In this case study, we will look at sales data from a retailer. There are two datasets, one called “sales.csv”, and one called “stores.csv”. You will first perform some data wrangling and merge the two datasets and then save the merged dataset as “retailer_sales.csv”. Below you find two tables with the variable names and a short description of each variable of the two datasets. After that you will have the oportunity to “pick your adventure” and do the analysis you like. For example, you could check how large the fluctuations of sales are per store, or whether sales numbers go up or down in holidays, how well you can predict sales numbers from the other variables, e.g. using a regression, or you could run a time series analysis to predict future turnovers. For each of these suggestions you will find a short paragraph that provides some guidance and hints to what you could do, but feel free to play with the dataset as you wish. Because there are multiple timepoints per store involved, we will often have to either aggregate the data over these, or will not be able to interpret p-values. One method to address this issue is by using mixed effects models. This is rather advanced and we will not cover it here. However, if you are familiar with the method and want to give it a try in R, you can use the lmer
function from the lme4
package.
Variable | Description |
---|---|
Store | Numeric Id of each of the stores |
DayOfWeek | A number representing the day of the week |
Date | The date |
Sales | The turnover on a given day |
Customers | The number of costumers on a given day |
Open | Whether the store was open (1) or closed (0) on a given day |
Promo | Whether a store was running a promo that day |
StateHoliday | Whether there where (NA) or were no (0) state holidays that day. |
SchoolHoliday | If the store on a given date was affected by the closure of public schools (1) or not (0) |
Variable | Description |
---|---|
Store | Numeric Id of each of the stores |
Assortment | What level of assortment a given store has. Can be basic, extra, or extended |
CompetitionDistance | The distance in meters to the nearest competitor store |
CompetitionOpenSinceMonth | The month of the year in which the nearest competitor opened |
CompetitionOpenSinceYear | The year when the nearest competitor opened |
Promo2 | Wheter a store is participating (1) or not (2) in a continuing and consecutive promotion for some stores |
Promo2SinceWeek | The week of the year in which the store promotion started |
Promo2SinceYear | The year in which the store promotion started |
PromoInterval | Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. For example, “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store |
sales_case_study.R
. At the top of the script, using comments, write your name and the date. Then, load the tidyverse
package. Here’s how the top of your script should look:## NAME
## DATE
## Sales Data - Case Study
library(tidyverse)
# Load data from your local data file of your R project
sales <- read_csv("data/sales.csv")
stores <- read_csv("data/stores.csv")
# Get to know the data
head(sales)
# A tibble: 6 x 9
Store DayOfWeek Date Sales Customers Open Promo StateHo… SchoolH…
<int> <int> <date> <int> <int> <int> <int> <int> <int>
1 1 5 2015-07-31 5263 555 1 1 0 1
2 2 5 2015-07-31 6064 625 1 1 0 1
3 3 5 2015-07-31 8314 821 1 1 0 1
4 4 5 2015-07-31 13995 1498 1 1 0 1
5 5 5 2015-07-31 4822 559 1 1 0 1
6 6 5 2015-07-31 5651 589 1 1 0 1
str(sales)
Classes 'tbl_df', 'tbl' and 'data.frame': 1017209 obs. of 9 variables:
$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ DayOfWeek : int 5 5 5 5 5 5 5 5 5 5 ...
$ Date : Date, format: "2015-07-31" "2015-07-31" ...
$ Sales : int 5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
$ Customers : int 555 625 821 1498 559 589 1414 833 687 681 ...
$ Open : int 1 1 1 1 1 1 1 1 1 1 ...
$ Promo : int 1 1 1 1 1 1 1 1 1 1 ...
$ StateHoliday : int 0 0 0 0 0 0 0 0 0 0 ...
$ SchoolHoliday: int 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame': 31050 obs. of 5 variables:
..$ row : int 63556 63558 63560 63561 63564 63568 63569 63571 63574 63575 ...
..$ col : chr "StateHoliday" "StateHoliday" "StateHoliday" "StateHoliday" ...
..$ expected: chr "an integer" "an integer" "an integer" "an integer" ...
..$ actual : chr "a" "a" "a" "a" ...
..$ file : chr "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" ...
- attr(*, "spec")=List of 2
..$ cols :List of 9
.. ..$ Store : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ DayOfWeek : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Date :List of 1
.. .. ..$ format: chr ""
.. .. ..- attr(*, "class")= chr "collector_date" "collector"
.. ..$ Sales : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Customers : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Open : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Promo : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ StateHoliday : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ SchoolHoliday: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
head(stores)
# A tibble: 6 x 10
Store StoreType Assortment Compe… Compe… Compe… Promo2 Prom… Prom… Prom…
<int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr>
1 1 c a 1270 9 2008 0 NA NA <NA>
2 2 a a 570 11 2007 1 13 2010 Jan,…
3 3 a a 14130 12 2006 1 14 2011 Jan,…
4 4 c c 620 9 2009 0 NA NA <NA>
5 5 a a 29910 4 2015 0 NA NA <NA>
6 6 a a 310 12 2013 0 NA NA <NA>
str(stores)
Classes 'tbl_df', 'tbl' and 'data.frame': 1115 obs. of 10 variables:
$ Store : int 1 2 3 4 5 6 7 8 9 10 ...
$ StoreType : chr "c" "a" "a" "c" ...
$ Assortment : chr "a" "a" "a" "c" ...
$ CompetitionDistance : int 1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
$ CompetitionOpenSinceMonth: int 9 11 12 9 4 12 4 10 8 9 ...
$ CompetitionOpenSinceYear : int 2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
$ Promo2 : int 0 1 1 0 0 0 0 0 0 0 ...
$ Promo2SinceWeek : int NA 13 14 NA NA NA NA NA NA NA ...
$ Promo2SinceYear : int NA 2010 2011 NA NA NA NA NA NA NA ...
$ PromoInterval : chr NA "Jan,Apr,Jul,Oct" "Jan,Apr,Jul,Oct" NA ...
- attr(*, "spec")=List of 2
..$ cols :List of 10
.. ..$ Store : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ StoreType : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ Assortment : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
.. ..$ CompetitionDistance : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ CompetitionOpenSinceMonth: list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ CompetitionOpenSinceYear : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Promo2 : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Promo2SinceWeek : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ Promo2SinceYear : list()
.. .. ..- attr(*, "class")= chr "collector_integer" "collector"
.. ..$ PromoInterval : list()
.. .. ..- attr(*, "class")= chr "collector_character" "collector"
..$ default: list()
.. ..- attr(*, "class")= chr "collector_guess" "collector"
..- attr(*, "class")= chr "col_spec"
stores <- stores %>%
select(Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2) %>%
mutate(Assortment = case_when(Assortment == "a" ~ "basic",
Assortment == "b" ~ "extra",
Assortment == "c" ~ "extended"))
left_join()
function for this).sales <- left_join(sales, stores, by = "Store")
NA
s where there should be 1
s. Change this (remember that variable == NA
won’t yield the result you want; use is.na(variable)
instead).sales <- sales %>%
mutate(StateHoliday = case_when(is.na(StateHoliday) ~ 1,
TRUE ~ 0))
names(sales) <- c("store", "week_day", "date", "sales", "customers", "open",
"promo", "state_holiday", "school_holiday", "assortment",
"competition_distance", "competition_open_since","store_promo")
write_csv()
function).write_csv(sales, "retailer_sales.csv")
To look at the average fluctuations over days you, we suggest you take a subsample of a few stores. You could then plot the individual trajectories, and if you like also add a mean line. You can also use a repeated measures test, to have a statistical test of the stability (you could, for example, use a correlation between two timepoints, or aggregate sales data of stores for each timepoint and run a regression. Note that with these two methods you will violate the assumption of independence of the data, so you cannot interpret the p-value).
# first create a variable called "days" that is a counter
# for the number of days and will be easier to use than
# the date variable
store_ids <- unique(sales$store)
sales$days <- 0
for (i in store_ids){
sales$days[sales$store == i] <- seq_len(sum(sales$store == i))
}
# take a subsample to plot
sales_sub <- sales[sales$store %in% sample(1:1115, 30),]
# get rid of dates where the stores were closed
sales_sub <- filter(sales_sub, sales > 0)
### Create a plot using ggplot:
ggplot(sales_sub, aes(x = days, y = sales)) + # specify the data
geom_line(aes(group = store), col = "grey", alpha = .4) + # add line per store
stat_smooth(lwd = 1.5) + # add an average line
theme_bw() # theme for white plotting window
sample()
function for this) from the days
variable created above, and run a correlation.# correlation between two of the timepoints
r_ds <- sample(sales$days, 2)
cor(sales$sales[sales$days == r_ds[1]], sales$sales[sales$days == r_ds[2]])
[1] 0.9030174
days
(i.e. for each day, take the mean), and store this as an object called sales_agg
. Then run a regression (the function to run a regression is lm()
).# regression
# first summarise the data
sales_agg <- sales %>%
group_by(days) %>%
summarise(
sales = mean(sales)
)
# then run regression
mod <- lm(sales ~ days, data = sales_agg)
summary(mod)
Call:
lm(formula = sales ~ days, data = sales_agg)
Residuals:
Min 1Q Median 3Q Max
-5785.7 -511.7 293.5 1514.4 7247.4
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6012.6989 170.5927 35.25 <2e-16 ***
days -0.5203 0.3134 -1.66 0.0972 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2616 on 940 degrees of freedom
Multiple R-squared: 0.002923, Adjusted R-squared: 0.001862
F-statistic: 2.756 on 1 and 940 DF, p-value: 0.09725
sd()
function and divide by the mean()
), of each stores turnovers (sales
variable). You can do this by using dplyr
’s summarise()
function. Store this as an object called sales_cv
, with the variable cv
. Only use days on which the stores were open to not introduce extra noise (use filter()
for this).# regression
# first summarise the data
sales_cv <- sales %>%
filter(open != 0) %>%
group_by(store) %>%
summarise(
cv = sd(sales) / mean(sales)
)
ggplot(sales_cv, aes(cv)) + # the data to plot
geom_histogram() + # function to create a histogram
xlim(c(0, 1)) + # range of the x-axis
xlab("Coefficient of Variation") + # x-axis title
ylab("Frequency") + # y-axis title
theme_bw() # white theme (white plotting window)
group_by()
and summarise()
).### For state holidays
# aggregate data to do the statistical test
sales_agg <- sales %>%
filter(open != 0) %>%
group_by(store, state_holiday) %>%
summarise(
sales = mean(sales)
)
# check the means
tapply(sales_agg$sales, sales_agg$state_holiday, mean)
0 1
6933.824 7048.918
# get rid of stores that weren't open on any state holiday
sales_agg <- sales_agg %>%
filter(store %in% store[state_holiday == 1])
# check the means again
tapply(sales_agg$sales, sales_agg$state_holiday, mean)
0 1
7206.409 7048.918
# run a paired t.test
t.test(sales ~ state_holiday,
data = sales_agg,
paired = TRUE)
Paired t-test
data: sales by state_holiday
t = 0.7607, df = 155, p-value = 0.448
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-251.4823 566.4642
sample estimates:
mean of the differences
157.491
### Now let's do the same for school holidays
# aggregate data to do the statistical test
sales_agg <- sales %>%
filter(open != 0) %>%
group_by(store, school_holiday) %>%
summarise(
sales = mean(sales)
)
# check the means
tapply(sales_agg$sales, sales_agg$school_holiday, mean)
0 1
6880.237 7162.152
# get rid of stores that weren't open on any state holiday
sales_agg <- sales_agg %>%
filter(store %in% store[school_holiday == 1])
# check the means again
tapply(sales_agg$sales, sales_agg$school_holiday, mean)
0 1
6880.237 7162.152
# run a paired t.test
t.test(sales ~ school_holiday,
data = sales_agg,
paired = TRUE)
Paired t-test
data: sales by school_holiday
t = -25.296, df = 1114, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-303.7817 -260.0484
sample estimates:
mean of the differences
-281.915
To not violate the independence assumption you will have to aggregate the sales data of each store over the different time periods. You can then run a linear regression.
# Aggregate data
sales_agg <- sales %>%
filter(open != 0) %>%
group_by(store, customers, store_promo, competition_distance, assortment) %>%
summarise(
sales = mean(sales)
)
# run a regression model
mod <- lm(sales ~ .,
data = sales_agg)
summary(mod)
Call:
lm(formula = sales ~ ., data = sales_agg)
Residuals:
Min 1Q Median 3Q Max
-26679.4 -1061.1 -234.5 862.7 27211.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.063e+03 8.940e+00 118.86 <2e-16 ***
store -1.735e-01 8.256e-03 -21.01 <2e-16 ***
customers 7.166e+00 6.369e-03 1125.20 <2e-16 ***
store_promo 4.354e+02 5.507e+00 79.07 <2e-16 ***
competition_distance 3.418e-02 3.443e-04 99.27 <2e-16 ***
assortmentextended 6.085e+02 5.423e+00 112.20 <2e-16 ***
assortmentextra -7.463e+03 2.481e+01 -300.87 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1710 on 415900 degrees of freedom
(975 observations deleted due to missingness)
Multiple R-squared: 0.7616, Adjusted R-squared: 0.7615
F-statistic: 2.214e+05 on 6 and 415900 DF, p-value: < 2.2e-16
Because we have repeated measures in the data, we always had to aggregate before we could run a statistical test. Using mixed effects models, this would not have been necessary, because there you can account for the dependent structure of the data with random effects. Covering the method itself is beyond the scope of this course, but if you already know the method and want to try it out, here is a code example of how you could do this. The most often used package for mixed effects modeling in R is called lme4
with the lmer()
function for linear mixed effects models, and the glmer()
function for generalized mixed effects models.
# load the package
library(lme4)
# run a linear mixed effects model
# lmer uses the same syntax as lm, the regression function you already know:
# a + between two fixed effects means a main effect, a * means a main
# effect AND interaction, a : means an interaction. Thus the following
# are the same:
lm(dependent_var ~ predictor_1 + predictor_2 + predictor_1 : predictor_2)
lm(dependent_var ~ predictor_1 * predictor_2)
# mixed effects model syntax
me_mod <- lmer(dependent_var ~ fixed_eff_1 + fixed_eff_2 +
(random_eff_slopes | random_eff_intercept),
data = your_dataset)
Here is an example for a mixed effects model you could run with the sales data that takes the same predictors we’ve used in task 18, but this time we use the unaggregated data and add the days variable as a random effect with varying intercepts (note that for a real analysis you’d probably want to rescale some variables and make sure all the labels make sense. But this here is just to show you how in principle you can run a mixed effects model in R.
library(lme4)
# first get rid of days where stores were closed
sales_open <- sales %>%
filter(open != 0)
me_mod <- lmer(sales ~ customers + store_promo + competition_distance +
assortment + (1 | store),
data = sales_open)
summary(me_mod)
Linear mixed model fit by REML ['lmerMod']
Formula:
sales ~ customers + store_promo + competition_distance + assortment +
(1 | store)
Data: sales_open
REML criterion at convergence: 13679867
Scaled residuals:
Min 1Q Median 3Q Max
-49.773 -0.578 -0.079 0.480 32.392
Random effects:
Groups Name Variance Std.Dev.
store (Intercept) 3328175 1824
Residual 656168 810
Number of obs: 842206, groups: store, 1112
Fixed effects:
Estimate Std. Error t value
(Intercept) -2.199e+03 1.022e+02 -21.5
customers 1.075e+01 5.316e-03 2021.3
store_promo 1.061e+03 1.107e+02 9.6
competition_distance 6.559e-02 7.310e-03 9.0
assortmentextended 4.790e+02 1.114e+02 4.3
assortmentextra -1.177e+04 6.134e+02 -19.2
Correlation of Fixed Effects:
(Intr) cstmrs str_pr cmptt_ assrtmntxtn
customers -0.045
store_promo -0.602 0.009
cmpttn_dstn -0.397 0.006 0.148
assrtmntxtn -0.436 -0.001 -0.023 -0.147
assrtmntxtr -0.112 -0.011 0.018 0.040 0.077
fit warnings:
Some predictor variables are on very different scales: consider rescaling