Trulli
source

Overview

In this case study, we will look at sales data from a retailer. There are two datasets, one called “sales.csv”, and one called “stores.csv”. You will first perform some data wrangling and merge the two datasets and then save the merged dataset as “retailer_sales.csv”. Below you find two tables with the variable names and a short description of each variable of the two datasets. After that you will have the opportunity to “pick your adventure” and do the analysis you like. For example, you could check how large the fluctuations of sales are per store, or whether sales numbers go up or down in holidays, how well you can predict sales numbers from the other variables, e.g. using a regression, or you could run a time series analysis to predict future turnovers. For each of these suggestions you will find a short paragraph that provides some guidance and hints to what you could do, but feel free to play with the dataset as you wish. Because there are multiple timepoints per store involved, we will often have to either aggregate the data over these, or will not be able to interpret p-values. One method to address this issue is by using mixed effects models. This is rather advanced and we will not cover it here. However, if you are familiar with the method and want to give it a try in R, you can use the lmer function from the lme4 package.

By the end of this case study you will have practiced how to:

  1. Load in data
  2. Wrangle data
  3. Run regressions

Packages

Package Installation
tidyverse install.packages("tidyverse")
lme4 install.packages("lme4")

Datasets

library(tidyverse)
library(lme4)
sales <- read_csv("../_data/complete/sales.csv")
stores <- read_csv("../_data/complete/stores.csv")
File Rows Columns Description
sales.csv 1017209 9 Sales data of a large retail store
stores.csv 1115 10 Data about the individual stores of the retailer
Table1. “sales.csv” variable description:
Variable Description
Store Numeric Id of each of the stores
DayOfWeek A number representing the day of the week
Date The date
Sales The turnover on a given day
Customers The number of costumers on a given day
Open Whether the store was open (1) or closed (0) on a given day
Promo Whether a store was running a promo that day
StateHoliday Whether there where (NA) or were no (0) state holidays that day.
SchoolHoliday If the store on a given date was affected by the closure of public schools (1) or not (0)
Table 2. “stores.csv” variable description:
Variable Description
Store Numeric Id of each of the stores
Assortment What level of assortment a given store has. Can be basic, extra, or extended
CompetitionDistance The distance in meters to the nearest competitor store
CompetitionOpenSinceMonth The month of the year in which the nearest competitor opened
CompetitionOpenSinceYear The year when the nearest competitor opened
Promo2 Whether a store is participating (1) or not (2) in a continuing and consecutive promotion for some stores
Promo2SinceWeek The week of the year in which the store promotion started
Promo2SinceYear The year in which the store promotion started
PromoInterval Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. For example, “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

Tasks

A - Getting setup

  1. Open your R project. It should already have the folders 0_Data and 1_Code. Make sure that all of the data files listed above are contained in the folder
# Done!
  1. Open a new R script and save it as a new file called sales_case_study.R. At the top of the script, using comments, write your name and the date. Then, load the set of packages listed above with library().

  2. For this practical, we’ll use the sales.csv and stores.csv data. These datasets contain sales numbers of various stores from a large retailer over a time period. Using the following template, load the data into R and store it as new objects called sales and stores.

# Load data from your local data file of your R project
sales <- read_csv("XX/XXX")
stores <- read_csv("XX/XXX")
  1. Using summary(), head() and skim(), explore the data to make sure it was loaded correctly.

B - Data Wrangling

  1. From the stores data, only select the following variables: Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2. The variable Assortment contains the values “a”, “b”, and “c”. This is not very helpful. Change them so that “a” is now “basic”, “b” is now “extra”, and “c” is now “extended”.
stores <- stores %>%
  select(Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2) %>%
  mutate(Assortment = case_when(Assortment == "a" ~ "basic",
                                Assortment == "b" ~ "extra",
                                Assortment == "c" ~ "extended"))
  1. Join the stores data with the sales data (check out the left_join() function for this).
sales <- left_join(sales, stores, by = "Store")
  1. The sales data contains an error in the state_holiday variable. There are NAs where there should be 1s. Change this (remember that variable == NA won’t yield the result you want; use is.na(variable) instead).
sales <- sales %>%
    mutate(StateHoliday = case_when(is.na(StateHoliday) ~ 1,
                                   TRUE ~ 0))
  1. Rename the variables to be al lowercase and with underscores between words. Afterwards your variables should be named like this: “store”, “week_day”, “date”, “sales”, “customers”, “open”, “promo”, “state_holiday”, “school_holiday”, “assortment”, “competition_distance”, “competition_open_since”, and “store_promo”.
names(sales) <- c("store", "week_day", "date", "sales", "customers", "open",
                  "promo", "state_holiday", "school_holiday", "assortment",
                  "competition_distance", "competition_open_since","store_promo")
  1. Save the prepared data as “retailer_sales.csv” (this is easy to do with the write_csv() function).
write_csv(sales, "retailer_sales.csv")

C - Statistics - Fluctuations over Days

To look at the average fluctuations over days you, we suggest you take a subsample of a few stores. You could then plot the individual trajectories, and if you like also add a mean line. You can also use a repeated measures test, to have a statistical test of the stability (you could, for example, use a correlation between two time points, or aggregate sales data of stores for each time point and run a regression. Note that with these two methods you will violate the assumption of independence of the data, so you cannot interpret the p-value).

  1. Run the following code to get a visual impression of how large the fluctuations are.
# first create a variable called "days" that is a counter
# for the number of days and will be easier to use than
# the date variable

store_ids <- unique(sales$store)
sales$days <- 0

for (i in store_ids){
  sales$days[sales$store == i] <- seq_len(sum(sales$store == i))
}

# take a subsample to plot
sales_sub <- sales[sales$store %in% sample(1:1115, 30),]

# get rid of dates where the stores were closed
sales_sub <- filter(sales_sub, sales > 0)


### Create a plot using ggplot:

ggplot(sales_sub, aes(x = days, y = sales)) + # specify the data
  geom_line(aes(group = store), col = "grey", alpha = .4) + # add line per store
  stat_smooth(lwd = 1.5) +  # add an average line
  theme_bw()                # theme for white plotting window
  1. Randomly sample two time points (use the sample() function for this) from the date variable created above, and run a correlation.
# correlation between two of the timepoints
r_ds <- sample(sales$date, 2)

cor(sales$sales[sales$days == r_ds[1]], sales$sales[sales$days == r_ds[2]])
[1] NA
  1. Summarise the sales data over date (i.e. for each day, take the mean), and store this as an object called sales_agg. Then run a regression (the function to run a regression is lm()).
# regression

# first summarise the data
sales_agg <- sales %>%
  group_by(date) %>%
  summarise(
    sales = mean(sales)
  )

# then run regression
mod <- lm(sales ~ date, data = sales_agg)

summary(mod)

Call:
lm(formula = sales ~ date, data = sales_agg)

Residuals:
   Min     1Q Median     3Q    Max 
 -5823   -448    240   1698   8303 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -4012.261   5593.072   -0.72     0.47  
date            0.606      0.346    1.75     0.08 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2890 on 940 degrees of freedom
Multiple R-squared:  0.00325,   Adjusted R-squared:  0.00219 
F-statistic: 3.07 on 1 and 940 DF,  p-value: 0.0802
  1. Compute the coefficient of variation, i.e. the standard deviation scaled on the mean (use the sd() function and divide by the mean()), of each stores turnovers (sales variable). You can do this by using dplyr’s summarise() function. Store this as an object called sales_cv, with the variable cv. Only use days on which the stores were open to not introduce extra noise (use filter() for this).
# regression

# first summarise the data
sales_cv <- sales %>%
  filter(open != 0) %>%
  group_by(store) %>%
  summarise(
    cv = sd(sales) / mean(sales)
  )
  1. Plot the distribution of the coefficient of variation using the following code:
ggplot(sales_cv, aes(cv)) +  # the data to plot
  geom_histogram() +         # function to create a histogram
  xlim(c(0, 1)) +            # range of the x-axis
  xlab("Coefficient of Variation") + # x-axis title
  ylab("Frequency") +        # y-axis title
  theme_bw()                 # white theme (white plotting window)

D - Statistics - Sales numbers in holidays

  1. To test the effects of state holidays, filter the dataset to only include dates on which the given store was open. Then, average the sales of the holiday days and the non holiday days for each store (use group_by() and summarise()).
### For state holidays

# aggregate data to do the statistical test
sales_agg <- sales %>%
  filter(open != 0) %>%
  group_by(store, state_holiday) %>%
  summarise(
    sales = mean(sales)
  )

# check the means
tapply(sales_agg$sales, sales_agg$state_holiday, mean)
   0    1 
6934 7049 
# get rid of stores that weren't open on any state holiday
sales_agg <- sales_agg %>%
  filter(store %in% store[state_holiday == 1])

# check the means again
tapply(sales_agg$sales, sales_agg$state_holiday, mean)
   0    1 
7206 7049 
  1. Run a paired t-test to test whether sales numbers on state holidays differ from sales numbers on normal days.
# run a paired t.test
t.test(sales ~ state_holiday,
       data = sales_agg,
       paired = TRUE)

    Paired t-test

data:  sales by state_holiday
t = 0.8, df = 200, p-value = 0.4
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -251  566
sample estimates:
mean of the differences 
                    157 
  1. Now repeat tasks D1. and D2. for school holidays
### Now let's do the same for school holidays

# aggregate data to do the statistical test
sales_agg <- sales %>%
  filter(open != 0) %>%
  group_by(store, school_holiday) %>%
  summarise(
    sales = mean(sales)
  )

# check the means
tapply(sales_agg$sales, sales_agg$school_holiday, mean)
   0    1 
6880 7162 
# get rid of stores that weren't open on any state holiday
sales_agg <- sales_agg %>%
  filter(store %in% store[school_holiday == 1])

# check the means again
tapply(sales_agg$sales, sales_agg$school_holiday, mean)
   0    1 
6880 7162 
# run a paired t.test
t.test(sales ~ school_holiday,
       data = sales_agg,
       paired = TRUE)

    Paired t-test

data:  sales by school_holiday
t = -30, df = 1000, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -304 -260
sample estimates:
mean of the differences 
                   -282 

E - Statistics - Predict sales numbers from other variables

To not violate the independence assumption you will have to aggregate the sales data of each store over the different time periods. You can then run a linear regression.

  1. Aggregate the sales numbers over days. Include the following groups to later be used as predictors: “store”, “customers”, “store_promo”, “competition_distance”, and “assortment”.
# Aggregate data
sales_agg <- sales %>%
  filter(open != 0) %>%
  group_by(store, customers, store_promo, competition_distance, assortment) %>%
  summarise(
    sales = mean(sales)
  )
  1. Run a regression to test the influence of the variables included in task E1.
# run a regression model
mod <- lm(sales ~ .,
          data = sales_agg)
summary(mod)

Call:
lm(formula = sales ~ ., data = sales_agg)

Residuals:
   Min     1Q Median     3Q    Max 
-26679  -1061   -234    863  27212 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.06e+03   8.94e+00   118.9   <2e-16 ***
store                -1.73e-01   8.26e-03   -21.0   <2e-16 ***
customers             7.17e+00   6.37e-03  1125.2   <2e-16 ***
store_promo           4.35e+02   5.51e+00    79.1   <2e-16 ***
competition_distance  3.42e-02   3.44e-04    99.3   <2e-16 ***
assortmentextended    6.08e+02   5.42e+00   112.2   <2e-16 ***
assortmentextra      -7.46e+03   2.48e+01  -300.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1710 on 415900 degrees of freedom
  (975 observations deleted due to missingness)
Multiple R-squared:  0.762, Adjusted R-squared:  0.762 
F-statistic: 2.21e+05 on 6 and 415900 DF,  p-value: <2e-16

F - Extra

Because we have repeated measures in the data, we always had to aggregate before we could run a statistical test. Using mixed effects models, this would not have been necessary, because there you can account for the dependent structure of the data with random effects. Covering the method itself is beyond the scope of this course, but if you already know the method and want to try it out, here is a code example of how you could do this. The most often used package for mixed effects modeling in R is called lme4 with the lmer() function for linear mixed effects models, and the glmer() function for generalized mixed effects models.

# load the package
library(lme4)

# run a linear mixed effects model

# lmer uses the same syntax as lm, the regression function you already know:
# a + between two fixed effects means a main effect, a * means a main
# effect AND interaction, a : means an interaction. Thus the following
# are the same:

lm(dependent_var ~ predictor_1 + predictor_2 + predictor_1 : predictor_2)
lm(dependent_var ~ predictor_1 * predictor_2)

# mixed effects model syntax
me_mod <- lmer(dependent_var ~ fixed_eff_1 + fixed_eff_2 +
                 (random_eff_slopes | random_eff_intercept),
               data = your_dataset)

Here is an example for a mixed effects model you could run with the sales data that takes the same predictors we’ve used in task 18, but this time we use the unaggregated data and add the days variable as a random effect with varying intercepts (note that for a real analysis you’d probably want to rescale some variables and make sure all the labels make sense. But this here is just to show you how in principle you can run a mixed effects model in R.

library(lme4)

# first get rid of days where stores were closed
sales_open <- sales %>%
  filter(open != 0)

me_mod <- lmer(sales ~ customers + store_promo + competition_distance +
                 assortment + (1 | store),
               data = sales_open)

summary(me_mod)