In this case study, we will look at sales data from a retailer. There are two datasets, one called “sales.csv”, and one called “stores.csv”. You will first perform some data wrangling and merge the two datasets and then save the merged dataset as “retailer_sales.csv”. Below you find two tables with the variable names and a short description of each variable of the two datasets. After that you will have the opportunity to “pick your adventure” and do the analysis you like. For example, you could check how large the fluctuations of sales are per store, or whether sales numbers go up or down in holidays, how well you can predict sales numbers from the other variables, e.g. using a regression, or you could run a time series analysis to predict future turnovers. For each of these suggestions you will find a short paragraph that provides some guidance and hints to what you could do, but feel free to play with the dataset as you wish. Because there are multiple timepoints per store involved, we will often have to either aggregate the data over these, or will not be able to interpret p-values. One method to address this issue is by using mixed effects models. This is rather advanced and we will not cover it here. However, if you are familiar with the method and want to give it a try in R, you can use the lmer
function from the lme4
package.
By the end of this case study you will have practiced how to:
Package | Installation |
---|---|
tidyverse |
install.packages("tidyverse") |
lme4 |
install.packages("lme4") |
library(tidyverse)
library(lme4)
sales <- read_csv("../_data/complete/sales.csv")
stores <- read_csv("../_data/complete/stores.csv")
File | Rows | Columns | Description |
---|---|---|---|
sales.csv | 1017209 | 9 | Sales data of a large retail store |
stores.csv | 1115 | 10 | Data about the individual stores of the retailer |
Variable | Description |
---|---|
Store | Numeric Id of each of the stores |
DayOfWeek | A number representing the day of the week |
Date | The date |
Sales | The turnover on a given day |
Customers | The number of costumers on a given day |
Open | Whether the store was open (1) or closed (0) on a given day |
Promo | Whether a store was running a promo that day |
StateHoliday | Whether there where (NA) or were no (0) state holidays that day. |
SchoolHoliday | If the store on a given date was affected by the closure of public schools (1) or not (0) |
Variable | Description |
---|---|
Store | Numeric Id of each of the stores |
Assortment | What level of assortment a given store has. Can be basic, extra, or extended |
CompetitionDistance | The distance in meters to the nearest competitor store |
CompetitionOpenSinceMonth | The month of the year in which the nearest competitor opened |
CompetitionOpenSinceYear | The year when the nearest competitor opened |
Promo2 | Whether a store is participating (1) or not (2) in a continuing and consecutive promotion for some stores |
Promo2SinceWeek | The week of the year in which the store promotion started |
Promo2SinceYear | The year in which the store promotion started |
PromoInterval | Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. For example, “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store |
0_Data
and 1_Code
. Make sure that all of the data files listed above are contained in the folder# Done!
Open a new R script and save it as a new file called sales_case_study.R
. At the top of the script, using comments, write your name and the date. Then, load the set of packages listed above with library()
.
For this practical, we’ll use the sales.csv
and stores.csv
data. These datasets contain sales numbers of various stores from a large retailer over a time period. Using the following template, load the data into R and store it as new objects called sales
and stores
.
# Load data from your local data file of your R project
sales <- read_csv("XX/XXX")
stores <- read_csv("XX/XXX")
summary()
, head()
and skim()
, explore the data to make sure it was loaded correctly.stores <- stores %>%
select(Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2) %>%
mutate(Assortment = case_when(Assortment == "a" ~ "basic",
Assortment == "b" ~ "extra",
Assortment == "c" ~ "extended"))
left_join()
function for this).sales <- left_join(sales, stores, by = "Store")
NA
s where there should be 1
s. Change this (remember that variable == NA
won’t yield the result you want; use is.na(variable)
instead).sales <- sales %>%
mutate(StateHoliday = case_when(is.na(StateHoliday) ~ 1,
TRUE ~ 0))
names(sales) <- c("store", "week_day", "date", "sales", "customers", "open",
"promo", "state_holiday", "school_holiday", "assortment",
"competition_distance", "competition_open_since","store_promo")
write_csv()
function).write_csv(sales, "retailer_sales.csv")
To look at the average fluctuations over days you, we suggest you take a subsample of a few stores. You could then plot the individual trajectories, and if you like also add a mean line. You can also use a repeated measures test, to have a statistical test of the stability (you could, for example, use a correlation between two time points, or aggregate sales data of stores for each time point and run a regression. Note that with these two methods you will violate the assumption of independence of the data, so you cannot interpret the p-value).
# first create a variable called "days" that is a counter
# for the number of days and will be easier to use than
# the date variable
store_ids <- unique(sales$store)
sales$days <- 0
for (i in store_ids){
sales$days[sales$store == i] <- seq_len(sum(sales$store == i))
}
# take a subsample to plot
sales_sub <- sales[sales$store %in% sample(1:1115, 30),]
# get rid of dates where the stores were closed
sales_sub <- filter(sales_sub, sales > 0)
### Create a plot using ggplot:
ggplot(sales_sub, aes(x = days, y = sales)) + # specify the data
geom_line(aes(group = store), col = "grey", alpha = .4) + # add line per store
stat_smooth(lwd = 1.5) + # add an average line
theme_bw() # theme for white plotting window
sample()
function for this) from the date
variable created above, and run a correlation.# correlation between two of the timepoints
r_ds <- sample(sales$date, 2)
cor(sales$sales[sales$days == r_ds[1]], sales$sales[sales$days == r_ds[2]])
[1] NA
date
(i.e. for each day, take the mean), and store this as an object called sales_agg
. Then run a regression (the function to run a regression is lm()
).# regression
# first summarise the data
sales_agg <- sales %>%
group_by(date) %>%
summarise(
sales = mean(sales)
)
# then run regression
mod <- lm(sales ~ date, data = sales_agg)
summary(mod)
Call:
lm(formula = sales ~ date, data = sales_agg)
Residuals:
Min 1Q Median 3Q Max
-5823 -448 240 1698 8303
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4012.261 5593.072 -0.72 0.47
date 0.606 0.346 1.75 0.08 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2890 on 940 degrees of freedom
Multiple R-squared: 0.00325, Adjusted R-squared: 0.00219
F-statistic: 3.07 on 1 and 940 DF, p-value: 0.0802
sd()
function and divide by the mean()
), of each stores turnovers (sales
variable). You can do this by using dplyr
’s summarise()
function. Store this as an object called sales_cv
, with the variable cv
. Only use days on which the stores were open to not introduce extra noise (use filter()
for this).# regression
# first summarise the data
sales_cv <- sales %>%
filter(open != 0) %>%
group_by(store) %>%
summarise(
cv = sd(sales) / mean(sales)
)
ggplot(sales_cv, aes(cv)) + # the data to plot
geom_histogram() + # function to create a histogram
xlim(c(0, 1)) + # range of the x-axis
xlab("Coefficient of Variation") + # x-axis title
ylab("Frequency") + # y-axis title
theme_bw() # white theme (white plotting window)