Overview

In this case study, we will look at sales data from a retailer. There are two datasets, one called “sales.csv”, and one called “stores.csv”. You will first perform some data wrangling and merge the two datasets and then save the merged dataset as “retailer_sales.csv”. Below you find two tables with the variable names and a short description of each variable of the two datasets. After that you will have the oportunity to “pick your adventure” and do the analysis you like. For example, you could check how large the fluctuations of sales are per store, or whether sales numbers go up or down in holidays, how well you can predict sales numbers from the other variables, e.g. using a regression, or you could run a time series analysis to predict future turnovers. For each of these suggestions you will find a short paragraph that provides some guidance and hints to what you could do, but feel free to play with the dataset as you wish. Because there are multiple timepoints per store involved, we will often have to either aggregate the data over these, or will not be able to interpret p-values. One method to address this issue is by using mixed effects models. This is rather advanced and we will not cover it here. However, if you are familiar with the method and want to give it a try in R, you can use the lmer function from the lme4 package.

Table1. “sales.csv” variable description:
Variable Description
Store Numeric Id of each of the stores
DayOfWeek A number representing the day of the week
Date The date
Sales The turnover on a given day
Customers The number of costumers on a given day
Open Whether the store was open (1) or closed (0) on a given day
Promo Whether a store was running a promo that day
StateHoliday Whether there where (NA) or were no (0) state holidays that day.
SchoolHoliday If the store on a given date was affected by the closure of public schools (1) or not (0)
Table 2. “stores.csv” variable description:
Variable Description
Store Numeric Id of each of the stores
Assortment What level of assortment a given store has. Can be basic, extra, or extended
CompetitionDistance The distance in meters to the nearest competitor store
CompetitionOpenSinceMonth The month of the year in which the nearest competitor opened
CompetitionOpenSinceYear The year when the nearest competitor opened
Promo2 Wheter a store is participating (1) or not (2) in a continuing and consecutive promotion for some stores
Promo2SinceWeek The week of the year in which the store promotion started
Promo2SinceYear The year in which the store promotion started
PromoInterval Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. For example, “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

Data I

  1. Open a new R script and save it as a new file called sales_case_study.R. At the top of the script, using comments, write your name and the date. Then, load the tidyversepackage. Here’s how the top of your script should look:
## NAME
## DATE
## Sales Data - Case Study

library(tidyverse)
  1. Load in the “sales.csv”, and “stores.csv” datasets from your data folder.
# Load data from your local data file of your R project

sales <- read_csv("data/sales.csv")
stores <- read_csv("data/stores.csv")
  1. Get a first impression of the datasets by looking at a few rows of them.
# Get to know the data
head(sales)
# A tibble: 6 x 9
  Store DayOfWeek Date       Sales Customers  Open Promo StateHo… SchoolH…
  <int>     <int> <date>     <int>     <int> <int> <int>    <int>    <int>
1     1         5 2015-07-31  5263       555     1     1        0        1
2     2         5 2015-07-31  6064       625     1     1        0        1
3     3         5 2015-07-31  8314       821     1     1        0        1
4     4         5 2015-07-31 13995      1498     1     1        0        1
5     5         5 2015-07-31  4822       559     1     1        0        1
6     6         5 2015-07-31  5651       589     1     1        0        1
str(sales)
Classes 'tbl_df', 'tbl' and 'data.frame':   1017209 obs. of  9 variables:
 $ Store        : int  1 2 3 4 5 6 7 8 9 10 ...
 $ DayOfWeek    : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Date         : Date, format: "2015-07-31" "2015-07-31" ...
 $ Sales        : int  5263 6064 8314 13995 4822 5651 15344 8492 8565 7185 ...
 $ Customers    : int  555 625 821 1498 559 589 1414 833 687 681 ...
 $ Open         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Promo        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ StateHoliday : int  0 0 0 0 0 0 0 0 0 0 ...
 $ SchoolHoliday: int  1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "problems")=Classes 'tbl_df', 'tbl' and 'data.frame':    31050 obs. of  5 variables:
  ..$ row     : int  63556 63558 63560 63561 63564 63568 63569 63571 63574 63575 ...
  ..$ col     : chr  "StateHoliday" "StateHoliday" "StateHoliday" "StateHoliday" ...
  ..$ expected: chr  "an integer" "an integer" "an integer" "an integer" ...
  ..$ actual  : chr  "a" "a" "a" "a" ...
  ..$ file    : chr  "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" "'https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_data/sales.csv'" ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 9
  .. ..$ Store        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ DayOfWeek    : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Date         :List of 1
  .. .. ..$ format: chr ""
  .. .. ..- attr(*, "class")= chr  "collector_date" "collector"
  .. ..$ Sales        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Customers    : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Open         : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Promo        : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ StateHoliday : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ SchoolHoliday: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"
head(stores)
# A tibble: 6 x 10
  Store StoreType Assortment Compe… Compe… Compe… Promo2 Prom… Prom… Prom…
  <int> <chr>     <chr>       <int>  <int>  <int>  <int> <int> <int> <chr>
1     1 c         a            1270      9   2008      0    NA    NA <NA> 
2     2 a         a             570     11   2007      1    13  2010 Jan,…
3     3 a         a           14130     12   2006      1    14  2011 Jan,…
4     4 c         c             620      9   2009      0    NA    NA <NA> 
5     5 a         a           29910      4   2015      0    NA    NA <NA> 
6     6 a         a             310     12   2013      0    NA    NA <NA> 
str(stores)
Classes 'tbl_df', 'tbl' and 'data.frame':   1115 obs. of  10 variables:
 $ Store                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ StoreType                : chr  "c" "a" "a" "c" ...
 $ Assortment               : chr  "a" "a" "a" "c" ...
 $ CompetitionDistance      : int  1270 570 14130 620 29910 310 24000 7520 2030 3160 ...
 $ CompetitionOpenSinceMonth: int  9 11 12 9 4 12 4 10 8 9 ...
 $ CompetitionOpenSinceYear : int  2008 2007 2006 2009 2015 2013 2013 2014 2000 2009 ...
 $ Promo2                   : int  0 1 1 0 0 0 0 0 0 0 ...
 $ Promo2SinceWeek          : int  NA 13 14 NA NA NA NA NA NA NA ...
 $ Promo2SinceYear          : int  NA 2010 2011 NA NA NA NA NA NA NA ...
 $ PromoInterval            : chr  NA "Jan,Apr,Jul,Oct" "Jan,Apr,Jul,Oct" NA ...
 - attr(*, "spec")=List of 2
  ..$ cols   :List of 10
  .. ..$ Store                    : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ StoreType                : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ Assortment               : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  .. ..$ CompetitionDistance      : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ CompetitionOpenSinceMonth: list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ CompetitionOpenSinceYear : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Promo2                   : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Promo2SinceWeek          : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ Promo2SinceYear          : list()
  .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
  .. ..$ PromoInterval            : list()
  .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
  ..$ default: list()
  .. ..- attr(*, "class")= chr  "collector_guess" "collector"
  ..- attr(*, "class")= chr "col_spec"

Data Wrangling

  1. From the stores data, only select the following variables: Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2. The variable Assortment contains the values “a”, “b”, and “c”. This is not very helpful. Change them so that “a” is now “basic”, “b” is now “extra”, and “c” is now “extended”.
stores <- stores %>%
  select(Store, Assortment, CompetitionDistance, CompetitionOpenSinceYear, Promo2) %>%
  mutate(Assortment = case_when(Assortment == "a" ~ "basic",
                                Assortment == "b" ~ "extra",
                                Assortment == "c" ~ "extended"))
  1. Join the stores data with the sales data (check out the left_join() function for this).
sales <- left_join(sales, stores, by = "Store")
  1. The sales data contains an error in the state_holiday variable. There are NAs where there should be 1s. Change this (remember that variable == NA won’t yield the result you want; use is.na(variable) instead).
sales <- sales %>%
    mutate(StateHoliday = case_when(is.na(StateHoliday) ~ 1,
                                   TRUE ~ 0))
  1. Rename the variables to be lower case and with underscores between words. Afterwards your variables should be named like this: “store”, “week_day”, “date”, “sales”, “customers”, “open”, “promo”, “state_holiday”, “school_holiday”, “assortment”, “competition_distance”, “competition_open_since”, and “store_promo”.
names(sales) <- c("store", "week_day", "date", "sales", "customers", "open",
                  "promo", "state_holiday", "school_holiday", "assortment",
                  "competition_distance", "competition_open_since","store_promo")

Data O

  1. Save the prepared data as “retailer_sales.csv” (this is easy to do with the write_csv() function).
write_csv(sales, "retailer_sales.csv")

Statistics

Flucutations over Days

To look at the average fluctuations over days you, we suggest you take a subsample of a few stores. You could then plot the individual trajectories, and if you like also add a mean line. You can also use a repeated measures test, to have a statistical test of the stability (you could, for example, use a correlation between two timepoints, or aggregate sales data of stores for each timepoint and run a regression. Note that with these two methods you will violate the assumption of independence of the data, so you cannot interpret the p-value).

  1. Run the following code to get a visual impression of how large the fluctuations are.
# first create a variable called "days" that is a counter
# for the number of days and will be easier to use than
# the date variable
store_ids <- unique(sales$store)
sales$days <- 0

for (i in store_ids){
  sales$days[sales$store == i] <- seq_len(sum(sales$store == i))
}

# take a subsample to plot
sales_sub <- sales[sales$store %in% sample(1:1115, 30),]

# get rid of dates where the stores were closed
sales_sub <- filter(sales_sub, sales > 0)


### Create a plot using ggplot:

ggplot(sales_sub, aes(x = days, y = sales)) + # specify the data
  geom_line(aes(group = store), col = "grey", alpha = .4) + # add line per store
  stat_smooth(lwd = 1.5) +  # add an average line
  theme_bw()                # theme for white plotting window