suppressMessages({
  suppressWarnings({
    library(tidyverse) # Includes all the necessary stuff: dplyr, ggplot, tidyr, and so on
    library(lubridate) # For dealing with dates
    library(lme4) # Linear mixed effects models
    library(rstanarm) # Bayesian Linear mixed effects models with STAN
  })
})

In the previous tutorial, we read in and prepared a kaggle dataset for model-building. In this tutorial, we will build a few predictive models for these data, evaluate them on our validation set, and submit them to kaggle to see how we fair on their test set.

The Data

Recall the data: daily sales on each of \(1,115\) Rossman drug stores in Europe. Our goal is to use the store-level and store/day-level data provided to predict daily sales for these same stores, in the future. See the previous tutorial for details of reading in the data, creating new features, and performing basic data analysis. We saved the model-ready data to disk, so we can load it as follows:

load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-training-set.RData")
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-validation-set.RData")
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-test-set.RData")

If you didn’t run the previous tutorial locally, you can download the data from github:

The load() function can take a URL as an argument, however it has trouble with https.

The R function load() takes the given file and loads all the objects in this file into the global environment by default (if you’re calling the function interactively like we are). We can see that this loaded our datasets:

ls()
## [1] "test_discretized" "training_set"     "validation_set"

If you’re programming with load() rather than using it interactively, it’s a good idea to load files into a new environment you create, and then check what you loaded. For example,

# Not run:
# e <- new.env()
# load("my/file.RData",envir = e)

We can take a look at the datasets we loaded:

glimpse(training_set)
## Observations: 798,540
## Variables: 14
## $ Store                <chr> "85", "122", "209", "259", "262", "274", ...
## $ DayOfWeek            <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ Sales                <dbl> 12421, 3709, 3152, 18535, 29937, 7786, 34...
## $ Promo                <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StateHoliday         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday        <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType            <chr> "b", "a", "a", "b", "b", "b", "d", "a", "...
## $ Assortment           <chr> "a", "c", "c", "b", "a", "b", "c", "c", "...
## $ CompetitionDistance  <fct> (255, Inf], (255, Inf], (255, Inf], (35,2...
## $ Promo2               <chr> "0", "0", "1", "0", "0", "1", "0", "1", "...
## $ PromoInterval        <chr> "0", "0", "Jan,Apr,Jul,Oct", "0", "0", "J...
## $ Month                <chr> "6", "6", "6", "6", "6", "6", "6", "6", "...
## $ comp_open_since_days <fct> (858,4.68e+03], (-Inf,858], (858,4.68e+03...
## $ promo2_since_days    <fct> (-2.5,0.5], (-2.5,0.5], (0.5,908], (-2.5,...
glimpse(validation_set)
## Observations: 45,852
## Variables: 14
## $ Store                <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ DayOfWeek            <chr> "5", "5", "5", "5", "5", "5", "5", "5", "...
## $ Sales                <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 1534...
## $ Promo                <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday        <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType            <chr> "c", "a", "a", "c", "a", "a", "a", "a", "...
## $ Assortment           <chr> "a", "a", "a", "c", "a", "a", "c", "a", "...
## $ CompetitionDistance  <fct> (255, Inf], (255, Inf], (255, Inf], (255,...
## $ Promo2               <chr> "0", "1", "1", "0", "0", "0", "0", "0", "...
## $ PromoInterval        <chr> "0", "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,Oct"...
## $ Month                <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_since_days <fct> (858,4.68e+03], (858,4.68e+03], (858,4.68...
## $ promo2_since_days    <fct> (-2.5,0.5], (908, Inf], (908, Inf], (-2.5...
glimpse(test_discretized)
## Observations: 41,088
## Variables: 22
## $ Id                        <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store                     <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek                 <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date                      <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open                      <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo                     <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday              <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday             <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType                 <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment                <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance       <fct> (255, Inf], (255, Inf], (255, Inf], ...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear  <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2                    <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek           <chr> "0", "14", "0", "0", "0", "0", "1", ...
## $ Promo2SinceYear           <chr> "0", "2011", "0", "0", "0", "0", "20...
## $ PromoInterval             <chr> "0", "Jan,Apr,Jul,Oct", "0", "0", "0...
## $ Month                     <chr> "9", "9", "9", "9", "9", "9", "9", "...
## $ comp_open_date            <date> 2008-09-01, 2006-12-01, 2013-04-01,...
## $ comp_open_since_days      <fct> (858,4.68e+03], (858,4.68e+03], (858...
## $ promo2_since_date         <date> NA, 2011-04-02, NA, NA, NA, NA, 201...
## $ promo2_since_days         <fct> (-2.5,0.5], (908, Inf], (-2.5,0.5], ...

The training and validation sets were treated exactly the same in the preprocessing, since we applied the preprocessing to the original kaggle “training” set, and then held out the most recent 6 weeks of that data to form our validation set. The test set is kaggle’s original “test set”, and we made sure to add all the new variables we added to the training set, but didn’t bother to delete the extra ones. All we use this one for is prediction, so this is okay.

We notice that all of our features are categorical. This is because during the preprocessing, we made the decision to convert the remaining numeric variables (there were only 3 anyways) to categorical by binning/discretizing them, in order to remove the effect of long-tailed distributions and outliers. This is of course not the only way to do it, and this is one potential part of the preprocessing that you can go back and change if you are curious.

We can now build our first model.

The Models

Simple Linear Regression (not good)

Simple linear regression of the form \[ y = X\beta + \epsilon \] is not going to work well here. Why? Well, nonlinearity isn’t a problem anymore, since all of our features are categorical (that was another reason to discretize). What about the fact that observations are grouped? Sales from different days on the same store are likely to be more correlated than sales from different days and different stores. While this is a problem for inference (standard errors of the regression coefficients will be lower than they should be), it is not necessarily a problem for prediction. The reason we want to try to account for it is not because not doing so would be “wrong”; it is simply because doing so will probably improve our predictions.

First, let’s ignore the grouping and fit a simple linear regression. This takes about 3 seconds on my 2015 MacBook Pro with 16 gigs of RAM:

simple_lr_1 <- lm(Sales ~ . -Store,data = training_set)
summary(simple_lr_1)
## 
## Call:
## lm(formula = Sales ~ . - Store, data = training_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10954.3  -1711.2   -367.1   1238.8  29074.6 
## 
## Coefficients: (1 not defined because of singularities)
##                                       Estimate Std. Error  t value
## (Intercept)                          5442.8416   143.7909   37.852
## DayOfWeek2                          -1080.0328    10.3413 -104.438
## DayOfWeek3                          -1395.7035    10.3847 -134.400
## DayOfWeek4                          -1355.1777    10.5365 -128.617
## DayOfWeek5                           -991.2246    10.4531  -94.825
## DayOfWeek6                           -939.2667    11.1522  -84.223
## DayOfWeek7                           -902.0822    48.6038  -18.560
## Promo1                               2298.7121     6.6241  347.023
## StateHolidaya                         100.8102   101.7478    0.991
## StateHolidayb                        -216.0637   222.0098   -0.973
## StateHolidayc                       -1019.0403   317.0699   -3.214
## SchoolHoliday1                        309.6167     8.6476   35.804
## StoreTypeb                           5031.2103    33.1032  151.985
## StoreTypec                             -4.4576     9.1855   -0.485
## StoreTyped                            -72.1547     7.1061  -10.154
## Assortmentb                         -2837.2171    44.0682  -64.382
## Assortmentc                           868.2882     6.2789  138.288
## CompetitionDistance(35,255]          2616.3808    36.4026   71.873
## CompetitionDistance(255, Inf]        1244.3447    35.3703   35.180
## Promo21                             -1248.1185   138.6350   -9.003
## PromoIntervalFeb,May,Aug,Nov          343.2907    13.1453   26.115
## PromoIntervalJan,Apr,Jul,Oct          564.0096    11.2987   49.918
## PromoIntervalMar,Jun,Sept,Dec               NA         NA       NA
## Month10                                 0.2754    14.7827    0.019
## Month11                               499.5205    14.8834   33.562
## Month12                              1979.9849    14.9919  132.070
## Month2                                 88.9484    13.0651    6.808
## Month3                                307.6090    12.8444   23.949
## Month4                                431.3562    13.0539   33.044
## Month5                                533.2505    13.0864   40.748
## Month6                                416.6966    13.7303   30.349
## Month7                                128.2613    14.7843    8.676
## Month8                                -70.3441    15.2272   -4.620
## Month9                                -38.7299    14.7751   -2.621
## comp_open_since_days(858,4.68e+03]   -181.8072     6.2382  -29.144
## comp_open_since_days(4.68e+03, Inf]   409.3043    16.9256   24.182
## promo2_since_days(-2.5,0.5]          -331.3742   138.8105   -2.387
## promo2_since_days(0.5,908]           -569.1750    12.9700  -43.884
## promo2_since_days(908, Inf]           -22.0111    13.0488   -1.687
##                                     Pr(>|t|)    
## (Intercept)                          < 2e-16 ***
## DayOfWeek2                           < 2e-16 ***
## DayOfWeek3                           < 2e-16 ***
## DayOfWeek4                           < 2e-16 ***
## DayOfWeek5                           < 2e-16 ***
## DayOfWeek6                           < 2e-16 ***
## DayOfWeek7                           < 2e-16 ***
## Promo1                               < 2e-16 ***
## StateHolidaya                        0.32179    
## StateHolidayb                        0.33045    
## StateHolidayc                        0.00131 ** 
## SchoolHoliday1                       < 2e-16 ***
## StoreTypeb                           < 2e-16 ***
## StoreTypec                           0.62747    
## StoreTyped                           < 2e-16 ***
## Assortmentb                          < 2e-16 ***
## Assortmentc                          < 2e-16 ***
## CompetitionDistance(35,255]          < 2e-16 ***
## CompetitionDistance(255, Inf]        < 2e-16 ***
## Promo21                              < 2e-16 ***
## PromoIntervalFeb,May,Aug,Nov         < 2e-16 ***
## PromoIntervalJan,Apr,Jul,Oct         < 2e-16 ***
## PromoIntervalMar,Jun,Sept,Dec             NA    
## Month10                              0.98514    
## Month11                              < 2e-16 ***
## Month12                              < 2e-16 ***
## Month2                              9.90e-12 ***
## Month3                               < 2e-16 ***
## Month4                               < 2e-16 ***
## Month5                               < 2e-16 ***
## Month6                               < 2e-16 ***
## Month7                               < 2e-16 ***
## Month8                              3.84e-06 ***
## Month9                               0.00876 ** 
## comp_open_since_days(858,4.68e+03]   < 2e-16 ***
## comp_open_since_days(4.68e+03, Inf]  < 2e-16 ***
## promo2_since_days(-2.5,0.5]          0.01698 *  
## promo2_since_days(0.5,908]           < 2e-16 ***
## promo2_since_days(908, Inf]          0.09164 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2663 on 798502 degrees of freedom
## Multiple R-squared:  0.2644, Adjusted R-squared:  0.2644 
## F-statistic:  7757 on 37 and 798502 DF,  p-value: < 2.2e-16

Even though the fit is no good, we can do some basic sanity checks; for example, sales during the Christmas season (November and December, months 11 and 12) are higher than other months. The day of the week seems to matter, as does the competition distance. These aren’t formal statistical inferences; we’re just getting a feel for the variables in the model.

To evaluate the predictions, let’s first score the model on the validation set, and then we’ll compute the target metric (root-mean-square-percentage-error), and plot the predicted vs actual Sales.

simple_lr_validation_preds <- data_frame(
  observed = validation_set %>% pull(Sales),
  predicted = predict(simple_lr_1,newdata = validation_set)
)
## Warning in predict.lm(simple_lr_1, newdata = validation_set): prediction
## from a rank-deficient fit may be misleading
simple_lr_validation_preds
## # A tibble: 45,852 x 2
##    observed predicted
##       <dbl>     <dbl>
##  1     5263     7915.
##  2     6064     7545.
##  3     8314     7545.
##  4    13995     8783.
##  5     4822     8101.
##  6     5651     8101.
##  7    15344     8969.
##  8     8492     8101.
##  9     8565     9379.
## 10     7185     7919.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe <- function(preds) {
  preds %>%
    summarize(rmspe = sqrt( mean( ((observed - predicted)/observed)^2 ))) %>%
    pull(rmspe)
}
get_rmspe(simple_lr_validation_preds) # Very bad!
## Warning: package 'bindrcpp' was built under R version 3.4.4
## [1] 0.4367195
# Plot predicted vs actual
simple_lr_validation_preds %>%
  ggplot(aes(x = observed,y = predicted)) +
  theme_classic() + 
  geom_point(pch = 21,colour = "black",fill = "lightblue") +
  geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
  labs(title = "Validation Predictions, Simple Linear Regression",
       subtitle = "Rossman Store Sales Data",
       x = "Observed",
       y = "Predicted") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format())

We see that the predictions have high variance, and that we are underpredicting by a large amount (look at the axis labels).

Linear Mixed Effects Model (better)

We could account for different stores’ differing sales by including Store as a categorical variable, resulting in a different intercept for each store. This would mean estimating over a thousand regression coefficients; while not impossible using modern hardware, it’s also not desirable. Such a model would be hard to interpret (it would be hard to even print it to the screen), and the estimation of so many distinct intercepts might be unstable.

A more stable and computationally efficient approach is to utilize Linear Mixed Effects Models and their implementation in the lme4 package. This implementation is highly user friendly and computationally efficient. We’ll fit the model \[ \begin{aligned} y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\ b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\ \epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2}) \end{aligned} \] The \(x_{ij}\) is a vector containing the features corresponding to the \(i^{th}\) store on the \(j^{th}\) day. The \(b_{i}\) is a random variable representing the \(i^{th}\) store’s “baseline” sales- the average daily sales of a randomly chosen store. The \(\epsilon_{ij}\) is the familiar error term seen in simple linear regression models. This model still specifies a different intercept for every store, however the estimation will shrink these intercepts toward their mean, improving the stability of the estimation. We choose \(0\) to be the mean of the random effects \(b_{i}\) because the \(x_{ij}\) still has a 1 in it, that is the model still has a fixed intercept \(\beta_{0}\) which represents the global average daily sales of all the stores.

We can fit this model using the lme4::lmer() function. This took about 30 seconds on my laptop.

# Annoyingly, we have to type the formula out manually, as predict.merMod cannot handle
# formulae with a dot
lm1ff <- as.formula(str_c(
  "Sales ~ ",
  str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
  " + (1|Store)"
))

lme1 <- lmer(lm1ff,data = training_set)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(lme1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## Sales ~ DayOfWeek + Promo + StateHoliday + SchoolHoliday + StoreType +  
##     Assortment + CompetitionDistance + Promo2 + PromoInterval +  
##     Month + comp_open_since_days + promo2_since_days + (1 | Store)
##    Data: training_set
## 
## REML criterion at convergence: 13889219
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -11.2136  -0.5798  -0.0694   0.4908  18.2148 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Store    (Intercept) 5059645  2249    
##  Residual             2074879  1440    
## Number of obs: 798540, groups:  Store, 1115
## 
## Fixed effects:
##                                       Estimate Std. Error t value
## (Intercept)                          4669.8119   802.2221     5.8
## DayOfWeek2                          -1080.0931     5.5939  -193.1
## DayOfWeek3                          -1395.8227     5.6174  -248.5
## DayOfWeek4                          -1355.3635     5.6995  -237.8
## DayOfWeek5                           -990.9379     5.6544  -175.3
## DayOfWeek6                           -942.8404     6.0327  -156.3
## DayOfWeek7                           -531.6864    26.8263   -19.8
## Promo1                               2296.2582     3.5833   640.8
## StateHolidaya                         282.0506    55.1367     5.1
## StateHolidayb                         137.4240   120.1801     1.1
## StateHolidayc                        -825.9343   171.5343    -4.8
## SchoolHoliday1                        296.4689     4.6818    63.3
## StoreTypeb                           4804.6812   803.4137     6.0
## StoreTypec                            -65.3959   207.7072    -0.3
## StoreTyped                            -64.7802   159.4595    -0.4
## Assortmentb                         -2763.5346  1094.9321    -2.5
## Assortmentc                           846.2646   140.5671     6.0
## CompetitionDistance(35,255]          2604.5647   824.9526     3.2
## CompetitionDistance(255, Inf]        1244.5223   800.8928     1.6
## Promo21                             -1163.0863   251.9336    -4.6
## PromoIntervalFeb,May,Aug,Nov          301.4425   295.7666     1.0
## PromoIntervalJan,Apr,Jul,Oct          592.7468   252.3429     2.3
## Month10                                 0.7914     8.0059     0.1
## Month11                               495.7651     8.0611    61.5
## Month12                              1981.0875     8.1233   243.9
## Month2                                 82.2465     7.0679    11.6
## Month3                                291.9649     6.9510    42.0
## Month4                                412.5546     7.0658    58.4
## Month5                                511.0787     7.0851    72.1
## Month6                                403.6988     7.4281    54.3
## Month7                                141.2168     8.0069    17.6
## Month8                                -68.6842     8.2459    -8.3
## Month9                                -47.1240     7.9988    -5.9
## comp_open_since_days(858,4.68e+03]    107.4904    10.8124     9.9
## comp_open_since_days(4.68e+03, Inf]   275.9587    23.1721    11.9
## promo2_since_days(-2.5,0.5]           353.1511    75.2237     4.7
## promo2_since_days(0.5,908]            271.5260     9.6245    28.2
## promo2_since_days(908, Inf]           568.7239    12.8222    44.4
## 
## Correlation matrix not shown by default, as p = 38 > 12.
## Use print(x, correlation=TRUE)  or
##   vcov(x)     if you need it
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient

We see that before, when the PromoIntervalMar,Jun,Sept,Dec level resulted in a NA coefficient, lmer drops this level entirely. This is okay, but it’s something to make note of.

Comparing these coefficient estimates to the ones from the simple linear regression, we see that they are similar, but slightly smaller on average and sometimes in the opposite direction:

data_frame(
  coefficient = names(coef(simple_lr_1))[-23],
  Simple = coef(simple_lr_1)[-23],
  Mixed = coef(summary(lme1))[ ,1]
) %>%
  gather(type,value,Simple:Mixed) %>%
  ggplot(aes(x = coefficient,y = value,group = type,fill = type)) +
  theme_classic() +
  geom_bar(stat = "identity",position = "dodge",colour = "black") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Comparison of Coefficients, Simple and Mixed Effects Linear Regression",
       subtitle = "Rossman Store Sales Data",
       x = "",
       y = "Coefficient Estimate",
       fill = "")

We can also take a look at the estimated intercepts. Estimated is not the correct term to use here anymore though, since these are no longer parameter estimates, but rather random variables. What we actually look at are the predicted random effects, which are given by the mean of the distribution of \(b_{i} | y\), the random effects conditional on the observed data. These are precomputed by the lmer function:

ranef(lme1)$Store %>%
  as_data_frame() %>%
  rename(intercept = `(Intercept)`) %>%
  ggplot(aes(x = intercept)) +
  theme_classic() +
  geom_histogram(bins = 100,colour = "black",fill = "purple") +
  labs(title = "Histogram of Predicted Random Intercepts, lme1 Model",
       subtitle = "Rossman Store Sales Data",
       x = "Predicted Intercept",
       y = "Count") +
  scale_x_continuous(labels = scales::comma_format())

This is nice; the distribution of predicted random effects has a long right tail. This is actually bad for inference, since it breaks a key model assumption, but it is potentially good for prediction, since those stores with high sales were likely predicted very poorly by our simple linear regression model, and hence inflated our prediction error.

Let’s take a look at the predictions from this model:

lme1_validation_preds <- data_frame(
  observed = validation_set %>% pull(Sales),
  predicted = predict(lme1,newdata = validation_set)
)
lme1_validation_preds
## # A tibble: 45,852 x 2
##    observed predicted
##       <dbl>     <dbl>
##  1     5263     6070.
##  2     6064     6236.
##  3     8314     8322.
##  4    13995    10906.
##  5     4822     5967.
##  6     5651     6848.
##  7    15344    10050.
##  8     8492     6779.
##  9     8565     7835.
## 10     7185     6828.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe(lme1_validation_preds) # Better
## [1] 0.2209619
# Plot predicted vs actual
lme1_validation_preds %>%
  ggplot(aes(x = observed,y = predicted)) +
  theme_classic() + 
  geom_point(pch = 21,colour = "black",fill = "lightblue") +
  geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
  labs(title = "Validation Predictions, Mixed Effects Model",
       subtitle = "Rossman Store Sales Data",
       x = "Observed",
       y = "Predicted") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format())

Hey, those predictions look a lot better than before!

How could we extend this model? We could try adding more random effects terms, or interactions between fixed effects. Let’s take a look at some randomly selected stores’ sales visually to try and see how they change over time. For good measure we’ll also include the predictions from the most recent model.

# Use the training predictions and include the Month and DayOfWeek
# Since there will be roughly 4 observations per store / day of week / month, average them
# for plotting purposes. Also add a combined indicator for month/day for the x-axis
# Note we use the training predictions because the validation set only has 6 weeks of data
lme1_trainpreds_withdate <- training_set %>%
  select(Store,Sales,Month,DayOfWeek) %>%
  bind_cols(data_frame(predicted = predict(lme1,newdata = training_set))) %>%
  group_by(Store,Month,DayOfWeek) %>%
  summarize(observed = mean(Sales),predicted = mean(predicted)) %>%
  mutate(dateind = lubridate::ymd(str_c("2015",Month,DayOfWeek,sep="-")))


# Randomly sample 20 stores and plot them
set.seed(897623)
randstore <- sample(unique(validation_set$Store),20)

lme1_trainpreds_withdate %>%
  gather(type,value,observed:predicted) %>%
  filter(Store %in% randstore) %>%
  ggplot(aes(x = dateind,group = Store,y = value,colour = type)) +
  theme_classic() +
  facet_wrap(~Store) +
  geom_line() +
  labs(title = "Predicted and Observed Sales by Date, Randomly Selected Stores",
       subtitle = "Rossman Store Sales Data",
       x = "Date",
       y = "Sales") +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_y_continuous(labels = scales::dollar_format())

You can re-run the above code using different random seeds to get a fuller idea of how stores’ sales change with the day and month. It seems like introducing a random slope for day of the week might be a good idea; the same for month might also be helpful. Introducing further random effects can cause matrices in the underlying calculations to become very large, so it should be done carefully.

When I ran the below on my laptop, it hadn’t finished after an hour, so I killed it. You may wish to try this on your own, perhaps letting it run overnight.

lm2ff <- as.formula(str_c(
  "Sales ~ ",
  str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
  " + (DayOfWeek|Store)"
))

lme2 <- lmer(lm2ff,data = training_set)
summary(lme2)

For now, we can score the test set and submit the results to kaggle:

test_scored_lme1 <- test_discretized %>%
  bind_cols(data_frame(Sales = predict(lme1,newdata = test_discretized))) %>%
  mutate(Sales = if_else(Open == 0,0,Sales)) %>%
  mutate(Sales = if_else(is.na(Sales),0,Sales)) %>% # There are 11 NA predictions remove them
  dplyr::select(Id,Sales)

test_scored_lme1
## # A tibble: 41,088 x 2
##    Id    Sales
##    <chr> <dbl>
##  1 1     5221.
##  2 2     7473.
##  3 3     9309.
##  4 4     5930.
##  5 5     6986.
##  6 6     5979.
##  7 7     8740.
##  8 8     8022.
##  9 9     5497.
## 10 10    6109.
## # ... with 41,078 more rows
readr::write_csv(test_scored_lme1,"/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-lme1-predictions.csv")

With the results written to disk, you can then go ahead and submit them through kaggle either through your web browser, or using their API as described in the previous tutorial:

kaggle competitions submit -c rossmann-store-sales -f rossman-lme1-predictions.csv -m "LME1"

This yielded me a public score fo 0.1924, a significant improvement over our linear regression from before, but still not world-class (the winning RMSPE was about 0.10).

With this workflow in place, we are free to try other things, for example a transformation on the response:

lm3ff <- as.formula(str_c(
  "log(Sales+1) ~ ",
  str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
  " + (1|Store)"
))

lme3 <- lmer(lm3ff,data = training_set)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(lme3)
## Linear mixed model fit by REML ['lmerMod']
## Formula: 
## log(Sales + 1) ~ DayOfWeek + Promo + StateHoliday + SchoolHoliday +  
##     StoreType + Assortment + CompetitionDistance + Promo2 + PromoInterval +  
##     Month + comp_open_since_days + promo2_since_days + (1 | Store)
##    Data: training_set
## 
## REML criterion at convergence: -185464
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -44.151  -0.518  -0.002   0.547   9.649 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Store    (Intercept) 0.09272  0.3045  
##  Residual             0.04593  0.2143  
## Number of obs: 798540, groups:  Store, 1115
## 
## Fixed effects:
##                                       Estimate Std. Error  t value
## (Intercept)                          8.4274064  0.1087065   77.524
## DayOfWeek2                          -0.1301967  0.0008323 -156.438
## DayOfWeek3                          -0.1692195  0.0008358 -202.476
## DayOfWeek4                          -0.1611301  0.0008480 -190.018
## DayOfWeek5                          -0.1019732  0.0008413 -121.215
## DayOfWeek6                          -0.1516023  0.0008975 -168.908
## DayOfWeek7                          -0.2195180  0.0039912  -55.001
## Promo1                               0.3318652  0.0005331  622.499
## StateHolidaya                       -0.0847782  0.0082032  -10.335
## StateHolidayb                       -0.2065328  0.0178803  -11.551
## StateHolidayc                       -0.1436036  0.0255207   -5.627
## SchoolHoliday1                       0.0378592  0.0006966   54.352
## StoreTypeb                           0.5277754  0.1087669    4.852
## StoreTypec                           0.0098442  0.0281199    0.350
## StoreTyped                           0.0239717  0.0215880    1.110
## Assortmentb                         -0.2378302  0.1482326   -1.604
## Assortmentc                          0.1173959  0.0190304    6.169
## CompetitionDistance(35,255]          0.3529484  0.1116837    3.160
## CompetitionDistance(255, Inf]        0.1717344  0.1084265    1.584
## Promo21                             -0.2060436  0.0344191   -5.986
## PromoIntervalFeb,May,Aug,Nov         0.0474225  0.0400416    1.184
## PromoIntervalJan,Apr,Jul,Oct         0.0925148  0.0341628    2.708
## Month10                              0.0080632  0.0011911    6.769
## Month11                              0.0807861  0.0011993   67.360
## Month12                              0.2465131  0.0012086  203.970
## Month2                               0.0127838  0.0010516   12.157
## Month3                               0.0409140  0.0010342   39.562
## Month4                               0.0581873  0.0010512   55.351
## Month5                               0.0814508  0.0010541   77.269
## Month6                               0.0604532  0.0011051   54.702
## Month7                               0.0231240  0.0011913   19.411
## Month8                              -0.0042227  0.0012268   -3.442
## Month9                              -0.0056191  0.0011900   -4.722
## comp_open_since_days(858,4.68e+03]   0.0142338  0.0016077    8.853
## comp_open_since_days(4.68e+03, Inf]  0.0418573  0.0034461   12.146
## promo2_since_days(-2.5,0.5]          0.0290923  0.0111917    2.599
## promo2_since_days(0.5,908]           0.0614155  0.0014318   42.893
## promo2_since_days(908, Inf]          0.1085917  0.0019073   56.933
## 
## Correlation matrix not shown by default, as p = 38 > 12.
## Use print(x, correlation=TRUE)  or
##   vcov(x)     if you need it
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
# Don't forget to exponentiate the resulting predictions!
lme3_validation_preds <- data_frame(
  observed = validation_set %>% pull(Sales),
  predicted = exp(predict(lme3,newdata = validation_set))
)
lme3_validation_preds
## # A tibble: 45,852 x 2
##    observed predicted
##       <dbl>     <dbl>
##  1     5263     5753.
##  2     6064     5762.
##  3     8314     8237.
##  4    13995    11602.
##  5     4822     5305.
##  6     5651     6562.
##  7    15344    10376.
##  8     8492     6335.
##  9     8565     7830.
## 10     7185     6701.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe(lme3_validation_preds) # Better
## [1] 0.2012049
# Plot predicted vs actual
lme3_validation_preds %>%
  ggplot(aes(x = observed,y = predicted)) +
  theme_classic() + 
  geom_point(pch = 21,colour = "black",fill = "lightblue") +
  geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
  labs(title = "Validation Predictions, Mixed Effects Model, log-transformed Sales",
       subtitle = "Rossman Store Sales Data",
       x = "Observed",
       y = "Predicted") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format())

test_scored_lme3 <- test_discretized %>%
  bind_cols(data_frame(Sales = exp(predict(lme3,newdata = test_discretized)))) %>%
  mutate(Sales = if_else(Open == 0,0,Sales)) %>%
  mutate(Sales = if_else(is.na(Sales),0,Sales)) %>% # There are 11 NA predictions remove them
  dplyr::select(Id,Sales)

test_scored_lme3
## # A tibble: 41,088 x 2
##    Id    Sales
##    <chr> <dbl>
##  1 1     5074.
##  2 2     7264.
##  3 3     9281.
##  4 4     5586.
##  5 5     6905.
##  6 6     5909.
##  7 7     8772.
##  8 8     7950.
##  9 9     5210.
## 10 10    5907.
## # ... with 41,078 more rows
readr::write_csv(test_scored_lme3,"/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-lme3-predictions.csv")

Submitting this got me a public score of 0.178, a slight-looking but actually not-that-insignificant improvement. And so on.

Next we’ll look at a Bayesian approach to predictive modelling.

Bayesian Linear Mixed Effects Modelling with Stan

Here we’ll investigate a Bayesian approach to fitting the above linear mixed effects model, where we put prior distributions on all unknown quantites, compute their posterior distribution given the data, and then make predictions by averaging over models defined by the posterior parameter configurations, weighted by their posterior probabilities. For an introduction to Bayesian inference in general, see the Bayesian inference tutorial.

Our statement of the linear mixed effects model was as follows: \[ \begin{aligned} y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\ b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\ \epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2}) \end{aligned} \] This heirarchical specification is what we need to account for the fact that observations (daily sales) from the same store are likely to be related to each other; it is highly desirable to use a store’s previous sales to predict its future sales.

For the Bayesian approach, the model is written out the same way, except now we have a further level of probability distributions to specify: the prior distributions on the remaining fixed, unknown quantites: \[ \begin{aligned} y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\ b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\ \epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2}) \\ \beta &\sim F_{\beta} \\ \sigma^{2}_{b} &\sim F_{\sigma^{2}_{b}} \\ \sigma^{2} &\sim F_{\sigma^{2}} \\ \end{aligned} \] Here, \(F_{\beta}\), etc. are probability distributions that we get to choose, and represent the range of values for these parameters that we think are plausible, prior to seeing the data. We can choose these based on actual prior knowledge (e.g. maybe we think the variance is between 1 and 10, for some reason), or based on how the results affect the posterior distribution; see “Choosing a Prior” in the previously-linked Bayesian Inference tutorial for more information. When building a predictive model, in which we are more interested in how the model predicts the future as opposed to what parameters the model says are plausible, we are more interested in the latter aspect of choosing a prior, namely how the choice affects the resulting posterior, and predictions.

To fit this Bayesian linear mixed effects model, we have a lot of steps:

  • Choose prior distributions
  • Compute an expression for the posterior distribution of the parameters given the data
  • Compute an expression for the predictive distribution of a new datapoint, given the data
  • Fit the model, either by
    • computing the posterior and predictive distributions analytically using math (usually not possible) or
    • devising a sampling scheme based on Markov Chain Monte Carlo (MCMC) to draw exact samples from the posterior and predictive distributions or
    • devising an approximation to the posterior and predictive distribution, using an available method such as Integrated Nested Laplace Approximation or Variational Inference
  • Draw from the predictive distribution to predict new observations (sales from the observed stores on new days)

That’s a lot of steps. In practical predictive modelling, it is often the case that the methods used will be extremely complicated, and require a lot of background to completely understand. While nothing can substitute your developing this background, it is usually not necessary to completely understand a method before you are able to apply it to some data and get a feel for how it works. In addition, even if you have extensive background in a topic (here: Bayesian inference), it is usually not the case that you will need to code it up from scratch. Before starting any modelling exercise, look at what software is available.

In this case we will illustrate the use of the Bayesian computation/model fitting software suite Stan, and we will do so through the rstamarm R package. Stan is a general-purpose software for performing Bayesian computations, and the rstanarm package provides and R formula interface for fitting Bayesian regression models using Stan. Here we will illustrate the use of this package and discuss practical issues; to learn more about how Stan works, see the documentation for Stan.

The stan_glmer function implements the above model. Using a modern laptop and the full dataset, the computations performed by Stan are quite intensive. By default Stan implements an efficient MCMC sampling algorithm for sampling from the posterior and predictive distributions. For the Rossman data, this takes a very long time on my laptop. There are two approaches we can use here: reduce the dataset by subsampling, or change the method of computing the posterior. For starters, let’s try the latter: rather than obtaining exact samples via MCMC (again, the default in rstanarm), we will set algorithm = "meanfield", to implement mean field Variational Inference, an efficient (but potentially inaccurate) technique for approximating the posterior.

In setting the arguments for the stan_glmer function below, I used information from the following vignettes. You can consult these for more information:

Running this on the full dataset with the default prior distributions for all terms still took 2 - 3 hours. Let’s see how this is done and how we can evaluate the results.

## I ran the below outside of this notebook.
## This took several hours
stan1 <- stan_glmer(
  formula = lm3ff,
  data = training_set,
  family = gaussian,
  algorithm = "meanfield",
  sparse = TRUE
)

# summary(stan1) # Prints output for each store, very long
save(stan1,file = "/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-rstan1.RData")
# Load the object into its own environment, to reduce clutter in the global environment
load_stan <- new.env()
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-rstan1.RData",envir = load_stan)
ls(load_stan)
## [1] "stan1"
pryr::object_size(load_stan$stan1)
## 610 MB

We see that the fitted model object is quite large. We should compare the predicted values on the training set to the observed values as before, as well as comparing them on the validation set. Since the output of stan_glmer is not a static prediction equation but rather a predictive distribution, a further sensible comparison is to draw a new dataset from this predictive distribution, and check that it has similar statistical properties to the original training set. This can be accomplished using the pp_check function:

# Draw 5 new datasets from the predictive distribution and compare them to the training set
# Remember, the scale on the x-axis is log(sales + 1)
pp_check(load_stan$stan1,plotfun = "hist",nreps = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The datasets generated from the predictive distribution look similar to the training set, which is one indication that the model fits the training data well. You could also go more detailed and investigate summary statistics like predictive means and standard deviations.

To obtain an actual set of validation predictions, we can draw from the predictive distribution values corresponding to the \(X\) matrices for stores in the validation set using the posterior_predict function. This function’s output is in a bit of a different form from that which works automatically with tidyverse functions, so needs to be manipulated. The output of posterior_predict is a matrix with columns corresponding to rows in the validation set, and rows corresponding to samples from the predictive distribution. The function doesn’t output just a single prediction for each store and day- it outputs a sample from the predictive distribution for that store and day, which allows us to compute a point estimate (the actual prediction that we’ll use) and a measure of uncertainty in the prediction itself.

We need to transpose the output matrix, match each row (formerly column) in this to the corresponding row in the validation set, compute a single prediction for each store, and output a dataframe containing the observed and predicted values like we had before:

stan1_validpreds <- posterior_predict(load_stan$stan1,newdata = validation_set) %>%
  t() %>% # Transpose; rows now correspond to rows in the validation set
  apply(1,mean) %>% # Compute the mean of each row using R's built-in apply function
  exp() %>% # Convert back to the scale of the original observations (remember we modelled log(sales))
  data_frame(observed = validation_set$Sales,predicted = .)


# Evaluate predictions  
get_rmspe(stan1_validpreds)
## [1] 1.61035
stan1_validpreds %>%
  ggplot(aes(x = observed,y = predicted)) +
  theme_classic() + 
  geom_point(pch = 21,colour = "black",fill = "lightblue") +
  geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
  labs(title = "Validation Predictions, Bayesian LME Model",
       subtitle = "Rossman Store Sales Data",
       x = "Observed",
       y = "Predicted") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::dollar_format())

Woah! That’s not good. The predictions look good “on average”, but many individual predictions are way off, with a chunk being drastically overestimated, and a chunk being somewhat underestimated. What is happening?

Looking at the distribution of prediction errors, we see the bimodal distribution suggested by the above plot:

stan1_validpreds %>%
  mutate(error = predicted - observed) %>%
  ggplot(aes(x = error)) +
  theme_classic() +
  geom_histogram(bins = 100,colour = "black",fill = "lightblue") +
  labs(title = "Prediction Errors - Bayesian LME Model",
       subtitle = "Validation Data",
       x = "Predicted - Observed",
       y = "Count") +
  scale_x_continuous(labels = scales::dollar_format(),breaks = seq(-30000,1500000,by=10000)) +
  scale_y_continuous(labels = scales::comma_format()) +
  theme(axis.text.x = element_text(angle = 90))

We could go into more detail, but frankly the predictions are so bad overall that it’s probably better to try something else. One very compellinng reason to take a Bayesian approach to predictive modelling is the degree of flexibility this allows the modeller (you!). There are many things we could try here, for example:

  • Change the default prior distributions; we could be lazy and guess/use trial and error, or we could put more thought into what types of prior distributions will penalize results like the ones we got above
  • Change the inference methodology. In Stan we specified algorithm = "meanfield" because this is the fastest option available and our dataset is large. However, the meanfield variational approximation is also potentially the least accurate; in particular, it assumes independence between posterior parameters, which is extremely restrictive. Changing this to a “full rank” variational approximation, or even better “sampling”, which implements MCMC sampling from the posterior, would most likely result in better predictive performance of the model, but would come at a much greater computational cost.

We won’t go further here, but you now have the tools you need to fit and evaluate predictive models on a somewhat complicated dataset, with grouped observations exhibiting temporal correlation. Happy modelling!