suppressMessages({
suppressWarnings({
library(tidyverse) # Includes all the necessary stuff: dplyr, ggplot, tidyr, and so on
library(lubridate) # For dealing with dates
library(lme4) # Linear mixed effects models
library(rstanarm) # Bayesian Linear mixed effects models with STAN
})
})
In the previous tutorial, we read in and prepared a kaggle dataset for model-building. In this tutorial, we will build a few predictive models for these data, evaluate them on our validation set, and submit them to kaggle to see how we fair on their test set.
Recall the data: daily sales on each of \(1,115\) Rossman drug stores in Europe. Our goal is to use the store-level and store/day-level data provided to predict daily sales for these same stores, in the future. See the previous tutorial for details of reading in the data, creating new features, and performing basic data analysis. We saved the model-ready data to disk, so we can load it as follows:
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-training-set.RData")
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-validation-set.RData")
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-test-set.RData")
If you didn’t run the previous tutorial locally, you can download the data from github:
The load()
function can take a URL as an argument, however it has trouble with https
.
The R
function load()
takes the given file and loads all the objects in this file into the global environment by default (if you’re calling the function interactively like we are). We can see that this loaded our datasets:
ls()
## [1] "test_discretized" "training_set" "validation_set"
If you’re programming with load()
rather than using it interactively, it’s a good idea to load files into a new environment you create, and then check what you loaded. For example,
# Not run:
# e <- new.env()
# load("my/file.RData",envir = e)
We can take a look at the datasets we loaded:
glimpse(training_set)
## Observations: 798,540
## Variables: 14
## $ Store <chr> "85", "122", "209", "259", "262", "274", ...
## $ DayOfWeek <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ Sales <dbl> 12421, 3709, 3152, 18535, 29937, 7786, 34...
## $ Promo <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "b", "a", "a", "b", "b", "b", "d", "a", "...
## $ Assortment <chr> "a", "c", "c", "b", "a", "b", "c", "c", "...
## $ CompetitionDistance <fct> (255, Inf], (255, Inf], (255, Inf], (35,2...
## $ Promo2 <chr> "0", "0", "1", "0", "0", "1", "0", "1", "...
## $ PromoInterval <chr> "0", "0", "Jan,Apr,Jul,Oct", "0", "0", "J...
## $ Month <chr> "6", "6", "6", "6", "6", "6", "6", "6", "...
## $ comp_open_since_days <fct> (858,4.68e+03], (-Inf,858], (858,4.68e+03...
## $ promo2_since_days <fct> (-2.5,0.5], (-2.5,0.5], (0.5,908], (-2.5,...
glimpse(validation_set)
## Observations: 45,852
## Variables: 14
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "5", "...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 1534...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "a", "...
## $ CompetitionDistance <fct> (255, Inf], (255, Inf], (255, Inf], (255,...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "0", "...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,Oct"...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_since_days <fct> (858,4.68e+03], (858,4.68e+03], (858,4.68...
## $ promo2_since_days <fct> (-2.5,0.5], (908, Inf], (908, Inf], (-2.5...
glimpse(test_discretized)
## Observations: 41,088
## Variables: 22
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance <fct> (255, Inf], (255, Inf], (255, Inf], ...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2 <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek <chr> "0", "14", "0", "0", "0", "0", "1", ...
## $ Promo2SinceYear <chr> "0", "2011", "0", "0", "0", "0", "20...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "0", "0", "0...
## $ Month <chr> "9", "9", "9", "9", "9", "9", "9", "...
## $ comp_open_date <date> 2008-09-01, 2006-12-01, 2013-04-01,...
## $ comp_open_since_days <fct> (858,4.68e+03], (858,4.68e+03], (858...
## $ promo2_since_date <date> NA, 2011-04-02, NA, NA, NA, NA, 201...
## $ promo2_since_days <fct> (-2.5,0.5], (908, Inf], (-2.5,0.5], ...
The training and validation sets were treated exactly the same in the preprocessing, since we applied the preprocessing to the original kaggle “training” set, and then held out the most recent 6 weeks of that data to form our validation set. The test set is kaggle’s original “test set”, and we made sure to add all the new variables we added to the training set, but didn’t bother to delete the extra ones. All we use this one for is prediction, so this is okay.
We notice that all of our features are categorical. This is because during the preprocessing, we made the decision to convert the remaining numeric variables (there were only 3 anyways) to categorical by binning/discretizing them, in order to remove the effect of long-tailed distributions and outliers. This is of course not the only way to do it, and this is one potential part of the preprocessing that you can go back and change if you are curious.
We can now build our first model.
Simple linear regression of the form \[ y = X\beta + \epsilon \] is not going to work well here. Why? Well, nonlinearity isn’t a problem anymore, since all of our features are categorical (that was another reason to discretize). What about the fact that observations are grouped? Sales from different days on the same store are likely to be more correlated than sales from different days and different stores. While this is a problem for inference (standard errors of the regression coefficients will be lower than they should be), it is not necessarily a problem for prediction. The reason we want to try to account for it is not because not doing so would be “wrong”; it is simply because doing so will probably improve our predictions.
First, let’s ignore the grouping and fit a simple linear regression. This takes about 3 seconds on my 2015 MacBook Pro with 16 gigs of RAM:
simple_lr_1 <- lm(Sales ~ . -Store,data = training_set)
summary(simple_lr_1)
##
## Call:
## lm(formula = Sales ~ . - Store, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10954.3 -1711.2 -367.1 1238.8 29074.6
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 5442.8416 143.7909 37.852
## DayOfWeek2 -1080.0328 10.3413 -104.438
## DayOfWeek3 -1395.7035 10.3847 -134.400
## DayOfWeek4 -1355.1777 10.5365 -128.617
## DayOfWeek5 -991.2246 10.4531 -94.825
## DayOfWeek6 -939.2667 11.1522 -84.223
## DayOfWeek7 -902.0822 48.6038 -18.560
## Promo1 2298.7121 6.6241 347.023
## StateHolidaya 100.8102 101.7478 0.991
## StateHolidayb -216.0637 222.0098 -0.973
## StateHolidayc -1019.0403 317.0699 -3.214
## SchoolHoliday1 309.6167 8.6476 35.804
## StoreTypeb 5031.2103 33.1032 151.985
## StoreTypec -4.4576 9.1855 -0.485
## StoreTyped -72.1547 7.1061 -10.154
## Assortmentb -2837.2171 44.0682 -64.382
## Assortmentc 868.2882 6.2789 138.288
## CompetitionDistance(35,255] 2616.3808 36.4026 71.873
## CompetitionDistance(255, Inf] 1244.3447 35.3703 35.180
## Promo21 -1248.1185 138.6350 -9.003
## PromoIntervalFeb,May,Aug,Nov 343.2907 13.1453 26.115
## PromoIntervalJan,Apr,Jul,Oct 564.0096 11.2987 49.918
## PromoIntervalMar,Jun,Sept,Dec NA NA NA
## Month10 0.2754 14.7827 0.019
## Month11 499.5205 14.8834 33.562
## Month12 1979.9849 14.9919 132.070
## Month2 88.9484 13.0651 6.808
## Month3 307.6090 12.8444 23.949
## Month4 431.3562 13.0539 33.044
## Month5 533.2505 13.0864 40.748
## Month6 416.6966 13.7303 30.349
## Month7 128.2613 14.7843 8.676
## Month8 -70.3441 15.2272 -4.620
## Month9 -38.7299 14.7751 -2.621
## comp_open_since_days(858,4.68e+03] -181.8072 6.2382 -29.144
## comp_open_since_days(4.68e+03, Inf] 409.3043 16.9256 24.182
## promo2_since_days(-2.5,0.5] -331.3742 138.8105 -2.387
## promo2_since_days(0.5,908] -569.1750 12.9700 -43.884
## promo2_since_days(908, Inf] -22.0111 13.0488 -1.687
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## DayOfWeek2 < 2e-16 ***
## DayOfWeek3 < 2e-16 ***
## DayOfWeek4 < 2e-16 ***
## DayOfWeek5 < 2e-16 ***
## DayOfWeek6 < 2e-16 ***
## DayOfWeek7 < 2e-16 ***
## Promo1 < 2e-16 ***
## StateHolidaya 0.32179
## StateHolidayb 0.33045
## StateHolidayc 0.00131 **
## SchoolHoliday1 < 2e-16 ***
## StoreTypeb < 2e-16 ***
## StoreTypec 0.62747
## StoreTyped < 2e-16 ***
## Assortmentb < 2e-16 ***
## Assortmentc < 2e-16 ***
## CompetitionDistance(35,255] < 2e-16 ***
## CompetitionDistance(255, Inf] < 2e-16 ***
## Promo21 < 2e-16 ***
## PromoIntervalFeb,May,Aug,Nov < 2e-16 ***
## PromoIntervalJan,Apr,Jul,Oct < 2e-16 ***
## PromoIntervalMar,Jun,Sept,Dec NA
## Month10 0.98514
## Month11 < 2e-16 ***
## Month12 < 2e-16 ***
## Month2 9.90e-12 ***
## Month3 < 2e-16 ***
## Month4 < 2e-16 ***
## Month5 < 2e-16 ***
## Month6 < 2e-16 ***
## Month7 < 2e-16 ***
## Month8 3.84e-06 ***
## Month9 0.00876 **
## comp_open_since_days(858,4.68e+03] < 2e-16 ***
## comp_open_since_days(4.68e+03, Inf] < 2e-16 ***
## promo2_since_days(-2.5,0.5] 0.01698 *
## promo2_since_days(0.5,908] < 2e-16 ***
## promo2_since_days(908, Inf] 0.09164 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2663 on 798502 degrees of freedom
## Multiple R-squared: 0.2644, Adjusted R-squared: 0.2644
## F-statistic: 7757 on 37 and 798502 DF, p-value: < 2.2e-16
Even though the fit is no good, we can do some basic sanity checks; for example, sales during the Christmas season (November and December, months 11 and 12) are higher than other months. The day of the week seems to matter, as does the competition distance. These aren’t formal statistical inferences; we’re just getting a feel for the variables in the model.
To evaluate the predictions, let’s first score the model on the validation set, and then we’ll compute the target metric (root-mean-square-percentage-error), and plot the predicted vs actual Sales.
simple_lr_validation_preds <- data_frame(
observed = validation_set %>% pull(Sales),
predicted = predict(simple_lr_1,newdata = validation_set)
)
## Warning in predict.lm(simple_lr_1, newdata = validation_set): prediction
## from a rank-deficient fit may be misleading
simple_lr_validation_preds
## # A tibble: 45,852 x 2
## observed predicted
## <dbl> <dbl>
## 1 5263 7915.
## 2 6064 7545.
## 3 8314 7545.
## 4 13995 8783.
## 5 4822 8101.
## 6 5651 8101.
## 7 15344 8969.
## 8 8492 8101.
## 9 8565 9379.
## 10 7185 7919.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe <- function(preds) {
preds %>%
summarize(rmspe = sqrt( mean( ((observed - predicted)/observed)^2 ))) %>%
pull(rmspe)
}
get_rmspe(simple_lr_validation_preds) # Very bad!
## Warning: package 'bindrcpp' was built under R version 3.4.4
## [1] 0.4367195
# Plot predicted vs actual
simple_lr_validation_preds %>%
ggplot(aes(x = observed,y = predicted)) +
theme_classic() +
geom_point(pch = 21,colour = "black",fill = "lightblue") +
geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
labs(title = "Validation Predictions, Simple Linear Regression",
subtitle = "Rossman Store Sales Data",
x = "Observed",
y = "Predicted") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format())
We see that the predictions have high variance, and that we are underpredicting by a large amount (look at the axis labels).
We could account for different stores’ differing sales by including Store
as a categorical variable, resulting in a different intercept for each store. This would mean estimating over a thousand regression coefficients; while not impossible using modern hardware, it’s also not desirable. Such a model would be hard to interpret (it would be hard to even print it to the screen), and the estimation of so many distinct intercepts might be unstable.
A more stable and computationally efficient approach is to utilize Linear Mixed Effects Models and their implementation in the lme4
package. This implementation is highly user friendly and computationally efficient. We’ll fit the model \[
\begin{aligned}
y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\
b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\
\epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2})
\end{aligned}
\] The \(x_{ij}\) is a vector containing the features corresponding to the \(i^{th}\) store on the \(j^{th}\) day. The \(b_{i}\) is a random variable representing the \(i^{th}\) store’s “baseline” sales- the average daily sales of a randomly chosen store. The \(\epsilon_{ij}\) is the familiar error term seen in simple linear regression models. This model still specifies a different intercept for every store, however the estimation will shrink these intercepts toward their mean, improving the stability of the estimation. We choose \(0\) to be the mean of the random effects \(b_{i}\) because the \(x_{ij}\) still has a 1 in it, that is the model still has a fixed intercept \(\beta_{0}\) which represents the global average daily sales of all the stores.
We can fit this model using the lme4::lmer()
function. This took about 30 seconds on my laptop.
# Annoyingly, we have to type the formula out manually, as predict.merMod cannot handle
# formulae with a dot
lm1ff <- as.formula(str_c(
"Sales ~ ",
str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
" + (1|Store)"
))
lme1 <- lmer(lm1ff,data = training_set)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(lme1)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## Sales ~ DayOfWeek + Promo + StateHoliday + SchoolHoliday + StoreType +
## Assortment + CompetitionDistance + Promo2 + PromoInterval +
## Month + comp_open_since_days + promo2_since_days + (1 | Store)
## Data: training_set
##
## REML criterion at convergence: 13889219
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -11.2136 -0.5798 -0.0694 0.4908 18.2148
##
## Random effects:
## Groups Name Variance Std.Dev.
## Store (Intercept) 5059645 2249
## Residual 2074879 1440
## Number of obs: 798540, groups: Store, 1115
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 4669.8119 802.2221 5.8
## DayOfWeek2 -1080.0931 5.5939 -193.1
## DayOfWeek3 -1395.8227 5.6174 -248.5
## DayOfWeek4 -1355.3635 5.6995 -237.8
## DayOfWeek5 -990.9379 5.6544 -175.3
## DayOfWeek6 -942.8404 6.0327 -156.3
## DayOfWeek7 -531.6864 26.8263 -19.8
## Promo1 2296.2582 3.5833 640.8
## StateHolidaya 282.0506 55.1367 5.1
## StateHolidayb 137.4240 120.1801 1.1
## StateHolidayc -825.9343 171.5343 -4.8
## SchoolHoliday1 296.4689 4.6818 63.3
## StoreTypeb 4804.6812 803.4137 6.0
## StoreTypec -65.3959 207.7072 -0.3
## StoreTyped -64.7802 159.4595 -0.4
## Assortmentb -2763.5346 1094.9321 -2.5
## Assortmentc 846.2646 140.5671 6.0
## CompetitionDistance(35,255] 2604.5647 824.9526 3.2
## CompetitionDistance(255, Inf] 1244.5223 800.8928 1.6
## Promo21 -1163.0863 251.9336 -4.6
## PromoIntervalFeb,May,Aug,Nov 301.4425 295.7666 1.0
## PromoIntervalJan,Apr,Jul,Oct 592.7468 252.3429 2.3
## Month10 0.7914 8.0059 0.1
## Month11 495.7651 8.0611 61.5
## Month12 1981.0875 8.1233 243.9
## Month2 82.2465 7.0679 11.6
## Month3 291.9649 6.9510 42.0
## Month4 412.5546 7.0658 58.4
## Month5 511.0787 7.0851 72.1
## Month6 403.6988 7.4281 54.3
## Month7 141.2168 8.0069 17.6
## Month8 -68.6842 8.2459 -8.3
## Month9 -47.1240 7.9988 -5.9
## comp_open_since_days(858,4.68e+03] 107.4904 10.8124 9.9
## comp_open_since_days(4.68e+03, Inf] 275.9587 23.1721 11.9
## promo2_since_days(-2.5,0.5] 353.1511 75.2237 4.7
## promo2_since_days(0.5,908] 271.5260 9.6245 28.2
## promo2_since_days(908, Inf] 568.7239 12.8222 44.4
##
## Correlation matrix not shown by default, as p = 38 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
We see that before, when the PromoIntervalMar,Jun,Sept,Dec
level resulted in a NA
coefficient, lmer
drops this level entirely. This is okay, but it’s something to make note of.
Comparing these coefficient estimates to the ones from the simple linear regression, we see that they are similar, but slightly smaller on average and sometimes in the opposite direction:
data_frame(
coefficient = names(coef(simple_lr_1))[-23],
Simple = coef(simple_lr_1)[-23],
Mixed = coef(summary(lme1))[ ,1]
) %>%
gather(type,value,Simple:Mixed) %>%
ggplot(aes(x = coefficient,y = value,group = type,fill = type)) +
theme_classic() +
geom_bar(stat = "identity",position = "dodge",colour = "black") +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Comparison of Coefficients, Simple and Mixed Effects Linear Regression",
subtitle = "Rossman Store Sales Data",
x = "",
y = "Coefficient Estimate",
fill = "")
We can also take a look at the estimated intercepts. Estimated is not the correct term to use here anymore though, since these are no longer parameter estimates, but rather random variables. What we actually look at are the predicted random effects, which are given by the mean of the distribution of \(b_{i} | y\), the random effects conditional on the observed data. These are precomputed by the lmer
function:
ranef(lme1)$Store %>%
as_data_frame() %>%
rename(intercept = `(Intercept)`) %>%
ggplot(aes(x = intercept)) +
theme_classic() +
geom_histogram(bins = 100,colour = "black",fill = "purple") +
labs(title = "Histogram of Predicted Random Intercepts, lme1 Model",
subtitle = "Rossman Store Sales Data",
x = "Predicted Intercept",
y = "Count") +
scale_x_continuous(labels = scales::comma_format())
This is nice; the distribution of predicted random effects has a long right tail. This is actually bad for inference, since it breaks a key model assumption, but it is potentially good for prediction, since those stores with high sales were likely predicted very poorly by our simple linear regression model, and hence inflated our prediction error.
Let’s take a look at the predictions from this model:
lme1_validation_preds <- data_frame(
observed = validation_set %>% pull(Sales),
predicted = predict(lme1,newdata = validation_set)
)
lme1_validation_preds
## # A tibble: 45,852 x 2
## observed predicted
## <dbl> <dbl>
## 1 5263 6070.
## 2 6064 6236.
## 3 8314 8322.
## 4 13995 10906.
## 5 4822 5967.
## 6 5651 6848.
## 7 15344 10050.
## 8 8492 6779.
## 9 8565 7835.
## 10 7185 6828.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe(lme1_validation_preds) # Better
## [1] 0.2209619
# Plot predicted vs actual
lme1_validation_preds %>%
ggplot(aes(x = observed,y = predicted)) +
theme_classic() +
geom_point(pch = 21,colour = "black",fill = "lightblue") +
geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
labs(title = "Validation Predictions, Mixed Effects Model",
subtitle = "Rossman Store Sales Data",
x = "Observed",
y = "Predicted") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format())
Hey, those predictions look a lot better than before!
How could we extend this model? We could try adding more random effects terms, or interactions between fixed effects. Let’s take a look at some randomly selected stores’ sales visually to try and see how they change over time. For good measure we’ll also include the predictions from the most recent model.
# Use the training predictions and include the Month and DayOfWeek
# Since there will be roughly 4 observations per store / day of week / month, average them
# for plotting purposes. Also add a combined indicator for month/day for the x-axis
# Note we use the training predictions because the validation set only has 6 weeks of data
lme1_trainpreds_withdate <- training_set %>%
select(Store,Sales,Month,DayOfWeek) %>%
bind_cols(data_frame(predicted = predict(lme1,newdata = training_set))) %>%
group_by(Store,Month,DayOfWeek) %>%
summarize(observed = mean(Sales),predicted = mean(predicted)) %>%
mutate(dateind = lubridate::ymd(str_c("2015",Month,DayOfWeek,sep="-")))
# Randomly sample 20 stores and plot them
set.seed(897623)
randstore <- sample(unique(validation_set$Store),20)
lme1_trainpreds_withdate %>%
gather(type,value,observed:predicted) %>%
filter(Store %in% randstore) %>%
ggplot(aes(x = dateind,group = Store,y = value,colour = type)) +
theme_classic() +
facet_wrap(~Store) +
geom_line() +
labs(title = "Predicted and Observed Sales by Date, Randomly Selected Stores",
subtitle = "Rossman Store Sales Data",
x = "Date",
y = "Sales") +
theme(axis.text.x = element_text(angle = 90)) +
scale_y_continuous(labels = scales::dollar_format())
You can re-run the above code using different random seeds to get a fuller idea of how stores’ sales change with the day and month. It seems like introducing a random slope for day of the week might be a good idea; the same for month might also be helpful. Introducing further random effects can cause matrices in the underlying calculations to become very large, so it should be done carefully.
When I ran the below on my laptop, it hadn’t finished after an hour, so I killed it. You may wish to try this on your own, perhaps letting it run overnight.
lm2ff <- as.formula(str_c(
"Sales ~ ",
str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
" + (DayOfWeek|Store)"
))
lme2 <- lmer(lm2ff,data = training_set)
summary(lme2)
For now, we can score the test set and submit the results to kaggle:
test_scored_lme1 <- test_discretized %>%
bind_cols(data_frame(Sales = predict(lme1,newdata = test_discretized))) %>%
mutate(Sales = if_else(Open == 0,0,Sales)) %>%
mutate(Sales = if_else(is.na(Sales),0,Sales)) %>% # There are 11 NA predictions remove them
dplyr::select(Id,Sales)
test_scored_lme1
## # A tibble: 41,088 x 2
## Id Sales
## <chr> <dbl>
## 1 1 5221.
## 2 2 7473.
## 3 3 9309.
## 4 4 5930.
## 5 5 6986.
## 6 6 5979.
## 7 7 8740.
## 8 8 8022.
## 9 9 5497.
## 10 10 6109.
## # ... with 41,078 more rows
readr::write_csv(test_scored_lme1,"/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-lme1-predictions.csv")
With the results written to disk, you can then go ahead and submit them through kaggle either through your web browser, or using their API as described in the previous tutorial:
kaggle competitions submit -c rossmann-store-sales -f rossman-lme1-predictions.csv -m "LME1"
This yielded me a public score fo 0.1924, a significant improvement over our linear regression from before, but still not world-class (the winning RMSPE was about 0.10).
With this workflow in place, we are free to try other things, for example a transformation on the response:
lm3ff <- as.formula(str_c(
"log(Sales+1) ~ ",
str_c(colnames(training_set)[-c(1,3)],collapse = " + "),
" + (1|Store)"
))
lme3 <- lmer(lm3ff,data = training_set)
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
summary(lme3)
## Linear mixed model fit by REML ['lmerMod']
## Formula:
## log(Sales + 1) ~ DayOfWeek + Promo + StateHoliday + SchoolHoliday +
## StoreType + Assortment + CompetitionDistance + Promo2 + PromoInterval +
## Month + comp_open_since_days + promo2_since_days + (1 | Store)
## Data: training_set
##
## REML criterion at convergence: -185464
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -44.151 -0.518 -0.002 0.547 9.649
##
## Random effects:
## Groups Name Variance Std.Dev.
## Store (Intercept) 0.09272 0.3045
## Residual 0.04593 0.2143
## Number of obs: 798540, groups: Store, 1115
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 8.4274064 0.1087065 77.524
## DayOfWeek2 -0.1301967 0.0008323 -156.438
## DayOfWeek3 -0.1692195 0.0008358 -202.476
## DayOfWeek4 -0.1611301 0.0008480 -190.018
## DayOfWeek5 -0.1019732 0.0008413 -121.215
## DayOfWeek6 -0.1516023 0.0008975 -168.908
## DayOfWeek7 -0.2195180 0.0039912 -55.001
## Promo1 0.3318652 0.0005331 622.499
## StateHolidaya -0.0847782 0.0082032 -10.335
## StateHolidayb -0.2065328 0.0178803 -11.551
## StateHolidayc -0.1436036 0.0255207 -5.627
## SchoolHoliday1 0.0378592 0.0006966 54.352
## StoreTypeb 0.5277754 0.1087669 4.852
## StoreTypec 0.0098442 0.0281199 0.350
## StoreTyped 0.0239717 0.0215880 1.110
## Assortmentb -0.2378302 0.1482326 -1.604
## Assortmentc 0.1173959 0.0190304 6.169
## CompetitionDistance(35,255] 0.3529484 0.1116837 3.160
## CompetitionDistance(255, Inf] 0.1717344 0.1084265 1.584
## Promo21 -0.2060436 0.0344191 -5.986
## PromoIntervalFeb,May,Aug,Nov 0.0474225 0.0400416 1.184
## PromoIntervalJan,Apr,Jul,Oct 0.0925148 0.0341628 2.708
## Month10 0.0080632 0.0011911 6.769
## Month11 0.0807861 0.0011993 67.360
## Month12 0.2465131 0.0012086 203.970
## Month2 0.0127838 0.0010516 12.157
## Month3 0.0409140 0.0010342 39.562
## Month4 0.0581873 0.0010512 55.351
## Month5 0.0814508 0.0010541 77.269
## Month6 0.0604532 0.0011051 54.702
## Month7 0.0231240 0.0011913 19.411
## Month8 -0.0042227 0.0012268 -3.442
## Month9 -0.0056191 0.0011900 -4.722
## comp_open_since_days(858,4.68e+03] 0.0142338 0.0016077 8.853
## comp_open_since_days(4.68e+03, Inf] 0.0418573 0.0034461 12.146
## promo2_since_days(-2.5,0.5] 0.0290923 0.0111917 2.599
## promo2_since_days(0.5,908] 0.0614155 0.0014318 42.893
## promo2_since_days(908, Inf] 0.1085917 0.0019073 56.933
##
## Correlation matrix not shown by default, as p = 38 > 12.
## Use print(x, correlation=TRUE) or
## vcov(x) if you need it
## fit warnings:
## fixed-effect model matrix is rank deficient so dropping 1 column / coefficient
# Don't forget to exponentiate the resulting predictions!
lme3_validation_preds <- data_frame(
observed = validation_set %>% pull(Sales),
predicted = exp(predict(lme3,newdata = validation_set))
)
lme3_validation_preds
## # A tibble: 45,852 x 2
## observed predicted
## <dbl> <dbl>
## 1 5263 5753.
## 2 6064 5762.
## 3 8314 8237.
## 4 13995 11602.
## 5 4822 5305.
## 6 5651 6562.
## 7 15344 10376.
## 8 8492 6335.
## 9 8565 7830.
## 10 7185 6701.
## # ... with 45,842 more rows
# Score them; RMSPE
get_rmspe(lme3_validation_preds) # Better
## [1] 0.2012049
# Plot predicted vs actual
lme3_validation_preds %>%
ggplot(aes(x = observed,y = predicted)) +
theme_classic() +
geom_point(pch = 21,colour = "black",fill = "lightblue") +
geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
labs(title = "Validation Predictions, Mixed Effects Model, log-transformed Sales",
subtitle = "Rossman Store Sales Data",
x = "Observed",
y = "Predicted") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format())
test_scored_lme3 <- test_discretized %>%
bind_cols(data_frame(Sales = exp(predict(lme3,newdata = test_discretized)))) %>%
mutate(Sales = if_else(Open == 0,0,Sales)) %>%
mutate(Sales = if_else(is.na(Sales),0,Sales)) %>% # There are 11 NA predictions remove them
dplyr::select(Id,Sales)
test_scored_lme3
## # A tibble: 41,088 x 2
## Id Sales
## <chr> <dbl>
## 1 1 5074.
## 2 2 7264.
## 3 3 9281.
## 4 4 5586.
## 5 5 6905.
## 6 6 5909.
## 7 7 8772.
## 8 8 7950.
## 9 9 5210.
## 10 10 5907.
## # ... with 41,078 more rows
readr::write_csv(test_scored_lme3,"/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-lme3-predictions.csv")
Submitting this got me a public score of 0.178, a slight-looking but actually not-that-insignificant improvement. And so on.
Next we’ll look at a Bayesian approach to predictive modelling.
Here we’ll investigate a Bayesian approach to fitting the above linear mixed effects model, where we put prior distributions on all unknown quantites, compute their posterior distribution given the data, and then make predictions by averaging over models defined by the posterior parameter configurations, weighted by their posterior probabilities. For an introduction to Bayesian inference in general, see the Bayesian inference tutorial.
Our statement of the linear mixed effects model was as follows: \[ \begin{aligned} y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\ b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\ \epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2}) \end{aligned} \] This heirarchical specification is what we need to account for the fact that observations (daily sales) from the same store are likely to be related to each other; it is highly desirable to use a store’s previous sales to predict its future sales.
For the Bayesian approach, the model is written out the same way, except now we have a further level of probability distributions to specify: the prior distributions on the remaining fixed, unknown quantites: \[ \begin{aligned} y_{ij} &= x_{ij}^{\prime}\beta + b_{i} + \epsilon_{ij} \\ b_{i} &\overset{IID}{\sim} N(0,\sigma^{2}_{b}) \\ \epsilon_{ij} &\overset{IID}{\sim} N(0,\sigma^{2}) \\ \beta &\sim F_{\beta} \\ \sigma^{2}_{b} &\sim F_{\sigma^{2}_{b}} \\ \sigma^{2} &\sim F_{\sigma^{2}} \\ \end{aligned} \] Here, \(F_{\beta}\), etc. are probability distributions that we get to choose, and represent the range of values for these parameters that we think are plausible, prior to seeing the data. We can choose these based on actual prior knowledge (e.g. maybe we think the variance is between 1 and 10, for some reason), or based on how the results affect the posterior distribution; see “Choosing a Prior” in the previously-linked Bayesian Inference tutorial for more information. When building a predictive model, in which we are more interested in how the model predicts the future as opposed to what parameters the model says are plausible, we are more interested in the latter aspect of choosing a prior, namely how the choice affects the resulting posterior, and predictions.
To fit this Bayesian linear mixed effects model, we have a lot of steps:
That’s a lot of steps. In practical predictive modelling, it is often the case that the methods used will be extremely complicated, and require a lot of background to completely understand. While nothing can substitute your developing this background, it is usually not necessary to completely understand a method before you are able to apply it to some data and get a feel for how it works. In addition, even if you have extensive background in a topic (here: Bayesian inference), it is usually not the case that you will need to code it up from scratch. Before starting any modelling exercise, look at what software is available.
In this case we will illustrate the use of the Bayesian computation/model fitting software suite Stan
, and we will do so through the rstamarm
R
package. Stan
is a general-purpose software for performing Bayesian computations, and the rstanarm
package provides and R
formula interface for fitting Bayesian regression models using Stan
. Here we will illustrate the use of this package and discuss practical issues; to learn more about how Stan
works, see the documentation for Stan.
The stan_glmer
function implements the above model. Using a modern laptop and the full dataset, the computations performed by Stan
are quite intensive. By default Stan
implements an efficient MCMC sampling algorithm for sampling from the posterior and predictive distributions. For the Rossman data, this takes a very long time on my laptop. There are two approaches we can use here: reduce the dataset by subsampling, or change the method of computing the posterior. For starters, let’s try the latter: rather than obtaining exact samples via MCMC (again, the default in rstanarm
), we will set algorithm = "meanfield"
, to implement mean field Variational Inference, an efficient (but potentially inaccurate) technique for approximating the posterior.
In setting the arguments for the stan_glmer
function below, I used information from the following vignettes. You can consult these for more information:
Running this on the full dataset with the default prior distributions for all terms still took 2 - 3 hours. Let’s see how this is done and how we can evaluate the results.
## I ran the below outside of this notebook.
## This took several hours
stan1 <- stan_glmer(
formula = lm3ff,
data = training_set,
family = gaussian,
algorithm = "meanfield",
sparse = TRUE
)
# summary(stan1) # Prints output for each store, very long
save(stan1,file = "/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-rstan1.RData")
# Load the object into its own environment, to reduce clutter in the global environment
load_stan <- new.env()
load("/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-rstan1.RData",envir = load_stan)
ls(load_stan)
## [1] "stan1"
pryr::object_size(load_stan$stan1)
## 610 MB
We see that the fitted model object is quite large. We should compare the predicted values on the training set to the observed values as before, as well as comparing them on the validation set. Since the output of stan_glmer
is not a static prediction equation but rather a predictive distribution, a further sensible comparison is to draw a new dataset from this predictive distribution, and check that it has similar statistical properties to the original training set. This can be accomplished using the pp_check
function:
# Draw 5 new datasets from the predictive distribution and compare them to the training set
# Remember, the scale on the x-axis is log(sales + 1)
pp_check(load_stan$stan1,plotfun = "hist",nreps = 5)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The datasets generated from the predictive distribution look similar to the training set, which is one indication that the model fits the training data well. You could also go more detailed and investigate summary statistics like predictive means and standard deviations.
To obtain an actual set of validation predictions, we can draw from the predictive distribution values corresponding to the \(X\) matrices for stores in the validation set using the posterior_predict
function. This function’s output is in a bit of a different form from that which works automatically with tidyverse
functions, so needs to be manipulated. The output of posterior_predict
is a matrix with columns corresponding to rows in the validation set, and rows corresponding to samples from the predictive distribution. The function doesn’t output just a single prediction for each store and day- it outputs a sample from the predictive distribution for that store and day, which allows us to compute a point estimate (the actual prediction that we’ll use) and a measure of uncertainty in the prediction itself.
We need to transpose the output matrix, match each row (formerly column) in this to the corresponding row in the validation set, compute a single prediction for each store, and output a dataframe containing the observed and predicted values like we had before:
stan1_validpreds <- posterior_predict(load_stan$stan1,newdata = validation_set) %>%
t() %>% # Transpose; rows now correspond to rows in the validation set
apply(1,mean) %>% # Compute the mean of each row using R's built-in apply function
exp() %>% # Convert back to the scale of the original observations (remember we modelled log(sales))
data_frame(observed = validation_set$Sales,predicted = .)
# Evaluate predictions
get_rmspe(stan1_validpreds)
## [1] 1.61035
stan1_validpreds %>%
ggplot(aes(x = observed,y = predicted)) +
theme_classic() +
geom_point(pch = 21,colour = "black",fill = "lightblue") +
geom_abline(slope = 1,intercept = 0,colour = "red",linetype = "dashed") +
labs(title = "Validation Predictions, Bayesian LME Model",
subtitle = "Rossman Store Sales Data",
x = "Observed",
y = "Predicted") +
scale_x_continuous(labels = scales::dollar_format()) +
scale_y_continuous(labels = scales::dollar_format())
Woah! That’s not good. The predictions look good “on average”, but many individual predictions are way off, with a chunk being drastically overestimated, and a chunk being somewhat underestimated. What is happening?
Looking at the distribution of prediction errors, we see the bimodal distribution suggested by the above plot:
stan1_validpreds %>%
mutate(error = predicted - observed) %>%
ggplot(aes(x = error)) +
theme_classic() +
geom_histogram(bins = 100,colour = "black",fill = "lightblue") +
labs(title = "Prediction Errors - Bayesian LME Model",
subtitle = "Validation Data",
x = "Predicted - Observed",
y = "Count") +
scale_x_continuous(labels = scales::dollar_format(),breaks = seq(-30000,1500000,by=10000)) +
scale_y_continuous(labels = scales::comma_format()) +
theme(axis.text.x = element_text(angle = 90))
We could go into more detail, but frankly the predictions are so bad overall that it’s probably better to try something else. One very compellinng reason to take a Bayesian approach to predictive modelling is the degree of flexibility this allows the modeller (you!). There are many things we could try here, for example:
algorithm = "meanfield"
because this is the fastest option available and our dataset is large. However, the meanfield variational approximation is also potentially the least accurate; in particular, it assumes independence between posterior parameters, which is extremely restrictive. Changing this to a “full rank” variational approximation, or even better “sampling”, which implements MCMC sampling from the posterior, would most likely result in better predictive performance of the model, but would come at a much greater computational cost.We won’t go further here, but you now have the tools you need to fit and evaluate predictive models on a somewhat complicated dataset, with grouped observations exhibiting temporal correlation. Happy modelling!