This notebook is to serve as an introduction to predictive modelling in R, using the Rossman Store Sales Kaggle competition dataset, available from kaggle or stored on github. Predictive modelling is tedious and difficult, and this notebook is long. It should take you a few hours to go through in detail, including running all the code yourself. The reader is encouraged to modify the code where appropriate to see the effects of changing steps, and to try to get a better predicting model.
The result of this notebook is a cleaned, model-ready dataset which will be used in subsequent tutorials on fitting predictive models to data. The purpose of this notebook is not to talk about specific models or methods; rather it is to illustrate the long, tedious process of data engineering and preprocessing that is required before predictive modelling methodologies can be applied to real-world data.
Topics covered:
# Load the tidyverse packages
suppressMessages({
suppressWarnings({
library(tidyverse) # Includes all the necessary stuff: dplyr, ggplot, tidyr, and so on
library(lubridate) # For dealing with dates
})
})
Before we even start reading the data into R, we need to understand the problem as documented. Before beginning any predictive modelling exercise, you should be writing a detailed description of the problem and all available data sources. This will help keep the project organized, and facilitate collaboration within your team.
This is a Kaggle competition; the documentation is available here. Go and read it in detail, and make notes.
There are a variety of factors affecting store sales. Rossman is a company operating drug stores in Europe, whose store managers are required to predict their stores’ daily sales up to 6 weeks in advance. The goal of this competition is to use provided data on the stores to develop a predictive model for store sales on a daily basis.
The provided data contain information at two levels:
The goal of our analysis is going to be to predict day-level sales, so we know we are going to need to use both the store and the store x day information. These will need to be merged after reading in.
Before reading in the data, do your best to understand what to expect: how many rows and columns should be in the file? What are the data types and reasonable value ranges for each column? This can be figured out from reading the documentation if available.
Regardless of whether documentation is available, if you are reading in text files, you should first view them in a terminal. This allows you to verify whether the file has a header, what type of separator it uses (e.g. if it is a .csv file, is it really comma-separated?), whether the file has any encoding challenges, and more. To view the file using BASH (default terminal on UNIX systems, so Linux and Mac; if you’re on windows, you can download git bash), navigate to the folder containing the file and type
head train.csv
This will print out the first five rows of the file. We verify here that our file does have a header, and has 9 columns.
To count the rows, use the wc -l
BASH command:
wc -l train.csv
This should output 1017210 train.csv
, indicating that there are 1,017,210 rows in the train.csv
file. You can run wc -l *
to list the line counts for all files in the current folder. We see that test.csv
has 41,089 rows, and store.csv
has 1,116. So there are 1,115 stores (if you printed the file out, you would notice the first row is a header row containing column names)- we can verify whether the training and test datasets together contain this number of stores once we read in the data.
To count the columns in a text file if you know the separator, in this case ,
, just count the number of separators in the first line (and add one!)
head -n1 train.csv | tr -cd , | wc -c
This command has 3 parts:
head -n1 train.csv
command returns the first row (-n1
) of the dataset train.csv
|
command (the “pipe”) takes the return value of head -n1 train.csv
and passes it to the next command. This works exactly like the %>%
operator in R
(in fact, this is what %>%
was modelled after, and is why it is referred to as the “pipe”)tr -cd ,
removes all characters from the line that are not commaswc -c
counts the characters in the result. Since the result is only the commas in the line, this counts the number of commas in the line. wc
stands for “word count”; the -c
flag counts characters, and the -l
flag you saw above counts lines (which is the same as tr -cd '\n' | wc -c
)Text files can be read in with the read_delim
function in the readr
package. There is a short form read_csv
specifically for reading comma-separated data, however it is just a wrapper around the more general (and therefore useful) read_delim
.
# Set the path where the data is stored
datapath <- "https://github.com/awstringer1/leaf2018/raw/gh-pages/datasets/rossman/"
# Read in the three files
train_read <- readr::read_delim(file = str_c(datapath,"train.csv"),
delim = ",",
col_names = TRUE)
## Parsed with column specification:
## cols(
## Store = col_integer(),
## DayOfWeek = col_integer(),
## Date = col_date(format = ""),
## Sales = col_integer(),
## Customers = col_integer(),
## Open = col_integer(),
## Promo = col_integer(),
## StateHoliday = col_integer(),
## SchoolHoliday = col_integer()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 31050 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual expected <int> <chr> <chr> <chr> actual 1 63556 StateHoliday an integer a file 2 63558 StateHoliday an integer a row 3 63560 StateHoliday an integer a col 4 63561 StateHoliday an integer a expected 5 63564 StateHoliday an integer a actual # ... with 1 more variables: file <chr>
## ... ................. ... ...................................... ........ ...................................... ...... ...................................... .... ...................................... ... ...................................... ... ...................................... ........ ...................................... ...... .......................................
## See problems(...) for more details.
This generated errors. What happened?
Reading the documentation for read_delim
, we see that the function tries to intelligently guess the datatypes of the columns using the first few rows of data. But, if there is data farther down that does not match the guessed type, it doesn’t re-guess, rather it throws an error.
There are two ways to fix this, depending on the complexity of the dataset:
guess_max
parameter of read_delim
to a number greater than the default of 1000
. This will use more rows for guessing the column types, reducing the chances of a mistake. This can slow down the reading of the file, and is not recommended for large filesExcept in situations with very easy to debug type-guessing errors, the second approach is strongly recommended as it is is robust, and guaranteed to be correct, up to human error.
Looking at the data on the command line (thankfully there are only 9 columns!) we see the following variables:
head -n1 train.csv | tr ',' '\n'
From this, we can manually set the data types of the columns to be something reasonable. Included are the high-level data types and the readr
single-letter codes; see the documentation for read_delim
:
Why did we set so many variables that only take numeric values, like DayOfWeek and Promo, to character datatypes? For modelling purposes, these variables aren’t really numeric; DayOfWeek could just as well have been stored as “Monday”, “Tuesday”,… the fact that whoever extracted these data chose to code the variables a certain way has no bearing on their true underlying “type” in the context of building a predictive model.
Trying again with the manual column types:
train_read <- readr::read_delim(file = str_c(datapath,"train.csv"),
delim = ",",
col_names = TRUE,
col_types = c("ccDddcccc"))
glimpse(train_read)
## Observations: 1,017,209
## Variables: 9
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "5", "5", "5"...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 15344, 8492...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414, 833, 687, 6...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"...
glimpse
ing the data tells us what we need to know: the data were read in with the correct types, and there are the correct number of rows and columns.
Repeating these steps on test.csv
and store.csv
:
test_read <- readr::read_delim(file = str_c(datapath,"test.csv"),
delim = ",",
col_names = TRUE,
col_types = c("cccDcccc"))
glimpse(test_read)
## Observations: 41,088
## Variables: 8
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11", "12", "13",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "4", "4", "4"...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0"...
store_read <- readr::read_delim(file = str_c(datapath,"store.csv"),
delim = ",",
col_names = TRUE,
col_types = c("cccdcccccc"))
glimpse(store_read)
## Observations: 1,115
## Variables: 10
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
We were able to read in the data successfully.
Now that the data are read in, we need to merge the store-specific characteristics with the store x day data in the training set. This need arises because our training data contains information for each store for many days, and our prediction task is to predict day-level sales by store.
The practical concept here is that of a key: the most granular non-redundant level possible of unique identification of a row in your dataset. A key can be one or more columns, and is used for uniquely identifying rows of a dataframe. Dataframes sharing keys can be merged together.
For example, a key for the store_read
table is the Store
variable, because the variable is unique. We can check this:
store_read %>%
nrow()
## [1] 1115
store_read %>%
pull(Store) %>%
unique() %>%
length()
## [1] 1115
Suppose we added a column to store_read
called rowid
, just equal to the row number. Then Store x rowid
would be unique, but it would be redundant: both Store
and rowid
uniquely identify rows in that dataframe, so are not both needed.
Now look at the Store
variable in the train-read
data:
train_read %>%
nrow()
## [1] 1017209
train_read %>%
pull(Store) %>%
unique() %>%
length()
## [1] 1115
There are many more rows than stores. Why?
train_read %>%
filter(Store == "1") %>%
arrange(Date)
## # A tibble: 942 x 9
## Store DayOfWeek Date Sales Customers Open Promo StateHoliday
## <chr> <chr> <date> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 2 2013-01-01 0 0 0 0 a
## 2 1 3 2013-01-02 5530 668 1 0 0
## 3 1 4 2013-01-03 4327 578 1 0 0
## 4 1 5 2013-01-04 4486 619 1 0 0
## 5 1 6 2013-01-05 4997 635 1 0 0
## 6 1 7 2013-01-06 0 0 0 0 0
## 7 1 1 2013-01-07 7176 785 1 1 0
## 8 1 2 2013-01-08 5580 654 1 1 0
## 9 1 3 2013-01-09 5471 626 1 1 0
## 10 1 4 2013-01-10 4892 615 1 1 0
## # ... with 932 more rows, and 1 more variables: SchoolHoliday <chr>
Each store has many rows in the dataframe, representing daily sales.
Check: how many unique days are there in the dataframe, and do all stores have data for all of them?
train_read %>%
select(Date) %>%
distinct()
## # A tibble: 942 x 1
## Date
## <date>
## 1 2015-07-31
## 2 2015-07-30
## 3 2015-07-29
## 4 2015-07-28
## 5 2015-07-27
## 6 2015-07-26
## 7 2015-07-25
## 8 2015-07-24
## 9 2015-07-23
## 10 2015-07-22
## # ... with 932 more rows
train_read %>%
group_by(Store) %>%
summarize(numdays = n()) %>%
group_by(numdays) %>%
summarize(numtimes = n()) %>%
arrange(desc(numtimes))
## # A tibble: 3 x 2
## numdays numtimes
## <int> <int>
## 1 942 934
## 2 758 180
## 3 941 1
Interesting; most of the stores do have all 942 days of data. 180 of the stores are missing about 6 months worth
Check: is the Store x Date
combination unique?
train_read %>%
select(Store,Date) %>%
distinct() %>%
nrow()
## [1] 1017209
train_read %>%
nrow()
## [1] 1017209
Yep.
We can merge store_read
with train_read
now. This is a many-to-one merge: each row in store_read
will be replicated for each occurance of the same Store
value in train_read
. That is, each store gets its data repeated for each day that it occurs in train_read
.
Before doing a merge, you need to know how many rows you expect to see in the result. The number of rows in the result should be equal to the number of rows in train_read
, since all stores are present in both dataframes.
Let’s proceed using an inner join: rows will only be kept in the result table if the corresponding key values are present in both merging tables. We can use the dplyr
function inner_join
to accomplish this as follows:
train_store <- train_read %>%
inner_join(store_read,by = "Store")
glimpse(train_store)
## Observations: 1,017,209
## Variables: 18
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651,...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
Check that the join was successful: confirm the number of rows is as expected (using glimpse
above), that all variables are present, and that the key in the new table has the required uniqueness:
train_store %>%
select(Store,Date) %>%
distinct()
## # A tibble: 1,017,209 x 2
## Store Date
## <chr> <date>
## 1 1 2015-07-31
## 2 2 2015-07-31
## 3 3 2015-07-31
## 4 4 2015-07-31
## 5 5 2015-07-31
## 6 6 2015-07-31
## 7 7 2015-07-31
## 8 8 2015-07-31
## 9 9 2015-07-31
## 10 10 2015-07-31
## # ... with 1,017,199 more rows
It appears that the join was successful! Let’s do the same for the test set:
test_read %>%
nrow()
## [1] 41088
test_store <- test_read %>%
inner_join(store_read,by = "Store")
glimpse(test_store)
## Observations: 41,088
## Variables: 17
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 14130, 24000, 7520, 2030, 3160...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2 <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek <chr> NA, "14", NA, NA, NA, NA, "1", "13",...
## $ Promo2SinceYear <chr> NA, "2011", NA, NA, NA, NA, "2012", ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", NA, NA, NA, N...
Now that our data is read in and merged, we can start modelling.
…or can we? We have our target variable defined already in the data (usually that alone is a very time-consuming process, thanks Kaggle!), and we have a bunch of variables. Should we just pick a fancy model like a neural network, plug all the variables into the precoded software, and call it a day?
It is almost never the case in a applied setting that someone will hand you a dataset that is complete and ready to model. Much of the predictive modelling process involves constructing useful modelling variables from the data you have in your possession.
The term feature engineering is typically used to describe the action of creating columns in a dataframe to be used in a predictive model, in one of two related situations. When you have structured data as we have here, the data already resides in a nice matrix-like format, and feature engineering involves creating new useful variables from what is supplied in the data. Contrast this to when you have unstructured data, and feature engineering refers to the act of putting the data into a matrix-like format suitable for modelling software- and deciding what to put in the columns of this matrix.
In our situation, we are dealing with structured data. Let’s take a look at the variables in our dataset, and think about how we could improve upon them. It is worth noting here that this step is where the analyst gets to be creative, and depending on the task at hand can have a huge impact on the predictive performance of the models considered. Any decisions made here are purely subjective, and while following this tutorial, you can and should add or remove steps here to get a feel for the effect this has on your models.
glimpse(train_store)
## Observations: 1,017,209
## Variables: 18
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651,...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
There are some potential new features that stand out:
Date
has already been engineered a bit, producing the DayOfWeek
variable. We might think sales would be different on different days of the week, so this makes sense. We might also think sales would be different in different months, e.g. a holiday effect. So, let’s make a feature called Month
, the month of the Date
Date
, we can use the CompetitionOpenSinceMonth
and CompetitionOpenSinceYear
to determine the length of time that the nearest competitor has been opened.Promo2
has been in effect, rather than simply knowing when it was started.Let’s implement these transformations. We will do this using a dplyr
pipeline, for efficiency and readability. Since we have to do the same transformations to both the train and the test sets, it is best practice to create a function that does the transformation:
feature_engineer <- function(ds) {
# Function to create a date from a year and week
yw <- function(y,w) {
dt <- ymd(str_c(y,"0101"))
week(dt) <- as.numeric(w)
dt
}
ds %>%
mutate(Month = as.character(lubridate::month(Date)),
comp_open_month_formatted = if_else(str_length(CompetitionOpenSinceMonth) == 1,str_c("0",CompetitionOpenSinceMonth),CompetitionOpenSinceMonth),
comp_open_date = lubridate::ymd(str_c(CompetitionOpenSinceYear,comp_open_month_formatted,"01")),
comp_open_since_days = as.numeric(Date - comp_open_date),
promo2_since_date = yw(Promo2SinceYear,Promo2SinceWeek),
promo2_since_days = as.numeric(Date - promo2_since_date)
) %>%
select(-comp_open_month_formatted,comp_open_date,promo2_since_date)
}
# Do it to the training data
train_feateng <- feature_engineer(train_store)
glimpse(train_feateng)
## Observations: 1,017,209
## Variables: 23
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651,...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_date <date> 2008-09-01, 2007-11-01, 2006-12-01,...
## $ comp_open_since_days <dbl> 2524, 2829, 3164, 2159, 121, 607, 85...
## $ promo2_since_date <date> NA, 2010-03-26, 2011-04-02, NA, NA,...
## $ promo2_since_days <dbl> NA, 1953, 1581, NA, NA, NA, NA, NA, ...
# ...and the test data
test_feateng <- feature_engineer(test_store)
glimpse(test_feateng)
## Observations: 41,088
## Variables: 22
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 14130, 24000, 7520, 2030, 3160...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2 <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek <chr> NA, "14", NA, NA, NA, NA, "1", "13",...
## $ Promo2SinceYear <chr> NA, "2011", NA, NA, NA, NA, "2012", ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", NA, NA, NA, N...
## $ Month <chr> "9", "9", "9", "9", "9", "9", "9", "...
## $ comp_open_date <date> 2008-09-01, 2006-12-01, 2013-04-01,...
## $ comp_open_since_days <dbl> 2572, 3212, 899, 351, 5525, 2207, 14...
## $ promo2_since_date <date> NA, 2011-04-02, NA, NA, NA, NA, 201...
## $ promo2_since_days <dbl> NA, 1629, NA, NA, NA, NA, 1355, 2001...
The above transformations had to do almost exclusively with dates. Working with dates is a useful skill, so let’s unpack the above a bit. In order to create the comp_open_since_days
variable, we
CompetitionOpenSinceMonth
variable. This is an annoying thing that happens when you convert numeric month indicators to character: October - December, months 10 - 12, have 2 characters while the rest of the months have 1. This throws of date-parsing functions like lubridate::ymd
, which take a character string of the form “yyyymmdd”, “yyyy-mm-dd”, etc, and expect the month to be 2 characters, so “01” instead of “1”lubridate::ymd
in the form 20140428
(for example). This creates a lubridate
date object, which has a standard internal format useful for calculations. Note that when we read in the data to begin with, readr
already parsed Date
into this same formatDate
and comp_open_date
in internally-compatible formats, simply taking their difference returns the days difference between them, giving the comp_open_since_days
feature.The approach for computing promo2_since_days
is similar, however because the data had the week of the year in which promo2 began, instead of the month, we built a simple custom function for constructing a lubridate
date from the week and year.
The point of this exercise is this: feature engineering is tedious and difficult. It is absolutely necessary to have these skills in predictive modelling; in more complex problems with bigger datasets, this task gets much more difficult and time-consuming, but doesn’t become less necessary.
While feature engineering typically refers to the construction of new columns in the modelling dataframe, feature preprocessing typically refers to the actual modification of data values in order to make the data more model-friendly. This can involve creating indicator variables for events, imputing missing values (if appropriate), deleting variables that are highly missing or have low standard deviation or are completely uncorrelated with the target or are too correlated with other variables or… this is another tedious, difficult step in which the analyst has freedom to be creative, and also the power to drastically change the predictive performance of models built on the resulting data.
Let’s look again at our variables:
glimpse(train_feateng)
## Observations: 1,017,209
## Variables: 23
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31,...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651,...
## $ Customers <dbl> 555, 625, 821, 1498, 559, 589, 1414,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_date <date> 2008-09-01, 2007-11-01, 2006-12-01,...
## $ comp_open_since_days <dbl> 2524, 2829, 3164, 2159, 121, 607, 85...
## $ promo2_since_date <date> NA, 2010-03-26, 2011-04-02, NA, NA,...
## $ promo2_since_days <dbl> NA, 1953, 1581, NA, NA, NA, NA, NA, ...
We will do the following preprocessing. You are encouraged to add/remove preprocessing steps as you see fit; the possibilities are endless and have a nontrivial effect on the results, so you definitely should play around here.
As you can see, there are many things we could do; the above list is long, as is only the beginning. In cases where you have more variables, you may also have to screen through them and initially filter out any that are completely unrelated to the target, to get the sheer volume down.
It also needs to be said that this manner of preprocessing as described here is suited to structured data. If you have unstructured data, like text or images, and you have engineered all your features yourself, the preprocessing steps might be very different, and will probably be related to the type of data you have, like for example scaling or recolouring images. It also depends on the type of model you are fitting- for example if you’re fitting a neural network to the MNIST handwritten digits data, many of the features have exactly zero variance (the corners of every image are white!) but not screening these out doesn’t bother the neural network.
First we screen out by standard deviation and missing value proportion
stddev_threshold <- function(x) sd(x,na.rm=TRUE) < .01
onelevel_threshold <- function(x) length(unique(x)) == 1
propmissing_threshold <- function(x) mean(is.na(x)) > .5
vars_to_remove_stddev <- train_feateng %>%
select_if(is.numeric) %>% # Only look at numeric variables
select_if(stddev_threshold) %>%
colnames()
# The equivalent notion for categorical variables is having only a single level
vars_to_remove_onelevel <- train_feateng %>%
select_if(is.character) %>%
select_if(onelevel_threshold) %>%
colnames()
vars_to_remove_missing <- train_feateng %>%
select_if(propmissing_threshold) %>%
colnames()
vars_to_remove_stddev
## character(0)
vars_to_remove_onelevel
## character(0)
vars_to_remove_missing
## character(0)
These data are already pretty clean, so this step did nothing. This won’t be the case for other datasets you encounter, and it’s good to have checked anyways.
Next let’s talk about missing values. We won’t get into a lengthy discussion about the inferential aspects of missing value imputation. Let’s check which variables have missing values to begin with:
has_missing <- function(x) any(is.na(x))
train_feateng %>%
select_if(has_missing) %>%
glimpse()
## Observations: 1,017,209
## Variables: 10
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 2...
## $ CompetitionOpenSinceMonth <chr> "9", "11", "12", "9", "4", "12", "4"...
## $ CompetitionOpenSinceYear <chr> "2008", "2007", "2006", "2009", "201...
## $ Promo2SinceWeek <chr> NA, "13", "14", NA, NA, NA, NA, NA, ...
## $ Promo2SinceYear <chr> NA, "2010", "2011", NA, NA, NA, NA, ...
## $ PromoInterval <chr> NA, "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,...
## $ comp_open_date <date> 2008-09-01, 2007-11-01, 2006-12-01,...
## $ comp_open_since_days <dbl> 2524, 2829, 3164, 2159, 121, 607, 85...
## $ promo2_since_date <date> NA, 2010-03-26, 2011-04-02, NA, NA,...
## $ promo2_since_days <dbl> NA, 1953, 1581, NA, NA, NA, NA, NA, ...
Because the data is, again, pretty clean, it turns out that the only variables that are missing are missing for an interpretable reason: either the store doesn’t have any competitors (CompetitionDistance
,CompetitionOpenSinceMonth
,CompetitionOpenSinceYear
and derivatives) or the store doesn’t run promo2
. We will deal with this in two steps:
# Set variables to impute
vars_to_impute <- train_feateng %>%
select_if(has_missing) %>%
colnames()
# Check that these variables don't actually have rows that are zero
has_zero <- function(x) {
x <- x[!is.na(x)]
if (is.numeric(x)) {
any(x == 0)
}
else if (is.Date(x)) {
FALSE
}
else {
any(x == "0")
}
}
train_feateng %>%
select(one_of(vars_to_impute)) %>%
select_if(has_zero)
## # A tibble: 1,017,209 x 2
## comp_open_since_days promo2_since_days
## <dbl> <dbl>
## 1 2524 NA
## 2 2829 1953
## 3 3164 1581
## 4 2159 NA
## 5 121 NA
## 6 607 NA
## 7 851 NA
## 8 303 NA
## 9 5477 NA
## 10 2159 NA
## # ... with 1,017,199 more rows
# Okay, the two since_days variables have zeroes. This would probably be for the actual day
# that those stores' competitors opened- let's check:
train_feateng %>%
select(Store,Date,comp_open_date,comp_open_since_days) %>%
filter(comp_open_since_days == 0)
## # A tibble: 181 x 4
## Store Date comp_open_date comp_open_since_days
## <chr> <date> <date> <dbl>
## 1 131 2015-07-01 2015-07-01 0
## 2 137 2015-07-01 2015-07-01 0
## 3 304 2015-07-01 2015-07-01 0
## 4 403 2015-07-01 2015-07-01 0
## 5 859 2015-07-01 2015-07-01 0
## 6 944 2015-07-01 2015-07-01 0
## 7 996 2015-07-01 2015-07-01 0
## 8 1053 2015-07-01 2015-07-01 0
## 9 269 2015-06-01 2015-06-01 0
## 10 496 2015-06-01 2015-06-01 0
## # ... with 171 more rows
train_feateng %>%
select(Store,Date,promo2_since_date,promo2_since_days) %>%
filter(promo2_since_days == 0)
## # A tibble: 176 x 4
## Store Date promo2_since_date promo2_since_days
## <chr> <date> <date> <dbl>
## 1 428 2015-06-04 2015-06-04 0
## 2 629 2015-06-04 2015-06-04 0
## 3 872 2015-06-04 2015-06-04 0
## 4 875 2015-04-30 2015-04-30 0
## 5 876 2015-04-30 2015-04-30 0
## 6 265 2015-04-02 2015-04-02 0
## 7 331 2015-04-02 2015-04-02 0
## 8 749 2015-04-02 2015-04-02 0
## 9 946 2015-04-02 2015-04-02 0
## 10 28 2015-02-05 2015-02-05 0
## # ... with 166 more rows
# Yep, looks like it. Go ahead and impute the data with zeroes
impute_with_zeroes <- function(x) {
if (is.numeric(x)) {
x[is.na(x)] <- 0
}
else if (is.Date(x)) {
# Don't do anything, we're deleting dates in the next step
}
else {
x[is.na(x)] <- "0"
}
x
}
train_imputed <- train_feateng %>%
mutate_at(vars_to_impute,impute_with_zeroes)
# Check it worked
train_imputed %>%
select_if(has_missing)
## # A tibble: 1,017,209 x 2
## comp_open_date promo2_since_date
## <date> <date>
## 1 2008-09-01 NA
## 2 2007-11-01 2010-03-26
## 3 2006-12-01 2011-04-02
## 4 2009-09-01 NA
## 5 2015-04-01 NA
## 6 2013-12-01 NA
## 7 2013-04-01 NA
## 8 2014-10-01 NA
## 9 2000-08-01 NA
## 10 2009-09-01 NA
## # ... with 1,017,199 more rows
# Just the two dates that we're going to get rid of anyways
Here we perform the simple step of getting rid of variables we can’t use directly in the model. For us this means the dates. We kept them up to this point to use in checks (like in the imputation step), but now we don’t need them anymore. Also remove Customers
, as it is a proxy for Sales
and isn’t available in the test set (because it isn’t known at the time of prediction).
vars_to_remove_cantuse <- c(
"Customers",
"CompetitionOpenSinceMonth",
"CompetitionOpenSinceYear",
"Promo2SinceWeek",
"Promo2SinceYear",
"comp_open_date",
"promo2_since_date"
)
train_varsremoved <- train_imputed %>%
select(-one_of(vars_to_remove_cantuse))
We’re getting closer to having usable data: with the exception of Store
, Date
and Sales
, all of the variables in our dataframe could be used in a statistical model. We have to keep Store
as it is a grouping variable, and Date
will be needed for the train-validation split.
Filter out rows representing sales for days that the store was closed:
# How many days are the stores closed?
train_varsremoved %>%
group_by(Open) %>%
summarize(numdays = n())
## # A tibble: 2 x 2
## Open numdays
## <chr> <int>
## 1 0 172817
## 2 1 844392
# Actually a lot. This might sound silly, but double check that sales are actually
# recorded as zero on days when stores are closed:
train_varsremoved %>%
filter(Open == "0") %>%
filter(abs(Sales) > 0)
## # A tibble: 0 x 16
## # ... with 16 variables: Store <chr>, DayOfWeek <chr>, Date <date>,
## # Sales <dbl>, Open <chr>, Promo <chr>, StateHoliday <chr>,
## # SchoolHoliday <chr>, StoreType <chr>, Assortment <chr>,
## # CompetitionDistance <dbl>, Promo2 <chr>, PromoInterval <chr>,
## # Month <chr>, comp_open_since_days <dbl>, promo2_since_days <dbl>
# Nope, all good
# Are all stores closed roughly the same number of days?
train_varsremoved %>%
filter(Open == "0") %>%
group_by(Store) %>%
summarize(numdays = n()) %>%
group_by(numdays) %>%
summarize(numtimes = n()) %>%
arrange(desc(numtimes))
## # A tibble: 66 x 2
## numdays numtimes
## <int> <int>
## 1 163 263
## 2 158 208
## 3 136 135
## 4 162 116
## 5 161 97
## 6 165 59
## 7 160 40
## 8 166 30
## 9 135 26
## 10 159 11
## # ... with 56 more rows
# So 263 stores are closed for 163 days in the dataset,
# 208 stores are closed for 158 days, and so on
# We could dig in to that more if we wanted
# Remove them:
train_closedremoved <- train_varsremoved %>%
filter(Open == "1") %>%
select(-Open) # Don't need this variable anymore as it will always equal 1
glimpse(train_closedremoved)
## Observations: 844,392
## Variables: 15
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31, 2015...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 1534...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "a", "...
## $ CompetitionDistance <dbl> 1270, 570, 14130, 620, 29910, 310, 24000,...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "0", "...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,Oct"...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_since_days <dbl> 2524, 2829, 3164, 2159, 121, 607, 851, 30...
## $ promo2_since_days <dbl> 0, 1953, 1581, 0, 0, 0, 0, 0, 0, 0, 1307,...
# Actually removed quite a few rows!
Including redundant variables in a model can cause instability, as the variables “fight” for which one explains the patterns in the data. In a regression context this is called multicollinearity, and can result in the inversion of the design matrix being numerically unstable. Such instability is observed more generally in optimization algorithms involving matrices with nearly linearly dependent columns, and the resulting model fits are unstable. So, best to investigate this possibility beforehand.
When we have a medium-large number of variables, it is reasonable to compute all pairwise correlations and display them in a sorted list. When we have a very large number of correlations, computing all pairwise correlations could be infeasble, as with \(p\) variables there are \(p(p-1)/2\) correlations to compute, which is quadratic in \(p\).
Luckily here we have a very small number of continuous variables:
train_closedremoved %>%
select_if(is.numeric)
## # A tibble: 844,392 x 4
## Sales CompetitionDistance comp_open_since_days promo2_since_days
## <dbl> <dbl> <dbl> <dbl>
## 1 5263 1270 2524 0
## 2 6064 570 2829 1953
## 3 8314 14130 3164 1581
## 4 13995 620 2159 0
## 5 4822 29910 121 0
## 6 5651 310 607 0
## 7 15344 24000 851 0
## 8 8492 7520 303 0
## 9 8565 2030 5477 0
## 10 7185 3160 2159 0
## # ... with 844,382 more rows
In fact, only 3 of them. When the number of continuous variables is small, it is feasible to make a heat map of the correlations. The corrplot
package does a nice job of this, and includes matrix ordering algorithms to cluster correlated variables together for easy viewing.
We’ll also include the response variable in here, to get a rough idea of whether any are correlated with it
# Look at the actual correlation matrix
cormat <- cor(select_if(train_closedremoved,is.numeric))
round(cormat,2)
## Sales CompetitionDistance comp_open_since_days
## Sales 1.00 -0.04 0.00
## CompetitionDistance -0.04 1.00 -0.02
## comp_open_since_days 0.00 -0.02 1.00
## promo2_since_days -0.06 -0.05 0.01
## promo2_since_days
## Sales -0.06
## CompetitionDistance -0.05
## comp_open_since_days 0.01
## promo2_since_days 1.00
# Plot it
#
corrplot::corrplot(cormat,
method = "shade",
type = "lower",
order = "AOE"
)
We don’t see much relation between any of these variables- including the response. This isn’t necessarily a bad thing; correlation only measures linear dependence, the degree to which one variable can be expressed as a linear function of the other. It doesn’t mean we won’t be able to model Sales
using these variables.
Now let’s take a look at the distribution of each of our categorical variables, to ensure that each is suitable for modelling. We’re looking for anything irregular such as sparse or mislabelled levels. We’ll also take a look at the joint distribution of each categorical variable and the target, to get an idea for whether any can be expected to be predictive on their own.
# Which are the categorical variables?
train_closedremoved %>%
select_if(is.character)
## # A tibble: 844,392 x 10
## Store DayOfWeek Promo StateHoliday SchoolHoliday StoreType Assortment
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 5 1 0 1 c a
## 2 2 5 1 0 1 a a
## 3 3 5 1 0 1 a a
## 4 4 5 1 0 1 c c
## 5 5 5 1 0 1 a a
## 6 6 5 1 0 1 a a
## 7 7 5 1 0 1 a c
## 8 8 5 1 0 1 a a
## 9 9 5 1 0 1 a c
## 10 10 5 1 0 1 a a
## # ... with 844,382 more rows, and 3 more variables: Promo2 <chr>,
## # PromoInterval <chr>, Month <chr>
# Put them in a vector, minus Store
categorical_variables <- train_closedremoved %>%
select_if(is.character) %>%
select(-Store) %>%
colnames()
# Check out the distribution of each
cat_dist <- function(var) {
plt <- train_closedremoved %>%
ggplot(aes_string(x = var)) +
theme_classic() +
geom_bar(stat = "count",aes_string(fill = var),colour = "black") +
labs(title = str_c("Barplot of ",var),
subtitle = "Rossman Store Sales",
x = var,
y = "Count") +
scale_y_continuous(labels = scales::comma_format()) +
guides(fill = FALSE)
return(plt)
}
# Get a list of plots of each categorical variable and pass them to cowplot::plot_grid
# This is a useful trick
categorical_variables %>%
map(cat_dist) %>%
cowplot::plot_grid(plotlist = .)
We see some sparse categories, specifically DayOfWeek
= 7 (this is Sunday, and remember we removed all days where the store is closed from the data), all the nonzero StateHoliday
levels, StoreType
= b, and Assortment
= b. It is not necessarily the case that including a sparse level of a categorical variable will cause instability in the model fit, but it is certainly possible. We won’t remove these levels yet, but if we find our models are having convergence or stability issues, we know what to look for.
To get an idea of what might go wrong, look closer at the sparseness of Assortment
= b and StoreType
= b:
store_read %>%
group_by(Assortment,StoreType) %>%
summarize(count = n())
## # A tibble: 9 x 3
## # Groups: Assortment [?]
## Assortment StoreType count
## <chr> <chr> <int>
## 1 a a 381
## 2 a b 7
## 3 a c 77
## 4 a d 128
## 5 b b 9
## 6 c a 221
## 7 c b 1
## 8 c c 71
## 9 c d 220
This is a problem. If we included both these variables unmodified, we would have one combination of categories that only had a single store in it. We’ll leave it for now, so we can see how and when it creeps up in the modelling process.
What about the relationship with the target? For visualizing the relationship between a categorical and a continuous variable, we can do boxplots of the continuous variable in each level of the categorical variable:
#
cat_cont_dist <- function(var,df = train_closedremoved) {
plt <- df %>%
ggplot(aes_string(x = var,y = "Sales")) +
theme_classic() +
geom_boxplot(aes_string(fill = var),colour = "black") +
labs(title = str_c("Boxplots of Sales within ",var),
subtitle = "Rossman Store Sales",
x = var,
y = "Count") +
scale_y_continuous(labels = scales::dollar_format()) +
guides(fill = FALSE)
return(plt)
}
categorical_variables %>%
map(cat_cont_dist) %>%
cowplot::plot_grid(plotlist = .,ncol=2)
Nothing surprising jumps out at us; sales are higher in December, and on days where the store is running a promo. Remember that the boxplots for categories which were identified as spare above might be misleading, as they are not based off very many datapoints.
Let’s now take a look at the continuous variables, both histograms of their univariate distributions and scatterplots of their relationship with Sales
:
continuous_variables <- train_closedremoved %>%
select_if(is.numeric) %>%
dplyr::select(-Sales) %>%
colnames()
#
cont_dist <- function(var) {
plt <- train_closedremoved %>%
ggplot(aes_string(x = var)) +
theme_light() +
geom_histogram(bins = 100,colour = "black",fill = "purple") +
labs(title = str_c("Histogram of ",var),
subtitle = "Rossman Store Sales",
x = var,
y = "Count") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::comma_format())
plt
}
continuous_variables %>%
map(cont_dist) %>%
cowplot::plot_grid(plotlist = .,ncol=1)
We note some things:
CompetitionDistance
is highly skewedcomp_open_since_days
has some bad outliers (there are stores that have been open for over \(40,000\) days, or \(109\) years, which isn’t impossible but doesn’t occur all that frequently), and is heavily spiked at zero. Recall zero means there is no competitor on record for that storepromo2_since_days
is severely spiked at \(0\), and has negative values. Either there were data recording errors (promos starting after the range of available data), or we made a mistake in the computation of this variable. In a more detailed analysis (like the kind maybe you are doing!) this should be investigated#
cont_cont_dist <- function(var) {
plt <- train_closedremoved %>%
sample_n(50000) %>%
ggplot(aes_string(x = var,y = "Sales")) +
theme_light() +
geom_point(pch=21,colour = "black",fill = "orange") +
labs(title = str_c("Scatterplot of Sales vs ",var),
subtitle = "Rossman Store Sales",
x = var,
y = "Sales") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_continuous(labels = scales::dollar_format())
return(plt)
}
continuous_variables %>%
map(cont_cont_dist) %>%
cowplot::plot_grid(plotlist = .,ncol=1)
We don’t see any clear relationships. A couple anomalies arise; there is probably only one store with each of those outlying values of comp_open_since_days
(remember each store shows up many times in this dataset).
It is not likely that a linear relationship between sales and any of these variables will be very strong. We could use a model that tries to fit complex non-linear relationships and just hope it works, and try to interpret the results- or we could tackle this in the preprocessing stage. You are encouraged to try the former approach and see if you can improve on the results; here we will investigate discretizing these variables.
Discretizing means splitting the variables up into discrete ranges, for example replacing comp_open_since_days
with an indicator for comp_open_since_days
between 0 and 180, 180 and 365, 365 and 1000… and so on. This serves two purposes: removes the effects of outliers, and allows simple modelling of nonlinear relationships. It has the disadvantage of removing any smooth relationships between the response and the features, so in practice it is good to try with and without this step.
One simple way to pick the knots (cut-points for the variables) is using a single-variable decision tree, as implemented in the rpart
package. This is tedious on the part of the analyst, as you have to comb through all continuous variables manually and make sure the knots you find make sense. Nonetheless, it’s a useful technique to employ when you have data that is poorly behaved (as we have here) and you’re under pressure to produce a clean, interpretable model.
#
# Choose knots by recursive partitioning using rpart
# Implicitly set the number of knots by choosing the complexity parameter cp-
# read the rpart documentation
#
get_knots <- function(var,complexity=0.01) {
form <- as.formula(str_c("Sales ~ ",var))
cntrl <- rpart::rpart.control(cp = complexity)
tree <- rpart::rpart(form,train_closedremoved,control = cntrl)
out <- tree$splits %>%
as_data_frame()
if (nrow(out) == 0) {
return(c(0))
}
out %>%
pull(index) %>%
unique() %>%
sort() %>%
c(-Inf,.,Inf) # Add end points so cut() function doesn't return NAs
}
# Set a list of the variable names and the complexities you want to use
# You basically have to set these manually, unless you can think of a better way!
# Try to set them so you get a reasonable number of splits something between maybe
# 3 and 5, as a guideline. You don't want bins that are too big, because you're
# smoothing over any relationship with the response, but you don't want bins that
# are too small as the relationship will be too noisy and not replicate in the
# validation set
variables_and_complexities <- list(
c("CompetitionDistance",0.003),
c("comp_open_since_days",0.0005),
c("promo2_since_days",0.001)
)
variable_knots <- variables_and_complexities %>%
map(~get_knots(.x[1],complexity = .x[2])) %>%
setNames(continuous_variables)
variable_knots
## $CompetitionDistance
## [1] -Inf 35 255 Inf
##
## $comp_open_since_days
## [1] -Inf 857.5 4682.5 Inf
##
## $promo2_since_days
## [1] -Inf -2.5 0.5 908.5 Inf
These chosen knots are telling us something about the relationship between each feature and the response. The split points (0,25,255) for CompetitionDistance
indicate that having no competitor results in different sales than having a competitor right next door (35 - 255, you read the documentation so you know these values are in metres), but if the competitor is not right next door/on the same block, there isn’t that much difference in 500m vs 5km. You can draw similar conclusions for the other variables. This is just what these data are saying for these values of the complexity parameter using this particular type of decision tree; these aren’t supposed to be formal statistical inferences.
We can compute our new discretized variables and plot their relationships with the response using side-by-side boxplots as we did before:
#
discretize_variables <- function() {
# Pulls the variable_knots and train_closedremoved from the calling environment
# and returns a new training set with the features discretized
df <- train_closedremoved
for (var in continuous_variables) {
df <- df %>%
mutate_at(var,function(x) {
cut(x,breaks = variable_knots[[var]])
}
)
}
df
}
train_discretized <- discretize_variables()
glimpse(train_discretized)
## Observations: 844,392
## Variables: 15
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "5", "...
## $ Date <date> 2015-07-31, 2015-07-31, 2015-07-31, 2015...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 1534...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "a", "...
## $ CompetitionDistance <fctr> (255, Inf], (255, Inf], (255, Inf], (255...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "0", "...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,Oct"...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_since_days <fctr> (858,4.68e+03], (858,4.68e+03], (858,4.6...
## $ promo2_since_days <fctr> (-2.5,0.5], (908, Inf], (908, Inf], (-2....
# Plot them
continuous_variables %>%
map(~cat_cont_dist(.x,df = train_discretized)) %>%
cowplot::plot_grid(plotlist = .,ncol=2)
With an engineered, preprocessed dataset in hand, we can turn to prediction. In this document, we’ll focus on the mechanical aspects of this: performing an appropriate train-validation split, passing data to a model and returning predictions on the training set, validating those predictions on the validation set, and (optionally) computing predictions on the test set and submitting to kaggle via their API.
Evaluation of the results of a predictive model must be done on different data than was used to fit the model. This is mainly to avoid developing a model that fits irregularities/noise in the training data that don’t appear in subsequent datasets on which the model is to be used- the problem of overfitting.
Standard practice is to split the data randomly by rows into a training set, used to fit the model, and a test set used to evaluate final model performance. The training set is often further split during the model building phase yielding a validation set, which is used during intermediate model building (e.g. to choose hyperparameter values). This training/validation split can be done many times if needed.
It is important not to just blindly apply standard practice in every new problem, but to think about what an appropriate strategy is for the task at hand. This is a kaggle competition dataset, and a test set is already provided, so we don’t need to do that split ourselves. We do need a validation set, though, as we don’t want to submit to kaggle every time we want to evaluate the performance of some intermediate model that we build. Should we just split the training data, say 70/30, into training and validation sets?
How to do the split depends on the structure of your data. Randomly splitting based on rows is appropriate for the “classical” case, where rows represent statistically independent observations from some population. That’s not the case here though- observations are grouped by store, and trended across time. The train-validation split needs to address the actual prediction task you’re trying to solve, which in this case is predicting sales for the same stores as in the training set, but at future time points.
How far in the future do we have to predict? Let’s look at the test set:
test_read %>%
pull(Date) %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu.
## "2015-08-01" "2015-08-12" "2015-08-24" "2015-08-24" "2015-09-05"
## Max.
## "2015-09-17"
We can also confirm that the stores in the test set are all in the training set:
teststores <- test_read %>% pull(Store) %>% unique()
trainstores <- train_read %>% pull(Store) %>% unique()
all(teststores %in% trainstores)
## [1] TRUE
We have to predict daily sales for 6 weeks. This was also stated in the documentation. So for our validation split, rather than randomly selecting stores (which doesn’t make sense to do since we aren’t tasked with predicting sales for new stores) and randomly selecting dates (we don’t need to predict sales on random dates; we need to predict daily sales for 6 weeks of consecutive days immediately following the training period), we will simply make our validation set the last six weeks of training data.
Check the range of dates in the training data:
train_discretized %>%
pull(Date) %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu.
## "2013-01-01" "2013-08-16" "2014-03-31" "2014-04-11" "2014-12-10"
## Max.
## "2015-07-31"
The training data goes until July 31st 2015, so let’s make the validation set contain June 15th - July 31st 2015:
training_set <- train_discretized %>%
filter(Date < ymd("20150615")) %>%
dplyr::select(-Date)
validation_set <- train_discretized %>%
filter(Date >= ymd("20150615")) %>%
dplyr::select(-Date)
glimpse(training_set)
## Observations: 798,540
## Variables: 14
## $ Store <chr> "85", "122", "209", "259", "262", "274", ...
## $ DayOfWeek <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ Sales <dbl> 12421, 3709, 3152, 18535, 29937, 7786, 34...
## $ Promo <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "b", "a", "a", "b", "b", "b", "d", "a", "...
## $ Assortment <chr> "a", "c", "c", "b", "a", "b", "c", "c", "...
## $ CompetitionDistance <fctr> (255, Inf], (255, Inf], (255, Inf], (35,...
## $ Promo2 <chr> "0", "0", "1", "0", "0", "1", "0", "1", "...
## $ PromoInterval <chr> "0", "0", "Jan,Apr,Jul,Oct", "0", "0", "J...
## $ Month <chr> "6", "6", "6", "6", "6", "6", "6", "6", "...
## $ comp_open_since_days <fctr> (858,4.68e+03], (-Inf,858], (858,4.68e+0...
## $ promo2_since_days <fctr> (-2.5,0.5], (-2.5,0.5], (0.5,908], (-2.5...
glimpse(validation_set)
## Observations: 45,852
## Variables: 14
## $ Store <chr> "1", "2", "3", "4", "5", "6", "7", "8", "...
## $ DayOfWeek <chr> "5", "5", "5", "5", "5", "5", "5", "5", "...
## $ Sales <dbl> 5263, 6064, 8314, 13995, 4822, 5651, 1534...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "1", "1", "1", "1", "1", "1", "1", "1", "...
## $ StoreType <chr> "c", "a", "a", "c", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "a", "c", "a", "a", "c", "a", "...
## $ CompetitionDistance <fctr> (255, Inf], (255, Inf], (255, Inf], (255...
## $ Promo2 <chr> "0", "1", "1", "0", "0", "0", "0", "0", "...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "Jan,Apr,Jul,Oct"...
## $ Month <chr> "7", "7", "7", "7", "7", "7", "7", "7", "...
## $ comp_open_since_days <fctr> (858,4.68e+03], (858,4.68e+03], (858,4.6...
## $ promo2_since_days <fctr> (-2.5,0.5], (908, Inf], (908, Inf], (-2....
Now we turn to the mechanics of passing data to a model and returning predictions. We need to train a simple model on the training set, and score it on the validation set. We’ll use a simple linear regression on Sales
as our model. This is obviously not an appropriate model for such grouped data; other tutorials will describe more appropriate modelling techniques. Here we just want something that we can pass the validation set to. The result needs to be a two-column dataframe, with columns observed
and predicted
containing the observed and predicted values from the model on the validation data.
Our simple model is defined as follows:
simple_model <- lm(Sales ~ . - Store,data = training_set)
summary(simple_model)
##
## Call:
## lm(formula = Sales ~ . - Store, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10954.3 -1711.2 -367.1 1238.8 29074.6
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 5442.8416 143.7909 37.852
## DayOfWeek2 -1080.0328 10.3413 -104.438
## DayOfWeek3 -1395.7035 10.3847 -134.400
## DayOfWeek4 -1355.1777 10.5365 -128.617
## DayOfWeek5 -991.2246 10.4531 -94.825
## DayOfWeek6 -939.2667 11.1522 -84.223
## DayOfWeek7 -902.0822 48.6038 -18.560
## Promo1 2298.7121 6.6241 347.023
## StateHolidaya 100.8102 101.7478 0.991
## StateHolidayb -216.0637 222.0098 -0.973
## StateHolidayc -1019.0403 317.0699 -3.214
## SchoolHoliday1 309.6167 8.6476 35.804
## StoreTypeb 5031.2103 33.1032 151.985
## StoreTypec -4.4576 9.1855 -0.485
## StoreTyped -72.1547 7.1061 -10.154
## Assortmentb -2837.2171 44.0682 -64.382
## Assortmentc 868.2882 6.2789 138.288
## CompetitionDistance(35,255] 2616.3808 36.4026 71.873
## CompetitionDistance(255, Inf] 1244.3447 35.3703 35.180
## Promo21 -1248.1185 138.6350 -9.003
## PromoIntervalFeb,May,Aug,Nov 343.2907 13.1453 26.115
## PromoIntervalJan,Apr,Jul,Oct 564.0096 11.2987 49.918
## PromoIntervalMar,Jun,Sept,Dec NA NA NA
## Month10 0.2754 14.7827 0.019
## Month11 499.5205 14.8834 33.562
## Month12 1979.9849 14.9919 132.070
## Month2 88.9484 13.0651 6.808
## Month3 307.6090 12.8444 23.949
## Month4 431.3562 13.0539 33.044
## Month5 533.2505 13.0864 40.748
## Month6 416.6966 13.7303 30.349
## Month7 128.2613 14.7843 8.676
## Month8 -70.3441 15.2272 -4.620
## Month9 -38.7299 14.7751 -2.621
## comp_open_since_days(858,4.68e+03] -181.8072 6.2382 -29.144
## comp_open_since_days(4.68e+03, Inf] 409.3043 16.9256 24.182
## promo2_since_days(-2.5,0.5] -331.3742 138.8105 -2.387
## promo2_since_days(0.5,908] -569.1750 12.9700 -43.884
## promo2_since_days(908, Inf] -22.0111 13.0488 -1.687
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## DayOfWeek2 < 2e-16 ***
## DayOfWeek3 < 2e-16 ***
## DayOfWeek4 < 2e-16 ***
## DayOfWeek5 < 2e-16 ***
## DayOfWeek6 < 2e-16 ***
## DayOfWeek7 < 2e-16 ***
## Promo1 < 2e-16 ***
## StateHolidaya 0.32179
## StateHolidayb 0.33045
## StateHolidayc 0.00131 **
## SchoolHoliday1 < 2e-16 ***
## StoreTypeb < 2e-16 ***
## StoreTypec 0.62747
## StoreTyped < 2e-16 ***
## Assortmentb < 2e-16 ***
## Assortmentc < 2e-16 ***
## CompetitionDistance(35,255] < 2e-16 ***
## CompetitionDistance(255, Inf] < 2e-16 ***
## Promo21 < 2e-16 ***
## PromoIntervalFeb,May,Aug,Nov < 2e-16 ***
## PromoIntervalJan,Apr,Jul,Oct < 2e-16 ***
## PromoIntervalMar,Jun,Sept,Dec NA
## Month10 0.98514
## Month11 < 2e-16 ***
## Month12 < 2e-16 ***
## Month2 9.90e-12 ***
## Month3 < 2e-16 ***
## Month4 < 2e-16 ***
## Month5 < 2e-16 ***
## Month6 < 2e-16 ***
## Month7 < 2e-16 ***
## Month8 3.84e-06 ***
## Month9 0.00876 **
## comp_open_since_days(858,4.68e+03] < 2e-16 ***
## comp_open_since_days(4.68e+03, Inf] < 2e-16 ***
## promo2_since_days(-2.5,0.5] 0.01698 *
## promo2_since_days(0.5,908] < 2e-16 ***
## promo2_since_days(908, Inf] 0.09164 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2663 on 798502 degrees of freedom
## Multiple R-squared: 0.2644, Adjusted R-squared: 0.2644
## F-statistic: 7757 on 37 and 798502 DF, p-value: < 2.2e-16
We score this model on new data using the predict
function:
train_predictions <- predict(simple_model)
train_predictions[1:10]
## 1 2 3 4 5 6 7
## 10719.830 6738.715 5634.998 9436.456 10901.637 7142.510 6484.753
## 8 9 10
## 5252.795 11351.763 6921.791
This gives us predicted sales for each store and day in the training set.
This isn’t quite what we want- we want a nicely structured dataframe containing both the observed and predicted values. We can just construct this manually:
train_predictions <- data_frame(
observed = training_set %>% pull(Sales),
predicted = predict(simple_model)
)
train_predictions
## # A tibble: 798,540 x 2
## observed predicted
## <dbl> <dbl>
## 1 12421 10719.830
## 2 3709 6738.715
## 3 3152 5634.998
## 4 18535 9436.456
## 5 29937 10901.637
## 6 7786 7142.510
## 7 3438 6484.753
## 8 3121 5252.795
## 9 8518 11351.763
## 10 8998 6921.791
## # ... with 798,530 more rows
For the validation set, pass the newdata = validation_set argument to predict
:
validation_predictions <- data_frame(
observed = validation_set %>% pull(Sales),
predicted = predict(simple_model,newdata = validation_set)
)
## Warning in predict.lm(simple_model, newdata = validation_set): prediction
## from a rank-deficient fit may be misleading
validation_predictions
## # A tibble: 45,852 x 2
## observed predicted
## <dbl> <dbl>
## 1 5263 7914.913
## 2 6064 7544.625
## 3 8314 7544.625
## 4 13995 8783.201
## 5 4822 8101.178
## 6 5651 8101.178
## 7 15344 8969.466
## 8 8492 8101.178
## 9 8565 9378.770
## 10 7185 7919.370
## # ... with 45,842 more rows
We get a warning about “predictions from a rank-deficient fit” being misleading; this happens because our design matrix was not of full rank in the simple model, and a coefficient was not defined.
Kaggle uses the Root Mean Squared Percentage Error (RMSPE) as the metric in this competition. With our data structured in this way, we can implement this in a function:
get_rmspe <- function(preds) {
preds %>%
summarize(rmspe = sqrt( mean( ((observed - predicted)/observed)^2 ))) %>%
pull(rmspe)
}
get_rmspe(train_predictions)
## [1] Inf
get_rmspe(validation_predictions)
## [1] 0.4367195
Not great- we’re off by 43% on average. We’ll see how to do a bit better in subsequent tutorials. Also note that there are still 54 rows with observed
= 0 in the training set, which is causing a divide-by-zero error in the calculation; we’ll be manually setting the days where the store is closed to have Sales
= 0 in the test set.
The final mechanical hurdle in our modelling pipeline is scoring the test data. This should be easy after all the worok we just put in.
The only problem is preprocessing. In the feature engineering step, our actions were a lot more prescribed, and we easily put them in a function and applied it to both the training and test data. In the preprocessing step, though, we did a lot of manual data analysis and decision making. Now, we need to record what we did, and apply exactly the same steps to the test set. Best practice dictates that you do this as you go while preprocessing, so we’re being a bit sloppy here.
Our preprocessing consisted of:
Luckily in this case, we only have to replicate two steps: imputation and discretization. It is important to remember though that these data were really clean; in cases where they aren’t, the task of applying the same preprocessing steps to both the training and test sets can get more difficult, and you should probably write a function like we did in the feature engineering step.
#
# Imputation
vars_to_impute
## [1] "CompetitionDistance" "CompetitionOpenSinceMonth"
## [3] "CompetitionOpenSinceYear" "Promo2SinceWeek"
## [5] "Promo2SinceYear" "PromoInterval"
## [7] "comp_open_date" "comp_open_since_days"
## [9] "promo2_since_date" "promo2_since_days"
test_imputed <- test_feateng %>%
mutate_at(vars_to_impute,impute_with_zeroes)
glimpse(test_imputed)
## Observations: 41,088
## Variables: 22
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance <dbl> 1270, 14130, 24000, 7520, 2030, 3160...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2 <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek <chr> "0", "14", "0", "0", "0", "0", "1", ...
## $ Promo2SinceYear <chr> "0", "2011", "0", "0", "0", "0", "20...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "0", "0", "0...
## $ Month <chr> "9", "9", "9", "9", "9", "9", "9", "...
## $ comp_open_date <date> 2008-09-01, 2006-12-01, 2013-04-01,...
## $ comp_open_since_days <dbl> 2572, 3212, 899, 351, 5525, 2207, 14...
## $ promo2_since_date <date> NA, 2011-04-02, NA, NA, NA, NA, 201...
## $ promo2_since_days <dbl> 0, 1629, 0, 0, 0, 0, 1355, 2001, 214...
# Discretization
# Re-do the discretize function so it takes the dataframe as input (why didn't
# I do that before? Who knows!)
discretize_variables <- function(df) {
# Pulls the variable_knots and train_closedremoved from the calling environment
# and returns a new training set with the features discretized
for (var in continuous_variables) {
df <- df %>%
mutate_at(var,function(x) {
cut(x,breaks = variable_knots[[var]])
}
)
}
df
}
test_discretized <- discretize_variables(test_imputed)
glimpse(test_discretized)
## Observations: 41,088
## Variables: 22
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "...
## $ Store <chr> "1", "3", "7", "8", "9", "10", "11",...
## $ DayOfWeek <chr> "4", "4", "4", "4", "4", "4", "4", "...
## $ Date <date> 2015-09-17, 2015-09-17, 2015-09-17,...
## $ Open <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ Promo <chr> "1", "1", "1", "1", "1", "1", "1", "...
## $ StateHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ SchoolHoliday <chr> "0", "0", "0", "0", "0", "0", "0", "...
## $ StoreType <chr> "c", "a", "a", "a", "a", "a", "a", "...
## $ Assortment <chr> "a", "a", "c", "a", "c", "a", "c", "...
## $ CompetitionDistance <fctr> (255, Inf], (255, Inf], (255, Inf],...
## $ CompetitionOpenSinceMonth <chr> "9", "12", "4", "10", "8", "9", "11"...
## $ CompetitionOpenSinceYear <chr> "2008", "2006", "2013", "2014", "200...
## $ Promo2 <chr> "0", "1", "0", "0", "0", "0", "1", "...
## $ Promo2SinceWeek <chr> "0", "14", "0", "0", "0", "0", "1", ...
## $ Promo2SinceYear <chr> "0", "2011", "0", "0", "0", "0", "20...
## $ PromoInterval <chr> "0", "Jan,Apr,Jul,Oct", "0", "0", "0...
## $ Month <chr> "9", "9", "9", "9", "9", "9", "9", "...
## $ comp_open_date <date> 2008-09-01, 2006-12-01, 2013-04-01,...
## $ comp_open_since_days <fctr> (858,4.68e+03], (858,4.68e+03], (85...
## $ promo2_since_date <date> NA, 2011-04-02, NA, NA, NA, NA, 201...
## $ promo2_since_days <fctr> (-2.5,0.5], (908, Inf], (-2.5,0.5],...
We can now compute the predictions in the same way as we did with the validation data, and also manually set . We’ll then format the results in the manner required by Kaggle:
test_predict <- test_discretized %>%
bind_cols(data_frame(Sales = predict(simple_model,newdata = test_discretized))) %>%
mutate(Sales = if_else(Open == 0,0,Sales)) %>%
mutate(Sales = if_else(is.na(Sales),0,Sales)) %>% # There are 11 NA predictions remove them
dplyr::select(Id,Sales)
## Warning in predict.lm(simple_model, newdata = test_discretized): prediction
## from a rank-deficient fit may be misleading
glimpse(test_predict)
## Observations: 41,088
## Variables: 2
## $ Id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11",...
## $ Sales <dbl> 7074.352, 6704.064, 7947.098, 7260.617, 8538.209, 7078.8...
This can be written to a .csv file using `readr::write_csv
, in much the same way as we used readr::read_delim
to read in the original data:
readr::write_csv(test_predict,'/Users/alexstringer/phd/s18/leaf-ra/leaf2018/rossman-simple-model-predictions.csv')
You can then submit to kaggle either through their drag-and-drop interface, or, if you feel like authenticating with their API, using the shell command
kaggle competitions submit -c rossmann-store-sales -f rossman-simple-model-predictions.csv -m "Message"
Submitting this got me a score of 0.38970, which is not great. Next, we’ll see about developing a better predicting model for these data.
As a final step, save the training, validation and test sets to disk as .RData files, so we can read them in directly in the subsequent tutorial where we fit more appropriate models:
save(training_set,file = '/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-training-set.RData')
save(validation_set,file = '/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-validation-set.RData')
save(test_discretized,file = '/Users/alexstringer/phd/s18/leaf-ra/leaf2018/datasets/rossman-test-set.RData')