# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})

Sampling Distributions

In lecture, we discussed the concept of the sampling distribution of an estimator: its probability distribution. An estimator is a function of the sample, which is random. If we repeated the data-collection experiment again, we would get a different sample- and a different value of our estimator. If we repeated the experiment many times, we would get many different values of our estimator, and they would be distributed according to the sampling distribution of that estimator.

Normal Example

If \(X_{i} \overset{IID}{\sim} N(\mu,1)\) is a random sample of size \(n\) from a normal distribution with unit variance and unknown mean \(\mu\), we have seen that we can estimate \(\mu\) with \(\hat{\mu} = \bar{X} = \frac{1}{n}\sum_{i=1}^{n}X_{i}\), the sample mean. In class we derived the sampling distribution of the sample mean (for normally distributed samples): \[ \bar{X} \sim \left( \mu,\frac{1}{n}\right) \] That is, normally distributed with the same mean as \(X_{i}\) and variance \(1/n\) scaling down linearly with the sample size.

What does this mean? Let’s draw one random sample of size \(n = 5\) from a normal distribution and compute the sample mean. We’ll fix the true \(\mu\) at \(\mu = 0\), though this works for any \(\mu\):

set.seed(3827)
# Draw one sample from normal(mu,1) with mu fixed at 0
n <- 5
mu <- 0
sig <- 1
samp <- rnorm(n,mu,sig)
mean(samp)
## [1] -0.3514201

The value -0.35 is one realization from a \(N(0,1/5)\) distribution. How do we use this information? Since we’re using \(\bar{X}\) to estimate \(\mu\), our best guess at the value of \(\mu\) based on this sample is \(\mu = -0.35\). This is different from the true value \(\mu = 0\). Does that mean our estimator is bad?

No, because there is variability in the sample, and hence variability in our estimator. Knowing the sampling distribution of our estimator allows us to quantify this variability in a meaningful way.

For example, we can make statements like “if \(\mu = 0\), what is the probability that I would see a sample mean of \(\bar{X} = -0.35\) or something even farther away from \(0\), in a sample of size \(n = 5\)?”. If this probability is low, then either we observed something really unlikely, or \(\mu = 0\) is not a value of \(\mu\) that is well supported by the data. We can compute this probability:

pnorm(-abs(mean(samp)),mu,sig/sqrt(n)) + 1 - pnorm(abs(mean(samp)),mu,sig/sqrt(n))
## [1] 0.4319851

Pretty likely! If \(\mu = 0\), there actually is a pretty good chance of seeing a sample as “extreme” as the one we observed.

We don’t always know the sampling distribution of our estimator analytically. Let’s verify the above calculation through a simulation:

set.seed(432976)
B <- 10000
simulated_means <- 1:B %>%
  map(~rnorm(n,mu,sig)) %>%
  map(mean) %>%
  reduce(c)
  
mean(simulated_means < -abs(mean(samp))) + mean(simulated_means > abs(mean(samp)))
## [1] 0.4347

Let’s unpack that simulation a bit. We:

  • Defined a variable B indiciating the number of samples of size \(n\) we wished to simulate
  • Use the map function to efficiently loop over the vector \((1,2,\ldots,B)\), taking a sample of size \(n\) each time and storing the results in a list
  • Used the map function again to take this list of samples, and take the mean of each
  • Use the reduce function to recursively loop over this list, applying the c (concatenate) function to each element in turn, returning a vector of simulated sample means
  • Computed the relative frequency with which our simulated sample means were father away from \(0\) in absolute value than the original sample mean we got

We can make a histogram of the simulated sample means and compare the theoretical density curve for \(\bar{X}\):

data_frame(x = density(simulated_means,n=100)$x,
           y = density(simulated_means,n=100)$y,
           col = if_else(x < -abs(mean(samp)) | x > abs(mean(samp)),"Yes","No")) %>%
  ggplot() +
  theme_classic() + 
  geom_bar(mapping = aes(x = x,y = y,fill = col),
           stat = "identity",
           colour = "black") +
  stat_function(fun = dnorm,args = list(mean = mu,sd = sig/sqrt(n)),col = "purple") +
  labs(title = "Sampling Distribution of Sample Mean for Normally Distributed Samples",
       subtitle = "Simulated (bars) vs Theoretical (purple line)",
       x = "Simulated Sample Mean",
       y = "Density",
       fill = "More Extreme than Observed?") +
  scale_fill_manual(values = c("Yes" = "red","No" = "orange")) +
  scale_x_continuous(breaks = seq(-2,2,by = 0.5))

The sampling distribution of an estimator gives an idea of what values of the estimator are reasonable to observe, given specfic values of the parameters.

Connection to the Central Limit Theorem

The Central Limit Theorem can be rephrased using our new terminology: let \(X_{i} \overset{IND}{\sim} F\) be independent random variables with mean \(0\) and finite variance \(\sigma^{2} < \infty\). Let \(S_{n} = \sum_{i=1}^{n}X_{i}\). Then the sampling distribution of \(S_{n}/\sigma\sqrt{n}\) converges to a standard normal distribution. Or put another way, the sampling distribution of \(S_{n}/\sigma\sqrt{n}\) is well approximated by a standard normal distribution if \(n\) is large.

Consider the example of the number of people using the Toronto subway in a given day. The total number of people riding the subway in a day is the sum of the riders at each of the TTC’s \(69\) stations. Suppose in an effort to secure more government funding, the TTC is claiming that it has an average of \(3\) million subway riders per weekday (note: this is a fictional scenario). Can we use available data to evaluate how reasonble a claim this is?

We can obtain data on station-level subway ridership from Open Data Toronto. The data has been cleaned slightly (just removed extra header rows) and stored on github. We can read in the data as follows:

ttc <- readr::read_csv("https://raw.githubusercontent.com/awstringer1/sta261s18supplementary/master/datasets/ttc-subway-2015.csv") %>%
  dplyr::select(-starts_with("X")) %>%
  filter(!is.na(total)) %>%
  filter(station != "Grand Totals")
## Warning: Missing column names filled in: 'X5' [5], 'X6' [6], 'X7' [7],
## 'X8' [8], 'X9' [9], 'X10' [10]
## Parsed with column specification:
## cols(
##   station = col_character(),
##   to_trains = col_number(),
##   from_trains = col_number(),
##   total = col_number(),
##   X5 = col_character(),
##   X6 = col_character(),
##   X7 = col_character(),
##   X8 = col_character(),
##   X9 = col_character(),
##   X10 = col_character()
## )
glimpse(ttc)
## Observations: 74
## Variables: 4
## $ station     <chr> "Bloor (line 1, Y-U)", "Yonge (line 2, B-D)", "St....
## $ to_trains   <dbl> 103975, 95407, 63367, 64125, 58259, 47895, 37296, ...
## $ from_trains <dbl> 112214, 87831, 72836, 64851, 60187, 52924, 44038, ...
## $ total       <dbl> 216189, 183238, 136203, 128976, 118446, 100819, 81...

The total column is the total number of riders on a “typical” weekday (we don’t know the ttc’s definiton of “typical”). The to_trains and from_trains are the number of riders boarding and disembarking from trains at each station.

Any time you read in data, you need to perform sanity checks. For example, we know there are \(69\) subway stations, so why are there \(74\) rows in the dataframe? Inspection of the data reveals that transfer stations are counted twice: Bloor/Yonge, Sheppard-Yonge, Kennedy, St. George and Spadina. This reconciles the total number of rows with the correct number of stations.

Now, to evaluate the TTC’s claim. We first compute the observed total number of riders on this one day:

ttc %>%
  summarize(total = sum(total)) %>%
  pull(total)
## [1] 2669373

On this weekday, there were \(2,669,373\) riders. If the average number of riders per weekday was actually \(3,000,000\), how likely would it be to see \(2,669,373\) riders or a number farther from \(3,000,000\) in any single day? It sounds like we can’t answer this question, because we only have data on a single day. If the population whose mean we are interested in is the total number of riders on any weekday of the year, then we are trying to estimate a mean with only a single datapoint, \(\hat{\mu} = X_{1}\).

Or are we? Let’s look at the distribution of station-level ridership:

ttc %>%
  ggplot(aes(x = total)) +
  theme_light () +
  geom_histogram(bins = 15,colour = "black",fill = "grey") +
  labs(title = "Station-level Subway Ridership",
       subtitle = "TTC Ridership Data",
       x = "Number of Riders in a Typical Day",
       y = "Number of Stations") +
  scale_x_continuous(labels = scales::comma_format())

Definitely not normal; I don’t immediately recognize this as being any distribution.

The key point, though, is that our estimator \(\hat{\mu} = X_{1}\) is actually a sum of random quantities which can reasonably be thought to be independent. That is, station-level riderships. What we have actually observed is \[ X_{1} = \sum_{i=1}^{74}R_{i} \] where each \(R_{i}\) is a random variable assumed to have finite mean and variance, and we are assuming the \(R_{i}\) are independent. We can apply the Central Limit Theorem to our estimator, even though at first glance it looked like it was only a single observation!

If the actual average weekdaily ridership is \(3,000,000\), the Central Limit Theorem can be applied to give us the distributional approximations: \[ \begin{aligned} \frac{\sum_{i=1}^{74}R_{i} - 3,000,000}{\sigma_{R}\sqrt{74}} &\overset{\cdot}{\sim} N(0,1) \\ \frac{\sum_{i=1}^{74}\left( R_{i} - \bar{R}\right)^{2}}{\sigma^{2}_{R}} &\overset{\cdot}{\sim} \chi^{2}_{73} \end{aligned} \] Combining these as we did when we derived the \(t\)-distribution lets us approximate \[ \frac{\sum_{i=1}^{74}R_{i} - 3,000,000}{s_{R}\sqrt{74}} \overset{\cdot}{\sim} t_{73} \] where \(s_{R} = \sqrt{\sum_{i=1}^{74}\left( R_{i} - \bar{R}\right)^{2}}\) is the sample variance of the station ridership.

With these results in hand, we can compute an approximate probability of seeing our result or something more extreme, conditional on the TTC’s claim being true, in much the same way as we did in the simulated normal distribution example. First we compute the sample variance and “observed t-statistic” \(t_{obs}\):

sR <- sd(ttc$total)
tobs <- (2669373 - 3000000)/(sR*sqrt(74))
cat("Sample Variance: ",sR,"\nObserved t-statistic: ",tobs,"\n")
## Sample Variance:  39328.53 
## Observed t-statistic:  -0.9772703

We then compute: \[ \begin{aligned} P\left( |\sum_{i=1}^{74}R_{i} - 3,000,000| > |2,669,373 - 3,000,000| \right) &= 1 - P\left(\frac{-|2,669,373 - 3,000,000|}{s_{R}\sqrt{74}} < \frac{\sum_{i=1}^{74}R_{i} - 3,000,000}{s_{R}\sqrt{74}} < \frac{|2,669,373 - 3,000,000|}{s_{R}\sqrt{74}} \right) \\ &= 1 - P\left( -|t_{obs}| < t_{73} < |t_{obs}| \right) \\ &= 1 - P(-0.98 < t_{73} < 0.98) \\ &= 0.3316629 \end{aligned} \]

Given the high between-station variability, based on this methodology it appears that observing \(2,669,373\) riders on a given day is reasonable if the true average daily ridership is \(3,000,000\).

Remember when using the CLT that you are making assumptions; most strongly, assuming the random variables you are summing are statistically independent, which cannot be reliably checked from a single dataset. This is assumed from the design of the data collection process, and should be open to criticism. For example here, we would need to discuss in much more detail what it means and whether we think it is reasonable to claim that ridership at individual stations on the same day are independent.

Check out this tutorial for further detail on how to go about checking these distributional approximations given only a single set of data like we have here.