Introduction
- Frequentist Perspective
- Bayesian Perspective
Bayesian Inference
Choosing a Prior
- Conjugate Priors
- Setting Hyperparameters
  - Moment-Matching
  - Empirical Bayes
Full Example: Bayesian Regression
- A Bayesian Formulation

# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})

This tutorial introduces Bayesian statistics from a practical, computational point of view. Less focus is placed on the theory/philosophy and more on the mechanics of computation involved in estimating quantities using Bayesian inference. Students completing this tutorial will be able to fit medium-complexity Bayesian models to data using MCMC.

Topics covered:

Introduction

Given data generated from some family of probability distributions indexed by unknown parameter, statistical inference is concerned with estimating these parameters- finding reasonable values for them, given the observed data. The central notion is that of uncertainty: we simply don’t know the values of the parameters that generated the data we observed, and we do know that several different values could reasonably have generated these data. Probability is the mathematical construct used to represent uncertainty.

Frequentist Perspective

Classically, the approach to this problem is taught from the frequentist perspective. Uncertainty in the values of the parameters that generated the data is represented by probability via the notion of repeated sampling: under the given probability model with the given parameter values, what is the relative frequency with which these same data would be observed, if the experiment that generated the data were repeated again and again? Values of the parameters that have a higher probability of having generated the observed data are thought to be more likely than other values.

As an example, consider a coin with unknown probability of heads \(\theta\). We toss the coin once, and observe random outcome (data) \(X = 1\) if the toss is heads and \(X = 0\) if not. For simplicity, suppose we know we have chosen one of two possible coins, either having \(\theta = 0.7\) or \(\theta = 0.3\). How do we use the observed data to infer which of these two coins we threw?

For any \(0 < \theta < 1\), the probability distribution of the single coin toss is given by \[ P(X = x) = \theta^{x}(1-\theta)^{1-x} \] This just says that \(P(X = 1) = \theta\) and \(P(X = 0) = 1-\theta\).

Let’s say we throw the coin once, and observe \(X = 1\). If \(\theta = 0.7\) the probabilty of observing this result is \(P(X = 1|\theta = 0.7) = 0.7\). That is if \(\theta = 0.7\), we would expect roughly \(70\%\) of repetitions of this experiment to yield the same results as we observed in our data. If \(\theta = 0.3\) on the other hand, \(P(X = 1|\theta = 0.3) = 0.3\); only \(30\%\) of the repetitions of this experiment would yield the observed data if \(\theta = 0.3\). Because \(\theta = 0.7\) would yield the observed data more frequently than \(\theta = 0.3\), we say that \(\theta = 0.7\) is more likely to have generated the observed data than \(\theta=0.3\), and our inference favours \(\theta =0.7\).

For a more comprehensive background on frequentist/likelihood inference, check out the tutorial on maximum likelihood.

Bayesian Perspective

One criticism of the above approach is that is depends not only on the observed data, but also on infinitely many other possible datasets that are not observed. This is an artifact of the manner in which probability is used to represent uncertainty. In contrast, Bayesian statistics represents uncertainty about the value of a parameter directly using probability distributions.

In particular, a prior distribution is placed on the parameter, representing the probable values of that parameter before data is observed. Having observed the data, the prior is updated via Bayes’ Rule, yielding the posterior distribution of the parameter, given the data. The choice of prior distribution is based either on subject-matter knowledge or mathematical convenience, and is a subjective choice on the part of the analyst.

To see how this works, suppose that we think about \(4/5\) coins in our pocket are the \(\theta = 0.3\) coins, and only \(1/5\) are the \(\theta = 0.7\) coins. Our prior distribution on \(\theta\) is then \[ P(\theta = q) = 0.2^{I(q = 0.7)}0.8^{I(q = 0.3)} \] where \(I(q = 0.7) = 1\) if \(q = 0.7\) and \(0\) otherwise. This, like the probability distribution of the actual result of the coin toss, just encodes our notion that \(P(\theta = 0.3) = 0.8\) and \(P(\theta = 0.7) = 0.2\).

So without knowing the result of the coin toss, we think there is a \(20\%\) chance that \(\theta = 0.7\). We know from above that if we observe heads on the coin toss, we have observed a result that would occur about \(70\%\) of the time if \(\theta = 0.7\). In probability terms, we have a marginal distribution for \(\theta\), and a conditional distribution for \(X|\theta\).

These two ideas are combined by computing the conditional distribution of \(\theta|X\), known as the posterior distribution for \(\theta\) having observed \(X\). This is obtained (explaining the name) via Bayes’ Rule: \[ p(\theta|X) = \frac{p(X|\theta)\times p(\theta)}{p(X)} \] where the marginal distribution of \(X\), or the normalizing constant or marginal likelihood or model evidence (this thing has a lot of names) is given by \[ p(X) = \sum_{\theta}p(X|\theta)\times p(\theta) \] and ensures \(p(\theta|X)\) is a proper probability distribution.

In our example, the prior probability of \(\theta = 0.7\) is only \(20\%\). But we flip the coin and observe \(X = 1\). We can see how this observation updates our belief about the likely values of \(\theta\) by computing the posterior distribution of \(\theta\) given the observed data: \[ \begin{aligned} &p(\theta|X) = \frac{\theta^{x}(1-\theta)^{1-x}\times 0.2^{I(\theta = 0.7)}0.8^{I(\theta = 0.3)}}{\sum_{\theta = 0.3,0.7}\theta^{x}(1-\theta)^{1-x}\times 0.2^{I(\theta = 0.7)}0.8^{I(\theta = 0.3)}} \\ \implies& P(\theta = 0.7 | X = 1) = \frac{0.7 \times 0.2}{0.7\times0.2 + 0.3\times0.8} \\ &= 0.368 \end{aligned} \] Before observing heads, we would have thought the \(\theta = 0.7\) coin to be very unlikely, but because the observed data favours \(\theta = 0.7\) more strongly than \(\theta = 0.3\), after observing these data we feel that \(\theta = 0.7\) is more likely than before.

Bayesian Inference

To continue the explanation of the Bayesian approach, we will extend the above simple example in two logical ways:

Flip \(n > 1\) coins, each with the same probability of heads \(0 < \theta < 1\)
Consider any \(0 < \theta < 1\), not just two single values

Flipping More Coins

Suppose now that we flip \(n\) coins, obtaining a dataset \(X = (X_{1},\ldots,X_{n})\) of heads or tails, represented by 0’s and 1’s. If we’re still considering only two candidate values \(\theta = 0.7\) or \(\theta = 0.3\), we may still ask the question “which value is more likely to have generated the observed data?”. We again form the likelihood function for each value of \(\theta\), the relative frequency with which each value of \(\theta\) would have generated the observed sample. Assuming the tosses are statistically independent: \[ p(X|\theta) = \theta^{\sum_{i=1}^{n}X_{i}} \times (1 - \theta)^{n - \sum_{i=1}^{n}X_{i}} \] where \(\sum_{i=1}^{n}X_{i}\) is just the number of heads observed in the sample. We see that any two samples that have the same number of heads will lead to the same inferences about \(\theta\) in this manner.

Suppose we throw the coin \(10\) times and observe \(6\) heads. The likelihood function for each candidate value of \(\theta\) is \[ \begin{aligned} p(X|\theta = 0.7) &= 0.7^{6} \times 0.3^{4} = 0.000953 \\ p(X|\theta = 0.3) &= 0.3^{6} \times 0.7^{4} = 0.000175 \\ \end{aligned} \] It is much more likely to observe \(6\) heads when \(\theta = 0.7\) than when \(\theta = 0.3\).

In the Bayesian setting, with our prior distribution on \(\theta\) from above, we would form the posterior distribution as follows: \[ p(\theta|X) = \frac{\theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}}\times 0.2^{I(\theta = 0.7)}0.8^{I(\theta = 0.3)}}{\sum_{\theta = 0.3,0.7}\theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}}\times 0.2^{I(\theta = 0.7)}0.8^{I(\theta = 0.3)}} \] Computing this for our observed data of \(\sum_{i=1}^{10}x_{i} = 6\) yields \[ \begin{aligned} p(\theta = 0.7 |X) = \frac{0.7^{6}0.3^{4}\times 0.2}{0.7^{6}0.3^{4}\times 0.2 + 0.3^{6}0.7^4\times0.8} = 0.576 \\ p(\theta = 0.3 |X) = \frac{0.3^{6}0.7^{4}\times 0.2}{0.3^{6}0.7^{4}\times 0.8 + 0.7^{6}0.3^4\times0.2} = 0.424 \\ \end{aligned} \]

We can see that the data “updates” our prior belief that \(\theta = 0.3\) was more probable than \(\theta = 0.7\), because the observed data was more likely to have occurred if \(\theta = 0.7\) than if \(\theta = 0.3\).

It is helpful to visualize the prior and posterior, for the observed data. Because both prior and posterior only allow two values, we can do this using a simple bar chart:

visualize_binomial_priorposterior <- function(sumx,n) {
  prior <- function(theta) {
    if (theta == .3) {
      return(.8)
    }
    else if (theta == .7) {
      return(.2)
    }
    0
  }
  likelihood <- function(theta) theta^sumx * (1-theta)^(n - sumx)
  marginal_likelihood <- prior(.7) * likelihood(.7) + prior(.3) * likelihood(.3)
  posterior <- function(theta) likelihood(theta) * prior(theta) / marginal_likelihood
  
  # Plot of the prior and posterior distributions for these observed data
  data_frame(
    theta = c(.3,.7,.3,.7),
    value = c(prior(.3),prior(.7),posterior(.3),posterior(.7)),
    type = c("Prior","Prior","Posterior","Posterior")
  ) %>%
    ggplot(aes(x = theta,y = value,fill = type)) +
    theme_classic() + 
    geom_bar(stat = "identity",position = "dodge",colour = "black") +
    labs(title = "Prior and Posterior for theta",
         subtitle = str_c("Observed data: ",sumx," flips in ",n," throws"),
         x = "Theta, probability of heads",
         y = "Prior/Posterior Probability",
         fill = "") +
    scale_x_continuous(breaks = c(0.30,0.70),labels = c("0.30","0.70")) +
    scale_y_continuous(labels = scales::percent_format()) +
    scale_fill_brewer(palette = "Reds")
  
}

Plotting is nice as it lets us compare how different observed data, and different experiments (number of throws) affect the prior/posterior balance of belief:

cowplot::plot_grid(
  visualize_binomial_priorposterior(6,6),
  visualize_binomial_priorposterior(6,10),
  visualize_binomial_priorposterior(6,20),
  visualize_binomial_priorposterior(6,50),
  visualize_binomial_priorposterior(0,10),
  visualize_binomial_priorposterior(1,10),
  visualize_binomial_priorposterior(7,10),
  visualize_binomial_priorposterior(10,10),
  ncol=2
)

Estimating \(\theta\)

The case where there are only two possible parameter values is useful for examples, but not common in practice. More realistic is the case where \(0 < \theta < 1\), and we need to use the observed data to estimate \(\theta\).

Intuition and frequentist results dictate that a “good” estimator of \(\theta\) is the sample proportion number of heads, \[ \hat{\theta}_{freq} = \frac{1}{n}\sum_{i=1}^{n}X_{i} \] This makes sense; of course the best guess of the value of the probability of heads based on one sample is just the relative frequency with which \(X\) is heads in that sample. It’s also the maximum likelihood estimator, and the unbiased estimator with minimum variance. Let’s see how this estimator, which is optimal from a frequentist perspective, behaves compared to what we come up with using Bayesian estimation.

The Prior

Bayesian inference proceeds as above, with the modification that our prior must be continuous and defined on the unit interval \((0,1)\). This reflects the fact that our parameter can take any value on the interval \((0,1)\). Choosing this is a subjective decision, and is slightly more difficult in the continuous case because interpreting densities is harder than interpreting discrete probability mass functions. In a nutshell, we should choose our prior distribution such that values of \(\theta\) that we think are reasonable have high probability under the prior, and values of \(\theta\) that we think are not reasonable have low probability under the prior. There is a lot more that goes into the choice of prior in more complicated applied problems; we’ll keep it simple for now.

A popular choice for a prior for a binomial likelihood like we have here is the beta distribution, \[ \begin{aligned} \theta &\sim Beta(a,b) \\ f(\theta;a,b) &= \theta^{a-1}(1-\theta)^{b-1}, 0 < \theta < 1, a > 0, b > 0 \\ E(\theta) &= \frac{a}{a+b} \\ Var(\theta) &= \frac{ab}{(a+b)^2 (a+b+1)} \end{aligned} \] The Beta distribution is defined on \(0,1\) and itself has two parameters \(a,b\) which control the shape of the distribution, and its moments. If a parameter having a mean and variance makes you uncomfortable at first, try interpreting these quantities less literally; the mean is just the “centre” of the possible values of \(\theta\) as weighted by their probabilities, and the variance (or more accurately, the standard deviation) roughly describes the size of the region around the mean in which \(\theta\) is likely to fall.

Let’s visualize the prior distribution in order to help us specify the parameters \(a,b\) that will give us a reasonable prior:

# Generate plot data- dbeta evaluated at x and a,b for x on a grid between 0 and 1 and various values of a,b
# Run this code line by line if you have trouble reading it and figuring out what it does
expand.grid(a = c(.5,1,2),b = c(.5,1,2)) %>%
  pmap_dfr(~data_frame(x = seq(0.01,0.99,by=0.01),y = dbeta(x = x,shape1=.x,shape2=.y),a = .x,b = .y)) %>%
  ggplot(aes(x = x,y = y)) +
  theme_classic() +
  facet_wrap(a~b) +
  geom_line(colour = "purple") +
  labs(title = "Beta Distribution, various a and b",
       subtitle = "Top value is a, bottom is b",
       x = "Datapoint x",
       y = "Density f(x;a,b)") +
  theme(text = element_text(size = 22))

The Beta distribution is very flexible; different values of a and b give very different shapes. If we thought extreme values (close to 0 or 1) of \(\theta\) were likely, we could choose a prior with a and b both less than 1; if we think middle values are more likely, we can choose a and b to be greater than 1.

For our example, we will choose a Beta(12,12) distribution, for reasons we will discuss below in the section on choosing prior distributions. This looks like this:

data_frame(x = c(0.01,0.99)) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  stat_function(fun = dbeta,
                args = list(shape1 = 12,shape2 = 12),
                colour = "blue") +
  labs(title = "Beta Prior for Theta",
       subtitle = "Bayesian Coin Flipping Example",
       x = "Theta",
       y = "Prior Density, p(Theta)") +
  scale_x_continuous(breaks = seq(0,1,by=0.1))

This prior puts strong weight on the coin being close to fair; values below \(\theta = 0.3\) and \(\theta = 0.7\) have very little prior probability. This can be verified:

# Prior probability of theta being between 0.3 and 0.7
pbeta(0.7,shape1=12,shape2=12) - pbeta(0.3,shape1=12,shape2=12)

## [1] 0.957104

Most of the mass of the distribution is between \(0.3\) and \(0.7\).

The Posterior

Our prior has density \[ p(\theta;a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \] Our likelihood remains the same as before: \[ p(X|\theta) = \theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}} \] Bayes’ Rule is still used to compute the posterior from these quantities; however, it looks slightly different now: \[ \begin{aligned} p(\theta|X) &= \frac{p(X|\theta)p(\theta)}{p(X)} \\ &= \frac{p(X|\theta)p(\theta)}{\int_{0}^{1}p(X|\theta)p(\theta)d\theta} \end{aligned} \]

Now, because \(\theta\) is defined on a continuous interval, the marginal likelihood/model evidence/normalizing constant is computed via integrating the joint distributionn of \(X\) and \(\theta\) over the range of \(\theta\).

In this example, the marginal likelihood and the posterior can be computed explicitly as follows: \[ \begin{aligned} p(X) &= \int_{0}^{1}p(X|\theta)p(\theta)d\theta \\ &= \int_{0}^{1} \theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} d\theta \\ &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \times \int_{0}^{1} \theta^{\sum_{i=1}^{n}x_{i} + a - 1}(1-\theta)^{n - \sum_{i=1}^{n}x_{i} + b - 1} d\theta \\ &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \times \frac{\Gamma(\sum_{i=1}^{n}x_{i} + a)\Gamma(n - \sum_{i=1}^{n}x_{i} + b)}{\Gamma(n + a + b)} \end{aligned} \]

How did we evaluate that integral and get to the last line? We recognized the integrand as being the \(\theta\)-dependent part of a \(Beta(\sum_{i=1}^{n}x_{i} + a,n - \sum_{i=1}^{n}x_{i} + b)\) density; hence it integrates to the reciprocal of the appropriate normalizing constant. This trick is commonly used in examples illustrating Bayesian inference; it shouldn’t be taken from this, though, that this integral is always easy to evaluate like this. It is almost never easy, or even possible, to evaluate this integral in anything beyond these simple examples- more on this later.

With \(p(X)\) available, we can explicitly compute the posterior: \[ \begin{aligned} p(\theta|X) &= \frac{p(X|\theta)p(\theta)}{p(X)} \\ &= \frac{\theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}} \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}}{\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \times \frac{\Gamma(\sum_{i=1}^{n}x_{i} + a)\Gamma(n - \sum_{i=1}^{n}x_{i} + b)}{\Gamma(n + a + b)}} \\ &= \frac{\Gamma(n + a + b)}{\Gamma(\sum_{i=1}^{n}x_{i} + a)\Gamma(n - \sum_{i=1}^{n}x_{i} + b)} \times \theta^{\sum_{i=1}^{n}x_{i} + a - 1}(1-\theta)^{n - \sum_{i=1}^{n}x_{i} + b - 1} \end{aligned} \] which we recognize as a \(Beta(a + \sum_{i=1}^{n}x_{i},b + n - \sum_{i=1}^{n}x_{i})\) distribution. Priors where the posterior belongs to the same family of distributions as the prior are called conjugate priors, and while they represent the minority of practical applications, they are very useful for examples.

In this scenario, we can interpret the likelihood as directly updating the prior parameters, from \((a,b)\) to \((a + \sum_{i=1}^{n}x_{i},b + n - \sum_{i=1}^{n}x_{i})\). Let’s visualize this for a few different datasets and sample sizes, for our chosen prior:

prior <- function(theta) dbeta(theta,shape1 = 12,shape2 = 12)
posterior <- function(theta,sumx,n) dbeta(theta,shape1 = 12 + sumx,shape2 = 12 + n - sumx)

data_frame(x = c(0.01,0.99)) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  stat_function(fun = prior,
                colour = "blue") +
  stat_function(fun = posterior,
                args = list(sumx = 5,n = 10),
                colour = "purple") +
  stat_function(fun = posterior,
                args = list(sumx = 0,n = 10),
                colour = "red") +
  stat_function(fun = posterior,
                args = list(sumx = 10,n = 10),
                colour = "orange") +
  labs(title = "Beta Prior vs Posterior for Theta, 10 coin flips",
       subtitle = "Blue: Prior. Purple: 5 heads. Red: 0 heads. Orange: 10 heads",
       x = "Theta",
       y = "Density") +
  scale_x_continuous(breaks = seq(0,1,by=0.1))

Some interesting points can be seen from this plot:

When the observed data “matches” the prior, in the sense that we observe a dataset for which the original frequentist estimate of \(\theta\) is very probable under the prior, the posterior becomes more peaked around that value.
When the observed data are extreme, as in the case of 0 or 10 heads in 10 flips, the frequestist inference would also be extreme. In these cases, we would have estimated \(\hat{\theta}_{freq} = 0\) or \(1\). Because our prior distribution is not extreme, though, the posterior is more moderate, and ends up being peaked at the low/high end of the range of values that were reasonable under the prior.

Check out this shiny app to get a better feel for how the prior, observed data and sample size influence the location and shape of the posterior under this model.

Point and Interval Estimation in Bayesian Inference

With the posterior in hand, what do we actually do? We’re used to immediately having a point estimate from frequentist inference, and there we typically proceed to derive a confidence interval for the parameter using the sampling distribution of the estimator.

The situation isn’t so different here. The posterior is analagous to the sampling distribution in the frequentist case, although the interpretation is different. In frequentist inference, we make statements about probable values for estimates in repeated sampling at fixed parameter values, and use this to infer the value of the parameter under which our observed data was most likely. In Bayesian inference, the posterior distribution is intepreted literally as the conditional distribution of the parameter given the data. We can just directly say things like “there is a \(95\%\) chance that \(\theta\) is between \(0.4\) and \(0.5\)”.

Specifically, to obtain point estimates of parameters, we may use either the posterior mean: \[ \hat{\theta}_{post. mean} = E(\theta|X) = \int_{\theta}\theta p(\theta|X)d\theta \] interpreted as a weighted average of possible values of \(\theta\), with weights corresponding to their posterior probabilities (densities); or, we may use the posterior mode: \[ \hat{\theta}_{post. mode} = \mbox{argmax}_{\theta} p(\theta|X) \] which is the most probable value of \(\theta\), given the observed data. In simple examples the posterior may be symmetric or nearly symmetric, so the two are nearly the same; in more complicated applications, either one is directly preferred, or both are computed and compared.

For interval estimation, the frequentist notion of a confidence interval is replaced by a credible interval: an interval which has a specified probability of containing the parameter, given the observed data. Contrast this interpretation to that of the frequentist confidence interval, which states that a certain proportion of the intervals computed in the same manner from repeatedly sampled datasets would contain the parameter. The Bayesian credible interval interpretation is closer to how many people would interpret such an interval intuitively.

Computing such a credible interval is a matter of finding the corresponding quantiles of the posterior, which is either simple or complicated depending on what the posterior is. In our example the posterior is known completely, and we can get a \(95\%\) credible interval using the qbeta function:

# E.g. for n = 10 and sumx = 5
c(qbeta(0.025,shape1=12 + 5,shape2 = 12 + 5),qbeta(0.975,shape1=12 + 5,shape2 = 12 + 5))

## [1] 0.3354445 0.6645555

The point estimate based off the posterior mode is the same as the frequentist estimate for these data, \(\hat{\theta}_{freq} = \hat{\theta}_{post. mode} = 0.5\), as can be seen from the plot above. The corresponding frequentist confidence interval is given by

c(.5 - sqrt(.5*.5/10)*1.96,.5 + sqrt(.5*.5/10)*1.96)

## [1] 0.1900968 0.8099032

which is much wider. The Bayesian approach gives a more accurate estimate here, because we assumed strong prior information that ended up agreeing with the data. If the data had been more extreme, say \(X = 1\) heads in \(n = 10\) flips, the frequentist point estimate is \(\hat{\theta}_{freq} = 0.1\) with confidence interval

c(.1 - sqrt(.1*.9/10)*1.96,.1 + sqrt(.1*.9/10)*1.96)

## [1] -0.08594193  0.28594193

Observing a single head in \(10\) tosses leads us to believe strongly that \(\theta\) must be small, and the corresponding confidence interval actually goes beyond the parameter space. It is true that if the coin were fair (\(\theta = 0.5\)), the observed data had about a \(1\%\) chance of occurring. However, if we had a prior belief that the coin had a probability of heads that is anywhere between \(0.3\) and \(0.7\), as above, we can compute the point and interval estimates obtained in a Bayesian setting:

# Point estimate- find the posterior mode, which is a critical point
# Use R's built in optimization tools, function nlminb() performs constrained minimization
# Pass it a function that returns minus the posterior; minimizing -f(x) is the same as maximizing f(x)
minus_posterior <- function(theta,sumx,n) -1 * posterior(theta,sumx,n)
opt <- nlminb(objective = minus_posterior,start = 0.3,sumx = 1,n = 10,lower = 0.01,upper = 0.99)
opt$par # Return the value at which the maximum occurs, i.e. the posterior mode

## [1] 0.375

# Credible interval
c(qbeta(0.025,shape1=12 + 1,shape2 = 12 + 9),qbeta(0.975,shape1=12 + 1,shape2 = 12 + 9))

## [1] 0.2290662 0.5487546

This interval is much more reasonable, stays within the parameter space, and still even includes the possibility that the coin is fair. Had we increased the sample size and observed a similarly extreme result, the posterior would become more centered in that region of the parameter space- that is, observing an equally extreme result with more data would diminish the effect of the prior on the resulting inferences.

You can play around in the above mentioned shiny app to get a feel for the comparison between frequentist confidence intervals and bayesian credible intervals works in this example.

Choosing a Prior

The choice of prior distribution is up to the analyst. There is no formula for doing this that will work in every problem; we can, though, discuss a few guidelines for doing so. When choosing a prior, you should consider at a minimum:

Reasonability: does the chosen prior give reasonable prior estimates for parameters, having observed no data? Put another way, does it put mass in regions of the parameter space where the parameter is likely to be, and does it put mass in regions where it is not likely to be?
Sensitivity: how much does the prior we choose actually affect the posterior, and in what ways? Does a given prior get “swamped” by the data, and how much data does it take for the prior to have negligible effect on the posterior? Does the prior affect the posterior differently for more “extreme” data than for less “extreme” data?
Tractability: do the prior and likelihood combine to give a posterior for which we can compute point estimates and credible intervals (quantiles)? Can we evaluate the integral required to compute the normalization constant? Can the posterior density/distribution (with or without the constant) be evaluated with a reasonable computational complexity (any large matrices to be inverted, etc)?

These are just some of the questions to ask when choosing a prior. It may sound like more work that in the frequentist paradigm, but an advantage of the Bayesian approach is that it makes it relatively simple for us to ask these questions of our modelling procedure.

How did we choose our prior for the coin-flipping example? To begin, we knew that the parameter of interest, \(\theta\), was bounded on \((0,1)\) and could take any real value in that region, so we considered only distributions that were continuous and defined on \((0,1)\). That alone narrowed it down- then we thought about what shape we wanted the distribution to have. We didn’t really have any idea about this, so we picked a distribution with a very flexible shape. We then chose hyperparameters (the parameters of the prior distribution, that we specify in advance) that gave us a reasonable location and spread of this distribution (more on this below). We then did a sensitivity analysis, showing the prior/posterior for various potential observed datasets, and even used a simple shiny app to get a feel for how different priors and datasets would combine in the posterior. All this was to ensure that our choice gives reasonable inferences for datasets that we could possibly/are likely to see.

If this is sounding like it should be easy, it isn’t. I used a concept we haven’t learned yet that renders a Beta distribution an obvious choice for a prior on \(\theta\) for a \(Bern(\theta)\) distribution: the Beta is the conjugate prior for the Bernoulli.

Conjugate Priors

A conjugate prior, in relation to a specific likelihood, is a prior that when combined with that likelihood gives a posterior with the same functional form as that prior. The Beta/Bernoulli we saw above is an example of this, because we found: \[ \begin{aligned} \mbox{Prior: } & p(\theta) \propto \theta^{a-1}(1-\theta)^{b-1} \\ \mbox{Likelihood: } & \ell(X|\theta) \propto \theta^{\sum_{i=1}^{n}x_{i}}(1-\theta)^{n - \sum_{i=1}^{n}x_{i}} \\ \implies \mbox{ Posterior: } & p(\theta | X) \propto \theta^{a + \sum_{i=1}^{n}x_{i} - 1}(1-\theta)^{b + n - \sum_{i=1}^{n}x_{i} - 1} \end{aligned} \] The prior has the form \(\theta^{a-1}(1-\theta)^{b - 1}\), and the posterior has the form \(\theta^{c-1}(1-\theta)^{d - 1}\), with \(c\) and \(d\) depending on \(a\) and \(b\) as well as the data. The posterior has the same functional form as the prior, with parameters that are updated after the data is observed.

Conjugate priors are great because they are mathematically tractible, and allow us to easily evaluate the impact of the prior distribution on the posterior under different datasets. It is often not possible, though, to find a conjugate prior for a given likelihood in anything but the most simple examples. Here are some common likelihoods and their conjugate priors; as an exercise, verify that each posterior is in the same family as the prior, and find expressions for the updated parameters:

Likelihood	Prior	Posterior
Bernoulli or Binomial, \(P(X = x) = \theta^{x}(1-\theta)^{1-x}\)	\(\theta \sim Beta(a,b)\)	???
Poisson, \(P(X = x) = \frac{\lambda^{x} e^{-\lambda}}{x!}\)	\(\lambda \sim Gamma(a,b)\)	???
Normal, \(f(x\|\mu,\tau) = \sqrt{\frac{\tau}{2\pi}}\exp\left( -\frac{\tau}{2} (x - \mu)^{2} \right)\) (note \(\tau = 1/\sigma^{2}\) is called the precision, and is the inverse of the variance)	\(\mu \sim Normal(m,v)\), \(\tau \sim Gamma(a,b)\)	???

Wikipedia has a great list containing many more examples.

Setting Hyperparameters

When using a conjugate prior (or any prior), once a family of distributions like Beta, Normal, Gamma, etc is chosen, the analyst still needs to set hyperparameter values. We did this above- first we chose a \(Beta(a,b)\) family of distributions, then we went a step further and actually specified \(a = 12\) and \(b = 12\). How did we come up with such wonky looking values of \(a\) and \(b\)? We will discuss two ways here.

Moment-Matching

A very direct way to encode your prior beliefs about the range of reasonable values of a parameter into a prior distribution is by setting hyperparameters via moment-matching. Analagous to the Method of Moments in frequentist estimation, we pick prior moments (mean, variance, etc) that give us a sensible range of values for the parameter, then find the prior hyperparameters that give us those moments.

This is where we got the \((12,12)\) in the above example. Suppose we think that, prior to seeing any data, \(\theta\) is most likely to be around \(0.5\), with values on in either direction away from this being equally likely, and that \(\theta\) is most likely between \(0.3\) and \(0.7\). Translate this statement into mathematical terms: we think the prior should be peaked at \(0.5\) and be symmetric about that value, which implies that its mean is also \(0.5\). We think that “most” of the mass should be between \(0.3\) and \(0.7\), so let’s say that \(0.3\) and \(0.7\) should be two standard deviations away from \(E(\theta) = 0.5\). This gives \(SD(\theta) = 0.1\).

A \(Beta(a,b)\) distribution has mean \(E(\theta) = \frac{a}{a+b}\) and \(Var(\theta) = \frac{ab}{(a+b)^{2}(a+b+1)}\). Moment-matching proceeds by setting these equal to the values we decided on above, and solving for \(a\) and \(b\): \[ \begin{aligned} E(\theta) = \frac{a}{a+b} &= 0.5 \\ Var(\theta) = \frac{ab}{(a+b)^{2}(a+b+1)} &= 0.1^{2} \\ \implies (a,b) &= (12,12) \end{aligned} \] As an exercise, verify the solutions to the above equations. We can verify that our answer is correct computationally by taking a sample from a Beta distribution with these parameters, and checking that the mean and standard deviation are close to what we want:

x <- rbeta(1000,12,12)
c(mean(x),sd(x))

## [1] 0.5062132 0.1031181

Empirical Bayes

Another method for choosing hyperparameters is to estimate them from the data. This seems to contradict the basic idea of a prior distribution (to encode our beliefs about the values of the parameters, before seeing any data); in practice, this manner of hyperparameter estimation is useful for more complicated models being fit to larger datasets, and is somewhat analagous to semi- or non-parametric estimation in the frequentist paradigm.

Empirical Bayes, sometimes called Type-II Maximum Likelihood, proceeds in exactly the same way as before, by combining a prior and likelihood to obtain a posterior. The prior hyperparameters, though, are chosen not to subjectively encode our beliefs about what values the parameter might take on; rather, they are estimated by fitting the prior distribution to the data itself.

To illustrate this, consider the above example with a \(Beta(a,b)\) prior on the coin-flipping likelihood. We have a problem, though; having performed the experiment only once, we don’t actually have enough data to estimate the prior hyperparameters and use these to inform a posterior distribution. One perceived weakness of Empirical Bayes is that is doesn’t work in very simple problems. However, in very simple problems it is not needed, as we can (as we did above) use reason to choose a prior and easily investigate the effects of this choice on our resulting inferences. When the data is bigger and the models more complex, choosing the prior manually may be difficult, and Empirical Bayes may be a useful tool.

We will modify our experiment as follows. Instead of throwing the coin a bunch of times, calling that our dataset, and trying to estimate \(\theta\), we will throw the coin a bunch of times say \(n\), and compute the proportion of heads… and then repeat that a bunch of times, say \(N\), resulting in a dataset of \(N\) proportions of heads from \(n\) independent throws, all with the same probability of heads, \(\theta\). We then use this dataset to draw inferences abour \(\theta\).

Let’s take a look at what such a dataset might look like:

n <- 100
N <- 500
truetheta <- 0.4
set.seed(80545)
flips <- data_frame(x = rbinom(N,n,truetheta) / n)

flips %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(bins = 30,colour = "black",fill = "purple") +
  labs(title = "Observed Sample Proportions",
       subtitle = str_c(N," repetitions of flipping a coin ",n," times"),
       x = "Sample Proportion of Heads",
       y = "Number of Times")

Empirical Bayes essentially treats each observed datapoint as being a point estimate of \(\theta\), and hence the dataset as being a random sample from the prior distribution. We then proceed by fitting a \(Beta(a,b)\) distribution to these data by maximum likelihood:

# No closed-form formula for (a,b); estimate numerically
minusloglik <- function(params) -sum(dbeta(flips$x,shape1 = params[1],shape2 = params[2],log = TRUE))

# Minimize minus the log-likelihood
# Get starting values using the method of moments
xb <- mean(flips$x)
vb <- var(flips$x)
astart <- xb * ((xb * (1-xb))/vb - 1)
bstart <- (1-xb) * ((xb * (1-xb))/vb - 1)
c(astart,bstart)

## [1] 40.12836 59.95234

opt <- nlminb(start = c(astart,bstart),objective = minusloglik,lower=c(0,0))
opt$par

## [1] 40.18212 60.03573

The method of moments and maximum likelihood estimates are nearly identical; in practice any frequentist estimation method could be used.

The fit isn’t bad:

flips %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "purple") +
  stat_function(fun = dbeta,
                args = list(shape1 = opt$par[1],shape2 = opt$par[2]),
                colour = "orange") +
  labs(title = "Observed Sample Proportions - Beta Distribution Fit by Maximum Likelihood",
       subtitle = str_c(N," repetitions of flipping a coin ",n," times"),
       x = "Sample Proportion of Heads",
       y = "Number of Times")

We now use a \(Beta(a,b)\) distribution as our prior, with \(a\) and \(b\) computed as above as \(a = 40.18\) and \(b = 60.04\). This gives a prior with mean \[ E(\theta) = \frac{a}{a+b} = 0.4 \] and standard deviation \[ SD(\theta) = \sqrt{\frac{ab}{(a+b)^{2}(a+b+1)}} = 0.05 \] Recalling that the value of \(\theta\) used to generate the observed data was \(0.40\), the prior is sharply peaked at the “true” value.

We can compare the resulting posterior to that which we would obtain by using a moment-matching prior having our previously chosen standard deviation. As an exercise, verify that the posterior distribution is a \(Beta(c,d)\) distribution with \[ \begin{aligned} c &= a + \sum_{i=1}^{N}\sum_{j=1}^{n}x_{ij} \\ d &= b + Nn - \sum_{i=1}^{N}\sum_{j=1}^{n}x_{ij} \\ \end{aligned} \] where \(x_{ij}\) is the \(j^{th}\) flip in the \(i^{th}\) experiment, so \(\sum_{i=1}^{N}\sum_{j=1}^{n}x_{ij}\) is the total number of heads across all experiments.

# Moment-matching parameters: E(theta) = 0.4, SD(theta) = 0.1
a_momatch <- 0.4 * (0.4*0.6 / 0.1^2 - 1)
b_momatch <- 0.6 * (0.4*0.6 / 0.1^2 - 1)
c(a_momatch,b_momatch)

## [1]  9.2 13.8

plt1 <- data_frame(x = c(0,1)) %>%
  ggplot(aes(x = x)) +
  theme_classic() +
  stat_function(fun = dbeta,
                args = list(shape1 = a_momatch,shape2 = b_momatch),
                colour = "purple") +
  stat_function(fun = dbeta,
                args = list(shape1 = opt$par[1],shape2 = opt$par[2]),
                colour = "blue") +
  labs(title = "Comparison of Prior Distributions",
       subtitle = "Blue: Empirical Bayes. Purple: Moment-Matching",
       x = "Theta",
       y = "Prior Density"
  ) +
  scale_x_continuous(breaks = seq(0,1,0.1))

plt2 <- data_frame(x = c(0,1)) %>%
  ggplot(aes(x = x)) +
  theme_classic() +
  stat_function(fun = dbeta,
                args = list(shape1 = a_momatch + n * sum(flips$x),shape2 = b_momatch + N*n - n*sum(flips$x)),
                colour = "purple") +
  stat_function(fun = dbeta,
                args = list(shape1 = opt$par[1] + n * sum(flips$x),shape2 = opt$par[2] + N*n - n*sum(flips$x)),
                colour = "blue") +
  labs(title = "Comparison of Posterior Distributions",
       subtitle = "Blue: Empirical Bayes. Purple: Moment-Matching",
       x = "Theta",
       y = "Posterior Density"
  ) +
  scale_x_continuous(breaks = seq(0,1,0.1))

cowplot::plot_grid(plt1,plt2,nrow=1)

Using the data twice results in a much narrower distribution around the true value of the parameter. We could have chosen a prior by moment-matching that was this narrow, but we probably wouldn’t have. So is Empirical Bayes always better?

We had a lot of data, and a very simple model. As the volume of data greatly outpaces the model complexity, it becomes less and less likely to see individual datasets that are improbable under the model. It is reasonable in this case for the inference procedure to rely more strongly on the observed data. If, however, we had a small dataset or a very complex model, it would be more likely that our observed data lies farther from the truth. We would then want to use an inference procedure that more strongly relied on our beliefs about what values of the parameters are likely.

Try the above experiment, but decrease the values of \(n\) and \(N\). If the original dataset just happens to be centered around some value of \(\theta\) other than the true value, the Empirical Bayes prior will be peaked around the wrong point, and it will be very difficult for the data to bring it back, since the data is also centred at the wrong point.

Full Example: Bayesian Regression

For an example, we’ll fit a curve to Toronto 311 contact centre wait times. 311 is a service operated by the City of Toronto in which residents can call (dial *311) and submit service requests for things like potholes, excess garbage pickup, downed trees, and so on. People call in to the service and speak to an actual human being in a contact centre. Data on the daily average wait time for such calls for the period from December 29th, 2009 to January 2nd, 2011 is available from Open Data Toronto here, and stored on github for your use.

It might be of interest to the call centre operations teams to estimate average wait time on a given day of the year; such estimates could be used for planning purposes. Realistically, modelling just the trend as a function of date probably isn’t useful from an operations standpoint, but such an exercise is often required as a proof-of-concept, in order to justify spending effort on a more complicated analysis.

Can we do it, and how? Let’s take a look at the data:

# Read data
contactdat <- readr::read_csv("https://media.githubusercontent.com/media/awstringer1/leaf2018/gh-pages/datasets/contact-centre.csv",col_types = "cd") %>%
  mutate_at("date",lubridate::mdy)

glimpse(contactdat)

## Observations: 371
## Variables: 2
## $ date         <date> 2009-12-28, 2009-12-29, 2009-12-30, 2009-12-31, ...
## $ answer_speed <dbl> 55, 67, 56, 15, 9, 62, 51, 20, 16, 14, 23, 33, 14...

contactdat %>%
  ggplot(aes(x = date,y = answer_speed)) +
  theme_classic() +
  geom_point(pch = 21,colour = "black",fill = "orange") +
  labs(title = "Average Wait Time by Day, Toronto 311 Contact Centre",
       x = "Date",
       y = "Wait Time")

The extreme wait times in the summer of 2010 might be explainable if we went back and read the news to figure out what was happening then. We can make them not so drastic by instead modelling the log of wait time:

contactdat %>%
  ggplot(aes(x = date,y = log(answer_speed))) +
  theme_classic() +
  geom_point(pch = 21,colour = "black",fill = "orange") +
  labs(title = "Average Wait Time by Day, Toronto 311 Contact Centre",
       x = "Date",
       y = "log(Wait Time)")

While the variance in log(answer_speed) is large, it does look like there is some kind of trend.

The simplest thing we can thing to do is fit a normal linear model: \[ y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i} \] where \(y_{i}\) is the average wait time on the \(i^{th}\) date, \(x_{i}\) is a suitable numerical value for the \(i^{th}\) date (e.g. number of days since some reference point), and \(\epsilon_{i} \sim N(0,\sigma^{2})\) is the error of the \(i^{th}\) observation about its mean.

Since the apparent relationship between these quantities is not linear, a natural thing to do is to fit a polynomial model, of the form \[ y_{i} = x_{i}^{\prime}\beta + \epsilon_{i} \] where now \(\beta = (\beta_{0},\beta_{1},\ldots,\beta_{p})\) and \(x_{i} = (1,x_{i},x_{i}^{2},\ldots,x_{i}^{p})\). This allows us to fit, for different \(p\), curves of any shape, using the normal linear model theory.

How to choose \(p\) though? Is a quadratic model okay, or do we need a cubic or quartic or higher? Let’s look at the residual sum of squared errors and the graphical displays of these fits for various values of \(p\):

# Degrees to try
ptotry <- c(1:10,100,371)
# Standardize both variables to have mean 0 and variance 1, so they are interpreted on the same scale
contact_std <- contactdat %>%
  mutate_at("date",as.numeric) %>%
  mutate_at("answer_speed",log) %>%
  mutate_all(funs( (. - mean(.)) / sd(.)))

# Fit for different values of p
model_list <- ptotry %>%
  map(~lm(answer_speed ~ poly(date,degree = .,raw = TRUE),data = contact_std)) 

names(model_list) <- ptotry
# Plot
plot_poly <- function(p) {
  plt <- data_frame(x = seq(-2,2,by = 0.1),
             y = predict(model_list[[as.character(p)]],newdata = data_frame(date = x))) %>%
    ggplot(aes(x = x,y = y,group = 1)) +
    theme_classic() +
    geom_point(data = contact_std,
               mapping = aes(x = date,y = answer_speed),
               colour = "black",
               fill = "lightblue",
               pch = 21) +
    geom_line(colour = "purple") +
    labs(title = "Polynomial fit to contact time data",
         subtitle = str_c("Degree = ",p),
         x = "Concentration",
         y = "Time") +
    ylim(-3,3)
  
  return(plt)
}

# Plot them all!
ptotry %>%
  map(plot_poly) %>%
  cowplot::plot_grid(plotlist = .,ncol = 2)

## Warning in predict.lm(model_list[[as.character(p)]], newdata =
## data_frame(date = x)): prediction from a rank-deficient fit may be
## misleading

## Warning in predict.lm(model_list[[as.character(p)]], newdata =
## data_frame(date = x)): prediction from a rank-deficient fit may be
## misleading

## Warning: Removed 3 rows containing missing values (geom_path).

## Warning: Removed 5 rows containing missing values (geom_path).

## Warning: Removed 5 rows containing missing values (geom_path).

## Warning: Removed 6 rows containing missing values (geom_path).

## Warning: Removed 7 rows containing missing values (geom_path).

There is a lot of variance in the data. As the degree of the polynomial becomes higher, the curve fits closer and closer to the observed data. How close is close enough? At what point do we start fitting to noise?

Let’s look at the sum of squared errors, and the corresponding estimated residual standard deviation, for each model:

model_list %>%
  map(~c(ssr = sum(residuals(.)^2),sigma = summary(.)$sigma)) %>%
  reduce(bind_rows) %>%
  bind_cols(data_frame(degree = ptotry))

## # A tibble: 12 x 3
##         ssr     sigma degree
##       <dbl>     <dbl>  <dbl>
##  1 365.3286 0.9950127      1
##  2 276.0048 0.8660330      2
##  3 255.1806 0.8338556      3
##  4 234.7854 0.8009308      4
##  5 234.3715 0.8013200      5
##  6 234.1749 0.8020832      6
##  7 234.0080 0.8029011      7
##  8 230.3057 0.7976238      8
##  9 222.7972 0.7855996      9
## 10 219.1650 0.7802511     10
## 11 172.9929 0.7307043    100
## 12 167.3747 0.7359795    371

As degree goes up, the sum of squared residuals goes down, indicating the curve fits closer to the observed data. It seems like fitting more complicated curves is always better.

Looking closer at the models, though, we see something alarming. Let’s look at the actual coefficients for the first 10 models:

# Get the first d slope coefficients from a model and return them in a named vector
get_d_coef <- function(d,mod) {
  
  coef_vec <- coef(mod)
  out <- numeric(d+2)
  out[1:length(coef_vec)] <- coef_vec
  out[d+2] <- length(coef_vec) - 1
  names(out) <- c("Intercept",str_c("beta_",1:d),"degree")
  round(out,4)
}

# Get the coefficients for each model
all_model_coefs <- model_list[1:10] %>%
  map(~get_d_coef(10,.)) %>%
  reduce(bind_rows)

all_model_coefs[ ,1:6]

## # A tibble: 10 x 6
##    Intercept  beta_1  beta_2  beta_3 beta_4  beta_5
##        <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1    0.0000  0.1124  0.0000  0.0000 0.0000  0.0000
##  2    0.5486  0.1124 -0.5501  0.0000 0.0000  0.0000
##  3    0.5486  0.6560 -0.5501 -0.3028 0.0000  0.0000
##  4    0.8124  0.6560 -1.4318 -0.3028 0.3438  0.0000
##  5    0.8124  0.7760 -1.4318 -0.4901 0.3438  0.0564
##  6    0.8383  0.7760 -1.6139 -0.4901 0.5264  0.0564
##  7    0.8383  0.8799 -1.6139 -0.8027 0.5264  0.2862
##  8    0.9510  0.8799 -2.9694 -0.8027 3.0187  0.2862
##  9    0.9510 -0.0026 -2.9694  3.5242 3.0187 -5.3551
## 10    1.0626 -0.0026 -5.0216  3.5242 8.9647 -5.3551

all_model_coefs[ ,7:12]

## # A tibble: 10 x 6
##     beta_6  beta_7 beta_8  beta_9 beta_10 degree
##      <dbl>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>
##  1  0.0000  0.0000 0.0000  0.0000  0.0000      1
##  2  0.0000  0.0000 0.0000  0.0000  0.0000      2
##  3  0.0000  0.0000 0.0000  0.0000  0.0000      3
##  4  0.0000  0.0000 0.0000  0.0000  0.0000      4
##  5  0.0000  0.0000 0.0000  0.0000  0.0000      5
##  6 -0.0448  0.0000 0.0000  0.0000  0.0000      6
##  7 -0.0448 -0.0476 0.0000  0.0000  0.0000      7
##  8 -1.4888 -0.0476 0.2586  0.0000  0.0000      8
##  9 -1.4888  2.6465 0.2586 -0.4253  0.0000      9
## 10 -7.4523  2.6465 2.6794 -0.4253 -0.3417     10

As the degree of the polynomial gets bigger, so do the absolute values of the point estimates of the regression coefficients, and they start pointing in different directions, in an attempt to cancel each other out. The values don’t look large, but remember since we standardized the covariate and response, these are interpreted as standard deviations from the mean.

The coefficients of larger models get even crazier:

data_frame(beta = c("Intercept",str_c(1:20)),
           value = coef(lm(answer_speed ~ poly(date,degree = 20,raw = TRUE),data = contact_std)) %>% round(4)
) %>%
  print(n = Inf)

## # A tibble: 21 x 2
##         beta     value
##        <chr>     <dbl>
##  1 Intercept    1.2528
##  2         1    0.7235
##  3         2   -8.8965
##  4         3  -15.0357
##  5         4    9.0593
##  6         5   96.2621
##  7         6   73.6388
##  8         7 -228.3528
##  9         8 -253.1213
## 10         9  275.6836
## 11        10  363.7286
## 12        11 -193.1107
## 13        12 -292.4617
## 14        13   82.4425
## 15        14  141.4553
## 16        15  -21.2964
## 17        16  -40.9669
## 18        17    3.0768
## 19        18    6.5570
## 20        19   -0.1917
## 21        20   -0.4465

While the higher degree polynomials may “fit the data better”, the underlying models are based on wildly variable (and probably inaccurate) coefficient estimates, and this is not desirable.

A Bayesian Formulation

There are a number of ways, many of them equivalent to each other, to address this. Following the theme of this lab, we’ll do it in a Bayesian framework: put a prior on \(\beta\), make it so that prior says that large values of the coefficients are extremely unlikely, and then compute and analyze the posterior for \(\beta\) given the prior and the data.

The likelihood for the data \(y\) in this problem is \[ y_{i}|\beta \sim N(x_{i}^{\prime}\beta,\sigma^{2}) \] From the discussion of conjugate priors above, and from your own completion of the associated exercise, we know that the conjugate prior for this distribution is also normal: \[ \beta \sim N(m,\Sigma) \] with \(m = (m_{0},m_{1},\ldots,m_{p})\) representing the prior mean and \(\Sigma\) representing the prior covariance matrix. Moment-matching is easy here, because the prior hyperparameters correspond directly to the first two moments of the prior distribution, so we don’t have to solve any equations like for the Beta example. We can figure this out as follows:

We’ll take \(m = 0\), because we simply don’t have any reason to believe that the curve we’re fitting takes any particular shape, prior to looking at the dataset.
We’ll take the prior covariance matrix to be diagonal, \(\Sigma = s^{2} \times I\) partly because multivariate stats is not a prerequisite for this lab; also partly because this is reasonable. Inducing prior covariance is sometimes a great idea, but it’s too complicated a discussion for this example.
We’ll choose the prior variance, \(s^{2}\), to enforce our belief that the values of the coefficients in the model that generated the data are actually small. What does “small” mean? This is highly subjective; interpreting these regression coefficients as being “standard deviations away from the predicted mean response due to a unit standard deviation change in the covariates”, which is a mouthful, leads to the notion that a \(\beta_{j}\) of 1 is probably getting pretty big. Hence we will choose \(s^{2} = 0.5^{2}\), which gives a prior standard deviation of \(0.5\), and hence any value more than 2 standard deviations away from the prior mean is unlikely under the prior- i.e. any value greater than 1. When running the code here, though, you should definitely change this and note the effect that this has on the resulting curve, because this is super important.

For simplicity, we will take the data variance \(\sigma^{2}\) to be fixed at its sample value, \(\hat{\sigma}^{2} = 1\). It equals 1 precisely because we divided the data by the sample variance, when we standardized. A fully Bayesian treatment would put a prior on this as well.

Using standard results about joint and conditional normal random variables leads us to the posterior distribution: \[ \begin{aligned} \beta | y &\sim N(m_{post},\Sigma_{post}) \\ m_{post} &= \left( \frac{1}{s^{2}}I + \frac{1}{\sigma^{2}}X^{\prime}X\right)^{-1}\left( \frac{1}{s^{2}}m + \frac{1}{\sigma^{2}}X^{\prime}y\right) \\ \Sigma_{post} &= \left( \frac{1}{s^{2}}I + \frac{1}{\sigma^{2}}X^{\prime}X\right)^{-1} \end{aligned} \] Here \(X\) is the design matrix, with rows consisting of the individual \(x_{i}\). Because the posterior is normal, both the mean and mode are equal to \(m_{post}\)- our Bayesian point estimate of \(\beta\).

From normal regression theory, if we fit a standard (non-Bayesian) linear regression, the resulting point estimate of \(\beta\) would be \[ \hat{\beta}_{normal} = \left( X^{\prime}X \right)^{-1}X^{\prime}y \] with a Gaussian sampling distribution, and variance \[ Var\left( \hat{\beta}_{normal} \right) = \sigma^{2}\left( X^{\prime}X \right)^{-1} \] We can interpret the effects of the prior distribution intuitively as shrinking the frequentist point estimate towards the prior mean, by an amount depending on the data variance and the prior variance.

Let’s compare the Bayesian regression to the normal-theory frequentist regression. For different values of \(p\), we’ll plot the curves corresponding to the Bayesian posterior mean and the frequentist point estimate, and pointwise credible (Bayesian) and confidence (frequentist) intervals. A pointwise interval for \(\beta\) at a given \(x\) is calculated as \[ \hat{\beta} \pm 2\times x^{\prime}Var(\hat{\beta})x \] These are meant to give a rough idea of the range of plausible curves at each given \(x\); they are misleading in that when calculated this way they don’t actually give a measure of the variability associated with the entire curve. Nonetheless they are useful for comparing estimated curves. Along with the plots, we’ll compute the actual model coefficients, and the residual sum of squares associated with the fit for that value of \(p\). We can then compare the two approaches as \(p\) gets high, and make a subjective choice as to the value of \(p\) that we prefer.

All of this takes some work to code up, so here we go! We’re going to need

A function to estimate the Bayesian linear regression for a given p, returning the posterior mean and variance
A function to estimate the frequentist linear regression for a given p, returning the point estimate and variance
A function to take a value of p, fit the above two models, and return a dataframe consisting of the observed responses, their predictions under the model (the value of the curve at that x point), and the lower and upper points of the associated interval, for each model
A function to plot the above data, and
A script that fits the models for various values of \(p\), returning a dataframe with the residual sum of squares for the model

# Function to fit the Bayesian linear regression described above
# Returns the posterior mean and variance
fit_bayesian_lm <- function(p,m=rep(0,p+1),s=0.5) {
  # p is the degree of polynomial to fit
  # m is the prior mean, and s the prior standard deviation
  
  m <- cbind(m)
  X <- model.matrix(answer_speed ~ poly(date,degree = p,raw = TRUE),data = contact_std)
  y <- cbind(contact_std$answer_speed)
  
  # Posterior mean and variance
  varpost <- solve((1/s^2)*diag(dim(X)[2]) + t(X) %*% X)
  meanpost <- varpost %*% ((1/s^2)*m + t(X) %*% y)
  
  list(
    estimate = meanpost,
    var = varpost
  )
}

# Function to fit the frequentist linear model
# Returns the point estimate and variance
fit_frequentist_lm <- function(p) {
  mod <- lm(answer_speed ~ poly(date,degree = p,raw = TRUE),data = contact_std)
  list(
    estimate = cbind(coef(mod)),
    var = vcov(mod)
  )
}

# Function to predict new y at given x
# For given x, returns predicted y and lower/upper bounds of a 2-sd interval
predict_with_interval <- function(model_data,x,p) {
  # model_data is the output of one of the above functions
  # x here is a scalar
  # Create a vector of length p+1 containing 1, x, x^2, ... , x^p
  xvec <- c(1)
  for (i in 1:p) {
    xvec <- c(xvec,x^i)
  }
  xvec <- cbind(xvec)
  
  vn <- t(xvec) %*% model_data$var %*% xvec
  mn <- t(xvec) %*% model_data$estimate
  
  c(
    x = x,
    predicted = mn,
    lower = mn - 2*vn,
    upper = mn + 2*vn
  )
}

# Now, a function that will take a value of p, and fit the above two models, 
# returning predictions and pointwise interval bounds for each training
# set point
fit_both_models <- function(p,...) {
  # The ... can be used to pass additional arguments to the fit_bayesian_lm function,
  # like a different prior mean and variance
  bayesian_lm <- fit_bayesian_lm(p,...)
  frequentist_lm <- fit_frequentist_lm(p)
  
  # Get all unique training set points, and map the predict_with_interval function
  # over them, reducing the results into a data_frame. Do for each of frequentist and bayesian.
  # Also, manually add back on the observed values of y
  bayes_predictions <- contact_std %>%
    pull(date) %>%
    map(~predict_with_interval(bayesian_lm,.x,p)) %>%
    reduce(bind_rows) %>%
    bind_cols(dplyr::select(contact_std,observed = answer_speed))
  freq_predictions <- contact_std %>%
    pull(date) %>%
    map(~predict_with_interval(frequentist_lm,.x,p)) %>%
    reduce(bind_rows) %>%
    bind_cols(dplyr::select(contact_std,observed = answer_speed))
  
  # Return a list containing the model parameters and predictions
  list(
    bayesian = list(
      predictions = bayes_predictions,
      model = bayesian_lm
    ),
    frequentist = list(
      predictions = freq_predictions,
      model = frequentist_lm
    )
  )
}

# Fit the models for several values of p
ptotry <- c(1:10)
model_fits <- ptotry %>%
  map(~list(
            p = .x,
            data = fit_both_models(p = .x))
  )

# Plot them
plot_one_p <- function(model) {
  # Stack the data for the frequentist and bayesian together, create a new variable
  # indicating frequentist/bayesian. This allows us to make a single ggplot with
  # two frames, one for each curve
  plt <- model$data$bayesian$predictions %>%
    mutate(type = "Bayesian") %>%
    bind_rows(model$data$frequentist$predictions %>% mutate(type = "Frequentist")) %>%
    ggplot(aes(x = x,colour = type,fill = type)) +
    theme_classic() + 
    facet_grid(~type) +
    geom_ribbon(aes(ymin = lower,ymax = upper),alpha = 0.3) +
    geom_point(aes(y = observed),colour = "black",fill = "grey",pch = 21) +
    geom_line(aes(y = predicted)) + 
    labs(title = "Contact Time Model",
         subtitle = str_c("p = ",model$p),
         x = "Date (Standardized)",
         y = "Contact Time (Standardized)",
         fill = "Type") +
    scale_fill_brewer(palette = "Set1") +
    scale_colour_brewer(palette = "Set1") +
    guides(colour = FALSE) +
    ylim(c(-2.5,2.5))
  
  return(plt)
}

# Plot them all
model_fits %>%
  map(plot_one_p) %>%
  cowplot::plot_grid(plotlist = .,ncol = 1)

As the frequentist curve gets more wiggly, the Bayesian curve stays reasonably stable. What about for higher \(p\)?

ptotry <- c(11:20)
ptotry %>%
  map(~list(
            p = .x,
            data = fit_both_models(p = .x))
  ) %>%
  map(plot_one_p) %>%
  cowplot::plot_grid(plotlist = .,ncol = 1)

While it’s probably not a good idea to fit models with such high degree polynomials, at least the Bayesian implementation guards somewhat against high variability in the fitted curves.

Based on this, we can decide what degree of polynomial to use. Since the quadratic is shaped almost exactly like the curves of higher degrees (up to around 10), we can stick with the simplest reasonable model and choose that one; we also could be more formal about it and split the data into a training and a test set, fit the polynomials on each, and report the test set error.

Bayesian Statistics: Introduction

Alex Stringer

2018-06-18