Introduction
- Purpose of Document
- Background & Learning Objectives
The Example
- First Part
- Second Part

Introduction

Purpose of Document

This document is a discussion and implementation of the second extended example from Horton (2013): I Hear, I Forget. I Do, I Understand: A Modified Moor-Method Mathematical Statistics Course. The American Statistician 67:4 219-228 (although the actual example is originally from Evans and Rosenthal (2004): Probability and Statistics: The Science of Uncertainty, New York: W. H. Freeman and Company). The purpose of this document is to illustrate some of the challenges that instructors might face when teaching topics that require computation to actually implement, but where computational skills may or may not be part of the learning objectives for the course. This is in slight contrast to the other tutorial covering an example from the Horton (2013) paper, which focusses more on how computation can be used to enhance student understanding of topics which would traditionally be taught in a purely analytical manner.

This document contains the fully worked example, discussion, and several implmentations of algorithms for sampling from a probability distribution. Both base R and, where appropriate, more modern implementations are given side-by-side, and instructors can choose which to use depending on the learning outcomes for the course. In particular, the base R versions are usually better for a course where computation is used but not necessarily a main learning objective; in contrast, the more advanced versions might be better if R programming is a main learning objective.

Background & Learning Objectives

The computational concepts here are appropriate for students with one introductory programming course, and ideally one applied course that uses R. If they don’t have the latter, a brief introduction to R programming may be required.

The learning objectives of the example covered here may include:

Using computation to solve otherwise intractable analytical problems
Using computation to help understand and verify the solution to difficult analytical problems
Develop intermediate R skills and use them fluently in the above; for example, understand functions well enough to be able to use integrate without difficulty
Fluency with ggplot, and for this to be the default choice for plotting

The Example

The question for students is as follows (this is paraphrased but otherwise taken directly from Horton (2013)):

Can you find the CDF of a distribution with density \(f(y) = c(1+|y|)^{3}\exp\left(-y^{4} \right)\), for \(-\infty < y < \infty\), and can you find the normalizing constant \(c\)?
Write a program to sample from this density

The first part could be done analytically (although it’s difficult and messy); we will show how this part of the question can be solved computationally. The second part literally asks students to write a program.

First Part

The CDF is the integral of the density, \[ F(y) = \int_{-\infty}^{y} f(s)ds \] Students might approach this problem by trying to brute-force evaluate this integral using their knowledge of calculus. This is difficult, but it’s a difficult calculus problem, not a difficult statistical problem. If calculus skills are what’s meant to be tested here then that’s one thing, but if the point of the problem is to have students develop practical skills in working with complicated probability densities, then working through this problem analytically can remove focus from this goal.

Whatever the learning objectives, computation can be used to aid understanding of this problem, and give students a way to check if their answers are reasonable. R has a built in integrate function that can be used to numerically evaluate improper integrals of reasonably well-behaved functions. Students can use this right away to find the normalizing constant \(c\):

# Define the function of interest
fx <- function(x) (1+abs(x))^3 * exp(-x^4)
# Integrate it over the entire domain
fx_int <- integrate(fx,-Inf,Inf)$value
fx_int

## [1] 6.809611

# The normalizing constant is the reciprocal of this
norm_const <- 1 / fx_int
norm_const

## [1] 0.1468513

Students should then double-check that their normalized density integrates to 1:

fxc <- function(x) norm_const * fx(x)
integrate(fxc,-Inf,Inf)$value

## [1] 1

The code here looks simple, but students with little knowledge of R can easily get trapped in trying to write it. In particular, it uses some features and concepts that might be unfamiliar to students with little programming background:

Functions are objects; they are defined using function, which is itself a function, and you give them an expression that takes one or more variables and returns a result. This is simple enough, but students will find their first functions hard to debug, in particular because the reference semantics that R uses (pass-by-value, but with lazy evaluation) are somewhat unique, and make it hard for first time users to read the stack trace and figure out what went wrong
Because functions are objects, they can be passed around like any other data. They can be passed as arguments to other functions, which is what happens with integrate. integrate takes an argument which is a function, and applies an algorithm that uses that function, returning a value (actually, returning a list, from which you have to extract that value- another concept to explain)
R uses environments (analagous to namespaces in langugages like python and C++) to store and search for data. We used this here in the definition of fxc: the variable norm_const and the function fx are stored in the global environment; the function fxc will first look for these values in its execution environment, and then not finding them, will search in its calling environment (the global environment), where it will find them. It then passes its own argument x to fx, evaluates that, multiplies the result by norm_const, and returns this value. An R programmer might think to write the code this way, but then to explain to students what it’s doing and why it works requires the introduction of several new concepts.

An alternative to introducing some of these new concepts is to, well, not introduce them, and write the code in a different way:

# Print out the value of norm_const
norm_const

## [1] 0.1468513

# Copy it and fx directly
fxc2 <- function(x) 0.1468513 * (1+abs(x))^3 * exp(-x^4)
integrate(fxc2,-Inf,Inf)$value

## [1] 1

The central tradeoff for the instructor seems to be time spent teaching programming vs. quality of the resulting program. The second implementation of fxc gives exactly the same result, but requires copying and pasting of previously used code, which is tedious and introduces further possibility of human error. However one might argue that blindly pulling variables from the global environment also introduces the possibility of error and should be considered bad practice! Every instructor will probably have different opinions about technical points like this; the point is, these are difficulties that should be thought of ahead of showing these examples to students, in order to ensure that
the example helps achieve your desired learning objectives and avoiding falling too far down a rabbit hole.

Moving on, we can also write an R function to evaluate the CDF. Whether this is used as the actual answer to the problem, or to aid students’ understanding and let them check their answer numerically, it is a useful exercise.

fx_cdf <- function(x) integrate(fxc,-Inf,x)$value
# Test it out
fx_cdf(-1) # Arbitrary

## [1] 0.08834757

fx_cdf(0) # Should equal fx_cdf(Inf) / 2, as density is symmetric

## [1] 0.5

fx_cdf(1) # Arbitrary, should equal 1 - fx_cdf(-1) again by symmetry

## [1] 0.9116524

fx_cdf(Inf) # Should equal 1

## [1] 1

This is a good time to mention to students strategies for checking their work. We checked 2 properties of the CDF that we knew from its definition:

\(F(x)\) should satisfy \(F(\infty) = 1\), because densities integrate to 1
\(F(x) = 1 - F(-x)\), because \(f(x) = f(-x)\), i.e. the underlying density is symmetric

We probably want to check two more properties of \(F(x)\): that it is non-decreasing in \(x\), and that its shape actually matches that of \(f(x)\). Both of these can be accomplished using a plot. If students aren’t familiar with ggplot, they would have to be introduced at this point, and indeed this is a good example of using ggplot to plot functions:

# Create a dataframe that holds the x-axis range that we want to plot
# Pass this to ggplot, and add a stat_function layer. ggplot does the rest

pdf_plt <- data_frame(x = c(-2,2)) %>%
  ggplot(aes(x = x)) +
  theme_light() +
  stat_function(fun = fxc,colour = "purple") +
  labs(title = "PDF",
       x = "y",
       y = "f(y)") +
  scale_x_continuous(breaks = seq(-2,2,by=0.5))

# Need a vectorized version of fx_cdf
fx_cdf_v <- function(x) {
  out <- numeric(length(x))
  for(i in 1:length(x)) {
    out[i] <- fx_cdf(x[i])
  }
  out
}

cdf_plt <- data_frame(x = c(-2,2)) %>%
  ggplot(aes(x = x)) +
  theme_light() +
  stat_function(fun = fx_cdf_v,colour = "purple") +
  labs(title = "CDF",
       x = "y",
       y = "F(y)") +
  scale_x_continuous(breaks = seq(-2,2,by=0.5))

cowplot::plot_grid(pdf_plt,cdf_plt,nrow = 1)

We can see from these plots that the CDF follows the shape of the PDF, increasing sharply where the PDF is high and slowly where the PDF is low.

Even just plotting these functions uncovered another important concept in R programming, namely that of vectorization. ggplot requires functions used internally to accept and return vector arguments and values. If students aren’t familiar with this yet, the instructor can dig a bit deeper into what happened here:

The fxc function uses operations that all are already vectorized, meaning they take a vector as an argument and return the operation evaluated on each element of that vector. The operations +, *, abs, exp and ^ all work this way, among many others
The fx_cdf function passes its argument to the upper named argument of the integrate function, which accepts only a single scalar value, not a vector. As such, the function fx_cdf can’t automatically accept and return vectors. We can fix this somewhat easily as above, by wrapping the function in a for loop.

If computation itself is a learning objective of the course, this is a great place to introduce the concept of a higher-order function, one which takes another function as an argument, and possibly returns a function as its return value. We can write a higher order function that will vectorize its input as follows:

vectorize <- function(f) {
  # f is a function that accepts scalar arguments and returns scalar values
  # vectorize(f) will return a function, which accepts vector arguments and returns vector values
  function(x) {
    out <- numeric(length(x))
    for (i in 1:length(x)) {
      out[i] <- f(x[i])
    }
    out
  }
}

fx_cdf_v2 <- vectorize(fx_cdf)
fx_cdf_v2(c(-1,0,1))

## [1] 0.08834757 0.50000000 0.91165243

There is a lot going on up there; this is a useful exercise only, again, if R programming is part of the learning objectives for the course.

Second Part

The second part of the problem is to write a program that will produce a sample from the density \(f(y)\). We’ll go over the practical aspects of doing this three ways: Inverse Transform Sampling, Importance Sampling, and MCMC. We won’t do the theory or make any claims about which one is better suited to this problem; we’ll just focus on potential challenges students might face trying to implement each.

Inverse Transform Sampling

If \(X \sim F\) with CDF \(F(x)\), and \(U \sim Unif(0,1)\), then \(F^{-1}(U) \overset{d}{=} X\). This can be used to generate a single point from \(F\) according to the following algorithm:

Draw \(u\) from \(U \sim Unif(0,1)\)
Return \(F^{-1}(u)\)

This can be repeated using many independent random draws from \(U \sim Unif(0,1)\) to get a random sample from \(F\). The implementation is very straightforward… except we don’t know \(F(x)\) here, so we certainly don’t know \(F^{-1}(x)\).

We can proceed using R by noting that if \(y = F^{-1}(x)\) then \(F(y) - x = 0\). R has a built-in function uniroot for finding roots of functions, i.e. solving equations of the form \(g(y) = 0\) for \(y\). We can use this to find \(y = F^{-1}(x)\) for any \(x\) as follows:

cdf_inv <- function(x) {
  # Function returns F^-1(x)
  # Define the function of y to pass to uniroot. y is now the variable, and x is considered a fixed constant
  g <- function(y) fx_cdf(y) - x
  # Pass it to uniroot; search for the root on a large interval (this practicality can be further
  # discussed with students)
  uniroot(g,lower = -10,upper = 10)$root
}
# Test it out
cdf_inv(fx_cdf(-1))

## [1] -1.000004

cdf_inv(fx_cdf(1))

## [1] 1.000008

cdf_inv(fx_cdf(-.5))

## [1] -0.4999956

cdf_inv(fx_cdf(.5))

## [1] 0.499996

cdf_inv(fx_cdf(-2))

## [1] -2.000024

cdf_inv(fx_cdf(2)) # Yikes!

## [1] 5.003514

cdf_inv(fx_cdf(1.5))

## [1] 1.500017

cdf_inv(fx_cdf(1.6))

## [1] 1.596018

cdf_inv(fx_cdf(1.8))

## [1] 1.829467

We see that our method works okay on points around the range of the density that contains most of the mass (about -2 to 2, although by 2 it gets pretty bad); out in the tails it can be very inaccurate. It remains to be seen whether this would affect the sampling. This is a good time to have students reflect back on how cool what we’ve done is; we couldn’t write down expressions for the CDF or its inverse, but we can still compute with them using R. It’s also a good time to point out the limitations; the naive numerical integration and root-finding methods we used are good, but aren’t always accurate, and understanding that this is something that needs to be checked in applications is important.

Now we have to write an algorithm to sample from the original density. This can be accomplished with a simple loop:

set.seed(34879)
B <- 10000 # Size of sample to generate
thesample <- numeric(B)
for (b in 1:B) {
  # Generate a uniform random deviate
  u <- runif(1)
  # Compute the inverse CDF at it
  thesample[b] <- cdf_inv(u)
  # Print some progress. Don't print it every iteration- printing stuff to the screen
  # is actually much more computationally intensive than actually doing an iteration of
  # this algorithm!
  if (b %% 500 == 0) {
    cat("Produced",b,"samples\n")
  }
}

## Produced 500 samples
## Produced 1000 samples
## Produced 1500 samples
## Produced 2000 samples
## Produced 2500 samples
## Produced 3000 samples
## Produced 3500 samples
## Produced 4000 samples
## Produced 4500 samples
## Produced 5000 samples
## Produced 5500 samples
## Produced 6000 samples
## Produced 6500 samples
## Produced 7000 samples
## Produced 7500 samples
## Produced 8000 samples
## Produced 8500 samples
## Produced 9000 samples
## Produced 9500 samples
## Produced 10000 samples

# Basic summary statistics; by symmetry we know the distribution we're sampling from has expected value 0,
# so let's check if we got that part right:
mean(thesample) # pretty close

## [1] 0.01076533

sd(thesample)

## [1] 0.7599331

range(thesample)

## [1] -1.650879  4.128738

# ...and we make a plot to see for sure. We'll cut it off at x = 2, even though our sample goes past that
data_frame(x = thesample) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),bins = 100,colour = "black",fill = "orange") +
  stat_function(fun = fxc,colour = "purple") +
  coord_cartesian(xlim = c(-2,2)) +
  labs(title = "Sample from our density",
       subtitle = "Inverse Transform Sampling",
       x = "y",
       y = "f(y)")

To summarize, in implementing this simple sampling algorithm covered the following computational concepts:

Computing the inverse CDF, which introduced uniroot and a discussion about accuracy of numerical computations
Writing a basic loop and storing the results
Making a histogram with an overlayed density curve in ggplot

This sampling can also be introduced using more advanced programming constructs, namely the map and reduce functions from the purrr package (part of the tidyverse). The above loop is simple enough that this probably isn’t necessary, but it is potentially a good example to help illustrate the use of map/reduce and working with data in lists in R. These concepts certainly aren’t new, and are implemented in base R through apply, lapply, sapply, vapply… the implementations in the purrr package provide an API that is consistent in its inputs/outputs, and much simpler to use and learn.

set.seed(34879)
B <- 10000
thesample <- runif(B) %>% # Generate the B unif(0,1) realizations in one line
  map(~c(x = cdf_inv(.x))) %>% # Apply the cdf_inv function to each element of the above vector, returning a list of named vectors
  reduce(bind_rows) # Concatenate the named vectors into a data_frame

thesample

## # A tibble: 10,000 x 1
##             x
##         <dbl>
##  1  0.9558366
##  2  1.1489711
##  3 -0.9452934
##  4 -0.2579777
##  5  0.7454143
##  6  0.8171312
##  7  0.8668995
##  8  0.3511792
##  9 -0.9762473
## 10 -0.8888388
## # ... with 9,990 more rows

Doing it this way introduces map/reduce in R: take data, apply a function to each datapoint individually (whether they be elements in a vector, rows in a data_frame, or arbitrary items in a list), then collapse the results together somehow. Specifically,

We generated a vector of B independent realizations of a Unif(0,1) random variable
We passed this to the purrr::map function, which takes a vector or list as an argument and a function of one variable as arguments, and applies that function to each element of the input. This is accomplished in base R using the apply family. Two advantages of using purrr are a consistent set of functions that are compatible with each other, take the same inputs in the same order as each other and return predictible output, and the compact formula ~ syntax for defining anonymous functions. This makes this type of functional programming easier for students to learn, as there is much less time focussing on indicental details (which apply function do I use? What are the arguments? What is it going to return? Will it always return a list or sometimes a vector? And so on).
We made sure the function we mapped returned a named vector, which allows us to reduce the list of named vectors returned by map into a single data_frame using dplyr::bind_rows.

Rejection Sampling

We’ll implement another sampling algorithm, Rejection Sampling, which is still simple but slightly less so. The Rejection Sampling algorithm is basically:

Choose a box \(\lbrack x_{l},x_{u} \rbrack\times\lbrack 0,y_{u} \rbrack\) that contains the probability density from which you want to sample- this means identify the domain (if infinite, pick a finite range in which nearly all the mass is), and find the mode, and choose the three parameters accordingly
Sample \(u \sim Unif(x_{l},x_{u})\)
Accept \(u\) with probability \(f(u)\). In computational terms this means generate another uniform random variate \(t \sim Unif(0,y_{u})\), and accept \(u\) if \(t < f(u)\), an event that happens with probability \(f(u)\).
Repeat until \(N\) points are accepted, where \(N\) is the size of sample you want

This algorithm is straightforward to state. To implement it requires considering a few details:

To choose \(x_{l}\) and \(x_{u}\), we can look at the plot of the density and figure out where most of the mass is. Students should think about whether this is an okay thing to do; is there a bump somewhere out in the tails that we didn’t include in the plot? The density in question is dominated by the \(e^{-x^4}\) term when \(|x| > 2\) (approximately), so it’s fine here, but it’s important for students to ask this question. Another important question to ask is: what happens if we get this wrong? If we’re too wide then we will end up rejecting more samples and the code will take longer to run; if we’re too narrow then we miss potentially large parts of the density. We’ll be a bit conservative and go with \(\lbrack 3,3 \rbrack\).
To choose \(y_{u}\), we need to know the maximum value our density could take. Can we just look at the plot (looks like about 0.6), or do we need to know any exact value, by finding the critical points and checking if they are maxima? We can answer this by again considering: what actually happens if we get this value wrong? If it’s too high, we will reject more points and the algorithm will run slower. If it’s too low, we’ll undersample important parts of the density. We can go conservative and choose, say, \(y_{u} = 0.7\) in this case. The discussion is important for students to have.

With those decisions made we can implement our algorithm:

set.seed(29432)
xmin <- -3
xmax <- 3
ymax <- 0.7

N <- 10000 # Number of samples we want, NOT number of iterations that will be performed
numiter <- 0 # A counter to count the number of iterations performed
acceptancerate <- numeric() # Track the acceptance rate over iterations
thesample <- numeric()

numaccepted <- 0 # Counter of number accepted
while(numaccepted < N) {
  u <- runif(1,xmin,xmax)
  if (runif(1,0,ymax) < fxc(u)) {
    numaccepted <- numaccepted + 1
    thesample <- c(thesample,u)
  }
  numiter <- numiter + 1
  acceptancerate <- c(acceptancerate,numaccepted / numiter)
  if (numiter %% 5000 == 0) {
    cat("Iterations:",numiter,"Accepted:",numaccepted,"Acceptance Rate:",scales::percent(numaccepted / numiter),"\n")
  }
}

## Iterations: 5000 Accepted: 1232 Acceptance Rate: 24.6% 
## Iterations: 10000 Accepted: 2408 Acceptance Rate: 24.1% 
## Iterations: 15000 Accepted: 3603 Acceptance Rate: 24% 
## Iterations: 20000 Accepted: 4841 Acceptance Rate: 24.2% 
## Iterations: 25000 Accepted: 5993 Acceptance Rate: 24% 
## Iterations: 30000 Accepted: 7188 Acceptance Rate: 24% 
## Iterations: 35000 Accepted: 8424 Acceptance Rate: 24.1% 
## Iterations: 40000 Accepted: 9580 Acceptance Rate: 23.9%

cat("Final Iterations:",numiter,"\nFinal Accepted:",numaccepted,"\nFinal Acceptance Rate:",scales::percent(numaccepted / numiter),"\n")

## Final Iterations: 41787 
## Final Accepted: 10000 
## Final Acceptance Rate: 23.9%

# Summary statistics and plot
mean(thesample) # pretty close

## [1] 0.001582394

sd(thesample)

## [1] 0.7591176

range(thesample)

## [1] -1.667341  1.696078

# ...and we make a plot to see for sure. We'll cut it off at x = 2, even though our sample goes past that
data_frame(x = thesample) %>%
  ggplot(aes(x = x)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),bins = 100,colour = "black",fill = "orange") +
  stat_function(fun = fxc,colour = "purple") +
  coord_cartesian(xlim = c(-2,2)) +
  labs(title = "Sample from our density",
       subtitle = "Rejection Sampling",
       x = "y",
       y = "f(y)")

# Also plot the acceptance rate over iterations
data_frame(y = acceptancerate,
           x = 1:length(acceptancerate)) %>%
  ggplot(aes(x = x,y = y,group = 1)) +
  theme_classic() +
  geom_line() +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(title = "Acceptance Rate by Iteration",
       subtitle = "Rejection Sampling Algorithm",
       x = "Iteration",
       y = "Acceptance Rate")

Rejection sampling introduces some similar computational concepts as the previous example on inverse transform sampling (loops, etc). There are some new concepts introduced:

Monitoring the behaviour of an algorithm. With the inverse transform sampling, the procedure was formulaic: just put a uniform random sample into the inverse CDF, and the result is guaranteed to be a random sample from our distribution of interest. We had to monitor the results, to check the impact of numerical inaccuracies in computing the inverse CDF, but not much other than this could go wrong in the algorithm itself. For rejection sampling, the algorithm is more prescriptive, a list of steps that are to be iterated. In addition to looking at the output, we also chose to monitor the progress of the algorithm. It is useful for students to discuss how and why we might do this. We could be looking for programming errors on our part, instability in the algorithm, or something else.
The effect of parameters on the algorithm. We had to put some thought into how to choose the box in which we generate our uniform random variables; students might benefit from some “stress testing” on this. What happens when you choose the box differently? What if it’s too big? Too small?

Markov Chain Monte Carlo (MCMC)

We will implement one final sampling algorithm, namely Metropolis-Hastings MCMC. What we need to get the computer to do is:

At a given point \(x\), sample a new point \(x^{\prime}\) from a proposal density \(g(x^{\prime}|x)\) that we choose and know how to sample from
Accept this new point with probability \(\frac{f(x^{\prime})g(x|x^{\prime})}{f(x)g(x^{\prime}|x)}\). If this is greater than 1, definitely accept the new point. Else keep the old point \(x\)
Repeat a prespecified number of times

There are several details to be worked out:

How many iterations?
Choosing the starting point. How do we do this, and how sensitive are the results to this choice?
Choosing the proposal density. Again, how, and how sensitive are the results to this?
The results, since they are a realization of a markov chain, are not independent, and the head of the chain won’t even be a sample from the desired distribution as the chain needs time to converge to its stationary distribution. We need to thin the chain and cut off a chosen number of the first iterations (burn in). How to choosing the thinning proportion and the number of burn-ins? How does this affect the results?

These represent a mix of theoretical and practical considerations. For our application here we’ll make some choices, which you of course are free to change:

We’ll run it for 1,000 iterations
We’ll start at \(x_{0} = )\), the mean/median of our distribution
We’ll use a normal proposal, \(x^{\prime} \sim N(x,0.25)\). The choice of proposal standard deviation might need to be adjusted; this is both a theoretical and practical choice
We’ll use a burn in of 100 iterations and thin the results by 1/9, for a final sample size of 100

Let’s see how we might implement this:

set.seed(24389)
niter <- 1000
burnin <- 100
thinning <- 9
xstart <- 0

iterations <- numeric(niter)
curpoint <- xstart
for (i in 1:niter) {
  # Propose a new point
  newpoint <- rnorm(1,curpoint,.25)
  # Decide whether to accept
  if (runif(1) < (fx(newpoint) * dnorm(curpoint,newpoint,.25)) / (fx(curpoint) * dnorm(newpoint,curpoint,.25))) {
    iterations[i] <- newpoint
  } else {
    iterations[i] <- curpoint
  }
  curpoint <- iterations[i]
}

# Burn in and thin
iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

# Check out the trace plot, and histogram of samples
traceplot <- data_frame(y = iterations,
                        x = 1:length(iterations)) %>%
  ggplot(aes(x = x,y = y,group = 1)) +
  theme_classic() +
  geom_path() +
  labs(title = "MCMC Trace Plot",
       x = "Iteration",
       y = "Accepted Value") +
  coord_flip()

histogramofsample <- data_frame(x = iterations_out) %>%
  ggplot(aes(x = x)) +
  theme_classic() +
  geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
  stat_function(fun = fxc,colour = "purple") +
  labs(title = "Histogram of MCMC Samples",
       subtitle = "Purple curve: desired density",
       x = "Sampled Value",
       y = "Density")

cowplot::plot_grid(traceplot,histogramofsample,nrow = 1)

This implementation required some computational concepts:

Basic for loops
A bit of fancy indexing, in the burn-in and thinning part
Plotting with ggplot

Looking at the results, we see some immediate problems that can be discussed in a deeper treatment of MCMC. From a computational perspective, the natural thing to do is to change the inputs to the algorithm. We can make a principled choice here, based on our knowledge of the behaviour of the algorithm, to try a bimodal proposal density and lower the proposal standard deviation- but nonetheless, this requires rerunning the code with changed parameters. This introduces a potential integrity issue into our analysis: if we just start changing things willy-nilly, we lose track of which experiments we performed (you could keep track in a notebook, but most students probably won’t). If we mitigate this by just copying and pasting the whole algorithm for every new set of inputs, our code becomes cluttered and we increase the chances of human error. Best practice dictates that we write a function.

This is another layer of computational complexity to what is supposed to be a practical problem. Discussion of why we’re doing this, as above, should help the students appreciate what problem we’re solving by adding this extra layer of work to our procedure.

Let’s write the function. First we’ll write a simple pair of proposal distribution functions, for computing the density of and sampling from an equal mixture of two normals, with low standard deviation. This should help our algorithm visit both modes of the distribution, without having to jump around the whole sample space, increasing the chances of visiting low-density areas. We’ll then implement a function to perform the above MCMC algorithm.

# Functions for computing density and sampling from a mixture of 2 normals,
# centered at mu and -mu
m2n_density <- function(x,mu,sig) {
  .5 * dnorm(x,mu,sig) + .5 * dnorm(x,-mu,sig)
}
m2n_sample <- function(mu,sig) {
  # mu is a vector of length 2
  if (rbinom(1,1,.5) == 0) {
    rnorm(1,mu,sig)
  } else {
    rnorm(1,-mu,sig)
  }
}

mcmc_sample1 <- function(niter,propsd,burnin,thinning,xstart = 0) {
  iterations <- numeric(niter)
  curpoint <- xstart
  for (i in 1:niter) {
    # Propose a new point
    newpoint <- m2n_sample(curpoint,propsd)
    # Compute the acceptance probability
    accprob <- (fx(newpoint) * m2n_density(curpoint,newpoint,propsd)) / (fx(curpoint) * m2n_density(newpoint,curpoint,propsd))
    # Decide whether to accept
    if (runif(1) < accprob) {
      iterations[i] <- newpoint
    } else {
      iterations[i] <- curpoint
    }
    curpoint <- iterations[i]
  }

  # Burn in and thin
  iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

  # Check out the trace plot, and histogram of samples
  traceplot <- data_frame(y = iterations,
                          x = 1:length(iterations)) %>%
    ggplot(aes(x = x,y = y,group = 1)) +
    theme_classic() +
    geom_path() +
    labs(title = "MCMC Trace Plot",
         x = "Iteration",
         y = "Accepted Value") +
    scale_x_continuous(labels = scales::comma_format()) +
    coord_flip()

  histogramofsample <- data_frame(x = iterations_out) %>%
    ggplot(aes(x = x)) +
    theme_classic() +
    geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
    stat_function(fun = fxc,colour = "purple") +
    labs(title = "Histogram of MCMC Samples",
         subtitle = "Purple curve: desired density",
         x = "Sampled Value",
         y = "Density")
  
  # Return the sample and the plots
  list(
    sample = iterations_out,
    traceplot = traceplot,
    histogram = histogramofsample
  )
}

# Try it out
testsample <- mcmc_sample1(niter = 1000,propsd = .25,burnin = 100,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

Not bad! The major advantage of keeping the code in a function is now if we want to assess the effect of changing an input, we can just call the function again. Our notebook contains a record of everything we’ve done, and the chance of human error remains the same as if we only called the function once.

testsample <- mcmc_sample1(niter = 1000,propsd = .1,burnin = 100,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

testsample <- mcmc_sample1(niter = 10000,propsd = .1,burnin = 1000,thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

If computation is a focus of the course itself, we can turn to the question of how to improve the above code. One improvement might be to pass the proposal distribution itself as an argument to the sampling function, rather than just its variance. This allows rapid testing of different proposals. This is a good point to introduce different function syntax in R, as provided by the rlang::as_function function. This lets us pass a function as an argument using

anonymous function syntax, function(x) x + 1
a string, "rnorm"
a function object, rnorm
most helpfully, a formula: ~ .x + 1

These variants are becoming more and more popular in R programming. Adding this to our function can be done as follows:

mcmc_sample2 <- function(niter,proposal_sample,proposal_density,burnin,thinning,xstart = 0) {
  # Convert proposal to functions
  proposal_sample <- rlang::as_function(proposal_sample)
  proposal_density <- rlang::as_function(proposal_density)
  
  # The rest of the code remains mostly unchanged
  iterations <- numeric(niter)
  curpoint <- xstart
  for (i in 1:niter) {
    # Propose a new point
    newpoint <- proposal_sample(curpoint)
    # Compute the acceptance probability
    accprob <- (fx(newpoint) * proposal_density(curpoint,newpoint)) / (fx(curpoint) * proposal_density(newpoint,curpoint,propsd))
    # Decide whether to accept
    if (runif(1) < accprob) {
      iterations[i] <- newpoint
    } else {
      iterations[i] <- curpoint
    }
    curpoint <- iterations[i]
  }

  # Burn in and thin
  iterations_out <- iterations[(burnin+1):niter][1:(niter-burnin) %% thinning  == 0]

  # Check out the trace plot, and histogram of samples
  traceplot <- data_frame(y = iterations,
                          x = 1:length(iterations)) %>%
    ggplot(aes(x = x,y = y,group = 1)) +
    theme_classic() +
    geom_path() +
    labs(title = "MCMC Trace Plot",
         x = "Iteration",
         y = "Accepted Value") +
    scale_x_continuous(labels = scales::comma_format()) +
    coord_flip()

  histogramofsample <- data_frame(x = iterations_out) %>%
    ggplot(aes(x = x)) +
    theme_classic() +
    geom_histogram(aes(y = ..density..),bins = 30,colour = "black",fill = "lightblue") +
    stat_function(fun = fxc,colour = "purple") +
    labs(title = "Histogram of MCMC Samples",
         subtitle = "Purple curve: desired density",
         x = "Sampled Value",
         y = "Density")
  
  # Return the sample and the plots
  list(
    sample = iterations_out,
    traceplot = traceplot,
    histogram = histogramofsample
  )
}

# Try it out with the same proposal as before
# Use the formula syntax to pass fixed arguments to the passed functions
testsample <- mcmc_sample2(niter = 1000,
                           proposal_sample = ~m2n_sample(mu = .x,sig = .1),
                           proposal_density = ~m2n_density(x = .x,mu = .y,sig = .1),
                           burnin = 100,
                           thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

testsample <- mcmc_sample2(niter = 10000,
                           proposal_sample = ~m2n_sample(mu = .x,sig = .1),
                           proposal_density = ~m2n_density(x = .x,mu = .y,sig = .1),
                           burnin = 1000,
                           thinning = 9)
cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

# Try it with a different proposal. Now you can do anything you want, like a noncentral t:
testsample <- mcmc_sample2(niter = 10000,
                           proposal_sample = ~rt(n=1,df=5,ncp=.x),
                           proposal_density = ~dt(x = .x,df=5,ncp = .y),
                           burnin = 1000,
                           thinning = 9)

## Warning in dt(x = .x, df = 5, ncp = .y): full precision may not have been
## achieved in 'pnt{final}'

## Warning in dt(x = .x, df = 5, ncp = .y): full precision may not have been
## achieved in 'pnt{final}'

## Warning in dt(x = .x, df = 5, ncp = .y): full precision may not have been
## achieved in 'pnt{final}'

## Warning in dt(x = .x, df = 5, ncp = .y): full precision may not have been
## achieved in 'pnt{final}'

cowplot::plot_grid(testsample$traceplot,testsample$histogram,nrow = 1)

Integrating Computation Into Statistics Courses: Another Worked Example

Evaluating and Sampling from A Probability Distribution

Alex Stringer

2018-06-22