Integrating Computation Into Statistics Courses: Worked Example

Simulate One Experiment (Basic)
Simulate One Experiment (Advanced)
Simulate Many Experiments

# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})

Simulate One Experiment (Basic)

Using mostly base R, students may be guided to produce the following result:

# Function to perform one experiment
perform_one_experiment_basic <- function(k,N=5000,p=0.05) {
  n <- N / k
  
  # Simulate the population
  popsim <- numeric(N)
  for (i in 1:N) {
    popsim[i] <- rbinom(n = 1,size = 1,prob = p)
  }
  
  # Group them
  groups <- rep(1:n,k)
  
  # Check the disease status of each group
  any_diseased <- numeric(n)
  for (i in 1:N) {
    if (popsim[i] == 1) {
      any_diseased[groups[i]] <- 1
    }
  }
  
  # Count the number of tests performed
  numtests <- 0
  for (i in 1:n) {
    if (any_diseased[i] == 1) {
      numtests <- numtests + (k + 1)
    }
    else {
      numtests <- numtests + 1
    }
  }
  numtests
}

perform_one_experiment_basic(5)

## [1] 2195

Simulate One Experiment (Advanced)

If the instructor wishes to introduce or otherwise make use of more advanced programming constructs in R, this example can be used for this purpose by extending the above “basic” simulation to include the use of more advanced data structures and some light functional programming:

# Function to check inputs: same as above
# Function to perform one experiment
perform_one_experiment_advanced <- function(k,N=5000,p=0.05) {
  n <- N / k
  data_frame(disease = rbinom(N,1,p),
             group = rep(1:n,k)) %>%
    group_by(group) %>%
    summarize(disease = max(disease)) %>%
    mutate(numtests = k*disease + 1) %>%
    summarize(numtests = sum(numtests)) %>%
    pull(numtests)
}

perform_one_experiment_advanced(5)

## [1] 2115

Simulate Many Experiments

Analogous to the manner in which the single-experiment simulation is coded, there are two immediate ways to perform the multiple simulations desired:

# Two functions to perform B experiments for fixed k and average the result
# 
# Basic
perform_B_experiments_basic <- function(B,k,N=5000,p=0.05) {
  results <- numeric(B)
  for (i in 1:B) {
    results[i] <- perform_one_experiment_basic(k,N,p)
  }
  mean(results)
}

perform_B_experiments_basic(25,5)

## [1] 2123

# Advanced
perform_B_experiments_advanced <- function(B,k,N=5000,p=0.05) {
  map(1:B,~perform_one_experiment_advanced(k,N,p)) %>%
    reduce(c) %>%
    mean()
}

perform_B_experiments_advanced(25,5)

## [1] 2114.4

We can then make a ggplot plot of the empirically-estimated number of experiements required for each value of k as follows. ggplot is worth introducing to students at any level as it allows them to create publication-quality plots:

# Simulate for valid values of k from 1 to 10:
sim_results <- numeric(5)
k_to_do <- c(1,2,5,8,10)
for (i in 1:length(k_to_do)) {
  sim_results[i] <- perform_B_experiments_basic(25,k_to_do[i])
}
# ggplot needs its input to be a data_frame
sim_results <- data_frame(k = k_to_do,numtests_sim = sim_results)


# Or...
sim_results <- c(1,2,5,8,10) %>%
  map(~data_frame(k = .x,numtests_sim = perform_B_experiments_advanced(25,.x))) %>%
  reduce(bind_rows)

sim_results %>%
  ggplot(aes(x = k,y = numtests_sim,group = 1)) +
  theme_classic() +
  geom_point() + 
  geom_line() +
  labs(title = "Simulated Average Number of Tests Performed",
       subtitle = "Blood Pooling Example",
       x = "Group Size (k)",
       y = "Average Number of Tests Performed (25 simulations)") +
  scale_x_continuous(breaks = 1:10) +
  scale_y_continuous(labels = scales::comma_format())

This give students a feel for the problem that will help to guide our analytical solution: the number of tests starts high with low \(k\) (pooling doesn’t reduce the number of tests that much when the groups only have a couple people in them), then decreases as \(k\) increases, but then starts increasing again, as the groups become large enough that most of them end up having a diseased individual, and therefore do not produce a reduction in the number of tests performed. From the simulations, we see that \(k = 5\) is a likely candidate for the group size that minimizes the expected number of tests.

Integrating Computation Into Statistics Courses: Worked Example

Empirical Investigation

Alex Stringer

2018-05-22

Simulate One Experiment (Basic)

Simulate One Experiment (Advanced)

Simulate Many Experiments