# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})

Likelihood

Recall in lecture how we defined the likelihood function: for a random vector \(X = (X_{1},\ldots,X_{n})\) whose joint distribution depends on parameter \(\theta = (\theta_{1},\ldots,\theta_{p})\), the likelihood function is the joint density/mass function of \(X\) treated as a function of \(\theta\) for fixed \(X\): \[ L(\theta | X) = f(x ; \theta) \] The most common case is where \(X\) represents an IID random sample of a single random variable; the joint density and hence the likelihood is a product of the marginal densities of \(X_{i}\), which are all the same: \[ L(\theta|X) = \prod_{i=1}^{n}f(x_{i};\theta) \] For computational and theoretical reasons, we said we would work primarily with the log-likelihood, \[ \ell(\theta) = \log{L(\theta)} \overset{IID}{=} \sum_{i=1}^{n}\ell_{i}(\theta) \] What does it mean, “for fixed \(X\)”? Every time we take a random sample \(x = (x_{1},\ldots,x_{n})\), we get a different log-likelihood function. This is the point: the likelihood describes the relative frequency with which each value of \(\theta\) would have generated the observed data, in repeated sampling from a population with true parameter eqaul to \(\theta\). Values of \(\theta\) that give higher likelihoods would generate the observed data more frequently- and hence are said to be more likely to have generated the observed data.

Examples

Bernoulli(\(\theta\))

Let’s take a look at this empirically. First we generate a single sample of size \(n = 5\) from a \(Bern(\theta)\) distribution, with \(\theta_{0} = 0.5\) as the true value, and plot the log-likelihood. The log-likelihood for these data is \[ \ell(\theta) = \sum_{i=1}^{n}x_{i} \log{\theta} + \left( n - \sum_{i=1}^{n}x_{i} \right)\log\left( 1 - \theta \right) \]

set.seed(423798)
n <- 5
thetatrue <- 0.5
samp <- rbinom(n,1,thetatrue)
cat("Actual generated sample: ",samp,"\n")
## Actual generated sample:  1 1 0 0 0
# Plot the log-likelihood
bern_ll <- function(theta,x) {
  sum(x)*log(theta) + (length(x) - sum(x)) * log(1 - theta)
}

data_frame(theta = seq(.01,.99,by=.01),
           ll = bern_ll(theta,x = samp)) %>%
  ggplot(aes(x = theta,y = ll,group = 1)) +
  theme_classic() +
  geom_line() +
  geom_vline(xintercept = mean(samp),colour = "purple") +
  geom_vline(xintercept = thetatrue,colour = "red") +
  scale_x_continuous(breaks = seq(0,1,by=.1)) +
  scale_y_continuous(breaks = seq(0,-15,by=-1)) +
  labs(title = "Bernoulli Log-Likelihood for 5 Simulated Coin Flips",
       subtitle = "Red line = true prob. of heads, purple line = maximum likelihood estimate",
       x = "Theta (prob. of heads)",
       y = "Log Likelihood")

Again: the likelihood for a given value of \(\theta\) is the relative frequency with which that value of \(\theta\) would have generated the data we observed. We are free to interpret “the data we observed” through the value of a sufficient statistic- as discussed last lecture and this lecture, the likelihood depends on the observed data only through a function of such a statistic. In our case, \(T(X) = \sum_{i=1}^{n}X_{i}\), the number of heads, is a sufficient statistic. So we ask: with what frequency would each value of \(\theta\) generate two heads in five flips of a coin?

We can simulate this for each value of \(\theta\), and plot the resulting curve. Note to get the actual numbers on the y-axis to be the same, we need to adjust for the normalizing constant. This won’t affect the shape of the curve, since it’s not a function of \(\theta\). In practice this isn’t an issue because we interpret only relative values of likelihoods (or differenes of log-likelihoods).

simulate_flips <- function(theta,n=5,sumx=2) {
  # Simulate a bunch of experiments with n flips of a coin with P(heads) = theta
  # Return the log of the relative frequency with which the number of heads equals sumx
  B <- 1000 # Number of simulations to do
  experiment_has_two_heads <- numeric(B) # Record whether the experiment has two heads
  for (b in 1:B) {
    experiment_has_two_heads[b] <- as.numeric(sum(rbinom(n,1,theta)) == sumx)
  }
  
  log(mean(experiment_has_two_heads)) - log(choose(n,sumx))
}

simulate_flips(.5)
## [1] -3.405205
# Vectorized version of the simulate function
simulate_flips_v <- function(theta,n=5,sumx=2) {
  out <- numeric(length(theta))
  for (i in 1:length(theta)) {
    out[i] <- simulate_flips(theta[i],n,sumx)
  }
  out
}
# Perform the simulation for various values of theta and plot:
data_frame(theta = seq(.01,.99,by=.01),
           ll = simulate_flips_v(theta)) %>%
  ggplot(aes(x = theta,y = ll,group = 1)) +
  theme_classic() +
  geom_line() +
  geom_vline(xintercept = mean(samp),colour = "purple") +
  geom_vline(xintercept = thetatrue,colour = "red") +
  scale_x_continuous(breaks = seq(0,1,by=.1)) +
  scale_y_continuous(breaks = seq(0,-15,by=-1)) +
  labs(title = "Bernoulli Empirical Log-Likelihood for 5 Simulated Coin Flips",
       subtitle = "Relative frequency with which each value of theta generated a sample of 2 heads in 5 flips",
       x = "Theta (prob. of heads)",
       y = "Empirical Log Likelihood")

We can do even better than this by realizing that we don’t need to estimate each individual likelihood function using independent samples at various values of the parameters; we can sample one dataset from all the values of the parameters simultaneously! How? Recall the inverse CDF transform: if \(U\sim\Unif(0,1)\), then \(F^{-1}(U)\sim F\). We can sample a single dataset of size \(n\) from a \(Unif(0,1)\) distribution, and transforom it multiple times, once for each parameter value we want to plot.

For the \(Binom(n,\theta)\) distribution, the inverse CDF transformation is achieved by thresholding, as follows:

  1. Generate \(U_{1},\ldots,U_{n}\overset{IID}{\sim}Unif(0,1)\)
  2. Let \(Y = X_{1} + \cdots + X_{n}\) where \(X_{i} = 1(U_{i} \leq \theta)\)

We can sample a single uniform random sample, then apply the above transformation for multiple values of \(\theta\).

set.seed(878976)
# Function to transform U into Y
transform_u_to_y <- function(u,theta) {
  sum(u <= theta)
}
# Simulate the likelihood function
simulate_binomial_likelihood <- function(B,n,sumx) {
  # B: how many simulations to do
  # n: parameter of binomial
  # sumx: observed suff. stat.
  
  thetagrid <- seq(0.01,0.99,by=0.01)
  loglikvec <- numeric(length(thetagrid))
  names(loglikvec) <- thetagrid
  # Repeatedly sample uniforms and add up the results
  for (b in 1:B) {
    # Sample a uniform
    u <- runif(n)
    # Transform for each theta
    for (theta in thetagrid) {
      #cat(transform_u_to_y(u,theta),"\n")
      #cat(loglikvec[str_c(theta)],"\n")
      loglikvec[str_c(theta)] <- loglikvec[str_c(theta)] + (transform_u_to_y(u,theta) == sumx)
    }
  }
  
  # Return the loglikvec, divided by B, logged
  log(loglikvec / B)
}

data_frame(theta = seq(.01,.99,by=.01),
           ll = simulate_binomial_likelihood(10000,5,2)) %>%
  ggplot(aes(x = theta,y = ll,group = 1)) +
  theme_classic() +
  geom_line() +
  scale_x_continuous(breaks = seq(0,1,by=.1)) +
  scale_y_continuous(breaks = seq(0,-15,by=-1)) +
  labs(title = "Bernoulli Empirical Log-Likelihood for 5 Simulated Coin Flips",
       subtitle = "Relative frequency with which each value of theta generated a sample of 2 heads in 5 flips",
       x = "Theta (prob. of heads)",
       y = "Empirical Log Likelihood")

Normal(\(\mu,1\))

Next let’s look at an example where the underlying distribution is continuous. In the previous case (and in all cases we will consider) the likelihood is a continuous function on the interior of the parameter space, because it is a function of the parameter, not the data. The only change to our example will be to slightly modify our simulation to account for the fact that the sufficient statistic is continuous and so takes on any actual value with probability zero. The math remains essentially the same.

If \(X_{i} \overset{IID}{\sim} N(\mu,1)\) is a random sample from a normal distribution with mean \(\mu\) and variance \(1\), the log-likelihood function (with constants not depending on \(\mu\) removed) is \[ \ell(\mu) = -\frac{1}{2}\sum_{i=1}^{n}\left(x_{i} - \mu\right)^{2} \] Let’s do as we did before: take a random sample of size \(n = 5\) and plot the log-likelihood:

set.seed(4632789)
n <- 5
mutrue <- 0
samp <- rnorm(5,mutrue,1)
cat("Actual generated sample: ",samp,"\n")
## Actual generated sample:  -2.360988 -1.25877 0.8782886 0.2488446 -0.4769181
# Plot the log-likelihood
normal_ll <- function(mu,x) {
  -(1/2)*sum((x - mu)^2)
}

normal_ll_v <- function(mu,x) {
  out <- numeric(length(mu))
  for (i in 1:length(mu)) {
    out[i] <- normal_ll(mu[i],x)
  }
  out
}

data_frame(mu = seq(-3,3,by=.01),
           ll = normal_ll_v(mu,x = samp)) %>%
  ggplot(aes(x = mu,y = ll,group = 1)) +
  theme_classic() +
  geom_line() +
  geom_vline(xintercept = mean(samp),colour = "purple") +
  geom_vline(xintercept = mutrue,colour = "red") +
  scale_x_continuous(breaks = seq(-3,3,by=.5)) +
  scale_y_continuous(breaks = seq(0,-40,by=-5)) +
  labs(title = "Normal Log-Likelihood for 5 Simulated Datapoints",
       subtitle = "Red line = true mean, purple line = maximum likelihood estimate",
       x = "Mu (mean of X)",
       y = "Log Likelihood")

Likelihood has the same interpretation here: the relative frequency with which each value of \(\mu\) would have generated the observed data, or more simply, data with the observed value of a sufficient statistic. For the normal distribution, we can unpack the form of the likelihood to find that \(T(X) = \sum_{i=1}^{n}X_{i}\) is sufficient for \(\mu\). As with the bernoulli case, we can simulate to investigate, for each value of \(\mu\) we try, the relative frequency with which we get ${i=1}^{x}{i} = $ -2.97 in a random sample of size 5 from a normal distribution with mean equal to each value of \(\mu\). We have to modify this slightly to account for that fact that we won’t get simulated \(\sum_{i=1}^{n}x_{i}\) values of exactly what we observed, because the underlying data is continuous- we’ll look at a small neighbourhood around the observed value instead.

simulate_norm <- function(mu,n=5,sumx=round(sum(samp),2),eps=0.01) {
  # Simulate a bunch of experiments with n samples from a normal(mu,1) distribution
  # Return the log of the relative frequency with which the sum of the sample is within eps distance of sumx
  B <- 100000 # Number of simulations to do
  experiment_within_eps_distance <- numeric(B) # Record whether the experiment has two heads
  for (b in 1:B) {
    experiment_within_eps_distance[b] <- as.numeric(abs(sum(rnorm(n,mu,1)) - sumx) < eps)
  }
  
  log(mean(experiment_within_eps_distance))
}

simulate_norm(-.59)
## [1] -5.649294

We perform this for various values of \(\mu\) and plot the log-likelihood:

simulate_norm_v <- function(mu,n=5,sumx=round(sum(samp),2),eps=0.01) {
  out <- numeric(length(mu))
  for (i in 1:length(mu)) {
    #cat("Simulating for mu = ",mu[i],"\n")
    out[i] <- simulate_norm(mu[i],n,sumx,eps)
  }
  out
}

data_frame(mu = seq(-3,3,by=.05),
           ll = simulate_norm_v(mu)) %>%
  ggplot(aes(x = mu,y = ll,group = 1)) +
  theme_classic() +
  geom_line() +
  geom_vline(xintercept = mean(samp),colour = "purple") +
  geom_vline(xintercept = mutrue,colour = "red") +
  scale_x_continuous(breaks = seq(-3,3,by=.5)) +
  scale_y_continuous(breaks = seq(-10,-40,by=-2)) +
  labs(title = "Normal Empirical Log-Likelihood for 5 Simulated Datapoints",
       subtitle = "Red line = true mean, purple line = maximum likelihood estimate",
       x = "Mu (mean of X)",
       y = "Log Likelihood")