# Load the tidyverse packages
suppressMessages({
  suppressWarnings({
    library(tidyverse)
  })
})

Rice (2006) Chapter 8.4, Example C, page 263: data is available from the textbook author’s website on the amount of precipitation (“rainfall” to the layman!) during 227 storms in Illinois from 1960 - 1964. We can read these data into R:

path <- "https://raw.githubusercontent.com/awstringer1/leaf2018/gh-pages/datasets/"
rainfall <- readr::read_csv(stringr::str_c(path,"illinois60.txt"),col_names = "rainfall") %>%
  bind_rows(readr::read_csv(stringr::str_c(path,"illinois61.txt"),col_names = "rainfall")) %>%
  bind_rows(readr::read_csv(stringr::str_c(path,"illinois62.txt"),col_names = "rainfall")) %>%
  bind_rows(readr::read_csv(stringr::str_c(path,"illinois63.txt"),col_names = "rainfall")) %>%
  bind_rows(readr::read_csv(stringr::str_c(path,"illinois64.txt"),col_names = "rainfall"))
## Parsed with column specification:
## cols(
##   rainfall = col_double()
## )
## Parsed with column specification:
## cols(
##   rainfall = col_double()
## )
## Parsed with column specification:
## cols(
##   rainfall = col_double()
## )
## Parsed with column specification:
## cols(
##   rainfall = col_double()
## )
## Parsed with column specification:
## cols(
##   rainfall = col_double()
## )
glimpse(rainfall)
## Observations: 227
## Variables: 1
## $ rainfall <dbl> 0.020, 0.001, 0.001, 0.120, 0.080, 0.420, 1.720, 0.05...

We wish to fit a probability model to these data. Let’s first find a suitable family of distributions using common sense and trial and error. We’ll then use the method of moments, as learned in class, to estimate the parameters of our chosen famliy from these data.

We can plot a histogram of the data and overlay the empirical density curve:

rainplot <- rainfall %>%
  ggplot(aes(x = rainfall)) +
  theme_classic() + 
  geom_histogram(aes(y = ..density..),colour = "black",fill = "#ff9933",bins=50) +
  labs(title = "Emprical Distribution of Rainfall",
       subtitle = "Illinois rainfall dataset, 227 storms from 1960 - 1964",
       x = "Amount of Rainfall (inches)",
       y = "Density")

rainplot + geom_density(colour = "red")

We see a large amount of right-skew, and a spike around 0. We should start with the simplest distribution that we can think of with this shape- an \(Exponential(\lambda)\) might do the trick.

To estimate \(\lambda\), we can find (exercise) the method of moments estimator to be \[ \hat{\lambda} = \bar{X} \] the sample mean. For these data,

lambda_exponential <- mean(rainfall$rainfall)
print(lambda_exponential)
## [1] 0.2243921

To summarize: we are proposing that the rainfall observed in this dataset was generated by nature according to an \(Exponential(0.22)\) random process. Does this proposal match the observed data? Let’s check visually:

# Note: dexp() in R uses the "rate" parameterization of the exponential distribution
rainplot +
  geom_line(data = data_frame(x = seq(0,2.5,by=0.01),
                              y = dexp(x,1 / lambda_exponential)),
            mapping = aes(x = x,y = y,group = 1),
            colour = "blue") +
  labs(title = "Exponential Model for Rainfall Data")

This seems to systematically overestimate rainfall; the fit isn’t great. Why? We got the skew of the distribution right, however the exponential distribution is too inflexible for these data. Only one parameter governs the shape of the blue curve above, and it just doesn’t match the shape of our data. To see this another way: recall if \(X \sim Exponential(\lambda)\) then \(E(X) = \lambda\) (a fact we used to derive the Method of Moments estimator above), but also, \(Var(X) = \lambda^{2}\). Does this match our data?

var(rainfall$rainfall)
## [1] 0.1338252
lambda_exponential^2
## [1] 0.0503518

This is another way of showing that our chosen probability distribution is not flexible enough: we estimate the mean well, but we are off by almost \(3\times\) in estimating the variance.

We can try what was originally suggested by the textbook: a \(Gamma(\alpha,\lambda)\) distribution with density \[ f(x;\lambda,\beta) = \frac{\lambda^{\alpha}}{\Gamma{(\alpha})} x^{\alpha-1}\exp\left( -\lambda x\right) \] This has the property that \(Gamma(1,\lambda) \overset{d}{=} Exponential(\lambda)\) (with the rate parametrization), so it can be viewed as a direct extension of our previous model. The first two moments are \[ \begin{aligned} E(X) &= \frac{\alpha}{\lambda} \\ E(X^{2}) &= \frac{\alpha(\alpha+1)}{\lambda^{2}} \end{aligned} \] Solving these, we get the method of moments estimators: \[ \begin{aligned} \hat{\lambda} &= \frac{\bar{X}}{s^{2}} \\ \hat{\alpha} &= \frac{\bar{X}^{2}}{s^{2}} &= \bar{X}\times\hat{\lambda} \end{aligned} \] where \(s^{2}\) is the sample variance. We see that this probability model allows for the estimated variance to change with the mean.

Let’s compute these for our data and fit the resulting curve:

alpha_gam <- mean(rainfall$rainfall)^2 / var(rainfall$rainfall)
alpha_gam
## [1] 0.3762506
lambda_gam <- mean(rainfall$rainfall) / var(rainfall$rainfall)
lambda_gam
## [1] 1.676755
rainplot +
  geom_line(data = data_frame(x = seq(0,2.5,by=0.01),
                              y = dgamma(x,shape = alpha_gam,rate = lambda_gam)),
            mapping = aes(x = x,y = y,group = 1),
            colour = "blue") +
  labs(title = "Gamma Model for Rainfall Data")

We see the fit is much better visually, and we know from our construction of the estimators that we captured the mean-variance relationship well.