This site hosts materials created by Alex Stringer in support of the LEAF project.
Github repository: https://github.com/awstringer1/leaf2018
Instructor Tutorials:
Student Tutorials:
Lecture Supplements:
learnr
tutorials:
Shiny Apps:
Datasets:
Scroll to the bottom of the page for a table containing some useful datasets for use in class.
Tutorials for instructors wishing to integrate computation into their courses.
Worked example from Horton (2013): probability problem with empirical and analytical solution, and discussion
Worked example from Horton (2013): computing a CDF and sampling from a distribution
Examples of materials that can be worked on with students as part of a course.
Materials that are designed to walk students through a statistical concept and the associated R
skills for implementing it. This is distriguished from Lecture Supplements below by the detail in which the R
code is covered; the Student Tutorials are designed to teach R
, while the Lecture Supplements are mostly examples of using R
.
R
concepts covered: simulation, loops, ggplot
ggplot
to compare empirical and theoretical distributions (Methods)Predictive Modelling - Data Processing
Predictive Modelling - Building and Evaluating Models
R
concepts covered: reading in data from delimited flat files; basic joins; feature engineering and preprocessing; basic prediction and evaluation; buliding and evaluating more complex modelsR
and data analysis skills at the level of STA302 should be sufficientlmer4
and Bayesian LME models using rstanarm
are fit and evaluatedggplot
and purrr::map
(Methods)Introduction to Bayesian Statistics
R
concepts covered: matrix algebra, working with list-comprehension (in place of loops), plotting with ggplot
, reading in simple real-world data from flat filesR
programming; 2nd year-level calculus, probability and matrix algebraR
to efficiently perform tedious tasks, like fitting many models or making many plots, using code that is cleaner and faster than native R
loops; data reading and manipulation (Computational Thinking)Materials that can be used during lecture as a supplement to traditional slides and blackboard writing. Each item has an example of a course in which it could be used, with bold indicating a course in which it has been used. Also included are example Intended Learning Outcomes that the material might relate to. These are aligned with the Statistics Undergraduate Program Learning Outcomes.
Normal Approximation to Binomial
R
to investigate theoretical results (Computational Thinking)ggplot
(Methods); interpret plots in the context of a problem and decide on further analysis (Real-World Problems)Central Limit Theorem: application in analyzing roundoff error
R
to investigate applied problems (Computational Thinking)ggplot
(Methods)Fitting a Gamma distribution to rainfall data via Method of Moments
R
, merge them, and evaluate the integrity of the resulting data (Computational Thinking)ggplot
(Methods); interpret plots in the context of a problem and decide on further analysis (Real-World Problems)Simulating Likelihood Functions
Sampling Distributions of Estimators
Sampling Distributions of Likelihood-derived Quantities
R
’s built-in optimization routines.ggplot()
(Methods)Binomial and Logistic Regression
R
outputs certain summary statistics, it is the responsibility of the Statistician to know when they should and should not be used/reported, based on the theory behind their calculation (Theory)ggplot()
(Methods)ggplot()
(Methods)Hosted tutorials on course concepts writtedn using the learnr package in R.
(STA261): Learnr tutorial on Law of Large Numbers and simulating random variables in R
(STA303): Learnr tutorial involving fairly complete analysis of a textbook dataset
Below is a collection of readily available datasets that instructors can use for examples, assignments, and tests. The table includes a description of the dataset, the source, and key features/suggested uses. When the dataset is from an R package, the documentation within that package will provide a qualitative description of the data; here I focussed on only the statistical qualities of the data (e.g. how many variables, data types, and so on) to make it easy to browse the list for a dataset that fits your particular needs. In cases when the dataset is not from an already-documented R package, a bit more context is provided.
Dataset | Source | Features | Suggested Uses | Comments/Notes |
---|---|---|---|---|
Rossman Store Sale Data | Kaggle; stored also on github | Medium-complexity predictive analytics problem. Repository contains two datasets with a many-to-one mapping between them, good for basic merging concepts | Predictive modelling | |
TTC Subway Ridership Data | Obtained from Open Data Toronto, stored on github; see this tutorial for an example of reading the data into R |
Contains inflow and outflow rider counts for one “typical” weekday, for each of the TTC’s subway stations | Basic summary statistics and plots | There are 74 rows in the data and only 69 stations; the transfer stations Bloor/Yonge, Sheppard/Yonge, Kennedy, St. George and Spadina are each counted as two stations |
Average temperature in Ann Arbor, Michigan | data(aatemp); R package faraway | Two variables, temperature and year, for 115 years ranging from 1854 to 2000 | Simple linear regression | Model assumptions satisfied nicely; point estimate of slope is small but statistically significant |
Rainfall in Illinois storms | From Rice, Mathematical Statistics and Data Analysis. Available freely from his webpage, or from here, datasets Illinois60.txt - Illinois64.txt. See this course material for commands to read the data into R. | Univariate dataset with measurements of rainfall in inches from 227 Illinois storms | Fitting basic probability models; illustrating application of statistical tests | Dataset is heavily right-skewed; it looks exponential but an exponential distribution is not flexible enough. A gamma distribution fits well. Can be used to show a likelihood ratio test of shape = 1 for a gamma distribution |
Beeswax melting point data | From Rice, Mathematical Statistics and Data Analysis. Available freely from his webpage, or from here, dataset beeswax.txt. See this course material for commands to read the data into R. | One continuous response, one continuous predictor | Univariate gaussian curve fitting; simple linear regression | Gaussian distribution fits melting point well; linear regression of melting point on hydrocarbons gives adequate fit with assumptions met |
PVC Operator Data | data(pvc); R package faraway | Continuous outcome and two discrete covariates, a 3x8 balanced full factorial design with 2 replicates in each group | Basic two-factor ANOVA | Model assumptions satisfied nicely; both main effects are significant but no significant interaction; good for introducing simple data analytic principles like plotting basic relationships between variables |
Warpbreaks data | data(warpbreaks); R package faraway | Continuous outcome and two discrete covariates, 3x2 balanced full factorial with 9 replicates per group | Two way ANOVA | Response requires transformation to satisfy assumption of normality for ANOVA, a boxcox reveals the log-transformation does a good job of this. Interaction plots show clear presence of interaction between the two factors, which shows up as being marginally significant at the .05 level in an ANOVA on the logged response. This is good for discussing practical vs statistical significance, and some reasons why what is “obvious” from the plots might not show up as significant in the accompanying inferential procedure, and hence why both are important |
orings data | data(orings); R package faraway | Binomial response (# orings damaged) and one continuous covariate (temperature on launch day) | Binomial GLM | This is a classic dataset used to introduce binomial regression models. The group size is small which facilitates discussion about model assumptions and distributional approximations. The temperature on the day of launch when the Challenger shuttle exploded was well outside the range of observed temperatures in the dataset, which facilitates discussion about extrapolation |
Wisconsin Breast Cancer Data | data(wbca); R package faraway | Binary response, 9 continuous covariates | Logistic regression, variable selection | Good small dataset to introduce logistic regression; possible to build a simple and well-performing predictive model; some correlation among the covariates, good for teaching variable selection |
Galapagos Islands species data | data(gala); R package faraway | Count response, 5 continuous covariates | Count regression | Some covariates (e.g. Area) do well with a log transformation; good for variable selection; most of the observed counts are actually large enough that a normal linear model looks reasonable, except looking closer reveals that the constant variance assumption is violated, hence a log transformation on the response or a count regression are appropriate; the mean-variance relationship imposed by a Poisson regression is too restrictive, overdispersion is present |
Salmonella data | data(salmonella); R package faraway | Count response, one discrete covariate with 6 levels | Count regression, dose-response model | Simple example of a variable transformation improving model fit. The log-linear model of dose fits the data poorly; transforming dose to log(dose + 1) improves the fit |
Smoking mortality data | data(femsmoke); R package faraway | Count response and three discrete covariates (smoking status, age group, whether or not they died); example of a 3-way table | Intermediate contingency table analysis | Marginalizing over age leads to the conclusion that smokers are less likely to die, because in these data there are more young smokers than old smokers. Good for discussion of both survivorship bias (why are there less old smokers? Are they dying before study inclusion?) and confounding variables- if age weren’t measured, we would erroneously conclude that smoking improves health if we intepreted this study causally |
Hair/Eye colours | data(haireye); R package faraway | 4 x 4 table of counts of combinations of hair and eye colours | Simple example of a contingency table that goes a bit beyond the 2 x 2 case, good for example of a test with more than 1 degree of freedom. Hair and eye colour end up showing evidence of not being independent, and the effects of some of the smaller categories (red hair, green eyes, etc) on this conclusion can be discussed |