In this tutorial we will learn how to use R
and RStudio
for basic data analysis. We are going to
R
.R
.RMarkdown
.We will see examples of concepts and then you will complete short exercises on your own.
First, let’s get comfortable with RStudio
. R
is the underlying programming language, where you type code and it does computations for you. RStudio
is an Integrated Development Environment (IDE), which is a program that makes it easier for you to program in R
. It lets you type code and run it, manage your data, publish documents, and a whole lot else. Let’s take a moment to get familiar with it…
To run any of the code in this document, highlight the part you want to run and press Cmd+Enter
(mac) or Ctrl+Enter
(windows). You can also click the “Play” button (little sideways triangle) at the top right corner of each “chunk” (grey box) of code. I would recommend getting out of the habit of using the mouse, and getting in to the habit of using the keyboard, to make your workflow faster and more fluid.
You must download the tidyverse
before starting. Type install.packages("tidyverse")
and wait a while. Then type
library(tidyverse)
to load the package so you can use its functions. A package is like a little magic box containing functions and data. Other people (like me!) create R
packages that contain code that they have written, and they host them on the internet in a common repository. Users (like you!) can then easily download the package and run the code. Packages have documentation that tell you what’s in them and how to use it. Google “tidyverse documentation” to see what I mean.
I always load the tidyverse
package, which is a special package that loads a bunch of other useful packages. I will proceed assuming you’ve done this, and you shouldn’t worry too much about it when you’re just getting started.
Scientific computing refers to the use of a computer to do complicated calculations. This is useful to you for your studies and also is a good way to get introduced to R
.
As a motivating example, we are going to compute the \(2^{nd}\)-order Taylor expansion of a simple trigonometric function. This will introduce you to the following concepts:
Recall from calclus that a sufficiently smooth function \(f(x)\) can be written as a power series: \[\begin{equation} f(x) = \sum_{k=0}^{\infty}\frac{f^{(k)}(x_{0})(x - x_{0})^{k}}{k!} \end{equation}\] where \(x_{0}\) is a reference point and \(f^{(k)}\) is the \(k^{th}\) derivative of \(f\). This representation is exact when the sum is infinite.
However, you can truncate the sum at some finite \(N\) to get a polynomial approximation to \(f\), called a Taylor series approximation. It is common to take \(N = 1\) or \(2\), to get a linear or quadratic approximation. These are used all the time in applied science, and extensively in statistics.
For our purposes, let’s take \(f(x) = \cos(x)\) and \(x_{0} = 0\). Its derivatives are \[\begin{equation}\begin{aligned} f^{(1)}(x) &= -\sin(x) \\ f^{(2)}(x) &= -\cos(x) \\ \end{aligned}\end{equation}\]
How can we use R
to compute the Taylor series approximation to \(\cos(x)\)? We need to
R
about the initial point \(x_{0}\),R
about \(f(x)\) and its first two derivatives,R
how to compute the value of the approximation for any \(x\), andTo tell R
about the initial point \(x_{0} = 0\), we create a variable called \(x_{0}\) and assign the value \(0\) to it. We can do this as follows:
# Create a variable simply by assigning it a value
x0 <- 0
This code uses the assignment operatior <-
to create a variable x0
and assign it value 0
.
The text next to the # sign is called a comment. You can use the # to put helpful text inside your program, so other users (including future you!) know what your program is doing. This is a fundamentally important practice.
You can check that your variable was defined either by listing all the variables currently defined:
ls() # ls == "list"
## [1] "x0"
or by specifically checking if the variable x0
exists:
exists("x0")
## [1] TRUE
or by typing the name of your variable and seeing what prints out:
x0
## [1] 0
This last way also tells you the value of the variable.
Exercise: create a variable called y
which takes the value \(2\). Check that it exists.
So what if you had more than one number you wanted to store? You could create more than one variable, like y1 <- 1; y2 <- 2
, but this becomes cumbersome quickly. Instead, you can store many numbers in a single variable, by using a vector. The c()
(for “concatenate”) function in R
is used for this:
y1 <- c(1,2,3,4,5) # Create a vector
y1 # See what it looks like
## [1] 1 2 3 4 5
y1[1]
## [1] 1
y1[2] # Access individual elements
## [1] 2
y1[3:5] # Access a subset of elements
## [1] 3 4 5
length(y1) # How many elements?
## [1] 5
You can also assign values to existing variables:
y1 # Old value
## [1] 1 2 3 4 5
y1 <- c(6,7,8) # Completely new value
y1
## [1] 6 7 8
y1[2] <- 10 # Can assign to a subset
y1
## [1] 6 10 8
Exercise: create a vector containing the values 1:10
. Change the fifth element to 20
. Print it out and make sure it’s correct.
Now that you can store numbers in R
, you want to be able to do things to those numbers. You do this using functions.
Functions are actually themselves “values” that you assign to variables, in the same way as numbers and vectors. And to create a function, you use the function()
function. Confused? Check it out:
# Create a function to multiply a number by 2:
multiply_by_2 <- function(x) {
x * 2
}
# multiply_by_2 <- function(x) x*2
# Call your function:
multiply_by_2(2)
## [1] 4
# Call your function on a pre-defined variable:
multiply_by_2(x0)
## [1] 0
# Many operations also work on vectors:
multiply_by_2(y1)
## [1] 12 20 16
That’s it! There are a few things to note:
Functions take zero or more arguments. Here there is one argument called x
. You can add more arguments or have no arguments. Arguments can be any type of R
object, and unlike other programming languages you don’t have to pre-specify what they are.
Functions may or may not return a value. If the last statement in a function is an expression like x * 2
, then the function returns that value. You can also use return()
to return a value.
Functions can call other functions. In this example, *
is itself a function!
Let’s create functions to represent \(f(x)\) and its first two derivatives:
f <- function(x) cos(x)
f1 <- function(x) -sin(x)
f2 <- function(x) -cos(x)
f(c(-pi/2,0,pi/2)) # pi is a variable provided by R
## [1] 6.123234e-17 1.000000e+00 6.123234e-17
f1(c(-pi/2,0,pi/2))
## [1] 1 0 -1
f2(c(-pi/2,0,pi/2))
## [1] -6.123234e-17 -1.000000e+00 -6.123234e-17
Exercise: create a function that adds 2
to a number.
function_that_adds_2 <- function(x) {
x + 2
}
Exercise: create a function \(g(x)\) that computes the tangent of an angle, \(\tan(x) = \sin(x)/\cos(x)\). What happens if you call \(g(\pi/2)\)?
g <- function(x) sin(x) / cos(x)
g(0)
## [1] 0
g(pi/2)
## [1] 1.633124e+16
tan(0)
## [1] 0
tan(pi/2)
## [1] 1.633124e+16
Now let’s use our new knowledge of functions and vectors to compute the Taylor expansion of \(f(x) = \cos(x)\) to second order.
The second order Taylor expansion of \(f(x)\) is: \[\begin{equation} f(x_{0}) + f^{(1)}(x_{0})(x - x_{0}) + \frac{1}{2}f^{(2)}(x_{0})(x - x_{0})^{2} \end{equation}\]
# Create a function that computes the Taylor expansion of f(x) at point x0
taylorfunction <- function(x) {
f(x0) + f1(x0) * (x - x0) + (1/2)*f2(x0) * (x - x0)^2
}
# Try it out
f(0)
## [1] 1
taylorfunction(0)
## [1] 1
f(x0)
## [1] 1
taylorfunction(x0)
## [1] 1
f(0.5)
## [1] 0.8775826
taylorfunction(0.5)
## [1] 0.875
# and so on
Exercise: \(f^{(3)}(x) = sin(x)\). Can you add a third order of approximation to the above, and can you change the centre of the approximation from \(x_{0} = 0\) to \(x_{0} = \pi/2\)? Create a new function taylorfunction3rd
that approximates \(f(x)\) to third-order about the point \(x_{0} = \pi/2\). Formula: \[\begin{equation}
f(x_{0}) + f^{(1)}(x_{0})(x - x_{0}) + \frac{1}{2}f^{(2)}(x_{0})(x - x_{0})^{2} + \frac{1}{6}f^{(3)}(x_{0})(x - x_{0})^{3}
\end{equation}\]
# Third order Taylor approx
f3 <- function(x) sin(x)
x0 <- pi/2
taylorfunction3rd <- function(x) {
f(x0) + f1(x0) * (x - x0) + (1/2)*f2(x0) * (x - x0)^2 + (1/6)*f3(x0) * (x - x0)^3
}
taylorfunction3rd(x0)
## [1] 6.123234e-17
f(x0)
## [1] 6.123234e-17
taylorfunction3rd(1)
## [1] 0.5398013
f(1)
## [1] 0.5403023
We should assess our approximation visually. To do this we will use ggplot2
to plot \(f(x)\) and its approximation.
ggplot2
is a layer-based graphics system. It’s best to just learn it by trial-and-error, by seeing dozens or hundreds of examples and eventually figuring out how to search for what you want to do. That’s how I learned it anyways. But I’ll try and explain.
You build up a plot in layers. First you add data; then you map variables to aesthetic elements of the plot like the “x-axis” or the “colour”; then you tell it what to draw (points, lines, bars, etc); then you annotate the plot with axis labels and other things. It looks like this:
# First, start by saving a blank plot to a variable:
# Give (-5,5) as the data; this tells ggplot where to draw the function to and from
# aes(): tell ggplot that the variable "x" goes on the x-axis.
plt <- tibble(x = c(-5,5)) %>% ggplot(aes(x = x))
# What does this look like?
plt
# Not much. Add a blank theme:
plt <- plt + theme_classic()
plt
plt <- plt + theme_bw()
plt
# Now add a line for the true function f(x)
plt <- plt + geom_line(stat = "function",fun = f) # stat = "function" tells geom_line() to draw a function
plt
# Now add a line for the Taylor approximation
plt <- plt + geom_line(stat = "function",fun = taylorfunction,colour = "blue",linetype = "dotdash")
plt
# Finally, add some axis labels and a title
plt <- plt + labs(title = "Taylor approximation (blue) to f(x) = cos(x) (black) about x0 = 0",
x = "x",y = "f(x) and approximation")
plt
You don’t have to do everything step-by-step like that in practice. You can do it all at once:
tibble(x = c(-5,5)) %>%
ggplot(aes(x = x)) +
theme_bw() +
geom_line(stat = "function",fun = f) +
geom_line(stat = "function",fun = taylorfunction,colour = "blue",linetype = "dotdash") +
labs(title = "Taylor approximation (blue) to f(x) = cos(x) (black) about x0 = 0",
x = "x",y = "f(x) and approximation") +
coord_cartesian(ylim = c(-1,1))
Notice how I set the y-limits of the plot to better capture the range I wanted to see.
There’s a couple confusing bits in this code:
A tibble
is a type of object that stores data. You’ll see this again in the data analysis section of this workshop.
The %>%
operator is called the “pipe”. It takes whatever is on its left side, and “pipes” it in as the first argument to the function on its right side. So f(x)
is the same as x %>% f()
. It is a very common and useful part of R
programming.
f(x0)
## [1] 6.123234e-17
x0 %>% f()
## [1] 6.123234e-17
# More complicated?
f(f1(f2(f3(x0))))
## [1] 0.8705904
x0 %>% f3() %>% f2() %>% f1 %>% f()
## [1] 0.8705904
+
sign is being used in a weird way here. Usually you would type x + y
to add together the numbers x
and y
. ggplot2
uses the +
sign to add together parts of the plot. You can just do this without thinking too much about it, but it’s weird when you first see it I think.Exercise: add your taylor3rdorder
function to the above plot. Change the colour to "red"
or your favourite alternative. Here is what I got:
R
is a programming language that is optimized for interactive data analysis. In this section of the tutorial we are going to analyze data using R
.
We are going to analyze the “Old Faithful” Geyser data, a famous dataset on eruption times from a famous geyser in Yellowstone National Park in Wyoming. The data is actually available in R
but for illustration purposes I saved it to a text file which you can get directly from my github.
The data is in “comma-separated value” format, which means it’s a text file where variables are separated by commas. Look at the raw data here.
Here is how you read in the data.
# The "readr" package is used to read in data.
faithful <- read_csv(
file = "https://raw.githubusercontent.com/awstringer1/ssu-r-workshop/master/data/faithful.txt",
col_names = TRUE,
col_types = c("nn")
)
# Take a look at it:
glimpse(faithful)
## Rows: 272
## Columns: 2
## $ eruptions <dbl> 3.600, 1.800, 3.333, 2.283, 4.533, 2.883, 4.700, 3.600, 1.9…
## $ waiting <dbl> 79, 54, 74, 62, 85, 55, 88, 85, 51, 85, 54, 84, 78, 47, 83,…
The data has two variables, eruptions
and waiting
, and 272 rows.
We can use R
to analyze the data. Let’s look at some basic summary statistics: mean, median, min/max and quantiles. You get the mean of a vector of numbers in R
using the mean()
function. You access one column of a dataframe by using the $
operator. We do this as follows:
mean(faithful$eruptions)
## [1] 3.487783
mean(faithful$waiting)
## [1] 70.89706
It looks like the mean waiting time is 70.8970588 and the mean number of eruptions is 3.4877831.
You can get a five-number summary using the summary()
function:
summary(faithful$eruptions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.600 2.163 4.000 3.488 4.454 5.100
Exercise: compute the five-number summary for the waiting
variable.
We can also plot the data, just like we plotted our functions. We’ll use ggplot
to make a scatterplot. The premise is the same as in the Scientific Computing section: build the plot up in layers.
faithful %>% # Provide the data to ggplot() directly
ggplot(aes(x = waiting,y = eruptions)) + # Tell ggplot to put waiting on the x-axis and eruptions on the y-axis
theme_bw() +
geom_point(pch = 21,colour = "black",fill = "orange") +
labs(
title = "Eruptions vs Waiting Time",
subtitle = "Old Faithful Geyser",
x = "Waiting",
y = "Eruptions"
) +
scale_x_continuous(breaks = seq(40,100,by = 5)) +
scale_y_continuous(breaks = seq(1,5,by = .5))
Let’s pause and take a deeper look at one of the features of R
that we are using here: the pipe, %>%
. The pipe is provided by the tidyverse
package (inb4: it actually lives inside the magrittr
package which is loaded by dplyr
which is loaded by tidyverse
…). The pipe lets you write clean code. The pipe works by taking its lefthand argument and “piping” it into its righthand argument. For example:
mean(faithful$waiting)
is the same as
faithful$waiting %>% mean()
This helps you write cleaner code because it makes it clear what data you’re using and what operations you’re performing on it. Suppose you wanted to compute the mean of the unique waiting times that have a value of greater than \(70\). You can do this using the following horrible line of code:
mean(unique(pull(filter(faithful,waiting > 70),waiting)))
Looking at that line, you can’t immediately tell 1) what the data is (it’s the waiting
variable in the faithful
dataset) and 2) what’s being done to it (filtering it, pulling it from the dataset, retaining only the unique values, and taking their mean). However, using the following equivalent lines of code:
faithful %>%
filter(waiting > 70) %>%
pull(waiting) %>%
unique() %>%
mean()
## [1] 83.04
gives a crystal-clear representation of all the data and operations used to produce the output. Use the pipe whenever you can!
Exercise: add a title and x/y axis labels to this plot. Remember the labs()
function from the Taylor series example?
We can also look at the individual variables’ distributions. A common type of plot for this is a boxplot which essentially plots the five-number summary of a variable. We can make this in ggplot
as follows:
faithful %>%
ggplot(aes(y = waiting)) +
theme_classic() +
geom_boxplot(width = 1) +
coord_cartesian(xlim = c(-3,3)) + # Annoying way to set the width of the boxplot
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank()) + # Remove garbage on the x-axis
labs(title = "Distribution of waiting times until eruption, Old Faithful Geyser",
x = "",y = "Waiting Time")
Exercise: there’s some confusing, unexplained parts of the above code. Are they important? Reading other people’s code is a hard but important skill. Try building up my plot layer-by-layer and see what each line does. What do the following elements in particular do?
width = 1
coord_cartesian(xlim = c(-3,3))
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())
Exercise: create a boxplot of eruption
.