Mathletics, by Wayne Winston, is a wonderful book that explains how data analytics is used in professional sports. The data work in the book is done using Microsoft Excel. I am an R user. While reading the first few chapters of his book, I began thinking about how the analytic examples he uses could also serve as great introductory examples for anyone interested in learning R.
R is a freely available programming language used for statistical computing. R is appealing because of the price, the fact that it is open source (it is constantly being improved by an engaged community of R users around the world), and R’s awesome graphic capabilities. R is quickly becoming the go-to tool for data analytics. I have attempted to take the examples from “Mathletics” and use them to create an R tutorial.
To install R, follow this link. Once you have finished, you should also install RStudio. RStudio is a an awesome intergated devlopment environment (IDE) for programming in R.
Below are some R basics. A longer overview can be found here. R’s introduction manual can be found here. I found both of these extremely helpful while coming up with these examples and explanations.
If you have any questions or comments, please contact me.
If you know some R already, or just want to dive in, head straight to Chapter 1.
R has a robust library of open source packages. They are easy to install. Make sure you load the package after installation is finished.
install.packages("Lahman") #installs the package
library(Lahman) #this loads it
This package provides the tables from the Sean Lahman Baseball Database.
Since this is an attempt to replicate work that was originally done using Microsoft Excel, a basic understanding of R’s data frame structure is needed. The simplest explanation of a data frame for an excel user is that data frames are essentially spreadsheets. Each column represents a variable and each row contains all measured variables for the same unit.
The rows and columns of a data frame are vectors. This is the simplest R data structure. A vector groups elements together in a specific order. You can assign a vector to a variable using the function c()
.
#basic vectors
x = c(1, 2, 3)
Most R users use <-
as the assignment operator. I use =
. This link can explain the difference. When I first started programming in R, I found using =
less confusing, so I have stuck with it. But just so you know, Google’s R style guide recommends using <-
, as does the R community.
#more vectors
y = c("My", "first", "vector")
z <- c(TRUE, FALSE, TRUE) #same as z = c(TRUE, FALSE, TRUE)
mixed_vector = c(x[1], y[2], z[3])
mixed_vector
[1] "1" "first" "TRUE"
There are other ways to create vectors when the vector follows a specific pattern.
seq(from=2, to=10, by=2)
[1] 2 4 6 8 10
rep(c("One", "Two"), times=3)
[1] "One" "Two" "One" "Two" "One" "Two"
c(-2:5)
[1] -2 -1 0 1 2 3 4 5
To create a data frame, we can group vectors togther.
Year = seq(2014, 2016)
Team = rep("New York Mets", 3)
W = c(79, 90, 87)
L = c(83, 72, 75)
mets = data.frame(Year, Team, W, L)
#if you are using RStudio, use the View() function to inspect the data frame
#View(mets)
mets
Year Team W L
1 2014 New York Mets 79 83
2 2015 New York Mets 90 72
3 2016 New York Mets 87 75
We can access the information in a data frame in many ways.
#use $ to access a data frame column
mets$Year
[1] 2014 2015 2016
#use logical expressions with vectors
mets$W[mets$W >= 81]
[1] 90 87
#data frame indexing: df[row, column]
mets[2, 3]
[1] 90
mets[ , c("W", "L")]
W L
1 79 83
2 90 72
3 87 75
mets[mets$Year==2016, c(3, 4)]
W L
3 87 75
mets[mets$W > mets$L, "Year"]
[1] 2015 2016
R has many built in functions. Some functions return one element.
max(mets$Year)
[1] 2016
sum(mets$W)
[1] 256
mean(mets$W)
[1] 85.33333
length(mets$Team)
[1] 3
Other functions return a vector the same length as the input.
paste(mets$Year, mets$Team, sep="---")
[1] "2014---New York Mets" "2015---New York Mets" "2016---New York Mets"
#apply a fuction to every element in a vector
sapply(mets$Year, function(x){ x-2000 })
[1] 14 15 16
#create a new column of the fly using $
mets$Games = mets$W + mets$L
#no spaces for column names, unless column name is inside ` `
mets$W.pct = round(mets$W/mets$Games, 3)
#vectorized if
mets$`Over 500` = ifelse(mets$W > mets$L, TRUE, FALSE)
We can add rows and columns to our data frames using the rbind
and cbind
functions.
next_year = c(2017, "New York Mets", 162, 0, 162, 1.000, TRUE)
rbind(mets, next_year)
Year Team W L Games W.pct Over 500
1 2014 New York Mets 79 83 162 0.488 FALSE
2 2015 New York Mets 90 72 162 0.556 TRUE
3 2016 New York Mets 87 75 162 0.537 TRUE
4 2017 New York Mets 162 0 162 1 TRUE
cbind(mets, League=rep("NL", 3))
Year Team W L Games W.pct Over 500 League
1 2014 New York Mets 79 83 162 0.488 FALSE NL
2 2015 New York Mets 90 72 162 0.556 TRUE NL
3 2016 New York Mets 87 75 162 0.537 TRUE NL
#our changes to the data frame were not saved
#make sure to store any changes in a variable
mets_2 = cbind(mets, Manager=rep("Terry Collins", 3))
mets_2
Year Team W L Games W.pct Over 500 Manager
1 2014 New York Mets 79 83 162 0.488 FALSE Terry Collins
2 2015 New York Mets 90 72 162 0.556 TRUE Terry Collins
3 2016 New York Mets 87 75 162 0.537 TRUE Terry Collins
There are many ways to import datasets into R. read.csv
or readLines
can be used when we have a file we would like to work with. Many times R packages come with ready to use datasets.
#the cars dataset that is included with the basic R installation
data(mtcars)
#what are we wroking with?
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
nrow(mtcars)
[1] 32
#sneak a peek
head(mtcars) # or tail(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
#get a quick summary
summary(mtcars[ , c(1:4)])
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
Ok! Now we’re ready to move on to the fun stuff!