Introduction: Wayne Winston’s Mathletics in R

Mathletics, by Wayne Winston, is a wonderful book that explains how data analytics is used in professional sports. The data work in the book is done using Microsoft Excel. I am an R user. While reading the first few chapters of his book, I began thinking about how the analytic examples he uses could also serve as great introductory examples for anyone interested in learning R.

R is a freely available programming language used for statistical computing. R is appealing because of the price, the fact that it is open source (it is constantly being improved by an engaged community of R users around the world), and R’s awesome graphic capabilities. R is quickly becoming the go-to tool for data analytics. I have attempted to take the examples from “Mathletics” and use them to create an R tutorial.

Getting Started

To install R, follow this link. Once you have finished, you should also install RStudio. RStudio is a an awesome intergated devlopment environment (IDE) for programming in R.

Below are some R basics. A longer overview can be found here. R’s introduction manual can be found here. I found both of these extremely helpful while coming up with these examples and explanations.

If you have any questions or comments, please contact me.

Some R Basics

If you know some R already, or just want to dive in, head straight to Chapter 1.

R has a robust library of open source packages. They are easy to install. Make sure you load the package after installation is finished.

install.packages("Lahman") #installs the package
library(Lahman) #this loads it

This package provides the tables from the Sean Lahman Baseball Database.

Since this is an attempt to replicate work that was originally done using Microsoft Excel, a basic understanding of R’s data frame structure is needed. The simplest explanation of a data frame for an excel user is that data frames are essentially spreadsheets. Each column represents a variable and each row contains all measured variables for the same unit.

The rows and columns of a data frame are vectors. This is the simplest R data structure. A vector groups elements together in a specific order. You can assign a vector to a variable using the function c().

#basic vectors
x = c(1, 2, 3)

Most R users use <- as the assignment operator. I use =. This link can explain the difference. When I first started programming in R, I found using = less confusing, so I have stuck with it. But just so you know, Google’s R style guide recommends using <-, as does the R community.

#more vectors
y = c("My", "first", "vector")
z <- c(TRUE, FALSE, TRUE) #same as z = c(TRUE, FALSE, TRUE)
mixed_vector = c(x[1], y[2], z[3])
mixed_vector

[1] "1"     "first" "TRUE"

There are other ways to create vectors when the vector follows a specific pattern.

seq(from=2, to=10, by=2)

[1]  2  4  6  8 10

rep(c("One", "Two"), times=3)

[1] "One" "Two" "One" "Two" "One" "Two"

c(-2:5)

[1] -2 -1  0  1  2  3  4  5

To create a data frame, we can group vectors togther.

Year = seq(2014, 2016)
Team = rep("New York Mets", 3)
W = c(79, 90, 87)
L = c(83, 72, 75)
mets = data.frame(Year, Team, W, L)
#if you are using RStudio, use the View() function to inspect the data frame
#View(mets)
mets

  Year          Team  W  L
1 2014 New York Mets 79 83
2 2015 New York Mets 90 72
3 2016 New York Mets 87 75

We can access the information in a data frame in many ways.

#use $ to access a data frame column
mets$Year

[1] 2014 2015 2016

#use logical expressions with vectors
mets$W[mets$W >= 81]

[1] 90 87

#data frame indexing: df[row, column]
mets[2, 3]

[1] 90

mets[ , c("W", "L")]

mets[mets$Year==2016, c(3, 4)]

   W  L
3 87 75

mets[mets$W > mets$L, "Year"]

[1] 2015 2016

R has many built in functions. Some functions return one element.

max(mets$Year)

[1] 2016

sum(mets$W)

[1] 256

mean(mets$W)

[1] 85.33333

length(mets$Team)

[1] 3

Other functions return a vector the same length as the input.

paste(mets$Year, mets$Team, sep="---")

[1] "2014---New York Mets" "2015---New York Mets" "2016---New York Mets"

#apply a fuction to every element in a vector
sapply(mets$Year, function(x){ x-2000 })

[1] 14 15 16

#create a new column of the fly using $
mets$Games = mets$W + mets$L
#no spaces for column names, unless column name is inside ` `
mets$W.pct = round(mets$W/mets$Games, 3)
#vectorized if 
mets$`Over 500` = ifelse(mets$W > mets$L, TRUE, FALSE)

We can add rows and columns to our data frames using the rbind and cbind functions.

next_year = c(2017, "New York Mets", 162, 0, 162, 1.000, TRUE)
rbind(mets, next_year)

  Year          Team   W  L Games W.pct Over 500
1 2014 New York Mets  79 83   162 0.488    FALSE
2 2015 New York Mets  90 72   162 0.556     TRUE
3 2016 New York Mets  87 75   162 0.537     TRUE
4 2017 New York Mets 162  0   162     1     TRUE

cbind(mets, League=rep("NL", 3))

  Year          Team  W  L Games W.pct Over 500 League
1 2014 New York Mets 79 83   162 0.488    FALSE     NL
2 2015 New York Mets 90 72   162 0.556     TRUE     NL
3 2016 New York Mets 87 75   162 0.537     TRUE     NL

#our changes to the data frame were not saved 
#make sure to store any changes in a variable
mets_2 = cbind(mets, Manager=rep("Terry Collins", 3))
mets_2

  Year          Team  W  L Games W.pct Over 500       Manager
1 2014 New York Mets 79 83   162 0.488    FALSE Terry Collins
2 2015 New York Mets 90 72   162 0.556     TRUE Terry Collins
3 2016 New York Mets 87 75   162 0.537     TRUE Terry Collins

There are many ways to import datasets into R. read.csv or readLines can be used when we have a file we would like to work with. Many times R packages come with ready to use datasets.

#the cars dataset that is included with the basic R installation
data(mtcars)
#what are we wroking with?
colnames(mtcars)

 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"

nrow(mtcars)

[1] 32

#sneak a peek
head(mtcars) # or tail(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#get a quick summary
summary(mtcars[ , c(1:4)])

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0

Ok! Now we’re ready to move on to the fun stuff!

Introduction

by Nicholas Capofari

January 22, 2017

Introduction: Wayne Winston’s Mathletics in R

Getting Started

Some R Basics