Posting Up 538 NBA Predictions Using R

November 25, 2018

FiveThirtyEight

FiveThirtyEight is a great website. From politics to pop culture, Nate Silver and his team do a great job of creating interesting articles and visuals by using various data science techniques.

A big part of what FiveThirtyEight does revolves around sports. Similarly, a major focus of mine is sports, specifically the NBA. FiveThirtyEight assigns win probabilities to every NBA game during the regular season and playoffs. I have been using the NBA regular season to test my modelling skills.

Predicting wins for every NBA game is difficult. It is probably a fact that the Warriors will win at home when my beloved Knicks go west, but what happens when that same Knicks team hosts the Orlando Magic? An adequate model for predicting NBA games should be correct 70% of the time. Yet even being right 7 out of 10 times is not enough to make money gambling (trust me, I know).

I used FiveThirtyEight predictions as my baseline comparison when creating my models. What I noticed during all of my time working with the data is that a pretty basic model could be created that produces results that are just as good as what FiveThirtyEight releases. Creating a successful model that is easy to explain and simple to utilize can be just as difficult as creating a model with 200 features that implements some esoteric algorithm. FiveThrityEight does not neccessarily create a prediction model. Rather they create rating systems that are used to simulate the NBA season. Sometimes it is best to just keep things simple.

The Model

Pythagorean Win Percentage (Pythagorean Expectation) is a formula created by sabermatrician Bill James to estimate the win percentage of baseball teams based upon runs scored and runs allowed. This formula estimates what proportion of games a team “should” win. Basketball analysts have adapted the formula to the NBA:

\[ Team\ Points^{16.5} \div (Team\ Points^{16.5} + Opponent^{16.5}) \]

The original exponent used for this formula was 14 and then updated to 13.91 to better correlate with the actual team winning percentages. The NBA analyst, John Hollinger, uses the exponenet of 16.5, which I have chosen for this post.

To predict which team will win an NBA game, just compare the home team’s home Pythagorean Win Percentage with the away team’s away Pythagorean Win Percentage. That’s it. Nothing fancy.

I have scraped all NBA game results from basketball-reference.com. The data sets used for this post can be found on my github page. Here is how to implement home and away Pythagorean Win Percentage. I am going to use the dplyr package to manipulate the data. This package is part of the tidyverse.

library(dplyr)

#NBA historical data on github
filename = paste0("https://raw.githubusercontent.com/capstat/",
                  "postups/master/data/nba_2017.csv")
#get NBA 2017-2018 data
#nba = read.csv(filename)
nba = read.csv("data/nba_2017.csv")

#dplyr pipe %>%
nba_pyt = nba %>%
  #just the features needed
  select("G", "Date", "Home", "Team", "Opponent", 
         "Result", "Team_Points", "Opp_Points") %>%
  #group by team and home/away
  group_by(Team, Home) %>%
  #calculate data as seaon progresses
  mutate(
    #keep a count for home/away games
    Home_Away_G = cumsum(Result==1 | Result==0),
    #points at home/away
    Home_Away_Points = cumsum(Team_Points),
    #opponent points home/away
    Home_Away_Opp_Points = cumsum(Opp_Points),
    #home/away pyt win pct using 16.5
    Home_Away_Pyt = Home_Away_Points^16.5/
      (Home_Away_Points^16.5 + Home_Away_Opp_Points^16.5))

Our data set should not have repeated entries so only home game observations are considered. Also, the data should represent stats leading up to a certain game, so the lag function will be utilized. Lastly, the Pythagorean Win Percentage can not be calculated for the first home or away game of a season. NA will be used to denote these instances.

#only the home teams
home = filter(nba_pyt, Home==1)
#lag the stats
home$Home_Away_Pyt = lag(home$Home_Away_Pyt)
#NA for the first home game
home$Home_Away_Pyt[home$Home_Away_G == 1] = NA

#only the away teams
away = filter(nba_pyt, Home==0)
#lag the stats
away$Home_Away_Pyt = lag(away$Home_Away_Pyt)
#NA for the first away game
away$Home_Away_Pyt[away$Home_Away_G == 1] = NA

final = home %>% 
  #join home team stats with away team stats
  left_join(away[,c("Team", "Date", "G", "Home_Away_G", "Home_Away_Pyt")], 
            #key is the date and the team names
            by = c("Opponent"="Team", "Date"),
            suffix=c("_HOMETEAM", "_OPP")) %>%
  #find the difference between the Home/Away pyt win pct
  mutate(Pyt_Diff = Home_Away_Pyt_HOMETEAM - Home_Away_Pyt_OPP)

Date	G	Team	Opponent	Pyt_Diff	Result
10/19/17	1	Toronto Raptors	Chicago Bulls	NA	1
10/21/17	2	Toronto Raptors	Philadelphia 76ers	0.5989345	1
11/5/17	9	Toronto Raptors	Washington Wizards	0.2448159	0
11/7/17	10	Toronto Raptors	Chicago Bulls	0.4662493	1
11/9/17	11	Toronto Raptors	New Orleans Pelicans	0.2942930	1
11/17/17	15	Toronto Raptors	New York Knicks	0.6234272	1

Historical Pythagorean Theorem Predictions

This chart depicts the correct prediction rate for all differences between home and away Pythagorean Win Percentage since the 1983-1984 season.

The last 3 season have seen a drop in the accuracy of relying solely on the difference between the home and away Pythagorean Win Percentages. The difference that produced the best prediction rate for the 2016-2017 season was approximately 0.05. Meaning that for the 2016-2017 season, if the home team is chosen to win when their home Pythagorean Win Percentage was 0.05 larger than the visiting team’s away Pythagorean Win Percentage, and lose if not, the correct prediction rate would be 63.7%. This excludes when either team does not have a Pythagorean Win Percentage (first game at home or first game away).

Predicting the 2017-2018 NBA Season

Below is how the 0.05 threshold can be applied to the 2017-2018 NBA season. Also, any NAs are defaulted to a home team win prediction.

#predict home team to win or lose
final$Prediction = ifelse(
    #win if home/away pyt win pct diff is greater than the threshold
    final$Pyt_Diff > 0.05 | #or
    #if it's the first home game for the team
    is.na(final$Home_Away_Pyt_HOMETEAM) | #or
    #if its the first away game for the opponent
    is.na(final$Home_Away_Pyt_OPP),
                          "WIN", "LOSE")

Date	G	Team	Opponent	Pyt_Diff	Result
10/21/17	2	New York Knicks	Detroit Pistons	NA	0
10/27/17	4	New York Knicks	Brooklyn Nets	0.0494168	1
10/30/17	6	New York Knicks	Denver Nuggets	0.3836709	1
11/1/17	7	New York Knicks	Houston Rockets	0.1969236	0
11/3/17	8	New York Knicks	Phoenix Suns	0.2896730	1
11/5/17	9	New York Knicks	Indiana Pacers	0.0543742	1

NBA Prediction Results 2017-2018 Season

Type	Rate
538 ELO	0.6618
538 CARMELO	0.6642
Pyt Win %	0.6366

FiveThirtyEight predictions beat out the Pythagorean Win Percentage model, but not by much. The Pythagorean Win Percentage model is much easier to implement compared to the FiveThirtyEight models. Also, as the season progresses, the model becomes more accurate (66.3% correct prediction rate for games in the new year).

In conclusion, similar results to the FiveThirtyEight NBA predictions can be made using a very basic model. But in reality, to find out who is going to win an NBA game, just take a look at the point spread!

HOME | ABOUT | RESUME | GITHUB | CONTACT