In 2004 Seattle Mariner outfielder Ichiro Suzuki set the major league record for most hits in a season. In 1997 Boston Red Sox shortstop Nomar Garciaparra had what was considered a good (but not great) year. Their key statistics are presented in table 2.1.1 (For the sake of simplicity, henceforth Suzuki will be referred to as “Ichiro” or “Ichiro 2004” and Garciaparra will be referred to as “Nomar” or “Nomar 1997.”)
Recall that a batter’s slugging percentage is Total Bases (TB)/At Bats (AB) where
\[ \begin{equation} \textrm{TB} = \textrm{Singles} + 2 \times \textrm{Doubles (2B)} + 3 \times \textrm{Triples (3B)} + 4 \times \textrm{Home Runs (HR).} \end{equation} \]
#get the tables from Lahman
library("Lahman")
#load the batting stats
data(Batting)
my_stats = Batting[
#ichiro 2004
(Batting$playerID == "suzukic01" & Batting$yearID == 2004) |
#nomar 1997
(Batting$playerID == "garcino01" & Batting$yearID == 1997) |
#bonds 2004 --- for later
(Batting$playerID == "bondsba01" & Batting$yearID == 2004), ]
#add column for singles
my_stats$Singles = my_stats$H -
(my_stats$X2B + my_stats$X3B + my_stats$HR)
#add batting average and SLG
my_stats$BA = my_stats$H/my_stats$AB
my_stats$TB = (my_stats$Singles +
2*my_stats$X2B +
3*my_stats$X3B +
4*my_stats$HR)
my_stats$SLG = my_stats$TB/my_stats$AB
Event | Ichiro 2004 | Nomar 1997 |
---|---|---|
AB | 684 | 373 |
Batting Average | .306 | .362 |
SLG | .534 | .812 |
Hits | 209 | 135 |
Singles | 124 | 60 |
2B | 44 | 27 |
3B | 11 | 3 |
HR | 30 | 45 |
BB+HBP | 43 | 361 |
We see that Ichiro had a higher batting average than Nomar, but because he hit many more doubles, triples, and home runs, Nomar had a much higher slugging percentage. Ichiro walked a few more times than Nomar did. So which player had a better hitting year?
When a batter is hitting, he can cause good things (like hits or walks) tend to happen or cause bad things (outs) to happen. To compare hitters we must develop a metric that measures how the relative frequency of a batter’s good events and bad events influence the number of runs the team scores.
In 1979 Bill James developed the first version of his famous Runs Created Formula in an attempt to compute the number of runs “created” by a hitter during the course of a season. The most easily obtained data we have available to determine how batting events influence Runs Scored are season-long team batting statistics. A sample of this data is shown in figure 2.1.
#use the dplyr package to combine the player data into team data
library(dplyr)
#change all missing values for IBB and HBP to 0
Batting$HBP[is.na(Batting$HBP)] = 0
Batting$IBB[is.na(Batting$IBB)] = 0
team_batting = Batting %>%
#add column for singles and one for any type of walk
mutate(Singles=H-(X2B+X3B+HR),
any_walk=BB+HBP) %>%
#group by year and team
group_by(yearID, teamID) %>%
#sum what we need
summarise(Runs=sum(R), `At Bats`=sum(AB), Hits=sum(H),
Singles=sum(Singles), `2B`=sum(X2B), `3B`=sum(X3B),
HR=sum(HR), `BB+HBP`=sum(any_walk)) %>%
#change the column names
rename(Year=yearID, Team=teamID)
#save as a csv
write.csv(team_batting, "team_batting.csv", row.names=FALSE)
Year | Runs | At Bats | Hits | Singles | 2B | 3B | HR | BB+HBP | Team |
---|---|---|---|---|---|---|---|---|---|
2000 | 864 | 5628 | 1574 | 995 | 309 | 34 | 236 | 655 | ANA |
2000 | 792 | 5527 | 1466 | 961 | 282 | 44 | 179 | 594 | ARI |
2000 | 810 | 5489 | 1490 | 1011 | 274 | 26 | 179 | 654 | ATL |
2000 | 794 | 5549 | 1508 | 992 | 310 | 22 | 184 | 607 | BAL |
2000 | 792 | 5630 | 1503 | 988 | 316 | 32 | 167 | 653 | BOS |
2000 | 978 | 5646 | 1615 | 1041 | 325 | 33 | 216 | 644 | CHA |
2000 | 764 | 5577 | 1426 | 948 | 272 | 23 | 183 | 686 | CHN |
2000 | 825 | 5635 | 1545 | 1007 | 302 | 36 | 200 | 623 | CIN |
2000 | 950 | 5683 | 1639 | 1078 | 310 | 30 | 221 | 736 | CLE |
2000 | 968 | 5660 | 1664 | 1130 | 320 | 53 | 161 | 643 | COL |
2000 | 823 | 5644 | 1553 | 1028 | 307 | 41 | 177 | 605 | DET |
2000 | 731 | 5509 | 1441 | 978 | 274 | 29 | 160 | 600 | FLO |
2000 | 938 | 5570 | 1547 | 973 | 289 | 36 | 249 | 756 | HOU |
2000 | 879 | 5709 | 1644 | 1186 | 281 | 27 | 150 | 559 | KCA |
2000 | 798 | 5481 | 1408 | 904 | 265 | 28 | 211 | 719 | LAN |
2000 | 740 | 5563 | 1366 | 867 | 297 | 25 | 177 | 681 | MIL |
2000 | 748 | 5615 | 1516 | 1026 | 325 | 49 | 116 | 591 | MIN |
2000 | 738 | 5535 | 1475 | 952 | 310 | 35 | 178 | 505 | MON |
2000 | 871 | 5556 | 1541 | 1017 | 294 | 25 | 205 | 688 | NYA |
2000 | 807 | 5486 | 1445 | 946 | 281 | 20 | 198 | 720 | NYN |
2000 | 947 | 5560 | 1501 | 958 | 281 | 23 | 239 | 802 | OAK |
2000 | 708 | 5511 | 1386 | 898 | 304 | 40 | 144 | 655 | PHI |
2000 | 793 | 5643 | 1506 | 987 | 320 | 31 | 168 | 630 | PIT |
2000 | 752 | 5560 | 1413 | 940 | 279 | 37 | 157 | 648 | SDN |
2000 | 907 | 5497 | 1481 | 957 | 300 | 26 | 198 | 823 | SEA |
2000 | 925 | 5519 | 1535 | 961 | 304 | 44 | 226 | 760 | SFN |
2000 | 887 | 5478 | 1481 | 962 | 259 | 25 | 235 | 759 | SLN |
2000 | 733 | 5505 | 1414 | 977 | 253 | 22 | 162 | 609 | TBA |
2000 | 848 | 5648 | 1601 | 1063 | 330 | 35 | 173 | 619 | TEX |
2000 | 861 | 5677 | 1562 | 969 | 328 | 21 | 244 | 586 | TOR |
James realized there should be a way to predict the runs for each team from hits, singles, 2B, 3B, HR, outs, and BB+HBP.2 Using his great intuition, James came up with the following relatively simple formula.
\[ \begin{equation} \textrm{runs created} = \frac{(\textrm{hits} + \textrm{BB} + \textrm{HBP}) \times (\textrm{TB})} {(\textrm{AB} + \textrm{BB} + \textrm{HBP})}. \end{equation} \]
As we will soon see, (2.2) does an amazingly good job of predicting how many runs a team scores in a season from hits, BB, HBP, AB, 2B, 3B, and HR. What is the rationale for (2.2)? To score runs you need to have runners on base, and then you need to advance them toward home plate: (Hits Walks HBP) is basically the number of base runners the team will have in a season. The other part of the equation, \(\frac{\textrm{TB}}{(\textrm{AB} + \textrm{BB} + \textrm{HBP})}\), measures the rate at which runners are advanced per plate appearance. Therefore (2.2) is multiplying the number of base runners by the rate at which they are advanced. Using the information in figure 2.1 we can compute Runs Created for the 2000 Anaheim Angels.
\[ \begin{equation} \textrm{runs created} = \frac{(1,574 + 655) \times 995 + 2(309) + 3(34) + 4(236))} {(5,628+655)} =943. \end{equation} \]
#total bases for each team
team_batting$TB = team_batting$Singles +
2*(team_batting$`2B`) +
3*(team_batting$`3B`) +
4*(team_batting$HR)
#runs created
team_batting$RC = ((team_batting$Hits + team_batting$`BB+HBP`)*team_batting$TB)/
(team_batting$`At Bats` + team_batting$`BB+HBP`)
#percent error
team_batting$RC_error = abs((team_batting$RC-team_batting$Runs)/team_batting$Runs)
Year | Team | Runs | RC | % Error |
---|---|---|---|---|
2000 | ANA | 864 | 943.33 | 9.2% |
2000 | ARI | 792 | 798.62 | 0.8% |
2000 | ATL | 810 | 821.23 | 1.4% |
2000 | BAL | 794 | 829.37 | 4.5% |
2000 | BOS | 792 | 818.07 | 3.3% |
2000 | CHA | 978 | 953.16 | 2.5% |
2000 | CHN | 764 | 773.24 | 1.2% |
2000 | CIN | 825 | 872.67 | 5.8% |
2000 | CLE | 950 | 988.63 | 4.1% |
2000 | COL | 968 | 941.76 | 2.7% |
2000 | DET | 823 | 854.01 | 3.8% |
2000 | FLO | 731 | 752.72 | 3.0% |
2000 | HOU | 938 | 966.56 | 3.0% |
2000 | KCA | 879 | 853.72 | 2.9% |
2000 | LAN | 798 | 810.32 | 1.5% |
2000 | MIL | 740 | 735.66 | 0.6% |
2000 | MIN | 748 | 776.46 | 3.8% |
2000 | MON | 738 | 783.15 | 6.1% |
2000 | NYA | 871 | 892.46 | 2.5% |
2000 | NYN | 807 | 823.30 | 2.0% |
2000 | OAK | 947 | 921.27 | 2.7% |
2000 | PHI | 708 | 728.88 | 2.9% |
2000 | PIT | 793 | 814.49 | 2.7% |
2000 | SDN | 752 | 742.66 | 1.2% |
2000 | SEA | 907 | 884.78 | 2.4% |
2000 | SFN | 925 | 952.14 | 2.9% |
2000 | SLN | 887 | 896.07 | 1.0% |
2000 | TBA | 733 | 726.94 | 0.8% |
2000 | TEX | 848 | 892.68 | 5.3% |
2000 | TOR | 861 | 913.66 | 6.1% |
Actually, the 2000 Anaheim Angels scored 864 runs, so Runs Created overestimated the actual number of runs by around 9%.
We find that Runs Created was off by an average of 28 runs per team. Since the average team scored 775 runs, we find an average error of less than 4% when we try to use (2.2) to predict team Runs Scored. It is amazing that this simple, intuitively appealing formula does such a good job of predicting runs scored by a team. Even though more complex versions of Runs Created more accurately predict actual Runs Scored, the simplicity of (2.2) has caused this formula to continue to be widely used by the baseball community.
Year | Team | Runs | RC Error | % Error |
---|---|---|---|---|
2000-2006 | League Average | 775.5 | 28.21 | 3.71% |
The problem with any version of Runs Created is that the formula is based on team statistics. A typical team has a batting average of .265, hits home runs on 3% of all plate appearances, and has a walk or HBP in around 10% of all plate appearances. Contrast these numbers to those of Barry Bonds’s great 2004 season in which he had a batting average of .362, hit a HR on 7% of all plate appearances, and received a walk or HBP during approximately 39% of his plate appearances. One of the first ideas taught in business statistics class is the following: do not use a relationship that is fit to a data set to make predictions for data that are very different from the data used to fit the relationship. Following this logic, we should not expect a Runs Created Formula based on team data to accurately predict the runs created by a superstar such as Barry Bonds or by a very poor player. In chapter 4 we will remedy this problem.
Despite this caveat, let’s plunge ahead and use (2.2) to compare Ichiro Suzuki’s 2004 season to Nomar Garciaparra’s 1997 season. Let’s also compare Runs Created for Barry Bonds’s 2004 season to compare his statistics with those of the other two players. (See figure 2.3.)
#any walk
my_stats$any_walk = my_stats$BB+my_stats$HBP
#runs created
my_stats$RC = ((my_stats$H + my_stats$any_walk)*my_stats$TB)/
(my_stats$AB + my_stats$any_walk)
#game outs used
my_stats$outs = 0.982*my_stats$AB - my_stats$H +
my_stats$GIDP + my_stats$SF + my_stats$SH + my_stats$CS
#runs created per game
my_stats$RCG = my_stats$RC/(my_stats$outs/26.72)
Player | Year | At Bats | Hits | Singles | 2B | 3B | HR | BB + HBP | Runs Created | Game Outs Used | Runs Creates per Game |
---|---|---|---|---|---|---|---|---|---|---|---|
Bonds 2004 | 2004 | 373 | 135 | 60 | 27 | 3 | 45 | 241 | 185.55 | 240.29 | 20.63 |
Suzuki 2004 | 2004 | 704 | 262 | 225 | 24 | 5 | 8 | 53 | 133.16 | 451.33 | 7.88 |
Garciapara 1997 | 1997 | 684 | 209 | 124 | 44 | 11 | 30 | 41 | 125.86 | 489.69 | 6.87 |
We see that Ichiro created 133 runs and Nomar created 126 runs. Bonds created 186 runs. This indicates that Ichiro 2004 had a slightly better hitting year than Nomar 1997. Of course Bonds’s performance in 2004 was vastly superior to that of the other two players.
A major problem with any Runs Created metric is that a bad hitter with 700 plate appearances might create more runs than a superstar with 400 plate appearances. In figure 2.4 we compare the statistics of two hypothetical players: Nick and Dave
Player | At Bats | Hits | Singles | 2B | 3B | HR | BB + HBP | Runs Created | Game Outs Used | Runs Creates per Game |
---|---|---|---|---|---|---|---|---|---|---|
Nick | 700 | 190 | 150 | 10 | 1 | 9 | 20 | 60.96 | 497.4 | 3.27 |
Dave | 400 | 120 | 90 | 15 | 0 | 15 | 20 | 60.00 | 272.8 | 5.88 |
Nick had a batting average of .257 while Dave had a batting average of .300. Dave walked more often per plate appearance and had more extra-base hits. Yet Runs Created says Nick was a better player. To solve this problem we need to understand that hitters consume a scarce resource: outs. During most games a team bats for nine innings and gets 27 outs (3 outs X 9 innings = 27).3 We can now compute Runs Created per game. To see how this works let’s look at the data for Ichiro 2004 (figure 2.3).
How did we compute outs? Essentially all AB except for hits and errors result in an out. Approximately 1.8% of all AB result in errors. Therefore, we computed outs as AB - Hits - .018(AB) = .982(AB) - Hits. Hitters also create “extra” outs through sacrifice flies (SF), sacrifice bunts (SH), caught stealing (CS), and grounding into double plays (GIDP). In 2004 Ichiro created 22 of these extra outs. He “used” up 451.3 outs for the Mariners. This is equivalent to \(\frac{451.3}{26.72} = 16.9\) games. Therefore, Ichiro created \(\frac{133.16}{16.9} = 7.88\) runs per game. More formally,
\[ \begin{equation} \textrm{runs created per game} = \frac{\textrm{runs created}} {\frac{.982\textrm{(AB) - hits + GIDP + SF + SH + CS}}{26.72}} \end{equation} \]
Equation (2.4) simply states that Runs Created per game is Runs Created by batter divided by number of games’ worth of outs used by the batter. Figure 2.3 shows that Barry Bonds created an amazing 20.63 runs per game. Figure 2.3 also makes it clear that Ichiro in 2004 was a much more valuable hitter than was Nomar in 1997. After all, Ichiro created 7.88 runs per game while Nomar created 1.01 fewer runs per game (6.87 runs). We also see that Runs Created per game rates Dave as being 2.61 runs (5.88 - 3.27) better per game than Nick. This resolves the problem that ordinary Runs Created allowed Nick to be ranked ahead of Dave.
Our estimate of Runs Created per game of 7.88 for Ichiro indicates that we believe a team consisting of nine Ichiros would score an average of 7.88 runs per game. Since no team consists of nine players like Ichiro, a more relevant question might be, how many runs would he create when batting with eight “average hitters”? In his book Win Shares (2002) Bill James came up with a more complex version of Runs Created that answers this question. I will address this question in chapter 3 and chapter 4.
The data come from Sean Lahman’s fabulous baseball database, http://baseball1.com/statistics/.↩
Of course, we are leaving out things like Sacrifice Hits, Sacrifice Flies, Stolen Bases and Caught Stealing. Later versions of Runs Created use these events to compute Runs Created. See http://danagonistes.blogspot.com/2004/10/brief-history-of-run-estimation-runs.html for an excellent summary of the evolution of Runs Created.↩
Since the home team does not bat in the ninth inning when they are ahead and some games go into extra innings, average outs per game is not exactly 27. For the years 2001–6, average outs per game was 26.72.↩