Chapter 2: WHO HAD A BETTER YEAR, NOMAR GARCIAPARRA OR ICHIRO SUZUKI?

The Runs-Created Approach

In 2004 Seattle Mariner outfielder Ichiro Suzuki set the major league record for most hits in a season. In 1997 Boston Red Sox shortstop Nomar Garciaparra had what was considered a good (but not great) year. Their key statistics are presented in table 2.1.1 (For the sake of simplicity, henceforth Suzuki will be referred to as “Ichiro” or “Ichiro 2004” and Garciaparra will be referred to as “Nomar” or “Nomar 1997.”)

Recall that a batter’s slugging percentage is Total Bases (TB)/At Bats (AB) where

\[ \begin{equation} \textrm{TB} = \textrm{Singles} + 2 \times \textrm{Doubles (2B)} + 3 \times \textrm{Triples (3B)} + 4 \times \textrm{Home Runs (HR).} \end{equation} \]

#get the tables from Lahman
library("Lahman")
#load the batting stats
data(Batting)
my_stats = Batting[
  #ichiro 2004
  (Batting$playerID == "suzukic01" & Batting$yearID == 2004) |
  #nomar 1997
  (Batting$playerID == "garcino01" & Batting$yearID == 1997) |
  #bonds 2004 --- for later
  (Batting$playerID == "bondsba01" & Batting$yearID == 2004), ]
#add column for singles
my_stats$Singles = my_stats$H -
  (my_stats$X2B + my_stats$X3B + my_stats$HR)
#add batting average and SLG
my_stats$BA = my_stats$H/my_stats$AB
my_stats$TB = (my_stats$Singles + 
  2*my_stats$X2B + 
  3*my_stats$X3B + 
  4*my_stats$HR)
my_stats$SLG = my_stats$TB/my_stats$AB
TABLE 2.1 Statistics for Ichiro Suzuki and Nomar Garciaparra
Event Ichiro 2004 Nomar 1997
AB 684 373
Batting Average .306 .362
SLG .534 .812
Hits 209 135
Singles 124 60
2B 44 27
3B 11 3
HR 30 45
BB+HBP 43 361

We see that Ichiro had a higher batting average than Nomar, but because he hit many more doubles, triples, and home runs, Nomar had a much higher slugging percentage. Ichiro walked a few more times than Nomar did. So which player had a better hitting year?

When a batter is hitting, he can cause good things (like hits or walks) tend to happen or cause bad things (outs) to happen. To compare hitters we must develop a metric that measures how the relative frequency of a batter’s good events and bad events influence the number of runs the team scores.

In 1979 Bill James developed the first version of his famous Runs Created Formula in an attempt to compute the number of runs “created” by a hitter during the course of a season. The most easily obtained data we have available to determine how batting events influence Runs Scored are season-long team batting statistics. A sample of this data is shown in figure 2.1.

#use the dplyr package to combine the player data into team data
library(dplyr)
#change all missing values for IBB and HBP to 0
Batting$HBP[is.na(Batting$HBP)] = 0
Batting$IBB[is.na(Batting$IBB)] = 0
team_batting = Batting %>% 
  #add column for singles and one for any type of walk
  mutate(Singles=H-(X2B+X3B+HR),
         any_walk=BB+HBP) %>%
  #group by year and team
  group_by(yearID, teamID) %>%
  #sum what we need
  summarise(Runs=sum(R), `At Bats`=sum(AB), Hits=sum(H),
            Singles=sum(Singles), `2B`=sum(X2B), `3B`=sum(X3B), 
            HR=sum(HR), `BB+HBP`=sum(any_walk)) %>%
  #change the column names
  rename(Year=yearID, Team=teamID)
#save as a csv
write.csv(team_batting, "team_batting.csv", row.names=FALSE)
Figure 2.1. Team batting data for 2000 season.
Year Runs At Bats Hits Singles 2B 3B HR BB+HBP Team
2000 864 5628 1574 995 309 34 236 655 ANA
2000 792 5527 1466 961 282 44 179 594 ARI
2000 810 5489 1490 1011 274 26 179 654 ATL
2000 794 5549 1508 992 310 22 184 607 BAL
2000 792 5630 1503 988 316 32 167 653 BOS
2000 978 5646 1615 1041 325 33 216 644 CHA
2000 764 5577 1426 948 272 23 183 686 CHN
2000 825 5635 1545 1007 302 36 200 623 CIN
2000 950 5683 1639 1078 310 30 221 736 CLE
2000 968 5660 1664 1130 320 53 161 643 COL
2000 823 5644 1553 1028 307 41 177 605 DET
2000 731 5509 1441 978 274 29 160 600 FLO
2000 938 5570 1547 973 289 36 249 756 HOU
2000 879 5709 1644 1186 281 27 150 559 KCA
2000 798 5481 1408 904 265 28 211 719 LAN
2000 740 5563 1366 867 297 25 177 681 MIL
2000 748 5615 1516 1026 325 49 116 591 MIN
2000 738 5535 1475 952 310 35 178 505 MON
2000 871 5556 1541 1017 294 25 205 688 NYA
2000 807 5486 1445 946 281 20 198 720 NYN
2000 947 5560 1501 958 281 23 239 802 OAK
2000 708 5511 1386 898 304 40 144 655 PHI
2000 793 5643 1506 987 320 31 168 630 PIT
2000 752 5560 1413 940 279 37 157 648 SDN
2000 907 5497 1481 957 300 26 198 823 SEA
2000 925 5519 1535 961 304 44 226 760 SFN
2000 887 5478 1481 962 259 25 235 759 SLN
2000 733 5505 1414 977 253 22 162 609 TBA
2000 848 5648 1601 1063 330 35 173 619 TEX
2000 861 5677 1562 969 328 21 244 586 TOR

James realized there should be a way to predict the runs for each team from hits, singles, 2B, 3B, HR, outs, and BB+HBP.2 Using his great intuition, James came up with the following relatively simple formula.

\[ \begin{equation} \textrm{runs created} = \frac{(\textrm{hits} + \textrm{BB} + \textrm{HBP}) \times (\textrm{TB})} {(\textrm{AB} + \textrm{BB} + \textrm{HBP})}. \end{equation} \]

As we will soon see, (2.2) does an amazingly good job of predicting how many runs a team scores in a season from hits, BB, HBP, AB, 2B, 3B, and HR. What is the rationale for (2.2)? To score runs you need to have runners on base, and then you need to advance them toward home plate: (Hits Walks HBP) is basically the number of base runners the team will have in a season. The other part of the equation, \(\frac{\textrm{TB}}{(\textrm{AB} + \textrm{BB} + \textrm{HBP})}\), measures the rate at which runners are advanced per plate appearance. Therefore (2.2) is multiplying the number of base runners by the rate at which they are advanced. Using the information in figure 2.1 we can compute Runs Created for the 2000 Anaheim Angels.

\[ \begin{equation} \textrm{runs created} = \frac{(1,574 + 655) \times 995 + 2(309) + 3(34) + 4(236))} {(5,628+655)} =943. \end{equation} \]

#total bases for each team
team_batting$TB = team_batting$Singles + 
  2*(team_batting$`2B`) + 
  3*(team_batting$`3B`) + 
  4*(team_batting$HR)
#runs created
team_batting$RC = ((team_batting$Hits + team_batting$`BB+HBP`)*team_batting$TB)/
  (team_batting$`At Bats` + team_batting$`BB+HBP`)
#percent error
team_batting$RC_error = abs((team_batting$RC-team_batting$Runs)/team_batting$Runs)
Figure 2.2. Team Runs and Runs Created for the 2000 season.
Year Team Runs RC % Error
2000 ANA 864 943.33 9.2%
2000 ARI 792 798.62 0.8%
2000 ATL 810 821.23 1.4%
2000 BAL 794 829.37 4.5%
2000 BOS 792 818.07 3.3%
2000 CHA 978 953.16 2.5%
2000 CHN 764 773.24 1.2%
2000 CIN 825 872.67 5.8%
2000 CLE 950 988.63 4.1%
2000 COL 968 941.76 2.7%
2000 DET 823 854.01 3.8%
2000 FLO 731 752.72 3.0%
2000 HOU 938 966.56 3.0%
2000 KCA 879 853.72 2.9%
2000 LAN 798 810.32 1.5%
2000 MIL 740 735.66 0.6%
2000 MIN 748 776.46 3.8%
2000 MON 738 783.15 6.1%
2000 NYA 871 892.46 2.5%
2000 NYN 807 823.30 2.0%
2000 OAK 947 921.27 2.7%
2000 PHI 708 728.88 2.9%
2000 PIT 793 814.49 2.7%
2000 SDN 752 742.66 1.2%
2000 SEA 907 884.78 2.4%
2000 SFN 925 952.14 2.9%
2000 SLN 887 896.07 1.0%
2000 TBA 733 726.94 0.8%
2000 TEX 848 892.68 5.3%
2000 TOR 861 913.66 6.1%

Actually, the 2000 Anaheim Angels scored 864 runs, so Runs Created overestimated the actual number of runs by around 9%.

We find that Runs Created was off by an average of 28 runs per team. Since the average team scored 775 runs, we find an average error of less than 4% when we try to use (2.2) to predict team Runs Scored. It is amazing that this simple, intuitively appealing formula does such a good job of predicting runs scored by a team. Even though more complex versions of Runs Created more accurately predict actual Runs Scored, the simplicity of (2.2) has caused this formula to continue to be widely used by the baseball community.

Year Team Runs RC Error % Error
2000-2006 League Average 775.5 28.21 3.71%

Beware Blind Extrapolation!

The problem with any version of Runs Created is that the formula is based on team statistics. A typical team has a batting average of .265, hits home runs on 3% of all plate appearances, and has a walk or HBP in around 10% of all plate appearances. Contrast these numbers to those of Barry Bonds’s great 2004 season in which he had a batting average of .362, hit a HR on 7% of all plate appearances, and received a walk or HBP during approximately 39% of his plate appearances. One of the first ideas taught in business statistics class is the following: do not use a relationship that is fit to a data set to make predictions for data that are very different from the data used to fit the relationship. Following this logic, we should not expect a Runs Created Formula based on team data to accurately predict the runs created by a superstar such as Barry Bonds or by a very poor player. In chapter 4 we will remedy this problem.

Ichiro vs. Nomar

Despite this caveat, let’s plunge ahead and use (2.2) to compare Ichiro Suzuki’s 2004 season to Nomar Garciaparra’s 1997 season. Let’s also compare Runs Created for Barry Bonds’s 2004 season to compare his statistics with those of the other two players. (See figure 2.3.)

#any walk
my_stats$any_walk = my_stats$BB+my_stats$HBP
#runs created
my_stats$RC = ((my_stats$H + my_stats$any_walk)*my_stats$TB)/
  (my_stats$AB + my_stats$any_walk)
#game outs used
my_stats$outs = 0.982*my_stats$AB - my_stats$H + 
  my_stats$GIDP + my_stats$SF + my_stats$SH + my_stats$CS
#runs created per game
my_stats$RCG = my_stats$RC/(my_stats$outs/26.72)
Figure 2.3. Runs Created for Bonds, Suzuki, and Garciaparra.
Player Year At Bats Hits Singles 2B 3B HR BB + HBP Runs Created Game Outs Used Runs Creates per Game
Bonds 2004 2004 373 135 60 27 3 45 241 185.55 240.29 20.63
Suzuki 2004 2004 704 262 225 24 5 8 53 133.16 451.33 7.88
Garciapara 1997 1997 684 209 124 44 11 30 41 125.86 489.69 6.87

We see that Ichiro created 133 runs and Nomar created 126 runs. Bonds created 186 runs. This indicates that Ichiro 2004 had a slightly better hitting year than Nomar 1997. Of course Bonds’s performance in 2004 was vastly superior to that of the other two players.

Runs Created Per Game

A major problem with any Runs Created metric is that a bad hitter with 700 plate appearances might create more runs than a superstar with 400 plate appearances. In figure 2.4 we compare the statistics of two hypothetical players: Nick and Dave

Figure 2.4. Nick and Dave’s fictitious statistics.
Player At Bats Hits Singles 2B 3B HR BB + HBP Runs Created Game Outs Used Runs Creates per Game
Nick 700 190 150 10 1 9 20 60.96 497.4 3.27
Dave 400 120 90 15 0 15 20 60.00 272.8 5.88

Nick had a batting average of .257 while Dave had a batting average of .300. Dave walked more often per plate appearance and had more extra-base hits. Yet Runs Created says Nick was a better player. To solve this problem we need to understand that hitters consume a scarce resource: outs. During most games a team bats for nine innings and gets 27 outs (3 outs X 9 innings = 27).3 We can now compute Runs Created per game. To see how this works let’s look at the data for Ichiro 2004 (figure 2.3).

How did we compute outs? Essentially all AB except for hits and errors result in an out. Approximately 1.8% of all AB result in errors. Therefore, we computed outs as AB - Hits - .018(AB) = .982(AB) - Hits. Hitters also create “extra” outs through sacrifice flies (SF), sacrifice bunts (SH), caught stealing (CS), and grounding into double plays (GIDP). In 2004 Ichiro created 22 of these extra outs. He “used” up 451.3 outs for the Mariners. This is equivalent to \(\frac{451.3}{26.72} = 16.9\) games. Therefore, Ichiro created \(\frac{133.16}{16.9} = 7.88\) runs per game. More formally,

\[ \begin{equation} \textrm{runs created per game} = \frac{\textrm{runs created}} {\frac{.982\textrm{(AB) - hits + GIDP + SF + SH + CS}}{26.72}} \end{equation} \]

Equation (2.4) simply states that Runs Created per game is Runs Created by batter divided by number of games’ worth of outs used by the batter. Figure 2.3 shows that Barry Bonds created an amazing 20.63 runs per game. Figure 2.3 also makes it clear that Ichiro in 2004 was a much more valuable hitter than was Nomar in 1997. After all, Ichiro created 7.88 runs per game while Nomar created 1.01 fewer runs per game (6.87 runs). We also see that Runs Created per game rates Dave as being 2.61 runs (5.88 - 3.27) better per game than Nick. This resolves the problem that ordinary Runs Created allowed Nick to be ranked ahead of Dave.

Our estimate of Runs Created per game of 7.88 for Ichiro indicates that we believe a team consisting of nine Ichiros would score an average of 7.88 runs per game. Since no team consists of nine players like Ichiro, a more relevant question might be, how many runs would he create when batting with eight “average hitters”? In his book Win Shares (2002) Bill James came up with a more complex version of Runs Created that answers this question. I will address this question in chapter 3 and chapter 4.

Keep Reading

Contact Me!

Home


  1. The data come from Sean Lahman’s fabulous baseball database, http://baseball1.com/statistics/.

  2. Of course, we are leaving out things like Sacrifice Hits, Sacrifice Flies, Stolen Bases and Caught Stealing. Later versions of Runs Created use these events to compute Runs Created. See http://danagonistes.blogspot.com/2004/10/brief-history-of-run-estimation-runs.html for an excellent summary of the evolution of Runs Created.

  3. Since the home team does not bat in the ninth inning when they are ahead and some games go into extra innings, average outs per game is not exactly 27. For the years 2001–6, average outs per game was 26.72.