Mathletics

The Runs-Created Approach

In 2004 Seattle Mariner outfielder Ichiro Suzuki set the major league record for most hits in a season. In 1997 Boston Red Sox shortstop Nomar Garciaparra had what was considered a good (but not great) year. Their key statistics are presented in table 2.1.¹ (For the sake of simplicity, henceforth Suzuki will be referred to as “Ichiro” or “Ichiro 2004” and Garciaparra will be referred to as “Nomar” or “Nomar 1997.”)

Recall that a batter’s slugging percentage is Total Bases (TB)/At Bats (AB) where

\[ \begin{equation} \textrm{TB} = \textrm{Singles} + 2 \times \textrm{Doubles (2B)} + 3 \times \textrm{Triples (3B)} + 4 \times \textrm{Home Runs (HR).} \end{equation} \]

#get the tables from Lahman
library("Lahman")
#load the batting stats
data(Batting)
my_stats = Batting[
  #ichiro 2004
  (Batting$playerID == "suzukic01" & Batting$yearID == 2004) |
  #nomar 1997
  (Batting$playerID == "garcino01" & Batting$yearID == 1997) |
  #bonds 2004 --- for later
  (Batting$playerID == "bondsba01" & Batting$yearID == 2004), ]
#add column for singles
my_stats$Singles = my_stats$H -
  (my_stats$X2B + my_stats$X3B + my_stats$HR)
#add batting average and SLG
my_stats$BA = my_stats$H/my_stats$AB
my_stats$TB = (my_stats$Singles + 
  2*my_stats$X2B + 
  3*my_stats$X3B + 
  4*my_stats$HR)
my_stats$SLG = my_stats$TB/my_stats$AB

TABLE 2.1 Statistics for Ichiro Suzuki and Nomar Garciaparra
Event	Ichiro 2004	Nomar 1997
AB	684	373
Batting Average	.306	.362
SLG	.534	.812
Hits	209	135
Singles	124	60
2B	44	27
3B	11	3
HR	30	45
BB+HBP	43	361

We see that Ichiro had a higher batting average than Nomar, but because he hit many more doubles, triples, and home runs, Nomar had a much higher slugging percentage. Ichiro walked a few more times than Nomar did. So which player had a better hitting year?

When a batter is hitting, he can cause good things (like hits or walks) tend to happen or cause bad things (outs) to happen. To compare hitters we must develop a metric that measures how the relative frequency of a batter’s good events and bad events influence the number of runs the team scores.

In 1979 Bill James developed the first version of his famous Runs Created Formula in an attempt to compute the number of runs “created” by a hitter during the course of a season. The most easily obtained data we have available to determine how batting events influence Runs Scored are season-long team batting statistics. A sample of this data is shown in figure 2.1.

#use the dplyr package to combine the player data into team data
library(dplyr)
#change all missing values for IBB and HBP to 0
Batting$HBP[is.na(Batting$HBP)] = 0
Batting$IBB[is.na(Batting$IBB)] = 0
team_batting = Batting %>% 
  #add column for singles and one for any type of walk
  mutate(Singles=H-(X2B+X3B+HR),
         any_walk=BB+HBP) %>%
  #group by year and team
  group_by(yearID, teamID) %>%
  #sum what we need
  summarise(Runs=sum(R), `At Bats`=sum(AB), Hits=sum(H),
            Singles=sum(Singles), `2B`=sum(X2B), `3B`=sum(X3B), 
            HR=sum(HR), `BB+HBP`=sum(any_walk)) %>%
  #change the column names
  rename(Year=yearID, Team=teamID)
#save as a csv
write.csv(team_batting, "team_batting.csv", row.names=FALSE)

Figure 2.1. Team batting data for 2000 season.
Year	Runs	At Bats	Hits	Singles	2B	3B	HR	BB+HBP	Team
2000	864	5628	1574	995	309	34	236	655	ANA
2000	792	5527	1466	961	282	44	179	594	ARI
2000	810	5489	1490	1011	274	26	179	654	ATL
2000	794	5549	1508	992	310	22	184	607	BAL
2000	792	5630	1503	988	316	32	167	653	BOS
2000	978	5646	1615	1041	325	33	216	644	CHA
2000	764	5577	1426	948	272	23	183	686	CHN
2000	825	5635	1545	1007	302	36	200	623	CIN
2000	950	5683	1639	1078	310	30	221	736	CLE
2000	968	5660	1664	1130	320	53	161	643	COL
2000	823	5644	1553	1028	307	41	177	605	DET
2000	731	5509	1441	978	274	29	160	600	FLO
2000	938	5570	1547	973	289	36	249	756	HOU
2000	879	5709	1644	1186	281	27	150	559	KCA
2000	798	5481	1408	904	265	28	211	719	LAN
2000	740	5563	1366	867	297	25	177	681	MIL
2000	748	5615	1516	1026	325	49	116	591	MIN
2000	738	5535	1475	952	310	35	178	505	MON
2000	871	5556	1541	1017	294	25	205	688	NYA
2000	807	5486	1445	946	281	20	198	720	NYN
2000	947	5560	1501	958	281	23	239	802	OAK
2000	708	5511	1386	898	304	40	144	655	PHI
2000	793	5643	1506	987	320	31	168	630	PIT
2000	752	5560	1413	940	279	37	157	648	SDN
2000	907	5497	1481	957	300	26	198	823	SEA
2000	925	5519	1535	961	304	44	226	760	SFN
2000	887	5478	1481	962	259	25	235	759	SLN
2000	733	5505	1414	977	253	22	162	609	TBA
2000	848	5648	1601	1063	330	35	173	619	TEX
2000	861	5677	1562	969	328	21	244	586	TOR

James realized there should be a way to predict the runs for each team from hits, singles, 2B, 3B, HR, outs, and BB+HBP.² Using his great intuition, James came up with the following relatively simple formula.

\[ \begin{equation} \textrm{runs created} = \frac{(\textrm{hits} + \textrm{BB} + \textrm{HBP}) \times (\textrm{TB})} {(\textrm{AB} + \textrm{BB} + \textrm{HBP})}. \end{equation} \]

As we will soon see, (2.2) does an amazingly good job of predicting how many runs a team scores in a season from hits, BB, HBP, AB, 2B, 3B, and HR. What is the rationale for (2.2)? To score runs you need to have runners on base, and then you need to advance them toward home plate: (Hits Walks HBP) is basically the number of base runners the team will have in a season. The other part of the equation, \(\frac{\textrm{TB}}{(\textrm{AB} + \textrm{BB} + \textrm{HBP})}\), measures the rate at which runners are advanced per plate appearance. Therefore (2.2) is multiplying the number of base runners by the rate at which they are advanced. Using the information in figure 2.1 we can compute Runs Created for the 2000 Anaheim Angels.

\[ \begin{equation} \textrm{runs created} = \frac{(1,574 + 655) \times 995 + 2(309) + 3(34) + 4(236))} {(5,628+655)} =943. \end{equation} \]

#total bases for each team
team_batting$TB = team_batting$Singles + 
  2*(team_batting$`2B`) + 
  3*(team_batting$`3B`) + 
  4*(team_batting$HR)
#runs created
team_batting$RC = ((team_batting$Hits + team_batting$`BB+HBP`)*team_batting$TB)/
  (team_batting$`At Bats` + team_batting$`BB+HBP`)
#percent error
team_batting$RC_error = abs((team_batting$RC-team_batting$Runs)/team_batting$Runs)

Figure 2.2. Team Runs and Runs Created for the 2000 season.
Year	Team	Runs	RC	% Error
2000	ANA	864	943.33	9.2%
2000	ARI	792	798.62	0.8%
2000	ATL	810	821.23	1.4%
2000	BAL	794	829.37	4.5%
2000	BOS	792	818.07	3.3%
2000	CHA	978	953.16	2.5%
2000	CHN	764	773.24	1.2%
2000	CIN	825	872.67	5.8%
2000	CLE	950	988.63	4.1%
2000	COL	968	941.76	2.7%
2000	DET	823	854.01	3.8%
2000	FLO	731	752.72	3.0%
2000	HOU	938	966.56	3.0%
2000	KCA	879	853.72	2.9%
2000	LAN	798	810.32	1.5%
2000	MIL	740	735.66	0.6%
2000	MIN	748	776.46	3.8%
2000	MON	738	783.15	6.1%
2000	NYA	871	892.46	2.5%
2000	NYN	807	823.30	2.0%
2000	OAK	947	921.27	2.7%
2000	PHI	708	728.88	2.9%
2000	PIT	793	814.49	2.7%
2000	SDN	752	742.66	1.2%
2000	SEA	907	884.78	2.4%
2000	SFN	925	952.14	2.9%
2000	SLN	887	896.07	1.0%
2000	TBA	733	726.94	0.8%
2000	TEX	848	892.68	5.3%
2000	TOR	861	913.66	6.1%

Actually, the 2000 Anaheim Angels scored 864 runs, so Runs Created overestimated the actual number of runs by around 9%.

We find that Runs Created was off by an average of 28 runs per team. Since the average team scored 775 runs, we find an average error of less than 4% when we try to use (2.2) to predict team Runs Scored. It is amazing that this simple, intuitively appealing formula does such a good job of predicting runs scored by a team. Even though more complex versions of Runs Created more accurately predict actual Runs Scored, the simplicity of (2.2) has caused this formula to continue to be widely used by the baseball community.

Year	Team	Runs	RC Error	% Error
2000-2006	League Average	775.5	28.21	3.71%

Beware Blind Extrapolation!

The problem with any version of Runs Created is that the formula is based on team statistics. A typical team has a batting average of .265, hits home runs on 3% of all plate appearances, and has a walk or HBP in around 10% of all plate appearances. Contrast these numbers to those of Barry Bonds’s great 2004 season in which he had a batting average of .362, hit a HR on 7% of all plate appearances, and received a walk or HBP during approximately 39% of his plate appearances. One of the first ideas taught in business statistics class is the following: do not use a relationship that is fit to a data set to make predictions for data that are very different from the data used to fit the relationship. Following this logic, we should not expect a Runs Created Formula based on team data to accurately predict the runs created by a superstar such as Barry Bonds or by a very poor player. In chapter 4 we will remedy this problem.

Ichiro vs. Nomar

Despite this caveat, let’s plunge ahead and use (2.2) to compare Ichiro Suzuki’s 2004 season to Nomar Garciaparra’s 1997 season. Let’s also compare Runs Created for Barry Bonds’s 2004 season to compare his statistics with those of the other two players. (See figure 2.3.)

#any walk
my_stats$any_walk = my_stats$BB+my_stats$HBP
#runs created
my_stats$RC = ((my_stats$H + my_stats$any_walk)*my_stats$TB)/
  (my_stats$AB + my_stats$any_walk)
#game outs used
my_stats$outs = 0.982*my_stats$AB - my_stats$H + 
  my_stats$GIDP + my_stats$SF + my_stats$SH + my_stats$CS
#runs created per game
my_stats$RCG = my_stats$RC/(my_stats$outs/26.72)

Figure 2.3. Runs Created for Bonds, Suzuki, and Garciaparra.
Player	Year	At Bats	Hits	Singles	2B	3B	HR	BB + HBP	Runs Created	Game Outs Used	Runs Creates per Game
Bonds 2004	2004	373	135	60	27	3	45	241	185.55	240.29	20.63
Suzuki 2004	2004	704	262	225	24	5	8	53	133.16	451.33	7.88
Garciapara 1997	1997	684	209	124	44	11	30	41	125.86	489.69	6.87

We see that Ichiro created 133 runs and Nomar created 126 runs. Bonds created 186 runs. This indicates that Ichiro 2004 had a slightly better hitting year than Nomar 1997. Of course Bonds’s performance in 2004 was vastly superior to that of the other two players.

Runs Created Per Game

A major problem with any Runs Created metric is that a bad hitter with 700 plate appearances might create more runs than a superstar with 400 plate appearances. In figure 2.4 we compare the statistics of two hypothetical players: Nick and Dave

Figure 2.4. Nick and Dave’s fictitious statistics.
Player	At Bats	Hits	Singles	2B	3B	HR	BB + HBP	Runs Created	Game Outs Used	Runs Creates per Game
Nick	700	190	150	10	1	9	20	60.96	497.4	3.27
Dave	400	120	90	15	0	15	20	60.00	272.8	5.88

Nick had a batting average of .257 while Dave had a batting average of .300. Dave walked more often per plate appearance and had more extra-base hits. Yet Runs Created says Nick was a better player. To solve this problem we need to understand that hitters consume a scarce resource: outs. During most games a team bats for nine innings and gets 27 outs (3 outs X 9 innings = 27).³ We can now compute Runs Created per game. To see how this works let’s look at the data for Ichiro 2004 (figure 2.3).

How did we compute outs? Essentially all AB except for hits and errors result in an out. Approximately 1.8% of all AB result in errors. Therefore, we computed outs as AB - Hits - .018(AB) = .982(AB) - Hits. Hitters also create “extra” outs through sacrifice flies (SF), sacrifice bunts (SH), caught stealing (CS), and grounding into double plays (GIDP). In 2004 Ichiro created 22 of these extra outs. He “used” up 451.3 outs for the Mariners. This is equivalent to \(\frac{451.3}{26.72} = 16.9\) games. Therefore, Ichiro created \(\frac{133.16}{16.9} = 7.88\) runs per game. More formally,

\[ \begin{equation} \textrm{runs created per game} = \frac{\textrm{runs created}} {\frac{.982\textrm{(AB) - hits + GIDP + SF + SH + CS}}{26.72}} \end{equation} \]

Equation (2.4) simply states that Runs Created per game is Runs Created by batter divided by number of games’ worth of outs used by the batter. Figure 2.3 shows that Barry Bonds created an amazing 20.63 runs per game. Figure 2.3 also makes it clear that Ichiro in 2004 was a much more valuable hitter than was Nomar in 1997. After all, Ichiro created 7.88 runs per game while Nomar created 1.01 fewer runs per game (6.87 runs). We also see that Runs Created per game rates Dave as being 2.61 runs (5.88 - 3.27) better per game than Nick. This resolves the problem that ordinary Runs Created allowed Nick to be ranked ahead of Dave.

Our estimate of Runs Created per game of 7.88 for Ichiro indicates that we believe a team consisting of nine Ichiros would score an average of 7.88 runs per game. Since no team consists of nine players like Ichiro, a more relevant question might be, how many runs would he create when batting with eight “average hitters”? In his book Win Shares (2002) Bill James came up with a more complex version of Runs Created that answers this question. I will address this question in chapter 3 and chapter 4.

Keep Reading

Contact Me!

Home

Mathletics

by Wayne Winston, R code by Nick Capofari

January 27, 2017

Chapter 2: WHO HAD A BETTER YEAR, NOMAR GARCIAPARRA OR ICHIRO SUZUKI?

The Runs-Created Approach

Beware Blind Extrapolation!

Ichiro vs. Nomar

Runs Created Per Game