AN EMPIRICAL TEST OF BILL JAMES S PYTHAGOREAN FORMULA by Paul M. Sommers David U. Cha And Daniel P. Gla March 2010 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO. 10-06 DEPARTMENT OF ECONOMICS MIDDLEBURY COLLEGE MIDDLEBURY, VERMONT 05753 hp://www.middlebury.edu/~econ
2 AN EMPIRICAL TEST OF BILL JAMES S PYTHAGOREAN FORMULA by David U. Cha Daniel P. Gla Paul M. Sommers Deparmen of Economics Middlebury College Middlebury, Vermon 05753 psommers@middlebury.edu
3 AN EMPIRICAL TEST OF BILL JAMES S PYTHAGOREAN FORMULA The Gians do no usually need o score many runs. All ha we mus do is score more han he oher fellow. Bill Terry, manager of he 1932 New York Gians [1, p.136] Bill James, baseball wrier and saisician, in 1980 developed a formula ha relaed a eam s won-los percenage o he number of runs hey scored and allowed, as follows: (1) Won-Los Percenage = 2 ( RunsScored ) ( RunsScored ) 2 + ( RunsAllowed ) 2 Since he Won-Los Percenage is he raio of games won o he oal number of games played (games won plus games los), equaion (1) can be re-wrien as follows: (2) Wins Losses = 2 ( RunsScored ) ( RunsAllowed ) 2 = RunsScored RunsAllowed 2 If, for example, he Boson Red Sox score 867 runs and allow 657 runs (as hey did in 2007), Bill James s Pyhagorean mehod [so dubbed because of he presence of hree squared erms in equaion (1)] 1 projecs ha he eam would have a won-los percenage of (867) 2 /[(867) 2 + (657) 2 ] or.635 (and hence win abou.635 162 or 103 games). In fac, he Red Sox (world champions in 2007) won 96 regular season games or abou 6.8 percen fewer games han he Pyhagorean mehod would predic. In his insance, he exponen of 2 on he righ-hand side of equaion (2)
4 is oo high. Or, one could argue ha 2 is accurae, bu he Boson Red Sox should have won more regular season games in 2007 han hey acually did. The purpose of his brief noe is o empirically es Bill James s Pyhagorean mehod for all eams in boh leagues, by decade, from 1950 o 2007. Does he mehod work as well since 1980 as i did before 1980? Does he mehod work beer for one league (American or Naional) han he oher? Has he exponen in equaion (2) changed in recen decades? The Models Equaion (2) can be wrien in log-linear form as follows: Wins RS (3) ln = 2ln Losses RA where ln is he naural logarihm, RS denoes runs scored, and RA denoes runs allowed. Tha is, if one firs akes logs of boh sides of equaion (2) and hen if we define Wins y = ln and Losses RS x = ln for each eam i in year, we can esimae he coefficiens β 0 and β 1 by applying RA leas squares o y and x in he following regression: (4) y = β 0 + β1x + ε where ε is a disurbance erm. According o Bill James, β 0 should be indisinguishable from zero and β 1 should be close o 2. To es he null hypohesis H 0 : β 1 = 2, we employ a -es. The es saisic is calc b1 2 =, where b 1 is he esimaed slope coefficien and SE(b 1 ) is he SE ( b ) 1
5 sandard error of he esimaed slope coefficien. 2 Hereafer, equaion (4) where Wins RS y = ln and x = ln will be called Model (1). Losses RA Model (1) assumes ha one more run scored has he same impac on a eam s win percenage as does one less run allowed. Bu wha if scoring runs was more (or less) imporan o winning games han allowing runs? Model (1) migh hen be revised as follows: Wins (5) ln = β 0 + β1 ln( RS) + β 2 ln( RA) + ε Losses If we relax he assumpion ha he exponen on he raio RS RA is he same (and, according o James, equal o 2 ), hen he revised model would be described by equaion (5), hereafer, Model (2). The Daa Daa on regular season wins, losses, runs scored, and runs allowed for all eams were gleaned from wo primary sources: Toal Baseball [3] for he years 1950 hrough 2003 and hp://spors.espn.go.com/mlb/sandings for he years 2004 hrough 2007. The Resuls Table 1 shows he regression resuls for each league (as well as for boh leagues combined) for each decade since he 1950s. The esimaed inercep (b 0 ) in all regressions is no discernible from zero, as Bill James would expec. Since he year 2000, he exponen in he raio of runs scored o runs allowed in James s Pyhagorean formula has been indisinguishable from 2. Bu, in decades before he urn of he millennium he exponen was no equal o 2. And, in all cases when we could rejec H 0 : β 1 = 2 (in favor he alernaive hypohesis H A : β 1 2),
6 our esimae b 1 was invariably less han 2. A comparison of he 30-year period 1950-1979 o he 28-year period 1980-2007 shows ha b 1 was in mos cases (wih he excepion of he American League (AL) from 1980 o 2007) significanly less han 2. The impac of RS/RA on winning is marginally higher now (1980-2007) han i was in he earlier period (1950-1979). Compare he value of b 1 (1.9202) esimaed for boh leagues combined in he period 1980-2007 o he corresponding esimae for b 1 (1.8099) in he period 1950-1979. Moreover, i is worh noing ha he average number of runs scored is also higher in he Naional (American) League in he period 1980-2007 han i was in he period 1950-1979 [ RS 1980 2007, NL = 699, RS 1950 1979, NL = 667.3, p-value on he difference beween means is less han.001; RS 1980 2007, AL = 747, RS 1950 1979,AL = 669.6, p-value on he difference is again less han.001]. Figures 1 and 2 show scaerplos of ln(w/l) agains ln(rs/ra) for each subperiod (1950-1979 and 1980-2007, respecively) for each league. Each poin represens an observaion on one eam in one year. The poins more closely fall on a sraigh line for he Naional League, 1950-1979 han hey do for he Naional League, 1980-2007 (compare R 2 =.878 for 1950-1979 wih R 2 =.847 for 1980-2007 in Table 1). Sill, he differences beween he wo periods by league are admiedly very small. Table 2 shows he regression resuls for Model (2), which isolaes he impac of runs scored from he impac of runs allowed on he win-loss raio. The righ-hand column repors he coefficien of deerminaion (R 2 ) for each regression each decade, by league. A look down his column and he corresponding column in Table 1 clearly shows ha he explanaory power (ha is, how well he regressors as a group explain variaion in he dependen variable, namely, ln(wins/losses) ) of Model (2) is no an improvemen over Model (1). In oher words, runs scored and runs allowed seemingly have an equal (and opposie) effec on he win-loss raio.
7 Concluding Remarks Early in he 1980s, Bill James developed a formula in response o he quesion: Can you ell how many games a eam will win, based on is runs scored and runs allowed? A regression analysis of daa on regular season runs scored, runs allowed, and wins (and losses) for each eam each season in Major League Baseball since 1950 reveals ha Bill James s Pyhagorean formula has sood he es of ime very well indeed. Runs scored and runs allowed have equal (and opposie) effecs on eam winning, in boh leagues and in years before and since 1980. If any modificaion should be made o he formula, he exponen on runs scored and runs allowed should be reduced o a power slighly below 2 [ 1.92 for boh leagues since he year 1980].
8 Table 1. Regression Resuls for Model (1) ln(wins/losses) = b 0 + b 1 ln(rs/ra) Slope coefficien on Inercep ln(rs/ra) R 2 b 0 b 1 1950-1959 AL -.0059 [.0131] a 1.7543 [.0598].917 NL.00003 [.0122] 1.8758 [.0737].893 Boh leagues -.0030 [.0089] 1.7985 [.0461].906 1960-1969 AL -.0017 [.0094] 1.8757 [.0593].911 NL -.0013 [.0111] 1.9323 [.0655].901 Boh leagues -.0016 [.0072] 1.9055 [.0441].905 1970-1979 AL.00001 [.0091] 1.8139 [.0560].894 NL.0012 [.0101] 1.6576 [.0642].850 Boh leagues.0006 [.0068] 1.7398 [.0425].873 1980-1989 AL.0005 [.0078] 1.8849 [.0577].885 NL -.0017 [.0100] 2.0195 [.0848].828 Boh leagues -.0005 [.0063] 1.9381 [.0489].859 1990-1999 AL.00003 [.0078] 1.9324 [.0599].883 NL -.0021 [.0090] 1.8370 [.0645].856 Boh leagues -.0012 [.0060] 1.8814 [.0441].869 2000-2007 AL -.0055 [.0099] 2.0026 [.0624].904 NL -.0004 [.0089] 1.8720 [.0682].857 Boh leagues -.0023 [.0066] 1.9445 [.0458].883 1950-1979 AL -.0021 [.0059] 1.8062 [.0333].907 NL.00005 [.0064] 1.8146 [.0393].878 Boh leagues -.0011 [.0044] 1.8099 [.0255].893 1980-2007 AL -.0012 [.0048] 1.9415 [.0344].891 NL -.0014 [.0054] 1.8951 [.0411].847 Boh leagues -.0013 [.0036] 1.9202 [.0266].870 a Numbers in brackes are sandard errors and numbers in boldface (ialics) are significan a beer han he.01 (.05) level. The null hypohesis for he inercep is H 0 :β 0 = 0 and he null hypohesis for he slope coefficien on ln(rs/ra) is H 0 : β 1 = 2. In boh cases, he alernaive hypohesis is wo-ailed.
9 Table 2. Regression Resuls for Model (2) ln(wins/losses) = b 0 + b 1 ln(rs) + b 2 ln(ra) Slope coefficien on: Inercep ln(rs) ln(ra) R 2 b 0 b 1 b 2 1950-1959 AL.3888 [.9790] 1.7224 [.0995] -1.7829 [.0930].917 NL -1.0380 [1.1370] 1.9526 [.1119] -1.7937 [.1164].894 Boh leagues -.2284 [.7351] 1.8162 [.0739] -1.7816 [.0718].906 1960-1969 AL.3171 [.5948] 1.8494 [.0771] -1.8986 [.0733].911 NL.8684 [.7091] 1.8708 [.0823] -2.0052 [.0883].902 Boh leagues.5648 [.4578] 1.8624 [.0562] -1.9499 [.0567].906 1970-1979 AL -.3023 [.5620] 1.8365 [.0702] -1.7900 [.0715].895 NL -.6484 [.7772] 1.7046 [.0854] -1.6046 [.0903].850 Boh leagues -.4204 [.4613] 1.7060 [.0545] -1.7060 [.0564].873 1980-1989 AL.0077 [.3052] 1.8844 [.0619] -1.8855 [.0630].885 NL -.3260 [.4232] 2.0462 [.0919] -1.9959 [.0904].829 Boh leagues -.1265 [.2429] 1.9477 [.0524] -1.9283 [.0525].859 1990-1999 AL.0794 [.4242] 1.9265 [.0680] -1.9385 [.0683].883 NL -.1865 [.4482] 1.8535 [.0762] -1.8253 [.0707].856 Boh leagues -.0895 [.2951] 1.8887 [.0505] -1.8753 [.0486].869 2000-2007 AL -.8735 [.9428] 2.0721 [.0980] -1.9421 [.0907].904 NL 1.6195 [.7686] 1.7343 [.0938] -1.9788 [.0842].862 Boh leagues.5502 [.5714] 1.8998 [.0652] -1.9829 [.0666].884 a Numbers in brackes are sandard errors and numbers in boldface (ialics) are significan a beer han he.01 (.05) level. The null hypohesis for he inercep is H 0 :β 0 = 0 and he null hypoheses for he slope coefficiens on ln(rs) and ln(ra) are H 0 : β 1 = 2 and H 0 : β 2 = -2, respecively. In all hree cases, he alernaive hypohesis is wo-ailed.
10 Figure 1 1.0 Scaerplo of ln(w/l) agains ln(rs/ra), 1950-1979 American League 0.5 ln(w/l) 0.0-0.5-0.50-0.25 0.00 ln(rs/ra) 0.25 0.50 1.0 Scaerplo of ln(w/l) agains ln(rs/ra), 1950-1979 Naional League 0.5 ln(w/l) 0.0-0.5-1.0-0.5-0.4-0.3-0.2-0.1 0.0 ln(rs/ra) 0.1 0.2 0.3 0.4
11 Figure 2 1.0 Scaerplo of ln(w/l) agains ln(rs/ra), 1980-2007 American League 0.5 ln(w/l) 0.0-0.5-1.0-0.5-0.4-0.3-0.2-0.1 0.0 ln(rs/ra) 0.1 0.2 0.3 0.4 0.8 Scaerplo of ln(w/l) agains ln(rs/ra), 1980-2007 Naional League 0.6 0.4 0.2 ln(w/l) 0.0-0.2-0.4-0.6-0.8-0.4-0.3-0.2-0.1 0.0 ln(rs/ra) 0.1 0.2 0.3 0.4
12 References 1. P. Williams, When he Gians Were Gians, Chapel Hill, NC: Algonquin Books, 1994. 2. B. James, The Bill James Baseball Absrac 1983, New York: Ballanine Books, 1983. 3. Toal Baseball: The Ulimae Baseball Encyclopedia (edied by H. Thorn, P. Birnbaum, and B. Deane), Wilmingon, DE: Spors Media Publishing, 2004.
13 Foonoes 1. See, for example, he reference o The Pyhagorean Formula in [2, p. 10]. 2. The b 1 esimae also inerpres as an elasiciy of (Wins/Losses) o (RS/RA), where (in general) he elasiciy of Y wih respec o X is defined as X dy. In oher words, a Y dx 1 percen increase in (RS/RA) will lead o a b 1 percen increase in (Wins/Losses). Moreover, since Wins + Losses is equal o a consan (162 games since he year 1962 and 154 games in years before 1962), i should also be noed ha a given percenage change in Wins is equal o he percenage change in he winning percenage, [Wins/(Wins + Losses)].