More on Defensive Regression (or Runs) Analysis

A More on Defensive Regression (or Runs) Analysis This appendix has three primary objectives: first, to disclose aspects of DRA not disclosed in chapter two; second, to address aspects of the model that raise issues related less to baseball per se than to statistical modeling in general; and third, to drive home the fundamental point that DRA is not an answer, but a method. Included in this appendix are certain alternative models I tried, and suggestions for further improvements, which should provide some sense of the range of alternative approaches that are possible. DRA POST-1951 Overview There are essentially two DRA models: post-1951 and pre-1952. The post- 1951 model uses a subset of Retrosheet play-by-play data currently available for seasons after 1951, and was almost completely described in chapter two. The pre-1952 model must make do with considerably less data, which renders it more primitive for infielders and unavoidably more complicated for outfielders. When we first began explaining DRA, we took a bottom-up approach, starting from the shortstop position and gradually building up until we had a team model. Here we ll take a top-down approach, revealing the entire post-1951 team model all at once, and then discussing its components. Likewise, we ll start with a top-down discussion of the pre-1952 model. The following page presents the entire post-1951 model on one page, with a glossary of defined terms on the facing page. 3

1952 2009 DRA Model Team defensive runs saved above or below the league rate, given innings pitched, DR.ip, is estimated as the sum of pitching, catching, infield, and outfield defensive runs: Pitching =.27 *SO.bfp.34 *BB.bfp 1.49 *HR.bh +.42 *A1.bip +.44 *IFO.bip.56 *WP.ip. Catching =.59 *CS.sba +.59 *GO2.bip. Infield =.52 *rgo3 +.53 *ra4 +.45 *ra5 +.44 *ra6. Outfield =.53 *rpo7 +.46 *rpo8 +.44 *rpo9 +.61 *A7.ip +.61 *A8.ip +.61 *A9.ip. All plain variables are team seasonal totals. See definitions on facing page. All variables with a dot, for example, A6.bip, are calculated in the same way: A6.bip = A6 [ A6 * (BIP \ league BIP )]. A6.bip equals total A6 recorded by the team above (if negative, below) the league average rate that year, given total team BIP opportunities. The opportunities variable following the dot is always in lower case letters. All variables beginning with an r are residual team plays that year; that is, estimated net plays taking into account available predictors, using regression analysis. rgo3 = GO3.bip +.09 *RBIP.bip. ra4 = A4.bip +.08 *RBIP.bip +.15 *RFO.rbip +.32 *LFO.lbip +.18 *HR.bh +.20 *WP.ip +.19 * SH.bip. ra6 = A6.bip.06 *RBIP.bip +.29 *RFO.rbip +.15 *LFO.lbip +.12 *HR.bh +.56 *WP.ip +.43 * SH.bip. ra5 = A5.bip.10 *RBIP.bip +.21 *RFO.rbip +.10 *LFO.lbip +.15 *A1.bip +.13 *rgo3 +.13 *IBB.pa. rpo7 = PO7.bip +.03 *RBIP.bip +.21 *RGO.rbip +.10 *LGO.lbip. rpo8 = PO8.bip.01 *RBIP.bip +.27 *RGO.rbip +.24 *LGO.lbip +.07 *IFO.bip +.20 *SH.bip. rpo9 = PO9.bip.03 *RBIP.bip +.22 *RGO.rbip +.22 *LGO.lbip +.12 *IFO.bip. Example of allocation of team fielding runs to individual (lower-case i ) fielders: ia6 runs = +.44 *ra6 * (iip \ IP ) +.44 * [ia6 A6 * (iip \ IP )].

Definitions of Team-Level Variables for DRA Model (1952 2009) Abbrev. Definition Formula or Source Abbrev. Definition Formula or Source 1 9 Pitcher... Right Fielder LFO L eft -handed batter FO play-by-play data A Assists (total, if not followed by a number) LGO L eft -handed batter GO play-by-play data BB Unintentional BB + HBP UBB + HBP OA Outfielder-only A sum(a7,a8,a9 ) BFP Batters Faced by Pitchers PA - IBB OPO Outfielder-only PO sum(po7,po8,po9 ) BH Balls Hit BFP - SO - BB PA Plate Appearances BIP Balls In Play BH - HR PB Passed Balls CS Caught Stealing PO Putouts (total, if not followed by a Number) FO Fly Outs (total) RFO + LFO RBIP Right-handed batter BIP play-by-play data GO Ground Outs (total) RGO + LGO RFO Right-handed batter FO play-by-play data GO2 GO at catcher A2 - CS RGO Right-handed batter GO play-by-play data GO3 GO at first base A3 + UGO3 SBA Stolen Base( SB ) Attempts SB + CS HBP Hit By Pitch SH Sacrifice Hits HR Home Runs SO Strikeouts IA In fielder-only Assists sum( A1,A2,...,A6 ) UBB Unintentional BB BB (traditional) - IBB IBB Intentional Bases on Balls BB UBB UGO3 Unassisted GO3 avg(ugo3e1,ugo3e2 ) IFO In fielder-only FO FO - OPO UGO3e1 UGO3 estimate #1 IPO - A - IFO IP Innings Pitched (or Played) UGO3e2 UGO3 estimate #2 GO - IA - CS - GIDP IPO Infielder-only PO sum(po1,po2,... PO6 ) WP Wild Pitches (includes PB ) WP (traditional) + PB

6 APPENDIX A The previous two pages are a bit much to take in all at once. But I do not believe that any other comprehensive system for team and individual defense remotely as accurate as DRA can be summarized as concisely. Before addressing the new points, let s quickly recap in a few pages the basic approach under DRA as described in chapter two. You might find it helpful to flip back to the preceding two pages as you read both the recap and the discussion of new issues. DRA is essentially a forced-zero-intercept, two-stage multivariable leastsquares regression analysis model. I m using the two-stage terminology informally; as we shall see, the DRA model is not an instrumental variables model, otherwise known as a two-stage least-squares model. The forced zero intercept merely means that we center the ultimate outcome being predicted (team runs allowed), each play made (each pitching and fielding play that is made) outcome used to predict expected team runs allowed, and each variable used to predict expected pitching and fielding plays, so that all outcomes and their respective predictors are net numbers, above or below the league-average rate. Furthermore, each outcome or predictor is centered by reference to its appropriate denominator of opportunities (the denominators are not literally used as denominators in the arithmetical sense; hence the quotation marks). The first stage of regression analysis involves regressing centered fielding variables onto centered variables not under the control of the fielding position being evaluated (and ideally not influenced by the quality of other fielders) that tend to be associated with more or fewer fielding plays at that position. The residual left over from each first-stage regression at each position is treated as an estimate of the skill plays made at that position above or below expectation. The second-stage regression involves regressing net team runs allowed onto net pitching and (first-stage-regression-adjusted) fielding plays in order to reveal the number of runs associated with each net pitching and (firststage-regression-adjusted) fielding outcome. To rate a team at a position, you simply apply the run weight determined in the second-stage regression to the net plays (which, again, are negative half the time) to determine defensive runs at that position. Finally, you allocate team defensive runs at that position to each player first pro-rata, based on his innings played at that position, then calculate his net plays compared to the team rate, given his percentage of team innings played. Each net play is credited with the same run weight used for the team rating at that position.

More on Defensive Regression (or Runs) Analysis 7 Centering The Variables By Their Respective Denominators We center all the team variables by their respective denominators of opportunities. Centering in this way is the first step towards making each variable less correlated with the others, so that its independent net impact in runs may be better estimated. The little quotation marks are there because we will not achieve true independence in a mathematically precise sense. The best denominator of opportunities for the ultimate outcome we re trying to model actual total team runs allowed per season is innings pitched, so we calculate team runs allowed above or below the league-average rate given the team s innings played, that is, net runs allowed given innings played, or RA.ip. In some sense this is just denominating net runs allowed by total outs, as innings are defined by outs. This is correct, because the ultimate limit on the number of runs a team can score in an inning is defined by outs. The best denominator of opportunities for pitchers to record strikeouts ( SO ) or unintentional walks (including batters hit by pitch, BB ) is the number of batters they face, or batters facing pitcher ( BFP ); hence net strikeouts given batters facing the team s pitchers ( SO.bfp ) and net unintentional walks and batters hit by pitch ( BB.bfp ).1 The best denominator for home runs allowed ( HR ) is any BFP not ending in a BB or HR, or balls hit ( BH ); hence HR.bh, which tracks home runs allowed, given that the batter has made contact. The number of balls in play ( BH minus HR, or BIP ) is the primary denominator of opportunities for plays involving getting the batter out on a batted ball not hit out of the park. By initially denominating batted ball outcomes by BIP, we begin the process of measuring net plays independent of the pitching staff s SO.bfp, BB.bfp, and HR.bh. Infield fly outs, that is, fly balls caught by infielders ( IFO ), are almost always weakly hit balls that could be caught by two or more fielders. Since they are nearly automatic outs, analogous to SO, we credit the pitchers with IFO relative to the league, given total BIP, resulting in the IFO.bip variable appearing among pitching runs. Likewise, we credit the pitcher if he records an assist ( A1 ), which will almost always be on a ground ball he has fielded ( A1.bip ). BIP is also the best 1. In this version of DRA, I tried treating intentional walks separately; for reasons discussed shortly below it didn t make any difference, though it should have. The BFP denominator for SO.bfp and BB.bfp excludes plate appearances ending in an intentional walk.

8 APPENDIX A denominator for ground out fielding plays at catcher ( GO3 ) and first, assists at second, third, and short, and putouts at each outfield position. The simplest denominator for runners caught stealing ( CS ) is the number of stolen base attempts ( SBA ), hence CS.sba. Finally, wild pitches (defined here to include passed balls, WP ) and outfielder assists ( A7, A8, and A9 ) are denominated by innings played ( IP ), not because that is optimal, but because it is simple. An alternative approach is addressed further below. For the pitching, catching, and outfielder assists variables, centering is the only adjustment that has to be made. (The coefficients for A7.ip, A8.ip, and A9.ip are the same because I combined all three into one variable, A789. ip, when running the second-stage regression.) Furthermore, with the exception of IFO.bip and GO2.bip, we have the exact counts of denominators per pitcher (their BFP, BH, and BIP ) and catcher (their SBA ), so the individual formulas are the same as the team formula, and the sum of individual results equals the team results. There is one variable that is truly a combination of a pitching and catching variable: WP.ip, and not just because it includes passed balls. We credit or debit the pitchers with total WP.ip, because by far the largest source of variance in both wild pitches and passed balls is knuckleball pitching and sheer pitcher wildness. However, to give catchers some credit for being better or worse at preventing wild pitches and passed balls, we credit each catcher with the number of his net passed balls, given innings played, relative to his team (which would control somewhat for the effect of pitchers), and multiplied by three, because there have been roughly two wild pitches per passed ball throughout major league history. Thus, we credit the catcher with effectively two wild pitches saved and one passed ball saved for every passed ball he records in a season above or below his team s rate. It s an admittedly crude measure of the impact catchers have on passed balls and wild pitches, but it is probably reasonable, because catchers miss so much playing time that the set of their catching teammates, at least over the course of a career, probably approaches league-average performance. And, as emphasized in our catcher chapter, all of the traditional methods for evaluating catchers are very suspect, because the biggest impact catchers may have is on pitcher effectiveness, more specifically, SO.bfp and BB.bfp, rather than on base runner defense. Adjusting Net Fielding Plays Made Using Proxy BIP Distribution Variables Second, we refine the estimate of true skill plays made on BIP by backing out, using regression analysis, the estimated effects pitchers and batters have on the distribution of BIP throughout the field. The key items of information gleaned from Retrosheet used to make these adjustments are the number of

More on Defensive Regression (or Runs) Analysis 9 total BIP hit by opponent right-handed batters (Right-handed opponent batter BIP, or RBIP ), the number of fly outs ( FO ) and ground outs ( GO ) recorded against opponent right-handed batters (Right-handed opponent batter FO and GO, or RFO and RGO ), and the number of FO and GO recorded against opponent left-handed batters (Left-handed opponent batter FO and GO, or LFO and LGO ). Th e denominator for RBIP is total BIP, yielding RBIP.bip (you have to have a BIP to have an RBIP ), which is negative when the team has a more left-handed opponent batter BIP. The denominator for RFO and RGO is RBIP (you have to have an RBIP to have either an RFO or an RGO ), yielding RFO.rbip and RGO.rbip. The denominator for LFO and LGO is total BIP hit by opponent left-handed batters, which is merely BIP minus RBIP, or LBIP, yielding LFO.lbip and LGO.lbip. Notice that these variables have all been constructed so that they are at least arithmetically independent of each other. These five key variables ( RBIP.bip, RFO.rbip, RGO.rbip, LFO.lbip, and LGO.bip ) are the Proxy BIP Distribution Variables. They are good, if imperfect, proxies for whatever perfect information could theoretically be obtained regarding the actual distribution of expected BIP fielding plays. As we showed in our Bill Mazeroski, Buddy Bell, and Mickey Mantle examples in chapter two, regression analysis reveals that they have the kind of statistical relationships with net second base assists ( A4 ) given total BIP ( A4. bip ), net third base assists ( A5 ) given total BIP ( A5.bip ), and net center field putouts ( PO8 ) given total BIP ( PO8.bip ) that one would expect. When RBIP.bip is positive (that is, when there is an above-average number of BIP hit by opponent right- handed batters, given total BIP ), there are more ground outs recorded on the left side of the infield (third and short) and more fly outs recorded on the right side of the outfield (center and right). When RBIP.bip is negative (in other words, when there is an above-average number of BIP hit by opponent left -handed batters, given total BIP ), there are more ground outs recorded on the right side of the infield (first and second) and fewer on the left side of the outfield (left field). In both cases, that s because hitters tend to pull the ball when they ground out and tend to be behind the ball when they fly out. (Fly balls and line drives to the outfield that are pulled tend to be hit harder and drop in as clean hits.) The coefficients for RBIP.bip are much bigger (positive or negative) in the infield than in the outfield. That s because batter-handedness has a much greater effect on the direction of ground outs than fly outs. You can see this by watching how infields and outfields shift. For the several left-handed batters these days for whom a Williams -type shift is put on, especially Ryan Howard, you ll frequently see the third baseman playing between third and second, and the shortstop playing behind second, but the outfielders playing practically straightaway.

10 APPENDIX A RFO.rbip and LFO.lbip are used to adjust ground out plays in the infield for fly ball and ground ball pitching. By using relative FO to estimate relative opportunities to record infield assists, we avoid having the assists made by the fielder being evaluated from being used to take into account his relative opportunities to make plays. By splitting fly outs by opponent batter-handedness, we capture to a significant extent cases in which (i) a team s lefthanded pitchers (who would face proportionately more right -handed batters) tend to induce RGO or RFO and (ii) a team s right-handed pitchers (who would face proportionately more left -handed batters) tend to induce LGO and LFO. Right- and left-handed opponent batters also have their own impact on whether BIP are hit on the ground or in the air, which is also reflected in RFO.rbip and LFO.lbip. However, RFO.rbip and LFO.lbip are controlled more by a team s pitchers, who would tend to have much more extreme ground ball or fly ball tendencies than the league s batters as a whole (excluding, of course, the team s own hitters), though this is less true for more recent seasons, which feature less-balanced schedules. If RFO.rbip is positive, that suggests there will be fewer GO recorded against those right-handed batters, and particularly fewer GO on the left side of the infield. (If RFO.rbip is negative, there will be more GO, particularly on the left side.) If LFO.lbip is high, that suggests there will be fewer GO, and particularly fewer GO on the right side of the infield. (If LFO.lbip is negative, there will be more GO, particularly on the right side.) The coefficients at second, third, and shortstop in the chart at the beginning of this appendix all reflect that expectation. We ll address first base further below. RGO.rbip and LGO.lbip are used to adjust fly out plays in the outfield for fly ball and ground ball pitching by left- and right-handed pitchers, respectively. By using relative GO to estimate relative outfield putout opportunities, we avoid having the actual putouts recorded by each outfielder being used to estimate how many putouts he should have made. If RGO.rbip is positive, that suggests there will be fewer FO recorded against those right-handed batters, and particularly fewer FO on the right side of the infield (and vice-versa). If LGO.lbip is positive, that suggests there will be fewer FO, and particularly fewer fly outs on the left side of the infield (and again, vice versa). Notice again that batter-handedness has less of an impact in the outfield than in the infield, as shown by the fact that the coefficients for RGO.rbip and LGO.lbip are nearly equal in the outfield, whereas the coefficients for RFO.rbip and LFO.lbip are significantly different at each infield position. The obvious case, mentioned in the Mantle example, is center field, which is, well, in the center of the field, where the impact of left- and righthanded batters (and pitchers) is approximately equal. But in right field, the coefficients for RGO.rbip and LGO.lbip are also nearly the same. Only in left is there a meaningful difference between the RGO.rbip and LGO.lbip coefficients,

More on Defensive Regression (or Runs) Analysis 11 but even so, the difference is not as great as the differences between the coefficients for RFO.rbip and LFO.lbip at second, third, and short. The bottom line seems to be that opponent batted handedness, and the interaction between opponent batter handedness and pitcher handedness, has a much, much greater impact on the direction of ground balls than fly balls. Adjusting Net Plays For The Impact Of Base Runners The Proxy BIP Distribution Variables attempt to account for where batted balls are hit that is, whether they are hit on the ground or in the air, and on the left or right side of the field. But there are other factors that were not discussed in chapter two that impact the likelihood that fielders at each position will make plays. One obvious factor for infielders is the presence of runners at first base. This increases double play assist opportunities for middle infielders but also forces the first baseman to play close to the bag, which reduces his chance of fielding ground balls in the hole between first and second. Taking this into account using regression analysis is a little tricky. If you create a variable for estimated runners at first, this would include not only walks but also hits allowed. But hits allowed are partly a function of net plays made at first, second, and short. Any statistical association revealed by regression analysis between, say, shortstop assists and runners on first could reflect either the shortstop s impact on the number of runners at first (by allowing or preventing hits) or the impact of the runners at first on shortstop assists (by increasing or decreasing double play assist opportunities). There are a few candidates for variables that get around this circularity problem, at least for middle infielder double play assists, because they are not influenced by infielder fielding: SO.bfp, BB.bfp, HR.bh, WP.ip, and perhaps SH.bip (net sacrifice hits given BIP ). The more SO.bfp, the fewer hits and runners at first. The more BB.bfp, the more runners on first. HR clear the base paths, which obviously prevents double plays. WP and SH allow runners on first to reach second, thus preventing a double play. At both shortstop and second base these variables have, at least directionally, the impact one would expect, though the particular coefficients are not very stable from sample to sample, and since WP.ip and SH.bip have relatively little variation from team to team, they are probably not practically significant and could have been dropped from the model. In addition, SH.bip might also belong more with the category of Proxy BIP Distribution Variables, because by definition they are ground balls that can only be fielded in a particular area of the infield (say, approximately anywhere within sixty feet of home plate).

12 APPENDIX A Net intentional walks ( IBB ) given total plate appearances ( PA ), IBB. pa (note that PA equals IBB plus BFP in the post-1951 model) have a negative impact on third base plays, probably because they reduce sacrifice bunts that should be added back. In any event, that variable has little practical impact and could have been dropped from the model. Adjusting Net Plays For The Impact Of Ball Hogging A fielder might make more plays not by preventing more BIP from going through for hits, but by taking more easy chances that could have been fielded by other fielders and were more or less guaranteed outs anyway. By far the most important example of this are FO fieldable by infielders. Ninety to ninety-five percent of fly balls and pop ups caught by infielders can usually be taken by at least two, and sometimes three, different fielders (two infielders and an outfielder). Centerfielders who have played very shallow, especially Andruw Jones, have tended to hog some of these chances. Regressions of PO8.bip onto IFO.bip throughout history consistently show that the more IFO.bip, the fewer PO8.bip, and vice versa. Therefore, if IFO. bip has been reduced by centerfielder ball hogging, a portion of those negative hogged plays is added to expected PO8.bip, thus reducing the centerfielder rating, and vice versa. At times there is an impact for corner outfielders as well. I was somewhat surprised that IFO.bip was so important in right field. Perhaps the fact that most pop-ups are hit to the right side of the field (as most batters are right-handed, and most pop ups are hit to the opposite side of the field, for reasons we ve already discussed) explains this result. Right fielders may take more discretionary pop flies from first basemen (some of whom are the slowest players in baseball) than left fielders take from third basemen. A batted ball category similar to infield fly outs is SH. The three fielders who field SH are the pitcher, first baseman, third baseman, and, to a very small extent, catcher. There is probably some bunt hogging, depending on the fielding quality of pitchers. A great fielding pitcher, such as Greg Maddux, probably fielded some bunts that might otherwise have been fielded by Chipper Jones or Fred McGriff. In contrast, someone like Randy Johnson probably relied more on others to handle sacrifice bunts. The third baseman formula above reflects this factor by backing out a portion of A1. bip when calculating ra5. (So, if the pitcher is taking bunt opportunities from the third baseman, estimated hogged bunts are added back to the third baseman, and vice versa.) Third baseman and first baseman don t fight over bunt opportunities; rather, bunt opportunities are gifts from the batter. Presumably, hitters playing against Brooks Robinson aimed their

More on Defensive Regression (or Runs) Analysis 13 bunts toward Boog Powell, and hitters playing against Keith Hernandez aimed their bunts toward Howard Johnson. Regression analysis indicates that the more rgo3 (which is already adjusted for batter-handedness), the fewer ra5, and vice versa. Another similarity between SH.bip and IFO.bip is that both are essentially guaranteed outs. All that is at stake with a sacrifice hit attempt is whether the lead runner advances and the value of that is only about.20 runs. As a practical matter, no fielder should be getting any credit for fielding a sacrifice bunt and getting the runner out at first. Given the total number of SH attempts fielded, the fielder should be given credit for the net number of lead runners taken out relative to the league rate, given those total opportunities, multiplied by.20 runs. I doubt any contemporary third or first baseman would earn more than a couple of runs a season for any such skill. Given total SH attempts fielded, the fielder should be charged for the net number of times he went for the out at second and failed to get either the lead runner or the batter out, multiplied by the free out lost and the hit given up, or about 0.75 runs. Any new DRA model I will develop will take more complete advantage of play-by-play data, will exclude SH from BIP altogether, and will subtract SH assists from each fielder s total. This will also make it unnecessary to back out SH.bip from positions that never have the opportunity to field SH, such as middle infielders and outfielders. Therefore, any future DRA model would not have the SH.bip factor in the rpo8 formula (it wasn t statistically significant in left or right) and none for the ra4 or ra6 formulas (except if significant in limiting double play opportunities). First Base About ninety-eight to ninety-nine percent of ground outs result in an assist for the fielder who fields the ball, with one exception: first base. First basemen record assists for only about half of the ground balls they convert into outs the rest of the time they just run to the bag to record the putout unassisted. Traditional statistics don t differentiate between ground ball putouts and fly ball putouts, but Retrosheet play-by-play data after 1951 does, so it is possible to count the exact number of ground balls a first baseman fields. Unfortunately, I had neither individual nor team totals of unassisted ground outs at first base ( UGO3 ) when I first developed the post-1951 model. However, a reasonably good estimate of the team total can by obtained indirectly, as shown in the charts at the beginning of the chapter. In English, the three rows above say that estimated UGO3 is simply the average of two estimates.

14 APPENDIX A UGO3 Unass iste d GO3 avg(ugo3e1,ugo3e2 ) UGO3e1 Unass iste d GO3 estimate #1 IPO - A - IFO UGO3e2 Unass iste d GO3 estimate #2 GO - IA - CS - GIDP The first estimate ( UGO3e1 ) is the estimated number of infield putouts that were not due to catching fly balls: total infield putouts, minus total team assists (including outfield assists, which always result in an infielder putout), minus FO recorded by infielders. I had the exact count for the latter variable, because my data provider gave me the Retrosheet count for total fly outs; all you need to do is subtract outfield putouts from that total to arrive at infielder fly outs. This estimate will overestimate UGO3 by the number of unassisted ground ball putouts at infield positions other than first, which are, in total, only about one-third the total at first. The second estimate is the estimated number of GO that were not in the form of infield assists. I had a total Retrosheet count of GO (at all infield positions); infield assists from fielding ground balls are estimated as total infield assists less CS and double play assists. This estimate underestimates total infield unassisted ground outs by the number of infield assists on relays. The noise in the above estimates is not inconsiderable, but probably not biased either. We are not ultimately concerned with getting the exact total of UGO3, but net UGO3, given BIP, or UGO3.bip. Both unassisted ground ball putouts at second and third, as well as infielder relay assists are both rare and random events that should merely create random noise, whereas first base unassisted ground ball putouts are routine and reflect to a large degree the systematic preference of the first baseman to run to the bag or to toss to the pitcher covering the bag. The sum of first base assists ( A3 ) and UGO3 is estimated GO at first base ( GO3 ). Here is the formula for residual, or regression-adjusted, GO3 : rgo3 = GO3.bip +.09 *RBIP.bip. Regression analysis would also include +.06 * RFO.rbip and +.14 *LFO.lbip, but we need to sacrifice some accuracy at first base by deleting these variables to ensure that the global regression of RA.ip onto our fully-adjusted pitching, fielding, and base-running variables generates correct run weights for infield and outfield plays. Here s why. The Proxy BIP Distribution Variables have a couple of important limitations. One is that in order to obtain in the second-step regression run weights in the infield and outfield that make sense (are approximately equal or slightly higher in the outfield), it is usually desirable that the sum of

More on Defensive Regression (or Runs) Analysis 15 RFO.rbip and LFO.lbip regression weights for adjusting infielder positions be approximately equal to the sum of RGO.rbip and LGO.lbip regression weights, respectively, for adjusting outfielder positions. In other words, we do not want each infielder assist to be discounting each outfield putout more than each outfielder putout is discounting each infielder assist. The sum of RFO.rbip coefficients is.71 with an adjustment included at first base ( rgo3 ) (.65 without); the sum of RGO.rbip coefficients is.70. The sum of LFO.lbip coefficients is.70 with an adjustment at first base (.57 without). But the sum of LGO.lbip coefficients is only.56: rgo3 = GO3.bip [ +.06 * RFO.rbip +.13 *LFO.lbip ] ra4 = A4.bip + ( ) +.15 * RFO.rbip +.32 *LFO.lbip ( ) ra6ss = A6.bip + ( ) +.29 *RFO.rbip +.15 *LFO.lbip ( ) ra5 = A5.bip + ( ) +.21 * RFO.rbip +.10 *LFO.lbip ( ) rpo7 = PO7.bip + ( ) +.21*RGO.rbip +.10*LGO.lbip rpo8 = PO8.bip + ( ) +.27*RGO.rbip +.24*LGO.lbip ( ) rpo9 = PO9.bip + ( ) +.22*RGO.rbip +.22*LGO.lbip ( ) Including the first base adjustments for the RFO.rbip and RGO.rbip, coefficients would be balanced, but including first base adjustments for the LFO. lbip and LGO.lbip would result in an imbalance that leads to run-weight coefficients for the outfield positions being lower than for the infield positions, because the marginal outfield plays are associated with a reduction in ground out plays greater than the reduction in outfield plays that is associated with marginal infield plays. We ll address issues related to this further below, when we discuss modeling issues, based on statistical theory, apart from baseball. Examples Of First Stage Regression And Diagnostics There would be little point to showing every single regression analysis and its output, but a couple of illustrative examples should convey the issues involved in variable selection. If one regresses A4.bip onto the Proxy BIP Distribution Variables applicable to infielders ( RBIP.bip, RFO.rbip, LFO.lbip, and SH.bip ) and variables that may impact the number of runners on first base and thus double play pivot opportunities ( IBB.pa, SO.bfp, BB.bfp, HR.bh, and WP.ip ), we obtain the following output (I imported my Excel spreadsheet of centered variables into the statistical software package S-PLUS in order to run the regressions):

16 APPENDIX A Call: lm(formula = A4.bip ~ IBB.pa + SO.bfp + BB.bfp + HR.bh + WP.ip + RBIP.jbip + RFO.rbip + LFO.lbip + SH.bip, data = DRAsept07sansNL69, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -88.18-16.88 0.6016 15.7 81.84 Coefficients: Value Std. Error t value Pr( > t ) (Intercept) 0.0000 0.7196 0.0000 1.0000 IBB.pa 0.0284 0.0515 0.5509 0.5818 SO.bfp -0.0061 0.0081-0.7524 0.4519 BB.bfp 0.0285 0.0147 1.9351 0.0532 HR.bh -0.2003 0.0417-4.8000 0.0000 WP.ip -0.2460 0.0590-4.1672 0.0000 RBIP.bip -0.0785 0.0042-18.6234 0.0000 RFO.rbip -0.1498 0.0156-9.6086 0.0000 LFO.lbip -0.3140 0.0229-13.7335 0.0000 SH.bip -0.2164 0.0795-2.7220 0.0066 Residual standard error: 25.22 on 1218 degrees of freedom Multiple R-Squared: 0.4874 Generally, we will eliminate from consideration variables with a Pr( > t ) greater than.05. It is quite common for statisticians to restrict model variables to those with p values of less than.05. When we eliminate variables with p values greater than.05 from the above regression we obtain the following result: Call: lm(formula = A4.bip ~ HR.bh + WP.ip + RBIP.bip + RFO.rbip + LFO.lbip + SH.bip, data = DRAsept07sansNL69, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -88.52-16.75 0.1957 15.41 82.76 Coefficients: Value Std. Error t value Pr( > t ) (Intercept) 0.0000 0.7201 0.0000 1.0000 HR.bh -0.1804 0.0405-4.4523 0.0000 WP.ip -0.2003 0.0537-3.7325 0.0002 RBIP.bip -0.0793 0.0042-18.8644 0.0000 RFO.rbip -0.1494 0.0155-9.6195 0.0000 LFO.lbip -0.3159 0.0228-13.8672 0.0000 SH.bip -0.1943 0.0781-2.4876 0.0130 Residual standard error: 25.23 on 1221 degrees of freedom Multiple R-Squared: 0.4855 F-statistic: 192 on 6 and 1221 degrees of freedom, the p-value is 0

More on Defensive Regression (or Runs) Analysis 17 The above output, rearranged and rounded, says that a good estimate of Expected A4.bip =.08 *RBIP.bip.15 *RFO.rbip.32 *LFO.lbip.18 *HR.bh.20 * WP.ip.19 * SH.bip. Since we are looking for net plays, we subtract expected A4.bip from actual A4.bip to obtain the following formula for residual (or regression-adjusted) plays at second: ra4 = A4.bip +.08 *RBIP.bip +.15 *RFO.rbip +.32 *LFO.lbip +.18 * HR.bh +.20 * WP.ip +.19 * SH.bip. We round to two decimal places not only for the sake of readability, but also because the standard errors in the estimates of the coefficients (see Std. Error column in the regression output) are generally greater than.01 and actually tend to be about.05. Reporting extra decimal places would be a classic case of false precision. Th ere are some interesting additional details in the final output. Notice that the data is DRAsept07sansNL69. I developed the model in September 2007 from Retrosheet data then only available from 1957 through 2006. Also, because of some data anomalies at the time in the 1969 National League data set, I excluded that year and league from the sample. Having developed the model from 1957 2006 data, I applied it out of sample to 1952 56, 1969 (National League), and 2007 09 when finalizing this book. We ll discuss the out of sample output shortly below. Th e Multiple R-Squared of.4855 indicates that approximately 49 percent, or about half, of the variance in A4.bip can be explained by the model. The remaining residual is what we call ra4 and treat as reflecting the true skill of the team s second baseman. The distribution of ra4 is still too large: the worst team at second base had 89 ra4 ; the best, + 83 ra4. The quartiles are fairly reasonable: 17 ra4 and + 15 ra4. The Residual standard error is the standard deviation in ra4, which is 25. Though the ra4 do not follow a so-called normal distribution exactly, due to an excessive number of extreme outcomes, it is still approximately correct to say that the middle halves of teams have between 17 and + 15 ra4, and the middle two-thirds have approximately 25 to + 25 ra4. This spread is probably too high, based on batted ball data, which indicates that the model is not perfectly capturing all the factors that can give or take away chances from second basemen. But the second-stage regression will

18 APPENDIX A discount ra4 (and other such residual estimated skill plays at other positions) to adjust for this. I have not included the usual diagnostic plots of residuals. There is absolutely no non-linearity in the residuals, at any position. The scatter plots of residuals against fitted values show no change in the spread of residuals. While the residuals in both the first and second stage regressions were unimodal and symmetric, it must be said that the tails were fatter than one would like, thus falling short of the ideal in regression modeling of normally distributed residuals. Recall that the presence of runners at first base should reduce GO3.bip, because the first baseman has to play closer to the bag. Regression analysis suggests that the typical impact is either not statistically significant or not practically significant over the course of a season. Call: lm(formula = GO3.bip ~ IBB.pa + SO.bfp + BB.bfp + HR.bh + WP.ip + RBIP.jbip + RFO.rbip + LFO.lbip + SH.bip, data = DRAsept07sansNL69, na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -86.4-15.39 0.0661 14.99 104.8 Coefficients: Value Std. Error t value Pr( > t ) [1-std impact] (Intercept) 0.0259 0.6764 0.0383 0.9694 IBB.pa -0.0131 0.0484-0.2695 0.7876 SO.bfp -0.0053 0.0076-0.7052 0.4808 BB.bfp 0.0261 0.0138 1.8874 0.0593 1.5 runs HR.bh 0.0183 0.0392 0.4659 0.6413 WP.ip -0.2065 0.0555-3.7219 0.0002 3 runs RBIP.jbip -0.0900 0.0040-22.7182 0.0000 RFO.rbip -0.0701 0.0147-4.7858 0.0000 LFO.lbip -0.1347 0.0215-6.2675 0.0000 SH.bip -0.1725 0.0747-2.3081 0.0212 1.5 runs Residual standard error: 23.7 on 1218 degrees of freedom Multiple R-Squared: 0.3675 F-statistic: 78.64 on 9 and 1218 degrees of freedom, the p-value is 0 I ve highlighted the variables not under the control of fielders that would impact the number of runners at first base. The only one with a p -value below.05 was WP.ip, and, given the standard deviation of WP.ip, that impact in runs per season would typically be only plus or minus three runs. For reasons explained shortly above, we excluded RFO.bip and LFO.lbip from the model for rgo3.

More on Defensive Regression (or Runs) Analysis 19 Second-Stage Regression And Diagnostics Set forth below is the regression output from the second stage, global regression, in which we regress actual team runs allowed above or below the league rate that year, RA.ip, onto all of the estimated net skill plays at all positions, including net pitcher plays such as BB.bfp, SO.bfp, HR.bh, IFO.bip, A1.bip, WP.ip, and net residual fielder plays such as ra4, ra6, rpo8, etc. Call: lm(formula = R.ip ~ IBB.pa + SO.bfp + BB.bfp + HR.bh + IFO.bip + A1.bip + WP.ip + CS.sba + GO2.bip + A789.ip + rgo3 + ra4 + ra5 + ra6 + rpo7 + rpo8 + rpo9, data = DRA,na.action = na.exclude) Residuals: Min 1Q Median 3Q Max -63.44-14.46-0.09485 15.02 67.22 Coefficients: Value Std. Error t value Pr( > t ) (Intercept) -0.0097 0.6368-0.0152 0.9879 IBB.pa 0.3074 0.0455 6.7631 0.0000 SO.bfp -0.2777 0.0075-37.0611 0.0000 BB.bfp 0.3375 0.0131 25.7802 0.0000 HR.bh 1.4918 0.0376 39.6665 0.0000 IFO.bip -0.4413 0.0175-25.2079 0.0000 A1.bip -0.4238 0.0305-13.8906 0.0000 WP.ip 0.5646 0.0546 10.3424 0.0000 CS.sba -0.5898 0.0713-8.2727 0.0000 GO2.bip -0.5943 0.0682-8.7095 0.0000 OA.ip -0.6081 0.0956-6.3621 0.0000 rgo3-0.5171 0.0288-17.9480 0.0000 ra4-0.5265 0.0288-18.2667 0.0000 ra5-0.4469 0.0271-16.4746 0.0000 ra6-0.4439 0.0266-16.6587 0.0000 rpo7-0.5349 0.0316-16.9241 0.0000 rpo8-0.4583 0.0278-16.5126 0.0000 rpo9-0.4954 0.0300-16.5378 0.0000 Residual standard error: 22.32 on 1210 degrees of freedom Multiple R-Squared: 0.889 F-statistic: 570.1 on 17 and 1210 degrees of freedom, the p-value is 0 Th e standard error of a little over 22 runs is similar to the standard errors for the twenty or so well-known formulas for estimating team runs scored, as demonstrated by John Jarvis on his website. Generally this means that

20 APPENDIX A the DRA estimate of runs allowed per team is within plus or minus 22 runs about two-thirds of the time. The worst matches, with the greatest errors, are 63 runs and + 67 runs. I would imagine that almost all of the many well-known offensive models would have similar outliers in a fifty- or sixtyyear sample. The Multiple R-Squared is not as high as I would like. When separate DRA models are developed for the Modern Era (1969 1992) and Contemporary Era (1993 present), such models tend to have multiple r-squareds of approximately ninety-five percent, which is approximately the same as is found in the better models of team offense, as reported by John Jarvis at this website (three were as high as ninety-six percent). Part of the art of developing regression models is balancing accuracy and simplicity. In this case I felt it would dramatically simplify this book to have one model for all seasons since the early 1950s. I have not included the usual diagnostic plots of residuals. There is absolutely no non-linearity in the residuals. We ve dealt with multi-collinearity among the predictor variables via centering and first-stage regressions, so the variables all have correlations with each other between.1 and +.1, down from.6 and +.6 for the simple seasonal totals. The scatter plot of residuals against fitted values shows no change in the spread of residuals. The Durbin Watson statistic did not indicate any meaningful correlation in team residuals over time. While the residuals in both the first and second stage regressions were unimodal and symmetric, it must be said again that the tails were fatter than one would like, though closer to a normal distribution than in the case of the first-stage regressions. However, due to the large sample sizes no residual in the first-stage regression was remotely large enough to impact the coefficient estimates in the second-stage regression. One of the typical diagnostic tests for a regression model is to apply it out of sample to see how well it works. When I was finishing this book and had to apply the model to the 1969 National League and 2007 09 seasons for both leagues, the standard error was 23 runs and the r-squared was.90 virtually identical to the in-sample values. Unfortunately, the 1952 56 standard error was 36 runs, with a.89 r -squared. However, that is easily explained. First, the play-by-play data for the early-to-mid 1950s is not nearly as complete as it is for the late 1950s some teams are missing up to 40 games of data per season. This results in significant data errors in the Proxy BIP Distribution Variables, CS.sba, GO3.bip, and IFO.bip. Second, as we will see in our discussion of the pre-1952 model(s), there was a dramatic change during the 1950s in the impact of pitchers on batted ball outcomes. The run weights for the so-called Three True Outcomes BB.bfp, SO.bfp, and HR.bh are remarkably consistent with those determined under a

More on Defensive Regression (or Runs) Analysis 21 variety of rigorous offensive models, though the weight for HR.bh is about one-tenth of a run too high. The run weight for CS.sba is almost precisely right, for it equals the sum of the typical increase in run expectation if a base is stolen (approximately.15 to.20 runs) and the typical decrease in run expectation if a runner on base is taken out (approximately.45 to.40 runs). Similarly, the run weight for A789.ip is almost precisely right, for it equals the sum of the typical increase in run expectation if a base runner gains the extra base (approximately.15 to.20 runs) and the typical decrease in run expectation if a runner on base is taken out by the outfielder (approximately.45 to.40 runs). The run weight for WP.ip should be.27 runs, not.56, the excess being due to the fact that WP.ip carries the higher run expectation of the state of there already being one or more runners on base. In other words, positive WP.ip is strongly correlated with runs allowed not only because a WP increases runs allowed, but have runners on base already is obviously even more correlated with allowing runs; the WP.ip variable cannot separate out these two effects. We ll get to an imperfect fix in one of our alternative DRA models. IBB.pa is also too high, for the same reason; the average intentional walk increases expected runs by only.16, rather than.33. Jim Albert and Jay Bennett s Curve Ball : Baseball, Statistics, and the Role of Chance in the Game (see pages 187 through 189 of the current paperback edition) has an excellent discussion about how regression variables can carry information of omitted variables (here, the existence of base runners) in both a good and a bad way. WP.ip and IBB.pa are examples where the omitted variables (the fact that there are runners on base already, which correlates with allowing runs) have a bad effect on the estimates. We will shortly see examples of variables carrying useful information. THEORETICAL QUESTIONS REGARDING THE PROXY BIP DISTRIBUTION VARIABLES We now come to perhaps the most interesting issue in the DRA model from the standpoint of general statistical modeling: the role of the Proxy BIP Distribution Variables and the run weights for the residual fielding plays ( ra4, ra6, rpo7, etc.). The Proxy BIP Distribution Variables are good proxy variables under standard multivariable regression theory, for two reasons. First, they are strongly correlated with the true distribution of ground balls and fly balls hit by right- and left-handed batters. As explained in chapter two, the.80 correlations between RFO.rbip and RGO.rbip, and between LFO.lbip and LGO.lbip, suggest they explain about two-thirds the variance in true ground

22 APPENDIX A balls versus fly balls generated by right- and left-handed batters respectively. That s because there should be zero correlations, because quality between team outfields and infields should be uncorrelated over large samples. The fact that the correlations are nevertheless approximately.80 suggests that the square of that number (64 % ) is the amount of variation between FO given BIP and GO given BIP that must be controlled by the pitchers. Second, the chosen proxies are not correlated (or very weakly correlated) with the theoretical error term in a perfectly specified model, in other words, the true skill plays of the position being evaluated, and uncorrelated with any other predictors used to predict skill plays at such position, such as the baserunner variables and the ball-hogging variables. When we get to the second, global regression, the residuals from the first set of regression rgo3, ra4, ra5, ra6, rpo7, rpo8, and rpo9 c a n be viewed as explanatory variables in predicting RA.ip that are either proxy variables for true skill plays or estimates of true skill plays that are subject to measurement error. If we view them as proxy variables, they have the problem that they are correlated somewhat with the error term in modeling RA.ip. For example, ra6 is too high (overestimates true skill net A6 ( tsa6 )) if the team s outfielders are above average in true skill, which would be associated with more runs prevented. Seen instead as simply measurement error in explanatory variables, this results in classical errors-in-variables, which can be shown to result in attenuation bias, 2 which causes the coefficients to be too small. This is exactly what happens in the DRA model, where the true run value of a true skill play (about.75 to.85 runs, depending on the position) is attenuated to something closer to.50 runs. Though that results in a mis-estimation of the true run value of a true net skill play, it is ultimately helpful in the DRA model because we are interested more in estimating the total defensive runs per position per team. Attenuation is an appropriate haircut for an estimate of skill plays with too much noise in it. When I first published an article in 2003 about the basic approach of DRA, one of the readers suggested that it was an instrumental variables regression model, also known as a two-stage least-squares model. I do not believe that is the case. In the first-stage regressions for each position, the Proxy BIP Distribution Variables are serving simply as good (because they are independent of the position being evaluated) if imperfect predictors in an ordinary least-squares estimate of net plays at each position, given total BIP, for example, A6.bip, A4.bip, PO8.bip, A5.bip, etc. 2. See Jeffrey M. Wooldridge, Introductory Econometrics: a Modern Approach, 318 22 (South- Western, 2009).

More on Defensive Regression (or Runs) Analysis 23 Perhaps the reader was thinking of the first-stage, per position regressions as the first stage in a formal two-stage (that is, instrumental variables) model. Seen in that light, the Proxy BIP Distribution Variables are attempting to function in some sense like instrumental variables, but without satisfying all the requirements that an instrumental variable should most importantly, exogeneity, or independence from the error term in the secondstage regression. For example, RFO.rbip is in some sense acting as an instrumental variable to purge estimates of net skill plays at each infield position of the effect of fly ball versus ground ball pitching to right-handed batters. And about twothirds of RFO.rbip probably does reflect the tendency of opponent righthanded batters to hit the ball on the ground or in the air, which has a very minor impact on ultimate runs allowed. (The expected run value of a ground ball is close to that of a ball hit in the air; more ground balls go through for hits, but more balls hit in the air go for extra bases.) However, RFO.rbip also reflects to some extent the skill of the outfielders in preventing hits, which does have an impact on runs allowed and would impact the error term in the second-stage regression. Another way in which the two-stage DRA model is inconsistent with the two-stage instrument variables regressions that I have seen is that the number of instrumental variables is less than the number of predictor (pitching and fielding) variables, and that a different set of instrumental variables is used for each predictor. Finally, in the examples of two-stage instrumental variables regression that I have seen, the fitted variables from the first stage are included in the second-stage regression; here, the residuals from the first-stage regression are included in the second-stage regression. Though the Proxy BIP Distribution Variables used in DRA are not ideal, they make the model much better than it would be without them. Furthermore, the ultimate validation of DRA is less whether it passes all the standard diagnostic tests for a regression model than whether it generates (i) fielder defensive runs estimates that match well with batted ball data systems, and (ii) team defensive runs estimates that match actual team runs allowed. On the basis of many tests I ve conducted over the years, DRA defensive runs estimates for individual fielders match almost or about as well with estimates derived from batted ball data as the latter do with each other. And DRA defensive runs estimates for teams match nearly or about as well with actual team runs allowed as the best offensive runs models based on team seasonal totals of various offensive events match actual team runs scored. Most importantly, the Proxy BIP Distribution Variables can be replaced with better Proxy BIP Distribution Variables in future versions of DRA that can be developed by exploiting the Retrosheet play-by-play database (currently available after 1951) to its maximum extent. When I say better,