Predicting Baseball Win/Loss Records from Player. Projections

Predicting Baseball Win/Loss Records from Player Projections Connor Daly dasconnor@gmail.com November 29, 2017 1 Introduction When forecasting future results in major league baseball (MLB), there are essentially two sources from which you can derive your predictions: teams and players. How do players perform individually, and how do their collaborative actions coalesce to form a team s results? Several methods of both types exist, but often are shrouded with proprietary formulas. Currently, several mature and highly sophisticated player projection systems are used to forecast season results. None are abundantly transparent about their methodology. Here I set out to develop the simplest possible player-based team projection system and try to add one basic improvement. 2 Predicting Wins and Losses 2.1 Team-Based Projections One approach to such forecasting is to analyze team performance in head to head matchups. A common implementation of this approach is known as an elo rating and prediction system. 1

Elo systems start by assigning teams an average rating. After games are played, winning teams ratings increase and losing teams ratings decrease relative to the expected outcome of the matchup. Expected outcomes are determined by the difference in rating between the two teams. If a very good team almost loses to a really bad team, its rating will only increase slightly. If an underdog pulls of an upset, however, it will earn relatively more points. As more games are played, older games become progressively less meaningful. Essentially, this prediction method considers only who you played, what the margin of victory was, and where the match was played (home team advantage is adjusted for). Using Monte Carlo simulations, one can predict the outcomes of individual seasons for each team. In between seasons, teams are regressed towards the mean. For a detailed explanation of a baseball elo model, see FiveThirtyEight [Boi]. The main advantage of this kind of team-based aproach is that it can capture some of the hard to pin down factors that make teams more than just the sum of their parts. Without figuring out what the secret sauce is, this method estimates the sum total contributions of ownership, coaches, team philosophy, and an uncountable number of other factors. The method does have a significant downside, however, in that it can t take advantage of known changes in team dynamics, such as changes in players and coaches. If I know Babe Ruth is leaving the Yankess after a particular season, I probably want to project them differently than I would have otherwise. This model fails to capture that. 2.2 Player-Based Projections Baseball enjoys a unique advantage over other major American sports in that it is significantly easier to decouple the performance of individual players to determine who was ultimately responsible for creating a certain result. If a batter hits a home run, we can say with a high degree of certainty that the batter and the pitcher combined to cause this event. By looking at the large number of combinations of batter/pitcher matchups, we can gauge the relative skill of each by their performance against a wide variety of opponents. On the 2

other hand, a sport such as football presents significant challenges to gauging the true skill of individual players. Looking at the running game, how can one intelligently and objectively pass out credit and blame? If a running back runs for a seven yard gain on a toss sweep to the right, how much credit should the left guard receive? Decoupling in baseball isn t perfect, but compared to other sports, it s much easier. 2.2.1 Wins Above Replacement A foundational pillar of sabermetrics, the empirical, quantitative analysis of baseball, is the concept of wins above replacement (WAR). Essentially, the idea is that all meaningful baseball statistics must measure how events on the field help or hurt a team s chances of winning in expectation. The way games are won is by teams scoring runs and preventing runs from being scored. Thus, every event can thus be understood in the context of runs allowed or runs created. This idea can be hard to grasp at first. How many runs does a home run create? One? Rather counter-intuitively, the generally accepted value is around 1.4 runs. How is this? Well, not only did the batter score himself, but he will have also batted in any potential runners on base. You must also consider the possibility that had the batter made an out instead of scoring these base runners, following batters could have driven them in. Using real playing data, we can determine the expected run creating or subtracting value of every event in baseball. See Table 1 for a complete breakdown of the run value of such events. By looking at the total contributions of a player over the course of a season, we can sum up the expected run contributions of every event the player caused. Now we need to compare our player against a baseline. A first intuition might be to compare the player to league average. Well, defining league average to be a baseline of zero runs created sells league average players short. A league average player is better than approximately half of the players in the league. That s valuable production! Instead, we scale our player s contribution against the idea of a replacement level player. The production of a replacement level player 3

is intended to be equivalent to the contributions of an infinitely replaceable minimum salary veteran or minor league free agent. For reference, a team of replacement players is defined by Fangraphs to win approximately 48 games over the course of a 162 game season. Using this replacement level, we determine the original player s runs above replacement player. Next, we scale the runs above replacement by the amount of runs per win in an average game. Finally, we scale this calculated WAR to the the number of possible wins in a season, so that the sum of all WAR and replacement runs equals the total number of wins in season. On average, the player s context free stats would have resulted in a team winning an extra number of games corresponding to his WAR than if the same player had been replaced with a replacement level player. There is a finite pool of WAR for all players. When one player performs better, that means less WAR will be allocated to the rest of the players. Unfortunately for the reader, there are several variants of WAR, and all define things slightly differently. Several rely on inexplicably chosen constants or proprietary formulas. The main basis of my calculations relies on Fangraphs WAR, but I did make some alterations, which will be explained later. For more in depth explanations of WAR and its underpinnings, see [Joh85], [Tom06], and [Fanb]. 2.2.2 WAR to Wins By projecting a season s worth of players expected WAR contributions, we can group players by target year team and take the sum total of their contributions. The combined total of their WAR should be able to help predict the team s actual number of wins. This relationship isn t necessarily one-to-one, as will be discussed in 4.5. This method benefits from being able to track players as they change teams. 4

Table 1: Run Values by Event Event Run Value Event Run Value Home Run 1.397 Balk 0.264 Triple 1.070 Intentional Walk 0.179 Double 0.776 Stolen Base 0.175 Error 0.508 Defensive Indifference 0.120 Single 0.475 Bunt 0.042 Interference 0.392 Sacrifice Bunt -0.096 Hit By Pitch 0.352 Pickoff -0.281 Non-intetional Walk 0.323 Out -0.299 Passed Ball 0.269 Strikeout -0.301 Wild Pitch 0.266 Caught Stealing -0.467 Empirical measurements of the run value of events from 1999-2002 season. Data from [Tom06] 3 Projecting Players To create a player projection based season long team projecting system, the first step is to project players. Essentially, you need to look at player s past performance and predict how he will perform in the future. Some methods of doing this are highly sophisticated, others quite simple. Systems like Baseball Prospectus s PECOTA, Dan Szymborski s ZiPS, and Chris Mitchell s KATOH all combine bunches of variables and various calculations to compute projected outcomes. PECOTA in particular is based primarily around player similarity scores. Mainly, it uses various metrics to find comparable players for a given to-be-projected player and uses the performance of those comparables to infer a trajectory for the targeted player s future performance. Although its general methodology has been discussed, its specific implementation is proprietary. On the other end of the sophistication system is perhaps the simplest possible projection system: Marcel the Monkey. 3.1 Marcel the Monkey Marcel the Monkey, or simply Marcel, is a player projection system invented by Tom Tango [Tan]. It sets out to be the simplest possible player projection system. Essentially, it takes 5

a weighted average of a player s last three years (5/4/3 for batters and 3/2/1 for pitchers), regresses the player toward the mean by 1200 plate appearances, and applies an aging curve to increase player s skills until age 29 after which point they begin to decline. These projections make no attempt to differentiate for team, league, or position, with the exception that some different constants are used for starting pitchers and relief pitchers. Rather than calculating counting stats such as hits or home runs specifically, Marcel projects rate stats like hits or home runs per plate attempt. Plate attempts for batters are calculated from the previous two years and then added to a baseline of 200 plate appearances. Thus, all players are projected to have at least 200 plate appearances in the target year, even a player that may have retired two years prior. When translating from player to team projections, this is controlled for by setting rosters with the actual players who played on teams in the target year. A note about pitchers. Pitchers are projected per inning pitched rather than by plate appearance. Starting pitchers are projected to a minimum of 60 innings and relievers are projected to a minimum of 25. A pitcher s starter or relief role is defined by the ratio of games started to games played. A starter has started more than half his appearances in the given period. Marcel player projections are the foundation of my Marcel-based projection system. The first phase of my project centered around implementing a Marcel projection scheme in R for both batters and pitchers. Going back in time, older seasons don t contain the same amount of statistical data that modern seasons do. Because of this, I am only able to create Marcel projections for seasons from 1955 onwards. 4 Marcel the Monkey to Marcel the Manager After developing my Marcel projections, the next step in projecting team seasonal results was to group the players into teams and sum their accomplishments. 6

4.1 Season Lengths Prior to 1961, both the American and National league played a 154 game season before later switching to 162. Other regular seasons have also been shortened such as by the player strike in 1994. As such, all projections must account for varying season lengths. All reported accuracy statistics will be scaled to a 162 game season. 4.2 Adjustments to WAR Calculation Although I generally followed standard calculations for Fangraphs WAR, my calculations did diverge enough to be considered significantly different. For position players, I only considered batting runs created, not fielding runs or baserunning runs. The numbers also aren t position, league, or park adjusted. Because WAR is designed to be a retrospective statistic and my numbers are forward looking, I did my best to remove them from all possible context. My projections don t take things like park factors or league adjustments into account, so neither should my WAR calculations. Fielding runs were not calculated because advanced fielding statistics are not provided in the Lahman database I used to gather my projections. See A.1 for more information on data sources. 4.3 New Season s Rosters Rosters for target year teams were assembled by looking at batting statistics for the following year. I defined being on the team for that year to be having at least one plate appearance for said team and that being the first team the player appeared with that year. Because I drew my batting stats from the Lahman database (see A.1), I only could project through the 2016 season. I will soon be able to predict the 2017 season when the 2017 version of the Lahman database is published, likely in the coming weeks. To predict a season before it actually happens, I would need to add a new source of data to determine which players to include on a roster. 7

4.4 Rescaling WAR Going by Fangraphs definition of WAR, a team of only replacement players is expected to achieve a winning percentage of approximately 0.294. Over a 162 game season, this corresponds to about 48 wins; however, not all seasons since 1955 contained 162 games. Additionally, not all seasons feature 30 teams. Hence, the number of available wins for players to earn fluctuates from year to year. To calculate the available WAR in year x: W AR(x) = (NumT eams(x) NumGames(x)) (1/2 ReplacementLevel) (1) where NumT eams(x) is the number of teams playing in year x, NumGames(x) is the mode of a team s played games in year x, and ReplacementLevel is the winning percentage for a team of exclusively replacement level players. WAR is then divided so that 57% is allocated for position players and 43% for pitchers. Once the total amount of WAR has been allocated, players must be scaled so that their projected WAR sums to the number of available WAR per season. 4.5 Correcting Diminishing Returns The next step in projecting teams is to sum individual player WAR to establish a team WAR. Once that has been done, we can add a team s total WAR to the season s per team replacement win total. The resulting win total is a that team s win projection for the year. You could stop there; however, doing so makes a key incorrect assumption about win totals. That is, win totals increase linearly with run differentials. Unfortunately, that is not the case. As you will see in the results section, there are clear diminishing returns at extreme ends of the projection spectrum. The more WAR a team adds over a projected 81 wins (a.500 season), the more the model will overestimate the value of those WAR in predicting the number of wins. Similarly, the fewer WAR a team has below.500, more the model will underestimate them. The relationship between WAR and wins is not entirely linear! 8

A simple solution to this problem is to apply a correcting function to the projections. I looked at applying two different correction models to the data, one linear one cubic. I used simultaneous perturbation stochastic approximation (SPSA) to help determine the parameters [Spa03]. For a more detailed explanation of model selection, see 5.1. 4.6 Measuring Correctness 4.6.1 Validity and Verification When constructing mathematical models of reality, one must always ask two questions: is the model correctly implemented and does the model actually represent some semblance of reality. To answer the first question, we will look at publicly available data on Marcel player projections and wins above replacement. For the second, we will construct a loss function to measure how predictive our model can be. The first step in model verification is to ensure that my implementation of Marcel projections matches the intended projections of the method s creator. Thanksfully, he has many years of Marcel projections posted on his website [Tan]. Although our numbers aren t in complete agreement, they appear to be within an acceptable bound. Differences are on the order of one or two per stat and are likely due to implementation details such as digit precisions and rounding decisions. To verify I m computing WAR correctly, I compared my projected WAR totals to the actual WAR earned in the target season. Looking mostly at the top of the board, I checked that my WAR projections seemed to be a reasonable weighted average from the previous three years. If a player averaged three WAR per year and was projected for six, I d know something was off. I did recognize, however, that there would likely be reasonably large divergences for players who were extreme defensively, either extremely good or extremely bad. On the aggregate, the WAR totals seemed to match up pretty well, but I don t have a rigorout calculation showing this is true. Finally, to verify I aggregated team projections correctly, I look at the sum of all projected 9

wins per year and compared it to the total number of available wins. I made sure the calculated number was within a couple wins of the actual. I allowed small differences because some years teams play a different number of games and rounding can cause a win or two to fall through the cracks. The projections will still be reasonable. 4.6.2 The Loss Function When measuring the validity of the model, it may seem tempting to say that we can measure its accuracy directly. But what exactly is it that our model is trying to measure? Are we trying to predict actual wins and losses or are we trying to predict true talent, which can only be measured noisily via wins and losses. I would espouse that we attempt to ascertain true talent by noisily measuring wins and losses. Thus, we define our loss function y(θ): y(θ) = L(θ) + ɛ(θ) (2) y(θ) = 162 n n i=1 abs( ˆx i x i ) N umgames(i) (3) where L(θ) measures the loss of the prediction s ability to measure true talent and ɛ(θ) is a noise term. The more concrete version specifies that the loss can be measured as the mean absolute error of the model s predictions scaled to a 162 game season. That is, for teams numbered 1,..., i,..., n, ˆx i is the model s predicted number of wins for team i, x i is the team s actual number of wins, and NumGames(i) is the number of games played by team i. This computes the mean absolute error for all teams, scaled to 162 games. This allows us to compare results from teams who played seasons of different lengths. 5 Results Without applying any correction model, I was able to achieve a loss of 8.16 wins per team per 162 game season. See Figure 1 for a visual representation of results. Although, not perfect, 10

Figure 1: Results of Uncorrected Projections (a) Uncorrected Projections for 1955-2016 (b) Residuals for Uncorrected Projections there is a clear trend line between the predictions and results. The residuals from the one-toone, however, appear to show a positive trend, meaning the model is overestimating teams at the right end of the graph and underestimating teams at the left. We can attempt to correct for this. 5.1 Calculating Correction Parameters I used two different models to attempt to apply corrections: one linear and one cubic. For both, I used SPSA to determine the optimal value. I chose a linear and a cubic because I assumed that the correction needed to be reasonably antisymmetric around 81 projected wins, corresponding to a.500 record. Both a negative sloping linear and a cubic function could provide that correction. I picked my initial parameters by guessing a scaling factor and choosing the other terms such that the x-intercept was 81. I attempted to find the correction factor such that: CorrectedW ins = P rojectedw ins + Correction (4) 11

5.1.1 Linear Model For the linear model, I modelled Correction = β 0 proj.wins β 1 starting with an intial β value set [.25, 20.25]. The intial beta values were determined by manually guessing and checking a few test values. After a million runs with parameters A = 1000, a =.01, c =.015, α = 0.602, γ = 0.101, and a Bernoulli distribution (+1,-1) for my deltas, I determined my optimal value to be [.250002, 20.249998]. The resulted in a net loss of 7.77 wins per team per 162 game season. I used the gain sequence provided in [Spa03], so I know that the gain sequence conditions for convergence are satisfied. By using a Bernoulli distribution for my deltas, as in [Spa03], I ve satisfied conditions on deltas. The rest of the conditions are unknowable without knowledge of L, but it seems reasonable that it is sufficiently smooth and bounded. 5.1.2 Cubic Model For the cubic model, I used vertex form to model Correction = β 0 (proj.wins β 1 ) 3 starting with an intial β value set [.01, 81]. The intial beta values were determined by manually guessing and checking a few test values. After a million runs with parameters A = 1000, a =.0001, c =.0015, α = 0.602, γ = 0.101, and a Bernoulli distribution (+1,-1) for my deltas, I determined my optimal value to be [ 0.0003673105, 80.9289773302]. The resulted in a net loss of 8.02 wins per team per 162 game season. I used a scalar multiple of the gain sequence provided in [Spa03], so I know that the gain sequence conditions for convergence are satisfied. By using a Bernoulli distribution for my deltas, as in [Spa03], I ve satisfied conditions on deltas. The rest of the conditions are unknowable without knowledge of L, but it seems reasonable that it is sufficiently smooth and bounded. 12

5.2 Results with Corrections After analyzing both the linear and the cubic model, I needed to decide which to the use for my corrections. I decided to use cross validation to determine which model to use. I used three different test sets that were created by grouping the data points by their position modulo three. Performing the same SPSA calculations as in the individual trials but with 10,000 runs, I found the linear model had an average loss of 7.82 wins per 162 game season and the cubic model had 8.04 wins per 162 game season. I decided to use the linear model for my corrections. Although this helped reduce our loss function, our corrected model still isn t perfect. Noticeably, the corrected left tail in figure 2a isn t as well predicted as in the uncorrected version. Overall, though, the corrected model sees noticeable improvements year to year over the uncorrected model as in figure 2b. Figure 2: Looking at Corrected Projections (a) Corrected Projections for 1955-2016 (b) Year to Year Correction Improvement 1955-2016 13

5.3 Perfect and Perfectly Imperfect Knowledge So how good is the model actually? We know we can achieve an average loss of under eight games per season, but is that any good? If we were to assume we knew nothing about individual MLB teams and instead only knew the distribution of MLB records. We can assume that it is approximately normal and by definition will have an average winning percentage of.5. The standard deviation in win percent turns out to be around 0.07. That corresponds to around 11.3 wins per 162 game season. If we randomly assign teams a win percentage from this distribution, we end up with a loss function around 13 wins per 162 game season. Similarly, if we were to project a.500 record for every team, we d off by about 9.5 wins per 162 game season. Contrastingly, how good of a projection could we ever hope to get? The best predictor of how many wins a team accrues turns out to be an estimation based solely on its runs scored and runs allowed. These estimations are called pythagorean win projections, the most accurate of which is referred to as the pythagenpat win total [Pro]. If we had perfect knowledge of how many runs a team would score and allow, we could use their pythagenpat wins to predict their record, like in Figure 3. Yet still with this perfect knowledge, we can only come within 3.18 wins per 162 game season. If we consider projecting all teams to a.500 record to be the low point and with 3.18 wins as a theoretical upper bound, our model appears to have achieved 27% of all possible knowledge gain. 6 Park Effects Now, I attempt to add one final improvement to the model: park effects. Essentially, not all ballparks in major league baseball are created equal; they have different dimensions and atmospheric effects that make some parks easier to score runs in than others. Using park 14

Figure 3: Predicting Wins from Pythagenpat Wins 1955-2016 effects data, see A, I deflated all player stats to remove park effects before computing their Marcel projection. After their season was projected, I looked at their destination home ballpark and inflated their numbers to reflect their new home. Surprisingly, these made my projections worse across the board, only beating my standard Marcel model twice in 60 years, shown in Figure 4. A clear shift occurs around 1973. In 1974, greater detailed park effects were released which led to improved predictions. Although park effects are certainly real, I m left to conlcude that averaged out over a very large sample of players, the current level of granularity is too course to be very predictive. 7 Conclusions At the start of this, I wanted to build the dumbest player based projection model possible and see if I could improve it. Beyond a simply error correction, I couldn t in the short time I had. Although my model may be dumb enough for a monkey, it is still reasonably predictive and appears likely to hold up with far more sophisticated predictions. 15

Figure 4: Comparing Projections With and Without Park Effects Marcel PECOTA FanGraphs Davenport Banished to the Pen Essays Composite 6.00 6.20 5.80 6.97 5.97 6.37 5.73 Table 2: 2016 Projection Comparison 7.1 Comparison to Other Models Unfortunately, many of the data points required to do a full many year model comparison lie behind pay walls or aren t easily searchble on the internet. Baseball Prospectus has currently taken down the seasonal PECOTA projections as they upgrade their site. We can, however, look at the year 2016. After training my model on the years 1955-2015, it predicts 2016 with a loss of 6 wins per 162 game season. Look at Table 2 to see how it stacked up to the competetion. Basically, Marcel projections went toe-to-toe with the best of the best. Data is courtesy of [Aus]. 7.2 Challenges and Future Directions There are several limitations to my model, some mathematical, some sabermetrical. First, my measurements of WAR only look at a batting and defense independent pitching. This removes skills related to baserunning and fielding from the game. This causes players with 16

fielding or baserunning talent signficantly different from league average to be incorrectly valued. Secondly, Marcel doesn t do a great job of adjusting for playing time. Every player is projected a minimum of 200 plate attempts with no regard for their expected role on the team. More intelligently modelling fielding and baserunning skill as well as better adjusting for playing time could significantly improve the model. Another simple improvement would be to add a more robust aging curve. Different positions tend to age differently. A position specific aging curve could add benefits. Obviously, I d like a better way to manage roster data so I can project current rosters into the future without relying on Lahman data. Mathematically, I would have liked to run better SPSA optimizations. For the amount of time I ran them, I wasn t able to move my final parameters very far from my initial guess. This caused me to need to check a lot of values by hand to figure out where the best place to start the optimization was. Better choices of SPSA parameters and longer runnings times likely would have helped. Additionally, my model doesn t account well for uncertainty. Marcel has a way to measure reliability based on how much the player s projection comes from his own stats versus how much it is regressed towards the mean. I would ve liked to have added a similar component that could perhaps provide confidence intervals around a team s projection. A Appendix A.1 Sources of Data Seasonal batting and pitching data was obtained from the Lahman database [Lah]. I made use of years through 2016, which was the last published year with entries at the time of writing. Park effect factors came from Fangraphs [Fana]. Full detailed factors were available after 1973 thru 2015. Earlier years only had basic effects available. 17

References [Joh85] Pete Palmer John Thorn. The Hidden Game of Baseball. University of Chicago Press, 1985. isbn: 9780226242484. [Spa03] James C. Spall. Introduction to Stochastic Search and Optimization. Wiley-Interscience Series in Discrete Mathematics and Optimization. John Wiley and Sons, 2003. isbn: 9780471330523. [Tom06] Andrew E. Dolphin Tom M. Tango Mitchel G. Lichtman. The Book: Playing the Percentages in Baseball. TMA Press, 2006. isbn: 9781494230170. [Aus] Darius Austin. Evaluating the 2016 Season Preview Predictions. url: http:// www.banishedtothepen.com/evaluating-the-2016-season-preview-predictions/. [Boi] Jay Boice. How Our 2017 MLB Predictions Work. url: https://fivethirtyeight. com/features/how-our-2017-mlb-predictions-work/. [Fana] Fangraphs. Park Factors. url: http://www.fangraphs.com/guts.aspx?type= pf&teamid=0&season=2012. [Fanb] Fangraphs. WAR for Position Players. url: https : / / www. fangraphs. com / library/war/war-position-players/. [Lah] Sean Lahman. Lahman s Baseball Database. url: http://www.seanlahman.com/ baseball-archive/statistics/. [Pro] Baseball Prospectus. Pythagenpat. url: http://legacy.baseballprospectus. com/glossary/index.php?mode=viewstat&stat=136. [Tan] Tom Tango. The 2004 Marcels. url: http://www.tangotiger.net/archives/ stud0346.shtml. 18