Stanford CS 221 Predicting Season-Long Baseball Statistics By: Brandon Liu and Bryan McLellan Task Definition Though handwritten baseball scorecards have become obsolete, baseball is at its core a statistical goldmine, full of well-kept statistics and measurables. Every aspect of the game is televised, tracked, and stored, everything from a player s home run total to the the speed, movement, and type of pitch on a given at bat. Using more traditional statistics and some of these more recently tracked statistics presents interesting opportunities to apply Machine Learning algorithms in order to predict baseball player performance over season-long data. Predicting player performance accurately can lend insight into predicting team performance and individual player awards like the MVP (Most Valuable Player) and Cy Young Awards. These predictive insights also provide an interesting model for player valuations for instance, how do players provide team value compared to what they are paid for salary? This project applies various Machine Learning algorithms to predict several mainstream baseball statistics for the active MLB batters and pitchers. These statistics include but are not limited to for batters: [RBIs, Runs, HRs, AVG] and for pitchers: [W, L, ERA, BB, H]. Evaluation In evaluating our results, we ultimately truncated our data set to the top 200 active MLB batters and starting pitchers. Relief pitchers frequently change roles, so the dataset was not optimal. For any given statistic, our prediction engine takes in statistics from one year and 1
outputs predictions for the next year. Our evaluation metrics consisted of inputting 2015 data and outputting 2016 predictions, which were then compared against actuals. Our primary evaluation metrics were raw difference compared to actual as well as percent difference [(actual - predicted) / actual]. For a less stringent error metric, attempting to predict the top 20 players in each category gave us ~45% accuracy, which is deflated slightly due to the presence of 3-4 rookies making the top of the statistical categories in 2016. As an additional case study, we attempted to predict the Cy Young and MVP award winners for each league by aggregating leaders in our predicted statistical categories. The predictions were quite realistic, in selecting viable candidates and usually picking the actual winner as either the 2nd or 3rd place finisher. Infrastructure: The project infrastructure largely involved data collection from various sources: Sean Lahman s baseball database contains batting and pitching statistics from 1871 to 2015 Fangraphs.com contains records of advanced statistics on all players Baseball-reference.com traditional baseball encyclopedia 2
The Lahman baseball data sets were the primary source for our statistics and querying, as they contained unique playerids which we could use to guarantee matching data for a single player. Some of the data like fangraphs had to be manually scraped with python code that would automatically export csv files (fangraphs includes an export button on their website). The Lahman baseball data set and the baseball reference data sets, though downloadable, required some cleaning and validation to remove blanks, resolve duplicates, and process improperly formatted rows and columns. The Baseball-reference.com data set, for example, would intermittently include extraneous rows within the data set for the header names. Once the data was properly cleaned, we had to then link data from all three data sources from player name to the unique player ID that was included in the Lahman baseball data set. We additionally had to match the names of different statistics and in some cases manually write code to perform conversions. For example, IPouts from the Lahman data set is the number of outs a pitcher generated in a given year, but IP is the number of innings pitched; we had to modulus IPouts by three and then convert the remainder to outs in order to calculate the traditional Innings Pitched (IP) statistic. Furthermore, the first iteration through the data set was quite time consuming and required a significant amount of manual data entry, which is not entirely reflected in the truncated ~200 person data set that we used for the final code. Approach: We modeled the task as a prediction problem and applied a number of regression algorithms, where we could essentially input 2015 data and output predictions for 2016 and then compare results. Primarily, we applied linear regression, support vector regression, and a neural 3
net. This combination of methods allowed us to explore both linear and nonlinear attempts to fit the data. It also allowed us to attempt various degrees of specificity with our predictions. The neural net, for instance, tended to overfit the training data, and we found our best performance using support vector regression. Baseline and Oracle For our baseline implementation, we took the historical baseball batting numbers over the last 5 years and took the raw averages of each statistic Hits, At Bats, RBI, Runs, Stolen Bases, Games) over these 5 years. This left us with a raw average for each statistic over the past 5 years. We calculated the percent difference of these average statistics with the actual statistics. The baseline performed relatively poorly, since it did not account for the number of games played or injuries it yielded ~60% median error [(actual - predicted)/ actual] for statistics, with some like 1 stolen bases even going above 100% deviations over expected. For our oracle implementation, extrapolated mid-year predictions and multiplied those by player career second-half averages starting from the halfway point of the season in order predict the end of the season. We thresholded on players with a minimum of 100 at bats. The model yielded roughly ~70% accuracy. Feature selection and implementation choices with peripheral statistics: 2 In 2015, Fangraphs introduced Baseball Info Solutions contact strength ratings data, more advanced batted ball statistics to provide additional analysis about player performance. For batters, Fangraphs provides GB/FB, LD%, FB, IFFB, HR/FB, IF%, Pull%, Cent%, Opp%, 1 For instance, a player predicted to steal 20 bases in a season but getting injured and stealing 1 base would have ~2000% error. 2 http://www.fangraphs.com/blogs/instagraphs/new-batted-ball-stats/ 4
Soft%, Med%, Hard%. In particular, these batted ball statistics about how and where a ball is hit factor heavily into other statistics like batting average (AVG) and runs (R). In certain cases, some of these statistics can deviate heavily from the norm and lend insights about key performance statistics. For instance, if a batter s HR/FB ratio is 50%, then half of his fly balls are home runs, an unsustainable statistic that may be inflating his Home Runs, AVG, Runs and RBI statistics. For pitching, K/9, BB/9, BABIP, LOB%, GB%, HR/FB, GB/FB, Balls, Strikes, Total pitches, Pull %, Cent%, Opposite%, Soft%, Medium%, Hard%. Batted ball and peripheral statistics can allow us to diagnose a pitcher s performance. Traditional pitcher statistics like ERA or Wins can often be skewed due to flukey peripherals (e.g. a high concentration of softly batted fluke hits despite actually performing well). Unfortunately, since these peripheral statistics were introduced relatively recently, the prediction numbers were not entirely the most effective or well correlated with player performance. As Bradley Woodrum writes in a study for the Hardball times The tools we have for evaluating and predicting hitter performance are still growing When we re tempted to 3 cite batted ball data, we need to be more careful. With our smaller sample size of 200 players, we found that historical player performance was actually a more effective approach. Nonetheless, these peripheral statistics provided valuable insight as to which features of our input statistics would have the most impact in predicting our output. Optimization and Tweaking: Naturally, the players with the smallest amount of historical data proved the most difficult to predict. Rookies, injured players, and newer players (e.g. only one or two full season) 3 http://www.hardballtimes.com/offensive-batted-ball-statistics-and-their-optimal-uses/ 5
lacked the same data quality as veteran players. Consequently, we were able to account for this by normalizing our predictions against the number of games played by each player. For starting pitchers, this was games started; for batters this was just total games. We projected each player s statistics over a full game season and then compared our statistics to the actual number of games they played in doing so, we essentially removed games played as a prediction factor. In terms of implementation, this involved scaling each batter s statistics in proportion to the number of games a batter plays in a regular season of baseball. For pitchers, we made a similar implementation decision. We also included add-1 smoothing for our prediction numbers to avoid division by zero and to normalize players who were not predicted to post high stats in a given season but might skew error. The motivating reason that we discovered for this would be that for example, if a player had zero home runs during a season where they were injured and only played ten games, we would want to scale up this player s statistics towards a full season. However, multiplying zero by any number would still give us zero as a final statistic for the player and therefore, to avoid this, we introduced uniform add-1 smoothing. The effect of these normalizations was great improvement across each of our regressions in our final prediction results. Literature Review A number of different baseball projection systems exist with similar and different applications to our own. 6
4 FiveThirtyEight: Nate Silver s team at ESPN FiveThirtyEight runs a forecasted simulation based on 50,000 simulations of the season to predict team records, playoff, division, and world series results. The simulations account for traditional stats, starting pitchers, travel distance, and rest. They then update their probabilities after each game. They scrape game-by-game data to generate an Elo-based rating system and predictive model to make the predictions. The main contributor is a score and rating maintained for each starting pitcher, on which they base Monte Carlo simulations to play out the season thousands of times; each simulation will update a team s rating, adding bonuses if a team makes the World Series or the playoffs. 5 Steamer: Steamer was created from a high school project collaboration. The projection system uses a weighted average of past performance regressing towards league average. The weights are set using regression analysis of past players, using a relatively simplistic regression. In 2015, 4 http://fivethirtyeight.com/features/how-our-2016-mlb-predictions-work/ 5 http://steamerprojections.com/blog/ 7
Steamer projections performed better than other competitors. This system is quite similar to ours, although they generate weight vectors across the entire baseball population, while we minimized our sample to ~200 players and performed regression on a player basis. 6 PECOTA: PECOTA was developed by Nate Silver. It fits a given player s past performance stats to that of a comparable MLB players using similarity scores. The primary characteristics for similarity are: 1. Basic production metrics like batting average, ISO, strikeout rate and groundball rate 2. Usage metrics: career length and plate appearances 3. Physical attributes: height, weight, throwing or batting right/left 4. Fielding position, Starting pitcher/relief pitcher PECOTA then finds the nearest neighbors of the player to determine comparable players. From there, the projection system determines the player s future performance based on the historical performance of these comparable players at a similar age range. Our model does not account for decreased performance due to age factors. Comparison A projections comparison run between the different data sets and actuals for the 2015 season shows average error on different statistical categories (K, BB, HR, ERA). Steamer outperforms the competition in most categories. Thus, relatively simplistic regression can be quite effective (although this is only considering the 2015 sample). 6 http://www.baseballprospectus.com/glossary/index.php?search=pecota 8
7 By comparison, our average error for HR, ERA, and K on 2016 data (without normalizing over number of games is): K: 0.4, BB: 0.4, HR: 0.387, ERA: 0.44 and does not perform as well as established systems. Error Analysis In terms of analysis of our algorithms, each one of them is taking the same space complexity by building python dictionaries to store the features and training data that we make our predictions based on. For time complexity from a practical perspective, running linear regression and support vector machine regression take a matter of seconds (somewhere from eight to eleven seconds generally). However, running our neural network prediction takes far longer on the order of closer to ten minutes to generate all predictions. 7 http://www.beyondtheboxscore.com/2016/1/21/10795210/marcel-pecota-steamer-zips-pitchers-projections-review 9
For most of our predicted statistics, our plots of residual value vs predicted statistic generally appear similar to the above figure. We can see that we have a healthy amount of clustering around a residual of 0 which means that we were very close to getting our prediction correct. Across all of our methodologies, we generate the following spreadsheet : 10
This sheet allows us to compare our error metrics across all of our approaches to predicting 2016 statistics. Looking at our final results, we can see that our support vector machine based regression was generally the most successful algorithm. Similar to all of our algorithms, support vector machine regression is further improved when we normalized our data before running the regression to account for rookies, injuries and other anomalies in our data set. For batters, we see that SVR provided a median percent difference of under 30% in most cases. Also, a general trend that we see in our results is that our predictions tend to be more accurate for batters than for pitchers. This is likely caused by the fact that batters play more games in any given season and therefore give us more data to use when predicting statistics. Pitcher predictions are highly volatile and can be impacted by error-prone defenses, strength of schedule, or poor offensive support. There are also a number of outliers heavily skewing the data; despite normalization efforts, mean error is still skewed considerably. But notably, in our aggregated statistics to create leaderboards of top performers for MVP and Cy Young selection, we saw that our results provided accurate predictions. Another point of analysis that has to be made about our results is the fact that the 2016 season of baseball on which we predicted, was an unusual year for baseball. This year in baseball saw a 35% percent increase in home runs, a 10% increases in runs over the past two years, and 8 the league record for strikeouts was broken. These changes in statistics can be explained by shifting trends in the strategy of baseball teams like defensive shifts and more attempts by players to cut upwards on the ball and hit home runs. Our model was not able take this into account since we simply had the statistics of previous years to use. 8 http://www.espn.com/blog/statsinfo/post/_/id/126617/mlb-trends-more-home-runs-strikeouts-shifts-pitching-change s 11