Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

Similar documents
CS 221 PROJECT FINAL

AggPro: The Aggregate Projection System

Machine Learning an American Pastime

B. AA228/CS238 Component

Building an NFL performance metric

The stats that I am using to project player stats are Hits (H), Runs (R), Homeruns (HR), Walks (BB), and Strikeouts (SO). I will look at each batter

A Novel Approach to Predicting the Results of NBA Matches

PREDICTING the outcomes of sporting events

2014 Tulane Baseball Arbitration Competition Eric Hosmer v. Kansas City Royals (MLB)

Simulating Major League Baseball Games

Pitching Performance and Age

The Rise in Infield Hits

Lorenzo Cain v. Kansas City Royals. Submission on Behalf of the Kansas City Royals. Team 14

Additional On-base Worth 3x Additional Slugging?

Pitching Performance and Age

SAP Predictive Analysis and the MLB Post Season

2014 National Baseball Arbitration Competition

2013 National Baseball Arbitration Competition. Tommy Hanson v. Atlanta Braves. Submission on behalf of Atlanta Braves. Submitted by Team 28

TULANE UNIVERISTY BASEBALL ARBITRATION COMPETITION NELSON CRUZ V. TEXAS RANGERS BRIEF FOR THE TEXAS RANGERS TEAM # 13 SPRING 2012

JEFF SAMARDZIJA CHICAGO CUBS BRIEF FOR THE CHICAGO CUBS TEAM 4

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Salary correlations with batting performance

Team Number 6. Tommy Hanson v. Atlanta Braves. Side represented: Atlanta Braves

Table of Contents. Pitch Counter s Role Pitching Rules Scorekeeper s Role Minimum Scorekeeping Requirements Line Ups...

OFFICIAL RULEBOOK. Version 1.08

Modeling Fantasy Football Quarterbacks

Dexter Fowler v. Colorado Rockies (MLB)

Running head: DATA ANALYSIS AND INTERPRETATION 1

A Database Design for Selecting a Golden Glove Winner using Sabermetrics

Do Clutch Hitters Exist?

2014 Tulane Baseball Arbitration Competition Josh Reddick v. Oakland Athletics (MLB)

Fastball Baseball Manager 2.5 for Joomla 2.5x

2013 National Baseball Arbitration Competition

2014 NATIONAL BASEBALL ARBITRATION COMPETITION

2014 National Baseball Arbitration Competition

Fairfax Little League PPR Input Guide

OFFICIAL RULEBOOK. Version 1.16

Jenrry Mejia v. New York Mets Submission on Behalf of the New York Mets Midpoint: $2.6M Submission by Team 32

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Why We Should Use the Bullpen Differently

Predicting Horse Racing Results with Machine Learning

2015 NATIONAL BASEBALL ARBITRATION COMPETITION. Lorenzo Cain v. Kansas City Royals (MLB) SUBMISSION ON BEHALF OF KANSAS CITY ROYALS BASEBALL CLUB

2013 Tulane National Baseball Arbitration Competition

Jenrry Mejia v. New York Mets Submission on Behalf of New York Mets Midpoint: $2.6 Million Submission by Team 18

2014 Tulane National Baseball Arbitration Competition Jeff Samardzija v. Chicago Cubs (MLB)

arxiv: v1 [stat.ml] 15 Dec 2017

2015 NATIONAL BASEBALL ARBITRATION COMPETITION

Forecasting Baseball

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

MONEYBALL. The Power of Sports Analytics The Analytics Edge

September 29, New type of file on Retrosheet website. Overview by Dave Smith

Hitting with Runners in Scoring Position

GUIDE TO BASIC SCORING

Pine Tar Baseball. Game Rules Manual - version 2.1 A dice simulation game ~ copyright by LIS Games

Predicting the Total Number of Points Scored in NFL Games

2015 National Baseball Arbitration Competition

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

2014 National Baseball Arbitration Competition

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

How are the values related to each other? Are there values that are General Education Statistics

BABE: THE SULTAN OF PITCHING STATS? by. August 2010 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO

2015 National Baseball Arbitration Competition. Kansas City Royals v. Lorenzo Cain. Submitted by Team 33. Brief on Behalf of Lorenzo Cain

Relative Value of On-Base Pct. and Slugging Avg.

Computer Scorekeeping Procedures Updated: 6/10/2015

(Under the Direction of Cheolwoo Park) ABSTRACT. Major League Baseball is a sport complete with a multitude of statistics to evaluate a player s

Offensive & Defensive Tactics. Plan Development & Analysis

Triple Lite Baseball

The 2015 MLB Season in Review. Using Pitch Quantification and the QOP 1 Metric

Our Shining Moment: Hierarchical Clustering to Determine NCAA Tournament Seeding

When you think of baseball, you think of a game that never changes, right? The

An average pitcher's PG = 50. Higher numbers are worse, and lower are better. Great seasons will have negative PG ratings.

Richmond City Baseball. Player Handbook

Baseball Prospectus 2016

Computer Scorekeeping Procedures

Projecting Three-Point Percentages for the NBA Draft

This file contains the main manual, optional rules manual, game tables, score sheet, game mat, and two teams from the 1889 season.

FORECASTING BATTER PERFORMANCE USING STATCAST DATA IN MAJOR LEAGUE BASEBALL

2014 NATIONAL BASEBALL ARBITRATION COMPETITION ERIC HOSMER V. KANSAS CITY ROYALS (MLB) SUBMISSION ON BEHALF OF THE CLUB KANSAS CITY ROYALS

Antelope Little League

Chapter 1 The official score-sheet

For any inquiries into the game contact James Formo at

NUMB3RS Activity: Is It for Real? Episode: Hardball

Deriving an Optimal Fantasy Football Draft Strategy

Analytics Improving Professional Sports Today

Gouwan Strike English Manual

How Effective is Change of Pace Bowling in Cricket?

A Markov Model of Baseball: Applications to Two Sluggers

HCLL Scorekeeping Clinic

2014 National Baseball Arbitration Competition

2015 Winter Combined League Web Draft Rule Packet (USING YEARS )

to the Kansas City Royals for the purposes of an arbitration hearing governed by the Major

Old Age and Treachery vs. Youth and Skill: An Analysis of the Mean Age of World Series Teams

Dexter Fowler v. Colorado Rockies. Submission on Behalf of the Colorado Rockies. Team 18

DRILL #1 FROM THE TEE

Lab 11: Introduction to Linear Regression

Effects of Incentives: Evidence from Major League Baseball. Guy Stevens April 27, 2013

Effect of homegrown players on professional sports teams

Transcription:

Stanford CS 221 Predicting Season-Long Baseball Statistics By: Brandon Liu and Bryan McLellan Task Definition Though handwritten baseball scorecards have become obsolete, baseball is at its core a statistical goldmine, full of well-kept statistics and measurables. Every aspect of the game is televised, tracked, and stored, everything from a player s home run total to the the speed, movement, and type of pitch on a given at bat. Using more traditional statistics and some of these more recently tracked statistics presents interesting opportunities to apply Machine Learning algorithms in order to predict baseball player performance over season-long data. Predicting player performance accurately can lend insight into predicting team performance and individual player awards like the MVP (Most Valuable Player) and Cy Young Awards. These predictive insights also provide an interesting model for player valuations for instance, how do players provide team value compared to what they are paid for salary? This project applies various Machine Learning algorithms to predict several mainstream baseball statistics for the active MLB batters and pitchers. These statistics include but are not limited to for batters: [RBIs, Runs, HRs, AVG] and for pitchers: [W, L, ERA, BB, H]. Evaluation In evaluating our results, we ultimately truncated our data set to the top 200 active MLB batters and starting pitchers. Relief pitchers frequently change roles, so the dataset was not optimal. For any given statistic, our prediction engine takes in statistics from one year and 1

outputs predictions for the next year. Our evaluation metrics consisted of inputting 2015 data and outputting 2016 predictions, which were then compared against actuals. Our primary evaluation metrics were raw difference compared to actual as well as percent difference [(actual - predicted) / actual]. For a less stringent error metric, attempting to predict the top 20 players in each category gave us ~45% accuracy, which is deflated slightly due to the presence of 3-4 rookies making the top of the statistical categories in 2016. As an additional case study, we attempted to predict the Cy Young and MVP award winners for each league by aggregating leaders in our predicted statistical categories. The predictions were quite realistic, in selecting viable candidates and usually picking the actual winner as either the 2nd or 3rd place finisher. Infrastructure: The project infrastructure largely involved data collection from various sources: Sean Lahman s baseball database contains batting and pitching statistics from 1871 to 2015 Fangraphs.com contains records of advanced statistics on all players Baseball-reference.com traditional baseball encyclopedia 2

The Lahman baseball data sets were the primary source for our statistics and querying, as they contained unique playerids which we could use to guarantee matching data for a single player. Some of the data like fangraphs had to be manually scraped with python code that would automatically export csv files (fangraphs includes an export button on their website). The Lahman baseball data set and the baseball reference data sets, though downloadable, required some cleaning and validation to remove blanks, resolve duplicates, and process improperly formatted rows and columns. The Baseball-reference.com data set, for example, would intermittently include extraneous rows within the data set for the header names. Once the data was properly cleaned, we had to then link data from all three data sources from player name to the unique player ID that was included in the Lahman baseball data set. We additionally had to match the names of different statistics and in some cases manually write code to perform conversions. For example, IPouts from the Lahman data set is the number of outs a pitcher generated in a given year, but IP is the number of innings pitched; we had to modulus IPouts by three and then convert the remainder to outs in order to calculate the traditional Innings Pitched (IP) statistic. Furthermore, the first iteration through the data set was quite time consuming and required a significant amount of manual data entry, which is not entirely reflected in the truncated ~200 person data set that we used for the final code. Approach: We modeled the task as a prediction problem and applied a number of regression algorithms, where we could essentially input 2015 data and output predictions for 2016 and then compare results. Primarily, we applied linear regression, support vector regression, and a neural 3

net. This combination of methods allowed us to explore both linear and nonlinear attempts to fit the data. It also allowed us to attempt various degrees of specificity with our predictions. The neural net, for instance, tended to overfit the training data, and we found our best performance using support vector regression. Baseline and Oracle For our baseline implementation, we took the historical baseball batting numbers over the last 5 years and took the raw averages of each statistic Hits, At Bats, RBI, Runs, Stolen Bases, Games) over these 5 years. This left us with a raw average for each statistic over the past 5 years. We calculated the percent difference of these average statistics with the actual statistics. The baseline performed relatively poorly, since it did not account for the number of games played or injuries it yielded ~60% median error [(actual - predicted)/ actual] for statistics, with some like 1 stolen bases even going above 100% deviations over expected. For our oracle implementation, extrapolated mid-year predictions and multiplied those by player career second-half averages starting from the halfway point of the season in order predict the end of the season. We thresholded on players with a minimum of 100 at bats. The model yielded roughly ~70% accuracy. Feature selection and implementation choices with peripheral statistics: 2 In 2015, Fangraphs introduced Baseball Info Solutions contact strength ratings data, more advanced batted ball statistics to provide additional analysis about player performance. For batters, Fangraphs provides GB/FB, LD%, FB, IFFB, HR/FB, IF%, Pull%, Cent%, Opp%, 1 For instance, a player predicted to steal 20 bases in a season but getting injured and stealing 1 base would have ~2000% error. 2 http://www.fangraphs.com/blogs/instagraphs/new-batted-ball-stats/ 4

Soft%, Med%, Hard%. In particular, these batted ball statistics about how and where a ball is hit factor heavily into other statistics like batting average (AVG) and runs (R). In certain cases, some of these statistics can deviate heavily from the norm and lend insights about key performance statistics. For instance, if a batter s HR/FB ratio is 50%, then half of his fly balls are home runs, an unsustainable statistic that may be inflating his Home Runs, AVG, Runs and RBI statistics. For pitching, K/9, BB/9, BABIP, LOB%, GB%, HR/FB, GB/FB, Balls, Strikes, Total pitches, Pull %, Cent%, Opposite%, Soft%, Medium%, Hard%. Batted ball and peripheral statistics can allow us to diagnose a pitcher s performance. Traditional pitcher statistics like ERA or Wins can often be skewed due to flukey peripherals (e.g. a high concentration of softly batted fluke hits despite actually performing well). Unfortunately, since these peripheral statistics were introduced relatively recently, the prediction numbers were not entirely the most effective or well correlated with player performance. As Bradley Woodrum writes in a study for the Hardball times The tools we have for evaluating and predicting hitter performance are still growing When we re tempted to 3 cite batted ball data, we need to be more careful. With our smaller sample size of 200 players, we found that historical player performance was actually a more effective approach. Nonetheless, these peripheral statistics provided valuable insight as to which features of our input statistics would have the most impact in predicting our output. Optimization and Tweaking: Naturally, the players with the smallest amount of historical data proved the most difficult to predict. Rookies, injured players, and newer players (e.g. only one or two full season) 3 http://www.hardballtimes.com/offensive-batted-ball-statistics-and-their-optimal-uses/ 5

lacked the same data quality as veteran players. Consequently, we were able to account for this by normalizing our predictions against the number of games played by each player. For starting pitchers, this was games started; for batters this was just total games. We projected each player s statistics over a full game season and then compared our statistics to the actual number of games they played in doing so, we essentially removed games played as a prediction factor. In terms of implementation, this involved scaling each batter s statistics in proportion to the number of games a batter plays in a regular season of baseball. For pitchers, we made a similar implementation decision. We also included add-1 smoothing for our prediction numbers to avoid division by zero and to normalize players who were not predicted to post high stats in a given season but might skew error. The motivating reason that we discovered for this would be that for example, if a player had zero home runs during a season where they were injured and only played ten games, we would want to scale up this player s statistics towards a full season. However, multiplying zero by any number would still give us zero as a final statistic for the player and therefore, to avoid this, we introduced uniform add-1 smoothing. The effect of these normalizations was great improvement across each of our regressions in our final prediction results. Literature Review A number of different baseball projection systems exist with similar and different applications to our own. 6

4 FiveThirtyEight: Nate Silver s team at ESPN FiveThirtyEight runs a forecasted simulation based on 50,000 simulations of the season to predict team records, playoff, division, and world series results. The simulations account for traditional stats, starting pitchers, travel distance, and rest. They then update their probabilities after each game. They scrape game-by-game data to generate an Elo-based rating system and predictive model to make the predictions. The main contributor is a score and rating maintained for each starting pitcher, on which they base Monte Carlo simulations to play out the season thousands of times; each simulation will update a team s rating, adding bonuses if a team makes the World Series or the playoffs. 5 Steamer: Steamer was created from a high school project collaboration. The projection system uses a weighted average of past performance regressing towards league average. The weights are set using regression analysis of past players, using a relatively simplistic regression. In 2015, 4 http://fivethirtyeight.com/features/how-our-2016-mlb-predictions-work/ 5 http://steamerprojections.com/blog/ 7

Steamer projections performed better than other competitors. This system is quite similar to ours, although they generate weight vectors across the entire baseball population, while we minimized our sample to ~200 players and performed regression on a player basis. 6 PECOTA: PECOTA was developed by Nate Silver. It fits a given player s past performance stats to that of a comparable MLB players using similarity scores. The primary characteristics for similarity are: 1. Basic production metrics like batting average, ISO, strikeout rate and groundball rate 2. Usage metrics: career length and plate appearances 3. Physical attributes: height, weight, throwing or batting right/left 4. Fielding position, Starting pitcher/relief pitcher PECOTA then finds the nearest neighbors of the player to determine comparable players. From there, the projection system determines the player s future performance based on the historical performance of these comparable players at a similar age range. Our model does not account for decreased performance due to age factors. Comparison A projections comparison run between the different data sets and actuals for the 2015 season shows average error on different statistical categories (K, BB, HR, ERA). Steamer outperforms the competition in most categories. Thus, relatively simplistic regression can be quite effective (although this is only considering the 2015 sample). 6 http://www.baseballprospectus.com/glossary/index.php?search=pecota 8

7 By comparison, our average error for HR, ERA, and K on 2016 data (without normalizing over number of games is): K: 0.4, BB: 0.4, HR: 0.387, ERA: 0.44 and does not perform as well as established systems. Error Analysis In terms of analysis of our algorithms, each one of them is taking the same space complexity by building python dictionaries to store the features and training data that we make our predictions based on. For time complexity from a practical perspective, running linear regression and support vector machine regression take a matter of seconds (somewhere from eight to eleven seconds generally). However, running our neural network prediction takes far longer on the order of closer to ten minutes to generate all predictions. 7 http://www.beyondtheboxscore.com/2016/1/21/10795210/marcel-pecota-steamer-zips-pitchers-projections-review 9

For most of our predicted statistics, our plots of residual value vs predicted statistic generally appear similar to the above figure. We can see that we have a healthy amount of clustering around a residual of 0 which means that we were very close to getting our prediction correct. Across all of our methodologies, we generate the following spreadsheet : 10

This sheet allows us to compare our error metrics across all of our approaches to predicting 2016 statistics. Looking at our final results, we can see that our support vector machine based regression was generally the most successful algorithm. Similar to all of our algorithms, support vector machine regression is further improved when we normalized our data before running the regression to account for rookies, injuries and other anomalies in our data set. For batters, we see that SVR provided a median percent difference of under 30% in most cases. Also, a general trend that we see in our results is that our predictions tend to be more accurate for batters than for pitchers. This is likely caused by the fact that batters play more games in any given season and therefore give us more data to use when predicting statistics. Pitcher predictions are highly volatile and can be impacted by error-prone defenses, strength of schedule, or poor offensive support. There are also a number of outliers heavily skewing the data; despite normalization efforts, mean error is still skewed considerably. But notably, in our aggregated statistics to create leaderboards of top performers for MVP and Cy Young selection, we saw that our results provided accurate predictions. Another point of analysis that has to be made about our results is the fact that the 2016 season of baseball on which we predicted, was an unusual year for baseball. This year in baseball saw a 35% percent increase in home runs, a 10% increases in runs over the past two years, and 8 the league record for strikeouts was broken. These changes in statistics can be explained by shifting trends in the strategy of baseball teams like defensive shifts and more attempts by players to cut upwards on the ball and hit home runs. Our model was not able take this into account since we simply had the statistics of previous years to use. 8 http://www.espn.com/blog/statsinfo/post/_/id/126617/mlb-trends-more-home-runs-strikeouts-shifts-pitching-change s 11