Trends in Baseball Scoring & Strikeouts, 1962-2014 Geoffrey Holland ECON 5341 Advanced Data Analysis 16 November 2015
Background Statistics are intrinsic part of Baseball Series under study are Runs Scored and Strikeouts Data from Lahman Baseball Database Contains individual player records back to 1871 Nearly 100,000 player-seasons available Time period: 1962-2014 Year of the Pitcher, 1968 Steroid Era, 1990s-2000s Second Year of the Pitcher, 2010 Data points normalized to team average for a season Work stoppages (1972, 1981, 1994) Changing number of teams (18 in 1961, 30 in 2014) 162-game season (adopted by AL 1961, NL in 1962) Examine both league-wide and individual teams
1,400 Runs & Strikeouts League Average 1,200 1,000 1968, Year of the Pitcher 800 600 R SO 400 200 -
1,400 1,200 Runs & Strikeouts League Average Steroid Era 1,000 800 600 R SO 400 200 -
1,400 1,200 Runs & Strikeouts League Average 2010, Second Year of the Pitcher 1,000 800 600 R SO 400 200 -
Hypothesis Runs and Strikeouts will demonstrate Autocorrelation Good players tend to stay in the league, so positive performance begets positive performance Both series will be stationary Baseball seasons are finite Runs and Strikeouts are count variables, not continuous Individual Teams will have higher order Autocorrelation due to persistence of exceptional players
1,400 Runs & Strikeouts League Average 1,200 1,000 800 600 R SO 400 200 - Runs appears stationary, but SO may not be
ACF & PACF Autocorrelations of so -1.00-0.50 0.00 0.50 1.00 0 5 10 15 20 25 Lag Bartlett's formula for MA(q) 95% confidence bands Partial autocorrelations of so -0.50 0.00 0.50 1.00 0 5 10 15 20 25 Lag 95% Confidence bands [se = 1/sqrt(n)] Autocorrelations of r -1.00-0.50 0.00 0.50 1.00 0 5 10 15 20 25 Lag Bartlett's formula for MA(q) 95% confidence bands Partial autocorrelations of r -0.20 0.00 0.20 0.40 0.60 0.80 0 5 10 15 20 25 Lag 95% Confidence bands [se = 1/sqrt(n)] Underlying Process appears to be AR(1)
OLS Results - Runs T-stat 6.93 > 2.678 99% critical value β <1, so Runs is stationary
OLS Results - Strikeouts T-stat 28.43 > 2.678 99% critical value But β >1, so Strikeouts is non-stationary
First Difference of Strikeouts -50 SO, D 0 50-100 100 1960 1980 2000 2020 yearid Autocorrelations of D.so -0.40-0.20 0.00 0.20 0.40 0 5 10 15 20 25 Lag Bartlett's formula for MA(q) 95% confidence bands Partial autocorrelations of D.so -0.60-0.40-0.20 0.00 0.20 0 5 10 15 20 25 Lag 95% Confidence bands [se = 1/sqrt(n)]
VAR Model
Conclusions Runs and Strikeouts will demonstrate Autocorrelation First lag only High turnover Difficult to sustain performance Too many structural breaks X Both series will be stationary Runs is stationary because β <1 Strikeouts is non-stationary I(1) KPSS rejects null, ADF and PP fail to reject at α = 0.01 Probably a coincidence due to late-year outliers, but interesting nonetheless
Forecasting Method 1. Normalize Runs and Strikeouts for full 162 game season 2. Regress Runs and Strikeouts on Trend term 1. Save detrended Runs and Strikeouts 2. Save constants & coefficients from OLS trend regression 3. Calculate straight-line 2015 Runs and Strikeouts from trend regression coefficient & constant 4. Create VAR model from detrended Runs and Strikeouts 1. Lag selection using AIC, HQIC, and SBIC. Lag length varies by team. 5. Use VAR model to predict Runs and Strikeouts above or below trend in 2015 6. Add the straight-line regression to VAR-predicted delta for final adjusted forecast
1,400 Runs & Strikeouts League Average 1,200 1,000 800 600 R SO 400 200 -
League Avg Forecast 1 Lag 0-200 -100 100 200 2010 2012 2014 2016 2018 2020 yearid R (detrended) SO (detrended) frstat, dyn(2015) fsostat, dyn(2015) 95% LB for frstat 95% UB for frstat 95% LB for fsostat 95% UB for fsostat 2015 Forecast Strikeouts Runs OLS Straight Line Forecast 1,123 768 VAR Delta 126 (82) Adjusted Forecast 1,249 687 Actual 1,248 688 Delta from Forecast 1 (1) Delta % from Forecast 0.1% -0.2%
Runs and SO by Team 1000 1500 Los Angeles Angels Baltimore Orioles 1000 1500 500 Pittsburgh Pirates Texas Rangers 500 1960 1980 2000 20201960 1980 2000 2020 Graphs by Team Dummy Year Runs Scored per 162 Strikeouts per 162
Rangers Forecast 1 Lag -150-100 -50 0 50 100 2010 2011 2012 2013 2014 2015 Year R (detrended) frstat, dyn(2015) SO (detrended) fsostat, dyn(2015) 2015 Forecast Strikeouts Runs OLS Straight Trend Forecast 1,084 873 VAR Delta 49 (79) Adjusted Forecast 1,133 794 Actual 1,233 751 Delta from Forecast 100 (43) Delta % from Forecast 8.8% -5.4%
Angels Forecast 2 Lags -200-100 0 100 200 2010 2011 2012 2013 2014 2015 Year R (detrended) SO (detrended) frstat, dyn(2015) fsostat, dyn(2015) 95% LB for frstat 95% UB for frstat 95% LB for fsostat 95% UB for fsostat 2015 Forecast Strikeouts Runs OLS Straight Trend Forecast 1,007 811 VAR Delta 58 (69) Adjusted Forecast 1,065 742 Actual 1,150 661 Delta from Forecast 85 (81) Delta % from Forecast 8.0% -10.9%
Pirates Forecast 3 Lags -200-100 0 100 200 300 2010 2011 2012 2013 2014 2015 Year R (detrended) frstat, dyn(2015) SO (detrended) fsostat, dyn(2015) 2015 Forecast Strikeouts Runs OLS Straight Trend Forecast 1,170 671 VAR Delta 186 (69) Adjusted Forecast 1,356 602 Actual 1,322 697 Delta from Forecast (34) 95 Delta % from Forecast -2.5% 15.8%
Orioles Forecast 4 Lags -200-100 0 100 200 2010 2011 2012 2013 2014 2015 Year R (detrended) frstat, dyn(2015) SO (detrended) fsostat, dyn(2015) 2015 Forecast Strikeouts Runs OLS Straight Trend Forecast 1,010 758 VAR Delta - (9) Adjusted Forecast 1,010 749 Actual 1,331 713 Delta from Forecast 321 (36) Delta % from Forecast 31.8% -4.8%
Acknowledgements Lahman Baseball Database copyright 1996-2015 Sean Lahman. Used with permission under Creative Common License 3.0. http://www.seanlahman.com/baseball-archive/statistics/ This presentation was created with design template from SmileTemplates.com. http://www.smiletemplates.com/powerpoint-templates/baseball/00405/
Questions?