Sports Predictive Analytics: NFL Prediction Model By Dr. Ash Pahwa IEEE Computer Society San Diego Chapter January 17, 2017 Copyright 2017 Dr. Ash Pahwa 1
Outline Case Studies of Sports Analytics Sports Sports Analytics Applications of Sports Analytics Sports Analytics Literature Data Sources Sports Predictive Models Regression Model Multi Variable Regression with Lasso NFL Prediction Model Prediction for Super Bowl 2016 Prediction for NFL 2017 Playoffs Copyright 2017 Dr. Ash Pahwa 2
San Diego L.A. Chargers Copyright 2017 Dr. Ash Pahwa 3
Case Studies of Sports Analytics? Copyright 2017 Dr. Ash Pahwa 4
What is Sports Analytics? Which Pitcher is Better? 2007 Baseball Jake Peavy San Diego Padres : National League John Lackey L.A. Angles : American League ERA: 2.54 ERA: 3.01 ERA: Earned Run Average. Mean number of runs yielded per 9 innings American league allows Designated Hitter (DH) for pitcher National league does not allow Designated Hitter (DH) for pitcher. Pitcher must bat. Copyright 2017 Dr. Ash Pahwa 5
Which Variable is the Best Predictor of the Winning Percentage? Copyright 2017 Dr. Ash Pahwa 6
Pythagorean Theorem Used in Baseball. Proposed by Bill James Suppose F represents a team s run scored A represents team s runs allowed Pyhtagorean Winning Percentage = F2 F 2 +A 2 It is called Pythagorean theorem because it is similar to the elementary geometry theorem Example: Year 2012: Detroit Tigers Scored Runs = F = 726 Allowed Runs = A = 670 Pyhtagorean Winning Percentage = F2 F 2 +A 2 = 7262 726 2 +670 2 = 527,076 975,976 = 0.54 Total games won = 162*0.54 = 88 games Copyright 2017 Dr. Ash Pahwa 7
How to Convey Information to Decision Makers Data Scientist Understand the Sport Understand the Players Understand Performance data Decision Makers Management Operational Personals Raw Data Selection of Statistical Models Predictive Models Meaningful Results Copyright 2017 Dr. Ash Pahwa 8
Goals of Sports Analytics 1. Apply Statistical Models to Sporting Data 2. Ratings and Rankings 3. Predictive Models 4. Player and Team Assessment Copyright 2017 Dr. Ash Pahwa 9
Statistical Models Predictive Models Indices of Central Tendencies and Variability Histogram (Frequency Distribution) Statistics Used to Examine Relationships Normal Distribution (z-values and p-values) Inferential Statistics Ratings + Rankings Normality Ratings + Rankings Predictive Models Simple Linear Regression Mean Covariance Outlier Rank Aggregation Multiple Linear Regression Median Correlation Pearson Mode Rank Correlation Spearman t-test ANOVA Range, Variance Partial Correlation Chi-Square Standard Deviation Polynomial Regression Logistic Regression Copyright 2017 Dr. Ash Pahwa 10
Ranking of Players and Teams? Copyright 2017 Dr. Ash Pahwa 11
Pair-Wise Comparison Copyright 2017 Dr. Ash Pahwa 12
Pair-Wise Comparison The Social Network Copyright 2017 Dr. Ash Pahwa 13
Pair-Wise Comparison Can be Used for Ranking Sorting the Rating list produces Ranking Copyright 2017 Dr. Ash Pahwa 14
Log 5 Method Developed by Bill James in 1970s Computes the probability that Team A will beat Team B Log 5 formula has nothing to do with the mathematical function Log Copyright 2017 Dr. Ash Pahwa 15
Log 5 Formula p a p a p b p a,b = p a + p b 2 p a p b Suppose Team A true winning percentage is 10 out of 16 games Percentage of true winning = p a = 10/16 = 0.625 Suppose Team B true winning percentage is 7 out of 16 games Percentage of true winning = p b = 7/16 = 0.438 -------------------------------------------------------------------------------------- The probability that Team A will beat Team B p a,b = p a p a p b p a +p b 2 p a p b = 0.625 0.625 0.438 = 0.625 0.274 = 0.351 = 0.681 0.625+0.438 2 0.625 0.438 1.063 2 0.274 0.515 ---------------------------------------------------------------------------------------- The probability that Team B will beat Team A p b,a = p b p a p b p a +p b 2 p a p b = 0.438 0.625 0.438 = 0.438 0.274 = 0.164 = 0.318 0.625+0.438 2 0.625 0.438 1.063 2 0.274 0.515 --------------------------------------------------------------------------------------------------- p a,b + p b,a = 1 Copyright 2017 Dr. Ash Pahwa 16
Arpad Elo Physics Professor at Marquette University Milwaukee, Wisconsin Chess Player Devised a method to rank chess players His method was adopted by US Chess Federation World Chess Federation Copyright 2017 Dr. Ash Pahwa 17
Points Gained or Lost Points gained/lost by player A = Points gained/lost by player B Before the game Player A rating r A Player B rating r B Difference d AB = r A r B After the game Player A rating r A Player B rating r B r A + r B = r A + r B Points gained/lost by player A against player B μ AB = L d AB 400 = 1 1+10 d AB 400 Copyright 2017 Dr. Ash Pahwa 18
Example Before the game Player A rating r A = 2400 Player B rating r B = 2000 Difference d AB = r A r B = 400 Difference d BA = r B r A = 400 If Player A wins Draw If Player B wins S(AB) 1 0.5 0 S(BA) 0 0.5 1 Suppose K = 32 for Chess μ AB = 0.91 μ BA = 0.09 If Player A wins After the game Player A rating r A = r A + K S AB μ AB = 2400 + 32 1 0.91 = 2403 Player B rating r B = r B + K S BA μ BA = 2000 + 32 0 0.09 = 1997 r A + r B = r A + r B 2400 + 2000 = 2403 + 1997 If Player B wins After the game Player A rating r A = r A + K S AB μ AB = 2400 + 32 0 0.91 = 2371 Player B rating r B = r B + K S BA μ BA = 2000 + 32 1 0.09 = 2029 r A + r B = r A + r B 2400 + 2000 = 2371 + 2029 Copyright 2017 Dr. Ash Pahwa 19
The Social Network Elo Formula was Used to pair-wise Comparison of Girls Copyright 2017 Dr. Ash Pahwa 20
Sports Copyright 2017 Dr. Ash Pahwa 21
Sports Inherent part of Human Culture Sports competition dates back to the dawn of our species Greek Olympics : 776 BC Copyright 2017 Dr. Ash Pahwa 22
Sportsman Spirit Virtues of sports fairness self-control courage persistence It has been associated with interpersonal concepts of treating others and being treated fairly Maintaining self-control if dealing with others, and respect for both authority and opponents GOOD SPORTSMEN HAVE ALWAYS BEEN HELD IN HIGH ESTEEM Copyright 2017 Dr. Ash Pahwa 23
Most Popular Sports in the World Copyright 2017 Dr. Ash Pahwa 24
Most Popular Sports in America Copyright 2017 Dr. Ash Pahwa 25
Sports Analytics Copyright 2017 Dr. Ash Pahwa 26
Predictive Models Estimation Regression Classification (win/loss) Logistic Regression Discriminant Analysis Linear Quadratic Support Vector Machine Copyright 2017 Dr. Ash Pahwa 27
Goals of Sports Analytics Player Discovering hidden talent in a new player Player Evaluation Assessing Player Performance Which metrics is most important to assess a players performance Assessing Player Value How much value a player adds to the teams value Copyright 2017 Dr. Ash Pahwa 28
Goals of Sports Analytics Team Ranking top teams Accessing Team Performance How to compute the value of a team Which Team Members are best suited to play against the opposing team Which strategy to use to play against a team? Anticipating Opponents Behavior Accessing the probability of a win in a sporting event Copyright 2017 Dr. Ash Pahwa 29
Need for Prediction Results Betting on an sporting event People betting on sports need to see the prediction results Probability of a win Point Spread Fantasy Sports DraftKings FanDuel Copyright 2017 Dr. Ash Pahwa 30
Applications of Sports Analytics Copyright 2017 Dr. Ash Pahwa 31
Application of Sports PA Movies The film is based on Michael Lewis' 2003 nonfiction book of the same name, an account of the Oakland Athletics baseball team's 2002 season and their general manager Billy Beane's attempts to assemble a competitive team. In the film, Beane (Brad Pitt) and assistant GM Peter Brand (Jonah Hill), faced with the franchise's limited budget for players, build a team of undervalued talent by taking a sophisticated sabermetric approach towards scouting and analyzing players. They acquire "submarine" pitcher Chad Bradford (Casey Bond) and former catcher Scott Hatteberg (Chris Pratt), and win 20 consecutive games, an American League record. Copyright 2017 Dr. Ash Pahwa 32
Team Selection: Moneyball Copyright 2017 Dr. Ash Pahwa 33
Moneyball Billy Beane & Paul DePodesta William Lamar "Billy" Beane III (born March 29, 1962) is an American former professional baseball player and current front office executive. He is the Executive Vice President of Baseball Operations and minority owner of the Oakland Athletics of Major League Baseball (MLB). The character of Brand is an invention by the filmmakers; in the excellent Michael Lewis non-fiction book upon which the movie is based, the real-life Brand is identified as Paul DePodesta. Unlike Brand, DePodesta is slender, fit and handsome. He s also Harvard-educated (not a Yalie screenwriter Aaron Sorkin s private joke). Copyright 2017 Dr. Ash Pahwa 34
Application of Sports PA Boston Red Sox Using Predictive Analytics Strategies Boston Red Sox won 3 world series in Baseball In 2006, Time named Bill James in the Time 100 as one of the most influential people in the world. He is a Senior Advisor on Baseball Operations for the Boston Red Sox. Copyright 2017 Dr. Ash Pahwa 35
Sports Analytics Literature Copyright 2017 Dr. Ash Pahwa 36
Predictive Models for Sports Literature Copyright 2017 Dr. Ash Pahwa 37
Statistical Learning Gareth James Daniela Witten Trevor Hastie Robert Tibshirani Stanford University Copyright 2017 Dr. Ash Pahwa 38
Data Sources Copyright 2017 Dr. Ash Pahwa 39
Data Sources Valid Accurate Complete Contains derived variables Copyright 2017 Dr. Ash Pahwa 40
Data Sources www.nfl.com www.nba.com www.footballoutsiders.com www.pro-football-reference.com www.soccerstats.com www.basketball-reference.com Copyright 2017 Dr. Ash Pahwa 41
Data Source Football www.armchairanalysis.com Copyright 2017 Dr. Ash Pahwa 42
Sports Predictive Models Copyright 2017 Dr. Ash Pahwa 43
Goals of Predictive Analytics Application: Estimation or Classification Estimation Regression modeling technique is used Output is a number House price Product sales for next quarter GNP growth for the next quarter How many points a team will score Classification Logistic Regression Support Vector Machine Discriminant Analysis (Linear, Quadratic) Naïve Bayes, Decision Trees etc. modeling techniques are used Output is a categorical variable Sports team will win or lose Email is junk or not Which grade student will get Tweet is positive or negative Copyright 2017 Dr. Ash Pahwa 44
Common PA Techniques Regression Linear 2 variables Linear multi variables Logistic Polynomial Clustering Decision Trees Neural Networks Naïve Bayes ARIMA A few more Copyright 2017 Dr. Ash Pahwa 45
Regression Model Copyright 2017 Dr. Ash Pahwa 46
History of Linear Regression Sir Francis Galton and Karl Pearson Developed the concepts on Regression and Correlation in 1900-1930 Copyright 2017 Dr. Ash Pahwa 47
Definition: Linear Regression 2 Variable Regression How a response variable y changes y x 1 c As the predictor (explanatory) variable x changes Multiple Regression How a response variable y changes As the predictor (explanatory) variables x1, x2, xn change y x x x... 1 1 2 2 3 3 x n n c Copyright 2017 Dr. Ash Pahwa 48
Single Variable Polynomial Regression First degree to Fifth Degree The 5 th degree polynomial regression looks good for training set data But with test data it may not perform well PROBLEM: Over fit for this data Copyright 2017 Dr. Ash Pahwa 49
Overfitting If there exists a model with estimated parameters w such that Training error (w2) < Training Error (w1) Generalization (Test) error (w2) > Generalization (Test) Error (w1) Copyright 2017 Dr. Ash Pahwa 50
Bias and Variance Goal is to find out a model complexity where Generalization (Validation) errors are least Bias + variance are least Mean Square Error = Bias 2 + Variance Just like Generalization Error We cannot compute Bias and Variance Copyright 2017 Dr. Ash Pahwa 51
1 st Degree Polynomial Fit y = a 1 x + c > #result = lm(y1~x) > result = lm(y1 ~ poly(x,1,raw=true)) > summary(result) Call: lm(formula = y1 ~ poly(x, 1, raw = TRUE)) Residuals: Min 1Q Median 3Q Max -0.9872-0.7277-0.1394 0.7842 1.2692 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 0.1919 0.4285 0.448 0.662 poly(x, 1, raw = TRUE) -0.0492 0.1120-0.439 0.668 Residual standard error: 0.8449 on 12 degrees of freedom Multiple R-squared: 0.01581, Adjusted R-squared: -0.0662 F-statistic: 0.1928 on 1 and 12 DF, p-value: 0.6684 > p = predict(result,list(x=xpredict)) > lines(xpredict,p,col='black',lwd=3) > Copyright 2017 Dr. Ash Pahwa 52
16 th Degree Polynomial Fit y = a 1 x 16 + a 2 x 15 + + a 15 x 2 + a 16 x + c > result = lm(y1 ~ poly(x,16,raw=true)) > summary(result) Call: lm(formula = y1 ~ poly(x, 16, raw = TRUE)) Residuals: ALL 14 residuals are 0: no residual degrees of freedom! Coefficients: (3 not defined because of singularities) Estimate Std. Error t value Pr(> t ) (Intercept) 1.820e-01 NA NA NA poly(x, 16, raw = TRUE)1 1.280e+02 NA NA NA poly(x, 16, raw = TRUE)2-7.656e+02 NA NA NA poly(x, 16, raw = TRUE)3 1.973e+03 NA NA NA poly(x, 16, raw = TRUE)4-2.895e+03 NA NA NA poly(x, 16, raw = TRUE)5 2.675e+03 NA NA NA poly(x, 16, raw = TRUE)6-1.631e+03 NA NA NA poly(x, 16, raw = TRUE)7 6.712e+02 NA NA NA poly(x, 16, raw = TRUE)8-1.869e+02 NA NA NA poly(x, 16, raw = TRUE)9 3.447e+01 NA NA NA poly(x, 16, raw = TRUE)10-3.944e+00 NA NA NA poly(x, 16, raw = TRUE)11 2.282e-01 NA NA NA poly(x, 16, raw = TRUE)12 NA NA NA NA poly(x, 16, raw = TRUE)13-5.901e-04 NA NA NA poly(x, 16, raw = TRUE)14 NA NA NA NA poly(x, 16, raw = TRUE)15 1.468e-06 NA NA NA poly(x, 16, raw = TRUE)16 NA NA NA NA Residual standard error: NaN on 0 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: NaN F-statistic: NaN on 13 and 0 DF, p-value: NA Copyright 2017 Dr. Ash Pahwa 53
Under fit or Over fit Model # Degree of Polynom ial Under fit or Over fit 1 1 Under fit 2 2 Under fit 3 4 Under fit 4 8 OK 5 10 Over fit 6 16 Over fit Copyright 2017 Dr. Ash Pahwa 54
Lasso Regression Least Absolute Shrinkage and Selection Operator Copyright 2017 Dr. Ash Pahwa 55
Cost Function of OLS + Ridge + Lasso OLS Ridge Regression Lasso Regression Copyright 2017 Dr. Ash Pahwa 56
NFL Model Copyright 2017 Dr. Ash Pahwa 57
Data Source Seasons 2000 2016 Weeks 1 21 Week#1 #17: 16 games + 1 bye Week#18: 4 games: Wild card Week#19: 4 games: Divisional Playoff Week#20: 2 games: Conference Championship Week#21: 1 game: Super Bowl Copyright 2017 Dr. Ash Pahwa 58
Data Tables Arm Chair Data 26 Tables Arm Chair Game Data Play Data Team Data External Source Stadium Data (Wikipedia) City Coordinates (Wikipedia) City GDP (Wikipedia + US Govt.) Copyright 2017 Dr. Ash Pahwa 59
Sonny Moore s Computer Rating http://sonnymoorepowerratings.com/archive Copyright 2017 Dr. Ash Pahwa 60
USA Today: Jeff Sagarin http://www.usatoday.com/sports/nfl Copyright 2017 Dr. Ash Pahwa 61
NFL Prediction Model Result Copyright 2017 Dr. Ash Pahwa 62
Comparison of Errors Sonny Moore & Jeff Sagarin Copyright 2017 Dr. Ash Pahwa 63
Super Bowl 2016 Prediction by Nate Silver Copyright 2017 Dr. Ash Pahwa 64
Super Bowl 2016 Prediction by A+ Copyright 2017 Dr. Ash Pahwa 65
Super Bowl 2016 All Predicted Models were Wrong Panthers Broncos Nate Silver 59% 41% A+ 56.5% 43.5% Copyright 2017 Dr. Ash Pahwa 66
2017 NFL Prediction Wild Card : Playoff Predictions for all the 4 games were 100% correct Copyright 2017 Dr. Ash Pahwa 67
2017 NFL Prediction Divisional Title: Playoff Predictions 2 correct 2 incorrect 50% correct www.nflprediction.co Copyright 2017 Dr. Ash Pahwa 68
2017 NFL Prediction Conference Championship: Playoff Atlanta Falcons vs Green Bay Packers Atlanta Falcons 61% New England Patriots vs Pittsburgh Steelers New England Patriots 70% Copyright 2017 Dr. Ash Pahwa 69
2017 NFL Prediction Super Bowl New England Patriots vs Atlanta Falcons New England Patriots 60% Copyright 2017 Dr. Ash Pahwa 70
Summary Sports Sports Analytics Applications of Sports Analytics Sports Analytics Literature Data Sources Sports Predictive Models Regression Model Multi Variable Regression with Lasso NFL Prediction Model Prediction for Super Bowl 2016 Prediction for NFL 2017 Playoffs Copyright 2017 Dr. Ash Pahwa 71
UCSD Extension Courses Winter 2017 Copyright 20172Dr. Ash Pahwa
UCSD Extension courses Winter 2017 Winter 2017 Dates Online Sports Predicted Analytics (7 weeks) January 30 March 19, 2017 Copyright 20173Dr. Ash Pahwa
Course UCSD Sports Predictive Analytics Content L# Date Subject 1 01/30/17 Introduction to Sports Analytics 1.1 What is Sports Analytics? 1.2 Tools for Sports Analytics 1.3 Science of Learning from Data 1.4 Basic Stat : Data Types + Histograms + Std Dev 2 02/06/17 Statistical Methods Applied on Sports Data 2.1 Normal Distribution 2.2 Correlation 2.3 Rank and Partial Correlation 3 02/13/17 Central Limit Theorem + Hypothesis Testing 3.1 CLT + Parameter Estimation 3.2 Hypothesis Testing (*) 4 02/20/17 Inference Stat: Chi-Sq+ANOVA+Model Selec 4.1 Chi-Square (*) 4.2 ANOVA 4.3 Stat Model Selection (*) 5 02/27/17 Ratings and Rankings + Rank Aggregation 5.1 Ratings and Rankings (Elo System) 5.2 Rank Aggregation (Borda Voting) 6 03/06/17 Prediction Using Regression Model 6.1 Introduction to Regression 6.2 Regression 2 Variables 6.3 Regression Multi Variables 7 03/13/17 Prediction Using Logistic Regression Model 7.1 Logistic Regression Mathematics 7.2 Logistic Regression Mechanics 7.3 Multivariable Logistic Regression Copyright 2017 Dr. Ash Pahwa 74