Standing Between a Bayesian and a Frequentist: An Emperical Bayes Exploration of Movies, Baseball, and Long Beach Basketball.

Standing Between a Bayesian and a Frequentist: An Emperical Bayes Exploration of Movies, Baseball, and Long Beach Basketball Arthur Berg Pennsylvania State University

Arthur Berg Standing Between a Bayesian and a Frequentist 2 / 28

Bayesian and Frequentist Representatives Rev. Thomas Bayes FRS (1702-1761) English Mathematician Presbyterian Minister Sir Ronald Fisher FRS (1890-1962) English Statistician Evolutionary Biologist, Geneticist P (H E) = P (E H)P (H) P (E) Let the data speak for itself. Arthur Berg Standing Between a Bayesian and a Frequentist 3 / 28

Bayes Estimator as a Convex Combination 1 st Goal: List the top 250 movies of all time. Movies are rated on a scale of 1 to 10. Some movies are rated by many people, and some by only a few. Movies with fewer than 3000 votes are not considered. All movies have an average rating of C = 6.9. µ i represents the mean rating by everyone who has seen movie i. The real goal is to construct the best estimate of µ i, then pick the top 250. The frequentist approach uses only X i, the average rating for movie i. = X i ˆµ (Fisher) i The Bayesian approach shrinks X i towards C with more shrinking applied when the number of votes for movie i is small. ˆµ (Bayes) i = α i Xi + (1 α i )C where α i (0, 1) Arthur Berg Standing Between a Bayesian and a Frequentist 4 / 28

Internet Movie Database Top 250 Rank WR R Title Votes 1 9.2 9.2 The Shawshank Redemption (1994) 546,155 2 9.1 9.2 The Godfather (1972) 427,961 3 9.0 9.0 The Godfather: Part II (1974) 257,643 4 8.9 9.0 The Good, the Bad and the Ugly (1966) 170,045 5 8.9 9.0 Pulp Fiction (1994) 436,456 6 8.9 8.9 Inception (2010) 265,531 7 8.9 8.9 Schindler s List (1993) 289,170 8 8.9 8.9 12 Angry Men (1957) 126,983 9 8.8 8.9 One Flew Over the Cuckoo s Nest (1975) 225,419 10 8.8 8.9 The Dark Knight (2008) 487,800 85 8.5 8.7 Black Swan (2010) 20,326 142 8.2 8.3 Avatar (2009) 285,005 240 8.0 8.5 True Grit (2010) 6,444 Arthur Berg Standing Between a Bayesian and a Frequentist 5 / 28

IMDb Weighted Ranking a true Bayesian estimate WR i = v ir i + mc v i + m = v i v i + m α i R i = average rating of the movie i ( X i ) R i + Xi v i = total number of votes from regular voters m v i + m 1 α i m = minimum # of votes to make the list = 3000 C = grand mean across all movies in the database = 6.9 C Arthur Berg Standing Between a Bayesian and a Frequentist 6 / 28

A Bayesian Calculation X i = (X i,1,..., X i,vi ) represents the v i ratings of movie i. prior: µ i N (µ 0, σ 2 0 ) conditional: X i,j µ i iid N (µi, σ 2 ) (j = 1,..., v i ) ˆµ (Bayes) i = E[µ i X i ] v i = ( v i + σ 2 /σ0 2 ) X i + ( σ2 /σ0 2 v i + σ 2 /σ0 2 ) µ 0 v i = v i + m R m i + v i + m C µ 0 = C, m = σ 2 /σ0 2 Arthur Berg Standing Between a Bayesian and a Frequentist 7 / 28

1 Does 2 How shrinking really help? much to shrink by? Prediction Error = n i=1 (µ i ˆµ i ) 2

Standing Between a Bayesian and a Frequentist In 1956, Charles Stein proved the existence of an estimator better than the sample mean under certain assumptions. In 1961, Willard James and Charles Stein explicitly constructed such an estimator. Arthur Berg Standing Between a Bayesian and a Frequentist 9 / 28

The James-Stein Estimator (n 4) µ i N (µ 0, σ 2 0) X i µ i iid N (µi, σ 2 ) (i = 1,... n) σ 2 ˆµ (Bayes) i = E [µ i X i ] = ( σ0 2 + σ2 α )µ 0 + ( σ0 2 σ0 2 + )X i σ2 1 α (n ˆµ (JS) 3)σ2 i = ( (X i X) 2 α ) X + ( 1 In practice, if σ 2 is unknown, an estimate is used. (n 3)σ2 (X i X) 2 )X i 1 α Arthur Berg Standing Between a Bayesian and a Frequentist 10 / 28

Predicting Batting Averages 2 nd Goal: Predict final batting averages from pre-season performances. Pre-season batting averages for 18 major league players are provided. Season final batting averages for the same players are also recorded. Data is from the 1970 season and is published in JASA (1975) and Scientific American (1977) by Efron and Morris. The frequentist approach uses only X i, the pre-season batting average for player i. = X i ˆp (Fisher) i The Emperical Bayes approach shrinks X i towards X by some empirically determined amount. ˆp (Stein) i = ˆαX i + (1 ˆα) X where ˆα (0, 1) Arthur Berg Standing Between a Bayesian and a Frequentist 11 / 28

Name hits/ab pre-season (ˆµ (ML) ) season final (µ) 1 Clemente 18/45 0.400 0.346 2 Robinson 17/45 0.378 0.298 3 Howard 16/45 0.356 0.276 4 Johnstone 15/45 0.333 0.222 5 Berry 14/45 0.311 0.273 6 Spencer 14/45 0.311 0.270 7 Kessinger 13/45 0.289 0.263 8 Alvarado 12/45 0.267 0.210 9 Santo 11/45 0.244 0.269 10 Swoboda 11/45 0.244 0.230 11 Unser 10/45 0.222 0.264 12 Williams 10/45 0.222 0.256 13 Scott 10/45 0.222 0.303 14 Petrocelli 10/45 0.222 0.264 15 Rodriguez 10/45 0.222 0.226 16 Campaneris 9/45 0.200 0.286 17 Munson 8/45 0.178 0.316 18 Alvis 7/45 0.156 0.200 Arthur Berg Standing Between a Bayesian and a Frequentist 12 / 28

Batting Average Dataset 1977 Batting Averages Dataset (Efron) Batting Average 0.0 0.1 0.2 0.3 0.4 pre season season final 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Arthur Berg Standing Between a Bayesian and a Frequentist 13 / 28

James-Stein Estimation of Batting Averages 1977 Batting Averages Dataset (Efron) Batting Average 0.0 0.1 0.2 0.3 0.4 pre season season final 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Arthur Berg Standing Between a Bayesian and a Frequentist 14 / 28

Ranking Bias Emperical Bayes + Order Statistics 1977 Batting Averages Dataset (Efron) Genome-wide association studies SNPS: AA/Aa/aa or 0/1/2 ( 10 7 ) Batting Average 0.0 0.1 0.2 0.3 0.4 pre season season final 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ranking bias estimator part frequentist, part Bayesian with robust properties Estimated effects of the top SNPs are biased up. (winner s curse) Applied to 2 GWAS studies with 2,000 cases and 3,000 controls Crohn s Disease Type 1 Diabetes Arthur Berg Standing Between a Bayesian and a Frequentist 15 / 28

49ers Statistics http://www.longbeachstate.com/ Arthur Berg Standing Between a Bayesian and a Frequentist 16 / 28

Opponents Over 3 Seasons 08-09, 09-10, 10-11 opponent # alaska anchorage 1 arizona state 1 boise state 1 byu cougars 1 byu hawaii 1 cal poly 7 cal state fullerton 6 cal state northridge 6 clemson 2 cs monterey bay 1 duke 1 green bay 2 idaho 1 idaho state 1 iowa 1 kentucky 1 loyola marymount 2 montana 1 montana state 1 new mexico state 1 north carolina 1 notre dame 1 oregon 1 pacific 8 pepperdine 2 saint mary s 1 saint peter s 1 san diego state 1 san francisco state 1 syracuse 1 temple 1 texas 1 uc davis 6 uc irvine 6 uc riverside 6 uc santa barbara 7 ucla 1 univ. san francisco 1 utah state 2 washington 1 weber state 2 west virginia 1 wisconsin 1 Arthur Berg Standing Between a Bayesian and a Frequentist 17 / 28

Winning Percentages All Games All 3 Seasons (93) 56% 08-09 Season (30) 50% 09-10 Season (33) 52% 10-11 Season (30) 67% Conference Games All 3 Seasons 67% 08-09 Season 63% 09-10 Season 50% 10-11 Season 88% Arthur Berg Standing Between a Bayesian and a Frequentist 18 / 28

spread 0 5 10 15 Spread = 49ers Score Opponent Score (10 11 Season) uc santa barbara cal state northridge uc davis cal poly uc riverside cal state fullerton pacific uc irvine Arthur Berg Standing Between a Bayesian and a Frequentist 19 / 28

Over/Under (Total Score) 120 140 160 uc irvine Over/Under = 49ers Score + Opponent Score (10 11 Season) cal state fullerton cal state northridge uc riverside Arthur Berg Standing Between a Bayesian and a Frequentist 21 / 28 uc davis pacific uc santa barbara cal poly

Conversion Formulas x = LB Score y = Opponent Score Over/Under = x + y Spread = x y Over/Under + Spread x = 2 Over/Under Spread y = 2 Arthur Berg Standing Between a Bayesian and a Frequentist 23 / 28

Predictions Over Under Rank Opponent LB Score O. Score Spread 2 Cal Poly 66 55 11 121 3 Cal State Northridge 81 66 15 147 4 Pacific 69 68 1 136 5 UC Santa Barbara 72 55 17 126 6 Cal State Fullerton 79 71 7 150 7 UC Riverside 75 66 9 141 8 UC Irvine 82 80 2 161 UC Davis 76 64 13 140 Arthur Berg Standing Between a Bayesian and a Frequentist 24 / 28

How good are the predictions? Using the 09-10 season to predict the 10-11 season: adjusted prediction error for spread unadjusted prediction error spread = 197 341 = 58% adjusted prediction error for over/under unadjusted prediction error over/under = 513 818 = 63% Using the 08-09 season to predict the 09-10 season: adjusted prediction error for spread unadjusted prediction error spread = 150 194 = 78% adjusted prediction error for over/under unadjusted prediction error over/under = 442 641 = 69% Arthur Berg Standing Between a Bayesian and a Frequentist 25 / 28

LB vs UCI Vegas Odds (as of 3am on game day) All bets are pay $110 to win $100. Long Beach is the favorite; UCI is the underdog. Casino Spread Over/Under LV Hilton -10 148.5 Wynn -9.5 149 MGM Mirage -10 NA Predicted -2 161 These predictions recommend betting on UCI (still expecting LB to win) and betting on over for the over/under option. Arthur Berg Standing Between a Bayesian and a Frequentist 26 / 28

Disclaimers: 1 I do not necessarily encourage sports betting. 2 I am not liable for any bets made based my presentation. Arthur Berg Standing Between a Bayesian and a Frequentist 27 / 28

Thank You!! Beach.ArthurBerg.com berg@psu.edu Arthur Berg Standing Between a Bayesian and a Frequentist 28 / 28