Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Similar documents
A Novel Approach to Predicting the Results of NBA Matches

Projecting Three-Point Percentages for the NBA Draft

PREDICTING the outcomes of sporting events

How to Win in the NBA Playoffs: A Statistical Analysis

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Two Machine Learning Approaches to Understand the NBA Data

This page intentionally left blank

Building an NFL performance metric

Evaluating and Classifying NBA Free Agents

1. OVERVIEW OF METHOD

Predictors for Winning in Men s Professional Tennis

Predicting Momentum Shifts in NBA Games

Examining NBA Crunch Time: The Four Point Problem. Abstract. 1. Introduction

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

PREDICTING OUTCOMES OF NBA BASKETBALL GAMES

Using Spatio-Temporal Data To Create A Shot Probability Model

Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework

Has the NFL s Rooney Rule Efforts Leveled the Field for African American Head Coach Candidates?

Do Clutch Hitters Exist?

Beyond Corsi: Examining Weighted Shots Matt puckplusplus.com

NBA TEAM SYNERGY RESEARCH REPORT 1

Our Shining Moment: Hierarchical Clustering to Determine NCAA Tournament Seeding

ANALYSIS OF SIGNIFICANT FACTORS IN DIVISION I MEN S COLLEGE BASKETBALL AND DEVELOPMENT OF A PREDICTIVE MODEL

Effect of homegrown players on professional sports teams

Predicting the Total Number of Points Scored in NFL Games

Estimating the Probability of Winning an NFL Game Using Random Forests

Scoresheet Sports PO Box 1097, Grass Valley, CA (530) phone (530) fax

3.1 Rules of the Game

Predicting Horse Racing Results with TensorFlow

NBA Statistics Summary

Should bonus points be included in the Six Nations Championship?

Home Team Advantage in the NBA: The Effect of Fan Attendance on Performance

International Discrimination in NBA

Period: Date: MAKE THE DREAM TEAM

Player Availability Rating (PAR) - A Tool for Quantifying Skater Performance for NHL General Managers

The factors affecting team performance in the NFL: does off-field conduct matter? Abstract

Pierce 0. Measuring How NBA Players Were Paid in the Season Based on Previous Season Play

Modeling Fantasy Football Quarterbacks

Predicting NBA Shots

Predicting the Draft and Career Success of Tight Ends in the National Football League

How percentages are used in sports

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Machine Learning an American Pastime

Official NCAA Basketball Statisticians Manual. Official Basketball Statistics Rules With Approved Rulings and Interpretations

Drills to Start Practice

ScienceDirect. Rebounding strategies in basketball

Higher & Intermediate 2 Physical Education. Structures & Strategies - Basketball

Empirical Example II of Chapter 7

Evaluating The Best. Exploring the Relationship between Tom Brady s True and Observed Talent

Other advantages of the Stack Offense are as follows: The stack tends to neutralize any defense forcing it to play you person-toperson.

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

SPL BASKETBALL & VOLLEYBALL RULES FOR COMPETITION

An Analysis of Factors Contributing to Wins in the National Hockey League

Preschool & Kindergarten Basketball Season Plan Week 1

Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College

Basketball Analytics: Optimizing the Official Basketball Box-Score (Play-by-Play) William H. Cade, CADE Analytics, LLC

Perfects Shooting Drill

AggPro: The Aggregate Projection System

Journal of Quantitative Analysis in Sports Manuscript 1039

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

AN ANALYSIS OF TEAM STATISTICS IN AUSTRALIAN RULES FOOTBALL. Andrew Patterson and Stephen R. Clarke 1. Abstract 1. INTRODUCTION

PREDICTING THE FUTURE OF FREE AGENT RECEIVERS AND TIGHT ENDS IN THE NFL

The probability of winning a high school football game.

SPATIAL STATISTICS A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS. Introduction

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Average Runs per inning,

Regression to the Mean at The Masters Golf Tournament A comparative analysis of regression to the mean on the PGA tour and at the Masters Tournament

Predicting Results of March Madness Using the Probability Self-Consistent Method

B. AA228/CS238 Component

Department of Economics Working Paper

IDENTIFYING SUBJECTIVE VALUE IN WOMEN S COLLEGE GOLF RECRUITING REGARDLESS OF SOCIO-ECONOMIC CLASS. Victoria Allred

Failure Data Analysis for Aircraft Maintenance Planning

Pitching Performance and Age

The MACC Handicap System

14 Bonus Basketball Drills

Basketball field goal percentage prediction model research and application based on BP neural network

OFFICIAL BASKETBALL STATISTICIANS MANUAL 2012

Modeling Pedestrian Volumes on College Campuses

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

Predicting the development of the NBA playoffs. How much the regular season tells us about the playoff results.

Pitching Performance and Age

Game Rules BASIC GAME. Game Setup NO-DICE VERSION. Play Cards and Time Clock

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

CB2K. College Basketball 2000

Anthony Goyne - Ferntree Gully Falcons

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

The Multi-Purpose Offense and Shooting Drill

Navigate to the golf data folder and make it your working directory. Load the data by typing

Basketball s Cinderella Stories: What Makes a Successful Underdog in the NCAA Tournament? Alinna Brown

COLUMBUS GIRLS BASKETBALL 10,000 SHOT CLUB

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

Predicting Horse Racing Results with Machine Learning

Goal Defence GD The Goal Defence works really closely with the goal keeper, they must support each other. GD players must be good at marking.

Correlation analysis between UK onshore and offshore wind speeds

Improving Bracket Prediction for Single-Elimination Basketball Tournaments via Data Analytics

Improving the Australian Open Extreme Heat Policy. Tristan Barnett

Disadvantage Drills for Building Your Team. Notes by. Coach Troy Culley

IMPROVING MOBILITY PERFORMANCE IN WHEELCHAIR BASKETBALL

Transcription:

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games Tony Liu Introduction The broad aim of this project is to use the Bradley Terry pairwise comparison model as the basis for finding strong predictive models for NBA games. Bradley Terry model is commonly used in power rankings in sports. It is primarily used to calculate win probabilities based off of a team s win/loss record. I argue, however, that using win percentages alone might not be the most effective method. For instance, it is not necessarily true that if team A has a greater than half chance of beating team B and team B has a greater than half chance of beating team C that team A will have a greater than half chance of beating team C. Instead, I hypothesise that it is possible to come up with a better predictive model by first predicting features that have a high correlation with win rate. Therefore, my model would have a two-tiered approach. First, I calculate the features that are predictive of win rate and then I feed those predictions into a model that has those features as the predictors and win rate as the response. Here, I turn to Dean Oliver s Four Factors of Basketball Success. i Oliver argues that most the variation in wins can be explained by Shooting, Turnovers, Rebounding and Free Throws. He assigns the weights to each factor, 40%, 25%, 20% and 15%, respectively. In my model, I will determine the coefficients for each factor, which have the greatest predictive power. The four factors are defined in the following ways. Shooting: Effective Field Goal Percentage = (Field Goals Made + 0.5*Three Pointers Made)/Field Goals Attempted Turnovers: Turnover Percentage = Turnovers/(Field Goals Attempted + 0.44*Free Throw Attempts + Turnovers) Rebounding: Offensive Rebound Rate = Offensive Rebounds/(Offensive Rebounds + Opposition Defensive Rebounds) Defensive Rebound Rate = Defensive Rebound Rate = Defensive Rebounds/(Opposition Offensive Rebounds + Defensive Rebounds) Free Throws: Free Throw Factor = Free Throws Made/Field Goals Attempted

As these values are all rates, we can predict the each of the four factors for both teams. Let s consider a game between team A and team B. Consider a prediction for A s Turnover Percentage. We would need to know A s mean turnover percentage, the league s mean turnover percentage and the mean turnover percentage of teams when they play against B. Given these values, I am able to apply the Bradley Terry Model, where the three agents are A, B and the league. Why use the Bradley Terry Model at all? Other models that make use of data at a far more granular level will likely have greater potential for predictive power. The reason I choose the Bradley Terry Model, however, is its simplicity and its potential to make sensible predictions with very limited data. It is clear that the model I suggest only requires data at the team level for each game. Thus, there are far fewer features. Methodology There are two predictive layers in the model optimise a model for predicting the four factors and a model for predicting win rate from the four factors. I use the 2010-2011 NBA season as my data set. There are (82*30)/2 = 1230 games per season. I split the data set into a training set and a test set. The training set consists of the 70% of the season and the test set consists of the remaining 30%. There are 861 observations in the training set and 360 observations in the test set. As a point of comparison for my model, I also tune the model that only uses the win/loss record. Predicting the four factors In predicting the four factors, the key parameter I consider is the number of games that I should use in the prediction of a single game. Since this is a purely predictive model, I can only predict on a game using past games. An obvious choice would be to include every game leading up to the prediction game. There are, however, some potential disadvantages with this method. By the time a team plays its 70 th game of the season, the first twenty games might not be so predictive of the outcome of that game. Also, injuries and roster changes can decrease the importance of earlier games. The alternative to this approach is a moving window. Therefore in my model, I tune the size of the window. I train and test on my original training set and calculate the test MSE error. The size of the training and test sets within the original training set varies depending on the size of the window. Given a window size d, the training set, here, is essentially the number of games that takes place in the league before every team plays d games. For instance, for a window size of 1, which is the case that we only use information from the previous game to predict the next, by the 18 th game of the season, every team has played at least one game. I calculate the mean squared error for the five different window sizes and also for the case, in which I include every game leading up to the prediction game.

Window num Rebound Turnover efg% MSE FT factor Sum of MSE Size obs. MSE MSE MSE 1 844 0.016501403 0.002960085 0.011684333 0.022131734 0.053277555 2 776 0.011073513 0.002020287 0.007479058 0.02408846 0.044661318 5 693 0.007100297 0.00142125 0.005043673 0.01293233 0.02649755 10 536 0.0063628 0.001249419 0.004432883 0.002776665 0.014821767 20 371 0.005733524 0.001195112 0.004259816 0.005780949 0.016969401 All games 844 0.00608761 0.001254227 0.004407891 0.009369296 0.021119024 Predicting wins from the four factors Now, I select a model that predicts win rate from the four factors. I consider both linear and non-linear models, including least squares regression, logistic regression, regression and classification trees. I perform 10-fold cross validation on the training set to determine the 0-1 loss and/or mean squared errors of the various models. For the regression models, I use Point Differential as the response and, for the classification models, I use the two classes win and loss. I removed Opposition Offensive Rebounds and Opposition Defensive Rebounds from the feature set because a model that includes them would have high multicollinearity since they can be derived from the Offensive Rebounds and Defensive Rebounds features. From the table, it seems that the window size that performs the best is 10 using the criteria of summing the errors across the four factors. The errors appear to decrease until size 10 before increasing again. The linear models perform the best and within the linear and non-linear approaches, classification methods outperform their regression counterparts. I chose logistic regression for my two-tiered model. 10-Fold Cross Validation Results Model MSE abs(y_hat y) 0-1 Loss Least squares 9.84896582 2.54035922 0.04298316 Logistic regression n/a n/a 0.03716921 Regression tree 74.90737 6.877177 0.2078856 Classification tree n/a n/a 0.1962978 Predicting wins from Win/Loss record only I tune the window size for the model that predicts win rate only using the win/loss record. Here I use the 0-1 loss error to find the optimal parameter, which appears to be a window size of 20. Using every single game prior to the prediction games has an insignificantly smaller error so I opt for the simpler model that requires less information.

Window num 0-1 Loss Size obs. 1 844 0.4490521 2 776 0.4379562 5 693 0.4007732 10 536 0.3708514 20 371 0.3451493 All games 844 0.3414948 The final predictive models To determine if the two-tiered approach performs better than the model that only uses win/loss record, I compare the following two models. 1. A two-tiered model that uses window size 10 games to predict the four factors and then uses those predicted four factors to predict wins through the logistic model. 2. The single-tier model that predicts wins using only win/loss record with window size 20 games. Results I refit the logistic model on the entire training and then predict the four factors and win rates of the test set of 369 observations. Similarly I predict win rate with the single-tier model on the same set. I find that the two-tiered approach has a lower test error than the single-tier model with a 0-1 loss of 0.360 against 0.385. Model 0-1 Loss Correct Guesses Total Games Two-tier model 0.3604336 236 369 Single-tier win/loss 0.3848238 227 369 We can compare these error rates with more established models that use player statistics to predict wins. We can refer to Omidiran s ii paper, which compares the performance of different models based on Adjusted Plus-Minus (APM) scores of players, which considers the overall contribution of a player to the point differential. He uses the same season for his data set. He trains on the first 410 games before predicting on the final 820 games of the season. His dummy model, which only considers home court advantage, had a 0-1 loss of 0.4024. The least squares model achieved a 0-1 loss of 0.4073 and the ridge

regression model 0.3732. His Subspace Prior Regression models managed to achieve a 0-1 loss of under 0.3. Therefore it is interesting to see that the two-tiered model is at least comparable if not a better predictor of wins than several of the plus-minus models, while requiring far less information. It must be noted, however, that a primary motivation of APM models is to measure player performance. Conclusion The results of this project are encouraging for several reasons. First, It seems that there is reasonable evidence that indirectly predicting wins, as in the two-tiered approach, could be a more successful paradigm for modelling NBA games. Second, the Bradley Terry model can be successfully applied to statistics beyond wins. Third, in predicting a game the size of the sample that should be considered is an important consideration. Based off of my findings, a sample size between 10 and 20 games seems to be optimal.

i http://www.basketball-reference.com/about/factors.html ii Omidiran, Dapo. Low-Dimensional Models for PCA and Regression. Diss. U of California at Berkeley, 2013.