Projecting Three-Point Percentages for the NBA Draft

Similar documents
Building an NFL performance metric

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Estimating the Probability of Winning an NFL Game Using Random Forests

Using Spatio-Temporal Data To Create A Shot Probability Model

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

PREDICTING the outcomes of sporting events

Predicting NBA Shots

Two Machine Learning Approaches to Understand the NBA Data

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Predicting Momentum Shifts in NBA Games

A Novel Approach to Predicting the Results of NBA Matches

CS 221 PROJECT FINAL

Chapter 12 Practice Test

Predictors for Winning in Men s Professional Tennis

Lecture 5. Optimisation. Regularisation

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Section I: Multiple Choice Select the best answer for each problem.

Predicting the Total Number of Points Scored in NFL Games

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

Predictive Model for the NCAA Men's Basketball Tournament. An Honors Thesis (HONR 499) Thesis Advisor. Ball State University.

Kelsey Schroeder and Roberto Argüello June 3, 2016 MCS 100 Final Project Paper Predicting the Winner of The Masters Abstract This paper presents a

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

This file is part of the following reference:

Deconstructing Data Science

Predicting the Draft and Career Success of Tight Ends in the National Football League

Basketball data science

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Deconstructing Data Science

B. AA228/CS238 Component

Analytics Improving Professional Sports Today

Evaluating and Classifying NBA Free Agents

Navigate to the golf data folder and make it your working directory. Load the data by typing

Lab 11: Introduction to Linear Regression

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

arxiv: v1 [stat.ml] 15 Dec 2017

Environmental Science: An Indian Journal

PREDICTING THE NCAA BASKETBALL TOURNAMENT WITH MACHINE LEARNING. The Ringer/Getty Images

Player Availability Rating (PAR) - A Tool for Quantifying Skater Performance for NHL General Managers

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Effect of homegrown players on professional sports teams

Predicting Horse Racing Results with TensorFlow

Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College

How to Win in the NBA Playoffs: A Statistical Analysis

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Running head: DATA ANALYSIS AND INTERPRETATION 1

SPATIAL STATISTICS A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS. Introduction

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Evaluating NBA Shooting Ability using Shot Location

Forecasting Baseball

Machine Learning an American Pastime

Analysis of Curling Team Strategy and Tactics using Curling Informatics

IDENTIFYING SUBJECTIVE VALUE IN WOMEN S COLLEGE GOLF RECRUITING REGARDLESS OF SOCIO-ECONOMIC CLASS. Victoria Allred

Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

This page intentionally left blank

What does it take to produce an Olympic champion? A nation naturally

Deriving an Optimal Fantasy Football Draft Strategy

A Network-Assisted Approach to Predicting Passing Distributions

An Analysis of Factors Contributing to Wins in the National Hockey League

Department of Economics Working Paper

Analyses of the Scoring of Writing Essays For the Pennsylvania System of Student Assessment

Autodesk Moldflow Communicator Process settings

Evaluating The Best. Exploring the Relationship between Tom Brady s True and Observed Talent

NBA TEAM SYNERGY RESEARCH REPORT 1

Basketball field goal percentage prediction model research and application based on BP neural network

Golf Ball Impact: Material Characterization and Transient Simulation

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Modeling Fantasy Football Quarterbacks

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

Gerald D. Anderson. Education Technical Specialist

Neural Networks II. Chen Gao. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Simulating Major League Baseball Games

ScienceDirect. Rebounding strategies in basketball

intended velocity ( u k arm movements

INFLUENCE OF ENVIRONMENTAL PARAMETERS ON FISHERY

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Predicting the NCAA Men s Basketball Tournament with Machine Learning

Sample Final Exam MAT 128/SOC 251, Spring 2018

Gizachew Tiruneh, Ph. D., Department of Political Science, University of Central Arkansas, Conway, Arkansas

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Investigation of Winning Factors of Miami Heat in NBA Playoff Season

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

The Reliability of Intrinsic Batted Ball Statistics Appendix

Applying Hooke s Law to Multiple Bungee Cords. Introduction

An Analysis of NBA Spatio-Temporal Data

PREDICTING OUTCOMES OF NBA BASKETBALL GAMES

Explaining NBA Success for Players with Varied College Experience

Pitching Performance and Age

University of Nevada, Reno. The Effects of Changes in Major League Baseball Playoff Format: End of Season Attendance

Failure Data Analysis for Aircraft Maintenance Planning

Football Play Type Prediction and Tendency Analysis

The probability of winning a high school football game.

HSIS. Association of Selected Intersection Factors With Red-Light-Running Crashes. State Databases Used SUMMARY REPORT

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

Special Topics: Data Science

Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework

Analysis of Labor Market Inefficiencies in the NBA Draft Due to College Conference Bias

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

DEVELOPMENT OF A SET OF TRIP GENERATION MODELS FOR TRAVEL DEMAND ESTIMATION IN THE COLOMBO METROPOLITAN REGION

Transcription:

Projecting Three-Point Percentages for the NBA Draft Hilary Sun hsun3@stanford.edu Jerold Yu jeroldyu@stanford.edu December 16, 2017 Roland Centeno rcenteno@stanford.edu 1 Introduction As NBA teams have begun to embrace data analytics, more emphasis has been placed on shooting and spacing the floor. The three-point shot, in particular, is becoming increasingly prevalent in the NBA. Players and teams have been taking more threes than ever, and recruiters are increasingly looking to recruit members who will be proficient three-point shooters in the NBA [3]. The goal of this project is to project the best three-point shooters among top NCAA prospects. Using a player s statistics from the NCAA and the NBA as inputs, along with team statistics on both leagues, our model outputted a three-point shooting percentage for that player. 2 Related Work Previous work has been done into building models for recruitment from the NCAA to the NBA, but most of these focus on predicting overall NBA success from the draft itself or trying to predict the draft itself. Harris and Berri research what factors affect the draft for the WNBA, modeling the draft using a Poisson Distribution model and a Negative Binomial model and incorporating the three-point percentage into their model[5]. Using data from division-one players, Bishop and Gajewski used principal component analysis, logistic regression, and cross validation to predict a players potential in the draft to the NBA[4]. There has been some work evaluating relationships between pre-nba statistics and NBA career success. Coates and Oguntimein analyzed correlations and regressions from data collected about players drafted in 1987 to 1989 to see if it is possible to use college statistics to predict success in the NBA, using a variety of features such as points per game, minutes per game, and throw percentages. They concluded that there was a strong relationship between a players performance in their college career and their professional career[2]. Another paper does the same, analyzing data from 1988-2002 instead using multiple regression tests as well for various positions. It concluded that there is a relationship between pre- NBA statistics and a players career longevity for guards and forwards, but not for centers[1]. However, these studies focus on a players performance holistically, while we hope to specifically focus on three-point percentages. Although there are not a lot of published research focusing in on threes, there exists some fan-made statistical analyses of three-point percentages, performing linear regression on selected features. One fans conclusion was that there is significant information conveyed about a players later three-point percentages from before their college statistics, but also 1

that this was a difficult problem[6]. We hope to extend the research attempted on this problem. 3 Dataset and Features 3.1 Data Collection Our dataset was scraped from basketballreference.com and sports-reference.com/cbb. We selected 149 NCAA prospects from nbadraft.net s Top 100 Big Board from 2009-2016 that met the following criteria: Player must have taken at least 30 threes in the NCAA. If the player was drafted before the 2015 NBA Draft, he must have taken 300 threes in the NBA. If the player was drafted in 2015 or 2016, he must have taken 100 threes in the NBA. We chose players from this list because we wanted our model to be relevant to NBA scouts who would want to learn more about the best prospects in the country. Additionally, our filtering method allowed us to narrow down our data to only apply to players who are considered regular 3-point shooters. 3.2 Feature Selection We included the following features for each player: Player s total number of 3-point field goal attempts in the NCAA. Player s NCAA 3-point shooting percentage. Player s NCAA free throw shooting percentage. The strength of schedule of the player s NCAA team. Principal Component Variance Explained 1 0.348 2 0.180 3 0.158 4 0.136 The average number of threes per game the player s NCAA team took. The average offensive rating of the NBA team(s) the player has played for. The average number of threes the player s NBA team(s) took, relative to the league average. The player s NBA 3-point shooting percentage. This is our output. In selecting features, our hunch was to not only include player statistics that would intuitively have some correlation with the specified output, but to also include the environment surrounding the player. For example, we thought that if the NBA team the player played for had a good offense (i.e. a high offensive rating), then he is more likely to take higher quality shots and thus have a higher shooting percentage. We decided to break up our dataset into 80/20 split between training and testing. Because a lot of our features are closely related, we guessed that there were probably correlations between variables and extraneous information. We decided to run principal component analysis on our data and reduce the dimensionality of our data. After running PCA from sklearn library on our training data, we saw that the variance explained was as such: As we can see, the first four principal components already explain 82.273% of the variance of the data, we decided that reducing the dimensionality of our data down to four features was enough, so we pro- 2

jected our training and test datasets into our new feature spaces. After reducing the dimensionality of the data, we ran our dataset through k-fold cross validation to decide which model best suited the data. 4 Methods 4.1 Linear Regression We used linear regression as a baseline. This allowed us to assign coefficients θ to each feature depending on its effect on the labels, according to the following equation below. It does so by minimizing the residual sum of squares between the labels and the values predicted by the model. h(x) = θ x x R Nx4, θ R 4 We used the LinearRegression function with the sklearn library to run this function. 4.2 Weighted Linear Regression We used weighted linear regression to place more emphasis on points with less variance. We therefore defined weights as the inverse of the variance squared so that examples with smaller error variance will be assigned more weight since they should provide relatively more information about the model. We therefore minimized the following function: i w(i) (y (i) θ x) x R Nx4, θ R 4, w R 4, y R 4 The WLS function from within the statsmodels library was used to run this regression. 4.3 Random Forests Regression We wanted to include models to see if our problem was a non-parametric regression, where the data is not directly based off of the features, but information that is derived from the features. One way to do this was through decision trees and combining a number of weak models to create a strong end model. Random forests chooses random bootstrap samples of the data to construct n trees. Each of these nodes then branch off based on the best split using the best predictor among a randomly-chosen subset of predictors at that node. The algorithm then averages the predictions from each of these trees are then averaged to get the final prediction [7]. This also helps to prevent any overfitting. To run random forests, we used Scikit-learn s machine learning Python library, which includes a random forest regressor: sklearn.ensemble.randomforestregressor [9]. We used minimizing mean squared error as the criterion that would determine each node split. MSE = 1 n i (y ŷ)2 4.4 Gradient Boosted Regression Another decision tree regression was a gradientboosted one. It assumes that at each level of the algorithm is a weak model that can be improved by fitting a stronger model using it [8]. We used Scikit-learn s machine learning Python library for this as well, using sklearn.ensemble.gradientboostingregressor [9]. For each stage, we minimized the least squares function and fit a regression tree based on it. LS = 1 2 i (y ŷ)2 This, and random forests, allows us to create complex regression models. 3

Model Training MSE Dev Test MSE Linear Regression 9.141 9.983 Weighted Linear Regression 9.202 9.947 Random Forest 5.276 11.792 Gradient Boosting 10.061 13.056 5 Experiments, Results, and Discussion 5.1 k-fold Cross-Validation Since we were not sure if our data would work better with a simple or a complex regression, we decided to run k-fold cross validation on all of the methods above. We used k = 10. In order to choose our hyper-parameters for random forest and gradient boosting, we first tuned them on the whole training set and then used those parameters in the cross-validation. For random forest, we tuned the maximum depth of the tree between a range of 4-6 using sklearn s GridSearchCV, which gave us the optimal parameters of 4. For gradient boosting, we tuned the learning rate and the maximum depth of the tree. For the learning rate, we tuned between a range of 0.000001-1 on a log scale and the maximum depth between a range of 4-6 using GridSearchCV as well. The best parameters were a learning rate of 0.0001 and a maximum depth of 5. For each of the models, we calculate the average mean-squared error as our metric, which is as follows: MSE = 1 n i (y ŷ)2 For each model, we found the average meansquared error for the training set and the development test sets that was held out using k-fold. The following table summarizes the results from the cross validation. Parameter ncaa fg3 ncaa fg3a ncaa fg3 pct ncaa ft pct ncaa ft sos ncaa team fg3a avg nba avg team ortg nba relative team fg3a Standardized Coefficients 4.54E-04 4.54E-04 4.32E-04 9.06E-04 4.63E-04 4.24E-03 2.34E-03 3.31E-04 As the weighted linear regression performed best on the development test sets in k-fold, we decided to proceed forward with that model. We therefore trained our model on on the complete training set using a weighted linear regression model, and used that model to predict NBA 3-point percentages for our test set. Below is a graph of actual 3-point percentages of the players in our training and test set plotted vs the 3-point percentages predicted by our model. One can see that while many points do cluster around the predicted value = actual value line, some outliers do exist that our model failed to accurately predict. In addition, in order to see which parameters were most significant in influencing our predicted nba 3- point percentage, we normalized the theta values by the average of that parameter. The standardized thetas are in the table above. We can observe from the relative sizes of the standardized coeficients that 4

the player s college team volume of shooting as well as the offensive rating of the nba team they joined ended up having the most impact in our model. Therefore, in our model, much of the player s projected 3-point skill is dependent on their team s ability rather than their individual statistics 6 Conclusion/Future Work Though our model was able to accurately capture the shooting percentages of players who were average three-point shooters, we found that there were too many outliers (e.g. Tony Wroten) for the model to be used reliably. It s unclear where the problem lies in catching and accounting for outliers. Perhaps there is a more effective method to capture the nonlinearities of predicting three-point shooting percentages. Additionally, we could have missed key features that are highly correlated with a player s three-point shooting percentage. With more time and data, we would first refine the model to be more accurate. From there, it would be interesting to expand the model to include international players. As the NBA expands its market globally, players like Kristaps Porzingis and Nikola Jokic have been drafted from overseas and have performed at all-star levels. In general, NBA fans know less about the abilities of international prospects compared to their NCAA counterparts, so building models that generated more information on them could be very insightful. Additionally, we could expand the model to account for players who only began to shoot threes in the NBA. For instance, since NBA teams began to covet the 3-point shot more in recent years, a higher value has been placed on bigs that can shoot the ball. Consequently, players such as Marc Gasol and Brook Lopez have expanded their range beyond the arc, and have become proficient 3-point shooters. Including more features that could be indicative of a player s 3-point shooting capabilities (e.g. midrange proficiency) could help predict which players might add the 3-point shot to their game. 7 Contributions Jerold wrote the web-scraping code and collected the data, and wrote the gradient boosting and k-fold cross validation code. Hilary wrote the code to split the data into its training and test set, and wrote the principal component analysis and random forest code. Roland wrote the linear regression and weighted linear regression code. All team members collaborated on debugging the code, analyzing the results, and creating the writeups. References [1] W. Abrams, J. Barnes, and A. Clement, Relationship of Selected Pre-NBA Career Variables to NBA Players Career Longevity, The Sports Journal, April 2, 2008. [Online]. Available: http://thesportjournal.org/article/relationshipof-selected-pre-nba-career-variables-to-nbaplayers-career-longevity/. [Accessed November 10, 2017]. [2] D. Coates and B. Oguntimein, The Length and Success of NBA Careers: Does College Production Predict Professional Outcomes? 2008. [Online]. Available: http://web.holycross.edu/repec/spe/ CoatesOguntimein NBA.pdf. [Accessed November 18, 2017]. [3] C. Gaines. Nearly 30% of all shots in the NBA now come from behind the 3-point 5

line. Business Insider, March 8, 2016. http://www.businessinsider.com/nba-threepoint-shooting-2016-3. [Accessed November 1, 2017]. [4] B. Gajewski and T. Bishop. Drafting a Career in Sports: Determining Underclassmen College Players Stock in the NBA Draft, 2004. [Online]. Available: https://works.bepress.com/byron gajewski/10/download/. [Accessed November 20, 2017]. [5] J. Harris and D. Berri. Predicting the WNBA Draft: What Matters Most from College Performance? 2005. [Online]. Available: http://daveberri.weebly.com/uploads/6/1/3/8/ 61387427/2015harrisberriijsf.pdf. [Accessed November 20, 2017]. [6] A. Johnson, Predictions Are Hard, Especially About Three Point Shooting, Counting the Baskets, Sept. 2014. http://counting-the-baskets.typepad.com/myblog/2014/09/prediction-are-hard-especiallyabout-three-point-shooting.html. [Accessed November 18, 2017]. [7] A. Liaw and M. Wiener. Classification and Regression by randomforest, R News, vol. 2/3, 2002. [Online]. Available: http://www.bios.unc.edu/ dzeng/bios740/randomforest.pdf. [Accessed Dec. 8, 2017]. [8] A. Natekin and A. Knoll, Alois, Gradient boosting machines, a tutorial, 2013. https://www.ncbi.nlm.nih.gov/pmc/articles/pmc3885826/. [Accessed Dec. 8, 2017]. [9] Pedregosa et al., Scikit-learn: Machine Learning in Python, JMLR 12, pp. 2825-2830, 2011. [Accessed Dec. 8, 2017]. 6