Data Science Final Project

Similar documents
Using Spatio-Temporal Data To Create A Shot Probability Model

PREDICTING the outcomes of sporting events

Perfects Shooting Drill

Our Shining Moment: Hierarchical Clustering to Determine NCAA Tournament Seeding

A Simple Visualization Tool for NBA Statistics

Opleiding Informatica

BASKETBALL HISTORY OBJECT OF THE GAME

A Novel Approach to Predicting the Results of NBA Matches

Game Rules. Basic Rules: The MIAA/Federation High School Rules are used expect as noted below.

The Rise in Infield Hits

Trial # # of F.T. Made:

Basketball Study Sheet

3 Seconds Violation in which an offensive player remains within the key for more than 3 seconds at one time.

THE PERFECTION DRILL

Examining NBA Crunch Time: The Four Point Problem. Abstract. 1. Introduction

Drills to Start Practice

1999 On-Board Sacramento Regional Transit District Survey

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

14 Bonus Basketball Drills

Basketball Rules YMCA OF GREATER HOUSTON

NBA TEAM SYNERGY RESEARCH REPORT 1

OFFICIAL BASKETBALL RULES SUMMARY OF CHANGES 2014

1994 Playcare TM Playing Cards are a product of Playcare TM 937 Otay Lakes Road, Chula Vista, California Barkley Shut Up and Jam is a trademark

MOORPARK BASKETBALL ASSOCIATION RULES AND REGULATIONS

National Junior Basketball has adopted the National Federation Rule Book for All-Star Tournament play. The following NJB rules also prevail:

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

Open Post Offense - Motion Offense, Diagrams, Drills, and Plays

Project Title: Overtime Rules in Soccer and their Effect on Winning Percentages

Motion Offense. Movement creates movement, Intelligent movement creates space, Space affords time, and time ensures accuracy

Official NCAA Basketball Statisticians Manual. Official Basketball Statistics Rules With Approved Rulings and Interpretations

Basketball data science

Game Like Drills for Pregame Warm Up

Using New Iterative Methods and Fine Grain Data to Rank College Football Teams. Maggie Wigness Michael Rowell & Chadd Williams Pacific University

2017 USA Basketball 14U National Tournament FIBA Rule Modifications

Predicting the development of the NBA playoffs. How much the regular season tells us about the playoff results.

Information Visualization in the NBA: The Shot Chart

The goal of this tryout is to gather a group of young men who are able to achieve academic success in the classroom as well as physical success on

Section 8 Lay Ups. Bacchus Marsh Basketball Association Coaches Manual

MEMORANDUM. TO: NCAA Divisions I, II and III Coordinators of Men's Basketball Officials.

Drill 8 Tandem Defense

Real Soccer Center Futsal Rules

Late Game Situations (End of practice note card box)

Building an NFL performance metric

CB2K. College Basketball 2000

KAMLOOPS ELEMENTARY SCHOOLS BASKETBALL PROGRAM. Philosophy

Wayzata Boys Basketball Workout Book (9-12 th Grade)

This is a simple "give and go" play to either side of the floor.

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

BLOCKOUT INTO TRANSITION (with 12 Second Shot Clock)

Spring/Summer Session

Appendix A continued A: Table Of Lessons

Ankeny Centennial Core Drills

Eastview Boys Basketball Workout Book

Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework

Name: Date: Math in Basketball: Take the Challenge Student Handout

Predicting NBA Shots

BASKETBALL HISTORY RULES TERMS

KAMLOOPS ELEMENTARY SCHOOL'S BASKETBALL PROGRAM

Machine Learning an American Pastime

Basketball Officials Exam For Postseason Tournament Consideration

Evaluating and Classifying NBA Free Agents

Rosemount Girls Basketball Workout Book

Anthony Goyne - Ferntree Gully Falcons

Abstract. 1 Introduction

5-Out Motion Offense Domestic Coaching Guide

MEMORANDUM. I would like to highlight the two areas where I believe we need additional focus:

SPUD Shooters. 7,000 & 10,000 Shooting Club. Great shooters are made, not born

Sharp Shooting: Improving Basketball Shooting Form

Anthony Goyne - Ferntree Gully Falcons

2014 Americas Team Camp Coaching Clinic

Gainesville Basketball Association

OFSAA FIBA (HIGH SCHOOL)

Practice 12 of 12 MVP LEVEL. Values TIME MANAGEMENT Help the players understand how to manage, school, fun, sports, and other hobbies.

NORTH METRO YOUTH BASKETBALL LEAGUE

How to Win in the NBA Playoffs: A Statistical Analysis

ScienceDirect. Rebounding strategies in basketball

1st - 2nd Grade BASKETBALL RULES

Basic organization of the training field

4 Out 1 In Offense Complete Coaching Guide

Billy Beane s Three Fundamental Insights on Baseball and Investing

Free Skill Progression Plan. ebasketballcoach.com

UC MERCED INTRAMURAL SPORTS

FIELDHOUSE USA BASKETBALL TABLE OF CONTENTS

Workout #1. "It's not about the number of hours you practice, it's about the number of hours your mind is present during the practice" - Kobe Bryant

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

Welcome to the ABGC Basketball House League

EAST HANOVER BOYS BASKETBALL ASSOCIATION RULES OF PLAY Version 2.4

2013 Brayden Carr Foundation Coaches Clinic

UNITED CHURCH ATHLETIC LEAGUE RULES OF BASKETBALL. Updated 12/2/2016

An Analysis of NBA Spatio-Temporal Data

Transition. Contents. Transition

1. Unit Objective(s): (What will students know and be able to do as a result of this unit?

Games, Games, Games By Tim Taggart, Nasco

Student Handout: Summative Activity. Professional Sports

Practice Task: Trash Can Basketball

STATIC AND DYNAMIC EVALUATION OF THE DRIVER SPEED PERCEPTION AND SELECTION PROCESS

UW-WHITEWATER INTRAMURAL SPORTS TEAM HANDBALL RULES Last update: January, 2018

NBA Salary Prediction

Transcription:

Data Science Final Project Hunter Johns Introduction At its most basic, basketball features two objectives for each team to work towards: score as many times as possible, and limit the opposing team's scoring to the lowest possible number of made shots. Basketball shares this scheme with baseball (albeit with runs instead of field goals), which has undergone a data-driven transformation into an analytical sport; basketball, however, features far more interactions between players than does baseball, and thus has resisted the same kind of data revolution that baseball saw with the work of theoretician Bill James and his executor Billy Bean. Short of developing a sabermetrics for basketball, the purpose of my final project is to see how abstractable the most important interaction in basketball, between the shooter and their defender, is. Since it came out in 2012, I've been a fan of the computer videogame NBA 2k13. As I've played the game, I've become increasingly interested in the way the creators of the game modeled how basketball works, specifically the way the creators abstracted that shooter-defender interaction. As far as I can tell without reading the source code, the probability of a shot being made depends most on the ability of the shooter from the specific place on the floor they are shooting from, the defender's ability in that same place, and how well the defender defends the shooter (though the game does make an effort to introduce player-specific modifiers to replicate real-life player behavior, I'm choosing to disregard this in favor of more basic shooter-defender interaction). Enter my data set: a Kaggle set of of every shot taken by an NBA player in the 2014-15 season, a total of just over 120,000 entries. The set features variables like CLOSE_DEF_DIST, the distance from the shooter to the closest defender, and SHOT_DIST, the distance from the shooter to the basket. Data Wrangling My goal was to develop a classification model using the NBA 2k13 mode of basketball thinking, i.e. shot percentage is based only on a mixture of shooter and defender skill in addition to spatial information like the distance from shooter to basket and shooter to defender. To begin with, I created a new variable, shot_class, which classifies a shot based on distance to the basket. I chose the breaks in shot classification based on where a shot changes characteristics, for instance a shot four feet or closer is most likely a layup (not a shot in the "jump-shot" sense). I could not

find a dplyr function which would evaluate cases and assign them a variable value based on the evaluation, so I iterated over the list using a for-loop. for (i in 1:nrow(shot_df)) { if (shot_df$shot_dist[i] < 4) { shot_df$shot_class[i] = "close" } else if (shot_df$shot_dist[i] < 15) { shot_df$shot_class[i] = "mid 1" } else if (shot_df$shot_dist[i] < 22) { shot_df$shot_class[i] = "mid 2" } else if (shot_df$shot_dist[i] > 22) { shot_df$shot_class[i] = "long" } else { shot_df$shot_class[i] = "other" } 4 Ft. and closer - "close" - layup or low post shot 4-15 Ft. - "mid1" - close jump shot, high post shot, or floater 15-22 Ft. - "mid2" - jump shot inside the 3pt line 22 Ft. and beyond - 3pt shot A problem I then ran into was how to find each player's field goal percentage in each of the zones. For this purpose I created an entirely new data frame for shooters, grouping by the name of the player. I wrangled new variables average shot distance for analysis after the classification model. [See Appendix Chunk #1] In order to find the same data for defenders, I applied a similar method and created a separate data frame for defenders. [See Appendix Chunk #2] In order for a machine-learning blackbox to create a prediction model out of the data frame with all of the season's shots, I needed to put the relevent information (shooter and defender field goal percentages) into each shot case. For instance, if a shot was in the "mid1" classification, or between four and fifteen feet, the code would reference the appropriate case in the defender data frame, access that player's defending field goal percentage for the "mid1" shot area, and apply it to the entry for the shot in question. The code would then do the same for the shooter. [See Appendix Chunk #3] I first gave the shot data frame to the rpart function, but something about the data did not mesh with the function. I then tried the ctree function, which then successfully made a classification model of the data. Using cross-validation, with 100,000 cases as the training set and the

remaining 28,069 cases for the training set, the model correctly predicted the result of 62% of shots. [See Appendix Chunk #4] In order to perform analysis of the classification model, I used the model to make a prediction for each case in the set. I then assigned the result of the prediction to a new variable. shot_df <- shot_df %>% mutate(shot_pred = NA) for (i in 1:nrow(shot_df)) { prediction <- predict(shot_pred_tree, shot_df[i,]) shot_df$shot_pred[i] <- prediction } Analysis of Model I started by doing a visual analysis of the discrepancy between the model and the actual result of the shots. (A note: in classification graphs, 1 is a make, 2 is a miss) ggplot(shot_df, aes(x=close_def_dist, y = SHOT_DIST, alpha = 0.1)) + geom_point() + labs(x = "Distance to Closest Defender", y = "Distance to Basket", title = "Classification of Shots, Defensive Distance Vs. Shot Distance") + facet_wrap(~factor(shot_pred)) ggplot(shot_df, aes(x=close_def_dist, y = SHOT_DIST, alpha = 0.1)) + geom_point() + labs(x = "Distance to Closest Defender", y = "Distance to Basket", title = "Result of Shots, Defensive Distance Vs. Shot Distance") + facet_wrap(~factor(shot_result))

Looking at the graphs, the model found the same trend towards the bottom of the graph: shots that were closer than about five feet to the basket with a defender separation of more than a few feet were likely to be good. However, the model predicted that shots greater than five feet from the basket with a defender separation of less than about three feet were very unlikely to go in, which in practice did not appear to be true. I did another side-by-side comparison of the model's features, this time of shooter zone FGP versus defender zone FGP. ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Classification of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_pred) ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Result of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_result)

Interestingly, I did not find the same noticable difference in the distribution of actual made versus missed as I found in predicted made versus missed. This is a failure of the model; much more than shot ability goes into a field goal percentage, more even than could be gleaned from this data set. An example of this failure might be a player who is playing out of their most comfortable role on a team. I also faceted by zone to see how results compared with classifications. ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, color = factor(shot_pred), alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Classification of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_class) ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1, color = SHOT_RESULT)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Result of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_class)

Here it is easier to see the break-down in predictive power in the mid1 and mid2 zones, as well as a decrease in stratification of makes and misses. This suggests that the interaction between shooters and defenders is less pronounced, at least in terms of field goal percentage, for jumpshots inside the three-point arc. Another striking observation is that the makes and misses are stratified horizontally in both data sets, suggesting that the role of the defender in the shooterdefender interaction is not as important as I had thought. To test this, I took out the def_zone_fgp from the prediction model and recieved the same 62% prediciton accuracy. When I removed the CLOSE_DEF_DIST feature, however, the accuracy of the model went down. train_df2 <- shot_df[1:120000,] test_df2 <- shot_df[120001:128069,] shot_pred_tree2 <- ctree(shot_result ~ SHOT_DIST + SHOT_CLOCK + shooter_zone_fgp + shooter_zone_shots + CLOSE_DEF_DIST, data=train_df) plot(shot_pred_tree2) pred_model2 <- predict(shot_pred_tree2, test_df2) conf2 <- table(test_df$shot_result, pred_model2) TP2 <- conf2[1,1] FN2 <- conf2[1,2] FP2 <- conf2[2,1] TN2 <- conf2[2,2] acc2 <- (TP2 + TN2)/(TP2 + TN2 + FN2 + FP2) acc2

To get a better idea of the model's accuracy by zone, I made a visualization. pred_by_zone <- shot_df %>% mutate(shot_pred2 = shot_pred - 1) %>% group_by(shot_class) %>% summarize(real_fgp = sum(fgm)/n(), pred_fgp = (1-(sum(shot_pred2)/n()))) ggplot(pred_by_zone %>% arrange(desc(real_fgp)), aes(x = shot_class, y=real_fgp)) + geom_bar(stat="identity") + labs(x = "Shot Class", y = "FGP", title = "Real FGP by Zone") ggplot(pred_by_zone %>% arrange(desc(pred_fgp)), aes(x = shot_class, y=pred_fgp)) + geom_bar(stat="identity") + labs(x = "Shot Class", y = "FGP", title = "Predicted FGP by Zone")

According to the data frame and the visualization, the model captures the trend in overall field goal percentages, with a descension by distance. However, the model thinks that close shot are far and away more likely to go in than any other shot, which does not hold up in reality. Conclusion A predicitive model based only on a shooter's basic stats, defender stats, distance to the basket, distance from the defender to the shooter, and shot clock (shots put up at the end of the shot clock tend to be worse than those in the middle of the shot clock) captures some of the trends in shot results, but not all. The abstracted view of basketball, with no player interaction outside of the shooter and their closest defender, does explain some of the data, but a more detailed analysis could be produced with more data on the other eight players on the court.

Appendix Chunk #1 off_by_player <- shot_df %>% group_by(player_name) %>% summarize(avgdefdist = mean(close_def_dist), avgshotdist = mean(shot_dist), pts = sum(pts), numshots = n(), pts_to_attempts = pts/numshots, avgdribbles = mean(dribbles)) off_by_player <- off_by_player %>% mutate(close_fgp = NA, close_shots = NA, mid1_fgp = NA, mid1_shots = NA, mid2_fgp = NA, mid2_shots = NA, long_fgp = NA, long_shots = NA, off_sweetspot = NA, off_sweetspot_rad = NA) for (i in 1:nrow(off_by_player)) { close_made <- 0 close_att <- 0 close_def_dist <- 0 mid1_made <- 0 mid1_att <- 0 mid1_def_dist <- 0 mid2_made <- 0 mid2_att <- 0 mid2_def_dist <- 0 long_made <- 0 long_att <- 0 long_def_dist <- 0 sweetspot <- c() counter <- 0 working <- shot_df %>% filter(player_name == off_by_player $player_name[i]) for (j in 1:nrow(working)) { if (working$shot_class[j] =="close") { close_att <- close_att + 1 if (working$shot_result[j] == "made"){ close_made <- close_made + 1 } else if (working$shot_class[j] =="mid 1") { mid1_att <- mid1_att + 1 if (working$shot_result[j] == "made") { mid1_made <- mid1_made + 1

} } else if (working$shot_class[j] =="mid 2") { mid2_att <- mid2_att + 1 if (working$shot_result[j] == "made") { mid2_made <- mid2_made + 1 } else if (working$shot_class[j] =="long") { long_att <- long_att + 1 if (working$shot_result[j] == "made") { long_made <- long_made + 1 if (working$shot_result[j] == "made") { counter <- counter + 1 sweetspot <- c(sweetspot, working$shot_dist) off_by_player$close_fgp[i] <- close_made/close_att off_by_player$close_shots[i] <- close_att off_by_player$mid1_fgp[i] <- mid1_made/mid1_att off_by_player$mid1_shots[i] <- mid1_att off_by_player$mid2_fgp[i] <- mid2_made/mid2_att off_by_player$mid2_shots[i] <- mid2_att off_by_player$long_fgp[i] <- long_made/long_att off_by_player$long_shots[i] <- long_att off_by_player$off_sweetspot[i] <- (sum(sweetspot))/counter off_by_player$off_sweetspot_rad[i] <- sd(sweetspot) Chunk #2 def_by_player <- shot_df %>% group_by(closest_defender) %>% summarize(avgdist = mean(close_def_dist), pts_against = sum(pts), numshots = n(), pts_to_attempts = pts_against/numshots) %>% arrange(desc(numshots)) def_by_player <- def_by_player %>% mutate(close_opp_fgp = NA, close_opp_shots = NA, mid1_opp_fgp = NA, mid1_opp_shots = NA, mid2_opp_fgp = NA, mid2_opp_shots = NA, long_opp_fgp = NA, long_opp_shots = NA, def_sweetspot = NA, def_sweetspot_rad = NA)

for (i in 1:nrow(def_by_player)) { close_made <- 0 close_att <- 0 mid1_made <- 0 mid1_att <- 0 mid2_made <- 0 mid2_att <- 0 long_made <- 0 long_att <- 0 counter <- 0 working <- shot_df %>% filter(closest_defender == def_by_player $CLOSEST_DEFENDER[i]) for (j in 1:nrow(working)) { if (working$shot_class[j] =="close") { close_att <- close_att + 1 if (working$shot_result[j] == "made"){ close_made <- close_made + 1 } else if (working$shot_class[j] =="mid 1") { mid1_att <- mid1_att + 1 if (working$shot_result[j] == "made") { mid1_made <- mid1_made + 1 } else if (working$shot_class[j] =="mid 2") { mid2_att <- mid2_att + 1 if (working$shot_result[j] == "made") { mid2_made <- mid2_made + 1 } else if (working$shot_class[j] =="long") { long_att <- long_att + 1 if (working$shot_result[j] == "made") { long_made <- long_made + 1 def_by_player$close_opp_fgp[i] <- close_made/close_att def_by_player$close_opp_shots[i] <- close_att

} def_by_player$mid1_opp_fgp[i] <- mid1_made/mid1_att def_by_player$mid1_opp_shots[i] <- mid1_att def_by_player$mid2_opp_fgp[i] <- mid2_made/mid2_att def_by_player$mid2_opp_shots[i] <- mid2_att def_by_player$long_opp_fgp[i] <- long_made/long_att def_by_player$long_opp_shots[i] <- long_att Chunk #3 shot_df <- shot_df %>% mutate(defender_sweetspot = NA, shooter_sweetspot = NA, shooter_zone_fgp = NA, shooter_zone_shots = NA, defender_zone_fgp = NA, defender_zone_shots = NA, def_dist_to_sweetspot = abs(shot_dist - (CLOSE_DEF_DIST + defender_sweetspot))) for (i in 1:nrow(shot_df)) { if (shot_df$shot_class[i] == "close") { shot_df$shooter_zone_fgp[i] <- off_by_player$close_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$close_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player $close_opp_fgp[shot_df$closest_defender[i]] shot_df$defender_zone_shots[i] <- def_by_player $close_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "mid 1") { shot_df$shooter_zone_fgp[i] <- off_by_player$mid1_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$mid1_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$mid1_opp_fgp[shot_df $CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $mid1_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "mid 2") { shot_df$shooter_zone_fgp[i] <- off_by_player$mid2_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$mid2_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$mid2_opp_fgp[shot_df

$CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $mid2_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "long") { shot_df$shooter_zone_fgp[i] <- off_by_player$long_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$long_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$long_opp_fgp[shot_df $CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $long_opp_shots[shot_df$closest_defender[i]] } Chunk #4 train_df <- shot_df[1:120000,] test_df <- shot_df[120001:128069,] shot_pred_tree <- ctree(shot_result ~ SHOT_DIST + SHOT_CLOCK + shooter_zone_fgp + shooter_zone_shots + defender_zone_fgp + CLOSE_DEF_DIST, data=train_df) plot(shot_pred_tree) pred_model <- predict(shot_pred_tree, test_df) conf <- table(test_df$shot_result, pred_model) TP <- conf[1,1] FN <- conf[1,2] FP <- conf[2,1] TN <- conf[2,2] acc <- (TP + TN)/(TP + TN + FN + FP) acc