Data Science Final Project Hunter Johns Introduction At its most basic, basketball features two objectives for each team to work towards: score as many times as possible, and limit the opposing team's scoring to the lowest possible number of made shots. Basketball shares this scheme with baseball (albeit with runs instead of field goals), which has undergone a data-driven transformation into an analytical sport; basketball, however, features far more interactions between players than does baseball, and thus has resisted the same kind of data revolution that baseball saw with the work of theoretician Bill James and his executor Billy Bean. Short of developing a sabermetrics for basketball, the purpose of my final project is to see how abstractable the most important interaction in basketball, between the shooter and their defender, is. Since it came out in 2012, I've been a fan of the computer videogame NBA 2k13. As I've played the game, I've become increasingly interested in the way the creators of the game modeled how basketball works, specifically the way the creators abstracted that shooter-defender interaction. As far as I can tell without reading the source code, the probability of a shot being made depends most on the ability of the shooter from the specific place on the floor they are shooting from, the defender's ability in that same place, and how well the defender defends the shooter (though the game does make an effort to introduce player-specific modifiers to replicate real-life player behavior, I'm choosing to disregard this in favor of more basic shooter-defender interaction). Enter my data set: a Kaggle set of of every shot taken by an NBA player in the 2014-15 season, a total of just over 120,000 entries. The set features variables like CLOSE_DEF_DIST, the distance from the shooter to the closest defender, and SHOT_DIST, the distance from the shooter to the basket. Data Wrangling My goal was to develop a classification model using the NBA 2k13 mode of basketball thinking, i.e. shot percentage is based only on a mixture of shooter and defender skill in addition to spatial information like the distance from shooter to basket and shooter to defender. To begin with, I created a new variable, shot_class, which classifies a shot based on distance to the basket. I chose the breaks in shot classification based on where a shot changes characteristics, for instance a shot four feet or closer is most likely a layup (not a shot in the "jump-shot" sense). I could not
find a dplyr function which would evaluate cases and assign them a variable value based on the evaluation, so I iterated over the list using a for-loop. for (i in 1:nrow(shot_df)) { if (shot_df$shot_dist[i] < 4) { shot_df$shot_class[i] = "close" } else if (shot_df$shot_dist[i] < 15) { shot_df$shot_class[i] = "mid 1" } else if (shot_df$shot_dist[i] < 22) { shot_df$shot_class[i] = "mid 2" } else if (shot_df$shot_dist[i] > 22) { shot_df$shot_class[i] = "long" } else { shot_df$shot_class[i] = "other" } 4 Ft. and closer - "close" - layup or low post shot 4-15 Ft. - "mid1" - close jump shot, high post shot, or floater 15-22 Ft. - "mid2" - jump shot inside the 3pt line 22 Ft. and beyond - 3pt shot A problem I then ran into was how to find each player's field goal percentage in each of the zones. For this purpose I created an entirely new data frame for shooters, grouping by the name of the player. I wrangled new variables average shot distance for analysis after the classification model. [See Appendix Chunk #1] In order to find the same data for defenders, I applied a similar method and created a separate data frame for defenders. [See Appendix Chunk #2] In order for a machine-learning blackbox to create a prediction model out of the data frame with all of the season's shots, I needed to put the relevent information (shooter and defender field goal percentages) into each shot case. For instance, if a shot was in the "mid1" classification, or between four and fifteen feet, the code would reference the appropriate case in the defender data frame, access that player's defending field goal percentage for the "mid1" shot area, and apply it to the entry for the shot in question. The code would then do the same for the shooter. [See Appendix Chunk #3] I first gave the shot data frame to the rpart function, but something about the data did not mesh with the function. I then tried the ctree function, which then successfully made a classification model of the data. Using cross-validation, with 100,000 cases as the training set and the
remaining 28,069 cases for the training set, the model correctly predicted the result of 62% of shots. [See Appendix Chunk #4] In order to perform analysis of the classification model, I used the model to make a prediction for each case in the set. I then assigned the result of the prediction to a new variable. shot_df <- shot_df %>% mutate(shot_pred = NA) for (i in 1:nrow(shot_df)) { prediction <- predict(shot_pred_tree, shot_df[i,]) shot_df$shot_pred[i] <- prediction } Analysis of Model I started by doing a visual analysis of the discrepancy between the model and the actual result of the shots. (A note: in classification graphs, 1 is a make, 2 is a miss) ggplot(shot_df, aes(x=close_def_dist, y = SHOT_DIST, alpha = 0.1)) + geom_point() + labs(x = "Distance to Closest Defender", y = "Distance to Basket", title = "Classification of Shots, Defensive Distance Vs. Shot Distance") + facet_wrap(~factor(shot_pred)) ggplot(shot_df, aes(x=close_def_dist, y = SHOT_DIST, alpha = 0.1)) + geom_point() + labs(x = "Distance to Closest Defender", y = "Distance to Basket", title = "Result of Shots, Defensive Distance Vs. Shot Distance") + facet_wrap(~factor(shot_result))
Looking at the graphs, the model found the same trend towards the bottom of the graph: shots that were closer than about five feet to the basket with a defender separation of more than a few feet were likely to be good. However, the model predicted that shots greater than five feet from the basket with a defender separation of less than about three feet were very unlikely to go in, which in practice did not appear to be true. I did another side-by-side comparison of the model's features, this time of shooter zone FGP versus defender zone FGP. ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Classification of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_pred) ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Result of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_result)
Interestingly, I did not find the same noticable difference in the distribution of actual made versus missed as I found in predicted made versus missed. This is a failure of the model; much more than shot ability goes into a field goal percentage, more even than could be gleaned from this data set. An example of this failure might be a player who is playing out of their most comfortable role on a team. I also faceted by zone to see how results compared with classifications. ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, color = factor(shot_pred), alpha = 0.1)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Classification of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_class) ggplot(shot_df, aes(x=shooter_zone_fgp, y = defender_zone_fgp, alpha = 0.1, color = SHOT_RESULT)) + geom_point() + labs(x = "Shooter Zone FGP", y = "Defender Zone FGP", title = "Result of Shots, Shooter Zone FGP Vs. Defender Zone FGP") + facet_wrap(~shot_class)
Here it is easier to see the break-down in predictive power in the mid1 and mid2 zones, as well as a decrease in stratification of makes and misses. This suggests that the interaction between shooters and defenders is less pronounced, at least in terms of field goal percentage, for jumpshots inside the three-point arc. Another striking observation is that the makes and misses are stratified horizontally in both data sets, suggesting that the role of the defender in the shooterdefender interaction is not as important as I had thought. To test this, I took out the def_zone_fgp from the prediction model and recieved the same 62% prediciton accuracy. When I removed the CLOSE_DEF_DIST feature, however, the accuracy of the model went down. train_df2 <- shot_df[1:120000,] test_df2 <- shot_df[120001:128069,] shot_pred_tree2 <- ctree(shot_result ~ SHOT_DIST + SHOT_CLOCK + shooter_zone_fgp + shooter_zone_shots + CLOSE_DEF_DIST, data=train_df) plot(shot_pred_tree2) pred_model2 <- predict(shot_pred_tree2, test_df2) conf2 <- table(test_df$shot_result, pred_model2) TP2 <- conf2[1,1] FN2 <- conf2[1,2] FP2 <- conf2[2,1] TN2 <- conf2[2,2] acc2 <- (TP2 + TN2)/(TP2 + TN2 + FN2 + FP2) acc2
To get a better idea of the model's accuracy by zone, I made a visualization. pred_by_zone <- shot_df %>% mutate(shot_pred2 = shot_pred - 1) %>% group_by(shot_class) %>% summarize(real_fgp = sum(fgm)/n(), pred_fgp = (1-(sum(shot_pred2)/n()))) ggplot(pred_by_zone %>% arrange(desc(real_fgp)), aes(x = shot_class, y=real_fgp)) + geom_bar(stat="identity") + labs(x = "Shot Class", y = "FGP", title = "Real FGP by Zone") ggplot(pred_by_zone %>% arrange(desc(pred_fgp)), aes(x = shot_class, y=pred_fgp)) + geom_bar(stat="identity") + labs(x = "Shot Class", y = "FGP", title = "Predicted FGP by Zone")
According to the data frame and the visualization, the model captures the trend in overall field goal percentages, with a descension by distance. However, the model thinks that close shot are far and away more likely to go in than any other shot, which does not hold up in reality. Conclusion A predicitive model based only on a shooter's basic stats, defender stats, distance to the basket, distance from the defender to the shooter, and shot clock (shots put up at the end of the shot clock tend to be worse than those in the middle of the shot clock) captures some of the trends in shot results, but not all. The abstracted view of basketball, with no player interaction outside of the shooter and their closest defender, does explain some of the data, but a more detailed analysis could be produced with more data on the other eight players on the court.
Appendix Chunk #1 off_by_player <- shot_df %>% group_by(player_name) %>% summarize(avgdefdist = mean(close_def_dist), avgshotdist = mean(shot_dist), pts = sum(pts), numshots = n(), pts_to_attempts = pts/numshots, avgdribbles = mean(dribbles)) off_by_player <- off_by_player %>% mutate(close_fgp = NA, close_shots = NA, mid1_fgp = NA, mid1_shots = NA, mid2_fgp = NA, mid2_shots = NA, long_fgp = NA, long_shots = NA, off_sweetspot = NA, off_sweetspot_rad = NA) for (i in 1:nrow(off_by_player)) { close_made <- 0 close_att <- 0 close_def_dist <- 0 mid1_made <- 0 mid1_att <- 0 mid1_def_dist <- 0 mid2_made <- 0 mid2_att <- 0 mid2_def_dist <- 0 long_made <- 0 long_att <- 0 long_def_dist <- 0 sweetspot <- c() counter <- 0 working <- shot_df %>% filter(player_name == off_by_player $player_name[i]) for (j in 1:nrow(working)) { if (working$shot_class[j] =="close") { close_att <- close_att + 1 if (working$shot_result[j] == "made"){ close_made <- close_made + 1 } else if (working$shot_class[j] =="mid 1") { mid1_att <- mid1_att + 1 if (working$shot_result[j] == "made") { mid1_made <- mid1_made + 1
} } else if (working$shot_class[j] =="mid 2") { mid2_att <- mid2_att + 1 if (working$shot_result[j] == "made") { mid2_made <- mid2_made + 1 } else if (working$shot_class[j] =="long") { long_att <- long_att + 1 if (working$shot_result[j] == "made") { long_made <- long_made + 1 if (working$shot_result[j] == "made") { counter <- counter + 1 sweetspot <- c(sweetspot, working$shot_dist) off_by_player$close_fgp[i] <- close_made/close_att off_by_player$close_shots[i] <- close_att off_by_player$mid1_fgp[i] <- mid1_made/mid1_att off_by_player$mid1_shots[i] <- mid1_att off_by_player$mid2_fgp[i] <- mid2_made/mid2_att off_by_player$mid2_shots[i] <- mid2_att off_by_player$long_fgp[i] <- long_made/long_att off_by_player$long_shots[i] <- long_att off_by_player$off_sweetspot[i] <- (sum(sweetspot))/counter off_by_player$off_sweetspot_rad[i] <- sd(sweetspot) Chunk #2 def_by_player <- shot_df %>% group_by(closest_defender) %>% summarize(avgdist = mean(close_def_dist), pts_against = sum(pts), numshots = n(), pts_to_attempts = pts_against/numshots) %>% arrange(desc(numshots)) def_by_player <- def_by_player %>% mutate(close_opp_fgp = NA, close_opp_shots = NA, mid1_opp_fgp = NA, mid1_opp_shots = NA, mid2_opp_fgp = NA, mid2_opp_shots = NA, long_opp_fgp = NA, long_opp_shots = NA, def_sweetspot = NA, def_sweetspot_rad = NA)
for (i in 1:nrow(def_by_player)) { close_made <- 0 close_att <- 0 mid1_made <- 0 mid1_att <- 0 mid2_made <- 0 mid2_att <- 0 long_made <- 0 long_att <- 0 counter <- 0 working <- shot_df %>% filter(closest_defender == def_by_player $CLOSEST_DEFENDER[i]) for (j in 1:nrow(working)) { if (working$shot_class[j] =="close") { close_att <- close_att + 1 if (working$shot_result[j] == "made"){ close_made <- close_made + 1 } else if (working$shot_class[j] =="mid 1") { mid1_att <- mid1_att + 1 if (working$shot_result[j] == "made") { mid1_made <- mid1_made + 1 } else if (working$shot_class[j] =="mid 2") { mid2_att <- mid2_att + 1 if (working$shot_result[j] == "made") { mid2_made <- mid2_made + 1 } else if (working$shot_class[j] =="long") { long_att <- long_att + 1 if (working$shot_result[j] == "made") { long_made <- long_made + 1 def_by_player$close_opp_fgp[i] <- close_made/close_att def_by_player$close_opp_shots[i] <- close_att
} def_by_player$mid1_opp_fgp[i] <- mid1_made/mid1_att def_by_player$mid1_opp_shots[i] <- mid1_att def_by_player$mid2_opp_fgp[i] <- mid2_made/mid2_att def_by_player$mid2_opp_shots[i] <- mid2_att def_by_player$long_opp_fgp[i] <- long_made/long_att def_by_player$long_opp_shots[i] <- long_att Chunk #3 shot_df <- shot_df %>% mutate(defender_sweetspot = NA, shooter_sweetspot = NA, shooter_zone_fgp = NA, shooter_zone_shots = NA, defender_zone_fgp = NA, defender_zone_shots = NA, def_dist_to_sweetspot = abs(shot_dist - (CLOSE_DEF_DIST + defender_sweetspot))) for (i in 1:nrow(shot_df)) { if (shot_df$shot_class[i] == "close") { shot_df$shooter_zone_fgp[i] <- off_by_player$close_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$close_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player $close_opp_fgp[shot_df$closest_defender[i]] shot_df$defender_zone_shots[i] <- def_by_player $close_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "mid 1") { shot_df$shooter_zone_fgp[i] <- off_by_player$mid1_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$mid1_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$mid1_opp_fgp[shot_df $CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $mid1_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "mid 2") { shot_df$shooter_zone_fgp[i] <- off_by_player$mid2_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$mid2_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$mid2_opp_fgp[shot_df
$CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $mid2_opp_shots[shot_df$closest_defender[i]] } else if (shot_df$shot_class[i] == "long") { shot_df$shooter_zone_fgp[i] <- off_by_player$long_fgp[shot_df shot_df$shooter_zone_shots[i] <- off_by_player$long_shots[shot_df shot_df$defender_zone_fgp[i] <- def_by_player$long_opp_fgp[shot_df $CLOSEST_DEFENDER[i]] shot_df$defender_zone_shots[i] <- def_by_player $long_opp_shots[shot_df$closest_defender[i]] } Chunk #4 train_df <- shot_df[1:120000,] test_df <- shot_df[120001:128069,] shot_pred_tree <- ctree(shot_result ~ SHOT_DIST + SHOT_CLOCK + shooter_zone_fgp + shooter_zone_shots + defender_zone_fgp + CLOSE_DEF_DIST, data=train_df) plot(shot_pred_tree) pred_model <- predict(shot_pred_tree, test_df) conf <- table(test_df$shot_result, pred_model) TP <- conf[1,1] FN <- conf[1,2] FP <- conf[2,1] TN <- conf[2,2] acc <- (TP + TN)/(TP + TN + FN + FP) acc