SPATIAL STATISTICS A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS. Introduction

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS KELLIN RUMSEY Introduction The 2016 National Basketball Association championship featured two of the leagues biggest names. The Golden State Warriors Stephen Curry who won the Leagues Most Valuable Player award in 2015 and 2016 found himself matched up against the Cleveland Cavaliers and LeBron James, a superstar who has won the MVP award 4 times in his illustrious career. One of the most intriguing things about the matchup was the drastically different playstyle of the players. James is nearly a half foot taller and 50 pounds heavier than Curry, so their preferred scoring methods are in sharp contrast of each other. In this paper, we attempt to explore this difference in shot selection and efficiency through the use of spatial statistics. Using Point Pattern analysis, we will fit Non-Homogeneous Poisson Processes (NHPP) to compare shot intensity, and we will use Spatial Logistic Regression for the point referenced data to fit a predictive surface for the efficiency of each player. Intensity Analysis Event intensity of a spatial point patter is the mean number of events that occur over a unit area centered at u. If the process is Homogeneous, then we denote λ to be the average across all locations. If the process is Non-Homogeneous we let the intensity vary from location x i to x j. (1) λ(x) = 1 ( ) x xi /h h 2 κ q( x ) i First we should verify that the point process is indeed Non-Homogeneous. Figure 1 provides a side by side comparison of 2958 shots for each player during the 2016 NBA season. Although we are not technically considering the points as marked (it was attempted, but the results were uninteresting), the green points are shots that were made and the black points are shots that were missed. From Figure 1, we see a couple of immediate differences. As expected, Curry is shooting from farther away more frequently than James, and James has a higher intensity of shots near the basket. It is obvious that the process contains clustered if the window is taken to be the entire court. A more interesting case, is to consider 1

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS 2 Figure 1. Shot Charts the polygon which traces the outline of the bulk of the shots. This polygon contains approximately 94% of the points for Curry, and 99% of the points for James. The K-Functions, G-Functions and F-Functions for the points contained in this region are plotted in Figures 2, 3 and 4. The evidence of clustering is far less significant than if we consider the entire court, but the evidence exists nonetheless. Figure 2. Ripleys K-Functions 2

KELLIN RUMSEY 3 Figure 3. G-Functions Figure 4. F-Functions It is interesting to note that the clustering appears to be more evident in the case of LeBron James. A possible explanation is that, due to his size he is able to shoot close to the basket more often. Curry is much smaller and must shoot whenever he has open space, leading to a slightly more homogeneous pattern. Now we are ready to fit the NHPP s for each player. The density of these fits are plotted below, overlayed with the points and contour plots. From Figure 5, we can make a couple of interesting observations. First, we see that the maximum Intensity value for James is almost double Curry s. While both players have the highest shot intensity at the origin, 3

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS 4 Figure 5. Non-Homogeneous Point Pattern Density s James is shooting there far more frequently. Curry is also higher at the three point line than James, with non-zero intensity significantly farther out than James. We also notice that James appears to prefer to shoot from the left side of the court as he moves farther out. We have also identified a small sweet spot for Curry on the right side inside the three point line where the contour circle is 2. With the exception of this one contour, the intensity of shots for both players appears to decrease monotonically with distance from the basket. Next we will fit predictive models, and analyze the spatial component of the residuals. First we should consider a few different models. We try the following models for λ(x, y). (1) Linear λ(x, y) = exp(β 0 + β 1 x + β 2 y) (2) Linear with Interaction λ(x, y) = exp(β 0 + β 1 x + β 2 y + β 3 xy) (3) Quadratic λ(x, y) = exp(β 0 + β 1 x + β 2 y + β 3 xy + β 4 x 2 + β 5 y 2 ) (4) Distance from Basket λ(x, y) = exp ( x 2 + y 2) It turns out however, that model (2) is not very interesting, and looks essentially the same as model (1). Hence Figures 6 and 7 provide predictive surfaces for the linear, quadratic and distance based models. 4

KELLIN RUMSEY 5 Figure 6. Stephen Curry - Predictive Surfaces Figure 7. Lebron James - Predictive Surfaces It is obvious that the Linear Model is not very effective, although it does seem to imply (consistent with our previous comments) that Stephen Curry has a slight preference for the right side, and LeBron James has a slight preference for the left side of the court. The distance based model and the quadratic model look 5

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS 6 fairly similar, but on first glance it appears that the quadratic model is preforming slightly better. Here we turn to the standard diagnostic of checking residuals. Figure 8. Stephen Curry - Smoothed Residuals Figure 9. LeBron James - Smoothed Residuals Abandoning the Linear Model, let us compare the Distance based and the Quadratic Model. For Curry, it appears that the Quadratic Model is fitting slightly better. The magnitude of the residuals have been 6

KELLIN RUMSEY 7 reduced, and they appear somewhat less correlated. We must remember of course that this model contains more parameters, and is in fact a superset of the distance model. To help us choose between the two, we can turn to the AIC values, which will reward goodness of fit, but penalize for the number of parameters. We see that the AIC suggests every so slightly that we should stick with the reduced Distance model. First of all, we notice that the NHPP seems to be fitting better in general for James. Although we are drastically underestimating the number of shots that he takes near the basket. Again it appears that the Quadratic model is the better fit, but this time the difference is less obvious. Indeed, the AIC for the Distance based model is the clear winner. Table 1. AIC Values Linear Model Distance Model Quadratic Model Stephen Curry 3, 571.800 2, 886.400 2, 904.500 LeBron James 1, 598.400-539.400 512.900 This is actually an important result. For James, we can conclude that the best spatial model is the one that simply considers distance from the basket. For Curry, however, there appears to be more to the story, as the higher parameter model is explaining more than the reduced model. In a sense, this is intuitive since Curry is the better shooter, and since he is smaller, he must shoot from everywhere on the court. This is consistent with the early observation that the process is more homogeneous for Curry. Upon closer examination, each parameter in the Quadratic model is statistically significant (although the interaction term is barely significant), so despite the AIC value, we decide to keep the Quadratic model for Stephen Curry but we will use the reduced Distance model for LeBron James. Changing the notation slightly, we provide the following NHPP predictive models for each player. (2) log ( λ SC (x, y) ) = 1.9 (1.5 10 2 )x (6.1 10 2 )y (1.1 10 3 )xy (4.3 10 3 )x 2 (5.0 10 4 )y 2 (3) log ( λ LJ ) = 2.7 0.14r, r = x2 + y 2 Of course, the values λ are meaningful in (2) and (3) only relative to each other. Ideally, we should normalize them over the number of games in the dataset to obtain a physical meaning for λ. In that case it would represent the predicted number of shots per game that each player would take a given location. The spatial analysis of shot intensity has provided some insight into the difference in the two players playstyle. In summary, James shot selection can be more easily characterized through purely spatial analysis. Curry s shot selection appears to be less dependent on distance, and more homogeneous. 7

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS 8 1. Efficiency Analysis In this section, we would like to fit a spatial logistic regression model which will allow us to predict the probability of making a shot at location x. Again, we are interested in comparing the two players. We will see however, that there are several issues that arise when we try to do this. For one, the method is fairly slow, and our datasets are fairly large. Secondly the spatial correlation of the binary data isn t overwhelmingly clear, especially for LeBron James. Without further ado, we present empirical and parametric semivariograms for each player. Figure 10. Stephen Curry - Semivariograms Figure 11. LeBron James - Semivariogrms The semivariograms are reasonable for Stephen Curry, and we may choose the exponential as it provides a reasonable fit. The data for LeBron James however is less encouraging. The empirical semivariogram 8

KELLIN RUMSEY 9 doesn t display the expected pattern, hence the semi-variogram fits are not very good either. In this case, the Gaussian semivarigoram is the obvious, but still disappointing, choice. Using the spbayes package, we can attempt to fit a spatial logistic regression model. The variograms make it clear that, especially in the second case, there are some counterintuitive spatial relationships going on. Attempting Bayesian Inference on this model was costly, limiting the number of draws we were able to do and also the size of the subset that we sampled from. Using 500 shots, we attempted to fit the model. The trace plots and density curves are not great by any means, but after trying endless combinations of priors we came to the realization that the parameter estimation wasn t going to be fixed by a simple choice of prior. The trace plot for β actually looked quite reasonable, and we got to a place where φ would have been reasonable if we applied an appropriate burn-in and thinning. The chain for σ 2 never converged however, and wouldn t have been helped by any amount of thinning. In the end, the trace plots should be very concerning and cause us to question the results from here on out. For lack of time however, we will continue with the analysis. Figure 12. Trace Plots and Posteriors Using a seperate set of 500 shots for each player, the sppredict function was utilized and a rough predictive plot was formed based on the 5 number summary for the values which were produced by taking the mean of the posterior predictive distribution at each location. Again the results are important and insightful for comparing the two players. Whatever lack of confidence we have in our model due to Figure 12, the model seems to be capturing some of the truth. 9

A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS 10 Figure 13. Efficiency Predictions For instance, it correctly determines that LeBron James will make a very good proportion of his shots near the basket, but will make far less as he moves farther away from the basket. Stephen Curry on the other hand doesn t make as many shots near the basket, but he remains effective from essentially anywhere on the court. The fact that the model captures these details may tell us that we may be close to the truth. In the end, there were some problems with using the spbayes package here. With more time, we could afford to use the entire data set instead of such a small subset. Another point of interest would be to predict at a grid of points across the entire surface rather than at select points. 2. Conclusion Overall, the results of this spatial comparison were satisfactory and intuitive to somebody who understands the players and their difference in playstyles. In the intensity section, we showed that even in the dense regions, there was evidence of clustering, but these clusters occurred differently for each player. We showed that Stephen Curry s shot selection is somewhat more homogeneous, and that he is far more likely to shoot farther out. LeBron s high intensity close to the basket is consistent with his high efficiency according to Figure 13. He shoots often close to the basket, and is highly effective in the same range. This provides an intuitive explanation for his tremendous success in the league. 10

KELLIN RUMSEY 11 We compared two of the best players in the world, and found statistical evidence to back the intuition that was already in place for their very different styles of play. 11