1 Golf Analysis 1.1 Introduction In a round, golfers have a number of choices to make. For a particular shot, is it better to use the longest club available to try to reach the green, or would it be better to use a shorter club that would be more likely to end up in the fairway? On a par five, should the intent from the start be to be on in regulation? Or is it worth the risk to make it on in two? And for a given course, what clubs should be kept in the bag? Here, we re going to take a look at how particular shots affect the score on a hole. We re also going to look at some plotting tools to visualize the data we have. 1.2 Data Navigate to the golf data folder and make it your working directory. Load the data by typing tps1. data = read. csv (" tps1. csv ", header = TRUE ) You should check to see what is contained in the dataset. One way that might make this easier is to type tps1. data [ which ( tps1. data $ Player. Last. Name == " Mickelson "),] This should bring up four lines of output from the data. Observing that the Shot column has entries 1, 2, 3, and 4, we see that each row is a shot on this particular hole. Now, what do some of the other columns mean? The easiest way to find out is to open Shot Detail Field Defs.pdf, which has descriptions for each of the columns. Look through this to get a better sense of what data is encoded in the file. Exercise What are the units of the X, Y, and Z coordinate columns? What do the columns show when the ball is in the hole? Is the location of the tee box given in the coordinates? The data presented here is from the first hole of the Torrey Pines South Course. A map of the course is available in the file tps_scorecard.pdf. Since we have coordinates, let s try to visualize the shot data. plot ( tps1. data $X. Coordinate, tps1. data $Y. Coordinate )
2 2 Figure 1.1 The X and Y coordinates for the shot data. The plot you obtain should look like Figure 1.1. As we can see, this isn t a particularly helpful plot. The problem here is that we need to get rid of the points that are 0. Exercise Use the which function to define a new data frame which does not include any shots for which the X coordinate is 0. Call this tps1.nonzero. Plot the X and Y coordinates of this new data frame. What do you get? Does the concentration of shot locations make sense? Exercise In Exercise 1.2.2, you should have plotted the shots. In the lower left corner, there is a large cluster of points and then two slightly smaller clusters. What is the reason for this? You may want to use the unique function on the column of dates. 1.3 A First Analysis: Distance off the Tee Suppose we re interested in knowing how the features of a drive correspond to a golfer s score on this hole. One place to begin an analysis would be based on drive distance A Simple Linear Regression First, define a data frame called tps1. first.shot which consists of all shots on the first shots. Then, define a vector called distance.yards which is the distance of the shot in
3 3 yards. 1 Now, we can define a linear model by typing golf. model = lm( tps1. first. shot $ Hole. Score ~ distance. yards ) From here, we can get a lot of information about the fit of the model by using the summary function. > summary ( golf. model ) Call : lm( formula = tps1. first. shot $ Hole. Score ~ distance. yards ) Residuals : Min 1Q Median 3Q Max Coefficients : Estimate Std. Error t value Pr ( > t ) ( Intercept ) < 2e 16 *** distance. yards **  Signif. codes : 0 *** ** 0.01 * 0.05 "." Residual standard error : on 301 degrees of freedom Multiple R squared : , Adjusted R squared : F statistic : on 1 and 301 DF, p value : This is quite a bit of information, and we don t have the tools to understand all of it in this course. 2 The basic idea though is that we are modeling the score a golfer receives on the hole as a random variable Y i, and we are assuming here that it follows the formula Y i = β 0 + β 1 X i + ε i (1.1) where X i is the distance of the drive, ε i is a normal random variable with mean 0, and β 0 and β 1 are constants. The point of the regression is to find β 0 and β 1. So in this particular case, we have Y i = X i + ε i. (1.2) As a sanity check, note that the coefficient β 1 is negative. This means that longer drives are correlated with lower scores on the hole. So, suppose we had shots of 250, 275, and 300 yards. What would the expected scores on the hole be? Turning to Equation (1.2), we would get E[Y 1 ] = (250) = We could similarly compute the other values. Alternatively, R has the builtin command predict. To use this function, we define a new data frame with the necessary predictors. In this case, that s just the distance of the shot. This gives 1 It s not strictly necessary to define the distance in yards. But we do it here because tee shots are usually measured in yards for interpretability. 2 Honestly, it would take about a year of an introductory statistics course to cover all of this material.
4 4 > newdata = data. frame ( distance. yards = c (250, 275, 300) ) > predict ( golf. model, newdata ) which agree with our earlier results. Exercise What are the predicted scores for drives of 240, 280, and 310 yards? Diagnostics When we wrote Equation (1.1) earlier, we were making some pretty strong assumptions about the model. Exercise What are some of the assumptions necessary for a linear model? In particular, what can be said about each individual Y i? How about the Y i collectively? Now, we re going to take the step of looking at our fit to the data. In general, looking at the data before fitting a model is bad practice. Why? Well, humans fit all sorts of complicated models to data, and after that has happened, there s no real way to make any statistical guarantees. In any case, we can type the following > plot ( distance. yards, tps1. first. shot $ Hole. Score ) > abline ( golf. model ) This should produce Figure 1.2 Figure 1.2 Hole score as a function of the yardage of the drive.
5 5 This is not a good fit. However, we could sort of imagine this happening. For one, the response variable, which is the score in this case, only takes a couple values. Second, the trend line doesn t seem to fit the data particularly well. If we want to do better, we ll have to throw more refined data at this problem. Exercise Suppose we had data from the entire round. How could we modify the analysis so we would get data that would be closer to normal? Why would it be closer to normal? What assumptions would we have to make that we haven t made for this analysis? 1.4 A Second Look We re going to try once again to fit the score on a hole as a function of the length of the drive, but we want data that will be more normal. Load the new data by typing full. round = read. csv (" TorreyPinesSouth. csv ") Exercise How many unique first names are there among the golfers? How many unique last names are there among the golfers? How many golfers are represented in the data? Exercise Write a function hole. scores which takes a player identification number and returns the vector of scores on par four holes. Write a function drive. distances which takes a player identification number and returns the vector of drive distances in yards on par four holes. You should be able to type > hole. scores (1810) [1] > drive. distances (1810) [1] [9] Exercise Write a function that returns the vector of averages scores on par four holes and the vector of average drive distances on par four holes for all golfers. Call the function round.info. Exercise Define the linear model average.model so that score is a function of drive distance. Plot the data and the fitted line. Does it look more reasonable to fit a linear model now? How well does the line fit the data? There are other refinements one could use for this. In particular, we might be interested in the effect that hitting into the rough or a fairway trap has on the outcome for the hole. You could continue analyzing this further by dividing the data appropriately.
