Navigate to the golf data folder and make it your working directory. Load the data by typing

Similar documents
Lab 11: Introduction to Linear Regression

Chapter 12 Practice Test

Section I: Multiple Choice Select the best answer for each problem.

Why We Should Use the Bullpen Differently

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Predicting the Total Number of Points Scored in NFL Games

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Pitching Performance and Age

Lesson 14: Modeling Relationships with a Line

Sample Final Exam MAT 128/SOC 251, Spring 2018

Pitching Performance and Age

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Economic Value of Celebrity Endorsements:

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Driv e accu racy. Green s in regul ation

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

27Quantify Predictability U10L9. April 13, 2015

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Computing the Probability of Scoring a 2 in Disc Golf Revised November 20, 2014 Steve West Disc Golf, LLC

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Running head: DATA ANALYSIS AND INTERPRETATION 1

Building an NFL performance metric

Unit 4: Inference for numerical variables Lecture 3: ANOVA

1wsSMAM 319 Some Examples of Graphical Display of Data

Copy of my report. Why am I giving this talk. Overview. State highway network

Boyle s Law: Pressure-Volume. Relationship in Gases

CS 221 PROJECT FINAL

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Boyle s Law: Pressure-Volume Relationship in Gases. PRELAB QUESTIONS (Answer on your own notebook paper)

Boyle s Law: Pressure-Volume Relationship in Gases

Session 2: Introduction to Multilevel Modeling Using SPSS

Model Selection Erwan Le Pennec Fall 2015

Mapping a course for Pocket Caddy

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Evaluating NBA Shooting Ability using Shot Location

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Grade: 8. Author(s): Hope Phillips

The Reliability of Intrinsic Batted Ball Statistics Appendix

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

The MACC Handicap System

8th Grade. Data.

Guide to Computing Minitab commands used in labs (mtbcode.out)

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754

NBA TEAM SYNERGY RESEARCH REPORT 1

IDENTIFYING SUBJECTIVE VALUE IN WOMEN S COLLEGE GOLF RECRUITING REGARDLESS OF SOCIO-ECONOMIC CLASS. Victoria Allred

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Boyle s Law. Pressure-Volume Relationship in Gases. Figure 1

Warmupweek. Etiquette. Technical. Rules EAGLES LESSON1.

Two Machine Learning Approaches to Understand the NBA Data

Ozobot Bit Classroom Application: Boyle s Law Simulation

The 19 th hole - 18 Card Micro Golf game. # Of Players: 1 or 2 players. Game time: min per game if a 2 player game.

Hitting The Driver Made Easy

1. The data below gives the eye colors of 20 students in a Statistics class. Make a frequency table for the data.

Robust specification testing in regression: the FRESET test and autocorrelated disturbances

E. Agu, M. Kasperski Ruhr-University Bochum Department of Civil and Environmental Engineering Sciences

Algebra 1 Unit 6 Study Guide

Darrell Klassen Inner Circle

Lesson 16: More on Modeling Relationships with a Line

Averages. October 19, Discussion item: When we talk about an average, what exactly do we mean? When are they useful?

A Novel Approach to Predicting the Results of NBA Matches

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

Evaluating The Best. Exploring the Relationship between Tom Brady s True and Observed Talent

Predictors for Winning in Men s Professional Tennis

Is Tiger Woods Loss Averse? Persistent Bias in the Face of Experience, Competition, and High Stakes. Devin G. Pope and Maurice E.

MJA Rev 10/17/2011 1:53:00 PM

MIS0855: Data Science In-Class Exercise: Working with Pivot Tables in Tableau

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

Effect of homegrown players on professional sports teams

Estimating the Probability of Winning an NFL Game Using Random Forests

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

Homework Exercises Problem Set 1 (chapter 2)

Calculation of Trail Usage from Counter Data

Returns to Skill in Professional Golf: A Quantile Regression Approach

NCSS Statistical Software

One-factor ANOVA by example

THE USGA HANDICAP SYSTEM. Reference Guide

Tying Knots. Approximate time: 1-2 days depending on time spent on calculator instructions.

DATA SCIENCE SUMMER UNI VIENNA

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

GolfLogix: Golf GPS. User Guide for: BlackBerry Curve. Version 1.0. Software Release , 8330, 8350i, 8800, 8820, 8830, 8900

Real-Time Electricity Pricing

Accident data analysis using Statistical methods A case study of Indian Highway

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Is It Truly a Building Ground? A Returns to Skill and Learning by Doing Study of the PGA Tour and the Web.com Tour

Handicap Differential = (Adjusted Gross Score - USGA Course Rating) x 113 / USGA Slope Rating

Tracking of Large-Scale Wave Motions

Taking Your Class for a Walk, Randomly

GolfLogix: Golf GPS. User Guide for: iphone 3G & 3GS. Version 1.0. Software Release 1.0

Legendre et al Appendices and Supplements, p. 1

The Effect of Newspaper Entry and Exit on Electoral Politics Matthew Gentzkow, Jesse M. Shapiro, and Michael Sinkinson Web Appendix

Biostatistics & SAS programming

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

When Falling Just Short is a Good Thing: the Effect of Past Performance on Improvement.

Kelsey Schroeder and Roberto Argüello June 3, 2016 MCS 100 Final Project Paper Predicting the Winner of The Masters Abstract This paper presents a

Transcription:

Golf Analysis 1.1 Introduction In a round, golfers have a number of choices to make. For a particular shot, is it better to use the longest club available to try to reach the green, or would it be better to use a shorter club that would be more likely to end up in the fairway? On a par five, should the intent from the start be to be on in regulation? Or is it worth the risk to make it on in two? And for a given course, what clubs should be kept in the bag? Here, we re going to take a look at how particular shots affect the score on a hole. We re also going to look at some plotting tools to visualize the data we have. 1.2 Data Navigate to the golf data folder and make it your working directory. Load the data by typing tps1. data = read. csv (" tps1. csv ", header = TRUE ) You should check to see what is contained in the dataset. One way that might make this easier is to type tps1. data [ which ( tps1. data $ Player. Last. Name == " Mickelson "),] This should bring up four lines of output from the data. Observing that the Shot column has entries 1, 2, 3, and 4, we see that each row is a shot on this particular hole. Now, what do some of the other columns mean? The easiest way to find out is to open Shot Detail Field Defs.pdf, which has descriptions for each of the columns. Look through this to get a better sense of what data is encoded in the file. Exercise 1.2.1. What are the units of the X, Y, and Z coordinate columns? What do the columns show when the ball is in the hole? Is the location of the tee box given in the coordinates? The data presented here is from the first hole of the Torrey Pines South Course. A map of the course is available in the file tps_scorecard.pdf. Since we have coordinates, let s try to visualize the shot data. plot ( tps1. data $X. Coordinate, tps1. data $Y. Coordinate )

2 Figure 1.1 The X and Y coordinates for the shot data. The plot you obtain should look like Figure 1.1. As we can see, this isn t a particularly helpful plot. The problem here is that we need to get rid of the points that are 0. Exercise 1.2.2. Use the which function to define a new data frame which does not include any shots for which the X coordinate is 0. Call this tps1.nonzero. Plot the X and Y coordinates of this new data frame. What do you get? Does the concentration of shot locations make sense? Exercise 1.2.3. In Exercise 1.2.2, you should have plotted the shots. In the lower left corner, there is a large cluster of points and then two slightly smaller clusters. What is the reason for this? You may want to use the unique function on the column of dates. 1.3 A First Analysis: Distance off the Tee Suppose we re interested in knowing how the features of a drive correspond to a golfer s score on this hole. One place to begin an analysis would be based on drive distance. 1.3.1 A Simple Linear Regression First, define a data frame called tps1. first.shot which consists of all shots on the first shots. Then, define a vector called distance.yards which is the distance of the shot in

3 yards. 1 Now, we can define a linear model by typing golf. model = lm( tps1. first. shot $ Hole. Score ~ distance. yards ) From here, we can get a lot of information about the fit of the model by using the summary function. > summary ( golf. model ) Call : lm( formula = tps1. first. shot $ Hole. Score ~ distance. yards ) Residuals : Min 1Q Median 3Q Max -1.8294-0.2358-0.1037 0.6580 1.9724 Coefficients : Estimate Std. Error t value Pr ( > t ) ( Intercept ) 6. 410738 0. 711750 9.007 < 2e -16 *** distance. yards -0.007970 0. 002489-3.201 0.00151 ** --- Signif. codes : 0 *** 0.001 ** 0.01 * 0.05 "." 0.1 1 Residual standard error : 0.6499 on 301 degrees of freedom Multiple R- squared : 0.03293, Adjusted R- squared : 0.02971 F- statistic : 10.25 on 1 and 301 DF, p- value : 0. 001514 This is quite a bit of information, and we don t have the tools to understand all of it in this course. 2 The basic idea though is that we are modeling the score a golfer receives on the hole as a random variable Y i, and we are assuming here that it follows the formula Y i = β 0 + β 1 X i + ε i (1.1) where X i is the distance of the drive, ε i is a normal random variable with mean 0, and β 0 and β 1 are constants. The point of the regression is to find β 0 and β 1. So in this particular case, we have Y i = 6.410738 0.007970X i + ε i. (1.2) As a sanity check, note that the coefficient β 1 is negative. This means that longer drives are correlated with lower scores on the hole. So, suppose we had shots of 250, 275, and 300 yards. What would the expected scores on the hole be? Turning to Equation (1.2), we would get E[Y 1 ] = 6.410738 0.007970(250) = 4.418238. We could similarly compute the other values. Alternatively, R has the built-in command predict. To use this function, we define a new data frame with the necessary predictors. In this case, that s just the distance of the shot. This gives 1 It s not strictly necessary to define the distance in yards. But we do it here because tee shots are usually measured in yards for interpretability. 2 Honestly, it would take about a year of an introductory statistics course to cover all of this material.

4 > newdata = data. frame ( distance. yards = c (250, 275, 300) ) > predict ( golf. model, newdata ) 1 2 3 4. 418313 4. 219070 4. 019827 which agree with our earlier results. Exercise 1.3.1. What are the predicted scores for drives of 240, 280, and 310 yards? 1.3.2 Diagnostics When we wrote Equation (1.1) earlier, we were making some pretty strong assumptions about the model. Exercise 1.3.2. What are some of the assumptions necessary for a linear model? In particular, what can be said about each individual Y i? How about the Y i collectively? Now, we re going to take the step of looking at our fit to the data. In general, looking at the data before fitting a model is bad practice. Why? Well, humans fit all sorts of complicated models to data, and after that has happened, there s no real way to make any statistical guarantees. In any case, we can type the following > plot ( distance. yards, tps1. first. shot $ Hole. Score ) > abline ( golf. model ) This should produce Figure 1.2 Figure 1.2 Hole score as a function of the yardage of the drive.

5 This is not a good fit. However, we could sort of imagine this happening. For one, the response variable, which is the score in this case, only takes a couple values. Second, the trend line doesn t seem to fit the data particularly well. If we want to do better, we ll have to throw more refined data at this problem. Exercise 1.3.3. Suppose we had data from the entire round. How could we modify the analysis so we would get data that would be closer to normal? Why would it be closer to normal? What assumptions would we have to make that we haven t made for this analysis? 1.4 A Second Look We re going to try once again to fit the score on a hole as a function of the length of the drive, but we want data that will be more normal. Load the new data by typing full. round = read. csv (" TorreyPinesSouth. csv ") Exercise 1.4.1. How many unique first names are there among the golfers? How many unique last names are there among the golfers? How many golfers are represented in the data? Exercise 1.4.2. Write a function hole. scores which takes a player identification number and returns the vector of scores on par four holes. Write a function drive. distances which takes a player identification number and returns the vector of drive distances in yards on par four holes. You should be able to type > hole. scores (1810) [1] 4 4 5 4 5 4 4 4 4 5 > drive. distances (1810) [1] 310. 3889 306. 7500 285. 4167 300. 2222 303. 7778 297. 9167 295. 6667 283. 0833 [9] 292. 7778 282. 7500 Exercise 1.4.3. Write a function that returns the vector of averages scores on par four holes and the vector of average drive distances on par four holes for all golfers. Call the function round.info. Exercise 1.4.4. Define the linear model average.model so that score is a function of drive distance. Plot the data and the fitted line. Does it look more reasonable to fit a linear model now? How well does the line fit the data? There are other refinements one could use for this. In particular, we might be interested in the effect that hitting into the rough or a fairway trap has on the outcome for the hole. You could continue analyzing this further by dividing the data appropriately.