Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Similar documents
Lab 11: Introduction to Linear Regression

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Lesson 14: Modeling Relationships with a Line

Navigate to the golf data folder and make it your working directory. Load the data by typing

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Measuring Batting Performance

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Additional On-base Worth 3x Additional Slugging?

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Salary correlations with batting performance

Running head: DATA ANALYSIS AND INTERPRETATION 1

(per 100,000 residents) Cancer Deaths

MONEYBALL. The Power of Sports Analytics The Analytics Edge

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

Chapter. 1 Who s the Best Hitter? Averages

Building an NFL performance metric

A Shallow Dive into Deep Sea Data Sarah Solie and Arielle Fogel 7/18/2018

Background Information. Project Instructions. Problem Statement. EXAM REVIEW PROJECT Microsoft Excel Review Baseball Hall of Fame Problem

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Projecting Three-Point Percentages for the NBA Draft

The Reliability of Intrinsic Batted Ball Statistics Appendix

Regression Analysis of Success in Major League Baseball

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Chapter 12 Practice Test

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Correction to Is OBP really worth three times as much as SLG?

8th Grade. Data.

The Rise in Infield Hits

The MACC Handicap System

Simulating Major League Baseball Games

Driv e accu racy. Green s in regul ation

Relative Value of On-Base Pct. and Slugging Avg.

CS 221 PROJECT FINAL

27Quantify Predictability U10L9. April 13, 2015

Lesson 2 Pre-Visit Slugging Percentage

Name Student Activity

Algebra 1 Unit 6 Study Guide

Effect of homegrown players on professional sports teams

save percentages? (Name) (University)

B. AA228/CS238 Component

Fastball Baseball Manager 2.5 for Joomla 2.5x

Modeling Fantasy Football Quarterbacks

MiSP Photosynthesis Lab L3

Section I: Multiple Choice Select the best answer for each problem.

Tying Knots. Approximate time: 1-2 days depending on time spent on calculator instructions.

Minimal influence of wind and tidal height on underwater noise in Haro Strait

SAP Predictive Analysis and the MLB Post Season

DATA SCIENCE SUMMER UNI VIENNA

Effects of Incentives: Evidence from Major League Baseball. Guy Stevens April 27, 2013

FORECASTING BATTER PERFORMANCE USING STATCAST DATA IN MAJOR LEAGUE BASEBALL

Boyle s Law: Pressure-Volume Relationship in Gases. PRELAB QUESTIONS (Answer on your own notebook paper)

Should bonus points be included in the Six Nations Championship?

Average Runs per inning,

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

Do Clutch Hitters Exist?

How To Win The Ashes A Statistician s Guide

BIOL 101L: Principles of Biology Laboratory

Lab #12:Boyle s Law, Dec. 20, 2016 Pressure-Volume Relationship in Gases

Machine Learning an American Pastime

AggPro: The Aggregate Projection System

A) The linear correlation is weak, and the two variables vary in the same direction.

One could argue that the United States is sports driven. Many cities are passionate and

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Grade: 8. Author(s): Hope Phillips

ISyE 6414 Regression Analysis

Predicting the Total Number of Points Scored in NFL Games

Rating Player Performance - The Old Argument of Who is Bes

Analysis of Highland Lakes Inflows Using Process Behavior Charts Dr. William McNeese, Ph.D. Revised: Sept. 4,

Compression Study: City, State. City Convention & Visitors Bureau. Prepared for

Session 2: Introduction to Multilevel Modeling Using SPSS

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

from ocean to cloud HEAVY DUTY PLOUGH PERFORMANCE IN VERY SOFT COHESIVE SEDIMENTS

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Boyle s Law: Pressure-Volume Relationship in Gases

Why We Should Use the Bullpen Differently

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Lesson 3 Pre-Visit Teams & Players by the Numbers

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Lab 13: Hydrostatic Force Dam It

Transpiration. DataQuest OBJECTIVES MATERIALS

Lab 1. Adiabatic and reversible compression of a gas

Unit 6, Lesson 1: Organizing Data

Two Machine Learning Approaches to Understand the NBA Data

An average pitcher's PG = 50. Higher numbers are worse, and lower are better. Great seasons will have negative PG ratings.

Stats in Algebra, Oh My!

Department of Economics Working Paper Series

JAWS 2: Incident at Arched Rock By Dave Cone Originally published in Bay Currents, Nov 1993

Bioequivalence: Saving money with generic drugs

This page intentionally left blank

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

(Under the Direction of Cheolwoo Park) ABSTRACT. Major League Baseball is a sport complete with a multitude of statistics to evaluate a player s

USING A CALCULATOR TO INVESTIGATE WHETHER A LINEAR, QUADRATIC OR EXPONENTIAL FUNCTION BEST FITS A SET OF BIVARIATE NUMERICAL DATA

2014 NATIONAL BASEBALL ARBITRATION COMPETITION ERIC HOSMER V. KANSAS CITY ROYALS (MLB) SUBMISSION ON BEHALF OF THE CLUB KANSAS CITY ROYALS

FIG: 27.1 Tool String

What Makes a Track Fast?

Homework Exercises Problem Set 1 (chapter 2)

Transcription:

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College Overview The Lahman package is a gold mine for statisticians interested in studying baseball. In today s lab, we are going to explore parts of the Lahman package, while also linking to the statistical concepts in a bivariate analysis, including correlation, regression, and R-squared. First, let s install the package, and then we load the libraries that we ll need for this lab. As always, once a package is downloaded, you do not need to run the install.packages() code again. #install.packages("lahman") library(lahman) library(mosaic) There is so much to the Lahman database, and in this course, we ll only touch the tip of the iceberg. First, scroll to the data frame list on the last page of the tutorial. Under Data sets, there is a list of roughly 25 data sets that come with the Lahman package. You can also get a list of the data frames by entering the following command. LahmanData 1. Which data frames in the Lahman package has the largest number of observations? Which have the fewest? We re going to start by using the Teams data. data(teams) head(teams) tail(teams) The Teams data contains team-level information for every year of baseball season, from 1871-2014. That s a lot of years! To make things a bit easier, we are going to focus on the modern era of baseball, which is generally considered to be the period from 1970 onwards. 2. Recall from Lab 1 2.1 Write a command to generate the names of the Teams data. 2.2 Write a command that gives you the dimensions of the Teams data. 2.3 Identify the observation in row 500 and column 10. To reduce a larger data frame into a subset of rows that contain only information meeting a certain criterion, we are going to take advantage of the filter() command. Note: The filter() command comes from the dplyr package, which is one of R s best packages for data manipulation. It loads automatically with mosaic, and you can read more by going to the tutorial here. 1

Teams.1 <- filter(teams, yearid >= 1970) head(teams.1) We store the newer data set as Teams.1, and you can tell Teams.1 is a smaller data set because it begins in 1970, and not 1871. Next, let s create a few missing team-level variables that we are going to need for the lab. In dplyr, we create a series of new variables using the mutate() command. Teams.1 <- mutate(teams.1, X1B = H - X2B - X3B - HR, TB = X1B + 2*X2B + 3*X3B + 4*HR, RC = (H + BB)*TB/(AB + BB), RC.new = (H + BB - CS)*(TB + (0.55*SB))/(AB+BB) ) This creates four new team-level variables, X1B (number of singles), TB (total bases), RC (runs created), and RC.new (a newer runs created formula). The first formula for runs created is the basic one, discussed previously in lecture, while the second one uses a tweak for stolen bases. Notice that in the above code, the data name (Teams.1) remained unchanged. Alternatively, we could have created a new name (say, Teams.2) if we had wanted. Bivariate analysis Correlation and scatter plots Using Teams.1, let s quantify the strength, shape, and direction of the association between runs created and runs. xyplot(r ~ RC, data = Teams.1, main = "Runs ~ Runs Created, 1970-2015") cor(r ~ RC, data = Teams.1) cor(r ~ RC, data = Teams.1)^2 xyplot(r ~ RC.new, data = Teams.1, main = "Runs ~ Runs Created, 1970-2015") cor(r ~ RC.new, data = Teams.1) cor(r ~ RC.new, data = Teams.1)^2 3. Which variable - RC or RC.new - shows a stronger link with runs? What does this entail about the updated stolen bases formula? Let s identify some additional ways of quantifying the association between two variables. As an example, let s look at the relationship between a team s home runs and its runs. cor(r ~ HR, data = Teams.1) cor(r ~ HR, data = Teams.1)^2 4. Interpret the correlation coefficient and the R-squared values between home runs and runs. 2

Simple linear regression. From introductory statistics, you ll remember the simple linear regression (SLR) equation. Recall, here s the estimated SLR fit: ŷ i = ˆβ 0 + ˆβ 1 x i In the model above, ŷ i represents our predicted value of an outcome variable given x i, while ˆβ 0 and ˆβ 1 are the estimated intercepts and slopes, respectively. In our example, we can plug in our variable names as follows: ˆR i = ˆβ 0 + ˆβ 1 HR i 5. Make a scatter plot of R as a function of HR. Using the function, make educated guesses for ˆβ 0 and ˆβ 1. Fortunately, we don t have to make educated guesses. Let s look at the fit a model of runs as a function of home runs using the R command lm(). fit.1 <- lm(r ~ HR, data = Teams.1) summary(fit.1) There s a ton of stuff in the code above, most of which is useful. First, the summary() code gives the regression output, as well as other model characteristics. 6. Write the estimated regression line. Does the actual estimated regression line resemble your guesses from question 5)? 7. Interpret both the intercept and the slope (for HR) in the estimated regression equation. Is the intercept a useful term here? 8. Estimate the number of runs that a team with 250 home runs would score. There is also code to draw an estimated regression line, assuming that you ve already stored a linear model. plotmodel(fit.1) In the code above, R knows that fit.1 contains a linear model of runs as a function of home runs. Hopefully this reflects the line of best fit that you would ve drawn by hand. Let s return to the output of the regression equation. summary(fit.1) 9. Is the link between home runs and runs significant? How can you tell? Related note on significance testing Let s turn out attention to another variable with a lesser link to runs, SB. The following code is used to produce identical output. 3

fit.2 <- lm(r ~ SB, data = Teams.1) summary(fit.2) cor.test(teams.1$r, Teams.1$SB) Can you figure out the link between the two outputs from above? In the first, the relative significance of the slope parameter (as judged by a p value = 0.008619) is identical to the relative significance of the test for the significance of the correlation coefficient (also a p value of 0.008619). 10. Use a similar code to determine if there is there a significant linear association between team runs and the number of times that team was caught stealing? Analyzing several variables simultaneously We close the lab by looking at how one would analyze a subset of variables to judge which are most strongly associated with team success. First, we use the select() command to reduce our data set from several dozen columns to only the ones we are interested in studying. Teams.3 <- select(teams.1, R, RC.new, RC, RA, HR, SO, attendance) Next, we calculate the pairwise correlation coefficient between each variable. cor.matrix <- cor(teams.3) cor.matrix round(cor.matrix, 3) 11. Of the variables listed, which boast the strongest and weakest correlations with runs scored? Sidenote: What does the round() command do? This is a good time to point out that, as is usually the case with observational data like this, strong links between two variables do not entail that one variable causes the other. While hitting home-runs will likely cause teams to score more runs, it is not neccessarily the case that higher attendence causes teams to score more runs. 12. Think of one reason why attendence and runs are significantly correlated besides saying that attendence causes teams to score more runs. Plotting correlations There are lots of fun plots to make in R. Here are a few. install.packages('corrplot') library(corrplot) corrplot(cor.matrix, method="circle", type = "lower") 4

Better team metrics: on base, slugging, OPS Perhaps other functions of traditional baseball statistics are worth looking at. Here, we calculate on base percentage (OBP), slugging percentage (SLG), and OPS, which is the sum of OBP and SLG. Teams.1 <- mutate(teams.1, OBP = (H + BB)/(AB + BB), SLG = (X1B + 2*X2B + 3*X3B + 4*HR)/AB, OPS = OBP + SLG) cor.matrix <- cor(select(teams.1, R, RC, OBP, SLG, OPS)) round(cor.matrix, 2) Repeatability In addition to wanting a metric to correlate with success and to reflect individual talent, it is worth looking at how well a metric can predict a future, unknown performance. This requires some careful coding. Using the dplyr package, we create a new data frame, Teams.2, which arranges the team-level data by franchise and year, and then calculates the number of runs scored in the following year as the variable next.r. Teams.2 <- Teams.1 %>% arrange(franchid, yearid) %>% group_by(franchid) %>% mutate(next.r = lead(rc)) head(teams.2) 13. Verify that the last column of Teams.2 contains the future runs scored for each team in each row. To close, we look at which team-level variables most strongly correlate with future runs scored. cor.matrix <- cor(select(ungroup(teams.2), next.r, R, RC, OBP, SLG, OPS, HR, X1B), use="pairwise.complete.obs") round(cor.matrix, 2) 14. Which variables appear to be the best at predicting a team s runs scored in the following year? Which appears to be the worst? 15. Related: Compare the correlation of number of singles (X1B) to next.r and R, as well as number of home runs (HR) to next.r and R. Think carefully about how this would impact your recommendations of how to build a team. Which measures should be looked at most closely? Which measures appear to be mostly noise? 16. For fun: Go to the Shiny app here. Take a few guesses, and then click View scatterplot of correlation between your guesses and actual correlation. How good are you at guessing the correlation coefficient? 5