Running head: DATA ANALYSIS AND INTERPRETATION 1

Similar documents
Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Lab 11: Introduction to Linear Regression

STAT 155 Introductory Statistics. Lecture 2-2: Displaying Distributions with Graphs

STT 315 Section /19/2014

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

STANDARD SCORES AND THE NORMAL DISTRIBUTION

save percentages? (Name) (University)

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Chapter 2: Modeling Distributions of Data

Exploring Measures of Central Tendency (mean, median and mode) Exploring range as a measure of dispersion

That pesky golf game and the dreaded stats class

A) The linear correlation is weak, and the two variables vary in the same direction.

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Frequency Distributions

Lesson 3 Pre-Visit Teams & Players by the Numbers

Chapter 12 Practice Test

The pth percentile of a distribution is the value with p percent of the observations less than it.

Section I: Multiple Choice Select the best answer for each problem.

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Descriptive Statistics. Dr. Tom Pierce Department of Psychology Radford University

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Driv e accu racy. Green s in regul ation

Legendre et al Appendices and Supplements, p. 1

Chapter 5: Methods and Philosophy of Statistical Process Control

Effect of homegrown players on professional sports teams

Sample Final Exam MAT 128/SOC 251, Spring 2018

1. The data below gives the eye colors of 20 students in a Statistics class. Make a frequency table for the data.

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

Bivariate Data. Frequency Table Line Plot Box and Whisker Plot

Lesson 14: Modeling Relationships with a Line

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

AP Statistics Midterm Exam 2 hours

How are the values related to each other? Are there values that are General Education Statistics

5.1 Introduction. Learning Objectives

Lesson 2 Pre-Visit Slugging Percentage

Unit 6 Day 2 Notes Central Tendency from a Histogram; Box Plots

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

Pitching Performance and Age

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

NBA TEAM SYNERGY RESEARCH REPORT 1

Building an NFL performance metric

Practice Test Unit 06B 11A: Probability, Permutations and Combinations. Practice Test Unit 11B: Data Analysis

Organizing Quantitative Data

Reminders. Homework scores will be up by tomorrow morning. Please me and the TAs with any grading questions by tomorrow at 5pm

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

CHAPTER 1 ORGANIZATION OF DATA SETS

Name May 3, 2007 Math Probability and Statistics

Unit 3 - Data. Grab a new packet from the chrome book cart. Unit 3 Day 1 PLUS Box and Whisker Plots.notebook September 28, /28 9/29 9/30?

Pitching Performance and Age

Unit 6, Lesson 1: Organizing Data

Descriptive Statistics

Chapter 2 - Frequency Distributions and Graphs

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Practice Test Unit 6B/11A/11B: Probability and Logic

Was John Adams more consistent his Junior or Senior year of High School Wrestling?

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Algebra 1 Unit 6 Study Guide

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Week 7 One-way ANOVA

Fundamentals of Machine Learning for Predictive Data Analytics

Quantitative Literacy: Thinking Between the Lines

TRIP GENERATION RATES FOR SOUTH AFRICAN GOLF CLUBS AND ESTATES

Navigate to the golf data folder and make it your working directory. Load the data by typing

Age of Fans

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

(c) The hospital decided to collect the data from the first 50 patients admitted on July 4, 2010.

Stats in Algebra, Oh My!

Analysis of Highland Lakes Inflows Using Process Behavior Charts Dr. William McNeese, Ph.D. Revised: Sept. 4,

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

Salary correlations with batting performance

CHAPTER 2 Modeling Distributions of Data

Descriptive Stats. Review

March Madness Basketball Tournament

2014 NATIONAL BASEBALL ARBITRATION COMPETITION ERIC HOSMER V. KANSAS CITY ROYALS (MLB) SUBMISSION ON BEHALF OF THE CLUB KANSAS CITY ROYALS

Unit 3 ~ Data about us

Using SAS/INSIGHT Software as an Exploratory Data Mining Platform Robin Way, SAS Institute Inc., Portland, OR

Best Practices in Mathematics Education STATISTICS MODULES

Internet Technology Fundamentals. To use a passing score at the percentiles listed below:

Minimal influence of wind and tidal height on underwater noise in Haro Strait

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

8th Grade. Data.

9.3 Histograms and Box Plots

STAT 115 : INTRO TO EXPERIMENTAL DESIGN. Science answers questions with experiments

(per 100,000 residents) Cancer Deaths

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

Gizachew Tiruneh, Ph. D., Department of Political Science, University of Central Arkansas, Conway, Arkansas

NUMB3RS Activity: Is It for Real? Episode: Hardball

PGA Tour Scores as a Gaussian Random Variable

ESP 178 Applied Research Methods. 2/26/16 Class Exercise: Quantitative Analysis

March Madness Basketball Tournament

Returns to Skill in Professional Golf: A Quantile Regression Approach

Transcription:

Running head: DATA ANALYSIS AND INTERPRETATION 1 Data Analysis and Interpretation Final Project Vernon Tilly Jr. University of Central Oklahoma

DATA ANALYSIS AND INTERPRETATION 2 Owners of the various Major League Baseball (MLB) teams are interested in learning of ways to recruit, select, and retain their best players. Before we address these items of interest we need to know a little bit of information on the MLB. The MLB is a professional baseball league consisting of teams that play in the American and National leagues. The league is one of the major professional sports leagues of the United States and Canada. It is composed of 30 teams 29 are in the United States and one in Canada. The MLB has the highest season attendance of any sports league in 2011. There are approximately 1200 players in the league. We will use various pieces of data to assist the MLB team owners in their efforts to find ways to recruit, select, and retain their best baseball players. We will use descriptive analysis to help them understand the data used. There will be specific statistical tests ran, analysis, and interpretation of the results for Salary, Homeruns (HR), and Batting Average (AVG). This will be in an effort to see if there is a relationship between a players Salary and their homeruns, as well as their batting average or not. To begin with we will use the below information represented in Table 1. This table represents the variables and their types, as well as their measurement scale. This information is helpful in helping us determine just what we are able to do with the data. Take Team for example it is a qualitative variable which means it is non-numerical, which is more descriptive in nature. While on the other hand Homeruns (HR) is a quantitative variable and is numerical meaning it is countable or meaningful, as in having value. We can also denote Team as a cross-sectional data type, with a Nominal measurement scale. Cross-sectional simply means a recorded characteristic, and can be collected irrespective of time. Nominal data is the least sophisticated, basically not a lot you can do with it. Table 1 Variable Name Name (of Players) Team (Name of Team) Variable Type Qualitative Qualitative Data Type Crosssectional Crosssectional Measurement Scale Nominal Nominal Salary (in dollars) Quantitative Continuous Ratio Games Played (G) Quantitative Discrete Ratio Hits (H) Quantitative Discrete Ratio Homeruns (HR) Quantitative Discrete Ratio Runs Batted In (RBI) Quantitative Discrete Ratio Batting Average (AVG) Quantitative Continuous Ratio We will be focusing on three variables, Salary, Homeruns (HR), and Batting Average (AVG) as all represent quantitative data. The variables Salary and Batting Average (AVG) are representative of continuous data types, which means there lays an infinite value within an interval. On the flipside Homeruns (HR) are of the discrete data type, which means there is the same interval between variables, like 2-3, and 3-4, we don t earn half a homerun. The last item

DATA ANALYSIS AND INTERPRETATION 3 we will address is the measurement scale as they all have a Ratio scale. The ratio scale is the strongest measurement scale, with a true zero point, which means $0.00 dollars means no money. Ratio also is meaningful in mathematical calculations, of which we will be using to arrive at a conclusion and recommendation for the owners. Given a baseball data set containing a random sample of 254 players with their respective stats, we will investigate the linear relationship, if any, between baseball players performance and pay. Performance variables as discussed will be batting average (AVG) and homerun (HR). Below we have Table 2 representing the relative frequency of players and their MLB league affiliation. It reflects a relative distribution of the 254 sample players at 47.2% for the National & 53.0% for the American leagues. This is illustrated as well in Figure 1 for clarity. Table 2 League Sample # of Players Relative National 120 0.4724 American 134 0.5276 Total 254 1.0000 Figure 1 Percentage of the random sample of 254 players and their MLB League affiliation National 53% 47% American

DATA ANALYSIS AND INTERPRETATION 4 For the given a baseball data set containing a random sample of 254 players with their respective stats, we illustrate the mean, median, mode, skew, and standard deviation for Salary as represented in Figure 2 below. We will also illustrate the mean, standard deviation, and skew for AVG as represented in Figure 3 below. What we want the owners to take away from this illustration is the sample Mean represents the average and is subject to interference from the outliers at both ends of the spectrum. The Median on the other hand is less subjective to outliers and is more of a truer picture of the data statistic. The Mode represents the value that occurs most frequently within the data set of each variable, like Salary & AVG. When we discuss the Standard Deviation we are talking about the amount of dispersion from the central location, which represents the data points of the sample data. We must consider Skew as well as it reflects the data values relative to the Mean, the closer to zero they are the more evenly the distribution. Figure 2 Figure 3 Salary AVG Mean 4689717.22 Mean 0.275527559 Standard Error 301727.4371 Standard Error 0.001413593 Median 3500000 Median 0.278 Mode 380000 Mode 0.28 Standard Deviation 4808744.053 Standard Deviation 0.022528972 Sample Variance 2.3124E+13 Sample Variance 0.000507555 Kurtosis 1.311246602 Kurtosis 1.333467801 Skewness 1.299066959 Skewness -0.562095159 Range 23048571 Range 0.143 Minimum 380000 Minimum 0.19 Maximum 23428571 Maximum 0.333 Sum 1191188174 Sum 69.984 Count 254 Count 254 Confidence Level(95.0%) 594217.4297 Confidence Level(95.0%) 0.002783909 For the given baseball data set containing a random sample of 254 players with their respective stats, we find the highest paid player is Mr. Jason Giambi with a salary of $23,428,571 and an average of 0.289. Mr. Giambi s average falls at approximately the 76 th percentile in relation to the other players, so approximately 75% have a lower average and approximately 23% have a higher average with 3 other players sharing the same average of 0.289. What we need to note here is the highest average is 0.333 for this random sample of players and is owned by Mr. Ichiro Suzuki with a salary of $12,500,000. Using Figures 2 and 3 above we find the interval for the mean salary at $594,217.43 and for the mean of AVG is 0.0028 at the 95% confidence level for the population mean. To better illustrate the players salaries we have Table 3 reflecting the relative frequency and Figure 4 a relative frequency histogram below. In Table 3 we can see approximately 80% of the players earn less than $9,000,000. In Figure 4 the relative frequency histogram reflects a positively skewed or skewed to the right, distribution with a long tail extending to the right. This attribute reflects the presence of a small number of relatively large values.

Percentage of Players DATA ANALYSIS AND INTERPRETATION 5 Table 3 Class (in $1,000s) Relative Cumulative Cumulative Relative $350 - $3,070 118 0.4646 118.00 0.4646 $3,070 - $5,790 64 0.2520 182.00 0.7165 $5,790 - $8,510 20 0.0787 202.00 0.7953 $8,510 - $11,230 18 0.0709 220.00 0.8661 $11,230 - $13,950 20 0.0787 240.00 0.9449 $13,950 - $16,670 10 0.0394 250.00 0.9843 $16,670 - $19,390 1 0.0039 251.00 0.9882 $19,390 - $22,110 1 0.0039 252.00 0.9921 $22,110 - $24,830 2 0.0079 254.00 1.0000 Figure 4 0.5000 0.4500 Salary of the Random Sample of 254 MLB Players 0.4000 0.3500 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 Salary in ( $1,000s) For clarity and to illustrate a possible different view for the owners to consider is the relative frequency polygon for salary Figure 5 below. The polygon gives a general idea of the shape of the distribution using the midpoint of the players salaries from our random sample and the frequency distribution. It complements our histogram in Figure 4 above. From this

Percentage of Players DATA ANALYSIS AND INTERPRETATION 6 illustration we can see 45% of the players earn less than $5,000,000. It also illustrates most of the players earn less than $10,000,000, something to think about. Figure 5 Random Sample of 254 MLB Players Salary 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 5000 10000 15000 20000 25000 30000 Salary in ($1,000s) To illustrate the players homeruns we have Table 4 reflecting the relative frequency and Figure 6 a relative frequency histogram below. In Table 4 we can see approximately 80% of the players hit less than 200 homeruns. In Figure 6 the relative frequency histogram reflects a positively skewed or skewed to the right, distribution with a long tail extending to the right. This attribute reflects the presence of a small number of relatively large values. Table 4 Number of Home Runs Cumulative Relative Relative Cumulative 0-98 153 0.6024 153.00 0.6024 98-196 49 0.1929 202.00 0.7953 196-294 31 0.1220 233.00 0.9173 294-392 11 0.0433 244.00 0.9606 392-490 3 0.0118 247.00 0.9724 490-588 4 0.0157 251.00 0.9882 588-686 2 0.0079 253.00 0.9961 686-784 1 0.0039 254.00 1.0000 Interval 98 - Total 254 1.0000

Percentage of Players Number of Players DATA ANALYSIS AND INTERPRETATION 7 Figure 6 Homeruns 180 160 140 120 100 80 60 40 20 0 153 49 31 11 3 4 2 1 Number of Homeruns For clarity and to illustrate a possible different view for the owners to consider is the relative frequency polygon for Homeruns Figure 7 below. The polygon gives a general idea of the shape of the distribution using the midpoint of the players homeruns from our random sample and the frequency distribution. It complements our histogram in Figure 6 above. From this illustration we can see approximately 60% of the players hit less than 100 homeruns. It also illustrates most of the players hit less than approximately 250 homeruns, something to think about. Figure 7 0.70 0.60 0.50 0.40 0.30 0.20 0.10 Random Sample of 254 MLB Players Homeruns 0.00 0 100 200 300 400 500 600 700 800 900 Number of Homeruns

Percentage of Players DATA ANALYSIS AND INTERPRETATION 8 To illustrate the players average (AVG) we have Table 5 reflecting the relative frequency and Figure 8 a relative frequency histogram below. In Table 5 we can see approximately 82% of the players have an average of less than 0.300. In Figure 8 the relative frequency histogram reflects a negatively skewed or skewed to the left, distribution with a long tail extending to the left. This attribute reflects the presence of a small number of relatively small values. Table 5 Batting Average (AVG) Relative Cumulative Cumulative Relative 0.190-0.211 4 0.0157 4.00 0.0157 0.211-0.232 4 0.0157 8.00 0.0315 0.232-0.253 31 0.1220 39.00 0.1535 0.253-0.274 68 0.2677 107.00 0.4213 0.274-0.295 101 0.3976 208.00 0.8189 0.295-0.316 40 0.1575 248.00 0.9764 0.316-0.337 6 0.0236 254.00 1.0000 Interval 0.021 - Total 254 1.0000 Figure 8 0.4500 0.4000 0.3500 0.3000 0.2500 0.2000 0.1500 0.1000 0.0500 0.0000 Batting Average Average For clarity and to illustrate a possible different view for the owners to consider is the relative frequency polygon for Batting average (AVG) Figure 9 below. The polygon gives a general idea of the shape of the distribution using the midpoint of the players batting average

Percentage of Players DATA ANALYSIS AND INTERPRETATION 9 from our random sample and the frequency distribution. It complements our histogram in Figure 8 above. From this illustration we can see approximately 40% of the players average is just below 0.300. It illustrates most of the players are within the 0.250 to 0.300 range on batting average, something to think about. Figure 9 Random Sample of 254 MLB Players Batting Average (AVG) 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0 50 100 150 200 250 300 350 400 Average (AVG) Based on the raw stats data from the random sample of 254 MLB players and information presented and interpreted in written and graphical form for Salary, Homeruns (HR), and Average (AVG) we now have a pretty good idea of their independent characteristics as an independent variable. The question now is to examine whether or not a linear relationship exists between these variables. To do this we will need to set up the hypotheses test, whereby we will reject the null in favor of the alternative hypotheses if the test leads in that direction or fail to reject the null for the status quo. Below in written form is the stated null and alternative hypothesis for Salary & HR as well for Salary & AVG. Salary & HR Null: H 0 : There is no relationship. Alternative: H A : There is a relationship. Salary & AVG Null: H 0 : There is no relationship. Alternative: H A : There is a relationship.

DATA ANALYSIS AND INTERPRETATION 10 A simply way of comparing two variables is the scatter plot. These can be used to quickly see if there is a potential relationship between two variables as measured from the random sample mean for Salary, Homeruns (HR), and Average (AVG). This is provided as a precursor to the regression analysis coming up. Based on the dispersion of data points from the mean as represented by the trend line, it looks like there could be a linear relationship between Salary and the independent variables of HR & AVG as presented in Figure 10 and Figure 11 respectively. Figure 10 Salary to Homeruns H o m e r u n s 900 800 700 600 500 400 300 200 100 0 $0 $5,000,000 $10,000,000 $15,000,000 $20,000,000 $25,000,000 Salary Figure 11 Salary to Average (AVG) 0.35 A v e r a g e 0.3 0.25 0.2 0.15 0.1 0.05 0 $0 $5,000,000 $10,000,000 $15,000,000 $20,000,000 $25,000,000 Salary

Salary DATA ANALYSIS AND INTERPRETATION 11 We will continue with our testing by use regression and correlation analysis at the 95% confidence level. Below we have two graphs reflecting our regression test for Salary as a dependent variable also known as the response variable, and the respective independent variables of HR & AVG, a.k.a. explanatory variables. Figure 12 represents Salary and Homeruns (HR), and Figure 13 represents Salary and Batting average (AVG). Figure 12 HR Line Fit Plot $25,000,000 $20,000,000 y = 27943x + 1E+06 R² = 0.5192 $15,000,000 Salary $10,000,000 Predicted Salary Linear (Salary) $5,000,000 $0 0 200 400 600 800 1000 HR We note here the goodness-of-fit regression equations are located on both Fig. 10 and Fig. 11 in the upper right hand corner. For simplicity they have been listed below: Salary & HR Y = 27943x + 1E+06 or Y = 27943x + 100000 Salary & AVG Y = 8E+07x - 2E+07 or Y = 8000000 x - 2000000

Salary DATA ANALYSIS AND INTERPRETATION 12 Figure 13 $25,000,000 $20,000,000 AVG Line Fit Plot y = 8E+07x - 2E+07 R² = 1 $15,000,000 Salary $10,000,000 $5,000,000 Predicted Salary $0 ($5,000,000) 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 AVG Based on the regression summary, more specifically the ANOVA data we find the respective slope and intercept values. For simplicity they have been listed below: Salary & HR Slope: 1380298.03 Intercept: 27943.36 Salary & AVG Slope: 82783583.99 Intercept: -18119441.61 We have ran a few more tests as reflected in Table 6, one being Covariance, which tells us the direction of the linear relationship between two variables. We cannot tell much from this test save there seems to be a positive linear relationship. The Correlation coefficient test is a better measure of direction and strength. Based on the data in Table 6 for Correlation of Salary to Homeruns of 0.72 it appears there is a strong positive linear relationship as a perfect relationship equals 1, where 0 represents no linear relationship. This brings us to Correlation of Salary to Batting Average (AVG) at 0.39, we find this to still be a positive linear relationship though much weaker than Homeruns. Using the R Squared also known as the coefficient of determination we can explain the percentage of variation of each of the pairs. We find for the

DATA ANALYSIS AND INTERPRETATION 13 Salary to Homeruns the model explains 52% leaving 48% to chance. The model for the Salary to Batting average explains 15% leaving 85% to chance. This may seem rather weak when spending millions of dollars; however there is still a linear relationship. To test this we use the P-Value test at the 95 % confidence level. As stated previously we will reject the null in favor of the alternative hypotheses if the test leads in that direction or fail to reject the null for the status quo. Table 6 Covariance between Salary & Home Runs 429686871.1 Correlation between Salary & Home Runs 0.720582577 Covariance between Salary & AVG 42017.18666 Correlation between Salary & AVG 0.387841194 P-Value Salary - HR 3.30614E-06 P-Value Salary - AVG 2.67418E-07 R Squared R Squared Salary - HR 0.519239251 Salary - AVG 0.150420792 Salary & HR We reject the null hypothesis since 0.00000330614 < 0.05 Salary & AVG We reject the null hypothesis since 0.000000267418 < 0.05 In conclusion given a p-value of 0.0000, the null hypothesis can be rejected for Salary and the two independent variables Homeruns & AVG at 5% level of significance. Therefore the decision is to reject the null hypothesis in favor of the alternate hypothesis. Based on the R Squared it was proven by the model, with Homeruns coming out ahead of Batting average in strength, though still proving the linear relationship. The bottom line there is a linear relationship.

DATA ANALYSIS AND INTERPRETATION 14 In conclusion given a p-value of 0.0000, the null hypothesis can be rejected for Salary and the two independent variables Homeruns & AVG at 5% level of significance. Therefore the decision is to reject the null hypothesis in favor of the alternate hypothesis. Based on the R Squared it was proven by the model, with Homeruns coming out ahead of Batting average in strength, though still proving the linear relationship. The bottom line there is a linear relationship. It is recommend the MLB Team Owners review this report, and ask any questions necessary if they need more clarification on the contents of this report. It appears to this analyst there may need to be a greater sample pulled which may include the entire population of MLB players as the stats are available. Recommend the owners consider that most of the bang for buck is below the $10,000,000 level and consider that most homeruns are earned below this level as well. While batting average does have a positive linear relationship to salary, it is not strong, there could be other variables to consider like, age, games played, and so on. It also appears most averages fall in the 0.250-0.300 range, not much negotiating room there. Anything paid above $15,000,000 is not a good return on performance. To recruit new players, most are hungry just to get in the game, and the stats prove it as they want to prove they have what it takes to stay in the Major league. Most seem to have good averages and numerous homeruns at the lower level of cost to the owners. It would appear $5,000,000 or less is a good start for recruiting young talent. To select and keep good players there is plenty of negotiating room as far as salary goes between $5,000,000 and $15,000,000.