The Reliability of Intrinsic Batted Ball Statistics Appendix

Similar documents
y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Chapter 12 Practice Test

Section I: Multiple Choice Select the best answer for each problem.

Lab 11: Introduction to Linear Regression

Calculation of Trail Usage from Counter Data

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Lesson 14: Modeling Relationships with a Line

save percentages? (Name) (University)

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

The Intrinsic Value of a Batted Ball Technical Details

Building an NFL performance metric

Navigate to the golf data folder and make it your working directory. Load the data by typing

LESSON 5: THE BOUNCING BALL

Running head: DATA ANALYSIS AND INTERPRETATION 1

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Name May 3, 2007 Math Probability and Statistics

Algebra I: A Fresh Approach. By Christy Walters

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Applying Hooke s Law to Multiple Bungee Cords. Introduction

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Appendix: Tables. Table XI. Table I. Table II. Table XII. Table III. Table IV

Analysis of Shear Lag in Steel Angle Connectors

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Nonlife Actuarial Models. Chapter 7 Bühlmann Credibility

Algebra I: A Fresh Approach. By Christy Walters

Chapter 20. Planning Accelerated Life Tests. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Lab #12:Boyle s Law, Dec. 20, 2016 Pressure-Volume Relationship in Gases

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Failure Data Analysis for Aircraft Maintenance Planning

Do Clutch Hitters Exist?

Estimating the Probability of Winning an NFL Game Using Random Forests

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Hitting with Runners in Scoring Position

Chapter 3.4. Measures of position and outliers. Julian Chan. September 11, Department of Mathematics Weber State University

A Study of Olympic Winning Times

Pitching Performance and Age

Stat 139 Homework 3 Solutions, Spring 2015

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

8th Grade. Data.

Legendre et al Appendices and Supplements, p. 1

The Estimation of Winners Number of the Olympiads Final Stage

Journal of Emerging Trends in Computing and Information Sciences

Boyle s Law: Pressure-Volume. Relationship in Gases

Objectives. Materials

Pitching Performance and Age

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Lesson 2 Pre-Visit Slugging Percentage

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Average Runs per inning,

Appendix. Table A1. Estimation results in the hedonic price regression

PREDICTION VERSUS REALITY: THE USE OF MATHEMATICAL MODELS TO PREDICT ELITE PERFORMANCE IN SWIMMING AND ATHLETICS AT THE OLYMPIC GAMES

Boyle s Law: Pressure-Volume Relationship in Gases

TRIP GENERATION RATES FOR SOUTH AFRICAN GOLF CLUBS AND ESTATES

AggPro: The Aggregate Projection System

ISyE 6414 Regression Analysis

NMSU Red Light Camera Study Update

The pth percentile of a distribution is the value with p percent of the observations less than it.

A Bayesian Hierarchical Model of Pitch Framing in Major League Baseball

(per 100,000 residents) Cancer Deaths

Project 1 Those amazing Red Sox!

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

Should bonus points be included in the Six Nations Championship?

APPROACH RUN VELOCITIES OF FEMALE POLE VAULTERS

Looking at Spacings to Assess Streakiness

Honors Statistics. Daily Agenda:

Boyle s Law: Pressure-Volume Relationship in Gases. PRELAB QUESTIONS (Answer on your own notebook paper)

Sample Final Exam MAT 128/SOC 251, Spring 2018

Chapter 10 Gases. Characteristics of Gases. Pressure. The Gas Laws. The Ideal-Gas Equation. Applications of the Ideal-Gas Equation

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

PSY201: Chapter 5: The Normal Curve and Standard Scores

Driv e accu racy. Green s in regul ation

Grade: 8. Author(s): Hope Phillips

Highly Cited Psychometrika Articles:

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

Confidence Interval Notes Calculating Confidence Intervals

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

FORECASTING BATTER PERFORMANCE USING STATCAST DATA IN MAJOR LEAGUE BASEBALL

Create a bungee line for an object to allow it the most thrilling, yet SAFE, fall from a height of 3 or more meters.

Name Student Activity

ABOUT THE REPORT. This is a sample report. Report should be accurate but is not

Experiment. THE RELATIONSHIP BETWEEN VOLUME AND TEMPERATURE, i.e.,charles Law. By Dale A. Hammond, PhD, Brigham Young University Hawaii

Boyle s Law. Pressure-Volume Relationship in Gases. Figure 1

On the association of inrun velocity and jumping width in ski. jumping

JEPonline Journal of Exercise Physiologyonline

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

PUBLISHED PROJECT REPORT PPR850. Optimisation of water flow depth for SCRIM. S Brittain, P Sanders and H Viner

Business Cycles. Chris Edmond NYU Stern. Spring 2007

3.3 - Measures of Position

Influence of the size of a nation s population on performances in athletics

Transcription:

The Reliability of ntrinsic Batted Ball Statistics Appendix Glenn Healey, EECS Department University of California, rvine, CA 92617 Given information about batted balls for a set of players, we review techniques for estimating thereliabilityofastatisticasafunctionofthesamplesize. Wealsoreviewmethodsforusing the estimated reliability to compute the variance of true talent and to generate forecasts. 1 Cronbach s alpha estimate for reliability The reliability of a statistic S for a sample of size is defined by R() = σ2 t σ 2 o () (1) where σ 2 t is the variance of true talent across players for S and σ 2 o() is the variance of the observed values across players for S as a function of. Cronbach s alpha α() [2] is an estimate of R() that is generated from a data set with batted balls for each of a set of players. Unlike split-half methods, Cronbach s alpha does not require partitioning of the data set. The estimate α() of R() is an approximation to the average of all possible split-half correlations that would be computed from a full data set with 2 batted balls for each player where each split-half contains batted balls per player. Let S(i, j) be the value of statistic S for batted ball i for player j. Cronbach s alpha is given by α() = ( i=1 σ 2 ) S 1 i 1 σs 2 T (2) where σ 2 S i is the variance of S(i,j) across players for batted ball i and σ 2 S T is the variance of the variable across players. S T (j) = S(i,j) (3) i=1 1

2 Spearman-Brown prophecy formula The Spearman-Brown prophecy formula [1] [4] allows us to predict an unknown R( ) value from an estimated R() value. This is useful for situations where the size of our data set allows us to estimate R() using Cronbach s alpha but not R( ) where >. The Spearman-Brown formula is where K = /. R( ) = KR() 1+(K 1)R() We used this equation to predict R( ) for the and batted ball statistics for pitchers for values of that are greater than 4. The most accurate values of R() that are estimated using Cronbach s alpha are those for the largest values of. Therefore, we applied equation (4) to the values of R() for the six values of between 395 and 4 to predict R( )forvaluesof above4forbothstatistics. Foreach over 4, thesixpredictions were averaged to generate the extrapolated R(). We obtained the result that the predicted R() reaches.5 at 838 batted balls for and at 1268 batted balls for. (4) 3 Computing the Variance of True Talent Given the α() estimate of R(), we can estimate the variance of true talent σ 2 t using equation (1) and the sample variances σo 2 () for the batted ball statistics. ur data set included the92battersand112pitchers whohadatleast 4battedballstracked byhtf/x in 214. Figure 1 plots the standard deviation σ o () for and across the 92 batters and figure2plots σ o () for andacross the112pitchers. Weseethat hasasmaller variance than and that σ o () tends to decrease with increasing. Figure 3 plots the estimated σ t as a function of. We see that σ t is nearly constant with which is proper since the variance of true talent for a statistic is invariant to sample size. For the largest value of, σ t is 35 wba points for batters and 14 wba points for pitchers. 2

.9.8.7 σ o.6.5.3 5 1 15 2 25 3 35 4 Figure 1: Standard deviation σ o across 92 batters.1.8.6 σ ο 5 1 15 2 25 3 35 4 Figure 2: Standard deviation σ o across 112 pitchers 3

.5 batters pitchers.3 σ t.1 3 32 34 36 38 4 Figure 3: Standard deviation σ t of true talent for for batters and pitchers 4 Linear regression for forecasting Suppose that we are given a set of points (x 1,y 1 ),(x 2,y 2 ),...,(x P,y P ). We can generate a prediction ŷ i for y i from x i by using the linear regression model ŷ i = a+bx i (5) where e i = ŷ i y i is the error for point i. Let µ x and σ x denote the mean and standard deviation for the x i and let µ y and σ y denote the mean and standard deviation for the y i. The values for the intercept a and the slope b that minimize the squared error are given by P E = e 2 i (6) i=1 a = µ y rµ xσ y σ x, b = rσ y σ x (7) 4

where r is the correlation coefficient for the points [3]. The line defined by (7) is called the regression line. The model errors e i for the regression line are zero-mean and have a standard deviation given by σ e = σ y 1 r 2 (8) where r 2 is called r-squared or the coefficient of determination. The value of σ e is a measure of the accuracy of the predictive model. f r = 1 then the points all lie on the regression line and σ e =. f r =, then from (5) and (7) the prediction simplifies to the horizontal line ŷ i = µ y and σ e = σ y. For our application, each point (x i,y i ) represents the value of a statistic computed for player i over two different samples of data. The reliability estimate α is the expected value of the correlation coefficient r. From (8), therefore, larger values of α lead to smaller values of the prediction error σ e. Since α is larger for the statistic than for the statistic for both batters and pitchers for sufficiently large values of, this leads to smaller prediction errors for. n this context, σ y represents the standard deviation across players for a batted ball statistic. We showed in figures 1 and 2 that this standard deviation is always smaller for than for. Using (8), this also leads to a smaller prediction error σ e for the statistic than for the statistic. Figures 4 and 5 plot σ e as a function of for batters and pitchers for the and statistics. We see that σ e is consistently smaller for than for for both batters and pitchers. 5

.1.8.6 σ e 5 1 15 2 25 3 35 4 Figure 4: Prediction error σ e for batters.5.3 σ e.1 24 26 28 3 32 34 36 38 4 Figure 5: Prediction error σ e for pitchers 6

5 Regression to the Mean Suppose that the points (x 1,y 1 ),(x 2,y 2 ),...,(x P,y P ) were generated to compute a splithalf correlation for a statistic where each x i and each y i were obtained from batted balls for player i. For this case, we can often assume that µ x = µ y and σ x = σ y. Under this assumption, we can combine equations (5) and (7) to obtain ŷ i = rx i +(1 r)µ (9) where µ is the shared mean µ = µ x = µ y. Since α() is the expected value of the correlation r over all possible split-half partitions of the data, a prediction equation that doesn t depend on the partition is given by ŷ i = α()x i +(1 α())µ (1) This relationship is referred to as regression to the mean since the prediction ŷ i is a weighted average of the observed x i and the mean µ. References [1] W. Brown. Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3):296 322, ctober 191. [2] L. Cronbach. Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297 334, 1951. [3]. Draper and H. Smith. Applied Regression Analysis. Wiley, ew York, 3rd edition, 1998. [4] C. Spearman. Correlation calculated from faulty data. British Journal of Psychology, 3(3):271 295, ctober 191. 7