The Reliability of ntrinsic Batted Ball Statistics Appendix Glenn Healey, EECS Department University of California, rvine, CA 92617 Given information about batted balls for a set of players, we review techniques for estimating thereliabilityofastatisticasafunctionofthesamplesize. Wealsoreviewmethodsforusing the estimated reliability to compute the variance of true talent and to generate forecasts. 1 Cronbach s alpha estimate for reliability The reliability of a statistic S for a sample of size is defined by R() = σ2 t σ 2 o () (1) where σ 2 t is the variance of true talent across players for S and σ 2 o() is the variance of the observed values across players for S as a function of. Cronbach s alpha α() [2] is an estimate of R() that is generated from a data set with batted balls for each of a set of players. Unlike split-half methods, Cronbach s alpha does not require partitioning of the data set. The estimate α() of R() is an approximation to the average of all possible split-half correlations that would be computed from a full data set with 2 batted balls for each player where each split-half contains batted balls per player. Let S(i, j) be the value of statistic S for batted ball i for player j. Cronbach s alpha is given by α() = ( i=1 σ 2 ) S 1 i 1 σs 2 T (2) where σ 2 S i is the variance of S(i,j) across players for batted ball i and σ 2 S T is the variance of the variable across players. S T (j) = S(i,j) (3) i=1 1
2 Spearman-Brown prophecy formula The Spearman-Brown prophecy formula [1] [4] allows us to predict an unknown R( ) value from an estimated R() value. This is useful for situations where the size of our data set allows us to estimate R() using Cronbach s alpha but not R( ) where >. The Spearman-Brown formula is where K = /. R( ) = KR() 1+(K 1)R() We used this equation to predict R( ) for the and batted ball statistics for pitchers for values of that are greater than 4. The most accurate values of R() that are estimated using Cronbach s alpha are those for the largest values of. Therefore, we applied equation (4) to the values of R() for the six values of between 395 and 4 to predict R( )forvaluesof above4forbothstatistics. Foreach over 4, thesixpredictions were averaged to generate the extrapolated R(). We obtained the result that the predicted R() reaches.5 at 838 batted balls for and at 1268 batted balls for. (4) 3 Computing the Variance of True Talent Given the α() estimate of R(), we can estimate the variance of true talent σ 2 t using equation (1) and the sample variances σo 2 () for the batted ball statistics. ur data set included the92battersand112pitchers whohadatleast 4battedballstracked byhtf/x in 214. Figure 1 plots the standard deviation σ o () for and across the 92 batters and figure2plots σ o () for andacross the112pitchers. Weseethat hasasmaller variance than and that σ o () tends to decrease with increasing. Figure 3 plots the estimated σ t as a function of. We see that σ t is nearly constant with which is proper since the variance of true talent for a statistic is invariant to sample size. For the largest value of, σ t is 35 wba points for batters and 14 wba points for pitchers. 2
.9.8.7 σ o.6.5.3 5 1 15 2 25 3 35 4 Figure 1: Standard deviation σ o across 92 batters.1.8.6 σ ο 5 1 15 2 25 3 35 4 Figure 2: Standard deviation σ o across 112 pitchers 3
.5 batters pitchers.3 σ t.1 3 32 34 36 38 4 Figure 3: Standard deviation σ t of true talent for for batters and pitchers 4 Linear regression for forecasting Suppose that we are given a set of points (x 1,y 1 ),(x 2,y 2 ),...,(x P,y P ). We can generate a prediction ŷ i for y i from x i by using the linear regression model ŷ i = a+bx i (5) where e i = ŷ i y i is the error for point i. Let µ x and σ x denote the mean and standard deviation for the x i and let µ y and σ y denote the mean and standard deviation for the y i. The values for the intercept a and the slope b that minimize the squared error are given by P E = e 2 i (6) i=1 a = µ y rµ xσ y σ x, b = rσ y σ x (7) 4
where r is the correlation coefficient for the points [3]. The line defined by (7) is called the regression line. The model errors e i for the regression line are zero-mean and have a standard deviation given by σ e = σ y 1 r 2 (8) where r 2 is called r-squared or the coefficient of determination. The value of σ e is a measure of the accuracy of the predictive model. f r = 1 then the points all lie on the regression line and σ e =. f r =, then from (5) and (7) the prediction simplifies to the horizontal line ŷ i = µ y and σ e = σ y. For our application, each point (x i,y i ) represents the value of a statistic computed for player i over two different samples of data. The reliability estimate α is the expected value of the correlation coefficient r. From (8), therefore, larger values of α lead to smaller values of the prediction error σ e. Since α is larger for the statistic than for the statistic for both batters and pitchers for sufficiently large values of, this leads to smaller prediction errors for. n this context, σ y represents the standard deviation across players for a batted ball statistic. We showed in figures 1 and 2 that this standard deviation is always smaller for than for. Using (8), this also leads to a smaller prediction error σ e for the statistic than for the statistic. Figures 4 and 5 plot σ e as a function of for batters and pitchers for the and statistics. We see that σ e is consistently smaller for than for for both batters and pitchers. 5
.1.8.6 σ e 5 1 15 2 25 3 35 4 Figure 4: Prediction error σ e for batters.5.3 σ e.1 24 26 28 3 32 34 36 38 4 Figure 5: Prediction error σ e for pitchers 6
5 Regression to the Mean Suppose that the points (x 1,y 1 ),(x 2,y 2 ),...,(x P,y P ) were generated to compute a splithalf correlation for a statistic where each x i and each y i were obtained from batted balls for player i. For this case, we can often assume that µ x = µ y and σ x = σ y. Under this assumption, we can combine equations (5) and (7) to obtain ŷ i = rx i +(1 r)µ (9) where µ is the shared mean µ = µ x = µ y. Since α() is the expected value of the correlation r over all possible split-half partitions of the data, a prediction equation that doesn t depend on the partition is given by ŷ i = α()x i +(1 α())µ (1) This relationship is referred to as regression to the mean since the prediction ŷ i is a weighted average of the observed x i and the mean µ. References [1] W. Brown. Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3(3):296 322, ctober 191. [2] L. Cronbach. Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297 334, 1951. [3]. Draper and H. Smith. Applied Regression Analysis. Wiley, ew York, 3rd edition, 1998. [4] C. Spearman. Correlation calculated from faulty data. British Journal of Psychology, 3(3):271 295, ctober 191. 7