ISyE 6414 Regression Analysis Lecture 2: More Simple linear Regression: R-squared (coefficient of variation/determination) Correlation analysis: Pearson s correlation Spearman s rank correlation Variable Transformation: Box-Cox Transformation Reminder: HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students
Simple Linear Regression: Point estimation Data: observe (xx ii, yy ii ) for i=1,..,n, Model: the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and var σσ 22. The least squares method: minimize the residuals of sum squares nn ii=11 [YY ii bb 00 + bb 11 xx ii ] 22 So the least squares estimators are ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx where SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy SS yyyy = yy ii yy 22 = yy 22 ii nn yy 22 2
Simple Linear Regression: Point estimation The estimator of σ 2 is σσ 22 = SSSS eeeeee nn 22 ββ 00 + ββ 11 xx ii and SSSS eeeeee = nn ii=11 where ee ii = yy ii ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx Warning: two different schools of notation on sum squares ( SS vs. SS in our text) SSSS tttttt = nn ii=11 22 SS xxxx YY ii YY 22 = SS yyyy, the total of sum squares SSSS eeeeee = nn ii=11 YY ii YY 22, the residuals (or error) sum of squares (RSS vs. SSE) SSSS rrrrrr = nn ii=11 YY ii YY 22, the explained (or regression) sum of squares (SSSS EE vs. SSR) 3
Inference on ββ ss Assumption: the εε ii ss are iid NN 00, σσ 22 ββ 11 ββ 11 ssss( ββ 11 ) TT nn 22; ssss ββ 11 = σσ SS xxxx ββ 00 ββ 00 ssss( ββ 00 ) TT nn 22 ; ssss ββ 00 = σσ 11 nn + xx 22 SS xxxx Conduct hypothesis testing or find confidence interval for ββ 11 or ββ 00 4
Inference at a given xx nnnnnn At a given xx nnnnnn the point estimator of Y is YY = ββ 00 + ββ 11 xx nnnnnn A 11 αα confidence interval on the mean response Y is YY ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx A 11 αα prediction interval on the future observation is YY ± tt αα/22,nn 22 σσ (appropriate for testing data) 11 + 11 nn + xx nnnnnn xx 22 SS xxxx 5
1.Simple Linear regression: R-square Simple linear regression: Assume we observe n pairs of data yy 11, xx 11, yy 22, xx 22,, yy nn, xx nn. The model is yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii ss are iid NN 00, σσ 22. R-squared: a statistic that effectively summaries how well the X s can be used to predict Y : RR 22 = 11 SSSS eeeeee = 11 yy ii yy ii SSSS tttttt yy ii yy 22 ii = 11 SS yyyy SS 22 xxxx /SS xxxx 22 22 = SS xxxx SS yyyy SS xxxx SS yyyy called also coefficient of (simple) variation or determination. The larger RR 22 is, the more the predictor X explains the variability in the response Y. 6
R-squared RR 22 = 11 SSSS eeeeee = 11 yy ii yy ii SSSS tttttt yy ii yy 22 ii When all observations fall on the fitted line, RR 22 = 11 When the fitted slope ββ 11 = 00, we have ββ 00 = YY aaaaaa YY ii = YY. So RR 22 = 00 The closer RR 22 is to 1, the greater the degree of linear association between X and Y 22 But a high RR 22 does not indicate that useful prediction can be made (CI can be wide) A small RR 22 does not mean that X and Y are uncorrelated. 7
Warning: no intercepts In the model without intercepts, YY ii = ββxx ii + εε ii, the R-squared is RR 22 = 11 SSSS eeeeee = 11 nn ii=11 SSSS tttttt YY ii ββxx ii 22 nn ii=11 YY ii 22 8
2. correlation analysis The correlation coefficient between RVs X and Y is ρρ = cccccccc XX, YY = EE[ XX μμ xx YY μμ yy ] VVVVVV XX VVVVVV(YY) The estimator of population correlation is known as Pearson s coefficient of correlation rr = nn ii=11 nn ii=11 [ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] = SS xxxx SS xxxx SS yyyy Note that the R-squared RR 22 = rr 22, since R-squared RR 22 = 11 SSSS eeeeee SS yyyy = 11 SS yyyy SS xxxx 22 /SS xxxx 22 = SS xxxx. SS yyyy SS xxxx SS yyyy 9
Pearson s correlation Coefficient Properties of rr = SS xxxx SS xxxx SS yyyy = nn ii=11 nn ii=11[ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] 11 rr 11 If Y i tends to increase (decrease) linearly with X i, then r > 0 (r < 0). The closer the points (X i, Y i ) come to forming a straight line, the closer r is to ±11. The magnitude of r is unchanged if either the X or Y sample is transformed linearly The correlation does not depend on which variable is called Y and which is called X. 10
Properties of r If r is near ±11, then there is a strong linear relationship between Y and X. Might predict Y from X via linear regression If r is near 0, there is a weak linear relationship between Y and X Note that r=0 does not imply that Y and X are not related. For example, Y i = X i2 when x i =-2:0.1:2 11
Example: Data X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia 77 118 Ethiopia 13 208 Mexico 91 33 Brazil 69 65 Finland 95 7 Poland 98 16 Cambodia 32 184 France 95 9 Russian 73 32 Canada 85 8 Greece 54 9 Senegal 47 145 China 94 43 India 89 124 Turkey 76 87 Czech Republic 99 12 Italy 95 10 UK 90 9 Egypt 89 55 Japan 87 6 12
Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Question Are Y and X related (associated), and how? x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); cor(x,y) [1] -0.7910654 (cor(x,y, method = "pearson")) Want to test H 0 : ρ =0 based on magnitude of r 13
Test HH 00 : ρρ = 00 The correlation coefficient between RVs X and Y is ρρ = cccccccc XX, YY = EE[ XX μμ xx YY μμ yy ] VVVVVV XX VVVVVV(YY) which is estimated by Pearson s correlation coefficient rr = SS xxxx SS xxxx SS yyyy = nn ii=11 nn ii=11[ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] How to use r to test the hypothesis that there is no correlation between X and Y (HH 00 : ρρ = 00)? Assumption of this question: The (X, Y) s are from a Bivariate normal distribution. 14
The relationship of r in LR In the simple linear regression: YY ii = ββ 00 + ββ 11 xx ii + εε ii, we have ββ 11 = SS xxxx SS xxxx = rr SS yyyy SS xxxx, sincee rr = SS xxxx SS xxxx SS yyyy so ββ 11 = 00 if and only rr = 00. Thus, testing H 0 : ρ =0 is equivalent to testing H 0 : β 1 =0. [when the (X, Y) s are from a Bivariate normal distribution and (xx ii, εε ii ) are independent, then ρρ = ββ 11 σσ xx /σσ yy or ββ 11 = ρρσσ yy /σσ xx ] 15
Test Statistic When testing H 0 : β 1 =0, the test statistic is TT oooooo = σσ/ ββ 11 SS xxxx Since rr = σσ 22 = SSSS eeeeee nn 22 = SS xxxx SS xxxx SS yyyy, ββ 11 = SS xxxx SS xxxx = rr SS yyyy SS xxxx, 11 nn 22 TT oooooo = rr 22 SS yyyy SS xxxx SS yyyy SS xxxx SS xxxx, we have nn 22 SS yyyy SS xxxx 22 SS xxxx = rr SS xxxx nn 22 11 rr 22 Under H 0 : ρ =0, T obs has a t-distribution with df=n-2. 16
Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Observe r= -0.7910654 and n= 20. To decide whether Y and X are associated, we can formulate it as the problem of testing the null hypothesis HH 00 : ρρ = 00 aginest the alternative hypothesis HH 11 : ρρ 00 Decision Rule: The test statistic is TT oooooo = rr we reject HH 00 at the αα level if and only if TT oooooo tt αα/22,nn 22. nn 22 11 rr 22 17
Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Now we observe r= -0.7910654 and n= 20, so we have Since tt 00.000000,1111 = 22. 111111, TT oooooo > tt αα/22,nn 22, and thus we reject H 0 : ρ =0 at the 5% level. Now suppose one claims that ρ=-0.7. Does the observed value differ significantly from this value? 18
Key Fact R. A. Fisher showed that ZZ rr = 11 22 llllll 11+rr 11 rr has approximate normal distribution ZZ rr NN( 11 11 + ρρ llllll 22 11 ρρ, 11 nn 33 ) where n = # of observations (here and below llllll = llllgg ee, not llllgg 1111 ) The results is not sensitive to the Bivariate normal assumption, and is useful quite broadly 19
Decision Rule ZZ rr NN( 11 11 + ρρ llllll 22 11 ρρ, 11 nn 33 ) In the problem of testing HH 00 : ρρ = ρρ 00 vs. HH 11 : ρρ ρρ 00 The test statistic is ZZ oooooo = 11 11 + rr llllll 22 11 rr 11 22 llllll 11 + ρρ 00 11 ρρ 00 11 nn 33 and we reject HH 00 at the αα level if ZZ oooooo zzαα 22 [Some useful critical value for standard normal: zz 00.0000 = 22. 333333, zz 00.000000 = 11. 999999, zz 00.0000 = 11. 666666, zz 00.1111 = 11. 222222.] 20
Example Observe r= -0.7910654 and n= 20. Want to test H 0 : ρ = -0.7 vs H 1 : ρρ -0.7 at 10% level. The observed ZZ rr = 11 22 Under H 0 : ρ = -0.7, ZZ rr NN 11 22 llllll 11 + ρρ 11 ρρ, llllll 11+rr 11 rr 11 nn 33 = 11. 00000000 = NN( 00. 88888888, 00. 0000000000) The corresponding (normalized) test statistic is ZZ oooooo = zz rr μμ 00 ( 11. 00000000) ( 00. 88888888) = = 00. 88888888 vvvvrr 00 00. 0000000000 This value does not exceed Z 0.05 = 1.645, and thus there is no evidence to reject H 0 : ρ = -0.7 at the 10% level. 21
Confidence Interval for ρ ZZ rr NN ZZ ρρ, 11 nn 33, ZZ ρρ = 11 22 How to find confidence Interval for ρ First, ZZ rr ZZ ρρ llllll 11 + ρρ 11 ρρ 11/(nn 33) NN 00, 11. Thus for the observed ZZ rr, 100(1-α)% CI for ZZ ρρ is ZZ rr ± zz αα/22 11 nn 33 = ZZ LL, ZZ UU Second, if ZZ = 11 11+ρρ llllll then ρρ = ee2222 11 22 11 ρρ ee 2222 +11 So transform back to find 100(1-α)% CI for ρ [ ee22zz LL 11 ee 22ZZ LL +11, ee22zz UU 11 ee 22ZZ UU +11 ] 22
Example: confidence Interval for ρ The observed ZZ rr = 11 22 A 90% CI for ZZ ρρ = 11 22 11+rr llllll 11 rr 11+ρρ llllll 11 ρρ ZZ rr ± zz αα/22 = 11. 00000000 is given by 11 nn 33 Thus the 90% CI for ρ is ( ee22 ( 11.44444444) 11, ee22 ( 00.66666666) 11 ee 22 ( 11.44444444) +11 ee 22 ( 00.66666666) +11 ) = 00. 99999999, 00. 55555555. 23
Pearson Correlation r The Pearson correlation r can be highly influenced by outliers in one or both samples. If we delete the one extreme case with the largest X and smallest Y, then r can change from -1 to 0!!! To avoid the conclusion depending heavily on a single observation, use nonparametric approach. 24
Spearman s rank correlation r S 1. Order the X i s and assign them ranks 2. Do the same for the Y i s, and replace the original data pairs by the pairs of ranks values (Ties are treated by the mid-ranks) 3. The Spearman rank correlation is the Pearson correlation computed from the pairs of ranks. [Then we can use TT oooooo = rr ss nn 22 11 rr ss 22 for testing] 25
Spearman s rank correlation r S X Rank(X) Y Rank(Y) d= Rank(X) Rank(Y) 1 X 1 R x1 Y 1 R y1 d 1 =R x1 - R y1 2 X 2 R x2 Y 2 R y2 d 2 =R x2 - R y2 n X n R xn Y n R yn d n =R xn - R yn If there are no ties 26
Data (rank) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Natio n Bolivia 77(8) 118(16) Ethio pia Brazil 69(5) 65(14) Finla nd X Y Nation X Y 13(1) 208(20) Mexico 91(14) 33(11) 95(17) 7(2) Poland 98(19) 16(9) Cambo dia Canad a 32(2) 184(19) Franc e 85(9) 8(3) Greec e 95(17) 9(5) Russian 73(6) 32(10) 54(4) 9(5) Senegal 47(3) 145(18) China 94(15) 43(12) India 89(11.5) 124(17) Turkey 76(7) 87(15) Czech Republ ic 99(20) 12(8) Italy 95(17) 10(7) UK 90(13) 9(5) Egypt 89(11.5) 55(13) Japan 87(10) 6(1) 27
Spearman s rank correlation r S x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); cor(x,y); # Pearson s correlation r [1] -0.7910654 cor(x,y, method = "spearman") [1] -0.5431913 # Alternative method to compute Spearman s rank correlation a <- rank(x); b <- rank(y); cor(a,b) [1] -0.5431913 28
Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Spearman s rank correlation rr ss =-0.5431913 and n= 20. In the problem of testing H 0 : ρρ = 00 (no association) vs H 1 : ρρ 00, we can use the test statistics TT oooooo = rr ss nn 22 11 rr ss 22 = 55. 44444444 and a significant level αα test is to reject H0 if and only if TT oooooo tt αα/22,nn 22. Since t 18, 0.025 =3.20, we reject H 0 : ρ =0 at the 5% level. 29
Spearman s rank correlation r S It is not sensitive to outliers In samples without unusual observations and a linear trend, we often have r S = r. The magnitude of the Spearman correlation does not change if either X or Y or both are monotonically transformed. If r S is noticeably greater than r, then a transformation of the data might provide a stronger linear relationship. 30
3. Variable Transformation If the model fit is inadequate, it does not mean that a regression is not useful, it just means that the linear regression you proposed is not useful One problem might be that the relationship between X and Y is not exactly linear. To model the nonlinear relationship, we can transform X and Y (or both) by some nonlinear function, e.g., ff xx = xx aa or log xx. Example: assume (X,Y) are related through yy = γγee θθθθ. We can transformation y to yy = llllll yy and then fit the new model as yy = \logγγ + θθθθ = ββ 00 + ββ 11 xx 31
Box-Cox Transformation Does the response Y need transformation? A useful tool is from the MASS library of R to perform Box-Cox Transformation. It considers the power transformation of the forms YY λλ or llllll YY (λλ = 00) and find the maximum likelihood estimate when fitting data to model YY ii λλ = ββ 00 + ββ 11 xx ii + εε ii when εε ii are iid NN 00, σσ 22. Equivalently, the transformation is YY = YYλλ 11 λλ = YYλλ 11 /λλ iiii λλ 00 log YY iiii λλ = 00 When λλ 11, no need to transform Y. 32
R Example HW#1, complete GPA data vs ACT score. GPAdata <- read.table(file = "http://www.isye.gatech.edu/~ymei/6414/gpadata.csv", sep=","); lm1 <- lm(v1 ~ V2, data=gpadata) library(mass) boxcox(lm1) ###Or alternatively boxcox(v1 ~ V2, data=gpadata) 33
boxcox(lm1) 34
boxcox(lm1, lambda=seq(-10,10)) 35
boxcox(lm1, lambda=seq(0,5,0.1)) 36
boxcox(lm1, lambda=seq(1,3,0.1)) The confidence interval for λλ is about [1.5,2.9]. We see perhaps YY = YY 22 might be best here YY 11.55 or YY 22.55 are also possible. Transformation can be useful here. 37
Try the following R code GPAdata <- read.table(file = "http://www.isye.gatech.edu/~ymei/6414/gpadata.csv", sep=","); GPAdata$V3 <- GPAdata$V1^2 plot(gpadata$v2, GPAdata$V1, xlab="act", ylab="gpa") plot(gpadata$v2, GPAdata$V3, xlab="act", ylab="gpa Square") model1 <- lm(v1 ~ V2, data= GPAdata) model2 <- lm(v3 ~ V2, data= GPAdata) summary(model1); summary(model2); 38
Scatter plots Left: x vs. Original Y Right: x vs. Y^2 39
Plot the data and fitted lines Add Fitted regression line to the scatter plots ##If you want to plot two plots together (2 columns) par(mfcol=c(1,2)) plot(gpadata$v2, GPAdata$V1, xlab="act", ylab="gpa") abline(model1); plot(gpadata$v2, GPAdata$V3, xlab="act", ylab="gpa Square") abline(model2) 40
41
Residual vs Fitted Plot the residuals against fitted values ##If you want to plot two plots together (2 columns) par(mfcol=c(1,2)) plot(fitted(model1), resid(model1)) abline(0,0) plot(fitted(model2), resid(model2)) abline(0,0) 42
43
qqnorm for residuals Try the following R code: par(mfcol=c(1,2)) qqnorm(residuals(model1)) qqline(residuals(model1)) qqnorm(residuals(model2)) qqline(residuals(model2)) 44
QQ plots for two models 45
Transformation Predictors (X)? > update(lm1,.~. +I(V2^2)) Call: lm(formula = V1 ~ V2 + I(V2^2), data = GPAdata) Coefficients: (Intercept) V2 I(V2^2) 1.516750 0.089425-0.001036 Multiple linear regression: examine the coefficient of X in regression. 46