ISyE 6414 Regression Analysis

Similar documents
ISyE 6414: Regression Analysis

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Bayesian Methods: Naïve Bayes

Special Topics: Data Science

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept

Operational Risk Management: Preventive vs. Corrective Control

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Lecture 5. Optimisation. Regularisation

Chapter 12 Practice Test

Name May 3, 2007 Math Probability and Statistics

Section I: Multiple Choice Select the best answer for each problem.

CS249: ADVANCED DATA MINING

CS145: INTRODUCTION TO DATA MINING

Imperfectly Shared Randomness in Communication

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Lesson 14: Modeling Relationships with a Line

Running head: DATA ANALYSIS AND INTERPRETATION 1

Logistic Regression. Hongning Wang

Analysis of Gini s Mean Difference for Randomized Block Design

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling

Lab 11: Introduction to Linear Regression

Functions of Random Variables & Expectation, Mean and Variance

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Stat 139 Homework 3 Solutions, Spring 2015

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables

Navigate to the golf data folder and make it your working directory. Load the data by typing

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Communication Amid Uncertainty

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

NCSS Statistical Software

Best Practices in Mathematics Education STATISTICS MODULES

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

COMP Intro to Logic for Computer Scientists. Lecture 13

Conservation of Energy. Chapter 7 of Essential University Physics, Richard Wolfson, 3 rd Edition

Driv e accu racy. Green s in regul ation

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Combining Experimental and Non-Experimental Design in Causal Inference

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Math 121 Test Questions Spring 2010 Chapters 13 and 14

The Reliability of Intrinsic Batted Ball Statistics Appendix

Communication Amid Uncertainty

Sample Final Exam MAT 128/SOC 251, Spring 2018

5.1 Introduction. Learning Objectives

Operations on Radical Expressions; Rationalization of Denominators

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

What is Restrained and Unrestrained Pipes and what is the Strength Criteria

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

A few things to remember about ANOVA

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

Pearson Edexcel Level 3 GCE Psychology Advanced Subsidiary Paper 2: Biological Psychology and Learning Theories

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Week 7 One-way ANOVA

Addendum to SEDAR16-DW-22

College/high school median annual earnings gap,

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

On the association of inrun velocity and jumping width in ski. jumping

Legendre et al Appendices and Supplements, p. 1

Report for Experiment #11 Testing Newton s Second Law On the Moon

save percentages? (Name) (University)

Confidence Interval Notes Calculating Confidence Intervals

Addition and Subtraction of Rational Expressions

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Predicting Results of March Madness Using the Probability Self-Consistent Method

CT4510: Computer Graphics. Transformation BOCHANG MOON

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Using Actual Betting Percentages to Analyze Sportsbook Behavior: The Canadian and Arena Football Leagues

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

Figure 1a. Top 1% income share: China vs USA vs France

Setting up group models Part 1 NITP, 2011

One-way ANOVA: round, narrow, wide

Biostatistics Advanced Methods in Biostatistics IV

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Minimal influence of wind and tidal height on underwater noise in Haro Strait

The Effect of Public Sporting Expenditures on Medal Share at the Summer Olympic Games: A Study of the Differential Impact by Sport and Gender

ANALYSIS OF THE REGIONAL DISTRIBUTION, PATTERNS AND FORECAST OF ROAD TRAFFIC FATALITIES IN GHANA

Lecture 10. Support Vector Machines (cont.)

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

DEVELOPMENT OF A SET OF TRIP GENERATION MODELS FOR TRAVEL DEMAND ESTIMATION IN THE COLOMBO METROPOLITAN REGION

Objective Determine how the speed of a runner depends on the distance of the race, and predict what the record would be for 2750 m.

Demography Series: China

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes

AREA TOTALS OECD Composite Leading Indicators. OECD Total. OECD + Major 6 Non Member Countries. Major Five Asia. Major Seven.

Applying Hooke s Law to Multiple Bungee Cords. Introduction

Correlation analysis between UK onshore and offshore wind speeds

Tie Breaking Procedure

CHAPTER ANALYSIS AND INTERPRETATION Average total number of collisions for a try to be scored

Machine Learning Application in Aviation Safety

Real-Time Electricity Pricing

THE INTEGRATION OF THE SEA BREAM AND SEA BASS MARKET: EVIDENCE FROM GREECE AND SPAIN

(per 100,000 residents) Cancer Deaths

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

Transcription:

ISyE 6414 Regression Analysis Lecture 2: More Simple linear Regression: R-squared (coefficient of variation/determination) Correlation analysis: Pearson s correlation Spearman s rank correlation Variable Transformation: Box-Cox Transformation Reminder: HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

Simple Linear Regression: Point estimation Data: observe (xx ii, yy ii ) for i=1,..,n, Model: the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and var σσ 22. The least squares method: minimize the residuals of sum squares nn ii=11 [YY ii bb 00 + bb 11 xx ii ] 22 So the least squares estimators are ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx where SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy SS yyyy = yy ii yy 22 = yy 22 ii nn yy 22 2

Simple Linear Regression: Point estimation The estimator of σ 2 is σσ 22 = SSSS eeeeee nn 22 ββ 00 + ββ 11 xx ii and SSSS eeeeee = nn ii=11 where ee ii = yy ii ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx Warning: two different schools of notation on sum squares ( SS vs. SS in our text) SSSS tttttt = nn ii=11 22 SS xxxx YY ii YY 22 = SS yyyy, the total of sum squares SSSS eeeeee = nn ii=11 YY ii YY 22, the residuals (or error) sum of squares (RSS vs. SSE) SSSS rrrrrr = nn ii=11 YY ii YY 22, the explained (or regression) sum of squares (SSSS EE vs. SSR) 3

Inference on ββ ss Assumption: the εε ii ss are iid NN 00, σσ 22 ββ 11 ββ 11 ssss( ββ 11 ) TT nn 22; ssss ββ 11 = σσ SS xxxx ββ 00 ββ 00 ssss( ββ 00 ) TT nn 22 ; ssss ββ 00 = σσ 11 nn + xx 22 SS xxxx Conduct hypothesis testing or find confidence interval for ββ 11 or ββ 00 4

Inference at a given xx nnnnnn At a given xx nnnnnn the point estimator of Y is YY = ββ 00 + ββ 11 xx nnnnnn A 11 αα confidence interval on the mean response Y is YY ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx A 11 αα prediction interval on the future observation is YY ± tt αα/22,nn 22 σσ (appropriate for testing data) 11 + 11 nn + xx nnnnnn xx 22 SS xxxx 5

1.Simple Linear regression: R-square Simple linear regression: Assume we observe n pairs of data yy 11, xx 11, yy 22, xx 22,, yy nn, xx nn. The model is yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii ss are iid NN 00, σσ 22. R-squared: a statistic that effectively summaries how well the X s can be used to predict Y : RR 22 = 11 SSSS eeeeee = 11 yy ii yy ii SSSS tttttt yy ii yy 22 ii = 11 SS yyyy SS 22 xxxx /SS xxxx 22 22 = SS xxxx SS yyyy SS xxxx SS yyyy called also coefficient of (simple) variation or determination. The larger RR 22 is, the more the predictor X explains the variability in the response Y. 6

R-squared RR 22 = 11 SSSS eeeeee = 11 yy ii yy ii SSSS tttttt yy ii yy 22 ii When all observations fall on the fitted line, RR 22 = 11 When the fitted slope ββ 11 = 00, we have ββ 00 = YY aaaaaa YY ii = YY. So RR 22 = 00 The closer RR 22 is to 1, the greater the degree of linear association between X and Y 22 But a high RR 22 does not indicate that useful prediction can be made (CI can be wide) A small RR 22 does not mean that X and Y are uncorrelated. 7

Warning: no intercepts In the model without intercepts, YY ii = ββxx ii + εε ii, the R-squared is RR 22 = 11 SSSS eeeeee = 11 nn ii=11 SSSS tttttt YY ii ββxx ii 22 nn ii=11 YY ii 22 8

2. correlation analysis The correlation coefficient between RVs X and Y is ρρ = cccccccc XX, YY = EE[ XX μμ xx YY μμ yy ] VVVVVV XX VVVVVV(YY) The estimator of population correlation is known as Pearson s coefficient of correlation rr = nn ii=11 nn ii=11 [ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] = SS xxxx SS xxxx SS yyyy Note that the R-squared RR 22 = rr 22, since R-squared RR 22 = 11 SSSS eeeeee SS yyyy = 11 SS yyyy SS xxxx 22 /SS xxxx 22 = SS xxxx. SS yyyy SS xxxx SS yyyy 9

Pearson s correlation Coefficient Properties of rr = SS xxxx SS xxxx SS yyyy = nn ii=11 nn ii=11[ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] 11 rr 11 If Y i tends to increase (decrease) linearly with X i, then r > 0 (r < 0). The closer the points (X i, Y i ) come to forming a straight line, the closer r is to ±11. The magnitude of r is unchanged if either the X or Y sample is transformed linearly The correlation does not depend on which variable is called Y and which is called X. 10

Properties of r If r is near ±11, then there is a strong linear relationship between Y and X. Might predict Y from X via linear regression If r is near 0, there is a weak linear relationship between Y and X Note that r=0 does not imply that Y and X are not related. For example, Y i = X i2 when x i =-2:0.1:2 11

Example: Data X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia 77 118 Ethiopia 13 208 Mexico 91 33 Brazil 69 65 Finland 95 7 Poland 98 16 Cambodia 32 184 France 95 9 Russian 73 32 Canada 85 8 Greece 54 9 Senegal 47 145 China 94 43 India 89 124 Turkey 76 87 Czech Republic 99 12 Italy 95 10 UK 90 9 Egypt 89 55 Japan 87 6 12

Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Question Are Y and X related (associated), and how? x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); cor(x,y) [1] -0.7910654 (cor(x,y, method = "pearson")) Want to test H 0 : ρ =0 based on magnitude of r 13

Test HH 00 : ρρ = 00 The correlation coefficient between RVs X and Y is ρρ = cccccccc XX, YY = EE[ XX μμ xx YY μμ yy ] VVVVVV XX VVVVVV(YY) which is estimated by Pearson s correlation coefficient rr = SS xxxx SS xxxx SS yyyy = nn ii=11 nn ii=11[ (XXii XX)(YY ii YY] XX ii XX 22 [ nn ii=11 YY ii YY 22 ] How to use r to test the hypothesis that there is no correlation between X and Y (HH 00 : ρρ = 00)? Assumption of this question: The (X, Y) s are from a Bivariate normal distribution. 14

The relationship of r in LR In the simple linear regression: YY ii = ββ 00 + ββ 11 xx ii + εε ii, we have ββ 11 = SS xxxx SS xxxx = rr SS yyyy SS xxxx, sincee rr = SS xxxx SS xxxx SS yyyy so ββ 11 = 00 if and only rr = 00. Thus, testing H 0 : ρ =0 is equivalent to testing H 0 : β 1 =0. [when the (X, Y) s are from a Bivariate normal distribution and (xx ii, εε ii ) are independent, then ρρ = ββ 11 σσ xx /σσ yy or ββ 11 = ρρσσ yy /σσ xx ] 15

Test Statistic When testing H 0 : β 1 =0, the test statistic is TT oooooo = σσ/ ββ 11 SS xxxx Since rr = σσ 22 = SSSS eeeeee nn 22 = SS xxxx SS xxxx SS yyyy, ββ 11 = SS xxxx SS xxxx = rr SS yyyy SS xxxx, 11 nn 22 TT oooooo = rr 22 SS yyyy SS xxxx SS yyyy SS xxxx SS xxxx, we have nn 22 SS yyyy SS xxxx 22 SS xxxx = rr SS xxxx nn 22 11 rr 22 Under H 0 : ρ =0, T obs has a t-distribution with df=n-2. 16

Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Observe r= -0.7910654 and n= 20. To decide whether Y and X are associated, we can formulate it as the problem of testing the null hypothesis HH 00 : ρρ = 00 aginest the alternative hypothesis HH 11 : ρρ 00 Decision Rule: The test statistic is TT oooooo = rr we reject HH 00 at the αα level if and only if TT oooooo tt αα/22,nn 22. nn 22 11 rr 22 17

Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Now we observe r= -0.7910654 and n= 20, so we have Since tt 00.000000,1111 = 22. 111111, TT oooooo > tt αα/22,nn 22, and thus we reject H 0 : ρ =0 at the 5% level. Now suppose one claims that ρ=-0.7. Does the observed value differ significantly from this value? 18

Key Fact R. A. Fisher showed that ZZ rr = 11 22 llllll 11+rr 11 rr has approximate normal distribution ZZ rr NN( 11 11 + ρρ llllll 22 11 ρρ, 11 nn 33 ) where n = # of observations (here and below llllll = llllgg ee, not llllgg 1111 ) The results is not sensitive to the Bivariate normal assumption, and is useful quite broadly 19

Decision Rule ZZ rr NN( 11 11 + ρρ llllll 22 11 ρρ, 11 nn 33 ) In the problem of testing HH 00 : ρρ = ρρ 00 vs. HH 11 : ρρ ρρ 00 The test statistic is ZZ oooooo = 11 11 + rr llllll 22 11 rr 11 22 llllll 11 + ρρ 00 11 ρρ 00 11 nn 33 and we reject HH 00 at the αα level if ZZ oooooo zzαα 22 [Some useful critical value for standard normal: zz 00.0000 = 22. 333333, zz 00.000000 = 11. 999999, zz 00.0000 = 11. 666666, zz 00.1111 = 11. 222222.] 20

Example Observe r= -0.7910654 and n= 20. Want to test H 0 : ρ = -0.7 vs H 1 : ρρ -0.7 at 10% level. The observed ZZ rr = 11 22 Under H 0 : ρ = -0.7, ZZ rr NN 11 22 llllll 11 + ρρ 11 ρρ, llllll 11+rr 11 rr 11 nn 33 = 11. 00000000 = NN( 00. 88888888, 00. 0000000000) The corresponding (normalized) test statistic is ZZ oooooo = zz rr μμ 00 ( 11. 00000000) ( 00. 88888888) = = 00. 88888888 vvvvrr 00 00. 0000000000 This value does not exceed Z 0.05 = 1.645, and thus there is no evidence to reject H 0 : ρ = -0.7 at the 10% level. 21

Confidence Interval for ρ ZZ rr NN ZZ ρρ, 11 nn 33, ZZ ρρ = 11 22 How to find confidence Interval for ρ First, ZZ rr ZZ ρρ llllll 11 + ρρ 11 ρρ 11/(nn 33) NN 00, 11. Thus for the observed ZZ rr, 100(1-α)% CI for ZZ ρρ is ZZ rr ± zz αα/22 11 nn 33 = ZZ LL, ZZ UU Second, if ZZ = 11 11+ρρ llllll then ρρ = ee2222 11 22 11 ρρ ee 2222 +11 So transform back to find 100(1-α)% CI for ρ [ ee22zz LL 11 ee 22ZZ LL +11, ee22zz UU 11 ee 22ZZ UU +11 ] 22

Example: confidence Interval for ρ The observed ZZ rr = 11 22 A 90% CI for ZZ ρρ = 11 22 11+rr llllll 11 rr 11+ρρ llllll 11 ρρ ZZ rr ± zz αα/22 = 11. 00000000 is given by 11 nn 33 Thus the 90% CI for ρ is ( ee22 ( 11.44444444) 11, ee22 ( 00.66666666) 11 ee 22 ( 11.44444444) +11 ee 22 ( 00.66666666) +11 ) = 00. 99999999, 00. 55555555. 23

Pearson Correlation r The Pearson correlation r can be highly influenced by outliers in one or both samples. If we delete the one extreme case with the largest X and smallest Y, then r can change from -1 to 0!!! To avoid the conclusion depending heavily on a single observation, use nonparametric approach. 24

Spearman s rank correlation r S 1. Order the X i s and assign them ranks 2. Do the same for the Y i s, and replace the original data pairs by the pairs of ranks values (Ties are treated by the mid-ranks) 3. The Spearman rank correlation is the Pearson correlation computed from the pairs of ranks. [Then we can use TT oooooo = rr ss nn 22 11 rr ss 22 for testing] 25

Spearman s rank correlation r S X Rank(X) Y Rank(Y) d= Rank(X) Rank(Y) 1 X 1 R x1 Y 1 R y1 d 1 =R x1 - R y1 2 X 2 R x2 Y 2 R y2 d 2 =R x2 - R y2 n X n R xn Y n R yn d n =R xn - R yn If there are no ties 26

Data (rank) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Natio n Bolivia 77(8) 118(16) Ethio pia Brazil 69(5) 65(14) Finla nd X Y Nation X Y 13(1) 208(20) Mexico 91(14) 33(11) 95(17) 7(2) Poland 98(19) 16(9) Cambo dia Canad a 32(2) 184(19) Franc e 85(9) 8(3) Greec e 95(17) 9(5) Russian 73(6) 32(10) 54(4) 9(5) Senegal 47(3) 145(18) China 94(15) 43(12) India 89(11.5) 124(17) Turkey 76(7) 87(15) Czech Republ ic 99(20) 12(8) Italy 95(17) 10(7) UK 90(13) 9(5) Egypt 89(11.5) 55(13) Japan 87(10) 6(1) 27

Spearman s rank correlation r S x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); cor(x,y); # Pearson s correlation r [1] -0.7910654 cor(x,y, method = "spearman") [1] -0.5431913 # Alternative method to compute Spearman s rank correlation a <- rank(x); b <- rank(y); cor(a,b) [1] -0.5431913 28

Example X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Spearman s rank correlation rr ss =-0.5431913 and n= 20. In the problem of testing H 0 : ρρ = 00 (no association) vs H 1 : ρρ 00, we can use the test statistics TT oooooo = rr ss nn 22 11 rr ss 22 = 55. 44444444 and a significant level αα test is to reject H0 if and only if TT oooooo tt αα/22,nn 22. Since t 18, 0.025 =3.20, we reject H 0 : ρ =0 at the 5% level. 29

Spearman s rank correlation r S It is not sensitive to outliers In samples without unusual observations and a linear trend, we often have r S = r. The magnitude of the Spearman correlation does not change if either X or Y or both are monotonically transformed. If r S is noticeably greater than r, then a transformation of the data might provide a stronger linear relationship. 30

3. Variable Transformation If the model fit is inadequate, it does not mean that a regression is not useful, it just means that the linear regression you proposed is not useful One problem might be that the relationship between X and Y is not exactly linear. To model the nonlinear relationship, we can transform X and Y (or both) by some nonlinear function, e.g., ff xx = xx aa or log xx. Example: assume (X,Y) are related through yy = γγee θθθθ. We can transformation y to yy = llllll yy and then fit the new model as yy = \logγγ + θθθθ = ββ 00 + ββ 11 xx 31

Box-Cox Transformation Does the response Y need transformation? A useful tool is from the MASS library of R to perform Box-Cox Transformation. It considers the power transformation of the forms YY λλ or llllll YY (λλ = 00) and find the maximum likelihood estimate when fitting data to model YY ii λλ = ββ 00 + ββ 11 xx ii + εε ii when εε ii are iid NN 00, σσ 22. Equivalently, the transformation is YY = YYλλ 11 λλ = YYλλ 11 /λλ iiii λλ 00 log YY iiii λλ = 00 When λλ 11, no need to transform Y. 32

R Example HW#1, complete GPA data vs ACT score. GPAdata <- read.table(file = "http://www.isye.gatech.edu/~ymei/6414/gpadata.csv", sep=","); lm1 <- lm(v1 ~ V2, data=gpadata) library(mass) boxcox(lm1) ###Or alternatively boxcox(v1 ~ V2, data=gpadata) 33

boxcox(lm1) 34

boxcox(lm1, lambda=seq(-10,10)) 35

boxcox(lm1, lambda=seq(0,5,0.1)) 36

boxcox(lm1, lambda=seq(1,3,0.1)) The confidence interval for λλ is about [1.5,2.9]. We see perhaps YY = YY 22 might be best here YY 11.55 or YY 22.55 are also possible. Transformation can be useful here. 37

Try the following R code GPAdata <- read.table(file = "http://www.isye.gatech.edu/~ymei/6414/gpadata.csv", sep=","); GPAdata$V3 <- GPAdata$V1^2 plot(gpadata$v2, GPAdata$V1, xlab="act", ylab="gpa") plot(gpadata$v2, GPAdata$V3, xlab="act", ylab="gpa Square") model1 <- lm(v1 ~ V2, data= GPAdata) model2 <- lm(v3 ~ V2, data= GPAdata) summary(model1); summary(model2); 38

Scatter plots Left: x vs. Original Y Right: x vs. Y^2 39

Plot the data and fitted lines Add Fitted regression line to the scatter plots ##If you want to plot two plots together (2 columns) par(mfcol=c(1,2)) plot(gpadata$v2, GPAdata$V1, xlab="act", ylab="gpa") abline(model1); plot(gpadata$v2, GPAdata$V3, xlab="act", ylab="gpa Square") abline(model2) 40

41

Residual vs Fitted Plot the residuals against fitted values ##If you want to plot two plots together (2 columns) par(mfcol=c(1,2)) plot(fitted(model1), resid(model1)) abline(0,0) plot(fitted(model2), resid(model2)) abline(0,0) 42

43

qqnorm for residuals Try the following R code: par(mfcol=c(1,2)) qqnorm(residuals(model1)) qqline(residuals(model1)) qqnorm(residuals(model2)) qqline(residuals(model2)) 44

QQ plots for two models 45

Transformation Predictors (X)? > update(lm1,.~. +I(V2^2)) Call: lm(formula = V1 ~ V2 + I(V2^2), data = GPAdata) Coefficients: (Intercept) V2 I(V2^2) 1.516750 0.089425-0.001036 Multiple linear regression: examine the coefficient of X in regression. 46