Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Announcements Announcements UNIT 7: MULTIPLE LINEAR REGRESSION LECTURE 1: INTRODUCTION TO MLR STATISTICS 101 Problem Set 10 Due Wednesday Nicole Dalzell June 15, 2015 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 2 / 1 Recap % College graduate vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Recap % College educated vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Education: College graduate 1.0 Race/Ethnicity: Hispanic 1.0 100% 0.8 0.6 0.4 0.8 0.6 0.4 % College graduate 75% 50% 25% 0.2 0.2 0% Freeways No data 0.0 Freeways No data 0.0 0% 25% 50% 75% 100% % Hispanic Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 3 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 4 / 1

Recap % College educated vs. % Hispanic in LA - linear model Recap % College educated vs. % Hispanic in LA - linear model Participation question Which of the below is the best interpretation of the slope? (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic -0.7527 0.0501-15.01 0.0000 (a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%. Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA? (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic -0.7527 0.0501-15.01 0.0000 How reliable is this p-value if these zip code areas are not randomly selected? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 1 Recap Recap Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: T = b 1 null value SE b1 df = n 2 Dinosaur Weight SLR: Categorical Predictors What relationship do you see between the weight of dinosaurs and the type of dinosaur? Dinosaur Weight by Type Confidence interval: b 1 ± t df=n 2 SE b 1 The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b 1, SE b1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we ll be focusing on the estimates and inference for the slope. Weight (kg) 0e+00 4e+04 8e+04 Ornithischian Saurischian Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 1

SLR: Categorical Predictors SLR: Categorical Predictors Dinosaur Weight Dinosaurs! What relationship do you see between the weight of dinosaurs and the type of dinosaur? (Intercept) 2786 4422 0.630 0.5316 dino$typesaurischian 13652 5968 2.288 0.0265 Weight = 2786 + 13652TypeSaurischian Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian For Saurischian dinosaurs: plug in 1 for TypeSaurischian Slope b 1 : We expect that Saurischian dinosaurs weighed on average 13,652 kilograms more than Ornithischian dinosaurs. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 9 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 1 Return to the scene of the crime Murder Rates by Country Last class, we used the poverty rate in a district to help predict the number of annual murders per million in that district. murders = 29.91+2.56 poverty http:// www.nationmaster.com/ country-info/ stats/ Crime/ Murders/ Per-capita (Intercept) -29.901 7.789-3.839 0.0000 percpov 2.559 0.390 6.562 3.64e-06 Do we think that poverty rates are the only thing that influence the number of annual murders per million in a district? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 11 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 12 / 1

Data from the ACS Data from the ACS A random sample of 783 observations from the 2012 ACS. 1 income: Yearly income (wages and salaries) 2 employment: Employment status, not in labor force, unemployed, or employed 3 hrs work: Weekly hours worked 4 race: Race, White, Black, Asian, or other 5 age: Age 6 gender: gender, male or female 7 citizens: Whether respondent is a US citizen or not 8 time to work: Travel time to work 9 lang: Language spoken at home, English or other 10 married: Whether respondent is married or not 11 edu: Education level, hs or lower, college, or grad 12 disability: Whether respondent is disabled or not 13 birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec We have 1 response variable (income) and 12 potential explanatory variables. How do we fit a model and interpret the coefficients with so many predictors? What do we do with a mix of categorical and numerical explanatory variables? How do we deterimine which (if any) of them are important in our model? How do we determine if our model is any good? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 14 / 1 Everybody in the Pool Examples How would we interpret the coefficient for hrs work? How do we interpret all of this? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 16 / 1

In MLR everything is conditional on all other variables in the model Examples MLR (Multiple Linear Regression) How would we interpret the coefficient for hrs work? In MLR everything is conditional on all other variables in the model. All estimates in a MLR for a given variable are conditional on all other variables being in the model. Slope: Numerical x: All else held constant, for one unit increase in x i, y is expected to be higher / lower on average by b i units. Categorical x: All else held constant, the predicted difference in y for the baseline and given levels of x i is b i. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 1 In MLR everything is conditional on all other variables in the model In MLR everything is conditional on all other variables in the model Examples How would we interpret the coefficient for genderfemale? Examples How would we interpret the coefficient for genderfemale? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 19 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 20 / 1

In MLR everything is conditional on all other variables in the model In MLR everything is conditional on all other variables in the model Categorical Predictors with multiple levels We have several categorical variables in this study. Some are binary (ie have only two levels) but others are not. employment: Employment status, not in labor force, unemployed, or employed race: Race, White, Black, Asian, or other gender: gender, male or female citizens: Whether respondent is a US citizen or not lang: Language spoken at home, English or other married: Whether respondent is married or not edu: Education level, hs or lower, college, or grad disability: Whether respondent is disabled or not birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec Birth Quarter Coefficients birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec How many coefficients do we see for birth qrtr? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 21 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 1 In MLR everything is conditional on all other variables in the model Categorical predictors and slopes for (almost) each level Race Coefficients Categorical predictors and slopes for (almost) each level How many coefficients do we see for birth qrtr? When we are working with a categorical variable with k levels, we only see k 1 parameters being estimated in the model. This happens because one of the levels of the variable is consumed by the intercept of the model. This level of the categorical variable is called the baseline. So, what happened to the folks born in January through March? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 24 / 1

Categorical predictors and slopes for (almost) each level Categorical predictors and slopes for (almost) each level Gender: male / female (k = 2) Respondent gender:female Female 1 Male 0 Estimate Std. Error t value Pr(> t ) Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 1 Birth Quarter: (k = 4) Baseline: Jan thru Mar Respondent birth qrt:apr thru jun birth qrtr:jul thru sep birth qrtr:oct thru dec 1, jan thru mar 0 0 0 2, apr thru jun 1 0 0 3, jul thru sep 0 1 0 4, oct thru dec 0 0 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 1 Participation question Categorical predictors and slopes for (almost) each level All else held constant, how do incomes of those born January thru March compare to those born April thru June? All else held constant, those born Jan thru Mar make, on average, (a) $2,043.42 less (b) $2,043.42 more than those born Apr thru Jun. (c) $4978.12 less (d) $4978.12 more Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 27 / 1 Prediction with MLR Return to the scene of the crime Predict the annual murders per million in a district with a poverty rate of 24%. murders = 29.91+2.56 percpov (Intercept) -29.901 7.789-3.839 0.0000 percpov 2.559 0.390 6.562 3.64e-06 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 1

Prediction with MLR Prediction with MLR Weights of books Predicting for MLR weight (g) volume (cm 3 ) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb l w h Write down the linear model for book weight based on volume and cover type. (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Interpret the coefficients for volume and cover:pb. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 29 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 1 Prediction with MLR Interpretation of the regression coefficients Prediction Prediction with MLR (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Participation question Which of the following is the correct calculation for the predicted weight of a paperback book that is 600 cm 3? Slope of volume: All else held constant, for each 1 cm 3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh on average 184.05 grams less than hardcover books. Intercept: Hardcover books with no volume are expected on average to weigh 198 grams. Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 31 / 1 (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 (a) 197.96 + 0.72 600 184.05 1 (b) 184.05 + 0.72 600 197.96 1 (c) 197.96 + 0.72 600 184.05 0 (d) 197.96 + 0.72 1 184.05 600 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 32 / 1

Examining our Model How do we determine if the model is significant? We can now interpret the coefficients, but how do we know if the model is any good? To be more specific, how can we show that the model is significant? Inference for the model as a whole: F-test Degrees of Freedom: df 1 = p, df 2 = n k 1 H 0 : β 1 = β 2 = = β k = 0 H A : At least one of the β i 0 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 33 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 34 / 1 Model output Coefficients: (Intercept) -15342.76 11716.57-1.309 0.190760 hrs_work 1048.96 149.25 7.028 4.63e-12 *** raceblack -7998.99 6191.83-1.292 0.196795 raceasian 29909.80 9154.92 3.267 0.001135 ** raceother -6756.32 7240.08-0.933 0.351019 age 565.07 133.77 4.224 2.69e-05 *** genderfemale -17135.05 3705.35-4.624 4.41e-06 *** citizenyes -12907.34 8231.66-1.568 0.117291 time_to_work 90.04 79.83 1.128 0.259716 langother -10510.44 5447.45-1.929 0.054047. marriedyes 5409.24 3900.76 1.387 0.165932 educollege 15993.85 4098.99 3.902 0.000104 *** edugrad 59658.52 5660.26 10.540 < 2e-16 *** disabilityyes -14142.79 6639.40-2.130 0.033479 * birth_qrtrapr thru jun -2043.42 4978.12-0.410 0.681569 birth_qrtrjul thru sep 3036.02 4853.19 0.626 0.531782 birth_qrtroct thru dec 2674.11 5038.45 0.531 0.595752 Residual standard error: 48670 on 766 degrees of freedom (60 observations deleted due to missingness) Multiple R-squared: 0.3126, Adjusted R-squared: 0.2982 F-statistic: 21.77 on 16 and 766 DF, p-value: < 2.2e-16 Participation question True / False: The F test yielding a significant result means the model fits the data well. (a) True (b) False Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 35 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 36 / 1

Significance also depends on what else is in the model Weights of books Model 1: (Intercept) -15342.76 11716.57-1.309 0.190760 hrs_work 1048.96 149.25 7.028 4.63e-12 raceblack -7998.99 6191.83-1.292 0.196795 raceasian 29909.80 9154.92 3.267 0.001135 raceother -6756.32 7240.08-0.933 0.351019 age 565.07 133.77 4.224 2.69e-05 genderfemale -17135.05 3705.35-4.624 4.41e-06 citizenyes -12907.34 8231.66-1.568 0.117291 time_to_work 90.04 79.83 1.128 0.259716 langother -10510.44 5447.45-1.929 0.054047 marriedyes 5409.24 3900.76 1.387 0.165932 <---- educollege 15993.85 4098.99 3.902 0.000104 edugrad 59658.52 5660.26 10.540 < 2e-16 disabilityyes -14142.79 6639.40-2.130 0.033479 birth_qrtrapr thru jun -2043.42 4978.12-0.410 0.681569 birth_qrtrjul thru sep 3036.02 4853.19 0.626 0.531782 birth_qrtroct thru dec 2674.11 5038.45 0.531 0.595752 Model 2: (Intercept) -22498.2 8216.2-2.738 0.00631 hrs_work 1149.7 145.2 7.919 7.60e-15 raceblack -7677.5 6350.8-1.209 0.22704 raceasian 38600.2 8566.4 4.506 7.55e-06 raceother -7907.1 7116.2-1.111 0.26683 age 533.1 131.2 4.064 5.27e-05 genderfemale -15178.9 3767.4-4.029 6.11e-05 marriedyes 8731.0 3956.8 2.207 0.02762 <---- weight (g) volume (cm 3 ) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb l w h Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 37 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 38 / 1 Weights of hard cover and paperback books Modeling weights of books using volume and cover type Can you identify a trend in the relationship between volume and weight of hardcover and paperback books? # load data library(daag) data(allbacks) weight (g) 1000 800 600 hardcover paperback # fit model book_mlr = lm(weight volume + cover, data = allbacks) summary(book_mlr) Coefficients: (Intercept) 197.96284 59.19274 3.344 0.005841 ** volume 0.71795 0.06153 11.669 6.6e-08 *** cover:pb -184.04727 40.49420-4.545 0.000672 *** 400 200 400 600 800 1000 1200 1400 volume (cm 3 ) Residual standard error: 78.2 on 12 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 39 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 40 / 1

Linear model Visualising the linear model (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 weight = 197.96 + 0.72 volume 184.05 cover : pb 1 For hardcover books: plug in 0 for cover weight = 197.96 + 0.72 volume 184.05 0 = 197.96 + 0.72 volume 2 For paperback books: plug in 1 for cover weight = 197.96 + 0.72 volume 184.05 1 = 13.91 + 0.72 volume weight (g) 1000 800 600 400 hardcover paperback 200 400 600 800 1000 1200 1400 volume (cm 3 ) Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 41 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 42 / 1 Interpretation of the regression coefficients Prediction (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Participation question Which of the following is the correct calculation for the predicted weight of a paperback book that is 600 cm 3? Slope of volume: All else held constant, for each 1 cm 3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books, on average. Intercept: Hardcover books with no volume are expected on average to weigh 198 grams. Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 43 / 1 (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 (a) 197.96 + 0.72 600 184.05 1 (b) 184.05 + 0.72 600 197.96 1 (c) 197.96 + 0.72 600 184.05 0 (d) 197.96 + 0.72 1 184.05 600 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 44 / 1

A note on interaction variables weight = 197.96 + 0.72 volume 184.05 cover : pb 200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm 3 ) weight (g) hardcover paperback This model assumes that hardcover and paperback books have the same slope for the relationship between their volume and weight. If this isn t reasonable, then we would include an interaction variable in the model (beyond the scope of this course). Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 45 / 1 Revisit: Modeling poverty poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 46 / 1 Predicting poverty using % female householder # load data poverty = read.csv("http://stat.duke.edu/ mc301/data/poverty.csv") # fit model pov_slr = lm(poverty female_house, data = poverty) summary(pov_slr) Linear model: (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00 8 10 12 14 16 18 6 8 10 12 14 16 18 % female householder % in poverty R = 0.53 R 2 = 0.53 2 = 0.28 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 47 / 1 Another look at R 2 - from last time anova(pov_slr) ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25 SS of y: SS Tot = (y ȳ) 2 = 480.25 total variability SS of residuals: SS Err = e 2 i = 347.68 unexplained variability SS of regression: SS Reg = SS Total SS Error explained variability = 480.25 347.68 = 132.57 R 2 = explained variability total variability = 132.57 480.25 = 0.28 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 48 / 1

Predicting poverty using % female hh + % white pov_mlr = lm(poverty female_house + white, data = poverty) summary(pov_mlr) anova(pov_mlr) Linear model: (Intercept) -2.58 5.78-0.45 0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25 ( R 2 SSError adj = 1 n 1 ) SS Total n k 1 where n is the number of cases and k is the number of predictors (explanatory variables) in the model. R 2 = explained variability total variability = 132.57 + 8.21 480.25 = 0.29 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 49 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 50 / 1 Application exercise: Calculate adjusted R 2 for the multiple linear regression model predicting % living in poverty from % female householders and % white. Remember n = 51, 50 states + DC. (a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25 R 2 vs. adjusted R 2 R 2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 When any variable is added to the model R 2 increases. But if the added variable doesn t really provide any new information, or is completely unrelated, adjusted R 2 does not increase. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 51 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 52 / 1

- properties R 2 adj = 1 ( SSError SS Total n 1 n k 1 ) Because k is never negative, R 2 adj will always be smaller than R 2. R 2 adj applies a penalty for the number of predictors included in the model. Therefore, we choose models with higher R 2 adj over others. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 53 / 1 Participation question True or false: tells us the percentage of variability in the response variable explained by the model. (a) True (b) False Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 54 / 1 Collinearity and parsimony We saw that adding the variable white to the model did not increase adjusted R 2, i.e. did not add any valuable information to the model. Why? poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 55 / 1 Collinearity and parsimony Collinearity between explanatory variables (cont.) Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation. Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other. We don t like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters. While it s impossible to avoid collinearity from arising in observational data, experiments are usually designed to control for correlated predictors. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 56 / 1