Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Similar documents
Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Section I: Multiple Choice Select the best answer for each problem.

Chapter 12 Practice Test

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Navigate to the golf data folder and make it your working directory. Load the data by typing

Empirical Example II of Chapter 7

Running head: DATA ANALYSIS AND INTERPRETATION 1

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Lab 11: Introduction to Linear Regression

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Correlates of Nonresponse in the 2012 and 2014 Medical Expenditure Panel Survey

Name May 3, 2007 Math Probability and Statistics

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

DISMAS Evaluation: Dr. Elizabeth C. McMullan. Grambling State University

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

HEALTH INSURANCE COVERAGE STATUS American Community Survey 1-Year Estimates

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORK AND EDUCATION

Lesson 14: Modeling Relationships with a Line

University Of Maryland

Liberals with steady 10 point lead on Conservatives

Rice Yield And Dangue Haemorrhagic Fever(DHF) Condition depend upon Climate Data

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORLD AFFAIRS

Oakmont: Who are we?

Is lung capacity affected by smoking, sport, height or gender. Table of contents

GALLUP NEWS SERVICE GALLUP POLL SOCIAL SERIES: WORLD AFFAIRS

ANOVA - Implementation.

Introduction. Forestry, Wildlife and Fisheries Graduate Seminar Demand for Wildlife Hunting in the Southeastern United States

Week 7 One-way ANOVA

Neighborhood Influences on Use of Urban Trails

2020 K Street NW, Suite 410 Washington, DC (202)

2017 North Texas Regional Bicycle Opinion Survey

SCIENTIFIC COMMITTEE SECOND REGULAR SESSION August 2006 Manila, Philippines

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

Setting up group models Part 1 NITP, 2011

COMPARISON OF FIXED & VARIABLE RATES (25 YEARS) CHARTERED BANK ADMINISTERED INTEREST RATES - PRIME BUSINESS*

FREEDOM OF INFORMATION REQUEST

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Accident data analysis using Statistical methods A case study of Indian Highway

ISyE 6414 Regression Analysis

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

INFLUENCE OF ENVIRONMENTAL PARAMETERS ON FISHERY

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

Longitudinal analysis of young Danes travel pattern.

Biostatistics & SAS programming

Survey of Wave Riders at Trestles (continued from 1st survey page)

GENDER INEQUALITY IN THE LABOR MARKET

B. Single Taxpayers (122,401 obs.) A. Married Taxpayers Filing jointly (266,272 obs.) Density Distribution. Density Distribution

Babson Capital/UNC Charlotte Economic Forecast. May 13, 2014

Factors Associated with the Bicycle Commute Use of Newcomers: An analysis of the 70 largest U.S. Cities

MnROAD Mainline IRI Data and Lane Ride Quality MnROAD Lessons Learned December 2006

Driv e accu racy. Green s in regul ation

How has the profile of student loan

AP 11.1 Notes WEB.notebook March 25, 2014

Select Boxplot -> Multiple Y's (simple) and select all variable names.

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

FOR LEASE HARMS ROAD INDUSTRIAL PARK Harms Road, Houston Texas 77041

2017 Nebraska Profile

Copy of my report. Why am I giving this talk. Overview. State highway network

2018 HR & PAYROLL Deadlines

Building an NFL performance metric

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

Report to the Benjamin Hair-Just Swim For Life Foundation on JACS4 The Jefferson Area Community Survey

WHERE ARE ARIZONA DEMOGRAPHICS TAKING US? HOW GROWING SLOWER, OLDER AND MORE DIVERSE AFFECTS REAL ESTATE

U.S. and Colorado Economic Outlook National Association of Industrial and Office Parks. Business Research Division Leeds School of Business

Legendre et al Appendices and Supplements, p. 1

Economic Overview. Melissa K. Peralta Senior Economist April 27, 2017

Persistence racial difference in socioeconomic outcomes. Are Emily and Greg More Employable than Lakisha and Jamal?

Pitching Performance and Age

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

Effects of Incentives: Evidence from Major League Baseball. Guy Stevens April 27, 2013

GALLUP NEWS SERVICE 2018 MIDTERM ELECTION

Business Cycles. Chris Edmond NYU Stern. Spring 2007

Chapter 13. Factorial ANOVA. Patrick Mair 2015 Psych Factorial ANOVA 0 / 19

Zoning for a Healthy Baltimore

Statistical Modeling of Consumers Participation in Gambling Markets and Frequency of Gambling

SEDAR52-WP November 2017

January 2019 FY Key Performance Report

IDENTIFYING SUBJECTIVE VALUE IN WOMEN S COLLEGE GOLF RECRUITING REGARDLESS OF SOCIO-ECONOMIC CLASS. Victoria Allred

National Association of REALTORS National Smart Growth Frequencies

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Does Gun Control Reduce Criminal Violence? An Econometric Evaluation of Canadian Firearm Laws.

Impacts of climate change on the distribution of blue marlin (Makaira. nigricans) ) as inferred from data for longline fisheries in the Pacific Ocean

Traffic Safety Barriers to Walking and Bicycling Analysis of CA Add-On Responses to the 2009 NHTS

Taking Your Class for a Walk, Randomly

Pitching Performance and Age

September 2018 FY Key Performance Report

State of American Trucking

Stats 2002: Probabilities for Wins and Losses of Online Gambling

International Discrimination in NBA

December 5-8, 2013 Total N= December 4-15, 2013 Uninsured N = 702

Confidence Intervals with proportions

A Study of Olympic Winning Times

Factorial Analysis of Variance

Transcription:

Announcements Announcements UNIT 7: MULTIPLE LINEAR REGRESSION LECTURE 1: INTRODUCTION TO MLR STATISTICS 101 Problem Set 10 Due Wednesday Nicole Dalzell June 15, 2015 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 2 / 1 Recap % College graduate vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Recap % College educated vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Education: College graduate 1.0 Race/Ethnicity: Hispanic 1.0 100% 0.8 0.6 0.4 0.8 0.6 0.4 % College graduate 75% 50% 25% 0.2 0.2 0% Freeways No data 0.0 Freeways No data 0.0 0% 25% 50% 75% 100% % Hispanic Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 3 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 4 / 1

Recap % College educated vs. % Hispanic in LA - linear model Recap % College educated vs. % Hispanic in LA - linear model Participation question Which of the below is the best interpretation of the slope? (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic -0.7527 0.0501-15.01 0.0000 (a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%. Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA? (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic -0.7527 0.0501-15.01 0.0000 How reliable is this p-value if these zip code areas are not randomly selected? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 5 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 6 / 1 Recap Recap Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: T = b 1 null value SE b1 df = n 2 Dinosaur Weight SLR: Categorical Predictors What relationship do you see between the weight of dinosaurs and the type of dinosaur? Dinosaur Weight by Type Confidence interval: b 1 ± t df=n 2 SE b 1 The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable. The regression output gives b 1, SE b1, and two-tailed p-value for the t-test for the slope where the null value is 0. We rarely do inference on the intercept, so we ll be focusing on the estimates and inference for the slope. Weight (kg) 0e+00 4e+04 8e+04 Ornithischian Saurischian Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 7 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 8 / 1

SLR: Categorical Predictors SLR: Categorical Predictors Dinosaur Weight Dinosaurs! What relationship do you see between the weight of dinosaurs and the type of dinosaur? (Intercept) 2786 4422 0.630 0.5316 dino$typesaurischian 13652 5968 2.288 0.0265 Weight = 2786 + 13652TypeSaurischian Type of dinosaur is a categorical variable with two levels: Ornithischian and Saurischian For Ornithischian dinosaurs: plug in 0 for TypeSaurischian For Saurischian dinosaurs: plug in 1 for TypeSaurischian Slope b 1 : We expect that Saurischian dinosaurs weighed on average 13,652 kilograms more than Ornithischian dinosaurs. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 9 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 10 / 1 Return to the scene of the crime Murder Rates by Country Last class, we used the poverty rate in a district to help predict the number of annual murders per million in that district. murders = 29.91+2.56 poverty http:// www.nationmaster.com/ country-info/ stats/ Crime/ Murders/ Per-capita (Intercept) -29.901 7.789-3.839 0.0000 percpov 2.559 0.390 6.562 3.64e-06 Do we think that poverty rates are the only thing that influence the number of annual murders per million in a district? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 11 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 12 / 1

Data from the ACS Data from the ACS A random sample of 783 observations from the 2012 ACS. 1 income: Yearly income (wages and salaries) 2 employment: Employment status, not in labor force, unemployed, or employed 3 hrs work: Weekly hours worked 4 race: Race, White, Black, Asian, or other 5 age: Age 6 gender: gender, male or female 7 citizens: Whether respondent is a US citizen or not 8 time to work: Travel time to work 9 lang: Language spoken at home, English or other 10 married: Whether respondent is married or not 11 edu: Education level, hs or lower, college, or grad 12 disability: Whether respondent is disabled or not 13 birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec We have 1 response variable (income) and 12 potential explanatory variables. How do we fit a model and interpret the coefficients with so many predictors? What do we do with a mix of categorical and numerical explanatory variables? How do we deterimine which (if any) of them are important in our model? How do we determine if our model is any good? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 13 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 14 / 1 Everybody in the Pool Examples How would we interpret the coefficient for hrs work? How do we interpret all of this? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 15 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 16 / 1

In MLR everything is conditional on all other variables in the model Examples MLR (Multiple Linear Regression) How would we interpret the coefficient for hrs work? In MLR everything is conditional on all other variables in the model. All estimates in a MLR for a given variable are conditional on all other variables being in the model. Slope: Numerical x: All else held constant, for one unit increase in x i, y is expected to be higher / lower on average by b i units. Categorical x: All else held constant, the predicted difference in y for the baseline and given levels of x i is b i. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 17 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 18 / 1 In MLR everything is conditional on all other variables in the model In MLR everything is conditional on all other variables in the model Examples How would we interpret the coefficient for genderfemale? Examples How would we interpret the coefficient for genderfemale? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 19 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 20 / 1

In MLR everything is conditional on all other variables in the model In MLR everything is conditional on all other variables in the model Categorical Predictors with multiple levels We have several categorical variables in this study. Some are binary (ie have only two levels) but others are not. employment: Employment status, not in labor force, unemployed, or employed race: Race, White, Black, Asian, or other gender: gender, male or female citizens: Whether respondent is a US citizen or not lang: Language spoken at home, English or other married: Whether respondent is married or not edu: Education level, hs or lower, college, or grad disability: Whether respondent is disabled or not birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec Birth Quarter Coefficients birth qrtr: Quarter in which respondent is born, jan thru mar, apr thru jun, jul thru sep, or oct thru dec How many coefficients do we see for birth qrtr? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 21 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 22 / 1 In MLR everything is conditional on all other variables in the model Categorical predictors and slopes for (almost) each level Race Coefficients Categorical predictors and slopes for (almost) each level How many coefficients do we see for birth qrtr? When we are working with a categorical variable with k levels, we only see k 1 parameters being estimated in the model. This happens because one of the levels of the variable is consumed by the intercept of the model. This level of the categorical variable is called the baseline. So, what happened to the folks born in January through March? Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 23 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 24 / 1

Categorical predictors and slopes for (almost) each level Categorical predictors and slopes for (almost) each level Gender: male / female (k = 2) Respondent gender:female Female 1 Male 0 Estimate Std. Error t value Pr(> t ) Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 25 / 1 Birth Quarter: (k = 4) Baseline: Jan thru Mar Respondent birth qrt:apr thru jun birth qrtr:jul thru sep birth qrtr:oct thru dec 1, jan thru mar 0 0 0 2, apr thru jun 1 0 0 3, jul thru sep 0 1 0 4, oct thru dec 0 0 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 26 / 1 Participation question Categorical predictors and slopes for (almost) each level All else held constant, how do incomes of those born January thru March compare to those born April thru June? All else held constant, those born Jan thru Mar make, on average, (a) $2,043.42 less (b) $2,043.42 more than those born Apr thru Jun. (c) $4978.12 less (d) $4978.12 more Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 27 / 1 Prediction with MLR Return to the scene of the crime Predict the annual murders per million in a district with a poverty rate of 24%. murders = 29.91+2.56 percpov (Intercept) -29.901 7.789-3.839 0.0000 percpov 2.559 0.390 6.562 3.64e-06 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 28 / 1

Prediction with MLR Prediction with MLR Weights of books Predicting for MLR weight (g) volume (cm 3 ) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb l w h Write down the linear model for book weight based on volume and cover type. (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Interpret the coefficients for volume and cover:pb. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 29 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 30 / 1 Prediction with MLR Interpretation of the regression coefficients Prediction Prediction with MLR (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Participation question Which of the following is the correct calculation for the predicted weight of a paperback book that is 600 cm 3? Slope of volume: All else held constant, for each 1 cm 3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh on average 184.05 grams less than hardcover books. Intercept: Hardcover books with no volume are expected on average to weigh 198 grams. Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 31 / 1 (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 (a) 197.96 + 0.72 600 184.05 1 (b) 184.05 + 0.72 600 197.96 1 (c) 197.96 + 0.72 600 184.05 0 (d) 197.96 + 0.72 1 184.05 600 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 32 / 1

Examining our Model How do we determine if the model is significant? We can now interpret the coefficients, but how do we know if the model is any good? To be more specific, how can we show that the model is significant? Inference for the model as a whole: F-test Degrees of Freedom: df 1 = p, df 2 = n k 1 H 0 : β 1 = β 2 = = β k = 0 H A : At least one of the β i 0 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 33 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 34 / 1 Model output Coefficients: (Intercept) -15342.76 11716.57-1.309 0.190760 hrs_work 1048.96 149.25 7.028 4.63e-12 *** raceblack -7998.99 6191.83-1.292 0.196795 raceasian 29909.80 9154.92 3.267 0.001135 ** raceother -6756.32 7240.08-0.933 0.351019 age 565.07 133.77 4.224 2.69e-05 *** genderfemale -17135.05 3705.35-4.624 4.41e-06 *** citizenyes -12907.34 8231.66-1.568 0.117291 time_to_work 90.04 79.83 1.128 0.259716 langother -10510.44 5447.45-1.929 0.054047. marriedyes 5409.24 3900.76 1.387 0.165932 educollege 15993.85 4098.99 3.902 0.000104 *** edugrad 59658.52 5660.26 10.540 < 2e-16 *** disabilityyes -14142.79 6639.40-2.130 0.033479 * birth_qrtrapr thru jun -2043.42 4978.12-0.410 0.681569 birth_qrtrjul thru sep 3036.02 4853.19 0.626 0.531782 birth_qrtroct thru dec 2674.11 5038.45 0.531 0.595752 Residual standard error: 48670 on 766 degrees of freedom (60 observations deleted due to missingness) Multiple R-squared: 0.3126, Adjusted R-squared: 0.2982 F-statistic: 21.77 on 16 and 766 DF, p-value: < 2.2e-16 Participation question True / False: The F test yielding a significant result means the model fits the data well. (a) True (b) False Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 35 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 36 / 1

Significance also depends on what else is in the model Weights of books Model 1: (Intercept) -15342.76 11716.57-1.309 0.190760 hrs_work 1048.96 149.25 7.028 4.63e-12 raceblack -7998.99 6191.83-1.292 0.196795 raceasian 29909.80 9154.92 3.267 0.001135 raceother -6756.32 7240.08-0.933 0.351019 age 565.07 133.77 4.224 2.69e-05 genderfemale -17135.05 3705.35-4.624 4.41e-06 citizenyes -12907.34 8231.66-1.568 0.117291 time_to_work 90.04 79.83 1.128 0.259716 langother -10510.44 5447.45-1.929 0.054047 marriedyes 5409.24 3900.76 1.387 0.165932 <---- educollege 15993.85 4098.99 3.902 0.000104 edugrad 59658.52 5660.26 10.540 < 2e-16 disabilityyes -14142.79 6639.40-2.130 0.033479 birth_qrtrapr thru jun -2043.42 4978.12-0.410 0.681569 birth_qrtrjul thru sep 3036.02 4853.19 0.626 0.531782 birth_qrtroct thru dec 2674.11 5038.45 0.531 0.595752 Model 2: (Intercept) -22498.2 8216.2-2.738 0.00631 hrs_work 1149.7 145.2 7.919 7.60e-15 raceblack -7677.5 6350.8-1.209 0.22704 raceasian 38600.2 8566.4 4.506 7.55e-06 raceother -7907.1 7116.2-1.111 0.26683 age 533.1 131.2 4.064 5.27e-05 genderfemale -15178.9 3767.4-4.029 6.11e-05 marriedyes 8731.0 3956.8 2.207 0.02762 <---- weight (g) volume (cm 3 ) cover 1 800 885 hc 2 950 1016 hc 3 1050 1125 hc 4 350 239 hc 5 750 701 hc 6 600 641 hc 7 1075 1228 hc 8 250 412 pb 9 700 953 pb 10 650 929 pb 11 975 1492 pb 12 350 419 pb 13 950 1010 pb 14 425 595 pb 15 725 1034 pb l w h Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 37 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 38 / 1 Weights of hard cover and paperback books Modeling weights of books using volume and cover type Can you identify a trend in the relationship between volume and weight of hardcover and paperback books? # load data library(daag) data(allbacks) weight (g) 1000 800 600 hardcover paperback # fit model book_mlr = lm(weight volume + cover, data = allbacks) summary(book_mlr) Coefficients: (Intercept) 197.96284 59.19274 3.344 0.005841 ** volume 0.71795 0.06153 11.669 6.6e-08 *** cover:pb -184.04727 40.49420-4.545 0.000672 *** 400 200 400 600 800 1000 1200 1400 volume (cm 3 ) Residual standard error: 78.2 on 12 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9154 F-statistic: 76.73 on 2 and 12 DF, p-value: 1.455e-07 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 39 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 40 / 1

Linear model Visualising the linear model (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 weight = 197.96 + 0.72 volume 184.05 cover : pb 1 For hardcover books: plug in 0 for cover weight = 197.96 + 0.72 volume 184.05 0 = 197.96 + 0.72 volume 2 For paperback books: plug in 1 for cover weight = 197.96 + 0.72 volume 184.05 1 = 13.91 + 0.72 volume weight (g) 1000 800 600 400 hardcover paperback 200 400 600 800 1000 1200 1400 volume (cm 3 ) Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 41 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 42 / 1 Interpretation of the regression coefficients Prediction (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 Participation question Which of the following is the correct calculation for the predicted weight of a paperback book that is 600 cm 3? Slope of volume: All else held constant, for each 1 cm 3 increase in volume we would expect weight to increase on average by 0.72 grams. Slope of cover: All else held constant, the model predicts that paperback books weigh 184 grams lower than hardcover books, on average. Intercept: Hardcover books with no volume are expected on average to weigh 198 grams. Obviously, the intercept does not make sense in context. It only serves to adjust the height of the line. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 43 / 1 (Intercept) 197.96 59.19 3.34 0.01 volume 0.72 0.06 11.67 0.00 cover:pb -184.05 40.49-4.55 0.00 (a) 197.96 + 0.72 600 184.05 1 (b) 184.05 + 0.72 600 197.96 1 (c) 197.96 + 0.72 600 184.05 0 (d) 197.96 + 0.72 1 184.05 600 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 44 / 1

A note on interaction variables weight = 197.96 + 0.72 volume 184.05 cover : pb 200 400 600 800 1000 1200 1400 400 600 800 1000 volume (cm 3 ) weight (g) hardcover paperback This model assumes that hardcover and paperback books have the same slope for the relationship between their volume and weight. If this isn t reasonable, then we would include an interaction variable in the model (beyond the scope of this course). Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 45 / 1 Revisit: Modeling poverty poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 46 / 1 Predicting poverty using % female householder # load data poverty = read.csv("http://stat.duke.edu/ mc301/data/poverty.csv") # fit model pov_slr = lm(poverty female_house, data = poverty) summary(pov_slr) Linear model: (Intercept) 3.31 1.90 1.74 0.09 female house 0.69 0.16 4.32 0.00 8 10 12 14 16 18 6 8 10 12 14 16 18 % female householder % in poverty R = 0.53 R 2 = 0.53 2 = 0.28 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 47 / 1 Another look at R 2 - from last time anova(pov_slr) ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.68 0.00 Residuals 49 347.68 7.10 Total 50 480.25 SS of y: SS Tot = (y ȳ) 2 = 480.25 total variability SS of residuals: SS Err = e 2 i = 347.68 unexplained variability SS of regression: SS Reg = SS Total SS Error explained variability = 480.25 347.68 = 132.57 R 2 = explained variability total variability = 132.57 480.25 = 0.28 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 48 / 1

Predicting poverty using % female hh + % white pov_mlr = lm(poverty female_house + white, data = poverty) summary(pov_mlr) anova(pov_mlr) Linear model: (Intercept) -2.58 5.78-0.45 0.66 female house 0.89 0.24 3.67 0.00 white 0.04 0.04 1.08 0.29 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.00 white 1 8.21 8.21 1.16 0.29 Residuals 48 339.47 7.07 Total 50 480.25 ( R 2 SSError adj = 1 n 1 ) SS Total n k 1 where n is the number of cases and k is the number of predictors (explanatory variables) in the model. R 2 = explained variability total variability = 132.57 + 8.21 480.25 = 0.29 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 49 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 50 / 1 Application exercise: Calculate adjusted R 2 for the multiple linear regression model predicting % living in poverty from % female householders and % white. Remember n = 51, 50 states + DC. (a) 0.26 (b) 0.29 (c) 0.32 (d) 0.71 ANOVA: Df Sum Sq Mean Sq F value Pr(>F) female house 1 132.57 132.57 18.74 0.0001 white 1 8.21 8.21 1.16 0.2868 Residuals 48 339.47 7.07 Total 50 480.25 R 2 vs. adjusted R 2 R 2 Model 1 (poverty vs. female house) 0.28 0.26 Model 2 (poverty vs. female house + white) 0.29 When any variable is added to the model R 2 increases. But if the added variable doesn t really provide any new information, or is completely unrelated, adjusted R 2 does not increase. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 51 / 1 Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 52 / 1

- properties R 2 adj = 1 ( SSError SS Total n 1 n k 1 ) Because k is never negative, R 2 adj will always be smaller than R 2. R 2 adj applies a penalty for the number of predictors included in the model. Therefore, we choose models with higher R 2 adj over others. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 53 / 1 Participation question True or false: tells us the percentage of variability in the response variable explained by the model. (a) True (b) False Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 54 / 1 Collinearity and parsimony We saw that adding the variable white to the model did not increase adjusted R 2, i.e. did not add any valuable information to the model. Why? poverty 40 50 60 70 80 90 100 80 85 90 6 8 10 14 18 40 60 80 100 0.20 metro_res 0.31 0.34 white 30 50 70 90 80 85 90 0.75 0.018 0.24 hs_grad 6 8 10 12 14 16 18 0.53 0.30 30 50 70 90 0.75 0.61 8 10 12 14 16 18 8 10 14 18 female_house Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 55 / 1 Collinearity and parsimony Collinearity between explanatory variables (cont.) Two predictor variables are said to be collinear when they are correlated, and this collinearity (also called multicollinearity) complicates model estimation. Remember: Predictors are also called explanatory or independent variables, so they should be independent of each other. We don t like adding predictors that are associated with each other to the model, because often times the addition of such variable brings nothing to the table. Instead, we prefer the simplest best model, i.e. parsimonious model. In addition, addition of collinear variables can result in biased estimates of the slope parameters. While it s impossible to avoid collinearity from arising in observational data, experiments are usually designed to control for correlated predictors. Statistics 101 (Nicole Dalzell) U7 - L1: Multiple Linear Regression June 15, 2015 56 / 1