Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Similar documents
Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Lab 11: Introduction to Linear Regression

Navigate to the golf data folder and make it your working directory. Load the data by typing

Lesson 14: Modeling Relationships with a Line

1. The data below gives the eye colors of 20 students in a Statistics class. Make a frequency table for the data.

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Habit Formation in Voting: Evidence from Rainy Elections Thomas Fujiwara, Kyle Meng, and Tom Vogl ONLINE APPENDIX

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

IHS AP Statistics Chapter 2 Modeling Distributions of Data MP1

Running head: DATA ANALYSIS AND INTERPRETATION 1

College/high school median annual earnings gap,

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Name Date Period. E) Lowest score: 67, mean: 104, median: 112, range: 83, IQR: 102, Q1: 46, SD: 17

Foundations of Data Science. Spring Midterm INSTRUCTIONS. You have 45 minutes to complete the exam.

Chapter 2: Modeling Distributions of Data

Chapter 12 Practice Test

Section I: Multiple Choice Select the best answer for each problem.

1wsSMAM 319 Some Examples of Graphical Display of Data

27Quantify Predictability U10L9. April 13, 2015

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Model Selection Erwan Le Pennec Fall 2015

download.file(" destfile = "kobe.rdata")

Lab 2: Probability. Hot Hands. Template for lab report. Saving your code

Standardized catch rates of U.S. blueline tilefish (Caulolatilus microps) from commercial logbook longline data

Figure 1a. Top 1% income share: China vs USA vs France

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

Journal of Sports Economics 2000; 1; 299

The MACC Handicap System

STAT 625: 2000 Olympic Diving Exploration

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

STAT 101 Assignment 1

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Fundamentals of Machine Learning for Predictive Data Analytics

North Point - Advance Placement Statistics Summer Assignment

Bioequivalence: Saving money with generic drugs

Biostatistics Advanced Methods in Biostatistics IV

Boyle s Law. Pressure-Volume Relationship in Gases. Figure 1

League Quality and Attendance:

Lab #12:Boyle s Law, Dec. 20, 2016 Pressure-Volume Relationship in Gases

8.5 Training Day Part II

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

Name May 3, 2007 Math Probability and Statistics

Why We Should Use the Bullpen Differently

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Session 2: Introduction to Multilevel Modeling Using SPSS

Pitching Performance and Age

Diameter in cm. Bubble Number. Bubble Number Diameter in cm

Department of Economics Working Paper

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

Boyle s Law: Pressure-Volume Relationship in Gases. PRELAB QUESTIONS (Answer on your own notebook paper)

Standardized catch rates of yellowtail snapper ( Ocyurus chrysurus

Rationale. Previous research. Social Network Theory. Main gap 4/13/2011. Main approaches (in offline studies)

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

Table 9 and 10 present the norms by gender and by age group, respectively. As can be

Lesson 18: There Is Only One Line Passing Through a Given Point with a Given Slope

The pth percentile of a distribution is the value with p percent of the observations less than it.

SPATIAL STATISTICS A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS. Introduction

Full file at

Lecture 16: Chapter 7, Section 2 Binomial Random Variables

STAT 155 Introductory Statistics. Lecture 2-2: Displaying Distributions with Graphs

MATH IN ACTION TABLE OF CONTENTS. Lesson 1.1 On Your Mark, Get Set, Go! Page: 10 Usain Bolt: The fastest man on the planet

Calculating Probabilities with the Normal distribution. David Gerard

Grade: 8. Author(s): Hope Phillips

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

STAT 155 Introductory Statistics. Lecture 2: Displaying Distributions with Graphs

Algebra 1 Unit 6 Study Guide

Empirical Example II of Chapter 7

8th Grade. Data.

Lesson 4: Describing the Center of a Distribution

Pitching Performance and Age

Vapor Pressure of Liquids

JEPonline Journal of Exercise Physiologyonline

DESIGN AND ANALYSIS OF ALGORITHMS (DAA 2017)

Name Student Activity

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

Acknowledgement: Author is indebted to Dr. Jennifer Kaplan, Dr. Parthanil Roy and Dr Ashoke Sinha for allowing him to use/edit many of their slides.

STA 103: Midterm I. Print clearly on this exam. Only correct solutions that can be read will be given credit.

Sample Final Exam MAT 128/SOC 251, Spring 2018

Tying Knots. Approximate time: 1-2 days depending on time spent on calculator instructions.

Using SAS/INSIGHT Software as an Exploratory Data Mining Platform Robin Way, SAS Institute Inc., Portland, OR

!!!!!!!!!!!!! One Score All or All Score One? Points Scored Inequality and NBA Team Performance Patryk Perkowski Fall 2013 Professor Ryan Edwards!

Descriptive Stats. Review

Team competition: Eliminating the gender gap in competitiveness

Effective Use of Box Charts

MoLE Gas Laws Activities

SCIENTIFIC COMMITTEE SEVENTH REGULAR SESSION August 2011 Pohnpei, Federated States of Micronesia

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Ozobot Bit Classroom Application: Boyle s Law Simulation

GENDER INEQUALITY IN THE LABOR MARKET

STANDARD SCORES AND THE NORMAL DISTRIBUTION

A REVIEW AND EVALUATION OF NATURAL MORTALITY FOR THE ASSESSMENT AND MANAGEMENT OF YELLOWFIN TUNA IN THE EASTERN PACIFIC OCEAN

Driv e accu racy. Green s in regul ation

Neighborhood Influences on Use of Urban Trails

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Transcription:

Announcements Announcements Unit 7: Multiple Linear Regression Lecture 3: Case Study Statistics 101 Mine Çetinkaya-Rundel April 18, 2013 OH: Sunday: Virtual OH, 3-4pm - you ll receive an email invitation Monday: cancelled (will be at the poster sessions) Tuesday: after class, 3-4pm (for last minute paper questions) Wednesday: 2-4pm (as usual, probably too late for paper questions) OH for finals week will be announced on Tuesday Final exam location: Gross Hall 107 Request: Borrow some clickers at the end of class on Tuesday, to be returned during finals week or at the final? Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 2 / 20 From last lab Predicting income > mlr_step5 = lm(income hrs_work + race + age + gender + edu + disability, data = acs_sub) > summary(mlr_step5) Load data, and subset for those who were employed, if you haven t yet done so. > download.file("http://stat.duke.edu/courses/spring13/ sta101.001/labs/acs.rdata", destfile = "acs.rdata") > load("acs.rdata") > acs_sub = subset(acs, acs$employment == "employed") (Intercept) -21737.27 7719.33-2.82 0.00 hrs work 1000.05 135.53 7.38 0.00 race:black -6015.53 5877.30-1.02 0.31 race:asian 29595.59 8029.98 3.69 0.00 race:other -8599.21 6648.63-1.29 0.20 age 561.57 118.88 4.72 0.00 gender:female -18120.56 3495.87-5.18 0.00 edu:college 17273.85 3827.51 4.51 0.00 edu:grad 58551.90 5418.84 10.81 0.00 disability:yes -15851.98 6209.49-2.55 0.01 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 3 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 4 / 20

(1) Nearly normal residuals with mean 0 > qqnorm(, main = "Normal prob. plot\nof residuals") > qqline() > hist(, main = "Histogram of residuals") Sample Quantiles Normal prob. plot of residuals 3 1 0 1 2 3 Theoretical Quantiles 0 100 300 Histogram of residuals (2) Constant variability of residuals > plot( mlr_step5$fitted, main = "Residuals vs. fitted") > plot(abs() mlr_step5$fitted, main = "Absolute value of\nresiduals vs. fitted") Residuals vs. fitted 0e+00 5e+04 1e+05 mlr_step5$fitted abs() 0 100000 250000 Absolute value of residuals vs. fitted 0e+00 5e+04 1e+05 mlr_step5$fitted Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 5 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 6 / 20 (3) Each (numerical) variable linearly related to outcome > par(mfrow = c(2,2)) # 4 plots in one window, 2 rows, 2 columns > plot(, main = "Income vs.\n") > plot(, main = "Residuals vs.\n") > plot(, main = "Income vs. age") > plot(, main = "Residuals vs. age") Income vs. Income vs. age (4) Independence We know that the data are sampled randomly, so there should be no pattern in residuals with respect to the order of data collection. > plot(, main = "Residuals vs.\norder of data collection") 0e+00 3e+05 1e+05 2e+05 Residuals vs. 0e+00 3e+05 1e+05 2e+05 Residuals vs. age Residuals vs. order of data collection 0 200 400 600 800 Index Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 7 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 8 / 20

Transformations Log of 0 0 100 200 300 400 500 600 We saw that residuals have a right-skewed distribution, and the relationship between and income is non-linear (exponential). In these situations a transformation applied to the response variable may be useful. In order to decide which transformation to use, we should examine the distribution of the response variable. min = 0 Q1 = 12000 mean = 44098 median = 30000 Q3 = 55000 max = 450000 The extremely right skewed distribution suggests that a log transformation may be useful. > summary() Min. 1st Qu. Median Mean 3rd Qu. Max. 0 12000 30000 44100 55000 450000 > log(0) [1] -Inf Since there are some individuals who made 0 income (from salaries and wages) last year, we cannot take the log of their income, since log(0) is undefined. A commonly used trick is to add a very small number to all values, and then take the log. 0e+00 1e+05 2e+05 3e+05 4e+05 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 9 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 10 / 20 Logged distribution Logged relationships > _log = log( + 0.01) > hist() > hist(_log) > plot(_log ) > plot(_log ) 0 100 300 500 0e+00 2e+05 4e+05 0 100 200 300 400 500 _log Log(income) vs. _log Log(income) vs. age _log Are there any interesting features in the distribution of the logged income? Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 11 / 20 We still might want to do something about those 0 incomes, it doesn t make sense to model them with the rest of the data. Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 12 / 20

Further subsetting the data Logged relationships - for those with any income > plot(acs_sub2$income_log acs_sub2$hrs_work) > plot(acs_sub2$income_log acs_sub2$age) People who work more than 0 hours per week but make 0 income in salaries and wages are different than others whose income is proportional to number of hours they work. So we have reason to omit these people from the analysis (and model their income differently based on other variables). > acs_sub2 = subset(acs_sub, > 0) > acs_sub2$income_log = log(acs_sub2$income) acs_sub2$income_log 4 6 8 10 12 Log(income) vs. acs_sub2$income_log 4 6 8 10 12 Log(income) vs. age acs_sub2$hrs_work acs_sub2$age Much better... Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 13 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 14 / 20 New model: log of income Final model for log of income... after a new model selection process (backwards elimination, p-value method): > mlr_log_fin = lm(income_log hrs_work + age + gender + time_to_work + married + edu + disability, data = acs_sub2) > summary(mlr_log_fin) Check the nearly normal residuals with mean 0 and constant variability of residuals conditions for the logged income model using appropriate diagnostics plots. > mlr_log_fin = lm(income_log hrs_work + age + gender + time_to_work + married + edu + disability, data = acs_sub2) Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 15 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 16 / 20

(1) Nearly normal residuals with mean 0 (2) Constant variability of residuals Normal prob. plot of residuals Histogram of residuals Residuals vs. fitted Absolute value of residuals vs. fitted Sample Quantiles 0 50 150 mlr_log_fin$residuals abs(mlr_log_fin$residuals) 0 1 2 3 4 3 1 0 1 2 3 8 9 10 11 12 13 8 9 10 11 12 13 Theoretical Quantiles mlr_log_fin$residuals mlr_log_fin$fitted mlr_log_fin$fitted Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 17 / 20 Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 18 / 20 Interpreting coefficients of logged models (1) Interpretations for model for logged income Which is the correct interpretation of the slope of? Interpreting coefficients of logged models (2) Interpretations for model for logged income Which is the correct interpretation of the slope of gender:female? For each additional hour worked per week, (a) we would expect income to increase on average by $0.05. (b) we would expect income to increase on average by 0.05%. (c) we would expect income to increase on average by 5.12%. (d) we would expect income to increase on average by $18.59 (e) we would expect income to increase on average by a factor of 18.59. Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 19 / 20 The model predicts that females, on average, make (a) $26,000 less than males. (b) 23% less than males. (c) 77% more than males. (d) $2,600 less than males. (e) $2,600 less more males. Statistics 101 (Mine Çetinkaya-Rundel) U7 - L3: Case study April 18, 2013 20 / 20