Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Similar documents
Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Lab 11: Introduction to Linear Regression

Chapter 12 Practice Test

Section I: Multiple Choice Select the best answer for each problem.

Running head: DATA ANALYSIS AND INTERPRETATION 1

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Unit 4: Inference for numerical variables Lecture 3: ANOVA

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Unit4: Inferencefornumericaldata 4. ANOVA. Sta Spring Duke University, Department of Statistical Science

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Navigate to the golf data folder and make it your working directory. Load the data by typing

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Pitching Performance and Age

Name May 3, 2007 Math Probability and Statistics

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Stat 139 Homework 3 Solutions, Spring 2015

Pitching Performance and Age

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Week 7 One-way ANOVA

ISyE 6414 Regression Analysis

Building an NFL performance metric

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Lesson 14: Modeling Relationships with a Line

Driv e accu racy. Green s in regul ation

One-factor ANOVA by example

Evaluation of Regression Approaches for Predicting Yellow Perch (Perca flavescens) Recreational Harvest in Ohio Waters of Lake Erie

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Select Boxplot -> Multiple Y's (simple) and select all variable names.

Class 23: Chapter 14 & Nested ANOVA NOTES: NOTES: NOTES:

Math 121 Test Questions Spring 2010 Chapters 13 and 14

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

Sample Final Exam MAT 128/SOC 251, Spring 2018

The Reliability of Intrinsic Batted Ball Statistics Appendix

Failure Data Analysis for Aircraft Maintenance Planning

Accident data analysis using Statistical methods A case study of Indian Highway

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

Copy of my report. Why am I giving this talk. Overview. State highway network

On the association of inrun velocity and jumping width in ski. jumping

Business Cycles. Chris Edmond NYU Stern. Spring 2007

A few things to remember about ANOVA

Biostatistics Advanced Methods in Biostatistics IV

Legendre et al Appendices and Supplements, p. 1

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Effect of homegrown players on professional sports teams

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

One-way ANOVA: round, narrow, wide

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

Lab #12:Boyle s Law, Dec. 20, 2016 Pressure-Volume Relationship in Gases

A N E X P L O R AT I O N W I T H N E W Y O R K C I T Y TA X I D ATA S E T

Sports Predictive Analytics: NFL Prediction Model

Standardized catch rates of yellowtail snapper ( Ocyurus chrysurus

Introduction. Forestry, Wildlife and Fisheries Graduate Seminar Demand for Wildlife Hunting in the Southeastern United States

Empirical Example II of Chapter 7

Modeling Pedestrian Volumes on College Campuses

AP 11.1 Notes WEB.notebook March 25, 2014

ANOVA - Implementation.

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

Measuring Batting Performance

4. A student estimated a regression model using annual data for 1990 through 2015, C = β 0. Y + β 2

Warm-up. Make a bar graph to display these data. What additional information do you need to make a pie chart?

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

CHAPTER ANALYSIS AND INTERPRETATION Average total number of collisions for a try to be scored

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Real-Time Electricity Pricing

DISMAS Evaluation: Dr. Elizabeth C. McMullan. Grambling State University

Factorial Analysis of Variance

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

Taking Your Class for a Walk, Randomly

Applying Hooke s Law to Multiple Bungee Cords. Introduction

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

The Impact of Narrow Lane on Safety of the Arterial Roads. Hyeonsup Lim

Math 4. Unit 1: Conic Sections Lesson 1.1: What Is a Conic Section?

Robust specification testing in regression: the FRESET test and autocorrelated disturbances

Political Science 30: Political Inquiry Section 5

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Measuring Returns to Scale in Nineteenth-Century French Industry Technical Appendix

FIRST NAME: (PRINT ABOVE (UNDERNEATH LAST NAME) IN CAPITALS)

Effects of Traffic Condition (v/c) on Safety at Freeway Facility Sections

IDENTIFYING SUBJECTIVE VALUE IN WOMEN S COLLEGE GOLF RECRUITING REGARDLESS OF SOCIO-ECONOMIC CLASS. Victoria Allred

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

NBA TEAM SYNERGY RESEARCH REPORT 1

FORECASTING BATTER PERFORMANCE USING STATCAST DATA IN MAJOR LEAGUE BASEBALL

Guide to Computing Minitab commands used in labs (mtbcode.out)

(per 100,000 residents) Cancer Deaths

Setting up group models Part 1 NITP, 2011

Initial Mortality of Black Bass in B.A.S.S. Fishing Tournaments

Systematic Review and Meta-analysis of Bicycle Helmet Efficacy to Mitigate Head, Face and Neck Injuries

Transcription:

Announcements Announcements Lecture 19: Inference for SLR & Statistics 101 Mine Çetinkaya-Rundel April 3, 2012 HW 7 due Thursday. Correlation guessing game - ends on April 12 at noon. Winner will be announced in class. Prize: +1 (out of 100) point on the final. http:// istics.net/ stat/ correlations Group: sta101 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 1 / 28 Recap Recap Online quiz 7 - commonly missed questions Review question Which of the following is false? Question 1: In SLR, (a) residuals should be nearly normally distributed with mean at 0 (b) residuals should have non-constant variance (c) residuals vs. x plot should show a random scatter around 0 (d) the relationship between x and y should be linear, and outliers should be handled with caution Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 2 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 3 / 28

Major league baseball Yesterday in lab you worked with 2009 MLB data. What was the best predictor of runs? runs 650 775 900 Runs vs. On base plus slugging 0.70 0.74 0.78 0.82 ob_slg Major league baseball R 2 for the regression line for predicting runs from on-base plus slugging is 91.31%. Which of the below is the correct interpretation of this value? 91.31% of (a) runs can be accurately predicted by on-base plus slugging. (b) variability in predictions of runs is explained by on-base plus slugging. (c) variability in predictions of on-base plus slugging is explained by runs. (d) variability in runs is explained by on-base plus slugging. (e) variability in on-base plus slugging is explained by runs. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 4 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 5 / 28 Understanding regression output from software Major league baseball (regression output) m = lm(runs ob_slg, data = mlb) summary(m) Call: lm(formula = runs ob_slg, data = mlb) Residuals: Min 1Q Median 3Q Max -39.140-12.568-1.205 10.488 57.634 Coefficients: (Intercept) -921.14 97.38-9.459 3.24e-10 *** ob_slg 2222.61 129.61 17.148 < 2e-16 *** --- Residual standard error: 22.37 on 28 degrees of freedom Multiple R-squared: 0.9131, Adjusted R-squared: 0.91 F-statistic: 294.1 on 1 and 28 DF, p-value: < 2.2e-16 Testing for the slope Clicker question Assuming that the 2009 season is representative of all MLB seasons, we would like to test if these data provide convincing evidence that the slope of the regression line for predicting runs from on-base plus slugging is different than 0. What are the appropriate hypotheses? (a) H 0 : b 0 = 0; H A : b 0 0 (b) H 0 : β 1 = 0; H A : β 1 0 (c) H 0 : b 1 = 0; H A : b 1 0 (d) H 0 : β 0 = 0; H A : β 0 0 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 6 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 7 / 28

Testing for the slope (cont.) Testing for the slope (cont.) (Intercept) -921 97.38-9.46 0.0000 ob slg 2223 129.61 17.15 0.0000 We always use a t-test in inference for regression Remember: Test statistic, T = point estimate null value SE Point estimate = b 1 is the observed slope, and is given in the regression output SE b1 is the standard error associated with the slope, and can be calculated as (yi ŷ i ) 2 /(n 2) SE b1 = (xi x i ) 2 is also given in the regression output (and it s silly to try to calculate it by hand, just know that it s doable and why the formula works the way it does) Degrees of freedom associated with the slope is df = n 2, where n is the sample size Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 8 / 28 (Intercept) -921 97.38-9.46 0.0000 ob slg 2223 129.61 17.15 0.0000 T = 2223 0 129.6116 = 17.15 df = 30 2 = 28 p value = P( T > 17.15) < 0.01 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 9 / 28 % College graduate vs. % Hispanic in LA What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? % College educated vs. % Hispanic in LA - another look What can you say about the relationship between of % college graduate and % Hispanic in a sample of 100 zip code areas in LA? Education: College graduate 1.0 Race/Ethnicity: Hispanic 1.0 100% 0.8 0.6 0.4 0.8 0.6 0.4 % College graduate 75% 50% 25% 0.2 0.2 0% Freeways No data 0.0 Freeways No data 0.0 0% 25% 50% 75% 100% % Hispanic Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 10 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 11 / 28

% College educated vs. % Hispanic in LA - linear model % College educated vs. % Hispanic in LA - linear model Clicker question Which of the below is the best interpretation of the slope? (Intercept) 0.7290 0.0308 23.68 0.0000 %Hispanic -0.7527 0.0501-15.01 0.0000 (a) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 75% decrease in % of college grads. (b) A 1% increase in Hispanic residents in a zip code area in LA is associated with a 0.75% decrease in % of college grads. (c) An additional 1% of Hispanic residents decreases the % of college graduates in a zip code area in LA by 0.75%. (d) In zip code areas with no Hispanic residents, % of college graduates is expected to be 75%. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 12 / 28 Do these data provide convincing evidence that there is a statistically significant relationship between % Hispanic and % college graduates in zip code areas in LA? (Intercept) 0.7290 0.0308 23.68 0.0000 hispanic -0.7527 0.0501-15.01 0.0000 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 13 / 28 Violent crime rate vs. unemployment Relationship between violent crime rate (annual number of violent crimes per 100,000 population) and unemployment rate (% of work eligible population not working) in 51 US States (including DC): violent_crime_rate 1400 1200 1000 800 600 400 200 DC 3 4 5 6 unemployed Note: The data are from the 2003 Statistical Abstract of the US. A 2012 version is available online, if looking for data on states for your project, it s a good resource. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 14 / 28 Violent crime rate vs. unemployment Clicker question Which of the below is the correct set of hypotheses and the p-value for testing if the slope of the relationship between violent crime rate and unemployment is positive? (Intercept) 27.68 130.00 0.21 0.8323 unemployed 105.03 32.04 3.28 0.0019 (a) H 0 :b 1 = 0 H A :b 1 0 p value = 0.0019 (b) H 0 :β 1 = 0 H A :β 1 > 0 p value = 0.0019/2 = 0.00095 (c) H 0 :β 1 = 0 H A :β 1 0 p value = 0.0019/2 = 0.00095 (d) H 0 :b 1 = 0 H A :b 1 > 0 p value = 0.0019/2 = 0.00095 (e) H 0 :β 1 = 0 H A :β 1 0 p value = 0.8323 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 15 / 28

CI for the slope CI for the slope Confidence interval for the slope Recap Clicker question Remember that a confidence interval is calculated as point estimate±me and the degrees of freedom associated with the slope in a simple linear regression is n 2. Which of the below is the correct 95% confidence interval for the slope parameter? Note that the model is based on observations from 51 states. (a) 27.68 ± 1.65 32.04 (Intercept) 27.68 130.00 0.21 0.8323 unemployed 105.03 32.04 3.28 0.0019 (b) 105.03 ± 2.01 32.04 (c) 105.03 ± 1.96 32.04 (d) 27.68 ± 1.96 32.04 Inference for the slope for a SLR model (only one explanatory variable): Hypothesis test: Confidence interval: T = b 1 null value SE b1 df = n 2 b 1 ± t df=n 2 SE b 1 The null value is often 0 since we are usually checking for any relationship between the explanatory and the response variable The regression output gives b 1, SE b1, and two-tailed p-value for the t-test for the slope where the null value is 0 We rarely do inference on the intercept, so we ll be focusing on the estimates and inference for the slope Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 16 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 17 / 28 CI for the slope An alternative statistic Caution ANOVA Always be aware of the type of data you re working with: random sample, non-random sample, or population Statistical inference, and the resulting p-values, are meaningless when you already have population data If you have a sample that is non-random (biased), the results will be unreliable The ultimate goal is to have independent observations and you know how to check for those by now We considered the t-test as a way to evaluate the strength of evidence for a hypothesis test evaluating the relationship between x and y However, we could focus on R 2 proportion of variability in the response variable (y) explained by the explanatory variable (x) A large R 2 suggests a linear relationship between x and y exists A small R 2 suggests the evidence provided by the data may not be convincing Considering the amount of explained variability is called analysis of variance (ANOVA) In SLR, where there is only one explanatory variable (and hence one slope parameter) t-test and the ANOVA yield the same result In multiple linear regression, they provide different pieces of information Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 18 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 19 / 28

Truck prices Remove unusual observations The scatterplot below shows the relationship between year and price of a random sample of 43 pickup trucks. Describe the relationship between these two variables. Let s remove trucks older than 20 years, and only focus on trucks made in 1992 or later. Now what can you say about the relationship? price 20000 15000 10000 5000 1980 1985 1990 year price 20000 15000 10000 5000 year From: http:// faculty.chicagobooth.edu/ robert.gramacy/ teaching.html Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 20 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 21 / 28 Truck prices - linear model? Truck prices - log transform of the response variable price residuals 20000 15000 10000 5000 10000 5000 0 5000 10000 year Model: price = b 0 + b 1 year The linear model doesn t appear to be a good fit since the residuals have non-constant variance. residuals log(price) 10.0 9.5 9.0 8.5 8.0 7.5 1.5 1.0 0.5 0.0 0.5 1.0 1.5 year Model: log(price) = b 0 +b 1 year We applied a log transformation to the response variable. The relationship now seems linear, and the residuals no longer have non-constant variance. Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 22 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 23 / 28

Interpreting models with log transformation Working with logs (Intercept) -265.07 25.04-10.59 0.00 pu$year 0.14 0.01 10.94 0.00 Model: log(price) = 265.07 + 0.14 year For each additional year the car is newer (for each year decrease in car s age) we would expect the log price of the car to increase on average by 0.14 log dollars. which is not very useful... Subtraction and logs: log(a) log(b) = log( a b ) Natural logarithm: e log(x) = x We can these identities to undo the log transformation Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 24 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 25 / 28 Interpreting models with log transformation (cont.) Recap: dealing with non-constant variance The slope coefficient for the log transformed model is 0.14, meaning the log price difference between cars that are one year apart is predicted to be 0.14 log dollars. log(price at year x + 1) log(price at year x) = 0.14 ( ) price at year x + 1 log = 0.14 price at year x price at year x + 1 e log( price at year x ) price at year x + 1 price at year x = e 0.14 = 1.15 For each additional year the car is newer (for each year decrease in car s age) we would expect the price of the car to increase on average by a factor of 1.15. Non-constant variance is one of the most common model violations, however it is usually fixable by transforming the response (y) variable The most common variance stabilizing transform is the log transformation: log(y) When using a log transformation on the response variable the interpretation of the slope changes: For each unit increase in x, y is expected on average to decrease/increase by a factor of e b 1. Another useful transformation is the square root: y These transformations may also be useful when the relationship is non-linear, but in those cases a polynomial regression may also be needed (this is beyond the scope of this course, but you re welcomed to try it for your project, and I d be happy to provide further guidance) Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 26 / 28 Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 27 / 28

R code # load data pu_allyrs = read.csv("http://stat.duke.edu/courses/spring12/sta101.1/lec/ pickups.csv") # drop trucks older than 20 yrs old pu = subset(pu_allyrs, pu_allyrs$year >= 1992) # linear model plot(pu$price pu$year) m1 = lm(pu$price pu$year) abline(m1) plot(m1$residuals pu$year) # model with log transformation plot(log(pu$price ) pu$year) m2 = lm(log(pu$price ) pu$year) abline(m2) plot(m2$residuals pu$year) # model summary and interpretation of the slope coefficient summary(m2) exp(0.14) Statistics 101 (Mine Çetinkaya-Rundel) L19: Inference for SLR & April 3, 2012 28 / 28