y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Similar documents
a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

Lab 11: Introduction to Linear Regression

The Reliability of Intrinsic Batted Ball Statistics Appendix

Chapter 12 Practice Test

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Lesson 14: Modeling Relationships with a Line

Running head: DATA ANALYSIS AND INTERPRETATION 1

Pitching Performance and Age

Navigate to the golf data folder and make it your working directory. Load the data by typing

save percentages? (Name) (University)

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Pitching Performance and Age

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Warm-up. Make a bar graph to display these data. What additional information do you need to make a pie chart?

Lesson 3 Pre-Visit Teams & Players by the Numbers

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

Lesson 2 Pre-Visit Slugging Percentage

Section I: Multiple Choice Select the best answer for each problem.

Building an NFL performance metric

A Study of Olympic Winning Times

Algebra 1 Unit 6 Study Guide

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

The pth percentile of a distribution is the value with p percent of the observations less than it.

Algebra I: A Fresh Approach. By Christy Walters

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

27Quantify Predictability U10L9. April 13, 2015

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

1. What function relating the variables best describes this situation? 3. How high was the balloon 5 minutes before it was sighted?

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

(per 100,000 residents) Cancer Deaths

Chapter 2: Modeling Distributions of Data

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

In addition to reading this assignment, also read Appendices A and B.

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

Average Runs per inning,

Minimal influence of wind and tidal height on underwater noise in Haro Strait

8th Grade. Data.

Applying Hooke s Law to Multiple Bungee Cords. Introduction

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

MiSP Solubility L2 Teacher Guide. Introduction

Position and displacement

Name Student Activity

Why We Should Use the Bullpen Differently

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Modeling Fantasy Football Quarterbacks

Reminders. Homework scores will be up by tomorrow morning. Please me and the TAs with any grading questions by tomorrow at 5pm

USING A CALCULATOR TO INVESTIGATE WHETHER A LINEAR, QUADRATIC OR EXPONENTIAL FUNCTION BEST FITS A SET OF BIVARIATE NUMERICAL DATA

Lesson 22: Average Rate of Change

SCIENTIFIC COMMITTEE SEVENTH REGULAR SESSION August 2011 Pohnpei, Federated States of Micronesia

Stats in Algebra, Oh My!

2008 Excellence in Mathematics Contest Team Project B. School Name: Group Members:

Solubility Unit. Solubility Unit Teacher Guide L1-3. Introduction:

Predicting Season-Long Baseball Statistics. By: Brandon Liu and Bryan McLellan

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

IHS AP Statistics Chapter 2 Modeling Distributions of Data MP1

CS 221 PROJECT FINAL

Failure Data Analysis for Aircraft Maintenance Planning

Tying Knots. Approximate time: 1-2 days depending on time spent on calculator instructions.

Driv e accu racy. Green s in regul ation

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Lab #12:Boyle s Law, Dec. 20, 2016 Pressure-Volume Relationship in Gases

CHAPTER 2 Modeling Distributions of Data

WHAT IS THE ESSENTIAL QUESTION?

PSY201: Chapter 5: The Normal Curve and Standard Scores

Overview. Learning Goals. Prior Knowledge. UWHS Climate Science. Grade Level Time Required Part I 30 minutes Part II 2+ hours Part III

Bioequivalence: Saving money with generic drugs

A N E X P L O R AT I O N W I T H N E W Y O R K C I T Y TA X I D ATA S E T

Name May 3, 2007 Math Probability and Statistics

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Stat 139 Homework 3 Solutions, Spring 2015

How are the values related to each other? Are there values that are General Education Statistics

Optimizing Bus Rapid Transit, Downtown to Oakland

This page intentionally left blank

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

Homework Exercises Problem Set 1 (chapter 2)

Micro Motion Pressure Drop Testing

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Assignment. To New Heights! Variance in Subjective and Random Samples. Use the table to answer Questions 2 through 7.

Algebra I: A Fresh Approach. By Christy Walters

Do Clutch Hitters Exist?

The MACC Handicap System

STT 315 Section /19/2014

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

CHAPTER 2 Modeling Distributions of Data

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Lesson 16: More on Modeling Relationships with a Line

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

CHAPTER 2 Modeling Distributions of Data

ISyE 6414 Regression Analysis

The Influence of Free-Agent Filing on MLB Player Performance. Evan C. Holden Paul M. Sommers. June 2005

Setting up group models Part 1 NITP, 2011

Safety and Design Alternatives for Two-Way Stop-Controlled Expressway Intersections

Lab 4: Transpiration

Effect of homegrown players on professional sports teams

Best Practices in Mathematics Education STATISTICS MODULES

Scaled vs. Original Socre Mean = 77 Median = 77.1

Transcription:

Statistics 111 - Lecture 7 Exploring Data Numerical Summaries for Relationships between Variables Administrative Notes Homework 1 due in recitation: Friday, Feb. 5 Homework 2 now posted on course website: http://stat.wharton.upenn.edu/~stjensen/stat111.html Homework 2 due in recitation: Friday, Feb. 12 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 1 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 2 How two variables vary together We have already looked at single-variable measures of spread: variance = s 2 = 1 (x i x ) 2 n 1 We now have pairs of two variables: (x 1,y 1 ),(x 2, y 2 ),,(x n,y n ) We need a statistic that summarizes how they are spread out together Correlation Correlation of two variables: r = 1 n 1 (x i x )(y i y ) s x s y We divide by standard deviation of both X and Y, so correlation has no units Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 3 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 4 Correlation Correlation is a measure of the strength of linear relationship between variables X and Y Correlation has a range between -1 and 1 r = 1 means the relationship between X and Y is exactly positive linear r = -1 means the relationship between X and Y is exactly negative linear r = 0 means that there is no linear relationship between X and Y Measure of Strength Correlation roughly measures how tight the linear relationship is between two variables Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 5 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 6 1

Education and Mortality r = -0.51 Examples Draft Order and Birthday r = -0.22 Cautions about Correlation Correlation is only a good statistic to use if the relationship is roughly linear Correlation can not be used to measure non-linear relationships http://guessthecorrelation.com Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 7 Always plot your data to make sure that the relationship is roughly linear! Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 8 Correlation with Non-linear Data A high correlation does not always mean a strong linear relationship A low correlation does not always mean no relationship (just no linear relationship) Linear Regression If our X and Y variables do show a linear relationship, we can calculate a best fit line in addition to the correlation Y = a + b X The values a and b together are called the regression coefficients a is called the intercept b is called the slope Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 9 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 10 Example: US Cities Finding the Best Line Want line that has smallest total Y-residuals Y-residuals Mortality = a + b Education How to determine our best line? ie. best regression coefficients a and b? Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 11 Must square Y-residuals and add them up: total residuals = (y i -(a + b x i )) 2 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 12 2

Best values for slope and intercept Line with smallest total residuals is called the least-squares line need calculus to find a and b that give leastsquares line Example: Education and Mortality b = r s y s x a = y b x Mortality = 1353.16-37.62 Education Negative association means negative slope b Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 13 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 14 R Output r = -0.51 Interpreting the regression line The slope coefficient b is the average change you get in the Y variable if you increased the X variable by one Eg. increasing education by one year means an average decrease of 38 deaths per 100,000 Mortality = 1353.16-37.62 Education Note that R gives correlation in terms of r 2 Need to take square root and give correct sign Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 15 The intercept coefficient a is the average value of the Y variable when the X variable is equal to zero Eg. an uneducated city (Education = 0) would have an average mortality of 1353 deaths per 100,000 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 16 Example: Crack Cocaine Sentences Sentence = 90.4 + 0.269 Quantity Extra gram of crack means extra 0.27 months (about 8 days) sentence on average Intercept is not always interpretable: having no crack gets you an 90 month sentence! Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 17 Prediction The regression equation can be used to predict our Y variable for a specific value of our X variable Eg: Sentence = 90.4 + 0.269 Quantity Person caught with 175 grams has a predicted sentence of 90.4 + 0.269 175 =137.5 months Eg: Mortality = 1353.16-37.62 Education City with only 6 median years of education has a predicted mortality of 1353.16-37.62 6 = 1127 deaths per 100,000 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 18 3

Problems with Prediction Quality of predictions depends on truth of assumption of a linear relationship Crack sentences aren t really linear We shouldn t extrapolate predictions beyond the range of data for our X variable Eg: Sentence = 90.4 0.269 Quantity Having no crack still has a predicted sentence of 90 months! Eg: Mortality = 1353.16-37.62 Education A city with median education of 35 years is predicted to have no deaths at all! The Clemens Report Done by Clemens agents: Hendricks Sports Mgmt Very long and involved report: 45 pages, 18,000 words Compared Clemens to three other pitchers Claim that Clemens does not have an unusual late career Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 19 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 20 Selection Bias! However, they only looked at the 3 pitchers that are most similar to him in their late careers This set of comparison pitchers minimizes the chances that Clemens will look unique Want to look at a larger comparison set of pitchers: all starting pitchers with long careers 32 pitchers (including Clemens) have had at least 15 years as starting pitcher with at least 3000 total innings pitched Better Analysis Best fit line can be used to make comparisons between players easier In this case, curves are even better (Stat 112) because careers are non-linear Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 21 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 22 Conclusions Expanding set of comparison pitchers makes Clemens career path look more unusual Clemens has atypical trajectory relative to other pitchers Of course, these curves tell us nothing about why Clemens career path is unusual! Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 23 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 24 4

Next Class - Lecture 8 Introduction to Probability Moore and McCabe: Section 4.1-4.2 Feb. 4, 2016 Stat 111 - Lecture 7 - Correlation 25 5