Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

Similar documents
Lecture 2b: Practice Problems

Running head: DATA ANALYSIS AND INTERPRETATION 1

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

Navigate to the golf data folder and make it your working directory. Load the data by typing

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

DISMAS Evaluation: Dr. Elizabeth C. McMullan. Grambling State University

1. Answer this student s question: Is a random sample of 5% of the students at my school large enough, or should I use 10%?

A Hare-Lynx Simulation Model

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

4-3 Rate of Change and Slope. Warm Up. 1. Find the x- and y-intercepts of 2x 5y = 20. Describe the correlation shown by the scatter plot. 2.

The Reliability of Intrinsic Batted Ball Statistics Appendix

Lesson 22: Average Rate of Change

INFLUENCE OF ENVIRONMENTAL PARAMETERS ON FISHERY

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

Section I: Multiple Choice Select the best answer for each problem.

Legendre et al Appendices and Supplements, p. 1

Lesson 14: Modeling Relationships with a Line

Lab 11: Introduction to Linear Regression

The MACC Handicap System

Chapter 12 Practice Test

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

2008 Excellence in Mathematics Contest Team Project B. School Name: Group Members:

Correlation and regression using the Lahman database for baseball Michael Lopez, Skidmore College

Quantitative Methods for Economics Tutorial 6. Katherine Eyal

Average Runs per inning,

STANDARD SCORES AND THE NORMAL DISTRIBUTION

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

8th Grade. Data.

Predator Prey Lab Exercise L3

General Information on Drowning

1. What function relating the variables best describes this situation? 3. How high was the balloon 5 minutes before it was sighted?

Keywords: multiple linear regression; pedestrian crossing delay; right-turn car flow; the number of pedestrians;

Algebra 1 Agenda 4.1. Monday. Tuesday. Thursday Lesson: Friday. No School! Linear Regression. Correlation Coefficient. Association & Causation

Why We Should Use the Bullpen Differently

Swell and Wave Forecasting

Driv e accu racy. Green s in regul ation

4-3 Rate of Change and Slope. Warm Up Lesson Presentation. Lesson Quiz

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

Taking Your Class for a Walk, Randomly

Pohnpei surf club. Pohnpei, FSM

Calculation of Trail Usage from Counter Data

When comparing two real gases at the same conditions, the one with the smaller particles is more ideal.

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Swell and Wave Forecasting

Algebra I: A Fresh Approach. By Christy Walters

Week 7 One-way ANOVA

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Project 1 Those amazing Red Sox!

#1 Accurately Rate and Rank each FBS team, and

An Analysis of Factors Contributing to Wins in the National Hockey League

Pool Plunge: Linear Relationship between Depth and Pressure

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Building an NFL performance metric

Does Gun Control Reduce Criminal Violence? An Econometric Evaluation of Canadian Firearm Laws.

Should bonus points be included in the Six Nations Championship?

Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008

Drownings in Hawaii, A review of death certificates by the Injury Prevention and Control Program, Hawaii DOH

Factors Affecting the Probability of Arrests at an NFL Game

Name May 3, 2007 Math Probability and Statistics

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

In addition to reading this assignment, also read Appendices A and B.

A few things to remember about ANOVA

College Teaching Methods & Styles Journal First Quarter 2007 Volume 3, Number 1

MAR 110 LECTURE #14 Ocean Waves

Effortless Power Hitting

Transportation Research Forum

Efficiency Wages in Major League Baseball Starting. Pitchers Greg Madonia

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

WHAT CAN WE LEARN FROM COMPETITION ANALYSIS AT THE 1999 PAN PACIFIC SWIMMING CHAMPIONSHIPS?

Movement and Position

MAR 110 LECTURE #20 Storm-Generated Waves & Rogue Waves

AP Statistics Midterm Exam 2 hours

Experiment. THE RELATIONSHIP BETWEEN VOLUME AND TEMPERATURE, i.e.,charles Law. By Dale A. Hammond, PhD, Brigham Young University Hawaii

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Surf Survey Summary Report

The importance of t. Gordon Craig, Coerver Coaching Director

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Announcements. Unit 7: Multiple Linear Regression Lecture 3: Case Study. From last lab. Predicting income

Advanced driver assistance systems status and future potential

This page intentionally left blank

Chapter 10 Aggregate Demand I: Building the IS LM Model

BODY FORM INFLUENCES ON THE DRAG EXPERIENCED BY JUNIOR SWIMMERS. Australia, Perth, Australia

Grade: 8. Author(s): Hope Phillips

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Algebra I: A Fresh Approach. By Christy Walters

Unit 4: Inference for numerical variables Lecture 3: ANOVA

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754

How to Make, Interpret and Use a Simple Plot

Confidence Intervals with proportions

Analysis of Professional Cycling Results as a Predictor for Future Success

Boys from Brazil stir up a surf war

Tourism impacts from major sports events Visiting previous host destinations or future events

Notes On Ice Pro and Basic Ice Climbing Technique. Jerry Heilman. 2015

Journal of Quantitative Analysis in Sports

DOE Golfer Experiment

What does it take to produce an Olympic champion? A nation naturally

PREDICTION VERSUS REALITY: THE USE OF MATHEMATICAL MODELS TO PREDICT ELITE PERFORMANCE IN SWIMMING AND ATHLETICS AT THE OLYMPIC GAMES

(per 100,000 residents) Cancer Deaths

Hitting with Runners in Scoring Position

Transcription:

Statistics 22_multiple_regression.pdf Michael Hallstone, Ph.D. hallston@hawaii.edu Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS) Some Common Sense Assumptions for Multiple Regression (these are truncated for simplicity): Dependent variable is interval or ratio (not nominal or ordinal). Ratio level variables are best for the other independent variables Nominal and ordinal independent variables can be used if made into dummy variables: o Nominal or ordinal variables can be used as independent variables if the characteristic of interest is coded 1 and the other values are coded zero. For example in criminology it is known that, as a group, men commit more violent crime than women. So if we were interested in the affect of maleness on criminality we would code men =1 and women=0. If we thought having a college degree (or higher) might increase income, we would code college =1 and all lower educational categories =0. Thus we pretend it is a ratio variable where 1=100% of our characteristic of interest and 0=0% of the characteristic. The data comes from a random sample. There is a logical relationship between the variables. Introduction You must understand lecture 21 on Simple Regression for this lecture to make any sense. The same basic idea applies. In simple regression we saw if we could predict the outcome of a dependent variable (usually called y) using another independent variable (usually called x). In multiple regression we can use many independent variables Simple regression only allowed us to use one independent variable to explain our dependent variable. In the previous lecture I made the argument that anything worth explaining in the social world has multiple causes not just one. To quote the previous lecture, As you might guess, the social world is very complex most social phenomenon are not caused or explained by single factors. I can t think of a single social science question Why do people do x? that can be explained by a single factor. Why people use drugs, commit murder, abuse children, spend money, visit Hawaii are all poorly explained by single factors or variables. Thus the term multiple regression comes from the notion that we are looking at how multiple independent variables influence the dependent variable. 1 OF 13

For example using SAT scores alone to predict college GPA is inadequate. There are a whole lot of other things besides SAT scores that predict GPA. What about high School GPA as a measure of college preparedness? How about including number of hours studied a week? How about how many hours a week a person works? Obviously all sorts of independent variables work together to cause GPA to rise or fall not a single factor or variable. This aint a math course The math on multiple regression is daunting. Even way back in the dark ages of the mid 1990s computers routinely took several minutes to spit out multiple regression output. Obviously computerprocessing power is much faster and this is no longer an issue but suffice it to say that if it took a computer several minutes, doing it by hand is a nightmare! Thus I skip the math and use SPSS. Example of Multiple Linear Ordinary Least Squares Regression We will use the same example of the number of surfers in the water at Waimea Bay, home of the legendary Eddie Aikau big wave surf contest. We will look at several factors that might predict how many surfers are in the water at Waimea Bay: wave height, wind direction, swell direction, and whether or not the pro surfing tour is on the island. Real Waimea no thanks! For a sense of size that s a 9-10 foot surfboard. 2 OF 13

Dependent variable: number of surfers in the water at Waimea The dependent variable (or y) is the number of surfers in the water at Waimea Bay. Independent variables: wave size, wind, swell, and pros looking to be famous A little more background to understand the logic of these independent variables is in order. Wave height Waves are like icebergs. It is not only what you see above the water; they have depth below the water too. A wave breaks when its bottom feels the ocean floor and slows down. This causes the top of the wave to pitch forward or break. Surfers catch the waves just when they start to break. So Waimea is a deep-water surf spot. All big wave surf spots are. So the waves at Waimea do not even start to break until the waves are large and the bottom of the wave feels the deep ocean floor in Waimea Bay. The perspective from the Waimea Point legendary surfer Brock Little hoping make it to the center of the Waimea Bay Wind and swell direction matter in surfing Truly giant waves like those at Waimea are really dangerous in the best of conditions. In case those pictures did not make an impression, No really! The waves at Waimea are extremely dangerous. In my younger days I found waves with 15-foot faces to look like small mountains moving through the water. The faces of these immense waves are forty feet, even higher. They are as big as 3 OF 13

several story buildings moving through the water. The experts who surf these waves have dedicated their lives to surfing, prepare to hold their breath for a few minutes whilst being driven violently deeply underwater. Holding your breath that long sitting on your couch is hard doing it while being violently thrashed by the awesome power of the ocean is way more difficult. So even really experienced professional surfers have died surfing large waves in the best of conditions. This danger explains why wind and swell are important. The 2016 Eddie contest -- one of the best big wave surfers in the world. That s a 9-10 foot long surfboard folks. Wind should be calm or not too strongly off shore The waves are best when the winds are calm or slightly off shore. Off shore winds that are not too strong gently shape the wave and make the waves more inviting as they hold up the face of the breaking wave allowing a surfer to more easily catch the wave. Onshore winds, on the other hand, make surfing more difficult as they crumble the wave and make the face of the wave bumpy. Trying stand up on 40+ foot bumpy wave while being battered by 20 feet of foaming water is probably next to impossible and almost certainly suicidal. Thus there are days where the waves are huge but the wind is too strong and/or from the wrong direction and the immense waves of Waimea go un-ridden. Swell direction influences how well waves break The most famous wave is the world is likely Pipeline, just down the road from Waimea. They say it needs a West swell to create its magical tube. Pipeline will break when the waves come from another direction, but not produce its postcard beauty. Since only crazy people surf Waimea I have 4 OF 13

no idea what swell direction it likes, but let us pretend Waimea is more fun to surf when the waves come from a favorable swell direction. Pipeline on a clean West swell The North Shore is even more crowded when the pro surf tour is in town The North Shore of Oahu is the proving ground of professional surfing. It is not hyperbole to say that a professional surfer in not considered world class until they show they can surf the large dangerous waves of the North Shore. Surfers make their money by sponsorships and sponsors demand to see pictures and videos of their surfers surfing on the North Shore. When Waimea is breaking, it is almost mandatory for a professional surfer to be photographed surfing its legendary waves. So when the professional surfing tour is on Oahu during November and December we would expect more suffers in the water at Waimea. The crowds on a super small day. Imagine that! This is not considered real Waimea. 5 OF 13

A review of the variables Dependent variable (y) number of surfers in the water at Waimea Independent variables (x) x 1 = wave height (in feet) x 2 = bad wind or good wind? (bad wind=1 good wind=0) x 3 = pro tour in town? (pro tour in town=1 pro tour out of town=0) x 4 = swell direction (optimum swell direction =1 other swell direction=0) The regression equation would look like this y-hat= a + b 1 x 1 + b 2 x 2 + b 3 x 3 + b 4 x 4 (where a = y intercept and y-hat = estimated number of surfers in the water at Waimea) (The b s in the equation are slopes of each independent variable and will be explained below) 6 OF 13

SPSS output Is the model significant? Look at the ANOVA box under Sig. That is the p-value of the entire model. The model refers to whether or not all of the independent variables combined account for a significant amount of variation in the dependent variable. Decision rule for statistical significance using p-value If p-value < α or alpha it is significant! [Typically we use alpha =.05] If p-value >= α or alpha it is NOT significant! [Typically we use alpha =.05] So... If p-value <.05 it is significant! If p-value >=.05 it is NOT significant! Our p value is small. It is not really.000 but p<. 001, which is less than.05 so our model is significant. 7 OF 13

Finding the slope of the independent variables unstandardized coefficients Each independent variable has a number that represents a slope. In SPSS they are called unstandardized coefficients and found under B in the Coefficients box. The slopes or unstandardized coefficients are meaningful only if the overall model is statistically significant. Model Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta 1 (Constant).764 13.631.056.956 Wave height at Waimea t Sig. 1.410.541.334 2.608.013 bad wind =1-20.903 7.235 -.408-2.889.007 pro tour in town=1 best swell direction=1 a. Dependent Variable: number of surfers in water 14.857 5.139.272 2.891.007 7.544 6.338.115 1.190.242 Recall y-hat= a + b 1 x 1 + b 2 x 2 + b 3 x 3 + b 4 x 4 (where a = y intercept) Dependent variable (y) number of surfers in the water at Waimea Independent variables (x) x 1 = wave height (in feet) x 2 = bad wind or good wind? (bad wind=1 good wind=0) x 3 = pro tour in town? (pro tour in town=1 pro tour out of town=0) x 4 = swell direction (optimum swell direction =1 other swell direction=0) Thus our regression equation is y-hat=.764 + 1.410x 1 + -20.903x 2 + 14.857x 3 + 7.544x 4 Now we need to see which of the independent variables have a statistically significant effect. We use the p-value under Sig in the far right column of the Coefficients box above. 8 OF 13

x 1 = wave height p-value =.013 or 1.3% so it s less than.05 or 5% and statistically significant x 2 = bad wind or good wind p-value =.007 or 0.7% so it s less than.05 or 5% and statistically significant x 3 = pro tour in town p-value =.007 or 0.7% so it s less than.05 or 5% and statistically significant x 4 = swell direction p-value =.115 or 11.5% so it s greater than.05 or 5% and not statistically significant this variable has no effect Interpreting the slopes of the independent variables unstandardized coefficients Recall, the overall model must be statistically significant for our interpretation below to make sense. If the overall model is not significant the interpretation of the slopes is meaningless. Recall above our model has p<.001 so it is statistically significant. y-hat=.764 + 1.410x 1 + -20.903x 2 + 14.857x 3 + 7.544x 4 To quote my colleague Dr. Michael Delucchi, In multiple regression each slope represents the impact of the independent variable when all other independent variables are held constant. It tells us the impact each independent variable has over and above the effects of remaining independent variables. Interpreting the effect of ratio level independent variables on the dependent variable x 1 = wave height b 1 =1.41 When all other independent variables are held constant, every one-foot increase in wave height the number of surfers in the water increases by 1.41. (There is no such thing as 1.41 surfers so if you wanted to round this to approximately 1 surfer that would make sense.) An increase in waves increases the number of surfers in the water. Interpreting the effect dummy independent variables on the dependent variable Recall that the variables below were nominal variables coded 1 and 0, where 1= the characteristic of interest. We pretend they are ratio where 1=100% of the characteristic of interest and 0=0% of the characteristic of interest. x 2 = bad wind or good wind? (bad wind=1 good wind=0) x 3 = pro tour in town? (pro tour in town=1 pro tour out of town=0) x 4 = swell direction (optimum swell direction =1 other swell direction=0) Dummy variables are interpreted differently. When the dummy variable =1 the slope is the effect it has on the dependent variable. x 2 = bad wind or good wind b 2 = -20.903 the minus sign is meaningful! When all other independent variables are held constant, when there is a bad wind (as compared to a good wind), the number of surfers in the water decreases by 20.903 surfers. 9 OF 13

(There is no such thing as 20.903 surfers so if you wanted to round this to approximately 21 surfers that would make sense.) The minus sign means the affect on the dependent variable is negative. The number of surfers goes down when the wind is bad. x 3 = pro tour in town b 2 = 14.857 When all other independent variables are held constant, when the pro tour is in town (as compared to when the pro tour is not in town), the number of surfers in the water increases by 14.857 surfers. (There is no such things as 14.857 surfers, so if you wanted to round that to approximately 15 surfers that would make sense.) Having the pro tour in town increases the crowd in the water. x 4 = swell direction since its p-value =.115 or 11.5% it is not statistically significant this variable has no effect. So whether or not the waves are coming from the optimum swell direction for Waimea has no effect on the number of surfers in the water! 10 OF 13

Standardized slope or standardized coefficients SPSS calls them Beta Recall, the overall model must be statistically significant for our interpretation below to make sense. We need a way to figure out which independent variable has the strongest effect on the dependent variable. The largest one (in absolute value) has the largest relative effect on the dependent variable. In our example we would wonder what has a greater effect on the number of surfers at Waimea wave height, wind, having pros in town? (We know from above that swell direction has no effect whatsoever.) To do this we standardize the slopes or coefficients. The formula is Beta = b(s x /s y ) where b=slope of the independent variable s x = standard deviation of the independent variable s y = standard deviation of the dependent variable Thus the standardized slope for the size of the waves is Beta wave height = b(s wave height /s number of surfers in the water ) Beta wave height = 1.41(6.04872 / 25.53345) Beta wave height = 1.41(.0.2369) Beta wave height =.334 11 OF 13

Standardized Slopes or Coeffients or Beta in SPSS Model Coefficients a Unstandardized Coefficients Standardized Coefficients B Std. Error Beta 1 (Constant).764 13.631.056.956 Wave height at Waimea t Sig. 1.410.541.334 2.608.013 bad wind =1-20.903 7.235 -.408-2.889.007 pro tour in town=1 best swell direction=1 a. Dependent Variable: number of surfers in water 14.857 5.139.272 2.891.007 7.544 6.338.115 1.190.242 Strength of standardized slopes in descending order Beta x 2 = bad wind or good wind = -0.408 Beta x 1 = wave height = 0.334 Beta x 3 = pro tour in town = 0.272 So in absolute value -0.408 is the largest number and having a bad wind has the greatest effect of all the independent variables on the number of surfers in the water. That should make sense as when the wind is bad, surfing Waimea is literally death defying and perhaps suicidal. So real Waimea only breaks when the waves are huge and dangerous. If the waves are giant and the wind is strongly onshore, there are zero surfers in the water because it s too dangerous. Interpreting standardized slopes or coefficients or Beta The plain English on this one is a little whacky as it involves the language of standard deviations. Recall a standard deviation produces a number that answers the question On average how much do the data points differ from the mean. So the standard deviation is the average deviation from the mean. Beta x 2 = bad wind or good wind = - 0.408 the minus sign matters! With the other independent variables held constant, when there is a bad wind (as compared to when there is a good wind), there is a -0.408 standard deviation decrease in the number of surfers in the water. Beta x 1 = wave height = 0.334 With the other independent variables held constant, for every increase of one standard deviation in wave height, there is a 0.334 standard deviation increase in the number of surfers in the water. 12 OF 13

Beta x 3 = pro tour in town = 0.272 With the other independent variables held constant, when the pro tour is in town (as compared to when the pro tour is not in town), there is a 0.272 standard deviation increase in the number of surfers in the water. 13 OF 13