ISyE 6414: Regression Analysis

Size: px

Start display at page:

Download "ISyE 6414: Regression Analysis"

Joella McCormick
5 years ago
Views:

1 ISyE 6414: Regression Analysis Lectures: MWF 8:00-10:30, MRDC #2404 Early five-week session; May 14- June 15 (8:00-9:10; 10-min break; 9:20-10:30) Instructor: Dr. Yajun Mei ( YA_JUNE MAY ) ymei@isye.gatech.edu; Tel: (O) Office Hours: MWF 10:30-11:00, after class or Groseclose #343 Course Homepage: Canvas (all HWs due Canvas) backup: HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

2 My academic pathway Undergraduate: Math, Peking Univ., BS in 1996 Work as a computer programmer in a Chinese bank, Graduate: PhD in Math with a minor in EE, Caltech, (advisor: Dr. Gary Lorden) Post Doc in biostatistics: FHCRC, Seattle, Sep 2005 (supervisor: Dr. Sarah Holte) New Research Fellow: SAMSI & Duke Univ., Fall 2005 Joined ISyE of GT since Jan Currently a tenured associate professor.

3 About this course Regression Analysis is the key building block for many modern Machine Learning, Artificial Intelligent, Business Analytics techniques and methods (such as Neural Networks, Deep Learning, Boosting, Random Forrest, etc.) This course aims to help you Understand its theoretical aspects (HW#1, #2, #4, and a midterm) Understand its computational aspects (HW#3, and a course project) 3

4 Organization of the Course Textbooks (Notes/slides provided): Kutner, Nachtsheim, Neter and Li, Applied linear statistical models (fifth edition)., 5 th ed Faraway, Practical Regression and ANOVA using R (freely downloadable online) Topics: Simple Linear Regression (Ch 1-4) Multiple linear Regression (Ch 5-11) (2 weeks, Midterm) Advanced Regression (Ch 13-14) ( 2 weeks) Design of Experiments (Ch 13, 14) 4

5 Organization of the Course Grading Policy (the past AVG GPA is [3.7,3.9]): Class attendance (5%) Homework (4*10%=40%): Collaboration encouraged, but you cannot look at any other solutions before submitting. One in-class Midterm (25%): 9:15am-10:30am, Friday, May 25 (happy Memorial weekend ) Class project (30%): a team of 2-4 or by yourself. See the handout for possible topics of project. Proposal (1-3 pages) : May 30 (Wed) Presentation file: due 7am on June 13 (Wed) (only for on-campus students, not required for DL students) Final report: June 15 (Friday) [Only for the Distance Learning students: two-lectures delay for homeworks and class project proposal, and one-week delay for midterm, and the final report.] 5

6 Part A Basic Background on probability and statistics. We might not discuss this background part in details, but I listed some slides here, so that you can brush up your memory if necessary Three key Probability distributions: Binomial, Poisson, and Normal. 6

7 Probability Review See Appendix A of our text. Probability Discrete Random Variable Continuous Random Variables Joint Distribution 7

8 Probability Basics of Probability Theory Random Experiments, e.g., flip a fair coin three times, and observe Heads or Tails Sample spaces: the set of all possible outcomes, e.g., S={HHH,THH,HTH,HHT, HTT,THT,TTH,TTT} An Event: a subset of the sample space of a random experiment, e.g., observe one heads Union/Intersection/Complement of events; Counting Techniques; Axioms of Probability; Conditional Probability; Independence; Bayes Theorem 8

9 Random Variable A random variable is a function that assigns a real number to each outcome in the sample space of a random experiments. Example: Let X be the number of heads when flipping a fair coin three times. Rigorously, w HHH HHT HTH THH HTT THT TTH TTT X(w)

10 Discrete Random Variable X X with countable possible values Probability Mass function: Cumulative distribution function Mean: Variance: Standard Deviation 10

11 Important discrete RVs Discrete Uniform Binomial(n,p) Geometric(p) Poisson(\lambda) What are the mean and Var/SD? 11

12 Continuous Random Variable Probability density function: Cumulative distribution function Mean: Variance: Standard Deviation 12

13 Important Continuous RVs Gamma/Weibull/Lognormal/Beta distribution What are the mean and Var/SD? 13

14 Central Limit Theorem a. If X is Binomial(n,p), then ZZ = XX nnnn nnnn(1 pp) NN(0,1) (continuity correction) b. If X 1, X 2, Λ, X n are iid with mean µ and variance σ 2, then (or ZZ = XX XX nn nnnn nn σσ NN(00, 11) ) 14

15 Statistical Review Population parameter vs. Sample statistic Point Estimation Conference Interval Hypothesis Testing 15

16 Population Parameter vs Sample Statistic Population: a set of entities concerning which statistical inferences are to be drawn. Typically population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. Sample: a subset of observed objects from the populations. The sample represents a subset of manageable size (possibly massive). Parameter: a (typical unobservable) parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristics of a population or a model. Statistic: some measures of some attribute of a sample. It is calculated by applying a function to the values of the items comprising the sample. [Population parameter vs. Sample statistic] 16

17 Important Sample statistics Sample mean: Sample variance: Sample standard deviation: Sample range: r = max(x i ) min(x i ) Quartiles: The lower quartile: 25% of the data is less than q 1 The median: 50% of the data is less than q 2 The upper quartile: 75% of the data is less than q 3 As a measure of variability, the interquartile range (IQR) is defined as: IQR = q 3 q 1 Plots: Stem-and-Leaf Diagram/Plot, Histogram, Box Plots, Probability Plots (or Normal QQ plots) 17

18 Normal Distribution Assume X 1, X 2, Λ, X n are iid with normal distribution mean µ and variance σ 2 Sample mean XX NN μμ, σσ22 Sample variance SS 22 = XX ii XX 22 nn 11 SS 22 σσ χχ nn 11 nn nn 11. Or nn( XX μμ) σσ satisfies (Chi-square distribution) NN(00, 11) 18

19 Normal Distribution (Cont.) Assume X 1, X 2, Λ, X n are iid with normal distribution mean µ and variance σ 2 Sample mean XX is independent of sample variance SS 22 = XX ii XX 22. Moreover, nn( XX μμ) SS = nn 11 NN 00,11 χχ22 nn 11 /(nn 11) with df=n-1. [In many cases, θθ θθ ss.ee. θθ has a t-distribution often has t-distribution.] In Appendix B on page 1317, for t-distribution, critical point: tt αα,dddd = tt AA, ddff with AA = 11 αα so tt ,1111 = tt , 1111 =

20 Point Estimation The bias of the estimator θθ is BBBBBBBB θθ = EE θθ θθ. An estimator is unbiased if the bias is 0. The variance of the estimator θθ. The mean square error of the estimator θθ is MMMMMM θθ = EE θθ θθ 22 = VVVVVV θθ + BBBBBBBB θθ 22 The standard error of θθ is s.e.= VVVVVV( θθ) 20

21 Methods of Point Estimation There are three methodologies to create point estimates of a population parameter. A. Method of moments (MOM) B. Method of maximum likelihood (MLE) C. Bayesian estimation of parameters 21

22 MOM & MLE The method of moment (MOM) estimators are found by equating the population moment to the sample moments and solving the resulting equations, e.g., hh θθ = EE XX = XX = XX 11+ +XX nn nn The maximum likelihood estimator (MLE) is the value of θ that maximizes the likelihood function L(θ) = f(x 1 ) f(x 2 ) f(x n ) If the domain of f(x) does not depend on θ, dd llllllll(θθ) solving = 00 yields the MLE. ddθθ Otherwise, plot L(θ) and find the maximum.. 22

23 Confidence Interval & Hypothesis Testing One sample: 1. Normal mean with known variances (one-sided) 2. Normal mean with unknown variances 3. Normal variance 4. Proportion of Binomial Distribution Two samples: inference on mean difference 5. Two independent normal dist: variances known 6. Two independent normal dist: unknown and equal variances 7. Two independent normal distributions: unknown and unequal variances 8. Paired Samples 23

24 Part B Overview of Supervised Learning Simple Linear Regression 24

25 Overview of Supervised Learning Supervised Learning (directed data mining, learning with a teacher): The observed data is of the form of (YY ii, XX iiii,, XX iiii ) for ii = 11,, nn, where the variables can be split into two groups: independent variables (explanatory variables, inputs, predictors) XX = (XX 11,, XX pp ) and One (or more) dependent variable (output, responses) Y. The objective is to predict Y given values of the input X. 25

26 Supervised Learning Observed Data (Training Data): (YY ii, XX iiii,.., XX iiii ) for ii = 11,, nn Objective: find a function ff xx nnnnnn = ff(xx 11,, xx pp ) that can predict YY well for any given input xx nnnnnn = xx 11,, xx pp. Deterministic relationship?(many classification tasks in machine learning) 26

27 The Additive Error Model Key Statistical Ideas: Observed Data = True Value + Noise For the observed training data, YY ii = ff xx iiii,.., xx iiii + εε ii for ii = 11,, nn, where the errors εε ii ss are iid with mean 0 and are independent of XX ss. Find the function ff(xx 11,, xx pp ) or find its approximation!!! (Generative vs. Predictive models) The simplest case: when pp = 11, ff xx = ββ 00 + ββ 11 xx Simple linear regression: YY ii = ββ 00 + ββ 11 xx ii + εε ii 27

28 The first Main Topic Simple linear regression 28

29 Empirical Models: Regression Many engineering and scientific problems are concerned with determining a relationship between a set of variables. For example: Y= college GPA at 1 st year; X= high school GPA Or Y=Mortality rate; X= Immunization rate. Knowledge of such a relationship would enable us to predict the output for Y. Regression analysis is a statistical technique that is very useful for these types of problems, as it can be used to build a model to predict Y at a given X value. 29

30 Example: Immunized and Mortality Suppose one wants to investigate the relationship between the percentage of children who have been immunized against the infectious disease diphtheria, pertussis, and tetanus (DPT) in a given country and the corresponding mortality rate for children under five years of age in that country. The UN Children s Fund (UNICEF) considers the under-five mortality rate to be one of most important indicators of the level of well-being for children. 30

31 Data X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia Ethiopia Mexico Brazil Finland 95 7 Poland Cambodia France 95 9 Russian Canada 85 8 Greece 54 9 Senegal China India Turkey Czech Republic Italy UK 90 9 Egypt Japan

32 Look at Scatter Plot The plot shows that Mortality rate tends to decrease as the percentage of children immunization increases. 32

33 Question X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Question: Are Y and X related (associated), and how? Does better immunization improve mortality rate? Can we use the data to develop a model for predicting under-five mortality rate from the percentage of children immunized against DPT? 33

34 Linear Regression It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point view, because of the wide variety of uses. Fit a models for a dependent variable as a function of one or more independent variables We will talk about Building models Assessing fit and reliability Drawing conclusions 34

35 A Simple Linear Regression We are interested in developing a linear equation that best summarizes the relationship in a sample between the response variable (Y) and the predictor variable (or independent variable) x YY ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are independent with mean 0 and variance σσ 22. The equation is also used to predict Y from X 35

36 (a) How to estimate ββ s Observe n data, YY ii, xx ii, and assume YY ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are independent with mean 0 and variance σσ 22. How to estimate ββ s? 36

37 Method of Least Squares The (ordinary) least squares estimator: Choose β 0 and β 1 to minimize the residual of sum square (RSS) 37

38 Why Least Squares? It is the Maximum Likelihood Estimators (MLE) of β 0 and β 1 when the errors εε ii s are iid N(0,σσ 22 ). It leads to the best linear unbiased estimators (BLUE) of β 0 and β 1, no matter whether the errors εε ii s are normally distributed or not. [A linear estimator is of the form nn ii=11 cc ii YY ii. The meaning of BLUE for β 1: Minimize vvvvvv cc ii YY ii = σσ 22 cc ii 22 subject to EE cc ii YY ii = cc ii ββ 00 + ββ 11 xx ii = ββ 11 for all β 0 and β 1, i.e., subject to cc ii ββ 00 = 00 and cc ii xx ii = 11] 38

39 Method of Least Squares When minimizing the residual of sum square (RSS) the solutions are: ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx where SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22 39

40 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia Ethiopia Mexico Brazil Finland 95 7 Poland Cambodia France 95 9 Russian Canada 85 8 Greece 54 9 Senegal China India Turkey Czech Republic Italy UK 90 9 Egypt Japan

41 Answer For our data nn = 2222, xx = , yy = 5555, xx ii 22 = , xx ii yy ii = SS xxxx = xx ii xx 22 = xx 22 ii nn xx 22 = SS xxyy = xx ii xx yy ii yy = xx ii yy ii nn xx yy = ββ 11 = SS xxxx = = ; SS xxxx ββ 00 = yy ββ 11 xx = = Thus, the fitted (simple linear regression) model is YY = xx + εε or EE YY = xx. 41

42 (b) Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Estimate the mean under-five mortality rate per 1000 live births when x=10? Repeat the question when x= 90? [ ; ] 42

43 (c) How to estimate σσ 22? Recall that the model is yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and variance σσ 22 We got the estimator ββ 00, ββ 11, and how to estimate the third parameter, σσ 22? Answer: It is natural to use the observed fitting error ee ii = yy ii ( ββ 00 + ββ 11 xx ii ) and the residual sum of squares RRRRRR = nn ii=11 ee ii 22 The estimator of σ 2 is σσ 22 = RRRRRR [and nn 22 σσ22 χχ 22 nn 22 σσ 22 nn 22] In practice, it is easier to compute RSS as follows: nn RRRRRR = ii=11 ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx 22 SS xxxx 43

44 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 In our example, the fitted (simple linear regression) model is YY = xx + εε. Find an estimate of σσ 22 = vvvvvv εε. Two ways to calculate the residual sum of squares RSS: Calculate the observed fitting error (residual) ee ii = yy ii ( ββ 00 + ββ 11 xx ii ) and then RRRRRR = nn ii=11 ee 22 ii = Use Sxx = , Sxy=-22706, Syy=77498, and RRRRRR = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx 22 = / = SS xxxx The estimator of σ 2 is σσ 22 = RRRRRR = nn 22 (or σσ = = ). 44

45 R code (calculator-type) x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); Sxx <- sum( x * x) - length(x) * (mean(x))^2 Sxy <- sum(x *y ) - length(x) * mean(x) * mean(y) Syy <- sum( y * y) - length(y) * (mean(y))^2 beta1hat <- Sxy / Sxx beta0hat <- mean(y) - beta1hat * mean(x) ### Two ways to compute RSS error <- y - (beta0hat + beta1hat * x) RSS <- sum( error * error) ### Or RSS <- Syy Sxy^2 / Sxx sigma2hat <- RSS / (length(x) - 2) c(beta0hat, beta1hat, sigma2hat) 45

46 (d) Properties of OLS estimators To derive the statistical inference of the (ordinary) least squares ββ 11 and ββ 00, we need to find EE ββ ii VVVVVV ββ ii Then by the central limit theorem, asymptotically ββ ii EE ββ ii NN(00, 11) VVVVVV( ββ ii ) 46

47 Key Steps SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy Assumption: the xx ii s are constants, and the YY ii s are independent with EE(YY ii ) = ββ 00 + ββ 11 xx ii and VVVVVV(YY ii ) = σσ 22. ββ 11 = SS xxxx = nn SS ii=11 cc ii YY ii, where cc ii = xx ii xx xxxx SS xxxx following three properties: nn ii=11 cc ii = 00 nn ii=11 cc ii xx ii = 11 nn ii=11 cc ii 22 = 11 SS xxxx ββ 00 = yy ββ 11 xx = nn ii=11 ( 11 cc nn ii xx)yy ii satisfying the 47

48 (d) Properties of OLS Unbiased: Variance: where Note that they are correlated: 48

49 CI and Tests Since σ 2 is unknown, consider and thus Then and have t-distribution with n-2 degree of freedom. 49

50 (d1) Inference on ββ 11 When testing HH 00 : ββ 11 = 00 versus HH 11 : ββ the test statistic is TT oooooo = ββ 11 ssss( ββ 11 ) = ββ 11 σσ/ SSSSSS and we reject HH 00 if TT oobbbb tt αα/22,nn 22 A 11 αα confidence interval on ββ 11 is ββ 11 ± tt αα/22,nn 22 σσ SSSSSS 50

51 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Test HH 00 : ββ 11 = 00 versus HH 11 : ββ at αα = 555 level. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = TT oooooo = ββ 11 σσ/ SSSSSS = / = ] 51

52 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on ββ 11. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = , So ββ 11 ± tt αα/22,nn 22 σσ SSSSSS = ± = , ] 52

53 (d2) Inference on ββ 00 When testing HH 00 : ββ 00 = bb 00 versus HH 11 : ββ 00 bb 00, the test statistic is TT oooooo = ββ 00 bb 00 ssss( ββ 00 ) = σσ ββ 00 bb xx 22 + nn SSxxxx and we reject HH 00 if TT oobbbb tt αα/22,nn 22 A 11 αα confidence interval on ββ 00 is ββ 00 ± tt αα/22,nn 22 σσ 11 nn + xx 22 SS xxxx 53

54 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Test HH 00 : ββ 00 = versus HH 11 : ββ at αα = 555 level. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = TT oooooo = σσ ββ 00 bb xx 22 + nn SSxxxx = ] 54

55 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on ββ 00. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = , So ββ 00 ± tt αα/22,nn 22 σσ 11 nn + xx 22 SS xxxx = ± = [ , ].] 55

56 (d3) Inference on ββ 00 + ββ 11 xx nnnnnn For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii For a given xx nnnnnn, what is the confidence interval for the mean response EE YY = ββ 00 + ββ 11 xx nnnnnn Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn = nn ii=11 EE ββ 00 + ββ 11 xx nnnnnn = ββ 00 + ββ 11 xx nnnnnn 11 nn + cc ii xx nnnnnn xx YY ii VVVVVV ββ 00 + ββ 11 xx nnnnnn = σσ 22 [ 11 nn + xx nnnnnn xx 22 SS xxxx ] The 11 αα confidence interval on the mean response is ββ 00 + ββ 11 xx nnnnnn ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx 56

57 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on the mean under-five mortality rate when x=10 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = nn + xx nnnnnn xx 22 YY ± tt αα/22,nn 22 σσ = ± = SS xxxx [ , ]] 57

58 (e) Prediction on new Observation For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii How to predict future observation Y corresponding to a given xx nnnnnn? Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn How about a confidence interval on Y? This is often called prediction interval. 58

59 Key Idea For the future response YY = ββ 00 + ββ 11 xx nnnnnn + εε ffffffffffff Consider the estimator YY = ββ 00 + ββ 11 xx nnnnnn, Then EE YY YY = 00 VVVVVV YY YY = VVVVVV ββ 00 + ββ 11 xx nnnnnn + εε ffffffffffff ββ 00 + ββ 11 xx nnnnnn = VVVVVV εε ffffffffffff + VVVVVV ββ 00 + ββ 11 xx nnnnnn = σσ 22 + σσ22 nn + xx nnnnnn xx 22 σσ SS xxxx 59

60 Key Idea (Cont.) For the future response yy = ββ 00 + ββ 11 xx nnnnnn + εε Consider the estimate YY = ββ 00 + ββ 11 xx nnnnnn, Then σσ So yy YY nn + xx nnnnnn xx 22 SSxxxx σσ yy YY nn + xx nnnnnn xx 22 SS xxxx NN(00, 11) TT nn 22 60

61 Prediction Interval For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii How to predict future observation Y corresponding to a given xx nnnnnn? Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn The 11 αα prediction interval is YY ± tt αα/22,nn 22 σσ nn + xx nnnnnn xx 22 SS xxxx 61

62 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% prediction interval on Y when x=10 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = YY ± tt αα/22,nn 22 σσ xx nnnnnn xx 22 = ± = nn SS xxxx [ , ]] 62

63 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% prediction interval on Y when x=90 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = YY ± tt αα/22,nn 22 σσ xx nnnnnn xx 22 = ± = nn SS xxxx [ , ]] 63

64 Summary (I): point estimation Assume that we observe (xx ii, yy ii ) for i=1,..,n, and we consider the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and variance σσ 22. Define SS xxxx = xx ii xx 22 = xx 22 ii nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy SS yyyy = yy ii yy 22 = yy 22 ii nn yy 22 The least squares estimators are ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx 64

65 Summary (II) : Estimation of σ 2 and Inference The estimator of σ 2 is σσ 22 = RRRRRR nn ii=11 nn 22 where RRRRRR = ee 22 ii and residuals ee ii = yy ii ββ 00 + ββ 11 xx ii. In practice, it is better to use nn RRRRRR = ii=11 ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx ββ 11 ββ 11 ssss( ββ 11 ) TT nn 22; ssss ββ 11 = σσ SS xxxx 22 SS xxxx ββ 00 ββ 00 ssss( ββ 00 ) TT nn 22 ; ssss ββ 00 = σσ 11 nn + xx 22 SS xxxx 65

66 Summary III: Inference At a given xx nnnnnn the point estimator of Y is YY = ββ 00 + ββ 11 xx nnnnnn A 11 αα confidence interval on the mean response Y is YY ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx A 11 αα prediction interval on the future observation is YY ± tt αα/22,nn 22 σσ (appropriate for testing data) nn + xx nnnnnn xx 22 SS xxxx 66

67 Part C Introduction to R 67

68 What is R R is a system for statistical computation and graphics It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files Free software OS: Windows, Unix, Linux Homepage:

69 Installing R Under Windows Need Windows OS(32/64 bits) Go to any CRAN site (see mirrors.html for a list), and follow the instruction Download R for Windows R win.exe (Size: 54Mb), and double-click on the icon and follow the instructions to install

70 Data With R Objects: vector, factor, array, matrix, data.frame, ts, list Mode (numerical, character, complex, and logical); Length Read data stored in text (ASCII) files read.table(), scan(), and read.fwf() Saving data write(x, file= data.txt ), write.table() write in a file a data.frame Generating data

71 Linear Regression in R x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); fm1 <- lm( y ~ x) fm1 Call: lm(formula = y ~ x) Coefficients: (Intercept) x

72 summary(fm1) > summary(fm1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** x e-05 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 18 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 30.1 on 1 and 18 DF, p-value: 3.281e-05 72

73 Confidence Interval on coefficients > confint(fm1) 2.5 % 97.5 % (Intercept) x > confint(fm1, level = 0.99) 0.5 % 99.5 % (Intercept) x

74 Intervals for xnew > xnew <- data.frame(x = c(10, 90)) ## Confidence intervals on the mean response > predict(fm1, xnew, interval="confidence, level=0.95) fit lwr upr ## Prediction intervals for future observations > predict(fm1, xnew, interval="prediction, level=0.95) fit lwr upr

ISyE 6414 Regression Analysis

ISyE 6414 Regression Analysis Lecture 2: More Simple linear Regression: R-squared (coefficient of variation/determination) Correlation analysis: Pearson s correlation Spearman s rank correlation Variable