ISyE 6414: Regression Analysis

Size: px
Start display at page:

Download "ISyE 6414: Regression Analysis"

Transcription

1 ISyE 6414: Regression Analysis Lectures: MWF 8:00-10:30, MRDC #2404 Early five-week session; May 14- June 15 (8:00-9:10; 10-min break; 9:20-10:30) Instructor: Dr. Yajun Mei ( YA_JUNE MAY ) ymei@isye.gatech.edu; Tel: (O) Office Hours: MWF 10:30-11:00, after class or Groseclose #343 Course Homepage: Canvas (all HWs due Canvas) backup: HW#1 due on Friday, May 18 for on-campus students, and on Wednesday, May 23 for distance learning students

2 My academic pathway Undergraduate: Math, Peking Univ., BS in 1996 Work as a computer programmer in a Chinese bank, Graduate: PhD in Math with a minor in EE, Caltech, (advisor: Dr. Gary Lorden) Post Doc in biostatistics: FHCRC, Seattle, Sep 2005 (supervisor: Dr. Sarah Holte) New Research Fellow: SAMSI & Duke Univ., Fall 2005 Joined ISyE of GT since Jan Currently a tenured associate professor.

3 About this course Regression Analysis is the key building block for many modern Machine Learning, Artificial Intelligent, Business Analytics techniques and methods (such as Neural Networks, Deep Learning, Boosting, Random Forrest, etc.) This course aims to help you Understand its theoretical aspects (HW#1, #2, #4, and a midterm) Understand its computational aspects (HW#3, and a course project) 3

4 Organization of the Course Textbooks (Notes/slides provided): Kutner, Nachtsheim, Neter and Li, Applied linear statistical models (fifth edition)., 5 th ed Faraway, Practical Regression and ANOVA using R (freely downloadable online) Topics: Simple Linear Regression (Ch 1-4) Multiple linear Regression (Ch 5-11) (2 weeks, Midterm) Advanced Regression (Ch 13-14) ( 2 weeks) Design of Experiments (Ch 13, 14) 4

5 Organization of the Course Grading Policy (the past AVG GPA is [3.7,3.9]): Class attendance (5%) Homework (4*10%=40%): Collaboration encouraged, but you cannot look at any other solutions before submitting. One in-class Midterm (25%): 9:15am-10:30am, Friday, May 25 (happy Memorial weekend ) Class project (30%): a team of 2-4 or by yourself. See the handout for possible topics of project. Proposal (1-3 pages) : May 30 (Wed) Presentation file: due 7am on June 13 (Wed) (only for on-campus students, not required for DL students) Final report: June 15 (Friday) [Only for the Distance Learning students: two-lectures delay for homeworks and class project proposal, and one-week delay for midterm, and the final report.] 5

6 Part A Basic Background on probability and statistics. We might not discuss this background part in details, but I listed some slides here, so that you can brush up your memory if necessary Three key Probability distributions: Binomial, Poisson, and Normal. 6

7 Probability Review See Appendix A of our text. Probability Discrete Random Variable Continuous Random Variables Joint Distribution 7

8 Probability Basics of Probability Theory Random Experiments, e.g., flip a fair coin three times, and observe Heads or Tails Sample spaces: the set of all possible outcomes, e.g., S={HHH,THH,HTH,HHT, HTT,THT,TTH,TTT} An Event: a subset of the sample space of a random experiment, e.g., observe one heads Union/Intersection/Complement of events; Counting Techniques; Axioms of Probability; Conditional Probability; Independence; Bayes Theorem 8

9 Random Variable A random variable is a function that assigns a real number to each outcome in the sample space of a random experiments. Example: Let X be the number of heads when flipping a fair coin three times. Rigorously, w HHH HHT HTH THH HTT THT TTH TTT X(w)

10 Discrete Random Variable X X with countable possible values Probability Mass function: Cumulative distribution function Mean: Variance: Standard Deviation 10

11 Important discrete RVs Discrete Uniform Binomial(n,p) Geometric(p) Poisson(\lambda) What are the mean and Var/SD? 11

12 Continuous Random Variable Probability density function: Cumulative distribution function Mean: Variance: Standard Deviation 12

13 Important Continuous RVs Gamma/Weibull/Lognormal/Beta distribution What are the mean and Var/SD? 13

14 Central Limit Theorem a. If X is Binomial(n,p), then ZZ = XX nnnn nnnn(1 pp) NN(0,1) (continuity correction) b. If X 1, X 2, Λ, X n are iid with mean µ and variance σ 2, then (or ZZ = XX XX nn nnnn nn σσ NN(00, 11) ) 14

15 Statistical Review Population parameter vs. Sample statistic Point Estimation Conference Interval Hypothesis Testing 15

16 Population Parameter vs Sample Statistic Population: a set of entities concerning which statistical inferences are to be drawn. Typically population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. Sample: a subset of observed objects from the populations. The sample represents a subset of manageable size (possibly massive). Parameter: a (typical unobservable) parameter that indexes a family of probability distributions. It can be regarded as a numerical characteristics of a population or a model. Statistic: some measures of some attribute of a sample. It is calculated by applying a function to the values of the items comprising the sample. [Population parameter vs. Sample statistic] 16

17 Important Sample statistics Sample mean: Sample variance: Sample standard deviation: Sample range: r = max(x i ) min(x i ) Quartiles: The lower quartile: 25% of the data is less than q 1 The median: 50% of the data is less than q 2 The upper quartile: 75% of the data is less than q 3 As a measure of variability, the interquartile range (IQR) is defined as: IQR = q 3 q 1 Plots: Stem-and-Leaf Diagram/Plot, Histogram, Box Plots, Probability Plots (or Normal QQ plots) 17

18 Normal Distribution Assume X 1, X 2, Λ, X n are iid with normal distribution mean µ and variance σ 2 Sample mean XX NN μμ, σσ22 Sample variance SS 22 = XX ii XX 22 nn 11 SS 22 σσ χχ nn 11 nn nn 11. Or nn( XX μμ) σσ satisfies (Chi-square distribution) NN(00, 11) 18

19 Normal Distribution (Cont.) Assume X 1, X 2, Λ, X n are iid with normal distribution mean µ and variance σ 2 Sample mean XX is independent of sample variance SS 22 = XX ii XX 22. Moreover, nn( XX μμ) SS = nn 11 NN 00,11 χχ22 nn 11 /(nn 11) with df=n-1. [In many cases, θθ θθ ss.ee. θθ has a t-distribution often has t-distribution.] In Appendix B on page 1317, for t-distribution, critical point: tt αα,dddd = tt AA, ddff with AA = 11 αα so tt ,1111 = tt , 1111 =

20 Point Estimation The bias of the estimator θθ is BBBBBBBB θθ = EE θθ θθ. An estimator is unbiased if the bias is 0. The variance of the estimator θθ. The mean square error of the estimator θθ is MMMMMM θθ = EE θθ θθ 22 = VVVVVV θθ + BBBBBBBB θθ 22 The standard error of θθ is s.e.= VVVVVV( θθ) 20

21 Methods of Point Estimation There are three methodologies to create point estimates of a population parameter. A. Method of moments (MOM) B. Method of maximum likelihood (MLE) C. Bayesian estimation of parameters 21

22 MOM & MLE The method of moment (MOM) estimators are found by equating the population moment to the sample moments and solving the resulting equations, e.g., hh θθ = EE XX = XX = XX 11+ +XX nn nn The maximum likelihood estimator (MLE) is the value of θ that maximizes the likelihood function L(θ) = f(x 1 ) f(x 2 ) f(x n ) If the domain of f(x) does not depend on θ, dd llllllll(θθ) solving = 00 yields the MLE. ddθθ Otherwise, plot L(θ) and find the maximum.. 22

23 Confidence Interval & Hypothesis Testing One sample: 1. Normal mean with known variances (one-sided) 2. Normal mean with unknown variances 3. Normal variance 4. Proportion of Binomial Distribution Two samples: inference on mean difference 5. Two independent normal dist: variances known 6. Two independent normal dist: unknown and equal variances 7. Two independent normal distributions: unknown and unequal variances 8. Paired Samples 23

24 Part B Overview of Supervised Learning Simple Linear Regression 24

25 Overview of Supervised Learning Supervised Learning (directed data mining, learning with a teacher): The observed data is of the form of (YY ii, XX iiii,, XX iiii ) for ii = 11,, nn, where the variables can be split into two groups: independent variables (explanatory variables, inputs, predictors) XX = (XX 11,, XX pp ) and One (or more) dependent variable (output, responses) Y. The objective is to predict Y given values of the input X. 25

26 Supervised Learning Observed Data (Training Data): (YY ii, XX iiii,.., XX iiii ) for ii = 11,, nn Objective: find a function ff xx nnnnnn = ff(xx 11,, xx pp ) that can predict YY well for any given input xx nnnnnn = xx 11,, xx pp. Deterministic relationship?(many classification tasks in machine learning) 26

27 The Additive Error Model Key Statistical Ideas: Observed Data = True Value + Noise For the observed training data, YY ii = ff xx iiii,.., xx iiii + εε ii for ii = 11,, nn, where the errors εε ii ss are iid with mean 0 and are independent of XX ss. Find the function ff(xx 11,, xx pp ) or find its approximation!!! (Generative vs. Predictive models) The simplest case: when pp = 11, ff xx = ββ 00 + ββ 11 xx Simple linear regression: YY ii = ββ 00 + ββ 11 xx ii + εε ii 27

28 The first Main Topic Simple linear regression 28

29 Empirical Models: Regression Many engineering and scientific problems are concerned with determining a relationship between a set of variables. For example: Y= college GPA at 1 st year; X= high school GPA Or Y=Mortality rate; X= Immunization rate. Knowledge of such a relationship would enable us to predict the output for Y. Regression analysis is a statistical technique that is very useful for these types of problems, as it can be used to build a model to predict Y at a given X value. 29

30 Example: Immunized and Mortality Suppose one wants to investigate the relationship between the percentage of children who have been immunized against the infectious disease diphtheria, pertussis, and tetanus (DPT) in a given country and the corresponding mortality rate for children under five years of age in that country. The UN Children s Fund (UNICEF) considers the under-five mortality rate to be one of most important indicators of the level of well-being for children. 30

31 Data X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia Ethiopia Mexico Brazil Finland 95 7 Poland Cambodia France 95 9 Russian Canada 85 8 Greece 54 9 Senegal China India Turkey Czech Republic Italy UK 90 9 Egypt Japan

32 Look at Scatter Plot The plot shows that Mortality rate tends to decrease as the percentage of children immunization increases. 32

33 Question X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Question: Are Y and X related (associated), and how? Does better immunization improve mortality rate? Can we use the data to develop a model for predicting under-five mortality rate from the percentage of children immunized against DPT? 33

34 Linear Regression It is interesting both theoretically because of the elegance of the underlying theory, and from an applied point view, because of the wide variety of uses. Fit a models for a dependent variable as a function of one or more independent variables We will talk about Building models Assessing fit and reliability Drawing conclusions 34

35 A Simple Linear Regression We are interested in developing a linear equation that best summarizes the relationship in a sample between the response variable (Y) and the predictor variable (or independent variable) x YY ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are independent with mean 0 and variance σσ 22. The equation is also used to predict Y from X 35

36 (a) How to estimate ββ s Observe n data, YY ii, xx ii, and assume YY ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are independent with mean 0 and variance σσ 22. How to estimate ββ s? 36

37 Method of Least Squares The (ordinary) least squares estimator: Choose β 0 and β 1 to minimize the residual of sum square (RSS) 37

38 Why Least Squares? It is the Maximum Likelihood Estimators (MLE) of β 0 and β 1 when the errors εε ii s are iid N(0,σσ 22 ). It leads to the best linear unbiased estimators (BLUE) of β 0 and β 1, no matter whether the errors εε ii s are normally distributed or not. [A linear estimator is of the form nn ii=11 cc ii YY ii. The meaning of BLUE for β 1: Minimize vvvvvv cc ii YY ii = σσ 22 cc ii 22 subject to EE cc ii YY ii = cc ii ββ 00 + ββ 11 xx ii = ββ 11 for all β 0 and β 1, i.e., subject to cc ii ββ 00 = 00 and cc ii xx ii = 11] 38

39 Method of Least Squares When minimizing the residual of sum square (RSS) the solutions are: ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx where SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22 39

40 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 Nation X Y Nation X Y Nation X Y Bolivia Ethiopia Mexico Brazil Finland 95 7 Poland Cambodia France 95 9 Russian Canada 85 8 Greece 54 9 Senegal China India Turkey Czech Republic Italy UK 90 9 Egypt Japan

41 Answer For our data nn = 2222, xx = , yy = 5555, xx ii 22 = , xx ii yy ii = SS xxxx = xx ii xx 22 = xx 22 ii nn xx 22 = SS xxyy = xx ii xx yy ii yy = xx ii yy ii nn xx yy = ββ 11 = SS xxxx = = ; SS xxxx ββ 00 = yy ββ 11 xx = = Thus, the fitted (simple linear regression) model is YY = xx + εε or EE YY = xx. 41

42 (b) Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Estimate the mean under-five mortality rate per 1000 live births when x=10? Repeat the question when x= 90? [ ; ] 42

43 (c) How to estimate σσ 22? Recall that the model is yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and variance σσ 22 We got the estimator ββ 00, ββ 11, and how to estimate the third parameter, σσ 22? Answer: It is natural to use the observed fitting error ee ii = yy ii ( ββ 00 + ββ 11 xx ii ) and the residual sum of squares RRRRRR = nn ii=11 ee ii 22 The estimator of σ 2 is σσ 22 = RRRRRR [and nn 22 σσ22 χχ 22 nn 22 σσ 22 nn 22] In practice, it is easier to compute RSS as follows: nn RRRRRR = ii=11 ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx 22 SS xxxx 43

44 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 In our example, the fitted (simple linear regression) model is YY = xx + εε. Find an estimate of σσ 22 = vvvvvv εε. Two ways to calculate the residual sum of squares RSS: Calculate the observed fitting error (residual) ee ii = yy ii ( ββ 00 + ββ 11 xx ii ) and then RRRRRR = nn ii=11 ee 22 ii = Use Sxx = , Sxy=-22706, Syy=77498, and RRRRRR = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx 22 = / = SS xxxx The estimator of σ 2 is σσ 22 = RRRRRR = nn 22 (or σσ = = ). 44

45 R code (calculator-type) x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); Sxx <- sum( x * x) - length(x) * (mean(x))^2 Sxy <- sum(x *y ) - length(x) * mean(x) * mean(y) Syy <- sum( y * y) - length(y) * (mean(y))^2 beta1hat <- Sxy / Sxx beta0hat <- mean(y) - beta1hat * mean(x) ### Two ways to compute RSS error <- y - (beta0hat + beta1hat * x) RSS <- sum( error * error) ### Or RSS <- Syy Sxy^2 / Sxx sigma2hat <- RSS / (length(x) - 2) c(beta0hat, beta1hat, sigma2hat) 45

46 (d) Properties of OLS estimators To derive the statistical inference of the (ordinary) least squares ββ 11 and ββ 00, we need to find EE ββ ii VVVVVV ββ ii Then by the central limit theorem, asymptotically ββ ii EE ββ ii NN(00, 11) VVVVVV( ββ ii ) 46

47 Key Steps SS xxxx = xx ii xx 22 = xx ii 22 nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy Assumption: the xx ii s are constants, and the YY ii s are independent with EE(YY ii ) = ββ 00 + ββ 11 xx ii and VVVVVV(YY ii ) = σσ 22. ββ 11 = SS xxxx = nn SS ii=11 cc ii YY ii, where cc ii = xx ii xx xxxx SS xxxx following three properties: nn ii=11 cc ii = 00 nn ii=11 cc ii xx ii = 11 nn ii=11 cc ii 22 = 11 SS xxxx ββ 00 = yy ββ 11 xx = nn ii=11 ( 11 cc nn ii xx)yy ii satisfying the 47

48 (d) Properties of OLS Unbiased: Variance: where Note that they are correlated: 48

49 CI and Tests Since σ 2 is unknown, consider and thus Then and have t-distribution with n-2 degree of freedom. 49

50 (d1) Inference on ββ 11 When testing HH 00 : ββ 11 = 00 versus HH 11 : ββ the test statistic is TT oooooo = ββ 11 ssss( ββ 11 ) = ββ 11 σσ/ SSSSSS and we reject HH 00 if TT oobbbb tt αα/22,nn 22 A 11 αα confidence interval on ββ 11 is ββ 11 ± tt αα/22,nn 22 σσ SSSSSS 50

51 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Test HH 00 : ββ 11 = 00 versus HH 11 : ββ at αα = 555 level. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = TT oooooo = ββ 11 σσ/ SSSSSS = / = ] 51

52 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on ββ 11. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = , So ββ 11 ± tt αα/22,nn 22 σσ SSSSSS = ± = , ] 52

53 (d2) Inference on ββ 00 When testing HH 00 : ββ 00 = bb 00 versus HH 11 : ββ 00 bb 00, the test statistic is TT oooooo = ββ 00 bb 00 ssss( ββ 00 ) = σσ ββ 00 bb xx 22 + nn SSxxxx and we reject HH 00 if TT oobbbb tt αα/22,nn 22 A 11 αα confidence interval on ββ 00 is ββ 00 ± tt αα/22,nn 22 σσ 11 nn + xx 22 SS xxxx 53

54 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Test HH 00 : ββ 00 = versus HH 11 : ββ at αα = 555 level. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = TT oooooo = σσ ββ 00 bb xx 22 + nn SSxxxx = ] 54

55 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on ββ 00. [Recall SS xxxx = , σσ = , tt αα/22,nn 22 = tt ,1111 = , So ββ 00 ± tt αα/22,nn 22 σσ 11 nn + xx 22 SS xxxx = ± = [ , ].] 55

56 (d3) Inference on ββ 00 + ββ 11 xx nnnnnn For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii For a given xx nnnnnn, what is the confidence interval for the mean response EE YY = ββ 00 + ββ 11 xx nnnnnn Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn = nn ii=11 EE ββ 00 + ββ 11 xx nnnnnn = ββ 00 + ββ 11 xx nnnnnn 11 nn + cc ii xx nnnnnn xx YY ii VVVVVV ββ 00 + ββ 11 xx nnnnnn = σσ 22 [ 11 nn + xx nnnnnn xx 22 SS xxxx ] The 11 αα confidence interval on the mean response is ββ 00 + ββ 11 xx nnnnnn ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx 56

57 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% confidence interval on the mean under-five mortality rate when x=10 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = nn + xx nnnnnn xx 22 YY ± tt αα/22,nn 22 σσ = ± = SS xxxx [ , ]] 57

58 (e) Prediction on new Observation For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii How to predict future observation Y corresponding to a given xx nnnnnn? Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn How about a confidence interval on Y? This is often called prediction interval. 58

59 Key Idea For the future response YY = ββ 00 + ββ 11 xx nnnnnn + εε ffffffffffff Consider the estimator YY = ββ 00 + ββ 11 xx nnnnnn, Then EE YY YY = 00 VVVVVV YY YY = VVVVVV ββ 00 + ββ 11 xx nnnnnn + εε ffffffffffff ββ 00 + ββ 11 xx nnnnnn = VVVVVV εε ffffffffffff + VVVVVV ββ 00 + ββ 11 xx nnnnnn = σσ 22 + σσ22 nn + xx nnnnnn xx 22 σσ SS xxxx 59

60 Key Idea (Cont.) For the future response yy = ββ 00 + ββ 11 xx nnnnnn + εε Consider the estimate YY = ββ 00 + ββ 11 xx nnnnnn, Then σσ So yy YY nn + xx nnnnnn xx 22 SSxxxx σσ yy YY nn + xx nnnnnn xx 22 SS xxxx NN(00, 11) TT nn 22 60

61 Prediction Interval For the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii How to predict future observation Y corresponding to a given xx nnnnnn? Point estimator: YY = ββ 00 + ββ 11 xx nnnnnn The 11 αα prediction interval is YY ± tt αα/22,nn 22 σσ nn + xx nnnnnn xx 22 SS xxxx 61

62 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% prediction interval on Y when x=10 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = YY ± tt αα/22,nn 22 σσ xx nnnnnn xx 22 = ± = nn SS xxxx [ , ]] 62

63 Example (Cont.) X = Percentage of children immunized against DPT; Y = under-five mortality rate per 1000 live births, in 1992 The fitted (simple linear regression) model is YY = xx + εε Find a 95% prediction interval on Y when x=90 [Recall SS xxxx = , σσ = , xx = , tt ,1111 = YY ± tt αα/22,nn 22 σσ xx nnnnnn xx 22 = ± = nn SS xxxx [ , ]] 63

64 Summary (I): point estimation Assume that we observe (xx ii, yy ii ) for i=1,..,n, and we consider the simple linear regression model yy ii = ββ 00 + ββ 11 xx ii + εε ii where the εε ii s are iid with mean 0 and variance σσ 22. Define SS xxxx = xx ii xx 22 = xx 22 ii nn xx 22, SS xxxx = xx ii xx yy ii yy = xx ii yy ii nn xx yy SS yyyy = yy ii yy 22 = yy 22 ii nn yy 22 The least squares estimators are ββ 11 = SS xxxx SS xxxx, ββ 00 = yy ββ 11 xx 64

65 Summary (II) : Estimation of σ 2 and Inference The estimator of σ 2 is σσ 22 = RRRRRR nn ii=11 nn 22 where RRRRRR = ee 22 ii and residuals ee ii = yy ii ββ 00 + ββ 11 xx ii. In practice, it is better to use nn RRRRRR = ii=11 ee ii 22 = SS yyyy ββ 11 SS xxxx = SS yyyy SS xxxx ββ 11 ββ 11 ssss( ββ 11 ) TT nn 22; ssss ββ 11 = σσ SS xxxx 22 SS xxxx ββ 00 ββ 00 ssss( ββ 00 ) TT nn 22 ; ssss ββ 00 = σσ 11 nn + xx 22 SS xxxx 65

66 Summary III: Inference At a given xx nnnnnn the point estimator of Y is YY = ββ 00 + ββ 11 xx nnnnnn A 11 αα confidence interval on the mean response Y is YY ± tt αα/22,nn 22 σσ 11 nn + xx nnnnnn xx 22 SS xxxx A 11 αα prediction interval on the future observation is YY ± tt αα/22,nn 22 σσ (appropriate for testing data) nn + xx nnnnnn xx 22 SS xxxx 66

67 Part C Introduction to R 67

68 What is R R is a system for statistical computation and graphics It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files Free software OS: Windows, Unix, Linux Homepage:

69 Installing R Under Windows Need Windows OS(32/64 bits) Go to any CRAN site (see mirrors.html for a list), and follow the instruction Download R for Windows R win.exe (Size: 54Mb), and double-click on the icon and follow the instructions to install

70 Data With R Objects: vector, factor, array, matrix, data.frame, ts, list Mode (numerical, character, complex, and logical); Length Read data stored in text (ASCII) files read.table(), scan(), and read.fwf() Saving data write(x, file= data.txt ), write.table() write in a file a data.frame Generating data

71 Linear Regression in R x <- c(77, 69, 32, 85, 94, 99, 89, 13, 95, 95, 54, 89, 95, 87, 91, 98, 73, 47, 76, 90); y <- c(118, 65, 184, 8, 43, 12, 55, 208, 7, 9, 9, 124, 10, 6, 33, 16, 32, 145, 87, 9); fm1 <- lm( y ~ x) fm1 Call: lm(formula = y ~ x) Coefficients: (Intercept) x

72 summary(fm1) > summary(fm1) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** x e-05 *** --- Signif. codes: 0 `***' `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Residual standard error: on 18 degrees of freedom Multiple R-Squared: , Adjusted R-squared: F-statistic: 30.1 on 1 and 18 DF, p-value: 3.281e-05 72

73 Confidence Interval on coefficients > confint(fm1) 2.5 % 97.5 % (Intercept) x > confint(fm1, level = 0.99) 0.5 % 99.5 % (Intercept) x

74 Intervals for xnew > xnew <- data.frame(x = c(10, 90)) ## Confidence intervals on the mean response > predict(fm1, xnew, interval="confidence, level=0.95) fit lwr upr ## Prediction intervals for future observations > predict(fm1, xnew, interval="prediction, level=0.95) fit lwr upr

ISyE 6414 Regression Analysis

ISyE 6414 Regression Analysis ISyE 6414 Regression Analysis Lecture 2: More Simple linear Regression: R-squared (coefficient of variation/determination) Correlation analysis: Pearson s correlation Spearman s rank correlation Variable

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: Naïve Bayes Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes San Francisco State University Michael Bar ECON 312 Fall 2018 Midterm Exam 1, section 2 Thursday, September 27 1 hour, 15 minutes Name: Instructions 1. This is closed book, closed notes exam. 2. You can

More information

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD

The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD The Simple Linear Regression Model ECONOMETRICS (ECON 360) BEN VAN KAMMEN, PHD Outline Definition. Deriving the Estimates. Properties of the Estimates. Units of Measurement and Functional Form. Expected

More information

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6

ECO 745: Theory of International Economics. Jack Rossbach Fall Lecture 6 ECO 745: Theory of International Economics Jack Rossbach Fall 2015 - Lecture 6 Review We ve covered several models of trade, but the empirics have been mixed Difficulties identifying goods with a technological

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Announcements Course TA: Hao Xiong Office hours: Friday 2pm-4pm in ECSS2.104A1 First homework

More information

Special Topics: Data Science

Special Topics: Data Science Special Topics: Data Science L Linear Methods for Prediction Dr. Vidhyasaharan Sethu School of Electrical Engineering & Telecommunications University of New South Wales Sydney, Australia V. Sethu 1 Topics

More information

Lecture 5. Optimisation. Regularisation

Lecture 5. Optimisation. Regularisation Lecture 5. Optimisation. Regularisation COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Iterative optimisation Loss functions Coordinate

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looked at -means and hierarchical clustering as mechanisms for unsupervised learning

More information

Logistic Regression. Hongning Wang

Logistic Regression. Hongning Wang Logistic Regression Hongning Wang CS@UVa Today s lecture Logistic regression model A discriminative classification model Two different perspectives to derive the model Parameter estimation CS@UVa CS 6501:

More information

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions Announcements Announcements Lecture 19: Inference for SLR & Statistics 101 Mine Çetinkaya-Rundel April 3, 2012 HW 7 due Thursday. Correlation guessing game - ends on April 12 at noon. Winner will be announced

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Linear Regression, Logistic Regression, and GLMs Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 About WWW2017 Conference 2 Turing Award Winner Sir Tim Berners-Lee 3

More information

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation

Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation Minimum Mean-Square Error (MMSE) and Linear MMSE (LMMSE) Estimation Outline: MMSE estimation, Linear MMSE (LMMSE) estimation, Geometric formulation of LMMSE estimation and orthogonality principle. Reading:

More information

Chapter 12 Practice Test

Chapter 12 Practice Test Chapter 12 Practice Test 1. Which of the following is not one of the conditions that must be satisfied in order to perform inference about the slope of a least-squares regression line? (a) For each value

More information

Section I: Multiple Choice Select the best answer for each problem.

Section I: Multiple Choice Select the best answer for each problem. Inference for Linear Regression Review Section I: Multiple Choice Select the best answer for each problem. 1. Which of the following is NOT one of the conditions that must be satisfied in order to perform

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lectures: Stefanos Zafeiriou Goal (Lectures): To present modern statistical machine learning/pattern recognition algorithms. The course

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 3: Vector Data: Logistic Regression Instructor: Yizhou Sun yzsun@cs.ucla.edu October 9, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept

A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept International Journal of Probability and Statistics 015, 4(): 4-50 DOI: 10.593/j.ijps.015040.0 A Class of Regression Estimator with Cum-Dual Ratio Estimator as Intercept F. B. Adebola 1, N. A. Adegoke

More information

Running head: DATA ANALYSIS AND INTERPRETATION 1

Running head: DATA ANALYSIS AND INTERPRETATION 1 Running head: DATA ANALYSIS AND INTERPRETATION 1 Data Analysis and Interpretation Final Project Vernon Tilly Jr. University of Central Oklahoma DATA ANALYSIS AND INTERPRETATION 2 Owners of the various

More information

Navigate to the golf data folder and make it your working directory. Load the data by typing

Navigate to the golf data folder and make it your working directory. Load the data by typing Golf Analysis 1.1 Introduction In a round, golfers have a number of choices to make. For a particular shot, is it better to use the longest club available to try to reach the green, or would it be better

More information

Name May 3, 2007 Math Probability and Statistics

Name May 3, 2007 Math Probability and Statistics Name May 3, 2007 Math 341 - Probability and Statistics Long Exam IV Instructions: Please include all relevant work to get full credit. Encircle your final answers. 1. An article in Professional Geographer

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 256 Introduction This procedure computes summary statistics and common non-parametric, single-sample runs tests for a series of n numeric, binary, or categorical data values. For numeric data,

More information

Analysis of Gini s Mean Difference for Randomized Block Design

Analysis of Gini s Mean Difference for Randomized Block Design American Journal of Mathematics and Statistics 2015, 5(3): 111-122 DOI: 10.5923/j.ajms.20150503.02 Analysis of Gini s Mean Difference for Randomized Block Design Elsayed A. H. Elamir Department of Statistics

More information

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Analysis of Variance. Copyright 2014 Pearson Education, Inc. Analysis of Variance 12-1 Learning Outcomes Outcome 1. Understand the basic logic of analysis of variance. Outcome 2. Perform a hypothesis test for a single-factor design using analysis of variance manually

More information

Unit 4: Inference for numerical variables Lecture 3: ANOVA

Unit 4: Inference for numerical variables Lecture 3: ANOVA Unit 4: Inference for numerical variables Lecture 3: ANOVA Statistics 101 Thomas Leininger June 10, 2013 Announcements Announcements Proposals due tomorrow. Will be returned to you by Wednesday. You MUST

More information

Lab 11: Introduction to Linear Regression

Lab 11: Introduction to Linear Regression Lab 11: Introduction to Linear Regression Batter up The movie Moneyball focuses on the quest for the secret of success in baseball. It follows a low-budget team, the Oakland Athletics, who believed that

More information

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM)

Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM) Jasmin Smajic 1, Christian Hafner 2, Jürg Leuthold 2, March 16, 2015 Introduction to Finite Element Method (FEM) Part 1 (2-D FEM) 1 HSR - University of Applied Sciences of Eastern Switzerland Institute

More information

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday. Announcements Announcements UNIT 7: MULTIPLE LINEAR REGRESSION LECTURE 1: INTRODUCTION TO MLR STATISTICS 101 Problem Set 10 Due Wednesday Nicole Dalzell June 15, 2015 Statistics 101 (Nicole Dalzell) U7

More information

Stat 139 Homework 3 Solutions, Spring 2015

Stat 139 Homework 3 Solutions, Spring 2015 Stat 39 Homework 3 Solutions, Spring 05 Problem. Let i Nµ, σ ) for i,..., n, and j Nµ, σ ) for j,..., n. Also, assume that all observations are independent from each other. In Unit 4, we learned that the

More information

Operations on Radical Expressions; Rationalization of Denominators

Operations on Radical Expressions; Rationalization of Denominators 0 RD. 1 2 2 2 2 2 2 2 Operations on Radical Expressions; Rationalization of Denominators Unlike operations on fractions or decimals, sums and differences of many radicals cannot be simplified. For instance,

More information

Imperfectly Shared Randomness in Communication

Imperfectly Shared Randomness in Communication Imperfectly Shared Randomness in Communication Madhu Sudan Harvard Joint work with Clément Canonne (Columbia), Venkatesan Guruswami (CMU) and Raghu Meka (UCLA). 11/16/2016 UofT: ISR in Communication 1

More information

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables

New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary Variables Journal of Mathematics and System Science 7 (017) 48-60 doi: 10.1765/159-591/017.09.00 D DAVID PUBLISHING New Class of Almost Unbiased Modified Ratio Cum Product Estimators with Knownparameters of Auxiliary

More information

knn & Naïve Bayes Hongning Wang

knn & Naïve Bayes Hongning Wang knn & Naïve Bayes Hongning Wang CS@UVa Today s lecture Instance-based classifiers k nearest neighbors Non-parametric learning algorithm Model-based classifiers Naïve Bayes classifier A generative model

More information

Machine Learning Application in Aviation Safety

Machine Learning Application in Aviation Safety Machine Learning Application in Aviation Safety Surface Safety Metric MOR Classification Presented to: By: Date: ART Firdu Bati, PhD, FAA September, 2018 Agenda Surface Safety Metric (SSM) development

More information

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754

APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754 IV Appendices APPENDIX A COMPUTATIONALLY GENERATED RANDOM DIGITS 748 APPENDIX B RANDOM NUMBER TABLES 750 APPENDIX C CHI-SQUARE RIGHT-HAND TAIL PROBABILITIES 754 APPENDIX D LINEAR INTERPOLATION 755 APPENDIX

More information

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together

y ) s x x )(y i (x i r = 1 n 1 s y Statistics Lecture 7 Exploring Data , y 2 ,y n (x 1 ),,(x n ),(x 2 ,y 1 How two variables vary together Statistics 111 - Lecture 7 Exploring Data Numerical Summaries for Relationships between Variables Administrative Notes Homework 1 due in recitation: Friday, Feb. 5 Homework 2 now posted on course website:

More information

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling

Use of Auxiliary Variables and Asymptotically Optimum Estimators in Double Sampling International Journal of Statistics and Probability; Vol. 5, No. 3; May 2016 ISSN 1927-7032 E-ISSN 1927-7040 Published by Canadian Center of Science and Education Use of Auxiliary Variables and Asymptotically

More information

Combining Experimental and Non-Experimental Design in Causal Inference

Combining Experimental and Non-Experimental Design in Causal Inference Combining Experimental and Non-Experimental Design in Causal Inference Kari Lock Morgan Department of Statistics Penn State University Rao Prize Conference May 12 th, 2017 A Tribute to Don Design trumps

More information

Legendre et al Appendices and Supplements, p. 1

Legendre et al Appendices and Supplements, p. 1 Legendre et al. 2010 Appendices and Supplements, p. 1 Appendices and Supplement to: Legendre, P., M. De Cáceres, and D. Borcard. 2010. Community surveys through space and time: testing the space-time interaction

More information

Driv e accu racy. Green s in regul ation

Driv e accu racy. Green s in regul ation LEARNING ACTIVITIES FOR PART II COMPILED Statistical and Measurement Concepts We are providing a database from selected characteristics of golfers on the PGA Tour. Data are for 3 of the players, based

More information

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010 ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era by Gary Evans Stat 201B Winter, 2010 Introduction: After a playerʼs strike in 1994 which resulted

More information

Week 7 One-way ANOVA

Week 7 One-way ANOVA Week 7 One-way ANOVA Objectives By the end of this lecture, you should be able to: Understand the shortcomings of comparing multiple means as pairs of hypotheses. Understand the steps of the ANOVA method

More information

Communication Amid Uncertainty

Communication Amid Uncertainty Communication Amid Uncertainty Madhu Sudan Harvard University Based on joint works with Brendan Juba, Oded Goldreich, Adam Kalai, Sanjeev Khanna, Elad Haramaty, Jacob Leshno, Clement Canonne, Venkatesan

More information

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question: Data Set 7: Bioerosion by Parrotfish Background Bioerosion of coral reefs results from animals taking bites out of the calcium-carbonate skeleton of the reef. Parrotfishes are major bioerosion agents,

More information

Confidence Interval Notes Calculating Confidence Intervals

Confidence Interval Notes Calculating Confidence Intervals Confidence Interval Notes Calculating Confidence Intervals Calculating One-Population Mean Confidence Intervals for Quantitative Data It is always best to use a computer program to make these calculations,

More information

Lecture 16: Chapter 7, Section 2 Binomial Random Variables

Lecture 16: Chapter 7, Section 2 Binomial Random Variables Lecture 16: Chapter 7, Section 2 Binomial Random Variables!Definition!What if Events are Dependent?!Center, Spread, Shape of Counts, Proportions!Normal Approximation Cengage Learning Elementary Statistics:

More information

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils 86 Pet.Sci.(29)6:86-9 DOI 1.17/s12182-9-16-x Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils Ehsan Khamehchi 1, Fariborz Rashidi

More information

Taking Your Class for a Walk, Randomly

Taking Your Class for a Walk, Randomly Taking Your Class for a Walk, Randomly Daniel Kaplan Macalester College Oct. 27, 2009 Overview of the Activity You are going to turn your students into an ensemble of random walkers. They will start at

More information

Operational Risk Management: Preventive vs. Corrective Control

Operational Risk Management: Preventive vs. Corrective Control Operational Risk Management: Preventive vs. Corrective Control Yuqian Xu (UIUC) July 2018 Joint Work with Lingjiong Zhu and Michael Pinedo 1 Research Questions How to manage operational risk? How does

More information

One-factor ANOVA by example

One-factor ANOVA by example ANOVA One-factor ANOVA by example 2 One-factor ANOVA by visual inspection 3 4 One-factor ANOVA H 0 H 0 : µ 1 = µ 2 = µ 3 = H A : not all means are equal 5 One-factor ANOVA but why not t-tests t-tests?

More information

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA

Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only

More information

Minimal influence of wind and tidal height on underwater noise in Haro Strait

Minimal influence of wind and tidal height on underwater noise in Haro Strait Minimal influence of wind and tidal height on underwater noise in Haro Strait Introduction Scott Veirs, Beam Reach Val Veirs, Colorado College December 2, 2007 Assessing the effect of wind and currents

More information

Computation: One objective of this course is to introduce S-PLUS. Data files and files containing examples of S-PLUS and SAS code can be copied from t

Computation: One objective of this course is to introduce S-PLUS. Data files and files containing examples of S-PLUS and SAS code can be copied from t STAT 511 Spring 2002 Course Information Instructor: Kenneth J. Koehler 120 Snedecor Hall Telephone: 515-294-4181 Fax: 515-294-5040 E-mail: kkoehler@iastate.edu Office Hours: to be announced Teaching Assistants:

More information

Sample Final Exam MAT 128/SOC 251, Spring 2018

Sample Final Exam MAT 128/SOC 251, Spring 2018 Sample Final Exam MAT 128/SOC 251, Spring 2018 Name: Each question is worth 10 points. You are allowed one 8 1/2 x 11 sheet of paper with hand-written notes on both sides. 1. The CSV file citieshistpop.csv

More information

COMP Intro to Logic for Computer Scientists. Lecture 13

COMP Intro to Logic for Computer Scientists. Lecture 13 COMP 1002 Intro to Logic for Computer Scientists Lecture 13 B 5 2 J Admin stuff Assignments schedule? Split a2 and a3 in two (A2,3,4,5), 5% each. A2 due Feb 17 th. Midterm date? March 2 nd. No office hour

More information

Communication Amid Uncertainty

Communication Amid Uncertainty Communication Amid Uncertainty Madhu Sudan Harvard University Based on joint works with Brendan Juba, Oded Goldreich, Adam Kalai, Sanjeev Khanna, Elad Haramaty, Jacob Leshno, Clement Canonne, Venkatesan

More information

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016

Support Vector Machines: Optimization of Decision Making. Christopher Katinas March 10, 2016 Support Vector Machines: Optimization of Decision Making Christopher Katinas March 10, 2016 Overview Background of Support Vector Machines Segregation Functions/Problem Statement Methodology Training/Testing

More information

Functions of Random Variables & Expectation, Mean and Variance

Functions of Random Variables & Expectation, Mean and Variance Functions of Random Variables & Expectation, Mean and Variance Kuan-Yu Chen ( 陳冠宇 ) @ TR-409, NTUST Functions of Random Variables 1 Given a random variables XX, one may generate other random variables

More information

Bivariate Data. Frequency Table Line Plot Box and Whisker Plot

Bivariate Data. Frequency Table Line Plot Box and Whisker Plot U04 D02 Univariate Data Frequency Table Line Plot Box and Whisker Plot Univariate Data Bivariate Data involving a single variable does not deal with causes or relationships the major purpose of univariate

More information

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND TOPIC 0: BASIC PROBABILITY AND THE HOT HAND The Hot Hand Debate Let s start with a basic question, much debated in sports circles: Does the Hot Hand really exist? A number of studies on this topic can

More information

Year 10 Term 2 Homework

Year 10 Term 2 Homework Yimin Math Centre Year 10 Term 2 Homework Student Name: Grade: Date: Score: Table of contents 6 Year 10 Term 2 Week 6 Homework 1 6.1 Data analysis and evaluation............................... 1 6.1.1

More information

Stats 2002: Probabilities for Wins and Losses of Online Gambling

Stats 2002: Probabilities for Wins and Losses of Online Gambling Abstract: Jennifer Mateja Andrea Scisinger Lindsay Lacher Stats 2002: Probabilities for Wins and Losses of Online Gambling The objective of this experiment is to determine whether online gambling is a

More information

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player? Which On-Base Percentage Shows the Highest True Ability of a Baseball Player? January 31, 2018 Abstract This paper looks at the true on-base ability of a baseball player given their on-base percentage.

More information

Chapter 20. Planning Accelerated Life Tests. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Chapter 20. Planning Accelerated Life Tests. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University Chapter 20 Planning Accelerated Life Tests William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University Copyright 1998-2008 W. Q. Meeker and L. A. Escobar. Based on the authors

More information

BBS Fall Conference, 16 September Use of modeling & simulation to support the design and analysis of a new dose and regimen finding study

BBS Fall Conference, 16 September Use of modeling & simulation to support the design and analysis of a new dose and regimen finding study BBS Fall Conference, 16 September 211 Use of modeling & simulation to support the design and analysis of a new dose and regimen finding study Didier Renard Background (1) Small molecule delivered by lung

More information

Deconstructing Data Science

Deconstructing Data Science Deconstructing Data Science David Bamman, UC Berkele Info 29 Lecture 4: Regression overview Jan 26, 217 Regression A mapping from input data (drawn from instance space ) to a point in R (R = the set of

More information

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic, SEDAR19-DW-14 United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic, 1993-2008 Kevin McCarthy and Neil Baertlein National Marine Fisheries Service,

More information

Diagnosis of Fuel Evaporative System

Diagnosis of Fuel Evaporative System T S F S 0 6 L A B E X E R C I S E 2 Diagnosis of Fuel Evaporative System April 5, 2017 1 objective The objective with this laboratory exercise is to read, understand, and implement an algorithm described

More information

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary

Pre-Kindergarten 2017 Summer Packet. Robert F Woodall Elementary Pre-Kindergarten 2017 Summer Packet Robert F Woodall Elementary In the fall, on your child s testing day, please bring this packet back for a special reward that will be awarded to your child for completion

More information

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007 Statistical Analysis of PGA Tour Skill Rankings 198-26 USGA Research and Test Center June 1, 27 1. Introduction The PGA Tour has recorded and published Tour Player performance statistics since 198. All

More information

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i . Suppose that the United States Golf Associate (USGA) wants to compare the mean distances traveled by four brands of golf balls when struck by a driver. A completely randomized design is employed with

More information

Is lung capacity affected by smoking, sport, height or gender. Table of contents

Is lung capacity affected by smoking, sport, height or gender. Table of contents Sample project This Maths Studies project has been graded by a moderator. As you read through it, you will see comments from the moderator in boxes like this: At the end of the sample project is a summary

More information

1wsSMAM 319 Some Examples of Graphical Display of Data

1wsSMAM 319 Some Examples of Graphical Display of Data 1wsSMAM 319 Some Examples of Graphical Display of Data 1. Lands End employs numerous persons to take phone orders. Computers on which orders are entered also automatically collect data on phone activity.

More information

Development of Decision Support Tools to Assess Pedestrian and Bicycle Safety: Development of Safety Performance Function

Development of Decision Support Tools to Assess Pedestrian and Bicycle Safety: Development of Safety Performance Function Development of Decision Support Tools to Assess Pedestrian and Bicycle Safety: Development of Safety Performance Function Valerian Kwigizile, Jun Oh, Ron Van Houten, & Keneth Kwayu INTRODUCTION 2 OVERVIEW

More information

The Intrinsic Value of a Batted Ball Technical Details

The Intrinsic Value of a Batted Ball Technical Details The Intrinsic Value of a Batted Ball Technical Details Glenn Healey, EECS Department University of California, Irvine, CA 9617 Given a set of observed batted balls and their outcomes, we develop a method

More information

Lecture 10. Support Vector Machines (cont.)

Lecture 10. Support Vector Machines (cont.) Lecture 10. Support Vector Machines (cont.) COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Soft margin SVM Intuition and problem

More information

The Reliability of Intrinsic Batted Ball Statistics Appendix

The Reliability of Intrinsic Batted Ball Statistics Appendix The Reliability of ntrinsic Batted Ball Statistics Appendix Glenn Healey, EECS Department University of California, rvine, CA 92617 Given information about batted balls for a set of players, we review

More information

Fundamentals of Machine Learning for Predictive Data Analytics

Fundamentals of Machine Learning for Predictive Data Analytics Fundamentals of Machine Learning for Predictive Data Analytics Appendix A Descriptive Statistics and Data Visualization for Machine learning John Kelleher and Brian Mac Namee and Aoife D Arcy john.d.kelleher@dit.ie

More information

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms * Queue analysis for the toll station of the Öresund fixed link Pontus Matstoms * Abstract A new simulation model for queue and capacity analysis of a toll station is presented. The model and its software

More information

Remote Towers: Videopanorama Framerate Requirements Derived from Visual Discrimination of Deceleration During Simulated Aircraft Landing

Remote Towers: Videopanorama Framerate Requirements Derived from Visual Discrimination of Deceleration During Simulated Aircraft Landing www.dlr.de Chart 1 > SESARInno > Fürstenau RTOFramerate> 2012-11-30 Remote Towers: Videopanorama Framerate Requirements Derived from Visual Discrimination of Deceleration During Simulated Aircraft Landing

More information

STAT/MATH 395 PROBABILITY II

STAT/MATH 395 PROBABILITY II STAT/MATH 395 PROBABILITY II Quick review on Discrete Random Variables Néhémy Lim University of Washington Winter 2017 Example Pick 5 toppings from a total of 15. Give the sample space Ω of the experiment

More information

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes

San Francisco State University ECON 560 Summer Midterm Exam 2. Monday, July hour 15 minutes San Francisco State University Michael Bar ECON 560 Summer 2018 Midterm Exam 2 Monday, July 30 1 hour 15 minutes Name: Instructions 1. This is closed book, closed notes exam. 2. No calculators or electronic

More information

On the association of inrun velocity and jumping width in ski. jumping

On the association of inrun velocity and jumping width in ski. jumping On the association of inrun velocity and jumping width in ski jumping Oliver Kuss Institute of Medical Epidemiology, Biostatistics, and Informatics University of Halle-Wittenberg, 06097 Halle (Saale),

More information

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China

Attacking and defending neural networks. HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China Attacking and defending neural networks HU Xiaolin ( 胡晓林 ) Department of Computer Science and Technology Tsinghua University, Beijing, China Outline Background Attacking methods Defending methods 2 AI

More information

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men? 100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men? The 100 Meter Dash has been an Olympic event since its very establishment in 1896(1928 for women). The reigning 100-meter Olympic champion

More information

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables

An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables Kromrey & Rendina-Gobioff An Empirical Comparison of Regression Analysis Strategies with Discrete Ordinal Variables Jeffrey D. Kromrey Gianna Rendina-Gobioff University of South Florida The Type I error

More information

Descriptive Statistics Project Is there a home field advantage in major league baseball?

Descriptive Statistics Project Is there a home field advantage in major league baseball? Descriptive Statistics Project Is there a home field advantage in major league baseball? DUE at the start of class on date posted on website (in the first 5 minutes of class) There may be other due dates

More information

save percentages? (Name) (University)

save percentages? (Name) (University) 1 IB Maths Essay: What is the correlation between the height of football players and their save percentages? (Name) (University) Table of Contents Raw Data for Analysis...3 Table 1: Raw Data...3 Rationale

More information

Estimating Paratransit Demand Forecasting Models Using ACS Disability and Income Data

Estimating Paratransit Demand Forecasting Models Using ACS Disability and Income Data Estimating Paratransit Demand Forecasting Models Using ACS Disability and Income Data Presenter: Daniel Rodríguez Román University of Puerto Rico, Mayagüez Co-author: Sarah V. Hernandez University of Arkansas,

More information

Chapter 5: Methods and Philosophy of Statistical Process Control

Chapter 5: Methods and Philosophy of Statistical Process Control Chapter 5: Methods and Philosophy of Statistical Process Control Learning Outcomes After careful study of this chapter You should be able to: Understand chance and assignable causes of variation, Explain

More information

Building an NFL performance metric

Building an NFL performance metric Building an NFL performance metric Seonghyun Paik (spaik1@stanford.edu) December 16, 2016 I. Introduction In current pro sports, many statistical methods are applied to evaluate player s performance and

More information

PGA Tour Scores as a Gaussian Random Variable

PGA Tour Scores as a Gaussian Random Variable PGA Tour Scores as a Gaussian Random Variable Robert D. Grober Departments of Applied Physics and Physics Yale University, New Haven, CT 06520 Abstract In this paper it is demonstrated that the scoring

More information

STANDARD SCORES AND THE NORMAL DISTRIBUTION

STANDARD SCORES AND THE NORMAL DISTRIBUTION STANDARD SCORES AND THE NORMAL DISTRIBUTION REVIEW 1.MEASURES OF CENTRAL TENDENCY A.MEAN B.MEDIAN C.MODE 2.MEASURES OF DISPERSIONS OR VARIABILITY A.RANGE B.DEVIATION FROM THE MEAN C.VARIANCE D.STANDARD

More information

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts

JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts JPEG-Compatibility Steganalysis Using Block-Histogram of Recompression Artifacts Jan Kodovský, Jessica Fridrich May 16, 2012 / IH Conference 1 / 19 What is JPEG-compatibility steganalysis? Detects embedding

More information

A Novel Approach to Predicting the Results of NBA Matches

A Novel Approach to Predicting the Results of NBA Matches A Novel Approach to Predicting the Results of NBA Matches Omid Aryan Stanford University aryano@stanford.edu Ali Reza Sharafat Stanford University sharafat@stanford.edu Abstract The current paper presents

More information

The Estimation of Winners Number of the Olympiads Final Stage

The Estimation of Winners Number of the Olympiads Final Stage Olympiads in Informatics, 15, Vol. 9, 139 145 DOI: http://dx.doi.org/1.15388/ioi.15.11 139 The Estimation of Winners Number of the Olympiads Final Stage Aleksandr MAIATIN, Pavel MAVRIN, Vladimir PARFENOV,

More information

Nonlife Actuarial Models. Chapter 7 Bühlmann Credibility

Nonlife Actuarial Models. Chapter 7 Bühlmann Credibility Nonlife Actuarial Models Chapter 7 Bühlmann Credibility Learning Objectives 1. Basic framework of Bühlmann credibility 2. Variance decomposition 3. Expected value of the process variance 4. Variance of

More information

1. In a hypothesis test involving two-samples, the hypothesized difference in means must be 0. True. False

1. In a hypothesis test involving two-samples, the hypothesized difference in means must be 0. True. False STAT 350 (Spring 2016) Homework 9 Online 1 1. In a hypothesis test involving two-samples, the hypothesized difference in means must be 0. 2. The two-sample Z test can be used only if both population variances

More information

Lesson 14: Modeling Relationships with a Line

Lesson 14: Modeling Relationships with a Line Exploratory Activity: Line of Best Fit Revisited 1. Use the link http://illuminations.nctm.org/activity.aspx?id=4186 to explore how the line of best fit changes depending on your data set. A. Enter any

More information

Real-Time Electricity Pricing

Real-Time Electricity Pricing Real-Time Electricity Pricing Xi Chen, Jonathan Hosking and Soumyadip Ghosh IBM Watson Research Center / Northwestern University Yorktown Heights, NY, USA X. Chen, J. Hosking & S. Ghosh (IBM) Real-Time

More information