Week 7 One-way ANOVA

Objectives By the end of this lecture, you should be able to: Understand the shortcomings of comparing multiple means as pairs of hypotheses. Understand the steps of the ANOVA method and the method s advantages. Compare the means of three or more populations using the ANOVA method.

The Logic and the Process of Analysis of Variance Suppose a salesperson wants to compare the level of satisfaction of customers for four different insurance companies. Our question is: "Is there a difference in satisfaction scores across the four insurance companies?

The Logic and the Process of Analysis of Variance The purpose of ANOVA is much the same as the t tests presented before: the goal is to determine whether the mean differences that are obtained for sample data are sufficiently large to justify a conclusion that there are mean differences between the populations from which the samples were obtained.

The Logic and the Process of Analysis of Variance (cont.) The difference between ANOVA and the t tests is that ANOVA can be used in situations where there are two or more means being compared, whereas the t tests are limited to situations where only two means are involved. Analysis of variance is necessary to protect researchers from excessive risk of a Type I error in situations where a study is comparing more than two population means.??

Shortcomings of Comparing Multiple Means Using Multiple t-tests We could just run six different independent samples t-tests (company 1 vs. company 2; company 1 vs. company 3; company 1 vs. company 4; company 2 vs. company 3; company 2 vs. company 4; and company 3 vs. company 4). This would be tedious, but we could use a computer to compute these quickly and easily.

Shortcomings of Comparing Multiple Means Using Multiple t-tests It turns out this is a very bad idea, and has a major flaw: When more than one t-test is run, each at its own level of significance, the probability of making one or more Type I errors multiplies exponentially. Recall that a Type I error occurs when we reject the null hypothesis when we should not. The level of significance,, is the probability of a Type I error in a single test. So, for a single t-test in our example, with an of 0.05, we have a Type I error probability of 5%. When testing more than one pair of samples, the probability of making at least one Type I error is: Where c is the number of t-tests

The Logic and the Process of Analysis of Variance (cont.) ANOVA allows researcher to evaluate all of the mean differences in a single hypothesis test using a single α-level and, thereby, keeps the risk of a Type I error under control no matter how many different means are being compared. Although ANOVA can be used in a variety of different research situations, we will cover only independent-measures designs involving only one independent variable (one-way ANOVA).

To apply one-way ANOVA: 1. All observations are independent of one another and randomly selected from the population which they represent. 2. The population at each value of the categorical variable (factor level ) is approximately normal. 3. The variances for each factor level are approximately equal to one another.

Steps of ANOVA To apply the ANOVA method to the insurance companies, we are actually analyzing the total variation of the scores, including the variation of the scores within the groups and the variation between the group means. Since we are interested in two different types of variation, we first calculate each type of variation independently and then calculate the ratio between the two called an F-value. Just like our z-score, t-test, and chi-square tests, ANOVA has its own distribution that we need to use, called an F-distribution to set our critical values and test our hypothesis. Just like the t-distribution and the chi-square distribution which use degrees of freedom, the F-distribution also relies on degrees of freedom. Since the F-value is actually a ratio of two different sources of variance, we ll need two different degrees of freedom.

Steps of ANOVA When using the ANOVA method, we are testing the null hypothesis that the means of our samples are equal. When we conduct a hypothesis test, we are testing the probability of obtaining an extreme F-statistic by chance. If we reject the null hypothesis that the means and variances of the samples are equal, and then we are saying that the difference that we see could not have happened just by chance. To test a hypothesis using the ANOVA method, there are several steps that we need to take.

Steps of ANOVA We will create what is called the ANOVA table: Source: This column lists where the variation in the test is coming from: Between the groups, within the groups, or all the variance for all the observations (Total). SS: is the Sums of Squares Df: Degrees of freedom MS: Mean Square F: value of test statistic

Steps of ANOVA 1. Calculate the total sum of squares (SST ), where y is the grand mean 2. Calculate the sum of squares between (SSB) 3. Find the sum of squares within groups (SSW)

Steps of ANOVA 4. solve for degrees of freedom for the test 5. calculate the Mean Squares Between (MSB) and Mean Squares Within (MSW) 6. calculate the F statistic: 7. Find F critical using tables with the 2 degrees of freedom (between, within).

Steps of ANOVA

Steps of ANOVA 8. Make decision: If F test statistic is greater than F critical (or p-value of F statistic is less than alpha) Reject the null hypothesis at least two groups have different means. 9. If you found significant difference, you need to apply another test for finding which two groups have different means. One of these tests is Tukey Honest Significant Difference (HSD) Test.

R Example Does the price of a car depend on its body style? boxplot(automobile$price ~ Automobile$BodyStyle, main = "Cars Prices",ylab = "Price", xlab = "Body Style") How to interpret values in boxplot?

Boxplots The orientation can be vertical or horizontal. In this figure, it is drawn horizontally. Q1 is the first quartile (median of first quarter) Q2 is the 2 nd quartile (median of all data) Q3 is the 3 rd quartile (median of 3 rd quarter) IQ=Q3-Q1 is the interquartile range. Outliers are either >Q3+1.5*IQ or <Q1-1.5*IQ Here, we have only large outliers, as indicated by the dots at the right of the box. No values are smaller than Q1-1.5*IQ, hence no outliers at the left are shown. Outliers

R Example aggregate(price ~ BodyStyle, Automobile, mean)

R Example aggregate(price ~ BodyStyle, Automobile, sd)

R Example Pricesmodel =aov(automobile$price ~ Automobile$BodyStyle) summary(pricesmodel)

R Example TukeyHSD(Pricesmodel)