Acknowledgement: Author is indebted to Dr. Jennifer Kaplan, Dr. Parthanil Roy and Dr Ashoke Sinha for allowing him to use/edit many of their slides.
Topic for this lecture 0Today s lecture s materials can be read from Chapters 3 of the textbook. 0I am going to cover only a part of the textbook in this class and the part I do not cover is not important for this course. 0Today we shall cover some descriptive statistics of categorical variables. 0In descriptive statistics we summarize data through graphs and tables. 2
3 Rules of Data Analysis 1. Make a picture 0 To help you thinkclearly about the patterns and relationships hiding in your data. 2. Make a picture 0 To showthe important features and unexpected values or patterns in your data. 3. Make a picture 0 Totellothers what your data reveal. 3
How to display Categorical Data? 0 Frequency Tables 0Bar Charts 0Pie Charts 0 Contingency Tables 4
Frequency Tables 0The frequencyof a particular data value is the number of times the data value occurs. Thus frequency is simply a count of a particular level. 0In frequency table categories/levels are written in the left most column and the corresponding frequencies are written in the second column. 0Sometimes proportions or percentages are also written instead of or in addition to the actual counts. Proportion is also called relative frequency. 5
Frequency Table: An Example Frequency Table of the number of Golf Balls sold in different days of a week Day # of Golf Balls Sold % of Golf Balls Sold (Frequency) Monday 17 19.54 Tuesday 13 14.94 Wednesday 15 17.24 Thursday 20 22.99 Friday 22 25.29 Total 87 100 6
Bar Charts 0A bar chart or bar graph is a chart with rectangular bars with lengths proportional to their frequencies. 0The bars can be plotted vertically (more common) or horizontally (less common). 0The percentages or relative proportions can also be plotted instead of the actual values. 7
Bar Chart : Golf Ball Sold # of Golf Balls Sold 25 20 15 17 13 15 20 22 10 5 0 Monday Tuesday Wednesday Thursday Friday 8
The following bar chart represents the incarceration rate (per 100000 people) of various countries. 9
Pie Chart 0A pie chart (or a circle graph) is a circular chart divided into sectors, illustrating proportion. 0The arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents. 0The math is carried out based on the following: 100% is same as 360 degrees. 10
Pie Chart: Golf Ball Sold % of Golf Balls Sold 25% 20% 15% Monday Tuesday Wednesday Thursday Friday 23% 17% 11
Pie Chart: An Example Pie Chart of English Native Speakers 12
Bar Chart vs. Pie Chart 0Bar chart is used more often to represent the actual frequencies while pie chart is used to represent relative proportions (in %). 0When comparison of relative proportion is important, pie chart is more appropriate. 0When the absolute counts or frequencies are more important, a bar chart should be used. 13
Major points so far 0First step in organizing data 0draw a picture 0Appropriate pictures for categorical data 0Pie chart 0Bar chart 14
Multiple categorical variables How to represent two categorical variables in tabular form? Contingency tables, cluster bar plots and stacked bar plots. 15
Contingency Tables 0A contingency table(also referred to as cross tabulation or cross tab) is often used to record and analyze the relation between two (or more) categorical variables. 0Here rows represent the categories of one categorical variable, and the columns represent the categories of other categorical variable. 0The cells corresponding to row and column entries tabulate the respective frequencies. 0Most often, we have two categorical variables and we can answer many questions on the data from the contingency table. 16
Data from STT 200 Class 0How many sophomores were there in Section 3? 16 0How many students were there in Section 1? 31 0How many Seniors were in the class? 7 Freshmen Sophomores Juniors Seniors Total Sec 1 23 2 3 3 31 Sec 2 16 10 2 2 30 Sec 3 2 16 9 2 29 Sec 4 9 14 7 0 30 Total 50 42 21 7 120 17
What proportion of students are A. About 74% B. About 46% C. Exactly 30% D. About 42% freshman? Freshmen Sophomores Juniors Seniors Total Sec 1 23 2 3 3 31 Sec 2 16 10 2 2 30 Sec 3 2 16 9 2 29 Sec 4 9 14 7 0 30 Total 50 42 21 7 120 E. The answer is not given Solution: (50/120)*100 % = 41.67%. i.e. about 42% Answer: D 18
What proportion of students in section 4are freshman? Freshmen Sophomores Juniors Seniors Total Sec 1 23 2 3 3 31 Sec 2 16 10 2 2 30 Sec 3 2 16 9 2 29 Sec 4 9 14 7 0 30 Total 50 42 21 7 120 A. About 74% B. About 46% C. Exactly 30% D. About 42% Solution: (9/30)*100% = 30%. Answer: C E. The answer is not given 19
What proportion of freshman are in section 1? Freshmen Sophomores Juniors Seniors Total Sec 1 23 2 3 3 31 Sec 2 16 10 2 2 30 Sec 3 2 16 9 2 29 Sec 4 9 14 7 0 30 Total 50 42 21 7 120 A. About 74% B. Exactly 46% C. Exactly 30% D. About 42% E. The answer is not given Solution: (23/50)*100% = 46%. Answer: B 20
Cluster and stacked bar plots 0Plots can be drawn to compare between different groups, or to check if there is any relation between two categorical variables. 0Two plots are widely used for this purpose: Cluster bar plot Stacked bar plot 21
Cluster bar plot (seniority on horizontal axis): frequencies 25 20 15 10 Sec 1 Sec 2 Sec 3 Sec 4 5 0 Freshmen Sophomores Juniors Seniors 22
Cluster bar plot (sections on horizontal axis): frequencies 25 20 15 10 Freshmen Sophomores Juniors Seniors 5 0 Sec 1 Sec 2 Sec 3 Sec 4 23
Stacked bar plot (for seniority): frequencies 60 50 40 30 20 Sec 4 Sec 3 Sec 2 Sec 1 10 0 Freshmen Sophomores Juniors Seniors 24
Stacked bar plot (for sections) : frequencies 35 30 25 20 15 Seniors Juniors Sophomores Freshmen 10 5 0 Sec 1 Sec 2 Sec 3 Sec 4 25
Are the variables related or independent of each other? 0To see if the categorical variables Seniority and Section are related, it will be more suitable to make stack plots of conditional relative frequencies. 0Here is a table of conditional relative frequencies (in percentages) given the seniority: Freshmen Sophomores Juniors Seniors Sec 1 46 4.76 14.29 42.86 Sec 2 32 23.81 9.52 28.57 Sec 3 4 38.1 42.86 28.57 Sec 4 18 33.33 33.33 0 Total 100 100 100 100 26
Stacked bar plots for comparison and finding relation 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Freshmen Sophomores Juniors Seniors Sec 4 Sec 3 Sec 2 Sec 1 0Here the segments of first bar represents percentages of different sections given the students are freshmen. Similarly the other bars represent Sophomores, juniors and Seniors respectively. 0As the segments of same color are of different length on different bar, we conclude that sections and seniority are related (i.e. not independent). 27
Simpson s paradox Do not use unfair averages. Lurking variable. 28
Example: Simpson s paradox 0Two pilots: Moe and Jill. We are interested in the fact how often they landed their flights on time. Day Night Overall Moe 90 out of 100 10 out of 20 100 out of 120 (83%) Jill 19 out of 20 75 out of 100 94 out of 120 (74%) 0Moe has a success rate of 83% (100 out of 120). 0Jill has a success rate of 78% (94 out of 120). 0Does it mean Moe has a better success rate than Jill? 29
Example: Simpson s paradox Day Night Overall Moe 90 out of 100 (90%) 10 out of 20 (50%) 100 out of 120 (83%) Jill 19 out of 20 (95%) 75 out of 100 (75%) 94 out of 120 (78%) 0Note that during dayjill has success rate 95% (19 out of 20), which is better than Moe s 90% (90 out of 100). 0Also during night Jill has a better success rate of 75% (75 out of 100), in comparison to Moe s 50% (10 out of 20). 0So Jill is better than Moe both at day and night, but worse overall. How is it possible? 0Notice landing at night is more difficult, and Jill flies mostly at night. In the overall average that fact is not considered, and hence the anomaly. 0 So be careful when interpreting the overall average! 30