Descriptive Stats Review
Categorical Data The Area Principal Distorts the data possibly making it harder to compare categories Everything should add up to 100% When we add up all of our categorical data, we should get 100% Know how to read a contingency table
Survival Contingency Table Class First Second Third Crew Total Alive 203 118 178 212 711 Dead 122 167 528 673 1490 Total 325 285 706 885 2201 The percentage of passengers who were both in first-class and survived? 203/2201 or 9.2% The percentage of first-class passengers who survived? 203/325 or 62.5% The percentage of the survivors who were in first-class? 203/711 or 28.6%
Survival Contingency Table Class First Second Third Crew Total Alive 203 118 178 212 711 Dead 122 167 528 673 1490 Total 325 285 706 885 2201 What are the marginal distributions? Survival (711 alive, 1490 dead) and Class (325 first, 285 second, 706 third, 885 crew) Conditional distributions? All the middle values! Is survival independent of class? Why? NO! b/c percent of first-class passengers that survived is 203/325 or 62.5%, where as the percentage of crew members that survived is 212/885 or 24.0%.
Categorical Data Make sure you have enough individuals in your data. I can make a free-throw 75% of the time. I just happened to make three out of four shots and then called quits. Don t overstate your claim. Independence is an important concept, but it is rare for two variable to be entirely independent. We can t conclude that that one variable has no effect whatsoever on another. Usually, all we know is that little effect was observed in our study. Other studies of other groups under other conditions might find different results.
Simpson s Paradox Ronnie Belliard 2002 61/289,.211 of his at-bats were hits 2003 124/447,.277 of his at-bats were hits Two-season average: 185/736, hits.2514 of the time Casey Blake 2002 4/20,.200 of his at-bats were hits 2003 143/557,.257 of his at-bats were hits Two-season average: 147/577, hits.2548 of the time The two season batting avg. for Belliard was lower than Blake s, but divided into separate seasons, Belliard s had a higher batting avg. both seasons. This is Simpson s Paradox.
Quantitative Data Have with you, and know how to use your calculator! Don t make a histogram of categorical data. Just because a zip code is a number does not automatically make it quantitative data Don t CUSS & BS a bar chart. Histograms are what causes us to swear
CUSS & BS Center Unusual features (gaps, outliers) Shape Spread & Be Specific
Measures of Center? Mean, median Shape? Unimodal, bimodal, multimodal, uniform, symmetric, skewed left, skewed right Spread? IQR, SD, consistent or varied?
Histograms Choose a bin width appropriate to the data. Too Small Too Large
The 5 number summary Min Q1 Median Q3 Max
Outliers! Resistant? Median, IQR Non-resistant? Mean, SD How do we calculate outliers? Upper Fence: Q3 + (1.5)IQR Lower Fence: Q1 (1.5)IQR
What is this variance you speak of? How are variance and SD related? The square root of variance is SD Or SD squared is variance Variance Standard Deviation
Oh great, Greek letters What is the difference between µ and µ - population average xҧ - sample average x? ҧ σ and s? σ population standard deviation s sample standard deviation
Comparing Distributions Avoid inconsistent scales! Label Clearly! Outliers! CUSS & BS all distributions! Always mention center, unusual features, shape, spread BE SPECIFIC! (provide a value where applicable)
Comparing Distributions When comparing histograms we can compare: Center Unusual features Shape Spread Shocking right?
Comparing Distributions When Comparing Boxplots: Compare the shapes. Do the boxes look symmetric or skewed? Compare medians. Which group has a larger center? Any idea why? Compare IQRs. Which groups is more varied? Consistent? Identify outliers if any Remember how to find the upper and lower fence!
Re-expressing Data What do we do if our real data sucks Make it not suck! But how? Re-express the data by applying a function to it Common functions are the square root, or log functions. Don t forget to convert our findings back from our re-expressed data!
Categorical Practice What percent of the class are females with democratic political views? What percent of the democratic are females? What percent of the females are democratic? What is the marginal frequency distribution of political views? What is the conditional relative frequency distribution of gender among republicans? Are gender and political view independent?
Weight of Pennies Make a Histogram (while we CUSS & BS) 2.57, 2.56, 3.14, 3.03, 3.13, 2.47, 2.43, 3.11, 3.06, 2.48 2.51, 2.50, 3.07, 3.08, 3.01, 2.45, 2.50, 3.13, 2.51, 3.12 3.10,3.08, 2.46, 2.44, 2.47, 2.54, 3.09, 3.13, 2.56, 2.49
Comparing Distributions Here are the weekly payrolls for two imaginary restaurants, Mooseburgers and McTofu. 1. Find the 5-number summaries for both 2. Create parallel boxplots. Label your graph 3. Write a few sentences comparing the distributions 4. Which restaurant pays the higher average salary? 5. Why is the mean salary misleading? 6. Where would you rather work? Explain with stats!