Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 13 Istructor: Nicolas Christou Measures of cetral tedecy Measures of cetral tedecy ad variatio Data display 1. Sample mea: Let x 1, x,, x be the observatios of a sample. The sample mea x is computed as follows: i=1 x = x i = x 1 + x + + x. Media: It is the value that falls i the middle whe the observatios are sorted from smallest to largest. To compute the media, follow the ext steps: a. Sort the observatios from smallest to largest. b. Compute the positio of the media: +1. Examples: A. Sample size is odd: 7 aual icomes: 8, 60, 6, 3, 30, 6, 9. First sort these observatios from smallest to largest: 6, 6, 8, 9, 30, 3, 60 Next compute +1 = 7+1 = th. The media is the th observatio. Media=9. B. Sample size is eve: 8 aual icomes: 6, 6, 8, 9, 30, 3, 60, 80 Agai compute +1 = 8+1 =.5 th. The media is the average of the two middle observatios. Media= 9+30 = 9.5. Questio: How do uusual observatios affect the sample mea ad the media? Example: 8 aual icomes: 6, 6, 8, 9, 30, 3, 60, 8000 1
Measures of o-cetral tedecy 1. First quartile (Q 1 ) or 5 th percetile: Its positio is +1.. Third quartile (Q 3 ) or 75 th percetile: Its positio is 3(+1). Example: Fid Q 1 ad Q 3 of the followig 8 aual icomes: 6, 6, 8, 9, 30, 3, 60, 80 +1 Positio of Q 1 : = 8+1 =.5 th d (roud to the earest iteger). 3(+1) Positio of Q 3 : = 3(8+1) = 6.75 th 7 th (roud to the earest iteger). Therefore, Q 1 = 6, Q 3 = 60. Five-umber summary of a data set: MIN Q 1 MEDIAN Q 3 MAX Box plot: A popular way to display data ad idetify outliers. You are give 11 aual icomes i thousads of dollars: 6, 6, 8, 9, 30, 3, 60, 65, 70, 0,. Costruct the boxplot of icome usig these 11 observatios. Begi by sortig these icomes: 6, 6, 8, 9, 30, 3, 0,, 60, 65, 70 Fid the positio of the first quartile, media, ad third quartile: +1 Positio of Q 1 = 3 rd +1 Positio of Media = 11+1 = 6 th Positio of Q 3 3 +1 = 3 11+1 = 9 th Fid the first quartile, media, ad third quartile: Q 1 = 8, Media = 3, Q 3 = 60 ad the iterquartile rage is IQR = Q 3 Q 1 = 60 8 = 3. = 11+1 Outliers are observatios above Q 3 + 1.5IQR or below Q 1 1.5IQR. Also, serious outliers are observatios above Q 3 + 3IQR or below Q 1 3IQR. I our example we do ot have ay outliers sice Q 3 + 1.5IQR = 60 + 1.5(3) = 108 ad Q 1 1.5IQR = 8 1.5(3) = 0. Now we ca costruct the box plot.
Box plot pathologies: Here are some iterestig box plots. Ca you write dow a set of observatios that correspod to these box plots? 0 30 0 50 6 30 3 30 50 70 3
Measures of variatio 1. Rage:. Iterquartile rage (IQR): 3. Sample variace ad sample stadard deviatio. Let x 1, x,, x be the values of a sample. The sample variace s is the average of the squared deviatios of each observatio from the sample mea ad it is computed as follows: s i=1 = (x i x) 1 where x i x is the i th deviatio from the sample mea x. It is easier for calculatios to use: [ s = 1 ] x i ( i=1 x i) 1 i=1 The stadard deviatio is simply the square root of the variace. Both x ad s have the same uits. i=1 s = (x i x) 1 or easier for calculatios [ s = 1 ] x i 1 ( i=1 x i) i=1 Note: i=1 (x i x) = 0 i=1 x i ( i=1 x i). Example: Fid the sample mea x, sample variace s, ad sample stadard deviatio s of the followig sample: 1, 1.1, 0.9, 1.3, 0.7 (weights of five orages i ouces).
Addig ad multiplyig observatios by a costat Let x 1, x,, x be the observatios of a sample of size, ad let x ad s be the sample mea ad sample variace respectively. a. Suppose that o each observatio a costat a is added. Fid the ew sample mea ad sample variace. b. Suppose that each observatio is multiplied by a costat a. Fid the ew sample mea ad sample variace. 5
Data display Three popular methods: 1. Stem-ad-leaf display. Frequecy distributio 3. Histogram Stem-ad-leaf display: Split each observatio ito a stem ad leaf. The place the stems i a colum from smallest to largest. Next to each stem place the leaves from smallest to largest. Frequecy distributio: We ca group data ito classes (bis). The first step is to defie the umber of classes ad the width of each class (defie the umber of bis). There may ways to do this. Histogram: The frequecy distributio ca be graphed. The graph is called histogram. To costruct a histogram: O the horizotal axis place the class limits. The costruct a rectagle which has base the width of the class ad height the frequecy of that class. There is also a relative frequecy histogram (the height of each rectagle is the the relative frequecy of that class). Costruct by had the stem ad leaf plot of the followig observatios (ozoe data ppm): [1] 0.0 0.081 0.035 0.080 0.053 0.077 0.051 0.059 0.01 0.07 0.090 0.069 0.057 [1] 0.09 0.05 0.083 0.068 0.078 0.096 0.019 0.065 0.061 0.09 0.035 0.097 0.057 [7] 0.036 0.060 0.03 0.036 See more examples o the ext pages. 6
a. Califoria ozoe data. You ca access the data at: http://www.stat.ucla.edu/~christo/statistics13/ozoe.txt Here are the data: [1] 0.0 0.081 0.035 0.080 0.053 0.077 0.051 0.059 0.01 0.07 0.090 0.069 [13] 0.057 0.09 0.05 0.083 0.068 0.078 0.096 0.019 0.065 0.061 0.09 0.035 [5] 0.097 0.057 0.036 0.060 0.03 0.036 0.051 0.09 0.030 0.105 0.07 0.078 [37] 0.08 0.095 0.079 0.067 0.09 0.081 0.077 0.08 0.05 0.059 0.101 0.038 [9] 0.08 0.06 0.089 0.033 0.036 0.03 0.078 0.06 0.056 0.085 0.01 0.09 [61] 0.059 0.115 0.03 0.08 0.09 0.099 0.059 0.089 0.093 0.038 0.099 0.06 [73] 0.050 0.068 0.079 0.01 0.056 0.09 0.08 0.051 0.071 0.077 0.063 0.063 [85] 0.061 0.068 0.039 0.061 0.0 0.05 0.08 0.061 0.065 0.036 0.05 0.06 [97] 0.067 0.073 0.050 0.105 0.09 0.10 0.055 0.053 0.090 0.063 0.055 0.08 [109] 0.01 0.097 0.079 0.097 0.056 0.036 0.078 0.061 0.066 0.09 0.070 0.039 [11] 0.096 0.065 0.03 0.067 0.09 0.086 0.079 0.073 0.081 0.080 0.073 0.03 [133] 0.083 0.080 0.068 0.077 0.077 0.08 0.06 0.066 0.10 0.111 0.079 0.07 [15] 0.037 0.067 0.071 0.07 0.100 0.071 0.038 0.07 0.075 0.035 0.100 0.036 [157] 0.058 0.035 0.09 0.079 0.08 0.11 0.08 0.08 0.111 0.037 0.051 0.0 [169] 0.07 0.053 0.080 0.0 0.059 0.055 0.05 Ad the stem ad leaf plot: The decimal poit is digit(s) to the left of the 1 9 77889999 3 0355556666667788899 1111333666778899 5 00111133355566677899999 6 01111133355566777788889 7 01113335777778888999999 8 0000111335699 9 00356677799 10 00155 11 115 Box plot of ozoe: 0.0 0.0 0.06 0.08 0.10 7
b. Soil lead ad zic data (area of iterest i the Netherlads - see ext hadout i R). You ca access these data at: http://www.stat.ucla.edu/~christo/statistics13/soil.txt Histogram of lead Histogram of soil lead Frequecy 0 10 0 30 0 50 0 100 00 300 00 500 600 700 Lead (ppm) Histogram of log(lead) Histogram of soil log(lead) Frequecy 0 10 0 30 0 3.5.0.5 5.0 5.5 6.0 6.5 Log_lead 8