1 4. ANALYSING FREQUENCY TABLES Categorcal (nomnal) data are usually summarzed n requency tables. Contnuous numercal data may also be grouped nto ntervals and the requency o observatons n each nterval may also be summarzed n a requency table (or n a hstogram; see earler lab on Explorng and Descrbng Data ). In ths lab we wll explore two knds o requency tables and the deas they may be used to test. One-way Frequency Table The rst type o requency table lsts the number o observatons n derent categores o a sngle lst. An example s the ollowng table evaluatng how good humans are at choosng random numbers. The data are rom the early years o a US State Lottery, n whch players would buy a tcket and choose any number they wanted between 000 and 999. Wnnngs would be dvded between all holders o the wnnng number, whch was chosen randomly. The ollowng data are based on a random sample o 100 players o the Lottery (these are not the wnnng numbers, but rather they are the numbers selected by players). Lsted are the requences o numbers chosen that have 0 to 9 as the rst dgt: Frst dgt o chosen number Frequency ( ) 0 4 1 16 2 14 3 15 4 13 5 8 6 9 7 7 8 8 9 6 Total 100 These requences may be compared to those predcted by derent hypotheses. For example, when players pck a number between 000 and 999, are some rst dgts more popular than others? It looks lke numbers begnnng wth 0 are unpopular, and those begnnng wth 1 through 4 are excessvely popular. A goodness o t test s an approprate method or testng these data aganst the null hypothess that there s no preerence or derent dgts n the populaton. Two-way (Contngency) Table The second knd o requency table s the two-way table, or contngency table. Here, every observaton s cross-classed by two category varables nstead o just one. The usual goal s to test whether true (populaton) relatve numbers o ndvduals allng nto the derent classes or one varable s the same regardless o ndvdual values or the second varable. An example gven below lsts the number o survvors and non-survvors n two classes o mountaneers descendng
2 rom the peak o Mount Everest between 1978 and 1999: those usng supplemental oxygen, and those descendng wthout supplemental oxygen. (Most deaths on Mount Everest occur durng the descent, not the ascent.) Survval Used supplemental oxygen Dd not use supplemental oxygen Survved descent 1045 88 Dd not survve descent 32 8 Total 1077 96 (data rom Huey and Egusktza 2000, JAMA 284: 181) In ths case we are nterested n knowng whether the relatve numbers o survvors and nonsurvvors depends on whether or not supplemental oxygen was used. Ths s not an expermental study, so we are unable to test whether a derence n survval between classes s caused by oxygen use, but at least we can decde whether supplemental oxygen and survval are assocated. The null hypothess s once agan the skeptcal pont o vew: survval and oxygen use are not assocated wth one another (.e., survval and oxygen use are ndependent). A test o derng survval requences between the two categores o mountaneers s carred out usng a contngency test. Hypothess Testng Formng and testng hypotheses s one o the most basc endeavors n statstcal analyss o bologcal data. Wth your notes and the course textbook, revew your knowledge o the ollowng concepts: null hypothess (H o ) and alternate hypothess (H a ). Type I errors and Type II errors sgncance level degrees o reedom Test Statstcs or Goodness o Ft and Contngency Tests The ch-squared statstc, χ 2, s a measure o dscrepancy between observed and expected requences, where expected requences are those expected under the null hypothess. A second measure o dscrepancy s the G-statstc (the log lkelhood rato): 2 χ = 2 G = 2 ln
3 Under the null hypothess both statstcs have a dstrbuton that conorms approxmately to the theoretcal ch-squared dstrbuton. The degrees o reedom wll usually be k 1, where k s the number o classes o the category varable, except n specal stuatons to be dealt wth later n the course. Analogous statstcs contrast observed and expected requences n contngency tables. Here, however, the expected requences are based on the null hypothess that relatve requences are the same n each set. The expected requency or row and column j n the contngency s obtaned as RC j j = N Where R and C j are the row and column totals, respectvely, and N s the grand total number o ndependent observatons. Wth your notes and the course textbook, revew your knowledge o the ollowng concepts: ndependence rules o thumb or low expected requences n ch-square tests Yates correcton or contnuty [JMP IN does not employ ths correcton] Fsher s exact test Usng the program In the case o one-way tables, only a sngle categorcal varable s requred (e.g., Frst dgt o chosen number ). Two categorcal varables are needed or a two-way (contngency) table (e.g., Use o supplemental oxygen and Survval ). Make sure that ater enterng the data, the category varable(s) have the nomnal attrbute (ths can be reset n the columns secton o the let rame, or by selectng Column Ino n the Cols pull-down menu). The observed requences may be entered drectly to a new column (call t observed requency or number o observatons. To produce a bar graph o requences rom a one-way table, use the Dstrbuton menu opton and select the categorcal varable as your Y column n the pop-up wndow. In the same wndow you also need to select the observed requency column as your Freq varable. To carry out a goodness o t test, clck the red symbol next to the categorcal varable name above the bar graph and select Test Probabltes. Ths acton wll open a new dsplay box below the requency table n the Dstrbuton output wndow. Here you wll need to enter the expected requences or your test. Clck on each row and enter ether the expected requency or the expected proporton or that row (t doesn t matter whch, as long as you are consstent; the goodness o t test wll be carred out usng the expected requences n ether case). Unortunately, JMP IN doesn t dsplay the expected requences t uses to calculate the test statstc, so these wll be lackng you have smply entered the expected proportons. In ths case you wll be unable to ensure that the expected requences are large enough to ulll the assumptons o the χ 2 goodness o t test. To calculate expected requences you wll need to use your own calculator, or better yet the JMP IN calculator.
4 To produce a mosac plot or a two-way (contngency) table, use the Ft Y by X menu opton. In the pop-up wndow, select one o the categorcal varables as your Y column and the other as you X column. Once agan, select the observed requency column as your Freq varable. A two-way table wll also appear beneath the mosac plot, gvng the observed requences (the program wll also dsplay the expected requences but you need to select ths opton by pressng the red symbol next to the Contngency Table ttle). Unortunately, JMP IN does not nclude the Yates correcton or contnuty when the G-test and ch-square (Pearson) tests are carred out on 2x2 tables (you wll need to nclude the Yates correcton wth 2x2 tables on your assgnments and wrtten exams). However, t does nclude the Fsher exact test, whch you can use to valdate the results o the ch-square and G tests. One-way and two-way requency tables can be constructed rom raw data on ndvdual subjects usng the Tables -> Summary opton n the pull-down menu or by selectng Summary n the Tables tab on the JMP Starter. In the pop-up menu choose one (n one-way tables) or two (or two-way tables) categorcal varables and clck the Group button. Then clck the Statstcs button n the same wndow and select N. When you clck OK a new data table wll appear that talles the requency o observatons correspondng to each category or combnaton o categores. Problems 1. Enter the Lottery data gven above and generate the correspondng bar graph. a) Examne the bar graph. Do the requences appear to vary greatly between classes? b) Carry out a statstcal test o the hypothess that players avor some rst dgts over others when choosng a number between 000 and 999. In your work, present all steps (.e., state hypotheses, gve the P-value, the sgncance level or the test, and state your concluson). Snce the computer provdes the P-value drectly, there s no need to provde the crtcal value rom the tables n Zar. c) Compare the results or the ch-squared (Pearson) to those or the G test (Lkelhood rato). Why are they derent? d) Compare the results rom your vsual apprasal o the data to the goodness o t tests. Whch approach provdes qualtatve normaton and whch one provdes quanttatve normaton? What level o uncertanty s assocated wth those quanttatve probabltes? e) What are the degrees o reedom or these tests? Why do we lose a degree o reedom? ) Why would t be necessary to alter the analyss expected values are small?
5 2. A physcal gene map o the human genome was publshed n 1998 that contaned the estmated locatons o 30075 human genes. The table below lsts the estmated number o genes on each chromosome. The second column lsts the racton o the total human genome made up by each chromosome. For example, the X chromosome consttutes a lttle more than 5% o the total genome sze. These data are n the data le genemap98.jmp on the shared drectory. Chromosome Proporton o total genome Observed number o genes 1 0.0834 3114 2 0.0809 2257 3 0.0679 2015 4 0.0644 1478 5 0.0615 1529 6 0.0580 1893 7 0.0542 1594 8 0.0492 1206 9 0.0460 1248 10 0.0457 1371 11 0.0457 1755 12 0.0453 1585 13 0.0311 703 14 0.0295 1047 15 0.0282 1029 16 0.0311 849 17 0.0292 1263 18 0.0269 523 19 0.0212 1114 20 0.0228 758 21 0.0123 305 22 0.0136 565 X 0.0520 874 Data rom Deloukas et al (1998). A physcal map o 30,000 human genes. Scence 282:744 746 (see http://www.ncb.nlm.nh.gov/genemap98/page.cg?f=genedstrb.html). a) Dsplay the estmated numbers o genes on derent chromosomes usng a bar graph and mosac plot. Descrbe the derences between chromosomes. Whch chromosomes have the most genes? Whch have the ewest? b) We would not expect each chromosome to have the same number o genes because chromosomes der n sze. Use the varable Proporton o total genome to calculate the expected number o genes on each chromosome, takng nto account chromosome sze derences. (Create a new column to receve the expected requences, and use the JMP IN
6 calculator to compute them. The total number o genes s 30075. Call the new varable Expected No. genes ). Generate a bar graph or the expected requences and place t besde the bar graph or the observed numbers o genes. Do larger chromosomes tend to have more genes? c) Generate a new column and compute the ollowng quantty or each chromosome usng the JMP IN calculator: (Observed No. genes) (Expected No. genes) (Expected No. genes) Ths quantty (sometmes called a z-score) measured the derence between observed and expected requences scaled by the square root ( ) o the expected requences. On ths scale, whch chromosomes have a dramatc decency o genes or ther sze? Whch chromosomes have the most dramatc excesses? d) Use the observed and expected requences to test the null hypothess that gene number s determned purely by chromosome sze. To have the program do ths or you automatcally you wll need to enter the expected requences or each chromosome, one at a tme. Alternatvely, you could use the JMP IN calculator to compute the ch-square statstc drectly. Ths s easly done by squarng the quanttes calculated n (c) and summng them up (Col Sum s an opton n the Statstcal unctons provded n the Functons panel o the JMP IN calculator wndow). You can also use the calculator to provde you wth the P- value or the calculated χ 2 statstc usng Probablty -> ChSquareDstrbuton, or you can look up the approprate crtcal value n Zar. Note: The substantally lower than expected gene densty on the X chromosome mght result rom expresson bas. Gene expresson rom the X chromosome s reduced because n emales the second copy o the X chromosome s nactvated, and n males a second X s lackng (males are XY). Gene wth reduced expresson are more dcult to detect by the method the researchers used to nd them. 3. Enter the Mount Everest mountaneer survval and supplemental oxygen data rom the above table nto a JMP IN data table. The most useul way to do ths s to create a new data table wth 4 rows and three columns. Call the rst column Survval and the second column Oxygen use. Enter the our combnatons o these two varables nto the our rows. Fnally, put the observed requences nto the thrd column. a) Inspect the mosac plot or these data. Descrbe the pattern n words. Are the relatve requences o ndvduals survvng smlar or derent n the two oxygen groups? b) Test whether survval o mountaneers descendng rom Mount Everest s sgncantly assocated wth use o supplemental oxygen. Show all steps n your work (a good habt, as always). c) Why does the P-value or the Fsher s exact test der rom that o the Pearson χ 2 and the G tests?
7 d) Repeat the calculaton o the Pearson χ 2 by hand. Dd you obtan the same number as JMP IN? Why? e) Do the expected requences satsy the assumptons o the ch-square test? What strategy do you recommend? ) The authors who compled the Everest data also presented results rom the teams o mountaneers descendng Everest (clmbers tend not to go alone). These data are gven below. Whch data set s the most approprate to test an assocaton between survval and supplemental oxygen? Why? Survval Used supplemental oxygen Dd not use supplemental oxygen All team members survved 85 24 At least one team member ded 8 4 Total 93 28 g) Should we conclude rom the test n () that supplemental oxygen has no eect on survval? h) The same authors also compled smlar data or K2, a nearby summt n the Hmalayas. Analyse these data n the same way as or Mount Everest. Are the results the same as those n ()? Survval Used supplemental oxygen Dd not use supplemental oxygen All team members survved 12 24 At least one team member ded 0 12 Total 12 36 4. Open the data le student_data.jmp rom the shared drectory. Ths le records the data taken rom Bology 300 students on the rst day o class, January 2001. The varables are: heght, Student heght n cm hand, Student handedness (let or rght; both was classed as let) parent.rst, Parent lsted rst by student when gvng ther heghts (mom or dad) mom.heght, Student s mother s heght, n cm dad.heght, Student s ather s heght, n cm mom.hand, Whether mother s let or rght-handed sex, Whether student s male or emale a) Use Dstrbuton to test whether male and emale students occur wth equal requency n the Bo 300 class. Note that n the pop-up wndow you wll not need to specy a column or the Freq button because you are workng now wth the raw data nstead o the requency table.
8 b) Use an approprate method to test whether there s a statstcal assocaton between handedness o student (let or rght-handed) and that o hs/her mother. c) Use Tables to generate a two-way (contngency) table or handedness o student and mother. Ths method shows how JMP IN may be used to construct requency tables rom raw data. d) Some students lsted ther dad rst when gvng ther heght, whereas some student lsted ther mother rst. Does ths depend on the sex o the student?