Biostatistics & SAS programming

Biostatistics & SAS programming Kevin Zhang March 6, 2017 ANOVA 1

Two groups only Independent groups T test Comparison One subject belongs to only one groups and observed only once Thus the observations from different groups reflects different subjects Paired group T test One subject will be observed twice Values from different groups are paired thus correlated PROC TTEST Will solve everything March 6, 2017 ANOVA 2

More than 2 groups??? In a clinical trial, 3 new approved medications (A, B, and C) are given to 3 groups of diabetic volunteer. We wish to know which one is more effective to reduce the glucose level. How to compare them? Med1.GLU Med2.GLU Med3.GLU 79 86 66 78 75 71 75 75 65 90 81 63 83 88 68 79 71 66 75 87 66 81 80 62 71 84 65 78 84 64 73 68 72 66 76 62 73 64 75 65 65 65 64 March 6, 2017 ANOVA 3

ANalysis Of Variance Why do we deal with variance? ANOVA A sequence of constant value has variance of 0. Variance reflects the information contained in the sample! Partition of variance upon sources In the beginning, we have no idea to catch information We may propose a model, a classification, etc Question: Dose the model or classification REALLY catch something from the sample?? i.e. Whether your model or classification makes sense. March 6, 2017 ANOVA 4

Things we can control Total variance ALL information you collected Things out of control Info caught by your model or classification Random March 6, 2017 ANOVA 5

ANOVA table Source DF SS MS F test statistics P-value Your model or Classification Number of Parameters -1 Reflects the variance caught by your model Averaged variability upon model/classificati on Random Errors Sample size Number of Parameters Variance that out of the control Averaged variability upon randomness Total Sample Size - 1 Total variance Yes, the F value is just a comparison: See if your model/classification dominant the major information or not. March 6, 2017 ANOVA 6

Back to the glucose example Med1.GLU Med2.GLU Med3.GLU 79 86 66 78 75 71 75 75 65 90 81 63 83 88 68 79 71 66 75 87 66 81 80 62 71 84 65 78 84 64 73 68 72 66 76 62 73 64 75 65 65 65 64 We can see the difference for sure!! March 6, 2017 ANOVA 7

Med1.GLU Med2.GLU Med3.GLU 79 86 66 78 75 71 75 75 65 90 81 63 83 88 68 79 71 66 75 87 66 81 80 62 71 84 65 78 84 64 73 68 72 66 76 62 73 64 75 65 65 65 64 Total variability is 2712.418605 (SS) Classification caught 1993.507494 Left for randomness 718.911111 The DF of classification: You have 3 medications (classifications), thus the DF = 3-1 = 2 Sample size is 43 (all volunteers), thus Total DF = 43-1=42 The DF of randomness will be 42 2 = 40 March 6, 2017 ANOVA 8

Filling the blanks Source DF Sum of Squares Mean Square F Value Pr > F Model 2 1993.507494 996.753747 55.46 <.0001 Error 40 718.911111 17.972778 Total 42 2712.418605 That tells the classification is success, and we can distinguish the 3 medications. March 6, 2017 ANOVA 9

DATA step SAS programming We prefer following structure of your data set: Observed values Classes 79 Med1 78 Med1 86 Med2 March 6, 2017 ANOVA 10

How to: Reorganizing data sets Take Column 1 (Med1) out as a separate dataset, say Med1 Take Column 2 (Med2) as Med2 dataset Column 3 as Med3 Stack Med1, Med2, Med3 together Errr More actions are needed: Labels of groups Change the observation name from Medx.GLU to an unique name Med1.GLU 79 78 75 90 83 Med2.GLU 86 75 75 81 Med3.GLU 66 71 65 63 68 March 6, 2017 ANOVA 11

DATA Step Will do the same thing for all 3, to make sure value column has unique name -- Glucose Yes, right now, the value column is still named as Med1_GLU, and we want to keep the values for sure. Glucose Medication 79 Medication 1 78 Medication 1 data med1 ( rename=(med1_glu=glucose) keep=med1_glu Medication) set Glu; Medication = "Medication 1"; /* Add the label to all values in this data set*/ if cmiss(med1_glu) then delete; run; Dealing with the missing values: Thus to trim those. in the imported data 75 Medication 1 90 Medication 1 83 Medication 1 79 Medication 1 75 Medication 1 81 Medication 1 71 Medication 1 78 Medication 1 73 Medication 1 72 Medication 1 76 Medication 1 73 Medication 1 75 Medication 1 March 6, 2017 ANOVA 12

Similar code for Med2 and Med3 data med2 (rename=(med2_glu=glucose) keep=med2_glu Medication); set Glu; Medication = "Medication 2"; if cmiss(med2_glu) then delete; run; data med3 (rename=(med3_glu=glucose) keep=med3_glu Medication); set Glu; Medication = "Medication 3"; if cmiss(med3_glu) then delete; run; March 6, 2017 ANOVA 13

data med; set med1 med2 med3; run; Stack all 3 data sets Now it is ready!! March 6, 2017 ANOVA 14

PROC ANOVA proc anova data=med; class Medication; model Glucose = Medication; /* ANOVA: Taget = Factor */ means Medication/tukey; run; Post-hoc study: In case we find difference, we compare classes pairely. Tell SAS which variable in the data set is used to classify the value column. Telling your classification model, i.e. Using Medication column classify Glucose value column. March 6, 2017 ANOVA 15

Homework The CFO of a global company wishes to research the pay rate of the employees in different areas. In payrate.csv he summarized pay rate of sampled employees from US branch, Canada branch, Europe branch, Australia branch, Asia branch and Africa branch. Please analyze the values and tell: Do you think there exists significant difference between branches? Why? Demonstrate the side-by-side boxplot In case you think the difference was significant, then what is the relationship among them? Could sort the branches from largest pay rate to smallest? March 6, 2017 ANOVA 16