Corey S Brier, Department of Statistics, Yale University 1 STAT 625: 2000 Olympic Diving Exploration Corey S Brier Yale University Abstract This document contains an investigation of bias using data from the 2000 Olympic Diving Event. Please note that this report is not the full diving report, but only contains those sections relating to bias, as per the assignment for 9/26. 1 Data import and formatting The data are provided in an easy to use CSV file so we may import it directly. > library(yaletoolkit) > library(png) > some <- function(data, n = 7, replace = FALSE) { sel <- sample(1:dim(data)[1], n, replace) return(data[sel,]) } > setwd("c:/users/corey/documents/yale/s3/625/week4") > data <- read.csv("diving2000.csv", as.is = T) > whatis(data) variable.name type missing distinct.values precision 1 Event character 0 4 NA 2 Round character 0 3 NA 3 Diver character 0 156 NA 4 Country character 0 42 NA 5 Rank numeric 0 49 1.0 6 DiveNo numeric 0 6 1.0 7 Difficulty numeric 0 20 0.1 8 JScore numeric 0 21 0.1 9 Judge character 0 25 NA 10 JCountry character 0 21 NA min max 1 M10mPF W3mSB 2 Final Semi 3 ABALLI Jesus-Iory ZHUPINA Olena 4 ARG ZIM 5 1 49
Corey S Brier, Department of Statistics, Yale University 2 6 1 6 7 1.5 3.8 8 0 10 9 ALT Walter ZAITSEV Oleg 10 AUS ZIM It is useful to change some data types and add a new column for gender: > data$event <- as.factor(data$event) > data$round <- as.factor(data$round) # This could be left as numeric > data$event <- as.factor(data$event) > data$round <- as.factor(data$round) # This could be left as numeric Let s add a column for gender: > menloc <- (data$event == "M3mSB") (data$event == "M10mPF") > femaleloc <-!menloc > data$sex[menloc] <- "M" > data$sex[femaleloc] <- "F" > data$sex <- factor(data$sex) Each row of the data corresponds to a score for a dive, not a particular contestant, so we expect some amount of clustering It could be useful to get all rows for a particular diver, so let s assign each distinct diver a different number: > data$divernumber <- rep(na,length(data$diver)) > for (i in 1:length(unique(data$Diver))) { dname <- (unique(data$diver))[i] data[data$diver == dname,]$divernumber <- i } Also, for each dive, let us compute the average score and add that back into our data-set. We used a vectorized method to avoid an unnecessary loop. > dmeans <- apply(matrix(data$jscore, ncol = 7, byrow = T),1,mean) > data$avg <- rep(dmeans, each = 7) [[Here, much content from my previous assignments was omitted]] 2 Initial bias considerations The data include the countries that the divers are from as well as the countries of the Judges. One possible analysis might search for any bias, such as a judge giving preferential treatment to a competitor for his or her own country. Although this section is not a complete analysis, we present some preliminary steps. First, it makes sense to actually find out if any Judge evaluated a competitor for their own country:
Corey S Brier, Department of Statistics, Yale University 3 > finalsdata <- data[data$round == "Final",] > sum(as.numeric(finalsdata$country == finalsdata$jcountry)) [1] 0 > prelimdata <- data[data$round == "Prelim",] > sum(as.numeric(prelimdata$country == prelimdata$jcountry)) [1] 201 > semidata <- data[data$round == "Semi",] > sum(as.numeric(semidata$country == semidata$jcountry)) [1] 113 Although a single diver is represented on multiple rows of our data set, because each row corresponds to a judge s score for a dive, we do not need to worry about over-counting using this code. The results are clear: No one judged their own country s team in the finals, but did in the preliminary and semi-final rounds. An additional option is to extract the data where the diver s country and the judge s country were the same, and where they were not the same, to allow for a comparison: > samecountry <- data[data$country == data$jcountry,] > diffcountry <- data[!(data$country == data$jcountry),] > summary(samecountry$jscore) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 7.000 7.500 7.462 8.500 10.000 > summary(diffcountry$jscore) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 6.000 7.000 6.814 8.000 10.000 Of course, the data are very unbalanced now, but the univariate summaries indicate that scores are higher in both mean and median when a judge evaluated a diver from his or her own country. However, it is not yet clear how significant this relationship is. 3 More Bias Investigation Above, we already saw the number of scores, within each round, that corresponded to a judge grading a diver from his or her own country. To find the number of judges (as opposed to number of dives or number of scores) that judged someone from their own country we do: > length(unique(data[data$country == data$jcountry,]$judge))
Corey S Brier, Department of Statistics, Yale University 4 [1] 17 Let s see how many unique judges there are in total: > length(unique(data$judge)) [1] 25 > 17/25 [1] 0.68 So, 68% of judges scored someone of their own country. Now, we may ignore focusing on McFarland specifically and instead will consider all of those 17 judges. Of course, we have 25 17 = 8 judges who did not grade anyone from their own country and thus could serve as a basis for comparison. 4 All biases: for 9/24 and 9/26 For each judge except McFarland, we first make two box-plots. The first box plot displays the scores given by a judge to divers or their own country (having already subtracted the overall discrepancy for that judge). We can then compare this to the next boxplot, which displays, for those same dives, the average of the scores given by all seven of the judges. Notice then the box plots are directly comparable because they both compare scores for the same dives. Next, we create the scatter plot by taking the scores of a judge for divers of their own country and subtracting the average of the scores from all the seven judges for that dive. Thus, a point on our scatter plot corresponds to the difference between the judge of interest s score and the average score for that dive. It follows that the x-axis is just an index, and a permutation of these values should not change our interpretation. If a judge were totally unbiased, we would expect a difference of zero (green line), and the actual average difference is plotted with a red line. > diffmax <- 0 > #par(mfrow = c(4,4), mar = c(2,1,1,1)) > for (i in 1:length(unique(data[data$Country == data$jcountry & data$judge!= "McFARLAND Steve",]$Judge))){ # i <- 3 thisjudge <- unique(data[data$country == data$jcountr & data$judge!= "McFARLAND Steve",]$Judge)[i] thiscountry <- unique(data[data$judge == thisjudge & data$judge!= "McFARLAND Steve",]$JCountry) # For thisjudge, we also want to get their overall discrepancy (in case
Corey S Brier, Department of Statistics, Yale University 5 # they always score high or low in general) judgediscrep <- mean(data[data$judge == thisjudge,]$jscore - data[data$judge == thisjudge,]$avg) # Lets get the data for those divers from this judges country datasubset <- data[data$country == thiscountry & data$judge == thisjudge,] # So, for each dive, we have the average score of the judges, and the score # for the judge of the same country # Lets compare these: diffs <- (datasubset$jscore - judgediscrep) - datasubset$avg if (max(abs(diffs))>diffmax) { diffmax <- max(abs(diffs)) } png(paste("p",i,".png", sep="")) #h <-hist(diffs, plot = F, probs = T, breaks = seq(-1.5,1.5, by =.25)) nf <- layout( matrix(c(1,2),1,2,byrow = TRUE), c(1,2), c(2,1), TRUE) par(mar = c(0,2.5,1,2), mgp <-c(0,0,0),oma=c(0,0,0,0)) #barplot(h$counts, axes = T, xlim = c(0, max(h$counts)), #space = 0, horiz = TRUE) boxplot(datasubset$jscore - judgediscrep,datasubset$avg, names = c("judge", "Average"), ylab = "Scores", las = 2) par(mar = c(0,2,2,1), mgp <-c(0,0,0)) plot(1:dim(datasubset)[1], diffs, ylim = c(-1.5,1.5), main = paste(thisjudge," ",thiscountry, "\n Difference in judge score and average"), sub = "", xlab = "") abline(h = 0, col="green") abline(h = mean(diffs), col = 4-2*as.numeric(mean(diffs>0))) dev.off() } This code produces 16 separate png plots which are then stitched together in Mathematica. Using the usual par(mfrow=c(2,2)) type of command does not work here because the use of layout overrides the graphics canvas. The image is then stored as biasimage.pdf.
Corey S Brier, Department of Statistics, Yale University 6 5 T tests for all judges To conduct t-tests, it may be more reasonable to consider taking the judge s score vs. the average of the other scores for that dive not including that judge. Thus, we compare the scores that a judge gave to divers of their own country versus the average of the scores the other judges gave to those dives. Thus it is appropriate to use a paired t-test. We also will store the interesting results of the t-tests in a data frame. > pvals <- rep(1, 17) > judges <- rep(na, 17) > jcountrys <- rep(na, 17) > meandiff <- rep(na, 17) > for (i in 1:length(unique(data[data$Country == data$jcountry,]$judge))){ thisjudge <- unique(data[data$country == data$jcountr,]$judge)[i] thiscountry <- unique(data[data$judge == thisjudge,]$jcountry) judgediscrep <- mean(data[data$judge == thisjudge,]$jscore - data[data$judge == thisjudge,]$avg) # Lets get the data for those divers from this judges country datasubset <- data[data$country == thiscountry & data$judge == thisjudge,] # Recreate the average score for a dive, this time without the judge from # the country of interest: datasubset$newavg <- (7*datasubset$avg-datasubset$JScore)/6 mytest <- t.test(datasubset$jscore - judgediscrep, datasubset$newavg, paired = T) pvals[i] <-mytest$p.value meandiff[i] <- mytest$estimate[1] judges[i] <- thisjudge jcountrys[i] <- thiscountry } > ttests <- data.frame(judges, jcountrys, pvals, meandiff) > ttests judges jcountrys pvals meandiff 1 WANG Facheng CHN 7.610458e-01 0.02768835 2 MENA Jesus MEX 4.461937e-04 0.33592012 3 ZAITSEV Oleg RUS 1.254460e-05 0.31065110 4 McFARLAND Steve USA 2.432926e-04 0.21151337 5 ALT Walter GER 8.866274e-06 0.42253586 6 BARNETT Madeleine AUS 9.913533e-04 0.30262637 7 BOOTHROYD Sydney GBR 2.083509e-02 0.32512165
Corey S Brier, Department of Statistics, Yale University 7 8 RUIZ-PEDREGUERA Rolando CUB 2.033477e-02 0.32011682 9 CRUZ Julia ESP 2.258140e-03 0.34509112 10 BOYS Beverley CAN 2.619120e-02 0.25101602 11 BOUSSARD Michel FRA 1.762819e-01 0.10785511 12 BURK Hans-Peter GER 1.465126e-02 0.49622642 13 XU Yiming CHN 3.476949e-03 0.29745429 14 SEAMAN Kathy CAN 1.392309e-01 0.17250784 15 GEISSBUHLER Michael SUI 3.665481e-02 0.78543720 16 HUBER Peter AUT 6.014372e-02 0.35429911 17 CALDERON Felix PUR 8.490038e-02 0.33650130 We see a range of p-values. Of course, we need to be very careful when we run many simultaneous t-tests, because we expect that natural variation in the data will result in some more or less extreme results. None-the-less, we see that some judges appear much more biased than others. An alternative option would be to instead dgconduct a t-test on the difference of scores given to divers of their own country minus the average of the other jues scores versus the difference in scores given by our judge to divers in other countries minus the average of the judges scores for those dives. Both the above t-test and this one are set up to account for a judge s overall enthusiasm. The latter of the tests would be unbalanced, but would include many more data points