STAT 625: 2000 Olympic Diving Exploration

Corey S Brier, Department of Statistics, Yale University 1 STAT 625: 2000 Olympic Diving Exploration Corey S Brier Yale University Abstract This document contains a preliminary investigation of data from the 2000 Olympic Diving Event. In particular, we offer an explanation for the bimodality in the degree of difficulty. The assignment for 9/17 begins in section 4. 1 Data import and formatting The data are provided in an easy to use CSV file so we may import it directly. > library(yaletoolkit) > some <- function(data, n = 7, replace = FALSE) { + sel <- sample(1:dim(data)[1], n, replace) + return(data[sel,]) + } > setwd("c:/users/corey/documents/yale/s3/625/week3") > data <- read.csv("diving2000.csv", as.is = T) > whatis(data) variable.name type missing distinct.values precision 1 Event character 0 4 NA 2 Round character 0 3 NA 3 Diver character 0 156 NA 4 Country character 0 42 NA 5 Rank numeric 0 49 1.0 6 DiveNo numeric 0 6 1.0 7 Difficulty numeric 0 20 0.1 8 JScore numeric 0 21 0.1 9 Judge character 0 25 NA 10 JCountry character 0 21 NA min max 1 M10mPF W3mSB 2 Final Semi 3 ABALLI Jesus-Iory ZHUPINA Olena 4 ARG ZIM 5 1 49 6 1 6

Corey S Brier, Department of Statistics, Yale University 2 7 1.5 3.8 8 0 10 9 ALT Walter ZAITSEV Oleg 10 AUS ZIM It is useful to change some data types and add a new column for gender: > data$event <- as.factor(data$event) > data$round <- as.factor(data$round) # This could be left as numeric > data$event <- as.factor(data$event) > data$round <- as.factor(data$round) # This could be left as numeric Let s add a column for gender: > menloc <- (data$event == "M3mSB") (data$event == "M10mPF") > femaleloc <-!menloc > data$sex[menloc] <- "M" > data$sex[femaleloc] <- "F" > data$sex <- factor(data$sex) Each row of the data corresponds to a score for a dive, not a particular contestant, so we expect some amount of clustering It could be useful to get all rows for a particular diver, so let s assign each distinct diver a different number: > data$divernumber <- rep(na,length(data$diver)) > for (i in 1:length(unique(data$Diver))) { + dname <- (unique(data$diver))[i] + data[data$diver == dname,]$divernumber <- i + } Also, for each dive, let us compute the average score and add that back into our dataset. We used a vectorized method to avoid an unnecessary loop. > dmeans <- apply(matrix(data$jscore, ncol = 7, byrow = T),1,mean) > data$avg <- rep(dmeans, each = 7) 2 Graphical Exploration We start with a simple histogram of the judge s scores:

Corey S Brier, Department of Statistics, Yale University 3 Histogram of data$jscore Frequency 0 500 1000 1500 0 2 4 6 8 10 Judge's Score We notice there is quite a bit of bimodality in the difficulty: Histogram of data$diff Frequency 0 500 1000 1500 2000 2500 3000 1.5 2.0 2.5 3.0 3.5 Dive Difficulties

Corey S Brier, Department of Statistics, Yale University 4 Constructing side by side box plots of the dive difficulties reveals that the difficulties from dives in the semifinal round are much lower than those of the final or preliminary rounds: Difficulty 1.5 2.0 2.5 3.0 3.5 Final Prelim Semi To confirm suspicions that Round is a large source of bimodality, we plot the difficulty vs. Judge s Score, jittering each point to deal somewhat with the over-plotting. Additionally, all of the points in the Semi-Final round are colored red.

Corey S Brier, Department of Statistics, Yale University 5 Judge's Score (jittered) 0 2 4 6 8 10 1.5 2.0 2.5 3.0 3.5 Difficulty (jittered) This clearly indicates that those dives performed in the Semi-Final round had lower difficulties than the other two rounds. Knowledge of the exact dive requirements and scoring system for the 2000 Olympics would also shed more insight onto why this is the case. Now, let us subset out data from the semi-final round and see if there is any bimodality:

Corey S Brier, Department of Statistics, Yale University 6 > datanosemi <- data[(data$round!="semi"),] > hist(datanosemi$diff, xlab = "Difficulty without semifinal round") Histogram of datanosemi$diff Frequency 0 500 1000 1500 2000 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 Difficulty without semifinal round Looking at the above figure, it certainly seems that bimodality is less of an issue, although there is still some concern which may merit further investigation. Next, we consider 4 plots, where the difficulty is plotted on the vertical axis. First (top-left) we construct box plots that contrast the 4 different events. The men s events (left two box plots) seem to possibly indicate slightly higher difficulties, so we isolate the men and women without considering the specific event in the top-right plot. We see that perhaps there is a small difference, but nothing drastic is occurring. Of course, due to the size of our data set, we should not be surprised if the standard statistical tests would indicate that there is a significant effect. The bottom-left plot compares the difficulties of the the dive numbers across all of the contestants. There seems to be little difference initially, with perhaps higher than average difficulties on dive number six. Finally, the bottom right graphic plots rank versus dive difficulty. Certainly for each value on the horizontal axis, multiple dives are present, but what is more interesting is the cluster on the bottom which seems to stop at the diver ranked number 20. Those points which correspond to the semi-final round are colored red and they in-fact match this cluster. One possible explanation is that divers ranked higher numerically (i.e. a lower position) only participated in the preliminary round.

Corey S Brier, Department of Statistics, Yale University 7 1.5 2.5 3.5 1.5 2.5 3.5 M10mPF W10mPF F M 1.5 2.5 3.5 jitter(data$diff) 1.5 2.5 3.5 1 2 3 4 5 6 0 10 20 30 40 50 data$rank We can confirm that divers with rank at best 20 only participated in the preliminary round as follows: > table(data[data$rank >= 20,]$Round) Final Prelim Semi 0 3710 0

Corey S Brier, Department of Statistics, Yale University 8 3 Considering the judges and the scoring The data include the countries that the divers are from as well as the countries of the Judges. One possible analysis might search for any bias, such as a judge giving preferential treatment to a competitor for his or her own country. Although this section is not a complete analysis, we present some preliminary steps. First, it makes sense to actually find out if any Judge evaluated a competitor for their own country: > finalsdata <- data[data$round == "Final",] > sum(as.numeric(finalsdata$country == finalsdata$jcountry)) [1] 0 > prelimdata <- data[data$round == "Prelim",] > sum(as.numeric(prelimdata$country == prelimdata$jcountry)) [1] 201 > semidata <- data[data$round == "Semi",] > sum(as.numeric(semidata$country == semidata$jcountry)) [1] 113 Although a single diver is represented on multiple rows of our data set, because each row corresponds to a judge s score for a dive, we do not need to worry about over-counting using this code. The results are clear: No one judged their own country s team in the finals, but did in the preliminary and semi-final rounds. An additional option is to extract the data where the diver s country and the judge s country were the same, and where they were not the same, to allow for a comparison: > samecountry <- data[data$country == data$jcountry,] > diffcountry <- data[!(data$country == data$jcountry),] > summary(samecountry$jscore) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.000 7.000 7.500 7.462 8.500 10.000 > summary(diffcountry$jscore) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.000 6.000 7.000 6.814 8.000 10.000

Corey S Brier, Department of Statistics, Yale University 9 Of course, the data are very unbalanced now, but the univariate summaries indicate that scores are higher in both mean and median when a judge evaluated a diver from his or her own country. However, it is not yet clear how significant this relationship is. Focusing only on data from the preliminary round, we can plot each diver on the horizontal axis (DiverNumber was generated above) and the score for each of their dives on the vertical axis. Also, we have colored and enlarged any point where a judge has given a score for a diver of the same country: > plot(jitter(prelimdata$divernumber),jitter(prelimdata$jscore), + pch = 20, + col = 1 + as.numeric(prelimdata$country == prelimdata$jcountry), + cex = 1 + 1*as.numeric(prelimdata$Country == prelimdata$jcountry), + xlab ="Diver Number", ylab = "Judges score") Judge's score 0 2 4 6 8 10 0 50 100 150 Diver Number Right away we wonder if something is wrong because the graph appears to be in 4 distinct regions, and within each region the scores of the divers seem to be decreasing. This actually makes sense however! The data was given to us sorted first by event (starting with the men s spring board), and within each event the divers were ordered by rank. So the overall shape of the graph may be slightly distracting, but it should not be alarming. More importantly, we wish to look for patterns in the red points. It is very tempting to say that the scores corresponding to the red points seem artificially inflated, but the graph does not provide conclusive evidence (especially when compared to our investigation of the

Corey S Brier, Department of Statistics, Yale University 10 bimodality above.) 4 Investigating Steve McFarland for Potential Bias Although misleading, to begin our search for bias from Steve McFarland, we compute his average score for US competitors and non-us competitors: > steveusa <- data[data$judge == "McFARLAND Steve" & data$country == "USA",] > mean(steveusa$jscore) [1] 7.797619 > stevenousa <- data[data$judge == "McFARLAND Steve" & data$country!= "USA",] > mean(stevenousa$jscore) [1] 6.698374 We see that on average, Steve McFarland scored American divers 1.1 points higher than non- American divers. We have to be careful, however. It could be the case that the American divers are actually better, on the average, than the other competitors. Thus, we calculate the average score of all of the judges, except Steve McFarland, for American Divers: > nosteve <- data[data$judge!= "McFARLAND Steve",] > mean(nosteve[nosteve$country == "USA",]$JScore) [1] 7.460177 This reveals that indeed the scores for USA divers are higher than Steve s scores for non- USA divers. However, McFarland s scores for the Americans are still about.34 points higher than the other judge s scores for the Americans. This might indicate some bias, so let s look more closely at those US divers that Steve McFarland judged. We proceed by, for each of those 7 divers, plotting all of their scores. Black points indicate scores from judge s besides McFarland, while points in red correspond to McFarland s scores. The green triangles represent the average of McFarland s scores, for that diver, and the blue diamonds represent the average of all of the other judge s score, for that diver. Data from the final round is excluded, but some within-diver clustering is expected because for each dive, and within each event, we expect reasonably comparable scores:

Corey S Brier, Department of Statistics, Yale University 11 jitter(final$jscore) 4 5 6 7 8 9 1 2 3 4 5 6 7 jitter(final$divernumber2) We see right away that McFarland s average score is always above the average score from the other judges, for each of these 7 divers. The greatest absolute discrepancy between Steve s average score and the other judge s score occurs for diver 5 on this chart, corresponding to DAVISON, Michelle. To statistically search for bias, we can assume that all judge s are unbiased and then permute the judges over the dives. This will preserve the performance standard within countries and individual competitors, but will test against judge s being extreme in scoring: > dataperm <- data[data$round!= "Final",] > dataperm$judge <- sample(dataperm$judge) > print(m1 <- mean(dataperm[dataperm$judge == "McFARLAND Steve" & + dataperm$country == "USA",]$JScore)) [1] 7.444444 > print(m2 <- mean(dataperm[dataperm$judge!= "McFARLAND Steve" & + dataperm$country == "USA",]$JScore)) [1] 7.481553 > abs(m1 - m2) [1] 0.03710895

Corey S Brier, Department of Statistics, Yale University 12 As before, the results here are both for the mean scores of US competitors. The first assumed the judge is McFarland (under permutation), while the second assumes it is not. We see that indeed these results are very similar, indicating that the difference we saw initially may be significant. By considering many permutations, the absolute difference remains very small, so we may reasonably assume that McFarland has some amount of bias. Earlier, we computed the mean score for each dive. Also, we already found those dives for which So, for each dive we can compare the mean score given by the judges besides McFarland and McFarland s score. > mean(steveusa$jscore - steveusa$avg) [1] 0.2006803 We see that McFarland scored about.20 higher than the judges across dives performed by an Americans. Let s see if he is enthusiastic and grades non-usa divers higher by.2 as well: > discrep <- mean(stevenousa$jscore - stevenousa$avg) > discrep [1] 0.01045296 This is a value very close to zero! It is positive, so on the average McFarland does score higher on a particular dive the the other judges, but the amount is not nearly so great as the bias he seems to give to the Americans. Subtracting out this average deviation, we have an estimate of his actual bias: > mean(steveusa$jscore - steveusa$avg) - mean(stevenousa$jscore - stevenousa$avg) [1] 0.1902273 Now, if McFarland is really unbiased, subtracting the discrepancy from his scores and comparing the mean to the scores given by the other judges to USA competitors should not yield a difference. Thus we have a (1 sided hypothesis test): > t.test(steveusa$jscore - discrep,steveusa$avg, alternative = "greater") Welch Two Sample t-test data: steveusa$jscore - discrep and steveusa$avg t = 1.2284, df = 80.173, p-value = 0.1114 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -0.06746283 Inf sample estimates: mean of x mean of y 7.787166 7.596939

Corey S Brier, Department of Statistics, Yale University 13 Which yields a p-value of about.111, which indicates it may not be truly significant. Now, there are a number of issues with this test so we need to be careful. We would like both steveusa$jscore - discrep and steveusa$avg to be roughly normal. So, we can plot some basic histograms: stogram of steveusa$jscore d Histogram of steveusa$avg Frequency 0 5 10 15 Frequency 0 5 10 15 5.5 6.5 7.5 8.5 steveusa$jscore discrep 5 6 7 8 9 steveusa$avg The first histogram appears roughly acceptable, though there is perhaps some cause for concern in the second. Also the two samples here are not independent since certainly the average scores for all of the judges will include McFarland s score. We suspect then that excluding McFarland s score from the average would slightly increase the significance level. A non-parametric test we could try is the (2-sample) Mann Whitney U Test: > wilcox.test(steveusa$jscore - discrep,steveusa$avg, alternative = "greater", + exact = FALSE) Wilcoxon rank sum test with continuity correction data: steveusa$jscore - discrep and steveusa$avg W = 941, p-value = 0.2991 alternative hypothesis: true location shift is greater than 0 Again we see a result that does not seem significant. Also, we could try using a permutation test which would not require the data follow a normal distribution as well. Another option would be to create an indicator variable that designates if McFarland is adjudicating a US Diver:

Corey S Brier, Department of Statistics, Yale University 14 > data$issteveusa <- rep(0,length(data$avg)) > data[data$judge == "McFARLAND Steve" & data$country =="USA",]$isSteveUSA <- 1 We could then create a regression model including this indicator variable, and see if it is significant. Some care would need to be taken because fitting JScore as the response would include each dive as seven separate observations which is not appropriate. Further explorations might consider the bias of all judges on their home country. If most or all judges are biased, then it would be useful to compare how biased McFarland is to the others. Perhaps he is not as biased as some of the other judges. Alternatively, perhaps if most judges are biased, then there is actually no net effect on the rankings, since each competitor s score will be similarly inflated. These are only speculations, but provide direction for additional analyses.