Hellgate 100k Race Analysis February 12, 2015

Similar documents
Pitching Performance and Age

Navigate to the golf data folder and make it your working directory. Load the data by typing

Pitching Performance and Age

ASTERISK OR EXCLAMATION POINT?: Power Hitting in Major League Baseball from 1950 Through the Steroid Era. Gary Evans Stat 201B Winter, 2010

Running head: DATA ANALYSIS AND INTERPRETATION 1

Minimal influence of wind and tidal height on underwater noise in Haro Strait

STAT 625: 2000 Olympic Diving Exploration

The MACC Handicap System

Announcements. Lecture 19: Inference for SLR & Transformations. Online quiz 7 - commonly missed questions

Stats 2002: Probabilities for Wins and Losses of Online Gambling

NBA TEAM SYNERGY RESEARCH REPORT 1

Analysis of performance and age of the fastest 100- mile ultra-marathoners worldwide

Announcements. % College graduate vs. % Hispanic in LA. % College educated vs. % Hispanic in LA. Problem Set 10 Due Wednesday.

Data Set 7: Bioerosion by Parrotfish Background volume of bites The question:

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Is lung capacity affected by smoking, sport, height or gender. Table of contents

The Aging Curve(s) Jerry Meyer, Central Maryland YMCA Masters (CMYM)

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

STAT 155 Introductory Statistics. Lecture 2-2: Displaying Distributions with Graphs

Sample Final Exam MAT 128/SOC 251, Spring 2018

Lab 11: Introduction to Linear Regression

Lesson 14: Modeling Relationships with a Line

Compression Study: City, State. City Convention & Visitors Bureau. Prepared for

Chapter 12 Practice Test

Will the New Low Emission Zone Reduce the Amount of Motor Vehicles in London?

Modeling Fantasy Football Quarterbacks

Regional Analysis of Extremal Wave Height Variability Oregon Coast, USA. Heidi P. Moritz and Hans R. Moritz

Section I: Multiple Choice Select the best answer for each problem.

Standardized catch rates of U.S. blueline tilefish (Caulolatilus microps) from commercial logbook longline data

Statistical Analysis of PGA Tour Skill Rankings USGA Research and Test Center June 1, 2007

Legendre et al Appendices and Supplements, p. 1

Calculation of Trail Usage from Counter Data

Competitive Performance of Elite Olympic-Distance Triathletes: Reliability and Smallest Worthwhile Enhancement

Opleiding Informatica

GENETICS OF RACING PERFORMANCE IN THE AMERICAN QUARTER HORSE: II. ADJUSTMENT FACTORS AND CONTEMPORARY GROUPS 1'2

MANITOBA'S ABORIGINAL COMMUNITY: A 2001 TO 2026 POPULATION & DEMOGRAPHIC PROFILE

Wildlife Ad Awareness & Attitudes Survey 2015

Building an NFL performance metric

Evaluating the Design Safety of Highway Structural Supports

A Hare-Lynx Simulation Model

Driv e accu racy. Green s in regul ation

Journal of Human Sport and Exercise E-ISSN: Universidad de Alicante España

NordFoU: External Influences on Spray Patterns (EPAS) Report 16: Wind exposure on the test road at Bygholm

MJA Rev 10/17/2011 1:53:00 PM

Domestic Energy Fact File (2006): Owner occupied, Local authority, Private rented and Registered social landlord homes

Volume 37, Issue 3. Elite marathon runners: do East Africans utilize different strategies than the rest of the world?

Distancei = BrandAi + 2 BrandBi + 3 BrandCi + i

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Briefing Paper #1. An Overview of Regional Demand and Mode Share

Hook Selectivity in Gulf of Mexico Gray Triggerfish when using circle or J Hooks

Energy capture performance

Average Runs per inning,

A Review of Roundabout Speeds in north Texas February 28, 2014

Active Travel and Exposure to Air Pollution: Implications for Transportation and Land Use Planning

SCIENTIFIC COMMITTEE SEVENTH REGULAR SESSION August 2011 Pohnpei, Federated States of Micronesia

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Nebraska Births Report: A look at births, fertility rates, and natural change

Percentage. Year. The Myth of the Closer. By David W. Smith Presented July 29, 2016 SABR46, Miami, Florida

Equation 1: F spring = kx. Where F is the force of the spring, k is the spring constant and x is the displacement of the spring. Equation 2: F = mg

PRACTICAL EXPLANATION OF THE EFFECT OF VELOCITY VARIATION IN SHAPED PROJECTILE PAINTBALL MARKERS. Document Authors David Cady & David Williams

Review of A Detailed Investigation of Crash Risk Reduction Resulting from Red Light Cameras in Small Urban Areas by M. Burkey and K.

Addendum to SEDAR16-DW-22

Partners for Child Passenger Safety Fact and Trend Report October 2006

Wind Resource Assessment for NOME (ANVIL MOUNTAIN), ALASKA Date last modified: 5/22/06 Compiled by: Cliff Dolchok

Midterm Exam 1, section 2. Thursday, September hour, 15 minutes

Dobbin Day - User Guide

Failure Data Analysis for Aircraft Maintenance Planning

Economic Value of Celebrity Endorsements:

A REVIEW OF AGE ADJUSTMENT FOR MASTERS SWIMMERS

Historical Analysis of Montañita, Ecuador for April 6-14 and March 16-24

ANOVA - Implementation.

WIND DATA REPORT. Vinalhaven

Figure 1. Winning percentage when leading by indicated margin after each inning,

Low Speed Wind Tunnel Wing Performance

Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania

Wind Project Siting & Resource Assessment

Algebra 1 Unit 6 Study Guide

COMPLETING THE RESULTS OF THE 2013 BOSTON MARATHON

A N E X P L O R AT I O N W I T H N E W Y O R K C I T Y TA X I D ATA S E T

Chapter 13. Factorial ANOVA. Patrick Mair 2015 Psych Factorial ANOVA 0 / 19

What s UP in the. Pacific Ocean? Learning Objectives

Math SL Internal Assessment What is the relationship between free throw shooting percentage and 3 point shooting percentages?

FireWorks NFIRS BI User Manual

A Description of Variability of Pacing in Marathon Distance Running

An Application of Signal Detection Theory for Understanding Driver Behavior at Highway-Rail Grade Crossings

ROUNDABOUT CAPACITY: THE UK EMPIRICAL METHODOLOGY

OXYGEN POWER By Jack Daniels, Jimmy Gilbert, 1979

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

CPUE standardization of black marlin (Makaira indica) caught by Taiwanese large scale longline fishery in the Indian Ocean

Influence of Angular Velocity of Pedaling on the Accuracy of the Measurement of Cyclist Power

POWER Quantifying Correction Curve Uncertainty Through Empirical Methods

Spatial Methods for Road Course Measurement

E. Agu, M. Kasperski Ruhr-University Bochum Department of Civil and Environmental Engineering Sciences

Applying Hooke s Law to Multiple Bungee Cords. Introduction

Analysis of Factors Affecting Train Derailments at Highway-Rail Grade Crossings

Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight

COMPARISON OF CONTEMPORANEOUS WAVE MEASUREMENTS WITH A SAAB WAVERADAR REX AND A DATAWELL DIRECTIONAL WAVERIDER BUOY

Assessment Schedule 2016 Mathematics and Statistics: Demonstrate understanding of chance and data (91037)

SCDNR Charterboat Logbook Program Data, Mike Errigo, Eric Hiltz, and Amy Dukes SEDAR32-DW-08

SPATIAL STATISTICS A SPATIAL ANALYSIS AND COMPARISON OF NBA PLAYERS. Introduction

Transcription:

Hellgate 100k Race Analysis brockwebb45@gmail.com February 12, 2015 Synopsis The Hellgate 100k is a tough, but rewarding race directed by Dr. David Horton. Taking place around the second week of December in the mountains just north of Roanoke Virginia, this race is known for challenging and unpredictable weather that make each running a unique experience. The goal of this analysis was to look at the data and see if there were any interesting patterns or results to investigate. Also, the basic stats could be used to create a nice info-graphic highlighting Hellgate. Methodology Please note: all the data files used and complete source code is available from the github repository at: https:// github.com/ brockwebb/ hellgate_100k_statistical_analysis All race time data and weather information was obtained through the referenced web sources, cleaned, and transformed into comma separated value (csv) format for ease in any analysis application. The weather data used was from the Roanoke Regional Airport (Weather Data). No data existed for temperatures at higher elevations like Headforemost Mountain at mile 24/Aid Station 4, which is known as the coldest section of the course when runners reach it before dawn. The Airport data was believed to be the best and most reliable source for this information due to its criticality in aviation safety. Taking windchill in account was done using the average observed wind speed and the lowest observed temperature as an estimate of what was felt on the course, albeit Hellgate has higher elevations, receives more wind exposure at the top of the mountains, and was probably colder. Normality tests on the data distributions were conducted by applying three methods: Shapiro-Wilk test, Anderson-Darling test, visual inspection of the Q-Q plot. The non-parametric Mann-Whitney U Test was used for all non-normal data comparisons. In all tests, significance was the standard accepted p-value of p<0.05. Linear models were used to show best fit whenever appropriate. A second order polynomial was used in one case as the points demonstrated a curved pattern. For the predictive modeling of Hellgate times based on the Beast Series, the step() function was used to attempt the best prediction from the data. Regardless of the Rˆ2 value obtained from the model fit, regression analysis was used to investigate goodness of fit. Regression analysis included an examination of the resulting residual and Q-Q plots for any patterns that would suggest complications or problems with any data fitting models used. Results and Conclusions All charts and in-depth analysis are found in the Appendix. This section provides a summation of the most salient results and conclusions. Male vs. Female Runners The age distribution of male and female runners was equivalent (p=0.4879). Overall, males finished faster than females (p=0.00015) with an average time of 15:33:47 to 16:05:47 for the women. The most surprising 1

result was that there were no significant differences between male and female finishing times for the first five years(p=0.2482). After that time, males became significantly faster whereas female times seem to stay similar to previous years (p<0.00016). 2008+ Improvement in Male Finish Times The reason for this improvement or trend is still unknown. The improvement effect was also noted in this analysis: Increase in finishers and improvement of performance of masters runners in the Marathon des Sables (Jampen SC, Knechtle B, Rust CA, Lepers R & Rosemann T, 2013). Books like Born to Run by Christopher McDougall was first published in May 2009, drawing a broader audience into the sport, as well as barefoot running. According to finishing stats from Ultrarunning Magazine, Ultra finishes went from 30,789 in 2008 to 69,573 in 2013 (UltraRunning Magazine, 2013). Overall, the effect may be attributed to an increase in popularity of the sport leading to increased participation that uncovered a wealth of undiscovered talent. Several factors were investigated, including looking at the performance of veterans of the race, use of the race committee in applicant screening, and whether or not the Beast Series, which began in 2008, was a factor. The Beast Series did not explain the improvement, but had power in predicting the Hellgate finishing time, covered later in this section. The question of whether or not the new race registration procedure of screening applicants is making a difference is undecided because the trend began three years before the introduction of it. The race committee may be stabilizing the finish percentage or improving the overall time, but the effect requires further study when more data are available. Finish Rate The early years had a much lower finish rate than the later years. The trend observed implied a learning curve and fit a second order polynomial trend-line nicely. Explanations of this learning curve might be due to knowledge of the race, race reports, and potentially the race committee screening procedure. As the improvement in finish times suggests, better prepared runners are showing up and future races are predicted to finish in the low 80 s. Temperature and Finish Rate The warmest start was 2004, but factoring in windchill, 2013 was the warmest year to date. The coldest years were 2007, tied with 2009 (20 degrees F). Factoring in windchill, 2007 was the coldest around 9 degrees F. Temperature appeared to correlate with the finishing percentage, but closer examination of the residual plots revealed non-linear behavior, meaning some other factor was not accounted for in the result. Course Conditions Runners experienced clear trail conditions six times, ice two time, and snow-ice four times. Course conditions had no significant impact on finishing time (all p>0.05). Overall, the wind direction was to the East five times, North-West three times. Beast Series Beast series runners were not significantly different than their peers as far as finish time are concerned. Only one year, 2011, really looked visually different from the box plots. By Removing the few outliers in 2011, Beast runners finished slower than their peers (p=.0325). 2

Past performance in the Beast Series was examined to determine how well it could predict a runners finish time at Hellgate. The following equation was found to provide a reasonable estimate of a runner s finish time: Hellgate time (predicted) = (-2.948e+04) + (1.897)MMTR.time + (5.684e-01)GS100.time + (-1.128e- 05)MMTR.timeGS100.time Where: MMTR.time and GS100.time are the finish time from each race, in decimal based hours (e.g. h:m:s 16:30:00 = 16.5 hours). This equation has an adjusted Rˆ2 =.08075 and has decent looking residual plots, confirming the model s validity as a good predictor. Future Study and Investigation Future study might include a broader, more in depth look at a runner s experience and preparation before Hellgate. More data and insight into how the Race Committee selects runners from the applicant pool would be helpful in determining its effectiveness. It would be interesting to see if the learning curve effect exists for other races and how long it takes to stabilize around a given level. When more data are available, the Hellgate prediction model should be evaluated to measure its robustness. Lastly, it would be prudent to check other race performance data to see if the 2008/2009 time frame kicked off an increase in overall performance throughout the sport. Appendix The sections below execute are constructed though execution of all the R code and were the basis of the above report. Environment Setup Global Options Global options set options for general formatting, including suppressing messages and warnings from the code. There are many messages produced that add length to the document and distract from readability. Also, turning on/off all code, chart sizes, etc. to display results is possible here too. Data libraries The following libraries are used in this analysis: 1. ggplot2 graphics/charts 2. grid layout of charts for display 3. lubridate handling date/time 4. knitr knitr global options for output/file build 5. nortest normality testing 6. plyr summarization/aggregation Global Functions One global function is used to return the equation and rˆ2 value that is generated by the linear model for display on graphs using ggplot (Regression Equation) 3

Loading the Data Files All the data files used, including this R-Markdown file with complete source code is available from the github repository at: https://github.com/brockwebb/hellgate_100k_statistical_analysis Information on which runners actually started, did not finish (DNF), and where they dropped was not available. This may have aided even more perspective into the complex challenges this race presents and where the major obstacles are. Basic Stats: Below are the basic stats on the race: Time/Age Records Here are the male and female finishers fastest/slowest times and youngest/oldest ages: Total finishes: [1] 1032 stat male female 1 Total Finishes 853 179 2 Fastest Finish 10:45:49 12:23:40 3 Longest Finish 18:55:00 18:55:54 4 Average Time 15:33:47 16:05:47 5 Youngest Finish 19 21 6 Oldest Finish 66 62 7 Average Age 39.9 39.4 Runners by state Looking at the distribution of runners from each state, top five states, totaling 71.1 percent of all finishes are from Virginia, Pennsylvania, Maryland, North Carolina, and Ohio: state runner.count percent.by.state 34 VA 468 45.3 29 PA 86 8.3 16 MD 76 7.4 20 NC 65 6.3 25 OH 39 3.8 [1] 71.1 Finish Percentage by Year The number of starters has increased from 71 (2003) to 148 (2014). The Number of finishers has also gone from 44 (2003) to 122 (2014). The best finish percentage was 88.1% in 2010. Also show are the First Timers those whose first Hellgate finished in a success. Interestingly enough, there was a noticeable jump in 2010 in 4

the total number of new runners. The fact that the highest finishing percentage and the largest number of new runners occurred in the same year is probably a coincidence, as there is not enough data to say otherwise. 150 Number of Runners 100 "starters" finishers first.timers starters 50 2004 2006 2008 2010 2012 2014 Date In 2011, the race registration format changed. Entry criteria required that each runner list the races they have done to prove they are able to complete Hellgate. A race committee then reviewed the applications to ensure runners had a good chance at finishing. It would appear that a learning curve with finishing may have occurred anyways, as more runners gained the benefit of race reports and prior Hellgate experience. Future race predictions would indicate a finish percentage in the low 80 s, and the effect of the Race Committee might have stabilized the success rate of Hellgate in this area. This learning curve is represented by the smoothed line in the plot below: 5

80 Finishing Percentage 70 "finish.percent" finish.percent 60 2004 2006 2008 2010 2012 2014 Date Weather Effects Hellgate can have some interesting weather, and the course conditions are dependent on many factors leading up to the race (snowfall, rain, cold/heat waves, etc). Weather data was obtained from the U.S. National Oceanic and Atmosphere Administration s (NOAA) National Climatic Data Center as measured from the Roanoke Regional Airport (Weather Data). No data existed for temperatures at higher elevations like Headforemost Mountain at mile 24/Aid Station 4, which is known as the coldest section of the course when runners reach it before dawn. The Airport data was chosen because wind speed/direction is incredibly important for airplanes and was believed to be the best/most reliable source for this information. This was important in calculating wind chill by using the formula from the National Weather Service (Windchill Calculation). It should be noted that the formula was only applied when average wind speeds exceeded the 3 miles per hour (mph) threshold where it is considered valid (Windchill Calculation). The course conditions were recovered from the race reports, but mainly from the Race Director s (Dr. David Horton) account of the conditions (Extreme Ultra Running). Temperature Plotting the both the Temperature and the Wind Chill using the average wind speed and coldest observed temperature at Roanoke Regional Airport. The warmest start was 2004, but factoring in windchill, 2013 was 6

the warmest year to date. The coldest years were 2007, tied with 2009 (20 degrees F). Factoring in windchill, 2007 was the coldest around 9 degrees F. 40 30 Temperature (F) 20 "min_temp_f" avg_min_wind_chill min_temp_f 10 2004 2006 2008 2010 2012 2014 Date Wind Direction Historical wind direction was not found in the data set so wind direction was determined by using the five minute wind gust direction as an estimate for the overall direction the air system was moving. The five minute wind gust is the highest sustained wind for a five minute period during the observation time. The data set contains direction in the form of degrees in a circular reference where east is zero degrees, north is ninety degrees, and so forth. Wind blowing in the south direction was the most common with five occurrences, north west the second most common with three: 7

5 4 count 3 2 five_min_wind_gust_direction E N NE NW SE SW 1 0 E N NE NW SE SW five_min_wind_gust_direction Course Conditions: Most of the time, the trails were clear. However, some years had ice or a mixture of snow and ice on the ground: 8

6 4 count trail.condition clear ice snow ice 2 0 clear ice snow ice trail.condition Statistical analysis Plot of male versus female times Looking at the overall finishing times distributions by gender: 9

8e 05 6e 05 density 4e 05 gender F M 2e 05 0e+00 40000 50000 60000 time The time distributions are similar, but the median time is different. The shape is not too normal looking. Much of the data has this type of skew, mainly because a larger portion of the runners finish much later in the race. To prove non-normality, applying the Shapiro-Wilk and Anderson-Darling tests and confirming Q-Q plot produces the following for the Men s times: 10

Normal Q Q Plot Sample Quantiles 40000 45000 50000 55000 60000 65000 3 2 1 0 1 2 3 Theoretical Quantiles norm.test pvals 1 Shapiro-Wilk: 4.542474e-14 2 Anderson-Darling: 1.531689e-23 Both normality tests generate incredibly small p values, rejecting normality. The Q-Q plot is very curved, where a normal one will follow a straight line. Therefore non-parametric tests are required to test for statistical significance. The non-parametric Mann-Whitney U Test (wilcox.test) is used and result in very small p-values to demonstrate a statistically significant different in male/female times: [1] "wilcox.test p-value:" "0.000150214031214457" Because the p value is incredibly small, it is clear that there is a significant difference between male and female finishing time. 11

Plot of male versus female ages 0.05 0.04 0.03 density 0.02 gender F M 0.01 0.00 20 30 40 50 60 age The age distributions look identical, and normally distributed. Running a check on the normality for the Female ages yields the following: 12

Normal Q Q Plot Sample Quantiles 20 30 40 50 60 2 1 0 1 2 Theoretical Quantiles norm.test pvals 1 Shapiro-Wilk: 0.3903384 2 Anderson-Darling: 0.3663799 The p values indicate normality, and the Q-Q plot confirms this. Therefore the Students T-test will be used. Based on the p value obtained below (0.49), there is no statistically significant difference between the ages within each group. [1] "t.test p-value:" "0.487897806045054" Age versus Hellgate Finishing Time Looking at overall age, older runners tend to take longer (on average) to finish. Obviously this is not a surprising result, but from the scatter of the data, the trend is very gradual, meaning that older runners are still very competitive. 13

17.5 Hellgate Finish Time (hours) 15.0 gender F M 12.5 20 30 40 50 60 Runner Age (years) Age by race year, males vs females In general, for both genders, the trend is an older. There are many veterans that have been running this race for several years, meaning that a core group of steadily older persons may be driving the numbers. However, the change is not that great. 14

60 Runner Age (years) 50 40 gender F M 30 20 2004 2006 2008 2010 2012 2014 Date 15

Times of men and women versus Hellgate finish time 17.5 Finish Time (hours) 15.0 gender F M 12.5 2004 2006 2008 2010 2012 2014 Date In the 2008 race, the standard error shown by the gray shaded area (95% confidence interval around the mean) for the men s mean finishing time is distinctly different than the women s mean time (plus standard error). Overall, the women s mean finishing time seems to hold flat or increase slightly. The men s time continues this trend for every race after. First compare the male/female times before 2008: 16

0.2 density gender F M 0.1 0.0 12.5 15.0 17.5 Hellgate Finish Time (hours) Testing pre and post 2008 Male/Female time differences: group males.vs.females 1 2003-2007 p-value: 0.2428249975 2 2008-2014 p-value: 0.0001558289 Based on the p values obtained, for the first five years of the race (2003-2007) there is no significant difference between the male and female finishing times. After 2007, it s a different story, and the p-values are incredibly low, meaning that the males are significantly faster than females from 2008-2014. Effect of Temperature on Finishing Percentage The temperature with the calculated wind chill was used to understand the apparent effects of temperature experienced by the runners and overall finish rate. It appears that there might be a correlation between temperature and finishing percentage as seen in this graph: 17

80 Finish Percent 70 Prediction Equation : y = 65.7 + 0.525x, r 2 = 0.31 60 10 15 20 25 30 35 Windchill Temperature (F) It would appear that temperature might influence the finishing percent, albeit the Rˆ2 value is a bit low. Looking at the residual plots: 18

Residuals vs Fitted Normal Q Q Residuals 15 5 0 5 10 166 167 168 Standardized residuals 2 1 0 1 167 168 166 72 74 76 78 80 82 84 3 2 1 0 1 2 3 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.4 0.8 1.2 Scale Location 166 167 168 Standardized residuals 2 1 0 1 Residuals vs Leverage Cook's distance 975 976 977 72 74 76 78 80 82 84 Fitted values 0.000 0.001 0.002 0.003 0.004 Leverage The Q-Q plot shows the data is not normal and the Residual plot has a curved fit, meaning that there is non-linear behavior and this type of obvious pattern means it is a poor predictor. Therefore, it is determined that temperature is not a good predictor on the finishing rate. Effect of Course Conditions on Finishing Time There are many factors that can influence the finishing times, including the overall course conditions. Looking at a box-plot of times by course conditions, the average finishing times do not look that different, although the spread of the times appears to get more narrowly focused: 19

17.5 Finish Time (hours) 15.0 trail.condition clear ice snow ice 12.5 clear ice snow ice Trail Condition Looking at at histograms of each, they all look stacked on top of each other: 20

0.2 density trail.condition clear ice snow ice 0.1 0.0 12.5 15.0 17.5 Finish Time (hours) The data appears to be non-normal. The Clear trail condition looks the most rounded like a bell curve, the others are more skewed to the right. Just to test this hypothesis, three methods are provided for the Clear data that demonstrate non-normality. The Shapiro-Wilk test and the Anderson-Darling tests combined with a Q-Q plot provides the results: 21

Normal Q Q Plot Sample Quantiles 40000 45000 50000 55000 60000 65000 3 2 1 0 1 2 3 Theoretical Quantiles norm.test pvals 1 Shapiro-Wilk: 3.782605e-11 2 Anderson-Darling: 8.475425e-16 Because both normality tests combined with the Q-Q plot showing very curved behavior is observed, the data is considered to be non-normal and a non-parametric test must be used. Applying the Mann-Whitney U Test (a non-parametric test) to determine if any significant differences exist: comparison pvals 1 Clear vs. Ice: 0.07652347 2 Clear vs. Snow-Ice: 0.06402921 3 Ice vs. Snow-Ice: 0.85804870 As seen from the p-values produced, none are significant (p<0.05), and therefore the conclusion is that trail conditions are a factor in influencing finish time is rejected. Beast Series Analysis Because runners were faster overall after 2008, investigations into the Beast Series as a potential factor. While running the Beast Series did not appear to be a factor in influencing an overall faster time at Hellgate for those runners, it was thought that previous performance may predict a runners finish at Hellgate. 22

Predicting Hellgate Time for the Beast Series runners Using the previous Beast time, can we predict the Hellgate finish time? Taking the Beast finishers, calculate their average time (pace) before, then see how well it compares to their Hellgate pace. 2013 was considered the Mini-Beast as the Grindstone 100 was cancelled due to the US Government shutdown that occurred. Total miles is normalized to pace/mile so the millage difference can be accounted for. Mileage is done with the advertised race distance, meaning that Hellgate is 62 miles, Grindstone is 100 miles, etc. The actual distances may be a little bit longer, but this difference does not affect the comparison, and was done for clarity or ease of use. 18 16 Hellgate Pace (min/mile) 14 12 Prediction Equation : y = 7.43 + 0.555x, r 2 = 0.59 10 12 14 16 18 Pace Before Hellgate (min/mile) 23

Residuals vs Fitted Normal Q Q Residuals 3 1 1 2 3 652 743 118 Standardized residuals 3 1 1 3 652 118 743 13 14 15 16 17 2 1 0 1 2 Fitted values Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 652743 118 Scale Location Standardized residuals 4 2 0 2 4 Residuals vs Leverage Cook's distance 743 652 118 13 14 15 16 17 Fitted values 0.00 0.01 0.02 0.03 0.04 Leverage At first glance, this looks like a pretty good fit, but the Rˆ2 value would indicate that the pave before Hellgate explains about 58% of the variation seen. The residual and Q-Q plots also look decent. But, in the upper left, there appears to be a cluster of times that are much higher than what would be predicted. By looking at many different factors, including gender, temperature, number of times finishing Hellgate, and trail conditions, one thing stood out. The snow-ice year (2013) seems to explain that cluster as seen by the graph below: 24

65000 60000 55000 time 50000 trail.condition clear ice snow ice 45000 40000 500 600 700 800 900 1000 1100 avg.time.before.hg 2013 certainly seems like a distinctly different year, as all of the finishing times are clustered above the others as seen in the above graph. Another factor is experience, not just running Hellgate but doing what it takes to finish the Beast Series. Looking at the veteran Beast series runners, by Beast Series Finishes: 25

65000 60000 time 55000 50000 as.factor(beast.times.finished) 1 2 3 4 5 6 7 45000 40000 500 600 700 800 900 1000 1100 avg.time.before.hg Now let s highlight Hellgate Finishes instead, showing that veterans of Hellgate are scattered throughout: 26

65000 60000 time 55000 50000 45000 as.factor(times.finished) 1 2 3 4 5 6 7 8 9 40000 500 600 700 800 900 1000 1100 avg.time.before.hg No distinct pattern from the number of Beast or Hellgate finishes is apparent. 2013 was also the year of the Govt shutdown, which forced Grindstone to be canceled. If years with and without Grindstone were treated separately, a better fit of the data is obtained: 27

17.5 Group No Grindstone With Grindstone Hellgate Finish Time (Hours) 15.0 12.5 No Grindstone : y = 17701 + 54.3x, r 2 = 0.69 With Grindstone : y = 18229 + 44.4x, r 2 = 0.75 10 12 14 16 18 Pace Before Hellgate (min/mile) The prediction equation is significantly improved for each group. Training is important, and race experience on a tough course would have been good preparation. but a test to see if 2013 was statistically different needs to be done before removing the group from the sample. Another question on whether or not it was the trail conditions or lack of a big race prior to Hellgate was to blame. There were four years which had snow-ice conditions. Just Looking at the years with snow-ice, the year in question (Y11, 2013) as the second fastest average time: 28

60000 time yearid Y1 Y11 Y3 Y5 50000 40000 Y1 Y11 Y3 Y5 yearid Several comparisons of 2013 can be made. First whether or not 2013 was different than other Beast years and from other Hellgate years. Next, looking at trail conditions between other snow-ice years and the other clear and ice years too. Again for non-normality, statistical significance is done with the Mann-Whitney U test: comparison.labels comparison.pvals 1 2013 vs other Beast years 0.17660283 2 2013 versus all Hellgate 0.60931047 3 2013 vs. Other Snow-Ice 0.09194064 4 Snow-Ice vs. Clear and Ice Years 0.14950988 All of the p values are greater than 0.05, therefore there are no significant differences between the 2013 year and other Beast years, all other Hellgate years or between all other terrain conditions. Therefore, the 2013 group should not be treated as outliers in the data. Comparing Beast Runners with other Hellgate Runners While the 2013 group may not be statistically different, looking at all Beast years is important to understand if any differences can be observed. A box-plot of the years with the Beast Series Runners compared to the rest of the Hellgate runners is below: 29

60000 time beast.runner N Y 50000 40000 2008 12 13 2009 12 12 2010 12 11 2011 12 10 2012 12 08 2013 12 14 2014 12 13 as.factor(date) For each year, testing to see if there are any differences between Beast Runners and the rest of the runners: race.year pvals 1 2008 0.4306507 2 2009 0.4346617 3 2010 0.8212902 4 2011 0.1086852 5 2012 0.2282386 6 2013 0.3269300 7 2014 0.6129500 In all years, the p values are greater than 0.05, meaning that there are no significant differences between Beast Runners and other Hellgate Runners. However, in 2011, the 9th year, the box-plot shows that the beast runners might have been significantly slower than their peers. There are some outliers in this group. A histogram of this year is below and a statistical test would claim that the difference is not significant. Removing the outliers and re-running the test make this year significantly different, that the average Beast Runner was slower, on average, from the rest of the pack. 30

1e 04 density beast.runner N Y 5e 05 0e+00 40000 50000 60000 time Wilcoxon rank sum test with continuity correction data: dfhg_all$time[which(dfhg_all$yearid == "Y9" & dfhg_all$beast.runner == and dfhg_all$time[whic W = 1445.5, p-value = 0.1087 alternative hypothesis: true location shift is not equal to 0 Wilcoxon rank sum test with continuity correction data: dfhg_all$time[which(dfhg_all$yearid == "Y9" & dfhg_all$beast.runner == and dfhg_all$time[whic W = 60, p-value = 0.0325 alternative hypothesis: true location shift is not equal to 0 Developing a better model: Beast Series Analysis Using the Total Beast time resulted in a 58% fit and the residual and Q-Q plots were decent, but it is an aggregate of all races. The three 50k races in the beginning of the year are mixed in and it may be possible to look more closely at each race leading up to Hellgate to determine if a better model can be produced. The Beast data was obtained from both Eco-XSports and Extreme Ultra Running, depending on the year (early years can only be found on Extreme Ultra Running). First, a linear model was fit using a combination of all the race finishing times. Next, using the step function in R, the software tests several combinations 31

to determine the outcomes that are most significant and optimizes the model with the best parameters to form a prediction equation. Start: AIC=2422.05 beast.hellgate ~ beast.hl.50k + beast.t.mtn + beast.pl.50k + beast.gs.100 + beast.mmtr Df Sum of Sq RSS AIC - beast.t.mtn 1 697546 877796833 2420.2 - beast.hl.50k 1 1127697 878226984 2420.2 - beast.pl.50k 1 10429774 887529061 2421.9 <none> 877099287 2422.1 - beast.gs.100 1 204355624 1081454911 2452.5 - beast.mmtr 1 379745149 1256844436 2475.8 Step: AIC=2420.17 beast.hellgate ~ beast.hl.50k + beast.pl.50k + beast.gs.100 + beast.mmtr Df Sum of Sq RSS AIC - beast.hl.50k 1 654894 878451727 2418.3 <none> 877796833 2420.2 - beast.pl.50k 1 12763825 890560658 2420.4 + beast.t.mtn 1 697546 877099287 2422.1 - beast.gs.100 1 204971861 1082768694 2450.7 - beast.mmtr 1 401378178 1279175011 2476.5 Step: AIC=2418.29 beast.hellgate ~ beast.pl.50k + beast.gs.100 + beast.mmtr Df Sum of Sq RSS AIC <none> 878451727 2418.3 - beast.pl.50k 1 14593204 893044930 2418.8 + beast.hl.50k 1 654894 877796833 2420.2 + beast.t.mtn 1 224743 878226984 2420.2 - beast.gs.100 1 206830343 1085282070 2449.1 - beast.mmtr 1 414089376 1292541103 2476.2 Call: lm(formula = beast.hellgate ~ beast.pl.50k + beast.gs.100 + beast.mmtr, data = beast.times) Residuals: Min 1Q Median 3Q Max -6775.2-1559.4-269.9 1324.3 7048.9 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.259e+04 1.984e+03 6.347 2.44e-09 *** beast.pl.50k 1.765e-01 1.114e-01 1.584 0.115 beast.gs.100 1.255e-01 2.104e-02 5.963 1.69e-08 *** beast.mmtr 7.050e-01 8.357e-02 8.437 2.45e-14 *** --- 32

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2412 on 151 degrees of freedom (29 observations deleted due to missingness) Multiple R-squared: 0.7952, Adjusted R-squared: 0.7911 F-statistic: 195.4 on 3 and 151 DF, p-value: < 2.2e-16 Residuals vs Fitted Normal Q Q Residuals 5000 0 5000 35 154 24 Standardized residuals 3 1 1 2 3 35 154 24 45000 50000 55000 60000 65000 Fitted values 2 1 0 1 2 Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 35 Scale Location 154 24 Standardized residuals 3 1 1 2 3 Residuals vs Leverage 12 Cook's distance 35 93 0.5 45000 50000 55000 60000 65000 Fitted values 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Leverage The output indicates that the Grindstone and MMTR time are significant, but the addition of the Promise Land (PL.50K) factor seems weird. This was suspect to be over fitting and a new model with only the interaction of the Grindstone and MMTR results would be used. Looking at the interaction of MMTR and GS100 to predict Hellgate Times: To begin, summaries of the linear fits to compare MMTR and Grindstone alone, as well as their interaction: Call: lm(formula = Hellgate ~ MMTR, data = beast) Residuals: Min 1Q Median 3Q Max -10978.0-1999.0-56.4 1876.4 7869.9 attr(,"class") [1] "Duration" attr(,"class")attr(,"package") 33

[1] "lubridate" Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.258e+04 2.077e+03 6.055 7.84e-09 *** MMTR 1.174e+00 5.517e-02 21.284 < 2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2817 on 182 degrees of freedom Multiple R-squared: 0.7134, Adjusted R-squared: 0.7118 F-statistic: 453 on 1 and 182 DF, p-value: < 2.2e-16 Call: lm(formula = Hellgate ~ GS.100, data = beast) Residuals: Min 1Q Median 3Q Max -9567.6-2253.6 157.6 2222.9 7578.9 attr(,"class") [1] "Duration" attr(,"class")attr(,"package") [1] "lubridate" Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 2.719e+04 1.747e+03 15.57 <2e-16 *** GS.100 2.772e-01 1.622e-02 17.09 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 3104 on 153 degrees of freedom (29 observations deleted due to missingness) Multiple R-squared: 0.6563, Adjusted R-squared: 0.654 F-statistic: 292.1 on 1 and 153 DF, p-value: < 2.2e-16 Call: lm(formula = Hellgate ~ MMTR * GS.100, data = beast) Residuals: Min 1Q Median 3Q Max -6855.3-1308.4-122.6 1224.7 6910.4 attr(,"class") [1] "Duration" attr(,"class")attr(,"package") [1] "lubridate" Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) -2.948e+04 1.096e+04-2.689 0.007970 ** MMTR 1.897e+00 2.973e-01 6.381 2.04e-09 *** GS.100 5.684e-01 1.097e-01 5.180 6.99e-07 *** 34

MMTR:GS.100-1.128e-05 2.859e-06-3.946 0.000121 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2315 on 151 degrees of freedom (29 observations deleted due to missingness) Multiple R-squared: 0.8112, Adjusted R-squared: 0.8075 F-statistic: 216.3 on 3 and 151 DF, p-value: < 2.2e-16 Note that the model excludes the missing GS100 values from the 2013 year when GS100 was cancelled due to the Govt Shutdown. To Summarize the (adjusted) Rˆ2 values: 1) MMTR Rˆ2 =.71 2) GS100 Rˆ2 =.654 3) Interaction of MMTR and GS100 Rˆ2 =.08075 This means that the Hellgate data can be plotted against a new set of times from the combination of MMTR and GS100 based on the formula below: Hellgate time (predicted) = (-2.948e+04) + (1.897)MMTR.time + (5.684e-01)GS100.time + (-1.128e- 05)MMTR.timeGS100.time Note that the coefficients from the interaction model were used. Using the interaction equation, generating a plot and new model: 18 Actual Hellgate Finish Time (Hours) 16 14 12 Prediction Equation : y = 0.000644 + 1x, r 2 = 0.81 12 14 16 18 Predicted Hellgate Finish Time (Hours) Looking at the residual plots of the model to see if there are any biases or other abnormalities. While the X axis on the residual plot looks slightly unbalanced, it has no other obvious pattern. The Q-Q plot has good normality, so the model looks good: 35

Residuals vs Fitted Normal Q Q Residuals 5000 0 5000 23 154 93 Standardized residuals 3 1 1 2 3 93 23 154 45000 50000 55000 60000 Fitted values 2 1 0 1 2 Theoretical Quantiles Standardized residuals 0.0 0.5 1.0 1.5 23 Scale Location 154 93 Standardized residuals 3 1 1 2 3 Residuals vs Leverage 154 Cook's distance 23 35 0.5 0.5 45000 50000 55000 60000 Fitted values 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 Leverage Based on this result, the results from Grindstone and MMTR can be used reasonable well to predict Hellgate performance. References 1. cookbook-r (2015). Online resource. http://www.cookbook-r.com/graphs/plotting_distributions_ (ggplot2) 2. Eco-XSports (2015). Retrieved January, 2015 from www.eco-exsports.com 3. Extreme Ultra Running (2015). Retrieved January 2015 from www.extremeultrarunning.com 4. Jampen SC, Knechtle B, Rust CA, Lepers R, Rosemann T (2013). Increase in finishers and improvement of performance of masters runners in the Marathon des Sables. Int J Gen Med. 2013; 6: 427 438. http://dx.doi.org/10.2147/ijgm.s45265 5. Regression Equation (2015). Retrieved on January 18,2015 from: http://stackoverflow.com/questions/ 7549694/ggplot2-adding-regression-line-equation-and-r2-on-graph 6. Trim Whitespace (2015). Retrieved from http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-w on January 27, 2015 7. Line Smoothing (2015). http://stats.stackexchange.com/questions/110380/smoother-lines-for-ggplot2. Retrieved January 28, 2015. 8. UltraRunning Magazine (2013). 2013 UltraRunning participation by the numbers. Retrieved on February 10, 2015 from the website http://www.ultrarunning.com/featured/2013-ultrarunning-participation-by-the-numbers/ 36

9. Weather Data (2015). Menne, Matthew J., Imke Durre, Bryant Korzeniewski, Shelley McNeal, Kristy Thomas, Xungang Yin, Steven Anthony, Ron Ray, Russell S. Vose, Byron E.Gleason, and Tamara G. Houston (2012): Global Historical Climatology Network - Daily (GHCN-Daily), Version 3. Custom GHCN-Daily CSV: CITY:US510016 - Roanoke, VA US dates 12-1-2003-12-31-2014. NOAA National Climatic Data Center. doi:10.7289/v5d21vhz Retrieved January 5, 2015. 10. Windchill Calculation (2015). Retrieved January 2015 from http://www.nws.noaa.gov/om/winter/ windchill.shtml 37