Analysis of Professional Cycling Results as a Predictor for Future Success

Similar documents
Uninformed Search (Ch )

The next criteria will apply to partial tournaments. Consider the following example:

CENG 466 Artificial Intelligence. Lecture 4 Solving Problems by Searching (II)

OIL & GAS. 20th APPLICATION REPORT. SOLUTIONS for FLUID MOVEMENT, MEASUREMENT & CONTAINMENT. Q&A: OPEC Responds to SHALE BOOM

Uninformed Search (Ch )

The Evolution of Transport Planning

APPENDIX D LEVEL OF TRAFFIC STRESS METHODOLOGY

Predicting the Total Number of Points Scored in NFL Games

Grand Slam Tennis Computer Game (Version ) Table of Contents

Traffic circles. February 9, 2009

ELO MECHanICS EXPLaInED & FAQ

A Network-Assisted Approach to Predicting Passing Distributions

Introduction. AI and Searching. Simple Example. Simple Example. Now a Bit Harder. From Hammersmith to King s Cross

At each type of conflict location, the risk is affected by certain parameters:

The Rule of Right-Angles: Exploring the term Angle before Depth

A Hare-Lynx Simulation Model

Problem Solving Agents

Princess Nora University Faculty of Computer & Information Systems ARTIFICIAL INTELLIGENCE (CS 370D) Computer Science Department

The system design must obey these constraints. The system is to have the minimum cost (capital plus operating) while meeting the constraints.

Uninformed search methods

Verification of Peening Intensity

Calculation of Trail Usage from Counter Data

Ranger Walking Initiation Stephanie Schneider 5/15/2012 Final Report for Cornell Ranger Research

Sprint Hurdles SPRINT HURDLE RACES: A KINEMATIC ANALYSIS

Handicapping Process Series

PREDICTING the outcomes of sporting events

Roundabouts along Rural Arterials in South Africa

The Aging Curve(s) Jerry Meyer, Central Maryland YMCA Masters (CMYM)

A Cost Effective and Efficient Way to Assess Trail Conditions: A New Sampling Approach

Introduction to Pattern Recognition

SPORTING LEGENDS: SEAN KELLY

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

CS 221 PROJECT FINAL

Gerald D. Anderson. Education Technical Specialist

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

Ascent to Altitude After Diving

Literature Review: Final

Massey Method. Introduction. The Process

Allocations vs Announcements

Incompressible Potential Flow. Panel Methods (3)

Uninformed search methods II.

REAL LIFE GRAPHS M.K. HOME TUITION. Mathematics Revision Guides Level: GCSE Higher Tier

CHEMICAL ENGINEERING LABORATORY CHEG 239W. Control of a Steam-Heated Mixing Tank with a Pneumatic Process Controller

Shedding Light on Motion Episode 4: Graphing Motion

The Incremental Evolution of Gaits for Hexapod Robots

One of the most important gauges on the panel is

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

Machine Learning an American Pastime

JUVENILE GENETIC EVALUATION AND SELECTION FOR DAIRY GOATS

Three New Methods to Find Initial Basic Feasible. Solution of Transportation Problems

References: Hi, License: Feel free to share these questions with anyone, but please do not modify them or remove this message. Enjoy the questions!

Percentage. Year. The Myth of the Closer. By David W. Smith Presented July 29, 2016 SABR46, Miami, Florida

U11-U12 Activities & Games

STAGE 2 ACTIVITIES 6-8 YEAR OLD PLAYERS. NSCAA Foundations of Coaching Diploma

Visualization of a crowded block of traffic on an entire highway.

Uninformed search methods II.

Lecture 22: Multiple Regression (Ordinary Least Squares -- OLS)

The effect of pack formation at the 2005 world orienteering championships

Analyses of the Scoring of Writing Essays For the Pennsylvania System of Student Assessment

Instructions For Use. Recreational Dive Planner DISTRIBUTED BY PADI AMERICAS, INC.

MEETPLANNER DESIGN DOCUMENT IDENTIFICATION OVERVIEW. Project Name: MeetPlanner. Project Manager: Peter Grabowski

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

12m. Instructions For Use. 14m. 16m. 18m. Recreational Dive Planner. 20m DISTRIBUTED BY PADI AMERICAS, INC.

Launch Line. 2.0 m Figure 1. Receptacle. Golf Ball Delivery Challenge Event Specifications MESA Day 2018

Opleiding Informatica

Grade: 8. Author(s): Hope Phillips

Agile project management with scrum

Traffic Impact Analysis

Traffic Impact Study. Westlake Elementary School Westlake, Ohio. TMS Engineers, Inc. June 5, 2017

How to Make, Interpret and Use a Simple Plot

PSM I PROFESSIONAL SCRUM MASTER

FEATURES. Features. UCI Machine Learning Repository. Admin 9/23/13

Contingent Valuation Methods

Uninformed search methods

AGA Swiss McMahon Pairing Protocol Standards

Software Engineering. M Umair.

Ozobot Bit Classroom Application: Boyle s Law Simulation

ROAD MODULE 4 ASSISTANT REFEREE: CRITERIUM PIT

LOW PRESSURE EFFUSION OF GASES revised by Igor Bolotin 03/05/12

ROAD MODULE 3 ASSISTANT JUDGE STAGE RACE

THERMALLING TECHNIQUES. Preface

MOUNTAIN BIKE ORIENTEERING RULES

AIRFLOW GENERATION IN A TUNNEL USING A SACCARDO VENTILATION SYSTEM AGAINST THE BUOYANCY EFFECT PRODUCED BY A FIRE

Discussion on the Selection of the Recommended Fish Passage Design Discharge

Practice Task: Trash Can Basketball

Pedestrian Dynamics: Models of Pedestrian Behaviour

100-Meter Dash Olympic Winning Times: Will Women Be As Fast As Men?

Teaching Notes. Contextualised task 35 The 100 Metre Race

Motion Control of a Bipedal Walking Robot

21-110: Problem Solving in Recreational Mathematics

/435 Artificial Intelligence Fall 2015

Application of Bayesian Networks to Shopping Assistance

Exponent's Fundamentally Flawed Research

Increased Onboard Bicycle Capacity Improved Caltrain s Performance in 2009

Applying Occam s Razor to the Prediction of the Final NCAA Men s Basketball Poll

Who s Who - Americans in the 2009 Tour de France

Author s Name Name of the Paper Session. Positioning Committee. Marine Technology Society. DYNAMIC POSITIONING CONFERENCE September 18-19, 2001

Estimating a Toronto Pedestrian Route Choice Model using Smartphone GPS Data. Gregory Lue

Prioritizing Schools for Safe Routes to School Infrastructure Projects

TIME-ON-TIME SCORING

Transcription:

Analysis of Professional Cycling Results as a Predictor for Future Success Alex Bertrand Introduction As a competitive sport, road cycling lies in a category nearly all its own. Putting aside the sheer length of the races (100 miles is, by most accounts, a short day) and the wildly varying natures of the courses themselves (mountains, cobblestones, and tight city corners are all part of the fun), it is unique in that is a particularly elegant blend of group and individual dynamics. On one hand, it is a fiercely individual sport. Team awards are given, but at the end of the day there will only be one man in each spot on the podium. And yet on the other, each and every rider is at the mercy of the group. A professional peloton is a constantly evolving blend of different skillsets and goals, constantly changing in response to moves made by individuals. It is because of this sort of interaction that I am interested in examining cycling from a network analysis perspective. Problem In this project I will attempt to use the results of the spring cycling season to predict the finishing order of the biggest and best cycling race in existence, the Tour de France. Of course, predicting it exactly is going to be nearly impossible, so my goal for this work will simply be to come as close as possible. For my data I will be using the results of four of the five Monuments of cycling, as well as two stage races. Each takes place at a different point in the spring, and each is well attended by some of the top names in cycling. They were chosen as such to provide maximum overlap with the names that compete in the Tour. The six races are as follows: Milan-San Remo Tour of Flanders Liège-Bastogne-Liège Paris-Roubaix Tirreno Adriatico Criterium du Dauphine

Related Work Works performing any sort of meta-analysis of cycling have been few and far between; one would go so far as to say largely nonexistent. Large amounts of energy have been put into studying the physical science behind cycling but comparatively little has been done in the Moneyball vein. Network analysis has been applied infrequently to other sports, but in methodologies which I have found to be largely irrelevant to the problem I ll discuss here. Studies of passing dynamics in basketball, for example, have little to no bearing on the predictive sort of analysis I seek to complete in this project. Model/Algorithm Note: Of the three approaches I tried, the latter two used the same model for the data; as such, this is what I ll describe in detail. The first approach was mostly necessary in terms of getting to know the data, but less significant in terms of results. The code for it is still present in my submission, but I ll skip over it here. I chose to model the data as a directed graph comprising riders (nodes) and performances (edges). In other words, each rider was represented by a node and each race represented by an edge between one rider and another. A single edge corresponds to the source node coming directly behind the destination node in the race if there is an edge from A to B, for example, it would mean that A got place N in a race and B got place N-1 (lower is better). There is a natural directedness to race results in that there is a (somewhat) clearly defined order of superiority in the finish; drawing edges between adjacent finishers conveys this flow of talent/speed. Following outward edges, then, corresponds to moving towards faster and faster riders. The hope for this model was that such a flow could be used to correlate riders with their relative speed. In practice this proved difficult; this will be discussed more shortly. Data Scraping All data used for this project was obtained via web-scraping or copying from cyclingnews.com. This was particularly convenient in that all the data used was available from a single source, and in a consistent format. This made working with the data relatively easy, with the small exception that occasionally names of riders would be slightly different, most often around special characters (eg Florian Senechal vs. Florian Sénéchal). It is necessary to maintain consistency across riders to properly correlate one of their results with the other, so

there is a section of the code that deals specifically with finding the nearest matching name in the results data at runtime. Evaluation Metric In order to gauge the efficacy of a particular algorithm it is necessary to have a metric by which we can evaluate its accuracy. It is goal to predict Tour de France results given the results of the earlier season reasons. As such, we need some way of measuring the finishing order an algorithm outputs against the actual finishing order (of the 2017 Tour). For these purposes, and after also evaluating the Kendall Tau and Spearman Rho, I settled on a metric based on set intersection for reasons of simplicity and focus. This particular calculation, as we ll see, effectively weights the higher ranks more while maintaining a relatively simple implementation. This former in particular is desirable for this effort; it is much more important to predict the top finishers in the Tour than it is to get the exact order of the masses right. The set intersection score (because if there s a better name, I don t know it) is computed as follows: 1. For i=1, Compare the first i elements of each list (in this case, our predicted finishing order and the actual results). Treat each as an unordered set, and calculate the fraction of elements that are shared between the two, ie S! S! S! S! This is the score for that value of i. 2. Repeat step 1 for all i s up to the size of the smaller of the two lists, getting a score for each step along the way. 3. Average all scores to get the final set intersection score. Construction of the actual graph was a straightforward process of going through the result set for each of the 6 races and adding an edge between consecutive finishers as determined by time. More than one edge between any two given riders was both uncommon and disallowed, largely in keeping with the constraints imposed by the snap.py graph representation. Two edges between any two given riders (one directed edge in each direction), however, was fine. This scenario arose when the first beat the second in one race, but lost to him in another. A looser version of this case, in fact, is one of the larger issues I believe I faced; cycles in the graph made reasoning about which riders were strictly better than the others quite difficult. The algorithm performed on this data went through two main iterations: Breadth First Search

Given the notion that an edge pointing towards a node indicates a rider for which that given node is superior, one might logically surmise that the more nodes point (either directly or by proxy/flow) to a given rider, the faster that rider is. As such, I evaluated a scoring system under which a rider is assigned a number of points equal to the number of nodes that can be reached with a BFS search starting at their node, and following incoming edges, outward. In other words; a rider s speed under this model is directly proportional to the number of nodes that are connected to him (note that it is not the number of nodes he is connected to subtle distinction). Riders can then be ranked by the number of points they accumulate in their respective searches. Implementing this algorithm without any modification, however, yields an interesting result: nearly every rider has the exact same score. This will be discussed in more details in the Challenges section of this report, but for now suffice to say it is due to the variance present in racing and presence of cycles in the graph. In the interest of solving this problem I chose to implement a depth-limited BFS essentially, only allowing progress to continue for a certain number of levels of search. Empirically, this proved effective in differentiating score among riders. This iteration of the algorithm performed somewhat ineffectively, peaking at a set intersection score of ~/.186. In initial implementations of the algorithm I incremented a rider s score by 1 for each node their search touched. I propose, however, that this favors slower riders in that it weights the (slower) riders they beat just as highly as the (relatively faster) riders that the top athletes manage to best. To remedy this I changed the increment value to be dependent on the current depth of search at which a node was reached essentially, placing a high value on beating riders that in turn beat many other riders or, put another way, having a high fanout. I tried both a linear and squared proportionality, and squared (that is, incrementing a riders score by [depth] 2 for each node that is touched during the search) performed slightly better. A depth of 10 with this i 2 increment yielded a set intersection score of ~.297. Random Walks In an attempt to remedy the cycle problems that BFS search faced, I implemented a ranking system based loosely on the random walks we have seen for PageRank. Starting from a random node, the algorithm followed outward links (this time moving up the speed gradient) for a specified number of repetitions or until it hit a dead end. Each rider it hit on the walk had their score increased, and the riders with the highest scores at the end were taken to be the fastest. Inasmuch as the edges (in theory) only move from slower to faster riders, repeatedly moving along those directions should narrow to faster and faster riders and, with enough repetitions, emphasize those riders at the front of the peloton.

In similar fashion to the BFS, it proved to be useful to weight later nodes on the random walk more, this time because there was a higher probability that they were towards the faster end of the group. An i 2 proportionality again outperformed its linear counterpart. This approach was moderately effective, reaching a set intersection score of ~.240. Challenges Far and away the biggest challenge was the messiness of the results data. By this I mean the lack of consistency; a rider might finish in 2 nd one day, and then 34 th the next. This is more or less normal for cycling ( bunch sprints at the end of the race, as they are called, oftentimes mean that such the difference between 1 st and 30 th might only be three or four seconds) but it makes it very difficult to get an accurate feel for who exactly is faster than whom. Add to this the fact that racing, as a human athletic endeavor, is naturally inconsistent and it becomes quite difficult to get an accurate picture of the standings, at least in a computational sense. These two factors in particular combine to generate a graph with many cycles and irregularities that makes generating continuous speed gradient nearly impossible. Approaching this with random walks was an attempt to deal with that sort of randomness, but it ultimately proved a little too difficult to surmount simply. Results and Findings Depth-limited breadth first search with a squared proportionality for score update performed best of the implemented algorithms with a set intersection score of ~.297. Future work Given more time, I would have like to spend time investigating alternate methods of representing the data in graph form. I think what I eventually came upon was adequate, but my suspicion is that an even slightly altered representation (perhaps something more adept at dealing with cycles) would greatly increase effectiveness on the algorithm design side. One think I would have liked to have tried when parsing the data was binning results by finish group instead of finish time/order. As stated, cycling races often end with large groups of riders finishing at nearly the exact same time so close, in

fact, that races will most often list long groups as finishing at the same time so long as there is never more than a few seconds gap between them. The data representation I landed on isn t ideal for this, but constructing edges based on finishing group would probably provide a more accurate picture when evaluating one cyclist against another.