Predicting Results of a Match-Play Golf Tournament with Markov Chains Kevin R. Gue, Jeffrey Smith and Özgür Özmen* *Department of Industrial & Systems Engineering, Auburn University, Alabama, USA, [kevin.gue, jsmith, ozgur]@auburn.edu Abstract. We introduce a Markov Chain model for predicting outcomes in golf match-play. The model uses individual players score probability distributions for each hole to estimate the probability of winning the match. The model is specific both to the individual participants and to the course on which the match is played. We use six years of PGA ShotLink data to determine individual player statistics and to estimate the required probability distributions. We compare the prediction of the model results in the Ryder Cup singles match-play (Day 3).. Introduction Golf tournaments take on two major forms: stroke-play and match-play. In stroke-play, a player s final score consists of the sum of scores for each hole of the tournament; the player with the lowest score wins. In matchplay, two players compete on a hole-by-hole basis, such that the player with the lower score on a hole wins one point. Equal scores on a hole yield one half-point for each player. The player with the most points after 8 holes wins. A match may not require a full 8 holes, if one player has a point advantage greater than the number of remaining holes. For example, if a player is 3-up with only two holes remaining, he or she has won the match and play ceases. In this study, we consider only match-play competition. Match outcome and decision support models for sports have been proposed in several studies (Scarf and Shi, 5; Goddard, 5; Barnett and Clarke, 5). Reilly and Williams (3) summarize the effects of implementing scientific methodologies to soccer. Regarding golf, Scheid (979) simulates the effect of handicap allowances in golf on players chance of winning, and McHale () conducts simulation studies to examine the fairness of handicapping by using data from a real golf tournament. Similar to these studies, Franks and McGarry (3) examine the relationship between observed results and expected results using real data. Markov chains are widely used to model sporting events (Sokol, 4; Kostuk et al., ). Berry () builds a Markov chain model to compare Tiger Woods with other golfers to find out if he has the persona of a winner. Fearing et al. () use PGA Tour ShotLink data to develop distance-based models of putting performance and to create a new putting performance metric. Our study combines player statistics from six years of PGA ShotLink data and a Markov chain model to predict the outcome of golf match-play events. In the following sections, we start with giving information about the aggregated data we formed and the mathematical model we built. Then we talk about our validation efforts regarding the data model. We present results of our computational model for Ryder Cup, and conclude with pedagogical notes and future goals.. Methodology. PGA ShotLink Data PGA ShotLink data is gathered in the major PGA Tour stroke-play events by volunteers using mobile computers and laser rangefinders. We used six years of raw data (4 9) consisting of the scores of every player on every hole in every tournament during those years. We aggregated this data to estimate player performance statistics by par of hole. The structure of the data is given in Table. For example, we can determine a player s probability of scoring i strokes on a Par- j, where i is {,,...,} and j is {3,4,5} as in Table. (Professional players almost never score more than on a hole.) http://www.pgatour.com/story/9596346/
Table. Sample player scores on Par-5 holes Player Name Obs 3 4 5 6 7 8 9 Mickelson, Phil 47 4 678 575 97 3 Mahan, Hunter 79 34 733 863 38 8 3 Watson, Bubba 79 38 48 463 8 3. Mathematical Model Table. Sample probabilities on Par-5 holes Player Name Obs 3 4 5 6 7 8 9 Mickelson, Phil 47.3.48.4.7. Mahan, Hunter 79..4.48.8. Watson, Bubba 79.4.45.43.8. We model a match-play match as a Markov process, in which the state of the match is the advantage one player has over the other and the transition probabilities correspond to the probabilities that that player wins, ties, or loses the current hole. We also assume that performance on a hole does not depend on holes already played, and so we meet the required memorylessness property. We also assume that the performance of a player is not influenced by the identity of his opponent. Let A j and B j be random variables corresponding to the score of Player A and Player B on a par- j hole. The probability that A wins the hole is P(A j < B j ) = = = P(A j < b B j = b)p(b j = b) b= P(A j < b)p(b j = b) b= b a= b= P(A j = a)p(b j = b) Similarly, the probability of a tie is and the probability of a loss is P(A j = B j ) = a= P(A j = a)p(b j = a), P(A j > B j ) = P(A j < B j ) P(A j = B j ). With the probabilities of win, tie, and loss for each of the three pars (3, 4, 5), we can completely specify a state transition diagram (see Figure ). The match, which may be defined from either player s perspective, begins in state zero and proceeds hole-by-hole, using probabilities appropriate to the par of each hole. In an 8-hole match-play we have different states. Gray nodes indicate termination states, in which one player has won. There is also a termination state of tie after 8 holes. The structure of the state diagram suggests a simple, recursive expression for the probability that the match is in state m after h holes. Let w h, t h, and l h be the probabilities of a win, tie, and loss on hole h. These probabilities will depend, of course, on the par of the hole. In general, the probability p(m,h) of being in state m after h holes is the recursive expression: p(m,h) = p(m,h )w h + p(m,h )t h + p(m +,h )l h. If states (m,h ),(m,h ), or (m +,h ) are infeasible (e.g., 3-up after two holes), we set their respective state probabilities to zero. Similarly, if a state is feasible, but the transition is not, we modify the
9 9 9 9 7 Hole 8 8 7.. 3 8 Win Tie Lose - -7.. - - -8-8 -3-9 -9-9 Figure. State diagram of a match. Gray circles represent terminating states, indicating the end of the match. recursion appropriately. For example, p(,8) = p(,7)w 8 because the other feeder states (p(3,7) and p(,7)) are winning states, and therefore the match is over if they are reached (see Figure ). The probability that a player wins the match is the sum of probabilities of reaching the winning states. The probability that the match ends in a tie is p(,8). 3. Validation and Results We assume that the probabilities we derived from the raw data are accurate and applicable in head-to-head matches between individual players. This is an important assumption and warrants validation. Since the current data was gathered from stroke-play tournaments, we wanted to also collect some match-play data to provide validity evidence. To our knowledge, there are only two major events which have match-play rounds; Ryder Cup Day-3 and the Accenture Match Play tournament. We searched world wide web to find these tournaments data and to discern rivalry information between players. Since the Accenture event is a knockout style tournament and since there are so few match-play tournaments, we could only find small number of players who played against each other multiple times. Our goal was to have enough match-play observations between two players to calculate binomial confidence interval and compare it with the conditional probabilities computed using our model. Since the number of observations for each par level was around or below, we found wide 95% confidence intervals which the conditional probabilities derived from ShotLink data always fall within. This doesn t give us great comfort in our validation efforts, but we are currently seeking additional head-to-head data in order to improve the validiation process. For our second validation effort, we use only PGA Championship data in PGA ShotLink and assume that if two players played a hole on the same day in the same round at the same event, we can use that We could find scorecards for Accenture Match Play Tournament,, 9, 8, 7, 6, 5, 4 and Ryder Cup, 8, 6, 4
data as if they played against each other in a match-play for that particular hole. We analyzed the data and picked two players (Toms and Mickelson) who played the most common holes in the same days. Using this data, we then calculated the probabilities of winning, losing and halving for these players. Assuming the normal approximation, we calculated binomial 95% confidence intervals on the respective probabilities. Table 3 shows that all of the conditional probabilities calculated by our algorithm using all ShotLink data fall within confidence intervals. In terms of validation, our results are still fairly weak. We hope to work with the PGA Tour to identify and obtain some additional data to support our validation effort. Table 3. Validation results for all ShotLink data All ShotLink Data PGA Championship Data 95% Confidence Interval Hole Toms Tie Mickelson Toms Tie Mickelson Toms Tie Mickelson Par - 3.5.56.4.9.563.9 (.36,.3) (.463,.66) (.36,.3) Par - 4.7.469.6.73.473.54 (.9,.36) (.43,.534) (.,.36) Par - 5.344.399.56.39.43.78 (.,.47) (.89,.56) (.74,.38) 3. Ryder Cup Day-3 Results The Ryder Cup is a golf competition between two teams from Europe and the United States which is held in every two years. Each team consists of members who are picked by the respective team captains. The Ryder Cup matches involve various match-play competitions between players selected from two teams of twelve. Currently, the matches consist of eight foursomes matches, eight fourball matches and singles matches. 3 The winner of each match scores a point for his team, or / point if the match ends in a draw. In this paper we are interested only in singles matches that are played at day-3 of the Ryder Cup tournament. The sequences of the players in each team are announced by the team captains the night before Day-3 session. Players who have the same rank play against each other. We ran our algorithm for Ryder Cup and the results are given in Table 4. Note that the actual winners are illustrated in bold characters. Table 4. Results for Ryder Cup Singles Match Play Match US Player EU Player P(US Wins) P(EU Wins) P(Tie) Stricker, Steve Westwood, Lee.557.3. Cink, Stewart McIlroy, Rory.48.39.8 3 Furyk, Jim Donald, Luke.48.389.3 4 Johnson, Dustin Kaymer, Martin.59.9.6 5 Kuchar, Matt Poulter, Ian.475.396.9 6 Overton, Jeff Fisher, Ross.569.39. 7 Watson, Bubba Jimenez, Miguel A..579.3.9 8 Woods, Tiger Molinari, Francesco.679.3.9 9 Fowler, Rickie Molinari, Edoardo.746.64.9 Mickelson, Phil Hanson, Peter.74.86. Johnson, Zach Harrington, Padraig.457.45.8 Mahan, Hunter McDowell, Graeme.535.34.3 In the appendix, we present the probabilities of winning and being tied from the US player s (first player listed) perspective. In Table 5, we show the conditional probabilities for Ryder Cup match-plays that are found by our algorithm using all PGA ShotLink data. In Table 6, similarly to our second validation effort, we present Ryder Cup match-play opponents data that is discerned from PGA Championship rounds. We only list the match-ups that have sufficient observations to make the normal (distribution) approximation. 3 http://en.wikipedia.org/wiki/ryder Cup
Table 7 gives 95% confidence intervals for PGA Championship data to compare with probabilities found using all ShotLink data. 4. Conclusion Assuming that the player probabilities we derived from ShotLink data are accurate for match-play, we calculate the probabilities of winning, losing and being tied for each player against each other on each par level (3, 4, and 5). For further validation of our assumption, we need more match-play data to compare. With our recursive algorithm, we can also find the probabilities of winning an 8 hole match. For the Ryder Cup tournament, using the same recursive logic (but without termination states and pruning), we can predict which team is more likely to win the day-3 singles match-play event (consists of matches). In the Ryder Cup, team Europe was leading the game with 9.5 to 6.5 before day-3 started. Our algorithm found 8% chance of winning for team US. Even the chance of winning for US with deficit of 3+ was around 4% which suggested that very exciting day-3 event was waiting for us at least this was an accurate prediction. As a future goal, we are working on a team selection tool based on our probability model. The tool will assist in the team selection process by finding good player assignments given the opposing team s line-up. We also assigned this model as a class project in our undergraduate applied probability course to measure the reaction of the students regarding their learning experience. Our purpose was to introduce an entertaining but also stimulating problem that would raise the student interest and makes the subject matter more memorable. Feedbacks we got back from the students were encouraging and really useful to design different implementations of this project assignment. Our future plan is to design the project in milestones at which students accomplish one task at a time such as manipulating the data, calculating conditional probabilities, calculating match results, and calculating game results (in Ryder Cup case) etc. We want them to compare their results with the real life Ryder Cup results to gain more faith on the method. 5. Acknowledgments We would like to thank Kin Lo of PGA Tour Headquarters and the PGA Tour for providing us with the ShotLink data that was used in this research. Appendix Table 5. Ryder Cup Match-up Probabilities when All ShotLink data is used Par-3 Win Par-3 Tie Par-4 Win Par-4 Tie Par-5 Win Par-5 Tie Stricker vs. Westwood.6.59.3.467.33.45 Cink vs. McIlroy.5.535.86.47.74.394 Furyk vs. Donald.4.53.63.498.9.436 Johnson vs. Kaymer.69.496.35.446.355.379 Kuchar vs. Poulter.6.56.63.49.87.47 Overton vs. Fisher.7.55.3.477.34.388 Watson vs. Jimenez.45.54.9.469.395.377 Woods vs. Molinari, F.97.563.33.54.436.357 Fowler vs. Molinari, E.95.48.36.449.43.353 Mickelson vs. Hanson.86.53.3.46.43.368 Johnson vs. Harrington.5.55.7.479.86.45 Mahan vs. McDowell.54.53.96.459.36.4
Table 6. PGA Championship results Par-3 Win Par-3 Tie Par-4 Win Par-4 Tie Par-5 Win Par-5 Tie Stricker vs. Westwood.9.54.8.47.36.5 Furyk vs. Donald.56.578.7.47..538 Mickelson vs. Hanson.34.48.99.463.375.53 Johnson vs. Harrington.34.484.85.44.37.43 Mahan vs. McDowell.9.5.59.534.458.375 Table 7. Confidence intervals of PGA Championship results Par-3 Win Par-3 Tie Par-4 Win Par-4 Tie Par-5 Win Par-5 Tie Stricker vs. Westwood (.,.348) (.4,.683) (.4,.357) (.385,.555) (.55,.456) (.337,.663) Furyk vs. Donald (.67,.45) (.457,.699) (.64,.89) (.396,.546) (.,.33) (.43,.674) Mickelson vs. Hanson (.83,.44) (.35,.63) (.9,.369) (.387,.54) (.7,.543) (.358,.74) Johnson vs. Harrington (.3,.338) (.36,.67) (.7,.35) (.35,.498) (.99,.454) (.89,.557) Mahan vs. McDowell (.76,.36) (.37,.673) (.83,.36) (.43,.638) (.59,.658) (.8,.569) References Barnett T. and Clarke S. (5) Combining player statistics to predict outcomes of tennis matches. IMA Journal of Management Mathematics 6, 3. Berry S. () Is tiger woods a winner? Mathematical Association of America Distinguished Lecture Series. Fearing D., Acimovic J. and Graves S. () How to catch a tiger: Understanding putting performance on the pga tour. Journal of Quantitative Analysis in Sports 7. Franks I. and McGarry T. (3) The science of match analysis. Science and soccer. Goddard J. (5) Regression models for forecasting goals and match results in association football. International Journal of Forecasting, 33 34. Kostuk K., Willoughby K. and Saedt A. () Modelling curling as a markov process. European Journal of Operational Research 33, 557 565. McHale I. () Assessing the fairness of the golf handicapping system in the uk. Journal of sports sciences 8, 33 4. Reilly T. and Williams A. (3) Science and soccer. Scarf P. and Shi X. (5) Modelling match outcomes and decision support for setting a final innings target in test cricket. IMA Journal of Management Mathematics 6, 6. Scheid F. (979) Golf competition between individuals. Winter Simulation Conference: Proceedings of the th conference on Winter simulation- Volume : San Diego, CA, United States, 55 5. Sokol J. (4) An intuitive markov chain lesson from baseball. Informs Transactions on Education 5, 47 55.