The Importance of Skill and Luck on the PGA Tour

The Importance of Sill and Luc on the PGA Tour Carl Acermann Robert A. Connolly Richard J. Rendleman, Jr. Department of Finance Kenan-Flagler Business School Kenan-Flagler Business School 39 Mendoza College of Business CB3490, McColl Building CB3490, McColl Building University of Notre Dame UNC - Chapel Hill UNC - Chapel Hill Notre Dame, Indiana 46556-5646 Chapel Hill, NC 7599-3490 Chapel Hill, NC 7599-3490 (574) 63-8407 (phone) (99) 96-0053 (phone) (99) 96-388 (phone) (574) 63-555 (fax) (99) 96-5539 (fax) (99) 96-068 (fax) Acermann.@nd.edu connollr@bschool.unc.edu richard_rendleman@unc.edu by We than Tom Gresi and seminar participants at the University of Notre Dame for helpful comments. Please direct all correspondence to Richard Rendleman.

The Importance of Sill and Luc on the PGA Tour Abstract Using a variety of methods, we study the relative contributions of sill and luc in determining the scores of professional golfers who play on the PGA Tour in the U.S. Fixed effects modeling establishes the presence of important changes in sill over the multi-year sample used in our empirical analysis. We apply a set of tests in the literature for hot hands, and employ a marovswitching model on scores of individual golfers. The results from standard tests reveal modest evidence of hot hands, while the marov-switching approach delivers clear evidence of hot hands among some of the best players on the Tour.

The Importance of Sill and Luc on the PGA Tour. Introduction Lie all sports, outcomes in golf involve elements of both sill and luc. Perhaps the highest level of sill in golf is displayed on the PGA Tour. Even among these highly silled players, however, a small portion of each 8-hole score can be attributed to luc, or what players and commentators often refer to as good and bad breas (sometimes called the rub of the green ). The purpose of our wor is to determine the extent to which sill and luc combine to determine 8-hole scores in PGA Tour events. We are also interested in the question of whether PGA players experience hot or cold hands, or runs of exceptionally good or bad scores, due to temporary changes in their levels of sill. From a psychological standpoint, understanding the extent to which luc plays a role in determining 8-hole golf scores is important. Clearly, a player would not want to mae swing changes to correct abnormally high scores that were due, primarily, to bad luc. Similarly, a player who shoots a low round should not get discouraged if he cannot sustain that level of play, especially if good luc was the primary reason for his low score. In some cases, the luc factor in a round of golf can be easily identified. In the final round of the 00 Bay Hill Invitational, David Duval s approach shot to the 6 th hole hit the pin and bounced bac into a water hazard fronting the green. Duval too a nine on the hole. Few would argue that Duval s score of nine was due to bad judgment or a sudden change in his sill level, and it is highly unliely that Duval made swing changes to correct the type of shot he made on 6. In contrast, the good fortune experienced by Craig Pers in the final round of the 00 Players Championship when he chipped in on holes 6 and 8 and sun a 8-foot putt for birdie on the 7 th hole en route to victory is not something that Pers can expect to repeat on a regular basis. Only time will tell whether Pers victory, his first on the PGA Tour, reflected exceptional ability or four consecutive days of exceptionally good luc. In other cases, specific occurrences of good and bad luc may not be nearly as easy to identify. Luc simply occurs in small quantities as part of the game. Even a player as highly silled as Tiger Woods cannot produce a perfect swing on every shot. As a result, a certain element of luc is introduced to a shot even before contact with the ball is made.

In the next two sections of the paper, we show that after adjusting a player s score for his general sill level and the difficulty of the course on the day a round is played, the standard deviation of residual golf scores on the PGA Tour is approximately.7 stroes per round. Although we have no direct evidence, we suspect that the standard deviation of residual scores would be much higher for amateur golfers. In Sections 3 and 4, we also investigate the nature and extent of hot and cold hands (streay play) using a variety of tests, including a marov-switching model (originally suggested by Albert (993)). We apply tests from the sports statistics literature (i.e., runs tests, marov-chain tests, and logistic models of performance persistence), but they produce quite limited evidence of hot and cold hands. By contrast, our marov-switching model tests show that streay play exists, has intuitively reasonable characteristics (e.g., streas have a relatively short life and are mean-reverting), but streay play is apparent in only a minority of the top PGA golfers whose individual play we study. A final section summarizes the paper.. Data and Statistical Methodology We have collected individual scores for every player in every stroe-play PGA Tour event for years 998-00. Our data cover scores of players who made and missed cuts. 3 It also includes scores for players who withdrew from tournaments and who were disqualified; as long as we have a score, we use it. 4 Our data also include indications of where each round was played. This is especially important for tournaments such as the Bob Hope Chrysler Classic played on more than one course.. In a related paper that focuses primarily on the performance of Tiger Woods, Berry (00) estimates the mean standard deviation of golf scores from a predicted score to be 3. stroes. His predicted score is based on a random effects model that taes into account the sill level of each player and the intrinsic difficulty of each round.. Our data include all stroe play events for which participants receive credit for earning official PGA Tour money, even though some of the events, including all four majors, are not actually run by the PGA Tour. The data were collected, primarily, from internet sources that include the following: www.pgatour.com, www.golfwee.com, www.golfonline.com, www.golfnews.augustachronicle.com, www.insidetheropes.com, and www.golftoday.com. When we were unable to obtain all necessary data from these sources, we checed national and local newspapers, and in some instances, contacted tournament headquarters directly. 3. Although there are a few exceptions, after the second round of a typical PGA Tour event, the field is reduced to the 70 players, including ties, with the lowest scores after the first two rounds.

3 In an effort to separate the effects of sill and luc, we employ two statistical models to predict a player s score in each of his 8-hole rounds. In model, the simplest of the two models, a player s predicted score for a given round reflects two things: i) the level of sill he displays throughout the entire 998-00 period and ii) the difficulty of the course on the day the round is played. In model, an individual player s sill level is allowed to change through time. The basic form of model is described below... Performance Model score denote the 8-hole score of player i (with i =,,... n) in PGA Tour round j (with Let i, j j =,,... m). For statistical purposes, a round is defined as an 8-hole round of play on a specific course. 5 The predicted score for player i in round j is determined by the following fixed-effect multiple regression model: 6 n m ij, c c ij, = c= score = α + β player + γ round + ε () In (), player is a player dummy that taes on a value of if i = and zero otherwise, round c is a round dummy that taes on a value of if j term with ( i j), 0 = c and zero otherwise, and ε, is an error E ε =. In the model, Tiger Woods is player =, and the first round of the 998 Mercedes Championships, the first event of our sample, is round c =. Therefore, the regression intercept, α, represents the score that Tiger would be expected to shoot in the first round of the 998 Mercedes Championships, β represents the additional amount that player would be expected to score in the first round of the 998 Mercedes, and γ c denotes the additional amount that i j 4. Typically, the daily scores of a player who withdraws or is disqualified from a PGA Tour event are not shown in the final summary of scores. However, in these cases we went bac to the scoring summaries at the end of each round prior to withdrawal or disqualification to pic up the earlier scores. 5. Most PGA Tour events are played on a single course over a four-day period, with a single 8-hole round scheduled for each day. As such, for statistical purposes, these tournaments consist of four rounds. However, over the 998-00 period, seven tournaments per year were played on more than one course. In the Bob Hope Chrysler Classic, the first four rounds are played on four different courses using a rotation that assigns each tournament participant to each of the four courses over the first four days. A cut is made after the fourth round, and a final round is played on the fifth day on a single course. Thus, for statistical purposes, the Bob Hope tournament consists of 7 rounds four for each of the first four days of play and one more additional round for the fifth and final day. 6. Berry (00) employs a similarly-constructed random effects model that taes into account the sill level of each player and the intrinsic difficulty of each round to measure the performance of Tiger Woods relative to others on the PGA Tour.

4 all players, including Tiger, would be expected to shoot in any other round. The only restriction placed on the data is that a player must have recorded at least four 8-hole scores during 998-00 for his scores to be included in the regression estimate. 7 With this restriction, the data include 75,054 scores for 80 players recorded over a possible 848 rounds, although no single player participated in more than 457 rounds. Based on the estimate of the intercept of model, Tiger Woods would have been expected to shoot 68.8 in the first round of the 998 Mercedes Championships. To estimate what other players would have been expected to shoot in the same round, one must add the estimated playerspecific β coefficients, which we refer to as sill coefficients, as illustrated in the model section of Table. 8 Clearly, all of the player sill coefficients summarized in table are significantly different from zero. This is not a direct test, however, of whether a given player s level of sill is significantly different from that of Tiger Woods, since there is a 0.53 standard error associated with the estimate of the intercept in model. To predict what any player, including Tiger Woods, should shoot in any other round of the PGA Tour, one must add the estimate of the round coefficient, examples of which are shown in Table. No specific information about course conditions, adverse weather, etc. is provided in the estimation of round coefficients. Nevertheless, if such conditions combine to produce abnormal scores in a given 8-hole round, the effect of these conditions will be reflected in the estimated coefficients. Note that the highest round coefficient is associated with the portion of the third round of the 999 ATT Pebble Beach National Pro-Am played on the Pebble Beach course. 9 In this particular round, wind and rain combined to mae the famed course extremely difficult and forced tournament officials to cancel the event during the fourth round due to unplayable conditions. 7. Given the structure of the model, four scores per player is the minimum required for the regression to be of full ran. 8. It is not necessary for a player to have actually participated in the Mercedes Championships to compute his sill coefficient. For example, Nic Price did not participate in the 998 Mercedes tournament. Nevertheless, his.3 sill coefficient represents the stroes that one should add to Tiger s base case score of 68.8 to estimate what Price s 8-hole score would have been in the first round if he had actually played. 9. Approximately one third of the field in the 999 ATT Pebble Beach National Pro-Am played the third round on the Pebble Beach course, with the remainder of the participants playing the same round at Spyglass and Poppy Hills.

5.. Performance Model In the previous model, a player s sill coefficient is assumed to be constant throughout the entire 998-00 period. In the second model, a player s sill coefficient is allowed to vary through time. The time pattern of potential variation in a player s level of sill should be very specific to that player. It could reflect injury, illness, a change in equipment, an intended or unintended change in technique and other factors. However, such patterns of change should not be common to all players. Nevertheless, if a permanent change in sill occurs, it is most liely to occur between PGA Tour seasons during the months of November and December when no official PGA Tour events are scheduled. 0 During this period, most Tour participants tae time off from active play, and some will mae major changes in technique and equipment if warranted. Also, many players are bound to specific equipment manufacturers by contracts that extend through entire PGA Tour seasons. Therefore, in our second model, we allow the sill coefficients of individual players to vary by calendar year. The simplest way to allow for annual changes in sill coefficients would be to estimate model separately for each of the four years of the sample period. However, by estimating the model in this way, there would be no way to test for the significance of changes in a player s sill through time. Moreover, the data requirements for estimating four separate regressions would not be the same as that of model and, therefore, the samples for the two models would not be consistent. An alternative approach would be to estimate a single regression equation with separate sill coefficients estimated for each player in each calendar year. However, experiments with a model in this form revealed that a number of players whose scores were included in the estimation of model would have to be removed from the original sample for the regression to be of full ran. To overcome the potential problem of singularity, while maintaining the same sample employed in the estimation of model, we employ the following method for allowing an individual player s sill coefficient to vary through time. If a full calendar year has passed, at least 5 scores were used in the estimation of the previous sill coefficient for the same player, and at least 5

6 additional scores were recorded for the same player, then a new incremental sill coefficient for that player is estimated. 5 scores was used as a cutoff for two reasons. First, a sample of 5 scores is sufficiently large to detect changes in sill that are statistically significant. Second, it is sufficiently small to allow changes in sill coefficients in each calendar year for many active players, such as Colin Montgomerie, who split time participating in European Tour and PGA Tour events and, accordingly, do not participate in a full complement of tournaments on the PGA Tour. score denote the 8-hole score of player i (with i =,,... n) in PGA Tour round j (with Let ij, j =,,... m), q denote the number of time-varying incremental sill coefficients estimated for player and β t, denote the value of incremental sill coefficient t for player. Then, the basic version of model, denoted as model., is specified as follows: q n q m ij,, t, t t, t, c c ij, t= = t= c= score = α + β player + β player + γ round + ε (.) As in model, Tiger Woods is player =, and the first round of the 998 Mercedes Championships is round c =. Therefore, the regression intercept, α, represents Tiger s expected score in the first round of the 998 Mercedes Championships. The term q β, tplayer, t pics up t= potential incremental changes in sill for Tiger starting in 999. In equation (.), the sill coefficient β, is estimated in each of the q estimation periods for player, the coefficient β, is estimated in all periods starting in period, and so on. Table 3 helps to clarify how the player dummies are structured in the estimation of the sill coefficients of model.. The structure illustrated for Phil Micelson is the most straightforward. As indicated in the table, Phil played in 5 or more rounds in each of the four years, 998-00. For Phil, β, can be interpreted as his level of sill in 998 in relation to Tiger Woods 998 sill level, β, can be interpreted as Phil s incremental change in sill starting in 999, etc. 0. The months of November and December are often referred to as the silly season, a period in which several unofficial events are scheduled. Typically, these events are played in a non-stroe play format and tend to be tournaments limited to a small number of select players. None of these tournaments are included in our sample.

7 Note that Matt Kuchar recorded an insufficient number of rounds in 998 and 999 to meet the 5 minimum round requirement, but together, his total of 3 rounds in these two years exceeds 5. Therefore, for Kuchar, the sill coefficient β, is structured to cover the entire two-year period 998-999. Kuchar recorded only two scores in 000, but in 00 he recorded 35. As a result, the coefficient, β, represents the incremental change in sill for Matt over the 000-00 period estimated in relation to that achieved in the 998-999 period. The coefficient structure for Charles Howell reflects that he did not begin play on the PGA Tour until 000, and the structure for Chris Smith reflects that he did not participate in any PGA Tour events in 000. The second panel of table (labeled Model.) shows time-dependent values of β t, for the 0 active players on the PGA Tour with the lowest overall sill coefficients. Despite being the most highly silled player on the PGA Tour in 998, the β t, coefficients indicate that Tiger Woods improved almost a full stroe per round in 999 and again in 000. However, in 00, his sill level deteriorated somewhat, causing his average score per round to increase by.3 stroes. Note that the β, value of 0.07 for David Duval, together with a β, value of 0.008, indicate that Duval s sill level in 999 produced an average score that was 0.064 = 0.008 0.07 stroes lower than that of Tiger Woods in 998. In other words, Duval appeared to be a slightly stronger player in 999 than Woods was in 998. However, Woods improved even more than Duval in 999. With his improvement, Woods ended 999 being stronger than Duval by approximately 0.89 stroes per round..3. Testing the Significance of Long-Term Sill Changes As structured, the β t, coefficients in model. provide estimates of incremental but permanent changes in sill starting in period t. For a player who meets the 5-round minimum requirement in each year of the 998-00 sample period, each β t, represents an estimate of a permanent change in the player s sill level. To obtain estimates of sill changes over longer periods, successive values of β t, can be added together. For example, β, + β,3 is an estimate of a player s change in sill over periods and 3. Although estimating sill changes for more than one period is straightforward, the model must be re-structured to estimate the statistical significance of such changes. In an effort to develop significance tests for changes in sill over longer time periods,

8 the regression models, denoted as models. and.3 and summarized in table 4 are estimated. Together, these models, along with model., can be used to test the hypotheses regarding changes in sill as summarized in table 5. For each of the hypotheses tests summarized in table 5, the proportion of players for whom the test ( H through H 3 ) is significant at the 5 percent level is substantially higher than 5 percent. Therefore, one can conclude that more players than can be expected by random chance experienced significant changes in their sill levels from one period to the next. Moreover, an even larger proportion of these players experienced significant changes in their sill levels over periods of two to three years ( H 4 through H 6 ). Finally, these results suggest that model, which assumes a constant level of sill for each player for the entire 998-00 period, does not provide an adequate estimate of what a player s score should be in a given round. From a golfing perspective, it is interesting to note that the results of hypothesis tests 4 6 indicate that approximately one out of seven players experienced statistically significant long-term changes in sill over the sample period. Assuming symmetry in these changes, this suggests that 3 of 4 players on the PGA Tour are probably as good as they are ever going to get, and even those who wor their way up to the Tour are not liely to get much better. Although our tests do not address this issue directly, it is an interesting question whether the sill-change dynamics for players who are attempting to qualify to play on the PGA Tour are similar to what we have seen for players already on the tour. Table 6 provides the names of players who showed the most improvement over the various calendar year-based sub-periods from 999 to 00. Although the player dummies and associated coefficients of models..3 are not aligned by calendar year, it is possible to reconstruct the results to determine the actual calendar year-based sub-periods to which the coefficients apply. Two results from table 6 are striing. First, Tiger Woods, who began 998 as the most highly-silled player on the PGA Tour, improved more than any player, other than Mie Weir, from 999-000. It is hard to believe that a player of Tiger s sill level could get even better, yet the results summarized in table 6 indicate that Tiger improved by.880 stroes during this period. A second striing result is that Phil Micelson was the fourth most improved player over this same period and the fifth most improved over the three-year period, 999-00. Those who follow the PGA Tour

9 now that Micelson has come under repeated media attac in recent years for choing and not playing up to his potential (especially in major tournaments), yet the results of table 6 indicate that lie Woods, Micelson has also shown significant improvement. Table 7 lists the best 0 PGA Tour players as of the end of the 00 season based on the sum of each player s sill coefficients from model.. The Official World Golf Raning is also shown for each player as of /04/0, the Raning date that corresponds to the end of the official 00 PGA Tour season. The World Golf Raning is based on a player s most recent rolling two years of performance, with points being awarded based on position of finish in qualifying worldwide golf events; the raning does not reflect actual player scores., 3 Despite being based on two entirely different criteria, the two raning methods produce a similar list of players. Only two players in the top 0 of the Official World Golf Raning as of /04/0, Ernie Els (4) and Darren Clare (0), do not appear in table 7, and only five players listed in table 7 were not raned among the World Golf Raning s top 0. An interesting result in table 7 is that the sum of sill coefficients from model. is negative for four of the 0 players. Recall that all sill coefficients are measured relative to Tiger Woods level of sill in 998. Therefore, the negative sill coefficients are an indication that these four players, including Tiger himself, were playing at a higher level of sill at the end of 00 than Tiger was playing in 998.. Interestingly, during the 998-00 period, Micelson actually played better than normal in the 6 major tournaments in which he participated. Over 6 major tournament rounds, Micelson s average residual score was -0.564 stroes per round, broen down as -.408 stroes per round on average in the first two rounds and +0.343 stroes per round over the last two rounds of each tournament. Rather than interpreting these results as Micelson choing, it appears that on average, Micelson played exceptionally well in the first two rounds of major tournaments over the 998-00 period and just slightly worse than average over the last two rounds. However, in comparison to his exceptional performance in the first two rounds, his performance in the last two rounds loos bad. By contrast, Tiger Woods, who gets credit for playing his best in majors, had an average residual score of -0.633 stroes per round, almost the same as Micelson, but his betterthan-average play was more evenly distributed among tournament rounds.. Details of the raning methodology can be found in the about section of the Official World Golf Raning web site, www.officialworldgolfraning.com. 3. Berry (00) suggests that the player sill coefficients of his random effects model may be better than the World Golf Ranings for raning professional golfers.

0 3. Analysis of Hot Hands 3.. Sort Group Analysis The preceding section establishes that more players than can be expected by random chance experience statistically significant permanent changes in sill between PGA Tour seasons. The question then arises as to whether non-permanent sill changes occur more frequently than can be expected by random chance. We refer to such non-permanent sill changes as hot and cold hands hot when a player s scores are lower than normal and cold when scores are higher than normal. Our first test of hot hands focuses on whether players who record exceptionally good or bad scores on one day of a tournament can be expected to continue their exceptional performance in subsequent rounds of the same tournament. For this test we examine the performance of players in all adjacent tournament rounds that are not divided by a cut. A typical PGA Tour tournament involves four rounds of play with a cut being made after the second round. Therefore, for a typical tournament, we compare performance between rounds and and also between rounds 3 and 4. But we do not compare performance between rounds and 3, because approximately half of the players who participate in round are cut and do not continue for a third round. In contrast to the typical tournament, the Bob Hope Chrysler Classic involves five rounds of play with a cut being made after the fourth round. Therefore, in the Bob Hope tournament we compare performance between rounds and, rounds and 3 and rounds 3 and 4 but do not compare performance between rounds 4 and 5. For the purposes of this test, we sort all residual scores from model in the first of each pair of qualifying rounds and place each player into one of 0 categories of approximately equal size based on the raning of residuals in the first of each qualifying two-round pair. Within each sort category we then compute the average residual score in the first and second of the two adjacent rounds, with the following small adjustment being made to each residual score in the second of the two rounds when computing the average second-round score. The adjustment reflects that the sum of residual scores for each individual player is zero. As such, if the residual score in the first of two adjacent rounds is positive, the expected score in each of the remaining rounds must be slightly negative, and if the first-round score is negative, the expected second-round score must be slightly positive. Therefore, we adjust each residual score in the second of two adjacent rounds by adding

the first-round residual divided by the number of rounds recorded by the same player minus. This adjustment allocates the built-in bias from conditioning on the first-round residual evenly among the remaining rounds for the same player. Table 8 summarizes the results of this test. Estimates reported in the first panel of Table 8 show that for the entire 998-00 period, the estimated average residual score in the first sort group for the first of two qualifying adjacent rounds was 5.43 stroes. This means that on average, players in the first group scored 5.43 stoes better (lower) than predicted by model. If a portion of the 5.43 stroes represents a temporary change in sill for these same players, rather than random good luc, scores in the next round should continue to be lower than predicted. Otherwise, scores of players in the first sort group should revert bac to normal. Table 8 shows that the average residual score of the same players in the next round was -0.05 stroes. Therefore, of the 5.43 stroes of abnormally good play experienced in the first of two adjacent rounds, it appears that no more than 0.05 of the 5.43 stroes can be attributed to an improvement in sill that carried over into the next round. On the basis of this test, as a group, players in the first sort category cannot expect their good fortune to continue to any significant extent into the next round. With the exception of the last two sort groups, this same tendency is evident throughout table 8; regardless of the first-round sort category, average residual scores in the second of two adjacent rounds are very close to zero. Therefore, with respect to the first 8 sort groups, we conclude that significant differences between actual and predicted scores are due, primarily, to luc, and that these differences cannot be expected to persist. We also run a simple least squares regression within each sort group where the second residual in each pair is regressed against the first. If players within a group tend to continue with similar performance, this regression coefficient should be positive and significant. Within the first panel of table 8, only two such coefficients are positive and also significant at the 5 percent level those for sort groups 3 and 0. The fact that the coefficients for sort groups and 4 are negative indicates that nothing unusual is liely to be happening in the third sort group. Moreover, the fact that the average second-round residual in group 3 is so close to zero (-0.046) indicates that any tendency for the direction of abnormal performance to persist in this sort group is not accompanied by a level of performance that is significant in terms of the actual golf scores being achieved no

player is going to get excited about the potential for improving 0.046 stroes from one round to the next. 4 One possible explanation for the different pattern of average residuals in categories 9 and 0, and the significant positive regression coefficient in sort group 0, is that players who have performed poorly prior to a cut may tae more chances in an attempt to mae the cut. If risier play tends to be accompanied by higher scores, this pattern should emerge. An alternative explanation is that many players in the last two sort groups may consider it a forgone conclusion that they will miss the cut and, perhaps, do not give their best effort in the second of the two adjacent rounds. Although both explanations are quite different, they both involve players changing their normal method of play when the second of the two rounds is followed by a cut. We can test for this tendency by separating the average residual scores into those for which the second of the two adjacent rounds is immediately followed by a cut and those for which a cut does not occur after the second of the two rounds. Results of these tests are summarized in the second and third panels of table 8. For all sort groups, except groups 9 and 0, there is little difference in performance when a cut occurs after the second of two qualifying adjacent rounds (panel ) and when it does not (panel 3). However, for sort categories 9 and 0, the average residual in the second of the two adjacent rounds is higher (worse) when a cut occurs immediately afterward, and for sort category 0, the regression of the second round residual on the first is significant only when a cut occurs after the second round. Thus, we conclude that tournament participants who perform exceptionally poorly in the earliest round(s) may change their method of play in the round before a cut to reflect the high probability of being cut. Otherwise, the threat of being cut does not appear to affect player performance. 3.. Runs Test of Hot Hands To detect the presence of hot hands in the performance of individual PGA Tour players, we also perform a series of standard non-parametric runs tests. For the purposes of conducting the runs tests, we do not include all players in the sample. Of the 80 players in the sample, 40 4. We also ran the same regression on the whole sample, over 38,000 observations. While the slope coefficient was statistically significant, it s tiny size (0.04) implies that a golfer would need a residual of 5.0 to gain a single stroe in the

3 recorded 0 or fewer scores over the entire 998-00 period, and another 7 recorded between and 00 scores. These 547 players participated in relatively few tournaments, and the tournaments in which they did participate are liely to have been separated significantly in time. As a result, the concept of a run a string of better-than-normal or worse-than-normal scores means little among these players. Our tests focus on two groups of players those who recorded more than 00 scores (43 players) and those who recorded at least 5 scores in each of the four years of the sample period (8 players). The second group is the same group for which an incremental sill coefficient was calculated in each of the four years in connection with model. The runs tests are conducted as follows using model as the basis for computing residual scores. For each player, let N denote the number of residual scores less than or equal to zero, N the number greater than zero, and R denote the total number of runs. Then, the expected number of runs and the standard deviation of the number of runs are as follows: 5 ( ) σ ( runs) E runs ( NN )( NN N N) ( ) ( ) NN = + and =. N + N N + N N + N For N+ N > 0, the test statistic, Z, is approximately normally distributed with a mean of zero and standard deviation of.0. ( ) σ ( runs) Z = R E runs Although a two-sided runs test is typically used to test for independence in a time series, our concern is not whether residual golf scores are independent, but rather, whether players display tendencies to have fewer runs than would be expected from a purely random series. Thus, for each ( ) σ ( ) R E runs player we employ a one-sided test of whether Z = <0. runs Table 9 summarizes the results of our initial set of runs tests. For the sample that includes the 43 players who recorded more than 00 scores, 8 players, or 7.4 percent of the players in the group, exhibited runs that were significant at the 5 percent level. To test for the significance of the next round. 5. See Mendenhall, Scheafler and Wacerly (98), p. 60.

4 results across all 43 players, an average Z-statistic of -0.639 and standard deviation of the Z- statistic of.0700 were computed for the entire group. The statistic, g( Z) = Z 0 Z / N + N, σ ( ) where Z is the average value of Z and σ ( Z ) is the standard deviation of the Z values, tests whether the average Z-value among all players in the group is significantly less than zero. A p- value of 0.0085 indicates that the number of significant runs among the players in the group is greater than can be expected by chance. A similar test, limited to the 8 players who recorded at least 5 scores in all four years, gives almost identical results. In Appendix A, we describe the consequences of a rough adjustment for the impact of the cut on the runs test. In Table 0, the first column reports the same runs test applied to scores to a subsample of the top 30 golfers determined by the sum of their sill coefficients under model.. 6 The performance of only two of these golfers, Tiger Woods and Jim Fury, appears to reject the null hypothesis of the runs test in favor of fewer streas than expected. 3.3. Marov Chain Test for Hot Hands Albright (993) studies hitting streas in baseball using several methods. One test focuses on whether a string of successive at-bats forms a first-order Marov chain. The test statistic is given by where χ [ ] / (3) M M00M M0M0 M0 M Mij is the number of times that that state i is followed by state j and Mi is the number of times the i th state occurs. Under the null hypothesis of randomness, this statistic has a chi-squared distribution with one degree of freedom. In our application, the two states represent better and worse-than-average performance, and this is measured empirically by negative and positive residuals from the fixed effects regression (Model ). We apply this method to a subsample of the top 30 golfers determined by the sum of their sill coefficients under model.. The results are summarized in Table 0. We find that three golfers, Tiger Woods, Jim Fury, and Chris Perry, show evidence of streay play. Test statistics for 7 of 30 golfers fail to reject the null hypothesis of random success, once the longer-term fixed effects associated with permanent sill and permanent sill changes are removed.

5 3.4. Logistic Model Test for Hot Hands Albright also considers the possibility that situational variables might affect the onset (and perhaps persistence of) streay play. In this setting, a logistic regression model may be used to isolate evidence of streay play: Pn ln P where P P( X = )and X = n n n n K = α + θhn + λzn (4) = = indicates a better-than-normal golf round, ln [ P /( P )] odds of a better-than-normal golf round (i.e., a negative residual from Model ), n n is the Hn is a history of success variable, and Z n represents a sequence of situational variables for the nth golfer that affect the odds of success (i.e., proximity to the leader, etc.). If the θ term is positive, the history of recent success raises the odds of success, an indication of streay play. In his application to hitting streas in baseball, Albright tried a number of (arbitrary) ways to define the history of success variable. 7 In our setting, we measure recent history as the cumulated residuals from fixed effects Model for the past 6 rounds (roughly four tournaments, a month s time normally). 8 As for situational variables, we use the tournament wee number, the tournament round (first, second, etc.), the number of stroes a golfer is behind the leader (which captures whether a golfer is in contention), and the interaction of the tournament round and distance from the leader (since being close to the leader at the beginning of play on the last day is more important than early in a tournament). We estimated (4) for the same subsample of the top 30 golfers used in the Marov Chain test. In our initial tests, we found clear evidence that the impact of cumulative residuals (our principal golfer-specific history variable) varied from year to year. Consequently, we interacted this variable with year dummy variables. In Table 0, we report on several dimensions of this portion of our data analysis. In column A under the Wald Test P-Value heading, we report the p-value for a chi-squared test whose null 6. A motivating factor in our selection of these 30 top players is that they were the least liely to be cut, and, therefore, the sequence of golf scores for this group was largely free of any effects from the cut. 7. A cautionary note is in order here. In their comment on the Albright paper, Stern and Morris (993) argue that the logistic regression approach has low power and is plagued with a negative bias in the estimate of the slope coefficient on the history variable. Albright taes issue with a portion of their comment in his rejoinder. 8. We also experimented with four, eight, and round cumulative performance, but the 6-round horizon produced the strongest case favoring streay play.

6 hypothesis is that the coefficients on the history variables are zero in the logistic regression. This test rejects the null hypothesis for only four golfers out of 30. Thus, history doesn t seem to matter very much in predicting success, the opposite of what we might expect to see if streay play was common. We also report for each year an estimate of the derivative of the success probability in that year with respect to our principal history variable, the cumulative 6-wee score, adjusted for fixed effects. The estimated effects are fairly small and uniformly positive for only one in ten golfers (and only one of the three sets of estimates is significantly positive). Finally, we also estimate the impact of four situational variables (described earlier) on the probability of success. In column B under the Wald Test P-Value heading, we report the p-value for a chi-squared test whose null hypothesis is that the coefficients on these four situational variables are zero in the logistic regression. The p-values suggest that these variables have no significant impact on the probability of success for any golfer in our sample of 30. Thus, we find little evidence of hot hands among golf s leading players using methods applied elsewhere in the sports statistics literature. In his comment on Albright s wor, Albert (993) proposes a marov-switching model of hitting probabilities. In his rejoinder, Albright endorses this approach saying, it is one of the most reasonable alternatives to the independent Bernoulli model I have seen (pg. 96). In the next section, we evaluate the utility of a marovswitching model in capturing streay play among golfers. 4.. Basics of Marov-Switching Model 4. Marov-Switching Model of Performance Since marov-switching models have been analyzed extensively elsewhere (see Hamilton, 994, Ch. ), we offer only a brief review of this modeling approach. An easy starting point is to consider a segmented trends model where a latent variable s t taes a value of or, thereby indicating whether a golfer is playing well or playing poorly. A golfer s average performance in the good state is given by µ, and the golfer s average performance in the bad state is µ. The variance of performance in each state is indicated by σ and σ. The model is completed by assuming that a Marov chain describes the evolution of the latent (state) variable:

7 st p ps ( = = ) = (5) t ps ( = s = ) = p t t ps ( = s = ) = p t t ps ( = s = ) = p t t where the p ij are transition probabilities between states i and j. 9 Given the marov chain assumption, st is sufficient to describe all past history of a golfer s performance and the latent variable. This model accommodates a wide variety of patterns in golfer performance. Specifically, different combinations of µ, µ, p, and p can produce symmetric or asymmetric cycles of varying length in golfer performance. For example, a golfer who occasionally displays brilliant play for a few rounds might be best described by a large (negative) value of µ and a small value of p. Another interesting case is where a golfer s performance this round is independent of his performance in his last round. If this holds, we expect p = p. This amounts to saying that a golfer s performance follows a random wal. Finally, one might conclude that a golfer is prone to streay play (hot hands being one example) when sgn( µ ) sgn( µ ) accompanied by large estimates of both p and p. 0 The joint distribution of observed data (the score in the i th round is indicated by score i ) with sample size T may be written as pscore ( t,..., scoret, s,..., st θ ) where s t is the latent variable and θ = { µ, µ, σ, σ,, }. Given values for the parameter vector, the probability of being in p p regime st at date t given the available information at that time is the filter inference: ps ( score,..., score; θ ). (6) t t t 9. In this application, we assume that the transition probabilities are constant, but Diebold, et. al. studied a marovswitching model in which the transition probabilities depend on a set of exogenous variables. 0. We note one final characteristic of this model. At first pass, the model just described appears to be quite similar to a mixture of distributions (MOD) model. The important difference between the marov-switching model and the MOD representation is that the probability of draws from the two states is not independent in our model. Independence is a characteristic of the MOD approach. In this setting, the probability that a golfer is in the midst of a hot strea isn t constant: it depends on scores in other rounds.. We use T to indicate the sample size here in part to emphasize that we are modeling the evolution of an individual golfer s scores over time after accounting for sill changes and other influences captured by Model.

8 Alternatively, computing this probability with the full sample produces a smoothed inference about the regime: p( s score,..., score ; θ ). (7) t t T When all the scores are used to compute this probability (i.e., t=t), (6) and (7) produce the same value. Using (6), the analyst can estimate the probability that a golfer has hot hands using only the sample up to the present round, while (7) uses the entire sample to mae the estimate. Note that in a MOD representation, the probability of hot hands would be a constant and independent of the history of golfer performance. Hamilton (994) and Kim (993) show how to estimate the parameter vector with Kalman filter methods. The principal operational issue involves the sensitivity of final estimates to alternative starting values. For example, Engle and Hamilton (990) report finding multiple local maxima in their wor. We addressed these issues in the following way. To create starting values, we sorted the residual score data (i.e., the ε t from () for each golfer j) from lowest to highest and used the means and variances in the bottom and top third of the sample as starting values for ( µ, σ )and( µ, σ ) in the estimation process. Initially, we assumed the transition probabilities, p and p, were equal. We used this starting value algorithm to estimate all the models in this paper. To chec this approach, we reran all these models using several adjustments to the starting values for the means and transition probabilities. For the overwhelming majority of cases, these adjustments had little impact on estimates and even less impact on inferences. We found a few golfers where adjusting the starting value algorithm produced significantly different results. We explored the parameter space for these golfers in much greater detail. In a very few cases where the specification test results indicated the empirical model failed along some dimension(s), we re-estimated the model repeatedly with different starting values to explore whether we had found a local maximum. While there can be no absolute guarantees that our estimates represent global maxima, we are reasonably confident that our approach has converged on global maxima.

9 4.. Hypothesis Tests and Tests of Model Adequacy 4.. Hypothesis Tests There are several interesting hypotheses about golfer performance that can be tested in the context of the marov-switching model. One hypothesis is that golfer performance follows a random wal. That is, performance from round to round is essentially unpredictable. Engel and Hamilton show that this hypothesis can be evaluated with the following test statistic: χ () = [ ˆ ( ˆ )] /[ ( ˆ ) + ( ˆ ) + ( ˆ, ˆ )] (8) p p V p V p Cov p p where the ^ indicates an estimated value, V( pˆ ) indicates the estimated variance of a parameter ii estimate, and Cov( pˆ ˆ, p ) is the estimated covariance of two parameter estimates. This test statistic corresponds to the three-part null hypothesis pˆ = p ˆ, µ µ, σ σ. 3 The basic idea behind the test is that if we find golf scores follow a random wal, that would represent significant evidence against the hot hands (streay play) idea represented in the segmented trends model. Another interesting hypothesis holds that there is no difference in golfer performance across the two states. This corresponds to a formal test for differences in golfer performance in the two states indexed by s t. The two-part null hypothesis is given by V V Cov µ = µ and σ σ, and the test statistic is χ () = [ ˆ µ ˆ µ )] /[ ( ˆ µ ) + ( ˆ µ ) + ( ˆ µ, ˆ µ )] (9) where the notation is as indicated earlier. Finally, we can test for differences in variance across the two states by imposing the constant variance assumption, re-estimating the marov-switching model, and performing a lielihood function test using the values of the restricted and unrestricted log lielihood function. 4 The specific test statistic is given by -[LLFR - LLF U] χ (), where LLFR is the log-lielihood function value under the restriction that σ = and LLFU is the log-lielihood σ function value without the restriction being imposed.. We are very grateful to James Hamilton for his generosity in providing his Gauss code for estimating marov switching models. 3. Direct tests of the segmented trends model confront the problem that some of the parameters of the segmented trends model are not identified under the null hypothesis. Alluded to in E-H, this hypothesis testing problem has been addressed by Hansen (99) and more recently by Gong and Mariano (997). 4. Our thans to James Hamilton for suggesting this direct approach to the problem.

0 4.. Model Specification Tests Rather than simply assuming the simple marov-switching model just described is appropriate to the data, we conducted an extensive analysis of our empirical models using specification tests proposed by Hamilton (996). He relies on the Lagrange Multiplier approach in his wor, and technical details are available in the original article. The idea behind these test statistics is reasonably straightforward, and uses the score of the i th observation defined as log pscore ( t Z t ; θ ) / θ where log pscore ( t Zt; θ ) is the log of the conditional distribution of score t given previous observations on score and a set of exogenous variables Θ Z. It can be shown that log pscore ( t Zt; θ ) / θ pscore ( t Zt; θ )dscoret = 0. In turn, this implies that E ( ht( θ0) Zt) = 0where the function h ( 0 ) log ( ; t θ pscoret Zt θ )/ θ is evaluated at the true value for θ (indicated by θ 0 ). This means that if the model is correctly specified, the scores should be unforecastable on the basis of information available at (t-), including lagged values of the score given by h ( ). θ 0 We use two basic forms of specification tests in this paper to assess the adequacy of our empirical model of golfer performance. We employ a version of the Newey(985)-Tauchen (985)- White (98) dynamic specification tests to chec for autocorrelation, ARCH effects in variance, and, perhaps most interestingly, the first-order Marov structure assumptions embedded in (5). When a model fails the Marov specification test, this suggests that a marov-switching model with autocorrelated states may be needed to capture the dynamics. 5 We also employ Hamilton s LM tests for ARCH effects and autocorrelation within regimes, autocorrelation across regimes, omitted variables from the mean function, and omitted variables from the variance function. Finally, we implement Andrews (993) LM test for structural stability that tests for structural shift in the model at each observation in the sample. 6 4.3. Analysis of Results In Table, we report estimates of the marov-switching model parameters for each of 30 golfers. These golfers represent the most successful Tour players for the sample period, 998 00. Their relative success ensures that our estimates of dynamics are as free as is possible of potential t t t 5. Hamilton (989) studies models of this sort, but we find no reason to use them here. 6. James Hamilton gave us his code for computing all these specification tests for the simple Marov-switching model he studied in Engel and Hamilton (990). To apply these tests to more complicated models requires a significant investment