One of the most-celebrated feats

Similar documents
Looking at Spacings to Assess Streakiness

A Monte Carlo Approach to Joe DiMaggio and Streaks in Baseball 1

Professional athletes naturally experience hot and

Chapter 5 - Probability Section 1: Randomness, Probability, and Simulation

Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008

Average Runs per inning,

Simulating Major League Baseball Games

When Should Bonds be Walked Intentionally?

Does Momentum Exist in Competitive Volleyball?

Table 1. Average runs in each inning for home and road teams,

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

1 Streaks of Successes in Sports

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

Chapter. 1 Who s the Best Hitter? Averages

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework

Effects of Incentives: Evidence from Major League Baseball. Guy Stevens April 27, 2013

Why We Should Use the Bullpen Differently

Quasigeometric Distributions and Extra Inning Baseball Games

Since the National League started in 1876, there have been

Finding your feet: modelling the batting abilities of cricketers using Gaussian processes

Journal of Quantitative Analysis in Sports

Hitting with Runners in Scoring Position

By the Numbers. Volume 20, Number 1 The Newsletter of the SABR Statistical Analysis Committee February, Academic Research: Conference Papers

Regression to the Mean at The Masters Golf Tournament A comparative analysis of regression to the mean on the PGA tour and at the Masters Tournament

Chapter 5 ATE: Probability: What Are the Chances? Alternate Activities and Examples

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

Do Clutch Hitters Exist?

2015 Winter Combined League Web Draft Rule Packet (USING YEARS )

2018 Winter League N.L. Web Draft Packet

Section 5.1 Randomness, Probability, and Simulation

Minimum Quality Standards in Baseball and the Paradoxical Disappearance of the.400 Hitter

Background Information. Project Instructions. Problem Statement. EXAM REVIEW PROJECT Microsoft Excel Review Baseball Hall of Fame Problem

Additional On-base Worth 3x Additional Slugging?

A PRIMER ON BAYESIAN STATISTICS BY T. S. MEANS

A Markov Model of Baseball: Applications to Two Sluggers

Using Sports Wagering Markets to Evaluate and Compare Team Winning Streaks in Sports

1 Hypothesis Testing for Comparing Population Parameters

ORF 201 Computer Methods in Problem Solving. Final Project: Dynamic Programming Optimal Sailing Strategies

DECISION MODELING AND APPLICATIONS TO MAJOR LEAGUE BASEBALL PITCHER SUBSTITUTION

Rating Player Performance - The Old Argument of Who is Bes

Percentage. Year. The Myth of the Closer. By David W. Smith Presented July 29, 2016 SABR46, Miami, Florida

Psychology - Mr. Callaway/Mundy s Mill HS Unit Research Methods - Statistics

PGA Tour Scores as a Gaussian Random Variable

A N E X P L O R AT I O N W I T H N E W Y O R K C I T Y TA X I D ATA S E T

Lab 11: Introduction to Linear Regression

hot hands in hockey are they real? does it matter? Namita Nandakumar Wharton School, University of

A New Chart for Pitchers and My Top 10 Pitching Thoughts Cindy Bristow - Softball Excellence

It s conventional sabermetric wisdom that players

Relative Value of On-Base Pct. and Slugging Avg.

Expansion: does it add muscle or fat? by June 26, 1999

On Probabilistic Excitement of Sports Games

The Rochester Avon Recreation Authority appreciates your support and involvement and thank you for your time.

IHS AP Statistics Chapter 2 Modeling Distributions of Data MP1

Economic Value of Celebrity Endorsements:

ROSE-HULMAN INSTITUTE OF TECHNOLOGY Department of Mechanical Engineering. Mini-project 3 Tennis ball launcher

United States Commercial Vertical Line Vessel Standardized Catch Rates of Red Grouper in the US South Atlantic,

Running head: DATA ANALYSIS AND INTERPRETATION 1

Extreme Shooters in the NBA

B. AA228/CS238 Component

THE BOOK--Playing The Percentages In Baseball

Calculation of Trail Usage from Counter Data

GRI-GM20 * Selecting Variable Intervals for Taking Geomembrane Destructive Seam Samples Using Control Charts

Multilevel Models for Other Non-Normal Outcomes in Mplus v. 7.11

Chasing Hank Aaron s Home Run Record. Steven P. Bisgaier, Benjamin S. Bradley, Peter D. Harwood and Paul M. Sommers. June, 2002

Black Sea Bass Encounter

E. Agu, M. Kasperski Ruhr-University Bochum Department of Civil and Environmental Engineering Sciences

Lesson 2 Pre-Visit Slugging Percentage

Inside Baseball Take cues from successful baseball strategies to improve your game in business. By Bernard G. Bena

OFFICIAL RULEBOOK. Version 1.08

Department of Economics Working Paper Series

International Journal of Technical Research and Applications e-issn: , Volume 4, Issue 3 (May-June, 2016), PP.

Analysis of Highland Lakes Inflows Using Process Behavior Charts Dr. William McNeese, Ph.D. Revised: Sept. 4,

Improving the Australian Open Extreme Heat Policy. Tristan Barnett

OFFICIAL RULEBOOK. Version 1.16

How to Make, Interpret and Use a Simple Plot

T-Ball is a baseball game for young boys and girls. It is a way to have fun while learning how to play.

Correction to Is OBP really worth three times as much as SLG?

Evaluating NBA Shooting Ability using Shot Location

Quantitative Literacy: Thinking Between the Lines

A IMPROVED VOGEL S APPROXIMATIO METHOD FOR THE TRA SPORTATIO PROBLEM. Serdar Korukoğlu 1 and Serkan Ballı 2.

The Changing Hitting Performance Profile In the Major League, September 2007 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO.

Chapter 9: Hypothesis Testing for Comparing Population Parameters

CHAPTER 5 Probability: What Are the Chances?

Our Shining Moment: Hierarchical Clustering to Determine NCAA Tournament Seeding

Day One Softball Introductory Activity: Fitness Activity: Push-up Routine ( Lesson Focus: Culminating Activity:

Coefficients of Restitution of Balls Used in Team Sports for the Visually Impaired

Hunting for the Sweet Spot by a Seesaw Model

IS THERE WIDESPREAD RACIAL BIAS among umpires?

Journal of Emerging Trends in Computing and Information Sciences

A Remark on Baserunning risk: Waiting Can Cost you the Game

DO YOU KNOW WHO THE BEST BASEBALL HITTER OF ALL TIMES IS?...YOUR JOB IS TO FIND OUT.

Evaluating and Preventing Capacity Loss when Designing Train Control to Enforce NFPA 130 Compliance

The Cold Facts About the "Hot Hand" in Basketball

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

College Teaching Methods & Styles Journal First Quarter 2007 Volume 3, Number 1

Journal of Quantitative Analysis in Sports

2017 BALTIMORE ORIOLES SUPPLEMENTAL BIOS

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

HERE COMES THE STRIKEOUT. In the spring the birds sing. The grass is green. Bobby plays baseball too. He can run the bases

Transcription:

Joe DiMaggio Done It Again and Again and Again and Again? David Rockoff and Philip Yates Joe DiMaggio done it again! Joe DiMaggio done it again! Clackin that bat, gone with the wind! Joe DiMaggio s done it again! AP Photo/Preston Stroup Joe DiMaggio Done It Again, Woody Guthrie, 1949 One of the most-celebrated feats in the history of American sports is baseball player Joe DiMaggio s 56-game hitting streak during the 1941 season. In major league baseball, a 30% success rate by a hitter is considered good, and a batter will typically have three to five attempts in a game; Joltin Joe hit safely in 56 consecutive games, a record that has rarely been approached since. One might wonder just how amazing this accomplishment is, or how surprised people should be that there has been such a long streak in the history of the sport. Many have attempted to quantify this using statistical methods. A basic probability approach goes something like this: If a player has a.300 batting average (i.e., gets a hit in 30% of his at-bats) and has four at-bats each game, his probability of getting at least one hit in a given game is 1-((1-.3) 4 ), or.76. Thus, the probability of this player getting a hit in every game during a given 56-game stretch is.76 56, or.00000021 (assuming all at-bats and all games are independent). But, we are really interested in the probability of there ever being a 56-game hitting streak by any player, not just the probability of one particular player achieving it over a given 56-game stretch. A direct probability approach is then difficult, mainly because the universe of 56-game stretches for a player is not independent; performance in games 1 56 during a season shares much information with performance in games 2 57. Fortunately, advances in computer technology and in the availability of baseball data have made a simulation approach feasible. The inspiration for our research came from a New York Times article on March 14 VOL. 24, NO. 1, 2011

Figure 1. Contour plot of the probability of a player with a.300 batting average getting a hit in each of two games 30, 2008, titled A Journey to Baseball s Alternative Universe. Samuel Arbesman and Steven Strogatz ran simulations of baseball seasons to estimate the probability of long hitting streaks, using data for each batter during each season in major league history. They treated a player s at-bats per game as constant across all games in a season; simulated 10,000 baseball histories; and tabulated which player held the longest streak, when he did it, and how long his streak was. There was a hitting streak of at least 56 games in 42% of these simulated histories, meaning we should not be all that amazed that there has been one in our actual observed history. Don M. Chance, in the CHANCE article What Are the Odds? Another Look at DiMaggio s Streak, used a modified calculation of hitting opportunities to study the likelihood of long hitting streaks, arguing that nonintentional bases on balls and sacrifice flies are opportunities for a hit and should be included in any calculation. This increases the number of opportunities in a game, but decreases the probability of success in a single opportunity. The net effect is a decrease in the probability of a hit in most games and, thus, of a lengthy hitting streak. We disagree with the notion that a base on balls is a missed opportunity for a hit. A base on balls is usually quite beneficial for the team on the receiving end. A good hitter, unless he has a very long streak going already, is unlikely to eschew the walk in favor of swinging at bad pitches in a dire bid to get a hit. That is presumably why a base on balls is not counted against a player s batting average. Constant vs. Variable At-Bats We noted that assigning the same number of at-bats for each game greatly overestimates the probability of long streaks. This is due to Jensen s inequality. The probability of a two-game hitting streak is much lower if the player s at-bats are, say, two and then six than if his at-bats are four and then four. A simple example can be seen in Figure 1. The probability of a two-game hitting streak is much lower if the player has fewer at-bats in each game. Going down a northwest-southeast diagonal of this lattice structure graphically illustrates Jensen s inequality. Thus, the constant at-bat assumption overestimates the likelihood of long hitting streaks. The need to vary at-bats also is due to the fact that they have decreased and are Figure 2. Densities estimates of at-bats per game by decade Note: The unit of analysis is a player-season. even more varied over time. Figure 2 illustrates this phenomenon. The simulations run in Chasing DiMaggio: Streaks in Simulated Seasons Using Non-Constant At-Bats, published in the Journal of Quantitative Analysis in Sports in 2009, varied at-bats using Retrosheet game data for all of major league baseball from 1954 2007, as well as for the National League in 1911, 1921, 1922, and 1953. Since the publication of that paper, Retrosheet has added game data from both the American League and National League for the 1920 1929 seasons. It should be noted that these simulations are not true simulations of a game, but simulations of a player s at-bats in a game over all the games played in a season. Unfortunately, due to the unavailability of some game-by-game data, some or all of the careers of some of the best hitters in baseball (e.g., Willie Keeler, Ted Williams, Joe DiMaggio) are not included in this analysis. The following is a brief overview of the simulations with varying at-bats. For each hitter in each season, the batting average is fixed, using that player s actual batting average for that season. CHANCE 15

Table 1 Top 20 Maximum Hitting Streaks in 1,000 Simulated Baseball Histories Table 2 Hitting Streaks in 18,607,000 Simulated Player Seasons This allows us to treat the 1990 and 2001 versions of a player such as Barry Bonds as two players. We assumed atbats over the course of a single game are independent. This means the number of hits a player gets in a given game has a binomial distribution with the number of trials equal to the at-bats in that game and the probability of success equal to the batting average for the season. Using the game-by-game data from Retrosheet, each player in each season had their atbats sampled with replacement from their actual at-bat distribution to create a simulated season s worth of games. This was done 1,000 times to create 1,000 simulated baseball histories. A hitting streak was considered to be any run of games with at least one hit. In the results section, this method will be denoted as Binom. Varying Batting Average While the number of at-bats varied each game, the batting average remained constant. The question remained: How should we vary batting average in this simulation study? In Chasing DiMaggio: Streaks in Simulated Seasons Using Non- Constant At-Bats, why did the authors feel the need to vary at-bats during the simulations? Over the course of a baseball season, a player is not going to have the same number of at-bats in every game. There may be days when the player is starting and others when he may be coming off the bench to pinch-hit. Varying the at-bats is an attempt to mimic this phenomenon in the simulations. How do we attempt to vary batting average in our simulations to resemble the course of a season? We attempted this in three ways. The first method we used was the simplest approach to varying batting average. We treated a player s chance at a hit in a given at-bat as a beta random variable, taking the player s actual number of hits (successes) in a season as the first shape parameter and the player s actual number of outs (failures) in a season as the second shape parameter. The mean of this random variable would be the batter s batting average in a season. In the results section, this method will be denoted as Beta. The second method of varying batting average treats hit probability in a given game as correlated with performance in neighboring games, which may better mimic so-called hot- and cold-hand effects. For a brief description, let s look at how the method works for a neighborhood of 15 games. 16 VOL. 24, NO. 1, 2011

Table 3 Hitting Streaks in 1,000 Simulated Baseball Histories Table 4 Hitting Streaks for 1941 Joe DiMaggio in 10,000 Simulations In the simulations, we start with a fixed batting average for each game and run a simulated season, just like the Binom method. Hit probabilities are then updated by incorporating information from neighboring games into the simulated season. For instance, in game 50 of a simulated season, the probability of a base hit is reflected by the player s performance in games 35 through 65. Using the new hit probabilities, the simulations generate a new array of hits. This process is repeated for each game in a player s season and denoted as Binom-15. The third method of varying batting average is the same as Binom-15, except it uses a 30-game neighborhood; that is, the probability of a base hit in game 50 should be reflected by the player s performance in games 20 through 80. This method is denoted as Binom-30. Results Table 1 lists the top 20 performances in the simulations. It should be noted that these are the peak streaks for a player. For example, using the Binom-15 method, the 1922 version of George Sisler had a maximum hitting streak of 95 games. His secondhighest streak was 83 games. That streak is not included on the list since his highest was 95. Another way to look at the results is to summarize the hitting streaks over each simulated player-season; yet another is to summarize over each of the individual 1,000 baseball histories. Table 2 compares our four methods, showing how many simulated player-seasons contained a hitting streak of at least 40 games, at least 50 games, and at least 56 games (DiMaggio s record). Table 3 shows the same breakdown, but over entire simulated histories. The Binom-15 method yields the most long streaks 561 individual playerseasons contained a hitting streak of 56 games or longer, and a whopping 450 of the 1,000 histories featured such a streak. This means that, if we assume batting ability fluctuates smoothly over the course of a month, we should not be all that surprised that there has been a 56-game hitting streak in real life. It also shows that results change significantly depending on our assumptions. How did Joltin Joe do in our simulations using all four methods, plus the constant at-bat method? We ran 10,000 simulations of DiMaggio s 1941 season, using his actual game-by-game at-bats. CHANCE 17

Concluding Thoughts Why is there such a difference between the simulations when batting average is fixed and batting average varies under the neighbor method? If a batter was successful in his first 15 or 30 games, his batting average (probability of success of a hit) is going to be higher for the next game. A player who starts a simulated season hot is going to have longer streaks using this method. The 30-game neighborhood method had a streak greater than DiMaggio s 1941 streak in 34.3% of our simulated baseball seasons. The 15-game neighborhood method had a streak greater than DiMaggio s 1941 streak in 45% of our simulated baseball seasons. Over a stretch of 15 games, it might be more common to see a player have a higher batting average than over a stretch of 30 games. This may have produced longer hitting streaks in the simulations. If we are to believe the results of these attempts at simulating baseball histories, it is not surprising that someone did have a 56-game hitting streak during the course of Major League Baseball s history. The surprise comes into play when we pick out a certain player, such as Joltin Joe or Sisler, to have the specific streak. Joe DiMaggio with bat ready at first day s workout on March 6, 1946, in Bradenton, Florida, after returning from Panama. DiMaggio had been out since 1942 for U.S. Army service. AP Photo/Preston Stroup DiMaggio s actual game-by-game data were obtained from Cliff Blau, a member of the Society for American Baseball Research. Table 4 shows he reached his record only a few times with each method and had more long streaks with the two Binom methods. The first two rows also reinforce that when batting average is treated as constant, the assumption of constant at-bats makes a long streak more likely. Multiseason Streaks At the end of the 2005 season through the beginning of the 2006 season, Jimmy Rollins of the Philadelphia Phillies had a hitting streak of 38 games. His streak would not have been captured by the methods previously discussed, although it officially counts as a streak. Chance looked at the odds of long hitting streaks using a player s career data. His analysis used only 125 players the top 100 in career batting average, plus all other players who had real-life hitting streaks of at least 30 games. We ran additional simulations on these good hitters to estimate the frequency of long hitting streaks that span two seasons. We started with the same list of players for our analysis using the game-by-game at-bats for their entire career found in the Retrosheet data. Retrosheet contains the complete careers for only 24 of those players and partial data for an additional 62. In these simulations, multiseason long streaks occurred approximately one-tenth as often as single-season long streaks, meaning that to truly compare the likelihood of witnessing a long streak, we should increase the number of long streaks in Tables 2 and 3 by 10%. Further Reading Arbesman, S., and S. Strogatz. 2008. A journey to baseball s Universe. The New York Times. March 30. Berry, S. 1991. The summer of 41: A probability analysis of DiMaggio s streak and Williams average of.406. CHANCE 4(4):8 11. Chance, D. M. 2009. What are the odds? Another look at DiMaggio s streak. CHANCE 22(2):33 42. Gould, S. J. 1989. The streak of streaks. CHANCE 2(2):10 16. McCotter, T. 2008. Hitting streaks don t obey your rules: Evidence that hitting streaks aren t just byproduct of random variation. Baseball Research Journal 37:62 70. Rockoff, D. M., and P. A. Yates. 2009. Chasing DiMaggio: Streaks in simulated seasons using non-constant at-bats. Journal of Quantitative Analysis in Sports 5(2), Article 4. www.bepress.com/jqas/vol5/iss2/4. Short, T., and L. Wasserman. 1989. Should we be surprised by the streak of streaks? CHANCE 2(2):13. Warrack, G. 1995. The great streak. CHANCE 8(3):41 43, 60. 18 VOL. 24, NO. 1, 2011