One could argue that the United States is sports driven. Many cities are passionate and

Similar documents
1: MONEYBALL S ECTION ECTION 1: AP STATISTICS ASSIGNMENT: NAME: 1. In 1991, what was the total payroll for:

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Simulating Major League Baseball Games

George F. Will, Men at Work

MONEYBALL. The Power of Sports Analytics The Analytics Edge

SAP Predictive Analysis and the MLB Post Season

Additional On-base Worth 3x Additional Slugging?

GUIDE TO BASIC SCORING

Chapter. 1 Who s the Best Hitter? Averages

Relative Value of On-Base Pct. and Slugging Avg.

CS 221 PROJECT FINAL

February 12, Winthrop University A MARKOV CHAIN MODEL FOR RUN PRODUCTION IN BASEBALL. Thomas W. Polaski. Introduction.

2017 International Baseball Tournament. Scorekeeping Hints

An average pitcher's PG = 50. Higher numbers are worse, and lower are better. Great seasons will have negative PG ratings.

Correction to Is OBP really worth three times as much as SLG?

A Markov Model for Baseball with Applications

When Should Bonds be Walked Intentionally?

Chapter 1 The official score-sheet

Fairfax Little League PPR Input Guide

Rating Player Performance - The Old Argument of Who is Bes

B. AA228/CS238 Component

Do Clutch Hitters Exist?

A One-Parameter Markov Chain Model for Baseball Run Production

2015 NATIONAL BASEBALL ARBITRATION COMPETITION

Level 2 Scorers Accreditation Handout

Machine Learning an American Pastime

Southern U. Baseball 2017 Overall Statistics for Southern U. (as of Apr 01, 2017) (All games Sorted by Batting avg)

A Database Design for Selecting a Golden Glove Winner using Sabermetrics

Triple Lite Baseball

ANALYSIS OF A BASEBALL SIMULATION GAME USING MARKOV CHAINS

A Markov Model of Baseball: Applications to Two Sluggers

Table of Contents. Pitch Counter s Role Pitching Rules Scorekeeper s Role Minimum Scorekeeping Requirements Line Ups...

Billy Beane s Three Fundamental Insights on Baseball and Investing

Package mlbstats. March 16, 2018

2010 Boston College Baseball Game Results for Boston College (as of Feb 19, 2010) (All games)

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

Regression Analysis of Success in Major League Baseball

Should pitchers bat 9th?

Baltimore Orioles (57-45) 2, Seattle Mariners (53-50) 1 July 25, 2014

Forecasting Baseball

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

Using Spatio-Temporal Data To Create A Shot Probability Model

Redmond West Little League

Baseball Scorekeeping for First Timers

Lorenzo Cain v. Kansas City Royals. Submission on Behalf of the Kansas City Royals. Team 14

The MLB Language. Figure 1.

A Novel Approach to Predicting the Results of NBA Matches

Infield Hits. Infield Hits. Parker Phillips Harry Simon PARKER PHILLIPS HARRY SIMON

DISTRICT 53 SCOREKEEPER CLINIC

TOP OF THE TENTH Instructions

Building an NFL performance metric

2017 B.L. DRAFT and RULES PACKET

THE BOOK--Playing The Percentages In Baseball

Offensive & Defensive Tactics. Plan Development & Analysis

The Rise in Infield Hits

Fastball Baseball Manager 2.5 for Joomla 2.5x

2015 Winter Combined League Web Draft Rule Packet (USING YEARS )

2018 Winter League N.L. Web Draft Packet

Softball New Zealand Scorers Refresher Examination 2018

2011 COMBINED LEAGUE (with a DH) DRAFT / RULES PACKET

Rare Play Booklet, version 1

Player AVG GP-GS AB R H 2B 3B HR RBI TB SLG% BB HBP SO GDP OB% SF SH SB-ATT PO A E FLD%

IBAF Scorers Manual INTERNATIONAL BASEBALL FEDERATION FEDERACION INTERNACIONAL DE BEISBOL

Lab 11: Introduction to Linear Regression

Scorekeeping Clinic Heather Burton & Margarita Yonezawa &

Seattle Mariners (16-19) 2, Boston Red Sox (17-19) 1 May 15, 2015

Average Runs per inning,

2014 Tulane Baseball Arbitration Competition Eric Hosmer v. Kansas City Royals (MLB)

Grade 6 Math Circles Fall October 7/8 Statistics

BABE: THE SULTAN OF PITCHING STATS? by. August 2010 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO

Lesson 1 Pre-Visit Batting Average

OFFICIAL RULEBOOK. Version 1.08

This page intentionally left blank

Hitting with Runners in Scoring Position

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

OFFICIAL RULEBOOK. Version 1.16

Effects of Incentives: Evidence from Major League Baseball. Guy Stevens April 27, 2013

AggPro: The Aggregate Projection System

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

A PRIMER ON BAYESIAN STATISTICS BY T. S. MEANS

ISDS 4141 Sample Data Mining Work. Tool Used: SAS Enterprise Guide

Seattle Mariners (52-45) 3, Los Angeles Angels (58-38) 2 July 19, 2014

DLL Scorekeeping Guide. Compiled by Kathleen DeLaney and Jill Rebiejo

Los Angeles Dodgers (17-13) vs. Miami Marlins (15-14) Friday, May 02, 2014 Marlins Park, Miami, FL

Department of Economics Working Paper Series

Seattle Mariners (42-36) 8, Boston Red Sox (35-43) 2 June 24, 2014

APBA Baseball for Windows 5.75 Update 22

The factors affecting team performance in the NFL: does off-field conduct matter? Abstract

Seattle Mariners (15-15) 4, Oakland Athletics (19-13) 2 May 5, 2014

Lesson 3 Pre-Visit Teams & Players by the Numbers

FORECASTING BATTER PERFORMANCE USING STATCAST DATA IN MAJOR LEAGUE BASEBALL

Stats in Algebra, Oh My!

Field Manager s Rulebook

Texas Rangers (15-9) 6, Seattle Mariners (9-14) 3 April 26, 2014

2019 LSU BASEBALL Overall Statistics for LSU (as of Feb 24, 2019) (All games Sorted by Batting avg) (All games Sorted by Earned run avg)

2010 Boston College Baseball Game Results for Boston College (as of May 28, 2010) (All games)

a) List and define all assumptions for multiple OLS regression. These are all listed in section 6.5

HMB Little League Scorekeeping

2014 NATIONAL BASEBALL ARBITRATION COMPETITION ERIC HOSMER V. KANSAS CITY ROYALS (MLB) SUBMISSION ON BEHALF OF THE CLUB KANSAS CITY ROYALS

Transcription:

Hoque 1 LITERATURE REVIEW ADITYA HOQUE INTRODUCTION One could argue that the United States is sports driven. Many cities are passionate and centered around their sports teams. Sports are also financially significant as they account for billions of dollars per year in the United States (Randy Jia, Chris Wong, & David Zeng, 2013). Baseball is often referred to as America s pastime. Baseball is the oldest of the four major professional sports leagues in North America (NHL, MLB, PGA, NBA and MORE to come, 2014). Keyser (2014) notes that Major League Baseball (MLB) teams play 162 games in a regular season, which is more games than teams in the National Hockey League (NHL), National Basketball Association (NBA), and National Football League (NFL). The fact that MLB seasons are so long compared to the seasons of the other three mentioned leagues makes it easy to see why so many North Americans are invested in the game of baseball. The abundance of data from previous baseball games allows even casual fans to study the game in depth (Randy Jia, Chris Wong, & David Zeng, 2013). The amount of data also allows teams to use complex analysis to develop the best team possible. As Smith (2016) says, the Chicago Cubs won the 2016 World Series and ended their title drought of 108 years, mainly because Theo Epstein, their general manager, assembled the best team in the MLB. Smith (2016) adds that

Hoque 2 Theo Epstein was also able to end the title drought of the Red Sox, which was 86 years long, in 2004, by using a similar strategy. Those two streaks were the two longest championship droughts in MLB history (Smith, 2016). While some general managers are successful, others are not. Some newly acquired players do not perform as well as their receiving team may have anticipated because most current statistics are team independent, which means they cannot accurately measure how a player affects his current team or how he will affect another team. BIG DATA According to Rajaraman (2016), big data has volume, variety, velocity, veracity, and value; baseball data meets all five of these requirements. The first characteristic of big data is that there is a large amount of data, and this trait applies to baseball data because baseball has a rich history of games with recorded data. The second trait of big data is its variety, and according to Jia, Wong, and Zeng (2013), there is a wide array of data taken from baseball games. The third requirement to be classified as big data is velocity. Over the course of a sixmonth baseball season, 30 teams play 162 games each (Keyser, 2014), which is approximately 13 MLB games per day. This high frequency of games leads to a large collection of data. To be called big data, the data must also have veracity; the data must be true. If the collected baseball data is accurate, it meets this criterion. Lastly, the data has to have value. The value in baseball data lies in its use. For example, to see how well a batter performed offensively, the number of hits he had matters, whereas the number of defensive errors he made does not. Therefore, as

Hoque 3 long as it is used in a meaningful way, baseball data holds value. Since baseball data meets the requirements to be considered big data, characteristics that apply to big data can also be applied to baseball data. THEORY DEVELOPMENT Predictive analytics, the field of analyzing data to predict the future, has traditionally been separate from theory development. However, Waller and Fawcett (2013) developed a model, as shown in Figure 1, that demonstrates how predictive analytics can play a role in the theory development process. Figure 1: This model shows the relationship between predictions (vertical axis) and explanations (horizontal axis). (Waller and Fawcett, 2013) According to the model, anything that uses little prediction and little explanation is just a description, anything that uses a lot of prediction but little explanation would fall under the category of predictive analytics, anything that uses a lot of explanation but little prediction is a

Hoque 4 critical explanation, and anything that is high in prediction and explanation unfolds new theories. This model shows that there is a way to integrate predictions into theories. BASEBALL STATISTICS Baseball statistical analysis has been around almost as long as the game itself. Although baseball and statistics have been associated with each other for a while, the use of statistics has evolved over time (see Figure 2). F.C. Lane, Baseball Magazine editor, creates new methods to measure offensive production and publishes Batting. 1925 Statistician Allan Roth hired by Brooklyn Dodgers to evaluate player performance. 1940s Henry Chadwick develops box score and counts hits, home runs, and total bases, leading to the calculation of statistics such as batting average and slugging percentage. Mid-19 th century Earl Weaver, Baltimore Orioles manager and future Hall of Famer, uses index cards to make in game decisions. 1960s Bill James writes numerous books about statistics that gain popularity. 1980s Moneyball, a movie about the Oakland Athletics' uncovential way of building a baseball team using statistics, comes out and becomes a sensation. 21 st century Figure 2: This figure shows how statistics have been relevant to baseball over the past 70 years, according to Birnbaum (n.d.) and Ivor-Campbell (n.d.).

Hoque 5 Baseball still has simple statistics, including, but not limited to, hits, home runs, runs batted in, and other statistics that most baseball fans, even the casual ones, are aware of. However, there is another more complicated group of statistics that has emerged in baseball over the past 15 years known as sabermetrics. SABERMETRICS HISTORY Bill James, who is often considered the father of sabermetrics, coined the term sabermetrics by combining SABR, which stands for the Society for American Baseball Research, and metrics, which means a method of measuring something (Birnbaum, n.d.). James defined sabermetrics as the search for objective knowledge about baseball (Birnbaum, n.d.). Nowadays, sabermetrics is used to describe any mathematical or statistical analysis of baseball or the actual statistics themselves (Analyzing Sabermetrics, n.d.). According to Birnbaum (n.d.), the concept of sabermetrics was first introduced to the public in 1982, when James published his book, The Bill James Historical Baseball Abstract, and it has been gaining popularity ever since. Birnbaum (n.d.) also says that even today there are a number of different publications about sabermetrics and sabermetric thinking. However, the work of sabermetricians is not often respected. Sabermetricians often derive unconventional measurements and unique statistics, which usually leads to them being ridiculed and ignored. For example, Birnbaum (n.d.) says, James theories were largely mocked

Hoque 6 (or ignored) by the baseball establishment; although, over time his work started to be recognized. In fact, Time Magazine once named him one of the 100 most influential people in the world (Birnbaum, n.d.). That just shows how important the work of sabermetricians is, even if it is not always instantly recognized as meaningful. OVERVIEW Many people may not consider statistical analysis as a form science, but sabermetrics is a science. The goal of sabermetric researchers and statisticians is to question traditional methods of evaluating a baseball player and search for new ways of obtaining objective knowledge through statistical analysis (Birnbaum, n.d.). Sabermetrics can also be used to answer questions (Birnbaum, n.d.), such as, how helpful were the hits of this player? Regardless of how they are formed or used, sabermetric statistics are a way to further analyze how players and teams perform. RUN PREDICTORS Predicting how many runs a team will score and measuring how many runs a player adds is one branch of sabermetrics known as run predicting. Run scoring is the goal of every offense, so this area of sabermetrics is one of the main ways offense is evaluated. Runs Created, Weighted On- Base Average, and Extrapolated Runs are three popular run predicting methods (Birnbaum, n.d.). Bard (personal communication, November 30, 2016) says that Run Expectancy Matrices are a model for run predicting.

Hoque 7 RUNS CREATED Runs Created is a formula invented by Bill James, whose goal was to find a way to predict how many runs a team scored based on its other statistics (Birnbaum, n.d.). According to Birnbaum (n.d.), James looked at which statistics help the team, such as hits (H) and walks (BB), and which stats hurt the team, such as strikeouts (K), to develop his formula. Figure 3 shows an example of the type of stat line James would have worked with. Figure 3: This is an example of the type of batting line Bill James analyzed to create his Runs Created formula. (Birnbaum n.d.) Through trial and error, he came up with the formula shown in Figure 4 (Birnbaum, n.d.). Figure 4: This is the Runs Created formula Bill James came up with. TB represents total bases, H represents hits, BB represents base on balls, also known as walks, and AB represents at bats. (Birnbaum, n.d.) James created this formula to predict team scoring, but this formula can also be used on individual players. The ability to use the formula on individual players is important because there is no way of knowing exactly how many runs a player added to his team. The statistics of individual players can be substituted into the formula to get the formula s estimation of how many runs that player created for his team.

Hoque 8 WEIGHTED ON-BASE AVERAGE The formula for Weighted On-Base Average (woba) uses something known as linear weights. Linear weights are used to give values to each possible event (Linear Weights, 2016). For example, it is understandable that a double is more valuable than a single, but by how much is still a mystery; linear weights give a definitive answer to this question. According to Furtado (1999), there are multiple ways to calculate the weights for each event. However, they usually end up being very similar, as evidenced by Figure 5. Figure 5: This figure shows the linear weights calculated by Johnson and Palmer. 1B represents singles, 2B represents doubles, 3B represents triples, HR represents home runs, HP+BB represents hit by pitches and base on balls, or walks, CS represents caught stealing, GIDP represents ground into double plays, and Out represents Outs. The values for CS, GIDP, and Out are negative because those are events that hurt the team. (Furtado, 1999) When used in woba, the weights are adjusted so that the weight for outs becomes zero (Linear Weights, 2016), as shown in Table 1. Table 1: This table shows the weights for each event in 2015 after they have been adjusted so that outs the weight for outs is zero. (Linear Weights, 2016)

Hoque 9 Once the weights for each event are found, they can be used in the formula for woba, as demonstrated in Figure 6. It is important to note that the weights change from year to year based on how much each event actually affected run scoring (woba, n.d.). Figure 6: This is the formula for woba for 2013 (woba, n.d.). ubb represents unintentional walks, HBP represents hit by pitch, 1B represents singles, 2B represents doubles, 3B represents triples, HR represents home runs, AB represents at bats, BB represents base on balls, or walks, IBB represents intentional walks, and SF represents sacrifice flies, which are plays where the batter flies out but a run scores on the play. EXTRAPOLATED RUNS Furtado (1999) thought there was a problem with the ways runs were predicted, so he invented a new formula known as Extrapolated Runs, which is portrayed in Figure 7. Like James, Furtado wanted use linear weights in his formula, but he derived the weights by using linear regression. The main difference between his formula and the one used for woba is that his formula assigns negative values for outs.

Hoque 10 Figure 7: This figure shows the formulas that Jim Furtado (1999) invented. 1B represents singles, 2B represents doubles, 3B represents triples, HR represents home runs, HP represents hit by pitches, TBB represents total walks, IBB represents intentional walks, SB represents stolen bases, CS represents the number of times caught stealing, AB represents at bats, H represents hits, K represents strikeouts, GIDP represents ground into double plays, SF represents sacrifice flies, which are plays where the batter flies out but a run scores on the play, and SH represents sacrifice hits, which are plays where the batter gets out attempting to bunt, but the runners on base advance to the next base. (Furtado, 2014) RUN EXPECTANCY MATRICES Run Expectancy Matrices are the most common models used to calculate linear weights (Linear Weights, 2016). They are based on the 24 different possible base-out states in baseball (eight possible base configurations and three possible out states) (Linear Weights, 2016). This value in each cell of the matrix is the average number of runs scored each time the state occurs (Linear Weights, 2016), as shown in Table 2.

Hoque 11 Table 2: This table shows the run expectancy matrix based on games played from 2010-2015 FLAWS Sabermetrics are a great way to analyze and understand the past, but they do not necessarily predict the future or account for change. Specifically, they do not provide a way to understand what would happen if the lineup order was modified or the teammates were changed. There are many statistics that ignore the situation of a player, but very few statistics actually take into consideration how the rest of the team performs. This statistical void is a problem because owners and general managers are trying to build a team, not a group of individual players. SIMULATIONS According to Hunter (2014), who created a baseball simulation and described what he learned from doing so, baseball is much simpler than it seems. It is possible to create baseball simulations because of the nature of the statistics involved in a baseball game and season. A lot

Hoque 12 of events in baseball can be expressed in terms of probabilities. For example, to find the probability a batter gets on base, his On Base Percentage (OBP), or the percentage of plate appearances he gets on base, would be used. One key benefit that simulations provide that traditional and sabermetric statistics do not, is team dependency. The factors involved in most statistics are team independent. Because of that, it is hard to accurately measure a player s offensive contribution to his team and determine how he might perform on different teams using current statistics. For example, the number of doubles a player hits does not necessarily depend on the team he is on. However, when a team signs a player, they want to know how that player will affect that team. Even though every team s goal is to score the most runs as possible, each team has different needs. For example, low scoring teams need batters with good power, whereas high scoring teams need their batters to get on base (Hunter, 2014). While there are traditional and sabermetric statistics that measure power and how often a batter gets on base, the needs of a team are likely more detailed that. Simulations can break a game of baseball down to its bare elements. Simulations can also be used to examine hypothetical situations. Different scenarios, such as new teams or lineup spots, can only be tried with simulations, which is what makes simulations more valuable than statistics for certain jobs (Tippett, personal communication, December 1, 2016). Tom Tippett, former statistician for the Boston Red Sox, says that the Red Sox use simulations for some things, but not for everything because of the amount of time it takes for most simulations to run (personal communication, December 1, 2016).

Hoque 13 ENGINEERING PLAN ENGINEERING PROBLEM MLB teams do not use data from past games to its fullest potential for player analysis and lineup efficiency. Many teams overpay for players due to overvaluing names and/or the use of inappropriate statistics. ENGINEERING GOAL The goal of this project is to engineer a computer program capable of simulating baseball games and outputting the impact of each player. The program should be able to accept different players and the lineup orders to measure the impact of each player based on the rest of the lineup (their team) and their lineup position. PROCEDURE/GENERAL METHODS DEVELOPMENT There are many small factors that need to be considered in the simulation, but they cannot all be dealt with from the beginning. The way to develop this simulation is to start small then add details. The first version should have basic outcomes for an at bat: single, double, triple, and home run. The next version should have another outcome added, such as walks. The following versions should have more and more outcomes compared to their predecessors. The final

Hoque 14 version should have at least the following possible outcomes: single, double, triple, home run, walk, double play, and sacrifice fly. The program needs to be able to output three things: the average runs per game with the given lineup, the runs added per plate appearance by each player, and the best possible lineup. The average runs per game comes from running the simulation multiple times and taking the average the number of runs the team scored each game. The runs added by each player is calculated by looking at on average how much the player affects his team s expected runs in terms of a run expectancy matrix. The most optimal lineup is derived by trying all 362,880 lineup possibilities and determining which configuration results in the maximum runs per game. DESIGN CRITERIA The most important criteria for this simulation are accuracy, the number of possible events, and the number of convenience features. The simulation needs to accurately reflect how many runs per game the team scores, because then the simulation s estimation of how many runs each player added can be trusted. The simulation also needs to allow for a wide range of possibilities for each plate appearance because that is the case in an actual game of baseball. TESTING Runs per game is the one value that the program outputs that can actually be compared to real data. That output is the only one that can be compared to real data because the most optimal lineup depends on the program s prediction of runs per game, and the idea of using a

Hoque 15 simulation with a run expectancy matrix to calculate the runs added by each player is novel. Data from the years prior to the year of interest can be used to make a simulation for a certain year. For example, to make a simulation for the 2016 season, data from the 2010-2015 seasons would be used.

Hoque 16 REFERENCES Analyzing sabermetrics. Retrieved from http://www.grammarphobia.com/blog/2012/09/analyzing-sabermetrics.html Bard, S. (2016, November 30). Personal interview and Email. Birnbaum, P.A guide to sabermetric research. Retrieved from http://sabr.org/sabermetrics Furtado, J. (1999a). Introducing XR. Retrieved from http://www.baseballthinkfactory.org/btf/scholars/furtado/articles/introducingxr.htm Furtado, J. (1999b). Why do we need another player evaluation method? Retrieved from http://www.baseballthinkfactory.org/btf/scholars/furtado/articles/why_do_we_need_ Another_Player_Evaluation_Method.htm Hunter, M. (2014). 10 lessons I learned from creating a baseball simulator. Retrieved from http://www.hardballtimes.com/10-lessons-i-learned-from-creating-a-baseballsimulator/ Ivor-Campbell, F.F.C. lane. Retrieved from http://sabr.org/bioproj/person/089be8f3 Keyser, H. (2014). Why are baseball seasons 162 games long? Retrieved from http://mentalfloss.com/article/58831/why-are-baseball-seasons-162-games-long Linear Weights. (2016). Retrieved from http://www.fangraphs.com/library/principles/linearweights/ NHL, MLB, PGA, NBA and MORE to come. (2014). Retrieved from http://fundinginnovation.ca/nhl-mlb-pga-nba-come/ Rajaraman, V. (2016). Big Data Analytics. Resonance, 21(8), 695-716. doi:10.1007/s12045-016- 0376-7 Randy Jia, Chris Wong, & David Zeng. (2013). Predicting the Major League Baseball Season Smith, C. (2016). With Red Sox, Cubs, Theo Epstein Ends 2 Longest World Series Droughts, Becomes Sure Hall of Famer. Retrieved from

Hoque 17 http://www.masslive.com/redsox/index.ssf/2016/11/theo_epstein_chicago_cubs_vs_c. html Tippett, T. (2016, December 1). Email. Waller, M. A., & Fawcett, S. E. (2013). Click Here For a Data Scientist: Big Data, Predictive Analytics, and Theory Development in the Era of a Maker Movement Supply Chain. Journal of Business Logistics, 34(4), 249-252. doi:10.1111/jbl.12024 woba. Retrieved from http://www.fangraphs.com/library/offense/woba/