Master of Arts In Mathematics

Similar documents
Simulating Major League Baseball Games

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Using Markov Chains to Analyze a Volleyball Rally

ANALYSIS OF A BASEBALL SIMULATION GAME USING MARKOV CHAINS

February 12, Winthrop University A MARKOV CHAIN MODEL FOR RUN PRODUCTION IN BASEBALL. Thomas W. Polaski. Introduction.

Opleiding Informatica

Should pitchers bat 9th?

Legendre et al Appendices and Supplements, p. 1

A One-Parameter Markov Chain Model for Baseball Run Production

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

A Markov Model for Baseball with Applications

Building an NFL performance metric

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

Hitting with Runners in Scoring Position

2015 Winter Combined League Web Draft Rule Packet (USING YEARS )

The Rise in Infield Hits

DO YOU KNOW WHO THE BEST BASEBALL HITTER OF ALL TIMES IS?...YOUR JOB IS TO FIND OUT.

arxiv: v1 [stat.ap] 18 Nov 2018

Table 1. Average runs in each inning for home and road teams,

Machine Learning an American Pastime

When Should Bonds be Walked Intentionally?

Internet Technology Fundamentals. To use a passing score at the percentiles listed below:

Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008

2018 Winter League N.L. Web Draft Packet

2013 Tulane National Baseball Arbitration Competition

CS 221 PROJECT FINAL

Planning and Acting in Partially Observable Stochastic Domains

MLB SHOWDOWN DCI Floor Rules Tournament Season Effective June 15, 2000

Lorenzo Cain v. Kansas City Royals. Submission on Behalf of the Kansas City Royals. Team 14

Minors Division (10u) Rules

Table of Contents. Pitch Counter s Role Pitching Rules Scorekeeper s Role Minimum Scorekeeping Requirements Line Ups...

The MLB Language. Figure 1.

CHAPTER 1 ORGANIZATION OF DATA SETS

Chapter 1 The official score-sheet

Why We Should Use the Bullpen Differently

DECISION MODELING AND APPLICATIONS TO MAJOR LEAGUE BASEBALL PITCHER SUBSTITUTION

PREDICTING the outcomes of sporting events

OFFICIAL RULEBOOK. Version 1.08

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf

How to Make, Interpret and Use a Simple Plot

TOPIC 10: BASIC PROBABILITY AND THE HOT HAND

NCSS Statistical Software

A Markov Model of Baseball: Applications to Two Sluggers

THE VILLAGES REC DIVISION IV PROCEDURES WINTER 2019 Revised January 24, 2019

Average Runs per inning,

THE VILLAGES REC DIVISION IV PROCEDURES WINTER 2018 Revised

A Database Design for Selecting a Golden Glove Winner using Sabermetrics

Wheaton Youth Baseball Pony League - Supplementary Rules

5.1 Introduction. Learning Objectives

One of the most-celebrated feats

PGA Tour Scores as a Gaussian Random Variable

OFFICIAL RULEBOOK. Version 1.16

Antelope Little League

WHEATON YOUTH BASEBALL BRONCO LEAGUE SUPPLEMENTARY RULES

At each type of conflict location, the risk is affected by certain parameters:

ECO 199 GAMES OF STRATEGY Spring Term 2004 Precept Materials for Week 3 February 16, 17

A PRIMER ON BAYESIAN STATISTICS BY T. S. MEANS

One could argue that the United States is sports driven. Many cities are passionate and

How Effective is Change of Pace Bowling in Cricket?

2015 GTAAA Jr. Bulldogs Memorial Day Tournament

Wheaton Youth Baseball Pony League - Supplementary Rules

INFORMS Transactions on Education

March Madness Basketball Tournament

March Madness Basketball Tournament

NPYL Major Division Rules Boys 11 & 12 Years of Age Revised April 2015

2015 NATIONAL BASEBALL ARBITRATION COMPETITION

Analysis of Variance. Copyright 2014 Pearson Education, Inc.

Lakeshore Baseball and Softball Association

Energy capture performance

An average pitcher's PG = 50. Higher numbers are worse, and lower are better. Great seasons will have negative PG ratings.

Calvary A.A. Baseball

Stafford Little League Softball Bi-laws & Local Rules 2016 Season

The next criteria will apply to partial tournaments. Consider the following example:

2014 Tulane Baseball Arbitration Competition Josh Reddick v. Oakland Athletics (MLB)

Fairfax Little League PPR Input Guide

Jonathan White Paper Title: An Analysis of the Relationship between Pressure and Performance in Major League Baseball Players

Draft - 4/17/2004. A Batting Average: Does It Represent Ability or Luck?

Background Information. Project Instructions. Problem Statement. EXAM REVIEW PROJECT Microsoft Excel Review Baseball Hall of Fame Problem

BERKSHIRE II: AN EXPERIENTIAL DECISION MAKING EXERCISE. Tom F. Badgett, Texas Christian University Halsey R. Jones, Texas Christian University

ROSE-HULMAN INSTITUTE OF TECHNOLOGY Department of Mechanical Engineering. Mini-project 3 Tennis ball launcher

Fairfield National Little League AA Rules (updated: Spring 2014)

TOP OF THE TENTH Instructions

Figure 1. Winning percentage when leading by indicated margin after each inning,

2014 National Baseball Arbitration Competition

ISCORE INTEGRATION IOS SCORING GUIDE

The pth percentile of a distribution is the value with p percent of the observations less than it.

Ranking teams in partially-disjoint tournaments

Examples of Carter Corrected DBDB-V Applied to Acoustic Propagation Modeling

PHYS Tutorial 7: Random Walks & Monte Carlo Integration

Softball

2014 NATIONAL BASEBALL ARBITRATION COMPETITION ERIC HOSMER V. KANSAS CITY ROYALS (MLB) SUBMISSION ON BEHALF OF THE CLUB KANSAS CITY ROYALS

Scorekeeping Guide Book

City of Palo Alto ADULT SOFTBALL RULES

Running head: DATA ANALYSIS AND INTERPRETATION 1

Chapter 2 - Displaying and Describing Categorical Data

Chapter 3 - Displaying and Describing Categorical Data

Extreme Shooters in the NBA

SAP Predictive Analysis and the MLB Post Season

Lab Report Outline the Bones of the Story

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

Transcription:

SIMULATION MODEL USING STANDARDIZED LINEUP TO EVALUATE PLAYER OFFENSIVE VALUE A s MfVlVl A thesis presented to the faculty of San Francisco State University * In partial fulfilment of The Requirements for The Degree Master of Arts In Mathematics by Eugene Beyder San Francisco, California June 2015

Copyright by Eugene Beyder

CERTIFICATION OF APPROVAL I certify that I have read SIMULATION MODEL USING STANDARD IZED LINEUP TO EVALUATE PLAYER OFFENSIVE VALUE by Eugene Beyder and that in my opinion this work meets the criteria for approving a thesis submitted in partial fulfillment of the requirements for the degree: Master of Arts in Mathematicsat San Francisco State University. Professor of Mathematics Associate Professor of Mathematics

SIMULATION MODEL USING STANDARDIZED LINEUP TO EVALUATE PLAYER OFFENSIVE VALUE Eugene Beyder San Francisco State University 2015 Baseball is a sport in which batting statistics are commonly used to assess the offensive value of a player. However, many traditional statistics do not accurately portray a player s true contribution to his team since they overlook a variety of circumstances outside of the hitter s control, in particular his position in the batting order. Using a standardized lineup, we have built a simulation based on Markov chains in which a player is evaluated by his offensive contribution in different batting positions in the lineup and then compared to other players in the league. Thus, this model is able to solely evaluate a player s offensive skill set and distinguish which player has a greater value to his team among players with similar traditional offensive statistics. This model reveals several strategies perhaps not yet explored and can be used by major league baseball teams when making various decisions such as signing free agents, and setting offensive lineups. I certify that the Abstract is a correct representation of the content of this thesis. Chair, Thesis Committee u

ACKNOWLEDGMENTS I want to thank my advisors for their support and guidance. I am extremely thankful for the ability to merge together two of my favorite topics, math and baseball. Of course I want to thank my family for the constant support and encouragement they have provided me through out my life. v

TABLE OF CONTENTS 1 Introduction... 1 1.1 Background on B a se b a ll... 1 1.2 M otivation... 2 1.3 Purpose of T h esis... 6 2 Theory Behind Sim ulation... 8 2.1 B ackground... 8 2.2 Markov C h a in... 9 2.3 Transitional M a trix... 13 3 Data Collection... 15 3.1 Collecting Transitional M a trix...15 3.2 Individual Player Transitional Matrix... 17 3.3 Average Player Transitional M a trix... 20 3.4 Runs M atrix...21 4 Building Simulation M odel...24 4.1 How to Simulate a Single G a m e... 24 4.2 How the Model Will S im u late...29 5 Distribution and Ranking of Each Batter P osition... 31 5.1 Process for Finding Random Players...31 vi

5.2 Testing Data...32 5.3 Finding a Players Rank by Batting P o sitio n... 42 6 Results of the M o d e l...43 7 Conclusion and Future W ork...46 A p p endix...49 Bibliography...59 vii

LIST OF TABLES Table Page 1.1 percent of plays during 2012-2014 seasons, where WP is wild pitch, PB is passed ball, and errors are errors on pick off a tte m p ts... 7 2.1 Examples of Code to represent states of system... 12 4.1 Example of a single game sim u la tio n... 28 4.2 Standardized lineup for m o d e l...30 5.1 Shapiro-wilk test for 30 random players in 1st batting position of standardized lin eu p...35 5.2 Normality test for batting positions 2-5...36 5.3 Random 30 players runs per game for batting positions 1-5 of standardized lin e u p 41 5.4 Explanation of equation used for finding ranking... 42 6.1 Rankings of top players in M L B...44 6.2 Rankings of players with similar PA- Plate Appearances and BA- Batting Average...45 6.3 Rankings of players with similar R s and RBI s... 45 viii

LIST OF FIGURES Figure Page 3.1 Examle of Transitional M atrix... 19 5.1 histogram and qqnormal plot for 30 random players runs per game in lead-off position of standardized lineup... 34 5.2 histogram and qqnormal plot for 30 random players runs per game in second position of standardized lineup... 37 5.3 histogram and qqnormal plot for 30 random players runs per game in third position of standardized lineup...38 5.4 histogram and qqnormal plot for 30 random players runs per game in fourth position of standardized lineup... 39 5.5 histogram and qqnormal plot for 30 random players runs per game in fifth position of standardized lineup... 40

1 Chapter 1 Introduction 1.1 Background on Baseball Baseball is a game played between two teams, a home team and a visiting team. A typical game of baseball consists of nine innings, where each inning is broken up into two half innings. The top half of the inning is reserved for the visiting team to play offense while the home team plays defense, and in the bottom half of the inning the roles are reversed. Each team s offensive consists of a 9-batter lineup that must take turns in that exact designated order for the entirety of the game; this order is set before the game starts and each time a player takes his turn it is called an atbat, denoted (AB).[6j A half inning ends when the defensive team has acquired 3 outs. The goal for the offense is to score as many times as possible before the defense gets three outs. An out can be made many ways depending on the situation of runners on base. Each new inning begins with the batter who follows the player

2 responsible for hitting into the third out. Each atbat results in either; an out, a batter occupying any one of the three bases, or a scoring scenario. At the end of nine innings, scoring plays known as runs determine the winning team. To score a run, a player must advance in order from 1st base to 2nd base through 3rd base and finally reach home plate safely. 1.2 Motivation Typical offensive statistics in baseball can be found everywhere from the back of players baseball cards to websites known as baseball reference pages. Most of these statistics can be classified as counting statistics, since they are determined simply by counting successes vs. failures. The most popular counting statistics used are hits, runs, and runs batted in. Hits denoted as (H),[6] is the number of times the batter reached base as a result of hitting the ball with no outs made on the play and no mistake from the defense. Runs, denoted as (R)[6] represent the number of times the batter passed all three bases and reached home plate safely to score. Runs batted in, denoted as (R B I),[6] is the number of runners that scored attributed to the batter as a result of him hitting the ball. These statistics serve as a very quick comparison between player A and player B. This type of comparison is sometimes all that is considered by not only the casual fan, but many top baseball broadcasters and commentators.

However, judging a player based on solely counted statistics does not fully represent his value offensively. This is due to some aspects of baseball that are out of the player s control. Counted statistics like R and RBIs have more to do with the success of those who bat before and after the player in question. For instance, to get a R, a player must reach base and pass all three bases and safely reach home plate. But if Player A is consistently reaching base safely but the players after him are not successful at scoring him, then Player A does not get a R credited to him. On the other hand, if the players in the batting order hitting behind Player B, who is reaching base safely at the same rate as Player A, are successful at scoring him then Player B will get credit for a R. Thus, when comparing Player A and Player B, Player B will have more runs, though he may not necessarily be a better offensive player. A similar argument can be made for RBIs. This category is based on how many times a run scores as a result of a player s AB. The number of RBIs is predicated on the number of opportunities the batter has with runners on base. Player A, a below average hitter, could have many opportunities for a RBI, succeeding at a very poor rate. Meanwhile, Player B, an above average hitter, may have far less chances for a RBI but succeed at a much higher rate, and end the season with the same number of RBIs as Player A. Thus, counting statistics such as R and RBI make it difficult to accurately compare players with similar statistics. 3

4 Percentage statistics offer greater insight to the quality of the player by looking at how successful the player performs in different situations. A typical percentage statistic is batting average, denoted as (BA) [6] which is the success rate of getting a hit. Batting average is calculated by dividing a batters number of hits by number of at bats, BA = H/AB. But even percentage statistics like BA, while telling us how well the player performed individually, still need to be evaluated with the impact of players around him. For example if the player is on a poor performing team, the opponent will try to avoid the good player making a big impact whenever possible. This means that the player will have fewer opportunities resulting in deceptive statistics. The main point that these cases illustrate is that judging a player based on counting statistics and percentage statistics does not provide enough information to get a comprehensive assessment of a player s value offensively. Each player has unique circumstances surrounding each AB that play a role in his offensive production, and inherently his offensive statistics. In order to attain a better analysis of a player s offensive value, we need a platform in which all players are evaluated with the same surrounding circumstances. Many algorithms and methods for evaluating players beyond typical statistics already exist, most of them deal with predicting individual player s statistics that may not reflect the true value of a player. Fantasy baseball[5] is a great example of players being evaluated and judged based on the numbers they produced and predicted

5 to produce for the remainder of the season. A more valuable and insightful statistic to judge a player is his overall contribution to his team winning. Statistics of this nature have already been examined, the most well known is a saber metric statistic called WAR,[1] which measures wins above replacement. This statistic incorporates how a player performs defensively and offensively, including base-running and hitting, and produces a number for how many more wins the team will have during the season if this player played in every game. Others have come to a similar realization that alternative methods of analysis need to be performed. A Markov Chain Model of run-scoring in baseball has been used by several researchers to model the progression of a half-inning of baseball. R.E. Trueman in Analysis of baseball as a Markov process defined a half-inning as a 25-state Markov chain in his analysis. [14] In fact, many statisticians have used the Markov Chain Model to simulate an entire game. For example, in Markov Chain Approach to Baseball, Bruce Bukiet, Elliotte Rusty Harold and Jos Luis Palacios applied a Markov Chain Model to determine optimal batting orders. [9] By developing algorithms to find the optimal out of 9! possible orders, they created a method to find the runs distribution per half inning and per entire game using a 25-state Markov chain. In addition, they determined the expected number of games a team should win in an entire 162 game season. Meanwhile, Nobuyoshi Hirotsu and Mike Wright in A Markov Chain Approach To Optimal Pinch Hitting

6 Strategies In a Designated Hitter Rule Baseball Game applied a Markov chain model to optimize the pinch hitting strategy. [10] This type of analysis allowed them to determine whether it was optimal to use a pinch hitter with a high likelihood of hitting a home run but also a high likelihood of making an out instead of a batter with a low likelihood of a home run but also a low likelihood of making an out. 1.3 Purpose of Thesis For my analysis, we are attempting to determine a player s contribution to his team by only considering his impact offensively. In baseball there are two types of plays, batted plays and non batted plays. Non batted plays are primarily plays that involve base runners advancing during atbats.[ll] They are classified as: steal attempts, balks, wild pitches, passed balls and errors on pick off attempts. Batted plays are defined as those not included in non batted plays, such as hits and walks. As evidenced by Table 1.1, non batted plays have only accounted for about 3% of all plays in the last three seasons. [7] Thus non batted plays are extremely unlikely to occur and are inherently unpredictable, and as a result are difficult to model. For the purpose of my thesis, we will disregard the effects of non-batted plays in a player s offensive production and solely focus on batted plays. As we have demonstrated, there are various circumstances that impact a player s

7 year % non batted plays % steal attem pt plays % Balk, W P, PB, errors 2012 3.4 % 2.3 % 1.1 % 2013 3.1 % 1.9 % 1.2% 2014 3.2 % 2.0 % 1.2 % Table 1.1: percent of plays during 2012-2014 seasons, where WP is wild pitch, PB is passed ball, and errors are errors on pick off attempts offensive success during each AB. Furthermore, the traditional statistics used to assess a player offensively do not take into account these factors. Therefore, many players are not being valued properly for their offensive skills. For example, oftentimes a player s offensive production is determined by their position in the lineup. We want to create a method which compares players offensive contributions in the same surrounding circumstances to allow for a more accurate comparison. We will only take into account batted plays to eliminate many of the irregularities of non batted plays. This will enable me to solely measure a player s offensive capabilities and as a result his contribution to the number of runs per game his team scores. Thus, the goal of this thesis is to develop a statistical method using a Markov Chain Model to compare a player s offensive skill-set for a given slot in the batting order across the entire league.

8 Chapter 2 Theory Behind Simulation 2.1 Background Analyzing data is a very prevalent aspect to all fields of study. It indicates what has already happened and provides information on what will potentially happen in the future. In order to have worthwhile data analysis, a large sample size needs to be collected. Using the collected data we can model future success based on previous results. Simulations serve as a process that enables the user to imitate situations of real world systems over an extended period of time. In order to perform a simulation, there must first be a developed model, which represents the key characteristics or functions of the system at hand. The model represents the system itself, whereas the simulation represents the operation of the system over time. Simulations are also used when the real system is unavailable, for reasons such as the system is not accessible to use, or it is being designed but not yet built, or simply it does not

9 exist. There are many types of simulations that are run today, such as the Monte Carlo simulation, which enables one to model situations that present uncertainty and play them out thousands of times on a computer. These simulations allow one to better estimate the probability and likelihood of events being successful. From this data, one can draw more accurate conclusions about the information at hand, and better predict future events. In terms of baseball, one can use simulations to create a platform in which every player is placed in the same situation and analyzed based on how well he performs in a given situation. We will apply this knowledge of simulations to studying baseball statistics. In the game of baseball every event and situation is charted, recorded and can be expressed through data sets. In particular we can examine the statistics that happen throughout the course of a game, season and career of a player or a team. This access to information allows us to use simulations to gain insight into which strategies are most effective in certain situations that may arise throughout the course of a game. 2.2 Markov Chain A Markov chain is a probability model that is used to characterize movement from different locations called states. A Markov chain is made up of absorbing states and transitional states. Transitional states are states that can move to other states. Absorbing states are final locations; once in an absorbing state, movement is not

10 allowable to other states. [12] [13] [11] Formally, consider a set of random variables X\, X2,..., X n where X t for i 1,..., n represents the state of the system at time i, for the possible states of the system 1, There exists a set of numbers P^ where both i and j range from 1,...,n, such that when the process is in state i, then the transition to the next state j will have a probability of Pty Pij = Pr(Xn+1 = j\x n = i) Thus the set {Xt \ i e 1,..., n} is called a Markov chain with transitional probabilities Pij. Where Ptj satisfies the following conditions: for all combinations of i and j. And, 0 < Pij < 1 n where i = 1,..., n since the process must transition to some state j once it leaves state i. We are in an absorbing state if and only if Pu = 1

11 and Pij 0 where i ^ j, otherwise we are in a transitional state. [12] To use a Markov chain some restrictions have to be met. First, there needs to be a finite number of outcomes or states. Second, the probability for each possible state that the process initially occupies is known. And finally, the probability of a given state can only depend on the previous state and not on the events leading up to that state. If a sequence of states has the Markov property, then every future state is conditionally independent of every prior state. In essence this discredits all notions of momentum being a factor in effecting outcomes for our situation. [11] [12] In order to model a baseball game we need to establish the variables that will affect our system. Since the game is not measured by a clock but rather play by play, the nature of baseball sets up very nicely to study a discrete situation. We want to look at each atbat, in particular the situation before the atbat takes place and the result of the atbat. Thus we need to know the probability of generating a certain result which will vary from player to player. Also, we need to know the positioning of the runners on base along with the number of outs prior to the atbat, and the positioning of the runners and number of outs after the atbat. These types of situations will be known as our states. A state is viewed as a description of the

12 runners on base and the number of outs in an inning. [11] There are 3 possible bases that can be occupied at any given atbat, thus this results in 23 or 8 possible ways that runners can occupy the three bases. At the same time there can also be 0,1, or 2 outs for each of the 8 occupying base situations, resulting in 24 possible situations that can be present when a player comes to bat. Once the third out is made the inning is over; this will be our absorbing state in our Markov chain. [13] To prepare our data for simulation we need to first have a way to represent the 24 transition states. Each state is coded as XXX Y where the X s going from left to right represent 1st, 2nd and 3rd base, and can take on the value of 0 to represent an empty base, or 1 to represent an occupied base. Y can take on the values 0,1 or 2 to represent the number of outs in the current state. The absorbing state, will be coded as 3 to represent three outs. Table 2.1 shows a few examples of different coded states and their representations. Code to represent State State of Runners and Outs XXX Y 1st base, 2nd base, 3rd base, Number of Outs 000 0 No runners on base and no out 100 1 Runner on 1st base and 1 out 110 2 Runners on 1st and 2nd base and 2 out 001 0 Runner on 3rd base and no out Table 2.1: Examples of Code to represent states of system

13 2.3 Transitional Matrix In order to keep track of the movements from state to state a transitional probability matrix is created. The transitional matrix in my system will be a 24 x 25 matrix where the elements of the matrix represent each possible transition from state to state. Each inning begins with a new Markov chain with the starting state always being 000 0. There are 24 base out states that we can start any given atbat with, and transition to the same 24 base out states plus the 3 outs state. Once we transition to the 3 outs state the inning is over and our Markov chain is broken. Every element in the matrix represents the probability of transitioning from state i to state j, hence the (i,j)th element of the matrix will be.[12] A transition from state i to state j exists only if Pij > 0 Here are situations when the P ^ s take on a value of 0: The number of outs decreases. A transition of this variety for example from 000 1 a one out state to 000 0 a zero out state can not happen because the number of outs in an inning of baseball game can only stay the same or increase after a play. The number of base runners added to the state increased by more

14 than one. A transition of this variety for example from 000 0 to 110 0 can not happen because only one base runner can be added to the state per atbat, since only one atbat takes place between states. Base runners went back to previous bases. A transition of this variety for example from 001 0 to 110 0 can not happen because in baseball base runners are not allowed to retreat to a base they already passed. No transitions occurred from state i to state j for that transitional matrix. The transition does not violate any restrictions and our system allows for such a transition, but the player did not come to bat when the state was i.

15 Chapter 3 Data Collection 3.1 Collecting Transitional Matrix We will be using data from the 2012-2014 seasons for my project. We chose the last three seasons because it provides a large enough sample size of sufficient data for my analysis. We did not want to use data for more years because we want the data to still model the players current skill set. Taking a larger data set will devalue the analysis of the players worth today, as a players skill set changes over time. To acquire a transitional matrix we must first download data from [7] for the particular season desired. Specifically we download a roster of all players for the particular season, a file with all plays that took place for that particular season, and a file that represents the names of all the different categories that the data in the season can be classified as. Next we want to take a snap shot look at each play that occurred

16 during that season. In particular we want to record the occupancy of the bases and the number of outs prior to the play occurring, as well as the location of runners on base and number of outs after the play has occurred. We code this information as discussed before, and denote them as our states. Next, since we are concerned with batting plays only for our model, we want to extract out all plays that advance runners via non batted plays. Thus plays such as steal attempts, wild pitches, passed balls, and errors trying to pick off a base runner are removed from the data set. We can accomplish this by using a variable called BAT_EVENT_FL.[11] This variable classifies all plays that are batted plays into one category. Taking a subset from our data set of all plays that have BAT_EVENT_FL equal to TRUE accomplishes the desired result. With the method described earlier to get the location of the base runners and number of outs, we come across a small issue when the outs are equal to 3. The method records the location of base runners when 3 outs are made. Thus there are 8 states that have 3 outs: ( 000 3, 100 3, 010 3, 001 3, 110 3, 011 3, 101 3, 111 3 ). For our model the location of runners when 3 outs are made is irrelevant. Therefore we will recode all of these 8 events with 3 outs simply as just 3. [3] Now we have 24 states that any atbat can start at and 25 states that the atbat can transition to. The 25 ending states include the same 24 starting states plus the 3 state. Using the table[3] function in R organizes our data set into a 24 X 25 matrix called T, where the first vertical column represents the 24 starting states

17 before an atbat, and the first horizontal row represents the 25 ending states after an atbat. Every entry in the matrix (Ty) represents the number of transition from the starting state i to the ending state j. Note, the values in the matrix are not percentages, but rather whole numbers since the table function only collects the data and organizes the information in matrix form. The data in the matrix represents the number of times in a particular season transitions from the starting state i to the ending state j occurred. To find the probabilities of transitioning from state i to state j we use the prop.table function in R. This function takes every T\j from the T matrix and divides them by the sum of the row that the entry occupies to find Pij for our desired transitional matrix. p.. Z21 T ^v25 rri 2^j=1 ij Thus we now have a method for acquiring a transitional matrix for a particular baseball season. 3.2 Individual Player Transitional Matrix The previous method finds a matrix that contains the transitions for all players in a particular baseball season. Now, we can find the transitional matrix for individual players. To acquire an individual player s matrix we need to extract all the atbats

18 that are not taken by our desired player. To do this we need to arrange our data where BAT_EVENT_FL is equal to TRUE into a three dimensional matrix. The three categories will be the starting states, ending states and BAT_ID. Each player has a unique BAT_ID [7] [11] that is made up of the first four letters of the player s last name, followed by first letter of the first name and then a three digit number. For example, Albert Puljos BAT_ID is pujoaool. The roster file contains a list of all players for that year and thebat_id for those players. A player s BAT_ID is assigned on the first official atbat of the players career, and never changes from season to season. Our three dimension matrix can be put together by once again using the table function in R for the three variables BAT_ID, starting state and ending state. This three dimensional matrix contains a two dimensional matrix for each BAT_ID, where the variable are starting states and ending states. To get a matrix for a particular player, we just use the players BAT_ID to pick out the corresponding two dimensional matrix. This matrix represents the number of times the player transitioned from state i to state j. To find the probabilities, we again use the prop.table function in R. Thus we now have a method for acquiring a transitional matrix for a player in any baseball season. Since we will be using 2012-2014 seasons as my data, we need to acquire a transitional matrix for a player for those three years. To accomplish this we need to merge together the matrices for each individual year. We gather

19 the matrices by the process laid out above, however we do not want to change the entries in the matrices to probabilities yet. We can add the three matrices together by using matrix addition in R. This creates a matrix that represents the number of transitions from starting state i to ending state j for a player for the years 2012-2014. Now we are ready to change the matrix to represent the probabilities of transitioning from state i to state j with the prop.table function. We repeat this process for gathering transitional matrices for any player desired. Figure 3.1 displays an example of a transitional matrix. m t #902 mu ffil; :0t 8 m i W i «1 S m i i mu 19&1 19S? ;mn m i : I W ii * m i Jie a m e m i m i % 6.627 *.«7 7 *.*80.8 8.9 m, 6B #, S e. «e,8 8 8.86* 8.e e 8«*88 6.24* * *08 8.860 9.9 m 8,060 0,08, 8 8.886 8.80* 8.806 *.*06 6.608 <tmt 6-62S 6.S62 8.860 8.«6 5. 6 0. 8 e.e*? 8.8 8 8.8 8 8.808 6.808 6.2*1 8.860 9.9 m 8.060 8-066. 86 9.806 8.806 8.806 8.606 6.606 t m 2 6,68 m i e 0, 3i,2*2. 86 8.882 8.352 *.*B*».* * #< ee 9.9m e.esse e,«e ««, ee.j7s *. 84 *. 8*.ai4 8.*86 *.*08 *,«8 *.8 8 8.8*8 8,8*0 8,68,«8 «M l I «, «*. 2S.2i* «.* 8 6 8.808 *.33& 6.608 9. 9m 9.9 m S> 8«. S.174 8.828 8.808 8.a«a 8.*08 *.*08 *.8 8 8.880 8,060 8.08 «. «961 2 f, &e g. e e.n $ 8.606 9, m 8.804 6.608 9.9 *8 *.«e 9.9 m 0.060 «. 68.086 8.125 : 8.806 8.608 t.a s * 6.608 6.860 6.880 8.060 0. 8 0. 8 818 * *,8 2 *18 a *,*08 * «* i$ *«*8* 8,88 9,999, m, e «. sa *.431 *.8 8 f.s 4?.«e *.* 8 9,9 U #,*11 9.9 m 8,073 8, e 9,9 m S.iSS 6.8*8 8.808 8.*08 *.*08 818 2 8.608 6.66 *.«2* *.88 8,86 9, 9m, 8. e«8.85s 8.8S 8.* 8 8.606 6.688 6.688 8.883 9.9 m 8.060. 34.. 86 8.383 8.806 8.808 6.*08 t u 6 6.62$ 6.«ea *.663 8-862 *.J4 * 9. m, 4s e. m.8*2 8.81* 8.3S* 8.80«S 6.8558 6.685 8.86 9. m 6.011. 0. i*. 18 8.806,532 8,606 8.608 i *.* w,6t* 8.88.8 s -aa«e< 68 e.^6 6, m.*ee 6.653 6. m 9, m *,«a #,*68 9, m. 8.6JJ 8.806 8,228 6.68 911 2 e, 8 e, 8. i8 8.8 * 8.8 * 8.* * 8,*88 e.s&e 9, 9m 8, 8 «, 66, e«. 6 «. 77 8.80 8.808 6.*31 *.8*8,8 8 8,880 8,880 8,08 0,2*4 180 9.«2S e.se*.il4 e.8 s *.* 3 6.606 6<6iS «x8«e 9.925 «, e 8.0 e.e»8 «. «* 8.888 8.8*2 8.*08 8.608 *.187 6.8*8 9. 9m 8.860 8,0@«8,9 m 360 % S. 2* e.eej. «.«@ 6 6.68i? 6.603 6.63 9.9S8 9. 9m 0.?0 0. 68. 86 8.**? 8.606 «.*»? 6.608 6.680 6.28S 8,860 8.680 0. 0$ 8,9 68 2 *,8 e «,*2S e, m 8.860,«e?,68, ee «. i4 e,«sa «,*? «*.««1 9,9 m 9,9 m «, S «6,68 «, «6.1SS 8.868 8.808 8.*08 183 * 8«* m *.«8 8.882 8.862 8, 2, e?, a«e.«7» *. es 8.83J 8.8*S <«e 8,608 *.1*3 8,8 8,834 \ 9,2 *7 9,9m 9,IS 4 9,929 «. 88 8.872 8.*08 8<*98 m a «.6 8 6.82S 6.868 *.8«9. 9m, 0, 6a, e. a* 8.8*3 8.806 8.6*3 6.624 6.6 8 6.608 6.18$ 8.860 8.037.22i 8.9 m.i4s 8.823 8.806 8.677 8.608 m t 6.6 s «.«* * m 6,62$. e. 8 8.805 8.8 e *, S3 *. * «,«2 #.035 e,a^3 «, e «. 82 6.8*7 8.323 8,^ 5 *,*08 *.*3* *,**<* 8.886 8.1?0 8>690 0,68 ii a : 8, 88» 2S, 8 8.8 e i.8? 8.* 2 *.* e «, s$ e.m* «,«3 3 «. rs, 6B.S*8 8.8** 8.80* 8.834 : *.a 4 *.* 8 6,^ 4 8.3S3,8 e #.i* s e.ee ii 2, *B > 8«. 25 : 8.806 8*80 8.608 8< 8 9.92A 8,»8 2«, 8. 0* 8.80S! 8.806 8.8 6 8.6^4 *.* i^ 6.680 9.8*9 8.060 : 0.060 :.il3 its # a.901 S.6&2 : : % ~m *>.M5 0-3?.86t.8? ; * * * & ' i 6.8 t9! 8-605 6.62? 6.6S2 6.600 8. t u 0.3S2 ; $.<tm :m 1 *,*0e #,«2 «>«* *.88 8,8*7 8, 62 *» t? e.«s *.?e*?,* «s 6,8 8,8«3 8.8S 8, 3i! «,?f 6,68 6.645 6.878 8.868 8,a?a *,3 i* m 2 *.8 8 *<8 8 *.820 8.8* 8, 8!, 67 «, @e, 8««. 2S 8.8 * 9. t m *.*58 *<*08 *«8 8.880 8,880 8,080, 4i. 8. 88 8.8*? 8.806 *.*06 j *.12* Figure 3.1: Examle of Transitional Matrix

20 3.3 Average Player Transitional Matrix For my model we will be building a lineup where each batter is the average MLB batter for that specific position in the batting order. To clarify further, the average hitter in the lead-off spot is found by taking all the players that hit first in the batting order during the season and gathering all the transitions that took place from starting state i to ending state j while they were in the first position of the batting order. This can be accomplished very much like the process in the previous section. Again, we want to arrange our data where BAT_EVENT_FL is equal to TRUE into a three dimensional matrix. The three categories will be the starting states, ending states and BAT_LINEUP_ID.[7][11] Since our data contains all the plays that occurred during a particular season, by using BAT_EVENT_FL equal to TRUE we are only looking at the plays that occurred via an atbat. Thus, for each play the data set keeps track of which player was atbat and which position in the lineup they were batting when that play took place. There are nine possible batting positions that any player can be designated to. The variable that keeps track of this batting position is BAT_LINEUP_ID. Our three dimension matrix can be put together by once again using the table function in R for the three variables BAT_LINEUP_ID, starting state and ending state. This three dimensional matrix contains a two dimensional matrix for each BAT_LINEUP_ID, where the variables are starting states and ending states. To get a matrix for a

21 particular batting position, we just use the batting position number to pick out the matrix. This matrix representing the number of times a transition occurred from starting statei to ending state j for all players who batted in that particular batting position. To find the probabilities, we need to use the prop.table function in R. For my model we will use 2011-2014 seasons to acquire my transitional matrices for the average player in each batting position. We chose to use one additional season for the average player transitional matrices because the average player transitional matrix is the foundation of my model. We wanted to make sure the data was sufficient at modeling the typical player for each batting position. To get a transitional matrix for those four years we will need to merge together the matrices for each individual year. Adding the matrices together by using matrix addition creates a matrix that represents the number of transitions from starting states to ending states for a batting position for the years 2011-2014. Now, we change the matrix to represent the probabilities of transitioning from starting states to ending states with the prop.table function. Finally we repeat this process for the remaining batting positions in the lineup. 3.4 Runs Matrix Now that we have a method for finding transitional matrices our Markov process is complete, however we are not ready to begin building our simulation yet. We

22 still need a way to know the number of runs scored in all possible transitions from starting state to ending states. However, there are a few restrictions. First when three outs are made the inning is over and no runs score. There is a possibility that the third out was made on the bases after a run had already scored. However this situation is quite rare and extremely unpredictable to model. Second, runs may score only on plays where the batter has an atbat; thus steal attempts, pick off attempts, balks, passed balls or wild pitches cannot result in a run in this model. From earlier in the paper we learned that many transitions will have a probability of 0. Clearly these transitions will also have a runs scored value of 0. To find the number of runs scored for transitions that have a probability greater than 0 we must simply find the difference between total number of base runners and number of outs in the starting state versus the ending state, and this will show how the runners in the state have transitioned. We also need to account for the batter, thus we just add one to our total. We can write this idea in equation form: Then, If Pi:j > 0 R ij = (baserunnersi + outsj) (baserunnersj + outsj) + 1 Otherwise = 0. Where baserunners.; represents the number of base runners

23 in state i, outs* represents the number of outs in state i, and Rij represents the number of runs scored when transitioning from starting state i to ending state j for the Runs Matrix i?.[ll][13] If we applied the equation to get all the entries of the Runs Matrix we would notice some transitions from state i to state j have a negative value yet the lowest number of runs that can be scored on any play is 0. But in all of the transitions that have negative runs scored, the probability of those transitions is 0. Many of these situations occur because the difference in the number of runners on base and the total number of outs from state i to state j went up by more than one. In any situation the amount of players added to a transition can only go up by one. If we look at an example where our starting state i was 000 0 and the ending state j was 110 0, then using the equation we get that Rij = 1. Another issue occurs for cases where Pl3 = 0 but Rij > 0. To account for this we can fill the cells that have a probability of 0 with a runs scored value of 0, before using the equation to find the remaining R^j s. Our finished product is a 24 X 24 Runs Matrix called R, with entries Rij for each starting state i and each ending state j.

24 Chapter 4 Building Simulation Model 4.1 How to Simulate a Single Game After establishing the transitional matrices and the Runs matrix, we are ready to build our simulation model. To start we want to model how to simulate a half inning of a baseball game. In order to do this, we need to know the variables that affect our model. The variables will be the Runs matrix and the batting lineup. The Runs matrix tells us how many runs were scored on a particular transition. The batting lineup tells us the order that the players are batting; however, instead of the player s names we use the player s transitional matrices. To denote the transitional matrices of the different batting positions in the lineup we will use Ax where x G 1,..., 9 and x represents the position in the lineup. During the simulation we will want to keep track of the number of runs scored throughout the game, and have our final output from the simulation be the total runs scored in the game. To do this we will set

25 runs = 0 at the beginning of the simulation, and throughout the simulation after each transition from state i to state j check the Runs matrix for the number of runs scored on the play. Once the number of runs scored on the play is found, we add that amount to the total runs scored for the game. Since every player in the batting lineup must bat in order, we need a way to make sure that this order is maintained. To account for this, we used a counter variable x that represents the batter position in the lineup. To begin the game x = 1 to represent the 1st batter. After each transition has occurred the counter variable goes up by one: x = x + 1. When the counter variable passes 9 it returns back to 1 and continues. Every inning has the same starting state, i = 000 0,[13] the no runners on base and no outs state. The half inning simulation begins by examining Ax the transitional matrix of the batter in the x position of the lineup. Then, we pick out the row from Ax that has starting state i: Ax[i,\. From that row we pick out only the ending states Ax[,j] whose Pij > 0. The remaining entries represent the probabilities of transitioning from our starting state i to the possible ending states j. Based on these probabilities we randomly select one entry, which now becomes our ending state j where the transition for the batter is represented by Axij. We now check the number of runs scored from this transition by looking at and adding this value to our total runs for the game. runs = (Rij + runs)

26 After this we set our ending state j to be our new starting state i and repeat the process for the next batter by setting x {x + 1). This process will go on forever unless we have a way to terminate it. For our model, an inning should end once three outs have been acquired, and thus when the state becomes 3 the inning is over. To have our model account for this we just check the ending state after each transition. This method is accomplished with a while loop. A while loop repeatedly follows a process and only stops the loop once it encounters a case that violates the desired condition. When our models ending state becomes 3 our condition is violated and the loop terminates. We want our model to simulate a half inning of a baseball game 9 times, where an inning ends with the number x batter and the next inning begins with the number (x + 1) batter. We can accomplish this task with a for loop. A for loop is a function that performs a process a designated amount of times. Our model will run a while loop inside of a for loop for the innings one through nine and will produce an output of the number of runs scored in the game. Table 4.1 shows an example of a simulated game where the states after each transition are shown along with an interpretation of the transitions. The transitions in Table 4.1 with a ** next to them distinguish some of the transitions that occur with multiple outcomes. For example going from 100 0 to 100 1 tells me that one out occurred on the play, but does not tell me which player

is the base runner or how the out was made. This information is irrelevant for my simulation model because we are concerned with a player s ability to contribute to the runs scored for the team, and not individual statistics which would be counted if we kept track of how the transition occurred. 27

Output Interpretation Output Interpretation 000 0 Start of Inning..Continued..Continued 000 1 Batter out 000 0 Start of Inning 000 2 Batter out 100 0 Batter reached 1st 3 Batter out 010 1 ** Out made and runner on 2nd 1 End of 1st Inning 010 2 ** Batter out 0 Runs scored in 1st Inning 100 2 Batter reached 1st run scores 000 0 Start of Inning 001 2 Batter reached 3rd run scores 000 1 Batter out 3 Third out made 100 1 Batter reached 1st 6 End of 6th Inning 3 Batter hit into double play 2 Runs scored in 6th Inning 2 End of 2nd Inning 000 0 Start of Inning 0 Runs scored in 2nd Inning 000 1 Batter out 000 0 Start of Inning 000 2 Batter out 000 1 Batter out 3 Batter out 100 1 Batter reached 1st 7 End of 7th Inning 100 2 ** Batter out or baserunner out 0 Runs scored in 7th Inning 3 Third out made 000 0 Start of Inning 3 End of 3rd Inning 000 1 Batter out 0 Runs scored in 3rd Inning 000 2 Batter out 000 0 Start of Inning 3 Batter out 000 1 Batter out 8 End of 8th Inning 001 1 Batter reached 3rd 0 Runs Scored in 8th Inning 010 1 Batter reached 2nd run scores 000 0 Start of Inning 010 2 Batter out 100 0 Batter reached 1st 3 Third out made 100 1 ** Batter out or baserunner out 4 End of 4th Inning 110 1 Batter reached 1st 1 Runs scored in 4th Inning 110 1 Batter reached 1st run scores 000 0 Start of Inning 110 2 ** Batter out or baserunner out 100 0 Batter reached 1st 110 2 Batter reached 1st run scores 100 1 ** Batter out or baserunner out 111 2 Batter reaches 1st 3 Batter hit into double play 011 2 Batter reaches 2nd 2 runs score 5 End of 5th Inning 3 Third out made 0 Runs scored in 5th Inning 9 End of 9th Inning Continue... Continue... 4 Runs scored in 5th Inning 7 Total Runs for the game Table 4.1: Example of a single game simulation 28

29 4.2 How the Model Will Simulate Simulating a single game is useful, yet not very reliable in terms of analysis. There is too much irregularity that can effect a single game s results. Thus, to get a more accurate analysis we want to simulate many games at one time, as this will account for some of the irregularities and negate the rare occurrences that could potentially effect a single game significantly. To simulate more than one game at a time, we can use the replicate [3] function in R. The replicate function runs the simulation of a single baseball game a specified number of times and after every simulated game is complete the runs scored in that game are stored into a runs vector. Once the specified number of simulations is complete, we take the mean of the runs vector by adding up all the entries and dividing by the number of simulations. This process gives an output of the average runs per game for the lineup used to run the simulations. For my analysis we have chosen to run a simulation three times of 100,000 games to ensure accuracy and then find the average of these three events to represent my runs per game. Thus whenever we refer to a value of runs per game, keep in mind that all values were created using the same number of simulations. We found that when running less simulations the number of runs per game fluctuates within a much larger interval than preferred, making the results less predictable and inconsistent.

30 As mentioned in section 3.3, my model will build a lineup where every Ax is made up of the average player s transitional matrix for that batting position x. We also saw in section 3.3 how to acquire these transitional matrices. Table 4.2 displays this lineup. We use this standardized lineup to create the surrounding circumstances that will be used to evaluate different players. To evaluate any player at position x in the line up, we must replace Ax with the transitional matrix of the player in question and run the simulations to achieve an output of the runs scored per game for that player. We can find the runs scored per game in all positions x for any player desired. For the rest of the paper when we mention runs per game of x position, it can be interpreted as replacing Ax with the player s transitional matrix while keeping the rest of the line up as is. We can now run the simulations and determine the runs per game a hypothetical line up would score with a certain player in different batting positions. Batting Position Transitional matrix A\ Average Position 1 batter A 2 Average Position 2 batter A 3 Average Position 3 batter Average Position 4 batter A5 Average Position 5 batter ^6 Average Position 6 batter A 7 Average Position 7 batter As Average Position 8 batter A9 Average Position 9 batter Table 4.2: Standardized lineup for model a 4

31 Chapter 5 Distribution and Ranking of Each Batter Position 5.1 Process for Finding Random Players In order to judge the significance of a player s runs per game value in a batting position we need a way to measure the results against all players. In order to achieve this, we decided to get a random sample of 30 players from the 2014 roster and analyze their results. [2] [8] However a few restrictions have to be incorporated. First the random player chosen has to have enough data to produce results in our simulation, which means at least 1000 atbats throughout the 2012-2014 baseball seasons. This ensures a player s transitional probabilities represent his true skill and not a streak resulting from a limited amount of atbats. Second, a player has to have transitions from all 24 possible starting states in the transitional matrix. Often when a player

32 has a limited amount of atbats, not enough data is available to represent all starting states. Also, there are rare occurrences when a player has more than 1000 atbats but still has a missing starting state. An example of such a situation occurs for some players who are lead-off hitters in the national league. They do not have any atbats when the situation is 001 0 a runner on third base with no outs. For the player to come to bat in this situation means that a pitcher or a pinch hitter has to somehow reach third base. This is quite rare, as most pitchers are poor hitters and not very fast runners. The simulation cannot work if there is a missing starting state in the transitional matrix, because if the simulation happens to be in that state when the player comes to bat our simulation has no state to transition to. Thus when picking our random players it is crucial to check that they meet these two restrictions. My method for finding random players was executed by looking at [2] and ordering the list of players by atbats taken from highest to lowest. We then ran in R a Sample function which produces a list of random numbers that we used to represent my list of players from baseball-reference.com. Lastly, we went through the list until we gathered 30 players who met the requirements stated above. 5.2 Testing Data Once we had my 30 random players, we ran the simulations and found the average number of runs per game scored by the standardized lineup for each random player

in the first batting position. At this point we needed a way of testing the significance of the values found for the players to judge how one compares to the other. We found the mean and standard deviation of the runs per games scored by the 30 random players once they were inserted into the standardized lineup. Figure 5.1 represents the histogram and qq normal plot of all 30 players. We noticed that it closely resembles a normal distribution. The next step was to test the distribution for normality. Using the Shapiro-wilk test in R, we reject the null hypothesis if our p value is less than.05. However, we want our data to be as close to normal as possible, this is represented with a p value closer to 1. Table 5.1 shows the results of the test. 33

34 Histogram Normal Q-Q Plot Frequency 8 6 4 m iz ks 3 a jd Q. 2 ECO m 0 111111 11 3.8 4.3 runs per game Theoretical Quantiles Figure 5.1: histogram and qqnormal plot for 30 random players runs per game in lead-off position of standardized lineup.

35 Test for normality Null Hypothesis: data is normal Shapiro-Wilk normality test in R mean = 4.1537, sd = 0.1066, p-value = 0.5872 Reject Null Hypothesis if p <.05 Thus fail to reject null hypothesis Conclusion: can assume normally distributed data Table 5.1: Shapiro-wilk test for 30 random players in 1st batting position of standardized lineup While 30 random players serve as a valid sample size, using a larger sample of random players would be a more thorough analysis. Unfortunately the process of finding random players and running the simulations is quite time consuming. When we gather our random sample of 30 players, the sample may not represent players of various skill levels equally. This is mostly due to the lower sample size, and if the sample size was larger this issue would subside. To account for this we are imposing a restriction on the random sample in order to insure a valid result. We want the p value to be at least 0.5. Thus, once we find a random list of 30 players with a p value of 0.5 or greater, this indicates that our sample is a good representation of the diverse skill set of various players. Once we have an adequate random sample of 30 players, we can apply this list to finding the simulation results for the remaining batting positions. For my analysis we chose to examine the first five batting positions. Table 5.3 shows the runs per game scored by the standardized lineup with each random player inserted for the five batting positions examined. For example, when Goldschmidt replaces the lead-off batter in the standardized lineup, the lineup

36 scores an average of 4.4135 runs per game. Meanwhile, when he replaces the second batter of the lineup they score an average of 4.46075 runs per game yet when he replaces the third they score an average of 4.3155 runs per game, etc. If interested in examining the remaining batting positions one can extend the steps described above to achieve the desired information. Table 5.2 show the results of the normality test for the batting positions 2-5 using the same 30 random players, and Figure s 5.2, 5.3, 5.4, 5.5 represent the histogram s and qq normal plot s for these respective batting positions. The results are similar to what we found for the lead-off position, thus we can assume that all batting positions are normally distributed. Data 2nd position 3rd position 4th position 5th position Mean 4.1558 4.0155 4.0702 4.1250 sd 0.1137 0.1080 0.1143 0.1148 p-value 0.7103 0.5333 0.9182 0.5256 data normal yes yes yes yes Table 5.2: Normality test for batting positions 2-5

37 Histogram Normal Q-Q Plot 10 Frequency 8-6 - 4 2 J] 11 11 11 11 3 8 4.3 runs per game m OJ sz co = 5 a m Q. E CD m Theoretical Quantiies Figure 5.2: histogram and qqnormal plot for 30 random players runs per game in second position of standardized lineup.

38 Histogram Normal Q-Q Plot 1 0 -i m m Frequency 6 H 4 2 0 Q. nii111 3.7 4.2 runs per game CD = 5 a m aė CD m Theoretical Quantiles Figure 5.3: histogram and qqnormal plot for 30 random players runs per game in third position of standardized lineup.

39 Histogram Normal Q-Q Plot 8 -i Frequency 6-4 - 2 - m m c CO =5 O Q. E10 CD 0 I I I I 3.8 4.4 runs per game Theoretical Quantiles Figure 5.4: histogram and qqnormal plot for 30 random players runs per game in fourth p osition of standardized lineup.

40 Histogram Normal Q-Q Plot 1 0 8 - Frequency 6-4 - 2-0 miti 11 3.8 4.3 Theoretical Quantiies Figure 5.5: histogram and qqnormal plot for 30 random players runs per game in fifth position of standardized lineup.

last name first name 1st 2nd 3rd 4th 5th Infante Omar 4.08872 4.0597 3.95419 4.00658 4.04591 Stubbs Drew 4.12041 4.1015 3.94778 3.984 4.084 Goldschmidt Paul 4.4135 4.46075 4.3155 4.3369 4.4298 Ramirez Alexi 4.02769 4.071997 3.91741 3.976 4.0430 Arencibia J.P. 4.0238 3.98660 3.89625 3.9331 4.02384 Reyes Jose 4.1821 4.22951 4.04255 4.08961 4.14037 Smoak Justin 4.01687 3.99054 3.85864 3.94861 3.97419 Ibanez Raul 4.06172 4.07661 3.94486 4.03171 4.07609 Texiera Mark 4.11859 4.0819 3.97364 4.06619 4.04832 Lawrie Brett 4.21142 4.19453 4.06110 4.12376 4.18682 Headly Chase 4.20234 4.16208 4.06655 4.11496 4.12554 Cain Lorenzo 4.06843 4.05157 3.92585 4.00750 4.04320 Freese David 4.17947 4.13333 3.98575 4.03707 4.08608 McCann Brian 4.13781 4.15659 3.97953 4.01421 4.10254 Suzuki Kurt 4.04305 4.08185 3.92177 4.02188 4.04810 Weeks Rickie 4.05057 4.03681 3.84943 3.88603 3.94464 Cabrera Melky 4.32492 4.30511 4.13521 4.25843 4.27223 Pagan Angel 4.15033 4.14319 3.98227 4.02904 4.10284 Sandoval Pablo 4.18135 4.15799 4.01857 4.11975 4.14600 Gordon Alex 4.17070 4.21958 4.05317 4.08978 4.15788 Marte Starling 4.18365 4.13068 4.01658 3.98826 4.08477 Espinosa Danny 3.94107 3.9502 3.83350 3.82092 3.89586 Jones Adam 4.20460 4.23851 4.10414 4.15790 4.21838 Seager Kyle 4.19482 4.21873 4.05973 4.12516 4.21091 Butler Billy 4.21339 4.17713 4.06688 4.13215 4.16010 Moss Brandon 4.23211 4.29082 4.16213 4.21482 4.24384 Jones Garrett 4.08312 4.17691 4.046 4.071086 4.11017 Cano Robinson 4.36367 4.36710 4.20831 4.2736 4.34492 Zimerman Ryan 4.28853 4.269606 4.11811 4.19821 4.28096 Martinez J.D. 4.13348 4.15178 4.01817 4.04865 4.11930 Table 5.3: Random 30 players runs per game for batting positions 1-5 of standardized lineup 41

42 5.3 Finding a Players Rank by Batting Position Now that we know the data represents a normal distribution, we can examine how a player s contribution to the standardized lineup s runs per game ranks among all the players for a specific batting position. This can be achieved by finding the percentile rank of runs per game for a player in a desired batting position. Using the function pnorm in R we get the probability of players that will have a lower runs per game value than the player in question. Simply turning this value into a percentage creates a ranking system. Table 5.4 shows the pnorm function and its components. pnorm( q, mean, sd, lower.tail= True) q runs per game of player for batting position mean mean of distribution for batting position sd standard deviation of distribution for batting position lower.tail=true finds area under the curve to the left of q value Table 5.4: Explanation of equation used for finding ranking

43 Chapter 6 Results of the Model Now that our process for evaluating players has been built, we need to examine the method s effectiveness. If you recall the standardized line up, which reflects the average hitter at every batting position, we used data from the 2011-2014 baseball season to acquire the transitional matrices for these batting positions. The average runs scored in the Major Leagues[l][2][8] for those 4 years was 4.21 runs per game. Running the simulation for the standardized lineup produced an average of 4.11 runs per game. This difference of only.1 runs per game can be attributed to the removal of non batted plays. While we did expect that my runs per game was going to be lower than the MLB s average due to removing non batted plays, we were a bit surprised as to how little of an influence non batted plays have. Since non batted plays only occur about 3% of all plays in a MLB season, it is justifiable that the influence of these plays is quite low.

44 For the model to be effective, it should corroborate information that is well known. For example, it should be able to determine who the best offensive players are regardless of which position they bat in the line up. Table 6.1 displays how the model ranks a few of the top 50 players in the game as ranked by Fantasy Baseball over the last three seasons. [5] First name last name 1st 2nd 3rd 4th 5th Mike Trout 99.90% 99.70% 99.50% 99.20% 99.60% Miguel Cabrera 99.90% 99.90% 99.90% 99.80% 99.90% Joe Votto 99.40% 99.70% 99.40% 99.20% 98.20% Robinson Cano 97.00% 96.60% 97.20% 96.10% 97.60% Matt Holiday 97.80% 97.90% 97.30% 98.50% 98.20% Table 6.1: Rankings of top players in MLB The model should also reveal information that is not that well known or clear. For example the model should be able to distinguish between players who have similar statistics for the past three years and determine who is more valuable offensively to the team. Table 6.2 shows the rankings for two players with exactly the same BA with the same number of plate appearances over the past three years. [4] It is clear from Table 6.2 that Holiday has a better contribution to his team scoring runs in most batting positions, despite the fact that Holiday and Freeman have exactly the same batting average. Moreover, the model should be able to distinguish between players with a similar

45 First name last name PA BA 1st 2nd 3rd 4th 5th Matt Holiday 1957 0.289 97.80% Fredie Freeman 1957 0.289 87.40% 97.90% 89.20% 97.30% 95.00% 98.50% 90.30% 98.20% 93.90% Table 6.2: Rankings of players with similar PA- Plate Appearances and BA- Batting Average number of R s and RBI s. Table 6.3 focuses on players with similar R s and RBI s over the past three years and shows that these players are not equally valued by the model. [4] The table indicates that Cano is a far superior offensive contributer to his team than Pence in every single batting spot, despite having almost an identical number of R s and RBI s. First name last name R RBI 1st 2nd 3rd 4th 5th Hunter Pence 284 277 69.20% Robinson Cano 263 283 97.00% 60.40% 96.60% 66.90% 97.20% 53.30% 96.10% 97.60% 70.00% Table 6.3: Rankings of players with similar R s and RBI s

46 Chapter 7 Conclusion and Future Work The goal of building this model was to develop a method for evaluating players based on skill set and not on statistical measures that rely on factors inherently outside of the player s control. Although this model also aims to optimize runs per game like many previous simulations that use Markov Chain Models, it evaluates players on an equal playing field and eliminates several circumstances for which the hitter is not responsible, in particular the outcomes of the at bat of the players before and after the hitter in question. Previous Markov Chain Models have addressed issues such as optimal lineups of particular teams and pinch hitting strategy. This model offers a more insightful analysis of a player s skills and offensive contribution to his team winning. Examining a few of the results similar to those found in Tables 6.2 and 6.3 validates this goal. Although this model is designed to evaluate a player s complete offensive skill set, it

does not account for offensive skills such as base running and speed. Certain players are well known for their ability to steal bases. In fact, for some players stealing bases is their greatest contribution. This model undervalues players of this caliber. Furthermore, a player known for stealing bases may affect the circumstances the player after him will face when he is atbat. While the threat of a steal creates added pressure on the defense, causing mistakes to occur more frequently, it also affects the batter, as the focus of the pitcher is diverted between him and the base runner. This can often lead to a higher chance that the pitcher will throw a pitch favorable to the hitter, making it easier for him to get a hit. The factor of speed also plays a role in a player s transitional matrix. If player A comes to bat with a slow runner on base, the likelihood that the base runner will score on a double is fairly low. However if player B comes to bat with a fast runner on base, the likelihood that the base runner will score on a double is much greater. This is an example of a situation in which aspects of the game are outside of the player s control. Thus, despite the best efforts of this model to account for elements outside of the player s control, the nature of baseball makes it almost impossible to do so. For example, every team has a unique ballpark with different dimensions that factor into a team s offensive prowess. Some parks are considered more pitcher friendly, while others are more favorable to the batters. In addition, the climate and altitude in these different locations can also play a factor. In summary, there are many variables that affect a players transitional matrix. Thus, while the model does take certain factors that 47

48 are out of the players hand out of the analysis, there are far too many elements that are not clear as to how to model. For future work, we plan to incorporate base running, the ability to determine which type of runners are on base, and how the transition occurred into my model. One possible consideration is to expand the matrix to the same state yet find different ways to transition to that state. We would also improve on my sample size that we used to determine the distributions for each batting position. Ideally, we would like to get values for all of the players in the league. Overall, this model can be used to evaluate players for various reasons. Professional baseball teams can use this model to make decisions about which players to acquire for their team, whether through free agency, or trades and determine the optimal lineup to score more runs per game and thus gain the most value for their team.

49 Appendix Code Used in R To acquire a transitional matrix for a particular season like 2011 in this case we must use the following code. [11] parse.retrosheet2.pbp = function(season){ download.retrosheet <- function(season){ download.file( url=paste("http://www.retrosheet.org/events/", season, "eve.zip", sep="") > ), destfile=paste("download.folder", "/zipped/", season, "eve.zip", sep="") unzip.retrosheet <- function(season){ unzip(paste("download.folder", "/zipped/", season, "eve.zip", sep=""), exdir=paste("download.folder", "/unzipped", sep=""))

50 create.csv.file=function(year){ wd = getwdo setwd("download.f older/unzipped") if (.Platform$OS.type == "unix"){ system(paste(paste("cwevent -y", year, "-f 0-96"), paste(year,"*. EV*",sep=""), paste("> all", year, ".csv", sep="")))} else { shell(paste(paste("cwevent -y", year, "-f 0-96"), paste(year,"*.ev*",sep=""), > paste("> all", year, ".csv", sep=""))) > setwd(wd) create.csv.roster = function(year){ filenames <- list.files(path = "download.folder/unzipped/") filenames.roster = subset(filenames, substr(filenames, 4, ll)==paste(year,".ros",sep="")) read.csv2 = function(file) read.csv(paste("download.folder/unzipped/", file, sep=""),header=false) R = do.call("rbind", lapply(filenames.roster, read.csv2))

51 names(r)[1:6] = c("player.id", "Last.Name", "First.Name", "Bats", "Pitches", "Team") wd = getwdo setwd("download.f older/unzipped") write.csv(r, file=paste("roster", year, ".csv", sep="")) > setwd(wd) cleanup = function(){ wd = getwdo setwd("download.folder/unzipped") if (.Platform$OS.type == "unix"){ system("rm *.EVN") system("rm *.EVA") system("rm *.R0S") system ("rm TEAM*")]- else { shell("del *.EVN") shell("del *.EVA") shell("del *.R0S") > shell("del TEAM*") setwd(wd)

52 setwd("download.folder/zipped") if (.Platform$OS.type == "unix"){ system("rm *.zip")} else { > shell("del *.zip") > setwd(wd) download.retrosheet(season) unzip.retrosheet(season) create.csv.f ile(season) create.csv.roster(season) > cleanup() Roster2011 <- read.csv("roster2011.csv") data2011 <- read.csv("all2011.csv", header=false) fields <- read.csv("fields.csv") names(data2011) <- fields[, "Header"] data2011$half.inning <- with(data2011, paste(game_id, INN.CT, BAT_HOME_ID)) data2011$runs.scored <- with(data2011, (BAT_DEST_ID > 3) +

53 (RUN1_DEST_ID > 3) + (RUN2_DEST_ID > 3) + (RUN3.D get.state <- function(runner1, runner2, runner3, outs){ runners <- paste(runner1, runner2, runner3, sep="") } paste(runners, outs) RUNNER1 <- ifelse(as.character(data2011[,"basel_run_id"])=="", 0, 1) RUNNER2 <- ifelse(as.character(data2011[,"base2_run_id"])=="", 0, 1) RUNNER3 <- ifelse(as.character(data2011[,"base3_run_id"] )=="", 0, 1) data2011$state <- get.state(runner1, RUNNER2, RUNNER3, data2011$0uts_ct) NRUNNER1 <- with(data2011, as.numeric(run1_dest_id==1 I BAT_DEST_ID==1)) NRUNNER2 <- with(data2011, as.numeric(run1_dest_id==2 I RUN2_DEST_ID==2 I BAT_DEST_ID==2)) NRUNNER3 <- with(data2011, as.numeric(run1_dest_id==3 I RUN2_DEST_ID==3 I RUN3_DEST_ID==3 I BAT_DEST_ID==3)) NOUTS <- with(data2011, OUTS.CT + EVENT.OUTS_CT) data2011$new.state <- get.state(nrunner1, NRUNNER2, NRUNNER3, NOUTS) data2011 <- subset(data2011, (STATE!=NEW.STATE) I (RUNS.SC0RED>0))

54 library(plyr) data.outs <- ddply(data2011,.(half.inning), summarize, Outs.Inning = sum(event_outs_ct)) data2011 <- merge(data2011, data.outs) data2011c <- subset(data2011, Outs.Inning == 3) data2011c <- subset(data2011, BAT_EVENT_FL == TRUE) library(car) data2011c$new.state <- recode(data2011c$new.state, "c( 000 3, 100 3, 010 3, 001 3, 110 3, 101 3, Oil 3, 111 3 ) = S ") T.matrix <- with(data2011c, table(state, NEW.STATE)) P.matrix <- prop.table(t.matrix, 1) To find individual player transition matrices use T3= with(dataseasonc, table(bat_id, STATE, NEW.STATE)) T.matrix=T3[BAT_ID,,]

55 To find transition matrices by batting position use T4= withcdataseasonc, table(bat_lineup_id, STATE, NEW.STATE)) T.matrix=T4[BAT_LINEUP_ID,,] Code used for simulation of a single game: endingstate=c("000 0", "000 1", "000 2", "001 0", "001 1", "001 2", "010 0", "01 C=endingstate d=l:25 inning=l:9 batter=function(k){rep(mat2[k, ].length.out=150)} mat2 is a 9 by 9 matrix of the batting lineups that take position k in the lineu ######simulation of baseball game ### Line up A01=positionlp.matrix A02=position2p.matrix A03=position3p.matrix A04=position4p.matrix A05=position5p.matrix A06=position6p.matrix A07=position7p.matrix A08=position8p.matrix

56 A09=position9p.matrix A10=Playerp.matrix Transitional.Matrix=list(A01, A02, A03, A04, A05, A06, A07, A08, A09, A10) simulate.game=function(a01, A02, A03, A04, A05, A06, A07, A08, A09, A10, R){ runs=0 i=l j=batter(k)[i] for(inning in 1:9) { runs.in.inning= 0 state ="000 0" while(l) { Z=Transitional.Matrix[[j ]] TP=Z[state,] C1=C[TP>0] C2=TP[TP>0] newstate=sample(cl,1, prob=c2)

57 if(newstate =="3") { } break > runs= (R[state, newstate] + runs) runs.in.inning= (R[state, newstate]+runs.in.inning) state = newstate i= i+1 j =batter(k)[i] i=i+l j =batter(k)[i] state ="000 0" > > runs To run a simulation for a player in different batting positions for multiple games use:

58 shuffle.lineup=function(k) { batter(k) RUNS = replicate(100000, simulate.game(a01, A02, A03, A04, A05, A06, A07, A08, mean(runs) } m<-numeric(10) for(k in (l:9)){m[k]=shuffle.lineup(k)}

59 Bibliography [1] Baseball prospectus, www.baseballprospectus.com,[online; accessed 2014]. [2] Baseball reference, major league baseball statistics, www.baseball-reference. com, [Online; accessed 2014]. [3] Companion to analyzing baseball data with r, https://github.com/maxtoki/ baseball_r, [Online; accessed 2014]. [4] Fangraphs baseball statistics, www.fangraphs.com, [Online; accessed 2015]. [5] Fantasy baseball, http://baseball.fantasysports.yahoo.com, [Online; accessed 2014]. [6] The official site of major league baseball, www.mlb.com,[online; accessed 2014]. [7] Retrosheet home page, www.retrosheet.org/,[online; accessed 2014]. [8] Sean lahman s database, www.seanlahman.com/baseball-archive/ statistics/, [Online; accessed 2014]. [9] E.R. Harold B. Bukiet and J.L. Palacios, A markov chain approach to baseball, Operations Research 45 (1997), 14-23. [10] Nobuyoshi Hirotsu and Mike Wright, A markov chain approach to optimal pinch hitting strategies in a designated hitter rule baseball game, Operations Research 46 (2003), 353-371. [11] Max Marchi and Jim Albert, Analyzing baseball data with r, CRC Press, 2014. [12] Sheldon M. Ross, Simulation fifth edition, Academic Press, 2013.

[13] Tom M. Tango, Mitchel G. Lichtman, and Andrew E. Dolphin, The book: Playing the percentages in baseball, Potomac Books, Inc, 2007. [141 R.E. Trueman, Analysis of baseball as a markov process, Optimal Strategies in Sports (1977), 68-76. 60