Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

Similar documents
Predicting the use of the Sacrifice Bunt in Major League Baseball. Charlie Gallagher Brian Gilbert Neelay Mehta Chao Rao

Simulating Major League Baseball Games

Do Clutch Hitters Exist?

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

SAP Predictive Analysis and the MLB Post Season

Guide to Softball Rules and Basics

Additional On-base Worth 3x Additional Slugging?

Left Fielder. Homework

TOP OF THE TENTH Instructions

A Novel Approach to Predicting the Results of NBA Matches

The MLB Language. Figure 1.

Gouwan Strike English Manual

Field Manager s Rulebook

Des Plaines Youth Baseball

SUPERSTITION LITTLE LEAGUE LOCAL RULES

Hershey Little League Player Development Guide

2017 JERSEY CENTRAL LEAGUE CONSTITUTION

BASEBALL PINTO DIVISION RULES OF PLAY LAKES YOUTH ATHLETIC ASSOCIATION

PREDICTING the outcomes of sporting events

Major League Baseball Offensive Production in the Designated Hitter Era (1973 Present)

Offensive & Defensive Tactics. Plan Development & Analysis

BASEBALL SHETLAND DIVISION RULES OF PLAY

TEE BALL BASICS. Here is a field diagram: 45 feet

Williamsburg Youth Baseball League 2018 Rules

NPYL Minor Division Rules Boys 9 & 10 years of age Revised January 2015

Computer Scorekeeping Procedures

MONEYBALL. The Power of Sports Analytics The Analytics Edge

SFX YOUTH SPORTS T-ball Coaching Handbook

1 st basemen. I will:

Computer Scorekeeping Procedures Updated: 6/10/2015

Season Plans. We hope you have learned a lot from this book: what your responsibilities. chapter 9

Bulldog Baseball Player Guide

CARY GROVE YOUTH BASEBALL and SOFTBALL PONY LEAGUE RULES

Little League Skill Sets for Each Division Level

Building an NFL performance metric

Predicting the Total Number of Points Scored in NFL Games

EMGSL 2010 Rules of Play

Rohnert Park Rebels Tournament Rules (10U, 12U, 14U, 16U)

2015 GTAAA Jr. Bulldogs Memorial Day Tournament

SCLL League Rules 2017

An Analysis of the Effects of Long-Term Contracts on Performance in Major League Baseball

DRILL #1 LEARN THE BASES

Player Progression Baseball

TPLL 2017 MINORS RULES

CARY GROVE YOUTH BASEBALL and SOFTBALL PONY LEAGUE RULES

UABA Coaches Manual. Mission Statement: The Coaches:

Richmond City Baseball. Player Handbook

LESSON PLAN FOCUS - Bunt Defenses, Double Plays, Bunt for a hit, Have a plan in each count. Name of Activity Description Key Teaching Points

Daily Dozen For Pitchers* Two line communication drill in OF. NFL Wide Receiver Game. 5 Ball Game(Teach players to run this one)

Westfield Youth Sports, Inc. Double A Baseball Rules

Rookies Division Rules (last update April 19, 2015)

GUIDE TO BASIC SCORING

CHLL 11U and 12U Rules

Minors Division (10u) Rules

2018 SEASON: MINORS PALOS VERDES LITTLE LEAGUE LOCAL RULES

The 2015 MLB Season in Review. Using Pitch Quantification and the QOP 1 Metric

Department of Economics Working Paper Series

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

Defensive Observations. II. Build your defense up the middle. By Coach Jack Dunn I. Defensive Observations

Table 1. Average runs in each inning for home and road teams,

GRASSLAND BASEBALL RULES & REGULATIONS 9-10 LEAGUE

Our Shining Moment: Hierarchical Clustering to Determine NCAA Tournament Seeding

Any team member may pitch, subject to the restrictions of the pitch count as recommended

Beaver Bash Tournament Rules and Information

5. Any player or coach throwing equipment in anger is subject to ejection from the game at the discretion of the umpire.

DOUBLE A DIVISION (AA)

Wednesday, December 27

Montville Travel League 2017 League Rules

DRILL #1 FROM THE TEE

Name of Activity Description Key Teaching Points. Sideline base running - Deep in LF or RF. Rotate groups after 5 minutes

Hitting with Runners in Scoring Position

The Change Up. Tips on the Change Up

AHYAA_PBA_PH Combined 12 Fast Pitch Softball Rules

2018 Colorado Springs Little League. Baseball Division-Specific Rules & Regulations. Local Supplemental Document

A One-Parameter Markov Chain Model for Baseball Run Production

Friday, July 28 ACTIVITY LOCATION TEAM STARTING POINTS. Loosen Arms/ Radar Velocity Field 3 RF Line Cardinals (Red/ Grey)

2018 Red Land 7U Tournament Rules June 23 24, 2018

YOUTH BASEBALL & SOFTBALL PROGRAM RULES AND REGULATIONS 2018

Table of Contents. Pitch Counter s Role Pitching Rules Scorekeeper s Role Minimum Scorekeeping Requirements Line Ups...

An Analysis of Factors Contributing to Wins in the National Hockey League

A Markov Model of Baseball: Applications to Two Sluggers

Milford Little League Pee Wee In-park Rules

MSHYB 2018 American Association Rules

2018 Lincoln Sox NIT & All State Showcase Event

2015 Shetland Score Keeping Guide

TOURNAMENT RULES -and- PROCEDURES

2018 WINTER WIFFLEBALL LEAGUE RULES (Last changes were made on 2/7/17)

2018 Rule Changes. Please note: These changes are NOT listed in the printed version of the 2018 rulebooks,

Madison Baseball Association Local Rules and Regulations Rookie League (7 and 8 year olds)

A Markov Model for Baseball with Applications

Charlottesville MABL, Inc. MSBL/MABL Local League Rules

MBSA Boys Pinto Rules

Predictors for Winning in Men s Professional Tennis

BASEBALL 2. AIM OF THE GAME

For any inquiries into the game contact James Formo at

LESSON PLAN. Name of Activity. Evaluation. Time. Points. Tryout Plan #3 0:00 0:40 1:10. 1:40 Team Tactical Game

Oak Park Youth Baseball & Softball. 10U Softball Rulebook 2016

KID PITCH SPECIFIC RULES 9U & 10U

BABE: THE SULTAN OF PITCHING STATS? by. August 2010 MIDDLEBURY COLLEGE ECONOMICS DISCUSSION PAPER NO

COACH PITCH DIVISION

Transcription:

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007 Group 6 Charles Gallagher Brian Gilbert Neelay Mehta Chao Rao

Executive Summary Background When a runner is on-base during a baseball game, the batter at the plate may be ordered by the manager to bunt, rather than swing away. A properly placed bunt can advance the runner a base, and although the bunter himself will get thrown out, it is a productive out as the runner is now in a position to score. This is called a sacrifice bunt, a technique often used in the course of a game. If the opposing team has insight into the specific situations when a sacrifice bunt will be executed, they will put themselves in a better position to defend it. Goal The goal of this project is to create a model for the St. Louis Cardinals that predicts the tendency of an opposing team to execute a sacrifice bunt in certain situations. While the occurrences of the sacrifice bunt in a given season are small when compared to the total number of plate occurrences that provide an opportunity to bunt, a successful defense of a sacrifice bunt or a sacrifice bunt executed in the right situation can mean a win for the manager ordering the bunt. It is often the case that teams advancing to the playoffs do so by a margin of one win over another team. Data To accomplish our goal, we were sponsored by Sig Mejdal, (the Senior Quantitative Analyst for the Cardinals) who gave us a dataset of 65,535 different events that occurred during the 2004 baseball season (random teams and games) with men on base. The dataset included the two teams, which of the teams was at home, which of the teams was on offense, the event that occurred, the inning, the number of outs before the event, the base-runners (and which bases) before the event, the score before the event, the fielding position of the batter, the batter s spot in the hitting order, and the batter s name. Analysis After an extensive analysis of the data which included adding variables from outside sources as well as creating new variables from the given dataset, we attempted to build classification trees which would provide us a scenario based tool which could be utilized in game situations in order to predict when a sacrifice bunt would be successful. Our end goal was to produce team specific trees that the manager could use as a predictive edge. Due to the extremely low occurrences of bunts in the data, creating a tree was not realistic as initially thought. Even when over sampling the data, the high predictive error rates compared to the Naïve Rule led us to investigate additional models/option. Conclusion We relied heavily on domain knowledge to help guide us through the selection process of the numerous variables given in the data-set. As our next goal, we attempted to use logistic regression to build an accurate model for the customer. In building a global model and several team specific models, many of our assumptions were validated. In looking at the team specific models, we were able to explain many of the significant variables such as why a given position on a certain team would be more likely to bunt. Ultimately, this helped us explain how and why much of the data acted as observed and in the end, led us to believe that a Manager s intuition has the potential to be even more powerful. 2

Technical Summary The Question The question we are trying to answer is predicting the tendency of an opposing baseball team to execute a sacrifice bunt in certain situations. The Data Our data set comes from Sig Mejdal, the Senior Quantitative Analyst for the St. Louis Cardinals. The original data set consisted of over 250,000 observations from the 2004 major league baseball season; in order to fit into Excel, the data was purged, giving us a random sampling of about 65,000 observations. Each observation had 20 variables describing the event. The original variables are described below: tyrgam a unique numerical code identifying the game bat.tm a unique numerical code identifying the team at bat bat.tm.nm the popular name of the team at bat fd.tm a unique numerical code identifying the fielding team fd.tm.nm the popular name of the fielding team home.tm a unique numerical code identifying the home team event a numerical code describing the outcome of the at bat, such as single, double, etc. innin the number of the inning in which the event took place out.before the number of outs before the at bat out.after the number of outs after the at bat basecode.before a numerical code indicating the position of base runners before the at bat basecode.after a numerical code indicating the position of base runners after the at bat score.bat.before the number of runs scored by the batting team before the at bat score.field the number of runs scored by the fielding team before the at bat score.bat.after the number of runs scored by the batting team after the at bat bat.type is a code binning the results from the event column into larger groups such as pop-up, grounder, line out, bunt, etc. fieldpos a numerical code indicating the fielding position of the player at bat bpos a number indicating the batter s order in the lineup last.name the last name of the batter first.name the first name of the batter Our domain knowledge experts are Neelay Mehta and Brian Gilbert, two avid baseball fans. The granularity of the data is at the individual at bat level. The total number of observations in the data set is 65,535. Each of the 30 teams in Major League Baseball has between 1,962 and 2,395 observations. Bunting occurs 3.10% of the time in the overall data set (aka. The Naïve Rule). On average bunting occurs between 4.19% (with a runner on first) and 0.15% (with bases loaded) of the time, when looking at different scenarios with runners on base. On average batters bunt between 12.23% 3

(batter #9, a pitcher or the team s worst hitter) and 0.12% (batter #4, generally a power hitter) of the time. Data cleaning No information was missing in the data set. It was necessary to remove some variables such as out.after, basecode.after, and score.bat.after, because these variables are unknown at the time when we are trying to make our prediction. More variables were eliminated (tyrgam, fd.tm.nm and fieldpos) based on domain knowledge. Additional variables were eliminated based on redundancy, such as bat.tm, fd.tm, and event. Data preprocessing Next we derived some new variables by binning bat.type, innin, and home.tm. The variables were transformed into dummy variables as follows: bat.type becomes Bunt:1, No Bunt: 0, innin becomes Inning 1-9:1, >9:0, and home.tm becomes Home:1, Away: 0. We also created dummy variables for each of the nine categories in bpos, and each of the seven scenarios in basecode.before. Next we created two new variables AL: 0 NL: 1, a dummy variable indicating whether American League or National League rules were in effect, because different rules can encourage or discourage bunting. The other variable created was ScoreDiff (the score of the batting team minus the score of the fielding team), because in close games the batting team is more likely to bunt to get the winning or tying run. Also, we isolated ten top bunters from information on the 2004 season obtained from baseball-reference.com, so those players frequently called on to bunt are accounted for. To visualize the data, we compared the global numbers to those of four specific teams: The Red Sox, Rockies, Brewers, and Dodgers. The Red Sox and Rockies were included because they were the most extreme in terms of bunt usage (Red Sox most rarely, Rockies most frequently). The Brewers and Dodgers had managers (Ned Yost and Jim Tracy) who currently manage intra-division rivals of the Cardinals (Yost still manages the Brewers, Tracy now manages the Rockies). The Cardinals play intradivision teams most frequently. Analysis Based on conversations with Sig Mejdal, we knew he was looking for a predictive model with a high level of transparency, specifically mentioning a decision tree. This made sense to us as well, given that the end user would be someone who is non-technical, such as a team manager. However, regardless of how many attempts we manipulated the data, classification trees were unsuccessful. The best classification tree was generated using over-sampling, and generated an error rate of 21.68%, which is much higher than the Naïve Rule of 3.10%. Next we experimented with logistic regression, which would require a black box approach when actually implemented. We experimented with two versions of logistic regression models: (1) a global model, using all the data to construct one model and (2) separate models specific to each team who the Cardinals are playing against. While we preferred the Team-specific models as they were more parsimonious, their problems with predictive accuracy forced us to use the global models. Our global model used a best subset, with a cutoff of 0.5. It scored a validation rate of 3.00%, marginally better than the Naïve Rule of 3.10%. 4

Appendix 1: Visualizing the data Inning vs. Bunt Percentage 25% Bunt Percentage 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 Inning Dodgers Rockies Red Sox Brewers Global Batter Position vs. Bunt Percentage 25% Bunt Percentage 20% 15% 10% 5% 0% 1 2 3 4 5 6 7 8 9 Batter Position Dodgers Rockies Red Sox Brewers Global 5

Appendix 2: The final regression model The Regression Model Input variables Constant term Inning 1-9: 0 10+: 1 out.before basecode.before_3 basecode.before_4 basecode.before_6 basecode.before_7 binned score diff_2 binned score diff_5+ binned bpos_3-7 binned bpos_9 Top Bunter Team binned_2 Team binned_3 Team binned_4 Coefficient Std. Error p-value Odds -2.91484118 0.19527394 0 * 1.43896341 0.31070757 0.00000363 4.2163229-2.06532502 0.12200621 0 0.12677707-1.18709564 0.52609891 0.02404486 0.3051061 0.47677508 0.16713402 0.00433562 1.61087108-2.63196373 1.02573335 0.01028984 0.07193705-2.14123392 0.72531897 0.00315593 0.11750975 0.66862828 0.20002525 0.00082962 1.95155859-0.85927516 0.3422673 0.01205473 0.42346892-1.65075326 0.1863011 0 0.1919053 1.85490274 0.15460882 0 6.39107656 1.51956904 0.26843038 0.00000002 4.57025528 0.90669906 0.20409924 0.00000889 2.47613549 1.00761068 0.20899259 0.00000143 2.73904896 1.47704506 0.21063162 0 4.3799839 Residual df Residual Dev. % Success in training data # Iterations used Multiple R-squared 9985 1757.736206 3.17 9 0.3749283 Validation Data scoring - Summary Report Cut off Prob.Val. for Success (Updatable) 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 226 1317 0 184 48273 Error Report Class # Cases # Errors % Error 1 1543 1317 85.35 0 48457 184 0.38 Overall 50000 1501 3.00 Elapsed Time Overall (secs) 593.00 6