arxiv: v1 [math.pr] 31 Mar 2014

Similar documents
Home Team Advantage in English Premier League

YouGov Survey Results

The Premier League housing boom

King s College London, Division of Imaging Sciences & Biomedical Engineering, London,

HOW TO READ FORM LAB S GAME NOTES

PREDICTING the outcomes of sporting events

PLEASE NOTE MATCHES HIGHLIGHTED IN BLUE WHICH ARE SUBJECT TO CHANGE RESULTING FROM FA CUP, CHAMPIONS LEAGUE or EUROPA LEAGUE PARTICIPATION

Bayesian networks for unbiased assessment of referee bias in Association Football

Chelsea takes the title before the first ball of the Premier League is even kicked

Simulating Major League Baseball Games

The First 25 Years of the Premier League

Scoring dynamics across professional team sports: tempo, balance and predictability

Using Poisson Distribution to predict a Soccer Betting Winner

UNIT 3 Graphs Activities

An examination of try scoring in rugby union: a review of international rugby statistics.

SPORTS TEAM QUALITY AS A CONTEXT FOR UNDERSTANDING VARIABILITY

Spurs finally top Premier League at least in terms of house price growth

TECHNICAL STUDY 2 with ProZone

Evaluating Australian Football League Player Contributions Using Interactive Network Simulation

Chapter 5: Methods and Philosophy of Statistical Process Control

The impact of human capital accounting on the efficiency of English professional football clubs

Does Momentum Exist in Competitive Volleyball?

WELCOME TO FORM LAB MAX

Reboot Annual Review of Football Finance Sports Business Group June 2016

Predictive Analysis of Football Matches using In-play Data

NBA TEAM SYNERGY RESEARCH REPORT 1

A Network-Assisted Approach to Predicting Passing Distributions

BET AGAINST THE MASSES 219

Analysis of Curling Team Strategy and Tactics using Curling Informatics

Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match

Applying Occam s Razor to the Prediction of the Final NCAA Men s Basketball Poll

MLU 2012 Spring Break Camp

Revisiting the Hot Hand Theory with Free Throw Data in a Multivariate Framework

Why We Should Use the Bullpen Differently

Behavior under Social Pressure: Empty Italian Stadiums and Referee Bias

premier league COMMUNITIES 2013/14

Premier League 2 and Professional Development League Bulletin. No. 3

A IMPROVED VOGEL S APPROXIMATIO METHOD FOR THE TRA SPORTATIO PROBLEM. Serdar Korukoğlu 1 and Serkan Ballı 2.

SCANCOMING UK FIXTURE LIST

Premier League 2 and Professional Development League. No. 30

Results. North. South. 1-0 West Ham United 2-3 Aston Villa

Navigate to the golf data folder and make it your working directory. Load the data by typing

Using Spatio-Temporal Data To Create A Shot Probability Model

Opleiding Informatica

U18 Bulletin. No. 12

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

One of the most-celebrated feats

Lab Report Outline the Bones of the Story

An Analysis of Factors Contributing to Wins in the National Hockey League

save percentages? (Name) (University)

Notational Analysis - Performance Indicators. Mike Hughes, Centre for Performance Analysis, University of Wales Institute Cardiff.

A Novel Approach to Predicting the Results of NBA Matches

Project Title: Overtime Rules in Soccer and their Effect on Winning Percentages

Results. North. South. Wolverhampton Wanderers. 3-2 West Bromwich Albion. 5-0 Leicester City. 1-1 Southampton

Football clubs in the Premier League. Are they doing what they should for disabled people? Easy Read

The Willingness to Walk of Urban Transportation Passengers (A Case Study of Urban Transportation Passengers in Yogyakarta Indonesia)

SECRET BETTING CLUB FINK TANK FREE SYSTEM GUIDE DECEMBER 2013 UPDATE

HOME ADVANTAGE IN FIVE NATIONS RUGBY TOURNAMENTS ( )

SCANCOMING UK FIXTURE LIST

U18 Premier League and U18 Professional Development League. No. 3

UEFA CHAMPIONS LEAGUE /15 SEASON MATCH PRESS KITS

Tech Suit Survey (Virginia LMSC)

Kinetic Energy Analysis for Soccer Players and Soccer Matches

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems

Using Markov Chains to Analyze a Volleyball Rally

Reading Time: 15 minutes Writing Time: 1 hour 30 minutes. Structure of Book. Number of questions to be answered. Number of modules to be answered

Connecting with communities 2015/16

RETREATING LINE INTRODUCTION

Ranger Walking Initiation Stephanie Schneider 5/15/2012 Final Report for Cornell Ranger Research

Using Markov Chains to Analyze Volleyball Matches

arxiv: v1 [stat.ap] 18 Nov 2018

Coaching Players Ages 17 to Adult

Running head: DATA ANALYSIS AND INTERPRETATION 1

from ocean to cloud HEAVY DUTY PLOUGH PERFORMANCE IN VERY SOFT COHESIVE SEDIMENTS

SCANCOMING UK FOOTBALL LIST

A V C A - B A D G E R R E G I O N E D U C A T I O N A L T I P O F T H E W E E K

BUILDING UP PLAY - TACTICAL PATTERNS OF PLAY CHAPTER 6

Evaluation of the Performance of CS Freiburg 1999 and CS Freiburg 2000

To focus on winning one competition or try to win more: Using a Bayesian network to help decide the optimum strategy for a football club

Analysis of Shear Lag in Steel Angle Connectors

Predicting Results of a Match-Play Golf Tournament with Markov Chains

4th Down Expected Points

ESTIMATION OF THE DESIGN WIND SPEED BASED ON

Sports Analytics: Designing a Volleyball Game Analysis Decision- Support Tool Using Big Data

Homeostasis and Negative Feedback Concepts and Breathing Experiments 1

LEE COUNTY WOMEN S TENNIS LEAGUE

Premier League 2 and Professional Development League Bulletin. No. 10

Premier League 2 and PDL Bulletin. No. 8

Matt Halper 12/10/14 Stats 50. The Batting Pitcher:

North Season 2016/ /09/2016

National Unit Specification: General Information

CARDIFF METROPOLITAN UNIVERSITY Prifysgol Fetropolitan Caerdydd CARDIFF SCHOOL OF SPORT DEGREE OF BACHELOR OF SCIENCE (HONOURS)

Matthew Gebbett, B.S. MTH 496: Senior Project Advisor: Dr. Shawn D. Ryan Spring 2018 Predictive Analysis of Success in the English Premier League

Basketball data science

Manchester City Vs Liverpool. Monday 25th August Get full match stats every week here

Is Home-Field Advantage Driven by the Fans? Evidence from Across the Ocean. Anne Anders 1 John E. Walker Department of Economics Clemson University

5.1 Introduction. Learning Objectives

TIME-ON-TIME SCORING

Effects of directionality on wind load and response predictions

Premier League 2 and PDL Bulletin. No. 16

Transcription:

A Markovian model for association football possession and its outcomes Javier López Peña arxiv:1403.7993v1 [math.pr] 31 Mar 2014 Abstract We propose a bottom-up approach to the study of possession and its outcomes for association football, based on probabilistic finite state automata with transition probabilities described by a Markov process. We show how even a very simple model yields faithful approximations to the distribution of passing sequences and chances of taking shots for English Premier League teams, which we fit using a whole season of granular game data (380 games). We compare the resulting model with classical top-down distributions traditionally used to describe possessions, showing that the Markov models yield a more accurate asymptotic behavior. 1 Introduction In association football (football in the forthcoming) key game events such as shots and goals are very rare, in stark contrast to other team sports. By comparison, passes are two orders of magnitude more abundant than goals. It stands to reason that in order to get a comprehensive statistical summary of a football game passes should be one of the main focal points. However, not much attention has been paid to passing distributions, and the media attention to passes is normally limited to the total number of them, sometimes together with the passing accuracy. A similar lack of attention is paid to the analysis of possession, oftentimes limited to a single percentage value. Kickdex Ltd, and Department of Mathematics, University College London. javier@kickdex.com In [1] Reep and Benjamin analyze the distribution of length of passing sequences in association football and compare it to Poisson and negative binomial distributions. More recent works suggest the distribution of lengths of passing sequences in modern football is better explained by Bendford s law instead. However, all these works take a top-down approach in which a model is arbitrarily chosen and then fitted to the distribution, without any explanation on why football games should be described by that model. In this work, we propose to take a bottom-up approach instead, inspired by our previous work on passing networks (cf. [2]) we model a team s game by a finite state automaton with states Recovery, Possession, Ball lost, Shot taken, with evolution matrix obtained from historical game data. The resulting Markovian system provides a model for possession after any given number of steps, as well as estimates for the probabilities of likelihood of either keeping possession or for it resulting in either of the two possible outcomes: lost ball or shot taken. By choosing adequate fitting data, we compare the possession models for different teams and show how they vary across different leagues. We then compare the resulting model with the ones previously used in the literature (Poisson, NBD, Pareto) and with the actual possessions data in order to find the best fit. The obvious advantage of our model is that it explains the resulting distribution as the natural limit of a Markov process.

2 Determining possession lengths Since football is such a fluid game, it is not always immediately clear which sequences of events constitute a possession. For our analysis, we consider the simplest possible notion of possession as any sequence of consecutive game events in which the ball stays in play and under control of the same team. As such, we will consider that a possession starts the first time a team makes a deliberate action on the ball, and ends any time the ball gets out of the pitch or there is a foul (regardless who gets to put the ball in play afterwards), any time there is a deliberate on-ball action by the opposing team, such as a pass interception tackle or a clearance (but not counting unsuccessful ball touches which don t interrupt the game flow), or any time the team takes a shot, regardless the outcome is a goal, out, hitting the woodwork, or a goalkeeper save. One could introduce a further level of sophistication by distinguishing between clear passes and divided balls, such as passes to a general area where player of both teams dispute the ball (for instance in an aerial duel). For the sake of simplicity, we will not take this into consideration. It is also worth noting that when measuring possession length we shall consider all in-ball actions, and not just passes. In particular dribbles, won aerial duels, and self-passes will all be considered as valid actions when accounting for a single possession. Share of possessions 10 0 10 1 10 2 10 3 10 4 10 5 Pareto Poisson NBD 10 6 0 2 4 6 8 10 Number of consecutive actions Figure 1: Classical top-down models In [1] it is suggested that length of passing sequences can be approximated by Poisson or Negative Binomial distributions. Our tests suggest this is no longer the case for the average EPL team in the 2012/2013 season: Figure 1 shows how the best fitting Poisson and NBD grossly underestimate the share of long possessions. For the sake of comparison, we have also included the fitting of a Pareto distribution, which displays the opposite effect (underestimating short possessions and overestimating longer ones). For our data, all three classic distributions fail to meet the asymptotic behavior of the observed data. This can be partly explained by our different notion for possession length, since Reep and Benjamin only consider the total number of passes in a possession, but the different definition does not tell us the whole story. According to Reep and Benjamin data, there are only 17 instances, measured over 54 games in 1957-58 and 1961-61, of sequences involving 9 passes or more (out of around 30000 possessions). Even if we use their more restrictive notion of possession there are many more instances of long possession nowadays. A possible explanation might be the generally admitted fact that football playing style has evolved over the years, leaning towards a more and more technical style which favors longer possessions, with many teams consistently playing possessions of 20 passes or even more. In any case, it is quite evident that in order to accurately describe possessions in current professional football we need to find a different type of model which allows for a longer tail. One possible such candidate would be a power-law distribution, such as Pareto s (also plotted in Figure 1 for comparison). As we plot longer parts of the tail, however, it will become apparent that Pareto distribution does not display the correct type of asymptotic behavior in order to describe our data. 3 Markovian possession models A typical possession in a football game starts with a ball recovery, which can be achieved in an active or passive manner. Ball recovery is followed by a se- 2

quence of ball movements that will eventually conclude with a loss of possession. Said lost of possession can be either intentional, due to an attempt to score a goal, or unintentional, due to an error, an interception/tackle, or an infringement of the rules. The different possession stages can be modeled using a nondeterministic finite state automaton, with initial state indicating the start of a possession, final states indicating the end of the possession, and some number of other states to account for intermediate stages of the possession, as well as appropriate probabilities for the state transitions. The freedom in the choice of intermediate states allows for a wealth of different Markov models with varying degrees of complexity. One of the simplest such possible schemes is to consider only one intermediate state Keep to indicate continued possession, and two different final states Loss and Shot indicating whether the possession ends in a voluntary manner (by taking an attempt to score a goal) or in an involuntary manner. This simple model is schematized by the following diagram: Recovery p k Keep p l p s Figure 2: Simple Markov model Loss Shot A slight variation allows to track for divided balls (though using this model would require us to change the definition of possession that we described above): In an alternative model, one might be interested on the interactions among players in different positions (loops have been removed from the graph to avoid cluttering). This model is sketched in Figure 4. Once we have chosen the set of states that configure our Markov model, its behavior is then described by the transition matrix A = (a i,j ), where entry a i,j represents the probability of transitioning from state i to state j; similarly, given an initial state i, the probability of the game being at state j after r steps is Recovery GK p r,k p r,d p d,k p k Keep p k,d Divided p s p l p d,l Shot Loss Figure 3: Markov model with divided balls Def Rec Mid Fwd Loss Shot Figure 4: Positional Markov model given by A r e i, where e i = (0,..., 0, 1, 0,..., 0) is the vector with coordinate 1 in position i and 0 everywhere else. More complex Markov models allow for a richer representation of game states, but since transition probabilities must (in general) be determined heuristically, they will be harder to fine-tune. 4 Model fitting As a proof of concept, we will perform a detailed fitting of the simplest Markov model described in the previous section. This first approximation model works under the very simplistic assumption that the transition probabilities between two game states remain constant through the entire sequence of events. Transition probabilities p k, p l, p s (which must satisfy 3

p k + p s + p l = 1) are, respectively, the probability of keeping possession, losing the ball, or taking a shot. Since the transition matrix for this system is extremely simple, this model can be solved analytically, yielding the probability distribution given by P ({X = x}) = (1 p k )p x 1 k Ce λx, (1) (where λ = log p k and C is a normalization constant) suggesting that the distribution of lengths follows a pattern asymptotically equivalent to an exponential distribution. Share of possessions 10 0 10 1 10 2 10 3 10 4 10 5 Pareto Simple Markov Adjusted Markov 10 6 0 5 10 15 20 25 30 Number of consecutive actions Figure 5: Power law and Markov models As Figure 5 shows, the simple Markov model provides a good fitting model for the general trend, with the correct asymptotic behavior, but it tends to over-estimate occurrence of sequences of 3 to 8 actions, as well as severely underestimating the number of sequences consisting on a single action. For individual Premier League teams, Pearson s goodness of fit test yields a p value higher than 0.99 in all cases. The simple Markov model can be easily improved by weakening the constraint on the transition probability being constant. A simple way of doing it is by adding an accumulation factor b (with 0 < b < 1) and modifying the probability distribution to P ({X = x}) = C(e λx + b x ), (2) since b x goes to 0 as x increases, the added factor does not modify the asymptotic behavior of the resulting probability distribution, but it allows to correct for the errors in the probabilities for short possessions. It is worth noting that whilst this adjusted model (also shown in Figure 5) does not strictly come from a Markov process (as the transition probability is no longer constant), the additional factor b can be interpreted as a (negative) self-affirmation feedback, incorporated to the model in a similar way as the one used in [3] for goal distributions. A possible interpretation of this factor would be the added difficulty of completing passes as a team proceeds to move the ball closer to their opponents box. A higher value of the b coefficient means that a team is more likely to sustain their passing accuracy over the course of a long possession. Besides the distribution of length of possessions, the Markov model allows us to study the probability of any state of the model after a given number of steps. In particular, one can look at the Shot state, obtaining a model for the probability that a shot will have taken place within a given number of actions. Once again, the system can be solved analytically, yielding x 1 P ({Shot X x}) = p i k p 1 p x k s = p s, (3) 1 p k i=0 where this probability should be understood as the chance that a team will take a shot within x consecutive ball touches. Figure 6 shows once again how the model yields a good approximation to the observed data fitting comfortably within the error bars. The p value for the Pearson s goodness of fitting test is again greater than 0.99. The fitted coefficients for all the Premier League teams in the 2012-2013 season are listed in table 1 (teams are sorted by final league position). There is a remarkable correlation between higher values of the keep probability p k and what is generally considered nice gameplay, as well as between higher values of the shot probability p s and teams considered to have a more aggressive football style. Traditionally possession is considered a bad indicator of performance, but our number suggest that this does not need to be the case when it is analyzed in a more sophisticated manner. Apart from a 4

Chance of taking a shot 0.12 0.10 0.08 0.06 0.04 0.02 Markov model 0.00 0 5 10 15 20 25 30 Number of consecutive actions Figure 6: Probability of shots for the Markov model few outliers, such as under-performers Swansea and Wigan Athletic, and over-performing Stoke, most teams fitted probabilities correlate fairly well with their league table positions. 5 Conclusions and future work We have shown how Markov processes provide a bottom-up approach to the problem of determining the probability distribution of possession related game states, as well as their outcomes. The bottom up approach has an obvious advantage of providing a conceptual explanation for the resulting probability distribution, but besides that we have shown that even a very simple Markov model yields better approximations than the classical top-down approaches. It is worth noting that even if we focused in the particular case of association football (and concretely the English Premier League) the type of analysis we have carried out is of a very general nature and can easily be performed for many team sports which follow similar possession patterns, such as basketball, hockey, handball, or waterpolo, to name a few. Besides the study of probability distributions, sufficiently granular Markov processes can be used to carry out game simulations. In theory one might want to consider a full Markov model containing a state for every single player, as in the passing networks described in [2], and use it as the base of an Team p k p s b 1 Manchester United 0.794 0.017 0.685 2 Manchester City 0.794 0.016 0.683 3 Chelsea 0.785 0.018 0.663 4 Arsenal 0.797 0.015 0.690 5 Tottenham Hotspur 0.782 0.016 0.653 6 Everton 0.772 0.017 0.631 7 Liverpool 0.788 0.019 0.672 8 West Bromwich Albion 0.771 0.018 0.630 9 Swansea City 0.794 0.018 0.684 10 West Ham United 0.756 0.018 0.589 11 Norwich City 0.764 0.014 0.601 12 Fulham 0.783 0.018 0.657 13 Stoke City 0.752 0.015 0.575 14 Southampton 0.772 0.015 0.630 15 Aston Villa 0.771 0.016 0.619 16 Newcastle United 0.771 0.019 0.624 17 Sunderland 0.763 0.017 0.601 18 Wigan Athletic 0.783 0.017 0.660 19 Reading 0.749 0.016 0.558 20 Queens Park Rangers 0.767 0.016 0.618 Table 1: EPL Markov model fitting parameters agent based model in order to forecast game outcomes, although finding all the probability transitions in that extreme situation would admittedly be very hard. 6 Data and analysis implementation details Our analysis uses data for all the English Premier League games in the 2012-2013 season (380 games). Raw data for game events was provided by Opta. Data munging, model fitting, analysis, and chart plotting were performed using IPython [4] and the python scientific stack [5, 6]. References [1] C. Reep and B. Benjamin Skill and chance in Association Football. J. of the Royal Stat. Soc. 5

A, 131 (1968) 581 585. [2] J. López Peña and H. Touchette A network theory analysis of football strategies. In Sports Physics. École Polytechnique Univ. Press, 519 530. [11] J. Duch, J. S. Waitzman, and L. A. N. Amaral. Quantifying the performance of individual players in a team activity. PloS One, 5(6):e10937, 2010. [3] E. Bittner, A. Nußaumer, W. Janke and M. Weigel Self-affirmation model for football goal distributions. EPL, 78 (2007) 58002 DOI: 10.1209/0295-5075/78/58002. [4] F. Pérez and B. E. Granger IPython: A System for Interactive Scientific Computing Computing in Science and Engineering, 9 (2007) 21 29 DOI: 10.1109/MCSE.2007.53. [5] T.E. Oliphant Python for Scientific Computing Computing in Science & Engineering 9, 90 (2007) [6] J.D. Hunter Matplotlib: A 2D Graphics Environment Computing in Science & Engineering 9, 90 (2007) [7] M. Hughes and I. Franks Analysis of passing sequences, shots and goals in soccer Journal of Sports Sciences, 23 (2005) 509 514 [8] D.R. Brillinger A potential function approach to the flow of play in soccer, Journal of Quantitative Analysis in Sports, 3 (2007), DOI: jqas.2007.3.1.1048 [9] A. Joseph, N. E. Fenton, and M. Neil. Predicting football results using Bayesian nets and other machine learning techniques. Knowledge-Based Sys., 19(7):544 553, 2006. [10] Carlos Cotta, Antonio M. Mora, Cecilia Merelo-Molina, and Juan Julián Merelo. FIFA World Cup 2010: A Network Analysis of the Champion Team Play. Complex Systems in Sports Workshop (CS-Sports 2011), August 2011. 6