2 This paper offers a slightly different view on home team advantage and instead of points home team advantage is based on number of goals scored and their differences. The advantage of using goals can be demonstrated on results of a team that played the same opponent at the home and away field. Let us assume, that the result at home field was 3 0 win, and the result at away field was 2 1 win. Obviously, better result was recorded at the home field; however, based on points obtained, it is not possible to distinguish between these results as the team is always awarded by 3 points. Method described in the following part will allow to distinguish between these results, and it will offer to measure the home team advantage for individual teams and observe changes during the time. 2 Data and Methods English Premier League results from the 1992/1993 season to the 2015/2016 season were obtained from England Football Results and Betting Odds (2017). Data for the first English Premier League season (1992/1993) were obtained from official website Premier League Football News, Fixtures, Scores & Results (2017). This website was also used for basic control of all data, e.g. total number of scored goals by team in the whole season. Premier League consisted of 22 teams in the first 3 seasons and of 20 teams in the rest of seasons. Balanced schedule was used in all seasons, i.e. each team played each other team exactly two times, once as a home team and once as a visiting team. This means that for each team there are 19 opponents (21 in the first three seasons) with two results in a season. These two results are combined together and used to measure home team advantage which is evaluated according to Definitions 1, 2 or 3. Naturally, each season is analysed separately to eliminate changes in teams that form the league and to eliminate changes in rosters that are usually bigger between seasons. Definition 1. Active measure of home team advantage is a random variable A that can take values 1,0, and 1. A= 1 for team T 1 if two matches between teams T 1 and T 2 in a season ended with a result where team T 1 scored more goals on a field of team T 2 than on its own field. A = 0 for team T 1 if this team scored exactly the same number of goals on a home field and away field and A = 1 for team T 1 if this team scored more goals on its own field than on a field of team T 2. With results h T1 : a T2 on a home field of team T 1 and h T2 : a T1 on a home field of team T 2 the value of random variable A is determined as A = sgn(h T1 a T1 ). (1) Definition 2. Passive measure of home team advantage is a random variable P that can take values 1,0, and 1. P= 1 for team T 1 if two matches between teams T 1 and T 2 in a season ended with a result where team T 1 conceded more goals on a home field than on a field of team T 2. P = 0 for team T 1 if this team conceded exactly the same number of goals on a home field and away field and P = 1 for team T 1 if this team conceded more goals on a field of team T 2 than on its own field. With results h T1 : a T2 on a home field of team T 1 and h T2 : a T1 on a home field of team T 2 the value of random variable P is determined as P = sgn(h T2 a T2 ). (2) Definition 3. Combined measure of home team advantage is a random variable C that can take values 1,0, and 1. C= 1 for team T 1 if two matches between teams T 1 and T 2 in a season ended with a better result 245

3 measured by a goal difference in matches for team T 1 on an away field. C = 0 for team T 1 if goal difference in both matches was exactly the same from T 1 s point of view and C = 1 for team T 1 if this team recorded better result measured by a goal difference in matches on its own field. With results h T1 : a T2 on a home field of team T 1 and h T2 : a T1 on a home field of team T 2 the value of random variable C is determined as C = sgn((h T1 a T2 ) (a T1 h T2 )). (3) All three measures are defined so that value 1 means that a result was better on a home field, 0 means that there was no difference and 1 means that better result was recorded on an away field. Obviously, active measure for team T 1 is passive measure for team T 2. More or less, combination or results between two same teams as used in Definitions 1, 2 or 3 eliminates the fact that teams in league are of different quality. All three random variables can take same values with same interpretation; therefore, in following parts the combined measure C is used and it can be easily substituted by A or P to obtain results for other two measures. English Premier League used balanced schedule in all seasons with exactly two matches between each two teams. Let L denote number of teams in a league (for our data L = 22 or L = 20) then for each team in a season, there are K, K = L 1, opponents. Random sample C 1,C 2,...,C K is obtained as one season s results of given team and its opponents. C i s are considered to be identically distributed because there are no big changes in a team during one season. Therefore, probabilities p 1, p 0 and p 1 of possible outcomes 1,0 and 1 are considered constant in a season. The meaning is that during a season the home team advantage of a team is stationary. The second assumption is that C i s are independent. The interpretation is that matches with one opponent does not influence matches with other opponents. Remark 1. Assumption that C i, i = 1,2,...K, are i.i.d. may not be true in reality. However, it can be expected that violation of this assumption is not strong, and therefore, it is used in the same sense in majority of studies that deal with sports. Without this simplification it would be impossible to use statistics for sports as every single match could be played under slightly different conditions (for example, in different weather conditions). Moreover, undermentioned methods will be robust, and this simplification should not result in any problems with interpretation of obtained findings. Let Z r, r = 1,0,1, is random variable which describes number of cases in a season where it is possible to observe home team advantage (r = 1), away team advantage (r = 1) and no advantage (r = 0). Obviously, for K matches in a season Z 1 + Z 0 = K Z 1. Vector (Z 1,Z 0,Z 1 ) follows trinomial distribution with parameters K and p 1, p 0, p 1. Probability mass function under this notation is given by P(k 1,k 0,k 1 )= K! k 1!k 0!k 1! pk 1 1 p k 0 0 pk 1 1, (4) where K is total number of opponents in a season for one team, p 1, p 0, p 1 are probabilities of occurring a home team advantage (r = 1), an away team advantage (r = 1) and no advantage (r = 0). k 1,k 0,k 1, k 1 + k 0 + k 1 = K, are observations of appropriate advantage. Bayesian inference is used to estimate unknown parameters and consequently confidence intervals. Prior distribution of parameters p 1, p 0 and p 1 is set to be uniform, i.e. it does not matter where a team plays a match and probability in Equation 4 is used as conditional probability of observation under given parameters, i.e. P(k 1,k 0,k 1 p 1, p 0, p 1 ). This leads to posterior probability density of parameters p 1, p 0, p 1 given by 246

4 P(p 1, p 0, p 1 k 1,k 0,k 1 )= G(K + 3) G(k 1 + 1)G(k 0 + 1)G(k 1 + 1) pk 1 1 p k 0 0 pk 1 1, p 1, p 0, p 1 0, 1 Â p r = 1, (5) r= 1 where K is total number of opponents in a season for one team and k 1,k 0,k 1, k 1 + k 0 + k 1 = K, are observations of given advantage. Equation 5 is probability density function of a Dirichlet distribution Dir(a 1 = k 1 +1,a 2 = k 0 +1,a 3 = k 1 +1). Bayesian estimator of probabilities in 4 is given (using squarederror loss function) as mean value of this Dirichlet distribution, i.e. ˆp r = n r + 1, r = 1,0,1. (6) K + 3 If p 1, p 0, p 1 follows Dirichlet distribution Dir(a 1 = k 1 +1,a 2 = k 0 +1,a 3 = k 1 +1), k 1 +k 0 +k 1 = K, then marginal distribution of p r, r = 1,0,1, is Beta(a = k r +1,b = K k r +2) (see (Pitman 1993, p. 473)). This can be used to find individual (1 a l a u )-confidence intervals ( ˆp r,l, ˆp r,u ) for each p r which are given by and ˆp r,l = Beta 1 (a l,k r + 1,K k r + 2) (7) ˆp r,u = Beta 1 (a u,k r + 1,K k r + 2) (8) Remark 2. These individual confidence intervals can be used for simultaneous confidence interval of all three parameters. Based on Bonferroni inequality, they form together a (1 3(a l + a u ))-simultaneous confidence interval. For testing hypothesis it is necessary to obtain P(p 1 > p 1 ) from Equation 5. Using results of (Omar & Joarder 2012, p. 932) and observed values of k 1 and k 1 this probability is estimated as P(p 1 > p 1 )=1 I 1/2 (k 1 + 1,k 1 + 1), (9) where I 1/2 (k 1 + 1,k 1 + 1) is regularized incomplete beta function or cumulative distribution function of Beta distribution. Remark 3. P(p 1 > p 1 ) in this paper is an estimate based on observed values of k 1 and k 1. However, for better readability, the word estimate is omitted in the following text. P(p 1 > p 1 ) is the probability of occurrence of home team advantage, i.e. it can be used as a measure of home team advantage (the higher value of P(p 1 > p 1 ), the higher home team advantage). Hypothesis that the home team advantage is real can be accepted if P(p 1 > p 1 ) 1 a. 3 Results As mentioned before, we analysed English Premier League from the 1992/1993 season to the 2015/2016 season. Totally, 9,366 matches were played in these seasons, and, thanks to promotion and relegation, there are 47 teams that played at least one season in the English Premier League. Out of these teams, only 247

6 Team C i = 1 C i = 0 C i = 1 Sum P(p 1 > p 1 ) Arsenal Aston Villa Bournemouth Crystal Palace Everton Chelsea Leicester Liverpool Man City Man United Newcastle Norwich Southampton Stoke Sunderland Swansea Tottenham Watford West Brom West Ham Table 2: Results for the 2015/2016 season Figure 1: Evolution of P(p 1 > p 1 ) for Liverpool. 249

7 Figure 2: Evolution of Bayesian estimate and symmetric 95% confidence interval for p 1 for Liverpool. Figure 3: Evolution of P(p 1 > p 1 ) for Arsenal. 250

9 Team Season 12/13 13/14 14/15 15/16 Arsenal Aston Villa Bournemouth Burnley Cardiff Chelsea Crystal Palace Everton Fulham Hull Leicester Liverpool Man City Man United Newcastle Norwich QPR Reading Southampton Stoke Sunderland Swansea Tottenham Watford West Brom West Ham Wigan Table 3: Evolution of P(p 1 > p 1 ) for all teams in the seasons 2012/ /16. Team Season P(p 1 > p 1 ) C i = 1 C i = 0 C i = 1 Hull 2008/ Norwich 1993/ Blackburn 2003/ Wolves 2011/ Crystal Palace 1997/ Table 4: Five lowest obtained values of P(p 1 > p 1 ). 252

Dixon, M. J. & Coles, S. G. (1997), Modelling Association Footbal Scores and Inefficiencies in the Football Betting Market, Journal of the Royal Statistical Society. Series C (Applied Statistics) 46(2), England Football Results and Betting Odds (2017), Premiership Results & Betting Odds.. Karlis, D. & Ntzoufras, I. (2003), Analysis of sports data by using bivariate Poisson models, The Statistician 52(3), Maher, M. J. (1982), Modelling association football scores, Statistica Neerlandica 36(3), Marek, P., Šedivá, B. & Ťoupal, T. (2014), Modeling and prediction of ice hockey match results, Journal of Quantitative Analysis in Sports 10(3), Omar, M. H. & Joarder, A. H. (2012), Some Mathematical Characteristics of the Beta Density Function of Two Variables, Bulletin of the Malaysian Mathematical Sciences Society 35(4), Pitman, J. (1993), Probability, 1 edn, Springer. Pollard, R. & Pollard, G. (2005), Long-term trends in home advantage in professional team sports in North America and England ( ), Journal of Sports Sciences 23(4), Premier League Football News, Fixtures, Scores & Results (2017), Premier League Football Scores, Results & Season Archives..

