Matthew Gebbett, B.S. MTH 496: Senior Project Advisor: Dr. Shawn D. Ryan Spring 2018 Predictive Analysis of Success in the English Premier League

Size: px

Start display at page:

Download "Matthew Gebbett, B.S. MTH 496: Senior Project Advisor: Dr. Shawn D. Ryan Spring 2018 Predictive Analysis of Success in the English Premier League"

Bertha Alexander
5 years ago
Views:

1 Matthew Gebbett, B.S. MTH 496: Senior Project Advisor: Dr. Shawn D. Ryan Spring 2018 Predictive Analysis of Success in the English Premier League

2 Abstract The purpose of this project is to use mathematical methods from statistical and applied analysis to create models capable of predicting the rankings of the English Premier League teams for the season. This method revolves around using a statistical multivariate regression approach. The models will be created by using carefully selected data that may contribute to onfield performance and applying each approach to this data creating a score for each of the teams. The teams will be ranked based off these final average scores. The rankings assigned to the teams using this method will then be compared to the actual rankings of each team to check the accuracy of each of the models as well as to allow comparison of the different approaches.

3 1. Introduction Being born in England, I have always been interested in soccer or football as we refer to it. My favorite sport was always soccer while growing up. I played for multiple teams in my youth and went to matches supporting my local teams as often as possible. My dad and my sister were always arguing who s team was better. Both were very passionate about the sport which lead to me becoming a big fan. After moving to the USA in 2005, I stopped following the sport almost immediately because the passion my sister had for the sport stayed in England with her and my dad became too busy with his job. Eventually, I stopped playing the sport all together until I reached college. Once in Cleveland, I started to get back into the sport, watching the matches every weekend and began playing casually with friends. As I became more interested in Mathematics, I started to get more involved in the analysis of the matches. A fundamental curiosity for most fans is wanting to know which team really has the advantage in each match and which aspects of the game has was really the most important. Could these things be predicted from the strengths and weaknesses of each team in a scientific way? The central question addressed here will be whether the end of season results can be predicted from midseason or prior season data. Throughout my college experience, I have experienced the many different pathways that math can lead to. Initially, I was drawn to a more applied pathway involving statistics or analysis. The background I obtained from studying statistical analysis influenced my intrigue into wanting to know how it could be used rigorously when applied to soccer analysis. This led to the central hypothesis of needing to develop a model to determine the end of season rankings. Ultimately to reach this goal, I will have to find the optimal combination of a statistical and analytical approaches. To focus the study, we restrict ourselves to the data collected from the English Premier League (EPL), which only came to existence in Originally, the English soccer league was referred to as the English Football League (EFL) which began in 1888 and technically still exists today ([2] Shaw, 2013). The EFL consisted of the top four tiers of English soccer. A team that is in the third tier can move both up to the second tier with a good season or can move down to the fourth tier with a poor season. These processes have come to be known as promotion or relegation respectively. Until 1991, the top division in the EFL was referred to as the first division, but in 1991 there was a large investment into television programming for soccer matches in the top tier resulting in the need for a new name and branding ([3] History of the English Premier League, 2018). This lead to the rebranding of the top tier as the English Premier League while the EFL still exists as the second through fourth tiers of English soccer. When the Premier League first began, there were twenty-four teams in the league with the bottom four being relegated and the champions of the league being granted access to the European Cup. Currently there are twenty teams in the Premier League with the bottom three being relegated and the top four being granted access to the Champions league. There are no regions or conferences in English soccer, so every team plays every other team twice home and away. This leads to a completely balanced schedule and should make the ranking process more objective. I will be using the data from two seasons, the season, and a partial amount

4 of the current season, ([4]Premier League Team Statistics, stics/england-premier-league , 2018). The teams play a total of thirty-eight matches each season in the EPL which runs from the middle of August, until the end of May. A win grants a team three points, a draw grants one point, and a loss grants zero points. The team with the most points at the end of the season wins the Premier League. Since the introduction of the EPL in 1992, only six teams have managed to win: Manchester United, Chelsea, Arsenal, Blackburn Rovers, Leicester City, and Manchester City. What makes these teams so successful and consistently ranked near the top? Many teams have been in the Premier League because of the relegation and promotion aspect of English Football. A team that places in the bottom three of the Premier League is relegated to the second tier of the EFL. To keep the tiers consistent in size, the best three teams in the second tier of English Football are moved up to the Premier League. This system is consistent throughout all the tiers of English Football all the way to the tenth tier. The top four teams of the Premier League qualify for the Champions league. The Champions League is a large tournament between every single League in all of Europe. The team that wins the Champions League is considered the best team in Europe and respectively the world that year. Teams can also qualify for the Europa League which is the next best tournament in Europe. The teams that qualify from the EPL are fifth and sixth place. The team that wins the Europa league automatically qualifies for the Champions league regardless of their national league position. The important positions in the league are certainly first through fourth as they qualify for the Champions League, fifth and sixth for the Europa League, and then the bottom three because they are relegated to a lower division. Finishing higher in the league does grant some bonus money to the club, but that is negligible compared to qualifying for the European competitions or being relegated to a lower division. The team that wins the Europa League also qualifies for the Champions League but that cannot be determined by looking at the results in only the Premier League. While the primary goal of this project is the see if mathematical modeling can be used to predict all the positions in the league, a secondary focus will be on looking at these more important league positions. The difference in moving in to the Champion s League and being relegated can be on the order of 100 s of millions of dollars in franchise value and sale revenue. Since every team plays every other team both at home and away, there are no distinct advantages across the season, so the playing field is level. There is also no end of season knockout round to determine the champion of England unlike many other sports which allows for an easier model. My project data consists of data from one full season, and two-thirds of another season. The full season will be the season in which all thirty-eight matches are played, while the other is the current season which is the season which has not finished yet. Between the two seasons, twenty-three teams have played in the premier league because three teams were relegated after the season and three teams were promoted from the lower league into the premier league. The teams I will primarily be looking at will be the teams that qualified for the champions league (position 1-4), teams that qualified for the Europa League (positions 5-6), teams that were relegated (positions 18-20), teams very close to relegation but

5 managed to stay in the EPL (positions 15-17), and the teams that were recently promoted from the lower league (positions 1-3 in tier 2). ([5] European Qualification for UEFA Competitions Explained, 2018). Image from: ( Premier League, ) Based on my knowledge of soccer through many years of observation and more importantly results from the research data, I will restrict the key factors determining ranking to several variables that appear to be key to determining the outcome of individual matches and in turn, the league. For my project, there are seven variables used throughout. These variables are Goal Difference (GD), penalties conceded, discipline, shot accuracy, pass accuracy, clean sheets, and possession. GD is defined as the number of goals conceded by a team minus the number of goals scored by a team (e.g., 100 goals scored, and 40 goals given up lead to a GD of 60). Penalties conceded is a measure of the number of times when a team commits a foul in their own penalty box, which results in a penalty kick. Discipline may be harder to quantify, but in this project, it is defined as the total number of red cards earned by a team during the season and if there were the same number of red cards, yellow cards were used as a tie breaker. Shot accuracy is number of shots on target divided by the number of total shots by a team. Possession is measured as the percentage of time that a team is in control of the ball. Clean sheets are the number of times during the season that a team did not concede a goal during a match. Pass accuracy is the percentage of successful passes by a player on a team that reaches a teammate. The value of each of these variables is different so something like GD is going to determine a match much better than passing accuracy. However, removing passing accuracy may reduce the accuracy of the model and hide a possible subtle dependence buried beneath the surface.

6 The second half of the project will focus on predicting the final rankings for the season based on data obtained before the transfer window. Before and during each season, there are transfers made for each team which bring in new players to the club. The summer transfer window starts on June 1st and goes until August 31st as well as a winter transfer window which begins January 1st and ends January 31st. These windows exist so that players cannot move from clubs at any time during the year. They also add an element of change to the league because any team can hypothetically buy any player if they have the money for them, and their contract allows a purchase. The summer window allows players to transfer while they are only training and do not have any matches. This window also creates excitement for the upcoming season as each team normally brings in new players that get fans excited. The winter window is much more important to this project as it occurs at around a halfway point in the season and allows teams to potentially bring in their biggest rival s best player which can cause a big change in the league ranking prediction. To avoid these variables from the winter transfer window, I have decided to collect the data directly after the winter transfer window for the season, but before any new matches can be played with a team s new players. 2. Statistics in Sports Ever since sports have been popular in the public s eye, people have always debated which team is the best and which player scored the best goal etc. These questions have consistently changed as teams get better and athletes become stronger and faster, but simply claiming who the best player is, simply is not enough. People quickly turned to statistics in order to settle the debate with an objective source to back it up. This allowed for people to settle their debates on who the best player is by simply looking at their statistics and comparing them to every other player and seeing who came out on top. These comparisons have led to teams collecting data to try and improve their teams by analyzing the data. These analyses are constantly performed today and are essential for many teams and companies to continue their progress toward a championship. Prior to the formation of the Premier league in 1991, there was not nearly as much interest in soccer in England as there is today. Players were paid much less money, the conditions that they played in were worse, the stadiums were smaller, so they fit less spectators in them, and overall it just simply was not as professional. Once the investment into the Football League in 1991 was enacted, many clubs received large bonuses because the investment was primarily to improve the ease of viewing for the public. Soccer matches would rarely be broadcasted on television until the Premier League came into existence. From that point on every single team s matches were broadcasted across the country. Because of this large increase in pay for the clubs and their players, they were able to improve their skills and training so that very quickly, the players became much stronger and faster. This improvement drew several fans outside of England to the game and lead many people to support English teams even though they were not from the area. The players also received a larger salary. This higher pay lead to many foreign players wanting to migrate into the Premier League so that they continued to play quality soccer, but

7 they were also able to make a very good living from it. In the 1960 s player s salaries were around 20 a week. Today the average salary of a player in the Premier League is around 34,000 a week ([6] Harris, 2011). At the inception of the Premier League in 1992/93, just 11 players named in the starting line-ups for the first round of matches were 'foreign' (players hailing from outside of the United Kingdom or Republic of Ireland). By 2000/01, the number of foreign players participating in the Premier League was 36 percent. In the 2004/05 season the figure had increased to 45 percent. On December 26, 1999, Chelsea became the first Premier League side to field an entirely foreign starting line-up, and on February 14, 2005, Arsenal were the first to name a completely foreign 16-man squad for a match. ([3] History of the Premier League, 2018). Image from ([6] Harris, ) The record transfer fee, as well as the average cost of players skyrocketed to new levels. Currently, the world record transfer fee is 222 million by Paris Saint-Germain for Neymar while the record transfer fee for a player in the Premier League is 90 million for Paul Pogba. Back in 1991 however, the record transfer fee was 5.5 million. This transfer record was consistently broken every few years as the popularity of watching soccer grew. All this investment into players and clubs spending absorbent fees to bring the best in the world to their

8 club drew the interest of companies wanting to put their brand names wherever anyone could see them. It also drew the interest of betting companies wanting to make as much money as possible from the sport. Betting has been a part of sports for a very long time. People will make a wager against a friend or against a company claiming that a certain team will win and sometimes the wager will be more specific such as a team winning by three goals to 1 goal. The betting company will give odds of the specific event occurring such as 8-1 which means for every 1 unit of currency put in, if the result holds, the investor will receive 8 units of currency. Before betting started becoming very profitable, the data collected focused on very tangible concepts such as total number of goals scored per player or total number or saves made by a goalkeeper. This data is easy to understand and can determine which player is the best in their positions. More recently, the data has become much more complex by focusing on player positions and how they are moving around the field compared to every other player. This data is expensive to gather so they are rarely available for public use. Companies such as Opta Sport collects this data and sells it to betting companies as well as teams in the Premier League that want to use their service to try and improve their odds or strategy respectively ([7] Predictive Analytics, 2018). These large corporations are constantly changing their odds for matches based on all sorts of data such as the weather and the lineups of each team. This is so the prediction is very accurate and reputable to get more people to bet with their specific company over any other companies. These companies employ data-analysts full time who have had a lot of experience analyzing data and determining which team is more likely to win than the other during each match. These betting companies also allow people to bet on which team they think will win the league at the end of the year. This is my goal in this project, to determine which team is most likely to win the Premier League in the 2016/2017 season as well as predicting the exact order of finish based on recorded statistical categories. This would show in advance, based on current statistics, who will move on to the UEFA Champions League and a big payday versus those relegated to the lower league by finishing at the bottom of the standings. 3. How the Statistics Will Work A statistical approach is very commonly used in sports for determining which team is more likely to win certain games, or which team is worth betting on for certain variables. In my project s case, I am trying to find the best method to predict the Premier League ranking at the end of the season. The method would have to use the data I collected to create a model which would solve for the optimal weights for each variable in the equation to be able to predict the league with accuracy. The method that makes the most sense for this would be an applied multivariate regression model. A brief description of an applied multivariate regression model is necessary to understand how the model works, and the results of the model. A regression model will be defined as: Y = β 0 + β 1 X 1,

9 where Y is the response variable, β i are the weights, and X i is the explanatory variable. A response variable is the variable being tested against in the equation. It is the actual value that we are trying to predict. An explanatory variable is the variable that is independent and will be used to solve for the response variable. It is essentially the variables being tested against the response variable. The weight is an unknown variable that will be solved for in this equation. In my case, since I have multiple variables, this method will not work exactly since I have more variables than the equation can handle. For multiple variables, it requires a multiple variable regression model which is defined as: Y = β 0 + β 1 X 1 + β 2 X β n X n, where this equation is identical to the equation above, except it allows for multiple variables in the equation opposed to just one variable. Each β i value is a new weight which will change for each new response variable. Each new X i value is a different variable depending on the equation. For my model, I will be using multivariate regression. I will also be using several different sizes of multiple regression for my model as the values change with each different model studied. The variables are selected for each individual model. The results we would be looking for will minimize the root mean squared error (RMS). This error is the average distance in each rank from the actual result to the predicted result with some metric; namely: N E RMS : = 1 N (Y Pred Y Actual ) 2 1 The total error is all these individual errors summed up then divided by the total number of data pointed summed and finally the square root is taken. The RMS error places more emphasis on outliers than simply using an absolute value for measuring error. A completely random model should, on average, give a mean squared error of 10 if there are 20 total data points being used. Since there are 20 teams in the Premier League, the error will divided by 20. A bad model would produce a result that is larger than 10 because the error is very large, so each predicted data point is very far from the actual data points. A good model would produce an error lower than 10 but preferably as close to 0 as possible. A predict error of 0 would be a perfect result, but this would be very difficult to obtain with a model because there are many factors not accounted for (e.g., luck, weather, political climate). Here is an example of multiple different types of error results and their meaning. From the three models, there are error terms of 5.36, 3.57, and The 5.36 is a very good result while 3.57 is an excellent result. However, the error term is a result that could not be much worse, so this sort of model would be dismissed.

10 A multivariate regression model is crucial for predicting the best possible result because by design, a regression model will produce the minimum error possible. This error term for each model will be used to determine whether that model was a viable prediction. The error term is the result of a model being run. If the error is high, then the model can be dismissed, if the error term is low, then the model can be considered and possibly adjusted further. Also, the coefficients or weights β i in multivariate regression indicate the relative importance of each component of the model.

11 Here is an example of a model that was relatively accurate in its prediction. The x-component of the dots represents what the model predicted for each team while the y-component is the actual position. The closer a point is to the line, the closer that team s predicted value is to its actual value. If a point is above the line, then the model overestimates the placement of the actual result. If a point is below the line, then the model has underestimated it. The greater the number of points are away from the line, then greater the variance is in the data. Here is an example of a model that was less accurate.

12 There are very few dots that are on the line, therefore, very few teams were predicted in the correct place. The dots are also not very close to the line in general with some being close to ten places away. The error for this model would be very high, therefore, it will not be investigated further. Based on both models, I will look at the variables selected and try to determine why these variables produced these weights. Both were models developed to predict the league. In this project, we analyze these models and try to figure out why they produced the result they did by looking at the variables and the weights and then adjusting the next model based on the previous results. 4. The Models As stated previously, this project uses a multivariate regression analysis to predict the Premier League standings. For each of the models run, we used between 2 to 8 variables in each model which all produced different, interesting results. The relevant data was transferred to Microsoft Excel and then organized it for each model and then from there transferred to both R and MATLAB. The statistical program R was used for the multivariate regression because it allows one to run the multivariate regression for each model quickly and adjust each model easily. It produced the weights for each model ran and allows one to see how important the weights are in R s attempt to reduce the error term. MATLAB was used to run several multivariate regression analyses too. MATLAB gave me more freedom to manipulate the values depending on each model, but was a slower process than R. Both were used in the analysis successfully.

13 Using all the data I collected I was able to create three different types of variables that I would test. The three variables are the original variable, the weighted version of the variable, and the normalized version. The original version is just the data itself so if one team scored 50 goals that season and conceded 30 then their raw GD would be +20. A weighted version eliminated the differences in the values across the variables. The spread from the best team to the worst team for GD was +60 to -43 while the difference for a variable such as Red cards spread from 0 to 5. This caused some variables to be dismissed because they were barely affecting the outcome of the league. So, a weighted approach ranked the variables based on the team s position on a specific scale from 0 to 1 by.05 increments. If a team had the highest GD then they were given a value of 1, the second highest GD was given a value of.95, then.9 for third and so on. If two or more teams tied, then they were given the same decimal value between 0 and 1 and the next team would jump up more than.05 so if two teams tied a.45 the next best team would be given.35. The weighted version eliminated a bias on variables, but some team s positions were significantly higher than others originally, and now they were reduced because of arbitrary results. If a team had a GD of 100 and the next best team was 20 then these two teams would only have a difference of.05 which may lead to surprising results in the model. So, the next approach is a normalized approach in which the values are still all between 0 and 1 but they are divided by the value of the highest team in each category. This created a distribution with no bias on a team nor a variable. Each of these three methods were run to allow for diversity in the results and to determine which approach would be most accurate moving forward. The initial set of models run was done using the original data in R. The first model run would always contain every variable so that one could then reduce it from there based on simulat ion results. This is the type output R would give each time any model was run: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Goal_Diff ** Possesion_Avg_Pct Yellow_Cards Red_Cards Shot_Accuracy_Pct Passing_Pct Conceded_Penalties Clean_Sheets The primary output to consider is the estimate value. This is the β i discussed in section 3 which would then be inserted into the multivariate regression model. Y = β 0 + β 1 X 1 + β 2 X β n X n (*) The β i terms have now been found and the X i terms are the data initially collected. This information was plugged into the equation (*) to produce a ranking prediction. The β 0 term is the intercept allowing for one extra degree if freedom. This prediction is the Y value in the equation and it represents the value which the equation predicts the end of season ranking to be. This was

done for each of the twenty teams in the Premier League in Excel then sorted the values from least to greatest so that the lowest predicted value was in first and the highest predicted value was in

14 done for each of the twenty teams in the Premier League in Excel then sorted the values from least to greatest so that the lowest predicted value was in first and the highest predicted value was in twentieth. The furthest left column is the team, the second column is the actual end of season ranking and the third column is the predicted ranking for the most basic model. This model would then be investigated further by looking at the means squared error term to determine exactly how accurate it is but simply looking at it is helpful in determining which model might be more useful to run in the future. Most of the top 10 in this model is accurate except for West Bromwich being predicted in 16 th when they actually finished 10 th. This result would lead to a mean squared error value of 6 places off which would reduce the chances of it being accurate. This process was done for four models using the same data. These are referred to as the basic models (Model Group 1). The first being the completely basic model accounting for everything (Model 1A), the second being a reduced basic model (Model 1B), the third being a basic model without GD (Model 1C), and the fourth time being a reduced basic model without GD (Model 1D). A reduced model is one in which stepwise-elimination is conducted in R. This process is done in R by removing the least valuable variable from each model, re-running it, and then checking to see if the variables can be eliminated again until a final model is found. The final model found by R which for Model 1B, looks like this: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Goal_Diff e-09 ***

15 Yellow_Cards The model was reduced to just GD and Yellow Cards because it was determined that the other variables were insignificant. This model is much smaller than the others, but it still follows the multivariate regression method by plugging in the estimate terms into the β i values and then determining the predicted league ranking. The other two models were Model 1C and Model 1D. GD was eliminated to see if the league could be predicted without GD because it is the most important variable due to its nature of being the point of soccer; to score more goals. The reason for this is to determine how important the other variables were. Once Model Group 1 had all been run, I started to run the weighted models (Model Group 2) in a similar fashion. The data was used in R and the output visually looks the same. The four models run were weighted basic model (Model 2A), reduced weighted basic model (Model 2B), weighted model without GD ((Model 2C), and reduced weighted model without GD (Model 2D). These four versions are the same as the ones run in the original model. The weighted version produced more accurate results than the original version because the data values are more consistent throughout, so no team has a massive advantage on one variable. The β terms from R can be seen here: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-09 *** Clean_Sheets_Rank_Weighted Goal_Diff_Rank_Weighted e-06 *** Possesion_Avg_Pct_Rank_Weighted YDiscipline_Rank_Weighted Shot_Accuracy_Pct_Rank_Weighted Passing_Pct_Rank_Weighted Conceded_Penalties_Rank_Weighted The key difference between Model Group 2 and Model Group 1 is the actual values of the β terms. In Model Group 1, the estimated values are all very similar except for shot accuracy. In Model Group 2, the estimated values are slightly more spread out except for GD. This is because GD is the most important variable in predicting the league so it s estimated value is much higher. The next step was to use run the normalized versions of the models in R (Model Group 3). The models conducted here are the same four variations as before except using a normalized version of the data (Model 3A, Model 3B, Model 3C, Model 3D). These results were very similar to Model Group 2 with GD being valued much more than the other data points and the spread being even throughout the rest of them. All these results for each model were logged and placed into an excel file for further comparisons later.

16 Once the models in R were conducted I went to compare the models in MATLAB where we use optimization to minimize the error. This process was done using only the normalized data because it eliminates the most bias in the data. The process is essentially the same except I was able to reduce the model as much as I felt I needed to in MATLAB and it could be done by visualizing the results of each model. A linear system is constructed: 2 = 1 20 (α 1A i + + α i N i R) 2 R EMS 20 1 where A,..,N are the variables such as GD, α i is the value we are solving for, and R is the actual end of season ranking. A partial derivative of this with respect to each variable and set equal to zero will result in: 20 0 = 1 10 n i(α 1 A i + + α i N i R) 1 where n i is the variable with which we took the partial derivative with respect to. The solution is X = A^-1*b where A= a matrix of all these partial derivatives and b= R i A i where R i is the Ranking There were numerous different models that were conducted in this approach using MATLAB. These consisted of: a normalized model using all eight variables, seven variables with yellow cards having been eliminated, 6 variables with yellow cards and clean sheets, and then these three variations without an initial β 0 term included. These variations were done by simulating each model and analyzing the results of each one, comparing it to the actual end of season ranking, and then determining why the prediction might have been further or closer to the actual result. Once I was analyzed the results I was able to determine which variable could be removed. I then would run the model again and repeat the same process. In the end around 20 models had been considered all in the attempt to predict the end of season ranking for the Premier League. 5. The Results The results of the models presented in the previous section all are useful in furthering my attempt to predict the outcome of the Premier League. The main importance was the mean squared error term which would determine how accurate each model was compared to the actual end-of-season ranking. After running each of the models, the results were sorted into Microsoft Excel for a comparison between each model with the actual end of season ranking and then in turn, each other. Many of the models produced a result that was very far from the actual end of season ranking. Almost all the models that were without GD produced a result in which the mean squared error term was above 10 which as mentioned before, would be worse than complete

17 randomization. This is because GD is key in predicting the Premier League due to it being the reason that teams win match. There is direct correlation between goals and league ranking so GD is necessary for a model to be accurate. This step eliminated several of the potential models. After observing the data, a conclusion was made that the models that did not contain GD were unsubstantial. The results of the models that lacked GD produced a result that were consistently above a mean squared error term of 9 across all types of the models. These models without GD will not be included in the observations. Model Group 1 was relatively accurate but because there was a bias towards the data with a larger spread they were not as efficient as the Model Group 2. Model Group 1 produced a mean squared error between 4.5 and 4.9 which means that each premier league team in the prediction is on average between 4.5 and 4.9 places from their actual ranking respectively. However, these results were less accurate than the weighted and normalized approaches. Model Group 2 produced a result which was the most accurate of all. Initially, I expected the Model Group 3 to be the most accurate, but the results show that Model Group 2, overall, was more accurate. Model Group 2 produced an error of 3.1 and These are both accurate because they are only 3.1 and 4.02 places off on average in the prediction. Model Group 3 produced the most consistent results for this season. There were many more models that were run because the approach in MATLAB used only normalized data. The results for the Model Group 3 produced an error term between 4.02 and 5.36 across all the models. The MATLAB approach had the largest variation of all models which consisted of the lowest and highest results, being 4.02 and 5.36 while the results from R were between 4.5 and 4.9. The three best models from the non-randomization part are as follows: Prem_Results_Weighted_End_of_Season_Rank error=4.02: Rank = (GD)-.1228(Poss) (YDIsc)-.6692(SA) (Pass) (CP) (CS) (Beta) Normalised_Model_Full_Without_Yellow_CS error=4.02 Rank = (GD) (Poss) (SA) (Pass) (CP) (Red) (Beta) Prem_Results_Reduced_Weighted_End_of_Season_Rank = 3.1 Rank = (GD) (Pass) (Beta) These three results were the most accurate in predicting the league by having mean squared error values of 4.02, 4.02, and 3.1. This is a plot of the predicted end of season result vs. The actual end of season ranking for each of the three models:

19 The top four is consistently accurate throughout each of these models but the winner, Chelsea, was only accurately predicted once after every single model was run. This is because Tottenham placed higher than them on almost every single category which then begs the question, why did Tottenham not win the Premier league? This occurred because while Tottenham had a better GD, Chelsea was able to be more efficient with their goals. Winning a match 5-0 rewards the same points as winning 1-0, so while Tottenham did score more and concede less than Chelsea, they were less consistent with their goal scoring across the season. The middle of the table appears to be a guess based on the plots is due to the congestion of the league in that area. The top 6 are all much further ahead than any of the other teams due to their club s wealth and the skill of the players; there is a large difference in wealth. The middle and bottom of the table, however, are all relatively close in their club s value. ([8] Premier League - Market Value v League Position, 2018). This causes the table to be very congested because any team has good odds of beating every other team in that section of the table. The top teams are almost always predicted to beat the bottom teams while the bottoms and middle teams are very even matches. This causes most of the teams to be around the same number of points at the end of the season.

20 The end-of-season ranking for each team between 8 th place and 17 th place is only a 6- point difference with many teams being on the same points and only separated by GD. This means that almost all these teams are difficult to predict because they most likely performed the same throughout the season. This lead to very few models being able to come close to predicting accurately this section of the table. The final section is the bottom of the table where the relegations spots are. From the three best models, two of them predicted the correct teams to be relegated while one predicted two of three teams to be relegated. This section of the table was easier for the model to predict because there was a very large gap in the performance of these bottom teams and the teams above them. 6 points separated 8 th and 17 th while 6 points also separated 17 th and 18 th. None of the best models predicted the exact result of the relegation zone, but it is more important to predict the correct teams rather than the exact positions. 6. Randomization The key to this analysis was to accurately predict the end of season ranking for each of the premier league teams in the season. If the models were able to accurately predict the season, then they should be able to consistently produce a result in future years as well as previous years. Based on this information, I created my own version of the season of the premier league in which some of the data was randomized to produce a new result. The purpose of this randomization is to determine how accurate the best models from the regular analysis of the premier league is in a different, hypothetical season. The data used was gathered from the season and then randomized in a specific manner. The specific data that was used was based on the specific model being used. The normalized version of the data was used for the normalized model in which the values are spread between 0 and 1 by dividing each team s values on each variable by the maximum value on that specific variable. The weighted data was used in the weighted models in which the data is evenly

21 spread between 0 and 1 by.05 increments. This created an even spread between 0 and 1 of all the data and eliminated the potential for biased data. The randomization process was done by taking the data from Excel and randomly assigning a value to each premier league team between one and two. This value determined whether the team will have their statistics increased or decreased; one for decrease, and two for increase. This randomization process was done using an online randomization calculator ([9] True Random Number Service, 2018). Once the teams had their positive or negative values assigned to them, they then would be assigned a new random value. This value was to determine which of the variables would be either increased or decreased based on the previous result. The randomization process was done using the same online calculator except this time it was assigned a value between one and eight since there are eight variables being used. One of the weighted models only had two values associated with it so the random variable selection was done using a randomizer to choose numbers between only 0 and 1 so that each team had at least one variable effected. Now that every team had been assigned a variable and a change that was positive or negative, the values could be increased or decreased to create this new version of the season. Each variable that was randomly selected would be increased or decreased by the same amount across each team; the value would be.25. So, if one team was ranked at.75 for GD, then they would now, in this new version of the season, be considered the best team at a ranking of 1. The new season has now been completely randomized and each value has been changed accordingly. Now the optimal models from the actual season can be tested on this random season to see whether the models are consistent. The three models being tested previously provided a mean squared error term of 1.6 for the weighted end of season ranking, 1.6 for the normalized model without yellow cards and clean sheets, and 1.4 for the reduced weighted model. The models produced this result:

22 Where Model 1 is the 6 variable normalized result, Model 2 is the full weighted result, and Model 3 is the reduced weighted result. This result is interesting because these three models were the best models from the actual season and yet they provide a wide range of error values. The full weighted model is consistent with the result of the actual season. The Normalized model provides an error term that is twice as large as the result from the actual season. This result would still be a useful error term for predicting the actual season however it being twice as large shows that it s consistency is in question. The wildest model is the reduced weighted model because it s error term for the randomized season is 10 times larger than the error term for the actual season. This occurred because the randomization was only spread throughout two variables one of which is GD. This means that about half of the teams had their GD affected by the randomization process with the increase or decrease being a quarter of their total goals scored. This lead to the prediction being very inaccurate for any of the teams that had their GD affected. The teams that gained GD almost certainly moved up several places in the end of season table. Any team that did not have their GD affected stayed very close to their actual result because the other teams were shifting both up and down around them. The teams that had their GD negatively affected would drop several places in the end of season ranking. The other two models are accurate to the actual end of season result despite having several of their variables randomized. This would indicate that these models would be useful in

23 another season of competition to determine the end outcome of the premier league. Also, these models are less sensitive to a significant change in any given category. 7. Further Analysis & Conclusion The next step is to test these three best models on an actual season of the Premier League opposed to a hypothetical one. The season that will be tested is slightly more than half of the season to see if the results will be accurate. The data that was collected for this season was the same 8 variables are beforehand. Two different versions of the data were collected, the weighted, and the normalized versions so that the three best models were possible to run. The data collected totaled 26 matches which is more than half of 38. The reasoning for this decision is that the transfer window for the Premier League ends on week 26. This means every single team has finished conducting their business for buying and selling players to and from different clubs. The eliminates a player being transferred away from a club as a confounding variable that could potentially be the reason they moved up or down in the ranking. The season, at the time of writing, is still ongoing. This means that comparing the results of the predictive analysis to the actual end-of-season rankings would be inconclusive because not each team has finished playing all their matches. The prediction is show below, but it cannot be compared to the end-of-season ranking because the season has not finished yet. The three best models were used from the season and can be seen in the image which model is associated with which prediction. The only team that currently has its league position decided, is Manchester City in 1 st. Each prediction predicted this at the mid-point of the season so at least 1 league position is correctly predicted.

24 To conduct this experiment more effectively in the future, a change that should be made would be collecting the data in a different way. I would not use GD as a variable, I believe that using goals scored and goals conceded as two different variables would provide a more effective model. I would also test the models on more historical seasons such as the and more past seasons. This would allow the more consistent model to be used by observing the average mean squared error term across all seasons tested. In conclusion, this project was conducted to try and predict the outcome of the premier league season. This was done by collecting important data from the season and analyzing this data by using a multivariate regression model. Many models were ran, checked, and tested to determine exactly what the best model was found to be the most accurate in its prediction. The root mean squared error was analyzed to give the best three models. These models were tested against the actual end-of-season rankings and had accurate predictions in which the mean squared error was 3.1, 4.02, and These accurate predictions were then tested in predicting a randomized league to test if the models were consistent. The results showed that a model in which the data was normalized and contained Goal Difference, Red Cards, Passing Accuracy, Possession, Shot Accuracy, and Penalties Conceded was the most consistent at predicting the Premier League.

25 8. References 1. Alexopoulos, Evangelos. Introduction to Multivariate Regression Analysis. HIPPOKRATIA, 2010, pp Shaw, Phil. ESTABLISHING THE TEMPLATE To Football League 125, 2013, 3. History of the English Premier League. SuperSport - Football, MultiChoice, 2018, 4. Premier League Statistics Retrieved from: stics/england-premier-league European Qualification for UEFA Competitions Explained. Premier League Football News, Fixtures, Scores & Results, 2018, 6. Harris, Nick. From 20 to 33,868 per Week: a Quick History of English Football's Top- Flight Wages. Sporting Intelligence, 20 Jan. 2011, 7. Predictive Analytics. Opta Sports, 2018, 8. Premier League - Market Value v League Position. Transfermarkt, 2018, 9. True Random Number Service. RANDOM.ORG - Integer Generator, 2018,

The MACC Handicap System

MACC Racing Technical Memo The MACC Handicap System Mike Sayers Overview of the MACC Handicap... 1 Racer Handicap Variability... 2 Racer Handicap Averages... 2 Expected Variations in Handicap... 2 MACC