Evolving strategies for prediction of sporting fixtures

Size: px

Start display at page:

Download "Evolving strategies for prediction of sporting fixtures"

Christiana Haynes
6 years ago
Views:

1 Evolving strategies for prediction of sporting fixtures Mark Rowan School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK April 30, 2007 Supervisor: Dr. John Bullinaria

2 Abstract Gambling is a risky business, particularly in the field of sports betting, where the inherently random nature of the game can present great difficulties to systems attempting to predict future event outcomes. In this mini-project, a genetic algorithm representation is presented with the intention of recovering underlying structure in football data, including potential temporal changes. A number of enhancements to the algorithm are then proposed and implemented, with varying degrees of success, and suitable values for a range of parameters are identified. The preliminary results show that it is possible to recover at least sufficient underlying structure from the data to break even across the length of a footballing season, and indeed that in certain circumstances, a healthy profit may be made. The report ends with a comprehensive list of proposals for extension of the system and implementation of alternative technologies. Football, betting, prediction, gambling, sports, genetic algorithm, evolutionary computa- Keywords tion.

3 Contents 1 Introduction Problem Definition Relevant literature Design Genetic algorithm Representation Fitness evaluation Crossover Mutation Data sources Program structure Gambler Predictor Population Individual ScoresDatabase FixtureResult Experiments performed Preparing the testing data set Naive betting Naive betting vs basic GA Number of epochs Population size Fitness scaling Self-adaptation Self-adapting home/away win ratio Self-adapting bitstring and home/away win ratio mutation rate Self-adaptation (complex schema) Bounded self-adapting mutation strengths Fast EP Improved Fast EP Points given for a correct prediction Experts Multiple oracles (ensemble machine) Multiple populations Human selection of bets Performance on other data sets English Premier League German Bundesliga German Bundesliga Evolved values in representation

4 4 Evaluation Conclusions Further work Improvements to the genetic algorithm General improvements to the project A Dividing data into sets by time 44 B Human-selected bets for testing set 45 C Literature search strategy 50 C.1 Forms of literature C.2 Range of literature C.3 Initial search tokens C.4 Search strategy D Project proposal 51 3

5 List of Figures 2.1 Program structure Naive betting Effect of varying number of epochs Effect of varying size of population Different fitness scaling schemes Self adaptation strengths for the home/away win ratio Self-adapting win ratio and bitstring mutation Self-adapting mutation strengths (complex schema) Bounded self-adapting mutation strengths (complex schema) Cauchy-based mutation and self-adaptation Improved Fast-EP mutation Utilisation of Gaussian vs Cauchy mutation operators (40 epochs) Utilisation of Gaussian vs Cauchy mutation operators (500 epochs) Points given by an individual if the home team is predicted to win Using a number of oracles Switching number of oracles after x consecutive losses Adapt number of oracles after x losses across y bets: finding gradient Adapt number of oracles after x losses across y bets: fine-tuning gradient Adapt number of oracles after x losses across y bets: extending gradient Number of populations Weighting populations predictions across time Performance of the system compared to a human expert Performance of the system on English Premier League testing data Performance of the system on German Bundesliga testing data Performance of the system on German Bundesliga testing data Actual representation values evolved for each team

6 Chapter 1 Introduction Gambling is a risky business, and much effort has taken place to analyse statistics and try to foretell the outcomes of uncertain events for profit. One of the most popular gambling arenas is that of sports betting. For the purposes of this project, English football particularly, the Premier League will be used as the sport for study. Genetic algorithms, pioneered by Holland [10], have long been known to perform well on many general function optimisation problems. If the problem of comparing pairs of football teams can successfully be represented as a numerical function, then it follows that a GA should have at least some measure of success at optimising the function, in order to ascertain some understanding of the data describing the football teams. 1.1 Problem Definition Sports, such as football, are inherently random in nature, which makes forecasting their futures particularly difficult. There is massive business in bookmaking and betting, and almost as large a market in tipsters who are paid to advise gamblers, and systems designed to keep track of large quantities of data and statistics about the game to aid in making predictions. The aim of this project is to determine the success of attempting to use genetic algorithms to evolve strategies for prediction of Premier League football matches. This will entail designing an appropriate representation for a genetic algorithm, in order to reduce the problem of comparing pairs of teams to each other down to a simple multi-dimensional function. It will also entail implementing various extensions or additions to the GA, testing them to show which are beneficial, and selecting ideal values for the various parameters. Finally, this work should also lead to a proposal of future areas and ideas to be researched. 1.2 Relevant literature Clair and Letscher [3] successfully implement a strategy to predict winners and underdogs in football pools. This is quite a different strategy to what will need to be evolved for this project, however, as the tournament for which they are predicting is a simple knockout, where the winner of a pair of teams goes on to play in the next round. They focus on a feature unique to the pools, of computationally finding a balance between predicting success for the favourite team, and predicting an upset caused by the underdog. This is done in order to minimise the size of the pool of players with which the money will be shared, rather than just to maximise the number of correct predictions, which is the aim of this project. Tsakonas et al. [20] present work similar in a number of ways to the aims of this project. They find success in using fuzzy rules, neural networks, and genetic programming, and they highly recommend such soft-computing techniques as an area for future research in the field of computer-aided gambling. They make use of a relatively large number of data features ( difference of infirmity factors; difference of dynamics profile; difference of ranks; host factor; personal score of the teams ). 5

7 With the exception of the study in neural networks, they focus primarily on the use of these techniques to generate hard rule-sets which then govern prediction, rather than using the techniques themselves to dynamically adapt to the data and predict outcomes more flexibly. It will be beneficial to draw on some of their experiences, and attempt firstly to reduce the extent of the input data features to just those which are necessary to achieve a good rate of correct prediction. Conversely, this mini-project aims to evolve a representation of the strengths and weaknesses of teams which is flexible and can accommodate new data as it is encountered, rather than generation of rigid sets of rules using fuzzy logic or genetic programming, which cannot accommodate flexibility in input data to such a high degree. Rotshtein et al. [16] attempt to use a simpler set of input data, using the logic that human sports fans frequently make predictions based on simple, common-sense assumptions, such as IF a team T1 won all previous matches, AND a team T2 lost all previous matches, AND the team T1 won the previous matches between T1 and T2, THEN it should be expected that T1 would win. They show how this can be formalised as a fuzzy logic model, and progress to devise such a model. Their input data is far more sparse than that of Tsakonas et al. [20]. The authors introduce a genetic algorithm and a neural network to optimise the fuzzy logic rule generator, with respectable success, although they concede in their conclusions that the prediction model can be further improved by accounting for additional factors in the fuzzy rules: home/away game, number of injured players, and various psychological effects including factors such as the referee who oversees the game, and the weather conditions prior to kick-off. Kvam and Sokol [13] build upon the work by Clair and Letscher [3] to accurately rank (and/or rate) teams using only basic input data based on Markov chain models. They summarise that part of the reason for the comparative success of our model is that the other models... treat the outcome of games as binary events, wins and losses. In contrast, our model estimates the probability of the winning team being better than the losing team based on the location of the game and the margin of victory and is therefore able to more accurately assess the outcome of a close game. The authors focus a large amount of their research on accurately predicting the outcomes of close games, using the differences in probabilities of each team winning to fine-tune their predictions. From this research it appears that a binary win / lose comparison for the genetic algorithm in this project may be deficient for accurate predictions. 6

8 Chapter 2 Design 2.1 Genetic algorithm Representation The representation chosen for a genetic algorithm can have dramatic effects on the success of the system. The proposed representation for this mini-project will contain a measure of attack and defence strengths (or abilities) for each team, so that the predicted outcome of a match can be determined by comparing the two teams attack and defence strengths. For a set of teams t, this could be represented as a vector of real values for each team in the format: [{a, d} 1, {a, d} 2,..., {a, d} t ] Teams play very differently when at their home ground compared to when they are visiting another team s ground. There are well-defined reasons for this: Buraimo and Simmons [2] state that home field advantage is a bundle of attributes including home team psychology, greater familiarity with pitch, passionate home fans and susceptibility of referees to home crowd pressure. Therefore it would seem reasonable to extend the representation to include home and away strengths: [{a h, a a, d h, d a } 1, {a h, a a, d h, d a } 2,..., {a h, a a, d h, d a } t ] Since a team s performance fluctuates over time due to seasonal effects, such as the arrival of new players in the summer and January transfer windows, the representation could be extended further to include each of the 38 gameweeks that a Premiership team plays in. If gameweeks belong to a set g the representation could be written as: [[{a h, a a, d h, d a } 1, {a h, a a, d h, d a } 2,..., {a h, a a, d h, d a } g ] 1, [{a h, a a, d h, d a } 1, {a h, a a, d h, d a } 2,..., {a h, a a, d h, d a } g ] 2,..., [{a h, a a, d h, d a } 1, {a h, a a, d h, d a } 2,..., {a h, a a, d h, d a } g ] t ] It can be seen that by representing the data in this way, each individual will quickly become very large. There are currently forty unique teams to have appeared for at least one season in the Premiership. This equates to 4 parameters 38 gameweeks 40 teams = 6080 elements in the vector 1. An alternative method of introducing temporal data into the representation, using the second suggestion above, is outlined in Appendix A. In brief, this entails breaking the input space into temporallyrelated portions, and training experts only on one portion of the input space. Following this schema, the data in each individual could be reduced to a more manageable 4 parameters 40 teams = 160 elements 2. Experiments showing the effects of this schema can be found in section If these real values are represented as double-precision variables of eight bytes each, this leads to a size of bytes per individual, or a somewhat excessive 46MB for a population size of 1000 individuals. 2 At 8-byte precision, this equates to 1280 bytes per individual or 1.2MB for a population size of

9 A number of methods exist for representing a vector of real-valued numbers for GAs, including storing them as directly-manipulable numbers in the vector, and real-to-discrete binary encoding. Binary encoding, with mutation consisting of randomly flipping a number of bits proportional to the mutation strength, is the more elegant solution. However, it introduces Hamming cliffs to the representation. Mathias and Whitley [15] state that Gray coding is known to eliminate Hamming cliffs that exist in binary function spaces. A Hamming cliff occurs when two consecutive numbers have complementary binary representations. For example, the binary representations for the numbers 7 and 8 are complements of each other (i.e and 1000). Gray coding is an alternative encoding of a number using binary characters which allows every number to be only a Hamming distance of 1 away from its immediate neighbours. Binary encoding also has the disadvantage that it only works up to a certain precision, depending on the number of bits representing each value in the vector. Real-valued encoding would allow arbitrary precision as the algorithm sees fit, with mutation achieved by adding a random value δ to each value of the vector, proportional to the mutation strength Fitness evaluation Good parents are chosen by the genetic algorithm according to their fitness, which is a measure of how well an individual in the population performs on the problem. The fitness is calculated by taking the home and away attack and defence parameters for each team contained in each individual, and using them to predict the outcome for each match in the training set. The output of the predictor is of the form x N 3 where: x = 3 in the case that the home team is predicted to win x = 1 in the case that the match is predicted to be a draw x = 0 in the case that the away team is predicted to win The reason for choosing these values for x is that, in the Premier League, these are the points given to teams in the equivalent cases. It seemed sensible to follow this convention for the fitness evaluation. The predicted outcome x is then compared with the actual outcome y, as recorded in the training set, and the absolute of the result is cumulatively summed across the whole training set. For an individual i trained over a training set T, where each t T is a pair of teams playing a match, the fitness is therefore calculated as follows: f(i) = t T abs(x t y t ) and the problem therefore becomes a non-linear fitness-minimisation problem. Home and away win ratios In calculating what result the predictor should output for two given teams, the defence score of the home team is subtracted from the attack score of the away team, and the defence score of the away team is subtracted from the attack score of the home team, giving a value for the overall strength s of each team in comparison to the opposing team, which ranges from 1 to +1. The values are then normalised to 0 s 1: s h = a h d a s a = a a d h s = s The difference in strengths of the teams is calculated: d = s h s a 8

10 and this is again normalised so that 0 d 1: d = d Finally this resulting difference d is compared to three thresholds t to determine the resulting prediction x of the match outcome: 3 if d > 1 t h x = 0 if d < t a 1 otherwise It is necessary to consider the thresholds at which different results will be chosen for output. These thresholds relate to the real-world proportions of games which are won by the home or away team, or which result in a draw. Buraimo and Simmons [2] state that around 48 percent of games in English football are won by the home team, with the remainder of games split approximately equally between away wins and draws 3. For ease of calculation, these thresholds could be set to 50% for home wins, 25% for draws, and 25% for away wins. The prediction x therefore becomes: Weighting per season x = 3 if d > if d < otherwise A team s overall performance tends to remain relatively stable over many seasons. For example, Manchester United are usually found to finish in one of the top three places. However in previous seasons, some teams in the League have had vastly different performances. A good example is that of Chelsea FC, who would in the past invariably finish the season mid-table, but since the arrival of a wealthy investor at the club, have finished more recent seasons at the top of the table. Other clubs, such as Blackburn Rovers, who have been champions in past seasons, have been consistently found in the lower part of the top half of the table in more recent times. It is generally the case that a team s recent performance is a good indicator of its future performance, whereas its historic performance is not a very good indicator. For example, Wimbledon used to perform well in the Premier League but, for various reasons and effects, are now to be found three divisions below in League 2 under the name MK Dons. One way to deal with this effect is to take into account all the history of a team s performance in the league across the entire training set, but to bias the fitness evaluation towards a team s recent history. This is achieved by weighting the fitness of each individual according to the time since the match being evaluated was played. If w is the distance between two seasons, taken as the year of the most-recent season in the training set minus the year of the current season being used in the training set, the fitness evaluation becomes: f(i) = t T (abs(x t y t )) w This produces an exponential decay curve of weightings. Training on the most recent season will have a weighting w 1 of 1, whilst training on the preceding season will have a weighting of 0.5, and the season before that will be weighted at 0.25, then 0.125, etc Crossover Crossover in the GA is performed between two parent individuals in order to create a new population of individuals which shares qualities of both parents. This is achieved by taking an element at each position in the vector from either one parent or the other, and using it to construct a new vector (table 2.1). 3 This can be confirmed at where the statistics are calculated to be 46.42% home wins (std dev 2.99), 26.86% draws (std dev 3.11), and 26.72% away wins (std dev 1.50) over the 14-season history of the Premier League. 9

11 Table 2.1: Crossover between two parents vectors to create a new child individual The fittest individual in the population is also separately copied to the new population, unchanged (that is, without mutation or crossover) in order to preserve it for future generations. This technique is known as elitism. Jennison and Sheehan [11] note that elitist strategies enhance selectivity by retaining the fittest individual at each generation and an elitist algorithm would be guaranteed to solve [... ] simple problems because mutation and crossover can be relied upon to improve the fittest individual, element by element if necessary, until the optimum is reached. However, by increasing selectivity, elitist strategies increase the risk of effective convergence at an inferior local optimum in problems with multiple local optima, so there is no immediate guarantee that implementing elitism will initially be a successful strategy Mutation Each individual in the population (apart from the elite individual mentioned earlier) is then subject to random mutation. For each element in the vector, a random value δ is added. δ is randomly generated from the normal distribution, with a mean of 0 and standard deviation of 1, for each element, and is multiplied by a scaling factor µ known as the mutation strength. For each element j of the vector x in each individual: 2.2 Data sources x j = x j + N(0, 1) µ x All data for the training and testing sets was obtained from in CSV format. The attributes used for training were the names of the home and away teams for each match, and the date on which the match was played, as well as a record of the final score. For the testing set, the attributes recorded were the names of the home and away teams for each match, the date on which the match was played, the name of the winning team, and the odds given to that outcome before the match, according to the BetBrain 4 service. 2.3 Program structure Gambler The Gambler module reads the testing set into memory and then requests betting recommendations from the Predictor for each match in the testing set. It calculates the winnings for testing purposes. A Gambler can take recommendations from any number of differently-trained Predictors, although this will not be implemented for this project. The Gambler expects to find testing data in the tab-separated format in table Predictor The predictor initialises a population or multiple populations, and provides an interface for the Gambler to query the predicted outcomes of matches. The Predictor in turn queries each of the Populations it has initialised, to find their predicted results, before reporting these to the Gambler. It also initialises a ScoresDatabase. 4 An online service at which collates several bookmakers freely-available odds and provides the user with the highest odds for each outcome, in order to maximise potential profits 10

12 Figure 2.1: Program structure %season identifier (String) (String) (String) (String) (double) (String) $home team away team winning team odds dd/mm/yyyy $home team away team winning team odds dd/mm/yyyy.. %next season identifier $home team away team winning team odds dd/mm/yyyy. Table 2.2: Testing dataset format 11

13 %season identifier (String) (String) (String) (int) (int) (String) $home team away team home score home result dd/mm/yyyy $home team away team home score home result dd/mm/yyyy.. %next season identifier $home team away team home score home result dd/mm/yyyy.. Table 2.3: Training dataset format Population Each Population initialises a training set by reading items from the ScoresDatabase. It also initialises an array of Individuals and begins training the GA. Training is performed according to the following process: Randomly initialise each Individual s representation vector Mutate each Individual Evaluate each Individual s fitness Select and cross over two Individuals to create a new array of Individuals The two parent individuals are chosen using roulette-wheel (fitness-proportional) selection, so that two good, but not necessarily the fittest, Individuals are selected. This is in order to allow the GA to maintain population diversity. The Population selects the fittest Individual to be consulted for prediction of games, and returns this Individual s predicted winner when the Predictor requests it on behalf of the Gambler Individual Each Individual contains the vector representation of the four attributes for each team (home and away attack/defence strengths) as well as referencing an array of teams, passed to it by the ScoresDatabase, which it will use to predict the outcome of any two teams passed to it. It also initialises the mutation strength for the Individual, as well as the ratios for home wins, away wins, and draws, used as the thresholds for determining the winner of a match between two teams. Each Individual provides its own method for random mutation of the representation vector according to the mutation strengths contained within ScoresDatabase The ScoresDatabase reads the training data set into memory, creating an array of FixtureResult objects. It also creates array listing all unique teams, which is passed to the Populations so they can initialise their Individuals correctly. The ScoresDatabase expects to find data in the tab-separated format in table FixtureResult Each FixtureResult is a simple container, storing the names of two teams which play a match, the date and season at which the match was played, and the final result of the match. 12

14 Chapter 3 Experiments performed 3.1 Preparing the testing data set For the purposes of testing, the GA was trained repeatedly under varying experimental conditions on all the English Premier League data from 1993/ /05 seasons. The system was then tested on unseen data from the 2005/06 season, and approximately the first half of the 2006/07 season up to the Arsenal vs. Wigan game on 11th February Odds for the first part of the data from the 2005/06 season were obtained from football-data.co.uk and the BetBrain max odds were selected. It should be noted that BetBrain finds the best odds (ie. highest payout) across a number of bookmakers, so this will have the effect of boosting slightly any potential returns from the data set compared to using a single bookmaker s odds. Many gamblers use services such as BetBrain to attempt to maximise the payout for any one fixture. Odds for the to 11th February season were obtained from betbase.info, again using the maximum odds across a number of bookmakers. Since Wigan and Reading made their Premiership debuts in 2005/06 and 2006/07 seasons respectively, they do not appear in the training data, and therefore the decision was made to remove all games involving these two teams from the relevant testing data, in order to avoid unfair advantages or disadvantages due to the GA not being trained to predict results for fixtures involving either of these two teams. 3.2 Naive betting Naive betting on just the home or away team to win, or for the outcome to be a draw, will be used as a baseline for evaluation of the GA s performance. All bets placed will be to the value of 10, from a starting balance of There are 559 games in the testing set. Percentage yield will be used as a measure of performance of the algorithm. This is calculated as yield = profit turnover, so a profit of 400 on a turnover (spend) of 6000 returns a yield of 6.67% Bookmakers reduce the odds (and thus the payouts returned to customers) by a small amount in order to make their profit. This can be seen by the gentle downward slope of the average payout in fig 3.1. The average payout is roughly equivalent to the rewards which would be obtained by randomly betting on matches according to the probabilities that a home win will occur approximately 50% of the time, an away win 25% of the time, and a draw 25% of the time. If the three odds (home win, away win, draw) for a given game are inverted and then summed, standard probabilities would sum to exactly 1. However bookmakers inverse odds sum to a value of slightly greater than 1, in order for them to ensure a profit. For the data set being used for testing, the inverse summed odds have a mean of and standard deviation of , so the bookmakers are making approximately a 5.1% profit on average. As naively betting on the home team to win each time returns a profit for this testing set, this will be used as a benchmark for the performance of the algorithm. The aim is to tune the algorithm such that it can return a consistently higher yield than naive betting. All experiments are performed over five runs, and the mean and standard deviations of the results recorded and plotted. 13

15 Figure 3.1: Naive betting 14

16 3.3 Naive betting vs basic GA Number of epochs Hypothesis In a genetic algorithm, training is performed in a number of generations, or epochs. Generally, the higher the number of epochs, the more accurately trained the model will be. However training over too large a number of epochs can lead to over-fitting, where the model learns to output correctly for just the data it has seen, and fails to generalise. Extra unnecessary epochs of training greatly increase the time taken to train the model, and if this time can be reduced without impacting on the performance of the model, it will benefit the user in the event that computation time is expensive. Results This is the basic GA as defined in the Design section. An arbitrary number of individuals (50) was chosen for the population size according to a best-guess as to what would be a reasonable number. The number of epochs of training was then varied and the results plotted (fig 3.2) against the benchmark of naive betting on the home team to win. Figure 3.2: Effect of varying number of epochs Conclusion None of the results are good when compared to naive betting, although they do perform slightly better in the early stage of the testing set where naive betting shows a steep drop in profits before recovering. As expected, a model trained with a low number of epochs does not perform well as the model is poorly trained, but also models trained with higher numbers of epochs perform badly due to the over-fitting they produce. 40 epochs of training seems to be a suitable figure. 15

17 3.3.2 Population size Hypothesis Haupt [9] states that DeJong [5] found that a small population size improved initial performance while large population size improved long-term performance and a high mutation rate was good for off-line performance while low mutation rate was good for on-line performance. For the purposes of this problem, a high long-term level of performance of the model will be required, therefore a large population is likely to produce the best results due to the extra diversity between individuals it introduces. For later experimentation, it will be useful to know that a high mutation rate would be beneficial, as the algorithm is trained off-line at each epoch. Results The GA was trained over 40 epochs with varying population sizes (fig 3.3). Figure 3.3: Effect of varying size of population Conclusion Clearly, the most successful population size for the model is 100 individuals. Having too small a population (in the case of 20 and 40 individuals) restricts the diversity within the population, and leads to fewer good individuals being discovered on each epoch of training. It is not immediately clear why a population size of 200 results in very poor performance of the model, despite the large increase in fitness function evaluations it allows, although it is possible that this is an effect of the good parent individuals having to be selected from a much larger pool, and therefore becoming less likely to be picked after each epoch than if the pool were smaller. 3.4 Fitness scaling Hypothesis Kreinovich et al. [12] state that empirical studies of genetic algorithms... showed that in the beginning, the... algorithm often leads to the appearance of a few superindividuals who dominate 16

18 the selection process and therefore slow it down. At the end, when the population consists largely of the individuals x, for which J(x) is close to maximum, the competition is practically absent, which again slows down the process. Goldberg [8] documents the procedure of fitness scaling. Instead of taking fitness equal to the value of the objective function F (x) = J(x), we take F (x) = f(j(x)), where f(z) is some monotone function from real numbers into real numbers (called a scaling) (summarised by Kreinovich et al. [12]). Forrest [6] introduces linear or simple fitness scaling, where f(z) = z b, where b is the fitness of the worst individual, computed for each generation. This has the effect of removing a constant amount of fitness from each individual, such that only the remaining differences (which now appear greater) are used for ranking the individuals. Another form of fitness scaling is power scaling, introduced by Gillies [7], where f(z) = z k, and k is a constant > 0. For this experiment, a standard value of k = 3 will be used. Finally, the fitness can be scaled exponentially where f(z) = exp(z), which has a similar but more drastic effect when compared to power scaling. A further fitness scaling schema could combine all three of these concepts, such that linear scaling is applied first to each individual, followed by power scaling and exponential scaling. It is hypothesised that fitness scaling will increase the resulting performance of the model over the given number of runs. Results These three fitness scaling schemas were implemented and tested, in addition to the combined schema, over 40 epochs of training and with 100 individuals in the population (fig 3.4). Figure 3.4: Different fitness scaling schemes Conclusion Fitness scaling appears to have no beneficial effect when compared to runs of the algorithm without using scaling (compare fig 3.3). This is unexpected, as genetic algorithm theory states that fitness scaling is generally known to improve algorithm performance. All three of the individual schemas 17

19 perform similarly badly, however the combined fitness scaling schema shows a marked improvement over the individual schemas. 3.5 Self-adaptation Self-adapting home/away win ratio Hypothesis The home/away win ratio is very important in determining the prediction that each individual makes. Although the current values were justified in section 2.1.2, there is no guarantee that these are precisely correct for each training set. It would be beneficial for the genetic algorithm to adapt the values of the ratio for each individual as the algorithm runs, and include the success of these adaptations into the fitness evaluation process. Self-adaptation in this form is introduced by Schwefel [17] and built upon by Bäck and Schwefel [1]. Self-adaptation in this simple schema will be achieved by adding a random Gaussian-distributed number to the mutation strength η of each individual on each iteration, such that η i = η i + N(0, 1) where N(0, 1) is a Gaussian-distributed number with mean 0 and standard deviation 1. Results Different starting values for the mutation strength were selected for self-adaptation of the home/away win ratio, and the results plotted (fig 3.5). The experiment was run over 40 epochs of training and with 100 individuals in the population. Fitness scaling was switched off until the final set of runs, during which the combined schema discussed previously was used. The results were recorded and plotted in fig 3.5. Conclusion Self-adaptation of the home/away win ratios is clearly very successful, particularly at a higher initial self-adaptation strength of 1.0. When combined fitness scaling was used, there was a very slight improvement in the mean results, although it should be noted that the standard deviation was greatly reduced, therefore indicating that the results with fitness scaling are more consistent than without Self-adapting bitstring and home/away win ratio mutation rate Hypothesis In addition to self-adapting the mutation of the home/away win ratio, as this is a realvalued optimisation problem, it is possible to self-adapt the mutation rate of the bitstring itself. In this case, η is again set such that η i = η i + N(0, 1) where N(0, 1) is a Gaussian-distributed number with mean 0 and standard deviation 1. Runs of the experiment were recorded with various combinations of bitstring and home/away win ratio self-adaptation (fig 3.6). In the first experiment, the same N(0, 1) is used for both win ratio and bitstring mutation adaptation, and in the final experiment N(0, 1) is generated separately for each. Results The self-adapting mutation strength for each individual was initialised at 1.0, and the algorithm trained over 40 epochs of training with 100 individuals in the population. Fitness scaling was switched on to use the combined schema. The results were recorded and plotted in fig 3.6. Conclusion Although the results are all similar, the best results were obtained when allowing separate self-adaptation rates for bitstring mutation and win ratio mutation, although the standard deviation of these results was higher. Forcing the algorithm to use the same self-adaptation rate for both bitstring and win ratio mutation produced the least successful results Self-adaptation (complex schema) Hypothesis Bäck and Schwefel [1] propose an alternative, more complex self-adaptation schema: η i (j) = η i (j) exp(τ N(0, 1) + τ N j (0, 1)) 18

20 Figure 3.5: Self adaptation strengths for the home/away win ratio 19

21 Figure 3.6: Self-adapting win ratio and bitstring mutation 20

22 The factors τ and τ... are rather robust exogenus parameters, which Schwefel [17] suggests to set as follows: ( τ 2 ) 1 n ( ) 1 τ 2n Where: n is the number of individuals in the population. N j (0, 1) is normally distributed, and generated for each individual j. N(0, 1) is normally distributed, and generated once for the whole population. Results With 40 epochs of training and 100 individuals in the population, and combined fitness scaling, the results were recorded and plotted in fig 3.7. Figure 3.7: Self-adapting mutation strengths (complex schema) Conclusion The results of using the simple and complex schemas are almost identical. The complex schema shows some slight mean improvement over the simple schema (compare fig 3.6), so it will be used for future experiments. Again, an initial self-adaptation strength of 1.0 has shown to be the most successful, so it will be used for future experiments. 21

23 3.5.4 Bounded self-adapting mutation strengths Hypothesis It became apparent when observing the mutation strengths which were being evolved, that they were often achieving very low or very high, or even negative, values. This was considered not to be beneficial to the mutation of the population, so the self-adaptation strengths were bounded as follows: 1 n i < x < n i where n is a real value and i is the initial mutation strength of the individual. It is hypothesised that this will improve the overall fitness of the population due to the requirement to keep the mutation self-adaptation rate within reasonable bounds. Therefore, a better performance should be obtained from the model after training. Results The model was trained over 40 epochs with 100 individuals in the population, using combined fitness scaling, and the complex mutation self-adaptation schema with initial self-adaptation strength of 1.0. The results were recorded and plotted in fig 3.8. Figure 3.8: Bounded self-adapting mutation strengths (complex schema) Conclusion There is very little difference to be obtained by bounding the self-adaptation mutation strengths when compared to fig 3.7. However it is interesting to note that with a bounding value of n = 4 the standard deviation of the results is zero. This indicates that bounding the mutation self-adaptation rate has the effect of stabilising the model s output and making it more consistent. It is also worth noting that the results from the model almost exactly follow the home-wins betting line. It seems, therefore, that the model has successfully learnt that the best easy performance can 22

24 be obtained simply by biasing the model such that it always bets on home wins. There may be an evolutionary plateau that has to be overcome before results can improve on naive home-win betting. 3.6 Fast EP Hypothesis Yao et al. [21] introduce a technique known as fast evolutionary programming. They claim that this technique... is very good at search in a large neighborhood while CEP [Classical Evolutionary Programming] is better at search in a small local neighborhood. The search space in use for this project is a large-dimensional real-valued space, therefore it is hypothesised that FEP will introduce benefits over CEP due to its ability to make larger jumps out of any local optima that it may encounter. Cauchy mutation was implemented as follows: η i = η i + N(0, 1) where N(0, 1) is a Cauchy-distributed number with mean 0 and standard deviation 1. Cauchy numbers were generated according to the equation a + b tan (π x) where a is the mean, b is the standard deviation, and x is a uniformly distributed random number 0.5 x 0.5. Results Using Cauchy mutation, and testing differing initial mutation self-adaptation strengths, the algorithm was run over 40 epochs of training with 100 individuals in the population and combined fitness scaling, using the complex self-adaptation schema, and varying self-adaptation strengths with bounding n set to 4. The results were then plotted in fig Conclusion The plots show almost no difference with the varying initial self-adaptation strengths, and there appears to be no significant improvement over Gaussian mutation. These results are very disappointing, although the standard deviation has again been reduced. The model seems incapable so far of breaking free of betting on the home win for each fixture, although it has learnt to do this very well, as can be seen from the low standard deviation. 3.7 Improved Fast EP Hypothesis Yao et al. [21] state that generally, Cauchy mutation performs better when the current search point is far away from the global minimum, while Gaussian mutation is better at finding a local optimum in a good region. It would be ideal if Cauchy mutation is used when search points are far away from the global optimum and Gaussian mutation is adopted when search points are in the neighborhood of the global optimum. Unfortunately, the global optimum is usually unknown in practice, making the ideal switch from Cauchy to Gaussian mutation very difficult... IFEP generates two offspring from each parent, one by Cauchy mutation and the other by Gaussian. The better one is then chosen as the offspring. Results Using IFEP mutation, with an initial mutation strength of 1.0, the algorithm was run over 40 epochs of training with 100 individuals in the population and combined fitness scaling, using the complex self-adaptation schema, and self-adaptation strengths bounding n set to 4. The results for IFEP compared to classical (Gaussian) EP were then plotted in fig In addition, the number of Gaussian vs the number of Cauchy mutated individuals at each epoch was plotted over 15 runs in fig Yao et al. [21] state that IFEP tends to make use of Cauchy-generated individuals early on in the search, when large mutation step sizes are beneficial, before switching to predominantly Gaussian-generated individuals closer to the global optimum. The number of Gaussian and Cauchy-generated individuals was plotted again over 500 epochs to observe any potential differences in the IFEP selection scheme across longer runs (fig 3.12). 23

25 Figure 3.9: Cauchy-based mutation and self-adaptation 24

26 Figure 3.10: Improved Fast-EP mutation 25

27 Figure 3.11: Utilisation of Gaussian vs Cauchy mutation operators (40 epochs) Figure 3.12: Utilisation of Gaussian vs Cauchy mutation operators (500 epochs) 26

28 Conclusion Both classical (Gaussian) EP and IFEP produce exactly the same results, clearly and exactly following the home-wins profit line. The standard deviation is zero for both, indicating that both methods are capable over small numbers of runs (ie. the five being recorded for the experiment) of producing stable and consistent results. It is clear, now, that there is some evolutionary plateau causing the model to be unable to improve on the home-wins betting strategy which it has successfully converged upon. The number of Cauchy-generated individuals being chosen should decrease as the GA approaches the optimum, as more Gaussian-generated individuals are chosen instead. However neither fig 3.11 nor fig 3.12 shows such a pattern, with the number Gaussian-generated individuals chosen at each epoch consistently exceeding the number of Cauchy-generated individuals. According to Liang et al. [14] Gaussian mutation only takes precedence when the algorithm is near the global optimum, which suggests that in the current representation the optimum is possibly very flat and ill-defined. 3.8 Points given for a correct prediction Hypothesis The fitness evaluation strategy detailed in section gives three points for a predicted home win in an analogy of the football league, with one point for a predicted draw and no points for a predicted home loss. However, there is no strict requirement to follow this analogy directly, as fitness is calculated simply as a function of the total error between the predictions and the results. Using two points for a predicted win and one for a draw could have a similar effect. Results Using IFEP mutation, with an initial mutation strength of 1.0, the algorithm was run over 40 epochs of training with 100 individuals in the population and combined fitness scaling, using the complex self-adaptation schema, and self-adaptation strengths bounding n set to 4. The results for granting different numbers of points for a win were recorded and plotted in fig Conclusion Granting two points to each individual for a correct prediction badly harms the ability of the GA to find fit individuals at each generation, as can be seen from the much lower mean line, and the very large standard deviation. Conversely, granting four or more points for a correct prediction, whilst keeping a draw at one point and an incorrect prediction at zero points, leads to exactly the same model output as when using three points. From this, it is clear to see that increasing the number of points given for a correct prediction above two has the effect of biasing the model towards predicting home wins. Although techniques such as fitness scaling and self-adaptation produce some (admittedly sometimes minor) positive effects, it will be necessary to overcome this plateau in the evolved model in some more significant way to achieve consistently better results than just predicting the home win every time. 3.9 Experts Multiple oracles (ensemble machine) Number of oracles Hypothesis If we define each individual in the population to be a predictor, we can choose a number of the best-evolved predictors and nominate them as oracles which will be consulted by the Gambler. So far only one oracle (the fittest individual) has been consulted, but it is possible that by consulting an ensemble of differently-trained oracles and taking a consensus on their predictions, an advantage can be gained by introducing more variety into the consulted population of predictors. Results Using IFEP mutation, with an initial mutation strength of 1.0, the algorithm was run over 40 epochs of training with 100 individuals in the population and combined fitness scaling, using the complex self-adaptation schema, and self-adaptation strengths bounding n set to 4. The results for using differing numbers of oracles were then plotted in fig

29 Figure 3.13: Points given by an individual if the home team is predicted to win 28

30 Figure 3.14: Using a number of oracles 29

31 Conclusion Clearly, using more than one oracle, the final results are much worse than the previouslyobtained results or even naive betting on home wins. However, in the critical dip in betting returns at the start of the testing data set, the additional oracles enabled the system to perform consistently better on average, cutting the losses dramatically. It appears that, if the multiple oracles perform worse towards the end of the testing set, they perform better in the hard-to-predict start period, due to the variety in the predictions that they produce. This can be seen from the results for five oracles on the graph, for example this plot provides the best performance in the early part of the test set, but by far the worst performance by the end of the test set. Varying number of oracles dynamically Hypothesis A useful strategy may be to use multiple oracles (eg. the five shown to be successful previously) for the first part of the season, and implement a gating scheme to switch to using just one oracle (the fittest) once the returns from multiple-oracle prediction start to fall, and to alternate between them as the performance of the model demands. Results Using IFEP mutation, with an initial mutation strength of 1.0, the algorithm was run over 40 epochs of training with 100 individuals in the population and combined fitness scaling, using the complex self-adaptation schema, and self-adaptation strengths bounding n set to 4. The oracles were alternated after every x consecutive losses, and the results for different values of x were plotted in fig Figure 3.15: Switching number of oracles after x consecutive losses In addition, the oracles were then alternated after every x losses in y bets, rather than after every x consecutive losses, in order to give a bit more flexibility to the conditions in which the oracles are alternated. y was held at 20, and x was varied to find the best gradient x y. The results were plotted in fig

PREDICTING the outcomes of sporting events

PREDICTING the outcomes of sporting events CS 229 FINAL PROJECT, AUTUMN 2014 1 Predicting National Basketball Association Winners Jasper Lin, Logan Short, and Vishnu Sundaresan Abstract We used National Basketball Associations box scores from 1991-1998