An Investigation into the Effectiveness of an Artificial Neural Network in the Prediction of Results in British Flat Racing

An Investigation into the Effectiveness of an Artificial Neural Network in the Prediction of Results in British Flat Racing Donald John Nairn September 2010 Dissertation submitted in partial fulfilment for the degree of Master of Science in Advanced Computing Department of Computing Science and Mathematics University of Stirling

Abstract The purpose of this project is to investigate the ability of Artificial Intelligence (AI) to successfully make predictions regarding the results of sporting competitions. Artificial Intelligence, as a discipline, has evolved over the last fifty or sixty years to the extent that it is in use today in a multitude of areas of everyday and not so everyday life. It is used for medical diagnosis, fraud identification, risk assessment, identification of consumer trends, OCR (Optical Character Recognition), the control of autonomous vehicles, military target identification and targeting, image identification, and a host of other applications. Many of these involve pattern recognition in one form or another; the ability of systems to recognise patterns in data. In many ways this, in a nutshell, is what occurs when an attempt is made to identify the winner of some sporting competition ahead of the results being known. An attempt is made to examine and weigh a myriad of factors, of varying significance, in order to arrive at a conclusion as to the ultimate outcome of the event. The aim of this project was to attempt to develop an AI system that was able to correctly identify potential winners of thoroughbred horse races by examining the historical information that is available regarding the runners and riders. In order to do this an Artificial Neural Network (ANN) was designed and constructed, and then supplied with data that was deemed apposite to the identification of successful race entrants. The success of this system was determined by comparing success rates of predicting results for known test cases, against predicted results. It was found that the ability of the network to predict was extremely variable, altering from one training session to another. Despite repeated attempts to arrive at consistent prediction values for the network, this proved not to be possible. As a result, the ability to further develop, and improve the network in its current form was severely compromised. - i -

Attestation I understand the nature of plagiarism, and I am aware of the University's policy on this I certify that this dissertation reports original work by me during my University project. Signature (you must sign and date this page) Date - ii -

Acknowledgements I wish to express my gratitude to the Student Awards Agency of Scotland (SAAS) who met the cost of the fees for my course of study at the University of Stirling, as well as providing me with a generous bursary for the period during which I was studying for my Post Graduate Diploma. I would like to thank Mr Kevin Swingler of the University of Stirling, who provided me with unstinting advice, direction and support during the period that I was engaged in the production of this assignment, as well as throughout the year that I was a student at the university. In particular, I wish to express my greatest thanks to my father; Mr Alan Nairn, without whose encouragement and support; both moral and practical, I could not have even contemplated commencing this course of study. - iii -

Table of Contents Abstract... i Attestation... ii Acknowledgements... iii Table of Contents... iv List of Figures... vii 1 Introduction... 1 1.1 Background and Context... 1 1.2 Scope and Objectives... 2 1.3 Achievements... 3 1.4 Overview of Dissertation... 4 2 State-of-The-Art... 5 2.1 Horseracing in the United Kingdom... 5 2.2 Neural Networks in Sports Prediction... 6 3 Preparatory Work... 12 3.1 Decisions Made... 12 3.2 Data Collection... 13 3.2.1 Data Source... 13 3.2.2 Limitations and Inconsistencies of the Data... 16 4 Development Considerations... 18 4.1 Data Preparation... 18 4.1.1 Data Preparation - Numeric... 18 4.1.2 Data Preparation Non-numeric/Integer... 20 4.1.3 Data Preparation Other... 21 4.2 Artificial Neural Network Design Considerations... 21 4.2.1 Number of Units... 21 4.2.2 Number of Hidden Layers... 22 4.2.3 Learning Rate... 22 4.3 Data Storage Within the Program... 22 5 Design and Development... 24 5.1 clscheckdata... 24 5.2 clscreateann... 25 5.3 clscreatearray... 26 5.4 clswritetofile... 27 5.4.1 writearraycontentstofile... 27 - iv -

5.4.2 appendlinetofile... 27 5.5 Global.java... 27 6 Results... 28 6.1 Testing Stage 1 Logic Gates... 28 6.1.1 AND Logic Gate... 28 6.1.2 XOR Logic Gate... 29 6.2 Testing Stage 2 Racing Data with known data relationship in place... 29 6.3 Testing Stage 3 Ongoing testing... 32 7 Conclusion... 38 7.1 Summary... 38 7.2 Evaluation... 38 7.3 Future Work... 39 References... 41 Appendix 1... 43 clscheckdata... 43 removeracesbasedonclassifications... 43 getitemindex... 43 removeoutliers... 43 determineaverage... 43 determinesd... 43 createcategorydatafieldsbyhorse... 44 createcategorydatafieldsbyrace... 45 normaliseannoutputs... 45 normaliseanninputsnumeric... 45 determineminmax... 46 createnormalvalue... 46 removequotesfromheadings... 47 doesdatacontainquotes... 47 checkforessentialfields... 47 getheadingsforjtable... 47 ensurevaluesnumeric... 47 getcoursename... 48 getsoughtfields... 48 createcategoriesbyhorse... 48 createcategoriesbyrace... 49 clscreateann... 49 createann (Constructor)... 49 - v -

createweightmatrix... 50 trainann... 50 populatehiddenneurons... 51 populateoutputneurons... 52 testnewdataagainstann... 52 populatetestinghiddenneurons... 53 populatetestingoutputneurons... 53 clscreatearray... 53 countfields... 53 countrecords... 53 createanndataarray... 53 get2ddataarrayfromarraylistofstrings... 53 get2ddataarray... 53 get2ddataarraylist... 54 getarraylistofarraylistsfrom2ddata... 54 - vi -

List of Figures Figure 1. Threshold Logic Unit... 5 Figure 2. Logical AND... 6 Figure 3. Logical OR... 2 Figure 4. Logical XOR... 2 Figure 5. Multii-layer Perceptron... 2 Figure 6. Graphical Representation of Backpropagation... 4 Figure 7. Map of Southwell Racecourse... 12 Figure 8. ProForm Data Export Filter... 13 Figure 9. ProForm Data Exporter Screen... 15 Figure 10. Extract from Proform Form Screen... 16 Figure 11. Encoding of Category data in an ANN... 20 Figure 12. Data Structure with ArrayLists... 23 Figure 13. Example Weight Matrix... 26 - vii -

1 Introduction 1.1 Background and Context Almost every day of the year, a number of race meetings occur at racecourses the length and breadth of Britain. At these meetings professional and semi professional owners, trainers, and riders (jockeys) race thoroughbred horses against one another over varying distances. Whilst many of these events are relatively obscure, there are a number; such as Royal Ascot in June, 'Glorious' Goodwood in July, and the Ebor Festival at York in August, that are as much a part of the annual sporting calendar as Wimbledon fortnight or the Oxford Cambridge boat-race. Although such events draw crowds of visitors because of the spectacle that is involved, for many attendees one of the attractions of a day at the races is the potential for wagering upon the likely winner of each race. For those unable to attend in person, the opportunity has long existed to bet on the races through the medium of a high street bookmaker, or more recently, through similar services that are offered online. Annually in Britain approximately 12.1 billion is wagered annually on horse races, with the sport contributing about 3.7 billion annually1 (2008 figures). Whilst most gamblers are happy just to wager small amounts of money in the expectation of either winning or losing similar sized amounts, others gamble heavily with the concomitant risk of losing similarly. For a few, professional or semi-professional gamblers betting on horse races provides a, more or less, steady income, and for a very select minority funds a comfortable lifestyle. Taking these facts and figures into consideration, it is inevitable that there should have been a great deal of research carried out in search of the perfect 'system' to use in order to maximise potential winnings whilst minimising potential losses. There exist an extraordinary number of such systems that are used by gamblers to try to better their chances of choosing the winning horse from the selection of potentials that are featured daily in the sports pages of newspapers and on the online websites of bookmakers. However, many of these systems are dubious in their effectiveness at best. A short search of the Internet will return links to thousands of such sites featuring at least as many systems of often questionable usefulness 2, some being made available free of charge and others only available upon the payment of (sometimes sizeable) subscription charges. Often the system itself is not made available to the subscriber, but only the results that it provides. Before parting 1 http://www.southwestbusiness.co.uk/news/report-reveals-8216-sport-kings-8217-secondfootball/article-1397803-detail/article.html 2 A search utilising Google, for Horseracing Systems returned about 2,180,000 results. - 1 -

with any money, either to subscribe to such a system, or to utilise it to place wagers, the potential gambler would be well advised to carefully check the results that the system purports to supply. In carrying out such checks the gambler is aided by a number of websites that provide historical results for races, in some cases containing information going back over decades. Sites such as these are invaluable not only for investigating the effectiveness of existing systems but also for anyone seeking to develop one of their own. In large part it was the ready availability of information sources such as these that led to the utilisation of horseracing as the vehicle for this project. 1.2 Scope and Objectives In commencing this project the intention was to seek to circumvent systems of the type that are described in the section above and try to develop an Artificial Neural Network that would successfully predict the results of horse races. Unlike other means of arriving at similar conclusions; such as the ID3 (Iterative Dichotomiser 3) algorithm developed by Ross Quinlan, an ANN does not provide a transparent view of the processes that it employs internally to arrive at the conclusions that it presents, but rather simply accepts inputs and then provides the output or outputs that an iterative training process has trained it to deliver. It was necessary for the scope of the project to be limited somewhat according to the availability of data for purchase. The chosen data supplier only provides records for horse races that are covered under the auspices of the British Horseracing Authority (BHA), and therefore excludes the two tracks in Northern Ireland (and the twenty five tracks in Eire) which are the responsibility of Horse Racing Ireland (Rásaíocht Capall Éireann). This is not a significant issue however, as for reasons that are set out below (Section 3.1) It was decided that the Artificial Neural Network would be developed on a racecourse by racecourse basis. In a similar fashion, certain codes of racing and race types were also excluded, as is explained further in Sections 3.1 and 5.1. The choice of this form of Artificial Intelligence as the vehicle for this project was in part dictated by the courses of study that I have undertaken whilst a student at the University of Stirling. and in part because the pattern matching and forecasting capabilities of ANNs seemed ideally suited to the problem at hand. Artificial Neural Networks are data-driven and self adaptive and therefore require little preconceived knowledge about the system that is being studied. They are capable of modelling subtle relationships between the data inputs being supplied that may be difficult to spot or may even be unknown to the developer of the network. Furthermore, once trained, Neural Networks are capable - 2 -

of generalising unknown data from the data with which they have been presented, precisely what it was intended to try to achieve in this case What it was hoped that this AI agent would achieve once it had been sufficiently trained, was to successfully identify potential winners when presented with the details of races that have yet to be run. Achieving this would be a significant achievement in itself, although it would also open up the potential for making further developments and refinements to the agent in future that would lead to the production of similar results, but with possibly slightly different ultimate goals in mind. 1.3 Achievements An early achievement was to successfully design and build a Java class that instantiated a very basic neural network that was capable of simple pattern recognition through supervised training and back-propagation. It was decided to build such a simple model at first in order to 'prove' the ability of the code that had been written to achieve the desired result the learning of simple logic gates such as the AND and OR. Its ability to accomplish this (as well as the more demanding XOR logic gate more demanding for reasons that will be illustrated in Section2) was a significant step forward. Once this had been accomplished, it was built upon in order to provide it with the further functionality that would permit more flexibility in the final model. During the process of developing the final project outcome, by far the greatest amount of time had to be devoted to the consideration of how the data that the system would use for both the training of the neural network and the subsequent testing it would undergo would be prepared. Deciding upon appropriate formats in which the data could be presented was not always straight-forward, and remains a problem for reasons that will hopefully become apparent in the pages that follow. Similarly, the choice of the specific data fields that should be utilised as inputs, remains problematical and open to future revision. The potential quantity and complexity of the data that was to be input had a knock-on effect, in forcing the abandonment of the original intention, which was to utilise a console system to permit the user of the system to define their requirements, and forced decision that a Graphical User Interface (GUI) would not only be beneficial, but would be essential in order to permit flexible access to the functionality of the network. This therefore had to be designed and coded, hopefully in such a way as to prove relatively easy to use and understand in the short term, whilst providing flexibility for future developments should they prove necessary. - 3 -

1.4 Overview of Dissertation The pages that follow describe the origins of Artificial Neural Networks and describe how they operate and the manner in which developments in the field have progressed to the point where they are now used on an everyday basis in many areas of modern life. There follows a short introduction to horse racing and the codes of racing that occur in the United Kingdom and how they differ from one another. This is of some significance, as some of the decisions that were made in the selection of the types of data that were included in the project were dependent upon the types of racecourses that exist in the UK and the variety of race meetings that are held at them. A short examination is then made of similar projects that have been carried out by others, that have proved of use in this one. This includes not only the use of Artificial Neural Networks in horse racing prediction, but also in similar sports, such as greyhound racing. ANNs have also been utilised in other sports prediction projects, and these sources also proved useful in the manner in which problems that had been encountered were overcome, and the overall outcome that they had achieved. The technical chapters of the report deal with the manner in which the data that was to be used in the training and testing of the neural network was obtained, and the subsequent decisions about how that data was to be presented to the ANN that were therefore made necessary. Inevitably in a project concerning an Artificial Neural Network, the operations that have to be carried out on the data are key to its success or failure and therefore some time is spent examining the processes that were used to ensure that the data that was presented to the network was of sufficient quality and quantity. The actual structure of the network that was decided upon is examined and the decisions regarding how data was to be held and manipulated within the model explained. The layout of the computer program that was designed and implemented is also considered. The Conclusion (Section 7) describes the overall effectiveness of the ANN in accomplishing the objectives that it was designed to achieve and looks at a number of areas which could prove promising in increasing its efficiency. - 4 -

2 State-of-The-Art The history of Artificial Neural Networks as we know it today began in the 1940s, when electronic computers that had been developed for military use during the Second World War became available to civilians. It made possible the testing of theoretical models of the functionality of biological neuron behaviour in the brain that had in some cases originated in the 19 th century or before, for example with the works of the Scottish educationalist and natural philosopher Alexander Bain and the German mathematician and philosopher Gottlob Frege. In 1943, Warren McCulloch and Walter Pitts, proposed their Threshold Logic Unit (TLU), sometimes known as the McCulloch-Pitts Neuron, an artificial neuron that sought to emulate the functionality of biological neurons, and utilised the power provided by computers to attempt to model the processes that were believed to occur in the brain. Whilst the TLU only employed binary inputs and outputs, which were passed through a unit step function, it was possible to use a network of such TLUs to model any Boolean function. Figure 1, illustrates the structure of one such Threshold Logic Unit. Figure 1. Threshold Logic Unit The TLU consists of a set of Inputs (I 1 - I n ), each of which represents activation values that are received from other neurons or from an external stimuli. Each is multiplied by a Weight (W j1 W jn ) that represents the strength of the synaptic connection between the pre-synaptic neuron and the post-synaptic neuron. Once received by the neuron, the inputs are summed (Σ) to give the activation level (A j ), and are passed through a non-linear activation function in order to produce the output value (Y j ). This can be, in turn, passed to other neurons or to an external output. The output of the Threshold Logic Unit (Y j ) can be represented as a function of its inputs using the equation: Y = sgn ( I θ) - 5 -

where θ represents the neuron's activation function. Thus, if I θ The value of Y j will be 1. and if I < θ The value of Y j will be 0. Such Threshold Logic Units can be used in order to implement the basic logic gates (i.e. NOT, AND and OR) where only two inputs are required, each providing a binary number, and only one output results. Figure 2. Logical AND Figure 2, above, illustrates how a linear decision boundary can be used in a logical AND in order to separate the outputs that result in 1 (represented by an empty circle) from those that result in 0 (represented by filled-in circles). The value of θ in this case is 1.5. In a similar fashion logical OR (Figure 3) can be represented by a single, linear decision boundary with a value for θ of 0.5. - 6 -

Figure 3. Logical OR For logical OR the value of θ is 0.5 The simplest construct of TLUs is into a single layer network, called a perceptron (or single layer perceptron). It consists simply of one input layer of Threshold Logic Units feeding forward to a single output layer of similar units. The concept of the perceptron was developed and promoted in large part by Frank Rosenblatt during the 1950s, and for the next decade a great deal of research was carried out into its capabilities. In 1969, however, a book entitled 'Perceptrons' was written by Marvin Minsky and Seymour Papert that demonstrated the inability of the single layer perceptron to learn an XOR function (Figure 4). Figure 4. Logical XOR This is because the XOR function has two linear decision boundaries and thus is not linearly seperable. As a result it requires a more complex structure of Threshold Logic Units. Such a struc- - 2 -

ture is the Multilayer Perceptron (Figure 5) the effectiveness of which in modelling more complex functions was demonstrated in the 1970s. Within the multilayer perceptron there exist an input layer, an output layer, and one or more hidden layers of artificial neurons. Each neuron in each layer is fully linked to every neuron in the succeeding layer by connections, each of which has a certain weight. Figure 5. Multii-layer Perceptron In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published "Learning Internal Representations by Error Propagation". This publication brought together work that had been done by other researchers over the previous fifteen years or so and described a multilayer neural network which utilised nonlinear but differentiable transfer functions such as the sigmoid function instead of the step function of the single layer perceptron. Their model also included an effective training algorithm utilising backpropagation, a method of teaching artificial neural networks that had been first described in 1969, but now became more widely recognised. Values applied to the Input Layer (X 1.. X k ) are multiplied by the connection weights within each neuron (w ij )and summed at the Hidden Layer. They are passed through a sigmoid activation function to determine the output from each of the units of the Hidden Layer (O 1.. O j ), and then the process is repeated using the connection weights of the neurons between the Hidden Layer and the Output Layer (w jk ) to determine the final outputs from the Output Layer (Y 1.. Y k ). Backpropagation within a multilayer perceptron involves four major steps: - 2 -

First the inputs of a training pattern are passed through the neural network from the input layer, to the hidden layer (or layers), and then to the output layer. In doing so, it is altered by the weights that are associated with the various connections between the neurons. In this way the output activation/s is/are determined. The activation/s at the output is/are compared to the target output/s in order to arrive at the error at the output. δ = (y y) The Generalised Delta Rule is then applied in order to determine by what amount each weight between the hidden and output layers should be altered in order to reduce the error at the output on subsequent applications of a training pattern. w = η δ o where δ = (y y )y (1 y ) Where η is a constant, the learning rate, which determines the amount by which the value of the weight is changed ( w) at each training step. The weights between the input layer and the hidden layer are then, similarly, altered utilising the chain rule which distributes the error of an output unit to all the hidden units to which it is connected, weighted by each connection. w = η δ x where δ = o (1 o ) w δ Thus the four steps include two major phases; a forward pass, or propagation through the network in order to arrive at output values which can be compared to the known target values, and a backward pass through the network in which the error signal is passed to each unit in the network and appropriate weight changes are made. At the conclusion of the above steps, the result is that the weights between neurons has been altered so as to lower the error at the output layer. The amount by which the weights are changed is determined by the learning rate. - 3 -

Multiple applications of the above steps lead to the error at the outputs being reduced to a sufficient extent that it can be said that the network has learned the data. The object of the exercise is to train the network sufficiently on known inputs and outputs, so that when presented with inputs, the result of which is not known, it is capable of arriving at appropriate outputs. A typical learning curve is shown below in Figure 6, which illustrates why this progress towards the minimisation of the error is termed Gradient Descent Learning. Figure 6. Graphical Representation of Backpropagation As can be seen from Figure 6, the presence of local minima can cause a premature conclusion to the learning process, should the error value sink into one. This can be avoided by making use of the momentum term which utilises the value of the previous update to each weight when making the current update to that weight. Thus the weight update rule that was illustrated above: w = η δ x Can be updated as shown below: w (t) = α w (t 1) + η δ (t) x (t) In this way it is possible to speed up the learning process and reduce the risk of the network becoming trapped in a local minimum. There is, however, a risk that, if the momentum term is too high, the network will overshoot the correct solution and subsequently oscillate around it. - 4 -

Whilst Artificial Neural Networks have been in use since as long ago as the 1960s, it was in the mid 1980s, when the backpropagation algorithm came into general use that their use for processes such as forecasting and pattern recognition began to become widespread. The requirement for the presence of a threshold value in the hidden and output layers can be negated by the introduction of a bias unit; with a value of 1. This acts in the same fashion as any other input to the two layers and is connected to them via weights that are updated in the same way as are the others. No change is made, however to the bias unit value, which remains at 1 throughout the training period. The success of Artificial Neural Networks can be judged by the uses to which they are put in many disparate areas of modern life. They are utilised in the detection of potentially fraudulent claims at insurance companies, to determine loan rates for bank customers, to predict road traffic densities and patterns, control retail inventories, carry out medical diagnostics, and a host of other applications. 2.1 Horseracing in the United Kingdom In the United Kingdom, horse racing as the professional sport that we know today dates back at least as far as the reign of Queen Anne (1702 1714). At this time races involving several horses, ridden by professional riders, on which spectators placed bets took place at a number of courses that were founded throughout the UK. Horse races in the UK can be run on courses that feature hurdles or fences, in which case they are termed National Hunt races (or Steeplechases), or on an unobstructed course in which case they are simply called flat races. A slight complication is offered by NHF or "National Hunt Flat" races, which are run under National Hunt rules but do not feature fences, and are used for the initial training of horses that are intended to go on to compete in National Hunt races proper. Whilst traditionally National Hunt races usually occur during winter, when the ground is expected to be softer and therefore more forgiving for horses that have to jump fences, it now continues throughout the year, whilst flat racing occurs during the summer months with the season lasting from March to November. However, the advent of all weather (AW) artificial surfaces at some tracks (currently four tracks in the UK, with more planned) means that now flat racing continues all year round. Nowadays there are sixty licensed racecourses in Great Britain (sixty one if you include Great Leighs, which opened in 2008, but had its license revoked in 2009), each of which has its own characteristics. Some hold meetings of both codes of racing, whilst others only of one and racing can occur at many of them at evening as well as at daytime meetings. Some run races in a clockwise direction (RH or right-hand courses) and others in a counter clockwise orientation (LH or lefthand). Each also differs from its fellows in its size, and therefore the tightness of its corners, and in - 5 -

its topography. In this way a thumbnail portrait of Epsom racecourse would describe it as being "LH Sharp/Undulating", whilst Kempton is merely "RH Sharp". The upshot of this is to ensure that there is a huge variety of course conditions and types which it will, somehow, be necessary for any ANN that is built to take into account if it is to prove of use on more than just a single course. 2.2 Neural Networks in Sports Prediction Upon the commencement of this project the extent to which Artificial Intelligence in general, and Artificial Neural Networks in particular, had been previously utilised in the area of sports forecasting was unknown. It was inevitable that given the recognised abilities of neural networks in the areas of prediction and pattern recognition that others should have sought to utilise their features in order to predict the results of sporting contests before me. One thing that was noticed among the projects that were found recorded in varying degrees of detail, was that most, if not all of the cases found were reports by users of how they utilised pre-existing software from third party vendors in order to investigate the efficacy of neural networks. Although this route was one that could have been followed in this project, the intention from the outset had been to attempt to develop a model from the ground up. This was not only because it was wished to put into practice some the theory that had been assimilated whilst studying at the University of Stirling, but also because it was believed that by doing so it would be possible to create a model that was customised to the requirements of the conditions, variables, and the data types of information encountered in horse racing. Despite the difference described above between this project and those that I found described elsewhere, the papers that were examined were of great value, as they permitted the avoidance of some potential pitfalls and that others had previously encountered. Another key difference between what was being attempted here and what had previously been accomplished by others was how often the objectives of the exercise differed. As an example, one of the first papers that was encountered during a search for similar exercise was a one entitled: "A Case Study Using Neural Network Algorithms: Horse Racing Predictions in Jamaica[7]. The aim of this work was to utilise horse racing as a medium through which to investigate the relative effectiveness of four different learning algorithms: the standard Back Propagation algorithm, the Quasi-Newton BFGS Algorithm, the Levenberg-Marquardt Algorithm and the Conjugate Gradient Descent Algorithm. The software package used was the Trajan 6.0 neural network - 6 -

package. The result reporting within the paper is somewhat perfunctory, but an accuracy rate of 71% at picking a horse that finished in the top three was reported for the most successful algorithm (Backpropagation as described above in Section 2). This strongly suggested that the backpropagation algorithm be employed on this project, although given its almost universal usage, it is the one most likely to have been utilised in any event. Of particular interest in this paper were the choices that were made by the authors of the training variables to use in the study. This was to prove to be of continuing interest due to the potential impact of the variables utilised on the effectiveness of any neural network, and the large potential number of such variable available in horse racing. The variables chosen by Williams and Li are shown below in Table 1 along with conclusions that I made regarding their possible significance. Variable used Racing Distance Type of Race Past Position Weight of Horse Weight of Jockey Horse's Finish Time Equipment Used (e.g. visor, blinkers, etc.) Age of Horse Number of Horses in Race Table 1 Comments Significant Probably of little or no significance Significant Of possible significance (but not recorded in British horse racing) Significant Unlikely to be of significance, as finish time can be affected by many outside factors such as going i.e. softness of track, pace of race, skill of jockey, etc. Upon research, seems to be of little significance, particularly in flat racing Significant, as horses only reach maturity at four years of age. Horses younger than four usually have yet to display their full potential Unlikely to be of significance What also proved interesting in the work by Williams and Chi was their use of one network to represent one horse, providing each horse with eight input variables in order to arrive at an output rating for that runner. This was at variance with other studies that had been encountered in which, for example, a network was constructed with each input representing a horse in the race. A very successful study of sports prediction in a different, but similar field; that of greyhound racing, was done by Hsinchun Chen, et al. in 1994[8] The intention of this work was to compare the relative accuracy of a backpropagation neural network, an ID3 algorithm (decision tree building - 7 -

algorithm), and human expertise (in the form of three tipping services). The quite remarkable results for 100 races are shown below in Table 2: Technique Correct Incorrect Did Not Bet Payoffs($) Expert 1 19 81 0-71.40 Expert 2 17 83 0-61.20 Expert 3 18 82 0-70.20 ID3 34 50 26 3 69.20 Backpropagation 20 80 0 124.80 Table 2 It should be noted that, unlike the example for the Jamaican study[7] only those dogs that actually won their race were considered in the statistics quoted. It can readily be seen that the human experts had very comparable results, both in the correct predictions that they made, and in the similar losses that would have resulted had their advice been followed. Backpropagation had a similar success rate at predicting winners as did the experts, but far surpassed them at accruing winnings. Whilst not matching the winnings achieved by backpropagation, the ID3 algorithm was far more successful at predicting winning runners than any other method, as well as eliminating 26 races where a winner could not be defined, and thus possibly limiting potential losses. It is interesting to note, give that backpropagation correctly predicted a similar number of winners as did the experts, that the payoff from this method of prediction was by far the highest of the three methods. Similarly, despite correctly predicting more winners than any other method, that ID3 failed to match backpropagation in payoff value. The report explains that both ID3 and backpropagation correctly identified as winners a number of runners that were at long odds (i.e. outside chances). Whilst the best payoff for the experts on a single win was $11.40, the ID3 algorithm produced winning results offering a payout as high as $41.20 and backpropagation methods yielded a maximum payout of $78.00. This was encouraging. As it suggested what had been hoped from the commencement of this study, that a neural network might be able to notice patterns of factors in runners indicating possible success that would elude a 3 In these 26 races ID3 did not/could not predict a winner - 8 -

human, or that would, if they were spotted be ignored because of some other factor that was perceived to negate them. Given the relative merits of ID3 and backpropagation that this report suggest, it would appear that a prediction method that combines the two might be the most successful method of maximising winnings. Possibly, the use of ID3 to eliminate races where a winner is unlikely to be reliably identified, prior to the implementation of backpropagation on those remaining in order to seek to maximise winnings would appear to be an avenue worthy of further exploration. It would have been revealing if the authors of the report had provided more information regarding the correlation between the predicted results identified by the BP compared with the ID3 methods in order to determine to what extent their results might be complementary or mutually exclusive. A further feature of the above results is the close match shown between the number of correct winners predicted by the human experts and the backpropagation method. This suggests that the measure of success of any AI system cannot simply be to compare its success at predicting numbers of winners to other systems, whether those utilise AI or are human in origin, but that instead its abilities to maximise profitability by spotting outsiders be considered. Although greyhound racing is very different from horse racing, the variables that were used by Chen et al. are instructive when there is some correlation with horse racing, and in some cases are directly transferrable to the latter sport. Table 3 shows the inputs that were used in their study. Variable used Explanation Comments Fastest Time Win percentage Place Percentage Show Percentage Break Average Finish Average Time 7 Average The fastest time in seconds for a 5/16 mile race The number of first places divided by the total number of races The number of second places divided by the total number of races The number of third places divided by the total number of races The dog's position during the first turn (averages over the seven most recent races) The average finishing position over the previous seven races The average finishing time of the seven most resent races Not necessarily of relevance to horse racing as horse races are held over varying distances Is directly comparable with horse racing Is directly comparable with horse racing No horse racing equivalent Is directly comparable with horse racing No horse racing equivalent. As races are run over varying distances it would be of no relevance - 9 -

Time 3 Average Grade Average Up Grade Table 3 The average finishing time of the three most recent races The average grade of the seven most recent races the dog competed in Weight given to a dog when dropping down to less competitive race grade No horse racing equivalent. As races are run over varying distances it would be of no relevance Whilst there is a grading system in horse racing (Classes) I have never seen this information made available for seven past races. The nearest equivalent is "Class LTO" (Last Time Out) Is directly equivalent to "Penalty Carried" in horse racing The above gives an indication of inputs that proved successful in a system of a similar type to this one. Another site that proved of interest was that of BrainMaker Neural Network Software which, while advertising their product, provides two pages where its use to predict horse racing results is highlighted. 'Predicting Thoroughbreds Finish Time with BrainMaker Neural Networks'[9] describes how the product was used by one customer to pick: "the winning horse in 17 out of 22 thoroughbred races at Detroit Race Course". The page goes on to define some of the problems that the customer faced in achieving the above, and how he overcame them. The other page: "BrainMaker Predicts the Order of Finish in Horseracing"[10] describes another customer's use of BrainMaker, this time to predict winners in races of 6furlong distance. What both of these examples have in common is that they were designed to operate on only a very limited subset of data each. The first system requires that all races used have over nine runners, every horse must have at least eight past performances, and that the race be run at Detroit racecourse. The second that all races be six furlongs in length, and run at Philadelphia racecourse. Despite these limitations, both give interesting insights into the methods by which positive results were obtained by their respective users. All of the above papers helped to crystallise my thinking regarding the inputs that might help the network arrive at meaningful results. Just as usefully they allowed me to determine those that might not just be less useful, but might indeed be deleterious in their effect. Another useful project that I found was "Using Machine Learning to Predict the Results of Sporting Matches"[11]. This study is not based upon horse racing, but rather upon predicting the winners of Australian National Rugby League (NRL) and National Basketball League (NBL) games. While the results obtained proved somewhat disappointing, it nonetheless provided excellent food-for-thought regarding the inputs to a neural network that work most efficiently. - 10 -

As well as these pages that dealt with specific implementations of neural networks that had been carried out, I also found a number of websites that offered advice regarding how best neural networks could be implemented in order to best predict the results of sports. If there was one lesson that could be learned from the above websites, and others (that are listed in the Bibliography) it is that the data that is input and the form in which it is supplied to the model is key to the success or otherwise of the neural network. - 11 -

3 Preparatory Work 3.1 Decisions Made Given the enormous disparity which exists between racecourses in the United Kingdom (as discussed above in (Section 2.1)) it was felt from the start of the project that an attempt to try to arrive at a 'one-size-fits-all' neural network that was capable of predicting results for any race at any course would be unlikely to succeed. Given the difference between National Hunt and Flat racing, I decided to concentrate on the latter, feeling that races over jumps introduce an additional complication, in that a horse might refuse a jump, fall, or unseat its rider all less likely to occur in Flat racing. One of the more obvious differences between flat and National Hunt racing is that in the former, horses start races from stalls that are spread across the width of the track, whereas in the latter they merely begin the race from behind a tape barrier i.e. in arbitrary positions. The draw that a horse enjoys in a flat race is determined by the drawing of lots, and can provide a horse with a significant advantage or disadvantage at some courses and over some distances. However this advantage or disadvantage is not equal between courses. Thus, in a five furlong race at Southwell racecourse, the draw has no impact - the horses running a straight course to the line. However, in a six furlong race at the same track, which is run around a sharp, left handed turn, low numbers have a distinct advantage. Why this should be the case is possibly best illustrated by Figure 7, below, the source of which was the website of the Racing Post periodical 4. At other courses no advantage or disadvantage to the draw is discernable. Figure 7. Map of Southwell Racecourse The above factors made it essential to build separate neural networks for each track that hosts flat races. Since this is impractical given 38 such courses exist in the United Kingdom, it was de- 4 http://www.racingpost.com/horses/course_home.sd?crs_id=61&r_date=2010-09-09-12 -

cided to concentrate on one course specifically, for the benefit of this project. The course chosen, purely arbitrarily, was York. 3.2 Data Collection 3.2.1 Data Source When this project was commenced the source of historic racing data that would be used to train and test the Artificial Neural Network was unknown. It was necessary for extensive research to be carried out into what was available and exactly what features each data provider gave to the user. In the end it was decided to subscribe to the ProForm[12] racing service, the website for which gives not only information about the features and use of their software but also about horse racing in general. It appears that the owner of the company; Simon Walton, was a computer programmer before his hobby of seeking to use historical race data became his full-time occupation. Whether because of this or not, the site appeared to be the one that had the most favourable feedback from subscribers, as well as the best record for technical support, should it be needed. I was aided in this decision by the fact that the ProForm database of historical results was downloadable to the subscribers computer (in MS-SQL format), and could then be accessed directly. It transpired though that the software that comes packaged with the download permits the export of data by the user in the form of a comma delimited variable (.csv) file. The page that controls this also provides a number of filtering options that can be used to decide exactly what data is to be exported by the user e.g. by race type, date, and racecourse (Figure 8). On a more granular level the user can decide which fields of data in particular should be exported. Figure 9 shows many of the data fields that can, in this way, have their values exported to.csv format. The result of this was to ensure that it would not be necessary to learn (without the benefit of technical documentation) the structure of an unfamiliar database prior to being able to make use of it, but instead - 13 - Figure 8. ProForm Data Export Filter

the data export facility would permit the obtaining of data for the purpose of training and testing the Artificial Neural Network, albeit in not as fast or convenient format as would have been available through direct access utilising SQL queries. As ProForm updates automatically with details of the current days winners and the declarations for the following days races and runners, it can also serve, ultimately, as the perfect tool to aid in the testing of a network against hitherto undecided races. The fields that can be exported from the Data Exporter screen are conveniently sorted into those that pertain to the course and race, to the runner and its jockey, trainer, etc., and those that are calculated or filtered values based on other fields e.g. "AVG(x)" (Average speed rating over previous x races (by default five)), and "Won S/R Before" (ratio of wins to runs prior to the race being examined). - 14 -

Figure 9. ProForm Data Exporter Screen In all, over 130 fields are provided on the Data Exporter screen, and whilst many of these are of little, or no relevance (e.g. "Horse Colour", "Horse Country"), many may be considered as potential inputs to the neural network. However, any increase in the number of inputs that are provided to a neural network results in an increase in the dimensionality of the network and therefore in the amount of time and quantity of data that it takes to train it correctly. It is therefore important to try to arrive at a set of inputs that include the data that most explicitly maps the task to be learned. - 15 -

Hence the importance of attempting to find the most significant fields to use to compose the inputs and thus limit the network complexity and required training time. 3.2.2 Limitations and Inconsistencies of the Data The earliest records in the database date from January 1997, therefore giving well over ten years of racing data. In the case of Catterick racecourse, for example, this equates to over 1400 races, involving almost 16000 horses. The only caveat that it is necessary to bear in mind is that a few of the fields in the database that are shown in Figure 9 have been added since the database was begun, and the data has not always been back-dated, therefore not all the data stretches back as far as the above date. For example; the "Trn Stats" ('Trainer Stats') field, one that might be of potential value in ensuring that the neural network can successfully learn patterns, only contains data from April 2006. However, even were data to only be taken from this, later, date onwards the above example of Catterick would contain racing records for 573 races and over 5800 horses 5. This comfortably exceeds the amount of data that was available to Chen, et al.[8] who utilised data from 300 races involving 2400 greyhounds and Williams and Li[7] who used data from only 143 races (the number of horses included in these races was not specified). It would therefore appear that there is sufficient data to proceed, even if the earliest records are ignored due to lacking some values. Another area in which it was found that there is a slight inconsistency within ProForm is in the headings under which data is stored. In the daily form-card, which displays races that are yet to be run, certain headings are used, once the race has been run and data becomes part of the historical database the same data is sometimes stored under different headings. One example of this can be seen in Figure 10 which shows an extract from the Proform Daily Form screen (form-card) for a race.. Figure 10. Extract from Proform Form Screen In Figure 10 the gender of the horse is presented in two separate fields (highlighted). The "H_Sex_1" field records broadly whether the horse if male ("Male") or female ("Female"), whilst the "Sex" field records more specific information (colt ("c"), filly ("f"), gelding ("g"), etc.). 5 This is after the removal of certain races for reasons that will be made clear in Section 5.1-16 -

However, this same information is rendered in the.csv file that is produced from the Data Exporter from the historic data in the format shown below (Table 4). Sex Sex Abbrev Male c Male g Male g Male c Male g Female f Male c Table 4 It can be seen that the data that was represented in Figure 10 under the heading "H_Sex_1" is given the heading "Sex" in the historical data, while the data that was given the heading "Sex" in Figure 10 is given the heading "Sex Abbrev" in Table 4. Whilst inconvenient, once it is known, this inconsistency and other similar ones within the data can be worked around and it can be specifically ensured that the data used using the prediction period of the use of the neural network is married to the correct fields that were used during the training and testing phase. All in all, the ProForm database offers the data and features that are necessary to ensure the effective training and testing of an Artificial Neural Network. - 17 -