Prediction models for soccer sports analytics

Size: px

Start display at page:

Download "Prediction models for soccer sports analytics"

Dwain Walton
5 years ago
Views:

1 Linköping University Department of Computer and Information Science Master thesis, 30 ECTS Computer Science LIU-IDA/LITH-EX-A--2018/021--SE Prediction models for soccer sports analytics Edward Nsolo Supervisor : Niklas Carlson Examiner : Patrick Lambrix Linköpings universitet SE Linköping ,

2 Upphovsrätt Detta dokument hålls tillgängligt på Internet eller dess framtida ersättare under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida Copyright The publishers will keep this document online on the Internet or its possible replacement for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: c Edward Nsolo

3 Abstract In recent times there has been a substantial increase in research interest of soccer due to an increase of availability of soccer statistics data. With the help of data provider firms, access to historical soccer data becomes more simple and as a result data scientists started researching in the field. In this thesis, we develop prediction models that could be applied by data scientists and other soccer stakeholders. As a case study, we run several machine learning algorithms on historical data from five major European leagues and make a comparison. The study is built upon the idea of investigating different approaches that could be used to simplify the models while maintaining the correctness and the robustness of the models. Such approaches include feature selection and conversion of regression prediction problems to binary classification problems. Furthermore, a literature review study did not reveal research attempts about the use of a generalization of binary classification predictions that applies different target class upper boundaries other than 50% frequency binning. Thus, this thesis investigated the effects of such generalization against simplicity and performance of such models. We aimed to extend the traditional discretization of classes with equal frequency binning function which is standard for converting regression problems into the binary classification in many applications. Furthermore, we ought to establish important players features in individual leagues that could help team managers to have cost-efficient transferring strategies. The approach of selecting those features was achieved successfully by the application of wrapper and filter algorithms. Both methods turned out to be useful algorithms as the time taken to build the models was minimal, and the models were able to make good predictions. Furthermore, we noticed different features matter for different leagues. Therefore, in accessing the performance of players, such consideration should be kept in mind. Different machine learning algorithms were found to behave differently under different conditions. However, Naïve Bayes was determined to be the best-fit in most cases. Moreover, the results suggest that it is possible to generalize binary classification problems and maintain the performance to a reasonable extent. But, it should be observed that the early stages of generalization of binary classification models involve a tedious work of training datasets, and that fact should be a tradeoff when thinking to use this approach.

5 Acknowledgments Firstly, I would like to express my sincere gratitude to my thesis examiner and supervisor Prof. Patrick Lambrix and Prof. Niklas Carlson of Linköping university for the opportunity of Thesis project that was carried under their supervision. Their continuous support, guidance, and patience motivated me in the right direction which led to the successful accomplishment of this thesis. Secondly, I would like to extend the hand of gratitude to fellow schoolmates, friends, and family for their company, advice, and encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Lastly, I would like to thank almighty God for the good health and opportunity of a scholarship to study in Sweden. This publication has been produced during scholarship period at Linköping University, thus, I would like to give a special appreciation to Swedish Institute scholarship. v

6 Contents Abstract Acknowledgments Contents List of Figures List of Tables iii v vi viii x 1 Introduction Purpose Research questions Delimitations Related work 5 3 Theory Software (Weka) Min-max normalization Feature selection methods Class imbalance SMOTE (Synthetic Minority Oversampling Technique) TigerJython with Weka Machine learning algorithms Evaluation of the prediction models Research methods, techniques, and methodology Pre-study Experimental study Methodology Data pre-processing Data collection Data rescaling, missing values, and duplicates Converting regression problem to binary classification problem Feature selection Feature selection with wrapper method Feature selection with filter attribute evaluator Performance of prediction models Accuracy results of the prediction models F1 Score results of the prediction models vi

7 7.3 AUC-ROC results of the prediction models Discussion and conclusion What are the best mechanisms for selecting essential features for predicting the performance of top players in European leagues? What are the essential features for developing prediction models for top players in European leagues? What are the useful classification models for predicting performance of top players in European leagues? How can binary prediction models be generalized? Future research 45 Bibliography 47 A Wrapper method results of the combined-leagues 51 B Attributes selected by Wrapper method of the combined leagues 57 C Execution time of wrapper method for the combined leagues 61 D Aggregated results of filter method for the combined leagues 65 E Model accuracy results of wrapper datasets for the combined leagues 67 F Model accuracy of filter-datasets for the combined leagues 71 G F1 score results of wrapper datasets for the combined leagues 75 H F1 Score results of the filter-datasets for the combined leagues 79 I AUC-ROC results of the wrapper datasets for the combined leagues 83 J AUC-ROC results of the filter-datasets 87 K Accuracy results for individual leagues 91 L F1 score results for individual leagues 97 M AUC-ROC results for individual leagues 103 vii

8 List of Figures 4.1 A procedure for analyzing soccer sport historical data Data preparation model Knowledge flow activities for data formatting process Feature selection with wrapper method knowledge flow model Feature selection with filter method knowledge flow model Merit of subsets of attributes selected Execution time of wrapper subset evaluator Model accuracy results Overall F1 Score results of the combined leagues Overall AUC-ROC results C.1 Execution time of wrapper attribute evaluator for defenders datasets C.2 Execution time of wrapper method for goalkeepers datasets C.3 Execution time of wrapper method for midfielders datasets C.4 Execution time of wrapper method for forwards datasets E.1 Prediction model accuracy for defenders wrapper-dataset E.2 Prediction model accuracy for midfielders wrapper-dataset E.3 Model accuracy for the goalkeepers wrapper-datasets E.4 prediction model accuracy for forwards wrapped datasets F.1 Model accuracy for defenders filter-datasets F.2 Model accuracy for midfielders filter-datasets F.3 Model accuracy for goalkeepers filter-datasets F.4 Model accuracy for forwards filter-datasets G.1 F1 Score results of the defenders wrapper-datasets G.2 F1 Score results of the midfielders wrapper-datasets G.3 F1 Score results of the goalkeepers wrapped datasets G.4 F1 Score results of the forwards wrapped datasets H.1 F1 Score results of the defenders filter-datasets H.2 F1 Score results of the midfielders filter-datasets H.3 F1 Score results of the goalkeepers filtered datasets H.4 F1 score results of the forwards filtered datasets I.1 AUC-ROC results of the defenders wrapper-datasets I.2 AUC-ROC results of the midfielders wrapper-datasets I.3 AUC-ROC results of the goalkeepers wrapper-datasets I.4 AUC-ROC results of the forwards wrapper-datasets viii

9 J.1 AUC-ROC results of the defenders filter-datasets J.2 AUC-ROC results of the midfielders filter-datasets J.3 AUC-ROC results of the goalkeepers filter-datasets J.4 AUC-ROC results of the forwards filter-datasets ix

10 List of Tables 3.1 Confusion matrix List of all attributes All datasets of the combined leagues All datasets of the Bundesliga All datasets of the EPL All datasets of the La Liga All datasets of the Ligue All datasets of the Serie A Important features for goalkeepers as selected by wrapper method Important features for defenders as selected by wrapper method Important features for midfielders as selected by wrapper method Important features for forwards as selected by wrapper method Important features for the combined leagues by wrapper attribute evaluator Most frequent selected attributes by wrapper method in combined leagues Important features for goalkeepers as selected by filter method Important features for defenders as selected by filter method Important features for midfielders as selected by filter method Important features for forwards as selected by filter method Important features for the combined leagues by filter attribute evaluator Overall model accuracy F1 Score results between wrapper and filter of the combined leagues AUC-ROC results Comparison between the wrapper and filter subset evaluator Sample of actual predictions A.1 Attribute selection with wrapper method for defenders top10pc dataset A.2 Attribute selection with wrapper method for defenders top25pc dataset A.3 Attribute selection with wrapper method for defenders top50pc dataset A.4 Attribute selection with wrapper method for goalkeepers top10pc dataset A.5 Attribute selection with wrapper method for goalkeepers top25pc dataset A.6 Attribute selection with wrapper method for goalkeepers top50pc dataset A.7 Attribute selection with wrapper method for midfielders top10pc dataset A.8 Attribute selection with wrapper method for midfielders top25pc dataset A.9 Attribute selection with wrapper method for midfielders top50pc dataset A.10 Attribute selection with wrapper method for forwards top10pc dataset A.11 Attribute selection with wrapper method for forwards top25pc dataset A.12 Attribute selection with wrapper method for forwards top50pc dataset A.13 Frequency of attributes being selected by several wrapper schemes x

11 B.1 Support count of selected attributes with wrapper method for defender s top10pc dataset B.2 Support count of selected attributes for wrapper method for defender s top25pc dataset B.3 Support count of selected attributes for wrapper method for defender s top50pc dataset B.4 Support count of selected attributes with wrapper method for goalkeeper s top10pc dataset B.5 Support count of selected attributes for wrapper method for goalkeepers top25pc dataset B.6 Count of selected attributes for goalkeepers top50pc dataset B.7 Support count of selected attributes with wrapper method for midfielders top10pc dataset B.8 Support count of selected attributes for wrapper method for midfielder s top25pc dataset B.9 Count of selected attributes for midfielders top50pc dataset B.10 Support count of selected attributes with wrapper method for forwards top10pc dataset B.11 Support count of selected attributes for wrapper method for forwards top25pc dataset B.12 Count of selected attributes for forwards top50pc dataset D.1 Overall filter method results K.1 Accuracy results of Bundesliga K.2 Accuracy results of EPL K.3 Accuracy results of La Liga K.4 Accuracy results of Ligue K.5 Accuracy results of Serie A L.1 F1 results of Bundesliga L.2 F1 results of EPL L.3 F1 results of La Liga L.4 F1 results of Ligue L.5 F1 results of Serie A M.1 AUC-ROC results of Bundesliga M.2 AUC-ROC results of EPL M.3 AUC-ROC results of La Liga M.4 AUC-ROC results of Ligue M.5 AUC-ROC results of Serie A xi

13 1 Introduction Soccer is an ancient sport whose history can be traced to 2000 years ago from China [1]. To date, it is still recognized as the highest ranked sport in the world regarding attendance, broadcasting, and support. More than a half global population follows the sport, and even in the countries like India, USA, and Canada where the sport is not the most dominant, there are millions of players and fans of the sport [2]. Due to the empirical evidence of the success of machine learning techniques in other sports like baseball, the extent of recording historical soccer data changed dramatically as the need for new information and the advancement of the machine learning tools increased [3]. An example of such analysis is the Moneyball study that demonstrated how a team could improve performance regardless of limited resources by finding players with potential in player s features that are usually undervalued in predicting winning [4]. Determination of essential features is not always straightforward as it may appear [5] [6] [7]. According to FIFA rules, European soccer leagues have a transfer window which spans 16 weeks where soccer team managers tend to look for talented players all over the world that could strengthen their squads. The activity of acquiring such players can be motivated by many factors including the excellent form of the player in the former league. Such motivation is unsatisfactory because European leagues are different, they require a different level of physicality and technicality to excel. What makes a good player in one league may not be the case for the other. Hence, there has been an increase of disappointments and criticism of top players underperforming in their current clubs after colossal player transfers turned out to be a failure. An Example of such a player is Paul Pogba of Manchester United who performed well in Serie A which made him the world most expensive footballer in 2016 but failed to impress when transferred to English Premier League in the same year [8]. Therefore, we see the importance of establishing different criteria in individual leagues that can be used to guide team managers to make proper investments in buying players that will give a good value for money spent. Furthermore, this thesis aims to produce prediction models that are simple and more accurate by grouping players into their specific roles, leagues, and performance levels. In general, we want to make general predictions models that cover many aspects of what makes a good player. First, we find essential player s features sets because: Soccer constitutes many features that may be hard to derive useful information, irrelevant, uncontrollable and hard to measure. All activities associated with the identification of players, training statistics and 1

14 1. INTRODUCTION rarely executed actions like penalties can be irrelevant for performance prediction models. Also, various features carry different weights when a player role in a soccer match is concerned, essential features of defenders may not be essential for forwards or goalkeepers. For example, a number-of-saves is an essential feature for goalkeepers but not for other players. Also, the performance of machine learning algorithms depends on adequately trained datasets. Attributes must be selected wisely and arranged according to the player roles. Secondly, we take an approach of testing different machine learning algorithms that involve: processing the data, selecting features, running classifiers, and evaluating the performance of the models. Furthermore, different types of discretization are used to broaden the range at which players can be classified. Soccer is a mega profit business which attracts firms from many sectors such as clothes, shoes, and gambling [9]. Consequently, stakeholders demand simple prediction models to facilitate improvement in training, match tactics, and individual player performance. Thus, theoretically, it can be asserted that better prediction models may lead to better performance and profit to all soccer stakeholders. We believe that the models that will be created may be tested empirically and be beneficial to users. In the market, data mining and machine learning are among the best methods available regarding predictions and learning associations among soccer data. Therefore, in this master s thesis, data mining, and machine learning techniques were carried out using historical data from top players in European leagues of the season for training and season for actual prediction testing. 1.1 Purpose In attempting to build prediction models that analyze European soccer players, this thesis is generally aiming at developing simple prediction models that work as classification problems. Furthermore, for each player category, the different prediction models are built for defenders, goalkeepers, midfielders, and forwards. Along the process of fulfilling the main objective of the study of building soccer prediction models for top players in European leagues, the following specific targets form the basis of the thesis project. Find essential features of players according to the ideal roles and determining the performance of top players in European leagues. Develop binary classification models that predict the performance of top players in European leagues in individual and consolidated leagues data. Analyze the player ranking schemes provided by soccer data providers 1.2 Research questions The analysis of the best prediction models for top players explicitly tries to answer the following research questions: What are the best mechanisms for selecting essential features for predicting the performance of top soccer players? What are the potential features for developing classification models for top players? What are the useful classification models for predicting top players? How can binary classification models be generalized? How accurate can the player ranking schemes of the soccer data providers be? 2

15 1.3. Delimitations 1.3 Delimitations This thesis will only analyze player s historical data from five major European leagues including English Premier League (EPL), La Liga Premier Division, Bundesliga, Serie A, and French Ligue 1. For each category of classification models available in the machine learning tools, popular representative algorithms were selected. The categories of selected classifiers are Bayes-based, function-based, Lazy-learners, rules-based, and tree-based classifiers. For resampling technique, only oversampling was applied because time was not enough to implement all algorithms with all possible combinations of parameters. 3

17 2 Related work This chapter focuses on describing the contribution of other researchers and studies that are like this thesis where a review of the application of data mining and machine learning in soccer is conducted. A look at how data mining has been used to predict the performance of players, teams, match outcomes, and injuries is done. Also, an elaboration of current research on other sports analytics will be given. Currently, the main research areas in soccer analytics are concerned with effective training, soccer officials, and prediction of future young stars, match outcomes, injuries, team performances and player ratings. Prediction of player performance is among the foremost researched fields in soccer analytics since players are the center of the sport. Several studies established new and useful ways of predicting player performance by using several machine learning techniques. These studies include Vroonen et al. [10] which proposed advanced techniques for predicting future performance of young soccer players. Vroonen et al. used similarity metrics to create an advanced system that predicts the performance of younger players of the same age. The approach was beneficial as it outperformed the baseline results. More research is possible to cover other aspects of this study as it only included players of the same age rated above 0.9/1. Therefore, a similar study can be conducted on players of different ages and the rating threshold could be lowered to 0.8/1, which is still reasonably high. Among common problems for many researchers are the use of the ineffective historical data because not all match events give useful information when it comes to analysis. There has been inadequate research that focuses on the match phases (phases that led to a goal or a shot). Not until recently some authors have begun to concentrate on match phases in performance analysis of historical soccer data. Decroos et al. [11] grouped several match events by using dynamic time warping technique and then applied an exponential-decay-based approach for calculating the performance of players in the distinct phases. Furthermore, S. Brown [12] conducted a similar study but with page-rank as an approach. Both approaches had high accuracy, but, these studies overlooked the fact that the match events leading to a goal and a shot mostly favor attackers and midfielders, and as a result discriminate defenders and goalkeepers. For the improved results, the authors could have separated the datasets according to different game roles, which is an approach emphasized in this thesis. Soccer analysis is based on the consideration of not only match events but also other attributes such as postural and physiological characteristics. There are studies such as one by 5

18 2. RELATED WORK Lara et al. [13] which analyze the prediction systems that use player movements and balance statistics. The authors used Decision trees (CART) and Logistic Regression as machine learning algorithms. The observation revealed that the performance of players does not depend on balance data and that Logistic Regression outperformed decision trees (CART). Given the fact that there are many machine learning techniques, it can be interesting to replicate the study and compare the results. Furthermore, most sports analytics including soccer, predicts player performance by only using past records. Some researchers incorporate live match statistics to produce live predictions. Research by Cintia et al. [14] proposed a model which considers the use of live match statistics and uses chi-square technique. The results showed that the application of live statistics yielded a 50% accuracy, which is below the desirable performance. Some soccer analysts tend to use many complicated features when it comes to forecasting performance of players. Consequently, the resulted prediction model became incomprehensible and deceptive. Over time, there has been an effort from the research community to find useful prediction models that are applicable in real-life. Brandt and Brefeld [15] discovered that focusing on just a few features and simple machine learning techniques such as page-rank, C5.0, and SVMs (RBF kernels) could increase the higher accuracy of prediction models. Similarly, G. Kumar [16] reached the same conclusion by using standard feature selection mechanisms. In addition to the Brandt and Brefeld work, G. Kumar used many learning algorithms to figure out essential player attributes. The algorithms used include Linear regression, SMOreg, Gaussian Processes, LeastMedSq, M5P, Bagging with REP Tree, Additive Regression with Decision Stump, REP Tree, J48, Decision Tables, Multilayer perceptron, Simple Linear Regression, Locally Weighted Learning, IBk, KStar, and RBF Network. However, some of Kumar et al. [16] claims contradict Cintia et al. [14] assertions that the application of live statistics leads to better accuracy than historical data. 6

19 3 Theory This chapter describes the tools and machine learning techniques used while writing this thesis. It gives necessary knowledge about various concepts in the text and the knowledge about the tools applied. For each tool and method, a link to previous research is given to show how other researchers used similar approaches and when a different approach is used motivation supports it. One thing to note, concepts presented in this section are described on a high-level assuming that interested readers will further read the references provided to acquaint themselves with the more in-depth concepts. This chapter is organized into three subsections. Section 3.1 presents the tools used for the project. Section 3.2 describes the techniques used for rescaling the data. Section 3.3 elaborates different mechanisms for feature selection, especially those applied in this thesis. Section 3.4 and 3.5 illustrate class imbalance and ways of solving the problem with resampling technique. Section 3.6 presents the scripting package used to extend discretization. Lastly, Section 3.7 presents the high-level knowledge about different classification algorithms used for this thesis. 3.1 Software (Weka) Weka is open-source software for small and large-scale machine learning tasks. The software has been designed to perform all core machine learning functionalities including data preprocessing, classification, clustering, feature selection, and visualization. Since its first release in 1993, the tool has been progressively accepted and widely used for data mining research and across other fields because of its simplicity and extensibility [17] [18]. Apart from soccer, Weka has also been used in other domains of research. Therefore, there is enough confidence that Weka is suitable for the research similar to this [19] [20]. Weka has five modules which perform the same tasks in different ways. For this thesis, the first three modules: Explorer, Experimenter, and Knowledge flow were used to accomplish most of the project tasks. The first module (Explorer) was used for exploring the datasets to determine the kinds of filters and parameters needed to be applied. The second module (Experimenter) was used to run algorithms and perform tests for each dataset taken after the data has been cleaned, and features have been selected. The third module (knowledge flow), was used for designing and executing sub-steps of the data pre-processing procedure. With knowledge flow models most of data pre-processing procedures are done. Missing values, 7

20 3. THEORY normalization, feature selection, and resampling are handled. The discretization task is done by extended module TigerJython which enables the running of Jython script for custom discretization and pruning of the datasets. The workbench and the Simple client modules were not used because the first three modules were satisfactory for this project. 3.2 Min-max normalization Normalization is an important step in any data-preparation process. It is used to transform data to follow a smaller and consistent range across all attributes within datasets; normally the preferred range is from 0 to 1. In many cases, attributes within the same dataset appear to be in different scales, a tendency that affects the efficiency of machine learning algorithms especially for those that use distance measures as a criterion for learning [21] [22]. When datasets contain attributes with different scales, attributes with larger scales tend to overshadow attributes with smaller scales by distorting attribute weight proportions in distance calculations. As a result, algorithms run slower and produce misleading outcomes. Normalization can be done using several techniques including Min-Max, Mean, and Standard deviation normalization methods. Generally, no normalization method is better than another. Depending on the type of data and machine learning algorithm used, minor differences may be noticed. Previous studies suggest that the min-max normalization technique can lead to a model with slightly higher accuracy, less complicated, and less training time compared to other techniques [23]. The Min-Max normalization technique transforms given attribute values to a range of 0 to 1 or -1 to 1 by using the minimum and maximum attribute values within a dataset. Following the fact that many researchers used the min-max technique and it has been proven to have benefits, this study has chosen the technique for feature rescaling. However, other methods could have been applied as well and still produce reasonable results. The formula for Min-Max normalization is given as in equation 3.1: Normalizedvalueo f (x) = x min (x) max (x) min (x) (3.1) 3.3 Feature selection methods Feature selection is the process of finding an optimal subset of attributes that enhance the performance of the classification model. It is common for many machine learning problems to have attributes that are redundant or do not influence the target class [24] [25]. Such attributes are referred to as insignificant or undesired features. Removing insignificant attributes in many cases improves the performance, robustness, and simplicity of the classification models [26]. There are four reasons why feature selection is essential for machine learning processes. The first reason is, the feature selection process produces datasets with fewer attributes. Therefore, improves simplicity and interpretability of classification models. The second reason is that feature selection significantly reduces learning time hence improves efficiency [27]. The third reason is, feature selection reduces the amount of noise and outliers hence decreases the problem of overfitting. Fourth, feature selection generates dense datasets that in turn increase the statistical significance of values of datasets [25]. Standard feature selection techniques in machine learning include filters, wrappers, and embedded methods. The Filter-method is a technique that uses the correlation between a normal attribute and a class attribute. The features that score above the desired correlation threshold are selected as potential attributes [28]. Filter methods are more efficient but disregard the inter-correlation between normal attributes. Wrapper method as another technique may be used when interrelated features exist. Wrapper method uses machine learning algorithms to cross-check multiple subsets of at- 8

21 3.4. Class imbalance tributes and save the set with optimum performance as a potential set of attributes [27]. Unlike filters, this technique not only considers the relationship between attributes and the class attribute but also, the attribute interrelation. When determining what to use between the wrapper and filter methods becomes a challenge, the Embedded method can be used because it uses both filter and wrapper methods with the aim of utilizing benefits of both techniques [25] [29]. All techniques have benefits and drawbacks; it cannot be generalized that one technique is better than the other. Generally in machine learning, all techniques are executed, then the method that led to optimal performance and robustness is selected and included as part of the model. 3.4 Class imbalance When values of the class attribute are not equally distributed within a dataset can yield to a class imbalance problem which affects the interpretation of the prediction model s accuracy [30]. Commonly, this problem causes the accuracy of the majority class to transcend the accuracy of the minority class which misleads the interpretation of the results. To better illustrate the effects of this problem, consider a dataset with 1000 cancer test results wherein ten records tested cancer-positive while 990 records tested cancer-negative. If results appeared to be as described in a confusion matrix below, it is apparent that, though the accuracy is 98.2%, the model can be thought to be unsuitable for this situation. The accuracy for the positive class is extremely low 0.2% and dominated by negative class accuracy 98%. By implication 99.8% of all tested cancer positive predictions will be wrong despite the reasonable model accuracy of 98.2%. Actual positives (cancer +ve) Actual negatives (cancer -ve) Predicted positive TP = 2 FP = 8 Predicted negative FN = 10 TN= 980 Model accuracy = Positive class accuracy + Negative class accuracy TP Model accuracy = (TP + TN + FP + FN) x 100% + TN (TP + TN + FP + FN) x 100% Model accuracy = x 100% + (1000) (1000) x 100% Model accuracy = 0.2 % + 98 % = 98.2% (3.2) The class imbalance problem is most problematic to decision trees classifiers and small datasets [30] [31]. Since this thesis project used several decision tree algorithms like J48 and Random Forest it was deemed necessary to resample the data. Additionally, some of the datasets generated for goalkeepers and forwards were small. Likewise, it was thought to be important to apply resampling techniques which solve the problem of class imbalance. Several techniques can be used to deal with class imbalance including under-sampling, oversampling, and a blended method which uses both techniques. This thesis used only oversampling technique called SMOTE which is described in section SMOTE (Synthetic Minority Oversampling Technique) SMOTE is a popular method of handling class imbalance. It is an oversampling technique that synthetically determines copies of the instances of the minority class to be added to the 9

22 3. THEORY dataset to match the number of instances of the majority class. Oversampling is the actual process of increasing the needed instances of an underrepresented class in a given dataset to increase the statistical significance of the minority class that in turn reduces the degree at which the models could overfit. Since the class ratio between minority class and majority class for the datasets used for this thesis were 1 to 4, and 1 to 10 which are high. Oversampling technique is suitable than the rest of the resampling techniques [31]. Contrary to traditional techniques, SMOTE does not just add random copies, it looks for the specified K-Nearest Neighbors instances in the dataset and add them until SMOTE percentage proportion is reached. This technique is proven to be efficient and it is popular among researchers [32]. The SMOTE algorithm takes three parameters (number of minority class, percentage of instances to be added, and the number of nearest neighbors). If the parameter values are not chosen properly, SMOTE may introduce noise and hence affect the accuracy of the classification models [33]. The selection of good values for SMOTE parameters is explained in the Section TigerJython with Weka TigerJython is a simple development platform for writing Jython scripts that gives Weka extra capability to execute Python and Java scripts to perform machine learning tasks with much more control than graphical user interfaces [34]. To use scripting languages like Jython a TigerJython plugin was installed through the package manager. With TigerJython a user can directly invoke Weka s java classes, java, and python modules at will and extend the limited functionalities as desired. For this thesis, Jython script is used to extend the discretization functionality. 3.7 Machine learning algorithms This section presents the underlying knowledge about machine learning algorithms used for feature selection and classification. It begins by briefly describing how the algorithms work, and presents the benefits and challenges encountered from previous research. Furthermore, for each algorithm a brief discussion about the expectations and tradeoff is made. Bayesian network Bayesian network algorithms are the type of classifiers that use distribution probabilities in the Directed Acyclic Graphs to give predictions. It takes the leaf nodes as the attributes, and the parent nodes as the predictor class. It is among the widely used classier in machine learning because it is robust and performs well with classification problems with complex and conflicting information. Moreover, Bayesian network techniques work well with datasets of any size, a property that is beneficial for this thesis as the size of the processed data is between 285 to 1857 instances. On the other hand, the main drawback of the algorithm is the inability to deal well with continuous data [35]. In many cases, to apply Bayesian network classifiers, discretization must be applied which in turn disturbs the linear relationship of the data. Naïve Bayes Naïve Bayes is another type of classifier based on Bayes theorem of conditional probabilities. It is considered as the simplified version of Bayes network classifier except that the attributes are treated independently indicating that knowing the values of one attribute does not imply the value of the other attributes. As its name suggest the classifier is simple and efficient because by ignoring class conditional relationships between attributes it reduces the computation work. Like general Bayesian network classifier, Naïve Bayes works well with small and large datasets. Although it is a simple classifier, many studies have found Naïve Bayes to 10

23 3.7. Machine learning algorithms outperform many classifiers even those that are sophisticated [36]. On the other hand, Naïve Bayes has drawbacks that include decreasing of accuracy caused by ignoring the information about conditional dependencies, also it forces invocation of discretization function which in turn affect negatively certain types of datasets [37]. Logistic Regression Logistic Regression is the type of classifier that falls under the function category. It works similarly as linear regression except of the relation function and the type of the class attribute used. This classifier is advantageous over linear on the following aspects: The ability to work well with categorical data, only handle binary classification [38], works relatively well to skewed datasets where the linear relationship in datasets is not well observed. The algorithm uses the population growth function to express the relationship between the input variables and the predictor class. The equation used for Logistic Regression comprise of four variables as follows: y means predicted class, x means the values of the independent attributes and b 0 mean the Logistic equation intercept, and b 1 means the value of the single independent attribute. ( ) predicted class (y) = e (b 0 + b 1 x) / 1 + e (b 0 + b 1 x) (3.3) Lazy classifiers Unlike other types of classifiers, lazy learners use the instances of the training dataset to make predictions. Other types of classifiers create the model first and use it with testing data to make predictions. The name lazy is followed by its tendency of not producing the model at all. Sometimes they are referred to as instance-based learners, an example of such learners is KNN (IBK) or KStar classifiers. Except for KStar which uses entropy distance measures to make predictions most of the lazy classifiers use standard distance measures such as Euclidean distances [39]. Although the idea behind the lazy leaners is simple, they are efficient, especially for small datasets. However, when lazy learners use large or high dimensional datasets as input, the lazy classifier performs poorly and require large storage capacity [40]. Another challenge is that it can be difficult to determine the appropriate K nearest neighbors because a redundant work of guessing K value can be time-consuming. Furthermore, since lazy learners make predictions directly from the training datasets, constant work of updating the training datasets is highly required as the new data become available. Rule-based classifiers Rule-based classifiers are the type of classifiers that generate a set of rules in the form of (IF condition THEN class label) and use them to make predictions. They are efficient and straightforward learners and widely used for many classification problems. For this thesis, two types of rule-based classifiers implemented are Zero and PART. Zero (Zero Rules) classifier is the simplest classifier which focuses on the target class and ignores other non-target attributes. It is mainly used as baseline accuracy to determine the minimum accepted accuracy for other classifiers [41]. PART is another type of rule-based classifier that combines two techniques to make predictions. It works as C4.5 but additionally prunes the C4.5 decision tree separately on each iteration and selects the best leaf as a rule [42]. Furthermore, the Decision Tables classifier completes the list of rule-based algorithms that we use. Decision Tables are known as one of the most straightforward and efficient classifiers that use a simple logic of decision tables to classify instances of data. They are like decision trees in the classification process. What distinguishes them is how the decisions or rules are presented. 11

24 3. THEORY Decision trees classifiers Decision tree classifiers are non-parametric algorithms that build classification and regression models by using decision trees concepts. The decision trees are built in a way that Leaf nodes represent the dependent class attribute, and the root node represents the independent input attributes. To make a prediction, first, a decision tree is generated and stored as the set of rules to be used for determining the class value of the new instance. For this study two of the standard decision tree classifiers, J48 and Random Forest were used. J48 is java implementation of the C4.5 idea proposed by Ross Quinlan as an improvement of the predecessor ID3 algorithm [43]. Unlike ID3, J48 works well with both numeric and categorical data and applies pruning techniques for minimizing errors [44]. Random Forest is another type of tree-based classifier used for this thesis. It works by building multiple decision trees generated randomly from subsets of the training dataset. The classification of a new instance works by each decision tree predicts a class in which an instance belongs then the class with most predictions is selected as the correct class of the instance. The benefits of this algorithm include robustness, and capability to handle large and high dimensional datasets appropriately. However, Random Forest has disadvantages as well. The first drawback is, Random Forest for regression problems is not as good as for classification problem. The second disadvantage is that the optimal number of features is not known intuitively, therefore trial and error required. 3.8 Evaluation of the prediction models Determining the best prediction models is always a difficult task that requires a great deal of knowledge of machine learning techniques and a broad range of performance metrics [45]. The classifier s performance can be measured by using simple metrics such as confusion matrix or advanced metrics that include multiple interpretations. Examples of things that machine learning algorithms used for evaluation of performance include, the number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), accuracy, Precision, Recall, F1 measure, Area Under the Curve Receiver Operating Characteristic (AUC-ROC), robustness in terms of time taken for training and testing, to mention a few. Confusion matrix The confusion matrix is the essential means of evaluating the performance of prediction models. It is composed of four fundamental metrics that can be used as criteria when selecting the best prediction model. Commonly, machine learning algorithms target two of the metrics, TP and TN where the goal is to achieve the highest score, while the other two metrics, FN and FP the goal is to achieve the lowest score [46]. TP: The desirable element of the confusion matrix that presents the number of correct predictions for the positive/target class TN: The desirable element of the confusion matrix that presents the number of correct prediction for the negative/no-target class. FP: The undesirable element of the confusion matrix that presents the number of incorrect predictions for the positive/target class FN: The undesirable element of the confusion matrix that presents the number of the incorrect predictions for the negative/non-target class The confusion matrix can be used as a quick way of analyzing the performance of the models. However, it cannot be used to make a comparison of the models because it does not have a single metric which incorporates all metrics. If the confusion matrix alone is to be used, 12

25 3.8. Evaluation of the prediction models the analyst will have to consider the values of TP, TN, FP, and FN individually, otherwise, the results may not reflect the real performance. Table 3.1: Confusion matrix Predicted Positive Predicted Negative Actual positive TP FN Actual Negative FP TN Prediction accuracy The second metric that can be used to measure the performance of the prediction models is the accuracy which is given as the percentage of the sum of the correct predictions (TP + TN) over the sum of all predictions (TP+TN+FP+FN). Accuracy can only suffice the measurement of the correctness but cannot tell the extent of FP and FN. When it comes to sensitive domains such as health, aviation, and security, high values of FP and FN are intolerable. Therefore, accuracy needs to work along other metrics such as F1 and AUC to make accurate evaluations[47] [45]. Below is the formula that deduced the values of accuracy. Model accuracy = (TP + TN) (TP + TN + FP + FN) (3.4) F1 measure The third metric for measuring the accuracy of the classifiers is the F1 Score. This metric presents the combinatory effect of the other two measurements of FP and FN that are reflected in precision and recall values [48]. Precision is used to indicate the extent of the false positives while Recall indicates the extent of the false negatives. The range of F1 values is 0 to 1 where 1 is the perfect prediction. The higher value of F1 indicates that the prediction model has few FP and FN. Precision: Given by the ratio of TP and (TP + FP) Recall: Given by the ratio of TP and (TP + FN) F1 measure: Given by the double ratio of the product and sum of precision and recall [49] F1 = 2 x (Precision x Recall) (Precision + Recall) (3.5) Area Under the Curve (AUC) The AUC is a plot representation of TP rate or Recall versus FP rate or precision [50]. It is an easy way to depict the relationship between the values of Recall and Precision or Recall and specificity applied at different thresholds [45]. The former is generally known as AUC - Receiver Operating Characteristic and the latter is known as AUC-PRC. They both correspond to each other, but, in this thesis, AUC-ROC was used over AUC-PRC as they form a bit smoother curve. Specificity: measure the degree at which negative predictions are correct. It is given by the ratio of TN and (TN + FP) AUC-ROC plot: Recall against (1-specificity), i.e., TP rate vs. FP rate 13

26 3. THEORY AUC-PRC plot: Precision vs. Recall The range of AUC-ROC/PRC is between 0 to 1 where 1 is a perfect prediction, and 0 is an absolute bad prediction. 14

27 4 Research methods, techniques, and methodology This chapter presents the research methods, techniques, and the methodology outlining step by step the procedures that led to the accomplishment of the thesis project. There are many research methods that are commonly used in computer science including simulation, experimentation, observation, pre-study, literature, and comparative study. But, this thesis used three common approaches: pre-study, literature, and experimental study that are relevant to machine learning research. In respective order the Section 4.1, 4.2, and 4.3 describe the prestudy activities undertaken to start the project, gives an overview on how the experimental study was conducted, and present the methodology used for this thesis. 4.1 Pre-study As the starting point of the thesis project a pre-study was conducted as a means of getting familiar with the undergoing research of soccer analytics. The aim was to build a foundation of the thesis by analyzing different methodologies, tools, algorithms, and areas explored in the past. This was carried out as literature review process where several sports analytics and machine learning publications were collected. Furthermore, literature study was used as means of acquisition of necessary knowledge for some of the concepts needed for the thesis work. The sports analytics and machine learning journals and conferences used for this thesis include: MIT SLOAN Sports analytics conference MLSA Machine Learning and Data Mining for Sports Analytics MLSA workshop IEEE Transactions on Pattern Analysis and Machine Intelligence Machine Learning: ECML-95 conference Journal of Machine Learning Research Journal of Artificial Intelligence Research Furthermore, search engines were used to collect important information about the study. Some example of search terms includes: 15

28 4. RESEARCH METHODS, TECHNIQUES, AND METHODOLOGY team player injuries young talent && prediction forecasting && models feature attribute && selection reduction && techniques methods soccer football && machine learning data mining artificial intelligence ai basketball baseball health && machine learning data mining artificial intelligence ai training && machine learning data mining artificial intelligence ai 4.2 Experimental study Experimental research is the quantitative study aiming to find, validate, and analyze the studied cases in relation to different parameters applied [51] [52]. It can be incorporated into observations and comparative studies between several conditions. The experiments in this thesis were undertaken in the form of supervised machine learning process where on each phase as described in section 4.3 analysis of the performance of players in five European leagues was conducted. Among comparisons made in the study were between: wrapper and filter feature selection techniques; Top10, Top25, and Top50 ranking of players; Bundesliga, EPL, La Liga, league1, and SerieA leagues; Bayes network, Naïve Bayes, Logistic Regression, IBK, PART, Decision Tables, J48, Random Forest, and ZeroR classifiers. The experiments were aided by the machine learning tool Weka that enabled the processes of data pre-processing, training, and evaluation of the algorithms. Along with the tool, Java and Python scripts were implemented to facilitate the limited functions in Weka. The purpose of the scripts was to extend data discretization to any desired ranking split and to implement pruning of the raw-datasets following data cleaning and feature selection procedures. 4.3 Methodology The purpose of this section is to present the processes undertaken during forming prediction models for soccer players. The section commences by first elaborating the methods regarding data collection and preparation. It describes the process involving reduction, resampling, and rescaling of the data. The second subsection is about the feature-selection procedure where a list of most important features is selected for each dataset. The third subsection explains the procedure used for generation of training and testing datasets by showing its validity to the context of this study. The fourth and fifth section elaborates development, execution, and testing of prediction models. Furthermore, for each classifier and filter used, the choices of parameter values used for the experiments are explained. The final section is about the analysis of the performance of machine learning models created. Different performance metrics including accuracy, precision, recall, AUC (ROC) are presented with another constituent parameters regarding robustness. The Figure 4.1 summarizes the overall methodology for this thesis. Data collection In this thesis data collection was conducted to five European leagues comprising of English Premier League (EPL), Spanish league (La Liga), Germany league (Bundesliga), Italian league (Serie A), and French League (Ligue 1) where players were arranged according to the different roles. WhoScored was used as the main data provider considering that most of its data were acquired from Opta. Many secondary data providers like Squawka also use Opta as a primary data source, but the differences regarding the rating of players performances between data providers are small therefore they are equally reliable. WhoScored and other secondary data-providers use internal schemes developed by a group of soccer experts to rate player 16

29 4.3. Methodology Figure 4.1: A procedure for analyzing soccer sport historical data and team performances. Because of the small differences, it gave confidence that the selection of data-provider does not matter as long as other requirements of acquiring the data is not expensive regarding time and resources. Previous researchers have also been using WhoScored as the main source of data which indicates that the information provided is accessible and it can also be used for this thesis. An example of the study that used WhoScored includes Decroos et al. [11] in his research about spatial-temporal action rating system for soccer. The data collection process led to two groups of datasets which in total contained 5212 instances. Group one separated players according to their corresponding leagues and further divided them to the four raw-datasets following the position of the player on the pitch. Group two combined players from all leagues into four datasets goalkeepers, defenders, midfielders, and forwards. Both group one and two had a total of 2606 instances. Players whose primary role and position on the pitch were defending and back, respectively were categorized as defenders. Players whose primary role and position were playmaking and central, respectively were considered as midfielders. Players whose primary responsibility and position were attacking and front, respectively were categorized as attackers. The remaining players were categorized as goalkeepers. The smallest dataset was for goalkeepers it contained 222 instances. The largest dataset was midfielders; it had 1109 instances. The remaining datasets which belong to defenders and forwards had 970 and 305 instances, respectively. Data preparation Data preparation is an essential procedure for machine learning problems as in real life raw datasets usually contain errors regarding incorrect formats, inaccurate values, typographical mistakes, duplicates, and incomplete information. Such errors can negatively affect the performance of machine learning algorithms as they can increase the amount of time needed for learning and in worst cases give a misleading outcome [53] [54]. In this thesis, data preparation is divided into four phases named as data formatting, relevance analysis, resampling, and feature selection. 17

ordering of attributes. Particularly the activities involved in this phase include: Conversion of unrecognizable character-types to equivalent recognizable characters.

30 4. RESEARCH METHODS, TECHNIQUES, AND METHODOLOGY Figure 4.2: Data preparation model Phase one Data formatting Phase one includes all procedures for formatting the raw-datasets as to make them compatible with the Weka software and enhancing naming and ordering of attributes. Particularly the activities involved in this phase include: Conversion of unrecognizable character-types to equivalent recognizable characters. For example, the name of the player Dembélé was replaced by the equivalent characters Dembele. All datasets were converted to standard file formats CSV and ARFF which are supported by Weka tool and commonly used in machine learning. Then, discretization of the datasets was done to group data into Top 10, 25, and 50 percent. In addition, rating attribute was selected as class attribute. A class label with values above the selected threshold was marked as a target class on each dataset. Furthermore, the removal of incorrect data, replacement of missing values, and merging of duplicate information followed. Several players were identified as duplicates due to transferring from one team to another across Europe. For this case, the record of the previous team was merged with the record of the new team. Missing values for some attributes were replaced by zero and other replaced by the appropriate information collected from the Internet. For example, missing nationality information of some players was obtained from search engines. Before normalization, a numeric attribute player_id was transformed to nominal so as it can be excluded automatically as by default Weka tend to ignore nominal attributes and class attribute when normalization filter is applied. The normalization process was done by a min-max method where a scale of 0 to 1 was used. Phase two Relevance analysis Relevance analysis mainly involves data reduction and feature selection. Feature selection is all about finding subsets of attributes of the dataset that can give the best results at lesser expense, more details about this will be given in the following section. On the other hand, Data reduction is an exploratory activity that involves removal of data with no or unwanted information. Normally in this stage, duplicate attributes are merged, attributes with single value are eliminated, decimal places are reduced by rounding formulas, partitioning of data according to a certain threshold is applied. Many of the previous studies used number of matches or minutes played for the season to be a constraint of removing irrelevant instances [11] [55] [56] [57]. In any sport high-rated players turned out to be relatively more important and relied more upon than other players on ensuring the success of a team [55]. On average top-rated players have more game time than regular players because of their consistent 18

31 4.3. Methodology Figure 4.3: Knowledge flow activities for data formatting process fitness and excellent performances. For an individual player to be considered a good player he must at least have a decent game time per season. However, in this thesis a different approach is used, no game time or other constraints are used to filter players because the information that is given by the instances may be beneficial while training. If the relationship between game time and rating attributes is insignificant, the information will be captured and removed during feature selection process. Since the aim is to create binary prediction models for top-rated players, for this study the term top-rated is referred to top 10%, top 25%, and top 50% players. Binary classification models need a binary class with values upper or below the percentage threshold. Weka can only discretize the attributes by using equal-frequency binning (i.e., 50%-frequency binning). However, the aim is to have 10%, 25%, and 50% - frequency binning therefore TygerJython script is used to perform custom discretization. After running the script, three datasets were created for defenders, goalkeepers, midfielders, and forwards. The first dataset had two labels: top ten percent, and non-top ten percent players. The second dataset had two labels: top twenty-five percent, and non-top twenty-five percent players. Lastly, the third dataset had two labels: top fifty percent, and non-top fifty percent players. Phase three Resampling (Dealing with class imbalance) Discretizing the datasets to smaller percentage frequency binning such as 10% produced datasets with an unbalanced number of instances among classes. The ratio between positive and negative class for top 10% and 25% datasets were 1:10 and 1:4 respectively. When a pilot analysis was undertaken on these datasets, most of the models overfitted. The classifiers returned 100% classification accuracy. Therefore, it was necessary to remove class imbalance to be able to generalize the binary classification models whose desired ratio between positive and negative class is 1:1. Therefore SMOTE percentage parameter of the algorithm is manipulated by using the formula below to archive the desired ratio. This formula is used to calculate the number of instances needed to reach a class ratio of 1:1. It takes the number of instances of the majority and minority class as input and returns the percentage of minority 19

PREDICTING the outcomes of sporting events

PREDICTING the outcomes of sporting events CS 229 FINAL PROJECT, AUTUMN 2014 1 Predicting National Basketball Association Winners Jasper Lin, Logan Short, and Vishnu Sundaresan Abstract We used National Basketball Associations box scores from 1991-1998