Prediction models for soccer sports analytics

Size: px
Start display at page:

Download "Prediction models for soccer sports analytics"

Transcription

1 Linköping University Department of Computer and Information Science Master thesis, 30 ECTS Computer Science LIU-IDA/LITH-EX-A--2018/021--SE Prediction models for soccer sports analytics Edward Nsolo Supervisor : Niklas Carlson Examiner : Patrick Lambrix Linköpings universitet SE Linköping ,

2 Upphovsrätt Detta dokument hålls tillgängligt på Internet eller dess framtida ersättare under 25 år från publiceringsdatum under förutsättning att inga extraordinära omständigheter uppstår. Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skriva ut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommersiell forskning och för undervisning. Överföring av upphovsrätten vid en senare tidpunkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräver upphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgängligheten finns lösningar av teknisk och administrativ art. Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman i den omfattning som god sed kräver vid användning av dokumentet på ovan beskrivna sätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i sådant sammanhang som är kränkande för upphovsmannens litterära eller konstnärliga anseende eller egenart. För ytterligare information om Linköping University Electronic Press se förlagets hemsida Copyright The publishers will keep this document online on the Internet or its possible replacement for a period of 25 years starting from the date of publication barring exceptional circumstances. The online availability of the document implies permanent permission for anyone to read, to download, or to print out single copies for his/hers own use and to use it unchanged for non-commercial research and educational purpose. Subsequent transfers of copyright cannot revoke this permission. All other uses of the document are conditional upon the consent of the copyright owner. The publisher has taken technical and administrative measures to assure authenticity, security and accessibility. According to intellectual property law the author has the right to be mentioned when his/her work is accessed as described above and to be protected against infringement. For additional information about the Linköping University Electronic Press and its procedures for publication and for assurance of document integrity, please refer to its www home page: c Edward Nsolo

3 Abstract In recent times there has been a substantial increase in research interest of soccer due to an increase of availability of soccer statistics data. With the help of data provider firms, access to historical soccer data becomes more simple and as a result data scientists started researching in the field. In this thesis, we develop prediction models that could be applied by data scientists and other soccer stakeholders. As a case study, we run several machine learning algorithms on historical data from five major European leagues and make a comparison. The study is built upon the idea of investigating different approaches that could be used to simplify the models while maintaining the correctness and the robustness of the models. Such approaches include feature selection and conversion of regression prediction problems to binary classification problems. Furthermore, a literature review study did not reveal research attempts about the use of a generalization of binary classification predictions that applies different target class upper boundaries other than 50% frequency binning. Thus, this thesis investigated the effects of such generalization against simplicity and performance of such models. We aimed to extend the traditional discretization of classes with equal frequency binning function which is standard for converting regression problems into the binary classification in many applications. Furthermore, we ought to establish important players features in individual leagues that could help team managers to have cost-efficient transferring strategies. The approach of selecting those features was achieved successfully by the application of wrapper and filter algorithms. Both methods turned out to be useful algorithms as the time taken to build the models was minimal, and the models were able to make good predictions. Furthermore, we noticed different features matter for different leagues. Therefore, in accessing the performance of players, such consideration should be kept in mind. Different machine learning algorithms were found to behave differently under different conditions. However, Naïve Bayes was determined to be the best-fit in most cases. Moreover, the results suggest that it is possible to generalize binary classification problems and maintain the performance to a reasonable extent. But, it should be observed that the early stages of generalization of binary classification models involve a tedious work of training datasets, and that fact should be a tradeoff when thinking to use this approach.

4

5 Acknowledgments Firstly, I would like to express my sincere gratitude to my thesis examiner and supervisor Prof. Patrick Lambrix and Prof. Niklas Carlson of Linköping university for the opportunity of Thesis project that was carried under their supervision. Their continuous support, guidance, and patience motivated me in the right direction which led to the successful accomplishment of this thesis. Secondly, I would like to extend the hand of gratitude to fellow schoolmates, friends, and family for their company, advice, and encouragement throughout my years of study and through the process of researching and writing this thesis. This accomplishment would not have been possible without them. Lastly, I would like to thank almighty God for the good health and opportunity of a scholarship to study in Sweden. This publication has been produced during scholarship period at Linköping University, thus, I would like to give a special appreciation to Swedish Institute scholarship. v

6 Contents Abstract Acknowledgments Contents List of Figures List of Tables iii v vi viii x 1 Introduction Purpose Research questions Delimitations Related work 5 3 Theory Software (Weka) Min-max normalization Feature selection methods Class imbalance SMOTE (Synthetic Minority Oversampling Technique) TigerJython with Weka Machine learning algorithms Evaluation of the prediction models Research methods, techniques, and methodology Pre-study Experimental study Methodology Data pre-processing Data collection Data rescaling, missing values, and duplicates Converting regression problem to binary classification problem Feature selection Feature selection with wrapper method Feature selection with filter attribute evaluator Performance of prediction models Accuracy results of the prediction models F1 Score results of the prediction models vi

7 7.3 AUC-ROC results of the prediction models Discussion and conclusion What are the best mechanisms for selecting essential features for predicting the performance of top players in European leagues? What are the essential features for developing prediction models for top players in European leagues? What are the useful classification models for predicting performance of top players in European leagues? How can binary prediction models be generalized? Future research 45 Bibliography 47 A Wrapper method results of the combined-leagues 51 B Attributes selected by Wrapper method of the combined leagues 57 C Execution time of wrapper method for the combined leagues 61 D Aggregated results of filter method for the combined leagues 65 E Model accuracy results of wrapper datasets for the combined leagues 67 F Model accuracy of filter-datasets for the combined leagues 71 G F1 score results of wrapper datasets for the combined leagues 75 H F1 Score results of the filter-datasets for the combined leagues 79 I AUC-ROC results of the wrapper datasets for the combined leagues 83 J AUC-ROC results of the filter-datasets 87 K Accuracy results for individual leagues 91 L F1 score results for individual leagues 97 M AUC-ROC results for individual leagues 103 vii

8 List of Figures 4.1 A procedure for analyzing soccer sport historical data Data preparation model Knowledge flow activities for data formatting process Feature selection with wrapper method knowledge flow model Feature selection with filter method knowledge flow model Merit of subsets of attributes selected Execution time of wrapper subset evaluator Model accuracy results Overall F1 Score results of the combined leagues Overall AUC-ROC results C.1 Execution time of wrapper attribute evaluator for defenders datasets C.2 Execution time of wrapper method for goalkeepers datasets C.3 Execution time of wrapper method for midfielders datasets C.4 Execution time of wrapper method for forwards datasets E.1 Prediction model accuracy for defenders wrapper-dataset E.2 Prediction model accuracy for midfielders wrapper-dataset E.3 Model accuracy for the goalkeepers wrapper-datasets E.4 prediction model accuracy for forwards wrapped datasets F.1 Model accuracy for defenders filter-datasets F.2 Model accuracy for midfielders filter-datasets F.3 Model accuracy for goalkeepers filter-datasets F.4 Model accuracy for forwards filter-datasets G.1 F1 Score results of the defenders wrapper-datasets G.2 F1 Score results of the midfielders wrapper-datasets G.3 F1 Score results of the goalkeepers wrapped datasets G.4 F1 Score results of the forwards wrapped datasets H.1 F1 Score results of the defenders filter-datasets H.2 F1 Score results of the midfielders filter-datasets H.3 F1 Score results of the goalkeepers filtered datasets H.4 F1 score results of the forwards filtered datasets I.1 AUC-ROC results of the defenders wrapper-datasets I.2 AUC-ROC results of the midfielders wrapper-datasets I.3 AUC-ROC results of the goalkeepers wrapper-datasets I.4 AUC-ROC results of the forwards wrapper-datasets viii

9 J.1 AUC-ROC results of the defenders filter-datasets J.2 AUC-ROC results of the midfielders filter-datasets J.3 AUC-ROC results of the goalkeepers filter-datasets J.4 AUC-ROC results of the forwards filter-datasets ix

10 List of Tables 3.1 Confusion matrix List of all attributes All datasets of the combined leagues All datasets of the Bundesliga All datasets of the EPL All datasets of the La Liga All datasets of the Ligue All datasets of the Serie A Important features for goalkeepers as selected by wrapper method Important features for defenders as selected by wrapper method Important features for midfielders as selected by wrapper method Important features for forwards as selected by wrapper method Important features for the combined leagues by wrapper attribute evaluator Most frequent selected attributes by wrapper method in combined leagues Important features for goalkeepers as selected by filter method Important features for defenders as selected by filter method Important features for midfielders as selected by filter method Important features for forwards as selected by filter method Important features for the combined leagues by filter attribute evaluator Overall model accuracy F1 Score results between wrapper and filter of the combined leagues AUC-ROC results Comparison between the wrapper and filter subset evaluator Sample of actual predictions A.1 Attribute selection with wrapper method for defenders top10pc dataset A.2 Attribute selection with wrapper method for defenders top25pc dataset A.3 Attribute selection with wrapper method for defenders top50pc dataset A.4 Attribute selection with wrapper method for goalkeepers top10pc dataset A.5 Attribute selection with wrapper method for goalkeepers top25pc dataset A.6 Attribute selection with wrapper method for goalkeepers top50pc dataset A.7 Attribute selection with wrapper method for midfielders top10pc dataset A.8 Attribute selection with wrapper method for midfielders top25pc dataset A.9 Attribute selection with wrapper method for midfielders top50pc dataset A.10 Attribute selection with wrapper method for forwards top10pc dataset A.11 Attribute selection with wrapper method for forwards top25pc dataset A.12 Attribute selection with wrapper method for forwards top50pc dataset A.13 Frequency of attributes being selected by several wrapper schemes x

11 B.1 Support count of selected attributes with wrapper method for defender s top10pc dataset B.2 Support count of selected attributes for wrapper method for defender s top25pc dataset B.3 Support count of selected attributes for wrapper method for defender s top50pc dataset B.4 Support count of selected attributes with wrapper method for goalkeeper s top10pc dataset B.5 Support count of selected attributes for wrapper method for goalkeepers top25pc dataset B.6 Count of selected attributes for goalkeepers top50pc dataset B.7 Support count of selected attributes with wrapper method for midfielders top10pc dataset B.8 Support count of selected attributes for wrapper method for midfielder s top25pc dataset B.9 Count of selected attributes for midfielders top50pc dataset B.10 Support count of selected attributes with wrapper method for forwards top10pc dataset B.11 Support count of selected attributes for wrapper method for forwards top25pc dataset B.12 Count of selected attributes for forwards top50pc dataset D.1 Overall filter method results K.1 Accuracy results of Bundesliga K.2 Accuracy results of EPL K.3 Accuracy results of La Liga K.4 Accuracy results of Ligue K.5 Accuracy results of Serie A L.1 F1 results of Bundesliga L.2 F1 results of EPL L.3 F1 results of La Liga L.4 F1 results of Ligue L.5 F1 results of Serie A M.1 AUC-ROC results of Bundesliga M.2 AUC-ROC results of EPL M.3 AUC-ROC results of La Liga M.4 AUC-ROC results of Ligue M.5 AUC-ROC results of Serie A xi

12

13 1 Introduction Soccer is an ancient sport whose history can be traced to 2000 years ago from China [1]. To date, it is still recognized as the highest ranked sport in the world regarding attendance, broadcasting, and support. More than a half global population follows the sport, and even in the countries like India, USA, and Canada where the sport is not the most dominant, there are millions of players and fans of the sport [2]. Due to the empirical evidence of the success of machine learning techniques in other sports like baseball, the extent of recording historical soccer data changed dramatically as the need for new information and the advancement of the machine learning tools increased [3]. An example of such analysis is the Moneyball study that demonstrated how a team could improve performance regardless of limited resources by finding players with potential in player s features that are usually undervalued in predicting winning [4]. Determination of essential features is not always straightforward as it may appear [5] [6] [7]. According to FIFA rules, European soccer leagues have a transfer window which spans 16 weeks where soccer team managers tend to look for talented players all over the world that could strengthen their squads. The activity of acquiring such players can be motivated by many factors including the excellent form of the player in the former league. Such motivation is unsatisfactory because European leagues are different, they require a different level of physicality and technicality to excel. What makes a good player in one league may not be the case for the other. Hence, there has been an increase of disappointments and criticism of top players underperforming in their current clubs after colossal player transfers turned out to be a failure. An Example of such a player is Paul Pogba of Manchester United who performed well in Serie A which made him the world most expensive footballer in 2016 but failed to impress when transferred to English Premier League in the same year [8]. Therefore, we see the importance of establishing different criteria in individual leagues that can be used to guide team managers to make proper investments in buying players that will give a good value for money spent. Furthermore, this thesis aims to produce prediction models that are simple and more accurate by grouping players into their specific roles, leagues, and performance levels. In general, we want to make general predictions models that cover many aspects of what makes a good player. First, we find essential player s features sets because: Soccer constitutes many features that may be hard to derive useful information, irrelevant, uncontrollable and hard to measure. All activities associated with the identification of players, training statistics and 1

14 1. INTRODUCTION rarely executed actions like penalties can be irrelevant for performance prediction models. Also, various features carry different weights when a player role in a soccer match is concerned, essential features of defenders may not be essential for forwards or goalkeepers. For example, a number-of-saves is an essential feature for goalkeepers but not for other players. Also, the performance of machine learning algorithms depends on adequately trained datasets. Attributes must be selected wisely and arranged according to the player roles. Secondly, we take an approach of testing different machine learning algorithms that involve: processing the data, selecting features, running classifiers, and evaluating the performance of the models. Furthermore, different types of discretization are used to broaden the range at which players can be classified. Soccer is a mega profit business which attracts firms from many sectors such as clothes, shoes, and gambling [9]. Consequently, stakeholders demand simple prediction models to facilitate improvement in training, match tactics, and individual player performance. Thus, theoretically, it can be asserted that better prediction models may lead to better performance and profit to all soccer stakeholders. We believe that the models that will be created may be tested empirically and be beneficial to users. In the market, data mining and machine learning are among the best methods available regarding predictions and learning associations among soccer data. Therefore, in this master s thesis, data mining, and machine learning techniques were carried out using historical data from top players in European leagues of the season for training and season for actual prediction testing. 1.1 Purpose In attempting to build prediction models that analyze European soccer players, this thesis is generally aiming at developing simple prediction models that work as classification problems. Furthermore, for each player category, the different prediction models are built for defenders, goalkeepers, midfielders, and forwards. Along the process of fulfilling the main objective of the study of building soccer prediction models for top players in European leagues, the following specific targets form the basis of the thesis project. Find essential features of players according to the ideal roles and determining the performance of top players in European leagues. Develop binary classification models that predict the performance of top players in European leagues in individual and consolidated leagues data. Analyze the player ranking schemes provided by soccer data providers 1.2 Research questions The analysis of the best prediction models for top players explicitly tries to answer the following research questions: What are the best mechanisms for selecting essential features for predicting the performance of top soccer players? What are the potential features for developing classification models for top players? What are the useful classification models for predicting top players? How can binary classification models be generalized? How accurate can the player ranking schemes of the soccer data providers be? 2

15 1.3. Delimitations 1.3 Delimitations This thesis will only analyze player s historical data from five major European leagues including English Premier League (EPL), La Liga Premier Division, Bundesliga, Serie A, and French Ligue 1. For each category of classification models available in the machine learning tools, popular representative algorithms were selected. The categories of selected classifiers are Bayes-based, function-based, Lazy-learners, rules-based, and tree-based classifiers. For resampling technique, only oversampling was applied because time was not enough to implement all algorithms with all possible combinations of parameters. 3

16

17 2 Related work This chapter focuses on describing the contribution of other researchers and studies that are like this thesis where a review of the application of data mining and machine learning in soccer is conducted. A look at how data mining has been used to predict the performance of players, teams, match outcomes, and injuries is done. Also, an elaboration of current research on other sports analytics will be given. Currently, the main research areas in soccer analytics are concerned with effective training, soccer officials, and prediction of future young stars, match outcomes, injuries, team performances and player ratings. Prediction of player performance is among the foremost researched fields in soccer analytics since players are the center of the sport. Several studies established new and useful ways of predicting player performance by using several machine learning techniques. These studies include Vroonen et al. [10] which proposed advanced techniques for predicting future performance of young soccer players. Vroonen et al. used similarity metrics to create an advanced system that predicts the performance of younger players of the same age. The approach was beneficial as it outperformed the baseline results. More research is possible to cover other aspects of this study as it only included players of the same age rated above 0.9/1. Therefore, a similar study can be conducted on players of different ages and the rating threshold could be lowered to 0.8/1, which is still reasonably high. Among common problems for many researchers are the use of the ineffective historical data because not all match events give useful information when it comes to analysis. There has been inadequate research that focuses on the match phases (phases that led to a goal or a shot). Not until recently some authors have begun to concentrate on match phases in performance analysis of historical soccer data. Decroos et al. [11] grouped several match events by using dynamic time warping technique and then applied an exponential-decay-based approach for calculating the performance of players in the distinct phases. Furthermore, S. Brown [12] conducted a similar study but with page-rank as an approach. Both approaches had high accuracy, but, these studies overlooked the fact that the match events leading to a goal and a shot mostly favor attackers and midfielders, and as a result discriminate defenders and goalkeepers. For the improved results, the authors could have separated the datasets according to different game roles, which is an approach emphasized in this thesis. Soccer analysis is based on the consideration of not only match events but also other attributes such as postural and physiological characteristics. There are studies such as one by 5

18 2. RELATED WORK Lara et al. [13] which analyze the prediction systems that use player movements and balance statistics. The authors used Decision trees (CART) and Logistic Regression as machine learning algorithms. The observation revealed that the performance of players does not depend on balance data and that Logistic Regression outperformed decision trees (CART). Given the fact that there are many machine learning techniques, it can be interesting to replicate the study and compare the results. Furthermore, most sports analytics including soccer, predicts player performance by only using past records. Some researchers incorporate live match statistics to produce live predictions. Research by Cintia et al. [14] proposed a model which considers the use of live match statistics and uses chi-square technique. The results showed that the application of live statistics yielded a 50% accuracy, which is below the desirable performance. Some soccer analysts tend to use many complicated features when it comes to forecasting performance of players. Consequently, the resulted prediction model became incomprehensible and deceptive. Over time, there has been an effort from the research community to find useful prediction models that are applicable in real-life. Brandt and Brefeld [15] discovered that focusing on just a few features and simple machine learning techniques such as page-rank, C5.0, and SVMs (RBF kernels) could increase the higher accuracy of prediction models. Similarly, G. Kumar [16] reached the same conclusion by using standard feature selection mechanisms. In addition to the Brandt and Brefeld work, G. Kumar used many learning algorithms to figure out essential player attributes. The algorithms used include Linear regression, SMOreg, Gaussian Processes, LeastMedSq, M5P, Bagging with REP Tree, Additive Regression with Decision Stump, REP Tree, J48, Decision Tables, Multilayer perceptron, Simple Linear Regression, Locally Weighted Learning, IBk, KStar, and RBF Network. However, some of Kumar et al. [16] claims contradict Cintia et al. [14] assertions that the application of live statistics leads to better accuracy than historical data. 6

19 3 Theory This chapter describes the tools and machine learning techniques used while writing this thesis. It gives necessary knowledge about various concepts in the text and the knowledge about the tools applied. For each tool and method, a link to previous research is given to show how other researchers used similar approaches and when a different approach is used motivation supports it. One thing to note, concepts presented in this section are described on a high-level assuming that interested readers will further read the references provided to acquaint themselves with the more in-depth concepts. This chapter is organized into three subsections. Section 3.1 presents the tools used for the project. Section 3.2 describes the techniques used for rescaling the data. Section 3.3 elaborates different mechanisms for feature selection, especially those applied in this thesis. Section 3.4 and 3.5 illustrate class imbalance and ways of solving the problem with resampling technique. Section 3.6 presents the scripting package used to extend discretization. Lastly, Section 3.7 presents the high-level knowledge about different classification algorithms used for this thesis. 3.1 Software (Weka) Weka is open-source software for small and large-scale machine learning tasks. The software has been designed to perform all core machine learning functionalities including data preprocessing, classification, clustering, feature selection, and visualization. Since its first release in 1993, the tool has been progressively accepted and widely used for data mining research and across other fields because of its simplicity and extensibility [17] [18]. Apart from soccer, Weka has also been used in other domains of research. Therefore, there is enough confidence that Weka is suitable for the research similar to this [19] [20]. Weka has five modules which perform the same tasks in different ways. For this thesis, the first three modules: Explorer, Experimenter, and Knowledge flow were used to accomplish most of the project tasks. The first module (Explorer) was used for exploring the datasets to determine the kinds of filters and parameters needed to be applied. The second module (Experimenter) was used to run algorithms and perform tests for each dataset taken after the data has been cleaned, and features have been selected. The third module (knowledge flow), was used for designing and executing sub-steps of the data pre-processing procedure. With knowledge flow models most of data pre-processing procedures are done. Missing values, 7

20 3. THEORY normalization, feature selection, and resampling are handled. The discretization task is done by extended module TigerJython which enables the running of Jython script for custom discretization and pruning of the datasets. The workbench and the Simple client modules were not used because the first three modules were satisfactory for this project. 3.2 Min-max normalization Normalization is an important step in any data-preparation process. It is used to transform data to follow a smaller and consistent range across all attributes within datasets; normally the preferred range is from 0 to 1. In many cases, attributes within the same dataset appear to be in different scales, a tendency that affects the efficiency of machine learning algorithms especially for those that use distance measures as a criterion for learning [21] [22]. When datasets contain attributes with different scales, attributes with larger scales tend to overshadow attributes with smaller scales by distorting attribute weight proportions in distance calculations. As a result, algorithms run slower and produce misleading outcomes. Normalization can be done using several techniques including Min-Max, Mean, and Standard deviation normalization methods. Generally, no normalization method is better than another. Depending on the type of data and machine learning algorithm used, minor differences may be noticed. Previous studies suggest that the min-max normalization technique can lead to a model with slightly higher accuracy, less complicated, and less training time compared to other techniques [23]. The Min-Max normalization technique transforms given attribute values to a range of 0 to 1 or -1 to 1 by using the minimum and maximum attribute values within a dataset. Following the fact that many researchers used the min-max technique and it has been proven to have benefits, this study has chosen the technique for feature rescaling. However, other methods could have been applied as well and still produce reasonable results. The formula for Min-Max normalization is given as in equation 3.1: Normalizedvalueo f (x) = x min (x) max (x) min (x) (3.1) 3.3 Feature selection methods Feature selection is the process of finding an optimal subset of attributes that enhance the performance of the classification model. It is common for many machine learning problems to have attributes that are redundant or do not influence the target class [24] [25]. Such attributes are referred to as insignificant or undesired features. Removing insignificant attributes in many cases improves the performance, robustness, and simplicity of the classification models [26]. There are four reasons why feature selection is essential for machine learning processes. The first reason is, the feature selection process produces datasets with fewer attributes. Therefore, improves simplicity and interpretability of classification models. The second reason is that feature selection significantly reduces learning time hence improves efficiency [27]. The third reason is, feature selection reduces the amount of noise and outliers hence decreases the problem of overfitting. Fourth, feature selection generates dense datasets that in turn increase the statistical significance of values of datasets [25]. Standard feature selection techniques in machine learning include filters, wrappers, and embedded methods. The Filter-method is a technique that uses the correlation between a normal attribute and a class attribute. The features that score above the desired correlation threshold are selected as potential attributes [28]. Filter methods are more efficient but disregard the inter-correlation between normal attributes. Wrapper method as another technique may be used when interrelated features exist. Wrapper method uses machine learning algorithms to cross-check multiple subsets of at- 8

21 3.4. Class imbalance tributes and save the set with optimum performance as a potential set of attributes [27]. Unlike filters, this technique not only considers the relationship between attributes and the class attribute but also, the attribute interrelation. When determining what to use between the wrapper and filter methods becomes a challenge, the Embedded method can be used because it uses both filter and wrapper methods with the aim of utilizing benefits of both techniques [25] [29]. All techniques have benefits and drawbacks; it cannot be generalized that one technique is better than the other. Generally in machine learning, all techniques are executed, then the method that led to optimal performance and robustness is selected and included as part of the model. 3.4 Class imbalance When values of the class attribute are not equally distributed within a dataset can yield to a class imbalance problem which affects the interpretation of the prediction model s accuracy [30]. Commonly, this problem causes the accuracy of the majority class to transcend the accuracy of the minority class which misleads the interpretation of the results. To better illustrate the effects of this problem, consider a dataset with 1000 cancer test results wherein ten records tested cancer-positive while 990 records tested cancer-negative. If results appeared to be as described in a confusion matrix below, it is apparent that, though the accuracy is 98.2%, the model can be thought to be unsuitable for this situation. The accuracy for the positive class is extremely low 0.2% and dominated by negative class accuracy 98%. By implication 99.8% of all tested cancer positive predictions will be wrong despite the reasonable model accuracy of 98.2%. Actual positives (cancer +ve) Actual negatives (cancer -ve) Predicted positive TP = 2 FP = 8 Predicted negative FN = 10 TN= 980 Model accuracy = Positive class accuracy + Negative class accuracy TP Model accuracy = (TP + TN + FP + FN) x 100% + TN (TP + TN + FP + FN) x 100% Model accuracy = x 100% + (1000) (1000) x 100% Model accuracy = 0.2 % + 98 % = 98.2% (3.2) The class imbalance problem is most problematic to decision trees classifiers and small datasets [30] [31]. Since this thesis project used several decision tree algorithms like J48 and Random Forest it was deemed necessary to resample the data. Additionally, some of the datasets generated for goalkeepers and forwards were small. Likewise, it was thought to be important to apply resampling techniques which solve the problem of class imbalance. Several techniques can be used to deal with class imbalance including under-sampling, oversampling, and a blended method which uses both techniques. This thesis used only oversampling technique called SMOTE which is described in section SMOTE (Synthetic Minority Oversampling Technique) SMOTE is a popular method of handling class imbalance. It is an oversampling technique that synthetically determines copies of the instances of the minority class to be added to the 9

22 3. THEORY dataset to match the number of instances of the majority class. Oversampling is the actual process of increasing the needed instances of an underrepresented class in a given dataset to increase the statistical significance of the minority class that in turn reduces the degree at which the models could overfit. Since the class ratio between minority class and majority class for the datasets used for this thesis were 1 to 4, and 1 to 10 which are high. Oversampling technique is suitable than the rest of the resampling techniques [31]. Contrary to traditional techniques, SMOTE does not just add random copies, it looks for the specified K-Nearest Neighbors instances in the dataset and add them until SMOTE percentage proportion is reached. This technique is proven to be efficient and it is popular among researchers [32]. The SMOTE algorithm takes three parameters (number of minority class, percentage of instances to be added, and the number of nearest neighbors). If the parameter values are not chosen properly, SMOTE may introduce noise and hence affect the accuracy of the classification models [33]. The selection of good values for SMOTE parameters is explained in the Section TigerJython with Weka TigerJython is a simple development platform for writing Jython scripts that gives Weka extra capability to execute Python and Java scripts to perform machine learning tasks with much more control than graphical user interfaces [34]. To use scripting languages like Jython a TigerJython plugin was installed through the package manager. With TigerJython a user can directly invoke Weka s java classes, java, and python modules at will and extend the limited functionalities as desired. For this thesis, Jython script is used to extend the discretization functionality. 3.7 Machine learning algorithms This section presents the underlying knowledge about machine learning algorithms used for feature selection and classification. It begins by briefly describing how the algorithms work, and presents the benefits and challenges encountered from previous research. Furthermore, for each algorithm a brief discussion about the expectations and tradeoff is made. Bayesian network Bayesian network algorithms are the type of classifiers that use distribution probabilities in the Directed Acyclic Graphs to give predictions. It takes the leaf nodes as the attributes, and the parent nodes as the predictor class. It is among the widely used classier in machine learning because it is robust and performs well with classification problems with complex and conflicting information. Moreover, Bayesian network techniques work well with datasets of any size, a property that is beneficial for this thesis as the size of the processed data is between 285 to 1857 instances. On the other hand, the main drawback of the algorithm is the inability to deal well with continuous data [35]. In many cases, to apply Bayesian network classifiers, discretization must be applied which in turn disturbs the linear relationship of the data. Naïve Bayes Naïve Bayes is another type of classifier based on Bayes theorem of conditional probabilities. It is considered as the simplified version of Bayes network classifier except that the attributes are treated independently indicating that knowing the values of one attribute does not imply the value of the other attributes. As its name suggest the classifier is simple and efficient because by ignoring class conditional relationships between attributes it reduces the computation work. Like general Bayesian network classifier, Naïve Bayes works well with small and large datasets. Although it is a simple classifier, many studies have found Naïve Bayes to 10

23 3.7. Machine learning algorithms outperform many classifiers even those that are sophisticated [36]. On the other hand, Naïve Bayes has drawbacks that include decreasing of accuracy caused by ignoring the information about conditional dependencies, also it forces invocation of discretization function which in turn affect negatively certain types of datasets [37]. Logistic Regression Logistic Regression is the type of classifier that falls under the function category. It works similarly as linear regression except of the relation function and the type of the class attribute used. This classifier is advantageous over linear on the following aspects: The ability to work well with categorical data, only handle binary classification [38], works relatively well to skewed datasets where the linear relationship in datasets is not well observed. The algorithm uses the population growth function to express the relationship between the input variables and the predictor class. The equation used for Logistic Regression comprise of four variables as follows: y means predicted class, x means the values of the independent attributes and b 0 mean the Logistic equation intercept, and b 1 means the value of the single independent attribute. ( ) predicted class (y) = e (b 0 + b 1 x) / 1 + e (b 0 + b 1 x) (3.3) Lazy classifiers Unlike other types of classifiers, lazy learners use the instances of the training dataset to make predictions. Other types of classifiers create the model first and use it with testing data to make predictions. The name lazy is followed by its tendency of not producing the model at all. Sometimes they are referred to as instance-based learners, an example of such learners is KNN (IBK) or KStar classifiers. Except for KStar which uses entropy distance measures to make predictions most of the lazy classifiers use standard distance measures such as Euclidean distances [39]. Although the idea behind the lazy leaners is simple, they are efficient, especially for small datasets. However, when lazy learners use large or high dimensional datasets as input, the lazy classifier performs poorly and require large storage capacity [40]. Another challenge is that it can be difficult to determine the appropriate K nearest neighbors because a redundant work of guessing K value can be time-consuming. Furthermore, since lazy learners make predictions directly from the training datasets, constant work of updating the training datasets is highly required as the new data become available. Rule-based classifiers Rule-based classifiers are the type of classifiers that generate a set of rules in the form of (IF condition THEN class label) and use them to make predictions. They are efficient and straightforward learners and widely used for many classification problems. For this thesis, two types of rule-based classifiers implemented are Zero and PART. Zero (Zero Rules) classifier is the simplest classifier which focuses on the target class and ignores other non-target attributes. It is mainly used as baseline accuracy to determine the minimum accepted accuracy for other classifiers [41]. PART is another type of rule-based classifier that combines two techniques to make predictions. It works as C4.5 but additionally prunes the C4.5 decision tree separately on each iteration and selects the best leaf as a rule [42]. Furthermore, the Decision Tables classifier completes the list of rule-based algorithms that we use. Decision Tables are known as one of the most straightforward and efficient classifiers that use a simple logic of decision tables to classify instances of data. They are like decision trees in the classification process. What distinguishes them is how the decisions or rules are presented. 11

24 3. THEORY Decision trees classifiers Decision tree classifiers are non-parametric algorithms that build classification and regression models by using decision trees concepts. The decision trees are built in a way that Leaf nodes represent the dependent class attribute, and the root node represents the independent input attributes. To make a prediction, first, a decision tree is generated and stored as the set of rules to be used for determining the class value of the new instance. For this study two of the standard decision tree classifiers, J48 and Random Forest were used. J48 is java implementation of the C4.5 idea proposed by Ross Quinlan as an improvement of the predecessor ID3 algorithm [43]. Unlike ID3, J48 works well with both numeric and categorical data and applies pruning techniques for minimizing errors [44]. Random Forest is another type of tree-based classifier used for this thesis. It works by building multiple decision trees generated randomly from subsets of the training dataset. The classification of a new instance works by each decision tree predicts a class in which an instance belongs then the class with most predictions is selected as the correct class of the instance. The benefits of this algorithm include robustness, and capability to handle large and high dimensional datasets appropriately. However, Random Forest has disadvantages as well. The first drawback is, Random Forest for regression problems is not as good as for classification problem. The second disadvantage is that the optimal number of features is not known intuitively, therefore trial and error required. 3.8 Evaluation of the prediction models Determining the best prediction models is always a difficult task that requires a great deal of knowledge of machine learning techniques and a broad range of performance metrics [45]. The classifier s performance can be measured by using simple metrics such as confusion matrix or advanced metrics that include multiple interpretations. Examples of things that machine learning algorithms used for evaluation of performance include, the number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), accuracy, Precision, Recall, F1 measure, Area Under the Curve Receiver Operating Characteristic (AUC-ROC), robustness in terms of time taken for training and testing, to mention a few. Confusion matrix The confusion matrix is the essential means of evaluating the performance of prediction models. It is composed of four fundamental metrics that can be used as criteria when selecting the best prediction model. Commonly, machine learning algorithms target two of the metrics, TP and TN where the goal is to achieve the highest score, while the other two metrics, FN and FP the goal is to achieve the lowest score [46]. TP: The desirable element of the confusion matrix that presents the number of correct predictions for the positive/target class TN: The desirable element of the confusion matrix that presents the number of correct prediction for the negative/no-target class. FP: The undesirable element of the confusion matrix that presents the number of incorrect predictions for the positive/target class FN: The undesirable element of the confusion matrix that presents the number of the incorrect predictions for the negative/non-target class The confusion matrix can be used as a quick way of analyzing the performance of the models. However, it cannot be used to make a comparison of the models because it does not have a single metric which incorporates all metrics. If the confusion matrix alone is to be used, 12

25 3.8. Evaluation of the prediction models the analyst will have to consider the values of TP, TN, FP, and FN individually, otherwise, the results may not reflect the real performance. Table 3.1: Confusion matrix Predicted Positive Predicted Negative Actual positive TP FN Actual Negative FP TN Prediction accuracy The second metric that can be used to measure the performance of the prediction models is the accuracy which is given as the percentage of the sum of the correct predictions (TP + TN) over the sum of all predictions (TP+TN+FP+FN). Accuracy can only suffice the measurement of the correctness but cannot tell the extent of FP and FN. When it comes to sensitive domains such as health, aviation, and security, high values of FP and FN are intolerable. Therefore, accuracy needs to work along other metrics such as F1 and AUC to make accurate evaluations[47] [45]. Below is the formula that deduced the values of accuracy. Model accuracy = (TP + TN) (TP + TN + FP + FN) (3.4) F1 measure The third metric for measuring the accuracy of the classifiers is the F1 Score. This metric presents the combinatory effect of the other two measurements of FP and FN that are reflected in precision and recall values [48]. Precision is used to indicate the extent of the false positives while Recall indicates the extent of the false negatives. The range of F1 values is 0 to 1 where 1 is the perfect prediction. The higher value of F1 indicates that the prediction model has few FP and FN. Precision: Given by the ratio of TP and (TP + FP) Recall: Given by the ratio of TP and (TP + FN) F1 measure: Given by the double ratio of the product and sum of precision and recall [49] F1 = 2 x (Precision x Recall) (Precision + Recall) (3.5) Area Under the Curve (AUC) The AUC is a plot representation of TP rate or Recall versus FP rate or precision [50]. It is an easy way to depict the relationship between the values of Recall and Precision or Recall and specificity applied at different thresholds [45]. The former is generally known as AUC - Receiver Operating Characteristic and the latter is known as AUC-PRC. They both correspond to each other, but, in this thesis, AUC-ROC was used over AUC-PRC as they form a bit smoother curve. Specificity: measure the degree at which negative predictions are correct. It is given by the ratio of TN and (TN + FP) AUC-ROC plot: Recall against (1-specificity), i.e., TP rate vs. FP rate 13

26 3. THEORY AUC-PRC plot: Precision vs. Recall The range of AUC-ROC/PRC is between 0 to 1 where 1 is a perfect prediction, and 0 is an absolute bad prediction. 14

27 4 Research methods, techniques, and methodology This chapter presents the research methods, techniques, and the methodology outlining step by step the procedures that led to the accomplishment of the thesis project. There are many research methods that are commonly used in computer science including simulation, experimentation, observation, pre-study, literature, and comparative study. But, this thesis used three common approaches: pre-study, literature, and experimental study that are relevant to machine learning research. In respective order the Section 4.1, 4.2, and 4.3 describe the prestudy activities undertaken to start the project, gives an overview on how the experimental study was conducted, and present the methodology used for this thesis. 4.1 Pre-study As the starting point of the thesis project a pre-study was conducted as a means of getting familiar with the undergoing research of soccer analytics. The aim was to build a foundation of the thesis by analyzing different methodologies, tools, algorithms, and areas explored in the past. This was carried out as literature review process where several sports analytics and machine learning publications were collected. Furthermore, literature study was used as means of acquisition of necessary knowledge for some of the concepts needed for the thesis work. The sports analytics and machine learning journals and conferences used for this thesis include: MIT SLOAN Sports analytics conference MLSA Machine Learning and Data Mining for Sports Analytics MLSA workshop IEEE Transactions on Pattern Analysis and Machine Intelligence Machine Learning: ECML-95 conference Journal of Machine Learning Research Journal of Artificial Intelligence Research Furthermore, search engines were used to collect important information about the study. Some example of search terms includes: 15

28 4. RESEARCH METHODS, TECHNIQUES, AND METHODOLOGY team player injuries young talent && prediction forecasting && models feature attribute && selection reduction && techniques methods soccer football && machine learning data mining artificial intelligence ai basketball baseball health && machine learning data mining artificial intelligence ai training && machine learning data mining artificial intelligence ai 4.2 Experimental study Experimental research is the quantitative study aiming to find, validate, and analyze the studied cases in relation to different parameters applied [51] [52]. It can be incorporated into observations and comparative studies between several conditions. The experiments in this thesis were undertaken in the form of supervised machine learning process where on each phase as described in section 4.3 analysis of the performance of players in five European leagues was conducted. Among comparisons made in the study were between: wrapper and filter feature selection techniques; Top10, Top25, and Top50 ranking of players; Bundesliga, EPL, La Liga, league1, and SerieA leagues; Bayes network, Naïve Bayes, Logistic Regression, IBK, PART, Decision Tables, J48, Random Forest, and ZeroR classifiers. The experiments were aided by the machine learning tool Weka that enabled the processes of data pre-processing, training, and evaluation of the algorithms. Along with the tool, Java and Python scripts were implemented to facilitate the limited functions in Weka. The purpose of the scripts was to extend data discretization to any desired ranking split and to implement pruning of the raw-datasets following data cleaning and feature selection procedures. 4.3 Methodology The purpose of this section is to present the processes undertaken during forming prediction models for soccer players. The section commences by first elaborating the methods regarding data collection and preparation. It describes the process involving reduction, resampling, and rescaling of the data. The second subsection is about the feature-selection procedure where a list of most important features is selected for each dataset. The third subsection explains the procedure used for generation of training and testing datasets by showing its validity to the context of this study. The fourth and fifth section elaborates development, execution, and testing of prediction models. Furthermore, for each classifier and filter used, the choices of parameter values used for the experiments are explained. The final section is about the analysis of the performance of machine learning models created. Different performance metrics including accuracy, precision, recall, AUC (ROC) are presented with another constituent parameters regarding robustness. The Figure 4.1 summarizes the overall methodology for this thesis. Data collection In this thesis data collection was conducted to five European leagues comprising of English Premier League (EPL), Spanish league (La Liga), Germany league (Bundesliga), Italian league (Serie A), and French League (Ligue 1) where players were arranged according to the different roles. WhoScored was used as the main data provider considering that most of its data were acquired from Opta. Many secondary data providers like Squawka also use Opta as a primary data source, but the differences regarding the rating of players performances between data providers are small therefore they are equally reliable. WhoScored and other secondary data-providers use internal schemes developed by a group of soccer experts to rate player 16

29 4.3. Methodology Figure 4.1: A procedure for analyzing soccer sport historical data and team performances. Because of the small differences, it gave confidence that the selection of data-provider does not matter as long as other requirements of acquiring the data is not expensive regarding time and resources. Previous researchers have also been using WhoScored as the main source of data which indicates that the information provided is accessible and it can also be used for this thesis. An example of the study that used WhoScored includes Decroos et al. [11] in his research about spatial-temporal action rating system for soccer. The data collection process led to two groups of datasets which in total contained 5212 instances. Group one separated players according to their corresponding leagues and further divided them to the four raw-datasets following the position of the player on the pitch. Group two combined players from all leagues into four datasets goalkeepers, defenders, midfielders, and forwards. Both group one and two had a total of 2606 instances. Players whose primary role and position on the pitch were defending and back, respectively were categorized as defenders. Players whose primary role and position were playmaking and central, respectively were considered as midfielders. Players whose primary responsibility and position were attacking and front, respectively were categorized as attackers. The remaining players were categorized as goalkeepers. The smallest dataset was for goalkeepers it contained 222 instances. The largest dataset was midfielders; it had 1109 instances. The remaining datasets which belong to defenders and forwards had 970 and 305 instances, respectively. Data preparation Data preparation is an essential procedure for machine learning problems as in real life raw datasets usually contain errors regarding incorrect formats, inaccurate values, typographical mistakes, duplicates, and incomplete information. Such errors can negatively affect the performance of machine learning algorithms as they can increase the amount of time needed for learning and in worst cases give a misleading outcome [53] [54]. In this thesis, data preparation is divided into four phases named as data formatting, relevance analysis, resampling, and feature selection. 17

30 4. RESEARCH METHODS, TECHNIQUES, AND METHODOLOGY Figure 4.2: Data preparation model Phase one Data formatting Phase one includes all procedures for formatting the raw-datasets as to make them compatible with the Weka software and enhancing naming and ordering of attributes. Particularly the activities involved in this phase include: Conversion of unrecognizable character-types to equivalent recognizable characters. For example, the name of the player Dembélé was replaced by the equivalent characters Dembele. All datasets were converted to standard file formats CSV and ARFF which are supported by Weka tool and commonly used in machine learning. Then, discretization of the datasets was done to group data into Top 10, 25, and 50 percent. In addition, rating attribute was selected as class attribute. A class label with values above the selected threshold was marked as a target class on each dataset. Furthermore, the removal of incorrect data, replacement of missing values, and merging of duplicate information followed. Several players were identified as duplicates due to transferring from one team to another across Europe. For this case, the record of the previous team was merged with the record of the new team. Missing values for some attributes were replaced by zero and other replaced by the appropriate information collected from the Internet. For example, missing nationality information of some players was obtained from search engines. Before normalization, a numeric attribute player_id was transformed to nominal so as it can be excluded automatically as by default Weka tend to ignore nominal attributes and class attribute when normalization filter is applied. The normalization process was done by a min-max method where a scale of 0 to 1 was used. Phase two Relevance analysis Relevance analysis mainly involves data reduction and feature selection. Feature selection is all about finding subsets of attributes of the dataset that can give the best results at lesser expense, more details about this will be given in the following section. On the other hand, Data reduction is an exploratory activity that involves removal of data with no or unwanted information. Normally in this stage, duplicate attributes are merged, attributes with single value are eliminated, decimal places are reduced by rounding formulas, partitioning of data according to a certain threshold is applied. Many of the previous studies used number of matches or minutes played for the season to be a constraint of removing irrelevant instances [11] [55] [56] [57]. In any sport high-rated players turned out to be relatively more important and relied more upon than other players on ensuring the success of a team [55]. On average top-rated players have more game time than regular players because of their consistent 18

31 4.3. Methodology Figure 4.3: Knowledge flow activities for data formatting process fitness and excellent performances. For an individual player to be considered a good player he must at least have a decent game time per season. However, in this thesis a different approach is used, no game time or other constraints are used to filter players because the information that is given by the instances may be beneficial while training. If the relationship between game time and rating attributes is insignificant, the information will be captured and removed during feature selection process. Since the aim is to create binary prediction models for top-rated players, for this study the term top-rated is referred to top 10%, top 25%, and top 50% players. Binary classification models need a binary class with values upper or below the percentage threshold. Weka can only discretize the attributes by using equal-frequency binning (i.e., 50%-frequency binning). However, the aim is to have 10%, 25%, and 50% - frequency binning therefore TygerJython script is used to perform custom discretization. After running the script, three datasets were created for defenders, goalkeepers, midfielders, and forwards. The first dataset had two labels: top ten percent, and non-top ten percent players. The second dataset had two labels: top twenty-five percent, and non-top twenty-five percent players. Lastly, the third dataset had two labels: top fifty percent, and non-top fifty percent players. Phase three Resampling (Dealing with class imbalance) Discretizing the datasets to smaller percentage frequency binning such as 10% produced datasets with an unbalanced number of instances among classes. The ratio between positive and negative class for top 10% and 25% datasets were 1:10 and 1:4 respectively. When a pilot analysis was undertaken on these datasets, most of the models overfitted. The classifiers returned 100% classification accuracy. Therefore, it was necessary to remove class imbalance to be able to generalize the binary classification models whose desired ratio between positive and negative class is 1:1. Therefore SMOTE percentage parameter of the algorithm is manipulated by using the formula below to archive the desired ratio. This formula is used to calculate the number of instances needed to reach a class ratio of 1:1. It takes the number of instances of the majority and minority class as input and returns the percentage of minority 19

PREDICTING the outcomes of sporting events

PREDICTING the outcomes of sporting events CS 229 FINAL PROJECT, AUTUMN 2014 1 Predicting National Basketball Association Winners Jasper Lin, Logan Short, and Vishnu Sundaresan Abstract We used National Basketball Associations box scores from 1991-1998

More information

Evaluation of the LHOVRA O-function using the microsimulation tool VISSIM

Evaluation of the LHOVRA O-function using the microsimulation tool VISSIM LiU-ITN-TEK-A--12/084--SE Evaluation of the LHOVRA O-function using the microsimulation tool VISSIM Homayoun Harirforoush 2012-12-19 Department of Science and Technology Linköping University SE-601 74

More information

Opleiding Informatica

Opleiding Informatica Opleiding Informatica Determining Good Tactics for a Football Game using Raw Positional Data Davey Verhoef Supervisors: Arno Knobbe Rens Meerhoff BACHELOR THESIS Leiden Institute of Advanced Computer Science

More information

Bayesian Optimized Random Forest for Movement Classification with Smartphones

Bayesian Optimized Random Forest for Movement Classification with Smartphones Bayesian Optimized Random Forest for Movement Classification with Smartphones 1 2 3 4 Anonymous Author(s) Affiliation Address email 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

More information

DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS. Sushant Murdeshwar

DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS. Sushant Murdeshwar DATA MINING ON CRICKET DATA SET FOR PREDICTING THE RESULTS by Sushant Murdeshwar A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science

More information

Building an NFL performance metric

Building an NFL performance metric Building an NFL performance metric Seonghyun Paik (spaik1@stanford.edu) December 16, 2016 I. Introduction In current pro sports, many statistical methods are applied to evaluate player s performance and

More information

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Wenbing Zhao. Department of Electrical and Computer Engineering EEC 686/785 Modeling & Performance Evaluation of Computer Systems Lecture 6 Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Outline 2 Review of lecture 5 The

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Jason Corso SUNY at Buffalo 12 January 2009 J. Corso (SUNY at Buffalo) Introduction to Pattern Recognition 12 January 2009 1 / 28 Pattern Recognition By Example Example:

More information

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management

Outline. Terminology. EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 6. Steps in Capacity Planning and Management EEC 686/785 Modeling & Performance Evaluation of Computer Systems Lecture 6 Department of Electrical and Computer Engineering Cleveland State University wenbing@ieee.org Outline Review of lecture 5 The

More information

A Novel Approach to Predicting the Results of NBA Matches

A Novel Approach to Predicting the Results of NBA Matches A Novel Approach to Predicting the Results of NBA Matches Omid Aryan Stanford University aryano@stanford.edu Ali Reza Sharafat Stanford University sharafat@stanford.edu Abstract The current paper presents

More information

Estimation of In-cylinder Trapped Gas Mass and Composition

Estimation of In-cylinder Trapped Gas Mass and Composition Master of Science Thesis in Mechanical Engineering Department of Electrical Engineering, Linköping University, 2017 Estimation of In-cylinder Trapped Gas Mass and Composition Sepideh Nikkar Master of Science

More information

Machine Learning an American Pastime

Machine Learning an American Pastime Nikhil Bhargava, Andy Fang, Peter Tseng CS 229 Paper Machine Learning an American Pastime I. Introduction Baseball has been a popular American sport that has steadily gained worldwide appreciation in the

More information

A Network-Assisted Approach to Predicting Passing Distributions

A Network-Assisted Approach to Predicting Passing Distributions A Network-Assisted Approach to Predicting Passing Distributions Angelica Perez Stanford University pereza77@stanford.edu Jade Huang Stanford University jayebird@stanford.edu Abstract We introduce an approach

More information

Safety Assessment of Installing Traffic Signals at High-Speed Expressway Intersections

Safety Assessment of Installing Traffic Signals at High-Speed Expressway Intersections Safety Assessment of Installing Traffic Signals at High-Speed Expressway Intersections Todd Knox Center for Transportation Research and Education Iowa State University 2901 South Loop Drive, Suite 3100

More information

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms *

Queue analysis for the toll station of the Öresund fixed link. Pontus Matstoms * Queue analysis for the toll station of the Öresund fixed link Pontus Matstoms * Abstract A new simulation model for queue and capacity analysis of a toll station is presented. The model and its software

More information

Traffic safety analysis for cyclists at roundabouts, a case study in Norrköping

Traffic safety analysis for cyclists at roundabouts, a case study in Norrköping LiU-ITN-TEK-A--18/011--SE Traffic safety analysis for cyclists at roundabouts, a case study in Norrköping Shengjie Tang 2018-05-25 Department of Science and Technology Linköping University SE-601 74 Norrköping,

More information

Evaluating and Classifying NBA Free Agents

Evaluating and Classifying NBA Free Agents Evaluating and Classifying NBA Free Agents Shanwei Yan In this project, I applied machine learning techniques to perform multiclass classification on free agents by using game statistics, which is useful

More information

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG

BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG BASKETBALL PREDICTION ANALYSIS OF MARCH MADNESS GAMES CHRIS TSENG YIBO WANG GOAL OF PROJECT The goal is to predict the winners between college men s basketball teams competing in the 2018 (NCAA) s March

More information

A Developmental Approach. To The Soccer Learning Process

A Developmental Approach. To The Soccer Learning Process A Developmental Approach To The Soccer Learning Process Soccer by definition Soccer is a game played between 2 teams and each team is trying to score more goals than the other team. Soccer games are decided

More information

Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College

Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College Predicting Tennis Match Outcomes Through Classification Shuyang Fang CS074 - Dartmouth College Introduction The governing body of men s professional tennis is the Association of Tennis Professionals or

More information

Citation for published version (APA): Canudas Romo, V. (2003). Decomposition Methods in Demography Groningen: s.n.

Citation for published version (APA): Canudas Romo, V. (2003). Decomposition Methods in Demography Groningen: s.n. University of Groningen Decomposition Methods in Demography Canudas Romo, Vladimir IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please

More information

Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania

Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania Evaluating the Influence of R3 Treatments on Fishing License Sales in Pennsylvania Prepared for the: Pennsylvania Fish and Boat Commission Produced by: PO Box 6435 Fernandina Beach, FL 32035 Tel (904)

More information

At each type of conflict location, the risk is affected by certain parameters:

At each type of conflict location, the risk is affected by certain parameters: TN001 April 2016 The separated cycleway options tool (SCOT) was developed to partially address some of the gaps identified in Stage 1 of the Cycling Network Guidance project relating to separated cycleways.

More information

Reliability. Introduction, 163 Quantifying Reliability, 163. Finding the Probability of Functioning When Activated, 163

Reliability. Introduction, 163 Quantifying Reliability, 163. Finding the Probability of Functioning When Activated, 163 ste41912_ch04_123-175 3:16:06 01.29pm Page 163 SUPPLEMENT TO CHAPTER 4 Reliability LEARNING OBJECTIVES SUPPLEMENT OUTLINE After completing this supplement, you should be able to: 1 Define reliability.

More information

Player valuation in European football

Player valuation in European football Player valuation in European football Edward Nsolo Patrick Lambrix Niklas Carlsson Linköping University, Sweden Abstract. As the success of a team depends on the performance of individual players, the

More information

Predicting the Total Number of Points Scored in NFL Games

Predicting the Total Number of Points Scored in NFL Games Predicting the Total Number of Points Scored in NFL Games Max Flores (mflores7@stanford.edu), Ajay Sohmshetty (ajay14@stanford.edu) CS 229 Fall 2014 1 Introduction Predicting the outcome of National Football

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Jason Corso SUNY at Buffalo 19 January 2011 J. Corso (SUNY at Buffalo) Introduction to Pattern Recognition 19 January 2011 1 / 32 Examples of Pattern Recognition in

More information

Estimating the Probability of Winning an NFL Game Using Random Forests

Estimating the Probability of Winning an NFL Game Using Random Forests Estimating the Probability of Winning an NFL Game Using Random Forests Dale Zimmerman February 17, 2017 2 Brian Burke s NFL win probability metric May be found at www.advancednflstats.com, but the site

More information

Two Machine Learning Approaches to Understand the NBA Data

Two Machine Learning Approaches to Understand the NBA Data Two Machine Learning Approaches to Understand the NBA Data Panagiotis Lolas December 14, 2017 1 Introduction In this project, I consider applications of machine learning in the analysis of nba data. To

More information

RUGBY is a dynamic, evasive, and highly possessionoriented

RUGBY is a dynamic, evasive, and highly possessionoriented VISUALIZING RUGBY GAME STYLES USING SOMS 1 Visualizing Rugby Game Styles Using Self-Organizing Maps Peter Lamb, Hayden Croft Abstract Rugby coaches and analysts often use notational data describing match

More information

Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match

Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match Honest Mirror: Quantitative Assessment of Player Performances in an ODI Cricket Match Madan Gopal Jhawar 1 and Vikram Pudi 2 1 Microsoft, India majhawar@microsoft.com 2 IIIT Hyderabad, India vikram@iiit.ac.in

More information

Evaluation and Improvement of the Roundabouts

Evaluation and Improvement of the Roundabouts The 2nd Conference on Traffic and Transportation Engineering, 2016, *, ** Published Online **** 2016 in SciRes. http://www.scirp.org/journal/wjet http://dx.doi.org/10.4236/wjet.2014.***** Evaluation and

More information

Running head: DATA ANALYSIS AND INTERPRETATION 1

Running head: DATA ANALYSIS AND INTERPRETATION 1 Running head: DATA ANALYSIS AND INTERPRETATION 1 Data Analysis and Interpretation Final Project Vernon Tilly Jr. University of Central Oklahoma DATA ANALYSIS AND INTERPRETATION 2 Owners of the various

More information

Chapter 12 Practice Test

Chapter 12 Practice Test Chapter 12 Practice Test 1. Which of the following is not one of the conditions that must be satisfied in order to perform inference about the slope of a least-squares regression line? (a) For each value

More information

Using Poisson Distribution to predict a Soccer Betting Winner

Using Poisson Distribution to predict a Soccer Betting Winner Using Poisson Distribution to predict a Soccer Betting Winner By SYED AHMER RIZVI 1511060 Section A Quantitative Methods - I APPLICATION OF DESCRIPTIVE STATISTICS AND PROBABILITY IN SOCCER Concept This

More information

Anxiety and attentional control in football penalty kicks: A mechanistic account of performance failure under pressure

Anxiety and attentional control in football penalty kicks: A mechanistic account of performance failure under pressure Anxiety and attentional control in football penalty kicks: A mechanistic account of performance failure under pressure Submitted by Greg Wood to the University of Exeter as a thesis for the degree of Doctor

More information

Using Spatio-Temporal Data To Create A Shot Probability Model

Using Spatio-Temporal Data To Create A Shot Probability Model Using Spatio-Temporal Data To Create A Shot Probability Model Eli Shayer, Ankit Goyal, Younes Bensouda Mourri June 2, 2016 1 Introduction Basketball is an invasion sport, which means that players move

More information

Smart-Walk: An Intelligent Physiological Monitoring System for Smart Families

Smart-Walk: An Intelligent Physiological Monitoring System for Smart Families Smart-Walk: An Intelligent Physiological Monitoring System for Smart Families P. Sundaravadivel 1, S. P. Mohanty 2, E. Kougianos 3, V. P. Yanambaka 4, and M. K. Ganapathiraju 5 University of North Texas,

More information

The Effect of a Seven Week Exercise Program on Golf Swing Performance and Musculoskeletal Screening Scores

The Effect of a Seven Week Exercise Program on Golf Swing Performance and Musculoskeletal Screening Scores The Effect of a Seven Week Exercise Program on Golf Swing Performance and Musculoskeletal Screening Scores 2017 Mico Hannes Olivier Bachelor of Sport Science Faculty of Health Sciences and Medicine Bond

More information

Tokyo: Simulating Hyperpath-Based Vehicle Navigations and its Impact on Travel Time Reliability

Tokyo: Simulating Hyperpath-Based Vehicle Navigations and its Impact on Travel Time Reliability CHAPTER 92 Tokyo: Simulating Hyperpath-Based Vehicle Navigations and its Impact on Travel Time Reliability Daisuke Fukuda, Jiangshan Ma, Kaoru Yamada and Norihito Shinkai 92.1 Introduction Most standard

More information

knn & Naïve Bayes Hongning Wang

knn & Naïve Bayes Hongning Wang knn & Naïve Bayes Hongning Wang CS@UVa Today s lecture Instance-based classifiers k nearest neighbors Non-parametric learning algorithm Model-based classifiers Naïve Bayes classifier A generative model

More information

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held

Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held Title: 4-Way-Stop Wait-Time Prediction Group members (1): David Held As part of my research in Sebastian Thrun's autonomous driving team, my goal is to predict the wait-time for a car at a 4-way intersection.

More information

Transformer fault diagnosis using Dissolved Gas Analysis technology and Bayesian networks

Transformer fault diagnosis using Dissolved Gas Analysis technology and Bayesian networks Proceedings of the 4th International Conference on Systems and Control, Sousse, Tunisia, April 28-30, 2015 TuCA.2 Transformer fault diagnosis using Dissolved Gas Analysis technology and Bayesian networks

More information

This file is part of the following reference:

This file is part of the following reference: This file is part of the following reference: Hancock, Timothy Peter (2006) Multivariate consensus trees: tree-based clustering and profiling for mixed data types. PhD thesis, James Cook University. Access

More information

Grade: 8. Author(s): Hope Phillips

Grade: 8. Author(s): Hope Phillips Title: Tying Knots: An Introductory Activity for Writing Equations in Slope-Intercept Form Prior Knowledge Needed: Grade: 8 Author(s): Hope Phillips BIG Idea: Linear Equations how to analyze data from

More information

ROUNDABOUT CAPACITY: THE UK EMPIRICAL METHODOLOGY

ROUNDABOUT CAPACITY: THE UK EMPIRICAL METHODOLOGY ROUNDABOUT CAPACITY: THE UK EMPIRICAL METHODOLOGY 1 Introduction Roundabouts have been used as an effective means of traffic control for many years. This article is intended to outline the substantial

More information

B. AA228/CS238 Component

B. AA228/CS238 Component Abstract Two supervised learning methods, one employing logistic classification and another employing an artificial neural network, are used to predict the outcome of baseball postseason series, given

More information

Simulating Major League Baseball Games

Simulating Major League Baseball Games ABSTRACT Paper 2875-2018 Simulating Major League Baseball Games Justin Long, Slippery Rock University; Brad Schweitzer, Slippery Rock University; Christy Crute Ph.D, Slippery Rock University The game of

More information

Development of a photopletysmography based method for investigating changes in blood volume pulsations for the purpose of pressure ulcer prevention

Development of a photopletysmography based method for investigating changes in blood volume pulsations for the purpose of pressure ulcer prevention Linköping University Department of Biomedical Engineering Master thesis, 3 ECTS Biomedical engineering 17 LIU-IMT/LITH-EX-A--17/1--SE Development of a photopletysmography based method for investigating

More information

Basketball data science

Basketball data science Basketball data science University of Brescia, Italy Vienna, April 13, 2018 paola.zuccolotto@unibs.it marica.manisera@unibs.it BDSports, a network of people interested in Sports Analytics http://bodai.unibs.it/bdsports/

More information

CS 221 PROJECT FINAL

CS 221 PROJECT FINAL CS 221 PROJECT FINAL STUART SY AND YUSHI HOMMA 1. INTRODUCTION OF TASK ESPN fantasy baseball is a common pastime for many Americans, which, coincidentally, defines a problem whose solution could potentially

More information

CONTROL VALVE WHAT YOU NEED TO LEARN?

CONTROL VALVE WHAT YOU NEED TO LEARN? CONTROL VALVE WHAT YOU NEED TO LEARN? i) The control valve characteristics refers to the relationship between the volumetric flowrate F (Y-axis) through the valve AND the valve travel or opening position

More information

Chapter 5: Methods and Philosophy of Statistical Process Control

Chapter 5: Methods and Philosophy of Statistical Process Control Chapter 5: Methods and Philosophy of Statistical Process Control Learning Outcomes After careful study of this chapter You should be able to: Understand chance and assignable causes of variation, Explain

More information

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder

Game Theory (MBA 217) Final Paper. Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder Game Theory (MBA 217) Final Paper Chow Heavy Industries Ty Chow Kenny Miller Simiso Nzima Scott Winder Introduction The end of a basketball game is when legends are made or hearts are broken. It is what

More information

Open Research Online The Open University s repository of research publications and other research outputs

Open Research Online The Open University s repository of research publications and other research outputs Open Research Online The Open University s repository of research publications and other research outputs Developing an intelligent table tennis umpiring system Conference or Workshop Item How to cite:

More information

Tutorial for the. Total Vertical Uncertainty Analysis Tool in NaviModel3

Tutorial for the. Total Vertical Uncertainty Analysis Tool in NaviModel3 Tutorial for the Total Vertical Uncertainty Analysis Tool in NaviModel3 May, 2011 1. Introduction The Total Vertical Uncertainty Analysis Tool in NaviModel3 has been designed to facilitate a determination

More information

Predicting NBA Shots

Predicting NBA Shots Predicting NBA Shots Brett Meehan Stanford University https://github.com/brettmeehan/cs229 Final Project bmeehan2@stanford.edu Abstract This paper examines the application of various machine learning algorithms

More information

Application of Bayesian Networks to Shopping Assistance

Application of Bayesian Networks to Shopping Assistance Application of Bayesian Networks to Shopping Assistance Yang Xiang, Chenwen Ye, and Deborah Ann Stacey University of Guelph, CANADA Abstract. We develop an on-line shopping assistant that can help a e-shopper

More information

Calculation of Trail Usage from Counter Data

Calculation of Trail Usage from Counter Data 1. Introduction 1 Calculation of Trail Usage from Counter Data 1/17/17 Stephen Martin, Ph.D. Automatic counters are used on trails to measure how many people are using the trail. A fundamental question

More information

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games

Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games Pairwise Comparison Models: A Two-Tiered Approach to Predicting Wins and Losses for NBA Games Tony Liu Introduction The broad aim of this project is to use the Bradley Terry pairwise comparison model as

More information

2600T Series Pressure Transmitters Plugged Impulse Line Detection Diagnostic. Pressure Measurement Engineered solutions for all applications

2600T Series Pressure Transmitters Plugged Impulse Line Detection Diagnostic. Pressure Measurement Engineered solutions for all applications Application Description AG/266PILD-EN Rev. C 2600T Series Pressure Transmitters Plugged Impulse Line Detection Diagnostic Pressure Measurement Engineered solutions for all applications Increase plant productivity

More information

Football Player s Performance and Market Value

Football Player s Performance and Market Value Football Player s Performance and Market Value Miao He 1, Ricardo Cachucho 1, and Arno Knobbe 1,2 1 LIACS, Leiden University, the Netherlands, r.cachucho@liacs.leidenuniv.nl 2 Amsterdam University of Applied

More information

Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008

Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008 Clutch Hitters Revisited Pete Palmer and Dick Cramer National SABR Convention June 30, 2008 Do clutch hitters exist? More precisely, are there any batters whose performance in critical game situations

More information

#1 Accurately Rate and Rank each FBS team, and

#1 Accurately Rate and Rank each FBS team, and The goal of the playoffpredictor website is to use statistical analysis in the absolute simplest terms to: #1 Accurately Rate and Rank each FBS team, and #2 Predict what the playoff committee will list

More information

A computer program that improves its performance at some task through experience.

A computer program that improves its performance at some task through experience. 1 A computer program that improves its performance at some task through experience. 2 Example: Learn to Diagnose Patients T: Diagnose tumors from images P: Percent of patients correctly diagnosed E: Pre

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Announcements Course TA: Hao Xiong Office hours: Friday 2pm-4pm in ECSS2.104A1 First homework

More information

A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors

A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors Purdue University Purdue e-pubs International Compressor Engineering Conference School of Mechanical Engineering 2014 A Chiller Control Algorithm for Multiple Variablespeed Centrifugal Compressors Piero

More information

CS472 Foundations of Artificial Intelligence. Final Exam December 19, :30pm

CS472 Foundations of Artificial Intelligence. Final Exam December 19, :30pm CS472 Foundations of Artificial Intelligence Final Exam December 19, 2003 12-2:30pm Name: (Q exam takers should write their Number instead!!!) Instructions: You have 2.5 hours to complete this exam. The

More information

A SEMI-PRESSURE-DRIVEN APPROACH TO RELIABILITY ASSESSMENT OF WATER DISTRIBUTION NETWORKS

A SEMI-PRESSURE-DRIVEN APPROACH TO RELIABILITY ASSESSMENT OF WATER DISTRIBUTION NETWORKS A SEMI-PRESSURE-DRIVEN APPROACH TO RELIABILITY ASSESSMENT OF WATER DISTRIBUTION NETWORKS S. S. OZGER PhD Student, Dept. of Civil and Envir. Engrg., Arizona State Univ., 85287, Tempe, AZ, US Phone: +1-480-965-3589

More information

Projecting Three-Point Percentages for the NBA Draft

Projecting Three-Point Percentages for the NBA Draft Projecting Three-Point Percentages for the NBA Draft Hilary Sun hsun3@stanford.edu Jerold Yu jeroldyu@stanford.edu December 16, 2017 Roland Centeno rcenteno@stanford.edu 1 Introduction As NBA teams have

More information

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf

Online Companion to Using Simulation to Help Manage the Pace of Play in Golf Online Companion to Using Simulation to Help Manage the Pace of Play in Golf MoonSoo Choi Industrial Engineering and Operations Research, Columbia University, New York, NY, USA {moonsoo.choi@columbia.edu}

More information

Player valuation in European football (Extended version)

Player valuation in European football (Extended version) Player valuation in European football (Extended version) Edward Nsolo Patrick Lambrix Niklas Carlsson Linköping University, Sweden Abstract. As the success of a team depends on the performance of individual

More information

AGA Swiss McMahon Pairing Protocol Standards

AGA Swiss McMahon Pairing Protocol Standards AGA Swiss McMahon Pairing Protocol Standards Final Version 1: 2009-04-30 This document describes the Swiss McMahon pairing system used by the American Go Association (AGA). For questions related to user

More information

POWER Quantifying Correction Curve Uncertainty Through Empirical Methods

POWER Quantifying Correction Curve Uncertainty Through Empirical Methods Proceedings of the ASME 2014 Power Conference POWER2014 July 28-31, 2014, Baltimore, Maryland, USA POWER2014-32187 Quantifying Correction Curve Uncertainty Through Empirical Methods ABSTRACT Christopher

More information

An Analysis of Factors Contributing to Wins in the National Hockey League

An Analysis of Factors Contributing to Wins in the National Hockey League International Journal of Sports Science 2014, 4(3): 84-90 DOI: 10.5923/j.sports.20140403.02 An Analysis of Factors Contributing to Wins in the National Hockey League Joe Roith, Rhonda Magel * Department

More information

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007

Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007 Predicting the use of the sacrifice bunt in Major League Baseball BUDT 714 May 10, 2007 Group 6 Charles Gallagher Brian Gilbert Neelay Mehta Chao Rao Executive Summary Background When a runner is on-base

More information

Lesson 14: Games of Chance and Expected Value

Lesson 14: Games of Chance and Expected Value Student Outcomes Students use expected payoff to compare strategies for a simple game of chance. Lesson Notes This lesson uses examples from the previous lesson as well as some new examples that expand

More information

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils

Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils 86 Pet.Sci.(29)6:86-9 DOI 1.17/s12182-9-16-x Novel empirical correlations for estimation of bubble point pressure, saturated viscosity and gas solubility of crude oils Ehsan Khamehchi 1, Fariborz Rashidi

More information

Section I: Multiple Choice Select the best answer for each problem.

Section I: Multiple Choice Select the best answer for each problem. Inference for Linear Regression Review Section I: Multiple Choice Select the best answer for each problem. 1. Which of the following is NOT one of the conditions that must be satisfied in order to perform

More information

Determining bicycle infrastructure preferences A case study of Dublin

Determining bicycle infrastructure preferences A case study of Dublin *Manuscript Click here to view linked References 1 Determining bicycle infrastructure preferences A case study of Dublin Brian Caulfield 1, Elaine Brick 2, Orla Thérèse McCarthy 1 1 Department of Civil,

More information

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player?

Which On-Base Percentage Shows. the Highest True Ability of a. Baseball Player? Which On-Base Percentage Shows the Highest True Ability of a Baseball Player? January 31, 2018 Abstract This paper looks at the true on-base ability of a baseball player given their on-base percentage.

More information

Decision Trees. an Introduction

Decision Trees. an Introduction Decision Trees an Introduction Outline Top-Down Decision Tree Construction Choosing the Splitting Attribute Information Gain and Gain Ratio Decision Tree An internal node is a test on an attribute A branch

More information

Evaluating The Best. Exploring the Relationship between Tom Brady s True and Observed Talent

Evaluating The Best. Exploring the Relationship between Tom Brady s True and Observed Talent Evaluating The Best Exploring the Relationship between Tom Brady s True and Observed Talent Heather Glenny, Emily Clancy, and Alex Monahan MCS 100: Mathematics of Sports Spring 2016 Tom Brady s recently

More information

Journal of Chemical and Pharmaceutical Research, 2014, 6(3): Research Article

Journal of Chemical and Pharmaceutical Research, 2014, 6(3): Research Article Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research 2014 6(3):304-309 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 World men sprint event development status research

More information

Predicting the NCAA Men s Basketball Tournament with Machine Learning

Predicting the NCAA Men s Basketball Tournament with Machine Learning Predicting the NCAA Men s Basketball Tournament with Machine Learning Andrew Levandoski and Jonathan Lobo CS 2750: Machine Learning Dr. Kovashka 25 April 2017 Abstract As the popularity of the NCAA Men

More information

Predicting Horse Racing Results with Machine Learning

Predicting Horse Racing Results with Machine Learning Predicting Horse Racing Results with Machine Learning LYU 1703 LIU YIDE 1155062194 Supervisor: Professor Michael R. Lyu Outline Recap of last semester Object of this semester Data Preparation Set to sequence

More information

TECHNICAL STUDY 2 with ProZone

TECHNICAL STUDY 2 with ProZone A comparative performance analysis of games played on artificial (Football Turf) and grass from the evaluation of UEFA Champions League and UEFA Cup. Introduction Following on from our initial technical

More information

CHAPTER 6 DISCUSSION ON WAVE PREDICTION METHODS

CHAPTER 6 DISCUSSION ON WAVE PREDICTION METHODS CHAPTER 6 DISCUSSION ON WAVE PREDICTION METHODS A critical evaluation of the three wave prediction methods examined in this thesis is presented in this Chapter. The significant wave parameters, Hand T,

More information

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan

CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan CS 7641 A (Machine Learning) Sethuraman K, Parameswaran Raman, Vijay Ramakrishnan Scenario 1: Team 1 scored 200 runs from their 50 overs, and then Team 2 reaches 146 for the loss of two wickets from their

More information

Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight

Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight Analysis of the Article Entitled: Improved Cube Handling in Races: Insights with Isight Michelin Chabot (michelinchabot@gmail.com) February 2015 Abstract The article entitled Improved Cube Handling in

More information

Determining Occurrence in FMEA Using Hazard Function

Determining Occurrence in FMEA Using Hazard Function Determining Occurrence in FMEA Using Hazard Function Hazem J. Smadi Abstract FMEA has been used for several years and proved its efficiency for system s risk analysis due to failures. Risk priority number

More information

The impact of human capital accounting on the efficiency of English professional football clubs

The impact of human capital accounting on the efficiency of English professional football clubs MPRA Munich Personal RePEc Archive The impact of human capital accounting on the efficiency of English professional football clubs Anna Goshunova Institute of Finance and Economics, KFU 17 February 2013

More information

From Bombe stops to Enigma keys

From Bombe stops to Enigma keys From Bombe stops to Enigma keys A remarkably succinct description of the Bombe written many years ago, reads as follows:- The apparatus for breaking Enigma keys, by testing a crib and its implications

More information

THE REFEREEING IN BASKETBALL- TRENDS AND OPTIMIZATION STRATEGIES OF THE TRAINING AND PERFORMANCE OF REFEREES IN A DIVISION

THE REFEREEING IN BASKETBALL- TRENDS AND OPTIMIZATION STRATEGIES OF THE TRAINING AND PERFORMANCE OF REFEREES IN A DIVISION THE MINISTRY OF NATIONAL EDUCATION THE NATIONAL UNIVERSITY OF PHYSICAL EDUCATION AND SPORTS THE REFEREEING IN BASKETBALL- TRENDS AND OPTIMIZATION STRATEGIES OF THE TRAINING AND PERFORMANCE OF REFEREES

More information

Environmental Science: An Indian Journal

Environmental Science: An Indian Journal Environmental Science: An Indian Journal Research Vol 14 Iss 1 Flow Pattern and Liquid Holdup Prediction in Multiphase Flow by Machine Learning Approach Chandrasekaran S *, Kumar S Petroleum Engineering

More information

DEVELOPMENT OF A SET OF TRIP GENERATION MODELS FOR TRAVEL DEMAND ESTIMATION IN THE COLOMBO METROPOLITAN REGION

DEVELOPMENT OF A SET OF TRIP GENERATION MODELS FOR TRAVEL DEMAND ESTIMATION IN THE COLOMBO METROPOLITAN REGION DEVELOPMENT OF A SET OF TRIP GENERATION MODELS FOR TRAVEL DEMAND ESTIMATION IN THE COLOMBO METROPOLITAN REGION Ravindra Wijesundera and Amal S. Kumarage Dept. of Civil Engineering, University of Moratuwa

More information

EXPLORING MOTIVATION AND TOURIST TYPOLOGY: THE CASE OF KOREAN GOLF TOURISTS TRAVELLING IN THE ASIA PACIFIC. Jae Hak Kim

EXPLORING MOTIVATION AND TOURIST TYPOLOGY: THE CASE OF KOREAN GOLF TOURISTS TRAVELLING IN THE ASIA PACIFIC. Jae Hak Kim EXPLORING MOTIVATION AND TOURIST TYPOLOGY: THE CASE OF KOREAN GOLF TOURISTS TRAVELLING IN THE ASIA PACIFIC Jae Hak Kim Thesis submitted for the degree of Doctor of Philosophy at the University of Canberra

More information

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems

A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems The First International Workshop on Cyber-Physical Networking Systems A Game Theoretic Study of Attack and Defense in Cyber-Physical Systems ChrisY.T.Ma Advanced Digital Sciences Center Illinois at Singapore

More information

A Cost Effective and Efficient Way to Assess Trail Conditions: A New Sampling Approach

A Cost Effective and Efficient Way to Assess Trail Conditions: A New Sampling Approach A Cost Effective and Efficient Way to Assess Trail Conditions: A New Sampling Approach Rachel A. Knapp, Graduate Assistant, University of New Hampshire Department of Natural Resources and the Environment,

More information

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS

#19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS #19 MONITORING AND PREDICTING PEDESTRIAN BEHAVIOR USING TRAFFIC CAMERAS Final Research Report Luis E. Navarro-Serment, Ph.D. The Robotics Institute Carnegie Mellon University November 25, 2018. Disclaimer

More information