Prior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling

Size: px

Start display at page:

Download "Prior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling"

Roland Jordan
5 years ago
Views:

1 RESEARCH ARTICLE Prior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling Melanie F. Pradier 1,2 *, Francisco J. R. Ruiz 1,2,3, Fernando Perez-Cruz 1,2,4 a Signal Theory and Communications Department, University Carlos III in Madrid, Madrid, Spain, 2 Gregorio Marañón Health Research Institute, Madrid, Spain, 3 Data Science Institute, Columbia University, New York, United States of America, 4 Member of the Technical Staff at Bell Labs, Alcatel-Lucent, Murray Hill, New Jersey, United States of America * melanie@tsc.uc3m.es Abstract OPEN ACCESS Citation: F. Pradier M, J. R. Ruiz F, Perez-Cruz F (2016) Prior Design for Dependent Dirichlet Processes: An Application to Marathon Modeling. PLoS ONE 11(1): e doi: /journal. pone Editor: Guy Brock, University of Louisville, UNITED STATES Received: May 18, 2015 Accepted: January 4, 2016 Published: January 28, 2016 Copyright: 2016 F. Pradier et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting Information files. Data will also be available at the author s personal webpage. Funding: The authors are grateful for financial support from the following institutions: Melanie F. Pradier is supported by the European Union 7th Framework Programme through the Marie Curie Initial Training Network Machine Learning for Personalized Medicine MLPM2012, Grant No ( Francisco J. R. Ruiz is supported by an FPU fellowship from the Spanish Ministry of Education (AP ). This work is This paper presents a novel application of Bayesian nonparametrics (BNP) for marathon data modeling. We make use of two well-known BNP priors, the single-p dependent Dirichlet process and the hierarchical Dirichlet process, in order to address two different problems. First, we study the impact of age, gender and environment on the runners performance. We derive a fair grading method that allows direct comparison of runners regardless of their age and gender. Unlike current grading systems, our approach is based not only on top world records, but on the performances of all runners. The presented methodology for comparison of densities can be adopted in many other applications straightforwardly, providing an interesting perspective to build dependent Dirichlet processes. Second, we analyze the running patterns of the marathoners in time, obtaining information that can be valuable for training purposes. We also show that these running patterns can be used to predict finishing time given intermediate interval measurements. We apply our models to New York City, Boston and London marathons. 1 Introduction Data in the real world typically involves some source of uncertainty. This uncertainty may come from noisy measurements, incomplete information, or from the fact that we only have access to a subset of the data from a larger population. Probabilistic models have proven to be an effective approach for understanding such data, by incorporating our assumptions and prior knowledge of the world. In fact, probabilistic models have become an important tool in all areas of science as a way to develop statistical algorithms that are able to learn hidden structures from the data and make predictions [1]. Within probabilistic models, Bayesian nonparametric (BNP) models present several desirable properties [2]. Their most recognizable benefit is to avoid the need of specifying a closedform model, i.e., a parametric model with a predefined fixed number of hidden variables. PLOS ONE DOI: /journal.pone January 28, /28

2 also partially supported by the Ministerio de Economia of Spain (projects COMONSENS, id. CSD , and ALCIT, id. TEC C03-01), by the Comunidad de Madrid (project CASI-CAM-CM, id. S2013/ICE-2845), and by the Office of Naval Research (ONR N ). Bell Labs Alcatel-Lucent provided support in the form of salaries for author FPC, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific role of these authors are articulated in the author contributions section. Competing Interests: The authors declare that the author Fernando Perez-Cruz is employed by Bell Labs Alcatel-Lucent. This does not alter the authors adherence to PLOS ONE policies on sharing data and materials. Instead, BNP priors place probability mass on an infinite range of models and let the inference procedure choose the one that best fits the data, in order to provide competitive predictions or density estimations [3]. Although this flexibility makes such models attractive for experts in other fields, obtaining interpretable results might be an even stronger requirement. Most BNP models are described as general priors [4 7] that might not give easy-to-interpret solutions if applied blindly, even if they provide accurate predictions. In order to additionally obtain interpretable results, we should tailor the design of the model and include all our prior knowledge in the assumptions. We need to take problem-specific design choices that will allow for an easy interpretation of the results afterwards. In this way, the first insights of the obtained results should not be foreign to us. This makes the model trustworthy for experts in other fields that do not know about machine learning or statistics, so other conclusions that were not common knowledge can be taken as plausible. At this stage we are able to formulate hypotheses that can be tested with future data and can provide previously unknown insights about the given problem. Such procedure avoids the frequent black-box flavor found in other methods, facilitating collaboration across fields. Examples of such multi-disciplinar efforts can be found in psychiatry [8], genetics [9], biostatistics [10], computer vision [11], econometry [12] or musicology [13]. This paper presents a novel application of BNPs to model marathon runners. We opt for BNP models to leave room for the unexpected. In particular, we make use of the dependent Dirichlet process (DDP) [14], which is a powerful tool that encompasses the Dirichlet process (DP) and the hierarchical Dirichlet process (HDP). However, the DDP is very general and it cannot be directly applied to data without additional constraints. Here, we specify a way to tie the parameters across groups using a Gaussian process [15], thus making the DDP a practical prior for our problem at hand. This paper is an application of the single-p DDP and the HDP to marathon modeling. The novelty of the paper relies not only on the application, but also on the necessary steps to transform a prior that provides accurate estimates into a prior that also gives interpretable results. Non-trivial structural assumptions and design solutions are made to find hidden properties of the athletes while providing accurate predictions. We believe that BNP models will be more useful in the future for experts without machine learning expertise if we can tailor the priors to provide accurate and understandable solutions. The rest of the paper is organized as follows. Section 2 describes the two problems addressed in this paper. In Section 3, we review the BNP priors behind our models. We then introduce the statistical models used in this paper. In particular, Section 4 presents the atom-dependent Dirichlet process (ADDP), a collection of BNP mixture models based on the single-p DDP prior coupled with a Gaussian process, while Section 5 describes an HDP over probability vectors to detect latent running profiles and perform predictions. Section 6 is devoted to the experiments and results. Our conclusions and further discussion can be found in Section 7. 2 Problem Overview 2.1 Age-grading problem We first consider the age-grading problem, in which we want to compare fairly the finishing times of runners having different age and gender. Currently, most popular marathons award entry to participants by their best marathon in the previous 12 months. For example, in Boston it is the only way a participant can gain entry to the race, while other paths are available in New York, Chicago or London. The proposed methodology can be used to equalize entry requirements for different marathons, which vary considerably for one event to the next, as there is no PLOS ONE DOI: /journal.pone January 28, /28

3 widely accepted standard method to specify them. Furthermore, the World Master Athletics (WMA) has an age-grading system [16] for equalizing the finishing time according to age and sex. They lobby for this measure to be taken into consideration for selecting the winners of each race, even though that procedure is based on world records, i.e., outliers that may not be very representative, or even realistic, for most races. Our method also provides an alternative way to reward runners of different ages and gender. Our approach consists in adapting a single-p DDP [5] to cluster the finishing time for each runner according to his/her age and sex. We propose a Gaussian process to control how the clusters (representing marathon finishing time) change from one group to the next (different ages or gender). We find that the means of these clusters are directly comparable to the marathon entry requirements and the age-graded tables from the WMA. Additionally, direct comparisons for any finishing time are straightforward, since we find a full distribution for all ages and both genders. Our single-p DDP can simultaneously deal with different races and/or the same race on different years, providing a unified ranking for all the races that may differ on elevation profile, temperature or humidity. 2.2 Running pattern problem The second problem considered herein is the temporal analysis of running patterns. Our objective here is twofold. First, we are interested in a tool that can capture latent running profiles that reflect the marathon difficulties along the 26.2 miles ( km). This can be useful for athletes training purposes. Second, we aim at predicting the arrival time of runners using intermediate records. This problem has already been addressed in [17], where the finishing times are imputed for the 2013 Boston marathon. One of the best approaches rely on the 100 nearest neighbors, which has the limitation of clustering runners that are doing the same absolute times. We propose to use an HDP [6] to model the fraction of time each runner has spent at each intermediate interval (typically, measures are taken every 5 km and at half-marathon). In this way, we cluster the time ratio instead of the absolute times. Runners that run at different but constant speed will be in the same cluster, no matter if they run each mile in five, eight or eleven minutes. Thus, this model allows estimating finishing times for slower runners that have the same time-ratio profile than fast ones. We use an HDP model in which the likelihood function is a Dirichlet distribution, and each DP clusters the runners by age group and sex. When modeling the full race, it helps to understand the different trade-offs and which parts of the race are harder. 3 Dependent Dirichlet processes In this Section, we briefly review the stochastic processes on which our models are based. In particular, we focus on the DDP and two of its variants. The DDP is a generalization of the DP for groups of data [5]. The DP is a stochastic process whose realizations are random infinite discrete probability distributions [18]. A DP is completely specified by a base distribution H, which is the expected value of the process, and a positive real number α (usually referred to as concentration parameter), which plays the role of an inverse variance. In general, a draw G 0 * DP(α, H) from a DP can be expressed as G 0 ¼ X1 k¼1 p k d k ; ð1þ where the vector π contains the atom weights, and ϕ k are the atom locations defined in the parameter space. In this form, it can be shown that the atom locations ϕ k are independent and PLOS ONE DOI: /journal.pone January 28, /28

4 identically distributed as ϕ k * H, whilst the weights (also known as stick proportions) are distributed as π * GEM(α), a distribution named after Griffiths, Engen and McCloskey [19, 20]. Such representation is called stick-breaking representation [21]. Note that Eq (1) defines an infinite mixture model, i.e., a mixture model with a countably infinite number of clusters. However, since the weights π k decrease exponentially fast, only a finite number of clusters are used to describe any finite dataset [3]. In fact, the expected number of components grows logarithmically with the number of observations [6]. In the DP mixture model, the actual number of clusters describing the data is not fixed, and can be automatically inferred from the data using the usual Bayesian posterior inference framework, e.g., Markov chain Monte Carlo [22] or variational [23] methods. The Chinese restaurant process (CRP) is typically found in the literature of DPs, a standard culinary metaphor that vividly illustrates how the DP operates [6]. In this metaphor, the atom locations ϕ k are referred to as dishes in a restaurant, and observations that are clustered together are viewed as customers sitting on the same table, therefore eating from the same dish. In this generative process, customers enter the restaurant one at a time, and they can either sit on an existing table, with probability proportional to the number of previous customers that are already sitting on that table, or open a new table, with probability proportional to α. In the latter case, they also sample a new dish from the prior, i.e., ϕ k new * H. In its more general form, the DDP is a stochastic process that generalizes the DP and that can be applied for clustering groups of data [5]. For each group j, we have an infinite mixture model of the form G j ¼ X1 k¼1 p jk d jk ; ð2þ where G l is a group-specific random measure, and the atom weights π jk and atom locations ϕ jk follow stochastic processes on the covariate space j. We describe below the two particularizations of the DDP that we use throughout the paper: the hierarchical DP (HDP) and the singlep DDP. Fig 1 shows a comparative sketch between both models. 3.1 Hierarchical Dirichlet process The HDP is a particular DDP to cluster groups of data sharing mixture components [6]. It uses a DP for each group of data, with the DPs for all groups sharing a base distribution which is Fig 1. Comparison of two Dependent Dirichlet processes. Hierarchical DP is at the left, single-p DDP is at the right. The first one shares atom locations, while the second one shares mixture weights. doi: /journal.pone g001 PLOS ONE DOI: /journal.pone January 28, /28

5 itself drawn from a DP. Mathematically, we first draw a base distribution from a DP as G 0 * DP(γ, H), where G 0 ¼ P 1 k¼1 v kd k, and for each group j we draw a distribution from a DP using G 0 as the base distribution, i.e., G j * DP(α, G 0 ). This construction ensures that all the distributions G j share the atom locations given by G 0, since G 0 is itself a discrete probability distribution. Since each atom corresponds to a cluster, cluster parameters ϕ k are shared across all groups. Each G j admits a representation as G j ¼ X1 k¼1 p jk d k ; ð3þ where the atom locations ϕ k do not depend on the group j. Furthermore, this method allows groups to share statistical strength via the atom weights v k of the base distribution G 0. Indeed, the vector of weights for each group j can be obtained as π j * DP(α, v). In the corresponding culinary metaphor, the HDP can be explained with a Chinese restaurant franchise (CRF), in which there is a collection of restaurants, and dishes are shared across restaurants. However, the popularity of each dish, i.e., the corresponding atom weight, is different in each of the restaurants [6]. 3.2 Single-p dependent Dirichlet process The single-p DDP is another DDP, which works in a complementary fashion to the HDP. In this case, atom weights are shared across groups while atom locations are allowed to vary across groups. In terms of the often used culinary metaphor for DPs [2], the HDP shares the dishes across restaurants but allows a different dish popularity in each of the restaurants, while the single-p DDP shares the dish popularity across restaurants but allows the dishes to vary slightly, in order to better fit each group of customers. The latter would be a peculiar CRF in which the popularity of tables is matched one-to-one across restaurants, with the served dish in each linked table slightly customized in each restaurant (e.g., different ingredients, cooking time or local taste). In the single-p DDP, the latent measure for each group j can be expressed as G j ¼ X1 k¼1 p k d jk ; ð4þ where the vector π * GEM(α) contains the mixture weights and ϕ jk are the atom locations. The single-p DDP does not specify how to tie the atom locations ϕ jk across groups for each k. This step is critical, as it conditions the performance of the model. We put forward in Section 4 another stochastic process for this purpose. 4 Modeling of the finishing time 4.1 Atom-dependent Dirichlet process mixture model This section describes a BNP model based on the single-p DDP prior that allows comparing the shape of different distributions while keeping the corresponding quantiles fixed. In the context of the marathon, we use it to obtain a fair comparison between runners regardless of their age or gender. Runners are grouped together according to their age and gender, yielding J different groups. In our infinite mixture model, we cluster the runners of all groups with a potentially unbounded number of clusters. Each cluster k presents a fixed percentage of runners given by π k, with a stochastic process linking the atom locations, i.e., the mean finishing time. This construction has the potential to provide a direct comparison for the finishing time in each group j. However, as the single-p DDP is a very general prior, we need to define the PLOS ONE DOI: /journal.pone January 28, /28

6 likelihood and the stochastic process in a way that is insightful about the marathoner s finishing time. We denote each marathoner finishing time as x ji, where j =1,..., J indexes the group and i runs over marathoners, and we assume a Gaussian likelihood for the finishing time x ji : x ji jc ji ¼ k; m k ; y j ; s 2 x N x ji jm k þ y j ; s 2 x : ð5þ Here, μ k denotes the global mean for cluster k, θ j is the shift associated to group j, c ji represents the cluster assignment associated to observation x ji, and s 2 x is the variance of the Gaussian distributions. Hence, we use a cluster-specific parameter μ k to describe cluster k, but we allow deviations from this value due to age or gender (this effect is modeled by θ j ). The key aspect that makes the single-p DDP useful for comparing different age-gender groups is the stochastic process that governs θ j. We would like this value to vary smoothly across ages, and therefore we choose a zero-mean Gaussian process prior for it: θ N ð0; Σ y Þ; ð6þ where θ =[θ 1,..., θ J ] >, and for the covariance function we use the standard choice, i.e., the squared exponential kernel given by 2 ðσ y Þ j ¼ s 2 exp ð jþ y þ kdð jþ; ð7þ 2n 2 where l and j represent two different age groups, s 2 y accounts for the variance, ν controls the degree of correlation between age groups l and j, and κ is a jitter factor to avoid numerical instabilities. We use an independent Gaussian process for each gender. There are other alternatives, see [15] for a comprehensive introduction for valid covariance functions. In our case, the squared exponential kernel is a smooth kernel (infinitely differentiable) that captures the correlation between the different age groups. Finally, we place the following priors over the assignment variables c ji and the cluster weights π: and c ji jπ π; ð8þ πja GEMðaÞ; ð9þ We refer to this single-p DDP prior, together with the likelihood model in Eq (5) and the Gaussian process for θ j, as the atom-dependent Dirichlet process (ADDP) mixture model. This model is similar to the ANOVA-DDP prior in [24]. We rely on a Gaussian process to control the dependency between the shift variables θ j, while the ANOVA-DDP prior relies on independent priors. This allows for smooth variations of the cluster means with the age. A graphical representation for the ADDP is depicted in Fig 2. We place a Gaussian prior over the cluster means μ k and an inverse gamma prior over the variance s 2, i.e., x mk N m0 ; s2 0 ; ð10þ s 2 x IGða; bþ; ð11þ where μ 0, s 2, a and b are hyperparameters of the model. The value of 0 s2 0 is assumed to be much PLOS ONE DOI: /journal.pone January 28, /28

7 Fig 2. Graphical representation of the basic ADDP mixture model. Grey circles represent observed variables, white circles are hidden random variables. Plates refer to duplicated random variables. doi: /journal.pone g002 larger than s 2 y, so that the first one controls the overall finishing time for the clusters (hours), whilst s 2 y controls the differences between groups due to different ages (minutes). We assume a common variance s 2 x for all the clusters, because it provides an ordering of the clusters, which is necessary for comparing the finishing times. Allowing for different variances for each component should provide a more accurate (in the sense of better density estimation) description of the finishing times, but a less interpretable and less actionable representation, as runners assigned to Gaussian components with different variances are not directly comparable. We have not placed a joint prior for the cluster means and variances through the normal-inverse gamma distribution, since separate priors might have better properties for density estimation [25] and allow for faster Gibbs sampling inference. 4.2 Model extensions Multiple races. We could apply the previous models to finishing times from different races or years, but we might obtain unexpected results since different races can present unalike conditions due to temperature, elevation profile, humidity, pull of runners, etc. In this section, we present an useful extension that deals with different races all together and allows drawing comparisons between these races. In order to deal with this, we extend the basic ADDP model using varying weights across races. This leads to a hierarchical ADDP model, in which cluster weights are allowed to change across races, and cluster parameters are allowed to change across age-gender groups. We refer to this model as the hierarchical ADDP (H-ADDP) model. Fig 3 shows a graphical representation of this extended model, with the following likelihood and PLOS ONE DOI: /journal.pone January 28, /28

8 Fig 3. Graphical representation of the H-ADDP mixture model. Cluster weights change across races, whereas cluster means change across age-gender groups. doi: /journal.pone g003 priors: x rji jc rji ¼ k; m k ; y j ; s 2 x N x rji jm k þ y j ; s 2 x ; m k N m 0 ; s 2 0 ; θ N ð0; Σ y Þ; s 2 x IGða; bþ; c rji jπ r π r ; π r jv; a DPða; vþ; vjg GEMðgÞ; ð12þ where r indexes the different races, γ is the upper level concentration parameter, π r ¼ðp rk Þ 1 k¼1 are the mixture weights for race r, v ¼ðv k Þ 1 k¼1 are the global weights for all races. One simple way to interpret this model is by conditioning on a particular race or an agegender group. If we only have data from a single race, we recover our original ADDP model. If we only have data from a single age-gender group, we recover an HDP, i.e., the cluster components are shared, but the mixing proportions are different. Age-gender interaction. The basic ADDP mixture model considers male and female runners independently, assuming independent shift delays θ j between both genders j (i.e., we use a PLOS ONE DOI: /journal.pone January 28, /28

9 block-diagonal covariance matrix). A natural extension of the model consists in introducing an additional gender factor δ and some age-gender interaction factors ω j in order to capture the correlation between male and female athletes. In such model, the shift delays θ j are shared for both men and women, and j indexes different age groups instead of age-gender groups. We refer to this model as the age-gender interaction ADDP model. The generative model can be written as follows: x ji jc ji ¼ k; g ji ; m k ; y j ; s 2 x ; d; o j N x ji jm k þ y j þ g ji ðd þ o j Þ; s 2 x ; m k N m 0 ; s 2 0 ; θ N ð0; Σ y Þ; s 2 x IGða; bþ; c ji jπ π; πja GEMðaÞ; d N 0; s 2 g ; ð13þ ω N ð0; Σ o Þ; where the indicator variables g ji differentiate male (g ji = 0) and female (g ji = 1) runners, δ is the gender effect, and ω =[ω 1,..., ω J ] > contains the age-gender interaction factors influencing the likelihood of female runners. Additionally, s 2 and S g ω are hyperparameters of the model. Cluster-dependent shifts. We now present another extension of the model concerning the shifts θ. In the model described above, the delay θ j only depends on the age and gender, which implies that the shift is the same for all clusters k, no matter whether they are fast or slow runners. However, we can also consider cluster-dependent shifts. This allows us to capture different shift evolutions across age/gender depending on the speed of runners. That is, instead of having a single delay θ j for each group, we consider a different delay θ jk for each cluster and group. Each vector θ k =[θ 1k,..., θ Jk ] > follows its own Gaussian process with mean μ θ and covariance matrix S θ : θ k N ðμ y ; Σ y Þ: ð14þ Table 1 shows a summary of all the ADDP model proposed in this paper. The main difference between them is the way in which the atom locations vary across the covariate space, i.e., the age-gender space. In the case of the H-ADDP model, atom positions depend on the age and gender of runners, and weights vary as a function of the race. Table 1. Summary of the ADDP mixture models developed in this paper. Model Atom weights Atom locations Basic ADDP π k μ k + θ j H-ADDP π rk μ k + θ j Age-gender interaction ADDP π k μ k + θ j + g ji (δ + ω j ) Cluster-dependent ADDP π k μ k + θ jk doi: /journal.pone t001 PLOS ONE DOI: /journal.pone January 28, /28

10 5 Modeling of running patterns In this section, we propose an HDP model as a tool that complements the analysis of the previously introduced ADDP. In particular, we address the analysis of temporal evolution of runners during the race, in order to understand how the marathoners pace themselves to complete the marathon. We aim at discovering running patterns, i.e., distinguishing those overly optimistic runners, with decreasing speed along the race, from well-trained runners who tend to keep a constant speed. It also helps to understand where the marathons are harder, so that runners can know beforehand. The hidden running patterns can be used for training purposes, as they can find out the typical shortcomings of athletes with respect to their age, which may help runners train and run more intelligently. In addition, we also show that discovering running patterns provides a new tool for prediction of finishing times, with results comparable to the best reported method in [17]. The idea is to cluster the data according to the relative time spent in each interval regardless of each runner s total time. In this sense, we are no longer interested in the absolute times, but in the time proportions invested for each interval. Marathons tend to record the elapsed time every 5 kilometers, in addition to half and full-marathon times. Our input data in this case consists of an N D dimensional matrix X with the time spent for each interval, together with the age and gender. Here, N is the number of runners and D denotes the number of available time records. We normalize our input data so that each runner is represented as a vector containing the fraction of time spent for each intermediate interval. We split the data X into J groups of runners having the same gender and belonging to the same age group. Here, we do not use the actual age of the runners, but split them according to age groups (i.e., age ranks) instead. We use the HDP [6] to cluster the running patterns for the different groups. In the HDP, clusters are allowed to show different probabilities for each group, but the per-cluster parameters are shared across groups. As explained in Section 3, we first draw a global base distribution from a DP as G 0 * DP(γ, H), where G 0 ¼ P 1 k¼1 v kd k, and for each group j we draw a distribution from a DP using G 0 as base distribution, i.e., G j * DP(α, G 0 ). In our model, the likelihood function is a Dirichlet distribution and can be written as x ji jc ji ¼ k; p k Dirichletðtp k1 ;...; tp kd Þ; ð15þ where x ji is the normalized D-dimensional vector for runner i in group j, c ji represents its cluster assignment, p k =[p k1,..., p kd ] is the vector of patterns representing cluster k, and τ is the concentration hyperparameter of the model. We place a Dirichlet prior over the per-cluster vectors p k, p k Dirichletð 1 ;...; D Þ; ð16þ where l d is the length of interval d, and is its concentration hyperparameter. 6 Experiments In the following experiments, we apply the described models to the New York City marathon [26] for 6 different years, between 2006 and This database consists of 249,899 runners in total. We additionally compare the NYC marathon data to the marathons of Boston [27] and London [28] for 2010 and 2011, including 117,255 additional runners. Data is included in S1 Marathon Database File. In order to test the resulting models, we set aside a test set with 20% of the participants for each race and age/gender group, ensuring that the age and gender proportions are the same in both train and test sets. PLOS ONE DOI: /journal.pone January 28, / 28

11 Inference and parameter estimation. Posterior inference for all simulations is based on Gibbs sampling. Following Algorithm 8 in [29], we do not integrate out the hidden variables, and we propose 10 new clusters at each iteration. Simulations are run in MatLab, with an approximated running time of 2.3 seconds per iteration for the plain ADDP model using an Intel Core i7-4700mq. In our results in Section 6.1, we report the values of the hidden variables (means and shift delays) averaged for the last 10,000 iterations after running the sampler for 50,000 iterations. In Section 6.2, we run 10,000 iterations of the sampler and average the results for the last 2,000 iterations. For the per-cluster variables, we carry out the averaging procedure to account for potential label switching. Label switching is common in Bayesian mixture models, and it occurs because the likelihood is invariant to permutation of the cluster labels [30, 31]. Such problem might occur in our case, but it is easy to detect by looking at the traceplot of the cluster-specific parameters, and hence parameters across MCMC samples can easily be matched and averaged. Moreover, in our plots we only report the value of the most populous clusters after filtering out the spurious ones with few runners. These spurious clusters appear as a consequence of the nonparametric nature of the model and the inference procedure, and can also be easily detected by looking at the traceplot of the cluster means. Without more data we cannot actually tell whether there is an actual phenomenon in them or they have locked in a noisy pattern. Choice of hyperparameters. In the case of the basic ADDP mixture model in Section 4, we use: s 2 ¼ 0:05, ν = 10, κ y =10 6, a =1,b =1,μ 0 = 5 and s 2 ¼ 1. The value of μ 0 0 and σ 0 are set so that 2-hour marathoners are within 3 standard deviations and 9-hour marathoners (typical cut-off time) are not unheard of. The ratio between σ θ and σ 0 is approximately 1 to 4, so the former is of the order of 15 minutes, while the latter is of the order of one hour. Therefore, σ 0 controls the overall finishing time, while σ θ controls the differences among age groups. The values of a and b control the variance of each Gaussian component and are set so that values of less than an hour, but not too small, are more likely. ν was defined in Eq (7) and controls the correlation between the mean finishing time of runners having same gender and different ages, so that they do not deviate significantly. The value of κ measures the error in the recorded time and in this case it prevents numerical instabilities. In the age-gender interaction ADDP model, we choose the same hyperparameters as in the basic ADDP model, and the additional hyperparameters are set as s 2 ¼ 0:05 (which corresponds to the same prior variance as for θ g j) and S o ¼ 1 S 2 y. For the HDP model in Section 5, we set = 0.2 km 1 and τ = 5000, because the differences between the relative time spent at each 5-km interval are typically small. Since τ plays the role of an inverse variance, the larger its value, the more clusters we should expect. We have found empirically that values of τ ranging in the thousands (between 2500 and 10000) provide results that admit similar interpretations, and they only differ in the granularity of the clusters. Finally, we place a Gamma prior with shape 1 and scale 10 over the concentration parameters α and γ, and sample their values following [32]. These hyperpriors are chosen in order to avoid the creation of too many spurious clusters. Due to the huge amount of data, results are not very sensitive to hyperparameter values, as long as they are not set to completely misleading and unrealistic values. In our application at hand, we can actually set (and be able to explain) the values of the hyperparameters using all our prior knowledge, as detailed above, which is important in order to incorporate the known information and allow for the expected variances. 6.1 Modeling of the finishing time Density estimation. We first show the performance of our basic ADDP mixture model in terms of density estimation, compared to a standard HDP with Gaussian likelihood, see [6] for PLOS ONE DOI: /journal.pone January 28, / 28

12 details. We depict in Fig 4 the empirical histogram for a particular group of runners (we choose all forty year-old male finishers for the plot), together with the overall density estimates, using data for the six considered NYC marathons. We obtain σ x = 5.8 minutes. Both the ADDP and HDP model perform similarly in terms of density estimation. Note that the empirical histogram in Fig 4 presents one narrow peak that is not fully captured by the ADDP nor HDP model. This peak, just under 4 hours, and the valley right afterwards are due to some runners trying to finish (and succeeding) a sub-4-hour marathon. This is a psychological effect that has limited interest for us, since it is not indicative of runners inherent performance. Using cluster-specific values for the variance s 2 x would yield better density estimation, even capturing this peak, but it would fail to provide any comparison between distributions. Here, we are interested in ordering runners into clusters for comparison, which is achieved by having a shared value of s 2 x for all clusters. Impact of age and gender. We now use the age-gender interaction ADDP model described in Subsection 4.2. In addition to its density estimation capacity, the ADDP has an additional descriptive strength, since it can show the impact of age and gender on runners performance straightforwardly through inference of the age delays θ j, gender factor δ and age-gender interaction factors ω j. Fig 5 shows the average proportion of runners in each cluster, as well as the inferred cluster means μ k, for both the ADDP and a standard HDP with Gaussian Fig 4. Density estimation capacities for the basic ADDP model. The histogram corresponds to the population of forty-year-old male runners, which is the largest age-gender group. The red curves are the probability density functions inferred by the basic ADDP model. The blue dotted lines represent the inferred individual clusters. doi: /journal.pone g004 PLOS ONE DOI: /journal.pone January 28, / 28

13 Fig 5. Age proportions per cluster. The HDP is at the top and the age-gender interaction ADDP is at the bottom. Male and female runners are shown together. The colors design the age of the runners, from 18-year-old in blue to 69-year-old in dark red. For the ADDP model, the cluster labels correspond to μ k + θ j, with the value of θ j corresponding to the 28-year-old male runners, the shifts for other age or gender can be found in Table 2. doi: /journal.pone g005 likelihood [6]. Runners aged above 69 are not reported because there are too few of them. The HDP results are not easy to interpret, except for the first three clusters that show the time degradation with age, because we do not know the cumulative percentage as the finishing time is increased. The ADDP model is much easier to understand. The first cluster contains the Olympic quality runners for all ages, if Olympics were held for each age group (less than 1% of the runners). The second cluster has the competitive runners (about 13% of the runners), the third cluster has the standard marathoners (about 33% of all runners), and so on. The x-axis in Fig 5 provides the value of μ k + θ j for 28 year-old male runners and the degradation for other ages and sex is shown in Table 2. Using the plot and the table, we can know the proportion of runners in each group and how much extra time they need compare to the fastest group. Fig 6 shows the value of the inferred cluster means μ k plus and minus one standard deviation, shifted according to θ j, δ, and ω j for both men and women. We only depict the two fastest clusters, and compare the corresponding values of the finishing time with the entry requirements of different marathons and the WMA records. Best performance for females and males is predicted, respectively, at 26 and 28 years old, which is consistent with [33]. The plateau PLOS ONE DOI: /journal.pone January 28, / 28

14 Table 2. Averaged values of θ j for men (or θ j + δ + ω j for women) for all age groups in minutes. The values have been shifted so that θ 28 = 0 for men. This table is useful to directly compare the finishing time of two runners of any age and gender. We only have to calculate the difference of their respective cells in the table and we will get the time penalty that should be considered (in minutes). age men women age men women age men women age men women age men women age men women doi: /journal.pone t002 afterwards illustrates a stable period of performance between the 30 s and 40 s for both genders. All plots behave in a similar way. This is what we meant in the introduction by that the first insights should not be foreign to us, so experts in marathon modeling can take other conclusions as plausible. Now, we focus on what is different. The most striking difference is how the entry levels penalize younger male runners, specially runners under 25. To be fairer to the youngest runners, their entry time should be raised (in about 7 minutes compared to the 30 years-old). The Boston marathon entry level is perfectly aligned with our second cluster for 40+ years old men and almost perfectly match the female second cluster, except for the runners under 23. There is a penalty of about 4 7 minutes for runners aged and between 7 14 minutes for year-old runners. The entry times slightly favor the year-old runners. The London marathon also penalizes excessively runners in their fifties compared to those in their forties and sixties, which seems odd. It is also clear that over 50 (or even 45), the degradation of the finishing times per year is significant enough to merit a finer scale to guarantee entry times (this may also apply for years-old). For example, a runner of 60 years old is doing almost 15 minutes less than a runner of 64 (which is a very long time in any marathon). Finally, the WMA curve for men penalizes the older male runners, while for older female runners it seems to have a similar trend than the first cluster of the ADDP model. For younger runners, the difference between the typical women in the olympic cluster and the WMA is larger than the difference between the typical men in the olympic cluster and the WMA. Fig 7 shows the finishing time gap between women and men. The gap seems to be of about 30 minutes and slightly increasing with age. There are too few runners over 65 for the final decay to be statistically significant. We can come up with two different plausible explanations, but we do not have data to confirm whether this empirical effect is due to any of them or to some other unknown factor. The degradation with age between women and men might be due to physiological factors, i.e., women age differently than men for long distance running, or it PLOS ONE DOI: /journal.pone January 28, / 28

15 Fig 6. Inferred cluster means and entry requirements. Comparison of the inferred cluster means, i.e., μ k + θ j with k 2 {1,2}, with the entry requirements (runners below the curve can qualify) for New York City, Boston and London marathons. (Top) Men. (Bottom) Women. doi: /journal.pone g006 PLOS ONE DOI: /journal.pone January 28, / 28

16 Fig 7. Gender effect on the final performance. Averaged inferred value for the gender coefficients δ + ω j. doi: /journal.pone g007 might be due to socio-economical factors, i.e., women above 30 cannot train as much as men do. Comparison across multiple races. We compare the NYC marathons to the ones in Boston and London, using the H-ADDP model described in Section 4.2. We consider both the 2010 and 2011 marathons, and we split the runners into age groups instead of using their actual age because we do not have this data available for London marathon. Fig 8 shows the inferred values of the per-race weights π rk. The values for μ k +θ j in the x-axis are those of the male runners and the value of σ x = 19 minutes. First, we notice that the values of π rk are quite different for each place, but they show little variation between different years. We can argue that this pattern is mainly due to the race difficulty, assuming a stationary selection of the runners. Boston has the most striking pattern, which can easily be explained by the strict entry requirement time. For years old the entry time is 3h25m and the cluster with 70% of the runners has a mean of 3h19m20s. There is a group of almost 15% of the runners that finish just under 4 hours and about 15% of the runners that do much worse than their qualifying time. This might be due to poor training or having some issue during the race. For the Boston marathons, there are no runners in the 3h cluster and around 1% runners in the fastest cluster. The void in the 3h group is due to the massive proportion of runners in the 3h19m group, which makes any runner in that group to be represented by the 3h19m cluster. The runners under 3h PLOS ONE DOI: /journal.pone January 28, / 28

17 Fig 8. Mixture weights for the H-ADDP mixture model. The figure shows the mixture weights π rk for each race r and cluster k. The legend shows the different races, and the x-axis corresponds to different clusters. doi: /journal.pone g008 are the runners that do much better than the needed qualifying time and cannot be represented by the massive 3h19m group. The proportions in NYC and London are more similar to each other, as both marathons allow runners to enter the race in more ways than just by entry requirement. Being these two races more accessible or democratic, we can consider the proportions in the different clusters closer to the general population of marathon runners. The 3h group is more populous in London than NYC, but the 3h19m and 3h53m clusters contain a larger proportion of NYC runners. The 4h30m group is equally probable in both races. London seems to attract a higher proportion of slower runners (over 5 hours). This difference might be due to the difficulty of the marathons (profile and weather conditions) or the pull of runners. NYC race is more hilly than London, which can explain the difference in the first cluster, but the runners in NYC are more diverse (coming from different parts of the country and world), while London attracts more local runners. This might also explain the pattern of the slowest athletes. Comparison taking into account the speed. We now apply the model extension in Section 4.2 with cluster-dependent shift delays θ jk. Figs 9 and 10 show the inferred cluster means for men and women respectively. Although the overall shape of the curves is quite similar across clusters, which validates our previous conclusions, there are some noticeable differences for the fastest runners and we concentrate on those. PLOS ONE DOI: /journal.pone January 28, / 28

18 Fig 9. Inferred cluster means μ k + θ jk for men. We have used the extended model with speed-dependent clusters. The legend shows the inferred value of the proportions π k for each cluster. doi: /journal.pone g009 The most interesting difference is the behavior of Clusters 1 and 2. At a first glance, it seems to the naked eye that women are faster than men. Cluster 1 captures the Olympic female runners that are doing under 2h45m, and its proportion is very low (less than 1%). Cluster 2 for women covers those runners doing under 3h45m and it represents 13% of the female population. There is a significant difference between Olympic female runners and competitive female runners, so two clusters are needed. For men under 50, Cluster 2 represents 13% of the runners and it captures those doing under 3h30. The model considers that the Olympic runners can be modeled by the tail of the distribution of competitive runners, i.e., Cluster 2, without requiring a new cluster as the Olympic women need. Male runners over 50 behave as women do, and two clusters are needed to separate the Olympic and the competitive groups. For men under 50, Cluster 1 sits in between the two populous clusters, becoming irrelevant in terms of density estimation. It appears because the model forces the same proportion for all clusters across age and gender groups. In order to support this conclusion we have depicted the histogram of the 28-year-old runners together with the inferred density in logarithm scale in Figs 11 and 12. In those plots, we can see that the Olympic male runners just doing over 2 hours can be modeled by the same cluster as the competitive male runners, while the finishing time for the Olympic female runners could not be explained by the competitive female ones, PLOS ONE DOI: /journal.pone January 28, / 28

19 Fig 10. Inferred cluster means μ k + θ jk for women. We have used the extended model with speed-dependent clusters. The legend shows the inferred value of the proportions π k for each cluster. doi: /journal.pone g010 and hence a specific cluster is needed. In previous sections, θ j was not allowed to vary with k. The cluster for the Olympic males was then visible, but this is an effect of forcing the same value of θ j for all clusters, since women and older male runners need such cluster. This is the only significant difference when we replace θ j with θ jk. In short, Cluster 1 (with low weight) is describing a different local feature for men under 50 than for men over 50 and women of all ages, and the Olympic male runners below 50 are found in the tail of Cluster 2 instead of Cluster 1 as in the other age/gender groups. There may be several explanations for this effect. Here we present some hypothesis that should be confirmed with further studies. First, in the NYC marathon the Olympic women run by themselves in an early wave, while the Olympic men start at the same time as everyone else, so competitive men can try to follow them. However, this does not explain the need for a cluster for fast male runners over 50. Second, we can also hypothesize that female Olympic runners have a training that is significantly different from competitive female runners, while for male runners there is a continuum in the training between Olympic and competitive runners. This could also apply for male over 50, in which there are not that many doing Olympic finishing times and competitive runners are not as strong. Finally, another plausible hypothesis is that younger male runners are more risky than female and older male runners. Those that succeed PLOS ONE DOI: /journal.pone January 28, / 28

20 Fig 11. Density estimation for 28-year-old male runners. We have used the cluster-dependent ADDP model, where cluster positions are speeddependent. The blue solid line represents the inferred distribution and the green dash-dotted line is the normalized histogram with 6-minute bins. Black and red lines correspond to the individual Gaussian components that define the inferred density, weighted by their averaged weight. doi: /journal.pone g011 do a better time and close the gap between the Olympic and competitive runners. Older males and female competitive runners do not follow such a risky approach and therefore they do not close the gap with the Olympic runners. There is some evidence on this risky hypothesis in Section 6.2, in which we see that the clusters with reckless running patterns are mainly populated by younger males. 6.2 Modeling latent running patterns Here, we consider temporal sequences of time measurements every 5 km, and at half and full marathon, as explained in Section 5. Running patterns. In this section, we use the data from NYC marathons, with 194,778 runners. We discarded data from the 2006 marathon because we observed that intermediate measurements were not fully synchronized with the half and full marathon times. After applying our HDP model, we found a modal value of 46 clusters. In Fig 13 we show the twelve most populated clusters, which account for around 90% of the population on average. The remaining clusters do not behave significantly different than the ones we show in this section. For clarity, we do not directly plot the time proportion spent at each interval, but instead PLOS ONE DOI: /journal.pone January 28, / 28

Legendre et al Appendices and Supplements, p. 1

Legendre et al Appendices and Supplements, p. 1 Legendre et al. 2010 Appendices and Supplements, p. 1 Appendices and Supplement to: Legendre, P., M. De Cáceres, and D. Borcard. 2010. Community surveys through space and time: testing the space-time interaction