Machine Learning for Stock Selection

Machine Learning for Sock Selecion Rober J. Yan Compuer Science Dep., The Uniersiy of Wesern Onario jyan@csd.uwo.ca Charles X. Ling Compuer Science Dep., The Uniersiy of Wesern Onario cling@csd.uwo.ca ABSTRACT In his paper, we propose a new mehod called Prooype Ranking (PR) designed for he sock selecion problem. PR akes ino accoun he huge size of real-world sock daa and applies a modified compeiie learning echnique o predic he ranks of socks. The primary arge of PR is o selec he op performing socks among many ordinary socks. PR is designed o perform he learning and esing in a noisy socks sample se where he op performing socks are usually he minoriy. The performance of PR is ealuaed by a rading simulaion of he real sock daa. Each week he socks wih he highes prediced ranks are chosen o consruc a porfolio. In he period of 978-2004, PR s porfolio earns a much higher aerage reurn as well as a higher risk-adjused reurn han Cooper s mehod, which shows ha he PR mehod leads o a clear profi improemen. Caegories and Subjec Descripors I.5 [PATTERN RECOGNITION] General Terms Algorihms Keywords Sock selecion. INTRODUCTION Recenly a considerable amoun of work has been deoed o predicing socks based on he machine learning echniques (e.g., [;3;6]). These mehods use a se of raining samples o generae an approximaion of he underlying funcion of daa. Comparing wih saisical mehods, machine learning mehods do no inole assumpions abou sample independence or special disribuion [7]. These assumpions may no always be me in he real world siuaions, which machine learning mehods are designed o adap. In his paper, we inesigae he issue of sock selecion o form a porfolio wih high reurn. In a real world rading enironmen, gien a se of socks, how can we selec hose bes socks? This ask inoles a ranking predicion of socks and chooses he op ones o form he porfolio. The usual caegorical predicion sysems (i.e., The price/reurn rend predicion [3] ha only predics he direcion of he price moemen raher Permission o make digial or hard copies of all or par of his work for personal or classroom use is graned wihou fee proided ha copies are no made or disribued for profi or commercial adanage and ha copies bear his noice and he full ciaion on he firs page. To copy oherwise, or republish, o pos on serers or o redisribue o liss, requires prior specific permission and/or a fee. KDD 07, Augus 2 5, 2007, San Jose, California, USA. Copyrigh 2007 ACM 978--59593-609-7/07/0008...$5.00. han he expeced price) are no appropriae for his ask. For insance, we do no know how o selec he 5-bes socks if he sysem predics ha 20 socks will moe upward. Therefore, he ask of sock selecion needs a coninuous predicion sysem. All he sock price/reurn predicion mehods (i.e., linear regression) are coninuous sysems. Howeer, hey may sill lead o unreliable resuls. When i comes o he indiidual sock predicion, he majoriy of preious mehods (e.g.,[6]) selec he model ha achiees he maximum oerall predicion accuracy (i.e., sum of squared deiaions from acual oupus) for all socks. Howeer, in he case of sock selecion, where he goal is o form a porfolio by hose bes socks, we only care abou he op performing socks. Thus, he opimized model for all socks may no be suiable for our ask. We propose a new mehod, namely Prooype Ranking (PR) ha is based on he compeiie learning [5]. PR is designed for he sock selecion ask raher han he indiidual sock predicion ask. The oerall predicion accuracy is no longer he primary objecie during he model searching. Insead, PR ries o learn a nework of prooypes, where he prooypes are he super poins ha represen a group of raining samples nearby and he whole nework can be considered as a model. This nework has a beer chance o disinguish he op performing socks from ordinary socks. PR is applied o samples of NYSE and AMEX indiidual socks oer he period 978 o 2004. The experimens resuls show ha PR is robus in shor-erm sock selecion, and is performance is beer han he radiional Cooper s mehod of selecion [2] afer he ransacion coss. Secion 2 defines he ask of sock selecion. Secion 3 inroduces he process of PR learning and esing. The experimens resuls are shown in secion 4. A conclusion is gien in secion 0. 2. DEFINING STOCK SELECTION TASK In his secion, we will discuss he formulaion of sock selecion ask and is ealuaion. We assume ha rading days (when he marke is open) are diided ino weeks of fie days labeled by he index. The ask of sock selecion is o find n bes performing socks in he se of socks ha we choose for week, gien only informaion se aailable a he sar of he week. In order o formulae he sock selecion ino a machine learning ask we need o specify he following eniies: is he raining sample se of week wih N samples, S(j, ) j,, N ;. Noe ha each sample in is associaed wih a specific week prior o week. S( j, ) ( X( j, ), RR( j, )) is a sample, where SR he sample space and X R is he predicor ecor; RR is

is he sock real reurn and here exiss a underlying funcion f ( X ( )) RR( ). is a separaed esing sample se of week wih M samples. S( j, ) ( X( j, ), RR( j, )) j,, M. As a ypical machine learning process, a ranking funcion g ha approximaes f is learned from by a specific algorihm. The rank of a esing sample j in is hen prediced by Rank ( j, ) g( X ( j, )). Afer all he esing samples are assigned he prediced ranks, n socks wih highes/lowes ranks are seleced o form a porfolio of week. This process is repeaed from he firs esing week s o he las esing week e. We can see ha such sock selecion ask depends on wo key decisions: How do we find he g? Wha choice o make for he predicor ecor? We will discuss how o use he compeiie learning based mehod PR o find he ranking funcion g in secion 3.. For he predicor ecor, we follow Cooper [2] in he choice of predicors. This will be discussed in secion 4.. 3. PROTOTYPE RANKING In his secion, we discuss he algorihm of PR mehod consising of a raining process and a esing process. PR applies a modified compeiie learning mehod o learn a ranking funcion g based on he raining sample se and generaes prediced ranks for esing samples. A quick reiew of radiional compeiie learning is as follows: A compeiie learning model (nework) consiss of H prooypes p p p. Prooypes could be hough of, 2,, H as super poins ha represen a group of acual raining samples around hem in he inpu space R. Each prooype has an associaed reference ecor w R. The general compeiie learning process can be described as follows:. Iniialize he se by randomly choosing w i for each p i. 2. For each raining sample S R, calculae he disance from S o each w i and choose one or seeral closes prooypes (winners). 3. Adap he reference ecor of winners owards S: w ( ) w ( ) ( S w ) i i i ε is he learning rae. The compeiie learning algorihms are widely used for making clusering analysis [5] and feaure mapping [4]. 3. PR Training As shown in Figure 2, he PR raining consiss of he following hree seps. (). Daa Preparaion. The raw sock daa is conered ino samples. For each week, samples are diided ino raining samples and esing samples. (2). Training prooype ree. The radiional compeiie learning defines a mapping from he inpu daa o a single prooype nework. A modified compeiie learning algorihm is inroduced in his paper, which maps he inpu daa ino muliple prooype neworks arranged o a ree srucure. We call hese neworks a prooype ree. Figure shows an example of wo-dimensional prooype ree wih deph=3. In PR algorihm, an iniial complee k-ary prooype ree of deph L is firs creaed. Each node represens a fixed prooype in he predicor space R, which is a subspace of he inpu sample space R. Nodes in he same deph are disribued uniformly o compose a nework. The raining process maps a raining sample se o a prooype ree. For each raining sample S( j, ), PR searches is neares prooypes (winners) on each ree leel m. Those winning prooypes are hen adaped o S( j, ). Noe ha in PR, he searching of winners is performed in he predicor space R insead of he enire sample space R, because he predicion ask needs paerns in R space. A he end of his sep, we obain a rained prooype ree. I reflecs he paerns in he raining samples. (3). Opimizing he rained prooype ree ino a ranking model for he minoriy samples. This sep firs prunes he redundan prooypes. If all he children of a prooype are similar o each oher, hey can be replaced by heir paren prooype wihou informaion los. Considering he majoriy of socks are ordinary in a sock daase, mos prooypes in he ree mus be rained o be ordinary. Such a ree ends o gie ordinary predicions, which is meaningless o us. Pruning dramaically decreases he number of ordinary prooypes. Afer pruning, he ree has a beer change o generae exreme predicions. Howeer, single predicion is no wha we need. To make he pruned ree predic relaie relaions among socks, we assign each prooype an expeced rank. By doing so, he pruned ree is conered ino a ranking model. When i is used for predicion, esing samples ha are close o prooypes wih high expeced ranks obain high prediced ranking score. 3 2 Figure. Illusraion of 2D prooype ree 3.2 PR Tesing The idea of PR esing is assigning each esing sample a prediced ranking score. Inside he ranking model obained from he raining, here are a number of prooypes wih expeced ranks disribued in he R. Since prooypes always represen

he nearby samples, he rank of a esing sample should be close o he ranks of is neighbour prooypes. Therefore, we may apply he kernel regression [0] o calculae he prediced rank of a esing sample. Those esing socks wih he highes/lowes prediced ranking scores are seleced o form a porfolio. The real reurn of his porfolio is hen ealuaed as a measure o judge he performance of PR. As a summary, PR mehod has seeral properies: I has he abiliy o process he real-world sock daase. To adap o he new daa, he model mus be renewed eery week. Considering he huge size of he real-world sock daase, he bach mehods ha use all he aailable daa o build a new model each week become impracical. Insead, PR adops he on-line updae mechanism. I uses only he laes daa o updae he old model. PR mehod akes he properies of sock daa ino accoun. By applying he prooypes, PR can handle he daa noise and daa imbalance (i.e., here are many more samples belonging o one caegory han anoher). PR mehod does no predic he indiidual sock reurn or price. The goal of PR is generaing he ranking scores. The ranking score can be considered as he relaie price/reurn and is more predicable han indiidual price/reurn [8]. Training Tesing (). Daa Preparaion Training Samples (2). Training prooype Tree (3). Opimizing prooype Tree for ranking Prooype Tree (4). Predicing sample ranks Prediced Sample Ranks Ranking Model Figure 2. The framework of PR Iniial Tree 4. EXPERIMENTS In his secion, some empirical experimen resuls are discussed. In secion 4., we firs inroduce he daa used in he experimens. The procedure of he experimens as well as he measuremen is discussed in secion 4.2. In he following secions, he resuls of hree experimens ha we design o ealuae he PR mehods are proided. 4. Daa The daa come from he daabase of he cener for Research in Securiy Prices (CRSP). We examine all samples of NYSE and AMEX indiidual socks oer he period 962 (Dec.) o 2004 (Dec.). The sock unierse we sudy is reised monhly. I consiss of he 300 NYSE and AMEX socks ha hae he larges marke capializaion. In all 504 differen socks were chosen. We coner he daily daa ino fie-rading-day weekly daa. In a gien week, we omi any sock ha has missing olume or price informaion for any of he preious en days. Samples in he weekly daa se hae he same forma: S( j, ) ( X( j, ), RR( j, )) where is he index of week and j is he sock permanen number. The predicor ecor X( j, ) conains hree predicors: Predicor x(,j, ) = he reurn of sock j for he week -. Predicor 2 x(2,j, ) = he reurn of sock j for he week -2. VV2 Predicor 3 x(3,j, ) = olume alue raio defined as, VV2 where V, V2 are he alues of he olume for sock j for weeks -, -2. Comparing wih he olume raio Cooper used, which is VV2 represened by, our olume raio leads o a more V symmeric disribuion of alues. 4.2 Procedure The PR mehod is ealuaed in he ime period from he firs week of 978 o he las week of 2004. We apply PR on for raining a model and hen make predicions for he socks in. To ealuae he performance of PR, we need o compare he prediced resuls wih real resuls. As we menioned in secion, he oerall crieria (i.e., he sum of square error) is no appropriae. The righ hing we need o ealuae is he efficiency of he algorihm. Tha is, wheher or no hose socks chosen by PR are profiable. Clearly, his could be ealuaed by checking he real reurn of he chosen socks, a porfolio. In his paper we hae used a simple porfolio formaion scheme. Each week we form a neural porfolio consising of n socks long and n socks shor. The long (shor) socks are hose wih highes (lowes) ranks. Each sock has equal weigh (excep when here are seeral socks ied for las place, and hen all hose socks are chosen wih equal reduced weigh). The aerage reurn of hese porfolios oer he esing ime period, which is denoed by ARP, is wha we sudy. PR mehod aims o minimize he danger of daa snooping. Daa snooping occurs when a gien se of daa is used more han once for purposes of inference or model selecion [9]. Therefore, he parameers of learning mus be decided prior o he esing ime period. In his experimen, PR searches he opimal alues of is parameers in he ime period from 963 o 977 and makes learning and esing in 978-2004. Those opimal alues of parameers are d=4, ( ) 0.9, k=9, s=4.5, and T=0.8. We design wo experimens for differen ealuaing purposes. Experimen ess he predicabiliy of he PR mehod.

Experimen 2 compares PR mehod wih Cooper s mehod boh before and afer he ransacion coss. In hese experimens, we diide he esing period ino wo (978-993 and 994 2004), because 978-993 was he one Cooper used for his ess so we can obain a direc comparison. 4.3 Experimen The predicabiliy of PR can be ealuaed by comparing he reurns of differen porfolios i consrucs in he same ime period. Gien a week, wo porfolios P, P2 are consruced. P has 2n socks and P2 has 2n2 socks. We denoe he expeced reurn and he real reurn of a porfolio P in week as RP ( P) and RP ( P) respeciely. Naurally, if a porfolio performs as i is prediced, he algorihm ha generaes he porfolio is considered o be wih predicabiliy. The condiion of predicabiliy can be defined as: Assume ha an algorihm predics P is beer P2, which means ha RP( P) RP( P2). If ( RP( P) RP( P2)) ( RP( P) RP( P2)), hen his algorihm has he predicabiliy in week. Howeer, PR does no really calculae he expec reurn of a porfolio. PR always chooses he sock wih he highes (lowes) rank and he chosen sock always has he highes expeced reurn in he se of remaining socks. The more socks inoled in a porfolio, he lower is expeced reurn. Therefore we may change he condiion of predicabiliy o: If ( n n2) ( RP( P) RP( P2)), hen PR has he predicabiliy in week. Similarly, he condiion of predicabiliy in a cerain ime period is defined as follows. If ( n n2) ( ARP( P) ARP( P2)), hen PR has he predicabiliy in his ime period. Howeer, he aboe condiion only works in he pure daase wih no noise. Gien a real-world sock daase, PR generaes wo porfolios P wih n socks and P2 wih n2 socks. If n and n2 are oo close (i.e., n 5 and n 2 6 ), een if PR has a cerain leel of predicabiliy, he aboe condiion may sill be iolaed because of he heay noise in he raining samples. To correcly reflec PR s predicabiliy under he noisy enironmen, he difference beween n and n2 should be larger enough o olerae he noise. We always se ha n2 n 5. In his experimen, for boh ime periods 978-993 and 994-2004, PR generaes en porfolios wih differen sock numbers 2n ( n 5 i, i,,0 ). We compare hese porfolios and presen he resuls in Figure 3. For each ime period, he predicabiliy condiion has been esed by 9 cases. In all he cases, he condiion is saisfied. In boh gien periods, he aerage reurn of he porfolio increases seadily as n decreases from 50. In addiion, we calculae he reurn difference d b/w beween he reurn of he expeced bes porfolio and he reurn of he expeced wors porfolio. d b/w represens he leel of predicabiliy in a way. db/ wrp( Pi) RP( Pj) where Pi argmax{ RP( P)} and Pj argmin{ RP( P)}. We may rewrie he equaion as follows. db/ w RP( P( n 5)) RP( P( n 50)) In 978-993, d b/w is.0% and in 994-2004, d b/w is 0.7%. They are boh significan changes. All hese resuls show srong eidences of he predicabiliy of PR oer 978-993 and 994-2004. Figure 3. The oerall predicabiliy of PR 4.4 Experimen 2 In his paper, we focus on he shor-erm sock selecion based on only hisorical reurn and olume informaion. Cooper [2] inesigaed he same problem and proposed a mehod (CP). In he learning phase, Cooper firs, for each predicor, diides ino deciles he hisorical disribuion of predicor alues. Using he decile boundary alues, he hree-dimensional predicor space is pariioned ino 000 cells, wih each cell assigning an aerage one-week reurn of all socks in i. In he esing phase, he aerage reurn of a cell can be used as he prediced reurn of esing samples belonging o his cell. CP has no machine learning echniques inoled. The comparison beween PR and CP will show us wheher he sock selecion ask benefis from applying some machine learning echniques. We apply boh PR and CP o he same sock daa se using he same procedure discussed in secion 4.2. For each week from 978 o 2004, each mehod forms hree weekly porfolios wih 0, 20, 30 socks respeciely. Table repors heir performances in 978-993 and 994-2004. In all cases, he PR earns higher Ae. reurn compared wih CP. The aerage margin of hree PR reurns oer hree CP reurns in he 978-993 is 76.3% and in 994-2004, i is 58.4%. We also compare he risk-adjused porfolio performance, which is usually measured by he Sharpe Raio (reun/sd.). Table shows ha in his case PR also ouperforms CP. For example, he Sharpe Raios of hree PR porfolio oer 978-993 are 0.5, 0.52, and 0.52 respeciely. In conras, he Sharpe Raio of CP porfolios are 0.32, 0.38, and 0.37 oer he same ime period, respeciely. The comparison in 994-2004 shows he similar resuls. The aboe experimen compares he predicabiliy of PR and CP and shows ha PR generaes more accurae and sable predicions. We also ealuae wheher PR s predicions are more profiable compared o CP s predicions under he ransacion coss. The final inesmen alue (FIV) of he porfolios under he ransacion coss is used as he measure. Alhough esimaing he real ransacion coss of each rade is difficul, i is

reasonable o suppose ha he coss for CP and PR would hae been similar. For echnical conenience, we follow Cooper [2] in seing he round-rip cos leels o ealuae he afer-cos performance for boh mehods: 0.25%; l (low ransacion coss) Transacion coss cl 0.5%; l 2 (medium ransacion coss) 0.75%; l 3 (high ransacion coss) Table. The Performance Comparison: PR.s. CP Porfolio 0-sock 20-sock 30-sock Performance 978-993 994-2004 PR CP PR CP Ae. Reurn (%).69 0.89.3 0.8 STD (%) 3.3 2.8 6.2 5. Sharpe Raio 0.5 0.32 0.2 0.6 Ae. Reurn (%).35 0.80.32 0.8 STD (%) 2.6 2. 5. 4.3 Sharpe Raio 0.52 0.38 0.26 0.9 Ae. Reurn (%).4 0.67.6 0.77 STD (%) 2.2.8 4.6 3.5 Sharpe Raio 0.52 0.37 0.27 0.22 We compare he 0-sock porfolios of PR and CP, which represen heir bes profiabiliy. Considering ha he inesing is a coninuous process, we do no spli he esing period. Therefore we calculae he FIV of he porfolios in 2004 (Assume ha inesors sar off wih $ in 978 and reines he porfolio income eery week) under differen ransacion coss. These resuls are shown in Table 2. Table 2. The FIV Comparison: PR.s. CP Transacion Coss FIV of PR (2004) ($) FIV of CP (2004) ($) Low 6E5 256.5 Medium 77.7 0.22 High.43 0 For boh mehods, he profi drops dramaically as he ransacion coss increase from 0.25% o 0.75%. Under he same coss leel, PR always ouperforms CP. A he low cos leel, he FIV of PR and CP in 2004 are $ 6E5 and $256, respeciely. In he cases of medium and high ransacion coss, PR porfolios are sill profiable. The FIVs of PR are $77.7 (medium) and $.43 (high). In conras, he profi of CP porfolios has disappeared under medium or high ransacion coss. As we expeced, PR suries a higher leel of coss relaie o CP and shows beer profiabiliy. 5. CONCLUSION This paper proposed a machine learning mehod called Prooype Ranking (PR) for shor-erm sock predicion. The goal of he PR mehod is o selec n bes performing socks from a sock se based on he ranking funcion g learned in he hisorical sock daa. PR applies a modified compeiie learning echnique, which is designed for discoering models under he noisy and imbalanced enironmen. In he esing phase, each esing sample is assigned a prediced ranking score and he socks wih he highes/lowes ranks are seleced o form a porfolio. The experimenal resuls show srong eidences of he predicabiliy of PR. In addiion, PR ouperforms CP, which is a non-machine-learning mehod. This shows he adanage of applying machine learning in he shor-erm sock predicion. This work can be furher improed in wo direcions. Firs, gien curren predicors, we may apply boosing echniques o improe he accuracy. Second, in he paper we only apply he shor-erm predicing. I is possible o combine he shor-erm predicing wih he long-erm predicing for he sock selecion. 6. ACKNOWLEDGMENTS We appreciae he access o he CRSP daabase proided by he Uniersiy of Wesern Onario ia he WRDS sysem. REFERENCES [] Aramo,D., and Chordia,T. (2006), Predicing sock reurns, Journal of Financial Economics 82, 387-45. [2] Cooper,M. (999), Filer rules based on price and olume in indiidual securiy oerreacion, Reiew of Financial Sudies 2, 90-935. [3] Edwards,R.D. and Magee,J., Technical Analysis of Sock Trends (Amacom Books, 997). [4] Frizke,B. (994), Growing Cell Srucures - A Self- Organizing Nework for Unsuperised and Superised Learning, Neural Neworks 7, 44-460. [5] Frizke,B. Some compeiie learning mehods. 997. [6] Hamid,S.A., and Iqbal,Z. (2004), Using neural neworks for forecasing olailiy of S&P 500 Index fuures prices, Journal of Business Research 57, 6-25. [7] Hasie,T., Tibshirani,R., and Friedman,J.H., The Elemens of Saisical Learning (Springer, 2003). [8] Hellsrom,T. (200), Opimizing he Sharpe Raio for a Rank Based rading Sysem, EPIA 200, LNAI 2258 30-4. [9] Sullian,R., Timmermann,A., and Whie,H. (999), Daa- Snooping, echnical rading rule performance, and he boosrap, The journal of Finance 54 647-69. [0] Wolberg,J.R., Exper rading sysems : modeling financial markes wih kernel regression (; Wiley, New York 2000).