SHRiMP: Accurate Mapping of Short Color-space Reads

SHRiMP: Accuate Mapping of Shot Colo-space Reads Stephen M. Rumble 1,2, Phil Lacoute 3,4, Adian V. Dalca 1, Mac Fiume 1, Aend Sidow 3,4, Michael Budno 1,5 * 1 Depatment of Compute Science, Univesity of Toonto, Toonto, Ontaio, Canada, 2 Depatment of Compute Science, Stanfod Univesity, Stanfod, Califonia, United States of Ameica, 3 Depatment of Genetics, Stanfod Univesity, Stanfod, Califonia, United States of Ameica, 4 Depatment of Pathology, Stanfod Univesity, Stanfod, Califonia, United States of Ameica, 5 Banting & Best Depatment of Medical Reseach, Univesity of Toonto, Toonto, Ontaio, Canada Abstact The development of Next Geneation Sequencing technologies, capable of sequencing hundeds of millions of shot eads (25 70 bp each) in a single un, is opening the doo to population genomic studies of non-model species. In this pape we pesent SHRiMP - the SHot Read Mapping Package: a set of algoithms and methods to map shot eads to a genome, even in the pesence of a lage amount of polymophism. Ou method is based upon a fast ead mapping technique, sepaate thoough alignment methods fo egula lette-space as well as AB SOLiD (colo-space) eads, and a statistical model fo false positive hits. We use SHRiMP to map eads fom a newly sequenced Ciona savignyi individual to the efeence genome. We demonstate that SHRiMP can accuately map eads to this highly polymophic genome, while confiming high heteozygosity of C. savignyi in this second individual. SHRiMP is feely available at http://compbio.cs.toonto.edu/shimp. Citation: Rumble SM, Lacoute P, Dalca AV, Fiume M, Sidow A, et al. (2009) SHRiMP: Accuate Mapping of Shot Colo-space Reads. PLoS Comput Biol 5(5): e1000386. doi:10.1371/jounal.pcbi.1000386 Edito: Wyeth W. Wasseman, Univesity of Bitish Columbia, Canada Received Januay 8, 2009; Accepted Apil 9, 2009; Published May 22, 2009 Copyight: ß 2009 Rumble et al. This is an open-access aticle distibuted unde the tems of the Ceative Commons Attibution License, which pemits unesticted use, distibution, and epoduction in any medium, povided the oiginal autho and souce ae cedited. Funding: This wok was sponsoed by Natual Sciences and Engineeing Reseach Council (NSERC) of Canada Undegaduate Student Reseach Awads, Canadian Institute fo Health Reseach (CIHR), Applied Biosystems, NSERC Discovey Gant, MITACS, and a Canada Foundation fo Innovation equipmentgant. Computational esouces wee povided by the Stanfod BioX2 compute cluste, suppoted by NSF awad CNS-0619926. The fundes had no ole in study design, data collection and analysis, decision to publish, o pepaation of the manuscipt. Competing Inteests: The authos have declaed that no competing inteests exist. * E-mail: budno@cs.toonto.edu Intoduction Next geneation sequencing (NGS) technologies ae evolutionizing the study of vaiation among individuals in a population. The ability of sequencing platfoms such as AB SOLiD and Illumina (Solexa) to sequence one billion basepais (gigabase) o moe in a few days has enabled the cheap e-sequencing of human genomes, with the genomes of a Chinese individual [1], a Youban individual [2], and matching tumo and healthy samples fom a female individual [3] sequenced in the last few months. These esequencing effots have been enabled by the development of extemely efficient mapping tools, capable of aligning millions of shot (25 70 bp) eads to the human genome [4 10]. In ode to acceleate the computation, most of these methods allow fo only a fixed numbe of mismatches (usually two o thee) between the efeence genome and the ead, and usually do not allow fo the matching of eads with insetion/deletion (indel) polymophisms. These methods ae extemely effective fo mapping eads to the human genome, most of which has a low polymophism ate, and so the likelihood that a single ead spans multiple SNPs is small. While matching with up to a few diffeences (allowing fo a SNP and 1 2 eos) is sufficient in these egions, these methods fail when the polymophism level is high. NGS technologies ae also opening the doo to the study of population genomics of non-model individuals in othe species. Vaious oganisms have a wide ange of polymophism ates - fom 0.1% in humans to 4.5% in the maine ascidian Ciona savignyi. The polymophisms pesent in a paticula species can be used to discen its evolutionay histoy and undestand the selective pessues in vaious genomic loci. Fo example, the lage amount of vaiation in C. savignyi (two individuals genomes ae as diffeent as Human and Macaque) was found to be due to a lage effective population size [11]. The e-sequencing of species like C. savignyi (and egions of the human genome with high vaiability) equies methods fo shot ead mapping that allow fo a combination of seveal SNPs, indels, and sequencing eos within a single (shot) ead. Futhemoe, due to lage-scale stuctual vaiation, only a faction of the ead may match to the genome, necessitating the use of local, athe than global, alignment methods. Pevious shot ead mapping tools typically allow fo a fixed numbe of mismatches by sepaating a ead into seveal sections and equiing some numbe of these to match pefectly, while othes ae allowed to vay [4,6,8]. An altenative appoach geneates a set of subsequences fom the ead (often epesented as spaced seeds [7,10,12]), again in such a manne that if a ead wee to match at a paticula location with some numbe of mismatches, at least one of the subsequences would match the genome [5,9]. While these methods ae extemely fast, they wee developed fo genomes with elatively low levels of polymophism, and typically cannot handle a highly polymophic, non-model genome. This becomes especially appaent when woking with data fom Applied Biosystem s SOLiD sequencing platfom (AB SOLiD). AB SOLiD uses a di-base sequencing chemisty that geneates one of fou possible calls (colos) fo each pai of nucleotides. While a sequencing eo is a change of one colo-call to anothe, a single SNP will change two adjacent colo positions. Hence a ead with two (non-adjacent) SNPs and a sequencing eo will diffe fom PLoS Computational Biology www.ploscompbiol.og 1 May 2009 Volume 5 Issue 5 e1000386

Autho Summay Next Geneation Sequencing (NGS) technologies ae evolutionizing the way biologists acquie and analyze genomic data. NGS machines, such as Illumina/Solexa and AB SOLiD, ae able to sequence genomes moe cheaply by 200-fold than pevious methods. One of the main application aeas of NGS technologies is the discovey of genomic vaiation within a given species. The fist step in discoveing this vaiation is the mapping of eads sequenced fom a dono individual to a known ( efeence ) genome. Diffeences between the efeence and the eads ae indicative eithe of polymophisms, o of sequencing eos. Since the intoduction of NGS technologies, many methods have been devised fo mapping eads to efeence genomes. Howeve, these algoithms often sacifice sensitivity fo fast unning time. While they ae successful at mapping eads fom oganisms that exhibit low polymophism ates, they do not pefom well at mapping eads fom highly polymophic oganisms. We pesent a novel ead mapping method, SHRiMP, that can handle much geate amounts of polymophism. Using Ciona savignyi as ou taget oganism, we demonstate that ou method discoves significantly moe vaiation than othe methods. Additionally, we develop colo-space extensions to classical alignment algoithms, allowing us to map colo-space, o dibase, eads geneated by AB SOLiD sequences. the efeence genome in five diffeent positions. Simultaneously, the natue of the di-base sequencing code allows fo the identification (and coection) of sequencing eos, so by caefully analyzing the exact sequence of matches and mismatches within a ead, it is possible to detemine that the ead and the genome diffe by two SNPs. While efficient mappes fo colo-space sequences have been developed [5,13], they tanslate the genome to colo-space, and diectly compae to the colo-space ead. The complexity of the colo-space epesentation makes the identification of complex vaiation such as adjacent SNPs and shot indels challenging o impossible with these tools. In this pape we develop algoithms fo the mapping of shot eads to highly polymophic genomes and methods fo the analysis of the mappings. We demonstate an algoithm fo mapping shot eads in the pesence of a lage amount of polymophism. By employing a fast k-me hashing step and a simple, vey efficient implementation of the Smith-Wateman algoithm, ou method conducts a full alignment of each ead to all aeas of the genome that ae potentially homologous. Secondly, we intoduce a novel, specialized algoithm fo mapping di-base (colo-space) eads, which allows fo an accuate, non-heuistic alignment of AB SOLiD eads to a efeence genome. Finally, we intoduce methodology fo evaluating the accuacy of discoveed alignments. Because a ead may match the genome in seveal locations with vaiable amounts of polymophism, we develop a statistical method fo scoing the hits, allowing fo the selection of the most pobable vaiants, and filteing of false positives. Ou methods ae implemented as pat of SHRiMP: the SHot Read Mapping Package. To demonstate the usefulness of SHRiMP we e-sequenced a Japanese Ciona savignyi genome on the SOLiD platfom. Peliminay estimates obtained in the couse of sequencing the efeence genome indicate that the SNP heteozygosity is 4.5%, wheeas indel heteozygosity is 16.6%. This species epesents the most challenging known test case fo the detection of polymophisms with shot ead technologies. We aligned the SOLiD eads of the Japanese individual to the C. savignyi efeence genome using both SHRiMP and AB s ead mappe. SHRiMP is able to identify 5-fold moe SNPs than AB s mappe, while also captuing 70,000 indel vaiants. Results/Discussion This section is oganized as follows: we begin with thee methodological sections, in which we fist pesent an oveview of the algoithms used in SHRiMP fo mapping shot eads, explain ou specialized algoithm fo alignment of di-base sequencing (AB SOLiD) data, and pesent ou famewok fo computing p-values and othe statistics fo alignment quality. The data flow fo these methods is illustated in Figue 1. In the last two subsections we will fist show the application of SHRiMP to the esequencing of Ciona savignyi using the AB SOLiD sequencing technology and pesent esults on the accuacy of the SHRiMP tool on simulated data. Read Mapping Algoithm The SHRiMP algoithm daws upon thee ecent developments in the field of sequence alignment: q-gam filte appoaches, intoduced by Rasmussen et al [14]; spaced seeds, intoduced by Califano and Rigoutsos [15] and populaized by the PatteHunte family of tools [7,10]; and specialized vecto computing hadwae to speed up the Smith-Wateman Algoithm [16 18] to apidly find the likely locations fo the eads on the genome. Once these locations ae identified, we conduct a thoough, Smith-Watemanbased algoithm to igoously evaluate the alignments. In this section we will povide a bief exposition of the methods used to align shot eads in SHRiMP (a moe thoough desciption of each of these steps is in Methods). Spaced seeds. Most heuistic methods fo local alignment ely on the identification of seeds shot exact matches between the two sequences. The advantage of using exact matches is that they ae easy to find using hash tables, suffix aays, o elated techniques. While classically seeds have been contiguous matches, moe ecently spaced seeds, whee pedetemined positions in the ead ae allowed not to match, have been shown to be moe sensitive. Spaced seeds ae often epesented as a sting of 1 s and 0 s, whee 1 s indicate positions that must match, while 0 s indicate positions that may mismatch. We efe to the length o span of the seed as the total length of the sting, and the weight of the seed as the numbe of 1 s in the sting. Fo example, the seed 11100111 equies matches at positions 1 3 and 6 8, and has length 8 and weight 6. Because seeds with such small weight match extemely often, we equie multiple seeds to match within a egion befoe it is futhe consideed, using a technique called Q- gam filteing. Q-gam filtes. While most olde local alignment tools, such as BLAST, use a single matching seed to stat a thoough compaison of the stings aound the seed, moe ecently Rassmussen et al [14] intoduced the use of q-gam filtes, whee multiple seeds ae used to detemine if a good match exists. This idea is also used in SHRiMP whee we equie a pedetemined numbe of seeds fom a ead to match within a window of the genome befoe we conduct a thoough compaison. Vectoized Smith-Wateman. If a paticula ead has the equied numbe of seeds matching to a window of the genome we conduct a apid alignment of the two egions to veify the similaity. This alignment is done using the classical Smith- Wateman algoithm [19], implemented using specialized vecto instuctions that ae pat of all moden CPUs. In ode to speed up this stage we compute just the scoe of the optimal alignment, and not the alignment itself. Fo evey ead we stoe PLoS Computational Biology www.ploscompbiol.og 2 May 2009 Volume 5 Issue 5 e1000386

Figue 1. Data flow and pocessing within the SHRiMP. Candidate mapping locations ae fist discoveed by the seed scanne and then validated by the vectoized Smith-Wateman algoithm, computing only a scoe. Top scoing hits ae then fully aligned by a platfom-specific algoithm (i.e. lette-space fo Solexa data and colo-space fo SOLiD data). Statistical confidence fo the final mappings ae then computed using the PROBCALC utility. doi:10.1371/jounal.pcbi.1000386.g001 the locations of top hits, soted by thei scoe. The numbe of top hits to stoe is a paamete. Final alignment. Afte we finish aligning all of the eads to all of the potential locations, we conduct a final, full alignment of each ead to all of the top hits. This final alignment stage diffes depending on the specifics of the sequencing technology. Within SHRiMP we have implemented sepaate final alignment modules fo Illumina/Solexa data (this is done with the egula Smith-Wateman algoithm) and fo colo-space (di-base) data poduced by the AB SOLiD instument (descibed in the next section). Additionally we have an expeimental module fo alignment of two-pass sequencing data, whee two eads ae geneated fom evey genomic location, which is descibed elsewhee [20]. Algoithm fo Colo-space Alignment The AB SOLiD sequencing technology intoduced a novel dibase sequencing technique, which eads ovelapping pais of lettes and geneates one of fou colos (typically labelled 0 3) at evey stage. Each base is inteogated twice: fist as the ight nucleotide of a pai, and then as the left one. The exact combinations of lettes and the colos they geneate ae shown in Figue 2A. The sequencing code can be thought of as a finite state automaton (FSA), in which each pevious lette is a state and each colo code is a tansition to the next lette state. This automaton is demonstated in Figue 2B. It is notable that the sequence of colos is insufficient to econstuct the DNA sequence, as econstuction equies knowledge of the fist lette of the sequence (o the last lette of the pime, which is fixed fo a single un of the technology). Figue 2. Two epesentations of the colo-space (dibase) encoding used by the AB SOLiD sequencing system. A: The standad epesentation, with the fist and second lette of the queied pai along the hoizontal and vetical axes, espectively. B: The equivalent Finite State Automaton epesentation, with edges labelled with the eadouts and nodes coesponding to the basepais of the undelying genome. doi:10.1371/jounal.pcbi.1000386.g002 PLoS Computational Biology www.ploscompbiol.og 3 May 2009 Volume 5 Issue 5 e1000386

Figue 3. Vaious mutation and eo events, and thei effects on the colo-code eadouts. The efeence genome is labeled G and the ead R. A: A pefect alignment; B: In case of a sequencing eo (the 2 should have been ead as a 0) the est of the ead no longe matches the genome in lette-space; C: In case of a SNP two adjacent colos do not match the genome, but all subsequent lettes do match. Howeve, D: only 3 of the 9 possible colo changes epesent valid SNPs; E: the ules fo deciding which insetion and deletion events ae valid ae even moe complex, as indels can also change adjacent colo eadouts. doi:10.1371/jounal.pcbi.1000386.g003 The AB SOLiD sequencing technology has the emakable popety of diffeentiating between sequencing eos and biological SNPs (unde the assumption that the efeence genome has no sequencing eos): a SNP changes two adjacent eadouts of the colo-space code, while a sequencing eo is unlikely to happen at two adjacent positions by chance (the technology does not sequence adjacent lettes at adjacent time points). At the same time, howeve, the colo-space code intoduces cetain complexities. Let us conside a compaison done by fist tanslating the colospace ead code into the lette-space sequence. Notice that a single sequencing eo would cause evey position afte the place of eo to be mistanslated (Figue 3B). Consequently, most appoaches have tanslated the lette-space genome into the coesponding colo code. Howeve, this is poblematic: since the colo-coding of evey dibase paiing is not unique, a sting of colos can epesent one of seveal DNA stings, depending on the peceding base pai. Fo example, a sting of zeoes could be tanslated as a poly-a, poly-c, poly-g o poly-t sting. Thee is an additional dawback to tanslating the genome into colo-space code: a sequence of matches and mismatches in colospace does not map uniquely into lette-space similaity. Fo example, a single SNP esults in two sequential colo-space mismatches. Howeve, given two consecutive colos, thee ae 9 possible ways to geneate two mismatches. Of these, only 3 coespond to a SNP, while the est lead to DNA stings that completely diffe fom the efeence. This is illustated in Figue 3D. We popose an altenate appoach. Ou key obsevation is that while a colo-space eo causes the est of the sequence to be mistanslated, the genome will match one of the othe thee possible tanslations. This is illustated in Figue 4C. Consequently, we adapt the classical dynamic pogamming algoithm to simultaneously align the genome to all fou possible tanslations of the ead, allowing the algoithm to move fom one tanslation to anothe by paying a cossove, o sequencing eo penalty. If one wishes fo a pobabilistic intepetation of the algoithm, one can conside the FSA in Figue 2B to be a Hidden Makov Model, whee the lette is the hidden state, and the colo-space sequence is the output of the model. By taking the coss poduct of this HMM with the standad pai-hmm associated with the Smith- Wateman algoithm, we can allow all of the typical alignment paametes, including the eo penalty, to be pobabilistically motivated as the log of the pobability of the event, and tained using the Expectation-Maximization algoithm. It is notable that ou appoach handles not only matches, mismatches, and sequencing eos, but also indels. Because the sequences ae aligned in lette-space (to be pecise, they ae aligned and tanslated simultaneously), indels can be penalized using the standad affine gap penalty with no futhe modification of the algoithm. Figue 4. Colo-space (dibase) sequence alignment. A: The Dynamic Pogamming (DP) epesentation, B: ecuences, and C: alignment of a lette space sequence to a colo-space ead with a sequencing eo. Within the DP matix we simultaneously align all of the fou possible tanslations (vetical) to the efeence genome (hoizontal); howeve the alignment can tansition between tanslations by paying the cossove penalty. This is illustated by the fouth ecuence, whee the thid index (k) coesponds to the tanslation cuently being used. In the alignment (C) afte the sequencing eo, the oiginal tanslation of the ead (stating fom a T) no longe matches, but a diffeent one (stating fom a C) does. doi:10.1371/jounal.pcbi.1000386.g004 PLoS Computational Biology www.ploscompbiol.og 4 May 2009 Volume 5 Issue 5 e1000386

In the SHRiMP algoithm, we only apply the special colospace Smith-Wateman algoithm in the final stage. Fo the initial stages, we convet the genome fom lette-space to colo-space, and seach fo k-me matches as well as pefom vectoized Smith- Wateman stictly in colo-space. In ode to bette incopoate SNPs in colo-space data, we use a spaced seed that allows fo two adjacent mismatching colos between the ead and the efeence genome. Computing Statistics fo Reads and Mate-pais Once all of the eads ae mapped, fo evey ead and mate-pai we compute mapping confidence statistics. Initially these ae computed fo each ead; howeve, they ae then combined to compute likelihoods of accidental matches fo mate-pais. Computing statistics fo single eads. While a vey thoough statistical theoy fo local alignments has been established [21], this theoy assumes the compaison of infinite length stings, and hence is inappopiate fo evaluating alignments of vey shot eads to a efeence genome. Instead, we have designed confidence statistics that explicitly model shot eads, and allow fo the computation of confidences in the pesence of shot insetions and deletions. We estimate the confidence in the possible mappings of each ead by using the following statistics (calculated by the PROBCALC pogam): pchance the pobability that the hit occued by chance and pgenome the pobability that the hit was geneated by the genome, given the obseved ates of the vaious evolutionay and eo events. Fo example, a good alignment would have a low pchance (close to 0) and a vey high pgenome (close to 1). In this section we biefly expand on these two concepts, give them mathematical definitions, and mege them to fomulate an oveall alignment quality measuement. A detailed desciption is in Methods (Computing Statistics: pchance and pgenome). The pchance of a hit is the pobability that the ead will align with as good a scoe to a genome that has the same length, but andom nucleotide composition with equal base fequencies (that is, the ead will align as well by chance). In ode to compute this, we count all of the possible k-mes with an equal numbe of changes as obseved in the hit, and we call this numbe Z. Fo example, if we only have substitutions in ou alignment (that is, no indels) and an alignment length of, then Z subs ~ 3 subs subs gives the numbe of unique stings to which the ead can align with the specified numbe of substitutions. A moe detailed discussion on the constuction of Z, especially fo the moe complex Z count fo indels, appeas in Computing Statistics: pchance and pgenome. The tem Z=4 compaes the numbe of unique stings with the given scoe (when aligned to the ead) compaed to all possible unique eads of length, and gives us the pobability that a ead matches by chance at any location. To compute the pchance statistic ove the entie length of the genome, we assume independence of positions, and evaluate the likelihood that thee is a match at any of the positions: p chance ~1{ 1{cf ðþ : Z 2 : g 4 ð1þ whee is the alignment length, g is the genome length (2 coesponds to the two stands), and cf ðþis a coection facto fo mappings that ae shote than the length of the ead, detailed in Computing Statistics: pchance and pgenome. Ou second computation, pgenome, defines the pobability that a hit was geneated by the genome via common evolutionay events chaacteistic of the genome - i.e. substitutions, indels and eos. Fist, we estimate the ate fo each type of event via bootstapping. Then, we compute the likelihood that the ead will diffe by as many events fom the genome via a binomial pobability that uses this estimation and ou obsevations fo the events in the cuent hit. Fo example, when consideing the numbe of eos, we fist estimate the aveage eo ate C e ove all hits, and then we can define the pobability that the cuent ead was ceated via this many eos by p e & n e C ne e ð1{c e Þ {ne ð2þ whee n e is the numbe of obseved eos in the cuent hit, and is the alignment length. We can similaly define p subs and p indel fo substituion and indel events, espectively. Finally, we can fom pgenome as p genome ~p e p subs p indel : Moe specifics about the mathematical fomulations ae available in Computing Statistics: pchance and pgenome. Finally, we define the quality measuement of this hit as the nomalized odds, i.e. a pobability odds atio pgenome pchance nomalized ove all of the hits of this ead: nomodds hit ~ pgenome hit=pchance hit : ð4þ PVhits pgenome=pchance This value epesents a elative cedibility of this hit compaed to the othes fo a given ead: A single hit would have a nomalized odds scoe of 1, two equally good hits will both have nomodds of 0.5 fo both, while fo an exact match and a moe distant one, the fome will have a nomodds close to 1, and the latte close to 0. Computing statistics fo mate-pais. SHRiMP also assigns mate-pai confidence values (akin to the ead confidence values pedicted by pobcalc) by combining the confidence values fo individual eads with empically obseved distibutions of inset sizes in the libay. We compute the distibution of the mapped distances (distance between the mapped positions of the two eads) d fo all mate-pais, and save the aveage distance m (see Computing Mate Pais with Statistics fo moe details). Then, fo each mate-pai mapping, we assign a pchance, pgenome and nomodds scoe, simila in meaning to those used in the pevious section: N pchance fo mate-pais: assume p c ðgþ is the pchance of a ead that takes g, the length of the genome, as a paamete. Now, the pchance of a mate-pai ead_1, ead_2 is defined as ð3þ p c ~p c,ead 1 ðgþ p c,ead 2 ðjm{dz1jþ ð5þ whee g is the length of the genome used in pobcalc, m is the aveage mate-pai distance, and d is the distance of the cuent mate-pai. That is, we ask the question: what is the pobability that a ead as good as the fist ead would align anywhee in the genome by chance, and that a second ead will align by chance within the obseved mate-pai distance? N pgenome fo mate-pais: assume p g is the pgenome of a ead. We can compute the pgenome of each mate-pai by p g ~p g,1 p g,2 T ð6þ PLoS Computational Biology www.ploscompbiol.og 5 May 2009 Volume 5 Issue 5 e1000386

whee T is the tail pobability of the mate-pai distance distibution we computed (both tails, stating at the jm{d i j cutoff). Theefoe, fo a mate-pai with the distance eally close to m, the pgenome will be close to p g ~p g,1 p g,2, othewise, it will be penalized. Thus, following the difinition of pgenome, we will get a lowe pobability that the mate-pai was geneated fom the genome if the mate-pai distance is too big o too small compaed to the aveage. A discussion of the implementation steps ae included in the SHRiMP README, and a moe detailed discussion of the statistical values is included in Computing Mate Pais with Statistics. Validation In ou expeiments, we used SHRiMP to compae 135 million 35 bp eads fom a tunicate Ciona savignyi to the efeence genome [22]. The fagments wee sequenced fom sheaed genomic DNA with an AB SOLiD 1.0 instument. In the following sections we fist descibe the unning time of SHRiMP at diffeent paamete settings, and then evaluate the quality of ou alignments compaed to the Applied Biosystem s ead mapping pogam. Running time analysis. One of the advantages of the SHRiMP algoithm is the seamless paallelism povided by the fact that we can simply subdivide the eads into sepaate computational jobs, without affecting the esults. This allows us to take full advantage of compute clustes egadless of the amount of memoy available at each machine. We took a andom subset consisting of 500,000 35 bp C. Savignyi eads and mapped them to the genome. The full ead dataset and efeence genome ae available at http://compbio.cs.toonto.edu/shimp/misc/ pape_ciona_eads_35me.csfasta.ta.bz2 and http://mendel. stanfod.edu/sidowlab/cionadata/cionasavignyi_v.2.1.fa.zip, espectively. The unning times at seveal paamete settings ae summaized in Table 1. Note that fom smallest to lagest seed weight, we see a nealy two odes of magnitude diffeence in total un time, most of which is concentated in the vectoized Smith-Wateman filte, and, to a lesse degee, in the spaced k-me scan. The final, full colo-space Smith-Wateman alignment took appoximately constant time acoss all uns, as the aveage numbe of top scoing hits that eached the stage was nealy constant (24.4960.5); howeve, popotional time inceased as the filte stages became moe efficient. While SHRiMP is somewhat slowe than othe shot ead mapping pogams, it allows both fo micoindels in the alignments and a pope colo-space alignment algoithm. SHRiMP is also vey configuable in tems of sensitivity and unning time tade-offs. Ciona savignyi polymophism analysis. The pimay stength of SHRiMP and othe mapping methods based on Smith-Wateman alignments is the ability to map eads containing complex pattens of sequence vaiation, including insetions, deletions and clustes of closely-space SNPs. Mappes that exclusively poduce ungapped alignments can only find SNPs. Futhemoe they ae moe likely to miss dense clustes of SNPs, since the ovelapping eads contain many mismatches, and SNPs adjacent to an indel, since only a small faction of the ovelapping eads contain just the SNP. Finally, since SHRiMP poduces local alignments, it can map a ead even if eithe end ovelaps a lage indel o stuctual vaiant. To evaluate the effectiveness of SHRiMP fo detecting sequence vaiation we used it to find polymophisms in a esequenced Ciona savignyi individual. C. savignyi is a challenging test case because of its vey high polymophism ate: the SNP heteozygosity is 4.5% and the aveage pe-base indel heteozygosity is 16.6% (indel ate of 0.0072 events pe base) [11]. We theefoe expect that even shot eads will fequently span multiple vaiant sites. We used the AB SOLiD sequencing platfom to geneate 135 million eads of length 35 bp fom a single C. savignyi individual. We then aligned those eads to the efeence genome [22] with SHRiMP using lenient scoing thesholds so that eads with multiple vaiant sites could be mapped, and we selected the single highest-scoing alignment fo each ead (see Methods). We discaded alignments in epetitive sequence by emoving eads with multiple similaly scoing alignments ( non-unique matches). The mapping took 48 hous using 250 2.33 GHz coes. Table 2 summaizes the mapping esults. The alignment data contains noise due to two types of eos: sequencing eos and chance alignments. Chance alignments ae a significant poblem fo shot eads, paticulaly with the low alignment scoe thesholds necessay fo mapping eads containing significant vaiation. Reads containing both sequence vaiation and sequencing eos ae even moe likely to map to the wong position in the efeence sequence. To combat the high falsepositive ate, fo the emaining analysis we focused on a highquality subset of the data consisting of sequence vaiants suppoted by at least fou independent eads. Acoss the genome SHRiMP detected 2,119,720 SNPs suppoted by at least fou eads. Fo compaison, we used the SOLiD aligne povided by Applied Biosystems to map the eads to the genome with up to thee mismatches, whee each mismatch can be eithe a single colo-space mismatch o a pai of adjacent mismatches consistent with the pesence of a SNP. Compaed to the SOLiD mappe, SHRiMP mapped 4.2 times as many eads and found 5.5 times as many SNPs. The AB mappe, howeve, was a lot faste, equiing 255 CPU hous to complete the alignments, o oughly 506 faste than SHRiMP. While it is possible to un the mappe with geate sensitivity, allowing fo moe eos and SNPs, and thus moe mapped eads, doing so would suende much of the untime advantage and still not ovecome its fundamental inability to detect insetion and deletion Table 1. Running time of SHRiMP fo mapping 500,000 35 bp SOLiD C. savignyi eads to the 180 Mb efeence genome on a single Coe2 2.66 GHz pocesso. K-me (7,8) (8,9) (9,10) (10,11) (11,12) (12,13) % K-me Scan 10.1% 16.5% 18.9% 13.4% 9.8% 7.4% % Vectoized SW Filte 88.8% 75.4% 49.8% 30.2% 20.1% 14.9% % Full SW Alignment 1.1% 8.0% 30.7% 55.5% 68.8% 76.2% Time 1 d21 h34 m 6 h18 m 1 h36 m 50 m28 s 37 m52 s 32 m32 s In all cases, two k-me hits wee equied within a 41 bp window to invoke the vectoized Smith-Wateman filte. doi:10.1371/jounal.pcbi.1000386.t001 PLoS Computational Biology www.ploscompbiol.og 6 May 2009 Volume 5 Issue 5 e1000386

Table 2. Mapping esults fo 135 million 35 bp SOLiD eads fom Ciona savignyi using SHRiMP and the SOLiD mappe povided by Applied Biosystems. SHRiMP SOLiD Mappe Uniquely-Mapped Reads 51,856,904 (38.5%) 15,268,771 (11.3%) Non-Uniquely-Mapped Reads 64,252,692 (47.7%) 12,602,387 (9.4%) Unmapped Reads 18,657,736 (13.8%) 106,896,174 (79.3%) Aveage Coveage (Uniquely-Mapped Reads) 10.3 3.0 Median Coveage (Uniquely-Mapped Reads) 8 1 SNPs 2,119,720 383,099 Deletions (1 5 bp) 51,592 0 Insetions (1 5 bp) 19,970 0 Non-uniquely-mapped eads have at least two alignments, none of which is significantly bette than the othes (see Methods). SNPs and indels have at least fou suppoting eads. doi:10.1371/jounal.pcbi.1000386.t002 polymophisms. SHRiMP, on the othe hand, is capable of handling indels, and detected tens of thousands of them. SHRiMP detected 51,592 deletions and 19,970 insetions of size 1 5 bp. The obseved atio of 2.56between insetions and deletions fo the C. savignyi data is biased by the constuction of the efeence genome wheneve the two haplomes diffeed, the efeence ageed with the longe one. While thee is a smalle inheent bias against detecting insetions (eads containing nucleotides not pesent in the efeence) compaed to deletions because a ead spanning a deletion only incus a gap penalty wheeas an insetion both incus a gap penalty and has fewe bases that match the efeence. Fo simulated data (see next section) this bias was only,5% fo single basepai indels (data not shown). The size distibution of the detected indels (Figue 5A) dops moe apidly with length than expected [11], but this detection bias against longe indels is not supising since longe indels have lowe alignments scoes. Mapping C. savignyi sequence is challenging pimaily because the population contains so much vaiation. Figue 5B shows the high fequency of closely spaced SNPs detected by SHRiMP. Mappes that can only detect nealy exact matches fail to map the eads ovelapping these dense SNP clustes. Note that even though the eads ae geneated fom the whole genome, a significant faction of the non-epetitive C. savignyi genome is coding, making it is possible to see the typical thee-peiodicity of SNPs in coding egions. Futhemoe SHRiMP ecoves micoindels, which ae completely invisible to ungapped alignes and yet account fo a significant faction of sequence vaiation in C. savignyi. Analysis of simulated data. In ode to futhe validate the accuacy of the SHRiMP alignments we have designed simulated expeiments, whee we sampled andom locations fom the C. savignyi genome, intoduced polymophisms (SNPs and indels) at the ates peviously obseved in the C. savignyi genome [22], added sequencing eos at ates obseved in ou C. savignyi dataset (2 7%, depending on the position in the ead), and mapped the eads back to the oiginal genome. Each sampled ead could have multiple SNPs and indels, though due to the low indel ate only a small faction of the eads had multiple indels. We mapped the eads with SHRiMP and postpocessed with PROBCALC Figue 5. Size distibution of indels. (A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clea 3-peiodicity, due to the fact that a significant faction of the non-epetitive C. savignyi genome is coding. doi:10.1371/jounal.pcbi.1000386.g005 PLoS Computational Biology www.ploscompbiol.og 7 May 2009 Volume 5 Issue 5 e1000386

Table 3. Colo-space mapping accuacy of SHRiMP. (pchance,0.001). Consideing only those eads that had a unique top hit, we computed the pecision the faction of eads fo which this unique hit was coect, and ecall the faction of all eads that had a unique, coect hit. Table 3 shows the esults of this analysis. Fo each ead, we classified it based on the numbe of SNPs and the maximum indel length, and computed pecision and ecall fo each class. With such polymophism, we can expect the aveage ead to have appoximately 1.5 SNPs and 1.9 eos. SHRiMP was able to accuately map 76% of eads with 2 SNPs and 0 indels, at 84% pecision, and nealy half of all eads with 2 SNPs and 3 bp indels at 74% pecision. Methods Numbe of SNPs 0 1 2 3 4 Pec. Rec. Pec. Rec. Pec. Rec. Pec. Rec. Pec. Rec. 0 85.7 83.2 84.8 81.3 83.5 76.6 80.6 65.2 75.6 46.8 Max 1 83.8 79.4 82.2 74.0 79.4 62.6 72.8 43.2 63.1 24.7 Indel 2 83.2 77.1 80.8 69.6 77.9 56.6 68.2 36.4 56.4 18.9 Length 3 80.7 71.0 79.6 64.2 73.6 48.3 66.5 31.5 57.1 16.6 4 78.0 65.4 76.5 56.1 71.4 41.9 60.6 23.9 50.3 12.4 5 75.9 58.9 73.0 48.1 69.7 36.6 57.0 21.3 46.0 12.7 Each cell shows the pecision and ecall fo mapping simulated eads with vaying amounts of polymophism. SHRiMP was able to accuately map.46% of all eads with eithe 4 SNPs o 5 bp indels, despite the lage numbe of sequencing eos in ou dataset (up to 7% towads the end of the ead). doi:10.1371/jounal.pcbi.1000386.t003 Details of the SHRiMP Algoithm The algoithm stats with a apid k-me hashing step to localize potential aeas of similaity between the eads and the genome. All of the spaced k-mes pesent in the eads ae indexed. Then fo each k-me in the genome, all of the matches of that paticula k- me among the eads ae found. If a paticula ead has as many o moe than a specified numbe of k-me matches within a given window of the genome, we execute a vectoized Smith-Wateman step, descibed in the next section, to scoe and validate the similaity. The top n highest-scoing egions ae etained, filteed though a full backtacking Smith-Wateman algoithm, and output at the end of the pogam if thei final scoes meet a specified theshold. The SHRiMP algoithm is summaized in Figue 6. Spaced seed filte. We build an index of all spaced k-mes in the eads, and quey this index with the genome. Ou appoach was taken pimaily fo simplicity: ou algoithm can apidly isolate which eads have seveal k-me matches within a small window by maintaining a simple cicula buffe of ecent positions in the genome that matched the ead. Since ou tageted compute platfom is a cluste of batch pocessing machines, indexing the eads means that we can easily contol memoy usage and paallelism by vaying the ead input size and splitting the ead set accodingly. Data is only loaded at pogam invocation; we do not steam in new eads fom disk as the algoithm uns. Vectoized Smith-Wateman implementation. The SHRiMP appoach elies on a athe libeal initial filteing step, followed by a igoous, but vey fast Smith-Wateman alignment pocess. By maximizing the speed of the Smith-Wateman compaison, we ae pemitted to let the algoithm test a lage numbe of potential egions. Most contempoay mobile, desktop and seve-class pocessos have special vecto execution units, which pefom multiple simultaneous data opeations in a single instuction. Fo example, it is possible to add the eight individual, 16-bit elements of two 128-bit vectos in one machine instuction. Ove the past decade, seveal methods have been devised to significantly enhance the execution speed of Smith-Wateman-type algoithms by paallelizing the computation of seveal cells of the dynamic pogamming matix. The simplest such implementation computes the Figue 6. SHRiMP Hashing technique & Vectoized Alignment algoithm. A: Oveview of the k-me filteing stage within SHRiMP: A window is moved along the genome. If a paticula ead has a peset numbe of k-mes within the window the vectoized Smith-Wateman stage is un to align the ead to the genome. B: Schematic of the vectoized-implementation of the Needleman-Wunsch algoithm. The ed cells ae the vecto being computed, on the basis of the vectos computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vecto fo the diagonal is detemined by compaing one sequence with the othe one evesed (indicated by the ed aow below). To obtain the set of match/ mismatch positions fo the next diagonal, the lowe sequence needs to be shifted to the ight. doi:10.1371/jounal.pcbi.1000386.g006 PLoS Computational Biology www.ploscompbiol.og 8 May 2009 Volume 5 Issue 5 e1000386

dynamic pogamming matix using diagonals. Since each cell of the matix can be computed once the cell immediately above, immediately to the left, and at the uppe-left cone have been computed, one can compute each successive diagonal once the two pio diagonals have been completed. In this way, the poblem can be paallelized acoss the length of suppoted diagonals (see Figue 6B). In most cases, this is a facto of 4 to 16. The only potion of such a Wozniak appoach that cannot be paallelized is the identification of match/mismatch scoes fo evey cell of the matix, which has to be done sequentially. These opeations ae expensive, necessitating 24 independent data loads fo 8-cell vectos, and become inceasingly poblematic as vecto sizes incease. Because memoy loads cannot be vectoized, when the paallelism gows, so does the numbe of lookups. Fo example, with 16-cell vectos, the numbe of data loads doubles to 48. We popose an altenate method, whee the unning time of the fully vectoized algoithm is independent of the numbe of matches and mismatches in the matix, though it only suppots fixed match/mismatch scoes (athe than full scoing matices). Ou key obsevation is that it is possible to completely paallelize the scoe computation fo evey diagonal. Figue 6B demonstates the essence of ou algoithm: by stoing one of the sequences backwads, we can align them in such a way that a small numbe of logical instuctions obtain the positions of matches and mismatches fo a given diagonal. We then constuct a vecto of match and mismatch scoes fo evey cell of the diagonal without having to use expensive and un-vectoizable load instuctions o a pe-compute a quey pofile. In ou tests, using a diagonal appoach with ou scoing scheme supasses the pefomance of Wozniak s oiginal algoithm and pefoms on pa with Faa s method [17]. Table 4 summaizes these esults. The advantage of ou method ove Faa s is that it is independent of the scoes used fo matches/mismatches/gaps, and it will scale bette with lage vecto sizes. A disadvantage is that we cannot suppot full scoing matices and ae esticted to match/mismatch scoes, though this is less impotant fo DNA alignment. Additionally, Faa s method is much faste fo lage databases whee most of the sequence is dissimila to the quey. Howeve, this is neve the case fo SHRiMP as the seed scan phase tagets only small, simila egions fo dynamic pogamming. In these cases ou algoithms pefom similaly. Final pass. The vectoized Smith-Wateman appoach descibed above is used to apidly detemine if the ead has a stong match to the local genomic sequence. The locations of the top n hits fo each ead ae stoed in a heap data stuctue, which is updated afte evey invocation of the vectoized Smith- Wateman algoithm if the heap is not full, o if the attained scoe is geate than o equal to the lowest scoing top hit. Once the whole genome is pocessed, highest scoing n matches ae ealigned using the appopiate full colo- o lette-space Smith- Wateman algoithm. This is necessay, as the vectoized Smith- Wateman algoithm descibed above only computes the maximum scoe of an alignment, not the taceback, as this would equie a much moe complicated and costly implementation. Instead, at most only the top n alignments fo each ead ae e-aligned in the final step. Computing Statistics: pchance and pgenome In Computing Statistics fo Single Reads, we biefly intoduced the concepts of the pchance, pgenome and nomalized odds of a hit. In this section we expand on the details egading the constuction of pchance and pgenome. In these fomulas we make use the following definitions: N g is the genome length N is the alignment length (note this may be diffeent fom the ead length, which is constant) N subs is the numbe of substitutions (mismatches) in ou alignment N ins is the numbe of nucleotide insetions in ou alignment, whee the genome is the oiginal sequence. Fo example, if the genome is AC-G and a ead is ACTG, thee is an insetion of a T. N dels is the numbe of nucleotide deletions in ou alignment. Fo example, if the genome is ACTG and a ead is A-TG, thee is a deletion of a C. N ins ev is the numbe of insetion events (fo example, fo a single insetion of length 3 we have ins ev ~1 and ins~3.) del ev is simila. N ins n : following the pevious definition, ins ev! will descibe the numbe of pemutations of insetion events. To detemine the numbe of distinguishable pemutations, we need to fist look at the fequency of insetion events of a cetain size, fequency insev ðsizeþ. Fo example, is we have 3 insetions of size 2, we need to divide the pemutations by fequency insev ð2þ!~3!. Theefoe, the distinguishable pemutations of insetion events can be witten as: P i~insevsizes ins ev! ðfequency insev ðsize~iþ! Þ Table 4. Pefomance (in millions of cells pe second) of the vaious Smith-Wateman implementations, including a egula implementation (not vectoized), Wozniak s diagonal implementation with memoy lookups, Faa s method and ou diagonal appoach without scoe lookups. Pocesso type Unvectoized Wozniak Faa SHRiMP Xeon 97 261 335 338 Coe 2 105 285 533 537 We inseted each into SHRiMP, and used SHRiMP to align 50 thousand eads to a efeence genome with default paametes. The impovements of the Coe 2 achitectue fo vectoed instuctions lead to a significant speedup fo ou appoach and Faa s, while Wozniak s algoithm slight impovement is due to the slow match/mismatch lookups. doi:10.1371/jounal.pcbi.1000386.t004 Below, we efe to this denominato tem P i~insevsizes ðfequency insev ðsize~iþ! Þ as ins n. We similaly define del n. N Pn,k ð Þ descibes the numbe of ways to assign n indistinguishable objects in k indistinguishable bins, which is ecusively defined by Pn,k ð Þ~ P k i~1 Pn{k,i ð Þ with Pn,n ð Þ~1 and Pn,1 ð Þ~1. pchance. We begin with the mathematical fomulation of pchance (defined above): p c ~1{ 1{cf ðþ : Z 2 : g 4 ; ð7þ whee, as descibed befoe, Z=4 is the numbe of possible unique PLoS Computational Biology www.ploscompbiol.og 9 May 2009 Volume 5 Issue 5 e1000386

sequences with the given edit distance as a faction of all possible unique eads of length. Thus, Z=4 gives us the pobability that the cuent ead has aligned by chance to a andom genome of the size of a ead. To this tem, we add a coection facto of cf ðþ~ ðeadsize{z1þwhich accounts fo all the possible places the alignment of size might match. Fo example, if the eadsize is 25 and we have a match of size 22, we should count Z=4 fo evey position whee this match could be found, that is 25222+1=4. Finally, to get the pobability that the cuent ead has aligned by chance to a andom genome of size g (instead of size ), we get fomula (7). The facto that lies at the coe of this calculation is Z, the numbe of possible unique sequences that would align to the ead with the given edit distance. We have shown the definition of Z subs, which computes Z when thee ae no indels in the alignment: Z subs ~ 3 subs : ð8þ subs Howeve, the calculation of the numbe of efeences to which a ead will map with a paticula indel count, Z indels, depends on the sequence of that ead and is significantly moe complicated. We define a lowe and uppe bound on Z in this case: a lowe bound (least numbe of unique sequences) occus when the cuent ead is one epeated nucleotide, fo example [AAAAAA], and the highe bound occus with the most change in neaby nucleotides, say [ACGTAC]. In the fome case, we need to look at the deletion events fom the genome to this ead, conside all the combinations of that numbe of deletion events and deleted nucleotides, as well as all the places whee these combinations may occu. This gives the fomula Z lowe ~ del ev! Pdels,del ð ev Þ del n zdels{ins dels 3 dels : ð9þ Looking fo the uppe bound, we note that the places and combinations of insetions also mattes in geneating unique sequences, theefoe giving us two exta tems involving ins ev ~ ins ev! ins n Z uppe ~ ins ev! ins n delev! ins ev ins ev Pdels,delev ð Þ del n Z lowe zdels{ins dels ð10þ 3 dels : ð11þ In ode to estimate the coect value fo Z indels, we estimated the aveage complexity of the eads in ou dataset (i.e., between the simplest [AAAAA ] and the most complex [ACG- TACGT ]). And have found that the mean obseved Z indels could be accuately estimated by Z indels & 1 2 Z lowezz uppe : ð12þ Finally, we can appoximate the total Z as Z&Z subs : Zindels : ð13þ pgenome. In Computing Statistics fo Single Reads, we defined ou pgenome facto as p e : psubs : pindel, whee p e & ðc e Þ ne ð1{c e Þ {ne ð14þ n e with C e the ate of event e (estimated via bootstapping) and n e the numbe of obseved events of type e in the cuent alignment. We wote p e as an appoximation because thee ae small coections to this fomula fo each pobability that is pat of pgenome. Fist, fo the eo tem p e, the numbe of sites that can suppot eos is in fact one minus the ead size, giving us p e ~ {1 ðc e Þ ne ð1{c e Þ {1{ne : ð15þ n e When consideing substitutions, we can have changes at any of the inne nucleotides, excluding eoneous sites: p sub ~ {2{n e ðc sub Þ n sub ð1{c sub Þ {1{ne{n sub : ð16þ n sub As befoe, when we look at alignments that involve indels, the fomula becomes moe complex. In the case of pgenome, we do not have to conside the vaious placements of insetion o deletion events, but we do have to conside, fo fixed placements of events, the vaious combinations of the total numbe of insetions and deletions into a set numbe of events. p indel ~Pðindels,indel ev Þ! {1 n indelev ðc indelev Þ n indelev ð1{c indelev Þ {1{n indelev : ð17þ Computing mate pais with statistics. In this section we povide seveal details fo the implementation, usage and statistics of the matepai post-pocessing step intoduced in Computing Statistics fo Mate-pais. We define a good matepai mapping as a mapping whose distance d (between the two eads) ae smalle than some chosen limit M, and fo which the ead mappings ae in a consistent oientation and stand(i.e. R + F + o F 2 R 2 ). Fist, pobcalc_mp will compute a matepai distance and standad deviation by looking at all the connected fowad and evese eads - all matepais - and adding the distance of any matepai with exactly one good mapping to a histogam. Optionally, one can choose to use only unique good mappings, o only use a cetain numbe of mappings (say, the fist 100,000) to speed up the pogam. Next, we call a matepai concodant if it has at least one good mapping, and othewise we call it discodant. Depending on the task, pobcalc_mp can output all concodant matepais, o all discodant matepais. Fo each matepai mapping, pobcalc_mp will compute the pgenome and pchance, as intoduced in Computing Mate Pais with Statistics. Paametes Fo the C. savignyi polymophism analysis we an SHRiMP with the following paametes. We used the spaced seed 11110111 and equied two hits pe 40-base window to invoke the Smith- Wateman algoithm. The Smith-Wateman scoing paametes wee set to +100 fo a matching base, 290 fo a mismatch, 2250 and 2100 to open and extend a gap espectively, and 2300 fo a cossove (sequencing eo). The minimum Smith-Wateman PLoS Computational Biology www.ploscompbiol.og 10 May 2009 Volume 5 Issue 5 e1000386