Overlap-Based Genome Assembly from Variable-Length Reads

Similar documents
Depth-first search and strong connectivity in Coq

Prestack signal enhancement by non-hyperbolic MultiFocusing

10 Torque. Lab. What You Need To Know: Physics 211 Lab

Finding location equilibria for competing firms under delivered pricing

arxiv:cs/ v1 [cs.ds] 8 Dec 1998

Assessment of Direct Torque Control of a Double Feed Induction Machine

Noncrossing Trees and Noncrossing Graphs

CORESTA RECOMMENDED METHOD N 68

Active Return-to-Center Control Based on Torque and Angle Sensors for Electric Power Steering Systems

ABriefIntroductiontotheBasicsof Game Theory

Multi-Robot Forest Coverage

Multi-Robot Flooding Algorithm for the Exploration of Unknown Indoor Environments

British Prime Minister Benjamin Disraeli once remarked that

MOT-Stillwater Bikeway Bridge PID 99981

Rearranging trees for robust consensus

Lesson 33: Horizontal & Vertical Circular Problems

Efficient Algorithms for finding a Trunk on a Tree Network and its Applications

Fault Diagnosis and Safety Design of Automated Steering Controller and Electronic Control Unit (ECU) for Steering Actuator

Red-Black Trees Goodrich, Tamassia Red-Black Trees 1

Complexity of Data Tree Patterns over XML Documents

OPTIMAL SCHEDULING MODELS FOR FERRY COMPANIES UNDER ALLIANCES

An Auction Based Mechanism for On-Demand Transport Services

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Design Engineering Challenge: The Big Dig Contest Platter Strategies: Ball Liberation

POSSIBLE AND REAL POWERFLOWS IN CONNECTED DIFFERENTIAL GEAR DRIVES WITH η 0 <i pq <1/η 0 INNER RATIO

Rotor Design and Analysis of Stall-regulated Horizontal Axis Wind Turbine

Cyclostrophic Balance in Surface Gravity Waves: Essay on Coriolis Effects

Efficient Signal Integrity Verification of Multi-Coupled Transmission Lines with Asynchronously Switching Non-Linear Drivers

Experiment #10 Bio-Physics Pre-lab Questions

The Study About Stopping Distance of Vehicles

MOT-Jefferson Street Reconstruction PID

Cheat-Proof Playout for Centralized and Distributed Online Games

EC-FRM: An Erasure Coding Framework to Speed up Reads for Erasure Coded Cloud Storage Systems

The Solution to the Bühlmann - Straub Model in the case of a Homogeneous Credibility Estimators

SHRiMP: Accurate Mapping of Short Color-space Reads

CS3350B Computer Architecture. Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions

Morrison Drive tel. Ottawa, ON, Canada K2H 8S fax. com

Faster Nearest Neighbors: Voronoi Diagrams and k-d Trees

Experiment #10 Bio-Physics Pre-lab Comments, Thoughts and Suggestions

Carnegie Mellon University Forbes Ave., Pittsburgh, PA command as a point on the road and pans the camera in

A Deceleration Control Method of Automobile for Collision Avoidance based on Driver's Perceptual Risk

f i r e - p a r t s. c o m

The Properties of. Model Rocket Body Tube Transitions

Design and Simulation Model for Compensated and Optimized T-junctions in Microstrip Line

Incorporating Location, Routing and Inventory Decisions in Dual Sales Channel - A Hybrid Genetic Approach

tr0 TREES Hanan Samet

A Collision Risk-Based Ship Domain Method Approach to Model the Virtual Force Field

Table of Contents. Grade 6. Handwriting Maintenance Thomas M. Wasylyk Jennifer L. Schweighofer

Multiple Vehicle Driving Control for Traffic Flow Efficiency

Data Sheet. Linear bearings

Performance Characteristics of Parabolic Trough Solar Collector System for Hot Water Generation

I. FORMULATION. Here, p i is the pressure in the bubble, assumed spatially uniform,

Example. The information set is represented by the dashed line.

RESOLUTION No A RESOLUTION OF THE CITY OF SALISBURY, MARYLAND AUTHORIZING THE MAYOR TO ENTER INTO AN AGREEMENT BETWEEN THE CITY OF

Matlab Simulink Implementation of Switched Reluctance Motor with Direct Torque Control Technique

UNIVERSITÀ DEGLI STUDI DI PADOVA. Dipartimento di Scienze Economiche Marco Fanno

Wind and extremely long bridges a challenge for computer aided design

An integrated supply chain design model with random disruptions consideration

Torque. Physics 2. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

THE performance disparity between processor speed and the

tr0 TREES Hanan Samet

Numerical study of super-critical carbon dioxide flow in steppedstaggered

Lecture Topics. Overview ECE 486/586. Computer Architecture. Lecture # 9. Processor Organization. Basic Processor Hardware Pipelining

DETC A NEW MODEL FOR WIND FARM LAYOUT OPTIMIZATION WITH LANDOWNER DECISIONS

THE IMPACTS OF CONGESTION ON COMMERCIAL VEHICLE TOUR CHARACTERISTICS AND COSTS

Bubble clustering and trapping in large vortices. Part 1: Triggered bubbly jets investigated by phase-averaging

Session 6. Global Imbalances. Growth. Macroeconomics in the Global Economy. Saving and Investment: The World Economy

A Force Platform Free Gait Analysis

Electrical Equipment of Machine Tools

PREDICTION OF THIRD PARTY DAMAGE FAILURE FREQUENCY FOR PIPELINES TRANSPORTING MIXTURES OF NATURAL GAS AND HYDROGEN Zhang, L. 1, Adey, R.A.

EcoMobility World Festival 2013 Suwon: an analysis of changes in citizens awareness and satisfaction

55CM ROUND CHARCOAL KETTLE BBQ

Color Encodings: srgb and Beyond

Alternate stable states in coupled fishery-aquaculture systems. Melissa Orobko

Watford Half Marathon. Watford Half Marathon. Sunday February 5th Starting at 10.30am. Enjoy Your Run!!! Notice to all Entrants.

AIRPLANE PAVEMENT MARKINGS

A CONCEPTUAL WHEELED ROBOT FOR IN-PIPE INSPECTION Ioan Doroftei, Mihaita Horodinca, Emmanuel Mignon

Angle-restricted Steiner arborescences for flow map layout Buchin, K.A.; Speckmann, B.; Verbeek, K.A.B.

Optimization for Bus Stop Design Based on Cellular Automaton Traffic Model

D07-RW-VAR(B) PID

Bicycle and Pedestrian Master Plan

"A Home for Gracious Living" (real estate brochure for Batten House)

Asteroid body-fixed hovering using nonideal solar sails

Interior Rule of the Quebec Open 2017

A Machine Vision based Gestural Interface for People with Upper Extremity Physical Impairments

Phase Behavior Introduction to Phase Behavior F.E. Londono M.S. Thesis (2001)

Three-axis Attitude Control with Two Reaction Wheels and Magnetic Torquer Bars

MODELLING THE INTERACTION EFFECTS OF THE HIGH-SPEED TRAIN TRACK BRIDGE SYSTEM USING ADINA

Watford Half Marathon. Watford Half Marathon. Sunday February 4th Starting at 10.30am. Enjoy Your Run!!! Notice to all Entrants.

A tale of San Diego County s water. If you had to describe San Diego's weather, you probably would use

Installation and Operation Instructions

College Hill Carthage. Winton Hills. Mill Creek. Spring Grove Village Paddock Hills. North Avondale. Avondale. Evanston. CUF Walnut Hills.

the Susquehanna River. Today, PFBC protects and conserves aquatic species throughout Pennsylvania.

1 of 6 5/12/2015 8:02 PM

Motivation. Prize-Collecting Steiner Tree Problem (PCSTP) Kosten und Profite. Das Fraktionale Prize-Collecting Steiner Tree Problem auf Baumgraphen

THE GREAT CARDBOARD BOAT RACE INTRODUCTION, BOAT BUILDING & RULES by United Way of Elkhart County

Trends in Cycling, Walking & Injury Rates in New Zealand

PlacesForBikes City Ratings Methodology. Overall City Rating

CLASS: XI: MATHEMATICS

DECO THEORY - BUBBLE MODELS

Transcription:

Ovelap-aed Genome embly fom Vaiable-Length Read Joeph Hui, Ilan Shomoony, Kannan Ramchandan and Thoma. Coutade Depatment of Electical Engineeing and Compute Science, Univeity of Califonia, ekeley Email: {ude.yelekeb, ilan.homoony, kannan, coutade}@bekeley.edu btact Recently developed high-thoughput equencing platfom can geneate vey long ead, making the pefect aembly of whole genome infomation-theoetically poible [1]. One of the challenge in achieving thi goal in pactice, howeve, i that taditional aembly algoithm baed on the de uijn gaph famewok cannot handle the high eo ate of long-ead technologie. On the othe hand, ovelap-baed appoache uch a ting gaph [2] ae vey obut to eo, but cannot achieve the theoetical lowe bound. In paticula, thee method handle the vaiable-length ead povided by long-ead technologie in a uboptimal manne. In thi wok, we intoduce a new aembly algoithm with two deiable featue in the context of long-ead equencing: (1) it i an ovelap-baed method, thu being moe eilient to ead eo than de uijn gaph appoache; and (2) it achieve the infomationtheoetic bound even in the vaiable-length ead etting. I. INTRODUCTION Cuent DN equencing technologie ae baed on a twotep poce. Fit, ten o hunded of million of fagment fom andom unknown location on the taget genome ae ead via hotgun equencing. Second, thee fagment, called ead, ae meged to each othe baed on egion of ovelap uing an aembly algoithm. output, an aembly algoithm etun a et of contig, which ae ting that, in pinciple, coepond to ubting of the taget genome. In othe wod, contig decibe ection of the genome that ae coectly aembled. lgoithm fo equence aembly can be mainly claified into two categoie: appoache baed on de uijn gaph [3] and appoache baed on ovelap gaph [2, 4, 5]. Following the hot-ead high-thoughput tend of econd-geneation equence, aemble baed on de uijn gaph became popula. Roughly peaking, thee aemble opeate by contucting a de uijn gaph with vetex et given by the et of ditinct K-me extacted fom the ead, and connecting two vetice via a diected edge wheneve the coeponding K-me appea conecutively in the ame ead. y contuction, if the ead achieve ufficient coveage, the taget genome coepond to a Chinee Potman oute on the gaph, which i a path that tavee evey edge at leat once. The poblem of finding the coect Chinee potman oute (thu detemining the taget genome) i complicated by the fact that epeated egion in the genome ae condened into ingle path. Thu, to eolve epeat and obtain long contig, a finihing tep mut be taken whee the oiginal ead ae bought back and aligned onto the gaph. While the contuction of the de uijn gaph can be pefomed efficiently both in time and pace, thi appoach ha two main dawback. The fit i that hedding the ead into K- me ende the tak of eolving epeat and obtaining long contig moe challenging. The econd dawback i that the de Thi wok wa uppoted in pat by NSF Gant CCF-1528132 and CCF- 0939370 (Cente fo Science of Infomation). uijn gaph contuction i not obut to ead eo. Indeed, even mall eo ate will geneate many chimeic K-me (i.e., K-me that ae not ubting of the taget genome). a eult, heuitic mut be implemented in pactice to clean up the gaph. On the othe hand, ovelap-baed aembly algoithm typically opeate by contucting an ovelap gaph with vetex et coeponding to the et of obeved ead, whee two vetice ae connected if the uffix of one of the ead enjoy ignificant imilaity with the pefix of the othe (i.e,. two ead ovelap by ignificant magin). Thi way, the taget genome coepond to a (genealied) Hamiltonian path on the gaph, auming ufficient coveage. y not beaking the ead into mall K-me, ovelap-baed appoache pomie to geneate le fagmented aemblie. Moeove, ead eo have mall impact if we etict ou attention to ovelap of ufficient length, implying that ovelap-baed aemble can be moe obut to ead eo than thei de uijn countepat. Theefoe, in the context of longead thid-geneation equencing (whee eo ate ae high, and will continue to be fo the foeeeable futue [6]), ovelap-baed appoache ae expected to play a cental ole. In pite of thei elevance in the context of long-ead equencing, ou fomal undetanding of ovelap-baed algoithm i faily limited. Unde mot natual fomulation, extacting the coect equence fom the ovelap gaph become an NPhad poblem [7, 8]. Moeove, a the gaph in geneal contain many puiou edge due to epeat in the taget genome, fomal analyi of thee algoithm i difficult and vey few of them have theoetical guaantee. One example i the wok in [9], whee an ovelap-baed algoithm i hown to have theoetical pefomance guaantee unde the aumption of fixed-length ead. In pactice thi i neve the cae (e.g., Pacio ead can diffe by ten of thouand of bae pai [5]), and poceing the ead o that they all have the ame length i uually uboptimal. In thi pape, we intoduce an efficient, ovelap-baed aembly algoithm that handle vaiable-length ead and i guaanteed to econtuct the taget genome povided the ead atify the infomation-theoetic ufficient condition popoed in [1]. II. CKGROUND ND DEFINITIONS In the genome aembly poblem, the goal i to econtuct a taget equence g = (g[0],..., g[g 1]) of length G with ymbol fom the alphabet Σ = {, C, G, T}. The equence poduce a et of N ead R = {,..., N } fom G, each of which i a ubting of g. Fo eae of expoition, we aume a cicula genome model to avoid edge-effect, o that a ubting may wap aound to the beginning of g. Thu g [5 : 3] denote g [5 : G 1] g [0 : 3]. The ead may be of abitay length. The goal i to deign an aemble, which take the et of ead R and attempt to econtuct the equence g.

. idging condition and Optimal aembly In [1], the autho deive neceay and ufficient condition fo aembly in tem of bidging condition of epeat. Thee condition ae ued to chaacteie the infomation limit fo the feaibility of the aembly poblem. In thi ection, we ecall the main idea behind thi chaacteiation, which eve a motivation to ou appoach. double epeat of length l 0 in g i a ubting x Σ l appeaing at ditinct poition i 1 and i 2 in g; i.e., g[i 1 : i 1 + l 1] = g[i 2 : i 2 +l 1] = x. Similaly, a tiple epeat of length l i a ubting x that appea at thee ditinct location in (poibly ovelapping); i.e., g[i 1 : i 1 + l 1] = g[i 2 : i 2 + l 1] = g[i 3 : i 3 + l 1] = x fo ditinct i 1, i 2 and i 3 (modulo G, given the cicula aumption on g). If x i a double epeat but not a tiple epeat, we ay that it i peciely a double epeat. double epeat x i maximal if it i not a ubting of any tictly longe double epeat. Finally, if x = g [i 1 : i 1 + l] = g [i 2 : i 2 + l] and y = g [j 1 : j 1 + l ] = g [j 2 : j 2 + l ] fo ome i 1, i 2, l, j 1 j 2, l whee x, y ae maximal and i 1 < j 1 < i 2 < j 2, then x and y fom an inteleaved epeat. Example ae hown in Fig. 1. maximal double epeat double epeat tiple epeat (alo a double epeat) inteleaved epeat Fig. 1. Example of vaiou kind of epeat. epeat conit of eveal copie, tating at ditinct location i 1, i 2, and o foth. ead = g [j 1 : j 2 ] i aid to bidge a copy g [i : i + l] if j 1 < i and j 2 > i + l, a illutated in Fig. 2. Fig. 2. ead bidging one copy of a tiple epeat. epeat i bidged if at leat one copy i bidged by ome ead, and all-bidged if evey copy i bidged by ome ead. et of ead R i aid to cove g if evey bae in g i coveed by ome ead. In the context of two ead, 2 which both contain ome ting of inteet, and 2 ae aid to be inconitent if, when aligned with epect to, they diagee at ome bae, a illutated in Fig. 3. 2 Fig. 3. Read and 2 ae inconitent with epect to the haed ting. In [1], the autho popoed a de uijn gaph-baed aembly algoithm called MULTIRIDGING and poved it to have the following theoetical guaantee, tated in tem of bidging condition: Theoem 1. [1] MULTIRIDGING coectly econtuct the taget genome g if R cove g and 1. Evey tiple epeat i all-bidged. 2. Evey inteleaved epeat i bidged (i.e. of it fou copie, at leat one i bidged). The motivation fo appealing to condition 1 and 2 tem fom the obevation that, unde a unifom ampling model whee N ead of a fixed length L ae ampled unifomly at andom fom the genome, thee condition nealy match neceay condition fo aembly [1]. Motivated by thi neachaacteiation of the infomation limit fo pefect aembly and the advantage of ovelap-baed aembly fo long-ead technologie, we decibe an ovelap-baed algoithm with the ame pefomance guaantee. That i, povided condition 1 and 2 ae atified, ou aembly algoithm will coectly econtuct the taget genome g. The analyi in [1] how that, when 1 and 2 ae not met, the aembly poblem i likely to be infeaible, and thee i inheent ambiguity in the taget genome given the et of obeved ead. In thi ene, ou algoithm can be conideed to be a nea-optimal ovelap-baed aemble. III. LGORITHM OVERVIEW Let begin with a deciption of the oveall tuctue of ou appoach. The algoithm tat with a et R of vaiable-length ead, a illutated in Fig. 4(a). Notice that in geneal R may contain many ead that ae eentially uele - fo example, a ead coniting only of a ingle lette. Hence we begin by dicading ome of thee uele ead. typical dicading tategy (ued, fo example, in the ting gaph appoach [2, 4]), conit of imply dicading any ead that i contained within anothe ead. Howeve, uch ead can potentially encode ueful infomation about the genome (ee example in Figue 5). Thu, we fit poce the ead uing a moe caeful ule decibed in Section IV to only thow away ead that ae tuly uele. Thi yield a timmed-down et of ead a hown in Fig. 4. The next tep, the ead extenion, i the mot complex pat of the algoithm. Fo each ead, we conide it potential ucceo and pedeceo and caefully decide whethe it can be extended to the ight and to the left in an unambiguou way. Wheneve 1 i atified, ou extenion algoithm i guaanteed to extend all ead coectly. Moeove, we can keep extending the ead in both diection until we hit the end of a double epeat. t thi point we ae not ue how to poceed and we top, obtaining a et of extended ead a hown in Fig. 4(c). In the thid tep, we mege ead that contain cetain unique ignatue and mut belong togethe. lthough the example in Fig. 3 doe not how it, in thi tep we may alo mege nonidentical ead. If a double epeat i bidged by ome ead, thi meging poce will mege the bidging ead with the coect ead to the left and ight, thu eolving the epeat. The meging opeation poduce a new et of ead a illutated in Fig. 4(d). t thi point the only emaining ambiguity come fom unbidged double epeat. Finally, we eolve the eidual ambiguity by contucting a gaph. Notice that fo each unbidged double epeat, we have two ead going in, and two going out, but we do not know the coect matching. We expe thi tuctue a a gaph, whee each long ead i a node and each unbidged double epeat i alo a (ingle) node, a illutated in Fig. 4(e). Since each of the unbidged double epeat ha in- and out-degee two, the gaph i Euleian, and contain at leat one Euleian cycle. Wheneve condition 2

(a) 2 Fig. 5. Removing ead, which i contained in 2, ceate a coveage gap. Read filteing Read Extenion lgoithm 1 Contained ead filteing 1: Input: R 2: fo R do 3: if i contained in two ead that ae inconitent with each othe then 4: Remove fom R 5: Output: Updated R (c) (d) Read Meging 2 3 4 allow u to achieve pefect aembly. Thi i tated in the following lemma, whoe poof we defe to the appendix. Lemma 1. Suppoe R cove and 1 and 2 hold. fte the filteing pocedue in lgoithm 1, R till cove, 1 and 2 till hold, and in addition, 3. No ead in R i a tiple epeat in. (e) Euleian gaph fte filteing out unneceay ead, we move to the ead extenion tep. The main idea i to conide one ead at a time, and keep extending it in both diection accoding to othe ovelapping ead. Due the exitence of epeat in, howeve, we cannot alway confidently detemine the next bae, o we top when thi i no longe poible. We decibe thi in lgoithm 2. 4 Fig. 4. The tep of the aembly algoithm. i alo atified, thi cycle i unique, and coepond to the tue odeing of the long ead, yielding the tue equence. In the next two ection, we will decibe the algoithm in detail. In Section IV, we decibe the thee ead poceing tep: ead filteing, extenion and meging. Then in Section V, we peent the final tep whee we contuct the Euleian gaph and extact the genome equence g fom it. We efe to the appendix fo detailed poof. IV. PROCESSING VRILE-LENGTH REDS baic quetion that aie when dealing with vaiable-length ead i how to handle ead that ae entiely contained in othe ead; i.e., a ead that i a ubting of anothe ead 2. n intuitive idea would be to imply dicad all uch ead, a they eemingly contain no additional infomation fo aembly. Howeve, a hown in Fig. 5, dicading all contained ead i in geneal uboptimal a it can ceate hole in the coveage, making pefect aembly fom the emaining ead infeaible. Hee, i contained within 2, becaue 2 bidge a epeat which in tun contain. Howeve, deleting caue the left copy to no longe be coveed by any ead. We tat intead with a moe caeful teatment of contained ead, decibed in lgoithm 1. it tun out, thi pocedue peeve valuable popetie of the et of ead R, which will 2 3 lgoithm 2 Extenion lgoithm 1: Input: R afte filteing fom lgoithm 1 2: fo R do 3: ifucationfound Fale 4: t 5: while ifucationfound = Fale do 6: longet pope uffix of t that appea (anywhee) in anothe ead, but peceded by a ditinct ymbol 7: ymbol of t peceding 8: U {egment K appeaing in R fo ome K Σ} 9: if U = then 10: t 11: ele if U = 1 then 12:, whee U = {} 13: t 14: ele if U = 2 then 15: ifucationfound Tue 16: U ight () {, } 17: ele 18: 1 mut have been violated 19: Repeat fo left extenion (obtaining U left () intead) 20: Output: Set of extended ead R. lgoithm 2 wok by finding ead that ovelap with, and uing them to detemine what the poible next bae ae. Fo each ead, in line 6, we caefully chooe a uffix and then look fo occuence of K fo ome K Σ in any othe ead to fom the et U, a illutated in Fig. 6. We will late pove that the uffix alway exit. If U i empty, we etun to line 6 and conide a hote uffix of. If U ha a ingle element, we extend by. If U ha two element

Fig. 6. fte the uffix of i defined in line 6 of lgoithm 2, we look fo ead containing a ting K fo ome K Σ. In thi cae, we would have U = {, }. Fig. 7. In line 6 of lgoithm 2, we conide the longet uffix of (o the longet uffix that i hote than the peviouly conideed ) that appea in anothe ead peceded by a ditinct ymbol. and, we conclude that we mut be at the end of a epeat and a bifucation hould happen. So we et ifucationfound to be tue, and exit the loop. key apect of thi pocedue i the election of the uffix, which detemine the ie of the match we ae looking fo. Intuitively, if a ead ovelap with by a lage amount, we hould tut that it give u the coect next bae, wheea if a ead ovelap with by a mall amount, thi i likely to be a puiou match. To detemine the amount of ovelap that i enough to be tutwothy, we look fo a uffix of that appea on a diffeent ead peceded by a diffeent ymbol, a hown in Fig. 7. To undetand the choice of, conide the following definition. Definition 1. ead tiple-uffix i the longet uffix of that i a tiple epeat in the genome. ead tiple uffix tell u the minimum ovelap that we conide eliable. lthough we cannot alway detemine thi quantity exactly, it tun out that the uffix choen in line 6 i alway an oveetimate. Theoem 2. Suffix i alway at leat a long a tiple-uffix. The eaon why an that i at leat a long a tipleuffix i tutwothy i that, wheneve condition 1, 2 and 3 ae atified, ou extenion opeation ae neve in eo. Hence lgoithm 2 neve poduce a ead that could not have come fom the genome, no doe it caue the et of ead to violate any of ou initial condition. Theoem 3. The Extenion lgoithm poduce a et of ead that continue to obey containt 1, 2, and 3. fte extending ou ead in lgoithm 2, we have a et of ead that end at peciely-double epeat, a illutated in Fig. 4(c). Thee epeat make the coect next bae ambiguou. Howeve, although the next bae itelf i ambiguou, finding the peciely-double epeat till allow u to eolve ome additional ambiguity. We do o by meging ead togethe in lgoithm 3. Notice that in lgoithm 2, wheneve we found a bifucation in line 15, we ecoded ignatue U ight () = {, } that hould identify the two poible extenion of to the ight (and U left () fo the poible left extenion). In lgoithm 3 we ue thee ignatue to guide the meging opeation. in the cae of the extenion algoithm, in the appendix we how that lgoithm 3 doe not make any mitake: lgoithm 3 Meging lgoithm 1: Input: R afte extenion fom lgoithm 2 2: fo R do 3: Let {, } = U ight (). Mege all ead with a and all ead with a 2 4: If i inconitent with, mege it with 2 5: If i inconitent with 2, mege it with 6: If i contained in both and 2, dicad it. 7: Canonical ucceo: S() {, 2 } 8: Repeat fo left extenion (and compute canonical pedeceo P () intead) 9: Output: New et of long ead R and two canonical ucceo S() and pedeceo P () fo R Theoem 4. The Meging lgoithm poduce a et of ead that continue to obey containt 1, 2, and 3. lthough we loop ove R in the algoithm, we point out that tictly peaking thi loop i not well-defined a we ae modifying the et R a we loop though it. We peent the algoithm in thi way fo implicity. In eality, one would poce ead in a queue, and additionally epoce cetain ead a neceay (wheneve thei ucceo ae meged). V. UILDING N EULERIN GRPH FROM ETENDED REDS fte the meging pat of the algoithm, we obtain a et of long ead R that tetch between pai of unbidged epeat, a illutated in Figue 4(c). In addition, lgoithm 3 output, fo evey long ead, a pai of canonical ucceo, ay and 2. Fom the canonical ucceo/pedeceo elationhip, we will contuct the final Euleian gaph G that will allow u to figue out the coect odeing of the long ead. Fit, we peent eveal technical obevation that guaantee that the contuction of G i well defined and will atify cetain popetie. To begin, let conide the cuent tate of the et of ead. The following lemma about the canonical ucceo of a ead (and the analogou tatement fo pedeceo) follow by contuction fom lgoithm 3. Lemma 2. fte lgoithm 3, each ead ha two canonical ucceo and 2 uch that: (a) ha a uffix that i peciely a double epeat, and uch that and 2 contain, fo ome (ee Fig. 8), and no othe ead contain K fo any K. i not contained within both and 2. (c) i conitent with both and 2. Ou eventual goal i to how that the ead can be gouped into (non-dijoint) goup of fou that all ovelap on a paticula ubting, a hown in Fig. 8. Fit, we how that a ead two ucceo mut have the ame ovelap. Lemma 3. ead ha the ame ovelap with it ucceo and 2, and i contained in neithe. Now we can demontate anothe type of ymmety: pedeceo and ucceo ae oppoite in the natual ene. Coollay 1. If i one of canonical ucceo, then i one of canonical pedeceo.

(a) 2 Fig. 8. Fom the oiginal et of ead (a), meging poduce et of ead that ovelap peciely at unbidged double epeat. With thee technical obevation, we can pove the following theoem, which how that evey ead in the gaph can be gouped into a fou-ead tuctue a the one hown in Fig. 8. Theoem 5. If ha ucceo and 2, then and 2 both have pedeceo and fo ome. Poof. Suppoe ha ucceo and 2. y Coollay 1, i one of pedeceo; let the othe be. y Lemma 3, ovelap by with both and 2, and by the analogou veion of Lemma 3 fo pedeceo, alo ovelap by wth. ll of thee ead contain K o K fo ome K Σ. Thu if we define a in Lemma 2(a) and a in the pedeceo veion of Lemma 2(a), all thee ead contain K o K fo ome K Σ. Thu, and 2 ae the ucceo of and, and vice vea, a hown in Fig. 9(a). Note that fo any fou-ead configuation implied by Theoem 5, the mutual ovelap mut be an unbidged double epeat, ince no ead othe than thee fou contain K o K fo K Σ. Now we have goup of ead matched in thi manne: two tat and two end ead, whee the end ead ae the tat ead ucceo, and vice vea. We will now contuct a gaph on all ead. Fit, fo each ead we will ceate a node. Second, fo evey goup, we ceate a node coeponding to the unbidged epeat, and edge a hown in Fig. 9. (a) ' ' ' 2 Fig. 9. (a) ll ead ovelap occu in a fou-ead configuation. The fou-ead configuation ae ued to contuct a gaph. The coect genome coepond to ome Euleian cycle though thi gaph in the natual ene, becaue evey ead mut be ucceeded by eithe o 2 and will then be ucceeded by the othe one, which detemine an Euleian cycle. Finally, we have the following eult egading the uniquene of Euleian cycle, which i poved in the appendix. 2 (a) 2 Lemma 4. Suppoe a gaph G i Euleian and evey node ha in-degee and out-degee at mot 2. If thee ae multiple ditinct Euleian cycle in G, then any Euleian cycle mut viit two vetice u and v in an inteleaved manne; i.e., u, v, u, v. Since, in ou contuction, only the unbidged double epeat node have degee moe than one, Lemma 4 implie that ou contucted gaph only ha multiple Euleian cycle if ha unbidged inteleaved epeat. We conclude that if 1 i atified ou contucted Euleian gaph ha a unique Euleian cycle, which mut coepond to the genome equence g. Finally, we confim that the entie algoithm can be implemented efficiently. Theoem 6. Suppoe that we have N ead with ead length bounded by ( ome fixed contant L max, and that the coveage N ) depth c i=1 L i /G (the aveage numbe of ead coveing a ymbol in g) i a contant. Then the algoithm can be implemented in O ( N 3) time. VI. CONCLUDING REMRKS In thi pape, we decibed an ovelap-baed aembly algoithm with pefomance guaantee unde the aumption of eo-fee ead. Howeve, the ovelap-baed natue of the algoithm make it amenable to be modified to handle ead eo. The algoithm elie on ting opeation uch a teting whethe one ting i contained within anothe and whethe two ting hae a ubting, which have natual appoximate analogue. Fo example, intead of teting whethe two ting ovelap, we can tet whethe they appoximately ovelap (with few eo). The bidging condition then tanlate to appoximate analogue (e.g. intead of tiple epeat being all-bidged, we mut have appoximate tiple epeat being all-bidged), and the ufficiency poof hould natually tanlate to thi cenaio. We point out that eal equencing dataet ae geneally lage and equie linea- o nea-linea-time algoithm, making the O(N 3 ) complexity guaanteed by Theoem 6 impactical. Howeve, we expect that mino change in the algoithm deign and analyi can yield an algoithm which i, in pactice, at mot quadatic and poibly nea-linea. VII. PPENDI Hee we peent eveal poof efeed to thoughout the text.. Poof of Lemma 1 Poof. Suppoe that i a tiple epeat. In paticula, mut be contained in ome maximal tiple epeat. If all copie of ae peceded by the ame bae, ay, then would alo be a tiple epeat, meaning that would not be maximal. Thu, mut be peceded by at leat two diffeent bae, ay and. If 1 hold, then i all-bidged, o and mut appea in two ead, and thee two ead contain and ae inconitent with each othe, o that i emoved. Thu, all tiple epeat ae emoved (3). Now uppoe wa emoved. appea in two inconitent ead, o it i at leat a double epeat. If i a tiple epeat, it mut be all-bidged, o that i ubumed in one of the bidging ead and can be dicaded. Othewie, i peciely a double epeat. Suppoe it i contained within both and 2. Since the two ead ae inconitent, they mut coepond peciely to the two ditinct location whee may be in the genome. mut be ubumed within eithe and 2, and can be dicaded. Thu, emoving a ead neve violate 1 o 2.

. Poof of Theoem 2 Poof. Fit we pove a ueful lemma. Lemma 5. If tiple uffix i a pope uffix of t befoe line 6, then it will be a uffix of afte line 6. Subpoof of Lemma 5. Let tiple-uffix be a ting peceded by. Then cannot alo be a tiple epeat (othewie it would be a tiple-uffix longe than, a contadiction). Thu, of the thee copie of, they cannot all be peceded by ; all but one o two mut be peceded by a diffeent bae, ay, a hown below. Fig. 10. i at mot a double epeat. Since thi copy mut be bidged, thee mut be a ead containing the ting. lo, i a pope uffix of t by aumption. Thu, atifie the condition decibed in line 6; and ince line 6 eache fo the longet ting atifying thoe condition, will be at leat a long a tiple-uffix. That i to ay, i a uffix of. With thi lemma, we can poceed by induction. Fit we conide the bae cae, when we et on line 6 with t =. Fitly, i guaanteed to have ome tiple-uffix fo nontivial genome, ince each bae in,g,c,t hould appea at leat thee time in the genome; thi mean that at leat the lat bae of i a tiple epeat. Now tiple-uffix cannot be itelf, by 3; thu i a pope uffix of = t. Thi atifie the condition of Lemma 5. Then we conide all thee cae in the next code block. (a) If we poceed to line 9, we aign t = and loop. We et t = only if U =, that i, i unbidged. We pove by contadiction that tiple uffix i a pope uffix of t. Suppoe not; then t i a uffix of. Since i a tiple epeat, it mut be bidged, and t, a ubting of, mut alo be bidged, a contadiction. gain, thi allow u to ue Lemma 5, and we ae done. If we poceed to line 11, we et t =. Since i at leat a long a tiple-uffix, i at leat a long a tiple uffix. If not, ome i a tiple epeat, but then i alo a tiple epeat that i longe than, a contadiction. Now ince i at leat a long a tiple uffix, and i a pope uffix of t =, tiple uffix will alo be a pope uffix of, o we can ue Lemma 5. (c) If we poceed to line 14, we exit the loop, o that the invaiant i ielevant. (d) If we poceed to line 17, we ee at leat thee diffeent ting (ay),, and C. Thu i a tiple epeat, and it i longe than, a contadiction to the inductive hypothei. Thu thi line can neve be eached. Note that thi guaantee that line 6 neve fail, becaue i alway a uffix of t and theefoe thee i at leat one valid candidate fo the value. C. Poof of Theoem 3 Poof. The only place whee the et of ead i modified i line 12. Hee, i a double epeat, becaue both and ae ubting of ome ead. Suppoe i peciely a double epeat. Then, ince appea both in and in, both and appea only once in the genome (and appea once in each, i.e., twice in total), a hown in Fig. 11. Since appea once Fig. 11. Extenion tep when i a double epeat. in the genome and a ead contain, the next bae of mut be. ltenatively, uppoe i not peciely a double epeat, i.e. it i a tiple epeat. Then mut be all-bidged by aumption 1. If wee ucceeded by ome bae othe than (ay, ), then the ting would appea in the genome. Since i allbidged, would alo appea in ome ead. ut it doe not, o again mut be uceeded by, a hown in Fig. 12. Thu, we Fig. 12. Extenion tep when i a tiple epeat. can extend by afely. Clealy, a coect extenion opeation cannot violate any of the condition 1, 2, and 3. D. Poof of Theoem 4 Poof. We ae given U() = {, } which wee caied ove fom the extenion algoithm. We claim that both and appea only once in the genome each. If eithe of them appeaed moe than once, then would be a tiple epeat that i longe than, and thu longe than tiple epeat (by Theoem 2), a contadiction. So, ince and appea only once each in the genome, we can mege all ead with into a ingle ead, and imilaly with. lo, ince appea only twice in the genome, if i inconitent with, it mut be aligned with 2 and can be meged with 2 ; and vice vea. If i contained in both and 2, it cannot contibute to 1, 2 and 3, and can be dicaded. E. Poof of Lemma 3 Poof. Suppoe towad a contadiction that ha two ucceo and 2, whee ha tictly geate ovelap with than with 2, with being the hote ovelap, a hown in Fig. 13. Read ' Fig. 13. Two ucceo with diffeent ovelap poduce a contadiction. 2 may not contain, othewie both and 2 would contain 2

, contadicting Lemma 2. Now, if we let be the uffix of a in Lemma 2(a), mut be a uffix of. Since i peciely a double epeat (once followed by and once followed by ), only appea once in the genome. Now if we apply the pedeceo veion of Lemma 2 to 2, 2 begin with ome and ha pedeceo containing and epectively. Since i peciely a double epeat, it cannot contain (which only appea once) and mut be a pefix of. Since can only be peceded by o, we can aume wlog that contain a depicted. ut then and both contain, contadicting (the pedeceo veion of) Lemma 2(a), which tate that only one ead may contain. Thu a ead mut have the ame ovelap with both ucceo, and cannot be contained in eithe, o it would be contained in both, contadicting Lemma 2. F. Poof of Coollay 1 Poof. y (the pedeceo veion of) Lemma 2, begin with a double epeat and ha pedeceo containing and epectively, a hown in Fig. 14. If we let be the ovelap ' Fig. 14. Succeo/pedeceo ae ymmetic. between and and be the ymbol following in, a we agued in the poof of Lemma 3, can only appea once in the genome. Hence, ince i a double epeat, it mut be a pefix of. Moeove, cannot be contained within by Lemma 3. Since by (the pedeceo veion of) Lemma 2, only the pedeceo of may contain K fo ome K Σ, mut be one of the pedeceo of. G. Poof of Lemma 4 Poof. Since G contain an Euleian cycle, d in (v) = d out (v) fo evey v (the in- and out-degee ae equal). Moeove, any node v with d in (v) = d out (v) = 1 can only be taveed by an Euleian cycle in a unique way, o it can be condened, and the Euleian cycle in the eulting gaph G ae in one-to-one coepondence with the Euleian cycle in G. Similaly, if a node v ha a elfloop, v can be condened, ince thee i a unique way in which any Euleian cycle can tavee it. Theefoe, we may pove the lemma auming the pecial cae whee d in (v) = d out (v) = 2 fo evey v, and thee ae no elf-loop. To pove the fowad diection, uppoe an Euleian cycle C viit u and v in an inteleaved manne. Then C can be patitioned into fou path, two fom u to v and two fom v to u. y conideing the two poible odeing of thee fou path into a cycle, we obtain two ditinct Euleian cycle. Fo the backwad diection, uppoe no uch node u and v exit. If afte the condenation tep decibed above, n = 1, the Euleian cycle mut be unique. So uppoe n 2, and conide the ode in which C viit the node in v 1,..., v n, each one twice, tating fom v 1. Suppoe v a i the fit node to be viited twice. Since thee i no elf-loop, thee mut be a node v b that wa viited between the fit and econd viit to v a. Theefoe, v a and v b ae viited in an inteleaved manne, which i a contadiction. H. Poof of Theoem 6 Poof. In many potion of the algoithm, we will need to find bidging ead. To make thi moe efficient, we can pepoce each ead by hahing all of it ubting, which take O (N) time. Uing thi hah table, we can implement lgoithm 1 in contant time pe ead, o O (N) time. In lgoithm 2, we can loop only O (G) = O (N) time fo each ead, becaue eithe t become hote o the ead become longe evey time we loop; and if we loop G time we have aleady aembled the entie genome and can teminate. Each loop take contant time uing ou hah table to find bidging ead. Thu the total time i O ( N 2). fte lgoithm 2, we will ecompute the hah table. Thi tep take O ( N 3) time becaue the ead may have gown in ie up to O (G). Then, in lgoithm 3, we pefom ome additional contant-time bidging check, and mege ome ead. Each mege take O (N) time and educe the numbe of ead by one, thu thi phae i O ( N 2) time. Finally, we contuct a gaph and tavee an Euleian cycle in it, which take linea time. Thi complete the algoithm. REFERENCES [1] G. ele, M. ele, and D. Te, Optimal aembly fo high thoughput hotgun equencing, MC ioinfomatic, 2013. [2] E. W. Mye, The fagment aembly ting gaph, ioinfomatic, vol. 21, pp. 79 85, 2005. [3] P.. Pevne, H. Tang, and M. S. Wateman, n euleian path appoach to dna fagment aembly, Poceeding of the National cademy of Science, vol. 98, no. 17, pp. 9748 9753, 2001. [4] J. T. Simpon and R. Dubin, Efficient contuction of an aembly ting gaph uing the fm-index, ioinfomatic, vol. 26, pp. 367 373, 2010. [5] C.-S. Chin, D. H. lexande, P. Mak,.. Klamme, J. Dake, C. Heine,. Clum,. Copeland, J. Huddleton, E. E. Eichle et al., Nonhybid, finihed micobial genome aemblie fom long-ead mt equencing data, Natue method, vol. 10, no. 6, pp. 563 569, 2013. [6] G. Mye, Efficient local alignment dicovey amongt noiy long ead, in lgoithm in ioinfomatic. Spinge, 2014, pp. 52 67. [7] N. Nagaajan and M. Pop, Paametic complexity of equence aembly: theoy and application to next geneation equencing, Jounal of computational biology, vol. 16, no. 7, pp. 897 908, 2009. [8] P. Medvedev, K. Geogiou, G. Mye, and M. udno, Computability of model fo equence aembly, in lgoithm in ioinfomatic. Spinge, 2007, pp. 289 301. [9] I. Shomoony, S. Kim, T. Coutade, and D. Te. Infomation- Optimal embly via Spae Read-Ovelap Gaph. [Online]. vailable: http://www.eec.bekeley.edu/~coutade/pdf/nsg.pdf [10] J. Hui, I. Shomoony, K. Ramchandan, and T. Coutade, Genome embly fom Vaiable-Length Read, 2016. [Online]. vailable: http://www.eec.bekeley.edu/~coutade/pdf/valengthisit2016.pdf