lack of resolution Gene duplication Organismal tree:

Similar documents
10 Torque. Lab. What You Need To Know: Physics 211 Lab

British Prime Minister Benjamin Disraeli once remarked that

CS3350B Computer Architecture. Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions

SHRiMP: Accurate Mapping of Short Color-space Reads

Multi-Robot Forest Coverage

CORESTA RECOMMENDED METHOD N 68

Lesson 33: Horizontal & Vertical Circular Problems

arxiv:cs/ v1 [cs.ds] 8 Dec 1998

Faster Nearest Neighbors: Voronoi Diagrams and k-d Trees

ABriefIntroductiontotheBasicsof Game Theory

Performance Characteristics of Parabolic Trough Solar Collector System for Hot Water Generation

A Force Platform Free Gait Analysis

Multi-Robot Flooding Algorithm for the Exploration of Unknown Indoor Environments

Rearranging trees for robust consensus

Data Sheet. Linear bearings

Wind and extremely long bridges a challenge for computer aided design

An Auction Based Mechanism for On-Demand Transport Services

A Machine Vision based Gestural Interface for People with Upper Extremity Physical Impairments

Numerical study of super-critical carbon dioxide flow in steppedstaggered

Rotor Design and Analysis of Stall-regulated Horizontal Axis Wind Turbine

Cheat-Proof Playout for Centralized and Distributed Online Games

The Study About Stopping Distance of Vehicles

Complexity of Data Tree Patterns over XML Documents

Color Encodings: srgb and Beyond

THE performance disparity between processor speed and the

Noncrossing Trees and Noncrossing Graphs

EC-FRM: An Erasure Coding Framework to Speed up Reads for Erasure Coded Cloud Storage Systems

Bubble clustering and trapping in large vortices. Part 1: Triggered bubbly jets investigated by phase-averaging

Lecture Topics. Overview ECE 486/586. Computer Architecture. Lecture # 9. Processor Organization. Basic Processor Hardware Pipelining

An integrated supply chain design model with random disruptions consideration

The Solution to the Bühlmann - Straub Model in the case of a Homogeneous Credibility Estimators

MODELLING THE INTERACTION EFFECTS OF THE HIGH-SPEED TRAIN TRACK BRIDGE SYSTEM USING ADINA

Design Engineering Challenge: The Big Dig Contest Platter Strategies: Ball Liberation

Incorporating Location, Routing and Inventory Decisions in Dual Sales Channel - A Hybrid Genetic Approach

Cyclostrophic Balance in Surface Gravity Waves: Essay on Coriolis Effects

Multiple Vehicle Driving Control for Traffic Flow Efficiency

Phase Behavior Introduction to Phase Behavior F.E. Londono M.S. Thesis (2001)

A Collision Risk-Based Ship Domain Method Approach to Model the Virtual Force Field

Lecture 24. Wind Lidar (6) Direct Motion Detection Lidar

Red-Black Trees Goodrich, Tamassia Red-Black Trees 1

Experiment #10 Bio-Physics Pre-lab Comments, Thoughts and Suggestions

Efficient Algorithms for finding a Trunk on a Tree Network and its Applications

Alternate stable states in coupled fishery-aquaculture systems. Melissa Orobko

PREDICTION OF THIRD PARTY DAMAGE FAILURE FREQUENCY FOR PIPELINES TRANSPORTING MIXTURES OF NATURAL GAS AND HYDROGEN Zhang, L. 1, Adey, R.A.

I. FORMULATION. Here, p i is the pressure in the bubble, assumed spatially uniform,

PlacesForBikes City Ratings Methodology. Overall City Rating

STUDY OF IRREGULAR WAVE-CURRENT-MUD INTERACTION

Matlab Simulink Implementation of Switched Reluctance Motor with Direct Torque Control Technique

Carnegie Mellon University Forbes Ave., Pittsburgh, PA command as a point on the road and pans the camera in

Electrical Equipment of Machine Tools

Range Extension Control System for Electric Vehicles Based on Front and Rear Driving Force Distribution Considering Load Transfer

OPTIMAL SCHEDULING MODELS FOR FERRY COMPANIES UNDER ALLIANCES

Depth-first search and strong connectivity in Coq

Morrison Drive tel. Ottawa, ON, Canada K2H 8S fax. com

Interior Rule of the Quebec Open 2017

EcoMobility World Festival 2013 Suwon: an analysis of changes in citizens awareness and satisfaction

Example. The information set is represented by the dashed line.

POSSIBLE AND REAL POWERFLOWS IN CONNECTED DIFFERENTIAL GEAR DRIVES WITH η 0 <i pq <1/η 0 INNER RATIO

A tale of San Diego County s water. If you had to describe San Diego's weather, you probably would use

MODEL 1000S DIGITAL TANK GAUGE

Design and Simulation Model for Compensated and Optimized T-junctions in Microstrip Line

0ur Ref:CL/Mech/ Cal /BID-01(11-12) Date: 29 July 2011

DECO THEORY - BUBBLE MODELS

The Properties of. Model Rocket Body Tube Transitions

f i r e - p a r t s. c o m

A Deceleration Control Method of Automobile for Collision Avoidance based on Driver's Perceptual Risk

Use of the swim bladder and lateral line in near-field sound source localization by fish

VIBRATION INDUCED DROPLET GENERATION FROM A LIQUID LAYER FOR EVAPORATIVE COOLING IN A HEAT TRANSFER CELL. A Thesis Presented to The Academic Faculty

Bicycle and Pedestrian Master Plan

A CONCEPTUAL WHEELED ROBOT FOR IN-PIPE INSPECTION Ioan Doroftei, Mihaita Horodinca, Emmanuel Mignon

Asteroid body-fixed hovering using nonideal solar sails

UNIVERSITÀ DEGLI STUDI DI PADOVA. Dipartimento di Scienze Economiche Marco Fanno

Operating Instructions Compressors

Cavitation Bubble Dynamics in Non-Newtonian Fluids

Project Proposal: Characterization of Tree Crown Attributes with High Resolution Fixed-Base Aerial Photography. by Rich Grotefendt and Rob Harrison

Experiment #10 Bio-Physics Pre-lab Questions

Experimental and Numerical Studies on Fire Whirls

THE IMPACTS OF CONGESTION ON COMMERCIAL VEHICLE TOUR CHARACTERISTICS AND COSTS

Fundamental Algorithms for System Modeling, Analysis, and Optimization

Follow this and additional works at:

Prestack signal enhancement by non-hyperbolic MultiFocusing

READING AREA TRANSPORTATION STUDY BICYCLE AND PEDESTRIAN TRANSPORTATION PLAN ADOPTED NOVEMBER 18, 2010

DETC A NEW MODEL FOR WIND FARM LAYOUT OPTIMIZATION WITH LANDOWNER DECISIONS

DYNAMICS OF WATER WAVES OVER FRINGING CORAL REEFS

ANALYSIS AND TESTING OF AN INTEGRATED REFRIGERATION AND STORAGE SYSTEM FOR LIQUID HYDROGEN ZERO BOIL-OFF, LIQUEFACTION, AND DENSIFICATION

Accel. (m/s 2 ) Time (sec) Newton s 3 rd Law and Circular Motion. Group Problem 04

Torque. Physics 2. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Trends in Cycling, Walking & Injury Rates in New Zealand

PREDICTION OF ELECTRICAL PRODUCTION FROM WIND ENERGY IN THE MOROCCAN SOUTH

Deception in Honeynets: A Game-Theoretic Analysis

55CM ROUND CHARCOAL KETTLE BBQ

FALL PROTECTION PROGRAM

Fault tolerant oxygen control of a diesel engine air system

A Three-Axis Magnetic Sensor Array System for Permanent Magnet Tracking*

Advanced Image Tracking Approach for Augmented Reality Applications

Session 6. Global Imbalances. Growth. Macroeconomics in the Global Economy. Saving and Investment: The World Economy

tr0 TREES Hanan Samet

the Susquehanna River. Today, PFBC protects and conserves aquatic species throughout Pennsylvania.

A Method of Hand Contour Tracking based on GVF, Corner and Optical flow Ke Du1, a, Ying Shi1,b, Jie Chen2,c, MingJun Sun1, Jie Chen1, ShuHai Quan1

Theoretical and Experimental Study of Gas Bubbles Behavior

Transcription:

Tees what might they mean Calculating a tee is compaatively easy, figuing out what it might mean is much moe difficult. If this is the pobable oganismal tee: Why could a gene tee look like this lack of esolution e.g., 60% bootstap suppot fo bipatition (AD)(CB) long banch attaction atifact the two longest banches join togethe e.g., 100% bootstap suppot fo bipatition (AD)(CB) What could you do to investigate if this is a possible explanation use only slow positions, use an algoithm that coects fo ASRV Gene tansfe Oganismal tee: molecula tee: speciation gene tansfe Gene Tansfe Gene duplication Oganismal tee: gene duplication molecula tee: (assuming gene loss): gene duplication seq. fom B seq. fom C seq. fom D Gene duplication and gene tansfe ae equivalent explanations. The moe elatives of C ae found that do not have the blue type of gene, the less likely is the duplication loss scenaio Ancient duplication followed by Hoizontal o lateal Gene gene loss Note that scenaio B involves many moe individual events than A 1 HGT with othologous eplacement 1 gene duplication followed by 4 independent gene loss events Function, otho- and paalogy molecula tee: gene duplication seq. fom B seq. fom C seq. fom D The pesence of the duplication is a taxonomic chaacte (shaed deived chaacte in C D). The phylogeny suggests that seq and seq have simila function, and that this function was impotant in the evolution of the clade BCD. seq in B and seq in C and D ae othologs and pobably have the same function, wheeas seq and seq in BCD pobably have diffeent function (the diffeence might be in subfunctionalization of functions that seq had in A. e.g. ogan specific expession) Phylip witten and distibuted by Joe Felsenstein and collaboatos (some of the following is copied fom the PHYLIP homepage) PHYLIP (the PHYLogeny Infeence Package) is a package of pogams fo infeing phylogenies (evolutionay tees). PHYLIP is the most widely-distibuted phylogeny package, and competes with PAUP* to be the one esponsible fo the lagest numbe of published tees. PHYLIP has been in distibution since 1980, and has ove 15,000 egisteed uses. Output is witten onto special files with names like "outfile" and "outtee". Tees witten onto "outtee" ae in the Newick fomat, an infomal standad ageed to in 1986 by authos of a numbe of majo phylogeny packages. Input is eithe povided via a file called infile o in esponse to a pompt. input and output 1

What s in PHYLIP Pogams in PHYLIP allow to do pasimony, distance matix, and likelihood methods, including bootstapping and consensus tees. Data types that can be handled include molecula sequences, gene fequencies, estiction sites and fagments, distance matices, and discete chaactes. Phylip woks well with potein and nucleotide sequences Many othe pogams mimic the style of PHYLIP pogams. (e.g. TREEPUZZLE, phyml, potml) Many othe packages use PHYIP pogams in thei inne wokings (e.g., PHYLO_WIN) PHYLIP uns unde all opeating systems Web intefaces ae available Pogams in PHYLIP ae Modula Fo example: SEQBOOT take one set of aligned sequences and wites out a file containing bootstap samples. PROTDIST takes a aligned sequences (one o many sets) and calculates distance matices (one o many) FITCH (o NEIGHBOR) calculate best fitting o neighbo joining tees fom one o many distance matices CONSENSE takes many tees and etuns a consensus tee. modules ae available to daw tees as well, but often people use teeview o njplot The Phylip Manual is an excellent souce of infomation. Bief one line desciptions of the pogams ae hee The easiest way to un PHYLIP pogams is via a command line menu (simila to clustalw). The pogam is invoked though clicking on an icon, o by typing the pogam name at the command line. > seqboot > potpas > fitch If thee is no file called infile the pogam esponds with: [gogaten@caot gogaten]$ seqboot seqboot: can't find input file "infile" Please ente a new file name> pogam folde menu inteface Example 1 Potpas example: seqboot, potpas, consense on infile1 NOTE the bootstap majoity consensus tee does not necessaily have the same topology as the FM tee fom the oiginal data! theshold pasimony, gap symbols - vesus outfile outtee compae to distance matix analysis example: seqboot and potpas on infile1 potpas (vesus distance/fm) Extended majoity ule consensus tee CONSENSUS TREE: the numbes on the banches indicate the numbe of times the patition of the species into the two sets which ae sepaated by that banch occued among the tees, out of 100.00 tees +------Pochlooc +----------------------100.- +------Synechococ +--------------------Guilladia +-85.7- +-88.3- +------Clostidiu +-100.- +-100.- +------Themoanae +-50.8- +-------------Homo sapie +------ +------Oyza sati +---------------100.0- +------Aabidopsi +--------------------Synechocys +---------------53.0- +------Nostoc pun +-99.5- +-38.5- +------Nostoc sp +-------------Tichodesm +------------------------------------------------Themosyne banches ae scaled with espect to bootstap suppot values, the numbe fo the deepest banch is handeled incoectly by njplot and teeview (potpas vesus) distance/fm Tee is scaled with espect to the estimated numbe of substitutions. what might be the explanation fo the ed algae not gouping with the plants If time: demo of njplot without and with coection fo ASRV emembe: this is an unooted tee! 2

subtee with banch lengths without and with coection fo ASRV compae to tees with FITCH and clustalw same dataset bootstap suppot ala clustal potpas (gaps as ) Phylip, Unix and Pel Rathe than typing commands at the menu, you can wite the esponses that you would need to give via the keyboad into a file (e.g. you_input.txt) You could stat and execute the pogam potpas by typing potpas < you_input.txt you input.txt might contain the following lines: infile1.txt t 10 y in a scipt you could use the line system ( potpas < you_input.txt ); The main poblem ae the owewite commands if the oufile and outtee files ae aleady existing. You can eithe ceate these befoehand, o ease them by moving (mv) thei contents somewhee else. ceate *.phy files the easiest (pobably) is to un clustalw with the phylip option: pint "# This pogam aligns all multiple sequence files with names *.fa \n # found in its diectoy using clustalw, and saves them in phyip fomat.\n ; while(defined($file=glob("*.fa"))){ # cleanup: @pats=split(/\./,$file); $file=$pats[0]; system("clustalw -infile=$file.fa -align -output=phylip"); }; system ("m *.dnd"); Altenatively, you could use a web vesion of eadseq this one woked geat fo me un phylip pogams fom pel An example on how to un multiple bootstap analyses is hee, the cmd files ae hee, hee, and hee: pint "# This pogam uns seqboot, potpas and consense on all multiple \n # sequence files with names *.phy\n"; while(defined($file=glob("*.phy"))){ }; @pats=split(/\./,$file); $file=$pats[0]; system ("cp $file.phy infile"); system ("seqboot < seqboot.cmd"); system ("mv outfile infile"); system ("potpas < potpas.cmd"); system ("m outfile"); system ("mv outtee intee"); system ("consense < consense.cmd"); system ("mv outtee $file.outtee"); system ("mv outfile $file.outfile"); # cleanup: system ("m infile"); system ("m intee"); Altenative fo enteing the commands fo the menu: system ("cp A.phy infile"); system ("echo -e 'y\n9\n' seqboot"); echo etuns the sting in, i.e., y\n9\n. The e options allows the use of \n The symbol pipes the output fom echo to seqboot phyml PHYML - A simple, fast, and accuate algoithm to estimate lage phylogenies by maximum likelihood An online inteface is hee ; thee is a command line vesion that is descibed hee (not as staight fowad as in clustalw); a phylip like inteface is automatically invoked, if you type phyml the manual is hee. The pape descibing phyml is hee, a bief inteview with the authos is hee TeePuzzle ne PUZZLE TREE-PUZZLE is a vey vesatile maximum likelihood pogam that is paticulaly useful to analyze potein sequences. The pogam was developed by Kobian Stimme and And von Hasele (then at the Univ. of Munich) and is maintained by von Hasele, Heiko A. Schmidt, and Matin Vingon (contacts see http://www.tee-puzzle.de/). 3

TREE-PUZZLE allows fast and accuate estimation of ASRV (though estimating the shape paamete alpha) fo both nucleotide and amino acid sequences, It has a fast algoithm to calculate tees though quatet puzzling (calculating ml tees fo quatets of species and building the multispecies tee fom the quatets). The pogam povides confidence numbes (puzzle suppot values), which tend to be smalle than bootstap values (i.e. povide a moe consevative estimate), the pogam calculates banch lengths and likelihood fo use defined tees, which is geat if you want to compae diffeent tee topologies, o diffeent models using the maximum likelihood atio test. Banches which ae not significantly suppoted ae collapsed. TREE-PUZZLE uns on "all" platfoms TREE-PUZZLE eads PHYLIP fomat, and communicates with the use in a way simila to the PHYLIP pogams. Maximum likelihood atio test If you want to compae two models of evolution (this includes the tee) given a data set, you can utilize the so-called maximum likelihood atio test. If L 1 and L 2 ae the likelihoods of the two models, d =2(logL 1 -logl 2 ) appoximately follows a Chi squae distibution with n degees of feedom. Usually n is the diffeence in model paametes. I.e., how many paametes ae used to descibe the substitution pocess and the tee. In paticula n can be the diffeence in banches between two tees (one tee is moe esolved than the othe). In pinciple, this test can only be applied if on model is a moe efined vesion of the othe. In the paticula case, when you compae two tees, one calculated without assuming a clock, the othe assuming a clock, the degees of feedom ae the numbe of OTUs 2 (as all sequences end up in the pesent at the same level, thei banches cannot be feely chosen). To calculate the pobability you can use the CHISQUARE calculato fo windows available fom Paul Lewis. TREE-PUZZLE allows (cont) TREEPUZZLE calculates distance matices using the ml specified model. These can be used in FITCH o Neighbo. PUZZLEBOOT automates this appoach to do bootstap analyses WARNING: this is a distance matix analyses! The official scipt fo PUZZLEBOOT is hee you need to ceate a command file (puzzle.cmds), and puzzle needs to be envocable though the command puzzle. You input file needs to be the enamed outfile fom seqboot A slightly modified woking vesion of puzzleboot_mod.sh is hee, and hee is an example fo puzzle.cmds. Read the instuctions befoe you un this! Maximum likelihood mapping is an excellent way to assess the phylogenetic infomation contained in a dataset. ML mapping can be used to calculate the suppot aound one banch. @@@ Puzzle is cool, don't leave home without it! @@@ Sequence alignment: CLUSTALW MUSCLE ml mapping ml mapping Removing ambiguous positions: T-COFFEE FORBACK Geneation of pseudosamples: SEQBOOT Calculating and evaluating phylogenies: PROTDIST NEIGHBOR TREE-PUZZLE FITCH PROTPARS PHYML Compaing phylogenies: CONSENSE SH-TEST in TREE-PUZZLE Compaing models: Visualizing tees: ATV, njplot, teeview Maximum Likelihood Ratio Test Fom: Olga Zhaxybayeva and J Pete Gogaten BMC Genomics 2002, 3:4 Figue 5. Likelihood-mapping analysis fo two biological data sets. (Uppe) The distibution pattens. (Lowe) The occupancies (in pecent) fo the seven aeas of attaction. (A) Cytochome-b data fom ef. 14. (B) Ribosomal DNA of majo athopod goups (15). Some possible pathways fom sequence to tee, model and suppot values. Fom: Kobinian Stimme and Andt von Haesele Poc. Natl. Acad. Sci. USA Vol. 94, pp. 6815-6819, June 1997 ml mapping can asses the topology suounding an individual banch : E.g.: If we want to know if Giadia lamblia foms the deepest banch within the known eukayotes, we can use ML mapping to addess this poblem. To apply ml mapping we choose the "highe" eukayotes as cluste a, anothe deep banching eukayote (the one that competes against Giadia) as cluste b, Giadia as cluste c, and the outgoup as cluste d. Fo an example output see this sample ml-map. An analysis of the cabamoyl phosphate synthetase domains with espect to the oot of the tee of life is hee. ml mapping can asses the not necessaily teelike histoies of genome Application of ML mapping to compaative Genome analyses see hee fo a compaison of diffeent pobability measues. Fig. 3: outline of appoach Fig. 4: Example and compaison of diffeent measues see hee fo an appoach that solves the poblem of poo taxon sampling that is usually consideed inheent with quatet analyses. Fig. 2: The pinciple of analyzing extended datasets to obtain embedded quatets Example next slides: Cluste a: 14 sequences outgoup (pokayotes) Cluste b: 20 sequences othe Eukayotes Cluste c: 1 sequences Plasmodium Cluste d: 1 sequences Giadia (a,b)-(c,d) /\ / 1 \ / 3 : 2 \ / : \ / \ (a,d)-(b,c) (a,c)-(b,d) Numbe of quatets in egion 1: 68 (= 24.3%) Numbe of quatets in egion 2: 21 (= 7.5%) Numbe of quatets in egion 3: 191 (= 68.2%) Occupancies of the seven aeas 1, 2, 3, 4, 5, 6, 7: (a,b)-(c,d) /\ / 1 \ / /\ \ / 6 4 \ / / 7 \ \ / \ / 3 : 5 : 2 \ / \ (a,d)-(b,c) (a,c)-(b,d) Numbe of quatets in egion 1: 53 (= 18.9%) Numbe of quatets in egion 2: 15 (= 5.4%) Numbe of quatets in egion 3: 173 (= 61.8%) Numbe of quatets in egion 4: 3 (= 1.1%) Numbe of quatets in egion 5: 0 (= 0.0%) Numbe of quatets in egion 6: 26 (= 9.3%) Numbe of quatets in egion 7: 10 (= 3.6%) 4

TREE-PUZZLE PROBLEMS/DRAWBACKS The moe species you add the lowe the suppot fo individual banches. While this is tue fo all algoithms, in TREE-PUZZLE this can lead to completely unesolved tees with only a few handful of sequences. Tees calculated via quatet puzzling ae usually not completely esolved, and they do not coespond to the ML-tee: The detemined multi-species tee is not the tee with the highest likelihood, athe it is the tee whose topology is suppoted though ml-quatets, and the lengths of the esolved banches is detemined though maximum likelihood. puzzle example The best tee might not be the tue tee. When can one conclude that a tee is a significantly wose explanation fo the data compaed to the best tee Estimate the pobability that a dataset might have esulted fom a given tee. Example: Kia s kangaoo data Usetees - SH test - go though outfile (PARS, PROTPARS and DNAPARS pefom a simila test when confonted with multiple usetees) Zhaxybayeva and Gogaten, BMC Genomics 2003 4: 37 COMPARISON OF DIFFERENT SUPPORT MEASURES A: mapping of posteio pobabilities accoding to Stimme and von Haesele B: mapping of bootstap suppot values C: mapping of bootstap suppot values fom extended datasets ml-mapping vesus Moe gene families goup species accoding to envionment than accoding to 16SRNA phylogeny bootstap values fom extended datasets Bayes Theoem Likelihood descibes how well the model pedicts the data P(data model, I) P(model data, I) = P(model, I) P(data,I) Elliot Sobe s Gemlins Obsevation: Loud noise in the attic Hypothesis: gemlins in the attic playing bowling Posteio Pobability Pio Pobability Nomalizing constant Likelihood = P(noise gemlins in the attic) In contast, a themophilic achaeon has moe genes gouping with the themophilic bacteia Reveend Thomas Bayes (1702-1761) epesents the degee to which we believe a given model accuately descibes the situation given the available data and all of ou pio infomation I descibes the degee to which we believe the model accuately descibes eality based on all of ou pio infomation. P(gemlins in the attic noise) Altenative Appoaches to Estimate Posteio Pobabilities Bayesian Posteio Pobability Mapping with MBayes (Huelsenbeck and Ronquist, 2001) Poblem: Stimme s fomula L i p i = L 1 +L 2 +L 3 only consides 3 tees (those that maximize the likelihood fo the thee topologies) Illustation of a biased andom walk Solution: Exploation of the tee space by sampling tees using a biased andom walk (Implemented in MBayes pogam) Tees with highe likelihoods will be sampled moe often p i N i N total,whee Ni - numbe of sampled tees of topology i, i=1,2,3 N total total numbe of sampled tees (has to be lage) Figue geneated using MCRobot pogam (Paul Lewis, 2001) 5