| Literature DB >> 23825794 |
Ivan Gregor1, Lars Steinbrück, Alice C McHardy.
Abstract
Phylogenetic reconstruction is vital to analyzing the evolutionary relationship of genes within and across populations of different species. Nowadays, with next generation sequencing technologies producing sets comprising thousands of sequences, robust identification of the tree topology, which is optimal according to standard criteria such as maximum parsimony, maximum likelihood or posterior probability, with phylogenetic inference methods is a computationally very demanding task. Here, we describe a stochastic search method for a maximum parsimony tree, implemented in a software package we named PTree. Our method is based on a new pattern-based technique that enables us to infer intermediate sequences efficiently where the incorporation of these sequences in the current tree topology yields a phylogenetic tree with a lower cost. Evaluation across multiple datasets showed that our method is comparable to the algorithms implemented in PAUP* or TNT, which are widely used by the bioinformatics community, in terms of topological accuracy and runtime. We show that our method can process large-scale datasets of 1,000-8,000 sequences. We believe that our novel pattern-based method enriches the current set of tools and methods for phylogenetic tree inference. The software is available under: http://algbio.cs.uni-duesseldorf.de/webapps/wa-download/.Entities:
Keywords: Local search; Maximum parsimony; Phylogeny reconstruction; Stochastic search
Year: 2013 PMID: 23825794 PMCID: PMC3698465 DOI: 10.7717/peerj.89
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Local tree topology.
A is an internal node, V0 its parent, V1…V are its children, and S represents the corresponding substitution sets.
Figure 2Repeated substitution pattern example.
(A) depicts a local tree topology where node A is an internal node that represents DNA sequence (A C A C A), V0 its parent, V1…V3 its children, and S0…S3 are corresponding substitution sets. The repeated substitution pattern is found at node A since the intersection of at least two substitution sets is non-empty, namely: S0∩S1 = S0∩S3 = S1∩S3 = {C2T} = Y1. Thus, new candidate intermediate node I1 (A T A C A) originates from A (A C A C A) by applying mutations in Y1 = {C2T}. Note that the arrow that goes from node A to its parent V0 does not represent the direction in which the local tree topology is rooted but it serves only for the purpose of the definition of the repeated substitution pattern. (B) depicts an expected local tree topology after intermediate node I1 is added to the tree topology. After I1 is added, the cost of the local tree topology (i.e., the number of substitutions) decreases from 7 to 5.
Figure 3Topological accuracy.
Comparison of topological accuracies of selected tree building methods as a function of sequence divergence using 5,000 datasets of 40 taxa with 500 bases each from Guindon & Gascuel (2003). PTree was run with 50 iterations and Jukes–Cantor correction enabled. PAUP* was run with the following settings: Neighbor Joining (NJ), maximum parsimony (MP; with the tree bisection and reconnection (TBR) branch swapping option used in the heuristic search), and maximum likelihood (ML; with the nearest neighbor interchange (NNI) branch swapping option used in the heuristic search). TNT was run with the subtree pruning and regrafting (SPR) branch swapping option used in the heuristic search.
Average parsimony cost comparison using the HIV dataset.
Comparison of average parsimony costs for the Neighbor Joining (NJ) algorithm, PTree, TNT with the subtree pruning and regrafting (SPR) search heuristic, and PAUP* with different search heuristics: nearest neighbor interchange (NNI), SPR, and tree bisection and reconnection (TBR), on different dataset sizes using concatenated sequences of HIV reverse transcriptase and polymerase with ∼1,600 bases each downloaded from EuResist (2010).
| Size of input dataset | |||||||
|---|---|---|---|---|---|---|---|
| 125 | 250 | 500 | 1,000 | 2,000 | 4,000 | 8,000 | |
|
| |||||||
| NJ | 6,298 | 11,789 | 21,911 | 42,383 | 79,567 | 152,546 | 289,472 |
| PAUP* (NNI) | 6,132 | 11,557 | 21,559 | 41,636 | 78,421 | 150,316 | 287,530 |
| PTree | 6,104 | 11,572 | 21,549 | 41,573 | 78,276 | 149,950 | 286,815 |
| TNT (SPR) | 6,078 | 11,487 | 21,401 | 41,250 | 77,556 | 148,527 | 283,921 |
| PAUP* (SPR) | 6,083 | 11,482 | 21,405 | 41,274 | 77,594 | 148,683 | – |
| PAUP* (TBR) | 6,080 | 11,476 | 21,381 | 41,235 | 77,537 | 148,601 | – |
Average parsimony cost comparison using the RAxML dataset.
Comparison of average parsimony costs for selected methods on different-sized datasets with sequence lengths of ∼1,200 bases from Stamatakis, Ludwig & Meier (2005).
| Size of input dataset | |||||||
|---|---|---|---|---|---|---|---|
| 125 | 250 | 500 | 1,000 | 2,000 | 4,000 | 8,000 | |
|
| |||||||
| NJ | 957 | 2,675 | 6,045 | 12,704 | 21,698 | 71,883 | 151,678 |
| PAUP* (NNI) | 946 | 2,650 | 5,970 | 12,489 | 21,188 | 70,391 | 148,718 |
| PTree | 934 | 2,607 | 5,898 | 12,299 | 20,951 | 70,247 | 148,423 |
| TNT (SPR) | 934 | 2,592 | 5,850 | 12,214 | 20,743 | 69,545 | 146,906 |
| PAUP* (SPR) | 934 | 2,597 | 5,852 | 12,203 | 20,718 | 69,508 | – |
| PAUP* (TBR) | 933 | 2,589 | 5,836 | 12,159 | 20,673 | 69,380 | – |
Average runtime comparison using the HIV dataset.
Average runtime comparison for selected methods on different dataset sizes using concatenated sequences of HIV reverse transcriptase and polymerase with ∼1,600 bases each downloaded from EuResist (2010).
| Size of input dataset | |||||||
|---|---|---|---|---|---|---|---|
| 125 | 250 | 500 | 1,000 | 2,000 | 4,000 | 8,000 | |
|
| |||||||
| NJ | 0.2 s | 0.2 s | 0.5 s | 2 s | 9 s | 35 s | 9 m 13 s |
| PAUP* (NNI) | 2.4 s | 11.8 s | 1 m 54 s | 23 m 44 s | 1 h 46 m | 11 h 26 m | 92 h 52 m |
| PTree | 19.4 s | 58.5 s | 3 m 30 s | 14 m 13 s | 1 h 4 m | 4 h 47 m | 23 h 48 m |
| TNT (SPR) | 4 s | 14 s | 1 m 24 s | 6 m 38 s | 41 m 2 s | 6 h 34 m | 33 h 38 m |
| PAUP* (SPR) | 27.7 s | 1 m 30 s | 19 m 35 s | 2 h 44 m | 22 h | 155 h 17 m | – |
| PAUP* (TBR) | 27.7 s | 4 m 33 s | 28 m 43 s | 5 h 2 m | 35 h 49 m | 190 h 31 m | – |
Average runtime comparison using the RAxML dataset.
Average runtime comparison for selected methods on different-sized datasets with sequence lengths of ∼1,200 bases from Stamatakis, Ludwig & Meier (2005).
| Size of input dataset | |||||||
|---|---|---|---|---|---|---|---|
| 125 | 250 | 500 | 1,000 | 2,000 | 4,000 | 8,000 | |
|
| |||||||
| NJ | 0.1 s | 0.1 s | 0.1 s | 1 s | 8 s | 1 m | 8 m 28 s |
| PAUP* (NNI) | 1.4 s | 11.4 s | 1 m 30 s | 17 m 7 s | 2 h 51 m | 28 h 33 m | 139 h 49 m |
| PTree | 7.5 s | 38.5 s | 2 m 14 s | 6 m 43 s | 23 m 56 s | 2 h 10 m | 11 h 10 m |
| TNT (SPR) | 0.7 s | 3 s | 14 s | 51 s | 5 m 23 s | 57 m 57 s | 4 h 48 m |
| PAUP* (SPR) | 17.7 s | 2 m 44 s | 19 m 51 s | 2 h 5 m | 60 h 23 m | >1 month | – |
| PAUP* (TBR) | 41.2 s | 4 m 56 s | 39 m 22 s | 4 h 38 m | 252 h 42 m | >1 month | – |