| Literature DB >> 28185555 |
Michael Nute1, Tandy Warnow2,3,4,5.
Abstract
BACKGROUND: Multiple sequence alignment is an important task in bioinformatics, and alignments of large datasets containing hundreds or thousands of sequences are increasingly of interest. While many alignment methods exist, the most accurate alignments are likely to be based on stochastic models where sequences evolve down a tree with substitutions, insertions, and deletions. While some methods have been developed to estimate alignments under these stochastic models, only the Bayesian method BAli-Phy has been able to run on even moderately large datasets, containing 100 or so sequences. A technique to extend BAli-Phy to enable alignments of thousands of sequences could potentially improve alignment and phylogenetic tree accuracy on large-scale data beyond the best-known methods today.Entities:
Keywords: Boosting; MCMC; Multiple sequence alignment
Mesh:
Year: 2016 PMID: 28185555 PMCID: PMC5123300 DOI: 10.1186/s12864-016-3101-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary statistics for true alignments on 1,000-sequence data
|
| Gaps/Seq | Gap Length | % Blank | ||||
|---|---|---|---|---|---|---|---|
| Sites | Avg | Max | |||||
| RNAsim | 4806 | 41 % | 61 % | 1036 | 3.1 | 68 % | |
| Indel. M2 | 2179 | 67 % | 74 % | 210 | 5.6 | 54 % | |
| Rose L1 | 3777 | 70 % | 77 % | 209 | 13.2 | 73 % | |
| Rose M1 | 3934 | 70 % | 77 % | 294 | 9.9 | 74 % | |
| Rose S1 | 2106 | 69 % | 77 % | 285 | 3.9 | 52 % | |
The p-distance is the normalized pairwise Hamming distance. Numbers shown are averages over 10 replicates
Alignment and tree accuracy metrics for all methods on 1,000 sequences
| Delta- | ||||||
|---|---|---|---|---|---|---|
| Data | Method | Prec. | Rec. | TC | RAxML | FT-2 |
| Indelible | P(Default) | 95.1 % | 94.6 % | 4.5 % | 1.86 % | 0.68 % |
| M2 | P+BAli-Phy |
|
|
|
|
|
| P+MAFFT-L | 97.2 % | 97.0 % | 6.8 % | 0.75 % | -0.20 % | |
| MAFFT-L | 80.2 % | 75.0 % | 1.4 % | 15.73 % | 8.74 % | |
| MAFFT-def | 1.0 % | 0.4 % | 0.0 % | (not run) | (not run) | |
| RNAsim | P(default) | 90.3 % | 90.4 % | 3.5 % | 0.56 % |
|
| P+BAli-Phy |
|
|
| 0.70 % | 0.42 % | |
| P+MAFFT-L | 88.8 % | 89.0 % | 3.9 % |
| 0.45 % | |
| MAFFT-L | 91.8 % | 91.5 % | 2.9 % | 0.73 % | 6.47 % | |
| MAFFT-def | 83.7 % | 71.5 % | 1.4 % | (not run) | (not run) | |
| Rose L1 | P(default) | 90.9 % | 90.6 % | 15.9 % | 2.07 % | 2.24 % |
| P+BAli-Phy |
|
|
|
|
| |
| P+MAFFT-L | 90.0 % | 89.8 % | 21.8 % | 1.98 % | 2.00 % | |
| MAFFT-L | 84.1 % | 76.6 % | 6.4 % | 3.45 % | 3.15 % | |
| MAFFT-def | 1.1 % | 0.4 % | 0.0 % | (not run) | (not run) | |
| Rose M1 | P(default) | 79.7 % | 79.0 % | 9.0 % | 5.35 % | 6.26 % |
| P+BAli-Phy |
|
|
| 4.70 % | 5.45 % | |
| P+MAFFT-L | 78.6 % | 78.2 % | 12.9 % | 5.96 % | 5.89 % | |
| MAFFT-L | 74.9 % | 63.3 % | 3.0 % |
|
| |
| MAFFT-def | 1.2 % | 0.5 % | 0.0 % | (not run) | (not run) | |
| Rose S1 | P(default) |
|
| 2.8 % | 3.94 % | 4.29 % |
| P+BAli-Phy | 84.3 % | 84.3 % |
|
|
| |
| P+MAFFT-L | 83.5 % | 83.3 % | 4.8 % | 3.55 % | 4.38 % | |
| MAFFT-L | 76.2 % | 68.2 % | 0.5 % | 3.80 % | 3.79 % | |
| MAFFT-def | 1.2 % | 0.5 % | 0.0 % | (not run) | (not run) | |
Note that precision, recall and TC are accuracy metrics (so larger is better) but Delta-RF is an error metric (so smaller is better). Metrics are averages over 10 replicates. Method names have been shortened slightly for space: P(default) refers to PASTA(default), P+(...) is shorthand for PASTA+(...), MAFFT-def refers to default MAFFT, and MAFFT-L refers to MAFFT L-INS-i. Bold numbers indicate best performing method
Fig. 1Results on 1000-sequence datasets, comparing default PASTA and PASTA+BAliPhy. Each point represents one replicate. PASTA denotes the alignment from PASTA under default settings (referred to as “PASTA(default)” in the text), and PASTA+BAli-Phy denotes the alignment after an additional iteration using BAli-Phy. Delta-RF refers to the difference between the RF error rates of ML trees computed on the estimated and true alignments. In each subfigure, a position above the 45-degree line indicates that PASTA+BAli-Phy is preferable; the axes for the subfigure for Delta-RF have been flipped to maintain this interpretation, since Delta-RF is an error metric rather than an accuracy metric
Alignment and tree accuracy metrics for UPP alignments on 10,000 sequences
| Data | Backbone | Prec. | Rec. | TC |
|
|---|---|---|---|---|---|
| P(default) | 96.2 % | 93.6 % | 2.6 % | 0.77 % | |
| Indelible | P+BAli-Phy |
|
|
|
|
| P+MAFFT-L | 97.3 % | 95.0 % | 3.2 % | 0.62 % | |
| P(default) | 90.8 % | 90.5 % | 0.5 % | 0.77 % | |
| RNAsim | P+BAli-Phy |
|
|
|
|
| P+MAFFT-L | 89.4 % | 89.1 % | 0.5 % |
|
Each method shown under Backbone is the method used to align the backbone of 1,000 sequences. Due to the running time required for RAxML on data of this size, Δ-RF shown is for FastTree-2 only. Bold numbers indicate best performing method
Fig. 2Results on 10,000 sequences. Using UPP on two different backbones: one computed using default PASTA and the other computed using PASTA+BAliPhy (i.e., one iteration of PASTA using BAli-Phy as the subset aligner after default PASTA completes). Each point represents one replicate. Delta-RF refers to the difference between the RF error rates of ML trees computed on the estimated and true alignments. In each subfigure, a position above the 45-degree line indicates that PASTA+BAli-Phy is preferable; the axes for the subfigure for Delta-RF have been flipped to maintain this interpretation, since Delta-RF is an error metric rather than an accuracy metric
P-values for each model condition and metric for the hypothesis test that P+BAli-Phy outperforms P(default) with respect to alignment accuracy
| Data | Precision | Recall | TC | ||
|---|---|---|---|---|---|
| Indelible M2 |
|
|
| ||
| RNAsim |
|
|
| ||
| Rose L1 | 0.211 | 0.188 |
| ||
| Rose M1 | 0.473 | 0.298 |
| ||
| Rose S1 | 0.820 | 0.770 |
| ||
Values are based on one-sided Student’s T-test for differences between the two methods on each replicate. Bolded values indicate significant differences using a Benjamini-Hochberg procedure to control the false discovery rate at 5 % [33]
P-values for each model condition and metric for the hypothesis test that P+BAli-Phy outperforms P(default) with respect to tree accuracy
| Delta- | Delta- | ||||
|---|---|---|---|---|---|
| Data | RAxML | FastTree-2 | |||
| Indelible M2 |
|
| |||
| RNAsim | 0.677 | 0.660 | |||
| Rose L1 | 0.036 | 0.054 | |||
| Rose M1 | 0.136 |
| |||
| Rose S1 |
|
| |||
Values are based on one-sided Student’s T-test for differences between the two methods on each replicate. Bolded values indicate significant differences using a Benjamini-Hochberg procedure to control the false discovery rate at 5 % [33]