| Literature DB >> 19478997 |
Robert K Bradley1, Adam Roberts, Michael Smoot, Sudeep Juvekar, Jaeyoung Do, Colin Dewey, Ian Holmes, Lior Pachter.
Abstract
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment--previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches--yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.Entities:
Mesh:
Year: 2009 PMID: 19478997 PMCID: PMC2684580 DOI: 10.1371/journal.pcbi.1000392
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Overview of the components constituting the FSA alignment program.
The algorithms that are used in each component are highlighted in the accompanying boxes. The bold arrows show the simplest mode of use for FSA, where posterior probabilities are calculated directly using default parameters for all pairs of sequences and the optional steps of anchor finding and iterative refinement are omitted.
Figure 2The default Pair HMM used by FSA.
By default FSA uses a Pair HMM with two sets of Insert (I) and Delete (D) states to generate a two-component geometric mixture distribution. FSA can optionally use a three-state HMM, which has only one set of Insert and Delete states. M is a Match state emitting aligned characters.
Figure 3Two alignments (left and right) which make the same homology statements and therefore are both represented by the same POSET (center).
“The mathematics of distance-based alignment” in Text S1 discusses this view of alignments as POSETs. The alignment on the right minimizes the number of gap-open events and as such is appropriate for analyses such as inferring parsimonious indel frequencies across a clade. Alignments are displayed with TeXshade [63].
Figure 4Schematic overview of FSA's parallelization strategy on a computer cluster.
For large input sizes, a disk-based database may be used to store some of the primary data structures and reduce memory usage.
Figure 5The Java GUI allows users to visualize the estimated alignment accuracy under FSA's statistical model.
FSA's alignment is colored according the expected accuracy under FSA's statistical model (top) as well as according to the “true” accuracy (bottom) given from a comparison between FSA's alignment and the reference structural alignment. It is clear from inspection that accuracies estimated under FSA's statistical model correspond closely to the true accuracies. Sequences are from alignment BBS12030 in the RV12 dataset of BAliBASE 3 [24].
Benchmarks against protein structural databases.
| Program | BAliBASE 3 | BAliBASE 3+fp | SABmark 1.65 |
| (Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
| AMAP | 0.70/0.62/0.83 | 0.73/0.61/0.80 | 0.57/0.43/0.46 |
| ClustalW | 0.66/0.63/0.62 | 0.59/0.63/0.53 | 0.38/0.44/0.30 |
| DIALIGN | 0.68/0.63/0.68 | 0.68/0.62/0.63 | 0.48/0.41/0.34 |
| FSA | 0.71/0.62/0.85 |
|
|
| FSA (–maxsn) | 0.73/0.68/0.76 | 0.74/0.68/0.72 | 0.52/0.45/0.39 |
| MAFFT | 0.74/0.71/0.71 | 0.68/0.71/0.61 | 0.44/0.49/0.35 |
| MUMMALS | 0.74/0.70/0.73 | 0.69/0.70/0.64 | 0.49/0.52/0.38 |
| MUSCLE | 0.70/0.67/0.66 | 0.63/0.66/0.57 | 0.40/0.46/0.32 |
| Probalign |
| 0.71/0.71/0.65 | 0.49/0.50/0.37 |
| ProbCons | 0.74/0.70/0.72 | 0.69/0.70/0.64 | 0.47/0.50/0.37 |
| T-Coffee | 0.72/0.67/0.71 | 0.67/0.67/0.63 | 0.45/0.46/0.35 |
| SeqAn::T-Coffee | 0.73/0.69/0.70 | 0.67/0.69/0.61 | 0.43/0.47/0.34 |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BAliBASE 3 [24] and SABmark 1.65 [25] databases. Probalign has the highest accuracy on the commonly-used BAliBASE 3 dataset and FSA in default mode has superior accuracy on the BAliBASE 3+fp and SABmark 1.65 datasets (note that only FSA and AMAP explicitly attempt to maximize the expected accuracy). FSA has higher positive predictive values than any other program on all datasets and can additionally achieve high sensitivity when run in maximum-sensitivity mode. The BAliBASE 3+fp dataset, which mirrors BAliBASE 3 but includes a single non-homologous sequence in each alignment, was designed to test the robustness of alignment programs to incomplete homology. Traditional alignment programs, designed to maximize sensitivity, suffer greatly-increased mis-alignment when even a single non-homologous sequence is introduced; in contrast, FSA is robust to the non-homologous sequence and has an unchanged positive predictive value. Remarkably, FSA was the only tested program with a mis-alignment rate of <50% on the SABmark 1.65 dataset; the majority of the homology statements made by other programs were incorrect. Because the SABmark 1.65 dataset contains many sequences of only partial or even no homology, a method such as FSA which is robust to non-homologous sequence performs better under our accuracy criterion than a program such as MUMMALS despite the fact that MUMMALS has significantly-higher sensitivity on this dataset. The BAliBASE 3 dataset consisted of full-length sequences in all reference sets RV11, RV12, RV20, RV30, RV40 and RV50; we created the BAliBASE 3+fp dataset from the same reference sets by adding a single false-positive, a random sequence, to each alignment. The SABmark 1.65 dataset consisted of the Twilight Zone and Superfamilies datasets.
Benchmarks against RNA structural databases.
| Program | BRAliBase 2.1 | Consan mix80 |
| (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
| ClustalW | 0.85/0.86/0.86 | 0.65/0.65/0.68 |
| DIALIGN | 0.82/0.83/0.85 | 0.76/0.75/0.82 |
| FSA | 0.90/0.91/0.94 | 0.77/0.74/0.92 |
| FSA (–maxsn) |
|
|
| MAFFT | 0.90/0.91/0.91 | 0.77/0.78/0.77 |
| MUSCLE | 0.90/0.91/0.90 | 0.74/0.76/0.74 |
| ProbConsRNA | 0.91/0.92/0.92 | (failed to align) |
| T-Coffee | 0.81/0.82/0.84 | 0.38/0.33/0.40 |
| SeqAn::T-Coffee | 0.89/0.90/0.90 | (failed to align) |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on the BRAliBase 2.1 dataset of small RNAs [26] and the Consan mix80 dataset of Small and Large Subunit ribosomal RNAs [27]. The BRAliBase 2.1 dataset consisted of all alignments with 15 sequences (the largest alignments). The mix80 dataset provided difficult alignment problems: The four alignments each contain from 107 to 254 sequences of approximately 1–4 kilobases in length, with average percentage identity less than <50%. Two program, ProbConsRNA and SeqAn::T-Coffee, were incapable of aligning these large datasets. When run in –fast mode, FSA considers only a subset (∼20% in this case) of all sequence pairs. Note that because the mix80 dataset consists of long sequences, FSA automatically uses anchoring for speed. FSA does not use anchoring on the short sequences of BRAliBase 2.1.
Benchmarks against simulated mammalian and fly genomic DNA.
| Program | Blanchette et al. | DAWG | simgenome |
| (Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
| CHAOS/DIALIGN | 0.58/0.44/0.74 | 0.72/0.46/0.43 | 0.62/0.67/0.59 |
| DIALIGN-TX | 0.73/0.68/0.77 | 0.72/0.51/0.44 | 0.64/0.68/0.61 |
| FSA (–exonerate) | 0.86/0.82/0.93 |
|
|
| FSA (–exonerate –maxsn) | 0.87/0.85/0.90 | 0.75/0.41/0.50 | 0.76/0.79/0.77 |
| MAVID | 0.57/0.45/0.68 | 0.66/0.36/0.32 | 0.72/0.77/0.72 |
| MLAGAN | 0.70/0.63/0.80 | 0.45/0.39/0.19 | 0.71/0.71/0.73 |
| Pecan |
| 0.77/0.48/0.53 | 0.78/0.81/0.78 |
| TBA | 0.83/0.81/0.87 | 0.80/0.32/0.75 | 0.74/0.79/0.72 |
Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA and other alignment methods on simulated alignments of mammalian and Drosophila DNA. The simulated alignments of nonfunctional DNA sequences (“Blanchette et al.”) from nine mammals (human, chimp, baboon, mouse, rat, cat, dog, cow, and pig) were produced by [28]. Simulated alignments of nonfunctional (“DAWG ”) and functional as well as nonfunctional (“simgenome ”) DNA sequences from the twelve species of Drosophila described in [43] were produced with the DAWG [29] and simgenome [30] programs as described in [30] (both were parametrized based on Pecan alignments of Drosophila whole-genome alignments). Three of the simgenome alignments contained sequences of length zero and were removed from this analysis. FSA was run with the –exonerate option to use both anchors from the exonerate program as well as MUMs from MUMmer. FSA had the highest accuracy on the two simulated Drosophila datasets and only Pecan had higher accuracy on the mammalian dataset. Pecan consistently produced the most-sensitive aligments.
Benchmarks against simulated unrelated protein and DNA sequences.
| Program | Protein | DNA |
| AMAP | 14% | n/a |
| ClustalW | 97% | 95% |
| DIALIGN | 24% | 17% |
| FSA |
|
|
| FSA (–maxsn) | 21% | 17% |
| MAFFT | 83% | 93% |
| MUMMALS | 63% | n/a |
| MUSCLE | 89% | 80% |
| Probalign | 44% | n/a |
| ProbCons | 51% | 77% |
| T-Coffee | 63% | 75% |
| SeqAn::T-Coffee | 74% | 78% |
Large-scale random sequence tests indicate that for most alignment programs, aligned sequences are not necessarily homologous (table shows the fraction of random sequence aligned, calculated by taking a sum-of-pairs over pairwise alignments). Even when run in maximum-sensitivity mode (–maxsn), FSA aligned only a small fraction of the random sequence. We generated 50 datasets, each with 10 random sequences, and ran all programs with default parameters. Protein sequences were 300 aa in length and DNA sequences were 1,000 nt in length. Results reported for ProbCons on DNA sequences were obtained with ProbConsRNA.
Benchmarks against simulated unrelated genomic DNA.
| Program | Genomic DNA |
| CHAOS/DIALIGN | 10% |
| ClustalW | 96% |
| DIALIGN-TX | 20% |
| FSA (–exonerate) | 1% |
| FSA (–exonerate –maxsn) | 4% |
| MAVID | 17% |
| MLAGAN | 30% |
| Pecan | 1% |
| TBA |
|
Large-scale random sequence tests for genomic alignment programs. As in Table 4, table entries are the fraction of random sequence aligned, calculated by taking a sum-of-pairs over pairwise alignments. FSA aligns a small fraction of random genomic sequence in both its default and maximium-sensitivity (–maxsn) modes. TBA did not align a single base in these tests and was thus the best performer. As the three best-performing programs in this test, TBA, Pecan and FSA –exonerate, all use inexact sequence matches as anchors, the relative performance of these three programs can be explained by the stringency of the anchoring thresholds used: TBA uses the highest threshold by default, Pecan the next-highest and FSA the lowest. All three of these programs show good base-level specificity on the simulated alignments of Table 3, for which TBA has the highest specificity on one dataset and FSA on two. The random sequence tests consisted of 50 datasets, each with 10 random DNA sequences (uniform base distribution) of length 50 kb. All programs were run with default parameters. For genomic aligners that required a phylogenetic tree, we used the guide tree computed by ClustalW (rooted via the midpoint algorithm of the PHYLIP [64] retree program).
Comparisons of alignments obtained in codon and amino acid space.
| Program | Alignment similarity (average) |
| ClustalW | 0.914 |
| DIALIGN | 0.912 |
| FSA |
|
| FSA (–noanchored) |
|
| MAFFT | 0.932 |
| MUSCLE | 0.915 |
| ProbCons | 0.902 |
| T-Coffee | 0.897 |
| SeqAn::T-Coffee | 0.905 |
We assessed the concordance between alignments obtained in nucleotide and amino acid space by aligning all 1,502 genes in Saccharomyces cerevisiae which have orthologs in the six related yeast species S. paradoxus, S. mikatae, S. kudriavzevii, S. bayanus, S. castellii, and S. kluyveri (this dataset was analyzed in [5]). Alignments produced by FSA, in both anchored and unanchored (–noanchored) modes, had the highest concordance. Alignment similarity between alignments computed in nucleotide and amino acid space was assessed by converting the amino acid alignment to the implied nucleotide alignment and computing the alignment similarity (the proportion of identical homology statements made by the alignments; see Text S1, “The mathematics of distance-based alignment” for details) between them. Alignments for ProbCons on nucleotide sequences were obtained with ProbConsRNA.
Ablation analysis of FSA on protein structural databases.
| FSA options | BAliBASE 3 | BAliBASE 3+fp | SABmark 1.65 |
| (Acc/Sn/PPV) | (Acc/Sn/PPV) | (Acc/Sn/PPV) | |
| (default) | 0.71/0.62/0.85 |
|
|
| –fast | 0.70/0.61/0.85 | 0.74/0.62/0.84 | 0.59/0.37/0.52 |
| –nolearn | 0.72/0.65/0.81 | 0.75/0.65/0.79 | 0.56/0.44/0.44 |
| –refinement 0 | 0.70/0.61/0.85 | 0.74/0.61/0.84 | 0.59/0.37/0.52 |
| –noindel2 | 0.70/0.61/0.85 | 0.74/0.60/0.84 | 0.59/0.38/0.52 |
| –maxsn |
| 0.74/0.68/0.72 | 0.52/0.45/0.39 |
| –fast –maxsn | 0.73/0.67/0.76 | 0.73/0.67/0.71 | 0.52/0.44/0.39 |
| –nolearn –maxsn | 0.73/0.68/0.74 | 0.70/0.68/0.67 | 0.49/0.47/0.37 |
| –refinement 0 –maxsn | 0.72/0.66/0.78 | 0.73/0.66/0.73 | 0.53/0.43/0.39 |
| –noindel2 –maxsn | 0.73/0.68/0.76 | 0.72/0.68/0.70 | 0.51/0.45/0.39 |
Ablation analysis of FSA on the protein benchmarks of Table 1 : Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA with different components enabled or disabled. From top to bottom, FSA was run in default mode, –fast mode, with learning disabled, with iterative refinement disabled, and with 1 set (rather than 2 sets) of indel states; these options were then repeated for maximum-sensitivity mode (–maxsn). As made evident by the results (PPV) on the BAliBASE 3+fp and SABmark 1.65 datasets, query-specific learning helps FSA to distinguish homologous and non-homologous sequences. The above figures understate the utility of iterative refinement: while it generally has little effect on these small protein alignments, it occasionally dramatically reduces the number of small gaps and thereby improves the alignment accuracy.
Ablation analysis of FSA on RNA structural databases.
| FSA options | BRAliBase 2.1 | FSA options | Consan mix80 |
| (Acc/Sn/PPV) | (Acc/Sn/PPV) | ||
| (default) | 0.90/0.91/0.94 | ||
| –fast | 0.90/0.91/0.94 | –fast | 0.77/0.74/0.92 |
| –nolearn | 0.91/0.92/0.93 | –nolearn –fast | 0.77/0.74/0.93 |
| –refinement 0 | 0.90/0.91/0.93 | –refinement 0 –fast | 0.73/0.69/0.94 |
| –noindel2 | 0.91/0.92/0.93 | –noindel2 –fast | 0.73/0.69/0.91 |
| –noanchored –fast | 0.77/0.74/0.93 | ||
| –maxsn |
| ||
| –fast –maxsn | 0.91/0.92/0.92 | –fast –maxsn | 0.78/0.78/0.86 |
| –nolearn –maxsn | 0.91/0.92/0.92 | –nolearn –fast –maxsn | 0.78/0.78/0.85 |
| –refinement 0 –maxsn | 0.90/0.91/0.93 | –refinement 0 –fast –maxsn | 0.74/0.70/0.92 |
| –noindel2 –maxsn | 0.91/0.92/0.92 | –noindel2 –fast –maxsn | 0.74/0.73/0.84 |
| –noanchored –fast –maxsn |
|
Ablation analysis of FSA on the RNA benchmarks of Table 2 : Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA with different components enabled or disabled. From top to bottom, FSA was run in default mode, –fast mode, with learning disabled, with iterative refinement disabled, with 1 set (rather than 2 sets) of indel states, and with anchored disabled; these options were then repeated for maximum-sensitivity mode (–maxsn). Iterative refinement is important for the large alignments of the mix80 dataset.
Ablation analysis of FSA on simulated mammalian genomic DNA.
| FSA options | Blanchette et al. |
| (Acc/Sn/PPV) | |
| (default) | 0.53/0.32/0.93 |
| –exonerate | 0.83/0.77/0.94 |
| –exonerate –minscore 50 |
|
| –exonerate –refinement 0 | 0.82/0.76/0.93 |
| –exonerate –noindel2 | 0.78/0.72/0.94 |
Ablation analysis of FSA on the simulated mammalian DNA of Table 3 : Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA with different components enabled or disabled. We tested the effectiveness of components related to anchor annealing for aligning long sequences, including using anchors from MUMmer and exonerate and changing the minimum acceptable score for an exonerate anchor (the default is –minscore 100). These results clearly show that while using only MUMs for anchoring (the default mode) gives a high positive predictive value, inexact matches must be used to obtain high sensitivity on very long or distant nonfunctional sequences lacking the local constraints which give rise to MUMs across species in functional (e.g., coding) sequence.
Ablation analysis of FSA on simulated unrelated protein and DNA sequences.
| FSA options | Protein | DNA |
| (default) | 4% | 5% |
| –fast | 4% | 5% |
| –nolearn | 13% | 8% |
| –refinement 0 |
|
|
| –noindel2 | 5% | 10% |
| –maxsn | 21% | 17% |
| –fast –maxsn | 22% | 17% |
| –nolearn –maxsn | 30% | 16% |
| –refinement 0 –maxsn | 19% | 15% |
| –noindel2 –maxsn | 27% | 21% |
Ablation analysis of FSA on the unrelated sequence benchmarks of Table 4 : Comparisons of the accuracies (Acc), sensitivities (Sn) and positive predictive values (PPV) of FSA with different components enabled or disabled. From top to bottom, FSA was run in default mode, –fast mode, with learning disabled, with iterative refinement disabled, and with 1 set (rather than 2 sets) of indel states; these options were then repeated for maximum-sensitivity mode (–maxsn). Query-specific learning helps to make FSA robust to non-homologous sequence.
Timing comparison of FSA and other methods on 16S sequences.
| Program | 100 | 200 | 300 | 400 | 500 seqs |
| ClustalW | 1,194 s | 4,147 s | 9,110 s | 16,187 s | 27,755 s |
| DIALIGN | 4,346 s | 19,449 s | 49,388 s | (fail) | (fail) |
| FSA –fast | 1,513 s | 3754 s | 5,641 s | 9,767 s | 15,683 s |
| FSA –fast –noindel2 –refinement 0 | 638 s | 1,495 s | 2,467 s | 3,604 s | 5,154 s |
| MAFFT |
|
|
|
|
|
| MUSCLE | 351 s | 1,235 s | 1,516 s | 4,384 s | 7,552 s |
| ProbConsRNA | 16,319 s | (fail) | (fail) | (fail) | (fail) |
| T-Coffee | 1,362 s | 3,666 s | 7,880 s | 15,254 s | 22,085 s |
| SeqAn::T-Coffee | 3,024 s | (fail) | (fail) | (fail) | (fail) |
Comparison of runtimes of FSA and other alignment methods when aligning 16S ribosomal sequences. MAFFT was faster than any other method by an order of magnitude; the next-fastest programs were MUSCLE and FSA. FSA can be made substantially faster by using a 3-state, rather than the default 5-state, HMM (with little loss of accuracy; see Table 8) and disabling iterative refinement. MAFFT was run with the –auto option, which presumably triggered a faster alignment mode on the 500 sequence dataset than was used for the datasets with fewer sequences. The designation “(fail)” means that a programs failed to align a dataset (generally due to out-of-memory errors). Timing results are from computers with 2.40 GHz CPUs and 2 GB of RAM. 16S sequences were obtained as a random slice of prokMSA from Greengenes [65] and had an average length of 1,450 nt.
Timing comparison of FSA in regular and parallelized modes.
| FSA options | 100 | 200 | 300 | 500 | 1,000 seqs |
| FSA | 6,407 s | 27,534 s | — | — | — |
| FSA –parallelize 10 | 819 s | 5,713 s | 22,113 s | — | — |
| FSA –fast | 1,650 s | 3,781 s | 6,207 s | 12,249 s | — |
| FSA –fast –parallelize 10 | 201 s | 513 s | 924 s | 2,511 s | 15,179 s |
Runtimes for FSA in regular, –fast and –parallelize modes when aligning the 16S sequences of Table 11 sequences in unanchored mode (–noanchored) with a 3-state HMM (–noindel2) and refinement disabled (–refinement 0). When running in –fast mode on a cluster with 10 processors (3.00 and 3.20 GHz; 8 GB of RAM), FSA can align 500 16S sequences in 20% of the time required without parallelization. The parallelized FSA was run on a cluster managed by the Condor batch queueing system [66]; nodes were connected by a 100 Mbps Ethernet network. Note that these runtimes are much slower than users can expect from default FSA usage, which uses anchoring for speed (Table 11); we used unanchored mode to make clear the benefits of parallelization.
Timing comparison of FSA in parallelized mode with different numbers of processors.
| 1 | 5 | 10 | 15 | 20 processors | |
| 100 seqs | 1,650 s | 365 s | 214 s | 135 s | 105 s |
| 200 seqs | 3,781 s | 889 s | 506 s | 385 s | 355 s |
Runtimes for FSA in –fast –parallelize P mode as a function of the number of processors P in the computer cluster with a 3-state HMM (–noindel2) and refinement disabled (–refinement 0). Sequences and cluster specifications are same as for Table 12.