| Literature DB >> 20430824 |
Raffi Hagopian1, John R Davidson, Ruchira S Datta, Bushra Samad, Glen R Jarvis, Kimmen Sjölander.
Abstract
We present the jump-start simultaneous alignment and tree construction using hidden Markov models (SATCHMO-JS) web server for simultaneous estimation of protein multiple sequence alignments (MSAs) and phylogenetic trees. The server takes as input a set of sequences in FASTA format, and outputs a phylogenetic tree and MSA; these can be viewed online or downloaded from the website. SATCHMO-JS is an extension of the SATCHMO algorithm, and employs a divide-and-conquer strategy to jump-start SATCHMO at a higher point in the phylogenetic tree, reducing the computational complexity of the progressive all-versus-all HMM-HMM scoring and alignment. Results on a benchmark dataset of 983 structurally aligned pairs from the PREFAB benchmark dataset show that SATCHMO-JS provides a statistically significant improvement in alignment accuracy over MUSCLE, Multiple Alignment using Fast Fourier Transform (MAFFT), ClustalW and the original SATCHMO algorithm. The SATCHMO-JS webserver is available at http://phylogenomics.berkeley.edu/satchmo-js. The datasets used in these experiments are available for download at http://phylogenomics.berkeley.edu/satchmo-js/supplementary/.Entities:
Mesh:
Year: 2010 PMID: 20430824 PMCID: PMC2896197 DOI: 10.1093/nar/gkq298
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Beta adrenergic receptors SATCHMO-JS tree and MSA displayed using the PhyloScope viewer. The PhyloScope viewer allows users to select internal nodes of the tree for examination of the alignments at these nodes, which may reflect different levels of inferred structural similarity across homologs. Columns are colored according to conservation based on BLOSUM62 sum-of-pairs scores (light blue indicates the highest level of conservation, followed by dark blue, grey and uncolored). Clicking on a subtree node restricts the MSA displayed to the sequences descending from that node, and highlights the selected subtree. The SATCHMO algorithm attempts to determine which columns are part of the conserved core structure across all sequences that descend from a node, resulting in some residues being displayed in lowercase (indicating that they are inserted relative to the consensus) at nodes higher in the tree (toward the root) but in uppercase at subtrees nearer the leaves. (A) The SATCHMO-JS tree and MSA corresponding to the root of the tree, where all sequences are selected. The first ∼70 residues of most sequences display in lowercase (indicating insertions relative to the consensus structure) reflecting structural variability over the dataset as a whole in this region. (Coincidentally, the region identified by SATCHMO as conserved across the dataset corresponds to the PFAM 7TM_1 HMM, which matches this region.) (B) The ADRB1 subtree (corresponding to orthologous Beta-1 adrenergic receptors from different species) has been selected by clicking the subtree node. This results in coloring the selected subtree red and displaying the MSA corresponding to sequences descending from that node. Note that many residues that displayed in lowercase in the SATCHMO root-level MSA are now displayed in uppercase, indicating that they are predicted by SATCHMO to be part of the conserve core structure for Beta-1 adrenergic receptors. Examining this subtree MSA shows that ADRB1_XENLA (from Xenopus laevis, African clawed frog) and ADRB1_MEGLA (from Meleagris gallopavo, Common turkey) diverge from mammalian orthologs at the N-terminus.
Figure 2.Benchmarking MSA accuracy. Methods used in this comparison include the original SATCHMO, SATCHMO-JS, ClustalW, MUSCLE and MAFFT (MUSCLE and MAFFT each used five iterations refinement). Results are shown on 983 pairs from the PREFAB benchmark dataset, divided into bins based on the percent identity in the reference structural alignment. The Modeler score (Qmodeler) is a measure of the precision of an alignment, while the Developer score (Qdeveloper) is a measure of the recall. For every percent identity bin, either SATCHMO or SATCHMO-JS produces the best overall performance in both Modeler and Developer scores, with SATCHMO-JS generally producing better results than SATCHMO. Over the dataset as a whole, SATCHMO-JS’s improvement relative to other methods tested is statistically significant (P < 0.05 using Wilcoxon paired score signed rank tests) for all scoring functions (including Qcombined and the Cline Shift score, which balance recall and precision) with a single exception: relative to MAFFT, the difference is significant only for the Developer score (P = 1.138e-05). For the Modeler, Qcombined and Cline Shift scores, the P-values are 0.204, 0.093 and 0.157, respectively. See text for additional details.
Compute time required to estimate MSAs of different sizes, measured in seconds
| Size/length | SATCHMO-JS | SATCHMO | ProbCons | T-Coffee | MAFFT |
|---|---|---|---|---|---|
| 100/230 | 30.09 | 198.85 | 85.99 | 219.74 | 2.14 |
| 200/155 | 112.49 | 346.06 | 265.23 | 954.96 | 13.94 |
| 300/126 | 234.79 | 533.97 | 560.93 | 3882.01 | 23.84 |
| 500/392 | 1085.12 | 14 393.87 | 10 469.5 | — | 232.94 |
The first column gives the number of sequences and average sequence length for each dataset. ProbCons and MAFFT were run with five iterations of refinement; SATCHMO, SATCHMO-JS and T-Coffee used default parameters. The time to run SATCHMO-JS includes the time required for MAFFT, QuickTree and the subtree-selection program. MUSCLE’s run-time on these datasets is slightly longer than that of MAFFT (data not shown). T-Coffee failed to complete on the dataset with 500 sequences.