| Literature DB >> 16936316 |
Abstract
We have developed MUMMALS, a program to construct multiple protein sequence alignment using probabilistic consistency. MUMMALS improves alignment quality by using pairwise alignment hidden Markov models (HMMs) with multiple match states that describe local structural information without exploiting explicit structure predictions. Parameters for such models have been estimated from a large library of structure-based alignments. We show that (i) on remote homologs, MUMMALS achieves statistically best accuracy among several leading aligners, such as ProbCons, MAFFT and MUSCLE, albeit the average improvement is small, in the order of several percent; (ii) a large collection (>10 000) of automatically computed pairwise structure alignments of divergent protein domains is superior to smaller but carefully curated datasets for estimation of alignment parameters and performance tests; (iii) reference-independent evaluation of alignment quality using sequence alignment-dependent structure superpositions correlates well with reference-dependent evaluation that compares sequence-based alignments to structure-based reference alignments.Entities:
Mesh:
Year: 2006 PMID: 16936316 PMCID: PMC1636350 DOI: 10.1093/nar/gkl514
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1(a) An illustration of structure-based sequence alignment and hidden state paths. In Sequences 1 and 2, uppercase letters and lowercase letters represent aligned core blocks and unaligned regions, respectively. If two corresponding unaligned regions bounded by the same two core blocks are of different length, we split the shorter one into two pieces and introduce contiguous gaps in the middle. For both N- and C-terminal ends, the shorter unaligned region is pushed toward the core blocks. Secondary structure (ss) types (helix, ‘h’; strand, ‘e’; coil, ‘c’) are shown for Sequence 1. The hidden state paths for three models are shown below the amino acid sequences. (b) Model structure of HMM_1_1_0. Residue pairs in unaligned regions are modeled using the same match state (‘M’) as those in the aligned blocks. Insertions in the first sequence and second sequence are modeled using states ‘X’ and ‘Y’, respectively. (c) Model structure of HMM_1_1_1. Residue pairs in the unaligned regions are modeled using a different match state (‘U’) than the match state in the core blocks (‘M’). (d) Model structure of HMM_1_3_1. Residue pairs in aligned core blocks are modeled using three match states (‘H’, ‘S’, ‘C’) according to three secondary structure types of the first sequence. In (b), (c) and (d), match states are shown as squares and insertion states are shown as diamonds. Begin state, end state, and transitions from or to them are present in these models, but are not shown.
Average Q-scores in pairwise alignment tests on representative SCOP40 domain pairs
| Method/Model | Testing datasets | ||||
|---|---|---|---|---|---|
| SCOP 0–10% | SCOP 10–15% | SCOP 15–20% | SCOP 2–40% | SCOP All | |
| (355) | (432) | (420) | (578) | (1785) | |
| HMM_1_1_0 | 0.146 | 0.322 | 0.568 | 0.851 | 0.516 |
| HMM_1_1_1 | 0.146 | 0.328 | 0.573 | 0.855 | 0.520 |
| HMM_3_1_1 | 0.150 | 0.327 | 0.574 | 0.858 | 0.521 |
| HMM_1_3_1 | 0.334 | 0.585 | 0.858 | 0.526 | |
| HMM_3_3_1 | 0.151 | ||||
| HMM_1_1_0 | 0.123 | 0.295 | 0.551 | 0.843 | 0.498 |
| HMM_1_1_1 | 0.132 | 0.31 | 0.572 | 0.851 | 0.511 |
| ProbCons | 0.116 | ||||
| MAFFT-fftnsi | 0.087 | 0.256 | 0.496 | 0.809 | 0.457 |
| MAFFT-einsi | 0.081 | 0.248 | 0.491 | 0.809 | 0.453 |
| MAFFT-linsi | 0.116 | 0.262 | 0.495 | 0.794 | 0.460 |
| MAFFT-ginsi | 0.116 | 0.265 | 0.496 | 0.794 | 0.461 |
| MUSCLE | 0.293 | 0.507 | 0.817 | 0.482 | |
| ClustalW | 0.136 | 0.27 | 0.482 | 0.809 | 0.467 |
Each HMM is named in the format ‘HMM_solv_ss_u’, where ‘solv’ is the number of solvent accessibility categories, ‘ss’ is the number of secondary structure types, and ‘u’ is 1 if unaligned regions are modeled with an additional match state. Average Q-scores of four testing datasets with different identity ranges are shown. Q-score is the number of correctly aligned residue pairs in the test alignment divided by the total number of aligned residue pairs in the reference alignment. The number of alignments in each testing dataset is shown in parentheses and the identity range in % is specified above the number of alignments. The best results of our models and the best results of other programs are in bold numbers.
aTrained on DaliLite alignments of SCOP40 domain pairs with 20–40% identity.
bTrained on BAliBASE2.0 pairwise alignments.
cOur model is statistically better than the best of other programs according to Wilcoxon signed-rank test (P < 0.015).
Average alignment scores in tests of multiple sequence alignment programs
| Methods/Models | Testing datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| SCOP 0–10% | SCOP 10–15% | SCOP 15–20% | SCOP 20–40% | SCOP | PREFAB | SABmark | SABmark | BAliBASE3.0 | |
| (355) | (432) | (420) | (578) | All (1785) | (1682) | Sup (425) | Twi (209) | Q/col (218) | |
| HMM_1_1_0 | 0.313 | 0.514 | 0.727 | 0.885 | 0.644 | 0.723 | 0.516 | 0.193 | 0.862/0.551 |
| HMM_1_1_1 | 0.313 | 0.512 | 0.728 | 0.886 | 0.644 | 0.724 | 0.512 | 0.186 | 0.861/0.550 |
| HMM_3_1_1 | 0.321 | 0.514 | 0.730 | 0.888 | 0.647 | 0.726 | 0.516 | 0.186 | 0.862/0.554 |
| HMM_1_3_1 | 0.327 | 0.518 | 0.732 | 0.889 | 0.650 | 0.729 | 0.519 | 0.194 | 0.863/0.554 |
| HMM_3_3_1 | |||||||||
| ProbCons | 0.291 | 0.486 | 0.702 | 0.879 | 0.625 | 0.716 | 0.485 | 0.166 | 0.862 |
| MAFFT-fftnsi | 0.283 | 0.472 | 0.673 | 0.865 | 0.608 | 0.7 | 0.45 | 0.147 | 0.829 |
| MAFFT-einsi | 0.293 | 0.498 | 0.71 | 0.882 | 0.631 | 0.72 | 0.502 | 0.175 | 0.866 |
| MAFFT-linsi | 0.301 | 0.707 | 0.883 | 0.633 | |||||
| MAFFT-ginsi | 0.497 | 0.715 | 0.495 | 0.176 | 0.840 | ||||
| MUSCLE | 0.262 | 0.453 | 0.662 | 0.866 | 0.597 | 0.68 | 0.433 | 0.136 | 0.816 |
| ClustalW | 0.21 | 0.357 | 0.566 | 0.798 | 0.519 | 0.617 | 0.39 | 0.127 | 0.749 |
The format of the HMM names (‘HMM_solv_ss_u’) is explained in Table 1. Average Q-scores are shown for all the testing datasets. For the BAliBASE3.0 dataset, both the Q-score (‘Q’, first number) and column score (‘col’, second number, fraction of entirely correct columns) are shown. The first four testing datasets are representative SCOP40 domain pairs with added homologs. SABmark has ‘superfamily’ dataset (sup) and ‘twilight zone’ dataset (twi). The number of alignments in each testing dataset is shown in parentheses and the identity range in % is specified above the number of alignments for SCOP datasets. MUMMALS implementing different HMMs are the first five methods. All sequences pairs are subject to consistency measure in MUMMALS. The best scores of MUMMALS and the best scores of other programs are in bold.
aMUMMALS with this model is statistically better than the best of other programs according to Wilcoxon signed-rank test (P < 0.015).
bFor BAliBASE3.0 test, the difference between MUMMALS with model HMM_1_3_1 or HMM_3_3_1 and this program is not statistically significant (P > 0.05) according to Wilcoxon signed-ranks test.
cFor BAliBASE3.0 test, MUMMALS with model HMM_1_3_1 or HMM_3_3_1 is statistically better than this program (P-value less than 0.01, except for Q-scores of ProbCons, for which P = 0.017).
Assessment of multiple sequence alignment programs using reference-independent sequence and structural similarity scores on 1207 representative SCOP40 domain pairs with identity <20%
| Method | Structural similarity | Sequence similarity | ||||||
|---|---|---|---|---|---|---|---|---|
| DALI | GDT-TS | TM-score | 3D-score | LBcona | LBconb | Sequence identity | Blosum62 score | |
| HMM_1_1_0 | 0.1178 | 0.2510 | 0.3005 | 0.2499 | 0.2181 | 0.2828 | 0.0953 | 0.1687 |
| HMM_1_1_1 | 0.1200 | 0.2519 | 0.3010 | 0.2514 | 0.2190 | 0.2838 | ||
| HMM_3_1_1 | 0.1217 | 0.2540 | 0.3034 | 0.2532 | 0.2215 | 0.2872 | 0.0938 | 0.1665 |
| HMM_1_3_1 | 0.1226 | 0.2564 | 0.3061 | 0.2557 | 0.2230 | 0.2892 | 0.0944 | 0.1662 |
| HMM_3_3_1 | 0.0932 | 0.1651 | ||||||
| ProbCons | 0.1003 | 0.2324 | 0.2767 | 0.2307 | 0.2060 | 0.2670 | 0.1719 | |
| MAFFT-fftnsi | 0.0982 | 0.2333 | 0.2814 | 0.2297 | 0.2004 | 0.2632 | 0.0917 | 0.1621 |
| MAFFT-einsi | 0.2425 | 0.2886 | 0.2410 | 0.2105 | 0.2763 | 0.0940 | 0.1666 | |
| MAFFT-linsi | 0.1135 | 0.2982 | 0.2143 | 0.0923 | 0.1632 | |||
| MAFFT-ginsi | 0.1126 | 0.2454 | 0.2429 | 0.2803 | 0.0972 | |||
| MUSCLE | 0.0980 | 0.2297 | 0.2777 | 0.2266 | 0.1941 | 0.2535 | 0.0939 | 0.1686 |
| ClustalW | 0.0723 | 0.1916 | 0.2318 | 0.1876 | 0.1551 | 0.2030 | 0.0733 | 0.1344 |
The first five methods are MUMMALS implementing different HMMs. The format of the HMM names (‘HMM_solv_ss_u’) is explained in Table 1. The best scores of MUMMALS and the best scores of other programs (ProbCons, MAFFT with different options, MUSCLE, ClustalW) are in bold.
aMUMMALS with this model is statistically better than the best of other programs according to Wilcoxon signed-rank test (P < 0.01).
Number of large Q-score or TM-score differences (no less than 0.1) among the multiple sequence alignment programs on 1207 representative SCOP40 domain pairs with identity <20%
| HMM_1_1_0 | HMM_1_1_1 | HMM_3_1_1 | HMM_1_3_1 | HMM_3_3_1 | ProbCons | MAFFT-fftnsi | MAFFT-einsi | MAFFT-linsi | MAFFT-ginsi | MUSCLE | ClustalW | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HMM_1_1_0 | — | 13/8 | 41/18 | 29/15 | 50/157 | 105/191 | 108/163 | 119/127 | 85/102 | 92/208 | 73/399 | |
| HMM_1_1_1 | 14/14 | — | 35/16 | 18/6 | 53/156 | 100/189 | 103/171 | 109/124 | 90/110 | 91/209 | 76/397 | |
| HMM_3_1_1 | 28/45 | 22/40 | — | 16/24 | 39/164 | 86/211 | 101/183 | 109/135 | 81/124 | 82/225 | 63/418 | |
| HMM_1_3_1 | 19/29 | 9/22 | 26/20 | — | 46/155 | 90/202 | 107/169 | 110/124 | 89/116 | 92/214 | 70/408 | |
| HMM_3_3_1 | — | |||||||||||
| ProbCons | 187/76 | 186/80 | 203/67 | 189/73 | — | 149/133 | 152/98 | 172/80 | 162/76 | 162/169 | 110/336 | |
| MAFFT-fftnsi | 334/123 | 334/118 | 354/113 | 344/115 | 242/154 | — | 123/98 | 137/71 | 153/76 | 115/143 | 88/342 | |
| MAFFT-einsi | 234/156 | 232/152 | 260/147 | 247/152 | 152/186 | 116/225 | — | 80/37 | 133/96 | 134/193 | 93/369 | |
| MAFFT-linsi | 206/138 | 204/140 | 237/131 | 219/143 | 133/188 | 87/207 | 62/82 | — | 85/98 | 93/196 | 60/387 | |
| MAFFT-ginsi | 160/124 | 162/124 | 192/119 | 176/126 | 113/184 | 109/253 | 129/158 | 111/132 | — | 90/185 | 67/395 | |
| MUSCLE | 370/94 | 374/94 | 390/83 | 384/90 | 302/138 | 218/136 | 309/134 | 295/110 | 327/106 | — | 75/288 | |
| ClustalW | 627/67 | 628/66 | 645/60 | 645/65 | 559/103 | 498/96 | 582/91 | 585/70 | 600/76 | 449/100 | — |
The first five methods are MUMMALS implementing different HMMs. The format of the HMM names (‘HMM_solv_ss_u’) is explained in Table 1. Each none-diagonal cell has two numbers separated by a slash. The first number is the number of cases where the alignment quality score of the program listed to the left (in a row) is inferior to that of the program listed above (in a column) by 0.1 or more. The second number is the number of cases where the score of the ‘row’ program is better than that of the ‘column’ program by 0.1 or more. The alignment quality scores in the lower triangle and upper triangle are Q-scores and weighted and normalized TM-scores, respectively. Comparisons of MUMMALS with the best model (HMM_3_3_1) with other programs are highlighted in bold.