| Literature DB >> 16361270 |
Timo Lassmann1, Erik L L Sonnhammer.
Abstract
Multiple sequence alignments play a central role in the annotation of novel genomes. Given the biological and computational complexity of this task, the automatic generation of high-quality alignments remains challenging. Since multiple alignments are usually employed at the very start of data analysis pipelines, it is crucial to ensure high alignment quality. We describe a simple, yet elegant, solution to assess the biological accuracy of alignments automatically. Our approach is based on the comparison of several alignments of the same sequences. We introduce two functions to compare alignments: the average overlap score and the multiple overlap score. The former identifies difficult alignment cases by expressing the similarity among several alignments, while the latter estimates the biological correctness of individual alignments. We implemented both functions in the MUMSA program and demonstrate the overall robustness and accuracy of both functions on three large benchmark sets.Entities:
Mesh:
Year: 2005 PMID: 16361270 PMCID: PMC1316116 DOI: 10.1093/nar/gki1020
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Alignment methods and parameters used in this study
| Method | Description/Options |
|---|---|
| Poa version 2 ( | Local unprogressive mode using blosum80.mat |
| ClustalW version 1.83 ( | Default parameters |
| Muscle version 3.52 ( | One iteration: -stable -maxiters 1 |
| Two iterations: -stable -maxiters 2 | |
| Default: -stable | |
| Probcons version 1.09 ( | Default parameters |
| Dialign version 2.2 ( | Default parameters |
| Mafft version 5.63 ( | -Localpair |
| -Localpair -maxiterate 100 | |
| -Globalpair | |
| -Globalpair -maxiterate 100 | |
| Kalign version 1.03 (manuscript submitted) | Default parameters |
Figure 1Histograms of the distribution of difficult/easy alignment cases in Balibase (A), Prefab (B), SABmark superfamily (C) and the SABmark twilight (D) benchmark test sets. The accuracy of each alignment was calculated by comparison to reference alignments using the sum-of-pairs (SP), Q and f scores, respectively (see Materials and Methods). The SABmark twilight set consists of predominantly difficult cases while Balibase and Prefab sets contains mainly easy cases. The superfamily subset of SABmark is made up of an equal number of difficult and easy alignment cases.
Figure 2Scatter-plots of estimated case difficulty using the average overlap score versus real difficulty: Balibase (A), Prefab (B), SABmark superfamily (C) and SABmark twilight (D). The Pearson correlation coefficient (r) is high for all test sets.
Pearson correlation coefficients between real alignment accuracy and predicted accuracy by MOS (bold), norMD, al2co and the average sequence identity
| Balibase | Prefab | SABmark sup | SABmark twi | |
|---|---|---|---|---|
| MOS | 0.76 | 0.87 | 0.86 | 0.78 |
| NorMD | 0.50 | 0.56 | 0.66 | 0.54 |
| Average sequence ID | 0.61 | 0.51 | 0.65 | 0.44 |
| Al2co 1_1 | 0.07 | — | 0.32 | 0.32 |
| Al2co 1_2 | 0.04 | — | 0.31 | 0.32 |
| Al2co 1_3 | −0.05 | — | 0.29 | 0.32 |
| Al2co 2_1 | 0.23 | — | 0.37 | 0.34 |
| Al2co 2_2 | 0.18 | — | 0.37 | 0.35 |
| Al2co 2_3 | 0.01 | — | 0.33 | 0.34 |
| Al2co 3_1 | 0.28 | — | 0.39 | 0.35 |
| Al2co 3_2 | 0.22 | — | 0.38 | 0.35 |
| Al2co 3_3 | 0.06 | — | 0.35 | 0.34 |
For al2co, the first number refers to the way conservation was calculated: 1, entropy-based measure; 2, variance-based measure; 3, sum-of-pairs measure. The second number refers to the weighting strategy used: 1, Unweighted amino acid frequency; 2, Henikoff weighting scheme; 3, estimated independent counts.
Figure 3ROC curves demonstrating the agreement between real and predicted rank of several alignments of the same sequences: Balibase (A), Prefab (B), SABmark superfamily (C) and SABmark twilight (D). For al2co, we only show the best curve among all 9 combination of methods (Table 3, italic). For the Prefab set no meaningful results could be obtained using al2co. The rankings based on our MOSs are more accurate than the rankings according to norMD, al2co and sequence identity scores, accepting fewer false positives at comparable levels of of true positives. The predictions of all scores are less accurate on the SABmark sets than on Balibase and Prefab.
The AUC (area under the ROC curve) values for each benchmark set and method
| Balibase | Prefab | SABmark sup | SABmark twi | |
|---|---|---|---|---|
| MOS | 0.80 | 0.80 | 0.70 | 0.70 |
| NorMD | 0.64 | 0.75 | 0.62 | 0.56 |
| Average sequence ID | 0.62 | 0.62 | 0.51 | 0.48 |
| Al2co 1_1 | 0.55 | — | 0.45 | 0.44 |
| Al2co 1_2 | 0.56 | — | 0.44 | 0.44 |
| Al2co 1_3 | 0.55 | — | 0.44 | 0.44 |
| Al2co 2_1 | 0.56 | — | 0.45 | 0.46 |
| Al2co 2_2 | 0.58 | — | 0.46 | 0.46 |
| Al2co 2_3 | 0.57 | — | 0.45 | 0.46 |
| Al2co 3_1 | 0.57 | — | 0.46 | 0.47 |
| Al2co 3_2 | 0.59 | — | 0.46 | 0.47 |
| Al2co 3_3 | 0.59 | — | 0.46 | 0.46 |
On all benchmark sets out method (bold) is superior to norMD, al2co and the average sequence identity. See Table 2 for description of the al2co modes. All methods are less accurate at predicting the correct rank of alignments on the SABmark benchmark sets.