| Literature DB >> 28093407 |
Quan Le1, Fabian Sievers1, Desmond G Higgins1.
Abstract
Motivation: Multiple sequence alignment (MSA) is commonly used to analyze sets of homologous protein or DNA sequences. This has lead to the development of many methods and packages for MSA over the past 30 years. Being able to compare different methods has been problematic and has relied on gold standard benchmark datasets of 'true' alignments or on MSA simulations. A number of protein benchmark datasets have been produced which rely on a combination of manual alignment and/or automated superposition of protein structures. These are either restricted to very small MSAs with few sequences or require manual alignment which can be subjective. In both cases, it remains very difficult to properly test MSAs of more than a few dozen sequences. PREFAB and HomFam both rely on using a small subset of sequences of known structure and do not fairly test the quality of a full MSA.Entities:
Mesh:
Year: 2017 PMID: 28093407 PMCID: PMC5408826 DOI: 10.1093/bioinformatics/btw840
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1Graph of average prediction accuracy versus number of sequences in alignments for 151 families—solid lines for full alignments of different aligner settings, and dashed lines for sub alignments of 200 sequences embedded in full alignments
Fig. 2The variation of prediction accuracy according to the choice of three reference sequences and the non-reference sequences in five samples. In blue is the variation of prediction accuracy among three reference sequences in the same alignment; in red is the variation of prediction accuracy of the same reference sequence among five samples of the same Pfam family (Color version of this figure is available at Bioinformatics online.)
Fig. 3Average Prediction Accuracy versus Average SPS Score for alignments of 200 and 1000 sequences from 238 Pfam families
Fig. 4Effect on prediction accuracy when adding errors to Clustal Omega alignments of 200 sequences on the SSPA. The boxplot for each percentage of error is created from 10 resamples: the whiskers represent the top and bottom 25% quantiles, the red line is the median
The prediction accuracy for alignments of 200 sequences for 238 Pfam families
| Aligner settings | Prediction Accuracy (in %) |
|---|---|
| MAFFT L-INS-i | 78.94 * |
| MAFFT—Default | 78.19 |
| MAFFT—Fast Mode | 77.53 * |
| Clustal Omega—2 iter | 78.36 * |
| Clustal Omega—1 iter | 78.56 * |
| Clustal Omega—Default | 78.63 |
| MUSCLE—2 iter | 78.17 |
| MUSCLE—Default | 78.13 |
| MUSCLE—1 iter | 77.29 * |
| PASTA—Default | 78.70 |
| T-Coffee—Default | 78.45 |
| Kalign 2—Default | 77.93 |
| Clustal W2—Default | 77.13 |
| HMMER—Default | 77.86 |
For aligner settings from the same aligner, the sign (*) signifies that the score is significantly different (higher or lower) from the default score with P < 0.01 using the Wilcoxon signed rank test.