| Literature DB >> 19648142 |
Carsten Kemena1, Cedric Notredame.
Abstract
This review focuses on recent trends in multiple sequence alignment tools. It describes the latest algorithmic improvements including the extension of consistency-based methods to the problem of template-based multiple sequence alignments. Some results are presented suggesting that template-based methods are significantly more accurate than simpler alternative methods. The validation of existing methods is also discussed at length with the detailed description of recent results and some suggestions for future validation strategies. The last part of the review addresses future challenges for multiple sequence alignment methods in the genomic era, most notably the need to cope with very large sequences, the need to integrate large amounts of experimental data, the need to accurately align non-coding and non-transcribed sequences and finally, the need to integrate many alternative methods and approaches.Entities:
Mesh:
Year: 2009 PMID: 19648142 PMCID: PMC2752613 DOI: 10.1093/bioinformatics/btp452
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Benchmarking of a selection of methods on the RV11 Balibase dataset. BaliBase/RV11 is made of 38 datasests consisting of seven or more highly divergent protein sequences (<20% pair-wise identity on the reference alignment)
| Method | Version | Score | Mode | Templates | RV11 | Sever |
|---|---|---|---|---|---|---|
| 3DPSI-Coffee | 7.05 | Consistency | Accurate | Profile + Structure | 61.00 | |
| PROMAL-3D | Server | Consistency | Default | Profile + Structure | 58.66 | prodata.swemd.edu/promals3d |
| PROMALS | Server | Consistency | Default | Profile | 55.80 | prodata.swemd.edu/promals3d |
| PSI-Coffee | 7.05 | Consistency | Psicoffee | Profile | 53.71 | |
| M-Coffee4 | 7.05 | Consistency | Muscl + Kal. + ProbC + TC | – | 41.63 | |
| T-Coffee | 7.05 | Consistency | Default | – | 42.30 | |
| ProbCons | 1.1 | Consistency | Default | – | 40.80 | probcons.stanford.edu |
| ProbCons | 1.1 | Consistency | Monophsic Penalty | – | 37.53 | probcons.stanford.edu |
| Kalign | 2.03 | It+Matrix | Default | – | 33.82 | msa.cgb.ki.se |
| MUSCLE | 3.7 | It+Matrix | Default | – | 31.37 | |
| Mafft | 6.603b | It+Matrix | Default | – | 26.21 | align.genome.jp/mafft |
| Prank | 0.080715 | Matrix | Default | – | 26.18 | |
| Prank | 0.080715 | Matrix | +F | – | 24.82 | |
| ClustalW | 2.0.9 | Matrix | Default | – | 22.74 |
All packages were ran using the default parameters. Servers were ran in August 2008.
Comparison of alternative reference datasets (adapted from Blackshield and Higgins)
| Dataset | #Categories | Agreement (%) | Self-agreement |
|---|---|---|---|
| BaliBase | 11 | 71.4 | 82.9 |
| RV11 | 01 | 77.4 | 83.3 |
| RV50 | 1 | 76.8 | 80.6 |
| SabMark | 4 | 69.8 | 81.3 |
| Oxbench | 10 | 65.0 | 70.8 |
| Prefab | 5 | 64.6 | 72.3 |
| Homstrad | 4 | 66.8 | 76.9 |
| IRMdb | 9 | 58.1 | 88.1 |
| – | |||
| – |
Blackshield and Higgins published the average accuracy of 10 MSA packages (Mafft, Muscle, POA, Dialign-T, Dialign2, PCMA, align_m, T-Coffee, Clustalw, ProbCons) on six reference databases. This table shows a new analysis of the original data. ‘Dataset’ indicates the considered dataset. In this column, ‘RV11’ and ‘RV50’ are two BaliBase categories, ‘Empirical Dataset’ refers to the five empirical datasets (BaliBase3, SabMark, Oxbench and Prefab). ‘All datasets’ includes IRMdb as well. ‘#Categories’ indicates the number of sub-categories contained in the considered datasets. ‘Agreement’: average agreement between all the considered categories of a given dataset and all the categories of the other databases. The agreement is defined as the number of times two given databases sub-categories agree on the relative accuracy of two methods. The ‘Empirical dataset’ average is obtained by considering all possible pairs of methods across all possible pairs of categories within the empirical datasets (i.e. all datasets except IRMdb). ‘Self-agreement’: same measure but restricted to a single database (i.e. each category in turn against all the other categories of the considered database). The last two rows show the average agreement between all respectively all empirical datasets.
Fig. 1.Generic overview for the derivation of a consistency-based scoring scheme. The sequences are originally compared two by two using any suitable methods. The second box shows the projection of pair-wise comparisons. These projections may equally come from multiple sequence alignments, pair-wise comparison or any method able to generate such projections, including posterior decoding of an HMM. They may also come from a template-based comparison such as the one described in Figure 2. Pairs thus identified are incorporated in the primary library. These pairs are then associated with weights used during the extension. The figure shows the T-Coffee extension protocol. When using probabilistic consistency, the probabilities are treated as weights and triplet extension is made by multiplying the weights rather than taking the minimum. See Supplementary Material for color version of the figure.
Fig. 2.Typical colored output of M-Coffee. This output was obtained on the RV11033 BaliBase dataset, made of 11 distantly related bacterial NADH dehydrogenases. The alignment was obtained by combining Muscle, T-Coffee, Kalign and Mafft with M-Coffee. Correctly aligned residues (correctly aligned with 50% of their column, as judged from the reference) are in upper case, non-correct ones are in lower case. In this colored output, each residue has a color that indicates the agreement of the four initial MSAs with respect to the alignment of that specific residue. Dark red indicates residues aligned in a similar fashion among all the individual MSAs, blue indicates a very low agreement. Dark yellow, orange and red residues can be considered to be reliably aligned. See Supplementary Material for color version of the figure.
Fig. 3.Overview of template-based protocols. Templates are identified and mapped onto the target sequences. The figure shows three possible types of templates: homology extension, structure and functional annotation. The templates are then compared with a suitable method (profile aligner, structural aligner, etc.) and the resulting alignment (or comparison) is mapped onto the final alignment of the original target sequences. The residue pairs thus identified are then incorporated in the primary library. See Supplementary Material for color version of the figure.