| Literature DB >> 23066102 |
Francisco M Ortuño1, Olga Valenzuela, Hector Pomares, Fernando Rojas, Javier P Florido, Jose M Urquiza, Ignacio Rojas.
Abstract
Multiple sequence alignments (MSAs) have become one of the most studied approaches in bioinformatics to perform other outstanding tasks such as structure prediction, biological function analysis or next-generation sequencing. However, current MSA algorithms do not always provide consistent solutions, since alignments become increasingly difficult when dealing with low similarity sequences. As widely known, these algorithms directly depend on specific features of the sequences, causing relevant influence on the alignment accuracy. Many MSA tools have been recently designed but it is not possible to know in advance which one is the most suitable for a particular set of sequences. In this work, we analyze some of the most used algorithms presented in the bibliography and their dependences on several features. A novel intelligent algorithm based on least square support vector machine is then developed to predict how accurate each alignment could be, depending on its analyzed features. This algorithm is performed with a dataset of 2180 MSAs. The proposed system first estimates the accuracy of possible alignments. The most promising methodologies are then selected in order to align each set of sequences. Since only one selected algorithm is run, the computational time is not excessively increased.Entities:
Mesh:
Year: 2012 PMID: 23066102 PMCID: PMC3592395 DOI: 10.1093/nar/gks919
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.PAcAlCI scheme. The architecture is developed into four modules: input dataset, feature extraction, feature selection and LS-SVM prediction.
Summary of applied methodologies
| Method | Version | Type |
|---|---|---|
| ClustalW ( | 2.0.10 | Progressive |
| Muscle ( | 3.8.31 | Progressive |
| Kalign ( | 2.04 | Progressive |
| Mafft ( | 6.85 | Progressive |
| RetAlign ( | 1.0 | Progressive |
| T-Coffee ( | 8.97 | Consistency-based |
| ProbCons ( | 1.12 | Consistency-based |
| FSA ( | 1.15.5 | Consistency-based |
| 3D-Coffee ( | 8.97 | Additional data |
| Promals ( | vServer | Additional data |
Ten different methodologies were run to align multiple sequences. Their versions and the applied strategies are also shown.
Summary of features extracted from several databases
| Feature | Source | Range | Type | Rank | |
|---|---|---|---|---|---|
| Number of sequences | BAliBASE | [4, 142] | Integer | 3 | |
| Average length | BAliBASE | [66.13, 1630.11] | Real | 4 | |
| Variance length (normalized) | BAliBASE | [0, 1] | Real | 6 | |
| Reference subset | BAliBASE | [1, 6] | Integer | 5 | |
| AA in | UniProt | [0, 1] | Real | 16 | |
| AA in | UniProt | [0, 1] | Real | 7 | |
| AA in transmembranea | UniProt | [0, 1] | Real | 22 | |
| Domainsb | Pfam | [0.00,6.67] | Real | 1 | |
| Shared Domainsb | Pfam | [0.00,117.07] | Real | 15 | |
| GO termsb | GOA | [0.00, 8.67] | Real | 11 | |
| MF-GO termsb | GOA | [0.00, 5.17] | Real | 17 | |
| CC-GO termsb | GOA | [0.00, 2.46] | Real | 20 | |
| BP-GO termsb | GOA | [0.00, 4.07] | Real | 19 | |
| Shared GO termsb | GOA | [0.00, 201.85] | Real | 18 | |
| 3D-Structuresb | PDB | [0.04, 3.06] | Real | 14 | |
| Seq. with any 3D structure | PDB | [0, 1] | Real | 21 | |
| Shared 3D structuresb | PDB | [0.00, 0.75] | Real | 23 | |
| Polar AAa | Biochemistry | [0, 1] | Real | 9 | |
| Non-polar AAa | Biochemistry | [0, 1] | Real | 12 | |
| Basic AAa | Biochemistry | [0, 1] | Real | 10 | |
| Aromatic AAa | Biochemistry | [0, 1] | Real | 13 | |
| Acid AAa | Biochemistry | [0, 1] | Real | 8 | |
| MSA method | — | [1, 10] | Integer | 2 |
Twenty-three features were retrieved from different databases. The relevance ranking was also measured according to the mRMR procedure. aThese features are calculated as the percentage of amino acids (AA) with that specific feature. bThese features are calculated as the number of occurrences per sequence.
Figure 2.Evolution of the MRE. The number of features progressively increases in ascendant relevance order. Training and test errors are shown.
Accuracies obtained for four different sets of sequences
| Alignment | Method | Real Acc. | Pred. Acc. | Rel. error |
|---|---|---|---|---|
| RV11 4th | 3D-Coffee | 0.1758 | ||
| Promals | 0.0551 | |||
| ProbCons | 0.6230 | 0.0973 | ||
| T-Coffee | 0.6120 | 0.5976 | 0.0235 | |
| Muscle | 0.6000 | 0.3840 | 0.3600 | |
| Kalign | 0.5730 | 0.6168 | 0.0765 | |
| Mafft | 0.5260 | 0.6246 | 0.1875 | |
| FSA | 0.4390 | 0.4159 | 0.0527 | |
| RetAlign | 0.3880 | 0.2767 | 0.2868 | |
| ClustalW2 | 0.1960 | 0.5291 | 1.6994 | |
| RV11 20th | 3D-Coffee | 0.1390 | ||
| Promals | 0.0202 | |||
| Mafft | 0.6920 | 0.6516 | 0.0583 | |
| ProbCons | 0.6810 | 0.6994 | 0.0270 | |
| T-Coffee | 0.6540 | 0.6035 | 0.0772 | |
| ClustalW2 | 0.6520 | 0.5785 | 0.1127 | |
| RetAlign | 0.6330 | 0.5269 | 0.1677 | |
| Kalign | 0.6000 | 0.6823 | 0.1371 | |
| Muscle | 0.5920 | 0.6040 | 0.0203 | |
| FSA | 0.5320 | 0.6311 | 0.1863 | |
| RV40 24th | Promals | 0.0955 | ||
| Mafft | 0.0234 | |||
| Kalign | 0.5616 | 0.1099 | ||
| 3D-Coffee | 0.5750 | 0.6117 | 0.0638 | |
| T-Coffee | 0.5750 | 0.5562 | 0.0326 | |
| ProbCons | 0.5680 | 0.5982 | 0.0532 | |
| FSA | 0.5330 | 0.5331 | 0.0001 | |
| Muscle | 0.5140 | 0.5153 | 0.0024 | |
| RetAlign | 0.5110 | 0.5520 | 0.0802 | |
| ClustalW2 | 0.4960 | 0.4378 | 0.1173 | |
| RV50 10th | Promals | 0.0919 | ||
| Mafft | 0.0520 | |||
| ProbCons | 0.0495 | |||
| 3D-Coffee | 0.0314 | |||
| T-Coffee | 0.7105 | 0.0879 | ||
| Kalign | 0.7370 | 0.0171 | ||
| FSA | 0.5910 | 0.6412 | 0.0849 | |
| Muscle | 0.5290 | 0.7089 | 0.3400 | |
| RetAlign | 0.5110 | 0.6308 | 0.2345 | |
| ClustalW2 | 0.4830 | 0.5768 | 0.1942 |
Predicted accuracies are compared with those obtained by each methodology in four different problems. Values in bold show accuracies included in the confidence interval. The prediction error is also measured.
Figure 3.Distribution of relative errors for training and test sets. The corresponding LS-SVM prediction was performed using 10 features.
Figure 4.Distribution of relative errors for training and test sets. Low accuracies were previously filtered to improve the LS-SVM prediction, avoiding prediction with high errors ().
Figure 5.Intersection of suitable and predicted methodologies (Venn diagrams) corresponding to the four alignments whose accuracies are shown in Table 3.
Comparison between PAcAlCI and AlexSys
| Feature | PAcAlCI | AlexSys |
|---|---|---|
| Number of aligners | 10 | 6 |
| Kind of problem | Regression (real) | Classification (binary) |
| Machine-learning strategy | LS-SVM | Decision trees |
| Values of prediction | Accuracies | Weak (Acc. < 0.5) |
| Strong (Acc. > 0.5) | ||
| Success rate | 83.6% ( | 45.0% (first aligner) |
| 85.9% ( | 45.5% (second aligner) |
PAcAlCI is qualitatively compared with AlexSys. The performance and attributes of both procedures are shown.