| Literature DB >> 25521502 |
Kuldip K Paliwal, Alok Sharma, James Lyons, Abdollah Dehzangi.
Abstract
Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein fold recognition and protein secondary structure prediction are transitional steps in identifying the three dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On the other hand, the SPINE-X predictor has been developed and applied for protein secondary structure prediction. Several reported methods for protein fold recognition have only limited accuracy. In this paper, we have developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure using SPINE-X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of 8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction accuracies for sequence similarity values below 40% and 25%, respectively. We also obtain 90.6% and 77.0% prediction accuracies, respectively, for the Extended Ding and Dubchak and Taguchi and Gromiha benchmark protein fold recognition datasets widely used for in the literature.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25521502 PMCID: PMC4290640 DOI: 10.1186/1471-2105-15-S16-S12
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1An overview of .
Recognition accuracy by n-fold cross validation procedure for different feature extraction techniques for SVM classification for the DD-dataset.
| Feature sets | ||||||
|---|---|---|---|---|---|---|
| PF1 [ | 48.6 | 49.1 | 49.5 | 50.1 | 50.5 | 50.6 |
| PF2 [ | 46.3 | 47.0 | 47.5 | 47.7 | 47.9 | 48.2 |
| PF [ | 51.2 | 52.2 | 52.6 | 52.9 | 53.4 | 53.4 |
| O [ | 49.7 | 50.4 | 50.8 | 50.8 | 51.1 | 51.0 |
| AAC [ | 43.6 | 43.9 | 44.2 | 44.8 | 44.6 | 45.1 |
| AAC+HXPZV [ | 45.1 | 46.2 | 46.5 | 46.8 | 46.9 | 47.2 |
| ACC [ | 65.7 | 66.6 | 66.8 | 67.5 | 67.7 | 68.0 |
| PSSM+PF1 [ | 62.5 | 63.2 | 63.7 | 64.2 | 64.5 | 64.6 |
| PSSM+PF2 [ | 62.7 | 63.3 | 64.1 | 64.2 | 64.6 | 64.7 |
| PSSM+PF [ | 65.5 | 66.2 | 66.5 | 66.9 | 67.1 | 67.5 |
| PSSM+O [ | 62.5 | 62.1 | 62.5 | 62.9 | 63.4 | 63.5 |
| PSSM+AAC [ | 57.5 | 58.1 | 58.4 | 58.7 | 59.1 | 59.2 |
| PSSM+AAC+HXPZV [ | 55.9 | 56.9 | 57.1 | 57.7 | 58.0 | 58.2 |
| Mono-gram [ | 67.7 | 68.4 | 68.6 | 69.1 | 69.4 | 69.6 |
| Bi-gram [ | 72.6 | 73.1 | 73.7 | 73.7 | 74.1 | 74.1 |
Recognition accuracy by n-fold cross validation procedure for different feature extraction techniques for SVM classification for the TG dataset.
| Feature sets | ||||||
|---|---|---|---|---|---|---|
| PF1 [ | 38.1 | 38.4 | 38.6 | 38.7 | 38.8 | 38.8 |
| PF2 [ | 38.0 | 38.4 | 38.5 | 38.6 | 38.7 | 38.8 |
| PF [ | 42.3 | 42.6 | 42.7 | 43.0 | 43.0 | 43.1 |
| O [ | 35.8 | 36.1 | 36.2 | 36.1 | 36.3 | 36.3 |
| AAC [ | 31.5 | 31.5 | 31.7 | 31.8 | 31.9 | 32.0 |
| AAC+HXPZV [ | 35.7 | 36.0 | 36.1 | 36.2 | 36.3 | 36.3 |
| ACC [ | 64.9 | 65.4 | 65.9 | 66.2 | 66.4 | 66.4 |
| PSSM+PF1 [ | 51.1 | 51.5 | 52.0 | 52.3 | 52.4 | 52.7 |
| PSSM+PF2 [ | 50.2 | 50.4 | 50.7 | 50.8 | 51.0 | 51.1 |
| PSSM+PF [ | 57.2 | 57.8 | 58.0 | 58.3 | 58.5 | 58.8 |
| PSSM+O [ | 46.0 | 46.3 | 46.5 | 46.5 | 46.7 | 46.7 |
| PSSM+AAC [ | 43.2 | 43.5 | 43.6 | 43.8 | 43.8 | 44.0 |
| PSSM+AAC+HXPZV [ | 45.6 | 45.9 | 46.0 | 46.2 | 46.3 | 46.6 |
| Mono-gram [ | 57.2 | 57.3 | 58.2 | 58.4 | 58.8 | 58.8 |
| Bi-gram [ | 67.1 | 67.5 | 67.6 | 67.8 | 68.1 | 68.1 |
Recognition accuracy by n-fold cross validation procedure for different feature extraction techniques for SVM classification for the EDD dataset.
| Feature sets | ||||||
|---|---|---|---|---|---|---|
| PF1 [ | 50.2 | 50.5 | 50.5 | 50.7 | 50.8 | 50.8 |
| PF2 [ | 49.3 | 49.5 | 49.7 | 49.8 | 49.8 | 49.9 |
| PF [ | 54.7 | 55.0 | 55.2 | 55.4 | 55.5 | 55.6 |
| O [ | 46.4 | 46.6 | 46.6 | 46.7 | 46.7 | 46.9 |
| AAC [ | 40.3 | 40.6 | 40.7 | 40.7 | 40.9 | 40.9 |
| AAC+HXPZV [ | 40.2 | 40.4 | 40.6 | 40.7 | 40.9 | 40.9 |
| ACC [ | 84.9 | 85.2 | 85.4 | 85.6 | 85.8 | 85.9 |
| PSSM+PF1 [ | 74.1 | 74.5 | 74.7 | 75.0 | 75.1 | 75.2 |
| PSSM+PF2 [ | 73.7 | 74.1 | 74.5 | 74.6 | 74.7 | 74.9 |
| PSSM+PF [ | 78.2 | 78.6 | 78.8 | 79.0 | 79.1 | 79.3 |
| PSSM+O [ | 67.6 | 68.0 | 68.1 | 68.3 | 68.3 | 68.5 |
| PSSM+AAC [ | 60.9 | 61.3 | 61.5 | 61.6 | 61.7 | 61.9 |
| PSSM+AAC+HXPZV [ | 66.7 | 67.2 | 67.4 | 67.7 | 67.8 | 67.9 |
| Mono-gram [ | 76.2 | 76.3 | 76.6 | 76.8 | 77.0 | 76.9 |
| Bi-gram [ | 83.6 | 84.0 | 84.1 | 84.3 | 84.3 | 84.5 |
Recognition accuracy (in percentage) for 10-fold cross validation procedure for PSSM and SSPM using SVM classifier on the DD, TG and EDD datasets.
| Feature sets | DD | TG | EDD |
|---|---|---|---|
| Using PSSM only | 74.5 | 73.8 | 88.8 |
| Using SSPM only | 59.8 | 55.2 | 71.7 |
| Using PSSM+SSPM (i.e., | 76.1 | 77.0 | 90.6 |
Recognition accuracy (in percentage) for 10-fold cross validation procedure using different classifiers on k -AAP.
| Classifiers | DD | TG | EDD |
|---|---|---|---|
| Naïve Bayes | 62.3 | 48.5 | 58.2 |
| SVM (SMO with linear polynomial of degree P = 1) | 75.4 | 76.1 | 88.8 |
| SVM (SMO with P = 3) | 69.1 | 69.2 | 86.2 |
| Random Forest (10 base learners) | 62.9 | 52.1 | 73.0 |
| Adaboost.M1 (10 base learners) | 68.1 | 59.3 | 79.2 |
| kNN (for | 70.8 | 65.6 | 84.3 |
Figure 2Sensitivity and specificity of all feature sets for the DD dataset.
Figure 3Sensitivity and specificity of all feature sets for the TG dataset.
Figure 4Sensitivity and specificity of all feature sets for the EDD dataset.