| Literature DB >> 27660761 |
Daozheng Chen1, Xiaoyu Tian1, Bo Zhou2, Jun Gao1.
Abstract
Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.Entities:
Year: 2016 PMID: 27660761 PMCID: PMC5021882 DOI: 10.1155/2016/6802832
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
The eight states of DSSP feature in four groups.
| Eight-state SS | Code | Description | Four groups |
|---|---|---|---|
| 310 helix (G) | G | Helix-3 | First |
| Alpha-helix (H) | H | Alpha helix | |
| pi-helix (I) | I | Helix-5 | |
|
| |||
| Beta-strand (E) | E | Strand | Second |
| Beta-bridge (B) | B | Beta bridge | |
|
| |||
| Beta-turn (T) | T | Turn | Third |
| High curvature loop (S) | S | Bend | |
|
| |||
| Irregular (L) | — | Empty, no secondary structure assigned | Fourth |
The description and dimension of the DSSP feature.
| Features description | Dimension |
|---|---|
| State composition | 8 |
| Group composition | 4 |
| Number of continuous states | 8 |
| Number of continuous groups | 4 |
| Number of continuous state compositions | 8 |
| Number of continuous group compositions | 4 |
| Alternate frequency between groups | 4 |
The 20 amino acids divided into 3 groups according to their physiochemical properties.
| Physicochemical property | The 1st group | The 2nd group | The 3rd group |
|---|---|---|---|
| Hydrophobicity | RKEDQN | GASTPHY | CVLIMFW |
| Van der Waals volume | GASCTPD | NVEQIL | MHKFRYW |
| Polarity | LIFWCMVF | PATGS | HQRKNED |
| Polarizability | GASDT | CPNVEQIL | KMHFRYW |
| Charge | KR | ANCQGHILMFPSTWYV | DE |
| Surface tension | GQDNAHR | KTSEC | ILMFPWYV |
| Secondary structure | EALMQKRH | VIYCWFT | GNPSD |
| Solvent accessibility | ALFCGIVW | RKQEND | MPSTHY |
The name and the dimension of the amino acids composition and physiochemical features.
| Feature name | Dimension |
|---|---|
| Amino acids composition | 20 |
| Hydrophobicity | 21 |
| Van der Waals volume | 21 |
| Polarity | 21 |
| Polarizability | 21 |
| Charge | 21 |
| Surface tension | 21 |
| Secondary structure | 21 |
| Solvent accessibility | 21 |
Figure 1The training process of the four feature groups through the corresponding classifier.
Figure 2The ensemble process of calculating the test data through the models.
Comparison with existing ensemble learning methods on DD-dataset.
| Methods | References | Overall accuracy (%) |
|---|---|---|
| PFP-Pred | [ | 62.1 |
| GAOEC | [ | 64.7 |
| ThePFP-FunDSeqE | [ | 70.5 |
| Dehzangi et al. | [ | 62.7 |
| Dehzangi et al. | [ | 62.4 |
| MarFold | [ | 71.7 |
| PFP-RFSM | [ | 73.7 |
| Feng and Hu | [ | 70.2 |
| Feng et al. | [ | 70.8 |
| PFPA | [ | 73.6 |
|
|
|
|
Comparison with the different methods on EDD-dataset by 10-fold cross validation.
| Methods | References | Overall accuracy (%) |
|---|---|---|
| Paliwal et al. | [ | 90.6 |
| Paliwal et al. | [ | 86.2 |
| Dehzangi et al. | [ | 88.2 |
| HMMFold | [ | 86.0 |
| Saini et al. | [ | 89.9 |
| Lyons et al. | [ | 92.9 |
|
|
|
|
Comparison with the different methods on TG-dataset by 10-fold cross validation.
| Methods | References | Overall accuracy (%) |
|---|---|---|
| Paliwal et al. | [ | 77.0 |
| Paliwal et al. | [ | 73.3 |
| Dehzangi et al. | [ | 73.8 |
| HMMFold | [ | 93.8 |
| Saini et al. | [ | 74.5 |
| NiRecor | [ | 84.6 |
| Lyons et al. | [ | 85.6 |
|
|
|
|
The accuracy of 5-fold cross validation on the features extracted from DD-dataset using 10 basic classifiers.
| Feature groups | Basic classifiers | Fivefold CV accuracy (%) |
|---|---|---|
| DSSP | LMT | 43.0 |
| RandomForest | 51.3 | |
| LibSVM | 46.4 | |
| SimpleLogistic | 43.0 | |
| RotationForest | 49.7 | |
| SMO | 36.4 | |
| NaiveBayes | 43.4 | |
| RandomTree | 32.8 | |
| FT | 42.4 | |
| SimpleCart | 37.7 | |
|
| ||
| AAsCPC | LMT | 32.5 |
| RandomForest | 35.4 | |
| LibSVM | 34.4 | |
| SimpleLogistic | 32.5 | |
| RotationForest | 27.7 | |
| SMO | 34.4 | |
| NaiveBayes | 28.3 | |
| RandomTree | 11.6 | |
| FT | 34.4 | |
| SimpleCart | 20.6 | |
|
| ||
| PSSM | LMT | 56.3 |
| RandomForest | 53.7 | |
| LibSVM | 57.2 | |
| SimpleLogistic | 55.9 | |
| RotationForest | 56.1 | |
| SMO | 30.2 | |
| NaiveBayes | 42.4 | |
| RandomTree | 29.6 | |
| FT | 49.5 | |
| SimpleCart | 33.4 | |
|
| ||
| FunD | LMT | 42.1 |
| RandomForest | 43.1 | |
| LibSVM | 21.2 | |
| SimpleLogistic | 43.1 | |
| RotationForest | 41.8 | |
| SMO | 38.9 | |
| NaiveBayes | 38.3 | |
| RandomTree | 39.9 | |
| FT | 44.1 | |
| SimpleCart | 34.7 | |
The basic classifier of each feature group with the highest accuracy.
Comparison with the different ensemble strategies on three datasets.
| Datasets | The accuracy of traditional ensemble strategy (%) | The accuracy of this paper ensemble strategy (%) |
|---|---|---|
| DD | 72.5 | 76.2 |
| EDD | 89.9 | 93.2 |
| TG | 91.7 | 94.3 |
The accuracy of each fold class with and without the DSSP feature.
| Fold number | The accuracy without the DSSP feature | The accuracy with the DSSP feature |
|---|---|---|
| 1 | 100.0 | 100.0 |
| 2 | 88.9 | 100.0 |
| 3 | 55.0 | 60.0 |
| 4 | 62.5 | 87.5 |
| 5 | 88.9 | 88.9 |
| 6 | 66.7 | 77.8 |
| 7 | 77.3 | 84.1 |
| 8 | 66.7 | 66.7 |
| 9 | 92.3 | 92.3 |
| 10 | 66.7 | 66.7 |
| 11 | 50.0 | 50.0 |
| 12 | 47.4 | 68.4 |
| 13 | 100.0 | 100.0 |
| 14 | 50.0 | 50.0 |
| 15 | 100.0 | 100.0 |
| 16 | 91.7 | 93.8 |
| 17 | 83.3 | 91.7 |
| 18 | 38.5 | 46.2 |
| 19 | 85.2 | 85.2 |
| 20 | 50.0 | 50.0 |
| 21 | 87.5 | 87.5 |
| 22 | 58.3 | 58.3 |
| 23 | 57.1 | 71.4 |
| 24 | 100.0 | 100.0 |
| 25 | 25.0 | 25.0 |
| 26 | 44.4 | 59.3 |
| 27 | 92.6 | 96.3 |
|
| ||
| Overall | 71.3 | 76.2 |
The fold class of which the accuracy has increased significantly after importing the DSSP feature.