| Literature DB >> 25136571 |
Abstract
The recognition of protein folds is an important step for the prediction of protein structure and function. After the recognition of 27-class protein folds in 2001 by Ding and Dubchak, prediction algorithms, prediction parameters, and new datasets for the prediction of protein folds have been improved. However, the influences of interactions from predicted secondary structure segments and motif information on protein folding have not been considered. Therefore, the recognition of 27-class protein folds with the interaction of segments and motif information is very important. Based on the 27-class folds dataset built by Liu et al., amino acid composition, the interactions of secondary structure segments, motif frequency, and predicted secondary structure information were extracted. Using the Random Forest algorithm and the ensemble classification strategy, 27-class protein folds and corresponding structural classification were identified by independent test. The overall accuracy of the testing set and structural classification measured up to 78.38% and 92.55%, respectively. When the training set and testing set were combined, the overall accuracy by 5-fold cross validation was 81.16%. In order to compare with the results of previous researchers, the method above was tested on Ding and Dubchak's dataset which has been widely used by many previous researchers, and an improved overall accuracy 70.24% was obtained.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25136571 PMCID: PMC4127253 DOI: 10.1155/2014/262850
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Datasets of 27-class protein folds.
| Fold | Dataset built by Liu et al. [ |
Dataset built by Ding and Dubchak [ | ||
|---|---|---|---|---|
| Training set | Testing set | Training set | Testing set | |
| All | 174 | 169 | 54 | 61 |
| (1) Globin-like | 14 | 14 | 13 | 6 |
| (2) Cytochrome c | 10 | 10 | 7 | 9 |
| (3) DNA-binding 3-helical bundle | 92 | 90 | 12 | 20 |
| (4) 4-helical up-and-down bundle | 25 | 24 | 7 | 8 |
| (5) 4-helical cytokines | 8 | 8 | 9 | 9 |
| (6) Alpha; EF-hand | 25 | 23 | 6 | 9 |
| All | 260 | 254 | 109 | 117 |
| (7) Immunoglobulin-like | 86 | 85 | 30 | 44 |
| (8) Cupredoxins | 18 | 18 | 9 | 12 |
| (9) Viral coat and capsid proteins | 24 | 24 | 16 | 13 |
| (10) ConA-like lectins/glucanases | 18 | 17 | 7 | 6 |
| (11) SH3-like barrel | 41 | 41 | 8 | 8 |
| (12) OB-fold | 29 | 28 | 13 | 19 |
| (13) Trefoil | 11 | 10 | 8 | 4 |
| (14) Trypsin-like serine proteases | 17 | 16 | 9 | 4 |
| (15) Lipocalins | 16 | 15 | 9 | 7 |
|
| 341 | 337 | 115 | 143 |
| (16) (TIM)-barrel | 93 | 92 | 29 | 48 |
| (17) FAD (also NAD)-binding motif | 5 | 5 | 11 | 12 |
| (18) Flavodoxin-like | 37 | 36 | 11 | 13 |
| (19) NAD(P)-binding Rossmann-fold | 17 | 16 | 13 | 27 |
| (20) P-loop-containing nucleotide | 74 | 73 | 10 | 12 |
| (21) Thioredoxin-like | 37 | 36 | 9 | 8 |
| (22) Ribonuclease H-like motif | 39 | 40 | 10 | 12 |
| (23) Hydrolases | 33 | 33 | 11 | 7 |
| (24) Periplasmic binding protein-like | 6 | 6 | 11 | 4 |
|
| 181 | 179 | 33 | 62 |
| (25) | 39 | 39 | 7 | 8 |
| (26) Ferredoxin-like | 101 | 99 | 13 | 27 |
| (27) Small inhibitors, toxins, and lectins | 41 | 41 | 13 | 27 |
|
| ||||
| Overall | 956 | 939 | 311 | 383 |
The physicochemical property values for 20 amino acid residues.
| Code | H1 | H2 | PL | SASA |
|---|---|---|---|---|
| A | 0.62 | −0.5 | 8.1 | 1.181 |
| C | 0.29 | −1 | 5.5 | 1.461 |
| D | −0.9 | 3 | 13 | 1.587 |
| E | −0.74 | 3 | 12.3 | 1.862 |
| F | 1.19 | −2.5 | 5.2 | 2.228 |
| G | 0.48 | 0 | 9 | 0.881 |
| H | −0.4 | −0.5 | 10.4 | 2.025 |
| I | 1.38 | −1.8 | 5.2 | 1.81 |
| K | −1.5 | 3 | 11.3 | 2.258 |
| L | 1.06 | −1.8 | 4.9 | 1.931 |
| M | 0.64 | −1.3 | 5.7 | 2.034 |
| N | −0.78 | 2 | 11.6 | 1.655 |
| P | 0.12 | 0 | 8 | 1.468 |
| Q | −0.85 | 0.2 | 10.5 | 1.932 |
| R | −2.53 | 3 | 10.5 | 2.56 |
| S | −0.18 | 0.3 | 9.2 | 1.298 |
| T | −0.05 | −0.4 | 8.6 | 1.525 |
| V | 1.08 | −1.5 | 5.9 | 1.645 |
| W | 0.81 | −3.4 | 5.4 | 2.663 |
| Y | 0.26 | −2.3 | 6.2 | 2.368 |
Figure 1The numbers of sequences containing secondary structure segments. (a) and (b) are for training set and testing set, respectively.
Prediction accuracies of different parameters in the testing set (%).
| Fold | A | A + ACC | A + ACC + M | A + ACC + M + P | A + ACC + M + P (5-fold cross validation) | The results of Liu et al. [ | Ding and Dubchak's dataset [ |
|---|---|---|---|---|---|---|---|
| 1 | 21.43 | 71.43 | 71.43 | 71.43 | 75.00 (0.0252) | 78.5 | 100.00 |
| 2 | 10.00 | 70.00 | 70.00 | 80.00 | 95.00 (0.0000) | 90.0 | 100.00 |
| 3 | 60.00 | 90.00 | 91.11 | 91.11 | 92.86 (0.0026) | 75.5 | 75.00 |
| 4 | 4.17 | 83.33 | 75.00 | 75.00 | 81.63 (0.0000) | 54.1 | 87.50 |
| 5 | 12.50 | 25.00 | 12.50 | 25.00 | 18.75 (0.0187) | 25.0 | 77.78 |
| 6 | 0.00 | 60.87 | 52.17 | 52.17 | 75.00 (0.0342) | 39.1 | 66.67 |
| 7 | 87.06 | 91.76 | 89.41 | 90.59 | 89.47 (0.0114) | 82.3 | 79.55 |
| 8 | 11.11 | 27.78 | 27.78 | 38.89 | 41.67 (0.0000) | 55.5 | 75.00 |
| 9 | 45.83 | 50.00 | 50.00 | 58.33 | 70.83 (0.0421) | 70.8 | 84.62 |
| 10 | 23.53 | 35.29 | 47.06 | 52.94 | 57.14 (0.0255) | 47.0 | 66.67 |
| 11 | 24.39 | 56.10 | 48.78 | 58.54 | 70.73 (0.0185) | 43.9 | 37.50 |
| 12 | 0.00 | 46.43 | 64.29 | 60.71 | 54.39 (0.0096) | 60.7 | 89.47 |
| 13 | 0.00 | 30.00 | 50.00 | 60.00 | 66.67 (0.0426) | 10.0 | 50.00 |
| 14 | 37.50 | 56.25 | 62.50 | 62.50 | 81.82 (0.0000) | 75.0 | 25.00 |
| 15 | 53.33 | 40.00 | 40.00 | 46.67 | 67.74 (0.0136) | 40.0 | 100.00 |
| 16 | 86.96 | 95.65 | 98.91 | 100.00 | 98.92 (0.0144) | 89.1 | 66.67 |
| 17 | 0.00 | 20.00 | 20.00 | 20.00 | 20.00 (0.0097) | 20.0 | 91.67 |
| 18 | 11.11 | 30.56 | 47.22 | 61.11 | 68.49 (0.0894) | 16.6 | 38.46 |
| 19 | 37.50 | 81.25 | 100.00 | 100.00 | 100.00 (0.0300) | 81.2 | 62.96 |
| 20 | 26.03 | 72.60 | 90.41 | 89.04 | 91.84 (0.0398) | 87.6 | 41.67 |
| 21 | 30.56 | 50.00 | 75.00 | 72.22 | 72.60 (0.0217) | 52.7 | 75.00 |
| 22 | 22.50 | 40.00 | 62.50 | 57.50 | 65.82 (0.0113) | 50.0 | 41.67 |
| 23 | 27.27 | 45.45 | 90.91 | 90.91 | 95.46 (0.0107) | 78.7 | 57.15 |
| 24 | 0.00 | 16.67 | 50.00 | 66.67 | 41.67 (0.0373) | 50.0 | 25.00 |
| 25 | 12.82 | 56.41 | 61.54 | 61.54 | 69.23 (0.0233) | 30.7 | 12.50 |
| 26 | 51.52 | 88.89 | 90.91 | 92.93 | 86.00 (0.0104) | 67.6 | 62.96 |
| 27 | 100.00 | 75.61 | 87.80 | 92.68 | 92.68 (0.0122) | 1.000 | 96.30 |
|
| 43.66 | 68.80 | 76.25 |
|
| 66.5 |
|
Note: A means amino acid composition (20 dimensions), A + ACC means amino acid composition and the interaction of segments (164 dimensions), A + ACC + M means amino acid composition, the interaction of segments, and motif frequency (290 dimensions), and A + ACC + M + P means amino acid composition, the interaction of segments, motif frequency, and predicted secondary structure information (296 dimensions); Q means the overall accuracy; the standard deviation values are in the parenthesis of the sixth column, the penultimate column is the results of Liu et al. [27] with the same dataset, and the last column is our results of the dataset built by Ding and Dubchak [11].
Figure 2The architecture of the protein folds identification system.
The previous identification results by an independent test from Ding and Dubchak's dataset (%).
| Author | Classifier | Accuracy |
|---|---|---|
| Ding and Dubchak [ | SVM (All-Versus-All) | 56.0 |
| Chinnasamy et al. [ | Tree-Augmented Naive Bayesian Classifier | 58.2 |
| Shen and Chou [ | OET-KNN | 62.1 |
| Nanni [ | Fusion of classifiers | 61.1 |
| Chen and Kurgan [ | PFRES | 68.4 |
| Guo and Gao [ | GAOEC | 64.7 |
| Damoulas and Girolami [ | Multiclass multikernel | 70.0 |
| Zhang et al. [ | Increment of diversity | 61.1 |
| Ghanty and Pal [ | Fusion of different classifiers | 68.6 |
| Dong et al. [ | ACCFold | 70.1 |
| Shen and Chou [ | PFP-FunDSeqE | 70.5 |
|
Yang et al. [ | MarFold | 71.7 |
| Liu et al. [ | SVM | 69.8 |
|
| Random Forest |
|
Overall accuracies of structural class using different approaches in the testing set (%).
| Dataset | Author | Structural class | Accuracy | |||
|---|---|---|---|---|---|---|
|
|
|
|
| |||
|
Liu and Hu [ |
|
|
|
|
|
|
| Liu and Hu [ | 97.04 | 85.43 | 94.07 | 78.21 | 89.24 | |
|
| ||||||
|
Ding and Dubchak [ |
|
|
|
|
|
|
| Liu and Hu [ | 86.89 | 88.03 | 83.22 | 59.68 | 81.46 | |
| Zhang et al. [ | 79.11 | |||||
| Chinnasamy et al. [ | 80.52 | |||||