| Literature DB >> 26980999 |
Zhenxing Feng1, Xiuzhen Hu1, Zhuo Jiang1, Hangyu Song1, Muhammad Aqeel Ashraf2.
Abstract
The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak.Entities:
Keywords: Average chemical shifts; Multi-class protein folds; Random Forest algorithm; Secondary structure elements; Secondary structure motifs; The increment of diversity
Year: 2015 PMID: 26980999 PMCID: PMC4778582 DOI: 10.1016/j.sjbs.2015.10.008
Source DB: PubMed Journal: Saudi J Biol Sci ISSN: 1319-562X Impact factor: 4.219
Datasets of 76 protein fold classes.
| Fold (name) | Ntrain/(Ntest) | Fold (name) | Ntrain/(Ntest) | Fold (name) | Ntrain/(Ntest) |
|---|---|---|---|---|---|
| 1 (GL) | 14/14 | 27 (ITL) | 41/41 | 53 (SM) | 44/44 |
| 2 (CY) | 10/10 | 28 (RCD) | 13/13 | 54 (PT-L) | 31/31 |
| 3 (DB) | 92/90 | 29 (SR) | 13/13 | 55 (PBPI) | 26/26 |
| 4 (HB) | 25/24 | 30 (F-L) | 21/21 | 56 (CD-L) | 7/7 |
| 5 (4HC) | 8/8 | 31 (SD) | 15/14 | 57 (L-L) | 8/8 |
| 6 (EF) | 25/23 | 32 (α-T) | 16/16 | 58 (I-L) | 8/7 |
| 7 (IL) | 86/85 | 33 (CP) | 9/8 | 59 (C-L) | 29/30 |
| 8 (CD) | 18/18 | 34 (α-S) | 32/33 | 60 (U-L) | 9/8 |
| 9 (VCP) | 24/24 | 35 (NRL) | 7/7 | 61 (GRP) | 16/16 |
| 10 (CLL) | 18/17 | 36 (MC) | 9/9 | 62 (C-DP) | 8/9 |
| 11 (SH3) | 41/41 | 37 (CFD) | 14/14 | 63 (TED) | 26/25 |
| 12 (OB) | 29/28 | 38 (C2D) | 9/9 | 64 (DL) | 8/9 |
| 13 (BT) | 11/10 | 39 (GD) | 16/16 | 65 (ETK) | 10/9 |
| 14 (TSP) | 17/16 | 40 (PDL) | 24/25 | 66 (BCM) | 8/9 |
| 15 (LIP) | 16/15 | 41 (AP) | 8/8 | 67 (Z-L) | 12/11 |
| 16 (TIM) | 93/92 | 42 (PDB) | 29/29 | 68 (S-L) | 7/8 |
| 17 (FAD) | 5/5 | 43 (6BP) | 10/9 | 69 (ACN) | 33/32 |
| 18 (FLL) | 37/36 | 44 (7BP) | 8/8 | 70 (PL) | 19/19 |
| 19 (NAD) | 17/16 | 45 (SR-β) | 12/13 | 71 (Nu) | 12/12 |
| 20 (P-L) | 74/73 | 46 (DSH) | 40/40 | 72 (Tbp) | 18/18 |
| 21 (THL) | 37/36 | 47 (β-C) | 8/7 | 73 (DNA) | 11/11 |
| 22 (RHM) | 39/40 | 48 (AN-α) | 13/12 | 74 (PK) | 22/22 |
| 23 (HYD) | 33/33 | 49 (HL) | 25/26 | 75 (NH-L) | 15/15 |
| 24 (PBP) | 6/6 | 50 (RCC) | 9/9 | 76 (CTL) | 12/12 |
| 25 (β-G) | 39/39 | 51 (P/H) | 17/17 | ||
| 26 (FEL) | 101/99 | 52 (P-L) | 12/13 |
Note: Ntrain/(Ntest) represents the number of folds in the training/(test) dataset.
Full names: (1) globin-like, (2) cytochrome c, (3) DNA-binding 3-helical bundle, (4) 4-helical up-and-down bundle, (5) 4-helical cytokines, (6) EF hand, (7) immunoglobulin-like β-sandwich, (8) cupredoxins, (9) viral coat and capsid proteins, (10) ConA-like lectin/glucanases, (11) SH3-like barrel, (12) OB-fold, (13) β-trefoil, (14) trypsin-like serine proteases, (15) lipocalins, (16) TIM barrel, (17) FAD (also NAD)-binding motif, (18) flavodoxin-like, (19) NAD(P)-binding Rossmann fold, (20) P-loop, (21) thioredoxin-like, (22) ribonuclease H-like motif, (23) hydrolases, (24) periplasmic binding protein-like, (25) β-grasp, (26) ferredoxin-like, (27) small inhibitors/toxins/lectins, (28) RuvA C-terminal domain-like, (29) spectrin repeat-like, (30) ferritin-like, (31) SAM domain-like, (32) α/α toroid, (33) cytochrome P450, (34) α–α superhelix, (35) nuclear receptor ligand-binding domain, (36) multiheme cytochromes, (37) diphtheria toxin/transcription factors/cytochrome f, (38) C2 domain-like, (39) galactose-binding domain-like, (40) PDZ domain-like, (41) acid proteases, (42) PH domain-like barrel, (43) 6-bladed β-propeller, (44) 7-bladed β-propeller, (45) single-stranded right-handed β-helix, (46) double-stranded β-helix, (47) β-clip, (48) adenine nucleotide α hydrolase-like, (49) HAD-like, (50) rhodanese/cell cycle control phosphatase, (51) phosphorylase/hydrolase-like, (52) PRTase-like, (53) S-adenosyl-l-methionine-dependent methyltransferases, (54) PLP-dependent transferase-like, (55) periplasmic binding protein-like II, (56) cytidine deaminase-like, (57) lysozyme-like, (58) IL8-like, (59) cystatin-like, (60) UBC-like, (61) glyoxalase/bleomycin resistance protein/dihydroxybiphenyl dioxygenase, (62) CBS-domain pair, (63) thioesterase/thiol ester dehydrase-isomerase, (64) dsRBD-like, (65) eukaryotic type KH domain (KH-domain type I), (66) Bacillus chorismate mutase-like, (67) zincin-like, (68) SH2-like, (69) acyl-CoA N-acyltransferases (Nat), (70) profilin-like, (71) Nudix, (72) TBP-like, (73) DNA clamp, (74) protein kinase-like (PK-like), (75) Ntn hydrolase-like, and (76) C-type lectin-like.
Summary of predicted secondary structure motifs.
| Feature set | Occurrence frequencies of the selected features |
|---|---|
| P1 | “E”, “C” and “H” |
| P2 | “ECE”, “ECH”, “HCH” and “HCE” |
| P3 | “ECECE”, “ECECH”, “ECHCE”, “ECHCH” |
Note: “H” indicates “helix”, “E” indicates “strand”, and “C” indicates “coil”.
Figure 1Prediction accuracies for 76 protein fold classes using combinations of different parameters in the test set (%). Note: parameter1: ID, increment of diversity values (76 dimensions); parameter2: ID + ACS, values of the increment of diversity and average chemical shifts of secondary structure elements (220 dimensions); parameter3: ID + ACS + M, values of the increment of diversity, average chemical shifts of secondary structure elements and motif frequency (629 dimensions); parameter4: ID + ACS + M + P, values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); parameter5: ID + ACS + M + P (5-fold cross-validation), values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); and Q, the overall accuracy.
Figure 2Prediction accuracies of 27 protein fold classes using combinations of different parameters. Note: parameter1: ID, increment of diversity values (76 dimensions); parameter2: ID + ACS, values of the increment of diversity and average chemical shifts of secondary structure elements (220 dimensions); parameter3: ID + ACS + M, values of the increment of diversity, average chemical shifts of secondary structure elements and motif frequency (629 dimensions); parameter4: ID + ACS + M + P, values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); parameter5: ID + ACS + M + P (5-fold cross-validation), values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); Q, the overall accuracy. The parameter6 summarizes the results of Liu et al. (2012) using an identical dataset. The parameter7 summarizes our results using the dataset constructed by Ding and Dubchak (2001).
Identification accuracy using the 27-protein fold class dataset constructed by Ding and Dubchak (%).
| Author | Classifier | Accuracy |
|---|---|---|
| SVM (all-versus-all) | 56.0 | |
| Tree-augmented naive Bayesian classifier | 58.2 | |
| OET-KNN | 62.1 | |
| Fusion of classifiers | 61.1 | |
| PFRES | 68.4 | |
| GAOEC | 64.7 | |
| Multi-class multi-kernel | 70.0 | |
| Increment of diversity | 61.1 | |
| Fusion of different classifiers | 68.6 | |
| ACCFold | 70.1 | |
| PFP-FunDSeqE | 70.5 | |
| Yang and Kecman (2011) | MarFold | 71.7 |
| SVM | 69.8 | |
| Present study | Random Forests | 70.8 |
Overall accuracies of structural class identification using different approaches in the test set (%).
| Dataset | Author | Structural class | Accuracy | |||
|---|---|---|---|---|---|---|
| α | β | α/β | α + β | |||
| Present study | 95.2 | 92.91 | 97.63 | 84.36 | 93.40 | |
| 97.04 | 85.43 | 94.07 | 78.21 | 89.24 | ||
| Present study | 85.25 | 88.03 | 83.22 | 69.35 | 82.77 | |
| 86.89 | 88.03 | 83.22 | 59.68 | 81.46 | ||
| 79.11 | ||||||
| 80.52 | ||||||