Literature DB >> 26980999

The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements.

Zhenxing Feng¹, Xiuzhen Hu¹, Zhuo Jiang¹, Hangyu Song¹, Muhammad Aqeel Ashraf².

Abstract

The recognition of protein folds is an important step in the prediction of protein structure and function. Recently, an increasing number of researchers have sought to improve the methods for protein fold recognition. Following the construction of a dataset consisting of 27 protein fold classes by Ding and Dubchak in 2001, prediction algorithms, parameters and the construction of new datasets have improved for the prediction of protein folds. In this study, we reorganized a dataset consisting of 76-fold classes constructed by Liu et al. and used the values of the increment of diversity, average chemical shifts of secondary structure elements and secondary structure motifs as feature parameters in the recognition of multi-class protein folds. With the combined feature vector as the input parameter for the Random Forests algorithm and ensemble classification strategy, we propose a novel method to identify the 76 protein fold classes. The overall accuracy of the test dataset using an independent test was 66.69%; when the training and test sets were combined, with 5-fold cross-validation, the overall accuracy was 73.43%. This method was further used to predict the test dataset and the corresponding structural classification of the first 27-protein fold class dataset, resulting in overall accuracies of 79.66% and 93.40%, respectively. Moreover, when the training set and test sets were combined, the accuracy using 5-fold cross-validation was 81.21%. Additionally, this approach resulted in improved prediction results using the 27-protein fold class dataset constructed by Ding and Dubchak.

Entities: Chemical Disease Gene Species

Keywords: Average chemical shifts; Multi-class protein folds; Random Forest algorithm; Secondary structure elements; Secondary structure motifs; The increment of diversity

Year: 2015 PMID： 26980999 PMCID： PMC4778582 DOI： 10.1016/j.sjbs.2015.10.008

Source DB: PubMed Journal: Saudi J Biol Sci ISSN： 1319-562X Impact factor: 4.219

Introduction

The large numbers of protein sequences generated in the post-genomic era has challenged researchers to develop a high-throughput computational method to structurally annotate these sequences. The protein fold reflects a key topological structure in proteins, as it contains three major aspects of protein structure: units of secondary structure, the relative arrangement of structures, and the overall relationship of protein peptide chains (Martin et al., 1998, Ming et al., 2015). The proper spacial structure of a protein is highly correlated with its physiological functions. Abnormal protein folding may cause different diseases, for example, the neurodegenerative diseases such as Alzheimer’s disease, spongiform encephalopathy, Parkinson’s disease, mad cow disease etc. Thus, the correct identification of protein folds can be valuable for studies on pathogenic mechanisms and drug design (Thomas et al., 1995, Christopher and Michelle, 2004, Krishna and Grishin, 2005, Lindquist et al., 2001, Scheibel et al., 2004, Ma et al., 2002, Ma and Lindquist, 2002) and represents an important topic in bioinformatics. In 2001, Ding and Dubchak (2001) constructed a dataset consisting of 27 protein fold classes using multiple feature parameters, including amino acid composition, predicted secondary structure, etc., and proposed support vector machines and neural network methods to predict the 27 protein fold classes, achieving an overall accuracy of 56.0%. Subsequently, using the dataset constructed by Ding and Dubchak and identical feature parameters, several studies have suggested algorithmic improvements for protein fold identification. For example, Chinnasamy et al. (2005) introduced the phylogenetic tree and Bayes classifier for the identification of protein folds and achieved an overall accuracy of 58.2%. Nanni (2006) proposed a new ensemble of K-local hyperplanes based on random subspace and feature selection, achieving an overall accuracy of 61.1%. Guo and Gao (2008) presented a novel hierarchical ensemble classifier termed GAOEC (genetic-algorithm optimized ensemble classifier) and achieved an overall accuracy of 64.7%. Damoulas and Girolami (2008) proposed the kernel combination methodology for the prediction of protein folds and achieved an accuracy of 70%. Lin et al. (2013) exploited novel techniques to impressively increase the accuracy of protein fold classification. Additional studies have suggested the selection of feature parameters to predict protein folds. For example, Shamim et al. (2007) used the structural properties of amino acid residues and amino acid residue pairs and achieved an overall accuracy of 65.2%. Dong et al. (2009) proposed a method termed ACCFold and achieved an overall accuracy of 70.1%. Nanni et al. (2010) proposed a method to extract features from the 3D structure and achieved significant improvement; however, this method does not solely rely on protein primary sequences to predict protein folds. Li et al. (2013) proposed a method termed PFP-RFSM and obtained improved results for protein fold identification. Numerous studies have not only focused on the selection of feature parameters but also on the improvement of algorithms to identify protein folds. For example, Zhang et al. (2009) proposed an approach that utilizes the increment of diversity by selecting the pseudo amino acid composition, position weight matrix score, etc., and used these parameters to predict the 27 protein fold classes, with an overall accuracy of 61.1%. Shen and Chou (2006) applied the OET-KNN ensemble classifier to identify folds by introducing pseudo amino acids with sequential order information as a feature parameter and achieved an overall accuracy of 62.1%. Chen and Kurgan (2007) proposed the PFRES method using evolutionary information and predicted secondary structure, obtaining an accuracy of 68.4%. Ghanty and Pal (2009) proposed the fusion of heterogeneous classifiers approach, with features including the selected trio AACs and trio potential, and the overall recognition accuracy was 68.6%. Shen and Chou (2009) applied an identification method to protein folds using functional domain and sequential evolution information and achieved an overall accuracy of 70.5%. Yang and Kecman (2011) proposed a novel ensemble classifier termed MarFold, which combines three margin-based classifiers for protein fold recognition, and the overall prediction accuracy was 71.7%. Additional studies have constructed and analyzed new 27-fold class datasets. For example, with a sequence identity less than 40%, Mohammad et al. (2007) constructed a dataset composed of 2554 proteins belonging to 27-fold classes, proposed structural properties of amino acid residues and amino acid residue pairs as parameters, and achieved an overall accuracy of 70.5% using 5-fold cross-validation. With sequence identity below 40%, Dong et al. (2009) constructed a 27-fold class dataset (containing 3202 sequences), proposed the ACCFold method, and obtained an overall accuracy of 87.6% using 5-fold cross-validation. Liu and Hu (2010) constructed a new 27-fold class dataset according to the construction of the Ding and Dubchak dataset (2001). This new dataset contains 1895 sequences with a sequence identity below 35%. Motif frequency, low-frequency power spectral density, amino acid composition, predicted secondary structure, and autocorrelation function values were combined as the set of feature parameters. Using the SVM algorithm and the ensemble classification strategy, the overall accuracy in the independent test was 66.67%. Moreover, studies on datasets consisting of 76, 86, and 199 fold classes have demonstrated improvements (Liu et al., 2012, Dong et al., 2009). In this study, we reorganized the dataset constructed by Liu et al. (2012). According to the biological characteristics, values of the increment of diversity, motif frequency, predicted secondary structure motifs and the average chemical shift information of predicted secondary structure elements were extracted as feature parameters. Based on the ensemble classification strategy, these combined features were used as the input parameter for the Random Forests algorithm. An independent test and 5-fold cross-validation were used to predict the 76 protein fold classes, which resulted in good protein fold identification. The protein folds of the 27-fold class dataset and the corresponding structural classes were also identified, yielding improved results.

Materials and methods

Protein fold dataset

The 76-fold class dataset constructed by Liu et al. (2012) was reorganized; 8 and 5 protein sequences were added to the training and test set, respectively. Then the training set contains 1744 proteins for training, and the test set contains 1726 proteins for test. The sequence identity of the dataset was below 35%. The number of sequences of each type of protein fold was 10 or greater. The training and test set contained 1744 and 1727 protein chains, respectively. The distribution of the corresponding fold names and sequence numbers is shown in Table 1. The 76-fold class dataset is available at http://202.207.29.245:8080/Ha/HomePage/fzxHomePage.jsp.

Table 1

Datasets of 76 protein fold classes.

Fold (name)	Ntrain/(Ntest)	Fold (name)	Ntrain/(Ntest)	Fold (name)	Ntrain/(Ntest)
1 (GL)	14/14	27 (ITL)	41/41	53 (SM)	44/44
2 (CY)	10/10	28 (RCD)	13/13	54 (PT-L)	31/31
3 (DB)	92/90	29 (SR)	13/13	55 (PBPI)	26/26
4 (HB)	25/24	30 (F-L)	21/21	56 (CD-L)	7/7
5 (4HC)	8/8	31 (SD)	15/14	57 (L-L)	8/8
6 (EF)	25/23	32 (α-T)	16/16	58 (I-L)	8/7
7 (IL)	86/85	33 (CP)	9/8	59 (C-L)	29/30
8 (CD)	18/18	34 (α-S)	32/33	60 (U-L)	9/8
9 (VCP)	24/24	35 (NRL)	7/7	61 (GRP)	16/16
10 (CLL)	18/17	36 (MC)	9/9	62 (C-DP)	8/9
11 (SH3)	41/41	37 (CFD)	14/14	63 (TED)	26/25
12 (OB)	29/28	38 (C2D)	9/9	64 (DL)	8/9
13 (BT)	11/10	39 (GD)	16/16	65 (ETK)	10/9
14 (TSP)	17/16	40 (PDL)	24/25	66 (BCM)	8/9
15 (LIP)	16/15	41 (AP)	8/8	67 (Z-L)	12/11
16 (TIM)	93/92	42 (PDB)	29/29	68 (S-L)	7/8
17 (FAD)	5/5	43 (6BP)	10/9	69 (ACN)	33/32
18 (FLL)	37/36	44 (7BP)	8/8	70 (PL)	19/19
19 (NAD)	17/16	45 (SR-β)	12/13	71 (Nu)	12/12
20 (P-L)	74/73	46 (DSH)	40/40	72 (Tbp)	18/18
21 (THL)	37/36	47 (β-C)	8/7	73 (DNA)	11/11
22 (RHM)	39/40	48 (AN-α)	13/12	74 (PK)	22/22
23 (HYD)	33/33	49 (HL)	25/26	75 (NH-L)	15/15
24 (PBP)	6/6	50 (RCC)	9/9	76 (CTL)	12/12
25 (β-G)	39/39	51 (P/H)	17/17
26 (FEL)	101/99	52 (P-L)	12/13

Note: Ntrain/(Ntest) represents the number of folds in the training/(test) dataset.

Full names: (1) globin-like, (2) cytochrome c, (3) DNA-binding 3-helical bundle, (4) 4-helical up-and-down bundle, (5) 4-helical cytokines, (6) EF hand, (7) immunoglobulin-like β-sandwich, (8) cupredoxins, (9) viral coat and capsid proteins, (10) ConA-like lectin/glucanases, (11) SH3-like barrel, (12) OB-fold, (13) β-trefoil, (14) trypsin-like serine proteases, (15) lipocalins, (16) TIM barrel, (17) FAD (also NAD)-binding motif, (18) flavodoxin-like, (19) NAD(P)-binding Rossmann fold, (20) P-loop, (21) thioredoxin-like, (22) ribonuclease H-like motif, (23) hydrolases, (24) periplasmic binding protein-like, (25) β-grasp, (26) ferredoxin-like, (27) small inhibitors/toxins/lectins, (28) RuvA C-terminal domain-like, (29) spectrin repeat-like, (30) ferritin-like, (31) SAM domain-like, (32) α/α toroid, (33) cytochrome P450, (34) α–α superhelix, (35) nuclear receptor ligand-binding domain, (36) multiheme cytochromes, (37) diphtheria toxin/transcription factors/cytochrome f, (38) C2 domain-like, (39) galactose-binding domain-like, (40) PDZ domain-like, (41) acid proteases, (42) PH domain-like barrel, (43) 6-bladed β-propeller, (44) 7-bladed β-propeller, (45) single-stranded right-handed β-helix, (46) double-stranded β-helix, (47) β-clip, (48) adenine nucleotide α hydrolase-like, (49) HAD-like, (50) rhodanese/cell cycle control phosphatase, (51) phosphorylase/hydrolase-like, (52) PRTase-like, (53) S-adenosyl-l-methionine-dependent methyltransferases, (54) PLP-dependent transferase-like, (55) periplasmic binding protein-like II, (56) cytidine deaminase-like, (57) lysozyme-like, (58) IL8-like, (59) cystatin-like, (60) UBC-like, (61) glyoxalase/bleomycin resistance protein/dihydroxybiphenyl dioxygenase, (62) CBS-domain pair, (63) thioesterase/thiol ester dehydrase-isomerase, (64) dsRBD-like, (65) eukaryotic type KH domain (KH-domain type I), (66) Bacillus chorismate mutase-like, (67) zincin-like, (68) SH2-like, (69) acyl-CoA N-acyltransferases (Nat), (70) profilin-like, (71) Nudix, (72) TBP-like, (73) DNA clamp, (74) protein kinase-like (PK-like), (75) Ntn hydrolase-like, and (76) C-type lectin-like.

The first 27 types of the 76 protein fold classes correspond to the dataset of Ding and Dubchak (2001), and each type of fold has been expanded. The number of sequences in the dataset is threefold greater than that of the Ding and Dubchak dataset. The second dataset used in this study was constructed by Ding and Dubchak. The previously used dataset, with sequence identity below 35%, contained a training set that included 311 protein chains and a test set that included 383 protein chains.

The selection of feature parameters

Increment of diversity (ID)

The ID algorithm has been successfully used in the classification of protein structure and subcellular localization (Chen and Li, 2007). The ID can be used as a classification prediction algorithm and can extract characteristics of the sequence as parameters of the classification prediction. In the state space of k dimensions, m indicates the absolute frequency of the ith state. The diversity measure for diversity source S:{m1, m2,…, m} is defined as follows: Here, In this state space, the ID between the source of diversity X(n1, n2,…, n) and Y(m1, m2,…, m) is defined as follows:where D(X + Y), which is termed the combination diversity source space, is the measure of the diversity of the sum of two diversity sources. The ID is used to measure the similarity level between two diversity sources. If X is similar to Y, then the value of ID(X, Y) will be small, particularly if X = Y, then ID(X, Y) = 0. Considering the local conservation of fold sequences, the sequence of each protein fold was divided into n segments, and in each segment, the occurrence frequencies of 20 amino acid residues in the protein sequences were extracted as a parameter, as previously described (Chen and Li, 2007, Wang et al., 2014). Thus, the initial parameter of each sequence was converted into a 20*n-dimensional vector that was inputted into the ID algorithm for classification, and an improved result was obtained. Following substantial iterative calculations, when an enzyme sequence was divided into 10 segments, a relatively better result was obtained. Therefore, we selected a 200-dimensional vector as the initial parameter for input into the ID algorithm and obtained 76 ID values for each sequence.

Average chemical shift (ACS)

Several studies have noted that the ACS of a particular nucleus in the protein backbone correlates well to its secondary structure (Sibley and Cosman, 2003, Zhao et al., 2010). Mielke and Krishnan, 2003, Mielke and Krishnan, 2004, Mielke and Krishnan, 2009 have presented a CS-based empirical approach to predict secondary structure and the protein structural class. Arai et al. (2010) have predicted the protein structural class using 1H–15N HSQC spectra. Moreover, CS information has been used to improve the prediction quality for various protein subcellular localizations (Fan and Li, 2012a, Fan and Li, 2012b). These results suggest that CS information can be regarded as important parameters in the prediction of protein folds. Chemical shift values corresponding to the protein backbone atoms were obtained from the BMRB (http://www.bmrb.wisc.edu) (Seavey et al., 1991). The online web server PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) was used to obtain the predicted secondary structure of each protein sequence in the 76-protein fold class dataset. We calculated the ACS using a previously described method (Mielke and Krishnan, 2003, Fan and Li, 2012a, Fan and Li, 2012b, Fan et al., 2013, Fan and Li, 2013, Anaika et al., 2003). We selected chemical shift values of 1H and 1H (two types of protein backbone atoms for every amino acid residue of protein sequence P) to calculate the corresponding ACS. Subsequently, each amino acid in the sequence was replaced by its ACS. Following iterative calculations, we selected the averaged chemical shifts of 1H and 1H, which were more suitable for predicting protein folds. Protein sequence P is expressed as follows: The auto cross covariance (ACC) (Wold et al., 1993) has been successfully adopted for the prediction of protein folds (Dong et al., 2009, Qi et al., 2015), G-proteins (Guo et al., 2006, Wen et al., 2007), protein interactions (Guo et al., 2008), and β-hairpins (Jun et al., 2010). However, the ACC has primarily been used to study interactions between residues or bases. We are the first to use the ACC at the level of predicted secondary structure elements (helix, strand, or coil) for protein fold prediction (Xinghui et al., 2015). The ACC contains two types of variables: the AC variable measures the correlation between identical properties (i.e., an identical secondary structure element) and the CC variable measures the correlation between different properties. Given the corresponding predicted secondary structure elements (helix, strand, or coil) in one sequence, AC variables describe the average interactions between identical predicted secondary structure elements, and the separation distance between two predicted secondary structure elements is given by lg elements. For example, if two secondary structure elements are neighboring, then lg = 1; if the two secondary structure elements are next-to-neighboring, then lg = 2, etc. The AC variables were redefined and calculated according to Eq. (4), as follows: Here ,where i represents a secondary structure element (helix, strand, or coil), L is the number of secondary structure elements in the protein sequence, and S is a feature value of secondary structure element i at position j. is the average value for the secondary structure element i along the entire sequence (Zhang et al., 2014). Given the ACS values for 20 amino acid residues in a sequence, the secondary structure element i contains m residues, and S represents the summation of ACS values for m residues. CC variables were redefined and calculated according to Eq. (5), as follows:where i1 and i2 are two different types of secondary structure elements (helix, strand, or coil), and S1, is a feature value of secondary structure element i1 at position j. is the average value for secondary structure element i1(i2) along the entire sequence (Li et al., 2015). The dimension of CC variables is 3*2*lg. The ACC is the summation of variables AC and CC. Following substantial calculations and a comparison of the prediction results, the optimal maximal value of lg was selected as 8 in this study (Zhiwei et al., 2015).

Motif information (M)

A motif is the local conserved region in a protein during evolution (Ben-Hur and Brutlag, 2003) that is often related to biological function. For example, some motifs are related to DNA binding sites and enzyme catalytic sites (Wang et al., 2003). As feature parameters, motif information has been successfully applied for the prediction of superfamilies, protein folds, etc. (Ben-Hur and Brutlag, 2003, Liu et al., 2012, Wang et al., 2014). Two types of motifs were used in this study: motifs with a biological function obtained by searching the existing functional motif database PROSITE (de Castro et al., 2009) and statistical motifs that were obtained using MEME (http://meme.nbcr.net/meme/cgi-bin/meme.cgi). Motif information (M) includes functional and statistical motifs. Functional motif The PROSITE database was used to obtain protein sequence patterns with notable biological functions. PS_SCAN packets provided by the PROSITE database were used and compiled using a Perl program as a motif-scan tool to search the sequences of the 76-fold class training set, and 181 functional motifs were selected. For an arbitrary sequence in the dataset, the frequencies of different motifs in the sequence were recorded. If a motif occurs once, the corresponding frequency value was recorded as “1”; if the motif occurs twice, the value was recorded as “2”, etc.; otherwise if the motif is absent, the corresponding frequency value was recorded as “0”. Thus, the frequencies of different functional motifs in a protein sequence were converted into a 181-dimensional vector. Statistical motif For statistical motifs, MEME was applied as the motif-scan tool (Bailey et al., 2006). The motifs with the three highest frequencies were selected. Each motif contained 6–10 amino acid residues; thus, 228 motifs were obtained and selected from the 76-fold class training set. For an arbitrary sequence in the dataset, if a motif occurs once, the frequency value was recorded as “1”; if the motif occurs twice, the value was recorded as “2”, etc.; otherwise if the motif is absent, the corresponding frequency value was recorded as “0”. Thus, frequencies of different statistical motifs in a protein sequence were converted into a 228-dimensional vector.

Predicted secondary structure motifs (P)

Because the protein fold is a description based on the secondary structure, the formation of secondary structure from the sequence influences the folding of the protein. We extracted the occurrence frequencies of three types of predicted secondary structure motifs (P1) from previous studies (Shen and Chou, 2006, Chen and Kurgan, 2007, Yang et al., 2011) as feature parameters, resulting in a 3-dimensional vector. The occurrence frequencies of four types of supersecondary motifs (P2) were subsequently extracted as feature parameters, resulting in a 4-dimensional vector. Finally, the occurrence frequencies of complex supersecondary motifs (P3) were extracted as parameters (Pi represents the three feature sets, with i = 1, 2, or 3). Thus, the frequencies of secondary structure motifs, supersecondary motifs, and complex supersecondary motifs were converted into a 15-dimensional vector represented by P. The online web server PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) was used to obtain the predicted secondary structure of each protein sequence. The three feature sets are provided in Table 2.

Table 2

Summary of predicted secondary structure motifs.

Feature set	Occurrence frequencies of the selected features
P1	“E”, “C” and “H”
P2	“ECE”, “ECH”, “HCH” and “HCE”
P3	“ECECE”, “ECECH”, “ECHCE”, “ECHCH”“HCECE”, “HCECH”, “HCHCE” and “HCHCH”

Note: “H” indicates “helix”, “E” indicates “strand”, and “C” indicates “coil”.

Random Forests

Random Forests is a classification algorithm developed by Leo Breiman (2001). The general idea of the algorithm is that multiple weak classifiers constitute a strong individual classifier. Random Forests uses a collection of multiple decision trees, in which each decision tree and each split of the decision tree is a classifier, and the final predictions are made by the majority vote of the trees. The advantages of Random Forests include (1) a few parameters to adjust and (2) the data do not require preprocessing. Random Forests uses two important parameters: (1) the number of feature parameters selected by each node of a single decision tree at each split, which is represented by m (m = , where M is the total number of features that were initially selected), and (2) the number of decision trees, which is represented by k (in this study, k = 1000). The Random Forests algorithm has been successfully used in the prediction of antifreeze proteins (Kandaswamy et al., 2011), DNA-binding residues (Wang et al., 2009), the metabolic syndrome status and β-hairpins (Jia and Hu, 2011). The Random Forests algorithm was applied using R-2.15.1 software (http://www.r-project.org/) and the Random Forest program package.

Results and discussion

Comparison using different parameters

For the 76-fold class dataset, ID, M, P, and ACS values were extracted as feature parameters, with the combined feature vector as input parameters for the Random Forest algorithm. The overall accuracy of the test set in the dataset was 66.69% using an independent test (Fig. 1). As some features and their combinations may give rise to higher accuracies, and in order to know the basis for them to give high accuracies, we also test the effectiveness of the individual features and their various systematic combinations, and the detailed fold-discriminatory accuracies. We then combined the test set with the training set, as previously described (Lin et al., 2013, Shamim et al., 2007, Ghanty and Pal, 2009), and the overall accuracy was 73.43% using 5-fold cross-validation. The identification results from the gradual addition of relevant feature parameters are summarized in Fig. 1.

Figure 1

Prediction accuracies for 76 protein fold classes using combinations of different parameters in the test set (%). Note: parameter1: ID, increment of diversity values (76 dimensions); parameter2: ID + ACS, values of the increment of diversity and average chemical shifts of secondary structure elements (220 dimensions); parameter3: ID + ACS + M, values of the increment of diversity, average chemical shifts of secondary structure elements and motif frequency (629 dimensions); parameter4: ID + ACS + M + P, values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); parameter5: ID + ACS + M + P (5-fold cross-validation), values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); and Q, the overall accuracy.

When only the ID values, which can reflect the local conservation of fold sequences, were used as the feature parameter in the independent test, the overall accuracy was 26.59%. Following the addition of the ACSs of secondary structure elements, the overall accuracy increased to 57.01% (a 30.42% higher overall accuracy). The accuracies for folds 2, 4, 6, etc., increased more than 50%, and the accuracies of folds 1, 3, 11, etc., increased approximately 30%. The accuracies of the remaining folds also improved to varying extents. Note that the ACSs of secondary structure elements substantially affected the identification of protein folds. Furthermore, we can see that the ACSs of secondary structure elements were shown to provide better accuracies than the other individual features. With the specific biological background of protein folds, the proposed feature parameter of ACSs of secondary structure elements was very suitable for predicting 76-fold classes. Upon the addition of motif frequency information to the values of the ID and ACSs of secondary structure elements, the overall accuracy increased to 63.19%, which represents a 6.18% higher overall accuracy. During this process, the accuracies of folds 2, 10, 14, 40, 49, 50, 60, 71 substantially increased. Furthermore, it was shown that the individual feature of motif frequency information, which reflects the function and structure information of folds, performed very well on the accuracies of folds above. Through investigation on the folds above, the local conservation of the sequences is better than other fold classes, and the sensitivity to motif frequency information is higher. Finally, addition of the predicted secondary structure motifs, which influence the spatial folding of the protein, resulted in an overall accuracy of 66.69%, and the prediction accuracies of various folds were further improved, resulting in the best overall accuracy (Fig. 1). However, as can be seen, upon the combinations of ACSs of secondary structure elements, motif frequency information and the predicted secondary structure motifs, the overall accuracy was 66.74%, which represents only a 0.05% higher overall accuracy. Overall, as relevant feature parameters were gradually added, the accuracies of a majority of the folds improved to varying extents. The great majority of combinations of features are shown to provide better accuracies than the individual feature. Thus, the combined feature parameters were effective in predicting the 76-fold classes. For an additional comparison, we combined the training and test set as previously described (Lin et al., 2013, Shamim et al., 2007, Ghanty and Pal, 2009), and the corresponding prediction results using 5-fold cross-validation are summarized in Fig. 1. As can be seen, the overall prediction accuracy using 5-fold cross-validation reached 73.43%, which represents a 6.74% higher overall accuracy. In addition to the 76-protein fold class dataset, the previous results of Liu et al. (2012) using an independent test are also summarized for comparison. Note that the overall accuracy using an independent test was 21.77% higher than that of Liu et al. (2012). Overall, the results of the 76-protein fold class prediction are encouraging. However, the prediction results for 17, 48, 57, 66 and 67 folds were poor, indicating that future studies are necessary. The web server for protein fold prediction is accessible to the public (http://202.207.29.245:8080/Ha/HomePage/fzxHomePage.jsp).

Comparison with predictions using the 27-fold class dataset

To evaluate the efficiency of our method, using identical feature parameters, classification strategy, and algorithm, the first 27-fold classes in the 76-fold class dataset and the dataset constructed by Ding and Dubchak (2001) were also evaluated. Overall accuracies of 79.66% and 70.76%, respectively, for the two datasets were achieved using an independent test (Fig. 2). Moreover, we combined the training and test set of the first 27-fold classes in the 76-fold class dataset and achieved an overall accuracy of 81.21% (which is higher than that of the independent test) using 5-fold cross-validation. The identification results from the gradual addition of relevant feature parameters are summarized in Fig. 2. We also test the effectiveness of the individual features and their various systematic combinations, and the detailed fold-discriminatory accuracies.

Figure 2

Prediction accuracies of 27 protein fold classes using combinations of different parameters. Note: parameter1: ID, increment of diversity values (76 dimensions); parameter2: ID + ACS, values of the increment of diversity and average chemical shifts of secondary structure elements (220 dimensions); parameter3: ID + ACS + M, values of the increment of diversity, average chemical shifts of secondary structure elements and motif frequency (629 dimensions); parameter4: ID + ACS + M + P, values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); parameter5: ID + ACS + M + P (5-fold cross-validation), values of the increment of diversity, average chemical shifts of secondary structure elements, motif frequency and predicted secondary structure information (644 dimensions); Q, the overall accuracy. The parameter6 summarizes the results of Liu et al. (2012) using an identical dataset. The parameter7 summarizes our results using the dataset constructed by Ding and Dubchak (2001).

Using the identical dataset and test method, the overall accuracy was 13% higher than that of Liu et al. (2012) (Fig. 2), and the prediction using 5-fold cross-validation was superior. The previous results for the Ding and Dubchak dataset are also summarized in Table 3 for comparison. The accuracy was slightly lower than the best results of Yang et al. (2011), but the overall accuracy in our analysis was higher than previously achieved accuracies (Table 3).

Table 3

Identification accuracy using the 27-protein fold class dataset constructed by Ding and Dubchak (%).

Author	Classifier	Accuracy
Ding and Dubchak (2001)	SVM (all-versus-all)	56.0
Chinnasamy et al. (2005)	Tree-augmented naive Bayesian classifier	58.2
Shen and Chou (2006)	OET-KNN	62.1
Nanni (2006)	Fusion of classifiers	61.1
Chen and Kurgan (2007)	PFRES	68.4
Guo and Gao (2008)	GAOEC	64.7
Damoulas and Girolami (2008)	Multi-class multi-kernel	70.0
Zhang et al. (2009)	Increment of diversity	61.1
Ghanty and Pal (2009)	Fusion of different classifiers	68.6
Dong et al. (2009)	ACCFold	70.1
Shen and Chou (2009)	PFP-FunDSeqE	70.5
Yang and Kecman (2011)	MarFold	71.7
Liu et al. (2012)	SVM	69.8
Present study	Random Forests	70.8

Identification of the structural classes for the 27-fold classes

As previously described by Shen and Chou (2006), the 27 protein fold classes belong to four structural classes. To evaluate the efficiency of our method, we extracted values of the ID, motif frequency, predicted secondary structure motifs and ACSs of secondary structure elements as feature parameters. The combined feature parameters were used as input parameters for the Random Forests algorithm, and the overall accuracy of the test set for the four structural classes was 93.40% using an independent test. This overall accuracy was 4% higher than the method of Liu et al. (2010) (Table 4). Using this approach, we also evaluated the Ding and Dubchak dataset, which has been used in several studies, and the results were superior to previous results obtained from this dataset (Table 4).

Table 4

Overall accuracies of structural class identification using different approaches in the test set (%).

Dataset	Author	Structural class				Accuracy
α	β	α/β	α + β
Liu et al. (2012)	Present study	95.2	92.91	97.63	84.36	93.40
Liu and Hu (2010)	97.04	85.43	94.07	78.21	89.24

Ding and Dubchak (2001)	Present study	85.25	88.03	83.22	69.35	82.77
Liu and Hu (2010)	86.89	88.03	83.22	59.68	81.46
Zhang et al. (2009)					79.11
Chinnasamy et al. (2005)					80.52

Conclusion

Using an identical dataset with different feature parameters can correctly or falsely classify a given protein sequence. Our approach resulted in good predictions and is valid for the following reasons. First, considering the correlation between the biological function of protein folds and secondary structure elements, the composition and combined features of secondary structure elements were adopted as prediction parameters. We additionally calculated the ACSs of secondary structure elements because chemical shifts reflect structural information, such as the nature of hydrogen exchange dynamics, ionization and oxidation states, the influence of the ring current of aromatic residues, hydrogen bonding interactions and long-range correlation information of the sequence. Second, each sequence was divided into segments according to the local conservation of folds, selecting the composition of amino acids as an initial parameter, after which the ID algorithm was further used to obtain ID values as a prediction parameter. Third, motif information, including functional and statistical motifs, was extracted considering the local conservation of kernel structure in the protein folds. Finally, the Random Forests algorithm, as a convenient and highly efficient combination classifier, was employed to yield final classification results that are decided by votes from decision trees.

42 in total

1. Protein classification using texture descriptors extracted from the protein backbone image.

Authors: Loris Nanni; Jian-Yu Shi; Sheryl Brahnam; Alessandra Lumini
Journal: J Theor Biol Date: 2010-03-20 Impact factor: 2.691

2. Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition.

Authors: Z Wen; M Li; Y Li; Y Guo; K Wang
Journal: Amino Acids Date: 2006-05-26 Impact factor: 3.520

3. PFRES: protein fold classification by using evolutionary information and predicted secondary structure.

Authors: Ke Chen; Lukasz Kurgan
Journal: Bioinformatics Date: 2007-10-17 Impact factor: 6.937

4. Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers.

Authors: Pradip Ghanty; Nikhil R Pal
Journal: IEEE Trans Nanobioscience Date: 2009-03-10 Impact factor: 2.935

The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements.

Introduction

Materials and methods

Protein fold dataset

The selection of feature parameters

Increment of diversity (ID)

Average chemical shift (ACS)

Motif information (M)

Predicted secondary structure motifs (P)

Random Forests

Results and discussion

Comparison using different parameters

Comparison with predictions using the 27-fold class dataset

Identification of the structural classes for the 27-fold classes

Conclusion

1. Protein classification using texture descriptors extracted from the protein backbone image.

2. Delaunay triangulation with partial least squares projection to latent structures: a model for G-protein coupled receptors classification and fast structure recognition.

3. PFRES: protein fold classification by using evolutionary information and predicted secondary structure.

4. Prediction of protein folds: extraction of new features, dimensionality reduction, and fusion of heterogeneous classifiers.

5. Predicting protein fold pattern with functional domain and sequential evolution information.

6. A novel hierarchical ensemble classifier for protein fold recognition.

7. Protein folds and functions.

Review 8. Defective protein folding as a basis of human disease.

9. The elongation of yeast prion fibers involves separable steps of association and conversion.

10. Hierarchical classification of protein folds using a novel ensemble classifier.

Review 1. Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition.

2. ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier.

3. DeepFrag-k: a fragment-based deep learning approach for protein fold recognition.