Literature DB >> 22720092

Prediction of protein domain with mRMR feature selection and analysis.

Bi-Qing Li¹, Le-Le Hu, Lei Chen, Kai-Yan Feng, Yu-Dong Cai, Kuo-Chen Chou.

Abstract

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Proteins
Solvents

Year: 2012 PMID： 22720092 PMCID： PMC3376124 DOI： 10.1371/journal.pone.0039308

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Protein domains are structural, evolutionary and functional units of proteins. Prediction of protein domains from the sequence information can facilitate the prediction of protein tertiary structure [1], [2], the annotation of protein functions [2], [3], the protein structure determination [4], protein engineering [5] as well as mutagenesis [6], [7]. Particularly, the functional domains are actually the cores of proteins that play the major role for their functions. That is why in determining the 3D (three dimensional) structure of a protein by experiments (see, e.g., [8], [9], [10], [11]) or by computational modeling (see, e.g., [7], [12], [13], [14], [15]) the first priority was always focused on its functional domain. This is because the knowledge of protein functional domains is important for both basic research and drug development. Recently, the functional domain information of proteins has been widely used to formulate protein samples through the conception of pseudo amino acid composition [16], [17] for predicting various important attributes of proteins, such as membrane proteins and their types [18], GPCRs and their types [19], [20], proteases and their types [21], protein quaternary structural attribute [22], [23], protein structural classification [24], and protein subcellular localization [25], [26], [27]. Meanwhile, the protein domain information was also used to help analyzing protein-protein binding interactions [28], [29] and predicting the network of substrate-enzyme-product triads [30]. With the avalanche of protein sequences generated in the postgenomic age, many efforts have been made in hopes to predict the domains of proteins from their primary sequences alone. They can be roughly divided into three categories: (i) template-based method [31], [32], [33], (ii) ab-initio method [34], [35], and (iii) hybrid method by combining the aforementioned two [36], [37], [38]. Most template-based approaches attempted to find homologous sequences in the existing domain databases and then infer the domains of the query protein from these sequences. The obvious drawback of the template-based method was that it would work only when a domain was conserved and had already been deposited in a database. In other words, such an approach would fail to work if the query protein did not have significant sequence similarity to any of the domain-known proteins. In contrast to the template-based method, the ab-initio method could make predictions basically only based on the primary sequence information alone, and hence for those query proteins without significant sequence similarity to any of the domain-known proteins, the ab-initio method would be a good choice. The concreted techniques involved in the ab-initio methods are the machine learning algorithms [35], [39], artificial neural networks [40], and support vector machines [41], [42], along with the high quality domain databases such as CATH [43], SCOP [44] and DALI [45]. However, since it needed to scan the entire sequence of a protein usually involving several hundreds of amino acids, and also relied on the inputs containing weak domain information, the ab-initio method needed much more computational time and also often suffered from low prediction accuracy. The third method, or the hybrid method [36], [37], [46], took the strategy by integrating the template-based method and the ab-initio method. In the mean time, many softwares and web-server tools were developed for predicting protein domains, such as FIEFDom [47], DoMpro [40], DROP [42], DomCut [48] and Globplot [49]. Most of these tools aimed at predicting domain linker or domain boundary, and then followed by inferring the domain region. The present study was initiated in an attempt to address the problem from such a keypoint by first identifying each of the constituent amino acid residue in a query protein belonging to the domain region or not. The techniques involved were RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection). The amino acid features incorporated were the sequence conservation, residual disorder, secondary structure propensity, and solvent accessibility. As summarized in a review [17] and demonstrated by a series of recent publications [50], [51], [52], [53], [54], [55], [56], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor. Below, let us describe how to deal with these steps.

Materials and Methods

1. Benchmark Dataset

A total of 517,100 protein sequences were retrieved from UniProt/Swiss-Prot database (version 2010_06) [57]. In order to construct a high-quality benchmark dataset, protein sequences were screened strictly according to the following criteria. (i) To reduce redundancy and homology bias, the cutoff threshold was set at 25% as suggested in [58], meaning that those sequence samples were removed by means of the program CD-HIT [59] that had pairwise sequence identity to any other in the dataset. (ii) Of the remaining 45,942 protein samples obtained via the above winnowing procedure, only 9,409 were kept that had clear experimental domain annotations. (iii) Of the samples obtained via the above step, 110 proteins were removed because their disorder feature could not be calculated. Finally, a total of 9,299 protein sequences were obtained for the benchmark dataset used in this study. Furthermore, on the basis of the benchmark dataset , two working datasets, i.e., a learning (training) dataset and an independent testing dataset , were constructed. In order to fully use the data in and meanwhile guarantee that and be completely independent of each other, the following condition was imposed:where , , and represent the symbols for “union”, “intersection”, and “empty set” in the set theory, respectively. Constrained by the condition of Eq.1, 8,000 protein sequences were randomly picked for the learning dataset and the remaining 1,299 sequences for the testing dataset . See the Online for the codes of the proteins included in the two datasets, and , respectively. Three different sliding windows [60] were used to generate the positive and negative datasets for this study: size-13, size-15, and size-17. For the size-13 window, we extracted all the 13-residue segments along a protein chain. The segments thus obtained can be denoted as and classified into the following two groups: During the operation of sliding the window along a protein chain (cf. Figure 4 of [61], not all segments thus generated contain 13 amino acid residues. For those with less than 13 residues such as the ones generated at the positions close to the N-terminal or C-terminal, we complement their subsites with the nominal amino acid “X” to make them contain 13 residues as well. Thus, we obtained 1,694,782 positive samples and 4,093,531 negative samples from the learning dataset . Subsequently, for each of the two sets of 13-residue samples, the program CD-HIT [59] was used to remove those that had pairwise sequence identity to any other in a same set. Finally, we obtained 121,013 positive samples and 242,026 negative samples; i.e.,where represents the positive learning dataset derived from using the size-13 sliding window according to Eq.2, while the corresponding negative dataset derived from . By following the same procedure but using size-15 and size-17 sliding windows, respectively, we obtainedand Now, the similar operation was made with the sliding windows on the 1,299 sequences in the testing dataset , and we obtained 250,208 positive samples and 573,791 negative samples, respectively; i.e.,where represents the positive learning dataset derived from , while the corresponding negative dataset.

2. Feature Construction and Computational Method

2.1 The features of PSSM conservation scores

Biology is a natural science with historic dimension. All biological species have developed starting out from a very limited number of ancestral species. The evolution in protein sequences involves changes of single residues, insertions and deletions of several residues [13], gene doubling, and gene fusion. In the course of time such changes accumulate, so that many similarities between initial and resultant amino acid sequences are eliminated, but the corresponding proteins may still share many common attributes, such as containing to a same domain and possessing basically the same function. In view of this, evolutionary conservation will play important roles in biological analysis: a more conserved residue within a protein sequence may indicate that it is more important for the protein function and thus under stronger selective pressure. To incorporate this kind of evolutionary effects, we used PSSM (position-specific scoring matrix) [62] generated by Position Specific Iterative BLAST (PSI BLAST) [63] to measure the conservation status for a specific residue. A 20-dimensional vector was used to denote the probabilities of conservation against mutations to 20 different amino acids for a specific residue. For a given sequence with , its PSSM would correspond to a matrix, as formulated by equation 12 of [54]. Similar PSSM approaches have been successfully used to enhance the prediction quality for various protein attributes (see, e.g., [21], [26], [27], [50], [54], [55], [56], [64], [65], [66], [67], [68], [69].

2.2 The features of amino acid factors

Since each of the 20 amino acids has specific but different properties, the composition of these properties of different residues within a protein may have impacts on its structure and function. AAIndex [70] is a database containing various physicochemical and biochemical properties of amino acids. Atchley et al. [71] performed multivariate statistical analyses on AAIndex and transformed AAIndex to five multidimensional and highly interpretable numeric patterns of attribute covariation that could reflect (i) polarity, (ii) secondary structure, (iii) molecular volume, (iv) codon diversity, and (v) electrostatic charge. Such five numerical pattern scores, denoted as AAFactor (amino acid factors), were used in this study to represent the respective properties of each amino acid in a given protein.

2.3 The features of disorder score

Protein segments lacking fixed three-dimensional structures under physiological conditions play important roles in biological functions [72], [73], [74]. The disordered regions of proteins allow for more modification sites and interaction partners and always contain PTM (post translational modification) sites, sorting signals, and protein ligands. Therefore they are quite important for protein structure and function [72], [75], [76]. In this study, the program VSL2 [77], which can accurately predict both long and short disordered regions in proteins, was used to calculate the disorder score that denotes the disorder status of each amino acid in a given protein sequence.

2.4 The features of secondary structure and solvent accessibility

As is well known, the function of a protein is closely correlated with its structure, and the post-translational modification of specific residues may be affected by the solvent accessibility of the relevant residues. Therefore, it would be useful during the process of encoding the constituent amino acids by also taking into account the features such as the secondary structure propensity and solvent accessibility. These kinds of features could be predicted by the software SSpro4 [78]. The second structural propensity predicted by SSpro4 for each amino acid was “helix”, “strand”, or “other”, encoded with “100”, “010” and “001”, respectively; the solvent accessibility as ‘buried’ or ‘exposed’, encoded with “10” and “01”, respectively.

2.5 Feature space and feature vector

Each of the residues in a given protein segment was formulated in terms of 31 features, of which 20 from the PSSM scores, 1 from the disorder score, 5 from the AAFactor, 3 from the secondary structural propensities, and 2 from the solvent accessibility states. Thus, each of the segment samples generated by the size-13 sliding window would contain features; that by the size-15 sliding window, ; and that by the size-17 sliding window, . According to the general form of pseudo amino acid composition (cf. equation 6 of [17], each of these segments can be formulated by the following feature vector:where represents the feature score, the transpose operator, andFor those segments that contain the nominal residue “X”, the corresponding subsite was substituted with zero.

2.6 The mRMR method

In this study, the mRMR (minimal-redundancy-maximal-relevance) criterion [79] was used to rank the importance of the features. The mRMR method could rank the features according to their relevance to the target concerned and the redundancy among the features themselves. The ranked feature with a smaller index indicates that it has a better trade-off between the maximum relevance and minimum redundancy. To quantify both the relevance and redundancy, the following mutual information (MI) is defined to estimate how one vector is related to another:where x, y are two vectors, is the joint probabilistic density, p(x) and p(y) are the marginal probabilistic densities. Suppose G denotes the entire space containing all the feature components, G s denotes the already-selected feature set containing m features, and G t denotes the to-be-selected feature set containing n features. The relevance D between the feature f in G and the target c can be calculated byThe redundancy between the feature in G t and all the features in G s can be calculated byTo get the feature in G t with the maximum relevance and minimum redundancy, let us combine Eq.10 with Eq.11, as formulated byThe mRMR feature evaluation would continue rounds when given a feature set with features. After these evaluations, a feature set can be obtained by the mRMR method as formulated belowwhere each feature in has a subscript index indicating at which round the feature is selected. The better the feature is, the earlier it has been selected. The mRMR program can be downloaded from the web-site at http://penglab.janelia.org/proj/mRMR/.

2.7 The RF (random forest) method

The RF approach is a popular machine-learning algorithm that has been recently successfully used in dealing with various biological prediction problems (see, e.g., [38], [52], [80], [81], [82], [83]). Developed by Loe Breiman [84], RF is an ensemble predictor consisting of multiple decision trees. In Weka 3.6.4 [85], the classifier named with “RandomForest” has implemented the predictor. In the current study, RandomForest was adopted as the prediction engine and operated with the default parameters. During the process of classifying a queried sample with its feature vector, RandomForest first grew 10 decision trees according to the following procedures. (i) Suppose the number of training cases is N, take N samples at random – but with replacement, from the original data. These samples are to form the training set for growing the tree. Here the so-called “with replacement” is a mathematical term meaning that a sample selected at random from the original dataset is returned to the original dataset before a second one is selected at random. In other words, whenever a sample is selected, the original dataset contains all the same samples. Thus, an exactly same sample may be selected more than once, and there is no change at all in the size of the original dataset at any stage. (ii) If each case consists of M input features, choose a number m = [log2 M+1] which is much less than M. At each node, m features are selected randomly out of the M features and the most optimized split on these m features is employed to split the node. The value of m does not change during the growth of the tree. (iii) Each tree is fully grown and not pruned. Then the input vector is predicted by each of 10 decision tree and 10 predicted classes provided by them are obtained. Finally, the class with the most votes will be selected as the output class of RandomForest. The Weka program package can be downloaded from the web-site at http://www.cs.waikato.ac.nz/ml/weka/index_downloading.html

2.8 The cross-validation method

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [86]. However, as elucidated in [58] and demonstrated by Eqs.28–32 of [17], among the three cross-validation methods, the jackknife test is deemed the least arbitrary (most objective) that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [20], [26], [27], [87], [88], [89], [90], [91]). However, to reduce the computational time, we adopted the 5-fold cross-validation in this study as done by many investigators with SVM as the prediction engine (see, e.g., [92], [93], [94]). During the process of 5-fold cross-validation, the benchmark dataset was first equally divided into 5 subsets. Subsequently, each of the subsets was in turn used as the testing dataset and the remaining four subsets as the training or learning dataset. To evaluate the performance of the predictor, the prediction accuracy, specificity, sensitivity and MCC (Matthews's correlation coefficient) were calculated below:where reflects the sensitivity, i.e., the rate of positive samples that are correctly predicted as positive; reflects the specificity, i.e., the rate of negative samples that are correctly predicted as negative; reflects the accuracy, i.e., the rate of correctly predicted events; MCC is the Matthew's correlation coefficient; TP represents the true positive; TN, the true negative; FP, the false positive; and FN, the false negative.

2.9 The IFS (incremental feature selection) approach

Based on the ranked features according to their importance evaluated by the mRMR approach, we used the IFS [95], [96], [97] approach to determine the optimal number of features. During the IFS procedure, features in the ranked feature set were added with a stepwise of from higher to lower rank. A new feature set was formed when features had been added. Thus feature sets would be composed for N ranked features. The i-th feature set is:where N denotes the total number of features in the original dataset and l (step) is a positive integer. In this study . For each of the [N/l] feature sets, an RF classifier was constructed and examined using the 5-fold cross-validation on the benchmark dataset. By doing so we obtained an IFS table with one column for the index i and the other four columns for the prediction accuracy, sensitivity, specificity and MCC, respectively. Thus, we could obtain the optimal feature set (), with which the predictor would yield the best prediction performance.

2.10 The final optimal feature set

The MCC curve was fluctuating with the increase of feature numbers. Therefore, it was necessary to carefully examine its variation against the increasing feature number. In this study the feature-increasing gap was set at 5 to winnow out the optimal features. In other words, we compared two neighbor MCC values at a time with a stepwise of five features, if the latter MCC value is greater than the former one, then the corresponding five features were reserved to join the final optimal feature set; otherwise, discarded. The final optimal feature set thus established consisted of 195 features and would be used for further analysis. We installed Weka into our Linux machine. Its “Run Environment and Configuration” was: Hardware 2 Intel(R) Xeon(R); CPU E5335@2.00 GHz; 16 G RAM; OS CentOS release 4.9 (Final) x86_64.

Results and Discussion

1. The mRMR Result

Listed in the Online are two kinds of outcomes obtained by running the mRMR software: one is called the “MaxRel feature list” that ranked all the features according to their relevance to the class of samples; the other one is the “mRMR feature list” that ranked the features according to the criteria of maximum relevance and minimum redundancy. In the mRMR feature list, the smaller the index of a feature was, the more important the feature would be for the protein domain prediction. Accordingly, the mRMR feature list could be used to establish the optimal feature set in the IFS procedure.

2. IFS and Final Optimal Feature Set

In Section 2.9 of Materials and Methods, by setting 403 for N and 5 for the feature-increasing gap, 80 individual predictors corresponding to 80 feature subsets were established for predicting the protein domain sites in the sequence samples generated by the size-13 sliding window. Listed in the Online are the rates of prediction accuracy, specificity, sensitivity and MCC (cf. 14) obtained by each of the 80 predictors. Shown in is the IFS curve plotted based on the data in Online . The same calculations were also carried out for the size-15 and size-17 windows, and the corresponding results were also plotted in , from which we can see that the predictor based on the size-13 window outperformed the other two, and that the maximal MCC was 0.342 when 360 features were included. These 360 features were deemed to form the optimal feature set of our classifier. With such a classifier, the prediction sensitivity, specificity and accuracy were 0.577, 0.768 and 0.704 respectively ( ). The optimal 360 features were given in the Online . After taking the IFS procedure (cf. In Sections 2.9 and 2.10 of Materials and Methods), we obtained the 195 final optimal features as given in the Online .

Figure 1

A plot to show the change of the MCC values versus the feature numbers with different window sizes.

Table 1

The predicted results obtained with different window size.

Window size	Dataset	Sensitivityc	Specificityc	Accuracyc	MCCc
13	a	0.577	0.768	0.704	0.342
	b	0.578	0.794	0.728	0.367
15	a	0.570	0.766	0.701	0.334
	b	0.571	0.793	0.726	0.360
17	a	0.569	0.767	0.701	0.333
	b	0.574	0.793	0.726	0.362

5-fold crossover test based on the learning dataset (cf. Eq.1).

Using the rule trained by to predict the query proteins in the independent dataset (cf. Eq.1).

See Eq.14 for more explanation.

A plot to show the change of the MCC values versus the feature numbers with different window sizes.

The IFS curves were drawn based on the data in Online . The MCC value reached the peak when the number of feature = 360 and the window size = 13. The 360 features thus obtained were used to form the optimal feature set for the protein domain predictor. Purple line is for the case of size-17 window, green for size-15 window, and brown for size-17 window. See the text for further explanation. 5-fold crossover test based on the learning dataset (cf. Eq.1). Using the rule trained by to predict the query proteins in the independent dataset (cf. Eq.1). See Eq.14 for more explanation. Hereafter, all the analyses will be based on such 195 final optimal features. The CPU time of the above calculation for size-13, 15 and 17 windows were about 4 hours, 5 hours and 6 hours, respectively.

3. Feature Analysis

The distribution of the number of each type of features in the final optimal feature set was investigated and shown in . Of the 195 optimal features, 147 were from PSSM conservation scores, 21 from the amino acid factors, 4 from the disorder scores, 7 from the solvent accessibilities, and 16 from the secondary structural propensities. All these five kinds of features made contributions to the prediction of protein domain sites. It was revealed by the site-specific distribution of the optimal feature set (see ) that sites 1–2, site 10 and site 13 played most important roles in determining the domain sites. In addition, the features of site 4 and site 5 also had considerable impacts on the prediction of protein domain sites.

Figure 2

A 2-dimensional histogram to characterize the final optimal features set.

The impact on the domain prediction from (A) the five different feature types, and (B) each of the 13 subsites. See the text for further explanation.

A 2-dimensional histogram to characterize the final optimal features set.

The impact on the domain prediction from (A) the five different feature types, and (B) each of the 13 subsites. See the text for further explanation.

4. PSSM Conservation Score Feature Analysis

As mentioned above, among the 195 optimal features, 147 belonged to the PSSM conservation features and hence had the highest proportion. It can be clearly seen from that each of the 20 different amino acid types would have different PSSM conservation impact in determining the protein domain site. In this regard, the amino acid N (asparagine) or D (aspartic acid) would have the highest impact, successively followed by G (glycine), R (arginine), and so forth. Interestingly, it has been reported that D, G and R were over-represented in protein interaction domains [98]. Besides, G was believed to be instrumental in defining the core domain and inter-domain regions of a protein [39]. Meanwhile, as shown in , for the samples generated by the size-13 window (cf. Eq.2), the conservation status at the subsite 10 played the most important role in predicting the protein domain site, followed by the subsites 1, 2, 4, and 7. Furthermore, of the top ten features in the final optimal feature list, five were from the PSSM conservation features. The first one was the conservation status against residue M (methionine) at subsite 1 (index 3, “AA1_pssm_13”). The other four were the conservation status against residue A at subsite 12 (index 4, “AA12_pssm_1”), the conservation status against residue G at subsite 6 and site 4 (index 6 and index 7, “AA6_pssm_8” and AA4_pssm_8), and the conservation status against residue T (threonine) at subsite 2 (index 8, “AA2_pssm_17”)

Figure 3

A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.

(A) The impact on the domain prediction from the mutation to each of the 20 amino acid types. (B) The evolutional conservation status for each of the 13 subsites. See the text for further explanation.

A 2-dimensional histogram to characterize the PSSM features in the final optimal features set.

5. Amino Acid Factor Analysis

Illustrated in are the impacts of different amino acid factors and their subsite locations to the protein domain prediction. It can be seen from that the codon diversity was the most important feature to the protein domain site prediction, as supported by [98], [99]. Besides, it was reported that “codon harmonization” would put some non-preferred codons into the positions corresponding to the predicted protein domain boundaries [100]. Furthermore, the electrostatic charge has proved to be essential for the localization and activation of many proteins containing polycationic domains in their amino acid sequence [101]. Meanwhile, it has also been revealed that binding of oppositely charged proteins via electrostatic interactions can induce domain formation [102]. As shown in , the amino acid residues at the subsite 2 and site 13 would have the highest impact to the protein domain sites prediction. Interestingly, the electrostatic feature at the subsite 13 had an index of 2 in our final optimal feature set, indicating that it was one of the most important features for the protein domain site prediction.

Figure 4

A 2-dimensional histogram to characterize the amino acid factor types in the final optimal features set.

The impact on the domain prediction from (A) the five different amino acid types, and (B) each of the 13 subsites. See the text for further explanation.

A 2-dimensional histogram to characterize the amino acid factor types in the final optimal features set.

The impact on the domain prediction from (A) the five different amino acid types, and (B) each of the 13 subsites. See the text for further explanation.

6. Disorder Analysis

Within the final optimal feature set, four of all the 13 disorder features were selected, indicating that the disorder feature might play a pivotal role in protein domain site prediction. Such four disorder features were from subsites 1, 5, 10 and 13. Particularly, the disorder feature of subsite 5 had the index of 1 in the final optimal feature set, suggesting that it was the most important feature in the protein domain site prediction. Also, the disorder feature of subsite 13 has an index of 9 in the final optimal feature site. These findings are fully consistent with the observations that the regions of substantial structural flexibility in a protein often correspond to domain boundaries where the structure is usually exposed and less constrained [39].

7. Solvent Accessibility Features Analysis

Shown in are the solvent accessibility features in the optimal feature set. It can be seen from that the number of buried solvent accessibility features was much more than that of the exposed, indicating that the protein domains were skewed toward the buried areas. Such findings are consistent with the report the buried protein regions can be accessible to water when they are in a free subunit or in one domain state and can form a complex or an aggregate with other subunits or domains [103]. Moreover, it can be seen from that the solvent accessibility features at the subsites 2, 3, 8, 9, and 11–13 have relatively more impacts on the domain site prediction.

Figure 5

A 2-dimensional histogram to characterize the solvent accessibility types in the final optimal features set.

The impact on the protein domain prediction from (A) the two different types of the solvent accessibility, and (B) each of the 13 subsites. See the text for further explanation.

A 2-dimensional histogram to characterize the solvent accessibility types in the final optimal features set.

The impact on the protein domain prediction from (A) the two different types of the solvent accessibility, and (B) each of the 13 subsites. See the text for further explanation.

8. Secondary Structure Features Analysis

The feature and site-specific distribution of the secondary structure in the optimal feature set was given in , from which we can see that the features of “strand” and “other” did affect the domain site prediction (panel A), while the secondary structure features at subsites 1, 5, 6, 8 and 13 had relatively more impact on the domain site determination (panel B).

Figure 6

A 2-dimensional histogram to characterize the secondary structure types in the final optimal features set.

The impact on the domain prediction from (A) the three different secondary structure types, and (B) each of the 13 subsites. See the text for further explanation. See the text for further explanation.

A 2-dimensional histogram to characterize the secondary structure types in the final optimal features set.

9. Scan the Entire Protein Sequence to Refine the Domain Region Prediction

As mentioned above, each of the amino acid residues in a protein sequence was identified whether it belonged to a domain region or non-domain region (cf. Eq.2). If a residue was identified as belonging to a domain region, it was coded with “1”; otherwise, “2”, as illustrated in . However, it is inevitable that some domain residues might be mispredicted as non-domain residues resulting in some short strand of “2” inserted in a long strand of “1” and vice versa. To filter out this kind of false positives and false negatives, a special scanning algorithm was developed to refine the entire predicted results according to the following criteria. (i) Any negative code “2” should be modified to a positive code “1” if it followed a strand of more than four continuous “1” codes but was followed by less than four continuous “2” codes. (ii) Any positive code “1” should be modified to a negative code “2” if it followed a strand of more than four continuous “2” codes but was followed by less than three continuous “1” codes. After such a scanning procedure, it can be seen from that many sporadic “2” codes in the long “1” regions have disappeared, and that many sporadic “1” codes in the long “2” regions have disappeared too. Meanwhile, the prediction quality was further improved as indicated in . Finally, the regions with the long continuous “1” codes thus obtained were assigned corresponding to the domain regions as indicated in Online

Figure 7

Illustration to show the predicted results obtained before and after applying the sequence-scanning refinement operation.

A residue assigned to the domain region was coded with “1”; otherwise, “2”. The 3D structure of A1A5Q6 was retrieved from ModBase. See the text for further explanation.

Table 2

A comparison between the predicted results with and without the scanning refinement.

Window size	Scanning refinementa	Sensitivityb	Specificityb	Accuracyb	MCCb
13	No	0.578	0.794	0.728	0.367
	Yes	0.642	0.808	0.758	0.441
15	No	0.571	0.793	0.726	0.360
	Yes	0.634	0.806	0.754	0.431
17	No	0.574	0.793	0.726	0.362
	Yes	0.645	0.804	0.756	0.438

See section 9 of Results and Discussion for more explanation about the scanning procedure.

See Eq.14 for more explanation.

Illustration to show the predicted results obtained before and after applying the sequence-scanning refinement operation.

A residue assigned to the domain region was coded with “1”; otherwise, “2”. The 3D structure of A1A5Q6 was retrieved from ModBase. See the text for further explanation. See section 9 of Results and Discussion for more explanation about the scanning procedure. See Eq.14 for more explanation.

10. In Comparison with the Existing Methods

To evaluate our method, let us compare its performance with three existing methods in this area, including DoMpro [40], Globplot [49] and Domcut [48] based on the same testing dataset. Those methods such as FIEFDom [47] were not included because they were aimed at predicting domain boundaries rather than domains themselves. In other word, this kind of methods was based on such an assumption that nearly the whole protein was domain region except two or three domain boundaries. As a consequence, their sensitivity would be very close to 1, but the specificity would be very low with quite poor overall success rates. The prediction result by the DoMpro [40] on a query protein sequence was formulated by a series of “N” and “T” codes to indicate that the corresponding residue being outside and inside the domain region, respectively. The predicted outcomes by the Globplot method [49] were the domain regions directly. As for the method Domcut [48], a score was assigned to each of the constituent residues in a query protein. The residues with a score below the cutoff threshold (default −0.09) were regarded as the inter-domain linker regions. For facilitating comparison, the results by all these methods on the same independent dataset (cf. Eq.1) are also listed in , from which we can see that our method was about 58–70% higher than the other methods in specificity, 28–40% higher in accuracy, and 24–31% higher in MCC, but about 20% lower in sensitivity. These results indicate that the current method will play an important complementary role to the existing methods in identifying the domains of proteins.

Table 3

Comparison of the current method with the existing methods on the same testing dataset (cf. Eq.1).

Method	Sensitivity	Specificity	Accuracy	MCC
Our method	0.643	0.808	0.757	0.441
DoMpro [40]	0.924	0.182	0.406	0.138
Globplot [49]	0.868	0.325	0.485	0.199
Domcut [48] a	0.979	0.110	0.367	0.149
Domcut [48] b	0.856	0.325	0.482	0.186

Using the default cutoff threshold of −0.09.

Using the optimal cutoff threshold,

Using the default cutoff threshold of −0.09. Using the optimal cutoff threshold,

11. Useful Insights for Guiding Experiments or Being Validated by Experiments

The selected features at different sites may provide clues or insights for researchers to find or validate new protein domains, as can be viewed from the following four aspects. (i) PSSM feature. It was found through analyzing the PSSM conservation score that the mutations to amino acid residues N and D had the most impact on identifying the protein domain sites. Besides, the mutation to residues G and R also had more impacts than the other amino acids in this regard, fully consistent with the report [98] that D, G and R were over-represented in protein interaction domains, and the report [39] that amino acid G was instrumental in defining the core domain and interdomain regions of a protein. (ii) Codon diversity feature. It was revealed in this study that the codon diversity played pivotal role in identifying the protein domain sites, as evidenced by a series of experiments [98], [99], [100]. (iii) Electrostatic charge feature. It is interesting to note that electrostatic charge has proved to be essential for the localization and activation of many proteins containing polycationic domains in their amino acid sequence [101], and that binding of oppositely charged proteins via electrostatic interactions can induce domain formation [102]. All these observations are quite consistent with the findings in this study that the electrostatic feature of site 13 has an index of 2 in our final optimal feature set meaning that it is one of the most important features for the protein domain sites prediction. (iv) Disorder feature. It was found that in the final optimal feature set derived from this study, four of all the 13 disorder features were selected, and that disorder feature of site 5 had the index of 1, implying it was the most important feature to the protein domain site prediction. Interestingly, it has been reported that disorder regions often correspond to the domain boundaries [39]. Accordingly, the remainders in the optimal feature set are certainly worth being further investigated by future experiments. It is anticipated that the strategy and approaches developed in this study may also be extended to investigate protein signal peptides (see, e.g., [60], [61], [104], [105]), B-cell epitopes [106], [107], HIV protease cleavage sites [108], [109], [110], [111], enzyme specificity [112], [113], among many other important topics in protein science and biomedicine. The UniProt ID codes for the proteins in the benchmark dataset: (A) training dataset and (B) testing dataset. (XLSX) Click here for additional data file. Detailed results obtained by the mRMR feature selection and analysis operation. It contains two sheets. The first one shows the 403 maximum relevance features of the size-13 window ranked according to their relevance with the classification of the samples. The second one shows the 403 mRMR features of the size-13 window ranked according to the redundancy and relevance criteria. (XLSX) Click here for additional data file. The results of the sensitivity (Sn), specificity (Sp), accuracy (Ac), and Matthews's correlation coefficient (MCC) versus the feature numbers. It contains three sheets. The first one shows the IFS results for the size-13 window, the second for the size-15 window, and the third for the size-17 window. The IFS curves were plotted based on the data in this file. See the text for more explanation. (XLSX) Click here for additional data file. The 360 features selected by the IFS procedure for the size-13 window. See the text for more explanation. (XLSX) Click here for additional data file. The 195 final optimal features selected by the incremental analysis procedure for the size-13 window. See the text for more explanation. (XLSX) Click here for additional data file. Domain region of the 1,299 proteins predicted by the method proposed in this paper. (XLSX) Click here for additional data file.

108 in total

1. Identification of cell-binding sites on the Laminin alpha 5 N-terminal domain by site-directed mutagenesis.

Authors: P K Nielsen; Y Yamada
Journal: J Biol Chem Date: 2000-11-29 Impact factor: 5.157

2. Predicting protein structural class by functional domain composition.

Authors: Kuo-Chen Chou; Yu-Dong Cai
Journal: Biochem Biophys Res Commun Date: 2004-09-03 Impact factor: 3.575

3. Predicting caspase substrate cleavage sites based on a hybrid SVM-PSSM method.

Authors: Dandan Li; Zhenran Jiang; Weiming Yu; Lei Du
Journal: Protein Pept Lett Date: 2010-12 Impact factor: 1.890

4. ProtIdent: a web server for identifying proteases and their types by fusing functional domain and sequential evolution information.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: Biochem Biophys Res Commun Date: 2008-09-05 Impact factor: 3.575

5. Prediction of G-protein-coupled receptor classes in low homology using Chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns.

Authors: Q Gu; Y S Ding; T L Zhang
Journal: Protein Pept Lett Date: 2010-05 Impact factor: 1.890

6. Plant-mPLoc: a top-down strategy to augment the power for predicting plant protein subcellular localization.

Authors: Kuo-Chen Chou; Hong-Bin Shen
Journal: PLoS One Date: 2010-06-28 Impact factor: 3.240

7. Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition.

Authors: Lei Chen; Kai-Yan Feng; Yu-Dong Cai; Kuo-Chen Chou; Hai-Peng Li
Journal: BMC Bioinformatics Date: 2010-05-31 Impact factor: 3.169

8. A protein domain interaction interface database: InterPare.

Authors: Sungsam Gong; Changbum Park; Hansol Choi; Junsu Ko; Insoo Jang; Jungsul Lee; Dan M Bolser; Donghoon Oh; Deok-Soo Kim; Jong Bhak
Journal: BMC Bioinformatics Date: 2005-08-25 Impact factor: 3.169

9. Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors: Kuo-Chen Chou
Journal: J Theor Biol Date: 2010-12-17 Impact factor: 2.691

10. Ab initio and homology based prediction of protein domains by recursive neural networks.

Authors: Ian Walsh; Alberto J M Martin; Catherine Mooney; Enrico Rubagotti; Alessandro Vullo; Gianluca Pollastri
Journal: BMC Bioinformatics Date: 2009-06-26 Impact factor: 3.169

36 in total

1. Prediction of O-glycosylation sites based on multi-scale composition of amino acids and feature selection.

Authors: Yuan Chen; Wei Zhou; Haiyan Wang; Zheming Yuan
Journal: Med Biol Eng Comput Date: 2015-03-10 Impact factor: 2.602

2. Predicting DNA-binding sites of proteins based on sequential and 3D structural information.

Authors: Bi-Qing Li; Kai-Yan Feng; Juan Ding; Yu-Dong Cai
Journal: Mol Genet Genomics Date: 2014-01-22 Impact factor: 3.291

3. Identification of compound-protein interactions through the analysis of gene ontology, KEGG enrichment for proteins and molecular fragments of compounds.

Authors: Lei Chen; Yu-Hang Zhang; Mingyue Zheng; Tao Huang; Yu-Dong Cai
Journal: Mol Genet Genomics Date: 2016-08-16 Impact factor: 3.291