| Literature DB >> 28991206 |
Ricardo Corral-Corral1, Jesús A Beltrán2, Carlos A Brizuela3, Gabriel Del Rio4.
Abstract
Protein structure and protein function should be related, yet the nature of this relationship remains unsolved. Mapping the critical residues for protein function with protein structure features represents an opportunity to explore this relationship, yet two important limitations have precluded a proper analysis of the structure-function relationship of proteins: (i) the lack of a formal definition of what critical residues are and (ii) the lack of a systematic evaluation of methods and protein structure features. To address this problem, here we introduce an index to quantify the protein-function criticality of a residue based on experimental data and a strategy aimed to optimize both, descriptors of protein structure (physicochemical and centrality descriptors) and machine learning algorithms, to minimize the error in the classification of critical residues. We observed that both physicochemical and centrality descriptors of residues effectively relate protein structure and protein function, and that physicochemical descriptors better describe critical residues. We also show that critical residues are better classified when residue criticality is considered as a binary attribute (i.e., residues are considered critical or not critical). Using this binary annotation for critical residues 8 models rendered accurate and non-overlapping classification of critical residues, confirming the multi-factorial character of the structure-function relationship of proteins.Entities:
Keywords: functional residues; machine learning; protein structure
Mesh:
Substances:
Year: 2017 PMID: 28991206 PMCID: PMC6151554 DOI: 10.3390/molecules22101673
Source DB: PubMed Journal: Molecules ISSN: 1420-3049 Impact factor: 4.411
Figure 1Diagram to identify optimized models to predict critical residues. The figure represents the data sets (in our study we used 3, but these can be modified as more information becomes available) and initial filtering operation to perform the optimization of attributes (2789 ProtDCal attributes plus 11 centralities), the machine-learning algorithm (MLA) and its corresponding parameters, to reduce the error in the classification of critical residues as performed by AutoWEKA. 55 MLA currently implemented in WEKA were used in this work. The optimized model then is tested by cross-validation using the training set (leave-one-out) or by swapping training sets as test sets. This strategy was implemented and executed using WEKA (see Methods for further details).
Algorithms and parameters optimized for JAMMING1SET.
| Descriptors | Algorithm | Parameters | Relative Absolute Error (%) |
|---|---|---|---|
| Centralities 1 | Multilayer Perceptron | [-L, 0.5571819547734898, -M, 0.23520521436874284, -B, -H, o, -C, -D, -S, 1] | 77.2 |
| Centralities 2 | LWL | [-A, weka.core.neighboursearch.LinearNNSearch, -W, weka.classifiers.functions.MultilayerPerceptron, --, -L, 0.3899191912662868, -M, 0.4683563849238558, -H, o, -C, -R, -D, -S, 1] | 78.2 |
| ProtDCal 1 | Logistic | [-R, 0.057140274761388915] | 67.9 |
| ProtDCal 2 | Logistic | [-R, 0.057140274761388915] | 67.9 |
| Union 1 | SMO | [-C, 0.7127949742291734, -N, 0, -M, -K, weka.classifiers.functions.supportVector.PolyKernel -E 1.1133446901320447 -L] | 68.6 |
| Union 2 | SMO | [-C, 0.7127949742291734, -N, 0, -M, -K, weka.classifiers.functions.supportVector.PolyKernel -E 1.1133446901320447 -L] | 68.6 |
Three sets of descriptors (Centralities, ProtDCal and union, see Materials and Methods) used to identify the best algorithm and corresponding parameters to learn the annotated critical and non-critical residues from the data set referred to as JAMMING1SET (see Materials and Methods). For each set of descriptors, AutoWEKA was executed at two different duration times: 1500 (model 1) and 3000 (model 2) min. We report these two results to show that no significant improvement was observed by doubling the executing time of AutoWEKA. AutoWEKA may perform the selection of descriptors; when such selection rendered the best model, this is specified (Attribute search, Attribute evaluation).
Leave-One-Out Test for JAMMING1SET.
| Attribute Set | TP Rate | FP Rate | Precision | Recall | MCC | ROC Area |
|---|---|---|---|---|---|---|
| Centralities 1 | 0.83 | 0.53 | 0.81 | 0.83 | 0.37 | 0.77 |
| Centralities 2 | 0.83 | 0.51 | 0.81 | 0.83 | 0.37 | 0.79 |
| ProtDCal 1, 2 | 0.84 | 0.52 | 0.82 | 0.84 | 0.39 | 0.81 |
| Union 1, 2 | 0.84 | 0.53 | 0.82 | 0.84 | 0.38 | 0.81 |
| SVM-Centralities | 0.79 | 0.33 | 0.82 | 0.79 | 0.41 | 0.73 |
| SVM-ProtDCal | 0.81 | 0.27 | 0.84 | 0.81 | 0.48 | 0.81 |
| SVM-Union | 0.82 | 0.25 | 0.85 | 0.82 | 0.49 | 0.78 |
Evaluation of different models trained with JAMMING1SET using the leave-one-out test. The reported statistical parameters are the weighted averages, as reported by WEKA. The attribute sets (Centralities, ProtDCal and Union) are divided into one or two groups, corresponding to the models obtained at 1500 or 3000 min of optimization performed by AutoWEKA (see Table 1). The last rows labeled with SVM- show the results obtained with the filter-wrapper feature selection method based on the support vector machine implementation in WEKA referred to as LibLINEAL using different descriptors (see Materials and Methods).
Swapped Test for JAMMING1SET Best Models.
| Attribute Set | TP Rate | FP Rate | Precision | Recall | MCC | ROC Area |
|---|---|---|---|---|---|---|
| Centralities 1 | 0.61 | 0.67 | 0.55 | 0.61 | −0.07 | 0.42 |
| ProtDCal 1 | 0.58 | 0.60 | 0.54 | 0.58 | −0.02 | 0.45 |
| Union 1 | 0.58 | 0.61 | 0.74 | 0.58 | −0.02 | 0.40 |
The JAMMING3SET was used to test the best models trained with JAMMING1SET. The reported statistical parameters are the weighted averages, as reported by WEKA. The attribute set names (Centralities, ProtDCal and Union) are followed by a number 1, corresponding with the models obtained at 1500 min of optimization performed by AutoWEKA (see Table 1).
Algorithms and parameters optimized for JAMMING2SET.
| Descriptors | Algorithm | Parameters | Relative Absolute Error (%) |
|---|---|---|---|
| Centralities 1 | Linear Regression | [-S, 2, -R, 7.855468822045874E-7], Attribute search: GreedyStepwise [-C, -B, -N, 172], Attribute evaluation: CfsSubsetEval [] | 82.8 |
| Centralities 2 | LWL | [-A, weka.core.neighboursearch.LinearNNSearch, -W, weka.classifiers.functions.LinearRegression, --, -S, 0, -R, 0.20912016083576357], Attribute search: GreedyStepwise [-C, -N, 213], Attribute evaluation: CfsSubsetEval [-L] | 82.4 |
| ProtDCal 1 | M5P | [-M, 1, -R] | 80.1 |
| ProtDCal 2 | M5P | [-M, 1, -R] | 80.1 |
| Union 1 | Bagging | [-P, 74, -I, 8, -S, 1, -W, weka.classifiers.trees.DecisionStump, --] | 88.5 |
| Union 2 | Bagging | [-P, 74, -I, 8, -S, 1, -W, weka.classifiers.trees.DecisionStump, --] | 88.5 |
Three sets of descriptors (Centralities, ProtDCal and union, see Materials and Methods) used to identify the best algorithm and corresponding parameters to learn the annotated critical and non-critical residues from the data set referred to as JAMMING2SET (see Materials and Methods). For each set of descriptors, AutoWEKA was executed at two different duration times: 1500 (model 1) and 3000 (model 2) min. We report these two results to show that no significant improvement was observed by doubling the executing time of AutoWEKA. AutoWEKA may perform the selection of descriptors; when such selection rendered the best model, this is specified (Attribute search, Attribute evaluation).
Leave-One-Out Test for JAMMING2SET.
| Attribute Set | Correlation Coefficient | Relative Absolute Error (%) |
|---|---|---|
| Centralities 1 | 0.54 | 83.2 |
| Centralities 2 | 0.55 | 83.0 |
| ProtDCal 1, 2 | 0.64 | 74.5 |
| Union 1, 2 | 0.52 | 83.7 |
Evaluation of different models trained with JAMMING2SET using the leave-one-out test. The reported statistical parameters correspond with the cross-validation results, as reported by WEKA. The attribute sets (Centralities, ProtDCal and Union) are divided into 1 or 2 groups, corresponding with the models obtained at 1500 or 3000 min of optimization performed by AutoWEKA (see Table 4).
Algorithms and parameters optimized for JAMMING3SET.
| Descriptors | Algorithm | Parameters | Relative Absolute Error (%) |
|---|---|---|---|
| Centralities 1 | IBk | [-K, 13] | 82.3 |
| Centralities 2 | AdaBoostM1 | [-P, 79, -I, 108, -Q, -S, 1, -W, WEKA.classifiers.functions. MultilayerPerceptron, -L, 0.7665662779502016, -M, 0.21535709618934423, -B, -H, i, -C, -R, -D, -S, 1] | 99.8 |
| ProtDCal 1 | Naïve Bayes | [-D] | 61.5 |
| ProtDCal 2 | J48 | [-J, -A, -S, -M, 18, -C, 0.3184313887632543] | 67.5 |
| Union 1 | Simple Logistic | [-W, 0] | 100 |
| Union 2 | J48 | [-J, -A, -S, -M, 18, -C, 0.3184313887632543] | 67.5 |
Three sets of descriptors (Centralities, ProtDCal and Union, see Materials and Methods) were used to identify the best algorithm and corresponding parameters to learn the annotated critical and non-critical residues from the JAMMING3SET. For each set of descriptors, AutoWEKA was executed at two different duration times: 1500 (model 1) and 3000 (model 2) min. We report these two results to show that no significant improvement was observed by doubling the execution time of AutoWEKA. AutoWEKA may perform the selection of descriptors; when such selection rendered the best model, this is specified (Attribute search/evaluation).
Leave-One-Out Test for JAMMING3SET.
| Attribute Set | TP Rate | FP Rate | Precision | Recall | MCC | ROC Area |
|---|---|---|---|---|---|---|
| Centralities 1 | 0.76 | 0.50 | 0.75 | 0.76 | 0.34 | 0.76 |
| ProtDCal 1 | 0.83 | 0.61 | 0.80 | 0.83 | 0.31 | 0.79 |
| ProtDCal 2 | 0.82 | 0.55 | 0.80 | 0.82 | 0.32 | 0.68 |
| Union 1 | 0.84 | 0.55 | 0.82 | 0.84 | 0.37 | 0.81 |
| Union 2 | 0.84 | 0.49 | 0.82 | 0.84 | 0.41 | 0.73 |
| SVM-Centralities | 0.72 | 0.26 | 0.77 | 0.72 | 0.42 | 0.72 |
| SVM-ProtDCal | 0.74 | 0.29 | 0.76 | 0.74 | 0.42 | 0.72 |
| SVM-Union | 0.74 | 0.19 | 0.81 | 0.74 | 0.50 | 0.77 |
Evaluation of different models trained with JAMMING1SET using the leave-one-out test. The reported statistical parameters are the weighted averages, as reported by WEKA. The attribute sets (Centralities, ProtDCal and Union) are divided into 1 or 2 groups, corresponding with the models obtained at 1500 or 3000 min of optimization performed by AutoWEKA (see Table 6). The last rows labeled with SVM- show the results obtained with the filter-wrapper feature selection method based on the support vector machine implementation in WEKA referred to as LibLINEAR using different descriptors (see Materials and Methods).
Swapped Test for JAMMING3SET Best Models.
| Attribute Set | TP Rate | FP Rate | Precision | Recall | MCC | ROC Area |
|---|---|---|---|---|---|---|
| Centralities 1 | 0.54 | 0.78 | 0.63 | 0.54 | −0.2 | 0.30 |
| ProtDCal 2 | 0.72 | 0.76 | 0.68 | 0.72 | −0.04 | 0.53 |
| Union 2 | 0.72 | 0.76 | 0.68 | 0.72 | −0.04 | 0.53 |
The JAMMING1SET was used to test the best models identified using the JAMMING3SET. The reported statistical parameters are the weighted averages, as reported by WEKA. The attribute sets (Centralities, ProtDCal and Union) are divided into 1 or 2 groups, corresponding with the models obtained at 1500 or 3000 min of optimization performed by AutoWEKA (see Table 7).
Comparison of classifications between critical residues models.
| Training Set | Centralities 1 | Centralities 2 | ProtDCal 1, 2 | Union 1, 2 | Average | |
|---|---|---|---|---|---|---|
| Centralities 1 | 1 | 100 | 78 | 0 | 62 | 60 |
| Centralities 2 | 1 | 92 | 100 | 0 | 63 | 63 |
| ProtDCal 1, 2 | 1 | 0 | 0 | 100 | 0 | 25 |
| Union 1, 2 | 1 | 66 | 56 | 0 | 100 | 55 |
| Centralities 1 | 3 | 100 | 47 | 50 | 65 | 65 |
| ProtDCal 1 | 3 | 35 | 100 | 100 | 84 | 79 |
| ProtDCal 2 | 3 | 28 | 75 | 100 | 69 | 68 |
| Union 2 | 3 | 37 | 63 | 69 | 100 | 67 |
Different models trained with JAMMING1SET (Training set column indicated by number 1) or JAMMING3SET (Training set column indicated by number 3) were compared in their classifications of critical residues. The table shows the percentage of identical critical residues achieved in such comparison. Comparisons were performed only where critical residues were treated as binary and the descriptors and instances were identical. Note that the table is asymmetrical as a consequence that the percentages are reported based on the number of critical residues classifications obtained from each model indicated in each row. For instance, when the training set was 3 (JAMMING3SET) the classifications of critical residues from model trained on ProtDCal 1 are 100% identical to those obtained from ProtDCal 2, yet ProtDCal 2 is only 75% identical to ProtDCal 1; this indicates that there are more classifications of critical residues derived from ProtDCal 2 than from ProtDCal 1 and that all the classifications from ProtDCal 1 are included in the classifications derived from ProtDCal 2 (see Materials and Methods). Models obtained with Centralities at 3000 min or Union at 1500 min were not analyzed because these rendered large errors (see Table 6).
Centrality measures used in this study.
| Centrality | Description |
|---|---|
| Excentricity | This is defined as the longest shortest distance of a node. The shortest distance was computed using the Dijkstra’s algorithm. |
| Excentricity inverted | 1/Excentricity |
| Degree | This is defined as the number of contacts for any residue in the graph. |
| Sphere degree | For any given i-residue, identify its neighbors and count the number of contacts. This is the degree at a second level of i-node. |
| Sphere degree accumulated (SNN) | Is derived as the sphere degree, but the counts include the number of neighbors of i-residue. |
| Mean distance | This is the sum of the shortest distances recorded from i-residue to any other residue divided by the number of neighbors for i-residue. |
| Closeness centrality | Is derived from the calculation of the shortest distances, according to the Dijkstra’s algorithm, between i-residue and the other residues in the protein. Closeness is the inverse of the sum of all these distances and is equivalent to 1/mean distance. |
| Clustering coefficient | Is obtained by dividing the observed number of neighbors between the neighbors of i-residue (o) by the expected number of neighbors (n): o/(n*(n − 1)). |
| Clustering coefficient inverted | 1/Clustering coefficient |
| Traversity | This index measures the number of times a residue is traversed while connecting every pair of residues in the contact map using the shortest path from the Dijkstra’s algorithm. Two version of this centrality are produced: One that follows the order of residues in the protein sequence (Traversity A) and another that does not (Traversity B). |
Figure 2Correlated mutation index of contacting residues. The image represents in the upper part the contacting residues (AAk and AAm) for residue AAi in the three-dimensional protein structure; a distance (d) criterion of 5 Å defined contacting residues (see Materials and Methods). In the lower part, the substitution rate (S) is derived from a structure-based multiple sequence alignment for each of the residues in contact; the bottom part of the image shows an example on how to obtain the S parameter (for a more detailed description, please read the corresponding Methods section). Each residue in a sequence is represented by a one-letter code; only conserved residues are shown in the alignment, the rest of the sequence is presented as dots.