| Literature DB >> 19816556 |
Omar N A Demerdash1, Michael D Daily, Julie C Mitchell.
Abstract
In allostery, a binding event at one site in a protein modulates the behavior of a distant site. Identifying residues that relay the signal between sites remains a challenge. We have developed predictive models using support-vector machines, a widely used machine-learning method. The training data set consisted of residues classified as either hotspots or non-hotspots based on experimental characterization of point mutations from a diverse set of allosteric proteins. Each residue had an associated set of calculated features. Two sets of features were used, one consisting of dynamical, structural, network, and informatic measures, and another of structural measures defined by Daily and Gray. The resulting models performed well on an independent data set consisting of hotspots and non-hotspots from five allosteric proteins. For the independent data set, our top 10 models using Feature Set 1 recalled 68-81% of known hotspots, and among total hotspot predictions, 58-67% were actual hotspots. Hence, these models have precision P = 58-67% and recall R = 68-81%. The corresponding models for Feature Set 2 had P = 55-59% and R = 81-92%. We combined the features from each set that produced models with optimal predictive performance. The top 10 models using this hybrid feature set had R = 73-81% and P = 64-71%, the best overall performance of any of the sets of models. Our methods identified hotspots in structural regions of known allosteric significance. Moreover, our predicted hotspots form a network of contiguous residues in the interior of the structures, in agreement with previous work. In conclusion, we have developed models that discriminate between known allosteric hotspots and non-hotspots with high accuracy and sensitivity. Moreover, the pattern of predicted hotspots corresponds to known functional motifs implicated in allostery, and is consistent with previous work describing sparse networks of allosterically important residues.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19816556 PMCID: PMC2748687 DOI: 10.1371/journal.pcbi.1000531
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Training and Independent Data Sets of Proteins with PDB identifications for the inactive and active state structures for various classes of molecules.
| PDB of effector ligand-unbound (inactive state) | PDB of effector ligand-bound (active state) | |
|
| ||
| CheY (signal transduction) | 3chy | 1fqw |
| PurR repressor (transcription factor) | 1dbq | 1wet |
| Tet repressor (transcription factor) | 2trt | 1qpi |
| Hemoglobin (carrier protein/enzyme) | 4hhb | 1hho |
| Phosphofructokinase (enzyme) | 6pfk | 4pfk |
| phosphoglycerate dehydrogenase (enzyme) | 1psd | 1yba |
| fructose-1,6-bisphosphatase (enzyme) | 1eyj | 1eyi |
| Aspartate transcarbamoylase (enzyme) | 1rac | 1d09 |
| RhoA (signal transduction) | 1ftn | 1a2b |
| CDC-42 (signal transduction) | 1an0 | 1nf3 |
| glycogen phosphorylase (transcription factor) | 1gpb | 7gpb |
|
| ||
| glucokinase (enzyme) | 1v4t | 1v4s |
| glutamate dehydrogenase (enzyme) | 1nr7 | 1hwz |
| lac repressor (transcription factor) | 1tlf | 1efa |
| myosin II (motor protein/enzyme) | 1vom | 1fmw |
| thrombin (enzyme) | 1sgi | 1sg8 |
Feature Set 1.
| Feature Set 1 | Abbreviation |
|
| |
| Deformation Energy of the inactive state | def-energ-i |
| Mean Squared Fluctuation of the inactive state | msf-i |
| Mean Squared Fluctuation of the active state | msf-a |
| Difference in Mean Squared Fluctuation between inactive and active states | diff-msf |
| Mutual Information in the inactive state | mut-info-i |
|
| |
| B-factor of the inactive state | bfac-i |
| B-factor of the active state | bfac-a |
| Difference in B-factor between the inactive and active states | diff-bfac |
| No. Potential Hydrogen Bonds in the active state | hbond-a |
| No. Potential Hydrogen Bonds in the inactive state | hbond-i |
| Difference in No. of Potential Hyd. Bonds between the inactive and active states | diff-hbond |
| Average Local Atomic Density in the inactive state | at-dens-i |
| Average Local Atomic Density in the active state | at-dens-a |
| Difference in Atomic Density between the inactive and active states | diff-at-dens |
|
| |
| Node degree in inactive state | node-deg-i |
| Perturbation in Clustering Coefficient upon Node Removal in inactive state | pert-clust-coef-i |
|
| |
| Evolutionary Conservation | cons |
| Local Structural Entropy | lse |
Feature Set 2.
| Feature Set 2 | Abbreviation |
| Alpha-carbon Displacement | Ca-disp |
| Side-Chain RMS Distance between inactive and active states | sc-rms |
| Rotation of Alpha Carbon-Beta Carbon bond from the inactive to active state | sc-flip |
| Difference in Phi Angle between inactive state and active states | dphi |
| Difference in Psi Angle between inactive state and active states | dpsi |
| Maximum of dphi and dpsi | maxdihed |
| Difference in Chi1 Angle between inactive state and active states | dchi1 |
| Difference in Chi2 Angle between inactive state and active states | dchi2 |
| Maximum of dchi1 and dchi2 | maxdchi |
| Fractional Change in Contact Environment relative to inactive state | fI |
| Fractional Change in Contact Environment relative to active state | fA |
| Maximum of fI and fA | fmax |
| All-Atom Solvent-Accessible Surface Area in inactive state | asa1 |
| All-Atom Solvent-Accessible Surface Area in active state | asa2 |
| Average of asa1 and asa2 | asaavg |
| Side-Chain Solvent-Accessible Surface Area in inactive state | asasc1 |
| Side-Chain Solvent-Accessible Surface Area in active state | asasc2 |
| Average of asasc1 and asasc2 | asascavg |
| Backbone Solvent-Accessible Surface Area in inactive state | asabb1 |
| Backbone Solvent-Accessible Surface Area in active state | asabb2 |
| Average of asabb1 and asabb2 | asabbavg |
Top 20 highest performing feature/kernel degree combinations (as ranked by F1) using Feature Set 1.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.68 | 0.62 | 0.75 | def-energ-i, msf-i, diff-msf, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
| 0.68 | 0.58 | 0.82 | msf-i, msf-a, diff-hbond, bfac-a, node-deg-i, lse | 2 |
| 0.68 | 0.54 | 0.91 | msf-i, msf-a, diff-hbond | 3 |
| 0.67 | 0.63 | 0.73 | def-energ-i, msf-i, diff-msf, at-dens-i, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
| 0.67 | 0.61 | 0.75 | msf-a, diff-hbond, diff-at-dens, bfac-a, lse | 2 |
| 0.67 | 0.60 | 0.77 | msf-i, msf-a, mut-info-i, diff-hbond, diff-at-dens, bfac-a, lse | 2 |
| 0.67 | 0.57 | 0.82 | msf-i, diff-hbond, node-deg-i, lse | 3 |
| 0.67 | 0.57 | 0.82 | msf-i, msf-a, hbond-i, diff-hbond, bfac-a, lse | 2 |
| 0.67 | 0.57 | 0.82 | def-energ-i, msf-i, diff-hbond, lse | 3 |
| 0.67 | 0.62 | 0.73 | msf-a, diff-msf, diff-hbond, at-dens-a, diff-at-dens, bfac-a, lse | 2 |
| 0.67 | 0.56 | 0.82 | msf-i, msf-a, diff-hbond, diff-bfac, lse | 3 |
| 0.66 | 0.56 | 0.80 | def-energ-i, msf-i, diff-hbond, diff-at-dens, diff-bfac, lse | 3 |
| 0.66 | 0.56 | 0.80 | def-energ-i, msf-i, msf-a, diff-msf, diff-hbond, diff-at-dens, lse | 3 |
| 0.66 | 0.58 | 0.77 | msf-i, hbond-i, diff-hbond, node-deg-i, lse | 2 |
| 0.66 | 0.58 | 0.77 | def-energ-i, msf-i, diff-hbond, lse | 2 |
| 0.66 | 0.59 | 0.75 | def-energ-i, msf-a, diff-hbond, diff-at-dens, diff-bfac, lse | 3 |
| 0.66 | 0.60 | 0.73 | def-energ-i, msf-a, diff-hbond, diff-at-dens, bfac-a, node-deg-i, lse | 2 |
| 0.66 | 0.62 | 0.70 | def-energ-i, msf-a, diff-msf, diff-hbond, diff-at-dens, diff-bfac, node-deg-i, lse | 3 |
| 0.66 | 0.62 | 0.70 | def-energ-i, msf-i, diff-msf, diff-hbond, at-dens-a, diff-at-dens, diff-bfac, lse | 3 |
| 0.66 | 0.64 | 0.68 | def-energ-i, msf-a, diff-hbond, diff-at-dens, diff-bfac, node-deg-i, lse | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set. Refer to Table 2 for explanations of feature abbreviations.
Summary of the performance of the four feature sets.
| Feature Set | Range of F1 of top 300 models for training data set | No. of models of top 300 w/F1>0.60 on ind. data set | F1 of top model on ind. data set |
| Feature Set 1 | 0.63–0.68 | 22 | 0.73 |
| Feature Set 2 | 0.68–0.71 | 293 | 0.68 |
| Aug. Feature Set 1 | 0.60–0.71 | 31 | 0.68 |
| Hybrid Feature Set | 0.63–0.73 | 26,113 | 0.73 |
**: 80,000 feature/kernel degree combinations using the Hybrid Feature Set had F1 scores in the range of 0.63–0.73 on the training data set, and all of these feature/kernel degree combinations were tested on the independent data set. 26,113 models of the 80,000 had an F1 greater than 0.60 on the independent data set. Abbreviation: ind. = independent.
Top 20 highest performing feature/kernel degree combinations (as ranked by F1) using Feature Set 2.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.71 | 0.58 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asasc1 | 3 |
| 0.71 | 0.58 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asasc1 | 3 |
| 0.71 | 0.58 | 0.91 | dpsi, asaavg, asascavg, asabbavg | 3 |
| 0.71 | 0.56 | 0.95 | Ca-disp, sc-flip, asa1, asa2, asasc1, asascavg | 3 |
| 0.70 | 0.61 | 0.84 | dpsi, dchi1, asascavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa2, asasc1, asascavg, asabb1, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa2, asaavg, asasc1, asabb1, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa1, asaavg, asasc2, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asascavg, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asascavg, asabb1, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asasc2, asabbavg | 2 |
| 0.70 | 0.57 | 0.91 | maxdchi, asa1, asa2, asasc1, asasc2, asascavg, asabbavg | 2 |
| 0.70 | 0.56 | 0.93 | sc-flip, asa2, asasc1, asascavg, asabb1, asabbavg | 3 |
| 0.70 | 0.56 | 0.93 | asa2, asaavg, asasc2, asabb1 | 3 |
| 0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, dchi2, asa1, asa2, asaavg | 3 |
| 0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 2 |
| 0.70 | 0.56 | 0.93 | Ca-disp, sc-flip, asa1, asa2, asaavg, asascavg | 3 |
| 0.70 | 0.55 | 0.95 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 3 |
| 0.70 | 0.55 | 0.95 | Ca-disp, sc-flip, asa2, asaavg, asasc1 | 3 |
| 0.70 | 0.58 | 0.86 | dpsi, dchi1, asa1, asasc2, asabb1 | 2 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set. Refer to Table 3 for explanations of feature abbreviations.
Figure 1Feature usage in the top 300 SVM models using Feature Set 1.
For each feature, the number of models (frequency) in the top 300, as ranked by F1 performance on the training data, that used that particular feature was tabulated.
Figure 2Feature usage in the top 300 SVM models using Feature Set 2.
For each feature, the number of models (frequency) in the top 300, as ranked by F1 performance on the training data, that used that particular feature was tabulated.
Feature/kernel degree combinations from the top 300 models which used only sequence or inactive state structural information.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.65 | 0.56 | 0.80 | msf-i, lse | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, lse, mut-info-i | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, hbond-i, lse | 3 |
| 0.65 | 0.56 | 0.77 | msf-i, lse, mut-info-i | 2 |
| 0.65 | 0.56 | 0.77 | msf-i, lse, node-deg-i | 3 |
| 0.64 | 0.56 | 0.75 | msf-i, bfac-i, lse, node-deg-i | 2 |
| 0.64 | 0.56 | 0.75 | def-energ-i, msf-i, lse, mut-info-i | 2 |
| 0.63 | 0.55 | 0.75 | msf-i, lse, node-deg-i, mut-info-i | 3 |
| 0.63 | 0.55 | 0.75 | msf-i, lse | 2 |
| 0.63 | 0.55 | 0.75 | msf-i, bfac-i, hbond-i, lse | 2 |
| 0.63 | 0.56 | 0.73 | msf-i, hbond-i, lse, node-deg-i | 3 |
| 0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 3 |
| 0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 2 |
| 0.63 | 0.53 | 0.77 | msf-i, bfac-i, lse, mut-info-i | 3 |
| 0.63 | 0.53 | 0.77 | msf-i, bfac-i, mut-info-i | 3 |
| 0.63 | 0.54 | 0.75 | msf-i, bfac-i, lse, mut-info-i | 2 |
| 0.63 | 0.54 | 0.75 | def-energ-i, msf-i, hbond-i, lse | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
Top 20 highest performing feature/kernel degree combinations (as ranked by F1) using top 8 Set 1 features augmented with deformation energy of the active state (abbreviated def-energ-r in the table) and the difference in deformation energy between the inactive and active states (abbreviated diff-def-energ), Augmented Feature Set 1.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.71 | 0.64 | 0.80 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
| 0.70 | 0.66 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, def-energ-a | 3 |
| 0.70 | 0.64 | 0.77 | msf-i, diff-hbond, lse, diff-at-dens, diff-bfac, def-energ-a | 3 |
| 0.69 | 0.63 | 0.77 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 3 |
| 0.69 | 0.61 | 0.80 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, diff-bfac, def-energ-a | 2 |
| 0.69 | 0.63 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
| 0.69 | 0.63 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, diff-def-energ | 2 |
| 0.69 | 0.59 | 0.82 | diff-hbond, lse, msf-a, diff-def-energ | 3 |
| 0.69 | 0.58 | 0.84 | def-energ-i, msf-i, lse, diff-def-energ | 3 |
| 0.68 | 0.64 | 0.73 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a, diff-def-energ | 2 |
| 0.68 | 0.62 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
| 0.68 | 0.62 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a | 2 |
| 0.68 | 0.62 | 0.75 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac, def-energ-a | 3 |
| 0.68 | 0.61 | 0.77 | msf-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac, def-energ-a | 3 |
| 0.68 | 0.54 | 0.91 | msf-i, diff-hbond, msf-a | 3 |
| 0.67 | 0.63 | 0.73 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, diff-def-energ | 3 |
| 0.67 | 0.63 | 0.73 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, msf-a, bfac-a, def-energ-a | 2 |
| 0.67 | 0.61 | 0.75 | diff-hbond, lse, diff-at-dens, msf-a, bfac-a | 2 |
| 0.67 | 0.61 | 0.75 | def-energ-i, diff-hbond, lse, diff-at-dens, msf-a, diff-bfac | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
Performance of the top Feature Set 1-models on the independent data set.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.73 | 0.67 | 0.81 | msf-i, diff-hbond, mut-info-i, msf-a, diff-msf | 3 |
| 0.68 | 0.60 | 0.78 | msf-i, mut-info-i, msf-a | 3 |
| 0.68 | 0.59 | 0.81 | msf-i, diff-hbond, mut-info-i, msf-a | 3 |
| 0.67 | 0.61 | 0.76 | msf-i, diff-hbond, msf-a, diff-msf | 3 |
| 0.67 | 0.61 | 0.73 | msf-i, hbond-a, msf-a | 3 |
| 0.67 | 0.58 | 0.78 | msf-i, diff-hbond, diff-at-dens, mut-info-i, msf-a | 3 |
| 0.66 | 0.58 | 0.76 | msf-i, diff-hbond, msf-a | 3 |
| 0.66 | 0.58 | 0.76 | msf-i, bfac-i, msf-a | 3 |
| 0.66 | 0.62 | 0.70 | msf-i, msf-a, diff-msf | 3 |
| 0.66 | 0.64 | 0.68 | msf-i, diff-hbond, msf-a | 2 |
| 0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
| 0.66 | 0.70 | 0.62 | hbond-a, hbond-i, lse, mut-info-i, msf-a, diff-msf | 3 |
| 0.65 | 0.63 | 0.68 | msf-i, bfac-i, diff-hbond, msf-a | 3 |
| 0.64 | 0.57 | 0.73 | msf-i, diff-hbond, diff-at-dens, msf-a | 3 |
| 0.64 | 0.61 | 0.68 | def-eng-i, msf-i, diff-hbond, msf-a | 3 |
| 0.63 | 0.67 | 0.59 | hbond-a, hbond-i, lse, msf-a, diff-msf | 3 |
| 0.62 | 0.68 | 0.57 | hbond-a, hbond-i, lse, diff-at-dens, msf-a, diff-msf | 3 |
| 0.61 | 0.61 | 0.62 | hbond-i, lse, diff-at-dens, msf-a, diff-msf | 3 |
| 0.60 | 0.61 | 0.59 | diff-hbond, lse, diff-at-dens, pert-clust-coeff-i, msf-a, bfac-a | 2 |
| 0.60 | 0.61 | 0.59 | msf-i, diff-hbond, lse, at-dens-a, diff-at-dens, pert-clust-coeff-i, msf-a, bfac-a | 2 |
Each of the top 300 feature/kernel degree combinations (as determined by the leave-one-out cross-validation) was used to train a model on the entire training data set. The resulting models were tested on the independent data set.
Performance of top Feature Set 2-models on the independent data set.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.69 | 0.56 | 0.89 | Ca-disp, dchi2, asa1, asaavg, asasc1, asabb1 | 3 |
| 0.69 | 0.56 | 0.89 | Ca-disp, dpsi, dchi2, asaavg, asascavg, asabb1 | 3 |
| 0.69 | 0.56 | 0.89 | Ca-disp, asa1, asaavg, asasc1, asabb1, asabbavg | 3 |
| 0.69 | 0.55 | 0.92 | dpsi, asaavg, asascavg | 3 |
| 0.68 | 0.59 | 0.81 | sc-flip, asa2, asaavg, asasc2, asascavg, asabb1 | 3 |
| 0.67 | 0.58 | 0.81 | Ca-disp, sc-flip, dchi1, dchi2, asasc2, asascavg, asabb1 | 2 |
| 0.67 | 0.55 | 0.86 | dpsi, dchi1, asascavg | 2 |
| 0.67 | 0.55 | 0.86 | Ca-disp, sc-flip, dchi2, asa1, asaavg, asasc2 | 3 |
| 0.67 | 0.55 | 0.86 | Ca-disp, asa1, asasc1, asascavg, asabbavg | 2 |
| 0.67 | 0.55 | 0.86 | Ca-disp, fI, asaavg, asasc1, asascavg, asabbavg | 2 |
| 0.67 | 0.54 | 0.89 | Ca-disp, dpsi, asa1, asa2 | 3 |
| 0.67 | 0.54 | 0.89 | Ca-disp, asa1, asaavg, asasc1 | 2 |
| 0.67 | 0.57 | 0.81 | dchi1, dchi2, asasc2, asascavg, asabb1 | 2 |
| 0.67 | 0.55 | 0.84 | dpsi, dchi2, asasc1, asascavg, asabb1, asabb2 | 3 |
| 0.67 | 0.54 | 0.86 | dpsi, dchi2, asaavg | 3 |
| 0.67 | 0.54 | 0.86 | dpsi, asaavg, asascavg, asabbavg | 3 |
| 0.67 | 0.54 | 0.86 | dpsi, asa2, asaavg, asascavg, asabb1, asabb2 | 3 |
| 0.67 | 0.54 | 0.86 | dpsi, asa1, asa2, asaavg, asasc1, asabb1 | 3 |
| 0.67 | 0.54 | 0.86 | Ca-disp, dchi2, asaavg, asascavg | 2 |
| 0.67 | 0.54 | 0.86 | Ca-disp, dpsi, dchi2, asa2, asascavg, asabb1 | 3 |
Each of the top 300 feature/kernel degree combinations (as determined by the leave-one-out cross-validation) was used to train a model on the entire training data set. The resulting models were tested on the independent data set. The top 20 models are given above.
Performance of models that used only inactive state structure and/or sequence information from the top 300 on the independent data set.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.56 | 0.54 | 0.59 | msf-i, lse | 3 |
| 0.56 | 0.54 | 0.59 | def-eng-i, msf-i, lse | 3 |
| 0.55 | 0.54 | 0.57 | msf-i, hbond-i, lse | 3 |
| 0.55 | 0.56 | 0.54 | msf-i, lse, node-deg-i, mut-info-i | 3 |
| 0.55 | 0.56 | 0.54 | msf-i, bfac-i, hbond-i, lse | 2 |
| 0.55 | 0.53 | 0.57 | msf-i, lse, mut-info-i | 3 |
| 0.55 | 0.53 | 0.57 | msf-i, bfac-i, lse | 3 |
| 0.54 | 0.54 | 0.54 | msf-i, lse, node-deg-i | 3 |
| 0.54 | 0.54 | 0.54 | msf-i, lse | 2 |
| 0.53 | 0.53 | 0.54 | msf-i, bfac-i, lse, mut-info-i | 3 |
| 0.53 | 0.51 | 0.54 | def-eng-i, msf-i, hbond-i, lse | 3 |
| 0.52 | 0.53 | 0.51 | msf-i, lse, mut-info-i | 2 |
| 0.52 | 0.53 | 0.51 | msf-i, bfac-i, lse, mut-info-i | 2 |
| 0.51 | 0.57 | 0.46 | def-eng-i, msf-i, lse, mut-info-i | 2 |
| 0.50 | 0.55 | 0.46 | msf-i, bfac-i, lse, node-deg-i | 2 |
| 0.49 | 0.53 | 0.46 | def-eng-i, msf-i, lse | 2 |
| 0.49 | 0.57 | 0.43 | msf-i, hbond-i, lse, node-deg-i | 3 |
Performance of the top 20 models consisting of the top 8 features from Set 1 augmented with deformation energy of the active state (abbreviated def-energ-a in the table) and the difference in deformation energy between the inactive and active states (abbreviated diff-def-energ) on the independent data set (Augemented Feature Set 1).
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.68 | 0.67 | 0.70 | def-energ-i, msf-i, diff-at-dens, msf-a, def-energ-a | 2 |
| 0.68 | 0.68 | 0.68 | def-energ-i, msf-i, diff-hbond, msf-a | 2 |
| 0.68 | 0.63 | 0.73 | msf-i, diff-hbond, msf-a, diff-def-energ | 3 |
| 0.67 | 0.66 | 0.68 | msf-i, diff-at-dens, msf-a, diff-def-energ | 2 |
| 0.67 | 0.66 | 0.68 | msf-i, diff-hbond, msf-a, def-energ-a | 2 |
| 0.67 | 0.63 | 0.70 | msf-i, msf-a, def-energ-a | 2 |
| 0.66 | 0.58 | 0.76 | msf-i, diff-hbond, msf-a | 3 |
| 0.66 | 0.62 | 0.70 | msf-i, diff-hbond, msf-a, def-energ-a | 3 |
| 0.66 | 0.64 | 0.68 | msf-i, diff-hbond, msf-a | 2 |
| 0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
| 0.66 | 0.64 | 0.68 | msf-i, diff-hbond, diff-at-dens, msf-a, def-energ-a | 2 |
| 0.66 | 0.67 | 0.65 | def-energ-i, msf-i, diff-hbond, diff-at-dens, msf-a | 2 |
| 0.65 | 0.65 | 0.65 | msf-i, msf-a, diff-def-energ | 2 |
| 0.65 | 0.65 | 0.65 | msf-i, diff-at-dens, msf-a, def-energ-a, diff-def-energ | 2 |
| 0.64 | 0.57 | 0.73 | msf-i, diff-hbond, diff-at-dens, msf-a | 3 |
| 0.64 | 0.66 | 0.62 | def-energ-i, lse, msf-a, def-energ-a, diff-def-energ | 3 |
| 0.62 | 0.62 | 0.62 | msf-i, diff-hbond, lse, msf-a, def-energ-a | 3 |
| 0.62 | 0.59 | 0.65 | def-energ-i, msf-i, diff-hbond, msf-a | 3 |
| 0.61 | 0.61 | 0.62 | def-energ-i, msf-i, lse, diff-at-dens, diff-def-energ | 3 |
| 0.61 | 0.63 | 0.59 | msf-i, diff-hbond, diff-at-dens, msf-a, diff-def-energ | 2 |
Precision, recall, and F1 scores were calculated from the results on the independent data set.
Figure 3Improvement of F1 upon successive feature addition.
The bar on the far right represents a feature combination from the top 10 models. Preceding bars represent feature combinations where each bar contains one feature fewer than the bar to its right.
Feature/kernel degree combinations from the top 300 models that used only two or three features.
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.68 | 0.54 | 0.91 | msf-i, diff-hbond, msf-a | 3 |
| 0.65 | 0.55 | 0.82 | msf-i, diff-hbond, msf-a | 2 |
| 0.65 | 0.56 | 0.80 | msf-i, lse | 3 |
| 0.65 | 0.54 | 0.82 | msf-i, msf-a, diff-msf | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, lse, diff-msf | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, lse, mut-info-i | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, diff-hbond, lse | 3 |
| 0.65 | 0.55 | 0.80 | msf-i, hbond-i, lse | 3 |
| 0.65 | 0.56 | 0.77 | msf-i, lse, mut-info-i | 2 |
| 0.65 | 0.56 | 0.77 | msf-i, lse, node-deg-i | 3 |
| 0.64 | 0.54 | 0.80 | msf-i, hbond-a, msf-a | 3 |
| 0.64 | 0.55 | 0.77 | msf-i, lse, diff-bfac | 3 |
| 0.64 | 0.55 | 0.77 | msf-i, diff-hbond, lse | 2 |
| 0.64 | 0.55 | 0.77 | msf-i, hbond-a, lse | 3 |
| 0.64 | 0.52 | 0.82 | msf-i, mut-info-i, msf-a | 3 |
| 0.63 | 0.55 | 0.75 | msf-i, lse | 2 |
| 0.63 | 0.56 | 0.73 | msf-i, lse, at-dens-a | 3 |
| 0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 3 |
| 0.63 | 0.56 | 0.73 | def-energ-i, msf-i, lse | 2 |
| 0.63 | 0.53 | 0.77 | msf-i, lse, diff-at-dens | 3 |
| 0.63 | 0.53 | 0.77 | msf-i, bfac-i, lse | 3 |
| 0.63 | 0.54 | 0.75 | msf-i, hbond-i, msf-a | 2 |
| 0.63 | 0.54 | 0.75 | msf-i, bfac-i, msf-a | 3 |
Precision, recall, and F1 scores calculated from the results of the nine-fold cross-validation on the training set.
Top 20 highest performing feature/kernel degree combinations (as ranked by F1) using all possible combinations of a mixture of Set 1 and Set 2 features that were found most frequently in the top-scoring models made using all possible combinations of each of the two feature sets separately (Hybrid Feature Set).
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.73 | 0.65 | 0.84 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asascavg, asabb1 | 3 |
| 0.73 | 0.65 | 0.84 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asascavg, asabb1 | 3 |
| 0.72 | 0.68 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asasc1, asasc2, asascavg, asabb1, asabbavg | 2 |
| 0.72 | 0.68 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa2, asaavg, asasc2 | 2 |
| 0.72 | 0.66 | 0.80 | def-energ-i, diff-hbond, lse, diff-at-dens, diff-bfac, asascavg | 3 |
| 0.72 | 0.66 | 0.80 | def-energ-i, msf-i, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asasc2, asabb1 | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabb1 | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc1, asasc2, asascavg, asabb1 | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asaavg, asasc1, asasc2, asabbavg | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asaavg, asasc1, asasc2, asabbavg | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 3 |
| 0.72 | 0.64 | 0.82 | def-energ-i, msf-i, diff-hbond, lse, diff-at-dens, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 3 |
| 0.72 | 0.69 | 0.75 | def-energ-i, diff-hbond, lse, bfac-a, asasc1, asasc2, asascavg | 2 |
| 0.72 | 0.61 | 0.86 | def-energ-i, lse, Ca-disp, asasc2, asabb1, asabbavg | 3 |
| 0.72 | 0.61 | 0.86 | def-energ-i, diff-hbond, lse, asasc2 | 3 |
| 0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa1, asa2, asasc1, asasc2, asabbavg | 2 |
| 0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, Ca-disp, asa1, asa2, asaavg, asasc2, asascavg, asabbavg | 2 |
| 0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, diff-bfac, asa2, asaavg, asasc2 | 2 |
| 0.72 | 0.67 | 0.77 | def-energ-i, diff-hbond, lse, bfac-a, asasc2, asascavg, asabbavg | 2 |
Precision, recall, and F1 scores were calculated from the results of the nine-fold cross-validation on the training set.
Performance of the top models consisting of mixtures of the top Set 1 and Set 2 features on the independent data set (Hybrid Feature Set).
| F1 | Precision | Recall | Feature Combination | Kernel Degree |
| 0.73 | 0.67 | 0.78 | msf-i, diff-at-dens, msf-a, asaavg, asascavg, asabb1, asabbavg | 2 |
| 0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabbavg | 2 |
| 0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabb1 | 2 |
| 0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabbavg | 2 |
| 0.73 | 0.67 | 0.78 | msf-i, diff-hbond, msf-a, asaavg, asasc1, asabb1 | 2 |
| 0.72 | 0.71 | 0.73 | diff-hbond, msf-a, Ca-disp, asaavg, asasc2, asascavg, asabbavg | 3 |
| 0.72 | 0.68 | 0.76 | diff-hbond, msf-a, Ca-disp, asa1, asa2, asaavg, asasc1, asascavg, asabbavg | 3 |
| 0.72 | 0.66 | 0.78 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabb1 | 2 |
| 0.71 | 0.64 | 0.81 | msf-i, diff-at-dens, asaavg, asascavg, asabb1, asabbavg | 3 |
| 0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asabb1, asabbavg | 3 |
| 0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, asa1, asabbavg | 3 |
| 0.71 | 0.64 | 0.81 | msf-i, diff-hbond, msf-a, asa1, asasc1, asabbavg | 3 |
| 0.71 | 0.64 | 0.81 | msf-i, diff-hbond, diff-at-dens, msf-a, asa1, asascavg, asabb1, asabbavg | 3 |
| 0.71 | 0.62 | 0.84 | msf-i, diff-at-dens, msf-a, asa2, asaavg, asasc2, asascavg, asabb1 | 3 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asasc1, asabb1, asabbavg | 2 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asaavg, asasc1, asabb1 | 2 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1 | 2 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asasc1, asabbavg | 2 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, Ca-disp, asa1, asasc1, asabb1, asabbavg | 2 |
| 0.71 | 0.67 | 0.76 | msf-i, diff-hbond, msf-a, asa1, asabbavg | 2 |
Precision, recall, and F1 scores were calculated from the results on the independent data set. Listed are the top scoring feature/kernel degree combinations as ranked by F1 on the independent data set.
The top 9 models with the highest precision on the independent data set that were used in the structural analysis.
| F1 train | P train | R train | F1 ind | P ind | R ind | Feature Combination | Kernel Degree |
| 0.65 | 0.54 | 0.84 | 0.70 | 0.75 | 0.65 | msf-i, diff-hbond, msf-a, Ca-disp, asa2, asaavg, asasc1, asasc2, asascavg, asabbavg | 3 |
| 0.65 | 0.55 | 0.80 | 0.70 | 0.74 | 0.68 | msf-i, diff-at-dens, Ca-disp, asaavg, asabb1, asabbavg | 2 |
| 0.64 | 0.57 | 0.73 | 0.69 | 0.73 | 0.65 | msf-i, diff-hbond, bfac-a, Ca-disp, asa2, asasc1, asabb1, asabbavg | 2 |
| 0.63 | 0.56 | 0.70 | 0.69 | 0.73 | 0.65 | msf-i, diff-hbond, diff-at-dens, msf-a, bfac-a, diff-bfac, asa1, asa2, asaavg, asasc2, asabbavg | 2 |
| 0.63 | 0.52 | 0.80 | 0.69 | 0.71 | 0.68 | msf-i, diff-at-dens, Ca-disp, asa1, asaavg, asabbavg | 2 |
| 0.65 | 0.55 | 0.80 | 0.69 | 0.71 | 0.68 | msf-i, diff-at-dens, msf-a, asa1, asa2, asaavg, asasc2, asabbavg | 3 |
| 0.64 | 0.57 | 0.73 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, bfac-a, diff-bfac, asa1, asa2, asasc1, asascavg, asabbavg | 2 |
| 0.64 | 0.56 | 0.75 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, msf-a, Ca-disp, asa2, asasc1, asasc2, asascavg, asabb1 | 3 |
| 0.64 | 0.57 | 0.73 | 0.69 | 0.71 | 0.68 | def-energ-i, msf-i, diff-hbond, diff-at-dens, bfac-a, diff-bfac, asa2, asaavg, asasc2, asabb1 | 2 |
The performance on both the training (abbreviated train) and independent (abbreviated ind) data sets are given. The F1, Precision (P) and Recall (R ) values for each model are reported based on their performance on the training and independent data sets.
Figure 4Hotspot predictions mapped to the inactive state structure of lac repressor.
(A) Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for lac repressor mapped onto the inactive state structure (1tlf). Experimentally tested residues rendered in van der Waals spheres, with known non-hotspots in small van der Waals spheres and known hotspots in larger ones. For other residues, the prediction is shown along the backbone trace, but no experimental data is available to test the prediction. Each residue in the structure is colored according to a blue→green→red heat map, where the extremes are as follows: red represents residues predicted to be hotspots by 9/9 of the models and blue residues to be predicted hotspots by 0/9 models (predicted non-hotspots by 9/9 models). (Refer to color bar above for exact mapping of the number of predicted hotspots to the color.) For ease of viewing only one set of dimers (chain A and B) is shown. His 74 and Asp 278, residues not in the independent data set but were studied experimentally and found to be allosterically active, are rendered in van der Waals mode as well [63]. Correct positive (hotspot) and negative (non-hotspot) predictions are colored according to the heat map, while false predictions are colored gray. The inducer molecule IPTG is rendered as sticks and colored by element. (B) Here the complete set of residues that caused the IS phenotype are rendered in van der Waals spheres. The hotspots depicted in A. are a subset of these for which no substitution caused an I− phenotype (completely nonfunctional). Incorrect predictions, i.e. false negatives, are colored in gray.
Figure 5Hotspot predictions mapped to the inactive state structure of myosin II.
Predictions made by the top 9 highest-precision Hybrid Feature Set models according to the voting scheme for myosin II motor domain mapped onto the inactive state structure (1vom). Refer to Figure 4 above for an explanation of the coloring. Residues that met our criteria for classification as hotspot and included in the independent data set are rendered in van der Waals spheres. Switch-II (a region with high homology to the switch region of G-proteins that couples GTP hydrolysis to effector-domain conformation) residues (454–459) are depicted in van der Waals spheres as well, and colored according to the heat map.