| Literature DB >> 27472327 |
Rita Melo1,2, Robert Fieldhouse3, André Melo4, João D G Correia5, Maria Natália D S Cordeiro6, Zeynep H Gümüş7, Joaquim Costa8, Alexandre M J J Bonvin9, Irina S Moreira10,11.
Abstract
Understanding protein-protein interactions is a key challenge in biochemistry. In this work, we describe a more accurate methodology to predict Hot-Spots (HS) in protein-protein interfaces from their native complex structure compared to previous published Machine Learning (ML) techniques. Our model is trained on a large number of complexes and on a significantly larger number of different structural- and evolutionary sequence-based features. In particular, we added interface size, type of interaction between residues at the interface of the complex, number of different types of residues at the interface and the Position-Specific Scoring Matrix (PSSM), for a total of 79 features. We used twenty-seven algorithms from a simple linear-based function to support-vector machine models with different cost functions. The best model was achieved by the use of the conditional inference random forest (c-forest) algorithm with a dataset pre-processed by the normalization of features and with up-sampling of the minor class. The method has an overall accuracy of 0.80, an F1-score of 0.73, a sensitivity of 0.76 and a specificity of 0.82 for the independent test set.Entities:
Keywords: Solvent Accessible Surface Area (SASA); evolutionary sequence conservation; hot-spots; machine learning; protein-protein interfaces
Mesh:
Substances:
Year: 2016 PMID: 27472327 PMCID: PMC5000613 DOI: 10.3390/ijms17081215
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Figure 1The flowchart of the current work.
Statistical metrics attained for five algorithms with top performance for each of the studied conditions for the training set.
| Pre-Processing | Metrics | Algorithms | ||||
|---|---|---|---|---|---|---|
| Scaled | Nnet | avNNET | C5.0 Tree | C5.0 Rules | svmRadialSigma | |
| AUROC | 0.52 | 0.65 | 0.77 | 0.72 | 0.78 | |
| Accuracy | 0.92 | 0.94 | 0.96 | 0.92 | 0.91 | |
| Sensitivity | 0.92 | 0.88 | 0.88 | 0.85 | 0.80 | |
| Specificity | 0.91 | 0.98 | 1.00 | 0.96 | 0.97 | |
| PPV | 0.86 | 0.95 | 0.99 | 0.92 | 0.93 | |
| NPV | 0.95 | 0.94 | 0.94 | 0.92 | 0.89 | |
| FPR | 0.09 | 0.02 | 0.00 | 0.04 | 0.03 | |
| 0.89 | 0.92 | 0.93 | 0.89 | 0.86 | ||
| Scaled_Down | c-Forest | avNNET | C5.0Tree | C5.0Rules | GBM | |
| AUROC | 0.79 | 0.70 | 0.73 | 0.71 | 0.80 | |
| Accuracy | 0.91 | 0.95 | 0.96 | 0.90 | 1.00 | |
| Sensitivity | 0.93 | 0.96 | 0.96 | 0.89 | 0.99 | |
| Specificity | 0.90 | 0.93 | 0.95 | 0.91 | 1.00 | |
| PPV | 0.90 | 0.93 | 0.95 | 0.9 | 1.00 | |
| NPV | 0.92 | 0.96 | 0.96 | 0.89 | 0.99 | |
| FPR | 0.1 | 0.07 | 0.05 | 0.09 | 0 | |
| F1-score | 0.91 | 0.95 | 0.96 | 0.9 | 1.00 | |
| Scaled_Up | c-Forest | avNNET | C5.0Tree | C5.0Rules | GBM | |
| AUROC | 0.85 | 0.75 | 0.85 | 0.82 | 0.84 | |
| Accuracy | 0.93 | 0.94 | 0.98 | 0.95 | 0.98 | |
| Sensitivity | 0.93 | 0.96 | 0.99 | 0.96 | 0.97 | |
| Specificity | 0.93 | 0.92 | 0.97 | 0.94 | 0.99 | |
| PPV | 0.93 | 0.92 | 0.97 | 0.94 | 0.99 | |
| NPV | 0.93 | 0.96 | 0.99 | 0.95 | 0.97 | |
| FPR | 0.07 | 0.08 | 0.03 | 0.06 | 0.01 | |
| F1-score | 0.93 | 0.94 | 0.98 | 0.95 | 0.98 | |
| PCA | nnet | avNNET | C5.0Tree | C5.0Rules | svmRadialSigma | |
| AUROC | 0.69 | 0.75 | 0.61 | 0.59 | 0.76 | |
| Accuracy | 1.00 | 0.99 | 0.98 | 0.92 | 0.91 | |
| Sensitivity | 1.00 | 0.97 | 0.98 | 0.91 | 0.76 | |
| Specificity | 1.00 | 1.00 | 0.98 | 0.93 | 0.99 | |
| PPV | 1.00 | 0.99 | 0.96 | 0.89 | 0.97 | |
| NPV | 1.00 | 0.98 | 0.99 | 0.95 | 0.88 | |
| FPR | 0 | 0 | 0.02 | 0.07 | 0.01 | |
| F1-score | 1.00 | 0.98 | 0.97 | 0.90 | 0.85 | |
| PCA_Down | nnet | avNNET | C5.0Tree | C5.0Rules | svmRadialSigma | |
| AUROC | 0.70 | 0.78 | 0.67 | 0.67 | 0.75 | |
| Accuracy | 0.87 | 0.91 | 0.97 | 0.91 | 0.91 | |
| Sensitivity | 0.88 | 0.88 | 0.96 | 0.96 | 0.88 | |
| Specificity | 0.87 | 0.93 | 0.99 | 0.87 | 0.93 | |
| PPV | 0.87 | 0.92 | 0.99 | 0.88 | 0.93 | |
| NPV | 0.88 | 0.89 | 0.96 | 0.95 | 0.89 | |
| FPR | 0.13 | 0.07 | 0.01 | 0.13 | 0.07 | |
| F1-score | 0.87 | 0.90 | 0.97 | 0.92 | 0.91 | |
| PCA_Up | nnet | avNNET | C5.0Tree | C5.0Rules | svmRadialSigma | |
| AUROC | 0.75 | 0.82 | 0.80 | 0.78 | 0.80 | |
| Accuracy | 0.95 | 0.98 | 0.98 | 0.96 | 0.94 | |
| Sensitivity | 0.94 | 0.97 | 0.99 | 0.96 | 0.92 | |
| Specificity | 0.96 | 0.99 | 0.98 | 0.96 | 0.95 | |
| PPV | 0.96 | 0.99 | 0.98 | 0.96 | 0.95 | |
| NPV | 0.94 | 0.97 | 0.99 | 0.96 | 0.92 | |
| FPR | 0.04 | 0.01 | 0.02 | 0.04 | 0.05 | |
| F1-score | 0.95 | 0.98 | 0.98 | 0.96 | 0.94 | |
avNNET: model averaged Neural Network; C5.0 Rules (single C5.0 Ruleset); C5.0 Tree (single C5.0 Tree); c-forest (conditional inference random forest); GBM (stochastic gradient boosting machine); nnet (neuronal network); svmRadialSigma (support vector machines with the Radial basis function kernel); Positive Predictive Value (PPV); Negative Predictive Value (NPV); False Positive Rate (FPR).
Statistical metrics attained for 5 algorithms with the top performance for each of the studied conditions for the independent test set.
| Pre-Processing | Metrics | Algorithms | ||||
|---|---|---|---|---|---|---|
| Scaled | Nnet | avNNET | C5.0 Tree | C5.0 Rules | svmRadialSigma | |
| AUROC | 0.71 | 0.68 | 0.68 | 0.72 | 0.70 | |
| Accuracy | 0.74 | 0.71 | 0.71 | 0.74 | 0.73 | |
| Sensitivity | 0.57 | 0.57 | 0.5 | 0.60 | 0.55 | |
| Specificity | 0.83 | 0.79 | 0.83 | 0.82 | 0.83 | |
| PPV | 0.65 | 0.6 | 0.62 | 0.65 | 0.64 | |
| NPV | 0.78 | 0.77 | 0.75 | 0.79 | 0.77 | |
| FPR | 0.43 | 0.43 | 0.4 | 0.4 | 0.45 | |
| F1-score | 0.61 | 0.58 | 0.55 | 0.62 | 0.59 | |
| Scaled_Down | c-forest | avNNET | C5.0 Tree | C5.0 Rules | GBM | |
| AUROC | 0.75 | 0.68 | 0.63 | 0.71 | 0.73 | |
| Accuracy | 0.76 | 0.69 | 0.64 | 0.72 | 0.75 | |
| Sensitivity | 0.79 | 0.71 | 0.67 | 0.76 | 0.74 | |
| Specificity | 0.74 | 0.69 | 0.62 | 0.70 | 0.75 | |
| PPV | 0.63 | 0.55 | 0.49 | 0.59 | 0.62 | |
| NPV | 0.87 | 0.81 | 0.77 | 0.84 | 0.84 | |
| FPR | 0.21 | 0.29 | 0.33 | 0.24 | 0.26 | |
| F1-score | 0.7 | 0.62 | 0.57 | 0.66 | 0.68 | |
| Scaled_Up | c-forest | AvNNET | C5.0 Tree | C5.0 Rules | GBM | |
| AUROC | 0.78 | 0.73 | 0.65 | 0.70 | 0.80 | |
| Accuracy | 0.80 | 0.75 | 0.69 | 0.73 | 0.82 | |
| Sensitivity | 0.76 | 0.66 | 0.48 | 0.59 | 0.76 | |
| Specificity | 0.82 | 0.80 | 0.80 | 0.81 | 0.85 | |
| PPV | 0.70 | 0.64 | 0.57 | 0.63 | 0.73 | |
| NPV | 0.86 | 0.81 | 0.74 | 0.78 | 0.86 | |
| FPR | 0.24 | 0.34 | 0.52 | 0.41 | 0.24 | |
| F1-score | 0.73 | 0.65 | 0.52 | 0.61 | 0.75 | |
| PCA | Nnet | avNNET | C5.0 Tree | C5.0 Rules | svmRadialSigma | |
| AUROC | 0.65 | 0.73 | 0.68 | 0.71 | 0.71 | |
| Accuracy | 0.67 | 0.75 | 0.7 | 0.74 | 0.74 | |
| Sensitivity | 0.60 | 0.60 | 0.66 | 0.67 | 0.52 | |
| Specificity | 0.71 | 0.84 | 0.72 | 0.77 | 0.86 | |
| PPV | 0.54 | 0.67 | 0.57 | 0.62 | 0.67 | |
| NPV | 0.77 | 0.79 | 0.79 | 0.81 | 0.76 | |
| FPR | 0.4 | 0.4 | 0.34 | 0.33 | 0.48 | |
| F1-score | 0.57 | 0.64 | 0.61 | 0.64 | 0.58 | |
| PCA_Down | Nnet | avNNET | C5.0 Tree | C5.0 Rules | svmRadialSigma | |
| AUROC | 0.70 | 0.68 | 0.59 | 0.61 | 0.69 | |
| Accuracy | 0.71 | 0.69 | 0.61 | 0.63 | 0.70 | |
| Sensitivity | 0.76 | 0.71 | 0.55 | 0.60 | 0.72 | |
| Specificity | 0.68 | 0.69 | 0.64 | 0.64 | 0.69 | |
| PPV | 0.56 | 0.55 | 0.46 | 0.48 | 0.56 | |
| NPV | 0.84 | 0.81 | 0.72 | 0.74 | 0.82 | |
| FPR | 0.24 | 0.29 | 0.45 | 0.4 | 0.28 | |
| F1-score | 0.65 | 0.62 | 0.50 | 0.53 | 0.63 | |
| PCA_Up | Nnet | avNNET | C5.0 Tree | C5.0 Rules | svmRadialSigma | |
| AUROC | 0.67 | 0.75 | 0.56 | 0.61 | 0.69 | |
| Accuracy | 0.7 | 0.77 | 0.59 | 0.63 | 0.71 | |
| Sensitivity | 0.59 | 0.64 | 0.48 | 0.55 | 0.64 | |
| Specificity | 0.76 | 0.84 | 0.65 | 0.68 | 0.75 | |
| PPV | 0.58 | 0.69 | 0.43 | 0.48 | 0.59 | |
| NPV | 0.77 | 0.81 | 0.69 | 0.73 | 0.79 | |
| FPR | 0.41 | 0.36 | 0.52 | 0.45 | 0.36 | |
| F1-score | 0.58 | 0.66 | 0.46 | 0.52 | 0.61 | |
avNNet: model averaged Neural Network; C5.0 Rules (single C5.0 Ruleset); C5.0 Tree (single C5.0 Tree); c-forest (conditional inference random forest); GBM (stochastic gradient boosting machine); nnet (neuronal network); svmRadialSigma (support vector machines with the Radial basis function kernel).
Figure 2Top 15 variables for the c-forest method. SASA, Solvent Accessible Surface Area; #, Number of residues
Comparison of the statistical metrics attained for the best predictor in this work and some of the most common ones in the literature.
| Perfomance | Algorithms | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| c-Forest/ | SBHD2 | Robetta | KFC2-A | KFC2-B | CPORT | |||||||
| Training | Test | Training | Test | Training | Test | Training | Test | Training | Test | Training | Test | |
| AUROC | 0.85 | 0.78 | 0.74 | 0.69 | 0.62 | 0.62 | 0.72 | 0.66 | 0.60 | 0.67 | 0.54 | 0.54 |
| Accuracy | 0.93 | 0.80 | 0.70 | 0.71 | 0.66 | 0.66 | 0.76 | 0.71 | 0.70 | 0.73 | 0.49 | 0.49 |
| Sensitivity | 0.93 | 0.76 | 0.70 | 0.70 | 0.38 | 0.29 | 0.57 | 0.53 | 0.26 | 0.28 | 0.55 | 0.54 |
| Specificity | 0.93 | 0.82 | 0.70 | 0.71 | 0.85 | 0.88 | 0.85 | 0.81 | 0.93 | 0.96 | 0.45 | 0.47 |
| PPV | 0.93 | 0.70 | 0.55 | 0.56 | 0.61 | 0.60 | 0.67 | 0.59 | 0.65 | 0.80 | 0.34 | 0.35 |
| NPV | 0.93 | 0.86 | 0.82 | 0.82 | 0.68 | 0.67 | 0.79 | 0.77 | 0.71 | 0.72 | 0.66 | 0.66 |
| F1-score | 0.93 | 0.73 | 0.62 | 0.62 | 0.47 | 0.39 | 0.62 | 0.56 | 0.37 | 0.42 | 0.42 | 0.42 |