| Literature DB >> 28808256 |
Irina S Moreira1,2, Panagiotis I Koukos3, Rita Melo4,5, Jose G Almeida4, Antonio J Preto4, Joerg Schaarschmidt3, Mikael Trellet3, Zeynep H Gümüş6, Joaquim Costa7, Alexandre M J J Bonvin8.
Abstract
We present SpotOn, a web server to identify and classify interfacial residues as Hot-Spots (HS) and Null-Spots (NS). SpotON implements a robust algorithm with a demonstrated accuracy of 0.95 and sensitivity of 0.98 on an independent test set. The predictor was developed using an ensemble machine learning approach with up-sampling of the minor class. It was trained on 53 complexes using various features, based on both protein 3D structure and sequence. The SpotOn web interface is freely available at: http://milou.science.uu.nl/services/SPOTON/ .Entities:
Mesh:
Year: 2017 PMID: 28808256 PMCID: PMC5556074 DOI: 10.1038/s41598-017-08321-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Cluster dendrogram of the machine learning algorithms tested in this work. All 5 clusters are separated by a dashed line and are ordered from I to V.
Statistical metrics mean values attained from the best algorithms of each cluster for all pre-processing conditions for both training set (Train) and testing set (Test).
| Train | Test | Train | Test | |
|---|---|---|---|---|
| PCA | Scaled | |||
| AUROC | 0.79 | 0.67 | 0.80 | 0.77 |
| Accuracy | 0.89 | 0.78 | 0.90 | 0.81 |
| Sensitivity | 0.60 | 0.31 | 0.67 | 0.40 |
| Specificity | 0.98 | 0.92 | 0.97 | 0.94 |
| PPV | 0.87 | 0.53 | 0.88 | 0.67 |
| NPV | 0.89 | 0.81 | 0.91 | 0.83 |
| F1-score | 0.67 | 0.38 | 0.75 | 0.49 |
| MCC | 0.68 | 0.29 | 0.71 | 0.42 |
|
|
| |||
| AUROC | 0.93 | 0.80 | 0.94 | 0.83 |
| Accuracy | 0.93 | 0.79 | 0.97 | 0.79 |
| Sensitivity | 0.95 | 0.55 | 0.98 | 0.48 |
| Specificity | 0.93 | 0.86 | 0.96 | 0.88 |
| PPV | 0.93 | 0.57 | 0.96 | 0.57 |
| NPV | 0.94 | 0.87 | 0.98 | 0.85 |
| F1-score | 0.94 | 0.55 | 0.97 | 0.52 |
| MCC | 0.83 | 0.41 | 0.91 | 0.38 |
|
|
| |||
| AUROC | 0.79 | 0.70 | 0.81 | 0.74 |
| Accuracy | 0.91 | 0.75 | 0.90 | 0.76 |
| Sensitivity | 0.90 | 0.78 | 0.87 | 0.66 |
| Specificity | 0.92 | 0.74 | 0.93 | 0.80 |
| PPV | 0.92 | 0.48 | 0.92 | 0.51 |
| NPV | 0.91 | 0.92 | 0.89 | 0.88 |
| F1-score | 0.91 | 0.59 | 0.89 | 0.57 |
| MCC | 0.78 | 0.46 | 0.78 | 0.42 |
PCA: dataset upon Principal Component Analysis; PCAUp: dataset upon Principal Component Analysis and up-scaling of the minor class; PCADown: dataset upon Principal Component Analysis and down-sampling of the major class; Scaled: dataset upon z-score calculation; ScaledUp: dataset upon z-score calculation and up-sampling of the minor class; ScaledDown: dataset upon z-score calculation and down-sampling of the major class.
Statistical metrics for the best algorithm of each cluster of method and their combined regression model, both the “Full Regression” and the stepwise-optimized regression model (rf + svmPoly + pda) for both training and testing set.
| C5.0 | pda | plr | rf | svmPoly | Full Regression | rf + svmPoly + pda | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | Train | Test | |
| AUROC | 0.88 | 0.83 | 0.85 | 0.84 | 0.83 | 0.85 | 0.93 | 0.83 | 0.89 | 0.83 | 0.91 | 0.91 | 0.91 | 0.91 |
| Accuracy | 0.88 | 0.91 | 0.85 | 0.88 | 0.83 | 0.85 | 0.93 | 0.90 | 0.89 | 0.90 | 0.94 | 0.95 | 0.94 | 0.95 |
| Sensitivity | 0.78 | 0.68 | 0.86 | 0.76 | 0.82 | 0.84 | 0.87 | 0.71 | 0.80 | 0.68 | 0.98 | 0.98 | 0.98 | 0.98 |
| Specificity | 0.98 | 0.98 | 0.84 | 0.91 | 0.85 | 0.85 | 0.98 | 0.96 | 0.98 | 0.97 | 0.84 | 0.85 | 0.84 | 0.85 |
| PPV | 0.98 | 0.90 | 0.84 | 0.73 | 0.84 | 0.64 | 0.98 | 0.84 | 0.97 | 0.87 | 0.95 | 0.95 | 0.95 | 0.95 |
| NPV | 0.81 | 0.91 | 0.85 | 0.93 | 0.82 | 0.95 | 0.89 | 0.91 | 0.83 | 0.91 | 0.91 | 0.94 | 0.91 | 0.94 |
| FPR | 0.22 | 0.32 | 0.14 | 0.24 | 0.18 | 0.16 | 0.13 | 0.29 | 0.20 | 0.32 | 0.02 | 0.02 | 0.02 | 0.02 |
| FNR | 0.02 | 0.02 | 0.16 | 0.09 | 0.15 | 0.15 | 0.02 | 0.04 | 0.02 | 0.03 | 0.16 | 0.15 | 0.16 | 0.15 |
| F1 | 0.86 | 0.78 | 0.85 | 0.74 | 0.83 | 0.73 | 0.92 | 0.77 | 0.88 | 0.76 | 0.96 | 0.97 | 0.96 | 0.97 |
PCA: dataset upon Principal Component Analysis; PCAUp: dataset upon Principal Component Analysis and up-scaling of the minor class; PCADown: dataset upon Principal Component Analysis and down-sampling of the major class; Scaled: dataset upon z-score calculation; ScaledUp: dataset upon z-score calculation and up-sampling of the minor class; ScaledDown: dataset upon z-score calculation and down-sampling of the major class.
Comparison of the performance of SpotOn with other common methods used for HS prediction for the full dataset.
| SpotOn | SBHD2[ | Robetta[ | KFC2-A[ | KFC2-B | CPORT[ | |
|---|---|---|---|---|---|---|
| AUROC | 0.91 | 0.69 | 0.62 | 0.66 | 0.67 | 0.54 |
| Sensitivity | 0.98 | 0.70 | 0.29 | 0.53 | 0.28 | 0.54 |
| Specificity | 0.84 | 0.71 | 0.88 | 0.81 | 0.96 | 0.47 |
| F1-score | 0.96 | 0.62 | 0.39 | 0.56 | 0.42 | 0.42 |
Figure 2Collage of the results page of the SpotOn webserver: Screenshot of the webGL structure viewer highlighting the hot spot residues in the interface (top); table listing the residues classified as HS (middle) and; sequence viewer highlighting the residues classified as HS and NS in the full sequence of the chains submitted for analysis (bottom).
Figure 3Workflow of the SpotOn web server pipeline. Each box corresponds to a step in the pipeline and the horizontal bars at the bottom of the image indicate the environment in which this step takes place. At the very beginning, the user is required to upload the PDB file in addition to defining the two monomers of the interface. After the credentials of the user have been checked and the input data validated, the web server will generate the run directory with all the necessary files. In case of validation errors, a helpful message is displayed on screen indicating the exact problem. The master node of the Linux cluster where SpotOn is hosted monitors the directory where the run folders are located and if the global maximum number of concurrent SpotOn jobs or the maximum number of jobs per user have not exceeded the defined limits, the analysis is submitted to the queue. Depending on the load of the system at the time of submission, the analysis might start running immediately or with a small delay. The user is notified as soon as the job starts running. The actual run takes place in one of the working nodes of the cluster and, upon completion, the user is notified via email.