| Literature DB >> 30598091 |
Quanya Liu1, Peng Chen2, Bing Wang3,3, Jun Zhang4, Jinyan Li5.
Abstract
BACKGROUND: Hot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features.Entities:
Keywords: Ensemble learning; Hot spot residues; Protein-protein interaction
Mesh:
Year: 2018 PMID: 30598091 PMCID: PMC6311905 DOI: 10.1186/s12918-018-0665-8
Source DB: PubMed Journal: BMC Syst Biol ISSN: 1752-0509
Databases for hot spots prediction
| Data sets | Positive sample(HS) | Negative sample(NHS) | Total |
|---|---|---|---|
| Train set(ASEdb) | 58 | 91 | 149 |
| Test set(BID) | 70 | 115 | 185 |
| Independent test(SKEMPI) | 120 | 234 | 354 |
| Independent test(dbMPIKT) | 106 | 384 | 490 |
| Independent test(Mix set) | 292 | 697 | 989 |
Fig. 1Encoding schema for protein residues. The protein sequence was first converted to a numerical sequence using the 46 attributes of AAi2dex1. Then, each residue is encoded using the autocorrelation function combined with the sliding window. Here, R represents the 1st residue in the protein sequence, R represents the 2nd residue..., and R represents the L-th residue, each of them belongs to the 20 common types of amino acids
Fig. 2The flowchart of our model
Fig. 3Performance comparison of the model with different m values
Fig. 4Performance comparison of three classifiers on ASEdb
Prediction performance of top 83 classifier on training and test sets
| Data sets | ACC | SPE | RECALL | PRE | F1 | MCC |
|---|---|---|---|---|---|---|
| Train(ASEdb) | 0.9402 | 0.9627 | 0.9078 | 0.9426 | 0.9247 | 0.8759 |
| Test(BID) | 0.9150 | 0.9595 | 0.8471 | 0.9476 | 0.8941 | 0.8278 |
Fig. 5The ROC curves of the ensemble model with the top 83 classifiers on training and test sets
Prediction performance of model with top 83 classifiers on different test sets
| Data sets | ACC | SPE | RECALL | PRE | F1 | MCC |
|---|---|---|---|---|---|---|
| Test(SKEMPI) | 0.9028 | 0.9268 | 0.8573 | 0.8590 | 0.8579 | 0.7843 |
| Test(dbMPIKT) | 0.9322 | 0.9616 | 0.8364 | 0.8618 | 0.8472 | 0.8052 |
| Test(Mix set) | 0.9183 | 0.9491 | 0.8503 | 0.8802 | 0.8644 | 0.8069 |
Fig. 6The ROC curves of the ensembles of the top 83 classifiers for SKEMPI, dbMPIKT and Mix sets
Prediction comparison of different methods on BID test sets
| Method | Features | ACC | F1 | PRE |
|---|---|---|---|---|
| Hot point | Structural features | 0.72 | 0.49 | 0.55 |
| ppRF | B-factor, individual atomic contacts and the co-occurring contacts | 0.78 | 0.58 | 0.69 |
| HEP | Physicochemical, structural neighborhood features | 0.79 | 0.70 | 0.60 |
| PredHS | Structural neighborhood features | 0.88 | 0.76 | 0.79 |
| Hu method | Sequence features | 0.76 | 0.80 | 1.0 |
| Our method | Sequence features | 0.92 | 0.89 | 0.95 |
Comparison of performance under different feature selection on training set
| Data sets | ACC | SPE | RECALL | PRE | F1 | MCC |
|---|---|---|---|---|---|---|
| Our method | 0.9402 | 0.9627 | 0.9078 | 0.9426 | 0.9247 | 0.8759 |
| Hu’s feature selection | 0.9262 | 0.8959 | 0.9918 | 0.8174 | 0.8956 | 0.8494 |
The classification and quantity statistics of base classifiers
| Classifier | Number | Features |
|---|---|---|
| KNN | 46 | 1-46 |
| SVM(RBF) | 37 | 1-3, 5-14, 18-31, 33, 34, 36, 37, 39, 40, 43-46 |
*The feature corresponds to the feature number in Additional file 1
Fig. 7Correlation coefficient heat map of 37 features
The classification and quantity statistics of AAindex1 properties
| Alpha and Turn propensities | GEIM800103, CHAM83102, QIAN880129, ROBB760111, RICJ880114 |
| RACS820104, QIAN880117, WOLS870103, FASG760104, ISOY800106 | |
| ROBB760107, QIAN880139, QIAN880113, RICJ880117, SNEP660104 | |
| VASM830101, BUAN790103 | |
| Hydrophobicity | NAKH900113, QIAN880128, PRAM820101, KHAG800101, SUEM840102 |
| WERD780103, RICJ880104, VASM830102, ROSM880103, RICJ880105 | |
| ISOY800107, RACS820103, JOND750102, TANS770108, KLEP840101, VELV850101 | |
| Physcioemcial properties | JOND920102, QIAN880113 |
| Add properties | GERO01103, NADH010107, AURR980118, AURR980120, WILM950104 |
| GEOR030107, GERO01103 |
*The second column represents the number of each attribute in AAindex1
Fig. 8The cluster dendrogram of the 46 descriptors
Fig. 9The visualization of prediction performance for PDB ID: 1DVA(chain H and chain X). Hot spots are represented in red color, and non-hot spots are represented in blue color. a BID experimental verification data. b Prediction results of our model. Hot spots predicted correctly are colored in red, while non-hot spots predicted correctly are colored in blue. The residues in yellow (E70 for our method) are non-hot spots wrongly predicted to be hot spots. c Prediction results of Hu’method. Hot spots predicted correctly are colored in red, and non-hot spots predicted correctly are colored in blue. The residues in yellow (G38, E70 and L153) are non-hot spots wrongly predicted to be hot spots