| Literature DB >> 32218345 |
Aijun Deng1,2,3, Huan Zhang4, Wenyan Wang4, Jun Zhang5, Dingdong Fan2, Peng Chen5, Bing Wang1,4,5.
Abstract
The study of protein-protein interaction is of great biological significance, and the prediction of protein-protein interaction sites can promote the understanding of cell biological activity and will be helpful for drug development. However, uneven distribution between interaction and non-interaction sites is common because only a small number of protein interactions have been confirmed by experimental techniques, which greatly affects the predictive capability of computational methods. In this work, two imbalanced data processing strategies based on XGBoost algorithm were proposed to re-balance the original dataset from inherent relationship between positive and negative samples for the prediction of protein-protein interaction sites. Herein, a feature extraction method was applied to represent the protein interaction sites based on evolutionary conservatism of proteins, and the influence of overlapping regions of positive and negative samples was considered in prediction performance. Our method showed good prediction performance, such as prediction accuracy of 0.807 and MCC of 0.614, on an original dataset with 10,455 surface residues but only 2297 interface residues. Experimental results demonstrated the effectiveness of our XGBoost-based method.Entities:
Keywords: XGBoost; overlapping regions; protein interaction sites; unbalanced data sets
Mesh:
Substances:
Year: 2020 PMID: 32218345 PMCID: PMC7178137 DOI: 10.3390/ijms21072274
Source DB: PubMed Journal: Int J Mol Sci ISSN: 1422-0067 Impact factor: 5.923
Sample numbers within the original and re-balance datasets.
| Samples | ||
|---|---|---|
| Positive | Negative | |
| Original data | 2297 | 8158 |
| RENN | 2297 | 2131 |
| IHT | 2297 | 2297 |
The prediction performance of two re-balance models.
| Acc | F-measure | Pre | Spe | Sen | MCC | |
|---|---|---|---|---|---|---|
| RENN-XGB | 0.707 | 0.731 | 0.699 | 0.645 | 0.765 | 0.454 |
| IHT-XGB | 0.807 | 0.808 | 0.804 | 0.802 | 0.812 | 0.614 |
Figure 1Performance comparison between the original and re-balanced datasets.
Classification results of two sampling methods and imbalance dataset.
| Sample | Results | |||||
|---|---|---|---|---|---|---|
| Positive | Negative | TP | TN | FP | FN | |
| Imbalanced data | 2297 | 8158 | 5 | 8151 | 25 | 2249 |
| RENN | 2131 | 2297 | 1758 | 1376 | 755 | 539 |
| IHT | 2291 | 2297 | 1864 | 1844 | 454 | 432 |
Figure 2Prediction performance comparison of IHT-XGB method with four previous works.
Prediction performance in benchmark datasets.
| Method | Acc | F-measure | Pre | Spe | Sen | MCC | |
|---|---|---|---|---|---|---|---|
| Dset_186 | SSWRF | 0.679 | 0.386 | 0.322 | 0.697 | 0.581 | 0.234 |
| LORIS | 0.604 | 0.384 | 0.287 | 0.586 |
| 0.221 | |
| PSIVER | 0.673 | 0.353 | 0.306 | 0.743 | 0.416 | 0.151 | |
| SCRIBER | 0.78 | 0.279 | 0.279 | 0.87 | 0.279 | 0.15 | |
| DELPHI |
| 0.353 | 0.353 |
| 0.352 | 0.235 | |
| IHT_XGB | 0.716 |
|
| 0.788 | 0.644 |
| |
| Dset_72 | SSWRF | 0.648 | 0.351 | 0.267 | 0.643 | 0.654 | 0.224 |
| LORIS | 0.614 | 0.324 | 0.238 | 0.610 | 0.631 | 0.177 | |
| PSIVER | 0.661 | 0.278 | 0.25 | 0.693 | 0.465 | 0.135 | |
| SCRIBER | 0.837 | 0.232 | 0.232 | 0.909 | 0.232 | 0.141 | |
| DELPHI |
| 0.275 | 0.276 |
| 0.274 | 0.189 | |
| IHT_XGB | 0.702 |
|
| 0.741 |
|
| |
| Dset_164 | SSWRF | 0.621 | 0.365 | 0.323 | 0.656 | 0.527 | 0.152 |
| LORIS | 0.588 | 0.323 | 0.263 | 0.609 | 0.538 | 0.111 | |
| PSIVER | 0.596 | 0.295 | 0.253 | 0.634 | 0.464 | 0.078 | |
| SCRIBER | 0.756 | 0.327 | 0.327 | 0.851 | 0.327 | 0.179 | |
| DELPHI |
| 0.332 | 0.332 |
| 0.332 | 0.184 | |
| IHT_XGB | 0.733 |
|
| 0.795 |
|
|
Figure 3Visualization results of predictions by the proposed methods. (A), (B) represent the cartoon and spheres form of 1a4y-a, and (C), (D) represent predictions based on RENN-XGB and IHT-XGB methods, where green, red, yellow, and blue ball represent the predictions of TP, TN, FP and FN, respectively.
Figure 4RENN and IHT algorithm schematic. Herein, the circles denote the positive samples, and the stars are negative ones.
Figure 5The flowchart of our method.