| Literature DB >> 32938375 |
Yuliang Pan1, Shuigeng Zhou2, Jihong Guan3.
Abstract
BACKGROUND: Protein-DNA interaction governs a large number of cellular processes, and it can be altered by a small fraction of interface residues, i.e., the so-called hot spots, which account for most of the interface binding free energy. Accurate prediction of hot spots is critical to understand the principle of protein-DNA interactions. There are already some computational methods that can accurately and efficiently predict a large number of hot residues. However, the insufficiency of experimentally validated hot-spot residues in protein-DNA complexes and the low diversity of the employed features limit the performance of existing methods.Entities:
Keywords: Ensemble stacking classifier; Feature selection; Hot spots; Protein-DNA complexes
Mesh:
Substances:
Year: 2020 PMID: 32938375 PMCID: PMC7495898 DOI: 10.1186/s12859-020-03675-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The workflow of PreHots
Performance comparison between ESC and five existing classifiers
| Method | ACC | SEN | SPE | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| RF | 0.683 | 0.696 | 0.684 | 0.687 | 0.669 | 0.374 | 0.758 |
| SVM | 0.685 | 0.673 | 0.695 | 0.670 | 0.665 | 0.366 | 0.793 |
| CatBoost | 0.722 | 0.731 | 0.726 | 0.734 | 0.721 | 0.455 | 0.806 |
| GTB | 0.711 | 0.743 | 0.733 | 0.718 | 0.705 | 0.468 | 0.816 |
| EVC | 0.725 | 0.741 | 0.721 | 0.699 | 0.694 | 0.446 | 0.826 |
| ECS | 0.783 | 0.795 | 0.753 | 0.784 | 0.782 | 0.562 | 0.833 |
Fig. 2The change of E value in the process of stepwise feature selection
Performance comparison between SBS and four existing feature selection methods
| Method | ACC | SEN | SPE | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| RF (28) | 0.744 | 0.739 | 0.749 | 0.716 | 0.715 | 0.483 | 0.823 |
| RFE (20) | 0.739 | 0.723 | 0.730 | 0.719 | 0.718 | 0.452 | 0.830 |
| mRMR (25) | 0.755 | 0.787 | 0.746 | 0.766 | 0.761 | 0.531 | 0.835 |
| HSIC Lasso (30) | 0.740 | 0.777 | 0.727 | 0.746 | 0.744 | 0.500 | 0.841 |
| SBS (19) | 0.767 | 0.784 | 0.766 | 0.776 | 0.741 | 0.535 | 0.853 |
The rankings of the 19 selected features
| 1 | PSSM(R) | Sequence | 11 | Lse score | Sequence |
| 2 | H-Bond in HBPLUS | Structure | 12 | phi in SPOT-1D | Sequence |
| 3 | ALPHA in Xssp | Structure | 13 | COMBINED2 score in ENDES | Structure |
| 4 | Current_flow_closeness_centrality | Network | 14 | ACC in Xssp | Structure |
| 5 | Q3_prob_3 in NetSurfp2 | Sequence | 15 | RSA in NetSurfp2 | Sequence |
| 6 | HSEa-u in SPOT-1D | Exposure | 16 | P(8-G) in SPOT-1D | Sequence |
| 7 | COMBINED1 score in ENDES | Structure | 17 | CN in hsexpo | Exposure |
| 8 | SIDESCORE score in ENDES | Structure | 18 | Conservation score | Sequence |
| 9 | Q8_prob_1 in NetSurfp2 | Sequence | 19 | Blosum(E) | Sequence |
| 10 | P(8-I) in SPOT-1D | Sequence |
These features fall intro four types, i.e., network features, exposure features, structure features and sequence features
Fig. 3The number of each class features on the optimal feature set
Performance comparison between our method with four existing methods on the benchmark dataset
| Method | ACC | SEN | SPE | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| PreHots | 0.789 | 0.813 | 0.801 | 0.785 | 0.784 | 0.597 | 0.868 |
| PrPDH | 0.683 | 0.667 | 0.700 | 0.690 | 0.678 | 0.367 | 0.779 |
| PremPDI | 0.756 | 0.711 | 0.800 | 0.780 | 0.744 | 0.513 | 0.790 |
| mCSM-NA | 0.461 | 0.056 | 0.867 | 0.284 | 0.093 | -0.133 | 0.314 |
| SAMPDI | 0.544 | 0.444 | 0.644 | 0.556 | 0.494 | 0.091 | 0.522 |
Performance comparison between our method with four existing methods on the independent dataset
| Method | ACC | SEN | SPE | PRE | F1 | MCC | AUC |
|---|---|---|---|---|---|---|---|
| PreHots | 0.788 | 0.818 | 0.766 | 0.711 | 0.761 | 0.576 | 0.820 |
| PrPDH | 0.600 | 0.545 | 0.638 | 0.514 | 0.529 | 0.182 | 0.628 |
| PremPDI | 0.463 | 0.333 | 0.553 | 0.344 | 0.338 | -0.114 | 0.411 |
| mCSM-NA | 0.563 | 0.121 | 0.872 | 0.400 | 0.186 | -0.010 | 0.472 |
| SAMPDI | 0.545 | 0.272 | 0.727 | 0.400 | 0.324 | 0.000 | 0.525 |
Fig. 4The hot spot residues of λexo-DNA complex (PDB ID: 3SM4) identified by experiments. The green surface denotes the protein chain (chain A) while the purple and yellow surfaces represent the DNA chains (purple: chain E and yellow: chain D). The red color represents experimentally identified hot spot residues and the blue color represents experimentally determined non-hot spot residues
Fig. 5The hot spot residues of DNA-bound SUP- 1228−121 complex (PDB ID: 4CH1) identified by experiments. The green surface denotes the protein chain (chain A) while the purple surface represents the DNA chain (chain B). The red color represents experimentally identified hot spot residues and the blue color represents experimentally determined non-hot spot residues