| Literature DB >> 35150231 |
Bas Stringer1, Hans de Ferrante1, Sanne Abeln1, Jaap Heringa1, K Anton Feenstra1, Reza Haydarlou1.
Abstract
MOTIVATION: The interactions between proteins and other molecules are essential to many biological and cellular processes. Experimental identification of interface residues is a time-consuming, costly, and challenging task, while protein sequence data is ubiquitous. Consequently, many computational and machine learning approaches have been developed over the years to predict such interface residues from sequence. However, the effectiveness of different deep learning architectures and learning strategies for protein-protein, protein-nucleotide, and protein-small molecule interface prediction, has not yet been investigated in great detail. Therefore, we here explore the prediction of protein interface residues using six deep learning architectures and various learning strategies with sequence-derived input features.Entities:
Year: 2022 PMID: 35150231 PMCID: PMC9004643 DOI: 10.1093/bioinformatics/btac071
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Generation of the BioDL dataset from the PDB and BioLip databases
Fig. 2.Training and testing procedure of our predictors (Section 2.5 for the explanation of the procedure)
Impact of different architectural building blocks on the performance of the dnet_hhc PPI predictor trained on HHC_TR and tested on HHC_TE
| Model | ACC | SPEC | F1 | MCC | AP | AUC | |
|---|---|---|---|---|---|---|---|
|
| 0.784 |
| 0.403 | 0.272 | 0.381 |
| 0 |
| hu → gn | 0.783 | 0.868 | 0.401 | 0.269 |
| 0.730 | –0.003 |
| ce → mse |
| 0.868 |
|
| 0.391 | 0.728 | –0.005 |
| 1d → 2d | 0.781 | 0.866 | 0.394 | 0.261 | 0.379 | 0.723 | –0.010 |
| pre → rel | 0.780 | 0.866 | 0.392 | 0.258 | 0.390 | 0.720 | –0.013 |
| + mp | 0.784 | 0.868 | 0.403 | 0.272 | 0.387 | 0.718 | –0.015 |
| oh → pv | 0.774 | 0.862 | 0.373 | 0.235 | 0.358 | 0.714 | –0.019 |
| − bn | 0.770 | 0.860 | 0.364 | 0.224 | 0.327 | 0.696 | –0.037 |
| − pa | 0.750 | 0.848 | 0.309 | 0.157 | 0.267 | 0.661 | –0.072 |
| − do | 0.752 | 0.849 | 0.314 | 0.163 | 0.291 | 0.646 | –0.087 |
Note: ‘hu → gn’: kernel initialization GlorotNormal instead of HeUniform used; ‘ce → mse’: loss function MeanSquaredError instead of CrossEntropy used; ‘1d → 2d’: spatial form 2D instead of 1D used; ‘pre → rel’: activation function RELU instead of PRELU used; ‘+ mp’: MaxPooling layer used; ‘ov → pv’: ProtVec encoding instead of One-Hot used; ‘− bn’: no BatchNormalization layer used; ‘− pa’: no padding used; ‘− do’: no Dropout layer used. Highest score per metric indicated in bold.
P < 0.05.
P < 0.0005.
Performance of the ensemble PPI predictor ensnet_p compared with all other predictors trained on BioDL_P_TR and tested on BioDL_P_TE
| Model | ACC | SPEC | F1 | MCC | AP | AUC |
|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
| 0.834 | 0.905 | 0.312 | 0.218 | 0.276 | 0.739 |
|
| 0.833 | 0.905 | 0.310 | 0.215 | 0.276 | 0.736 |
|
| 0.833 | 0.905 | 0.309 | 0.215 | 0.279 | 0.735 |
|
| 0.832 | 0.904 | 0.303 | 0.208 | 0.273 | 0.733 |
|
| 0.833 | 0.905 | 0.309 | 0.214 | 0.270 | 0.729 |
|
| 0.829 | 0.903 | 0.292 | 0.196 | 0.253 | 0.717 |
Note: Highest score per metric indicated in bold; AUC differences >0.10 are P < 0.05.
Fig. 3.(A) ROC and (B) P/R plots of all six architecture models and the ensemble models, trained on BioDL_P_TR and tested on BioDL_P_TE PPI data. The ensnet_p clearly outperforms the six architecture models in the ROC plot, and in the P/R plot only rnet_p and rnn_p yield somewhat higher precision (∼0.6) at very low recall (0.01–0.02)
Performance of ensnet_a, trained on the generic BioDL_A_TR dataset, compared with the ensnet p, s and n models trained on type-specific datasets containing protein, small molecule or nucleotide interaction interfaces and scored performance on the interaction-specific test sets as indicated
| Model | Test set | ACC | SPEC | F1 | MCC | AP | AUC |
|---|---|---|---|---|---|---|---|
|
|
| 0.828 | 0.902 | 0.289 | 0.192 | 0.248 | 0.733 |
|
| 0.840 | 0.909 | 0.339 | 0.249 | 0.302 |
| |
|
|
| 0.937 | 0.967 | 0.339 | 0.306 | 0.289 | 0.826 |
|
| 0.944 | 0.970 | 0.413 | 0.384 | 0.388 |
| |
|
|
| 0.901 | 0.947 | 0.272 | 0.219 | 0.238 | 0.835 |
|
| 0.921 | 0.957 | 0.418 | 0.376 | 0.399 |
|
Note: Highest AUC per metric per test set indicated in bold (P < 1e−6).
Performance comparison of our ensnet models and other state-of-the-art sequence-based interaction prediction methods on applicable test sets
| Model | Test set | ACC | SPEC | F1 | MCC | AP | AUC |
|---|---|---|---|---|---|---|---|
| Protein–protein interaction (PPI) | |||||||
|
|
| 0.767 | 0.849 | 0.485 |
| 0.491 |
|
| SeRenDIP | 0.277 | 0.724 | |||||
|
|
| 0.849 | 0.916 | 0.197 | 0.114 | 0.155 |
|
| SeRenDIP |
| 0.636 | |||||
|
|
| 0.785 | 0.870 |
|
|
|
|
| SCRIBER | n.a. |
| 0.333 | 0.230 | 0.287 | 0.715 | |
| SSWRF | n.a. | 0.891 | 0.287 | 0.178 | 0.256 | 0.687 | |
| CRFPPI | n.a. | 0.887 | 0.266 | 0.154 | 0.238 | 0.681 | |
| LORIS | n.a. | 0.887 | 0.263 | 0.151 | 0.228 | 0.656 | |
| SPRINGS | n.a. | 0.882 | 0.229 | 0.111 | 0.201 | 0.625 | |
| PSIVER | n.a. | 0.874 | 0.191 | 0.066 | 0.170 | 0.581 | |
| SPRINT | n.a. | 0.873 | 0.183 | 0.057 | 0.167 | 0.570 | |
| SPPIDER | n.a. | 0.870 | 0.198 | 0.071 | 0.159 | 0.517 | |
| Protein–small molecule interaction | |||||||
|
|
|
|
|
|
|
|
|
| SCRIBER | 0.874 | 0.931 | 0.278 | 0.209 | 0.259 | 0.706 | |
| Protein–DNA/RNA interaction | |||||||
|
|
|
|
|
|
|
|
|
| DRNApred (DNA) | 0.830 | 0.903 | 0.294 | 0.198 | 0.240 | 0.609 | |
| DRNApred (RNA) | 0.814 | 0.894 | 0.230 | 0.124 | 0.248 | 0.547 | |
Note: Highest scores per metric per test set indicated in bold; confidence for difference in AUC-ROC with runner-up.
Metrics according to Zhang and Kurgan (2019) on their ZK448 test set.
P < 0.05.
P < 0.005.
Fig. 4.The top 15 ranking of the importance of the features based on the SHAP values for 5000 randomly selected amino acids in ZK448_P, indicating the contribution of a feature to a residue’s interface prediction. Colors represent the input values of a feature: blue for low and red for high values. The width of the distribution of a feature’s SHAP values shows its relative importance across the sampled residues. For aggregated features only the sum is shown, e.g. WM_PC is the sum of 3_wm_PC, 5_wm_PC, 7_wm_PC and 9_wm_PC