| Literature DB >> 35538084 |
Carter Knutson1, Mridula Bontha1, Jenna A Bilbrey1, Neeraj Kumar2.
Abstract
Protein-ligand interactions (PLIs) are essential for biochemical functionality and their identification is crucial for estimating biophysical properties for rational therapeutic design. Currently, experimental characterization of these properties is the most accurate method, however, this is very time-consuming and labor-intensive. A number of computational methods have been developed in this context but most of the existing PLI prediction heavily depends on 2D protein sequence data. Here, we present a novel parallel graph neural network (GNN) to integrate knowledge representation and reasoning for PLI prediction to perform deep learning guided by expert knowledge and informed by 3D structural data. We develop two distinct GNN architectures: [Formula: see text] is the base implementation that employs distinct featurization to enhance domain-awareness, while [Formula: see text] is a novel implementation that can predict with no prior knowledge of the intermolecular interactions. The comprehensive evaluation demonstrated that GNN can successfully capture the binary interactions between ligand and protein's 3D structure with 0.979 test accuracy for [Formula: see text] and 0.958 for [Formula: see text] for predicting activity of a protein-ligand complex. These models are further adapted for regression tasks to predict experimental binding affinities and [Formula: see text] crucial for compound's potency and efficacy. We achieve a Pearson correlation coefficient of 0.66 and 0.65 on experimental affinity and 0.50 and 0.51 on [Formula: see text] with [Formula: see text] and [Formula: see text], respectively, outperforming similar 2D sequence based models. Our method can serve as an interpretable and explainable artificial intelligence (AI) tool for predicted activity, potency, and biophysical properties of lead candidates. To this end, we show the utility of [Formula: see text] on SARS-Cov-2 protein targets by screening a large compound library and comparing the prediction with the experimentally measured data.Entities:
Mesh:
Substances:
Year: 2022 PMID: 35538084 PMCID: PMC9086424 DOI: 10.1038/s41598-022-10418-2
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Features associated with each atom in the protein and ligand for the Graph-CNN model described by Torng et al.[28], the GNN described by Lim et al.[32], and and described in this contribution.
| Feature | Graph-CNN | GNN | ||||
|---|---|---|---|---|---|---|
| Torng and Altman[ | Lim et al.[ | (Current work) | ||||
| Protein | Ligand | Protein | Ligand | Protein | Ligand | |
| Atom type | x | x | x | x | x | x |
| Atom degree | x | x | x | x | x | |
| x | x | x | x | x | ||
| Implicit valence | x | x | x | x | x | |
| Aromaticity | x | x | x | x | x | |
| Atom in ring | x | |||||
| Residue type | x | |||||
| Hybridization | x | |||||
| Formal charge | x | |||||
| Single bonda | x | xb | ||||
| Double bonda | x | xb | ||||
| Triple bonda | x | xb | ||||
| Bond aromaticitya | x | xb | ||||
| Conjugationa | x | xb | ||||
| Bond in ringa | x | xb | ||||
Features are associated with the atom unless otherwise noted.
aBond feature; bbond feature is indirectly considered by a corresponding atom-level feature that captures the same physical property.
Figure 1Schematic showing the prediction logic implemented in our GNN models. The two models differ based on the applied attention head. uses the PLI obtained from docking simulations to create a combined feature and adjacency matrix. In , the features for the ligand and target protein are coded separately alongside their corresponding adjacency matrices. The output from the attention head is passed through a series of MLP that can be tuned for activity classification through application of the sigmoid activation function and binary cross-entropy loss function or property regression through application of the linear activation function and mean squared error loss function.
Number of active complexes, inactive complexes, and protein targets in the PDBbind, DUD-E, IBS, and SARS-CoV-2 datasets used to create the training and test sets in this work.
| Dataset | Targets | Total actives | Total in-actives |
|---|---|---|---|
| PDBbind2018 | 991 | 1418 | 4804 |
| DUD-E | 96 | 46,145 | 16,996,568 |
| SARS-CoV-2 | 7 | 56,191 | 56,191 |
Summary of data used for training and testing EBA regression models.
| Dataset | Total targets | Total ligands | Train targets | Test targets | Train samples | Test samples | Total samples |
|---|---|---|---|---|---|---|---|
| PDBbind2018-EBA (with docked poses) | 1278 | 1278 | 1023 | 255 | 6485 | 1557 | 8042 |
| PDBbind2018-EBA (crystal only) | 10,375 | 10,375 | 8300 | 2075 | 8300 | 2075 | 10,375 |
| PDBbind2016-EBA (crystal only) | 11,674 | 11,674 | a | a | a | a | 11,654 |
| PDBbind2019-EBA (crystal only) | 190 | 190 | – | – | – | – | 190 |
aRefer to Table SS5 in the supporting information for detailed dataset splits.
Table summarizing the dataset used for regression models.
| Dataset | Total samples | Train samples | Test samples |
|---|---|---|---|
| PDBbind2016- | 4576 | 3676 | 900 |
| DUD-E- | 3706 | 3105 | 601 |
SARS-CoV-2 targets with experimentally assessed non-covalent inhibitors used for testing models.
| PDB-ID | Target | Inhibitor |
|---|---|---|
| 7LTJ[ | M | MCULE-5948770040 |
| 7L0D[ | M | ML188 |
| 7LME[ | M | ML300 |
| 7L11[ | M | Compound 5aa |
| 7L12[ | M | Compound 14 |
| 7L13[ | M | Compound 21a |
| 7L14[ | M | Compound 26a |
aAs defined by Zhang et al.[45].
Comparison of test dataset results for our and models on the various test sets described in the text, along with representative examples from the literature.
| Method | Test accuracy | Sensitivity | Specificity | References |
|---|---|---|---|---|
| 0.840 | Current work | |||
| 0.958 | 0.910 | Current work | ||
| 0.855 | 0.590 | 0.910 | Current work | |
| 0.951 | 0.690 | Current work | ||
| 0.934 | 0.660 | Current work | ||
| 0.845 | 0.580 | 0.900 | Current work | |
| Docking | 0.591 | Current work | ||
| GNN | 0.968 | 0.830 | Lim et al.[ | |
| CNN | 0.904 | Gonczarek et al.[ | ||
| CNN | 0.868 | Ragoza et al.[ | ||
| CNN | 0.855 | Wallach et al.[ | ||
| Graph-CNN | 0.886 | Torng and Altman[ |
The top scores are shown in bold.
Figure 2Comparison of the and models with docking. Each bar corresponds to the percentage of protein–ligand complexes identified in top-N ranks which have an RMSD less than 2 Å from the crystal structure.
Figure 3Binding probability distribution for IBS molecules with M and NSP15 as targets. (A,B) correspond to the predicted binding probability for NSP15 and M targets against IBS molecules. (C,D) correspond to the predicted binding probability on active molecules for M and NSP15 respectively (for each plot, the x-axis denotes the predicted probability and y-axis denotes the density of molecules).
Performance comparison of our GNN models in predicting experimental affinity on the PDBbind dataset.
| MODEL | RMSE | MAE | Pearson r | Spearman r | |
|---|---|---|---|---|---|
| 1.74 | 1.41 | 0.40 | 0.39 | 0.16 | |
| 1.61 | 1.32 | 0.49 | 0.49 | 0.24 | |
| 1.68 | 1.32 | 0.45 | 0.46 | 0.21 | |
| Pafnucy[ | 1.60 | 1.34 | 0.52 | 0.50 | 0.27 |
| Pafnucy[ | 1.86 | 1.50 | 0.38 | 0.37 | 0.14 |
The top scores are shown in bold.
a255 test targets from our PDBbind2018-EBA-docking dataset using all docked poses for evaluation; bPDBBind2018-EBA-crystal-only test dataset.
Performance comparison of our GNN models in predicting experimental affinity on the PDBbind2019 structure-based evaluation dataset.
| MODEL | RMSE | MAE | Pearson r | Spearman r |
|---|---|---|---|---|
| 1.39 | 0.49 | 0.50 | ||
| 1.52 | 1.22 | 0.42 | 0.46 | |
| Pafnucy[ | 1.11 | |||
| FAST[ | 1.48 | 1.21 | 0.42 | 0.40 |
| 1.42 | 1.13 | 0.48 | 0.47 |
The results for the FAST method are reported for its 3D CNN model. The top scores are shown in bold.
Performance comparison of deep learning models in predicting
| Model | RMSE | Pearson r |
|---|---|---|
| 1.24 | 0.45 | |
| 1.26 | 0.44 | |
| 1.21 | 0.51 | |
| 1.24 | 0.51 | |
| DeepAffinity[ | 0.84 | |
| DeepDTA[ | 0.78 | 0.85 |
| MONN[ | 0.76 |
Our models were trained and tested on PDBbind2016 + DUD-E targets, whose was curated from the PDBbind and ChEMBL repositories, respectively. DeepAffinity, DeepDTA, and MONN were trained and tested on BindingDB data. The top scores are given in bold.
Performance of GNN on SARS-CoV-2 M targets and some of the potential inhibitors whose has been experimentally measured.
| PDB-ID | Experimental | ||||
|---|---|---|---|---|---|
| 7TLJ | 6.20 | 7.12 | 6.51 | 6.90 | 5.37 |
| 7L0D | 7.56 | 7.08 | 7.33 | 7.43 | 5.6 |
| 7LME | 7.61 | 7.07 | 7.46 | 7.59 | 5.3 |
| 7L11 | 7.90 | 7.13 | 7.74 | 7.59 | 6.8 |
| 7L12 | 7.96 | 7.36 | 8.26 | 7.61 | 7.74 |
| 7L13 | 8.11 | 7.42 | 8.42 | 7.56 | 6.89 |
| 7L14 | 8.06 | 7.16 | 7.87 | 7.67 | 6.76 |
Figure 4Schematic showing the results produced by each method in the both regression and classification model. The inclusion of 3D structural data provides numerous advantages and ability to produce such a range of prediction both activity and biophysical properties, their relationship with protein–ligand interactions.