| Literature DB >> 32161539 |
Jincai Yang1,2, Cheng Shen2,3, Niu Huang2,4.
Abstract
Predicting protein-ligand interactions using artificial intelligence (AI) models has attracted great interest in recent years. However, data-driven AI models unequivocally suffer from a lack of sufficiently large and unbiased datasets. Here, we systematically investigated the data biases on the PDBbind and DUD-E datasets. We examined the model performance of atomic convolutional neural network (ACNN) on the PDBbind core set and achieved a Pearson R2 of 0.73 between experimental and predicted binding affinities. Strikingly, the ACNN models did not require learning the essential protein-ligand interactions in complex structures and achieved similar performance even on datasets containing only ligand structures or only protein structures, while data splitting based on similarity clustering (protein sequence or ligand scaffold) significantly reduced the model performance. We also identified the property and topology biases in the DUD-E dataset which led to the artificially increased enrichment performance of virtual screening. The property bias in DUD-E was reduced by enforcing the more stringent ligand property matching rules, while the topology bias still exists due to the use of molecular fingerprint similarity as a decoy selection criterion. Therefore, we believe that sufficiently large and unbiased datasets are desirable for training robust AI models to accurately predict protein-ligand interactions.Entities:
Keywords: artificial intelligence; convolutional neural network; molecular docking; protein-ligand interaction; scoring function; topology fingerprint; virtual screening
Year: 2020 PMID: 32161539 PMCID: PMC7052818 DOI: 10.3389/fphar.2020.00069
Source DB: PubMed Journal: Front Pharmacol ISSN: 1663-9812 Impact factor: 5.810
The PDBbind and DUD-E datasets.
| Name | Task type | Sets | Crystal structures | #Actives | #Decoys |
|---|---|---|---|---|---|
|
| Regression | Core | 195 | 195 | 0 |
| Refined | 3,706 | 3,706 | 0 | ||
| General | 11,987 | 11,987 | 0 | ||
|
| Classification | Original | 102 | 22,886 | 1,411,214 |
| MW ≤ 500 | 102 | 19,374 | 1,182,039 |
Figure 1Atomic convolutional neural network performance measured by the Pearson R2 values obtained from the different PDBbind datasets using different splitting approaches. Each dataset was split into the training, validation, and test subsets five times with different random seeds following an 80/10/10 ratio, and studied on three different binding components, including protein-ligand complex structure (binding complex), only ligand structure (ligand alone), and only protein structure (protein alone), individually. (A) Models trained and tested within the same set. (B) Models trained on randomly selected subsets of the refined and the general sets (removing the core set structures) and tested on the core set. Models trained on the PDBbind datasets (C) (protein alone) and (D) (ligand alone) using different splitting methods.
Figure 2Atomic contributions derived from the ACNN model (ligand alone) on three representative systems chosen from the PDBbind core set, including (A) protein tyrosine phosphatase 1B (PTP1B) inhibitors, (B) ligands bound to the antibody Fab and (C) acetylcholinesterase (AChE) inhibitors. The ACNN model (ligand alone) was trained on the refined set (removing the core set structures) and tested on the core set. Each row shows two ligands from the same protein target with different binding affinities (pK i or pK) (predictive values included inside the parentheses). The first column shows the superimposed ligand structures using the binding pocket alignment approach. The second and third columns show atomic contributions of each ligand. The size of the balls represents the absolute values of atomic scores. The atomic scores of selected atoms are labeled explicitly. The atoms with black spheres have negative scores. The molecular images were generated using UCSF Chimera (Pettersen et al., 2004).
Figure 3Performance of RF on the DUD-E datasets using (A) six properties or (B) topology fingerprints. Note that the DUD-E(MW ≤ 500) dataset was compiled by removing actives with MW (only including heavy atoms) greater than 500 and their associated decoys. The cross-class CV split the dataset into three folds based on target classes, and the random CV randomly split targets with the same fold sizes as in cross-class CV.
Figure 4Significantly changed bits between actives and decoys on DUD-E(MW ≤ 500). Eighty-four bits with absolute log2 fold change ≥ 1 between the actives and decoys and mean relative frequency ≥ 0.03 were selected as representative bits from the Morgan fingerprints (2,048 bits). The bits were sorted by frequencies of ZINC12 compounds. The chemical features of three selected bits are presented, and the chemical features of all 84 bits are summarized in .