| Literature DB >> 25994950 |
Alex M Clark1, Krishna Dole2, Anna Coulon-Spektor2, Andrew McNutt2, George Grass3, Joel S Freundlich, Robert C Reynolds4, Sean Ekins2,5.
Abstract
On the order of hundreds of absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) models have been described in the literature in the past decade which are more often than not inaccessible to anyone but their authors. Public accessibility is also an issue with computational models for bioactivity, and the ability to share such models still remains a major challenge limiting drug discovery. We describe the creation of a reference implementation of a Bayesian model-building software module, which we have released as an open source component that is now included in the Chemistry Development Kit (CDK) project, as well as implemented in the CDD Vault and in several mobile apps. We use this implementation to build an array of Bayesian models for ADME/Tox, in vitro and in vivo bioactivity, and other physicochemical properties. We show that these models possess cross-validation receiver operator curve values comparable to those generated previously in prior publications using alternative tools. We have now described how the implementation of Bayesian models with FCFP6 descriptors generated in the CDD Vault enables the rapid production of robust machine learning models from public data or the user's own datasets. The current study sets the stage for generating models in proprietary software (such as CDD) and exporting these models in a format that could be run in open source software using CDK components. This work also demonstrates that we can enable biocomputation across distributed private or public datasets to enhance drug discovery.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25994950 PMCID: PMC4478615 DOI: 10.1021/acs.jcim.5b00143
Source DB: PubMed Journal: J Chem Inf Model ISSN: 1549-9596 Impact factor: 4.956
Datasets Used for Bayesian Models Created with CDD Models Using FCFP6 Fingerprints
| model | datasets used and refs | cutoff for active | no. of molecules | three-fold ROC |
|---|---|---|---|---|
| malaria ( | CDD Public datasets (MMV, St. Jude, Novartis, and
TCAMS)[ | 3D7 EC50 <10 nM | 184 actives, 19,824 inactives | 0.97 |
| TB ( | CDD Public datasets from NIAID/SRI (MLSMR, CB2, kinase)[ | 6891 actives, 210,190 inactives | 0.88 | |
| TB ( | CDD
Public datasets from NIAID/SRI (MLSMR, CB2, kinase, and
ARRA)[ | 3712 actives, 1145 inactives | 0.89 | |
| TB ( | CDD Public MLSMR single-point data | 3986 actives, 210,447 inactives | 0.87 | |
| TB ( | CDD Public MLSMR dose–response | 624 actives, 1649 inactives | 0.75 | |
| TB ( | CDD Public[ | described
in ref ( | 371 actives, 407 inactives | 0.73 |
| cholera | CDD Public in the TB ARRA dataset[ | IC50 <5 μM | 50 actives, 1874 inactives | 0.93 |
| Ames mutagenicity | ref ( | Ames positive, active = 1 | 3501 actives, 3007 actives | 0.83 |
| mouse intrinsic clearance | data from ChEMBL | <10 μL/(min·g) | 52 actives, 312 inactives | 0.82 |
| human intrinsic clearance | data from ChEMBL | ≤10 μL/(min·g) | 105 actives, 638 inactives | 0.92 |
| human intrinsic clearance | AZ data from ChEMBL[ | ≤10 μL/(min·mg) | 496 actives, 604 inactives | 0.80 |
| Caco-2 | proprietary data from ADMEdata.com | pH 6.5, cutoff >1×10–5 | 181 actives, 325 inactives | 0.79 |
| Caco-2 | data from ChEMBL | cutoff >1×10–5 | 60 actives, 399 inactives | 0.89 |
| 5-HT2B | ref ( | active = 1, described in ref ( | 146 actives, 607 inactives | 0.89 |
| solubility | ref ( | Log solubility = −5 | 1136 actives, 154 inactives | 0.87 |
| PXR activation | ref ( | described
in ref ( | 174 actives, 143 inactives | 0.80 |
| maximum recommended therapeutic dose | ref ( | >10 mg/(kg·day) | 350 actives, 813 inactives | 0.85 |
| blood brain barrier permeability | ref ( | BBB positive, described in ref ( | 1472 actives, 432 inactives | 0.92 |
ROC = receiver operator characteristic integral.
Datasets Used for Bayesian Models Created for Use by MMDS, with ECFP6 Fingerprintsa
| model | datasets used and refs | cutoff for active | no. of molecules | five-fold ROC |
|---|---|---|---|---|
| solubility | ref ( | Log solubility = −5 | 1144 actives, 155 inactives | 0.86 |
| probe-like | ref ( | described
in ref ( | 253 actives, 69 inactives | 0.76 |
| hERG | ref ( | described in ref ( | 373 actives, 433 inactives | 0.85 |
| KCNQ1 | PubChem BioAssay: AID 2642[ | using actives assigned in PubChem | 301,737 actives, 3878 inactives | 0.84 |
| Bubonic plague ( | PubChem single-point screen BioAssay: AID 898 | active when inhibition ≥50% | 223 actives, 139,710 inactives | 0.81 |
| Chagas disease ( | Pubchem BioAssay: AID 2044 | with EC50 <1 μM, >10-fold difference in cytotoxicity as active | 1692 actives, 2363 inactives | 0.8 |
| TB ( | 1434 actives, 5789 inactives | 0.73 | ||
| malaria ( | CDD Public datasets (MMV, St. Jude, Novartis, and TCAMS)[ | 3D7 EC50 <10 nM | 175 actives, 19,604 inactives | 0.98 |
All eight models are ECFP6, with folding into 32,768 slots.
Figure 1Example of a serialized file containing a very small Bayesian model. The default file extension is .bayesian, and the MIME type is chemical/x-bayesian.
Figure 2Example of the model output in CDD Models. (A) Model derived from whole-cell datasets from antimalarial screening across four CDD Public datasets (MMV, St. Jude, Novartis, and TCAMS), ∼20,000 EC50 values, cutoff < 10 nM. (B) Options for exporting a model from CDD.
Figure 3Example of the Bayesian model implemented in the MMDS mobile app. (a) hERG model, based on literature data. (b) A molecule from a hERG paper.[151] (c) Results scored with this model (hERG measured IC50 = 24 nM) showing a visually intuitive atom coloring for this and other Bayesian models. This compound would appear to be an inhibitor of hERG and possibly KCNQ1 potassium channels.
Figure 4Screenshots summarizing the ROC plots and active and inactive compounds for eight models implemented in MMDS.