| Literature DB >> 33430920 |
David Ruano-Ordás1,2,3,4,5, Lindsey Burggraaff6, Rongfang Liu6, Cas van der Horst6, Laura H Heitman6, Michael T M Emmerich6, Jose R Mendez1,2,4, Iryna Yevseyeva5, Gerard J P van Westen7.
Abstract
Drugs have become an essential part of our lives due to their ability to improve people's health and quality of life. However, for many diseases, approved drugs are not yet available or existing drugs have undesirable side effects, making the pharmaceutical industry strive to discover new drugs and active compounds. The development of drugs is an expensive process, which typically starts with the detection of candidate molecules (screening) after a protein target has been identified. To this end, the use of high-performance screening techniques has become a critical issue in order to palliate the high costs. Therefore, the popularity of computer-based screening (often called virtual screening or in silico screening) has rapidly increased during the last decade. A wide variety of Machine Learning (ML) techniques has been used in conjunction with chemical structure and physicochemical properties for screening purposes including (i) simple classifiers, (ii) ensemble methods, and more recently (iii) Multiple Classifier Systems (MCS). Here, we apply an MCS for virtual screening (D2-MCS) using circular fingerprints. We applied our technique to a dataset of cannabinoid CB2 ligands obtained from the ChEMBL database. The HTS collection of Enamine (1,834,362 compounds), was virtually screened to identify 48,232 potential active molecules using D2-MCS. Identified molecules were ranked to select 21 promising novel compounds for in vitro evaluation. Experimental validation confirmed six highly active hits (> 50% displacement at 10 µM and subsequent Ki determination) and an additional five medium active hits (> 25% displacement at 10 µM). Hence, D2-MCS provided a hit rate of 29% for highly active compounds and an overall hit rate of 52%.Entities:
Keywords: Clustering methods; Drug discovery; Measure-guided methodology; Multiple classifier systems
Year: 2019 PMID: 33430920 PMCID: PMC6836644 DOI: 10.1186/s13321-019-0389-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Feature characteristics and codification
| Feature type | Feature values | No of features |
|---|---|---|
| Chemical substructure fingerprints | Binary | 2048 |
| Physicochemical descriptors | Discrete values | 50 |
| Continuous values | 34 | |
| Total | 2132 | |
Fig. 1Structure and functionality of the D2-MCS toolkit. D2-MCS builds a set of classifiers (one per feature cluster) whose outputs will be combined to generate a single solution. The set of selected trained classifiers (one for each dataset part) together with a voting system comprises a whole D2-MCS instance
Fig. 2Workflow of three-stage potential candidate ranker methodology. Our ranking methodology comprised three main stages: (i) class probability estimator, (ii) global relevance calculator and (iii) relevance sorter
Fig. 3CB2 dataset partitioning. The CB2 dataset instances (rows) were randomly divided into four homogeneous and evenly sized groups
Fig. 4Performance comparison plot for testing stage (represented as a double arrow). a Indicates the performance obtained using the MCC measure and b shows the performance obtained using the PPV measure. Also shown (represented as dots) the name of classifier achieving the best performance for each cluster (C) together with its performance value during the optimization/training stage (P)
Fig. 5Performance comparison achieved for Minimize FP and Minimize FN meta-models
Confusion matrix achieved for both configurations
| TP | FP | TN | FN | |
|---|---|---|---|---|
| Minimize FP | 474 | 3 | 475 | 30 |
| Minimize FN | 480 | 60 | 418 | 24 |
Experimentally validated compounds
| Data image | IDnumber/ | Probability | Distance | pKi ± SEM |
|---|---|---|---|---|
|
| ||||
|
| Z28609248 / HXJYJTXXUOYRSB-UHFFFAOYSA-N | 0.81 | 0.29 | 16% |
|
| ||||
|
| ||||
|
| ||||
|
| ||||
|
| Z30007452 / VBFKBSAAMKINJD-UHFFFAOYSA-N | 0.77 | 0.24 | − 2% |
|
| ||||
|
| ||||
|
| ||||
|
| Z44866691 / WPWBUEOMELTWOC-FCDQGJHFSA-N | 0.76 | 0.25 | − 1% |
|
| ||||
|
| Z1317886912 / MEXULSRPIBCDQX-UHFFFAOYSA-N | 0.76 | 0.28 | 3% |
|
| Z44867007 / PCCXRCZRXNECAZ-JLPGSUDCSA-N | 0.76 | 0.30 | 0% |
|
| Z237484560 / LIGIHTRZFDFDAN-UHFFFAOYSA-N | 0.75 | 0.15 | − 1% |
|
| Z223843850 / CVSSLUCDGJDGHX-UHFFFAOYSA-N | 0.75 | 0.32 | − 5% |
|
| ||||
|
| Z55473655 / VDTRQSFAESBVFB-UHFFFAOYSA-N | 0.75 | 0.26 | 7% |
|
| Z2094674960 / RISCNDGLDMULEE-UHFFFAOYSA-N | 0.75 | 0.29 | 0% |
|
| Z1523102560 / IXASXIGZGJSBJT-UHFFFAOYSA-N | 0.75 | 0.30 | 18% |
|
|
Shown are the structure, enamine identifier (ID number), InChIKey, assigned probability, distance to the training set, and biological activity. Biological activity is shown as pKi (with a standard error of the mean) when available or % displacement of the radioligand by 10 μM of the compound. Identified novel hits are indicated in italic
Summary of predictions group by model
| Meta-models | Predictions | |
|---|---|---|
| Minimize FP | Minimize FN | |
| 48,232 | 166,664 | Active |
| 1,786,130 | 1,667,698 | Inactive |
| 1,834,362 | 1,834,362 | Total |