| Literature DB >> 27895718 |
Mariana González-Medina1, Fernando D Prieto-Martínez1, John R Owen2, José L Medina-Franco1.
Abstract
BACKGROUND: Measuring the structural diversity of compound databases is relevant in drug discovery and many other areas of chemistry. Since molecular diversity depends on molecular representation, comprehensive chemoinformatic analysis of the diversity of libraries uses multiple criteria. For instance, the diversity of the molecular libraries is typically evaluated employing molecular scaffolds, structural fingerprints, and physicochemical properties. However, the assessment with each criterion is analyzed independently and it is not straightforward to provide an evaluation of the "global diversity".Entities:
Keywords: Chemical space; Data mining; Molecular fingerprints; Molecular scaffolds; Physicochemical properties; Shannon entropy; Structural diversity
Year: 2016 PMID: 27895718 PMCID: PMC5105260 DOI: 10.1186/s13321-016-0176-9
Source DB: PubMed Journal: J Cheminform ISSN: 1758-2946 Impact factor: 5.514
Fig. 1Prototype Consensus Diversity Plot. Scaffold diversity is plotted along the vertical axis and the fingerprints diversity is plotted along the horizontal axis. The thresholds (dashed lines) can be set depending on the metric used to quantify the diversity on each axis
Compound data sets analyzed in this work
| Data set | Sizea | Source |
|---|---|---|
| Natural products screening compounds (MEGx) | 2500 | ac-discovery.com |
| Semi-synthetic screening compounds (NATx) | 2500 | ac-discovery.com |
| Generally Recognized as Safe (GRAS) | 2249 | [ |
| GRAS subset (cyclic systems) | 1195 | [ |
| Carcinogenic | 738 | monographs.iarc.fr/ENG/Classification/index.php, [ |
| Carcinogenic subset (cyclic systems) | 544 | monographs.iarc.fr/ENG/Classification/index.php, [ |
| Anticancer drugs | 76 | [ |
| Non-anticancer drugs | 1399 | [ |
| Compounds in clinical trials (Clinical) | 713 | [ |
| Epigenetic focused | 850 | selleckchem.com |
aNumber of unique compounds
Summary of scaffold diversity analysis
| Data set | N | N/M | Nsing | Nsing/N | Nsing/M | AUC | F50 |
|---|---|---|---|---|---|---|---|
| MEGx | 935 | 0.374 | 642 | 0.687 | 0.257 | 0.781 | 0.072 |
| NATx | 799 | 0.320 | 400 | 0.501 | 0.160 | 0.768 | 0.116 |
| GRAS | 238 | 0.106 | 150 | 0.630 | 0.067 | 0.926 | 0.004 |
| GRAS subset | 237 | 0.198 | 150 | 0.633 | 0.126 | 0.867 | 0.021 |
| Carcinogenic | 262 | 0.355 | 195 | 0.744 | 0.264 | 0.800 | 0.031 |
| Carcinogenic subset | 261 | 0.480 | 195 | 0.747 | 0.450 | 0.737 | 0.107 |
| Anticancer drugs | 70 | 0.921 | 65 | 0.929 | 0.855 | 0.537 | 0.457 |
| Non-anticancer drugs | 844 | 0.572 | 686 | 0.813 | 0.465 | 0.699 | 0.157 |
| Clinical | 603 | 0.846 | 565 | 0.937 | 0.792 | 0.576 | 0.409 |
| Epigenetic focused | 727 | 0.855 | 666 | 0.916 | 0.784 | 0.569 | 0.415 |
N number of chemotypes, M number of molecules, N number of singletons, AUC area under the curve, F fraction of chemotypes that contains 50% of the data set
Fig. 2Scaffold retrieval curves (CSR) curves for the data sets studied in this work. Area under the curve (AUC) and fraction of chemotypes required for retrieving 50% of the compounds in the data sets (F50) are summarized in Table 2
Fig. 3Results of scaled Shannon entropy (SSE) at different numbers of most populated chemotypes
Summary of the intra-library similarity distributions computed with MACCS keys/Tanimoto
| Data set | Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | SD |
|---|---|---|---|---|---|---|---|
| MEGx | 0 | 0.388 | 0.475 | 0.485 | 0.574 | 1.0 | 0.138 |
| NATx | 0.104 | 0.378 | 0.444 | 0.460 | 0.525 | 1.0 | 0.119 |
| GRAS | 0 | 0.256 | 0.375 | 0.385 | 0.500 | 1.0 | 0.181 |
| GRAS subset | 0.016 | 0.269 | 0.38 | 0.385 | 0.487 | 1.0 | 0.161 |
| Carcinogenic | 0 | 0.135 | 0.229 | 0.244 | 0.333 | 1.0 | 0.151 |
| Carcinogenic subset | 0.014 | 0.180 | 0.269 | 0.284 | 0.370 | 1.0 | 0.144 |
| Anticancer drugs | 0.065 | 0.362 | 0.468 | 0.460 | 0.562 | 1.0 | 0.139 |
| Non-anticancer drugs | 0 | 0.283 | 0.370 | 0.373 | 0.458 | 1.0 | 0.130 |
| Clinical | 0 | 0.345 | 0.438 | 0.432 | 0.522 | 1.0 | 0.128 |
| Epigenetic focused | 0.039 | 0.344 | 0.430 | 0.431 | 0.516 | 1.0 | 0.123 |
1st Qu. first quartile, 3rd Qu. third quartile
Mean of the intra-set Euclidean distance of six physicochemical properties
| Data set | Euclidean distance |
|---|---|
| MEGx | 2.95 |
| NATx | 1.07 |
| GRAS | 1.00 |
| Carcinogenic | 2.22 |
| Anticancer drugs | 2.90 |
| Non-anticancer drugs | 2.50 |
| Clinical | 2.62 |
| Epigenetic focused | 2.10 |
Fig. 4Consensus Diversity Plots (CDPs) for the eight data sets and two subsets studied in this work. CDPs in this figure classify the compound data sets considering molecular scaffolds, fingerprint representations, and physicochemical properties. Each data point represents a compound set. Fingerprint-based diversity is plotted on the X-axis. Scaffold diversity is represented in the Y-axis plotting area under the curve (AUC) or F50. The quadrants in red identify compound data sets with high fingerprint-based diversity, the quadrants in white identify data sets with relative low fingerprint-based diversity and lower scaffold diversity; the quadrants in blue locate data sets with high fingerprint-based diversity but low scaffold diversity; and the quadrants in yellow identify compound libraries with low fingerprint-based diversity but high scaffold diversity. Data points are colored by the diversity of the physicochemical properties of the data set as measured by the Euclidean distance of six properties of pharmaceutical relevance. The distance is represented with a continuous color scale from red (more diversity), to orange/brown (intermediate diversity) to green (less diversity). The relative size of the data set is represented with the size of the data point: smaller data points indicate compound data sets with fewer molecules. In this application example of the plots, a value of 0.75 for AUC and the median values of the distribution of F50 and MACCS/Tanimoto similarity were used to set the quadrants