| Literature DB >> 31459378 |
J Jesús Naveja1,2,2, Martin Vogt1, Dagmar Stumpfe1, José L Medina-Franco2, Jürgen Bajorath1.
Abstract
Chemical optimization of organic compounds produces a series of analogues. In addition to considering an analogue series (AS) or multiple series on a case-by-case basis, which is often done in the practice of chemistry, the extraction of analogues from compound repositories is of high interest in organic and medicinal chemistry. In organic chemistry, ASs are a source of alternative synthetic routes and also aid in exploring relationships between compounds from different sources including synthetic vs. naturally occurring molecules. In medicinal chemistry, ASs are the major source of structure-activity relationship information and of hits or leads for drug development. ASs might be identified in different ways. For a given reference compound, a substructure search can be carried out using its scaffold. Alternatively, matched molecular pairs can be calculated to retrieve analogues from a compound repository. However, if no query compounds are used, the identification of ASs in databases is a difficult task. Herein, we introduce a computational approach to systematically identify ASs in collections of organic compounds. The approach involves compound decomposition on the basis of well-established retrosynthetic rules, organization of compound-core relationships, and identification of analogues sharing the same core. The method was applied on a large scale to extract ASs from the ChEMBL database, yielding more than 30 000 distinct series.Entities:
Year: 2019 PMID: 31459378 PMCID: PMC6648924 DOI: 10.1021/acsomega.8b03390
Source DB: PubMed Journal: ACS Omega ISSN: 2470-1343
Figure 1Concept of the compound–core relationship method. The schematic representation illustrates the identification of analogue series using the CCR approach. For two exemplary compounds (left), all possible cores are shown resulting from the application of retrosynthetic rules and replacement of substitution sites with hydrogen atoms (generalization). In compounds (left), sites of retrosynthetic bond elimination are indicated by red lines. In cores (middle), generalized substitution sites are indicated by red hydrogen atoms. For the two analogues, the largest identical generalized cores and the reconstructed core with two substitution sites (right) are encircled (purple). The reconstructed core contains the invariant sulfonamide group.
Figure 2Compound–core relationships and identification of analogue series. (a) AS associated with three retrosynthetic cores. The core at the top represents all analogues (depicted on a purple background), whereas the two remaining cores represent two analogues each (encircled in green and red, respectively). (b) Two overlapping ASs are shown, each of which is associated with an individual core. The core at the top represents four analogues (depicted on a purple background) and the core at the bottom three (encircled in red). One of the analogues is shared by both series.
Figure 3Representing identified analogue series in R-group tables. A conventional R-group table for an AS with three substitution sites (R1–R3) is shown. Six exemplary analogues are listed. The core representing the AS is shown at the top. For each compound, the ChEMBL ID is provided.
Composition of Analogue Series Identified in ChEMBL Using the CCR Methoda
| # analogues/series | # series (%) | # series (%), multiple substitution sites | average # substitution sites |
|---|---|---|---|
| 2–9 | 27 391 (90.0%) | 7046 (25.7%) | 1.38 |
| 10–19 | 2272 (7.5%) | 933 (41.1%) | 1.71 |
| >19 | 768 (2.5%) | 380 (49.5%) | 1.93 |
Reported are the size distribution of ASs and the fraction of ASs per size range having multiple substitution sites. In addition, the average number of substitution sites per AS of increasing size is given.
Comparison of MMPC- and CCR-Based Retrieval of Analogue Series from ChEMBLa
| method | MMPC | CCR |
|---|---|---|
| # compounds in ASs | 103 154 | 145 269 |
| # ASs | 22 111 | 30 431 |
| # ASs (%), multiple substitution sites | 3509 (15.9%) | 8359 (27.5%) |
For the MMPC and CCR methods, the total number of ASs extracted from ChEMBL, the number of compounds forming these ASs, and the number (percentage) of ASs with multiple substitution sites are reported.