Literature DB >> 31459378

Systematic Extraction of Analogue Series from Large Compound Collections Using a New Computational Compound-Core Relationship Method.

J Jesús Naveja^1,2,2, Martin Vogt¹, Dagmar Stumpfe¹, José L Medina-Franco², Jürgen Bajorath¹.

Abstract

Chemical optimization of organic compounds produces a series of analogues. In addition to considering an analogue series (AS) or multiple series on a case-by-case basis, which is often done in the practice of chemistry, the extraction of analogues from compound repositories is of high interest in organic and medicinal chemistry. In organic chemistry, ASs are a source of alternative synthetic routes and also aid in exploring relationships between compounds from different sources including synthetic vs. naturally occurring molecules. In medicinal chemistry, ASs are the major source of structure-activity relationship information and of hits or leads for drug development. ASs might be identified in different ways. For a given reference compound, a substructure search can be carried out using its scaffold. Alternatively, matched molecular pairs can be calculated to retrieve analogues from a compound repository. However, if no query compounds are used, the identification of ASs in databases is a difficult task. Herein, we introduce a computational approach to systematically identify ASs in collections of organic compounds. The approach involves compound decomposition on the basis of well-established retrosynthetic rules, organization of compound-core relationships, and identification of analogues sharing the same core. The method was applied on a large scale to extract ASs from the ChEMBL database, yielding more than 30 000 distinct series.

Entities: Chemical Species

Year: 2019 PMID： 31459378 PMCID： PMC6648924 DOI： 10.1021/acsomega.8b03390

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

In medicinal chemistry, hit-to-lead and lead optimization campaigns produce a series of analogues. An analogue series (AS) is generally defined as a series of compounds that share the same core structure and carry different R-groups at single or multiple substitution sites.[1] ASs are conventionally represented in R-group tables and are the major source of structure–activity relationship (SAR) information.[1−4] They are usually investigated as individual series in the course of chemical optimization. Computational methods have been introduced to organize large ASs and monitor SAR progression.[2−5] Going beyond the analysis of individual ASs, another important task is searching for analogues in compound libraries and databases. If one is interested in identifying analogues of given reference compound(s), substructure search approaches can be applied using the core structure of a reference compound as a query.[1,6] For example, this might be attempted in hit expansion when searching for analogues of an interesting active compound. Furthermore, analogues of reference compounds can also be identified without a predefined core structure by searching for matched molecular pairs (MMPs).[7] An MMP is defined as a pair of compounds that are only distinguished by a structural modification at a single site.[8] This modification can be rationalized as the exchange of a pair of substructures or a chemical transformation.[9] To detect analogue relationships via MMPs, chemical transformations are restricted in size to focus on typical R-group replacements.[7,10] MMPs can be efficiently generated algorithmically,[9] making MMP-based analogue searching generally applicable[7] and an attractive alternative to substructure search methods. A much more difficult task than query-based analogue searching is the identification of ASs in large compound data sets, without prior knowledge. However, this task is highly relevant for knowledge extraction from compounds and activity data. In medicinal chemistry, one would like to identify and extract ASs of any composition from heterogeneous compound sources to maximize SAR information retrieval and provide templates for compound optimization. However, to the best of our knowledge, only one computational method for the systematic identification of ASs has so far been introduced.[11] This approach is also based upon the MMP formalism. For a given data set, all possible MMPs are generated and organized in an MMP-based network in which nodes represent compounds and edges pairwise MMP relationships. In this network, separate MMP clusters (MMPCs) are formed by individual ASs that can hence be easily identified.[11] Accordingly, this approach is termed herein an MMP cluster (MMPC)-based method. In the simplest case, an AS from a cluster is formed by a matching molecular series (MMS)[12] having a single substitution site. However, separate clusters in the MMP-based network can also be formed by multiple and overlapping MMSs representing ASs with multiple substitution sites.[11] In this case, each participating MMS contributes a unique single site. Herein, we introduce another computational methodology for the systematic identification of ASs in repositories of organic compounds, which does not rely on the MMP formalism. Rather, it is based upon the decomposition of single compounds according to well-established retrosynthetic rules and subsequent organization of compound–core relationships (CCRs). In a large-scale application, this new compound–core relationship (CCR) method was applied to systematically extract ASs from the ChEMBL database[13] and compared with the MMPC approach.

Results and Discussion

Extracting ASs from compound repositories without prior knowledge or query compounds is a difficult task. The CCR method introduced herein for systematically identifying ASs in databases of any composition is conceptually simple and generally applicable. The method comprises three sequential steps, including the generation of cores, exploration of compound–core relationships, and identification of analogue series, which are discussed in the following sections.

Methodological Concept

Generation of Cores

The primary goal of the method is the identification of core structures and corresponding analogues such that the compounds can be readily reconstructed from the cores by substitutions at one or more sites and organized into ASs. The basis for reconstruction is provided by systematically applying combinations of possible bond deletions in each compound using retrosynthetic rules. Specifically, for each database compound, all possible combinations of one to five (or any other predefined number of) bonds are systematically subjected to retrosynthetic cleavage. Hence, a maximum number of five substitution sites per AS are covered. Each combination of applicable retrosynthetic rules leading to the corresponding elimination of single or multiple bonds yields a potential core. The core is considered valid if it consists of a single substructure containing an individual end point (substitution site) for each cleaved bond. Figure illustrates the generation of cores for two analogues having two retrosynthetic cleavage sites. In addition to the three cores obtained from each analogue through retrosynthetic modification, each original compound is recorded as a core with no cleavage sites. Substitution sites in cores are recorded. Furthermore, it is required that the core and eliminated fragments (substituents) meet a predefined size ratio. In our proof-of-concept study presented herein, we applied the rule that the core must contain at least two-thirds of the heavy atoms comprising the original compound. In other words, the ratio of the number of heavy atoms in the core to the sum of the total number of heavy atoms in all substituents must be at least 2:1. If these requirements are met, a core is accepted for further analysis. For a given database, all possible cores are generated and then “generalized”. During the generalization step, all substitution sites are disregarded by introducing hydrogen atom substitutions at each site such that different cores become identical if they only differ in the position of substitution sites. In Figure , two identical cores resulting from generalization are highlighted.

Figure 1

Concept of the compound–core relationship method. The schematic representation illustrates the identification of analogue series using the CCR approach. For two exemplary compounds (left), all possible cores are shown resulting from the application of retrosynthetic rules and replacement of substitution sites with hydrogen atoms (generalization). In compounds (left), sites of retrosynthetic bond elimination are indicated by red lines. In cores (middle), generalized substitution sites are indicated by red hydrogen atoms. For the two analogues, the largest identical generalized cores and the reconstructed core with two substitution sites (right) are encircled (purple). The reconstructed core contains the invariant sulfonamide group.

Exploration of Compound–Core Relationships

Original database compounds that are identical to hydrogen-substituted cores are assigned to the corresponding cores as the smallest possible analogues. Generalization of cores is followed by reconstruction of recorded substitution sites and the assignment of additional database compounds to cores that differ at given substitution sites. The generalization and reconstruction steps ensure that compounds with all possible substitutions are assigned to corresponding cores, for example, analogues with ortho-, meta-, and/or para-substitution at one or more rings. We note that this cannot be accomplished on the basis of MMPs. The assignment of compounds to cores with reconstructed substitution sites yields all possible compound–core relationships in an organized form. Figure illustrates the reconstruction of a single core with two substitution sites representing two exemplary analogues.

Identification of Analogue Series

An AS is formed if at least two compounds are associated with a core. Because all possible cores meeting the acceptance criteria are involved in CCRs, analogues forming an AS are often associated with multiple cores. ASs might consist of distinct sets of analogues, i.e., analogues belonging to one and only one AS, or overlapping sets of compounds. In addition, an AS might be fully contained as a subset in another series. The latter case is disambiguated by removal of ASs forming a subset of another. In addition, if two ASs contain exactly the same analogues, the one associated with the largest core is retained. Other possible cases must also be taken into consideration. Figure a illustrates the frequently observed situation that an AS is associated with multiple cores. One of the cores might represent the entire AS and others subsets of analogues comprising the series. In this case, the core associated with the entire AS is retained to represent the series.

Figure 2

Compound–core relationships and identification of analogue series. (a) AS associated with three retrosynthetic cores. The core at the top represents all analogues (depicted on a purple background), whereas the two remaining cores represent two analogues each (encircled in green and red, respectively). (b) Two overlapping ASs are shown, each of which is associated with an individual core. The core at the top represents four analogues (depicted on a purple background) and the core at the bottom three (encircled in red). One of the analogues is shared by both series. Figure b shows an example of overlapping ASs having different cores. The series share one analogue that is associated with both cores. This example also illustrates the rationale for consistently applying core/substituents size ratio restrictions. We note that the smaller core at the bottom in Figure b is a substructure of the larger one at the top. Due to the applied 2:1 size ratio restriction, the three analogues at the top are not presented by the small core at the bottom. This provides a basis for separating the series into two smaller ASs. The confined set of six analogues in this example could have been easily combined into a single AS by assigning two cores to the series. However, application of the size ratio restriction as a criterion for separating overlapping series generally avoids the situation that increasingly large compounds associated with cores that are substructures of each other form elongated “pseudo-AS” that might be artificial in nature and not meaningful chemically. Albeit rarely observed (see Section ), this possible complication should strictly be avoided to ensure chemical relevance of computed ASs. Therefore, in overlapping ASs, each analogue is assigned to the largest AS it belongs to and removed from others. If the number of compounds in alternative ASs is the same, the AS associated with the larger core is selected. Furthermore, if the cores have an identical size, preference is given to the one with fewer substitution sites. Application of these criteria ensures that nearly all overlapping series are disambiguated, as further discussed below. The protocol outlined above guarantees that each AS is ultimately associated with a single core and each compound is associated with no more than one AS. Distinguishing between different CCRs is also of practical relevance. The consistent association of analogues and cores on the basis of size ratio restrictions and the selection of largest possible cores ensures that newly identified ASs are well-defined and can be easily represented in standard R-group tables, as illustrated in Figure . Hence, such ASs are readily available for follow-up analysis in medicinal chemistry.

Figure 3

Representing identified analogue series in R-group tables. A conventional R-group table for an AS with three substitution sites (R1–R3) is shown. Six exemplary analogues are listed. The core representing the AS is shown at the top. For each compound, the ChEMBL ID is provided.

Evaluation

Large-Scale Search Application

In a proof-of-concept application, the CCR method was applied to systematically search for ASs in 244 704 active compounds from the ChEMBL database (for details, see the Materials and Methods section). A total of 30 431 ASs containing 145 269 compounds were identified, 8359 of which contained cores with multiple substitution sites. Table reports the size distribution of these ASs, 90% containing between two andnine analogues, 7.5% containing between 10 and 19, and 2.5% containing more than 19 analogues. Furthermore, with increasing size, the proportion of ASs with multiple substitution sites and the average number of substitution sites per AS also increased. For example, the 768 ASs containing at least 20 analogues included 380 series with multiple substitution sites and had on average close to two substitution sites per AS (with a maximum of sites).

Table 1

Composition of Analogue Series Identified in ChEMBL Using the CCR Methoda

# analogues/series	# series (%)	# series (%), multiple substitution sites	average # substitution sites
2–9	27 391 (90.0%)	7046 (25.7%)	1.38
10–19	2272 (7.5%)	933 (41.1%)	1.71
>19	768 (2.5%)	380 (49.5%)	1.93

Reported are the size distribution of ASs and the fraction of ASs per size range having multiple substitution sites. In addition, the average number of substitution sites per AS of increasing size is given. Importantly, 18 606 (61%) of the identified ASs containing 64 323 compounds were nonoverlapping and associated with a single core representing the entire series, corresponding to the example shown in Figure a. Furthermore, 11 825 ASs (39%) containing 80 946 compounds were obtained from a set of 24 202 initially overlapping AS, as illustrated in Figure b. Most of the overlapping ASs were separated into well-defined series by uniquely assigning each compound to a single core. Disambiguation (as detailed above) was not possible for a very small subset of 96 overlapping ASs, 82 of which contained less than five compounds. Thus, taken together, the results of systematic search calculations using the CCR method revealed that the majority of newly identified ASs was distinct from others. In cases where series overlap was detected, separation into nonoverlapping ASs was mostly unambiguous. Pseudo-ASs were not detected.

Method Comparison

For comparison, search calculations on the basis of 244 704 ChEMBL compounds were repeated using the MMPC approach,[11] the only other computational methodology available to date for systematically identifying ASs. The results are reported in Table . MMPC calculations identified 22 111 ASs that covered a total of 103 154 ChEMBL compounds. These series included 3509 ASs (15.9%) with multiple substitution sites. In contrast, the CCR search calculations detected 30 431 ASs that covered a total of 145 326 compounds and included 8359 ASs (27.5%) with multiple substitution sites. Most of the ASs obtained by MMPC were also detected using the CCR method, with some variation in the composition of individual (especially larger) series. Moreover, nearly all analogues (97%) obtained by MMPC were identified using the CCR approach, which yielded 45 508 additional analogues. MMPC calculations yielded 2191 ASs comprising 10 or more analogues. Of these series, 1986 ASs (91%) having more than 50% compound overlap were also identified by CCR including 1406 ASs with at least 80% compound overlap and 730 identical ASs. The overlap was calculated as the Jaccard index, i.e., the ratio of the number of shared analogues to the total number of unique analogues in a pair of corresponding series. CCR calculations identified a total of 3040 ASs with 10 or more analogues including 1352 ASs that were not detected using MMPC.

Table 2

Comparison of MMPC- and CCR-Based Retrieval of Analogue Series from ChEMBLa

method	MMPC	CCR
# compounds in ASs	103 154	145 269
# ASs	22 111	30 431
# ASs (%), multiple substitution sites	3509 (15.9%)	8359 (27.5%)

For the MMPC and CCR methods, the total number of ASs extracted from ChEMBL, the number of compounds forming these ASs, and the number (percentage) of ASs with multiple substitution sites are reported. The MMPC/CCR comparison showed that the CCR method identified a significantly larger number of ASs, with a larger proportion of series having multiple substitution sites, and achieved a larger global compound coverage.

Conclusions

The identification of ASs in compound repositories without prior knowledge is of considerable relevance for the practice of organic and medicinal chemistry. ASs and the associated activity information can be used to rationalize and/or guide chemical synthesis and optimization efforts. However, only little has been done so far to automatically identify and extract ASs from databases, leaving much room for further developments. Herein, we have introduced a new computational approach to systematically search for ASs. The CCR method relies on the decomposition of single compounds on the basis of retrosynthetic rules, systematic generation of cores and compound–core relationships, and identification of ASs on the basis of organized and prioritized relationships. By design, the methodology is conceptually simple yet generally applicable. As such, it is thought to represent an attractive addition to the current repertoire of computational methods with utility for organic and medicinal chemistry. In our proof-of-concept investigation, a systematic search for ASs in ChEMBL identified a large number of ASs. The majority of ASs were nonoverlapping and distinct from others and associated with an individual core representing the entire AS. Such series should be of considerable interest for further SAR analysis and the identification of target-selective or promiscuous compounds. In summary, the CCR method introduced herein represents a new and general approach for systematically identifying ASs. It should be of interest to computational as well as organic and medicinal chemists including investigators aiming to explore relationships between compounds from different sources such as natural products and synthetic compounds. Such analyses will provide interesting topics for future application-oriented research.

Materials and Methods

Retrosynthetic Rules

As retrosynthetic rules for compound decomposition, a well-established set of 13 retrosynthetic combinatorial analysis procedure (RECAP) rules was applied.[14] We emphasize that the CCR methodology does not depend on a given set of rules. Depending on individual preferences or project requirements, any chosen set of reaction/retrosynthetic rules can be used. This is particularly relevant for applications in organic chemistry when new synthesis schemes are explored and compared with others.

Core Generation Details

The systematic generation of cores is among the three central components of the CCR method. Further details are provided. Bonds in compounds are cleaved according to RECAP rules and respective substituents are removed. If multiple RECAP rules are applicable to a given compound, all possible combinations are explored to generate cores. For example, if three rules A, B, and C apply, seven cores are obtained, including three with single cleavage sites (A, B, and C), three with dual sites (A/B, A/C, and B/C), and one with three cleavage sites (A/B/C). However, cores are only accepted to establish compound–core relationships if the ratio of the number of heavy atoms forming the core to the number of heavy atoms of all eliminated substituents is at least 2:1. The number of bonds in a compound to which RECAP rules applied was limited (and rarely larger than 20). Consequently, the exhaustive exploration of all possible combinations and resulting cores did not pose a combinatorial problem in most cases. In addition, the 2:1 size ratio restriction further reduced the number of cores for analyzing compound–core relationships. Nonetheless, a computational time restriction of 100 s per compound was implemented for core generation. However, due to this constraint, only 629 of 244 704 ChEMBL compounds failed to produce cores. The protocol for compound decomposition according to retrosynthetic rules was implemented in Java with the aid of the OEChem toolkit.[15]

Implementation of the CCR Algorithm

The CCR algorithm for systematically identifying ASs, as detailed in the Results and Discussion section, was implemented in Python.

Searching for Analogue Series

Systematic search calculations using the CCR and MMPC reference methods were carried out in a curated version of ChEMBL release 23.[13] Only compounds with direct interactions (target relationship type “D”) with human targets at the highest confidence level (target confidence score 9) and available Ki or IC50 values were selected, yielding a total of 244 704 active compounds. The application of these selection criteria was not essential for the analysis but ensured that detected ASs exclusively consisted of compounds for which meaningful activity data were available.

13 in total

1. Evaluation of different virtual screening strategies on the basis of compound sets with characteristic core distributions and dissimilarity relationships.

Authors: Tomoyuki Miyao; Swarit Jasial; Jürgen Bajorath; Kimito Funatsu
Journal: J Comput Aided Mol Des Date: 2019-08-21 Impact factor: 3.686

2. Introducing the metacore concept for multi-target ligand design.

Authors: Dagmar Stumpfe; Alexander Hoch; Jürgen Bajorath
Journal: RSC Med Chem Date: 2021-04-15

3. Advances in exploring activity cliffs.

Authors: Dagmar Stumpfe; Huabin Hu; Jürgen Bajorath
Journal: J Comput Aided Mol Des Date: 2020-05-05 Impact factor: 3.686

4. A general approach for retrosynthetic molecular core analysis.

Authors: J Jesús Naveja; B Angélica Pilón-Jiménez; Jürgen Bajorath; José L Medina-Franco
Journal: J Cheminform Date: 2019-09-24 Impact factor: 5.514

5. Predicting Isoform-Selective Carbonic Anhydrase Inhibitors via Machine Learning and Rationalizing Structural Features Important for Selectivity.

Authors: Salvatore Galati; Dimitar Yonchev; Raquel Rodríguez-Pérez; Martin Vogt; Tiziano Tuccinardi; Jürgen Bajorath
Journal: ACS Omega Date: 2021-01-26

10. DeepCOMO: from structure-activity relationship diagnostics to generative molecular design using the compound optimization monitor methodology.

Authors: Dimitar Yonchev; Jürgen Bajorath
Journal: J Comput Aided Mol Des Date: 2020-10-05 Impact factor: 3.686