Literature DB >> 24627802

Matched molecular pair-based data sets for computer-aided medicinal chemistry.

Ye Hu¹, Antonio de la Vega de León¹, Bijun Zhang¹, Jürgen Bajorath¹.

Abstract

Matched molecular pairs (MMPs) are widely used in medicinal chemistry to study changes in compound properties including biological activity, which are associated with well-defined structural modifications. Herein we describe up-to-date versions of three MMP-based data sets that have originated from in-house research projects. These data sets include activity cliffs, structure-activity relationship (SAR) transfer series, and second generation MMPs based upon retrosynthetic rules. The data sets have in common that they have been derived from compounds included in the ChEMBL database (release 17) for which high-confidence activity data are available. Thus, the activity data associated with MMP-based activity cliffs, SAR transfer series, and retrosynthetic MMPs cover the entire spectrum of current pharmaceutical targets. Our data sets are made freely available to the scientific community.

Entities: Chemical Species

Year: 2014 PMID： 24627802 PMCID： PMC3945766 DOI： 10.12688/f1000research.3-36.v2

Source DB: PubMed Journal: F1000Res ISSN： 2046-1402

Introduction

The matched molecular pair (MMP) concept is widely applied in medicinal chemistry [1– 4]. An MMP is defined as a pair of compounds that are only distinguished by a structural modification at a single site [1], i.e., the exchange of a substructure, termed a chemical transformation [5]. MMPs are attractive tools for computational analysis because they can be algorithmically generated and they make it possible to associate defined structural modifications at the level of compound pairs with chemical property changes, including biological activity [2– 4]. MMPs are usually chemically intuitive and easily accessible, which helps to bridge the gap between computational analysis and the practice of medicinal chemistry. In the context of different studies, we have systematically generated MMPs through the mining of publicly available compound activity data. All possible MMPs have been derived from compounds active against currently available pharmaceutical targets. Then, MMPs have been used to explore structure-activity relationships (SARs) on a large-scale and from different viewpoints. In a previous data article, we have reported and made publicly available a number of different data sets and computational tools developed in our laboratory [6]. Here we describe three recently developed MMP-based data structures, which should be of interest for SAR analysis and compound design, and we also provide up-to-date versions of the corresponding data sets. It is anticipated that these data sets will be helpful as a resource for computer-aided medicinal chemistry applications. The data sets include MMP-based activity cliffs (i.e., MMP-cliffs), SAR transfer series, and MMPs derived on the basis of retrosynthetic fragmentation rules and were derived from all bioactive compounds currently available in the ChEMBL database (release 17) [7, 8]. Only high-confidence activity data (as specified below) were considered. MMP-cliffs, SAR transfer series, and retrosynthetic MMPs provide comprehensive sources of SAR information. In addition, retrosynthetic MMPs are thought to increase the utility of computational MMP analysis for practical chemistry efforts because these second generation MMPs consider reaction information during molecular fragmentation, which sets them apart from standard MMPs originating from systematic fragmentation of all possible exocyclic single bonds in a molecule (as detailed below).

Materials and methods

Concepts

(1) Activity cliffs are generally defined as pairs or groups of compounds that are structurally similar and have large differences in potency [9– 11]. Accordingly, activity cliffs usually have high SAR information content (because small chemical changes in similar or analogous compounds lead to large potency effects). The assessment of activity cliffs requires clearly defined similarity and potency difference criteria [9– 11]. The formation of an MMP can be considered as a similarity criterion, which is similarity metric-free and often chemically more intuitive than the use of calculated molecular similarity [11, 12]. MMP formation as a similarity criterion has led to the introduction of MMP-cliffs [12]. For MMP-cliffs, a difference in potency of at least two orders of magnitude between cliff-forming compounds was set as a potency difference criterion [12]. Figure 1 shows exemplary MMP-cliffs.

Figure 1.

MMP-cliffs.

MMP-cliffs.

Six representative MMP-cliffs for three targets belonging to different target families are shown; ( a) muscarinic acetylcholine receptor M3, ( b) serine/threonine-protein kinase c-TAK1, ( c) matrix metalloproteinase-2. The pK i value of each compound is provided and the structural differences between cliff-forming compounds are highlighted in red. (2) SAR transfer can be rationalized in different ways. For example, a compound series might display similar potency progression against two different targets [13]. Alternatively, two different compound series with corresponding analogs, i.e., series having different core structures and containing compounds with pairwise corresponding substitutions, might display similar potency progression against a given target [14]. Such SAR transfer series displaying similar target-specific SAR behavior are often sought after in medicinal chemistry as alternative compounds for optimization. Here we focus on these target-based SAR transfer series. Figure 2 shows an example.

Figure 2.

SAR transfer series.

SAR transfer series.

An exemplary target-based SAR transfer series is shown. Compound pairs are arranged in the order of increasing potency (from the bottom to the top). Potency progression is monitored by corresponding pairs of color-coded dots using a continuous color spectrum from green (lowest potency value (pK i = 5.7) in the compound data set), over yellow to red (highest potency value; pK i = 9.0). The pK i value of each compound is provided. The core structures are drawn in black and the substituents in red. The compounds are active against serine/threonine-protein kinase D2. (3) Computational generation of MMPs typically involves molecular fragmentation through the systematic deletion of exocyclic single bonds [5]. Hence, the resulting fragments representing a molecular core and substituent are not derived considering chemical reactions. Accordingly, a transformation relating MMP-forming compounds to each other might not necessarily be interpretable from a synthetic perspective. Hence synthetic accessibility of MMPs might be further improved by considering the reaction information during molecular fragmentation. This has been accomplished by applying the well-known retrosynthetic combinatorial analysis procedure (RECAP) rules [15], leading to the introduction of RECAP-MMPs [16]. Representative examples are shown in Figure 3. In addition, examplary differences between standard MMPs and RECAP-MMPs are illustrated in Figure 4.

Figure 3.

RECAP-MMPs.

In ( a)–( d), four exemplary RECAP-MMPs representing different retrosynthetic rules are shown. For each RECAP-MMP, the chemical transformation is highlighted in red.

Figure 4.

Standard MMPs vs. RECAP-MMPs.

Two pairs of compounds that form both standard MMPs and RECAP-MMPs are shown. For each pair, the structural differences between compounds are highlighted. The chemical transformation associated with the standard MMP is colored in red, while the transformation of the RECAP-MMP corresponds to the combination of fragments colored in red and blue.

RECAP-MMPs.

In ( a)–( d), four exemplary RECAP-MMPs representing different retrosynthetic rules are shown. For each RECAP-MMP, the chemical transformation is highlighted in red.

Standard MMPs vs. RECAP-MMPs.

MMP generation

For the generation of MMP-cliffs, SAR transfer series, and RECAP-MMPs, transformation size restrictions that limit transformations to meaningful chemical substitutions were introduced [12]. Specifically, the common core structure had to be at least twice the size of each exchanged substructure. Furthermore, the difference in size of the exchanged fragments was limited to at most eight non-hydrogen atoms and the maximal size of an exchanged fragment was set to 13 non-hydrogen atoms [12]. Therefore, the largest permitted transformations included, for example, the addition of a substituted ring to a compound or the replacement of a five- or six-membered ring with a substituted condensed two-ring system (with a maximum of 13 atoms). All possible transformation size-restricted MMPs and RECAP-MMPs were calculated using an in-house implementation of the algorithm by Hussain and Rea [5] that utilizes the OpenEye toolkit [17].

Compounds and activity data

Compound data were taken from the latest version of ChEMBL (release 17) [7, 8]. Only compounds with direct interactions (i.e., target relationship type “D”) against human targets at the highest confidence level (target confidence score 9) were selected. Two types of potency measurements were separately considered, i.e., K i (equilibrium constant) and IC 50 (half-maximal inhibition concentration) values. In order to ensure high data confidence, inactive or inconclusive compounds and compounds with approximate measurements such as “>”, “<”, or “∼” were not considered. For compounds with multiple measurements against the same target, the geometric mean was calculated as the final potency annotation, provided that all values fell within one order of magnitude; otherwise, the compound was discarded. All qualifying compounds were further organized into target sets. A total of 661 and 1203 target sets (consisting of compounds with reported specific activity against a given target) were collected for the K i- and IC 50-based subsets, respectively, as reported in Table 1. The target sets contained a total of 45,353 and 95,685 compounds and 77,421 and 135,291 potency measurements for the K i and IC 50 subsets, respectively. These target sets provided the basis for the generation of all MMPs.

Table 1.

Data sets.

Number of	K _i	IC ₅₀
Targets	661	1203
Compounds	45,353	95,685
Measurements	77,421	135,291

For the K i and IC 50 subsets from the latest version of ChEMBL (release 17), the total numbers of targets, compounds, and corresponding potency measurements are reported.

Results

As a follow-up on the original publications in which MMP-cliffs [12], SAR transfer series [14], and RECAP-MMPs [16] were introduced, all corresponding data sets have been re-generated on the basis of ChEMBL release 17, hence providing up-to-date versions for release. Separate data subsets have been generated for different types of well-defined potency measurements (i.e., assay-dependent IC 50 vs. assay-independent K i values) to avoid inconsistencies due to simultaneous consideration of different potency measurements that cannot be directly compared.

MMP-cliffs

Figure 1 illustrates small chemical changes in compound pairs leading to large potency differences that are captured by MMP-cliffs. For ease of structural interpretation, we currently prefer MMP-based activity cliff representations compared to alternative representations that rely on calculated similarity values [11]. Table 2 provides the MMP-cliff statistics for the current data set. On the basis of K i and IC 50 measurements, more than 20,000 and 25,000 MMP-cliffs were obtained, respectively, requiring an at least 100-fold difference in potency between cliff-forming compounds. The MMP-cliffs corresponded to ~5% of all MMPs that were generated from ChEMBL compounds with high-confidence activity data. They covered 293 and 500 different targets on the basis of K i and IC 50 measurements, respectively. In addition to the more conservative potency difference cutoff, MMP-cliffs were also identified when a less stringent criterion was applied, i.e., two compounds forming an MMP were required to have a potency difference of at least one order of magnitude. In this case, as reported in Table 2, nearly 99,000 and more than 126,000 MMP-cliffs were detected in 392 and 726 targets for the K i and IC 50 subsets, respectively. The proportion of MMP-cliffs increased to approx. 25%.

Table 2.

MMP and MMP-cliff statistics.

Number of		K _i	IC ₅₀
MMPs		385,777	537,848
Targets with MMPs		467	929
MMP compounds		40,454 (89.2%)	80,744 (84.4%)
∆Potency ≥ 1 OoM	MMP-cliffs	98,608	126,464
	% MMP-cliffs	25.6%	23.5%
	Targets with MMP-cliffs	392	726
	MMP-cliff compounds	29,976 (66.1%)	50,413 (52.7%)
∆Potency ≥ 2 OoM	MMP-cliffs	20,073	25,297
	% MMP-cliffs	5.2%	4.7%
	Targets with MMP-cliffs	293	500
	MMP-cliff compounds	11,760 (25.9%)	16,816 (17.6%)

For the K i- and IC 50-based compound subsets, the number of MMPs, the number of targets for which MMPs were obtained, and the number (and ratio) of compounds that formed MMPs are reported. In addition, the number and proportion of MMP-cliffs derived from all MMPs with potency difference (∆Potency) of at least one order (1 OoM) or two orders of magnitude (2 OoM) are provided, respectively, as well as the number of targets for which MMP-cliffs were obtained and the number (and ratio) of cliff-forming compounds.

SAR transfer series

SAR transfer series are best rationalised as pairs of compound series active against the same target that have distinct core structures, and consist of corresponding pairs of analogs, as illustrated in Figure 2 for a small series with three pairs. Different from the original analysis of target-based SAR transfer [14] that was based upon MMPs without transformation size restrictions, the current analysis has been carried out on the basis of size-restricted MMPs. This modification further supports SAR exploration (because only small chemical changes are considered) and explains a reduction in series numbers compared to the original publication. In Table 3, the numbers of different series available for the current data set are reported. A total of 1270 and 2109 matching series were obtained from the K i and IC 50 subsets, respectively. Matching series met the structural requirement of consisting of at least three pairs of corresponding analogs. In addition, the potency values of compounds associated with individual series had to span at least two orders of magnitude. From these pre-selected matching series, 157 (K i) and 513 (IC 50) SAR transfer series with at least approximate potency progression and activity against 42 and 54 targets, respectively, were obtained. A subset of 60 (K i) and 322 (IC 50) SAR transfer series displayed strictly corresponding (regular) potency progression (often over different potency ranges) [14]. These series were active against 23 (K i) and 27 (IC 50) different targets. The size of SAR transfer series with approximate and regular potency progression ranged from three to 12 corresponding pairs of analogs. On average, the SAR transfer series consisted of three to four pairs.

Table 3.

Target-based SAR transfer series statistics.

Number of	K _i	IC ₅₀
Matching series	1270	2109
T_SAR-TS	157	513
Targets with T_SAR-TS	42	54
T_SAR-TS-RP	60	322
Targets with T_SAR-TS-RP	23	27

For the K i and IC 50 subsets, the number of qualifying matching compound series is reported. In addition, the number of target-based SAR transfer series with at least approximate potency progression (T_SAR-TS), the subset of SAR transfer series with regular potency progression (T_SAR-TS-RP), and the corresponding numbers of targets are given.

RECAP-MMPs

The replacement of systematic fragmentation of exocyclic single bonds with a set of 13 retrosynthetic rules for MMP generation reduced the number of MMPs that were obtained by more than half. RECAP-MMP numbers are reported in Table 4. However, (perhaps surprisingly) large numbers of RECAP-MMPs remained for further consideration and assessment of synthetic feasibility. From the K i and IC 50 subsets, nearly 170,000 and more than 240,000 RECAP-MMPs were obtained with activity against 371 and 778 targets, respectively. Examples are shown in Figure 3.

Table 4.

RECAP-MMP statistics.

Number of	K _i	IC ₅₀
RECAP-MMPs	169,889	240,322
Targets with RECAP-MMPs	371	778
RECAP-MMP compounds	28,529 (62.9%)	53,917 (56.3%)

For the K i and IC 50 subsets, the number of RECAP-MMPs, the number of targets for which RECAP-MMPs were obtained, and the number (and ratio) of compounds that formed RECAP-MMPs are reported.

Data availability

All MMP-cliffs, SAR transfer series, and RECAP-MMPs are provided in canonical SMILES representation [18] on a per-target basis separately for the K i and IC 50 subsets. The canonical SMILES representation of compounds was calculated using the Molecular Operating Environment [19] on the basis of standardized molecular structures by removing solvents or ions and rebalancing protonation states. Furthermore, the canonical SMILES representation of key fragments (cores) and chemical transformations derived from MMPs and RECAP-MMPs was generated using the OpenEye toolkit [17]. ZENODO: Detailed data sets of MMP-cliffs, SAR transfer series, RECAP-MMPs and compound activities, doi: 10.5281/zenodo.8418 [20].

Summary

We have described new and up-to-date MMP-based data sets comprising activity cliffs, SAR transfer series, and second generation retrosynthetic MMPs that have been systematically generated from currently available public domain compounds with high-confidence activity data. Hence, these data sets are comprehensive and have broad target coverage. They are made available without restrictions to the scientific community to aid in SAR analysis, compound design, and other medicinal chemistry applications. It is hoped that these data sets might be of interest and useful to many investigators in this field and catalyse further research efforts. This revised version looks good. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. Hu et al have compiled a useful set of matched pair datasets based on the CHEMBL database of biological activity. They describe in a straightforward manner the derivation of the datasets and basic concepts relevant for matched pairs. The following suggested modifications to the data provided would enhance their usefulness and completeness: As a result of switching the method for generating MMPs from cleavage of single bonds to a RECAP-based method, the pool of MMPs now includes substitutions of internal fragments (e.g. in Figure 3c) as well as substitution of a terminal R-group (as in Figure 3 examples a, b, and d). Although both types of MMPs involve replacement of a single structural fragment, it may be desirable for many applications to distinguish between core scaffold replacement and R-group variation. It would therefore be helpful to annotate the datasets to easily separate these two classes of MMPs. Since the authors filter out IC50s/Kis of indeterminate values, it is unclear how compounds that were clearly inactive were processed. Were compounds with IC50s/Kis that could not be quantified due to a flat dose response curve included in the datasets? The authors present a filtered dataset where a number of factors have contributed to rejection of potential MMPs, namely: the difference in size of the exchanged fragments was limited to 8 heavy atoms; the ratio of the common core fragment to the size of each exchanged fragment had to be>2; and the exchanged fragment could have maximum 13 heavy atoms. While these are reasonable filters to obtain MMPs that truly represent small structural changes, the cutoffs selected are arbitrary and for some targets may exclude MMPs that another user might consider relevant. Rather than providing the final filtered dataset, it would be helpful if the authors would provide the full original datasets with the values of the features used for filtering annotated as extra columns. This would allow maximal flexibility in designing custom MMP sets. In the files that list the RECAP MMPs, key fields are missing that would require the user to retrieve the relevant data from CHEMBL in order to perform any analysis: (a) the Target name (only the target CHEMBL ID is provided); and (b) more importantly, the compound activities are not included. In the files that list the transfer series, for each matched pair the authors provide the two series cores and full compound smiles, but not the substituted fragments. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This paper provides a review of the Bajorath group's recent work on matched molecular pairs (MMP), a technique for exploring structure activity relationships, and identifying chemical transformations that can readily modulate biological activity. The authors focus on recent applications of MMP to large datasets from the publicly accessible ChEMBL database. The paper provides an excellent introduction to those unfamiliar with the MMP technique and with concepts such as activity cliffs. In addition to providing an overview of the recent literature, the authors also provide links to publicly available software and datasets that will provide tutorial materials for those interested in learning more about these powerful techniques. Datasets and software like those described in this paper are valuable resources. A logical next step from this work would be to create interactive tutorials using tools like the iPython Notebook Viewer or knitr. The presentation is clear, but a few changes may help readers unfamiliar with some of the concepts. On p2 the authors refer to " second generation MMPs". It would be useful to add a sentence explaining the differences between first and second generation MMPs. MMP Cliffs which differ in activity by 10-fold may also be interesting. It would be informative to see the number of examples available with a 1 log vs 2 log difference. In the section (3) on page 3, it would be interesting to provide a specific example of how the results of RECAP generated MMPS differ from those generated using a more "traditional" approach. This paper provides an excellent gateway to a topic that is becoming increasingly more important in drug discovery. The paper should be of interest to computational and medicinal chemists as well as biologists. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The matched molecule pairs approach provides a “chemistry friendly” and intuitive way of expressing relationships among molecules and therefore this manuscript is of importance to all cheminformatics scientists interested in the study of activity cliffs, SAR analysis and in the design of bioactive molecules in general. The authors extracted several datasets of matched molecule pairs from public databases that could be used as benchmarks in further analysis of this phenomenon. Care has been taken to assure a high quality of data. The manuscript is well written and the procedure and all results are sufficiently documented and, in addition, all datasets are available for download; therefore I am suggesting only a few minor modifications to the text: Introduction: replace " the latest release of the ChEMBL" with the version number. Provide a bit more technical information about the in-house implementation of a molecule fragmentation procedure used to generated matched pairs. Was the procedure implemented entirely in-house, or is it based on a publicly available cheminformatics toolkit? (If this is the case, please cite the respective toolkit). The authors mention that structures in the download file are available as canonical SMILES. The form of canonical SMILES however will depend on the particular program used to generate it. Please specify whether the original ChEMBL SMILES is included or the canonical SMILES was created by another toolkit I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. The data set described by Hu et al. is a large set of carefully curated small molecule matched-molecular pairs (MMPs) with high-quality activity data derived from ChEMBL. The set includes examples of structure-activity cliffs, as well as matched SAR-transfer series, both of which are important in the development and validation of activity prediction algorithms. The availability of the MMP data set will be very valuable to researchers that are focused on methods development. The data should also be of interest to those interested in fundamental questions about molecular activity (e.g. questions about the independence and additivity of activity changes that are linked with substituent changes). I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

13 in total

1. Exploring activity cliffs in medicinal chemistry.

Authors: Dagmar Stumpfe; Jürgen Bajorath
Journal: J Med Chem Date: 2012-01-27 Impact factor: 7.446

2. MMP-Cliffs: systematic identification of activity cliffs on the basis of matched molecular pairs.

Authors: Xiaoying Hu; Ye Hu; Martin Vogt; Dagmar Stumpfe; Jürgen Bajorath
Journal: J Chem Inf Model Date: 2012-04-17 Impact factor: 4.956

3. Matched molecular pairs as a medicinal chemistry tool.

Authors: Ed Griffen; Andrew G Leach; Graeme R Robb; Daniel J Warner
Journal: J Med Chem Date: 2011-09-22 Impact factor: 7.446

4. Recent progress in understanding activity cliffs and their utility in medicinal chemistry.

Authors: Dagmar Stumpfe; Ye Hu; Dilyana Dimova; Jürgen Bajorath
Journal: J Med Chem Date: 2013-09-13 Impact factor: 7.446

5. RECAP--retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry.

Authors: X Q Lewell; D B Judd; S P Watson; M M Hann
Journal: J Chem Inf Comput Sci Date: 1998 May-Jun

6. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets.

Authors: Jameed Hussain; Ceara Rea
Journal: J Chem Inf Model Date: 2010-03-22 Impact factor: 4.956

7. Systematic assessment of compound series with SAR transfer potential.

Authors: Bijun Zhang; Anne Mai Wassermann; Martin Vogt; Jürgen Bajorath
Journal: J Chem Inf Model Date: 2012-12-06 Impact factor: 4.956

Review 8. Matched molecular pair analysis in drug discovery.

Authors: Alexander G Dossetter; Edward J Griffen; Andrew G Leach
Journal: Drug Discov Today Date: 2013-04-02 Impact factor: 7.851

9. ChEMBL: a large-scale bioactivity database for drug discovery.

Authors: Anna Gaulton; Louisa J Bellis; A Patricia Bento; Jon Chambers; Mark Davies; Anne Hersey; Yvonne Light; Shaun McGlinchey; David Michalovich; Bissan Al-Lazikani; John P Overington
Journal: Nucleic Acids Res Date: 2011-09-23 Impact factor: 16.971

10. Freely available compound data sets and software tools for chemoinformatics and computational medicinal chemistry applications.

Authors: Ye Hu; Jurgen Bajorath
Journal: F1000Res Date: 2012-08-14

3 in total

Review 1. Using ChEMBL web services for building applications and data processing workflows relevant to drug discovery.

Authors: Michał M Nowotka; Anna Gaulton; David Mendez; A Patricia Bento; Anne Hersey; Andrew Leach
Journal: Expert Opin Drug Discov Date: 2017-06-12 Impact factor: 6.098

2. Follow up: Compound data sets and software tools for chemoinformatics and medicinal chemistry applications: update and data transfer.

Authors: Ye Hu; Jürgen Bajorath
Journal: F1000Res Date: 2014-03-11

3. A probabilistic molecular fingerprint for big data settings.

Authors: Daniel Probst; Jean-Louis Reymond
Journal: J Cheminform Date: 2018-12-18 Impact factor: 5.514

3 in total