Huabin Hu1, Dagmar Stumpfe1, Jürgen Bajorath1. 1. Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany.
Abstract
Activity cliffs are formed by structurally analogous compounds with large potency variations and are highly relevant for the exploration of discontinuous structure-activity relationships and compound optimization. So far, activity cliffs have mostly been studied on a case-by-case basis or assessed by global statistical analysis. Different from previous investigations, we report a large-scale analysis of activity cliff formation with a strong focus on individual compound activity classes (target sets). Compound potency distributions were systematically analyzed and categorized, and structural relationships were dissected and visualized on a per-set basis. Our study uncovered target set-dependent interplay of potency distributions and structural relationships and revealed the presence of activity cliffs and origins of cliff formation in different structure-activity relationship environments.
Activity cliffs are formed by structurally analogous compounds with large potency variations and are highly relevant for the exploration of discontinuous structure-activity relationships and compound optimization. So far, activity cliffs have mostly been studied on a case-by-case basis or assessed by global statistical analysis. Different from previous investigations, we report a large-scale analysis of activity cliff formation with a strong focus on individual compound activity classes (target sets). Compound potency distributions were systematically analyzed and categorized, and structural relationships were dissected and visualized on a per-set basis. Our study uncovered target set-dependent interplay of potency distributions and structural relationships and revealed the presence of activity cliffs and origins of cliff formation in different structure-activity relationship environments.
Activity cliffs are formed by structurally similar (analogous)
active compounds with large differences in potency.[1−4] Because activity cliffs represent
small chemical changes having large biological activity effects, they
embody the pinnacle of structure–activity relationship (SAR)
discontinuity,[3] which is detrimental for
quantitative SAR predictions.[2] However,
discontinuous SARs and activity cliffs often reveal SAR determinants,
especially when encountered during early stages of compound optimization,
and thus provide viable information for medicinal chemistry.[3,4]For a consistent assessment of activity cliffs, similarity
and
potency difference criteria must be clearly defined.[3] On the basis of globally assessed potency range distributions
of pairs of active analogues, an at least 100-fold difference in potency
(on the basis of equilibrium constants, if available) has been proposed
and frequently been used as an activity cliff criterion.[4,5] The definition of activity cliffs also depends on the molecular
representations and similarity measures that are used.[4,6] Compound similarity for activity cliff definition can be quantified
in different ways, for example, by calculating Tanimoto similarity
on the basis of molecular fingerprint representations or by applying
substructure-based similarity criteria.[3,4] Numerical similarity
measures, such as the Tanimoto coefficient, yield a continuum of values,
and a threshold must be set for defining activity cliffs. By contrast,
substructure-based methods produce a binary readout, for example,
two compounds share the same core structure—and are classified
as similar—or they do not. In addition to comparing molecular
graph-based (two-dimensional) representations, activity cliffs have
also been determined in three dimensions by calculating the similarity
of experimental compound binding modes taken from complex X-ray structures.[7]For graph-based activity cliff definition,
substructure similarity
assessment is—in our experience—generally more consistent
than numerical similarity calculations and often easier to interpret
from a chemical perspective.[4] Among substructure-based
approaches, the matched molecular pair (MMP) concept[8,9] is particularly attractive for activity cliff definition. An MMP
is defined as a pair of compounds that are only distinguished by a
chemical modification at a single site.[8] This modification corresponds to the exchange of a pair of substructures,[8,9] which is termed a chemical transformation.[9] By introducing appropriate transformation size restrictions, the
formation of MMPs can be limited to structural analogues typically
generated during compound optimization.[10] Applying this similarity criterion yields a structurally conservative
and chemically intuitive definition of activity cliffs.[4,10] Moreover, transformation size-restricted MMPs can be efficiently
generated algorithmically,[9,10] hence enabling large-scale
analysis of activity cliff populations.In light of these considerations,
our preferred activity cliff
definition encompasses the formation of a transformation size-restricted
MMP by two compounds sharing the same biological activity that have
an at least 100-fold difference in potency.[4,10] Whenever
possible, potency differences are determined on the basis of (assay-independent)
equilibrium constants. The so-defined activity cliffs have been termed
MMP-cliffs.[10]The definition of activity
cliffs is focused on compound pairs
and hence accounts for pairwise relationships. However, activity cliffs
in compound data sets are mostly not formed by isolated compound pairs
(i.e., pairs without structural neighbors forming additional activity
cliffs). Rather, the vast majority of activity cliffs are formed in
a coordinated manner by groups of structurally related compounds with
large potency variations, meaning that individual compounds are involved
in the formation of multiple activity cliffs with different analogues.[11,12] In activity cliff networks where nodes represent compounds and edges
pairwise activity cliffs, compound subsets forming coordinated cliffs
give rise to the formation of disjoint clusters.[12] These activity cliff clusters are a rich source of SAR
information and much more informative than cliffs considered as isolated.[13] More than 95% of MMP-cliffs detected across
different data sets were formed in a coordinated manner.[14] In activity cliff networks, clusters often include
“hubs,” that are, nodes representing molecules that
are centers of local activity cliff formation with multiple partner
compounds. Such molecules have also been termed “activity cliff
generators.”[15,16]In addition to activity
cliff coordination, the frequency with
which activity cliffs occur across different data sets has been determined.[5,14] There has been substantial growth in activity cliff information
over time. For example, from June 2011 until January 2015, the number
of MMP-cliffs originating from the ChEMBL database,[17] the major public repository of compounds and activity data
from medicinal chemistry sources, nearly doubled; with a total of
more than 17 000 MMP-cliffs available at the beginning of 2015.[14] In addition, the target coverage of MMP-cliffs
increased from about 200 to 300 individual target proteins over this
period of time. However, despite this strong growth, the proportion
of bioactive compounds involved in the formation of MMP-cliffs across
different compound data sets remained essentially constant at close
to 23%.[14]So far, activity cliffs
have been studied in exemplary compound
sets on a case-by-case basis or surveyed by global statistical analysis.[5,14] In addition, cliff populations have been organized and visualized
in network representations.[12,13] However, what has not
been attempted thus far is systematically exploring and comparing
activity cliff formation in different compound activity classes (also
called target sets). To these ends, we have analyzed in detail potency
distributions and structural relationships between compounds in many
different target sets, studied how activity cliffs were formed, and
determined the differences between sets. Hence, the focus of our current
study has been on details of activity cliff arrangements in individual
compound sets rather than on global statistical exploration. Our analysis
revealed many characteristic differences in activity cliff formation
between target sets.
Materials and Methods
Activity Cliff Definition
For our
current analysis, we introduced a modification of our preferred MMP-cliff
definition stated above.[4,10] For MMP generation,
standard random fragmentation of exocyclic single bonds[9] was replaced by fragmentation according to retrosynthetic
(RECAP) rules,[18] yielding (transformation
size-restricted) RECAP-MMPs (RMMPs).[19] Retrosynthetic
MMPs were generated to further increase the chemical relevance (synthetic
accessibility) of compound pairs, forming cliffs. Accordingly, the
formation of an RMMP was used as a similarity criterion for activity
cliffs, and an at least 100-fold difference in potency between RMMP
compounds was required, as before. The so-defined activity cliffs
are referred to as RMMP-cliffs.
Compounds
and Activity Data
Bioactive
compounds with high-confidence activity data were assembled from ChEMBL
version 23.[17] The following selection criteria
were applied: First, only compounds involved in direct interactions
(type “D”) with human targets at the highest confidence
level (assay confidence score 9) were selected. Second, only numerically
specified equilibrium constants (Ki values)
were considered as potency measurements. Equilibrium constants were
reported as pKi values. On the basis of
these selection criteria, a total of 71 967 unique compounds were
obtained with activity against a total of 904 targets. Accordingly,
these compounds were organized into 904 target sets.
RMMP Analysis
RMMPs were systematically
generated for all target sets, yielding 354 094 target set-based RMMPs
(243 110 unique RMMPs) that were formed by 46 977 compounds from 574
target sets. For the subsequent analysis, only target sets that contained
at least 100 RMMPs were retained, which resulted in 237 sets yielding
a total of 347 025 target-based RMMPs (238 795 unique RMMPs) formed
by 44 451 compounds.For each
target set, an RMMP network was generated in which nodes represented
compounds and edges pairwise RMMP relationships. In this network,
each separate RMMP cluster represented a unique series of analogues.
RMMP networks were also used to represent RMMP-cliffs by highlighting
edges that represented both RMMP and activity cliff relationships.
All network representations were drawn with Cytoscape.[20]
Potency Distributions
For the 237
qualifying target sets, compound potency distributions were monitored
in boxplots. On the basis of the interquartile range (IQR), that is,
the range between quartile 1 (Q1) and 3 (Q3), target sets were assigned
to three different categories, as shown in Figure : category 1 (CAT 1), IQR was smaller than
1 order of magnitude (<10-fold difference in potency); CAT 2, IQR
fell between 1 and less than 2 orders of magnitude (10- to 100-fold
difference); and CAT 3, IQR no smaller than 2 orders of magnitude
(≥100-fold difference in potency). By definition, the IQR represented
the potency range of ∼50% of the compounds in each target set.
Figure 1
Potency
distribution in target sets and categorization. The compound
potency distributions of all 237 target sets were analyzed in a boxplot
and the IQR, that is, the difference between quartile 3 and 1, was
determined. On the basis of the IQR, target sets were divided into
three different categories (CAT 1: IQR < 1; CAT 2: 1 ≤ IQR
< 2; CAT 3: IQR ≥ 2).
Potency
distribution in target sets and categorization. The compound
potency distributions of all 237 target sets were analyzed in a boxplot
and the IQR, that is, the difference between quartile 3 and 1, was
determined. On the basis of the IQR, target sets were divided into
three different categories (CAT 1: IQR < 1; CAT 2: 1 ≤ IQR
< 2; CAT 3: IQR ≥ 2).
Results and Discussion
Study
Concept
Activity cliffs have
so far mostly been studied on the basis of individual compound series
or by global statistical analysis.[3−5] Our current study was
designed to systematically investigate, for the first time, the differences
in activity cliff formation and frequency between different target
sets by relating compound potency distributions and structural relationships
to each other. Therefore, potency distributions were determined for
many different target sets, categorized, compared, and related to
intra-set analogue relationships, which were systematically determined.
Primary goals of the analysis included the assessment of differences
in activity cliff formation and frequency between different target
sets and the rationalization of such differences on the basis of potency
and structural criteria, as defined in the following. To better understand
target set-dependent activity cliff distributions, they were visualized
in network representations. Taken together, these features set our
current analysis apart from previous studies of activity cliffs in
computational and medicinal chemistry.[3,4]
Structural Relationships
Close structural
relationships between active compounds are one of the two major determinants
of activity cliffs, in addition to potency differences. RMMP (or MMP)
calculations reveal close structural relationships and identify pairs
of analogues. Importantly, however, the number of RMMPs produced by
a given target set cannot be reliably used as an indicator of structural
homogeneity. Rather, the presence or absence of multiple subsets of
analogues comprising different series strongly influences structural
heterogeneity or homogeneity, which is reflected by the cluster structure
of RMMP networks, as illustrated in Figure . Here, two target sets with similar numbers
of RMMP-forming compounds are compared. The target set on the left
was dominated by a large cluster of analogues and was thus structurally
homogeneous, whereas the set on the right contained 20 different small
clusters and 1 larger cluster and was structurally heterogeneous.
It follows that the cluster structure of RMMP networks must be carefully
considered as a prerequisite for RMMP-cliff formation.
Figure 2
Structural similarity
in target sets. For two exemplary target
sets, RMMP networks are shown in which blue nodes represent compounds
and edges pairwise RMMP relationships. Separate clusters represent
a unique series of analogues. Although the number of RMMP-forming
compounds (CPDs) was similar for both target sets, the number of clusters
differed significantly.
Structural similarity
in target sets. For two exemplary target
sets, RMMP networks are shown in which blue nodes represent compounds
and edges pairwise RMMP relationships. Separate clusters represent
a unique series of analogues. Although the number of RMMP-forming
compounds (CPDs) was similar for both target sets, the number of clusters
differed significantly.
Potency Distributions and Profiles
The likelihood of large potency differences between similar compounds
can be estimated by monitoring the potency distributions of target
sets. For our analysis, we assigned potency distributions to three
different categories (CAT 1–3) on the basis of boxplot-derived
IQR values, as shown in Figure . CAT 1, 2, and 3 comprised 25, 169, and 43 target sets, respectively.
Hence, the majority of target sets fell into CAT 2 whose IQR spanned
1 to 2 orders of magnitude in potency and thus delineated an activity
cliff-relevant range, which was further expanded by CAT 3. These observations
supported our categorization of potency distributions. Accordingly,
potency distributions became increasingly variable from CAT 1 to 3,
as revealed by the potency distribution profiles in Figure . The CAT 1 profiles in Figure a reflect narrow
potency distributions on the basis of which activity cliff formation
is unlikely. By contrast, the CAT 2 profiles in Figure b and, especially, CAT 3 profiles in Figure c reveal large potency
variations between structural analogues, resulting in a principally
high propensity of activity cliffs.
Figure 3
Potency distribution profiles. Shown are
exemplary potency distribution
profiles for target sets belonging to different categories [(a), CAT
1; (b), CAT 2; (c), CAT 3] according to Figure . Black dots represent RMMP compounds and
red dots singletons not participating in RMMPs.
Potency distribution profiles. Shown are
exemplary potency distribution
profiles for target sets belonging to different categories [(a), CAT
1; (b), CAT 2; (c), CAT 3] according to Figure . Black dots represent RMMP compounds and
red dots singletons not participating in RMMPs.
RMMP-Cliffs
In 207 of the 237 qualifying
targets sets, RMMP-cliffs were identified, amounting to a total of
11 834 cliffs. Table reports that the number of RMMP-cliffs increased over target sets
of CAT 1, 2, and 3, with on average 2, 52, and 69 cliffs per set,
respectively. Thus, there was a general trend of increasing number
of RMMP-cliffs with increasing variability of potency distributions.
The very small number of RMMP-cliffs for CAT 1 sets was directly attributable
to the narrow potency distributions characterizing this category. Table reports that the
48 target sets containing 50 to a maximum of 820 RMMP-cliffs exclusively
belonged to CAT 2 and CAT 3 that had activity cliff-relevant IQR values.
By contrast, target sets with less than 50 RMMP-cliffs were found
in all 3 categories. Figure shows that the majority of target sets with large number
of 100 or more RMMP-cliffs belonged to CAT 2, which was due to the
large number of 169 target sets in this category compared to only
43 sets in CAT 3. A systematic increase in the number of activity
cliffs with increasing IQR values was not observed.
Table 1
Target Set Statisticsa
CAT
# target sets
# clusters (mean)
# RMMP-cliffs (mean)
1
25
10
2
2
169
54
52
3
43
37
69
For each target set category (CAT),
the number (#) of target sets, mean number of RMMP clusters per set,
[# clusters (mean)], and mean number of RMMP-cliffs are reported.
Table 2
RMMP-Cliff Distributiona
# RMMP-cliffs (range)
# target sets
CATs
0
30
1, 2, 3
[1, 10)
77
1, 2, 3
[10, 20)
33
1, 2, 3
[20, 50)
49
1, 2, 3
[50, 100)
20
2, 3
[100, 500)
25
2, 3
[500, 820]
3
2, 3
For different ranges of RMMP-cliffs,
the number of target sets (# targets) and categories (CATs) they belong
to are reported.
Figure 4
RMMP-cliffs vs IQR values.
For each of the 237 target sets, the
number of RMMP-cliffs (y-axis) is plotted against
increasing IQR values (x-axis). Red vertical lines
separate target sets belonging to CAT 1, 2, and 3.
RMMP-cliffs vs IQR values.
For each of the 237 target sets, the
number of RMMP-cliffs (y-axis) is plotted against
increasing IQR values (x-axis). Red vertical lines
separate target sets belonging to CAT 1, 2, and 3.For each target set category (CAT),
the number (#) of target sets, mean number of RMMP clusters per set,
[# clusters (mean)], and mean number of RMMP-cliffs are reported.For different ranges of RMMP-cliffs,
the number of target sets (# targets) and categories (CATs) they belong
to are reported.However,
despite these general trends, the propensity to form RMMP-cliffs
could not solely be attributed to the variability and spread of potency
distributions. Rather, as further discussed below, potency distributions
in target sets must be viewed in combination with RMMP networks and
their cluster structure. Table also reports that target sets in CAT 1, 2, and 3 contained
on average 10, 54, and 37 RMMP clusters, respectively. Thus, CAT 2
and CAT 3 sets contained large number of clusters (analogue series)
whose local potency distributions strongly influenced RMMP-cliff formation.
Interplay of Potency Patterns and Structural
Relationships
The 207 target sets containing RMMP-cliffs
were individually examined to evaluate potency distribution profiles
and RMMP networks in context and rationalize why RMMP-cliffs were
formed with different frequencies. The analysis revealed a number
of characteristic features determining cliff formation that are summarized
in Figure by comparing
exemplary target sets. Figure a (top) shows a set of phosphodiesterase 3A inhibitors with
a flat CAT 1 potency distribution profile, which prohibited RMMP-cliff
formation, despite the presence of two analogue series with in-part
extensive RMMP relationships. In addition, Figure a (bottom) displays somatostatin receptor
5 ligands with a variable CAT 2 distribution and more than 100 RMMP-forming
compounds. Although cliff formation was more likely in this case,
the target set did not contain any RMMP-cliffs either. This was a
direct consequence of a heterogeneous cluster structure and local
potency distributions over different subsets of analogues forming
16 clusters, as revealed by the RMMP network of this set.
Figure 5
Differences
in RMMP-cliff formation. In (a–c), exemplary
target sets with characteristic differences in activity cliff formation
are compared, as described in the text. For each set, its potency
distribution profile and RMMP network are shown and RMMP statistics
are reported. Network nodes are colored by potency using a continuous
color spectrum from red (lowest potency in the target set) over yellow
(intermediate) to green (highest potency). If available, compounds
forming exemplary RMMP-cliffs are shown and consistently labeled in
all display items.
Differences
in RMMP-cliff formation. In (a–c), exemplary
target sets with characteristic differences in activity cliff formation
are compared, as described in the text. For each set, its potency
distribution profile and RMMP network are shown and RMMP statistics
are reported. Network nodes are colored by potency using a continuous
color spectrum from red (lowest potency in the target set) over yellow
(intermediate) to green (highest potency). If available, compounds
forming exemplary RMMP-cliffs are shown and consistently labeled in
all display items.Figure b shows
two different sets of kinase inhibitors with similar CAT 2 potency
distributions but different RMMP cluster structures that yielded 40
(top) and 27 (bottom) RMMP-cliffs, respectively. Exemplary RMMP-cliffs
are displayed. In both instances, the target sets were structurally
heterogeneous but RMMP-cliffs were formed across different clusters,
revealing high degrees of SAR discontinuity.In Figure c, sets
of anandamide amidohydrolase (top) and Bcl-X (bottom) inhibitors are
compared having CAT 2 (top) and CAT 3 (bottom) distributions, respectively.
The anandamide amidohydrolase inhibitors contained only 49 RMMP-forming
compounds. The RMMP network was dominated by a densely connected cluster
of 19 analogues that formed 79 coordinated RMMP-cliffs (exemplary
cliffs are shown). Thus, in this case, the number of RMMP-cliffs was
much larger than the number of participating analogues because of
extensive coordination of cliffs. Hence, this cluster represented
an SAR hotspot. By contrast, the Bcl-X inhibitors contained a much
larger number of 119 RMMP-forming compounds that were distributed
over 20 clusters. Although the CAT 3 potency distribution of this
target set was highly variable, the majority of compounds in individual
clusters had comparable potency, whereas the potency levels of clusters
significantly differed, giving rise to the presence of only three
RMMP-cliffs.Taken together, the results in Figure were representative of many
target sets
we studied. Analyzing the potency distribution profiles and in combination
with RMMP networks revealed the characteristic features of target
sets and clearly rationalized differences in RMMP-cliff frequency
across target sets.
Conclusions
Herein,
we have reported a systematic analysis of RMMP-cliffs in
more than 200 target sets to investigate and better understand the
origins of cliff formation and differences in the frequency of cliffs.
Our study was strongly focused on individual target sets and their
comparison. Potency distributions were determined and categorized,
and structural relationships were analyzed at the level of RMMPs and
organized in networks. Structural homogeneity of target sets and potency
distributions of increasing variability generally supported the formation
of RMMP-cliffs. However, the interplay of structural and potency relationships
determined the frequency with which RMMP-cliffs were formed, as revealed
by relating potency profiles and RMMP networks to each other and studying
local potency distributions across different RMMP clusters.The analysis scheme introduced herein reveals target set-dependent
formation of activity cliffs, provides immediate visual access to
characteristic activity cliff-relevant features of target sets, and
rationalizes differences in the frequency of cliffs across sets.