Assessing the degree to which analogue series are chemically saturated is of major relevance in compound optimization. Decisions to continue or discontinue series are typically made on the basis of subjective judgment. Currently, only very few methods are available to aid in decision making. We further investigate and extend a computational concept to quantitatively assess the progression and chemical saturation of a series. To these ends, existing analogues and virtual candidates are compared in chemical space and compound neighborhoods are systematically analyzed. A large number of analogue series from different sources are studied, and alternative chemical space representations and virtual analogues of different designs are explored. Furthermore, evolving analogue series are distinguished computationally according to different saturation levels. Taken together, our findings provide a basis for practical applications of computational saturation analysis in compound optimization.
Assessing the degree to which analogue series are chemically saturated is of major relevance in compound optimization. Decisions to continue or discontinue series are typically made on the basis of subjective judgment. Currently, only very few methods are available to aid in decision making. We further investigate and extend a computational concept to quantitatively assess the progression and chemical saturation of a series. To these ends, existing analogues and virtual candidates are compared in chemical space and compound neighborhoods are systematically analyzed. A large number of analogue series from different sources are studied, and alternative chemical space representations and virtual analogues of different designs are explored. Furthermore, evolving analogue series are distinguished computationally according to different saturation levels. Taken together, our findings provide a basis for practical applications of computational saturation analysis in compound optimization.
In
medicinal chemistry, compound optimization relies on the generation
of analogues to explore structure–activity relationships (SARs)
and improve molecular properties. Chemical optimization is largely
driven by intuition and experience. The optimization process is difficult
to rationalize and formalize, and consequently, subjective criteria
dominate decision making. In particular, it is very difficult to determine
when a sufficient number of analogues have been generated and no further
progress can be expected. Series progression can be evaluated on the
basis of SAR features and/or chemical saturation. Both criteria go
hand in hand but provide somewhat different perspectives. From an
SAR viewpoint, the central question is whether or not compound potency
and other relevant properties can be further improved by generating
additional compounds. Chemical saturation, on the other hand, primarily
addresses the question whether the chemical space around active compounds
has been sufficiently covered to ensure that no potential optimization
pathways remain unexplored.So far, only a few approaches have
been introduced to evaluate
the generation of analogue series. These approaches include multiparameter
optimization (MPO) using desirability functions,[1] attrition curves,[2] or risk statistics,[3] which balance multiple compound properties. However,
MPO does not directly assess progression saturation of analogue series,
but it prioritizes candidate compounds with desirable property profiles.
MPO can hence be applied to indirectly evaluate series progression
by prioritizing candidate compounds with preferred property combinations
that can still be obtained. Regardless of the MPO approach that is
applied, a pool of candidates for evaluation must be generated separately.
In addition, numerical[3,4] and graphical[5] SAR analysis methods have been introduced to monitor SAR
progression of evolving compound series and evaluate whether newly
generated analogues yield further SAR information. These approaches
may also suggest preferred candidate compounds[3] or provide diagnostics for SAR landscapes of compound data sets.[4,5]Previously, we have introduced a computational concept to
more
directly and quantitatively assess progression saturation of analogue
series using virtual analogue populations that are mapped into chemical
reference space together with exiting analogues.[6] The approach is focused on the assessment of chemical progression
saturation, but not SAR progression, and addresses the following questions:
Do existing analogues provide sufficient chemical space coverage?
Are virtual candidate compounds available that have a high likelihood
of activity? As discussed in detail below, the methodology requires
the use of chemical reference spaces and virtual candidate compounds
to quantify chemical saturation. The degree to which the results of
computational saturation analysis depend on such parameters is yet
to be explored. Therefore, we have further investigated and extended
the methodology for medicinal chemistry applications by analyzing
many analogue series of different compositions, exploring alternative
chemical space representations, and virtual analogues resulting from
different design strategies. Therefore, critical computational parameter
settings underlying the analysis were assessed. We also show that
large series with similar numbers of analogues have different saturation
characteristics. Distinguishing between different saturation levels
relies on computational analysis, as demonstrated herein.
Materials and Methods
Methodology
Concept
Progression saturation
of the analogue series is evaluated by comparing distributions of
existing analogues and virtual candidate compounds for series expansion
in chemical space[6] applying a neighborhood
concept.[7,8] First, chemical space coverage of the analogue
series is estimated by determining the proportion of virtual candidates
falling into predefined neighborhoods of existing analogues. For this
purpose, a global saturation score is calculated, as defined below.
In this case, chemical neighborhoods of analogues are defined on the
basis of nearest-neighbor distances between virtual compounds to measure
global coverage of the chemical space. Second, the population of neighborhoods
of active analogues (active neighborhoods) is assessed by determining
virtual candidates falling into active neighborhoods. Therefore, neighborhoods
are defined differently on the basis of median distances between active
compounds in the chemical space and a local saturation score is calculated,
as also defined below. Relating global and local compound distributions
and resulting scores to each other makes it possible to evaluate progression
saturation of the analogue series. The assessment is primarily focused
on chemical progression saturation (rather than SAR progression),
addressing the questions whether chemical space coverage by existing
analogues is extensive and, in addition, whether a significant number
of virtual candidates exist that are likely to be active.
Scoring Scheme
A quantitative measure
of progression saturation is obtained by calculating two scores and
relating them to each other.[6] The raw global
saturation score is defined as the ratio of the number of virtual
candidate compounds that fall into neighborhoods of experimental analogues
relative to the total number of virtual compoundsS denotes
the set of experimental
analogues, V denotes the set of virtual analogues,
and vExptl denotes the set of virtual
compounds falling into neighborhoods of experimental analogues. The
neighborhood radius is determined as follows. For each virtual analogue,
mean Euclidian distances to the top 1% of its nearest virtual neighbors
are calculated. The median of these distances is used as the neighborhood
radius. Accordingly, only closely related virtual candidates map to
the same neighborhood. The top 1% of nearest virtual neighbors were
selected on the basis of initial test calculations in which the percentage
of nearest virtual neighbors was systematically varied. For different
percentages, comparable scores were obtained and the top 1% were selected
to control the number of distance calculations.The so-defined
raw global saturation score, the calculation of which is illustrated
in Figure a, measures
the chemical space coverage of existing analogues and virtual candidates.
Hence, the larger the score, the more virtual analogues map to neighborhoods
of experimental analogues, indicating extensive coverage. Depending
on the composition of a series, S may contain both
active and inactive or only active analogues.
Figure 1
Saturation scores and
categorized combinations. (a, b) Calculation
of the raw global and local saturation scores is illustrated, respectively.
The coordinate system represents a chemical reference space containing
an analogue series and virtual candidates. “D” stands
for descriptor. Each descriptor adds a dimension to the space. (c)
Combinations of global and local saturation scores are categorized
as indicators of different progression saturation stages. Figure panels
were adapted from ref (6).
Saturation scores and
categorized combinations. (a, b) Calculation
of the raw global and local saturation scores is illustrated, respectively.
The coordinate system represents a chemical reference space containing
an analogue series and virtual candidates. “D” stands
for descriptor. Each descriptor adds a dimension to the space. (c)
Combinations of global and local saturation scores are categorized
as indicators of different progression saturation stages. Figure panels
were adapted from ref (6).The raw local saturation score
is defined as the ratio of the number
of active analogues relative to the number of virtual analogues falling
into the neighborhoods of active analoguesA is a set of active compounds
and Vactive is a set of virtual candidates
in active neighborhoods. A Laplace-like correction by adding 1 is
applied to the denominator to avoid numerical instabilities when Vactive is small or 0. For calculating the raw
local saturation score, the neighborhood radius of each active analogue
is set to the median value of pairwise distances between active analogues.
Hence, the size of the so-defined neighborhood accounts for typically
observed distances between active analogues in chemical space. The
raw local saturation score, the calculation of which is illustrated
in Figure b, measures
the distribution of virtual analogues around active analogues. The
larger the score, the less populated are the neighborhoods of active
analogues with virtual candidates.The raw global scores and
logarithmically transformed local saturation
scores are converted into conventional Z-scores on
the basis of the mean and standard deviation of the score population
of large sets of analogue series. Accordingly, the mean of the score
population was subtracted from each raw score and the difference was
divided by the standard deviation of the distribution, yielding the Z-score. Accordingly, the resulting Z-scores
have a mean of 0 and standard deviation of 1. Global and local scores
generally display low correlation.[6]
Score Combinations
Combinations
of global and local saturation scores can be divided into four categories
that characterize different levels of saturation progression,[6] as schematically shown in Figure c: (1) low global and high local scores (category
low/high, upper left quadrant) characterize the series that have low
analogue coverage of chemical reference space and only few virtual
candidates in active neighborhoods. Hence, these series are only little
explored and thus rationalized as early-stage series. (2) Category
low/low (lower left quadrant) describes the series with low chemical
space coverage by experimental analogues but with many virtual candidates
located in active neighborhoods. Such series are more advanced chemically
and rationalized as mid-stage series. (3) Category high/low (lower
right) identifies the series with more extensive coverage of experimental
analogues and many virtual candidates that are present in active neighborhoods.
Such series are best characterized as late-stage series, which approach
saturation. (4) Category high/high (upper right) characterizes
series with extensive analogue coverage but only few remaining virtual
candidates in active neighborhoods, thus indicating a high level of
chemical saturation (saturated series).Threshold values for
categorization of Z-score combinations are set to
1, i.e., one standard deviation (σ) above the mean of the fitted
normal distribution of Z-scores for sets of analogue
series.
Analogue Series
Analogue series were
assembled using a computational method that identifies analogue series
in compound data sets of any composition.[9] This method makes use of the matched molecular pair (MMP) concept.[10] An MMP is defined as a pair of compounds that
are distinguished by a chemical modification at only a single site.[10] This modification involves the exchange of a
pair of substructures, which is termed as transformation.[11] For our study, MMPs were generated by single-cut
fragmentation of exocyclic single bonds[11] on the basis of retrosynthetic rules,[12] yielding RECAP-MMPs.[13]Compounds
sharing the same RECAP-MMP core form a matching molecular series (MMS),[14] which represents an analogue series with a single
substitution site.[9] By contrast, different
MMSs sharing analogues form a series with multiple substitution sites.[9] In this case, substitution sites of the corresponding
RECAP-MMP cores are transferred to shared analogues from which an
analogue-series-based (ASB) scaffold[15,16] with multiple
substitution sites[16] is extracted. This
ASB scaffold then represents an analogue series with multiple substitution
sites.Two sets of analogue series were extracted from screening
data
and medicinal chemistry sources, respectively.
Series
of Screening Compounds
A
set of 80 series containing active and inactive analogues was extracted
from the PubChem Bioassay database.[17] The
series were required to have single substitution sites, consist of
at least 30 analogues tested in the same assay, and include at least
three active compounds. They contained a total of 1618 compounds and
covered 25 biochemical assays and 23 unique targets. These series
were used in the proof-of-concept study introducing computational
progression saturation analysis.[6] They
are termed PubChem series in the following.
Series
from Medicinal Chemistry
A set of 64 analogue series yielding
ASB scaffolds with multiple
substitution sites were extracted from ChEMBL (release 23)[18] on the basis of high-confidence activity data.
Accordingly, only compounds with direct interactions (type “D”)
with human targets at the highest assay confidence level (confidence
score 9) were selected. As potency measurements, only specified equilibrium
constants (Ki values) or IC50 values were considered. These analogue series, termed ChEMBL series
in the following, were required to consist of 10–30 active
analogues. They contained a total of 1422 compounds and covered 62
unique targets.In addition, three large series with 100 or
more analogues were extracted from ChEMBL to model evolving series
and analyze their saturation characteristics.Compositions of
all analogue series, their structures, and targets
are reported in Table S1 in the Supporting
Information. Figure shows exemplary compounds from representative series with associated
activity and target information together with virtual analogues. Because
our approach is designed to assess chemical saturation of analogue
series, rather than SAR progression, potency and target information
for analogues is not of primary relevance. As discussed, compound
potency is not a computational parameter here.
Figure 2
Representative analogue
series. Exemplary compounds of series from
(a) PubChem and (b) ChEMBL are shown with associated target and potency
information and corresponding virtual analogues (red). In addition,
the composition of each series is reported and its ASB scaffold is
shown.
Representative analogue
series. Exemplary compounds of series from
(a) PubChem and (b) ChEMBL are shown with associated target and potency
information and corresponding virtual analogues (red). In addition,
the composition of each series is reported and its ASB scaffold is
shown.
Chemical
Reference Spaces
Two overlapping
chemical reference spaces were generated using descriptor subsets
of different designs and information contents. Because chemical reference
space is a variable for saturation analysis, investigating the influence
of alternative chemical space representations is an important aspect
of our study. The first space representation was 7-dimensional
and formed by intuitive molecular descriptors including molecular
weight, the number of hydrogen-bond donor and acceptor atoms, the
number of rotatable bonds, the logarithm of the octanol/water partition
coefficient (log P), aqueous solubility, and
topological polar surface area. These descriptors accounted for chemical
features known to be relevant to ligand–target interactions.
In addition, a 14-dimensional reference space was generated by adding
seven more abstract two-dimensional descriptors with little pairwise
correlation to the initial set. Selected descriptors (Table ) were calculated with the Molecular
Operating Environment (MOE)[19] and scaled
to zero mean and unit variance.
Table 1
Descriptors for Chemical
Reference
Spacesa
descriptor
name
description
set 1
a_acc
number of H-bond acceptor atoms
a_con
number of H-bond donor atoms
b_1rotN
number of rotatable single
bonds
log P (o/w)
log octanol/water partition coefficient
logs
log solubility in water
TPSA
topological polar surface
area
weight
molecular weight
set 2
petitjeanSC
topological
shape index (diameter – radius)/radius
rsynth
synthetic feasibility based
on retrosynthetic rules
PEOE_VSA_FNEG
fractional negative van
der Waals surface area
balabanJ
topological index (Balaban
distance connectivity)
PEOE_RPC+
relative positive partial
charge
PEOE_VSA_FPPOS
fractional polar positive
van der Waals surface area
a_nN
number of nitrogen atoms
Descriptors used for the design
of chemical reference spaces are described according to the Molecular
Operating Environment with which they were calculated. Set 1 constitutes
a seven-dimensional reference space, and sets (1 + 2) form a 14-dimensional
space.
Descriptors used for the design
of chemical reference spaces are described according to the Molecular
Operating Environment with which they were calculated. Set 1 constitutes
a seven-dimensional reference space, and sets (1 + 2) form a 14-dimensional
space.
Virtual
Analogues
Two conceptually
different strategies were applied to generate virtual analogues for
series including a transformation- and a matrix-based approach. Virtual
analogues served as candidates for series progression.
Transformation-Based Virtual Candidates
Systematic
RECAP-MMP fragmentation, as described above, was applied
to ChEMBL compounds with high-confidence activity data to sample chemical
transformations from which individual substituent fragments were extracted.
A total of 13 203 unique substituents were obtained. These
substituents were systematically recombined with the RECAP-MMP core
of each screening analogue series, yielding a constant number of 13 203
virtual analogues per series. This strategy was applied in our initial
study.[6] In addition, the substituent pool
was recombined with ASB scaffolds representing medicinal chemistry
series with multiple substitution sites to randomly sample the same
number of virtual candidates per series. Transformation-based virtual
analogues were generated with Python scripts aided by the OpenEye
toolkit.[20]
Matrix-Based
Virtual Candidates
Furthermore, the SAR matrix (SARM) data
structure[21,22] was used as a source of virtual analogues
for medicinal chemistry
series. SARMs are obtained from compound sets through systematic two-step
MMP fragmentation and identify all subsets that have structurally
analogous cores, i.e., core structures that are distinguished by a
structural modification only at a single site.[21] Each subset of analogue series with structurally related
cores is represented in a single SARM that is reminiscent of a standard
R-group table. A matrix cell represents a unique combination of a
core and substituent. Analogue series in SARMs typically contain different
substituents. Hence, the systematic recombination of structurally
related cores obtained from the second round of fragmentation and
substituents from the first round reproduces all existing analogues
and, in addition, generates virtual analogues representing as of yet
unexplored core–substituent combinations.[21]By design, structurally related cores and SARMs can
be obtained only from analogue series with multiple substitution sites.
Such series typically yield multiple SARMs. Depending on the number
of structurally related cores and substituents resulting from two-step
fragmentation, a series-specific number of virtual candidates is obtained.A major difference between the transformation- and matrix-based
virtual analogue generation approaches is that matrix-based candidates
are generally more closely related to existing analogues than transformation-based
candidates, for which a diverse array of possible substituents is
available. The distribution of virtual analogues from SARMs can be
rationalized as an envelope in chemical space formed around an existing
series.[22]The 64 ChEMBL series with
10–30 analogues yielded between
101 and 501 matrix-based virtual candidates per series. In each case,
the same number of transformation-based virtual analogues were generated
through random sampling. In addition, for direct comparison with PubChem
series, a constant number of 13 203 transformation-based virtual
analogues per ChEMBL series was also generated through random sampling.
The same number of transformation-based virtual analogues was generated
for three large ChEMBL series with more than 100 analogues.
Results and Discussion
Progression
Saturation Analysis
The
scoring scheme underlying progression saturation assessment is illustrated
in Figure a,b. Global
and local saturation scores quantitatively account for global chemical
space coverage of analogues and the distribution of virtual candidates
across chemical neighborhoods of active analogues, respectively. These
scores yield characteristic combinations that reflect different levels
of saturation and make it possible to assign analogue series to different
progression stages according to Figure c. Relationships among score magnitudes, analogue distributions,
and saturation states are detailed in Section . We note that there is no a priori preferred
saturation level for analogue series. The methodology is designed
as a diagnostics to characterize and differentiate between different
levels of chemical saturation, which is important for practical applications.Herein, we have systematically analyzed sets of series from different
sources. The major difference between PubChem and ChEMBL series is
that the former contained both active and inactive analogues, whereas
the latter exclusively consisted of active analogues from medicinal
chemistry publications. Different parameters were evaluated that were
expected to impact the computational assessment of progression saturation.
Z-Scores
Global
and local saturation scores must be separately calculated for different
sets of analogue series and varying parameter settings including chemical
space representations and populations of virtual candidates. For example,
for the 80 screening series with 13 203 transformation-based
virtual candidates projected into the seven-dimensional reference
space, global and local saturation Z-scores covered
the intervals [−2.1, 2.3] and [−1.3, 4.6], respectively.
Other parameter settings yielded only slightly different score distributions.
The threshold value for high global and local saturation Z-scores was set to 1σ above the mean in all cases and was thus
constantly 1.0. On the basis of this threshold value, the four different
score combinations for classifying analogue series according to Figure c were calculated.
Alternative Chemical Reference Spaces
For
the assessment of progression saturation, analogue series must
be projected into chemical reference spaces. We first investigated
the influence of alternative chemical space representations on score
and series categorization. Figure shows the comparison of 7-dimensional (Figure a) and 14-dimensional reference
spaces (Figure b)
for 80 PubChem series in the presence of 13 203 transformation-based
virtual analogues. In both reference spaces, similar score combinations
were obtained for analogue series, leading to a closely corresponding
assignment of series to different progression stages (Figure c). In both reference spaces
of different dimensionalities, the majority of series (60 vs 58) fell
into the low/low global/local saturation score category, 13 series belonged
to the high/low category, and only three series were assigned to the
high/high category. As shown in Figure , equivalent observations were made for the 64 ChEMBL
series in the presence of 13 203 transformation-based virtual
analogues. In both chemical reference spaces, most series (43 vs 49)
belonged to the low/low category. In this case, no series were assigned
to the high/high category. Hence, both sets of PubChem and ChEMBL
series were dominated by analogue series with mid-stage character.
Only 17 PubChem and 12 ChEMBL series were found to belong to different
categories in the 7- and 14-dimensional reference spaces, consistent
with the closely corresponding distributions observed in Figures and 4. Thus, scoring was stable in both reference spaces and very
similar assignments were obtained, indicating that progression saturation
assessment was not sensitive to chemical reference space variation.
Furthermore, the score distributions in Figures and 4 also reveal
a significant spread of series across the scoring range, indicating
the capacity of global and local scores to distinguish between different
series. At least for analogue series from chemical optimization projects
published in the medicinal chemistry literature, the observed prevalence
of series with a mid-stage character would be expected.
Figure 3
Analysis of
PubChem series in different chemical reference spaces.
Progression saturation of analogue series from PubChem with 13 203
virtual candidates is assessed in (a) 7-dimensional and (b) 14-dimensional
reference spaces. Scatter plots report local and global saturation
scores obtained for all series (each dot represents a series). (c)
Assignment of series to different score combination categories (according
to Figure c).
Figure 4
Analysis of ChEMBL series in different chemical
reference spaces.
Progression saturation of analogue series from ChEMBL with 13 203
virtual candidates was assessed in (a) 7-dimensional and (b) 14-dimensional
reference spaces. (c) Assignment of series to different score combination
categories. The representation is according to Figure .
Analysis of
PubChem series in different chemical reference spaces.
Progression saturation of analogue series from PubChem with 13 203
virtual candidates is assessed in (a) 7-dimensional and (b) 14-dimensional
reference spaces. Scatter plots report local and global saturation
scores obtained for all series (each dot represents a series). (c)
Assignment of series to different score combination categories (according
to Figure c).Analysis of ChEMBL series in different chemical
reference spaces.
Progression saturation of analogue series from ChEMBL with 13 203
virtual candidates was assessed in (a) 7-dimensional and (b) 14-dimensional
reference spaces. (c) Assignment of series to different score combination
categories. The representation is according to Figure .
Virtual Analogues of Different Designs
Next, we compared different ensembles of virtual candidates for progression
saturation assessment, which represented another parameter of the
analysis. For ChEMBL series having multiple substitution sites, matrix-based
virtual analogues were generated, yielding varying numbers of 101–501
candidates per series. For each series, the corresponding number of
transformation-based virtual analogues were generated and saturation
scores were calculated on the basis of alternative sets of virtual
analogues in 14-dimensional reference space. Figure a,b shows the score distributions obtained
for matrix- and transformation-based virtual candidates, respectively.
The resulting assignment of series to score combination categories
is shown in Figure c. Here, moderate changes were observed, predominantly for the low/high
and low/low categories. Progression saturation assessment in the presence
of transformation-based virtual analogues assigned 13 series to the
high/low and 38 series to the low/low category. By contrast, assessment
in the presence of matrix-based virtual candidates yielded eight
high/low and 44 low/low series. In addition, 13 and 12 low/high series
were obtained using transformation- and matrix-based virtual analogues,
respectively. Thus, there was a shift from late-stage series toward
series with the mid-stage character for matrix-based virtual analogues
compared to that of transformation-based candidates. By design, matrix-based
virtual analogues were structurally closer to existing analogues than
transformation-based virtual candidates, which contained structurally
more diverse substituents. Hence, matrix-based candidates should be
more likely to map to neighborhoods of active analogues than transformation-based
virtual analogues, which would result in lower raw saturation scores.
This was consistent with the observation that 22 series contained
no transformation-based virtual analogues within neighborhoods of
active analogues, whereas all series contained at least a few matrix-based
virtual candidates in active neighborhoods. However, lower raw saturation
scores do not necessarily translate into significant differences in
category assignments because the category of a series is determined
on the basis of the magnitude of its global and local scores relative
to the scores of the other analogue series. Mapping of transformation-
or matrix-based virtual analogues delineates the chemical space of
an analogue series. One might argue that the space covered by matrix-based
virtual compounds, which are closely related to existing analogues,
might more accurately reflect the space that is relevant for a given
series.
Figure 5
Progression saturation analysis using virtual analogues of different
designs. Progression saturation of ChEMBL series was assessed on the
basis of corresponding numbers of (a) matrix- and (b) transformation-based
virtual candidates. Scatter plots report local and global saturation
scores obtained for all series (each dot represents a series). (c)
Comparison of the assignment of series to different categories. Here,
the raw global saturation score calculation was modified using the
top 10% of nearest virtual neighbors for determining the neighborhood
radius to account for the smaller number of virtual candidates.
Progression saturation analysis using virtual analogues of different
designs. Progression saturation of ChEMBL series was assessed on the
basis of corresponding numbers of (a) matrix- and (b) transformation-based
virtual candidates. Scatter plots report local and global saturation
scores obtained for all series (each dot represents a series). (c)
Comparison of the assignment of series to different categories. Here,
the raw global saturation score calculation was modified using the
top 10% of nearest virtual neighbors for determining the neighborhood
radius to account for the smaller number of virtual candidates.
Increasing
Numbers of Virtual Analogues
In addition to alternative analogue
design strategies, the use
of varying numbers of virtual analogues to sample chemical space was
also investigated. Figure shows the results of progression saturation analysis of ChEMBL
series in the presence of stepwise increasing numbers of transformation-based
virtual analogues. Only small variations in series assignments were
observed over a wide range of virtual candidates, indicating that
the number of virtual analogues was not a critical parameter for saturation
progression assessment. This can be rationalized by taking into account
that, independent of the size of the set, virtual candidates are used
to sample the chemical space centered on an analogue series. This
implies that larger sets of virtual analogues do not necessarily cover
a larger section of chemical space. Rather, they more densely sample
chemical space around an analogue series. Saturation scores assess
the chemical space coverage by the series on the basis of these virtual
analogues. Because a small set of samples is less likely to be evenly
distributed, one would expect larger variations for smaller sets of
virtual candidates and more stable category assignments for larger
sets. Over different set sizes, only small fluctuations in distributions
were observed, indicating that chemical space coverage by a few hundreds
to a few thousands virtual analogues was sufficient for ensuring stable
category distributions. For practical applications, the results in Figure suggest that on
the order of 1000 virtual candidates are sufficient for the analysis
of moderately sized analogue series.
Figure 6
Category distributions for increasing
numbers of virtual analogues.
For increasing numbers of virtual analogues, score combination categories
for ChEMBL series are reported as percentages. As in Figure , the top 10% of nearest neighbors
were used for determining the neighborhood radius for calculating
the global saturation score.
Category distributions for increasing
numbers of virtual analogues.
For increasing numbers of virtual analogues, score combination categories
for ChEMBL series are reported as percentages. As in Figure , the top 10% of nearest neighbors
were used for determining the neighborhood radius for calculating
the global saturation score.
Evolving Analogue Series
Although
parameter evaluation depends on the analysis of large ensembles of
similar analogue series from which statistically sound scores can
be derived, the assessment of individual series is of particular interest
in medicinal chemistry. In practice, the analysis of progression saturation
would primarily be on the agenda for evolving series for which a significant
number of analogues have already been generated. However, such series
are rarely disclosed and only few examples are available in the public
domain, which prohibits Z-score calculations. From
ChEMBL, we have obtained three series with 100 or more analogues and
analyzed them on the basis of raw scores instead. For consistency
with calculations reported above, 13 203 transformation-based
virtual analogues were generated for each series
and projected into the 14-dimensional reference space. To model the
evolution of series, subsets of 30 and 60 analogues were taken from
each series and compared to those of the complete series. The results
are shown in Figure a, and representative analogues from each series are depicted in Figure b. The analogue series
include 126 inhibitors of acetyl-CoA carboxylase 2; 146 inhibitors
of phosphodiesterase 10A, the largest available series; and 100 inhibitors
of the 5-lipoxygenase activating protein. For compounds from each
series, exemplary transformation-based virtual analogues are shown. Figure a reveals for each
series an increase in saturation scores for subsets of increasing
size, consistent with the expectation that increasing numbers of analogues
should generally result in increasing levels of chemical saturation.
Interestingly, the progression saturation characteristics of all three
series differed. The series on the left in Figure a yielded the lowest global and highest local
saturation scores, the series in the middle had intermediate scores,
and the series on the right yielded the highest global and lowest
local scores. Thus, on a relative scale, the series on the left had
lower analogue coverage and fewer virtual candidates in active neighborhoods
than those in the others. By contrast, the series on the right had
more extensive analogue coverage than the others and more virtual
candidates in active neighborhoods. In qualitative terms, progression
saturation of the series in Figure a increased from the left to the right. The analogue
series on the right displayed late-stage character and the highest
level of progression saturation.
Figure 7
Categorization of evolving analogue series.
(a) For three large
analogue series, raw global and local saturation scores are reported
for subsets of increasing size (small dots, 30 analogues; medium-size
dots, 60 analogues; and large dots, all analogues of a series). Theses
series include inhibitors of acetyl-CoA carboxylase 2 (red; 126 analogues),
phosphodiesterase 10A (blue; 146), and 5-lipoxygenase activating protein
(orange; 100). (b) Representative analogues from each series are shown
together with transformation-based virtual nearest neighbors (red).
The color code of boxes separating analogue subsets from different
series corresponds to (a). In addition, the target of each series
and the ChEMBL ID and potency value of each analogue are reported.
Categorization of evolving analogue series.
(a) For three large
analogue series, raw global and local saturation scores are reported
for subsets of increasing size (small dots, 30 analogues; medium-size
dots, 60 analogues; and large dots, all analogues of a series). Theses
series include inhibitors of acetyl-CoA carboxylase 2 (red; 126 analogues),
phosphodiesterase 10A (blue; 146), and 5-lipoxygenase activating protein
(orange; 100). (b) Representative analogues from each series are shown
together with transformation-based virtual nearest neighbors (red).
The color code of boxes separating analogue subsets from different
series corresponds to (a). In addition, the target of each series
and the ChEMBL ID and potency value of each analogue are reported.
Conclusions
Herein, we have investigated computational progression saturation
analysis of different sets of analogue series, originating from biological
screening or medicinal chemistry, under varying parameter settings.
The computational approach depends on relating chemical space distributions
of existing analogues and virtual candidates to each other and on
analyzing chemical neighborhoods of existing analogues. Accordingly,
it is important to explore how different analysis conditions might
influence the results of progression saturation analysis. Relevant
parameters include alternative chemical space representations, varying
compound numbers, and different virtual analogue design strategies.
Moreover, it is of critical relevance to evaluate the influence of
varying computational analysis settings on the categorization of series
on the basis of characteristic score combinations; a prerequisite
for meaningful practical applications. Therefore, different chemical
reference spaces and populations of virtual compounds were explored.
Essentially, scoring remained stable under these conditions and only
minor to moderate alterations in series categorization were observed.
Furthermore, we have analyzed exemplary large analogue series, which
are rare in the public domain, and used these analogue series to model
series progression. Analysis of evolving series revealed an intuitive
increase in chemical saturation when series grew in size, lending
further credence to the methodological concept. Moreover, computational
comparison of large series comprising similar numbers of analogues
revealed different saturation characteristics, which is of high relevance
for medicinal chemistry applications. Distinguishing between these
characteristics relied on our computational analysis scheme. Herein,
we provide all details required for computational progression saturation
analysis, which should make it straightforward for interested investigators
to implement the methodology.
Authors: Veerabahu Shanmugasundaram; Liying Zhang; Shilva Kayastha; Antonio de la Vega de León; Dilyana Dimova; Jürgen Bajorath Journal: J Med Chem Date: 2015-11-16 Impact factor: 7.446
Authors: A Patrícia Bento; Anna Gaulton; Anne Hersey; Louisa J Bellis; Jon Chambers; Mark Davies; Felix A Krüger; Yvonne Light; Lora Mak; Shaun McGlinchey; Michal Nowotka; George Papadatos; Rita Santos; John P Overington Journal: Nucleic Acids Res Date: 2013-11-07 Impact factor: 16.971