Drug discovery programs frequently target members of the human kinome and try to identify small molecule protein kinase inhibitors, primarily for cancer treatment, additional indications being increasingly investigated. One of the challenges is controlling the inhibitors degree of selectivity, assessed by in vitro profiling against panels of protein kinases. We manually extracted, compiled, and standardized such profiles published in the literature: we collected 356 908 data points corresponding to 482 protein kinases, 2106 inhibitors, and 661 patents. We then analyzed this data set in terms of kinome coverage, results reproducibility, popularity, and degree of selectivity of both kinases and inhibitors. We used the data set to create robust proteochemometric models capable of predicting kinase activity (the ligand-target space was modeled with an externally validated RMSE of 0.41 ± 0.02 log units and R02 0.74 ± 0.03), in order to account for missing or unreliable measurements. The influence on the prediction quality of parameters such as number of measurements, Murcko scaffold frequency or inhibitor type was assessed. Interpretation of the models enabled to highlight inhibitors and kinases properties correlated with higher affinities, and an analysis in the context of kinases crystal structures was performed. Overall, the models quality allows the accurate prediction of kinase-inhibitor activities and their structural interpretation, thus paving the way for the rational design of compounds with a targeted selectivity profile.
Drug discovery programs frequently target members of the human kinome and try to identify small molecule protein kinase inhibitors, primarily for cancer treatment, additional indications being increasingly investigated. One of the challenges is controlling the inhibitors degree of selectivity, assessed by in vitro profiling against panels of protein kinases. We manually extracted, compiled, and standardized such profiles published in the literature: we collected 356 908 data points corresponding to 482 protein kinases, 2106 inhibitors, and 661 patents. We then analyzed this data set in terms of kinome coverage, results reproducibility, popularity, and degree of selectivity of both kinases and inhibitors. We used the data set to create robust proteochemometric models capable of predicting kinase activity (the ligand-target space was modeled with an externally validated RMSE of 0.41 ± 0.02 log units and R02 0.74 ± 0.03), in order to account for missing or unreliable measurements. The influence on the prediction quality of parameters such as number of measurements, Murcko scaffold frequency or inhibitor type was assessed. Interpretation of the models enabled to highlight inhibitors and kinases properties correlated with higher affinities, and an analysis in the context of kinases crystal structures was performed. Overall, the models quality allows the accurate prediction of kinase-inhibitor activities and their structural interpretation, thus paving the way for the rational design of compounds with a targeted selectivity profile.
Protein kinases typically
function in highly connected, dynamic,
and regulated networks and are central actors in the majority of signal
transduction cascades. The human kinome is composed of more than 500
kinases[1] and deregulated kinase signaling
has frequently been observed to be oncogenic.[2] Human kinases are therefore attractive targets for drug discovery
and thus have received considerable attention from the pharmaceutical
industry, which has invested in the identification of small molecule
protein kinase inhibitors (PKIs) targeting the protein kinase catalytic
domain.[3,4] These efforts have so far led to the approval
of 36 PKIs for clinical use (28 by the US Food and Drugs Administration[5]). In addition, at least 600 PKIs have entered
formal clinical trials.[6] The vast majority
of the approved or under investigation PKIs aim at treating various
neoplasms, but PKIs are now also being designed to treat other indications
such as diabetes, neurological, inflammatory, and autoimmune diseases
like rheumatoid arthritis.[7−11]A number of PKIs behave as allosteric regulators[12−14] but the majority
(about 95%[6]) act as competitive inhibitors,[15] usually blocking ATP cofactor binding, with
several different distinct binding modes as demonstrated in X-ray
crystallography studies.[16,17] There is also a relatively
large set of irreversible PKIs.[18,19] The highly conserved
nature of the ATP binding site renders the development of highly selective
PKIs challenging,[20] while the selectivity
profile of a PKI governs its total effect on an organism. Indeed,
the clinical efficacy of some PKIs against kinase targets against
which they were not originally developed led to their approval for
other indications. For example, while originally approved to treat
chronic myeloid leukemia,[21] acting via
inhibition of cAbl, Imatinib (Gleevec) was later shown to inhibit
Kit and PDGFR. Consequently, it was approved for the treatment of
gastrointestinal stromal tumors and the hypereosinophilic syndrome
in which these particular kinases are dysregulated.[22,23] The dual beneficial and adverse off-target pharmacology of PKIs
is complex[24−27] and relies on both target exposure and activity spectra.In
vitro profiling is one of the standard tools at hand to reduce
attrition rates observed during drug discovery and development.[28] As a general rule, promiscuous compounds are
difficult to optimize and develop. Given both the known promiscuity
of PKIs, and the large number of protein kinase genes known and assayable,
compounds developed as PKIs are routinely profiled against substantial
parts of the (human) kinome.[29−32] The objective here is to screen out the PKIs with
undesired kinase profiles as early as possible. Furthermore, it is
now common practice to screen compound libraries against the kinome
in order to identify either new pharmacological probes for poorly
characterized targets,[33] or to identify
hits for newly validated kinases.Since the initial work of
Davies et al. in 2000,[34] a growing number
of publications have reported the profiling
against large kinase panels of either libraries of compounds (some
selected as potential PKIs),[35,36] or smaller and more
focused sets of PKIs (as part of their characterization).[37,38] Databases such as ChEMBL[39,40] make publicly available
a very large amount of structure–activity relationships (SARs)
manually extracted and curated from the scientific literature. Due
to the literature focus of these efforts, full kinase profiles of
compounds were not routinely added, with notable exceptions, such
as for the Millipore Kinase Screening data set[41] (CHEMBL2218924) and the GSK Kinase Inhibitor data set[42] (CHEMBL2303647). An effort was therefore undertaken
to manually capture in the primary literature (including, but not
limited to medicinal chemistry journals) and standardize such kinase
profile-related SAR. Here, we report the collation of an unprecedentedly
large number of kinase inhibition data points, and the release to
the public domain, through a deposition in the ChEMBL database.Due to the heterogeneous nature of the sources, the collected SAR
data set is not a complete one:[43] while
for some PKIs most of the kinases were tested, others were tested
only on a few kinases. In addition, the sizes of the kinase panels
vary across the publications and assay technology and readout can
substantially vary. To overcome this and demonstrate the potential
of a kinase data set like the current in the public domain, we leveraged
on the collected SAR and the content of the ChEMBL database to populate
the kinase/PKI pairs that were not initially measured, using a state-of-the-art
and robust activity modeling methodology. After integration with ChEMBL
data (release 19, see Materials and Methods for details), proteochemometric bioactivity modeling (PCM) was applied
at large-scale to this set. The results demonstrate that the application
of this method to the compiled data set enables a robust prediction
of kinase activity across the subset of the kinome covered by the
data set; combined with a systematic exploration of the compounds
and kinases features that drive potency, this paves the way for bespoke
design of compounds with selective promiscuity.
Materials and Methods
Data Sources
All potential sources of PKI kinome profiling
bioactivities were identified through literature searches. A first
selection restricted the sources to articles or data sets containing
bioactivities from assays where (i) the target was tractable; (ii)
the technology generated results of the following activity types:
Pct_Ctrl, Pct_Inhib, Kd, Ki, IC50 (excluding for example bioactivities
generated from thermal stability shift assays[44]); (iii) the ligand structures were disclosed; and (iv) quantitative
bioactivity measurements were available and not expressed exclusively
as ranges. In addition to measurements published in peer-reviewed
articles, data points made publicly available by the NIH Libraries
of Network-based Cellular Signatures (LINCS) program[45,46] and the International Centre for Kinase Profiling within the MRC
Protein Phosphorylation Unit at the University of Dundee[47] were included. For each data source, the bibliographic
reference, PubMed identifier, digital object identifier (DOI), date
of publication, title, authors list and abstract of the publication,
publishing journal name, reference. The list of data sources is provided
in the Supporting Information (SI) Table
S1.
PKIs
PKI structures were directly extracted, transformed
from SMILES representations, or manually drawn using Biovia Draw[48] and then standardized using Pipeline Pilot version
8.5:[49] salts were removed, charges and
stereochemistry were standardized, and a canonical tautomer was generated.
The accuracy of stereochemistry and tautomeric state were manually
checked and adjusted if necessary, using PubMed for bibliographic
searches (http://www.ncbi.nlm.nih.gov/pubmed) and the Protein
Data Bank (PDB) for kinase–PKI complex crystal structures (http://www.rcsb.org/pdb). A unique name was attributed to each
structure and synonyms were stored. The list of PKI names, structures,
and compound-related data is provided in Supporting Information Table S2.
Kinases
For each name used to identify
the kinase in
the original publications (SwissProt entry number, kinase name, gene
name), the corresponding Entrez preferred gene symbol from the Human
Genome Nomenclature Committee (HGNC) (www.genenames.org) was identified. To further qualify the kinases, several properties
were added, describing whether the assay was run against a single
kinase, the targeted binding site (for the kinases with multiple kinase
domains), the length of the domain, the source of the assayed kinase,
the presence and nature of mutations, the exact sequence, the phosphorylation
status, and the organism (see Supporting Information 1 for details); however we anticipate that this data is tentative
in some cases due to sparse reporting in the literature. The list
of kinases is provided in Supporting Information Table S3.
Assays
To fully describe the assays
and when applicable,
the name of the kinase profiling service commercial provider, the
assay technology, the measured effect and the ATP and compound concentrations
were captured (see Supporting Information 2 for details).
Bioactivities
For all bioactivities,
the original value,
operator, unit and display name (of possible values: Kd, Ki, IC50, Pct_Ctrl,
Pct_Inhib) were captured. Pct_Ctrl is one assay results preprocessing
method that attempts to correct for variability by normalizing compound
measurements relative to controls: Pct_Ctrl =100 × Mean_on_compound/Mean_on_controls.
Similarly, Pct_Inhib is defined as 1 – (100 × Mean_on_sample/Mean_on_controls).
Those values were then submitted to a set of standardization rules
described in details in Supporting Information 3. Briefly, Pct_Ctrl and Pct_Inhib results were transformed in Pct_Effect
values; all bioactivities were then standardized (standardized result
value, SRV and their negative logarithm pSRV); an additional property
was created to monitor the applied transformations (standardized result
type, SRT).The minimal set of properties required to capture
a bioactivity value was the compound and target names, the organism,
the result value, result operator, and result type. The list of possible
values for all properties related to bioactivities is provided in Supporting Information Table S4.
Generation
of Data Sets 1 and 2
The ensemble of bioactivities
initially captured was filtered to build data set 1: only measurements
from human kinases were kept, and measurements were rejected where
the exact kinase could not be identified (Target Type = “Gene
Family”) or where the standardized result type was “inconclusive”.
In the analysis, two naming conventions refer to the kinases that
were assayed:“HGNC”,
corresponding to the Target Name“Kinase”,
corresponding to the Target
Name, Binding Site, and Mutation and describing the kinase with more
precisionThe notion of coverage was
introduced as follows: for
a Kinase–PKI or HGNC–PKI pair, the presence of at least
one measurement in data set 1 contributes one unit. By summing over
all pairs and dividing by the number of pairs (per data source or
across data set 1), the coverage of the data source or of the data
set is calculated.Multiple measurements per Kinase–PKI
pair were aggregated
to keep only one copy: the original measurements and their sources
were discarded, and only the number N, mean, and standard deviation
(SD) of pSRV were considered further. Data set 2 was then generated
by filtering out the aggregated measurements with a pSRV SD greater
than one standard deviation unit.
Kinase- and PKI-Centric
Indices
For each Kinase, the
number of PKIs having a pSRV_Mean, a pSRV_Mean below 6 (1 μM),
and a pSRV_Mean below 7 (100 nM), the kinase selectivity index at
1 μM (SI_k_1uM) and the kinase selectivity index at 100 nM (SI_k_100
nM)[35,50,51] were calculated.
For each PKI, the number of Kinases having a pSRV_Mean, a pSRV_Mean
below 6 (1 μM), and a pSRV_Mean below 7 (100 nM), the selectivity
index at 1 μM (SI_1uM) and the selectivity index at 100 nM (SI_100
nM) were calculated. All calculations and data processing were performed
with Pipeline Pilot version 8.5.[49]
SureChEMBL
Searches
SureChEMBL (https://www.surechembl.org/) is a large, freely available, live repository of compound structures
automatically extracted from full-text patent documents.[52] The SureChEMBL system consists of the data pipeline
and the user interface. The former automatically detects named chemical
entities in text, and chemical structures in images. It annotates
the incoming patent documents with chemical structures, which are
then stored in a database. The latter allows users to search patent
documents using text or chemical queries and retrieve the compounds
extracted from specified document search hits.The SureChEMBL
repository was used to search for the earliest patent documents claiming
the PKIs in the data set. The search was implemented as follows:Each of the PKIs in the data set
was submitted to the
UniChem[53] web service to run a search against
the SureChEMBL compound pool. In addition to exact matching, UniChem
allows connectivity matching, i.e. searching for compounds with the
same connectivity, by ignoring stereochemistry, isotope or salt information.[54]For each of the
1132 search hits (53.8% of the data
set), all patent documents containing the hit (either in text or image)
from the three main patent authorities (WIPO, USPTO, and EPO) were
retrieved.The PKIs and their corresponding
earliest patent documents
were annotated with bibliographic data, such as priority date, title,
inventor, and assignee.The target, target
family, and target type properties
were manually assigned based on patent titles with additional PubMed
look ups when required. The affiliation type and activity domain properties
were manually assigned based on the SureChEMBL-provided list of inventors
and affiliations.The list of PKIs, patents,
and related properties is provided in Supporting
Information Table S2.
Proteochemometric Models
Proteochemometric
(PCM) modeling
creates predictive models using machine-learning.[55,56] The methodology is related to conventional quantitative structure–activity
relationship (QSAR) modeling but includes an explicit target protein-based
descriptor alongside conventional small molecule descriptors.[57] This allows the pooling of bioactivity data
of several related targets into a single PCM model. Previous application
areas of PCM include GPCR families and their ligands, viral mutants
and enzyme inhibitors, transporter proteins and compounds blocking
these proteins, and kinases and kinase inhibitors.[58−64] While this is not the first attempt to apply PCM to a PKI data set,
it is, to the authors’ knowledge, the first time it has been
applied on the large scale demonstrated here and to a kinase data
set of such ligand diversity. Schürer and Muskal used a diverse
public set, yet constructed hundreds of separate QSAR models, an approach
that does not allow interpretation from the kinase perspective (which
residues contribute to binding of which compound).[65] Finally, our additional aim is to explore the feasibility
of PCM analysis toward the rational design of kinase inhibitors with
biased, defined promiscuity.
Proteochemometric Data
Set
To generate PCM models with
the broadest applicability domain, Ki data
points from the collected SAR set were merged with those present in
ChEMBL (release 19). For modeling, Ki values
were -log10 transformed, leading to pKi values. Multiple measurements were averaged; if duplicate pairs
were present in this set and ChEMBL, the duplicates were discarded
favoring the measurements presented here over those already in ChEMBL.
Furthermore, allosteric data points were removed based on previously
published work (using data points that were retrieved from text mining
paper abstracts)[66] and peptide-like PKIs
were discarded. Peptides were detected according to the protocol used
by ChEMBL (among others filters based on presence of a peptide bond
and known peptides).
Proteochemometric Descriptors
The
binding pocket was
described using the kinase sequences alignment provided by Huang et
al.[67] which was converted to physicochemical
properties using the first three Z-scales published by Sandberg et
al.[68] as done previously.[57] This led to 117 protein descriptors. Three mean values
for the total sequence for the 3 Z-scales (Z1_mean, Z2_mean, Z3_mean),
and 3 scales per residue (Z1_Res_1, Z2_res_1, etc.) times 38 residues.
PKIs were described using a combination of physicochemical properties
(37 descriptors, see Supporting Information Table S5 for a full list) and FCFP_6 circular fingerprints.[69] The fingerprint features were selected based
on their frequency (optimal frequency was present in 50% of the PKIs)
and a total of 768 features were encoded as a fixed-length bit string.
Finally, 16 properties coded for presence of cofactors or secondary
kinase domains, and 4 encoded properties about the activity type/interaction
(IC50/Ki/Kd/predicted allosteric). In total, the feature vector consisted of
942 descriptors per PKI/kinase pair.
Proteochemometric Machine
Learning
Script files for
R statistics (version 2.15.0) were prepared in Pipeline Pilot version
8.5 using the “randomForest” package.[49,70,71] Models were trained using 10 000
trees on the EMBL-EBI cluster (45 cores) in parallel using the “foreach”
package.[72] The optimal value for “mtry”
was empirically found to be a fraction of 0.3 of the total descriptors.
Scaling of the importance variables was turned off and no imputation
was performed.
Proteochemometric Regression Models
Several models
were created: the data set was split into random training sets of
10%, 30%, 50%, 70%, and 90% of the total data set. Subsequently models
were trained on these sets and validated on the remaining 90%, 70%,
50%, 30%, and 10% of the set to create learning curves. This process
was performed four times, splitting the set with a different seed
every iteration. Finally a full model was trained on 100% of the set,
which was subsequently interpreted and is available for prediction.Model quality was validated using both the correlation coefficient
and the Root Mean Square Error (RMSE). The combination was chosen
as both are complementary. RMSE can be overly optimiztic when the
bulk of the compounds is inactive (a simple inactive prediction suffices)
whereas the R2 suffers from distortions
if the modeled activity range is very low (see Results for further details).
Results
Data Set Composition
There were 356 908 entries
in the original data set, corresponding to 61 data sources, 482 HGNC
symbols and 2106 PKIs. Perhaps counterintuitively, the analysis of
the data sources indicates that the majority of bioactivities were
disclosed in journals not directly focused on medicinal chemistry:
74.6% of the measurements were published in three journals: Nature Chemical Biology (30.5%), Biochemical Journal (22.6%), and Nature Biotechnology (21.5%) (Figure a). Furthermore,
78.2% of the measurements came from three articles and the two online
sources (Figure b).
Figure 1
Distribution
of measurements across (a) the journals and (b) the
data sources.
Distribution
of measurements across (a) the journals and (b) the
data sources.In vitro kinase profile
commercial providers generated 68.5% of
the measurements, the major contributors being DiscoverX (22.5%),
Millipore (20.9%), and Reaction Biology (15.0%) (Table ). The vast majority of the
measurements were of type Pct_Ctrl (61.8%), followed by Ki (30.0%), Kd (7.7%), and
IC50 (0.5%).
Table 1
Distribution of Measurements Across
Kinase Profile Commercial Providersa
kinase profile
provider
measurements
count
percent of
total
Ambit/DiscoverX
80328
22.5
Millipore
74620
20.9
Reaction Biology
53417
15.0
MRC Dundee
28778
8.1
Carna
Biosciences
2725
0.8
Caliper Life Sciences
1856
0.5
Invitrogen
1910
0.5
Proquinase
931
0.3
Abbott
72742
20.4
Other
39595
11.1
Although not a kinase profile
provider, the large number of measurements shared by Abbott justified
the creation of a dedicated category.
Although not a kinase profile
provider, the large number of measurements shared by Abbott justified
the creation of a dedicated category.The standardized result type provides an assessment
of the amount
of usable information within the data set: 9.9% of the data were qualified
as “inconclusive”, 14.8% as “inactive”
(ranges with thresholds indicating a lack of activity), 60.8% as “calculated”
(Pct_Effect, single point measurements), and 14.5% as “measured”
(Ki, Kd, or
IC50, in which a greater confidence can be given than in
single point measurements;Table ). Supporting Information Figure S1 compares across the data sources the relative numbers
of measurements and their distribution according to the result types.
Table 2
Distribution of Measurements Across
the Standardized Result Types
result type
measurements
count
percent of
total
calculated
216894
60.8
inactive
52817
14.8
measured
51797
14.5
inconclusive
35394
9.9
The distribution of the standardized result type values
across
the result types is presented in Figure . It shows that approximately half of the Kd measurements were considered as inactive (to
be compared with approximately one-third of the Ki measurements, with another third being considered inconclusive).
Figure 2
Distribution
of measurements across the standardized result types,
split by measured result types. The number of measurements is labeled
and is proportional to the radius of the discs. The color corresponds
to the original data type (yellow Pct_Ctrl, orange PCT_Inhib).
Distribution
of measurements across the standardized result types,
split by measured result types. The number of measurements is labeled
and is proportional to the radius of the discs. The color corresponds
to the original data type (yellow Pct_Ctrl, orange PCT_Inhib).The analysis of the Target Type
values indicates that 97.6% of
the measurements were obtained in assays where the target could be
mapped to a single HGNC symbol, with another 2.2% coming from complexes.
Here, 65.2% of the measurements were generated in binding assays,
22.5%, in quantitative PCR assays, and 11.2%, in fluorescence assays.
Finally, almost all measurements were generated in human kinase assays
(99.5%).The collected data cover widely the kinome, as can
be observed
in Figure where the
majority of the branches in the treelike representation of the kinome
introduced by Manning et al. feature measurements.[1]
Figure 3
Coverage of the kinome by the current data set. The kinase names
and circles sizes are proportional to the number of corresponding
compounds. The picture was generated using the Kinome Render,[73] and the kinome tree illustration is reproduced
courtesy of Cell Signaling Technology, Inc. (www.cellsignal.com). A high resolution version is available as Supporting Information Figure S2.
Coverage of the kinome by the current data set. The kinase names
and circles sizes are proportional to the number of corresponding
compounds. The picture was generated using the Kinome Render,[73] and the kinome tree illustration is reproduced
courtesy of Cell Signaling Technology, Inc. (www.cellsignal.com). A high resolution version is available as Supporting Information Figure S2.A limited number of kinases have already proven to be attractive
targets for drug discovery efforts, and the number of measurements
available per HGNC symbol reflects this trend: roughly one-third of
the 482 genes exhibited more than 1000 measurements, another third
had 250 to 1000 measurements, and the last third less than 250 measurements
(Figure a). The 10
most characterized kinases are ABL1, EGFR, FLT3, KIT, LCK, PIM1, PRKACA,
GSK3B, AURKB, and PDPK1, with more than 2000 measurements each (Figure b). The 10 least
profiled kinases are PINK1, SGK233, ATR, CDK1, CDK8, EFNA2, EIF2AK4,
MAST3, SMG1, and STRADA, with less than four measurements each (Figure c).
Figure 4
Number of measurements
per HGNC symbol: (a) all kinases; (b) most
popular kinases; (c) least popular kinases. The colors indicate the
result type.
Number of measurements
per HGNC symbol: (a) all kinases; (b) most
popular kinases; (c) least popular kinases. The colors indicate the
result type.To focus on data of interest
for drug discovery projects, data
set 1 was prepared by excluding measurements from nonhuman kinases
(1923 measurements) or from gene families (exact kinase not identified,
267 measurements) and qualified as inconclusive (35 392 measurements).
In total, 319 320 measurements were further considered (89.5%
of the initial data set). Those were not distributed equally among
the data sources, since the sizes of the kinase panels depend on the
profiling organizations. The coverage across the data sources was
therefore analyzed. In total, the coverage was 15.6% when considering
the targets at the kinase level, and 20.0% at the HGNC level (Table ). Supporting Information Tables S6 and S7 provide the counts
and coverages for each data source, at the Kinase and HGNC levels,
respectively.
Table 3
Coverage of Dataset 1 at the Kinase
and HGNC Levels
compounds
count
targets count
pairs count
count of
pairs with data
coverage
(%)
kinase level
2106
627
1320462
207495
15.7
HGNC level
2106
452
951912
191877
20.2
In data set 1, only 15.7% of the Kinase–PKI
pairs do have
at least one measurement. But the distribution of coverage per data
source indicates that 42 data sources have an internal coverage of
90% or more (68.9% of the data sources) at the Kinase level, while
only 7 have a coverage of 50% or less (Table ). This highlights that data set 1 is a heterogeneous
“patchwork” of relatively densely populated submatrices
of Kinases-PKIs, separated by empty submatrices (Supporting Information Figure S3).
Table 4
Distribution
of the Data Sources of
Dataset 1 per Binned Coverage, at the Kinase Level
internal
coverage (%)
[0; 20]
]20; 30]
]30; 40]
]40; 50]
]50; 60]
]60; 70]
]70; 80]
]80; 90]
]90; 100]
count
1
2
2
2
0
2
5
5
42
The reproducibility of the
measurements was then analyzed. Data
set 1 corresponds to 319 320 measurements, distributed across
207 495 Kinase–PKI pairs. 70.6% of these pairs exhibit
a unique measurement, while only 1.8% had five measurements or more
(Table ).
Table 5
Distribution of the Number of Measurements
Per Kinase–PKI Pair
number of
measurements per Kinase–PKI pair
1
2
3
4
5 or more
count
146564
25134
27656
4477
3664
percent total
70.63
12.11
13.33
2.16
1.77
For the 29.4% of Kinase–PKI
pairs with multiple measurements,
the distribution of the pSRV SD indicates that applying a filter at
SD equal or less than 1 unit would lead to the rejection of only 570
pairs (Table ), a
relatively small number, in the order of magnitude of the amount of
information contained in the profile of a single PKI against all Kinases. Supporting Information Figure S4 shows the distribution
of the binned pSRV SD values split by number of measurements per Kinase–PKI
pair.
Table 6
Binned Distribution of the pSRV SD
binned pSRV
SD
0.0
]0.0; 0.5]
]0.5; 1.0]
]1.0; 1.5]
]1.5; 2.0]
>2.0
count
161844
39002
6079
507
58
5
percent total
78.00
18.80
2.93
0.24
0.03
0.00
Data
set 2 was created by applying the “pSRV SD ≤
1” filter to include only Kinase–PKI pairs with significant
measurements. It contains 627 Kinases, 2106 PKIs, and 206 925
Kinase–PKI pairs with a pSRV (15.6% of the total number of
pairs). For 316 Kinases (50.4%), at most 200 PKIs were tested that
provided a significant measurement, while only 51 Kinases (8.1%) had
more than 1000 PKIs tested against (Table ).
Table 7
Distribution of the
Binned Number
of PKIs per Kinase
binned number
of PKIs per Kinase
[1; 200]
]200; 400]
]400; 600]
]600; 800]
]800; 1000]
]1000; 1200]
]1200; 1400]
]1400; 1600]
>1600
count
316
137
70
29
24
36
14
0
1
percent total
50.40
21.85
11.16
4.63
3.83
5.74
2.23
0.00
0.16
By applying an upper limit of 100 nM for potent PKIs,
it was observed
that the coverage of the kinome by potent inhibitors is similar to
the one observed in Data set 1 (Supporting Information Figure S5). The distribution of the kinase selectivity index at
100 nM across the kinome (limited to Kinases with at least 100 PKIs
to reduce sampling bias, rejecting mutants, and nonkinase domains)
indicates that some Kinases seem more prone to inhibition than others,
namely CDC7, MAP2K5, FLT3, FLT1, and DDR1 (Supporting Information Figure S6). However, the SI_k_100 nM does not allow
factoring in the number of tested PKIs. The number of potent PKIs
was plotted against the number of tested PKIs to identify among the
Kinases probed with many PKIs those that proved difficult to inhibit
(selective Kinases, such as NEK2, MAPK1, MAPK12, MAPK13, MAPKKAPK2,
AKT1, AKT2, and AKT3) and those that were more prone to inhibition
(promiscuous Kinases, such as FLT3, CLK4, AURKB, KDR, FLT1, CSF1R,
RET, and CDC7) (Figure ). For example, a similar number of PKIs was tested against CLK4
(1206 PKIs) and NEK2 (1210 PKIs); however, while 447 PKIs were active
below 1 μM and 175 below 100 nM against CLK4, only 71 PKIs were
active below 1 μM and 11 below 100 nM against NEK2. Of course,
directly comparing these hit rates requires assuming that the PKIs
assayed against each Kinase (57% of the total number of PKIs) covered
the same chemical space, which is not necessarily true: the numbers
of PKIs are similar, but the PKIs might be different. Furthermore,
defining a Kinase as “selective” or “promiscuous”
relies on a similar assumption and a more appropriate but more complex
qualification for these Kinases would be selective or promiscuous
“within the boundaries of the chemical space they were assayed
against”.
Figure 5
(a) Comparison across all Kinases between the number of
PKIs tested
(x axis) and the number of PKIs active below 100
nM (y axis). (b) Same comparison as in part a with
a focus on the most selective Kinases. Color and marker size reflect
the potency of the most potent inhibitor of each Kinase. Shown in
parentheses is the SI_k_100 nM.
(a) Comparison across all Kinases between the number of
PKIs tested
(x axis) and the number of PKIs active below 100
nM (y axis). (b) Same comparison as in part a with
a focus on the most selective Kinases. Color and marker size reflect
the potency of the most potent inhibitor of each Kinase. Shown in
parentheses is the SI_k_100 nM.A similar analysis of data set 2 can be performed by focusing
on
PKIs instead of Kinases. Some of the collected data sources describe
the results of PKIs profiling against Kinase panels of limited size.
In addition, a number of measurements were rejected when building
data sets 1 and 2. As a consequence, a significant number of PKIs
from data set 2 present results against 50 Kinases or less (925 PKIs,
43.9%) (Table ), too
small a panel to allow the evaluation of kinase selectivity. In parallel,
323 PKIs present results against at least 200 Kinases (15.3%) (Table ). At that level of
coverage of the kinome, it should be possible to identify nonselective
kinase inhibitors, and inhibitors already displaying some selectivity
could be prioritized (keeping in mind the possibility of a later identification
of additional inhibited kinases that were not present in the initial
panel). It is to be noted that this constraint prevented any further
analysis based on Murcko scaffolds, due to the limited size of the
sample left after filtering (2 Murcko scaffolds with 5 PKIs, 3 Murcko
scaffolds with 4 PKIs, 5 Murcko scaffolds with 3 PKIs, 22 Murcko scaffolds
with 2 PKIs, and singletons).
Table 8
Distribution of the
Binned Number
of Kinases with Measurements per PKIa
binned number
of Kinases per PKI
[1; 25]
]25; 50]
]50; 75]
]75; 100]
]100; 200]
]200; 300]
]300; 400]
]400; 500]
>500
count
628
297
244
446
168
36
192
87
8
percent total
29.82
14.10
11.59
21.18
7.98
1.71
9.12
4.13
0.38
Note that two
different bin sizes
(25 and 100 Kinases) were used.
Note that two
different bin sizes
(25 and 100 Kinases) were used.Table presents
the 10 PKIs with the highest number of Kinases exhibiting at least
one significant measurement (485 Kinases and more), and the structures
of the top 25 PKIs are shown in Supporting Information Figure S7. Not unexpectedly, one can observe the presence of Staurosporine
and derivatives PKC412 and Go6976, which share a privileged kinase
scaffold, along with FDA-approved PKIs Sorafenib, Sunitinib, and Pazopanib.
Table 9
List of the 10 PKIs with the Highest
Number of Kinases Exhibiting at Least One Significant Measurement
PKI name
kinase count
Tyrphostin_AG1478
525
Akti_1_2
523
JNK_Inhibitor_IX
521
AS601245
518
PKC412
503
AR_AO_14418
503
Sorafenib
501
Go6976
501
Indirubin_6_bromo_oxime
493
Staurosporine
487
The same threshold for potency was used to assess
which of the
most frequent PKIs were often observed to be potent (Figure ). Unsurprisingly, Staurosporine
stands out as the most promiscuous PKI, along with close structural
analogs Lestaurtinib and K252a. But structurally dissimilar PKIs such
as AZD7762 (a checkpoint kinase inhibitor that potentiates chemotherapy-induced
apoptosis),[74,75] TAE_684 (a PKI identified as
an inhibitor of NPM-ALK and LRRK2)[76,77] and MK_8033
(a c-Met/Ron dual inhibitor)[78] exhibit
the same apparent lack of selectivity.
Figure 6
(a) Comparison across
all PKIs between the number of Kinases tested
(x axis) and the number of Kinases with measurements
below 100 nM (y axis). (b) Same comparison as in
part a with a focus on the most selective PKIs. Color and marker size
reflect the potency of the most potently inhibited Kinase for each
PKI.
(a) Comparison across
all PKIs between the number of Kinases tested
(x axis) and the number of Kinases with measurements
below 100 nM (y axis). (b) Same comparison as in
part a with a focus on the most selective PKIs. Color and marker size
reflect the potency of the most potently inhibited Kinase for each
PKI.Focusing then on the most selective
PKIs, PKIs with measurements
on more than 450 Kinases and less than 15 Kinases inhibited below
100 nM were analyzed. Such selective PKIs were the JNK2/3 inhibitor
JNK_inhibitor_IX,[79] the EGFR inhibitor
Tyrphostin_AG1478,[80] the JNK inhibitor
AS602145,[81] the GSK3 inhibitor AR_A0_14418,[82] the p38-MAPK signaling pathway inhibitors Skepinone_L[83] and SB_203580,[84] the
LRRK2 inhibitor HG_10_102_01,[85,86] the FLT3 inhibitor
Tandutinib,[87] and the dual PI3K/mTOR inhibitor
PI_103[88] (Figure ).The apparent structural diversity
of these selective PKIs rules
out the simplistic association of kinase selectivity with certain
scaffolds. Assuming that compounds sharing similar scaffolds bind
in similar ways in the orthosteric site, the importance of scaffold
decorations to fine-tune the interactions to these sites is highlighted
by comparing pairs of structurally related compounds. Such an interesting
pair comprises MK_8033 (tested on 214 Kinases out of which 176 were
inhibited below 100 nM, 82.2%) and Skepinone_L (tested on 479 Kinases
out of which only 2 were inhibited below 100 nM, 0.4%) (Figure ).The most selective
PKI was Akti_1_2, a dual AKT1-AKT2, whose level
of selectivity might be explained by its allosteric, pleckstrin homology
(PH), domain-dependent mode of binding.[89] This observation led us to further investigate the influence of
the PKIs binding modes on their kinase profiles. PKIs with an unspecified
mode of binding (assumed to be Type I), Type II and covalent PKIs
all compete with ATP to bind in the catalytic site. Type I PKIs bind
to activated kinases, type II bind to inactivated kinases, and covalent
PKIs bind irreversibly to both activated and inactivated kinases.
Allosteric PKIs do not compete with ATP and bind kinases in several
cavities (if binding in the catalytic site, they do not interact with
the hinge). By comparing the value of the largest inhibition per PKI
across all Kinases to the number of Kinases observed within 10-fold
of that value (limiting ourselves as previously to the 322 PKIs exhibiting
at least 200 Kinases with measurements), it was observed that the
largest inhibition was on average the highest for type II PKIs, followed
by covalent and type I PKIs (here, PKIs are assumed to be of type
I when not in the other categories) (Figure ). The allosteric PKIs were on average weaker
inhibitors. This is a similar observation to what was found in previous
work.[66] While we have not compared the
size of the allosteric inhibitors to the competitive inhibitors in
this work, previously we found them to have a lower absolute affinity
but to be equally ligand efficient due to their smaller molecular
weight.[66] This was also true for the kinase
inhibitor section of our data set. No significant difference was observed
for the number of Kinases observed within one log of the largest inhibition
per PKI across all Kinases.
Figure 7
Influence of the PKIs binding mode on their
potency. (a) The x axis shows for each PKI the number
of Kinases found within
10-fold of the most potently inhibited Kinase, in log scale, and the y axis shows the value at which this Kinase is inhibited.
Marker size reflects the number of Kinases having measurements, for
each PKI. (b) Boxplots show the upper quartile, median, average, and
lower quartile for the value of the largest inhibition per PKI across
all Kinases.
Influence of the PKIs binding mode on their
potency. (a) The x axis shows for each PKI the number
of Kinases found within
10-fold of the most potently inhibited Kinase, in log scale, and the y axis shows the value at which this Kinase is inhibited.
Marker size reflects the number of Kinases having measurements, for
each PKI. (b) Boxplots show the upper quartile, median, average, and
lower quartile for the value of the largest inhibition per PKI across
all Kinases.An alternative explanation
of the observed lower affinity could
be that allosteric PKIs were most often discovered by serendipity
and might therefore be less often the outcome of numerous rounds of
medicinal chemistry optimization, while on the other hand type II
and covalent PKIs are usually rationally designed or optimized under
the assumption that the specific structural features of their binding
modes will confer on them an advantage in terms of kinase selectivity.
The number of Kinases with a measurement was therefore compared to
the SI_100 nM index, across the different binding modes classes. As
noted previously, allosteric PKIs were observed to be significantly
less potent than the other types of PKIs, and only one allosteric
PKI did exhibit a non-null value of SI_100 nM: allosteric PKIs could
therefore not be compared to the other types for SI_100 nM. As recently
observed by Sutherland et al., Type II PKIs exhibited larger SI_100
nM values than the Type I and covalent PKIs, which were not significantly
different (Figure ).[90] It appears therefore that, in this
data set, Type II PKIs might achieve a higher potency but not necessarily
a better selectivity than Type I and covalent ones.
Figure 8
Influence of the PKI
binding mode on the kinase selectivity. (a)
The x axis shows for each PKI the number of Kinases
with at least one measurement, and the y axis shows
in log scale the value of the SI_100 nM selectivity index. (b) The
boxplots show the upper quartile, median, average, and lower quartile
for the value of the SI_100 nM selectivity index.
Influence of the PKI
binding mode on the kinase selectivity. (a)
The x axis shows for each PKI the number of Kinases
with at least one measurement, and the y axis shows
in log scale the value of the SI_100 nM selectivity index. (b) The
boxplots show the upper quartile, median, average, and lower quartile
for the value of the SI_100 nM selectivity index.
SureChEMBL Patent Analysis
Out of the 661 unique patents,
a majority (563; 85.2%) were assigned to commercial companies, followed
by 91 patents (13.8%) assigned to academic institutions, and only
7 (1.1%) to academic–industry partnerships. Most of the patents
(552; 83.5%) were assigned after 1995, with a significant increase
of the number of patents issued per year by industry in 2002–2003
and an apparent decline in the industry since 2006 (Figure ).
Figure 9
Number of patents priority
claims per year. The color indicates
the affiliation type.
Number of patents priority
claims per year. The color indicates
the affiliation type.It should be noted that these observations are valid within
the
boundaries of this data set which was not obtained via an exhaustive
search of all kinase-related patents for the same period but via a
nonexhaustive search of literature: trends on the plot do not reflect
the overall kinase inhibitor patent productivity, but the publication
bias of literature compounds. In addition, no attempt was made to
account for the multiple mergers and acquisitions within the industrial
sector, and the assignees names were kept as in the original patents.
The assignees activity domains corresponded almost always to pharmaceutical
research (630; 95.3%), with 5 patents coming from the agrochemical
industry (0.8%), 19 (2.9%) from other industries, and 7 (1.1%) that
could not be attributed. The 12 assignees having the highest numbers
of unique patents in the data set are presented at Table .
Table 10
List
of the 12 Assignees Having the
Highest Numbers of Unique Patents
assignee
number of
patents
Abbott
55
Smithkline Beecham
24
Novartis AG
16
AstraZeneca
15
Pfizer
14
Warner Lambert
11
Boehringer Ingelheim
10
Dana Farber Cancer Inst.
10
Bayer AG
8
Glaxo Group
8
Hoffmann La Roche
8
Merck & Co
8
Out of the 12 most prolific institutions, one is an academic center,
and the Abbott company alone accounts for 395 patented PKIs (34.9%)
and 55 patents. This is not unexpected since one single article from
Abbott[91] contributes to 1441 PKIs to the
complete data set (68.5%), some PKIs being not exclusive to this article.
The target families for the patent could be identified in 301 cases
(45.5%) and, not unexpectedly, it was a kinase in all but 39 cases
(5.9%) (Figure ).
Figure 10
Target
families as identified in the patent titles. The bars are
colored according to the priority claims year (one patent from 1900,
covering caffeine, was undisplayed for clarity).
Target
families as identified in the patent titles. The bars are
colored according to the priority claims year (one patent from 1900,
covering caffeine, was undisplayed for clarity).In 147 cases (22.2%), a disease or medical term was used
instead
of a target name to define the scope of the patent, while no disease
or target info was provided in 189 cases (28.6%). Finally, 24 patents
were not related to pharmaceutical research.
Proteochemometric Analysis
Results
The PCM data set
was composed of 117 148 data points (100 969 Ki, 979 IC50, 15 200 Kd) corresponding to 361 Kinases and 2860 PKIs
(11% matrix completeness). Approximately 27% were uniquely present
in data set 2, 1% were uniquely present in ChEMBL, and 72% were present
in both sets. The PCM models perform very well on this data set. The
ligand–target space is modeled with an RMSE of 0.20 log units
and an R02 of 0.94 (with the
externally validated performance reaching 0.41 ± 0.02 log units
and R02 0.74 ± 0.03 at
90%/10% cross-validated models) (see the learning curves in Figure a). Grouping predictions
per target drops performance slightly to 0.44 ± 0.15 log units
RMSE and 0.65 ± 0.14 R02 (Figure b), and
when grouped per compound, performance is 0.36 ± 0.12 log units
RMSE and 0.59 ± 0.22 R02 (Figure c). This
illustrates that while the full kinase ligand–target space
is well modeled, there are negative outliers, likely caused by a nonuniform
sampling of the interaction space. From a target perspective, some
Kinases are more densely populated with tested PKIs (Figure , Table ), whereas from a chemical perspective, some
PKIs have also been tested on more Kinases than others (Figure ). Interestingly the average
RMSE for individual compounds is slightly lower than that for individual
targets, whereas the correlation coefficient is on average slightly
higher for analysis grouped per target than for grouped per PKI. In
other words, PKIs tend to show a lower variance across targets than
the other way around, target variance across PKIs. We also observe
this effect in our data: when the bioactivity points are grouped by
target the mean pKi of compounds is 6.16 (±0.87) log units, whereas
when the values are grouped by compound the mean pKi of targets is
6.56 (±1.20) log units. As PKIs bind (on average) with a relatively
similar potency on different targets (displaying a lower variance),
the RMSE is relatively low. However, the narrow spread makes it challenging
for a model to rank order the PKIs by affinity (does a PKI bind better
to target A than B?), and this translates in a relatively lower correlation
coefficient. Conversely, this means that targets can display a greater
variance in compound affinity. As a consequence, targets are unlikely
to have (on average) a similar preference for all PKIs as indicated
by the larger RMSE. However, larger differences in affinity lead to
the modeling error having a lower influence on the compound ranking.
For example, two compounds A and B, A having a better affinity, with
a difference larger than the RMSE, are predicted correctly ranked
when the model is slightly mispredicting within the RMSE. If their
difference is smaller than the model RMSE, the chances are higher
for these compounds to be ranked wrongly. Here, the larger variance
translates in a relatively higher correlation coefficient.
Figure 11
Performance
of proteochemometric modeling on the set (R02 and RMSE). The curves show the median value
and median average deviation (MAD) at different sizes of the training
set based on the four different iterations (a) or based on the different
targets or compounds (b, c). (a) Full validation set (the dashed lines
extrapolate to full model performance using the values obtained in
random forest out-of-bag validation). (b) Validation set grouped per
target and subsequent median of these values (this method provides
a better performance estimate of the models for individual targets).
(c) Validation set grouped per PKI and a subsequent median of these
values (this method provides a better performance estimate of the
models for individual PKIs).
Performance
of proteochemometric modeling on the set (R02 and RMSE). The curves show the median value
and median average deviation (MAD) at different sizes of the training
set based on the four different iterations (a) or based on the different
targets or compounds (b, c). (a) Full validation set (the dashed lines
extrapolate to full model performance using the values obtained in
random forest out-of-bag validation). (b) Validation set grouped per
target and subsequent median of these values (this method provides
a better performance estimate of the models for individual targets).
(c) Validation set grouped per PKI and a subsequent median of these
values (this method provides a better performance estimate of the
models for individual PKIs).In order to further quantify performance, the average RMSE
was
also calculated (using all models that made up the learning curves)
for each target. R02 was ignored
here as the full activity range was smaller than 2.5 log units for
some targets, which is insufficient (assuming an average error of
approximately 0.5 log units).[92,93] For both the validation
sets grouped per Kinase and per PKI, the average RMSE was plotted
versus the frequency (Figure ). The plots show that our PCM models are capable of a robust
modeling of the kinase ligand–target space. However, it should
be noted that PKIs or Kinases with only a few data points could display
a variable performance.
Figure 12
(a) Average RMSE versus number of data points
in the set grouped
per PKI, for the Kinases with at least 600 PKIs tested with Ki results. (b) Average RMSE versus number of
data points in the set grouped per Kinase, for the PKIs with at least
200 Kinases tested with Ki results. The
color indicates the mean pKi across all Kinases/PKIs. The plots have
been created on all models (different iterations of the learning curves)
and give a good idea of the error distribution of our PCM models.
(a) Average RMSE versus number of data points
in the set grouped
per PKI, for the Kinases with at least 600 PKIs tested with Ki results. (b) Average RMSE versus number of
data points in the set grouped per Kinase, for the PKIs with at least
200 Kinases tested with Ki results. The
color indicates the mean pKi across all Kinases/PKIs. The plots have
been created on all models (different iterations of the learning curves)
and give a good idea of the error distribution of our PCM models.Only 104 Kinases had Ki results on
more than 600 PKIs, the threshold we chose for the analysis. On that
subset, the three best modeled Kinases were PRKCZ, PAK1, and CAMK2B
(RMSE 0.22–0.24 log units) whereas GSK3B, AURKB, and CLK4 were
modeled the worst (RMSE 0.69–0.78 log units) (Figure a). A representative validation
plot for CAMK2B is shown in Supporting Information Figure S8, and a representative plot for GSK3B is in Supporting Information Figure S9.Much
fewer PKIs (∼600–800) have been tested on the
best predicted Kinases than on the worst predicted ones (∼1000–1200).
At the same time, their mean pKi (∼5.2) is on average almost
one log unit smaller than that of the worst predicted ones (∼6.0)
(Figure ), and their
mean SI_k_100 nM (∼0.01) is also smaller than that of the worst
predicted ones (∼0.14), which might imply that the distribution
of pKi values is more important than the number of values (Supporting Information Figure S10). This is also
observed when plotting the pKi values against the Kinases (Supporting Information Figure S11).The
number of PKIs per Murcko scaffold[94] was
variable in the data set, ranging from singletons and pairs
(half of the compounds) to 22 compounds sharing the same Murcko scaffold
(Supporting Information Figure S12a). Judging
from the RMSE, scaffold frequency did not seem to have an influence
on the quality of the predictions (Supporting Information Figure S12b). In addition, the median similarity
of the compounds to the rest of the training set was calculated and
further investigation led to the conclusion that no trend could be
observed with regards to the RMSE or R2 (data not shown).The worst modeled Kinases appear to have
more PKIs, which are on
average more potent, and which cover a wider range of potencies, than
the best modeled ones. That might seem counterintuitive if one reasons
by analogy to the pharmacophore perception problem. This problem states
that when a training set is built, it should contain several diverse
chemical series, with large ranges of potencies within each series,
in order to computationally identify the structural features responsible
for the biological activity. One explanation for our observed inconsistency
to this wider range might be as follows: we judge the “ability
to predict” (i.e., small RMSE value) by excluding data from
the set, building models without the excluded data and then using
the models to predict the values for these data before comparing to
the measured values:If the
complete range of measured values is narrow,
then whatever the composition of the excluded set, there are good
chances that its range will be covered by that of the model building
set: we stay within the “applicability domain” in terms
of pKi values.On the other hand, if
the range is wide, then the ranges
of excluded and model building data might not overlap very well anymore,
which might lead to poorer performances.Only 50 PKIs had Ki results on more
than 200 Kinases, the threshold we chose for the analysis. On that
subset, the PKIs whose activities were predicted the best were VX745,
CP_724714, and Roscovitine (RMSE 0.31–0.32 log units), and
the worst were Staurosporine, Bosutinib, and CEP_32496 (RMSE 0.91–0.96
log units) (Figure b), with the structures shown in Figure . Representative plots for two are shown
in the Supporting Information Figures S13
(VX745) and S14 (Bosutinib).
Figure 13
Examples of PKIs whose activities were well
predicted by our model
(green, underlined) and predicted with lesser accuracy (red, in italic).
Examples of PKIs whose activities were well
predicted by our model
(green, underlined) and predicted with lesser accuracy (red, in italic).The number of Kinases against
which the PKIs were tested does not
seem to influence the RMSE (Figure b). But as already observed with the validation set
grouped per Kinase, the mean pKi of the best predicted PKIs (∼5.2)
is on average almost one log unit smaller than that of the worst predicted
ones (∼6.3) (Figure b), and their mean SI_100 nM (∼0.01) is also smaller
than that of the worst predicted ones (∼0.26), underlining
again the importance of the distribution of the pKi values (Supporting Information Figure S10b). This is
also observed when plotting the pKi values against the Kinases (Supporting Information Figure S15).It
is never trivial to prospectively validate a model; however,
something can be said on the relevance of future predictions. Novel
PKIs might belong to the chemical space of the data set (e.g., structures
tested on a different Kinase, or new structures similar to the data
set ones) and then quantification of reliability of predictions is
relatively straightforward. Should a structure not be completely covered,
reliability is expected to be lower; however, as we have shown, it
is difficult to quantify the reliability for individual compounds.
Hence the best course of action would be in our view to consider the
average model error.Finally, when plotting the RMSE as a function
of the number of
Kinases assayed for a PKI, no difference was observed between PKIs
with type I or type II binding modes (covalent PKIs were represented
by only two compounds and no conclusion can therefore be made on this
small population) (Figure ). It is to be noted that the three best and three worst predicted
compounds mentioned above were all type I orthosteric binders.
Figure 14
Influence
of the PKI binding mode type on the RMSE for the PKIs.
The x axis shows the number of Kinases and the y axis shows the RMSE value.
Influence
of the PKI binding mode type on the RMSE for the PKIs.
The x axis shows the number of Kinases and the y axis shows the RMSE value.
Model Interpretation
The model quality allows the accurate
prediction of PKI–Kinase activities, opening the door for rational
design of selectivity for individual PKIs. Analyzing the descriptors
(see Supporting Information Table S8 for
the complete list), several key properties were identified to correlate
with a higher Ki value across the board.
For the PKIs these were (in order of decreasing importance): compound
size (and related properties like number of bonds and number of rings),
lower predicted solubility (i.e., more lipophilic compounds tend to
have a higher affinity),[95] larger number
of hydrogen bond donors, and a larger polar surface area. Metz et
al. had already highlighted the influence of the number of hydrogen
bonds donors and acceptors but had excluded a correlation with size.[91] In addition, a number of molecular substructures
were found to correlate with a higher affinity. Out of the 15 substructures
with the highest correlation to affinity (averaged over all 20 models),
11 clearly correspond to the Staurosporine class of compounds (they
retrieve 10–19 of the 19 Staurosporine analogs), 3 retrieved
bicyclic aromatic systems such as azaindole (as observed in Vemurafenib)
and derivatives, and the last one is the common motif diaminephenylpyrimidine,
observed for example in Tamatinib (Figure , with more examples presented in Supporting Information Figure S16).
Figure 15
(a) List
of the 15 substructures with the highest correlation to
affinity (averaged over all 20 models. (b) examples of PKIs retrieved
by these substructures. The substructures (a) and hits (b) are colored
by chemical class: green diamine phenylpyrimidine (DAPP) and related;
blue Staurosporine and related; orange azaindole and related.
(a) List
of the 15 substructures with the highest correlation to
affinity (averaged over all 20 models. (b) examples of PKIs retrieved
by these substructures. The substructures (a) and hits (b) are colored
by chemical class: green diaminephenylpyrimidine (DAPP) and related;
blue Staurosporine and related; orange azaindole and related.Analysis of the molecular substructures
correlated to the prediction
accuracy identified sets of similar substructures, as seen in Figure where the previously
identified top 15 substructures are again highlighted.
Figure 16
Impact of
molecular substructures on the correlation with affinity
(x axis) and on the importance for accuracy (y axis),
both averaged over the 20 models. The 15 substructures previously
described are displayed and colored by chemical class (green diamine
phenylpyrimidine (DAPP) and related; blue Staurosporine and related;
brown azaindole and related) while the other substructures are only
represented by gray markers.
Impact of
molecular substructures on the correlation with affinity
(x axis) and on the importance for accuracy (y axis),
both averaged over the 20 models. The 15 substructures previously
described are displayed and colored by chemical class (green diaminephenylpyrimidine (DAPP) and related; blue Staurosporine and related;
brown azaindole and related) while the other substructures are only
represented by gray markers.Conversely we could identify protein features correlated
to higher
affinity: Kinases with more electrophilic, large and hydrophobic (small
effect) residues tend to display higher affinities. More specifically,
key properties at specific positions were found to correlate with
relatively higher affinities: alignment positions 13 (larger and more
electrophilic amino acids), 23 (large amino acids), 31 (hydrophobic
and small amino acids), and 35 (more hydrophobic). As an example, Table shows the amino
acids observed at these four positions in three Kinases of increasing
mean pKi: PAK1 (mean pKi = 5.12, third smallest pKi value observed;
RMSE = 0.23, second smallest RMSE value observed), TYRO3 (mean pKi
= 5.69; RMSE = 0.39), and FLT1 (mean pKi = 6.47, largest pKi value
observed; RMSE = 0.67, fourth largest RMSE value observed). While
on this example the properties of the amino acids at position 13 do
not allow a clear differentiation, the trend is overall respected
at the three other positions.
Table 11
Values of the Key
Properties Correlated
with Higher Affinities at the Four Identified Positionsa
PAK1 (mean
pKi = 5.12)
TYRO3 (mean pKi = 5.69)
FLT1 (mean pKi = 6.47)
position
desirable
property (unit)
amino acid
value
amino acid
value
amino acid
value
13
large size (Da)
Asn329
132
Lys572
146
Asn893
132
13
electrophilicity (KD score)
Asn329
–3.5
Lys572
–3.9
Asn893
–3.5
23
large size (Da)
Gly349
75
His598
155
Tyr914
181
31
small size (Da)
Lys391
146
Ala647
89
Ala1024
89
31
hydrophobicity (KD score)
Lys391
–3.9
Ala647
1.8
Ala1024
1.8
35
hydrophobicity (KD score)
Thr406
–0.7
Ala662
1.8
Cys1039
2.5
The KD score
corresponds to
the Kyte and Doolitle hydrophobicity[96].
The KD score
corresponds to
the Kyte and Doolitle hydrophobicity[96].Position 13 belongs to the
ß4 strand of the N lobe and lines
the back pocket of the orthosteric site, while position 23 in the
loop between the N lobe ß4 strand and the C lobe D helix lies
in the entrance pocket, separated from the hinge amino acids by one
amino acid (Figure ). The side chains of the amino acids at positions 13 and 23 point
away from the ATP cavity and cannot make any interaction with it.
Position 31 belongs to the catalytic loop in the C lobe and lines
the phosphate pocket. Lysine amino acids are conservatively observed
at this position in the kinome since the Lysine side chain points
toward the cavity and is engaged in stabilizing interactions with
the ATP phosphate groups. Position 35 belongs to the activation loop
in the C lobe and precedes the Asp-Phe-Gly (DFG) motif at the N terminus
of the activation loop, which conformational switch distinguishes
between type I and type II ligands. Its side chain points toward the
cavity, potentially interacting with ligands (via direct hydrogen
bonds to type II ligands for example). The reason why the PCM model
identified these four positions is therefore not immediately related
to a potential modulation (depending on the nature of the amino acid)
of direct interactions with ligands.
Figure 17
Mapping of the four positions where key
properties correlate with
relatively higher affinities (carbon atoms in pink) on the PAK1–ATP
(carbon atoms in light green) crystal structure (PDB Code: 3Q53, resolution: 2.09
A). The hinge amino acids are highlighted in orange (carbon atoms
in orange, only the backbone is displayed). Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon
atoms in cyan). The hydrogen bonds between ATP and the amino acids
are represented as dotted lines.
Mapping of the four positions where key
properties correlate with
relatively higher affinities (carbon atoms in pink) on the PAK1–ATP
(carbon atoms in light green) crystal structure (PDB Code: 3Q53, resolution: 2.09
A). The hinge amino acids are highlighted in orange (carbon atoms
in orange, only the backbone is displayed). Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon
atoms in cyan). The hydrogen bonds between ATP and the amino acids
are represented as dotted lines.The model might have captured
more complex parameters such as the
impact of the presence of key amino acids on the conformational plasticity
of the kinases. A highly conserved spatial motif of four hydrophobic
amino acids spanning all structural domains have been found to dynamically
assemble upon kinase activation, a process that depends on the Activation
Loop conformation and phosphorylation status. These amino acids form
a “Regulatory Spine” (R-Spine) that stabilizes the kinase
active conformation, coordinates the motions of the N- and C-lobes
and is disordered in the inactive conformation.[97] A second spine was later identified, formed via the assembly
of eight hydrophobic amino acids and requiring the additional presence
of the adenine ring of the ATP (or an equivalent ring system from
a synthetic ligand). This “Catalytic Spine” (C-Spine)
traverses and connects both lobes.[98] The
protein kinase functionality requires conformational plasticity, provided
by the spines assembly and dismantlement, itself triggered by activation
loop phosphorylation state and ATP presence. The four positions identified
by the PCM model were all found in close proximity to the R- and C-Spine
amino acids (Table , Figure ), a region
of strategically great importance.[99] For
the construction of the PCM model we limited ourselves to a sequence
alignment based mainly on the ATP binding pocket. Hence, what we observe
interpreting the model for residues at positions 13, 23, 31, 35 might
be a proxy for changes actually taking place in the spine region.
The currently used alignment includes half of the Spines positions:
two of the four R-Spine positions and four of the eight C-Spine positions.
It might very well be that extending this alignment to include the
full spine regions will improve the PCM model and allow it to confirm
the role these amino acids play.
Table 12
List, Amino Acid
Identities and Locations
of the Four Positions Identified by the PCM Models and of the C- and
R-Spine Amino Acids in Their Vicinity
amino acid
location
position 13
Asn329
N lobe, ß4 strand
R-Spine
Tyr330
N lobe, ß4 strand
position 23
Gly349
N lobe ß4 strand to
C lobe D-helix loop
C-Spine
Leu352
C lobe, D-helix
R-Spine
His387
C lobe, catalytic loop
position 31
Lys391
C lobe, catalytic loop
C-Spine
Ile395
C lobe, ß7 strand
C-Spine
Leu396
C lobe, ß7 strand
C-Spine
Leu397
C lobe, ß7 strand
position
35
Thr406
C lobe,
activation loop
R-Spine
Phe408
C lobe, activation loop
Figure 18
Four positions where key properties correlate
with relatively higher
affinities (carbon atoms in pink) are surrounded by the regulatory
(carbon atoms in light brown) and catalytic (carbon atoms in light
blue for the kinase, in light green for the ATP) spines in this PAK1-ATP
crystal structure (PDB code 3Q53, resolution 2.09 A). The hinge amino acids are highlighted
in orange (carbon atoms in orange, only the backbone is displayed).
Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon atoms in cyan). The hydrogen bonds between
ATP and the amino acids are represented as dotted lines.
Four positions where key properties correlate
with relatively higher
affinities (carbon atoms in pink) are surrounded by the regulatory
(carbon atoms in light brown) and catalytic (carbon atoms in light
blue for the kinase, in light green for the ATP) spines in this PAK1-ATP
crystal structure (PDB code 3Q53, resolution 2.09 A). The hinge amino acids are highlighted
in orange (carbon atoms in orange, only the backbone is displayed).
Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon atoms in cyan). The hydrogen bonds between
ATP and the amino acids are represented as dotted lines.It is to be noted that a previous effort by Martin
and Mukherjee,
relying on a 46 amino acids definition of the binding site as composed
of the ATP and back pockets, and on a sequence identity as a similarity
measure, identified a set of 16 amino acids most influential for interkinase
SAR similarity. Our method relied on a 38 amino acids definition of
the binding site and we translated our amino acids to physicochemical
properties, that could arguably be considered to be more relevant
for drug binding: we have previously shown that using different descriptors
leads to different descriptions of the binding pocket.[57] There, while performance of the models trained
was similar,[100] interpretation might be
diverging between different descriptor types. Indeed, out of the 4
amino acids highlighted in our approach, only 1 is also present in
the 16 amino acids from Martin and Mukherjee, while 2 are adjacent
to 1 of Martin and Mukherjee and the fourth is located 5 amino acids
away.
Summary and Conclusions
We have presented here our
efforts in compiling, standardizing
and analyzing a very large set of kinase bioactivity profiles based
on published literature, followed by extensive proteochemometric bioactivity
modeling. The full data set is available as a Supporting Information zip file and can be downloaded from
the ChEMBL FTP server.[101] The bioactivity
data will be fully integrated and available in a future release of
the ChEMBL database. This will complement the existing kinase screening
depositions and will hopefully facilitate more comparative chemogenomics
studies in the community.Different aspects of the data set
composition were analyzed, ranging
from the data sources (journals, articles) to the data types and providers,
the patents covering the PKIs and the corresponding patentees’
affiliations, the kinome coverage, the selectivity or promiscuity
of both Kinases and PKIs, and the impact of the PKI type on both their
potency and selectivity.Though incomplete by nature due to
the heterogeneity of the sources,
the data set allowed for a robust proteochemometric prediction of
kinase activity across the fraction of the kinome addressed in the
compiled literature: our models facilitate the exploration of the
kinase ligand–target space and the rationalization of the global
activity. The provided data files in the Supporting Information enable the rational design of compounds with a
targeted selectivity profile, hopefully leading to more specific leads
or probes with higher efficacy. The key is the ability to capture
the differences in ligand–target interaction space displayed
by the kinases and their inhibitors. Here it has been demonstrated
that the current set allows for the creation of PCM models capable
to achieve just that. However, the currently presented PCM model can
be further improved by descriptor tuning or inclusion of other parameters
such as protein domain[102] or concentrations
used in the experiments. Although it was judged outside the scope
of the current work, our future plans include a dedicated study on
a high quality kinomewide PCM as proposed here.In conclusion,
in the opinion of the authors, the ever increasing
computational power, coupled to the development of better algorithms
(such as deep learning)[103] will allow the possibility of rational kinome inhibitor design to
be a reality in the very near future.
Authors: E ter Haar; W P Walters; S Pazhanisamy; P Taslimi; A C Pierce; G W Bemis; F G Salituro; S L Harbeson Journal: Mini Rev Med Chem Date: 2004-03 Impact factor: 3.862
Authors: Sonya D Zabludoff; Chun Deng; Michael R Grondine; Adam M Sheehy; Susan Ashwell; Benjamin L Caleb; Stephen Green; Heather R Haye; Candice L Horn; James W Janetka; Dongfang Liu; Elizabeth Mouchet; Shannon Ready; Judith L Rosenthal; Christophe Queva; Gary K Schwartz; Karen J Taylor; Archie N Tse; Graeme E Walker; Anne M White Journal: Mol Cancer Ther Date: 2008-09 Impact factor: 6.261
Authors: Richard M Angell; Francis L Atkinson; Murray J Brown; Tsu Tshen Chuang; John A Christopher; Maria Cichy-Knight; Allison K Dunn; Kendra E Hightower; Susanna Malkakorpi; James R Musgrave; Margarete Neu; Paul Rowland; Robyn L Shea; Jeffery L Smith; Donald O Somers; Sonia A Thomas; Gladstone Thompson; Ruolan Wang Journal: Bioorg Med Chem Lett Date: 2006-12-15 Impact factor: 2.823
Authors: Craig W Lindsley; Zhijian Zhao; William H Leister; Ronald G Robinson; Stanley F Barnett; Deborah Defeo-Jones; Raymond E Jones; George D Hartman; Joel R Huff; Hans E Huber; Mark E Duggan Journal: Bioorg Med Chem Lett Date: 2005-02-01 Impact factor: 2.823
Authors: Laura R Pearce; Gordon R Alton; Daniel T Richter; John C Kath; Laura Lingardo; Justin Chapman; Catherine Hwang; Dario R Alessi Journal: Biochem J Date: 2010-10-15 Impact factor: 3.857
Authors: Gerard Jp van Westen; Remco F Swier; Jörg K Wegner; Adriaan P Ijzerman; Herman Wt van Vlijmen; Andreas Bender Journal: J Cheminform Date: 2013-09-23 Impact factor: 5.514
Authors: Anna H C Vlot; Wilhelmus E A de Witte; Meindert Danhof; Piet H van der Graaf; Gerard J P van Westen; Elizabeth C M de Lange Journal: AAPS J Date: 2017-12-04 Impact factor: 4.009
Authors: Ryan H B Smith; Zaigham M Khan; Peter Man-Un Ung; Alex P Scopton; Lisa Silber; Seshat M Mack; Alexander M Real; Avner Schlessinger; Arvin C Dar Journal: Biochemistry Date: 2021-01-13 Impact factor: 3.162
Authors: Eelke B Lenselink; Niels Ten Dijke; Brandon Bongers; George Papadatos; Herman W T van Vlijmen; Wojtek Kowalczyk; Adriaan P IJzerman; Gerard J P van Westen Journal: J Cheminform Date: 2017-08-14 Impact factor: 5.514
Authors: David H Drewry; Carrow I Wells; David M Andrews; Richard Angell; Hassan Al-Ali; Alison D Axtman; Stephen J Capuzzi; Jonathan M Elkins; Peter Ettmayer; Mathias Frederiksen; Opher Gileadi; Nathanael Gray; Alice Hooper; Stefan Knapp; Stefan Laufer; Ulrich Luecking; Michael Michaelides; Susanne Müller; Eugene Muratov; R Aldrin Denny; Kumar S Saikatendu; Daniel K Treiber; William J Zuercher; Timothy M Willson Journal: PLoS One Date: 2017-08-02 Impact factor: 3.240
Authors: Pavel V Pogodin; Alexey A Lagunin; Anastasia V Rudik; Dmitry A Filimonov; Dmitry S Druzhilovskiy; Mark C Nicklaus; Vladimir V Poroikov Journal: Front Chem Date: 2018-04-26 Impact factor: 5.221
Authors: Michael L Maitland; Sarina Piha-Paul; Gerald Falchook; Razelle Kurzrock; Ly Nguyen; Linda Janisch; Sanja Karovic; Mark McKee; Elizabeth Hoening; Shekman Wong; Wijith Munasinghe; Joann Palma; Cherrie Donawho; Guinan K Lian; Peter Ansell; Mark J Ratain; David Hong Journal: Br J Cancer Date: 2018-03-19 Impact factor: 7.640