Literature DB >> 27482722

Unprecedently Large-Scale Kinase Inhibitor Set Enabling the Accurate Prediction of Compound-Kinase Activities: A Way toward Selective Promiscuity by Design?

Serge Christmann-Franck¹, Gerard J P van Westen², George Papadatos², Fanny Beltran Escudie¹, Alexander Roberts¹, John P Overington², Daniel Domine¹.

Abstract

Drug discovery programs frequently target members of the human kinome and try to identify small molecule protein kinase inhibitors, primarily for cancer treatment, additional indications being increasingly investigated. One of the challenges is controlling the inhibitors degree of selectivity, assessed by in vitro profiling against panels of protein kinases. We manually extracted, compiled, and standardized such profiles published in the literature: we collected 356 908 data points corresponding to 482 protein kinases, 2106 inhibitors, and 661 patents. We then analyzed this data set in terms of kinome coverage, results reproducibility, popularity, and degree of selectivity of both kinases and inhibitors. We used the data set to create robust proteochemometric models capable of predicting kinase activity (the ligand-target space was modeled with an externally validated RMSE of 0.41 ± 0.02 log units and R02 0.74 ± 0.03), in order to account for missing or unreliable measurements. The influence on the prediction quality of parameters such as number of measurements, Murcko scaffold frequency or inhibitor type was assessed. Interpretation of the models enabled to highlight inhibitors and kinases properties correlated with higher affinities, and an analysis in the context of kinases crystal structures was performed. Overall, the models quality allows the accurate prediction of kinase-inhibitor activities and their structural interpretation, thus paving the way for the rational design of compounds with a targeted selectivity profile.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2016 PMID： 27482722 PMCID： PMC5039764 DOI： 10.1021/acs.jcim.6b00122

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Introduction

Protein kinases typically function in highly connected, dynamic, and regulated networks and are central actors in the majority of signal transduction cascades. The human kinome is composed of more than 500 kinases[1] and deregulated kinase signaling has frequently been observed to be oncogenic.[2] Human kinases are therefore attractive targets for drug discovery and thus have received considerable attention from the pharmaceutical industry, which has invested in the identification of small molecule protein kinase inhibitors (PKIs) targeting the protein kinase catalytic domain.[3,4] These efforts have so far led to the approval of 36 PKIs for clinical use (28 by the US Food and Drugs Administration[5]). In addition, at least 600 PKIs have entered formal clinical trials.[6] The vast majority of the approved or under investigation PKIs aim at treating various neoplasms, but PKIs are now also being designed to treat other indications such as diabetes, neurological, inflammatory, and autoimmune diseases like rheumatoid arthritis.[7−11] A number of PKIs behave as allosteric regulators[12−14] but the majority (about 95%[6]) act as competitive inhibitors,[15] usually blocking ATP cofactor binding, with several different distinct binding modes as demonstrated in X-ray crystallography studies.[16,17] There is also a relatively large set of irreversible PKIs.[18,19] The highly conserved nature of the ATP binding site renders the development of highly selective PKIs challenging,[20] while the selectivity profile of a PKI governs its total effect on an organism. Indeed, the clinical efficacy of some PKIs against kinase targets against which they were not originally developed led to their approval for other indications. For example, while originally approved to treat chronic myeloid leukemia,[21] acting via inhibition of cAbl, Imatinib (Gleevec) was later shown to inhibit Kit and PDGFR. Consequently, it was approved for the treatment of gastrointestinal stromal tumors and the hypereosinophilic syndrome in which these particular kinases are dysregulated.[22,23] The dual beneficial and adverse off-target pharmacology of PKIs is complex[24−27] and relies on both target exposure and activity spectra. In vitro profiling is one of the standard tools at hand to reduce attrition rates observed during drug discovery and development.[28] As a general rule, promiscuous compounds are difficult to optimize and develop. Given both the known promiscuity of PKIs, and the large number of protein kinase genes known and assayable, compounds developed as PKIs are routinely profiled against substantial parts of the (human) kinome.[29−32] The objective here is to screen out the PKIs with undesired kinase profiles as early as possible. Furthermore, it is now common practice to screen compound libraries against the kinome in order to identify either new pharmacological probes for poorly characterized targets,[33] or to identify hits for newly validated kinases. Since the initial work of Davies et al. in 2000,[34] a growing number of publications have reported the profiling against large kinase panels of either libraries of compounds (some selected as potential PKIs),[35,36] or smaller and more focused sets of PKIs (as part of their characterization).[37,38] Databases such as ChEMBL[39,40] make publicly available a very large amount of structure–activity relationships (SARs) manually extracted and curated from the scientific literature. Due to the literature focus of these efforts, full kinase profiles of compounds were not routinely added, with notable exceptions, such as for the Millipore Kinase Screening data set[41] (CHEMBL2218924) and the GSK Kinase Inhibitor data set[42] (CHEMBL2303647). An effort was therefore undertaken to manually capture in the primary literature (including, but not limited to medicinal chemistry journals) and standardize such kinase profile-related SAR. Here, we report the collation of an unprecedentedly large number of kinase inhibition data points, and the release to the public domain, through a deposition in the ChEMBL database. Due to the heterogeneous nature of the sources, the collected SAR data set is not a complete one:[43] while for some PKIs most of the kinases were tested, others were tested only on a few kinases. In addition, the sizes of the kinase panels vary across the publications and assay technology and readout can substantially vary. To overcome this and demonstrate the potential of a kinase data set like the current in the public domain, we leveraged on the collected SAR and the content of the ChEMBL database to populate the kinase/PKI pairs that were not initially measured, using a state-of-the-art and robust activity modeling methodology. After integration with ChEMBL data (release 19, see Materials and Methods for details), proteochemometric bioactivity modeling (PCM) was applied at large-scale to this set. The results demonstrate that the application of this method to the compiled data set enables a robust prediction of kinase activity across the subset of the kinome covered by the data set; combined with a systematic exploration of the compounds and kinases features that drive potency, this paves the way for bespoke design of compounds with selective promiscuity.

Materials and Methods

Data Sources

All potential sources of PKI kinome profiling bioactivities were identified through literature searches. A first selection restricted the sources to articles or data sets containing bioactivities from assays where (i) the target was tractable; (ii) the technology generated results of the following activity types: Pct_Ctrl, Pct_Inhib, Kd, Ki, IC50 (excluding for example bioactivities generated from thermal stability shift assays[44]); (iii) the ligand structures were disclosed; and (iv) quantitative bioactivity measurements were available and not expressed exclusively as ranges. In addition to measurements published in peer-reviewed articles, data points made publicly available by the NIH Libraries of Network-based Cellular Signatures (LINCS) program[45,46] and the International Centre for Kinase Profiling within the MRC Protein Phosphorylation Unit at the University of Dundee[47] were included. For each data source, the bibliographic reference, PubMed identifier, digital object identifier (DOI), date of publication, title, authors list and abstract of the publication, publishing journal name, reference. The list of data sources is provided in the Supporting Information (SI) Table S1.

PKIs

PKI structures were directly extracted, transformed from SMILES representations, or manually drawn using Biovia Draw[48] and then standardized using Pipeline Pilot version 8.5:[49] salts were removed, charges and stereochemistry were standardized, and a canonical tautomer was generated. The accuracy of stereochemistry and tautomeric state were manually checked and adjusted if necessary, using PubMed for bibliographic searches (http://www.ncbi.nlm.nih.gov/pubmed) and the Protein Data Bank (PDB) for kinase–PKI complex crystal structures (http://www.rcsb.org/pdb). A unique name was attributed to each structure and synonyms were stored. The list of PKI names, structures, and compound-related data is provided in Supporting Information Table S2.

Kinases

For each name used to identify the kinase in the original publications (SwissProt entry number, kinase name, gene name), the corresponding Entrez preferred gene symbol from the Human Genome Nomenclature Committee (HGNC) (www.genenames.org) was identified. To further qualify the kinases, several properties were added, describing whether the assay was run against a single kinase, the targeted binding site (for the kinases with multiple kinase domains), the length of the domain, the source of the assayed kinase, the presence and nature of mutations, the exact sequence, the phosphorylation status, and the organism (see Supporting Information 1 for details); however we anticipate that this data is tentative in some cases due to sparse reporting in the literature. The list of kinases is provided in Supporting Information Table S3.

Assays

To fully describe the assays and when applicable, the name of the kinase profiling service commercial provider, the assay technology, the measured effect and the ATP and compound concentrations were captured (see Supporting Information 2 for details).

Bioactivities

For all bioactivities, the original value, operator, unit and display name (of possible values: Kd, Ki, IC50, Pct_Ctrl, Pct_Inhib) were captured. Pct_Ctrl is one assay results preprocessing method that attempts to correct for variability by normalizing compound measurements relative to controls: Pct_Ctrl =100 × Mean_on_compound/Mean_on_controls. Similarly, Pct_Inhib is defined as 1 – (100 × Mean_on_sample/Mean_on_controls). Those values were then submitted to a set of standardization rules described in details in Supporting Information 3. Briefly, Pct_Ctrl and Pct_Inhib results were transformed in Pct_Effect values; all bioactivities were then standardized (standardized result value, SRV and their negative logarithm pSRV); an additional property was created to monitor the applied transformations (standardized result type, SRT). The minimal set of properties required to capture a bioactivity value was the compound and target names, the organism, the result value, result operator, and result type. The list of possible values for all properties related to bioactivities is provided in Supporting Information Table S4.

Generation of Data Sets 1 and 2

The ensemble of bioactivities initially captured was filtered to build data set 1: only measurements from human kinases were kept, and measurements were rejected where the exact kinase could not be identified (Target Type = “Gene Family”) or where the standardized result type was “inconclusive”. In the analysis, two naming conventions refer to the kinases that were assayed: “HGNC”, corresponding to the Target Name “Kinase”, corresponding to the Target Name, Binding Site, and Mutation and describing the kinase with more precision The notion of coverage was introduced as follows: for a Kinase–PKI or HGNC–PKI pair, the presence of at least one measurement in data set 1 contributes one unit. By summing over all pairs and dividing by the number of pairs (per data source or across data set 1), the coverage of the data source or of the data set is calculated. Multiple measurements per Kinase–PKI pair were aggregated to keep only one copy: the original measurements and their sources were discarded, and only the number N, mean, and standard deviation (SD) of pSRV were considered further. Data set 2 was then generated by filtering out the aggregated measurements with a pSRV SD greater than one standard deviation unit.

Kinase- and PKI-Centric Indices

For each Kinase, the number of PKIs having a pSRV_Mean, a pSRV_Mean below 6 (1 μM), and a pSRV_Mean below 7 (100 nM), the kinase selectivity index at 1 μM (SI_k_1uM) and the kinase selectivity index at 100 nM (SI_k_100 nM)[35,50,51] were calculated. For each PKI, the number of Kinases having a pSRV_Mean, a pSRV_Mean below 6 (1 μM), and a pSRV_Mean below 7 (100 nM), the selectivity index at 1 μM (SI_1uM) and the selectivity index at 100 nM (SI_100 nM) were calculated. All calculations and data processing were performed with Pipeline Pilot version 8.5.[49]

SureChEMBL Searches

SureChEMBL (https://www.surechembl.org/) is a large, freely available, live repository of compound structures automatically extracted from full-text patent documents.[52] The SureChEMBL system consists of the data pipeline and the user interface. The former automatically detects named chemical entities in text, and chemical structures in images. It annotates the incoming patent documents with chemical structures, which are then stored in a database. The latter allows users to search patent documents using text or chemical queries and retrieve the compounds extracted from specified document search hits. The SureChEMBL repository was used to search for the earliest patent documents claiming the PKIs in the data set. The search was implemented as follows: Each of the PKIs in the data set was submitted to the UniChem[53] web service to run a search against the SureChEMBL compound pool. In addition to exact matching, UniChem allows connectivity matching, i.e. searching for compounds with the same connectivity, by ignoring stereochemistry, isotope or salt information.[54] For each of the 1132 search hits (53.8% of the data set), all patent documents containing the hit (either in text or image) from the three main patent authorities (WIPO, USPTO, and EPO) were retrieved. The PKIs and their corresponding earliest patent documents were annotated with bibliographic data, such as priority date, title, inventor, and assignee. The target, target family, and target type properties were manually assigned based on patent titles with additional PubMed look ups when required. The affiliation type and activity domain properties were manually assigned based on the SureChEMBL-provided list of inventors and affiliations. The list of PKIs, patents, and related properties is provided in Supporting Information Table S2.

Proteochemometric Models

Proteochemometric (PCM) modeling creates predictive models using machine-learning.[55,56] The methodology is related to conventional quantitative structure–activity relationship (QSAR) modeling but includes an explicit target protein-based descriptor alongside conventional small molecule descriptors.[57] This allows the pooling of bioactivity data of several related targets into a single PCM model. Previous application areas of PCM include GPCR families and their ligands, viral mutants and enzyme inhibitors, transporter proteins and compounds blocking these proteins, and kinases and kinase inhibitors.[58−64] While this is not the first attempt to apply PCM to a PKI data set, it is, to the authors’ knowledge, the first time it has been applied on the large scale demonstrated here and to a kinase data set of such ligand diversity. Schürer and Muskal used a diverse public set, yet constructed hundreds of separate QSAR models, an approach that does not allow interpretation from the kinase perspective (which residues contribute to binding of which compound).[65] Finally, our additional aim is to explore the feasibility of PCM analysis toward the rational design of kinase inhibitors with biased, defined promiscuity.

Proteochemometric Data Set

To generate PCM models with the broadest applicability domain, Ki data points from the collected SAR set were merged with those present in ChEMBL (release 19). For modeling, Ki values were -log10 transformed, leading to pKi values. Multiple measurements were averaged; if duplicate pairs were present in this set and ChEMBL, the duplicates were discarded favoring the measurements presented here over those already in ChEMBL. Furthermore, allosteric data points were removed based on previously published work (using data points that were retrieved from text mining paper abstracts)[66] and peptide-like PKIs were discarded. Peptides were detected according to the protocol used by ChEMBL (among others filters based on presence of a peptide bond and known peptides).

Proteochemometric Descriptors

The binding pocket was described using the kinase sequences alignment provided by Huang et al.[67] which was converted to physicochemical properties using the first three Z-scales published by Sandberg et al.[68] as done previously.[57] This led to 117 protein descriptors. Three mean values for the total sequence for the 3 Z-scales (Z1_mean, Z2_mean, Z3_mean), and 3 scales per residue (Z1_Res_1, Z2_res_1, etc.) times 38 residues. PKIs were described using a combination of physicochemical properties (37 descriptors, see Supporting Information Table S5 for a full list) and FCFP_6 circular fingerprints.[69] The fingerprint features were selected based on their frequency (optimal frequency was present in 50% of the PKIs) and a total of 768 features were encoded as a fixed-length bit string. Finally, 16 properties coded for presence of cofactors or secondary kinase domains, and 4 encoded properties about the activity type/interaction (IC50/Ki/Kd/predicted allosteric). In total, the feature vector consisted of 942 descriptors per PKI/kinase pair.

Proteochemometric Machine Learning

Script files for R statistics (version 2.15.0) were prepared in Pipeline Pilot version 8.5 using the “randomForest” package.[49,70,71] Models were trained using 10 000 trees on the EMBL-EBI cluster (45 cores) in parallel using the “foreach” package.[72] The optimal value for “mtry” was empirically found to be a fraction of 0.3 of the total descriptors. Scaling of the importance variables was turned off and no imputation was performed.

Proteochemometric Regression Models

Several models were created: the data set was split into random training sets of 10%, 30%, 50%, 70%, and 90% of the total data set. Subsequently models were trained on these sets and validated on the remaining 90%, 70%, 50%, 30%, and 10% of the set to create learning curves. This process was performed four times, splitting the set with a different seed every iteration. Finally a full model was trained on 100% of the set, which was subsequently interpreted and is available for prediction. Model quality was validated using both the correlation coefficient and the Root Mean Square Error (RMSE). The combination was chosen as both are complementary. RMSE can be overly optimiztic when the bulk of the compounds is inactive (a simple inactive prediction suffices) whereas the R2 suffers from distortions if the modeled activity range is very low (see Results for further details).

Results

Data Set Composition

There were 356 908 entries in the original data set, corresponding to 61 data sources, 482 HGNC symbols and 2106 PKIs. Perhaps counterintuitively, the analysis of the data sources indicates that the majority of bioactivities were disclosed in journals not directly focused on medicinal chemistry: 74.6% of the measurements were published in three journals: Nature Chemical Biology (30.5%), Biochemical Journal (22.6%), and Nature Biotechnology (21.5%) (Figure a). Furthermore, 78.2% of the measurements came from three articles and the two online sources (Figure b).

Figure 1

Distribution of measurements across (a) the journals and (b) the data sources.

Distribution of measurements across (a) the journals and (b) the data sources. In vitro kinase profile commercial providers generated 68.5% of the measurements, the major contributors being DiscoverX (22.5%), Millipore (20.9%), and Reaction Biology (15.0%) (Table ). The vast majority of the measurements were of type Pct_Ctrl (61.8%), followed by Ki (30.0%), Kd (7.7%), and IC50 (0.5%).

Table 1

Distribution of Measurements Across Kinase Profile Commercial Providersa

kinase profile provider	measurements count	percent of total
Ambit/DiscoverX	80328	22.5
Millipore	74620	20.9
Reaction Biology	53417	15.0
MRC Dundee	28778	8.1
Carna Biosciences	2725	0.8
Caliper Life Sciences	1856	0.5
Invitrogen	1910	0.5
Proquinase	931	0.3
Abbott	72742	20.4
Other	39595	11.1

Although not a kinase profile provider, the large number of measurements shared by Abbott justified the creation of a dedicated category.

Although not a kinase profile provider, the large number of measurements shared by Abbott justified the creation of a dedicated category. The standardized result type provides an assessment of the amount of usable information within the data set: 9.9% of the data were qualified as “inconclusive”, 14.8% as “inactive” (ranges with thresholds indicating a lack of activity), 60.8% as “calculated” (Pct_Effect, single point measurements), and 14.5% as “measured” (Ki, Kd, or IC50, in which a greater confidence can be given than in single point measurements;Table ). Supporting Information Figure S1 compares across the data sources the relative numbers of measurements and their distribution according to the result types.

Table 2

Distribution of Measurements Across the Standardized Result Types

result type	measurements count	percent of total
calculated	216894	60.8
inactive	52817	14.8
measured	51797	14.5
inconclusive	35394	9.9

The distribution of the standardized result type values across the result types is presented in Figure . It shows that approximately half of the Kd measurements were considered as inactive (to be compared with approximately one-third of the Ki measurements, with another third being considered inconclusive).

Figure 2

Distribution of measurements across the standardized result types, split by measured result types. The number of measurements is labeled and is proportional to the radius of the discs. The color corresponds to the original data type (yellow Pct_Ctrl, orange PCT_Inhib). The analysis of the Target Type values indicates that 97.6% of the measurements were obtained in assays where the target could be mapped to a single HGNC symbol, with another 2.2% coming from complexes. Here, 65.2% of the measurements were generated in binding assays, 22.5%, in quantitative PCR assays, and 11.2%, in fluorescence assays. Finally, almost all measurements were generated in human kinase assays (99.5%). The collected data cover widely the kinome, as can be observed in Figure where the majority of the branches in the treelike representation of the kinome introduced by Manning et al. feature measurements.[1]

Figure 3

Coverage of the kinome by the current data set. The kinase names and circles sizes are proportional to the number of corresponding compounds. The picture was generated using the Kinome Render,[73] and the kinome tree illustration is reproduced courtesy of Cell Signaling Technology, Inc. (www.cellsignal.com). A high resolution version is available as Supporting Information Figure S2. A limited number of kinases have already proven to be attractive targets for drug discovery efforts, and the number of measurements available per HGNC symbol reflects this trend: roughly one-third of the 482 genes exhibited more than 1000 measurements, another third had 250 to 1000 measurements, and the last third less than 250 measurements (Figure a). The 10 most characterized kinases are ABL1, EGFR, FLT3, KIT, LCK, PIM1, PRKACA, GSK3B, AURKB, and PDPK1, with more than 2000 measurements each (Figure b). The 10 least profiled kinases are PINK1, SGK233, ATR, CDK1, CDK8, EFNA2, EIF2AK4, MAST3, SMG1, and STRADA, with less than four measurements each (Figure c).

Figure 4

Number of measurements per HGNC symbol: (a) all kinases; (b) most popular kinases; (c) least popular kinases. The colors indicate the result type.

Number of measurements per HGNC symbol: (a) all kinases; (b) most popular kinases; (c) least popular kinases. The colors indicate the result type. To focus on data of interest for drug discovery projects, data set 1 was prepared by excluding measurements from nonhuman kinases (1923 measurements) or from gene families (exact kinase not identified, 267 measurements) and qualified as inconclusive (35 392 measurements). In total, 319 320 measurements were further considered (89.5% of the initial data set). Those were not distributed equally among the data sources, since the sizes of the kinase panels depend on the profiling organizations. The coverage across the data sources was therefore analyzed. In total, the coverage was 15.6% when considering the targets at the kinase level, and 20.0% at the HGNC level (Table ). Supporting Information Tables S6 and S7 provide the counts and coverages for each data source, at the Kinase and HGNC levels, respectively.

Table 3

Coverage of Dataset 1 at the Kinase and HGNC Levels

	compounds count	targets count	pairs count	count of pairs with data	coverage (%)
kinase level	2106	627	1320462	207495	15.7
HGNC level	2106	452	951912	191877	20.2

In data set 1, only 15.7% of the Kinase–PKI pairs do have at least one measurement. But the distribution of coverage per data source indicates that 42 data sources have an internal coverage of 90% or more (68.9% of the data sources) at the Kinase level, while only 7 have a coverage of 50% or less (Table ). This highlights that data set 1 is a heterogeneous “patchwork” of relatively densely populated submatrices of Kinases-PKIs, separated by empty submatrices (Supporting Information Figure S3).

Table 4

Distribution of the Data Sources of Dataset 1 per Binned Coverage, at the Kinase Level

internal coverage (%)	[0; 20]	]20; 30]	]30; 40]	]40; 50]	]50; 60]	]60; 70]	]70; 80]	]80; 90]	]90; 100]
count	1	2	2	2	0	2	5	5	42

The reproducibility of the measurements was then analyzed. Data set 1 corresponds to 319 320 measurements, distributed across 207 495 Kinase–PKI pairs. 70.6% of these pairs exhibit a unique measurement, while only 1.8% had five measurements or more (Table ).

Table 5

Distribution of the Number of Measurements Per Kinase–PKI Pair

number of measurements per Kinase–PKI pair	1	2	3	4	5 or more
count	146564	25134	27656	4477	3664
percent total	70.63	12.11	13.33	2.16	1.77

For the 29.4% of Kinase–PKI pairs with multiple measurements, the distribution of the pSRV SD indicates that applying a filter at SD equal or less than 1 unit would lead to the rejection of only 570 pairs (Table ), a relatively small number, in the order of magnitude of the amount of information contained in the profile of a single PKI against all Kinases. Supporting Information Figure S4 shows the distribution of the binned pSRV SD values split by number of measurements per Kinase–PKI pair.

Table 6

Binned Distribution of the pSRV SD

binned pSRV SD	0.0	]0.0; 0.5]	]0.5; 1.0]	]1.0; 1.5]	]1.5; 2.0]	>2.0
count	161844	39002	6079	507	58	5
percent total	78.00	18.80	2.93	0.24	0.03	0.00

Data set 2 was created by applying the “pSRV SD ≤ 1” filter to include only Kinase–PKI pairs with significant measurements. It contains 627 Kinases, 2106 PKIs, and 206 925 Kinase–PKI pairs with a pSRV (15.6% of the total number of pairs). For 316 Kinases (50.4%), at most 200 PKIs were tested that provided a significant measurement, while only 51 Kinases (8.1%) had more than 1000 PKIs tested against (Table ).

Table 7

Distribution of the Binned Number of PKIs per Kinase

binned number of PKIs per Kinase	[1; 200]	]200; 400]	]400; 600]	]600; 800]	]800; 1000]	]1000; 1200]	]1200; 1400]	]1400; 1600]	>1600
count	316	137	70	29	24	36	14	0	1
percent total	50.40	21.85	11.16	4.63	3.83	5.74	2.23	0.00	0.16

By applying an upper limit of 100 nM for potent PKIs, it was observed that the coverage of the kinome by potent inhibitors is similar to the one observed in Data set 1 (Supporting Information Figure S5). The distribution of the kinase selectivity index at 100 nM across the kinome (limited to Kinases with at least 100 PKIs to reduce sampling bias, rejecting mutants, and nonkinase domains) indicates that some Kinases seem more prone to inhibition than others, namely CDC7, MAP2K5, FLT3, FLT1, and DDR1 (Supporting Information Figure S6). However, the SI_k_100 nM does not allow factoring in the number of tested PKIs. The number of potent PKIs was plotted against the number of tested PKIs to identify among the Kinases probed with many PKIs those that proved difficult to inhibit (selective Kinases, such as NEK2, MAPK1, MAPK12, MAPK13, MAPKKAPK2, AKT1, AKT2, and AKT3) and those that were more prone to inhibition (promiscuous Kinases, such as FLT3, CLK4, AURKB, KDR, FLT1, CSF1R, RET, and CDC7) (Figure ). For example, a similar number of PKIs was tested against CLK4 (1206 PKIs) and NEK2 (1210 PKIs); however, while 447 PKIs were active below 1 μM and 175 below 100 nM against CLK4, only 71 PKIs were active below 1 μM and 11 below 100 nM against NEK2. Of course, directly comparing these hit rates requires assuming that the PKIs assayed against each Kinase (57% of the total number of PKIs) covered the same chemical space, which is not necessarily true: the numbers of PKIs are similar, but the PKIs might be different. Furthermore, defining a Kinase as “selective” or “promiscuous” relies on a similar assumption and a more appropriate but more complex qualification for these Kinases would be selective or promiscuous “within the boundaries of the chemical space they were assayed against”.

Figure 5

(a) Comparison across all Kinases between the number of PKIs tested (x axis) and the number of PKIs active below 100 nM (y axis). (b) Same comparison as in part a with a focus on the most selective Kinases. Color and marker size reflect the potency of the most potent inhibitor of each Kinase. Shown in parentheses is the SI_k_100 nM. A similar analysis of data set 2 can be performed by focusing on PKIs instead of Kinases. Some of the collected data sources describe the results of PKIs profiling against Kinase panels of limited size. In addition, a number of measurements were rejected when building data sets 1 and 2. As a consequence, a significant number of PKIs from data set 2 present results against 50 Kinases or less (925 PKIs, 43.9%) (Table ), too small a panel to allow the evaluation of kinase selectivity. In parallel, 323 PKIs present results against at least 200 Kinases (15.3%) (Table ). At that level of coverage of the kinome, it should be possible to identify nonselective kinase inhibitors, and inhibitors already displaying some selectivity could be prioritized (keeping in mind the possibility of a later identification of additional inhibited kinases that were not present in the initial panel). It is to be noted that this constraint prevented any further analysis based on Murcko scaffolds, due to the limited size of the sample left after filtering (2 Murcko scaffolds with 5 PKIs, 3 Murcko scaffolds with 4 PKIs, 5 Murcko scaffolds with 3 PKIs, 22 Murcko scaffolds with 2 PKIs, and singletons).

Table 8

Distribution of the Binned Number of Kinases with Measurements per PKIa

binned number of Kinases per PKI	[1; 25]	]25; 50]	]50; 75]	]75; 100]	]100; 200]	]200; 300]	]300; 400]	]400; 500]	>500
count	628	297	244	446	168	36	192	87	8
percent total	29.82	14.10	11.59	21.18	7.98	1.71	9.12	4.13	0.38

Note that two different bin sizes (25 and 100 Kinases) were used.

Note that two different bin sizes (25 and 100 Kinases) were used. Table presents the 10 PKIs with the highest number of Kinases exhibiting at least one significant measurement (485 Kinases and more), and the structures of the top 25 PKIs are shown in Supporting Information Figure S7. Not unexpectedly, one can observe the presence of Staurosporine and derivatives PKC412 and Go6976, which share a privileged kinase scaffold, along with FDA-approved PKIs Sorafenib, Sunitinib, and Pazopanib.

Table 9

List of the 10 PKIs with the Highest Number of Kinases Exhibiting at Least One Significant Measurement

PKI name	kinase count
Tyrphostin_AG1478	525
Akti_1_2	523
JNK_Inhibitor_IX	521
AS601245	518
PKC412	503
AR_AO_14418	503
Sorafenib	501
Go6976	501
Indirubin_6_bromo_oxime	493
Staurosporine	487

The same threshold for potency was used to assess which of the most frequent PKIs were often observed to be potent (Figure ). Unsurprisingly, Staurosporine stands out as the most promiscuous PKI, along with close structural analogs Lestaurtinib and K252a. But structurally dissimilar PKIs such as AZD7762 (a checkpoint kinase inhibitor that potentiates chemotherapy-induced apoptosis),[74,75] TAE_684 (a PKI identified as an inhibitor of NPM-ALK and LRRK2)[76,77] and MK_8033 (a c-Met/Ron dual inhibitor)[78] exhibit the same apparent lack of selectivity.

Figure 6

(a) Comparison across all PKIs between the number of Kinases tested (x axis) and the number of Kinases with measurements below 100 nM (y axis). (b) Same comparison as in part a with a focus on the most selective PKIs. Color and marker size reflect the potency of the most potently inhibited Kinase for each PKI. Focusing then on the most selective PKIs, PKIs with measurements on more than 450 Kinases and less than 15 Kinases inhibited below 100 nM were analyzed. Such selective PKIs were the JNK2/3 inhibitor JNK_inhibitor_IX,[79] the EGFR inhibitor Tyrphostin_AG1478,[80] the JNK inhibitor AS602145,[81] the GSK3 inhibitor AR_A0_14418,[82] the p38-MAPK signaling pathway inhibitors Skepinone_L[83] and SB_203580,[84] the LRRK2 inhibitor HG_10_102_01,[85,86] the FLT3 inhibitor Tandutinib,[87] and the dual PI3K/mTOR inhibitor PI_103[88] (Figure ). The apparent structural diversity of these selective PKIs rules out the simplistic association of kinase selectivity with certain scaffolds. Assuming that compounds sharing similar scaffolds bind in similar ways in the orthosteric site, the importance of scaffold decorations to fine-tune the interactions to these sites is highlighted by comparing pairs of structurally related compounds. Such an interesting pair comprises MK_8033 (tested on 214 Kinases out of which 176 were inhibited below 100 nM, 82.2%) and Skepinone_L (tested on 479 Kinases out of which only 2 were inhibited below 100 nM, 0.4%) (Figure ). The most selective PKI was Akti_1_2, a dual AKT1-AKT2, whose level of selectivity might be explained by its allosteric, pleckstrin homology (PH), domain-dependent mode of binding.[89] This observation led us to further investigate the influence of the PKIs binding modes on their kinase profiles. PKIs with an unspecified mode of binding (assumed to be Type I), Type II and covalent PKIs all compete with ATP to bind in the catalytic site. Type I PKIs bind to activated kinases, type II bind to inactivated kinases, and covalent PKIs bind irreversibly to both activated and inactivated kinases. Allosteric PKIs do not compete with ATP and bind kinases in several cavities (if binding in the catalytic site, they do not interact with the hinge). By comparing the value of the largest inhibition per PKI across all Kinases to the number of Kinases observed within 10-fold of that value (limiting ourselves as previously to the 322 PKIs exhibiting at least 200 Kinases with measurements), it was observed that the largest inhibition was on average the highest for type II PKIs, followed by covalent and type I PKIs (here, PKIs are assumed to be of type I when not in the other categories) (Figure ). The allosteric PKIs were on average weaker inhibitors. This is a similar observation to what was found in previous work.[66] While we have not compared the size of the allosteric inhibitors to the competitive inhibitors in this work, previously we found them to have a lower absolute affinity but to be equally ligand efficient due to their smaller molecular weight.[66] This was also true for the kinase inhibitor section of our data set. No significant difference was observed for the number of Kinases observed within one log of the largest inhibition per PKI across all Kinases.

Figure 7

Influence of the PKIs binding mode on their potency. (a) The x axis shows for each PKI the number of Kinases found within 10-fold of the most potently inhibited Kinase, in log scale, and the y axis shows the value at which this Kinase is inhibited. Marker size reflects the number of Kinases having measurements, for each PKI. (b) Boxplots show the upper quartile, median, average, and lower quartile for the value of the largest inhibition per PKI across all Kinases. An alternative explanation of the observed lower affinity could be that allosteric PKIs were most often discovered by serendipity and might therefore be less often the outcome of numerous rounds of medicinal chemistry optimization, while on the other hand type II and covalent PKIs are usually rationally designed or optimized under the assumption that the specific structural features of their binding modes will confer on them an advantage in terms of kinase selectivity. The number of Kinases with a measurement was therefore compared to the SI_100 nM index, across the different binding modes classes. As noted previously, allosteric PKIs were observed to be significantly less potent than the other types of PKIs, and only one allosteric PKI did exhibit a non-null value of SI_100 nM: allosteric PKIs could therefore not be compared to the other types for SI_100 nM. As recently observed by Sutherland et al., Type II PKIs exhibited larger SI_100 nM values than the Type I and covalent PKIs, which were not significantly different (Figure ).[90] It appears therefore that, in this data set, Type II PKIs might achieve a higher potency but not necessarily a better selectivity than Type I and covalent ones.

Figure 8

Influence of the PKI binding mode on the kinase selectivity. (a) The x axis shows for each PKI the number of Kinases with at least one measurement, and the y axis shows in log scale the value of the SI_100 nM selectivity index. (b) The boxplots show the upper quartile, median, average, and lower quartile for the value of the SI_100 nM selectivity index.

SureChEMBL Patent Analysis

Out of the 661 unique patents, a majority (563; 85.2%) were assigned to commercial companies, followed by 91 patents (13.8%) assigned to academic institutions, and only 7 (1.1%) to academic–industry partnerships. Most of the patents (552; 83.5%) were assigned after 1995, with a significant increase of the number of patents issued per year by industry in 2002–2003 and an apparent decline in the industry since 2006 (Figure ).

Figure 9

Number of patents priority claims per year. The color indicates the affiliation type.

Number of patents priority claims per year. The color indicates the affiliation type. It should be noted that these observations are valid within the boundaries of this data set which was not obtained via an exhaustive search of all kinase-related patents for the same period but via a nonexhaustive search of literature: trends on the plot do not reflect the overall kinase inhibitor patent productivity, but the publication bias of literature compounds. In addition, no attempt was made to account for the multiple mergers and acquisitions within the industrial sector, and the assignees names were kept as in the original patents. The assignees activity domains corresponded almost always to pharmaceutical research (630; 95.3%), with 5 patents coming from the agrochemical industry (0.8%), 19 (2.9%) from other industries, and 7 (1.1%) that could not be attributed. The 12 assignees having the highest numbers of unique patents in the data set are presented at Table .

Table 10

List of the 12 Assignees Having the Highest Numbers of Unique Patents

assignee	number of patents
Abbott	55
Smithkline Beecham	24
Novartis AG	16
AstraZeneca	15
Pfizer	14
Warner Lambert	11
Boehringer Ingelheim	10
Dana Farber Cancer Inst.	10
Bayer AG	8
Glaxo Group	8
Hoffmann La Roche	8
Merck & Co	8

Out of the 12 most prolific institutions, one is an academic center, and the Abbott company alone accounts for 395 patented PKIs (34.9%) and 55 patents. This is not unexpected since one single article from Abbott[91] contributes to 1441 PKIs to the complete data set (68.5%), some PKIs being not exclusive to this article. The target families for the patent could be identified in 301 cases (45.5%) and, not unexpectedly, it was a kinase in all but 39 cases (5.9%) (Figure ).

Figure 10

Target families as identified in the patent titles. The bars are colored according to the priority claims year (one patent from 1900, covering caffeine, was undisplayed for clarity).

Target families as identified in the patent titles. The bars are colored according to the priority claims year (one patent from 1900, covering caffeine, was undisplayed for clarity). In 147 cases (22.2%), a disease or medical term was used instead of a target name to define the scope of the patent, while no disease or target info was provided in 189 cases (28.6%). Finally, 24 patents were not related to pharmaceutical research.

Proteochemometric Analysis Results

The PCM data set was composed of 117 148 data points (100 969 Ki, 979 IC50, 15 200 Kd) corresponding to 361 Kinases and 2860 PKIs (11% matrix completeness). Approximately 27% were uniquely present in data set 2, 1% were uniquely present in ChEMBL, and 72% were present in both sets. The PCM models perform very well on this data set. The ligand–target space is modeled with an RMSE of 0.20 log units and an R02 of 0.94 (with the externally validated performance reaching 0.41 ± 0.02 log units and R02 0.74 ± 0.03 at 90%/10% cross-validated models) (see the learning curves in Figure a). Grouping predictions per target drops performance slightly to 0.44 ± 0.15 log units RMSE and 0.65 ± 0.14 R02 (Figure b), and when grouped per compound, performance is 0.36 ± 0.12 log units RMSE and 0.59 ± 0.22 R02 (Figure c). This illustrates that while the full kinase ligand–target space is well modeled, there are negative outliers, likely caused by a nonuniform sampling of the interaction space. From a target perspective, some Kinases are more densely populated with tested PKIs (Figure , Table ), whereas from a chemical perspective, some PKIs have also been tested on more Kinases than others (Figure ). Interestingly the average RMSE for individual compounds is slightly lower than that for individual targets, whereas the correlation coefficient is on average slightly higher for analysis grouped per target than for grouped per PKI. In other words, PKIs tend to show a lower variance across targets than the other way around, target variance across PKIs. We also observe this effect in our data: when the bioactivity points are grouped by target the mean pKi of compounds is 6.16 (±0.87) log units, whereas when the values are grouped by compound the mean pKi of targets is 6.56 (±1.20) log units. As PKIs bind (on average) with a relatively similar potency on different targets (displaying a lower variance), the RMSE is relatively low. However, the narrow spread makes it challenging for a model to rank order the PKIs by affinity (does a PKI bind better to target A than B?), and this translates in a relatively lower correlation coefficient. Conversely, this means that targets can display a greater variance in compound affinity. As a consequence, targets are unlikely to have (on average) a similar preference for all PKIs as indicated by the larger RMSE. However, larger differences in affinity lead to the modeling error having a lower influence on the compound ranking. For example, two compounds A and B, A having a better affinity, with a difference larger than the RMSE, are predicted correctly ranked when the model is slightly mispredicting within the RMSE. If their difference is smaller than the model RMSE, the chances are higher for these compounds to be ranked wrongly. Here, the larger variance translates in a relatively higher correlation coefficient.

Figure 11

Performance of proteochemometric modeling on the set (R02 and RMSE). The curves show the median value and median average deviation (MAD) at different sizes of the training set based on the four different iterations (a) or based on the different targets or compounds (b, c). (a) Full validation set (the dashed lines extrapolate to full model performance using the values obtained in random forest out-of-bag validation). (b) Validation set grouped per target and subsequent median of these values (this method provides a better performance estimate of the models for individual targets). (c) Validation set grouped per PKI and a subsequent median of these values (this method provides a better performance estimate of the models for individual PKIs). In order to further quantify performance, the average RMSE was also calculated (using all models that made up the learning curves) for each target. R02 was ignored here as the full activity range was smaller than 2.5 log units for some targets, which is insufficient (assuming an average error of approximately 0.5 log units).[92,93] For both the validation sets grouped per Kinase and per PKI, the average RMSE was plotted versus the frequency (Figure ). The plots show that our PCM models are capable of a robust modeling of the kinase ligand–target space. However, it should be noted that PKIs or Kinases with only a few data points could display a variable performance.

Figure 12

(a) Average RMSE versus number of data points in the set grouped per PKI, for the Kinases with at least 600 PKIs tested with Ki results. (b) Average RMSE versus number of data points in the set grouped per Kinase, for the PKIs with at least 200 Kinases tested with Ki results. The color indicates the mean pKi across all Kinases/PKIs. The plots have been created on all models (different iterations of the learning curves) and give a good idea of the error distribution of our PCM models. Only 104 Kinases had Ki results on more than 600 PKIs, the threshold we chose for the analysis. On that subset, the three best modeled Kinases were PRKCZ, PAK1, and CAMK2B (RMSE 0.22–0.24 log units) whereas GSK3B, AURKB, and CLK4 were modeled the worst (RMSE 0.69–0.78 log units) (Figure a). A representative validation plot for CAMK2B is shown in Supporting Information Figure S8, and a representative plot for GSK3B is in Supporting Information Figure S9. Much fewer PKIs (∼600–800) have been tested on the best predicted Kinases than on the worst predicted ones (∼1000–1200). At the same time, their mean pKi (∼5.2) is on average almost one log unit smaller than that of the worst predicted ones (∼6.0) (Figure ), and their mean SI_k_100 nM (∼0.01) is also smaller than that of the worst predicted ones (∼0.14), which might imply that the distribution of pKi values is more important than the number of values (Supporting Information Figure S10). This is also observed when plotting the pKi values against the Kinases (Supporting Information Figure S11). The number of PKIs per Murcko scaffold[94] was variable in the data set, ranging from singletons and pairs (half of the compounds) to 22 compounds sharing the same Murcko scaffold (Supporting Information Figure S12a). Judging from the RMSE, scaffold frequency did not seem to have an influence on the quality of the predictions (Supporting Information Figure S12b). In addition, the median similarity of the compounds to the rest of the training set was calculated and further investigation led to the conclusion that no trend could be observed with regards to the RMSE or R2 (data not shown). The worst modeled Kinases appear to have more PKIs, which are on average more potent, and which cover a wider range of potencies, than the best modeled ones. That might seem counterintuitive if one reasons by analogy to the pharmacophore perception problem. This problem states that when a training set is built, it should contain several diverse chemical series, with large ranges of potencies within each series, in order to computationally identify the structural features responsible for the biological activity. One explanation for our observed inconsistency to this wider range might be as follows: we judge the “ability to predict” (i.e., small RMSE value) by excluding data from the set, building models without the excluded data and then using the models to predict the values for these data before comparing to the measured values: If the complete range of measured values is narrow, then whatever the composition of the excluded set, there are good chances that its range will be covered by that of the model building set: we stay within the “applicability domain” in terms of pKi values. On the other hand, if the range is wide, then the ranges of excluded and model building data might not overlap very well anymore, which might lead to poorer performances. Only 50 PKIs had Ki results on more than 200 Kinases, the threshold we chose for the analysis. On that subset, the PKIs whose activities were predicted the best were VX745, CP_724714, and Roscovitine (RMSE 0.31–0.32 log units), and the worst were Staurosporine, Bosutinib, and CEP_32496 (RMSE 0.91–0.96 log units) (Figure b), with the structures shown in Figure . Representative plots for two are shown in the Supporting Information Figures S13 (VX745) and S14 (Bosutinib).

Figure 13

Examples of PKIs whose activities were well predicted by our model (green, underlined) and predicted with lesser accuracy (red, in italic).

Examples of PKIs whose activities were well predicted by our model (green, underlined) and predicted with lesser accuracy (red, in italic). The number of Kinases against which the PKIs were tested does not seem to influence the RMSE (Figure b). But as already observed with the validation set grouped per Kinase, the mean pKi of the best predicted PKIs (∼5.2) is on average almost one log unit smaller than that of the worst predicted ones (∼6.3) (Figure b), and their mean SI_100 nM (∼0.01) is also smaller than that of the worst predicted ones (∼0.26), underlining again the importance of the distribution of the pKi values (Supporting Information Figure S10b). This is also observed when plotting the pKi values against the Kinases (Supporting Information Figure S15). It is never trivial to prospectively validate a model; however, something can be said on the relevance of future predictions. Novel PKIs might belong to the chemical space of the data set (e.g., structures tested on a different Kinase, or new structures similar to the data set ones) and then quantification of reliability of predictions is relatively straightforward. Should a structure not be completely covered, reliability is expected to be lower; however, as we have shown, it is difficult to quantify the reliability for individual compounds. Hence the best course of action would be in our view to consider the average model error. Finally, when plotting the RMSE as a function of the number of Kinases assayed for a PKI, no difference was observed between PKIs with type I or type II binding modes (covalent PKIs were represented by only two compounds and no conclusion can therefore be made on this small population) (Figure ). It is to be noted that the three best and three worst predicted compounds mentioned above were all type I orthosteric binders.

Figure 14

Influence of the PKI binding mode type on the RMSE for the PKIs. The x axis shows the number of Kinases and the y axis shows the RMSE value.

Model Interpretation

The model quality allows the accurate prediction of PKI–Kinase activities, opening the door for rational design of selectivity for individual PKIs. Analyzing the descriptors (see Supporting Information Table S8 for the complete list), several key properties were identified to correlate with a higher Ki value across the board. For the PKIs these were (in order of decreasing importance): compound size (and related properties like number of bonds and number of rings), lower predicted solubility (i.e., more lipophilic compounds tend to have a higher affinity),[95] larger number of hydrogen bond donors, and a larger polar surface area. Metz et al. had already highlighted the influence of the number of hydrogen bonds donors and acceptors but had excluded a correlation with size.[91] In addition, a number of molecular substructures were found to correlate with a higher affinity. Out of the 15 substructures with the highest correlation to affinity (averaged over all 20 models), 11 clearly correspond to the Staurosporine class of compounds (they retrieve 10–19 of the 19 Staurosporine analogs), 3 retrieved bicyclic aromatic systems such as azaindole (as observed in Vemurafenib) and derivatives, and the last one is the common motif diamine phenylpyrimidine, observed for example in Tamatinib (Figure , with more examples presented in Supporting Information Figure S16).

Figure 15

(a) List of the 15 substructures with the highest correlation to affinity (averaged over all 20 models. (b) examples of PKIs retrieved by these substructures. The substructures (a) and hits (b) are colored by chemical class: green diamine phenylpyrimidine (DAPP) and related; blue Staurosporine and related; orange azaindole and related. Analysis of the molecular substructures correlated to the prediction accuracy identified sets of similar substructures, as seen in Figure where the previously identified top 15 substructures are again highlighted.

Figure 16

Impact of molecular substructures on the correlation with affinity (x axis) and on the importance for accuracy (y axis), both averaged over the 20 models. The 15 substructures previously described are displayed and colored by chemical class (green diamine phenylpyrimidine (DAPP) and related; blue Staurosporine and related; brown azaindole and related) while the other substructures are only represented by gray markers. Conversely we could identify protein features correlated to higher affinity: Kinases with more electrophilic, large and hydrophobic (small effect) residues tend to display higher affinities. More specifically, key properties at specific positions were found to correlate with relatively higher affinities: alignment positions 13 (larger and more electrophilic amino acids), 23 (large amino acids), 31 (hydrophobic and small amino acids), and 35 (more hydrophobic). As an example, Table shows the amino acids observed at these four positions in three Kinases of increasing mean pKi: PAK1 (mean pKi = 5.12, third smallest pKi value observed; RMSE = 0.23, second smallest RMSE value observed), TYRO3 (mean pKi = 5.69; RMSE = 0.39), and FLT1 (mean pKi = 6.47, largest pKi value observed; RMSE = 0.67, fourth largest RMSE value observed). While on this example the properties of the amino acids at position 13 do not allow a clear differentiation, the trend is overall respected at the three other positions.

Table 11

Values of the Key Properties Correlated with Higher Affinities at the Four Identified Positionsa

		PAK1 (mean pKi = 5.12)		TYRO3 (mean pKi = 5.69)		FLT1 (mean pKi = 6.47)
position	desirable property (unit)	amino acid	value	amino acid	value	amino acid	value
13	large size (Da)	Asn₃₂₉	132	Lys₅₇₂	146	Asn₈₉₃	132
13	electrophilicity (KD score)	Asn₃₂₉	–3.5	Lys₅₇₂	–3.9	Asn₈₉₃	–3.5
23	large size (Da)	Gly₃₄₉	75	His₅₉₈	155	Tyr₉₁₄	181
31	small size (Da)	Lys₃₉₁	146	Ala₆₄₇	89	Ala₁₀₂₄	89
31	hydrophobicity (KD score)	Lys₃₉₁	–3.9	Ala₆₄₇	1.8	Ala₁₀₂₄	1.8
35	hydrophobicity (KD score)	Thr₄₀₆	–0.7	Ala₆₆₂	1.8	Cys₁₀₃₉	2.5

The KD score corresponds to the Kyte and Doolitle hydrophobicity[96].

The KD score corresponds to the Kyte and Doolitle hydrophobicity[96]. Position 13 belongs to the ß4 strand of the N lobe and lines the back pocket of the orthosteric site, while position 23 in the loop between the N lobe ß4 strand and the C lobe D helix lies in the entrance pocket, separated from the hinge amino acids by one amino acid (Figure ). The side chains of the amino acids at positions 13 and 23 point away from the ATP cavity and cannot make any interaction with it. Position 31 belongs to the catalytic loop in the C lobe and lines the phosphate pocket. Lysine amino acids are conservatively observed at this position in the kinome since the Lysine side chain points toward the cavity and is engaged in stabilizing interactions with the ATP phosphate groups. Position 35 belongs to the activation loop in the C lobe and precedes the Asp-Phe-Gly (DFG) motif at the N terminus of the activation loop, which conformational switch distinguishes between type I and type II ligands. Its side chain points toward the cavity, potentially interacting with ligands (via direct hydrogen bonds to type II ligands for example). The reason why the PCM model identified these four positions is therefore not immediately related to a potential modulation (depending on the nature of the amino acid) of direct interactions with ligands.

Figure 17

Mapping of the four positions where key properties correlate with relatively higher affinities (carbon atoms in pink) on the PAK1–ATP (carbon atoms in light green) crystal structure (PDB Code: 3Q53, resolution: 2.09 A). The hinge amino acids are highlighted in orange (carbon atoms in orange, only the backbone is displayed). Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon atoms in cyan). The hydrogen bonds between ATP and the amino acids are represented as dotted lines. The model might have captured more complex parameters such as the impact of the presence of key amino acids on the conformational plasticity of the kinases. A highly conserved spatial motif of four hydrophobic amino acids spanning all structural domains have been found to dynamically assemble upon kinase activation, a process that depends on the Activation Loop conformation and phosphorylation status. These amino acids form a “Regulatory Spine” (R-Spine) that stabilizes the kinase active conformation, coordinates the motions of the N- and C-lobes and is disordered in the inactive conformation.[97] A second spine was later identified, formed via the assembly of eight hydrophobic amino acids and requiring the additional presence of the adenine ring of the ATP (or an equivalent ring system from a synthetic ligand). This “Catalytic Spine” (C-Spine) traverses and connects both lobes.[98] The protein kinase functionality requires conformational plasticity, provided by the spines assembly and dismantlement, itself triggered by activation loop phosphorylation state and ATP presence. The four positions identified by the PCM model were all found in close proximity to the R- and C-Spine amino acids (Table , Figure ), a region of strategically great importance.[99] For the construction of the PCM model we limited ourselves to a sequence alignment based mainly on the ATP binding pocket. Hence, what we observe interpreting the model for residues at positions 13, 23, 31, 35 might be a proxy for changes actually taking place in the spine region. The currently used alignment includes half of the Spines positions: two of the four R-Spine positions and four of the eight C-Spine positions. It might very well be that extending this alignment to include the full spine regions will improve the PCM model and allow it to confirm the role these amino acids play.

Table 12

List, Amino Acid Identities and Locations of the Four Positions Identified by the PCM Models and of the C- and R-Spine Amino Acids in Their Vicinity

	amino acid	location
position 13	Asn₃₂₉	N lobe, ß4 strand
R-Spine	Tyr₃₃₀	N lobe, ß4 strand
position 23	Gly₃₄₉	N lobe ß4 strand to C lobe D-helix loop
C-Spine	Leu₃₅₂	C lobe, D-helix
R-Spine	His₃₈₇	C lobe, catalytic loop
position 31	Lys₃₉₁	C lobe, catalytic loop
C-Spine	Ile₃₉₅	C lobe, ß7 strand
C-Spine	Leu3₉₆	C lobe, ß7 strand
C-Spine	Leu₃₉₇	C lobe, ß7 strand
position 35	Thr₄₀₆	C lobe, activation loop
R-Spine	Phe₄₀₈	C lobe, activation loop

Figure 18

Four positions where key properties correlate with relatively higher affinities (carbon atoms in pink) are surrounded by the regulatory (carbon atoms in light brown) and catalytic (carbon atoms in light blue for the kinase, in light green for the ATP) spines in this PAK1-ATP crystal structure (PDB code 3Q53, resolution 2.09 A). The hinge amino acids are highlighted in orange (carbon atoms in orange, only the backbone is displayed). Key amino acids Met344 (gatekeeper) and Arg299 are also displayed (carbon atoms in cyan). The hydrogen bonds between ATP and the amino acids are represented as dotted lines. It is to be noted that a previous effort by Martin and Mukherjee, relying on a 46 amino acids definition of the binding site as composed of the ATP and back pockets, and on a sequence identity as a similarity measure, identified a set of 16 amino acids most influential for interkinase SAR similarity. Our method relied on a 38 amino acids definition of the binding site and we translated our amino acids to physicochemical properties, that could arguably be considered to be more relevant for drug binding: we have previously shown that using different descriptors leads to different descriptions of the binding pocket.[57] There, while performance of the models trained was similar,[100] interpretation might be diverging between different descriptor types. Indeed, out of the 4 amino acids highlighted in our approach, only 1 is also present in the 16 amino acids from Martin and Mukherjee, while 2 are adjacent to 1 of Martin and Mukherjee and the fourth is located 5 amino acids away.

Summary and Conclusions

We have presented here our efforts in compiling, standardizing and analyzing a very large set of kinase bioactivity profiles based on published literature, followed by extensive proteochemometric bioactivity modeling. The full data set is available as a Supporting Information zip file and can be downloaded from the ChEMBL FTP server.[101] The bioactivity data will be fully integrated and available in a future release of the ChEMBL database. This will complement the existing kinase screening depositions and will hopefully facilitate more comparative chemogenomics studies in the community. Different aspects of the data set composition were analyzed, ranging from the data sources (journals, articles) to the data types and providers, the patents covering the PKIs and the corresponding patentees’ affiliations, the kinome coverage, the selectivity or promiscuity of both Kinases and PKIs, and the impact of the PKI type on both their potency and selectivity. Though incomplete by nature due to the heterogeneity of the sources, the data set allowed for a robust proteochemometric prediction of kinase activity across the fraction of the kinome addressed in the compiled literature: our models facilitate the exploration of the kinase ligand–target space and the rationalization of the global activity. The provided data files in the Supporting Information enable the rational design of compounds with a targeted selectivity profile, hopefully leading to more specific leads or probes with higher efficacy. The key is the ability to capture the differences in ligand–target interaction space displayed by the kinases and their inhibitors. Here it has been demonstrated that the current set allows for the creation of PCM models capable to achieve just that. However, the currently presented PCM model can be further improved by descriptor tuning or inclusion of other parameters such as protein domain[102] or concentrations used in the experiments. Although it was judged outside the scope of the current work, our future plans include a dedicated study on a high quality kinomewide PCM as proposed here. In conclusion, in the opinion of the authors, the ever increasing computational power, coupled to the development of better algorithms (such as deep learning)[103] will allow the possibility of rational kinome inhibitor design to be a reality in the very near future.

92 in total

Review 1. Protein kinases--the major drug targets of the twenty-first century?

Authors: Philip Cohen
Journal: Nat Rev Drug Discov Date: 2002-04 Impact factor: 84.694

Review 2. Kinase chemogenomics: targeting the human kinome for target validation and drug discovery.

Authors: E ter Haar; W P Walters; S Pazhanisamy; P Taslimi; A C Pierce; G W Bemis; F G Salituro; S L Harbeson
Journal: Mini Rev Med Chem Date: 2004-03 Impact factor: 3.862

3. AZD7762, a novel checkpoint kinase inhibitor, drives checkpoint abrogation and potentiates DNA-targeted therapies.

Authors: Sonya D Zabludoff; Chun Deng; Michael R Grondine; Adam M Sheehy; Susan Ashwell; Benjamin L Caleb; Stephen Green; Heather R Haye; Candice L Horn; James W Janetka; Dongfang Liu; Elizabeth Mouchet; Shannon Ready; Judith L Rosenthal; Christophe Queva; Gary K Schwartz; Karen J Taylor; Archie N Tse; Graeme E Walker; Anne M White
Journal: Mol Cancer Ther Date: 2008-09 Impact factor: 6.261

Review 4. FDA-approved small-molecule kinase inhibitors.

Authors: Peng Wu; Thomas E Nielsen; Mads H Clausen
Journal: Trends Pharmacol Sci Date: 2015-05-12 Impact factor: 14.819

5. N-(3-Cyano-4,5,6,7-tetrahydro-1-benzothien-2-yl)amides as potent, selective, inhibitors of JNK2 and JNK3.

Authors: Richard M Angell; Francis L Atkinson; Murray J Brown; Tsu Tshen Chuang; John A Christopher; Maria Cichy-Knight; Allison K Dunn; Kendra E Hightower; Susanna Malkakorpi; James R Musgrave; Margarete Neu; Paul Rowland; Robyn L Shea; Jeffery L Smith; Donald O Somers; Sonia A Thomas; Gladstone Thompson; Ruolan Wang
Journal: Bioorg Med Chem Lett Date: 2006-12-15 Impact factor: 2.823

6. Allosteric Akt (PKB) inhibitors: discovery and SAR of isozyme selective inhibitors.

Authors: Craig W Lindsley; Zhijian Zhao; William H Leister; Ronald G Robinson; Stanley F Barnett; Deborah Defeo-Jones; Raymond E Jones; George D Hartman; Joel R Huff; Hans E Huber; Mark E Duggan
Journal: Bioorg Med Chem Lett Date: 2005-02-01 Impact factor: 2.823

7. Characterization of PF-4708671, a novel and highly specific inhibitor of p70 ribosomal S6 kinase (S6K1).

Authors: Laura R Pearce; Gordon R Alton; Daniel T Richter; John C Kath; Laura Lingardo; Justin Chapman; Catherine Hwang; Dario R Alessi
Journal: Biochem J Date: 2010-10-15 Impact factor: 3.857

8. Kinase selectivity potential for inhibitors targeting the ATP binding site: a network analysis.

Authors: Danzhi Huang; Ting Zhou; Karine Lafleur; Cristina Nevado; Amedeo Caflisch
Journal: Bioinformatics Date: 2009-11-26 Impact factor: 6.937

9. Characterization of GSK2334470, a novel and highly specific inhibitor of PDK1.

Authors: Ayaz Najafov; Eeva M Sommer; Jeffrey M Axten; M Phillip Deyoung; Dario R Alessi
Journal: Biochem J Date: 2011-01-15 Impact factor: 3.857

10. Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets.

Authors: Gerard Jp van Westen; Remco F Swier; Jörg K Wegner; Adriaan P Ijzerman; Herman Wt van Vlijmen; Andreas Bender
Journal: J Cheminform Date: 2013-09-23 Impact factor: 5.514

17 in total

1. Predictive models of aqueous solubility of organic compounds built on A large dataset of high integrity.

Authors: Hongmao Sun; Pranav Shah; Kimloan Nguyen; Kyeong Ri Yu; Ed Kerns; Md Kabir; Yuhong Wang; Xin Xu
Journal: Bioorg Med Chem Date: 2019-05-27 Impact factor: 3.641

2. Target and Tissue Selectivity Prediction by Integrated Mechanistic Pharmacokinetic-Target Binding and Quantitative Structure Activity Modeling.

Authors: Anna H C Vlot; Wilhelmus E A de Witte; Meindert Danhof; Piet H van der Graaf; Gerard J P van Westen; Elizabeth C M de Lange
Journal: AAPS J Date: 2017-12-04 Impact factor: 4.009

3. Type II Binders Targeting the "GLR-Out" Conformation of the Pseudokinase STRADα.

Authors: Ryan H B Smith; Zaigham M Khan; Peter Man-Un Ung; Alex P Scopton; Lisa Silber; Seshat M Mack; Alexander M Real; Avner Schlessinger; Arvin C Dar
Journal: Biochemistry Date: 2021-01-13 Impact factor: 3.162

4. Developing a Kinase-Specific Target Selection Method Using a Structure-Based Machine Learning Approach.

Authors: Arina Afanasyeva; Chioko Nagao; Kenji Mizuguchi
Journal: Adv Appl Bioinform Chem Date: 2020-12-02

5. Facing small and biased data dilemma in drug discovery with enhanced federated learning approaches.

Authors: Zhaoping Xiong; Ziqiang Cheng; Xinyuan Lin; Chi Xu; Xiaohong Liu; Dingyan Wang; Xiaomin Luo; Yong Zhang; Hualiang Jiang; Nan Qiao; Mingyue Zheng
Journal: Sci China Life Sci Date: 2021-07-26 Impact factor: 6.038

6. Computational-experimental approach to drug-target interaction mapping: A case study on kinase inhibitors.

Authors: Anna Cichonska; Balaguru Ravikumar; Elina Parri; Sanna Timonen; Tapio Pahikkala; Antti Airola; Krister Wennerberg; Juho Rousu; Tero Aittokallio
Journal: PLoS Comput Biol Date: 2017-08-07 Impact factor: 4.475

7. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set.

Authors: Eelke B Lenselink; Niels Ten Dijke; Brandon Bongers; George Papadatos; Herman W T van Vlijmen; Wojtek Kowalczyk; Adriaan P IJzerman; Gerard J P van Westen
Journal: J Cheminform Date: 2017-08-14 Impact factor: 5.514

8. Progress towards a public chemogenomic set for protein kinases and a call for contributions.

Authors: David H Drewry; Carrow I Wells; David M Andrews; Richard Angell; Hassan Al-Ali; Alison D Axtman; Stephen J Capuzzi; Jonathan M Elkins; Peter Ettmayer; Mathias Frederiksen; Opher Gileadi; Nathanael Gray; Alice Hooper; Stefan Knapp; Stefan Laufer; Ulrich Luecking; Michael Michaelides; Susanne Müller; Eugene Muratov; R Aldrin Denny; Kumar S Saikatendu; Daniel K Treiber; William J Zuercher; Timothy M Willson
Journal: PLoS One Date: 2017-08-02 Impact factor: 3.240

9. How to Achieve Better Results Using PASS-Based Virtual Screening: Case Study for Kinase Inhibitors.

Authors: Pavel V Pogodin; Alexey A Lagunin; Anastasia V Rudik; Dmitry A Filimonov; Dmitry S Druzhilovskiy; Mark C Nicklaus; Vladimir V Poroikov
Journal: Front Chem Date: 2018-04-26 Impact factor: 5.221

10. Clinical pharmacodynamic/exposure characterisation of the multikinase inhibitor ilorasertib (ABT-348) in a phase 1 dose-escalation trial.

Authors: Michael L Maitland; Sarina Piha-Paul; Gerald Falchook; Razelle Kurzrock; Ly Nguyen; Linda Janisch; Sanja Karovic; Mark McKee; Elizabeth Hoening; Shekman Wong; Wijith Munasinghe; Joann Palma; Cherrie Donawho; Guinan K Lian; Peter Ansell; Mark J Ratain; David Hong
Journal: Br J Cancer Date: 2018-03-19 Impact factor: 7.640