Literature DB >> 36098536

On the Choice of Active Site Sequences for Kinase-Ligand Affinity Prediction.

Jannis Born^1,2, Yoel Shoshan³, Tien Huynh⁴, Wendy D Cornell⁴, Eric J Martin⁵, Matteo Manica¹.

Abstract

Recent work showed that active site rather than full-protein-sequence information improves predictive performance in kinase-ligand binding affinity prediction. To refine the notion of an "active site", we here propose and compare multiple definitions. We report significant evidence that our novel definition is superior to previous definitions and better models of ATP-noncompetitive inhibitors. Moreover, we leverage the discontiguity of the active site sequence to motivate novel protein-sequence augmentation strategies and find that combining them further improves performance.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 36098536 PMCID： PMC9516689 DOI： 10.1021/acs.jcim.2c00840

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 6.162

Introduction

The human kinome is indispensable for the regulation of cell function and comprises many widely studied drug targets due to its key role in a multitude of diseases such as cancer. Therefore, proteochemometric models that can predict protein–ligand interaction, kinetic energies, or binding affinities have received growing interest.[1] Most efforts either rely on structure-based[2,3] or sequence-based[4,5] deep learning models. While structure-based approaches can, in principle, model binding dynamics more realistically, their practical superiority is questionable: recent work evidenced that incorporating noncovalent interactions does not give benefits compared to simple protein/ligand descriptors.[6] Sequence-based models for affinity prediction are usually trained on prohibitively long protein sequences that consist predominantly of residues irrelevant for binding. Recently, however, we demonstrated that using only residues of the ATP-binding site rather than the full protein increases the signal-to-noise-ratio in the protein representation and improves significantly the performance in protein–ligand affinity prediction for human kinases.[7] All experiments in that work were based on an active site definition from Sheridan et al. (ref (8)) which comprises 29 residues surrounding the ATP-binding site that were identified using MSA. The superiority of the active site representation manifested consistently across all ligand types, with the sole exception of one drug class: MEK/MAPK inhibitors.[7] Notably, this class contains many allosteric binders, in particular ATP-noncompetitive MAPK inhibitors that bind to a unique site near the ATP-binding pocket.[9] One goal of the presented work is to address this systematic limitation in modeling allosteric binders and refine the definition of an “active site” for binding affinity prediction. Therefore, we leverage an alternative active site definition comprising 16 residues from ref (10) that includes 6 residues farther away from the immediate binding site (see Figure A). These two representations are compared to a broader Combined definition (cf. Figure B). Last, we explore additional mechanisms to leverage the knowledge about the active site, in particular how it can inspire data augmentation. We propose two new protein sequence augmentation techniques and find that they have complementary positive effects.

Figure 1

Overview of active site site definitions and representations. A) Visualization of cAMP-dependent protein kinase catalytic subunit alpha (P17612). Residues unique to the active site definitions of refs (8) and (10) are colored in orange and green, respectively. Residues contained in both definitions are shown in red. B) Partial amino acid sequence (residues 48–62) of the same kinase. The upper gray panel displays the four kinase sequence representations examined in this work. The lower gray panel visualizes three kinase augmentation strategies, exemplified on the “combined” active site definition: flipping (i.e., reversing) the entire sequence, flipping contiguous subsequences, and swapping neighboring subsequences. Residues affected by the augmentation are encircled in black.

Kinase Sequence Representation

Active Site Definitions

In our previous work,[7] the active site representation relied on 29 residues defined originally in Sheridan et al. [ref (8), Table 1]. These residues are short contiguous subsequences that lie discontiguously in the original sequence (cf. Figure B top). Here, the predictive power of this Sheridan definition is compared to 16 residues that were found most relevant for kinase kernel models by Martin et al.[10] These Martin residues were identified from a starting set of 46 residues based on how frequently they were picked with a variable selection algorithm for a large set of kinase-kernel models. Since only 10 of these 16 residues are overlapping with the Sheridan definition, we also examine a Combined active site definition with 35 residues. For a table with the PKA numbering of all residues, see subsection S1.3.

Kinase Sequence Augmentation

While the MSA guarantees a meaningful and consistent ordering of the residues (and their physical roles), the sequences do not provide explicit 3D information on protein conformation. Especially, proximity in the sequence likely but not necessarily corresponds to proximity in 3D space. We therefore hypothesized that sequence augmentation strategies could assist to learn general binding patterns for two reasons: 1) There may be 1D representations that align better with the 3D relation of residues than the original sequence. Representing a kinase as a distribution of sequences reflects this lack of knowledge, might regularize the model, and thus improves generalization, especially to unseen target families. 2) Static roles of specific residue positions may induce overfitting in practice as the model might memorize too specific patterns. A natural augmentation technique for protein sequences is flipping (F) the entire reduced residue set (p = 0.5). Moreover, we leveraged the knowledge about the location of the active site residues and exploited their discontiguity in the full sequence to motivate two additional augmentation strategies (cf. Figure B bottom). First, flipping contiguous subsequences (FS): Since subsequences of the active site that are contiguous in the full sequence are close together in space, reading such sequences from either direction should not affect model predictions (p = 0.5). Second, swapping neighboring contiguous subsequences (SS): This strategy relies on the assumption that neighboring contiguous sequences have a higher probability to be closer in space than distant active site subsequences (p = 0.2). Last, we also explore combinations of these augmentation strategies. For details, see Supporting Information S1.2.

Experimental Setup

The experimental setup is largely identical to the binding affinity prediction task described in ref (7). We take data from BindingDB[11] and examine two types of models, a k-nearest-neighbor (KNN) model that builds a joint similarity space of protein and ligand distances and a deep neural network called BiMCA (Bimodal Multiscale Convolutional Attention encoder[12]) that ingests protein and ligand sequences (SMILES strings) and consists of convolutional and attention layers. The remaining methods (data source and preprocessing, model definitions) can be found in Supporting Information S1.

Results

Ligand Split

This split corresponds to the classical discovery setting: Kinases are shared across train and validation data, and thus, we measure generalization in the ligand space. The results on the ligand split confirm the superiority of using active sites compared to full sequences, irrespective of the exact definition of the active site (cf. Table ). The table clearly indicates that the Combined representation yields consistently the best results for both models, both metrics and validation and test data (cf. Table ). These improvements are statistically significant (Wilcoxon signed-rank test, W+) compared to at least one active site definition for all settings (see Figure S1 and Figure S2).

Table 1

Results on Validation and Test Data (Ligand Split)a

		RMSE (↓)		Pearson (↑)
data	config	BiMCA	BiMCA-pre	BiMCA	BiMCA-pre
val.	full sequence	0.908_±0.01	0.848_±0.01	0.748_±0.00	0.782_±0.01
	AS (Sheridan)	0.829_±0.01	0.821_±0.01	0.794_±0.00	0.797_±0.01
	AS (Martin)	0.839_±0.01	0.813_±0.01	0.791_±0.00	0.804_±0.01
	AS (combined)	0.828_±0.01	0.811_±0.01	0.797_±0.01	0.804_±0.01
test	full sequence	0.912_±0.01	0.863_±0.01	0.744_±0.00	0.774_±0.01
	AS (Sheridan)	0.832_±0.01	0.826_±0.01	0.792_±0.01	0.795_±0.01
	AS (Martin)	0.842_±0.01	0.818_±0.01	0.789_±0.01	0.801_±0.01
	AS (combined)	0.832_±0.01	0.816_±0.01	0.795_±0.01	0.802_±0.01

10-fold cross-validation results on kinase data from BindingDB. For each model and data partition, we show mean and standard deviation across 10 folds and mark the best representation in bold.

10-fold cross-validation results on kinase data from BindingDB. For each model and data partition, we show mean and standard deviation across 10 folds and mark the best representation in bold. There are several kinase inhibitor classes with notable performance differences: First, the conspicuous inferiority of the Sheridan definition to the full protein sequence for MEK inhibitors [ref (7), Figure 7], caused by allosteric MAPK inhibitors that cannot be modeled using an ATP-based active site definition, was a limitation of our previous work. Importantly, this can be resolved using the Martin or the Combined active site definition with 6 more distant residues (cf. Figure S1 panel C). These definitions include residues distant from the ATP-binding site and around the “hydrophobic spine”, hypothesized to affect the stability of binding site features or the active and inactive forms.[13] Second, the Martin definition also includes T51, a residue that builds an important salt bridge with residues in the same loop in many CDK kinases, another class where Martin/Combined is better than Sheridan.

Kinase Split

This split tests the ability of the model to predict the binding affinity for an unseen protein kinase. Since it induces high heterogeneity across each fold/chunk of data, care has to be taken in drawing conclusions, especially from the test data results. The results for the KNN and the BiMCA on the validation and test data are shown in Figures A and B, respectively.

Figure 2

RMSE in affinity prediction for kinase split on validation and test data. 10-fold cross-validation results on kinase data from BindingDB. Performance of validation (A) and test data (B) is shown. Statistically significant differences between the three different active site configurations are marked with a star. On the validation data, no clear trend is visible when comparing the three active site configurations across models, data splits, and metrics. Notably, however, all active site definitions significantly outperform the full sequence representations across all models, splits, and metrics. While the Sheridan representation is significantly superior to the Martin representation for the KNN (p < 0.05, W+) and to the Combined representation for the BiMCA, this trend does not persist in the test data. During testing, the Martin representation consistently obtained the highest Pearson correlation, irrespective of the model (cf. Table S1). However, this finding does not corroborate when using the RMSE as a response metric (cf. Figure B). Notably, our best model (the pretrained BiMCA) obtained the best performance with the Combined representation in all but one case. In Supporting Information S2.3, we report additional results on a subset of samples where both kinases and ligands are unseen. The results on this strict split evidence the higher generalization capabilities of the BiMCA compared to the KNN and underline the superiority of the active site sequence representations.

Kinase Sequence Augmentation

To further improve performance, we systematically investigated different kinase sequence augmentation strategies. The results demonstrate that all augmentation techniques improved model performance (cf. Table ). Interestingly, the structure-motivated techniques of swapping (SS) and flipping subsequences (FS) exhibited a similar performance boost to simple flipping (F). However, the benefit of flipping is statistically insignificant, whereas FS and SS yield significant benefits (p < 0.01, W+) in several configurations. Moreover, their performance boost is roughly additive as combining all three strategies yields the best results in seven out of eight cases (p < 0.01, W+, RMSE on validation data). We hypothesize that the pretrained model is harder to improve because it partially learned invariance to the applied transformations.

Table 2

Results of Sequence Augmentation (Kinase Split)a

		RMSE (↓)		Pearson (↑)
data	augmentation	BiMCA	BiMCA-pre	BiMCA	BiMCA-pre
val.	none	1.32_±0.16	1.20_±0.12	0.438_±0.08	0.489_±0.09
	flip (F)	1.25_±0.13	1.19_±0.13	0.463_±0.08	0.502_±0.08
	flip subseq (FS)	1.28_±0.12	1.18_±0.12	0.431_±0.11	0.521_±0.08
	swap subseq (SS)	1.28_±0.17	1.18_±0.12	0.443_±0.11	0.511_±0.09
	FS + SS	1.27_±0.11	1.18_±0.12	0.444_±0.09	0.508_±0.09
	F + FS + SS	1.22_±0.10	1.18_±0.11	0.468_±0.11	0.505_±0.09
test	none	1.33_±0.08	1.23_±0.08	0.431_±0.06	0.505_±0.07
	flip (F)	1.28_±0.05	1.23_±0.07	0.478_±0.04	0.515_±0.06
	flip subseq (FS)	1.32_±0.09	1.22_±0.04	0.444_±0.08	0.516_±0.04
	swap subseq (SS)	1.28_±0.04	1.23_±0.03	0.479_±0.01	0.506_±0.06
	FS + SS	1.29_±0.06	1.22_±0.07	0.469_±0.04	0.526_±0.05
	F + FS + SS	1.27_±0.06	1.21_±0.05	0.479_±0.06	0.531_±0.05

All models were used the Combined active site definition.

Discussion

In this work, we corroborate the finding that “less is more” in sequence-based kinase-ligand affinity prediction models. Our experiments show that the 16 residues identified by ref (10) yield similar results to the Sheridan residues. We report evidence that a novel, Combined kinase representation is superior to the Sheridan and the Martin representation for predicting binding affinity in unseen ligands. To predict unseen kinases, we not only corroborate our previous results on the superiority of active sites to full kinase sequences but also find that no residue composition is strictly advantageous. While we have previously found that incorporating fewer residues yields better results, we find here that bringing back specific residues that are more distant from the the ATP pocket significantly increases performance, especially for allosteric binders. Although these residues (cf. subsection S1.3) were identified algorithmically, Martin et al.[10] discuss in a post hoc analysis dynamical roles of residues in the “hydrophobic spine”[14] as well as other residues important in loop dynamics and activation–deactivation mechanisms of kinases that do not interact directly with the ligand (T51, L103, V119, G126, I163). Other residues might be involved in both direct and indirect interactions (F54, L95, L106, F187, L162). Even though the ideal set of residues for sequence-based kinase affinity prediction models remains unclear, our results are a step forward in compactly modeling kinase-ligand binding. As shown in our previous work,[7] improved affinity predictors can be leveraged to drive molecular generative models toward generating molecules with higher binding affinity to specific kinases. Lastly, the knowledge about the location of the active site motivates multiple novel sequence augmentation techniques that demonstrated further, complementary performance improvement.

Data and Software Availability

The data processing and augmentation pipelines are implemented in the pytoda package.[12] The source code has been released on https://github.com/PaccMann/paccmann_kinase_binding_residues#choosing-active-site-sequences, and the preprocessed BindingDB data is available via https://ibm.biz/active_site_data.[7] Moreover, in the Generative Toolkit for Scientific Discovery (GT4SD), we provide an example on leveraging the affinity predictor as a reward function in a protein-driven molecular generative model:[15]https://github.com/GT4SD/gt4sd-core/tree/main/examples/protein_driven_molecule_generation.

12 in total

1. Kinase-kernel models: accurate in silico screening of 4 million compounds across the entire human kinome.

Authors: Eric Martin; Prasenjit Mukherjee
Journal: J Chem Inf Model Date: 2012-01-06 Impact factor: 4.956

2. Surface comparison of active and inactive protein kinases identifies a conserved activation mechanism.

Authors: Alexandr P Kornev; Nina M Haste; Susan S Taylor; Lynn F Ten Eyck
Journal: Proc Natl Acad Sci U S A Date: 2006-11-09 Impact factor: 11.205

3. QSAR models for predicting the similarity in binding profiles for pairs of protein kinases and the variation of models between experimental data sets.

Authors: Robert P Sheridan; Kiyean Nam; Vladimir N Maiorov; Daniel R McMasters; Wendy D Cornell
Journal: J Chem Inf Model Date: 2009-08 Impact factor: 4.956

4. DeepAffinity: interpretable deep learning of compound-protein affinity through unified recurrent and convolutional neural networks.

Authors: Mostafa Karimi; Di Wu; Zhangyang Wang; Yang Shen
Journal: Bioinformatics Date: 2019-09-15 Impact factor: 6.937

Review 5. Kinases and pseudokinases: lessons from RAF.

Authors: Andrey S Shaw; Alexandr P Kornev; Jiancheng Hu; Lalima G Ahuja; Susan S Taylor
Journal: Mol Cell Biol Date: 2014-02-24 Impact factor: 4.272

6. Deep Learning in Drug Target Interaction Prediction: Current and Future Perspectives.

Authors: Karim Abbasi; Parvin Razzaghi; Antti Poso; Saber Ghanbari-Ara; Ali Masoudi-Nejad
Journal: Curr Med Chem Date: 2021 Impact factor: 4.530

7. RosENet: Improving Binding Affinity Prediction by Leveraging Molecular Mechanics Energies with an Ensemble of 3D Convolutional Neural Networks.

Authors: Hussein Hassan-Harrirou; Ce Zhang; Thomas Lemmin
Journal: J Chem Inf Model Date: 2020-05-26 Impact factor: 4.956

8. On the Frustration to Predict Binding Affinities from Protein-Ligand Structures with Deep Neural Networks.

Authors: Mikhail Volkov; Joseph-André Turk; Nicolas Drizard; Nicolas Martin; Brice Hoffmann; Yann Gaston-Mathé; Didier Rognan
Journal: J Med Chem Date: 2022-05-24 Impact factor: 7.446

9. Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model.

Authors: Jannis Born; Tien Huynh; Astrid Stroobants; Wendy D Cornell; Matteo Manica
Journal: J Chem Inf Model Date: 2021-12-14 Impact factor: 4.956

10. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology.

Authors: Michael K Gilson; Tiqing Liu; Michael Baitaluk; George Nicola; Linda Hwang; Jenny Chong
Journal: Nucleic Acids Res Date: 2015-10-19 Impact factor: 16.971