Literature DB >> 29788456

Kinact: a computational approach for predicting activating missense mutations in protein kinases.

Carlos Hm Rodrigues¹, David B Ascher^1,2,3, Douglas Ev Pires³.

Abstract

Protein phosphorylation is tightly regulated due to its vital role in many cellular processes. While gain of function mutations leading to constitutive activation of protein kinases are known to be driver events of many cancers, the identification of these mutations has proven challenging. Here we present Kinact, a novel machine learning approach for predicting kinase activating missense mutations using information from sequence and structure. By adapting our graph-based signatures, Kinact represents both structural and sequence information, which are used as evidence to train predictive models. We show the combination of structural and sequence features significantly improved the overall accuracy compared to considering either primary or tertiary structure alone, highlighting their complementarity. Kinact achieved a precision of 87% and 94% and Area Under ROC Curve of 0.89 and 0.92 on 10-fold cross-validation, and on blind tests, respectively, outperforming well established tools (P < 0.01). We further show that Kinact performs equally well on homology models built using templates with sequence identity as low as 33%. Kinact is freely available as a user-friendly web server at http://biosig.unimelb.edu.au/kinact/.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Protein Kinases

Year: 2018 PMID： 29788456 PMCID： PMC6031004 DOI： 10.1093/nar/gky375

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The ability of cells to recognize and correctly respond to their microenvironment is crucial for survival. In order to dynamically respond to cellular signals, fast dynamic switches are required. Protein phosphorylation is the most widespread type of post-translational modification, with over one-third of the proteins in the human proteome phosphorylated (1). The dynamic equilibrium between phosphorylation and dephosphorylation is stringently regulated, and provides a rapid mechanism to modulate protein behaviour and activity across most signalling pathways (2). Loss of control over this regulation process, through the introduction of dominant activating mutations in kinases and the consequent hyperphosphorylation of their targets can have many phenotypic consequences, including the development and metastasis of many cancers (3–7), and the development of other metabolic disorders (8). Advances in next generation sequencing techniques are leading to the identification of a range of novel mutations, including in kinases. In the absence of experimental information, it is currently challenging to identify mutations that are likely to lead to constitutive activation of kinases. While many computational approaches have been proposed for predicting the effects of mutations that disrupt activity, these approaches have been shown to be of limited success to predict gain of function mutations, as also shown on this work, despite the important roles they play in many diseases, particularly in cancer. To fill this gap, here we present Kinact, a machine learning-based predictive model and web server. Using our graph-based signatures, the method was tailored to accurately identify kinase activating mutations from a combination of sequence and structural information.

MATERIALS AND METHODS

Data sets

Mutations were derived from three mutational databases with experimental evidence of their functional consequence: Kindriver (9); ClinVar (10); and Ensembl (11). Kinase mutations were divided into two groups based upon the available experimental evidence: activating and non-activating mutations. The non-activating group is represented by variations that either disrupt activity (inactivating) or have no significant biological effect (neutral). The activating mutations were defined by a significant experimentally measured increase in kinase activity. The complete data set contained 384 mutations (260 activating and 124 non-activating) distributed across 42 proteins, of which 256 (186 activating and 70 non-activating) could be mapped onto experimentally solved 3D structures. Supplementary Figures S1 and S2 of Supplementary Materials summarises the composition and the class distribution of mutations over the data set. The dataset of mutations with experimental structures available, which account for 256 mutations, was randomly split into training and blind test sets. The proportion of activating and non-activating mutations on training and blind test sets is similar to observed on the original dataset as an attempt to prevent bias on the final method. The training set is comprised of 179 mutations (130 activating and 49 non-activating) that were used to train Kinact under 10-fold cross validation. The remaining 77 (56 activating and 21 non-activating) were used as blind test for validating the predictive model, minimizing the risk of overfitting. In order to assess the quality of the sub sets selected for training and blind test we repeated this process 20 times and the final version of the web server was built using the predictive model with best performance. Average and standard deviation values are reported on Supplementary Materials. In addition, 41 mutations (24 activating and 17 non-activating in 14 kinases) that did not have experimentally solved structures available, therefore were not part of the original 256 mutations, had their structure modelled using homology modelling for further evaluation of Kinact predictive performance as a blind test.

Feature engineering

The task of predicting and understanding the effects of mutations in proteins at a molecular level has been tackled by approaches using different biological features, each with their own assumptions and limitations. Protein structural and sequence features have been the two most popular categories of attributes used by these computational methods. Sequence-based features have focussed predominantly on the analysis of sequence residue conservation throughout a protein family and homologs (12) and sequence composition (13). By contrast, previous studies have used a wide range of structural features, including secondary structure, solvent accessibility and dihedral angles (14,15). Significant effort has also been employed on more computationally intensive approaches to model mutation effects from the use of force fields and energy terms, to molecular dynamics simulations (16,17). As an alternative, the use of graph-based structural signatures have been shown to be a scalable and effective approach for modeling the residue environment, which was successfully employed to train machine learning-based methods to predict and elucidate effects of mutations on protein stability and interactions with their partner (18–26). Moreover, these have also been used to provide insights into the molecular mechanisms of mutations and how they lead to disease and disease predisposition (27–33) and drug resistance (34–41). These graph-based signatures are predominantly composed of distance patterns extracted from the wildtype residue environment, which together with a pharmacophore modelling of its components, has been shown to be an effective way to model both geometry and physicochemical composition of protein regions. Despite these diverse range of approaches, a combination of sequence and structural information has also been proven to be valuable when predicting damaging mutations (42,43). Based on these assumptions, graph-based signatures together with complementary sequence and structural information were used to build a predictive model. This complementary information included: (a) wild-type residue environment descriptors, (b) wild-type residue interactions, (c) predicted stability changes upon mutation, (d) sequence-based predicted effects on protein function and (e) the mCSM mutation pharmacophore modelling. A total of 82 different attributes (72 structural and 10 sequence-based) were calculated for each mutation in our dataset. These were then provided as evidence to train and test supervised learning algorithms using the Weka Tool Kit (44). The attributes used on this work were categorised into six different groups and summarised in Supplementary Table S1 of Supplementary Materials.

WEB SERVER

We have implemented Kinact as a user-friendly, freely available web server (http://biosig.unimelb.edu.au/kinact/). The server front end was built using Bootstrap framework version 3.3.7, while the back-end was built in Python via the Flask framework (Version 0.12.2). It is hosted on a Linux server running Apache.

Input

The server provides two different input options for the user (Supplementary Figure S4). The ‘Single mutation’ option allows users to predict whether a given mutation will lead to protein kinase activation or not. This option requires the user to provide a PDB (45) file or PDB accession code of the kinase, the point mutation specified as a string containing the wild-type residue one-letter code, its corresponding residue number and the mutant residue one-letter code, and the chain identifier of the wild-type residue. The primary sequence of the kinase of interest in fasta format is also required. The ‘Mutation list’ option allows users to upload a list of mutations in a file for batch processing. In order to aid users to submit their jobs, sample submission entries are available on the submission page and a help page is available via the top navigation bar.

Output

For the ‘Single mutation’ option, as shown in Figure 1, the web server displays in the output page the prediction outcome of Kinact, the details of the user input data, such as structure of wild-type and mutant residues, and also information on the kinase group in which the submitted structure was assigned to, based on sequence similarity according to the Standard Kinase Classification Scheme (46).

Figure 1.

Web server results page for a single mutation prediction. The predicted outcome is shown alongside with complementary information on the submitted protein and the details of the mutation being evaluated. In addition, Kinact displays information on the group of homologue protein kinases according to the Standard Kinase Classification Scheme. The effects of mutation on protein stability calculated by mCSM (21), SDM (43) and DUET (25) are also shown. In addition, Kinact provides a set of analyses to help users investigate in greater detail the impact of the mutation. All resources displayed within the analysis section, including Pymol Sessions and the Multiple Sequence Alignment in fasta format, are made available for download. The first item in the analysis section (Supplementary Figure S5) allows users to explore the 3D structure and the inter-residue interactions established by the wild-type residue, calculated by Arpeggio (47). Below this, users can also explore the conservation of residues with the structure of the wild-type kinase (Supplementary Figure S6). The 3D structure of the kinase of interest is displayed and colored according to conservation within the kinase sub-group, from red (not conserved) to blue (conserved). The structures are displayed in an interactive viewer implemented with 3Dmol.js (48). Finally, users can also explore, within the analysis section, a multiple sequence alignment of the sequence of the provided structure and those from the closest kinase group according to the Standard Kinase Classification Scheme, assigned by similarity (Supplementary Figure S7). Previously experimentally characterised point mutations within any kinases of the group are highlighted, enabling users to rapidly identify through homology the effect of mutations at the corresponding residue position. For the ‘Mutation list’ option, the server output is shown as a downloadable table (Supplementary Figure S8) and users also have the option to analyse each mutation separately, similarly to what was described for the ‘Single mutation’ option.

VALIDATION

In order to evaluate the quality of the training and blind test sets used we performed a resampling of these subsets 20 times and evaluated the performance of the predictive model on each split using AUC and precision. All values for the blind tests are reported on Supplementary Materials for each sample. Average and standard deviation are also shown and no bias was identified. Here we compare the performance of the best predictive model of Kinact with widely used tools to study the effects of mutations in proteins functions PolyPhen2 (42), SIFT (12) and wKinMut2 (49), a tool to identify and interpret pathogenic variants in human protein kinases.

Performance on cross validation

In order to better evaluate the contribution of structure and sequence-based attributes on the performance of supervised learning algorithms, three different predictive models were generated. The first model uses only attributes that rely on protein sequence information, which include mutation tolerance predictions (12,42), as well as a pharmacophore difference vector between wild-type and mutant residues, as proposed by the mCSM signatures (21), for this model we used the complete original dataset of 384 mutations. The second model uses only structural attributes calculated using the experimental structural data from the PDB. These include the graph-based structural signatures and complementary descriptors described in Supplementary Table S1 of Supplementary materials. Finally, the third model was constructed based on a combination of all attributes, using both sequence and structural data. For the models that used structural data on their predictions we used only the dataset of mutations with experimental structure available, which accounts for 256 mutations. In order to run and assess the performance of the machine learning algorithms, we split each dataset into 70% of the mutations for training and 30% for blind test. In that sense, for the model that uses only sequence-based data we used 268 mutations for training (182 activating and 86 non-activating) and 116 for blind test (77 activating and 39 non-activating). For the other two models that used structure-based features 179 mutations were used for training and 77 mutations for blind test as previously described. All models were trained under 10-fold cross validation. Supplementary Figure S3 of Supplementary Materials summarises the distribution of activating and non-activating mutations in training and blind test sets for all models. Machine learning methods, evaluation procedures and performance metrics used are described in Supplementary Data. A series of experiments were carried out to assess the performance of Kinact to predict whether a given mutation was likely to lead to constitutive activation of a kinase. The ROC curves across the training data set for models using sequence information alone, structure-based features alone, and the Kinact model that combines both attribute classes are shown in Figure 2. Details on the evaluation metrics for each algorithm are summarised on Supplementary Tables S2-S4 in Supplementary materials. Across the complete training set, Kinact achieved a Precision of 87% and Area Under ROC Curve of 0.89, significantly higher than the models using either just sequence or structural data (AUC of 0.77 and 0.83, and Precision of 0.78 and 0.81, respectively, P < 0.01). The final predictive models were trained using the full training set and all the performance evaluation metrics were calculated considering the average values for all 10 folds from cross validation.

Figure 2.

Comparative performance of Kinact. The ROC curves obtained for the training data set for models using sequence information alone, structural information alone, and the Kinact combined model is shown in (A). Kinact (AUC of 0.89), performs significantly better (P-value < 0.01) than the models using either just sequence or structural data (AUC of 0.77 and 0.83, respectively). In order to compare the performance of Kinact against the widely used tools SIFT, PolyPhen-2 and wKinMut2, a blind test (B) over a non-redundant test was evaluated and Kinact (AUC of 0.96) significantly (P-value < 0.01) outperformed all three methods (AUC of 0.54, 0.70 and 0.52, respectively). Using homology models (C), Kinact was also able to accurately identify activating mutations (AUC of 0.77), and again outperformed the other methods.

Blind test

In order to properly evaluate the method's predictive performance and generalization, Kinact was initially evaluated against a separate, independent, non-redundant blind test set comprised of 77 missense mutations in protein kinases with available experimental structures, achieving a precision of 97% and Area Under ROC Curve of 0.96. When comparing with other methods, Kinact significantly outperformed (Figure 2B) all three methods (P-value < 0.01). Looking specifically at the activating mutations, SIFT predicted 55% of mutations as deleterious (score < 0.05), PolyPhen-2 classified 84% as probably damaging (score > 0.85), and wKinMut2 predicted 62% of mutations as disease related (score > 0), while Kinact correctly classified 99% of them. Comparisons of Kinact with tools that assess the effects of mutations on protein stability are described on Supplementary Materials.

Homology models

The performance of the web server to accurately classifying mutations using homology models was evaluated using a set of 41 mutations in kinases without experimentally resolved structures. Homology models of the kinases were generated by Modeller (50) using experimentally resolved structures down to 33% sequence identity. Using the homology models, Kinact was able to accurately identify activating mutations (AUC of 0.77 and precision of 0.78), providing confidence and robustness in the applicability of this approach beyond experimental structures to those that are computationally modelled. This was also significantly better than PolyPhen-2, SIFT and wKinMut2 (Figure 2C). When comparing the performance of the methods specifically at the activating mutations, Kinact was able to classify correctly 100% of mutations, while SIFT predicted 75% as deleterious (score < 0.05), PolyPhen-2 classified 83% as probably damaging (score > 0.85), and wKinMut2 predicted 77% as disease related (score > 0).

CONCLUSION

We present here, Kinact, a predictive model and web server tailored for identifying kinase activating mutations using graph-based signatures, sequence and structural data. Kinact conveniently combines high-performance, open access, web visualization tools to assist research on how mutations affect protein kinases activity as well as prioritise mutations for further investigation. Given the importance of these variants in the context of many diseases, especially on the development of many types of cancer, and also that widely used tools have not been able to successfully predict gain of function mutations, we believe Kinact will be a useful tool to help identify and understand the role of these mutations. The method is freely available as a user friendly and easy to use web server at http://biosig.unimelb.edu.au/kinact/. Click here for additional data file.

49 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. The origins of protein phosphorylation.

Authors: Philip Cohen
Journal: Nat Cell Biol Date: 2002-05 Impact factor: 28.824

Review 3. The protein kinase complement of the human genome.

Authors: G Manning; D B Whyte; R Martinez; T Hunter; S Sudarsanam
Journal: Science Date: 2002-12-06 Impact factor: 47.728

4. Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations.

Authors: Raphael Guerois; Jens Erik Nielsen; Luis Serrano
Journal: J Mol Biol Date: 2002-07-05 Impact factor: 5.469

5. Twelve novel HGD gene variants identified in 99 alkaptonuria patients: focus on 'black bone disease' in Italy.

Authors: Martina Nemethova; Jan Radvanszky; Ludevit Kadasi; David B Ascher; Douglas E V Pires; Tom L Blundell; Berardino Porfirio; Alessandro Mannoni; Annalisa Santucci; Lia Milucci; Silvia Sestini; Gianfranco Biolcati; Fiammetta Sorge; Caterina Aurizi; Robert Aquaron; Mohammed Alsbou; Charles Marques Lourenço; Kanakasabapathi Ramadevi; Lakshminarayan R Ranganath; James A Gallagher; Christa van Kan; Anthony K Hall; Birgitta Olsson; Nicolas Sireau; Hana Ayoob; Oliver G Timmis; Kim-Hanh Le Quan Sang; Federica Genovese; Richard Imrich; Jozef Rovensky; Rangan Srinivasaraghavan; Shruthi K Bharadwaj; Ronen Spiegel; Andrea Zatkova
Journal: Eur J Hum Genet Date: 2015-03-25 Impact factor: 4.246

Review 6. Combating mutations in genetic disease and drug resistance: understanding molecular mechanisms to guide drug design.

Authors: Amanda T S Albanaz; Carlos H M Rodrigues; Douglas E V Pires; David B Ascher
Journal: Expert Opin Drug Discov Date: 2017-06 Impact factor: 6.098

7. mCSM: predicting the effects of mutations in proteins using graph-based signatures.

Authors: Douglas E V Pires; David B Ascher; Tom L Blundell
Journal: Bioinformatics Date: 2013-11-26 Impact factor: 6.937

8. ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971

9. DynaMut: predicting the impact of mutations on protein conformation, flexibility and stability.

Authors: Carlos Hm Rodrigues; Douglas Ev Pires; David B Ascher
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

10. In silico functional dissection of saturation mutagenesis: Interpreting the relationship between phenotypes and changes in protein stability, interactions and activity.

Authors: Douglas E V Pires; Jing Chen; Tom L Blundell; David B Ascher
Journal: Sci Rep Date: 2016-01-22 Impact factor: 4.379

8 in total