Literature DB >> 32712465

PhosphoEffect: Prioritizing Variants On or Adjacent to Phosphorylation Sites through Their Effect on Kinase Recognition Motifs.

Abstract

Phosphorylation sites often have key regulatory functions and are central to many cellular signaling pathways, so mutations that modify them have the potential to contribute to pathological states such as cancer. Although many classifiers exist for prioritization of coding genomic variants, to our knowledge none of them explicitly account for the alteration or creation of kinase recognition motifs that alter protein structure, function, regulation of activity, and interaction networks through modifying the pattern of phosphorylation. We present a novel computational pipeline that uses a random forest classifier to predict the pathogenicity of a variant, according to its direct or indirect effect on local phosphorylation sites and the predicted functional impact of perturbing a phosphorylation event. We call this classifier PhosphoEffect and find that it compares favorably and with increased accuracy to the existing classifier PolyPhen 2.2.2 when tested on a dataset of known variants enriched for phosphorylation sites and their neighbors.

Entities: Chemical Disease Gene Mutation Species

Keywords: Biochemistry; Omics

Year: 2020 PMID： 32712465 PMCID： PMC7387813 DOI： 10.1016/j.isci.2020.101321

Source DB: PubMed Journal: iScience ISSN： 2589-0042

Introduction

Mutations in the coding region of the human genome can contribute to pathological phenotypes through their molecular effect on the structure and function of proteins. Although many genetic diseases are caused by well-characterized mutations in single genes, others, such as cancer, are complex polygenic conditions resulting from the accumulation of numerous genomic mutations, with different cancers exhibiting a wide variety of mutational patterns. In the post-genomic era, next-generation sequencing (NGS) platforms such as Illumina and IonTorrent can generate huge amounts of sequence data and identify millions of variants in a single genome (Zhang et al., 2011). This can facilitate the identification of pathogenic mutations in tumors and aid in their prediction, diagnosis, prognosis, and intervention, as well as guide in the design of novel targeted drugs (Torshizi and Wang, 2018). However, the majority of mutations in any cancer genome are benign, so-called passenger mutations, which arise by chance and may hitchhike to fixation if they co-occur with mutations beneficial to the growth of the tumor, the so-called driver mutations. Distinguishing driver from passenger mutations has been the holy grail of cancer diagnosis and treatment (Pon and Marra, 2015). Recent developments in machine learning algorithms, which can predict the impact of mutations on protein structure and/or function and “prioritize” them based on clinical relevance have indeed come to the rescue of cancer diagnosis and treatment. These algorithms include PolyPhen (Adzhubei et al., 2010), SIFT (Sorting Intolerant From Tolerant; Sim et al., 2012), FATHMM (Functional Analysis Through Hidden Markov Models; Rogers et al., 2018), CADD (Combined Annotation Dependent Depletion; Rentzsch et al., 2018), and MutationTaster (Schwarz et al., 2014). Their predictions are based on a wide range of features including sequence context, conservation, and predicted impact on protein structure. They also take into account post-translational modifications (PTMs), such as phosphorylation, as these can regulate the structure, function, and interaction partners of a protein and play important roles in cellular signaling (Ardito et al., 2017). Mutations that perturb PTMs—for instance, substituting a phosphorylation site for a residue that cannot be phosphorylated— can have a significant impact on the function of a protein and/or the regulation of its activity (Radivojac et al., 2008). Efforts have therefore been made to catalog missense variants that map to phosphorylation sites, for instance, AWESOME (A Website Exhibits SNP On Modification Event; Yang et al., 2019), ActiveDriverDB (Krassowski et al., 2018), and MIMP (Mutation Impact on Phosphorylation, Wagih et al., 2015). Similar attempts have been made for ubiquitination with some experimental validation (Martínez-Jiménez et al., 2020). However, not all phosphorylation sites are necessarily functional, and the regulatory importance of phosphorylation sites can vary greatly. Several studies have therefore aimed to classify functional versus nonfunctional phosphorylation sites according to features such as sequence context, conservation, and structure (Beltrao et al., 2012; Xiao et al., 2016). Additionally, residues of neighboring phosphorylation sites can influence the degree of phosphorylation, since protein kinases largely recognize short linear motifs surrounding the target residue. Thus, mutations in the flanking region of phosphorylation sites have the potential to greatly influence protein structure and function by up- or downregulating the stoichiometry of phosphorylation at that site. However, as far as we are aware, existing variant classifiers do not explicitly consider these effects. Additionally, given that protein phosphorylation frequently modifies protein-protein interaction (PPI) partners and this modification is central to many, if not most, cellular signaling pathways, another important consideration is the impact of disrupting or altering phosphorylation sites on its PPI, and indeed on the whole PPI network. The study of molecular interaction networks has become increasingly applied to human diseases such as cancer, with the effect of many known cancer mutations being understood through their effect on PPI networks (Hijazi et al., 2020; Creixell et al., 2015). But one of the major shortcomings of existing phosphorylation classifiers in our view is their failure to take such network information into account and implement them in the algorithm. Finally, variant prioritization algorithms generally consider sequence conservation as a key feature, based on the reasoning that mutations in highly conserved regions have undergone negative selection and are thus most likely deleterious. However, this may penalize against pathogenic variants that are on or adjacent to phosphorylation sites, since these are enriched in disordered regions that are generally poorly conserved (Krassowski et al., 2018). In light of this, we sought to develop a classifier that built upon existing algorithms for prioritizing variants but included a greater range of features describing the effect of a mutation on protein structure and function, as well as the broader PPI network, through its effect on direct and/or neighboring phosphorylation sites. We collected and annotated a set of variants of known clinical significance, then used the annotated features to train and test a random forest-based classifier. We find that the classifier, which we call PhosphoEffect, compares favorably to PolyPhen. We thus present a user-ready, open-source pipeline that can take as input a list of single amino acid point mutations and output the predicted probability that each mutation is pathogenic.

Results

Enrichment of Mutations around Phosphorylation Sites

We retrieved all annotated human phosphosites from PhosphoSitePlus (Hornbeck et al., 2015), a comprehensive database of PTMs curated from UniProtKB and published experimental datasets. These were mapped onto missense mutations derived from the COSMIC (Catalogue Of Somatic Mutations In Cancer) database, a comprehensive, manually curated genomic data repository for human cancers (Tate et al., 2019). As described in the Methods section, we took a sample of 100 tumors of each of the main tumor types in the dataset of whole-genome/whole-exome sequence studies, containing a total of 282,019 single amino acid mutations. We evaluated the frequency of mutations on and around phosphorylation sites to test for enrichment (Figure 1). We found that mutations were underrepresented in phosphorylation sites (hypergeometric test, fold enrichment = 0.94, Benjamini-Hochberg (BH)-corrected p = 1.60 × 10−5). In contrast, mutations were significantly overrepresented at positions −4, −3, −2, −1, and +1 (fold-enrichments 1.06, 1.14, 1.03, 1.08, and 1.05 respectively, all BH-corrected p values < 0.05), whereas there was no significant enrichment in either direction at the remaining positions (Table 1).

Figure 1

Enrichment of Mutations in the Vicinity of Phosphorylation Sites in Human Cancers

The solid red line shows the expected number of mutations at each site, whereas the dotted lines are the upper and lower 2.5% bounds of the corresponding hypergeometric distribution, respectively.

Table 1

Enrichment of Mutations in the Vicinity of Phosphorylation Sites in the COSMIC Cancer Mutation Database

Site Relative to Phosphosite	Fold Enrichment of Mutations	BH-Corrected p Value
−5	1.02	0.126
−4	1.06	6.86 × 10⁻⁶
−3	1.14	2.42 × 10⁻⁸
−2	1.03	1.79 × 10⁻²
−1	1.08	3.08 × 10⁻⁸
0	0.94	1.60 × 10⁻⁵
1	1.05	3.54 × 10⁻⁴
2	1.03	4.74 × 10⁻²
3	1.01	0.632
4	1.01	0.662
5	1.01	0.296

All hypergeometric test p -values are Benjamini-Hochberg corrected for multiple testing.

Enrichment of Mutations in the Vicinity of Phosphorylation Sites in Human Cancers The solid red line shows the expected number of mutations at each site, whereas the dotted lines are the upper and lower 2.5% bounds of the corresponding hypergeometric distribution, respectively. Enrichment of Mutations in the Vicinity of Phosphorylation Sites in the COSMIC Cancer Mutation Database All hypergeometric test p -values are Benjamini-Hochberg corrected for multiple testing. We hypothesized that mutations on phosphorylation sites are underrepresented because they are more likely to abolish the phosphorylation site altogether and be deleterious, whereas mutations in the neighborhood of phosphorylation sites that alter the binding affinity of kinases and/or phosphatases may fine-tune the level of phosphorylation and lead to deregulated protein function, rationalizing their overenrichment in tumors.

Evaluating the Success of Existing Predictors on Variants Affecting Phosphorylation Sites

Existing predictors such as PolyPhen do not take into account the effect of mutations on residues of neighboring phosphosites nor do they explicitly consider the network-rewiring effect of mutations affecting phosphorylation. We assessed the performance of PolyPhen on three classes of variants: mutations mapping to phosphorylation sites (class 1), mutations that are not directly phosphorylated but are within five residues of a phosphorylation site (class 2), and mutations that have no phosphorylation sites in the +-5 region (class 3) (Figure 2). A total of 21,037 variants of known clinical significance were retrieved from the ClinVar database as described in the Methods section. Of these, 450 were classified as class 1, 3,452 were class 2, and 17,135 were class 3. PolyPhen classifies variants as “probably damaging” (false positive rate <10%), “possibly damaging” (false positive rate 10%–20%), or benign (false-positive rate >20%). For the purpose of the following evaluation we are classifying predictions of “probably damaging” or “possibly damaging” as pathogenic.

Figure 2

Performance of PolyPhen on Known Variants

The false-positive (A) and false negative (B) rates of the PolyPhen classifier on phosphorylated residues, residues neighboring phosphorylation sites, and all others. The error bars indicate standard errors of the proportion. The difference in false-positive rates between classes was significant (p=2.43×10ˆ-9) and the difference in false-negative rates was marginally significant (p=0.0455). All p-values, including for ad-hoc pairwise testing, are in Table S1.

Performance of PolyPhen on Known Variants The false-positive (A) and false negative (B) rates of the PolyPhen classifier on phosphorylated residues, residues neighboring phosphorylation sites, and all others. The error bars indicate standard errors of the proportion. The difference in false-positive rates between classes was significant (p=2.43×10ˆ-9) and the difference in false-negative rates was marginally significant (p=0.0455). All p-values, including for ad-hoc pairwise testing, are in Table S1. We found that PolyPhen had the highest sensitivity on the class 3 variants, with a true positive rate of 88% indicating that 12% of pathogenic variants were incorrectly classified as benign. In contrast, 16% of pathogenic class 1 variants and 17% of pathogenic class 2 variants were classified as benign. A chi-square test indicated a significant difference between the three proportions (chi-square = 39.67, df = 2, p = 2.43 × 10−9). In ad hoc pairwise testing the only significant difference was between class 2 and class 3 (chi-square = 36.97, df = 1, BH-corrected p = 7.19 × 10−9, Table S1). This indicates that PolyPhen has a tendency to underestimate the pathogenicity of mutations that occur in the neighborhood of phosphorylation sites. In regard to specificity, the false-positive rate was 25% for class 1 variants, 23% for class 2, and 26% for class 3. The differences between the groups were marginally significant (chi-square = 6.18, df = 2, p = 0.0455), and ad hoc pairwise testing showed a marginally significant difference between class 2 and class 3 (chi-square = 6.02, df = 1, p = 0.424, all p values BH corrected for multiple testing). However, the effect size is small and the significance of this difference, if any, is unclear.

The Influence of Mutations on Kinase Recognition Motifs

To test the hypothesis that mutations in the neighborhood of phosphorylation sites could contribute to disease by altering or creating kinase recognition motifs, and thus modifying the strength or regulation of phosphorylation sites, we needed a metric to predict the strength of phosphorylation at a site as a function of its sequence context. To this end, we used the NetPhorest classifier (Horn et al., 2014) as described in the Methods section and derived a score as the sum of the probabilities that a site would be phosphorylated by each possible class of kinase. To illustrate this, an example is shown of p53 (Figure 3), one of the most commonly mutated genes in human cancers (Muller and Vousden, 2013). The residue Ser20 is itself phosphorylated and also lies just downstream of two other phosphosites, Ser15 and Thr18. A point mutation at this position therefore has the potential to have a triple effect on the phosphorylation stoichiometry of the protein.

Figure 3

NetPhorest Outputs for p53 with Point Mutations at Ser20

(A–C) (A) Wild-type sequence, (B) Mutant S20Y, (C) Mutant S20P.

NetPhorest Outputs for p53 with Point Mutations at Ser20 (A–C) (A) Wild-type sequence, (B) Mutant S20Y, (C) Mutant S20P. Figure 3 shows the NetPhorest predictions for these three residues in the wild-type and in two different point mutants: serine to tyrosine (S20Y), which can also be phosphorylated, and serine to proline (S20P), which abrogates phosphorylation at this position. Neither mutation affects the sole ATM/ATR kinase recognition motif of Ser15, so the score on this phosphorylation site would be 0 for both mutants. In contrast, the other two phosphosites are altered by both mutations. S20Y increases the summed strength for Thr18 phosphorylation from 1.03 to 1.1, so has a score of 0.07 on this site, whereas Tyr20 has a summed strength of 0.85 compared with 0.53 on Ser20, a score of 0.32. This results in a total score of 0.39 for the S20Y mutant, reflecting its impact on the strengths of direct and neighboring phosphosites. The less conservative S20P mutant reduces the strength of Thr18 phosphorylation to 0.85 and completely abolishes the Ser20 phosphorylation as proline lacks the phosphorylatable hydroxyl group. The total score for S20P is therefore |(0.85-1.03)| + |(0-0.53)| = 0.71, representing a bigger impact of this mutation on the surrounding phosphorylation pattern. This procedure was applied to the class 2 variants, as described in the previous section, to determine whether there was a correlation between the impact of a variant on local phosphorylation sites and its pathogenicity. The mean score of the pathogenic variants was 0.21, whereas that of the benign variants was 0.20 (Mann-Whitney U = 1,429,324, n1 = 1,575, n2 = 1,877, p = 0.047, Figure S1). Although only marginally significant, this supports our hypothesis that deleterious variants are slightly more likely to influence local phosphorylation patterns than benign ones, and so this impact score was used as a feature for training the classifier. Additionally, since some mutations have multiple phosphorylation sites in their flanking regions, structural and functional features of neighboring phosphosites that were used to estimate the impact of perturbing phosphorylation were weighted according to the predicted impact of the mutation on that phosphosite, as detailed in the Methods section.

Performance of the Random Forest Classifier

A random forest classifier, which we call PhosphoEffect, was trained on a training set of 2,274 benign and 2,400 pathogenic mutants (Tables S2, S3, S4, and S5), as described in the Methods section. The features of the variants which were selected for training are shown in Table 2, and the optimal hyperparameters for the classifier, chosen by grid search cross-validation, are shown in Table 3.The size of the training set was limited by the number of known pathogenic missense variants corresponding to phosphosites or their neighbors. The feature matrices, pre- and post-processing as described in the Methods section, can be found in Tables S6, S7, S8, S9, S10, S11, S12, and S13.

Table 2

Features Used to Train the Classifier

Feature	Source
Molecular weight of wild-type residue (Da)
Molecular weight of mutant residue (Da)
Chemical property of wild-type residue (acidic, basic, polar, nonpolar)
Chemical property of mutant residue
Polyphen score	Polyphen 2.2.2 (Adzhubei et al., 2010)
Secondary structure (helix, sheet, disordered, other)	RING (Piovesan et al., 2016)
Wild-type residue phosphorylated?	PhosphoSitePlus (Hornbeck et al., 2015)
Number of phosphorylated neighbors (+-5)	PhosphoSitePlus
Impact on phosphorylation level	Derived from NetPhorest (Horn et al., 2014)
Network perturbation score	Derived from iPTMnet (Huang et al., 2018) andSTRING database (Szklarczyk et al., 2019)
Number of internal contacts (of neighboring phosphorylated residues)	RING
RAPDF (of neighboring phosphorylated residues)	RING

RAPDF, residue-specific all atom-dependent conditional probability distribution function, RING, residue interaction network generator.

Table 3

Hyperparameters of the Random Forest Classifier

Hyperparameter	Value
Bootstrap	True
Criterion	Gini
Maximum leaf nodes	None
Minimum impurity decrease	1 × 10⁻⁵
Minimum impurity split	None
Minimum samples per leaf	1
Minimum samples per split	2
Number of estimators	100

The optimal hyperparameters, selected by grid search cross-validation, for the random forest classifier are shown.

Features Used to Train the Classifier RAPDF, residue-specific all atom-dependent conditional probability distribution function, RING, residue interaction network generator. Hyperparameters of the Random Forest Classifier The optimal hyperparameters, selected by grid search cross-validation, for the random forest classifier are shown. The performance of PolyPhen and PhosphoEffect were compared on a test set of 568 benign and 600 pathogenic mutants (Tables S14 and S15). PolyPhen had a true-positive rate of 81.2% and a false-positive rate of 25.4%, with an area under the receiver operating characteristic curve (AUROC) of 0.859. Using a 25% cutoff for false-positive rate for fair comparison, PhosphoEffect had a true positive rate of 86.3%, with an AUROC of 0.884 (Figure 4).

Figure 4

Performance of the PhosphoEffect Compared with PolyPhen on the Test Dataset

Receiver operating characteristic (ROC) curves for (A) PhosphoEffect classifier and (B) PolyPhen; (C) PhosphoEffect without PolyPhen feature and their respective areas under the curve are 0.884, 0.859, 0.747.

Performance of the PhosphoEffect Compared with PolyPhen on the Test Dataset Receiver operating characteristic (ROC) curves for (A) PhosphoEffect classifier and (B) PolyPhen; (C) PhosphoEffect without PolyPhen feature and their respective areas under the curve are 0.884, 0.859, 0.747. As with PolyPhen, probability scores associated with a 0%–10% false-positive rate are dubbed “probably pathogenic” and those with a 10%–20% false-positive rate are “possibly pathogenic.” At this cutoff, 82.7% of pathogenic variants in the test set were assigned “possibly” or “probably” pathogenic, with a probability threshold of 0.485 for a variant to be classified as possibly or probably pathogenic. Since the PolyPhen score was used as one of the features to train the classifier, an improvement is to be expected. When the classifier was trained in the absence of PolyPhen score as a feature, the result was an AUROC of 0.747, indicating the features used to update the prediction of variants have reasonable predictive power on their own. Figure 5 shows the relative importance of the 18 features used to train PhosphoEffect (Table S16). Unsurprisingly, given that the aim of the classifier is to update the PolyPhen prediction based on the impact of a mutation on phosphorylation levels, the PolyPhen score is by far the most important predictor, with a feature importance of 0.54. The impact of the mutation on phosphorylation level according to NetPhorest was the second most important 0.10. This supports our hypothesis that the modification of kinase recognition motifs is an important contributor to the pathogenicity of a variant and should be explicitly included in variant prioritization algorithms.

Figure 5

Feature Importances

The relative contribution of each feature on which the classifier was trained to the model predictions; note log-scale of y axis.

Feature Importances The relative contribution of each feature on which the classifier was trained to the model predictions; note log-scale of y axis. Structural features of the variant and its neighboring phosphosites were of intermediate importance, but the network perturbation score is of virtually no importance. This is not surprising because most of the values for this score are zero, mainly due to the highly incomplete annotation of phosphorylation-dependent PPI, which are challenging to assay on a high-throughput scale and come solely from the curation of published low-throughput studies.

Pipeline

We present a pipeline implemented in Python (compatible with versions 3.5 and above) into which the user can input a tab-delimited text file containing a list of point mutations to query (format UniProtKB accession number, residue number, wild-type amino acid, mutant amino acid) to obtain a list of classifications along with associated probabilistic scores. A separate output file logs any errors that are raised during the running of the pipeline, for instance, if a wild-type residue in the input does not match the canonical sequence.

Discussion

In this work we have developed a new classifier for the prioritization of missense variants according to their effect on phosphorylation sites. It takes as a base the score assigned by PolyPhen representing the probability that the variant is pathogenic and updates this score using a variety of features reflecting the impact of the mutation on kinase recognition motifs (resulting in the enhancement, disruption, or deregulation of phosphorylation sites) and the structural and functional impact of such perturbations. On our testing set, which was enriched for mutations on or around phosphosites, PhosphoEffect incorrectly classified 17.3% of pathogenic variants as benign, compared with 18.8% for PolyPhen, with a substantially lower false-positive rate, representing a clear improvement in the accuracy of distinguishing benign and pathogenic variants. Additionally, the impact of a mutation on local phosphorylation strength was the second most important feature in the classifier, after the PolyPhen score. This supports the inclusion of features describing the direct and indirect impact of mutations on phosphorylation sites, and such an approach could identify large numbers of clinically relevant mutations that act indirectly through modifying the strength of neighboring phosphosites, which are overlooked by existing classifiers. In the era of personalized medicine, where patients with cancer are increasingly having their tumor genomes sequenced, the identification of novel pathogenic variants could aid in diagnosis, prognosis, and choice of treatment, as well as targeted drug design. We hope that this computational pipeline will facilitate the identification of new potentially deleterious mutations that rewire phosphorylation-dependent signaling in cancer and can be clinically validated.

Limitations of the Study

However, there remains a lot of work to be done in this area. For instance, our metric for assessing the impact of mutations on the global PPI network through modification of phosphorylation barely contributed at all to the performance of the classifier, as for almost all variants this score was zero. This is most likely due to the scarcity of data on PPI, since iPTMnet is a database curated from evidence from published studies on individual PPI and phosphorylation sites. Since one of the major roles of phosphorylation events is to alter PPIs (Nishi et al., 2011) and phosphorylation is central to many disease-related signaling pathways (Ardito et al., 2017), we do believe that the network-rewiring impact of mutations is crucial to consider when prioritizing variants. Future work would entail creating a new model to predict the dependency of a PPI on a given phosphorylation event, for instance, based on the presence of phosphoprotein binding domains and their recognition motifs (Guo et al., 2019), which could itself be trained on data from the iPTMnet database and would give a more accurate estimation of the impact of a mutation on the protein-protein interaction network. Furthermore, there are many more features that could have been included in the model, such as the solvent accessibility or number of water contacts per phosphorylated residue—these may indicate how likely a phosphorylation event is to modify the structure or interactions of a residue.

Resource Availability

Lead Contact

Sudhakaran Prabakaran email: sp339@cam.ac.uk

Materials Availability

Not applicable

Data and Code Availability

All codes for this work can be obtained from https://github.com/PrabakaranGroup/PhosphoEffect-pipeline.

Methods

All methods can be found in the accompanying Transparent Methods supplemental file.

1 in total

1. Identification of phosphorylation site using S-padding strategy based convolutional neural network.

Authors: Yanjiao Zeng; Dongning Liu; Yang Wang
Journal: Health Inf Sci Syst Date: 2022-09-17

1 in total