Literature DB >> 35532924

Profiling the Human Phosphoproteome to Estimate the True Extent of Protein Phosphorylation.

Anton Kalyuzhnyy1,2, Patrick A Eyers1, Claire E Eyers1,3, Emily Bowler-Barnett4, Maria J Martin4, Zhi Sun5, Eric W Deutsch5, Andrew R Jones1,2.   

Abstract

Public phosphorylation databases such as PhosphoSitePlus (PSP) and PeptideAtlas (PA) compile results from published papers or openly available mass spectrometry (MS) data. However, there is no database-level control for false discovery of sites, likely leading to the overestimation of true phosphosites. By profiling the human phosphoproteome, we estimate the false discovery rate (FDR) of phosphosites and predict a more realistic count of true identifications. We rank sites into phosphorylation likelihood sets and analyze them in terms of conservation across 100 species, sequence properties, and functional annotations. We demonstrate significant differences between the sets and develop a method for independent phosphosite FDR estimation. Remarkably, we report estimated FDRs of 84, 98, and 82% within sets of phosphoserine (pSer), phosphothreonine (pThr), and phosphotyrosine (pTyr) sites, respectively, that are supported by only a single piece of identification evidence─the majority of sites in PSP. We estimate that around 62 000 Ser, 8000 Thr, and 12 000 Tyr phosphosites in the human proteome are likely to be true, which is lower than most published estimates. Furthermore, our analysis estimates that 86 000 Ser, 50 000 Thr, and 26 000 Tyr phosphosites are likely false-positive identifications, highlighting the significant potential of false-positive data to be present in phosphorylation databases.

Entities:  

Keywords:  PeptideAtlas; PhosphoSitePlus; UniProt; database; evolutionary conservation; false discovery rate; mass spectrometry; phosphopeptides; phosphoproteomics; phosphorylation; phosphosites; proteome; proteomics

Mesh:

Substances:

Year:  2022        PMID: 35532924      PMCID: PMC9171898          DOI: 10.1021/acs.jproteome.2c00131

Source DB:  PubMed          Journal:  J Proteome Res        ISSN: 1535-3893            Impact factor:   5.370


Introduction

Protein phosphorylation is a fundamental post-translation modification (PTM) that regulates protein function and is well studied in relation to cell signaling pathways and disease.[1−3] Huge numbers of phosphorylated peptides and sites have been reported and characterized after isolation from human cells using approaches allied to tandem mass spectrometry (LC-MS/MS), focussing primarily on the phosphorylation of “canonical” (established) serine (Ser), threonine (Thr), and tyrosine (Tyr) residues.[4−8] However, large numbers of “noncanonical” phosphorylation sites have also been annotated on proteins from a variety of sources including human cells.[9] This additional complexity highlights the ongoing requirement for careful, evidence-based phosphosite identification from mass spectrometric datasets. Historically, the focused analysis of phosphorylation sites in proteins tended to rely on biochemical analysis including, for example, chromatography and solid-state Edman sequencing.[10−12] However, while giving confidence in phosphosite identification, such low-throughput approaches are now rare, lacking the depth of coverage needed for most large-scale studies. The dominance of MS approaches has led to the development of multiple strategies to both understand and help mitigate the high levels of phosphopeptide false discovery rate (FDR), particularly in sets of mapped peptide spectral matches (PSMs) that result from LC-MS/MS and sequence database analysis.[13−15] The goal of such approaches is to separate true identifications from false ones. Even without considering noncanonical phosphorylation (which is likely to be absent in typical phosphoproteomics pipelines due to its acid-labile nature), many confidently identified phosphopeptides possess multiple Ser, Thr, or Tyr residues that could be differentially modified in a given proteolytically generated peptide.[9] Phosphosite occupancy is variable on any given protein under different biological conditions such that analysis of a peptide containing, for example, two Ser residues that have the potential to be phosphorylated with different dynamics could present evidence for either only one or both being modified, depending on the sample studied.[6,7,16,17] Many phosphorylation events are also substoichiometric, possibly falling below the limit of detection of certain analyses.[7,16] As such, careful data handling and statistical processes should be applied, either within the search engine used for peptide mapping or in a downstream software application to calculate additional statistics, such as a local false localization rate (FLR) or conversely the probability that a given site within a peptide is correct or incorrect. Software/algorithms include phosphoRS,[18] Ascore,[19] Andromeda’s PTM Score,[20] and recently released PTMProphet.[21] We have previously benchmarked the performance of some instrumental parameters and software pipelines for phosphoproteomics,[22] demonstrating that there is considerable variability in how such scores map to robust statistics, such as local or global FLR, depending on the instrument fragmentation mode and resolution. Following confident identification of phosphopeptides and localization of given sites, data tend to be compiled from within a single study or across multiple studies (meta-analysis) to determine the extent of evidence for a given site from multiple PSMs. In general, where there are independent observations of PSMs supporting a phosphosite, it can be reasonably assumed that the evidence for a site to be real increases, although to our knowledge, there are no current statistical models to calculate this phenomenon accurately. Multiple PSMs can be observed per identified phosphosite as a result of either different peptide sequences containing that site, or the same peptide sequence being detected several times.[23] There are some caveats to this logic though, as it is possible for the same PSM to be wrongly assigned to a phosphopeptide multiple times. This can occur if the correct interpretation for the spectrum had a very similar peptide sequence and identical mass to the wrongly assigned phosphopeptide.[16,24] Although LC-MS/MS and computational analysis is generally recognized as very effective and reliable for phosphosite detection, from each study, it is likely that there is some element of remaining false discovery of peptides and sites wrongly localized, depending on the applied statistical thresholds. This is particularly problematic for studies that set relatively weak thresholds for phosphosite localization (e.g., equating to site probability >0.75) to maximize sensitivity—more true positives may be identified, but at the expense of very large numbers of false positives passing the threshold. A multicenter benchmarking study highlighted some of the challenges in practice, showing considerable variability in the number of true-positive, false-positive, and false-negative sites reported across different laboratories, with particular issues arising when a peptide carried multiple phosphate groups.[25] Methods and guidelines for FLR are still evolving and not consistently applied in phosphoproteome studies, and so it is likely that most published studies contain considerable numbers of falsely localized phosphosites.[25−29] This can lead to overestimation of the total number of known true human phosphosites if database providers do not control for FDR across multiple datasets.[30] One such database is PhosphoSitePlus (PSP) which represents a comprehensive, manually curated, and well-cited resource containing experimentally defined PTMs primarily focusing on phosphorylation.[29] As of March 2020, PSP encompassed phosphosite evidence across 17 830 human protein sequences, which are defined as canonical in UniProt (representing the most prevalent protein product per gene,[31] for example). The evidence for phosphorylation comes from manually curated reviews of literature primarily describing tandem MS studies and also low-throughput experiments, or from in-house MS studies.[29] Interestingly, the majority of phosphosites in PSP only have a single piece of evidence associated with their identification (i.e., there is only one study identifying the phosphosite). As mentioned in the PSP documentation itself, researchers should be cautious when accepting such sites as true positives.[29] It is possible that many users of PSP are not aware of the need for caution when reviewing or reusing data, and we are not familiar with any previous effort to assess phosphosite FDR within PSP. A second curated proteomics resource is PeptideAtlas (PA)[26] which is a repository of tandem MS datasets that have been processed through Trans-Proteomic Pipeline to ensure high and consistent quality of phosphopeptide identifications.[32] The latest PA builds incorporate the use of the PTMProphet algorithm for phosphosite localization where each potential phosphosite within an observed PSM is assigned a probability score between 0 and 1 of being phosphorylated.[21] As with PSP, researchers should also be careful when accepting sites in PA with only a single piece of identification evidence (i.e., a single associated PSM) as positively identified phosphosites. Instead, phosphosites that not only have multiple PSM observations in PA but also have high phosphorylation probability scores assigned within the majority of those PSMs are most likely to be true-positive identifications. In addition to PSP and PA, other databases containing data on human phosphosites include UniProt, which collates mostly manually curated phosphosites from the literature but is planned to start incorporating high-throughput derived data in later releases;[31] dbPTM, a server importing data from other resources, but currently unavailable as of July 2021;[33] and PhosphoDB containing results from a set of studies on phosphopeptides derived from multiple proteases.[34] Even with easy access to these accumulated phosphorylation site resources, to our knowledge, no estimates have been made to predict the scale of phosphosite FDR across large datasets. In this work, by profiling the reported human phosphoproteome, we aimed to estimate the false discovery rate of phosphosites with evidence in PSP and/or PA and use these estimates to predict the count of true phosphosites within the currently explored human phosphoproteome. We categorized the sites into sets of various predicted phosphorylation likelihood based on the amount of positive identification evidence reported in PSP and PA, properties not readily available in other databases. Using orthogonal features of phosphosites assigned to these sets, such as evolutionary conservation, sequence properties, and functional annotations, we aimed to demonstrate significant differences between the sets and develop an improved method for independent FDR estimation, which can be used to indicate the extent of true phosphosites within the human phosphoproteome.

Methods

Processing and Categorizing Phosphorylation Data in PSP and PA

Phosphorylation data in PeptideAtlas (PA) (2020 build)[26] was filtered to only include human Ser/Thr/Tyr sites from canonical UniProt protein sequences with at least one PSM observation (1 069 709 sites across 63 616 sequences) (Table S1). The sites were categorized according to the number of PSM observations with a certain phosphorylation probability score assigned by PTMProphet.[21] The counts of observations with a probability of >0.95 were used as “positive” evidence for site phosphorylation. The counts at a probability threshold of ≤0.19 were used as “negative” evidence in favor of a site being a nonphosphosite. The total number of PSM observations per site was considered to distinguish sites for which ≥10% of all associated PSMs had a PTM probability >0.95, from sites where a small minority (<10%) of associated PSMs had this probability. Based on this, selected confidence categories were applied to predict site phosphorylation likelihood in PA (“High”: ≥5 positive observations which accounted for ≥10% of total observations across all probabilities; “Medium”: ≥5 positive observations, which accounted for <10% of total observations or 2–4 positive observations; “Low”: 1 positive observation; “Not phosphorylated”: 0 positive observations and ≥5 negative observations; “Other”: site did not fall into any described categories). PhosphoSitePlus (PSP) data (11/03/20 build; Phosphorylation_site_dataset.gz)[29] was filtered to only include human Ser/Thr/Tyr sites from canonical protein sequences labeled by UniProt identifiers (231 607 sites across 17 830 sequences) (Table S2). The sites were ranked based on the number of times they have been characterized in low/high-throughput studies. The sum of observations across all studies was used to predict site phosphorylation likelihood in PSP (“High”: ≥5 observations; “Medium”: 2–4 observations; “Low”: 1 observation).

Evolutionary Conservation Analysis

To determine the cross-species conservation of all Ser, Thr, and Tyr sites in the reference human proteome,[31] which have phosphorylation evidence in PSP and PA, human reference proteome (20 605 sequences, UniProt ID: UP000005640) and the proteomes of 100 eukaryotic species (50 mammals, 12 birds, 5 fish, 4 reptiles, 2 amphibians, 11 insects, 4 fungi, 7 plants, and 5 protists; Table S3) were downloaded from UniProt (UniProt release 2019_10). Each sequence in the human proteome was used as a query in a BLASTp search (BLAST+ 2.10.0 version)[35] against all 100 eukaryotic proteomes. The BLAST output was processed to extract a top matching significant orthologue (E-value of ≤0.00001) from each species for each human target. Human targets were then aligned with their matched orthologues using the MUSCLE algorithm (version 3.8.31)[36] with default settings if all sequences to be aligned were <2000 amino acids long. If any sequences to be aligned (either the human sequence or any of the orthologue sequences) were ≥2000 amino acids long, two iterations of the algorithm were run using settings for large alignments (-maxiters 2 option).[36] From the alignments, percentage conservation scores were calculated for every Ser, Thr, and Tyr site within each human target out of 100 (all eukaryotic proteomes) and out of the number of aligned orthologues. Conservation percentages were calculated considering any Ser/Thr substitutions in orthologues, whereby an orthologue was included in the count if, for example, a Thr in its sequence was aligned with a Ser in the target human sequence and vice versa. Conservation data was then cross-referenced with PSP/PA datasets to identify sites in the human proteome with phosphorylation evidence in PSP/PA and determine their conservation. To ensure consistency in terms of proteins and sites used, any human protein target for which it was not possible to calculate site conservation either due to the protein having no matches in BLAST (14 proteins), no significant matches in BLAST (236 proteins), no Ser/Thr/Tyr sites in its sequence (1 protein) or due to failed alignments (10 proteins), was excluded from any further analysis (Table S4). Any human targets labeled with the same UniProt identifier in the reference human proteome, PSP and PA, but which corresponded to different protein sequences across the datasets (73 proteins; Table S4) were also excluded. Conservation was assessed for the remaining targets (Table S5) by linear regression models with nonassumed intercept for a simpler interpretation of slope between phosphosites and nonphosphosites. The average conservation of likely phosphosites (sites ranked “High” or “Medium” in PSP and/or PA) was plotted against the average conservation of likely nonphosphosites (sites in “Not phosphorylated” and “Other” sets) within each target protein that had at least three likely phosphosites and three likely nonphosphosites. Conservation scores (%) were also compared across all sites within phosphorylation likelihood sets using box plots.

Analysis of Amino Acids Adjacent to Phosphosites

Target protein sequences (20 271 sequences; Table S6) were processed to identify amino acids at the −1 and +1 proximal positions adjacent to every Ser, Thr, and Tyr site. If a target sequence ended with a Ser, Thr, or Tyr site then its +1 amino acid was marked as “Not found”. For each amino acid, its frequency at each proximal position was first normalized to 1000 and then to its frequency in the prefiltered human reference proteome (expected distribution). Proximal amino acid frequencies around target Ser, Thr, and Tyr in “High in PSP and PA” set were compared to those in the “Not phosphorylated” set and to the expected amino acid distribution. The comparisons were assessed by Fisher’s exact statistical test[37] performed using scipy module in Python[38] with Bonferroni corrections to generate adjusted p values. For each amino acid, any significant difference (Bonferroni corrected p value <0.001) between the compared sets was used to estimate phosphosite false discovery rate across all phosphorylation likelihood sets. FDR estimates assumed that all sites in the highest phosphorylation likelihood set “High in PSP and PA” set were true-positive phosphosite identifications, whereas all sites with the weakest phosphorylation confidence (either the “Not phosphorylated” or the “Other” set) were nonphosphosites. Therefore, the observed count of a certain proximal amino acid in the “High in PSP and PA” (nPos) corresponded to its expected count at 0% FDR, whereas its observed count in the “Not phosphorylated” or “Other” set (nNeg) corresponded to its expected count at 100% FDR. To estimate % FDR in any other phosphorylation likelihood set based on the observed count of the compared proximal amino acid in that set (nObs), we used the following equation The equation has the effect of estimating what proportion of the observed count (nObs) is explained by assumed false positives (nNeg) and what proportion by true positives (nPos). For example, if amino acid X was found at +1 position next to 500 Ser sites in the highest phosphorylation confidence set (0% FDR set; nPos = 500) compared to 10 Ser sites in the “Not phosphorylated” set (100% FDR set; nNeg = 10), and next to 350 sites in the set of interest (nObs = 350), then pSer FDR within the set of interest would be 31%. This would suggest that 31% of sites in that set behave like false-positive pSer in terms of X amino acid frequency at +1 position, whereas 69% of those sites behave like sites in the highest phosphorylation likelihood set (true pSer). An average FDR with 95% confidence intervals (CI) was calculated per each likelihood set using all significantly enriched amino acids around a particular target phosphosite and which had an enrichment of >1.5 relative to the expected distribution. Final FDR estimates were used to derive the total number of true-positive (TP) phosphosite identifications across phosphorylation likelihood sets. To compare FDR/TP estimates between individual PSP and PA sets, the method was replicated using alternative phosphorylation likelihood sets, where sites were categorized according to the highest amount of positive phosphorylation evidence from one database (at least one observation at PTM probability >0.95 in PA; at least one observation in PSP), without taking into account any evidence in the other. Phosphosite FDR estimates within “High” sets in each database were presented as a weighted average between FDR estimates in sites ranked “High” in that database only and sites ranked “High” in both PSP and PA. For example, the FDR in “High in PA” set was a weighted average of FDR estimates in “High in both” set and “High in PA only” set. To analyze phosphosites in UniProt, phosphorylation data for the reference human proteome was downloaded directly from UniProt (June 2021 version; release 2021_04) and processed to split phosphosites according to associated evidence codes from the Evidence and Conclusion Ontology (ECO:0007744—combinatorial computational and experimental evidence imported from large-scale experiments; ECO:0000269—manually annotated experimental evidence; ECO:0000250—similarity evidence based on orthologous sequence). Any target proteins removed earlier (Table S4) were also removed from this analysis. The resulting protein sequences (n = 9481) and sets of Ser, Thr, and Tyr phosphosites were analyzed in terms of adjacent amino acids and conservation using the above method. Phosphosite FDR was calculated for the large-scale study set (ECO:0007744) using “High in PSP and PA” as 0% FDR set and “Not phosphorylated” as the 100% set.

Functional Enrichment Analysis

All protein sequences in the filtered reference human proteome (Table S6) were categorized into sets according to what their highest-ranked Ser, Thr, and Tyr site was in terms of phosphorylation evidence (“High in PSP and PA”, “High in PSP or PA”, “Medium in PSP and/or PA”, “Low in PSP and/or PA”, “Other in PA”, “Not phosphorylated” and “No evidence in PSP or PA”). Each protein set within Ser, Thr, and Tyr datasets was analyzed with DAVID (version 6.8)[39] using all proteins in filtered proteome with any Ser, Thr, or Tyr evidence in PSP or PA (16 296, 14 565, and 12 912 proteins, respectively) as control background. Protein sets containing no reported evidence in PSP or PA were searched against a background of all proteins in the filtered reference proteome to determine any differences in their functional enrichment compared to proteins with PSP/PA evidence. Per each set searched, the top 10 (where possible) significant (Benjamini–Hochberg corrected p value <0.05) functional terms with the highest percentage of proteins mapped were identified, replacing any near-synonymous terms with additional terms from outside the initial top 10. All target protein sets were also searched in UniProt (release 2020_04) to determine the percentage of proteins mapped to UniProt keywords “Phosphoprotein”, “Alternative splicing”, “Nucleus”, “Transcription”, “Acetylation”, “Membrane”, “Glycoprotein”, “Signal”, and “Disulfide bond”.

Secondary Structure Analysis

Categorized Ser, Thr, and Tyr sites in filtered reference human proteome were mapped to protein structures (β strand, helix, turn, and coiled coil) described for those proteins in UniProt (release 2020_04) (Tables S5 and S7). Any target proteins searched in UniProt which were marked as obsolete (15 proteins) or represented different sequences despite being labeled with the same identifier (25 proteins) were removed further and marked as “NA” (Table S5). Normalized (to 1000) counts of target amino acids within protein structures were assessed with Fisher’s exact statistical test[37] using the scipy module in Python[38] to generate p values and indicate any significant enrichment (p < 0.05) between “High in PSP and PA” set and the “Not phosphorylated” set. The method was also applied separately for Ser sites, which had phosphorylation evidence in UniProt, and which were mapped to the described phosphorylation likelihood sets based on PSP/PA evidence.

Results and Discussion

Categorizing All Ser, Thr, and Tyr Annotated Phosphosites in the Human Proteome

We first ranked all Ser, Thr, and Tyr phosphosites in PA and PSP in the filtered reference human proteome according to the amount of accumulated identification evidence (Figure and Tables and S5). The majority of Ser, Thr, and Tyr sites (50.1, 63.3, and 54.3%, respectively) with phosphorylation evidence in PSP were placed into the “Low” phosphorylation likelihood set, meaning that there was only a single piece of evidence supporting their positive identification (Figure A). Furthermore, out of all analyzed Ser, Thr, and Tyr sites with at least one observation at PTM probability >0.95 in PA (suggesting a positive phosphosite identification), 21.7, 34.0, and 33.5%, respectively, were placed in the “Low” set (Figure B and Table ), highlighting that a considerable amount of potential phosphosites only had one piece of positive identification evidence across both databases.
Figure 1

Distribution of serine (Ser), threonine (Thr), and tyrosine (Tyr) phosphosites from UniProt’s reference human proteome that have any positive identification evidence in (A) PhosphoSitePlus (PSP) or (B) PeptideAtlas (PA) based on established phosphorylation likelihood sets (see the Methods section). Venn diagrams provide the counts of (C) Ser, (D) Thr, and (E) Tyr sites ranked “High” in PSP (left), PA (right), and both resources (overlap).

Table 1

Categorizing Serine (Ser), Threonine (Thr), and Tyrosine (Tyr) Sites from UniProt’s Reference Human Proteome into Phosphorylation Likelihood Sets Based on Available Phosphorylation Evidence in PhosphoSitePlus (PSP) and PeptideAtlas (PA)

Phosphorylation likelihood setPhosphorylation evidence per siteSer countThr countTyr count
High in PSP5+ pieces of evidence32 30681619763
Medium in PSP2–4 pieces of evidence34 15412 1977228
Low in PSP1 piece of evidence66 77735 17320 191
High in PA5+ observations at PTM score >0.95 which is ≥10% of total observations26 18642041169
Medium in PA5+ observations at PTM score >0.95 which is <10% of total observations OR 2–4 observations at PTM score >0.9520 51752971460
Low in PA1 observation at PTM score >0.9512 95048951324
Not phosphorylated0 observations at PTM score >0.19 AND 5+ observations at PTM score ≤0.19 AND no evidence in PSP13 89284622184
Other sitesAt least 1 observation in PA but does not fall into any other PA categories AND no evidence in PSP60 22135 00010 009
Distribution of serine (Ser), threonine (Thr), and tyrosine (Tyr) phosphosites from UniProt’s reference human proteome that have any positive identification evidence in (A) PhosphoSitePlus (PSP) or (B) PeptideAtlas (PA) based on established phosphorylation likelihood sets (see the Methods section). Venn diagrams provide the counts of (C) Ser, (D) Thr, and (E) Tyr sites ranked “High” in PSP (left), PA (right), and both resources (overlap). Interestingly, we found that in the human proteome there were more Tyr sites assigned to “High” set in PSP (5+ observations) than Thr sites (Figure A and Table ), indicating a higher initial proportion of pTyr compared to pThr in PSP. The high prevalence of likely true Tyr phosphosites in the PSP dataset could have been a result of in-house studies which identified large numbers of pTyr sites using immunoaffinity strategies not suitable for pSer/pThr discovery,[40,41] and studies that have not been officially published.[29] From PA, it is possible to identify sites for which covering phosphopeptides are observed, but for which the modifications are only localized to other sites in the same peptides, thus providing strong evidence for likely nonphosphosites. Sets of potential Ser, Thr, and Tyr nonphosphosites were therefore established based initially on evidence in PA (Table S8). Those sets were then cross-referenced with data in PSP to determine whether PSP contained any sites ranked as nonphosphosites in PA. Interestingly, we found that 2489 Ser, 1341 Thr, and 891 Tyr sites assigned to the “Not phosphorylated” set in PA were found to have evidence in PSP (Table S9). In fact, out of those potential PA nonphosphosites, 146 Ser, 97 Thr, and 293 Tyr sites were placed into “High” phosphorylation likelihood set according to PSP evidence (Table S9). This strongly indicated the presence of potential false positives in PSP and/or false negatives in PA. For example, Ser42 in protein P17066 (HSPA6) and Ser59 in Q8N488 (RYBP) had 8 and 6 phosphosite identification references in PSP, respectively (mostly from in-house MS studies), but had no positive identification evidence in PA or any mention in UniProt[31] (Table S5). On the other hand, Ser4 in P15927 (RPA2) had 33 phosphosite identification references in PSP and was also mentioned in UniProt’s annotations, but has never been positively localized in any of its 127 associated PSMs in PA (Table S5). To eliminate potential false assignments when considering evidence in both PSP and PA, a site was only categorized as a nonphosphosite if it had no evidence in PSP in addition to having “negative” phosphorylation evidence in PA (Table ). As a result, we established final negative control sets containing 13 892 Ser, 8462 Thr, and 2184 Tyr sites. Similar adjustments were made to the “Other” PA set (sites in that set must have no evidence in PSP) which contained the majority of analyzed PA sites (Table ). Having further cross-referenced sets of sites of various phosphorylation likelihood between PSP and PA (Table S9), we established a “gold standard” set of phosphosites, all of which had “High” phosphorylation likelihood according to both PSP and PA evidence (Table S10). This set contained 16 978 Ser, 2747 Thr, and 986 Tyr highly likely true phosphosites (Figure C–E and Table S10). As for the general agreement between PSP and PA in terms of phosphorylation evidence, we found that 37.7% Ser, 20.5% Thr, and 9.10% Tyr sites with PSP evidence also had at least one observation at PTM probability >0.95 in PA (Table S9). This variation in phosphosites observed between the two databases can be explained by the likely use of different methods for phosphosite detection and localization between PA and the sources referenced in PSP, as well as due to a considerable presence of random false positives in both datasets before thresholding has been applied (see the Introduction section).

Evolutionary Conservation Analysis

Phosphoproteomes from all species are constantly evolving, although many ancient phosphosites are conserved across species and taxa, increasing the likelihood of them being functionally relevant.[42−44] In our analysis, we determined the conservation of all potential Ser, Thr, and Tyr phosphosites and nonphosphosites in UniProt’s filtered reference human proteome across 100 eukaryotic species (Table S5), weighed toward vertebrates, but also including examples of insects, plants, and unicellular eukaryotes (Table S3). In our first analysis, we explored the mean conservation of phosphosites and nonphosphosites per protein (at least three of each present per protein) and performed a correlation analysis across all proteins (Figure ). We fitted linear regression models through the origin, under the theory that proteins unique to humans would have zero conservation for both phosphosites and nonphosphosites. We found great variation between the conservation of both site types, ranging from near zero to 100%, which was mostly dependent on the overall conservation of the protein sequence. However, based on the generated linear regression models, we concluded that on average, Ser, Thr, and Tyr phosphosites (“High” or “Medium” in PSP and/or PA) were around 4.6, 5.4, and 2.0%, respectively, more conserved across all 100 eukaryotes than corresponding likely nonphosphosites (sites in “Not phosphorylated” and “Other” sets) within analyzed proteins when allowing Ser/Thr substitutions toward the conservation score (Figure ). Similar results were obtained when assessing phosphosite conservation only across found orthologues for each protein (Figure S1). The results (Figures and S1) provide additional evidence that phosphosites are generally more conserved than nonphosphosites.[42,43,45] The difference in conservation is thus subtle and variable, but statistically robust. Furthermore, in our analyzed sets of proteins which had at least three likely phosphosites and three likely nonphosphosites, we found 104, 88, and 19 proteins where conservation of Ser, Thr, and Tyr likely phosphosites, respectively, was at least 20% higher than conservation of likely nonphosphosites (Table S11).
Figure 2

Mean % conservation across 100 eukaryotic species of likely (A) Ser, (B) Thr, and (C) Tyr phosphosites and corresponding likely nonphosphosites within each target protein (n = number of proteins analyzed). The regression coefficient (R2) is given by “R_sq”.

Mean % conservation across 100 eukaryotic species of likely (A) Ser, (B) Thr, and (C) Tyr phosphosites and corresponding likely nonphosphosites within each target protein (n = number of proteins analyzed). The regression coefficient (R2) is given by “R_sq”. We next compared the conservation of all sites split by phosphorylation likelihood sets (Figure ) and observed that sites in the highest phosphorylation likelihood set (“High in both PSP and PA”) had the highest average conservation across all 100 eukaryotic proteomes considering Ser/Thr substitutions (average conservation of 58.4, 58.6, and 69.4% across 16 978 Ser, 2747 Thr, and 986 Tyr sites, respectively) (Figure and Table S12). In comparison, the sites in “Low in PSP and/or PA” set had slightly lower average conservation scores of 54.3, 55.4, and 64.0% in 35 126 Ser, 13 253 Thr, and 7471 Tyr sites, respectively (Figure and Table S12). Assuming that high conservation is a property of true phosphosites, the results (Figure ) show that this property was observed more frequently in higher phosphorylation likelihood sets compared to lower ones suggesting higher potential phosphosite FDR in sets with less phosphorylation evidence.
Figure 3

Box plots of conservation percentages (%) across 100 eukaryotic species of human (A) Ser, (B) Thr, and (C) Tyr sites categorized across established phosphorylation confidence sets based on PSP and PA evidence. Within each box, a horizontal line represents median % conservation and (x) symbol represents mean % conservation per group. Each box extends from the 25th to the 75th percentile of each set’s distribution of conservation % values. Vertical lines extending from the boxes correspond to adjacent values. Dots (•) represent outlier values. The red line shows median % conservation in “High in PSP and PA” set for visual comparison.

Box plots of conservation percentages (%) across 100 eukaryotic species of human (A) Ser, (B) Thr, and (C) Tyr sites categorized across established phosphorylation confidence sets based on PSP and PA evidence. Within each box, a horizontal line represents median % conservation and (x) symbol represents mean % conservation per group. Each box extends from the 25th to the 75th percentile of each set’s distribution of conservation % values. Vertical lines extending from the boxes correspond to adjacent values. Dots (•) represent outlier values. The red line shows median % conservation in “High in PSP and PA” set for visual comparison. There were numerous cases in our analysis of likely nonphosphosites and sites with “Low” phosphorylation likelihood where amino acid conservation was also high compared to likely phosphosites, indicative of a conserved function for these amino acids in, for example, catalysis or a biomolecular interaction that is unrelated to phosphorylation. Furthermore, we found 64, 30, and 6 proteins in which the average conservation across 100 eukaryotes of Ser, Thr, and Tyr likely nonphosphosites, respectively, was at least 20% higher than the conservation of corresponding likely phosphosites (Table S11). It is possible that the predicted phosphosites within those proteins were either false positives or were nonfunctional true phosphosites, explaining the comparative weaker selective pressure. In fact, previous reports estimated that as many as 65% of known phosphosites may be nonfunctional as individual sites (although may have a more general structural role) due to limited kinase specificity and therefore have similar evolution rates compared to nonphosphosites which would explain the observed trends.[46,47] It is also possible that some proteins were formed by recent gene fusion events leading to regions containing phosphorylation sites only found in a few closer related orthologues (low conservation), with other protein domains being more highly conserved. In addition, higher evolutionary rates in closely related species (primates, for example) could lead to new protein functions unique to that group of species, further explaining low conservation of some phosphosites in our analysis. We further note that Tyr sites in the highest phosphorylation likelihood set (“High in PSP and PA”) had a higher mean conservation (69.4%) compared to Ser/Thr sites in that set (58.4 and 58.6%, respectively) (Figure and Table S12). There are several possible explanations for this result, including the idea that pTyr is under stronger conservation pressure (i.e., mutations cannot easily be tolerated) in animals that make up the vast majority (84/100) of the species analyzed (Table S3). It is also possible that there is a degree of experimental bias due to the comparison of the much larger set of pSer/pThr to pTyr. The typically higher data quality for pTyr, enhanced by the availability of epitope-specific monoclonal antibodies may also contribute to this phenomenon. Amino acids directly adjacent to known phosphorylation sites are often involved in optimizing substrate capture for subsequent phospho-transfer by the kinase enzymatic machinery.[48−50] Multiple reports specifically highlight the importance of proline (Pro) in the mechanism of phosphorylation for families of kinases such as the cyclin-dependent kinases, mitogen-activated protein kinases, and, more recently, the centrosomal kinase PLK4.[48,51−56] Consequently, there is a high prevalence of Pro in numerous phosphorylation motif sequences as part of Ser/Thr-Pro combinations.[57,58] In our analysis, we identified the frequency of −1 and +1 amino acids relative to a possible phosphosite and compared it across different sets of sites ranked by the relative strength of phosphorylation evidence in Table . We found a strong enrichment of Pro at the +1 position next to Ser and Thr sites in the reference human proteome that were placed in the set with the most phosphorylation evidence (“High in PSP and PA”) (Figure A,B and Table S13). In fact, Pro was observed at the +1 position next to 44.3 and 74.9% of all Ser and Thr sites, respectively, in that set (Table S13). The enrichment of Pro at +1 position around those sites was significant (adj. p value <0.001) in relation to the normalized distribution of Pro in the human proteome, where it is, in fact, only the sixth most observed amino acid (Table S13). The normalized number of observations of Pro at +1 relative to Ser and Thr sites in the highest phosphorylation likelihood set was also significantly (adj. p value <0.001) higher than around Ser/Thr sites in the “Not phosphorylated” set (Figure A,B), where only 2.68% of Ser and 5.67% of Thr sites had Pro at +1 position (Table S13). Therefore, the enrichment of Pro around highly likely Ser and Thr phosphosites suggests that this feature, among others, can be used as a differentiating characteristic for phosphosites compared to nonphosphosites.
Figure 4

Counts of proximal amino acids positioned at (A) +1 around Ser; (B) +1 around Thr; (C) +1 around Tyr; (D) −1 around Ser; (E) −1 around Thr; and (F) −1 around Tyr sites of various phosphorylation likelihood based on evidence in PSP and PA, normalized to the observed distribution of those amino acids in human proteome (represented by dotted baseline fixed at 1). Significant (Bonferroni corrected p value <0.001) enrichment of proximal amino acids in the “High in PSP and PA” set is highlighted by the caret symbol (∧) compared with the “Not phosphorylated” set, and an asterisk (*) compared to the expected amino acid distribution.

Counts of proximal amino acids positioned at (A) +1 around Ser; (B) +1 around Thr; (C) +1 around Tyr; (D) −1 around Ser; (E) −1 around Thr; and (F) −1 around Tyr sites of various phosphorylation likelihood based on evidence in PSP and PA, normalized to the observed distribution of those amino acids in human proteome (represented by dotted baseline fixed at 1). Significant (Bonferroni corrected p value <0.001) enrichment of proximal amino acids in the “High in PSP and PA” set is highlighted by the caret symbol (∧) compared with the “Not phosphorylated” set, and an asterisk (*) compared to the expected amino acid distribution. We also found a significant enrichment of Asp at +1 position next to Ser sites in the highest phosphorylation likelihood set (Figure A). To explain this, we linked the sequences containing those sites to phosphorylation motifs which commonly feature Ser-Asp combinations, including those phosphorylated by Casein kinase II.[57,59] At the −1 positions around target Ser, we found significant enrichment (adj. p value <0.001) of Asp and Gly in the highest phosphorylation likelihood set compared to “Not phosphorylated” set (Figure D). It is possible that the observed enrichment was due to the presence of those amino acids within substrate motifs of Casein Kinase II, CDK5, PKC, and MEKK,[57] suggesting high prevalence of potential true Ser phosphosites. Similar conclusions were made for the enrichment of Gly at −1 around Thr sites in the highest phosphorylation likelihood set (Figure E), which was linked to possible Gly–Thr combinations within PKA, ERK1, and ERK2 kinase substrate motifs.[57] We compared the frequency of significantly enriched amino acids (Bonferroni corrected p value <0.001; enrichment >1.5) across the sites within different phosphorylation likelihood sets and used the comparison to estimate phosphosite false discovery rate across those sets. Using the counts of all four enriched amino acids (Asp and Pro at +1; Asp and Gly at −1) around Ser sites of various phosphorylation likelihood (Figure A,D) and working under the assumption of FDR = 0% in set 1 “High in PSP and PA”, we estimated average Ser phosphosite FDR = 49% (CI ± 12%) in set 2 “High in PSP or PA”; FDR = 54% (CI ± 25%) in set 3 “Medium in PSP and/or PA”; FDR = 84% (CI ± 11%) in set 4 “Low in PSP and/or PA” and FDR = 91% (CI ± 4%) in the “Other” set. Similarly, using the enrichment of Pro at +1 and Gly at −1 around target Thr sites, we estimated Thr phosphosite FDR = 59% (CI ± 7%) in set 2; FDR = 86% in set 3 (CI ± 8%); FDR = 98% (CI ± 5%) in set 4 and FDR = 99% (CI ± 1%) in the “Other” set (Tables A and S14). Our FDR estimates clearly highlight that the majority of Ser and Thr sites with just one piece of phosphosite identification evidence are likely false-positive identifications, and users of these databases can reasonably assume that if a site does not have multiple levels of evidence, then it is unlikely to represent a true phosphorylation site.
Table 2

Counts of Estimated True-Positive (TP) Serine (Ser), Threonine (Thr), and Tyrosine (Tyr) Phosphosites within Sets of Various Phosphorylation Likelihood Based on (A) Combined Evidence and (B) Individual Positive Identification Evidence in PhosphoSitePlus (PSP) or PeptideAtlas (PA)a

For each set, TP counts were derived from the FDR estimates within the set and the overall count of target amino acids in the set.

For each set, TP counts were derived from the FDR estimates within the set and the overall count of target amino acids in the set. In our analysis of proximal sites around target Tyr, we found a significant enrichment (adj. p value <0.001) of Ala, Glu, and Asp at +1 positions, in addition to enriched Ile, Val, and Asp at −1 in “High in PSP and PA” set compared to “Not phosphorylated” set (Figure C,F). We were able to link the enrichment of those proximal sites to their possible involvement in various phosphorylation motifs including EGFR and Abl kinase substrate motifs; PTP1B and PTPRJ phosphatase substrate motifs, and multiple SH2 domain binding motifs,[57,60] therefore indicating a higher frequency of true Tyr phosphosites in the highest confidence set compared to other sets. Using the frequencies of all six enriched proximal amino acids around target Tyr in “High in PSP and PA” (Figure C,F), we estimated FDR = 49% (CI ± 9%) in set 2 “High in PSP or PA”, FDR = 69% (CI ± 5%) in set 3 “Medium in PSP and/or PA”, FDR = 82% (CI ± 9%) in set 4 “Low in PSP and/or PA”, and FDR = 98% (CI ± 4%) in the “Other” set (Tables A and S14). Our FDR estimates varied depending on the selected enriched proximal amino acid in the highest phosphorylation likelihood set (Table S14), and thus the FDR estimates obtained with our method should be seen as approximate indicators of the extent of false positives in a set of sites with quantifiable phosphorylation evidence. Based on our Ser, Thr, and Tyr phosphosite FDR estimates, we predicted that there were around 62 000 Ser, 8000 Thr, and 12 000 Tyr true-positive (TP) phosphosite identifications in the human proteome that were supported by evidence in PSP and/or PA (Table A). Furthermore, the results suggested that 86 000 Ser, 50 000 Thr, and 26 000 Tyr sites with positive phosphorylation evidence in PSP and/or PA (sites in “High”, “Medium”, and “Low” sets) were false positives (Table A). Interestingly, the estimated count of Tyr TPs was higher than the count of Thr TPs, which goes against the general understanding of threonine phosphorylation being more prevalent than tyrosine,[6] although it is difficult to estimate the underlying true distributions, given experimental biases due to availability of different tools and methods. Our results are influenced because there are initially more Tyr sites with “High” or “Medium” phosphorylation evidence than Thr sites, particularly in PSP (Figure A and Table A), where there has been a strong focus to identify Tyr sites using in-house methods. The ratio of the count of sites that have been recorded as “High” in both databases is however 16 978 (pSer), 2747 (pThr), and 986 (pTyr), following more closely previously reported estimates of phosphorylation site frequency. It thus remains to be seen if the pTyr sites reported in PSP, but without independent evidence are true or false. Using the same method, we compared phosphosite FDR between PSP and PA sets by considering positive phosphorylation evidence (“High”, “Medium”, or “Low” sets) in one database without taking into account any evidence in the other (Figure S2 and Table S15). The analysis revealed a generally lower FDR per each set in PA compared to the respective set in PSP, overall suggesting that a higher proportion of analyzed sites in PA are true phosphosites compared to the analyzed sites in PSP (Tables B and S15). As noted in the Introduction, manually curated evidence for phosphorylation sites is also collated in UniProt. However, while this resource provides information pertaining to the publication providing this evidence, the number of individual observations is not reported, preventing a matched analysis being performed with PSP/PA. Nevertheless, we were able to extract all phosphorylation data from the human reference proteome in UniProt and separate phosphosites into sets according to the type of manually curated phosphosite evidence (experimental evidence, combinatorial computational and experimental evidence from large-scale experiments, sequence similarity with an orthologous protein). As before, the sets were analyzed in terms of adjacent amino acids around target phosphosites (Figure S3A–F). Due to potential phosphosite differences and biases associated with different discovery methods (motif frequency, for example), we suggest that our method should only be used to analyze sites from high-throughput studies because it was built primarily using sites of similar evidence type. This is further evident from the conservation analysis of UniProt sites (Figure S3H,I), which revealed different conservation patterns between the set of sites identified by large-scale studies and the other UniProt sets. As a result, we were able to estimate FDR for a set of UniProt sites with evidence from large-scale proteomics studies (Figure S3G). In that set, we estimated average pSer FDR = 7% (CI ± 8%); pThr FDR = 22% (CI ± 14%) and pTyr FDR = 6% (CI ± 7%) (Figure S3G), suggesting that there is a much higher proportion of true-positive phosphosites in UniProt compared to PSP or PA datasets. The FDR difference between pSer and pThr follows the statistical expectation from analyses of large datasets with unbalanced counts of true positives for different residues. For example, if a study reported 1200 phosphosites at 5% FDR, of which 1000 are pSer and 200 are pThr, the false positives (∼60) would assort approximately equally across pSer (∼30 out of 1000 i.e., 3% FDR on pSer) and pThr (∼30 out of 200 i.e., 15% on pThr), meaning that in the vast majority of studies (which do not correct for this issue) the general pThr FDR will be significantly higher than for pSer. For pTyr, the majority of sites comes from separate studies that specifically enrich for pTyr via antibodies, which likely accounts for the pTyr FDR being similar to the pSer FDR.

Functional Enrichment Analysis

In our analysis, we categorized all 20 271 proteins in the filtered human reference proteome (Table S5) according to what their highest-ranked Ser, Thr, and Tyr site was based on phosphorylation likelihood sets in Table . The resulting sets (Table S16) were analyzed in DAVID[39] to compare functional enrichment patterns between phosphorylation likelihood sets. First, we found that across all datasets (Ser, Thr, and Tyr) the protein sets containing sites ranked “High in both PSP and PA” were associated with the most significant (Benjamini–Hochberg adj. p value <0.05) functional groups (Figure S4) suggesting their functional coherence i.e., sharing mappings to keywords, ontology terms or pathways. Interestingly, proteins with sites from “Low in PSP and/or PA” set as their highest-ranked site and proteins which did not have any evidence phosphorylation evidence (“No evidence in PSP or PA” set) were also enriched for numerous functional categories suggesting that they too share some functional properties (Figure S4). Proteins containing sites from the “Not phosphorylated” set as their highest-ranked Ser/Thr/Tyr site were enriched for one significant functional group in the case of Tyr dataset and no functional groups in the case of Ser/Thr datasets, which was likely due to small protein sample size in those sets. To investigate this further, we compared the top 10 enriched functional groups between the protein sets and found that proteins containing Ser, Thr, and Tyr sites with the most phosphorylation evidence (“High in PSP and PA” set) were significantly enriched for categories and terms associated with phosphorylation such as “Phosphoprotein”, “Transcription”, “Nucleus”, and “Alternative splicing” (Figure ), suggesting that those proteins were true phosphoproteins. There is a risk of generating circular evidence here, as the enriched term “Phosphoprotein” is a UniProt keyword, and will have been annotated based on literature evidence, potentially shared with PSP. UniProt does not routinely load phosphorylation evidence from high-throughput datasets, and so classifications of phosphoproteins are generally independent of evidence used in PA. Other enriched keywords have also likely been determined based on independent evidence, and thus we believe are unbiased observations of our sets. Overall, 92.3, 93.9, and 88.2% of proteins containing Ser, Thr, and Tyr sites of the highest phosphorylation likelihood, respectively, were enriched for the term “Phosphoprotein”, which, as per description in UniProt, is a term assigned to a “protein which is post-translationally modified by the attachment of either a single phosphate group, or of a complex molecule, such as 5′-phospho-DNA, through a phosphate group”.[31] Furthermore, those proteins were enriched for “Acetylation” (Figure ) which in some cases might indicate phosphorylation since crosstalk between acetylation and phosphorylation has been frequently reported,[61,62] alongside other modifications such as O-glycosylation.[63] Another enriched function is “Alternative splicing” (Figure ) which is known to be controlled by reversible phosphorylation,[64] further indicating that those proteins likely contain functional phosphosites. However, it is possible that this enrichment could correlate with the depth of analysis of the mentioned proteins rather than their phosphorylation likelihood since extensively studied gene products (and abundant proteins with more easily detectable phosphosites) are likely to have better quality data associated with isoform identification and be consequently linked to “Alternative splicing”.
Figure 5

Top 10 functional categories for which protein sets containing various highest-ranked (A) Ser, (B) Thr, and (C) Tyr sites based on the amount of available phosphorylation evidence (“High in PSP and PA”, “Low in PSP and/or PA”, “No evidence in PSP or PA”) were significantly enriched in DAVID (Benjamini–Hochberg corrected p value <0.05). For each protein set, the % of proteins enriched for a particular functional category is given as well as the log 2(fold enrichment) for that set. The number of proteins in each set is presented by n.

Top 10 functional categories for which protein sets containing various highest-ranked (A) Ser, (B) Thr, and (C) Tyr sites based on the amount of available phosphorylation evidence (“High in PSP and PA”, “Low in PSP and/or PA”, “No evidence in PSP or PA”) were significantly enriched in DAVID (Benjamini–Hochberg corrected p value <0.05). For each protein set, the % of proteins enriched for a particular functional category is given as well as the log 2(fold enrichment) for that set. The number of proteins in each set is presented by n. In comparison, proteins that only had sites from “Low in PSP and/or PA” set as their highest-ranked Ser, Thr, and Tyr sites (i.e., proteins which did not have sites with strong phosphorylation evidence) were not enriched for clear phosphorylation-associated terms and were instead enriched for categories such as “Glycoprotein”, “Signal” and “Disulfide bond” and “Membrane” (Figure ), suggesting that the majority of those proteins were likely nonphosphoproteins and their associated phosphosites with weak evidence were therefore likely false positives. Assuming that sites with no phosphorylation evidence in PSP or PA are likely nonphosphosites (although it is possible that phosphorylation has not been investigated or localized yet), potential high FDR in the “Low in PSP and/or PA” set was further supported by proteins with no phosphorylation evidence being enriched for similar functional groups (Figure ). In fact, we observed a clear decrease in the proportion of proteins enriched for phosphorylation-associated functional groups (where a set was enriched for at least 10 functional groups) going across our established sets suggesting higher phosphosite FDR in lower confidence sets (Figure S5). Our investigation of UniProt terms linked to protein sets revealed that the enrichment for the term “Phosphoprotein” and other terms likely to be associated with phosphorylation (“Alternative splicing”, “Nucleus”, “Acetylation”, “Transcription”) generally decreased across the sets of reduced confidence, which suggested higher FDR in sets with less phosphorylation evidence (Figure ). For example, only 13.0, 31.7, and 36.3% of all proteins, which had Ser, Thr, and Tyr sites, respectively, from “Low in PSP and/or PA” phosphorylation likelihood set as their most confident site, were marked as phosphoproteins in UniProt (Figure and Table S17), suggesting that most proteins in those sets were not phosphoproteins and further highlighting that the associated sites with only a single piece of evidence are likely false-positive identifications.
Figure 6

Percentage of proteins within sets containing (A) Ser, (B) Thr, and (C) Tyr sites of various phosphorylation likelihood as their highest-ranked site, annotated with specific UniProt keywords. The number of proteins in each set is presented by n.

Percentage of proteins within sets containing (A) Ser, (B) Thr, and (C) Tyr sites of various phosphorylation likelihood as their highest-ranked site, annotated with specific UniProt keywords. The number of proteins in each set is presented by n.

Secondary Structure Analysis

We also investigated whether Ser, Thr, and Tyr sites with strong phosphorylation evidence were located more frequently within specific protein secondary structures, compared to sites with less evidence. For example, previous analysis of thousands of phosphosites from multiple species identified hotspots within domain families of proteins, particularly near domain interfaces and adjacent to catalytic residues, where they presumably regulate enzymatic output.[65,66] We found that significantly more (Fisher’s test p value <0.05) Ser, Thr, and Tyr sites with the strongest phosphorylation evidence (“High in PSP and PA” set) were localized within coiled coils compared to sites in the “Not phosphorylated” set (Figure ). This might readily be explained by coiled coils being frequently found in transcription factors, the activity or subcellular location of which is often dependent on phosphorylation.[67−69] Therefore, the results in Figure further indicated that there were more potential true Ser, Thr, and Tyr phosphosites in “High in PSP and PA” set than in other sets. In terms of other analyzed protein structures (β strand, turn, α helix), there was no significant enrichment of sites from the highest phosphorylation confidence set within those structures compared to the “Not phosphorylated” set (Figure ). In fact, our current reading of the literature suggests that it is still unclear whether phosphorylation sites are found on average to be localized more or less frequently within β strands, turns, or α helices, though clear evidence for localization of PTMs at functionally important loci in proteins has been previously presented.[70]
Figure 7

Normalized counts of (A) Ser, (B) Thr, and (C) Tyr amino acids of various phosphorylation likelihood based on evidence in PSP and PA, which are found within protein structures (β strand, α helix, turn, coiled coil). Significant (Fisher’s test p value <0.05) enrichment of amino acids from “High in PSP and PA” set within protein structures is highlighted by the dot symbol (•) compared with the “Not phosphorylated” set.

Normalized counts of (A) Ser, (B) Thr, and (C) Tyr amino acids of various phosphorylation likelihood based on evidence in PSP and PA, which are found within protein structures (β strand, α helix, turn, coiled coil). Significant (Fisher’s test p value <0.05) enrichment of amino acids from “High in PSP and PA” set within protein structures is highlighted by the dot symbol (•) compared with the “Not phosphorylated” set.

Conclusions

In our analysis, we ranked all potential Ser, Thr, and Tyr phosphosites in UniProt reference human proteome according to how much quantitative and qualitative phosphorylation evidence they were assigned in PSP and PA databases. Having analyzed the sites and the proteins that contain them in terms of conservation, proximal site patterns, functional enrichment, and structural properties, we established that Ser, Thr, and Tyr sites with weak phosphosite identification evidence, particularly sites with a single piece of supporting evidence, were likely to be false-positive identifications. This finding was further confirmed by FDR estimations across the established phosphorylation likelihood sets which revealed phosphosite FDR of 84, 98, and 82% in sets of Ser, Thr, and Tyr sites, respectively, where only one piece of identification evidence was present. Since there is a considerable presence of such sites in PSP and PA datasets, our results implied high FDR in both those datasets, although PSP was predicted to have a generally higher proportion of false-positive phosphosites compared to PA. This is potentially a cause for concern since many potential false positives are presented to scientists as true phosphosites, without a clear explanation of the likelihood of such claims. Nevertheless, using our FDR estimates we predicted that there are around 62 000 Ser, 8000 Thr, and 12 000 Tyr true-positive phosphosites in the human proteome that are supported by evidence in PSP and/or PA. These estimated counts are lower than other published estimates[29,30,71,72] particularly for Ser/Thr sites, presumably due to the previous inclusion of false positives and subsequent overestimation of the number of true phosphosites. We conclude that researchers must be aware of the potential for false-positive sites in both public and self-generated databases and should always evaluate the evidence behind the phosphosites used in their research. As a general rule, phosphorylation sites with <5 independent observations should be treated with caution, and those with only one observation in a database are likely to be false positives. In a recent phosphoproteomics study from our group, we demonstrated the utility of the classification presented here, by matching the sites identified by LC-MS/MS to their evidence categories from PSP and PA.[73] For new phosphoproteomics studies, it will be common for some ambiguity to remain regarding phosphosite localization, and many sites will be observed that would not pass a 1 or 5% false localization rate cutoff from a single dataset, but for which there may be some supporting evidence. By evaluating new datasets in combination with all of the evidence collated from a large number of previous studies, greater confidence can be assigned to “borderline” significant phosphosites that may indeed be correct, or conversely, sites with weak evidence that have never been reported can be rejected. Here, we have provided a methodological framework for estimating global FDR in large-scale phosphorylation datasets, which does not rely on native scores from search engines or site localization software. Methods for estimating global FDR in meta-analyses of phosphosites are not yet robust, and thus we would recommend that other groups profile orthogonal properties of ranked sets, as we have done here, to estimate the real distribution of true and false phosphosites in their data.
  72 in total

1.  A rapid method for determining protein kinase phosphorylation specificity.

Authors:  Jessica E Hutti; Emily T Jarrell; James D Chang; Derek W Abbott; Peter Storz; Alex Toker; Lewis C Cantley; Benjamin E Turk
Journal:  Nat Methods       Date:  2004-10       Impact factor: 28.547

2.  Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry.

Authors:  Joshua E Elias; Steven P Gygi
Journal:  Nat Methods       Date:  2007-03       Impact factor: 28.547

3.  Non-functional phosphorylations?

Authors:  Gustav E Lienhard
Journal:  Trends Biochem Sci       Date:  2008-07-04       Impact factor: 13.807

4.  A structural basis for substrate specificities of protein Ser/Thr kinases: primary sequence preference of casein kinases I and II, NIMA, phosphorylase kinase, calmodulin-dependent kinase II, CDK5, and Erk1.

Authors:  Z Songyang; K P Lu; Y T Kwon; L H Tsai; O Filhol; C Cochet; D A Brickey; T R Soderling; C Bartleson; D J Graves; A J DeMaggio; M F Hoekstra; J Blenis; T Hunter; L C Cantley
Journal:  Mol Cell Biol       Date:  1996-11       Impact factor: 4.272

5.  Evaluation of Parameters for Confident Phosphorylation Site Localization Using an Orbitrap Fusion Tribrid Mass Spectrometer.

Authors:  Samantha Ferries; Simon Perkins; Philip J Brownridge; Amy Campbell; Patrick A Eyers; Andrew R Jones; Claire E Eyers
Journal:  J Proteome Res       Date:  2017-08-11       Impact factor: 4.466

6.  Mechanism of activation of protein kinase B by insulin and IGF-1.

Authors:  D R Alessi; M Andjelkovic; B Caudwell; P Cron; N Morrice; P Cohen; B A Hemmings
Journal:  EMBO J       Date:  1996-12-02       Impact factor: 11.598

7.  PhosFox: a bioinformatics tool for peptide-level processing of LC-MS/MS-based phosphoproteomic data.

Authors:  Sandra Söderholm; Petteri Hintsanen; Tiina Öhman; Tero Aittokallio; Tuula A Nyman
Journal:  Proteome Sci       Date:  2014-06-26       Impact factor: 2.480

Review 8.  Computational phosphoproteomics: from identification to localization.

Authors:  Dave C H Lee; Andrew R Jones; Simon J Hubbard
Journal:  Proteomics       Date:  2015-02-17       Impact factor: 3.984

9.  dbPTM in 2019: exploring disease association and cross-talk of post-translational modifications.

Authors:  Kai-Yao Huang; Tzong-Yi Lee; Hui-Ju Kao; Chen-Tse Ma; Chao-Chun Lee; Tsai-Hsuan Lin; Wen-Chi Chang; Hsien-Da Huang
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

10.  Mechanism of conditional partner selectivity in MITF/TFE family transcription factors with a conserved coiled coil stammer motif.

Authors:  Vivian Pogenberg; Josué Ballesteros-Álvarez; Romana Schober; Ingibjörg Sigvaldadóttir; Agnieszka Obarska-Kosinska; Morlin Milewski; Rainer Schindl; Margrét Helga Ögmundsdóttir; Eiríkur Steingrímsson; Matthias Wilmanns
Journal:  Nucleic Acids Res       Date:  2020-01-24       Impact factor: 16.971

View more
  2 in total

1.  Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.

Authors:  Gilbert S Omenn; Lydie Lane; Christopher M Overall; Young-Ki Paik; Ileana M Cristea; Fernando J Corrales; Cecilia Lindskog; Susan Weintraub; Michael H A Roehrl; Siqi Liu; Nuno Bandeira; Sudhir Srivastava; Yu-Ju Chen; Ruedi Aebersold; Robert L Moritz; Eric W Deutsch
Journal:  J Proteome Res       Date:  2021-10-20       Impact factor: 5.370

2.  Method for Independent Estimation of the False Localization Rate for Phosphoproteomics.

Authors:  Kerry A Ramsbottom; Ananth Prakash; Yasset Perez Riverol; Oscar Martin Camacho; Maria-Jesus Martin; Juan Antonio Vizcaíno; Eric W Deutsch; Andrew R Jones
Journal:  J Proteome Res       Date:  2022-05-31       Impact factor: 5.370

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.