Literature DB >> 35640880

Method for Independent Estimation of the False Localization Rate for Phosphoproteomics.

Kerry A Ramsbottom¹, Ananth Prakash², Yasset Perez Riverol², Oscar Martin Camacho¹, Maria-Jesus Martin², Juan Antonio Vizcaíno², Eric W Deutsch³, Andrew R Jones¹.

Abstract

Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.

Entities: Chemical

Keywords: database searching; false localization rate; phosphoproteomics; software; statistics

Mesh：

Substances：
Amino Acids

Year: 2022 PMID： 35640880 PMCID： PMC9251759 DOI： 10.1021/acs.jproteome.1c00827

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 5.370

Introduction

There is great research interest in studying post-translational modifications (PTMs) to proteins, due to their importance in cell signaling, as a rapid mode of proteins changing their function, and their implication in almost all known disease processes. The most widely studied reversible modifications include phosphorylation (by far the most studied one and our primary focus here), acetylation, methylation, and attachment of small proteins, such as ubiquitin and SUMO. High-throughput tandem mass spectrometry (MS) is commonly used for detection and localization of phosphorylation sites on proteins, using so-called phosphoproteomic methods. Typically in these methods, proteins are first extracted from samples and digested with an enzyme such as trypsin, and then, phosphorylated peptides are enriched in the sample, for example, using TiO2 or other metal ion attached to a column (affinity chromatography), to which phosphate binds preferentially. The bound peptides are then eluted and analyzed by liquid chromatography–MS.[1] In the common analysis mode used in phosphoproteomics, data dependent acquisition is performed to fragment the most abundant peptides observed. The MS2 fragmentation spectra (plus the mass/charge of the intact precursor) are then used to identify peptide sequences, e.g., using sequence database search software. In this approach, the spectra are searched against a theoretical digest of the proteome (i.e. peptide sequence database) for the given species, taking account for the variable modifications selected. For phosphorylation, most users search for phosphorylation on the canonical Ser, Thr, and Tyr (STY) residues, where the vast majority of detectable phosphorylation resides in eukaryotic systems. The search engine then considers every STY residue with and without the addition of the phosphate mass (+79.97 Da), greatly increasing the size of the search space, with a corresponding reduction in statistical power for peptide identification. Confident peptide identification is governed by the quality of the match between the observed spectrum and the theoretical spectrum expected for a peptide from the sequence database, from which local statistics such as p-values or e-values are usually calculated as well as sometimes a posterior error probability (PEP). If a PEP value is calculated, 1-PEP gives the probability that a given peptide-spectrum match (PSM) is correct. There are of course many different proteomics search engines, including commercial and free and/or open source; for a review, see Verheggen et al.[2] An important consideration for phosphoproteomics is the confidence that a given site within a protein has been correctly identified as being phosphorylated. Ambiguity in this regard may occur when a confident PSM has more STY residues than n, where n is the number of phosphorylation modification instances detected, that is, intact peptidoform mass = peptide sequence mass + (n * 79.97 Da). In this case, the search engine itself or a downstream analysis package calculates statistics related to each of the n phosphosites within a peptide, such as a PEP that the site has been incorrectly localized (sometimes also called local false localization rate (FLR); local FLR or other ad hoc score. As for PSMs, if an accurate PEP can be estimated, then 1-PEPsite gives the probability that the site has been correctly localized, in this case, assuming already that the PSM is definitively correct. Correct site localization can be critically important for downstream uses of data. As one example, there are completely different kinases and phosphatases involved in Ser/Thr versus Tyr phosphorylation, and thus, biological conclusions as to the up and downstream signaling pathways would be completely different. Even where ambiguity relates to different, say, Ser residues in a peptide, nearby amino acid motifs allow inference of the kinase responsible for phosphorylating the site, and thus, incorrect site determination could lead researchers to making incorrect assumptions and conclusions. Many of the published site localization algorithms were benchmarked by the originating authors, and scores were calibrated based on synthetic data sets, with a known “ground truth”, that is, where the sites of phosphorylation were known.[3] There have also been some independent efforts to benchmark different site localization tools, showing that the choice of tool does alter the global statistics,[4,5] that is, sensitivity—how many sites in a whole data set can be correctly localized at a suitable overall (global) FLR. While tools continue to improve and become more widely used for ensuring confident site localization, there remain several unsolved challenges for the field, as follows. First, it is unclear whether findings on synthetic data sets can be extrapolated to genuine biological data sets that have generally a higher level of complexity, and synthetic data sets may have amino acid frequencies near to modification sites, which do not well reflect natural samples. Second, for analysis of real experimental data sets, there are no commonly used methods for independent estimation of global FLR, for example, that allow a researcher to ask the question—how many sites have we confidently identified and how many are likely false positives in the whole data set. For regular peptide/protein identification in proteomics, decoy database search methods are now almost ubiquitously used for estimating peptide/protein false discovery rate (FDR), since they give a search engine-independent statistic that is easy to understand. There is no generally accepted method for calculating the same type of statistic for PTM site identification. Third, our groups are interested in very large-scale re-analysis of public PTM enriched data sets, via a project called PTMeXchange. We wish to have methods that allow for accurate calculation of the probability that a given PTM site has been observed in a meta-analysis of data sets, where there could be potentially multiple PSMs from different studies supporting a given site. To our knowledge, there are no suitable statistical models for combining different evidence streams. In this work, we explore the concept of using decoy amino acid(s) for estimation of site localization statistics (i.e. global FLR), in this context, defined as one that we know cannot be modified, to model the distribution of false localizations detected from a processing pipeline, resulting from both incorrect peptide identifications and incorrect localization of sites on correct peptide identifications. We test a range of different amino acids for their suitability as a decoy in synthetic and real data sets as well as demonstrate the results obtained from several common proteome informatics pipelines. The concept of using a decoy amino acid for localization of PTMs is not a new one. It has been used before in several previous publications and approaches.[6] However, to our knowledge, no publication has yet validated the statistics associated with the use of decoy amino acids, particularly on multiple tools, and the method of using a decoy amino acid has not gained widespread use in the field. The majority of PTM-based studies still relies on using ad hoc score thresholds for determining whether PTMs have been correctly identified or not. From the results we present, we make some recommendations as to how we believe large-scale PTM-enriched studies should be analyzed to control the local and global FLR. While we have focused on computational analysis and pipelines for phosphorylation, general approaches and conclusions are largely applicable to other types of PTM readily detected by MS. The code used for the analysis is in GitHub: https://github.com/PGB-LIV/PhosphoFLR.

Methods

Our overall goal is to demonstrate methods for controlling and understanding FLR, rather than benchmarking tools per se, although we wished to demonstrate the reproducibility of methods in different pipelines. As such, we tested four commonly used analysis pipelines: trans-proteomic pipeline (TPP)[7] including Comet search[8] and PTMProphet site localization;[9] MaxQuant including PTMScore;[10] ProteomeDiscover, including Mascot search and ptmRS localization;[11] and PEAKS DB search with A-Score.[12] We tested the effects on global FLR of selecting the localization on different amino acids as “decoy” and profiled the frequency of potential decoy amino acids relative to assumed correct STY phosphorylation sites, to see which provides the decoy distribution best matching the target distribution, that is, other STY sites to which the site could be wrongly localized. The MS proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE[13] partner repository with the dataset identifier PXD028840 and DOI 10.6019/PXD028840. Four data sets were used for evaluation of methods for estimating global FLR—two synthetic data sets, one model plant phosphoproteomics data set (from Arabidopsis thaliana), and one human phosphoproteomics data set. The raw files of the four search data sets were obtained from the ProteomeXchange Consortium[14] via the PRIDE repository.[15] These included ten files from the PXD007058[5] synthetic data set (files named “HCDOT” pools 1–5, reps 1 and 2), 10 files from the PXD000138 synthetic data set,[16] twelve files from the PXD008355[17]Arabidopsis data set (rapamycin treated), and six from the PXD000612[18] human data set (files randomly selected). The PXD007058 and PXD000138 data sets contain synthetic phosphopeptide libraries. The use of synthetic phosphopeptides allowed us to define FLR (through one method) by comparing the results from our search pipelines to the known phosphopeptide sequences to determine if our analyses correctly localize the phosphosites. The PXD008355 A. thaliana data set and the PXD000612 human data set are both biological data sets with unknown phosphosites. Databases were created for the searches of each data set. The PXD007058 search database consisted of the synthetic peptides;[5] the PXD000138 search used IPI human sequences and phosphopeptide libraries matched to the original study;[16] the PXD008355 Arabidopsis search database contained Araport11[19] sequences, and the PXD000612 human search database was created from the Level 1 PeptideAtlas Tiered Human Integrated Search Proteome,[20] containing core isoforms from neXtProt.[21] Each search database also contained the cRAP contaminant sequences (https://www.thegpm.org/crap/, last accessed October 2021). Decoys across all four databases were generated for each entry using the Brujin method (with k = 2).[22] Using the TPP,[7] the data set files were first searched using Comet.[8] The resulting files were then combined and processed using PeptideProphet,[23] iProphet,[24] and PTMProphet.[9] For PXD007058, in addition to searching for phosphorylation, we also searched for pyrophosphorylation modifications, which pilot searches had determined that were unintentionally present on some synthetic peptides. The key Comet search parameters for each dataset are shown in Table ; the key parameters used for the other pipelines are shown in Supporting Information Table S1.

Table 1

Comet Search Parameters for Each Data Set

	PXD007058 (synthetic data set)	PXD000138 (synthetic data set)	PXD008355 (Arabidopsis data set)	PXD000612 (human data set)
peptide mass tolerance	20.0 ppm	5 ppm	7 ppm	7 ppm
fragment bin tolerance	0.02 Da	0.02 Da	0.02 Da	0.02 Da
digest mode	tryptic	tryptic	tryptic	tryptic
max missed cleavages	4	4	2	2
fixed mods	carbamidomethylation (C)	carbamidomethylation (C)	carbamidomethylation (C)	carbamidomethylation (C)
variable mods	oxidation (MWP)	oxidation (M)	oxidation (M)	oxidation (M)
	phospho (STYXa)	phospho (STYXa)	phospho (STYXa)	phospho (STYXa)
	pyrophospho (STY)b, N-terminal acetylation	N-terminal acetylation	N-terminal acetylation	N-terminal acetylation
	ammonia loss (QC)	ammonia loss (QC)	ammonia loss (QC)	ammonia loss (QC)
	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)	pyro-Glu (EQ on the N-terminus)
	deamination (NQ)	deamination (NQ)	deamination (NQ)	deamination (NQ)
max variable PTMs	5	5	5	5

X corresponds to the different decoy amino acid searched: Ala, Gly, Leu, Asp, Glu, or Pro.

Preliminary analysis of the data set detected that several peptides had been manufactured with pyrophosphate modification rather than the intended phosphate, which can cause apparent errors when comparing the results to the answer key if they are not accounted for.

X corresponds to the different decoy amino acid searched: Ala, Gly, Leu, Asp, Glu, or Pro. Preliminary analysis of the data set detected that several peptides had been manufactured with pyrophosphate modification rather than the intended phosphate, which can cause apparent errors when comparing the results to the answer key if they are not accounted for.

Downstream Data Analysis

The data from searching with TPP were downstream processed by custom Python scripts (https://github.com/PGB-LIV/PhosphoFLR). First, the global FDR was calculated from the decoy counts, and the PSMs were filtered for 1% PSM FDR. From these filtered PSMs, a site-based file was generated giving separate localization scores for each phosphosite found on each PSM, removing decoy and contaminant hits. These site-based PSMs were ordered by a combined probability, calculated by multiplying the PSM probability by the localization probability. In the processing pipeline from TPP, iProphet calculates a probability that a given PSM is correct, and PTMProphet calculates a probability for the site assignment. We demonstrated that there is almost no meaningful correlation (r2 = ∼0.01) between these probabilities (Supporting Information Figure S1), and thus, we conclude that these probabilities are sufficiently independent that they can be multiplied to arrive at a final probability that a given site’s identification is supported by the given spectrum. For the PEAKS search, the PSM score and A-scores for targets and decoys were modeled based on the counts of targets and decoys per histogram score bin, to generate similar probability estimates (code provided in the GitHub repository). For MaxQuant and Mascot searches, the PSM probability values were calculated as 1-PEP values (reported by the pipeline natively) with the PTM probabilities being calculated innately through PTM-score/ptmRS probabilities, respectively. For the synthetic peptide search, these site-based results were then filtered further to allow comparison with the synthetic peptide known localization key. Partial peptides and PSMs with the incorrect phosphorylation count compared to what was expected from the answer key, or other additional modifications, were removed. These remaining PSMs were then compared against the synthetic peptide answer key to determine if the phosphosites had been correctly identified. For analysis of the synthetic data only, results were ordered by the corresponding site localization probability rather than the combined probability; since due to the small size of the search database, not all pipelines could produce accurate estimates of PSM probability. The sites reported for each analysis method were ordered by combined probability, and global FLR was estimated for every ranked site, from which we can then later apply a threshold at the lowest scoring site that delivers a desired global FLR (e.g. 1, 5, or 10%), similar to the q-value approach for standard database searching. The global FLRs for all of the data sets were estimated using two methods—model FLR and decoy FLR method. For the synthetic data sets, a third method was also used, called answer key method (comparing against the known phosphosite answer key)

Model FLR Method

First, we estimated the global FLR using the combined probabilities (global model FLR). For TPP, we use the 1—final probability, to give the local FLR (PEP) for each given site scored in a ranked list. The “Global model FLR” is calculated as a running sum of the local FLR divided by the count of rows (eq ), that is, the estimated frequency of false localizations at each row in a ranked list, divided by the total number of reported observations.

Global Model FLR Equation

Where n is the count of observations, PPSM is the local FLR (PEP) for a given PSM identification and PPTM is the local FLR for a given site localization.

Decoy FLR Method

We used the identification of phosphorylated decoy amino acids (e.g. worked example follows for alanine, pA), as these are known to be false localizations and can therefore be used to estimate the FLR. The counts of the phosphorylated decoy amino acids were first normalized to allow comparison with true hits; by modeling the random frequency, one would expect incorrect sites to be assigned to target STY residues. The target/decoy ratio (Tc/Xc) was determined by dividing the total count of STY residues (Tc) by the total count of the decoy amino acid residues (Xc), within the set of PSMs with a scored phosphosite (eq ). At each position in the ranked list, we have a count of decoy amino acids observed ∑1(pXc). For every false localization of the decoy amino acid, it would be expected that there is a count of “silent” false positives within the target list, proportional to the ratio of STY amino acids divided by the total count of decoy amino acids, that is, (Tc/Xc) ∑1(pXc), to model the expected frequency of random wrong hits. The expected count of false positives within the targets is then multiplied by 2 (to model the normalized frequency of random wrong assignments among both the decoy amino acid and the target amino acid) to arrive at a normalized false localization count. This is a relatively conservative method to calculate global FLR, but without the correction to multiply by two, the use of a less frequent decoy amino acid would be insufficiently corrected for. This normalized false localization count is then divided by the total count of observations (n), at a given row in the ranked list, to obtain the Model (global) FLR estimate (pX_FLR in eq ).

Phosphorylated Decoy Amino Acid FLR Equation

Where Tc is the total target (STY) count, Xc is the total decoy amino acid count, pXc is the count of phosphorylated decoy amino acid, and n is the count of observations at a given position in the ranked list.

Answer Key FLR

For the synthetic data sets, we used the synthetic peptide false localizations in a similar way; the false localization count (i.e. result not matching the answer key) was divided by the total count of sites to calculate the FLR (eq ).

Synthetic Answer Key FLR

Where n is the count of observations, and false localization count Fc is the count of sites not matching the answer key in a given position in the ranked list.

Collapsing Observations of a Site across Multiple PSMs

When summarizing study results, it is desirable to “collapse” results where there are multiple PSMs supporting the same modification site down to a single row. There is no well-agreed method for collapsing data, although common practice when using collapsing multiple PSMs into individual reports for peptides is to use the maximum peptide score for ordering results and disregarding the count of PSMs. The rationale for this simplification is that multiple PSMs reporting for the same peptide are not independent statistical tests and thus the same wrong answer can appear in multiple PSMs. As such, a simple method for ranking final “collapsed” results for sites is simply to take the maximum final probability. However, our own profiling of data sets suggests that this method is sub-optimal. Many of the high scoring decoy hits are supported by only a single PSM, and so a collapse method that weights sites supported by a higher number of PSMs is more likely to be true than the one supported by a single PSM (Supporting Information Figure S2). For this study, we use a relative ad hoc method for collapsing multiple observations that attempts to balance maximum final probability and spectral counts. We took the maximum probability for a given site, derived from multiple PSMs and binned into final probability values for 2 decimal places. We ranked via binned final probability and then ranked within bins via the count of PSMs.

Profiling Distance Distributions from Real Identifications to Decoy Amino Acids

In order to compare between the decoy amino acids investigated, the distribution of amino acids around phosphorylation sites were compared. The phosphorylation sites obtained searching each database for phospho (STY) using TPP was first filtered for 5% model FLR. The minimum distance between an assumed correctly localized phosphorylated STY and the nearest candidate amino acid was compared, alongside the minimum distance for the nearest STY. Histograms were generated with the normalized frequencies of these distributions in order to compare between the selected decoy amino acids and STY.

Profiling Site Probabilities for Proximal Amino Acids

When analyzing the results for different decoy amino acids, we observed particular differences in the global FLR estimates for certain decoys (particularly pAla vs pGly) that could not be explained by distributions of amino acids in relation to confident target sites (above). We further explored these effects by calculating the average final probabilities for assumed correct sites with different amino acids in the −1 and +1 position relative to the site. The assumed correct sites were estimated as sites with combined probability ≥0.68. This threshold was calculated from the average minimum combined probability using a 5% FLR cut off for each of the decoy FLR estimations across all searches. These average probability distributions were calculated for the Arabidopsis and human data sets, from results of the TPP search with no decoy amino acid (pSTY), pAla decoy (pASTY), pLeu decoy (pLSTY), and pGly decoy (pGSTY).

Results

Analysis of Synthetic Data Set PXD007058

The analysis setup aimed to determine whether global FLR within genuinely modifiable residues (target amino acids) could be estimated reliably by including in the search a “decoy” amino acid, that is, not modified. We tested localization on six different amino acids to act as a decoy in parallel searches: glycine, leucine, alanine, glutamate, aspartate, and proline, to determine what effect the selection of a particular decoy had on the results obtained. The set of potential decoy amino acids was selected based on the following rationale: (i) glycine, leucine, alanine—no evidence that they can be phosphorylated in any known biological system; all are relatively frequent amino acids in most biological systems; (ii) glutamate—frequently phosphorylated[25] and not typically detectable as phosphorylated in most standard enrichment MS experiments and thus could be a plausible choice as a decoy; (iii) aspartate and proline were chosen as expected to be deliberately poor choices of decoy amino acids since there are known SP and SD phosphorylation motifs, which could bias estimates of global FLR. We expect that a statistically reliable choice of amino acid should have a similar distance distribution from a phosphorylation site (STY) to another truly possible phosphorylation site (STY), under the theory that incorrect localizations are more likely to amino acids nearby in the sequence. We first searched the PX007058 synthetic dataset, which allowed us to test the three different methods of FLR estimation against a known answer. The data set was searched using TPP and the global FLR of these was calculated using the decoy phosphorylated amino acid method for six different choices of decoy amino acid, that is, in six parallel searches (Figure a). As described in the Methods, the global “Decoy” FLR is estimated based on the counts of hits to the decoy amino acids in the ranked list of results, adjusted for the ratio of the counts of the decoy amino acid to the target amino acid in the modified peptides that have been considered. We also show the global FLR calculation for all three methods in Figure b–g, split by decoy amino acid choice; that is, (i) answer key—identifying false localization by comparing to the known phosphosites (in this synthetic data set, where the truly modified site is known), (ii) decoy amino acid method, and (iii) model FLR that is based on summing local FLR calculated by the analysis software intrinsically (see Methods). The first observation we make is that the choice of decoy amino acid can have a substantial effect on the sensitivity (counts of assumed true sites), for a given estimated global FLR threshold (Table and Figure ). At 5% FLR, the lowest sensitivity is achieved with a Gly decoy (749 sites) versus the highest sensitivity with a Glu decoy (952 sites).

Figure 1

Table 2

Counts at pX FLR (Calculated by the Decoy Method) for 1, 5, and 10% Using Each FLR Method, Searching PXD007058 (Synthetic Data Set) (TPP, Fully Tryptic, 1% FDR)

	count at 1% FLR			count at 5% FLR			count at 10% FLR
	answer key	model	decoy	answer key	model	decoy	answer key	model	decoy
pAla	702	226	799	835	700	866	935	836	921
pGly	571	203	515	778	645	749	866	791	787
pLeu	632	229	665	843	710	823	938	844	914
pAsp	535	258	207	834	721	784	929	849	805
pGlu	722	232	841	858	734	952	975	866	n/a
pPro	636	206	621	817	692	768	883	832	823

(a) Comparison of FLR estimation searching PXD007058 (synthetic data set) using localization on different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR); (b–g) comparison of FLR estimation methods searching PXD007058 for each of the different decoy amino acids (TPP, fully tryptic, 1% FDR). X-axis = count of sites, y-axis is global FLR estimated as q-values. It can be seen that for pAla and pLeu decoys (Figure b,d), there is a close agreement between the two “empirical” methods of estimating global FLR, that is, answer key and decoy, with the model giving more conservative estimates of FLR. The pGly and pPro methods (Figure c,g) have good agreement between the answer key and decoy methods up to ∼700 counts, and then, the decoy method gives more conservative estimates (rising steeply), compared to the answer key method (and model). The pAsp decoy method agrees well with the model FLR but is more conservative than the answer key (Figure e). The pGlu decoy method is the least conservative, apparently underestimating global FLR compared to the answer key (Figure f). Overall, the matching of pAla and pLeu decoy FLR estimation to the answer key FLR gives some supporting evidence toward pAla and pLeu being appropriate choices for decoy amino acids. The model FLR method is shown to be more conservative that the other “empirical” FLR methods in most cases, especially in the most important regions of distribution, that is, up to 5% global FLR for example. The estimates from Figure b–g and Table demonstrate greater stability in the model FLR and answer key FLR across different decoy amino acids; that is, there is less variation in sensitivity at a given estimated FLR. This is to be expected since comparing the six different searches, many of the errors in localization are due to incorrect localization to a target amino acid (which largely behave the same across the six searches). On the same synthetic data set, PXD007058, we also compared the estimation methods across four different pipelines: TPP, PEAKS, MaxQuant, and Mascot (Supporting Information Figure S3 and Table S2), searched for pSTY pASTY and pLSTY (for decoy comparison). The initial set of results from our analysis pipeline are the redundant identification of phosphorylation sites; that is, if multiple PSMs support the same site, these appear as multiple rows (not collapsed). In general, as noted in the Methods, our preference is to order these results by the final probability that a site has been observed (PSM–probability X site localization probability). This synthetic data set has a small database size and an overall small count of identifications, which makes it difficult to model PSM probability accurately. As such, for the synthetic data set only, we ordered results by site localization probability, having first accepted only PSMs with FDR <1%. FLR was calculated using the synthetic answer key and the decoy amino acid hits. For all four pipelines tested, both the pAla and pLeu decoy methods agree well with the results from the answer key FLR method across all three pipelines, demonstrating that our method with these amino acids gives reliable FLR estimates in a software-independent manner. There are differences in the total number of sites identified at a fixed FLR threshold, depending on the pipeline applied. For this data set, TPP gives the highest sensitivity, followed by PEAKS, Mascot/ptmRS, and MaxQuant. However, our primary goal in this manuscript is not extensively to benchmark different pipelines, as there are choices of algorithm parameterization, which need to be optimized and could affect conclusions, and thus, we do not make any general conclusions about software performance for PTM analysis here. The synthetic data set PXD000138 was also used to compare the FLR calculations using localization on different decoy amino acids (Supporting Information Figure S5 and Table S3). These results also showed a close agreement between the “empirical” methods of global FLR estimation; that is, the answer key and decoy estimation methods generally agreed with each other across the different decoy amino acids, although as observed for PXD007058, the choice of decoy amino acid affects the sensitivity (count of sites at a fixed FLR). The model estimation gives more conservative estimates of FLR across all decoy searches. In order to further compare between the decoy amino acids investigated, the distribution of amino acids around phosphorylation sites was compared for each of the synthetic sets (Supporting Information Figures S4 and S6). It would be assumed that if a phosphosite is wrongly localized, it would usually be to the nearest other STY residue than the correct site. We therefore assumed that a statistically reliable decoy amino acid will follow a similar (normalized) frequency distribution to the closest STY residue from correct hits. For the two synthetic data sets, all decoy amino acids had substantial differences between their distributions in relation to the assumed correct phosphosites, and none, in particular, matched well the distribution of the nearest STY residues. Given the artificial nature of synthetic data sets, we do not expect these distributions to reflect the reality of biological data sets, and thus, we next explored the behavior of the method on real data sets.

Biological Data Set Analysis

Data Set PXD008355

To investigate the effect of using localization on different decoy amino acids on different data sets, we compared the FLR estimations across the six different amino acids using two experimental data sets from A. thaliana and human. Figure shows the decoy FLR comparisons searching the PXD008355 Arabidopsis data set with TPP. Here, we can see a similar trend as previously seen in the synthetic data set PXD007058. The FLR estimations with pGly and pPro give the most conservative performance at higher FLR values; that is, a steep rise in global FLR (Figure a) giving the lowest sensitivity at 10% global FLR, although there is a more complex picture at 1 and 5% FLR values (Table and Figure b). We assume that many studies will aim to threshold at 5% global FLR; here, we observe lowest counts (of sites at 5% FLR) for the pAsp decoy and intermediate counts for pGly, pGlu, and pPro methods, and the highest counts of sites (sensitivity) for pAla and pLeu decoys. One of the challenges with accurate FLR estimation is that there can be some high-scoring incorrect localizations, and their position in the ranked list can have significant implications on the count of sites at 1% FLR (Figure b and Table ). We thus would not recommend general thresholding at 1% global FLR but instead we recommend applying a 5% FLR where the global FLR estimates are likely to be robust. There is additional discussion of these high-scoring false hits in the Supporting Information (i), Figure S7 and Table S4. We also calculated the model FLR for each decoy option, demonstrating good agreement at 5% FLR between the two methods (decoy FLR vs model FLR) for pAla, pLeu, and pGly decoy options but less good agreement for other decoys (Supporting Information Figure S8).

Figure 2

Table 4

Counts at pX FLR for 1, 5, and 10% Using Each Decoy Method, Searching PXD000612 (Human Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsed to Unique Sites

	count at 1% FLR		count at 5% FLR		count at 10% FLR
	not collapsed	collapsed	not collapsed	collapsed	no collapsed	collapsed
pAla	62,050	5246	95,924	9504	103,609	11,491
pGly	58,367	5557	86,752	9267	94,536	11,193
pLeu	3822	6705	98,875	10172	106,453	12,030
pAsp	2650	809	71,968	8458	86,451	10,052
pGlu	22,563	608	78,708	7819	94,337	10,092
pPro	47,942	3256	79,821	7384	90,430	9482

Table 3

Counts at pX FLR for 1, 5, and 10% Using Each Decoy Method, Searching PXD008355 (Arabidopsis Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsing Multiple PSMs to One Row per Modified Peptide, Sorting by Combined Probability and Count of the Supporting PSMs

	count at 1% FLR		count at 5% FLR		count at 10% FLR
	not collapsed	collapsed	not collapsed	collapsed	not collapsed	collapsed
pAla	23,104	1570	40,541	4704	44,556	5815
pGly	18,872	1469	35,939	3654	38,885	4990
pLeu	17,943	1017	42,157	4964	45,875	6068
pAsp	13,490	234	32,595	4151	39,262	5130
pGlu	21,923	2766	34,949	4385	40,532	5607
pPro	23,696	2771	35,170	4483	38,226	5221

Comparison of FLR estimation searching PXD008355 (Arabidopsis data set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR). (a) all PSMs; (b) zoom 5% FLR all PSMs; (c) collapsed by modified peptide, sorting by combined probability and count of supporting PSM; (d) collapsed by modified peptide, sorting by combined probability and count of supporting PSMs, zoom at 5% FLR. X-axis = count of sites; y-axis is global FLR estimated as q-values. We next explored data after collapsing multiple scores from different PSMs supporting the same site, taking the maximum probability for a given site for ranking results, along with the greatest number of supporting PSMs (see Methods). The results following this collapse step are shown in Figure c,d, demonstrating relatively similar trends, with considerable differences in sensitivity at 5 and 10% global FLR, with pAla and pLeu giving highest sensitivity at a 5 and 10% FLR. The statistical assumptions for the model FLR do not hold after collapse, so this method was not used. To further investigate the selection of decoy amino acid candidates, the minimum distance between an assumed correctly localized phosphorylated STY (<5% global FLR filtered) and the nearest candidate amino acid were compared, alongside the minimum distance for the nearest STY (Figure ). The rationale for this comparison is that in a regular search, not employing a decoy, if a phosphosite is wrongly localized, it will usually be to the nearest other STY residue than the correct site. We assume that a statistically reliable decoy amino acid will follow a similar (normalized) frequency distribution to the closest STY residue from correct hits. When comparing these distances in the Arabidopsis data set, it can be seen that Ala, Leu, and Gly follow somewhat similar frequency distributions to proximal STY, particularly, in the + positions (i.e. toward the C-terminus of the protein). Asp, Glu, and Pro are all enriched at the +1 position relative to STY, which likely partially explains the higher FLR estimates observed for the same site counts in Table and Figure ; that is, the pipeline wrongly assigns sites to Asp, Glu, and Pro more frequently than it would be other target sites (STY). We also observed in Table that using a glycine decoy gave relatively low sensitivity at 5% FLR and steeply increasing FLR at higher site counts. The results for Gly in Figure are thus an outlier with respective to Figure as well as for the synthetic data set PXD007058 in Figure , in which we observed the lowest site count at 5% FLR for estimates using a Gly decoy. Our starting expectation was that Ala, Leu, and Gly would all make reliable choices as decoy amino acids, and thus, we also conducted an analysis of amino acid frequencies to attempt to explain the differences seen for Gly; results are shown in Supporting Information (ii) and Table S5. Gly and Ala have similar frequencies of observations in phosphopeptides, so this also does not explain the disparity. We further explore this phenomenon below.

Figure 3

Comparison of minimum distance between phosphorylated STY and the nearest target amino acid (Ala, Leu, Gly, Asp, Glu, and Pro), compared to STY distribution, searching PXD008355 (Arabidopsis data set). Similar to the PXD007058 synthetic data set, a comparison between FLR estimates was made across the different pipelines: TPP, PEAKS, MaxQuant, and Mascot/ptmRS. The PXD008355 Arabidopsis data set was searched with an Ala decoy as well as a Leu and Gly decoy, and FLR estimations were calculated in the same way as before (Supporting Information Figure S9 and Table S6). In general, highest sensitivity is achieved by TPP and Mascot/ptmRS, whereas there are high-scoring decoy (amino acid) hits in the other two pipelines that lead to much lower sensitivity at a given FLR cut-off. For TPP and Mascot/ptmRS pipelines, the results from estimation with the three decoys are largely reproducible; that is, pAla and pLeu gives highest (and similar) counts of sites at a given FLR, whereas pGly gives a lower count of sites at the same FLR threshold. In this approach, sites are ordered by final probability (PSM probability * PTM probability). An alternative approach commonly used in the field is to threshold first at say <1% FDR for PSMs or peptide, and then order purely by PTM localization score or probability. We tested a similar approach to see what effect there is on sensitivity at a given FLR for the pAla results (Supporting Information Figure S10). While ordering site localizations by PTM probability only rather than the combined PSM*PTM probability, we can see that there is lower sensitivity at 1% FLR for the PTM probability option, and almost identical sensitivity between the two options at 5 and 10% FLR (Supporting Information Table S7). We therefore conclude that it is slightly superior to model both the probability that a given PSM is correct as well as that the PTM has been correctly localized to give the best ordering of results, particularly for those highest scoring around 1% global FLR.

Data Set PXD000612

Given that we see consistent trends for the synthetic data set and Arabidopsis data set in terms of comparing decoys across different pipelines, for the final validation, we focus only on the use of TPP on one further validation set from a different species (human). We would expect some different phosphorylation motifs comparing an animal species to a plant species, which could affect decoy amino acid performance. Figure and Table illustrate the FLR comparison using the different decoy amino acids, searching the PXD000612 human data set. A similar trend can be seen here as in the Arabidopsis data set with pAla and pLeu giving the highest site counts at 5% FLR and pAsp giving lowest site count. On the zoomed plot (<5% FLR, Figure b), the same issue as for data set PXD008355 can be observed, with unstable decoy estimation at low counts due to a random factor from a few high-scoring decoys (FLR < 1%). For this data set, there is also good agreement between the model FLR and the decoy FLR for most amino acids except pGlu, where the model FLR tends to be more conservative than decoy FLR (Supporting Information Figure S11). We also show the data after collapsing multiple PSMs reporting on the same site (Figure c,d), giving similar trends in sensitivity at fixed FLR thresholds as for data without collapse.

Figure 4

Comparison of decoy FLR estimation searching PXD000612 (human data set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 % FDR). (a) no collapse, all sites shown; (b) zoom on 5% FLR; (c) collapse to unique sites; (d) zoom on <5 FLR for data collapsed to unique sites. X-axis is count of sites, y-axis is global FLR, estimated as q-values by the decoy method. The distance between the phosphorylated STY and the nearest candidate amino acid were again compared (Supporting Information Figure S12) to further investigate the effect of decoy amino acid choice and to examine how the distributions differ between the different data sets. When comparing these distances in the human data set, a similar pattern is seen to that of the Arabidopsis data set. It can be seen that Ala and Leu again follow a somewhat similar frequency distribution to proximal STY residues, again particularly in the positive direction. Asp and Pro are again enriched at the +1 position relative to STY, which would be expected. Gly is also seen to follow a similar distribution to STY residues and therefore would be expected to be a reliable decoy amino acid, based on this measure. However, looking at the FLR comparisons seen in Figure , Gly can be seen to give more conservative FLR estimation (or lower site counts at 5% FLR e.g. than Ala or Leu), as was also seen in the Arabidopsis data set. We next explored whether particular amino acids in proximity to true phosphorylation sites cause results to change. We plotted the average final probabilities for the Arabidopsis data set searched via the TPP pipeline and split according to the amino acid in the −1 (Figure ) and +1 (Supporting Information Figure S13) position relative to the assumed correct phosphorylation site, for the data set searched with no decoy pSTY, pAla decoy, pLeu decoy, and pGly decoy. In the search with no decoy, there is a particular striking trend that sites have a lower probability when the −1 amino acid is Ser, Thr, or Tyr. This occurs because the site localization algorithm (PTMProphet in this case) has fewer ions available to discriminate the correct from incorrect localization. In the pAla and pLeu results, we see that Ala and Leu in the −1 position cause sites to have a similar reduction in final probability as Ser, Thr, and Tyr in the no decoy search, that is, final probability shifts from around ∼0.96 to ∼0.91 (Ala and Leu decoy). We interpret this to mean that they behave as statistically “good decoys”; that is, when they are present in the −1 position relative to a true site, they behave in a similar manner to STY residues. In the pGly data, there is a much larger drop off in final probabilities when G is in the −1 position (∼0.96 to ∼0.87), meaning that (most commonly) Gly-pSer sites are scored less well that phosphoserines preceded by other amino acids, and excessive probability space is being distributed to the pGly-Ser hypotheses. In the results, we observe that around 47% (Arabidopsis data) and 35% (human data) of the high scoring pGly decoys have the pGly-Ser motif. We see similar trends in the human data set (Supporting Information Figures S14 and S15). It is unclear why this particular amino acid combination causes a problem for PTM localization, but we hypothesize that the Gly-pSer bond is perhaps particularly stable during fragmentation and hence a discriminating y ion terminating with pSer is less commonly observed. We thus conclude based on what we have observed that pGly is not an ideal choice for a decoy amino acid.

Figure 5

Comparison of averaged final site probabilities for all peptides (final probability ≥0.68 split by amino acid in the −1 positions for the PXD008355 (Arabidopsis data set). (a) STY (no decoy); (b) STY with Ala decoy; (c) STY with Leu decoy; and (d) STY Gly decoy.

Discussion

In the proteomics field, prior to the widespread adoption of decoy database searching, there was a general problem with false positive results in the literature, as labs vied to report the largest number of peptides and proteins, without control of FDR. It is now accepted that all proteomics studies should control FDR at an appropriately conservative level, for example, <1% FDR at the protein-level for protein-centric studies. It has long been recognized that a similar problem exists with reporting PTM sites. The accurate discovery of a site can be crucially important for downstream interpretation since the identity of the residue (STY for canonical phosphorylation) and the proximal amino acids govern understanding of the kinase and phosphatase that regulate it. Given the interest in understanding phosphorylation (and other PTM) sites in most human diseases, adequate control of false reporting is crucial. It has been recently reported that >80% of the reported sites in a popular phosphorylation database are estimated to be false positives.[26] This likely resulted due to studies using overly weak FLR thresholds in publications, and results then get deposited in databases. Correct identifications tend to be reported from multiple studies, whereas random wrong site identifications tend to be seen only one or twice, and thus over time, database-level FLR creeps up. This study is, to our knowledge, the most detailed attempt to understand how best to estimate global FLR using localization on decoy amino acids. We compare the method against the use of a statistical model, based on summing local FLR values, and the results agree reasonably but not perfectly well. We also demonstrate that the selection of a particular amino acid, even when correcting for the frequency of that amino acid in the results, does influence results more than would be desirable. We believe that our results back up that either pAla or pLeu make appropriate decoys based on their similar frequencies proximal to real phosphorylation sites (in a test case from humans and a model plant), as compared to target amino acids STY. The results for pAla and pLeu decoys also agree well with the model FLR (for the large data sets) and the answer key FLR, for the synthetic data sets. We have a slight preference to use pAla as a decoy going forward since there is a slight risk of confusion between Leu and Ile amino acids, which often cannot be distinguished by MS. In rare cases, where there are two peptides in the database, differing only by Ile/Leu, errors or inconsistencies in decoy FLR estimation could be introduced. From the TPP pipeline, using iProphet and PTMProphet, it is possible in theory to use either the model FLR or the decoy FLR for thresholding final results. As noted above, performing a 1% global FLR threshold may be unstable (based on the decoy FLR method), depending on the chance of appearance of a few decoys high on the ranked list. If control at this level is required, for example, in cases where the exact sites of all modifications will be used in future validation studies, the model FLR would thus be preferred. For a less conservative threshold, say 5% FLR, for example, used when making more general biological conclusions from a quantitative study or pathway analysis following it, we believe that thresholding using the pAla decoy FLR method can be recommended. The rationale is that this method can be straightforwardly applied using any combination of tools and is simple to interpret. Most other pipelines in current use do not report accurate PEP values for PSMs and for site localization, allowing model FLR to be calculated reliably. We also recommend that the scores per site (final probability in the case of TPP-produced data) and the pAla “identifications” get carried forward and reported. This allows for the potential for meta-analyses and database submissions to estimate the resulting global FLR once multiple data sets have been combined. We acknowledge that there is a slight downside to searching with a decoy amino acid in that the search spaces for PSM identification and PTM localization are both increased, leading to a potential loss in sensitivity. In our analyses, Ala residues are present at less than 1/3 the total frequency of STY residues, leading to a relatively modest increase in search spaces, say 30%. We also suggest that the proteomics field has generally accepted that doubling the PSM search database (and search time) through the inclusion of decoys is an acceptable trade-off for gaining the ability to estimate global FDR straightforwardly and transparently. It is also important to stress that the method of searching for a decoy amino acid is intended to perform global correction for different errors that can be introduced in a modification discovery pipeline (e.g. incorrect peptide identification and incorrect site localization). However, the approach does not enable calculation of local statistics for false discovery since a given decoy amino acid will not be present in all peptides and has a variable count of decoy sites in other peptides, compared to target amino acids. For local statistics, groups should rely on pipelines that can calculate accurate statistics for peptide identification and site localization. Nevertheless, our results show that adding an empirical decoy amino gives confidence that appropriate thresholds and quality control has been applied throughout the analytical pipeline, to deliver the final results from a study. While we have presented results for pAla as a decoy for phosphorylation studies, we also suggest that modified Ala could also be an appropriate decoy for other modification types, such as Lys modifications acetylation, methylation, ubiquitination SUMOylation, and so forth or for cases where multiple modifications are scored at the same time, although we have not yet profiled the amino acid distributions sufficiently to conclude that Ala is more suitable than other amino acids in these cases.

Conclusions

We have assessed six different amino acids for their ability to act as suitable decoy amino acids for the estimation of global FLR in phosphoproteomics studies. We have analyzed four data sets, two synthetic with a known answer and two biological-sample data sets. We conclude that either Ala or Leu make appropriate decoys and give reliable estimates of FLR above 1% FLR. Below 1% FLR, estimates can be unstable due to a few random high-scoring decoys. We demonstrate that the decoy-based FLR gives similar estimates to a modeled FLR for Ala and Leu decoys, based on summing local FLR values per site and based on the answer key for the synthetic data sets. We recommend that phosphoproteomics investigators should adopt the “pAla” decoy going forward, that is, the pASTY method, and report sites with appropriate global FLR control.

28 in total

1. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.

Authors: Andrew Keller; Alexey I Nesvizhskii; Eugene Kolker; Ruedi Aebersold
Journal: Anal Chem Date: 2002-10-15 Impact factor: 6.986

Review 2. Quantitative phosphoproteomics to characterize signaling networks.

Authors: Kristoffer T G Rigbolt; Blagoy Blagoev
Journal: Semin Cell Dev Biol Date: 2012-06-05 Impact factor: 7.727

3. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling.

Authors: Kirti Sharma; Rochelle C J D'Souza; Stefka Tyanova; Christoph Schaab; Jacek R Wiśniewski; Jürgen Cox; Matthias Mann
Journal: Cell Rep Date: 2014-08-21 Impact factor: 9.423

4. A probability-based approach for high-throughput protein phosphorylation analysis and site localization.

Authors: Sean A Beausoleil; Judit Villén; Scott A Gerber; John Rush; Steven P Gygi
Journal: Nat Biotechnol Date: 2006-09-10 Impact factor: 54.908

Review 5. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics.

Authors: Eric W Deutsch; Luis Mendoza; David Shteynberg; Joseph Slagel; Zhi Sun; Robert L Moritz
Journal: Proteomics Clin Appl Date: 2015-04-02 Impact factor: 3.494

6. Comet: an open-source MS/MS sequence database search tool.

Authors: Jimmy K Eng; Tahmina A Jahan; Michael R Hoopmann
Journal: Proteomics Date: 2012-12-04 Impact factor: 3.984

7. Capturing the phosphorylation and protein interaction landscape of the plant TOR kinase.

Authors: Jelle Van Leene; Chao Han; Astrid Gadeyne; Dominique Eeckhout; Caroline Matthijs; Bernard Cannoot; Nancy De Winne; Geert Persiau; Eveline Van De Slijke; Brigitte Van de Cotte; Elisabeth Stes; Michiel Van Bel; Veronique Storme; Francis Impens; Kris Gevaert; Klaas Vandepoele; Ive De Smet; Geert De Jaeger
Journal: Nat Plants Date: 2019-03-04 Impact factor: 15.793

8. The PRIDE database and related tools and resources in 2019: improving support for quantification data.

Authors: Yasset Perez-Riverol; Attila Csordas; Jingwen Bai; Manuel Bernal-Llinares; Suresh Hewapathirana; Deepti J Kundu; Avinash Inuganti; Johannes Griss; Gerhard Mayer; Martin Eisenacher; Enrique Pérez; Julian Uszkoreit; Julianus Pfeuffer; Timo Sachsenberg; Sule Yilmaz; Shivani Tiwary; Jürgen Cox; Enrique Audain; Mathias Walzer; Andrew F Jarnuczak; Tobias Ternent; Alvis Brazma; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9. Profiling the Human Phosphoproteome to Estimate the True Extent of Protein Phosphorylation.

Authors: Anton Kalyuzhnyy; Patrick A Eyers; Claire E Eyers; Emily Bowler-Barnett; Maria J Martin; Zhi Sun; Eric W Deutsch; Andrew R Jones
Journal: J Proteome Res Date: 2022-05-09 Impact factor: 5.370

10. ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Authors: Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob
Journal: Nat Biotechnol Date: 2014-03 Impact factor: 54.908