Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.
Phosphoproteomic methods are commonly employed to identify and quantify phosphorylation sites on proteins. In recent years, various tools have been developed, incorporating scores or statistics related to whether a given phosphosite has been correctly identified or to estimate the global false localization rate (FLR) within a given data set for all sites reported. These scores have generally been calibrated using synthetic datasets, and their statistical reliability on real datasets is largely unknown, potentially leading to studies reporting incorrectly localized phosphosites, due to inadequate statistical control. In this work, we develop the concept of scoring modifications on a decoy amino acid, that is, one that cannot be modified, to allow for independent estimation of global FLR. We test a variety of amino acids, on both synthetic and real data sets, demonstrating that the selection can make a substantial difference to the estimated global FLR. We conclude that while several different amino acids might be appropriate, the most reliable FLR results were achieved using alanine and leucine as decoys. We propose the use of a decoy amino acid to control false reporting in the literature and in public databases that re-distribute the data. Data are available via ProteomeXchange with identifier PXD028840.
There is great research
interest in studying post-translational
modifications (PTMs) to proteins, due to their importance in cell
signaling, as a rapid mode of proteins changing their function, and
their implication in almost all known disease processes. The most
widely studied reversible modifications include phosphorylation (by
far the most studied one and our primary focus here), acetylation,
methylation, and attachment of small proteins, such as ubiquitin and
SUMO.High-throughput tandem mass spectrometry (MS) is commonly
used
for detection and localization of phosphorylation sites on proteins,
using so-called phosphoproteomic methods. Typically in these methods,
proteins are first extracted from samples and digested with an enzyme
such as trypsin, and then, phosphorylated peptides are enriched in
the sample, for example, using TiO2 or other metal ion
attached to a column (affinity chromatography), to which phosphate
binds preferentially. The bound peptides are then eluted and analyzed
by liquid chromatography–MS.[1] In
the common analysis mode used in phosphoproteomics, data dependent
acquisition is performed to fragment the most abundant peptides observed.
The MS2 fragmentation spectra (plus the mass/charge of
the intact precursor) are then used to identify peptide sequences,
e.g., using sequence database search software. In this approach, the
spectra are searched against a theoretical digest of the proteome
(i.e. peptide sequence database) for the given species, taking account
for the variable modifications selected. For phosphorylation, most
users search for phosphorylation on the canonical Ser, Thr, and Tyr
(STY) residues, where the vast majority of detectable phosphorylation
resides in eukaryotic systems. The search engine then considers every
STY residue with and without the addition of the phosphate mass (+79.97
Da), greatly increasing the size of the search space, with a corresponding
reduction in statistical power for peptide identification. Confident
peptide identification is governed by the quality of the match between
the observed spectrum and the theoretical spectrum expected for a
peptide from the sequence database, from which local statistics such
as p-values or e-values are usually
calculated as well as sometimes a posterior error probability (PEP).
If a PEP value is calculated, 1-PEP gives the probability that a given
peptide-spectrum match (PSM) is correct. There are of course many
different proteomics search engines, including commercial and free
and/or open source; for a review, see Verheggen et al.[2]An important consideration for phosphoproteomics
is the confidence
that a given site within a protein has been correctly identified as
being phosphorylated. Ambiguity in this regard may occur when a confident
PSM has more STY residues than n, where n is the number of phosphorylation modification instances detected,
that is, intact peptidoform mass = peptide sequence mass + (n * 79.97 Da). In this case, the search engine itself or
a downstream analysis package calculates statistics related to each
of the n phosphosites within a peptide, such as a
PEP that the site has been incorrectly localized (sometimes also called
local false localization rate (FLR); local FLR or other ad hoc score.
As for PSMs, if an accurate PEP can be estimated, then 1-PEPsite gives the probability that the site has been correctly localized,
in this case, assuming already that the PSM is definitively correct.
Correct site localization can be critically important for downstream
uses of data. As one example, there are completely different kinases
and phosphatases involved in Ser/Thr versus Tyr phosphorylation, and
thus, biological conclusions as to the up and downstream signaling
pathways would be completely different. Even where ambiguity relates
to different, say, Ser residues in a peptide, nearby amino acid motifs
allow inference of the kinase responsible for phosphorylating the
site, and thus, incorrect site determination could lead researchers
to making incorrect assumptions and conclusions.Many of the
published site localization algorithms were benchmarked
by the originating authors, and scores were calibrated based on synthetic
data sets, with a known “ground truth”, that is, where
the sites of phosphorylation were known.[3] There have also been some independent efforts to benchmark different
site localization tools, showing that the choice of tool does alter
the global statistics,[4,5] that is, sensitivity—how
many sites in a whole data set can be correctly localized at a suitable
overall (global) FLR. While tools continue to improve and become more
widely used for ensuring confident site localization, there remain
several unsolved challenges for the field, as follows. First, it is
unclear whether findings on synthetic data sets can be extrapolated
to genuine biological data sets that have generally a higher level
of complexity, and synthetic data sets may have amino acid frequencies
near to modification sites, which do not well reflect natural samples.
Second, for analysis of real experimental data sets, there are no
commonly used methods for independent estimation of global FLR, for
example, that allow a researcher to ask the question—how many
sites have we confidently identified and how many are likely false
positives in the whole data set. For regular peptide/protein identification
in proteomics, decoy database search methods are now almost ubiquitously
used for estimating peptide/protein false discovery rate (FDR), since
they give a search engine-independent statistic that is easy to understand.
There is no generally accepted method for calculating the same type
of statistic for PTM site identification. Third, our groups are interested
in very large-scale re-analysis of public PTM enriched data sets,
via a project called PTMeXchange. We wish to have methods that allow
for accurate calculation of the probability that a given PTM site
has been observed in a meta-analysis of data sets, where there could
be potentially multiple PSMs from different studies supporting a given
site. To our knowledge, there are no suitable statistical models for
combining different evidence streams.In this work, we explore
the concept of using decoy amino acid(s)
for estimation of site localization statistics (i.e. global FLR),
in this context, defined as one that we know cannot be modified, to
model the distribution of false localizations detected from a processing
pipeline, resulting from both incorrect peptide identifications and
incorrect localization of sites on correct peptide identifications.
We test a range of different amino acids for their suitability as
a decoy in synthetic and real data sets as well as demonstrate the
results obtained from several common proteome informatics pipelines.
The concept of using a decoy amino acid for localization of PTMs is
not a new one. It has been used before in several previous publications
and approaches.[6] However, to our knowledge,
no publication has yet validated the statistics associated with the
use of decoy amino acids, particularly on multiple tools, and the
method of using a decoy amino acid has not gained widespread use in
the field. The majority of PTM-based studies still relies on using
ad hoc score thresholds for determining whether PTMs have been correctly
identified or not. From the results we present, we make some recommendations
as to how we believe large-scale PTM-enriched studies should be analyzed
to control the local and global FLR. While we have focused on computational
analysis and pipelines for phosphorylation, general approaches and
conclusions are largely applicable to other types of PTM readily detected
by MS. The code used for the analysis is in GitHub: https://github.com/PGB-LIV/PhosphoFLR.
Methods
Our overall goal is to demonstrate methods for controlling
and
understanding FLR, rather than benchmarking tools per se, although
we wished to demonstrate the reproducibility of methods in different
pipelines. As such, we tested four commonly used analysis pipelines:
trans-proteomic pipeline (TPP)[7] including
Comet search[8] and PTMProphet site localization;[9] MaxQuant including PTMScore;[10] ProteomeDiscover, including Mascot search and ptmRS localization;[11] and PEAKS DB search with A-Score.[12] We tested the effects on global FLR of selecting
the localization on different amino acids as “decoy”
and profiled the frequency of potential decoy amino acids relative
to assumed correct STY phosphorylation sites, to see which provides
the decoy distribution best matching the target distribution, that
is, other STY sites to which the site could be wrongly localized.
The MS proteomics data have been deposited to the ProteomeXchange
Consortium via the PRIDE[13] partner repository
with the dataset identifier PXD028840 and DOI 10.6019/PXD028840.Four data sets were used for evaluation of methods for estimating
global FLR—two synthetic data sets, one model plant phosphoproteomics
data set (from Arabidopsis thaliana), and one human phosphoproteomics data set. The raw files of the
four search data sets were obtained from the ProteomeXchange Consortium[14] via the PRIDE repository.[15] These included ten files from the PXD007058[5] synthetic data set (files named “HCDOT” pools
1–5, reps 1 and 2), 10 files from the PXD000138 synthetic data
set,[16] twelve files from the PXD008355[17]Arabidopsis data
set (rapamycin treated), and six from the PXD000612[18] human data set (files randomly selected). The PXD007058
and PXD000138 data sets contain synthetic phosphopeptide libraries.
The use of synthetic phosphopeptides allowed us to define FLR (through
one method) by comparing the results from our search pipelines to
the known phosphopeptide sequences to determine if our analyses correctly
localize the phosphosites. The PXD008355 A. thaliana data set and the PXD000612 human data set are both biological data
sets with unknown phosphosites.Databases were created for the
searches of each data set. The PXD007058
search database consisted of the synthetic peptides;[5] the PXD000138 search used IPI human sequences and phosphopeptide
libraries matched to the original study;[16] the PXD008355 Arabidopsis search
database contained Araport11[19] sequences,
and the PXD000612 human search database was created from the Level
1 PeptideAtlas Tiered Human Integrated Search Proteome,[20] containing core isoforms from neXtProt.[21] Each search database also contained the cRAP
contaminant sequences (https://www.thegpm.org/crap/, last accessed October 2021). Decoys across all four databases were
generated for each entry using the Brujin method (with k = 2).[22]Using the TPP,[7] the data set files were
first searched using Comet.[8] The resulting
files were then combined and processed using PeptideProphet,[23] iProphet,[24] and PTMProphet.[9] For PXD007058, in addition to searching for phosphorylation,
we also searched for pyrophosphorylation modifications, which pilot
searches had determined that were unintentionally present on some
synthetic peptides. The key Comet search parameters for each dataset
are shown in Table ; the key parameters used for the other pipelines are shown in Supporting Information Table S1.
Table 1
Comet Search Parameters for Each Data
Set
PXD007058 (synthetic data set)
PXD000138 (synthetic data set)
PXD008355 (Arabidopsis data set)
PXD000612 (human data set)
peptide mass tolerance
20.0 ppm
5 ppm
7 ppm
7 ppm
fragment bin tolerance
0.02 Da
0.02 Da
0.02 Da
0.02 Da
digest mode
tryptic
tryptic
tryptic
tryptic
max missed cleavages
4
4
2
2
fixed mods
carbamidomethylation (C)
carbamidomethylation (C)
carbamidomethylation (C)
carbamidomethylation (C)
variable mods
oxidation (MWP)
oxidation (M)
oxidation (M)
oxidation (M)
phospho (STYXa)
phospho (STYXa)
phospho (STYXa)
phospho
(STYXa)
pyrophospho (STY)b, N-terminal acetylation
N-terminal acetylation
N-terminal acetylation
N-terminal acetylation
ammonia loss (QC)
ammonia loss (QC)
ammonia loss (QC)
ammonia loss (QC)
pyro-Glu (EQ on the N-terminus)
pyro-Glu (EQ on the N-terminus)
pyro-Glu (EQ on the N-terminus)
pyro-Glu (EQ on the N-terminus)
deamination (NQ)
deamination (NQ)
deamination (NQ)
deamination (NQ)
max variable PTMs
5
5
5
5
X corresponds
to the different decoy
amino acid searched: Ala, Gly, Leu, Asp, Glu, or Pro.
Preliminary analysis of the data
set detected that several peptides had been manufactured with pyrophosphate
modification rather than the intended phosphate, which can cause apparent
errors when comparing the results to the answer key if they are not
accounted for.
X corresponds
to the different decoy
amino acid searched: Ala, Gly, Leu, Asp, Glu, or Pro.Preliminary analysis of the data
set detected that several peptides had been manufactured with pyrophosphate
modification rather than the intended phosphate, which can cause apparent
errors when comparing the results to the answer key if they are not
accounted for.
Downstream Data Analysis
The data from searching with
TPP were downstream processed by custom Python scripts (https://github.com/PGB-LIV/PhosphoFLR). First, the global FDR was calculated from the decoy counts, and
the PSMs were filtered for 1% PSM FDR. From these filtered PSMs, a
site-based file was generated giving separate localization scores
for each phosphosite found on each PSM, removing decoy and contaminant
hits. These site-based PSMs were ordered by a combined probability,
calculated by multiplying the PSM probability by the localization
probability. In the processing pipeline from TPP, iProphet calculates
a probability that a given PSM is correct, and PTMProphet calculates
a probability for the site assignment. We demonstrated that there
is almost no meaningful correlation (r2 = ∼0.01) between these probabilities (Supporting Information Figure S1), and thus, we conclude that
these probabilities are sufficiently independent that they can be
multiplied to arrive at a final probability that a given site’s
identification is supported by the given spectrum.For the PEAKS
search, the PSM score and A-scores for targets and decoys were modeled
based on the counts of targets and decoys per histogram score bin,
to generate similar probability estimates (code provided in the GitHub
repository). For MaxQuant and Mascot searches, the PSM probability
values were calculated as 1-PEP values (reported by the pipeline natively)
with the PTM probabilities being calculated innately through PTM-score/ptmRS
probabilities, respectively. For the synthetic peptide search, these
site-based results were then filtered further to allow comparison
with the synthetic peptide known localization key. Partial peptides
and PSMs with the incorrect phosphorylation count compared to what
was expected from the answer key, or other additional modifications,
were removed. These remaining PSMs were then compared against the
synthetic peptide answer key to determine if the phosphosites had
been correctly identified. For analysis of the synthetic data only,
results were ordered by the corresponding site localization probability
rather than the combined probability; since due to the small size
of the search database, not all pipelines could produce accurate estimates
of PSM probability.The sites reported for each analysis method
were ordered by combined
probability, and global FLR was estimated for every ranked site, from
which we can then later apply a threshold at the lowest scoring site
that delivers a desired global FLR (e.g. 1, 5, or 10%), similar to
the q-value approach for standard database searching.
The global FLRs for all of the data sets were estimated using two
methods—model FLR and decoy FLR method. For the synthetic data
sets, a third method was also used, called answer key method (comparing
against the known phosphosite answer key)
Model FLR Method
First, we estimated the global FLR
using the combined probabilities (global model FLR). For TPP, we use
the 1—final probability, to give the local FLR (PEP) for each
given site scored in a ranked list. The “Global model FLR”
is calculated as a running sum of the local FLR divided by the count
of rows (eq ), that
is, the estimated frequency of false localizations at each row in
a ranked list, divided by the total number of reported observations.
Global Model
FLR Equation
Where n is the count of observations,
PPSM is the local FLR (PEP) for a given PSM identification
and PPTM is the local FLR for a given site localization.
Decoy FLR Method
We used the identification of phosphorylated
decoy amino acids (e.g. worked example follows for alanine, pA), as
these are known to be false localizations and can therefore be used
to estimate the FLR. The counts of the phosphorylated decoy amino
acids were first normalized to allow comparison with true hits; by
modeling the random frequency, one would expect incorrect sites to
be assigned to target STY residues. The target/decoy ratio (Tc/Xc) was determined
by dividing the total count of STY residues (Tc) by the total count of the decoy amino acid residues (Xc), within the set of PSMs with a scored phosphosite
(eq ).At each
position in the ranked list, we have a count of decoy amino acids
observed ∑1(pXc). For every
false localization of the decoy amino acid, it would be expected that
there is a count of “silent” false positives within
the target list, proportional to the ratio of STY amino acids divided
by the total count of decoy amino acids, that is, (Tc/Xc) ∑1(pXc), to model the expected frequency
of random wrong hits. The expected count of false positives within
the targets is then multiplied by 2 (to model the normalized frequency
of random wrong assignments among both the decoy amino acid and the
target amino acid) to arrive at a normalized false localization count.
This is a relatively conservative method to calculate global FLR,
but without the correction to multiply by two, the use of a less frequent
decoy amino acid would be insufficiently corrected for.This
normalized false localization count is then divided by the
total count of observations (n), at a given row in
the ranked list, to obtain the Model (global) FLR estimate (pX_FLR in eq ).
Phosphorylated Decoy Amino Acid FLR Equation
Where Tc is the total target (STY)
count, Xc is the total decoy amino acid
count, pXc is the count of phosphorylated
decoy amino acid, and n is the count of observations
at a given position in the
ranked list.
Answer Key FLR
For the synthetic
data sets, we used
the synthetic peptide false localizations in a similar way; the false
localization count (i.e. result not matching the answer key) was divided
by the total count of sites to calculate the FLR (eq ).
Synthetic Answer Key FLR
Where n is
the count of observations, and false localization count Fc is the count of sites not matching the answer key in
a given position in the ranked list.
Collapsing Observations
of a Site across Multiple PSMs
When summarizing study results,
it is desirable to “collapse”
results where there are multiple PSMs supporting the same modification
site down to a single row. There is no well-agreed method for collapsing
data, although common practice when using collapsing multiple PSMs
into individual reports for peptides is to use the maximum peptide
score for ordering results and disregarding the count of PSMs. The
rationale for this simplification is that multiple PSMs reporting
for the same peptide are not independent statistical tests and thus
the same wrong answer can appear in multiple PSMs. As such, a simple
method for ranking final “collapsed” results for sites
is simply to take the maximum final probability. However, our own
profiling of data sets suggests that this method is sub-optimal. Many
of the high scoring decoy hits are supported by only a single PSM,
and so a collapse method that weights sites supported by a higher
number of PSMs is more likely to be true than the one supported by
a single PSM (Supporting Information Figure
S2). For this study, we use a relative ad hoc method for collapsing
multiple observations that attempts to balance maximum final probability
and spectral counts. We took the maximum probability for a given site,
derived from multiple PSMs and binned into final probability values
for 2 decimal places. We ranked via binned final probability and then
ranked within bins via the count of PSMs.
Profiling Distance Distributions
from Real Identifications to
Decoy Amino Acids
In order to compare between the decoy amino
acids investigated, the distribution of amino acids around phosphorylation
sites were compared. The phosphorylation sites obtained searching
each database for phospho (STY) using TPP was first filtered for 5%
model FLR. The minimum distance between an assumed correctly localized
phosphorylated STY and the nearest candidate amino acid was compared,
alongside the minimum distance for the nearest STY. Histograms were
generated with the normalized frequencies of these distributions in
order to compare between the selected decoy amino acids and STY.
Profiling Site Probabilities for Proximal Amino Acids
When
analyzing the results for different decoy amino acids, we observed
particular differences in the global FLR estimates for certain decoys
(particularly pAla vs pGly) that could not be explained by distributions
of amino acids in relation to confident target sites (above). We further
explored these effects by calculating the average final probabilities
for assumed correct sites with different amino acids in the −1
and +1 position relative to the site. The assumed correct sites were
estimated as sites with combined probability ≥0.68. This threshold
was calculated from the average minimum combined probability using
a 5% FLR cut off for each of the decoy FLR estimations across all
searches. These average probability distributions were calculated
for the Arabidopsis and human data
sets, from results of the TPP search with no decoy amino acid (pSTY),
pAla decoy (pASTY), pLeu decoy (pLSTY), and pGly decoy (pGSTY).
Results
Analysis of Synthetic Data Set PXD007058
The analysis
setup aimed to determine whether global FLR within genuinely modifiable
residues (target amino acids) could be estimated reliably by including
in the search a “decoy” amino acid, that is, not modified.
We tested localization on six different amino acids to act as a decoy
in parallel searches: glycine, leucine, alanine, glutamate, aspartate,
and proline, to determine what effect the selection of a particular
decoy had on the results obtained. The set of potential decoy amino
acids was selected based on the following rationale: (i) glycine,
leucine, alanine—no evidence that they can be phosphorylated
in any known biological system; all are relatively frequent amino
acids in most biological systems; (ii) glutamate—frequently
phosphorylated[25] and not typically detectable
as phosphorylated in most standard enrichment MS experiments and thus
could be a plausible choice as a decoy; (iii) aspartate and proline
were chosen as expected to be deliberately poor choices of decoy amino
acids since there are known SP and SD phosphorylation motifs, which
could bias estimates of global FLR. We expect that a statistically
reliable choice of amino acid should have a similar distance distribution
from a phosphorylation site (STY) to another truly possible phosphorylation
site (STY), under the theory that incorrect localizations are more
likely to amino acids nearby in the sequence.We first searched
the PX007058 synthetic dataset, which allowed us to test the three
different methods of FLR estimation against a known answer. The data
set was searched using TPP and the global FLR of these was calculated
using the decoy phosphorylated amino acid method for six different
choices of decoy amino acid, that is, in six parallel searches (Figure a). As described
in the Methods, the global “Decoy” FLR is estimated
based on the counts of hits to the decoy amino acids in the ranked
list of results, adjusted for the ratio of the counts of the decoy
amino acid to the target amino acid in the modified peptides that
have been considered. We also show the global FLR calculation for
all three methods in Figure b–g, split by decoy amino acid choice; that is, (i)
answer key—identifying false localization by comparing to the
known phosphosites (in this synthetic data set, where the truly modified
site is known), (ii) decoy amino acid method, and (iii) model FLR
that is based on summing local FLR calculated by the analysis software
intrinsically (see Methods). The first observation
we make is that the choice of decoy amino acid can have a substantial
effect on the sensitivity (counts of assumed true sites), for a given
estimated global FLR threshold (Table and Figure ). At 5% FLR, the lowest sensitivity is achieved with a Gly
decoy (749 sites) versus the highest sensitivity with a Glu decoy
(952 sites).
Figure 1
(a) Comparison of FLR estimation searching PXD007058 (synthetic
data set) using localization on different decoy amino acids: pAla,
pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR); (b–g)
comparison of FLR estimation methods searching PXD007058 for each
of the different decoy amino acids (TPP, fully tryptic, 1% FDR). X-axis = count of sites, y-axis is global
FLR estimated as q-values.
Table 2
Counts at pX FLR (Calculated by the
Decoy Method) for 1, 5, and 10% Using Each FLR Method, Searching PXD007058
(Synthetic Data Set) (TPP, Fully Tryptic, 1% FDR)
count at 1% FLR
count at 5% FLR
count at 10% FLR
answer key
model
decoy
answer key
model
decoy
answer key
model
decoy
pAla
702
226
799
835
700
866
935
836
921
pGly
571
203
515
778
645
749
866
791
787
pLeu
632
229
665
843
710
823
938
844
914
pAsp
535
258
207
834
721
784
929
849
805
pGlu
722
232
841
858
734
952
975
866
n/a
pPro
636
206
621
817
692
768
883
832
823
(a) Comparison of FLR estimation searching PXD007058 (synthetic
data set) using localization on different decoy amino acids: pAla,
pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic, 1 %FDR); (b–g)
comparison of FLR estimation methods searching PXD007058 for each
of the different decoy amino acids (TPP, fully tryptic, 1% FDR). X-axis = count of sites, y-axis is global
FLR estimated as q-values.It can
be seen that for pAla and pLeu decoys (Figure b,d), there is a close agreement
between the two “empirical” methods of estimating global
FLR, that is, answer key and decoy, with the model giving more conservative
estimates of FLR. The pGly and pPro methods (Figure c,g) have good agreement between the answer
key and decoy methods up to ∼700 counts, and then, the decoy
method gives more conservative estimates (rising steeply), compared
to the answer key method (and model). The pAsp decoy method agrees
well with the model FLR but is more conservative than the answer key
(Figure e). The pGlu
decoy method is the least conservative, apparently underestimating
global FLR compared to the answer key (Figure f). Overall, the matching of pAla and pLeu
decoy FLR estimation to the answer key FLR gives some supporting evidence
toward pAla and pLeu being appropriate choices for decoy amino acids.
The model FLR method is shown to be more conservative that the other
“empirical” FLR methods in most cases, especially in
the most important regions of distribution, that is, up to 5% global
FLR for example.The estimates from Figure b–g and Table demonstrate greater stability in the model
FLR and answer
key FLR across different decoy amino acids; that is, there is less
variation in sensitivity at a given estimated FLR. This is to be expected
since comparing the six different searches, many of the errors in
localization are due to incorrect localization to a target amino acid
(which largely behave the same across the six searches).On
the same synthetic data set, PXD007058, we also compared the
estimation methods across four different pipelines: TPP, PEAKS, MaxQuant,
and Mascot (Supporting Information Figure
S3 and Table S2), searched for pSTY pASTY and pLSTY (for decoy comparison).
The initial set of results from our analysis pipeline are the redundant
identification of phosphorylation sites; that is, if multiple PSMs
support the same site, these appear as multiple rows (not collapsed).
In general, as noted in the Methods, our preference
is to order these results by the final probability that a site has
been observed (PSM–probability X site localization probability).
This synthetic data set has a small database size and an overall small
count of identifications, which makes it difficult to model PSM probability
accurately. As such, for the synthetic data set only, we ordered results
by site localization probability, having first accepted only PSMs
with FDR <1%.FLR was calculated using the synthetic answer
key and the decoy
amino acid hits. For all four pipelines tested, both the pAla and
pLeu decoy methods agree well with the results from the answer key
FLR method across all three pipelines, demonstrating that our method
with these amino acids gives reliable FLR estimates in a software-independent
manner. There are differences in the total number of sites identified
at a fixed FLR threshold, depending on the pipeline applied. For this
data set, TPP gives the highest sensitivity, followed by PEAKS, Mascot/ptmRS,
and MaxQuant. However, our primary goal in this manuscript is not
extensively to benchmark different pipelines, as there are choices
of algorithm parameterization, which need to be optimized and could
affect conclusions, and thus, we do not make any general conclusions
about software performance for PTM analysis here.The synthetic
data set PXD000138 was also used to compare the FLR
calculations using localization on different decoy amino acids (Supporting Information Figure S5 and Table S3).
These results also showed a close agreement between the “empirical”
methods of global FLR estimation; that is, the answer key and decoy
estimation methods generally agreed with each other across the different
decoy amino acids, although as observed for PXD007058, the choice
of decoy amino acid affects the sensitivity (count of sites at a fixed
FLR). The model estimation gives more conservative estimates of FLR
across all decoy searches. In order to further compare between the
decoy amino acids investigated, the distribution of amino acids around
phosphorylation sites was compared for each of the synthetic sets
(Supporting Information Figures S4 and
S6). It would be assumed that if a phosphosite is wrongly localized,
it would usually be to the nearest other STY residue than the correct
site. We therefore assumed that a statistically reliable decoy amino
acid will follow a similar (normalized) frequency distribution to
the closest STY residue from correct hits. For the two synthetic data
sets, all decoy amino acids had substantial differences between their
distributions in relation to the assumed correct phosphosites, and
none, in particular, matched well the distribution of the nearest
STY residues. Given the artificial nature of synthetic data sets,
we do not expect these distributions to reflect the reality of biological
data sets, and thus, we next explored the behavior of the method on
real data sets.
Biological Data Set Analysis
Data Set
PXD008355
To investigate the effect of using
localization on different decoy amino acids on different data sets,
we compared the FLR estimations across the six different amino acids
using two experimental data sets from A. thaliana and human. Figure shows the decoy FLR comparisons searching the PXD008355 Arabidopsis data set with TPP. Here, we can see a
similar trend as previously seen in the synthetic data set PXD007058.
The FLR estimations with pGly and pPro give the most conservative
performance at higher FLR values; that is, a steep rise in global
FLR (Figure a) giving
the lowest sensitivity at 10% global FLR, although there is a more
complex picture at 1 and 5% FLR values (Table and Figure b). We assume that many studies will aim to threshold
at 5% global FLR; here, we observe lowest counts (of sites at 5% FLR)
for the pAsp decoy and intermediate counts for pGly, pGlu, and pPro
methods, and the highest counts of sites (sensitivity) for pAla and
pLeu decoys. One of the challenges with accurate FLR estimation is
that there can be some high-scoring incorrect localizations, and their
position in the ranked list can have significant implications on the
count of sites at 1% FLR (Figure b and Table ). We thus would not recommend general thresholding at 1%
global FLR but instead we recommend applying a 5% FLR where the global
FLR estimates are likely to be robust. There is additional discussion
of these high-scoring false hits in the Supporting Information (i), Figure S7 and Table S4. We also calculated
the model FLR for each decoy option, demonstrating good agreement
at 5% FLR between the two methods (decoy FLR vs model FLR) for pAla,
pLeu, and pGly decoy options but less good agreement for other decoys
(Supporting Information Figure S8).
Figure 2
Comparison
of FLR estimation searching PXD008355 (Arabidopsis data set) using different decoy amino
acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic,
1 %FDR). (a) all PSMs; (b) zoom 5% FLR all PSMs; (c) collapsed by
modified peptide, sorting by combined probability and count of supporting
PSM; (d) collapsed by modified peptide, sorting by combined probability
and count of supporting PSMs, zoom at 5% FLR. X-axis
= count of sites; y-axis is global FLR estimated
as q-values.
Table 4
Counts at pX FLR for 1, 5, and 10%
Using Each Decoy Method, Searching PXD000612 (Human Data Set) (TPP,
Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed) and Collapsed
to Unique Sites
count
at 1% FLR
count at 5% FLR
count at 10% FLR
not collapsed
collapsed
not collapsed
collapsed
no collapsed
collapsed
pAla
62,050
5246
95,924
9504
103,609
11,491
pGly
58,367
5557
86,752
9267
94,536
11,193
pLeu
3822
6705
98,875
10172
106,453
12,030
pAsp
2650
809
71,968
8458
86,451
10,052
pGlu
22,563
608
78,708
7819
94,337
10,092
pPro
47,942
3256
79,821
7384
90,430
9482
Table 3
Counts at pX FLR for 1, 5, and 10%
Using Each Decoy Method, Searching PXD008355 (Arabidopsis Data Set) (TPP, Fully Tryptic, 1% FDR) Showing all PSMs (Not Collapsed)
and Collapsing Multiple PSMs to One Row per Modified Peptide, Sorting
by Combined Probability and Count of the Supporting PSMs
count
at 1% FLR
count at 5% FLR
count at 10% FLR
not collapsed
collapsed
not collapsed
collapsed
not collapsed
collapsed
pAla
23,104
1570
40,541
4704
44,556
5815
pGly
18,872
1469
35,939
3654
38,885
4990
pLeu
17,943
1017
42,157
4964
45,875
6068
pAsp
13,490
234
32,595
4151
39,262
5130
pGlu
21,923
2766
34,949
4385
40,532
5607
pPro
23,696
2771
35,170
4483
38,226
5221
Comparison
of FLR estimation searching PXD008355 (Arabidopsis data set) using different decoy amino
acids: pAla, pGly, pLeu, pAsy, pGlu, and pPro (TPP, fully tryptic,
1 %FDR). (a) all PSMs; (b) zoom 5% FLR all PSMs; (c) collapsed by
modified peptide, sorting by combined probability and count of supporting
PSM; (d) collapsed by modified peptide, sorting by combined probability
and count of supporting PSMs, zoom at 5% FLR. X-axis
= count of sites; y-axis is global FLR estimated
as q-values.We next explored data after collapsing multiple scores
from different
PSMs supporting the same site, taking the maximum probability for
a given site for ranking results, along with the greatest number of
supporting PSMs (see Methods). The results
following this collapse step are shown in Figure c,d, demonstrating relatively similar trends,
with considerable differences in sensitivity at 5 and 10% global FLR,
with pAla and pLeu giving highest sensitivity at a 5 and 10% FLR.
The statistical assumptions for the model FLR do not hold after collapse,
so this method was not used.To further investigate the selection
of decoy amino acid candidates,
the minimum distance between an assumed correctly localized phosphorylated
STY (<5% global FLR filtered) and the nearest candidate amino acid
were compared, alongside the minimum distance for the nearest STY
(Figure ). The rationale
for this comparison is that in a regular search, not employing a decoy,
if a phosphosite is wrongly localized, it will usually be to the nearest
other STY residue than the correct site. We assume that a statistically
reliable decoy amino acid will follow a similar (normalized) frequency
distribution to the closest STY residue from correct hits. When comparing
these distances in the Arabidopsis data
set, it can be seen that Ala, Leu, and Gly follow somewhat similar
frequency distributions to proximal STY, particularly, in the + positions
(i.e. toward the C-terminus of the protein). Asp, Glu, and Pro are
all enriched at the +1 position relative to STY, which likely partially
explains the higher FLR estimates observed for the same site counts
in Table and Figure ; that is, the pipeline
wrongly assigns sites to Asp, Glu, and Pro more frequently than it
would be other target sites (STY). We also observed in Table that using a glycine decoy
gave relatively low sensitivity at 5% FLR and steeply increasing FLR
at higher site counts. The results for Gly in Figure are thus an outlier with respective to Figure as well as for the
synthetic data set PXD007058 in Figure , in which we observed the lowest site count at 5%
FLR for estimates using a Gly decoy. Our starting expectation was
that Ala, Leu, and Gly would all make reliable choices as decoy amino
acids, and thus, we also conducted an analysis of amino acid frequencies
to attempt to explain the differences seen for Gly; results are shown
in Supporting Information (ii) and Table
S5. Gly and Ala have similar frequencies of observations in phosphopeptides,
so this also does not explain the disparity. We further explore this
phenomenon below.
Figure 3
Comparison of minimum distance between phosphorylated
STY and the
nearest target amino acid (Ala, Leu, Gly, Asp, Glu, and Pro), compared
to STY distribution, searching PXD008355 (Arabidopsis data set).
Comparison of minimum distance between phosphorylated
STY and the
nearest target amino acid (Ala, Leu, Gly, Asp, Glu, and Pro), compared
to STY distribution, searching PXD008355 (Arabidopsis data set).Similar to the PXD007058 synthetic
data set, a comparison between
FLR estimates was made across the different pipelines: TPP, PEAKS,
MaxQuant, and Mascot/ptmRS. The PXD008355 Arabidopsis data set was searched with an Ala decoy as well as a Leu and Gly
decoy, and FLR estimations were calculated in the same way as before
(Supporting Information Figure S9 and Table
S6). In general, highest sensitivity is achieved by TPP and Mascot/ptmRS,
whereas there are high-scoring decoy (amino acid) hits in the other
two pipelines that lead to much lower sensitivity at a given FLR cut-off.
For TPP and Mascot/ptmRS pipelines, the results from estimation with
the three decoys are largely reproducible; that is, pAla and pLeu
gives highest (and similar) counts of sites at a given FLR, whereas
pGly gives a lower count of sites at the same FLR threshold.In this approach, sites are ordered by final probability (PSM probability
* PTM probability). An alternative approach commonly used in the field
is to threshold first at say <1% FDR for PSMs or peptide, and then
order purely by PTM localization score or probability. We tested a
similar approach to see what effect there is on sensitivity at a given
FLR for the pAla results (Supporting Information Figure S10). While ordering site localizations by PTM probability
only rather than the combined PSM*PTM probability, we can see that
there is lower sensitivity at 1% FLR for the PTM probability option,
and almost identical sensitivity between the two options at 5 and
10% FLR (Supporting Information Table S7).
We therefore conclude that it is slightly superior to model both the
probability that a given PSM is correct as well as that the PTM has
been correctly localized to give the best ordering of results, particularly
for those highest scoring around 1% global FLR.
Data Set
PXD000612
Given that we see consistent trends
for the synthetic data set and Arabidopsis data set in terms of comparing decoys across different pipelines,
for the final validation, we focus only on the use of TPP on one further
validation set from a different species (human). We would expect some
different phosphorylation motifs comparing an animal species to a
plant species, which could affect decoy amino acid performance. Figure and Table illustrate the FLR comparison using the different decoy amino
acids, searching the PXD000612 human data set. A similar trend can
be seen here as in the Arabidopsis data
set with pAla and pLeu giving the highest site counts at 5% FLR and
pAsp giving lowest site count. On the zoomed plot (<5% FLR, Figure b), the same issue
as for data set PXD008355 can be observed, with unstable decoy estimation
at low counts due to a random factor from a few high-scoring decoys
(FLR < 1%). For this data set, there is also good agreement between
the model FLR and the decoy FLR for most amino acids except pGlu,
where the model FLR tends to be more conservative than decoy FLR (Supporting Information Figure S11). We also show
the data after collapsing multiple PSMs reporting on the same site
(Figure c,d), giving
similar trends in sensitivity at fixed FLR thresholds as for data
without collapse.
Figure 4
Comparison of decoy FLR estimation searching PXD000612
(human data
set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu,
and pPro (TPP, fully tryptic, 1 % FDR). (a) no collapse, all sites
shown; (b) zoom on 5% FLR; (c) collapse to unique sites; (d) zoom
on <5 FLR for data collapsed to unique sites. X-axis is count of sites, y-axis is global FLR, estimated
as q-values by the decoy method.
Comparison of decoy FLR estimation searching PXD000612
(human data
set) using different decoy amino acids: pAla, pGly, pLeu, pAsy, pGlu,
and pPro (TPP, fully tryptic, 1 % FDR). (a) no collapse, all sites
shown; (b) zoom on 5% FLR; (c) collapse to unique sites; (d) zoom
on <5 FLR for data collapsed to unique sites. X-axis is count of sites, y-axis is global FLR, estimated
as q-values by the decoy method.The distance between the phosphorylated STY and the
nearest candidate
amino acid were again compared (Supporting Information Figure S12) to further investigate the effect of decoy amino acid
choice and to examine how the distributions differ between the different
data sets. When comparing these distances in the human data set, a
similar pattern is seen to that of the Arabidopsis data set. It can be seen that Ala and Leu again follow a somewhat
similar frequency distribution to proximal STY residues, again particularly
in the positive direction. Asp and Pro are again enriched at the +1
position relative to STY, which would be expected. Gly is also seen
to follow a similar distribution to STY residues and therefore would
be expected to be a reliable decoy amino acid, based on this measure.
However, looking at the FLR comparisons seen in Figure , Gly can be seen to give more conservative
FLR estimation (or lower site counts at 5% FLR e.g. than Ala or Leu),
as was also seen in the Arabidopsis data set.We next explored whether particular amino acids
in proximity to
true phosphorylation sites cause results to change. We plotted the
average final probabilities for the Arabidopsis data set searched via the TPP pipeline and split according to the
amino acid in the −1 (Figure ) and +1 (Supporting Information Figure S13) position relative to the assumed correct phosphorylation
site, for the data set searched with no decoy pSTY, pAla decoy, pLeu
decoy, and pGly decoy. In the search with no decoy, there is a particular
striking trend that sites have a lower probability when the −1
amino acid is Ser, Thr, or Tyr. This occurs because the site localization
algorithm (PTMProphet in this case) has fewer ions available to discriminate
the correct from incorrect localization. In the pAla and pLeu results,
we see that Ala and Leu in the −1 position cause sites to have
a similar reduction in final probability as Ser, Thr, and Tyr in the
no decoy search, that is, final probability shifts from around ∼0.96
to ∼0.91 (Ala and Leu decoy). We interpret this to mean that
they behave as statistically “good decoys”; that is,
when they are present in the −1 position relative to a true
site, they behave in a similar manner to STY residues. In the pGly
data, there is a much larger drop off in final probabilities when G is in the −1 position (∼0.96 to ∼0.87),
meaning that (most commonly) Gly-pSer sites are scored less well that
phosphoserines preceded by other amino acids, and excessive probability
space is being distributed to the pGly-Ser hypotheses. In the results,
we observe that around 47% (Arabidopsis data) and 35% (human data) of the high scoring pGly decoys have
the pGly-Ser motif. We see similar trends in the human data set (Supporting Information Figures S14 and S15).
It is unclear why this particular amino acid combination causes a
problem for PTM localization, but we hypothesize that the Gly-pSer
bond is perhaps particularly stable during fragmentation and hence
a discriminating y ion terminating with pSer is less
commonly observed. We thus conclude based on what we have observed
that pGly is not an ideal choice for a decoy amino acid.
Figure 5
Comparison
of averaged final site probabilities for all peptides
(final probability ≥0.68 split by amino acid in the −1
positions for the PXD008355 (Arabidopsis data set). (a) STY (no decoy); (b) STY with Ala decoy; (c) STY with
Leu decoy; and (d) STY Gly decoy.
Comparison
of averaged final site probabilities for all peptides
(final probability ≥0.68 split by amino acid in the −1
positions for the PXD008355 (Arabidopsis data set). (a) STY (no decoy); (b) STY with Ala decoy; (c) STY with
Leu decoy; and (d) STY Gly decoy.
Discussion
In the proteomics field, prior to the widespread
adoption of decoy
database searching, there was a general problem with false positive
results in the literature, as labs vied to report the largest number
of peptides and proteins, without control of FDR. It is now accepted
that all proteomics studies should control FDR at an appropriately
conservative level, for example, <1% FDR at the protein-level for
protein-centric studies. It has long been recognized that a similar
problem exists with reporting PTM sites. The accurate discovery of
a site can be crucially important for downstream interpretation since
the identity of the residue (STY for canonical phosphorylation) and
the proximal amino acids govern understanding of the kinase and phosphatase
that regulate it. Given the interest in understanding phosphorylation
(and other PTM) sites in most human diseases, adequate control of
false reporting is crucial. It has been recently reported that >80%
of the reported sites in a popular phosphorylation database are estimated
to be false positives.[26] This likely resulted
due to studies using overly weak FLR thresholds in publications, and
results then get deposited in databases. Correct identifications tend
to be reported from multiple studies, whereas random wrong site identifications
tend to be seen only one or twice, and thus over time, database-level
FLR creeps up.This study is, to our knowledge, the most detailed
attempt to understand
how best to estimate global FLR using localization on decoy amino
acids. We compare the method against the use of a statistical model,
based on summing local FLR values, and the results agree reasonably
but not perfectly well. We also demonstrate that the selection of
a particular amino acid, even when correcting for the frequency of
that amino acid in the results, does influence results more than would
be desirable. We believe that our results back up that either pAla
or pLeu make appropriate decoys based on their similar frequencies
proximal to real phosphorylation sites (in a test case from humans
and a model plant), as compared to target amino acids STY. The results
for pAla and pLeu decoys also agree well with the model FLR (for the
large data sets) and the answer key FLR, for the synthetic data sets.
We have a slight preference to use pAla as a decoy going forward since
there is a slight risk of confusion between Leu and Ile amino acids,
which often cannot be distinguished by MS. In rare cases, where there
are two peptides in the database, differing only by Ile/Leu, errors
or inconsistencies in decoy FLR estimation could be introduced.From the TPP pipeline, using iProphet and PTMProphet, it is possible
in theory to use either the model FLR or the decoy FLR for thresholding
final results. As noted above, performing a 1% global FLR threshold
may be unstable (based on the decoy FLR method), depending on the
chance of appearance of a few decoys high on the ranked list. If control
at this level is required, for example, in cases where the exact sites
of all modifications will be used in future validation studies, the
model FLR would thus be preferred. For a less conservative threshold,
say 5% FLR, for example, used when making more general biological
conclusions from a quantitative study or pathway analysis following
it, we believe that thresholding using the pAla decoy FLR method can
be recommended. The rationale is that this method can be straightforwardly
applied using any combination of tools and is simple to interpret.
Most other pipelines in current use do not report accurate PEP values
for PSMs and for site localization, allowing model FLR to be calculated
reliably. We also recommend that the scores per site (final probability
in the case of TPP-produced data) and the pAla “identifications”
get carried forward and reported. This allows for the potential for
meta-analyses and database submissions to estimate the resulting global
FLR once multiple data sets have been combined.We acknowledge
that there is a slight downside to searching with
a decoy amino acid in that the search spaces for PSM identification
and PTM localization are both increased, leading to a potential loss
in sensitivity. In our analyses, Ala residues are present at less
than 1/3 the total frequency of STY residues, leading to a relatively
modest increase in search spaces, say 30%. We also suggest that the
proteomics field has generally accepted that doubling the PSM search
database (and search time) through the inclusion of decoys is an acceptable
trade-off for gaining the ability to estimate global FDR straightforwardly
and transparently. It is also important to stress that the method
of searching for a decoy amino acid is intended to perform global
correction for different errors that can be introduced in a modification
discovery pipeline (e.g. incorrect peptide identification and incorrect
site localization). However, the approach does not enable calculation
of local statistics for false discovery since a given decoy amino
acid will not be present in all peptides and has a variable count
of decoy sites in other peptides, compared to target amino acids.
For local statistics, groups should rely on pipelines that can calculate
accurate statistics for peptide identification and site localization.
Nevertheless, our results show that adding an empirical decoy amino
gives confidence that appropriate thresholds and quality control has
been applied throughout the analytical pipeline, to deliver the final
results from a study. While we have presented results for pAla as
a decoy for phosphorylation studies, we also suggest that modified
Ala could also be an appropriate decoy for other modification types,
such as Lys modifications acetylation, methylation, ubiquitination
SUMOylation, and so forth or for cases where multiple modifications
are scored at the same time, although we have not yet profiled the
amino acid distributions sufficiently to conclude that Ala is more
suitable than other amino acids in these cases.
Conclusions
We
have assessed six different amino acids for their ability to
act as suitable decoy amino acids for the estimation of global FLR
in phosphoproteomics studies. We have analyzed four data sets, two
synthetic with a known answer and two biological-sample data sets.
We conclude that either Ala or Leu make appropriate decoys and give
reliable estimates of FLR above 1% FLR. Below 1% FLR, estimates can
be unstable due to a few random high-scoring decoys. We demonstrate
that the decoy-based FLR gives similar estimates to a modeled FLR
for Ala and Leu decoys, based on summing local FLR values per site
and based on the answer key for the synthetic data sets. We recommend
that phosphoproteomics investigators should adopt the “pAla”
decoy going forward, that is, the pASTY method, and report sites with
appropriate global FLR control.
Authors: Kirti Sharma; Rochelle C J D'Souza; Stefka Tyanova; Christoph Schaab; Jacek R Wiśniewski; Jürgen Cox; Matthias Mann Journal: Cell Rep Date: 2014-08-21 Impact factor: 9.423
Authors: Eric W Deutsch; Luis Mendoza; David Shteynberg; Joseph Slagel; Zhi Sun; Robert L Moritz Journal: Proteomics Clin Appl Date: 2015-04-02 Impact factor: 3.494
Authors: Jelle Van Leene; Chao Han; Astrid Gadeyne; Dominique Eeckhout; Caroline Matthijs; Bernard Cannoot; Nancy De Winne; Geert Persiau; Eveline Van De Slijke; Brigitte Van de Cotte; Elisabeth Stes; Michiel Van Bel; Veronique Storme; Francis Impens; Kris Gevaert; Klaas Vandepoele; Ive De Smet; Geert De Jaeger Journal: Nat Plants Date: 2019-03-04 Impact factor: 15.793
Authors: Yasset Perez-Riverol; Attila Csordas; Jingwen Bai; Manuel Bernal-Llinares; Suresh Hewapathirana; Deepti J Kundu; Avinash Inuganti; Johannes Griss; Gerhard Mayer; Martin Eisenacher; Enrique Pérez; Julian Uszkoreit; Julianus Pfeuffer; Timo Sachsenberg; Sule Yilmaz; Shivani Tiwary; Jürgen Cox; Enrique Audain; Mathias Walzer; Andrew F Jarnuczak; Tobias Ternent; Alvis Brazma; Juan Antonio Vizcaíno Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Anton Kalyuzhnyy; Patrick A Eyers; Claire E Eyers; Emily Bowler-Barnett; Maria J Martin; Zhi Sun; Eric W Deutsch; Andrew R Jones Journal: J Proteome Res Date: 2022-05-09 Impact factor: 5.370
Authors: Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob Journal: Nat Biotechnol Date: 2014-03 Impact factor: 54.908