James A Dowell1, Logan J Wright1, Eric A Armstrong1, John M Denu1,2. 1. Wisconsin Institute for Discovery, University of Wisconsin-Madison, 330 North Orchard Street, Madison, Wisconsin 53715, United States. 2. Department of Biomolecular Chemistry, University of Wisconsin-Madison, 420 Henry Mall Room 1135 Biochemistry Building, Madison, Wisconsin 53706, United States.
Abstract
Previous benchmarking studies have demonstrated the importance of instrument acquisition methodology and statistical analysis on quantitative performance in label-free proteomics. However, the effects of these parameters in combination with replicate number and false discovery rate (FDR) corrections are not known. Using a benchmarking standard, we systematically evaluated the combined impact of acquisition methodology, replicate number, statistical approach, and FDR corrections. These analyses reveal a complex interaction between these parameters that greatly impacts the quantitative fidelity of protein- and peptide-level quantification. At a high replicate number (n = 8), both data-dependent acquisition (DDA) and data-independent acquisition (DIA) methodologies yield accurate protein quantification across statistical approaches. However, at a low replicate number (n = 4), only DIA in combination with linear models for microarrays (LIMMA) and reproducibility-optimized test statistic (ROTS) produced a high level of quantitative fidelity. Quantitative accuracy at low replicates is also greatly impacted by FDR corrections, with Benjamini-Hochberg and Storey corrections yielding variable true positive rates for DDA workflows. For peptide quantification, replicate number and acquisition methodology are even more critical. A higher number of replicates in combination with DIA and LIMMA produce high quantitative fidelity, while DDA performs poorly regardless of replicate number or statistical approach. These results underscore the importance of pairing instrument acquisition methodology with the appropriate replicate number and statistical approach for optimal quantification performance. Not subject to U.S. Copyright. Published 2021 by American Chemical Society.
Previous benchmarking studies have demonstrated the importance of instrument acquisition methodology and statistical analysis on quantitative performance in label-free proteomics. However, the effects of these parameters in combination with replicate number and false discovery rate (FDR) corrections are not known. Using a benchmarking standard, we systematically evaluated the combined impact of acquisition methodology, replicate number, statistical approach, and FDR corrections. These analyses reveal a complex interaction between these parameters that greatly impacts the quantitative fidelity of protein- and peptide-level quantification. At a high replicate number (n = 8), both data-dependent acquisition (DDA) and data-independent acquisition (DIA) methodologies yield accurate protein quantification across statistical approaches. However, at a low replicate number (n = 4), only DIA in combination with linear models for microarrays (LIMMA) and reproducibility-optimized test statistic (ROTS) produced a high level of quantitative fidelity. Quantitative accuracy at low replicates is also greatly impacted by FDR corrections, with Benjamini-Hochberg and Storey corrections yielding variable true positive rates for DDA workflows. For peptide quantification, replicate number and acquisition methodology are even more critical. A higher number of replicates in combination with DIA and LIMMA produce high quantitative fidelity, while DDA performs poorly regardless of replicate number or statistical approach. These results underscore the importance of pairing instrument acquisition methodology with the appropriate replicate number and statistical approach for optimal quantification performance. Not subject to U.S. Copyright. Published 2021 by American Chemical Society.
Proteomics is a powerful
tool to profile global changes in protein
expression and protein modification states.[1] For these “discovery” experiments to be useful in
guiding follow-up “validation” experiments, they must
have a high degree of quantitative accuracy. As with other “omics
technologies”, many factors can impact the accuracy of quantitative
proteomics, including the type of quantitative workflow, instrument
parameters, replicate number, and statistical approach.In general,
there are three different types of quantitative proteomics
workflows—label-free, metabolic labeling, and chemical tagging.
Each of these techniques has certain advantages and disadvantages.
Label-free quantification (LFQ) is the simplest and most straightforward
to set up. However, LFQ generally exhibits higher amounts of variance
than metabolic labeling or chemical tagging.[2,3] In
contrast, metabolic labeling via stable isotope labeling with amino
acids in cell culture (SILAC) yields highly accurate quantification,
but is limited to cell culture systems and produces lower proteome
coverage than LFQ workflows.[4,5] Chemical tagging via
isobaric tandem mass tags (TMTs) also yields highly accurate quantitation;
however, TMT workflows display quantitative bias unless triple-stage
isolation (MS3) or high-field asymmetric-waveform ion mobility (FAIMS)
is employed, which requires a specific type of mass spectrometer and
an electrospray source.[6−10]Data-dependent acquisition (DDA) is a highly effective method
for
identifying large numbers of peptides and proteins.[11−14] However, DDA presents some unique
quantitative challenges for LFQ. Specifically, DDA precursor ion selection
is based on relative peptide intensities at a given time in the LC–MS
run and thus exhibits run-to-run inconsistencies in peptide identification
and quantification. This stochastic peptide sampling results in a
large number of missing peptide intensities across the samples. These
missing values negatively affect downstream statistical analyses,
especially for low-abundance peptides/proteins.[2,15,16] While newer, faster instruments provide
immense sampling depth, they still suffer from undersampling when
near the limit of detection.[2]To
alleviate the issue of missing values in label-free DDA workflows,
data imputation has been employed.[17−20] A simple application of data
imputation uses the normal distribution to replace the missing values.[21] This methodology has been shown to increase
the quantitative fidelity for low-abundance proteins but not for high-abundance
proteins.[22] More recently, more sophisticated
techniques combining multiple types of data, for example, peptide
intensities and spectral counts, have been used to produce more robust
quantitative performance from DDA data.[23]In addition to post hoc data imputation, instrument-centric
methods
employing data-independent acquisition (DIA) have been developed to
reduce missing values in label-free workflows.[24−26] In contrast
to DDA workflows that produce tandem MS1/MS2 spectra via isolation
and fragmentation of specific peptide precursors, DIA methods cycle
through wide mass isolation windows in which all peptides within a
window are fragmented. These peptide spectra are then matched to a
previously generated peptide library across the LC–MS/MS run.
This unbiased isolation and fragmentation produces very few missing
values.[15] However, DIA produces complex,
multipeptide fragmentation spectra that are generally not suitable
for database searching. To circumvent this issue, DIA workflows perform
peptide identification and quantification using an external peptide
library.[24,25,27] These libraries
are typically generated via parallel DDA experiments and thus require
additional instrument time.A popular alternative to LFQ is
the use of isobaric labeling with
TMTs.[8,28] In this workflow, tryptic peptides are N-terminal-labeled
via amine-reactive isobaric mass tags that contain a sample-specific
reporter fragment ion. After labeling, the samples are combined into
a single sample and then assayed via standard DDA acquisition. Instead
of the quantification being performed on peptide intensities (MS1),
the signal intensity from each unique “reporter” fragment
is compared against the other “reporter” fragments to
yield relative peptide (and protein) quantification. This was traditionally
performed on the peptide fragmentation spectra (MS2); however, it
was discovered that peptide co-isolation leads to interference in
the MS2 channel. To alleviate these issues, triple-stage isolation
(MS3) is used, in which the most intense fragment ions of the precursor
peptides are isolated and further fragmented to produce peptide-specific,
interference-free “reporter” ions.[6−10] This results in almost no missing values because
any peptide that undergoes fragmentation yields a complete quantitative
profile across all samples.In addition to instrument acquisition
methods, quantification is
affected by post-acquisition data handling, including data normalization
and statistical analyses.[29] For protein-level
analyses, typically the peptide intensities for a specific protein
are combined (by sum, mean, or median) to obtain a protein-level intensity
value. Statistical analyses are then performed on these “rolled-up”
protein-level values.[30−32] More recently, peptide-centric statistical approaches
have been developed that perform statistical analysis at the peptide
level before combining results into protein values. These approaches
were shown to have superior quantitative accuracy in comparison with
protein-level statistical approaches.[22,23,33−36]In addition to protein-level experiments, quantitative
mass spectrometry
is employed to analyze the alterations in peptide modification states,
for example, phosphoproteomics and acetylomics.[37−40] The modified peptides are typically
enriched and then analyzed via label-free (DDA or DIA) or isobaric
labeling (TMT) workflows.[41] In contrast
to protein intensities that are a combination of multiple peptide
signals, peptide-level analyses rely on a single intensity of an individual
peptide analyte. This presents more demands on the quantitative accuracy
due to the use of a single intensity for downstream analysis.While a number of benchmarking studies have evaluated the effects
of instrument acquisition, replicate numbers, and statistical approaches
independently, a holistic benchmarking approach that examines the
synergistic effects of these parameters on quantitative performance
has not been undertaken.[2,29,34,42−48] In this study, we have rigorously determined the combined impact
of these factors on the quantitative performance in label-free proteomics
(DDA and DIA). As a corollary, we also evaluated the impact of statistical
approach on a previously published data set employing isobaric labeling
via TMTs.
Results
Experimental Overview
To assess
the impact of data
acquisition, statistical approach, false discovery rate (FDR) calculations,
and replicate number on quantitative accuracy, we produced a benchmarking
data set by spiking an Escherichia coli lysate into a humanHEK293 lysate at 1:4 (1x sample) and 1:2 (2x
sample), where the 2x sample contains twice as much total E. coli protein as the 1x sample, while the HEK293
protein amount is held constant (Figure ). Specifically, 100 μg (1x sample)
or 200 μg (2x sample) of the E. coli lysate was spiked into 400 μg of the human lysate (HEK293
cell lysate). Since there is very little protein homology between
these species, the expected peptide/protein fold change between the
1x and 2x conditions is known simply based on the species, and thus,
humanpeptides/proteins have an ideal fold change of 1 (log2 FC = 0), while bacterial proteins have an ideal fold change of 2
(log2 FC = +1). We analyzed our benchmark standard using
either label-free DIA (LFQ-DIA) or DDA (LFQ-DDA) in combination with
various replicate numbers and statistical approaches. The peptide
library for the DIA analysis was generated from the DDA replicates,
and thus, the identified peptides and proteins are identical between
the two data sets. In this manner, we obviated any identification
bias between data sets.
Figure 1
Workflow overview. Benchmarking standards were
created with an E. coli protein digest
spiked into a human protein
digest (HEK293 cell lysate) at either a 1x or 2x concentration. Benchmark
standards were analyzed by LC–MS/MS using two different acquisition
modes: DDA or DIA. Statistical analysis was performed with four distinct
approaches: t-test, LIMMA, ROPECA, and ROTS.
Workflow overview. Benchmarking standards were
created with an E. coli protein digest
spiked into a human protein
digest (HEK293 cell lysate) at either a 1x or 2x concentration. Benchmark
standards were analyzed by LC–MS/MS using two different acquisition
modes: DDA or DIA. Statistical analysis was performed with four distinct
approaches: t-test, LIMMA, ROPECA, and ROTS.To extend our evaluation beyond label-free quantitative
methodologies,
we evaluated the impact of statistical approaches on the quantitative
performance of a previously published data set that utilized tandem
mass tagging (TMT-DDA) and label-free DDA2. The inclusion
of this data set also served as a laboratory-independent control for
our evaluation of statistical approaches. Gygi et al. employed a similar
experimental design as the one employed in our study, but instead
of using an E. coli digest as their
spike-in proteome, they used a yeast digest. However, in contrast
to our “single-shot” label-free workflow, Gygi et al.
prefractionated their benchmarking sample via offline high pH reverse-phase
HPLC before chemical tagging and LC–MS/MS analysis. Gygi et
al. also employed a Thermo Scientific Lumos mass spectrometer coupled
to a nanoflow HPLC, whereas we used a first-generation Thermo Scientific
Q-Exactive coupled to a microflow HPLC.
Acquisition Methodology
To qualitatively assess the
impact of instrument acquisition modes on quantitative performance,
we analyzed the E. coli benchmarking
standard via DDA and DIA acquisition. These analyses were performed
using the MS1 signal intensities of eight replicates for both the
DDA and DIA acquisition methodologies. We performed moderated t-tests
with a Benjamini–Hochberg (BH) false discovery correction to
identify those peptides and proteins that exhibit significantly different
quantities between the 1x and 2x conditions. We plotted log2 fold change against the negative log10q-value for each protein or peptide (Figure ). All probes above the horizontal dotted
line are statistically significant (q < 0.05).
The expected fold change value is designated by the dotted vertical
trendline at log2 fold change = 1. We assessed the quantitative
performance via the overall negative log10q-value of the bacterial proteins/peptides and number of bacterial
proteins/peptides that most closely adhere to the expected fold change
trendline. For the protein-level analysis, the LFQ-DIA exhibited the
highest negative log10q-values as well
as a slightly tighter grouping around the expected log2 fold change trendlines (Figure A,B). For the peptide-level analysis, the LFQ-DIA data
set also had the highest negative log10q-values. However, in contrast to the protein data, the LFQ-DIA and
LFQ-DDA peptide-level analysis displayed a similar data spread and
adherence to the expected log2 fold change trendlines (Figure C,D). We also visualized
the data as a standard deviation plot and observed similar trends
in the adherence to the expected fold change and data spread as observed
in the scatter plots (Figure S1).
Figure 2
Scatter plots.
Log2 fold change plotted against the
negative log10q-value for each protein/peptide.
Red dots are human proteins/peptides. Blue dots are E. coli proteins/peptides. All proteins above the
horizontal dotted line indicate statistical significance (n = 8, moderated t-test with the BH correction, q < 0.05). Vertical lines indicate a log2 fold
change of ±1. (A) Proteins analyzed via DIA, (B) proteins analyzed
via DDA, (C) peptides analyzed via DIA, and (D) peptides analyzed
via DDA.
Scatter plots.
Log2 fold change plotted against the
negative log10q-value for each protein/peptide.
Red dots are human proteins/peptides. Blue dots are E. coli proteins/peptides. All proteins above the
horizontal dotted line indicate statistical significance (n = 8, moderated t-test with the BH correction, q < 0.05). Vertical lines indicate a log2 fold
change of ±1. (A) Proteins analyzed via DIA, (B) proteins analyzed
via DDA, (C) peptides analyzed via DIA, and (D) peptides analyzed
via DDA.As an extension of these observations,
we also qualitatively assessed
the protein- and peptide-level quantitative performance of the Gygi
et al. yeast benchmarking data set using the same metrics of the negative
log10p-values and adherence to expected
fold change values (Figure S2). The yeast
TMT-DDA data set exhibited the tightest grouping around the expected
log2 fold change trendline, while the yeastLFQ-DDA data
set exhibited a much larger data spread. Also, the negative log10 q-values were higher for the yeast TMT-DDA than for the
yeastLFQ-DDA data set. For the peptide-level analysis, the yeast
TMT-DDA data set also had higher negative log10q-values and adhered more closely to the expected log2 fold change trendlines than the yeastLFQ-DDA data set.Interestingly, all the protein data sets exhibited a tighter grouping
around the expected trendline than the peptide data sets, indicating
a higher quantification fidelity at the protein than at the peptide
level. This is likely due to protein-level quantification being a
composite of multiple peptide signals, essentially producing multiple
measurements of its intensity value, whereas peptide-level quantification
relies on a single intensity value, that of the individual peptide
analyte. Thus, peptide quantification is more prone to quantification
errors and requires a more robust experimental design than protein-level
quantification, a theme we will explore more thoroughly below.
Statistical
Approaches
While the LFQ-DIA data display
slightly less data spread and higher negative log10p-values than the LFQ-DDA, these observations are largely
qualitative. To quantitatively assess the impact of acquisition methodology,
statistical approach, and replicate number on protein-level quantitative
performance, three distinct statistical methods were compared—t-tests, linear models for the analysis of microarrays (LIMMAs),
and reproducibility-optimized test statistic (ROTS and ROPECA). We
used true positive and false positive rates to assess the performance
of each workflow. True positive rate (TPR) measures the proportion
of actual positives that are identified in an experiment in which
the actual positives and negatives are known. In our experimental
design, a true positive is any E. coli protein/peptide that is identified as being significantly different
between the 1x and 2x benchmarking standards. The false positive rate
is the proportion of nonpositives identified as positive. In our experimental
design, a false positive is any human protein/peptide identified as
being significantly different between the 1x and 2x benchmarking standards.LIMMA fits a linear model to the expression data.[49] ROTS and ROPECA use a version of bootstrapping to optimize
the parameters that maximize reproducibility to identify differentially
expressed proteins.[29,34] ROTS is specific to protein-level
data and ROPECA is the adaptation of ROTS that calculates peptide-level p-values and then aggregates expression values of each peptide
within a protein into a single value for that protein. In order to
control for the multiple testing hypothesis problem, we applied the
BH FDR correction with a q-value threshold of <0.05.[50]To compare these various approaches for
protein-level analysis,
stacked bar graphs of true and false positives (Figure ), confusion tables (Figure S3), and receiver operating characteristic (ROC) curves
(Figure S4) were generated. The LFQ-DIA
workflow at high replicate numbers (n = 8) generated
a high number of true positives and a low number of false positives
across all statistical approaches with an average 90% TPR (∼500
out of 568 possible true positives) and a false positive rate of ∼3%
across all statistical approaches (Figure A). In contrast, the quantitative performance
of DIA at low replicate numbers (n = 4) was highly dependent on the
statistical approach with LIMMA and ROPECA performing extremely well—LIMMA
had a 75% TPR (427 out of 568 possible) and ROPECA had a 70% true
positive rate (396 out of 568 possible)—while the t-statistic
only had a 40% TPR and ROTS performed even more poorly, identifying
a single true positive (Figure B).
Figure 3
Protein true and false positive stacked bar graphs. True positives
(dark blue) and false positives (light blue) were plotted as stacked
bar graphs for each protein workflow and statistical approach. (A)
Eight replicates analyzed via DIA, (B) four replicates analyzed via
DIA, (C) eight replicates analyzed via DDA, and (D) four replicates
analyzed via DDA.
Protein true and false positive stacked bar graphs. True positives
(dark blue) and false positives (light blue) were plotted as stacked
bar graphs for each protein workflow and statistical approach. (A)
Eight replicates analyzed via DIA, (B) four replicates analyzed via
DIA, (C) eight replicates analyzed via DDA, and (D) four replicates
analyzed via DDA.For all replicate numbers
and statistical approaches, the LFQ-DDA
workflow had a lower TPR than the LFQ-DIA workflow. Across all statistical
conditions, the high replicate analysis (n = 8) produced
an average 54% TPR (311 out of 573 possible) with LIMMA slightly outperforming
the other two statistical approaches (Figure C). However, LFQ-DDA exhibited a poor TPR
at low replicate numbers, regardless of the statistical approach,
with an average 10% TPR (59 out of 573 possible) (Figure D). However, there were still
large differences in statistical performance at low replicate numbers
in the LFQ-DDA data set—LIMMA had a 24% TPR (139 out of 573
possible), while the t-statistic and ROTS had TPRs
well below 10%. While the reasons for the difference in quantitative
performances between the LFQ-DIA and LFQ-DDA workflows are unclear,
one of the major differences between the DDA and DIA data sets is
the number of missing values. The LFQ-DDA workflow exhibited a significantly
higher number of missing values (12%) versus the LFQ-DIA workflow
(0.078%), pointing to the possible importance of missing values on
quantitative performance.To examine if the trends we observed
in label-free protein quantification
extended to a chemical tagging approach, we compared the performance
of the TMT and DDA workflows from a previously published data set.[2] The results from the yeastLFQ-DDA data set from
Gygi et al. exhibited very similar trends to the E.
coli LFQ-DDA data set at low replicates (n = 4), with LIMMA strongly outperforming ROTS and the t-statistic
(Figure S5). The yeastLFQ-DDA analysis
yielded a 45% TPR (432 out of 945 possible) and a 3% false positive
rate with LIMMA, while ROTS and the t-test performed
much more poorly. Interestingly, the yeast TMT-DDA workflow at low
replicate numbers (n = 4) had a very high average
TPR of 94% (1222 out of 1304 possible) (Figure S5). These results are in stark contrast to all other workflows
at low replicate numbers, which all exhibited highly variable TPRs
between statistical approaches. It also has the fewest missing values
of any data set (less than 0.15%), once again emphasizing the strong
impact of missing values on quantitative fidelity, especially at low
replicate numbers. In terms of statistics, LIMMA and ROPECA performed
the best across workflows and replicate numbers.[34] In summary, these data demonstrate the complex interplay
of replicate number, instrument acquisition, and statistical approach.For the peptide-level analysis, two statistical approaches were
performed—LIMMA and t-tests (Figure ). ROTS and ROPECA can only be performed
on protein-level analyses, so these statistical approaches were not
included.[33,34] The LFQ-DIA analysis at high replicate numbers
(n = 8) had a 67% TPR (1786 out of 2,655 possible)
(Figure A). In contrast,
the LFQ-DDA analysis, even at high replicates, performed poorly with
a 29% TPR (715 out of 2501 possible) using LIMMA and a 15% TPR using
the t-statistic (Figure C). However, at low replicate numbers even DIA performed poorly,
with LIMMA and t-tests exhibiting 23% and 1% TPRs, respectively (Figure B). The LFQ-DDA workflow
at a low replicate number performed even worse, with less than 10
true positives using LIMMA and no true positives with the t-statistic
(Figure D).
Figure 4
Peptide true
and false positive stacked bar graphs. True positives
(dark blue) and false positives (light blue) were plotted as stacked
bar graphs for each peptide workflow and statistical approach. (A)
Eight replicates analyzed via DIA, (B) four replicates analyzed via
DIA, (C) eight replicates analyzed via DDA, and (D) four replicates
analyzed via DDA.
Peptide true
and false positive stacked bar graphs. True positives
(dark blue) and false positives (light blue) were plotted as stacked
bar graphs for each peptide workflow and statistical approach. (A)
Eight replicates analyzed via DIA, (B) four replicates analyzed via
DIA, (C) eight replicates analyzed via DDA, and (D) four replicates
analyzed via DDA.To examine if the trends
observed in the peptide quantification
via LFQ-DDA and LFQ-DIA workflows extend to a chemical tagging approach,
we compared the performance of the Gygi et al. TMT and DDA data sets.
The yeastLFQ-DDA data exhibited a 44% TPR for LIMMA and a 2% TPR
for t-tests (Figure S5). In contrast, the yeast TMT-DDA workflow produced 97% true positives
with LIMMA (6434 out of a possible 6610) and a 75% TPR via the t-statistic (Figure S5). While
not directly comparable to our LFQ-DIA and LFQ-DDA data sets, the
statistical performance of LIMMA and the t-statistic
exhibited highly similar trends across all data sets, with LIMMA outperforming
the t-statistic regardless of acquisition methodology.In summary, accurate peptide quantification is much more demanding
than protein analyses. The LFQ-DIA and TMT-DDA workflows produced
superior results to the LFQ-DDA workflows, with LIMMA producing the
highest TPRs of any statistical approach across all workflows.
Replicate
Numbers
To further examine the impact of
replicate numbers on statistical performance, we plotted TPR versus
replicate number for each statistical approach in combination with
either the LFQ-DDA or LFQ-DIA workflow (Figure ). As with our previous analyses, the statistical
performance was highly variable across the replicate number. For LFQ-DIA
protein quantification, both LIMMA and ROPECA exhibited a high, flat
TPR across replicates, which is consistent with the previous comparison
of low and high replicate data that showed robust performance at both
low (n = 4) and high (n = 8) replicates
(Figure ). However,
the t-statistic exhibited a steadily increasing curve
that approached a plateau near n = 8 for the LFQ-DIA
workflow, indicating that the optimal replicate number for the t-statistic is closer to n = 8. For the
LFQ-DDA data, both the t-statistic and LIMMA analysis
steadily increased with more replicates and do not appear to plateau,
indicating that the optimal replicate number is above n = 8. For the peptide LFQ-DIA data, both the LIMMA and t-statistic steadily increased to n = 8 without plateauing,
once again, indicating that the ideal replicate number is above n = 8. For the LFQ-DDA peptide data, very few true positives
are identified until n = 5 for LIMMA and n = 7 for the t-statistic. These data show
that the statistical performance is highly dependent on the data acquisition
methodology and replicate number.
Figure 5
Replicate analysis. TPR plotted against
replicate number for (A)
each protein and (B) peptide workflow and statistical approach from
the E. coli benchmarking data set.
The legend indicates the instrument acquisition and statistical approach
for each line plot.
Replicate analysis. TPR plotted against
replicate number for (A)
each protein and (B) peptide workflow and statistical approach from
the E. coli benchmarking data set.
The legend indicates the instrument acquisition and statistical approach
for each line plot.
FDR Correction
Controlling for the multiple testing
hypothesis problem is essential for the quantitative accuracy of large
data sets, including microarrays, RNA-seq, and quantitative proteomics
experiments.[50,51] A commonly used FDR correction
for type I errors (false positives) is the Benjamini–Hochberg
(BH) correction.[50] For all previously displayed
data, we applied the BH correction at a q-value threshold of <0.05
(see the Experimental Section). To assess
the potential impact of FDR corrections on quantitative accuracy,
we also applied the Storey correction at the same q-value cutoff.[51] The Storey correction
is slightly less conservative than the BH correction.[51] Overall, the Storey correction produced higher TPRs with
a concurrent increase in the false positive rates. However, the choice
of FDR correction unexpectedly exhibited variable behavior that depended
on replicate number, statistical approach, and acquisition methodology.The BH correction (q-value <0.05) in combination
with the LFQ-DIA protein workflow at a high replicate number (n = 8) differed very little from the Storey corrected results,
regardless of statistical approach (Table S4). However, the FDR correction had a slightly larger impact on the
high replicate LFQ-DDA workflow with the Storey correction producing
a consistently higher TPR. The choice of FDR correction also noticeably
affected the results from the low replicate LFQ-DIA workflow. However,
the most dramatic effect of altering the FDR correction was manifested
in the low replicate (n = 4) LFQ-DDA workflow in
combination with LIMMA, with the Storey correction producing 37% more
true positives than the BH correction (Table S4).These results indicate that FDR corrections can greatly
impact
quantitative performance, especially at low replicates using LFQ-DDA
approaches. It is unclear why the FDR correction produces large effects
in certain conditions, for example, low replicate LFQ-DDA using LIMMA,
and not others, for example, high replicate LFQ-DIA. However, the
high replicate workflows clearly exhibit a greatly reduced dependence
on the choice of FDR correction, again, highlighting the importance
of replicate choice in experimental design.
Discussion
We have employed a quantitative proteomics benchmarking approach
to explore the synergistic impact of instrument acquisition methodology,
replicate number, and statistical approach on quantitative performance.
We have evaluated two LFQ workflows as well as TMT workflow for protein-
and peptide-level quantitative performance. The relative advantages
of these quantitative workflows have been evaluated before.[2,29,34,42−48] However, the combined impact of replicate number and statistical
approach on these workflows has not been thoroughly explored. Interestingly,
we found a complex relationship between quantitative workflow, statistical
approach, and replicate number.A common theme we found across
our analyses is those acquisition
methodologies with fewer missing values (TMT-DDA and LFQ-DIA) produced
superior results at lower replicate numbers. Missing values have been
shown to have a detrimental effect on downstream statistical analyses
across disparate data types.[2,15,16] The effect of missing values is especially detrimental at low replicate
numbers. However, the impact of missing values and/or low replicate
numbers on quantitative performance depends on the statistical approach
employed. Specifically, conventional parametric statistical tests,
such as the t-statistic, are especially affected. In contrast, the
performance of a linear model fitted to expression values, as implemented
in the LIMMA package, was less impacted by low replicate numbers and/or
high numbers of missing values. This is not surprising since LIMMA
has been optimized for smaller sample sizes compared with the t-statistic
and its improved performance might be due to the elimination of small
within-group variance and/or the introduction of fold-change criterion.[49] Similarly, the performance of the peptide-centric,
boot-strapping algorithm employed by ROPECA is also relatively unaffected
by the low number of replicates.[34] Most
likely, this is due to the generation and aggregation of p-values
at the peptide level instead at the protein level. We would expect
that other peptide-centric statistical methodologies would also exhibit
similar performance, including the algorithm employed in MSstats.[35]In contrast to protein-level quantification,
which is based on
multiple peptide measurements, peptide-level quantification relies
on a single value. This produces a much more demanding scenario for
downstream statistical analysis, with missing values and/or low replicate
numbers being especially detrimental to the quantitative performance.
Furthermore, the use of a single peptide intensity for quantification
precludes the implementation of peptide-centric statistical methods,
such as those employed by ROPECA and MSstats, that aggregate multiple
peptide p-values into a single protein-level value.
For peptide-level quantification, LIMMA clearly outperforms the t-statistic. As mentioned above, this is not too surprising
given that LIMMA has been optimized for small replicate numbers. However,
even LIMMA does not perform well with low replicates in combination
with high numbers of missing values, such as those encountered in
the LFQ-DDA workflows. Thus, peptide-level quantification requires
more robust experimental designs, including the use of data acquisition
methodologies with less missing values (TMT-DDA and LFQ-DIA), higher
replicate numbers (n = 8), and more sophisticated
statistical approaches (LIMMA).Another important consideration
is instrument requirements. DDA
approaches are the simplest to set up and can be performed on a wide
array of instrumentation. A quantitative DDA experiment can be performed
on almost any mass spectrometer, whether it is a quadrupole time-of-flight,
an ion trap (IT), an Orbitrap, or a hybrid instrument. In contrast
to DDA, DIA requires a high-resolution quadrupole instrument to obtain
suitable MS2 data for peptide library matching and quantification.
Although TMT workflows can theoretically be performed on many proteomics
instruments, quantitative accuracy is greatly enhanced when FAIMS
or MS3-based quantification is used; however, this requires an IT-based
instrument or a FAIMS-capable source. Also, TMT workflows generally
require prefractionation, usually via ion exchange or high pH reverse-phase
chromatography, which adds to the complexity and level of expertise
required to execute this workflow.Another consideration is
the cost of executing these workflows.
The TMT-DDA workflow requires the purchase of chemical tagging reagents,
which increases the cost per sample. Some LFQ-DIA workflows require
the purchase of retention time standards, which also adds to the analysis
cost. While LFQ-DDA workflows do not require the purchase of these
reagents, they require a higher number of replicates to obtain adequate
quantitative performance. This requirement for higher replicate numbers
increases the upstream sample preparation costs as well as the instrument
time per experiment.In summary, excellent quantitative performance
in bottom-up proteomics
can be achieved using LFQ-DDA, LFQ-DIA, or TMT-DDA workflows. However,
each workflow exhibits unique characteristics that require careful
attention to experimental design, including the choice of replicate
number and downstream statistical approach.
Experimental Section
Sample
Preparation
HEK293 cells were cultured in Dulbecco’s
modified Eagle’s medium with 10% fetal bovine serum until confluent.
The cells were rinsed once with phosphate-buffered saline, lifted
with trypsin/ethylenediaminetetraacetic acid, and then pelleted. The
pellet was resuspended in 1% sodium dodecyl sulfate in 100 mM Tris–HCl
(pH 7.5) and then tip sonicated at 20% amplitude three times for 5
s each. Cysteines were reduced with 25 mM dithiothreitol at 95 °C
for 10 min and then alkylated with 50 mM iodoacetamide for 20 min
in the dark at room temperature. Proteins were precipitated with 4x
volumes of ice-cold acetone for 60 min and then spun at 21,000 g to
pellet proteins. Precipitated protein was resuspended in 2 M urea
in 25 mM ammonium bicarbonate with tip sonication (as above). Protein
concentration was determined via the bicinchoninic acid assay. The
protein extract was digested with trypsin at 1:50 overnight at 37
°C. The digest was quenched with 0.5% trifluoroacetyl and protein
extracts desalted via Stage-tips.[52] The
samples were dried down and resuspended in 0.1% formic acid. E. coli were cultured in 2xYT media and then pelleted
at 8000 g and processed in the same manner as the HEK293 samples.We prepared our benchmark sample as previously published by Mann
and co-workers.[53] Two separate samples
were created by mixing the HEK293 and E. coli digests. For the “1x” sample, 100 μg of E. coli digest was added to 400 μg of HEK293
sample and diluted to a total of 600 μL. For the “2x”
sample, 200 μg of E. coli digest
was added to 400 μg of HEK293 sample and diluted to a total
of 600 μL. iRT peptides (HRM Kit, Biognosys) were spiked into
each peptide digest according to the manufacturer’s instructions.
Eight technical replicates of each sample group (1x and 2x) and each
acquisition type (DDA and DIA) were analyzed by LC–MS/MS for
a total of 32 runs. For the mapping of sample IDs to raw files, the
sequence table was exported into an Excel file (Table S4).
LC–MS/MS Parameters
For both
DDA and DIA, 20
μL of each sample was injected onto a Dionex Ultimate3000 microflow
HPLC and separated at a flow rate of 6 μL/min using a Thermo
Scientific Hypersil Gold C18 column (150 × 0.18 mm, particle
size 3 μm) at 60 °C coupled to a Thermo Scientific Q-Exactive
mass spectrometer with a HESI-II source equipped with a narrow-bore
spray needle. HPLC mobile phases consisted of water + 0.1% formic
acid and acetonitrile + 0.1% formic acid. Peptides were resolved with
a linear gradient of 2–30% ACN over 85 min. HESI parameters
were as follows: sheath gas = 5; spray voltage = 3 kV; capillary temperature
= 240 °C; S-lens RF = 50; auxiliary gas heater = 30 °C;
auxiliary gas flow = OFF; and sweep gas = OFF. The mass spectrometer
was operated in the DDA mode with dynamic exclusion enabled (exclusion
duration = 15 s), MS1 resolution = 70,000, MS1 automatic gain control
target = 1 × 106, MS1 maximum fill time = 100 ms,
MS2 resolution = 17,500, MS2 automatic gain control target = 2 ×
105, MS2 maximum fill time = 200 ms, and MS2 normalized
collision energy = 28. For each cycle, one full MS1 scan range = 400–1000 m/z, followed by MS2 scans with a loop
count of 20 and an isolation window size of 2.0 m/z. In DIA, the mass spectrometer was operated with
a MS1 scan resolution = 70 000, automatic gain control target = 1
× 106, MS1 maximum fill time = 100 ms, followed by
a DIA scan with a loop count of 30 using nonoverlapping windows from
400 to 1000 m/z. The DIA settings
were as follows: window size = 20 m/z, resolution = 17,500, automatic gain control target = 1 × 1
× 106, DIA maximum fill time = AUTO, and normalized
collision energy = 30. The total cycle time was 3.1 s.
Data Processing
DDA runs were analyzed with MaxQuant
(version 1.6.5.0) using default settings. The data were searched against
a combined human and E. coli Swiss-Prot
database (downloaded December 12, 2017). Search criteria included
carbamidomethylation of cysteine as a fixed modification, methionine
oxidation and N-terminal protein acetylation as variable modifications,
and trypsin/P with two missed cleavages as the enzyme. Mass tolerance
was set at 4.5 ppm for precursor ions and 20 ppm for fragment ions.
Peptide spectrum match and Protein FDR were set at 0.01. For DDA peptide
quantification, match between runs was performed with a matching time
window of 0.7 min and an alignment window of 20 min. MS1 summed isotope
intensities were used. For protein quantification, all peptide MS1
intensities for a given protein were summed to yield the total protein
intensity. The MaxQuant output file was exported as an Excel file
and is included in the Supporting Information (Table S4). Normalization was performed via the summation of the
total intensity of all the identified humanpeptides. These calculations
were performed used an internally generated R-script (see below).
The normalized peptide and protein values were exported to an Excel
file (Table S4).The DIA runs were
analyzed with Spectronaut Pulsar X (version 12.0.20491.13.20699) using
default parameters. The spectral libraries were generated from the
MaxQuant results from all 16 DDA runs. The spectral library contained
19,561 precursors, 17,571 modified peptides, and 3092 proteins. For
identification, a “mutated” decoy method was used, and
the precursor and protein q-value cutoffs were set
at 0.01, and the retention time prediction was set to dynamic. MS1
summed isotope intensities were used for peptide quantification. Protein
inference was set to Automatic and interference correction was enabled.
The Spectronaut output file was exported as an Excel file and is included
in the Supporting Information (Table S4).
Protein intensities and normalization were calculated as described
for the DDA analysis (see above). The normalized peptide and protein
values were exported to an Excel file (Table S4).
R-Script
The analyses were performed using the R programming
language (www.r-project.org), version 3.6. Our script accepts peptide- or protein-level intensity
data, performs sum normalization on total peptide intensities, rolls
up peptide-to-protein intensities via summation, runs statistical
tests, and generates plots. Note: Data imputation of missing
values was not performed. We have written a general user
guide to download and run the script in RStudio or your choice of
editor, which includes the necessary parameters for the input files
(see below).
Statistical Analysis
Statistical
analyses were performed
on both protein- and peptide-level intensities using q-adjusted t-tests, LIMMA, and ROTS. FDR was controlled
via BH or Storey corrections.[50,51]
Yeast Benchmarking Label-Free
DDA and TMT Data Sets
We analyzed a previously published
data set from Gygi et al.[2] This benchmarking
data set is very similar to
our in-house generated sample except a yeast proteome spike-in was
used instead of an E. coli spike-in.
We analyzed both the DDA and TMT data sets using our analysis pipeline.
The original Excel files were downloaded and minimally reformatted
to conform to the data frame requirements of our R script. We only
analyzed the “2x” versus the “1x” spike-in
samples and did not include the “3x” samples. Normalization,
peptide-to-protein rollup, and statistical analysis were performed
as described above. The normalized peptide and protein values were
exported to an Excel file (Table S4).
Data and R Script Availability
The normalized intensity
values for all the experimental conditions are summarized in Table S4. The raw data, spectral libraries, and
output files are publicly available on the ProteomeXchange data repository
(PXD018408) and the MassIVE repository (MSV000085239). The R script
and associated output files are available on GitHub (https://github.com/DenuLab/LFQMagic).
Authors: Pedro Navarro; Jörg Kuharev; Ludovic C Gillet; Oliver M Bernhardt; Brendan MacLean; Hannes L Röst; Stephen A Tate; Chih-Chiang Tsou; Lukas Reiter; Ute Distler; George Rosenberger; Yasset Perez-Riverol; Alexey I Nesvizhskii; Ruedi Aebersold; Stefan Tenzer Journal: Nat Biotechnol Date: 2016-10-03 Impact factor: 54.908
Authors: George Rosenberger; Isabell Bludau; Uwe Schmitt; Moritz Heusel; Christie L Hunter; Yansheng Liu; Michael J MacCoss; Brendan X MacLean; Alexey I Nesvizhskii; Patrick G A Pedrioli; Lukas Reiter; Hannes L Röst; Stephen Tate; Ying S Ting; Ben C Collins; Ruedi Aebersold Journal: Nat Methods Date: 2017-08-21 Impact factor: 28.547
Authors: Klemens Fröhlich; Eva Brombacher; Matthias Fahrner; Daniel Vogele; Lucas Kook; Niko Pinter; Peter Bronsert; Sylvia Timme-Bronsert; Alexander Schmidt; Katja Bärenfaller; Clemens Kreutz; Oliver Schilling Journal: Nat Commun Date: 2022-05-12 Impact factor: 17.694