Patrick Willems1,2,3, Ursula Fels1,4, An Staes4,5, Kris Gevaert4,5, Petra Van Damme1. 1. Department of Biochemistry and Microbiology, Ghent University, Ghent 9000, Belgium. 2. Department of Plant Biotechnology and Bioinformatics, Ghent University, Ghent 9000, Belgium. 3. VIB-UGent Center for Plant Systems Biology, Ghent 9052, Belgium. 4. VIB-UGent Center for Medical Biotechnology, Ghent 9052, Belgium. 5. Department of Biomolecular Medicine, Ghent University, Ghent 9000, Belgium.
Abstract
In the context of bacterial infections, it is imperative that physiological responses can be studied in an integrated manner, meaning a simultaneous analysis of both the host and the pathogen responses. To improve the sensitivity of detection, data-independent acquisition (DIA)-based proteomics was found to outperform data-dependent acquisition (DDA) workflows in identifying and quantifying low-abundant proteins. Here, by making use of representative bacterial pathogen/host proteome samples, we report an optimized hybrid library generation workflow for DIA mass spectrometry relying on the use of data-dependent and in silico-predicted spectral libraries. When compared to searching DDA experiment-specific libraries only, the use of hybrid libraries significantly improved peptide detection to an extent suggesting that infection-relevant host-pathogen conditions could be profiled in sufficient depth without the need of a priori bacterial pathogen enrichment when studying the bacterial proteome. Proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD017904 and PXD017945.
In the context of bacterial infections, it is imperative that physiological responses can be studied in an integrated manner, meaning a simultaneous analysis of both the host and the pathogen responses. To improve the sensitivity of detection, data-independent acquisition (DIA)-based proteomics was found to outperform data-dependent acquisition (DDA) workflows in identifying and quantifying low-abundant proteins. Here, by making use of representative bacterial pathogen/host proteome samples, we report an optimized hybrid library generation workflow for DIA mass spectrometry relying on the use of data-dependent and in silico-predicted spectral libraries. When compared to searching DDA experiment-specific libraries only, the use of hybrid libraries significantly improved peptide detection to an extent suggesting that infection-relevant host-pathogen conditions could be profiled in sufficient depth without the need of a priori bacterial pathogen enrichment when studying the bacterial proteome. Proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the dataset identifiers PXD017904 and PXD017945.
Among
others, proteomics aims to identify and quantify changes
in protein levels. Bearing in mind that relatively small genomes of microorganisms such as that
of the bacterium Salmonella Typhimurium encoding approximately 4500 protein-coding genes, bacterial proteomics
can be complex, as already a significant number of proteins need to
be profiled. In fact, these numbers get even much higher when considering
proteoforms, that is, multiple protein products arising from a single
gene.[1,2]The still preferred method to analyze
proteomes is by means of
liquid chromatography coupled to tandem mass spectrometry (LC–MS/MS).
Specifically, in bottom-up proteomic approaches, proteins are digested
to peptides by proteases (e.g., trypsin) which are
subsequently analyzed by LC–MS/MS. Mass spectrometers are then
mostly operated in the so-called data-dependent acquisition (DDA)
mode. Here, following an MS1 scan, a number of peptide ions of sufficiently
high intensity are selected for further fragmentation to deliver tandem
MS (MS2) spectra for sequence identification following database searching.[3] Peptide quantification in DDA is routinely based
on the MS1 intensity of the precursor ion, which can be hampered by
interference with chemical noise, resulting in a decreased dynamic
range due to difficulties when trying to quantify low-abundant peptides.[4,5] On the other hand, in the so-called targeted proteomic approaches,
for example, selected reaction monitoring (SRM) and parallel reaction
monitoring (PRM), quantifications are based on MS2 scans that are
composed of multiple fragments ions, therefore being more robust and
reproducible.[6] However, in this way, only
a limited number of peptides can be quantified and thus do not permit
proteome-wide discovery. In contrast, mass spectrometers operating
in data-independent acquisition (DIA) mode aim to combine the broad
identification power of DDA with accurate quantification available
by targeted approaches.[7] In DIA mode, all
precursors in a predetermined m/z isolation window are fragmented together irrespective of their abundances.
This overcomes the stochastic and thus irreproducible nature of DDA
which biases DDA toward the identification of (highly) abundant peptides.[4] Furthermore, in DIA mode, peptide quantification
is performed at the MS2 level, resulting in more robust quantification
versus the interference-prone MS1 quantification in DDA mode. As the
direct relation between the precursor and fragment ions is lost, analysis
of DIA data requires however more sophisticated algorithms compared
to the analysis of DDA data.[8]Although
spectrum-centric DDA search algorithms are not suited
for the analysis of the complex MS2 spectra generated by DIA, annotated
MS2 spectra and retention times (RTs) from DDA searches are widely
used for constructing spectral libraries to query DIA data.[9−11] The sensitivity of these approaches is thus inherently limited to
the stochastic limitations of DDA. Spectral libraries can be created
in several ways, leading to more or less extensive libraries, with
the library size influencing the accuracy and specificity of identifications
and quantifications.[12] Besides the use
of DDA-based spectral libraries, an alternative way of creating spectral
libraries is by using algorithms that accurately predict MS2 fragmentation
spectra such as MS2PIP[13] and
Prosit.[14] MS2PIP (MS2 peak intensity
prediction) is a data-driven tool[13] that
is trained on different types of public DDA data to predict MS2 peak
intensities.[15] Prosit[14] on the other hand is trained on MS2 spectra of synthetic
peptides and mass spectrometry data generated in the context of the
ProteomeTools project, which has the overall aim to provide a high-quality
reference MS2 data of synthetic peptides.[16] It is noteworthy that whole-proteome spectral libraries predicted
by MS2PIP and Prosit have been used to search narrow-window
DIA data and generate chromatogram libraries using EncyclopeDIA.[17−19] This DIA-only approach allows to empirically correct peptide predictions
and limit the spectral library search to peptides identified in the
high-sensitive narrow-window detection runs. Using DIA-only workflows
with chromatogram libraries shows even better performance than DDA-based
experimental libraries, thus bypassing the need for sample-specific
DDA analysis.[17,18] Next to predicted spectral libraries,
the so-called library-free approaches also overcome the stochastic
limitation of DDA by querying DIA data for the best supporting evidence
of peptide detection. The peptide-centric algorithm Peptide Centric
Analysis or PECAN[20] incorporates a sequence-based
RT predictor to improve the sensitivity of detection by filtering
DIA data based on expected RTs. In a DIA-only workflow, PECAN queries
a peptide list derived from a background proteome on DIA data, and
provides output auxiliary scores of experimental and expected RTs,
which are further processed by Percolator[21] to report confident peptide and protein identifications. Another
library-free approach includes algorithms such as DIA-Umpire that
perform deconvolution of DIA data to DDA-like pseudo MS2 spectra,
which can then be identified with DDA-based database-searching approaches.[22]In bacterial infection biology, a comprehensive
mapping of the
proteome profiles of both host and pathogen is needed.[23] Such dual-proteome profiles are currently lacking
likely because of the scarce amount of protein material originating
from the bacterial pathogen in most commonly used infection models.
Since most intracellular bacterial replicate inside the host or bacterial
viability is affected inside the host, bacteria enumeration varies
during the time course of an infection. More specifically, in in vitro infection models using Salmonella-infectedhuman epithelial HeLa cells, bacterial enumeration estimates
<10 bacteria per HeLa cell at early times and up to 100 bacteria
at late times in infection.[24] Considering
these numbers, bacteria-to-host protein content varies from ∼1:1000
to 1:100 during the infection process. Despite the improved sensitivity
and speed of contemporary mass spectrometers, the current technologies
still do not permit us to study host and pathogen proteomes simultaneously
with sufficient depth, requiring prior enrichment of bacteria which
can typically be attained by selective host lysis and differential
centrifugation.[25,26] Moreover, from the host point
of view, bystander host cells can foremost be removed in order to
exclusively profile the proteomes of infected host cells. This is
commonly done using fluorescent bacterial reporter strains and fluorescence-activated
cell sorting.[27,28] Of note, the elimination of host
bystander contributions concomitantly enriches for bacterial content
as typically only a fraction of cells get infected.[29]In the present work, we aimed to establish a DIA-MS
workflow to
improve the overall sensitivity of protein identification and quantification
of complex Salmonella–host mixtures
containing only a fairly low amount of peptide material derived from Salmonella without bacterial pre-enrichment. We compared
the performance of MS2PIP-predicted and DDA-based spectral
libraries, and, to improve on DDA sensitivity, we extended the spectral
library[30] by including MS2 spectra obtained
from pre-fractionated LC–MS/MSDDA data of Salmonella grown under different infection-relevant conditions. Finally, by
integrating (predicted) DDA libraries and library-independent approaches,
a hybrid spectral library was created,[31] which could achieve an up to 2- to 3-fold increase in consistently
quantified human or spiked-in Salmonella proteins and peptides.
Experimental Procedures
HeLa Cell Culture
HeLa cells (epithelial cervix adenocarcinoma,
American Type Culture Collection, Manassas, VA, USA; ATCC CCL-2) were
cultured in GlutaMAX containing Dulbecco’s modified Eagle medium
(Gibco, cat no. 31966047) supplemented with 10% fetal bovine serum
(Gibco, cat no. 10270-106) and 50 units/mL penicillin and 50 μg/mL
streptomycin (Gibco; cat no. 5070–063). Cells were cultured
at 37 °C in a humidified atmosphere with 5% CO2 and
passaged at a 1:8 ratio every 4 days.
Bacterial Strain and Salmonella Cultivation Conditions
The Salmonella enterica serovar Typhimurium wild-type
strain SL1344[32] (Genotype: hisG46, Phenotype:
His(-); biotype 26i), herein referred
to as Salmonella, was obtained from
the Salmonella Genetic Stock Center
(SGSC, Calgary, Canada; cat no. 438). Bacterial growth was performed
in liquid Lennox (L) growth medium (10 g/L Bacto tryptone, 5 g/L Bacto
yeast extract, 5 g/L NaCl), Luria Beltrami (LB)-Miller broth (10 g/L
Bacto tryptone, 5 g/L Bacto yeast extract, 10 g/L NaCl) or variants
of phosphate carbon nitrogen (PCN) medium[33] (lnSPI2; pH 5.8, 0.4 mM Pi), SPI2-inducing PCN (pH 5.8, 0.4 mM inorganic
phosphate) containing low levels (10 μM) of magnesium sulfate
(PCN medium was stored at 4 °C and brought at room temperature
for cultivation). Viewing the auxotrophic nature of the SL1344 strain
used, all PCN media were supplemented with histidine to a final concentration
of 5 mM. For bacterial cultivation, single colonies were picked from
LB plates, inoculated in 8 mL of liquid Lennox (L) growth medium (L-broth),
and grown overnight at 37 °C with agitation (180 rpm). Subsequently,
the overnight cultures with an optical density measured at 600 nm
(OD600) of ∼4.8 were diluted 1:200 (∼OD600 0.02) in T175 flasks in 50 mL of L-medium without antibiotics
and grown under ten different (infection-relevant) growth conditions
as reported in ref 34.[34] More specifically,
bacteria were grown to early-exponential growth phase (EEP; OD600 0.1), mid-exponential growth phase (MEP; OD600 0.3), late-exponential growth phase (LEP; OD600 1.0),
early-stationary phase (ESP; OD600 2.0), and late-stationary
phase (LSP; OD600 2.0 + 6 h of extra growth). Besides,
environmental shocks in LB were performed on MEP-grown bacteria by
the addition of NaCl to a final concentration of 0.3 M and continued
growth for 10 min or, in the case of anaerobic shock, growth for an
additional 30 min without agitation in a filled and tightly screwed
50 mL Falcon tube. For growth in variants of PCN minimal medium,[33] overnight-grown LB cultures were washed twice
in PCN medium before resuspension at O.D600 0.02. Cells
were grown in SPI2-inducing PCN or low-magnesium SPI2-inducing PCN.
The nitric oxide shock conducted in PCN (InSPI2) was performed at
OD600 0.3 by the addition of the nitric oxide donor spermine
NONOate to a final concentration of 250 μM for 20 min (nitric
oxide shock (InSPI2)).[35] Bacterial cells
were collected by centrifugation (2600g, 10 min)
at 4 °C and the supernatant was discarded. Samples were flash-frozen
in liquid nitrogen and stored at −80 °C until further
processing.
Proteome Extractions and Sample Preparation
Cell pellets
were resuspended in guanidinium chloride (Gu.HCl)-containing lysis
buffer (4 M Gu.HCl, 50 mM ammonium bicarbonate (pH 7.9)) at 5 ×
109Salmonella and 1 ×
107 HeLa cells per 500 μL of lysis buffer and mechanically
lysed by three rounds of freeze–thaw cycles in liquid nitrogen.
The lysates were sonicated (Branson probe sonifier output 4, 50% duty
cycle, 2 × 30 s, 1 s pulses) followed by centrifugation (16,100g, 10 min) at 4 °C, to remove cellular debris. The
protein concentration of the supernatant was determined by Bradford
measurement according to the manufacturer’s instructions (Bio-Rad,
cat no. 5000006).Samples mixtures were made from trypsin-digested
total Salmonella and/or HeLa protein
lysate(s). More specifically, for the infection-relevant complex Salmonella sample, an equimolar mix of Salmonella protein samples originating from Salmonella grown in the 10 infection-relevant conditions
was made. Alternatively, in the case of complex Salmonella–host mixtures, Salmonella protein
lysate (S) dilution series in protein lysates of humanHeLa cells
(H) (hereafter referred to as artificial mixtures) were made by mixing
proteome samples prior to digestion in a 1:9 ratio and making dilutions
series thereof to obtain complex S/H proteome mixtures with the corresponding Salmonella/HeLa protein ratios of 1:99, 1:999, and
1:9999. In addition, a sample containing equal amounts of Salmonella and HeLa proteins (1:1 ratio) was prepared.
All spiked-in samples were prepared in triplicate. For all protein
mixtures, an aliquot equivalent to 400 μg of total protein was
transferred to a 1.5 mL Eppendorf tube, twice diluted with liquid
chromatography (LC)-grade water and precipitated overnight with 4
volumes of −20 °C acetone. The precipitated protein material
was recovered by centrifugation (3500g, 15 min) at
4 °C, and pellets were washed twice with −20 °C 80%
acetone and air-dried upside down at room temperature until no residual
acetone odor remained. Pellets were resuspended in 200 μL of
TFE (2,2,2-trifluoroethanol) digestion buffer (10% TFE, 100 mM ammonium
bicarbonate, pH 7.9) with sonication at 4 °C (Branson probe output
20; 1 s pulses) until a homogeneous suspension was reached. All samples
were digested overnight at 37 °C using a Trypsin/Lys-C Mix (mass
spec grade, Promega, Madison, WI) (enzyme/substrate of 1:100 w/w)
while mixing (550 rpm). Samples were acidified with TFA to a final
concentration of 0.5% and cleared from insoluble particulates by centrifugation
(16,100g, 15 min) at 4 °C and the supernatant
was transferred to new Eppendorf tubes. Methionine oxidation was performed
by the addition of hydrogen peroxide to a final concentration of 0.5%
for 30 min at 30 °C. Solid-phase extraction of peptides was performed
using a C18 reversed-phase sorbent containing 100 μL pipette
tips according to the manufacturer’s instructions (Agilent,
Santa Clara, CA, USA, cat no. A57003100K). The pipette tip was conditioned
by aspirating the maximum pipette tip volume of water/acetonitrile
(ACN), 50:50 (v/v) and the solvent was discarded. After equilibration
of the tip by washing three times with the maximum pipette tip volume
in 0.1% TFA in water, 100 μL of the acidified peptide mixtures
(∼200 μg) was dispensed and aspirated for 10 cycles for
maximum binding efficiency. The tip was washed three times with the
maximum pipette tip volume of 0.1% TFA in water/ACN, 98:2 (v/v) and
the bound peptides were eluted in LC–MS/MS vials with the maximum
pipette tip volume of 0.1% TFA in water/ACN, 30:70 (v/v). The samples
were vacuum-dried in a SpeedVac concentrator and redissolved in 100
μL (infection-relevant Salmonellapeptide mixture) for subsequent reversed-phase (RP-HPLC) fractionation
(see below) or, for LC–MS/MS analysis, in 50 μL (artificial
mixtures) or 20 μL (fractionated RP-HPLC samples obtained from
the infection-relevant Salmonellapeptide
mixture) of 2 mM tris(2-carboxyethyl)phosphine (TCEP) in 2% ACN spiked
with an indexed RT (iRT) peptide mix (Biognosys, Schlieren, Switzerland)
according to the manufacturer’s instructions[36] for RT prediction (see below). Samples were stored at −20
°C until further analysis.
RP-HPLC Fractionation of
the Complex Salmonella Peptide Mixture
The infection-relevant, complex Salmonellapeptide mixture (100 μL) was acidified
by the addition of 5 μL of glacial acetic acid and the peptide
mixture (corresponding to an equivalent of 400 μg of digested
protein) fractionated at pH 5 using an HPLC Agilent series 1100 instrument.
The sample was trapped for 16 min on a reversed-phase trapping column
(35 mm × 300 μm I.D., 5 μm beads C18 material (Dr.
Maisch, Ammerbuch, Germany), fritted and packed in-house). Next,
the sample was separated on an analytical column (150 mm × 250
μm I.D., 3 μm beads C18 material fritted and packed in-house)
using a 100 min gradient from solvent A (10 mM ammonium acetate, pH
5.5) to solvent B (10 mM ammonium acetate, 70% ACN, pH 5.5) at a constant
flow rate of 3 μL/min. The constant flow rate is achieved with
an Agilent’s 1100 series capillary pump in the microflow mode
with the flow controller at 20 μL/min. After the gradient, the
column was run at solvent B for 5 min, switched to solvent A, and
re-equilibrated for 20 min. One-minute fractions were collected in
MS vials over a time interval of 65 min and automatically pooled,
this being the restarting of the fraction collection cycle every 10
min, resulting in a total of 10 (pooled) fractions. Peptides were
detected by absorbance at 214 and 280 nm. The fractions were then
vacuum-dried in a SpeedVac concentrator and re-dissolved in 20 μL
of 2 mM TCEP in 2% ACN spiked with the iRT peptide mix as described
above. Samples, referred to as Salmonella pre-fractionated samples, were stored at −20 °C until
further analysis.
LC–MS/MS Data Acquisition
From each artificial
mixture and Salmonella pre-fractionated
sample, 10 and 2 μL were injected onto the column, corresponding
to 2 and 4 μg peptide material, respectively, for LC–MS/MS
analysis on an Ultimate 3000 RSLC nano system in-line connected to
a Q Exactive HF BioPharma mass spectrometer (Thermo Fisher Scientific,
Bremen, Germany). Trapping was performed at 10 μL/min for 4
min in loading solvent A [0.1% TFA in water/ACN (98:2, v/v)] on a
20 mm trapping column [made in-house, 100 μm internal diameter
(I.D.), 5 μm beads, C18 Reprosil-HD, Dr. Maisch, Ammerbuch,
Germany]. After flushing from the trapping column, the peptides were
loaded and separated on an analytical 200 cm μPAC column with
C18-endcapped functionality (PharmaFluidics, Belgium) kept at a constant
temperature of 50 °C. Peptides were eluted using a non-linear
gradient reaching 9% MS solvent B [0.1% formic acid (FA) in water/ACN
(2:8, v/v)] in 15 min, 33% MS solvent B in 105 min, 55% MS solvent
B in 125 min, and 99% MS solvent B in 135 min at a constant flow rate
of 300 nL/min, followed by a 5 min wash with 99% MS solvent B and
re-equilibration with MS solvent A (0.1% FA in water). For both analyses,
a pneu-Nimbus dual-column ionization source was used (Phoenix S&T),
at a spray voltage of 2.6 kV and a capillary temperature of 275 °C.
For the first analysis (DDA mode), the mass spectrometer automatically
switched between MS and MS2 acquisition for the 16 most abundant ion
peaks per MS spectrum. Full-scan MS spectra (375–1500 m/z) were acquired at a precursor resolution
of 60,000 at 200 m/z in the Orbitrap
analyzer after accumulation to a target value of 3,000,000. The 16
most intense ions above a threshold value of 13,000 were isolated
for higher-energy collisional dissociation (HCD) fragmentation at
a normalized collision energy of 28% after filling the trap at a target
value of 100,000 for maximum 80 ms injection time using a dynamic
exclusion of 12 s. MS2 spectra (200–2000 m/z) were acquired at a resolution of 15,000 at 200 m/z in the Orbitrap analyzer.Another
10 μL aliquot from each artificial mixture was analyzed using
the same mass spectrometer in the DIA mode. Nano LC conditions and
gradients were the same as used for DDA. Full-scan MS spectra ranging
from 375 to 1500 m/z with a target
value of 5 × 106 were followed by 30 quadrupole isolations
with a precursor isolation width of 10 m/z for HCD fragmentation at a normalized collision energy
of 30% after filling the trap at a target value of 3 × 106 for a maximum injection time of 45 ms. MS2 spectra were acquired
at a resolution of 15,000 at 200 m/z in the Orbitrap analyzer without multiplexing. The isolation intervals
ranged from 400 to 900 m/z with
an overlap of 5 m/z.
Processing of DDA Data
Raw data files corresponding
to 10 fractions of the complex, infection-relevant Salmonellapeptide mixture and 5 artificial dual-proteome
mixtures (1:1, 1:9, 1:99, 1:999, and 1:9999 Salmonella/HeLa samples, in triplicates), were searched in parallel using MaxQuant[37] (version 1.6.10.43). Protein databases for searching
the obtained spectra were either the UniProt knowledgebase (UniProtKB)
proteomes for Salmonella pre-fractionated
samples (proteome UP000008962, 4657 proteins) or the Salmonella proteome database concatenated to the
human UniProtKB database for artificial mixtures (proteomes UP000008962
[4657 proteins] and UP000005640 [74,449 proteins]). In addition, MaxQuant
built-in contaminant proteins and the 11 iRT peptide sequences (Biognosys-11)
were included in the search.[36] Methionine
oxidation to methionine-sulfoxide was set as a fixed modification,
and in the case of artificial mixtures, protein N-terminal acetylation
was set as variable modification. To augment peptide quantification,
we performed matching-between-runs with a match time window of 1.2
min and an alignment time window of 20 min and performed label-free
quantitation with the LFQ algorithm using default settings in MaxQuant.
We used the enzymatic rule of trypsin/P with a maximum of two missed
cleavages. The peptide-to-spectrum match level was set at 1% FDR.
Protein FDR—calculated by employing a reverse database strategy—was
set at 1%. For protein quantification in the proteinGroups.txt file,
only unique peptides were considered and all modifications were allowed.
For other search parameters not specified here, default MaxQuant settings
were used.
Processing of DIA Data
DDA-Based Spectral Library
Construction
The “msms.txt”
files outputted using MaxQuant were used as input for the creation
of redundant BLIB spectral libraries (artificial mixtures and Salmonella pre-fractionated samples) using BiblioSpec
(version 2.1).[38] Redundant spectra were
subsequently filtered using the “BlibFilter” function,
requiring entries to have at least 5 peaks (“-n 5”).
The DDA-based spectral library created from the MaxQuant results of
the artificial mixtures is referred to as Library A1. Peptide sequences
uniquely identified in the Salmonella pre-fractionated samples were appended to Library A1 to augment
the detection capacity of Salmonellapeptides. This extended DDA-based spectral library is referred to
as Library A2 and a detailed overview is given in Table S1. We transformed the peptide RTs present in the BLIB
library to iRTs using the spiked-in iRT peptides (Biognosys-11). To
this end, empirical RTs of the top-scoring iRT peptide identifications
(lowest posterior error probability, “msms.txt”) in
the DDA samples were used to fit a linear trend line and scale the
RTs. For the artificial mixtures and Salmonella pre-fractionation analyses, the corresponding trend lines were iRT
= 1.220 RT—74.566 and iRT = 1.189 RT—75.039, respectively.
The updated BLIB files were then converted to DLIB format using EncyclopeDIA,
using the combined human–Salmonella UniProtKB proteome FASTA as background.[19]
Library-free Searching of Proteome FASTA by PECAN/Walnut
DIA raw data files were converted to mzML by MSConvert using vendor
peakPicking and enabling the “SIM as spectra” option.
Pre-processed DIA samples were searched against a compilation of the Salmonella UniProtKB proteome (UP000008962, 4657
proteins) and human Swiss-Prot proteome (UP000005640, 20,367 proteins)
using the EncylopeDIA built-in PECAN algorithm.[19,20] We opted to solely search the human Swiss-Prot protein database,
which resulted in a ∼3-fold reduction in the protein database
search space (25,024 vs 74,449 proteins when combining Salmonella [UP000008962] and human [UP000005640]
UniProtKB references proteomes), in order to minimize the theoretical
peptide search space. This was desired to limit the size of a predicted
spectral library for all possible tryptic peptides (see below) and
overall runtime and memory usage. Since 36,494 out of 36,668 (i.e., 99.53%) of all humanpeptides identified using MaxQuant
matched a Swiss-Prot protein entry, no drastic loss in identifications
is anticipated. Default settings were used, except for methionine
oxidation (to methionine-sulfoxide) being set as fixed modification
and considering a maximum length of 25 amino acids and HCD as the
fragmentation type.
Construction of an MS2PIP-Based
Spectral Library
MS2 spectra were predicted by MS2PIP (version 20190312)[13] for tryptic peptides
derived from an in silico digest of the Salmonella UniProtKB proteome and human Swiss-Prot
proteome (trypsin/P, peptide
length 7–25 AA, mass 500–5000 Da, one missed cleavage,
N-terminal initiator methionine removal considered) in the case of
2+ and or 3+ peptide precursor fit within 400–900 m/z (scanned range DIA). This yields a total of 1,586,777
predicted MS2 spectra for 1,151,386 peptides solely matching human
proteins, 197,782 spectra for 144,156 peptides matching Salmonella proteins, and 117 spectra for 110 peptides
matching both species. We set methionine oxidation (to methionine-sulfoxide)
as a fixed modification for MS2 prediction by MS2PIP. Predicted
spectra were supplemented with DeepLC-predicted RTs using a model
trained on RTs of 35,206 non-redundant peptides identified in DIAPECAN searches (peptide Q-value < 0.01) (as described
in the above section). The MS2PIP-based spectral library
is referred to as Library B.
Hybrid Library Construction
DIA-based result (ELIB)
libraries generated by EncyclopeDIA after PECAN processing were combined
into a single redundant library. After conversion to BLIB format and
adding the “redundant” tag in library info (sqlite3),
the “BlibFilter” function was used to create a non-redundant
DIA spectral library as described above. All raw DIA data were searched
against this PECAN result library with EncyclopeDIA, and the outputted
ELIB library is referred to as Library C. Aiming at integrating the
results for the three searched libraries, we ran Percolator (v3.5)
independently on the combined EncyclopeDIA scoring features of the
three searches (Library A2, B, and C) for all samples with similar
options as in EncyclopeDIA internal Percolator processing (-y -N 200000
-no-terminate). Subsequently, we used the obtained Percolator score
as the discriminant score for MAYU FDR estimation for peptides matching
non-ambiguously to human Swiss-Prot or Salmonella proteins.[39] Afterward, spectral libraries
were filtered for entries matching proteins with a MAYU protein-FDR
≤ 1%. For the 7192 unique protein entries that passed this
criterium, all spectra corresponding to peptides with an EncylopeDIA
Percolator peptide-level FDR ≤ 1% were retained for hybrid
spectral library construction. To avoid redundant entries in the hybrid
spectral library, we appended the EncyclopeDIA ELIB library of the
Library A2 search (39,120 peptides) with additional peptide entries
found in search results of Library B and/or Library C. Initially,
8795 additional peptide entries were appended from the filtered results
of Library C and afterward, another 11,541 additional peptide entries
from Library B results. A detailed overview of the hybrid spectral
library is given in Table S2.
EncyclopeDIA
Spectral Library Searching and Peptide Quantification
The
resulting mzML files were searched against Library A2, B, or
hybrid spectral (DLIB) libraries using EncylopeDIA software (version
0.90)[19] with default settings. Sample-specific
Percolator output files and EncyclopeDIA result (ELIB) libraries were
stored. Per setup, a combined EncyclopeDIA result library was created
consisting of the three replicates. This performs a Percolator re-running
of the combined results and provides peptide and protein quantifications
at a 1% peptide and protein Q-value, respectively.
For quantification, the number of minimum required and quantifiable
ions was set at 5 with aligning between samples enabled.
SpectraST
Library Construction and Searches
We used
SpectraST[40] (version 5.0) for spectral
library searches of the DDA data. Library A2 and Library B were used
to generate a SpectraST spectral library, appending MS2PIP peptide ions to Library A2 (option -cJA). Afterward, a concatenated
target-decoy library was generated, using the precursor swap method[41] (options -cAD -cc -cy1 -c_DPS). The default
search parameters of the SpectraST were used except for specifying
a precursor isolation window of 0.01 Th and outputting the top 3 ranked
hits. The FDR was estimated by FDR = nd/nt, where nt and nd are the number of PSMs of target
and decoy peptides, respectively.
Results
DDA-Based Spectral
Library Searching of DIA Data Improves Detection
of Low-Abundant Peptides
To mimic the proteome complexity
of bacterial host cell infections, we generated five artificial Salmonella–human proteome mixtures (1:1, 1:9,
1:99, 1:999, and 1:9999) in triplicate, referred to as artificial
dual-proteome mixtures. Following trypsin digestion, each sample was
analyzed by LC–MS/MS in both DDA and DIA modes. First, DDA
data were analyzed using MaxQuant against a composite database containing Salmonella and human UniProtKB protein entries (see Experimental Procedures). As anticipated, with decreasing Salmonella protein content, the number of identified Salmonellapeptides decreases, whereas the number
of identified humanpeptides increases (Figure , orange bars, top
to bottom). Indeed, although approximately 5008 non-redundant Salmonellapeptides are consistently (i.e., in all three replicates) identified in the 1:1 dilution, this number
decreases to 1226 (24.5%) and only 257 (5.1%) in the 1:9 and 1:99
dilutions, respectively.
Figure 2
Peptide
identifications in artificial dual-proteome mixtures acquired
in DDA and DIA modes. The number of identified non-redundant human
(left panel) or Salmonella (right panel)
peptide sequences (x-axis, peptide Q-value ≤ 0.01) is shown per replicate across artificial mixtures
(y-axis, 1:1 to 1:9999). Bars showing triplicate
samples are used to display data acquired in the DDA mode and searched
with MaxQuant (orange) and data acquired in the DIA mode and searched
with EncyclopeDIA against Library A1 (the non-extended DDA-based spectral
library, green) and Library A2 (extended DDA-based spectral library,
blue). The dark-colored portion of the bars and corresponding numbers
within indicate the number of peptide sequences consistently identified
in all three replicate samples.
Proteome data analysis workflow of hybrid spectral
library construction.
DDA data from artificial dual-proteome mixtures and complex Salmonella pre-fractionated samples were searched
using MaxQuant,[37] and an extended DDA spectral
library (Library A2) of both datasets was constructed using BiblioSpec.[38] In parallel, entire proteomes of human and Salmonella were searched either by searching an MS2PIP-predicted spectral library (Library B) with EncyclopeDIA[19] or searching the combined FASTA using PECAN[20] for which an EncyclopeDIA ELIB library was constructed
(Library C). A combined hybrid spectral library was constructed from
EncyclopeDIA ELIB libraries when searching Library A2, B, and C searches.
The hybrid spectral library only contained non-redundant peptide entries
matching proteins with MAYU protein FDR ≤ 1% (see Experimental Procedures).Peptide
identifications in artificial dual-proteome mixtures acquired
in DDA and DIA modes. The number of identified non-redundant human
(left panel) or Salmonella (right panel)
peptide sequences (x-axis, peptide Q-value ≤ 0.01) is shown per replicate across artificial mixtures
(y-axis, 1:1 to 1:9999). Bars showing triplicate
samples are used to display data acquired in the DDA mode and searched
with MaxQuant (orange) and data acquired in the DIA mode and searched
with EncyclopeDIA against Library A1 (the non-extended DDA-based spectral
library, green) and Library A2 (extended DDA-based spectral library,
blue). The dark-colored portion of the bars and corresponding numbers
within indicate the number of peptide sequences consistently identified
in all three replicate samples.In a next phase, MaxQuant results were used to create a DDA-based
spectral library designated as “Library A1” (see Experimental Procedures), encompassing a total of
57,056 MS2 spectra corresponding to 44,932 non-redundant peptide sequences
of which 34,475 (76.7%) uniquely matched to human proteins and 10,422
(23.2%) to Salmonella proteins. However,
when analyzing related shotgun samples exclusively consisting of Salmonellapeptides on the same MS instrument, we
previously routinely identified between 15,000 and 25,000 Salmonellapeptides with MaxQuant[42] and when loading equal amounts of total peptides as determined
by microfluidic spectroscopy,[43] suggesting
that mixing of human and Salmonella proteomes limits Salmonellapeptide
identification and thus DDA-based spectral library construction. To
extend the number of Salmonellapeptides
in this library, we performed an offline RP-HPLC pre-fractionation
of a digest of a complex proteome mixture obtained from mixing equal
proteome amounts from Salmonella grown in vitro under 10 different (infection relevant) conditions,
as reported in ref (34) (MaxQuant results, see Dataset PXD017904). This way, additional
12,945 non-redundant Salmonellapeptide
sequences (14,709 spectra) were appended to Library A1, referring
to the extended DDA-based library as “Library A2” (for
entries see Table S1). DIA data from artificial
mixtures were searched against both spectral libraries using EncyclopeDIA[19] (Figure , green and blue bars, respectively). In the case of humanpeptides, DIA analysis approximately doubles the number of peptide
identifications in the 1:1 and 1:9 dilutions compared to DDA analysis
and identifies up to ∼6000 to 9000 additional humanpeptides
compared to DDA when decreasing Salmonellapeptide input further. Overall, this suggests that with increasing
abundance, Salmonellapeptide ions
obstruct selection and fragmentation of humanpeptide ions in the
DDA mode, while clearly much less interference is observed when samples
are analyzed in the DIA mode.[44] Furthermore,
when looking at artificial mixtures across all dilutions, DIA identifies
most peptides consistently (i.e., in all three replicates).
Notably, extending Library A1 with additional Salmonellapeptides (Figure , left panel, blue vs green bars) increases humanpeptide identifications,
likely due to increased Salmonellapeptide
identifications included in peptide-level FDR scoring. Logically,
a similar trend holds true when inspecting Salmonellapeptide identifications (Figure , right panel). For instance, in all dilutions, the
DIA data queried with Library A2 more than doubles the number of consistently
identified Salmonellapeptides compared
to MaxQuant DDA identifications. Taken together, offline fractionation
methods can provide a useful asset to improve DIA identification rates,
especially in the case of complex proteome mixes with components of
low abundance (e.g., dual-proteome mixes such as
bacterial pathogen infection of human hosts). Most likely, grasping
the whole multi-species peptide mixture complexity is impossible in
the given LC–MS/MS setting—as illustrated here by the
merely ∼10,000 Salmonellapeptides
identified, whereas on average, more than double the amount of Salmonellapeptides are identified in pure proteome
samples.
DIA-Only Workflows and Predicted Spectral Libraries Can Join
Forces with DDA Spectral Libraries
Although our DIA spectral
library searches clearly outperformed MaxQuant DDA analysis in terms
of peptide identification, an important limitation is that only those
peptides originally identified in DDA analysis can be detected. To
further tap into the peptide discovery potential of DIA data, we implemented
two workflows independent of DDA data. First, we used the DIA library-free
PECAN software to search the human and Salmonella proteome. When using PECAN, the number of peptide identifications
is in-line with or just below DDA data-based identifications but lower
as compared to when our DIA data was queried with Library A2, in line
with previous reports.[19] Second, we generated
a spectral library of an in silico tryptic proteome
digest of the human and Salmonella proteomes
using MS2PIP.[13] In total, 1,784,677
MS2 spectra were predicted for 2+/3+ peptide precursors
(395 to 905 m/z) for 1,295,652 peptides
with enzymatic settings similar to PECAN search settings (Trypsin/P,
7 to 25 amino acid long peptides with a maximum of one missed cleavage).
DIA-based peptide RTs obtained from the PECAN search results were
used to train and predict RTs for all peptides using DeepLC.[45] We then used the MS2PIP-predicted
spectral library, referred to as “Library B”, to query
our artificial mixtures with EncyclopeDIA (Figure , left panel, grey bars). Overall, the searches
identified a reduced number of consistently identified humanpeptides
per dilution series, while nonetheless identifying a similar number
of peptides per run in 1:999 and 1:9999 dilution series. This relatively
lower consistency is likely due to the increased number of peptide
sequences, ∼35-fold, in Library B compared to Library A2. The
use of large database sizes already showed to increase variation in
peptide and protein quantification.[12] Importantly,
and as illustrated in the respective Venn diagrams (insets Figure , left panel), thousands
of peptides not present either in the DDA results nor Library A2 were
identified when searching library B, demonstrating the potential of
this approach to identify peptides in DIA not found in DDA. Turning
our attention to Salmonella identified
peptides, searching Library B shows relatively lower peptide identification
rates (∼60–80%) than Library A2. Nonetheless, a significant
agreement among results is observed, as for instance, in the 1:1 dilutions,
7545 out of 8520 peptide sequences (88.6%) identified in the Library
B search were also identified when querying the artificial mixtures
with Library A2. In a next phase, we inspected the properties of peptides
identified by searching Library B that were not found in the DDA searches
(and thus not part of Library A2). For instance, 1814 peptides were
discovered using Library B in the 1:1 sample, of which 933 human and
881 Salmonellapeptides (Figure S1). In less diluted human samples 1:999
and 1:9999, more than 5000 peptides are uniquely identified using
Library B. When comparing the intensity of peptides included in Library
A2 and those discovered by searching Library B, the novel MS2PIP-identified peptide distributions are slightly lower, and yet
more outspoken in the 1:999 and 1:9999 samples (Figure S1A). Hence, these observations suggest that Library
B, thus searching of MS2PIP-predicted spectra, enable the
detection of relatively lower abundant peptide species missed in DDA.
Next, we checked whether the novel MS2PIP-based peptides
matched protein entries that were also matched by other peptides in
the DDA searches. In all dilutions, approximately 85% of MS2PIP-based peptides matched a protein identified in the MaxQuant searches
of the DDA data (Figure S1B). As such,
the majority of MS2PIP-based peptides increase the protein
sequence coverage for proteins present in the DDA spectral library
in turn increasing identification and quantification confidence.
Figure 3
DIA-based
peptide identifications using different search strategies.
The number of identified non-redundant human (left panel) or Salmonella (right panel) peptide sequences (x-axis, peptide Q-value ≤ 0.01)
are shown per replicate across artificial mixtures (y-axis, 1:1 to 1:9999). Bars showing triplicate samples are used to
display data acquired in the DIA mode and searched with the Library
A2 (blue), Library B (ochre), and EncyclopeDIA built-in PECAN algorithm
(Walnut), using as input the human–Salmonella FASTA (brown). The dark-colored portion of the bars and corresponding
numbers within indicate the number of peptide sequences consistently
identified in all three samples. Venn diagrams indicate the overlap
of consistently identified peptide sequences between Library A2 and
Library B.
DIA-based
peptide identifications using different search strategies.
The number of identified non-redundant human (left panel) or Salmonella (right panel) peptide sequences (x-axis, peptide Q-value ≤ 0.01)
are shown per replicate across artificial mixtures (y-axis, 1:1 to 1:9999). Bars showing triplicate samples are used to
display data acquired in the DIA mode and searched with the Library
A2 (blue), Library B (ochre), and EncyclopeDIA built-in PECAN algorithm
(Walnut), using as input the human–Salmonella FASTA (brown). The dark-colored portion of the bars and corresponding
numbers within indicate the number of peptide sequences consistently
identified in all three samples. Venn diagrams indicate the overlap
of consistently identified peptide sequences between Library A2 and
Library B.Next to a conventional MaxQuant
database search (Figures and S2, orange bars), we combined
Library A2 and Library B to search DDA
data with SpectraST[40] (Figure S2, ochre bars). Spectral library searching gained
popularity due to their high sensitivity and fast running time, which
therefore might improve the detection of Salmonellapeptides. At 1:1 and 1:9 dilutions, MaxQuant identifies slightly
more consistently identified Salmonellapeptides which largely overlaps with SpectraST identified peptides
(∼80%, Figure S2A). However, SpectraST
identifies more peptides at a 1:99 dilution (270 vs 257 peptides)
and 1:999 dilution (50 vs 20 peptides), suggesting spectral library
searching as a more sensitive method to identify low-abundant Salmonellapeptides. For instance, peptides IVIRPLPGLPVIR
and TNVPHIFAIGDIVGQPMLAHK were identified by both SpectraST and MaxQuant
in 1:1, 1:9, and 1:99 dilution series but solely using SpectraST in
the 1:999 dilution (Figure S2B,C).
In a next phase, we generated
a hybrid spectral library to attest its potential to further increase
proteome coverage (Experimental Procedures, Figure ). Hybrid
spectral libraries are essentially merged libraries comprising results
of different DDA or DIA analysis workflows, for example, combining
predicted spectral libraries with experimental libraries,[46] or DDA-based and DIA-only spectral libraries.[31] Here, we merged the EncyclopeDIA ELIB libraries
when querying library A2, Library B, and a non-redundant spectral
library of PECAN-identified peptides (designated “Library C”)
to generate a single hybrid library (Library A2 + B + C, see Experimental Procedures). One of the key challenges
is to maintain comparable FDR control to merged spectral libraries.[47] To this end, we performed integrated Percolator
processing for all EncyclopeDIA scoring features of the three spectral
library searches combined. Afterward, the Percolator score was used
as the discriminant score for peptide- and protein-level error rate
control by MAYU.[39] It can be observed that
both curves did not yet reach saturation, and a total of 7194 proteins
(5086 human and 2108 Salmonella proteins)
were below a stringent protein FDR of 1% (Figure A,B). For hybrid library construction, we
first filtered the EncyclopeDIA ELIB library (i.e., DIA-calibrated results) result when searching Library A, including
spectra of 39,120 peptides matching these 7194 proteins with an MAYU
1% protein FDR (Table S2), instead of the
initial 57,877 peptides (Table S1). After
similar protein FDR filtering, DIA-only identified peptides were appended
stepwise with 8795 additional peptides of Library C (i.e., PECAN result library) and yet additional 11,541 peptides identified
solely by Library B searches (Figure C, Table S2). EncyclopeDIA
searches with the obtained hybrid spectral library (Library A2 + B
+ C) resulted in an increased number of humanpeptide identifications,
most outspoken in 1:99 to 1:999 dilution series with an average of
3000–4000 additional peptides consistently identified (Figure D). In the case of Salmonella, an increased peptide identification rate
is only observed in the 1:999 and 1:9999 dilution series. This beneficial
effect might arise due to the inclusion of more low-abundant peptides
species solely identified in DIA. In addition, stricter protein FDR
control used for hybrid library construction might also be a plausible
reason as in these low-input Salmonella samples as relatively high proportions of “false Salmonella targets” (also referred to as π0[48,49]) can be anticipated, in our case due to
their (extreme) low abundancy and inclusion of an heterogeneous Salmonella pre-fractionation experiment. Such larger
π0 necessitates stricter error rate control,[48] as implemented here in our hybrid spectral library
construction (Figure A,B).
Figure 1
Proteome data analysis workflow of hybrid spectral
library construction.
DDA data from artificial dual-proteome mixtures and complex Salmonella pre-fractionated samples were searched
using MaxQuant,[37] and an extended DDA spectral
library (Library A2) of both datasets was constructed using BiblioSpec.[38] In parallel, entire proteomes of human and Salmonella were searched either by searching an MS2PIP-predicted spectral library (Library B) with EncyclopeDIA[19] or searching the combined FASTA using PECAN[20] for which an EncyclopeDIA ELIB library was constructed
(Library C). A combined hybrid spectral library was constructed from
EncyclopeDIA ELIB libraries when searching Library A2, B, and C searches.
The hybrid spectral library only contained non-redundant peptide entries
matching proteins with MAYU protein FDR ≤ 1% (see Experimental Procedures).
Figure 4
Hybrid spectral library construction. (A) Number of target and
true positive protein identifications (orange and blue, respectively)
in function of MAYU estimated protein FDR. (B) Number of target and
true positive peptide identifications (orange and blue, respectively)
in function of MAYU-estimated protein FDR. (C) Hybrid spectral library
consisting of 59,475 non-redundant peptide entries matching 7194 proteins
(MAYU protein FDR ≤ 1%). The filtered EncyclopeDIA ELIB library
from the Library A2 search was appended stepwise with additional peptide
entries from Library C (PECAN results, step 1) and subsequently, with
additional peptide entries from Library B (MS2PIP-predicted
spectral library, step 2). For more details on the hybrid spectral
library construction, see Experimental Procedures. (D) Number of identified non-redundant human (left panel) or Salmonella (right panel) peptide sequences (x-axis, peptide Q-value ≤ 0.01)
are shown per replicate across artificial mixtures (y-axis, 1:1 to 1:9999). Bars showing triplicate samples are used to
display data acquired in the DIA mode and searched with Library A2
(blue) or the hybrid spectral library (red). The dark-colored portion
of the bars and corresponding numbers within indicate the number of
peptide sequences consistently identified in all three replicate samples
analyzed.
Hybrid spectral library construction. (A) Number of target and
true positive protein identifications (orange and blue, respectively)
in function of MAYU estimated protein FDR. (B) Number of target and
true positive peptide identifications (orange and blue, respectively)
in function of MAYU-estimated protein FDR. (C) Hybrid spectral library
consisting of 59,475 non-redundant peptide entries matching 7194 proteins
(MAYU protein FDR ≤ 1%). The filtered EncyclopeDIA ELIB library
from the Library A2 search was appended stepwise with additional peptide
entries from Library C (PECAN results, step 1) and subsequently, with
additional peptide entries from Library B (MS2PIP-predicted
spectral library, step 2). For more details on the hybrid spectral
library construction, see Experimental Procedures. (D) Number of identified non-redundant human (left panel) or Salmonella (right panel) peptide sequences (x-axis, peptide Q-value ≤ 0.01)
are shown per replicate across artificial mixtures (y-axis, 1:1 to 1:9999). Bars showing triplicate samples are used to
display data acquired in the DIA mode and searched with Library A2
(blue) or the hybrid spectral library (red). The dark-colored portion
of the bars and corresponding numbers within indicate the number of
peptide sequences consistently identified in all three replicate samples
analyzed.
DIA Improves Quantification
of Low-Abundant Proteins
Besides peptide identification rates,
we evaluated and compared protein
quantifications across dilutions between both acquisition modes. Aiming
at assessing the performance per dilution series independently, that
is, as a proxy for biological infection conditions where the bacterial
proteome content is limiting and thus 1:1 and 1:9 dilution conditions
are thus typically not representative, we ran the three replicates
of each dilution in MaxQuant enabling matching-between-runs and used
the alignment-between-runs function in EncyclopeDIA. In order to handle
identical protein interference of MaxQuant and EncyclopeDIA results,
we used the average intensity of the three most intense proteotypic
peptides (unambiguously matched and considering Leu and Ile as indistinguishable)
for determining the corresponding protein intensity, that is, the
TOP3 quantification method[50] also used
in the label-free quantification benchmark tool LFQbench.[51] Importantly, we did include protein quantifications
based on single peptides (given the low abundance of Salmonella proteins) and, unlike DIA, considered
missing values for DDApeptide quantifications by averaging over the
non-zero quantifications across replicates. DIA hybrid spectral library
searches quantified 1124 Salmonella proteins with a median protein ratio of 0.067 in the 1:9 dilution
series, or an additional 345 proteins compared to MaxQuant (Figure A, top). In the more
representative infection-relevant host-pathogen 1:99 condition, the
hybrid spectral search quantified 329 proteins with a median protein
ratio of 0.012, while MaxQuant only quantified 202 proteins with a
0.03 median ratio (Figure A, middle). In the extremely challenging 1:999 dilution, more
than 5-fold the number of MaxQuant-quantified proteins were identified—although
with an increased number of outlying ratios and a median ratio of
0.041 (Figure A, bottom).
In general, in both 1:99 and 1:999 dilution series, it can be observed
that quantified Salmonella proteins
are typically of higher abundance in the 1:1 dilution series (x-axis, log(B)), and the few protein quantification
outliers not in-line with the anticipated dilution frequency are relatively
low abundant in the 1:1 dilution series, thus likely representing
false targets. Tackling this issue, we tested two more stringent filtering
criteria for 1:999 dilution series. First, requiring a 10-fold stricter
Percolator peptide Q-value of 0.1%, which deliver
47 quantified proteins at a median ratio of 0.051 (Figure B, top). Alternatively, requiring
at least two peptide quantifications per peptide narrows down protein
quantifications to 24 proteins with a median ratio of 0.0016. Thus,
stricter filtering of DIA results here enables higher-confidence Salmonella protein quantifications matching the anticipated
dilution.
Figure 5
Salmonella protein quantification
throughout Salmonella–human
proteome dilutions. (A) Log-transformed ratios (log(A/B), y-axis) of human (red) and Salmonella (blue) proteins with B being the protein intensity in the 1:1 dilution series mixture and A, the protein intensity in 1:9, 1:99, and 1:999 dilution
series (from top to bottom). MaxQuant DDA quantification (left) was
compared to EncyclopeDIA quantification searching the hybrid spectral
library (right). The number of plotted human and Salmonella proteins were indicated. Anticipated and median Salmonella protein ratios were plotted (black and blue dotted lines, respectively).
Protein quantification was performed by the average of the three most
abundant proteotypic peptides per protein (see Experimental
Procedures). (B) EncyclopeDIA protein quantification results
when applying a 0.1% Percolator peptide FDR filter (top) or when requiring
a minimum of two proteotypic peptide quantifications per protein.
Salmonella protein quantification
throughout Salmonella–human
proteome dilutions. (A) Log-transformed ratios (log(A/B), y-axis) of human (red) and Salmonella (blue) proteins with B being the protein intensity in the 1:1 dilution series mixture and A, the protein intensity in 1:9, 1:99, and 1:999 dilution
series (from top to bottom). MaxQuant DDA quantification (left) was
compared to EncyclopeDIA quantification searching the hybrid spectral
library (right). The number of plotted human and Salmonella proteins were indicated. Anticipated and median Salmonella protein ratios were plotted (black and blue dotted lines, respectively).
Protein quantification was performed by the average of the three most
abundant proteotypic peptides per protein (see Experimental
Procedures). (B) EncyclopeDIA protein quantification results
when applying a 0.1% Percolator peptide FDR filter (top) or when requiring
a minimum of two proteotypic peptide quantifications per protein.We also manually inspected the more extreme cases
of quantifiable Salmonellapeptides
in the 1:999+ dilutions in our
DIA workflow. As a representative example, the peptide ILADIAVFDK
(doubly charged) maintains similar MS[2] peak
shapes throughout 1:1, 1:9, and 1:99 and to a lesser extent in 1:999
and 1:9999 dilutions, while the intensity decreases according to the
dilution factor (Figure S3). However, at
increased dilutions, flanking noise peaks are evidenced and become
dominant over genuine peptide peaks—similar as observable for
SpectraST-identified peptides at a 1:999 dilution in DDA (Figure S2B,C). Undoubtedly, low peptide abundance
represents a clear challenge for achieving correct identification
and quantification in the 1:999 and 1:9999 setups irrespective of
the acquisition mode used.
Discussion
MS
acquisition of artificial human–Salmonella dual-proteome mixtures in the DIA mode significantly improved identification
and quantification of low-abundant Salmonella proteins compared to the DDA mode, increasing the discriminative
power of LC–MS/MS. Among the artificial mixtures (1:1 to 1:9999)
used in this study to effectuate comparative analysis between both
acquisition modes, the 1:99 dilution closely reflects the host/pathogen
ratio under actual infection conditions in the case of Salmonella,[23] where a
scarce amount of protein material comes from the bacterial pathogen,
making the simultaneous study of host and pathogen proteomes challenging.
Overall, the DIA mode improved detection and quantification of Salmonellapeptides and protein quantifications at
lower 1:99 and 1:999 dilutions (Figures and 5). Also, in
1:1 Salmonella–human proteome
mixtures, human and Salmonellapeptide/protein
identification rates were significantly boosted when searching DDA-based
and/or hybrid spectral libraries compared to a conventional DDA database
search (Figure ).
Taken together, the increased identification of peptides with lower
abundance in DIA points to the stochastic nature of DDA. Notably,
to correct for DDA stochastic sampling, more advanced DDA precursor
selection algorithms have been developed such as MaxQuant.Live.[52] In addition, next to a conventional database
search, we performed a spectral search with SpectraST[40] on a library composed of Library A2 and Library B spectra.
Although not outperforming MaxQuant for humanpeptide identification,
it did improve detection of peptide identification at 1:999 dilution
(Figure S2). Another alternative is to
combine spectral library-based scoring features with conventional
scoring features as demonstrated by Prosit and MS2PIP-based
spectral libraries.[14,53] In contrast to a recent report,[11] offline fractionation of a Salmonella sample proved to increase Salmonellapeptide identification, nearly doubling Salmonellapeptide identification in the 1:1 Salmonella–human proteome mixture. This effect is likely attributable
to the dual-proteome complexity, rendering DDA unable to fully grasp Salmonella proteome complexity. Hence, in such multi-species
proteome analyses, comprehensive spectral libraries and pre-fractionation
of a species of interest can be advisable. However, this brings along
spectral library heterogeneity which may increase the risk of “false
targets” within the library,[48] especially
at lower dilutions where stricter filtering can be advisable (Figure B).Besides
assessing the performance of DDA-based spectral libraries,
we tested alternative DIA workflows that enable us to search both Salmonella and human proteomes. The library-free
EncyclopeDIA built-in PECAN algorithm (Walnut) delivered lower peptide
identifications, in-line with earlier observations.[19] In addition, we made use of MS2PIP-predicted
spectral libraries (Library B), predicting RT with DeepLC trained
on DIA identifications made by PECAN. Similar to recent studies reported
on the use of Prosit and Prism-predicted spectral libraries,[14,54] we achieved a similar performance compared to searching DDA-based
spectral libraries (Figure , Library A2). Interestingly, both alternative approaches
allowed us to identify 18,736 novel peptides not found in DDA data.
Further inspection of non-DDApeptides identified by searching Library
B pointed to relatively lower-abundant peptides that might have been
missed in DDA precursor selection. Notably, the majority of these
peptides (85%) matched proteins identified by other peptides in DDA
(Figure S1). Combining the strengths of
both approaches, we generated a hybrid spectral library merging EncyclopeDIA
ELIB libraries from Library A2, B, and C searches, filtering for peptides
matching 7192 proteins with a protein FDR ≤ 1% as assessed
by MAYU[39] (Figure A–C). The resulting hybrid spectral
library identified ∼27,000 peptides in all 3 replicates of
concentrated (1:99+) human samples (Figure D), which is 1.35-fold higher than searching
Library A2 alone (∼20,000 peptides, Figure ), and 1.8-fold higher than a DDA analysis
(∼15,000 peptides, Figure ). As such, integrating DIA-based peptide identifications
from DDA-based spectral library searches with DIA-only approaches
resulted in a drastic increase in peptide identification. Hybrid spectral
libraries were very recently reported to improve proteome coverage.[19,31,46] Here, we showed that this greatly
facilitates dual-proteome profiling, as complex proteome mixtures
pose an enormous challenge to DDA, making DDA-only libraries far from
comprehensive. Taken together, making use of DIA-only approaches,
predicted spectral libraries, and perhaps publicly available DDA data
or DDA-based spectral libraries, DIA analyses no longer seem to depend
on sample-specific complementary DDA runs, although these can further
improve the performance, as also shown to be the case in our study.
Notably, other DIA-only analyses such as spectral deconvolution algorithms,
for example, DIA-Umpire,[22] could further
strengthen hybrid libraries. Besides EncyclopeDIA, other peptide-centric
DIA analysis software algorithms could be of value, such as the recently
developed DIA-NN algorithm that also supports library-free searches.[55] In this regard, the provided human–Salmonella datasets could serve as a valuable benchmarking
tool for wide-window DIA analysis. Another promising avenue would
be to run narrow-window detection DIA runs to generate a dual-proteome
chromatogram library as described previously by Searle et al.[19] Using predicted libraries to search narrow-window
DIA data greatly improves peptide detection and the empirical correction
for chromatogram library delivers highly performant libraries.[17,18] Hence, although DDA and library-free PECAN (Library A2 and C, respectively)
searches yield additional identifications and motivates the hybrid
library construction in our wide-window DIA analysis, these contributions
are expected to weaken and might possibly be neglectable when making
use of narrow-window chromatogram search strategies. Nevertheless,
for DIA analysis not supported by additional detection-only chromatogram
libraries, hybrid library creation can serve as an alternative workflow
to augment DIA detection and quantification.When judging protein
quantification, both MaxQuant (DDA) and EncyclopeDIA
(DIA) delivered Salmonella protein
quantifications in-line with dilution series up to 1:999 dilutions
(Figure A). In the
infection-relevant 1:99 sample, 329 proteins were quantified with
a relative median intensity ratio of 0.012 compared to 1:1 equal mixed
samples. Notably, MaxQuant also delivers 202 (−39%) quantified
proteins and has overall better accuracy, which is clearer at a 1:999
ratio (Figure A).
However, applying stricter FDR threshold or requiring minimum 2 peptide
quantifications per proteins did improve accuracy of DIA protein quantifications
(Figure B). Note that
we did not include 1:9999 protein quantifications as the few that
were found had inconsistent protein ratios. When judging, for instance,
the ILADIAVFDK/2+ precursor that was found across all dilutions (Figure S3), it is clear how peptide peaks at
a 1:9999 dilution become nearly impossible to distinguish from random
noise. However, artificial mixtures do suggest that infection-relevant
host pathogen conditions with an approximate ∼1:100 dilution
could be profiled in sufficient proteome depth without prior bacteria
enrichment, especially so when increasing LC–MS/MS run time
by, for instance, offline pre-fractionations or the use of parallel
DIA runs with narrow m/z windows
in consecutive m/z ranges as demonstrated
in the EncyclopeDIA workflow.[19]In
our study, we compared DDA and DIA analysis workflows, which
is not straightforward given the evident differences between spectrum-centric
and peptide-centric searches. Alternatively, the provided data provide
an interesting and challenging benchmark case for label-free quantification
algorithms. For instance, artificial proteome mixes of human, yeast,
and Escherichia coli have been used
before for quantification by DIA algorithms.[51,56] In addition, a Plasmodium falciparum proteome was diluted to a 1:99 ratio in an uninfected human red
blood cell lysate for DIA quantification.[57] Here, we sampled more extreme dilutions, providing an interesting
challenge for state-of-the art quantification algorithms. Moreover,
the spectral libraries created and provided as the Supporting Information online will assist future DIA-based
research on Salmonella-infected epithelial
host cells.
Authors: Henry Lam; Eric W Deutsch; James S Eddes; Jimmy K Eng; Nichole King; Stephen E Stein; Ruedi Aebersold Journal: Proteomics Date: 2007-03 Impact factor: 3.984
Authors: Lukas Reiter; Manfred Claassen; Sabine P Schrimpf; Marko Jovanovic; Alexander Schmidt; Joachim M Buhmann; Michael O Hengartner; Ruedi Aebersold Journal: Mol Cell Proteomics Date: 2009-07-16 Impact factor: 5.911
Authors: Stephan Michalik; Maren Depke; Annette Murr; Manuela Gesell Salazar; Ulrike Kusebauch; Zhi Sun; Tanja C Meyer; Kristin Surmann; Henrike Pförtner; Petra Hildebrandt; Stefan Weiss; Laura Marcela Palma Medina; Melanie Gutjahr; Elke Hammer; Dörte Becher; Thomas Pribyl; Sven Hammerschmidt; Eric W Deutsch; Samuel L Bader; Michael Hecker; Robert L Moritz; Ulrike Mäder; Uwe Völker; Frank Schmidt Journal: Sci Rep Date: 2017-09-08 Impact factor: 4.379
Authors: Brian C Searle; Kristian E Swearingen; Christopher A Barnes; Tobias Schmidt; Siegfried Gessulat; Bernhard Küster; Mathias Wilhelm Journal: Nat Commun Date: 2020-03-25 Impact factor: 14.919
Authors: Sisse Andersen; Arkadiusz Nawrocki; Andreas Eske Johansen; Ana Herrero-Fresno; Vanesa García Menéndez; Jakob Møller-Jensen; John Elmerdahl Olsen Journal: Proteomes Date: 2022-05-05