Swantje Lenz1, Sven H Giese1, Lutz Fischer2, Juri Rappsilber1,2. 1. Bioanalytics, Institute of Biotechnology , Technische Universität Berlin , 13355 Berlin , Germany. 2. Wellcome Centre for Cell Biology , University of Edinburgh , Edinburgh EH9 3BF , United Kingdom.
Abstract
Cross-linking/mass spectrometry has undergone a maturation process akin to standard proteomics by adapting key methods such as false discovery rate control and quantification. A poorly evaluated search setting in proteomics is the consideration of multiple (lighter) alternative values for the monoisotopic precursor mass to compensate for possible misassignments of the monoisotopic peak. Here, we show that monoisotopic peak assignment is a major weakness of current data handling approaches in cross-linking. Cross-linked peptides often have high precursor masses, which reduces the presence of the monoisotopic peak in the isotope envelope. Paired with generally low peak intensity, this generates a challenge that may not be completely solvable by precursor mass assignment routines. We therefore took an alternative route by '"in-search assignment of the monoisotopic peak" in the cross-link database search tool Xi (Xi-MPA), which considers multiple precursor masses during database search. We compare and evaluate the performance of established preprocessing workflows that partly correct the monoisotopic peak and Xi-MPA on three publicly available data sets. Xi-MPA always delivered the highest number of identifications with ∼2 to 4-fold increase of PSMs without compromising identification accuracy as determined by FDR estimation and comparison to crystallographic models.
Cross-linking/mass spectrometry has undergone a maturation process akin to standard proteomics by adapting key methods such as false discovery rate control and quantification. A poorly evaluated search setting in proteomics is the consideration of multiple (lighter) alternative values for the monoisotopic precursor mass to compensate for possible misassignments of the monoisotopic peak. Here, we show that monoisotopic peak assignment is a major weakness of current data handling approaches in cross-linking. Cross-linked peptides often have high precursor masses, which reduces the presence of the monoisotopic peak in the isotope envelope. Paired with generally low peak intensity, this generates a challenge that may not be completely solvable by precursor mass assignment routines. We therefore took an alternative route by '"in-search assignment of the monoisotopic peak" in the cross-link database search tool Xi (Xi-MPA), which considers multiple precursor masses during database search. We compare and evaluate the performance of established preprocessing workflows that partly correct the monoisotopic peak and Xi-MPA on three publicly available data sets. Xi-MPA always delivered the highest number of identifications with ∼2 to 4-fold increase of PSMs without compromising identification accuracy as determined by FDR estimation and comparison to crystallographic models.
Entities:
Keywords:
BS3; SDA; cross-linking; data processing; mass spectrometry; monoisotopic mass; peptides; proteomics; software; structural proteomics
Several approaches
have been utilized to increase the numbers of
identified cross-links, for example enriching for cross-linked peptides,[1−4] using different proteases[1,5,6] or optimizing fragmentation methods.[7,8] In parallel
with experimental developments, data analysis has also progressed
to extract even more cross-links to be used as distance restraints
for modeling of proteins and their complexes.[9,10] Search
software has been designed for the identification of cross-linked
peptides, for example Kojak,[11] xQuest,[12] pLink,[13] XlinkX,[14] or Xi.[5] In addition,
cross-linking workflows can make use of preprocessing methods to improve
data quality and reduce file sizes,[15] as
well as postprocessing methods to filter out false identifications[11,16] and custom-tailored false discovery rate (FDR) estimation.[17−19] Preprocessing can improve peptide identification by correcting the
MS1 precursor ion m/z and simplifying
MS2 fragment spectra. Established proteomics software perform such
preprocessing, including MaxQuant[20,21] and OpenMS.[22,23] For example, MaxQuant performs a variety of preprocessing steps:
it corrects the precursor m/z by
an intensity-weighted average if a suitable peptide feature is found,
reassigns the monoisotopic peak and contains options for intensity
filtering of MS2 peaks. Despite such correction of the precursor mass,
many linear search engines have integrated the possibility of considering
multiple monoisotopic peaks during search.[24−26] However, the
benefits of this search feature are currently unclear. It seems that
the assignment of monoisotopic mass for tryptic peptides is already
achieved adequately either during acquisition or as part of preprocessing.Cross-linked peptides have characteristics that may render MS1
monoisotopic precursor mass assignment as used for linear peptides
nonoptimal: high-charge states, large masses, and low abundances.
Several cross-link search engines include MS1 correction in their
pipeline: pLink corrects monoisotopic peaks based on previous work
with linear peptides,[27] however does not
include a parameter for searching multiple precursor masses. Kojak
averages precursor ion signals of neighboring scans to create a composite
spectrum and infer the true monoisotopic mass of the precursor. If
this step fails, precursor masses up to −2 Da lighter are searched.[11] For previous searches in Xi, MaxQuant was used
to perform preprocessing. Neither xQuest nor XLinkX describe precursor
correction in their workflow documentation and there is no option
for additionally searched masses available in the respective search
parameters. We are not aware of a detailed evaluation of the impact
of different preprocessing techniques for cross-link identification,
independent of the search software. Correcting the monoisotopic mass
of precursors, although acknowledged as an issue,[11,28] awaits systematic evaluation.In this study, we show that
errors in assigning monoisotopic peaks
during data acquisition are frequent for cross-linked peptides because
of their size and generally low abundance. This adversely affects
their identification. We show that standard software suites, MaxQuant
and OpenMS correct monoisotopic precursor masses of cross-linked peptides
with variable success. We then implement an option in Xi to consider
multiple precursor masses during search, to minimize the impact of
false monoisotopic precursor mass assignment on the identification
of cross-links.
Methods
Data Sets
In this
study, we used three publicly available
data sets (Table ).
The three data sets were chosen to reflect a range of applications
of cross-linking mass spectrometry as well as a range of data complexity:
the first data set is Human Serum Albumin (HSA) cross-linked with
succinimidyl 4,4-azipentanoate (SDA) and fragmented using five different
methods (PXD003737).[29] The second data
set is a pooled pseudocomplex sample with seven separately cross-linked
proteins with bis(sulfosuccinimidyl) suberate (BS3) (PXD006131).[7] This data set includes data from four different
fragmentation methods. The third data set is the most complex sample,
composed of 15 size exclusion chromatography fractions of Chaetomium thermophilum lysate cross-linked with BS3 and
fragmented only with HCD (PXD006626).[30] The first and last size exclusion fractions were used to optimize
the search parameters for this data set. All samples were analyzed
on an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermo Fisher
Scientific, San Jose, CA) using Xcalibur (version 2.0 and 2.1).
Table 1
Overview of Datasets Used
data set
sample
database sizea
reference
1
HSA
1
Giese et al. 2016
2
pseudocomplex
7
Kolbowski et al. 2017
3
C. thermophilum
198–400b
Kastritis et al. 2017
Database size refers to the number
of proteins in the database.
Database size refers to the number
of proteins in the database.Multiple size exclusion chromatography
fractions (n = 15).
Preprocessing
Raw files were preprocessed independently
using MaxQuant (1.5.5.30), OpenMS (2.0.1) and the ProteoWizard[31] tool msconvert (3.0.9576) for comparison. Scripts
automating the preprocessing, search and evaluation were written in
Python (2.7).The essential steps during the preprocessing can
be divided into two parts: (1) correction of the m/z or charge of the precursor peak for MS2 spectra
and (2) denoising of MS2 spectra. MaxQuant and OpenMS both try to
correct the precursor information via additional feature finding steps,
i.e. identifying a peptide feature from the retention time, m/z and intensity domain of the LC-MS run.
Additionally, denoising of the MS2 spectra is performed by simply
filtering the most intense peaks in defined m/z windows. The preprocessing is by default enabled in MaxQuant
and was run using the partial processing option (steps 1–5)
with default settings except for inactivated “deisotoping”
and “top peaks per 100 Da”, which was set to 20. The
OpenMS preprocessing workflow includes centroiding, feature finding,[32] precursor correction (mass and charge) using
the identified features and MS2 denoising as described above (Supporting Information (SI) Figure S1). Msconvert
was used to convert the raw files to mgf files without any correction.
These peak files were denoted as “uncorrected” and used
as our baseline to quantify improvements in the subsequent database
search. For the “in-search assignment of the monoisotopic peak”
in Xi (Xi-MPA), we used msconvert to convert raw files to mgf files
and included a MS2 peak filter for the 20 most intense peaks in a
100 m/z window.
Data Analysis
Peak files were searched separately in
Xi (1.6.731) with the following settings: MS accuracy 3 ppm, MS/MS
accuracy 10 ppm, oxidation of methionine as variable modification,
tryptic digestion, two missed cleavages. For samples cross-linked
with SDA, linkage sites were allowed on lysine, serine, tyrosine,
threonine, and protein n-terminus on one end and all amino acids on
the other end of the cross-linker. Variable modifications were monolink
SDA (110.048 Da), SDA loop-links (82.0419 Da), SDA hydrolyzed (100.0524
Da), SDA oxidized (98.0368 Da)[31] as well
as carbamidomethylation on cysteine. For searches with BS3, linkage
sites were lysine, serine, threonine, tyrosine, and the protein n-terminus.
Carbamidomethylation on cysteine was set as fixed modification. Allowed
variable modifications of the cross-linker were aminated BS3 (155.0946
Da), hydrolyzed BS3 (156.0786 Da) and loop-linked BS3 (138.0681 Da).
For collision-induced dissociation (CID) and beam-type CID, also referred
to as higher-energy C-trap dissociation (HCD), b- and y-ions were
searched for, whereas for electron transfer dissociation (ETD) c-
and z-ions were allowed. For ETciD and EThcD, b-, c-, z-, and y-ions
were allowed. The HSA and pseudocomplex data sets were searched against
the known proteins in the sample. For each protein fraction of the C. thermophilum data set, the databases of the original
publication were used, where a database was created for each fraction
by taking the most abundant proteins (iBAQ value above 106). For searches employing Xi-MPA, the parameter "missing_isotope_peaks"
was set to the respective mass range searched. Data sets 1 and
2 were searched with a reversed decoy database, whereas data set 3
was searched with a shuffled decoy database due to palindromic sequences.
For the reversed decoy database, lysines and arginines were swapped
with the preceding amino acid before peptide generation.[17,20]For cross-linking, there are different information levels:
PSMs, peptide pairs, residue pairs (links) and protein pairs. The
false discovery rate (FDR) can be calculated on each one of these
levels and should be reported for the level at which the information
is given.[12] The FDR was calculated as described
in Fischer et al.[17] using xiFDR (1.0.14.34)
according to the following equation: . A 5% PSM level cutoff was imposed.
The
setting “uniquePSMs” was enabled and the FDR was calculated
separately on self-and between links. Minimal peptide length was set
to 6. In data set 2, identified cross-linked residues were mapped
to the crystal structure of the respective protein and the Euclidian
distance between the alpha-carbons was calculated. Structures were
downloaded from the PDB (IDs: 1AO6, 5GKN, 2CRK, 3NBS, 1OVT, 2FRJ).
Kojak (1.5.5) was run via the Trans-Proteomic Pipeline (5.1.0)[33] using default settings except: MS1 resolution
120 000, BS3 allowed on lysine, serine, threonine, tyrosine, and the
protein n-terminus, aminated BS3 (155.0946 Da) as variable modification
of the cross-linker, 3 ppm mass tolerance on MS1 level. For the uncorrected
search, the isotope error was set to 0 and precursor refinement was
disabled. PSMs were validated using PeptideProphet[34] and FDR calculated as described above on the resulting
probability.The mass spectrometry data have been deposited
to the ProteomeXchange
Consortium via the PRIDE[35] partner repository
with the data set identifier PXD011121. For transparency, python scripts
are available on GitHub under: https://github.com/Rappsilber-Laboratory/Xi-MPA_scripts.
Results and Discussion
We evaluated the impact on cross-link
identification in Xi of changing
the precursor monoisotopic mass that was initially assigned during
data acquisition (“uncorrected”). In this analysis,
MaxQuant and OpenMS were used as preprocessing tools. We used three
different data sets that differ in complexity and fragmentation regimes.
To measure the improvements from using the preprocessing tools, a
simple conversion from raw files to mgf format was done with msconvert
and used as a baseline. Note that in the spectrum header, there are
two m/z values: the trigger mass
of the MS2 and the assigned monoisotopic peak of the isotope cluster.
Msconvert extracts the assigned monoisotopic mass. Processed data
were searched separately in Xi and evaluated on PSM level or link
(= unique residue pair) level, with a 5% FDR. Finally, the newly implemented
in-search assignment of monoisotopic peaks in Xi was compared to the
elaborate preprocessing pipelines in OpenMS and MaxQuant.
Preprocessing
Increases the Number of Cross-Link PSMs by Finding
the Correct Monoisotopic Peak
The data sets were preprocessed
in MaxQuant and OpenMS and numbers of identified PSMs were compared
to those obtained using uncorrected data. Data sets 1 (HSA) and 2
(pseudocomplex) were acquired with different acquisition methods.
For comparability to data set 3 (complex mixture), we focused on the
HCD acquired data. Cross-links between proteins were excluded, either
because they were experimentally not possible (data set 2) or observed
in too low numbers for reliable FDR calculation (data set 3).For uncorrected data, 672, 354, and 2157 cross-link PSMs resulted
for the HSA data set (data set 1), pseudocomplex (data set 2), and
first and last fractions of C. thermophilum respectively
(data set 3). Both preprocessing approaches improved numbers of identified
PSMs for all data sets: Preprocessing in MaxQuant led to 1127 (68%
increase), 966 (173% increase), and 2966 (38% increase), while for
OpenMS, 1044 (55% increase), 598 (69% increase) and 2394 (11% increase)
PSMs were identified (Figure A).
Figure 1
Correction of the monoisotopic peak is crucial in cross-link identification.
(A) The data sets were preprocessed using MaxQuant and OpenMS, leading
to more identified PSMs in all cases. Fold changes from uncorrected
data (msconvert conversion of Xcalibur data) were calculated for each
file separately and the mean plotted. Error bars represent the standard
error of the mean between different acquisitions (HSA: n = 3, pseudocomplex: n = 3, C. thermophilum: n = 8). (B) The majority of additional identifications
after preprocessing are due to correction of the precursor mass to
lighter monoisotopic masses. Spectra that are unique to MaxQuant preprocessed
searches of HCD acquisitions from data set 2 were evaluated in terms
of precursor correction. The main proportion of the gain was corrected
to lighter masses of up to −3 Da, while charge state correction
or correction to heavier masses rarely occurred. (C) Isotope cluster
of a corrected precursor of m/z 992.71
(z = 5, m = 4958.6 Da) was solely
identified in MaxQuant preprocessed results. In OpenMS preprocessed
and uncorrected data, the wrong monoisotopic mass was selected for
unknown reasons.
Correction of the monoisotopic peak is crucial in cross-link identification.
(A) The data sets were preprocessed using MaxQuant and OpenMS, leading
to more identified PSMs in all cases. Fold changes from uncorrected
data (msconvert conversion of Xcalibur data) were calculated for each
file separately and the mean plotted. Error bars represent the standard
error of the mean between different acquisitions (HSA: n = 3, pseudocomplex: n = 3, C. thermophilum: n = 8). (B) The majority of additional identifications
after preprocessing are due to correction of the precursor mass to
lighter monoisotopic masses. Spectra that are unique to MaxQuant preprocessed
searches of HCD acquisitions from data set 2 were evaluated in terms
of precursor correction. The main proportion of the gain was corrected
to lighter masses of up to −3 Da, while charge state correction
or correction to heavier masses rarely occurred. (C) Isotope cluster
of a corrected precursor of m/z 992.71
(z = 5, m = 4958.6 Da) was solely
identified in MaxQuant preprocessed results. In OpenMS preprocessed
and uncorrected data, the wrong monoisotopic mass was selected for
unknown reasons.We assessed the gains
in identified PSMs of preprocessed data compared
to uncorrected data (focusing on data set 2) regarding three forms
of precursor correction: (1) correction of the monoisotopic mass,
(2) charge state correction, and (3) small corrections of the m/z value based on averaging the m/z values across the peptide feature (Figure B). Precursor mass
and charge state of spectra identified solely in MaxQuant-Xi were
compared to their counterparts when searching uncorrected data in
Xi. Of the 756 newly identified spectra, 686 (91%) had a different
monoisotopic precursor mass. Precursors were primarily corrected to
lighter masses by MaxQuant, that is, the monoisotopic peak correction
by −1 (208 spectra), −2 (215 spectra), −3 (149
spectra), and −4 Da (62 spectra). Greater shifts (−5
to −7 Da) only occurred 30 times, and corrections to heavier
masses were observed 22 times. Only 30 spectra (4%) were corrected
in their charge state. For the 60 spectra (8%) without correction
in charge state or monoisotopic peak, we only identified nine spectra
that had a higher error than 3 ppm before preprocessing, indicating
a small correction of the initial precursor m/z (by averaging of peptide feature peaks). The main proportion
of these identifications is likely a product of noise removal in MS2
spectra or small changes in the score distribution. Similarly, for
OpenMS-Xi, the monoisotopic peak correction had the greatest impact:
Of the 314 spectra that OpenMS added over uncorrected data, 139 were
precursor corrected by −1 Da and 108 to −2 Da. In contrast
to MaxQuant, corrections to −3 or lighter were not observed,
which might explain the higher number of identifications obtained
with MaxQuant-Xi.For data set 1 and 3, the gains of preprocessing
are smaller than
for data set 2. The median peptide mass of data set 1 (3368 Da) is
smaller than the median mass of data set 2 (3946 Da) and we later
show that this is a major factor in precursor mass assignment. This
reflects in the distribution of lighter masses assigned: 46%, 33%,
and 21% were corrected by −1 Da, −2 Da, and to even
lighter peaks, respectively. Data set 3 was acquired with a different
version of Xcalibur, for which we saw better mass assignment than
for earlier versions (data not shown). As an implication of this,
already 67% of the lighter corrected masses are shifted by −1
Da, 21% by −2 Da, 12% to even lighter masses. However, we were
not able to follow up on this to our content, since the source code
of the vendor software is not available.In summary, preprocessing,
especially monoisotopic peak correction,
leads to a notable increase in identifications. Using the 3-dimensional
peptide feature is advantageous compared to on-the-fly detection of
the monoisotopic peak. If the preceding MS1 spectrum was acquired
during the beginning (or end) of the elution profile of a peptide,
the intensity will be low. Thus, the monoisotopic peak might not even
be detectable at the time of fragmentation. For large (cross-linked)
peptides, this effect might be exacerbated by the monoisotopic peak
usually being less intense than other isotope peaks. Therefore, using
the additional information from the retention time domain will be
beneficial. The same feature information can also be used to determine
or validate the assigned charge state of the precursor. However, the
instrument software almost always assigned the same charge state as
MaxQuant or OpenMS. Thus, the main advantage for identifying cross-linked
peptides arises from monoisotopic peak correction.Interestingly,
OpenMS and MaxQuant did not always agree on or find
the same monoisotopic peak (Figure C). Of the total MaxQuant-corrected spectra with a
different monoisotopic mass, 81% were not corrected and 6% corrected
differently with OpenMS. Vice versa, 15% of the monoisotopic peaks
corrected by OpenMS were not corrected by MaxQuant and 25% were corrected
differently. Both MaxQuant and OpenMS have their own implementations
for precursor correction - therefore, there might be instances where
MaxQuant is able to find a corresponding peptide feature where OpenMS
does not and vice versa. Although OpenMS did not lead to the same
improvements in the number of identifications as MaxQuant, it did
correct some precursors that the latter did not. We therefore suspect
that there are also precursors with a falsely assigned monoisotopic
peak that were corrected with neither algorithm. Furthermore, 3-dimensional
detection of peptide features is challenging for low intensity peptides.
In conclusion, there likely remain falsely assigned monoisotopic peaks
in the data, ultimately leading to missed or false identifications.
In-Search Monoisotopic Peak Assignment Increases the Number
of Identifications
We observed multiple cases where MaxQuant
and OpenMS disagreed in their monoisotopic peak choice, indicating
that the problem of monoisotopic peak assignment (MPA) cannot be solved
easily at MS1 level. Indeed, we found instances where the monoisotopic
peak is simply not distinguishable from noise, so a feature-based
correction would not be feasible. Nevertheless, the associated MS2
spectra could be matched to a cross-linked peptide when considering
multiple different monoisotopic masses during search. This shows that
the extra information on obtaining a peptide-spectrum match is advantageous
to MPA over considering MS1 information alone. Therefore, we implemented
a monoisotopic peak assignment in Xi: for each MS2 spectrum, multiple
precursor masses are considered during a single search and the highest
scoring peptide-pair assigns the precursor mass. Note that this is
different from simply searching with a wide mass error for MS1. The
mass accuracy of MS1 is minimally compromised as multiple candidates
for the monoisotopic mass are taken and considered with the original
mass accuracy of the measurement.To find a good trade-off between
increased search space and sensitivity, we tested different mass range
settings on the data sets. For data set 2 (HCD subset), the number
of PSMs increased with ranges up to −5 Da on the considered
monoisotopic masses (Figure A). However, the increase in identifications from −4
to −5 Da was only 3% and considering the increase in search
time, we continued with a maximal correction to −4 Da as the
optimal setting for this data set. Xi-MPA yielded 1508 PSMs, which
is a 326% increase compared to searching uncorrected data and a 56%
increase compared to MaxQuant-Xi. Similar improvements are observed
for the other fragmentation methods in this data set (SI Figure S2). Additionally, we corrected up
to −7 Da to test if a large increase in search space increases
random spectra matches as measured by the target-decoy approach. The
number of identifications at 5% FDR decreased only slightly compared
to −5 Da (−1%), but still led to more identifications
than up to −4 Da (3%). In the HSA data set, Xi-MPA with up
to −4 Da increased the number of identified PSMs by 170% compared
to uncorrected data (Figure S3).
Figure 2
In-search monoisotopic
peak assignment outperforms preprocessing.
(A) Performance of Xi-MPA on data set 2. HCD data from the pseudocomplex
data set were searched assuming different ranges of missing monoisotopic
peaks. With increased ranges, the number of identified PSMs also increases.
(B) Performance of Xi-MPA on the complete C. thermophilum data set. All 15 fractions were searched with the original preprocessed
data as well as with Xi-MPA. (C) Overlap of identified residue pairs
of MaxQuant-Xi and OpenMS-Xi to residue pairs gained from Xi-MPA (data
set 2). Numbers in brackets are the proportion of decoys in the respective
regions.
In-search monoisotopic
peak assignment outperforms preprocessing.
(A) Performance of Xi-MPA on data set 2. HCD data from the pseudocomplex
data set were searched assuming different ranges of missing monoisotopic
peaks. With increased ranges, the number of identified PSMs also increases.
(B) Performance of Xi-MPA on the complete C. thermophilum data set. All 15 fractions were searched with the original preprocessed
data as well as with Xi-MPA. (C) Overlap of identified residue pairs
of MaxQuant-Xi and OpenMS-Xi to residue pairs gained from Xi-MPA (data
set 2). Numbers in brackets are the proportion of decoys in the respective
regions.As a final evaluation of in-search
monoisotopic peak assignment,
we searched the complete data set of C. thermophilum. We used 0 to −3 Da as the range of Xi-MPA, since an initial
analysis of the first and last fraction of the C. thermophilum data set returned a similar number of identifications when running
Xi-MPA up to −4 Da or −3 Da (Figure
S4). As a comparison, we took the original peak files obtained
from PRIDE. The FDR was calculated separately on self-and between
links, enabled boosting (automatic prefiltering on PSM and peptide
pair level[17]), with a minimum of three
fragments per peptide and a minimal delta score of 0.5. For the original
peak files, which were preprocessed in MaxQuant, we identified 3848
PSMs, 2594 peptide pairs and 1653 cross-links, with a 5% FDR on each
respective level (Figure B). Xi-MPA resulted in 4952 PSMs (29% increase), 3566 peptide
pairs (37% increase), and 2273 cross-links (38% increase).Next,
we looked into the complementarity of search results with
the different approaches, using data set 2 at 5% link-FDR. Preprocessing
via MaxQuant and OpenMS led to 172 and 158 links, respectively, while
Xi-MPA resulted in 243 links. While the overlap between links of OpenMS-Xi
and MaxQuant-Xi is only 50%, Xi-MPA identifications cover 76% of both
searches (Figure C).
Nineteen and 23 links are uniquely found in MaxQuant and OpenMS preprocessed
data, respectively. However, there are five decoy links as well in
each unique set (resulting in a link-FDR of 26% and 22%). For Xi-MPA,
there are 75 unique target links with 12% link-FDR.Identification-based
monoisotopic peak assignment as employed by
Xi-MPA results in more identifications than the feature-based assignment
algorithms of OpenMS and MaxQuant. Neither OpenMS nor MaxQuant correct
all precursor masses that are incorrectly assigned during data acquisition.
In Xi-MPA, spectra are searched with multiple monoisotopic masses,
thereby relying less on the MS1 information. The quality of the precursor
isotope cluster does not contribute to the decision of monoisotopic
mass and spectra for which correction failed will be identifiable.
One could hypothesize that increasing the search space by considering
multiple masses will lead to more false positives, thereby reducing
the number of true identifications. This is not the case, as we match
substantially more PSMs at constant FDR by considering alternative
monoisotopic masses. As a second plausible caveat, this approach increases
the search time. However, the use of relatively cheap computational
time appears balanced by the notable increase in identified cross-links.
The optimal range of additional monoisotopic peaks to search will
however be dependent on complexity and quality of MS1 acquisition
and the instrument software. To reduce the mass range considered in
Xi-MPA, we developed a MS1 level-based approach. For each precursor,
we search lighter isotope peaks in MS1 and use this to narrow the
search space (explained in detail in the SI). This led to an average of 24% less values to be considered, while
only reducing the number of identifications by 3%. We hope that our
observation of the monoisotopic peak detection challenge in cross-linking
together with our publicly available data sets will lead to further
improvements in monoisotopic peak-assignment algorithms in the future,
possibly tailored to cross-link data.The cross-link search
engine Kojak employs a precursor correction
in its pipeline.[11] As we could not find
a detailed evaluation of the impact of precursor correction in Kojak,
we searched the HCD data of the pseudocomplex data set without correction
as well as with their default correction settings. We focused on FDR
10% data as there were too few identifications for a reliable calculation
of FDR 5%. Just 171 cross-linked PSMs passed for the unprocessed data,
whereas for the default search, 1088 PSMs passed (536% increase).
Of those, 862 (79%) were corrected in their monoisotopic precursor
peak. These results support our observations with Xi.
In-Search Monoisotopic
Peak Assignment Does Not Compromise Search
Accuracy
Changing the search could lead to several problems.
We already excluded that the increased search space leads to high-scoring
decoy matches that in turn reduce the number of identifications at
a given FDR cutoff. As an additional validation, we assessed our results
against known PDB structures using the HCD data from the pseudocomplex
data set (data set 2), at 5% link-FDR. Assuming a crystal structure
is correct, a cross-link can be unexpectedly long either because the
link is false or because of in-solution structural dynamics. If, however,
the proportion of long-distance links in results of two approaches
is identical, then at least the two results have equal quality.We first tested the results of all three approaches against crystal
structures. Residue pairs were mapped to PDB structures and the distance
between the two alpha-carbons was calculated (see Methods). Thirty Å was set as the maximal distance for
BS3, links with a greater distance were classified as long-distance.
In this evaluation, we excluded the protein C3B because its flexible
regions make it unsuitable for this analysis. For MaxQuant and OpenMS
preprocessed results, 11.8% and 6.1% long-distance cross-links were
identified, respectively. In Xi-MPA, 8.1% long distance cross-links
were identified (Figure A). Of the links uniquely identified through Xi-MPA, only 5.3% were
long distance links. Therefore, Xi-MPA as such does not lead to an
enrichment in long-distance cross-links. However, it could be that
mass-corrected precursors tend to have a higher proportion of long-distance
links. We therefore split the Xi-MPA results into five groups corresponding
to the monoisotopic mass change (0, −1, −2, −3,
−4 Da) and looked at their match to crystal structures. If
a link originated from PSMs with different mass corrections, all of
those were considered. We conducted a “nonparametric ANOVA”
(Kruskal–Wallis test) to detect any significant changes in
the distance distributions of Xi-MPA identifications with different
shifts and decoy distribution. However, we fail to reject the null
hypothesis at the predetermined significance level of α = 0.05
(p-value: 0.13), indicating that the distance distributions
for all subsets are similar. This matches the visual inspection of
distance distributions (Figure B). Furthermore, all individual distance distributions were
significantly smaller than the derived reference distribution (one-sided
Wilcoxon test, see SI Table S5). In conclusion,
we do not see any evidence of in-search monoisotopic mass assignment
leading to increased conflicts with crystal structures. We then evaluated
the effect of in-search monoisotopic mass assignment on PSM quality
as assessed by the search score. First, we compared the scores of
PSMs with a mass shift (Xi-MPA identifications) to the scores of the
same spectrum without a mass shift (uncorrected data). While scores
with shifted mass have a median of 6.7, the median score is 2.3 when
using the uncorrected masses (Figure C). As one would expect from an increased search space,
the scores of decoy hits also improve, albeit only marginally. We
find that the score difference of target PSMs is significantly larger
than of decoy PSMs (one-sided Wilcoxon test, p-value:
<2.2 × 10–16). We then turned to a “decoy
mass search” for which we not only searched the range from
0 Da to −4 Da, but also +1 Da to +4 Da. Assuming the monoisotopic
peak in the uncorrected data is rarely lighter than the true monoisotopic
peak, the new identifications should score like decoy identifications.
Indeed, the resulting score distributions for targets with a positive
mass shift follow the decoy distribution (Figure D). In contrast, identifications with a negative
shift are distributed like the identifications without mass shift.
In conclusion, in-search monoisotopic mass change leads to significantly
improved scores with a distribution that resembles that of precursors
that did not see a mass change (0 Da). Importantly, these improvements
are not random since an equally large search space increase (+1 Da
to +4 Da) results in a completely different score distribution that
resembles the decoy distribution but not the distribution of identifications
without a mass shift.
Figure 3
Matches with and without in-search mass shift show similar
quality
metrics. (A) Evaluation of Xi-MPA derived links on crystal structures
(data set 2). Distances between α carbon atoms of identified
cross-linked residues in the crystal structure of the proteins are
shown in light gray while a reference distribution of all possible
pairwise C-alpha distances of cross-linkable residues is shown in
dark gray. Thirty Å is set as a limit, above which links are
defined as long distance. (B) Distance distribution of identifications
with different mass corrections. There was no significant difference
between the different mass shifts, while all had a significant difference
to the decoy distribution. (C) PSM scores of spectra identified with
a mass shift are significantly higher than the corresponding score
in uncorrected data. Shown are the score distributions of uncorrected
and Xi-MPA results, as well as the corresponding decoy distribution.
(D) Score distribution of PSM matches of the “decoy mass search”.
Identifications with a positive mass shift generally follow the decoy
distribution (note that there are correct identifications with a positive
mass shift, albeit few, see Figure B) while identifications with a negative shift resemble
unshifted identifications. The scores of negative-shifted PSMs are
significantly higher than those of positive-shifted PSMs (one-sided
Wilcoxon test, p-value: <2.2× 10–16).
Matches with and without in-search mass shift show similar
quality
metrics. (A) Evaluation of Xi-MPA derived links on crystal structures
(data set 2). Distances between α carbon atoms of identified
cross-linked residues in the crystal structure of the proteins are
shown in light gray while a reference distribution of all possible
pairwise C-alpha distances of cross-linkable residues is shown in
dark gray. Thirty Å is set as a limit, above which links are
defined as long distance. (B) Distance distribution of identifications
with different mass corrections. There was no significant difference
between the different mass shifts, while all had a significant difference
to the decoy distribution. (C) PSM scores of spectra identified with
a mass shift are significantly higher than the corresponding score
in uncorrected data. Shown are the score distributions of uncorrected
and Xi-MPA results, as well as the corresponding decoy distribution.
(D) Score distribution of PSM matches of the “decoy mass search”.
Identifications with a positive mass shift generally follow the decoy
distribution (note that there are correct identifications with a positive
mass shift, albeit few, see Figure B) while identifications with a negative shift resemble
unshifted identifications. The scores of negative-shifted PSMs are
significantly higher than those of positive-shifted PSMs (one-sided
Wilcoxon test, p-value: <2.2× 10–16).
Heavy and Low Intensity
Peptides Are Corrected More Frequently
One would especially
expect to observe shifted mass assignment
for peptides of high mass and low abundance. For large peptides (approximately
>2000 Da), the monoisotopic peak will not be the most intense peak
in the isotope cluster. If the peptide is of low abundance, the monoisotopic
peak may be of too low intensity to be detected. We therefore analyzed
the monoisotopic peak assignment in Xi-MPA regarding the precursor
mass and intensity. Indeed, precursors with higher masses are more
often corrected to lighter monoisotopic peaks (Figure A). While the median precursor mass for uncorrected
matches is 2952 Da, for matches corrected by −2 Da it is already
4062 Da and for −4 Da it is 4684 Da. Of the identifications
with a mass above 3000 Da, 88% were identified with a lighter mass.
For precursors lighter than 3000 Da, the proportion was 42%. Like
for mass dependency, there is a trend toward larger correction ranges
for lower intensity peptides (SI Figure S5). However, this is less strong than it is for precursor mass.
Figure 4
Correction
is dependent on precursor mass and intensity. (A) Box
plot of the precursor mass and monoisotopic mass correction of identified
PSMs after Xi-MPA. PSMs with higher mass more often require monoisotopic
mass correction to lighter masses. Whiskers show the 5 and 95% quantiles
of the data. Asterisks denote the significance calculated by a one
tailed t test (****: p-value: <0.0001).
(B) Precursors of cross-linked PSMs identified in all three approaches,
MaxQuant-Xi, OpenMS-Xi, and Xi-MPA (“common”), are more
intense than precursors of PSMs that are only identified in Xi-MPA.
In other words, successful correction happens more often for abundant
precursors, whereas Xi-MPA identifies precursors of lower intensity.
(C) MS1 isotope cluster of a cross-linked peptide. The monoisotopic
peak of m/z 758.16 (z = 4, m = 3028.6 Da) was falsely assigned during
acquisition and not corrected in any preprocessing approach. Xi-MPA
identifies a PSM for a precursor with a mass that is 3 Da lighter.
Correction
is dependent on precursor mass and intensity. (A) Box
plot of the precursor mass and monoisotopic mass correction of identified
PSMs after Xi-MPA. PSMs with higher mass more often require monoisotopic
mass correction to lighter masses. Whiskers show the 5 and 95% quantiles
of the data. Asterisks denote the significance calculated by a one
tailed t test (****: p-value: <0.0001).
(B) Precursors of cross-linked PSMs identified in all three approaches,
MaxQuant-Xi, OpenMS-Xi, and Xi-MPA (“common”), are more
intense than precursors of PSMs that are only identified in Xi-MPA.
In other words, successful correction happens more often for abundant
precursors, whereas Xi-MPA identifies precursors of lower intensity.
(C) MS1 isotope cluster of a cross-linked peptide. The monoisotopic
peak of m/z 758.16 (z = 4, m = 3028.6 Da) was falsely assigned during
acquisition and not corrected in any preprocessing approach. Xi-MPA
identifies a PSM for a precursor with a mass that is 3 Da lighter.When evaluating the newly matched
precursors of Xi-MPA, the advantage
of not having to rely on MS1 identification is evident. Matches not
made through any of the preprocessing methods are generally much less
intense (Figure B)
and larger (SI Figure S6) than matches
that are common to all approaches. Manual analysis of isotope clusters
of corrected precursors from data set 2 revealed many cases where
the monoisotopic peak was present in the MS1 spectrum but was not
recognized during acquisition. For some, this might be due to the
peak being of low intensity and discarded as noise, or because of
other interfering peaks (Figure C). However, there are also cases where the cluster
is well resolved (Figure C). Without details of how the instrument software determines
the monoisotopic peak, a full evaluation is difficult. For a complete
list of precursor m/z for Xi-MPA
identifications and corresponding m/z of uncorrected, MaxQuant and OpenMS data, see SI Table S1–S3.Note that in many acquisition
methods, the machine only fragments
peaks where it can successfully identify a full isotope cluster. Therefore,
there might be instances of cross-linked peptides not being fragmented
because of insufficient isotopic cluster quality, leading to lost
identifications.
Conclusion
The size and low abundance
of cross-linked peptides leads to frequent
misassignment of the monoisotopic mass by instrument software, which
in some instances even escapes correction by sophisticated correction
approaches employed by MaxQuant and OpenMS. Considering multiple monoisotopic
masses during search increases the number of cross-link PSMs 1.8–4.2-fold,
without compromising search accuracy as judged by multiple assessment
strategies including comparison of the gains against solved protein
structures. The problem of wrongly assigned monoisotopic peaks will
have an impact on most cross-link search engines since these all rely
in some part on the precursor mass. The extent of the misassignment
will however be sample and software-dependent. Even with improved
acquisition or correction software, there will remain instances where
the monoisotopic peak cannot be determined correctly before searching
due to low intensity. Our search-assisted monoisotopic peak assignment
provides a general solution to this problem by relying on MS2 identification
in addition to precursor information.
Authors: Bernhard Y Renard; Marc Kirchner; Flavio Monigatti; Alexander R Ivanov; Juri Rappsilber; Dominic Winter; Judith A J Steen; Fred A Hamprecht; Hanno Steen Journal: Proteomics Date: 2009-11 Impact factor: 3.984
Authors: Eric W Deutsch; Luis Mendoza; David Shteynberg; Joseph Slagel; Zhi Sun; Robert L Moritz Journal: Proteomics Clin Appl Date: 2015-04-02 Impact factor: 3.494
Authors: Hendrik Weisser; Sven Nahnsen; Jonas Grossmann; Lars Nilse; Andreas Quandt; Hendrik Brauer; Marc Sturm; Erhan Kenar; Oliver Kohlbacher; Ruedi Aebersold; Lars Malmström Journal: J Proteome Res Date: 2013-02-22 Impact factor: 4.466
Authors: Alexander Leitner; Roland Reischl; Thomas Walzthoeni; Franz Herzog; Stefan Bohn; Friedrich Förster; Ruedi Aebersold Journal: Mol Cell Proteomics Date: 2012-01-27 Impact factor: 5.911
Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971
Authors: J Eduardo Fajardo; Rojan Shrestha; Nelson Gil; Adam Belsom; Silvia N Crivelli; Cezary Czaplewski; Krzysztof Fidelis; Sergei Grudinin; Mikhail Karasikov; Agnieszka S Karczyńska; Andriy Kryshtafovych; Alexander Leitner; Adam Liwo; Emilia A Lubecka; Bohdan Monastyrskyy; Guillaume Pagès; Juri Rappsilber; Adam K Sieradzan; Celina Sikorska; Esben Trabjerg; Andras Fiser Journal: Proteins Date: 2019-10-07
Authors: Maria Alba Abad; Tanmay Gupta; Michael A Hadders; Amanda Meppelink; J Pepijn Wopken; Elizabeth Blackburn; Juan Zou; Anjitha Gireesh; Lana Buzuk; David A Kelly; Toni McHugh; Juri Rappsilber; Susanne M A Lens; A Arockia Jeyaprakash Journal: J Cell Biol Date: 2022-07-01 Impact factor: 8.077
Authors: Alex A Makarov; Juan Zou; Douglas R Houston; Christos Spanos; Alexandra S Solovyova; Cristina Cardenal-Peralta; Juri Rappsilber; Eric C Schirmer Journal: Nat Commun Date: 2019-07-11 Impact factor: 14.919
Authors: Frank Bürmann; Byung-Gil Lee; Thane Than; Ludwig Sinn; Francis J O'Reilly; Stanislau Yatskevich; Juri Rappsilber; Bin Hu; Kim Nasmyth; Jan Löwe Journal: Nat Struct Mol Biol Date: 2019-03-04 Impact factor: 15.369
Authors: Swantje Lenz; Ludwig R Sinn; Francis J O'Reilly; Lutz Fischer; Fritz Wegner; Juri Rappsilber Journal: Nat Commun Date: 2021-06-11 Impact factor: 14.919