Lutz Fischer1, Juri Rappsilber1,2. 1. Wellcome Trust Centre for Cell Biology, School of Biological Sciences, University of Edinburgh , Edinburgh EH9 3BF, United Kingdom. 2. Chair of Bioanalytics, Institute of Biotechnology, Technische Universität Berlin , 13355 Berlin, Germany.
Abstract
Cross-linking/mass spectrometry is an increasingly popular approach to obtain structural information on proteins and their complexes in solution. However, methods for error assessment are under current development. We note that false-discovery rates can be estimated at different points during data analysis, and are most relevant for residue or protein pairs. Missing this point led in our example analysis to an actual 8.4% error when 5% error was targeted. In addition, prefiltering of peptide-spectrum matches and of identified peptide pairs substantially improved results. In our example, this prefiltering increased the number of residue pairs (5% FDR) by 33% (n = 108 to n = 144). This number improvement did not come at the expense of reduced accuracy as the added data agreed with an available crystal structure. We provide an open-source tool, xiFDR ( https://github.com/rappsilberlab/xiFDR ), that implements our observations for routine application. Data are available via ProteomeXchange with identifier PXD004749.
Cross-linking/mass spectrometry is an increasingly popular approach to obtain structural information on proteins and their complexes in solution. However, methods for error assessment are under current development. We note that false-discovery rates can be estimated at different points during data analysis, and are most relevant for residue or protein pairs. Missing this point led in our example analysis to an actual 8.4% error when 5% error was targeted. In addition, prefiltering of peptide-spectrum matches and of identified peptide pairs substantially improved results. In our example, this prefiltering increased the number of residue pairs (5% FDR) by 33% (n = 108 to n = 144). This number improvement did not come at the expense of reduced accuracy as the added data agreed with an available crystal structure. We provide an open-source tool, xiFDR ( https://github.com/rappsilberlab/xiFDR ), that implements our observations for routine application. Data are available via ProteomeXchange with identifier PXD004749.
Cross-linking/mass
spectrometry
(CLMS) is emerging as a valuable tool to investigate protein structures,
protein complexes, and protein–protein interactions.[1−4] As any method relying on measurement as well as interpretation,
CLMS has some level of error. One popular method in proteomics to
assess the expected error among reported results is the false discovery
rate (FDR) by the target-decoy approach.[5] A decoy database is generated, typically by inverting all target
sequences. This decoy database should not contain any peptide sequences
that are in the analyzed sample. Any match to this database is therefore
a false positive. Under the assumption that random identifications
fall with equal probability into the target and decoy section of the
database, the distribution of decoy hits reveals the distribution
of random target hits and allows the reporting of results with defined
FDR.For CLMS, the FDR estimation is complicated by the fact
that every
match is a composite of two peptides, each with its own probability
to be false. Previously, FDR estimation of cross-links was addressed
by either inverting all possible cross-linked peptide pairs,[6] not modeling cases that have one correctly identified
peptide and one incorrectly identified peptide or by using a decoy
(i.e., wrong mass) cross-linker.[6] While
the decoy cross-linker permits for one peptide to be right and one
to be wrong as well as both peptides being wrong, it does not provide
an easy way to model both cases separately. To model this, FDR calculations
have to take into account a set of two interdependent problems. While
for the false identification of a single peptide, only a linear random
space needs to be considered; for two peptides, this needs to be extended
to a quadratic random space as each peptide could be from both the
target as well as the decoy database. MS2-cleavable cross-linkers[7−11] may allow circumvention of a cross-linking specific FDR, at least
in part. The cross-link is cleaved in MS2, separating the two peptides
that can then be identified individually in MS3. As linear peptides
are being identified, standard proteomic peptide FDR estimation has
been applied,[12] possibly falling short
in considering errors from joining up peptides. Nevertheless, their
data can also be assessed jointly as cross-links within a spectrum.[13,14] A formalism for FDR estimation of cross-links has recently been
proposed.[15] However, some questions remain
open such as how to handle directionality of the cross-linker or what
levels to consider: peptide-spectrum matches (PSMs), peptide pairs,
or residue pairs.Here we share our considerations regarding
FDR estimation in CLMS,
based on the target-decoy approach. The FDR approach was tested using
a data set of RNA Polymerase II (Pol II) cross-linked with Bis[sulfosuccinimidyl]
suberate (BS3).[16] Our data was compared
against an available crystal structure of Pol II,[17] which served as a mass spectrometry-independent evaluation
of our FDR approach. We highlight the importance of considering the
different information levels, PSMs, peptide pairs, and residue pairs,
and how their relationship can be exploited productively.
Experimental
Section
Dataset
The data set has been described previously[16] and was reprocessed here. In short, purified
RNA polymerase II (Pol II) from Saccharomyces cerevisiae was cross-linked with BS3. Cross-linked complexes were then digested
with trypsin and analyzed by LC-MS/MS. Mass spectrometric data was
acquired using a “high–high” strategy, meaning
both MS1 and MS2 spectra were acquired with high resolution (R = 100000 and R = 7500, respectively).
Data Processing
Mass spectrometric raw files were processed
into peak lists using MaxQuant version 1.2.2.5[18] using default parameters except the setting for “Top
MS/MS peaks per 100 Da” being set to 100. Peak lists were searched
against a target-decoy database of all Pol II proteins (Rpb1 to Rpb12,
4565 residues) and their decoy equivalents obtained by sequence inversion[18] using Xi[19] (http://github.com/Rappsilber-Laboratory/XiSearch) for identification of cross-linked peptides. Search parameters
were MS accuracy, 6 ppm; MS/MS accuracy, 20 ppm; enzyme, trypsin;
specificity, fully tryptic; allowed number of missed cleavages, four;
cross-linker, BS3; fixed modifications, carbamidomethylation on cysteine;
variable modifications, oxidation on methionine, hydrolyzed, amidated,
and loop-linked versions of BS3. The linkage specificity for BS3 was
assumed to be at lysine, serine, threonine, tyrosine, and protein
N-termini. The data have been deposited to the ProteomeXchange[20] Consortium via the PRIDE[21] partner repository with the data set identifier PXD004749.
Comparison to Crystal Structure
As a mass spectrometry-independent
assessment of identification success, the residue distance of identified
cross-linked residue pairs was measured in an available crystal structure
of Pol II (PDB|1WCM).[17] CLMS and X-ray
crystallography do not necessarily return identical results as CLMS
investigates proteins in solution where conformational flexibility
is likely much higher than in crystallized form. However, for our
data set, a good agreement of the two methods has been reported.[16] To compare decoy matches with the crystal structure,
the linked residue in the decoy was assigned the position of the same
residue in the forward sequence.
xiFDR Software
All FDR calculations were done with
xiFDR. We provide xiFDR, an open-source program (https://github.com/lutzfischer/xiFDR), for researchers to analyze the results of their preferred cross-link
search engine. The input of xiFDR is either an mzIdentML file or a
table of PSMs (Table S1). The output is
either an amended mzIdentML file or a set of tables containing PSMs,
peptide pairs, residue pairs, and protein pairs that pass the requested
FDR thresholds. It supports two modes of operation for cross-links:
directional and nondirectional. Directional here refers to matches
where the spectra of A being cross-linked to B would be significantly
different then B being cross-linked to A and nondirectional refers
to cross-linking methods where there is practically no distinction
between A cross-linked B and B cross-linked A. The formula for directional
cross-links iswith TT being the number of target–target
matches, DD being the number of decoy–decoy matches, and TD
the number of target–decoy and decoy–target matches.
For nondirectional cross-links the formula is with TDDB being the number off
all possible unique target–decoy and decoy–target entries.The difference here is in how decoy–decoy model the false
peptide–false peptide matches among the target–target
matches. A detailed derivation of both formulas and their impact is
described in the Supporting Information (text and Figure S1). Both formulas converge quickly (at 200 linkable
entities the deviation is <1%, Supporting Information, Figure S2 and supplemental discussion). Both formulas are applicable
at PSM, peptide pair, residue pair, and protein pair level. Even so,
how directionality would look for residue and protein pairs is currently
unclear.The calculated FDRs are being reported with an attached
resolution.
The resolution here is being defined as the difference of the next
higher computable FDR minus the next lower FDR. This is exemplified
in Supporting Information, Figure S3. While
not providing an actual accuracy it gives an indication of the range
into which the actual FDR might fall. xiFDR is described in more detail
as part of the Supporting Information.
Results and Discussion
Database searches of mass spectrometry
data in proteomics return
peptide-spectra matches (PSMs). Consequently, one may want to assess
the error made in this process and FDR calculations for PSMs have
been validated extensively for linear peptides based on a number of
tests.[22−24] However, for protein cross-linking, there are three
additional information levels. PSMs aggregate to peptide pairs, these
then aggregate to linked residue pairs, which in turn aggregate to
protein pairs. To assess if FDR estimation at the different information
levels is actually valid we used a crystal structure as “ground
truth”. We compared our search results for data of a RNA polymerase
II analysis[16] filtered to 50% FDR at different
levels (PSMs, peptide pairs, residue pairs) to the crystal structure
of Pol II (PDB|1WCM),[17] measuring the distance
of residue pairs that were identified as being cross-linked. If the
distance of a cross-linked residue pair is feasible the identification
is possibly right. If not, it is likely wrong. When looking at the
distance histogram of target and decoy matches, the distribution of
target and decoy matches should be distinct for the cross-linkable
distance with more targets then decoys (Figure a). This indicates that there are actually
true identifications among the target matches. On the other hand,
for long, structurally unfeasible, distances the curves should overlay.
Most of the identifications of residue pairs that are long distance
are structurally unfeasible and, hence, likely false positives, which
decoys are supposed to model. Indeed, we found that the decoy distributions
match the long-distance part of the target distribution for each observed
level of information: PSMs (Figure b), peptide pairs (Figure c), and residue pairs (Figure d). Decoys (always false) and long-distance
links (mostly false) agree for PSMs, peptide pairs, and residue pairs.
Consequently, FDRs of PSMs, peptide pairs, and residue pairs can be
obtained by target-decoy searches.
Figure 1
Validation of FDR on different levels
by crystal structure. (A)
Schematic distance-histograms showing the expected overlap of false
positive and decoys and resulting overlap of overlength cross-links
with decoy cross-links. (B) Residue-pair distance-histogram based
on identified PSMs for a PSM FDR of 50%. (C) Residue-pair distance-histogram
based on identified peptide pairs for a peptide-pair FDR cutoff of
50%, calculated at the level of peptide pairs. (D) Residue pair distance-histogram
for a residue-pair FDR of 50%. All distances are Cα–Cα
distances of the identified residue pairs in a crystal structure of
Pol II (PDB|1WCM).
Validation of FDR on different levels
by crystal structure. (A)
Schematic distance-histograms showing the expected overlap of false
positive and decoys and resulting overlap of overlength cross-links
with decoy cross-links. (B) Residue-pair distance-histogram based
on identified PSMs for a PSM FDR of 50%. (C) Residue-pair distance-histogram
based on identified peptide pairs for a peptide-pair FDR cutoff of
50%, calculated at the level of peptide pairs. (D) Residue pair distance-histogram
for a residue-pair FDR of 50%. All distances are Cα–Cα
distances of the identified residue pairs in a crystal structure of
Pol II (PDB|1WCM).In a cross-linking experiment,
the information of interest lies
with the cross-linked residue pairs and the cross-linked protein pairs.
Restricting FDR analysis to PSMs or peptide pairs leads to a problem:
A defined FDR for PSMs or peptide pairs gives an unpredictable and
typically larger FDR at the level of residue pairs or protein pairs
(Figure ). For our
RNA polymerase II analysis 5% FDR at the level of PSMs leads to 5.8%
FDR at the level of peptide pairs and 8.4% at the level of residue
pairs. While we can also look at protein pairs, and the trend seems
to persist, the actual number of possible pairs in Pol II does not
permit for any statistically meaningful results. At no FDR is the
PSM FDR a good guide for the accuracy of information at the level
of residue pairs. Also peptide-pair FDR is not a good guide for the
situation at residue-pair level. Consequently, the error should be
estimated for the information that is of actual interest, that is,
linked residue and protein pairs. Similar arguments have been made
for protein identification:[25] correct matches
tend to aggregate when combining PSMs to peptides and peptides to
proteins. In contrast, false matches tend to stay alone. False matches
are random and have a low probability to fall by chance into the same
protein. Therefore, the proportion of false results increases when
combining results.
Figure 2
FDR propagation from PSMs to peptide pairs and residue
pairs. (A)
Actual peptide-pair FDR (solid gray) and residue-pair FDR (solid black)
in dependence of PSM FDR (dashed gray line) for a cross-link data
set of RNA Pol II.[16] The protein-pair FDR
is plotted as a trend only, due to data sparseness. (B) Exemplification
of the error propagation, in form of wrong identifications, from PSMs
to peptide pairs and residue pairs. Correctly identified PSMs (true
positives = green) tend to cluster, for example, several correctly
identified PSMs support the same unique peptide pair and correctly
identified peptide pairs in turn support one residue pair. Incorrectly
identified PSMs (false positives = red) are random and do not cluster
to the same extend.
FDR propagation from PSMs to peptide pairs and residue
pairs. (A)
Actual peptide-pair FDR (solid gray) and residue-pair FDR (solid black)
in dependence of PSM FDR (dashed gray line) for a cross-link data
set of RNA Pol II.[16] The protein-pair FDR
is plotted as a trend only, due to data sparseness. (B) Exemplification
of the error propagation, in form of wrong identifications, from PSMs
to peptide pairs and residue pairs. Correctly identified PSMs (true
positives = green) tend to cluster, for example, several correctly
identified PSMs support the same unique peptide pair and correctly
identified peptide pairs in turn support one residue pair. Incorrectly
identified PSMs (false positives = red) are random and do not cluster
to the same extend.Given that residue-pair
FDRs should and can be calculated leaves
the question of how to treat PSMs and peptide pairs. One could ignore
their error and leave error estimation to the level of residue pairs
entirely. Instead, we restrict the number of false PSMs and peptide
pairs by applying a FDR threshold at their respective level as a prefilter.
Importantly, the way one handles PSMs and peptide pairs actually influences
the number of residue pairs passing a given FDR threshold. For example,
aiming for 5% FDR on residue pairs in our data we observe 108 hits
if only applying the cutoff at residue level, compared to 144 hits
if we apply 6% FDR cutoff at PSM level, and a 10.5% FDR cutoff at
peptide-pair level (Figure ). Prefiltering in PSMs and peptide pairs added 36 (33%) additional
residue pairs without affecting their FDR. To test if our FDR is still
reflecting the likely accuracy of cross-links reported in our analysis,
we compared the initial as well as the number-improved set of cross-links
with an available crystal structure of Pol II (PDB|1WCM). Of the additional
36 residue pairs, 33 showed a distance in the crystal structure that
matched the possible cross-link length (∼27 Å for lysine–lysine
links with BS3[16]). In addition, two of
the three remaining residue pairs involve the very flexible N-terminal
loop-region of Rbp1, offering an explanation for seeing these cross-links
despite residues being distant in the crystal structure. In conclusion,
prefiltering added 35 plausible residue pairs (33%) at the expense
of adding one implausible one. Prefiltering therefore appears to be
a valid way of improving search sensitivity without compromising search
accuracy.
Figure 3
Increased search sensitivity by prefiltering. (A) The number of
identified residue pairs (at 5% FDR, z-axis) depends
on the FDR-threshold applied to PSMs (x-axis) and
peptide pairs (y-axis). (B) Optimal FDR thresholds
on PSMs and peptide pairs (left) return more cross-links (at 5% FDR)
than not applying prefilters (right). (C) Distance distribution of
the residue pairs (5% residue-pair FDR). The prefiltering does increase
the number of cross-links but does not lead to a notable increase
in long distance links (see text for a more detailed discussion).
Increased search sensitivity by prefiltering. (A) The number of
identified residue pairs (at 5% FDR, z-axis) depends
on the FDR-threshold applied to PSMs (x-axis) and
peptide pairs (y-axis). (B) Optimal FDR thresholds
on PSMs and peptide pairs (left) return more cross-links (at 5% FDR)
than not applying prefilters (right). (C) Distance distribution of
the residue pairs (5% residue-pair FDR). The prefiltering does increase
the number of cross-links but does not lead to a notable increase
in long distance links (see text for a more detailed discussion).The success of prefiltering by
applying FDR thresholds at lower
levels in improving search sensitivity depends on combining multiple
PSMs to support a peptide pair and multiple peptide pairs to support
a residue pair. We are not aware of a way to predict best filter settings,
or in fact if different filter settings at lower information levels
would always be beneficial. We, therefore, suggest exploring this
numerically by software. We supply such a software here, xiFDR (see Experimental Section). Note that this tool uses
a CSV file or mzIdentML[26] version 1.2 (submitted)
as input and is therefore independent of the search software. XiFDR
reports the FDR interval (Supporting Information, Figure S3).
Conclusion
Current FDR approaches
in cross-linking/mass spectrometry stop
at the PSM or peptide-pair level, often missing to specify which one
was actually used. Consequently, the information of interest, links
between sites (residue pairs) or proteins (protein pairs), is reported
with an unknown and typically higher (potentially much higher) error.
Our data indicate that our FDR approach can be extended to assess
the error on residue-pair level and presumably also protein-pair level.
As contributions to finding the most sensitive but also fair report
of identified links we propose to prefilter on PSMs and peptide pairs,
and to report FDR together with the interval of uncertainty resulting
from limited data. FDR estimation played an important role in consolidating
proteomics and it has a similar role to play for cross-linking/mass
spectrometry.
Authors: Thomas Walzthoeni; Manfred Claassen; Alexander Leitner; Franz Herzog; Stefan Bohn; Friedrich Förster; Martin Beck; Ruedi Aebersold Journal: Nat Methods Date: 2012-07-08 Impact factor: 28.547
Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911
Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971
Authors: Tadeusz L Ogorzalek; Greg L Hura; Adam Belsom; Kathryn H Burnett; Andriy Kryshtafovych; John A Tainer; Juri Rappsilber; Susan E Tsutakawa; Krzysztof Fidelis Journal: Proteins Date: 2018-02-07
Authors: Bingqing Zhao; Colin P Reilly; Caroline Davis; Andreas Matouschek; James P Reilly Journal: J Proteome Res Date: 2020-06-04 Impact factor: 4.466
Authors: J Eduardo Fajardo; Rojan Shrestha; Nelson Gil; Adam Belsom; Silvia N Crivelli; Cezary Czaplewski; Krzysztof Fidelis; Sergei Grudinin; Mikhail Karasikov; Agnieszka S Karczyńska; Andriy Kryshtafovych; Alexander Leitner; Adam Liwo; Emilia A Lubecka; Bohdan Monastyrskyy; Guillaume Pagès; Juri Rappsilber; Adam K Sieradzan; Celina Sikorska; Esben Trabjerg; Andras Fiser Journal: Proteins Date: 2019-10-07
Authors: Juan D Chavez; Jared P Mohr; Martin Mathay; Xuefei Zhong; Andrew Keller; James E Bruce Journal: Nat Protoc Date: 2019-07-03 Impact factor: 13.491
Authors: Frank Bürmann; Byung-Gil Lee; Thane Than; Ludwig Sinn; Francis J O'Reilly; Stanislau Yatskevich; Juri Rappsilber; Bin Hu; Kim Nasmyth; Jan Löwe Journal: Nat Struct Mol Biol Date: 2019-03-04 Impact factor: 15.369