Literature DB >> 28267312

Quirks of Error Estimation in Cross-Linking/Mass Spectrometry.

Abstract

Cross-linking/mass spectrometry is an increasingly popular approach to obtain structural information on proteins and their complexes in solution. However, methods for error assessment are under current development. We note that false-discovery rates can be estimated at different points during data analysis, and are most relevant for residue or protein pairs. Missing this point led in our example analysis to an actual 8.4% error when 5% error was targeted. In addition, prefiltering of peptide-spectrum matches and of identified peptide pairs substantially improved results. In our example, this prefiltering increased the number of residue pairs (5% FDR) by 33% (n = 108 to n = 144). This number improvement did not come at the expense of reduced accuracy as the added data agreed with an available crystal structure. We provide an open-source tool, xiFDR ( https://github.com/rappsilberlab/xiFDR ), that implements our observations for routine application. Data are available via ProteomeXchange with identifier PXD004749.

Entities: Chemical Gene Species

Year: 2017 PMID： 28267312 PMCID： PMC5423704 DOI： 10.1021/acs.analchem.6b03745

Source DB: PubMed Journal: Anal Chem ISSN： 0003-2700 Impact factor: 6.986

Cross-linking/mass spectrometry (CLMS) is emerging as a valuable tool to investigate protein structures, protein complexes, and protein–protein interactions.[1−4] As any method relying on measurement as well as interpretation, CLMS has some level of error. One popular method in proteomics to assess the expected error among reported results is the false discovery rate (FDR) by the target-decoy approach.[5] A decoy database is generated, typically by inverting all target sequences. This decoy database should not contain any peptide sequences that are in the analyzed sample. Any match to this database is therefore a false positive. Under the assumption that random identifications fall with equal probability into the target and decoy section of the database, the distribution of decoy hits reveals the distribution of random target hits and allows the reporting of results with defined FDR. For CLMS, the FDR estimation is complicated by the fact that every match is a composite of two peptides, each with its own probability to be false. Previously, FDR estimation of cross-links was addressed by either inverting all possible cross-linked peptide pairs,[6] not modeling cases that have one correctly identified peptide and one incorrectly identified peptide or by using a decoy (i.e., wrong mass) cross-linker.[6] While the decoy cross-linker permits for one peptide to be right and one to be wrong as well as both peptides being wrong, it does not provide an easy way to model both cases separately. To model this, FDR calculations have to take into account a set of two interdependent problems. While for the false identification of a single peptide, only a linear random space needs to be considered; for two peptides, this needs to be extended to a quadratic random space as each peptide could be from both the target as well as the decoy database. MS2-cleavable cross-linkers[7−11] may allow circumvention of a cross-linking specific FDR, at least in part. The cross-link is cleaved in MS2, separating the two peptides that can then be identified individually in MS3. As linear peptides are being identified, standard proteomic peptide FDR estimation has been applied,[12] possibly falling short in considering errors from joining up peptides. Nevertheless, their data can also be assessed jointly as cross-links within a spectrum.[13,14] A formalism for FDR estimation of cross-links has recently been proposed.[15] However, some questions remain open such as how to handle directionality of the cross-linker or what levels to consider: peptide-spectrum matches (PSMs), peptide pairs, or residue pairs. Here we share our considerations regarding FDR estimation in CLMS, based on the target-decoy approach. The FDR approach was tested using a data set of RNA Polymerase II (Pol II) cross-linked with Bis[sulfosuccinimidyl] suberate (BS3).[16] Our data was compared against an available crystal structure of Pol II,[17] which served as a mass spectrometry-independent evaluation of our FDR approach. We highlight the importance of considering the different information levels, PSMs, peptide pairs, and residue pairs, and how their relationship can be exploited productively.

Experimental Section

Dataset

The data set has been described previously[16] and was reprocessed here. In short, purified RNA polymerase II (Pol II) from Saccharomyces cerevisiae was cross-linked with BS3. Cross-linked complexes were then digested with trypsin and analyzed by LC-MS/MS. Mass spectrometric data was acquired using a “high–high” strategy, meaning both MS1 and MS2 spectra were acquired with high resolution (R = 100000 and R = 7500, respectively).

Data Processing

Mass spectrometric raw files were processed into peak lists using MaxQuant version 1.2.2.5[18] using default parameters except the setting for “Top MS/MS peaks per 100 Da” being set to 100. Peak lists were searched against a target-decoy database of all Pol II proteins (Rpb1 to Rpb12, 4565 residues) and their decoy equivalents obtained by sequence inversion[18] using Xi[19] (http://github.com/Rappsilber-Laboratory/XiSearch) for identification of cross-linked peptides. Search parameters were MS accuracy, 6 ppm; MS/MS accuracy, 20 ppm; enzyme, trypsin; specificity, fully tryptic; allowed number of missed cleavages, four; cross-linker, BS3; fixed modifications, carbamidomethylation on cysteine; variable modifications, oxidation on methionine, hydrolyzed, amidated, and loop-linked versions of BS3. The linkage specificity for BS3 was assumed to be at lysine, serine, threonine, tyrosine, and protein N-termini. The data have been deposited to the ProteomeXchange[20] Consortium via the PRIDE[21] partner repository with the data set identifier PXD004749.

Comparison to Crystal Structure

As a mass spectrometry-independent assessment of identification success, the residue distance of identified cross-linked residue pairs was measured in an available crystal structure of Pol II (PDB|1WCM).[17] CLMS and X-ray crystallography do not necessarily return identical results as CLMS investigates proteins in solution where conformational flexibility is likely much higher than in crystallized form. However, for our data set, a good agreement of the two methods has been reported.[16] To compare decoy matches with the crystal structure, the linked residue in the decoy was assigned the position of the same residue in the forward sequence.

xiFDR Software

All FDR calculations were done with xiFDR. We provide xiFDR, an open-source program (https://github.com/lutzfischer/xiFDR), for researchers to analyze the results of their preferred cross-link search engine. The input of xiFDR is either an mzIdentML file or a table of PSMs (Table S1). The output is either an amended mzIdentML file or a set of tables containing PSMs, peptide pairs, residue pairs, and protein pairs that pass the requested FDR thresholds. It supports two modes of operation for cross-links: directional and nondirectional. Directional here refers to matches where the spectra of A being cross-linked to B would be significantly different then B being cross-linked to A and nondirectional refers to cross-linking methods where there is practically no distinction between A cross-linked B and B cross-linked A. The formula for directional cross-links iswith TT being the number of target–target matches, DD being the number of decoy–decoy matches, and TD the number of target–decoy and decoy–target matches. For nondirectional cross-links the formula is with TDDB being the number off all possible unique target–decoy and decoy–target entries. The difference here is in how decoy–decoy model the false peptide–false peptide matches among the target–target matches. A detailed derivation of both formulas and their impact is described in the Supporting Information (text and Figure S1). Both formulas converge quickly (at 200 linkable entities the deviation is <1%, Supporting Information, Figure S2 and supplemental discussion). Both formulas are applicable at PSM, peptide pair, residue pair, and protein pair level. Even so, how directionality would look for residue and protein pairs is currently unclear. The calculated FDRs are being reported with an attached resolution. The resolution here is being defined as the difference of the next higher computable FDR minus the next lower FDR. This is exemplified in Supporting Information, Figure S3. While not providing an actual accuracy it gives an indication of the range into which the actual FDR might fall. xiFDR is described in more detail as part of the Supporting Information.

Results and Discussion

Database searches of mass spectrometry data in proteomics return peptide-spectra matches (PSMs). Consequently, one may want to assess the error made in this process and FDR calculations for PSMs have been validated extensively for linear peptides based on a number of tests.[22−24] However, for protein cross-linking, there are three additional information levels. PSMs aggregate to peptide pairs, these then aggregate to linked residue pairs, which in turn aggregate to protein pairs. To assess if FDR estimation at the different information levels is actually valid we used a crystal structure as “ground truth”. We compared our search results for data of a RNA polymerase II analysis[16] filtered to 50% FDR at different levels (PSMs, peptide pairs, residue pairs) to the crystal structure of Pol II (PDB|1WCM),[17] measuring the distance of residue pairs that were identified as being cross-linked. If the distance of a cross-linked residue pair is feasible the identification is possibly right. If not, it is likely wrong. When looking at the distance histogram of target and decoy matches, the distribution of target and decoy matches should be distinct for the cross-linkable distance with more targets then decoys (Figure a). This indicates that there are actually true identifications among the target matches. On the other hand, for long, structurally unfeasible, distances the curves should overlay. Most of the identifications of residue pairs that are long distance are structurally unfeasible and, hence, likely false positives, which decoys are supposed to model. Indeed, we found that the decoy distributions match the long-distance part of the target distribution for each observed level of information: PSMs (Figure b), peptide pairs (Figure c), and residue pairs (Figure d). Decoys (always false) and long-distance links (mostly false) agree for PSMs, peptide pairs, and residue pairs. Consequently, FDRs of PSMs, peptide pairs, and residue pairs can be obtained by target-decoy searches.

Figure 1

Validation of FDR on different levels by crystal structure. (A) Schematic distance-histograms showing the expected overlap of false positive and decoys and resulting overlap of overlength cross-links with decoy cross-links. (B) Residue-pair distance-histogram based on identified PSMs for a PSM FDR of 50%. (C) Residue-pair distance-histogram based on identified peptide pairs for a peptide-pair FDR cutoff of 50%, calculated at the level of peptide pairs. (D) Residue pair distance-histogram for a residue-pair FDR of 50%. All distances are Cα–Cα distances of the identified residue pairs in a crystal structure of Pol II (PDB|1WCM). In a cross-linking experiment, the information of interest lies with the cross-linked residue pairs and the cross-linked protein pairs. Restricting FDR analysis to PSMs or peptide pairs leads to a problem: A defined FDR for PSMs or peptide pairs gives an unpredictable and typically larger FDR at the level of residue pairs or protein pairs (Figure ). For our RNA polymerase II analysis 5% FDR at the level of PSMs leads to 5.8% FDR at the level of peptide pairs and 8.4% at the level of residue pairs. While we can also look at protein pairs, and the trend seems to persist, the actual number of possible pairs in Pol II does not permit for any statistically meaningful results. At no FDR is the PSM FDR a good guide for the accuracy of information at the level of residue pairs. Also peptide-pair FDR is not a good guide for the situation at residue-pair level. Consequently, the error should be estimated for the information that is of actual interest, that is, linked residue and protein pairs. Similar arguments have been made for protein identification:[25] correct matches tend to aggregate when combining PSMs to peptides and peptides to proteins. In contrast, false matches tend to stay alone. False matches are random and have a low probability to fall by chance into the same protein. Therefore, the proportion of false results increases when combining results.

Figure 2

FDR propagation from PSMs to peptide pairs and residue pairs. (A) Actual peptide-pair FDR (solid gray) and residue-pair FDR (solid black) in dependence of PSM FDR (dashed gray line) for a cross-link data set of RNA Pol II.[16] The protein-pair FDR is plotted as a trend only, due to data sparseness. (B) Exemplification of the error propagation, in form of wrong identifications, from PSMs to peptide pairs and residue pairs. Correctly identified PSMs (true positives = green) tend to cluster, for example, several correctly identified PSMs support the same unique peptide pair and correctly identified peptide pairs in turn support one residue pair. Incorrectly identified PSMs (false positives = red) are random and do not cluster to the same extend. Given that residue-pair FDRs should and can be calculated leaves the question of how to treat PSMs and peptide pairs. One could ignore their error and leave error estimation to the level of residue pairs entirely. Instead, we restrict the number of false PSMs and peptide pairs by applying a FDR threshold at their respective level as a prefilter. Importantly, the way one handles PSMs and peptide pairs actually influences the number of residue pairs passing a given FDR threshold. For example, aiming for 5% FDR on residue pairs in our data we observe 108 hits if only applying the cutoff at residue level, compared to 144 hits if we apply 6% FDR cutoff at PSM level, and a 10.5% FDR cutoff at peptide-pair level (Figure ). Prefiltering in PSMs and peptide pairs added 36 (33%) additional residue pairs without affecting their FDR. To test if our FDR is still reflecting the likely accuracy of cross-links reported in our analysis, we compared the initial as well as the number-improved set of cross-links with an available crystal structure of Pol II (PDB|1WCM). Of the additional 36 residue pairs, 33 showed a distance in the crystal structure that matched the possible cross-link length (∼27 Å for lysine–lysine links with BS3[16]). In addition, two of the three remaining residue pairs involve the very flexible N-terminal loop-region of Rbp1, offering an explanation for seeing these cross-links despite residues being distant in the crystal structure. In conclusion, prefiltering added 35 plausible residue pairs (33%) at the expense of adding one implausible one. Prefiltering therefore appears to be a valid way of improving search sensitivity without compromising search accuracy.

Figure 3

Increased search sensitivity by prefiltering. (A) The number of identified residue pairs (at 5% FDR, z-axis) depends on the FDR-threshold applied to PSMs (x-axis) and peptide pairs (y-axis). (B) Optimal FDR thresholds on PSMs and peptide pairs (left) return more cross-links (at 5% FDR) than not applying prefilters (right). (C) Distance distribution of the residue pairs (5% residue-pair FDR). The prefiltering does increase the number of cross-links but does not lead to a notable increase in long distance links (see text for a more detailed discussion). The success of prefiltering by applying FDR thresholds at lower levels in improving search sensitivity depends on combining multiple PSMs to support a peptide pair and multiple peptide pairs to support a residue pair. We are not aware of a way to predict best filter settings, or in fact if different filter settings at lower information levels would always be beneficial. We, therefore, suggest exploring this numerically by software. We supply such a software here, xiFDR (see Experimental Section). Note that this tool uses a CSV file or mzIdentML[26] version 1.2 (submitted) as input and is therefore independent of the search software. XiFDR reports the FDR interval (Supporting Information, Figure S3).

Conclusion

Current FDR approaches in cross-linking/mass spectrometry stop at the PSM or peptide-pair level, often missing to specify which one was actually used. Consequently, the information of interest, links between sites (residue pairs) or proteins (protein pairs), is reported with an unknown and typically higher (potentially much higher) error. Our data indicate that our FDR approach can be extended to assess the error on residue-pair level and presumably also protein-pair level. As contributions to finding the most sensitive but also fair report of identified links we propose to prefilter on PSMs and peptide pairs, and to report FDR together with the interval of uncertainty resulting from limited data. FDR estimation played an important role in consolidating proteomics and it has a similar role to play for cross-linking/mass spectrometry.

26 in total

1. False discovery rate estimation for cross-linked peptides identified by mass spectrometry.

Authors: Thomas Walzthoeni; Manfred Claassen; Alexander Leitner; Franz Herzog; Stefan Bohn; Friedrich Förster; Martin Beck; Ruedi Aebersold
Journal: Nat Methods Date: 2012-07-08 Impact factor: 28.547

Review 2. Crosslinking combined with mass spectrometry for structural proteomics.

Authors: Evgeniy V Petrotchenko; Christoph H Borchers
Journal: Mass Spectrom Rev Date: 2010 Nov-Dec Impact factor: 10.946

3. Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry.

Authors: Fan Liu; Dirk T S Rijkers; Harm Post; Albert J R Heck
Journal: Nat Methods Date: 2015-09-28 Impact factor: 28.547

Review 4. Analysis of protein complexes using mass spectrometry.

Authors: Anne-Claude Gingras; Matthias Gstaiger; Brian Raught; Ruedi Aebersold
Journal: Nat Rev Mol Cell Biol Date: 2007-08 Impact factor: 94.444

5. An enhanced protein crosslink identification strategy using CID-cleavable chemical crosslinkers and LC/MS(n) analysis.

Authors: Fan Liu; Cong Wu; Jonathan V Sweedler; Michael B Goshe
Journal: Proteomics Date: 2012-01-18 Impact factor: 3.984

6. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

7. Architecture of the RNA polymerase II-TFIIF complex revealed by cross-linking and mass spectrometry.

Authors: Zhuo Angel Chen; Anass Jawhari; Lutz Fischer; Claudia Buchen; Salman Tahir; Tomislav Kamenski; Morten Rasmussen; Laurent Lariviere; Jimi-Carlo Bukowski-Wills; Michael Nilges; Patrick Cramer; Juri Rappsilber
Journal: EMBO J Date: 2010-01-21 Impact factor: 11.598

8. The beginning of a beautiful friendship: cross-linking/mass spectrometry and modelling of proteins and multi-protein complexes.

Authors: Juri Rappsilber
Journal: J Struct Biol Date: 2010-10-26 Impact factor: 2.867

9. Improved results in proteomics by use of local and peptide-class specific false discovery rates.

Authors: Lau Sennels; Jimi-Carlo Bukowski-Wills; Juri Rappsilber
Journal: BMC Bioinformatics Date: 2009-06-12 Impact factor: 3.169

10. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

56 in total

1. Small angle X-ray scattering and cross-linking for data assisted protein structure prediction in CASP 12 with prospects for improved accuracy.

Authors: Tadeusz L Ogorzalek; Greg L Hura; Adam Belsom; Kathryn H Burnett; Andriy Kryshtafovych; John A Tainer; Juri Rappsilber; Susan E Tsutakawa; Krzysztof Fidelis
Journal: Proteins Date: 2018-02-07

2. Prediction of an Upper Limit for the Fraction of Interprotein Cross-Links in Large-Scale In Vivo Cross-Linking Studies.

Authors: Andrew Keller; Juan D Chavez; Kevin C Felt; James E Bruce
Journal: J Proteome Res Date: 2019-07-17 Impact factor: 4.466

3. Use of Multiple Ion Fragmentation Methods to Identify Protein Cross-Links and Facilitate Comparison of Data Interpretation Algorithms.

Authors: Bingqing Zhao; Colin P Reilly; Caroline Davis; Andreas Matouschek; James P Reilly
Journal: J Proteome Res Date: 2020-06-04 Impact factor: 4.466

4. Assessment of chemical-crosslink-assisted protein structure modeling in CASP13.

Authors: J Eduardo Fajardo; Rojan Shrestha; Nelson Gil; Adam Belsom; Silvia N Crivelli; Cezary Czaplewski; Krzysztof Fidelis; Sergei Grudinin; Mikhail Karasikov; Agnieszka S Karczyńska; Andriy Kryshtafovych; Alexander Leitner; Adam Liwo; Emilia A Lubecka; Bohdan Monastyrskyy; Guillaume Pagès; Juri Rappsilber; Adam K Sieradzan; Celina Sikorska; Esben Trabjerg; Andras Fiser
Journal: Proteins Date: 2019-10-07

5. OpenPepXL: An Open-Source Tool for Sensitive Identification of Cross-Linked Peptides in XL-MS.

Authors: Eugen Netz; Tjeerd M H Dijkstra; Timo Sachsenberg; Lukas Zimmermann; Mathias Walzer; Thomas Monecke; Ralf Ficner; Olexandr Dybkov; Henning Urlaub; Oliver Kohlbacher
Journal: Mol Cell Proteomics Date: 2020-10-16 Impact factor: 5.911

6. CLMSVault: A Software Suite for Protein Cross-Linking Mass-Spectrometry Data Analysis and Visualization.

Authors: Mathieu Courcelles; Jasmin Coulombe-Huntington; Émilie Cossette; Anne-Claude Gingras; Pierre Thibault; Mike Tyers
Journal: J Proteome Res Date: 2017-06-05 Impact factor: 4.466

7. MaXLinker: Proteome-wide Cross-link Identifications with High Specificity and Sensitivity.

Authors: Kumar Yugandhar; Ting-Yi Wang; Alden King-Yung Leung; Michael Charles Lanz; Ievgen Motorykin; Jin Liang; Elnur Elyar Shayhidin; Marcus Bustamante Smolka; Sheng Zhang; Haiyuan Yu
Journal: Mol Cell Proteomics Date: 2019-12-15 Impact factor: 5.911