Literature DB >> 30293428

In-Search Assignment of Monoisotopic Peaks Improves the Identification of Cross-Linked Peptides.

Swantje Lenz¹, Sven H Giese¹, Lutz Fischer², Juri Rappsilber^1,2.

Abstract

Cross-linking/mass spectrometry has undergone a maturation process akin to standard proteomics by adapting key methods such as false discovery rate control and quantification. A poorly evaluated search setting in proteomics is the consideration of multiple (lighter) alternative values for the monoisotopic precursor mass to compensate for possible misassignments of the monoisotopic peak. Here, we show that monoisotopic peak assignment is a major weakness of current data handling approaches in cross-linking. Cross-linked peptides often have high precursor masses, which reduces the presence of the monoisotopic peak in the isotope envelope. Paired with generally low peak intensity, this generates a challenge that may not be completely solvable by precursor mass assignment routines. We therefore took an alternative route by '"in-search assignment of the monoisotopic peak" in the cross-link database search tool Xi (Xi-MPA), which considers multiple precursor masses during database search. We compare and evaluate the performance of established preprocessing workflows that partly correct the monoisotopic peak and Xi-MPA on three publicly available data sets. Xi-MPA always delivered the highest number of identifications with ∼2 to 4-fold increase of PSMs without compromising identification accuracy as determined by FDR estimation and comparison to crystallographic models.

Entities: Chemical Disease Species

Keywords: BS3; SDA; cross-linking; data processing; mass spectrometry; monoisotopic mass; peptides; proteomics; software; structural proteomics

Mesh：

Substances：

Year: 2018 PMID： 30293428 PMCID： PMC6279313 DOI： 10.1021/acs.jproteome.8b00600

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Several approaches have been utilized to increase the numbers of identified cross-links, for example enriching for cross-linked peptides,[1−4] using different proteases[1,5,6] or optimizing fragmentation methods.[7,8] In parallel with experimental developments, data analysis has also progressed to extract even more cross-links to be used as distance restraints for modeling of proteins and their complexes.[9,10] Search software has been designed for the identification of cross-linked peptides, for example Kojak,[11] xQuest,[12] pLink,[13] XlinkX,[14] or Xi.[5] In addition, cross-linking workflows can make use of preprocessing methods to improve data quality and reduce file sizes,[15] as well as postprocessing methods to filter out false identifications[11,16] and custom-tailored false discovery rate (FDR) estimation.[17−19] Preprocessing can improve peptide identification by correcting the MS1 precursor ion m/z and simplifying MS2 fragment spectra. Established proteomics software perform such preprocessing, including MaxQuant[20,21] and OpenMS.[22,23] For example, MaxQuant performs a variety of preprocessing steps: it corrects the precursor m/z by an intensity-weighted average if a suitable peptide feature is found, reassigns the monoisotopic peak and contains options for intensity filtering of MS2 peaks. Despite such correction of the precursor mass, many linear search engines have integrated the possibility of considering multiple monoisotopic peaks during search.[24−26] However, the benefits of this search feature are currently unclear. It seems that the assignment of monoisotopic mass for tryptic peptides is already achieved adequately either during acquisition or as part of preprocessing. Cross-linked peptides have characteristics that may render MS1 monoisotopic precursor mass assignment as used for linear peptides nonoptimal: high-charge states, large masses, and low abundances. Several cross-link search engines include MS1 correction in their pipeline: pLink corrects monoisotopic peaks based on previous work with linear peptides,[27] however does not include a parameter for searching multiple precursor masses. Kojak averages precursor ion signals of neighboring scans to create a composite spectrum and infer the true monoisotopic mass of the precursor. If this step fails, precursor masses up to −2 Da lighter are searched.[11] For previous searches in Xi, MaxQuant was used to perform preprocessing. Neither xQuest nor XLinkX describe precursor correction in their workflow documentation and there is no option for additionally searched masses available in the respective search parameters. We are not aware of a detailed evaluation of the impact of different preprocessing techniques for cross-link identification, independent of the search software. Correcting the monoisotopic mass of precursors, although acknowledged as an issue,[11,28] awaits systematic evaluation. In this study, we show that errors in assigning monoisotopic peaks during data acquisition are frequent for cross-linked peptides because of their size and generally low abundance. This adversely affects their identification. We show that standard software suites, MaxQuant and OpenMS correct monoisotopic precursor masses of cross-linked peptides with variable success. We then implement an option in Xi to consider multiple precursor masses during search, to minimize the impact of false monoisotopic precursor mass assignment on the identification of cross-links.

Methods

Data Sets

In this study, we used three publicly available data sets (Table ). The three data sets were chosen to reflect a range of applications of cross-linking mass spectrometry as well as a range of data complexity: the first data set is Human Serum Albumin (HSA) cross-linked with succinimidyl 4,4-azipentanoate (SDA) and fragmented using five different methods (PXD003737).[29] The second data set is a pooled pseudocomplex sample with seven separately cross-linked proteins with bis(sulfosuccinimidyl) suberate (BS3) (PXD006131).[7] This data set includes data from four different fragmentation methods. The third data set is the most complex sample, composed of 15 size exclusion chromatography fractions of Chaetomium thermophilum lysate cross-linked with BS3 and fragmented only with HCD (PXD006626).[30] The first and last size exclusion fractions were used to optimize the search parameters for this data set. All samples were analyzed on an Orbitrap Fusion Lumos Tribrid mass spectrometer (Thermo Fisher Scientific, San Jose, CA) using Xcalibur (version 2.0 and 2.1).

Table 1

Overview of Datasets Used

data set	sample	database sizea	reference
1	HSA	1	Giese et al. 2016
2	pseudocomplex	7	Kolbowski et al. 2017
3	C. thermophilum	198–400b	Kastritis et al. 2017

Database size refers to the number of proteins in the database.

Multiple size exclusion chromatography fractions (n = 15).

Database size refers to the number of proteins in the database. Multiple size exclusion chromatography fractions (n = 15).

Preprocessing

Raw files were preprocessed independently using MaxQuant (1.5.5.30), OpenMS (2.0.1) and the ProteoWizard[31] tool msconvert (3.0.9576) for comparison. Scripts automating the preprocessing, search and evaluation were written in Python (2.7). The essential steps during the preprocessing can be divided into two parts: (1) correction of the m/z or charge of the precursor peak for MS2 spectra and (2) denoising of MS2 spectra. MaxQuant and OpenMS both try to correct the precursor information via additional feature finding steps, i.e. identifying a peptide feature from the retention time, m/z and intensity domain of the LC-MS run. Additionally, denoising of the MS2 spectra is performed by simply filtering the most intense peaks in defined m/z windows. The preprocessing is by default enabled in MaxQuant and was run using the partial processing option (steps 1–5) with default settings except for inactivated “deisotoping” and “top peaks per 100 Da”, which was set to 20. The OpenMS preprocessing workflow includes centroiding, feature finding,[32] precursor correction (mass and charge) using the identified features and MS2 denoising as described above (Supporting Information (SI) Figure S1). Msconvert was used to convert the raw files to mgf files without any correction. These peak files were denoted as “uncorrected” and used as our baseline to quantify improvements in the subsequent database search. For the “in-search assignment of the monoisotopic peak” in Xi (Xi-MPA), we used msconvert to convert raw files to mgf files and included a MS2 peak filter for the 20 most intense peaks in a 100 m/z window.

Data Analysis

Peak files were searched separately in Xi (1.6.731) with the following settings: MS accuracy 3 ppm, MS/MS accuracy 10 ppm, oxidation of methionine as variable modification, tryptic digestion, two missed cleavages. For samples cross-linked with SDA, linkage sites were allowed on lysine, serine, tyrosine, threonine, and protein n-terminus on one end and all amino acids on the other end of the cross-linker. Variable modifications were monolink SDA (110.048 Da), SDA loop-links (82.0419 Da), SDA hydrolyzed (100.0524 Da), SDA oxidized (98.0368 Da)[31] as well as carbamidomethylation on cysteine. For searches with BS3, linkage sites were lysine, serine, threonine, tyrosine, and the protein n-terminus. Carbamidomethylation on cysteine was set as fixed modification. Allowed variable modifications of the cross-linker were aminated BS3 (155.0946 Da), hydrolyzed BS3 (156.0786 Da) and loop-linked BS3 (138.0681 Da). For collision-induced dissociation (CID) and beam-type CID, also referred to as higher-energy C-trap dissociation (HCD), b- and y-ions were searched for, whereas for electron transfer dissociation (ETD) c- and z-ions were allowed. For ETciD and EThcD, b-, c-, z-, and y-ions were allowed. The HSA and pseudocomplex data sets were searched against the known proteins in the sample. For each protein fraction of the C. thermophilum data set, the databases of the original publication were used, where a database was created for each fraction by taking the most abundant proteins (iBAQ value above 106). For searches employing Xi-MPA, the parameter "missing_isotope_peaks" was set to the respective mass range searched. Data sets 1 and 2 were searched with a reversed decoy database, whereas data set 3 was searched with a shuffled decoy database due to palindromic sequences. For the reversed decoy database, lysines and arginines were swapped with the preceding amino acid before peptide generation.[17,20] For cross-linking, there are different information levels: PSMs, peptide pairs, residue pairs (links) and protein pairs. The false discovery rate (FDR) can be calculated on each one of these levels and should be reported for the level at which the information is given.[12] The FDR was calculated as described in Fischer et al.[17] using xiFDR (1.0.14.34) according to the following equation: . A 5% PSM level cutoff was imposed. The setting “uniquePSMs” was enabled and the FDR was calculated separately on self-and between links. Minimal peptide length was set to 6. In data set 2, identified cross-linked residues were mapped to the crystal structure of the respective protein and the Euclidian distance between the alpha-carbons was calculated. Structures were downloaded from the PDB (IDs: 1AO6, 5GKN, 2CRK, 3NBS, 1OVT, 2FRJ). Kojak (1.5.5) was run via the Trans-Proteomic Pipeline (5.1.0)[33] using default settings except: MS1 resolution 120 000, BS3 allowed on lysine, serine, threonine, tyrosine, and the protein n-terminus, aminated BS3 (155.0946 Da) as variable modification of the cross-linker, 3 ppm mass tolerance on MS1 level. For the uncorrected search, the isotope error was set to 0 and precursor refinement was disabled. PSMs were validated using PeptideProphet[34] and FDR calculated as described above on the resulting probability. The mass spectrometry data have been deposited to the ProteomeXchange Consortium via the PRIDE[35] partner repository with the data set identifier PXD011121. For transparency, python scripts are available on GitHub under: https://github.com/Rappsilber-Laboratory/Xi-MPA_scripts.

Results and Discussion

We evaluated the impact on cross-link identification in Xi of changing the precursor monoisotopic mass that was initially assigned during data acquisition (“uncorrected”). In this analysis, MaxQuant and OpenMS were used as preprocessing tools. We used three different data sets that differ in complexity and fragmentation regimes. To measure the improvements from using the preprocessing tools, a simple conversion from raw files to mgf format was done with msconvert and used as a baseline. Note that in the spectrum header, there are two m/z values: the trigger mass of the MS2 and the assigned monoisotopic peak of the isotope cluster. Msconvert extracts the assigned monoisotopic mass. Processed data were searched separately in Xi and evaluated on PSM level or link (= unique residue pair) level, with a 5% FDR. Finally, the newly implemented in-search assignment of monoisotopic peaks in Xi was compared to the elaborate preprocessing pipelines in OpenMS and MaxQuant.

Preprocessing Increases the Number of Cross-Link PSMs by Finding the Correct Monoisotopic Peak

The data sets were preprocessed in MaxQuant and OpenMS and numbers of identified PSMs were compared to those obtained using uncorrected data. Data sets 1 (HSA) and 2 (pseudocomplex) were acquired with different acquisition methods. For comparability to data set 3 (complex mixture), we focused on the HCD acquired data. Cross-links between proteins were excluded, either because they were experimentally not possible (data set 2) or observed in too low numbers for reliable FDR calculation (data set 3). For uncorrected data, 672, 354, and 2157 cross-link PSMs resulted for the HSA data set (data set 1), pseudocomplex (data set 2), and first and last fractions of C. thermophilum respectively (data set 3). Both preprocessing approaches improved numbers of identified PSMs for all data sets: Preprocessing in MaxQuant led to 1127 (68% increase), 966 (173% increase), and 2966 (38% increase), while for OpenMS, 1044 (55% increase), 598 (69% increase) and 2394 (11% increase) PSMs were identified (Figure A).

Figure 1

Correction of the monoisotopic peak is crucial in cross-link identification. (A) The data sets were preprocessed using MaxQuant and OpenMS, leading to more identified PSMs in all cases. Fold changes from uncorrected data (msconvert conversion of Xcalibur data) were calculated for each file separately and the mean plotted. Error bars represent the standard error of the mean between different acquisitions (HSA: n = 3, pseudocomplex: n = 3, C. thermophilum: n = 8). (B) The majority of additional identifications after preprocessing are due to correction of the precursor mass to lighter monoisotopic masses. Spectra that are unique to MaxQuant preprocessed searches of HCD acquisitions from data set 2 were evaluated in terms of precursor correction. The main proportion of the gain was corrected to lighter masses of up to −3 Da, while charge state correction or correction to heavier masses rarely occurred. (C) Isotope cluster of a corrected precursor of m/z 992.71 (z = 5, m = 4958.6 Da) was solely identified in MaxQuant preprocessed results. In OpenMS preprocessed and uncorrected data, the wrong monoisotopic mass was selected for unknown reasons. We assessed the gains in identified PSMs of preprocessed data compared to uncorrected data (focusing on data set 2) regarding three forms of precursor correction: (1) correction of the monoisotopic mass, (2) charge state correction, and (3) small corrections of the m/z value based on averaging the m/z values across the peptide feature (Figure B). Precursor mass and charge state of spectra identified solely in MaxQuant-Xi were compared to their counterparts when searching uncorrected data in Xi. Of the 756 newly identified spectra, 686 (91%) had a different monoisotopic precursor mass. Precursors were primarily corrected to lighter masses by MaxQuant, that is, the monoisotopic peak correction by −1 (208 spectra), −2 (215 spectra), −3 (149 spectra), and −4 Da (62 spectra). Greater shifts (−5 to −7 Da) only occurred 30 times, and corrections to heavier masses were observed 22 times. Only 30 spectra (4%) were corrected in their charge state. For the 60 spectra (8%) without correction in charge state or monoisotopic peak, we only identified nine spectra that had a higher error than 3 ppm before preprocessing, indicating a small correction of the initial precursor m/z (by averaging of peptide feature peaks). The main proportion of these identifications is likely a product of noise removal in MS2 spectra or small changes in the score distribution. Similarly, for OpenMS-Xi, the monoisotopic peak correction had the greatest impact: Of the 314 spectra that OpenMS added over uncorrected data, 139 were precursor corrected by −1 Da and 108 to −2 Da. In contrast to MaxQuant, corrections to −3 or lighter were not observed, which might explain the higher number of identifications obtained with MaxQuant-Xi. For data set 1 and 3, the gains of preprocessing are smaller than for data set 2. The median peptide mass of data set 1 (3368 Da) is smaller than the median mass of data set 2 (3946 Da) and we later show that this is a major factor in precursor mass assignment. This reflects in the distribution of lighter masses assigned: 46%, 33%, and 21% were corrected by −1 Da, −2 Da, and to even lighter peaks, respectively. Data set 3 was acquired with a different version of Xcalibur, for which we saw better mass assignment than for earlier versions (data not shown). As an implication of this, already 67% of the lighter corrected masses are shifted by −1 Da, 21% by −2 Da, 12% to even lighter masses. However, we were not able to follow up on this to our content, since the source code of the vendor software is not available. In summary, preprocessing, especially monoisotopic peak correction, leads to a notable increase in identifications. Using the 3-dimensional peptide feature is advantageous compared to on-the-fly detection of the monoisotopic peak. If the preceding MS1 spectrum was acquired during the beginning (or end) of the elution profile of a peptide, the intensity will be low. Thus, the monoisotopic peak might not even be detectable at the time of fragmentation. For large (cross-linked) peptides, this effect might be exacerbated by the monoisotopic peak usually being less intense than other isotope peaks. Therefore, using the additional information from the retention time domain will be beneficial. The same feature information can also be used to determine or validate the assigned charge state of the precursor. However, the instrument software almost always assigned the same charge state as MaxQuant or OpenMS. Thus, the main advantage for identifying cross-linked peptides arises from monoisotopic peak correction. Interestingly, OpenMS and MaxQuant did not always agree on or find the same monoisotopic peak (Figure C). Of the total MaxQuant-corrected spectra with a different monoisotopic mass, 81% were not corrected and 6% corrected differently with OpenMS. Vice versa, 15% of the monoisotopic peaks corrected by OpenMS were not corrected by MaxQuant and 25% were corrected differently. Both MaxQuant and OpenMS have their own implementations for precursor correction - therefore, there might be instances where MaxQuant is able to find a corresponding peptide feature where OpenMS does not and vice versa. Although OpenMS did not lead to the same improvements in the number of identifications as MaxQuant, it did correct some precursors that the latter did not. We therefore suspect that there are also precursors with a falsely assigned monoisotopic peak that were corrected with neither algorithm. Furthermore, 3-dimensional detection of peptide features is challenging for low intensity peptides. In conclusion, there likely remain falsely assigned monoisotopic peaks in the data, ultimately leading to missed or false identifications.

In-Search Monoisotopic Peak Assignment Increases the Number of Identifications

We observed multiple cases where MaxQuant and OpenMS disagreed in their monoisotopic peak choice, indicating that the problem of monoisotopic peak assignment (MPA) cannot be solved easily at MS1 level. Indeed, we found instances where the monoisotopic peak is simply not distinguishable from noise, so a feature-based correction would not be feasible. Nevertheless, the associated MS2 spectra could be matched to a cross-linked peptide when considering multiple different monoisotopic masses during search. This shows that the extra information on obtaining a peptide-spectrum match is advantageous to MPA over considering MS1 information alone. Therefore, we implemented a monoisotopic peak assignment in Xi: for each MS2 spectrum, multiple precursor masses are considered during a single search and the highest scoring peptide-pair assigns the precursor mass. Note that this is different from simply searching with a wide mass error for MS1. The mass accuracy of MS1 is minimally compromised as multiple candidates for the monoisotopic mass are taken and considered with the original mass accuracy of the measurement. To find a good trade-off between increased search space and sensitivity, we tested different mass range settings on the data sets. For data set 2 (HCD subset), the number of PSMs increased with ranges up to −5 Da on the considered monoisotopic masses (Figure A). However, the increase in identifications from −4 to −5 Da was only 3% and considering the increase in search time, we continued with a maximal correction to −4 Da as the optimal setting for this data set. Xi-MPA yielded 1508 PSMs, which is a 326% increase compared to searching uncorrected data and a 56% increase compared to MaxQuant-Xi. Similar improvements are observed for the other fragmentation methods in this data set (SI Figure S2). Additionally, we corrected up to −7 Da to test if a large increase in search space increases random spectra matches as measured by the target-decoy approach. The number of identifications at 5% FDR decreased only slightly compared to −5 Da (−1%), but still led to more identifications than up to −4 Da (3%). In the HSA data set, Xi-MPA with up to −4 Da increased the number of identified PSMs by 170% compared to uncorrected data (Figure S3).

Figure 2

In-search monoisotopic peak assignment outperforms preprocessing. (A) Performance of Xi-MPA on data set 2. HCD data from the pseudocomplex data set were searched assuming different ranges of missing monoisotopic peaks. With increased ranges, the number of identified PSMs also increases. (B) Performance of Xi-MPA on the complete C. thermophilum data set. All 15 fractions were searched with the original preprocessed data as well as with Xi-MPA. (C) Overlap of identified residue pairs of MaxQuant-Xi and OpenMS-Xi to residue pairs gained from Xi-MPA (data set 2). Numbers in brackets are the proportion of decoys in the respective regions. As a final evaluation of in-search monoisotopic peak assignment, we searched the complete data set of C. thermophilum. We used 0 to −3 Da as the range of Xi-MPA, since an initial analysis of the first and last fraction of the C. thermophilum data set returned a similar number of identifications when running Xi-MPA up to −4 Da or −3 Da (Figure S4). As a comparison, we took the original peak files obtained from PRIDE. The FDR was calculated separately on self-and between links, enabled boosting (automatic prefiltering on PSM and peptide pair level[17]), with a minimum of three fragments per peptide and a minimal delta score of 0.5. For the original peak files, which were preprocessed in MaxQuant, we identified 3848 PSMs, 2594 peptide pairs and 1653 cross-links, with a 5% FDR on each respective level (Figure B). Xi-MPA resulted in 4952 PSMs (29% increase), 3566 peptide pairs (37% increase), and 2273 cross-links (38% increase). Next, we looked into the complementarity of search results with the different approaches, using data set 2 at 5% link-FDR. Preprocessing via MaxQuant and OpenMS led to 172 and 158 links, respectively, while Xi-MPA resulted in 243 links. While the overlap between links of OpenMS-Xi and MaxQuant-Xi is only 50%, Xi-MPA identifications cover 76% of both searches (Figure C). Nineteen and 23 links are uniquely found in MaxQuant and OpenMS preprocessed data, respectively. However, there are five decoy links as well in each unique set (resulting in a link-FDR of 26% and 22%). For Xi-MPA, there are 75 unique target links with 12% link-FDR. Identification-based monoisotopic peak assignment as employed by Xi-MPA results in more identifications than the feature-based assignment algorithms of OpenMS and MaxQuant. Neither OpenMS nor MaxQuant correct all precursor masses that are incorrectly assigned during data acquisition. In Xi-MPA, spectra are searched with multiple monoisotopic masses, thereby relying less on the MS1 information. The quality of the precursor isotope cluster does not contribute to the decision of monoisotopic mass and spectra for which correction failed will be identifiable. One could hypothesize that increasing the search space by considering multiple masses will lead to more false positives, thereby reducing the number of true identifications. This is not the case, as we match substantially more PSMs at constant FDR by considering alternative monoisotopic masses. As a second plausible caveat, this approach increases the search time. However, the use of relatively cheap computational time appears balanced by the notable increase in identified cross-links. The optimal range of additional monoisotopic peaks to search will however be dependent on complexity and quality of MS1 acquisition and the instrument software. To reduce the mass range considered in Xi-MPA, we developed a MS1 level-based approach. For each precursor, we search lighter isotope peaks in MS1 and use this to narrow the search space (explained in detail in the SI). This led to an average of 24% less values to be considered, while only reducing the number of identifications by 3%. We hope that our observation of the monoisotopic peak detection challenge in cross-linking together with our publicly available data sets will lead to further improvements in monoisotopic peak-assignment algorithms in the future, possibly tailored to cross-link data. The cross-link search engine Kojak employs a precursor correction in its pipeline.[11] As we could not find a detailed evaluation of the impact of precursor correction in Kojak, we searched the HCD data of the pseudocomplex data set without correction as well as with their default correction settings. We focused on FDR 10% data as there were too few identifications for a reliable calculation of FDR 5%. Just 171 cross-linked PSMs passed for the unprocessed data, whereas for the default search, 1088 PSMs passed (536% increase). Of those, 862 (79%) were corrected in their monoisotopic precursor peak. These results support our observations with Xi.

In-Search Monoisotopic Peak Assignment Does Not Compromise Search Accuracy

Changing the search could lead to several problems. We already excluded that the increased search space leads to high-scoring decoy matches that in turn reduce the number of identifications at a given FDR cutoff. As an additional validation, we assessed our results against known PDB structures using the HCD data from the pseudocomplex data set (data set 2), at 5% link-FDR. Assuming a crystal structure is correct, a cross-link can be unexpectedly long either because the link is false or because of in-solution structural dynamics. If, however, the proportion of long-distance links in results of two approaches is identical, then at least the two results have equal quality. We first tested the results of all three approaches against crystal structures. Residue pairs were mapped to PDB structures and the distance between the two alpha-carbons was calculated (see Methods). Thirty Å was set as the maximal distance for BS3, links with a greater distance were classified as long-distance. In this evaluation, we excluded the protein C3B because its flexible regions make it unsuitable for this analysis. For MaxQuant and OpenMS preprocessed results, 11.8% and 6.1% long-distance cross-links were identified, respectively. In Xi-MPA, 8.1% long distance cross-links were identified (Figure A). Of the links uniquely identified through Xi-MPA, only 5.3% were long distance links. Therefore, Xi-MPA as such does not lead to an enrichment in long-distance cross-links. However, it could be that mass-corrected precursors tend to have a higher proportion of long-distance links. We therefore split the Xi-MPA results into five groups corresponding to the monoisotopic mass change (0, −1, −2, −3, −4 Da) and looked at their match to crystal structures. If a link originated from PSMs with different mass corrections, all of those were considered. We conducted a “nonparametric ANOVA” (Kruskal–Wallis test) to detect any significant changes in the distance distributions of Xi-MPA identifications with different shifts and decoy distribution. However, we fail to reject the null hypothesis at the predetermined significance level of α = 0.05 (p-value: 0.13), indicating that the distance distributions for all subsets are similar. This matches the visual inspection of distance distributions (Figure B). Furthermore, all individual distance distributions were significantly smaller than the derived reference distribution (one-sided Wilcoxon test, see SI Table S5). In conclusion, we do not see any evidence of in-search monoisotopic mass assignment leading to increased conflicts with crystal structures. We then evaluated the effect of in-search monoisotopic mass assignment on PSM quality as assessed by the search score. First, we compared the scores of PSMs with a mass shift (Xi-MPA identifications) to the scores of the same spectrum without a mass shift (uncorrected data). While scores with shifted mass have a median of 6.7, the median score is 2.3 when using the uncorrected masses (Figure C). As one would expect from an increased search space, the scores of decoy hits also improve, albeit only marginally. We find that the score difference of target PSMs is significantly larger than of decoy PSMs (one-sided Wilcoxon test, p-value: <2.2 × 10–16). We then turned to a “decoy mass search” for which we not only searched the range from 0 Da to −4 Da, but also +1 Da to +4 Da. Assuming the monoisotopic peak in the uncorrected data is rarely lighter than the true monoisotopic peak, the new identifications should score like decoy identifications. Indeed, the resulting score distributions for targets with a positive mass shift follow the decoy distribution (Figure D). In contrast, identifications with a negative shift are distributed like the identifications without mass shift. In conclusion, in-search monoisotopic mass change leads to significantly improved scores with a distribution that resembles that of precursors that did not see a mass change (0 Da). Importantly, these improvements are not random since an equally large search space increase (+1 Da to +4 Da) results in a completely different score distribution that resembles the decoy distribution but not the distribution of identifications without a mass shift.

Figure 3

Matches with and without in-search mass shift show similar quality metrics. (A) Evaluation of Xi-MPA derived links on crystal structures (data set 2). Distances between α carbon atoms of identified cross-linked residues in the crystal structure of the proteins are shown in light gray while a reference distribution of all possible pairwise C-alpha distances of cross-linkable residues is shown in dark gray. Thirty Å is set as a limit, above which links are defined as long distance. (B) Distance distribution of identifications with different mass corrections. There was no significant difference between the different mass shifts, while all had a significant difference to the decoy distribution. (C) PSM scores of spectra identified with a mass shift are significantly higher than the corresponding score in uncorrected data. Shown are the score distributions of uncorrected and Xi-MPA results, as well as the corresponding decoy distribution. (D) Score distribution of PSM matches of the “decoy mass search”. Identifications with a positive mass shift generally follow the decoy distribution (note that there are correct identifications with a positive mass shift, albeit few, see Figure B) while identifications with a negative shift resemble unshifted identifications. The scores of negative-shifted PSMs are significantly higher than those of positive-shifted PSMs (one-sided Wilcoxon test, p-value: <2.2× 10–16).

Heavy and Low Intensity Peptides Are Corrected More Frequently

One would especially expect to observe shifted mass assignment for peptides of high mass and low abundance. For large peptides (approximately >2000 Da), the monoisotopic peak will not be the most intense peak in the isotope cluster. If the peptide is of low abundance, the monoisotopic peak may be of too low intensity to be detected. We therefore analyzed the monoisotopic peak assignment in Xi-MPA regarding the precursor mass and intensity. Indeed, precursors with higher masses are more often corrected to lighter monoisotopic peaks (Figure A). While the median precursor mass for uncorrected matches is 2952 Da, for matches corrected by −2 Da it is already 4062 Da and for −4 Da it is 4684 Da. Of the identifications with a mass above 3000 Da, 88% were identified with a lighter mass. For precursors lighter than 3000 Da, the proportion was 42%. Like for mass dependency, there is a trend toward larger correction ranges for lower intensity peptides (SI Figure S5). However, this is less strong than it is for precursor mass.

Figure 4

Correction is dependent on precursor mass and intensity. (A) Box plot of the precursor mass and monoisotopic mass correction of identified PSMs after Xi-MPA. PSMs with higher mass more often require monoisotopic mass correction to lighter masses. Whiskers show the 5 and 95% quantiles of the data. Asterisks denote the significance calculated by a one tailed t test (****: p-value: <0.0001). (B) Precursors of cross-linked PSMs identified in all three approaches, MaxQuant-Xi, OpenMS-Xi, and Xi-MPA (“common”), are more intense than precursors of PSMs that are only identified in Xi-MPA. In other words, successful correction happens more often for abundant precursors, whereas Xi-MPA identifies precursors of lower intensity. (C) MS1 isotope cluster of a cross-linked peptide. The monoisotopic peak of m/z 758.16 (z = 4, m = 3028.6 Da) was falsely assigned during acquisition and not corrected in any preprocessing approach. Xi-MPA identifies a PSM for a precursor with a mass that is 3 Da lighter. When evaluating the newly matched precursors of Xi-MPA, the advantage of not having to rely on MS1 identification is evident. Matches not made through any of the preprocessing methods are generally much less intense (Figure B) and larger (SI Figure S6) than matches that are common to all approaches. Manual analysis of isotope clusters of corrected precursors from data set 2 revealed many cases where the monoisotopic peak was present in the MS1 spectrum but was not recognized during acquisition. For some, this might be due to the peak being of low intensity and discarded as noise, or because of other interfering peaks (Figure C). However, there are also cases where the cluster is well resolved (Figure C). Without details of how the instrument software determines the monoisotopic peak, a full evaluation is difficult. For a complete list of precursor m/z for Xi-MPA identifications and corresponding m/z of uncorrected, MaxQuant and OpenMS data, see SI Table S1–S3. Note that in many acquisition methods, the machine only fragments peaks where it can successfully identify a full isotope cluster. Therefore, there might be instances of cross-linked peptides not being fragmented because of insufficient isotopic cluster quality, leading to lost identifications.

Conclusion

The size and low abundance of cross-linked peptides leads to frequent misassignment of the monoisotopic mass by instrument software, which in some instances even escapes correction by sophisticated correction approaches employed by MaxQuant and OpenMS. Considering multiple monoisotopic masses during search increases the number of cross-link PSMs 1.8–4.2-fold, without compromising search accuracy as judged by multiple assessment strategies including comparison of the gains against solved protein structures. The problem of wrongly assigned monoisotopic peaks will have an impact on most cross-link search engines since these all rely in some part on the precursor mass. The extent of the misassignment will however be sample and software-dependent. Even with improved acquisition or correction software, there will remain instances where the monoisotopic peak cannot be determined correctly before searching due to low intensity. Our search-assisted monoisotopic peak assignment provides a general solution to this problem by relying on MS2 identification in addition to precursor information.

34 in total

1. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search.

Authors: Andrew Keller; Alexey I Nesvizhskii; Eugene Kolker; Ruedi Aebersold
Journal: Anal Chem Date: 2002-10-15 Impact factor: 6.986

2. pParse: a method for accurate determination of monoisotopic peaks in high-resolution mass spectra.

Authors: Zuo-Fei Yuan; Chao Liu; Hai-Peng Wang; Rui-Xiang Sun; Yan Fu; Jing-Fen Zhang; Le-Heng Wang; Hao Chi; You Li; Li-Yun Xiu; Wen-Ping Wang; Si-Min He
Journal: Proteomics Date: 2011-12-20 Impact factor: 3.984

3. Proteome-wide profiling of protein assemblies by cross-linking mass spectrometry.

Authors: Fan Liu; Dirk T S Rijkers; Harm Post; Albert J R Heck
Journal: Nat Methods Date: 2015-09-28 Impact factor: 28.547

4. When less can yield more - Computational preprocessing of MS/MS spectra for peptide identification.

Authors: Bernhard Y Renard; Marc Kirchner; Flavio Monigatti; Alexander R Ivanov; Juri Rappsilber; Dominic Winter; Judith A J Steen; Fred A Hamprecht; Hanno Steen
Journal: Proteomics Date: 2009-11 Impact factor: 3.984

Review 5. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics.

Authors: Eric W Deutsch; Luis Mendoza; David Shteynberg; Joseph Slagel; Zhi Sun; Robert L Moritz
Journal: Proteomics Clin Appl Date: 2015-04-02 Impact factor: 3.494

6. An automated pipeline for high-throughput label-free quantitative proteomics.

Authors: Hendrik Weisser; Sven Nahnsen; Jonas Grossmann; Lars Nilse; Andreas Quandt; Hendrik Brauer; Marc Sturm; Erhan Kenar; Oliver Kohlbacher; Ruedi Aebersold; Lars Malmström
Journal: J Proteome Res Date: 2013-02-22 Impact factor: 4.466

7. To Be or Not to Be? Five Guidelines to Avoid Misassignments in Cross-Linking/Mass Spectrometry.

Authors: Claudio Iacobucci; Andrea Sinz
Journal: Anal Chem Date: 2017-07-21 Impact factor: 6.986

8. Expanding the chemical cross-linking toolbox by the use of multiple proteases and enrichment by size exclusion chromatography.

Authors: Alexander Leitner; Roland Reischl; Thomas Walzthoeni; Franz Herzog; Stefan Bohn; Friedrich Förster; Ruedi Aebersold
Journal: Mol Cell Proteomics Date: 2012-01-27 Impact factor: 5.911

9. Quirks of Error Estimation in Cross-Linking/Mass Spectrometry.

Authors: Lutz Fischer; Juri Rappsilber
Journal: Anal Chem Date: 2017-03-24 Impact factor: 6.986

10. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

17 in total

1. Assessment of chemical-crosslink-assisted protein structure modeling in CASP13.

Authors: J Eduardo Fajardo; Rojan Shrestha; Nelson Gil; Adam Belsom; Silvia N Crivelli; Cezary Czaplewski; Krzysztof Fidelis; Sergei Grudinin; Mikhail Karasikov; Agnieszka S Karczyńska; Andriy Kryshtafovych; Alexander Leitner; Adam Liwo; Emilia A Lubecka; Bohdan Monastyrskyy; Guillaume Pagès; Juri Rappsilber; Adam K Sieradzan; Celina Sikorska; Esben Trabjerg; Andras Fiser
Journal: Proteins Date: 2019-10-07

2. MaXLinker: Proteome-wide Cross-link Identifications with High Specificity and Sensitivity.

Authors: Kumar Yugandhar; Ting-Yi Wang; Alden King-Yung Leung; Michael Charles Lanz; Ievgen Motorykin; Jin Liang; Elnur Elyar Shayhidin; Marcus Bustamante Smolka; Sheng Zhang; Haiyuan Yu
Journal: Mol Cell Proteomics Date: 2019-12-15 Impact factor: 5.911

3. Mechanistic basis for Sgo1-mediated centromere localization and function of the CPC.

Authors: Maria Alba Abad; Tanmay Gupta; Michael A Hadders; Amanda Meppelink; J Pepijn Wopken; Elizabeth Blackburn; Juan Zou; Anjitha Gireesh; Lana Buzuk; David A Kelly; Toni McHugh; Juri Rappsilber; Susanne M A Lens; A Arockia Jeyaprakash
Journal: J Cell Biol Date: 2022-07-01 Impact factor: 8.077

4. Retention time prediction using neural networks increases identifications in crosslinking mass spectrometry.

Authors: Sven H Giese; Ludwig R Sinn; Fritz Wegner; Juri Rappsilber
Journal: Nat Commun Date: 2021-05-28 Impact factor: 17.694

5. Open Database Searching Enables the Identification and Comparison of Bacterial Glycoproteomes without Defining Glycan Compositions Prior to Searching.

Authors: Ameera Raudah Ahmad Izaham; Nichollas E Scott
Journal: Mol Cell Proteomics Date: 2020-06-23 Impact factor: 5.911