Literature DB >> 20158870

Reducing the haystack to find the needle: improved protein identification after fast elimination of non-interpretable peptide MS/MS spectra and noise reduction.

Nedim Mujezinovic1, Georg Schneider, Michael Wildpaner, Karl Mechtler, Frank Eisenhaber.   

Abstract

BACKGROUND: Tandem mass spectrometry (MS/MS) has become a standard method for identification of proteins extracted from biological samples but the huge number and the noise contamination of MS/MS spectra obstruct swift and reliable computer-aided interpretation. Typically, a minor fraction of the spectra per sample (most often, only a few %) and about 10% of the peaks per spectrum contribute to the final result if protein identification is not prevented by the noise at all.
RESULTS: Two fast preprocessing screens can substantially reduce the haystack of MS/MS data. (1) Simple sequence ladder rules remove spectra non-interpretable in peptide sequences. (2) Modified Fourier-transform-based criteria clear background in the remaining data. In average, only a remainder of 35% of the MS/MS spectra (each reduced in size by about one quarter) has to be handed over to the interpretation software for reliable protein identification essentially without loss of information, with a trend to improved sequence coverage and with proportional decrease of computer resource consumption.
CONCLUSIONS: The search for sequence ladders in tandem MS/MS spectra with subsequent noise suppression is a promising strategy to reduce the number of MS/MS spectra from electro-spray instruments and to enhance the reliability of protein matches. Supplementary material and the software are available from an accompanying WWW-site with the URL http://mendel.bii.a-star.edu.sg/mass-spectrometry/MSCleaner-2.0/.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 20158870      PMCID: PMC2822527          DOI: 10.1186/1471-2164-11-S1-S13

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Liquid chromatography (LC) coupled with tandem mass spectrometry (MS/MS) is the method of choice for the identification of proteins extracted from biological samples. The standard procedure of post-MS/MS data processing involves computer-aided interpretation of the measured spectra with MASCOT [1], SEQUEST [2] or some other software for comparing theoretical spectra calculated for database sequences with the experimental ones. But modern instruments generate extremely large sets of MS/MS spectra (in the order of 10000 per sample), which are heavily contaminated with different types of background and noise. In addition to b-, y- and their derivative ions from peptides, spectra contain repeated shifted signals due to the natural isotope distribution (isotope clusters), multiply charged replicas, peaks from unknown fragmentation pathways, sample-specific or systematic chemical contaminations and random noise from the electronic detection system. Thus, the spectra consist mostly of background; typically, only a few percent of the spectra recorded have signals from target protein fragments and just about 10% of the peaks in such a spectrum contribute to the peptide identification. Thus, computer resources in mass spectrometry departments all over the world are mostly spent on analyzing non-relevant data if the identification of the protein with significance is possible within the background at all. This strategy clashes with limitations in compute server capacity in proteomics laboratories and seriously limits the access of less generously equipped teams to the field. With the broad availability of accurate MS/MS instruments with resolution in the order of tenths of a Dalton, automatic background removal procedures before interpretation software application became possible [3-5]. Various spectrum pre-processing rules, deconvolution of multiply charged peaks and deisotoping procedures have been described [6-15]. It should be noted that many spectra do not contain peaks from peptide fragmentations or are extremely noisy and, therefore, are non-interpretable into peptide sequences reliably. Thus, the exclusion of non-interpretable spectra is a valid strategy for reducing the computational load. For a well performing method, one would desire it to remove clearly more than half or three quarters of the experimental MS/MS spectra and, essentially, to keep all interpretable ones. At the same time, computation time for this task should be negligible or, at least, small compared to the processing time used by an interpretation program such as MSACOT that is saved by unselecting a large spectra subset. Published approaches to this problem differ in the criterion for spectrum selection, either with empirically defined score functions or with a classifier generated by automated learning approaches [16-23]. Although many of these methods apply quite sophisticated criteria, they either are not efficient filters or suffer from a substantial fraction of unselected but nevertheless interpretable MS/MS spectra (e.g., loss of ~10% of the interpretable spectra for removing ~75% of the total number spectra in Figures 2 and 3 of Bern et al. [18]). Thus, substantial computational load reduction is traded in for the risk not to find the desired peptide hit. Consequently, none of the published techniques has routinely entered the laboratories so far. In the attempt to develop an alternative methodical approach, we propose to return to ideas from the beginning of mass spectrometry of proteins. Originally, interpretation of an MS/MS spectrum meant experts trying to manually find sequence ladders (i.e., sets of peaks with amino acid mass spacing between them) among the high-intensity peaks. The concept of searching mainly among the higher intensity peaks is still reminiscent in the formulas for evaluating the significance of a peptide hit as used in MASCOT [1]. Indeed, a peptide the theoretical fragmentation spectrum of which matches exclusively low intensity peaks cannot serve as convincing explanation of the experimental data. In this work, we explore the idea that at least some short oligopeptide segment of a significant peptide hit should be fully matched by the higher intensity peaks in the spectrum. In an efficient implementation, the computational costs are low if one tries just to check whether small peptide ladders of predefined length do occur in a MS/MS spectrum at all among the top fraction of most intense peaks. The identity of the oligopepetide is not important in this context; it is rather questioned whether such an amino acid chain theoretically exists at all. It is reasonable to suggest that the spectrum is probably not interpretable into a peptide sequence with statistical significance if not even a short oligopeptide sequence is matched by this criterion. After this unselecting procedure, the remaining spectra still contain considerable background in the typical case. In a previous publication [24], we developed an approach based on techniques from electrical signal processing. Periodical band-reject and high-frequency filters as well as correlation analyses with etalons of multiply charged clusters can successfully be used for background suppression. In this work, we describe a workflow involving sequence ladder and improved signal processing criteria on a large MS/MS dataset exemplified in the MS Cleaner version 2.0 that efficiently reduces the number and the size of spectra and, subsequently, dramatically shrinks the computing time used by the interpretation software. To emphasize, the approach described in this work is thought to increase the efficiency of protein identification. It is not considered to process MS/MS data that is intended to be screened for protein posttranslational modifications.

Methods

Mass spectrometry

Commercially acquired proteins (α-amylase, amyloglucosidase, apo-transferrin, β-galactidase, carbonic anhydrase, catalase, phosphorylase B, glutamic dehydrogenase, glutathione transferase, immunoglobulin γ, lactic dehydrogenase, lactoperoxidase, myoglobin) were used, each in two independent preparations (each with a concentration of 100 fmol). For chromatography, a UltiMate Plus Nano-LC system. LC-Packings - A Dionex Co was used. Chromatographic mobile phases were: loading mobile phase 0.1% TFA in water, separation mobile phase A 5% acetonitrile in 0.1% aqueous formic acid and mobile phase B 80% acetonitrile, 20% water with 0.08% formic acid. The sample was loaded for 10 min onto a reversed phase trap column (PepMap C18, 300 μm ID × 5 mm length, 5 μm particle size, 100 Å pore size, LC Packings - A Dionex Co., not online with the separation column) at a flow rate of 20 μl/min and washed free of ion pairing agents and other impurities. The gradient for separation of analytes starts at 10 min when the trap column is switched online with the separation column (PepMapC18, 75 μm ID × 15 cm length, 3μm particle size, 100 Å pore size) at 0.275 μl/min. The gradient used starts at 100% mobile phase A and changes to 50% mobile phase B from 10 minutes (trap column and separation column online) to 40 minutes. Additional wash step of 90% mobile phase B is incorporated in order to clean the separation column and elute hydrophobic analytes. After the separation, the trap column is switched offline and equilibrated with loading mobile phase. The analytical nano column is equilibrated with separation mobile phase A. The mass spectrometric data are only recorded for the time both columns are online. The mass spectra were recorded with a Thermo Finnigan LTQ (positive nano-ESI mode, ionizing spray voltage: 1.5 kV, enhanced mass-spec full-scan range: 220 - 2000 amu). The much smaller datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) recorded with a 3D IT mass spectrometer (model DecaXP Thermo Finnigan) were reused from our previous work [24].

File processing and MS/MS data analysis

The MS/MS output was converted into mgf-files (MASCOT generic format). Each dataset was then separately processed using the MS Cleaner program (with default internal parameters), generating two new mgf-files with cleaned and bad (non-interpretable) spectra respectively. The MASCOT search parameters were the same in all runs (enzyme: trypsin; fixed modifications: carbamidomethyl (at cysteines) for BSA, ADH and TRF, carboxymethyl (at cysteiness) for other proteins; variable modifications: oxidation (at methionines); peptide charges: 1+, 2+ and 3+; mass values: monoisotopic; protein mass: unrestricted; peptide mass tolerance: ± 2 Da; fragment mass tolerance: ± 0.8 Da; max. missed cleavages: 1). The MASCOT search results output html-file was formatted with standard scoring, a significance threshold of p < 0.05, and an ion score cut-off for each peptide of 30. The non-redundant protein database (NCBI) was used (both for the local PC MASCOT installation and for the MASCOT Linux cluster). In this work, we compare the MASCOT interpretation results of non-pre-processed tandem MS datasets with those obtained in a two-step preprocessing. First, each spectrum (.dta-file) is analyzed with the sequence ladder algorithm. Only those spectra that pass this test, are then processed with the background removal routines described in our previous publication [24].

The sequence ladder algorithm

For this algorithm, two parameters are critical - the values (in amino acid residues), the minimal length of the sequence ladder, and (in per cent), the fraction of peaks from the spectrum that is considered of high intensity. The number can theoretically be just one (i.e., we would require just two high intensity peaks that are spaced by the mass difference corresponding to the mass of one of the amino acids); yet, larger values of (for example, between two and six residues) represent stricter requirements to the sequence ladder. The other parameter restricts the search space. For this purpose, the peaks in the spectrum considered (i.e., in one .dta-file) are sorted by intensity into a list with descending order. Only the first part of this list (the fraction of the total set) is used for searching sequence ladders. The condition of = 100% implies that all peaks are included; yet, considerably smaller values of are desirable since they would help unselecting more non-interpretable spectra. Once the set of high-intensity peaks is defined, their pair-wise mass differences are compared in a systematic enumeration with the masses of amino acids residues (to select pairs of peaks separated by the mass of any of the amino acids within a user-defined accuracy) and it is tested whether a subset of peaks forms a sequence ladder of the required minimal length. If at least one such ladder is found, the search is stopped and the procedure is restarted with the next tandem MS spectrum in the dataset.

Modifications of the noise detection algorithm

If a spectrum has passed the sequence ladder test, it is handed over to a series of routines for noise and background detection. The procedures for removing multiply charged peak clusters with the etalon method and for the suppression of high-frequency noise with a low-pass filter after Fourier transformation have been described in a previous publication [24] in detail and have been applied without changes here. The algorithm for the removal of latent periodic background (including deisotoping) received another option with respect to the determination of the base frequency of the noise. We observed that the determination of the base frequency fin the first power spectrum (see sections 3.3 and 3.5 in ref. [24]) is, in rare cases, not always as unambiguous as in Figure 2A of ref. [24] since several almost equally intense peaks may appear in the second-level Fourier transform. Wrong base frequency fdetection leads to wrong multi-band rejection filter creation and a few interpretable spectra can be lost after applying this technique. This ambiguity can be avoided by not choosing the frequency of the most intense peak in the second-level Fourier transform. Rather, we propose to iterate through all possible base frequencies detected in this spectrum. For each of these frequencies, theoretical maxima and minima expected in first level Fourier transform are calculated. Best matching between the theoretical and experimental maxima and minima (see Figure 3 in ref. [24]) confirms the right base frequency. We call this method "soft recognition" of latent periodic noise which should be applied if minor improvements in sequence coverage (in rare cases, a single additional peptide) are more important than data size reduction; yet, it leads to an increment of about 10% of the computation time compared with the previous method [24].

Standalone implementation and cluster version

We created two implementations for MS Cleaner 2.0. A single-machine Windows version was used for most of the computations in this article and it is available for free download at the associated WWW site. A Unix-Port of the MS Cleaner 2.0 software is deployed in a clustered environment in order to guarantee scalability. The spectrum file is partitioned into workpackages, which are then handed over to a batch queuing system for scheduling on available nodes. Each node processes the spectra in its workpackage and transfers the results back to the controlling application where they are post-processed into the final good/bad spectra output. This version is the engine behind the MS Cleaner 2.0 WWW server.

WWW Supplement

At the WWW-site http://mendel.bii.a-star.edu.sg/mass-spectrometry/MSCleaner-2.0/, supplementary resources are available: all experimental mass-spectrometry data used in this work, the processed spectra, the user manual, default parameter datasets and a free downloadable Windows version of the program MSCleaner 2.0 as well as free access to a MSCleaner 2.0 WWW server accessing a local Linux cluster. Other implementations can be obtained on request.

Results and discussion

For the initial determination of optimal parameter ranges (sequence ladder length and peak intensity threshold ), we used the datasets for bovine serum albumin (BSA), yeast alcohol dehydrogenase (ADH) and human transferrin (TRF) from our previous work [24] since they are quite small (less than 3000 .dta-files per set). We checked the influence of the preprocessing procedures on the spectrum interpretation with the MASCOT tool. A systematic analysis was performed; sequence ladder length was tested with values between 2 and 6 and the high-intensity threshold was varied from 5% to 35% (the sequence ladder was searched for only among the 5%, 10%, 15%, ..., or 35% of most intense peaks). The goal is to have as many unselected "bad" spectra as possible (the savings in computing time are about proportional to the fraction of spectra that is not handed over to the spectrum interpretation program) without losses of (i) MASCOT score, (ii) spectra giving peptide matches and (iii) sequence coverage. Due to the space limitation, only the results of a parameter subset are presented (Table 1). As expected, the number of detected bad spectra increases with growing sequence ladder length and decreasing intensity threshold . We observe that the MASCOT score of the non-preprocessed data (586 for BSA, 224 for ADH and 588 for TRF; see rows with = 0 and = 0%) is considerably smaller than that of the cleaned datasets (often, by a factor of 2-5) regardless of the severity of data pre-processing. Thus, the reliability of the top protein hit in the database searches greatly increases by the background reduction, both by discarding bad spectra and by removing noise from spectra that can be interpreted in peptides. This alone is an interesting result.
Table 1

Influence of background removal on the recovery of BSA, ADH and TRF in MS/MS spectra of 100 fmol test samples. The original number of MS/MS spectra for the BSA (bovine serum albumine), ADH (yeast alcoholdehydrogenase) and TRF (human transferring) datasets (recorded on a DecaXP machine) are 2679, 2325 and 2608 respectively. The intensity threshold s (column 3) describes the search of the sequence ladder (length n in column 2) within the 15%, 20%, 25% or 30% top peaks (100% - all peaks are considered). The following three columns show the MS Cleaner output - number of spectra with background removal, number of unselected spectra and the MS Cleaner CPU time on a single-processor Windows XP computer (Pentium IV 2.4 GHz; to get exact measurements of computation time, we did not use the cluster version). The remaining four columns present the MASCOT output - the CPU time on the same machine, the protein score, the number of spectra matching peptides in a MASCOT search and the final sequence coverage. For each dataset, the first line shows the results for the case when MS Cleaner is not used for pre-processing and the MS/MS data is immediately interpreted by MASCOT.

proteinsequence ladder length nintensity threshold s [%]cleaned spectrabad spectraMS Cleaner time [min]MASCOT time [min]MASCOT scorequeries matchedsequence coverage
BSA0100---615868955

3100166410153.92447209157
31539022891.211719918452
32049021891.402121088757
32560120781.612621148957
33068819911.752921149057

410094017393.803621089157
41526024190.911218757847
42032123581.061419118047
42538022991.251821148657
43044122381.301921148957

510059320863.822621089157
51517425050.60915796041
52023224470.851118097244
52528123981.001319638149
53031323660.851420588654

ADH0100---642423939

310014468794.15453273439
31526920560.88126732935
32034719781.10136963137
32544018851.33176973237
33069716281.53206973337

410090214234.15357333439
41517321520.5875622628
42021621090.7196733035
42527120540.90126072833
43032520001.05136973237

510059417314.20237123339
5159422310.3553111521
52012522000.4663661725
52514521800.5374341926
53018621390.6695892431

TRF0100---525888647

3100158710213.57427688749
31537322351.001719888649
32048521231.232019888649
32556820401.362419988749
33063919690.782719988749

410086417443.623419738749
41523123770.701119878149
42029823100.861319888449
42536022481.001619888549
43041421941.121919988749

510054020683.632319738749
51516424440.55917856845
52019424140.611018907447
52524523630.751219578048
53028623220.861419688448
Influence of background removal on the recovery of BSA, ADH and TRF in MS/MS spectra of 100 fmol test samples. The original number of MS/MS spectra for the BSA (bovine serum albumine), ADH (yeast alcoholdehydrogenase) and TRF (human transferring) datasets (recorded on a DecaXP machine) are 2679, 2325 and 2608 respectively. The intensity threshold s (column 3) describes the search of the sequence ladder (length n in column 2) within the 15%, 20%, 25% or 30% top peaks (100% - all peaks are considered). The following three columns show the MS Cleaner output - number of spectra with background removal, number of unselected spectra and the MS Cleaner CPU time on a single-processor Windows XP computer (Pentium IV 2.4 GHz; to get exact measurements of computation time, we did not use the cluster version). The remaining four columns present the MASCOT output - the CPU time on the same machine, the protein score, the number of spectra matching peptides in a MASCOT search and the final sequence coverage. For each dataset, the first line shows the results for the case when MS Cleaner is not used for pre-processing and the MS/MS data is immediately interpreted by MASCOT. The sequence coverage is more sensitive to the pre-processing parameters. For a sequence ladder length of = 5 residues, we see a trend that sequence coverage is slightly decreased with respect to that of unprocessed data (41-54% instead of 55% for BSA, 21-31% instead of 39% for ADH, 45-48% instead of 47% for TRF). Sequence coverage is about the same or even slightly higher as for non-preprocessed data for sequence ladder lengths = 3 and = 4 and intensity thresholds at and above 20%. With regard to the number of spectra that lead to a significant peptide match in the MASCOT search, the settings = 3, = 20%; = 3, = 25%; = 4, = 20% and = 4, = 25% are close to reproduce the result achieved with the unprocessed data for the BSA and TRF cases. Surprisingly, the number of peptide matches is slightly higher for = 100% (all peaks are included in the sequence ladder search) than for the datasets without preprocessing. Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these two cases. For ADH, the number of spectra matching peptides is always somewhat lower if the tandem MS/MS data is pre-processed, although MASCOT score and sequence coverage do not suffer from choices of = 3, or = 4 and the higher values of . To detect a considerable fraction of the bad spectra and to reduce the time for interpretation by MASCOT, these results support the selection of a sequence ladder length equal to = 4 and an intensity threshold of = 20%. If the sequence coverage is more important than computational time savings, softer parameters can be chosen, for example with an intensity threshold of = 25%. With these parameters, it is possible to eliminate more than 80% of all spectra in the datasets BSA, ADH and TRF by declaring them non-interpretable in oligopeptides (see Table 1). Minor sequence coverage loss, if at all observed, does not affect the interpretation result. Yet, the total computing time required for interpretation narrows up to only 20% of the original value. The computing time consumption for MS Cleaner alone in such a setting is ~2% of MASCOT time for non-preprocessed data (see Table 1); i.e., it is essentially negligible. For further analysis of the algorithm's performance, large MS/MS datasets are necessary that are recorded from samples with known protein composition. For this purpose, we used solutions of commercially available proteins at 100 fmol concentration. The behavior of the MS Cleaner algorithms was tested over this large dataset of about 270000 spectra from 26 samples of 13 proteins (Table 2) generated by an LTQ device. We used sequence ladder length = 4 with intensity thresholds = 20% and = 25% and contrasted the results both (i) with the MASCOT-based interpretation of non-preprocessed data and (ii) with sequence ladder 4 and the inclusion of all peaks (= 100% threshold). We find that, as a rule, preprocessing reproduces or slightly improves the sequence coverage relative to the non-preprocessed data (100-110% for threshold = 100% (columns A4 and A7), 100-108% for thresholds = 20% and = 25% (columns A4, A11 and A15)). Thus, the number of falsely rejected spectra by the sequence-ladder algorithm is essentially zero in these examples. This clear trend says that the preprocessing algorithm proposed here performs even better if it is supplied with more accurate data from the LTQ instrument as compared with those from the DecaXP. There is a trend for increased MASCOT scores (103-140% for threshold 100% (columns A3 and A6), 98-140% for = 20% (A3 and A10) and 103-140% for = 25% (A3 and A14) with an average of 110% regardless of threshold. The reduction of the dataset by unselecting spectra is significant (on average, 11% for threshold 100% (column A5), 63% for threshold = 20% (column A8) and 53% for threshold = 25% (column A13)). This means that the interpretation time with MASCOT reduces in a similar proportion.
Table 2

Performance of the MSCleaner version 2.0 over a large test set.

A1A2A3A4A5A6A7A8A9A10A11A12A13A14A15A16
alphaAmyl_col1101086332411.306672431.6560.076672415.1351.096672418.07
alphaAmyl_col210184698359.827803534.2050.227803519.0520.257803522.76
AmylGlu_col1100307362813.267612828.4079.24761288.6673.587612810.63
AmylGlu_col298708013613.318603729.5072.628603711.7063.958603714.29
apo_col11003226066311.7228146330.7663.1028146313.9354.4928146316.78
apo_col21009025716012.1327616032.9553.1227616017.5344.3227616021.03
betaGal_col1103241459567.1715675734.9848.0615675722.0540.5315675724.60
betaGal_col2103681309518.1215085636.7142.9014545524.7633.1014545528.61
CarAnly_col199465864912.356164926.3590.31573493.6584.94607495.48
CarAnly_col295345825213.406165226.2786.07616525.0878.44616527.66
Cat_col11009817986111.1318866130.8867.2618796113.1357.8918796116.50
Cat_col21003415676511.7816936531.9059.5016936515.9148.5516936519.56
phosB_col11011827805910.3030796135.1363.4930146014.2654.4630476117.25
phosB_col21009626556110.5231166532.5853.9630846517.5844.3131166521.16
GluDey_col1100068923611.299863627.3079.55986367.7573.42986369.71
GluDey_col298868503411.819623428.7372.519623410.1362.259623413.51
GluTra_col1100223512510.363892528.6171.643482510.2562.783892514.30
GluTra_col210156341339.183843331.3161.153843314.2549.593843328.11
Immo_col110330506359.275653536.2042.305653524.9534.445653527.66
Immo_col210334356668.615006638.0537.065006627.3128.475006630.31
LacDe_col11028615495810.3616945835.3653.2016945820.0344.8616945823.15
LacDe_col2102501346549.0714835436.4840.1614835425.6031.6714835428.31
LactoPee_col11024216134513.1617644534.7862.1217564515.9152.3717644519.53
LactoPee_col2104021679439.0918904435.1851.7018904420.3141.7618904423.85
Myo_col199585616611.675946627.2685.42594665.4679.25594667.45
Myo_col297445306612.155846628.0180.83584666.9570.925846610.35

A1 name of test set (.mgf file; see Methods),

A2 total number of spectra (.dta files),

A3 MASCOT score of top protein hit with the original .mgf file

(without application of MS Cleaner),

A4 sequence coverage (in %) without application of MS Cleaner,

A5 fraction of non-interpretable "bad" spectra found with sequence ladder

length = 4 among all peaks (intensity threshold = 100%)

A6 MASCOT score of the top protein hit for this search,

A7 sequence coverage (in % of the whole protein length) for this search,

A8 MS Cleaner processing time (in min) on a PC with a single Pentium IV (to

achieve exact time consumption values, we did not use the cluster version and

stopped the "soft frequency recognition option")

A9 fraction of non-interpretable "bad" spectra found with sequence ladder

length = 4 among the = 20% most intense peaks

A10 MASCOT score of the top protein hit for this search,

A11 sequence coverage (in % of the whole protein length) for this search,

A12 MS Cleaner processing time (in min),

A13 fraction of non-interpretable "bad" spectra found with sequence ladder

length = 4 among the = 25% most intense peaks (in % of A2; i.e.,

of all spectra)

A14 MASCOT score of the top protein hit for this search,

A15 sequence coverage (in % of the whole protein length) for this MASCOT

search,

A16 MS Cleaner processing time on the same machine as described in the legend of

Table 1 (in min).

The sequence ladder criterion (minimal ladder length 4 with varying peak intensity thresholds) and the noise suppression algorithms of MS Cleaner 2.0 have been applied over a large set of tandem MS results. For each of the test proteins, two independent sample preparations and dataset recordings (marked with appendices _col1 and _col2 in the dataset name) were carried out: α-amylase, amylogucosidase, apo-transferrin, β-galactidase, carbonic anhydrase, catalase, phosphorylase B, glutamic dehydrogenase, glutathione transferase, immunoglobulin γ, lactic dehydrogenase, lactoperoxidase, myoglobin). For these datasets, the MASCOT interpretation was carried out on a cluster in parallel with other jobs; therefore, no computation time is provided.

Performance of the MSCleaner version 2.0 over a large test set. A1 name of test set (.mgf file; see Methods), A2 total number of spectra (.dta files), A3 MASCOT score of top protein hit with the original .mgf file (without application of MS Cleaner), A4 sequence coverage (in %) without application of MS Cleaner, A5 fraction of non-interpretable "bad" spectra found with sequence ladder length = 4 among all peaks (intensity threshold = 100%) A6 MASCOT score of the top protein hit for this search, A7 sequence coverage (in % of the whole protein length) for this search, A8 MS Cleaner processing time (in min) on a PC with a single Pentium IV (to achieve exact time consumption values, we did not use the cluster version and stopped the "soft frequency recognition option") A9 fraction of non-interpretable "bad" spectra found with sequence ladder length = 4 among the = 20% most intense peaks A10 MASCOT score of the top protein hit for this search, A11 sequence coverage (in % of the whole protein length) for this search, A12 MS Cleaner processing time (in min), A13 fraction of non-interpretable "bad" spectra found with sequence ladder length = 4 among the = 25% most intense peaks (in % of A2; i.e., of all spectra) A14 MASCOT score of the top protein hit for this search, A15 sequence coverage (in % of the whole protein length) for this MASCOT search, A16 MS Cleaner processing time on the same machine as described in the legend of Table 1 (in min). The sequence ladder criterion (minimal ladder length 4 with varying peak intensity thresholds) and the noise suppression algorithms of MS Cleaner 2.0 have been applied over a large set of tandem MS results. For each of the test proteins, two independent sample preparations and dataset recordings (marked with appendices _col1 and _col2 in the dataset name) were carried out: α-amylase, amylogucosidase, apo-transferrin, β-galactidase, carbonic anhydrase, catalase, phosphorylase B, glutamic dehydrogenase, glutathione transferase, immunoglobulin γ, lactic dehydrogenase, lactoperoxidase, myoglobin). For these datasets, the MASCOT interpretation was carried out on a cluster in parallel with other jobs; therefore, no computation time is provided. To summarize, the results support that testing spectra for interpretability in oligopeptides is a useful criterion for dataset reduction in protein mass spectrometry if a sequence ladder of a tetrapeptide segment is searched for among the 20% (or 25%) most intense peaks. This preprocessing is accompanied by an increase in MASCOT score and more significant top protein hits and it does not significantly affect sequence coverage. Running MS Cleaner 2.0 as a standard preprocessing step in peptide tandem MS data analysis for protein identification is recommended. The idea of using short series of sequence ions (peptide sequence tags) as a specific identifier that speeds up searches for matches between spectra and sequences in databases (either by searching the database with the tag or by creating sequence tag database filters in order to reduce the size of a database via a preprocessing step) is extensively explored in the literature [25-27]. It is interesting to see that this simple idea applied to the problem of recognizing spectra non-interpretable in oligopeptides greatly reduces the complexity of analyzing protein mass spectrometry data.

List of abbreviations used

CID: collision-induced dissociation; Da: Dalton; ESI: Electrospray ionization; LC-MS/MS: liquid chromatography coupled with tandem mass spectrometry; MS: mass spectrometry; MS/MS: tandem mass spectrometry; PS: power spectrum.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

NM programmed the single-processor prototype of MSCleaner and carried out all computational experiments. NM and FE together produced the WWW site associated with this publication. GS and MW were instrumental for creating the multi-processor version and the WWW server. KM provided the wet lab part of this work and participated in the discussion of the results. FE proposed the scientific task, guided the work and wrote the article.
  24 in total

1.  Automated deconvolution and deisotoping of electrospray mass spectra.

Authors:  Marco Wehofsky; Ralf Hoffmann
Journal:  J Mass Spectrom       Date:  2002-02       Impact factor: 1.982

2.  Automatic quality assessment of peptide tandem mass spectra.

Authors:  Marshall Bern; David Goldberg; W Hayes McDonald; John R Yates
Journal:  Bioinformatics       Date:  2004-08-04       Impact factor: 6.937

3.  Artificial neural network analysis for evaluation of peptide MS/MS spectra in proteomics.

Authors:  Tomasz Baczek; Adam Buciński; Alexander R Ivanov; Roman Kaliszan
Journal:  Anal Chem       Date:  2004-03-15       Impact factor: 6.986

4.  New data base-independent, sequence tag-based scoring of peptide MS/MS data validates Mowse scores, recovers below threshold data, singles out modified peptides, and assesses the quality of MS/MS techniques.

Authors:  Mikhail M Savitski; Michael L Nielsen; Roman A Zubarev
Journal:  Mol Cell Proteomics       Date:  2005-05-22       Impact factor: 5.911

5.  InsPecT: identification of posttranslationally modified peptides from tandem mass spectra.

Authors:  Stephen Tanner; Hongjun Shu; Ari Frank; Ling-Chi Wang; Ebrahim Zandi; Marc Mumby; Pavel A Pevzner; Vineet Bafna
Journal:  Anal Chem       Date:  2005-07-15       Impact factor: 6.986

6.  Cleaning of raw peptide MS/MS spectra: improved protein identification following deconvolution of multiply charged peaks, isotope clusters, and removal of background noise.

Authors:  Nedim Mujezinovic; Günther Raidl; James R A Hutchins; Jan-Michael Peters; Karl Mechtler; Frank Eisenhaber
Journal:  Proteomics       Date:  2006-10       Impact factor: 3.984

7.  Code developments to improve the efficiency of automated MS/MS spectra interpretation.

Authors:  Rovshan G Sadygov; Jimmy Eng; Eberhard Durr; Anita Saraf; Hayes McDonald; Michael J MacCoss; John R Yates
Journal:  J Proteome Res       Date:  2002 May-Jun       Impact factor: 4.466

8.  A universal algorithm for fast and automated charge state deconvolution of electrospray mass-to-charge ratio spectra.

Authors:  Z Zhang; A G Marshall
Journal:  J Am Soc Mass Spectrom       Date:  1998-03       Impact factor: 3.109

9.  Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database.

Authors:  J R Yates; J K Eng; A L McCormack; D Schieltz
Journal:  Anal Chem       Date:  1995-04-15       Impact factor: 6.986

10.  Error-tolerant identification of peptides in sequence databases by peptide sequence tags.

Authors:  M Mann; M Wilm
Journal:  Anal Chem       Date:  1994-12-15       Impact factor: 6.986

View more
  7 in total

1.  An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics.

Authors:  Muaaz Gul Awan; Fahad Saeed
Journal:  ACM BCB       Date:  2017-08

2.  A mass spectrometry-based method to screen for α-amidated peptides.

Authors:  Zhenming An; Yudan Chen; John M Koomen; David J Merkler
Journal:  Proteomics       Date:  2011-12-14       Impact factor: 3.984

3.  Practical 4'-phosphopantetheine active site discovery from proteomic samples.

Authors:  Jordan L Meier; Anand D Patel; Sherry Niessen; Michael Meehan; Roland Kersten; Jane Y Yang; Michael Rothmann; Benjamin F Cravatt; Pieter C Dorrestein; Michael D Burkart; Vineet Bafna
Journal:  J Proteome Res       Date:  2010-12-13       Impact factor: 4.466

4.  GPU-DAEMON: GPU algorithm design, data management & optimization template for array based big omics data.

Authors:  Muaaz Gul Awan; Taban Eslami; Fahad Saeed
Journal:  Comput Biol Med       Date:  2018-08-16       Impact factor: 4.589

5.  Peppy: proteogenomic search software.

Authors:  Brian A Risk; Wendy J Spitzer; Morgan C Giddings
Journal:  J Proteome Res       Date:  2013-05-06       Impact factor: 4.466

6.  The impact of noise and missing fragmentation cleavages on de novo peptide identification algorithms.

Authors:  Kevin McDonnell; Enda Howley; Florence Abram
Journal:  Comput Struct Biotechnol J       Date:  2022-03-19       Impact factor: 7.271

7.  Improved identification and quantification of peptides in mass spectrometry data via chemical and random additive noise elimination (CRANE).

Authors:  Akila J Seneviratne; Sean Peters; David Clarke; Michael Dausmann; Michael Hecker; Brett Tully; Peter G Hains; Qing Zhong
Journal:  Bioinformatics       Date:  2021-07-29       Impact factor: 6.937

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.