Literature DB >> 16689691

Preprocessing of tandem mass spectrometric data based on decision tree classification.

Jing Fen Zhang¹, Si Min He, Jin Jin Cai, Xing Jun Cao, Rui Xiang Sun, Yan Fu, Rong Zeng, Wen Gao.

Abstract

In this study, we present a preprocessing method for quadrupole time-of-flight (Q-TOF) tandem mass spectra to increase the accuracy of database searching for peptide (protein) identification. Based on the natural isotopic information inherent in tandem mass spectra, we construct a decision tree after feature selection to classify the noise and ion peaks in tandem spectra. Furthermore, we recognize overlapping peaks to find the monoisotopic masses of ions for the following identification process. The experimental results show that this preprocessing method increases the search speed and the reliability of peptide identification.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2005 PMID： 16689691 PMCID： PMC5173242 DOI： 10.1016/s1672-0229(05)03032-9

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

Mass spectrometric analysis and database searching have been used as well-known approaches for peptide and protein identification (. During the experiment, the peptides separated from liquid chromatographers are fragmented and ionized by collision-induced dissociation (CID) and the ions are measured by mass spectrometer for mass/charge ratios (m/z). Consequently, the peptides are identified (or sequenced) by these m/z values of ions in tandem spectra with a sequence database searching. Due to the variety of the fragment ions under CID and the existence of a large amount of spectral noise, it is difficult to determine the sequence of a peptide from its tandem spectrum. Generally, a quadrupole time-of-flight (Q-TOF) spectrum of a peptide has 500 to 8,000 or even more peaks (, but only 1%–5% of these peaks are real ones that correspond to the important and known fragment ions and are useful for peptide identification. To increase the accuracy of peptide identification and decrease the computation complexity, the preprocessing of tandem mass spectra is introduced before database searching in order to select the peaks corresponding to fragment ions and minimize the number of selected peaks. To date, several methods have been proposed for the preprocessing of tandem data, including threshold filtering, denoise transforming, and deisotoping. The threshold filtering method is the most straightforward approach. As peaks with very small abundance values are unlikely to be real ones, this method selects the peaks above a given threshold or chooses a specific number of the most intensive peaks in the specified m/z intervals 3., 4., 5., 6., 7.. As we know, abundance is not the fundamental attribute of real peaks. Many important b-type ions have very low abundance. In addition, for various spectra, the quality, namely the intensity baseline of noise, is totally different. Therefore, using thresholds to remove the noise is not perfect. In the denoising mechanism, some well-known procedures such as wavelet transformation have been used to denoise the raw tandem mass spectra (. However, the parameters such as the wavelet base functions, the order, and the level of decomposition would impact the potential spectrum distortion by this procedure. In deisotoping, the isotopes are removed so that every fragment ion is represented only by one peak and the complexity of spectra is greatly reduced 6., 7.. Since peak overlappings, that is, two or more different ions have confused isotope masses, are observed frequently in spectra, deisotoping without identifying whether a peak corresponds to the monoisotope of one ion or the isotope of another ion leads to the loss of some overlapped but important fragment ions. To address the above issues, we present a new preprocessing method for Q-TOF tandem mass spectra based on decision tree classification. Firstly, instead of threshold filtering and denoise transforming, we use a Gaussian mixture model (GMM) to estimate the baseline of noise and treat the baseline just as one feature to distinguish noise and real peaks. Secondly, a key concept of isotope pattern vector (IPV) is introduced to characterize the isotope cluster of a fragment ion. The complex overlapping of isotope peaks are considered before deisotoping. Then we investigate the difference among noise, single fragment ions, and overlapping ions based on features such as the baseline of noise and IPV. Finally, a decision tree is constructed to classify the peaks, and the monoisotopic masses of all potential ions are calculated. We applied our preprocessing method on four different datasets and conducted extensive experiments to evaluate the specificity and sensitivity of classification. We also evaluated the effect of the preprocessing on the speed and accuracy of the Mascot ( and pFind ( searches. The experimental results show that this data preprocessing approach can increase the search speed and the reliability of peptide identification.

Methods

Gaussian mixture model

Factors including the signal to noise ratio of precursor and the imperfect laboratorial environment such as temperature shifts in the laboratory may all impact the quality of spectrum. Therefore, the intensity distribution of noise is different for various spectra. For example, Fig. 1, Fig. 2 show the spectra of peptides CCAADDKEACFAVEGPK and YLGYLEQLLR, respectively. It can be observed that the intensity baseline of noise peaks in Figure 1 is much higher than that in Figure 2.

Fig. 1

The tandem mass spectrum of peptide CCAADDKEACFAVEGPK in which the precursor holds 3 charges.

Fig. 2

The tandem mass spectrum of peptide YLGYLEQLLR in which the precursor holds 2 charges.

The peaks corresponding to noise are randomly produced by mass spectrometry during CID. Therefore, the variable of the intensity of noise obeys a normal distribution approximatively and a GMM can be established, in which the Gaussian curve represents the distribution of the intensity of noise. Intuitively, the centroid of the Gaussian curve corresponding to noise is treated as the baseline. Practically, the mean and standard deviations are used to characterize the baseline of noise, denoted as I = (I, I), and the value of I is obtained by the Expectation-Maximization (EM) algorithm to estimate the parameters of GMM. It is noted that we use the relative intensities instead of the absolute values of the intensities of peaks in spectra. The highest value in intensity is 100%. Using the MATLAB toolbox, the calculated results of (I, I) for the data in Fig. 1, Fig. 2 are (2.290144%, 0.350236%) and (1.012099%, 0.076899%), respectively, which are consistent with the observation of the noise in the two spectra.

Isotope pattern vector

Isotopes are elements that contain the same number of protons and electrons but differ in the number of neutrons in nucleus. The elements of H, C, N, O, and S have stable isotope distributions in nature (. Most proteins are composed of the above five elements and thereby have relatively stable isotope patterns. We use IPV to digitally describe the profile of the isotopes of an ion. Suppose that the monoisotopic mass of a fragment ion P (with molecular formula CHNOS) is M, and its first four isotopes (with one, two, three, and four extra neutrons, respectively) are P1, P2, P3, and P4, we can define the IPV of P as: where T is the relative abundance of P with respect to P, and ∆m is the mass difference between P and P, for k=1~4, respectively.

Theoretical IPV

Since the five elements of H, C, N, O, and S have stable isotope distributions, the theoretical IPV (tIPV) of a fragment ion is definite and can be deduced from its elemental components, that is, from its molecular formula. We assume that each extra neutron of an atom in the peptide appears independently. Then the tIPV for the given formula CHNOS can be deduced from the probability of the isotopes of each element. For example, we show the deduction of M, T1, T2, ∆m1, and ∆m2 as follows: where q, q, and q are the relative abundance of 13C to 12C, D to H, and 14N to 15N; ∆C, ∆H, and ∆N are the mass differences between 13C and 12C, D and H, and 14N and 15N, respectively; q1, q2 (q1, q2) are the ratios of 17O to 16O, 18O to 16O (33S to 32S, 34S to 32S), respectively; ∆O1, ∆O2 (∆S1, ∆S2) are the mass differences between 17O and 16O, 18O and 16O (33S and 32S, 34S and 32S), respectively.

Experimental IPV

We can calculate the experimental IPV (eIPV) of a fragment ion P if the isotope peaks of the ion are measured by mass spectrometer. We characterize an ion peak in mass spectrum in terms of (m/z, intensity), where m/z is the value of the mass to charge ratio and intensity is the relative height of the peak. Considering a group of isotope peaks (p0, p1, p2, p3, p4) corresponding to an ion, the interval of the corresponding m/z values among p0, p1, p2, p3, and p4 is around 1 Da when the ion holds a single charge, while the interval is around 0.5 Da when the ion holds double charges. In general, the interval is 1/z Da when the ion holds z charges. Contrariwise, the charge of an ion can be deduced by the m/z interval of the isotope peaks. To calculate the eIPV for P, we find the corresponding isotope cluster of peaks (p0, p1, p2, p3, p4) in tandem spectrum with the (m/z, intensity) pair (Mz, I), k=0~4, and calculate the number of charge z from the interval between Mz. After normalizing z=1, the (m/z, intensity) pairs are converted to (M, I), where M = Mz × z − (z − 1) × 1.0078, k=0~4. Then the eIPV can be obtained by:

Feature selection and decision tree classification

The next step is to investigate the difference between noise and fragment ions based on some proposed features, and construct a decision tree to classify the peaks based on the values of the features. Firstly, since the peaks higher than the baseline of noise are more likely to be real peaks, it is necessary to find the baseline of noise I = (I, I) of each spectrum. Secondly, each fragment ion has theoretical isotopes while noise does not have. Therefore, noise and real peaks can be distinguished based on the concept of IPV. Considering the measure error of mass spectrometer, the isotope peaks of a fragment ion should be observed and the experimental isotope pattern should match its theoretical isotope pattern. Thirdly, there are complex overlapping ions with different charge states and noise data, hence it is very important to recognize the charge state of fragment ions and the case of overlapping to determine all the monoisotopic masses of ions. Therefore, we select some features such as the charge state, the mass corresponding to the peak, the intensity distance between the peak and the baseline of noise, and the distance between eIPV and tIPV. Finally, we investigate the difference between noise and fragment ions, learn the rules from some training samples, and construct a decision tree to classify the peaks into three classes: Class 1: noise; Class 2: real peaks corresponding to single ions; Class 3: real peaks corresponding to overlapping ions. As described above, the interval of the m/z value of the isotope peaks is around 1/z Da if the ion holds z charges. For a given peak p0, we scan the spectrum and find out the overall groups of potential isotope peaks in tandem spectrum by supposing three different charge states for z=1, 2, or 3, and within a tolerance of 0.05/z Da for the interval. For the above isotope cluster of peaks (p0, p1, p2, p3, p4) with the (m/z, intensity) pair (Mz, I), k=0~4, it is noted that if there is no peak at the kth isotopic interval within the given tolerance, then we set the virtual peaks (pk, pk+1, …, p4) by setting the intensity I as zero, j = k~4. Therefore, we can always obtain at least three groups of potential isotope peaks for p0. Then it will be judged accordingly that which group corresponds to the fragment ion. On the other hand, although the formula of a fragment ion is unknown during the preprocessing, the tIPV of an ion can be estimated by the expected (or mean) isotope pattern of an average peptide of the given mass (. The average peptide is a peptide with an amino acid composition corresponding to the statistical distribution of amino acids in the non-redundant database and the expected tIPV = (M0, T1, T2, T3, T4, ∆m1, ∆m2, ∆m3, ∆m4) can be obtained. Therefore, we calculate the value of the features for each potential group of isotope peaks and obtain: We select some peaks as training samples to observe the difference between the value corresponding to noise and that to real peaks. Specifically, we judge whether a peak is noise or it corresponds to an ion or it involves overlapped ions when the peptide sequence corresponding to the spectrum is known. There are four kinds of overlappings considered as follows: Case 1: two ions with 1 Da difference in mass; Case 2: three consecutive ions with 1 Da difference in mass; Case 3: two ions with 3 Da difference in mass; Case 4: two ions with 2 Da difference in mass. The four profiles of the overlapping cases are shown in Figure 3. Then we select three classes of peaks corresponding to noise, single ions, and overlapped ions, respectively. Finally, the decision tree to classify these peaks is constructed by using the WEKA C4.5 toolbox.

Fig. 3

Four profiles of the overlapping cases in which Ion 1, Ion 2, and Ion 3 represent the monoisotopes of each ion involved in overlapping.

According to the rules of the decision tree, all of the peaks in spectra can be classified by the calculated values of V for its potential isotope peak groups. It is noted that each peak will be classified into one and only one class. Specifically, a given peak p0 is judged as noise if all of the values of V corresponding to the overall groups of potential isotope peaks are classified into Class 1. If it is classified into Class 2, then the monoisotopic mass M = M × z − (z − 1) × 1.0078 is selected to present a potential fragment ion. Furthermore, if peak p0 is classified into Class 3, then two or three monoisotopic masses will be obtained according to the overlapping cases. Finally, some masses corresponding to the peaks that have been classified into Classes 2 and 3 are selected prior to database searching.

Application

We applied our preprocessing method on four different datasets of Q-TOF mass spectra, including 54 spectra from tryptic digestion peptides (, 20 spectra of Glu-Fibrino peptide B, 9 spectra of the mixture of standard peptides measured during different time, and 7 spectra of the tryptic peptides of bovine serum albumin protein (the Research Centre for Proteome Analysis, Key Laboratory of Proteomics, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences), which were denoted as PepLutefisk, PepGFB, PepMix, and PepBSA, repectively. For performance metrics, we gave some definitions as follows. Firstly, a peak is called a real peak if its corresponding mass matches with a known theoretical ion; otherwise, it is called an invalid peak. In this paper, the known theoretical ions include the predominant a-, b-, and y-type of ions 12., 13., immonium ions 14., 15., and other less important ions such as c-, x-, and z-type of ions 12., 13., internal fragment ions formed by a combination of a- and y-type cleavages 14., 15., and ions with lost ammonia and water (. It is noted that there are some peaks that really correspond to fragment ions but the corresponding masses cannot match with any known theoretical ions because the knowledge of collision rules in CID is not complete at present. Consequently, the invalid peaks include noise peaks and some peaks corresponding to fragment ions but its ion type is unknown to human beings. Secondly, it is called a true positive (TP) if a real peak is classified correctly; otherwise it is called a false negative (FN). Similarly, it is called a true negative (TN) if an invalid peak is classified correctly; otherwise it is called a false positive (FP). Finally, sensitivity and specifity are used to measure the performance of classification. Here, sensitivity is defined as TP/(TP+FN) and specifity is defined as TN/(TN+FP). In our experiment, 900 cases were selected as training samples and 429,156 cases were selected as testing samples. The experimental results are summarized in Table 1. From the table, it can be observed that the ratios of peak selection in the four datasets are all lower than 5%. The low selecting ratios can improve the speed of database searching greatly since the less the number of selected peaks, the simpler the computing of the subsequent identification process.

Table 1

Classification Performance of the Preprocessing

Data	No. of spectra	No. of total peaks/No. of selected peaks	Ratio of peak selection	Sensitivity	Specifity
PepLutefisk	54	89,256/3,721	4.168%	97.94%	99.06%
PepGFB	20	180,088/2,408	1.337%	97.77%	99.66%
PepMix	9	51,836/1,799	3.471%	93.68%	97.99%
PepBSA	7	18,720/789	4.215%	94.50%	97.76%

As we know, it is the real peaks that make certain the identification of peptides. The more selected real peaks, the higher the accuracy of identification. Therefore, the sensitivity of classification is very important for the identification. The detailed results on sensitivity are depicted in Table 2, where two kinds of FN samples are given in the last column: one is the peaks corresponding to the predominant a-, b-, and y-type of ions, and the other is the peaks corresponding to other less important types of ions. From the data, it can be observed that the former FN is much less than the later FN, which means that the lost but important information in classification is few. Compared with sensitivity, the specifity of preprocessing is less important for two reasons: Firstly, the number of invalid peaks is related to the purity of testing samples and the knowledge of collision rules in CID while the knowledge of collision rules is not sufficient and needs improvement, hence the computing of specifity is not absolutely objective; Secondly, most peaks are invalid, thus a small number of classification error has little effect on the value of specifity.

Table 2

Detailed Performance on Sensitivity of the Preprocessing

Data	No. of selected peaks	No. of real peaks in spectra*	No. of TP	No. of FN a-, b-, y-type/other type
PepLutefisk	3,721	2909	2,849	11/49
PepGFB	2,408	1796	1,756	1/39
PepMix	1,799	775	726	9/40
PepBSA	789	379	358	3/18

Peaks whose corresponding masses match with the known type of theoretical ions.

We also evaluated the effect of the preprocessing on the speed and accuracy of the Mascot ( and pFind ( searches. On one hand, the experimental tests were performed with pFind. The results showed that under the same parameters of searching, the accuracy of identification was increased a little while the speed of searching was improved up to 5–10 times. On the other hand, all the experiments were performed by submitting the data to Mascot through the Internet. Therefore, only the accuracy level of searching was compared since the testing of speed was not applicable. We submitted two kinds of data to Mascot: the original spectrum data and the spectrum data after our preprocessing. Comparing with the search results, we can see that: (1) If the peptide can be identified by the original data, that is, the expected peptide sequence is listed at the first position by the Mascot search, it can also be identified by the data after our preprocessing, which means that the process does not destroy the data. (2) Compared with the search scores including “Score” and “Expectation value” in Mascot search results, there were 70% data (spectra) in which the scores for the data after our preprocessing were much better than those for the original data. (3) For some spectra, such as the spectrum of peptide QNCDQFEK (in which the amino acid C is carbamidomethylated) and the spectrum of peptide DDPHACYSTVFDK, the query for the original data gave the expected sequence after the fifth position, while the query for the processed data gave the correct answer at the first position. Therefore, the search after our preprocessing is more reliable. In the future research, we will focus on improving the sensitivity and specifity of the preprocessing.

Conclusion

In this study, we present a new preprocessing method for Q-TOF tandem mass spectra to increase the accuracy of database searching for peptide (protein) identification. Instead of threshold filtering and denoise transforming, we use a GMM to estimate the baseline of noise and treat the baseline just as one feature to distinguish noise and real peaks. In addition, based on the natural isotopic information inherent in tandem mass spectra, we construct a decision tree after feature selection to classify the noise and ion peaks and recognize overlapping peaks. The experimental results show that this preprocessing increases the search speed largely and improves the reliability of peptide identification.

12 in total

1. Probability-based protein identification by searching sequence databases using mass spectrometry data.

Authors: D N Perkins; D J Pappin; D M Creasy; J S Cottrell
Journal: Electrophoresis Date: 1999-12 Impact factor: 3.535

2. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry.

Authors: J A Taylor; R S Johnson
Journal: Anal Chem Date: 2001-06-01 Impact factor: 6.986

Review 3. Mass spectrometry-based proteomics.

Authors: Ruedi Aebersold; Matthias Mann
Journal: Nature Date: 2003-03-13 Impact factor: 49.962

4. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry.

Authors: Yan Fu; Qiang Yang; Ruixiang Sun; Dequan Li; Rong Zeng; Charles X Ling; Wen Gao
Journal: Bioinformatics Date: 2004-03-25 Impact factor: 6.937

5. Increased identification of peptides by enhanced data processing of high-resolution MALDI TOF/TOF mass spectra prior to database searching.

Authors: Tomas Rejtar; Hsuan-Shen Chen; Victor Andreev; Eugene Moskovets; Barry L Karger
Journal: Anal Chem Date: 2004-10-15 Impact factor: 6.986

6. Predicting molecular formulas of fragment ions with isotope patterns in tandem mass spectra.

Authors: Jingfen Zhang; Wen Gao; Jinjin Cai; Simin He; Rong Zeng; Runsheng Chen
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2005 Jul-Sep Impact factor: 3.710

7. A comparison of the peptide fragmentation obtained from a reflector matrix-assisted laser desorption-ionization time-of-flight and a tandem four sector mass spectrometer.

Authors: J C Rousecor; W Yu; S A Martin
Journal: J Am Soc Mass Spectrom Date: 1995-09 Impact factor: 3.109

8. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database.

Authors: J K Eng; A L McCormack; J R Yates
Journal: J Am Soc Mass Spectrom Date: 1994-11 Impact factor: 3.109

9. Novel fragmentation process of peptides by collision-induced decomposition in a tandem mass spectrometer: differentiation of leucine and isoleucine.

Authors: R S Johnson; S A Martin; K Biemann; J T Stults; J T Watson
Journal: Anal Chem Date: 1987-11-01 Impact factor: 6.986

10. Proposal for a common nomenclature for sequence ions in mass spectra of peptides.

Authors: P Roepstorff; J Fohlman
Journal: Biomed Mass Spectrom Date: 1984-11