Literature DB >> 22034872

Enhanced peptide quantification using spectral count clustering and cluster abundance.

Seungmook Lee1, Min-Seok Kwon, Hyoung-Joo Lee, Young-Ki Paik, Haixu Tang, Jae K Lee, Taesung Park.   

Abstract

BACKGROUND: Quantification of protein expression by means of mass spectrometry (MS) has been introduced in various proteomics studies. In particular, two label-free quantification methods, such as spectral counting and spectra feature analysis have been extensively investigated in a wide variety of proteomic studies. The cornerstone of both methods is peptide identification based on a proteomic database search and subsequent estimation of peptide retention time. However, they often suffer from restrictive database search and inaccurate estimation of the liquid chromatography (LC) retention time. Furthermore, conventional peptide identification methods based on the spectral library search algorithms such as SEQUEST or SpectraST have been found to provide neither the best match nor high-scored matches. Lastly, these methods are limited in the sense that target peptides cannot be identified unless they have been previously generated and stored into the database or spectral libraries.To overcome these limitations, we propose a novel method, namely Quantification method based on Finding the Identical Spectral set for a Homogenous peptide (Q-FISH) to estimate the peptide's abundance from its tandem mass spectrometry (MS/MS) spectra through the direct comparison of experimental spectra. Intuitively, our Q-FISH method compares all possible pairs of experimental spectra in order to identify both known and novel proteins, significantly enhancing identification accuracy by grouping replicated spectra from the same peptide targets.
RESULTS: We applied Q-FISH to Nano-LC-MS/MS data obtained from human hepatocellular carcinoma (HCC) and normal liver tissue samples to identify differentially expressed peptides between the normal and disease samples. For a total of 44,318 spectra obtained through MS/MS analysis, Q-FISH yielded 14,747 clusters. Among these, 5,777 clusters were identified only in the HCC sample, 6,648 clusters only in the normal tissue sample, and 2,323 clusters both in the HCC and normal tissue samples. While it will be interesting to investigate peptide clusters only found from one sample, further examined spectral clusters identified both in the HCC and normal samples since our goal is to identify and assess differentially expressed peptides quantitatively. The next step was to perform a beta-binomial test to isolate differentially expressed peptides between the HCC and normal tissue samples. This test resulted in 84 peptides with significantly differential spectral counts between the HCC and normal tissue samples. We independently identified 50 and 95 peptides by SEQUEST, of which 24 and 56 peptides, respectively, were found to be known biomarkers for the human liver cancer. Comparing Q-FISH and SEQUEST results, we found 22 of the differentially expressed 84 peptides by Q-FISH were also identified by SEQUEST. Remarkably, of these 22 peptides discovered both by Q-FISH and SEQUEST, 13 peptides are known for human liver cancer and the remaining 9 peptides are known to be associated with other cancers.
CONCLUSIONS: We proposed a novel statistical method, Q-FISH, for accurately identifying protein species and simultaneously quantifying the expression levels of identified peptides from mass spectrometry data. Q-FISH analysis on human HCC and liver tissue samples identified many protein biomarkers that are highly relevant to HCC. Q-FISH can be a useful tool both for peptide identification and quantification on mass spectrometry data analysis. It may also prove to be more effective in discovering novel protein biomarkers than SEQUEST and other standard methods.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22034872      PMCID: PMC3234305          DOI: 10.1186/1471-2105-12-423

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

The main objective of functional proteomics analysis is often to estimate changes in the amount of proteins found in complex biological systems, in response to physiological and clinical factors such as cell development, disease progression, or drug treatment. In particular, one of the key issues in proteomics research based on tandem mass spectrometry (MS/MS) is the identification of protein species and the characterization of their expression changes in normal and disease samples. Three analysis techniques are often required in an MS/MS study: expressed peptide identification, target protein characterization, and quantification [1]. For hundreds to tens of thousands of fragment ion spectra generated, the assignment of the fragment ion spectra to peptide sequences, the identification of proteins represented by each peptide, and the estimation of their abundances in the analyzed sample require complex computations and still remain as high statistical challenges [2]. Quantification of protein expression using mass spectrometry (MS) is often required for the discovery of protein biomarkers associated with cancer, their response to stimuli, cell signalling cascades and the function of cell cycle-promoting proteins, and various biomedical investigations [3]. Two categories of quantification methods for MS data have been used: stable isotope labelling quantification and label-free quantification [2]. Several stable isotope-based quantification methods have been introduced based on different labelling reagents that can be chemically bound to peptides [4]. It is, however, difficult to simultaneously quantify the amount of proteins/peptides in multiple samples because of the limited number of labelling reagents available [5]. Moreover, current practical applications can typically quantify, at most, a few hundreds of peptides, measuring relative expression values of each pair of contrasting samples. Furthermore, the high costs of labelling reagents make these quantification methods difficult to be commonly applied for the characterization of the global proteome. On the other hand, label-free quantification, which does not require the use of a stable isotope labeling, has the advantages of low cost and simplicity. Currently, two label-free methods are available to measure expression levels of peptides: spectra counting and spectra feature analysis. The spectral counting method can estimate the peptide expression levels by means of spectrum counting (from MS/MS data) or through the estimation of the integrated ion intensities [6,7]. The spectral feature analysis method quantitatively determines the peptide expression levels by comparing three-dimensional patterns (retention time, m/z and intensity) between different samples [8-13]. However, these label-free quantitative methods have two main shortcomings. The first limitation is due to numerous false-positive discriminative peptides, which are the result of the chromatographic variability between LC-MS experiments. In the analysis of the spectra features, after finding two candidates with same MS1 retention time and m/z, the difference in their MS1 intensities is used to define the peptide levels. Therefore, spectra feature analysis requires stringent reproducibility [3,8] and additional pre-processing of the LC normalization or retention time alignment [14,15]. The second limitation is that spectra counting cannot be performed without peptide identification because the relative peptide levels can be quantified only after peptide identification. In peptide identification, MS/MS spectra are verified using a database searching algorithm or spectral library searching algorithm such as SEQUEST, MASCOT, or SpectraST. Specifically, database search algorithms calculate score functions to compare the experimental MS/MS spectra with theoretical MS/MS spectra of peptides derived from protein sequence databases. The pool of theoretical MS/MS spectra is restricted by user-specified criteria such as mass tolerance, proteolytic enzymes, and the types of post-translational modification [2,16]. A number of spectra may not be assigned to the correct peptides for diverse reasons, including deficiencies of the scoring scheme implemented in the database search tools, sequence variations (e.g., single nucleotide polymorphisms, SNPs), omissions in the database searched, post-translational or chemical modifications of the peptide analyzed, and the observation of genomic sequences that are not anticipated (e.g., splice forms, somatic rearrangement, and processed proteins) [17]. For all these reasons, a large number of important peptides may be lost during the database search. Instead of matching acquired MS/MS spectra against theoretically predicted spectra, MS/MS spectra can also be assigned to peptides by matching those in a spectral library. The spectral library is compiled from a large collection of experimentally observed MS/MS spectra identified in previous experiments [18]. Generally, a set of spectra of known peptide sequences is collected into a library and used as a reference. The experimental spectrum may be identified by a similar match in the library. However, this method can only be identified when spectra were observed previously and entered into the library. So, these library searching methods are well suited for targeted proteomics, in which one seeks not to discover previously unseen peptides, but rather limited to finding and quantifying expected peptides of interest in the sample [19]. To overcome these limitations of label-free quantification methods, we propose a novel spectral counting method to estimate a peptide's abundance by counting MS/MS spectra, comparing and clustering all experimentally observed spectra. This approach has several advantages. First, because the same peptide may be fragmented multiple times or repeatedly observed at different time points from an MS/MS run, multiple spectra may be extracted for the same peptides. In other words, duplicated spectra are ubiquitous in large-scale proteomics data [20]. Our method thus attempts to identify and group all the duplicate spectra, which allows us to quantify the amount of peptide found in complex biological systems without searching through the databases or using LC normalization. For the given spectra, our method, referred to as the Quantification method derived by Finding the Identical Spectra set for a Homogenous peptide (Q-FISH) employs a two-stage clustering algorithm to determine whether they are from the same peptides with homogeneous spectral patterns. The Q-FISH algorithm employs two similarity measures: the difference between two precursor ions and the correlation coefficient of moving window averages. Subsequently, the algorithm clusters spectra from the same peptide through all plausible pair-wise comparisons. By counting the spectra of each cluster set of peptides, we can estimate the amount of peptides. Figure 1 summarizes the workflow of the proposed Q-FISH algorithm.
Figure 1

Work flow chart. This figure shows a flow schematic of the analysis process performed by Q-FISH algorithm

Work flow chart. This figure shows a flow schematic of the analysis process performed by Q-FISH algorithm Our proposed algorithm was applied to identify differentially expressed peptides from a real data obtained during a Nano-LC-MS/MS experiment performed on human HCC and normal liver tissue samples.

Results & Discussion

We introduced and tested the so-called Q-FISH algorithm to identify and quantify the amount of all expressed peptides from an MS/MS dataset by clustering and counting spectra with homogeneous spectral patterns. In order to test our algorithm, we performed a Nano-LC MS/MS experiment with triplicated human hepatocellular carcinoma and normal liver tissue samples. For a total of 44,318 MS/MS spectra obtained through three MS/MS analysis for two samples, Q-FISH yielded 14,748 clusters. More specifically, 5,777 clusters were identified only in the hepatocellular carcinoma (HCC) sample, 6,648 clusters only in the normal sample, and 2,323 clusters in both HCC and normal samples. For the purpose of comparison, we also implemented SEQUEST and SpectraST to identify peptides. However, only 4,824 of 44,318 spectra were identified using SEQUEST, and a total of 1,326 peptides from the experimental spectra. Generally, most database search algorithms including SEQUEST assign specific experimental spectra to peptides by comparing the experimental data with theoretical spectra generated from the peptide sequence. It should be noted that neither the best match nor a high search score may not be a true match, especially for novel protein targets. Therefore, many peptides could be misidentified, or not be identified, unless they were previously generated and stored into the database sequence. In our experiments, a large number of experimental spectra (89.12%, namely 39,494 of a total of 44,318 spectra) could not be used for the peptide identification using SEQUEST. On the other hands, 5,549 spectra and 3,295 peptides could be identified using SpectraST. That is, a large number of spectra still could not be used for the peptide identification by SpectraST (87.48%, namely 38,769 of a total of 44,318 spectra). On the other hand, our proposed method directly compares all observed experimental spectra to discover differentially expressed peptides without a loss of observed spectra. The standardized intensities of the experimental spectra plotted in Figure 2 are characterized by positive intensity values (upper part) and the reference spectrum plotted using negative intensity values (lower part). Specifically, Figure 2(a), which illustrates an example of one cluster with nine similar spectra, shows spectral patterns of the MS/MS spectra as well as the reference spectrum for clustered spectral set. The overall patterns look quite similar and all nine spectra pairs seem to have almost identical patterns. Table 1 shows the search results returned by SEQUEST and SpectraST. Subsequently, in the case of spectral set S366006, nine spectra were identified by means of the same peptide sequence, "SIFSAVLDELK" in the SEQUEST and SpectraST with XCorr above 1.97. In addition, a reference spectrum for the clustered spectral set was identified as the peptide sequence, "SIFSAVLDELK" with a SEQUEST score, XCorr = 2.96. This analysis reveals that these spectra can be regarded as the spectra of a homogenous peptide. In other words, each cluster could be expected to be composed of spectra from the same peptide.
Figure 2

Pattern-plots of reference spectrum and experimental MS/MS spectra in clustered spectral sets. This figure shows pattern-plots the of the experimental MS/MS spectra with plotted using positive intensities (upper part) and the reference spectrum using negative intensities (lower part). Then, (a) all of nine spectra were identified as a same peptide, while (b) two of the eleven spectra are not identified by SEQUEST and (c) four of the seven spectra were only identified by SpectraST, although pattern-plots are very similar.

Table 1

Results of SEQUEST & SpectraST for spectra in clustered spectral sets

Spectral Set IDSampleSequenceXCorrRTprecursor ionprecursor intensitySpectraST
S366006HCC-3SIFSAVLDELK2.356096.181223.425828.21
Normal-1SIFSAVLDELK2.386144.001222.612823.41
Normal-1SIFSAVLDELK2.126197.201224.2385800.01
Normal-1SIFSAVLDELK2.306248.891224.2145284.01
Normal-2SIFSAVLDELK1.996278.551223.46218.51
Normal-2SIFSAVLDELK2.356341.521222.314101.81
Normal-2SIFSAVLDELK1.986441.911223.12800.21
Normal-3SIFSAVLDELK2.566149.981224.4421560.01
Normal-3SIFSAVLDELK2.376154.651222.323456.61

S1157004Normal-1VDFPQDQLTALTGR2.773724.741565.1106099.01
Normal-2VDFPQDQLTALTGR2.333647.591562.4143286.01
Normal-1VDFPQDQLTALTGR2.463779.981562.275465.11
HCC-3VDFPQDQLTALTGR2.103854.181562.334323.81
Normal-2VDFPQDQLTALTGR2.073695.201562.7159244.01
Normal-3VDFPQDQLTALTGR2.073825.731562.469159.31
Normal-3VDFPQDQLTALTGR2.243775.231562.971196.31
HCC-1VDFPQDQLTALTGR2.023930.661562.225175.71
HCC-2VDFPQDQLTALTGR1.953977.911562.212849.61
Normal-2M#WLSSMCSMRSAR1.293629.941564.786816.81
HCC-1VDFPQDQLTALTGR2.203907.361564.219403.51

S65002HCC-1EILVGDVGQTVDDPYATFVK3.814619.641084.421166.71
HCC-3EILVGDVGQTVDDPYATFVK3.704573.011084.746939.4X
HCC-3EILVGDVGQTVDDPYATFVK2.324579.551083.522077.01
Normal-1EILVGDVGQTVDDPYATFVK2.874516.451084.026598.51
Normal-2EILVGDVGQTVDDPYATFVK4.084461.411084.519416.7X
Normal-2EILVGDVGQTVDDPYATFVK3.494514.741084.791100.7X
Normal-3EILVGDVGQTVDDPYATFVK3.374548.321084.523254.41

These spectra were clustered by the proposed Q-FISH algorithm. In the case of spectral set S366006, all spectra in spectral set were identified as a same peptide sequence "SIFSAVLDELK" by both of SEQUEST and SpectraST, while two spectra in S1157004 are not identified by SEQUEST (XCorr < 2.11). Also, all spectra in S65002 are identified by SEQUEST with high scores, while four spectra were only identified by SpectraST. If we relied only on SEQUEST or SpectraST, these spectra in S1157004 or S65002 would be excluded.

Pattern-plots of reference spectrum and experimental MS/MS spectra in clustered spectral sets. This figure shows pattern-plots the of the experimental MS/MS spectra with plotted using positive intensities (upper part) and the reference spectrum using negative intensities (lower part). Then, (a) all of nine spectra were identified as a same peptide, while (b) two of the eleven spectra are not identified by SEQUEST and (c) four of the seven spectra were only identified by SpectraST, although pattern-plots are very similar. Results of SEQUEST & SpectraST for spectra in clustered spectral sets These spectra were clustered by the proposed Q-FISH algorithm. In the case of spectral set S366006, all spectra in spectral set were identified as a same peptide sequence "SIFSAVLDELK" by both of SEQUEST and SpectraST, while two spectra in S1157004 are not identified by SEQUEST (XCorr < 2.11). Also, all spectra in S65002 are identified by SEQUEST with high scores, while four spectra were only identified by SpectraST. If we relied only on SEQUEST or SpectraST, these spectra in S1157004 or S65002 would be excluded. Similarly, Figures 2(b) and 2(c) show spectral patterns for the reference spectrum and the experimental spectra of a single cluster. It should be noted that the overall patterns look quite similar and all spectra pairs are characterized by high correlation coefficients. However, while all spectra in S1157004 could be identified by SpectraST, two out of the eleven spectra could not be identified by SEQUEST, as shown Table 1. On the contrary, all spectra in S65002 are identified by SEQUEST with high scores, while three spectra could not be identified by SpectraST. In other words, if we relied only on the conventional peptide identification such as SEQUEST or SpectraST, these spectra would have been excluded despite the similar peak patterns. On the other hand, our Q-FISH algorithm was able to include these spectra without a loss of information. In this study, we were interested in identifying proteins and characterizing their differential expressions in normal and HCC samples. Hence, we first focused on the 2,323 clusters, which were observed in both samples. Figure 3 and Table 2 show a scatter plot and a correlation matrix with the number of spectra in the same cluster, which were obtained through the replicated experiments on HCC and normal tissue samples, respectively. It is worth noting that the number of spectra in the same cluster showed high correlations (0.7178~0.8315), while the number of spectra for different samples showed weak correlations (0.0654~0.1549). For a given spectral set, the reference spectrum was estimated by averaging the relative intensities of the spectra. Consequently, the reference spectrum corresponds to the number of expressed spectra in the normal and HCC samples. We computed the false clustering rate (FCR) on the 2,323 clusters shared by the HCC and normal samples. Among these clusters, 1,571 clusters had FCRs smaller than 0.05. Our next step was to perform a beta-binomial test to isolate differentially expressed peptides (DEPs) [21]. The result showed that only 84 out of the 1,571 reference spectra were characterized by different spectral counts between the HCC and normal tissue samples. Also, 5,777 clusters were observed only in the HCC sample and 6,648 clusters only in the normal sample by Q-FISH. Among these clusters, 1,571 and 1,556 clusters, respectively, had FCRs smaller than 0.05.
Figure 3

Scatter plot between different samples and within replicated samples. This figure represents the scatter plot with the number of spectra in clustered sets obtained through the replicated experiments on HCC and normal tissue samples, respectively. Then, two black boxes show the relationships of the number of spectra in replicated each HCC and normal samples, while a gray box represents the relationships of the number of spectra in clustered sets between HCC and normal samples.

Table 2

Correlation matrix and the number of shared spectral clusters between different samples and within replicated samples

HCC1HCC2HCC3Normal1Normal2Normal3
HCC11.0000a(4,319)b0.8315(2,117)c0.8125(2,142)c0.1549(1,108)c0.0828(929)c0.1088(1,022)c
HCC21.0000(4,144)b0.8048(2,032)c0.1232(1,025)c0.0654(894)c0.0899(947)c
HCC31.0000(4,461)b0.1394(1,144)c0.0654(947)c0.0911(1,061)c
Normal11.0000(4,710)b0.7178(2,280)c0.7449(2,286)c
Normal21.0000(4,863)b0.7302(2,128)c
Normal31.0000(4,560)b

a: correlation coefficient, b: # of spectral clusters, and c: # of shared spectral clusters,

This table shows correlation matrix with number of spectra in same cluster between different samples and within replicated samples. The number of spectra in the same cluster within replicated samples showed high correlations, while the number of spectra between different samples showed weak correlations.

Scatter plot between different samples and within replicated samples. This figure represents the scatter plot with the number of spectra in clustered sets obtained through the replicated experiments on HCC and normal tissue samples, respectively. Then, two black boxes show the relationships of the number of spectra in replicated each HCC and normal samples, while a gray box represents the relationships of the number of spectra in clustered sets between HCC and normal samples. Correlation matrix and the number of shared spectral clusters between different samples and within replicated samples a: correlation coefficient, b: # of spectral clusters, and c: # of shared spectral clusters, This table shows correlation matrix with number of spectra in same cluster between different samples and within replicated samples. The number of spectra in the same cluster within replicated samples showed high correlations, while the number of spectra between different samples showed weak correlations. In order to compare the performance of Q-FISH with the spectral counting method by SEQUEST, we used the human liver data and validated the results through literature search. For the human liver data, Q-FISH provided 1571 differentially expressed clusters for HCC sample and 1556 for normal sample, among which 57 and 99 clusters were identified by SEQUEST in HCC and normal samples, respectively. On the other hand, SEQUEST provided 93 and 145 peptides for HCC and normal tissue samples, respectively. Among the 57 identified clusters in HCC samples, 37 clusters were found to be over-expressed by Q-FISH; 20 peptides/clusters were overlapped by Q-FISH and SEQUEST. On the other hands, 73 peptides were identified only by SEQUEST. 49 peptides/clusters were identified as over-expressed by both Q-FISH and SEQUEST in normal sample. Also, 50 and 96 peptides/clusters were identified as over-expressed only by Q-FISH and SEQUEST, respectively. We compared two results through literature search. We assumed that it is a true match if a peptide was reported in a previous literature in cancer. While there is a certain degree of uncertainty for reported protein biomarkers, this assumption is not biased to any of the two methods and allowed us to statistically compare their performance. For examples, alpha-2-macroglobulin (A2M) annotated by "VSVQLEASPAFLAVPVEK" was reported to be over-expressed in HCC sample [22]. This peptide was found to be over-expressed by Q-FISH, but under-expressed by spectral counting analysis by SEQUEST. The full list of peptides is given in Additional file 1. Based on this report, the 2 × 2 confusion tables can be constructed as shown in Table 3.
Table 3

2 × 2 tables for literature search results of Q-FISH and SEQUEST

Q-FISHSEQUEST

HCCNormalTotalHCCNormalTotal
LiteratureOver-Expressed251742342660
Under-Expressed6172392433

Total313465435093

Accuracy64.62%62.37%

We assume that if a peptide is reported in a previous literature, it is assumed to be correctly identified. We compared two results (Q-FISH and SEQUEST) through literature search. Based on this report, the following 2 × 2 tables can be constructed

2 × 2 tables for literature search results of Q-FISH and SEQUEST We assume that if a peptide is reported in a previous literature, it is assumed to be correctly identified. We compared two results (Q-FISH and SEQUEST) through literature search. Based on this report, the following 2 × 2 tables can be constructed For Q-FISH result, 65 peptides were found in the literature: 31 for HCC sample and 34 for normal sample. Among 31 peptides for HCC sample, 25 are reported as over-expressed in the literature, and are assumed to be correctly identified. Among 17 peptides for normal sample, 17 are reported as under-expressed in the literature, and thus are assumed to be correctly identified. The remaining 17 and 6 peptides are assumed incorrectly identified. For SEQUEST result, 93 peptides were reported in the literature: 43 for HCC sample and 50 for normal sample. Among them, 34 and 24 peptides were correctly identified, while 26 and 9 peptides were incorrectly identified. Based on these numbers, accuracy measure was computed showing that Q-FISH (accuracy = 64.62%) has slightly higher accuracy than SEQUEST (accuracy = 62.37%). This comparison showed that Q-FISH performed as reliably as SEQUEST, despite the comparison giving SEQUEST a natural advantage. Table 4 provides a list of potential protein biomarkers. Q scores were calculated by averaging the correlation coefficient between moving averages over the reference spectrum and experimental spectra of the clustered spectral set. If it has a relatively high value, then the reference spectrum is well represented in the clustered spectral set.
Table 4

Lists of differentially expressed peptides in HCC and normal sample.

HCC sample
RelatedCancerGene NameShogun Sequence#(HCC)aXCorrQ ScoreProtein NamePMIDc

HCCAKR1B10IVENIQVFDFK22.040.95Aldo-keto reductase family 1 member B1020388846
ALBDVFLGMFLYEYAR22.040.96Putative uncharacterized protein ALB20658536
ECHDC3VIIISAEGPVFSSGHDLK22.140.95Isoform 1 of Enoyl-CoA hydratase domain-containing protein 3, mitochondrial21495032
EEF1A2THINIVVIGHVDSGK32.290.83Elongation factor 1-alpha 218161050
EEF2AYLPVNESFGFTADLR33.300.96Elongation factor 218161940
ENO1FTASAGIQVVGDDLTVTNPK332.510.61Isoform alpha-enolase of Alpha-enolase18813785
FGGEGFGHLSPTGTTEFWLGNEK23.160,95Isoform Gamma-B of Fibrinogen gamma chain19596924
FN1SSPVVIDASTAIDAPSNLR22.450.96Isoform 1 of Fibronectin16820872
FTCDEAQELSLPVVGSQLVGLVPLK22.980.99Isoform A of Formimidoyltransferase-cyclodeaminase18571811
GAPDHWGDAGAEYVVESTGVFTTMEK53.570.96Glyceraldehyde-3-phosphate dehydrogenase20714864
HBDFFESFGDLSSPDAVMGNPK22.370.96Hemoglobin subunit delta9214599
HMOX1ALDLPSSGEGLAFFTFPNIASATK22.820.90Heme oxygenase 120664735
HRSP12IEIEAVAIQGPLTTASL22.310.98Ribonuclease UK11418349270
HSPA5NQLTSNPENTVFDAK42.510.97HSPA5 protein19445531
HSPA9VINEPTAAALAYGLDK22.040.93Stress-70 protein, mitochondrial18334731
DIVMTQSPDSLAVSLGER22.520.99
HSPD1ALMLQGVDLLADAVAVTMGPK32.660.9560 kDa heat shock protein, mitochondrial21533669
NME1VMLGETNPADSKPGTIR22.570.97Isoform 1 of Nucleoside diphosphate kinase A17594820
EISLWFKPEELVDYK22.270.95
P4HBVDATEESDLAQQYGVR22.380.81Protein disulfide-isomerase21207424
PRDX6LIALSIDSVEDHLAWSK33.480.93Peroxiredoxin-619893992
TKTILATPPQEDAPSVDIANIR32.160.98cDNA FLJ54957, highly similar to Transketolase17321041
VCPLIVDEAINEDNSVVSLSQPK22.490.98Transitional endoplasmic reticulum ATPase12560433
VIMEMEENFAVEAANYQDTIGR33.280.99Vimentin19843643

breast cancerEEF1DSLAGSSGPGASSGTSGDHGELVVR23.170.93Elongation factor 1-delta17997862
HBBGTFATLSELHCDK22.090.97Hemoglobin subunit beta20097481

colon cancerACTN1GYEEWLLNEIR32.030.99Alpha-actinin-117898132
ACLISLGYDVENDR22.090.94
ATP5BDQEGQDVLLFIDNIFR22.580.98ATP synthase subunit beta, mitochondrial20080835
HMGCS2LMFNDFLSASSDTQTSLYK32.870.93Hydroxymethylglutaryl-CoA synthase, mitochondrial16940161

colorectral cancerATP5A1NVQAEEMVEFSSGLK22.650.95ATP synthase subunit alpha, mitochondrial9261598
EVAAFAQFGSDLDAATQQLLSR32.880.87

LeukemiaIDH1SIEDFAHSSFQMALSK22.530.97Isocitrate dehydrogenase [NADP] cytoplasmic21205756

pancreatic cancerEPPK1LLEAQIATGGVIDPVHSHR22.640.97epiplakin 118498355

lung cancerFGBDNENVVNEYSSELEK32.570.97Fibrinogen beta chain20142248

cell migration.FLNBYAPTEVGLHEMHIK22.020.97Isoform 1 of Filamin-B20110358

XRCC5YAPTEAQLNAVDALIDSMSLAK53.600.94ATP-dependent DNA helicase 2 subunit 2
AP1B1LAPPLVTLLSAEPELQYVALR22.810.99Isoform A of AP-1 complex subunit beta-1
PLECAGTLSITEFADMLSGNAGGFR22.160.89Isoform 1 of Plectin-1
SDHAF2PAPEIFENEVMALLR32.410.93Protein EMI5 homolog, mitochondrial
TUBA4AAFVHWYVGEGMEEGEFSEAR22.400.98Tubulin alpha-4A chain
AVFVDLEPTVIDEVR22.230.98
TYMPDVTATVDSLPLITASILSK32.840.93Thymidine phosphorylase
UGP2TLDGGLNVIQLETAVGAAIK22.980.94Isoform 1 of UTP--glucose-1-phosphate uridylyltransferase
TUBBAILVDLEPGTMDSVR21.970.95Tubulin beta chain
TPIVTNGAFTGEISPGMIK22.520.95Triosephosphate isomerase (Fragment)
UnknownLFIGGLSFETTEESLR22.640.97Putative uncharacterized protein HNRNPA2B1
SVPTSTVFYPSDGVATEK32.770.93cDNA FLJ54957, highly similar to Transketolase
RHVFGESDELIGQK22.090.96
VFSNGADLSGVTEEAPLK22.240.90PRO2275

Normal sample

Related CancerGene NameShogun Sequence#(normal)bXCorrQ ScoreProtein NamePMIDc

HCCA2MVSVQLEASPAFLAVPVEK22.360.93Alpha-2-macroglobulin18959789
LLLQQVSLPELPGEYSMK32.250.96
ACTA2YPIEHGIITNWDDMEK32.420.96Actin, aortic smooth muscle21214675
ALBRPCFSALEVDETYVPK22.180.90Putative uncharacterized protein ALB20658536
ALDH2VAEQTPLTALYVANLIK22.550.86Aldehyde dehydrogenase, mitochondrial20186752
ALDH6A1ENTLNQLVGAAFGAAGQR22.460.89Methylmalonate-semialdehyde dehydrogenase [acylating], mitochondrial17786358
LFIHESIHDEVVNR22.610.96
VNAGDQPGADLGPLITPQAK23.270.98
ALDOBGILAADESVGTMGNR32.400.85Fructose-bisphosphate aldolase B17786358
ELSEIAQSIVANGK22.320.96
ASLINVLPLGSGAIAGNPLGVDR33.180.76Argininosuccinate lyase19138817
ASS1NPWSMDENLMHISYEAGILENPK22.740.96Argininosuccinate synthase20104527
BHMTISGQEVNEAACDIAR22.230.62Betaine--homocysteine S-methyltransferase 119960509
AGPWTPEAAVEHPEAVR22.620.93
C5orf33VATQAVEDVLNIAK22.230.97Isoform 2 of UPF0465 protein C5orf3321495032
CATGAGAFGYFEVTHDITK22.170.78Catalase21324921
FNTANDDNVTQVR22.400.92
FGGAIQLTYNPDESSKPNMIDAATLK33.760.92Fibrinogen gamma chain17018627
ETFALEVAPISDIIAIK52.730.89Electron transfer flavoprotein alpha-subunit20515076
CPS1TVLMNPNIASVQTNEVGLK32.420.99Isoform 1 of Carbamoyl-phosphate synthase [ammonia], mitochondrial12143053
FLGVAEQLHNEGFK32.670.97
AVNTLNEALEFAK22.580.96
VLGTSVESIMATEDR32.220.88
IEFEGQPVDFVDPNK22.520.98
GLNSESMTEETLK22.630.95
CYP3A7EMVPIIAQYGDVLVR22.370.80Cytochrome P450 variant 3A717978482
DCIDADVQNFVSFISK32.200.99Isoform 1 of 3,2-trans-enoyl-CoA isomerase, mitochondrial1903293
ECHS1ALNALCDGLIDELNQALK23.330.98Enoyl-CoA hydratase, mitochondrial15492826
EIF5AMGPLVLTEVLFNEK52.410.83Eukaryotic translation initiation factor 519175833
FBP1LDVLSNDLVMNMLK72.340.72Fructose-1,6-bisphosphatase 119637194
FHSGLGELILPENEPGSSIMPGK32.200.98Isoform Mitochondrial of Fumarate hydratase, mitochondrial1958270
AAAEVNQDYGLDPK32.230.97
IYELAAGGTAVGTGLNTR22.240.97
FLNAASGPGLNTTGVPASLPVEFTIDAK32.680.97Isoform 2 of Filamin-A21471709
HPDSQIQEYVDYNGGAGVQHIALK22.990.984-hydroxyphenylpyruvate dioxygenase8558370
HSPA5SQIFSTASDNQPTVTIK22.160.97HSPA5 protein19445531
KRT8LKLEAELGNMQGLVEDFK592.080.43Keratin, type II cytoskeletal 818932288
PBLDVNTENLLQVENTGK22.330.94Phenazine biosynthesis-like domain-containing protein20525558
PDIA4EVSQPDWTPPPEVTLVLTK32.490.98Protein disulfide-isomerase A419016532
PEBP1GNDISSGTVLSDYVGSGPPK63.510.96Phosphatidylethanolamine-binding protein 120739083
PHBNITYLPAGQSVLLQLPQ32.560.86Prohibitin21318481
PRDX6ELAILLGMLDPAEK42.000.94Peroxiredoxin-619893992
SELENBP1NTGTEAPDYLATVDVDPK22.060.96cDNA FLJ55757, highly similar to Selenium-binding protein 121338716
SORBS1LTPVQVLEYGEAIAK22.641.00Isoform 9 of Sorbin and SH3 domain-containing protein 111374898
SORDLENYPIPEPGPNEVLLR21.990.97Sorbitol dehydrogenase12848999
STIP1ALSVGNIDDALQCYSEAIK22.540.97Stress-induced-phosphoprotein 117627933
TPI1VAHALAEGLGVIACIGEK23.350.99Isoform 2 of Triosephosphate isomerase18813785
TXNDC5ALAPTWEQLALGLEHSETVK34.010.98Thioredoxin domain-containing protein 516574106
Vκ3EIVLTQSPATLSLSPGER22.970.97Rheumatoid factor D5 light chain (Fragment)15207089
ADH1AFSLDALITHVLPFEK62.600.92Alcohol dehydrogenase 1A16054971
ELGATECINPQDYK22.150.94
ADH4ISEAFDLMNQGK42.950.94Isoform 2 of Alcohol dehydrogenase 416054971
GGVDFALDCAGGSETMK33.250.96
FNLDALVTHTLPFDK82.760.95
AAIAWEAGKPLCIEEVEVAPPK32.710.99
DLHKPIQEVIIELTK53.080.99

prostate cancerCOL6A2YGGLHFSDQVEVFSPPGSDR22.330.86Isoform 2C2A' of Collagen alpha-2(VI) chain18353764
LLTPITTLTSEQIQK32.570.93
VAVVTYNNEVTTEIR52.380.67
IEDGVPQHLVLVLGGK22.010.86
RPS27ATITLEVEPSDTIENVK22.230.98ubiquitin and ribosomal protein S27a precursor15647830

breast cancerEMILIN1LVGSGLHTVEAAGEAR22.470.96EMILIN-116243817
MYH9NLPIYSEEIVEMYK22.060.97Isoform 1 of Myosin-918796164
QLLQANPILEAFGNAK32.800.86Isoform 1 of Myosin-9
IAEFTTNLTEEEEK132.290.65Isoform 1 of Myosin-9

colon cancerALDH1A1GYFVQPTVFSNVTDEMR33.180.97Retinal dehydrogenase 121435460
ATP5BTVLIMELINNVAK53.180.88ATP synthase subunit beta, mitochondrial20080835
ETFAAAVDAGFVPNDMQVGQTGK22.130.98Electron transfer flavoprotein subunit alpha, mitochondrial16708797
GTSFDAAATSGGSASSEK62.530.86
ANXA6GLGTDEDTIIDIITHR22.480.98annexin VI isoform 221137014

LeukemiaGLUD1HGGTIPIVPTAEFQDR22.480.98Glutamate dehydrogenase 1, mitochondrial19683518
IDH2LNEHFLNTTDFLDTIK32.770.98Isocitrate dehydrogenase [NADP], mitochondrial21205756

gastic carcinomaHIST4H4TVTAMDVVYALK22.030.96Histone H419139817

colorectal cancerRRBP1TLQEQLENGPNTQLAR22.740.88Isoform 3 of Ribosome-binding protein 119425502

pancreatic cancerARG1TGLLSGLDIMEVNPSLGK42.710.91Isoform 1 of Arginase-121347333
CALM1VFDKDGNGYISAAELR32.500.93Calmodulin18852131
EAFSLFDKDGDGTITTK22.620.98

ovarian cancerHAAOTQGSVALSVTQDPACK22.560.91Isoform 1 of 3-hydroxyanthranilate 3,4-dioxygenase19724865

Lung cancerACY1TVQPKPDYGAAVAFFEETAR22.500.99cDNA FLJ60317, highly similar to Aminoacylase-18394326

cell migration.FLNBLVSPGSANETSSILVESVTR23.210.99Isoform 1 of Filamin-B19915675
UGP2ILTTASSHEFEHTK23.300.93Isoform 1 of UTP--glucose-1-phosphate uridylyltransferase
IQRPPEDSIQPYEK42.380.95
ALDH4A1EEIFGPVLSVYVYPDDKYK33.340,95Delta-1-pyrroline-5-carboxylate dehydrogenase, mitochondrial
COL14A1HFLENLVTAFDVGSEK32.390.77Isoform 1 of Collagen alpha-1(XIV) chain
DCTN2LLGPDAAINLTDPDGALAK22.240.94dynactin 2
EEF1B2SPAGLQVLNDYLADK32.860.84Elongation factor 1-beta
GRHPRIAAAGLDVTSPEPLPTNHPLLTLK33.110.99Glyoxylate reductase/hydroxypyruvate reductase
HSD17B10VMTIAPGLFGTPLLTSLPEK32.800.91Isoform 1 of 3-hydroxyacyl-CoA dehydrogenase type-2
PCBD1VHITLSTHECAGLSER22.540.96Pterin-4-alpha-carbinolamine dehydratase
PDIA6GSTAPVGGGAFPTIVER32.050.87Isoform 2 of Protein disulfide-isomerase A6
PTGR1HFVGYPTNSDFELK22.240.93Prostaglandin reductase 1
TGPLPPGPPPEIVIYQELR72.560.96
TFSAGWNIPIGLLYCDLPEPR32.650.97Serotransferrin
EDPQTFYYAVAVVK42.490.92
unknownPAHVVVGDVLQAADVDK22.880.9622 kDa protein

HCC and normal sample

Related CancerGene NameShogun Sequence#(HCC)a /#(normal)bXCorrQ ScoreProtein NamePMIDc

HCCCPS1MEYDGILIAGGPGNPALAEPLIQNVR2/113.920.91carbamoyl-phosphate synthetase 112143053
SIFSAVLDELK1/83.870.92
IAPSFAVESIEDALK3/132.960.85
TAVDSGIPLLTNFQVTK1/102.500.45
HBA1VADALTNAVAHVDDMPNALSALSDLHAHK1/83.670.93Hemoglobin subunit alpha 120572306
VGAHAGEYGAEALER4/132.050.94
P4HBILFIFIDSDHTDNQR10/152.880.49Protein disulfide-isomerase21207424
HNRNPCMIAGQVLDINLAAEPK21/92.310.46Heterogeneous nuclear ribonucleoprotein C (C1/C2), isoform CRA_b20572306
PGK1VSHVSTGGGASLELLEGK16/83.360.46Phosphoglycerate kinase 119200351
ACTBDLYANTVLSGGTTMYPGIADR10/33.250.96Actin, cytoplasmic 116493704
GSTA1NDGYLMFQQVPMVEIDGMK2/62.240.83Glutathione S-transferase20604928
FABP1SVTELNGDIITNTMTLGDIVFK17/63.430.78Fatty acid-binding protein12245374
CES1EGYLQIGANTQAAQK13/12.210.76Isoform 1 of Liver carboxylesterase 119658107

Breast CancerLGALS7/LGALS7BLVEVGGDVQLDSVR19/12.350.65Galectin-7/p53-induced gene 1 protein20382700
HBBFFESFGDLSTPDAVMGNPK39/672.720.74Hemoglobin subunit beta20097481
MDH2VDFPQDQLTALTGR4/72.490.93Malate dehydrogenase 219485423
MYH9LQQELDDLLVDLDHQR9/152.540.56Myosin, heavy polypeptide 9, non-muscle, isoform CRA_a18796164

Ovarian cancerPSMA2YNEDLELEDAIHTAILTLK3/54.480.84Proteasome subunit alpha type-214960231

Lung cancerAKR1A1DPDEPVLLEEPVVLALAEK3/53.160.63Aldo-keto reductase family 117114299

Chromophobe renal cell carcinomasATP5HNLIPFDQMTIEDLNEAFPETK3/52.480.95ATP synthase subunit d, mitochondrial20440404

LeukemiaIGKCVDNALQSGNSQESVTEQDSK3/63.950.92Ig kappa chain C region12357370

RPS7TLTAVHDAILEDLVFPSEIVGK5/33.920.9240S ribosomal protein S7

a the number of spectral sets in HCC samples

b the number of spectral set in normal samples.

c the PubMed index for MEDLINE

Table 4 shows lists of DEPs in HCC sample, normal sample, and both samples. In HCC sample and normal sample, 57 and 115 reference spectra were identified by SEQUEST. Among these spectra, 29 and 59 peptides were known biomarkers for the human liver cancer. In both sample, we performed a beta-binomial test for finding out DEPs. The result shows that only 84 out of 1,571 reference spectra indicate heterogeneity of spectral counts between HCC and normal tissue samples. Among these 84 reference spectra, only 22 were identified by SEQUEST.

Lists of differentially expressed peptides in HCC and normal sample. a the number of spectral sets in HCC samples b the number of spectral set in normal samples. c the PubMed index for MEDLINE Table 4 shows lists of DEPs in HCC sample, normal sample, and both samples. In HCC sample and normal sample, 57 and 115 reference spectra were identified by SEQUEST. Among these spectra, 29 and 59 peptides were known biomarkers for the human liver cancer. In both sample, we performed a beta-binomial test for finding out DEPs. The result shows that only 84 out of 1,571 reference spectra indicate heterogeneity of spectral counts between HCC and normal tissue samples. Among these 84 reference spectra, only 22 were identified by SEQUEST. To find the potential biomarkers in each sample, we searched the reference spectra of clusters using SEQUEST. Consequently, we could find 50 and 95 peptides as the candidate biomarkers from HCC sample and normal sample, respectively, as shown Table 4. Among them, 24 peptides in HCC sample and 56 peptides in normal samples are known biomarkers for the human liver cancer. Also, 22 reference spectra among 84 DEPs were identified by SEQUEST. Among them, 13 peptides are known markers for the human liver cancer, too. As shown in Table 4, carbamoyl-phosphate synthetase 1 (CPS1) are annotated by various sequences such as "MEYDGILIAGGPGNPALAEPLIQNVR" "SIFSAVLDELK", "TAVDSGIPLLTNFQVTK" and "GLNSESMTEETLK". These sequences are underexpressed in the HCC sample. Kinoshita et al. [23] performed differential gene display analysis (DGDA) to compare the intensities of polymerase chain reaction (PCR) products and evaluated the degrees of mRNA expression in HCC tissue samples and noncancerous hepatitis tissues. Subsequently, they confirmed that CPS1 is underexpressed. Specifically, CPS1 synthesizes carbamyl phosphate from bicarbonate, adenosine triphosphate (ATP) and ammmonia. A genetic mutation of CPS1 was identified as the source of hyperammonemia. In HCC tissue samples, underexpression of the CPS1 gene had been reported in rats, but the scientists' study was the first to result in such a finding for humans [23]. Heterogeneous nuclear ribonucleoprotein C (HNRNPC) annotated as "MIAGQVLDINLAAEPK" and actin, cytoplasmic 1 (ACTB) annotated as "DLYANTVLSGGTTMYPGIADR" were found to be over-expressed in the HCC sample [24,25]. On the contrary, glutathione S-transferase (GSTA1) annotated as "NDGYLMFQQVPMVEIDGMK" has been down-regulated in the human HCC sample [26]. Moreover, fatty acid-binding protein (FABP1) annotated as "SVTELNGDIITNTMTLGDIVFK", and Isoform 1 of Liver carboxylesterase 1 (CES1) annotated as "EGYLQIGANTQAAQK" are all characteristic of the HCC sample [27,28]. As shown in Table 4 many peptides are also known to be associated with cancer. Specifically, EMILIN-1 (EMILIN1), elongation factor 1-delta (EEF1D), galectin-7/p53-induced gene 1 protein (LGALS7), hemoglobin subunit beta (HBB) and malate dehydrogenase 2 (MDH 2) are differentially expressed in breast cancer cells [29-31]. Consequently, the LGALS7 gene is known to be related to over-expression when compared with control cells. Likewise, our result was also over-expressed. Table 4 provides a list of different types cancers associated with specific genes [28-34]. Figure 4 shows a scatter plot of the spectral counts of normal and HCC samples. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the symbol "▲" indicates DEPs identified with the use of SEQUEST, whereas the symbol "●" indicates unidentified DEPs. However, 62 DEPs were not identified by SEQUEST despite their significant differences by the beta-binomial test.
Figure 4

Scatter plot of spectral counts between normal and HCC samples. This figure plots the number of spectra in clustered sets in HCC and normal sample, respectively. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the grey triangle indicates DEPs identified with the use of SEQUEST, whereas the black circle indicates unidentified DEPs.

Scatter plot of spectral counts between normal and HCC samples. This figure plots the number of spectra in clustered sets in HCC and normal sample, respectively. The × axis and y axis represent the number of expressed spectra in each HCC and normal sample. Specifically, the grey triangle indicates DEPs identified with the use of SEQUEST, whereas the black circle indicates unidentified DEPs. We believe there were several reasons why 62 DEPs were not identified by SEQUEST. First, "one-size-fits-all" search parameter values of SEQUEST would not have been chosen appropriately for this protein target. Second, these unidentified DEPs may have other post-translational modification, sequence variation (e.g., alternative splicing) or insufficient peptide ions information. We re-run SEQUEST with many different parameter options for allowing phosphorylation modification and two missed cleavages, and for using other sequence databases (NCBI nr and EST human). However, even with these parameter options, SEQUEST did not identify the remaining 62 DEPs. Next, we tried to identify 62 reference spectra using other searching engines such as MASCOT and SpectraST. MASCOT identified 2 DEPs, Alcohol dehydrogenase 1A (ADH1A) and Isoform 2 of Myosin-9(MYH9) but SpectraST did not identify any DEPs. The remaining 60 DEPs could not be identified by these search engines. In order to identify these DEPs, further experiments may be needed. For example, additional MS/MS experiments such as MRM (Multiple Reaction Monitoring) or SRM (Selective Reaction Monitoring) can be carried out within the range of the corresponding retention times for all the unidentified spectra in order to collect more detailed peptide information.

Conclusions

In this paper, we proposed a novel method to estimate peptide's abundance by counting MS/MS spectra clustered through the direct comparison of all experimentally observed spectra. For a given pair of spectra, our method can be used to answer the question of whether they are from the same peptide without computationally searching them from a theoretical library of protein spectra. Examining all possible pair-wise comparisons, our method results into a set of spectra for the same peptide and enables us to estimate the amount of peptides found in biological samples of interest by counting the spectra clusters. Since our proposed method compares all possible pairs of experimental spectra, it can discover even modified and unknown peptides, which may not be searchable from a theoretical spectral library. For practical MS/MS experimental data, a large proportion of spectra are often misidentified or completely lost during a computational database search. On the other hand, Q-FISH can identify these spectra without any loss of information. As demonstrated in our practical examples, the majority of DEPs derived by Q-FISH were found to be highly related with various cancers, which were not discovered by other methods. We thus believe our Q-FISH algorithm will be highly useful in the identification of novel peptides [19]. Also, Q-FISH has the potential to find applications in many other practical proteomic studies. For example, it can be used to discover unknown biomarkers or drug targets through the comparison of proteins with statistically significant difference and by quantifying sets of identical peptides in multiple samples. Unknown spectral clusters can often come from non-peptide contaminants as revealed by a recent publication [35]. Q-FISH can evaluate the significance of such unknown clusters, some of which can be novel biomarkers, requiring further experimental confirmation by de novo sequencing, unrestricted sequence database search (using e.g. InsPect [36]) or spectral library search (using e.g. pMatch [37]).

Methods

Sample Preparation, Nano-LC-ESI-MS/MS

Tissue samples such as hepatocellular carcinoma (HCC) tumour tissue and adjacent healthy liver tissue were collected under the guidelines of the Institutional Review Board (IRB) established at Yonsei Medical Center (Seoul, Korea). All tissues were prepared and subsequently, in-solution tryptic digestion was performed as previously described [20]. Nano-LC-MS/MS analysis was performed on an Agilent Nano HPLC 1100 system using an linear trap quadruple (LTQ) mass spectrometer (Thermo Electron, San Jose, US). LC-MS/MS was performed as previously described [38]. The peptide fractionation was performed by means of cationic exchange chromatography (SCX) at a flow rate of 0.5 mL/min where absorbance of the column effluent was maintained stable at 280 nm for 40 min. Fractions were automatically transferred every 0.5 min into a 96-microplate. Nano-LC MS/MS experiments were carried out three times on two different samples (human liver cancer and normal tissues) and 44,318 MS/MS spectra were generated. These tandem mass spectrometry data were first analyzed by means of the database search software SEQUEST (Bioworks 3.2, ThermoFinnigan, San Jose, US). The sequence database downloaded from European Bioinformatics Institute (EBI) was the International Protein Index (IPI) human version 3.61. The next step was to combine the protein sequence database with its reverse sequences. The maximum number of missed cleavage sites was set to 1, and only tryptic cleavage after arginine and lysine was allowed. The mass tolerance of the precursor peptide ion was set to 3.0 Da, while the fragment ion tolerance was set to 0.5 Da. These tolerance values were chosen to minimize FDR when XCorr > 1.5 [39]. Modification at cysteine with carboxyamidomethylation and methionine with oxidation were allowed [40]. All peptides assigned to reverse sequence were removed before proceeding to peptide identification to inhibit false-positive identifications. We chose XCorr as 1.44(+1), 1.97(+2) and 3.13(+3) which yielded FDR close to 0.05, respectively, and the value of DeltaCn is equal to a great than 0.1. These score criteria were considered to ensure high confidence in the results of protein identification [41]. The spectra derived by mass spectrometry were also analyzed by means of the spectral library search software SpectraST, which was initially developed by the Institute for Systems Biology (ISB) and National Institute of Standards and Technology (NIST). SpectraST is integrated with the Trans-Proteomic Pipeline (TPP) software suite, which provides the supporting functionalities necessary in a full proteomics data analysis pipeline. Then, the SpectraST program was validated in the NIST Human IT Library with the SpectraST's scores > 0.9 [18,38,42]. The precursor tolerance was set to 1.5 Da/z (Thomson).

Q-FISH algorithm for direct comparison of experimental spectra

We assumed that MS/MS spectra from the same peptide would present similar patterns. Under this assumption, the proposed Q-FISH algorithm can be applied to find DEPs both in normal and disease samples. As shown in Figure 1, to evaluate the similarities between two spectra, we use a correlation coefficient of the moving window averages. The analytical process is summarized as follows:

1. Scale Standardization

Perform scale standardization by dividing the intensity values by its maximum value.

2. Moving average

Compute the moving window average over the spectra using a window of fixed size.

3. Correlation index for moving average-based peak patterns

Calculate a summary statistic based on the correlation coefficient of the moving averages between two spectra.

4. Spectral count-based quantification using two-stage clustering

Cluster duplicated peptides with similar peak patterns and retention time using a two-stage clustering method.

5. Identification of differentially expressed peptides

Employ the beta-binomial test to identify DEPs among the experimental groups.

Similarity measure between pairs of MS/MS spectra

Scale standardization

Because the intensities of the spectra obtained may be different for various physical and chemical reasons such as inconsistencies in the total ion currents, we cannot use the raw data for the intensity of m/z peaks. In light of this, we used a scale-standardization method, which involves the division of the m/z peak values for all ions by their maximum value. Let x[i] be the intensity of the im/z peak. Then, the scale standardized intensity, y[i], is defined by

Moving window average

To reduce the background noise of the peak intensities, the moving window average (MWA) is used. The most simple moving average is the unweighted (or uniformly weighted) average of n data points within a given window, and the weighted moving average (WMWA) is the average calculated using multiplying weight factors to give different weight to each data point. Among the various options for the weights of WMWA, we selected the "Gaussian" kernel, which uses the probability density function (pdf) of the standard Gaussian distribution with mean 0 and variance 1 as a weight function. For a given spectrum, the MWA is calculated by averaging the peak intensities within the sliding window sequentially for all m/z peaks. In other words, the MWA is not a single value, but a set of averages. The next step is to calculate correlation between the MWAs of two spectra and determine whether there are identical spectra from the same peptide. We assume that there are N moving windows of fixed size K along the entire m/z range. Subsequently, the WMWA for the imoving window (i = 1, 2,..., N) is defined by where y[i + j] is the jscale standardized intensity in the imoving window and ware the weights. For a uniform kernel w1/K or the Gaussian kernel, w= Φ(z) represents the pdf of the standard Gaussian distribution, where zrepresents the value of y[i+j] standardized by using mean and variance of m/z's in the iwindow. Total number of windows, N can be determined by the fixed window size K along with the entire m/z range (200-2000 Da). In order to determine the optimal window size, we randomly selected some pairs of spectra from the same and different peptides using target-decoy sequence database. We implemented receiver operating characteristic (ROC) analysis to determine the window size. Based on ROC analysis, we chose a window size, K = 30 (3.0Da) and accordingly N = 19,771 (20-2000 Da at interval of 0.1 Da). However, the areas under the curve (AUC) did not differ much and were less sensitive to the window size.

Correlation index for moving average-based peak patterns

For peptides p and q, the correlation coefficient is computed as follows: where and are the means of moving window averages for peptide p and q. The closer the correlation coefficient is to 1, the stronger is the correlation between spectra from the same peptides.

Quantification by counting spectra in clustered spectra set from a homogenous peptide

Two-stage cluster analysis is used to cluster peptide sets consisting of spectra with similar patterns. As previously assumed, if the spectra have approximately the same shape, then the spectra would have come from the same peptide. Namely, each cluster can be expected to be composed of the spectra obtained from a homogenous peptide. Two-stage clustering analysis employs two similarity measures to cluster peptides: the first is the difference between precursor ions and the second is the correlation coefficient between two MWAs. It is theoretically predicted that MS/MS spectra obtained from the same peptide have similar precursor ions. First, clusters can be defined in terms of pair-wise differences between the precursor ions. For any two pair of precursor ions in the same cluster, their difference is smaller than the threshold value. In our analysis, we set ± 1 Da as a threshold value. The next step is to perform a hierarchical clustering analysis for each of the clusters defined. Specifically, we employ "single linkage," also known as the nearest neighbour technique. Here, the correlation coefficient of MWAs is used as a similarity measure. Because this two-stage clustering analysis yields clustered spectra sets consisting of MS/MS spectra from the same peptide, the amount of peptides can be quantified by counting the spectra included in each clustered set. Lastly, representative spectra called "reference spectra" can be defined based on the basic patterns of precursor ions as the average spectra for a given spectral set.

Validation of the clustering results using retention times

It is well known that the same peptides tend to elute continuously within a limited liquid chromatography (LC) interval. Thus, the clustering results can be validated using the retention time (RT) information. In order to validate the clustering results, we propose a new measure to estimate the clustering error rate using the spectral RT information. Note that the Q-FISH results provide the list of clusters. If a cluster contains only peptides from the same spectra, the RTs of peptides would have similar values. If a cluster contains peptides from the different spectra, the RTs would have different values. As a measure of similarity, we consider the measures representing the variability of RTs from the same cluster such as coefficient of variation (CV) and standard deviation (SD) of RTs. Since the RT varies much across of spectra, CV would be a better measure than SD. Using CV, we propose a new measure called the false clustering rate (FCR) which is similar in spirit to that of the false discovery rate (FDR). It measures the rate how often a cluster is composed of spectra from the different peptides. We provide a threshold value of CV, Δ, to determine whether a cluster is well clustered or not. That is, if the value of CV of a given cluster is smaller than Δ, then we call it is a good cluster. For the given value of Δ, FCR can be computed. The detailed procedure of computing FCR is given as follows: 1) Calculate the coefficient of variation (CV) of spectral RT in the same clusters from the Q-FISH results. 2) Permute the spectra while maintaining the number of spectra in each cluster fixed. 3) Calculate CVfor each permuted cluster for the pth permuted sample. 4) Compute FCR as follows: where P is the number of permutations, Δ the threshold value, and C the total number of clusters. For our HCC data, we computed FCR for various values of Δ, as summarized in the Table 5. From our analysis, we chose the value of Δ as 4.4 which yielded FCR close to 0.05.
Table 5

Validation for clustering result using the false clustering rate (FCR)

FCR using RT informationFCR for the cut-off value
ΔFCRρFCR

10.02880.01.0000
20.03070.10.9486
30.03800.20.8060
40.04670.30.6525
4.40.05000.40.4515
50.05530.50.3178
60.06390.60.0251
70.07190.70.0034
80.08060.80.0008
90.08950.90.0003
100.09811.00.0000

In order to validate the clustering results, we propose a new measure to estimate the clustering error rate using the spectral retention time (RT) information. We computed the false clustering rate (FCR) for various values of threshold Δ, as summarized. We also calculated FCR to determine the cut-off value of correlation coefficient for spectral clustering. We computed FCR for the various values of the given ρ, as summarized. We chose ρ = 0.6 which yielded FCR close to 0.05.

Validation for clustering result using the false clustering rate (FCR) In order to validate the clustering results, we propose a new measure to estimate the clustering error rate using the spectral retention time (RT) information. We computed the false clustering rate (FCR) for various values of threshold Δ, as summarized. We also calculated FCR to determine the cut-off value of correlation coefficient for spectral clustering. We computed FCR for the various values of the given ρ, as summarized. We chose ρ = 0.6 which yielded FCR close to 0.05. We also calculated FCR to determine the cut-off value of correlation coefficient, ρ for spectral clustering. For the given threshold value of ρ, FCR can be computed in the similar manner as Δ. We computed FCR for the various values of the given ρ, as summarized in the Table 5. We chose ρ = 0.6 which yielded FCR close to 0.05.

Differentially expressed peptides (DEPs)

To estimate the peptide's abundance found in different samples such as control and disease tissue samples, a spectral counting method like Q-FISH can be employed. Pham et al. [21] proposed the use of the beta-binomial distribution to test the significance of DEPs in spectral counts in label-free mass spectrometry-based proteomics. Their results revealed that the beta-binomial test can be applied to experiments with one or more replicates, as well as for the comparison of multiple conditions. We applied the beta-binomial model to test the abundance of DEPs in the clustered spectral set through three replicated MS/MS experiments. Let x denote the number of spectral counts in the clustered spectral set and n, the total number of spectral counts of all spectral in each sample. Then, assume that x is distributed with the true proportion π, 0 ≤ π ≤ 1, Differently, π is approximated as a random variable based on the beta distribution with real parameters α > 0 and β > 0. Subsequently, the marginal distribution of x is the beta-binomial distribution [21], where B(·,·) is the beta function. The following parameterization is used where h is the inverse of the link function (logit or complementary log-log), X a design matrix, b a vector of fixed effects, η = Xb the linear predictor, and Φ the overdispersion parameter. Based on this parameterization, the marginal mean and variance are: It should be noted that parameters b and ϕ are estimated by maximizing the log-likelihood of the marginal model. Given the estimated coefficients, the testing hypothesis is rephrased as to whether the b coefficient is 0 [43]. We also used Benjamini and Hochberg's method to correct for multiple comparisons in multiple testing for DEPs [44].

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

SML and MSK performed the statistical analysis and drafted the manuscript. HJL, YKP, and HT carried out mass spectrometry experiments. JKL and TP conceived of the study, and participated in coordination. All authors write, read and approved the final manuscript.

Additional file 1

Lists for identified peptides reported in the literature. In order to compare the performance of Q-FISH with the spectral counting method by SEQUEST, we used the human liver data and validated the results through literature search. For the human liver data, Q-FISH provided 1571 differentially expressed clusters for HCC sample and 1556 for normal sample, among which 57 and 99 clusters were identified by SEQUEST in HCC and normal samples, respectively. On the other hand, SEQUEST provided 93 and 145 peptides for HCC and normal tissue samples, respectively. Click here for file
  41 in total

1.  Informatics platform for global proteomic profiling and biomarker discovery using liquid chromatography-tandem mass spectrometry.

Authors:  Dragan Radulovic; Salomeh Jelveh; Soyoung Ryu; T Guy Hamilton; Eric Foss; Yongyi Mao; Andrew Emili
Journal:  Mol Cell Proteomics       Date:  2004-07-21       Impact factor: 5.911

2.  Signal maps for mass spectrometry-based comparative proteomics.

Authors:  Amol Prakash; Parag Mallick; Jeffrey Whiteaker; Heidi Zhang; Amanda Paulovich; Mark Flory; Hookeun Lee; Ruedi Aebersold; Benno Schwikowski
Journal:  Mol Cell Proteomics       Date:  2005-11-03       Impact factor: 5.911

3.  Development and validation of a spectral library searching method for peptide identification from MS/MS.

Authors:  Henry Lam; Eric W Deutsch; James S Eddes; Jimmy K Eng; Nichole King; Stephen E Stein; Ruedi Aebersold
Journal:  Proteomics       Date:  2007-03       Impact factor: 3.984

4.  On the beta-binomial model for analysis of spectral count data in label-free tandem mass spectrometry-based proteomics.

Authors:  Thang V Pham; Sander R Piersma; Marc Warmoes; Connie R Jimenez
Journal:  Bioinformatics       Date:  2009-12-09       Impact factor: 6.937

5.  Overexpression of galectin-7, a myoepithelial cell marker, enhances spontaneous metastasis of breast cancer cells.

Authors:  Mélanie Demers; April A N Rose; Andrée-Anne Grosset; Katherine Biron-Pain; Louis Gaboury; Peter M Siegel; Yves St-Pierre
Journal:  Am J Pathol       Date:  2010-04-09       Impact factor: 4.307

Review 6.  A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics.

Authors:  Alexey I Nesvizhskii
Journal:  J Proteomics       Date:  2010-09-08       Impact factor: 4.044

7.  Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate.

Authors:  Ding Ye; Yan Fu; Rui-Xiang Sun; Hai-Peng Wang; Zuo-Fei Yuan; Hao Chi; Si-Min He
Journal:  Bioinformatics       Date:  2010-06-15       Impact factor: 6.937

8.  Underexpression of mRNA in human hepatocellular carcinoma focusing on eight loci.

Authors:  Moritoshi Kinoshita; Masahiko Miyata
Journal:  Hepatology       Date:  2002-08       Impact factor: 17.425

9.  Overexpression and elevated serum levels of phosphoglycerate kinase 1 in pancreatic ductal adenocarcinoma.

Authors:  Tsann-Long Hwang; Ying Liang; Kuan-Yi Chien; Jau-Song Yu
Journal:  Proteomics       Date:  2006-04       Impact factor: 3.984

10.  Identification of differentially expressed proteins in triple-negative breast carcinomas using DIGE and mass spectrometry.

Authors:  Daniela M Schulz; Claudia Böllner; Gerry Thomas; Mike Atkinson; Irene Esposito; Heinz Höfler; Michaela Aubele
Journal:  J Proteome Res       Date:  2009-07       Impact factor: 4.466

View more
  4 in total

1.  Quantitative Metaproteomics and Activity-Based Probe Enrichment Reveals Significant Alterations in Protein Expression from a Mouse Model of Inflammatory Bowel Disease.

Authors:  Michael D Mayers; Clara Moon; Gregory S Stupp; Andrew I Su; Dennis W Wolan
Journal:  J Proteome Res       Date:  2017-01-23       Impact factor: 4.466

2.  Proteomic differences between hepatocellular carcinoma and nontumorous liver tissue investigated by a combined gel-based and label-free quantitative proteomics study.

Authors:  Dominik A Megger; Thilo Bracht; Michael Kohl; Maike Ahrens; Wael Naboulsi; Frank Weber; Andreas-Claudius Hoffmann; Christian Stephan; Katja Kuhlmann; Martin Eisenacher; Jörg F Schlaak; Hideo A Baba; Helmut E Meyer; Barbara Sitek
Journal:  Mol Cell Proteomics       Date:  2013-03-05       Impact factor: 5.911

3.  Quantitative proteomics by SWATH-MS reveals sophisticated metabolic reprogramming in hepatocellular carcinoma tissues.

Authors:  Yanyan Gao; Xinzheng Wang; Zhihong Sang; Zongcheng Li; Feng Liu; Jie Mao; Dan Yan; Yongqiang Zhao; Hongli Wang; Ping Li; Xiaomin Ying; Xuemin Zhang; Kun He; Hongxia Wang
Journal:  Sci Rep       Date:  2017-04-05       Impact factor: 4.379

Review 4.  Review of Liquid Chromatography-Mass Spectrometry-Based Proteomic Analyses of Body Fluids to Diagnose Infectious Diseases.

Authors:  Hayoung Lee; Seung Il Kim
Journal:  Int J Mol Sci       Date:  2022-02-16       Impact factor: 5.923

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.