Literature DB >> 23164367

Estimating relative abundances of proteins from shotgun proteomics data.

Sean McIlwain1, Michael Mathews, Michael S Bereman, Edwin W Rubel, Michael J MacCoss, William Stafford Noble.   

Abstract

BACKGROUND: Spectral counting methods provide an easy means of identifying proteins with differing abundances between complex mixtures using shotgun proteomics data. The crux spectral-counts command, implemented as part of the Crux software toolkit, implements four previously reported spectral counting methods, the spectral index (SI(N)), the exponentially modified protein abundance index (emPAI), the normalized spectral abundance factor (NSAF), and the distributed normalized spectral abundance factor (dNSAF).
RESULTS: We compared the reproducibility and the linearity relative to each protein's abundance of the four spectral counting metrics. Our analysis suggests that NSAF yields the most reproducible counts across technical and biological replicates, and both SI(N) and NSAF achieve the best linearity.
CONCLUSIONS: With the crux spectral-counts command, Crux provides open-source modular methods to analyze mass spectrometry data for identifying and now quantifying peptides and proteins. The C++ source code, compiled binaries, spectra and sequence databases are available at http://noble.gs.washington.edu/proj/crux-spectral-counts.

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 23164367      PMCID: PMC3599300          DOI: 10.1186/1471-2105-13-308

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Existing methods for differential proteomics (reviewed by [1]) fall into two categories: spectral counting methods that rely on counting the number of spectra that map to a given protein across multiple experiments, and peptide chromatographic peak intensity methods that use the area under the peptide precursor ion peak as a measure of peptide abundance. In principle, methods based on mass spectrometry peak areas are potentially much more accurate, but these methods require highly reproducible liquid chromatography as well as accurate methods for chromatographic alignment and identification of peaks within the profile spectra. In contrast, spectral counting methods are straightforward to employ and have been shown to correctly detect known differences between samples [2], which contributes to their wide use. The command line tool crux spectral-counts implements four popular spectral counting methods: the spectral index (SI) [3], the exponentially modified protein abundance index (emPAI) [4], the normalized spectral abundance factor (NSAF) [5], and the distributed normalized spectral abundance factor (dNSAF) [6]. The crux spectral-counts command is integrated within the Crux software toolkit, which provides actively maintained open-source methods to identify and now quantify peptides and proteins from shotgun mass spectrometry datasets. Crux supports a variety of input spectra formats, and the tools can easily be incorporated into proteomic analysis pipelines, such as the Trans-Proteomic Pipeline (TPP) [7]. Finally, the modular design of Crux allows improvements to one part of the toolkit to be propagated through downstream analyses. Currently, several software packages offer spectral counting protein quantification methods [8]. ProteoIQ ( http://www.bioinquire.com) and Scaffold [9] are commercial software products that post-process results from a variety of database search programs. Freely available tools such as APEX [10], emPAI calc [11], and PepC [12] each offer a single spectral counting method. Table 1 compares the features of six software spectral counting tools. Crux offers more spectral counting methods than other tools and is the only method to provide peptide-level in addition to protein-level counts.
Table 1

Spectral counting software

 CruxAPEXemPAI CalcPepCProteoQScaffold
Metrics
 
Provided
 
SIN
X
 
 
 
 
 
emPAI
X
 
X
 
 
 
NSAF
X
 
 
 
 
 
dNSAF
X
 
 
 
 
 
Raw
X
 
 
 
X
X
Other
 
X
 
X
X
 
Other
 
Features
 
ParsimonyAnalysis
X
 
 
 
X
X
Peptide-LevelCounting
X
 
 
 
 
 
Free
X
X
X
X
 
 
Opensource
X
X
 
X
 
 
Web Interface
 
 
X
X
 
 
Graphical userinterface
 
X
 
 
X
X
ScriptableXX X  

This table summarizes the features of various spectral counting software methods.

Spectral counting software This table summarizes the features of various spectral counting software methods. Using crux spectral-counts, we compared and contrasted the reproducibility and linearity of the four spectral counting methods. Our experiments suggest that the NSAF metric provides the most reproducible protein quantification. In contrast, our linearity experiments show that SI and NSAF provide the best performance, with dNSAF providing intermediate performance and emPAI yielding the worst linearity. The contributions of this paper are thus two-fold: we describe a performance comparison of the reproducibility and linearity of the SI, emPAI, NSAF, and dNSAF protein quantification methods, and we provide to the proteomics community a flexible, open source spectral counting software tool.

Implementation

Software

The crux spectral-counts command is implemented as part of the Crux proteomics software toolkit [13]. The toolkit is implemented in C++ as a single binary that supports commands for database searching and a variety of downstream analyses [14-18]. The crux spectral-counts command takes as input a protein database in FASTA format and a collection of peptide-spectrum matches (PSMs) produced by a database search procedure. The PSMs may be in Crux’s tab-delimited text format, PeptideProphet’s PepXML or mzIdentML [19]. To compute the SIscore, a set of spectra must also be provided as input in MS2, mzXML, or mgf format. By default, crux spectral-counts will select the PSMs in the input by a user modifiable threshold of q-value ≤ 0.01. For each protein with at least one spectral count, the program then computes the NSAF, dNSAF, emPAI, or the SI score. The NSAF metric is defined as where N is the protein index, s is the number of spectra matched to protein N, Lis the length of protein N, and n is the total number of proteins in the input database. The dNSAF metric is given by where is the spectral count for the peptides uniquely mapping to protein N, is the spectral count of degenerate peptide j (out of the protein’s k degenerate peptides) mapped to protein N, and d is the distribution factor of peptide shared counts, defined by the equation The metric emPAI is defined as where is the number of unique peptides observable for protein N and is the number of unique peptides observed for protein N. Finally, the SI score is calculated using where p is number of unique peptides in protein N, s is the number of spectra assigned to peptide j, and iis the total fragment ion intensity of spectrum k. Analogous scores can also be computed for each peptide, rather than for each protein. A detailed description of the peptide-level scoring metrics is available in the on-line documentation. As output, crux spectral-counts produces a tab-delimited file listing proteins and their corresponding counts, in reverse sorted order. The crux spectral-counts command also computes a parsimonious set of proteins, using the greedy set cover approach used by IDPicker [20]. Users thus have the option of considering spectral counts only for proteins within the parsimonious set.

Data Collection

For the reproducibility experiments, proteins were extracted from the cochlear nucleus of the developing mouse brain at postnatal day 7 and postnatal day 21. Two biological replicates were generated for each age by dissecting out the cochlear nuclei from two separate mice at each age. One of the 21-day mice was used to generate two samples, thereby providing a technical replicate in addition to a biological replicate. The samples prepared from the chicken brain were derived from nucleus laminaris, an auditory region in the brain stem. Samples were taken from the dorsal (D) and ventral (V) regions of this area. For each region, two biological replicates were generated, and one of those replicates was also subjected to technical replication. Each sample was digested with trypsin and subjected to liquid chromatography followed by tandem mass spectrometry. For the linearity experiments, we used eight samples that represent a dilution curve of 48 known proteins synthesized by Sigma (UPS1, http://www.sigmaaldrich.com). These data sets are mixtures (Std1–Std8) of the C. elegans lysate at equal concentrations and the 48 proteins, diluted by a two-fold in each successive standard. Std 8 has the lowest concentration of the known proteins (6 fmol) and Std 1 has the highest concentration (870 fmol). All three data sets are publicly available at http://noble.gs.washington.edu/proj/crux-spectral-counts.

Data analysis

The fragmentation spectra from the experiments were searched against their respective mouse, chicken, or the C. elegans+UPS1 protein database using crux sequest-search followed by crux q-ranker, with the default parameters. crux spectral-counts was applied to the peptide-spectrum matches (PSMs) that received q-values ≤ 0.01. The resulting data sets for the mouse and chicken replicates are summarized in Additional file 1: Table S1, and the UPS1 dilution curve data sets are summarized in Additional file 1: Table S2.

Results

Testing reproducibility between replicates

To investigate the reproducibility of the four spectral count methods, we analyzed mass spectrometry data from technical and biological replicates from chicken and mouse samples. We then produced a scatter plot for each pair of biological or technical replicates and computed the corresponding Spearman correlation. For these comparisons, proteins identified in only one of the two datasets were ignored. Figure 1 shows sixteen such plots, corresponding to one biological and one technical replicate for chicken and mouse, respectively. The complete collection of 76 plots is provided as Additional file 1: Figures S1–S2. From these analyses, as summarized in Table 2, we draw two primary conclusions. First, the spectral counts are generally reproducible: the mean correlation value across all 76 pairs is 0.867, and the minimum correlation is 0.719. Second, reassuringly, the technical replicates produce higher correlations than the biological replicates: the mean correlation among 24 pairs of technical replicates is 0.885, whereas the corresponding value for the 52 pairs of biological replicates is 0.859 (two-tailed Wilcoxon rank-sum test p-value=0.026).
Figure 1

Reproducibility of spectral counts across biological and technical replicate experiments. Each plot compares either the SI, emPAI, NSAF or dNSAF measure for proteins that were reproducibly identified across two replicate experiments. For visualization purposes, the counts are plotted on a logarithmic scale. The number in the lower right corner of each panel is the corresponding Spearman correlation and the number in the upper left is the number of datapoints compared.

Table 2

Spectral-counting reproducibility performance on mouse and chicken replicates

MetricTechnicalBiologicalAll Replicates
SIN
0.885
0.848
0.859
emPAI
0.870
0.858
0.862
NSAF
0.899
0.876
0.884
dNSAF
0.886
0.852
0.863
All Metrics0.8850.8590.867

This table summarizes the average correlation of the spectral-counting metrics across the technical and biological replicates.

Reproducibility of spectral counts across biological and technical replicate experiments. Each plot compares either the SI, emPAI, NSAF or dNSAF measure for proteins that were reproducibly identified across two replicate experiments. For visualization purposes, the counts are plotted on a logarithmic scale. The number in the lower right corner of each panel is the corresponding Spearman correlation and the number in the upper left is the number of datapoints compared. Spectral-counting reproducibility performance on mouse and chicken replicates This table summarizes the average correlation of the spectral-counting metrics across the technical and biological replicates. To test whether the observed differences in correlations among the four metrics are significant, we applied a Wilcoxon signed-rank test to paired sets of correlations. With four metrics, there are six possible paired comparisons. Figure 2 shows the results of this analysis, where one metric attaining a significant increase (using a Bonferroni p-value of 0.05/6=0.008333) over another is indicated by a directed edge. From this graph we conclude that, for the biological and technical replicates, NSAF yields significantly more reproducible quantification values than SI, dNSAF and emPAI. Our reproducibility results agree with Colaert et al., who claim that NSAF is more reproducible than SIand emPAI [21]. However, in contrast to our results, Griffen et al. report better reproducibility across replicates for SIcompared to NSAF [3].
Figure 2

Comparison of spectral counts across replicates. This graph summarizes the statistical analysis of the reproducibility measurements. An edge leading out from node A to node B indicates a statistically significant improvement in reproducibility for method A relative to method B.

Comparison of spectral counts across replicates. This graph summarizes the statistical analysis of the reproducibility measurements. An edge leading out from node A to node B indicates a statistically significant improvement in reproducibility for method A relative to method B.

Testing linear response for protein abundance across samples

To determine the linear response of each of the spectral count metrics, we analyzed mass spectra from a dataset of samples that form a dilution curve of forty-eight proteins with known amounts spiked into a C. elegans lysate. We performed linear regression between each protein spectral count and the associated amounts across the dilution curve samples. For this analysis, we only included proteins that obtain a positive spectral count in three or more of the data sets, which results in a comparison of forty-two proteins across the four metrics. We then carried out a Wilcoxon signed rank test analysis separately on the average correlation, R2, and the mean percent error (MPE). The results of these tests (Figure 3) are fairly consistent with one another: NSAF significantly outperforms dNSAF, and dNSAF and SIsignificantly outperform emPAI.
Figure 3

Comparison of spectral counts across UPS1 dilution curve. This graph summarizes the statistical analysis of the linearity measurements. Two types of analysis were performed, using the linear regression correlation, R2 and mean percent error (MPE) for the C. elegans + UPS1 dilution curve dataset. An edge leading out from node A to node B indicates a statistically significant improvement in linearity for method A relative to method B.

Comparison of spectral counts across UPS1 dilution curve. This graph summarizes the statistical analysis of the linearity measurements. Two types of analysis were performed, using the linear regression correlation, R2 and mean percent error (MPE) for the C. elegans + UPS1 dilution curve dataset. An edge leading out from node A to node B indicates a statistically significant improvement in linearity for method A relative to method B. Colaert et al. (2011) claim that SIis more accurate than both NSAF and emPAI [21], but we find evidence only to support the former claim, even though our experiments employ a wider dynamic range of protein abundance (6.7–20 fmol versus 6–870 fmol) and more data sets (two versus eight). Based on our experiments, we conclude that NSAF or SI are the methods of choice for ensuring an accurate linear response between a protein’s change in abundance across different samples. It is worth noting that Griffin et al. (2010) observe a good linear fit between SIand protein quantification. However, their evaluation methodology fits a single line to all of the SI values from many proteins, whereas we have fit a separate line for each protein. This difference reflects our belief that spectral counting methods are most useful as measures of the relative abundance of a single protein between two experiments. We did not test the claim that SIprovides an accurate absolute protein abundance metric.

Discussion

Overall, our experiments suggest a relative ordering of spectral counting methods according to their reproducibility and the linearity of their response, but we can only speculate as to the reasons for the ranking that we observe. For example, we note that NSAF outperforms the emPAI metric in both of our experiments. The emPAI measure takes into account the least information—not only does it ignore fragment ion intensities, but emPAI also fails to account for the length of the protein. Apparently, this relatively simple approach is insufficient to accurately estimate protein abundance. The relative performance of NSAF and SI, on the other hand, is less clear: NSAF yields more reproducible results than SI but the two methods are statistically indistinguishable with respect to linearity. The main difference between SI and the other three metrics is that SIis the only metric that takes into account the intensities of the fragment ion peaks. In this sense, SI goes a bit beyond the strict definition of “spectral counting.” Our experiments do not support the claim that such intensity information is valuable for quantification. However, the conflicting results of our study and Collaert et al., on the one hand, versus Griffin et al. on the other hand, suggests perhaps that further comparison of these methods is warranted. An additional direction for future work involves quantifying the linearity and reproducibility of proteins in a segregated fashion according to protein abundance. For example, visual inspection of Figure 1 suggests that perhaps the SImeasure yields more reproducible counts for high abundance proteins, with a corresponding decrease in reproducibility as the abundance drops. Arguably, in many studies, such low abundance proteins are of the greatest interest; hence, it may be worthwhile to investigate in a systematic fashion the extent to which either the linearity or the reproducibility of a given spectral counting measure varies as a function of protein abundance.

Conclusions

Quantifying protein amounts in mass spectrometry by spectral counting is a simple and robust method for measuring the relative change of protein amounts across different samples; however, many different algorithms exist for assigning a score to each identified protein. Using crux spectral-counts, we compared and contrasted four spectral counting methods with respect to their reproducibility across replicates and their linear response relative to protein abundance. Crux provides a flexible, easy to use open source tool for performing protein quantification using spectral counting.

Availability and requirements

Project name: Crux tandem mass spectrometry analysis software Project home page:http://noble.gs.washington.edu/proj/crux Operating systems: Linux, MacOS, Windows + Cygwin Programming language: C++ Other requirements: Crux has no requirements to install the binary version under Linux or MacOS. On Windows, Crux requires Cygwin. To compile Crux requires a c++ compiler, cmake, and Subversion. License: Apache Any restrictions to use by non-academics: None

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

The chicken and mouse samples were provided by ER’s lab, and the LC-MS/MS data were collected by members of the MM lab. MB prepared and collected the UPS1 + C. elegans dilution sample datasets. MM wrote the initial code for crux spectral-counts and the initial draft of the manuscript. SM finished the coding of crux spectral-counts and the final draft with WSN. All authors revised and approved the final manuscript.

Additional file 1

Supplementary Information. Supplementary Tables 1 and 2 and Suplementary Figures 1 and 2 are provided as quantify-supplement.pdf. Click here for file
  20 in total

1.  Refinements to label free proteome quantitation: how to deal with peptides shared by multiple proteins.

Authors:  Ying Zhang; Zhihui Wen; Michael P Washburn; Laurence Florens
Journal:  Anal Chem       Date:  2010-03-15       Impact factor: 6.986

2.  Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein.

Authors:  Yasushi Ishihama; Yoshiya Oda; Tsuyoshi Tabata; Toshitaka Sato; Takeshi Nagasu; Juri Rappsilber; Matthias Mann
Journal:  Mol Cell Proteomics       Date:  2005-06-14       Impact factor: 5.911

3.  Proteomic parsimony through bipartite graph analysis improves accuracy and transparency.

Authors:  Bing Zhang; Matthew C Chambers; David L Tabb
Journal:  J Proteome Res       Date:  2007-08-04       Impact factor: 4.466

4.  emPAI Calc--for the estimation of protein abundance from large-scale identification data by liquid chromatography-tandem mass spectrometry.

Authors:  Kosaku Shinoda; Masaru Tomita; Yasushi Ishihama
Journal:  Bioinformatics       Date:  2009-12-22       Impact factor: 6.937

5.  Scaffold: a bioinformatic tool for validating MS/MS-based proteomic studies.

Authors:  Brian C Searle
Journal:  Proteomics       Date:  2010-03       Impact factor: 3.984

6.  Detecting cross-linked peptides by searching against a database of cross-linked peptide pairs.

Authors:  Sean McIlwain; Paul Draghicescu; Pragya Singh; David R Goodlett; William Stafford Noble
Journal:  J Proteome Res       Date:  2010-05-07       Impact factor: 4.466

7.  PepC: proteomics software for identifying differentially expressed proteins based on spectral counting.

Authors:  N L Heinecke; B S Pratt; T Vaisar; L Becker
Journal:  Bioinformatics       Date:  2010-04-22       Impact factor: 6.937

8.  The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors:  Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal:  Mol Cell Proteomics       Date:  2012-02-27       Impact factor: 5.911

9.  A uniform proteomics MS/MS analysis platform utilizing open XML file formats.

Authors:  Andrew Keller; Jimmy Eng; Ning Zhang; Xiao-jun Li; Ruedi Aebersold
Journal:  Mol Syst Biol       Date:  2005-08-02       Impact factor: 11.429

10.  Label-free, normalized quantification of complex mass spectrometry data for proteomic analysis.

Authors:  Noelle M Griffin; Jingyi Yu; Fred Long; Phil Oh; Sabrina Shore; Yan Li; Jim A Koziol; Jan E Schnitzer
Journal:  Nat Biotechnol       Date:  2009-12-13       Impact factor: 54.908

View more
  40 in total

1.  A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites.

Authors:  Alexander Koch; Daria Gawron; Sandra Steyaert; Elvis Ndah; Jeroen Crappé; Sarah De Keulenaer; Ellen De Meester; Ming Ma; Ben Shen; Kris Gevaert; Wim Van Criekinge; Petra Van Damme; Gerben Menschaert
Journal:  Proteomics       Date:  2014-10-02       Impact factor: 3.984

2.  Proteomic profiling of the retinas in a neonatal rat model of oxygen-induced retinopathy with a reproducible ion-current-based MS1 approach.

Authors:  Chengjian Tu; Kay D Beharry; Xiaomeng Shen; Jun Li; Lianshui Wang; Jacob V Aranda; Jun Qu
Journal:  J Proteome Res       Date:  2015-04-06       Impact factor: 4.466

3.  Mass spectrometry-based analysis of murine bronchoalveolar lavage fluid following respiratory exposure to 4,4'-methylene diphenyl diisocyanate aerosol.

Authors:  Justin M Hettick; Brandon F Law; Chen-Chung Lin; Adam V Wisnewski; Paul D Siegel
Journal:  Xenobiotica       Date:  2017-07-21       Impact factor: 1.908

4.  Meta-analysis defines principles for the design and analysis of co-fractionation mass spectrometry experiments.

Authors:  Michael A Skinnider; Leonard J Foster
Journal:  Nat Methods       Date:  2021-07-01       Impact factor: 28.547

5.  Affinity purification-mass spectrometry and network analysis to understand protein-protein interactions.

Authors:  John H Morris; Giselle M Knudsen; Erik Verschueren; Jeffrey R Johnson; Peter Cimermancic; Alexander L Greninger; Alexander R Pico
Journal:  Nat Protoc       Date:  2014-10-02       Impact factor: 13.491

6.  N-terminal ArgD peptides from the classical Staphylococcus aureus Agr system have cytotoxic and proinflammatory activities.

Authors:  David J Gonzalez; Ross Corriden; Kathryn Akong-Moore; Joshua Olson; Pieter C Dorrestein; Victor Nizet
Journal:  Chem Biol       Date:  2014-11-20

7.  Computational methods and opportunities for phosphorylation network medicine.

Authors:  Yian Ann Chen; Steven A Eschrich
Journal:  Transl Cancer Res       Date:  2014-06-01       Impact factor: 1.241

8.  ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies.

Authors:  Jing Tang; Jianbo Fu; Yunxia Wang; Bo Li; Yinghong Li; Qingxia Yang; Xuejiao Cui; Jiajun Hong; Xiaofeng Li; Yuzong Chen; Weiwei Xue; Feng Zhu
Journal:  Brief Bioinform       Date:  2020-03-23       Impact factor: 11.622

9.  Novel Citronellyl-Based Photoprobes Designed to Identify ER Proteins Interacting with Dolichyl Phosphate in Yeast and Mammalian Cells.

Authors:  Jeffrey S Rush; Thangaiah Subramanian; Karunai Leela Subramanian; Fredrick O Onono; Charles J Waechter; H Peter Spielmann
Journal:  Curr Chem Biol       Date:  2015

10.  Elucidation of the CHO Super-Ome (CHO-SO) by Proteoinformatics.

Authors:  Amit Kumar; Deniz Baycin-Hizal; Daniel Wolozny; Lasse Ebdrup Pedersen; Nathan E Lewis; Kelley Heffner; Raghothama Chaerkady; Robert N Cole; Joseph Shiloach; Hui Zhang; Michael A Bowen; Michael J Betenbaugh
Journal:  J Proteome Res       Date:  2015-10-13       Impact factor: 4.466

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.