Literature DB >> 22144880

Computational mass spectrometry-based proteomics.

Lukas Käll¹, Olga Vitek.

Abstract

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22144880 PMCID： PMC3228769 DOI： 10.1371/journal.pcbi.1002277

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

× No keyword cloud information.

This is an original tutorial.

Goals and Challenges of Proteomics

Proteomics is defined as the system-wide characterization of all the proteins in an organism in terms of their sequence, localization, abundance, post-translational modifications, and biomolecular interactions. Modern proteomic investigations are increasingly quantitative and comprehensive [1]. Examples include the relative quantification of over 4,000 proteins in haploid and diploid yeast, which identified the pheromone signaling pathway as enriched in differential abundance [2]; determination of site- and time-specific dynamics of more than 6,000 phosphorylation sites of HeLa cells stimulated with epidermal growth factor [3]; and characterization of 232 multiprotein complexes in Saccharomyces cerevisiae, which proposed new cellular roles for 344 proteins [4]. Such investigations are now successfully utilized in functional biology [5], [6], genomics [7], [8], and biomedical research [9]. Challenges of proteomic studies stem from the complexity of the proteome and to its broad dynamic range. For example, the human genome contains around 20,000 protein coding genes. Their translation, combined with splicing or proteolysis, yields an estimated 50,000–500,000 proteins, and over 10 million different protein forms can be derived by somatic DNA rearrangements and post-translational modifications [10]. The abundance of protein species in human plasma spans more than 10 orders of magnitude [11]. Unlike oligonucleotides, proteins cannot be amplified, and therefore the objectives of proteomics are achieved by sensitive and scalable technologies identifying and quantifying proteins [12]. The overall mass spectrometry–based proteomic workflow is summarized in Figure 1.

Figure 1

Quantitative mass spectrometry–based proteomic workflow.

The workflow requires a tight integration of biological and experimental (red) and computational and statistical (yellow) analysis steps.

Quantitative mass spectrometry–based proteomic workflow.

The workflow requires a tight integration of biological and experimental (red) and computational and statistical (yellow) analysis steps.

Experimental Design

Quantitative proteomic investigations are conducted in the context of biological variation [13], technical variation due to sample processing and spectral acquisition, and ambiguities of spectral interpretation. Statistical experimental design [14], [15] accounts for these sources of variation. The first goal of experimental design is to avoid biases [16], [17] (i.e., systematic errors in interpretation) by clearly defining the populations of interest, matching the individuals with respect to the confounding factors, randomizing the selection of matched individuals from the population, and randomizing sample allocation to the processing steps. The second goal is to ensure efficiency (i.e., minimal random variation and uncertainty for a given cost) by choosing an appropriate number of biological and technical replicates, and by allocating the replicates to experimental resources in balanced blocks. The steps of the statistical experimental design are summarized in Figure 2.

Figure 2

Experimental design.

Statistical experimental design consists of (a) defining the populations of interest, (b) randomly selecting biological replicates from the population and (optionally) matching confounding factors, (c) randomly allocating biological samples to spectral acquisition and (optionally) grouping the samples in balanced blocks for joint profiling, and (d) (optionally) acquiring technical replicate measurements on the biological samples. Replication, randomization, and blocking are necessary to avoid biases and maximize the efficiency of the experiment.

Experimental design.

Mass Spectrometry–Based Measurements

Global Label-Free LC-MS/MS Workflow

Mass spectrometry is currently the only technology for protein identification and quantification that is both high-accuracy and high-throughput [18]–[20]. Although many alternatives exist, shotgun liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS; overview in Figure 3) is most frequently used. Mass spectrometry is better amenable to characterizing peptides; therefore, LC-MS/MS starts by enzymatically digesting proteins into a peptide mixture. Next, liquid chromatography (LC) separates the peptides, and the separated peptides are ionized and further separated by the mass spectrometer according to their mass-to-charge ratio in a mass spectrum (MS). The mass spectra obtained from the same sample at different elution times form an LC-MS run, and intensities of MS peaks, are related to peptide abundance. For identification, the mass spectrometer isolates the biological material of selected MS peaks, subjects it to collision energy or another type of fragmentation, and separates the resulting fragments in a secondary (MS/MS) mass spectrum. The distances between the MS/MS peaks are used to infer the amino acid sequence of the parent MS peak. Since abundant MS1 peaks are more likely to be selected for fragmentation, relative peptide quantification can also be achieved by counting the number of identified MS/MS spectra.

Figure 3

Mass spectrometry–based measurements.

Mass spectrometry–based measurements.

(a) Sample processing. Label-free quantification requires minimal sample manipulation, and acquires spectra from each sample in a separate mass spectrometry run. Label-based quantification varies in the timing and type of the labeling steps, but always simultaneously profiles two or more biological samples within a run. (b) Global label-free workflows achieve relative quantification by comparing counts of MS/MS spectra, or intensities of MS peaks between runs. Global label-based workflows compare intensities of reporter MS/MS fragments (iTRAQ) or MS peaks (SILAC, synthetic peptides). (c) Targeted workflows are an alternative to global quantification. They are most sensitive, but require an a priori knowledge of the proteins of interest, and of the technological characteristics of their peptides. Label-free targeted experiments compare intensities of transitions between runs, and label-based experiments within a run. An LC-MS/MS experiment can identify and quantify thousands of proteins in complex mixtures. It requires minimal manipulation of the sample, and minimal prior information regarding its composition. However, the workflow has a number of deficiencies. Enzymatic digestion increases the complexity of the mixture. For example, a proteome comprising 5,000 proteins is expected to yield over 250,000 tryptic peptides, and minor cleavage and fragmentations of abundant proteins can obscure major events of low-abundant proteins, complicating the interpretation [21]. Dynamic range of mass spectrometers is limited to 3–4 orders of magnitude, and the direct LC-MS/MS analysis is biased towards most abundant peptides [22]. Technical variation can further undermine the identification and the quantification steps. A variety of extensions to this basic workflow have therefore been proposed.

Overcoming Between-Run Variation: Label-Based Quantification

The LC-MS/MS workflow is enhanced by labeling samples from different conditions metabolically (e.g., with SILAC [23], where stable isotopes are included in the growth medium of an organism), or chemically (e.g., with iTRAQ [24] or TMT [25], where reacting chemical labels are applied during sample processing). Samples with different labels are combined and analyzed by a mass spectrometer within a single LC-MS run. Peaks from the samples are subsequently recognized by label-induced mass shifts in MS (SILAC) or MS/MS (iTRAQ, TMT) spectra, and used for relative quantification. Labeling enables within-run comparisons of protein abundance, and improves the precision of quantification. Experimental design can further gain efficiency through optimal allocation of samples to the labels, e.g., in reciprocal or reference designs [26] or by using labeled synthetic peptides as references. However, labeling requires extra sample manipulation and increases the complexity of the sample.

Overcoming Limits of Dynamic Range: Targeted Workflows

The complexity of a biological mixture can be overcome by fractionation [27]; however, this severely undermines the throughput. A valuable alternative is selected reaction monitoring (SRM) (also referred to as multiple reaction monitoring, MRM), a targeted workflow where the mass spectrometer isolates a set of pre-defined peptides and their fragments during mass analysis [28]–[31]. The resulting peptide-fragment pairs (called transitions) are used for quantification. Since the isolation is highly specific, SRM enables the most sensitive mass spectrometry–based quantification currently available. For example, proteins expressed with fewer than 50 copies/cell were quantified in total yeast lysates [32]. As shown in Figure 3, SRM can be conducted in conjunction with both label-free and label-based workflows. The drawback of targeted workflows is that they only quantify a priori known proteins, require optimized experimental protocols, and limit the number of measurements per run to a few hundreds. Further technological developments [33] and optimal experimental designs [34] will help alleviate these drawbacks.

Computation and Statistics

Identification of Peptides and Proteins

The computational and statistical analyses of the acquired spectra are illustrated in Figure 4. With the shotgun LC-MS/MS workflow, the first step is to identify sequences of amino acids that correspond to the MS/MS spectra. This has received much attention from both algorithmic and statistical viewpoints [35]–37. A predominant approach is the database search, which compares each observed spectrum to the theoretical spectra predicted from a genomic sequence database (or to the previously identified experimental spectra in a library [38]), and reports the best-scoring peptide-spectrum match (PSM). Emerging alternatives are de novo identifications and hybrid searches [39], [40].

Figure 4

Computation and statistics.

Computation and statistics.

Analysis of the acquired spectra includes (a, b) signal processing, (c, d) significance analysis, and (e–h) downstream analysis. Methods in (a–d) must reflect the technological properties of the workflows. Methods in (e–h) are technology-independent and are similar to the analysis of gene expression microarrays, but their use is affected by uncertainty in protein identities and the incomplete sampling of the proteome. Due to the stochastic nature of the MS/MS spectra [41], and to deficiencies of scoring functions and databases, the best-scoring PSMs are not necessarily correct. Statistical characterization of the identifications is necessary, and is now required by most journals [42]. This problem is frequently formalized as controlling the false discovery rate (FDR) in the list of reported PSMs [43], [44]. Representative methods for controlling FDR are two-group models, which view the reported PSMs as a mixture of correct and incorrect identifications [45], and methods utilizing decoy databases [46]. Typically, only around 30% of MS/MS spectra are confidently identified, and developing improved methods is an active area of research. The task of identification extends to inferring peptides and proteins in the sample from the identified MS/MS spectra. This is challenging due to the “many-to-many” mapping of peptides to proteins, and of MS/MS spectra to peptides. Inference must enable parsimonious results, while maintaining the sensitivity and characterizing the confidence in the identifications. The problem of protein inference is not entirely solved. For example, arguments exist in favor [47] and against [48] reporting single-peptide protein identifications, and in favor [49] and against [50] the exclusive use of protease-specific peptides. A typical experiment generates hundreds of thousands of MS/MS spectra, and open-source and commercial pipelines such as the Trans-Proteomic Pipeline [51] streamline spectral handling and interpretation through common infrastructure.

Quantification of Spectral Features

The next step in quantitative label-free LC-MS/MS experiments is to locate and quantify MS peaks, annotate them with peptide and sequence identities, and establish the correspondence of peaks between runs [52]. Label-based workflows with MS quantification (e.g., SILAC) search for pairs of peaks with known mass shifts that correspond to a same peptide. Workflows with MS/MS quantification (e.g., iTRAQ) locate and quantify reporter MS/MS fragments. All these tasks can be made difficult by irregular, overlapped, and missing peaks, chromatographic variations between runs, and incomplete and incorrect identifications. As a result, only a subset of the identified proteins is typically quantified [53]. A variety of signal processing software tools are reviewed in [54], and the representative ones are OpenMS [55] for label-based quantification and MaxQuant [56] for quantification with SILAC. Targeted SRM experiments sidestep the need for identifying and aligning peaks, and signal processing focuses on peak detection, quantification, and annotation. However, difficulties can arise with overlapped or suppressed signals or incorrectly calibrated transitions, and computational methods can help filter out poor quality transitions [57], [58]. Pipelines such as Skyline [59], [60] and ATAQS [61] streamline these tasks. Frequently, sample handling induces differences in the quantitative signals between runs, and global between-run normalization is necessary to distinguish true biological changes from these artifacts. Two common approaches to global normalization are sample-based and control-based. Sample-based normalization, e.g., quantile normalization or normalization based on the total ion current, makes the best use of the data, but assumes that the majority of features do not change in abundance [62]. Control-based normalization in preferred in experiments with few measurements or many biological changes.

Finding Differentially Abundant Proteins

Typical statistical goals of quantitative proteomics are protein quantification, i.e., estimation of protein concentration in a sample on a relative or absolute scale, and class comparison, i.e., determination of proteins that change in average abundance between conditions. To achieve this, it is often necessary to summarize the quantitative information across all the features that pertain to a protein. One such approach is spectral counting [63], which is based on the insight that in global LC-MS/MS peaks from abundant proteins are more frequently selected for fragmentation, and uses the number of identified MS/MS spectra as a proxy for the abundance. The approach involves minimal signal processing; however, it requires specialized statistical modeling, is limited to finding large changes among abundant proteins, and is most successful with mixtures of low complexity, e.g., for determination of protein complexes [64]. Alternative approaches are based on summarizing signals from quantified spectral peaks. With other technologies such as gene expression microarrays, similar summarization is performed by some form of averaging, e.g., with Robust Multiarray Averaging (RMA) [65]. Unfortunately, averaging fails to produce accurate results in mass spectrometry–based proteomics. Length, charge, and other chemical properties of peptides greatly affect the quality of the signals, and averaging obscures these difference in information content. A more successful summarization requires probabilistic modeling, which represents all features of a protein and characterizes their variation. A diverse range of such models has been proposed, and there is no single generally accepted procedure. The models differ in using raw or log-transformed intensities, comparing groups in terms of ratios or differences, and using general-purpose [66] or specialized [67] classes of statistical models. Important aspects are accurate representation of the experimental design and of within-run groupings of peaks in label-based workflows, treatment of missing data (e.g., using specialized [68] or general-purpose [69], [70] techniques), incorporating confidence in feature identifications [71], expanding the scope of conclusions to the underlying populations or restricting it to the selected samples [66], and controlling the FDR in the list of differentially abundant proteins. In some cases, e.g., in samples enriched in post-translational modifications, changes in peak intensities can be due to both differential abundance and differential modifications. Comparisons at the feature level are then more appropriate; however, they should be adjusted for the overall changes in protein abundance [72]. Given the diversity of experimental designs and analysis steps, all these tasks can rarely be performed in a fully automated fashion, and consultations with statisticians are highly recommended.

Downstream Analysis

The high-throughput nature of proteomic data is similar to that of gene expression microarrays, and many downstream analysis methods can also be applied in proteomics [73]. In particular, all analyses benefit from data visualization [74]. Unsupervised class discovery helps find functionally related proteins, or biological samples homogeneous with respect to the quantitative protein profiles. Supervised class prediction, e.g., prediction of the disease status of a patient based on his or her protein abundance [75], and its thorough validation [76], are the required steps for discovery of biomarkers of disease. Enrichment analysis tests whether pre-specified sets of proteins, e.g., those sharing a function, change in abundance more systematically than as expected by chance. This is referred to as pathway analysis when the protein set forms a pathway. The analysis investigates hypotheses that are more directly relevant to the biological function, and can help detect small but consistent changes in abundance within the set. Many enrichment analysis methods exist and are systematically reviewed in [77], [78], and representative examples are the hypergeometric (equivalently, Fisher's exact) test and Gene Set Enrichment Analysis (GSEA) [79]. A particular challenge in proteomics is to map the protein identitifiers to gene-centric knowledge bases. The tools for this task are reviewed in [80], and a representative one is DAVID [81]. A frequently asked question is the correlation between the expression of protein-coding genes and the abundances of the corresponding proteins [82]–[84]. Many studies reported that in bacteria and uni-cellular eukaryotes, proteins and mRNA exhibit moderate correlation in a steady state (Pearson correlation of the order of 0.4), but it improves to the order of 0.6–0.7 for proteins that are directly affected by a relevant condition or a stress [2]. An even lower correlation has been historically reported for multi-cellular eukaryotes; however, technological improvements now also point to a steady state correlation in human samples of the order of 0.4 [85]. The moderate correlation of transcript and protein abundance indicates a major role of post-translational regulation in the activity of the cell. Therefore, the best functional insight can be obtained by combining measurements across technologies, and searching for broader groups of genes, proteins, and metabolites forming regulatory relationships [86], [87]. Such integrative studies are increasingly appearing [88], [89]. They remain challenging, however, due to the complexity of the underlying processes, incomplete sampling of the proteome, uncertainty in protein identities and difficulties of resolving multiple proteomic, genomic, and technological identifiers across platforms. New specialized methods and algorithms are needed to address these challenges.

Outlook

Despite the challenges, mass spectrometry–based proteomics continues to bring high promise for basic science and clinical research [90]. Several studies recently demonstrated that with appropriate care and training, it is now possible to accurately and reproducibly identify and quantify proteins across laboratories and instrument platforms [91]–[93]. In shotgun proteomics, most repeatable peptide identifications corresponded to enzyme-specific cleavage sites, intense MS peaks, and proteins that generated many distinct peptides. Targeted quantification could reproducibly detect low µg/ml protein concentrations in unfractionated plasma. To date, only 65% of all predicted human proteins have been reliably observed by mass spectrometry [90]. Therefore, future experimental developments will focus on improving the sensitivity, reproducibility, and comprehensiveness of protein identifications, and the sensitivity and accuracy of quantification. All studies consistently emphasize the key role of computation [94]. Future computational efforts will involve the development of proteome-centric knowledge bases such as neXtProt (http://www.nextprot.org/), repositories of experimental data, and the development of methods for optimal experimental design and data interpretation. Venues such as RECOMB Satellite Conference on Computational Proteomics [95] aim at closing the communication gap between biologists, chemists, and statisticians, and enable integrative and collaborative research.

95 in total

Review 1. The human plasma proteome: history, character, and diagnostic prospects.

Authors: N Leigh Anderson; Norman G Anderson
Journal: Mol Cell Proteomics Date: 2002-11 Impact factor: 5.911

2. The need for guidelines in publication of peptide and protein identification data: Working Group on Publication Guidelines for Peptide and Protein Identification Data.

Authors: Steven Carr; Ruedi Aebersold; Michael Baldwin; Al Burlingame; Karl Clauser; Alexey Nesvizhskii
Journal: Mol Cell Proteomics Date: 2004-04-09 Impact factor: 5.911

Review 3. Proteomics: a pragmatic perspective.

Authors: Parag Mallick; Bernhard Kuster
Journal: Nat Biotechnol Date: 2010-07-09 Impact factor: 54.908

4. A stress test for mass spectrometry-based proteomics.

Authors: Ruedi Aebersold
Journal: Nat Methods Date: 2009-06 Impact factor: 28.547

Review 5. Bioinformatics analysis of mass spectrometry-based proteomics data sets.

Authors: Chanchal Kumar; Matthias Mann
Journal: FEBS Lett Date: 2009-03-21 Impact factor: 4.124

6. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks.

Authors: Shao-Shan Carol Huang; Ernest Fraenkel
Journal: Sci Signal Date: 2009-07-28 Impact factor: 8.192

7. Protein quantification in label-free LC-MS experiments.

Authors: Timothy Clough; Melissa Key; Ilka Ott; Susanne Ragg; Gunther Schadow; Olga Vitek
Journal: J Proteome Res Date: 2009-11 Impact factor: 4.466

8. Use of stable isotope labeling by amino acids in cell culture as a spike-in standard in quantitative proteomics.

Authors: Tamar Geiger; Jacek R Wisniewski; Juergen Cox; Sara Zanivan; Marcus Kruger; Yasushi Ishihama; Matthias Mann
Journal: Nat Protoc Date: 2011-02 Impact factor: 13.491

9. The importance of peptide detectability for protein identification, quantification, and experiment design in MS/MS proteomics.

Authors: Yong Fuga Li; Randy J Arnold; Haixu Tang; Predrag Radivojac
Journal: J Proteome Res Date: 2010-11-10 Impact factor: 4.466

10. Repeatability and reproducibility in proteomic identifications by liquid chromatography-tandem mass spectrometry.

Authors: David L Tabb; Lorenzo Vega-Montoto; Paul A Rudnick; Asokan Mulayath Variyath; Amy-Joan L Ham; David M Bunk; Lisa E Kilpatrick; Dean D Billheimer; Ronald K Blackman; Helene L Cardasis; Steven A Carr; Karl R Clauser; Jacob D Jaffe; Kevin A Kowalski; Thomas A Neubert; Fred E Regnier; Birgit Schilling; Tony J Tegeler; Mu Wang; Pei Wang; Jeffrey R Whiteaker; Lisa J Zimmerman; Susan J Fisher; Bradford W Gibson; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Stephen E Stein; Paul Tempst; Amanda G Paulovich; Daniel C Liebler; Cliff Spiegelman
Journal: J Proteome Res Date: 2010-02-05 Impact factor: 4.466

22 in total

1. Warpgroup: increased precision of metabolomic data processing by consensus integration bound analysis.

Authors: Nathaniel G Mahieu; Jonathan L Spalding; Gary J Patti
Journal: Bioinformatics Date: 2015-09-30 Impact factor: 6.937

Review 2. Systems immunology of human malaria.

Authors: Tuan M Tran; Babru Samal; Ewen Kirkness; Peter D Crompton
Journal: Trends Parasitol Date: 2012-05-15

3. Statistical approach to protein quantification.

Authors: Sarah Gerster; Taejoon Kwon; Christina Ludwig; Mariette Matondo; Christine Vogel; Edward M Marcotte; Ruedi Aebersold; Peter Bühlmann
Journal: Mol Cell Proteomics Date: 2013-11-19 Impact factor: 5.911

Review 4. Tools for label-free peptide quantification.

Authors: Sven Nahnsen; Chris Bielow; Knut Reinert; Oliver Kohlbacher
Journal: Mol Cell Proteomics Date: 2012-12-17 Impact factor: 5.911

5. Detecting Significant Changes in Protein Abundance.

Authors: Kai Kammers; Robert N Cole; Calvin Tiengwe; Ingo Ruczinski
Journal: EuPA Open Proteom Date: 2015-06

6. RIPPER: a framework for MS1 only metabolomics and proteomics label-free relative quantification.

Authors: Susan K Van Riper; LeeAnn Higgins; John V Carlis; Timothy J Griffin
Journal: Bioinformatics Date: 2016-02-18 Impact factor: 6.937

7. CONSTANd : A Normalization Method for Isobaric Labeled Spectra by Constrained Optimization.

Authors: Evelyne Maes; Wahyu Wijaya Hadiwikarta; Inge Mertens; Geert Baggerman; Jef Hooyberghs; Dirk Valkenborg
Journal: Mol Cell Proteomics Date: 2016-06-14 Impact factor: 5.911

Review 8. Current algorithmic solutions for peptide-based proteomics data generation and identification.

Authors: Michael R Hoopmann; Robert L Moritz
Journal: Curr Opin Biotechnol Date: 2012-11-08 Impact factor: 9.740

9. Statistical inference from multiple iTRAQ experiments without using common reference standards.

Authors: Shelley M Herbrich; Robert N Cole; Keith P West; Kerry Schulze; James D Yager; John D Groopman; Parul Christian; Lee Wu; Robert N O'Meally; Damon H May; Martin W McIntosh; Ingo Ruczinski
Journal: J Proteome Res Date: 2013-01-16 Impact factor: 4.466

Review 10. Techniques and Approaches to Genetic Analyses in Nephrological Disorders.

Authors: Laurel K Willig
Journal: J Pediatr Genet Date: 2015-08-13