Mass spectrometry-based proteomics has considerably extended our knowledge about the occurrence and dynamics of protein post-translational modifications (PTMs). So far, quantitative proteomics has been mainly used to study PTM regulation in cell culture models, providing new insights into the role of aberrant PTM patterns in human disease. However, continuous technological and methodical developments have paved the way for an increasing number of PTM-specific proteomic studies using clinical samples, often limited in sample amount. Thus, quantitative proteomics holds a great potential to discover, validate and accurately quantify biomarkers in body fluids and primary tissues. A major effort will be to improve the complete integration of robust but sensitive proteomics technology to clinical environments. Here, we discuss PTMs that are relevant for clinical research, with a focus on phosphorylation, glycosylation and proteolytic cleavage; furthermore, we give an overview on the current developments and novel findings in mass spectrometry-based PTM research.
Mass spectrometry-based proteomics has considerably extended our knowledge about the occurrence and dynamics of protein post-translational modifications (PTMs). So far, quantitative proteomics has been mainly used to study PTM regulation in cell culture models, providing new insights into the role of aberrant PTM patterns in human disease. However, continuous technological and methodical developments have paved the way for an increasing number of PTM-specific proteomic studies using clinical samples, often limited in sample amount. Thus, quantitative proteomics holds a great potential to discover, validate and accurately quantify biomarkers in body fluids and primary tissues. A major effort will be to improve the complete integration of robust but sensitive proteomics technology to clinical environments. Here, we discuss PTMs that are relevant for clinical research, with a focus on phosphorylation, glycosylation and proteolytic cleavage; furthermore, we give an overview on the current developments and novel findings in mass spectrometry-based PTM research.
The role of post-translational modifications in biological systems
Biological systems maintain homeostasis by dynamic adaptation to the rapidly changing environment. While transcription, translation (and degradation) affect protein abundance, protein activity and function are mainly defined by structure (Figure 1). The latter can be regulated by post-translational modifications (PTMs), allowing rapid response to external/internal stimuli within (milli-)seconds.
Figure 1.
Frequently reported post-translational protein modifications.
Frequently reported post-translational protein modifications.ChaFRADIC: Charge-based fractional diagonal chromatography; COFRADIC; Combined fractional diagonal chromatography; PTM: Post-translational modification; SCX: Strong cation exchange chromatography; TAILS: Terminal amine isotopic labeling of substrates.Currently, 469 PTMs are reported in the UniProt database (January 2015), of which 326 are reported in eukaryotes, 250 in bacteria, and 80 in archeae [1]. Some PTMs have only been found in specialized bacteria [2,3], but more than 100 different PTMs are reported in Homo sapiens. According to PhosphoSitePlus [4], protein phosphorylation is by far the most common PTM and has been detected on approximately 17,500 human gene products (Figure 2). Other frequently reported PTMs are ubiquitination (∼8100 proteins), lysine acetylation (∼6700 proteins), and lysine methylation (∼2400 proteins). The heterogeneous group of protein glycosylation has been reported for approximately 4500 proteins (PhosphoSitePlus and UniProt), but is estimated to occur on ≥50% of all human proteins [5]. Further PTMs such as succinylation, SUMOylation and citrullination [6,7] are increasingly added to the databases (Figure 2B). Importantly, >95% of these data are derived from mass spectrometry (MS)-based proteome studies [8], and with the advent of tools that allow automated re-processing of large-scale MS data sets [9] from repositories such as PRIDE [10] or proteomeXchange [11], it can be assumed that our knowledge about the prevalence of PTMs will further increase [12].
Figure 2.
Frequency of human PTMs. Summary of human PTMs which, according to UniProt and PhosphoSitePlus, have been detected (A) frequently, (B) less frequently and (C) rarely. For UniProt, the percentage of entries with experimental evidence is given (ECO:0000269). (D) The high number of known PTMs is in stark contrast to the limited knowledge about their involvement in disease.PTM: Post-translational modification.
Frequency of human PTMs. Summary of human PTMs which, according to UniProt and PhosphoSitePlus, have been detected (A) frequently, (B) less frequently and (C) rarely. For UniProt, the percentage of entries with experimental evidence is given (ECO:0000269). (D) The high number of known PTMs is in stark contrast to the limited knowledge about their involvement in disease.PTM: Post-translational modification.Notably, for a given protein, not only the individual PTMs, but rather the combinations of PTMs and PTM crosstalk define protein function [13-16]. Altered PTM patterns have been connected to various diseases. However, according to PhosphoSitePlus, this has been experimentally validated for only 350 proteins (Figure 2D), rendering studying the role of PTMs in the genesis and progression of human disease a major goal of current life science.For decades, PTM characterization was mainly confined to individual proteins or defined pathways. With the advent of liquid chromatography online coupled to MS (LC-MS) analysis and PTM enrichment [17-20], the identification, localization and quantitation of hundreds to even thousands of PTMs have become possible.
From discovery to clinical research
To date, PTM analysis by MS is mainly based on the analysis of peptides in so-called ‘bottom-up’ approaches. Thus, sample preparation typically comprises: protein extraction from the sample of interest (tissues, body fluids, cells, organelles), enzymatic digestion [21,22] and enrichment of modified peptides to deplete the bulk of non-modified peptides that hamper PTM analysis [17,23,24]. The employed techniques target either PTM structure, such as in affinity chromatography [25,26], or physicochemical characteristics of the modified peptides, for example, resulting in specific retention characteristics in reversed phase (RP) [27] or hydrophilic interaction chromatography (HILIC) [28]. Strong cation exchange chromatography (SCX) can be used to selectively enrich charge-reduced peptides (phosphopeptides, glycopeptides or N-terminally acetylated peptides) [29-31]. Often, two and more enrichment techniques are combined (e.g., affinity chromatography and HILIC for the selective enrichment of phosphopeptides [32,33]) in order to increase specificity and additionally fractionate the complex samples.Notably, most strategies for the enrichment of modified peptides have been developed, evaluated and applied using cell lines such as HeLa [33] or HEK 293 [34] rather than primary tissues [19]. Consequently, our knowledge about PTM involvement in disease mechanisms is mainly derived from in vitro or animal models. For instance PanCa-1 cells can be stimulated with EGF or TNF-β to induce an epidermal mesenchymal transition, an important mechanism involved in metastasis formation in various cancer types [35,36]. Such cell culture-based studies can provide high amounts of sample material (often in the milligram range [37]) and, therefore, allow a large-scale analysis of PTMs without the ultimate need for high enrichment specificity or sensitivity. In contrast, the availability of sample material in clinical research is often restricted [38]. Hence, one future challenge will be to move one step ahead by validating proposed and identifying novel models directly in clinical samples [39]. This might be imperative to the identification and verification of biomarker candidates and drug targets, and represents a current bottleneck in MS-based PTM research [40].Whereas body fluids (such as blood, urine or tear fluid) are usually readily available in high amounts, tissue samples are often restricted and/or formalin-fixed and paraffin-embedded. In addition, the bulk of a clinical sample is usually needed for diagnostic purposes and sample storage in biobanks. Consequently, the amount of protein available for proteomics analyses is often in the microgram range, requiring the use of robust and sensitive analytical workflows. If feasible, developing standard operating procedures that are both applicable in a clinical environment and compatible with downstream proteomics is highly recommended. Here, the first challenge is an efficient and reproducible sample preparation [38], ideally in a single tube to maximize sample recovery [41,42]. Recently, Hughes et al. used a novel bead-based sample preparation protocol for proteome analysis to analyze single fly embryos and merely 1000 HeLa cells (<1 µg of protein), respectively [43]. Although not evaluated for clinical samples, such novel methods might be the future for efficient sample preparation of ultra-low sample amounts. A second challenge is downscaling PTM enrichment, as currently only a few techniques allow large-scale analysis of PTMs from low sample amounts. In phosphoproteomics, recent advances enable the detection of thousands of phosphopeptides from ≤100 µg of protein [20,32,33]. For low sample amounts, as often obtained from microdissected tissue, several groups have set up platforms to analyze the samples in a single-shot analysis. For example, Masuda et al. presented an online fractionation approach for the identification of approximately 1000 phosphorylation sites from only 1 µg of protein by using optimized surfactant-aided sample preparation, hydroxy acid-modified metal oxide affinity chromatography (MOAC) and miniaturization of the HPLC system [44]. Lam et al. reported an online LC-MS platform employing two subsequent RP-HPLC fractionations for peptide separation, combined with porous graphitic carbon chromatography for retention of hydrophilic glycopeptides. This enabled an efficient proteomics and glycoproteomics analysis with only 25 µg of sample material [45]. Such online approaches are promising for restricted samples; however, high-throughput analysis of large sample cohorts demands for considerable robustness of the entire analytical procedure, which often contradicts ultra-high sensitivity approaches. Currently, common offline enrichment strategies might still represent the more robust alternative.
Quantification strategies for PTM-focused clinical proteomics
In biological systems, proteins are usually expressed in several up to many copies, which may differ with regard to their PTM patterns. Thus, for each putatively modified amino acid, there usually is equilibrium between different modification states that can be rapidly altered by enzymes such as kinases or phosphatases. Consequently, a protein can be simultaneously present in multiple variations, and therefore, regulation by PTMs is a function of site occupancy and full PTM patterns (also called PTM code). Thus, given a sensitive detection system such as LC-MS and a cell population rather than individual cells, qualitative changes between none and full-site occupancy are rather sparse. Hence, PTM studies focus mainly on quantifying – sometimes slight – relative changes of modified peptides between different sample states, in order to discover relevant biological features [46].To quantify PTM peptides in clinical samples, the following points regarding study design and the quantification method should be considered beforehand: How many sample (including technical and biological) replicates can and have to be analyzed? Is the availability of LC-MS analysis time-limited? Can the samples be analyzed in successive batches? Which accuracy is required for quantification? In general, it is recommended to randomize the order in which samples are analyzed and to place samples freshly on the LC autosampler prior to LC-MS analysis.The following techniques may be used for relative quantification of PTMs in clinical samples: label-free quantification, super-stable isotope labeling by amino acids in cell culture (SILAC) or chemical labeling. They have certain advantages and disadvantages, and may differ in their compatibility with common PTM enrichment techniques, as summarized in the following sections.In label-free quantification, samples are analyzed in separate LC-MS runs and peptides are quantified by comparing either the number of peptide spectrum matches (spectral counting [47]) or precursor ion signal intensities (i.e., the area under the curve) [48], whereas the latter is more accurate and precise. In any case, a high run-to-run reproducibility substantially improves the confidence of peptide quantification. Thus, the quality of label-free quantification strongly depends on the reproducibility of sample preparation, LC and MS performance, and requires a high level of quality control [21,49] that may not be easily established in each environment. Even then, the comparability between two samples can suffer from instrumental variations, particularly when measured over a long period of time, such that measuring several replicates is important to improve statistical robustness. Notably, label-free quantification of previously fractionated samples is even more challenging, as slight variations in the fractionation, for example, small retention time shifts during offline chromatography, are virtually inevitable and can considerably impair quantification. Novel bioinformatics strategies promise a more sophisticated label-free comparison of such difficult examples [50,51].Super-SILAC is similar to label-free quantification. It, however, makes use of a heavy SILAC-labeled internal standard that is spiked as the reference to each sample, in order to facilitate the comparison between different samples. A super-SILAC standard consists of either a mixture of different heavy SILAC-labeled cell lines, for example, when analyzing complex tissue samples, or simply the very cell type of interest [52]. Each biological sample is individually quantified relative to this standard, thus tackling instrumental variations. Schweppe et al. used a super-SILAC non-small lung cancer cell line standard for quantitative phosphoproteomics of the lung tissue [53]. Boersema et al. quantified 180 N-glycosylation sites from breast cancerpatient plasma using a dedicated super-SILAC mix [54]. Notably, qualitative differences in the proteomes of tissue and cell lines may impair the quantification of proteins, which might be even more pronounced on the level of PTMs. Using mixtures of cell lines can partially compensate for this, however, at the expense of a substantially increased sample complexity and, consequently, the impaired detection of low-abundant peptides.Chemical labels can be used to introduce stable isotopes on the protein or peptide level for quantitative analyses. Depending on the employed label, currently up to 3 (dimethyl labeling [55,56]), 8 (isobaric tag for relative and absolute quantitation [iTRAQ] [57,58]) or 10 samples (tandem mass tags [TMT] [59,60]) can be multiplexed, considerably facilitating the use of extensive fractionation and enrichment protocols prior to LC-MS analysis. In case of reporter ion-based iTRAQ and TMT, after labeling, the same peptide derived from different samples is always isobaric, such that even multiplexing 10 samples does not considerably increase sample complexity. Thus, for each differentially labeled peptide, only a single precursor isotope pattern is detected and isolated to generate MS/MS spectra. Upon fragmentation, the non-isobaric reporter ions are released from their precursors and their intensities reflect the relative abundance of the respective peptide in the different samples. Owing to the multiplexing capacity in conjunction with the isobaric nature of the labels, iTRAQ and TMT are ideally suited for studies in which the available amount of protein per sample/condition is limited [61]. Notably, for reporter-ion based strategies, MS/MS identification rates are below those for non-labeled samples, resulting partially from elevated charge states [62] and from the different fragmentation behavior of the labeled peptides [63]. However, the possibility to conduct extensive fractionation prior to LC-MS allows increasing the number of peptide spectrum matches and, importantly, even improves quantification accuracy owing to reduced co-isolation of precursor ions. As each labeled peptide releases the same m/z reporter ions, particularly in highly complex samples, the reporter ion intensities may reflect a mixture of co-isolated peptides, often distorting ratio estimation [64,65]. It was reported that besides reducing isolation windows for precursor ions and extended fractionation [64], MS3 analysis can improve ratio determination [60,66], the latter, however, happening at the expense of sensitivity. iTRAQ and TMT are widely used for quantification [67-69], but the labels can alter the physicochemical properties of peptides and thus directly affect the efficiency of PTM enrichment. This can be circumvented by labeling after enrichment, which allows using higher amounts of starting material and/or saving expensive label reagents; this, however, induces higher systematic biases from sample-to-sample variations [70,71].Recently, a new strategy termed neutron encoding (NeuCode), utilizing the subtle mass differences (low millidalton range) of different isotopologues for quantification, has been developed to further expand multiplexing capacities in LC-MS–based quantitative proteomics [72]. In theory, as many as 39 different isotopologues of lysine can be incorporated in cell culture and multiplexed for analysis. If a resolution of 30,000–60,000 is used for the initial survey scan, the unresolved isotopologue signals do not increase spectral complexity, and thus, all channels are co-isolated for fragmentation and peptide identification. Quantification is done by an additional high-resolution survey scan which resolves the different isotopologues (R ≥ 480,000 to resolve 18 mDa differences in a 3-plex experiment). Notably, the NeuCode technology is experimentally challenging and currently limited to only few mass spectrometers with sufficient resolution and scan speed. Nevertheless, the development of NeuCode-based chemical labels [73,74] and the rapid technological progress in MS instrumentation render NeuCode a most promising and valuable tool for future clinical (PTM) proteomics.Whereas label-free analysis, super-SILAC and chemical labeling provide the repertoire for the large-scale detection of aberrant PTM regulation, validation and assay detection of potential biomarkers requires more accurate and precise quantification. In a clinical environment, this can be done by ELISA-based methods, which, nevertheless, can only target single or few biomarkers simultaneously. The use of targeted MS-based proteomics overcomes this constraint, offering equal or even higher sensitivity for multiple analytes [40]. Here, a pre-defined set of peptide candidates can be relatively quantified over a high dynamic range with high precision and accuracy. Depending on the sensitivity of the equipment and the abundance of the peptides of interest, targeted MS can be utilized to monitor >100 different peptides from complete cell digests within a single LC-MS analysis, and without the need for fractionation or enrichment. This allows refining and validating the results obtained from ‘discovery’ experiments with larger sets of samples in a short time. Furthermore, spiked-in stable isotope-labeled reference peptides allow for absolute quantification in an assay-like manner, without the need for cost-intensive antibodies as required for ELISA.For a targeted approach, peptides of interest have to be evaluated for their suitability, that is, stability and uniqueness [75]. Targeted MS is classically conducted with the help of triple-quadrupole mass spectrometers in multiple reaction monitoring (MRM) mode or, more recently, using parallel reaction monitoring, as reviewed elsewhere [76]. Whereas generating reliable high-quality targeted assays requires a substantial effort and the incorporation of robust quality control measures [77], once set up and validated, they enable analyzing larger sample cohorts in accordance with the demands of clinical research. The technology provides good inter-lab reproducibility and precision [78], and current research is aiming at more cost-effective and standardized MRM assays for widespread applicability [79].Targeted methods have been used for the quantification of low-abundant modified peptides, as demonstrated for glycosylated [80] or phosphorylated species [81]. Recently, Yoneyama et al. developed a targeted assay to monitor the levels of proline hydroxylation in fibrinogen from serum samples of pancreatic cancerpatients [82]. Such exemplary studies render targeted LC-MS/MS, though still not frequently employed for PTMs, a highly promising approach for future clinical research that, however, is still far from routine use.
Study design & statistics for biomarker discovery
In most clinical proteomics studies, sample size is one of the major issues. LC-MS analysis time is expensive, well-classified patient material is often not easy to obtain and, particularly in pathological samples, biological variation can be huge. The importance of sample size (n) to gain statistical power and maximize the chances of identifying true-positive candidates from high-throughput experiments has been demonstrated previously [83,84]. Skates et al. computed the probabilities for true biomarkers to pass the initial steps of proteomics-based biomarker discovery, namely, identification in a large-scale experiment to reach the subsequent verification step by MRM [85]. They considered sample size (n), shedding (percentage of patients showing up-regulation of a certain biomarker, e.g., due to heterogeneity of a cancer) and the distance of biomarker intensities between patients and controls. Their simulations demonstrate that the probability that a biomarker with 50% shedding and a median distance of 3 standard deviations is passed to the verification step in an MRM assay with n = 10 (controls = patients = 10) is no more than 15% when performing 20 targeted assays after the discovery experiment. The probability can be increased by increasing the number of targeted assays or the number of samples in discovery phase. Indeed, this probability increases to 60%, 93% and almost 100% when increasing the sample sizes to n = 25, 50, 100 or to 35% when performing 50 targeted assays with n = 10. Thus, this simulation clearly confirms the importance of cohort size in biomarker discovery. Importantly, when analyzing diseases with low prevalence where samples can be obtained from only a few individuals, sample sizes are often far below n = 10, considerably reducing the statistical power of such experiments, particularly when considering the difficulties in making a precise and correct diagnosis.Therefore, choosing appropriate normalization methods and statistical tests is a major concern directly affecting the final selection of candidates. The often used two-sample t-test to estimate whether a regulation can be considered as significant or not has some inherent drawbacks, since it takes into account both the fold-change and the estimated variance of each potential candidate. If the sample size is small (n ≤ 5), estimation of the variance is quite uncertain. Consequently, the t-test often declares strongly regulated hits with a high variance as ‘not significant’ [86]. Figure 3 demonstrates how decisions for potentially regulated candidates would be made on the basis of a two-sample t-test (two-sided, unequal variance) for two simulated markers in an experiment with n = 3 (3 patients vs 3 controls). Marker 1 exhibits a clear fold-change (3.2-fold), but also a high variance, and thus would be rejected at a 5% confidence level. In contrast, marker 2 does not exhibit a biologically significant regulation (1.1-fold), but the low variance would render it ‘significant’ at a 5% confidence level. Thus, decisions should not be made merely based on p-values resulting from a two-sample t-test.
Figure 3.
Using the classical t-test for biomarker research. Two simulated markers 1 and 2 (A) in a background of an iTRAQ-based phosphoproteomics experiment (B, C). Using the two-sample t-test, the not-promising marker 2 would be defined as significant, whereas marker 1 would not be considered. Using the moderated t-test provided in the Limma package [92,89,93], only the promising marker 1 remains significant. A detailed description for the use of this package was recently published by Kammers et al.
[86].
Using the classical t-test for biomarker research. Two simulated markers 1 and 2 (A) in a background of an iTRAQ-based phosphoproteomics experiment (B, C). Using the two-sample t-test, the not-promising marker 2 would be defined as significant, whereas marker 1 would not be considered. Using the moderated t-test provided in the Limma package [92,89,93], only the promising marker 1 remains significant. A detailed description for the use of this package was recently published by Kammers et al.
[86].Various studies addressed the incapacity of the t-test for omics experiments with often small replicate numbers [87,88]. In recent years, statistical methods that had already been used in other high-throughput technologies have been employed for proteomics experiments [89,90]. The inaccuracy of variance estimations in experiments with only few replicates can be compensated, for example, by ‘empirical Bayes method shrinking’ [86]. The reduced variance can be used for a moderated t-test in which the p-value is rather assessed by the fold-change than by the initially estimated variance. Notably, empirical Bayes methods have been successfully used for approximately 10 years in the microarray field [91] and can be easily adapted to proteomics experiments [86,92,89]. Recently, Kammers et al. described the use of the R-package Limma [92,93] for iTRAQ 8-plex data, including a detailed online description (see ‘Methods’ section of [86]). We used the Limma package to assess p-values by the moderated t-test to illustrate the empirical Bayes method for the aforementioned simulated markers 1 and 2 (Figure 3). Indeed, using the moderated t-test, the promising marker 1 is declared as significant (5% confidence level), whereas the non-promising marker 2 shifts into the background. Thus, the results of the moderated t-test are more in agreement with the expected outcome for these two biomarkers.In summary, statistical methods might help researchers to identify more promising candidates, but cannot compensate for low sample sizes. When only few replicates are analyzed, especially during the discovery (and perhaps also verification) phase, appropriate statistical methods might increase the probability of selecting promising candidates. Continuous improvements might help to adapt statistical methods to the needs of clinical (proteomics) research and to help identifying and validating promising biomarker candidates in the future.
Analyzing protein phosphorylation in clinical samples
Protein phosphorylation plays a key role in many clinically relevant processes such as stem cell differentiation [94], platelet activation [95] and cell cycle regulation [96]. Abnormal phosphorylation patterns can be linked to several diseases such as Alzheimer’s disease [97-99], diabetes [100-102], cancer [103-105] or cardiovascular diseases [106-108]. Therefore, the detection of aberrant phosphorylation holds a great potential for understanding the genesis and progression of diseases, discovering new biomarkers and evaluating treatment response in clinical research (reviewed in [109,110]).Recent examples for the use of quantitative phosphoproteomics to characterize signaling events in clinically relevant targets include, for example, the comparison of phosphorylation patterns between plasma membrane proteins of sickle red blood cells and normal erythrocytes upon MEK1/2 inhibition or exogenous ERK2 addition. Here ERK1/2 could be further confirmed as potential therapeutic target, as the results indicated its connection to several dysfunctional aspects of sickle red blood cells [111]. A study on primary adipocyte cell cultures from healthy and diabetes type 2 individuals suggested a putative mechanism for insulin resistance in obesepatients [112]. Recently, we used quantitative phosphoproteomics to study time-resolved changes upon inhibition of human platelets isolated from healthy donors by stimulation of the inhibitory cAMP/PKA pathway. This provided novel insights into the crosstalk of signaling pathways and pointed to potential new candidates for anti-platelet treatment [19].State-of-the-art MS-based phosphoproteomic workflows rely on the enrichment of phosphorylated peptides. For this purpose, various affinity chromatography-based methods are frequently employed, with Fe3+-immobilized metal ion affinity chromatography (IMAC) [113,114], Ti4+-IMAC [33,115,116] and TiO2-MOAC [25,27,32,117,118] being prominent examples. Often, these techniques are combined with LC, either prior to or subsequent to enrichment, in order to fractionate and/or further enrich the sample. Successfully applied combinations comprise MOAC–IMAC–HILIC [32], SCX–IMAC [119], SCX–MOAC [120], high-pH–RP-MOAC [27] and HILIC–IMAC [121]. Schweppe et al. used an SCX–MOAC to quantitatively profile non-small lung cancer tissue from human individuals in a super-SILAC approach to map substrates of the oncogenic kinase, PLK1. Moreover, they conducted a large-scale comparison of cancer signaling between different individuals, with the goal of monitoring cancer progression and treatment response in a personalized manner [53]. Herskowitz et al. applied Ti4+-IMAC to characterize phosphorylation patterns in post-mortem brains of frontotemporal lobar degenerationpatients and found GFAP, NDRG2, MAP1A, Nogo, PKCγ and HSP90AA1 abnormally regulated, compared to control brains [122].In affinity chromatography-based phosphoproteomics, throughput and reproducibility are generally limited owing to a multitude of labor-intensive manual steps. Thus, HPLC-based methods that target distinct physicochemical properties of phosphopeptides are attractive alternatives regarding automation toward clinical applications. At pH 2–3, phosphopeptides, similar to glycopeptides, are more hydrophilic and have a lower net charge than unmodified peptides. However, targeting only hydrophilicity using HILIC is not sufficient for selective enrichment of phosphopeptides, unless combined with affinity chromatography (e.g., IMAC or MOAC), as phosphopeptides are distributed rather evenly throughout the obtained fractions [121]. Chromatographic modes targeting net charge proved to be more efficient, especially for tryptic peptidesSCX can be used to effectively enrich phosphopeptides in the very early fractions [123]. However, a drawback of charge-based separation is the co-elution of other charge state-reduced peptides, such as N-terminally acetylated peptides, or most C-terminal peptides, hampering the specificity of this method and demanding for further separation/enrichment. The probably most promising HPLC-based approach is termed electrostatic repulsion-hydrophilic interaction liquid chromatography (ERLIC) [124]. In ERLIC, an anion exchange column is used with a HILIC-type buffer system (70% acetonitrile) at low pH to superimpose two effects that help to separate phosphorylated from non-modified peptides. First, phosphopeptides are less repulsed by the positively charged stationary phase than unmodified (tryptic-) peptides, and second they are additionally retained by their more pronounced hydrophilicity. This ERLIC mechanism was exploited and characterized in various phosphoproteomics studies and yields highly selective phosphopeptide enrichment using a solely HPLC-based approach. The majority of unphosphorylated peptides are in the flow-through and early fractions [125], whereas phosphopeptides are not only efficiently retained [126] but furthermore separated according to the number of phosphoamino acids [127]. Consequently, ERLIC has proven very efficient for the enrichment of multiply phosphorylated peptides [128]. We recently demonstrated that a tailored strong cation exchange/reversed-phase solid-phase extraction (SCX/RP-SPE) further boosts the sensitivity and performance of ERLIC for efficient phosphopeptide enrichment from limited sample material. Thus, approximately 7500 highly confident phosphorylation sites could be identified from 100 µg of non-stimulated HeLa cells, by measuring only 50% per fraction on an Orbitrap Velos mass spectrometer [20]. Indeed, this approach can be further refined to include a global proteome analysis from the same sample [129]. Consequently, ERLIC ranks among the most sensitive phosphoproteomics workflows to date and, more importantly, can be conducted without any offline AC. In our opinion, ERLIC provides a flexible, sensitive and efficient platform with high reproducibility that may be used for high-throughput phosphoproteomics studies in the future.Although phosphopeptide enrichment has improved with regard to efficiency and sensitivity, phosphoproteomics still faces some considerable challenges some of which are given below. The overlap between biological replicates (even for cell culture) is often comparably low. First of all, this might derive from low reproducibility and specificity during enrichment (e.g., presence of other modifications) and LC-MS analysis (e.g., retention time shifts, MS performance). However, high sample complexity and differences in phosphopeptide identification (e.g., undersampling, peptide and site localization scoring) can have a strong impact. These technical issues can be partially addressed by more extensive sample fractionation and, thus, increased analysis time, as well as the use of faster and more sensitive mass spectrometers to reduce undersampling. However, the biological variance between samples will remain a source of irreproducibility, hampering the comprehensive detection of ‘the phosphoproteome’, particularly owing to the highly dynamic and complex nature of PTM patterns.The applied enrichment method can introduce a certain bias, preferentially enriching phosphopeptides with distinct physicochemical properties (e.g., acidic, basic, hydrophilic or hydrophobic phosphopeptides) and, therefore, subsets of the phophoproteome. Targeting different physicochemical properties simultaneously, as done in ERLIC, could help to overcome this limitation.Digestion efficiency can be reduced if a phosphorylation is in proximity to a proteolytic cleavage site, such that a mixture of fully and missed cleaved phosphopeptides can be generated. This further impairs the detection and even more the quantification of phosphorylation sites, demanding for adjusted digestion parameters to improve digestion efficiency [130]. The complementary use of alternative proteases is another promising approach toward a more comprehensive phosphoproteome [131,132], including the usage of non-specific proteases [133].The localization of a phosphate moiety within a peptide sequence can be challenging and irreproducible, even though specific statistical tools have been developed to determine probabilities of site localization [134-138]. Importantly, even if a phosphopeptide is present in all analyzed replicates, it might not always yield confident and/or the same site localization, thus apparently reducing the overlap between replicates. The use of alternative fragmentation techniques such as electron-transfer dissociation (ETD) [139] and electron-transfer-higher energy collision-induced dissociation (etHCD) [140] for sequencing phosphopeptides can assist site assignment, particularly compared to ion trap collision-induced dissociation (CID) which is often dominated by neutral losses of the precursor ion [141].The separation of phosphopeptides by RP chromatography can suffer from peak broadening in case of multiply phosphorylated peptides. To improve peak shape and width (i.e., to obtain higher sensitivity), the use of chelating agents such as ethylenediaminetetraacetic acid to remove metal ion contaminations in the HPLC system [142] has been reported.In summary, it is most important to consider possible pitfalls and sources of error (or bias) in the design of clinical phosphoproteomics studies; otherwise, important biological information can be easily misinterpreted or simply lost.
Analyzing protein glycosylation in clinical samples
Protein glycosylation, the attachment of glycan structures to proteins, is another well-known and clinically relevant PTM that has been found on asparagine in N-X-S/T (X≠P) motifs (N-glycosylation), as well as on serine, threonine and, recently, tyrosine (O-glycosylation) [143,144]. The attached glycan structures can be highly complex combinations of different carbohydrate-building blocks and act as dense information carriers [145]. Thus, not only the position within the protein and glycosylation site occupancy, but also the glycan structure has to be elucidated for a deeper understanding of its pathological role. Glycosylation is known to assist in protein folding [146,147], and folding quality control [148,149], protein sorting [150], protein degradation, cell–cell interaction and host–pathogen interaction [151].Protein glycosylation has a major impact on protein–protein interaction and has been shown to play a primary role in various pathologies, for example, in certain types of cancer [152,153] and the hypoxia-induced invasiveness of cancer cells [71], neurodegenerative diseases such as Alzheimer’s [154,155], multiple sclerosis [156], atherosclerosis [157], bleeding disorders [158,159], diabetes [160] and inflammation [161]. Thus, glycosylation is a key target in biomarker discovery, underscored by the fact that several cancer biomarkers are indeed glycoproteins [162] and by the existence of cancer-specific glycan structures [163]. Differential glycoprotein expression might be used for cancer classification, as demonstrated for breast cancer cell lines [164]. Moreover, glycoproteins might be promising markers to reliably detect early cancer forms like hepatocellular carcinoma [165], allowing differentiation from other liver-related diseases such as fibrosis and cirrhosis [166]. However, the assessment of a single biomarker for diagnosis often results in insufficient specificity, whereas monitoring multiple glycoproteins might be the key to increase the specificity and, furthermore, sensitivity in diagnosis [167,168]. Semi-quantitative glycoproteomics is mostly conducted by enrichment of glycosylated peptides from the vast majority of non-glycosylated peptides, for instance, by targeting the strongly pronounced hydrophilicity of glycopeptides by HILIC [67,169]. The reduced net charge of sialic acid containing glycans can be exploited to separate them from the bulk of tryptic peptides by SCX [170]. This can be combined with glycopeptide enrichment based on hydrazide chemistry [171], lectins [172,173] and TiO2-MOAC [174], as well as pre-fractionation techniques such as high-pH-RP [28]. For N-glycosylation analysis, large-scale LC-MS–based profiling so far mostly focuses on the identification of sites rather than glycan structures, as glycopeptide fragment ion spectra can be extremely complex and hard to interpret. Hence, N-glycans are typically removed using Peptide N-Glycosidase F prior to LC-MS analysis, leading to a conversion of Asn to Asp. Such site-specific approaches can identify hundreds or thousands of N-glycosylation sites in a single experiment [5]. The conversion of Asn to Asp induces a mass shift of +1 Da, which is a key feature for the identification of the N-glycosylation sites using LC-MS. However, this deamidation might be an artifact from sample preparation that can even occur within the N-glycosylation consensus motif N-X-S/T [175]. The use of H2
18O during Peptide N-Glycosidase F digestion introduces a more specific mass shift of +3 Da, which is distinguishable from unspecific deamidation events.Although recent advances enable large-scale identification of the O-glycoproteome in a similar manner (reviewed recently by Levery et al. [176]), we will here mainly focus on LC-MS–based N-glycoproteomics.Strategies employing HPLC-based enrichment and fractionation are frequently used especially for the quantitative profiling of N-glycosylation. HPLC columns for lectin-affinity chromatography exhibit high selectivity toward certain glycan structures and might be an excellent choice for studying specific sub-glycoproteomes, as demonstrated for fucosylation in the sera of patients with small cell lung cancer [70]. Zhao et al. recently used an online 2D-LC HILIC-RP setup for the detection of approximately 250 glycosylation sites from iTRAQ-labeled plasma samples from Macaca fascicularis
[67].As previously mentioned, SCX and TiO2 have been used for targeting sialic acid containing glycopeptides [30], but both do not enable a reasonable fractionation or separation from other PTMs, which might lead to signal suppression in LC-MS [17,177]. In contrast, ERLIC might allow overcoming these limitations, as it targets both the reduced net charge and the increased hydrophilicity, as first demonstrated by Lewandrowski et al. [178]. Since then, ERLIC has been used in various N-glycoproteomics profiling studies [179-181] and recently in conjunction with iTRAQ-based quantification as demonstrated by Ren et al.
[71]. After ERLIC-based glycopeptide enrichment and N-glycan release, deamidated peptides were iTRAQ-labeled and fractionated in a second ERLIC run, enabling the detection of approximately 200 N-glycosylation sites in an epidermoid carcinoma cell line. Thus, although not yet employed in large-scale glycosylation profiling studies, again ERLIC shows great potential for the enrichment and separation of glycosylated peptides and might complement the toolbox for large-scale N-glycoproteomics [5].
Analyzing proteolytic cleavage in clinical samples
Proteolytic cleavage is an irreversible PTM that occurs on the global proteome scale and is known to determine the intra- or extracellular fate, function, activity and turnover rate of proteins. The enzymes involved in these processes are exo- and endopeptidases, together termed the ‘degradome’. N- or C-terminal signal sequences of newly synthesized proteins can determine their subcellular destination, as shown for the endoplasmic reticulum [182], nucleus [183] or mitochondria [184]. These signal sequences are, in most cases, enzymatically removed after translocation [185,186]. Besides, pro-proteins can mature to an active form by sequential trimming or can be cleaved into smaller functional proteins of the same or even completely different function. This is known for several hormones, such as preproinsulin processing into active insulin upon release of the C-peptide [187] or angiotensin II release by C-terminal cleavage of angiotensin I [188]. On the cellular level, the family of caspases (cysteine proteases), for example, activated upon apoptosis, is well known to process various proteins [189]. The cleavage-based inactivation of focal adhesion kinase proteins by caspases is known to suppress cell–cell-dependent survival signaling in the early stages of apoptosis [190]. Aberrant proteolytic cleavage is connected to a variety of diseases, including cardiovascular and neurodegenerative diseases [191,192], inflammation and impaired wound healing [193], as well as tumor metastasis [194].The identification of mature protein C- and N-termini, as well as of the so-called ‘neo’ N-termini that are produced upon proteolytic cleavage allows to determine and monitor protease function and to identify substrates, their cleavage sites and, thus, potential consensus motifs. Whereas in the past protein termini were characterized by Edman degradation [195,196], nowadays specific methods for enrichment of N-terminal and C-terminal peptides are applied in conjunction with LC-MS.The crucial step for a successful, unbiased terminomic analysis is a (more or less) complete derivatization (i.e., labeling) of free termini on the protein level, followed by proteolytic digestion. The labeling step is required to clearly distinguish ‘real’ termini (i.e., terminal peptides) from those generated upon in vitro digestion (i.e., internal peptides). Next, different methods can be applied to separate internal from terminal peptides [197]; however, C-terminal enrichment is complicated due to several reasons. The similar reactivity of C-terminal and Asp/Glu carboxyl groups leads to side reactions, and the generally low reactivity of carboxylic acids reduces labeling efficiency [198].The first method for large-scale N-terminomics was combined fractional diagonal chromatography (COFRADIC) [199]. In COFRADIC, free N-termini and lysines (primary amines) are blocked on the protein level by deutero-acetylation, followed by a tryptic digestion. Whereas the deutero-acetylation allows distinguishing endogenous from in vitro N-terminal acetylation, the blocked Lys residues cause an ArgC specificity of trypsin. Next, the complex mixture of internal and N-terminal peptides is fractionated by RP-LC. All fractions are individually treated with 2,4,6-trinitrobenzenesulfonic acid, which can only react with free N-termini of internal peptides and induces an increase in hydrophobicity. In a subsequent RP-LC fractionation under the same conditions, the 2,4,6-trinitrobenzenesulfonic acid-derivatized peptides shift to later retention times, whereas unaltered N-terminal peptides retaining their elution behavior can be specifically collected. COFRADIC has been applied to reveal the role of the MPP/Icp55 interplay in the stabilization of the mitochondrial proteome [200] or to characterize proteolytic processing in the secretome of gastric cancer associated myofibroblasts [201]. SCX pre-fractionation and Qcyclase/pGAPase treatment to remove N-pyroglutamyl modifications after tryptic digestion can be used prior to COFRADIC to further increase enrichment specificity [202]. In another powerful method, terminal amine isotopic labeling of substrates (TAILS), after blocking of primary amines on the protein level followed by proteolytic digestion, internal peptides are depleted using an aldehyde-functionalized water-soluble polymer [203]. TAILS was successfully employed to investigate proteolytic events and the role of MMP2 during skin inflammation [204], as well as of dipeptidyl peptidases 8 and 9 in energy metabolism and homeostasis [205]. Recently, TAILS has been used for characterizing proteolytic events upon inflammation and wound healing [206], and during platelet storage [207].We recently introduced charge-based FRADIC (ChaFRADIC) which makes use of the same principle as COFRADIC, however, using a 2D SCX-based charge state separation [31]. This reduces the number of fractions obtained and, moreover, proved to be robust and highly sensitive for the identification of N-terminal peptides. After blocking of primary amines on the protein level and tryptic digestion (Arg-C specificity), the generated peptides are fractionated according to their charge state at pH 2.7. The internal peptides in each fraction (five fractions, charge state +1–≥+4) are subsequently deutero-acetylated, leading to a reduction in net charge. In a second SCX separation under the same conditions, the internal peptides consequently shift to an earlier charge state fraction, whereas N-terminal peptides retain their retention time window. ChaFRADIC allows a highly sensitive N-terminal enrichment, yielding considerable coverage of the N-terminome from less than 100 µg of cell lysate. Importantly, both FRADIC approaches can be adapted to enrich for other PTMs [208].
Top-down protein analysis – current use & future perspectives
In top-down proteomics, intact proteins rather than peptides are analyzed by MS. Top-down-MS has been classically applied to characterize purified proteins and low-complexity protein mixtures, mainly due to separation and sensitivity issues, as proteins are considerably more heterogeneous than peptides. In recent years, these limitations have been partly overcome by new, efficient methods and instrumental developments. For a comprehensive overview of this fascinating and still emerging field, excellent reviews by the groups of MacLafferty and Kelleher [209,210] can be referred to. To date, in clinical research, top-down-MS is mostly used to monitor distinct biomarker proteins, for example, those extracted from body fluids, such as monoclonal immunoglobulins from patient sera as a marker for monoclonal gammopathy [211], as well as diabetes marker proteins, metabolites and PTMs of blood proteins [212]. Moreover, matrix-assisted laser desorption/ionization (MALDI) top-down MS (as reviewed in [213]) has been applied for diagnostic imaging of patient tissue resections, such as HER2 receptor status classification in breast cancer tissues [214].The probably most exciting feature of intact protein analysis is that given sufficient separation power, high mass accuracy, resolution and sensitivity, different ‘proteoforms’ [215] can be distinguished. Hence, is it possible to deduce whether certain PTM regulations derive from the same protein molecule or represent different proteoforms – a clear advantage over the currently mostly used bottom-up approaches. Consequently, top-down-MS also enables the determination of site occupancies and, even more, the analysis of PTM crosstalk [16].Particularly the past 5 years have shown enormous progress in the applicability and power of top-down studies. Tran et al. utilized in-solution isoelectric focusing followed by gel elution liquid fraction entrapment electrophoresis (GELFrEE) and nano-LC-MS to identify 1043 protein accessions from 1045 genes (77% N-terminally acetylated), comprising 3039 proteoforms with different PTM patterns [216]. Whereas large-scale top-down analysis has been rather restricted to small proteins, more recent developments have a good coverage of the proteome up to 50 kDa [217], including integral membrane proteins [218]. Lately, the field focused on transferring powerful bottom-up quantification strategies such as label free quantification [219] and NeuCODE [220] to top-down proteomics. These recent achievements show great promise for further advancing this technology into a highly valuable tool for PTM-related clinical research.
Expert commentary
The LC-MS–based analysis of clinically relevant PTMs has considerably improved over the past 10 years. Nowadays, sophisticated strategies enable the analysis of hundreds to thousands of PTMs form as little as 100 µg of protein starting material. However, to further reduce the required sample material, PTM enrichment strategies have to be refined to yield a quantitative recovery and, thus, maximize sensitivity – as sensitivity is the key to identifying low-abundant modified peptides. Considering a detection limit of 100 amol on column and a fully quantitative recovery (which is far from reality), identifying a peptide derived from a protein expressed with 10 copies per cell would require to start with at least 6 × 106 cells. In case of HeLa, this would correspond to approximately 2–3 mg of protein as the starting material. However, if under the same conditions the sample material was limited to merely 600 cells, only proteins above 100,000 copies per cell could be identified. This is the case for less than half of a typical cancer cell line proteome (Figure 4)
[221]. Notably, the low stoichiometry of PTMs further complicates detection and confident identification.
Figure 4.
Protein copy number distribution in HeLa cells (*indicates a membrane protein). Dashed lines give the limits of detection when analyzing a certain number of cells, assuming a full quantitative recovery and a limit of detection of 100 amol. If only 600 cells are available, approximately 20% of the proteome will be covered.
Protein copy number distribution in HeLa cells (*indicates a membrane protein). Dashed lines give the limits of detection when analyzing a certain number of cells, assuming a full quantitative recovery and a limit of detection of 100 amol. If only 600 cells are available, approximately 20% of the proteome will be covered.Besides sensitivity, LC-MS analysis time is the major bottleneck for large sample cohorts since many strategies rely on extensive peptide fractionation. To reduce the number of LC-MS runs, using state-of-the-art MS instruments that provide excellent scan rates is mandatory. Current mass spectrometers with acquisition rates of approximately 20 Hz enable the detection of 10,000 peptides per hour LC-MS time [222], indicating that owing to technical advances, extensive fractionation might become less and less mandatory. Considering recent, and to be anticipated future improvement in sample preparation, enrichment strategies and LC-MS instrumentation, large-scale clinical proteomics of PTMs from limited sample material might be feasible in the near future.Despite the numerous studies which efficiently target single PTMs, the knowledge about PTM crosstalk (reviewed in [16]) is rather limited. Several examples for crosstalk events have been reported, such as phosphorylation-dependent ubiquitination in EGFR/MAPK signaling [223], or phosphorylation-dependent SUMOylation on heat-shock proteins [224], as well as the crosstalk between O-GlcNAc and phosphorylation in the stabilization of p53 [225]. Particularly, crosstalk-mediated regulation of histones is well investigated and involves many PTMs, among which are included phosphorylation, lysine acetylation [226] and arginine methylation [227]. Thus, a single PTM-related discovery might not necessarily reveal the entire complexity of a corresponding PTM-dependent regulation. In-depth investigation of PTM crosstalk by combining several single-PTM analyses with highly sophisticated sample preparation strategies can be conducted, but so far has mostly required huge sample amounts and hundreds of hours of LC-MS time.In conclusion, recent methodological advances in PTM research should be considered as valuable toolbox for the future development of more sensitive and time-efficient strategies that allow analyzing putative PTM codes even in clinical samples. This is particularly important not only for the discovery of novel, clinically relevant pathways and their interconnections, but also for treatment response studies in clinical trials [228].
Five-year view
In the past two decades, MS-based proteomics has been a highly dynamic field with a strong impact on life science. This development has been boosted by the continuous development of more sensitive instrumentation, methodology and novel commercial applications. It can be expected that this trend will continue for the next 5 years, providing researchers even faster and more sensitive and powerful MS instrumentation. Faster acquisition rates will further reduce undersampling and, thus, improve the overlap between replicates. Data-independent acquisition methods are getting increasingly popular and may also have a strong impact on clinical research [229,230], especially in early discovery phase experiments, as demonstrated in a recently published protocol for fast and reproducible quantitative proteome mapping with approximately 1 mg of tissue biopsy samples [231]. Quality control measures established in recent years will further enhance sample preparation and analysis, with an impact on recovery, robustness and sensitivity of future studies [43]. Ongoing efforts for automation and standardization of typical proteomics workflows and new developments such as NeuCode to expand multiplexing capabilities for quantitative proteomics show great promise to increase throughput and, thus, drive quantitative proteomics more toward clinical application. Alternative to the classical bottom-up proteomics, sophisticated top-down [232] and middle-down [233] approaches will allow a more detailed study of complex PTM patterns derived from the very same protein molecule. The development of novel MALDI-MS imaging techniques might allow to screen and visualize biomarkers directly from the tissue in order to aid pathological assessment in a clinical environment [234]. To conclude, the recent and upcoming developments in the field are most exciting promises for the future that would help to exploit the full potential of MS-based proteomics for revealing disease mechanisms, identifying biomarker panels and developing diagnostic assays to path the way for a new age of personalized medicine.
Authors: Natasha A Karp; Wolfgang Huber; Pawel G Sadowski; Philip D Charles; Svenja V Hester; Kathryn S Lilley Journal: Mol Cell Proteomics Date: 2010-04-10 Impact factor: 5.911
Authors: Sharon Gauci; Andreas O Helbig; Monique Slijper; Jeroen Krijgsveld; Albert J R Heck; Shabaz Mohammed Journal: Anal Chem Date: 2009-06-01 Impact factor: 6.986
Authors: Albrecht Gruhler; Jesper V Olsen; Shabaz Mohammed; Peter Mortensen; Nils J Faergeman; Matthias Mann; Ole N Jensen Journal: Mol Cell Proteomics Date: 2005-01-22 Impact factor: 5.911
Authors: Sara Zanivan; Alexander Meves; Kristina Behrendt; Erwin M Schoof; Lisa J Neilson; Jürgen Cox; Hao R Tang; Gabriela Kalna; Janine H van Ree; Jan M van Deursen; Carol S Trempus; Laura M Machesky; Rune Linding; Sara A Wickström; Reinhard Fässler; Matthias Mann Journal: Cell Rep Date: 2013-01-31 Impact factor: 9.423
Authors: Alexander S Hebert; Anna E Merrill; Derek J Bailey; Amelia J Still; Michael S Westphall; Eric R Strieter; David J Pagliarini; Joshua J Coon Journal: Nat Methods Date: 2013-02-24 Impact factor: 28.547
Authors: Florian Beck; Jörg Geiger; Stepan Gambaryan; Fiorella A Solari; Margherita Dell'Aica; Stefan Loroch; Nadine J Mattheij; Igor Mindukshev; Oliver Pötz; Kerstin Jurk; Julia M Burkhart; Christian Fufezan; Johan W M Heemskerk; Ulrich Walter; René P Zahedi; Albert Sickmann Journal: Blood Date: 2016-11-09 Impact factor: 22.113