Literature DB >> 30687733

In Silico Proteome Cleavage Reveals Iterative Digestion Strategy for High Sequence Coverage.

Abstract

In the postgenome era, biologists have sought to measure the complete complement of proteins, termed proteomics. Currently, the most effective method to measure the proteome is with shotgun, or bottom-up, proteomics, in which the proteome is digested into peptides that are identified followed by protein inference. Despite continuous improvements to all steps of the shotgun proteomics workflow, observed proteome coverage is often low; some proteins are identified by a single peptide sequence. Complete proteome sequence coverage would allow comprehensive characterization of RNA splicing variants and all posttranslational modifications, which would drastically improve the accuracy of biological models. There are many reasons for the sequence coverage deficit, but ultimately peptide length determines sequence observability. Peptides that are too short are lost because they match many protein sequences and their true origin is ambiguous. The maximum observable peptide length is determined by several analytical challenges. This paper explores computationally how peptide lengths produced from several common proteome digestion methods limit observable proteome coverage. Iterative proteome cleavage strategies are also explored. These simulations reveal that maximized proteome coverage can be achieved by use of an iterative digestion protocol involving multiple proteases and chemical cleavages that theoretically allow 92.9% proteome coverage.

Entities: CellLine Chemical Disease Gene Species

Year: 2014 PMID： 30687733 PMCID： PMC6347401 DOI： 10.1155/2014/960902

Source DB: PubMed Journal: ISRN Comput Biol ISSN： 2314-5420

Introduction

In the postgenome era, biologists have sought system-wide measurements of RNA, proteins, and, metabolites, termed transcriptomics, proteomics, and metabolomics, respectively. Shotgun, or bottom-up, proteomics has become the most comprehensive method for proteome identification and quantification [1]. However, observed protein sequence coverage is often low. The ability to cover 100% of protein sequences in a biological system was likened to surrealism in a recent review by Meyer et al. [2]. Multiple steps in the traditional shotgun proteomics workflow contribute to the deficit in observed sequence coverage, including proteome isolation, proteome digestion, peptide separation, peptide MS/MS, and identification by peptide-spectrum matching. Proteome isolation has been extensively evaluated [3, 4]. Several types of peptide separation have been explored [5-7]. Mass spectrometers are becoming more sensitive and versatile [8-10]. Peptide-spectrum matching algorithms are adapting to new data types [11] and becoming more sensitive [12, 13]. Proteome fragmentation into sequenceable peptides is one step with significant room for improvement. DNA sequencing relies on sequence fragmentation into readable pieces by mechanical force [14], which produces a nearly uniform distribution of fragment lengths. In comparison, proteome fragmentation is generally accomplished by targeting one or more amino acid residues for cleavage, and, therefore, the protein cleavage can be likened to a Poisson process that produces an exponential distribution of peptide lengths. Numerous papers have described the application of new digestion strategies for proteome analysis [15-18]; however, no single strategy has emerged as optimal. The greatest observed proteome coverage has plateaued around 25%. 24.6% of the human proteome was recently observed [19], but this was obtained from over 1,000 MS/MS data files that allowed identification of over 260,000 peptide sequences using a new high performance data analysis package. Sim-ilarly, multiple protease digests of yeast resulted in 25.2% coverage [20]. Therefore, improved strategies for proteome digestion are needed to allow observation of a complete proteome. An innovative example demonstrating the application of multiple enzyme digestion (MED) was recently published by Wiśniewski and Mann [21], which demonstrated the utility of multienzyme digestion coupled to filter-aided sample preparation [22] (MED-FASP, Figure 1). This work extends a previous work that described size exclusion to isolate long tryptic peptides for additional digestion [18]. Wiśniewski and Mann compared gains afforded by iterative digestion using various proteases (i.e., GluC, ArgC, LysC, or AspN) followed by trypsin. Their work concluded that iterative digestion with LysC followed by trypsin allowed 31% more protein identifications and a 2-fold gain in observed phosphopeptides for a particular protein. Their work led me to optimize iterative digestion in silico with the hope of identifying a testable digestion strategy that can theoretically achieve complete proteome coverage.

Figure 1:

Cartoon describing the multiple-enzyme digestion, filter-assisted sample preparation strategy (MED-FASP) from Wiesinski and Mann. A proteome is digested on top of a size-based filter device and peptides are then spun through the filter. Undigested sequences are retained above the filter because of their length. The process is repeated with various cleavage agents and several peptide pools are collected separately. The peptides are then analyzed by nLC-MS/MS separately and the resulting data is then combined either before or after the database search.

Methods

The S. cerevisiae proteome file in FASTA format was downloaded from UniProt on June 20, 2012. Proteome digestion simulations were accomplished using scripts written in [R] [23]. Considered protease specificities include c-terminal of R/K (trypsin), L (LeuC theoretical cleavage agent), E (GluC), and K (LysC). Additionally, simulations utilized chemical digestion agents [24], including cyanogen bromide (CNBr) [25, 26] for cleavage c-terminal of M, 3-bromo-3-methyl-2-(2-nitrophenylthio)-3H-indole (BNPS-skatole) for cleavage c-terminal of W [27], and 2-nitro-5-thiocyanobenzoic acid (NTCB) for cleavage n-terminal of C [28, 29]. Peptide populations were filtered using both length and molecular weight constraints. Since the filtration thresholds affect the proteome coverage prediction, multiple cutoff values are compared. The [R] code is available at https://www.github.com/ jgmeyerucsd/ProteomeDigestSim.

Results and Discussion

Minimum Unique Peptide Length.

The probability of a sequence being unique can be calculated assuming a random distribution of sequences in the library. The number of sequences of length n can be described by 20. Therefore, any given sequence of length five is likely to occur once in a library of 3,200,000 random amino acid sequences (roughly the number of amino acids in the S. cerevisiae proteome). As the number of amino acids in the database grows, a peptide sequence must be longer to expect uniqueness. The human proteome contains 11,323,900 amino acids (not including isoforms, downloaded from UniProt on October 22, 2013), and, therefore, for a sequence to be unique, it must be of length six. Of course, due to common sequence motifs there are less unique peptide sequences in a proteome than would be found in a random library.

Peptide Length Distributions from Various Cleavages.

Initial in silico digestions using single cleavage agents were used to compare the resulting peptide lengths (Figure 2). Many peptide sequences are too short to uniquely match a protein. For all digestion agents, the most frequent peptide length produced is one. Generation of a single amino acid would arise when the target residue is next to itself in the protein. Notably, over 25% of theoretical peptides from trypsin digestion, which cleaves after 11.7% of all residues, are of length one. Not surprisingly, the observable proportion of the residue targeted for cleavage correlates with the resulting average peptide length (Figure 3); more common cleavage targets produce shorter average peptide lengths. Additionally, the residue-level coverage was found to depend on digestion. Proteome cleavage after more common residues results in depletion of the target residues (Figure 4), which is expected to result from production of peptides that are too short to uniquely match a protein sequence. However, cleavage after rare residues results in enriched coverage of the target residue. This result was also observed by amino acid analysis of proteome digestions in recent work [30].

Figure 2:

Theoretical peptide length distributions produced from various cleavage agents. (a) Size frequency distributions (density) of peptides from proteome digestion by five real cleavage agents (i.e., trypsin, LysC, GluC, CNBr, and NTCB) and one theoretical cleavage agent (LeuC). The vertical black lines at 7 and 35 indicate general peptide identification size limits. (b) The same distribution focused on the region from 1 to 10 amino acids. (c) The view focused on the region between 30 and 40 amino acids.

Figure 3:

Correlation between abundance of the residue targeted for cleavage and the resulting average peptide length. Proteome cleavage targeting abundant residues results in lower average peptide lengths; proteome cleavage targeting rare residues results in higher average peptide length. The line shows the data fit to an exponential equation.

Figure 4:

Residue-level coverage observed for various cleavage agents. Proteome cleavage of more common amino acids, such as with (a) trypsin or the theoretical cleavage after (b) leucine, results in residue-specific depletion of the target residues. However, cleavage of rare amino acids, such as (c) methionine or (d) cysteine, results in residue-specific enrichment of the target residues.

Comparison of Peptide Filtration Parameters.

The the-oretical distribution of peptides passing through a MWCO ultrafilter certainly does not match the actual distribution. Denatured peptides and proteins are effectively larger than folded proteins, and, in fact, it was found that even 30 kDa or 50 kDa cutoff ultrafilters perform better for peptide yield than 10 kDa cutoff ultrafilters [31], despite the inability to identify such large peptide sequences by bottom-up proteomics. Therefore, multiple length constraints were compared for their influence on the predicted proteome coverage.Figure 5 shows how various minimum peptide length values affect residue-level depletion and theoretical proteome coverage. As the minimum length increases, total coverage decreases and depletion of R/K increases. Figure 6 shows how different upper length thresholds change theoretical coverage. Intuitively, raising the upper length limit of identifiable peptides increases total predicted proteome coverage. Interestingly, although total predicted coverage increases, the coverage of R/K stays around 60%. Since peptide MW also determines identifiable peptides and peptides above 5 kDa are unlikely to be identified with current MSMS technology, an upper limit of 5 kDa was used for subsequent digest simulations. A lower length limit of 7 amino acids was used because this length is more likely to be relevant to actual proteomics experiments.

Figure 5:

Effect of minimum peptide length on proteome coverage and residue-level depletion. Residue-level coverage predicted after trypsin digestion keeping all peptides with lengths between (a) 1 and 35, (b) 5 and 35, (c) 7 and 35, and (d) 10 and 35.

Figure 6:

Effect of upper length limit on predicted proteome coverage. Upper length limit of identifiable peptides effects predicted proteome coverage. Theoretical residue-level proteome coverage keeping peptides with lengths (a) 5–20, (b) 5–30, (c) 5–40, and (d) 5–100. As the maximum length of identifiable peptides increases, the total theoretical proteome coverage increases, but the depletion of K and R remains. As the upper length limit increases, the theoretical coverage maximum increases.

Comparison of Digestion Iterations.

Several combinations of cleavage agents were simulated to compute the-oretical proteome coverage resulting from the iterative MED-FASP (iMED-FASP) strategy. Simulations confirm that iMED-FASP offers theoretically greater coverage of the proteome when the sequence of digestions starts with the protease targeting the rarest residue first (Table 1). As expected, reversal of the optimal digestion sequence results in a negligible improvement to proteome coverage as compared to the limit from using trypsin digestion alone.

Table 1:

Theoretical upper limits of coverage upon digestion with various cleavage agents using the iMED-FASP strategy. Iterative cleavage of the proteome starting with the rarest amino acids first results in the greatest theoretical proteome coverage of 92.9%. The reversed sequence of cleavage provides a minimal improvement to theoretical proteome coverage. Peptides were filtered after each digest keeping those with MW >5 kDa for additional digestion. The final “flowthrough” peptides were filtered keeping only sequences with at least 7 residues.

Digestion strategy	Theoretical coverage limit (%)
Trypsin	74.0
LysC	69.6
GluC	64.9
AspN	64.9
ArgC	53.7
CNBr	22.7
NTCB	13.8
TrpC	11.0
LysC, trypsin	82.9
GluC, trypsin	84.2
CNBr, LysC, trypsin	86.3
NTCB, CNBr, LysC, trypsin	88.2
TrpC, NTCB, CNBr, ArgC, GluC, trypsin	92.4
TrpC, NTCB, CNBr, ArgC, AspN, GluC, trypsin	92.9
Trypsin, GluC, AspN, ArgC, CNBr, NTCB, TrpC[a]	78.9

Reversed order of cleavage starting with the most common residues instead of the rarest residues.

Proposed Iterative Digestion Strategy and Challenges Therein.

An ideal iterative cleavage strategy must limit sample processing steps and must take place under conditions that are compatible with the ultrafiltration device. Further, because tryptophan f luorescence can be used to quantify peptide yield from each digestion, chemical cleavage after tryptophan should initially be omitted since it destroys the fluorophore that can be used to monitor peptide yield. Therefore, a testable, ultrafilter-compatible strategy, with a balance between sample processing and predicted gains in coverage, is the sequence: NTCB, CNBr, LysC, and trypsin. Implementation of this method introduces several technical hurdles that must be addressed. First, the buffer conditions required for each separate digestion need to be planned. The requisite use of an ultrafiltration device fortunately allows easy buffer/denaturant exchange to accommodate the different conditions. However, researchers should carefully consider which conditions are best for each step and use controls to ensure the efficient digestion at each step. Limitations of the ultrafilter must also be accounted for. For example, cleavage after methionine by CNBr is usually carried out at a formic acid concentration that would degrade the ultrafilter membrane. Instead, HCl could be substituted to enable use of CNBr with the iterative digestion MED-FASP strategy. Another key consideration is the choice of peptide fragmentation. Nontryptic peptides are less efficiently fragmented by commonly used peptide dissociation methods (e.g., collision-induced dissociation). Therefore, I recommend that any attempt to assess this theory should use electron-transfer dissociation (ETD) [32], which produces more complete fragment ion series that depend less on peptide sequence. Database searching also presents a challenge because the peptide pools will lack defined termini, which therefore requires that the database search be carried out with “no enzyme” specificity. A fast and effective choice for database searching with “no enzyme” specificity is MSGFDB [13], which can learn scoring parameters from a set of annotated peptide-spectra matches in order to improve the sensitivity of peptide identification. Finally, it should be noted that the biological fact of missed cleavages will result in deviations from these simulations. The feature to allow user-defined missed cleavage propensities has been implemented in the code, and an example of the effects is shown in supplemental Figure 1 in the Supplementary Material available online at http://dx.doi.org/10.1155/2014/960902. The missed cleavages result in noisy length distributions. Missed cleavages help limit the proportion of short peptides, suggesting that optimization of partial digestions might further improve proteome coverage.

Conclusions

This work provides a publically accessible computational framework for simulation of iterative proteome digestion that can be used with any input protein sequence database to optimize proteome coverage. Further, this work demonstrates how the choice of proteome digestion agent affects the pre-dicted proteome coverage due to the distribution of peptide lengths that are produced. This work also shows how various digestion agents affect proteome coverage at the residue level. Proteome cleavage targeting common residues results in depletion of the cleaved residue, but proteome cleavage after rare residues results in enrichment of the target residue. Finally, this paper finds that the best theoretical proteome coverage is achieved by an iterative digestion strategy that limits production of short peptides by cleaving the rarest residues first.

30 in total

1. Peptide and protein sequence analysis by electron transfer dissociation mass spectrometry.

Authors: John E P Syka; Joshua J Coon; Melanie J Schroeder; Jeffrey Shabanowitz; Donald F Hunt
Journal: Proc Natl Acad Sci U S A Date: 2004-06-21 Impact factor: 11.205

2. Universal sample preparation method for proteome analysis.

Authors: Jacek R Wiśniewski; Alexandre Zougman; Nagarjuna Nagaraj; Matthias Mann
Journal: Nat Methods Date: 2009-04-19 Impact factor: 28.547

3. In-depth analysis of tandem mass spectrometry data from disparate instrument types.

Authors: Robert J Chalkley; Peter R Baker; Katalin F Medzihradszky; Aenoch J Lynn; A L Burlingame
Journal: Mol Cell Proteomics Date: 2008-07-24 Impact factor: 5.911

4. Elastase digests: new ammunition for shotgun membrane proteomics.

Authors: Benjamin Rietschel; Tabiwang N Arrey; Bjoern Meyer; Sandra Bornemann; Malte Schuerken; Michael Karas; Ansgar Poetsch
Journal: Mol Cell Proteomics Date: 2008-12-30 Impact factor: 5.911

Review 5. Multidimensional LC separations in shotgun proteomics.

Authors: Akira Motoyama; John R Yates
Journal: Anal Chem Date: 2008-10-01 Impact factor: 6.986

Review 6. Recent advances in DNA sequencing methods - general principles of sample preparation.

Authors: Sten Linnarsson
Journal: Exp Cell Res Date: 2010-03-06 Impact factor: 3.905

7. Value of using multiple proteases for large-scale mass spectrometry-based proteomics.

Authors: Danielle L Swaney; Craig D Wenger; Joshua J Coon
Journal: J Proteome Res Date: 2010-03-05 Impact factor: 4.466

8. Multiple enzymatic digestion for enhanced sequence coverage of proteins in complex proteomic mixtures using capillary LC with ion trap MS/MS.

Authors: Gargi Choudhary; Shiaw-Lin Wu; Paul Shieh; William S Hancock
Journal: J Proteome Res Date: 2003 Jan-Feb Impact factor: 4.466

9. Chemical cleavage-assisted tryptic digestion for membrane proteome analysis.

Authors: Mio Iwasaki; Takeshi Masuda; Masaru Tomita; Yasushi Ishihama
Journal: J Proteome Res Date: 2009-06 Impact factor: 4.466

10. A dual pressure linear ion trap Orbitrap instrument with very high sequencing speed.

Authors: Jesper V Olsen; Jae C Schwartz; Jens Griep-Raming; Michael L Nielsen; Eugen Damoc; Eduard Denisov; Oliver Lange; Philip Remes; Dennis Taylor; Maurizio Splendore; Eloy R Wouters; Michael Senko; Alexander Makarov; Matthias Mann; Stevan Horning
Journal: Mol Cell Proteomics Date: 2009-10-14 Impact factor: 5.911

1 in total

1. First sequencing of ancient coral skeletal proteins.

Authors: Jeana L Drake; Julian P Whitelegge; David K Jacobs
Journal: Sci Rep Date: 2020-11-10 Impact factor: 4.379

1 in total