Jesse G Meyer1. 1. Department of Chemistry and Biochemistry, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0378, USA.
Abstract
In the postgenome era, biologists have sought to measure the complete complement of proteins, termed proteomics. Currently, the most effective method to measure the proteome is with shotgun, or bottom-up, proteomics, in which the proteome is digested into peptides that are identified followed by protein inference. Despite continuous improvements to all steps of the shotgun proteomics workflow, observed proteome coverage is often low; some proteins are identified by a single peptide sequence. Complete proteome sequence coverage would allow comprehensive characterization of RNA splicing variants and all posttranslational modifications, which would drastically improve the accuracy of biological models. There are many reasons for the sequence coverage deficit, but ultimately peptide length determines sequence observability. Peptides that are too short are lost because they match many protein sequences and their true origin is ambiguous. The maximum observable peptide length is determined by several analytical challenges. This paper explores computationally how peptide lengths produced from several common proteome digestion methods limit observable proteome coverage. Iterative proteome cleavage strategies are also explored. These simulations reveal that maximized proteome coverage can be achieved by use of an iterative digestion protocol involving multiple proteases and chemical cleavages that theoretically allow 92.9% proteome coverage.
In the postgenome era, biologists have sought to measure the complete complement of proteins, termed proteomics. Currently, the most effective method to measure the proteome is with shotgun, or bottom-up, proteomics, in which the proteome is digested into peptides that are identified followed by protein inference. Despite continuous improvements to all steps of the shotgun proteomics workflow, observed proteome coverage is often low; some proteins are identified by a single peptide sequence. Complete proteome sequence coverage would allow comprehensive characterization of RNA splicing variants and all posttranslational modifications, which would drastically improve the accuracy of biological models. There are many reasons for the sequence coverage deficit, but ultimately peptide length determines sequence observability. Peptides that are too short are lost because they match many protein sequences and their true origin is ambiguous. The maximum observable peptide length is determined by several analytical challenges. This paper explores computationally how peptide lengths produced from several common proteome digestion methods limit observable proteome coverage. Iterative proteome cleavage strategies are also explored. These simulations reveal that maximized proteome coverage can be achieved by use of an iterative digestion protocol involving multiple proteases and chemical cleavages that theoretically allow 92.9% proteome coverage.
In the postgenome era, biologists have sought system-wide measurements of
RNA, proteins, and, metabolites, termed transcriptomics, proteomics, and
metabolomics, respectively. Shotgun, or bottom-up, proteomics has become the most
comprehensive method for proteome identification and quantification [1]. However, observed protein sequence coverage is often
low. The ability to cover 100% of protein sequences in a biological system was
likened to surrealism in a recent review by Meyer et al. [2]. Multiple steps in the traditional shotgun proteomics
workflow contribute to the deficit in observed sequence coverage, including proteome
isolation, proteome digestion, peptide separation, peptide MS/MS, and identification
by peptide-spectrum matching. Proteome isolation has been extensively evaluated
[3, 4]. Several types of peptide separation have been explored [5-7]. Mass spectrometers are becoming more sensitive and versatile [8-10]. Peptide-spectrum matching algorithms are adapting to new data types
[11] and becoming more sensitive [12, 13].
Proteome fragmentation into sequenceable peptides is one step with significant room
for improvement. DNA sequencing relies on sequence fragmentation into readable
pieces by mechanical force [14], which
produces a nearly uniform distribution of fragment lengths. In comparison, proteome
fragmentation is generally accomplished by targeting one or more amino acid residues
for cleavage, and, therefore, the protein cleavage can be likened to a Poisson
process that produces an exponential distribution of peptide lengths.Numerous papers have described the application of new digestion strategies
for proteome analysis [15-18]; however, no single strategy has emerged as
optimal. The greatest observed proteome coverage has plateaued around 25%. 24.6% of
the human proteome was recently observed [19], but this was obtained from over 1,000 MS/MS data files that allowed
identification of over 260,000 peptide sequences using a new high performance data
analysis package. Sim-ilarly, multiple protease digests of yeast resulted in 25.2%
coverage [20]. Therefore, improved strategies
for proteome digestion are needed to allow observation of a complete proteome.An innovative example demonstrating the application of multiple enzyme
digestion (MED) was recently published by Wiśniewski and Mann [21], which demonstrated the utility of
multienzyme digestion coupled to filter-aided sample preparation [22] (MED-FASP, Figure
1). This work extends a previous work that described size exclusion to
isolate long tryptic peptides for additional digestion [18]. Wiśniewski and Mann compared gains afforded
by iterative digestion using various proteases (i.e., GluC, ArgC, LysC, or AspN)
followed by trypsin. Their work concluded that iterative digestion with LysC
followed by trypsin allowed 31% more protein identifications and a 2-fold gain in
observed phosphopeptides for a particular protein. Their work led me to optimize
iterative digestion in silico with the hope of identifying a
testable digestion strategy that can theoretically achieve complete proteome
coverage.
Figure 1:
Cartoon describing the multiple-enzyme digestion, filter-assisted sample
preparation strategy (MED-FASP) from Wiesinski and Mann. A proteome is digested
on top of a size-based filter device and peptides are then spun through the
filter. Undigested sequences are retained above the filter because of their
length. The process is repeated with various cleavage agents and several peptide
pools are collected separately. The peptides are then analyzed by nLC-MS/MS
separately and the resulting data is then combined either before or after the
database search.
Methods
The S. cerevisiae proteome file in FASTA format was
downloaded from UniProt on June 20, 2012. Proteome digestion simulations were
accomplished using scripts written in [R] [23]. Considered protease specificities include c-terminal of R/K (trypsin),
L (LeuC theoretical cleavage agent), E (GluC), and K (LysC). Additionally,
simulations utilized chemical digestion agents [24], including cyanogen bromide (CNBr) [25, 26] for cleavage c-terminal
of M, 3-bromo-3-methyl-2-(2-nitrophenylthio)-3H-indole (BNPS-skatole) for cleavage
c-terminal of W [27], and
2-nitro-5-thiocyanobenzoic acid (NTCB) for cleavage n-terminal of C [28, 29]. Peptide
populations were filtered using both length and molecular weight constraints. Since
the filtration thresholds affect the proteome coverage prediction, multiple cutoff
values are compared. The [R] code is available at https://www.github.com/ jgmeyerucsd/ProteomeDigestSim.
Results and Discussion
Minimum Unique Peptide Length.
The probability of a sequence being unique can be calculated assuming a
random distribution of sequences in the library. The number of sequences of
length n can be described by 20.
Therefore, any given sequence of length five is likely to occur once in a
library of 3,200,000 random amino acid sequences (roughly the number of amino
acids in the S. cerevisiae proteome). As the number of amino
acids in the database grows, a peptide sequence must be longer to expect
uniqueness. The human proteome contains 11,323,900 amino acids (not including
isoforms, downloaded from UniProt on October 22, 2013), and, therefore, for a
sequence to be unique, it must be of length six. Of course, due to common
sequence motifs there are less unique peptide sequences in a proteome than would
be found in a random library.
Peptide Length Distributions from Various Cleavages.
Initial in silico digestions using single cleavage
agents were used to compare the resulting peptide lengths (Figure 2). Many peptide sequences are too short to
uniquely match a protein. For all digestion agents, the most frequent peptide
length produced is one. Generation of a single amino acid would arise when the
target residue is next to itself in the protein. Notably, over 25% of
theoretical peptides from trypsin digestion, which cleaves after 11.7% of all
residues, are of length one. Not surprisingly, the observable proportion of the
residue targeted for cleavage correlates with the resulting average peptide
length (Figure 3); more common cleavage
targets produce shorter average peptide lengths. Additionally, the residue-level
coverage was found to depend on digestion. Proteome cleavage after more common
residues results in depletion of the target residues (Figure 4), which is expected to result from production
of peptides that are too short to uniquely match a protein sequence. However,
cleavage after rare residues results in enriched coverage of the target residue.
This result was also observed by amino acid analysis of proteome digestions in
recent work [30].
Figure 2:
Theoretical peptide length distributions produced from various cleavage
agents. (a) Size frequency distributions (density) of peptides from proteome
digestion by five real cleavage agents (i.e., trypsin, LysC, GluC, CNBr, and
NTCB) and one theoretical cleavage agent (LeuC). The vertical black lines at 7
and 35 indicate general peptide identification size limits. (b) The same
distribution focused on the region from 1 to 10 amino acids. (c) The view
focused on the region between 30 and 40 amino acids.
Figure 3:
Correlation between abundance of the residue targeted for cleavage and
the resulting average peptide length. Proteome cleavage targeting abundant
residues results in lower average peptide lengths; proteome cleavage targeting
rare residues results in higher average peptide length. The line shows the data
fit to an exponential equation.
Figure 4:
Residue-level coverage observed for various cleavage agents. Proteome
cleavage of more common amino acids, such as with (a) trypsin or the theoretical
cleavage after (b) leucine, results in residue-specific depletion of the target
residues. However, cleavage of rare amino acids, such as (c) methionine or (d)
cysteine, results in residue-specific enrichment of the target residues.
Comparison of Peptide Filtration Parameters.
The the-oretical distribution of peptides passing through a MWCO
ultrafilter certainly does not match the actual distribution. Denatured peptides
and proteins are effectively larger than folded proteins, and, in fact, it was
found that even 30 kDa or 50 kDa cutoff ultrafilters perform better for peptide
yield than 10 kDa cutoff ultrafilters [31], despite the inability to identify such large peptide sequences by
bottom-up proteomics. Therefore, multiple length constraints were compared for
their influence on the predicted proteome coverage.Figure 5 shows how various minimum peptide length values affect
residue-level depletion and theoretical proteome coverage. As the minimum length
increases, total coverage decreases and depletion of R/K increases. Figure 6 shows how different upper length
thresholds change theoretical coverage. Intuitively, raising the upper length
limit of identifiable peptides increases total predicted proteome coverage.
Interestingly, although total predicted coverage increases, the coverage of R/K
stays around 60%. Since peptide MW also determines identifiable peptides and
peptides above 5 kDa are unlikely to be identified with current MSMS technology,
an upper limit of 5 kDa was used for subsequent digest simulations. A lower
length limit of 7 amino acids was used because this length is more likely to be
relevant to actual proteomics experiments.
Figure 5:
Effect of minimum peptide length on proteome coverage and residue-level
depletion. Residue-level coverage predicted after trypsin digestion keeping all
peptides with lengths between (a) 1 and 35, (b) 5 and 35, (c) 7 and 35, and (d)
10 and 35.
Figure 6:
Effect of upper length limit on predicted proteome coverage. Upper
length limit of identifiable peptides effects predicted proteome coverage.
Theoretical residue-level proteome coverage keeping peptides with lengths (a)
5–20, (b) 5–30, (c) 5–40, and (d) 5–100. As the
maximum length of identifiable peptides increases, the total theoretical
proteome coverage increases, but the depletion of K and R remains. As the upper
length limit increases, the theoretical coverage maximum increases.
Comparison of Digestion Iterations.
Several combinations of cleavage agents were simulated to compute
the-oretical proteome coverage resulting from the iterative MED-FASP (iMED-FASP)
strategy. Simulations confirm that iMED-FASP offers theoretically greater
coverage of the proteome when the sequence of digestions starts with the
protease targeting the rarest residue first (Table 1). As expected, reversal of the optimal digestion sequence
results in a negligible improvement to proteome coverage as compared to the
limit from using trypsin digestion alone.
Table 1:
Theoretical upper limits of coverage upon digestion with various
cleavage agents using the iMED-FASP strategy. Iterative cleavage of the proteome
starting with the rarest amino acids first results in the greatest theoretical
proteome coverage of 92.9%. The reversed sequence of cleavage provides a minimal
improvement to theoretical proteome coverage. Peptides were filtered after each
digest keeping those with MW >5 kDa for additional digestion. The final
“flowthrough” peptides were filtered keeping only sequences with
at least 7 residues.
Digestion strategy
Theoretical coverage limit (%)
Trypsin
74.0
LysC
69.6
GluC
64.9
AspN
64.9
ArgC
53.7
CNBr
22.7
NTCB
13.8
TrpC
11.0
LysC, trypsin
82.9
GluC, trypsin
84.2
CNBr, LysC, trypsin
86.3
NTCB, CNBr, LysC, trypsin
88.2
TrpC, NTCB, CNBr, ArgC, GluC, trypsin
92.4
TrpC, NTCB, CNBr, ArgC, AspN, GluC,
trypsin
92.9
Trypsin, GluC, AspN, ArgC, CNBr, NTCB,
TrpC[a]
78.9
Reversed order of cleavage starting with the most common residues
instead of the rarest residues.
Proposed Iterative Digestion Strategy and Challenges Therein.
An ideal iterative cleavage strategy must limit sample processing steps
and must take place under conditions that are compatible with the
ultrafiltration device. Further, because tryptophan f luorescence can be used to
quantify peptide yield from each digestion, chemical cleavage after tryptophan
should initially be omitted since it destroys the fluorophore that can be used
to monitor peptide yield. Therefore, a testable, ultrafilter-compatible
strategy, with a balance between sample processing and predicted gains in
coverage, is the sequence: NTCB, CNBr, LysC, and trypsin.Implementation of this method introduces several technical hurdles that
must be addressed. First, the buffer conditions required for each separate
digestion need to be planned. The requisite use of an ultrafiltration device
fortunately allows easy buffer/denaturant exchange to accommodate the different
conditions. However, researchers should carefully consider which conditions are
best for each step and use controls to ensure the efficient digestion at each
step. Limitations of the ultrafilter must also be accounted for. For example,
cleavage after methionine by CNBr is usually carried out at a formic acid
concentration that would degrade the ultrafilter membrane. Instead, HCl could be
substituted to enable use of CNBr with the iterative digestion MED-FASP
strategy. Another key consideration is the choice of peptide fragmentation.
Nontryptic peptides are less efficiently fragmented by commonly used peptide
dissociation methods (e.g., collision-induced dissociation). Therefore, I
recommend that any attempt to assess this theory should use electron-transfer
dissociation (ETD) [32], which produces
more complete fragment ion series that depend less on peptide sequence. Database
searching also presents a challenge because the peptide pools will lack defined
termini, which therefore requires that the database search be carried out with
“no enzyme” specificity. A fast and effective choice for database
searching with “no enzyme” specificity is MSGFDB [13], which can learn scoring parameters from a set of
annotated peptide-spectra matches in order to improve the sensitivity of peptide
identification. Finally, it should be noted that the biological fact of missed
cleavages will result in deviations from these simulations. The feature to allow
user-defined missed cleavage propensities has been implemented in the code, and
an example of the effects is shown in supplemental Figure 1 in the Supplementary Material available online at http://dx.doi.org/10.1155/2014/960902. The
missed cleavages result in noisy length distributions. Missed cleavages help
limit the proportion of short peptides, suggesting that optimization of partial
digestions might further improve proteome coverage.
Conclusions
This work provides a publically accessible computational framework for
simulation of iterative proteome digestion that can be used with any input protein
sequence database to optimize proteome coverage. Further, this work demonstrates how
the choice of proteome digestion agent affects the pre-dicted proteome coverage due
to the distribution of peptide lengths that are produced. This work also shows how
various digestion agents affect proteome coverage at the residue level. Proteome
cleavage targeting common residues results in depletion of the cleaved residue, but
proteome cleavage after rare residues results in enrichment of the target residue.
Finally, this paper finds that the best theoretical proteome coverage is achieved by
an iterative digestion strategy that limits production of short peptides by cleaving
the rarest residues first.
Authors: John E P Syka; Joshua J Coon; Melanie J Schroeder; Jeffrey Shabanowitz; Donald F Hunt Journal: Proc Natl Acad Sci U S A Date: 2004-06-21 Impact factor: 11.205
Authors: Robert J Chalkley; Peter R Baker; Katalin F Medzihradszky; Aenoch J Lynn; A L Burlingame Journal: Mol Cell Proteomics Date: 2008-07-24 Impact factor: 5.911
Authors: Jesper V Olsen; Jae C Schwartz; Jens Griep-Raming; Michael L Nielsen; Eugen Damoc; Eduard Denisov; Oliver Lange; Philip Remes; Dennis Taylor; Maurizio Splendore; Eloy R Wouters; Michael Senko; Alexander Makarov; Matthias Mann; Stevan Horning Journal: Mol Cell Proteomics Date: 2009-10-14 Impact factor: 5.911