Jian Wang1, Veronica G Anania, Jeff Knott, John Rush, Jennie R Lill, Philip E Bourne, Nuno Bandeira. 1. Bioinformatics Program, ∥Skaggs School of Pharmacy and Pharmaceutical Sciences, ⊥Center for Computational Mass Spectrometry, and ¶Department of Computer Science and Engineering, University of California, San Diego , La Jolla, California 92093, United States.
Abstract
The conjugation of complex post-translational modifications (PTMs) such as glycosylation and Small Ubiquitin-like Modification (SUMOylation) to a substrate protein can substantially change the resulting peptide fragmentation pattern compared to its unmodified counterpart, making current database search methods inappropriate for the identification of tandem mass (MS/MS) spectra from such modified peptides. Traditionally it has been difficult to develop new algorithms to identify these atypical peptides because of the lack of a large set of annotated spectra from which to learn the altered fragmentation pattern. Using SUMOylation as an example, we propose a novel approach to generate large MS/MS training data from modified peptides and derive an algorithm that learns properties of PTM-specific fragmentation from such training data. Benchmark tests on data sets of varying complexity show that our method is 80-300% more sensitive than current state-of-the-art approaches. The core concepts of our method are readily applicable to developing algorithms for the identifications of peptides with other complex PTMs.
The conjugation of complex post-translational modifications (PTMs) such as glycosylation and Small Ubiquitin-like Modification (SUMOylation) to a substrate protein can substantially change the resulting peptide fragmentation pattern compared to its unmodified counterpart, making current database search methods inappropriate for the identification of tandem mass (MS/MS) spectra from such modified peptides. Traditionally it has been difficult to develop new algorithms to identify these atypical peptides because of the lack of a large set of annotated spectra from which to learn the altered fragmentation pattern. Using SUMOylation as an example, we propose a novel approach to generate large MS/MS training data from modified peptides and derive an algorithm that learns properties of PTM-specific fragmentation from such training data. Benchmark tests on data sets of varying complexity show that our method is 80-300% more sensitive than current state-of-the-art approaches. The core concepts of our method are readily applicable to developing algorithms for the identifications of peptides with other complex PTMs.
In recent years the
focus in proteomics has shifted from cataloging
the “parts list” of gene products inside the cell toward
understanding the structural and functional properties of proteins
in a systematic and high-throughput manner.[1] A key step toward this goal is the comprehensive characterization
of protein post-translational modifications (PTMs). These “decorations”
on the protein surface have been shown to play crucial roles in determining
a protein’s activity state, localization, turnover rate, and
interactions with other proteins.[2,3] Recent advances
in mass spectrometry (MS) and enrichment protocols that selectively
capture peptides with specific PTMs have enabled the detection of
many PTMs on a large scale, thus providing scientists with a global
view on various PTMs and their interplay at a systems level.[4−8] However, such success has mostly been limited to PTMs that result
from the addition of a relatively simple chemical group to one or
more amino acid residues in the proteins. Common examples include
acetylation, deamidation, phosphorylation, and oxidation. These modifications
can be readily identified with tandem mass spectrometry (MS/MS) by
considering characteristic shifts in peptide precursor mass as well
as in modified fragment ion masses. However, more complex PTMs, such
as glycosylation,[9] Small Ubiquitin-like
Modification (SUMOylation),[10] PUPylation,[11] and ADPribosylation,[12] present a more difficult problem because the PTMs themselves are
large and complex molecules rather than simple chemical moieties,
creating unusual “branched” structures for the modified
peptides. As a result, these modified peptides display rather different
fragmentation pattern than their unmodified counterparts, and thus
new experimental and computational methods are needed for the analysis
of peptides with complex PTMs.We propose an automated approach, Specialize (Spectra
of complex-PTModified peptides identification tool), to derive new
algorithms for any type of modified peptide fragmentation using synthetic
peptide libraries. While previous studies have mostly used synthetic
peptide libraries to evaluate peptide identification algorithms,[13−15] Specialize learns the PTM-specific fragmentation patterns from the
peptide libraries and is able to generalize to peptides that are not
in the libraries across different biological samples. In addition,
the cost of generating the peptide libraries is kept very low by using
combinatorial peptide synthesis. A training data with thousands of
unique peptides was generated using only three peptide synthesis experiments
as opposed to synthesizing each peptide separately. To illustrate
the concept, we focus on one specific example of complex PTMs (SUMOylation)
and use it to demonstrate the feasibility and practicality of our
approach. Small Ubiquitin-like Modifiers (SUMO) are small proteins
of around 100 amino acids that reversibly attach to substrate proteins
to modify their functions. SUMOylation has been shown to be involved
in many cellular pathways such as cellular trafficking, cell cycle,
and DNA repair and replication.[16] It is
also implicated in several neurodegenerative diseases such as Alzheimer’s
disease and Huntington disease.[17,18] Similar to ubiquitination,
SUMOylation is regulated by a series of enzymatic reactions involving
SUMO-activating enzymes, conjugating enzymes, and SUMO E3 ligases
that covalently attach SUMO to substrate proteins via an iso-peptide
bond between the C-terminus of SUMO and a specific lysine residue
on the substrate protein. Previously it was thought that SUMOylation
occurred within a strict consensus motif [XK(D/E)],[19] but more recently it has been shown that several motifs
including an “inverted” consensus motif, a hydrophobic
patch motif, and a phosphorylation dictated motif exist as common
localizers of SUMOylation.[20] It has also
been observed in many cases that SUMOylation can occur on lysine residues
not located within any predetermined motif, hence the increased need
for unbiased methods to detect SUMOylated lysine residues. Upon enzymatic
digestion of SUMOylated proteins, peptides that contain the SUMO conjugation
site will be covalently linked to a C-terminal remnant or tag of SUMO,
resulting in ‘y-shaped’ linked peptides (see Figure 1a). Unbiased detection of this type of linked peptide
is the most informative as it provides direct evidence that a protein
is a SUMO substrate as well as revealing the specific amino acid site
of SUMOylation. However these branch-linked peptides present several
challenges when being analyzed by MS/MS methods to detect SUMOylated
lysine residues.
Figure 1
Conceptual model of SUMOylated peptides. (a) Small Ubiquitin-like
Modifiers (SUMO) are small proteins that reversibly attach to substrate
proteins to regulate their functions. Upon enzymatic digestion of
SUMO-conjugated proteins, peptides that contain the SUMO conjugation
site in the substrate protein have a SUMO C-terminus remnant (or SUMO
tag) covalently attached to the lysine residue, resulting in ‘y-shaped’
peptides. Here we use QQQTGG as an example of SUMO-tag, which is the
last six amino acid residues at the C-terminus of Human SUMO2 protein.
(b) SUMOylated peptides are modeled as two peptides: a substrate peptide
carrying a modification of mass +599 Da (the mass of the SUMO tag)
at the lysine residue and a peptide with sequence QQQTGG carrying
a modification at the C-terminus with mass equal to that of substrate
peptide (which is assumed to be 1650 Da for illustration purposes).
Theoretical fragments from a SUMOylated peptide are represented as
two sets of fragment ions, one set from the substrate peptide and
another set from the SUMO tag peptide. This way, different scoring
models can be used for the substrate peptide and the SUMO tag to account
for their distinct fragmentation patterns. (c) Fragment ions from
SUMOylated peptides are divided into two categories: linked-fragments
and unlinked fragments. Linked fragments are from peptide fragment
ions that are covalently linked to a second peptide. Linked and unlinked
fragments have different fragmentation statistics, and thus different
scoring models are used to score each type of fragment.
Conceptual model of SUMOylated peptides. (a) Small Ubiquitin-like
Modifiers (SUMO) are small proteins that reversibly attach to substrate
proteins to regulate their functions. Upon enzymatic digestion of
SUMO-conjugated proteins, peptides that contain the SUMO conjugation
site in the substrate protein have a SUMO C-terminus remnant (or SUMO
tag) covalently attached to the lysine residue, resulting in ‘y-shaped’
peptides. Here we use QQQTGG as an example of SUMO-tag, which is the
last six amino acid residues at the C-terminus of HumanSUMO2 protein.
(b) SUMOylated peptides are modeled as two peptides: a substrate peptide
carrying a modification of mass +599 Da (the mass of the SUMO tag)
at the lysine residue and a peptide with sequence QQQTGG carrying
a modification at the C-terminus with mass equal to that of substrate
peptide (which is assumed to be 1650 Da for illustration purposes).
Theoretical fragments from a SUMOylated peptide are represented as
two sets of fragment ions, one set from the substrate peptide and
another set from the SUMO tag peptide. This way, different scoring
models can be used for the substrate peptide and the SUMO tag to account
for their distinct fragmentation patterns. (c) Fragment ions from
SUMOylated peptides are divided into two categories: linked-fragments
and unlinked fragments. Linked fragments are from peptide fragment
ions that are covalently linked to a second peptide. Linked and unlinked
fragments have different fragmentation statistics, and thus different
scoring models are used to score each type of fragment.First, the attachment of a SUMO tag to the substrate
peptide inhibits
tryptic digestion due to steric hindrance and therefore inaccessibility
of the enzyme to the SUMO-conjugated lysine residue, generating peptides
with internal lysine that are unusual in unconjugated tryptic peptides.
In addition, the SUMO tag resulting from tryptic digestion is relatively
large, ranging from 20 to 30 residues depending on the isoforms of
SUMO. As a result the SUMO tag tends to dominate the MS/MS spectrum
and make it difficult to identify the substrate peptide using current
MS/MS methods. In order to address these issues, previous studies
have generated SUMO mutants by inserting a lysine or arginine at specific
positions along the SUMO C-termini tail so that shorter SUMO C-termini
tags (4–6 residues) can be generated upon trypsin digestion.[21−25] On the other hand, alternative enzymes such as chymotrypsin, GluC,
and LysC can also be used to generate shorter SUMO tags (4–12
residues) attached to the substrate peptides,[26] making them more suitable for MS/MS analysis.Even as we circumvent
these hurdles and manage to generate SUMOylated
peptides with favorable properties to be analyzed by tandem mass spectrometry,
it remains a challenge to interpret the resulting MS/MS spectra because
almost all mainstream database search algorithms are designed for
MS/MS spectra from linear, unlinked peptides. In contrast, an MS/MS spectrum from a SUMOylated peptide
contains a mixture of fragment ions from both the substrate peptide
and the SUMO tag. In addition, the linkage of two peptides together
results in fragmentation patterns that are different from those of
common linear peptides. While there have been several attempts to
address these issues, none of them captured the specific fragmentation
pattern of SUMOylated peptides due to the lack of appropriate training
data.[27,28] Here, we propose a novel experimental and
computational hybrid procedure to reliably generate large MS/MS reference
data for SUMOylated peptides, which are then used to derive a database
search algorithm capturing the PTM-specific fragmentation patterns
of SUMOylated peptides.
Results
Fragmentation Pattern of
SUMOylated Peptides
In order
to obtain a large MS/MS data set with identified SUMOylated peptides,
we designed and synthesized three combinatorial peptide libraries,
each with a SUMO C-terminus tag (QQQTGG) attached via a lysine
residue at different position along the peptide (see Figure 2). The peptide libraries are designed with the goal
of promoting sequence diversity while also representing a realistic
model of endogenous SUMOylated peptides. For example, in library III
the known consensus motif for SUMOylation is incorporated into the
sequence pattern. The synthetic SUMOylated peptide libraries were
analyzed using an LTQ-Orbitrap mass spectrometer, and MS/MS spectra
were identified using our proposed two-step search strategy that takes
advantage of the special design of the library peptides (see Figure 2). A total of 10216 MS/MS spectra from SUMOylated
peptides were identified, corresponding to 3492 unique peptides. To
our knowledge this is the largest mass spectral data set for SUMOylated
peptides known to date. From this training data, we studied the PTM-specific
fragmentation pattern of SUMOylated peptides. First the prominence
of the SUMO fragment ions presented in the MS/MS spectra was assessed
by the fraction of total ion intensity corresponding to SUMO tag fragments.
As shown in Figure 3, SUMO fragment ions can
contribute a large fraction of the total intensity in MS/MS spectra,
ranging from 10% to 60% of total intensity, with an average of 20%.
To put these statistics in context, we also show the fractions of
total intensity from linked substrate peptides (light blue) and from
common, unlinked peptides (dark blue line). Since the SUMO tag represents
a significant fraction of total ion intensity in the MS/MS spectra,
our new database search method, Specialize, considers all possible
fragment ions from both the SUMO tag and the substrate
peptide when matching a SUMOylated peptide against a query spectrum
rather than simply treating it as a peptide with a big mass offset
at lysine residue as is presently modeled in current database search
methods (see Figure 1b). Moreover, we use a
separate scoring model for the substrate peptide and SUMO tag to account
for their difference in fragmentation statistics (see Supplementary Figure 1a).
Figure 2
Generating training data
using combinatorial synthetic peptide
libraries. We designed and synthesized three combinatorial peptide
libraries, each with a SUMO tag (QQQTGG) attached via a lysine
residue at a different position along the library peptide. The sequence
pattern for each library is shown on the left. The symbols X and h stand for variable positions where
multiple amino acid residues are possible. MS/MS spectra from peptide
libraries were identified using a two-step search strategy. First
for library I, since the SUMO tag is attached to the library peptide
at the first residue, this is essentially equivalent to the substrate
peptide having a prefix extension of QQQTGG. Thus, we can identify
MS/MS spectra from library I by searching a database where the sequence
QQQTGG is concatenated to the N-terminus of every possible peptide
sequence in library I. This initial set of identified MS/MS spectra
from SUMOylated peptides was used to build a SUMO-specific database
search tool to identify MS/MS spectra from libraries II and III, which
are more a realistic representation of SUMOylated peptides. Identified
spectra from library II and III were then incorporated into the training
data to build an even better scoring model. This refined method was
used to search the spectra from all three libraries to obtain a final
set of MS/MS spectra from SUMOylated peptides.
Figure 3
Contribution of ion intensity from SUMO tag. We computed the fraction
of ion intensity corresponding to fragment ions from the SUMO tag
in our training data. As shown in panel a, the fragment ions from
the SUMO tag contribute a significant fraction (10–60%) of
total intensity in the MS/MS spectra (red line). The fractions of
total ion intensity from SUMO tag peptides are compared to those from
substrate peptides (cyan), substrate peptide and SUMO tag combined
(magenta), and linear, unlinked peptides (blue). (b–d) Examples
of identified MS/MS spectra from SUMOylated peptides from the synthetic
peptide libraries. Peaks explained by substrate peptides are colored
blue, while peaks explained by SUMO tag are colored red. Peaks corresponding
to neutral-losses or explained by both substrate peptide and SUMO
tag are colored in black. (d) An example in which peaks from the SUMO
tag dominate the observed MS/MS spectra.
Generating training data
using combinatorial synthetic peptide
libraries. We designed and synthesized three combinatorial peptide
libraries, each with a SUMO tag (QQQTGG) attached via a lysine
residue at a different position along the library peptide. The sequence
pattern for each library is shown on the left. The symbols X and h stand for variable positions where
multiple amino acid residues are possible. MS/MS spectra from peptide
libraries were identified using a two-step search strategy. First
for library I, since the SUMO tag is attached to the library peptide
at the first residue, this is essentially equivalent to the substrate
peptide having a prefix extension of QQQTGG. Thus, we can identify
MS/MS spectra from library I by searching a database where the sequence
QQQTGG is concatenated to the N-terminus of every possible peptide
sequence in library I. This initial set of identified MS/MS spectra
from SUMOylated peptides was used to build a SUMO-specific database
search tool to identify MS/MS spectra from libraries II and III, which
are more a realistic representation of SUMOylated peptides. Identified
spectra from library II and III were then incorporated into the training
data to build an even better scoring model. This refined method was
used to search the spectra from all three libraries to obtain a final
set of MS/MS spectra from SUMOylated peptides.Contribution of ion intensity from SUMO tag. We computed the fraction
of ion intensity corresponding to fragment ions from the SUMO tag
in our training data. As shown in panel a, the fragment ions from
the SUMO tag contribute a significant fraction (10–60%) of
total intensity in the MS/MS spectra (red line). The fractions of
total ion intensity from SUMO tag peptides are compared to those from
substrate peptides (cyan), substrate peptide and SUMO tag combined
(magenta), and linear, unlinked peptides (blue). (b–d) Examples
of identified MS/MS spectra from SUMOylated peptides from the synthetic
peptide libraries. Peaks explained by substrate peptides are colored
blue, while peaks explained by SUMO tag are colored red. Peaks corresponding
to neutral-losses or explained by both substrate peptide and SUMO
tag are colored in black. (d) An example in which peaks from the SUMO
tag dominate the observed MS/MS spectra.In addition to generating extra fragment ions, the conjugation
of a SUMO tag to a substrate peptide changes its physicochemical properties
and thus changes its fragmentation pattern in MS/MS spectra. Conceptually,
fragment ions from SUMOylated peptides can be divided into two categories:
linked-fragments and unlinked fragments (see Figure 1c). Linked fragment ions are from peptide fragments that remain
covalently linked to a second peptide. Assuming there is no double-fragmentation,
for substrate peptides, these are fragments that are linked to the
SUMO tag; for the SUMO-tag peptide, these are fragments that are linked
to the substrate peptide. In general, unlinked fragments result in
fragmentation patterns similar to those of common, unlinked peptides
(see Supplementary Figure 1b), while linked
fragments result in fragmentation patterns substantially different
from those of unlinked peptides (see Supplementary
Figure 1c). In particular, multiply charged fragments are more
prominent (i.e., fragment ions have more intense peaks). This makes
intuitive sense because linked fragments are covalently attached to
a second peptide that contains an additional N and C-terminus that
are also available to capture additional charges. Specialize accounts
for these characteristics by introducing different ion models for
linked and unlinked fragments (e.g., linked b-ions vs unlinked b-ions).
Therefore, during training of the Peptide-Spectrum-Match scoring function,
separate probabilistic models are used for linked and unlinked fragments.
Identifying SUMOylated Peptides in Combinatorial Peptide Library
To benchmark Specialize, we searched MS/MS spectra from the three
synthetic peptide libraries using a standard database search tool,
InsPecT,[29] with variable lysine modifications
+599.266 Da (for SUMO tag QQQTGG) and +582.239 Da (for SUMO tag with
a pyro-Q modification). To avoid overfitting, we split the data evenly
into two subsets and trained Specialize on one subset and tested it
on the other subset (i.e., 2-fold cross validation[30]). As shown in Table 1, Specialize
identified 3–7 times more MS/MS spectra from SUMOylated peptides
than InsPecT at 1% FDR. To provide some perspective, we also ran InsPecT
on a Yeast data set[31] representing a typical
proteomics experiment designed for linear, unlinked peptides. Out
of the 76,177 MS/MS spectra in the Yeast data set, InsPecT identified
22,658 spectra corresponding to an identification rate of 29.7%. However
in the three synthetic libraries of SUMOylated peptides, the identification
rate for InsPecT drops to 3.2–16.4% (see Table 1), substantially lower than that of the Yeast data set even
though the combinatorial peptide libraries are much less complex samples
than the Yeast lysate. This supports the observations that attachment
of SUMOylation tags to substrate peptides indeed changes peptide fragmentation
patterns in a way that limits the ability of current database search
tools to identify MS/MS spectra from SUMOylated peptides. In contrast,
using Specialize’s scoring models that capture SUMO-specific
fragmentation characteristics, the identification rate in the SUMOylated
peptide libraries was increased to 19–44.7%, which is comparable
to InsPecT’s identification rate for linear, unlinked peptides.
Table 1
Identified Spectra from SUMOylated
Peptides and Unique SUMOylated Peptides Identifieda
Identified
spectra from SUMOylated peptides
library I
library II
library III
Yeast
InsPecT
743 (6.1%)
1826 (16.4%)
531 (4.4%)
22658 (29.7%)
Specialize
2320 (19.0%)
4967 (44.7%)
2929 (24.2%)
N/A
no. of MS/MS spectra
12202
11113
12177
76177
MS/MS spectra from each synthetic
peptide library were analyzed by Specialize and InsPecT, and the number
of identified spectra and unique peptides from SUMOylated peptides
at 1% FDR are shown. Numbers inside the parentheses indicate the identification
rate, which is the percentage of total number of spectra that are
identified. The Yeast data set represents a typical proteomic experiment
designed for linear, unlinked peptides. In comparison InsPecT’s
identification rate is much lower for SUMOylated peptides as compared
to unlinked peptides. On the other hand, Specialize’s identification
rate for SUMOylated peptides is comparable to InsPecT’s identification
rate for unlinked peptides.
MS/MS spectra from each synthetic
peptide library were analyzed by Specialize and InsPecT, and the number
of identified spectra and unique peptides from SUMOylated peptides
at 1% FDR are shown. Numbers inside the parentheses indicate the identification
rate, which is the percentage of total number of spectra that are
identified. The Yeast data set represents a typical proteomic experiment
designed for linear, unlinked peptides. In comparison InsPecT’s
identification rate is much lower for SUMOylated peptides as compared
to unlinked peptides. On the other hand, Specialize’s identification
rate for SUMOylated peptides is comparable to InsPecT’s identification
rate for unlinked peptides.
Identifying SUMOylated Peptides from Cell Lysate
In
order to demonstrate Specialize’s ability to process biological
samples, we synthesized 20 peptides from the humanmyeloid cell leukemia
protein (MCL1_Human) with a SUMO tag QQQTGG attached to a lysine residue.
Since MCL-1 carries canonical SUMOylation motifs, these synthetic
peptides were used as a model for endogenous SUMOylated peptides.
The samples were analyzed using an LTQ-Orbitrap mass spectrometer,
and a total of 207 MS/MS spectra from SUMOylated peptides were identified
by Specialize. This corresponds to 18 out of the 20 SUMOylated peptides
synthesized. The remaining two peptides that Specialize was unable
to identify because they are very short (3 and 5 residues long), reflecting
the general limitation of database search methods in identifying short
peptides (short peptides tend to have relatively few fragment ions
in MS/MS spectra).To test the identification of SUMOylated
peptides in complex samples, the synthetic SUMOylated peptides from
MCL1 were spiked into a Jurkathuman cell lysate background and analyzed
by MS/MS (Jurkat data set). From this data set Specialize was able
to identify 13 unique MCL1 SUMOylated peptides, while Mascot was able
to identify 6 unique SUMOylated peptides at 5% FDR. To estimate an
upper bound for the number of SUMOylated peptides that could have
been identified from the set of acquired spectra, identified MS/MS
spectra from the pure MCL-1 data set described above were used to
build a spectral library of MCL1 SUMOylated peptides. This spectral
library was then used to search the Jurkat data to determine a list
of possible SUMOylated peptides that were selected by the instrument
for MS/MS analysis out of all acquired spectra. Spectral library search
identified a total of 16 unique MCL1 SUMOylated peptides in the Jurkat
lysate sample, which indicates that Specialize has a sensitivity of
approximately 13/16 ≈ 80%. We noted that this is not related
to the sensitivity of mass spectrometry to detect SUMOylated peptides
but rather reflects the ability of computational tools to identify
SUMOylated peptides given that such MS/MS spectra have already been
acquired.Specialize was also evaluated on two large-scale proteomic
experiments
from two previous studies on SUMOylation in Human[20] and Arabidopsis.[32] Most SUMOylation studies to date have focused on identifying potential
substrate proteins. While these studies often identified many potential
substrate proteins after immunoprecipitation, the number of SUMOylated
peptides identified is usually rather small, underscoring the current
challenges in distinguishing true SUMO substrates from immunoprecipitation
artifacts.[21−24,26,33] As shown in Figure 4, in both SUMO data sets
Specialize was able to increase the number of identified SUMOylated
peptides by 54–125% over what was identified by Mascot[34] at 5% FDR. A detailed comparison shows that
Specialize was able to identify 41 out of 68 SUMOylated peptides found
by Mascot. Further investigation into those SUMOylated peptides missed
by Specialize showed that either the substrate peptides are very long
(e.g., ≥ 27 aa, see Supplementary Figure
2a) or the fragment ions from the SUMO tag display relatively
low intensity in the MS/MS spectra: an average of only 5% of the total
intensity in the spectrum as compared to 10–20% of the intensity
for cases that were identified by Specialize (see Supplementary Figure 2b). It is perhaps not surprising that
Specialize did not identify this subgroup of SUMOylated peptides because
these observed characteristics highlight the limitations of the peptide
libraries used to train Specialize. For example, the peptides in the
libraries have a fixed length of 12 residues, which reflects the average
length of a tryptic peptide. However, this limited diversity in peptide
length may lead to a scoring model that does not capture the fragmentation
pattern for long peptides very well. Similarly, in the training data
the SUMO tag always contributes a significant fraction (on average
20–25%) of the total intensity in the MS/MS spectra (see Supplementary Figure 3). As a result Specialize
gives considerable weight to these fragment ions from the SUMO tag
when evaluating whether a SUMOylated peptide is a good match to an
MS/MS spectrum. When many fragment ions from the SUMO tag are missing
or of relatively low abundance in the MS/MS spectrum, Specialize is
likely to assign a low score. We argue that this may actually be a
desirable feature for automatic methods to have, since the presence
of these fragment ions from the SUMO tag help confirm that the PTM
is a SUMOylation rather than some other combination of sequence variation
and modifications that happen to result in the same peptide parent
mass. Nevertheless, by capturing the specific fragmentation characteristics
of SUMOylated peptides, Specialize was clearly shown to be able to
substantially increase the identification of SUMOylated peptides in
various data sets.
Figure 4
Comparison of identification of SUMOylated peptides between
Specialize
and Mascot. The ability of Specialize to identify SUMOylated peptides
was tested on three data sets. The Jurkat data set contains a set
of 20 synthetic SUMOylated peptides from the human MCL1 protein spiked
into a background of Jurkat human cell lysate. The Arabidopsis and Human SUMO data sets were obtained from two previous proteomic
studies on SUMO site identification. The numbers of MS/MS spectra
from SUMOylated peptides as well as the numbers of unique SUMOylated
peptides identified by Specialize are compared with those identified
by Mascot at 5% FDR. As shown in the figure, Specialize is able to
improve the identification of SUMOylated peptides by 54–125%.
Comparison of identification of SUMOylated peptides between
Specialize
and Mascot. The ability of Specialize to identify SUMOylated peptides
was tested on three data sets. The Jurkat data set contains a set
of 20 synthetic SUMOylated peptides from the humanMCL1 protein spiked
into a background of Jurkathuman cell lysate. The Arabidopsis and Human SUMO data sets were obtained from two previous proteomic
studies on SUMO site identification. The numbers of MS/MS spectra
from SUMOylated peptides as well as the numbers of unique SUMOylated
peptides identified by Specialize are compared with those identified
by Mascot at 5% FDR. As shown in the figure, Specialize is able to
improve the identification of SUMOylated peptides by 54–125%.
Discussion
A key
requirement for the development of efficient and accurate
computational tools for the automatic identification of MS/MS spectra
is the availability of a sufficiently large set of identified spectra
from distinct peptides. However, creating such a data set for atypical
classes of peptides is difficult without efficient informatics tools
to identify these spectra in the first place, a recurring “chicken-and-egg”
problem. Traditionally these reference data sets were only possible
when mass spectrometrists manually curated hundreds to thousands of
MS/MS spectra, but such an approach is very labor intensive and not
scalable. We demonstrated that combinatorial peptide libraries are
an efficient way to address this challenge by quickly generating large
numbers of unique modified peptides at very low cost. There is also
no need to enrich for modified peptides from a large background of
unmodified peptides because modifications are directly attached to
the peptides during synthesis. MS/MS spectra from the modified peptides
are readily identified using a search strategy that takes advantage
of the design in the peptide libraries. Using this approach, we observed
two main characteristics that make fragmentation of SUMOylated peptides
different from those of linear, unlinked peptides. First, the SUMO
tag fragments contribute significantly to the total ion intensity
in the spectrum and require search algorithms to consider both fragment
ions from the peptide and the PTM when matching spectra against SUMOylated
peptides. Second, the residual attachment of a SUMO tag to the substrate
peptide generates highly charged fragment ions that are not commonly
observed in linear, unlinked peptides. These differences in fragmentation
statistics makes current database search methods, which have mainly
been designed based on linear, unlinked peptides, inappropriate for
identification of MS/MS spectra from SUMOylated peptides. In our benchmark
analysis of synthetic peptides, we observed that the identification
rate for a regular database search tool dropped by 2–10-fold
when applied to MS/MS spectra from SUMOylated peptides. On the other
hand, the incorporation of PTM-specific fragmentation statistics into
Specialize increased the identification rate of SUMOylated peptides
and made it comparable to that of linear, unlinked peptides. Further
testing on several data sets unrelated to the training data demonstrated
that Specialize is able to identify significantly more SUMOylated
peptides from biological samples when compared to InsPecT or Mascot.
These samples originated from multiple species, and different techniques
were used to enrich for the SUMOylated peptides, leading to different
SUMO C-terminus tags present on substrate peptides. Specialize’s
ability to identify SUMOylated peptides across these samples demonstrates
its robustness in identifying SUMOylated peptides from various sources
in an unbiased manner. One current limitation of our approach is that
the training data may not generalize to all possible SUMOylated peptides;
this is illustrated by the handful of SUMOylated peptides identified
by Mascot that were not identified by Specialize. It is common for
different search engines to perform better on different subclasses
of peptides.[35−39] As with Specialize this is usually because they tend to perform
best on data similar to those used to develop the algorithm. However,
in universal tools such as PepNovo,[40] Percolator,[41] or MSGFDB,[42] it is
possible to retrain the models for different types of data and classes
of peptides.[43] Similarly, one could potentially
train a specific Specialize model for each subtype of SUMOylated peptides
in order to maximize the sensitivity of the search tool at detecting
SUMOylated peptides (similar to what is already done for peptides
with different charge states). In fact, one of the major advantages
of Specialize is that this would be readily addressable in our framework
as one could retrain it by synthesizing additional peptide libraries
with more diversity in sequence composition and length or by using
additional spectra of SUMOylated peptides from any other sources.
Finally, the core concepts of the proposed approach for developing
PTM-specific search methods are not specific to SUMOylation and can
potentially be used to develop new tools to identify peptides with
other complex PTMs by designing and synthesizing new peptide libraries
for each complex PTM.
Methods
Combinatorial Peptide Libraries
of SUMOylated Peptides
We used combinatorial peptide synthesis
to generate peptide libraries
with the following sequence patterns:(I) K(QQQTGG)A[X]D[X]ES[X]LRAK(II)
TALH[X]K(QQQTGG)[X]S[X]TFR(III) A[h]K(QQQTGG)[X][DE]T[X]FRAKEach peptide library was synthesized with
a SUMO tag
(QQQTGG) attached via a lysine residue at positions 1, 6, and
3 along the substrate peptides, respectively (see Figure 2). The letter in square brackets indicates that
multiple residues are possible at that position. The possible residue
choices are [X] = ARDEHLKMFPSTYV, and [h] = FILVYW. The
sequence patterns were designed to generate sufficient sequence variability
as well as to provide a realistic model for SUMOylated peptides seen
in real samples. In particular the sequence pattern for library III
contains the canonical sequence motif ([XK(D/E)]) for SUMOylated peptides.[19]After synthesis, the peptides libraries
were analyzed and identified
using tandem mass spectrometry. Samples from each library were injected
via an autosampler for separation by reverse phase chromatography
on a NanoAcquity UPLC system (Waters, Dublin, CA). Peptides were loaded
onto a Symmetry C18 column (1.7 m BEH-130, 0.1 mm × 100 mm, Waters,
Dublin, CA) with a flow rate of 1 μL a minute and a gradient
of 2% Solvent B to 25% Solvent B (where Solvent A is 0.1% formic acid/2%
ACN/water and Solvent B is 0.1% formic acid/2% water/ACN) applied
over 60 min with a total analysis time of 90 min. Peptides were eluted
directly into an Advance CaptiveSpray ionization source (Michrom BioResources/Bruker,
Auburn, CA) with a spray voltage of 1.4 kV and were analyzed using
an LTQ Velos Orbitrap mass spectrometer (ThermoFisher, San Jose, CA).
Precursor ions were analyzed in the FTMS at a resolution of 60,000.
MS/MS was performed in the LTQ with the instrument operated in data
dependent mode whereby the top 15 most abundant ions were subjected
for fragmentation.
Synthetic MCL1 Data Set
All possible
chymotryptic peptides
with internal lysine residues in the humanmyeloid cell leukemia protein
(MCL1_Human) were synthesized with a SUMO2 tag (QQQTGG) attached to
the lysine residue. This corresponds to a total of 20 SUMOylated peptides,
including variants of the same peptide with SUMO attached to different
lysine positions. This set of synthesized peptides serves as a benchmark
data set to test our algorithm and also as a reference spectral library
for identifying SUMOylated peptides in a real sample. To test our
algorithm’s ability to identify SUMOylated peptide in a complex
mixture, the synthetic SUMOylated peptides from MCL1 (125fmol/peptide)
were also spiked into 1 μg whole cell lysate of the humanJurkat
cell. The samples were then analyzed by LC–MS/MS as described
for the combinatorial peptide libraries.
Identification of SUMOylated
Peptides from Combinatorial Peptide
Libraries
As illustrated in Figure 2, a two-stage search strategy was used to identify the MS/MS spectra
from the three synthetic peptide libraries. For peptide library I,
the SUMO tag is attached to the library peptide at the first residue,
which is conceptually similar to library peptides having a prefix
extension of QQQTGG. Thus MS/MS spectra from peptide library I can
be identified by searching a custom database where a prefix QQQTGG
is added at the N-terminus of every possible peptide sequence in library
I. In addition to these target sequences, an E. coli protein sequence database (downloaded from NCBI with Taxonomy ID
511145, ver. 08/25/2009) was used as the decoy database. The E. coli database was appended to the target library peptide
sequences in order to generate a sufficient large database for the
target/decoy approach[44] to get a reasonable
estimation of FDR, since a sequence database containing only library
peptide sequences would be too small for TDA to accurately estimate
FDR.[45] While a decoy database of larger
size than the target database may result in overestimation of the
actual FDR, we opted for this more conservative approach because the
identified spectra were used as reference training data for our proposed
scoring models. The database search was performed using InsPecT[29] with a 1% spectrum-level false discovery rate
(FDR). This allows one to identify an initial set of MS/MS spectra
from SUMOylated peptides that are then used to build a SUMO-specific
database search method (see next section) to identify MS/MS spectra
from peptide libraries II and III, which are a more realistic representation
of endogenous SUMOylated peptides. Library II contains peptides with
a SUMO tag attached near the middle of the peptide, and library III
contains peptides whose sequence pattern conforms to the canonical
sequence motif [XK(D/E)][19] for SUMOylated
peptides. After spectra from SUMOylated peptides were identified from
libraries II and III, they were incorporated in our training data
and used to build a better scoring model for SUMOylated peptides.
Finally, this improved method was used to search the spectra from
all three libraries one more time to get a final list of MS/MS spectra
from synthetic SUMOylated peptides. To avoid overfitting and FDR underestimation
by training and testing Specialize on the same set of PSMs, we used
2-fold cross-validation.[30,41] The PSMs were randomly
split into two subsets of equal size, and Specialize was trained on
one subset and tested on the other subset (and vice versa, as in standard
2-fold cross-validation). Since InsPecT does not support small precursor
mass tolerance, it was run with 3 Da parent mass tolerance and 0.5
Da fragment mass tolerance. Specialize search was run with a 50 ppm
precursor mass tolerance and 0.5 Da fragment mass tolerance. To make
the search space comparable, when we compared the search result between
Specialize and InsPecT, Specialize was also run with a 3 Da parent
mass tolerance. With 50 ppm precursor mass tolerance Specialize identified
a total of 2357, 4967, and 2990 MS/MS spectra from libraries I, II,
and III, respectively, while with 3 Da precursor mass tolerance it
identified 2320, 4967, and 2929 spectra from SUMOylated peptides,
respectively.
Building a PTM-Specific Database Search Method
for SUMOylated
Peptides
In general MS/MS spectra from SUMOylated peptides
have two defining characteristics: (1) they tend to contain a mixture
of SUMO tag fragment ions and substrate peptide fragment ions, and
(2) the attachment of the SUMO tag to the substrate peptide makes
higher-charged fragment ions much more prominent than on spectra of
unlinked peptides. To model the first characteristic we assume that
each SUMOylated peptide can only fragment once and conceptually think
of a SUMOylated peptide as a mixture of two peptides: a substrate
peptide carrying a modification of mass +599 Da (the mass of QQQTGG)
at the lysine residue and a peptide with sequence QQQTGG carrying
a modification with mass of the substrate peptide at the C-terminus
(see Figure 1b). In common MS/MS database search,
one tries to evaluate how well a single candidate
peptide matches to an MS/MS spectrum; for SUMOylated peptides we evaluate
how well a pair of peptides (substrate peptide and
SUMO tag) matches to a MS/MS spectrum. In previous work (MixDB[46]) we introduced a probabilistic model that describes
how well a pair of peptides matches to a mixture MS/MS spectrum from
coeluting peptides. The statistical framework used here extends that
used in MixDB by further capturing the specific fragmentation pattern
of branch-linked peptides.Briefly, an MS/MS spectrum is represented
as a vector of n bins, each representing a mass interval
of width δ Da (δ depends on instrument resolution). An
experimental MS/MS spectrum is represented as a vector S = s1, s2, ..., s where s represents the peak intensity
rank (ranked from most to least intense) of the highest-intensity
peak in each bin. Similarly, a theoretical spectrum of a peptide P = p1, p2, ..., p is
represented as a vector where p indicates the ion-type of the fragment ion (e.g., b-ion or
y-ion) with mass in that bin. The model captures peptide fragmentation
statistics by using a set of annotated MS/MS spectra to learn the
probability that each type of ion generates an observed peak with
a given rank: Prob(s|p). Similarly, a noise model, Prob(s|0), can be learned using unannotated peaks in the spectrum (where
the symbol 0 represents noise). The scoring function for a Peptide
Spectrum Match (PSM) is thus defined as the likelihood ratio of the
probability that the observed spectrum S is generated
from the candidate peptide P versus the probability
that the observed spectrum is generated from noise: Score(S,P) = ∑Score(s,p) = ∑ log[(Prob(s|p))/(Prob(s|0))]. Since a spectrum from a SUMOylated peptide is a mixture
spectrum from two peptides, we can represent a SUMOylated peptide
as two vectors SUMO(P) = (U,T). The vector U = u1, u2, ..., u encodes all possible fragment
ions from the substrate peptide (having the SUMO tag as a lysine modification),
while the vector T = t1, t2, ..., t contains all possible fragments from the SUMO tag
(having the substrate peptide as a C-terminus modification). In order
to account for their different fragmentation patterns, separate scoring
models were learned to score U and T against S. For example, for b-ion Specialize will
uses a different scoring model for substrate peptides (Score(s,bsubstrate) and the
SUMO tag (Score(s,btag)). Thus, the likelihood score that a spectrum S is generated from a pair of peptides (U,T) is defined as: Score(S,(U,T)) = ∑ max(Score(u,p),Score(t,p)). The max operation is used to model the dependency
between the substrate peptide and the SUMO tag; when theoretical fragment
ions from both U and T match to
the same peak in the spectrum, the model assign the peak only to the
theoretical fragment ion with higher probability. This avoids using
the same peak twice to support the identification of substrate or
tag peptides, which if not explicitly prevented will incorrectly bias
toward unusually high scores for pairs of peptides with shared masses
for many of their theoretical fragment ions.In order to further
capture the fragmentation statistics of branch-linked
peptides Specialize separates the fragment ions from a SUMOylated
peptide into linked and unlinked fragments (see Figure 1). Linked fragments are defined as fragment ions that are
covalently linked to a second peptide. Specialize introduces new ion
types to account for linked fragments. The original MixDB scoring
model considered the standard ion types: b, b(iso), b–H2O, b–NH3, y, y(iso), y–H2O, y–NH3, where (iso) indicates the isotopic peak of b or y ions. Specialize further adds the ion types b, b(iso), b–H2O, b–NH3, y, y(iso), y–H2O, y–NH3 to represent the corresponding linked-fragment ions that can be
generated from SUMOylated peptides. For each ion type Specialize considers
charge states from one to the precursor charge of the observed MS/MS
spectrum. With these new ion types, the fragmentation properties of
linked-fragment ions were learned during training, and different probability/weights
were assigned to linked and nonlinked fragment ions when matching
a SUMOylated peptide against an MS/MS spectrum. We noted that Specialize
does not attempt to find new, nonstandard ion types from peptide libraries;
however, this capability could be implemented using the offset frequency
function.[42,47]Since it is not known in advance whether
each spectrum comes from
a SUMOylated peptide, both SUMOylated peptide candidates and non-SUMO
peptide candidates are considered during database search. SUMOylated
peptide candidates are scored using models with both linked and unlinked
fragment ions as described above, and unlinked peptides are scored
using only models with unlinked fragment ions. The top scoring peptide
candidate, whether SUMOylated or not, is taken as the final match
for the particular query spectrum. Because Specialize used the assigned
precursor charge state in the MS/MS spectrum to determine the list
of candidate peptides and the appropriate scoring model, spectra with
unassigned charge states were not considered. We note that it is important
to consider both SUMOylated and unlinked peptide candidates when searching
a spectrum against a database even though the main goal is to identify
SUMOylated peptides. This is because an MS/MS spectrum generated from
a long, unlinked peptide can be mistaken as a shorter peptide candidate
carrying a SUMO modification at a lysine site near the N or C-terminus
of the peptide. These incorrect SUMOylated candidates can sometime
obtain good scores, especially when they share a prefix/suffix with
the correct unmodified peptide. Thus considering both SUMOylated and
unlinked candidates for every query spectrum can reduce the chances
of such false positive IDs.After determining the highest-scoring
match for each spectrum,
top scoring peptide spectrum matches (PSMs) from SUMOylated peptides
are scored using a Support Vector Machine (SVM)[48,100] to distinguish true matches from false positive ones. Because the
current implementation of Specialize focuses on the identification
of peptides with complex PTMs, non-SUMOylated PSMs were not considered
for further scoring. The features used in SVM were (1) likelihood
score as described above; (2) normalized score: likelihood score from
(1) divided by the number of amino acids in the candidate peptide;
(3) explained MS/MS intensity: total intensity of annotated peaks
divided by total intensity of the spectrum; (4, 5) fraction of b and y ions present: number of b and y ions present in the spectrum divided
by the number of b/y ions possible
from the peptide (2 features); (6, 7) longest consecutive series of b and y ions (2 features); and (8) average
mass error between theoretical and observed masses.For SUMOylated
peptides, each of the above features can be computed
for the substrate peptide and SUMO tag, thus resulting in a total
of 16 features. Together with the combined likelihood score that considers
fragments from both the substrate peptide and SUMO tag (as described
above), this defines the final list of 17 features used in the SVM
model for SUMOylated peptides. The SVM model was trained using the
identified MS/MS spectra from the combinatorial libraries. For each
training data set, the correct PSMs were used as positive training
data while top-scoring PSMs from the decoy database were used as negative
training data.
Estimation of False Discovery Rate for SUMOylated
Peptides
All database searches were performed against a concatenated
database
consisting of the target database and a decoy database created by
reversing each protein sequence in the target database. For each database
search result, all top scoring PSMs with peptides having the SUMO
modifications were considered as SUMOylated PSMs. All SUMOylated PSMs
were extracted and a false discovery rate (FDR) specific for SUMOylated
PSMs was determined using the standard Target/Decoy Approach.[44] Briefly, let SUMOtarget be the number
of SUMOylated PSMs matched to the target database and SUMOdecoy be the number of SUMOylated PSMs matched to the decoy database,
the FDR for SUMOylated PSM was calculated as follows:
Identification
of SUMOylated Peptides in Biological Data Sets
For the synthetic
MCL1 SUMOylated peptides (pure MCL1 data
set), Specialize was run with same parameters while allowing
the following two SUMO tags: QQQTGG and Q(−17.0265)QQTGG where
Q(−17.0265) indicates pyro-glutamate formation. The data were
searched against a database containing all synthetic MCL1peptide
sequences with an appended E. coli sequence database
(downloaded from NCBI, ver . 08/25/2009) as decoy. All SUMOylated
peptide PSMs were extracted, and then a 5% FDR was enforced using
the standard target/decoy strategy (TDA). We used a FDR threshold
slightly higher than the 1% that is usually used in typical proteomic
experiment because the number of spectra from SUMOylated peptides
in the sample is usually small (i.e., 30–200 in the MCL1 data
set). As a result, it is difficult to get a robust estimation of FDR
using the TDA approach when only a very small number of decoy SUMOylated
PSMs are allowed to pass the FDR threshold (i.e., 0–2 PSMs).
For the humanJurkat cell lysate data set with spiked-in MCL1peptides
(Jurkat data set), searches were done against a database
containing all synthetic MCL1peptide sequences and a Human protein
sequences (downloaded from NCBI Refseq, ver. 10/29/2010). To estimate
the an upper bound for the number of SUMOylated peptides that could
have been identified from the set of acquired spectra, identified
MS/MS spectra from the pure MCL-1 data set described above were used
to build a spectral library of MCL1 SUMOylated peptides. This spectral
library was then used to search the Jurkat data to determine a list
of possible SUMOylated peptides that were selected by the instrument
for MS/MS analysis out of all acquired spectra.To estimate the an
upper bound for the number of SUMOylated peptides that could have
been identified from the set of acquired spectra, identified MS/MS
spectra from the pure MCL-1 data set described above were used to
build a spectral library of MCL1 SUMOylated peptides. This spectral
library was then used to search the Jurkat data to determine a list
of possible SUMOylated peptides that were selected by the instrument
for MS/MS analysis out of all acquired spectra. The search was done
using M-SPLIT[49] using default parameters.To compare the performance of Specialize and Mascot, the Jurkat,
the Arabidopsis(32) and
the Human SUMO[20] data sets were analyzed
by both Specialized and Mascot. For the Arabidopsis data set, because only a subset of the MS/MS data were provided
to us by the authors, only results in the following data files are
considered in this manuscript: MM_cold_091609o.mzXML; MM_Sumo_hot_091609qṁzXML;
Vierstra_Sumo_062209d.mzXML; Vierstra_sumo_070109dṁzXML. For
the Human SUMO data set,[20] we considered
only MS/MS data that were generated in the Collision-induced dissociation
(CID) mode since our training data was generated in CID only. All
searches were done using 50 ppm precursor mass tolerance and 0.5 Da
fragment mass tolerance with variable modification N-terminal acetylaton
(Nterm+42.011) and methionine oxidation (M+15.995). Because different
SUMO modifications are generated in different data sets the following
SUMO tags or modifications were allowed on lysine residue during the
searches. For the Jurkat data set, QQQTGG (+599.266) and Q(−17.0265)QQTGG
(+582.23) were considered; for the Arabidopsis SUMO
data set, QTGG (+343.33) and Q(−17)TGG (+326.12) modifications
were considered; and for the Human SUMO data set, QQTGG (+454.18)
and Q(−17.0265)QTGG (471.21) were considered. All top-scoring
PSMs with the SUMO modifications were extracted and filtered to have
a precursor mass error less than 15 ppm and a 5% FDR was enforced
using the TDA approach. The sequence databases used were the Arabidopsis thaliana protein sequence database (downloaded
from UniProt, ver. 5/13/2012) and the Human protein sequences (downloaded
from NCBI Refseq, ver. 10/29/2010). For the Human and Arabidopsis data sets decoy databases were generated as usual[44] by reversing each protein sequence in the target databases.
Authors: Sangtae Kim; Nikolai Mischerikow; Nuno Bandeira; J Daniel Navarro; Louis Wich; Shabaz Mohammed; Albert J R Heck; Pavel A Pevzner Journal: Mol Cell Proteomics Date: 2010-09-09 Impact factor: 5.911
Authors: Albrecht Moritz; Yu Li; Ailan Guo; Judit Villén; Yi Wang; Joan MacNeill; Jon Kornhauser; Kam Sprott; Jing Zhou; Anthony Possemato; Jian Min Ren; Peter Hornbeck; Lewis C Cantley; Steven P Gygi; John Rush; Michael J Comb Journal: Sci Signal Date: 2010-08-24 Impact factor: 8.192
Authors: Marcus J Miller; Gregory A Barrett-Wilt; Zhihua Hua; Richard D Vierstra Journal: Proc Natl Acad Sci U S A Date: 2010-09-02 Impact factor: 11.205
Authors: Klarisa Rikova; Ailan Guo; Qingfu Zeng; Anthony Possemato; Jian Yu; Herbert Haack; Julie Nardone; Kimberly Lee; Cynthia Reeves; Yu Li; Yerong Hu; Zhiping Tan; Matthew Stokes; Laura Sullivan; Jeffrey Mitchell; Randy Wetzel; Joan Macneill; Jian Min Ren; Jin Yuan; Corey E Bakalarski; Judit Villen; Jon M Kornhauser; Bradley Smith; Daiqiang Li; Xinmin Zhou; Steven P Gygi; Ting-Lei Gu; Roberto D Polakiewicz; John Rush; Michael J Comb Journal: Cell Date: 2007-12-14 Impact factor: 41.582
Authors: Jian Wang; Veronica G Anania; Jeff Knott; John Rush; Jennie R Lill; Philip E Bourne; Nuno Bandeira Journal: Mol Cell Proteomics Date: 2014-02-03 Impact factor: 5.911