Glycoproteins are biologically significant large molecules that participate in numerous cellular activities. In order to obtain site-specific protein glycosylation information, intact glycopeptides, with the glycan attached to the peptide sequence, are characterized by tandem mass spectrometry (MS/MS) methods such as collision-induced dissociation (CID) and electron transfer dissociation (ETD). While several emerging automated tools are developed, no consensus is present in the field about the best way to determine the reliability of the tools and/or provide the false discovery rate (FDR). A common approach to calculate FDRs for glycopeptide analysis, adopted from the target-decoy strategy in proteomics, employs a decoy database that is created based on the target protein sequence database. Nonetheless, this approach is not optimal in measuring the confidence of N-linked glycopeptide matches, because the glycopeptide data set is considerably smaller compared to that of peptides, and the requirement of a consensus sequence for N-glycosylation further limits the number of possible decoy glycopeptides tested in a database search. To address the need to accurately determine FDRs for automated glycopeptide assignments, we developed GlycoPep Evaluator (GPE), a tool that helps to measure FDRs in identifying glycopeptides without using a decoy database. GPE generates decoy glycopeptides de novo for every target glycopeptide, in a 1:20 target-to-decoy ratio. The decoys, along with target glycopeptides, are scored against the ETD data, from which FDRs can be calculated accurately based on the number of decoy matches and the ratio of the number of targets to decoys, for small data sets. GPE is freely accessible for download and can work with any search engine that interprets ETD data of N-linked glycopeptides. The software is provided at https://desairegroup.ku.edu/research.
Glycoproteins are biologically significant large molecules that participate in numerous cellular activities. In order to obtain site-specific protein glycosylation information, intact glycopeptides, with the glycan attached to the peptide sequence, are characterized by tandem mass spectrometry (MS/MS) methods such as collision-induced dissociation (CID) and electron transfer dissociation (ETD). While several emerging automated tools are developed, no consensus is present in the field about the best way to determine the reliability of the tools and/or provide the false discovery rate (FDR). A common approach to calculate FDRs for glycopeptide analysis, adopted from the target-decoy strategy in proteomics, employs a decoy database that is created based on the target protein sequence database. Nonetheless, this approach is not optimal in measuring the confidence of N-linked glycopeptide matches, because the glycopeptide data set is considerably smaller compared to that of peptides, and the requirement of a consensus sequence for N-glycosylation further limits the number of possible decoy glycopeptides tested in a database search. To address the need to accurately determine FDRs for automated glycopeptide assignments, we developed GlycoPep Evaluator (GPE), a tool that helps to measure FDRs in identifying glycopeptides without using a decoy database. GPE generates decoy glycopeptides de novo for every target glycopeptide, in a 1:20 target-to-decoy ratio. The decoys, along with target glycopeptides, are scored against the ETD data, from which FDRs can be calculated accurately based on the number of decoy matches and the ratio of the number of targets to decoys, for small data sets. GPE is freely accessible for download and can work with any search engine that interprets ETD data of N-linked glycopeptides. The software is provided at https://desairegroup.ku.edu/research.
Glycosylation
is commonly considered
the most extensive post-translational modification on proteins, and
it is estimated that 20%–50% of all proteins are glycoproteins.[1,2] Glycosylation is known to impact protein folding and function;[3,4] the interaction between proteins and glycans is a main route for
cellular communications and signaling.[5−7] In addition, changes
in glycosylation pattern on certain proteins are closely related to
the pathogenesis of diseases.[8,9] Therefore, protein glycosylation
analysis is a vital step toward understanding the role that carbohydrates
play in various biological events.One common method of characterizing
the glycosylation on proteins
is to digest the protein and to analyze the resulting glycopeptides.
This strategy allows researchers to correlate the glycans to their
attachment sites in the protein(s).[10−12] In glycopeptide analysis,
the correct glycopeptide compositions usually cannot be determined
by high resolution MS data alone, and MS/MS data are needed for confident
glycopeptide assignments.[13] In order to
accelerate the analysis workflow for high-throughput glycopeptide
identifications, an increasing number of bioinformatics tools are
developed to analyze MS/MS data of glycopeptides.[14−20] Strum et al. presented a program called GlycoPeptide Finder that
can interpret CID data of N- and O-linked glycopeptides generated from nonspecific proteolysis.[21] A computational framework was developed to implement
a software tool called GlycoFragwork, which is capable of scoring N-linked glycopeptideMS/MS data from multiple fragmentation
modes.[22] We recently introduced two web-based
utilities, GlycoPep grader[23] and GlycoPep
Detector,[24] to determine the most likely N-linked glycopeptide compositions by scoring the CID and
ETD data against each of the possible glycopeptide candidates. In
all the applications described above, the glycopeptide analysis tool
returns a best glycopeptide match for each MS/MS spectrum by selecting
the candidate that receives the highest score under a certain scoring
algorithm. Although these matches are very helpful in guiding the
user, the top match is sometimes incorrect.While automated
analysis tools are helpful for glycopeptide analysis,
users need to know the likelihood that the automated matches are correct.
Therefore, it is important for any tool to provide users with a reliable
false discovery rate (FDR), which is the measure of probability that
a match is correct, based on the program’s performance in analyzing
the entire data set.[25−28] The concept of calculating an FDR has been well established by the
proteomics community, and to determine the FDR value in proteomics,
a composite database is generated by combining the target protein
sequence database and a decoy sequence database. The decoy database
is nonsensical and created based on the target database such that
they contain an equivalent number of peptide sequences, which is often
accomplished by reversing the protein sequences in the target database.[26−30] Subsequently, the MS/MS data are scored against the composite database,
and the numbers of matches made against the target and decoy sequences
are used to calculate FDR. Following the assumption that the distribution
of incorrect matches to target sequences is the same as that of matches
to decoy sequences, the number of false positive identifications,
which directly translates to FDR, can be calculated by doubling the
number of decoy matches. This target-decoy approach is simple and
works well for peptide identifications based on large scale proteomics
data.[31−33]Most of the currently available glycopeptide
analysis tools do
not have the capability to calculate FDRs for glycopeptide assignments,
and for those that are enabled with this functionality, the target-decoy
approach is adopted to estimate FDRs in glycopeptide identifications,
where an equal amount of decoy glycopeptides are generated on the
basis of the target glycoprotein sequences to comprise the decoy database.[21,22] However, in a glycoproteomics experiment, the number of CID or ETD
spectra scored is considerably smaller than the number of spectra
scored in a proteomics experiment. This is expected since glycoproteomics
experiments are often conducted on a single protein, not thousands
of proteins. Even when the entire proteome is evaluated for glycopeptides,
the number of CID or ETD spectra that are verified to be from glycopeptides
is generally much less than 1000. As a result, using the conventional
approach for calculating FDRs, the distribution of decoy glycopeptide
matches may not accurately reflect that of incorrect matches to target
glycopeptides because the collected glycopeptide data set is not large
enough.[21,34,35] Furthermore,
for N-linked glycopeptides, a consensus sequence
of N-X-S/T (X can be any amino acid except proline) must be present,
which further limits the number of possible decoy glycopeptides being
tested. All these factors lead to inaccurate FDRs when the target-decoy
approach is applied to small to moderate size glycoproteomics data
sets.In this work, we present a new method to determine FDRs
with high
accuracy for N-linked glycopeptide identifications
based on ETD data. Instead of creating a decoy database of the same
size as the target database, we developed a tool called GlycoPep Evaluator
(GPE) to generate decoy glycopeptides de novo for every target glycopeptide,
in a 1:20 target-to-decoy ratio. The decoys are made under specific
rules so that they contain the consensus sequence for N-linked glycosylation, while they have distinct glycopeptide sequences
and glycosylation sites. To determine the FDR, all the generated decoys
are scored against the ETD data along with target glycopeptides, and
the FDR is calculated accurately based on the number of decoy glycopeptide
matches and the relative amount of targets to decoys. GPE is freely
available for download and can be used in conjunction with any scoring
schemes for assessing ETD data of glycopeptides. Please visit https://desairegroup.ku.edu/research for a copy of the software.
Experimental
Section
Samples and Reagents
Bovine fetuin, RNase B, and human
serum proteins (IgG, AGP, transferrin) were obtained from Sigma-Aldrich
(St. Louis, MO). The HIV envelope protein, C.97ZA012 gp140, was provided
by the Duke Human Vaccine Research Institute (Durham, NC).[36] Sequencing grade trypsin was purchased from
Promega (Madison, WI). All chemical reagents used were either of analytical
grade or better.
Protease Digestion
Glycoproteins
of 72–100 μg
were dissolved in 100 mM Tris buffer at pH 8 with a concentration
of 2.4–3.3 μg/μL. Samples were denatured by addition
of urea so that the final urea concentration was 6 M, followed by
addition of 5 mM tris(2-carboxyethyl)-phosphine (TCEP) solution to
reduce the disulfide bonds (the molar ratio of TCEP to disulfide bond
was kept at 6:1), and 10 mM iodoacetamide (IAM) was subsequently added
to alkylate the free thiol groups using a molar ratio of 8:1. The
reaction was left to proceed for 1 h at room temperature in the dark.
Dithiothreitol (DTT) solution was then added to a final concentration
of 10 mM to quench the alkylation reaction. Prior to enzymatic digestion,
the urea concentration was decreased to 1 M by diluting the samples
with Tris buffer. Subsequently, trypsin was added at a 1:30 enzyme-to-protein
ratio, followed by 18 h incubation of the samples at 37 °C. Finally,
trypsin digestion was stopped by adding 1 μL of acetic acid
for every 100 μL of glycoprotein solution. The prepared samples
were stored at −20 °C before subjected to LC/MS analysis.
LC/MS Analysis
Digested glycoprotein samples were analyzed
using a Waters Acquity Ultra Performance Liquid Chromatography system
(Milford, MA) coupled to a LTQ Velos linear ion trap mass spectrometer
(Thermo Scientific, San Jose, CA). For each run, 5 μL of a sample
was injected onto a capillary C18 column (300 μm
i.d. × 5 cm, 100 Å, Micro-Tech Scientific, Vista, CA). Two
mobile phases were employed for separation: solvent A consists of
99.9% H2O plus 0.1% formic acid, and solvent B consists
of 99.9% acetonitrile with 0.1% formic acid. The LC separation gradient
was as follows: 2% solvent B for 5 min, followed by a linear increase
to 40% B in 50 min, and a ramp to 90% B in 10 min.[37,38] The column was kept at 90% solvent B for an additional 10 min and
then re-equilibrated at 2% B for 10 min. The mass spectrometer was
operated in the positive ion mode, with the ESI source voltage at
3 kV and the capillary temperature set at 200 °C. For the data-dependent
acquisition, CID and ETD spectra were collected by selecting the five
most intense peaks in the full scan MS (m/z 500–2000) and the precursor ions were fragmented
in either CID or ETD mode. In the MS/MS settings, automatic gain control
(AGC) function was enabled with a target value of 2 × 104 for the ion trap; the fluoranthene anions, employed for ETD
fragmentation, was set at a AGC target value of 2 × 105. The reaction time between anions and cations in ETD was set at
90 ms, and the supplemental activation was turned on for ETD so that
precursor ions and charge-reduced species could undergo further dissociation.
For CID, the normalized collision energy was set at 30%, with activation
time of 10 ms.
Glycopeptide MS/MS Data Set
In this
study, MS/MS data
were collected on glycoproteins that have been previously characterized
in the literature.[36,39−42] In silico trypsin digestion was
performed on the glycoprotein sequences with up to 2 missed cleavages
allowed, and carbamidomethylation was set as a fixed modification
on cysteine residues. Theoretical monoisotopic masses of potential N-linked glycopeptides were calculated by adding the site-specific
glycan masses to the masses of the corresponding peptides that contain
the glycosylation sites. The theoretical m/z values of these glycopeptides were then computed and searched
against the ETD data to see whether precursor ions of these m/z values were selected for ETD. Manual
analysis was then performed on every identified ETD spectrum that
may come from potential glycopeptides. If a match was found, CID data
were employed to further confirm the glycopeptide assignment. In this
way, a glycopeptide ETD data set with known glycopeptide compositions
was built that includes glycopeptides of diverse peptide sequences
and varying glycan types.
Decoy and Target Candidates
Generation
For this study,
all of the glycopeptide assignments were known. However, to demonstrate
our approach, we simulated a case where the identity of the glycopeptide
was not known and the user had to choose between multiple feasible
candidates. Therefore, we needed mock candidates and decoys to score
against each spectrum. GlycoPep Evaluator (GPE) was used to generate
20 decoys per candidate. The correct “candidate” for
each spectrum is known, and the additional mock candidates were generated
using GlycoMod.[42] To generate the mock
candidates, sequences of the studied glycoproteins were entered into
GlycoMod, along with a polypeptide sequence, Titin, which contains
50 000 amino acid residues. The mock candidates contain the
consensus motif of N-X-S/T, and their glycan compositions are biologically
relevant. As a result, multiple glycopeptide compositions were produced
by GlycoMod for every glycopeptide peak that was subjected to ETD
(with a mass tolerance of 200 ppm), and a selection of the glycopeptides
were entered into GPE as (mock) target glycopeptide candidates. Typically,
five candidate glycopeptides were entered, where one of the candidates
was the true glycopeptide. For each target glycopeptide, GPE is used
to generate 20 decoy glycopeptides of isobaric masses, and these decoys
can be used for evaluating the false discovery rate (FDR) in automated
assignment of glycopeptides by a search engine. (GPE includes functionality
to generate any number of decoys, but 20 were used here.)
Scoring of
Decoy and Target Candidates
GPE is a freely
available software tool that we developed to assist in determining
FDRs in glycopeptide analysis. The function to generate decoy glycopeptides
is the main innovation of this tool, and the algorithm used to generate
the decoys is described in detail in the Results section. GPE also incorporates an ETD algorithm that we described
previously,[24] and it can score each target
and decoy candidate against the ETD spectrum in an automated manner.
The software may be used as a standalone program simply for generating
decoys, or it can be used to score the input decoys and targets using
the embedded scoring tool. In order to use the scoring functionality
of GPE, the user needs to upload a raw ETD data, specify the MS/MS
scan range and the ion types being scored, and submit the target and
decoy candidates for scoring. GPE then generates the result page where
the candidates are ranked by the scores that they are assigned.
FDR Study Using GPE
GPE was used to score a set of
ETD spectra from 77 different glycopeptides, which had been manually
assigned, as described above. The software generated decoy glycopeptides
for all the input target glycopeptides, and a target or a decoy match
was made depending on whether a target or a decoy candidate received
the highest score. Using the number of decoy matches made by GPE in
assessing the glycopeptide data set and the target-to-decoy ratio
(1:20 in our study), the FDRs in glycopeptide analysis could be calculated.
Results and Discussion
Overview of GlycoPep Evaluator
GlycoPep
Evaluator (GPE)
is a freely downloadable software tool that can be used to generate
decoy glycopeptides for false discovery rate analysis. GPE is available
for download at https://desairegroup.ku.edu/research. It
has incorporated functionality to score all the targets and decoys
against imported spectra using a previously published scoring algorithm.[24] GPE was written in Java and developed with Java
Development Kit 7 (JDK 7). The program has been tested to perform
successfully under Windows and Linux systems, and Java Runtime Environment
7 (JRE 7) is recommended to be installed prior to running GPE.The graphical user interface (GUI) of GPE is shown in Figure 1A. To generate decoy glycopeptides, the user needs
to enter the target glycopeptide sequence and to specify the N-glycosylation site location by entering the Glycosylated
Asn Index (if a default value of 0 is input, the software will automatically
locate the first Asn that meets the N-X-S/T sequon). Cysteine modifications
can be selected by the user as indicated in the GUI; if there is an
additional modification on any amino acid residue, the user can specify
the location and the mass of the modification as needed. For the glycan
portion, the user can either type in the number of each monosaccharide
unit (Hex, HexNAc, Neu5Ac, etc.) that constitutes the glycan or input
the glycan mass, as shown in Figure 1A. Other
parameters that are necessary to generate decoys include the precursor
ion’s m/z and charge state,
mass tolerance (in ppm), number of maximum missed cleavages, peptide
variation (in Da, see discussion below) and the number of decoys per
target. The mass tolerance is the mass range that the monoisotopic
mass of a decoy glycopeptide, as generated by GPE, is allowed to deviate
from the precursor ion’s mass (as calculated by the precursor
ion’s m/z and charge state).
The peptide variation, on the other hand, is the mass range that the
peptide portion of the decoy (calculated by subtracting the glycan
mass from the monoisotopic mass) is allowed to differ from that of
the peptide in the target glycopeptide. In our experiments, the mass
tolerance for decoys was set at 20 ppm, maximum missed cleavage number
was set to 2, peptide variation was set at 200 Da and the number of
decoys per target was set to 20. Currently, the tool specifically
generates tryptic peptides. If sufficient interest warrants future
development, other options for peptide generation could be included.
Figure 1
(A) Graphical
user interface (GUI) of the GlycoPep Evaluator (GPE)
program. (B) The result of decoy generation completed by GPE that
contains the input target glycopeptide as well as 20 decoy glycopeptide
sequences generated by the program.
(A) Graphical
user interface (GUI) of the GlycoPep Evaluator (GPE)
program. (B) The result of decoy generation completed by GPE that
contains the input target glycopeptide as well as 20 decoy glycopeptide
sequences generated by the program.Once the required parameters are submitted to generate decoy
glycopeptides,
GPE will present the result page where 20 output decoys are listed,
as exemplified in Figure 1B. Several requirements
are met by GPE in producing the decoy glycopeptide candidates: First,
the decoy ends with either Arg or Lys on its C-terminus; second, the
missed cleavages on the decoy sequence must not exceed the number
of maximum missed cleavages specified by the user; third, the decoy
contains a consensus sequence, Asn-X-Ser/Thr (X is any random amino
acid, excluding proline), with the Asn being the glycosylation site;
fourth, the peptide portion of the decoy has a mass that is within
a user-specified range (termed “peptide variation”)
from the peptide mass of the target glycopeptide; finally, the glycan
portion of the decoy is assigned a mass that makes the m/z of the entire decoy within the user-specified
mass tolerance of the precursor ion’s m/z, and the glycan mass value is appended to the glycosylated
Asn as a modification of mass in the output of the decoy glycopeptide
(Figure 1B).Following these rules, the
generated decoy glycopeptide can closely
mimic the target glycopeptide in terms of the glycosylation site,
protease specificity, and the approximate peptide length. On the other
hand, 20 decoy glycopeptides of distinct sequences and varying glycan
locations are produced for every single target glycopeptide, as demonstrated
in Figure 1B, thus providing a sufficient number
of decoy candidates that can compete with the target glycopeptides
in the scoring by a software tool.
False Discovery Rate Analysis
The false discovery rate
(FDR) is, by definition, the percentage of accepted peptide-spectral
matches that are incorrect.[28] When decoys
are included in database searching, the incorrect matches are comprised
of a proportion of the target matches as well as all the decoy matches.
The latter are used to estimate the number of target matches that
are incorrect. As such, FDR is calculated by the following equation:In the equation, Nic is the number of
incorrect assignments made to target candidates
and Nd is the number of decoy assignments.Because both the incorrect target matches and the decoy matches
are made at random, the number of hits for incorrect target assignments
or decoy assignments is proportional to the number of the corresponding
target or decoy candidates scored by a program. Consequently, the
ratio of the number of incorrect target assignments to decoy assignments
is equal to the ratio of target candidates to decoy candidates in
quantity:When eqs 1 and 2 are
combined, the FDR is determined by eq 3:In a conventional
workflow, since an equal
number of decoy sequences are scored along with target sequences, Nic/Nd is 1. Therefore,
according to eq 3, the FDR is calculated by
doubling the number of decoy matches divided by the number of total
assignments. In our method, however, the target-to-decoy ratio is
1:20 rather than 1:1 because 20 decoy candidates are generated and
scored for each target, thus Nic/Nd is 0.05. Accordingly, FDR is determined by
eq 4:Consequently,
using our method in which 20
decoy glycopeptides are created and tested for every target glycopeptide
composition, the FDR can be measured accurately based on the number
of decoy matches and the number of total accepted assignments, as
formulated in eq 4.
Target and Decoy Glycopeptides
Analysis
Apart from
generating decoy glycopeptide candidates de novo, GPE was also implemented
with an algorithm that we developed to process and score ETD data
of N-linked glycopeptides.[24] After a list of decoy candidates are generated by GPE, the user
can load raw ETD data to the program and specify the MS/MS scan range;
GPE can score all the decoy candidates as well as the target glycopeptide
compositions against the input MS/MS data. For every glycopeptide
composition, GPE evaluates the match of different ion series (c, z,
and y-ions) to the processed ETD data and assigns a final score to
each candidate, as described in the algorithm published with ref (24). The decoy glycopeptides
can then be sorted from high score to low and be compared with the
scores of target glycopeptides.To demonstrate the functionality
of GPE, we present, below, a CID and ETD spectrum of a known glycopeptide
and show how the GPE would process the ETD data, score the spectrum,
and then additionally calculate scores for decoy assignments. Figure 2A is the ETD data of a glycopeptide from HIV gp140
that has a composition of DGGEDKTEEIFRPGGGNMK
+ [Hex]3[HexNAc]4[Fuc]1 (where is
the glycosylation site). In the ETD spectrum, c-ions (c4-c5) and z-ions (z8-z13) are observed
that can be used to determine the glycopeptide’s sequence,
as shown in the figure. Additionally, the CID data in Figure 2B further confirms that the precursor ion is a glycopeptide
peak, because glycan oxonium ions are present at m/z 366 and 528. Moreover, by assigning monosaccharide
losses, including losses of Hex, HexNAc and glycan dissociation patterns
in CID, the glycan portion of the glycopeptide can be deduced to be
[Hex]3[HexNAc]4[Fuc]1. It is noteworthy that, although CID data are
utilized to verify glycopeptide assignments, in our method, we did
not implement CID fragmentation rules in the scoring function, and
only ETD data should be submitted to GPE for appropriate FDR analysis.
Figure 2
(A) ETD-MS/MS
data of a HIV gp140 glycopeptide that has a core-fucosylated
biantennary complex-type glycan as shown in the figure. The peptide
backbone fragment ions (c- and z-ions) are labeled. (B) CID data of
the same glycopeptide in (A). Extensive dissociation at the glycan
portion is observed in CID; product ions containing partially cleaved
glycans and intact peptide sequences are present in the data. This
figure is an example of how spectra were manually assigned, prior
to testing of GPE. Please note: generally the glycan composition,
but not the structure, is confirmed.
(A) ETD-MS/MS
data of a HIV gp140 glycopeptide that has a core-fucosylated
biantennary complex-type glycan as shown in the figure. The peptide
backbone fragment ions (c- and z-ions) are labeled. (B) CID data of
the same glycopeptide in (A). Extensive dissociation at the glycan
portion is observed in CID; product ions containing partially cleaved
glycans and intact peptide sequences are present in the data. This
figure is an example of how spectra were manually assigned, prior
to testing of GPE. Please note: generally the glycan composition,
but not the structure, is confirmed.To demonstrate that the glycopeptide composition described
above
can be correctly assigned by GPE, the true glycopeptide composition,
along with four isobaric glycopeptide “mock” candidates,
were entered into GPE as potential target glycopeptide candidates.
GPE then generated 20 decoy glycopeptides per target. The ETD data
were subsequently submitted to the software, and all the candidates
(including decoys) were scored by GPE. A total of 100 decoy glycopeptide
compositions were created by GPE for the 5 target glycopeptides, and
each decoy has its distinct sequence and glycosylation site. The true
glycopeptide composition, its 20 decoys, and the associated scores
are shown in Figure 3; the remaining 4 targets,
their 80 decoys, and their scores are shown in the Supporting Information, Table 1. The correct glycopeptide
composition, labeled as target in Figure 3,
receives the highest score of 61.7, which is significantly higher
than the score of any other candidate, including the other 4 target
and 100 decoy glycopeptides. By contrast, none of the other 4 input
target glycopeptides (which are incorrect candidates but still considered
“targets”, for the purposes of this demonstration),
outscore the best decoy glycopeptide sequences generated by GPE. While
at least one of the 20 decoys in each of these sets outscore the falsely
generated “target” candidates, the overall highest scoring
decoy, with a score of 17.8, does not outscore the true assignment.
(Additional data are shown in Supplementary Table 1.) Therefore, the
first glycopeptide candidate, which is also the manually verified
correct assignment, is assigned to the ETD data by GPE, even when
four other incorrect candidates and 100 decoys are scored in parallel.
This example shows how to use GPE: Correct and incorrect target glycopeptides
can be readily differentiated by including a sufficient number of
decoy glycopeptides in the scoring process, which are generated by
GPE in an automated fashion.
Figure 3
For the input glycopeptide composition (labeled
as target) DGGEDNKTEEIFRPGGG- NMK + [Hex]3[HexNAc]4[Fuc]1,
20 decoy glycopeptide
compositions were generated by GPE. Subsequently, GPE scored both
the target and decoy glycopeptides against the ETD data, and they
were ranked from high to low score as shown in this figure. The target
glycopeptide, which is also the correct assignment, received the highest
score of 61.7, outscoring all the other candidates. The scoring results
of the other four incorrect glycopeptide candidates are summarized
in Supporting Information, Table 1.
For the input glycopeptide composition (labeled
as target) DGGEDNKTEEIFRPGGG- NMK + [Hex]3[HexNAc]4[Fuc]1,
20 decoy glycopeptide
compositions were generated by GPE. Subsequently, GPE scored both
the target and decoy glycopeptides against the ETD data, and they
were ranked from high to low score as shown in this figure. The target
glycopeptide, which is also the correct assignment, received the highest
score of 61.7, outscoring all the other candidates. The scoring results
of the other four incorrect glycopeptide candidates are summarized
in Supporting Information, Table 1.
Is GPE Consistently Able to Identify the Correct Candidate,
When It Is Present?
The above example illustrates that GPE
can be used to effectively identify a correct target candidate among
a large list of incorrect glycopeptides. To determine how consistently
GPE could generate these kinds of successful results, we tested a
larger data set. We employed GPE in analyzing a glycopeptide data
set that contains ETD data of 77 distinct glycopeptides generated
from multiple proteins (fetuin, IgG, HIV gp140, etc.). In these cases,
all 77 spectra were manually assigned using the same procedure described
above. After determining the correct assignment for each spectrum,
four other (incorrect) “target” assignments were also
generated. The software assigned 76 of the 77 MS/MS spectra to the
correct glycopeptide compositions, demonstrating that the approach
can consistently return the correct result, even when 20 decoys per
candidate are scored. These results are expected when a high-quality
algorithm is used for scoring glycopeptides, such as the one used
in GPE, and the spectra are of high enough quality such that manual
assignment is possible.
Is GPE Effective at Identifying Misassigned
Spectra?
We next tested whether GPE is capable of indicating
that the incorrect
target glycopeptides are incorrect when the true candidates are not
present in the target list. The correct glycoprotein sequences that
generated the ETD data were excluded from the search of target glycopeptide
compositions, so that all the target glycopeptides were incorrect
glycopeptide candidates from Titin. After the incorrect targets were
input into GPE, they were scored along with 20 decoys per target.
Only four out of the 77 ETD spectra were matched to the target glycopeptides
that are incorrect, whereas 73 spectra were assigned to decoy glycopeptides.
Therefore, the ratio of incorrect target matches to decoy matches, Nic/Nd, is 0.055
(4/73) in this case. This value is very close to the target-to-decoy
ratio of 0.05 (1/20).
Comparison of the Predicted FDR to the True
FDR
Using
the data above, we evaluated the true FDR for our data set of 77 spectra
compared to the FDR that would be predicted by eq 4. When the correct glycopeptide compositions are included
in the test, as mentioned above, 1 out of 77 assignments is a decoy
match, and the FDR, according to eq 4, is predicted
to be 1.36%. The actual FDR that is observed, on the other hand, is
the number of incorrect assignments divided by the total assignments.
In this case, only the decoy assignment is incorrect and the other
76 assignments are correct, so the observed FDR is 1.30% (1/77), which
is closely approximated by the predicted FDR value. On the other hand,
when the correct glycopeptide sequences are excluded from the target
list, 73 of 77 assignments are decoy matches, which leads to a calculated
FDR of 99.55%. (This calculation is done using eq 4: (73/77) × 1.05 = 0.995.) The actual FDR is 100% since
all the assignments are incorrect. In both circumstances, the predicted
FDRs are very close to the observed FDRs.To further test if
FDR values for small data sets can be accurately determined by our
method, a proportion of the 77 ETD spectra were randomly selected,
and for those spectra, the correct glycoprotein sequences were excluded
for generating target glycopeptide candidates. For the remaining spectra,
the correct glycoproteins were included in the generation of target
compositions. Subsequently, GPE was employed to score each ETD spectrum
against the corresponding target glycopeptides, and the number of
decoy assignments was used to calculate FDR based on eq 4. The experiment was conducted at 12 different cases such
that 0, 3, 5, 10, 20, 30, 40, 50, 60, 70, 73, 77, out of the 77 correct
glycopeptide sequences were randomly excluded when their respective
spectra were being scored. In this way, different numbers of incorrect
assignments for the ETD data set were generated, and the predicted
FDR using our method can be compared to the observed FDR at different
levels. The comparison of the calculated versus observed FDRs for
the 77 tested ETD spectra is illustrated in Figure 4A, where a correlation curve is made based on the blue data
points. The least-squares fitting line has a slope that only deviates
slightly from unity, and the curve has good linearity (R2 above 0.99). These data demonstrate that for glycopeptide
data set with a wide range of FDRs (ranging from 1.3%–100%),
the FDR values can be determined accurately using GPE and the method
that we developed.
Figure 4
Lines that are fitted based on the blue data points: correlation
curves between the predicted FDR values calculated using our method
versus the observed FDR values that are manually verified. Lines that
are fitted based on the red data points: correlation curves between
the FDRs calculated by the common approach where an equal number of
decoys are tested with the targets versus the observed FDRs that are
verified manually. The FDRs are based on the analysis of ETD data
sets of (A) 77 distinct glycopeptides and (B) 35 distinct glycopeptides
using GPE program.
Lines that are fitted based on the blue data points: correlation
curves between the predicted FDR values calculated using our method
versus the observed FDR values that are manually verified. Lines that
are fitted based on the red data points: correlation curves between
the FDRs calculated by the common approach where an equal number of
decoys are tested with the targets versus the observed FDRs that are
verified manually. The FDRs are based on the analysis of ETD data
sets of (A) 77 distinct glycopeptides and (B) 35 distinct glycopeptides
using GPE program.In glycopeptide-based
identifications, the MS/MS data set is frequently
of a small size, and a robust method needs to be able to determine
the FDRs for these types of data. To build a smaller glycopeptide
data set, we randomly selected 35 ETD spectra from the entire data
set, and performed the same experiment as described above, to test
whether using our method, the FDRs at different levels can be measured
with high accuracy for this limited size data set. The result is shown
in Figure 4B where the correlation curve is
fitted based on the blue data points; the best-fitting line between
the predicted and observed FDRs has a slope that is close to 1 with R2 still above 0.99. Therefore, these experiments
prove that the developed method is accurate in measuring the FDRs
in glycopeptide identifications, even for small glycopeptide data
sets.Finally, the accuracy of our method in predicting the
FDRs was
compared to that of the common approach where an equal number of decoy
glycopeptides were tested with the target glycopeptides. For the same
two data sets described above, correlation curves comparing the predicted
versus observed FDRs, when using a 1:1 target-to-decoy ratio, are
also shown in Figure 4. In this experiment,
an equal number of decoy glycopeptides were generated by GPE based
on the target candidates, and both the decoy and target glycopeptides
are analyzed in the same way as described previously. These data sets
are present in red. For the 77 tested ETD spectra, the R2 of the curve is below 0.99, and the slope of the curve
(0.83) deviates significantly from 1 (Figure 4A). Furthermore, using the conventional approach, the correlation
between the predicted and observed FDRs becomes much worse when the
size of the data set decreases, as evidenced by the correlation curve
in Figure 4B that has a R2 of only 0.90 and a flat slope of 0.58. The slope of the curves
reflect the ratio of predicted FDRs to true FDRs, and the values,
which are significantly less than 1, indicate that the number of false
positive assignments would be considerably underestimated using the
conventional approach. By contrast, the FDRs are predicted accurately
using our method, especially under circumstances where only a small
glycopeptide data set is available.
Conclusion
False
discovery rate (FDR) is an important measurement of the confidence
of glycopeptide assignments when MS/MS data of glycopeptides are analyzed.
In order to accurately determine the FDR of glycopeptide identifications,
we developed a software program, GlycoPep Evaluator, to generate abundant
decoy glycopeptide compositions and to score the target and decoy
glycopeptide candidates in measuring the FDR. The target-to-decoy
ratio is 1:20 so that, even for a small number of target glycopeptide
sequences, sufficient decoy glycopeptides are available for scoring;
hence, false-positive identifications can be better contained. Moreover,
FDRs can be measured with high accuracy using GPE for small data sets,
which are commonly seen in glycoproteomics where tens to hundreds
of spectra are scored, as opposed to thousands of spectra scored in
a proteomics experiment. The functionality of GPE in generation of
decoy glycopeptide candidates can be combined with any other data
analysis tools that score ETD data of glycopeptides, so that FDRs
can be accurately determined.
Authors: Y Wang; J Tan; M Sutton-Smith; D Ditto; M Panico; R M Campbell; N M Varki; J M Long; J Jaeken; S R Levinson; A Wynshaw-Boris; H R Morris; D Le; A Dell; H Schachter; J D Marth Journal: Glycobiology Date: 2001-12 Impact factor: 4.313
Authors: Eden P Go; Qing Chang; Hua-Xin Liao; Laura L Sutherland; S Munir Alam; Barton F Haynes; Heather Desaire Journal: J Proteome Res Date: 2009-09 Impact factor: 4.466
Authors: John S Strum; Charles C Nwosu; Serenus Hua; Scott R Kronewitter; Richard R Seipert; Robert J Bachelor; Hyun Joo An; Carlito B Lebrilla Journal: Anal Chem Date: 2013-05-24 Impact factor: 6.986
Authors: Jude C Lakbub; Xiaomeng Su; Zhikai Zhu; Milani W Patabandige; David Hua; Eden P Go; Heather Desaire Journal: J Proteome Res Date: 2017-07-25 Impact factor: 4.466