Wai-Kok Choong1, Ting-Yi Sung1. 1. Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan.
Abstract
Adopting proteogenomics approach to validate single nucleotide variation events by identifying corresponding single amino acid variant peptides from mass spectrometry (MS)-based proteomics data facilitates translational and clinical research. Although variant peptides are usually identified from MS data with a stringent false discovery rate (FDR), FDR control could fail to eliminate dubious results caused by several issues; thus, postexamination to eliminate dubious results is required. However, comprehensive postexaminations of identification results are still lacking. Therefore, we propose a framework of three bottom-up levels, peptide-spectrum match, peptide, and variant event levels, that consists of rigorous 11-aspect examinations from the MS perspective to further confirm the reliability of variant events. As a proof of concept and showing feasibility, we demonstrate 11 examinations on the identified variant peptides from an HEK293 cell line data set, where various database search strategies were applied to maximize the number of identified variant PSMs with an FDR <1% for postexaminations. The results showed that only FDR criterion is insufficient to validate identified variant peptides and the 11 postexaminations can reveal low-confidence variant events detected by shotgun proteomics experiments. Therefore, we suggest that postexaminations of identified variant events based on the proposed framework are necessary for proteogenomics studies.
Adopting proteogenomics approach to validate single nucleotide variation events by identifying corresponding single amino acid variant peptides from mass spectrometry (MS)-based proteomics data facilitates translational and clinical research. Although variant peptides are usually identified from MS data with a stringent false discovery rate (FDR), FDR control could fail to eliminate dubious results caused by several issues; thus, postexamination to eliminate dubious results is required. However, comprehensive postexaminations of identification results are still lacking. Therefore, we propose a framework of three bottom-up levels, peptide-spectrum match, peptide, and variant event levels, that consists of rigorous 11-aspect examinations from the MS perspective to further confirm the reliability of variant events. As a proof of concept and showing feasibility, we demonstrate 11 examinations on the identified variant peptides from an HEK293 cell line data set, where various database search strategies were applied to maximize the number of identified variant PSMs with an FDR <1% for postexaminations. The results showed that only FDR criterion is insufficient to validate identified variant peptides and the 11 postexaminations can reveal low-confidence variant events detected by shotgun proteomics experiments. Therefore, we suggest that postexaminations of identified variant events based on the proposed framework are necessary for proteogenomics studies.
Single amino acid variations (SAVs) in proteins could affect protein
folding, protein–protein interaction, and protein domain functionality
that could cause diseases or cancers.[1−5] Moreover, somatic SAVs in tumor samples may become a neoantigen
for developing personalized therapeutic vaccines for cancers.[6,7] To identify SAVs at the proteomic level, a proteogenomics approach
is usually applied, which utilizes the single nucleotide variant (SNV)
information derived from next-generation sequencing of genomic or
transcriptomic data to generate SAV-harboring protein sequences and
identify SAV variant peptides from mass spectrometry (MS)-based proteomics
data.[8−11] To obtain confidently identified variant peptides, the target–decoy
database search approach is commonly applied to estimate a false discovery
rate (FDR) to filter out unconfident identification.Although
variant PSMs are obtained from a rigorous identification
procedure passing an FDR of 1%, it is mentioned in the literature
that we must still check their reliability to avoid false positives.[10−13] For example, database searches may yield false-positive PSMs, which
still pass the FDR threshold because an incomplete protein sequence
database is used for searches.[10] Or because
possible modifications are neglected in the search parameters,[14−16] spectra of modified peptides can be incorrectly identified as other
peptide sequences. Furthermore, considering 11 types of isobaric substitutions
such as isoleucine/leucine substitution and delaminated-glutamine/glutamate
substitution, more than 6% of variant peptides in neXtProt can be
interpreted as wild-type peptides of PE1 proteins, i.e., human proteins
with experimental evidence at the protein level as classified by neXtProt.[17] Thus, it is necessary to further examine the
reliability of identified variant peptides.Several criteria
or methods for postexamination of identified variant
peptides have been reported to eliminate dubious results. For instance,
the claimed variant peptides should not appear in any major reference
protein sequence database such as Ensembl and RefSeq[18,19] nor they match any modification or isobaric substitution of a wild-type
peptide.[10,12,13,17] Three bioinformatics tools—SpectrumAI,[20] SAVControl,[21] and
PepQuery[22]—have been developed to
validate the quality of claimed variant peptides. SpectrumAI and SAVControl
use different concepts to validate the reliability of a variant peptide
at the site level. SpectrumAI examines whether the mass difference
of the ions flanking both sides of the variant site on the MS/MS spectrum
equals the mass of the amino acid variant. SAVControl, in contrast,
adopts mass shift relocation to assess the reliability of the variant
site, where the mass shift, defined as the variant peptide precursor
mass minus the wild-type peptide mass, is relocalized on each position
of the wild-type peptide sequence to confirm that the mass shift localizing
at the variant site has the highest matching probability. PepQuery
adopts a peptide-centric strategy, which takes the user’s preselected
variant peptide sequences as input to retrieve the highest-scoring
variant peptide–spectrum matches (PSMs) and then performs unrestricted
modification searching with all of the modifications from Unimod to
confirm that the variant PSMs have no other interpretation.Identifying and confirming variant events from MS data sets is
a critical process for proving that SNVs or SAVs exist in biosamples.
From the perspective of MS-based qualitative proteomics, the procedure
of variant event identification is highly similar to the protein identification
procedure, which is conducted at three hierarchical levels—the
PSM, peptide, and protein levels—to filter out false-positive
identification. Furthermore, when publishing protein identification
results of large-scale MS-based proteomics data, journals generally
require that the reliability of identification results at each of
the aforementioned levels be reported; for example, for special issues
of the Human Proteome Project in Journal of Proteome Research, relevant
requirements are mentioned in MS-based data interpretation guidelines.[23,24] However, comprehensive guidelines that account for similar hierarchical
levels to eliminate dubious results of variant event identification
are still lacking.In this paper, we propose a framework around
three bottom-up levels—PSM,
peptide, and variant event—for the postexamination of identified
variant events. In this framework, rigorous 11-aspect examinations
at the three levels, i.e., five PSM-level, two peptide-level, and
four variant event-level examinations, are proposed to further confirm
the reliability of identified variant peptides. To demonstrate these
examinations, we first conducted a comprehensive variant peptide identification
study on the HEK293 cell line to acquire the maximized number of identified
variant peptides. Then, only the identified variant PSMs and their
variant peptides, not the whole MS data set, were further evaluated
for the reliability of the variant events by the postexaminations.
Our results reveal that variant peptide identification that only passes
the FDR threshold is insufficient for ensuring authenticity, and these
examinations can reveal low-confidence variant events. Thus, we suggest
using the proposed examination methods to further examine the reliability
of variant events at the PSM, peptide, and variant event levels.
Materials and Methods
MS/MS Data Set and Variant
Information of
the HEK293 Cell Line
An MS/MS data set (a total of 24 .raw
data files, PXD001468) of the HEK293 cell line was downloaded from
the PRIDE[25] (PRoteomics IDEntifications)
database. The data set was acquired using a Q-Exactive Orbitrap spectrometer
(Thermo Fisher Scientific, San Jose, CA) with higher-energy collisional
dissociation for peptide fragmentation. Detailed information about
sample collection, experimental preparation, and MS conditions was
reported in Chick et al.[26] We converted
the MS/MS files from .raw to .mgf using MSConvert (ProteoWizard 3.0.11110
64-bit).[27]We obtained 1336 genome-annotated
variant information of the HEK293 cell line from the supplementary
data (Supplementary Data 2; sheet name: “Known common homozygous
SNP”) in Lin et al.[28] Since we used
the Swiss-Prot database for sequence database searching but the variant
information was provided with RefSeq identifiers, we filtered out
any variants in the proteins with RefSeq identifiers unmatched to
Swiss-Prot identifiers. This yielded 1123 single amino acid variants
for this study. We henceforth use the variant to refer to “single
amino acid variant”. Note that this pair of genomic and proteomics
data sets of HEK293 cell line was used for proof of concept of the
proposed postexaminations on the identified variant peptides, regardless
of the complexity of their source samples.
Construction
of a Customized Target–Decoy
Protein Database
The target protein sequence database used
for database searches of the HEK293 cell line MS data set consisted
of the following entries: (1) 42 197 human protein sequences,
including isoforms, from the UniProt database (ver. 201707, human),[29] (2) 48 sequences of contaminants from the cRAP
(common Repository of Adventitious Proteins) database,[30] (3) 35 sequences of the human adenovirus C serotype
5 (HAdV-5) proteome from UniProt (ver. 201707), and (4) 1123 sequences
including HEK293 genome-annotated variants. The 1123 variant protein
sequences were generated by integrating exactly one variant into its
wild-type protein sequence, i.e., the 1123 variants yielded 1123 variant
protein sequences. The resulting target sequence database is called
RefP_V. All of the sequences in RefP_V were reversed to generate decoy
protein sequences, which were then concatenated with all of the target
sequences for target–decoy searches to estimate the FDR for
variant peptide identification.[31]
Database Search Types and FDR Estimation Approaches
for Variant Peptide Identification
Protein database searches
are usually performed using search engine(s) for shotgun proteomics
data analyses. Similarly, in proteogenomics analyses, either results
from a single search engine (SSe) or combined search results from
multiple search engines (CMSe) are coupled with global FDR (gFDR)
or class-specific FDR (cFDR) estimation for variant peptide identification.
Spectra matched to wild-type peptides of reference proteins, termed
wild-type peptide–spectrum matches (PSMs), and spectra matched
to variant peptides, termed variant PSMs, were combined for 1% gFDR
estimation at the PSM level using a statistical validation tool for
FDR estimation, called MAYU.[32] In contrast,
cFDR estimation used only the variant PSMs and was defined as the
number of decoy variant hits divided by the number of target variant
hits above a discrimination score threshold to determine the 1% FDR
at the PSM level. To be specific, all of the identified target and
decoy variant PSMs were ranked in a decreasing order of their discrimination
scores calculated by the search engine. The discrimination score threshold
was determined based on the above-mentioned cFDR estimation that reaches
1% FDR. In this paper, we consider FDR at the PSM level only; for
convenience, we simply term this FDR.We implemented the above
four different combinations of search types and FDR calculations,
i.e., SSe-gFDR, SSe-cFDR, CMSe-gFDR, and CMSe-cFDR, using the Comet,[33] MS-GF+,[34] and X!Tandem[35] search engines on the HEK293 cell line data
set for protein and variant peptide identification. The workflow is
shown in Figure S1. For SSe searching,
each of the three search engines was individually used to search against
RefP_V database, and its search results were further processed by
PeptideProphet,[36] followed by iProphet[37] and then by MAYU. Adopting the SSe-gFDR strategy,
we manually extracted target variant PSMs from MAYU-validated PSMs
that passed a gFDR of 1% as the result of variant peptide identification.
For the SSe-cFDR strategy, we used four scores, including the search
score and E-value from the respective search engine,
and PeptideProphet and iProphet probabilities of variant PSMs to evaluate
a cFDR of 1% and then manually retrieved all of the identified variant
peptides from variant PSMs with an FDR of 1% based on any of the four
scores.For CMSe searching, search results from each search
engine were
further processed by PeptideProphet, and the PeptideProphet results
from the three search engines were combined by iProphet. Particularly
for the CMSe-cFDR strategy, the iProphet probability of variant PSMs
was used as the discrimination score for estimating cFDR to acquire
PSMs passing a cFDR of 1%, from which we manually extracted the identified
variant peptides. In contrast, adopting the CMSe-gFDR strategy, we
further used MAYU to process the combined results from iProphet for
validation at a gFDR of 1% and manually extracted variant peptides
from the PSMs passing MAYU’s validation. Identified variant
PSMs with an FDR <1% are termed variant PSMs for convenience; similarly,
the spectra in the PSMs are termed variant spectra.
Database Search Parameters
The following
search parameters were used in the three search engines: a precursor
mass tolerance of ±10 ppm, a fragment mass tolerance of ±0.01
Da, carbamidomethylation of cysteine as a fixed modification, and
oxidation of methionine and acetylation of protein N-term as variable
modifications. The parameters used for PeptideProphet and MAYU were
“-OpdEAP-PPM” and “-P mFDR = 0.01:t -G 0.01 -H
51 -I 2”, respectively. In this study, trypsin was considered
the protease for digestion.
Overview of 11-Aspect Examinations
of Identified
Variant Events at Three Bottom-Up Levels
In this study, we
propose a bottom-up trilevel framework of postexaminations of identified
variant events from MS spectra at the variant PSM, peptide, and event
levels, in which 11-aspect examinations are involved, as shown in Figure S2. First, the foundation PSM level enhances
the peptide–spectrum match results via the following five examinations:
(1) open modification search, (2) explosive search, (3) combined open
modification and explosive search, (4) de novo peptide sequencing,
and (5) similarity between a variant spectrum and the predicted spectrum
of the corresponding variant peptide. With the first four examinations,
we seek to detect dubious PSMs caused by searching an incomplete protein
sequence database or neglecting possible modifications; they are designed
with increasing search spaces of peptides and modifications, i.e.,
(1) < (2) < (3) < (4). The last examination compares the
fragment ion peaks by checking the similarity between an identified
variant spectrum and the predicted spectrum of the identified variant
peptide obtained by an MS/MS peak intensity prediction tool because
database search tools usually ignore fragment ion intensities in a
spectrum.The middle peptide level disambiguates variant peptide
sequences by examining (1) isobaric substitutions and semitryptic
cleavage and (2) spectral counting. Variant peptides without the possibility
of isobaric substitution or semitryptic cleavage and with multiple
PSMs are more reliable. Variant peptides with multiple PSMs are more
reliable than those with single PSM.Lastly, the variant event
level at the top confirms variant events
by (1) checking for the occurrence of two consecutive b-ions or y-ions identifying the amino acid variant,
(2) checking for the existence of an identified wild-type counterpart
peptide, (3) checking that its parental protein is identified, and
(4) checking the variant peptide location in the protein when the
SAV involves lysine (K) or arginine (R), or proline (P) after K or
R.
PSM-Level Examination
Open
Modification Search Examination
When using conventional database
search tools for searches, to avoid
explosions of search space and search time, proteomics researchers
usually set only a few modifications in the database search process
on an MS/MS data set. However, more than a hundred types of in vivo
and in vitro modifications are reported in the Unimod database,[38] which may possibly exist in the samples of MS
experiments. To account for false-positive variant PSMs caused by
neglecting possible modifications in database searches, we utilized
three open modification search (OMS) tools—PIPI,[39] MSFragger,[40] and
SpecOMS[41]—with unrestricted modification
types when searching against the RefP_V database. When a variant spectrum
is identified by an OMS tool as the corresponding variant peptide,
the variant PSM is regarded as reliable.PIPI and SpecOMS mainly
allocate the mass of an unexpected but possible modification to an
amino acid of a peptide and thus can pinpoint the exact modification
site on the peptide. MSFragger enlarges the precursor mass tolerance
to 500 Da (default setting) to identify modified peptides but reports
neither the modification site nor the modification type in the peptide
sequence. To conduct the search, each tool was configured with the
following parameters:PIPI: peptide tolerance = 10 ppm,
fragment ion tolerance = 0.01 Da, PTM mass tolerance = ±500 Da,
fixed modification = carbamidomethylation (+57), missed cleavages
= 2, protease = trypsin, minimum peptide length = 7, maximum peptide
length = 50, q-value = 0.01.MSFragger: peptide tolerance = 500
Da, fragment ion tolerance = 0.01 Da, fixed modification = carbamidomethylation
(+57), variable modifications = methionine oxidation and acetylation
of protein N-term, missed cleavages = 2, protease = trypsin.SpecOMS: minimum peptide
length =
7, maximum peptide length = 50, threshold = 6, max masses count =
100, minimum peptide charge = 1, maximum peptide charge = 7, number
of decimals = 2, decimal value = 4, decoy base = false.
Explosive Search Examination with the Human-Associated
Tryptic Peptide Database (SuperPep_V)
In contrast to open
modification searching against RefP_V with large PTM or peptide tolerance,
explosive search conducts searches against a huge human tryptic peptide
database containing peptides from complete human proteome and variants;
this database is called SuperPep_V and is much larger than the peptide
space of RefP_V. We examine the consistency between variant PSMs by
searching the RefP_V and the huge SuperPep_V databases.To construct
SuperPep_V for explosive search, we needed to select very comprehensive
protein or peptide sequence databases and variant databases. To the
best of our knowledge, the PeptideAtlas Mapping Database (PAmap)[42] and the dbSAP[43] database
provide comprehensive human protein sequences and single amino acid
variants, respectively. The PAmap database integrates proteomics-based
protein sequences (such as sequences in Swiss-Prot, TrEMBL, and neXtProt[44]), genomics-based protein sequences (such as
sequences in Ensembl and RefSeq), SAVs listed in neXtProt but excluding
those in COSMIC,[45] as well as human-associated
microbiome and nonhuman contaminant protein sequences. The dbSAP database
combines sequence variant annotations from public databases such as
dbSNP,[46] COSMIC, UniProt, HPMD,[47] MS-CanProVar,[48] and
Ensembl. Therefore, we collected the human and associated protein
sequences from the PAmap and dbSAP databases and subsequently concatenated
them with the RefP_V database. For this concatenated database, we
performed in silico trypsin digestion, following the cleavage specificity
rules (cleaving K or R not before P at the C-terminal) provided in
Keil’s rules[49] and allowing up to
two missed cleavage sites. After in silico digestion, all nonredundant
tryptic peptides, including tryptic variant peptides, of 7–50
amino acids were collected to construct the SuperPep_V database for
explosive search. As a result, SuperPep_V contained 77,551,699 unique
tryptic peptides, a file of 3.5 GB. Each tryptic peptide entry in
the database was recorded in the FASTA format, which is suitable for
database searches. Notably, compared with the peptide space of RefP_V,
the SuperPep_V database includes an approximately 30-fold increase
in the number of unique peptides. The peptide length distributions
of RefP_V and SuperPep_V are shown in Figure S3; the number of peptides of each length in SuperPep_V is at least
seven times as much as that in RefP_V. Thus, the SuperPep_V database
is suitable for explosive search examination.To perform explosive
search on the variant spectra obtained by
searching against RefP_V, we used Comet, MS-GF+, and X!Tandem to search
the spectra against the SuperPep_V database. The search parameters
of each search engine were the same as those described in Section . The search
results of the explosive search were then compared with the variant
PSMs obtained from searching against RefP_V. For the comparison, we
used the original search score, instead of E-value,
from each search engine for the following two reasons. First, the
peptide search spaces of SuperPep_V and RefP_V can affect the E-values of a spectrum–peptide pair. Second, the
original search score from a search engine represents the similarity
between a variant spectrum and a theoretical spectrum generated from
a specific peptide in the database and thus is unlikely to be changed
for a given spectrum–peptide pair regardless of which database
the peptide is from. For results from each search engine, we compared
the search scores and the matched peptides of variant PSMs. Originally
identified variant PSMs having explosive search supports were regarded
as reliable.
Examination of Combined
Open Modification
and Explosive Search
We further propose an examination of
OMS combined with explosive search, i.e., using the three OMS tools
(MSFragger, PIPI, and SpecOMS) to search against the SuperPep_V database
to examine the reliability of variant PSMs.Due to the large
file size of SuperPep_V (3.5 GB), some OMS tools could not successfully
perform searches caused by insufficient memory when using a computing
server with 64G RAM. We propose solving this problem using a strategy
of searching specific scans against specific FASTA to reduce the FASTA
file size but maintain a proper peptide search space. First, we classified
the variant spectra into four spectral groups (denoted as SG) based
on the length of their corresponding variant peptides: 7–15
amino acids (a.a.) (SG-A), 16–24 a.a. (SG-B), 25–35
a.a. (SG-C), and 36–44 a.a. (SG-D). Considering the 500 Da
precursor tolerance (approximately ± 5 a.a.) for open modification
searches, we divided SuperPep_V by peptide length into four FASTA
sets (denoted as PepS) as search spaces for spectra groups as follows:
7–20 a.a. (PepS-A for SG-A searching), 11–30 a.a. (PepS-B
for SG-B searching), 20–40 a.a. (PepS-C for SG-C searching),
and 30–50 a.a. (PepS-D for SG-D searching). Table S1 in the Supporting Information shows the numbers of
spectra and peptides in the four spectral groups and four peptide
FASTA sets, respectively. The search parameters of the three OMS tools
were the same as those described above in the OMS examination.
De Novo Peptide Sequencing Examination
De novo peptide
sequencing is an alternative method to interpret
the peptide sequence directly from an MS/MS spectrum without using
a sequence database but based on the mass difference of consecutive
peaks. It is a good basis by which to verify variant PSMs. To this
end, we adopted three common de novo peptide sequencing tools—PepNovo+,[50] pNovo+,[51] and PEAKS[52]—to examine the interpretation consistency
of variant spectra obtained from sequence database searching and from
de novo peptide sequencing. The variant spectra with consistent interpretations
between the two approaches are regarded as reliable identification
of variant peptides.The general parameters used in de novo
peptide sequencing tools were a precursor mass tolerance of ±10
ppm, a fragment mass tolerance of ±0.01 Da, carbamidomethylation
of cysteine as a fixed modification, oxidation of methionine as a
variable modification, and output of the top 10 rank hits for each
spectrum. Furthermore, we used DeNovoGUI’s (version 1.15)[53] built-in “peptide matches” function
to reversely match the top 10 hits of each spectrum from de novo sequencing
tools to peptide sequences of the RefP_V database. This resulted in
the peptide sequences in RefP_V yielded by de novo sequencing tools,
which were used to examine the interpretation consistency between
sequence database searches and de novo peptide sequencing.
Examination of Similarity between Variant
Spectra and Predicted MS/MS Spectra
The peptide–spectrum
matching of the conventional database search approach usually does
not consider peak intensities in experimental spectra. We thus proposed
an examination that considers peak intensity to evaluate the reliability
of variant spectra. For each variant spectrum, we obtained its corresponding
variant peptide and charge state and used MS2PIP[54] and MS2PBPI[55] to generate a predicted MS/MS spectrum for the variant peptide sequence
and charge state. Then, we examined the cosine similarity between
the variant and predicted spectra.To determine the similarity
score threshold for a variant-predicted spectral pair to be similar
(i.e., a reliable variant spectrum supported by the predicted spectrum),
we adopted a target–decoy strategy in which similarity scores
were calculated between all pairs of a variant spectrum and a predicted
spectrum (workflow illustrated in Figure S4). We defined the similarity score of a correct pair of spectra,
i.e., having the same peptide sequence and charge state, as a matched-pair
score, which represents the score of a target case; otherwise, it
was defined as a mismatched-pair score, the score of a decoy case.
Note that we excluded mismatched pairs of spectra that corresponded
to a peptide and its truncated sequence because both peptide sequences
have overlapping b- and y-ions and
similar intensity patterns, possibly affecting the distribution of
similarity scores; for example, ALMDEGMK and ALMDEGMKEK have at least
70% b- and y-ions in common. Then,
all matched- and mismatched-pair scores form two respective distributions
to estimate the threshold to justify the confidence of all of the
matched pairs. For a similarity score s, we defined
the corresponding FDR as the number of mismatched pairs with a similarity
score less than or equal to s divided by the number
of matched pairs having a similarity score less than or equal to s. Then, we determined the similarity score threshold as
the score s* that yields an FDR of 5%. Thus, all
of the matched pairs that pass the specified threshold are considered
high-confidence results, i.e., the corresponding variant spectra identifying
variant peptide sequences are reliable, as they are supported by the
predicted MS/MS spectra from state-of-the-art tools.
Peptide-Level Examination
Isobaric
Substitution and Semitryptic Cleavage
Checks
As our previous study[17] showed that more variant PSMs can be interpreted as wild-type peptides
of proteins when considering 11 types of isobaric substitution, for
each variant PSM, we suggested verifying whether the variant peptide
can be obtained by isobaric substitution of a wild-type peptide. If
yes, the identified variant peptide is not reliable. In this study,
we conducted an examination of isobaric substitution and semitryptic
cleavage on the RefP_V database to verify the identified variant peptides.
The detailed algorithm for checking the 11 isobaric substitutions
is described in Choong et al.[17] We also
performed semitryptic cleavage examination to determine whether a
variant peptide can be derived from semicleavage or terminal truncation
of any wild-type peptide. To check, we first performed in silico trypsin
digestion on the RefP_V database with two missed cleavages and then
examined whether each variant peptide sequence maps to a subsequence
of any wild-type tryptic peptide.
Spectral
Counting
Spectral counting
is widely used in label-free quantification; it involves counting
the number of identified PSMs of a given peptide and integrating all
of the numbers for the identified peptides in a protein to represent
the protein abundance. Popular tools such as Scaffold,[56] CRUX,[57] and Proteome
Discoverer adopt this approach for label-free quantification. Some
proteins in an experiment are possibly identified by a single PSM.
However, such single PSMs may be false positives even though they
pass the FDR control, casting doubt on the proteins of such one-hit
wonders. Using a similar concept, we stratified the evidence levels
of variant peptides based on their spectral counts as follows: one
PSM, two PSMs, and multiple (≥3) PSMs to classify variant peptides,
denoted by 1_PSM, 2_PSM, and 3up_PSM, respectively. Variant peptides
having more PSMs are regarded as having greater confidence.
Variant Event-Level Examination
Checking Consecutive Fragment Ion Peaks
Checking the
mass difference of consecutive fragment ions in an
MS/MS spectrum is common in proteomics data analysis, for instance,
de novo peptide sequencing, PTM site localization, and glycopeptide
identification.[50,58,59] In Ivanov et al.,[60] the analysis of consecutive y-ions in an MS/MS spectrum is used to confirm the existence
of variant sites in the variant peptide sequence. Therefore, we adopted
a similar concept to examine the existence of consecutive variant
site-specific b- or y-ions in variant
spectra. If a variant spectrum reveals two consecutive variant site-specific b- or y-ions with a mass difference corresponding
to the variant residue, the variant event can be confirmed; otherwise,
the event cannot be confirmed, although the variant PSM and peptide
sequence can be still correct.For each variant spectrum, we
converted the spectrum information into a pNovo+ output format and
loaded the converted spectrum to a spectrum viewer, a built-in function
of DeNovoGUI to generate spectrum–peptide annotation. The spectrum
viewer parameter settings only consider matching b- and y-ions of charge +1 to spectrum peaks that
have intensities at least 10% of the most intense peak and are within
0.02 Da tolerance. Then, we manually examined the spectrum annotation
of each variant peptide–spectrum to confirm the existence and
mass difference of the two consecutive variant site-specific b- or y-ions of the variant peptide.
Checking the Existence of Wild-Type Counterparts
in PeptideAtlas and the Identification of Its Parental Protein
To enhance the reliability of the variant events derived from variant
PSMs, we inspected the existence of their wild-type counterparts and
parental proteins. For each variant site, we adopted the LeTE-fusion
pipeline proposed in Mamie Lih et al.[61] to check whether its wild-type counterpart peptides, fully digested
or miscleaved, could be found in PeptideAtlas (build human Jan 2018),[62] a public repository of experimentally observed
peptides. If a variant event could not find any wild-type counterpart
peptide in PeptideAtlas, the identification of this variant event
was doubtful. To check the existence of its parental protein, we used
MAYU to perform validation with an FDR of 1% at the PSM, peptide,
and protein levels on the CMSe-gFDR search results. If the parental
protein of the variant event did not pass the protein-level FDR, identification
of this variant event was dubious. Only variant events with evidence
of wild-type counterparts and parental proteins were regarded as reliable
identification results.
Variant Peptide Location
in the Protein
Variant peptides are identified from variant-harboring
protein
sequences with protease digestion in sequence database searches. Notably,
when an amino acid is mutated to K or R, in silico trypsin digestion
of SAV-harboring protein may generate peptides, which are regarded
as variant peptides by database search engines, but they do not include
the actual variant site. For example, in Q9BUP3, the fully digested
wild-type peptide containing position 197 is 186-KFFGSLPDSWAGHSVPVVTVVR-207. For its SAV S197R, the SAV-harboring protein contained in the
customized database for database searches yields fully digested peptides
187-FFGSLPDSWA-197 and 198-GHSVPVVTVVR-207, neither
of which belong to the wild-type peptide space and thus are regarded
as variant peptides by database search engines. The former peptide
is indeed a variant peptide, but the latter peptide does not contain
the variant site, and its identification cannot be used to support
the identification of the S197R variant event because this peptide
can come from peptide truncation or semitryptic digestion of a wild-type
peptide. Similarly, because a proline (P) after K or R may render
a missed cleavage, an SAV involving P, which is after K or R, mutated
to another amino acid can also lead to a misclassified variant peptide.
Thus, it is essential to exclude variant peptides that do not contain
the variant event.
Results and Discussion
Maximizing the Number of Identified Variant
PSMs of the HEK293 Cell Line Data Set
To demonstrate 11 postevaluation
examinations on identified variant peptides, we used the HEK293 cell
line MS data set to obtain as many variant PSMs with an FDR of 1%
as possible. In this proteogenomics study, we applied SSe-gFDR, SSe-cFDR,
CMSe-gFDR, and CMSe-cFDR, the four combinations of database search
type and FDR estimation approach, for database searches.To
explore the impact of the above four strategies, we compared the identification
results of different search types under the same FDR estimation approach.
To compare SSe-gFDR and CMSe-gFDR, we used Comet, MS-GF+, and X!Tandem
for database searches and obtained 227, 252, and 272 variant PSMs
passing a gFDR of 1%, respectively, that identified 78 (68 variant
events), 85 (76), and 89 (80) variant peptides. In contrast, using
the CMSe approach yielded 282 variant PSMs passing a gFDR of 1% that
corresponded to 93 variant peptides with 82 variant events, more than
using the SSe approach. Overall, a total of 302 variant PSMs corresponding
to 105 variant peptides with 94 variant events passing a gFDR of 1%
were obtained from the SSe-gFDR and CMSe-gFDR strategies. The results
are summarized in Table S2. Furthermore,
we used a Venn diagram to illustrate the relationship of variant PSMs,
variant peptides, and variant events among CMSe-gFDR, Comet-gFDR,
MS-GF+-gFDR, and X!Tandem-gFDR strategies in Figure S5A. Using the SSe-gFDR strategy, 5.3% (12/227), 4.4% (11/252),
and 11.4% (31/272) variant PSMs were exclusively obtained by Comet,
MS-GF+, and X!Tandem, respectively. Notably, 1.8% (4/227), 2.8% (7/252),
and 4.0% (11/272) of the PSMs obtained by Comet, MS-GF+, and X!Tandem,
respectively, were not included in the PSMs obtained using the CMSe-gFDR
strategy. This shows that the variant PSMs resulting from the use
of SSe-gFDR and CMSe-gFDR are slightly different, likewise for variant
peptides and variant events. More detailed results about the diversity
of SSe-gFDR versus CMSe-gFDR are provided in Table S3.To compare the SSe-cFDR and CMSe-cFDR strategies,
we followed the
cFDR estimation methods described in Section . The numbers of identified PSMs, variant
peptides, and variant events obtained from SSe-cFDR and CMSe-cFDR
are listed in Table S2. Specifically, using
CMSe-cFDR, Comet-cFDR, MS-GF+-cFDR, and X!Tandem-cFDR, we obtained
279, 227, 257, and 273 variant PSMs, respectively, and 308 variant
PSMs in total, passing a cFDR of 1% from any score cutoff. Based on
the 308 PSMs, 103 variant peptides and 90 variant events were identified.
Furthermore, the differences in the identification results of SSe-cFDR
and CMSe-cFDR are shown in Figure S5B and Table S4. The above results reflect a similar
phenomenon to that for the comparison between SSe-gFDR and CMSe-gFDR
and suggest that using different search engines and different discrimination
scores for cFDR estimation yields slightly different results. In summary,
observing the above comparisons of using the four strategies (Figure S5), CMSe is suggested for database searching.
When adopting the CMSe strategy, using gFDR can obtain more variant
PSMs than using cFDR.Finally, using the four strategies, we
identified a total of 320
unique variant PSMs, corresponding to 111 variant peptides and 98
variant events (Figure ), where 302 PSMs and 308 PSMs were obtained from both search types
using gFDR and cFDR, respectively. All 320 variant PSMs were used
in this study to demonstrate 11-aspect examinations to evaluate their
reliability at the PSM, peptide, and variant event levels (Table S5).
Figure 1
Variant peptide identification results
obtained by using the four
strategies of combining the type of database search and FDR estimation
approach (SSe-gFDR, SSe-cFDR, CMSe-gFDR, and CMSe-cFDR) on the HEK293
cell line MS data set.
Variant peptide identification results
obtained by using the four
strategies of combining the type of database search and FDR estimation
approach (SSe-gFDR, SSe-cFDR, CMSe-gFDR, and CMSe-cFDR) on the HEK293
cell line MS data set.
Examinations
of Variant PSMs of the HEK293
Data Set at the PSM Level
Examination by Open Modification
Search
(OMS)
To investigate the reliability of all 320 identified
variant PSMs, we first compared them with the results of the OMS tools;
the results are shown in Figure and Table S6. Of the 320
spectra, the three OMS tools reported 190–243 spectra having
results consistent with conventional protein database searches (Figure A). In other words,
24.1–40.6% of the 320 variant PSMs were not supported by the
three respective OMS tools. Notably, of the 320 variant PSMs, 149
(46.6%), 62 (19.4%), and 58 (18.1%) PSMs were supported by all three,
two, and one OMS tools, respectively. However, the remaining 51 (15.9%)
variant spectra were not supported by any OMS tool, i.e., they were
unidentified or mostly identified as other wild-type peptides with
modifications (Figure B). Thus, these 51 variant PSMs were very likely unreliable identification
results, as verified by the three OMS tools.
Figure 2
Results of open modification
search (OMS) on 320 variant PSMs obtained
from conventional database searches. We used three OMS tools to perform
the search against the RefV_P database. (A) Comparison of the consistency
of identified variant PSMs between OMS and conventional database searches.
(B) Distribution of 320 variant PSMs supported by 0–3 OMS tools.
Results of open modification
search (OMS) on 320 variant PSMs obtained
from conventional database searches. We used three OMS tools to perform
the search against the RefV_P database. (A) Comparison of the consistency
of identified variant PSMs between OMS and conventional database searches.
(B) Distribution of 320 variant PSMs supported by 0–3 OMS tools.
Examination of Explosive
Search against
the Huge SuperPep_V Database
After using Comet, MS-GF+, and
X!Tandem to search the 320 variant spectra against SuperPep_V, we
extracted the top PSMs and their search scores of the 320 spectra
from the search results (.t.xml, pep.xml, .mzid) of each search engine.
For a search engine, each spectrum had two PSM results and their search
scores obtained from the original search (searching against RefP_V)
and the explosive search, respectively. For each spectrum, we compared
its identified peptides of the two PSMs and the respective search
scores. If both search scores were the same, then it was denoted as
ΔScore = 0, and ΔScore ≠ 0 otherwise. The two associated
identified peptides could be variant peptide(s) of the HEK293 cells,
termed HEK_VP, or any peptide(s) other than the HEK293 variant peptides
from RefP_V or SuperPep_V, termed non-HEK_VP. Then, we classified
the 320 spectra according to the comparison results into the following
four groups:Both PSMs have the same search score
and identify the same variant peptide: denoted as ΔScore = 0
and HEK_VP → HEK_VPSuperPep_V;Both PSMs have the same search score
and identify non-HEK_VP peptides: denoted as ΔScore = 0 and
non-HEK_VP → non-HEK_VPSuperPep_V;Both PSMs have different search scores,
and one identifies HEK_VP but the other does not: denoted as ΔScore
≠ 0 and HEK_VP → non-HEK_VPSuperPep_V;Both PSMs have different
search scores
and identify different non-HEK_VPs: denoted as ΔScore ≠
0 and non-HEK_VP → non-HEK_VPSuperPep_V.Note that the 320 variant spectra were obtained
by integrating
results from the three single search engines and those using multiple
search engines. The second and fourth groups with non-HEK_VP →
non-HEK_VPSuperPep_V could occur when the search engine
identified the spectrum as a non-HEK_VP but at least one of the other
two search engines identified it as HEK_VP.Classifications
of the 320 variant spectra using Comet, MS-GF+,
and X!Tandem are shown in Figure A–C and Table S7.
Of the 320 spectra, 266 (83.13%), 254 (79.38%), and 280 (87.50%) spectra
are in Group 1, ΔScore = 0, and HEK_VP → HEK_VPSuperPep_V, as determined by using Comet, MS-GF+, and X!Tandem for searches,
respectively. In other words, the spectra in group 1, the majority
of the 320 spectra, still matched the same variant peptide even when
searching against the huge SuperPep_V database, and thus their identified
variant peptides were high confidence. In addition, the spectra in
the second and fourth groups (light blue and bisque color in Figure A–C) accounted
for 9.38–14.06% of the 320 spectra for each search engine.
Such spectra were not assigned to variant peptides by the specific
search engine when searching against RefP_V and against SuperPep_V,
but they were assigned to variant peptides by other search engine(s)
and could represent medium-confidence variant PSMs. In contrast, 7–24
(2.19–7.50%) spectra in the third group (dark orange color
in Figure A–C),
ΔScore ≠ 0 and HEK_VP → non-HEK_VPSuperPep_V, of the respective search engine, were identified as variant peptides
of HEK293 cells but not when searching against SuperPep_V. Notably,
this also suggests that some of these spectra may be misinterpreted
as a variant peptide due to an incomplete search space.
Figure 3
Results of
explosive search on 320 identified variant PSMs. We
used three conventional database search engines to search the 320
variant spectra against the huge SuperPep_V database. (A–C)
Based on each search engine’s results, classifying the 320
variant spectra into four groups. (D) Distribution of 320 variant
PSMs with support from explosive search by 0–3 search engines.
Results of
explosive search on 320 identified variant PSMs. We
used three conventional database search engines to search the 320
variant spectra against the huge SuperPep_V database. (A–C)
Based on each search engine’s results, classifying the 320
variant spectra into four groups. (D) Distribution of 320 variant
PSMs with support from explosive search by 0–3 search engines.Furthermore, a total of 292 spectra were classified
as group 1
by any of the search engines; the confidence levels of their identified
variant peptides are further summarized as follows (Figure D). Of the 320 variant spectra,
233 (72.81%) spectra were identified as variant peptides when searching
SuperPep_V by all of the three search engines, and these variant peptides
were considered to have the highest confidence. Moreover, 32 (10%)
and 27 (8.44%) spectra were identified as variant peptides in explosive
search by two and one search engines, respectively; these variant
peptides are regarded to have high and medium confidence, respectively.
The remaining 28 (8.75%) spectra were consistently annotated as non-HEK_VP
in explosive search by the three search engines, although they were
identified as variant peptides in the original search by some search
engines. The results show that the explosive search examination is
an effective method to inspect possible false-positive variant peptides
caused by using an incomplete database for searches.
Examination by Combined Open Modification
and Explosive Search
Using OMS tools, MSFragger, PIPI, and
SpecOMS to search the 320 variant spectra obtained from the original
search against SuperPep_V yielded 312, 253, and 314 PSMs, respectively,
as shown in Figure A and Table S8. Of the PSMs of the combined
open modification and explosive search results, 211, 163, and 136
spectra were identified as variant peptides by the three OMS tools
(light green bar), respectively, consistent with the original search,
and were considered high-confidence variant PSMs. Moreover, OMS against
SuperPep_V (light green bar), i.e., the combined search, had a decrease
of 13.2–28.4% variant PSMs compared with OMS against RefP_V
(light blue bar). Note that a number of spectra were not matched to
variant peptides in this combined search with open modifications and
a huge peptide database as the search space. Thus, the variant peptide
identification of these spectra was insufficiently confirmed.
Figure 4
Results of
combined open modification and explosive search on 320
variant PSMs. Three OMS tools were used to search the 320 variant
spectra against the huge SuperPep_V database. (A) Comparison of identified
PSMs and variant PSMs between OMS and combined search (OMS + explosive
search) results for each OMS tool. (B) Distribution of 320 variant
PSMs with the support of combined open modification and explosive
search results from 0 to 3 OMS tools.
Results of
combined open modification and explosive search on 320
variant PSMs. Three OMS tools were used to search the 320 variant
spectra against the huge SuperPep_V database. (A) Comparison of identified
PSMs and variant PSMs between OMS and combined search (OMS + explosive
search) results for each OMS tool. (B) Distribution of 320 variant
PSMs with the support of combined open modification and explosive
search results from 0 to 3 OMS tools.Furthermore, we integrated the combined search results of the 320
spectra to classify the reliability of the originally identified variant
PSMs (Figure B). Of
the 320 spectra, 99, 71, and 71 spectra were assigned to variant peptides
by three, two, and only one OMS tools, respectively. The remaining
79 spectra were unidentified or assigned to nonvariant peptides by
all of the three OMS tools. Thus, these 79 spectra of variant PSMs
could be considered to be less reliable results caused by possible
modifications and incomplete peptide search space.We used three
de novo peptide sequencing tools—PepNovo+,
pNovo+, and PEAKS—to identify the 320 variant spectra obtained
from database searches (Table S9). Of the
320 spectra, the top 10 hits of 89, 95, and 97 spectra obtained from
PepNovo+, pNovo+, and PEAKS (green bar), respectively, contained the
corresponding variant peptides (Figure A), of which more than 79.8% belonged to rank-1 or
-2 hits from the three tools (Figure S6). Integrating the results of the three tools to examine the interpretation
consistency of the identified spectra, we observed that of the 320
spectra, 59 (18.4%), 31 (9.7%), and 42 (13.1%) spectra had support
from three, two, and one tools, respectively, as shown in Figure B. However, 188 (58.8%)
spectra had no support from any tool and could be considered to be
unreliable identification results.
Figure 5
De novo peptide sequencing on the 320
variant spectra by PepNovo+,
pNovo, and PEAKS. (A) Identification results of three de novo sequencing
tools based on the top 10 hits of each of the 320 spectra. (B) Distribution
of 320 variant PSMs with support from de novo sequencing by 0–3
de novo sequencing tools.
De novo peptide sequencing on the 320
variant spectra by PepNovo+,
pNovo, and PEAKS. (A) Identification results of three de novo sequencing
tools based on the top 10 hits of each of the 320 spectra. (B) Distribution
of 320 variant PSMs with support from de novo sequencing by 0–3
de novo sequencing tools.
Similarity Examination by Predicted MS/MS
Spectra
We used the prediction tools MS2PIP and
MS2PBPI to generate the simulated MS/MS spectra of 111
identified variant peptide sequences while considering charge states
and modifications. We then used the target–decoy approach described
in Section to determine the similarity score threshold between variant spectra
and simulated spectra for the confidence results. As a result, the
similarity score thresholds between a variant spectrum and the corresponding
predicted spectrum to be considered similar at an FDR less than 5%
were set to 0.5 and 0.4 for MS2PIP and MS2PBPI,
respectively, as shown in Figure S7; i.e.,
a variant spectrum yielding a similarity score with the predicted
spectrum higher than the threshold was regarded as a high-confidence
variant PSM. Of the 320 variant spectra obtained from sequence database
searches, 179 and 111 of the variant spectra passed the similarity
score thresholds of MS2PIP and MS2PBPI, respectively,
and were considered to be high-confidence identification results (Table S10). Furthermore, 108 (33.75%) and 74
(23.125%) spectra were verified by both and only one tools, respectively
(Figure ). The remaining
138 (43.125%) spectra had no evidence support from any tool and were
likely to be less reliable results.
Figure 6
Distribution of 320 variant spectra with
support by spectral similarity
with predicted MS/MS spectra generated by 0–2 intensity prediction
tools. MS2PIP and MS2PBPI were used to generate
simulated spectra based on variant peptide sequence and charge state.
A variant spectrum has the support of a prediction tool if the spectrum
yields a similarity score with the predicted spectrum higher than
the threshold.
Distribution of 320 variant spectra with
support by spectral similarity
with predicted MS/MS spectra generated by 0–2 intensity prediction
tools. MS2PIP and MS2PBPI were used to generate
simulated spectra based on variant peptide sequence and charge state.
A variant spectrum has the support of a prediction tool if the spectrum
yields a similarity score with the predicted spectrum higher than
the threshold.
Examinations
of Variant PSMs of the HEK293
Data Set at the Peptide Level
Isobaric Substitution
and Semitryptic Check
For the 111 variant peptides identified
in 320 variant PSMs, we
conducted examinations of isobaric substitution and semitryptic cleavage
on the RefP_V database to further verify the variant peptides. Of
the 111 variant peptides, 3 variant peptides could be derived from
reference peptides with isobaric substitutions, as shown in Table S11. For example, the PSM of the variant
peptide “SLVQESLSTNSSDLVAPSPDAFR” of protein
Q12888 with the D353E variant could be interpreted as “SLVQD[methylated]SLSTNSSDLVAPSPDAFR” because
the masses of glutamate and methylated aspartate are equivalent. Also,
we found a variant peptide “TQDLLNQNHSANAVR.L”
of protein Q9H040 with the P296L variant possibly belonging to a reference
tryptic peptide “TQDLLNQNHSANAVR.PNSK” of Q9H040 by
semitryptic cleavage checking (Table S11). However, this case is not an actual semicleavage or terminal-truncation
case because the P is mutated to L although trypsin retains low cleavage
specificity at K or R before P.[63] Therefore,
these 4 out of the 111 variant peptides were ambiguous results at
the peptide sequence level but not at the PSM level.
Spectral Counting Evaluation
Following
the spectral counting described in Section , we grouped the 111 variant peptides
in the 320 variant PSMs into three peptide classes: 1_PSM, 2_PSM,
and 3up_PSM, which contained 54 (48.65%), 21, and 36 variant peptides,
respectively (Figure A). Note that 48.65% of the variant peptides in the 1_PSM class were
less confident than the other two peptide classes. Next, we grouped
identified variant events into three event classes based on the numbers
of peptides and PSMs detecting the event, defined as follows: a variant
event detected by a single peptide and a single PSM (denoted as 1P_1PSM),
a single peptide and multiple (≥2) PSMs (1P_mPSM), and multiple
(≥2) peptides and multiple PSMs (mP_mPSM). Of the 98 variant
events, 46, 40, and 12 variant events were grouped in the 1P_1PSM,
1P_mPSM, and mP_mPSM classes, respectively (as shown in Figure B), where 46.94% of variant
events were less confident than the other two event classes. The hierarchical
visualization of the variant event classification based on peptide
and spectral counting is shown in Figure S8; detailed results are provided in Table S12.
Figure 7
Evaluation of 320 variant PSMs at the peptide and variant event
levels. The 320 variant PSMs corresponded to 111 variant peptides
and 98 variant events. Spectral counting evaluation of (A) 111 variant
peptides and (B) 98 variant events. Checking two consecutive variant
site-specific fragment ions (C) in 320 variant spectra and (D) from
PSM, peptide, and variant event perspectives.
Evaluation of 320 variant PSMs at the peptide and variant event
levels. The 320 variant PSMs corresponded to 111 variant peptides
and 98 variant events. Spectral counting evaluation of (A) 111 variant
peptides and (B) 98 variant events. Checking two consecutive variant
site-specific fragment ions (C) in 320 variant spectra and (D) from
PSM, peptide, and variant event perspectives.
Variant Event Examination on the HEK293 Data
Set via Four Different Checks
Checking Consecutive
Fragment Ion Peaks
Following the procedure described in Section , we manually
examined the spectrum annotation
of each of the 320 variant spectra to confirm the existence and mass
difference of the two consecutive variant site-specific b- or y-ions. The detailed information is listed
in Table S13.Of the 320 variant
PSMs, 161 spectra contained two consecutive variant site-specific
fragment ions and more y-ion pairs, as shown in Figure C. Moreover, of the
98 variant events derived from the 320 PSMs, 47 variant events were
confirmed by consecutive fragment ions in MS/MS spectra (Figure D). These results
show that 48% of variant events are more reliable, as confirmed by
the consecutive variant site-specific b- or y-ions.
Checking Variant Peptide
Sequence Location
We examined whether the 111 variant peptides
in the 320 variant
PSMs contained the SAVs. As explained in Section , 198-GHSVPVVTVVR-207 cannot be regarded
as a variant peptide with S197R in Q9BUP3 and is indeed a wild-type
peptide, although it is not a digested peptide of the reference protein.
Of these variant peptides, we found that four peptides do not contain
the variant sites because the variant sites are trypsin cleavage sites
or occur after the C-terminal of the trypsin cleavage site, as shown
in Table S14. Of the 98 variant events
identified in the 320 variant PSMs, 96 variant events were supported
by variant-site-containing peptides and thus were more reliable than
the remaining two variant events, each of which was reported in only
one peptide but without containing the variant site.
Checking the Existence of Wild-Type Counterparts
in PeptideAtlas and the Identification of Parental Proteins
Of the 98 variant events in the 320 variant PSMs, for 70 events,
their wild-type counterpart peptides were found in PeptideAtlas; for
the remaining 28 events, no such peptides were found (Table S15). Based on the CMSe-gFDR results, the
parental proteins of 93 variant events passed an FDR of 1% at the
protein level (Table S16). By inspecting
the existence of wild-type counterpart peptides and parental proteins,
we found that the 98 variant events in the 320 variant PSMs reflected
different levels of reliability.
Discussion
Applying the 11-Aspect Examinations to Extract
the Highest-Confidence Variant Events
In the HEK293 cell
line MS data set, using both search strategies, 375 variant PSMs corresponding
to 135 variant peptides and 121 variant events were reported. After
applying FDR filtering, 320 variant PSMs passed an FDR <1%. It
showed that FDR filtering removed 55 (14.9%) PSMs, 24 (17.8%) peptides,
and 23 (19.0%) events. To identify variant events with the highest
confidence, we applied all of the 11-aspect examinations to the 320
variant PSMs containing 98 variant events (Table S17). Of these, 111 PSMs passed all of the five-aspect examinations
at the PSM level, where each aspect examination results from at least
one tool supporting the reliability of variant PSMs. When further
applying the two-aspect examinations at the peptide level on the 111
variant PSMs, 25 variant peptides satisfied both evaluations, i.e.,
ambiguous variant peptides caused by isobaric substitutions or semitryptic
cleavage, and variant peptides belonging to 1P_1PSM class were filtered
out. Finally, 14 variant events of the 17 variant peptides passed
the four evaluations at the variant event level, where each variant
peptide was verified by at least one pair of site-specific b- or y-ions, the existence of a wild-type
counterpart peptide and parental protein, and its sequence containing
the variant site. With such stringent examinations at the PSM, peptide,
and variant event levels, 14 variant events contained in 17 variant
peptides of 71 variant PSMs in the HEK293 data set achieved the highest
confidence.We noted that 249 (77.8%) out of 320 variant PSMs
were filtered out by the 11 postexaminations. Such a high filtration
rate was also observed in the nine deep proteome data sets of cancer
cell lines after applying two filtering strategies—filtration
against reference proteome and chemical modification filter—to
filter out unreliable identified variant peptides, as reported by
Alfaro et al.[12] based on their Additional
File 4 (CL9_exom_snv). To be specific, in their nine deep proteome
data sets, on average, 668 variant PSMs were reported. After filtering,
on average, 587 (87.9%) variant PSMs were filtered out and only 81
(12.1%) variant PSMs remained. It again emphasized the necessity of
checking the reliability of identified variant PSMs with 1% FDR. Furthermore,
to validate the results after 11 postexaminations, we used Lobas et
al.’s[64] results of three HEK293
MS data sets from three different sources using two search engines.
The authors classified identified variant peptides into four levels
of confidence by the number of their confirmations. All of our 17
variant peptides passing the 11 postexaminations were found in their
identification results. Notably, 10 (71.4%) variant events were found
in at least two MS data sets and regarded as more confident identification.
The high overlapping rate of variant events showed that our postexaminations
are reliable and effective. In addition, our postexamination methods
can remedy the situation without sufficient technical replicate data
due to limited sample amount for verifying the reliability of identified
variant peptides.
Flexibility of the Proposed
Methodology
However, performing the 11 postexaminations is
a very stringent
evaluation of the variant events at the three bottom-up levels that
can facilitate detecting the most reliable variant peptide identification.
We consider that the PSM-level examination is the fundamental examination.
Because the examinations at the PSM level greatly affect the ultimate
outcomes of verification, researchers can select performing specific
examinations or adopting a “voting” strategy to determine
reliable variant PSMs as passing at least k (say, k = 2 or 3) out of five examinations depending on their
stringency requirements. Among the five PSM-level examinations, we
suggest that examination by open modification search (OMS) is essential.
Variant spectra passing the PSM-level examination(s) will then proceed
for examinations at variant peptide and event levels. Researchers
can also select specific examinations to check. For peptide-level
examinations, we consider that isobaric substitution check is a must
because identified variant peptides that can be obtained by isobaric
substitutions of wild-type peptides are unreliable, and this examination
can be easily done. Event-level examinations are quite easy to perform
and can detect unreliable variant events.
Conclusions
In this paper, we present a framework for the postexamination of
variant peptide identification results to verify the reliability of
identified variant events. The framework consists of 11 examinations
at the PSM, peptide, and variant event levels based on MS proteomic
knowledge. Each examination is performed on identified variant PSMs
and their variant peptides, not on the whole MS data set, and can
be done using public software tools or in-house programs. As a proof
of concept and showing feasibility, we demonstrate the 11 examinations
on the identified variant peptides by sequence database searching
of a public MS data set of the HEK293 cell line. Although identified
variant peptides pass an FDR of 1%, the results of 11 examinations
reveal that the essential FDR criterion requirement is not sufficient
to validate identified variant peptides. These rigorous examinations
can serve to reveal low-confidence variant events from the shotgun
proteomics experiment. Moreover, in this framework, researchers can
replace some of the proposed examinations with other examinations
or add different examinations. We suggest that postexaminations of
identified variant events are essential and can be considered as additional
guidelines to evaluate the reliability of variant events in proteogenomics
studies.
Authors: Eric W Deutsch; Lydie Lane; Christopher M Overall; Nuno Bandeira; Mark S Baker; Charles Pineau; Robert L Moritz; Fernando Corrales; Sandra Orchard; Jennifer E Van Eyk; Young-Ki Paik; Susan T Weintraub; Yves Vandenbrouck; Gilbert S Omenn Journal: J Proteome Res Date: 2019-10-21 Impact factor: 4.466
Authors: X Gu; L Xing; G Shi; Z Liu; X Wang; Z Qu; X Wu; Z Dong; X Gao; G Liu; L Yang; Y Xu Journal: Cell Death Differ Date: 2011-08-05 Impact factor: 15.828
Authors: Andy T Kong; Felipe V Leprevost; Dmitry M Avtonomov; Dattatreya Mellacheruvu; Alexey I Nesvizhskii Journal: Nat Methods Date: 2017-04-10 Impact factor: 28.547
Authors: Nuala A O'Leary; Mathew W Wright; J Rodney Brister; Stacy Ciufo; Diana Haddad; Rich McVeigh; Bhanu Rajput; Barbara Robbertse; Brian Smith-White; Danso Ako-Adjei; Alexander Astashyn; Azat Badretdin; Yiming Bao; Olga Blinkova; Vyacheslav Brover; Vyacheslav Chetvernin; Jinna Choi; Eric Cox; Olga Ermolaeva; Catherine M Farrell; Tamara Goldfarb; Tripti Gupta; Daniel Haft; Eneida Hatcher; Wratko Hlavina; Vinita S Joardar; Vamsi K Kodali; Wenjun Li; Donna Maglott; Patrick Masterson; Kelly M McGarvey; Michael R Murphy; Kathleen O'Neill; Shashikant Pujar; Sanjida H Rangwala; Daniel Rausch; Lillian D Riddick; Conrad Schoch; Andrei Shkeda; Susan S Storz; Hanzhen Sun; Francoise Thibaud-Nissen; Igor Tolstoy; Raymond E Tully; Anjana R Vatsan; Craig Wallin; David Webb; Wendy Wu; Melissa J Landrum; Avi Kimchi; Tatiana Tatusova; Michael DiCuccio; Paul Kitts; Terence D Murphy; Kim D Pruitt Journal: Nucleic Acids Res Date: 2015-11-08 Impact factor: 16.971
Authors: John G Tate; Sally Bamford; Harry C Jubb; Zbyslaw Sondka; David M Beare; Nidhi Bindal; Harry Boutselakis; Charlotte G Cole; Celestino Creatore; Elisabeth Dawson; Peter Fish; Bhavana Harsha; Charlie Hathaway; Steve C Jupe; Chai Yin Kok; Kate Noble; Laura Ponting; Christopher C Ramshaw; Claire E Rye; Helen E Speedy; Ray Stefancsik; Sam L Thompson; Shicai Wang; Sari Ward; Peter J Campbell; Simon A Forbes Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971