| Literature DB >> 33998808 |
Renee Salz1, Robbin Bouwmeester2,3, Ralf Gabriels2,3, Sven Degroeve2,3, Lennart Martens2,3, Pieter-Jan Volders2,3, Peter A C 't Hoen1.
Abstract
Discovery of variant peptides such as a single amino acid variant (SAAV) in shotgun proteomics data is essential for personalized proteomics. Both the resolution of shotgun proteomics methods and the search engines have improved dramatically, allowing for confident identification of SAAV peptides. However, it is not yet known if these methods are truly successful in accurately identifying SAAV peptides without prior genomic information in the search database. We studied this in unprecedented detail by exploiting publicly available long-read RNA sequences and shotgun proteomics data from the gold standard reference cell line NA12878. Searching spectra from this cell line with the state-of-the-art open modification search engine ionbot against carefully curated search databases resulted in 96.7% false-positive SAAVs and an 85% lower true positive rate than searching with peptide search databases that incorporate prior genetic information. While adding genetic variants to the search database remains indispensable for correct peptide identification, inclusion of long-read RNA sequences in the search database contributes only 0.3% new peptide identifications. These findings reveal the differences in SAAV detection that result from various approaches, providing guidance to researchers studying SAAV peptides and developers of peptide spectrum identification tools.Entities:
Keywords: deep proteomics; direct RNA sequencing; long-read RNA sequence; open search
Mesh:
Substances:
Year: 2021 PMID: 33998808 PMCID: PMC8280751 DOI: 10.1021/acs.jproteome.1c00264
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 4.466
Figure 1Creation of the search databases. (A) Three databases were made to make comparison between use of different sources of sequences. One with only translations of transcriptome sequences (ONT), one with only the reference proteome (GENCODE), and one with the union of the two. This comparison is denoted with a blue square. Variants from NA12878 were incorporated into the combination database from A and compared to the combination database without variants. This comparison is denoted with a red square. (B) Number of (predicted) ORFs in the different sources used to construct the VF search database and their overlap. The sources included the GENCODE v29 reference ORFs and the predicted ORFs from ONT RNAseq. Two ORF prediction software (ANGEL and SQANTI) were used to determine candidate ORFs, and the intersection was included in the final search database.
Figure 2Detectable peptides per method. Theoretical (upper pie charts) and observed (lower pie charts) proportions of peptides when searching against VC (right) or VF (left) search databases. This shows percentages of matched peptides attributed only to GENCODE proteins, only ONT proteins, and those that match to proteins in both databases.
Figure 3Detection of variant peptides using (combination) VF and VC databases. (A) Variant PSMs (left) and unique peptides (right) attributed to genome-supported variant peptides. (B) PSM and peptide counts found by each method.
Figure 4Properties of detected variants compared to those expected. (A) Groups of variant peptides being compared. All circles, including all overlaps, are being compared to each other. (B) Length distribution differences between detected variant peptides by the different variant detection methods. (C) Normalized (divided by max) frequency of variation per original (reference) amino acid.
Figure 5False-negative variant misidentifications. (A) Investigation of causes of mis-identification of peptides in the VF set. (B) Scores of those misidentified peptides in VF vs VC set. Each point corresponds to one false-negative variant peptide. Percolator PSM score is used. Color corresponds to delta retention time.
Figure 6False-positive misidentifications. (A) False-positive misidentifications are genome-unsupported (US) variants predicted by the VF method. The Venn diagram highlights the subset of variants that are being investigated in this figure. These 2998 variants were predicted by ionbot to be variant peptides but were not found with the variant containing set. All but seven were variants unsupported by genome information. (B) Relative score distributions between genome supported vs unsupported variants in the VF set. (C) Unexpected modifications by the VC set corresponding to all “false-positive” predicted variant PSMs in the VF set.
Figure 7Underlying SNPs detected at the protein level. (A) Variant peptide abundance vs reference counterpart split by zygosity and search database, square root-transformed. (B) Separating heterozygous variants in the variant-containing database by whether more variant peptide was found (variant-biased) or more of the reference counterpart was found (reference-biased) revealed differences in allele frequency distributions. (C) Ratio variability of genes with two or more variant peptides. Ratio is defined by the variant counterpart abundance divided by variant peptide abundance. Y axis shows max – min per gene.