SUMMARY: The neXtProt peptide uniqueness checker allows scientists to define which peptides can be used to validate the existence of human proteins, i.e. map uniquely versus multiply to human protein sequences taking into account isobaric substitutions, alternative splicing and single amino acid variants. AVAILABILITY AND IMPLEMENTATION: The pepx program is available at https://github.com/calipho-sib/pepx and can be launched from the command line or through a cgi web interface. Indexing requires a sequence file in FASTA format. The peptide uniqueness checker tool is freely available on the web at https://www.nextprot.org/tools/peptide-uniqueness-checker and from the neXtProt API at https://api.nextprot.org/. CONTACT: lydie.lane@sib.swiss.
SUMMARY: The neXtProt peptide uniqueness checker allows scientists to define which peptides can be used to validate the existence of human proteins, i.e. map uniquely versus multiply to human protein sequences taking into account isobaric substitutions, alternative splicing and single amino acid variants. AVAILABILITY AND IMPLEMENTATION: The pepx program is available at https://github.com/calipho-sib/pepx and can be launched from the command line or through a cgi web interface. Indexing requires a sequence file in FASTA format. The peptide uniqueness checker tool is freely available on the web at https://www.nextprot.org/tools/peptide-uniqueness-checker and from the neXtProt API at https://api.nextprot.org/. CONTACT: lydie.lane@sib.swiss.
Most proteomics experiments aiming to identify proteins in complex samples rely on proteolytic digestion followed by separation of the resulting peptides by liquid chromatography and their identification by tandem mass spectrometry (MS). Since the link between peptides and their protein precursors is lost, peptide-to-protein mappings must be obtained and critically evaluated, even when there is strong evidence that peptide identifications are correct. To ensure that a peptide unambiguously maps to a protein, isobaric substitutions of isoleucine to leucine that cannot be distinguished by most current MS techniques must be taken into account, as well as possible sequence variations arising from single amino acid variants (SAAVs) and alternative splicing. This is especially important for projects such as the HUPO Human Proteome Project (HPP), whose aim is to experimentally validate the existence of in silico-predicted human proteins (Omenn ).The neXtProt knowledgebase on human proteins currently contains >42 000 isoforms produced by alternative splicing and >5 million variants (Gaudet ), which generates an incommensurate number of proteoforms. There is currently no tool taking this diversity into account when mapping peptides to proteins. The peptide uniqueness checker tool on the neXtProt platform was developed to meet this need.
2 Materials and methods
The peptide indexer (pepx) maps proteomics peptides to a given set of protein sequences using an n-mer-based index: all protein sequences are scanned with a sliding window of n amino acids and each n-mer found is written in an index pointing to the list of identifiers of sequences containing it. Peptides to be mapped are scanned with the same n-mer sliding window. For each peptide all n-mers must be found in the same protein isoform sequence to return a match. This method allows protein variations to be taken into account comprehensively, while minimizing combinatorial explosion.For proteomic peptides from shotgun MS experiments which are typically 7–30 aa in length, the index for 6-mers offers the best trade-off between speed, memory and performance. The pepx program also builds indexes for 3-, 4- and 5-mers which can for instance be used to search for short linear motifs. To further guarantee the confidence in uniqueness pepx can build special indexes where isobaric amino acids I (Ile) and L (Leu) are merged and replaced with the ambiguity code J. Thus no false positive will occur when two peptides differing only by I/L substitutions exist in different proteins.
3 Usage
As the reference knowledgebase for HPP, neXtProt validates the existence of human proteins based on several criteria, including peptide identification data from mass spectrometry-based proteomics experiments. According to the latest HPP guidelines, a protein is validated if two unique peptides of at least 9 aa in length are reported (Deutsch ). At each neXtProt release, all splice isoforms are indexed, and pepx is used to assess peptide uniqueness: a peptide is considered to be unique if all the matching isoform sequences derive from a single neXtProt entry.For the so-called ‘missing’ proteins for which no experimental evidence has previously been reported, HPP requests to further check peptide uniqueness by taking all possible SAAVs into account (Deutsch ). Pepx can be used for this check by using an index containing the >5 million SAAVs from neXtProt; the index building step takes a few hours and index size is about 4 Gb. A single substitution per 6-mer is accepted in order to limit the size of the index to a manageable size. For example, PNVLLA with known variants P->L, P->S and V->A will generate entries for sequences PNVLLA, LNVLLA, SNVLLA and PNALLA, but not SNALLA. The position of the SAAV within the 6-mer is recorded in the indexes, allowing the variant path to be rapidly reconstructed when displaying matches. Since pepx has successfully been used to validate the identification of missing proteins in human sperm (Vandenbrouck ), the HPP board requested that this tool be accessible to the whole human proteomics community.A dedicated web interface, the neXtProt peptide uniqueness checker, and a corresponding API (application programming interface) service have been developed. The list of peptides, which can be typed in a text area or imported from a text file, is sent via an http request to the API (Fig. 1(1)) which then queries the 6-mer index created by pepx (Fig. 1(2)) and returns the identifiers of the isoform matched (Fig. 1(3)). Because pepx retrieves isoform sequences that map to the 6-mers included in a given peptide and not to the full extent of the peptide, it occasionally returns false positives. An exact string search using the neXtProt API is performed as a validation step to ensure that the entire sequence of the peptide is present in the retrieved sequences. Another validation step is performed to check that the variant at the indicated position in the matching entry exists and justifies the match (Fig. 1(4)). The validated matches (Fig. 1(5)) are sent from the API server to the client in JSON (JavaScript Object Notation) format (Fig. 1(6)). The results are displayed in boxes, each box containing the resulting matches for one peptide, taking into account SAAVs (lower panel) or not (upper panel). Usually, in MS data analysis, peptide uniqueness is not evaluated at the level of protein isoforms, but at the level of genes or protein entries. Therefore, by default, all the matching isoforms that belong to a same entry are merged, and matches are displayed as neXtProt entry accession numbers followed by the corresponding gene names. A button in each box allows the user to toggle between this default view and the isoform view, which displays matches at the level of isoform sequences, and, in case of additional mappings due to variants, the variant involved. A color code allows entry-specific peptides (in green) to be quickly distinguished from peptides matching several entries (in blue). Appropriate filters allow only entry-specific peptides to be displayed, either taking variants into account or not. The results displayed can be downloaded in CSV (Comma-separated values) format.
Fig. 1
neXtProt peptide uniqueness checker workflow. Of the two peptides submitted in this example, one is entry-specific (left) while the other loses its specificity if variants are taken into account (right)
neXtProt peptide uniqueness checker workflow. Of the two peptides submitted in this example, one is entry-specific (left) while the other loses its specificity if variants are taken into account (right)The use of this tool is recommended in the latest HPP guidelines (Deutsch ).
Authors: Gilbert S Omenn; Lydie Lane; Emma K Lundberg; Ronald C Beavis; Christopher M Overall; Eric W Deutsch Journal: J Proteome Res Date: 2016-09-20 Impact factor: 4.466
Authors: Eric W Deutsch; Christopher M Overall; Jennifer E Van Eyk; Mark S Baker; Young-Ki Paik; Susan T Weintraub; Lydie Lane; Lennart Martens; Yves Vandenbrouck; Ulrike Kusebauch; William S Hancock; Henning Hermjakob; Ruedi Aebersold; Robert L Moritz; Gilbert S Omenn Journal: J Proteome Res Date: 2016-08-24 Impact factor: 4.466
Authors: Luis Mendoza; Eric W Deutsch; Zhi Sun; David S Campbell; David D Shteynberg; Robert L Moritz Journal: J Proteome Res Date: 2018-09-28 Impact factor: 4.466
Authors: Eric W Deutsch; Lydie Lane; Christopher M Overall; Nuno Bandeira; Mark S Baker; Charles Pineau; Robert L Moritz; Fernando Corrales; Sandra Orchard; Jennifer E Van Eyk; Young-Ki Paik; Susan T Weintraub; Yves Vandenbrouck; Gilbert S Omenn Journal: J Proteome Res Date: 2019-10-21 Impact factor: 4.466
Authors: Young-Ki Paik; Gilbert S Omenn; William S Hancock; Lydie Lane; Christopher M Overall Journal: Expert Rev Proteomics Date: 2017-11-10 Impact factor: 3.940
Authors: Axel Petzold; Ching-Hua Lu; Mike Groves; Johan Gobom; Henrik Zetterberg; Gerry Shaw; Sonia O'Connor Journal: J R Soc Interface Date: 2020-01-08 Impact factor: 4.118
Authors: Carlo O Martins; Sarah Huet; San S Yi; Maria S Ritorto; Ola Landgren; Ahmet Dogan; Jessica R Chapman Journal: J Mol Diagn Date: 2020-04-14 Impact factor: 5.568
Authors: Dan Shao; Lan Huang; Yan Wang; Xueteng Cui; Yufei Li; Yao Wang; Qin Ma; Wei Du; Juan Cui Journal: Database (Oxford) Date: 2021-10-13 Impact factor: 3.451
Authors: Eric W Deutsch; Gilbert S Omenn; Zhi Sun; Michal Maes; Maria Pernemalm; Krishnan K Palaniappan; Natasha Letunica; Yves Vandenbrouck; Virginie Brun; Sheng-Ce Tao; Xiaobo Yu; Philipp E Geyer; Vera Ignjatovic; Robert L Moritz; Jochen M Schwenk Journal: J Proteome Res Date: 2021-10-21 Impact factor: 5.370
Authors: Gilbert S Omenn; Lydie Lane; Emma K Lundberg; Christopher M Overall; Eric W Deutsch Journal: J Proteome Res Date: 2017-10-09 Impact factor: 4.466
Authors: Jessica R Chapman; Anna Liu; San S Yi; Enmily Hernandez; Maria Stella Ritorto; Achim A Jungbluth; Melissa Pulitzer; Ahmet Dogan Journal: Amyloid Date: 2020-09-01 Impact factor: 7.141