Literature DB >> 28520855

The neXtProt peptide uniqueness checker: a tool for the proteomics community.

Mathieu Schaeffer1, Alain Gateau2, Daniel Teixeira2, Pierre-André Michel2, Monique Zahn-Zabal2, Lydie Lane1,2.   

Abstract

SUMMARY: The neXtProt peptide uniqueness checker allows scientists to define which peptides can be used to validate the existence of human proteins, i.e. map uniquely versus multiply to human protein sequences taking into account isobaric substitutions, alternative splicing and single amino acid variants.
AVAILABILITY AND IMPLEMENTATION: The pepx program is available at https://github.com/calipho-sib/pepx and can be launched from the command line or through a cgi web interface. Indexing requires a sequence file in FASTA format. The peptide uniqueness checker tool is freely available on the web at https://www.nextprot.org/tools/peptide-uniqueness-checker and from the neXtProt API at https://api.nextprot.org/. CONTACT: lydie.lane@sib.swiss.
© The Author(s) 2017. Published by Oxford University Press.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28520855      PMCID: PMC5860159          DOI: 10.1093/bioinformatics/btx318

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Most proteomics experiments aiming to identify proteins in complex samples rely on proteolytic digestion followed by separation of the resulting peptides by liquid chromatography and their identification by tandem mass spectrometry (MS). Since the link between peptides and their protein precursors is lost, peptide-to-protein mappings must be obtained and critically evaluated, even when there is strong evidence that peptide identifications are correct. To ensure that a peptide unambiguously maps to a protein, isobaric substitutions of isoleucine to leucine that cannot be distinguished by most current MS techniques must be taken into account, as well as possible sequence variations arising from single amino acid variants (SAAVs) and alternative splicing. This is especially important for projects such as the HUPO Human Proteome Project (HPP), whose aim is to experimentally validate the existence of in silico-predicted human proteins (Omenn ). The neXtProt knowledgebase on human proteins currently contains >42 000 isoforms produced by alternative splicing and >5 million variants (Gaudet ), which generates an incommensurate number of proteoforms. There is currently no tool taking this diversity into account when mapping peptides to proteins. The peptide uniqueness checker tool on the neXtProt platform was developed to meet this need.

2 Materials and methods

The peptide indexer (pepx) maps proteomics peptides to a given set of protein sequences using an n-mer-based index: all protein sequences are scanned with a sliding window of n amino acids and each n-mer found is written in an index pointing to the list of identifiers of sequences containing it. Peptides to be mapped are scanned with the same n-mer sliding window. For each peptide all n-mers must be found in the same protein isoform sequence to return a match. This method allows protein variations to be taken into account comprehensively, while minimizing combinatorial explosion. For proteomic peptides from shotgun MS experiments which are typically 7–30 aa in length, the index for 6-mers offers the best trade-off between speed, memory and performance. The pepx program also builds indexes for 3-, 4- and 5-mers which can for instance be used to search for short linear motifs. To further guarantee the confidence in uniqueness pepx can build special indexes where isobaric amino acids I (Ile) and L (Leu) are merged and replaced with the ambiguity code J. Thus no false positive will occur when two peptides differing only by I/L substitutions exist in different proteins.

3 Usage

As the reference knowledgebase for HPP, neXtProt validates the existence of human proteins based on several criteria, including peptide identification data from mass spectrometry-based proteomics experiments. According to the latest HPP guidelines, a protein is validated if two unique peptides of at least 9 aa in length are reported (Deutsch ). At each neXtProt release, all splice isoforms are indexed, and pepx is used to assess peptide uniqueness: a peptide is considered to be unique if all the matching isoform sequences derive from a single neXtProt entry. For the so-called ‘missing’ proteins for which no experimental evidence has previously been reported, HPP requests to further check peptide uniqueness by taking all possible SAAVs into account (Deutsch ). Pepx can be used for this check by using an index containing the >5 million SAAVs from neXtProt; the index building step takes a few hours and index size is about 4 Gb. A single substitution per 6-mer is accepted in order to limit the size of the index to a manageable size. For example, PNVLLA with known variants P->L, P->S and V->A will generate entries for sequences PNVLLA, LNVLLA, SNVLLA and PNALLA, but not SNALLA. The position of the SAAV within the 6-mer is recorded in the indexes, allowing the variant path to be rapidly reconstructed when displaying matches. Since pepx has successfully been used to validate the identification of missing proteins in human sperm (Vandenbrouck ), the HPP board requested that this tool be accessible to the whole human proteomics community. A dedicated web interface, the neXtProt peptide uniqueness checker, and a corresponding API (application programming interface) service have been developed. The list of peptides, which can be typed in a text area or imported from a text file, is sent via an http request to the API (Fig. 1(1)) which then queries the 6-mer index created by pepx (Fig. 1(2)) and returns the identifiers of the isoform matched (Fig. 1(3)). Because pepx retrieves isoform sequences that map to the 6-mers included in a given peptide and not to the full extent of the peptide, it occasionally returns false positives. An exact string search using the neXtProt API is performed as a validation step to ensure that the entire sequence of the peptide is present in the retrieved sequences. Another validation step is performed to check that the variant at the indicated position in the matching entry exists and justifies the match (Fig. 1(4)). The validated matches (Fig. 1(5)) are sent from the API server to the client in JSON (JavaScript Object Notation) format (Fig. 1(6)). The results are displayed in boxes, each box containing the resulting matches for one peptide, taking into account SAAVs (lower panel) or not (upper panel). Usually, in MS data analysis, peptide uniqueness is not evaluated at the level of protein isoforms, but at the level of genes or protein entries. Therefore, by default, all the matching isoforms that belong to a same entry are merged, and matches are displayed as neXtProt entry accession numbers followed by the corresponding gene names. A button in each box allows the user to toggle between this default view and the isoform view, which displays matches at the level of isoform sequences, and, in case of additional mappings due to variants, the variant involved. A color code allows entry-specific peptides (in green) to be quickly distinguished from peptides matching several entries (in blue). Appropriate filters allow only entry-specific peptides to be displayed, either taking variants into account or not. The results displayed can be downloaded in CSV (Comma-separated values) format.
Fig. 1

neXtProt peptide uniqueness checker workflow. Of the two peptides submitted in this example, one is entry-specific (left) while the other loses its specificity if variants are taken into account (right)

neXtProt peptide uniqueness checker workflow. Of the two peptides submitted in this example, one is entry-specific (left) while the other loses its specificity if variants are taken into account (right) The use of this tool is recommended in the latest HPP guidelines (Deutsch ).
  4 in total

1.  Metrics for the Human Proteome Project 2016: Progress on Identifying and Characterizing the Human Proteome, Including Post-Translational Modifications.

Authors:  Gilbert S Omenn; Lydie Lane; Emma K Lundberg; Ronald C Beavis; Christopher M Overall; Eric W Deutsch
Journal:  J Proteome Res       Date:  2016-09-20       Impact factor: 4.466

2.  Looking for Missing Proteins in the Proteome of Human Spermatozoa: An Update.

Authors:  Yves Vandenbrouck; Lydie Lane; Christine Carapito; Paula Duek; Karine Rondel; Christophe Bruley; Charlotte Macron; Anne Gonzalez de Peredo; Yohann Couté; Karima Chaoui; Emmanuelle Com; Alain Gateau; Anne-Marie Hesse; Marlene Marcellin; Loren Méar; Emmanuelle Mouton-Barbosa; Thibault Robin; Odile Burlet-Schiltz; Sarah Cianferani; Myriam Ferro; Thomas Fréour; Cecilia Lindskog; Jérôme Garin; Charles Pineau
Journal:  J Proteome Res       Date:  2016-08-23       Impact factor: 4.466

3.  Human Proteome Project Mass Spectrometry Data Interpretation Guidelines 2.1.

Authors:  Eric W Deutsch; Christopher M Overall; Jennifer E Van Eyk; Mark S Baker; Young-Ki Paik; Susan T Weintraub; Lydie Lane; Lennart Martens; Yves Vandenbrouck; Ulrike Kusebauch; William S Hancock; Henning Hermjakob; Ruedi Aebersold; Robert L Moritz; Gilbert S Omenn
Journal:  J Proteome Res       Date:  2016-08-24       Impact factor: 4.466

4.  The neXtProt knowledgebase on human proteins: 2017 update.

Authors:  Pascale Gaudet; Pierre-André Michel; Monique Zahn-Zabal; Aurore Britan; Isabelle Cusin; Marcin Domagalski; Paula D Duek; Alain Gateau; Anne Gleizes; Valérie Hinard; Valentine Rech de Laval; JinJin Lin; Frederic Nikitin; Mathieu Schaeffer; Daniel Teixeira; Lydie Lane; Amos Bairoch
Journal:  Nucleic Acids Res       Date:  2016-11-29       Impact factor: 16.971

  4 in total
  16 in total

1.  Flexible and Fast Mapping of Peptides to a Proteome with ProteoMapper.

Authors:  Luis Mendoza; Eric W Deutsch; Zhi Sun; David S Campbell; David D Shteynberg; Robert L Moritz
Journal:  J Proteome Res       Date:  2018-09-28       Impact factor: 4.466

2.  Human Proteome Project Mass Spectrometry Data Interpretation Guidelines 3.0.

Authors:  Eric W Deutsch; Lydie Lane; Christopher M Overall; Nuno Bandeira; Mark S Baker; Charles Pineau; Robert L Moritz; Fernando Corrales; Sandra Orchard; Jennifer E Van Eyk; Young-Ki Paik; Susan T Weintraub; Yves Vandenbrouck; Gilbert S Omenn
Journal:  J Proteome Res       Date:  2019-10-21       Impact factor: 4.466

Review 3.  Advances in the Chromosome-Centric Human Proteome Project: looking to the future.

Authors:  Young-Ki Paik; Gilbert S Omenn; William S Hancock; Lydie Lane; Christopher M Overall
Journal:  Expert Rev Proteomics       Date:  2017-11-10       Impact factor: 3.940

4.  Protein aggregate formation permits millennium-old brain preservation.

Authors:  Axel Petzold; Ching-Hua Lu; Mike Groves; Johan Gobom; Henrik Zetterberg; Gerry Shaw; Sonia O'Connor
Journal:  J R Soc Interface       Date:  2020-01-08       Impact factor: 4.118

5.  Mass Spectrometry-Based Method Targeting Ig Variable Regions for Assessment of Minimal Residual Disease in Multiple Myeloma.

Authors:  Carlo O Martins; Sarah Huet; San S Yi; Maria S Ritorto; Ola Landgren; Ahmet Dogan; Jessica R Chapman
Journal:  J Mol Diagn       Date:  2020-04-14       Impact factor: 5.568

6.  HBFP: a new repository for human body fluid proteome.

Authors:  Dan Shao; Lan Huang; Yan Wang; Xueteng Cui; Yufei Li; Yao Wang; Qin Ma; Wei Du; Juan Cui
Journal:  Database (Oxford)       Date:  2021-10-13       Impact factor: 3.451

Review 7.  Advances and Utility of the Human Plasma Proteome.

Authors:  Eric W Deutsch; Gilbert S Omenn; Zhi Sun; Michal Maes; Maria Pernemalm; Krishnan K Palaniappan; Natasha Letunica; Yves Vandenbrouck; Virginie Brun; Sheng-Ce Tao; Xiaobo Yu; Philipp E Geyer; Vera Ignjatovic; Robert L Moritz; Jochen M Schwenk
Journal:  J Proteome Res       Date:  2021-10-21       Impact factor: 5.370

8.  Progress on the HUPO Draft Human Proteome: 2017 Metrics of the Human Proteome Project.

Authors:  Gilbert S Omenn; Lydie Lane; Emma K Lundberg; Christopher M Overall; Eric W Deutsch
Journal:  J Proteome Res       Date:  2017-10-09       Impact factor: 4.466

Review 9.  Comparative evaluation of two methods for LC-MS/MS proteomic analysis of formalin fixed and paraffin embedded tissues.

Authors:  Katarina Davalieva; Sanja Kiprijanovska; Aleksandar Dimovski; Gorazd Rosoklija; Andrew J Dwork
Journal:  J Proteomics       Date:  2021-01-14       Impact factor: 4.044

10.  Proteomic analysis shows that the main constituent of subepidermal localised cutaneous amyloidosis is not galectin-7.

Authors:  Jessica R Chapman; Anna Liu; San S Yi; Enmily Hernandez; Maria Stella Ritorto; Achim A Jungbluth; Melissa Pulitzer; Ahmet Dogan
Journal:  Amyloid       Date:  2020-09-01       Impact factor: 7.141

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.