Literature DB >> 27899619

The neXtProt knowledgebase on human proteins: 2017 update.

Pascale Gaudet^1,2, Pierre-André Michel³, Monique Zahn-Zabal³, Aurore Britan³, Isabelle Cusin³, Marcin Domagalski^3,2, Paula D Duek³, Alain Gateau³, Anne Gleizes³, Valérie Hinard³, Valentine Rech de Laval^3,2, JinJin Lin⁴, Frederic Nikitin³, Mathieu Schaeffer^3,2, Daniel Teixeira³, Lydie Lane^3,2, Amos Bairoch^3,2.

Abstract

The neXtProt human protein knowledgebase (https://www.nextprot.org) continues to add new content and tools, with a focus on proteomics and genetic variation data. neXtProt now has proteomics data for over 85% of the human proteins, as well as new tools tailored to the proteomics community.Moreover, the neXtProt release 2016-08-25 includes over 8000 phenotypic observations for over 4000 variations in a number of genes involved in hereditary cancers and channelopathies. These changes are presented in the current neXtProt update. All of the neXtProt data are available via our user interface and FTP site. We also provide an API access and a SPARQL endpoint for more technical applications.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27899619 PMCID： PMC5210547 DOI： 10.1093/nar/gkw1062

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

neXtProt (https://www.nextprot.org; (1–3)) is a knowledge platform that represents the current state of knowledge on human proteins. It complements UniProtKB (4) by extending the content and tools, with the aim of supporting applications specifically relevant to human proteins. To integrate data accurately and develop tools that are relevant to the users, we collaborate closely with all our data and tools providers. neXtProt values high quality data, which is achieved by manual annotation or review of all data and a stringent quality control process. Whenever possible, we provide our data, tools and code freely to all users. Since our last neXtProt update (1), we have continued to expand the database, focusing mainly on proteomics and genetic variations. One major new development is a large set of manual annotations that we have created to capture the phenotypic effect of genetic variations as described in the literature, as well as a corresponding new view to present the impact of changes at the amino acid sequence on various characteristics of the protein. We have also developed tools to support our proteomics user community. This article describes the major features of our most recent release.

neXtProt contents overview

Our main data sources are listed in Table 1: UniProtKB (4), Bgee (5), HPA (6), PeptideAtlas (7), SRMAtlas (8), GOA (9), dbSNP (10), Ensembl (11), COSMIC (12), DKF GFP-cDNA localization (13,14), Weizmann Institute of Science's Kahn Dynamic Proteomics Database (15) and IntAct (16). In the past year we have worked closely with PeptideAtlas (17) to integrate their processed data (2016-01 build), including new phosphorylation data (2015-09 build). Moreover, for the first time ADP-ribosylation sites have been integrated (18), and new acetylation sites with their corresponding peptides have also been loaded (19). With this data, neXtProt now contains 142,453 post-translational modification sites and 1,150,170 peptides. Our own curation efforts have also led to the annotation of more than 4000 variants associated with more than 8000 observations at the molecular, cellular and organism levels. These data can be viewed on our user interface in a new phenotype view (see Section II) in each annotated entry, or on our new ‘Portal’ pages (see Section III).

Table 1.

Data content of the neXtProt 2016-08-25 release

Entries	Statistics	Change since previous release	Source
Protein entries / isoforms	22 061/42 024	+6/+32	UniProtKB
Binary interactions	140 270	+18 351	IntAct
Post-translational modifications (PTMs)	142 453	+172	PeptideAtlas, UniProtKB, neXtProt
Variants (including disease mutations)	4 943 914	+2 461 938	UniProtKB, COSMIC, dbSNP
Entries with an experimental 3D structure	5740	+119	PDB via UniProtKB
Entries with proteomics data	17 279	+340	PeptideAtlas
Entries with a disease	3916	+336	UniProtKB
Phenotypic annotations	8014	+8014	neXtProt
Cited publications	99 922	+22 850	All resources

Phenotypic impact of genetic variations

In the course of two separate projects to characterize genetic variants involved in hereditary cancers and channelopathies, we have annotated the phenotypic impact of genetic variants for proteins with known causative roles in these diseases: BRCA1, BRCA2, EPCAM, MLH1, MLH3, MSH2, MSH6, PMS2, SCN1A, SCN2A, SCN3A, SCN4A, SCN5A, SCN8A, SCN9A, SCN10A and SCN11A. The neXtProt variation phenotypes are curated in a highly structured model with complete traceability to the original experimental results. Our annotation statements are triplets composed of (i) a subject, corresponding to the protein variation being annotated; (ii) a predicate (or relation) describing how the object is affected (Table 2); and (iii) an object describing the phenotype tested. This phenotype can correspond to the protein's molecular function or its localization (captured with Gene Ontology (GO) terms (20)), effects at the level of the organism (captured with the mammalian phenotype ontology (21)), interactions with proteins (represented by neXtProt entries) or small molecules (captured using the ChEBI dictionary of molecular entities (22)). Changes in protein or mRNA stability are captured with in-house vocabularies available on our FTP site: ftp://ftp.nextprot.org/pub/current_release/controlled_vocabularies/cv_protein_property.obo. Finally, for ion channels, the impact on electrophysiological properties is captured with the ICEPO ontology (23). Experimental evidence for each statement is provided, including a reference, an evidence code from the Evidence and Conclusion Ontology (24), and, importantly, a qualitative assessment of the phenotype intensity: mild, moderate or severe. Variations producing no significant difference compared to the wild-type are annotated as having no impact. A detailed description of our annotation model and data will be published elsewhere.

Table 2.

Relations used in the neXtProt phenotype annotations

Relation	Definition
No impact	No significant effect observed compared to wild-type control
Impact	A significant effect observed compared to wild-type control
Increase	A significant increase observed in a measured parameter compared to wild-type control
Decrease	A significant decrease observed in a measured parameter compared to wild-type control
Gain of function	Mutant protein acquires a property absent from the wild-type (new substrate, new cellular localization, etc.)

New phenotype view

We have deployed a new view to display the phenotype data. The view contains two main sections: Phenotypes and Variants. The (Figure 1A) lists the different phenotypes observed for the protein, grouped by object type: GO molecular function, biological process, cellular component, binary interaction, protein property and mammalian phenotype, as well as the number of phenotypes in each group. Clicking on the object type opens the list of phenotypes annotated in that category. For example the effect of MSH6 variants on four different GO molecular functions have been tested: ATPase activity, mismatched DNA binding, DNA clamp loader activity and adenyl–nucleotide exchange factor activity. The number of variants assayed for each specific phenotype is given: as shown in Figure 1A.

Figure 1.

New Phenotype View for the MSH6 entry. (A) Phenotypes section. (B) Variants sections.

New Phenotype View for the MSH6 entry. (A) Phenotypes section. (B) Variants sections. Nine MSH6 variants have been tested for ATPase activity, including Ser144Ile, Ser285Ile and Gly566Arg. Note that the names are shown with the specific isoform selected: for MSH6, the isoform displayed by default is MSH6-isoGTBP-N. Any number of phenotypes can be selected. The variants associated with these phenotypes can be viewed by clicking on either ‘Apply filter’ button, on the top and bottom right corner of the phenotype box. The (Figure 1B) lists all variants ordered by their position along the sequence. Different depth of information can be viewed. At the top-most level, the variant names are shown with the intensity of its most deleterious phenotype on the right. The phenotype(s) assayed for a variant can be viewed by clicking on the arrow on the left, which can also be opened in more details to view the experimental evidence supporting the annotation: an evidence code (‘experimental evidence’ or ‘sequence similarity evidence’ for experiments carried out in non-human model systems), whether the evidence is Gold or Silver (general criteria are described in (1)), the intensity of the phenotype, the species from which the test protein was derived, and the reference. Alternatively, all details can be opened for all the variants at once by clicking the button ‘Expand all’.

Variant portals

For convenience, our phenotypic data are also available on our new data portals: the neXtProt Cancer variants portal and the Ion channels variants portal, both accessible from the top menu ‘Portals’. These portals contain the variants, their associated molecular, cellular and organism level phenotypes, and the associated experimental evidence for each observation (Figure 2). These data are presented in table format, in which each column is searchable and sortable. The data can also be downloaded in CSV format, copied or printed.

Figure 2.

Excerpt of the Ion channel variants portal.

Improvements to proteomics data representation and peptide unicity checker

In entries having proteomics data, we have implemented a new peptide view that lists all the peptides that match the entry, their position on the sequence, an indication as to whether or not they are unique to that entry (‘proteotypic’), and whether they have been detected in biological samples (‘natural’) or chemically synthesized as reagents for selected reaction monitoring experiments (‘synthetic’). One important need of the proteomics community is to be able to determine whether a peptide is unique to a protein or not. The ‘unicity checker’, which uses the pepx program (available at https://github.com/calipho-sib/pepx), was designed to help scientists determine which peptides are unambiguous and can thus be used to confidently identify protein entries. The ‘additional mappings with known variants’ mode takes into account all the SNPs and disease mutations in neXtProt to increase the search space. Making use of this tool for mass spectrometry data interpretation is now part of the recommendations of the HUPO Human Proteome Project (25). An example of the unicity checker user interface is shown in Figure 3.

Figure 3.

Peptide unicity checker user interface. A list of peptides provided by the user (separated by a space, a comma, a semi-colon or a carriage return) is verified for their unicity in neXtProt sequences by hitting the ‘Check’ button. Users have the option to take into account variants for determining unicity.

Revamped neXtProt website

Our latest release also features a renewed user interface. The neXtProt home page has been re-organized so that information and data are easier to find and use. Navigation in our website has also been re-organized such that the user has access to all the neXtProt content via menus in the page headers and footers. The new header menu offers quick access to tools, data and new documentation describing the neXtProt data model, how to access data (instructions for searching and downloading) and how to use our tools and API. Information regarding neXtProt, the human proteome, the current data release including the best evidence for protein existence of entries broken down by chromosome and how to cite neXtProt is available from the About menu.

Data and software availability

All neXtProt annotations are available as XML and PEFF files (3) on our FTP site (ftp://ftp.nextprot.org/). Note that our XML format has changed to accommodate the new phenotypic data. Changes are documented in a comment at the beginning of the new XSD file (version 2), also on the FTP site. The former XML files are no longer provided due to technical constraints. Annotations can also be accessed through our API at https://api.nextprot.org and our SPARQL endpoint (https://www.nextprot.org/proteins/search?mode=advanced). The Cellosaurus – a knowledge resource on cell lines is available at ftp://ftp.expasy.org/databases/cellosaurus/ (note that the files are no longer on the neXtProt FTP site). The neXtProt content is available under the Creative Commons Attribution-NoDerivs License. Our software is freely available from the GitHub repository (https://github.com/calipho-sib) or biojs (http://www.biojs.io/), as described in our documentation (https://www.nextprot.org/help/technical-corner/).

CONCLUSION

The neXtProt human protein knowledgebase integrates data to provide comprehensive, up-to-date, high quality information organized in such a way so as to provide scientists around the world with a resource that facilitates their research. neXtProt is continually evolving and, in terms of content, the focus will continue to be the incorporation of new variant and proteomics data in the immediate future. Concerning the quality of the data, global checks complementing the spot checks described in our previous paper (1) have been introduced. While content is important, it is just as critical that users be able to view, analyze and export the data in a useful manner. We have thus developed a number of tools, starting with the private lists and queries, and the latest being the peptide unicity checker. We recently started introducing more user interaction features in our web interface so as to improve usability. In the new Phenotype view, the user can select the content that should be displayed both in the Phenotype and Variant sections, thereby focusing exclusively on the data of interest. Another new feature allowing more flexibility for the user is the possibility to download all or part of the data, in XML, JSON or FASTA format, for the entry currently being viewed by simply clicking on the ‘download’ icon at the top right of each page. While all these developments are undertaken in the hope of improving user access and use of our knowledgebase, we count on feedback from our users to provide a high quality resource.

25 in total

1. A microscope-based screening platform for large-scale functional protein analysis in intact cells.

Authors: Urban Liebel; Vytaute Starkuviene; Holger Erfle; Jeremy C Simpson; Annemarie Poustka; Stefan Wiemann; Rainer Pepperkok
Journal: FEBS Lett Date: 2003-11-20 Impact factor: 4.124

2. Towards a knowledge-based Human Protein Atlas.

Authors: Mathias Uhlen; Per Oksvold; Linn Fagerberg; Emma Lundberg; Kalle Jonasson; Mattias Forsberg; Martin Zwahlen; Caroline Kampf; Kenneth Wester; Sophia Hober; Henrik Wernerus; Lisa Björling; Fredrik Ponten
Journal: Nat Biotechnol Date: 2010-12 Impact factor: 54.908

3. Generation of a fluorescently labeled endogenous protein library in living human cells.

Authors: Alex Sigal; Tamar Danon; Ariel Cohen; Ron Milo; Naama Geva-Zatorsky; Gila Lustig; Yuvalal Liron; Uri Alon; Natalie Perzov
Journal: Nat Protoc Date: 2007 Impact factor: 13.491

4. neXtProt: organizing protein knowledge in the context of human proteome projects.

Authors: Pascale Gaudet; Ghislaine Argoud-Puy; Isabelle Cusin; Paula Duek; Olivier Evalet; Alain Gateau; Anne Gleizes; Mario Pereira; Monique Zahn-Zabal; Catherine Zwahlen; Amos Bairoch; Lydie Lane
Journal: J Proteome Res Date: 2012-12-03 Impact factor: 4.466

5. neXtProt: a knowledge platform for human proteins.

Authors: Lydie Lane; Ghislaine Argoud-Puy; Aurore Britan; Isabelle Cusin; Paula D Duek; Olivier Evalet; Alain Gateau; Pascale Gaudet; Anne Gleizes; Alexandre Masselot; Catherine Zwahlen; Amos Bairoch
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

6. The neXtProt knowledgebase on human proteins: current status.

Authors: Pascale Gaudet; Pierre-André Michel; Monique Zahn-Zabal; Isabelle Cusin; Paula D Duek; Olivier Evalet; Alain Gateau; Anne Gleizes; Mario Pereira; Daniel Teixeira; Ying Zhang; Lydie Lane; Amos Bairoch
Journal: Nucleic Acids Res Date: 2015-01 Impact factor: 16.971

7. COSMIC: exploring the world's knowledge of somatic mutations in human cancer.

Authors: Simon A Forbes; David Beare; Prasad Gunasekaran; Kenric Leung; Nidhi Bindal; Harry Boutselakis; Minjie Ding; Sally Bamford; Charlotte Cole; Sari Ward; Chai Yin Kok; Mingming Jia; Tisham De; Jon W Teague; Michael R Stratton; Ultan McDermott; Peter J Campbell
Journal: Nucleic Acids Res Date: 2014-10-29 Impact factor: 16.971

8. Uncovering hidden duplicated content in public transcriptomics data.

Authors: Marta Rosikiewicz; Aurélie Comte; Anne Niknejad; Marc Robinson-Rechavi; Frederic B Bastian
Journal: Database (Oxford) Date: 2013-03-13 Impact factor: 3.451

9. Standardized description of scientific evidence using the Evidence Ontology (ECO).

Authors: Marcus C Chibucos; Christopher J Mungall; Rama Balakrishnan; Karen R Christie; Rachael P Huntley; Owen White; Judith A Blake; Suzanna E Lewis; Michelle Giglio
Journal: Database (Oxford) Date: 2014-07-22 Impact factor: 3.451

10. The Ensembl gene annotation system.

Authors: Bronwen L Aken; Sarah Ayling; Daniel Barrell; Laura Clarke; Valery Curwen; Susan Fairley; Julio Fernandez Banet; Konstantinos Billis; Carlos García Girón; Thibaut Hourlier; Kevin Howe; Andreas Kähäri; Felix Kokocinski; Fergal J Martin; Daniel N Murphy; Rishi Nag; Magali Ruffier; Michael Schuster; Y Amy Tang; Jan-Hinnerk Vogel; Simon White; Amonida Zadissa; Paul Flicek; Stephen M J Searle
Journal: Database (Oxford) Date: 2016-06-23 Impact factor: 3.451

62 in total

Review 1. Proteomics Identifies Golgi phosphoprotein 3 (GOLPH3) with A Link Between Golgi Structure, Cancer, DNA Damage and Protection from Cell Death.

Authors: John J M Bergeron; Catherine E Au; David Y Thomas; Louis Hermo
Journal: Mol Cell Proteomics Date: 2017-09-27 Impact factor: 5.911

2. Elucidating Protein-DNA Interactions in Human Alphoid Chromatin via Hybridization Capture and Mass Spectrometry.

Authors: Katherine E Buxton; Julia Kennedy-Darling; Michael R Shortreed; Nur Zafirah Zaidan; Michael Olivier; Mark Scalf; Rupa Sridharan; Lloyd M Smith
Journal: J Proteome Res Date: 2017-08-04 Impact factor: 4.466

3. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Judith A J Steen; Jyoti S Choudhary
Journal: J Vis Exp Date: 2018-05-22 Impact factor: 1.355

4. Exploring the mechanistic insights of Cas scaffolding protein family member 4 with protein tyrosine kinase 2 in Alzheimer's disease by evaluating protein interactions through molecular docking and dynamic simulations.

Authors: Mubashir Hassan; Saba Shahzadi; Hany Alashwal; Nazar Zaki; Sung-Yum Seo; Ahmed A Moustafa
Journal: Neurol Sci Date: 2018-05-22 Impact factor: 3.307

5. Structure and Protein Interaction-Based Gene Ontology Annotations Reveal Likely Functions of Uncharacterized Proteins on Human Chromosome 17.

Authors: Chengxin Zhang; Xiaoqiong Wei; Gilbert S Omenn; Yang Zhang
Journal: J Proteome Res Date: 2018-10-16 Impact factor: 4.466