Literature DB >> 23161681

Update on activities at the Universal Protein Resource (UniProt) in 2013.

Abstract

The mission of the Universal Protein Resource (UniProt) (http://www.uniprot.org) is to support biological research by providing a freely accessible, stable, comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase. It integrates, interprets and standardizes data from numerous resources to achieve the most comprehensive catalogue of protein sequences and functional annotation. UniProt comprises four major components, each optimized for different uses, the UniProt Archive, the UniProt Knowledgebase, the UniProt Reference Clusters and the UniProt Metagenomic and Environmental Sequence Database. UniProt is produced by the UniProt Consortium, which consists of groups from the European Bioinformatics Institute (EBI), the SIB Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). UniProt is updated and distributed every 4 weeks and can be accessed online for searches or downloads.

Entities: Chemical Mutation Species

Mesh：

Year: 2012 PMID： 23161681 PMCID： PMC3531094 DOI： 10.1093/nar/gks1068

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The UniProt’s goal is to provide the most comprehensive resource for protein sequence and functional annotation. The four UniProt databases are optimized for different uses as follows: the UniProt Knowledgebase (UniProtKB) is an expertly curated database; the UniProt Archive (UniParc) (1) is a comprehensive sequence repository, reflecting the history of all protein sequences not only in the UniProtKB but also in all source databases; the UniProt Reference Clusters (UniRef), which merge closely related sequences based on sequence identity to facilitate sequence similarity searches (2) and the UniProt Metagenomic and Environmental Sequence (UniMES) database, which was created to cater for the developing area of metagenomics. The aim of this article is to provide a status report on UniProt activities and some of our plans for the near future that will enable us to successfully continue to play a critical role in bioinformatics discovery in the genomic and proteomic era.

NEW AND ONGOING DEVELOPMENTS

UniProtKB reorganization

As the cost of sequencing continues to fall, the number of organisms with complete proteomes in UniProtKB is increasing. It is also becoming more and more common in the scientific community for many groups to sequence the complete proteomes of the same organism or multiple strains of an organism. This means that users are presented with an increasingly large data set, which can be difficult to navigate and are largely redundant in biological knowledge. In response, the UniProt Consortium is developing a concept to provide a set of sequences from selected species based on UniProtKB manually reviewed entries, the Reference Proteomes and the Representative proteomes (3,4). We re-evaluated our manual annotation priorities, and re-defined our organism focus list. For more information, please see http://www.uniprot.org/program. Curators continue to define complete proteomes and reference proteomes as they become available. To ensure comprehensiveness, several changes were required in the UniProt import pipeline. Historically, the great majority of UniProt sequences are based on translations of genome sequence submissions to the International Nucleotide Sequence Database Consortium (INSDC) (5). Our longstanding collaboration has been deepened to include the joint definition of complete genomes and the grouping together of all the genome submissions (e.g. individual chromosomes, organelles) for an organism that originate from the same sequencing project under one unique set accession. In addition, we have extended the import pipeline to include Ensembl (6) and Ensembl Genomes (7) sequences. This was to ensure comprehensiveness, as the full and/or up-to-date annotation of genomes is sometimes not submitted to the INSDC, for example, Apis mellifera (http://metazoa.ensembl.org/Apis_mellifera/Info/Index). The Ensembl sequences are mapped to their UniProtKB counterparts under stringent conditions, requiring 100% identity for 100% of the length of the two sequences. Ensembl sequences that are absent from UniProtKB are imported into UniProtKB/TrEMBL. The UniProtKB entries provide a cross-reference back to the appropriate Ensembl record(s) where available, enabling an easy transition to the genomic view. The one exception to this approach is for the Homo sapiens complete proteome, where there are some cross-references to Ensembl in the UniProtKB/Swiss-Prot entries that do not follow the aforementioned criteria. This is because of the fact that there are different evidence and sources for the sequence in the two resources. The cross-reference mapping is, however, enhanced with the usage of HUGO Gene Nomenclature Committee (HGNC) (http://www.genenames.org) identifiers. Of the 20 224 UniProtKB/Swiss-Prot entries, 18 696 entries have at least one sequence that has 100% identity for 100% of the length of an Ensembl transcript. The UniProt curators and the Ensembl curators and gene builders are progressively working through the rest of the differences, correcting them where appropriate and documenting agree-to-disagree decisions. This is part of the Consensus CDS (CCDS) project, which is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated and of high quality (http://www.ncbi.nlm.nih.gov/CCDS/). The long-term goal is to support convergence towards a standard set of gene annotations. UniProt has also extended the pipeline to import RefSeq (8) sequences, and we are currently evaluating how to combine this data with the existing UniProtKB and Ensembl data. All of these developments have had the side benefit of establishing a close and mutually beneficial collaboration with the Ensembl and RefSeq groups. We import their sequences while they import our annotations into their records (in particular the protein nomenclature and sequence feature annotations), and their prediction pipelines learn from our manually reviewed and experimentally proved sequences. There is a consensus that we should all provide the same (in sequence and annotation) complete proteomes and to collaborate on the definition of Reference proteomes. Another outcome of this collaboration is the ongoing development of genome annotation standards (including protein nomenclature), and the promotion of these standards by the sequencing community (9).

UniProt biocuration

UniProt’s central focus is the annotation—both manual and automatic—of the UniProt Knowledgebase.

Manual curation challenge

Historically, the sequences from the same gene (and more than one when the resulting protein sequences were 100% identical) from the same organism were merged into one UniProtKB/Swiss-Prot entry. Discrepancies between sequence reports were identified, and the underlying causes, such as alternative splicing, natural variations, frameshifts and so forth, were annotated. Journal articles provided the main source of experimental knowledge, with the full text of each article being read and the information extracted. The aim of this approach was to provide a central hub of information for each protein, but it also meant that many UniProtKB/Swiss-Prot entries contain sequences and annotations from many strains. In the era of complete genomes and proteomes at the strain level for so many organisms, UniProt has modified this policy. We are now providing entries that contain the protein products from a particular gene from a particular species + strain with the experimental literature being annotated to that species + strain and propagated as appropriate to other species and strains, ideally through the UniRule pipeline (see later in the text). This has the advantage of providing a gold standard experimental set in UniProtKB/Swiss-Prot and automatically propagating appropriate annotation to the ever increasing number of complete proteomes for which there is no experimental data in UniProtKB/TrEMBL.

Automatic annotation approaches

UniProt has developed two complementary systems to automatically annotate the protein sequences in UniProtKB/TrEMBL. The first system, UniRule, which incorporates the HAMAP (10), RuleBase (11) and PIR Rule (12,13) systems, consists of annotation rules created and monitored by experienced curators. Each annotation rule specifies a number of annotations and conditions which must be satisfied for that annotation to be applied. These conditions may include family membership [as indicated by a match to a family defined by InterPro (14)], taxonomic constraints and the presence of particular sequence features. Rules are created by curators based on information from experimentally characterized template entries, and their predictions evaluated against the content of manually annotated UniProtKB/Swiss-Prot entries, which serve as the gold standard. With each UniProt release, the monitoring system sends those rules that are inconsistent with UniProtKB/Swiss-Prot annotation to curators for review. This ensures that only high-quality predictions are added and prevents propagation of potentially erroneous data. The second system, the Statistical Automatic Annotation System [SAAS, previously named Spearmint (15)] supplements the labour-intensive-UniRule system and generates automatic rules for functional annotation from UniProtKB/Swiss-Prot entries using the C4.5 decision-tree algorithm. This algorithm uses entropy gain to find the most concise rule for an annotation based on the criteria of sequence length, InterPro-group membership and taxonomy. Generating rules ‘on the fly’ ensure their evolution along with the UniProtKB with little or no manual intervention while providing seed rules for exploitation in the UniRule system. This combined approach produces annotation for 34% of UniProtKB/TrEMBL entries at the current time. All predictions are refreshed with each UniProtKB release to ensure the latest state-of-knowledge predictions.

Gene Ontology annotation

UniProt continues to be a major provider of Gene Ontology (GO) annotations to the GO Consortium (16). UniProt curators are actively involved in curating UniProtKB entries with GO terms, providing both high-quality manual GO annotations in addition to their contributions to electronic GO annotation pipelines. Manual GO annotations are made during the UniProt literature curation process, and, at the time of writing, almost 214 000 annotations have been manually assigned to >37 000 proteins by UniProt curators. The curators also supply information to entries that is subsequently used in electronic GO annotation pipelines, such as UniProt keywords2GO, UniProt subcellular location2GO and InterPro2GO. A new automatic pipeline, UniPathway2GO [a collaboration between UniProt, INRIA (Rhone-Alpes) and Laboratoire d’Ecologie Alpine (Grenoble) (17)], was initiated in May 2012 that provides GO annotations describing the metabolic pathways that proteins are involved in. Altogether, the UniProt supplied automatic annotation pipelines provide 42.5 million annotations to >14 million proteins. UniProt also incorporates annotations from other GO Consortium members and affiliates and displays these annotations in the relevant UniProt entries. Currently, the UniProt-GO annotation project provides GO annotations for 65% of UniProt entries.

Highlighting the UniProt website

As a result of recent usability testing with the UniProt user community, we would like to highlight the following features on the UniProt website (http://www.uniprot.org), which is the main access point for the data available in the UniProt databases and the tools to explore it. The tabbed bar on the top of each page includes multiple tools, such as free text ‘Search’, ‘BLAST’ sequence similarity search, ‘Align’ for multiple sequence alignment, ‘Retrieve’ for batch downloads and ‘ID mapping’. ID mapping is a tool to convert UniProt identifiers to corresponding identifiers from a number of other databases available in a dropdown list or vice versa. There is also functionality available to help users personalize their experience with the website. For example, the search results page contains the ‘Customize’ button above the results table to help modify the table. This allows removal or addition of data to the results table from a vast selection of available columns, such as Gene Ontology, Cross-references, Sequence features and so forth, to help users find their proteins of interest. Users can then click on checkboxes at the left of the results table to add their proteins of interest to a selection cart that appears at the bottom of the page. The cart provides tools to help analyse or download the selected entries and saves selections across searches. The protein entry page contains the ‘Customize order’ button on the grey navigation tool bar that allows users to reorder sections within the entry.

DATABASE ACCESS AND FEEDBACK

The http://www.uniprot.org website (18) is the primary access point to our data and documentation and offers tools, such as full text and field-based text search, sequence similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. The home page features a site tour as a quick introduction for novice users. The full text search allows quick and easy searching without previous knowledge of our data or search syntax. The results are sorted by relevance, and search suggestions are provided, where possible, to help filter searches that yield too many or no results. More complex queries can be built with the field-based text search, either iteratively with a query builder or by entering them manually in the query field, which can be faster and more powerful (http://www.uniprot.org/help/text-search). Searching with ontology terms is assisted by auto-completion, and search results can be browsed by ontologies. The display of the result sets, as well as database entries, is configurable; columns can be added to or removed from the result table to see more functional annotation than is available in the default display. Sequence similarity search results can be filtered by taxonomy to obtain a quick overview of the taxonomic distribution of the results, and the sequence annotations of the matched entries can be projected onto the sequence alignments to see at a glance whether important positions are conserved. The site has a simple and consistent URL scheme that allows the bookmarking of all searches to repeat them at a later time. All result sets can be downloaded to offer users the possibility to retrieve customized data sets. However, large downloads are given low priority to ensure that they do not interfere with interactive queries, and they can, therefore, be slow compared with downloads from the UniProt FTP server. We, therefore, recommend downloading complete data sets from ftp.uniprot.org/pub/databases. The website offers various download formats (e.g. plain text, extensible mark-up language, RDF, FASTA, GFF), which depend on the chosen data set. The tab-delimited and Excel formats can be customized by selecting the desired columns in the graphical view of the result table. All data are also available in RDF (http://www.w3.org/RDF/), a W3C standard for publishing data on the Semantic Web. Both data and search results can also be accessed programmatically, either through simple HTTP (REST) requests (http://www.uniprot.org/faq/28) or our Java API (UniProtJAPI) (19). Although the UniProt website provides a query interface for all UniProt data, some users also require facilities to search across related data in different databases. We have, therefore, set-up a BioMart (20) (http://www.biomart.org) instance at http://www.ebi.ac.uk/uniprot/biomart/martview that allows complex queries between UniProt and other data resources, such as PRIDE (21), Ensembl and InterPro. To offer users even more flexibility, we are going to provide a SPARQL Protocol and RDF Query Language (SPARQL) (http://www.w3.org/TR/rdf-sparql-query/) end-point for all our data that can be linked with any remote data resource that has a SPARQL end-point, using SPARQL 1.1’s federated query capabilities. This new service is available for beta testing at http://beta.sparql.uniprot.org/. Your feedback is extremely valuable to help us improve our databases and services in terms of accuracy and usability. Please contact us if you have questions or suggestions through http://www.uniprot.org/contact or email us directly at help@uniprot.org. You can submit new data or updates at http://www.uniprot.org/help/submissions. Extensive documentation on how to best use our resource is available at http://www.uniprot.org/help/. UniProt is freely available for both commercial and non-commercial use. Please see http://www.uniprot.org/help/license for details. New releases are published every 4 weeks except for UniMES, which is updated only when the underlying source data are updated. Release statistics are available at http://www.uniprot.org.

FUNDING

National Institutes of Health (NIH) [1U41HG006104-03]; NIH GO [2P41HG02273-07, 5R01GM080646-07, 3R01GM080646-07S1, 5G08LM010720-03 and 8P20GM103446-12]; British Heart Foundation [SP/07/007/23671]; Swiss Federal Government through the Federal Office of Education and Science; EC [SLING (226073), GEN2PHEN (200754) and MICROME (222886)]; National Science Foundation (NSF) [DBI-1062520]. Funding for open access charge: NIH [1U41HG006104-03]. Conflict of interest statement. None declared.

20 in total

1. Structure-guided rule-based annotation of protein functional sites in UniProt knowledgebase.

Authors: Sona Vasudevan; C R Vinayaka; Darren A Natale; Hongzhan Huang; Robel Y Kahsay; Cathy H Wu
Journal: Methods Mol Biol Date: 2011

2. UniProtJAPI: a remote API for accessing UniProt data.

Authors: Samuel Patient; Daniela Wieser; Michael Kleen; Ernst Kretschmann; Maria Jesus Martin; Rolf Apweiler
Journal: Bioinformatics Date: 2008-04-04 Impact factor: 6.937

3. Infrastructure for the life sciences: design and implementation of the UniProt website.

Authors: Eric Jain; Amos Bairoch; Severine Duvaud; Isabelle Phan; Nicole Redaschi; Baris E Suzek; Maria J Martin; Peter McGarvey; Elisabeth Gasteiger
Journal: BMC Bioinformatics Date: 2009-05-08 Impact factor: 3.169

4. Representative proteomes: a stable, scalable and unbiased proteome set for sequence analysis and functional annotation.

Authors: Chuming Chen; Darren A Natale; Robert D Finn; Hongzhan Huang; Jian Zhang; Cathy H Wu; Raja Mazumder
Journal: PLoS One Date: 2011-04-27 Impact factor: 3.240

5. InterPro in 2011: new developments in the family and domain prediction database.

Authors: Sarah Hunter; Philip Jones; Alex Mitchell; Rolf Apweiler; Teresa K Attwood; Alex Bateman; Thomas Bernard; David Binns; Peer Bork; Sarah Burge; Edouard de Castro; Penny Coggill; Matthew Corbett; Ujjwal Das; Louise Daugherty; Lauranne Duquenne; Robert D Finn; Matthew Fraser; Julian Gough; Daniel Haft; Nicolas Hulo; Daniel Kahn; Elizabeth Kelly; Ivica Letunic; David Lonsdale; Rodrigo Lopez; Martin Madera; John Maslen; Craig McAnulla; Jennifer McDowall; Conor McMenamin; Huaiyu Mi; Prudence Mutowo-Muellenet; Nicola Mulder; Darren Natale; Christine Orengo; Sebastien Pesseat; Marco Punta; Antony F Quinn; Catherine Rivoire; Amaia Sangrador-Vegas; Jeremy D Selengut; Christian J A Sigrist; Maxim Scheremetjew; John Tate; Manjulapramila Thimmajanarthanan; Paul D Thomas; Cathy H Wu; Corin Yeats; Siew-Yit Yong
Journal: Nucleic Acids Res Date: 2011-11-16 Impact factor: 16.971

6. BioMart: driving a paradigm change in biological data management.

Authors: Arek Kasprzyk
Journal: Database (Oxford) Date: 2011-11-13 Impact factor: 3.451

7. Ensembl 2012.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Denise Carvalho-Silva; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Laurent Gil; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas K Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Monika Komorowska; Gautier Koscielny; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Matthieu Muffato; Bert Overduin; Miguel Pignatelli; Bethan Pritchard; Harpreet Singh Riat; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Y Amy Tang; Kieron Taylor; Stephen Trevanion; Jana Vandrovcova; Simon White; Mark Wilson; Steven P Wilder; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Jennifer Harrow; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Giulietta Spudich; Jan Vogel; Andy Yates; Amonida Zadissa; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

8. Ensembl Genomes: an integrative resource for genome-scale data from non-vertebrate species.

Authors: Paul J Kersey; Daniel M Staines; Daniel Lawson; Eugene Kulesha; Paul Derwent; Jay C Humphrey; Daniel S T Hughes; Stephan Keenan; Arnaud Kerhornou; Gautier Koscielny; Nicholas Langridge; Mark D McDowall; Karine Megy; Uma Maheswari; Michael Nuhn; Michael Paulini; Helder Pedro; Iliana Toneva; Derek Wilson; Andrew Yates; Ewan Birney
Journal: Nucleic Acids Res Date: 2011-11-08 Impact factor: 16.971

9. The International Nucleotide Sequence Database Collaboration.

Authors: Ilene Karsch-Mizrachi; Yasukazu Nakamura; Guy Cochrane
Journal: Nucleic Acids Res Date: 2011-11-12 Impact factor: 16.971

10. UniPathway: a resource for the exploration and annotation of metabolic pathways.

Authors: Anne Morgat; Eric Coissac; Elisabeth Coudert; Kristian B Axelsen; Guillaume Keller; Amos Bairoch; Alan Bridge; Lydie Bougueleret; Ioannis Xenarios; Alain Viari
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

381 in total

1. De novo loss-of-function mutations in SETD5, encoding a methyltransferase in a 3p25 microdeletion syndrome critical region, cause intellectual disability.

Authors: Detelina Grozeva; Keren Carss; Olivera Spasic-Boskovic; Michael J Parker; Hayley Archer; Helen V Firth; Soo-Mi Park; Natalie Canham; Susan E Holder; Meredith Wilson; Anna Hackett; Michael Field; James A B Floyd; Matthew Hurles; F Lucy Raymond
Journal: Am J Hum Genet Date: 2014-03-27 Impact factor: 11.025

2. A fast Peptide Match service for UniProt Knowledgebase.

Authors: Chuming Chen; Zhiwen Li; Hongzhan Huang; Baris E Suzek; Cathy H Wu
Journal: Bioinformatics Date: 2013-08-19 Impact factor: 6.937

3. pLogo: a probabilistic approach to visualizing sequence motifs.

Authors: Joseph P O'Shea; Michael F Chou; Saad A Quader; James K Ryan; George M Church; Daniel Schwartz
Journal: Nat Methods Date: 2013-10-06 Impact factor: 28.547

4. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution.

Authors: James A Briggs; Caleb Weinreb; Daniel E Wagner; Sean Megason; Leonid Peshkin; Marc W Kirschner; Allon M Klein
Journal: Science Date: 2018-04-26 Impact factor: 47.728

Review 5. The EcoCyc Database.

Authors: Peter D Karp; Wai Kit Ong; Suzanne Paley; Richard Billington; Ron Caspi; Carol Fulcher; Anamika Kothari; Markus Krummenacker; Mario Latendresse; Peter E Midford; Pallavi Subhraveti; Socorro Gama-Castro; Luis Muñiz-Rascado; César Bonavides-Martinez; Alberto Santos-Zavaleta; Amanda Mackie; Julio Collado-Vides; Ingrid M Keseler; Ian Paulsen
Journal: EcoSal Plus Date: 2018-11

6. Synergetic regulatory networks mediated by oncogene-driven microRNAs and transcription factors in serous ovarian cancer.

Authors: Min Zhao; Jingchun Sun; Zhongming Zhao
Journal: Mol Biosyst Date: 2013-10-16

7. Cryptococcus strains with different pathogenic potentials have diverse protein secretomes.

Authors: Leona T Campbell; Anna R Simonin; Cuilan Chen; Jannatul Ferdous; Matthew P Padula; Elizabeth Harry; Markus Hofer; Iain L Campbell; Dee A Carter
Journal: Eukaryot Cell Date: 2015-04-03

8. Bottom-up Metabolic Reconstruction of Arabidopsis and Its Application to Determining the Metabolic Costs of Enzyme Production.

Authors: Anne Arnold; Zoran Nikoloski
Journal: Plant Physiol Date: 2014-05-07 Impact factor: 8.340

9. Inhibition of Ebola virus glycoprotein-mediated cytotoxicity by targeting its transmembrane domain and cholesterol.

Authors: Moritz Hacke; Patrik Björkholm; Andrea Hellwig; Patricia Himmels; Carmen Ruiz de Almodóvar; Britta Brügger; Felix Wieland; Andreas M Ernst
Journal: Nat Commun Date: 2015-07-09 Impact factor: 14.919

10. Integrated analysis of shotgun proteomic data with PatternLab for proteomics 4.0.

Authors: Paulo C Carvalho; Diogo B Lima; Felipe V Leprevost; Marlon D M Santos; Juliana S G Fischer; Priscila F Aquino; James J Moresco; John R Yates; Valmir C Barbosa
Journal: Nat Protoc Date: 2015-12-10 Impact factor: 13.491