Literature DB >> 19965766

The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources.

Abstract

The current issue of Nucleic Acids Research includes descriptions of 58 new and 73 updated data resources. The accompanying online Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, now lists 1230 carefully selected databases covering various aspects of molecular and cell biology. While most data resource descriptions remain very brief, the issue includes several longer papers that highlight recent significant developments in such databases as Pfam, MetaCyc, UniProt, ELM and PDBe. The databases described in the Database Issue and Database Collection, however, are far more than a distinct set of resources; they form a network of connected data, concepts and shared technology. The full content of the Database Issue is available online at the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19965766 PMCID： PMC2808992 DOI： 10.1093/nar/gkp1077

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

COMMENTARY

This Database Issue of Nucleic Acids Research (NAR) includes descriptions of 58 new data resources and updates to 73 previously published data resources. The online Database Collection that accompanies the issue holds 1230 data resources, a growth of 5% over last year (http://www.oxfordjournals.org/nar/database/a/). Continuing a decade-long tradition, the Database Issue and Database Collection serve two functions: (i) to introduce molecular and cell biologists that make up the regular readership of NAR to the databases that might be useful to them and (ii) to provide database developers a venue to publish articles to promote their resources and introduce their work to the community that might benefit from it. Based on a number of measures (such as the numbers of downloads, literature citations and web links from outside sources), the NAR Database Issue and Database Collection have been extremely successful. Despite rather strict acceptance criteria (1), the number of submitted articles greatly exceeds the capacity of a single annual issue. In order to accommodate this, Oxford University Press, the publisher of NAR, has recently launched the new journal Database: The Journal of Biological Databases and Curation (http://database.oxfordjournals.org/). We hope that the availability of this new journal, as well as that of our other sister journal, Bioinformatics, will provide a publication venue for databases that could not be accepted in the NAR Database Issue because of their limited scope, absence of manual curation or orientation to a limited readership. The data resources of the Database Issue and the Database Collection make up an invaluable infrastructure upon which much of life science has come to rely. Far more than a collection of distinct information sources, the resources form an extensive and evolving network of connected data, common concepts and shared technologies, driven forward by the collective efforts of developers, curators and database managers. While there is no moderator of this network and no overall controller of its growth, through peer review and editorial processes, the Database Issue and Database Collection provide a valuable quality assurance service to the reader. In this editorial, we first outline some of the new and updated databases that will be of interest to readers of the Database Issue. While individually these databases offer great utility, it is perhaps as part of the community of people, data and technology that the resources offer up some of their richest uses; we complete our introduction to the Database Issue with a commentary on this community.

NEW AND UPDATED DATABASES

In addition to the usual updates on the database services at the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI), this issue includes a comprehensive listing of Japanese databases provided by the Japanese National BioResource Project [http://www.nbrp.jp (2)]. The geographic distribution of the featured databases continues to grow; the phiSITE (http://www.phisite.org), a database of gene regulation in bacteriophages (3), is the first database in the list from Slovakia. Several articles in this issue feature updates on the status of databases that have been included in the Database Collection after being described first in other journals. These include the Eukaryotic Linear Motif database [ELM, http://elm.eu.org/ (4)], the Catalogue Of Somatic Mutations In Cancer [COSMIC, http://www.sanger.ac.uk/genetics/CGP/cosmic/ (5)], MicrobesOnline [http://www.MicrobesOnline.org (6)], the Immune Epitope Database [http://www.immuneepitope.org/ (7)] and PDBselect [http://bioinfo.tg.fh-giessen.de/pdbselect/ (8)]. Several other articles describe updated features of such popular resources as the Comprehensive Microbial Resource [CMR, http://cmr.jcvi.org/ (9)], PrimerBank [http://pga.mgh.harvard.edu/primerbank/ (10)] and the Therapeutic Target Database [http://xin.cz3.nus.edu.sg/group/cjttd/ttd.asp (11)], which have been last described in NAR several years ago. In previous issues, update articles were permitted only brief descriptions of the latest changes in the respective resource. We felt that this limitation was unnecessary and that readers might benefit from more extensive and detailed descriptions of key database resources. For this issue, we have invited the authors responsible for several popular data resources to submit extended papers to provide a deeper insight into the organization and goals of their respective resources and would put the recent changes in these resources into a broader context. We are very happy with several excellent papers that resulted from this initiative, including comprehensive descriptions of the recent changes in Pfam [http://pfam.sanger.ac.uk/ (12)], MetaCyc [http://metacyc.org/ (13)], UniProt [http://www.uniprot.org/ (14)], IntAct [http://www.ebi.ac.uk/intact/ (15)], the Eukaryotic Linear Motif database [ELM, http://elm.eu.org/ (4)], the Comprehensive Microbial Resource [CMR, http://cmr.jcvi.org/ (9)] and the Integrated Microbial Genomes system [IMG, http://img.jgi.doe.gov/ (16)]. In addition, we have included extensive descriptions of three key databases recently unveiled by the EBI: the Gene Expression Atlas [http://www.ebi.ac.uk/gxa, (17)], Ensembl Genomes [http://www.ensemblgenomes.org, (18)] and the Protein Data Bank in Europe [PDBe, http://www.ebi.ac.uk/pdbe/ (19)]. We expect to continue this approach with longer articles next year; database authors who would like to submit such descriptions of their resources are encouraged to contact Michael Galperin at nardatabase@gmail.com in advance.

A COMMUNITY OF DATA RESOURCES

Some of the greatest efforts that have brought connectivity between the records of distinct resources were conceived not to serve pre-defined sets of users, but rather as open-ended initiatives that would provide broad utility across multiple domains. Perhaps the best known of these is the Gene Ontology (GO) project [http://www.geneontology.org/ (20)], which finds itself at the heart of the community of resources described in this Database Issue and Database Collection, with many of them providing GO annotations. Annotation of a common GO term to data objects in distinct resources allows the user to infer a conceptual relationship between the objects that is described by the term; by extension, more distant relationships can be inferred by using ontological relationships within GO to reach terms common to distinct data objects. A shared approach to the development of data models is a theme in the Database Collection. For example, many model organism databases, such as Flybase [http://flybase.org/ (21)], Beetlebase [http://beetlebase.org/ (22)], ParameciumDB [http://paramecium.cgm.cnrs-gif.fr/db/index (23)] and wFleaBase [http://wfleabase.org/ (24)], have adopted the Chado database schema as underlying data structure for their resources. Chado, delivered and maintained by the GMOD community, provides database and interface technology for a broad range of information typically required by users of a model organism database, including genomic data, expression data, phenotypic information and literature collections (25). Centred on ontologies and controlled vocabularies, the schema is extensible through its system of domain-specific modules. Model organism databases that adopt Chado benefit from reduced development costs, as they are immediately able to use the substantial body of technologies (such as genome browsers, gene pages, search tools) that are freely available. In addition, users of these databases can apply their knowledge and experience of Chado-based interfaces to all model organism databases that have adopted the schema. A key strategy for many resources in the Database Collection is partnering to exchange data, as a means of achieving comprehensive coverage and to reduce overall effort in data management. Successful data exchange relies not least on agreement to structure information in compatible ways. In 1982, the databases of the International Nucleotide Sequence Database Collaboration (INSDC, http://www.insdc.org) established the Feature Table Definitions and continue to maintain the definitions to this day. The document defines a formalised text format in which biological features and their sequence coordinates can be described. INSDC feature table format, as defined in the Feature Table Definitions, remains the file format under which INSDC annotation data are kept in daily global synchrony. An unsung hero, perhaps, in this community is the taxonomic backbone upon which almost all of its resources hang. The NCBI Taxonomy project (http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=handbook&part=ch4) was established in 1991 and has been adopted as the de facto standard taxonomic classification of biomolecular data. While there are minor deviations and while for many resources with limited taxonomic range (such as the model organism databases), a model of taxonomy is not implicitly required, those resources that deal with multiple taxa all share this one system. The heaviest users are the large generalist resources [INSDC, PDBe (19), UniProt (14)], but even specialist resources adopt the system. The benefit to the user of the resource, of course, is the ability to approach, filter and link biomolecular information for any given taxon or set of taxa. Finally, there are the direct relationships between items of information in the resources. These take many forms: perhaps molecular sequences are similar between objects, the same genes are described, a functional correspondence between objects exists or common technology for presenting data binds two resources together. Some of these relationships are derived computationally, some manually curated from the literature and some exchanged between databases in reciprocal exchanges. The strengths of these relationships also vary; some resources share data directly leading to exact mappings between their objects, others report common attributes of their objects. Relationships can be asserted explicitly in cross-references, implied through common references to objects in tertiary resources or the user can be left to inject his/her own knowledge and creativity to bring a specific area of the network into sharper focus. For those generating and interpreting data to be fed into the resources of the Database Collection, shared technologies and concepts need to be used. For this to happen, they need to be understood and readily available. Community standardisation initiatives, with their repertoire of minimal reporting standards and technology development initiatives to support the use of their standards, make their contribution here. Seminal work in the microarray field under the Microarray Gene Expression Data (MGED) consortium led to the development and adoption of MIAME—Minimal Information about a Microarray Experiment—a checklist of items of information required to render microarray data reusable beyond the initial analysis of those who generated the data (26). Since this time, a whole host of minimal reporting standards have been developed to better the usability of data, across genomics and environmental sequencing [the MIGS, MIMS and MIENS standards of the Genomics Standards Consortium and the emerging MINSEQE standard from MGED (27), http://gensc.org/, http://www.mged.org/minseqe/], proteomics [the MIAPE standard (28)], cell-based assays (MIACA; http://miaca.sourceforge.net/), phylogenetics [MIAPA (29)], systems biology [MIRIAM (30)] and many more. Plenty of additional cases of such collective investment exist, but from the examples of vocabulary, taxonomic backbone, shared data models, common file formats and cross-references development initiatives alone, the value to the user is already clear. A continued attention to collective effort in such areas remains key to optimising the utility of our community of resources. As then, we prepare grant proposals to generate, interpret and present our latest and greatest data, we make reference to interoperability, shared effort to develop technologies and the need for cooperation and collaboration. The community of resources described in this Database Issue and Database Collection provides a compelling example of the many successes that can be won with investment in interoperability and shared effort. Curators, developers, managers and users of life science databases know well the ongoing importance of developing and maintaining connectivity between our resources and cooperation between those involved.

FUNDING

European Molecular Biology Laboratory (to G.R.C.); Intramural Research Program of the US National Institutes of Health (to M.Y.G.). Funding for open access charge: Oxford University Press. Conflict of interest statement. None declared.

30 in total

1. Ensembl Genomes: extending Ensembl across the taxonomic space.

Authors: P J Kersey; D Lawson; E Birney; P S Derwent; M Haimel; J Herrero; S Keenan; A Kerhornou; G Koscielny; A Kähäri; R J Kinsella; E Kulesha; U Maheswari; K Megy; M Nuhn; G Proctor; D Staines; F Valentin; A J Vilella; A Yates
Journal: Nucleic Acids Res Date: 2009-11-01 Impact factor: 16.971

2. The integrated microbial genomes system: an expanding comparative analysis resource.

Authors: Victor M Markowitz; I-Min A Chen; Krishna Palaniappan; Ken Chu; Ernest Szeto; Yuri Grechkin; Anna Ratner; Iain Anderson; Athanasios Lykidis; Konstantinos Mavromatis; Natalia N Ivanova; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2009-10-28 Impact factor: 16.971

3. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer.

Authors: Simon A Forbes; Gurpreet Tang; Nidhi Bindal; Sally Bamford; Elisabeth Dawson; Charlotte Cole; Chai Yin Kok; Mingming Jia; Rebecca Ewing; Andrew Menzies; Jon W Teague; Michael R Stratton; P Andrew Futreal
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

4. PrimerBank: a resource of human and mouse PCR primer pairs for gene expression detection and quantification.

Authors: Athanasia Spandidos; Xiaowei Wang; Huajun Wang; Brian Seed
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

5. phiSITE: database of gene regulation in bacteriophages.

Authors: Lubos Klucar; Matej Stano; Matus Hajduk
Journal: Nucleic Acids Res Date: 2009-11-09 Impact factor: 16.971

6. The immune epitope database 2.0.

Authors: Randi Vita; Laura Zarebski; Jason A Greenbaum; Hussein Emami; Ilka Hoof; Nima Salimi; Rohini Damle; Alessandro Sette; Bjoern Peters
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

7. The comprehensive microbial resource.

Authors: Tanja Davidsen; Erin Beck; Anuradha Ganapathy; Robert Montgomery; Nikhat Zafar; Qi Yang; Ramana Madupu; Phil Goetz; Kevin Galinsky; Owen White; Granger Sutton
Journal: Nucleic Acids Res Date: 2009-11-05 Impact factor: 16.971

8. MicrobesOnline: an integrated portal for comparative and functional genomics.

Authors: Paramvir S Dehal; Marcin P Joachimiak; Morgan N Price; John T Bates; Jason K Baumohl; Dylan Chivian; Greg D Friedland; Katherine H Huang; Keith Keller; Pavel S Novichkov; Inna L Dubchak; Eric J Alm; Adam P Arkin
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

9. Gene expression atlas at the European bioinformatics institute.

Authors: Misha Kapushesky; Ibrahim Emam; Ele Holloway; Pavel Kurnosov; Andrey Zorin; James Malone; Gabriella Rustici; Eleanor Williams; Helen Parkinson; Alvis Brazma
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

10. ELM: the status of the 2010 eukaryotic linear motif resource.

Authors: Cathryn M Gould; Francesca Diella; Allegra Via; Pål Puntervoll; Christine Gemünd; Sophie Chabanis-Davidson; Sushama Michael; Ahmed Sayadi; Jan Christian Bryne; Claudia Chica; Markus Seiler; Norman E Davey; Niall Haslam; Robert J Weatheritt; Aidan Budd; Tim Hughes; Jakub Pas; Leszek Rychlewski; Gilles Travé; Rein Aasland; Manuela Helmer-Citterich; Rune Linding; Toby J Gibson
Journal: Nucleic Acids Res Date: 2009-11-17 Impact factor: 16.971

34 in total

Review 1. Sustainable digital infrastructure. Although databases and other online resources have become a central tool for biological research, their long-term support and maintenance is far from secure.

Authors: Ruth Bastow; Sabina Leonelli
Journal: EMBO Rep Date: 2010-09-17 Impact factor: 8.807

2. Semantic integration of data on transcriptional regulation.

Authors: Michael Baitaluk; Julia Ponomarenko
Journal: Bioinformatics Date: 2010-04-28 Impact factor: 6.937

Review 3. Generating and navigating proteome maps using mass spectrometry.

Authors: Christian H Ahrens; Erich Brunner; Ermir Qeli; Konrad Basler; Ruedi Aebersold
Journal: Nat Rev Mol Cell Biol Date: 2010-10-14 Impact factor: 94.444

Review 4. OpenHelix: bioinformatics education outside of a different box.

Authors: Jennifer M Williams; Mary E Mangan; Cynthia Perreault-Micale; Scott Lathe; Neeraj Sirohi; Warren C Lathe
Journal: Brief Bioinform Date: 2010-08-26 Impact factor: 11.622

5. BIOSPIDA: A Relational Database Translator for NCBI.

Authors: Matthew S Hagen; Eva K Lee
Journal: AMIA Annu Symp Proc Date: 2010-11-13

6. Hierarchical clustering of shotgun proteomics data.

Authors: Ville R Koskinen; Patrick A Emery; David M Creasy; John S Cottrell
Journal: Mol Cell Proteomics Date: 2011-03-29 Impact factor: 5.911

7. BioCatalogue: a universal catalogue of web services for the life sciences.

Authors: Jiten Bhagat; Franck Tanoh; Eric Nzuobontane; Thomas Laurent; Jerzy Orlowski; Marco Roos; Katy Wolstencroft; Sergejs Aleksejevs; Robert Stevens; Steve Pettifer; Rodrigo Lopez; Carole A Goble
Journal: Nucleic Acids Res Date: 2010-05-19 Impact factor: 16.971

8. A quick guide to large-scale genomic data mining.

Authors: Curtis Huttenhower; Oliver Hofmann
Journal: PLoS Comput Biol Date: 2010-05-27 Impact factor: 4.475

9. The EMBRACE web service collection.

Authors: Steve Pettifer; Jon Ison; Matús Kalas; Dave Thorne; Philip McDermott; Inge Jonassen; Ali Liaquat; José M Fernández; Jose M Rodriguez; David G Pisano; Christophe Blanchet; Mahmut Uludag; Peter Rice; Edita Bartaseviciute; Kristoffer Rapacki; Maarten Hekkelman; Olivier Sand; Heinz Stockinger; Andrew B Clegg; Erik Bongcam-Rudloff; Jean Salzemann; Vincent Breton; Teresa K Attwood; Graham Cameron; Gert Vriend
Journal: Nucleic Acids Res Date: 2010-05-12 Impact factor: 16.971

10. Finding and sharing: new approaches to registries of databases and services for the biomedical sciences.

Authors: Damian Smedley; Paul Schofield; Chao-Kung Chen; Vassilis Aidinis; Chrysanthi Ainali; Jonathan Bard; Rudi Balling; Ewan Birney; Andrew Blake; Erik Bongcam-Rudloff; Anthony J Brookes; Gianni Cesareni; Christina Chandras; Janan Eppig; Paul Flicek; Georgios Gkoutos; Simon Greenaway; Michael Gruenberger; Jean-Karim Hériché; Andrew Lyall; Ann-Marie Mallon; Dawn Muddyman; Florian Reisinger; Martin Ringwald; Nadia Rosenthal; Klaus Schughart; Morris Swertz; Gudmundur A Thorisson; Michael Zouberakis; John M Hancock
Journal: Database (Oxford) Date: 2010-07-06 Impact factor: 3.451