Literature DB >> 22144685

The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection.

Michael Y Galperin¹, Xosé M Fernández-Suárez.

Abstract

The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).

Entities: Disease Gene Species

Mesh：

Year: 2011 PMID： 22144685 PMCID： PMC3245068 DOI： 10.1093/nar/gkr1196

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

COMMENTARY

This current, 19th annual Database Issue of Nucleic Acids Research (NAR) features descriptions of 92 new online databases covering a variety of molecular biology data, 77 update papers on databases that have been previously described in the NAR Database Issue and 23 papers with updates on database resources whose descriptions have previously been published in other journals (Table 1). The accompanying NAR online Molecular Biology Database Collection (http://www.oxfordjournals.org/nar/database/a/) has been revised, which resulted in updating the URLs of more than 30 databases and exclusion of more than 20 obsolete web sites. This list now includes 1380 databases sorted into 14 categories and 41 subcategories.

Table 1.

New databases featured in the 2012 NAR Database issue

Database name	URL	Brief description
ApoHoloDB	http://ahdb.ee.ncku.edu.tw/	Apo- and Holo- structure pairs of proteins
AutismKB	http://autism.cbi.pku.edu.cn	Autism genetics knowledgebase
BGMUT	http://www.ncbi.nlm.nih.gov/projects/gv/mhc/xslcgi.cgi?cmd=bgmut	Blood Group antigen gene Mutation database
BitterDB	http://bitterdb.agri.huji.ac.il/bitterdb/dbbitter.php	Bitter taste: molecules and receptors
canSAR	http://cansar.icr.ac.uk	Integrated cancer research and drug discovery resource
CAPS-DB	http://www.bioinsilico.org/CAPSDB	Classification of helix cappings in protein structures
ccPDB	http://crdd.osdd.net/raghava/ccpdb/	Compilation and creation of datasets from Protein Data Bank
CharProtDB	http://www.jcvi.org/charprotdb/	Experimentally Characterized Protein annotations
COLT-Cancer	http://colt.ccbr.utoronto.ca/cancer	Essential gene profiles in human cancer cell lines
Crystallography Open Database	http://www.crystallography.net/	Crystal structures of small molecules
Cube-DB	http://epsf.bmad.bii.a-star.edu.sg/cube/db/html/home.html	Functional divergence in human protein families
DARC	http://darcsite.genzentrum.lmu.de/darc/	Database for Aligned Ribosomal Complexes
DBETH	http://www.hpppi.iicb.res.in/btox	Database for Bacterial ExoToxins for Humans
Death Domain database	http://www.deathdomain.org	Protein interaction data for Death Domain superfamily
DIGIT	http://www.biocomputing.it/digit4/	Database of ImmunoGlobulin sequences and Integrated Tools
Disease Ontology	http://diseaseontology.sf.net/	Ontology for a variety of human diseases
DiseaseMeth	http://202.97.205.78/diseasemeth	Human disease methylation database
DistiLD	http://distild.jensenlab.org/	Diseases and Traits In Linkage Disequilibrium blocks
DNAtraffic	http://dnatraffic.ibb.waw.pl/	DNA dynamics during the cell cycle
DOMMINO	http://dommino.org	Database of MacroMolecular INteractions
doRiNA	http://dorina.mdc-berlin.de	Database of RNA interactions in post-transcriptional regulation
DR.VIS	http://www.scbit.org/dbmi/drvis	Human Disease-Related Viral Integration Sites
EBI BioSample Database	http://www.ebi.ac.uk/biosamples/	Biological samples used as sources of sequence, structure or expression data
EcoliWiki	http://ecoliwiki.net	Community-based pages about non-pathogenic E. coli
eQuilibrator	http://equilibrator.weizmann.ac.il	Thermodynamics calculator for biochemical reactions
FungiDB	http://fungidb.org	Functional genomics of fungi
FunTree	http://www.ebi.ac.uk/thornton-srv/databases/FunTree/	Evolution of novel enzyme functions in enzyme superfamilies
GeneWeaver	http://www.GeneWeaver.org	Functional genomics analysis system
GONUTS	http://gowiki.tamu.edu	Gene Ontology Normal Usage Tracking System
GWASdb	http://jjwanglab.org/gwasdb	Human genetic variants identified by genome wide association studies
HaploReg	http://compbio.mit.edu/HaploReg	SNP-centric access to chromatin state information
HFV database	http://hfv.lanl.gov/	Hemorrhagic fever virus sequence database
hiPathDB	http://hipathdb.kobic.re.kr/	Human Integrated Pathway Database
Histome	http://www.histome.net/	Human histone database
HotRegion	http://prism.ccbb.ku.edu.tr/hotregion	Database of interaction Hotspots
Human OligoGenome Resource	http://oligogenome.stanford.edu/	Oligonucleotides for targeted resequencing of the human genome
ICEberg	http://db-mml.sjtu.edu.cn/ICEberg/	Integrative and Conjugative Elements in Bacteria
IDEAL	http://www.ideal.force.cs.is.nagoya-u.ac.jp/IDEAL/	Intrinsically Disordered proteins with Extensive Annotations and Literature
IGDB.NSCLC	http://igdb.nsclc.ibms.sinica.edu.tw	Integrated Genomic Database of Non-Small Cell Lung Cancer
IndelFR	http://indel.bioinfo.sdu.edu.cn	Indel Flanking Region database
InterEvol	http://biodev.cea.fr/interevol	Evolution of protein–protein Interfaces
LegumelIP	http://plantgrn.noble.org/LegumeIP/	Model Legumes Integrative database Platform
MetaBase	http://metadatabase.org	Wiki database of biological databases
MethylomeDB	http://epigenomics.columbia.edu/methylomedb/	DNA methylation profiles in human and mouse brain
MINAS	http://www.minas.uzh.ch	Metal Ions in Nucleic AcidS
MIPModDB	http://bioinfo.iitk.ac.in/MIPModDB	Major Intrinsic Protein superfamily Models
miREX	http://bioinfo.amu.edu.pl/mirex	Plant microRNA Expression data
miRNEST	http://mirnest.amu.edu.pl	microRNAs in animal and plant EST sequences
MMMDB	http://mmdb.iab.keio.ac.jp/	Mouse Multiple Tissue Metabolomics Database
modMine	http://intermine.modencode.org	Mining of modENCODE data
MOPED	http://moped.proteinspire.org	Model Organism Protein Expression Database
NCBI BioSample	http://www.ncbi.nlm.nih.gov/biosample	Biological samples used as sources of sequence, structure or expression data
NCBI BioProject	http://www.ncbi.nlm.nih.gov/bioproject	Linked data related to a single research project
Nematodes.org	http://www.nematodes.org/nematodegenomes/	Wiki for coordinating nematode sequencing projects
Newt-omics	http://newt-omics.mpi-bn.mpg.de	Data on red spotted newt Notophthalmus viridescens
neXtProt	http://www.nextprot.org/	A knowledgebase for human proteins
NRG-CING	http://nmr.cmbi.ru.nl/NRG-CING	Validated NMR structures of proteins and nucleic acid
OGEE	http://ogeedb.embl.de	Online GEne Essentiality database
PDBj	http://pdbj.org/	Protein Data Bank Japan
PhenoM	http://phenom.ccbr.utoronto.ca	Morphological database of essential yeast genes
Phytozome	http://www.phytozome.net/	JGI's platform for green plant genomics
PlantNATsDB	http://bis.zju.edu.cn/pnatdb/	Plant natural antisense transcripts
Polbase	http://polbase.neb.com	Biochemical, genetic, and structural information about DNA polymerases
PomBase	http://www.pombase.org/	Genome database on S. pombe
PoSSuM	http://possum.cbrc.jp/PoSSuM/	Ligand-binding POcket Similarity Search Using Multiple-Sketches
Predictive Networks	http://predictivenetworks.org	Integration, navigation, visualization, and analysis of gene interaction networks
ProGlycProt	http://www.proglycprot.org	Experimentally characterized Prokaryotic GlycoProteins
ProOpDB	http://operons.ibt.unam.mx/OperonPredictor/	Prokaryotic Operon DataBase
ProPortal	http://proportal.mit.edu/	Prochlorococcus marinus and its phages
ProRepeat	http://prorepeat.bioinformatics.nl/	Amino acid tandem Repeats in Proteins
ProtChemSI	http://pcidb.russelllab.org/	Protein-Chemical Structural Interactions
PSCDB	http://idp1.force.cs.is.nagoya-u.ac.jp/pscdb/	Protein Structural Change upon ligand binding
RecountDB	http://recountdb.cbrc.jp	Recalculated transcript amounts database
Rhea	http://www.ebi.ac.uk/rhea/	EBI's biochemical reaction database
RNA CoSSMos	http://cossmos.slu.edu	RNA Characterization of Secondary Structure Motifs
ScerTF	http://ural.wustl.edu/TFDB/	Binding sites for Saccharomyces cerevisiae Transcription Factors
SCRIPDB	http://dcv.uhnres.utoronto.ca/SCRIPDB/search	Search for Chemicals and Reactions In Patents
SEQanswers	http://seqanswers.com/wiki/SEQanswers	Wiki on all aspects of next-generation genomics
SitEx	http://www-bionet.sscc.ru/sitex/	Projections of protein functional Sites on Exons
SNPedia	http://www.SNPedia.com	Wiki on SNPs and genome annotation
SpliceDisease	http://cmbi.bjmu.edu.cn/Sdisease	Links between RNA splicing and disease
STAP refinement of NMRdb	http://psb.kobic.re.kr/STAP/refinement	Refined solution NMR structures
Stem Cell Discovery Engine	http://discovery.hsci.harvard.edu/	Comparison system for cancer stem cell analysis
TopFIND	http://clipserve.clip.ubc.ca/topfind	Protein N- and C-termini and protease processing
UMD-BRCA1/ BRCA2 databases	http://www.umd.be/BRCA1/	BRCA1 and BRCA2 mutations detected in France
UniPathway	http://www.grenoble.prabi.fr/obiwarehouse/unipathway	Metabolic pathway information in UniProt knowledge base
VIRsiRNAdb	http://crdd.osdd.net/servers/virsirnadb	Experimentally validated Viral siRNA/shRNA
YeTFaSCo	http://yetfasco.ccbr.utoronto.ca/	Yeast Transcription Factor binding Site sequence Collection
YMDB	http://www.ymdb.ca	Yeast Metabolome Database
zfishbook	http://zfishbook.org/	Transposon-labeled mutants in zebrafish

New databases featured in the 2012 NAR Database issue

NEW AND UPDATED DATABASES

This issue contains an unusually high number of papers from the authors’ host institutions, NCBI and EMBL-EBI, respectively. In addition to the annual papers from the International Nucleotide Sequence Database collaboration [INSDC (1), which includes the DNA Data Bank of Japan, GenBank and the European Nucleotide Archive (2–4)], Ensembl (5), UniProtKB (6) and the Protein Data Bank in Europe (7), these include two papers that describe the BioSample database project, recently launched at both institutions. The BioSample databases [http://www.ncbi.nlm.nih.gov/biosample and http://www.ebi.ac.uk/biosamples/, (8) and (9), respectively] aim at capturing essential information about each biological sample used to obtain sequence, gene expression or protein expression data, as well as the relationship between different samples and their sources. The sample information includes the name of the source organism (or an environmental isolate), the source material within that species such as e.g. the organ, tissue and the cell type. It will also contain information about the isolation source of the sample, (some or all of) locality, host, collection date, etc. For human sources, BioSample information will include any available—and ethically appropriate—additional data, such as the disease state and clinical information [clinical samples that may raise privacy concerns will continue to be kept at the NCBI's dbGaP database (10) and the EBI's European Genome-phenome Archive (http://www.ebi.ac.uk/ega/), with sanitized versions available in the BioSample databases]. While providing sample information will place additional burden on the submitters, the availability of BioSample data should dramatically improve the experience of a typical user. By consistently recording sample information for various kinds of data stored in the NCBI and EBI databases, the BioSample databases will allow smooth cross-database searching of all available information pertaining to a particular sample source, such as cell type, disease, or a tissue biopsy. Furthermore, since NCBI and EBI agreed to assign shared sample accession numbers, these numbers could now be used to query web sites of both institutions (8,9). The NCBI paper (8) also presents the BioProject database (http://www.ncbi.nlm.nih.gov/bioproject), another INSDC initiative, which aims to provide a higher-order organization of large-scale data submitted by a single organization or a consortium, funded from a single source, or relating to the same whole-genome assembly. Again, the availability of such metadata should simplify the task of retrieving related data sets from different kinds of databases held at NCBI, EBI and DDBJ. Five papers in this issue describe databases resources of the US Department of Energy's Joint Genome Institute (JGI, http://www.jgi.doe.gov). These include a description of the JGI Genome Portal (11) with its fungal (MycoCosm), plant (Phytozome), prokaryotic (IMG) and metagenomic (IMG/M) resources, and the Genomes OnLine Database (GOLD, http://www.genomesonline.org), which lists the ongoing genomic and metagenomic projects (12). One of the major highlights of this issue is the first description of neXtProt, a knowledgebase on human proteins that has been created at the Swiss Institute of Bioinformatics (SIB) on the basis of the human protein set in the UniProtKB/Swiss-Prot and then expanded by including quality-assessed protein expression, localization, variation and proteomics data (13). Other highlights include CharProtDB, a database of experimentally characterized proteins that is used for genome annotation at the J. Craig Venter Institute (14); a detailed explanation of the basic principles behind the NCBI Taxonomy Database and the ways it ties together various DNA and protein sequence and gene expression data for all organisms and taxonomic groups represented in GenBank (15); the descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects (16,17), and updates on model organism databases SGD, MGD, FlyBase and WormBase (18–21) and on Pfam, SMART and InterPro domain databases (22–24). With all the diversity of the databases featured in this issue, the major trend appears to be an increased focus on small molecules (ChEMBL, PubChem, BitterDB, SCRIPDB, Crystallography Open Database) and related topics, such as properties of enzyme-catalyzed reactions (Rhea, MACiE, eQuilibrator, SABIO-RK), protein–ligand binding (Pocketome, PoSSuM, ProtChemSI, STITCH), and the analysis of potential drugs and drug targets for human disease (canSAR, DAMPD, DBETH, SuperTarget, TDR Targets, Therapeutic Target Database). As in previous years, there is a strong representation of structure databases, including descriptions of the European and Japanese Protein Data Banks (PDBe, PDBj), two databases of refined NMR structures (NRG-CING and STAP Refinement of NMR database), and several other databases on protein structure and protein–protein interactions. An unusually high number of databases, including ChEMBL, FunCoup, MitoMiner, PhosphoSitePlus, Pocketome, SABIO-RK and TDR Targets, are featured in this NAR Database Issue for the first time after having their descriptions published elsewhere (Table 2). All these databases have been available online for several years and have been accepted and valued by the community. Accordingly, they presented few, if any, problems with the database design, although some appeared somewhat less user-friendly than is required for the NAR Database Issue. We consider publication of these papers in the NAR Database Issue a continuation of our efforts to bring the readers the best publicly available molecular biology databases, as well as a reflection of the unique status of this publication that introduces the databases to a very wide audience.

Table 2.

Database updates new for the NAR Database issue

Database name	URL	Brief description
BYKdb	http://bykdb.ibcp.fr/	Bacterial protein tYrosine Kinase database
BμG@Sbase	http://bugs.sgul.ac.uk/E-BUGS-PUB	Microarray datasets for microbial gene expression
ChEMBL	https://www.ebi.ac.uk/chembldb	EMBL's database of bioactive drug-like small molecules
ConoServer	http://www.conoserver.org/	Sequence and structures of peptides expressed by marine cone snails
CoryneRegNet	http://coryneregnet.cebitec.uni-bielefeld.de/	Corynebacterial Regulatory Network
ExoCarta	http://exocarta.ludwig.edu.au	Database on exosomes, membrane vesicles of endocytic origin released by diverse cell types
FunCoup	http://funcoup.sbc.su.se/	Networks of Functional Coupling of proteins
HmtDB	http://www.hmtdb.uniba.it/	Human mitochondrial genome variability
MimoDB	http://immunet.cn/mimodb	Mimotope database, active site-mimicking peptides from phage-display libraries
MIRIAM Registry	http://www.ebi.ac.uk/miriam/	Minimal Information Required In the Annotation of Models
MitoMiner	http://mitominer.mrc-mbu.cam.ac.uk/	Mitochondrial proteomics data
MitoZoa	http://www.caspur.it/mitozoa	Mitochondrial genomes in Metazoa
NAPP	http://rna.igmors.u-psud.fr/NAPP	Nucleic Acid Phylogenetic Profile database
OPMdb	http://opm.phar.umich.edu	Orientations of Proteins in Membranes database
PhosphoSItePlus	http://www.phosphosite.org/	Protein phosphorylation sites and other post-translational modifications
PINA	http://cbg.garvan.unsw.edu.au/pina/	Protein Interaction Network Analysis
Plant Metabolomics	http://plantmetabolomics.vrac.iastate.edu/	Arabidopsis metabolomics database
PLEXdb	http://www.plexdb.org	Gene Expression Resources for Plants and Plant Pathogens
Pocketome	http://www.pocketome.org	Small-molecule binding pockets in the structural proteome
SABIO-RK	http://sabiork.h-its.org/	System for the Analysis of Biochemical Pathways Reaction Kinetics
SubtiWiki	http://subtiwiki.uni-goettingen.de/	Collaborative resource for the Bacillus community
TDR Targets	http://tdrtargets.org/	Targets against neglected tropical diseases
WikiPathways	http://www.wikipathways.org	Community curation of biological pathways

Database updates new for the NAR Database issue In response to the growing popularity of Wikipedia (http://www.wikipedia.org) and wiki-based approaches to constructing and curating biological databases, this issue includes a special section with 10 papers describing various wiki-based databases. These papers are introduced in an accompanying editorial by Rob Finn, Paul Gardner and Alex Bateman (25), whose very popular Pfam (22) and Rfam (26) databases successfully incorporate wiki elements. It could be argued that the Pfam update paper (22) should have been placed in that section as well.

SUSTAINABILITY OF BIOINFORMATICS DATABASES

A joint paper in this issue from the three INSDC members (27) discusses the progress of the Sequence Read Archive (SRA, previously known as the Short Read Archive), however, without mentioning the controversy that surrounded the SRA in the past year. Established in 2007 as a public repository of raw sequence data from next-generation sequencing platforms, SRA stores sequence data generated for RNA-Seq, ChIP-Seq and genotyping studies, as well as from several large-scale projects, such as the Human Microbiome project (https://commonfund.nih.gov/hmp) and the 1000 Genomes project (http://www.1000genomes.org) (27). In June 2011, its volume surpassed 100 Terabases (1014 bases) of DNA. In February, NCBI announced that, due to budget constraints, it would discontinue the SRA within the next 12 months (http://www.ncbi.nlm.nih.gov/About/news/16feb2011). This announcement caused a widespread response (28). One news source even claimed that NCBI ‘announced that it would slowly phase out its DNA archive due to federal budget cuts’. There has been also an extensive online discussion on the http://seqanswers.com wiki web site (which is described in a separate paper in this issue). However, the news of the SRA demise proved largely premature. Within days, EBI and DDBJ announced that they would continue supporting the SRA (http://www.ebi.ac.uk/ena/SRA_announcement_Feb_2011.pdf, http://www.ddbj.nig.ac.jp/whatsnew/2011/DRA20110222.html), and the NIH provided support to enable the continuation of the SRA (http://www.ncbi.nlm.nih.gov/About/news/13Oct2011.html). Still, given that the SRA keeps growing at a rapid pace and handling the data becomes increasingly complicated, the INSDC paper carefully states that ‘SRA partners actively discuss and pursue approaches together with user communities to maximize the benefit gained from archiving next-generation sequencing data while minimizing the infrastructure costs’ (27). Despite its successful resolution, the SRA story highlights an important problem of whether public database providers should try keeping all sequence-related data or make certain choices about the kind of resources that they would like to maintain. The same news release in February 2011 announced the closure of Peptidome, the NCBI resource for tandem mass spectrometry peptide and protein identification data (29). The closure of Peptidome attracted far less attention than of SRA, probably because of the continued operation of EBI's PRIDE (30), Seattle Proteome Center's PeptideAtlas (31), the recently created MOPED (32) and other proteomics resources. Still, it is definitely a sign of things to come, as is the recently announced closure of the International Protein Index, which is to be replaced by the complete proteome sets in UniProtKB (33). Most importantly, the worldwide attention to the SRA story illuminates the deep concern that exists in the community with regard to the stability (viability) of the online databases that have become key resources enabling all kinds of biomedical research. Previously, we have seen a natural selection of databases that led to a relatively orderly succession: as some databases have grown obsolete, they were replaced by similar but more robust databases maintained elsewhere. For example, after termination of IRESdb, a database of the internal ribosome entry sites (34), the same data were still available through the IRESite database (35). Among the databases featured in this issue, MitoZoa provides the same coverage of metazoan mitochondrial genomes as the now-defunct AMmtDB, Gene3D fully replaces the no-longer-maintained 3D-Genomics, and Ensembl (5) provides the alternative splicing data that have previously been available through ASHESdb, EBI's ASD/ATD/ATSD and several other recently discontinued databases. Unfortunately, owing to the difficult economic times, budget constraints are now leading to the termination (or commercialization) of truly unique resources, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.jp/kegg) and The Arabidopsis Information Resource (TAIR, http://arabidopsis.org), both featured in this issue (36,37). The KEGG database, maintained by Minoru Kanehisa and his colleagues at the Bioinformatics Center of the Kyoto University Institute for Chemical Research, has been a permanent feature of the NAR Database Issue since 1997 and is now in its 60th release (36), see http://www.genome.jp/en/release.html. However, after Kanehisa, who was one of the founders of GenBank and has been at the forefront of bioinformatics research ever since, has reached the mandatory retirement age; the future of KEGG has suddenly become uncertain (see http://www.genome.jp/kegg/docs/plea.html). Right now, KEGG continues to be publicly available but its funding mechanisms support a narrow focus on translational research (36), which is certainly important but is only a minor part of the enormous contribution of this database to the progress of genomics and bioinformatics around the world. The case of TAIR is even more troubling. Over the past 12 years, TAIR enjoyed generous support from the US National Science Foundation (NSF, http://www.nsf.gov) that helped it grow into a recognized source of sequence data and curated annotation of the model plant Arabidopsis thaliana. Three previous publications on TAIR in the NAR Database Issue in 2001, 2003 and 2008 were all extremely well cited, confirming the widespread use of this resource. With the completion of the Arabidopsis sequencing project, the focus of TAIR shifted from providing new annotation to improving the existing genome annotation, making it the ultimate source of gene annotation and expression data for A. thaliana. Unfortunately, this new focus failed to win the NSF support and the funding for a project that until recently has been heralded as one of the NSF best success stories will end in August of 2013. This will likely mean termination of TAIR as we know it; the existing plans for corporate sponsorship of TAIR and/or for its shift to an International Arabidopsis Informatics Consortium (see http://www.arabidopsis.org/doc/about/tair_funding/410) are not going to prevent the demise of this useful genomic resource. These recent developments show that the importance of the public database resources, which is obvious to any biologist, needs to be constantly highlighted to the national and international financing bodies. We all remember the financial difficulties encountered in the 1990s by the Swiss-Prot database after it failed to secure sufficient support from the European Union (http://web.expasy.org/docs/crisis96/help-sprot.html) (38). Fortunately, in the end, Swiss government recognized the value of that unique resource and provided funding to support Swiss-Prot (39). It now supports the UniProtKB/Swiss-Prot activities at the SIB, whereas funding for the UniProtKB activities at the EBI and PIR is provided by the NIH, NSF and the European Commission (6). The stories of Swiss-Prot, KEGG and TAIR also illustrate the need [clearly articulated in a recent paper by Julian Parkhill, Ewan Birney and Paul Kersey, (40)] for a comprehensive infrastructure that would (i) support the key bioinformatics resources, (ii) extend to the model organism databases and (iii) bring the genomic information into every biological lab. In the USA, such infrastructure includes the NCBI, the JGI and associated DOE labs, the NIH-funded Bioinformatics Resource Centers (this issue includes papers on VectorBase and ViPR, as well as on EuPathDB-associated databases, such as GeneDB, FungiDB, and TDR Targets) and comprehensive resources on model organisms, such as FlyBase, WormBase, SGD and MGD (18–21). In Europe, coordination of the bioinformatics infrastructure is planned through the EU-sponsored ELIXIR (European Life Sciences Infrastructure for Biological Information, http://www.elixir-europe.org) project, which aims at guaranteeing seamless access to biological information by integrating data generators and data centers throughout Europe.

AN ECOSYSTEM OF DATABASES

Although this issue looks like a simple catalog, it is important to note that we are not dealing with isolated resources: many listed databases interact in a variety of ways, forming a network of interconnected (or at least hyperlinked) data resources. Obviously, UniProtKB provides a plethora of links to all kinds of databases, including ENA, GenBank, DDBJ, RefSeq, PDBe, PDBj, IntAct, MINT, Ensembl, KEGG, UCSC Genome Browser, neXtProt, SGD, FlyBase, WormBase, MGD, TAIR, eggNOG, MetaCyc, InterPro, Gene3D, Pfam, SMART and ProtoNet, which are featured in this issue. However, many database interactions are more subtle: for example, BioMart has been recently used to link protein annotation data from the Reactome database of metabolic networks (41) to phosphoproteomics data in PRIDE (30) and somatic mutations in COSMIC (42), which allowed putting cancer-related mutation data into a functional context (43). We believe that establishing connections between databases is an important way of improving the databases themselves, providing the user with additional search tools and, more generally, creating a live ecosystem that stores and expands knowledge. Accordingly, we consider it essential that the databases featured in the NAR Database Issue do their best in creating links to outside resources and providing an easy and straightforward way for the authors of other databases to link to their database content. Last year, we published a paper by the BioDBcore Working Group that proposed creating a resource of ‘minimal information about a biological database’, a community-defined, uniform, generic description of the core attributes of biological databases (44). Accordingly, submitters to this year's NAR Database Issue were asked to fill out a checklist of core attributes (available at http://www.biodbcore.org) of their databases and provide it as supplementary material to their manuscripts. Most of the authors complied with this request, which resulted in a stand-alone resource that contains machine-readable descriptions of the databases featured in this issue and is available from the BioSharing website (http://biosharing.org/biodbcore). We hope that this effort would illuminate the scope and general features of every listed database resource, including the community standards that these systems support, forge better contacts between their authors, simplify linking various data sets, and, eventually, bring greater clarity and integration to the whole field of molecular biology databases.

FUNDING

Intramural Research Program of the US National Institutes of Health at the National Library of Medicine (to M.Y.G.); European Molecular Biology Laboratory (to X.M.F.S.). Funding for open access charge: waived by Oxford University Press. Conflict of interest statement. The authors’ opinions do not necessarily reflect the views of their respective institutions.

44 in total

1. Serendipity in bioinformatics, the tribulations of a Swiss bioinformatician through exciting times!

Authors: A Bairoch
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. IRESdb: the Internal Ribosome Entry Site database.

Authors: Sophie Bonnal; Christel Boutonnet; Leonel Prado-Lourenço; Stéphan Vagner
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

3. The NCBI dbGaP database of genotypes and phenotypes.

Authors: Matthew D Mailman; Michael Feolo; Yumi Jin; Masato Kimura; Kimberly Tryka; Rinat Bagoutdinov; Luning Hao; Anne Kiang; Justin Paschall; Lon Phan; Natalia Popova; Stephanie Pretel; Lora Ziyabari; Moira Lee; Yu Shao; Zhen Y Wang; Karl Sirotkin; Minghong Ward; Michael Kholodov; Kerry Zbicz; Jeffrey Beck; Michael Kimelman; Sergey Shevelev; Don Preuss; Eugene Yaschenko; Alan Graeff; James Ostell; Stephen T Sherry
Journal: Nat Genet Date: 2007-10 Impact factor: 38.330

4. Unique protein database imperiled.

Authors: N Williams
Journal: Science Date: 1996-05-17 Impact factor: 47.728

5. Genomic information infrastructure after the deluge.

Authors: Julian Parkhill; Ewan Birney; Paul Kersey
Journal: Genome Biol Date: 2010-07-26 Impact factor: 13.583

6. COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer.

Authors: Simon A Forbes; Nidhi Bindal; Sally Bamford; Charlotte Cole; Chai Yin Kok; David Beare; Mingming Jia; Rebecca Shepherd; Kenric Leung; Andrew Menzies; Jon W Teague; Peter J Campbell; Michael R Stratton; P Andrew Futreal
Journal: Nucleic Acids Res Date: 2010-10-15 Impact factor: 16.971

7. The PeptideAtlas project.

Authors: Frank Desiere; Eric W Deutsch; Nichole L King; Alexey I Nesvizhskii; Parag Mallick; Jimmy Eng; Sharon Chen; James Eddes; Sandra N Loevenich; Ruedi Aebersold
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. IRESite--a tool for the examination of viral and cellular internal ribosome entry sites.

Authors: Martin Mokrejs; Tomás Masek; Václav Vopálensky; Petr Hlubucek; Philippe Delbos; Martin Pospísek
Journal: Nucleic Acids Res Date: 2009-11-16 Impact factor: 16.971

9. NCBI Peptidome: a new repository for mass spectrometry proteomics data.

Authors: Li Ji; Tanya Barrett; Oluwabukunmi Ayanbule; Dennis B Troup; Dmitry Rudnev; Rolf N Muertter; Maxim Tomashevsky; Alexandra Soboleva; Douglas J Slotta
Journal: Nucleic Acids Res Date: 2009-11-26 Impact factor: 16.971

10. The Proteomics Identifications database: 2010 update.

Authors: Juan Antonio Vizcaíno; Richard Côté; Florian Reisinger; Harald Barsnes; Joseph M Foster; Jonathan Rameseder; Henning Hermjakob; Lennart Martens
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

32 in total

Review 1. Online tools for bioinformatics analyses in nutrition sciences.

Authors: Sridhar A Malkaram; Yousef I Hassan; Janos Zempleni
Journal: Adv Nutr Date: 2012-09-01 Impact factor: 8.701

2. MBASED: allele-specific expression detection in cancer tissues and cell lines.

Authors: Oleg Mayba; Houston N Gilbert; Jinfeng Liu; Peter M Haverty; Suchit Jhunjhunwala; Zhaoshi Jiang; Colin Watanabe; Zemin Zhang
Journal: Genome Biol Date: 2014-08-07 Impact factor: 13.583

3. Unlocking the Power of Big Data at the National Institutes of Health.

Authors: Meghan F Coakley; Maarten R Leerkes; Jason Barnett; Andrei E Gabrielian; Karlynn Noble; M Nick Weber; Yentram Huyen
Journal: Big Data Date: 2013-06-06 Impact factor: 2.128

Review 4. Drug target inference through pathway analysis of genomics data.

Authors: Haisu Ma; Hongyu Zhao
Journal: Adv Drug Deliv Rev Date: 2013-01-28 Impact factor: 15.470

Review 5. Crowdsourcing for bioinformatics.

Authors: Benjamin M Good; Andrew I Su
Journal: Bioinformatics Date: 2013-06-19 Impact factor: 6.937