Literature DB >> 26013493

Databases for Microbiologists.

Abstract

Databases play an increasingly important role in biology. They archive, store, maintain, and share information on genes, genomes, expression data, protein sequences and structures, metabolites and reactions, interactions, and pathways. All these data are critically important to microbiologists. Furthermore, microbiology has its own databases that deal with model microorganisms, microbial diversity, physiology, and pathogenesis. Thousands of biological databases are currently available, and it becomes increasingly difficult to keep up with their development. The purpose of this minireview is to provide a brief survey of current databases that are of interest to microbiologists.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26013493 PMCID： PMC4505447 DOI： 10.1128/JB.00330-15

Source DB: PubMed Journal: J Bacteriol ISSN： 0021-9193 Impact factor: 3.490

INTRODUCTION

A database is an organized collection of data. Very few biologists cared about databases 25 years ago, simply because there was no need to organize biological data—it was relatively scarce, except for biological literature. It all has changed with the arrival of genome sequencing technologies. The amount of biological data began to grow quickly, as sequenced genomes became commonplace and then skyrocketed when next-generation sequencing (NGS) (1) spread the “omics” revolution. Since January 2008, the speed of DNA sequencing is beating the infamous “Moore's law,” and all of a sudden, not only astrophysicists but also biologists are facing big data, and big data needs to be organized big time (2). As of March 2015, the genomes of more than 45,000 strains of bacteria and archaea have been sequenced or are in the process of being sequenced (3). Consequently, the role of databases in microbiology has increased dramatically in recent years and will become even more important in the foreseeable future. The purpose of this minireview is to focus on databases that are specific to microbiology rather than to provide a comprehensive view on thousands of biological databases available today. On the other hand, interdisciplinary boundaries are blurry, and some of the comprehensive databases must be mentioned in the context of microbiology although they serve much broader biological communities. Inevitably, many important databases will be missed from this survey, and I would like to direct a more meticulous reader to a more comprehensive source—the annual Nucleic Acids Research database issues and molecular biology database collection. As stated by Michael Galperin and his colleagues (4), the 2015 Nucleic Acids Research database issue contains 172 papers describing new and updated databases, and the journal online database collection (http://www.oxfordjournals.org/our_journals/nar/database/a/) provides links to more than 1,550 biological databases. There is no single, unified way of classifying databases. For the purpose of this minireview, I classify them into three categories: global resources, comprehensive databases, and special-purpose or community databases. “Shopping for information” is quite similar to shopping for goods and services that we do in our everyday lives. Accordingly, one can compare global resources to Wal-Mart or Amazon. Comprehensive databases are reminiscent of large retailers delivering a wide variety of goods in a certain category, such as IKEA or iTunes. Special-purpose databases could be viewed as fashion boutiques that target a very specific clientele. Doing away with analogies and metaphors, I hope that this minireview might serve as a database “consumer digest” or “shopping list” for Journal of Bacteriology readers. As shopping lists are rarely logical, this minireview lacks a single logical scheme. For example, some databases that are parts of large comprehensive resources can be described in sections devoted to special-purpose databases, when it seems more appropriate to do so. Links to key databases for microbiologists are provided in Table 1, and many more are briefly described in the text. One important note about biological databases to keep in mind is that although many of them contain very similar information (e.g., the same genomes), they were developed for various purposes, contain heterogeneous types of data accessible by different tools, and curated by different people. Therefore, no two databases are alike.

TABLE 1

Key databases for microbiologists

Main subject	Database name	URL	Brief description
Microbial genomic resources	IMG	http://img.jgi.doe.gov/	Comprehensive platform for annotation and analysis of microbial genomes and metagenomes
	MicrobesOnline	http://www.microbesonline.org/	Portal for comparative and functional microbial genomics
	SEED	http://www.theseed.org/	Portal for curated genomic data and automated annotation of microbial genomes
	GOLD	https://gold.jgi-psf.org/	Resource for comprehensive information about genome and metagenome sequencing projects
Protein families	CDD	http://www.ncbi.nlm.nih.gov/cdd/	Conserved domain database
Protein families	Pfam	http://pfam.xfam.org/	Database of protein families
Protein-protein interactions	STRING	http://string-db.org/	Database of protein association networks
Microbial diversity	RDP	http://rdp.cme.msu.edu/	16S rRNA gene database
	SILVA	http://www.arb-silva.de/	rRNA gene database
	GREENGENES	http://greengenes.lbl.gov/	16S rRNA gene database
	BIGSdb	http://pubmlst.org/software/database/bigsdb/	Bacterial isolate genome sequence database
	EBI metagenomics	www.ebi.ac.uk/metagenomics/	Portal for submission and analysis of metagenomics data
Model organisms	EcoCyc	http://EcoCyc.org	E. coli genome and metabolism knowledge base
	RegulonDB	http://regulondb.ccg.unam.mx/	E. coli transcriptional regulation resource
	Pseudomonas	http://pseudomonas.com	Pseudomonas genome database
Pathogenesis	PATRIC	https://www.patricbrc.org/	Portal for many prokaryotic pathogens
	EuPathDB	http://eupathdb.org/	Portal for many eukaryotic pathogens
	TBDB	http://www.tbdb.org/	Integrated platform for tuberculosis research
Transport and metabolism	TCDB	http://www.tcdb.org/	Transporter classification database
	TransportDB	http://www.membranetransport.org/	Transporter protein analysis database
	MetaCyc	http://metacyc.org/	Metabolic pathway database
	KEGG	http://www.genome.jp/kegg/	Genome database with emphasis on metabolism
Signal transduction and gene regulation	MiST	http://mistdb.com/	Microbial signal transduction database
	SwissRegulon	http://swissregulon.unibas.ch/	Genome-wide annotations of regulatory sites in model organisms
	RegPrecise	http://regprecise.lbl.gov/RegPrecise/	Database of regulons in prokaryotic genomes

Key databases for microbiologists

GLOBAL RESOURCES

Global resources consist of many interconnected databases and tools in order to provide “one-stop shopping” for a vast majority of users. The National Center for Biotechnology Information (NCBI) at the National Institutes of Health in the United States and the European Molecular Biology Laboratory/European Bioinformatics Institute (EMBL-EBI) are undisputed leaders that offer the most comprehensive suites of genomic and molecular biology data collections in the world. The key features of these resources are described by their developers and curators in corresponding publications (5, 6). Here, I will focus specifically on a question asked by a microbiologist: “What's in there for me?” First and foremost, all genomes of bacteria, archaea, eukaryotic microorganisms, and viruses that have been deposited to GenBank, EMBL Bank or DNA Data Bank of Japan (DDBJ) become an integral part of the NCBI and EMBL-EBI collection of databases and therefore are accessible to everyone through a variety of text-based and sequence-based search engines.

NCBI.

Like all other biologists, microbiologists depend on NCBI literature resources—PubMed and PubMed Central, which provides the full text of peer-reviewed journal articles, as well as NCBI Bookshelf, which provides free access to the full text of books and reports. The central features of the NCBI collection are nonredundant (NR) databases of nucleotide and protein sequences and their curated subset, known as Reference Sequences or RefSeq (7). The NCBI Genome database collects genome sequencing projects, including all sequenced microbial genomes, and provides links to corresponding records in NR databases and BioProject, which is a central access point to the primary data from sequencing projects. NCBI also maintains the Sequence Read Archive (SRA), which is a public repository for next-generation sequence data (8) and GEO, the archive for functional genomics data sets, which provides an R-based web application to help users analyze its data (9). BLAST (10) is the most popular sequence database search tool, and it now offers an option to search for sequence similarity against any taxonomic group from its NCBI web page. For example, a user may choose to search for similarity only in Proteobacteria or Firmicutes, or even in a single organism, such as Escherichia coli or Bacillus subtilis. Alternatively, any taxonomic group or an organism can be excluded from the search. NCBI BLAST also allows its users to search genomic data from environmental samples, thus providing a way to explore vast metagenomics data. NCBI Primer-BLAST (11) helps bench microbiologists design and analyze PCR primers. NCBI Taxonomy database (12) is another useful resource for microbiologists, because it contains information for each taxonomic node, from superkingdoms to subspecies, for virtually all of the formally described species of prokaryotes, and about 10% of eukaryotes. The NCBI Virus Variation resource (13) links viral genome sequence data from influenza, dengue, and West Nile viruses to the corresponding literature, sequences, structures, and population studies.

EMBL-EBI.

Similar to NCBI, EMBL-EBI implemented a user-centered design for its cross-linked databases and tools (6). Various databases are organized in several areas: DNA and RNA (genes, genomes, and expression data), proteins (sequences, families, and structures), metabolites (chemogenomics and metabolomics), and systems (reactions, interactions, and pathways). Universal Protein Resource (UniProt) and its curated knowledge base UniProtKB (14) are the most highly regarded EMBL-EBI databases. As at NCBI, information for microbiologists scattered throughout this vast collection. One of the most interesting resources for microbiologists at EMBL-EBI is its Metagenomics portal (15). It allows researchers to submit, archive, and analyze genomic information from various environments, and its analysis pipeline enables feature (genes and rRNAs), function (families, structures, and ontologies), and taxonomic (operational taxonomic units) predictions. EMBL-EBI also contributes to the development of several special-purpose databases for microbiologists that will be considered below. All databases, but global resources in particular, have to deal with the recent flood of genomic and metagenomic data, which is by no means a simple task. Both NCBI and EMBL-EBI separated environmental sequencing samples from the nonredundant database in order to reduce the number of hypothetical and partial hits in BLAST searches. More recently, both resources consolidated redundant entries from different strains (sometimes even species) into single records. For example, such records in the NCBI databases start with a label “MULTISPECIES,” and a single record may contain sequences from hundreds of strains (e.g., in the case of E. coli).

COMPREHENSIVE SPECIAL-PURPOSE DATABASES

Various microorganisms have several hundred to more than ten thousand proteins encoded in their genomes. Classifying them into distinct protein families characterized by conserved domains and evolutionary relationships helps define their biological function. Several resources aim at classifying proteins from their sequence or structure. Pfam (16), SMART (17), and TIGRFAM (18) are carefully curated collections of models that identify conserved proteins and protein domains. The Clusters of Orthologous Groups of Proteins (COGs) database (19) provides phylogenetic classification of proteins. The latest version of COGs is substantially improved by expanded microbial genome coverage (20). A Conserved Domain Database (CDD) at NCBI (21) established a unified collection of such models by combining data from all four above-mentioned databases in addition to its own models. The InterPro protein sequence analysis and classification database at EMBL-EBI (22) serves a similar purpose and provides comprehensive information on protein domains, regions, motifs, and functional sites by integrating data from various sources. Understanding the structure of biological macromolecules leads to deeper understanding of their function. In addition to the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB), which is a key central repository of atomic coordinates and relevant information on the three-dimensional (3D) structures of proteins, nucleic acids, and their complexes (23), the SCOP (Structural Classification of Proteins) (24) and CATH (Class, Architecture, Topology, and Homology) (25) databases are comprehensive resources for classification of protein structures. Finding orthologs (genes and their products in different species that evolved from a common ancestral gene by speciation) and paralogs (genes and their products originated by gene duplication) is yet another important aspect in predicting protein function. In addition to COGs, the OrthoDB database provides comprehensive information on gene/protein orthology (26). Reconstruction of metabolism from the genome sequence is a key process in modern biology, which combines the wealth of knowledge obtained by experimental biochemistry with the power of comparative genome analysis. Kyoto Encyclopedia of Genes and Genomes (KEGG) (27) is one of the leaders in this field. KEGG metabolic maps are part of the annotation platform for nearly every bacterial and archaeal genome sequencing project. MetaCyc is a metabolic arm of a large BioCyc collection (28). This database contains more than 2,000 experimentally elucidated metabolic pathways from more than 2,000 organisms from all three domains of life. Protein-protein interactions are critically important for metabolism, transport, signal transduction, and other cellular functions. The STRING database is the leading resource for functional protein networks (29). The BioGRID database provides information about protein and genetic interactions in many model organisms, including E. coli, B. subtilis, eukaryotic microorganisms, and viruses (30).

COMPREHENSIVE MICROBIAL RESOURCES

In this category are databases that aim to deliver comprehensive coverage not for a specific purpose but for a specific group of organisms—those that are of interest to the Journal of Bacteriology readership. Ironically, the database known as the Comprehensive Microbial Resource (CMR) (31) is no longer supported. Unfortunately, many other valuable resources also went offline due to lack of funding. For example, the number of obsolete databases removed from their listings last year was similar to that of new databases (4). However, despite the CMR loss, several other large databases still provide comprehensive information on microorganisms centering on comparative genomics. The Integrated Microbial Genomes (IMG) data warehouse (32) is a leading comprehensive resource devoted specifically to microbes. It integrates genomes and relative metadata from bacteria, archaea, eukaryotic microbes, and viruses. IMG provides a way to analyze genome information in a comparative context. It uses a wide variety of bioinformatics tools and curation by domain experts to deliver high-quality annotation across thousands of microbial genomes. In addition, IMG incorporates newly available proteomics and high-throughput RNA sequencing (RNA-seq) data sets and contains information on biosynthetic clusters—sets of genes encoding pathways for secondary metabolite production in selected bacterial genomes. Other resources aiming to provide the community of microbiologists with the means to analyze comparative genomic data include xBase (33), Microbial Genome Database (MBGD) (34), and the MicrobesOnline resource (35). MicrobesOnline not only contains information on thousands of bacterial, archaeal, and fungal genomes but also provides access to gene expression and fitness data.

Annotation and comparative analysis.

Genome annotation is a multistep process of taking the raw genomic DNA and adding the layers of analysis and interpretation necessary to extract its biological significance (36). Different annotation platforms use different tools at every step of this process, thus often creating discrepancies starting from gene calling to assignment of biological function. Unfortunately, these problems, which were recognized in the early days of genome sequencing (36), are persistent. While all comprehensive microbial resources implement annotation and allow comparative analysis, some databases make it a priority and/or provide a means to improve its quality. The SEED database implemented the subsystems approach (37) to genomic data, which is also extended to a popular platform for automated microbial genome annotation, RAST (Rapid Annotation using Subsystems Technology) (38). The MicroScope genome annotation and comparative analysis database (39) aims at improving annotation by enabling groups of investigators to collectively curate various aspects of a given microbial genome and enables cross-genome comparisons. The idea of bringing together the computational and experimental communities in the interest of improving our understanding of microbial gene function is behind the COMBREX database (40). The Gene Ontology resource (41) was designed to unify the representation of genes and proteins across all species, and it is often used as a part of the genome annotation process. The main goal of the POGO-DB database (42) is to allow pairwise comparisons of microbial genomes and identification of orthologous genes. Identification of orthologs in bacterial and archaeal genomes is also the main scope of the Ortholuge database (43). Similarly, the ATGC database (44) is the database of orthologous genes, but its distinctive feature is that orthologs are defined in closely related genomes, which is useful for studying microevolution in prokaryotes. Visualization of bacterial chromosome maps at scale delivered by the BacMap database (45) is another valuable tool for comparative analysis. Specialized databases useful for genome annotation and analysis include PSORTdb (46), which contains information on protein subcellular localizations for bacteria and archaea, DoriC (47), which catalogs experimentally identified and computationally predicted oriC regions in bacterial and archaeal genomes, ICEberg (48), which does a similar job for bacterial integrative and conjugative elements, and MICAS (49), which contains information on simple sequence repeats and short tandem repeats (microsatellites). For those looking for a quick automated annotation of a bacterial or archaeal genome, Prokka is a good solution. This recent, but already highly popular platform can deliver automated annotation in 10 min on a desktop computer (50).

SPECIAL-PURPOSE DATABASES

Model organisms.

Escherichia coli is the most widely studied prokaryotic organism. Consequently, there are many resources for the large community of microbiologists, molecular biologists, and biotechnologists interested in this bacterium. Several E. coli resources deliver comprehensive coverage of genome information, literature-based curation, and experimental data. EcoGene (51) and GenoBase (52) primarily focus on genome information but also contain other “omics” data, such as microarrays; (ii) EcoCyc (53) provides a powerful synthesis of genome and metabolism information; (iii) PortEco (54) builds a platform where key E. coli data can be accessed through the same web portal and serves as a forum for community interactions. Finally, the E. coli genome project at ASAP (A Systematic Annotation Package) database (55) stores and distributes genome information and experimental data from functional genomics studies. These rich resources deliver most of the essentials for the E. coli community. In addition, several more specialized databases are dedicated to specific topics in E. coli research. RegulonDB is an authoritative resource for the E. coli transcriptional regulatory network, which is built on experimental and computationally predicted data sets (56). The Bacteriome database (57) has a goal of defining all known and predicting novel protein-protein interactions in E. coli, and STEPdb (58) catalogs subcellular localization and topology of all its proteins. The comparative proteomics database EcoProDB has experimental information on E. coli proteins and experimental and theoretical 2D maps (59). ECMDB contains comprehensive annotation and detailed information about the E. coli metabolome (60). Bacillus subtilis, the best-studied Gram-positive bacterium, serves as a model organism for studying cell differentiation and chromosome replication, and it is widely used in biotechnology. Dedicated databases for B. subtilis include SubtiWiki, a collaborative resource of the Bacillus community, which links pathway, interaction, and expression information (61). The SporeWeb interactive knowledge platform captures relevant information about the B. subtilis sporulation cycle (62), whereas the BacillusRegNet database contains the known regulatory network of this organism as well as predicted interactions for other Bacillus species (63). Transcriptional regulation in B. subtilis and information about upstream intergenic conservation is also captured in the DBTBS database (64). Cyanobacteria are widely studied due to their ability to derive energy through photosynthesis. ProPortal, which contains genome, metagenome, transcriptomics, and population dynamics information, serves as the main resource for the model cyanobacterium Prochlorococcus and its close relatives (65). CyanoBase (66) is one of the Kazusa genomic resources (Japan) that serve researchers studying cyanobacteria, including Synechocystis, Prochlorococcus, Synechococcus, Nostoc, and other model photosynthetic bacteria. Its sister database, RhizoBase (66), plays the same role for a large community of microbiologists interested in nitrogen-fixing bacteria, including various rhizobia as well as Azoarcus, Azospirillum, Klebsiella, and Frankia. Kazusa resources also include the Streptomyces database devoted to the organisms that produce more than two-thirds of clinically useful antibiotics of natural origin. A more specialized StreptomeDB database (67) explicitly focuses on natural bioactive compounds from Streptomyces that can potentially be used as pharmaceuticals—antibiotics and antitumor and immunosuppressant drugs. As the endosymbiont of aphids, Buchnera is the subject of investigation in the field of host-microbe interactions. The BuchneraBase database was constructed to facilitate the postgenomic analysis of several Buchnera strains and closely related insect endosymbionts (68). Due to its extremely small genome, Mycoplasma genitalium became a model for synthetic genomics. The first whole-genome genotype-to-phenotype model was constructed using this organism, and the corresponding database WholeCellKB (69) offers a platform for whole-cell modeling in other organisms. The Saccharomyces Genome Database is the key resource for a model eukaryotic microorganism (70), which provides comprehensive biological information for the budding yeast Saccharomyces cerevisiae, including ontologies, biochemical pathways, expression data, and phenotypes in addition to the genome browser and similarity search tools. The yeast stress expression database, yStreX (71), is an online repository of analyzed gene expression data related to responses to diverse environmental transitions.

Diversity and metagenomics.

Research in microbial diversity has flourished owing to metagenomics approaches utilizing NGS technologies to characterize microbial communities in different ecosystems. Consequently, databases that assist researchers in environmental microbiology, microbial ecology, taxonomy, and phylogenetics become increasingly important. The Ribosomal Database Project (RDP) (72) maintains the largest collection of aligned and annotated rRNA gene sequences from bacteria, archaea, and fungi. It enables researchers to analyze their rRNA sequences in the RDP framework and provides tools to facilitate analysis of high-throughput data. Similar capabilities are provided by the SILVA taxonomic framework (73), which offers a set of rRNA gene sequence databases for bacteria, archaea, and eukaryota based on representative phylogenetic trees, and by the GREENGENES project (74). To help monitor microbial communities within complex environments, the PhylOPDb database offers a large collection of regular and explorative rRNA-targeted probes (75), and the rRNA operon copy number (rrnDB) database provides a means to interpret rRNA gene abundance in bacteria and archaea (76). The Bacterial Diversity Metadatabase (BacDive) contains detailed information on various aspects of more than 20,000 strains of bacteria and archaea, which includes taxonomy, physiology, sampling, and environmental conditions (77). The Global Catalog of Microorganisms (GCM) is a database for retrieval and analysis of relevant information for hundreds of thousands of microbial strains from different sources (78). Bacterial Isolate Genome Sequence Database (BIGSdb) from the PubMLST (collection of databases for molecular typing and microbial genome diversity) resource stores sequence data for bacterial isolates and enables analysis of genome variation at the population level (79). The List of Prokaryotic Names with Standing in Nomenclature (LPSN) (80) is an important resource that provides up-to-date classification of prokaryotes by listing the names of bacteria and archaea that have been validated and published in the International Journal of Systematic and Evolutionary Microbiology. The Human Microbiome Project (HMP) and other large-scale projects created the need to submit, store, analyze, and share numerous metagenomic data sets. In response to this challenge, platforms dedicated to metagenomics are now offered by several bioinformatics centers. EBI established a dedicated metagenomics resource (15), which allows users to submit raw nucleotide reads for taxonomic and functional analysEs by an automated pipeline. Similar capabilities are provided by IMG in its dedicated IMG/M resource (81), which also enables expert review of metagenome annotations (IMG/M ER). The metaMicrobesOnline (82) and MetaRef (83) databases offer comparative analysis for microbial genomes and metagenomes with emphasis on gene trees, gene family conservation, and genome context comparisons. To promote inference of metagenomic functional networks, the MetaProx database (84) contains information on candidate operons in metagenomic data sets. The MetaBioME database (85) positions itself as a comprehensive metagenomic biomining engine by providing the opportunity to find novel homologs of known commercially useful enzymes in metagenomic data sets. The FOAM (Functional Ontology Assignments for Metagenomes) database offers classification of gene functions in environmental metagenomes based on ontology and orthologous relationships (86).

Pathogenesis.

Pathogen Portal (http://pathogenportal.org) is a rich resource on pathogenic microorganisms that includes both bacteria and eukaryotes. The Pathosystems Resource Integration Center (PATRIC) provides the community of microbiologists interested in pathogenic bacteria with access to a variety of data, including genomics, transcriptomics, protein-protein interactions, sequence typing, etc., covering more than 10,000 genomes of genera containing NIAID category A to C/emerging/reemerging pathogens (87). GeneDB (88) is an annotation database for many pathogenic bacteria, eukaryotes, and viruses. Databases that are smaller in scope focus on specific pathogens and their functions. The tuberculosis database (TBDB) integrates genomic sequences and data for Mycobacterium species relevant to drug discovery, vaccines, and biomarkers (89), whereas SITVITWEB (90), tbvar (91), and GMTV (92) focus on delivering comprehensive information on genome-wide variation in Mycobacterium tuberculosis. Other databases featuring specific pathogenic bacteria include the Pseudomonas (93), Vibrio (94), Corynebacterium (95), and Helicobacter (96) genome databases. The HoPaCI-DB resource focuses on host-pathogen interactions of Pseudomonas aeruginosa and Coxiella spp. and provides thousands of validated interactions between molecules and processes (97). Postanalysis data on the Staphylococcus aureus transcriptome can be found in the SATMD database (98). DBSecSys provides comprehensive information about secretion systems in a category B priority pathogen, Burkholderia mallei (99). Information about bacterial species with the most relevance to veterinary medicine can be found in the VetBact database (100). Resources on viral pathogens include the molecular and epidemiological ViralZone knowledge base (101), the HBVdb database for the hepatitis B virus (102), the Papillomavirus Episteme (103), and databases of experimentally validated viral small interfering RNA (siRNA)/small hairpin RNA (shRNA) VIRsiRNAdb (104) and microRNA (miRNA) VIRmiRNA (105). Eukaryotic pathogenic microorganisms are well represented in Eukaryotic Pathogen Database Resources (EuPathDB), a collection of individual databases, each focusing on specific pathogens, accessible through a common portal (106). FungiDB contains information on Candida, Aspergillus, Cryptococcus, and some other fungi (107). The EuPathDB collection includes databases for pathogenic amoeba and microsporidia (108), Cryptosporidium (109), Toxoplasma (110), Giardia and Trichomonas (111), Trypanosoma (112), and several species of Plasmodium malaria parasites (113).

Transport, secretion, and metabolism.

Several special-purpose databases are devoted to these functions. Two expert-led databases—the Transporter Classification Database (TCDB) and TransportDB—provide the overview, curated annotations and detailed genomic comparisons of membrane transport proteins across selected genomes of bacteria, archaea, and eucarya (114, 115). More-specialized community databases highlight specific transport systems, including archaeal and bacterial ABC transporters (116) and β-barrel outer membrane proteins (117) as well as type III (118), type IV (119, 120), and type VI (121) secretion systems. Microme (http://microme.eu) is a European resource for microbial metabolism, and its main goal is to support the large-scale inference of metabolic flux directly from the genome sequence. It includes the well-regarded microbial genome annotation and analysis platform MicroScope (39) and serves as a portal to several analysis and data mining tools. Microorganisms contain many enzymes that assemble, modify, and break down oligo- and polysaccharides. The Carbohydrate-Active Enzymes (CAZy) database provides a classification platform linking the sequences to the specificities and 3D structures of these enzymes (122). Many other community databases are devoted to specific metabolic features and processes in microorganisms. AromaDeg is a database focusing on aerobic bacterial degradation of aromatics (123). DEOP database on osmoprotectants and associated pathways provides curated information for many bacterial species (124). BacMet is a resource for antibacterial biocide and metal resistance genes (125). Information on microbial polyketide and nonribosomal peptide gene clusters can be found in the ClusterMine360 database (126). Detailed information on bacterial carbohydrate structure is provided by the BCSDB database (127); experimentally characterized glycoproteins from bacteria and archaea are listed in the ProGlycProt repository (128). CyanoLyase (129) and mVOC (130) are curated databases of phycobiliproteins and microbial volatile compounds, respectively.

Signal transduction and gene regulation.

The Microbial Signal Transduction (MiST) database offers detailed information on receptors, kinases, response regulators, and transcription factors in thousands of bacterial and archaeal genomes (131). A subset of these proteins is also available in the P2CS (prokaryotic two-component systems) database (132). The Quorumpeps database has a collection of specific signaling molecules—quorum-sensing peptides (133). Transcriptional regulation is the main mode of bacterial responses to signals, and several databases are designed to capture known and predicted transcriptional regulators and their targets. The SwissRegulon database (134) provides genome-wide annotation of regulatory sites for model eukaryotes and prokaryotes, including E. coli, B. subtilis, S. aureus, Vibrio cholerae, and M. tuberculosis. RegTransBase is a database of regulatory sequences and interactions that are based on careful curation of thousands of scientific publications (135), and the RegPrecise database delivers inferred regulatory interactions across hundreds of bacterial genomes with an emphasis on phylogenetic, structural, and functional properties (136). Network Portal (137) provides analysis and visualization tools for selected gene regulatory networks in several bacterial species, including E. coli, B. subtilis, M. tuberculosis, P. aeruginosa, and Campylobacter jejuni among others. It serves as a modular database for analyzing user-uploaded data and public data. Operons predicted for more than 1,200 genomes of bacteria and archaea can be accessed in the ProOpDB database (138). Collection of experimentally verified transcription factor-binding sites is available in the CollecTF database (139). WebGeSTer DB is the largest database of intrinsic transcription terminators identified in more than a thousand bacterial genomes (140).

CONCLUDING REMARKS

The advent of Internet and high-throughput genome sequencing has opened new horizons for biomedical sciences. Sooner than we think, electronic patient records in hospitals will be merged with patient genomic information to build foundation for truly personalized medicine. In a similar fashion, the wealth of knowledge about microorganisms, which resides in electronic copies of journal articles will be merged with genomic (and other “omic”) information to build the foundation for a more vigorous and comprehensive analysis of microbes. While we are at the very beginning of this road, it becomes increasingly important for microbiologists to know how to use genomic resources—databases and computational tools—to enhance their own research and to archive and share obtained knowledge in a robust and widely accessible form. One obvious way is to deposit the results of experimental research, especially high-throughput data, to public repositories. Submission of genome sequencing (141) and microarray (142) data to public repositories has become mandatory. However, there are many other options by which one can enhance the visibility of results and share them more efficiently at the same time. For example, the commentary published in this issue of the Journal of Bacteriology by Ivan Erill (143) shows how the CollecTF database (139) can be used by researchers to submit, archive, and share experimentally verified transcription factor-binding sites that are usually only reported in research articles and are hard to mine. Similarly, it is important for authors when publishing their research papers to use database accession numbers that would link genes, proteins, and other data sets described in the paper to genomic data. Because many journal electronic editions now provide hyperlinks to genomic databases, one can then access the relevant data in one click. While some biological databases will come and go and others may change their substance and appearance, it is clear that in the grand scheme of things, biological databases—from giant repositories and comprehensive information portals to small community databases—will play an increasingly important role in biological discovery.

143 in total

1. Bacterial carbohydrate structure database 3: principles and realization.

Authors: Philip V Toukach
Journal: J Chem Inf Model Date: 2010-12-14 Impact factor: 4.956

2. Big data: The future of biocuration.

Authors: Doug Howe; Maria Costanzo; Petra Fey; Takashi Gojobori; Linda Hannick; Winston Hide; David P Hill; Renate Kania; Mary Schaeffer; Susan St Pierre; Simon Twigger; Owen White; Seung Yon Rhee
Journal: Nature Date: 2008-09-04 Impact factor: 49.962

3. PSORTdb--an expanded, auto-updated, user-friendly protein subcellular localization database for Bacteria and Archaea.

Authors: Nancy Y Yu; Matthew R Laird; Cory Spencer; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2010-11-10 Impact factor: 16.971

4. AmoebaDB and MicrosporidiaDB: functional genomic resources for Amoebozoa and Microsporidia species.

Authors: Cristina Aurrecoechea; Ana Barreto; John Brestelli; Brian P Brunk; Elisabet V Caler; Steve Fischer; Bindu Gajria; Xin Gao; Alan Gingle; Greg Grant; Omar S Harb; Mark Heiges; John Iodice; Jessica C Kissinger; Eileen T Kraemer; Wei Li; Vishal Nayak; Cary Pennington; Deborah F Pinney; Brian Pitts; David S Roos; Ganesh Srinivasamoorthy; Christian J Stoeckert; Charles Treatman; Haiming Wang
Journal: Nucleic Acids Res Date: 2010-10-24 Impact factor: 16.971

5. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes.

Authors: Richard J Roberts; Yi-Chien Chang; Zhenjun Hu; John N Rachlin; Brian P Anton; Revonda M Pokrzywa; Han-Pil Choi; Lina L Faller; Jyotsna Guleria; Genevieve Housman; Niels Klitgord; Varun Mazumdar; Mark G McGettrick; Lais Osmani; Rajeswari Swaminathan; Kevin R Tao; Stan Letovsky; Dennis Vitkup; Daniel Segrè; Steven L Salzberg; Charles Delisi; Martin Steffen; Simon Kasif
Journal: Nucleic Acids Res Date: 2010-11-21 Impact factor: 16.971

6. ICEberg: a web-based resource for integrative and conjugative elements found in Bacteria.

Authors: Dexi Bi; Zhen Xu; Ewan M Harrison; Cui Tai; Yiqing Wei; Xinyi He; Shiru Jia; Zixin Deng; Kumar Rajakumar; Hong-Yu Ou
Journal: Nucleic Acids Res Date: 2011-10-18 Impact factor: 16.971

7. The Sequence Read Archive: explosive growth of sequencing data.

Authors: Yuichi Kodama; Martin Shumway; Rasko Leinonen
Journal: Nucleic Acids Res Date: 2011-10-18 Impact factor: 16.971

8. ProGlycProt: a repository of experimentally characterized prokaryotic glycoproteins.

Authors: Aadil H Bhat; Homchoru Mondal; Jagat S Chauhan; Gajendra P S Raghava; Amrish Methi; Alka Rao
Journal: Nucleic Acids Res Date: 2011-10-28 Impact factor: 16.971

9. xBASE2: a comprehensive resource for comparative bacterial genomics.

Authors: Roy R Chaudhuri; Nicholas J Loman; Lori A S Snyder; Christopher M Bailey; Dov J Stekel; Mark J Pallen
Journal: Nucleic Acids Res Date: 2007-11-05 Impact factor: 16.971

10. ToxoDB: an integrated Toxoplasma gondii database resource.

Authors: Bindu Gajria; Amit Bahl; John Brestelli; Jennifer Dommer; Steve Fischer; Xin Gao; Mark Heiges; John Iodice; Jessica C Kissinger; Aaron J Mackey; Deborah F Pinney; David S Roos; Christian J Stoeckert; Haiming Wang; Brian P Brunk
Journal: Nucleic Acids Res Date: 2007-11-14 Impact factor: 16.971

3 in total

1. SynBioStrainFinder: A microbial strain database of manually curated CRISPR/Cas genetic manipulation system information for biomanufacturing.

Authors: Pengli Cai; Mengying Han; Rui Zhang; Shaozhen Ding; Dachuan Zhang; Dongliang Liu; Sheng Liu; Qian-Nan Hu
Journal: Microb Cell Fact Date: 2022-05-14 Impact factor: 6.352

Review 2. The next frontier of the anaerobic digestion microbiome: From ecology to process control.

Authors: Jo De Vrieze
Journal: Environ Sci Ecotechnol Date: 2020-05-08

3. Data distribution in public veterinary service: health and safety challenges push for context-aware systems.

Authors: Laura Contalbrigo; Stefano Borgo; Giandomenico Pozza; Stefano Marangon
Journal: BMC Vet Res Date: 2017-12-22 Impact factor: 2.741

3 in total