Literature DB >> 16381886

ODB: a database of operons accumulating known operons across multiple genomes.

Shujiro Okuda¹, Toshiaki Katayama, Shuichi Kawashima, Susumu Goto, Minoru Kanehisa.

Abstract

Operon structures play an important role in co-regulation in prokaryotes. Although over 200 complete genome sequences are now available, databases providing genome-wide operon information have been limited to certain specific genomes. Thus, we have developed an ODB (Operon DataBase), which provides a data retrieval system of known operons among the many complete genomes. Additionally, putative operons that are conserved in terms of known operons are also provided. The current version of our database contains about 2000 known operon information in more than 50 genomes and about 13 000 putative operons in more than 200 genomes. This system integrates four types of associations: genome context, gene co-expression obtained from microarray data, functional links in biological pathways and the conservation of gene order across the genomes. These associations are indicators of the genes that organize an operon, and the combination of these indicators allows us to predict more reliable operons. Furthermore, our system validates these predictions using known operon information obtained from the literature. This database integrates known literature-based information and genomic data. In addition, it provides an operon prediction tool, which make the system useful for both bioinformatics researchers and experimental biologists. Our database is accessible at http://odb.kuicr.kyoto-u.ac.jp/.

Entities: Species

Mesh：

Year: 2006 PMID： 16381886 PMCID： PMC1347400 DOI： 10.1093/nar/gkj037

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

With the increasing availability of completely sequenced genomes, comparative genomic approaches are becoming more important to decipher the functions of genes. Methods, which are powerful, using the conservation of gene proximity on genomes (i.e. determining potential operons) can understand functional associations between genes (1–3). Genes in an operon are functionally associated with each other in prokaryotes; thus, various kinds of operon prediction methods have been developed to understand the functional relationships and to annotate genes (4–13). Databases that accumulate the experimentally verified operon information should be useful to validate such prediction methods and also to understand the functional association between genes. However, databases providing genome-wide operon information have been limited to certain specific genomes (14,15). Although the STRING database was developed to identify functional associations between genes for multiple genomes, it uses gene neighborhood based on genome context methods (16). Here, we introduce the database called ODB (Operon DataBase), which provides operon data documented in the literature and putative operons that are conserved in terms of known operons. Furthermore, to characterize operons, it integrates genome context, gene co-expression obtained from microarray data, functional links in biological pathways and data on the conservation of gene order across genomes. ODB also provides operon prediction based on these various types of data as an application of our database. These datasets are fully pre-computed so that all information can be quickly accessed. The ODB database integrates known literature-based information and genomic data. In addition, it provides an operon prediction tool, which makes the system useful for both bioinformatics researchers and experimental biologists.

OPERON DATA SOURCES

We have collected information of known operons of multiple genomes from the literature. We note that the experimentally verified operons, which we have collected, have been verified by a variety of means, from direct measurements such as primer extension and northern blots to less direct methods such as gene knock-out experiments. Our database represents an ongoing effort to increase the coverage of operons. The current version of our database contains about 2000 known operon information in more than 50 genomes obtained from a total of 825 literatures (Table 1). Note that although some of these operons overlap, we use the term ‘operon’ to refer to a ‘transcriptional unit’ individually as opposed to the generally understood usage of the term that may include multiple overlapping transcriptional units. These data also include the operons of Caenorhabditis elegans. Operon structures are often observed in prokaryotes, but nematodes also have similar transcriptional systems (17,18). Thus, we added the eukaryotic operons into our database. Note that the operons from Bacillus subtilis contain operons obtained from transcriptional maps stored in BSORF (). Because these maps were derived from the results of northern blotting experiments, we added these operons into our database. Note that these entries can be distinguished from the operons obtained from the literature, as the origin of the source (BSORF) is annotated in the database.

Table 1

Statistics of operons in major genomes

Species	No. of operons	No. of putative operons
Eukaryotes: 7 species
Caenorhabditis elegans	628	149
Saccharomyces cerevisiae	–	7
Prokaryotes: 177 species
Bacillus subtilis	711	60
Escherichia coli	389	61
Pseudomonas aeruginosa	33	156
Agrobacterium tumefaciens	15	172
Synechocystis sp. PCC6803	12	26
Bradyrhizobium japonicum	10	190
Archaea: 19 species
Methanosarcina acetivorans	–	44
Pyrococcus furiosus	2	13
Total: 203 species	1957	13 258

Table 1 also shows putative operons that are conserved in terms of known operons. When we calculated these conservations, we used KEGG OC as the ortholog gene set (19), which is ortholog gene clustering based on Smith–Waterman sequence similarity scores. If genes in a known operon have ortholog genes in another genome and these ortholog genes are consecutively located on the same strand of the genome, we regarded them as a putative but highly reliable operon. Note that this is not applied to known mono-cistronic genes. Furthermore, the putative operons were also explored from the viewpoint of paralog genes. These putative operons are also explored in eukaryotes. Usually, we do not use the term ‘operon’ for the eukaryotic gene clusters, but we use this term operationally in our database. As a result, over 13 000 putative operons were observed in over 200 genomes.

OVERVIEW OF THE DATABASE

ODB uses a relational database management system (MySQL, ) to store and manage all information including not only known and putative operons but also primary data, such as gene location and definition, and associations between genes. This system contains four types of associations between genes that determine an operon: (i) intergenic distances, (ii) functional links in biological pathways, (iii) gene co-expression obtained from microarray data and (iv) the conservation of gene order across multiple genomes. These four types of associations are considered indicators and that the genes linked by them can organize an operon. Therefore, we pre-calculated these associations among all genes in all available genomes to characterize operons. Genes in an operon are often closely located on the genome compared with those between non-operons. Therefore, this is one of the indicators to characterize operons. Intergenic distances are defined as the number of bases between the end position of a gene and the start position of the next gene on the genome. In addition, genes in an operon are often functionally related. For example genes appearing in a metabolic pathway are often clustered on the genome to be co-transcribed (20). Such functional links were obtained from KEGG pathway (19). We calculated the number of steps between genes in the pathway maps. The number of steps indicates that when two genes are linked across a compound, the number of steps is one. In this way, we calculated the number of steps not only in the same pathway map but also across different pathway maps. The KEGG EXPRESSION database contains the gene expression data derived from microarrays of four organisms, B.subtilis, Escherichia coli K-12 W3110, Synechocystis sp. PCC6803 and Saccharomyces cerevisiae (19). We used the information of co-expressed genes from the database. We calculated the Pearson's correlation coefficients between gene expression profiles obtained from these microarray data. Because it is considered that microarray data reflect actual gene transcription and that they are powerful tools to predict operons, co-expressed gene clusters on the genome are possible operons. However, the limitation of experimental conditions and quality of the experiments still leave the issue that certain operons are not transcribed and that the level of gene co-expression is not homogeneous. Therefore, there are cases where genes are not co-expressed even if they are genes in a known operon. Gene order in an operon is often shuffled and collapsed in evolutionary history (21,22). Therefore, conservation of gene order across genomes is rather rare, especially in distantly related genomes. If such conservation is observed, they are probably related to a physical interaction such as a molecular complex (23). Therefore, this feature is also important in characterizing operons. We calculated the step number between gene pairs. That is, given a gene pair, we took each of their ortholog genes from all genomes, calling this ‘ortholog gene pair’. Then we calculated the step number between these two ortholog genes. When the gene pair is adjacently located on the genome, the step number is regarded as one. Here, we ignore the genomes included in the same taxonomic group, which are defined in KEGG (). All-against-all runs of these associations between genes were performed. Pre-computed results are stored in a table, allowing quick retrieval against the query specified by users. Each table in our system corresponds to a particular genome to facilitate efficient access and retrieval of the information. When users search a gene or an operon of interest, the gene cluster including it can be identified by its name and identifier. Then, the user is presented with a summary of genes and associations between genes in the region on the genome (Figure 1). Primary data such as gene names, gene IDs, definitions, KO IDs as functional classes, KEGG pathway IDs and EC numbers are presented. These are linked to the KEGG database if available. Additionally, the genomic view of the region of interest is also presented. This view includes graphical symbols of operons, genes, pathways and EC numbers and each symbol is also linked to the KEGG database. The user can also scroll and zoom the region of interest on the genome. Finally, the four types of associations are shown as separate tables. For the biological pathway table, the shortest step numbers between genes are presented. For the ortholog gene table, the shortest step numbers between the ortholog gene pairs are shown. In these tables, additional pages are accessible which show the detail of the information. For the gene expression table, the correlation coefficients between gene expression profiles are shown, and the strength of co-expression is illustrated by a color gradient ranging from blue to red.

Figure 1

An example view of an operon.

OPERON PREDICTION

Because the conditions to determine putative operons are very strict and are not genome-wide, ODB also provides a system to predict operons, using the four associations. Given a specific species, predicted operons that may exist within that species are returned. There are two options that are available: simple and advanced prediction mode. For a simple mode, users can obtain prediction results based on default parameter values that have been validated by known operons. However, in advanced prediction mode, users can freely change these parameter values, which are based on the four types of associations described above. When genes linked by these associations are clustered on the genome, they are likely to be an operon. Thus, we benchmarked the accuracy of the predictions based on combinations of various values of intergenic distances, step numbers between ortholog genes and the number of the genomes having conserved ortholog genes that are linked within a specific range of step numbers. Therefore, the optimal values that predict the largest number of operons while keeping the accuracy high is provided as default values in simple prediction mode (Supplementary data). When there is little or no known operon information in a genome, the default values of another genome in the same taxonomic group and having sufficient operon information is used as an alternative to the genome. If such genomes are also unavailable, we used the values of B.subtilis (see Supplementary Data for details).

CONCLUDING REMARKS

ODB provides a platform for searching known operons and consequent putative operons and for predicting operons with high accuracy validated by literature-based operon data. It includes about 2000 literature-based operons in over 50 genomes and about 13 000 putative operons in over 200 genomes. In addition, the data from KEGG pathway and related resources that are provided allow analyses not only based on a specific genomic context but also across genomes. Thus, it is the first of its kind to integrate operon data from a variety of genomes, providing a wide-ranging coverage of operons. This integrated system of both known literature-based and genomic data is a useful system for bioinformatics researchers and experimental biologists.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

23 in total

1. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models.

Authors: T Yada; M Nakao; Y Totoki; K Nakai
Journal: Bioinformatics Date: 1999-12 Impact factor: 6.937

2. Prediction of operons in microbial genomes.

Authors: M D Ermolaeva; O White; S L Salzberg
Journal: Nucleic Acids Res Date: 2001-03-01 Impact factor: 16.971

3. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters.

Authors: H Ogata; W Fujibuchi; S Goto; M Kanehisa
Journal: Nucleic Acids Res Date: 2000-10-15 Impact factor: 16.971

4. A probabilistic learning approach to whole-genome operon prediction.

Authors: M Craven; D Page; J Shavlik; J Bockhorst; J Glasner
Journal: Proc Int Conf Intell Syst Mol Biol Date: 2000

5. Computational identification of operons in microbial genomes.

Authors: Yu Zheng; Joseph D Szustakowski; Lance Fortnow; Richard J Roberts; Simon Kasif
Journal: Genome Res Date: 2002-08 Impact factor: 9.043

6. Conservation of gene co-regulation in prokaryotes and eukaryotes.

Authors: Sarah A Teichmann; M Madan Babu
Journal: Trends Biotechnol Date: 2002-10 Impact factor: 19.536

7. A global analysis of Caenorhabditis elegans operons.

Authors: Thomas Blumenthal; Donald Evans; Christopher D Link; Alessandro Guffanti; Daniel Lawson; Jean Thierry-Mieg; Danielle Thierry-Mieg; Wei Lu Chiu; Kyle Duke; Moni Kiraly; Stuart K Kim
Journal: Nature Date: 2002-06-20 Impact factor: 49.962

8. Co-expression pattern from DNA microarray experiments as a tool for operon prediction.

Authors: Chiara Sabatti; Lars Rohlin; Min-Kyu Oh; James C Liao
Journal: Nucleic Acids Res Date: 2002-07-01 Impact factor: 16.971

9. Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes.

Authors: Martin J Lercher; Thomas Blumenthal; Laurence D Hurst
Journal: Genome Res Date: 2003-02 Impact factor: 9.043

10. A novel method for accurate operon predictions in all sequenced prokaryotes.

Authors: Morgan N Price; Katherine H Huang; Eric J Alm; Adam P Arkin
Journal: Nucleic Acids Res Date: 2005-02-08 Impact factor: 16.971

21 in total

1. Crystal structures, dynamics and functional implications of molybdenum-cofactor biosynthesis protein MogA from two thermophilic organisms.

Authors: Shankar Prasad Kanaujia; Jeyaraman Jeyakanthan; Akeo Shinkai; Seiki Kuramitsu; Shigeyuki Yokoyama; Kanagaraj Sekar
Journal: Acta Crystallogr Sect F Struct Biol Cryst Commun Date: 2010-12-21

Review 2. Ten years of bacterial genome sequencing: comparative-genomics-based discoveries.

Authors: Tim T Binnewies; Yair Motro; Peter F Hallin; Ole Lund; David Dunn; Tom La; David J Hampson; Matthew Bellgard; Trudy M Wassenaar; David W Ussery
Journal: Funct Integr Genomics Date: 2006-05-12 Impact factor: 3.410

3. Quantitative assessment of protein function prediction from metagenomics shotgun sequences.

Authors: E D Harrington; A H Singh; T Doerks; I Letunic; C von Mering; L J Jensen; J Raes; P Bork
Journal: Proc Natl Acad Sci U S A Date: 2007-08-23 Impact factor: 11.205

4. Large-scale analysis of gene clustering in bacteria.

Authors: Qingwu Yang; Sing-Hoi Sze
Journal: Genome Res Date: 2008-04-04 Impact factor: 9.043

5. Analysis of strand-specific RNA-seq data using machine learning reveals the structures of transcription units in Clostridium thermocellum.

Authors: Wen-Chi Chou; Qin Ma; Shihui Yang; Sha Cao; Dawn M Klingeman; Steven D Brown; Ying Xu
Journal: Nucleic Acids Res Date: 2015-03-12 Impact factor: 16.971

6. Transcriptional profile of Pseudomonas syringae pv. phaseolicola NPS3121 in response to tissue extracts from a susceptible Phaseolus vulgaris L. cultivar.

Authors: Alejandro Hernández-Morales; Susana De la Torre-Zavala; Enrique Ibarra-Laclette; José Luis Hernández-Flores; Alba Estela Jofre-Garfias; Agustino Martínez-Antonio; Ariel Alvarez-Morales
Journal: BMC Microbiol Date: 2009-12-14 Impact factor: 3.605

7. Shewregdb: database and visualization environment for experimental and predicted regulatory information in Shewanella oneidensis mr-1.

Authors: Mustafa H Syed; Tatiana V Karpinets; Michael R Leuze; Guruprasad H Kora; Margaret R Romine; Edward C Uberbacher
Journal: Bioinformation Date: 2009-10-15

8. Shewanella knowledgebase: integration of the experimental data and computational predictions suggests a biological role for transcription of intergenic regions.

Authors: Tatiana V Karpinets; Margaret F Romine; Denise D Schmoyer; Guruprasad H Kora; Mustafa H Syed; Michael R Leuze; Margrethe H Serres; Byung H Park; Nagiza F Samatova; Edward C Uberbacher
Journal: Database (Oxford) Date: 2010-07-06 Impact factor: 3.451

9. DOOR: a database for prokaryotic operons.

Authors: Fenglou Mao; Phuongan Dam; Jacky Chou; Victor Olman; Ying Xu
Journal: Nucleic Acids Res Date: 2008-11-06 Impact factor: 16.971

10. Short Promoters in Viral Vectors Drive Selective Expression in Mammalian Inhibitory Neurons, but do not Restrict Activity to Specific Inhibitory Cell-Types.

Authors: Jason L Nathanson; Roberto Jappelli; Eric D Scheeff; Gerard Manning; Kunihiko Obata; Sydney Brenner; Edward M Callaway
Journal: Front Neural Circuits Date: 2009-11-09 Impact factor: 3.492