Literature DB >> 22075992

OGEE: an online gene essentiality database.

Wei-Hua Chen¹, Pablo Minguez, Martin J Lercher, Peer Bork.

Abstract

OGEE is an Online GEne Essentiality database. Its main purpose is to enhance our understanding of the essentiality of genes. This is achieved by collecting not only experimentally tested essential and non-essential genes, but also associated gene features such as expression profiles, duplication status, conservation across species, evolutionary origins and involvement in embryonic development. We focus on large-scale experiments and complement our data with text-mining results. Genes are organized into data sets according to their sources. Genes with variable essentiality status across data sets are tagged as conditionally essential, highlighting the complex interplay between gene functions and environments. Linked tools allow the user to compare gene essentiality among different gene groups, or compare features of essential genes to non-essential genes, and visualize the results. OGEE is freely available at http://ogeedb.embl.de.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2011 PMID： 22075992 PMCID： PMC3245054 DOI： 10.1093/nar/gkr986

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Large-scale efforts to link genotypes to phenotypes belong to the most important and challenging tasks in the post-omics era. Essential genes, whose removal results in inviability or infertility, are of particular interests because of their theoretical and practical applications, for example, in studying the robustness of a biological system (1), defining a minimal set of genes for a free living organism (2) and identifying effective drug targets (3). Essentiality often depends on the environment (4), especially for bacterial genes, or for eukaryotic genes that were tested in cell lines. For example, genes coding for proteins involved in the biosynthesis of amino acids, nucleic acids and vitamins are essential for cell survival in minimal media, but not in rich media where the corresponding metabolites are supplied (4). However, so far the concept of ‘conditional essentiality’ has not been widely adopted by existing essential gene databases. Gene essentiality does not only depend on individual gene functions, but can also be affected by global factors. Duplicated genes are typically less essential than the genomic average because they often overlap in gene function and expression profile; genes forming hubs in PPI networks (those connected to many direct neighbors) are more often essential (5); and genes involved in development and tissue differentiation in higher eukaryotes are also more likely to be essential (6). However, given the complex nature of biological systems, gene essentiality is often affected by multiple factors simultaneously; studying one factor at a time may generate conflicting results among species. For example, in a biased data set, mouse duplicates and singletons were reported to be equally essential (7), which disagreed with theoretical expectations and experimental findings in yeast (8). Experimental biases could only partially explain the contradiction (6). In a previous study, we showed that considering both the duplication status of genes and their evolutionary origins could solve the discrepancies (Chen, W.-H., Trachana, K., Lercher, M.J., and Bork, P., unpublished data). Our understanding of gene essentiality is still limited. Progress can be enhanced by collecting the following information into a central database: (i) tested essential and non-essential genes, allowing comparisons between the two groups; (ii) essentiality information obtained from large-scale studies, facilitating genome-wide analyses, as well as more precise information from small-scale studies, more suited for gene-centered biological research; (iii) additional gene features that are either known or hypothesized to influence gene essentiality. Ideally, such a database should come with a set of tools that allow the user to systematically explore and analyze the raw data. Existing essential gene databases either only include data for a specific species (9) or contain only essential genes (10). This provided the motivation to develop OGEE, an online gene essentiality database that combines points a–d with a set of tools for large-scale data analysis. This should make OGEE useful to both biologists and bioinformaticians.

DATA GENERATION

Collection and organization of genes tested for essentiality

We collected 91 436 protein-coding genes from 8 eukaryotic and 16 prokaryotic organisms tested for essentiality in genome-wide studies (2,3,9,11–37). For data sets that both essential and non-essential genes are publicly available, the genomic proportion of essential genes (PE) ranges from ∼2% [data set 347 (11) of Drosophila melanogaster] to 66.04% [Aspergillus fumigatus Af293, data set 361 (3)] in eukaryotes and from 5.46% [Bacilus subtilis 168, data set 352 (17)] to 80% [Mycoplasma genitalium G37, data set 357 (2)] in prokaryotes. It seems that overall PE in eukaryotes is most strongly influenced by organism complexity and by the methods employed for testing, in particular by the experimental conditions surveyed. Gene knockout techniques [data sets 349 (14) and 350 (37) of Mus musculus and Saccharomyces cerevisiae, respectively] generate higher PE than siRNA-based methods [data sets 348 (12) and 347 (11) of Homo sapiens and D. melanogaster, respectively]. Multi-cellular organisms have higher PE than single-celled eukaryotes (M. musculus versus S. cerevisiae) if similar techniques were used. Cell lines generate lower PE than in vivo if the same multi-cellular organism is used [data sets 347 (11) and 363 (25) of D. melanogaster]. In prokaryotes, overall PE is affected by details of the survey technology as well as by genome size and life style (free living versus parasitic). In addition to the collection of large-scale data, we also employed text-mining to obtain 3543 genes from 38 species that were tested in small-scale studies. We applied a customized text-mining pipeline based on the one used for data collection by the STRING database (38). We searched for a set of terms related to essentiality (Supplementary Table S1) in PubMed abstracts (as published February 2011) and manually checked the results and removed some false positives. We divided identified genes into essential and non-essential genes according to their associated terms. Due to a strong reporting bias, most genes identified in this way were essential. Among those, 3168 (89.4% of 3543) genes overlapped with those tested in genome-wide studies. Please note that although substantial efforts have been made to improve the quality of the text-mining data, there might still be significant fraction of false-positive results; please use with caution. We organized genes in each organism into distinct data sets according to the data source; a gene can have multiple entries within a data set or in different data sets. Two entries of a gene would be included in two distinct data sets if the gene was tested in a large-scale study as well as in a small-scale study; if a gene was tested by several small-scale studies, multiple entries of this gene would be included in the text-mining data set, with each entry corresponding to a distinct PubMed record. A gene was marked as ‘conditionally essential’ if multiple entries for this gene exist in OGEE but essentiality status varies among entries (see, e.g. the essentiality status of gene ‘FBgn0001112’ in Figure 1 and the supporting evidence in Figure 2).

Figure 1.

Interface of the ‘Browse’ module.

Figure 2.

Extra gene features shown in a popup window. This window will show up when clicking locus IDs in the ‘Browse’ or ‘Search’ modules.

Interface of the ‘Browse’ module. Extra gene features shown in a popup window. This window will show up when clicking locus IDs in the ‘Browse’ or ‘Search’ modules.

Collection of gene features influencing gene essentiality

We collected several gene features that are known to influence gene essentiality, encompassing duplication status, connectivity in protein–protein interaction (PPI) networks (defined as the number of direct neighbors) (5) and evolutionary origins of genes (defined as the age of the evolutionarily most distant species group where homologs can be found (39); see the web Q&As for more details). We also collected several extra features that might influence gene essentiality, including the number of homologous genes (family size) in the same genome, and the earliest expression stage during embryonic development [for multi-cellular organisms only; data was obtained from the NCBI UniGene database (40)]. It is known that duplicates are often less essential than singletons. This may be due to a range of factors, including the ability of duplicates to provide a functional backup for each other, lower expression abundances of duplicates (41,42), or a lower duplicability of the genes in certain important functional classes (43). It is thus conceivable that duplicates in large gene families are even less likely to be essential than duplicates in smaller families. In multicellular organisms, embryonic development is a tightly regulated chain of events. Disruption of genes expressed earlier may affect all subsequent events, thereby causing more severe phenotypes in the host. Both gene family size and earliest expression in development are indeed correlated with PE in mouse (Figure 3A and B).

Figure 3.

Screen shots taken from the ‘Analyze’ module. With integrated tools, the user can easily explore and analyze the collected data, including the visualization of results. Shown here are the results of the following analyses: (A) the proportion of essential genes (PE) as a function of family size (number of homologous genes within the genome) in mouse, (B) PE as a function of the earliest expression stage during mouse development, (C) the effects of gene duplication status and involvement in development on gene essentiality in Caenorhabditis elegans and (D) the effects of gene connectivity and involvement in development on gene essentiality in C. elegans.

USAGE OF OGEE

The functionalities of OGEE have been divided into six different modules (tabs): ‘Summary’, ‘Browse’, ‘Search’, ‘Analyze’, ‘Download’ and ‘Q&As’. We provide inline help messages displayed as ‘tooltips’ within each module; we also provide detailed help contents and answers to frequently asked questions in ‘Q&As’. Below, we introduce several of the most interesting features of OGEE.

Viewing details of individual genes

In the ‘Browse’ and ‘Search’ modules, by default only some gene features such as essentiality, duplication status and data sources will be displayed (Figure 1). To view more details of individual genes, the user can simply mouse over or click the locus names; a popup window containing all available information for the corresponding gene will appear. As shown in Figure 2, extra information including gene description, type of evidence for gene essentiality and corresponding links to original data sources, involvement in development, evolutionary origin (phyletic age), connectivity in the PPI network, as well as nucleotide and protein sequences are available. Links to other databases, including Gene Ontology (44), EGGNOG2 (45), NCBI taxonomy, as well as NCBI BLAST (40) are also integrated (Figure 2). For example, if the gene of interests is involved in development, several corresponding GO IDs and terms will be shown; clicking each GO ID, the user will be redirected to the corresponding page at the Gene Ontology website. Similarly, the user will be redirected to the corresponding NCBI taxonomy page if clicking on the organism name. The NCBI BLAST website will be opened in a new window if clicking on the BLAST NCBI links. The popup window also features in-site data integration. For example, if a query gene has orthologs in other species collected by OGEE, not only the corresponding orthologs [based on EGGNOG2 (45)], but also their essentiality status will be shown (Figure 2). This way, the conservation of a gene as well as the conservation of its essentiality across species can be checked easily.

Analyzing collected gene features using linked tools

One of the most interesting features of OGEE is that users can analyze the data systematically and visualize the results with integrated tools from the ‘Analyze’ module. With ‘Analyze’, the user can divide genes into distinct groups according to one of the available features, calculate the proportion of essential genes (PE) in each group and then plot the results as either a bar-chart or line plot. To illustrate this feature, Figures 3A and B show average mouse PE values as functions of the earliest expression stage during development and gene family size, respectively; both factors affect PE values globally. Users can also investigate two gene features simultaneously to study their effects individually or in combination. For example, the user can divide genes first into developmental and non-developmental genes, and then further divide each group into duplicates and singletons (Figure 3C). Similarly, on could first divide genes according to the connectivity in PPI network and then according to their involvement in development (Figure 3D). By default, predefined breaks by which genes can be divided into distinct groups and matching labels are used. However, if desired, the user can change the default settings by providing customized breaks and labels.

Open access to all data contained in OGEE

Our data are freely accessible to all academic users. We provide an SQL-dump file of the whole database as well as several selected data sections as tab-delimited flat files in the ‘Download’ module. Users can also download individual gene essentiality data sets for a selected species in ‘Browse’ and raw data used in data analysis in ‘Analyze’.

CONCLUSIONS

OGEE introduces several unique and novel features compared with existing gene essentiality databases. For example (i) OGEE provides both essential and non-essential genes from large-scale as well as small-scale studies; (ii) OGEE introduces ‘conditional essentiality’ to reflect the complexity of biological systems and the interplay between gene functions and environments; (iii) OGEE lists a variety of gene features known or suspected to influence gene essentiality; and (iv) OGEE provides a set of online tools to explore and analyze the data and to visualize the results. We thus believe that OGEE should be highly useful to biologists and bioinformaticians studying gene essentiality, whether focusing on individual genes or on genome-wide analyses.

FUTURE DIRECTIONS

Future development of OGEE will include the incorporation of essential non-coding genes, and the possibility for users to submit additional essentiality data.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Table S1. Key words used to search for essential and non-essential genes in PubMed abstracts.

FUNDING

Funding for open access charge: BMBF (Bundesministerium für Bildung und Forschung) MedSys grant #0315450C to Peer Bork. Conflict of interest statement. None declared.

45 in total

1. Global transposon mutagenesis and essential gene analysis of Helicobacter pylori.

Authors: Nina R Salama; Benjamin Shepherd; Stanley Falkow
Journal: J Bacteriol Date: 2004-12 Impact factor: 3.490

2. Higher duplicability of less important genes in yeast genomes.

Authors: Xionglei He; Jianzhi Zhang
Journal: Mol Biol Evol Date: 2005-09-08 Impact factor: 16.240

3. Elucidation of essential and nonessential genes in the Haemophilus influenzae Rd cell wall biosynthetic pathway by targeted gene disruption.

Authors: Catherine M Trepod; John E Mott
Journal: Antimicrob Agents Chemother Date: 2005-02 Impact factor: 5.191

4. Cell size and nucleoid organization of engineered Escherichia coli cells with a reduced genome.

Authors: Masayuki Hashimoto; Toshiharu Ichimura; Hiroshi Mizoguchi; Kimie Tanaka; Kazuyuki Fujimitsu; Kenji Keyamura; Tomotake Ote; Takehiro Yamakawa; Yukiko Yamazaki; Hideo Mori; Tsutomu Katayama; Jun-ichi Kato
Journal: Mol Microbiol Date: 2005-01 Impact factor: 3.501

5. A comprehensive transposon mutant library of Francisella novicida, a bioweapon surrogate.

Authors: Larry A Gallagher; Elizabeth Ramage; Michael A Jacobs; Rajinder Kaul; Mitchell Brittnacher; Colin Manoil
Journal: Proc Natl Acad Sci U S A Date: 2007-01-10 Impact factor: 11.205

6. Essential genes of a minimal bacterium.

Authors: John I Glass; Nacyra Assad-Garcia; Nina Alperovich; Shibu Yooseph; Matthew R Lewis; Mahir Maruf; Clyde A Hutchison; Hamilton O Smith; J Craig Venter
Journal: Proc Natl Acad Sci U S A Date: 2006-01-03 Impact factor: 11.205

7. Identification of 315 genes essential for early zebrafish development.

Authors: Adam Amsterdam; Robert M Nissen; Zhaoxia Sun; Eric C Swindell; Sarah Farrington; Nancy Hopkins
Journal: Proc Natl Acad Sci U S A Date: 2004-07-15 Impact factor: 11.205

8. Large-scale identification of essential Salmonella genes by trapping lethal insertions.

Authors: Karin Knuth; Heide Niesalla; Christoph J Hueck; Thilo M Fuchs
Journal: Mol Microbiol Date: 2004-03 Impact factor: 3.501

9. An ordered, nonredundant library of Pseudomonas aeruginosa strain PA14 transposon insertion mutants.

Authors: Nicole T Liberati; Jonathan M Urbach; Sachiko Miyata; Daniel G Lee; Eliana Drenkard; Gang Wu; Jacinto Villanueva; Tao Wei; Frederick M Ausubel
Journal: Proc Natl Acad Sci U S A Date: 2006-02-13 Impact factor: 11.205

10. Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection.

Authors: Tomoya Baba; Takeshi Ara; Miki Hasegawa; Yuki Takai; Yoshiko Okumura; Miki Baba; Kirill A Datsenko; Masaru Tomita; Barry L Wanner; Hirotada Mori
Journal: Mol Syst Biol Date: 2006-02-21 Impact factor: 11.429

77 in total

1. Model-based biotechnological potential analysis of Kluyveromyces marxianus central metabolism.

Authors: A Pentjuss; E Stalidzans; J Liepins; A Kokina; J Martynova; P Zikmanis; I Mozga; R Scherbaka; H Hartman; M G Poolman; D A Fell; A Vigants
Journal: J Ind Microbiol Biotechnol Date: 2017-04-25 Impact factor: 3.346

2. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.

Authors: Jacob J Michaelson; Yujian Shi; Madhusudan Gujral; Hancheng Zheng; Dheeraj Malhotra; Xin Jin; Minghan Jian; Guangming Liu; Douglas Greer; Abhishek Bhandari; Wenting Wu; Roser Corominas; Aine Peoples; Amnon Koren; Athurva Gore; Shuli Kang; Guan Ning Lin; Jasper Estabillo; Therese Gadomski; Balvindar Singh; Kun Zhang; Natacha Akshoomoff; Christina Corsello; Steven McCarroll; Lilia M Iakoucheva; Yingrui Li; Jun Wang; Jonathan Sebat
Journal: Cell Date: 2012-12-21 Impact factor: 41.582

Review 3. Systems biology perspectives on minimal and simpler cells.

Authors: Joana C Xavier; Kiran Raosaheb Patil; Isabel Rocha
Journal: Microbiol Mol Biol Rev Date: 2014-09 Impact factor: 11.056

4. Comparative analysis of housekeeping and tissue-selective genes in human based on network topologies and biological properties.

Authors: Lei Yang; Shiyuan Wang; Meng Zhou; Xiaowen Chen; Yongchun Zuo; Dianjun Sun; Yingli Lv
Journal: Mol Genet Genomics Date: 2016-02-20 Impact factor: 3.291

5. Mechanism of MenE inhibition by acyl-adenylate analogues and discovery of novel antibacterial agents.

Authors: Joe S Matarlo; Christopher E Evans; Indrajeet Sharma; Lubens J Lavaud; Stephen C Ngo; Roger Shek; Kanagalaghatta R Rajashankar; Jarrod B French; Derek S Tan; Peter J Tonge
Journal: Biochemistry Date: 2015-10-15 Impact factor: 3.162

6. Collective influencers in protein interaction networks.

Authors: T A Boltz; P Devkota; Stefan Wuchty
Journal: Sci Rep Date: 2019-03-08 Impact factor: 4.379

7. Linking genome-scale metabolic modeling and genome annotation.

Authors: Edik M Blais; Arvind K Chavali; Jason A Papin
Journal: Methods Mol Biol Date: 2013

8. Kinase impact assessment in the landscape of fusion genes that retain kinase domains: a pan-cancer study.

Authors: Pora Kim; Peilin Jia; Zhongming Zhao
Journal: Brief Bioinform Date: 2018-05-01 Impact factor: 11.622

9. Controllability analysis of the directed human protein interaction network identifies disease genes and drug targets.

Authors: Arunachalam Vinayagam; Travis E Gibson; Ho-Joon Lee; Bahar Yilmazel; Charles Roesel; Yanhui Hu; Young Kwon; Amitabh Sharma; Yang-Yu Liu; Norbert Perrimon; Albert-László Barabási
Journal: Proc Natl Acad Sci U S A Date: 2016-04-18 Impact factor: 11.205

10. Studying tumorigenesis through network evolution and somatic mutational perturbations in the cancer interactome.

Authors: Feixiong Cheng; Peilin Jia; Quan Wang; Chen-Ching Lin; Wen-Hsiung Li; Zhongming Zhao
Journal: Mol Biol Evol Date: 2014-05-31 Impact factor: 16.240