Literature DB >> 31504171

New insights on human essential genes based on integrated analysis and the construction of the HEGIAP web-based platform.

Hebing Chen¹, Zhuo Zhang¹, Shuai Jiang¹, Ruijiang Li¹, Wanying Li¹, Chenghui Zhao¹, Hao Hong¹, Xin Huang¹, Hao Li¹, Xiaochen Bo¹.

Abstract

Essential genes are those whose loss of function compromises organism viability or results in profound loss of fitness. Recent gene-editing technologies have provided new opportunities to characterize essential genes. Here, we present an integrated analysis that comprehensively and systematically elucidates the genetic and regulatory characteristics of human essential genes. First, we found that essential genes act as 'hubs' in protein-protein interaction networks, chromatin structure and epigenetic modification. Second, essential genes represent conserved biological processes across species, although gene essentiality changes differently among species. Third, essential genes are important for cell development due to their discriminate transcription activity in embryo development and oncogenesis. In addition, we developed an interactive web server, the Human Essential Genes Interactive Analysis Platform (http://sysomics.com/HEGIAP/), which integrates abundant analytical tools to enable global, multidimensional interpretation of gene essentiality. Our study provides new insights that improve the understanding of human essential genes.

Entities: CellLine Chemical Disease Gene Species

Keywords: cell development; human essential genes; integrated analysis; multi-omic; web server

Year: 2020 PMID： 31504171 PMCID： PMC7373178 DOI： 10.1093/bib/bbz072

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Essential genes are indispensable for organism survival and the maintenance of basic cell and tissue functions [1-3]. The systematic identification of essential genes in different organisms [4] has provided critical insights into the molecular bases of many biological processes [5]. Such information may be useful for applications in areas such as synthetic biology [6] and drug target identification [7, 8]. The identification of human essential genes is a particularly attractive area of research because of the potential for medical applications [9, 10]. Utilizing gene-editing technologies based on CRISPR-Cas9 and retroviral gene-trap screens, three independent genome-wide studies [11-13] identified essential genes that are indispensable for human cell viability. The results agreed very well among the three studies, confirming the robustness of the evaluation approaches. All of these studies [11-13] showed that ∼10% of the ∼20 000 genes in human cells are essential for cell survival, highlighting the intrinsic buffering mechanisms of eukaryotic genomes against genetic and environmental insults [14]. In a recent review, Pavelka et al. [15] stated that gene essentiality is not a fixed property but instead depends strongly on environmental and genetic contexts and can be altered during short- or long-term evolution. However, this paradigm leaves some questions unresolved, as we do not know how essential genes are interconnected within cells, why these genes are essential or what their underlying mechanisms may be. Furthermore, we do not know whether these genes are associated with disease (e.g. cancer) or have the potential to be exploited as targets for therapeutic strategies. With the recent and rapid development of next-generation sequencing and other experimental technologies, we now have access to myriad data on genomic sequences, epigenetic modifications, structures and disease-related information. These data will enable researchers to examine human essential genes from multiple perspectives. In light of this background, we performed a comprehensive study of human essential genes, including their genomic, epigenetic, proteomic, evolutionary and embryonic patterning characteristics. Genetic and regulatory characteristics were studied to understand what makes these genes essential for cell survival. We analyzed the evolutionary status of human essential genes and their profiles during embryonic development. Our findings suggest that human essential genes are important for lineage segregation. Essential genes have important implications for drug discovery, which may inform the next generation of cancer therapeutics (Figure 1). Finally, we developed a new web server, the Human Essential Genes Interactive Analysis Platform (HEGIAP), to facilitate the global research community’s comprehensive exploration of human essential genes.

Figure 1

Comprehensive overview of the integrated analysis of human essential genes. In this study, we performed a systemically analysis of human essential. By integrating multi-omics (genome, epigenome and proteome) data, we characterized the evolutionary nature of essential genes and provided new insights into embryonic development and tumorigenesis.

Results

Multi-level essentiality of human essential genes

Essential genes define the key biological functions that are required for cell growth, proliferation and survival. To characterize human essential genes, we first compared three essential gene sets generated by different experimental methods [11-13]; more than 60% of essential genes were cataloged in at least two data sets (Venn diagram in Figure 1). Here, we focused on the essential genes detected by Wang et al. [11], who defined a CRISPR score (CS) for the assessment of gene essentiality. Briefly, low CSs indicate high degrees of essentiality and vice versa. Here, we used data from the KBM7 cell line for further analysis. We first evaluated the cell-line specificity of CSs to ensure the robustness of our analysis in terms of reflecting the general properties of gene essentiality in different cell types. First, according to Wang et al. [11], a predominant part (98.99%) of essential genes in the KBM7 cell line were not cell-line specific; they identified only 19 such genes. Essential genes in the KBM7 cell line also represent more than two-thirds of the essential genes identified in the other two studies. Using these data, we calculated Pearson coefficients of CS correlations between cell lines for all screened genes to assess the similarity of essentiality between cell lines. The essentialities of all screened genes in multiple cell lines showed high degrees of correlation (Pearson coefficient: maximum, 0.83; minimum, 0.79). In addition, the results of our analyses using CSs for different cell lines and using multi-omics data from different cell types were robust, demonstrating the reliability of the data sets used. We divided all protein-coding genes into 10 groups (CS0–CS9) according to ascending CS values, where group CS0 was composed of essential genes. This detailed classification provided a good representation of the various features in subsequent analyses.

Protein essentiality: high transcription activity and stability, `hub’ of the PPI network

We first analyzed the expression levels of human genes in 2916 individuals from the Genotype-Tissue Expression Program [16]. Essential genes were highly expressed compared with nonessential genes (P < 1 × 10−50, Welch’s t-test; Cohen’s d = 8.76), and the expression level decreased as gene essentiality decreased (the CS value increased; r2 = 0.42, P = 0.04; Figure 2A). Furthermore, an analysis using publicly available human protein stability data [17] showed that proteins encoded by essential genes showed more stability than did other proteins (P = 1.10 × 10−18, Welch’s t-test; Supplementary Figure S1A). This observation was consistent with the findings of a recent study [18], which showed that highly expressed proteins are stable because they are designed to tolerate translational errors that would lead to the accumulation of toxic misfolded species.

Figure 2

General properties of human essential genes. (A) Violin plot showing gene expression for 2916 individuals from the GTEx program. The mean expression level was calculated for each group of genes (CS0–CS9). (B) Degree of connectivity for each gene group. Inset, relationship between the degree of connectivity and CS value in group CS0 (human essential genes). (C) Relationship between gene essentiality and gene length. R2 and P-values from linear regression are shown. (D) Relationship between gene length and number of transcript types. R2 and P-values from linear regression are shown. (E) Heatmap showing the colocalization of gene TSSs. (F) TSS density surrounding TAD boundary (IMR90 cell line). Inset, average TSS densities within regions (50 kb upstream, TAD boundary and 50 kb downstream). (G–I) Profiles showing mean signals for chromatin accessibility (G), methylation level (H) and H3k4me3 density (I).

In many model organisms, essential genes tend to encode abundant proteins that engage extensively in protein–protein interactions (PPIs) [19]. We constructed a PPI network for each CS group (Supplementary Figure S1B). Essential genes showed significantly more connectivity than did other genes (P = 2.73 × 10−266, Welch’s t-test; Figure 2B), and the degree of connectivity was correlated negatively with the CS in the essential gene set (r = −0.29, P = 1.22 × 10−37; insets in Figure 2B). We then calculated the distribution of genes over the range of connectivity. We found that with higher degrees of connectivity, there were fewer genes (Supplementary Figure S2A) but greater proportions of essential genes (Supplementary Figure S2B). As a novel study showed that long noncoding RNAs (lncRNAs) are important players in regulatory networks [20], we investigated genes whose expression levels were altered significantly following CRISPRi knockout of lncRNAs, identified in a previous study [21]. We found significantly a greater proportion of essential genes among all differentially expressed genes (DEGs) than among genes that showed no significant alteration in expression (Supplementary Figure S2C). This result suggests that essential genes are more likely to be regulated by lncRNAs. Finally, we performed a gene ontology (GO) analysis for each group. Essential genes were enriched in fundamental biological processes, such as rRNA processing, translational initiation, mRNA splicing and DNA replication, and nonessential genes were less significantly enriched in other processes (Supplementary Figure S3). In summary, essential genes are highly expressed and associated with important biological processes. Proteins encoded by essential genes are stable and located at connection hubs in PPI networks. Taken together, these results show the essentiality of essential genes at the protein level.

Structural essentiality: high density in the genome and 3D structure lead to a `hub’ of chromatin organization

In general, gene length affects the stability of the kinetics of genetic switches and thus the dynamics of gene expression [22]. We found that human essential genes were much shorter than nonessential genes (P = 1.38 × 10−53, Welch’s t-test; Figure 2C), consistent with the results of a previous study of Escherichia coli [22]. Generally, long genes were likely to contain more diverse transcripts in the human genome (r = 0.34, P = 1.0 × 10−100, Pearson correlation; Supplementary Figure S4A). Therefore, fewer types of transcripts were expected in essential genes, as they were short. However, a greater variety of transcripts was found in essential genes compared with nonessential genes (P = 7.02 × 10−42, Welch’s t-test; Figure 2D), and the number of transcript types decreased as essentiality decreased (r2 = 0.90, P = 2.70 × 10−5, Pearson correlation; insets in Figure 2D), indicating that mRNAs transcribed by essential genes were highly variable. GC content is associated with DNA stability, and variations in GC content within the genome result in variations in staining intensity in chromosomes [19]. We thus examined the distribution of GC content. Our result showed that genes with moderate to high degrees of essentiality (CS0–CS4) tended to have a slightly higher GC content than did other genes (CS5–CS9) in both the promoter regions and gene bodies (Supplementary Figure S4B). Furthermore, as Alu repetitive elements in the genome have been found to be regulators of gene expression [23-26] and as the Alu repetitive component may contribute to prevention of DNA damage [27], we investigated repetitive elements within essential genes. DNA sequences for Alu elements were first identified and masked using RepeatMasker [28]. We found that Alu repetitive elements were significantly enriched in essential genes compared with those in nonessential genes (P < 1 × 10−50, Welch’s t-test; Supplementary Figure S4C). Together, these results indicate that essential genes are formed with high GC content and colocalize with Alu repetitive elements, suggesting high DNA stability of essential genes. We next examined the genomic distribution of essential genes and found that the transcription start sites (TSSs) of these genes tended to cluster (Figure 2E; Supplementary Figure S5A). We then investigated the three-dimensional (3D) structural organization of essential genes. Previous studies have shown that topologically associated domains (TADs) are highly conserved between cell types and species and that proximity to the TAD boundary likely contributes to the stabilization of gene expression [29-33]. We detected TADs using high-throughput/high-resolution chromosome conformation capture (Hi-C) data from Jin et al. [34] and Rao et al. [35]. We observed a significantly greater density of essential than nonessential genes within TAD boundaries (P < 1 × 10−16, Welch’s t-test; Figure 2F; Supplementary Figure S5B–E). As chromatin may be associated with proteins’ affinity for each other, resulting in chromatin loops [36], we calculated intra-TAD local (<100 kb) Hi-C contacts for each gene set and examined the distribution of gene TSSs in chromatin loop anchors. Essential genes contained more local contacts (P < 1 × 10−43, Welch’s t-test; Supplementary Figure S6A and B), and TSS density of essential genes that located in chromatin loop anchors was greater than that of nonessential gene groups (Supplementary Figure S6C). As the formation of architectural loops depends strongly on the protein CTCF [37], we examined the distribution of CTCF signals detected by ChIP-seq [38]. As expected, CTCF was more likely to bind near essential genes (Supplementary Figure S6D). We used the recently introduced SPRITE experimental protocol [39] to identify active hubs of inter-chromosomal interactions that arranged around nuclear speckles. We found that essential genes were enriched on these experimentally verified structural hubs (Supplementary Figure S6E). General properties of human essential genes. (A) Violin plot showing gene expression for 2916 individuals from the GTEx program. The mean expression level was calculated for each group of genes (CS0–CS9). (B) Degree of connectivity for each gene group. Inset, relationship between the degree of connectivity and CS value in group CS0 (human essential genes). (C) Relationship between gene essentiality and gene length. R2 and P-values from linear regression are shown. (D) Relationship between gene length and number of transcript types. R2 and P-values from linear regression are shown. (E) Heatmap showing the colocalization of gene TSSs. (F) TSS density surrounding TAD boundary (IMR90 cell line). Inset, average TSS densities within regions (50 kb upstream, TAD boundary and 50 kb downstream). (G–I) Profiles showing mean signals for chromatin accessibility (G), methylation level (H) and H3k4me3 density (I). In summary, essential genes are structurally essential due to their high GC content, highly enriched Alu repetitive elements and central location in the chromosomal scaffold.

Epigenetic essentiality: the enrichment of epigenetic marks leads to an epigenetic regulatory network `hub’

The epigenetic modification of chromatin provides the necessary plasticity for cells to respond to environmental and positional cues and enables the maintenance of acquired information without changing the DNA sequence [40]. To study the epigenetic information on essential genes, we took advantage of recent high-throughput genomic assays [41, 42]. We first examined the chromatin accessibility of essential genes. Strong DNase I hypersensitivity (DHS) signals were observed in the promoters of essential genes, and this enrichment decreased as gene essentiality decreased (Figure 2G). Moreover, more transcription factor binding sites were detected in the promoters of essential genes (Supplementary Figure S7). We then examined the DNA methylation pattern of essential genes. By analyzing data obtained with two sequencing-based methods, the DNA immunoprecipitation (MeDIP-seq) and methylation-sensitive restriction enzyme (MRE-seq) methods [43], we found that the methylation levels of gene promoters increased as gene essentiality decreased (MRE-seq: r2 = 0.93, P = 8.6 × 10−6; MeDIP-seq: r2 = 0.93, P = 5.8 × 10−6, Welch’s t-test; Figure 2H; Supplementary Figure S8A). Methylation levels were higher in the gene body regions of essential genes than in other genes (MRE-seq, P = 1.5 × 10−5; MeDIP-seq, P = 9.6 × 10−20), supporting the highly transcribed nature of essential genes [44]. Next, we examined two histone modifications—trimethylation of H3 lysine 4 (H3K4me3) and trimethylation of H3 lysine 27 (H3K27me3)—associated with transcription activation and gene repression [45-47], respectively. Similar to chromatin accessibility, H3K4me3 signals were strongly enriched in the promoters of essential genes, and H3K4me3 density in gene promoters increased while gene essentiality increased (Figure 2I). In contrast, H3K27me3 signals were weaker in the promoters of essential genes compared with nonessential genes (Supplementary Figure S8B). Finally, we studied the abundance of noncoding RNAs (ncRNAs), which play a key role in regulating gene expression [48, 49]. The density of ncRNAs in gene promoters decreased as essentiality decreased (r2 = 0.40, P = 0.049; Supplementary Figure S8C). These observations of both histone modification marks show that essential genes tend to be located in the most active chromatin regions. Therefore, essential genes are hubs of epigenetic modification. As these modifications have important regulatory functions, these epigenetic hubs may, in turn, contribute to gene essentiality. For instance, inactive genes with low expression levels in the essential gene group seem to be `outliers’ because most essential genes are highly expressed. We suppose that a fraction of these inactive genes are essential, not because they are highly expressed and are direct modulators of critical cellular processes, but because they are located at epigenetic hubs. CRISPR-based screening may negatively affect the integrity of these epigenetic hubs, hampering cell proliferation. Based on the hypothesis that epigenetic hub location contributes to a gene’s high degree of essentiality, we believe that some inactive genes located near epigenetic hubs, for instance in highly accessible chromatin regions, are more likely to be essential, even if they are not actively transcribed like typical essential genes; hereafter, we refer to these genes as inactive epigenetic hub genes. A high degree of essentiality of an inactive epigenetic hub gene may be due predominantly to its location near an epigenetic hub. Therefore, such genes may be more enriched in the CS0 group than among genes with lower degrees of essentiality (CS1–CS9). In the analysis (Materials and methods) performed to test this hypothesis, we found that inactive epigenetic hub genes were more likely to be in the essential gene group than in the nonessential gene groups (Supplementary Figure S8D). Thus, disruption of these genes may hamper cell proliferation not by their seized expression, but by the disruption of epigenetic hub integrity. This result further supports the role of essential genes as epigenetic hubs. In summary, these results provide epigenetic evidence for the role of essential genes as hubs of active epigenetic modification.

Evolutionary nature of human essential genes

Highly and widely expressed genes have been found to have originated early and to be conserved across species [50]. We next investigated the universal distribution of evolutionary rates of essential genes. Using gene age categories defined in a study [51] based on inferred gene origination times, we showed that, as expected, essential genes on average were older (Figure 3A) and significantly more conserved (P = 9.40 × 10−9, Welch’s t-test; Supplementary Figure S9) than were nonessential genes. However, a small subset of essential genes was notably young (Figure 3B); we found that the proportion of essential genes among human-specific genes (gene age group 13) was significantly larger than that of other genes (gene age groups 1–11; Figure 3C), indicating that human-specific genes are more likely to be essential in humans. Similar results were obtained for the other three cell types (Supplementary Figure S10). GO analysis of essential genes in the two youngest groups (human and chimpanzee) showed significant enrichment in the regulation of GTPase activity (Supplementary Figure S11A). Enrichment of these young essential genes in immune-related functions may indicate that these young and essential genes compress a few of cell-type-specific essential genes in hematopoietic lineages and that these genes take up a considerable proportion due to their very small amount of these young and essential genes. We also found that these youngest essential genes were shorter, less conserved and not as actively expressed as other essential genes (Supplementary Figure S11B). In addition, `old’ genes have been reported to be longer than `young’ genes [50, 52, 53]; however, most essential genes in this study were `old,’ but they were shorter than other genes on average (Figure 3D).

Figure 3

Evolutionary nature of human essential genes. (A) Relationship between gene essentiality and gene age. R2 and P-values from linear regression are shown. (B) Scatterplot of gene length and evolutionary age. Circle size indicates the number of genes, and color represents the essential gene proportion. (C) Essential gene proportions in all 13 gene age groups. (D) Analysis of age differences between essential and nonessential genes grouped by gene length. Red boxes, age data for essential genes. Blue boxes, age data for nonessential genes. Violet line, mean value. Gene groups containing <100 genes are not shown. (E) Essential gene-associated specific functional modules. The network of specific functional interactions among the 1878 human essential genes was clustered using a graph theory clustering algorithm to elucidate gene modules. Six clusters that containing ≥40 genes (H1–H6) were tested for functional enrichment by using genes annotated with GO biological process terms. Representative processes and pathways enriched within each cluster are presented alongside the cluster label. Enriched functions provide a landscape of the potential effects of cellular functions for essential genes. Similar functional processes were shared by essential genes in mouse (four subnet modules) and S. cerevisiae (five subnet modules). Evolutionary age is defined based on the presence of a homolog in a wide range of species from single-celled organisms to primates [54]. However, the essentiality of a gene can change during the course of evolution [15]. We investigated the essentiality of homologous genes in humans and four other species (mouse, Danio rerio, Drosophila melanogaster and Saccharomyces cerevisiae). We observed 329, 66, 15 and 143 essential genes in humans that were also essential in the four other species, respectively, but only 150, 9, 2 and 2 shared genes in random controls, indicating that essential genes were significantly more conserved than were other genes (P < 1 × 10−5, permutation test), consistent with previous findings [55]. Interestingly, more than half of the genes found to be essential in humans were nonessential in other species and vice versa (Supplementary Figure S12), also consistent with the findings of Hart et al. [55]. Gene essentiality may change in different species because genes or functions could arise separately or be lost or replaced by others during evolution; in this way, the biological network could become more robust [15]. To test this hypothesis, we compared PPI networks among species using essential genes annotated in various species retrieved from DEG database [4]. We first constructed a PPI network with essential genes for each species using a previously described network topology method [56] and detected subnet modules (densely connected regions that can represent molecular complexes) using the MCODE algorithm [57]. We then performed gene set enrichment analysis [58] for each subnet module. Interestingly, similar biological processes were observed between human and other species, although the essential genes were quite different (Figure 3E); for instance, rRNA processing was enriched in human, mouse and S. cerevisiae, but less than 18% of human essential genes were essential in these species (Supplementary Figure S12). Protein localization analysis using existing annotation data [59] also showed that a larger fraction of human essential genes encode proteins located in cytoplasm than those encoded by human nonessential genes, whereas percentages of proteins located in membrane and extracellular for human essential genes are lower than those for human nonessential genes (Supplementary Figure S13). These observations showed similar protein localization propensity in human and in prokaryotes [60] for essential genes in comparison with nonessential genes and may also support similar functions of essential genes between species. In addition, the percentage of proteins located in nucleus is higher for human essential genes than for human nonessential genes (Supplementary Figure S13). Our observations suggest that essentialomes are enriched in genes required for essential processes.

Transcription activation of essential genes during cell development

Dynamic expression of essential genes in early embryo development

Cell fate decisions contribute fundamentally to the development and homeostasis of complex tissue structures in multicellular organisms. The key to understand the different fates of apparently identical cells lies in the emergence of transcriptional programs [61]. We next characterized essential genes in mammalian embryonic development. Due to the lack of experimental data from human embryos, we used data from mouse preimplantation embryos. To examine whether genetic and epigenetic features were consistent in human and mouse, we calculated the distributions of gene expression and epigenetic information in human and mouse embryonic development. We observed strong correlations (Supplementary Figure S14), which supported the use of mouse data to study human essential genes. During embryo development, the expression levels of essential genes were progressively increased, and two significant increasing were observed in two-cell embryos and the inner cell mass (ICM), which corresponding to zygotic genome activation [62, 63] and the 1st cell fate decision, respectively. In contrast, the expression levels of nonessential genes were significantly lower than those of essential genes during the entire preimplantation period (P < 1 × 10−100, Welch’s t-test), and genes in groups CS7–CS9, which labeled as the least essentiality, were silent after the two-cell embryo stage, similar to the maternal mRNA degradation process (Figure 4A). To further understand the dynamic changes in the transcription activity of essential genes during embryo development, we investigated chromatin state dynamics. Accessible chromatin and active histone modifications were highly enriched in the promoters of essential genes compared with those of nonessential genes (Figures 2G and I and 4B and C). In addition, chromatin was progressively accessible and the H3K4me3 density progressively increased (Figure 4B and C). However, essential genes were least methylated during embryo preimplantation (Figure 4E). These observations suggest that the expression of essential genes is required for embryo development and that both chromatin accessibility and epigenetic modifications contribute to the formation of transcriptional programs in essential genes.

Figure 4

Essential genes in mouse embryo development. (A) Expression levels of essential genes (red lines) and other gene groups at each developmental stage. (B) Density of accessible chromatin surrounding TSSs and TESs of genes at each developmental stage. (C) Density of active H3k4me3 modifications surrounding TSSs and TESs of genes at each developmental stage. (D) Dynamics of H3k4me3 density in gene bodies in preimplantation mouse embryos (left) and postimplantation human embryos (right). (E) Profiles of CG methylation surrounding TSSs of genes at each developmental stage. (F) Dynamics of gene expression for essential and other genes at each developmental stage. TE means trophectoderm; ICM, inner cell mass; VE, visceral endoderm; Epi, epiblast; Ect, ectoderm; End, endoderm; Mes, mesoderm; PS, primitive streak. To gain insight into the potential function of essential genes during embryo preimplantation, we examined the gene expression pattern at each developmental stage. Interestingly, differential patterns of transcription were observed during early lineage specification (Figure 4F). Essential genes were highly expressed in the ICM, which gives rise to the entire fetus, but much less expressed in the trophectoderm (TE), the outer layer of the blastocyst-stage embryo. During the subsequent formation of the primitive endoderm (PE) and epiblast (Epi), essential genes were also more highly expressed in embryonic tissues than in extraembryonic tissues. Finally, during the formation of the three germ layers, essential genes were highly expressed in the ectoderm, which was derived from the anterior epiblast by embryonic day 6.5 (E6.5), but weakly expressed in the primitive streak (PS) and PS-derived mesoderm and endoderm [61]. Compared with essential genes, nonessential genes were weakly expressed and showed no apparent pattern during embryo development. Thus, the transcription of essential genes is required for lineage segregation, especially for the development of the fetal-origin part of the placenta.

Essential genes are differentially expressed in cancer and normal tissues

Given the fundamental role played by essential genes, it is unsurprising that they represent current and potential novel targets of many antimicrobial and anticancer compounds [64-66]. To further investigate the therapeutic implications of essential genes, we examined the relationship between essential genes and cancer genes. The identification of cancer genes has varied markedly among studies [67-71]; for instance, more than two-thirds of cancer genes identified in one study were not identified as such in another study (Supplementary Figure S15A). Thus, we compared genes from these studies separately and found that essential genes were significantly enriched among cancer genes relative to all protein-coding genes (Figure 5A). For instance, five well-known oncogenes—BRCA1, BRCA2, MYC, EZH2 and SMARCB1—were essential in terms of human cell survival and were associated with chromatin stability, remodeling and modification.

Figure 5

Relationships between human essential genes and cancer. (A) Proportions of human essential genes in five cancer gene lists and in all human genes. ***P < 0.0001, Fisher exact test. (B) Differential expression of genes in normal and tumor tissues from the TCGA database. Each colored line represents a mean differential expression fold change (among all donors, from normal to cancerous for each donor) by tumor type. As reported above, essential genes were more strongly expressed than nonessential genes in cancer and normal cells (Supplementary Figure S15B). We next determined whether essential genes were differentially expressed between cancer and normal cells. Twenty-three cancer types were examined using gene expression data from the TCGA project. Interestingly, the expression levels of essential genes were significantly higher in cancer cells than in normal cells (P = 8.5 × 10−6, paired-sample t-test; Figure 5B), whereas nonessential genes exhibited similar transcription activity in both cancer and normal cells. These results suggest that essential genes are more sensitive to tumorigenesis and may be superior targets for further drug screening and development. To further understand the potential function of essential genes in drug screening, we identified 297 significantly DEGs using TCGA data (Materials and methods and Supplementary Table S1). Using the DrugBank database [72], we then identified 135 candidate drugs for these 297 DEGs (Supplementary Table S2). Some of these candidate drugs (e.g. the antineoplastic agents pemetrexed, decitabine, doxorubicin, mitoxantrone and capecitabine) have already been matured in drug-targeting strategies for oncology programs. Other candidate drugs (e.g. the anti-infective agents trifluridine and fleroxacin and the antibacterial agents enoxacin, pefloxacin and ciprofloxacin) may also play roles in cancer treatment, and more research and clinical experiments are required.

HEGIAP: an interactive web server for the study of essential genes

We developed the interactive web server HEGIAP (http://sysomics.com/HEGIAP/), which integrates abundant analytical and visual tools to provide multi-level interpretation of the essentiality of single genes. HEGIAP provides an overall gene property graph, which shows gene length, number of transcript types and the distributions of exons and introns of each transcript. Boxplots are provided that describe properties of the 10 gene groups (CS0–CS9), including but not limited to gene length, protein length, exon count and counts of repetitive elements near promoter regions. Graphs show the corresponding value and group number for each selected gene. For a selected gene, the web server provides histone modification, methylation and chromatin accessibility profiles and a Hi-C contact map of chromatin structure, all of which have been shown to be correlated significantly with the CS value. Multigene analysis is available for the comprehensive examination of groups of genes (Figure 6).

Figure 6

Integrative analysis of individual genes using HEGIAP. HEGIAP provides different analysis modules to enable a multi-view exploration of gene essentiality measured by the CS.

Integrative analysis of individual genes using HEGIAP. HEGIAP provides different analysis modules to enable a multi-view exploration of gene essentiality measured by the CS. HEGIAP supports both feature- and gene-oriented analyses. In feature-oriented analysis, users can obtain all of the genes that meet chosen screening thresholds for multiple features. They can examine the CS distribution or any other property, enabling exploration of possible correlations between the CS and other genomic features. In classic gene-oriented analysis, users specify their chosen genes and are provided with a comprehensive view of their essentiality and genomic features. A comparative analysis of two different gene lists is provided to facilitate free exploration of the variation in genomic features between genes of interest or between genes screened by degree of essentiality or any other property. Tools are provided to identify genes with aberrant epigenetic modification levels or genomic features based on their essentiality. To facilitate examination of the difference in essential gene expression level between cancer cells and normal cells, HEGIAP provides a tool for the direct visualization of expression profiles across multiple TCGA tumor types for any group of genes uploaded. Genes are also grouped into essential and nonessential subgroups whose expression profiles are shown for further comparison. HEGIAP identifies genes that are differentially expressed in tumor and normal tissues. It has a drug-screening tool that is based on the assumption that essential genes have predictive power for the identification of candidate drugs for cancer. Users can set a CS threshold and acquire a list of drugs that significantly suppress the expression of cancer-specific highly expressed genes filtered by this threshold. The user-friendly interface of HEGIAP was constructed using R Shiny, and it requires no plug-in installation for users running any popular web browser.

Discussion

Three types of `hubs’ location of essential genes

Essential genes have been identified to be important content in multiple life-science research domains including those of genetic networks [20, 73, 74], developmental phenotypes [75], evolution [76], cancer therapy [77] and drug discovery [78]. Our work extends previous findings by revealing three `hub’ locations of essential genes. First, essential genes are `hubs’ of PPI networks. As described in the `centrality-lethality’ rule, genes and proteins with high degrees of connectivity tend to be essential because their inactivation is more likely to disrupt overall network architecture [15]. Our statistical analysis confirmed that gene essentiality is correlated significantly with the degree of connectivity and that proteins encoded by essential genes are very stable, tolerating translational errors. Second, our work revealed a structural ‘hub’ of essential genes. Not only are essential genes clustered densely in the genome; these clusters have a 3D structural organization. Third, essential genes are sensitive `hubs’ of epigenetic modification, which contributes to their high expression levels. As essential genes are centers of epigenetic modification and chromatin structure, their high transcription activity levels may further promote the expression of surrounding genes; that is, essential genes may act as the ‘seeds’ of a transcription factory, where endogenous genes are replicated, transcribed and repaired [79-81]. Furthermore, this three-‘hub’ model for essential genes indicates that gene knockdown in CRISPR experiments has the following effects on the transcriptional regulatory system in a cell: the gene will be down-regulated; the expression of other genes in the same PPI network will change; and the chromatin structure and epigenetic signal surrounding the CRISPR site may also change, which may affect the regulation of many other genes.

No gene is absolutely essential; only functions can be so

Consistent with previous findings [50], we confirmed that most essential genes are old. However, we also found that an unexpectedly high proportion of the youngest, human-specific genes are essential and play a role in the regulation of GTPase activity, although we cannot rule out the possibility that this may reflect enrichment of the very few cell-type-specific essential genes in hematopoietic cells, which could easily comprise a considerable fraction of all young and essential genes. Although essential genes are highly expressed and genes with high expression levels tend to be conserved across species, we noticed great variation in essential genes among species. By further examining the PPI networks constructed by essential genes, we found that although gene essentiality changes across species, the biological processes were conserved. This observation provides new insight supporting the idea that no gene is absolutely essential; only functions can be so [15].

Implications for gene editing and synthetic biology

Major innovations in our ability to edit genome sequences have enabled cost-effective and straightforward genome editing in yeasts, plants and animals [82, 83]. The identification of three ‘hub’ locations of essential genes suggests that the effects of gene editing (or gene therapy) on cells should be examined with consideration not only of the target gene and its signaling pathways but also of the associated epigenetic environment and the context of chromatin structure. Furthermore, essential genes can be used as a preferred gene set or important reference for gene interactions in synthetic biology. Additionally, in cancer research, they could facilitate drug discovery, offer promise as markers and may be useful for the identification of clinical therapeutic applications. In summary, our work provides very valuable information that improves our understanding of human essential genes. Due to the limitations of experimental approaches, further work is required not only to understand the evolutionary plasticity of essential genes across various species but also to gain more evidence on the three `hub’ locations of essential genes. These studies will facilitate our understanding of the design principles of transcription regulatory networks, higher-level organization of vital processes and principles underlying drug resistance.

Materials and methods

Data set

Data description for each evaluation and figure, software or package generating the figures and other related details are provided in Supplementary Table S3.

For human essential genes

Data on DHS, DNA methylation, histone modifications, CTCF and evolutionary conservation were downloaded from the ENCODE project and RoadMap database. Hi-C data were obtained from GSE43070 [34] and GSE63525 [35]. TAD boundaries were detected according to a previously described protocol [84]. Position-specific weight matrices of transcription factors were downloaded from the TRANSFAC and JASPAR databases. Data on ncRNAs were downloaded from the NONCODE database [85] (http://www.noncode.org). Essential gene data for different species were obtained from the DEG database [86] (http://www.essentialgene.org/). Cancer data were downloaded from the TCGA project. Drug data were obtained from the DrugBank database. Human gene annotations were obtained from the GENCODE database (V21).

For essential genes during mouse embryo preimplantation

The transposase-accessible chromatin followed by sequencing assay (ATAC-seq) was obtained from GSE66390 [87]. Histone modification H3K4me3 data for the early two-cell, four-cell and eight-cell stages of mouse embryos and ICMs were obtained from GSE71434 [88] and the ENCODE project. Histone modification H3K4me3 data for H1hESC and H1hESC-derived cells were obtained from a previous study [89]. Histone modification H3K4me3 data for H7es and H7es-derived cells were obtained from the ENCODE project. Mouse gene annotations were obtained from the Mouse Genome Informatics database [90].

Division of genes into groups by CS value

Given the high consistency of essential genes in different cell lines, we used essential genes in the KBM7 cell line for this study. CS value of KBM7 cell line is provided by Wang et al. [11]. Genes were sorted by ascending CS value in KBM7 cell line and divided into 10 groups (CS0–CS9). Specifically, group CS0 is composed of 1878 essential genes as reported by Wang et al. The 1st nine gene groups (CS0–CS8) contains the same number of genes. The rest of genes with highest CS are assigned to group CS9. Corresponding CS threshold of each group is shown in Supplementary Table S4.

Hi-C data processing

For H1hES cell, Hi-C contact matrices were constructed and then normalized using HOMER (http://homer.ucsd.edu/homer/). For the GM12878, IMR90 and K562 cell lines, Hi-C contact matrices and loops were obtained from a previous study [35] and normalized using the SQRTVC method.

Calculating relative abundances of inactive epigenetic hub genes

The observed proportion of genes in each essentiality group with expression levels lower than a specific threshold and whose promoter regions were highly accessible (defined by a specific cutoff of DNase-seq tag density in a 2 kb region upstream of the TSS) was calculated. Then, for each group, the expected probability that a gene was an inactive epigenetic hub gene (defined as the product of the proportion of genes with low expression levels and that of genes located at highly accessible chromatin sites, using the same cutoffs) was calculated. The relative possibility that a gene was an inactive epigenetic hub gene was defined using the observed/expected ratio. To ensure that the analysis was not biased toward the selection of particular cutoff values, we applied widely ranging cutoff selection parameters [expression cutoff: minimum = 2, maximum = 10 (FPKM), step = 0.5; chromatin accessibility cutoff: minimum = top 50% of all genes, maximum = top 95% of all genes, step = 1%]. All 170 cutoff combinations yielded similar and significant results.

PPI network construction

Protein–protein associations were obtained from the STRING database (version: 10.5) [91]. Based on these associations, PPI network was constructed using Cytoscape [92]. Using CentiScaPe [56], we computed specific centrality parameters to describe the network topology and then calculated the degree of connectivity for each node in the PPI network. Densely connected regions in large PPI networks were detected using the molecular complex detection method [57]. A GO analysis was performed using DAVID [93].

Profiling of epigenetic information

For each gene, the gene body and 10 kb upstream and downstream segments were each broken into 50 bins. The ChIP-seq density (RPKM) in these regions was calculated and combined to obtain 150 bins spanning 10 kb upstream, the gene body and 10 kb downstream. The average combined profiles for genes are shown.

Screening of pan-cancer candidate genes

Twelve tumor types (COAD, KICH, BLCA, KIRC, CHOL, UCEC, PRAD, KIRP, LIHC, CESC, LUAD and BRCA) from the TCGA project were used to screen pan-cancer candidate genes. DEGs were first quantified in each cancer–normal tissue pair. The 5000 top-scoring DEGs among all genes and the 1000 top-scoring DEGs among essential genes were further combined, and genes in both of these sets were used as candidate genes for each cancer type. Pan-cancer candidate genes were defined as those that were candidates in at least eight cancer types.

Key Points

We performed a very detailed classification of human protein-coding genes (including essential and nonessential genes) and found general correlations between gene essentiality level and major features. We extended previous work by integrating multi-omics (genome, epigenome and proteome) data into a new model of the three types of `hub’ location of essential genes in PPI networks, chromatin structure and epigenetic regulation, which enables multidimensional understanding of essential genes. We conducted, to our knowledge, the 1st systematic analysis of essential genes from the view of 3D chromatin structure, dissecting chromatin loop anchors, TAD boundaries and intra-TAD local contacts, and revealed the ‘hub’ location of essential genes in spatial chromatin conformation. Our work extended knowledge on two features of essential genes. First, although previous studies indicated that `old’ genes are generally longer than `young’ genes, we found that essential genes are old but unexpectedly short. Second, although long genes are generally more likely to contain more diverse transcripts than short genes, the short essential genes contained a larger variety of transcripts. These two characters of essential genes may be important for the stability of cell functions. We developed HEGIAP, a web server that provides multiple tools for further visualization and analysis of essential genes. Click here for additional data file.

90 in total

1. Genomic maps and comparative analysis of histone modifications in human and mouse.

Authors: Bradley E Bernstein; Michael Kamal; Kerstin Lindblad-Toh; Stefan Bekiranov; Dione K Bailey; Dana J Huebert; Scott McMahon; Elinor K Karlsson; Edward J Kulbokas; Thomas R Gingeras; Stuart L Schreiber; Eric S Lander
Journal: Cell Date: 2005-01-28 Impact factor: 41.582

Review 2. The maternal-zygotic transition: death and birth of RNAs.

Authors: Alexander F Schier
Journal: Science Date: 2007-04-20 Impact factor: 47.728

Review 3. Principles and challenges of genomewide DNA methylation analysis.

Authors: Peter W Laird
Journal: Nat Rev Genet Date: 2010-03 Impact factor: 53.242

4. The landscape of accessible chromatin in mammalian preimplantation embryos.

Authors: Jingyi Wu; Bo Huang; He Chen; Qiangzong Yin; Yang Liu; Yunlong Xiang; Bingjie Zhang; Bofeng Liu; Qiujun Wang; Weikun Xia; Wenzhi Li; Yuanyuan Li; Jing Ma; Xu Peng; Hui Zheng; Jia Ming; Wenhao Zhang; Jing Zhang; Geng Tian; Feng Xu; Zai Chang; Jie Na; Xuerui Yang; Wei Xie
Journal: Nature Date: 2016-06-15 Impact factor: 49.962

Review 5. Emerging and evolving concepts in gene essentiality.

Authors: Giulia Rancati; Jason Moffat; Athanasios Typas; Norman Pavelka
Journal: Nat Rev Genet Date: 2017-10-16 Impact factor: 53.242

6. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

7. Younger genes are less likely to be essential than older genes, and duplicates are less likely to be essential than singletons of the same age.

Authors: Wei-Hua Chen; Kalliopi Trachana; Martin J Lercher; Peer Bork
Journal: Mol Biol Evol Date: 2012-01-19 Impact factor: 16.240

8. Enrichment analysis of Alu elements with different spatial chromatin proximity in the human genome.

Authors: Zhuoya Gu; Ke Jin; M James C Crabbe; Yang Zhang; Xiaolin Liu; Yanyan Huang; Mengyi Hua; Peng Nan; Zhaolei Zhang; Yang Zhong
Journal: Protein Cell Date: 2016-02-10 Impact factor: 14.870

9. DEG 10, an update of the database of essential genes that includes both protein-coding genes and noncoding genomic elements.

Authors: Hao Luo; Yan Lin; Feng Gao; Chun-Ting Zhang; Ren Zhang
Journal: Nucleic Acids Res Date: 2013-11-15 Impact factor: 16.971

10. OncodriveROLE classifies cancer driver genes in loss of function and activating mode of action.

Authors: Michael P Schroeder; Carlota Rubio-Perez; David Tamborero; Abel Gonzalez-Perez; Nuria Lopez-Bigas
Journal: Bioinformatics Date: 2014-09-01 Impact factor: 6.937

13 in total

1. Evidence that conserved essential genes are enriched for pro-longevity factors.

Authors: Naci Oz; Elena M Vayndorf; Mitsuhiro Tsuchiya; Samantha McLean; Lesly Turcios-Hernandez; Jason N Pitt; Benjamin W Blue; Michael Muir; Michael G Kiflezghi; Alexander Tyshkovskiy; Alexander Mendenhall; Matt Kaeberlein; Alaattin Kaya
Journal: Geroscience Date: 2022-06-13 Impact factor: 7.581

Review 2. Targeting pan-essential genes in cancer: Challenges and opportunities.

Authors: Liang Chang; Paloma Ruiz; Takahiro Ito; William R Sellers
Journal: Cancer Cell Date: 2021-01-14 Impact factor: 31.743

Review 3. Essential Genes of the Parasitic Apicomplexa.

Authors: Jenna Oberstaller; Thomas D Otto; Julian C Rayner; John H Adams
Journal: Trends Parasitol Date: 2021-01-05

4. Essential gene prediction in Drosophila melanogaster using machine learning approaches based on sequence and functional features.

Authors: Olufemi Aromolaran; Thomas Beder; Marcus Oswald; Jelili Oyelade; Ezekiel Adebiyi; Rainer Koenig
Journal: Comput Struct Biotechnol J Date: 2020-03-10 Impact factor: 7.271

5. Elucidating the network features and evolutionary attributes of intra- and interspecific protein-protein interactions between human and pathogenic bacteria.

Authors: Debarun Acharya; Tapan K Dutta
Journal: Sci Rep Date: 2021-01-08 Impact factor: 4.379

6. Epigenomic signatures on paralogous genes reveal underappreciated universality of active histone codes adopted across animals.

Authors: Kuei-Yuan Lan; Ben-Yang Liao
Journal: Comput Struct Biotechnol J Date: 2021-12-28 Impact factor: 7.271