Literature DB >> 29069459

OverGeneDB: a database of 5' end protein coding overlapping genes in human and mouse genomes.

Wojciech Rosikiewicz1, Yutaka Suzuki2, Izabela Makalowska1.   

Abstract

Gene overlap plays various regulatory functions on transcriptional and post-transcriptional levels. Most current studies focus on protein-coding genes overlapping with non-protein-coding counterparts, the so called natural antisense transcripts. Considerably less is known about the role of gene overlap in the case of two protein-coding genes. Here, we provide OverGeneDB, a database of human and mouse 5' end protein-coding overlapping genes. The database contains 582 human and 113 mouse gene pairs that are transcribed using overlapping promoters in at least one analyzed library. Gene pairs were identified based on the analysis of the transcription start site (TSS) coordinates in 73 human and 10 mouse organs, tissues and cell lines. Beside TSS data, resources for 26 human lung adenocarcinoma cell lines also contain RNA-Seq and ChIP-Seq data for seven histone modifications and RNA Polymerase II activity. The collected data revealed that the overlap region is rarely conserved between the studied species and tissues. In ∼50% of the overlapping genes, transcription started explicitly in the overlap regions. In the remaining half of overlapping genes, transcription was initiated both from overlapping and non-overlapping TSSs. OverGeneDB is accessible at http://overgenedb.amu.edu.pl.
© The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2018        PMID: 29069459      PMCID: PMC5753363          DOI: 10.1093/nar/gkx948

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Gene overlap in eukaryotes, which is defined here as sharing of the same DNA sequence by at least two different genes (1,2), was discovered over 30 years ago (3–5). For a long time, this phenomenon was believed to be rare. However, over the last three decades, increasing examples of gene overlap were reported in various animal, plant and fungal species (6–19). It is estimated that >30% of human and mouse genes overlap with another gene (20–22). Large-scale projects, such as those from the FANTOM Consortium, revealed that 72% of transcription events might proceed in both directions (23). Genes may overlap in various manners (1), including complete overlap when one gene is nested within the other or partial overlap when only the 3′ or 5′ end(s) of genes are overlapping. Gene overlap is currently intensively studied in the context of protein coding genes regulated by their antisense non-protein coding counterparts, i.e. natural antisense transcripts (NATs), which exhibit various regulatory functions. Briefly, NATs were suggested to regulate protein coding gene expression levels during transcription by various mechanisms of transcriptional interference (TI), including promoter competition, occlusion, ‘sitting duck’ interference or polymerase collisions (24). Presence of the antisense RNA may also regulate gene expression post-transcriptionally via double-stranded RNA formation (25), leading to RNA editing (26), interference (27–31) or masking (32–39). NATs also regulate protein-coding gene expression levels epigenetically by inducing repressive chromatin modifications within the sense gene promoters or even the entire genomic loci and downregulating the expression levels of neighboring genes (40–43). Nevertheless, the extent to which gene expression is regulated by antisense transcription remains a matter of debate and needs to be further investigated (25,44–48), especially given that numerous NATs were connected with various pathological states, such as Parkinson’s or Alzheimer’s diseases (34,49), cancer (50,51) and numerous other disorders (52,53). Researchers are interested in many of these NATs as therapeutic targets given that their artificial up- or downregulation directly influences the expression levels of the sense, protein-coding genes (42). Although numerous researchers are working on the gene overlap phenomenon, relatively few databases dedicated to overlapping genes are available. Currently, the most comprehensive database is PlantNATsDB (54), which focuses on antisense transcripts in plants and predicts >2 million NATs in 70 plant species. Until recently, the best equivalent for animal species was NATsDB (55), in which authors deposited thousands of antisense transcripts identified by mapping EST sequences for 11 model organisms. Unfortunately, NATsDB and other databases, such as EVOG (56), antiCODE (57) or the database created by Veeramachaneni et al. (16), are no longer maintained. Nevertheless, the abovementioned databases were primarily dedicated to protein-coding genes that overlap with non-protein coding counterparts, and none of these databases are suitable for large-scale tissue-specific studies of gene overlap. As described in this paper, OverGeneDB is a database of 5′ end(s) protein coding overlapping genes in human and mouse genomes. OverGeneDB contains information regarding 582 human and 113 mouse overlapping protein-coding gene pairs that were identified based on the exact genomic coordinates of the alternative transcription start sites (TSSs) in 73 human and 10 mouse TSS-Seq libraries from various organs, tissues and cell lines. For 26 human lung adenocarcinoma tissue samples, studies of overlapping genes were strengthen by RNA-Seq and ChIP-Seq data analyses of seven histone modifications and RNA Polymerase II activity studies. OverGeneDB is a platform that offers easy access and visualization of all identified overlapping gene pairs and associated data. In addition, these data can be downloaded for further analysis.

MATERIALS AND METHODS

Representative gene coordinates

Coordinates of all known RefSeq transcripts (58) for human GRCh38/hg38 and mouse NCBI37/mm9 genome versions were downloaded from the UCSC database using Table Browser (59). Coordinates of splice variants were further used to determine side-to-side gene positions as shown in Figure 1A. TSS coordinates were downloaded for a total of 73 human and 4 mouse libraries from DBTSS database versions 9 (60) and 8 (61) for human and mouse, respectively. TSS coordinates from six additional mouse organs were sequenced for the purpose of this study and processed using the same protocol used for other data from the DBTSS database (60). All human and mouse libraries are listed in Supplementary Table S1. The downloaded data were filtered to ensure that only TSSs with the ‘confident’ status, in which the normalized expression level is ≥5 parts per million (ppm), were considered, as suggested by Yamashita and coworkers after the detailed validation of the TSS-Seq method (62). Additionally, the maximum distance between a TSS and the closest gene on the same DNA strand was limited to 5000 bp upstream of the gene’s annotated 5′ end. Next, each gene was analyzed separately in every TSS library to identify representative gene coordinates. The 3′ end was based on the RefSeq annotations, whereas the 5′ end was determined based on the position of the TSS in a given library. If more than one TSS was assigned to a gene, the coordinates of the distal TSS were considered to represent the 5′ end of the gene (Figure 1B).
Figure 1.

Representative gene coordinates computation strategy. (A) Gene A coordinates are based on the distal 5′ and 3′ annotated coordinates of all gene's mRNAs; (B) Representative gene coordinates in different libraries are based on the annotated gene A 3′ end and distal alternative transcription start site.

Representative gene coordinates computation strategy. (A) Gene A coordinates are based on the distal 5′ and 3′ annotated coordinates of all gene's mRNAs; (B) Representative gene coordinates in different libraries are based on the annotated gene A 3′ end and distal alternative transcription start site.

Overlapping genes detection and characterization

The identification of genes overlapping at 5′ end(s) was performed for genes expressed in a given library based on the representative gene coordinates. The genes were required to overlap by at least one base. The procedure was performed for each library independently. Genes may be simultaneously expressed using one or more TSS. In numerous cases, this phenomenon results in genes that only overlap in relation to a subset of alternative TSSs. To determine to what degree the gene is transcribed from the overlapping TSSs, the overlap ratio (OR) was developed. This value is simply a fraction of the total gene expression assigned to the overlap region (Figure 2). Consequently, to estimate to what extent transcripts in the gene pair are transcribed from the overlapping region, the JoinedOR ratio was calculated. JoinedOR is a product of the OR values of genes in an overlapping pair (Figure 2). OR and JoinedOR values equal to 1 indicate that all transcripts originated from the overlapping TSSs in the gene and pair, respectively. The lower the values, the smaller the subset of transcripts that originated from the overlapped region. A value of 0 indicates that no expression was assigned to the overlapping TSSs.
Figure 2.

OR and JoinedOR values for the example gene pair expressed from the non-overlapping TSS in library 1 and the overlapping TSS in libraries 2 and 3. Blue and green narrow solid boxes represent the annotated coordinates of genes on plus and minus strand, respectively. Wider light blue and green boxes indicate representative gene coordinates in particular library, whereas arrows represent alternative transcription start sites accompanied by assigned to them normalized to parts per million (ppm) expression levels.

OR and JoinedOR values for the example gene pair expressed from the non-overlapping TSS in library 1 and the overlapping TSS in libraries 2 and 3. Blue and green narrow solid boxes represent the annotated coordinates of genes on plus and minus strand, respectively. Wider light blue and green boxes indicate representative gene coordinates in particular library, whereas arrows represent alternative transcription start sites accompanied by assigned to them normalized to parts per million (ppm) expression levels.

Expression level estimation using RNA-seq data

Raw Illumina RNA-Seq paired-end reads from 26 lung adenocarcinoma samples, which were the same samples used for TSS sequencing, were downloaded from the ENA database (63) where these data are stored under accession number PRJDB2256. All reads were subjected to quality filtering using the Trimmomatic program (version 0.36) (64) with the following parameters: -phred33; ILLUMINACLIP: adapters/TruSeq3-PE.fa:2:30:10; LEADING: 20; TRAILING: 20; SLIDINGWINDOW:5:20; and MINLEN:50. Quality control before and after filtering was performed using FastQC (version 0.11.5) (65). Filtered reads were aligned to the human reference hg38 genome downloaded from UCSC (66) using the HISAT2 program (version 2.0.5) (67) with the—downstream-transcriptome-assembly parameter. The numbers of mapped and unmapped reads for each library are presented in supplementary Table ST2. Next, all SAM files were sorted and converted to BAM format using SAMtools (version 1.3.1) (68). Expression levels of individual transcripts were further estimated in FPKM (fragments per kilobase of exon per million fragments mapped) using StringTie (version 1.3.1c) (69) with –e and –B flags guided by GENCODE (version 24) genome reference annotations (70). The expression levels of individual transcripts were summed to represent the total expression levels of genes. No minimal expression level was required in this step.

Histone modifications and RNA polymerase II activity studies

Pre-aligned reads from ChIP-Seq experiments for RNA Polymerase II and seven histone modification types, including H3ac, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3 and H3K9me3, with their controls were downloaded from the DBTSS’ (version 9) (60) FTP server in BED3+1 format. The additional column represents the number of tags mapped in a certain position. Data were available for 26 tissue samples from patients with lung adenocarcinoma. Each library was converted to standard BED6 format and subsequently screened for peak enrichment using the MACS2 program (version 2.1.0) (71) with the –nomodel flag used for all libraries. In addition, parameters –broad and –broad-cutoff 0.1 were used for data regarding seven histone modifications. For the purpose of visualization in the web browser, MACS2 output peaks were converted to the bigBed format using the bedtobigbed program from UCSC (72). All BED6 files representing coordinates of mapped reads were converted to bigWig format using genomecov from the BEDTools package (version 2.25.0) (73) with –bg flag and bedGraphToBigWig program from UCSC (72).

Association of transcription factors with TSSs

To associate transcription factors (TFs) with particular promoters, transcription start sites were first clustered across all human and separately all mouse TSS-Seq libraries. In this step, all TSSs assigned to the same gene and located <300 bp from each other were joined in a single cluster. Next, nucleotide sequences of all clusters together with up to 500 bp in both directions were screened for the presence of transcription factor binding site (TFBS) motifs obtained from the JASPAR database (74). The search was conducted using the searchSeq function from TFBSTools (version 1.14.0) (75) with the minimal score set to 95%. All TFs potentially associated with promoters were subsequently filtered based on their expression, i.e. only TFs that were expressed in a particular library were considered as potentially regulating a certain promoter. Finally, hierarchical clustering of all potential TFs in all libraries was performed for each promoter.

Database implementation

The OverGeneDB database was implemented in MySQL (https://www.mysql.com/). The publicly accessible interface was generated using HTML, PHP and JavaScript. The detailed overlapping gene pair view was additionally equipped with an embedded Dalliance (76) genome browser and interactive charts from plot.ly (https://plot.ly/).

DATABASE COMPOSITION AND USAGE

Data stored in OverGeneDB may be accessed via three different methods: Browse, Search or sequence similarity search. The Browse page lists all human and mouse gene pairs that overlap in at least one library. This page also provides information regarding the number of libraries in which genes in a given pair are identified as overlapping and whether both or only one gene from a pair is expressed. The Search option allows the user to specify the libraries or the number of libraries in which genes overlap, only one or none of genes from a pair is expressed, or both genes are expressed regardless of their overlap status. It is also possible to perform a sequence-based similarity search using the BLAST program (77,78) against the overlapping regions or the representative gene sequences in all or selected libraries. The above described Search and Browse methods generate lists of gene pairs meeting the specified criteria. Detailed information about a given pair can be obtained by clicking on the ‘Details’ button. The overlapping gene pair view is separated into six sections displayed in tabs as follows: Genome context— annotated genes and transcripts can be examined using a built-in dalliance web browser (76). Additional tracks may be selected for human and mouse libraries using a button above the browser. These tracks contain alternative TSSs and overlap regions displayed as blocks and positions of predicted TFBS. For the 26 human lung adenocarcinoma libraries, it is also possible to display tracks of raw BAM files of mapped RNA-Seq reads and raw ChIP-Seq signals in bigWig format and peaks for RNA Polymerase II and seven types of histone modifications. Overlap summary table (Figure 3)—this table displays detailed information about overlapping genes, including OR and JoinedOR ratios and TSS-Seq-based expression levels. Upon clicking the library name, a simple visualization of the gene overlap in a selected library is displayed where one may also inspect the expression levels of individual TSSs.
Figure 3.

Overlap summary table for the H2AFJ and HIST4H4 gene pair. Clicking on library names opens a small pop-up window with a scheme of genes arrangement within selected library.

Gene expression—this tab provides detailed information about the TSS-Seq and RNA-Seq expression levels of genes in an overlapping gene pair when available. Detailed TSS information (Figure 4)—this tab provides a visualization of the gene pair and positions of TSS clusters. For each gene in a pair, a summary table with the TSS usage is provided. The user may easily inspect the libraries in which a particular TSS was utilized and what normalized expression level was assigned to it. In addition, for each promoter, a hierarchically clustered summary of the TFs potentially responsible for the regulation of this particular promoter is displayed. The user may download this summary in PNG, SVG or Tab Separated Value (TSV) formats.
Figure 4.

Detailed TSS information for the HIST4H4 gene example. ‘HIST4H4 gene TSS usage summary table’ panel displays expression levels assigned to individual TSS in various libraries. This summary lists only libraries, in which inspected gene was expressed. ‘TFBS summary table’ displays TFs for which binding sites were identified in the region <500 bp from a given TSS. These TFs were hierarchically clustered based on information if it is expressed (marked in green) or non-expressed (marked in white) in a given library.

Gene summary—this tab displays basic gene information with links to cross database references, gene references into functions (Gene RIF), Gene Ontology and PubMed literature references associated with NCBI Gene IDs, which were all downloaded from the NCBI Gene database (79). Download—user may download the most essential information associated with selected overlapping gene pair in TSV, BED or FASTA formats. Overlap summary table for the H2AFJ and HIST4H4 gene pair. Clicking on library names opens a small pop-up window with a scheme of genes arrangement within selected library. Detailed TSS information for the HIST4H4 gene example. ‘HIST4H4 gene TSS usage summary table’ panel displays expression levels assigned to individual TSS in various libraries. This summary lists only libraries, in which inspected gene was expressed. ‘TFBS summary table’ displays TFs for which binding sites were identified in the region <500 bp from a given TSS. These TFs were hierarchically clustered based on information if it is expressed (marked in green) or non-expressed (marked in white) in a given library.

DISCUSSION

To the best of the authors’ knowledge, OverGeneDB is the first database strictly dedicated to protein coding genes overlapping at their 5′ end(s). The database contains 582 human and 113 mouse gene pairs that were identified as overlapping with a minimum of one TSS pair in at least one library. These gene pairs included 1150 human and 225 mouse genes, among which 14 genes overlapped with more than one other gene. A total of 4075 promoters in human and 518 promoters in mouse were assigned to these genes. The average overlapping region size is 1570 bp long, and the longest overlap region, which was ∼50 kb, was reported for gene pair RUNX2 and SUPT3H. The collected data revealed that the overlap region is rarely conserved among the studied species, organs, tissues and cell lines. Surprisingly, for up to 300 human and mice gene pairs, a >100-bp difference in the overlap region length across libraries was identified. In the case of 159 gene pairs, this difference is >1000 bp. Moreover, in 85 human overlapping pairs, the difference is not only in the overlap length but also it is significantly shifted and covers a different genomic region. The HTRA2 and AUP1 gene pair serves as an example. In this gene pair, the overlapping region shifts between libraries from exclusively covering the HTRA2 annotated gene body in Adenocarcinoma VMRCLCD to an overlap that is mainly located within the AUP1 gene in Adenocarcinomas PC3 and PC14. Among all identified overlapping gene pairs, 90 human and mouse pairs were identified as always overlapping whenever both genes were expressed. The remaining 605 pairs were occasionally expressed from the overlapping and non-overlapping promoters. In total, 203 human and mouse gene pairs were overlapping only in one library, whereas both genes where expressed without gene overlap in almost all other libraries. Genes often utilize multiple alternative promoters simultaneously, among which only a subset may be overlapping. Therefore, if regulation by transcriptional interference or double stranded RNA formation occurs, only transcripts initiated within the overlap region may be subjected to regulation via these mechanisms. To assess this phenomenon, OR and JoinedOR ratios were introduced, and these values represent the frequency of the expression initiated within the overlap region in gene and gene pair, respectively. In 57% human and 44% mouse overlap events, transcription started explicitly in the overlap regions, as reflected by a JoinedOR value equal to one. In contrast, in the remaining cases, transcription was initiated both from overlapping and non-overlapping TSSs by at least one of the genes in a pair. Human genes FBXL15 and PSD, which are inter alia overlapping in brain tissue, serve as a great example. Both genes simultaneously utilize two alternative TSSs, among which only distal TSSs are overlapping. Assuming equal expression level of both genes and utilizing a JoinedOR ratio, <13% of FBXL15 and PSD gene transcripts were estimated to be possibly subjected to overlap-related regulation on transcriptional or post-transcriptional levels. Studies of the overlapping gene pairs in OverGeneDB were strengthened by the large-scale in silico prediction of TFBS within the overlapping gene promoter regions. Moreover, overlapping gene expression levels in 26 lung adenocarcinoma samples were independently estimated based on RNA-Seq data. Finally, the same twenty six tissue samples were also studied using ChIP-Seq experiment results aimed at RNA Polymerase II activity and seven histone modifications, which exhibit significant potential in overlapping promoter studies (44). Taking all of these features together, the OverGeneDB is a very valuable source of data for anyone interested in antisense transcription, the regulation of promoter usage, and the mechanisms of gene expression regulation.

AVAILIBILITY

OverGeneDB database is freely available at the URL http://overgenedb.amu.edu.pl. Scripts for automated overlapping gene pairs’ identification are available at https://github.com/forrest1988/OverGeneDB. Click here for additional data file.
  78 in total

Review 1.  Regulation of plant gene expression by antisense RNA.

Authors:  J N Mol; A R van der Krol; A J van Tunen; R van Blokland; P de Lange; A R Stuitje
Journal:  FEBS Lett       Date:  1990-08-01       Impact factor: 4.124

2.  Gene within a gene: nested Drosophila genes encode unrelated proteins on opposite DNA strands.

Authors:  S Henikoff; M A Keene; K Fechtel; J W Fristrom
Journal:  Cell       Date:  1986-01-17       Impact factor: 41.582

3.  Antisense transcripts in the human genome.

Authors:  Ben Lehner; Gary Williams; R Duncan Campbell; Christopher M Sanderson
Journal:  Trends Genet       Date:  2002-02       Impact factor: 11.639

4.  A natural antisense transcript regulates Zeb2/Sip1 gene expression during Snail1-induced epithelial-mesenchymal transition.

Authors:  Manuel Beltran; Isabel Puig; Cristina Peña; José Miguel García; Ana Belén Alvarez; Raúl Peña; Félix Bonilla; Antonio García de Herreros
Journal:  Genes Dev       Date:  2008-03-15       Impact factor: 11.361

5.  Expression of a noncoding RNA is elevated in Alzheimer's disease and drives rapid feed-forward regulation of beta-secretase.

Authors:  Mohammad Ali Faghihi; Farzaneh Modarresi; Ahmad M Khalil; Douglas E Wood; Barbara G Sahagan; Todd E Morgan; Caleb E Finch; Georges St Laurent; Paul J Kenny; Claes Wahlestedt
Journal:  Nat Med       Date:  2008-06-29       Impact factor: 53.440

6.  Derivation of an endogenous small RNA from double-stranded Sox4 sense and natural antisense transcripts in the mouse brain.

Authors:  King-Hwa Ling; Peter J Brautigan; Sarah Moore; Rachel Fraser; Pike-See Cheah; Joy M Raison; Milena Babic; Young Kyung Lee; Tasman Daish; Deidre M Mattiske; Jeffrey R Mann; David L Adelson; Paul Q Thomas; Christopher N Hahn; Hamish S Scott
Journal:  Genomics       Date:  2016-01-21       Impact factor: 5.736

7.  Genome-wide analysis of plant nat-siRNAs reveals insights into their distribution, biogenesis and function.

Authors:  Xiaoming Zhang; Jing Xia; Yifan E Lii; Blanca E Barrera-Figueroa; Xuefeng Zhou; Shang Gao; Lu Lu; Dongdong Niu; Zheng Chen; Christy Leung; Timothy Wong; Huiming Zhang; Jianhua Guo; Yi Li; Renyi Liu; Wanqi Liang; Jian-Kang Zhu; Weixiong Zhang; Hailing Jin
Journal:  Genome Biol       Date:  2012       Impact factor: 13.583

8.  Dalliance: interactive genome viewing on the web.

Authors:  Thomas A Down; Matias Piipari; Tim J P Hubbard
Journal:  Bioinformatics       Date:  2011-01-19       Impact factor: 6.937

9.  NATsDB: Natural Antisense Transcripts DataBase.

Authors:  Yong Zhang; Jiongtang Li; Lei Kong; Ge Gao; Qing-Rong Liu; Liping Wei
Journal:  Nucleic Acids Res       Date:  2006-11-01       Impact factor: 16.971

10.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles.

Authors:  Anthony Mathelier; Oriol Fornes; David J Arenillas; Chih-Yu Chen; Grégoire Denay; Jessica Lee; Wenqiang Shi; Casper Shyr; Ge Tan; Rebecca Worsley-Hunt; Allen W Zhang; François Parcy; Boris Lenhard; Albin Sandelin; Wyeth W Wasserman
Journal:  Nucleic Acids Res       Date:  2015-11-03       Impact factor: 16.971

View more
  3 in total

1.  Molecular dissection of the replication system of plasmid pIGRK encoding two in-frame Rep proteins with antagonistic functions.

Authors:  Paweł Wawrzyniak; Agnieszka Sobolewska-Ruta; Piotr Zaleski; Natalia Łukasiewicz; Paulina Kabaj; Piotr Kierył; Agata Gościk; Anna Bierczyńska-Krzysik; Piotr Baran; Anna Mazurkiewicz-Pisarek; Andrzej Płucienniczak; Dariusz Bartosik
Journal:  BMC Microbiol       Date:  2019-11-13       Impact factor: 3.605

2.  Promoter switching in response to changing environment and elevated expression of protein-coding genes overlapping at their 5' ends.

Authors:  Wojciech Rosikiewicz; Jarosław Sikora; Tomasz Skrzypczak; Magdalena R Kubiak; Izabela Makałowska
Journal:  Sci Rep       Date:  2021-04-26       Impact factor: 4.379

3.  Overlapping protein-coding genes in human genome and their coincidental expression in tissues.

Authors:  Chao-Hsin Chen; Chao-Yu Pan; Wen-Chang Lin
Journal:  Sci Rep       Date:  2019-09-16       Impact factor: 4.379

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.