Wojciech Rosikiewicz1, Yutaka Suzuki2, Izabela Makalowska1. 1. Department of Integrative Genomics, Institute of Anthropology, Faculty of Biology, Adam Mickiewicz University in Poznan, 61-712 Poznan, Poland. 2. Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Chiba, 272-8562, Japan.
Abstract
Gene overlap plays various regulatory functions on transcriptional and post-transcriptional levels. Most current studies focus on protein-coding genes overlapping with non-protein-coding counterparts, the so called natural antisense transcripts. Considerably less is known about the role of gene overlap in the case of two protein-coding genes. Here, we provide OverGeneDB, a database of human and mouse 5' end protein-coding overlapping genes. The database contains 582 human and 113 mouse gene pairs that are transcribed using overlapping promoters in at least one analyzed library. Gene pairs were identified based on the analysis of the transcription start site (TSS) coordinates in 73 human and 10 mouse organs, tissues and cell lines. Beside TSS data, resources for 26 human lung adenocarcinoma cell lines also contain RNA-Seq and ChIP-Seq data for seven histone modifications and RNA Polymerase II activity. The collected data revealed that the overlap region is rarely conserved between the studied species and tissues. In ∼50% of the overlapping genes, transcription started explicitly in the overlap regions. In the remaining half of overlapping genes, transcription was initiated both from overlapping and non-overlapping TSSs. OverGeneDB is accessible at http://overgenedb.amu.edu.pl.
Gene overlap plays various regulatory functions on transcriptional and post-transcriptional levels. Most current studies focus on protein-coding genes overlapping with non-protein-coding counterparts, the so called natural antisense transcripts. Considerably less is known about the role of gene overlap in the case of two protein-coding genes. Here, we provide OverGeneDB, a database of human and mouse 5' end protein-coding overlapping genes. The database contains 582 human and 113 mouse gene pairs that are transcribed using overlapping promoters in at least one analyzed library. Gene pairs were identified based on the analysis of the transcription start site (TSS) coordinates in 73 human and 10 mouse organs, tissues and cell lines. Beside TSS data, resources for 26 humanlung adenocarcinoma cell lines also contain RNA-Seq and ChIP-Seq data for seven histone modifications and RNA Polymerase II activity. The collected data revealed that the overlap region is rarely conserved between the studied species and tissues. In ∼50% of the overlapping genes, transcription started explicitly in the overlap regions. In the remaining half of overlapping genes, transcription was initiated both from overlapping and non-overlapping TSSs. OverGeneDB is accessible at http://overgenedb.amu.edu.pl.
Gene overlap in eukaryotes, which is defined here as sharing of the same DNA sequence by at least two different genes (1,2), was discovered over 30 years ago (3–5). For a long time, this phenomenon was believed to be rare. However, over the last three decades, increasing examples of gene overlap were reported in various animal, plant and fungal species (6–19). It is estimated that >30% of human and mouse genes overlap with another gene (20–22). Large-scale projects, such as those from the FANTOM Consortium, revealed that 72% of transcription events might proceed in both directions (23).Genes may overlap in various manners (1), including complete overlap when one gene is nested within the other or partial overlap when only the 3′ or 5′ end(s) of genes are overlapping. Gene overlap is currently intensively studied in the context of protein coding genes regulated by their antisense non-protein coding counterparts, i.e. natural antisense transcripts (NATs), which exhibit various regulatory functions. Briefly, NATs were suggested to regulate protein coding gene expression levels during transcription by various mechanisms of transcriptional interference (TI), including promoter competition, occlusion, ‘sitting duck’ interference or polymerase collisions (24). Presence of the antisense RNA may also regulate gene expression post-transcriptionally via double-stranded RNA formation (25), leading to RNA editing (26), interference (27–31) or masking (32–39). NATs also regulate protein-coding gene expression levels epigenetically by inducing repressive chromatin modifications within the sense gene promoters or even the entire genomic loci and downregulating the expression levels of neighboring genes (40–43). Nevertheless, the extent to which gene expression is regulated by antisense transcription remains a matter of debate and needs to be further investigated (25,44–48), especially given that numerous NATs were connected with various pathological states, such as Parkinson’s or Alzheimer’s diseases (34,49), cancer (50,51) and numerous other disorders (52,53). Researchers are interested in many of these NATs as therapeutic targets given that their artificial up- or downregulation directly influences the expression levels of the sense, protein-coding genes (42).Although numerous researchers are working on the gene overlap phenomenon, relatively few databases dedicated to overlapping genes are available. Currently, the most comprehensive database is PlantNATsDB (54), which focuses on antisense transcripts in plants and predicts >2 million NATs in 70 plant species. Until recently, the best equivalent for animal species was NATsDB (55), in which authors deposited thousands of antisense transcripts identified by mapping EST sequences for 11 model organisms. Unfortunately, NATsDB and other databases, such as EVOG (56), antiCODE (57) or the database created by Veeramachaneni et al. (16), are no longer maintained. Nevertheless, the abovementioned databases were primarily dedicated to protein-coding genes that overlap with non-protein coding counterparts, and none of these databases are suitable for large-scale tissue-specific studies of gene overlap. As described in this paper, OverGeneDB is a database of 5′ end(s) protein coding overlapping genes in human and mouse genomes. OverGeneDB contains information regarding 582 human and 113 mouse overlapping protein-coding gene pairs that were identified based on the exact genomic coordinates of the alternative transcription start sites (TSSs) in 73 human and 10 mouse TSS-Seq libraries from various organs, tissues and cell lines. For 26 humanlung adenocarcinoma tissue samples, studies of overlapping genes were strengthen by RNA-Seq and ChIP-Seq data analyses of seven histone modifications and RNA Polymerase II activity studies. OverGeneDB is a platform that offers easy access and visualization of all identified overlapping gene pairs and associated data. In addition, these data can be downloaded for further analysis.
MATERIALS AND METHODS
Representative gene coordinates
Coordinates of all known RefSeq transcripts (58) for human GRCh38/hg38 and mouse NCBI37/mm9 genome versions were downloaded from the UCSC database using Table Browser (59). Coordinates of splice variants were further used to determine side-to-side gene positions as shown in Figure 1A. TSS coordinates were downloaded for a total of 73 human and 4 mouse libraries from DBTSS database versions 9 (60) and 8 (61) for human and mouse, respectively. TSS coordinates from six additional mouse organs were sequenced for the purpose of this study and processed using the same protocol used for other data from the DBTSS database (60). All human and mouse libraries are listed in Supplementary Table S1. The downloaded data were filtered to ensure that only TSSs with the ‘confident’ status, in which the normalized expression level is ≥5 parts per million (ppm), were considered, as suggested by Yamashita and coworkers after the detailed validation of the TSS-Seq method (62). Additionally, the maximum distance between a TSS and the closest gene on the same DNA strand was limited to 5000 bp upstream of the gene’s annotated 5′ end. Next, each gene was analyzed separately in every TSS library to identify representative gene coordinates. The 3′ end was based on the RefSeq annotations, whereas the 5′ end was determined based on the position of the TSS in a given library. If more than one TSS was assigned to a gene, the coordinates of the distal TSS were considered to represent the 5′ end of the gene (Figure 1B).
Figure 1.
Representative gene coordinates computation strategy. (A) Gene A coordinates are based on the distal 5′ and 3′ annotated coordinates of all gene's mRNAs; (B) Representative gene coordinates in different libraries are based on the annotated gene A 3′ end and distal alternative transcription start site.
Representative gene coordinates computation strategy. (A) Gene A coordinates are based on the distal 5′ and 3′ annotated coordinates of all gene's mRNAs; (B) Representative gene coordinates in different libraries are based on the annotated gene A 3′ end and distal alternative transcription start site.
Overlapping genes detection and characterization
The identification of genes overlapping at 5′ end(s) was performed for genes expressed in a given library based on the representative gene coordinates. The genes were required to overlap by at least one base. The procedure was performed for each library independently.Genes may be simultaneously expressed using one or more TSS. In numerous cases, this phenomenon results in genes that only overlap in relation to a subset of alternative TSSs. To determine to what degree the gene is transcribed from the overlapping TSSs, the overlap ratio (OR) was developed. This value is simply a fraction of the total gene expression assigned to the overlap region (Figure 2). Consequently, to estimate to what extent transcripts in the gene pair are transcribed from the overlapping region, the JoinedOR ratio was calculated. JoinedOR is a product of the OR values of genes in an overlapping pair (Figure 2). OR and JoinedOR values equal to 1 indicate that all transcripts originated from the overlapping TSSs in the gene and pair, respectively. The lower the values, the smaller the subset of transcripts that originated from the overlapped region. A value of 0 indicates that no expression was assigned to the overlapping TSSs.
Figure 2.
OR and JoinedOR values for the example gene pair expressed from the non-overlapping TSS in library 1 and the overlapping TSS in libraries 2 and 3. Blue and green narrow solid boxes represent the annotated coordinates of genes on plus and minus strand, respectively. Wider light blue and green boxes indicate representative gene coordinates in particular library, whereas arrows represent alternative transcription start sites accompanied by assigned to them normalized to parts per million (ppm) expression levels.
OR and JoinedOR values for the example gene pair expressed from the non-overlapping TSS in library 1 and the overlapping TSS in libraries 2 and 3. Blue and green narrow solid boxes represent the annotated coordinates of genes on plus and minus strand, respectively. Wider light blue and green boxes indicate representative gene coordinates in particular library, whereas arrows represent alternative transcription start sites accompanied by assigned to them normalized to parts per million (ppm) expression levels.
Expression level estimation using RNA-seq data
Raw Illumina RNA-Seq paired-end reads from 26 lung adenocarcinoma samples, which were the same samples used for TSS sequencing, were downloaded from the ENA database (63) where these data are stored under accession number PRJDB2256. All reads were subjected to quality filtering using the Trimmomatic program (version 0.36) (64) with the following parameters: -phred33; ILLUMINACLIP: adapters/TruSeq3-PE.fa:2:30:10; LEADING: 20; TRAILING: 20; SLIDINGWINDOW:5:20; and MINLEN:50. Quality control before and after filtering was performed using FastQC (version 0.11.5) (65). Filtered reads were aligned to the human reference hg38 genome downloaded from UCSC (66) using the HISAT2 program (version 2.0.5) (67) with the—downstream-transcriptome-assembly parameter. The numbers of mapped and unmapped reads for each library are presented in supplementary Table ST2. Next, all SAM files were sorted and converted to BAM format using SAMtools (version 1.3.1) (68). Expression levels of individual transcripts were further estimated in FPKM (fragments per kilobase of exon per million fragments mapped) using StringTie (version 1.3.1c) (69) with –e and –B flags guided by GENCODE (version 24) genome reference annotations (70). The expression levels of individual transcripts were summed to represent the total expression levels of genes. No minimal expression level was required in this step.
Histone modifications and RNA polymerase II activity studies
Pre-aligned reads from ChIP-Seq experiments for RNA Polymerase II and seven histone modification types, including H3ac, H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3 and H3K9me3, with their controls were downloaded from the DBTSS’ (version 9) (60) FTP server in BED3+1 format. The additional column represents the number of tags mapped in a certain position. Data were available for 26 tissue samples from patients with lung adenocarcinoma. Each library was converted to standard BED6 format and subsequently screened for peak enrichment using the MACS2 program (version 2.1.0) (71) with the –nomodel flag used for all libraries. In addition, parameters –broad and –broad-cutoff 0.1 were used for data regarding seven histone modifications. For the purpose of visualization in the web browser, MACS2 output peaks were converted to the bigBed format using the bedtobigbed program from UCSC (72). All BED6 files representing coordinates of mapped reads were converted to bigWig format using genomecov from the BEDTools package (version 2.25.0) (73) with –bg flag and bedGraphToBigWig program from UCSC (72).
Association of transcription factors with TSSs
To associate transcription factors (TFs) with particular promoters, transcription start sites were first clustered across all human and separately all mouse TSS-Seq libraries. In this step, all TSSs assigned to the same gene and located <300 bp from each other were joined in a single cluster. Next, nucleotide sequences of all clusters together with up to 500 bp in both directions were screened for the presence of transcription factor binding site (TFBS) motifs obtained from the JASPAR database (74). The search was conducted using the searchSeq function from TFBSTools (version 1.14.0) (75) with the minimal score set to 95%. All TFs potentially associated with promoters were subsequently filtered based on their expression, i.e. only TFs that were expressed in a particular library were considered as potentially regulating a certain promoter. Finally, hierarchical clustering of all potential TFs in all libraries was performed for each promoter.
Database implementation
The OverGeneDB database was implemented in MySQL (https://www.mysql.com/). The publicly accessible interface was generated using HTML, PHP and JavaScript. The detailed overlapping gene pair view was additionally equipped with an embedded Dalliance (76) genome browser and interactive charts from plot.ly (https://plot.ly/).
DATABASE COMPOSITION AND USAGE
Data stored in OverGeneDB may be accessed via three different methods: Browse, Search or sequence similarity search. The Browse page lists all human and mouse gene pairs that overlap in at least one library. This page also provides information regarding the number of libraries in which genes in a given pair are identified as overlapping and whether both or only one gene from a pair is expressed. The Search option allows the user to specify the libraries or the number of libraries in which genes overlap, only one or none of genes from a pair is expressed, or both genes are expressed regardless of their overlap status. It is also possible to perform a sequence-based similarity search using the BLAST program (77,78) against the overlapping regions or the representative gene sequences in all or selected libraries.The above described Search and Browse methods generate lists of gene pairs meeting the specified criteria. Detailed information about a given pair can be obtained by clicking on the ‘Details’ button. The overlapping gene pair view is separated into six sections displayed in tabs as follows:Genome context— annotated genes and transcripts can be examined using a built-in dalliance web browser (76). Additional tracks may be selected for human and mouse libraries using a button above the browser. These tracks contain alternative TSSs and overlap regions displayed as blocks and positions of predicted TFBS. For the 26 humanlung adenocarcinoma libraries, it is also possible to display tracks of raw BAM files of mapped RNA-Seq reads and raw ChIP-Seq signals in bigWig format and peaks for RNA Polymerase II and seven types of histone modifications.Overlap summary table (Figure 3)—this table displays detailed information about overlapping genes, including OR and JoinedOR ratios and TSS-Seq-based expression levels. Upon clicking the library name, a simple visualization of the gene overlap in a selected library is displayed where one may also inspect the expression levels of individual TSSs.
Figure 3.
Overlap summary table for the H2AFJ and HIST4H4 gene pair. Clicking on library names opens a small pop-up window with a scheme of genes arrangement within selected library.
Gene expression—this tab provides detailed information about the TSS-Seq and RNA-Seq expression levels of genes in an overlapping gene pair when available.Detailed TSS information (Figure 4)—this tab provides a visualization of the gene pair and positions of TSS clusters. For each gene in a pair, a summary table with the TSS usage is provided. The user may easily inspect the libraries in which a particular TSS was utilized and what normalized expression level was assigned to it. In addition, for each promoter, a hierarchically clustered summary of the TFs potentially responsible for the regulation of this particular promoter is displayed. The user may download this summary in PNG, SVG or Tab Separated Value (TSV) formats.
Figure 4.
Detailed TSS information for the HIST4H4 gene example. ‘HIST4H4 gene TSS usage summary table’ panel displays expression levels assigned to individual TSS in various libraries. This summary lists only libraries, in which inspected gene was expressed. ‘TFBS summary table’ displays TFs for which binding sites were identified in the region <500 bp from a given TSS. These TFs were hierarchically clustered based on information if it is expressed (marked in green) or non-expressed (marked in white) in a given library.
Gene summary—this tab displays basic gene information with links to cross database references, gene references into functions (Gene RIF), Gene Ontology and PubMed literature references associated with NCBI Gene IDs, which were all downloaded from the NCBI Gene database (79).Download—user may download the most essential information associated with selected overlapping gene pair in TSV, BED or FASTA formats.Overlap summary table for the H2AFJ and HIST4H4 gene pair. Clicking on library names opens a small pop-up window with a scheme of genes arrangement within selected library.Detailed TSS information for the HIST4H4 gene example. ‘HIST4H4 gene TSS usage summary table’ panel displays expression levels assigned to individual TSS in various libraries. This summary lists only libraries, in which inspected gene was expressed. ‘TFBS summary table’ displays TFs for which binding sites were identified in the region <500 bp from a given TSS. These TFs were hierarchically clustered based on information if it is expressed (marked in green) or non-expressed (marked in white) in a given library.
DISCUSSION
To the best of the authors’ knowledge, OverGeneDB is the first database strictly dedicated to protein coding genes overlapping at their 5′ end(s). The database contains 582 human and 113 mouse gene pairs that were identified as overlapping with a minimum of one TSS pair in at least one library. These gene pairs included 1150 human and 225 mouse genes, among which 14 genes overlapped with more than one other gene. A total of 4075 promoters in human and 518 promoters in mouse were assigned to these genes. The average overlapping region size is 1570 bp long, and the longest overlap region, which was ∼50 kb, was reported for gene pair RUNX2 and SUPT3H. The collected data revealed that the overlap region is rarely conserved among the studied species, organs, tissues and cell lines. Surprisingly, for up to 300 human and mice gene pairs, a >100-bp difference in the overlap region length across libraries was identified. In the case of 159 gene pairs, this difference is >1000 bp. Moreover, in 85 human overlapping pairs, the difference is not only in the overlap length but also it is significantly shifted and covers a different genomic region. The HTRA2 and AUP1 gene pair serves as an example. In this gene pair, the overlapping region shifts between libraries from exclusively covering the HTRA2 annotated gene body in Adenocarcinoma VMRCLCD to an overlap that is mainly located within the AUP1 gene in Adenocarcinomas PC3 and PC14. Among all identified overlapping gene pairs, 90 human and mouse pairs were identified as always overlapping whenever both genes were expressed. The remaining 605 pairs were occasionally expressed from the overlapping and non-overlapping promoters. In total, 203 human and mouse gene pairs were overlapping only in one library, whereas both genes where expressed without gene overlap in almost all other libraries.Genes often utilize multiple alternative promoters simultaneously, among which only a subset may be overlapping. Therefore, if regulation by transcriptional interference or double stranded RNA formation occurs, only transcripts initiated within the overlap region may be subjected to regulation via these mechanisms. To assess this phenomenon, OR and JoinedOR ratios were introduced, and these values represent the frequency of the expression initiated within the overlap region in gene and gene pair, respectively. In 57% human and 44% mouse overlap events, transcription started explicitly in the overlap regions, as reflected by a JoinedOR value equal to one. In contrast, in the remaining cases, transcription was initiated both from overlapping and non-overlapping TSSs by at least one of the genes in a pair. Human genes FBXL15 and PSD, which are inter alia overlapping in brain tissue, serve as a great example. Both genes simultaneously utilize two alternative TSSs, among which only distal TSSs are overlapping. Assuming equal expression level of both genes and utilizing a JoinedOR ratio, <13% of FBXL15 and PSD gene transcripts were estimated to be possibly subjected to overlap-related regulation on transcriptional or post-transcriptional levels.Studies of the overlapping gene pairs in OverGeneDB were strengthened by the large-scale in silico prediction of TFBS within the overlapping gene promoter regions. Moreover, overlapping gene expression levels in 26 lung adenocarcinoma samples were independently estimated based on RNA-Seq data. Finally, the same twenty six tissue samples were also studied using ChIP-Seq experiment results aimed at RNA Polymerase II activity and seven histone modifications, which exhibit significant potential in overlapping promoter studies (44). Taking all of these features together, the OverGeneDB is a very valuable source of data for anyone interested in antisense transcription, the regulation of promoter usage, and the mechanisms of gene expression regulation.
AVAILIBILITY
OverGeneDB database is freely available at the URL http://overgenedb.amu.edu.pl. Scripts for automated overlapping gene pairs’ identification are available at https://github.com/forrest1988/OverGeneDB.Click here for additional data file.
Authors: Manuel Beltran; Isabel Puig; Cristina Peña; José Miguel García; Ana Belén Alvarez; Raúl Peña; Félix Bonilla; Antonio García de Herreros Journal: Genes Dev Date: 2008-03-15 Impact factor: 11.361
Authors: Mohammad Ali Faghihi; Farzaneh Modarresi; Ahmad M Khalil; Douglas E Wood; Barbara G Sahagan; Todd E Morgan; Caleb E Finch; Georges St Laurent; Paul J Kenny; Claes Wahlestedt Journal: Nat Med Date: 2008-06-29 Impact factor: 53.440
Authors: King-Hwa Ling; Peter J Brautigan; Sarah Moore; Rachel Fraser; Pike-See Cheah; Joy M Raison; Milena Babic; Young Kyung Lee; Tasman Daish; Deidre M Mattiske; Jeffrey R Mann; David L Adelson; Paul Q Thomas; Christopher N Hahn; Hamish S Scott Journal: Genomics Date: 2016-01-21 Impact factor: 5.736
Authors: Anthony Mathelier; Oriol Fornes; David J Arenillas; Chih-Yu Chen; Grégoire Denay; Jessica Lee; Wenqiang Shi; Casper Shyr; Ge Tan; Rebecca Worsley-Hunt; Allen W Zhang; François Parcy; Boris Lenhard; Albin Sandelin; Wyeth W Wasserman Journal: Nucleic Acids Res Date: 2015-11-03 Impact factor: 16.971
Authors: Paweł Wawrzyniak; Agnieszka Sobolewska-Ruta; Piotr Zaleski; Natalia Łukasiewicz; Paulina Kabaj; Piotr Kierył; Agata Gościk; Anna Bierczyńska-Krzysik; Piotr Baran; Anna Mazurkiewicz-Pisarek; Andrzej Płucienniczak; Dariusz Bartosik Journal: BMC Microbiol Date: 2019-11-13 Impact factor: 3.605