Literature DB >> 30624654

RAEdb: a database of enhancers identified by high-throughput reporter assays.

Zena Cai¹, Ya Cui^2,3, Zhiying Tan⁴, Gaihua Zhang⁵, Zhongyang Tan¹, Xinlei Zhang⁶, Yousong Peng¹.

Abstract

High-throughput reporter assays have been recently developed to directly and quantitatively assess enhancer activity for thousands of regulatory elements. However, there is still no database to collect these enhancers. We developed RAEdb, the first database to collect enhancers identified by high-throughput reporter assays. RAEdb includes 538 320 enhancers derived from eight studies, most of which were from six human cell lines. An activity score was assigned to each enhancer based on reporter assays. Based on these enhancers, 7658 epromoters (promoters with enhancer activity) were identified and stored in the database. RAEdb provides two ways of searches: the first is to search studies by species and cell line; the other is to search enhancers or epromoters by position, activity score, sequence and gene. RAEdb also provides a genome browser to query, visualize and compare enhancers. All data in RAEdb is freely available for download.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Year: 2019 PMID： 30624654 PMCID： PMC6323319 DOI： 10.1093/database/bay140

Source DB: PubMed Journal: Database (Oxford) ISSN： 1758-0463 Impact factor: 3.451

Introduction

The enhancer is a short (50~1500 bps) region of DNA sequence that recruits transcription factors to regulate the transcription of target genes in a cell-type-specific manner (1). Enhancers play a central role in regulating a wide range of important biological functions and processes, such as embryogenesis, development and homeostasis (1–3). They were reported to be associated with many human complex diseases (4). There are several databases collecting enhancers, including EnhancerAtlas (5), VISTA Enhancer Browser (6), FANTOM5 (7), DENdb (8), dbSUPER (9) and SEA (10). Although millions of enhancers are collected in these databases, lots of enhancers may be `false positives’ because most of them were inferred by indirect methods that require further reporter assays to determine enhancer activity (11, 12). Moreover, most enhancers in previous databases lack of quantitative activity scores (11, 12), which hinders computational modeling of gene regulations. In recent years, high-throughput reporter assays including STARR-seq (self-transcribing active regulatory region sequencing) (11, 12) and MPRA (massively parallel reporter assay) (13) have been developed to directly and quantitatively assess enhancer activity for thousands of regulatory elements. For example, Arnold et al. (11) identified thousands of cell-type-specific enhancers across a broad continuum of strengths and created a genome-wide quantitative enhancer map in Drosophila by STARR-seq. Despite of widespread applications of these new methods in identifying and characterizing enhancers, there is still a lack of database collecting these enhancers. Herein, we present RAEdb, the first database for hosting, analyzing and visualizing enhancers identified by high-throughput reporter assays. Besides the enhancers, thousands of epromoters were also inferred based on these enhancers, and were stored in the database. Workflow for enhancer and epromoter identification.

Materials and methods

Data sources

STARR-seq and MPRA data were collected from NCBI Gene Expression Omnibus (GEO) database (14) and the European Nucleotide Archive (15). We used a set of keywords (`STARR-seq’, `STARRseq’, `Self-transcribing active regulatory region sequencing’, ‘MPRA’ and `Massively parallel reporter assay’) to search these databases and screened the datasets manually. After manual check, four studies were collected for both STARR-seq and MPRA, respectively. As a comparison, several H3K27ac and H3K4me1 ChIP-seq datasets in some common cell lines were also downloaded from the ENCODE Consortium (16), and were displayed on genome browser derived from UCSC Genome Browser database (17).

Enhancer peak calling from STARR-seq data

The workflow for calling enhancer peak from STARR-seq data were shown in Figure 1. Sequencing reads were aligned to human genome hg38 by Hisat2 (18). To exclude potential PCR amplification bias, fragments (inferred from paired-end reads) that have the same start and end positions were collapsed into distinct fragments by SAMtools (19). An enhancer peak is called when there is significant enrichment of fragments from one region in output library than the representation of that region in input library based on Poisson distribution using MACS2 (false discovery rate < 0.05) (20). The genome coverage of the plasmid library was used as input when calculating the enrichment of STARR-seq reads (11). We took the enrichment score reported by MACS2 as enhancer activity.

Figure 1

Workflow for enhancer and epromoter identification.

Enhancer identification from MPRA data

The workflow for identifying enhancers from MPRA data was adapted from previous studies (21, 22), which was shown in Figure 1. To infer the barcode copy numbers generated for each sample, all sequence reads were examined, regardless of their quality scores. If the first N (N equal to the length of barcode) nucleotides of a read perfectly matched any of the barcode, this was counted as one occurrence of that barcode. The mRNA and plasmid counts for each barcode were then normalized by the size of cDNA and plasmid library, respectively, to facilitate comparing counts across samples with different sequencing depths. The library size was calculated as the total number of reads in each sample. For samples in multiple independent replications, the median of normalized barcode numbers were used. Finally, a ratio of mRNA to plasmid barcode counts was calculated to measure enhancer activity of the sequence linked to the barcode. If multiple different barcodes were linked to the same sequence, the average ratio was used. Only the sequences with the ratio greater than 1 were considered as enhancers.

Epromoter identification

We defined the region of 500 bps upstream and downstream of the Transcription Start Site as promoter based on the Ensembl Genome Browser (23). The promoters overlapping with one or more enhancers were considered as epromoters (promoters with enhancer activity) with the help of bedtools (version 2.27.0) (24, 25).

Implementation

An open source platform, LAMP (Linux, Apache, MySQL and PHP), is used to implement RAEdb database. The JavaScript Object Notation format is employed to implement the communication between the client-side and server-side layer. The JavaScript library JQuery and the web framework bootstrap were employed for producing dynamic and interactive data visualization in the web interface. In addition, DataTable library was applied to construct tables in web pages. In addition, we integrated UCSC Genome Browser (17) in our database for visualizing enhancers. The annotation of genes and references are hosted in the genome browser.

Results

Data summary

RAEdb contains 538 320 enhancers derived from eight studies; 55 415 and 482 905 of these enhancers were derived from data generated by STARR-seq and MPRA method (Figure 2A and B), respectively. All the enhancers generated by STARR-seq belong to four human cell lines. Over three-fourth of them were found in HeLaS3 cell line, then the LNCaP, K562 and HeLa cell line. The enhancers generated by MPRA mostly belong to two human cell lines, i.e. HepG2 and K562. In addition, there are a few enhancers belonging to a human cell line (Fibroblasts) and a mouse cell line (C2C12). Besides for enhancers, RAEdb also includes 7658 epromoters; 4673 and 2985 epromoters were identified from data generated by SATRR-seq and MPRA method, respectively. The cell line distribution of epromoters is similar to that for enhancers (Figure 2C and D).

Figure 2

Data distribution of enhancers and epromoters by cell line and by methods.

Usage and access

The RAEdb mainly includes Home, Browse, Genome Browser, Search, Download, Help and Contact Us pages (Figure 3).

Figure 3

Structure of RAEdb database.

Structure of RAEdb database. A systemic lupus erythematosus associated enhancer displayed in genome browser of RAEdb. It shows the activity measured by STARR-seq (two upper panels, from the dataset of GSE100423 in NCBI GEO database) and histone modification (two bottom panels, from the dataset of GSE29611 in NCBI GEO database) methods. The enhancer peaks called based on STARR-seq data were highlighted in red bars. The red perpendicular line refers to the SNP (rs2230926) associated with systemic lupus erythematosus.

Browse

All enhancers and epromoters were organized by sample in studies. For each study, this page displays the meta-information of the study, including title and abstract, cell lines, treatment for samples, accession number of data in NCBI GEO database and all enhancers or epromoters identified in the study. All the enhancers or epromoters from a sample are displayed in a table. For each enhancer, RAEdb displays the start and end positions of it on the chromosome, the activity score, and its sequences; for each epromoter, besides the start and end positions of it and the sequence, RAEdb also displays the name and id of its target gene.

Genome Browser

To explore the distribution of enhancers on the human or mouse genome, RAEdb provides a genome browser to query and visualize all enhancers collected in the database. As a comparison, enhancers that were identified by histone modification methods were also presented for some common cell lines (such as K562 and HeLaS3) in the genome browser. Users can select multiple samples simultaneously for comparing the distribution of enhancers identified in different cell lines or by different methods.

Search

RAEdb provides two ways to search the database. The first is to search studies by keywords, including data type (enhancer or epromoter), species, cell line and study name. The output is directed to the selected studies in the Browse page. The second is to search enhancers or epromoters by start and end positions, activity score, sequence and target gene (for epromoters) in the search box of the table in the Browse page.

Download

All data in RAEdb is freely available for download in either fasta or bed format. They are organized by study, method, species and cell line.

A case study

Here we show how to explore enhancers of interest. The enhancer ranging from 137 869 065 to 137 884 663 on chromosome 6 was reported to be associated with systemic lupus erythematosus (4), which is an autoimmune disease in which the immune system mistakenly attacks healthy tissues. We want to know the activity of this enhancer in different cell lines. On the Genome Browser page, we could search the region of this enhancer. The output would display the activity score of this enhancer in multiple cell lines identified in multiple studies. It shows that this enhancer shows activity only in HeLaS3 cell line (Figure 4). The STARR-seq method identified two adjoining regions (137 872 975–137 873 278 and 137 873 347–137 875 412) of 2.3 kb (highlighted in red rectangle) with enhancer activity within the enhancer. Moreover, a single-nucleotide polymorphism (SNP) (marked by a red perpendicular line) (4) associated with the systemic lupus erythematosus is also located within this small region. As a comparison, the histone modification method presented multiple regions with enhancer activity (two bottom panels), which makes it difficult to identify the precise region acting as an enhancer.

Figure 4

A systemic lupus erythematosus associated enhancer displayed in genome browser of RAEdb. It shows the activity measured by STARR-seq (two upper panels, from the dataset of GSE100423 in NCBI GEO database) and histone modification (two bottom panels, from the dataset of GSE29611 in NCBI GEO database) methods. The enhancer peaks called based on STARR-seq data were highlighted in red bars. The red perpendicular line refers to the SNP (rs2230926) associated with systemic lupus erythematosus.

Discussion

The high-throughput reporter assays, especially the STARR-seq method, were reported to be more accurate and quantitative in identifying enhancers than methods developed in previous studies. The number of studies using STARR-seq technique has been growing significantly in the recent years. There is an urgent need for a database collecting enhancers identified in these studies. Here we present RAEdb, the first database for hosting and analyzing these enhancers. The main functions of RAEdb include searching studies and enhancers/epromoters of interest and visualizing enhancers in a genome browser. Owing to the increasing interest in enhancers, the STARR-seq technology will be applied to a broader set of species, cell lines and conditions, and more data will be released in future. RAEdb will be updated in a timely manner (every season) with new released data from public studies. Continuous efforts would be devoted to improve the database. It could help much in understanding enhancers for both experimental and computational biologists.

23 in total

1. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

2. EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types.

Authors: Tianshun Gao; Bing He; Sheng Liu; Heng Zhu; Kai Tan; Jiang Qian
Journal: Bioinformatics Date: 2016-08-10 Impact factor: 6.937

Review 3. Decoding enhancers using massively parallel reporter assays.

Authors: Fumitaka Inoue; Nadav Ahituv
Journal: Genomics Date: 2015-06-10 Impact factor: 5.736

4. Transcriptional enhancers in animal development and evolution.

Authors: Mike Levine
Journal: Curr Biol Date: 2010-09-14 Impact factor: 10.834

5. SEA: a super-enhancer archive.

Authors: Yanjun Wei; Shumei Zhang; Shipeng Shang; Bin Zhang; Song Li; Xinyu Wang; Fang Wang; Jianzhong Su; Qiong Wu; Hongbo Liu; Yan Zhang
Journal: Nucleic Acids Res Date: 2015-11-17 Impact factor: 16.971

6. Systematic dissection and optimization of inducible enhancers in human cells using a massively parallel reporter assay.

Authors: Alexandre Melnikov; Anand Murugan; Xiaolan Zhang; Tiberiu Tesileanu; Li Wang; Peter Rogov; Soheil Feizi; Andreas Gnirke; Curtis G Callan; Justin B Kinney; Manolis Kellis; Eric S Lander; Tarjei S Mikkelsen
Journal: Nat Biotechnol Date: 2012-02-26 Impact factor: 54.908

7. Gateways to the FANTOM5 promoter level mammalian expression atlas.

Authors: Marina Lizio; Jayson Harshbarger; Hisashi Shimoji; Jessica Severin; Takeya Kasukawa; Serkan Sahin; Imad Abugessaisa; Shiro Fukuda; Fumi Hori; Sachi Ishikawa-Kato; Christopher J Mungall; Erik Arner; J Kenneth Baillie; Nicolas Bertin; Hidemasa Bono; Michiel de Hoon; Alexander D Diehl; Emmanuel Dimont; Tom C Freeman; Kaori Fujieda; Winston Hide; Rajaram Kaliyaperumal; Toshiaki Katayama; Timo Lassmann; Terrence F Meehan; Koro Nishikata; Hiromasa Ono; Michael Rehli; Albin Sandelin; Erik A Schultes; Peter A C 't Hoen; Zuotian Tatum; Mark Thompson; Tetsuro Toyoda; Derek W Wright; Carsten O Daub; Masayoshi Itoh; Piero Carninci; Yoshihide Hayashizaki; Alistair R R Forrest; Hideya Kawaji
Journal: Genome Biol Date: 2015-01-05 Impact factor: 13.583

8. dbSUPER: a database of super-enhancers in mouse and human genome.

Authors: Aziz Khan; Xuegong Zhang
Journal: Nucleic Acids Res Date: 2015-10-04 Impact factor: 16.971

9. Archiving next generation sequencing data.

Authors: Martin Shumway; Guy Cochrane; Hideaki Sugawara
Journal: Nucleic Acids Res Date: 2009-12-03 Impact factor: 16.971

10. Ensembl 2018.

Authors: Daniel R Zerbino; Premanand Achuthan; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Jyothish Bhai; Konstantinos Billis; Carla Cummins; Astrid Gall; Carlos García Girón; Laurent Gil; Leo Gordon; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Jimmy Kiang To; Matthew R Laird; Ilias Lavidas; Zhicheng Liu; Jane E Loveland; Thomas Maurel; William McLaren; Benjamin Moore; Jonathan Mudge; Daniel N Murphy; Victoria Newman; Michael Nuhn; Denye Ogeh; Chuang Kee Ong; Anne Parker; Mateus Patricio; Harpreet Singh Riat; Helen Schuilenburg; Dan Sheppard; Helen Sparrow; Kieron Taylor; Anja Thormann; Alessandro Vullo; Brandon Walts; Amonida Zadissa; Adam Frankish; Sarah E Hunt; Myrto Kostadima; Nicholas Langridge; Fergal J Martin; Matthieu Muffato; Emily Perry; Magali Ruffier; Dan M Staines; Stephen J Trevanion; Bronwen L Aken; Fiona Cunningham; Andrew Yates; Paul Flicek
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

7 in total

1. EnhancerAtlas 2.0: an updated resource with enhancer annotation in 586 tissue/cell types across nine species.

Authors: Tianshun Gao; Jiang Qian
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

2. EpiRegio: analysis and retrieval of regulatory elements linked to genes.

Authors: Nina Baumgarten; Dennis Hecker; Sivarajan Karunanithi; Florian Schmidt; Markus List; Marcel H Schulz
Journal: Nucleic Acids Res Date: 2020-07-02 Impact factor: 16.971

3. scEnhancer: a single-cell enhancer resource with annotation across hundreds of tissue/cell types in three species.

Authors: Tianshun Gao; Zilong Zheng; Yihang Pan; Chengming Zhu; Fuxin Wei; Jinqiu Yuan; Rui Sun; Shuo Fang; Nan Wang; Yang Zhou; Jiang Qian
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

4. PCRMS: a database of predicted cis-regulatory modules and constituent transcription factor binding sites in genomes.

Authors: Pengyu Ni; Zhengchang Su
Journal: Database (Oxford) Date: 2022-04-22 Impact factor: 4.462

5. Analysis of the landscape of human enhancer sequences in biological databases.

Authors: Juan Mulero Hernández; Jesualdo Tomás Fernández-Breis
Journal: Comput Struct Biotechnol J Date: 2022-05-30 Impact factor: 6.155

Review 6. Computational Biology Solutions to Identify Enhancers-target Gene Pairs.

Authors: Judith Mary Hariprakash; Francesco Ferrari
Journal: Comput Struct Biotechnol J Date: 2019-06-14 Impact factor: 7.271

7. ENdb: a manually curated database of experimentally supported enhancers for human and mouse.

Authors: Xuefeng Bai; Shanshan Shi; Bo Ai; Yong Jiang; Yuejuan Liu; Xiaole Han; Mingcong Xu; Qi Pan; Fan Wang; Qiuyu Wang; Jian Zhang; Xuecang Li; Chenchen Feng; Yanyu Li; Yuezhu Wang; Yiwei Song; Ke Feng; Chunquan Li
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

7 in total