Literature DB >> 21097880

MPromDb update 2010: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-seq experimental data.

Ravi Gupta1, Anirban Bhattacharyya, Francisco J Agosto-Perez, Priyankara Wickramasinghe, Ramana V Davuluri.   

Abstract

MPromDb (Mammalian Promoter Database) is a curated database that strives to annotate gene promoters identified from ChIP-seq results with the goal of providing an integrated resource for mammalian transcriptional regulation and epigenetics. We analyzed 507 million uniquely aligned RNAP-II ChIP-seq reads from 26 different data sets that include six human cell-types and 10 distinct mouse cell/tissues. The updated MPromDb version consists of computationally predicted (novel) and known active RNAP-II promoters (42,893 human and 48,366 mouse promoters) from various data sets freely available at NCBI GEO database. We found that 36% and 40% of protein-coding genes have alternative promoters in human and mouse genomes and ∼40% of promoters are tissue/cell specific. The identified RNAP-II promoters were annotated using various known and novel gene models. Additionally, for novel promoters we looked into other evidences-GenBank mRNAs, spliced ESTs, CAGE promoter tags and mRNA-seq reads. Users can search the database based on gene id/symbol, or by specific tissue/cell type and filter results based on any combination of tissue/cell specificity, Known/Novel, CpG/NonCpG, and protein-coding/non-coding gene promoters. We have also integrated GBrowse genome browser with MPromDb for visualization of ChIP-seq profiles and to display the annotations. The current release of MPromDb can be accessed at http://bioinformatics.wistar.upenn.edu/MPromDb/.

Entities:  

Mesh:

Substances:

Year:  2010        PMID: 21097880      PMCID: PMC3013732          DOI: 10.1093/nar/gkq1171

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

The mammalian transcriptome and proteome is far more diverse than expected from one gene→one mRNA→one protein paradigm (1). This diversity arises due to the generation of multiple transcripts from a gene using alternative transcriptional and splicing events. Alternative transcriptional events that involve use of multiple promoters and/or transcriptional termination result in multiple pre-mRNAs from the same gene that can further undergo alternative splicing to generate a plethora of transcript variants corresponding to a single gene (2). Therefore, a gene can yield transcript variants that differ in either their regulatory UTRs or/and protein coding regions; thereby expanding the complexity of mammalian genomes (3–5). In particular, the role of alternative promoter activity is critical in transcriptional regulation, as their precise utilization allows the balanced expression of corresponding pre-mRNA variants in different cell and/or developmental contexts. In fact, recent evidence suggests that at least half of the mammalian genes use alternative promoters generating multiple transcript variants (3,5). Therefore, identifying all possible gene promoters, their usage and epigenetic modification states in specific cell populations, tissues and their developmental stages and disease conditions is critical to understanding a diversity of physiological processes associated with normal and diseased states. Several high-throughput technologies, such as cap analysis gene expression (CAGE), chromatin immunoprecipitation (ChIP) followed by microarray analysis (ChIP–chip), (6,7), and more recently, ChIP coupled with sequencing (ChIP-seq) (8) and sequencing of cDNAs (RNA-seq) (5), are enabling the genome-wide identification of alternative promoters and their patterns of use. However, these high-throughput approaches need to be applied with caution because of the inherent problems with each method (9). In our recent study, we have shown that a combination of ChIP-seq and computational technique provides a better approach to annotate active promoters (9,10). Although EPD database (11) provides curated promoter sequences for eukaryotic organisms, it does not provide promoter activity information at tissue/cell centric level. In this update of MPromDb we have removed ChIP–chip results and added active RNAP-II promoters identified after analyzing six different cell types of human and 10 different cell/tissue types of mouse ChIP-seq experiments performed with RNAP-II antibody. In addition, we have added enrichment profile of various transcription factors obtained from ChIP-seq data sets. These promoters along with their annotations are provided as a user-friendly database, where each known and ChIP-seq promoter is linked to a new interface for visualization of enrichment profile. Here, we describe the updates of our MPromDb, which enables users to study promoter activity at tissue/cell centric level for human and mouse genome.

NEW FEATURES

Statistics of the promoters identified using ChIP-seq data sets

In this update, we have added (i) a comprehensive knowledgebase of known and novel promoters, (ii) promoters identified from RNAP-II ChIP-seq experiments, (iii) advance search and filter options and (iv) visualization of ChIP-seq profiles and promoters using GBrowse (12). The comprehensive promoter knowledgebase was generated from various known gene models (RefSeq, Vega, Ensembl, MGI and UCSC Known genes), predicted gene models (AceView, Tromer, MGC, SGP, SIB, Genscan, Geneid, N-SCAN and Augustus Abinitio), Orthologous gene model (XenoRef), GenBank mRNAs, spliced ESTs, CAGE promoters and mRNA-seq tags (Figure 1). The gene models, mRNAs and spliced ESTs were downloaded from UCSC Genome Browser database (13), CAGE promoters location were downloaded from FANTOM4 project (14) and mRNA-seq raw reads were downloaded from NCBI GEO database. We have also added promoter regions of recently discovered non-coding genes class (lincRNA) transcribed by RNAP-II (15,16). The total number of records in the knowledgebase can be found in Table S1.
Figure 1.

The block diagram and workflow of updated MPromDb database. Deep sequencing datasets were downloaded from NCBI GEO server and processed by our analysis and annotation pipeline. The identified promoters are deposited in MPromDb tables. Novel promoters are compared to various existing experimental and predicted gene promoter regions and status of novel promoters is deposited in the relational tables. The database is accessed through a user-friendly webpage. The database is integrated with open source genome browser (GBrowse) to visualize the promoter and various ChIP-seq enrichment profiles.

The block diagram and workflow of updated MPromDb database. Deep sequencing datasets were downloaded from NCBI GEO server and processed by our analysis and annotation pipeline. The identified promoters are deposited in MPromDb tables. Novel promoters are compared to various existing experimental and predicted gene promoter regions and status of novel promoters is deposited in the relational tables. The database is accessed through a user-friendly webpage. The database is integrated with open source genome browser (GBrowse) to visualize the promoter and various ChIP-seq enrichment profiles. The RNAP-II ChIP-seq data sets includes the data generated at our lab (9) and data sets from various published and unpublished studies available freely at NCBI GEO database. The human RNAP-II ChIP-seq data sets include six different cell lines: CD4 + T, HeLa S3, K562, NB4, Lymphoblastoid and Jurkat, whereas mouse samples include five different tissues and five different cell types: brain, liver, lung, spleen, kidney, Embryonic Stem Cell (V6.5), Mouse Embryonic Fibroblasts B4, Mouse Embryonic Fibroblasts B6, Bone Marrow-derived macrophages and 3T3-L1 (9,17–23). The NCBI GEO accession numbers of the data sets are provided in Table S2. On the downloaded ChIP-seq data sets, we apply our pipeline (Figure 1) that includes alignment, identification of significant enriched regions, promoter prediction and annotation. Bowtie program (24) was applied to map reads to the reference genome (mm9 version for mouse and hg18 version for human), allowing up to two mismatches. Only uniquely mapped reads were considered for further analysis. We obtained 174 777 943 and 333 192 049 uniquely mapped reads for mouse and human genome respectively (Table S3). Significant peaks were identified using our three steps procedure as described in (9) at P-value = 0.01. After identification of significant RNAP-II bound peaks we apply our recently published program for prediction of RNAP-II bound promoters (10). The peak identification and promoter prediction of each sample is summarized in Table S3. Following promoter prediction, we performed promoter annotation using our reference promoter knowledgebase as summarized in Figures S1 and S2. Finally, we identified 48 366 mouse and 42 893 human promoters bound by RNAP-II where 39% and 42% of the promoters in mouse and human respectively were annotated as ‘Novel promoters’ (Table 1). In case the predicted ChIP-seq promoters lie within −1 to 0.5 kb of known TSS or within the first exons of known transcripts, they are defined as ‘Known promoters’ otherwise they are considered as ‘Novel promoters’. It is worth noting that 65% and 90% of novel promoters in mouse and human, respectively, are supported by additional sources (novel gene models, mRNAs, spliced ESTs, CAGE tags and Orthologous gene model) (Table S4).
Table 1.

Summary of RNAP-II bound promoters identified in various tissues/cell types for human and mouse using ChIP-seq data sets

SpeciesTissue/cell typeNo. of known promotersNo. of novel promotersNo. of tissue/ cell-specific promotersNo. of CpG promotersNo. of bidirectional promotersNo. of total promoters
MouseBrain15 9485270397813 864137321 218
Liver12 3193189164210 421125015 508
Kidney15 0594632199512 879134819 691
Spleen908921218068273106711 210
Lung15 3735142193513 986137420 515
Embryonic stem cell(V6.5)11 8952880274512 063131414 775
Mouse embryonic fibroblasts B410 558226127310 898124112 819
Mouse embryonic fibroblasts B611 887276170610 886123714 648
Bone marrow-derived macrophages (untreated)13 320397787012 038129817 297
Bone marrow-derived macrophages (2 h)12 647371356611 846129416 260
Bone marrow-derived macrophages (4 h)13 119404168811 926129217 160
3T3-L1 cells (untreated)84891373113859710389862
3T3-L1 cells (Day 1)86841626154880310729310
3T3-L1 cells (Day 2)850815931748415104210 101
3T3-L1 cells (Day 3)83741540136837110359914
3T3-L1 cells (Day 4)6976142219467938488398
3T3-L1 cells (Day 6)4039144392740305115482
Total29 51718 84917 90224 587180148 366
HumanJurkat cells7417140354176537928820
K562 cells16 4108012642216 918132024 422
Lymphoblastoid cells19 6178998662920 682131128 615
NB4 cells12 925294491613 650115615 869
HeLa_S3 cells13 9823502210114 812121217 484
CD4 + T cells14 3294137133615 740122018 466
CD4 + T cells (2 h)7354126717479558828621
CD4 + T cells (12 h)11 470238937712 740117213 859
Total24 96717 92618 49627 488150142 893
Summary of RNAP-II bound promoters identified in various tissues/cell types for human and mouse using ChIP-seq data sets Furthermore, our analysis has identified promoters for 15 493 and 14 266 protein-coding genes in mouse and human respectively. A gene is defined as protein coding if it has at least one protein-coding transcript in RefSeq/Vega gene models, or else it is a non-coding gene. Please note that a protein coding gene can generate transcript variants that are non-coding RNAs. We also observed that 40% and 36% of protein coding genes in mouse and human are expressed from alternative promoters (Table 2). Surprisingly, 37% of promoters in mouse and 43% of human promoters were identified in a single cell/tissue suggesting that they are cell/tissue-specific promoters. Additionally, we analyzed the CpG-richness and bidirectionality of the promoters and found that 51% and 64% of promoters are CpG-rich and there are 1801 and 1501 bidirectional promoters in mouse and human respectively. Additionally, we also provide significant enrichment profiles of various factors (MouseOCT4, CEBPa, CHD7, c-Myc, CTCF, ESRRB, FOXA1, FOXA2, GFP, KLF4, n-Myc, NR5A2, P300, Rbbp5, SETDB1, SIRT1, SOX2, STAT3, STAT4, STAT6, SUZ12, TBP, TBX3, TCFCP2I1, WDR5, ZFX; HumanOCT4, CBP, CTCF, ETS1, KLF4, NANOG, P300, PCAF, PHF8, PPARG, RUNX, SOX2, STAT1, TFII, Tip60, ZNF263, SUZ12, MOF, IGF1R, NFkB) calculated from different published and unpublished ChIP-seq datas ets (Table S5A and B).
Table 2.

Alternative promoter usage for active protein-coding genes in mouse and human

Protein-coding genesMouse (%)Human (%)
1-promoter genes9290 (60)9051 (63.44)
2-promoter genes3490 (22.5)3192 (22.37)
≥3-promoter genes2707 (17.5)2023 (14.18)
Total15 49314 266
Screenshots of MPromDb and search results. (A) MPromDb main search page where a user can perform search based on either Entrez gene id/symbol or specific tissue/cell type and the resulting page is shown in (B) and (C), respectively. (D) User can visualize the ChIP-seq profile for any promoter displayed on (B) or (C) by clicking on the promoter position link. Alternative promoter usage for active protein-coding genes in mouse and human

Database search and visualization

MPromDb as a web-based application has many layers: the core application (designed in Django), a backend database (MySQL), a visualization component (GBrowse) and a web server (Apache) (see Supplementary File 1). The promoter information corresponding to a particular gene can be retrieved from the database using Entrez geneid or gene symbol. We also provide additional search and filter options such as selection of tissue/cell type, tissue/cell specific promoters, known/novel promoters and coding/non-coding gene promoters. The gene search query returns result at two different levels (see Figure 2, Supplementary File 2, Supplementary Tables S6 and S7). The first level provides information (promoter position, CpG type and bidirectional type) regarding all promoters of the queried gene that are present in the promoter knowledgebase. The second level of search result lists all promoters identified from ChIP-seq data sets for the queried gene. The result of the search can be downloaded into an excel file. Each promoter of the search result is linked to the visualization module. Further, complete list of annotated promoters can be downloaded from the download link. Visualization of the promoter position and ChIP-seq data enrichment profile is implemented using GBrowse (12), an open source genome browser platform. GBrowse is simple but highly configurable web-based genome browser, which provides a fast and customizable interface for visualizing data that is stored in a backend database, as well as the data that is uploaded by the user. GBrowse is lighter than UCSC genome browser and offers many advantages especially in displaying the results and tracks. Some of the features unique to GBrowse are: glyphs and balloons to represent different features, organizing features sub categories to more depth, multi-language support, view GenBank, chado and biosql feature databases, third party loading. On GBrowse the identified promoter location and enrichment profile of the analyzed ChIP-seq data sets are shown (Figure 2D). Further, users can directly type the genome coordinates or gene symbol on GBrowse for searching. Users have an option to turn on/off the tracks that are displayed on the genome browser.
Figure 2.

Screenshots of MPromDb and search results. (A) MPromDb main search page where a user can perform search based on either Entrez gene id/symbol or specific tissue/cell type and the resulting page is shown in (B) and (C), respectively. (D) User can visualize the ChIP-seq profile for any promoter displayed on (B) or (C) by clicking on the promoter position link.

FUTURE PLANS

In future, we plan to include epigenetic histone modifications profile identified from ChIP-seq data sets that are currently available at NCBI GEO and integrate it to our promoter knowledgebase. We will also continue to collect RNAP-II and transcription factors ChIP-seq data sets from a wider variety of tissues and cell types to routinely update MPromDb. We also plan to include other mammalian data sets, and add additional features and search options to the frontend of the database. In conclusion, MPromDb will provide integrated transcriptional regulatory information for mammalian genomes in an easily accessible way. We believe that the updates will facilitate large-scale ChIP-seq data analysis and contribute toward the elucidation of mammalian transcriptional regulatory networks.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

NHGRI/NIH grant (# R01HG003362); American Cancer Society Research Scholar Grant (# RSG-07-097-01 to R.D.); and Philadelphia Healthcare Trust. R.D. holds a Philadelphia Healthcare Trust Endowed Chair Position. Funding for open access charge: National Institutes of Health grant (#R01HG003362 to R.D.). Conflict of interest statement. None declared.
  24 in total

1.  The generic genome browser: a building block for a model organism system database.

Authors:  Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal:  Genome Res       Date:  2002-10       Impact factor: 9.043

Review 2.  Alternative splicing: global insights.

Authors:  Martina Hallegger; Miriam Llorian; Christopher W J Smith
Journal:  FEBS J       Date:  2010-01-15       Impact factor: 5.542

3.  Chromatin poises miRNA- and protein-coding genes for expression.

Authors:  Artem Barski; Raja Jothi; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Dustin E Schones; Keji Zhao
Journal:  Genome Res       Date:  2009-08-27       Impact factor: 9.043

4.  Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes.

Authors:  Kouichi Kimura; Ai Wakamatsu; Yutaka Suzuki; Toshio Ota; Tetsuo Nishikawa; Riu Yamashita; Jun-ichi Yamamoto; Mitsuo Sekine; Katsuki Tsuritani; Hiroyuki Wakaguri; Shizuko Ishii; Tomoyasu Sugiyama; Kaoru Saito; Yuko Isono; Ryotaro Irie; Norihiro Kushida; Takahiro Yoneyama; Rie Otsuka; Katsuhiro Kanda; Takahide Yokoi; Hiroshi Kondo; Masako Wagatsuma; Katsuji Murakawa; Shinichi Ishida; Tadashi Ishibashi; Asako Takahashi-Fujii; Tomoo Tanase; Keiichi Nagai; Hisashi Kikuchi; Kenta Nakai; Takao Isogai; Sumio Sugano
Journal:  Genome Res       Date:  2005-12-12       Impact factor: 9.043

5.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

6.  Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data.

Authors:  Ravi Gupta; Priyankara Wikramasinghe; Anirban Bhattacharyya; Francisco A Perez; Sharmistha Pal; Ramana V Davuluri
Journal:  BMC Bioinformatics       Date:  2010-01-18       Impact factor: 3.169

7.  EPD in its twentieth year: towards complete promoter coverage of selected model organisms.

Authors:  Christoph D Schmid; Rouaïda Perier; Viviane Praz; Philipp Bucher
Journal:  Nucleic Acids Res       Date:  2006-01-01       Impact factor: 16.971

8.  FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions.

Authors:  Jessica Severin; Andrew M Waterhouse; Hideya Kawaji; Timo Lassmann; Erik van Nimwegen; Piotr J Balwierz; Michiel Jl de Hoon; David A Hume; Piero Carninci; Yoshihide Hayashizaki; Harukazu Suzuki; Carsten O Daub; Alistair Rr Forrest
Journal:  Genome Biol       Date:  2009-04-19       Impact factor: 13.583

9.  Jmjd3 contributes to the control of gene expression in LPS-activated macrophages.

Authors:  Francesca De Santa; Vipin Narang; Zhei Hwee Yap; Betsabeh Khoramian Tusi; Thomas Burgold; Liv Austenaa; Gabriele Bucci; Marieta Caganova; Samuele Notarbartolo; Stefano Casola; Giuseppe Testa; Wing-Kin Sung; Chia-Lin Wei; Gioacchino Natoli
Journal:  EMBO J       Date:  2009-09-24       Impact factor: 11.598

10.  The UCSC Genome Browser database: update 2010.

Authors:  Brooke Rhead; Donna Karolchik; Robert M Kuhn; Angie S Hinrichs; Ann S Zweig; Pauline A Fujita; Mark Diekhans; Kayla E Smith; Kate R Rosenbloom; Brian J Raney; Andy Pohl; Michael Pheasant; Laurence R Meyer; Katrina Learned; Fan Hsu; Jennifer Hillman-Jackson; Rachel A Harte; Belinda Giardine; Timothy R Dreszer; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal:  Nucleic Acids Res       Date:  2009-11-11       Impact factor: 16.971

View more
  18 in total

1.  Using Galaxy to perform large-scale interactive data analyses.

Authors:  Jennifer Hillman-Jackson; Dave Clements; Daniel Blankenberg; James Taylor; Anton Nekrutenko
Journal:  Curr Protoc Bioinformatics       Date:  2012-06

2.  Using galaxy to perform large-scale interactive data analyses.

Authors:  James Taylor; Ian Schenck; Dan Blankenberg; Anton Nekrutenko
Journal:  Curr Protoc Bioinformatics       Date:  2007-09

3.  VEGFR-1 Pseudogene Expression and Regulatory Function in Human Colorectal Cancer Cells.

Authors:  Xiangcang Ye; Fan Fan; Rajat Bhattacharya; Seth Bellister; Delphine R Boulbes; Rui Wang; Ling Xia; Cristina Ivan; Xiaofeng Zheng; George A Calin; Jing Wang; Xiongbin Lu; Lee M Ellis
Journal:  Mol Cancer Res       Date:  2015-06-03       Impact factor: 5.852

Review 4.  Popular computational methods to assess multiprotein complexes derived from label-free affinity purification and mass spectrometry (AP-MS) experiments.

Authors:  Irina M Armean; Kathryn S Lilley; Matthew W B Trotter
Journal:  Mol Cell Proteomics       Date:  2012-10-15       Impact factor: 5.911

5.  Alternative transcription start site selection leads to large differences in translation activity in yeast.

Authors:  Maria F Rojas-Duran; Wendy V Gilbert
Journal:  RNA       Date:  2012-10-25       Impact factor: 4.942

Review 6.  Structure and dynamics of molecular networks: a novel paradigm of drug discovery: a comprehensive review.

Authors:  Peter Csermely; Tamás Korcsmáros; Huba J M Kiss; Gábor London; Ruth Nussinov
Journal:  Pharmacol Ther       Date:  2013-02-04       Impact factor: 12.310

7.  Editor's Highlight: Neonatal Activation of the Xenobiotic-Sensors PXR and CAR Results in Acute and Persistent Down-regulation of PPARα-Signaling in Mouse Liver.

Authors:  Cindy Yanfei Li; Sunny Lihua Cheng; Theo K Bammler; Julia Yue Cui
Journal:  Toxicol Sci       Date:  2016-07-13       Impact factor: 4.849

8.  Synthetic circuit of inositol phosphorylceramide synthase in Leishmania : a chemical biology approach.

Authors:  Vineetha Mandlik; Dixita Limbachiya; Sonali Shinde; Milsee Mol; Shailza Singh
Journal:  J Chem Biol       Date:  2013-01-03

9.  Asymmetric mRNA localization contributes to fidelity and sensitivity of spatially localized systems.

Authors:  Robert J Weatheritt; Toby J Gibson; M Madan Babu
Journal:  Nat Struct Mol Biol       Date:  2014-08-24       Impact factor: 15.369

10.  Constrained transcription factor spacing is prevalent and important for transcriptional control of mouse blood cells.

Authors:  Felicia S L Ng; Judith Schütte; David Ruau; Evangelia Diamanti; Rebecca Hannah; Sarah J Kinston; Berthold Göttgens
Journal:  Nucleic Acids Res       Date:  2014-11-26       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.