Literature DB >> 16381948

CAGE Basic/Analysis Databases: the CAGE resource for comprehensive promoter analysis.

Hideya Kawaji¹, Takeya Kasukawa, Shiro Fukuda, Shintaro Katayama, Chikatoshi Kai, Jun Kawai, Piero Carninci, Yoshihide Hayashizaki.

Abstract

Cap-analysis gene expression (CAGE) Basic and Analysis Databases store an original resource produced by CAGE, which measures expression levels of transcription starting sites by sequencing large amounts of transcript 5' ends, termed CAGE tags. Millions of human and mouse high-quality CAGE tags derived from different conditions in >20 tissues consisting of >250 RNA samples are essential for identification of novel promoters and promoter characterization in the aspect of expression profile. CAGE Basic Database is a primary database of the CAGE resource, RNA samples, CAGE libraries, CAGE clone and tag sequences and so on. CAGE Analysis Database stores promoter related information, such as counts of related transcripts, CpG islands and conserved genome region. It also provides expression profiles at base pair and promoter levels. Both databases are based on the same framework, CAGE tag starting sites, tag clusters for defining promoters and transcriptional units (TUs). Their associations and TU attributes are available to find promoters of interest. These databases were provided for Functional Annotation Of Mouse 3 (FANTOM3), an international collaboration research project focusing on expanding the transcriptome and subsequent analyses. Now access is free for all users through the World Wide Web at http://fantom3.gsc.riken.jp/.

Entities: Chemical Gene Species

Mesh：

Year: 2006 PMID： 16381948 PMCID： PMC1347397 DOI： 10.1093/nar/gkj034

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Cap-analysis gene expression (CAGE) is a high-throughput method to measure expression levels by counting large amounts of sequenced capped 5′ ends of transcripts, termed CAGE tags (1). A similar approach is proposed as 5′ end SAGE (2). The average length of these 5′ end tags of transcripts is 20 bp and the tags are aligned with the genome directly, although original SAGE (3) tags are aligned with 3′ ends of transcripts (4). CAGE tags are an essential resource for profiling transcriptional starting sites and can be used for profiling gene expressions by counting CAGE tags associated with genes. Millions of mouse and human high-quality CAGE tags derived from different conditions in >20 tissues consisting of >250 RNA samples are subjected for analysis in the international collaboration research project, Functional Annotation Of Mouse 3 (FANTOM3). The CAGE tags are used for the analysis of the transcriptional landscape of mammalian genome (5), antisense transcription in the mammalian transcriptome (6), comprehensive promoter analysis (P.Carninci, A.Sandelin, B.Lenhard, S.Katayama, K.Shimokawa, J.Ponjavic, C.A.Semple, M.S.Taylor, P.Engstrom, M.C.Frith, A.R.Forrest, W.B.Alkema, S.L.Tan, C.Plessy, R.Kodzius, T.Ravasi, T.Kasukawa, S.Fukuda, M.Kanamori-Katayama, Y.Kitazume, H.Kawaji, C.Kai, H.Konno, K.Nakano, S.Mottagui-Tabar, P.Arner, A.Chesi, S.Gustincich, F.Persichetti, H.Suzuki, S.M.Grimmond, C.Wells, V.Orlando, C.Wahlestedt, E.T.Liu, M.Harbers, J.Kawai, V.B.Bajic, D.A.Hume and Y.Hayashizaki, manuscript submitted) and subsequent analyses. We constructed two database systems to utilize the CAGE resource, CAGE Basic and Analysis Databases. Their aims are (i) to manage and trace the CAGE data consistently and (ii) to demonstrate the promoter usage (using CAGE and other data). The former is required to support the novel experimental processes of CAGE and to manage the large amount of RNA samples provided in the FANTOM3 collaboration. The latter is to support subsequent analyses using all of the required data, without influence of our management of the CAGE data. Additionally, we constructed CAGE Expression 3D Viewer for novel type of expression view (K.Shimokawa, Y.Okamura-Oho, C.Kai, P.Carninci and Y.Hayashizaki, manuscript in preparation). The database systems described here were used in FANTOM3 and are now publicly accessible. Here, we present the systems' overview and functions to facilitate the use of the CAGE resource.

DATA BASIS

A consistent and comprehensive dataset is crucial to allow biological analyses in different kinds of viewpoints. Our two database systems are built on the same basis: CAGE tag starting site (CTSS), tag cluster (TC) and transcriptional unit (TU). CTSS is a nucleotide position on the genome from which an alignment of CAGE tag starts. Counts of CAGE tags sharing the same starting sites represent expression profiles in base pairs level. TC is an operationally defined unit to characterize promoters. It is constructed by clustering 5′ end overlapped region of transcripts (P.Carninci, A.Sandelin, B.Lenhard, S.Katayama, K.Shimokawa, J.Ponjavic, C.A.Semple, M.S.Taylor, P.Engstrom, M.C.Frith, A.R.Forrest, W.B.Alkema, S.L.Tan, C.Plessy, R.Kodzius, T.Ravasi, T.Kasukawa, S.Fukuda, M.Kanamori-Katayama, Y.Kitazume, H.Kawaji, C.Kai, H.Konno, K.Nakano, S.Mottagui-Tabar, P.Arner, A.Chesi, S.Gustincich, F.Persichetti, H.Suzuki, S.M.Grimmond, C.Wells, V.Orlando, C.Wahlestedt, E.T.Liu, M.Harbers, J.Kawai, V.B.Bajic, D.A.Hume and Y.Hayashizaki, manuscript submitted), such as 5′ end 20 bp long of RIKEN full-length cDNA and RIKEN-5′-expressed sequence tag (EST), 5′ end tags of GIS (7) and GSC (4) ditags, DBTSS (8), 5′ end SAGE and CAGE. Of these, overlapping sequences on the genome with at least 1 bp are clustered, and define a TC. Counts of CAGE tags within TCs represent expression profiles on promoter level. TU is also an operationally defined unit proposed in FANTOM2 (9), defined as a region or a set of discontinuous regions in the genome from where all exons of a mature full-length mRNA are derived (10). Counts of CAGE tags within TUs represent expression profiles on gene level. TUs are associated with Entrez Gene (11) and gene ontology term (12) by means of transcripts belonging to them, if possible. CTSS are associated with TCs, and TCs are associated with TUs. Users can access the CAGE resource of interest by searching TUs with their own keywords.

SYSTEM OVERVIEW

Figure 1 is an overview of the CAGE database systems. CAGE Basic Database is a primary database of the CAGE resource, and provides a central view of CAGE resources. CAGE Analysis Database stores TC related information, and provides a central view of promoters. As a complementary system, Genomic Elements Database is constructed to provide a central view of genome positions. Their main contents are described in Table 1. CAGE Analysis Database would be the most convenient gateway for users, especially new to the CAGE data. Hyperlinks from the database to the others are available depending on their interests, CAGE Basic Database for CAGE sequences themselves and Genomic Elements Database for a conventional genome view.

Figure 1

An overview of the CAGE supporting systems and data flow among them.

Table 1

Main contents of the database systems

Database	Contents
CAGE Basic Database	RNA sample information
	CAGE library information
	CAGE clone plate/spot
	CAGE clone sequence
	CAGE clone sequence quality
	CAGE tag sequence
	CAGE tag mapping status
	Associations of CAGE tags with CTSS
	Associations of CTSS with TCs
	Associations of TCs with transcript and TUs
CAGE Analysis Database	Base pair level expression profile
	TC expression profile within TUs
	Statistical significance expression fluctuations
	Presence of predicted core promoter elementsa in upstream region
	Presence of conserved genome region between human and mouse (axtNet)
	Presence of CpG islands
	Counts of TC related mRNA, 5′-EST, GIS/GSC ditags
Genomic Elements Database	TC
	Predicted core promoter elementsa
	mRNA
	GIS/GSC ditag
	5′- and 3′-ESTs
	Candidates of imprited transcripts in EICO DB
	Transcription factors listed in TFdb
	Gene predictionb
	CpG islandsb
	Repeat detected by repeatmasker and tandem repeats finderb
	Assemble gapb
	Conserved genome region between human and mouse (axtNet)b

aTATA box, CCAAT box, GC box and initiator.

bRetrieved from the UCSC Genome Browser Database.

CAGE BASIC DATABASE

In the CAGE protocol, 5′ ends of full-length cDNA synthesized from RNA samples are cleaved with MmeI, a class II restriction enzyme, which cleaves 20/18 bp outside the recognition sequence. The cleaved 5′ end cDNA tags (CAGE tags) are ligated to form concatemers and cloned as CAGE clones in CAGE library. After sequencing the CAGE clone, CAGE tag sequences are extracted and mapped computationally onto the genome. CAGE clone sequence, CAGE tag location on the clone and its genome mapping information are stored to facilitate their traceability. To manage a broad range of RNA samples provided in the FANTOM3 collaboration, RNA sample ID, tissue name, developmental stage, sample treatment, cell type and collaboration name are stored. The amount of the CAGE data derived from each RNA sample is presented to examine if targeted samples are analyzed with CAGE and to which extent CAGE tags in the samples were sequenced.

CAGE ANALYSIS DATABASE

Expression levels are measured by counting associated CAGE tags, and they can be used to measure different levels of expression profiles from base pair to chromosomal band level. Two levels of expression profiles are presented in the CAGE Analysis Database for each RNA sample: base pair scale expressions inside a TC are displayed in histogram (Figure 2A), and TC expressions within a TU are represented by a heat map like representation (Figure 2B). CAGE tag counts and transcripts per million, (tag counts)/(total mapped tag counts in the sample) × 1 000 000, are used as units of expression level. Additionally, statistical significances of expression fluctuations between RNA samples are also accessible in a matrix (Figure 2C). They provide users with graphical views of transcriptional start variation, promoter variation and expression fluctuation of promoters.

Figure 2

Screenshots of CAGE Analysis Database: (A) a view of base pair scale expression within a TC, where CAGE tag count of each genome position is displayed in histogram, (B) a view of TC expressions within a TU, in which expression levels are represented by a heat map like representation, (C) a view of statistical significances of expression fluctuations between RNA samples, and their E-values are displayed in a matrix.

Rarely expressed promoters contain only a few tags. Although our RACE experiment using an oligo-capping method supported 91% of the tested cases (5), some CAGE tags could be artifacts caused by some errors in library preparation, sequencing and genome mapping. To provide some evidences for promoters, associations of TCs with (genome) conserved regions (13), CpG islands (14), predicted core promoter elements (15–17) and different transcript counts are stored. Users can search TCs with different reliability levels by specifying search conditions.

GENOMIC ELEMENTS DATABASE

Genomic Elements Database is a supplementary database to the two CAGE databases. The aim is to integrate TCs and other data onto the genome and display them in a conventional way. Generic Genome Browser (18) with MySQL DBMS is used to present a genome view. Candidates of imprinted transcripts in EICO DB (19,20), transcription factors in TFdb (21) and other data in the UCSC Genome Browser Database (22) are stored in addition to the utilized data above. This system is also utilized in full-length cDNA annotation in FANTOM3 (5).

CONCLUSION

The CAGE database systems have successfully provided a large amount of mouse and human CAGE tags derived from various RNA samples for the FANTOM3 project, resulting in biological analyses in various viewpoints. The systems have supported these analyses by providing central views of CAGE resource, promoter and genome position depending on the aspects of interests to researchers. They are publicly available now, and are expected to promote subsequent analyses by using the CAGE resource in scientific research community.

AVAILABILITY

The database systems described here are hyperlinked from . Their user's guide, glossary and/or database schema are available from their help pages, and their raw data files, table definitions in SQL and tab-delimited data files, are also available for download from .

22 in total

1. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors: M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal: Nat Genet Date: 2000-05 Impact factor: 38.330

2. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs.

Authors: Y Okazaki; M Furuno; T Kasukawa; J Adachi; H Bono; S Kondo; I Nikaido; N Osato; R Saito; H Suzuki; I Yamanaka; H Kiyosawa; K Yagi; Y Tomaru; Y Hasegawa; A Nogami; C Schönbach; T Gojobori; R Baldarelli; D P Hill; C Bult; D A Hume; J Quackenbush; L M Schriml; A Kanapin; H Matsuda; S Batalov; K W Beisel; J A Blake; D Bradt; V Brusic; C Chothia; L E Corbani; S Cousins; E Dalla; T A Dragani; C F Fletcher; A Forrest; K S Frazer; T Gaasterland; M Gariboldi; C Gissi; A Godzik; J Gough; S Grimmond; S Gustincich; N Hirokawa; I J Jackson; E D Jarvis; A Kanai; H Kawaji; Y Kawasawa; R M Kedzierski; B L King; A Konagaya; I V Kurochkin; Y Lee; B Lenhard; P A Lyons; D R Maglott; L Maltais; L Marchionni; L McKenzie; H Miki; T Nagashima; K Numata; T Okido; W J Pavan; G Pertea; G Pesole; N Petrovsky; R Pillai; J U Pontius; D Qi; S Ramachandran; T Ravasi; J C Reed; D J Reed; J Reid; B Z Ring; M Ringwald; A Sandelin; C Schneider; C A M Semple; M Setou; K Shimada; R Sultana; Y Takenaka; M S Taylor; R D Teasdale; M Tomita; R Verardo; L Wagner; C Wahlestedt; Y Wang; Y Watanabe; C Wells; L G Wilming; A Wynshaw-Boris; M Yanagisawa; I Yang; L Yang; Z Yuan; M Zavolan; Y Zhu; A Zimmer; P Carninci; N Hayatsu; T Hirozane-Kishikawa; H Konno; M Nakamura; N Sakazume; K Sato; T Shiraki; K Waki; J Kawai; K Aizawa; T Arakawa; S Fukuda; A Hara; W Hashizume; K Imotani; Y Ishii; M Itoh; I Kagawa; A Miyazaki; K Sakai; D Sasaki; K Shibata; A Shinagawa; A Yasunishi; M Yoshino; R Waterston; E S Lander; J Rogers; E Birney; Y Hayashizaki
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

3. The UCSC Genome Browser Database.

Authors: D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

4. The generic genome browser: a building block for a model organism system database.

Authors: Lincoln D Stein; Christopher Mungall; ShengQiang Shu; Michael Caudy; Marco Mangone; Allen Day; Elizabeth Nickerson; Jason E Stajich; Todd W Harris; Adrian Arva; Suzanna Lewis
Journal: Genome Res Date: 2002-10 Impact factor: 9.043

5. 5'-end SAGE for the analysis of transcriptional start sites.

Authors: Shin-ichi Hashimoto; Yutaka Suzuki; Yasuhiro Kasai; Kei Morohoshi; Tomoyuki Yamada; Jun Sese; Shinichi Morishita; Sumio Sugano; Kouji Matsushima
Journal: Nat Biotechnol Date: 2004-08-08 Impact factor: 54.908

6. A genome-wide and nonredundant mouse transcription factor database.

Authors: Mutsumi Kanamori; Hideaki Konno; Naoki Osato; Jun Kawai; Yoshihide Hayashizaki; Harukazu Suzuki
Journal: Biochem Biophys Res Commun Date: 2004-09-24 Impact factor: 3.575

7. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences.

Authors: P Bucher
Journal: J Mol Biol Date: 1990-04-20 Impact factor: 5.469

8. CpG islands in vertebrate genomes.

Authors: M Gardiner-Garden; M Frommer
Journal: J Mol Biol Date: 1987-07-20 Impact factor: 5.469

9. EICO (Expression-based Imprint Candidate Organizer): finding disease-related imprinted genes.

Authors: Itoshi Nikaido; Chika Saito; Akiko Wakamoto; Yasuhiro Tomaru; Takahiro Arakawa; Yoshihide Hayashizaki; Yasushi Okazaki
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

10. Discovery of imprinted transcripts in the mouse transcriptome using large-scale expression profiling.

Authors: Itoshi Nikaido; Chika Saito; Yosuke Mizuno; Makiko Meguro; Hidemasa Bono; Moritoshi Kadomura; Tomohiro Kono; Gerard A Morris; Paul A Lyons; Mitsuo Oshimura; Yoshihide Hayashizaki; Yasushi Okazaki
Journal: Genome Res Date: 2003-06 Impact factor: 9.043

49 in total

1. Integrative epigenomic and genomic analysis of malignant pheochromocytoma.

Authors: Johanna Sandgren; Robin Andersson; Alvaro Rada-Iglesias; Stefan Enroth; Goran Akerstrom; Jan P Dumanski; Jan Komorowski; Gunnar Westin; Claes Wadelius
Journal: Exp Mol Med Date: 2010-07-31 Impact factor: 8.718

2. Hidden treasures in unspliced EST data.

Authors: J Engelhardt; P F Stadler
Journal: Theory Biosci Date: 2012-04-08 Impact factor: 1.919

3. PPARγ and NF-κB regulate the gene promoter activity of their shared repressor, TNIP1.

Authors: Igor Gurevich; Carmen Zhang; Priscilla C Encarnacao; Charles P Struzynski; Sarah E Livings; Brian J Aneskievich
Journal: Biochim Biophys Acta Date: 2011-10-07

4. High-resolution mapping and characterization of open chromatin across the genome.

Authors: Alan P Boyle; Sean Davis; Hennady P Shulha; Paul Meltzer; Elliott H Margulies; Zhiping Weng; Terrence S Furey; Gregory E Crawford
Journal: Cell Date: 2008-01-25 Impact factor: 41.582

5. An intronic MYLK variant associated with inflammatory lung disease regulates promoter activity of the smooth muscle myosin light chain kinase isoform.

Authors: Yoo Jeong Han; Shwu-Fan Ma; Michael S Wade; Carlos Flores; Joe G N Garcia
Journal: J Mol Med (Berl) Date: 2011-10-21 Impact factor: 4.599

6. Identifying and characterizing a novel protein kinase STK35L1 and deciphering its orthologs and close-homologs in vertebrates.

Authors: Pankaj Goyal; Antje Behring; Abhishek Kumar; Wolfgang Siess
Journal: PLoS One Date: 2009-09-16 Impact factor: 3.240

7. Cross-species mapping of bidirectional promoters enables prediction of unannotated 5' UTRs and identification of species-specific transcripts.

Authors: Helen Piontkivska; Mary Q Yang; Denis M Larkin; Harris A Lewin; James Reecy; Laura Elnitski
Journal: BMC Genomics Date: 2009-04-24 Impact factor: 3.969

8. The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation.

Authors: Hideya Kawaji; Jessica Severin; Marina Lizio; Andrew Waterhouse; Shintaro Katayama; Katharine M Irvine; David A Hume; Alistair R R Forrest; Harukazu Suzuki; Piero Carninci; Yoshihide Hayashizaki; Carsten O Daub
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

9. High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites.

Authors: Tom Whitington; Andrew C Perkins; Timothy L Bailey
Journal: Nucleic Acids Res Date: 2008-11-06 Impact factor: 16.971

10. FANTOM4 EdgeExpressDB: an integrated database of promoters, genes, microRNAs, expression dynamics and regulatory interactions.

Authors: Jessica Severin; Andrew M Waterhouse; Hideya Kawaji; Timo Lassmann; Erik van Nimwegen; Piotr J Balwierz; Michiel Jl de Hoon; David A Hume; Piero Carninci; Yoshihide Hayashizaki; Harukazu Suzuki; Carsten O Daub; Alistair Rr Forrest
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583