Literature DB >> 17942421

DBTSS: database of transcription start sites, progress report 2008.

Hiroyuki Wakaguri¹, Riu Yamashita, Yutaka Suzuki, Sumio Sugano, Kenta Nakai.

Abstract

DBTSS is a database of transcriptional start sites, based on our unique collection of precise, experimentally determined 5'-end sequences of full-length cDNAs. Since its first release in 2002, several major updates have been made. In this update, we expanded the human transcriptional start site dataset by 19 million uniquely mapped, and RefSeq-associated, 5'-end sequences, which were generated by a newly introduced Solexa sequencer. Moreover, in order to provide means for interpreting those massive TSS data, we implemented two new analytical tools: one for connecting expression information with predicted transcription factor binding sites; the other for examining evolutionary conservation or species-specificity of promoters and transcripts, which can be browsed by our own comparative genome viewer. With the expanded dataset and the enhanced functionalities, DBTSS provides a unique platform that enables in-depth transcriptome analyses. DBTSS is accessible at http://dbtss.hgc.jp/.

Entities: CellLine Chemical Disease Species

Mesh：

Substances：
Transcription Factors

Year: 2007 PMID： 17942421 PMCID： PMC2238895 DOI： 10.1093/nar/gkm901

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

One of the most challenging subjects of the genome science in the post-genome-sequencing era is to understand the transcriptional regulatory networks, by which the timing and quantity of transcriptions are collectively controlled. To eventually decode the genome information into a macromolecular system of the cell, in-depth knowledge of its higher-order regulatory mechanisms should be indispensable (1,2). It is also expected that detailed evolutionary comparisons of transcriptional networks would explain molecular mechanisms underlying the phenotypic divergence and speciation (3). Therefore, the identification and analyses of promoters, where most of the binding sites of transcription factors are contained, are essential. To define promoter regions, positional information about transcriptional start sites (TSSs) is the first clue (4). In order to provide the precise information of TSSs, we launched DBTSS in 2002 (5). DBTSS contains the TSS information determined with our experiments. The 5′-end sequences of full-length cDNA clones isolated from libraries, which are mainly constructed by the oligo-capping method and are enriched in full-length cDNAs, are mapped on to genome sequences. Each TSS is determined as a genomic position to which the 5′-end of some full-length cDNAs is corresponded (6). Since the initial release of DBTSS, we have made several updates, including the expansion of the data amount as well as the covered species, the addition of the information of so-called alternative promoters (7), and the implementation of an analytical tool, which enables the promoter sequence comparison between human and mouse. In this article, we introduce new updates and additions since DBTSS 2006, the most important one being the addition of TSS information which has been massively produced by a new-generation sequencer. Newly developed, massively parallel sequencing technologies, such as GS20 and Solexa sequencer systems, have enabled to determine millions of sequences in a single run (8). We recently accommodated our full-length cDNA technique, the oligo-capping method, for the Solexa sequencers (detailed protocol will be published elsewhere). Utilizing the extremely high-throughput Solexa sequencer, we generated 19-million TSS information and incorporated it into DBTSS. At the same time, we also considered that, in order to maximally extract biological information from this size of TSS data, the implementation of powerful analytical tools should be essential. Therefore, in this update, together with the expansion of the TSS data, we put two analytical tools into service; the first for correlating expression information with promoter elements; the second for examining the evolutionary conservation or species-specificity of promoters and transcripts, which can be browsed by a dynamic and flexible comparative genome browser. Both of these new functionalities are closely related with the Solexa sequence browser and enable the enhanced interpretation of the TSS data.

NEW FEATURES

Incorporation of the TSS data produced by the Solexa sequencer

A major update of the current version of DBTSS, version 6, since the previous report, is that the amount of data for human TSSs has been significantly increased. We recently started to determine the 5′-end sequences of the oligo-capped cDNAs using the Solexa sequencer. Briefly, 5′- and 3′-adaptor sequences necessary for the Solexa sequencing were introduced as a 5′-end-oligo at the RNA ligation and as a random hexamer primer at the first strand cDNA synthesis, respectively. Because of this straightforward procedure, thereby collected data should be improved by its size without degrading its quality. Obtained data were processed as previously described for the classical Sanger sequences with some minor modifications. The genomic positions to which the 5′-ends of the Solexa sequences were mapped were defined as putative TSSs. Clustering by 500 bp bins was also applied to separate putative alternative promoters (7). As a result, 10 000 349 and 8 633 345 sequences were obtained from the MCF7 cells (a human cell line of breast cancer origin; ATCC#HTB-22) and the human embryonic kidney 293 cells (HEK293; ATCC#CRL-1573), respectively, which were uniquely mapped to the RefSeq mRNA regions of the human genome (hg18). Those sequences collectively represent 29 210 and 41 238 putative (alternative) promoters of 12 133 and 11 598 RefSeq genes in the two cell lines, respectively (Table 1). This is one of the largest collections of human TSS information collected from a single cell type (9).

Table 1.

	Total no. of mapped sequences	No. of sequences associated with NMs	No. represented NMs	Total no. of putative promoters
MCF7 (Solexa)	11 919 330	10 000 349	12 133	29 210
HEK293 (Solexa)	10 062 560	8 633 345	11 598	41 238
CDNA (Sanger, total)	1 540 411	1 370 985	15 194	32 122

Statistics of the datasets produced by the Solexa sequencer (upper two rows) and by the original Sanger sequencer (bottom row) in humans. TSSs which were supported by equal or greater than five sequences were counted for the Solexa datasets The Solexa data can be retrieved in parallel with the original Sanger data (Figure 1). In other words, we did not mix the new TSS information from the Solexa data with the original dataset so that the in-depth transcriptome analysis focusing on a single cell type is possible, while the previous data representing general features of promoters taken from the data of various cell types and tissues, could be preserved. Otherwise, the clustering and the representations of the promoters in the original dataset could be severely biased because of the size of the new data.

Figure 1.

Screenshot from the Solexa sequence viewer. Its basic utility is similar to that of the previous version (5). Users can choose the database to search, either Sanger or Solexa dataset, in the left panel and then retrieve the results (red circle). Users can also switch the browsers between the Sanger and Solexa results (blue circle).

Connecting expression information with promoter elements

In order to further investigate the biological significance or the underlying regulatory mechanisms of the observed TSSs, the information on the changes of expression levels invoked by various environmental stimulations would provide important clues. For this purpose, at least before the expression information produced by the Solexa and the GS20 sequencers [where millions of collected sequences are subjected to the SAGE-like analysis; see (10)] will be accumulated to some sufficient level, hitherto compiled microarray data are conveniently available. In particular, for MCF 7, there is a series of microarray data, which was recently produced by the Connectivity Map project, representing the expression profiles of this particular cell on the administrations of over 200 kinds of drugs in several time courses (11). In addition to the incorporation of the MCF7 expression data, we implemented an analytical tool. This analytical tool looks for putative transcription factor binding sites commonly occurring in the promoters, which show similar behaviors on various drug perturbations. It is assumed that those promoters should be under the regulations of similar transcription factors, thus, containing particular transcription factor binding sites in common. As shown in Figure 2, as for the user-specified group of promoters or promoters which satisfy some specified search conditions on the expression profiles, transcription factor binding sites are predicted by the matrix search of the TRANFAC database (12) under any cut-off values. The degree of enrichment of the predicted transcription factor binding sites is evaluated assuming the hypergeometric distributions. Thereby, the enriched binding sites are listed and users can further retrieve the information regarding in which promoters and in which part of them these sites are located. Although the MCF7 dataset is currently the only resource of the expression data, we will further expand the dataset to cover expression profiles in various cell types and conditions. Moreover, further increases in the amount of the Solexa data for TSS determination themselves will enrich the source of expression information. Such data will directly couple the positional information within promoters with the changes of expression levels, as the latter should be represented by the number of allocated sequences.

Figure 2.

Screenshot from the search engine for enrichment of the putative transcription factor binding sites (A). The figure exemplifies the search for common TF binding sites appearing in the promoters of genes with which more than 10 Solexa sequences are associated and the relative expression levels are more than 2-fold elevated both by the 1 μm trichostatin A treatment and by the 1 μm wortmannin treatment (B). From the resultant list of the enriched sites, the link can be followed to the main viewer to retrieve further detailed information (C).

Evolutionary conservation of promoters and their downstream transcripts

Another approach to infer the biological relevance of the TSSs is to examine the details of their evolutionary conservation or turnovers in both their upstream promoters and downstream transcribed regions. Recently, it has become clear that some of the (alternative) promoters are well conserved, while others are rapidly evolving (13). Whether the non-conserved (alternative) promoters are the noise of transcription or actually playing species-specific roles still remains elusive. However, further ways for investigating those (alternative) promoters should be different from the first step, depending on whether they are associated with conserved, possibly principal, biological roles or species-specific phenotypic features. As shown in Figure 3, our new comparative browser enables users to examine evolutionary conservation of the surrounding regions of TSSs, based on the genomic sequences of various kinds of mammals, such as mice, rats and monkeys, as well as their mutual base-pair alignments from the UCSC Genome Browser (14). Furthermore, users can search for the promoters or transcripts according to the degree of evolutionary conservation, using variable parameters, including the coverage of alignable regions and the base substitution rate therein. Further detailed searches focusing on limited regions of the transcripts, such as UTRs, CDSs or the ‘coding exons’ are also supported. These results are browsed with our new comparative browser, which provides the dynamic magnification from the sequence level to the overview level. This viewer currently supports up to four-way comparisons between four genomes of user's choice.

Figure 3.

Screenshot from the search engine for the evolutionary conservation of the promoters and transcripts (A). The figure exemplifies the results of the search for promoters for which more than 10 Solexa sequences are associated, the alignable region in the promoters in human–mouse comparison is <300 bp and the overall base substitution of the downstream transcript region is <20% (B). The regions specified between red (one selection) and green (second selection) vertical lines (or one left click) can be magnified up to the sequence level.

FUTURE PERSPECTIVE

We will continue Solexa sequencing in various tissues with various environmental conditions, by which not only TSS data but also expression data will be increased further. We will also take a similar approach for model organisms other than humans. Their relevant data will be incorporated into DBTSS and made publicly available. Especially, in many organisms, their cDNA resources still remain scarce, while the genome sequences are being hastily released. Our massive transcriptome data will be most helpful not only for determining TSSs and upstream promoters but also for putting accurate annotations for the compiled genome sequences. DBTSS, with expanded data and supported by several new analytical tools, provides a unique platform for enabling in-depth analyses of transcriptomes. We believe DBTSS will serve as a firm foundation leading us into deeper insights on how the transcriptional regulatory network is realized by the code of genomic DNA sequences.

14 in total

1. Identification and characterization of the potential promoter regions of 1031 kinds of human genes.

Authors: Y Suzuki; T Tsunoda; J Sese; H Taira; J Mizushima-Sugano; H Hata; T Ota; T Isogai; T Tanaka; Y Nakamura; A Suyama; Y Sakaki; S Morishita; K Okubo; S Sugano
Journal: Genome Res Date: 2001-05 Impact factor: 9.043

2. Construction of a full-length enriched and a 5'-end enriched cDNA library using the oligo-capping method.

Authors: Yutaka Suzuki; Sumio Sugano
Journal: Methods Mol Biol Date: 2003

Review 3. Transcriptional regulatory elements in the human genome.

Authors: Glenn A Maston; Sara K Evans; Michael R Green
Journal: Annu Rev Genomics Hum Genet Date: 2006 Impact factor: 8.929

Review 4. Whole-genome re-sequencing.

Authors: David R Bentley
Journal: Curr Opin Genet Dev Date: 2006-10-18 Impact factor: 5.578

5. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes.

Authors: Kouichi Kimura; Ai Wakamatsu; Yutaka Suzuki; Toshio Ota; Tetsuo Nishikawa; Riu Yamashita; Jun-ichi Yamamoto; Mitsuo Sekine; Katsuki Tsuritani; Hiroyuki Wakaguri; Shizuko Ishii; Tomoyasu Sugiyama; Kaoru Saito; Yuko Isono; Ryotaro Irie; Norihiro Kushida; Takahiro Yoneyama; Rie Otsuka; Katsuhiro Kanda; Takahide Yokoi; Hiroshi Kondo; Masako Wagatsuma; Katsuji Murakawa; Shinichi Ishida; Tadashi Ishibashi; Asako Takahashi-Fujii; Tomoo Tanase; Keiichi Nagai; Hisashi Kikuchi; Kenta Nakai; Takao Isogai; Sumio Sugano
Journal: Genome Res Date: 2005-12-12 Impact factor: 9.043

6. Accelerated evolution of conserved noncoding sequences in humans.

Authors: Shyam Prabhakar; James P Noonan; Svante Pääbo; Edward M Rubin
Journal: Science Date: 2006-11-03 Impact factor: 47.728

7. Distinct class of putative "non-conserved" promoters in humans: comparative studies of alternative promoters of human and mouse genes.

Authors: Katsuki Tsuritani; Takuma Irie; Riu Yamashita; Yuta Sakakibara; Hiroyuki Wakaguri; Akinori Kanai; Junko Mizushima-Sugano; Sumio Sugano; Kenta Nakai; Yutaka Suzuki
Journal: Genome Res Date: 2007-06-13 Impact factor: 9.043

8. The transcriptional landscape of the mammalian genome.

Authors: P Carninci; T Kasukawa; S Katayama; J Gough; M C Frith; N Maeda; R Oyama; T Ravasi; B Lenhard; C Wells; R Kodzius; K Shimokawa; V B Bajic; S E Brenner; S Batalov; A R R Forrest; M Zavolan; M J Davis; L G Wilming; V Aidinis; J E Allen; A Ambesi-Impiombato; R Apweiler; R N Aturaliya; T L Bailey; M Bansal; L Baxter; K W Beisel; T Bersano; H Bono; A M Chalk; K P Chiu; V Choudhary; A Christoffels; D R Clutterbuck; M L Crowe; E Dalla; B P Dalrymple; B de Bono; G Della Gatta; D di Bernardo; T Down; P Engstrom; M Fagiolini; G Faulkner; C F Fletcher; T Fukushima; M Furuno; S Futaki; M Gariboldi; P Georgii-Hemming; T R Gingeras; T Gojobori; R E Green; S Gustincich; M Harbers; Y Hayashi; T K Hensch; N Hirokawa; D Hill; L Huminiecki; M Iacono; K Ikeo; A Iwama; T Ishikawa; M Jakt; A Kanapin; M Katoh; Y Kawasawa; J Kelso; H Kitamura; H Kitano; G Kollias; S P T Krishnan; A Kruger; S K Kummerfeld; I V Kurochkin; L F Lareau; D Lazarevic; L Lipovich; J Liu; S Liuni; S McWilliam; M Madan Babu; M Madera; L Marchionni; H Matsuda; S Matsuzawa; H Miki; F Mignone; S Miyake; K Morris; S Mottagui-Tabar; N Mulder; N Nakano; H Nakauchi; P Ng; R Nilsson; S Nishiguchi; S Nishikawa; F Nori; O Ohara; Y Okazaki; V Orlando; K C Pang; W J Pavan; G Pavesi; G Pesole; N Petrovsky; S Piazza; J Reed; J F Reid; B Z Ring; M Ringwald; B Rost; Y Ruan; S L Salzberg; A Sandelin; C Schneider; C Schönbach; K Sekiguchi; C A M Semple; S Seno; L Sessa; Y Sheng; Y Shibata; H Shimada; K Shimada; D Silva; B Sinclair; S Sperling; E Stupka; K Sugiura; R Sultana; Y Takenaka; K Taki; K Tammoja; S L Tan; S Tang; M S Taylor; J Tegner; S A Teichmann; H R Ueda; E van Nimwegen; R Verardo; C L Wei; K Yagi; H Yamanishi; E Zabarovsky; S Zhu; A Zimmer; W Hide; C Bult; S M Grimmond; R D Teasdale; E T Liu; V Brusic; J Quackenbush; C Wahlestedt; J S Mattick; D A Hume; C Kai; D Sasaki; Y Tomaru; S Fukuda; M Kanamori-Katayama; M Suzuki; J Aoki; T Arakawa; J Iida; K Imamura; M Itoh; T Kato; H Kawaji; N Kawagashira; T Kawashima; M Kojima; S Kondo; H Konno; K Nakano; N Ninomiya; T Nishio; M Okada; C Plessy; K Shibata; T Shiraki; S Suzuki; M Tagami; K Waki; A Watahiki; Y Okamura-Oho; H Suzuki; J Kawai; Y Hayashizaki
Journal: Science Date: 2005-09-02 Impact factor: 47.728

9. DBTSS: DataBase of Human Transcription Start Sites, progress report 2006.

Authors: Riu Yamashita; Yutaka Suzuki; Hiroyuki Wakaguri; Katsuki Tsuritani; Kenta Nakai; Sumio Sugano
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

10. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

Authors: V Matys; O V Kel-Margoulis; E Fricke; I Liebich; S Land; A Barre-Dirrie; I Reuter; D Chekmenev; M Krull; K Hornischer; N Voss; P Stegmaier; B Lewicki-Potapov; H Saxel; A E Kel; E Wingender
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

73 in total

1. Hypermethylation of the gene LARP2 for noninvasive prenatal diagnosis of β-thalassemia based on DNA methylation profile.

Authors: Tian Gao; Yanli Nie; Jianxin Guo
Journal: Mol Biol Rep Date: 2012-02-11 Impact factor: 2.316

2. Sequence features that drive human promoter function and tissue specificity.

Authors: Jane M Landolin; David S Johnson; Nathan D Trinklein; Shelly F Aldred; Catherine Medina; Hennady Shulha; Zhiping Weng; Richard M Myers
Journal: Genome Res Date: 2010-05-25 Impact factor: 9.043

3. Inducing gene expression by targeting promoter sequences using small activating RNAs.

Authors: Ji Wang; Robert F Place; Victoria Portnoy; Vera Huang; Moo Rim Kang; Mika Kosaka; Maurice Kwok Chung Ho; Long-Cheng Li
Journal: J Biol Methods Date: 2015-03-11

4. High-resolution human core-promoter prediction with CoreBoost_HM.

Authors: Xiaowo Wang; Zhenyu Xuan; Xiaoyue Zhao; Yanda Li; Michael Q Zhang
Journal: Genome Res Date: 2008-11-07 Impact factor: 9.043

5. A link between H3K27me3 mark and exon length in the gene promoters of pluripotent and differentiated cells.

Authors: Liang Chen
Journal: Bioinformatics Date: 2010-02-09 Impact factor: 6.937

6. Analysis and synthesis of high-amplitude Cis-elements in the mammalian circadian clock.

Authors: Yuichi Kumaki; Maki Ukai-Tadenuma; Ken-ichiro D Uno; Junko Nishio; Koh-hei Masumoto; Mamoru Nagano; Takashi Komori; Yasufumi Shigeyoshi; John B Hogenesch; Hiroki R Ueda
Journal: Proc Natl Acad Sci U S A Date: 2008-09-24 Impact factor: 11.205

7. An N- and C-terminal truncated isoform of zinc finger X-linked duplicated C protein represses MHC class II transcription.

Authors: Anastasiia Aleksandrova; Oleksandr Galkin; Rupa Koneni; Joseph D Fontes
Journal: Mol Cell Biochem Date: 2009-09-24 Impact factor: 3.396

8. Characteristics of the CArG-SRF binding context in mammalian genomes.

Authors: Wenwu Wu; Xia Shen; Shiheng Tao
Journal: Mamm Genome Date: 2009-12-03 Impact factor: 2.957

9. The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation.

Authors: Hideya Kawaji; Jessica Severin; Marina Lizio; Andrew Waterhouse; Shintaro Katayama; Katharine M Irvine; David A Hume; Alistair R R Forrest; Harukazu Suzuki; Piero Carninci; Yoshihide Hayashizaki; Carsten O Daub
Journal: Genome Biol Date: 2009-04-19 Impact factor: 13.583

10. Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data.

Authors: David L Corcoran; Kusum V Pandit; Ben Gordon; Arindam Bhattacharjee; Naftali Kaminski; Panayiotis V Benos
Journal: PLoS One Date: 2009-04-23 Impact factor: 3.240