Literature DB >> 15608156

TRED: a Transcriptional Regulatory Element Database and a platform for in silico gene regulation studies.

Fang Zhao1, Zhenyu Xuan, Lihua Liu, Michael Q Zhang.   

Abstract

In order to understand gene regulation, accurate and comprehensive knowledge of transcriptional regulatory elements is essential. Here, we report our efforts in building a mammalian Transcriptional Regulatory Element Database (TRED) with associated data analysis functions. It collects cis- and trans-regulatory elements and is dedicated to easy data access and analysis for both single-gene-based and genome-scale studies. Distinguishing features of TRED include: (i) relatively complete genome-wide promoter annotation for human, mouse and rat; (ii) availability of gene transcriptional regulation information including transcription factor binding sites and experimental evidence; (iii) data accuracy is ensured by hand curation; (iv) efficient user interface for easy and flexible data retrieval; and (v) implementation of on-the-fly sequence analysis tools. TRED can provide good training datasets for further genome-wide cis-regulatory element prediction and annotation, assist detailed functional studies and facilitate the decipher of gene regulatory networks (http://rulai.cshl.edu/TRED).

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 15608156      PMCID: PMC539958          DOI: 10.1093/nar/gki004

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

To understand gene regulatory mechanisms and networks requires accurate and comprehensive knowledge of transcriptional regulatory elements. They include cis-elements, such as promoters and trans-elements, such as transcription factors. A number of databases have been created to facilitate such studies. However, most of them are only dedicated to either promoter annotation or transcription factor binding and functional information, which make data access disconnected and correlation of different types of data difficult. Hence, we are motivated to build a unique resource for both cis- and trans- regulatory elements, and provide easy access of the correlation between promoter sequences and transcription factor binding information. Although current promoter databases have provided much value for the development of promoter-finding programs and gene regulation studies (1,2), many have their own limitations. These include incomplete datasets, inadequate data accuracy, restricted accessibility of the data and lack of sequence analysis functionalities. On top of single-gene-based and whole genome experimental promoter identification, computational methods are greatly needed for efficient genome-wide promoter annotation. However, in higher eukaryotes, promoter finding in silico has turned out to be one of the most difficult problems in computational biology (3). Therefore, accurate promoter annotation for all the genes in higher eukaryotes is still an outstanding challenge. Collecting comprehensive and precise transcription factor binding and regulation information currently known is a daunting task. It involves painstaking and time-consuming literature curation by transcription study experts. Although there are a limited number of databases (4,5) dedicated to this aspect of data collection, they often do not conveniently correlate functional information to the relevant promoter sequences and its genomic context that are required in most of the regulation studies. Furthermore, inevitably, data completeness is always an issue. Here, we report our efforts in building a Transcriptional Regulatory Element Database (TRED) with associated data analysis functions. With the availability of complete genome sequences for human and draft sequences for mouse and rat, we have mapped out and documented in the database gene transcription start sites (TSSs) and core promoters for the whole genomes through both automated pipeline and hand curation. In addition, we have been carrying out continuous expert curation of transcription factor binding and regulation information on these promoters. Our short-term goal is to provide comprehensive and accurate trans-regulatory information for target genes of cancer-related transcription factors. We have so far included binding data for a few transcription factors, with emphasis on two major cell cycle regulators, E2F and Myc. For each, we have recorded thousands of target genes of different binding qualities as demonstrated by various experiments. A web-based user interface has been implemented for easy data visualization, retrieval and analysis for both single-gene-based studies and large-scale sequence manipulation and gene regulatory network studies. We intend to build TRED to contain information of both cis- and trans-regulatory elements for every annotated gene, and to serve as a one-stop data provider for researchers interested in gene regulation studies.

DATA SOURCES

Promoter annotation

Promoters in TRED came from two sources: automated genome-wide annotation and hand curation. They complement each other, and together realize the relative completeness and accuracy of the data. The automated annotation pipeline was built to extract and merge known promoters from databases such as EPD and DBTSS (1,6), employing promoter-finding programs such as FirstEF (7) combined with mRNA/EST information and cross-species comparisons to predict promoters, and associating them with known or predicted genes (Z. Xuan, F. Zhao, J. Wang, G. Chen and M. Q. Zhang, submitted for publication). Given the difficulty and complexity of promoter prediction in higher eukaryotes, accuracy of computational promoter annotation is limited. Therefore, hand curation was applied as a crucial part of our data collection to assess computational prediction and ensure data accuracy. After we pooled data from both sources, further data cleaning and integration were carried out. Based on the reliability of the supporting evidence for each promoter, a quality level was assigned.

Transcription factor binding curation

Curation was carried out for transcriptional regulation information on promoters. Exhaustive literature search for target genes of individual transcription factors was carried out, binding motifs and experimental evidence were recorded, and transcription factor binding motifs were mapped on promoters of the target genes. Binding quality levels were assigned based on definitiveness of the binding evidence, which was determined by the experimental approaches employed to demonstrate the binding and expert data interpretation. A standardized curation format has been developed for easy data entry and automated data loading into the database. To best preserve the curated association between motifs and promoters through changes such as genome assembly releases and genome annotations, we also record motif flanking sequences. Curation is a time-consuming and laborious process, and we started out by focusing on target genes of cancer-related transcription factors. In compliance with the broad interest in cell cycle regulatory network studies, we have completed curation for transcription factor E2F and Myc target genes. They are involved in various biological pathways and have profound effects in cell proliferation (8–13). Many E2F and Myc target genes have been identified by traditional transcription studies as well as newly developed, large-scale functional genomics studies.

DATABASE CONSTRUCTION AND IMPLEMENTATION

A MySQL relational database was constructed for storage and query of the data. It includes three key entities: ‘Promoter’, ‘Gene’ and ‘Factor’. ‘Promoter’ is a weak entity because our model would not allow a promoter to exist without the associated gene. There are two key relationships: (i) a promoter regulates a gene, which is a many-to-one relationship; and (ii) a factor binds a promoter, which is a many-to-many relationship. Other entities in the relational schema include promoter qualities, binding motifs, binding qualities and external data sources. Other relationships include gene annotation, promoter supporting evidence, factor annotation and binding supporting evidence. An automated data look-up, integration and loading pipeline has been developed for easy populating and updating the database.

DATABASE CONTENT

TRED contains whole genome promoter annotation for human, mouse and rat from both curation and computational prediction. The number of genes and promoters in various quality categories are listed in Table 1. From our extensive literature curation, TRED holds functional annotations of hundreds of direct target genes for E2F and Myc in human, mouse and rat with concrete binding evidence (high binding quality levels) (Table 2). Many of them have experimentally verified promoter sequences and known E2F or Myc binding motifs. It also has a collection of thousands of genes shown to be regulated by E2F and Myc of lower binding confidence (e.g. only demonstrated by expression experiments or computational prediction). This is a more comprehensive collection than that recorded for these two transcription factors in the Transfac database (4). Some target genes for a few other transcription factors are also included in the current release of TRED. To provide users with further information of the genes, cross-references to other well-known databases such as GenBank, PubMed, GeneCards (14) and Transfac were established.
Table 1.

Promoter and gene numbers in TRED, with gene numbers in parentheses

Promoter quality12345 + 6Sum
Human1971 (1779)13 120 (9769)24 563 (14 363)9217 (7214)9358 (8877)58 229 (30 981)
Mouse214 (179)8385 (6675)20 318 (12 122)13 252 (10 812)8595 (8442)50 764 (31 683)
Rat91 (84)808 (534)7157 (3987)819 (614)21 511 (21 437)30 386 (26 064)

Promoter qualities ranked from high to low: 1, known, curated promoters; 2, known, pipeline collected promoters; 3, predicted promoters with Refseq evidence and putative promoters taking 5′ ends of Refseq as TSSs; 4, predicted promoters with mRNA (other than Refseq and EST) evidence; 5, predicted promoters with EST evidence; 6, predicted promoters supported only by gene prediction.

Promoters included in a higher ranking are automatically excluded from the lower ranking categories.

Table 2.

Numbers of curated E2F and Myc target promoters and genes, with gene numbers in parentheses

 E2F targets  Myc targets  
 HumanMouseRatHumanMouseRat
Promoter qualitya      
 1221 (164)59 (47)9 (9)298 (263)5 (5)4 (3)
 2388 (355)14 (14)0 (0)1125 (651)43 (31)26 (15)
 3496 (454)29 (29)2 (2)1230 (730)59 (34)90 (54)
 4249 (229)18 (18)0 (0)15 (12)7 (5)4 (4)
 5 + 6239 (231)21 (21)0 (0)8 (7)10 (5)4 (3)
 Sum1593 (1329)141 (127)11 (11)2676 (785)124 (38)128 (62)
Binding qualityb      
 Known166 (127)20 (13)0 (0)2667 (782)70 (28)108 (54)
 Likely10 (10)0 (0)0 (0)4 (1)10 (3)9 (3)
 Maybe1217 (1048)70 (69)0 (0)5 (3)28 (7)11 (5)
 Computationally predicted200 (175)51 (48)11 (11)0 (0)0 (0)0 (0)
 Sum1593 (1329)141 (127)11 (11)2676 (785)108 (38)128 (62)

aNumber break-down by promoter quality (promoter quality definition is the same as that in Table 1).

bNumber break-down by E2F and Myc binding quality.

WEB INTERFACE

Data access and retrieval

A CGI/Perl-based web interface was built to facilitate easy visualization and retrieval of both single-gene-based and batch data. It carries the following major functionalities. Search promoters for a gene or a list of genes by gene name, GenBank ID or chromosome location (Figure 1). The resulting page contains all annotated promoters for the gene, ranked from the highest quality to the lowest. Links for gene information and promoter information (including localization of transcription factor binding sites) are provided by the hotlinks in ‘Gene ID’ and ‘Promoter ID’ columns, respectively (see Figure 1). Sequence retrieval of desired promoters can be achieved by checking the box on the left of each entry. Sequence length for retrieval can be decided by users, with the default being 1 kb (700 bp upstream and 299 bp downstream of TSS). Promoter sequences of interest can also be conveniently sent to ‘on-the-fly analysis’ page for further analysis (see below).
Figure 1

Sample pages showing TRED user-interface for gene promoter search, promoter search results, gene information and promoter information.

Gene information page displays the annotation and promoter links for a particular gene, as well as transcription factors that regulate the gene, experimental evidence and literature references. A link is provided to locate the gene on UCSC Genome Browser and access additional annotations (Figure 1). Promoter information page includes genomic localization of the promoter, annotation references and the sequence, with transcription factor binding sites marked and hot linked to detailed binding information and literature references. A link is provided to locate the promoter on UCSC Genome Browser and access its genomic context (Figure 1). Retrieve promoter sequences for all target genes of a transcription factor, with the option of filtering sequences for desired promoters and binding qualities (Figure 2). This will conveniently produce good datasets for computational studies on transcriptional regulons and networks, as well as for the development and training of computational tools such as motif-finding programs.
Figure 2

Sample pages showing TRED user-interface for promoter retrieval for target genes of a transcription factor, on-the-fly sequence analysis results, transcription factor binding motif retrieval and biding site information.

Retrieve all binding motifs for a transcription factor (Figure 2). This can greatly facilitate the construction of transcription factor binding positional weight matrices (PWMs) for target gene identification and gene regulation studies. Browse the genome for genes/promoters located in a particular chromosome. Search for orthologous genes based on the annotation in Ensembl.

On-the-fly analysis tools

On-the-fly analysis tools were implemented for sequences retrieved from TRED or imported from other resources (Figure 2). They currently include simple sequence manipulation and analysis tools for users' convenience and motif-matching programs based on regular expression and PWM. A word counting-based motif searching tool DWE (15) and PromoterWise, a program specifically for pair-wise promoter local alignment (E. Birney, unpublished), are also implemented. Promoters on various TRED sub-pages can be directly sent to these analysis tools at a click of a button. In addition to the on-the-fly tools, TRED also provides links to many other sequence analysis and motif-finding programs such as MEME (16) and Gibbs sampler (17).

FUTURE DEVELOPMENTS

Updating of genome-wide promoter annotation based on newer genome assembly releases can be automated and will be done for the next release. Promoter annotation for mammals other than human, mouse and rat will be carried out and included in TRED. For transcription factor binding and regulation information, literature curation has been a continuing effort. We hope to finish target genes of cancer-related transcription factors in the near future, and eventually expand to targets of other transcription factors.
  17 in total

1.  Role for E2F in control of both DNA replication and mitotic functions as revealed from DNA microarray analysis.

Authors:  S Ishida; E Huang; H Zuzan; R Spang; G Leone; M West; J R Nevins
Journal:  Mol Cell Biol       Date:  2001-07       Impact factor: 4.272

2.  GeneCards 2002: towards a complete, object-oriented, human gene compendium.

Authors:  Marilyn Safran; Irina Solomon; Orit Shmueli; Michal Lapidot; Shai Shen-Orr; Avital Adato; Uri Ben-Dor; Nir Esterman; Naomi Rosen; Inga Peter; Tsviya Olender; Vered Chalifa-Caspi; Doron Lancet
Journal:  Bioinformatics       Date:  2002-11       Impact factor: 6.937

3.  DBTSS, DataBase of Transcriptional Start Sites: progress report 2004.

Authors:  Yutaka Suzuki; Riu Yamashita; Sumio Sugano; Kenta Nakai
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors:  Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

Review 5.  Computational prediction of eukaryotic protein-coding genes.

Authors:  Michael Q Zhang
Journal:  Nat Rev Genet       Date:  2002-09       Impact factor: 53.242

6.  Computational identification of promoters and first exons in the human genome.

Authors:  R V Davuluri; I Grosse; M Q Zhang
Journal:  Nat Genet       Date:  2001-12       Impact factor: 38.330

7.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Authors:  C E Lawrence; S F Altschul; M S Boguski; J S Liu; A F Neuwald; J C Wootton
Journal:  Science       Date:  1993-10-08       Impact factor: 47.728

Review 8.  The role of E2F in the mammalian cell cycle.

Authors:  P J Farnham; J E Slansky; R Kollmar
Journal:  Biochim Biophys Acta       Date:  1993-08-23

9.  Genomic targets of the human c-Myc protein.

Authors:  Paula C Fernandez; Scott R Frank; Luquan Wang; Marianne Schroeder; Suxing Liu; Jonathan Greene; Andrea Cocito; Bruno Amati
Journal:  Genes Dev       Date:  2003-04-14       Impact factor: 11.361

10.  TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors:  V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

View more
  93 in total

Review 1.  Toward a complete in silico, multi-layered embryonic stem cell regulatory network.

Authors:  Huilei Xu; Christoph Schaniel; Ihor R Lemischka; Avi Ma'ayan
Journal:  Wiley Interdiscip Rev Syst Biol Med       Date:  2010 Nov-Dec

2.  Mimivirus gene promoters exhibit an unprecedented conservation among all eukaryotes.

Authors:  Karsten Suhre; Stéphane Audic; Jean-Michel Claverie
Journal:  Proc Natl Acad Sci U S A       Date:  2005-10-03       Impact factor: 11.205

3.  Context specific transcription factor prediction.

Authors:  Eric Yang; David Simcha; Richard R Almon; Debra C Dubois; William J Jusko; Ioannis P Androulakis
Journal:  Ann Biomed Eng       Date:  2007-03-22       Impact factor: 3.934

4.  TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions.

Authors:  Heonjong Han; Jae-Won Cho; Sangyoung Lee; Ayoung Yun; Hyojin Kim; Dasom Bae; Sunmo Yang; Chan Yeong Kim; Muyoung Lee; Eunbeen Kim; Sungho Lee; Byunghee Kang; Dabin Jeong; Yaeji Kim; Hyeon-Nae Jeon; Haein Jung; Sunhwee Nam; Michael Chung; Jong-Hoon Kim; Insuk Lee
Journal:  Nucleic Acids Res       Date:  2018-01-04       Impact factor: 16.971

5.  Sustained NF-kappaB activation produces a short-term cell proliferation block in conjunction with repressing effectors of cell cycle progression controlled by E2F or FoxM1.

Authors:  Marianna Penzo; Paul E Massa; Eleonora Olivotto; Francesca Bianchi; Rosa Maria Borzi; Adedayo Hanidu; Xiang Li; Jun Li; Kenneth B Marcu
Journal:  J Cell Physiol       Date:  2009-01       Impact factor: 6.384

6.  Systems biology of embryogenesis.

Authors:  Lucas B Edelman; Sriram Chandrasekaran; Nathan D Price
Journal:  Reprod Fertil Dev       Date:  2010       Impact factor: 2.311

7.  Identification of responsive gene modules by network-based gene clustering and extending: application to inflammation and angiogenesis.

Authors:  Jin Gu; Yang Chen; Shao Li; Yanda Li
Journal:  BMC Syst Biol       Date:  2010-04-21

8.  Transcription factor regulation can be accurately predicted from the presence of target gene signatures in microarray gene expression data.

Authors:  Ahmed Essaghir; Federica Toffalini; Laurent Knoops; Anders Kallin; Jacques van Helden; Jean-Baptiste Demoulin
Journal:  Nucleic Acids Res       Date:  2010-03-09       Impact factor: 16.971

9.  Data recovery and integration from public databases uncovers transformation-specific transcriptional downregulation of cAMP-PKA pathway-encoding genes.

Authors:  Chiara Balestrieri; Lilia Alberghina; Marco Vanoni; Ferdinando Chiaradonna
Journal:  BMC Bioinformatics       Date:  2009-10-15       Impact factor: 3.169

10.  Combinatorial network of primary and secondary microRNA-driven regulatory mechanisms.

Authors:  Kang Tu; Hui Yu; You-Jia Hua; Yuan-Yuan Li; Lei Liu; Lu Xie; Yi-Xue Li
Journal:  Nucleic Acids Res       Date:  2009-08-10       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.