Literature DB >> 16381947

ABS: a database of Annotated regulatory Binding Sites from orthologous promoters.

Enrique Blanco¹, Domènec Farré, M Mar Albà, Xavier Messeguer, Roderic Guigó.

Abstract

Information about the genomic coordinates and the sequence of experimentally identified transcription factor binding sites is found scattered under a variety of diverse formats. The availability of standard collections of such high-quality data is important to design, evaluate and improve novel computational approaches to identify binding motifs on promoter sequences from related genes. ABS (http://genome.imim.es/datasets/abs2005/index.html) is a public database of known binding sites identified in promoters of orthologous vertebrate genes that have been manually curated from bibliography. We have annotated 650 experimental binding sites from 68 transcription factors and 100 orthologous target genes in human, mouse, rat or chicken genome sequences. Computational predictions and promoter alignment information are also provided for each entry. A simple and easy-to-use web interface facilitates data retrieval allowing different views of the information. In addition, the release 1.0 of ABS includes a customizable generator of artificial datasets based on the known sites contained in the collection and an evaluation tool to aid during the training and the assessment of motif-finding programs.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2006 PMID： 16381947 PMCID： PMC1347478 DOI： 10.1093/nar/gkj116

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Expression of genes is regulated at many different levels, transcription of DNA being one of the most critical stages. Specific configurations of transcription factors (TFs) that interact with gene promoter regions are recruited to activate or modulate the production of a given transcript. Many of these TFs possess the ability to recognize a small set of genomic sequence footprints called TF-binding sites (TFBSs). These motifs are typically 6–15 bp long and in some cases, they show a high degree of variability. In addition, many motifs may ambiguously be recognized by members of different TF families. Because of these flexible binding rules, computational methods for the identification of regulatory elements in a promoter sequence tend to produce an overwhelming amount of false positives. However, the identification of conserved regulatory elements present in orthologous gene promoters (also called phylogenetic footprinting) has proved to be more effective to characterize such sequences (1–3). In fact, the ever-growing availability of more genomes and the constant improvement of bioinformatics algorithms hold great promise for unveiling the overall network of gene interactions of each organism (4). Typically, computational methods to detect regulatory elements use their own training set of experimental annotated TFBSs. These annotations are usually collected from bibliography or from general repositories of gene regulation information, such as JASPAR (5) and TRANSFAC (6). However, each program establishes different criteria and formats to retrieve and display the data that forms the final training set, which makes the comparison between different methods very difficult. The construction of a good benchmark to evaluate the accuracy of several pattern discovery methods is therefore not a trivial procedure (7). Although important efforts are being carried out to standardize the construction of collections of promoter regions (8) or the presentation of experimental data (9), there is a clear necessity to provide stable and common datasets for future algorithmic developments. In this direction, we present here the release 1.0 of the ABS database constructed from literature annotations that have been experimentally verified in human, mouse, rat or chicken.

DATABASE CONSTRUCTION

We have gathered from the literature a collection of experimentally validated binding sites that are conserved in at least two orthologous vertebrate promoters. The sites and the promoter sequences have been manually curated to ensure data consistency. The compiled data are suitable for training both classical pattern discovery programs and new emerging comparative methods. Flat files accomplishing the GFF standard format were used to store and query the information. The GenBank accession number of the sequences in each bibliographical reference was utilized to retrieve the promoters. Such sequences were mapped on to the corresponding RefSeq annotations to ensure we were retrieving the actual promoter. The DBTSS database (10) was finally used to refine the annotation of the TSSs. Since it is the region in which most experimental studies have been focused on, we considered the sequence 500 bp immediately upstream the annotated TSS, as the promoter region in this first release. For each annotated promoter, we only included experimentally tested sites in this proximal region whose motifs were correctly identified in at least two species, i.e. orthologous sites. Every known binding site was mapped on to the corresponding promoter sequence by BLASTN (11). Those matches that exhibited <80% of identity between the sequence of the original site and the mapped motif in the promoter region were rejected. We computed BLASTN (11), CLUSTALW (12), AVID (13) and LAGAN (14) alignments of the orthologous promoters from each gene. Moreover, we produced a dotplot of word matches with EMBOSS (15) to visualize unusually conserved regions. For comparative assessments, computational predictions using the JASPAR (5), TRANSFAC (6) and PROMO (16) collections of position weight matrices were calculated. A very restrictive threshold of 0.85 was used to remove those predicted TFBSs whose score was below this value, creating a first group of more reliable predictions. A second group of predictions was produced using a more flexible threshold of 0.70 (see the ABS website for further information about the scoring method).

DATABASE CONTENTS

Release 1.0 of ABS database contains 100 annotated orthologous genes, each entry corresponding to two or more species. The total number of promoter sequences is 211 (105 500 nt). There is a clear predominance of human and mouse annotations: 73 entries contain at least annotations for human–mouse orthologs. A total of 650 experimental binding sites from 68 different TFs are associated with ABS entries, covering 8624 nt. In average, three TFBSs per sequence have been mapped on to the promoters with an average length of 13.3 nt per site. The majority of the TFBSs are found near the TSS, as expected. The TFs that appear more frequently are TBP (14.6% of sites), SP1 (13.6%) and CEBP (5.6%). Those TFs are known to be part of the core of many eukaryotic promoters (see the ABS documentation for further details about the contents of the database).

WEB INTERFACE

Data retrieval

The ABS database can be accessed through a simple CGI/Perl-based web interface at . On-line documentation and tutorials are provided for each web service. The following functionalities are implemented in the current release (see Figure 1):

Figure 1

Examples of the ABS data retrieval system showing the annotation of a gene, the set of binding motifs from a given TF in human and mouse and the extraction of the promoter sequences containing such annotations.

For each gene in the collection, show the orthologous promoters and a list of experimentally verified TFBSs annotated on the corresponding sequence. Promoter sequence alignments, computational predictions, dotplots and cross-references to other well-known databases, such as GenBank, Entrez Gene and PubMed, are also provided for each annotation. Retrieve all of the binding motifs associated with a given TF, filtering by species. Moreover, a global alignment of the motifs is provided and the corresponding sequence logo representation is displayed by using WebLogo (17). This information could be used to produce new profiles for subsequent detection of this TF in other promoters. Retrieve all of the promoter sequences in which at least one binding site for a given TF was annotated. These sequences and the associated motifs could be used to generate datasets based on known sites to train motif-finding programs. The gene catalogue, the promoter sequences, the collection of annotations, the sequence alignments and the computational predictions are also individually distributed in several flat files.

Benchmarking and evaluation tools

The ABS database aims to become a platform to evaluate new algorithms for the discovery of novel regulatory elements in a set of related gene promoters (e.g. orthologous promoters or co-regulated genes from microarray experiments). In addition to the data retrieval functions, two on-line applications are available to perform the benchmarking of such algorithms (see Figure 2):

Figure 2

Protocol to evaluate the accuracy of an external motif-finding program on a synthetic dataset generated by planting motifs from ABS in randomly generated sequences.

Constructor is a web server to produce synthetic datasets based on the ABS annotations. The design of the benchmark is highly flexible allowing to customize the number of sequences, their length, the background nucleotide distribution, the number of motifs to plant on them, the probability to plant a motif, the species and the TFs for which the associated motifs will be randomly selected from the known sites collection. The output consists of the artificial sequences with the embedded motifs, the list of motifs and a graphical representation of the occurrences in the sequences produced with the program gff2ps (18). Evaluator is a web server to determine the accuracy of a set of predicted motifs in several sequences using a list of known binding sites as a reference set. Both sets must be provided by the user in GFF format. A complete accuracy assessment at both nucleotide and site levels is computed using the standard measures in the field (7,19).

CONCLUSIONS AND FUTURE WORK

The ABS database has been developed to fill the existing gap in the availability of consistent datasets to train and compare different pattern discovery programs. The lack of standard collections of TFBSs is specially serious in the case of phylogenetic footprinting data. The collection described here contains 650 experimental TFBSs identified in human, mouse, rat and chicken genes. Orthologous promoter sequences and their binding sites have been manually curated from bibliography. Supplementary information about the promoters is also provided for each entry. In addition, two web applications (Constructor and Evaluator) are included in this first release to facilitate the development of new motif-finding programs using the ABS annotations. In the next release, we plan to increase the number of annotations adding known sites in regulatory regions different from the proximal promoter and eventually incorporate binding motifs from other species.

19 in total

1. EMBOSS: the European Molecular Biology Open Software Suite.

Authors: P Rice; I Longden; A Bleasby
Journal: Trends Genet Date: 2000-06 Impact factor: 11.639

2. gff2ps: visualizing genomic annotations.

Authors: J F Abril; R Guigó
Journal: Bioinformatics Date: 2000-08 Impact factor: 6.937

3. Discovery of regulatory elements by a computational method for phylogenetic footprinting.

Authors: Mathieu Blanchette; Martin Tompa
Journal: Genome Res Date: 2002-05 Impact factor: 9.043

4. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA.

Authors: Michael Brudno; Chuong B Do; Gregory M Cooper; Michael F Kim; Eugene Davydov; Eric D Green; Arend Sidow; Serafim Batzoglou
Journal: Genome Res Date: 2003-03-12 Impact factor: 9.043

5. DBTSS, DataBase of Transcriptional Start Sites: progress report 2004.

Authors: Yutaka Suzuki; Riu Yamashita; Sumio Sugano; Kenta Nakai
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

6. AVID: A global alignment program.

Authors: Nick Bray; Inna Dubchak; Lior Pachter
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

7. Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN.

Authors: Domènec Farré; Romà Roset; Mario Huerta; José E Adsuara; Llorenç Roselló; M Mar Albà; Xavier Messeguer
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

8. Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover.

Authors: Emmanouil T Dermitzakis; Andrew G Clark
Journal: Mol Biol Evol Date: 2002-07 Impact factor: 16.240

9. Distinguishing regulatory DNA from neutral sites.

Authors: Laura Elnitski; Ross C Hardison; Jia Li; Shan Yang; Diana Kolbe; Pallavi Eswara; Michael J O'Connor; Scott Schwartz; Webb Miller; Francesca Chiaromonte
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

10. TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors: V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

24 in total

1. Compound hierarchical correlated beta mixture with an application to cluster mouse transcription factor DNA binding data.

Authors: Hongying Dai; Richard Charnigo
Journal: Biostatistics Date: 2015-05-11 Impact factor: 5.899

2. SOMEA: self-organizing map based extraction algorithm for DNA motif identification with heterogeneous model.

Authors: Nung Kion Lee; Dianhui Wang
Journal: BMC Bioinformatics Date: 2011-02-15 Impact factor: 3.169

3. On calculating the probability of a set of orthologous sequences.

Authors: Junfeng Liu; Liang Chen; Hongyu Zhao; Dirk F Moore; Yong Lin; Weichung Joe Shih
Journal: Adv Appl Bioinform Chem Date: 2009-02-26

4. Discovering multiple realistic TFBS motifs based on a generalized model.

Authors: Tak-Ming Chan; Gang Li; Kwong-Sak Leung; Kin-Hong Lee
Journal: BMC Bioinformatics Date: 2009-10-07 Impact factor: 3.169

5. A protein-protein interaction guided method for competitive transcription factor binding improves target predictions.

Authors: Kirsti Laurila; Olli Yli-Harja; Harri Lähdesmäki
Journal: Nucleic Acids Res Date: 2009-12 Impact factor: 16.971

6. Statistical approaches to use a model organism for regulatory sequences annotation of newly sequenced species.

Authors: Pietro Liò; Claudia Angelini; Italia De Feis; Viet-Anh Nguyen
Journal: PLoS One Date: 2012-09-11 Impact factor: 3.240

7. NRF2-ome: an integrated web resource to discover protein interaction and regulatory networks of NRF2.

Authors: Dénes Türei; Diána Papp; Dávid Fazekas; László Földvári-Nagy; Dezső Módos; Katalin Lenti; Péter Csermely; Tamás Korcsmáros
Journal: Oxid Med Cell Longev Date: 2013-04-17 Impact factor: 6.543

8. CBS: an open platform that integrates predictive methods and epigenetics information to characterize conserved regulatory features in multiple Drosophila genomes.

Authors: Enrique Blanco; Montserrat Corominas
Journal: BMC Genomics Date: 2012-12-10 Impact factor: 3.969

9. MISCORE: a new scoring function for characterizing DNA regulatory motifs in promoter sequences.

Authors: Dianhui Wang; Sarwar Tapan
Journal: BMC Syst Biol Date: 2012-12-12

10. SignaLink 2 - a signaling pathway resource with multi-layered regulatory networks.

Authors: Dávid Fazekas; Mihály Koltai; Dénes Türei; Dezső Módos; Máté Pálfy; Zoltán Dúl; Lilian Zsákai; Máté Szalay-Bekő; Katalin Lenti; Illés J Farkas; Tibor Vellai; Péter Csermely; Tamás Korcsmáros
Journal: BMC Syst Biol Date: 2013-01-18