Literature DB >> 17981843

CTCFBSDB: a CTCF-binding site database for characterization of vertebrate genomic insulators.

Lei Bao1, Mi Zhou, Yan Cui.   

Abstract

Recent studies on transcriptional control of gene expression have pinpointed the importance of long-range interactions and three-dimensional organization of chromatins within the nucleus. Distal regulatory elements such as enhancers may activate transcription over long distances; hence, their action must be restricted within appropriate boundaries to prevent illegitimate activation of non-target genes. Insulators are DNA elements with enhancer-blocking and/or chromatin-bordering functions. In vertebrates, the versatile transcription regulator CCCTC-binding factor (CTCF) is the only identified trans-acting factor that confers enhancer-blocking insulator activity. CTCF-binding sites were found to be commonly distributed along the vertebrate genomes. We have constructed a CTCF-binding site database (CTCFBSDB) to characterize experimentally identified and computationally predicted CTCF-binding sties. Biological knowledge and data from multiple resources have been integrated into the database, including sequence data, genetic polymorphisms, function annotations, histone methylation profiles, gene expression profiles and comparative genomic information. A web-based user interface was implemented for data retrieval, analysis and visualization. In silico prediction of CTCF-binding motifs is provided to facilitate the identification of candidate insulators in the query sequences submitted by users. The database can be accessed at http://insulatordb.utmem.edu/

Entities:  

Mesh:

Substances:

Year:  2007        PMID: 17981843      PMCID: PMC2238977          DOI: 10.1093/nar/gkm875

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

CCCTC-binding factor (CTCF) is a versatile transcription regulator that is evolutionarily conserved from fruit fly to human (1). CTCF binds to different DNA sequences by combinatorial use of 11-zinc fingers and plays a key role in many chromatin insulation events [reviewed in (2)]. In eukaryotic genomes, chromatins are organized into distinct domains. The chromatin domain architecture is critical for transcription control. Insulators are the key DNA sequence elements that establish and maintain such domain boundaries (2–12). They represent a class of diverged DNA sequences capable of shielding genes from inappropriate cis-regulatory signals from the genomic neighborhood. There are two types of insulators—enhancer-blocking insulators that block enhancer–promoter communication and barrier insulators that protect against heterochromatin-mediated silencing (13). Many recent studies have been devoted to the identification and characterization of insulators. CTCF-binding site is of particular interest because CTCF is the only protein identified so far in vertebrate that binds to enhancer-blocking insulators and shows enhancer-blocking activity. Recent studies also linked the CTCF-binding site to epigenetic processes, such as imprinting (14–16), X-chromosome inactivation (17,18) and interchromosomal colocalization (19). Despite their obvious importance, to our knowledge, there is no public database categorizing this type of regulatory elements. In addition to dozens of well-characterized CTCF-binding sites with validated insulation functions that are scattered in the biomedical literature, several recent high-throughput ChIP-chip analyses and comparative genomic studies (20–23) have identified tens of thousands of potential CTCF-binding sites in human and mouse genomes. Here we report our effort in creating a CTCF-binding site database, a collection of experimentally identified and computationally predicted CTCF-binding sites. Biological knowledge and data from multiple resources were integrated to annotate the CTCF-binding sites. The database is designed to facilitate the studies on insulators and their roles in regulating gene expression and demarcating functional genomic domains.

DATA SOURCES AND PROCESSING

Data sources

Experimentally identified and computationally predicted CTCF-binding sites are processed separately. First, 34 417 experimentally identified CTCF-binding sites are collected from four sources: (i) 110 manually curated CTCF-binding sites from biomedical literature, denoted by identifiers starting with ‘INSUL_MAN’, (ii) 244 mouse CTCF-binding sites identified by Ohlsson and coworkers using ChIP-chip assay (21), denoted by identifiers starting with ‘INSUL_OHL’, (iii) 13 801 human CTCF-binding sites identified by Ren and coworkers using ChIP-chip assay (20), denoted by identifiers starting with ‘INSUL_REN’ and (iv) 20 262 human CTCF-binding sites identified by Zhao and coworkers using massive direct sequencing of ChIP DNA (23), denoted by identifiers starting with ‘INSUL_ZHAO’. Second, we collected the conserved CTCF-specific sequence motifs (∼20 bp) in the human and mouse genomes that were predicted using motif scan (20,22). We excluded those (∼40%) overlapped with any of the experimentally determined CTCF-binding sites. The resulting 18 905 entries include 7736 human and 5504 mouse CTCF-binding sites predicted in (20) and 5665 human CTCF-binding sites predicted in (22). The computationally predicted CTCF-binding sites have identifiers beginning with ‘INSUL_PRE’.

Annotation of the CTCF-binding sites

Table 1 shows the major data fields of the database. For the 110 manually curated CTCF-binding sites, we used a set of controlled vocabularies to describe their properties. The ‘Validation Method’ field specifies whether CTCF binding was validated by in vitro and/or in vivo assays and whether this CTCF-binding sequence showed enhancer-blocking function in transgenic experiment (24). The ‘In situ Function’ field annotates the biological roles of a CTCFBS in its natural genomic context (enhancer-blocking, chromatin boundary, etc.). The ‘Description’ field contains other features of the CTCF-binding site (e.g. methylation-sensitivity of the CTCF binding). Genomic coordinates of the CTCF-binding sequences were determined using the BLAT alignment program (25). The assemblies of genomes used are hg18 for human, mm8 for mouse, rn3 for rat and galGal2 for chicken. CTCF-binding sequences without chromosome location information usually mean that they were probably mapped to unsequenced portions (e.g. heterochromatic regions) of the genome (21).
Table 1.

Description of the fields

Field nameDescription
IDaUnique identifier of an entry
SpeciesaSpecies name
NameName used by the authors in the original paper
Chromosome locationCTCF-binding site position
OrientationForward (+) or reverse (−) strand
5′-Flanking gene5′-Flanking gene of the CTCF-binding site along the genome
3′-Flanking gene3′-Flanking gene of the CTCF-binding site along the genome
Validation methodaThe validation methods including in vitro binding, in vivo binding, enhancer-blocking assay and sequence analysis
In situ functionIn situ function of the CTCF-binding site
DescriptionOther important features of the CTCF-binding site
ReferenceaPubMed reference
SequenceaDNA sequence of the CTCF-binding site

aA mandatory field.

Description of the fields aA mandatory field.

Sequence features of the CTCF-binding sites

The sequences of CTCF-binding sites in the database vary from 20 bp to several hundred bp. There are two reasons for this length heterogeneity. First, different experimental methods may have different basepair resolutions for locating CTCF-binding sites. Second, different laboratories often have different research goals when they publish the original sequences. Some researchers may stop at a 500-bp region encapsulating the CTCF-binding sites while others may further narrow down to the sequences covered by the CTCF protein physically. Most CTCF-binding sites were found to share a 20-bp motif (20), which is highlighted using consecutive arrows. The direction of the arrows shows the genomic orientation of the motif. We also highlighted all the single nucleotide polymorphisms (SNP) in the dbSNP database (26) that are located in a CTCF-binding site using vertical indicators. Mutations that disrupt CTCF-binding sites may lead to abnormal gene expression and cause diseases. Indeed, a recent study showed that inherited mutations that abolish CTCF-binding sites in the human H19 differentially methylated region (DMR) can cause Beckwith–Wiedemann syndrome (27,28). Thus, the naturally occurring mutations in CTCF-binding sites may represent new types of genetic variations that underlie phenotypes including disease status. To get more information about any of the SNPs, the user can click the SNP indicator to browse the corresponding dbSNP webpage (26).

Genomic context track

The genomic context of a CTCF-binding site provides clues for its in situ functions. The CTCF-binding site (red) and flanking genes within 100 kb distance are displayed using the UCSC genome browser (29) (Figure 1). Other CTCF-binding sites located in this genomic region are also displayed and different colors are used to distinguish the sources of the CTCF-binding sites: yellow for INSUL_MAN, blue for INSUL_OHL, green for INSUL_REN, cyan for INSUL_ZHAO and black for INSUL_PRE. An important function of CTCF-bound insulators is to demarcate transcriptionally active and silent chromatin domains, which are marked by distinct histone methylation patterns. A recent study provided high-resolution maps of histone methylations (chromatin domains) in the human genome (23). H3K4 trimethylation (H3K4me3) and H3K27 trimethylation (H3K27me3) are a pair of ‘Yin-Yang’ modifications with high level of H3K4me3 and H3K27me3, representing gene activation and silencing, respectively (23). We integrated H3K4me3 and H3K27me3 maps with our genomic context track of CTCF-binding sites using the genome browser to facilitate the utilization of this valuable information.
Figure 1.

The genomic context of a few CTCF-binding sites. The CTCF-binding sites reside at the boundary between the two histone methylation domains (H3K4me3 and H3K27me3).

The genomic context of a few CTCF-binding sites. The CTCF-binding sites reside at the boundary between the two histone methylation domains (H3K4me3 and H3K27me3).

Flanking gene expression track

Another in situ function of insulators is to maintain independent expression patterns of neighboring genes. Suppose there is a tissue-specific enhancer that should control the transcription of one gene but not that of the other in a pair of neighboring genes. The CTCF-binding site located between the enhancer and the promoter of the second gene may function as enhancer-blocking insulator to protect against illegitimate transcriptional activation. In this scenario, the neighboring genes may have very different expression status in that tissue. We created a flanking gene expression track to compare the expression patterns of the genes flanking the CTCF-binding site. The data were obtained from The Genomics Institute of the Novartis Research Foundation (GNF) Gene Expression Atlas 2 (30), which contains genome-wide gene expression profiles of 61 mouse tissues and 79 human tissues. The raw data was log-transformed (base 2) and normalized to have a mean of 0 and SD of 1. The expression images were created using the Slcview software (http://slcview.stanford.edu), in which red indicates overexpression and green indicates underexpression. An example of gene expression track is shown in Figure 2.
Figure 2.

A CTCF-binding site (INSUL_MAN0004) webpage displays the flanking gene expression profiles and links to tracks of SNPs, genomic context and orthologous regions.

A CTCF-binding site (INSUL_MAN0004) webpage displays the flanking gene expression profiles and links to tracks of SNPs, genomic context and orthologous regions.

Mammalian orthologous region track

Comparative genomic studies on human, mouse and rat may provide insights into the evolution of CTCF-binding sites. To this end, we created a track of mammalian orthologous regions. For any of the three genomes, the regions containing CTCF-binding sites and flanking genes were used to query orthologous regions in the other two genomes from the UCSC precomputed block chains (31,32). Only the DNA blocks with the maximal alignment score against the query region were retained as orthologous regions. The aligned orthologous sequences in up to 16 vertebrate genomes can be displayed by clicking the ‘view alignment’ button (Figure 2).

CTCF-binding site prediction

CTCF uses different combinations of its zinc fingers to recognize divergent DNA sequences. Recent studies have identified core motifs for CTCFBS sequences (20,22). The motifs are represented by position weight matrices (PWM). Altogether, four closely related PWM have been derived to accommodate the sequence divergences in CTCF-binding sites (20,22). The database provides a simple web tool to search for the core CTCF-binding motifs in a query sequence. It uses the STORM program (33) to scan for each of the four PWM in the query sequences and reports the best hits.

UTILITY AND DISCUSSION

First, a web interface was developed for browsing the experimentally identified and computationally predicted CTCF-binding sites. Users can focus on entries of interest using four selection controls—Species, Validation Method, In Situ Function and Description. The in situ function of most known CTCF-binding sites is to act as boundary element. However, in some biological contexts, CTCF-binding sites may also function as elements for transcription activation/repression [reviewed in (34)]. Second, a text search interface was developed for querying the database. Users can search for CTCF-binding sites by element name or by the PubMed identifier of the original literature. A useful approach is to retrieve the CTCF-binding sites contiguous to a gene of interest by entering an official gene symbol or words used in the gene description. Third, the database provides sequence similarity search (35) for the comparison between query sequences and CTCF-binding sequences. Finally, an option of genomic range search is provided. Users can specify a genomic interval and retrieve all the CTCF-binding sites in the interval. To maintain an up-to-date resource, we encourage researchers to submit newly identified CTCFBS sequences to the database. Data can be submitted directly through a web interface. The submissions will be manually checked before being added to the database. The database is an integrative platform for storing, retrieving and characterizing vertebrate genomic insulators. We envision that with more and more experimentally validated CTCFBS sequences available in the database, a comprehensive analysis of these sequences may facilitate the extraction of meaningful sequence signals, uncover the functional basis of insulators, and ultimately enable the mapping of every distinct transcription domain along the genomes.
  35 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

Review 2.  Insulators and boundaries: versatile regulatory elements in the eukaryotic genome.

Authors:  A C Bell; A G West; G Felsenfeld
Journal:  Science       Date:  2001-01-19       Impact factor: 47.728

Review 3.  CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease.

Authors:  R Ohlsson; R Renkawitz; V Lobanenkov
Journal:  Trends Genet       Date:  2001-09       Impact factor: 11.639

4.  CTCF, a candidate trans-acting factor for X-inactivation choice.

Authors:  Wendy Chao; Khanh D Huynh; Rebecca J Spencer; Lance S Davidow; Jeannie T Lee
Journal:  Science       Date:  2001-12-06       Impact factor: 47.728

5.  BLAT--the BLAST-like alignment tool.

Authors:  W James Kent
Journal:  Genome Res       Date:  2002-04       Impact factor: 9.043

Review 6.  Insulators: many functions, many mechanisms.

Authors:  Adam G West; Miklos Gaszner; Gary Felsenfeld
Journal:  Genes Dev       Date:  2002-02-01       Impact factor: 11.361

Review 7.  Setting the boundaries of chromatin domains and nuclear organization.

Authors:  Mariano Labrador; Victor G Corces
Journal:  Cell       Date:  2002-10-18       Impact factor: 41.582

8.  Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene.

Authors:  A C Bell; G Felsenfeld
Journal:  Nature       Date:  2000-05-25       Impact factor: 49.962

9.  The UCSC Genome Browser Database.

Authors:  D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

10.  High-resolution profiling of histone methylations in the human genome.

Authors:  Artem Barski; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Dustin E Schones; Zhibin Wang; Gang Wei; Iouri Chepelev; Keji Zhao
Journal:  Cell       Date:  2007-05-18       Impact factor: 41.582

View more
  69 in total

1.  Escape from X chromosome inactivation is an intrinsic property of the Jarid1c locus.

Authors:  Nan Li; Laura Carrel
Journal:  Proc Natl Acad Sci U S A       Date:  2008-10-29       Impact factor: 11.205

2.  Positive and negative design for nonconsensus protein-DNA binding affinity in the vicinity of functional binding sites.

Authors:  Ariel Afek; David B Lukatsky
Journal:  Biophys J       Date:  2013-10-01       Impact factor: 4.033

3.  An evolutionarily conserved three-dimensional structure in the vertebrate Irx clusters facilitates enhancer sharing and coregulation.

Authors:  Juan J Tena; M Eva Alonso; Elisa de la Calle-Mustienes; Erik Splinter; Wouter de Laat; Miguel Manzanares; José Luis Gómez-Skarmeta
Journal:  Nat Commun       Date:  2011       Impact factor: 14.919

4.  Epstein-Barr Virus Rta-Mediated Accumulation of DNA Methylation Interferes with CTCF Binding in both Host and Viral Genomes.

Authors:  Yen-Ju Chen; Yu-Lian Chen; Yao Chang; Chung-Chun Wu; Ying-Chieh Ko; Sai Wah Tsao; Jen-Yang Chen; Su-Fang Lin
Journal:  J Virol       Date:  2017-07-12       Impact factor: 5.103

5.  Regulated expression of the human gastrin gene in mice.

Authors:  Edith Mensah-Osman; Ed Labut; Yana Zavros; Mohamad El-Zaatari; David J Law; Juanita L Merchant
Journal:  Regul Pept       Date:  2008-03-28

6.  Developmentally programmed 3' CpG island methylation confers tissue- and cell-type-specific transcriptional activation.

Authors:  Da-Hai Yu; Carol Ware; Robert A Waterland; Jiexin Zhang; Miao-Hsueh Chen; Manasi Gadkari; Govindarajan Kunde-Ramamoorthy; Lagina M Nosavanh; Lanlan Shen
Journal:  Mol Cell Biol       Date:  2013-03-04       Impact factor: 4.272

7.  Vertebrate Protein CTCF and its Multiple Roles in a Large-Scale Regulation of Genome Activity.

Authors:  L G Nikolaev; S B Akopov; D A Didych; E D Sverdlov
Journal:  Curr Genomics       Date:  2009-08       Impact factor: 2.236

8.  A map of open chromatin in human pancreatic islets.

Authors:  Kyle J Gaulton; Takao Nammo; Lorenzo Pasquali; Jeremy M Simon; Paul G Giresi; Marie P Fogarty; Tami M Panhuis; Piotr Mieczkowski; Antonio Secchi; Domenico Bosco; Thierry Berney; Eduard Montanya; Karen L Mohlke; Jason D Lieb; Jorge Ferrer
Journal:  Nat Genet       Date:  2010-01-31       Impact factor: 38.330

9.  Genome-wide colonization of gene regulatory elements by G4 DNA motifs.

Authors:  Zhuo Du; Yiqiang Zhao; Ning Li
Journal:  Nucleic Acids Res       Date:  2009-09-16       Impact factor: 16.971

10.  MotifAdjuster: a tool for computational reassessment of transcription factor binding site annotations.

Authors:  Jens Keilwagen; Jan Baumbach; Thomas A Kohl; Ivo Grosse
Journal:  Genome Biol       Date:  2009-05-01       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.