Literature DB >> 16381983

A new generation of JASPAR, the open-access repository for transcription factor binding site profiles.

Dominique Vlieghe1, Albin Sandelin, Pieter J De Bleser, Kris Vleminckx, Wyeth W Wasserman, Frans van Roy, Boris Lenhard.   

Abstract

JASPAR is the most complete open-access collection of transcription factor binding site (TFBS) matrices. In this new release, JASPAR grows into a meta-database of collections of TFBS models derived by diverse approaches. We present JASPAR CORE--an expanded version of the original, non-redundant collection of annotated, high-quality matrix-based transcription factor binding profiles, JASPAR FAM--a collection of familial TFBS models and JASPAR phyloFACTS--a set of matrices computationally derived from statistically overrepresented, evolutionarily conserved regulatory region motifs from mammalian genomes. JASPAR phyloFACTS serves as a non-redundant extension to JASPAR CORE, enhancing the overall breadth of JASPAR for promoter sequence analysis. The new release of JASPAR is available at http://jaspar.genereg.net.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 16381983      PMCID: PMC1347477          DOI: 10.1093/nar/gkj115

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Methods for computational discovery and analysis of regulatory sequences are becoming increasingly important for the interpretation of genome and transcriptome data. Reliable prediction of cis-regulatory elements is critically dependent upon access to high-quality models for the binding specificity of transcription factors (TFs) (1). These models are predominantly defined by ungapped alignments of bona fide TF binding sites (TFBSs), summarized as count matrices (also referred to as matrix profiles) (2). The JASPAR database, the largest open-access collection of TFBS matrix profiles, is used as a fundamental component within a growing number of bioinformatic tools (3–8). The initial release of the JASPAR database (9) contained a collection of extensively curated, non-redundant profiles collected from published collections of TFBS from multicellular eukaryotes. Those high-quality profiles remain the collection of choice for the detection of putative binding sites resembling target sequences of known TFs. Since laboratory-based elucidation of bona fide TFBSs is time-consuming and labor-intensive, only a fraction of all TFs have a defined binding profile. Based on the slow influx of conclusively validated data, expansion of the curated JASPAR binding profiles is lethargic. In the meantime, binding model collections based on other approaches, such as in silico pattern discovery, have emerged. While researchers may prefer to use highly curated profiles from bona fide TFBSs, the new collections offer great utility for genome-scale analysis. Compared with the original JASPAR collection, such datasets will differ both in terms of methods used for generation as well as in the level of biological evidence. They should therefore not be indiscriminately added to the collection, but are nevertheless valuable for the exploration of the content of regulatory regions. To address the need and desire for access to a broader range of binding profiles, we present an expansion of JASPAR into a meta-repository for TF binding profiles, within which profiles are divided into distinct subsets that differ in data generation methodology.

APPLYING THE JASPAR PARTITIONS FOR GENOME ANALYSIS

In this release, JASPAR contains three distinct subsets (Table 1):
Table 1

Summary of the database components

Data collectionJASPAR COREJASPAR FAMJASPAR phyloFACTS
KeywordsNon-redundant, literature curated models (9)Meta-models for structural classes of TFs (10)Data-mined profiles using phylogenetic pattern finding (11)
Number of models12311174
Mean information content (bits)12.18.115.6
Mean profile sequence depth33.9100a1598.5
Number of structural TF classes2611NAb
Anonymous MySQL accesscJASPAR_CORE_2005JASPAR_FAM_2005JASPAR_PHYLOFACTS_2005

aThe sequence depths for the meta-models in JASPAR FAM is normalized to 100.

bAs the patterns in JASPAR phyloFACTS are not experimentally linked to cognate factors, the structural classes are unknown.

cThe MySQL server is at jasper.genereg.net (user: anonymous, password: jaspar).

JASPAR CORE. This collection corresponds to an expansion of the original set of JASPAR profiles. The 123 binding models in JASPAR CORE are non-redundant and based on experimentally defined TFBSs from published reports, subject to scrutinous curation. Methods used for collection and alignments are described previously (9). In this release, names of some of the profiles changed to match the official HGED or MGED symbols of the corresponding TFs where applicable; their JASPAR IDs were not changed. JASPAR FAM. The JASPAR FAM partition houses familial binding profiles (also referred to as ‘consensus profiles’) for 11 major structural classes of factors. The collection facilitates prediction of TF binding domain structures based on profile information alone (10). These models are especially suitable for gene- and genome-wide exploratory searches in cases where there is no prior knowledge of cognate factors. As only a fraction of TFs are well characterized, factor-specific profiles are lacking for most TFs. The familial models in JASPAR FAM can be used as proxy profiles for uncharacterized TFs within TF structural families known to bind similar target sequences. The construction and application of familial binding profiles is described in Ref. (10). JASPAR phyloFACTS. This new subset of the database contains a set of matrices that are derived from evolutionarily conserved sequences in the regulatory regions of mammalian genes. The profiles were based on a recent comprehensive systematic survey of regulatory motifs (11), which used the phylogenetic relationship between human, mouse, rat and dog to discover conserved and overrepresented sequence motifs in the region 2 kb upstream and downstream from the RefSeq-based transcription start site of human genes. To construct the 174 matrix models, we scanned multiple sequence alignments from Ref. (11) for the conserved motifs reported in the paper and used the detected sites to derive matrices that can be regarded as common mammalian matrix profiles. The resulting matrices represent numerous putative binding sites, providing potentially high-matrix granularity. We compared the 174 phyloFACTS matrices to the existing matrices in the JASPAR CORE using Pearson correlation coefficient (PCC) as a measure for matrix similarity. In total, similarity higher than 0.8 (an empirical value PCC often used to indicate strong correlation) was observed for 27% of the JASPAR CORE mammalian matrices. Inspection of this significant, but limited overlap indicates that potential binding sites for many TFs computationally detected in Ref. (11) have not been experimentally characterized to date; on the other hand, their computational method is unable to detect many of the experimentally verified binding sites from JASPAR CORE that either have low information content or are predominantly found in long-range enhancers. This and the validation procedure (see below) strongly indicate that the JASPAR phyloFACTS and CORE sets complement each other.

COMBINATION OF JASPAR CORE AND phyloFACTS DATABASES ENHANCES BINDING SPACE COVERAGE

We wanted to assess the predictive power of the combination of phyloFACTS and JASPAR CORE, and to compare the coverage of the union against the coverage of vertebrate subset of the JASPAR CORE itself, as well as the TRANSFAC database (version 8.4) (12). Co-regulated gene expression is a consequence of the co-occurrence of similar features, such as common TFBSs, in the promoter regions of a gene set. If a collection of matrix profiles contains relevant data for these features, we should be able to distinguish a set of co-regulated genes from a random set. We compiled two sets of co-regulated genes: one set of 16 genes known to be important in the Wnt signalling pathway and a set of 20 histone genes [the results on the Wnt dataset are explained in more detail in P.J. de Bleser et al. (submitted for publication)]. By applying a feature selection and classification procedure described in detail in Supplementary Data, we were able to show that the use of JASPAR CORE/phyloFACTS increases both specificity and sensitivity of predictions compared with either JASPAR CORE alone, or with TRANSFAC. The complementarity of JASPAR CORE and phyloFACTS is further shown in the selection of matrix profiles used as classification attributes. Of the four chosen attributes for the Wnt dataset, two are from JASPAR CORE and two are from phyloFACTS. One of the selected matrices of phyloFACTS (JASPAR ID: PF0073) comprises a motif that is associated with the TF Lef1, known to be essential in the Wnt signalling pathway. This motif is absent from the attributes selected by either JASPAR CORE or TRANSFAC. For the histones, the 44 selected attributes are evenly distributed between the two matrix sets (20 from JASPAR CORE and 24 from phyloFACTS), indicating that phyloFACTS serves as a complementary profile collection to JASPAR CORE (see Supplementary Data for details of the validation procedure).

AVAILABILITY, API AND DISTRIBUTION

The JASPAR web portal address provides a graphical interface for casual users, enabling browsing and database search functions, as well as basic sequence search functionality for selected profiles. In addition, novel profiles entered by users can be compared to profiles in the three datasets using matrix alignment algorithms (10). The TFBS module for the Perl programming language (13) has extensive support for the JASPAR database and can be considered an application programming interface to the database. This approach is recommended for power users. The JASPAR database and underlying datasets are available for download with no restrictions from the JASPAR portal. In addition, users can access the underlying MySQL database anonymously (Table 1).

FUTURE DIRECTIONS: EXPANSION, USER SUBMISSION AND UNIVERSAL DATA MODEL

In the future, we shall see an increasing number of TFBS models produced by diverse approaches. Those methods will vary in scope, reliability and the depth of biological validation. For instance, the recently launched ENCODE project (14) is anticipated to produce a large number of genome-wide chip–CHIP (15) experiments for a wide selection of TFs, potentially leading to a new cadre of profiles. Therefore, the new meta-database model provides the required flexibility for the growth of the JASPAR collections. JASPAR CORE will remain faithful to the original purpose as the central open-access repository of high-quality, experimentally verified profiles that will continue to expand by expert curation. As a community resource, JASPAR is introducing a user data submission mechanism which enables users to submit (i) individual models with sufficient experimental evidence to JASPAR CORE, or (ii) whole collections of annotated matrix profiles that share the scope, origin and mechanism of generation. The submission form is available at . As part of a larger effort, the development of a universal data model and the associated input and curation tools for annotated TF binding data is under way. The ultimate goal is not only to have an open-access binding profile repository, but also the most comprehensive collection of information about TFs and their binding sites. We strongly encourage researchers supporting the open-access database model to use JASPAR as a means for sharing models and datasets with the research community.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.
  15 in total

Review 1.  DNA binding sites: representation and discovery.

Authors:  G D Stormo
Journal:  Bioinformatics       Date:  2000-01       Impact factor: 6.937

2.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors:  Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

3.  TFBS: Computational framework for transcription factor binding site analysis.

Authors:  Boris Lenhard; Wyeth W Wasserman
Journal:  Bioinformatics       Date:  2002-08       Impact factor: 6.937

Review 4.  Applied bioinformatics for the identification of regulatory elements.

Authors:  Wyeth W Wasserman; Albin Sandelin
Journal:  Nat Rev Genet       Date:  2004-04       Impact factor: 53.242

5.  ConSite: web-based prediction of regulatory elements using cross-species comparison.

Authors:  Albin Sandelin; Wyeth W Wasserman; Boris Lenhard
Journal:  Nucleic Acids Res       Date:  2004-07-01       Impact factor: 16.971

6.  Constrained binding site diversity within families of transcription factors enhances pattern discovery bioinformatics.

Authors:  Albin Sandelin; Wyeth W Wasserman
Journal:  J Mol Biol       Date:  2004-04-23       Impact factor: 5.469

7.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays.

Authors:  Sonali Mukherjee; Michael F Berger; Ghil Jona; Xun S Wang; Dale Muzzey; Michael Snyder; Richard A Young; Martha L Bulyk
Journal:  Nat Genet       Date:  2004-11-14       Impact factor: 38.330

8.  The ENCODE (ENCyclopedia Of DNA Elements) Project.

Authors: 
Journal:  Science       Date:  2004-10-22       Impact factor: 47.728

9.  Systematic discovery of regulatory motifs in human promoters and 3' UTRs by comparison of several mammals.

Authors:  Xiaohui Xie; Jun Lu; E J Kulbokas; Todd R Golub; Vamsi Mootha; Kerstin Lindblad-Toh; Eric S Lander; Manolis Kellis
Journal:  Nature       Date:  2005-02-27       Impact factor: 49.962

10.  TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors:  V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

View more
  115 in total

1.  Genetic variation in the KIAA0319 5' region as a possible contributor to dyslexia.

Authors:  Adrienne Elbert; Maureen W Lovett; Tasha Cate-Carter; Ashley Pitch; Elizabeth N Kerr; Cathy L Barr
Journal:  Behav Genet       Date:  2011-01-05       Impact factor: 2.805

2.  Transcriptional regulation of protein complexes within and across species.

Authors:  Kai Tan; Tomer Shlomi; Hoda Feizi; Trey Ideker; Roded Sharan
Journal:  Proc Natl Acad Sci U S A       Date:  2007-01-16       Impact factor: 11.205

3.  Serum response factor binding sites differ in three human cell types.

Authors:  Sara J Cooper; Nathan D Trinklein; Loan Nguyen; Richard M Myers
Journal:  Genome Res       Date:  2007-01-02       Impact factor: 9.043

4.  Identification and analysis of a conserved Tcfap2a intronic enhancer element required for expression in facial and limb bud mesenchyme.

Authors:  Weiguo Feng; Jian Huang; Jian Zhang; Trevor Williams
Journal:  Mol Cell Biol       Date:  2007-11-05       Impact factor: 4.272

5.  Construction of a genome-scale structural map at single-nucleotide resolution.

Authors:  Jason A Greenbaum; Bo Pang; Thomas D Tullius
Journal:  Genome Res       Date:  2007-06       Impact factor: 9.043

6.  Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE.

Authors:  Eivind Valen; Giovanni Pascarella; Alistair Chalk; Norihiro Maeda; Miki Kojima; Chika Kawazu; Mitsuyoshi Murata; Hiromi Nishiyori; Dejan Lazarevic; Dario Motti; Troels Torben Marstrand; Man-Hung Eric Tang; Xiaobei Zhao; Anders Krogh; Ole Winther; Takahiro Arakawa; Jun Kawai; Christine Wells; Carsten Daub; Matthias Harbers; Yoshihide Hayashizaki; Stefano Gustincich; Albin Sandelin; Piero Carninci
Journal:  Genome Res       Date:  2008-12-11       Impact factor: 9.043

7.  Common functional genetic variants in catecholamine storage vesicle protein promoter motifs interact to trigger systemic hypertension.

Authors:  Kuixing Zhang; Fangwen Rao; Lei Wang; Brinda K Rana; Sajalendu Ghosh; Manjula Mahata; Rany M Salem; Juan L Rodriguez-Flores; Maple M Fung; Jill Waalen; Bamidele Tayo; Laurent Taupenot; Sushil K Mahata; Daniel T O'Connor
Journal:  J Am Coll Cardiol       Date:  2010-04-06       Impact factor: 24.094

8.  The role of transposable elements in the regulation of IFN-lambda1 gene expression.

Authors:  Scott J P Thomson; Fui G Goh; Helen Banks; Thomas Krausgruber; Sergei V Kotenko; Brian M J Foxwell; Irina A Udalova
Journal:  Proc Natl Acad Sci U S A       Date:  2009-07-01       Impact factor: 11.205

9.  Human dopamine beta-hydroxylase (DBH) regulatory polymorphism that influences enzymatic activity, autonomic function, and blood pressure.

Authors:  Yuqing Chen; Gen Wen; Fangwen Rao; Kuixing Zhang; Lei Wang; Juan L Rodriguez-Flores; Amber P Sanchez; Manjula Mahata; Laurent Taupenot; Ping Sun; Sushil K Mahata; Bamidele Tayo; Nicholas J Schork; Michael G Ziegler; Bruce A Hamilton; Daniel T O'Connor
Journal:  J Hypertens       Date:  2010-01       Impact factor: 4.844

10.  Molecular characterization of the Ggamma-globin-Tag transgenic mouse model of hormone refractory prostate cancer: comparison to human prostate cancer.

Authors:  Alfonso Calvo; Carlos Perez-Stable; Victor Segura; Raúl Catena; Elizabeth Guruceaga; Paul Nguewa; David Blanco; Luis Parada; Teresita Reiner; Jeffrey E Green
Journal:  Prostate       Date:  2010-05-01       Impact factor: 4.104

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.