Literature DB >> 24194598

JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles.

Anthony Mathelier¹, Xiaobei Zhao, Allen W Zhang, François Parcy, Rebecca Worsley-Hunt, David J Arenillas, Sorana Buchman, Chih-yu Chen, Alice Chou, Hans Ienasescu, Jonathan Lim, Casper Shyr, Ge Tan, Michelle Zhou, Boris Lenhard, Albin Sandelin, Wyeth W Wasserman.

Abstract

JASPAR (http://jaspar.genereg.net) is the largest open-access database of matrix-based nucleotide profiles describing the binding preference of transcription factors from multiple species. The fifth major release greatly expands the heart of JASPAR-the JASPAR CORE subcollection, which contains curated, non-redundant profiles-with 135 new curated profiles (74 in vertebrates, 8 in Drosophila melanogaster, 10 in Caenorhabditis elegans and 43 in Arabidopsis thaliana; a 30% increase in total) and 43 older updated profiles (36 in vertebrates, 3 in D. melanogaster and 4 in A. thaliana; a 9% update in total). The new and updated profiles are mainly derived from published chromatin immunoprecipitation-seq experimental datasets. In addition, the web interface has been enhanced with advanced capabilities in browsing, searching and subsetting. Finally, the new JASPAR release is accompanied by a new BioPython package, a new R tool package and a new R/Bioconductor data package to facilitate access for both manual and automated methods.

Entities: Chemical

Mesh：

Substances：
Transcription Factors

Year: 2013 PMID： 24194598 PMCID： PMC3965086 DOI： 10.1093/nar/gkt997

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Transcription factors (TFs) influence gene expression by binding to specific cis-acting elements in a genomic sequence. Thus, accurate models for describing the binding properties of TFs are essential in modeling transcription. From a set of known transcription factor binding sites (TFBSs) for a given TF, the binding preference is generally represented in the form of a position weight matrix (PWM) (also called position-specific scoring matrix) derived from a position frequency matrix (PFM). A PFM is essentially an occurrence table, summarizing the number of each nucleotide observed at each position of a set of aligned TFBSs (1,2). Compared with simpler models like consensus sequences, PWMs allow for an additive probabilistic description of binding preferences (3). The JASPAR database holds collections of PFM nucleotide profiles based on published experiments from diverse sources, and has grown gradually from its inception (4–7). The most widely used JASPAR collection is JASPAR CORE, which is a curated non-redundant set of TFBS profiles for multicellular eukaryotes, based on experimental evidence. The JASPAR database aims to provide the best canonical DNA binding profile per TF, as assessed by expert curators. Non-redundancy of TFBS profiles (i.e. one profile per TF) is intended with the exception of cases in which curators observe a clear difference in the sequence (e.g Nkx2-5) or length (e.g. JUND) at the core of a profile. Other JASPAR motif collections, with different characteristics than the CORE database, are available (7). Over the years, JASPAR has been equipped with functions aimed at casual and power users. The web-based graphical user interface functionality includes browsing, searching, subsetting and downloading, as well as basic sequence searching tools, dynamic clustering of matrices and generation of random PFMs by sampling selected profiles (4–7). Historically, JASPAR was populated by PFMs generated by in vitro site selection assays or collections of in-depth characterized sites, limiting both the number of TFs with binding profiles and the number of sites contributing to the profiles. With the development of high-throughput techniques that can assess in vitro or in vivo binding (8–10), it is now possible to generate binding models for most regulators, in multiple species. To this end, we have, in this fifth release, expanded the JASPAR CORE collection substantially, as well as updated the profiles of several existing ones with new data from high-throughput experiments.

EXTENSIVE EXPANSION AND IMPROVEMENT OF JASPAR CORE

The JASPAR CORE database has been substantially expanded. In total, 135 new PFMs have been added (a 30% increase), and 43 older PFMs (9% of last release) have been updated with new data, from vertebrate, insect, nematode and plant species (Table 1). These additions are described in more details later.

Table 1.

Summary of content and growth of the JASPAR CORE database

Subset	Number of non-redundant profiles in JASPAR 4.0	New non-redundant profiles in JASPAR 5.0	Updated profiles	Removed profiles	Total profiles (including older versions of profiles)	Total profiles (non-redundant)
Vertebrates	130	74	36	1	260	202
Plants	21	43	3		67	64
Insects	123	8	4	1	136	131
Nematodes	5	10			15	15
Fungi	177				177	177
Urochordata	1				1	1
Total	457	135	43	2	656	590

Summary of content and growth of the JASPAR CORE database We compiled published sequence-specific DNA binding TF chromatin immunoprecipitation (ChIP)-seq data collections into the PAZAR database (11,12) along with TF ChIP-seq datasets from the ENCODE (13–15) and modENCODE (16,17) consortia for Homo sapiens, Mus musculus, Drosophila melanogaster and Caenorhabditis elegans. From these studies, we extracted the bound regions, identified over-represented motifs close to the ChIP-seq peak max position (corresponding to the region where the maximum number of ChIP-seq reads are mapped) using the MEME suite (18) and constructed PFMs describing the binding preferences of the TFs (see Supplementary Text for details). As in previous JASPAR CORE additions, we manually curated the profiles. To confirm the putative binding patterns, we identified independent publications with TFBSs or profiles consistent with the candidates, as described in (7). To gain additional profiles, we considered bound regions derived from ChIP-chip experiments from modENCODE and (19) for D. melanogaster. A similar strategy as for ChIP-seq datasets was used to derive PFMs from ChIP-chip data (see Supplementary Text for details). In total, we obtained 45, 28, 8 and 10 high-quality PFMs in H. sapiens, M. musculus, D. melanogaster and C. elegans, respectively, for TFs that have never been described previously in JASPAR (see Supplementary Table S1). It represents a 57, 6 and 200% increase when compared with the previous release for vertebrates, insects and nematodes, respectively. The newly introduced vertebrate profiles are derived from 34 and 40 ChIP-seq experiments collected from PAZAR and ENCODE, respectively. The fact that almost 50% of the new PFMs are from individual studies collected in PAZAR highlights the importance of our manual retrieval of published ChIP-seq data. From ChIP-seq data sets of the vertebrate sequence-specific TFs not previously described in JASPAR, we obtained 71 (∼60%) canonical motifs satisfying our literature-based manual curation (see Supplementary Table S2). The rich data from ChIP experiments allowed replacement of 39 existing profiles for TFs in mammals (36 PFMs updated) and in D. melanogaster (3 PFMs updated). As part of the curation of ChIP-seq data, and as introduced earlier, we computed a centrality score as described in (20), based on our expectation that the positions where the maximum number of ChIP-seq reads map on the genome of reference will be strongly enriched for binding sites corresponding to the ChIPed TF (21). We provide the centrality plot and log(P-value) for each newly introduced PFMs in vertebrates (see Figure 1), showing the propensity of the motif to be found close to the peak-max position in the corresponding peaks of the ChIP-seq dataset used to generate the profile (see Supplementary Figure S1). The high quality of the vertebrates PFMs and the ChIP-seq datasets used to construct them is reflected by the low centrality log(P-values), which are all below −200, with the exception of the Bach1::Mafk, ESRRA, FOXP1, FOXP2, Hoxa9, Sox6, SP2, SREBF1, SREBF2, and THAP1 binding profiles (see Supplementary Table S1).

Figure 1.

Screenshot of an example TFBS profile in new layout.

Screenshot of an example TFBS profile in new layout. Moreover, we expanded the collection of PFMs for Arabidopsis thaliana TFs in JASPAR, with the first targeted JASPAR curation effort for plant TFs. We have included 43 new DNA-binding profiles for A. thaliana TFs, more than tripling the plant content in JASPAR CORE, and we updated three previous PFMs. The profiles are derived from in vitro and in vivo experiments (8 new profiles are constructed from ChIP-seq experiments, 8 from ChIP-chip experiments, 6 from protein binding microarray experiments and 24 from SELEX experiments).

MODELS FOR DUAL BINDING BY THE SAME TF

In this release, in extremely select cases, we introduce multiple binding profiles for a same TF, as motivated by the fact that some TFs display diverse target specificity that cannot be represented using a single PFM model. For instance, JUND has been previously shown to bind the DNA with motifs of flexible lengths (22) with a core composed of either TGACGTCA or TGAC/GTCA, where C/G stands for C or G. The two new profiles introduced for JUND (see Figure 2A) are derived from the same ChIP-seq dataset, confirming the binding to the two subclasses. Similarly, we introduce two profiles for JUN (see Figure 2B), displaying equivalent characteristics to the JUND profiles. A new profile for Nkx2-5 (see Figure 2C) derived from ChIP-seq data has been introduced. It differs substantially from an in vitro SELEX experiment-based profile but has been confirmed to reflect binding properties of Nkx2-5 (23). Finally, we introduce two binding profiles associated to the plant TF RAV1, as it can bind to two unrelated motifs by using two distinct DNA-binding domains (24) (see Figure 2D). The philosophy of maintaining JASPAR as a non-redundant collection remains a driving approach to curation. In these special cases in which we allow unique pairs of profiles for the same TF, the TF presents distinct binding capacities that cannot be captured within a single PFM.

Figure 2.

TFBSs with two different profiles. (A) JUN, (B) JUND, (C) Nkx2-5 and (D) RAV1.

ENHANCED WEB INTERFACE AND NEW RESOURCES FOR POWER USERS

For casual users, we have enhanced the web search interface to the JASPAR database. Fuzzy searching is now enabled to search one or multiple profiles by gene name, species official or common name, protein accession ID, DNA-binding domain family or class, experiment type (e.g. ChIP-seq) and any other keyword associated to the profile(s) in the underlying database. This fuzzy searching performs approximate string matching in case-insensitive mode and offers suggestions below the search box while typing. It also includes the gene name aliases from HGNC (PMID:23161694) for searching gene synonyms. Furthermore, for each TF profile, we have now included links to the Transcription Factor Encyclopedia (25) and to the protein structures from the Protein Data Bank when available (26). Each binding profile links to the corresponding TFBSshape profile of DNA structural analysis (27). For power users, we have developed an open source Python package (freely available at https://github.com/biopython/) within the extensively used tools of the BioPython Project (28). We implemented the jaspar package as part of the ‘motifs’ BioPython package, which provides functions such as reading profiles, writing profiles, scanning sequences for motif instances and more. The specific jaspar ‘motif’ class allows to store all the metadata information related to the profiles in JASPAR, and specific functions allow the user to retrieve profiles from the database. We also developed an R/Bioconductor (PMID: 15461798) software package TFBSTools, available at http://www.bioconductor.org/packages/2.13/bioc/html/TFBSTools.html under the General Public License-2 (GPL-2), to provide developers handy tools to generate, read and convert the JASPAR template, an internal data format to describe each motif instance and its meta information. An R/Bioconductor (29) data package JASPAR2014Data is freely available at http://www.bioconductor.org/packages/devel/data/experiment/html/JASPAR2014.html to provide the users with tools for data analysis using the JASPAR profiles. In addition, a web-based curator interface was developed for JASPAR, focusing on giving the super-users the ability to edit and update the database: this capacity is released for users wishing to produce custom PFM databases using the JASPAR framework.

CONCLUSIONS AND FUTURE DEVELOPMENTS

In this release of JASPAR, we have focused on the CORE database and expanded it primarily with new ChIP-based data. Although these types of expansions are important and will continue, the increasing availability of rich data sources highlights important questions for the future development of JASPAR, which need to be discussed with its user base. Two such larger questions are as follows.

Non-redundancy versus species-specific matrix models?

JASPAR CORE was originally designed with the clear goal of finding the ‘best’ PFM for a certain TF, unlike other databases that can hold several models for the same factor. Although many users have appreciated the clarity, it is not established how to resolve cases where the same factor has been characterized in-depth in two or more species. While this situation was rare in the early JASPAR versions, new experimental methods allows for probing binding specificity in several species with comparative ease (30). In general, the binding specificity for orthologous TFs rarely changes to a substantial degree, but exceptions exist (31). Thus, future curation of JASPAR will have to resolve whether the non-redundancy approach should be within each species or within larger clades.

New types of models?

Likewise, the sheer amount of sites that the new laboratory methods generate provides sufficient information to produce predictive models that address more aspects than can be readily handled within the classic PWM framework—in particular, dependencies between positions and variable length motifs, which basic PWM models ignore. Here, one will have to consider the trade-off between possible higher specificity in binding predictions [see (32) for a detailed discussion] and the comfort of the community with the simpler PWM models. It is our plan to introduce newly designed Transcription Factor Flexible Models (33) derived from ChIP-seq data within JASPAR in the near future.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

32 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Diversity and complexity in DNA recognition by transcription factors.

Authors: Gwenael Badis; Michael F Berger; Anthony A Philippakis; Shaheynoor Talukder; Andrew R Gehrke; Savina A Jaeger; Esther T Chan; Genita Metzler; Anastasia Vedenko; Xiaoyu Chen; Hanna Kuznetsov; Chi-Fong Wang; David Coburn; Daniel E Newburger; Quaid Morris; Timothy R Hughes; Martha L Bulyk
Journal: Science Date: 2009-05-14 Impact factor: 47.728

3. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

Review 4. ChIP-seq: advantages and challenges of a maturing technology.

Authors: Peter J Park
Journal: Nat Rev Genet Date: 2009-09-08 Impact factor: 53.242

5. Identification of novel DNA binding targets and regulatory domains of a murine tinman homeodomain factor, nkx-2.5.

Authors: C Y Chen; R J Schwartz
Journal: J Biol Chem Date: 1995-06-30 Impact factor: 5.157

6. Inferring direct DNA binding from ChIP-seq.

Authors: Timothy L Bailey; Philip Machanick
Journal: Nucleic Acids Res Date: 2012-05-18 Impact factor: 16.971

7. The transcription factor encyclopedia.

Authors: Dimas Yusuf; Stefanie L Butland; Magdalena I Swanson; Eugene Bolotin; Amy Ticoll; Warren A Cheung; Xiao Yu Cindy Zhang; Christopher T D Dickman; Debra L Fulton; Jonathan S Lim; Jake M Schnabl; Oscar H P Ramos; Mireille Vasseur-Cognet; Charles N de Leeuw; Elizabeth M Simpson; Gerhart U Ryffel; Eric W-F Lam; Ralf Kist; Miranda S C Wilson; Raquel Marco-Ferreres; Jan J Brosens; Leonardo L Beccari; Paola Bovolenta; Bérénice A Benayoun; Lara J Monteiro; Helma D C Schwenen; Lars Grontved; Elizabeth Wederell; Susanne Mandrup; Reiner A Veitia; Harini Chakravarthy; Pamela A Hoodless; M Michela Mancarelli; Bruce E Torbett; Alison H Banham; Sekhar P Reddy; Rebecca L Cullum; Michaela Liedtke; Mario P Tschan; Michelle Vaz; Angie Rizzino; Mariastella Zannini; Seth Frietze; Peggy J Farnham; Astrid Eijkelenboom; Philip J Brown; David Laperrière; Dominique Leprince; Tiziana de Cristofaro; Kelly L Prince; Marrit Putker; Luis del Peso; Gieri Camenisch; Roland H Wenger; Michal Mikula; Marieke Rozendaal; Sylvie Mader; Jerzy Ostrowski; Simon J Rhodes; Capucine Van Rechem; Gaylor Boulay; Sam W Z Olechnowicz; Mary B Breslin; Michael S Lan; Kyster K Nanan; Michael Wegner; Juan Hou; Rachel D Mullen; Stephanie C Colvin; Peter John Noy; Carol F Webb; Matthew E Witek; Scott Ferrell; Juliet M Daniel; Jason Park; Scott A Waldman; Daniel J Peet; Michael Taggart; Padma-Sheela Jayaraman; Julien J Karrich; Bianca Blom; Farhad Vesuna; Henriette O'Geen; Yunfu Sun; Richard M Gronostajski; Mark W Woodcroft; Margaret R Hough; Edwin Chen; G Nicholas Europe-Finner; Magdalena Karolczak-Bayatti; Jarrod Bailey; Oliver Hankinson; Venu Raman; David P LeBrun; Shyam Biswal; Christopher J Harvey; Jason P DeBruyne; John B Hogenesch; Robert F Hevner; Christophe Héligon; Xin M Luo; Marissa Cathleen Blank; Kathleen Joyce Millen; David S Sharlin; Douglas Forrest; Karin Dahlman-Wright; Chunyan Zhao; Yuriko Mishima; Satrajit Sinha; Rumela Chakrabarti; Elodie Portales-Casamar; Frances M Sladek; Philip H Bradley; Wyeth W Wasserman
Journal: Genome Biol Date: 2012 Impact factor: 13.583

8. A new generation of JASPAR, the open-access repository for transcription factor binding site profiles.

Authors: Dominique Vlieghe; Albin Sandelin; Pieter J De Bleser; Kris Vleminckx; Wyeth W Wasserman; Frans van Roy; Boris Lenhard
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

9. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update.

Authors: Jan Christian Bryne; Eivind Valen; Man-Hung Eric Tang; Troels Marstrand; Ole Winther; Isabelle da Piedade; Anders Krogh; Boris Lenhard; Albin Sandelin
Journal: Nucleic Acids Res Date: 2007-11-15 Impact factor: 16.971

10. TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

Authors: Lin Yang; Tianyin Zhou; Iris Dror; Anthony Mathelier; Wyeth W Wasserman; Raluca Gordân; Remo Rohs
Journal: Nucleic Acids Res Date: 2013-11-07 Impact factor: 16.971

536 in total

1. Genetic variation of the transthyretin gene in wild-type transthyretin amyloidosis (ATTRwt).

Authors: Jacquelyn L Sikora; Mark W Logue; Gloria G Chan; Brian H Spencer; Tatiana B Prokaeva; Clinton T Baldwin; David C Seldin; Lawreen H Connors
Journal: Hum Genet Date: 2014-11-04 Impact factor: 4.132

2. A G-Box-Like Motif Is Necessary for Transcriptional Regulation by Circadian Pseudo-Response Regulators in Arabidopsis.

Authors: Tiffany L Liu; Linsey Newton; Ming-Jung Liu; Shin-Han Shiu; Eva M Farré
Journal: Plant Physiol Date: 2015-11-19 Impact factor: 8.340

3. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning.

Authors: Babak Alipanahi; Andrew Delong; Matthew T Weirauch; Brendan J Frey
Journal: Nat Biotechnol Date: 2015-07-27 Impact factor: 54.908

4. Dynamics and recognition within a protein-DNA complex: a molecular dynamics study of the SKN-1/DNA interaction.

Authors: Loïc Etheve; Juliette Martin; Richard Lavery
Journal: Nucleic Acids Res Date: 2015-12-31 Impact factor: 16.971

5. An Intronic Enhancer Element Regulates Angiotensin II Type 2 Receptor Expression during Satellite Cell Differentiation, and Its Activity Is Suppressed in Congestive Heart Failure.

Authors: Tadashi Yoshida; Patrice Delafontaine
Journal: J Biol Chem Date: 2016-10-18 Impact factor: 5.157

6. Targeting nuclear receptor NR4A1-dependent adipocyte progenitor quiescence promotes metabolic adaptation to obesity.

Authors: Yang Zhang; Alexander J Federation; Soomin Kim; John P O'Keefe; Mingyue Lun; Dongxi Xiang; Jonathan D Brown; Matthew L Steinhauser
Journal: J Clin Invest Date: 2018-10-02 Impact factor: 14.808

7. Transcriptome dynamics of developing maize leaves and genomewide prediction of cis elements and their cognate transcription factors.

Authors: Chun-Ping Yu; Sean Chun-Chang Chen; Yao-Ming Chang; Wen-Yu Liu; Hsin-Hung Lin; Jinn-Jy Lin; Hsiang June Chen; Yu-Ju Lu; Yi-Hsuan Wu; Mei-Yeh Jade Lu; Chen-Hua Lu; Arthur Chun-Chieh Shih; Maurice Sun-Ben Ku; Shin-Han Shiu; Shu-Hsing Wu; Wen-Hsiung Li
Journal: Proc Natl Acad Sci U S A Date: 2015-04-27 Impact factor: 11.205