Literature DB >> 19906716

JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles.

Elodie Portales-Casamar¹, Supat Thongjuea, Andrew T Kwon, David Arenillas, Xiaobei Zhao, Eivind Valen, Dimas Yusuf, Boris Lenhard, Wyeth W Wasserman, Albin Sandelin.

Abstract

JASPAR (http://jaspar.genereg.net) is the leading open-access database of matrix profiles describing the DNA-binding patterns of transcription factors (TFs) and other proteins interacting with DNA in a sequence-specific manner. Its fourth major release is the largest expansion of the core database to date: the database now holds 457 non-redundant, curated profiles. The new entries include the first batch of profiles derived from ChIP-seq and ChIP-chip whole-genome binding experiments, and 177 yeast TF binding profiles. The introduction of a yeast division brings the convenience of JASPAR to an active research community. As binding models are refined by newer data, the JASPAR database now uses versioning of matrices: in this release, 12% of the older models were updated to improved versions. Classification of TF families has been improved by adopting a new DNA-binding domain nomenclature. A curated catalog of mammalian TFs is provided, extending the use of the JASPAR profiles to additional TFs belonging to the same structural family. The changes in the database set the system ready for more rapid acquisition of new high-throughput data sources. Additionally, three new special collections provide matrix profile data produced by recent alternative high-throughput approaches.

Entities: CellLine Chemical Gene Species

Mesh：

Substances：

Year: 2009 PMID： 19906716 PMCID： PMC2808906 DOI： 10.1093/nar/gkp950

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

The wide availability of TF affinity data is becoming essential for an increasing number of research efforts to understand gene regulation in the post-genomic era. The increasing amount of assembled genome sequences, transcriptome data (1), as well as high-throughput studies revealing genome-wide locations of core promoters (2) and enhancer elements (3,4) have resulted in the greatest demand for TF binding site content analyses. TF binding affinities are typically modeled as position frequency matrices (PFMs, also known as raw count matrices or simply binding profiles), summarizing nucleotide counts in an alignment of active binding sites These can be used to scan genomes for new binding sites (5). Since the first official release of JASPAR in 2004 (6), the research community has embraced it as the leading open-access database of such matrix profiles for TF binding sites. From the beginning, the aim of its core collection has been to provide a non-redundant set of curated, high-quality matrix profiles derived from experimental binding data in the form of position frequency matrices (7); in other words, the goal is to present the best currently available DNA binding model for a given TF, decided by expert curators. The availability of potentially useful matrices derived by other means (e.g. using a number of genome-wide computational approaches) as well as non-TF binding profiles, prompted the addition of separate JASPAR Collections in the second release (8): the intention was to provide those matrix profiles in the same format and hence usable with the same tools as the core JASPAR database, while keeping the latter reserved for profiles representing experimentally derived data. While the community has valued the open-access policy and non-redundant nature of JASPAR, a common complaint was that the size of the core collection was small compared to the commercial TransFac database (9), currently the only comprehensive alternative to JASPAR. In this update, our goal was to make this gap smaller by performing a major expansion of the core database, while maintaining the popular non-redundant, curated quality. As a result, this fourth major release introduces a wealth of new and improved matrix profiles and represents the largest expansion of the core database since its inception, with new data coming either from high-throughput methods like Chip-seq, or assembled from TF binding site databases particularly PAZAR (10) described below.

NEW AND IMPROVED MATRIX PROFILES IN JASPAR CORE DATABASE

Profiles from ChIP-seq

Several recent genome-wide studies have revealed thousands of TF binding sites for individual TFs. Compared to the original matrices, the larger number of representative target sequences provides potentially more accurate profiles and brings the added benefit that (unlike in DNA SELEX), all the binding sites come from the actual genome sequence to which the TFs in question are bound in vivo. To make the derivation of matrices uniform, we extracted the original sets of bound regions from published experiments (11–19). We retrieved 200 bp sequences centered on each peak and performed de novo motif discovery on them using parallelized MEME (20) on a Cray XT4 supercomputing platform, which can handle inputs of many thousands of sequences in manageable time. In most cases, the resulting matrices closely resemble those reported in the original publications, produced using various motif discovery tools. The single exception was the Zfx profile, where our profile obtained with MEME from sites reported in (13) differed reproducibly from the profile reported therein. In this case, we chose to include the newly derived matrix. In most cases, the ChIP-seq data resulted in improved matrices with higher information content than the original ones derived from either compiled single promoter assays or from DNA SELEX (Figure 1). This contradicts the widely held view that SELEX is prone to producing over-specified models since many selection rounds are commonly used. Also, somewhat surprisingly, the resulting matrices did not differ much as thresholds were varied for the inclusion of ChIP identified regions (e.g. top 100 highest confidence bound regions versus top 1000).

Figure 1.

Examples of SELEX-derived matrix profiles replaced by ChIP-seq-derived profiles. (A) The previous MYCN matrix profile (MA0104.1) derived by DNA SELEX. (B) The new MYCN profile (MA0104.2) derived from ChIP-seq binding shows general agreement with the SELEX profile, with additional information derived from hundreds of sites at flanking positions. (C) The previous KLF4 profile (MA0039.1) is an example of SELEX-derived profile that did not correspond well to the handful of individually characterized KLF4 sites. (D) The new KLF profile derived from ChIP-seq data (13) shows a dramatic increase in information content and a good agreement with individually characterized binding sites.

Profiles from ChIP-chip experiments

The ChIP-chip derived TF binding sites, while not providing the resolution of the ChIP-seq data, are a rich source of binding data. Even though they are currently being superseded by ChIP-seq (21), the published sets contain a number of high-quality binding data currently unavailable in the ChIP-seq version. As with ChIP-seq, we use the enriched regions reported by the authors of the study in question, and then apply MEME to find the pattern.

Yeast profiles in core collection

Previous versions of JASPAR did not include any matrix profiles for yeast TFs. Responding to community requests, we have compiled results from several large-scale binding profile projects to produce a non-redundant set of matrix profiles for TFs from Saccharomyces cerevisiae. The sources used, in order of preference, were a recent in vitro binding screen (22), a protein-binding microarray (PBM) experiment (23), the compiled SCPD binding profile database (24), the SwissRegulon computational re-analysis of multiple data collections (25) and a motif discovery-based collection from a widely used ChIP-chip data collection (26). The prioritization of the contributions, as well as the indicated deviations, reflect the curators’ personal perspective. The preferred set, from Badis et al. (22), appeared to offer matrices of consistently high-quality, likely reflecting the curated nature of the effort (new experimental data were compared against existing data for consistency). All matrices were manually curated to remove redundancies and converted to count matrices. In curating the collection, the curators identified a few instances in which profiles were preferred in contradiction with the source priority: GAL4 (SwissRegulon), GCR1 (SwissRegulon), MATALPHA2 (SCPD), PHO4 (UniProbe with the six leftmost and rightmost nucleotides trimmed) and ROX1 (SCPD). The resulting non-redundant set represents a comprehensive open-access compilation of yeast binding profiles, facilitating genome-wide computational studies of yeast regulatory inputs. We are grateful to the commitment of all of the data providers to open information, without which the compilation would have been impossible.

New literature-based profiles from PAZAR

Recently, annotations of hundreds of experimentally validated TF binding sites from published studies have accumulated in the PAZAR database (27), allowing us to produce additional matrices similar in nature to the original JASPAR release (DNA SELEX or compiled from multiple studies on individual binding sites). The PAZAR database was mined to identify TFs with more than 15 annotated binding sites. The resulting data was manually curated, selecting only the results from the most high-quality data collections (i.e. collections manually annotated from the literature by specialists) and discarding any redundant sequences to build the profiles. The resulting set of compiled binding sites for each TF was used as input to the MEME software to obtain a profile. If non-informative positions were obtained on the edges of the matrices, the profiles were trimmed accordingly.

Additional model organism core profiles

For this new release, two major sources of Drosophila melanogaster matrix profiles have been used: DNaseI footprinting data by Bergman et al. (28) and bacterial one-hybrid data by Wolfe and colleagues (29–31). The profiles from these data sets have been curated by the authors to remove redundancies among the results and with the existing profiles in the previous version of JASPAR database. In addition, any profile based on less than 10 sequences has been discarded. This new insect sub-section of JASPAR core includes 123 curated profiles; however, these are heavily dominated by the homeodomain profiles (29). For Caenorhabditis elegans, no large sources of data are currently available. Through literature searches, we identified only five profiles suitable for inclusion in the core database (32–36). In summary, the JASPAR core database now numbers 457 non-redundant matrix profiles (Table 1). New core profiles are summarized in Supplementary Table S1.

Table 1.

Summary of the content and growth of the JASPAR database

JASPAR	Brief description	Subset	Number of profiles in JASPAR 3.0	New profiles in JASPAR 4.0	Updated profiles	Removed profiles	Total profiles (including all versions)	Total profiles (non- redundant)
Core
	Non-redundant, literature-derived, curated models	Vertebrates	101	29	16	1	145	130
		Plants	21		–	–	21	21
		Insects	14	109	1	–	124	123
		Nematoda	–	5	–	–	5	5
		Fungi	–	177	–	–	177	177
		Urochordata	1	–	–	–	1	1
Total core			137	321	17	1	474	457
Collections
POLII	Core promoter element profiles	–	13	–		–	13	13
FAM	Familial ‘consensus’ profiles for major structural families of transcription factors	–	11	–	–	–	11	11
CNE	Profiles overrepresented in vertebrate highly conserved non-coding elements	–	233	–	–	–	233	233
PHYLOFACTS	Evolutionary conserved profiles in 5′ promoter regions	–	174	–		–	174	174
SPLICE	Splice sites	–	6	–	–	–	6	6
PBM	Protein binding microarray profiles	–	–	208	–	–	208	208
PBM_HOMEO	Protein binding microarray profiles focused on homeodomain TFs	–	–	176	–	–	176	176
PBM_BHLH	Protein binding microarray profiles focused on bHLH domain TFs	–	–	19	–	–	19	19
Total collections			437	403	–	–	840	840

Summary of the content and growth of the JASPAR database

NEW COLLECTIONS

In addition to the expansion of the core database, we remain committed to providing other collections of matrix profiles within JASPAR. Recently, the PBM technology has emerged as a new in vitro method for the characterization of TF binding affinities (37). The UniPROBE database hosts the PBM datasets and makes the derived matrix profiles available to the community (38). We have selected three of these new datasets as new collections in JASPAR: With these additions, JASPAR now holds 840 profiles within collections outside of the core database. PBM, the set derived by (39) from binding preferences of 104 mouse TFs. For each TF, both the primary and secondary motifs identified in the study were incorporated. PBM_HOMEO, the set derived by (40) includes 176 profiles from mouse homeodomains. From the original 168 TFs analyzed, two were discarded because they could not be identified (Dobox4 and Dobox5) and ten have two alternative profiles. PBM_HLH, the set derived from binding preferences of dimers of C. elegans bHLH TFs, including nine homodimers and ten heterodimers (41).

GENERAL ORGANIZATIONAL CHANGES

Version control and taxonomic catagories

In line with our goal of presenting the best currently available binding model for any TF, we updated some previous JASPAR entries motivated by new available data. Seventeen entries of the previous release were updated. The replacement of existing matrices with the new ones led us to the introduction of version numbers in matrix IDs, in a manner equivalent to the management of sequence versions in GenBank. For example, the old GATA1 profile MA0035 is replaced with a new one, and the full identifier of the new matrix is MA0035.2, while the old one becomes MA0035.1. By default, the latest version of non-redundant database includes the latest version of each profile. A search for ‘MA0035’ also retrieves the newest version, with an option to view older versions. Older versions can also be downloaded from the JASPAR web site. The addition of 177 yeast matrices to the core collection means that the JASPAR matrices now span the entire eukaryote crown group. Even before that, a typical user scenario included the selection of only a subset of matrices derived from a particular taxonomic category of organisms, across which the TFs are strictly orthologous and their binding activities largely unchanged (e.g. vertebrates). For that reason, both the JASPAR web interface and the download section now present the database content split into major taxonomic categories—vertebrates, insects, nematodes, (higher) plants and fungi—within which most of the binding sites are transferable across species. The option to search with and download the entire core collection is still available and behaves as before.

A standardized TF classification

Up to now, JASPAR used an ad hoc structural class annotation for the TFs associated with each matrix profile. In this release, we have updated the structural class annotation using our recently published catalog for mouse and human TFs (42) in which DNA binding proteins are associated with a structural classification system. We adopted the two-level classification described by Luscombe et al. (43) and extended it to accommodate additional binding domain structures. For the TFs from other species, we extrapolated the structural class and family based on the PFAM annotation of the DNA-binding domains. This addition to JASPAR provides a standardized system for the classification of TFs and allows a better grouping into families (or sub-families) with potentially similar binding preferences. A curated list of putative mouse/human DNA-binding proteins is provided at the JASPAR web site. It is also possible to browse the catalog by structure, to see what profiles that are available within the web interface.

Changes in the underlying database structure and interface

The underlying database schema was updated to accommodate matrix versions and to allow multiple species and TF accession numbers, as well to allow the storage of multiple collections in the same sql database. A Perl API (JASPAR5) for the new schema is available as part of the open-source TFBS Perl framework (44).

FUTURE DEVELOPMENTS

In the forthcoming months and years, a large amount of whole-genome binding data from ChIP-seq and related techniques will become available. We have created the first steps towards a standardized way of including this new data into JASPAR, which is expected to expand significantly with the concomitant increase in the quality of matrix data. At the same time, JASPAR collections outside the core will continue to include interesting matrix sets derived by other means.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

EU Framework Programme 6 integrated project EuTRACC (to S.T.); YFF grant 180435 from the Norwegian Research Council (NRF), and by Bergen Research Foundation (BFS) (to B.L.). Novo Nordisk Foundation to the Bioinformatics Centre (to X.Z., E.V. and A.S.); The European Research Council under the EU 7th Framework Programme (FP7/2007-2013)/ERC grant agreement 204135 (to A.S.); Scholar of the Michael Smith Foundation for Health Research (to W.W.); Canadian Institutes for Health Research, GenomeCanada (via the Pleiades Promoter Project), GenomeBritishColumbia and the Canada Foundation for Innovation (to W.W. research laboratory). Funding for open access charge: Norwegian Research Council (NFR) (project no. 180435). Conflict of interest statement. None declared.

44 in total

1. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities.

Authors: Michael F Berger; Anthony A Philippakis; Aaron M Qureshi; Fangxue S He; Preston W Estep; Martha L Bulyk
Journal: Nat Biotechnol Date: 2006-09-24 Impact factor: 54.908

2. High-resolution profiling of histone methylations in the human genome.

Authors: Artem Barski; Suresh Cuddapah; Kairong Cui; Tae-Young Roh; Dustin E Schones; Zhibin Wang; Gang Wei; Iouri Chepelev; Keji Zhao
Journal: Cell Date: 2007-05-18 Impact factor: 41.582

3. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites.

Authors: Marcus B Noyes; Ryan G Christensen; Atsuya Wakabayashi; Gary D Stormo; Michael H Brodsky; Scot A Wolfe
Journal: Cell Date: 2008-06-27 Impact factor: 41.582

4. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences.

Authors: Michael F Berger; Gwenael Badis; Andrew R Gehrke; Shaheynoor Talukder; Anthony A Philippakis; Lourdes Peña-Castillo; Trevis M Alleyne; Sanie Mnaimneh; Olga B Botvinnik; Esther T Chan; Faiqua Khalid; Wen Zhang; Daniel Newburger; Savina A Jaeger; Quaid D Morris; Martha L Bulyk; Timothy R Hughes
Journal: Cell Date: 2008-06-27 Impact factor: 41.582

5. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells.

Authors: Xi Chen; Han Xu; Ping Yuan; Fang Fang; Mikael Huss; Vinsensius B Vega; Eleanor Wong; Yuriy L Orlov; Weiwei Zhang; Jianming Jiang; Yuin-Han Loh; Hock Chuan Yeo; Zhen Xuan Yeo; Vipin Narang; Kunde Ramamoorthy Govindarajan; Bernard Leong; Atif Shahab; Yijun Ruan; Guillaume Bourque; Wing-Kin Sung; Neil D Clarke; Chia-Lin Wei; Huck-Hui Ng
Journal: Cell Date: 2008-06-13 Impact factor: 41.582

6. Genome-wide mapping of in vivo protein-DNA interactions.

Authors: David S Johnson; Ali Mortazavi; Richard M Myers; Barbara Wold
Journal: Science Date: 2007-05-31 Impact factor: 47.728

7. The molecular signature and cis-regulatory architecture of a C. elegans gustatory neuron.

Authors: John F Etchberger; Adam Lorch; Monica C Sleumer; Richard Zapf; Steven J Jones; Marco A Marra; Robert A Holt; Donald G Moerman; Oliver Hobert
Journal: Genes Dev Date: 2007-07-01 Impact factor: 11.361

8. SwissRegulon: a database of genome-wide annotations of regulatory sites.

Authors: Mikhail Pachkov; Ionas Erb; Nacho Molina; Erik van Nimwegen
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

9. PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation.

Authors: Elodie Portales-Casamar; Stefan Kirov; Jonathan Lim; Stuart Lithwick; Magdalena I Swanson; Amy Ticoll; Jay Snoddy; Wyeth W Wasserman
Journal: Genome Biol Date: 2007 Impact factor: 13.583

10. A systematic characterization of factors that regulate Drosophila segmentation via a bacterial one-hybrid system.

Authors: Marcus B Noyes; Xiangdong Meng; Atsuya Wakabayashi; Saurabh Sinha; Michael H Brodsky; Scot A Wolfe
Journal: Nucleic Acids Res Date: 2008-03-10 Impact factor: 16.971

371 in total

1. Expression2Kinases: mRNA profiling linked to multiple upstream regulatory layers.

Authors: Edward Y Chen; Huilei Xu; Simon Gordonov; Maribel P Lim; Matthew H Perkins; Avi Ma'ayan
Journal: Bioinformatics Date: 2011-11-10 Impact factor: 6.937

2. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs.

Authors: Morgane Thomas-Chollier; Andrew Hufton; Matthias Heinig; Sean O'Keeffe; Nassim El Masri; Helge G Roider; Thomas Manke; Martin Vingron
Journal: Nat Protoc Date: 2011-11-03 Impact factor: 13.491

3. System approaches reveal the molecular networks involved in neural stem cell differentiation.

Authors: Kai Wang; Haifeng Wang; Jiao Wang; Yuqiong Xie; Jun Chen; Huang Yan; Zengrong Liu; Tieqiao Wen
Journal: Protein Cell Date: 2012-04-10 Impact factor: 14.870

4. Identification of molecular compartments and genetic circuitry in the developing mammalian kidney.

Authors: Jing Yu; M Todd Valerius; Mary Duah; Karl Staser; Jennifer K Hansard; Jin-Jin Guo; Jill McMahon; Joe Vaughan; Diane Faria; Kylie Georgas; Bree Rumballe; Qun Ren; A Michaela Krautzberger; Jan P Junker; Rathi D Thiagarajan; Philip Machanick; Paul A Gray; Alexander van Oudenaarden; David H Rowitch; Charles D Stiles; Qiufu Ma; Sean M Grimmond; Timothy L Bailey; Melissa H Little; Andrew P McMahon
Journal: Development Date: 2012-05 Impact factor: 6.868

5. Improved models for transcription factor binding site identification using nonindependent interactions.

Authors: Yue Zhao; Shuxiang Ruan; Manishi Pandey; Gary D Stormo
Journal: Genetics Date: 2012-04-13 Impact factor: 4.562

6. Studying the evolution of promoter sequences: a waiting time problem.

Authors: Sarah Behrens; Martin Vingron
Journal: J Comput Biol Date: 2010-12 Impact factor: 1.479

Review 7. Using bioinformatics to predict the functional impact of SNVs.

Authors: Melissa S Cline; Rachel Karchin
Journal: Bioinformatics Date: 2010-12-15 Impact factor: 6.937

Review 8. Systematic characterization of protein-DNA interactions.

Authors: Zhi Xie; Shaohui Hu; Jiang Qian; Seth Blackshaw; Heng Zhu
Journal: Cell Mol Life Sci Date: 2011-01-05 Impact factor: 9.261

Review 9. Determining causality and consequence of expression quantitative trait loci.

Authors: A Battle; S B Montgomery
Journal: Hum Genet Date: 2014-04-26 Impact factor: 4.132

10. A common functional regulatory variant at a type 2 diabetes locus upregulates ARAP1 expression in the pancreatic beta cell.

Authors: Jennifer R Kulzer; Michael L Stitzel; Mario A Morken; Jeroen R Huyghe; Christian Fuchsberger; Johanna Kuusisto; Markku Laakso; Michael Boehnke; Francis S Collins; Karen L Mohlke
Journal: Am J Hum Genet Date: 2014-01-16 Impact factor: 11.025