Literature DB >> 21037262

UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions.

Abstract

The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) database is a centralized repository of information on the DNA-binding preferences of proteins as determined by universal protein-binding microarray (PBM) technology. Each entry for a protein (or protein complex) in UniPROBE provides the quantitative preferences for all possible nucleotide sequence variants ('words') of length k ('k-mers'), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In this update, we describe >130% expansion of the database content, incorporation of a protein BLAST (blastp) tool for finding protein sequence matches in UniPROBE, the introduction of UniPROBE accession numbers and additional database enhancements. The UniPROBE database is available at http://uniprobe.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：
DNA-Binding Proteins
DNA

Year: 2010 PMID： 21037262 PMCID： PMC3013812 DOI： 10.1093/nar/gkq992

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

A comprehensive understanding of gene-expression regulation requires the thorough characterization of transcription factor (TF)–DNA binding properties. TFs play central roles in transcriptional regulatory networks by binding specific DNA sequences and activating or repressing gene expression. Consequently, TF–DNA-binding specificities have broad impact on cell physiology and development and in evolution (1,2). Advances in DNA microarray synthesis and the development of protein-binding microarray (PBM) technology (3,4) led to the development of universal PBMs (5), which allow high-throughput measurement of comprehensive data on protein–DNA binding specificities, resulting in large data sets requiring curation and searchability. The Universal PBM Resource for Oligonucleotide-Binding Evaluation (UniPROBE) (6) database was created to satisfy these requirements. Please refer to the original UniPROBE publication (6) for a description of major differences between UniPROBE and the JASPAR (7), TRANSFAC (8) and PAZAR (9) databases. The original UniPROBE publication (6) also provides a detailed description of PBM technology and data types. Since its inception 2 years ago, the UniPROBE database has continued to expand in size, utility and user base. UniPROBE previously housed data for 177 non-redundant proteins (6). That number has recently grown to over 400 non-redundant proteins or protein complexes, with additional, unpublished PBM data sets already planned for future deposition. Currently, the UniPROBE database averages 933 unique visitors per month (classified by IP address) from over 40 different countries and 3558 page views per month. UniPROBE is the standard for curating universal PBM data, and we invite other researchers generating universal PBM data to contact us about depositing their data in UniPROBE.

DATABASE ADDITIONS

UniPROBE has more than doubled in size since its introduction in January 2009 (6) (Table 1). As of this writing, in addition to the data deposited from the initial set of four publications (5,10–12), PBM data are included from six newer publications (13–18) with additional published (19) and soon to be published data currently in planning for deposition. The new additions include data on TFs from Caenorhabditis elegans, Saccharomyces cerevisiae, Mus musculus and Homo sapiens. The UniPROBE database now houses PBM data for 415 individual proteins or protein complexes, nearly all of which are TFs, corresponding to 404 non-redundant proteins or protein complexes.

Table 1.

UniPROBE database contents, with indication of additions in PBM data sets since its introduction in 2009

Reference	Number of proteins or protein complexes	Species
Berger et al. (5)	5	Saccharomyces cerevisiae, Homo sapiens, Mus musculus, Caenorhabditis elegans
Berger et al. (10)	168	Mus musculus
Pompeani et al. (11)	1	Vibrio harveyi
De Silva et al. (12)	3	Plasmodium falciparum, Cryptosporidium parvum
Grove et al. 2009 (13)^a	21	Caenorhabditis elegans
Scharer et al. (14)^a	1	Homo sapiens
Lesch et al. (15)^a	1	Caenorhabditis elegans
Zhu et al. (16)^a	89	Saccharomyces cerevisiae
Badis et al. (17)^a	104	Mus musculus
Wei et al. (18)^a	22	Mus musculus
Total number:	415
Non-redundant proteins or protein complexes:	404
Total, last described (6):	177
Total added:	238
Percent increase:	134%

aIndicates data sets that have been added since the last published description of UniPROBE.

UniPROBE database contents, with indication of additions in PBM data sets since its introduction in 2009 aIndicates data sets that have been added since the last published description of UniPROBE.

NEW BLASTP SEARCH FEATURE

In the latest version of UniPROBE, the available online search features have been augmented with a new search tool that permits a user to perform a blastp (20) search of a protein sequence of interest (the ‘query protein’) against all protein sequences in the UniPROBE database (the ‘subject proteins’). This feature incorporates NCBI’s Protein–Protein BLAST tool (21), blastp v.2.2.23+, for accurate and efficient alignments. This blastp tool returns a list of links to the Details page for each subject protein that either exactly matches or is similar to the query protein(s) according to user-specified search parameter settings. Links from the Details pages allow further exploration and links to download the PBM data for the matching proteins. Query protein sequences may be entered manually into a web-page form or uploaded as a text file. The sequence is parsed using fail-safe rules to interpret the format. Multiple sequences can be processed in batch either by specifying one sequence per line, or by entry of FASTA-formatted sequences, which may cross multiple lines but are separated by header lines. Numbers and unnecessary white-space are stripped from the sequence prior to performing the search. For the subject proteins, the blastp search tool uses a database comprising all the clone insert sequences corresponding to all the PBM experiments with data curated in UniPROBE. For example, consider a search for the human TF GATA4, which is not currently in UniPROBE. Running the blastp tool on the human GATA4 sequence with default parameter settings (Figure 1) results in eight hits, four from yeast and four from mouse, all with the GATA DNA-binding domain (Figure 2). Among the hits, the tool correctly retrieves two hits to Gata3, which is represented in the database by two proteins: the full-length TF and just the DNA-binding domain. The blastp search parameter settings (E-value threshold, species, substitution matrix and word size) are passed directly to a local instance of NCBI’s blastp executable.

Figure 1.

Blastp search of UniPROBE with human GATA4 protein sequence and default parameter settings for the advanced search options.

Figure 2.

Results from blastp search of all protein sequences in UniPROBE using the human full-length GATA4 protein sequence as the query.

Blastp search of UniPROBE with human GATA4 protein sequence and default parameter settings for the advanced search options. Results from blastp search of all protein sequences in UniPROBE using the human full-length GATA4 protein sequence as the query. Results are output with the sequence matches within matching subject proteins rendered with yellow highlighting on all the residues within the confines of the alignment. Also provided is the offset of the first aligned residue of the query protein. As defined by blastp, the score provided is a measure of similarity, and the E-value is the number of expected matches if the subject protein sequences were generated randomly.

UNIPROBE ACCESSION NUMBERS

A significant new feature is the addition of UniPROBE accession numbers. Each TF PBM data set now has its own UniPROBE accession number, regardless of whether or not its protein is unique in the database. Accession numbers are five digits prefixed with ‘UP’ (abbreviation for ‘universal PBM’), e.g. UP00350. Accession numbers are returned as part of the search results and are also listed on each protein’s Details page. A user can use the ‘Quick Search’ tool to find TFs by accession number. Accession numbers can be requested prior to publication of new PBM data sets, such as for unpublished PBM data sets in new article submissions.

OTHER NEW FEATURES

New to this version of UniPROBE is the inclusion of PBM data for protein complexes. This functionality was implemented to accommodate homodimer and heterodimer data for bHLH TFs from C. elegans (13). This feature allows the Details page to render data sets for the protein of interest and for each of the proteins with which the protein of interest dimerizes. The UniPROBE statistics cited here were derived with the aid of several minor but useful enhancements. It is now possible to use ‘Text Search’ to find TFs by publication; TFs can be searched by species using the same tool. The search results now include the total number of TFs returned. To easily distinguish between separate, published PBM data sets for the same protein, a reference to the publication for each separate data set has been added to the bottom of all TF Details pages, along with the array design number(s). For convenience a new, shorter URL (http://uniprobe.org) has been registered, which redirects to the legacy UniPROBE URL (http://thebrain.bwh.harvard.edu/uniprobe).

FUTURE DIRECTIONS

Future updates planned for UniPROBE include additional user and administrative tools. Currently in development is a negative control sequence generator which, given an E-score threshold indicative of DNA-binding preference, will generate random sequence of user-specified length that does not include any 8-mer with scores exceeding the given threshold for user-selected TFs and species in UniPROBE. Another planned feature is the display of sequence alignments resulting from the blastp searches of UniPROBE. Also under development are administrative tools to allow for self-deposition and automated pre-publication UniPROBE accession number requests. The template for the Details page will be generalized to support self-deposition of PBM data for protein complexes. These tools and others will be facilitated, and system performance will generally improve, with the implementation of a newly designed database schema. As always, we continue to encourage user registration and feedback for error reports and feature requests, some of which motivated the development of the new features described here.

AVAILABILITY AND LICENSE

All data hosted by the PBM database are freely available for distribution at the database website. The sequences of the 60-mer DNA probes synthesized on the custom-designed universal arrays are available under the terms of the academic research use license available at http://thebrain.bwh.harvard.edu/uniprobe/academic-license.php.

FUNDING

Funding for open access charge: National Institutes of Health (grant number R01 HG003985 to M.L.B.). Conflict of interest statement. None declared.

21 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays.

Authors: M L Bulyk; X Huang; Y Choo; G M Church
Journal: Proc Natl Acad Sci U S A Date: 2001-06-12 Impact factor: 11.205

3. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays.

Authors: Sonali Mukherjee; Michael F Berger; Ghil Jona; Xun S Wang; Dale Muzzey; Michael Snyder; Richard A Young; Martha L Bulyk
Journal: Nat Genet Date: 2004-11-14 Impact factor: 38.330

4. Basic local alignment search tool.

Authors: S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal: J Mol Biol Date: 1990-10-05 Impact factor: 5.469

5. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities.

Authors: Michael F Berger; Anthony A Philippakis; Aaron M Qureshi; Fangxue S He; Preston W Estep; Martha L Bulyk
Journal: Nat Biotechnol Date: 2006-09-24 Impact factor: 54.908

6. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences.

Authors: Michael F Berger; Gwenael Badis; Andrew R Gehrke; Shaheynoor Talukder; Anthony A Philippakis; Lourdes Peña-Castillo; Trevis M Alleyne; Sanie Mnaimneh; Olga B Botvinnik; Esther T Chan; Faiqua Khalid; Wen Zhang; Daniel Newburger; Savina A Jaeger; Quaid D Morris; Martha L Bulyk; Timothy R Hughes
Journal: Cell Date: 2008-06-27 Impact factor: 41.582

Review 7. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

8. Specific DNA-binding by apicomplexan AP2 transcription factors.

Authors: Erandi K De Silva; Andrew R Gehrke; Kellen Olszewski; Ilsa León; Jasdave S Chahal; Martha L Bulyk; Manuel Llinás
Journal: Proc Natl Acad Sci U S A Date: 2008-06-09 Impact factor: 11.205

9. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes.

Authors: V Matys; O V Kel-Margoulis; E Fricke; I Liebich; S Land; A Barre-Dirrie; I Reuter; D Chekmenev; M Krull; K Hornischer; N Voss; P Stegmaier; B Lewicki-Potapov; H Saxel; A E Kel; E Wingender
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

Review 10. Computational prediction of transcription-factor binding site locations.

Authors: Martha L Bulyk
Journal: Genome Biol Date: 2003-12-23 Impact factor: 13.583

90 in total

1. CLARE: Cracking the LAnguage of Regulatory Elements.

Authors: Leila Taher; Leelavati Narlikar; Ivan Ovcharenko
Journal: Bioinformatics Date: 2011-12-22 Impact factor: 6.937

2. Improved models for transcription factor binding site identification using nonindependent interactions.

Authors: Yue Zhao; Shuxiang Ruan; Manishi Pandey; Gary D Stormo
Journal: Genetics Date: 2012-04-13 Impact factor: 4.562

3. Compound mouse mutants of bZIP transcription factors Mafg and Mafk reveal a regulatory network of non-crystallin genes associated with cataract.

Authors: Smriti A Agrawal; Deepti Anand; Archana D Siddam; Atul Kakrana; Soma Dash; David A Scheiblin; Christine A Dang; Anne M Terrell; Stephanie M Waters; Abhyudai Singh; Hozumi Motohashi; Masayuki Yamamoto; Salil A Lachke
Journal: Hum Genet Date: 2015-04-21 Impact factor: 4.132

4. Protein-DNA binding in the absence of specific base-pair recognition.

Authors: Ariel Afek; Joshua L Schipper; John Horton; Raluca Gordân; David B Lukatsky
Journal: Proc Natl Acad Sci U S A Date: 2014-10-13 Impact factor: 11.205

5. Integrative analysis of the zinc finger transcription factor Lame duck in the Drosophila myogenic gene regulatory network.

Authors: Brian W Busser; Di Huang; Kevin R Rogacki; Elizabeth A Lane; Leila Shokri; Ting Ni; Caitlin E Gamble; Stephen S Gisselbrecht; Jun Zhu; Martha L Bulyk; Ivan Ovcharenko; Alan M Michelson
Journal: Proc Natl Acad Sci U S A Date: 2012-11-26 Impact factor: 11.205

6. Bayesian hierarchical model of protein-binding microarray k-mer data reduces noise and identifies transcription factor subclasses and preferred k-mers.

Authors: Bo Jiang; Jun S Liu; Martha L Bulyk
Journal: Bioinformatics Date: 2013-04-04 Impact factor: 6.937

7. Determination of specificity influencing residues for key transcription factor families.

Authors: Ronak Y Patel; Christian Garde; Gary D Stormo
Journal: Quant Biol Date: 2015-06-16

8. Analysis of computational footprinting methods for DNase sequencing experiments.

Authors: Eduardo G Gusmao; Manuel Allhoff; Martin Zenke; Ivan G Costa
Journal: Nat Methods Date: 2016-02-22 Impact factor: 28.547

9. Evolutionary insights into the active-site structures of the metallo-β-lactamase superfamily from a classification study with support vector machine.

Authors: Lili Wang; Ling Yang; Yu-Lan Feng; Hao Zhang
Journal: J Biol Inorg Chem Date: 2020-09-18 Impact factor: 3.358

Review 10. The ZIC gene family encodes multi-functional proteins essential for patterning and morphogenesis.

Authors: Rob Houtmeyers; Jacob Souopgui; Sabine Tejpar; Ruth Arkell
Journal: Cell Mol Life Sci Date: 2013-02-27 Impact factor: 9.261