Literature DB >> 25378322

UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein-DNA interactions.

Maxwell A Hume¹, Luis A Barrera², Stephen S Gisselbrecht³, Martha L Bulyk⁴.

Abstract

The Universal PBM Resource for Oligonucleotide Binding Evaluation (UniPROBE) serves as a convenient source of information on published data generated using universal protein-binding microarray (PBM) technology, which provides in vitro data about the relative DNA-binding preferences of transcription factors for all possible sequence variants of a length k ('k-mers'). The database displays important information about the proteins and displays their DNA-binding specificity data in terms of k-mers, position weight matrices and graphical sequence logos. This update to the database documents the growth of UniPROBE since the last update 4 years ago, and introduces a variety of new features and tools, including a new streamlined pipeline that facilitates data deposition by universal PBM data generators in the research community, a tool that generates putative nonbinding (i.e. negative control) DNA sequences for one or more proteins and novel motifs obtained by analyzing the PBM data using the BEEML-PBM algorithm for motif inference. The UniPROBE database is available at http://uniprobe.org.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2014 PMID： 25378322 PMCID： PMC4383892 DOI： 10.1093/nar/gku1045

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Characterizing and predicting transcription factor (TF) DNA-binding specificities are crucial tasks for understanding the functioning of cellular regulatory networks. The particular binding affinities of a TF govern its set of target genes and thus play an important role in cellular functions and differentiation. The development of universal protein-binding microarray (PBM) technology (1) has allowed for comprehensive high-resolution profiling of the DNA-binding specificity of a given TF by evaluating its binding affinity for all possible k-mer DNA sequences. The Universal PBM Resource for Oligonucleotide Binding Evaluation (UniPROBE) database (2) was created to provide appropriate curation, easy searching and an informative display interface for universal PBM data. An update to UniPROBE was published in 2011 (3). Since that time, many new features have been added to the web interface. In addition, numerous data sets have been deposited into UniPROBE. Here, we discuss these data and features, which include a new data deposition pipeline, a negative control sequence generation tool and motifs derived using BEEML-PBM (4).

DATABASE ADDITIONS

Table 1 describes 12 new publications whose PBM data sets have been introduced into UniPROBE since the last update (5–16). The 96 TFs from these publications come from 19 highly diverse species, many of which are new to the database. At the time of this manuscript's preparation, UniPROBE hosts 515 non-redundant proteins and complexes. A number of additional data depositions are planned for the near future: e.g. Nowak-Lovato et al., 2012 (17); Weirauch et al., 2013 (18); Siggers et al., 2014 (19); Lindemose et al., 2014 (20); Oberstaller et al., 2014 (21). We anticipate that most future depositions will likely be performed by the authors themselves using our new data deposition pipeline.

Table 1.

New PBM data sets added into UniPROBE

Reference	Number of proteins or complexes	Species
Alibés et al. (5)	2	Homo sapiens, Saccharomyces cerevisiae
Campbell et al. (6)	19	Plasmodium falciparum
Gordân et al. (7)	27	S. cerevisiae
Del Bianco et al. (8)	9	H. sapiens
Cheatle Jarvela et al. (9)	2	Patiria miniata, Strongylocentrotus purpuratus
Busser et al. (Development) (10)	10	Drosophila melanogaster
Nakagawa et al. (11)	20	Acanthamoeba castellanii, Allomyces macrogynus, Ashbya gossypii, Aspergillus nidulans, D. melanogaster, H. sapiens, Kluyveromyces lactis, Monosiga brevicollis, Mus musculus, Mycosphaerella graminicola, Nematostella vectensis, S. purpuratus, Trichoplax adhaerens, Tuber melanosporum
Soruco et al. (12)	1	D. melanogaster
Busser et al. (PNAS) (13)	1	D. melanogaster
Peterson et al. (14)	3	M. musculus
De Masi et al. (15)	1	Caenorhabditis elegans
Helfer et al. (16)	1	Arabidopsis thaliana
Total number of new proteins/complexes:	96
Total, last described (3):	404
Total number of non-redundant proteins/complexes in UniPROBE:	515

DATA DEPOSITION PIPELINE

Among the most significant features recently added to UniPROBE is a web-based pipeline for deposition of new PBM data sets. The link for this tool is found conveniently in a header near the top of the front page or by accessing it directly by URL at http://thebrain.bwh.harvard.edu/pbms/webworks_pub_dev/admin.php. Previously, uploading data manually into the MySQL database was inefficient and error-prone; therefore, we designed several linked scripts to automate the process. Figure 1A shows the main page for this pipeline, which also outlines the control flow of the deposition for users. In the first five steps, the user can input information into the database concerning the proteins involved in their study. While the most convenient way to do this is by preparing an appropriately formatted spreadsheet file (for steps 2, 4 and 5; see Figure 1B), alternatively the input can be done one entry at a time using an HTML form if a user prefers that method. Currently, the user must prepare a folder with all of the data files they wish to make public. Instructions for data file preparation are given (and are also provided in Supplementary Text 1), and several helpful scripts are available for download to aid the process. The user then uploads the folder to the UniPROBE server as a zip file. The remaining steps fully integrate the data files into the web interface, including constructing sequence logos for each protein and making all the data easily searchable and available for download. The UniPROBE administrator will then finalize the deposition by ensuring proper insertion and moving the new data into the public version of the web site. Data depositors may contact the UniPROBE administrator to specify a release date for prepublication data submissions.

Figure 1.

Data deposition pipeline. (A) The main page for the UniPROBE data deposition pipeline provides an outline of the data deposition procedure. The user successively clicks each link and follows the instructions in each step. Some steps require only the click of a button, whereas others require either submission of an input file or some extra actions on the command line. (B) The instructions for file-based input in step 5. Steps 2 and 4 have similar instructions. File-based input makes it easy for the user to simultaneously provide all the relevant information to add to the database, and has formatting, error checking and rollback functionality built in.

INCORPORATION OF BEEML-PBM MOTIFS

All of the raw PBM data posted in UniPROBE until recently have been handled in the same manner: the Seed-and-Wobble algorithm, introduced jointly with universal PBM technology (1,22), is used to generate a position weight matrix (PWM) (23,24), which in turn is used to generate sequence logos (25) that are displayed on the protein's Details page (e.g. see Figure 2A). Since the development of universal PBM technology, other algorithms have been developed to derive PWMs from the PBM data. BEEML-PBM employs a maximum likelihood approach, using a weighted nonlinear least-squares regression to infer free energy parameters for TF–DNA interactions (4). BEEML-PBM was one of the top two algorithms in the DREAM5 challenge (18) and provided PWMs with better performance than Seed-and-Wobble for the majority of TFs. We have generated PWMs using BEEML-PBM for the PBM data from all publications whose data have been incorporated into UniPROBE, including those mentioned in this paper (1,5–16,26–32). The free energy parameters derived from BEEML-PBM were converted into PWM frequencies by applying a Boltzmann distribution probability mass function to each matrix column. Figure 2 shows an example of Seed-and-Wobble and BEEML-PBM logos in UniPROBE. All of the new logos are currently viewable on the appropriate protein pages and the PWMs are available for download either individually on these pages or in bulk on the Downloads page.

Figure 2.

Seed-and-Wobble and BEEML-PBM motif displays. Examples of displays for data generated using the (A) Seed-and-Wobble and (B) BEEML-PBM algorithms for the Erg protein, from Wei et al., 2010 (31). (A) The Seed-and-Wobble data displays a sequence logo, links for downloading the PWM data and the top-scoring k-mer along with its PBM enrichment score. (B) The BEEML-PBM data display format is essentially the same, but because k-mers and enrichment scores are not utilized in this algorithm, an IUPAC consensus sequence derived from the PWM is instead displayed above the motif. The reverse complement sequence orientation can be displayed for either data set individually by clicking the appropriate button; this changes the logo, the PWM file link and the displayed sequence. Assignment of ‘forward’ versus ‘reverse complement’ orientation is arbitrary for each PWM—here, the BEEML-PBM data have been switched to ‘reverse complement’ mode in order to display a more obvious comparison between the logos, since its ‘forward’ orientation happens to correspond more closely to the Seed-and-Wobble data's ‘reverse complement’ orientation.

NEGATIVE CONTROL SEQUENCE GENERATOR

UniPROBE's main ‘toolbox’, found on the front and Browse pages, includes: a basic text search with different options; a tool that finds proteins with a sufficiently close match to a query DNA motif; a tool that scans a DNA sequence for putative TF-binding sites (2); and a blastp search tool for matching protein sequences (3). In addition to predicting specific protein–DNA interactions, it is sometimes desirable to find a sequence that is predicted not to be bound by a given protein(s); e.g. when designing negative controls for in vivo reporter experiments or nonspecific competitor DNA for in vitro assays. An important new addition to this toolbox is a negative control (nonbinding) sequence generator for such purposes; the search interface for this tool is displayed in Figure 3A. This tool takes a list of proteins stored in UniPROBE as input along with a few parameters (PBM k-mer enrichment score threshold for TF binding and minimum and maximum length cutoffs) for the desired sequence to be generated. The output is a DNA sequence which is predicted to have little to no specific binding by any of the proteins selected as input based on the PBM data available for that protein in UniPROBE.

Figure 3.

Examples of input and output from the Negative Control Sequence Generator tool. (A) Form for the Negative Control Sequence Generator tool. In this example, the user has selected two proteins using the pulldown menu, but alternatively, the user can select all proteins in the database from a given species or enter the proteins he/she wants into the text area. The user has requested two sequences between 50 and 150 bp in length. The enrichment score threshold and ‘maximum number of tries’ parameter values used here are the defaults. Clicking on the ‘Help’ link in this box on the web page provides more information about the various parameters. (B) The text of an email reply containing the results from the Negative Control Sequence Generator tool for the input shown in (A). Briefly, the algorithm works as follows. First, it assembles a list of all contiguous 8-mers such that every selected protein has scored below the enrichment score threshold for binding to that 8-mer in every PBM data set for that protein. Then, it generates putative nonbinding DNA sequences by randomly concatenating suitable k-mers such that no disallowed 8-mer—i.e. no 8-mer not in the input list—will appear at any point in the sequence. This is ensured by the construction and use of a mapping in which every 7-mer corresponds to a list of the bases allowed to directly follow it in the next sequential nucleotide. During each addition to the sequence, the next nucleotide added is selected from this list to ensure that no disallowed 8-mer is created. Note that since the addition of k-mers is performed randomly, this algorithm is non-deterministic; thus, the user can also specify the number of sequences to be generated. The results are emailed to the user once the computation has finished; an example is provided in Figure 3B.

OTHER NEW FEATURES

The blastp search feature introduced in the last published update (3) has been further improved by adding a visualization of the alignment between the query and result sequences within the search results. Links to the TFBSshape database (33) have been included in the Details pages of proteins with available TFBSshape data. TFBSshape describes the structural features of DNA at TF binding sites, and has entries for proteins corresponding to entries in JASPAR (34,35) and UniPROBE. Figure 4 shows an example of a link and its corresponding TFBSshape web page. Publications with data in UniPROBE whose protein pages currently link to TFBSshape (and vice versa) are: Berger et al., 2006 (1); Berger et al., 2008 (26); Zhu et al., 2009 (27); Badis et al., 2009 (28); Lesch et al., 2009 (30); Scharer et al., 2009 (32). We will continue to correspond with the TFBSshape administrators and provide links for additional publications as they become available in the TFBSshape database.

Figure 4.

TFBSshape links. (A) An example of a link to the TFBSshape database from the Protein Details page for Hoxa6, from Berger et al., 2008 (26). (B) The TFBSshape page for Hoxa6, to which the link in (A) leads. Finally, migration to a new, faster server has been completed, and we expect a concomitant speedup in web operation times.

DISCUSSION

There are many opportunities for further improvements to UniPROBE in the near future. To start, there are additional published PBM data sets still awaiting deposition into the database. To expedite data deposition, we encourage authors of such studies to submit their data themselves into UniPROBE using our new data deposition pipeline. Additional improvements could be made to the data deposition pipeline. Currently, the pipeline does not account explicitly for variations in the structure of the data files available for download from each publication; for example each protein from a publication may have different sets of PBM data reflecting distinct binding activity for different clones, protein complexes of which a particular protein is a component or data from replicate PBM experiments. In some cases, data are available from experiments using different PBM array versions (which may themselves have multiple replicates). Similarly, the structure of the Protein Details page must properly match the file structure in order to optimally display the data. The default template for the Details page has not yet been configured to handle the amount of potential variability in the file structure, and currently a new template page must be generated by the database administrator for any publication newly deposited into UniPROBE that does not have strictly one set of files per protein, without any complexes. In the future, we hope to automate this process by creating one or more pre-written page templates that can handle variation in data file structure. Users should still be able to request customization of their publication's Details pages if necessary. Further planned improvements to the deposition process include the ability to request specific UniPROBE accession numbers for proteins (see Robasky and Bulyk, 2011 (3) for a description of protein accession numbers in UniPROBE). We also plan to generate accession numbers to publications for reference and to allow users to specify particular publication data sets for searches. PWM data derived using other motif finding algorithms in addition to Seed-and-Wobble and BEEML-PBM will also be added. Among those on which we may choose to focus initially are FeatureREDUCE (manuscript in preparation) and MatrixREDUCE (36), which also performed well in the DREAM5 challenge (18). BEEML-PBM data will also be generated for the remaining publications that have been deposited in UniPROBE. Finally, users will also soon be able to do a bulk download as a FASTA file of the protein sequences of all the TF clones used in the PBM experiments. We welcome feedback and suggestions for further improvements from our users. A new UniPROBE administrative email account can now be reached with any questions, comments or suggestions at uniprobe@genetics.med.harvard.edu.

AVAILABILITY

As before, the data in UniPROBE are freely available at the database web site (http://uniprobe.org), and the sequences of the 60-mer DNA probes on the custom-designed oligonucleotide arrays are available under the terms of an academic research use license available at http://thebrain.bwh.harvard.edu/uniprobe/academic-license.php.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

36 in total

Review 1. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

2. DNA-binding specificity changes in the evolution of forkhead transcription factors.

Authors: So Nakagawa; Stephen S Gisselbrecht; Julia M Rogers; Daniel L Hartl; Martha L Bulyk
Journal: Proc Natl Acad Sci U S A Date: 2013-07-08 Impact factor: 11.205

3. Diversification of transcription factor paralogs via noncanonical modularity in C2H2 zinc finger DNA binding.

Authors: Trevor Siggers; Jessica Reddy; Brian Barron; Martha L Bulyk
Journal: Mol Cell Date: 2014-07-17 Impact factor: 17.970

4. Neural-specific Sox2 input and differential Gli-binding affinity provide context and positional information in Shh-directed neural patterning.

Authors: Kevin A Peterson; Yuichi Nishi; Wenxiu Ma; Anastasia Vedenko; Leila Shokri; Xiaoxiao Zhang; Matthew McFarlane; José-Manuel Baizabal; Jan Philipp Junker; Alexander van Oudenaarden; Tarjei Mikkelsen; Bradley E Bernstein; Timothy L Bailey; Martha L Bulyk; Wing H Wong; Andrew P McMahon
Journal: Genes Dev Date: 2012-12-15 Impact factor: 11.361

5. Evaluation of methods for modeling transcription factor sequence specificity.

Authors: Matthew T Weirauch; Atina Cote; Raquel Norel; Matti Annala; Yue Zhao; Todd R Riley; Julio Saez-Rodriguez; Thomas Cokelaer; Anastasia Vedenko; Shaheynoor Talukder; Harmen J Bussemaker; Quaid D Morris; Martha L Bulyk; Gustavo Stolovitzky; Timothy R Hughes
Journal: Nat Biotechnol Date: 2013-01-27 Impact factor: 54.908

6. The Cryptosporidium parvum ApiAP2 gene family: insights into the evolution of apicomplexan AP2 regulatory systems.

Authors: Jenna Oberstaller; Yoanna Pumpalova; Ariel Schieler; Manuel Llinás; Jessica C Kissinger
Journal: Nucleic Acids Res Date: 2014-06-23 Impact factor: 16.971

7. TFBSshape: a motif database for DNA shape features of transcription factor binding sites.

Authors: Lin Yang; Tianyin Zhou; Iris Dror; Anthony Mathelier; Wyeth W Wasserman; Raluca Gordân; Remo Rohs
Journal: Nucleic Acids Res Date: 2013-11-07 Impact factor: 16.971

8. A DNA-binding-site landscape and regulatory network analysis for NAC transcription factors in Arabidopsis thaliana.

Authors: Søren Lindemose; Michael K Jensen; Jan Van de Velde; Charlotte O'Shea; Ken S Heyndrickx; Christopher T Workman; Klaas Vandepoele; Karen Skriver; Federico De Masi
Journal: Nucleic Acids Res Date: 2014-06-09 Impact factor: 16.971

9. Modular evolution of DNA-binding preference of a Tbrain transcription factor provides a mechanism for modifying gene regulatory networks.

Authors: Alys M Cheatle Jarvela; Lisa Brubaker; Anastasia Vedenko; Anisha Gupta; Bruce A Armitage; Martha L Bulyk; Veronica F Hinman
Journal: Mol Biol Evol Date: 2014-07-12 Impact factor: 16.240

10. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles.

Authors: Anthony Mathelier; Xiaobei Zhao; Allen W Zhang; François Parcy; Rebecca Worsley-Hunt; David J Arenillas; Sorana Buchman; Chih-yu Chen; Alice Chou; Hans Ienasescu; Jonathan Lim; Casper Shyr; Ge Tan; Michelle Zhou; Boris Lenhard; Albin Sandelin; Wyeth W Wasserman
Journal: Nucleic Acids Res Date: 2013-11-04 Impact factor: 16.971

121 in total

Review 1. Accelerating Adverse Outcome Pathway Development Using Publicly Available Data Sources.

Authors: Noffisat O Oki; Mark D Nelms; Shannon M Bell; Holly M Mortensen; Stephen W Edwards
Journal: Curr Environ Health Rep Date: 2016-03

2. Toward deciphering the mechanistic role of variations in the Rep1 repeat site in the transcription regulation of SNCA gene.

Authors: A Afek; L Tagliafierro; O C Glenn; D B Lukatsky; R Gordan; O Chiba-Falek
Journal: Neurogenetics Date: 2018-05-05 Impact factor: 2.660

3. Red Blood Cell Invasion by the Malaria Parasite Is Coordinated by the PfAP2-I Transcription Factor.

Authors: Joana Mendonca Santos; Gabrielle Josling; Philipp Ross; Preeti Joshi; Lindsey Orchard; Tracey Campbell; Ariel Schieler; Ileana M Cristea; Manuel Llinás
Journal: Cell Host Microbe Date: 2017-06-14 Impact factor: 21.023

4. Optimized Sequence Library Design for Efficient In Vitro Interaction Mapping.

Authors: Yaron Orenstein; Robert Puccinelli; Ryan Kim; Polly Fordyce; Bonnie Berger
Journal: Cell Syst Date: 2017-09-27 Impact factor: 10.304

5. Deconvolving the recognition of DNA shape from sequence.

Authors: Namiko Abe; Iris Dror; Lin Yang; Matthew Slattery; Tianyin Zhou; Harmen J Bussemaker; Remo Rohs; Richard S Mann
Journal: Cell Date: 2015-04-02 Impact factor: 41.582

6. Identification of Human Lineage-Specific Transcriptional Coregulators Enabled by a Glossary of Binding Modules and Tunable Genomic Backgrounds.

Authors: Luca Mariani; Kathryn Weinand; Anastasia Vedenko; Luis A Barrera; Martha L Bulyk
Journal: Cell Syst Date: 2017-09-27 Impact factor: 10.304

7. Predicting the effects of SNPs on transcription factor binding affinity.

Authors: Sierra S Nishizaki; Natalie Ng; Shengcheng Dong; Robert S Porter; Cody Morterud; Colten Williams; Courtney Asman; Jessica A Switzenberg; Alan P Boyle
Journal: Bioinformatics Date: 2020-01-15 Impact factor: 6.937

8. Activation of the LMO2 oncogene through a somatically acquired neomorphic promoter in T-cell acute lymphoblastic leukemia.

Authors: Sunniyat Rahman; Michael Magnussen; Theresa E León; Nadine Farah; Zhaodong Li; Brian J Abraham; Krisztina Z Alapi; Rachel J Mitchell; Tom Naughton; Adele K Fielding; Arnold Pizzey; Sophia Bustraan; Christopher Allen; Teodora Popa; Karin Pike-Overzet; Laura Garcia-Perez; Rosemary E Gale; David C Linch; Frank J T Staal; Richard A Young; A Thomas Look; Marc R Mansour
Journal: Blood Date: 2017-03-07 Impact factor: 22.113

9. Genome-wide identification of regulatory elements in Sertoli cells.

Authors: Danielle M Maatouk; Anirudh Natarajan; Yoichiro Shibata; Lingyun Song; Gregory E Crawford; Uwe Ohler; Blanche Capel
Journal: Development Date: 2017-01-13 Impact factor: 6.868

10. SEASTAR: systematic evaluation of alternative transcription start sites in RNA.

Authors: Zhiyi Qin; Peter Stoilov; Xuegong Zhang; Yi Xing
Journal: Nucleic Acids Res Date: 2018-05-04 Impact factor: 16.971