Literature DB >> 16381982

HTPSELEX--a database of high-throughput SELEX libraries for transcription factor binding sites.

Vidhya Jagannathan¹, Emmanuelle Roulet, Mauro Delorenzi, Philipp Bucher.

Abstract

HTPSELEX is a public database providing access to primary and derived data from high-throughput SELEX experiments aimed at characterizing the binding specificity of transcription factors. The resource is primarily intended to serve computational biologists interested in building models of transcription factor binding sites from large sets of binding sequences. The guiding principle is to make available all information that is relevant for this purpose. For each experiment, we try to provide accurate information about the protein material used, details of the wet lab protocol, an archive of sequencing trace files, assembled clone sequences (concatemers) and complete sets of in vitro selected protein-binding tags. In addition, we offer in-house derived binding sites models. HTPSELEX also offers reasonably large SELEX libraries obtained with conventional low-throughput protocols. The FTP site contains the trace archives and database flatfiles. The web server offers user-friendly interfaces for viewing individual entries and quality-controlled download of SELEX sequence libraries according to a user-defined sequencing quality threshold. HTPSELEX is available from ftp://ftp.isrec.isb-sib.ch/pub/databases/htpselex/ and http://www.isrec.isb-sib.ch/htpselex.

Entities: Chemical Species

Mesh：

Substances：
Transcription Factors
DNA

Year: 2006 PMID： 16381982 PMCID： PMC1347412 DOI： 10.1093/nar/gkj049

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

SELEX (systematic evolution of ligands by exponential enrichment) is an in vitro protocol for isolating nucleic acid ligands to a specific protein from DNA or RNA sequences (1). The technique is frequently used for the purpose of characterizing the binding specificity of a transcription factor. In such an experiment, SELEX yields a library of double-stranded DNA molecules binding to the protein, which can then be used to generate a computational model, e.g. a position-specific scoring or weight matrix (2) that serves to predict binding sites in regulatory DNA sequences. Alternatively, the technique can also be used to select a single high-affinity ligand of a particular protein for biotechnological applications. A so-called ‘high-throughput’ (HTP) SELEX experiment generates large numbers (>1000) of ligands using mass sequencing technology. We recently developed a HTPSELEX protocol, incorporating the concatemerization step of SAGE in order to increase the sequencing throughput (3). This technology development was motivated by computer simulations showing that several thousands of individual sequences are required to derive a reasonably accurate description of the sequence specificity of a typical transcription factor. Published SELEX sequence collections obtained with conventional methods rarely exceed 100 sequences. HTPSELEX serves as a public repository for HTPSELEX data. There is a need of such a resource because the data volumes originating from HTPSELEX experiments are too large to be presented in scientific articles. In contrast, smaller SELEX sequence collections obtained with conventional methods have traditionally been disseminated through the journal literature. SELEX_DB (4) and TRANSFAC (5) are databases which offer these data in machine-readable form. Other related databases, such as JASPAR (6), only distribute the SELEX-based computational models (weight matrices) of transcription factor binding sites, but not the SELEX sequences from which these models were derived.

SCOPE AND LEADING CONCEPTS

The main purpose of the HTPSELEX database is to make the primary data from an experiment available in a form suitable for re-analysis in the future. We consider this important because the methods for characterizing the binding specificity of transcription factors are under development and continuously improving. Along with the raw data, we also make derived information available, including transcription factor binding site descriptions represented as hidden Markov models (HMMs) (7). Nevertheless, our resource is primarily intended to serve computational biologists interested in analyzing SELEX sequence libraries, either for methodological developments or for deriving better binding site models for given transcription factors. Effective analysis of HTPSELEX data requires not only access to raw data but also precise knowledge of the experimental protocols used to generate them. A given SELEX method may introduce a specific and predictable bias in the binding site collection, which could be compensated for by a customized computational model building procedure, even though this is currently not performed in practice. The leading principle in the design of a standardized experiment description for HTPSELEX was to provide all technical details that are relevant for the downstream analysis of the data.

OVERVIEW OF A HTPSELEX EXPERIMENT

A complete SELEX experiment starts with a purified nucleic acid binding protein and terminates with a computational model of its binding specificity. Our HTPSELEX protocol, which is schematized in Figure 1, was specifically designed for DNA binding transcription factors. The transcription factor, typically, is a complex composed of several polypeptide chains produced by a recombinant organism. Note that the name of the factor is not necessarily identical with the name of one of its components deposited in the protein sequence database. Sometimes, the polypeptides used in the experiment contain only a part of the native protein. On the DNA side, the SELEX protocol uses a library of synthetic oligonucleotides consisting of a random internal part and constant flanking regions as starting material. The latter serve as PCR primers and provide restriction sites for concatemerization and insertion into a cloning vector.

Figure 1

HTPSELEX protocol. The flowchart shows the HTPSELEX data acquisition and analysis steps starting from a random pool of DNA oligonucleotides. Experimental details for each HTPSELEX experiment are recorded in the corresponding entry in htpselex.doc. The chromatograms for each experiment are made available on our FTP server. The clone sequences obtained after the Phred/Phrap processing of trace files are recorded in htpselex.dat. The ‘tag’ sequences corresponding to the binding sites are available in the fasta format in htpselex.seq along with quality information. Binding site models obtained after initial analysis of these experiments are also available on our FTP server.

Once made double-stranded, the random DNA library is mixed with protein and protein–DNA complexes are subsequently isolated by some biochemical method, for instance by preparative electrophoretic mobility shift assay. After purification, the protein–DNA complexes are dissociated and the DNA fraction is amplified by PCR before the next selection-amplification cycle. The sequences of the constant regions of the input library are relevant for downstream analysis in as much as they may overlap with the in vitro selected binding sites. The SELEX libraries obtained after each cycle are subjected to analytical sequencing and, if judged useful, to HTP sequencing. For this purpose, the random parts of the in vitro selected oligonucleotides in addition to a few flanking bases are cut out with a restriction enzyme and concatemerized before ligation into a cloning vector. Knowledge of the restriction enzyme used in this step is important as it could induce a bias in the SELEX library as binding sequences containing the corresponding restriction site are automatically destroyed during this processing step. The insert containing vectors are then transfected into bacteria. Individual colonies are sent to the sequencing facility. At this stage, the wet laboratory protocol ends and the computational data analysis pipeline starts. The raw data obtained from the sequencing laboratory consist of trace files (electropherograms) associated with a colony identifier and a sequencing direction (forward or reverse). There may be several trace files for each colony. Individual reads from the same colony are processed and assembled with Phred and Phrap (8,9), resulting in a consensus clone insert sequence with base quality scores. Upon preliminary analysis of these sequences, one usually detects some colonies containing the same insert sequences. In the HTPSELEX jargon, these ‘colonies’ are said to represent the same ‘clone’. Sequencing reads from the same clone are pooled and subjected to a second round of Phred/Phrap processing (Figure 2).

Figure 2

Example of an experiment entry. Data items appearing in Figure 1 are shown in gray colour.

Individual repeat units (called ‘tags’ in the HTPSELEX jargon) are parsed out from the clone sequences with the aid of a HMM representing the repetitive insert structure and some flanking vector sequences (references to HMM decoding programs are given in the next section). For each tag, a per-base error rate is computed using the base quality scores returned by Phred or Phrap. The complete tag collection is subsequently quality filtered using the error rate estimates and scanned for duplicate tags. Duplicate tags are usually observed after about five Selex cycles and are the consequence of the loss of diversity caused by repeated reduction and expansion of the population. The quality-filtered tag sequence collection is finally used to derive a binding site model. There are different types of computational models to represent the binding site, and for each type there are different algorithms to derive the corresponding parameters from the data. A survey of these methods is beyond the scope of this article. Note, however, that many of the smaller SELEX libraries published in Journal articles contain mainly high-affinity binding sites and thus are not expected to produce binding site models of high-predictive value, regardless of the model-building method used (3).

STRUCTURE AND FORMAT OF THE HTPSELEX DATABASE

The core of HTPSELEX is the flat file release, which is distributed from our FTP site jointly with the compressed archives of the trace files. There are three main files, each containing a collection of a particular entry type: htpselex.doc: contains experiment entries. htpselex.dat: contains clone sequence entries. htpselex.seq: contains tag sequences. HTPSELEX entries have composite identifiers reflecting the hierarchical relationships between them. The components are alphanumeric strings separated by underscore characters. Experiment entries are identified by a short alphanumeric string, e.g. ‘NF1’ for the CTF/NF1 experiment. They contain information about the protein source, the structure of the partly random input library, the restriction enzymes used in the concatemerization step and the vector used for cloning. In addition, the number of traces, clone sequences and tags obtained from each SELEX cycle can be found there. The information is presented in a format similar to that of an EMBL or Swiss-Prot sequence entry. The clone sequence entries contain either a complete insert sequence or a partial sequence from the 5′ or 3′ end. The latter occurs when the complete sequence of the insert could not be assembled from the sequencing reads. The clone sequence identifiers consist of the experiment ID, the cycle number, the clone number and optionally the sequencing direction (e.g. NF1_3_00001 and NF1_3_0500_F). The feature tables are used to indicate the location of the individual tag sequences. In addition, the annotation part of these entries contains cross-references to corresponding colony names and trace files. The tag sequences are stored in a fasta-formatted sequence file. The header line contains the tag identifier consisting of the experiment ID, cycle number, clone number and tag serial number (e.g. NF1_3_00001_1). The location in the corresponding sequence file and the estimated per-base error rate are also recorded. The tag sequence file is made non-redundant such that tags which were sequenced multiple times in the same SELEX cycle appear only once, with accessory information referring to the highest quality version. The FTP server also provides for each SELEX cycle the trace files as a compressed archive, and a HMM representing the binding specificity of the corresponding transcription factor in two different formats suitable as input to the programs decodeanhmm (developed by Anders Krogh) and MAMOT (developed by Mauro Delorenzi, ), respectively. Besides HTPSELEX entries, the database also offers entries containing data from conventional SELEX experiments published in Journal articles. For convenience, these data are presented in the same format, but many fields remain empty as they are not applicable to this class. The clone insert sequence entries and trace files are missing altogether. To be acceptable for inclusion in this section, a SELEX library must contain at least 50 sequences per cycle. The partly random input library used in our HTPSELEX experiments, was also subjected to HTP sequencing. The resulting data are contained in a special experiment entry missing all fields related to the DNA-binding protein. So far, the HTP section of our database contains data for five different transcription factors totaling 38 254 tags. These factors are: CTF/NF1, Lef1, Lef1 in complex with β-catenin, TCF3 and TCF4. There are 26 additional entries covering conventional SELEX experiments and totaling 2278 tags. The current growth rate of the HTP section is ∼3000 tags per month and four new factors per year.

RELATIONSHIP TO OTHER DATABASES

A part of the data contained in HTPSELEX is being submitted to other databases. The trace files are currently processed by the trace archive at the NCBI (10). It has further been agreed that the tag sequences will be deposited in a special section of the EMBL database in a format similar to MGA (Mass Genome Annotation) sequences (11). The trace identifiers given by the NCBI will be cross-referenced within the clone insert sequence entries. Currently, the experiment entries contain cross-references to Swiss-Prot, EMBL (vector sequence), REBASE (12), SELEX_DB (4) and TRANSFAC (5).

ACCESS

HTPSELEX can be accessed freely via FTP () or through various web pages (). The contents of the FTP release has been described in detail above. The website offers as additional services: hyperlinked documentation entries for individual HTPSELEX and conventional SELEX experiments, quality controlled download of tag sequences from individual or multiple HTPSELEX libraries with user-defined error probability thresholds, download of tag sequences from conventional SELEX experiments and detailed statistics for HTPSELEX experiments.

PERSPECTIVES

The HTPSELEX database is still at an early stage of its development. Several changes and extension are anticipated for the near future. A large part of the bulk data will probably soon be available from larger public data repositories at the NCBI or EBI. If this happens, the contents of HTPSELEX may be reduced to those parts not available from other sources, in the extreme case to the experiment entries only. In fact, our initiatives to submit parts of the data stored in HTPSELEX to other databases, has already stimulated a broader discussion among experts on how to store such information. Currently, HTPSELEX contains in-house generated data and manually curated entries from Journal articles. We are, however, open to accept direct submissions from authors and are prepared to work out guidelines and automatic submission tools for this purpose in response to a demand. We are further considering the inclusion of protein-binding affinity measurements for individual oligonucleotides. Such data constitute a very useful complement to SELEX sequences for building transcription factor binding site models intended to predict the affinity of a given sequence to the protein.

11 in total

1. SELEX_DB: an activated database on selected randomized DNA/RNA sequences addressed to genomic sequence annotation.

Authors: J V Ponomarenko; G V Orlova; M P Ponomarenko; S V Lavryushev; A S Frolov; S V Zybova; N A Kolchanov
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

Review 2. DNA binding sites: representation and discovery.

Authors: G D Stormo
Journal: Bioinformatics Date: 2000-01 Impact factor: 6.937

3. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase.

Authors: C Tuerk; L Gold
Journal: Science Date: 1990-08-03 Impact factor: 47.728

5. Base-calling of automated sequencer traces using phred. I. Accuracy assessment.

Authors: B Ewing; L Hillier; M C Wendl; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

6. Base-calling of automated sequencer traces using phred. II. Error probabilities.

Authors: B Ewing; P Green
Journal: Genome Res Date: 1998-03 Impact factor: 9.043

7. High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites.

Authors: Emmanuelle Roulet; Stéphane Busso; Anamaria A Camargo; Andrew J G Simpson; Nicolas Mermod; Philipp Bucher
Journal: Nat Biotechnol Date: 2002-07-08 Impact factor: 54.908

8. TRANSFAC: transcriptional regulation, from patterns to profiles.

Authors: V Matys; E Fricke; R Geffers; E Gössling; M Haubrock; R Hehl; K Hornischer; D Karas; A E Kel; O V Kel-Margoulis; D-U Kloos; S Land; B Lewicki-Potapov; H Michael; R Münch; I Reuter; S Rotert; H Saxel; M Scheer; S Thiele; E Wingender
Journal: Nucleic Acids Res Date: 2003-01-01 Impact factor: 16.971

9. DDBJ in collaboration with mass-sequencing teams on annotation.

Authors: Y Tateno; N Saitou; K Okubo; H Sugawara; T Gojobori
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Database resources of the National Center for Biotechnology Information.

Authors: David L Wheeler; Tanya Barrett; Dennis A Benson; Stephen H Bryant; Kathi Canese; Deanna M Church; Michael DiCuccio; Ron Edgar; Scott Federhen; Wolfgang Helmberg; David L Kenton; Oleg Khovayko; David J Lipman; Thomas L Madden; Donna R Maglott; James Ostell; Joan U Pontius; Kim D Pruitt; Gregory D Schuler; Lynn M Schriml; Edwin Sequeira; Steven T Sherry; Karl Sirotkin; Grigory Starchenko; Tugba O Suzek; Roman Tatusov; Tatiana A Tatusova; Lukas Wagner; Eugene Yaschenko
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

11 in total

Review 1. Analysis of In Vitro Aptamer Selection Parameters.

Authors: Maureen McKeague; Erin M McConnell; Jose Cruz-Toledo; Elyse D Bernard; Amanda Pach; Emily Mastronardi; Xueru Zhang; Michael Beking; Tariq Francis; Amanda Giamberardino; Ashley Cabecinha; Annamaria Ruscito; Rocio Aranda-Rodriguez; Michel Dumontier; Maria C DeRosa
Journal: J Mol Evol Date: 2015-11-03 Impact factor: 2.395

Review 2. Synthetic Promoters: Designing the cis Regulatory Modules for Controlled Gene Expression.

Authors: Jameel Aysha; Muhammad Noman; Fawei Wang; Weican Liu; Yonggang Zhou; Haiyan Li; Xiaowei Li
Journal: Mol Biotechnol Date: 2018-08 Impact factor: 2.695

3. Cell-internalization SELEX: method for identifying cell-internalizing RNA aptamers for delivering siRNAs to target cells.

Authors: William H Thiel; Kristina W Thiel; Katie S Flenker; Tom Bair; Adam J Dupuy; James O McNamara; Francis J Miller; Paloma H Giangrande
Journal: Methods Mol Biol Date: 2015

Review 4. Trends in the Design and Development of Specific Aptamers Against Peptides and Proteins.

Authors: Maryam Tabarzad; Marzieh Jafari
Journal: Protein J Date: 2016-04 Impact factor: 2.371

5. Three enhancements to the inference of statistical protein-DNA potentials.

Authors: Mohammed AlQuraishi; Harley H McAdams
Journal: Proteins Date: 2012-11-12

6. EMBL Nucleotide Sequence Database: developments in 2005.

Authors: Guy Cochrane; Philippe Aldebert; Nicola Althorpe; Mikael Andersson; Wendy Baker; Alastair Baldwin; Kirsty Bates; Sumit Bhattacharyya; Paul Browne; Alexandra van den Broek; Matias Castro; Karyn Duggan; Ruth Eberhardt; Nadeem Faruque; John Gamble; Carola Kanz; Tamara Kulikova; Charles Lee; Rasko Leinonen; Quan Lin; Vincent Lombard; Rodrigo Lopez; Michelle McHale; Hamish McWilliam; Gaurab Mukherjee; Francesco Nardone; Maria Pilar Garcia Pastor; Siamak Sobhany; Peter Stoehr; Katerina Tzouvara; Robert Vaughan; Dan Wu; Weimin Zhu; Rolf Apweiler
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

7. A structural-based strategy for recognition of transcription factor binding sites.

Authors: Beisi Xu; Dustin E Schones; Yongmei Wang; Haojun Liang; Guohui Li
Journal: PLoS One Date: 2013-01-08 Impact factor: 3.240

8. Deep sequencing of large library selections allows computational discovery of diverse sets of zinc fingers that bind common targets.

Authors: Anton V Persikov; Elizabeth F Rowland; Benjamin L Oakes; Mona Singh; Marcus B Noyes
Journal: Nucleic Acids Res Date: 2013-11-07 Impact factor: 16.971

9. An affinity-structure database of helix-turn-helix: DNA complexes with a universal coordinate system.

Authors: Mohammed AlQuraishi; Shengdong Tang; Xide Xia
Journal: BMC Bioinformatics Date: 2015-11-19 Impact factor: 3.169

10. Scoring Targets of Transcription in Bacteria Rather than Focusing on Individual Binding Sites.

Authors: Marko Djordjevic; Magdalena Djordjevic; Evgeny Zdobnov
Journal: Front Microbiol Date: 2017-11-22 Impact factor: 5.640