Literature DB >> 21715389

RSAT 2011: regulatory sequence analysis tools.

Morgane Thomas-Chollier¹, Matthieu Defrance, Alejandra Medina-Rivera, Olivier Sand, Carl Herrmann, Denis Thieffry, Jacques van Helden.

Abstract

RSAT (Regulatory Sequence Analysis Tools) comprises a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. Thirteen new programs have been added to the 30 described in the 2008 NAR Web Software Issue, including an automated sequence retrieval from EnsEMBL (retrieve-ensembl-seq), two novel motif discovery algorithms (oligo-diff and info-gibbs), a 100-times faster version of matrix-scan enabling the scanning of genome-scale sequence sets, and a series of facilities for random model generation and statistical evaluation (random-genome-fragments, random-motifs, random-sites, implant-sites, sequence-probability, permute-matrix). Our most recent work also focused on motif comparison (compare-matrices) and evaluation of motif quality (matrix-quality) by combining theoretical and empirical measures to assess the predictive capability of position-specific scoring matrices. To process large collections of peak sequences obtained from ChIP-seq or related technologies, RSAT provides a new program (peak-motifs) that combines several efficient motif discovery algorithms to predict transcription factor binding motifs, match them against motif databases and predict their binding sites. Availability (web site, stand-alone programs and SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services): http://rsat.ulb.ac.be/rsat/.

Entities: Chemical Disease Species

Mesh：

Substances：
Transcription Factors

Year: 2011 PMID： 21715389 PMCID： PMC3125777 DOI： 10.1093/nar/gkr377

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

This article presents an update of RSAT (Regulatory Sequence Analysis Tools), a software suite integrating a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. The web site has been running without interruption since 1998 (1–4). It includes various algorithms for sequence retrieval, motif discovery, sequence scanning with regular expressions or position-specific scoring matrices, random model generation, visualization and conversion utilities (sequences, matrices, background models and feature lists). As of December 2010, the web site supports 1794 genomes (including 1120 bacteria, 88 archaea, 98 fungi, 16 metazoa and 461 phages). The web server offers an intuitive interface, where each program can be accessed either separately, or connected to the other tools via predefined analysis flows. Programs are documented at four levels: (i) manual pages give a systematic description of the functionalities and options; (ii) ‘demo’ buttons propose typical test cases; (iii) tutorial pages provide online practical courses, with a problem-based explanation of the biological questions and the bioinformatics approaches; (iv) a series of protocols have been published for the most popular tools (5–9), to provide step-by-step instructions about option choices and result interpretation. Furthermore, the web site hosts a forum enabling direct interactions between users and developers (announcements, bug reports, wish list, help and discussion). The tools can also be used as stand-alone applications (Unix shell) and invoked remotely as web services (SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) interface), enabling diverse combinations in programmatic workflows. We describe hereafter 13 new programs (Table 1 and Figure 1) added to the 30 tools described in the 2008 NAR Web Software Issue (1).

Table 1.

Short description of the new programs supported on RSAT web site (since the publication in the 2008 web software issue of this journal)

Task	Program name	Input	Output	Description
Sequences	retrieve-ensembl-seq	Gene names	Sequences	Retrieve upstream, downstream, intronic, exonic, UTR, transcript, mRNA, CDS or gene sequences for a list of genes from the EnsEMBL database. Multi-genome queries are supported, enabling automatic retrieval of sequences for all orthologs of query genes in selected taxa.
Motif discovery	oligo-diff	Two sequence sets	Differentially represented oligonucleotides	Compare oligonucleotide occurrences between two input sequence files, and return oligos that are significantly enriched in one of the files respective to the other one.
	info-gibbs	Sequences	Over-represented motifs (matrices)	An enhanced gibbs sampler, based on a stochastic optimization of the information content of PSSMs.
Pattern matching	matrix-scan-quick	Sequences+ motifs (PSSM)	Matching positions in input sequences	Scan a DNA sequence with a profile matrix. This implementation has restricted capabilities with respect to matrix-scan, but runs 100 times faster.
Motif comparisons	compare-matrices	Two sets of PSSM	Similarity scores + matrix alignments	Compare two collections of PSSMs, and return various similarity statistics + matrix alignments (pairwise, one-to-n).
Random model generation	random-genome-fragments	A genome supported in either RSAT or EnsEMBL	Randomly selected genome fragments	Select a set of fragments with random positions in a given genome, and return their coordinates and/or sequences.
	random-motif		Randomly generated motifs (PSSM)	Generate random motifs with a given level of conservation in each column.
	random-sites	Motif (PSSM)	Randomly generated sites (sequences)	Generate random sites given a motif (PSSM).
	implant-sites	Sequences + sites	Sequences with sites implanted	Implant given sites at random positions into given sequences.
	permute-matrix	1 set of PSSM	Randomized PSSMs	Randomize a set of input matrices by permuting their columns. The resulting motifs have the same nucleotide composition and information content as the original ones.
	seq-proba	Sequences + background model	Sequence probability	Calculate the probability of a sequence, given a background model. Bernoulli or Markov models are supported.
Work flows	matrix-quality	Motif (PSSM) + one or several sequence sets	Statistical analysis of score distributions	Evaluate the quality of a PSSM, by comparing score distributions obtained with this matrix in various sequence sets (positive set, negative set, etc.). Computes ROC curves indicating tradeoff between sensitivity and predictive value.
	peak-motifs	Sequences	Discovered motifs + correspondences with motif databases + predicted binding sites + sequence composition	Pipeline for discovering motifs in massive ChIP-seq peak sequences.

Note that additional programs are available as SOAP Web Services and/or with the stand-alone tools. PSSM: position-specific scoring matrix;

ROC: receiver operating characteristic.

Figure 1.

Flow chart of the Regulatory Sequence Analysis Tools (RSAT).

Short description of the new programs supported on RSAT web site (since the publication in the 2008 web software issue of this journal) Note that additional programs are available as SOAP Web Services and/or with the stand-alone tools. PSSM: position-specific scoring matrix; ROC: receiver operating characteristic. Flow chart of the Regulatory Sequence Analysis Tools (RSAT).

NEW PROGRAMS IN RSAT

Retrieving sequences from EnsEMBL on the fly

The tool retrieve-ensembl-seq (10) retrieves promoter (upstream), downstream, intronic, exonic, UTR, transcript, mRNA, Coding sequence (CDS) and gene sequences for all the organisms supported in the popular EnsEMBL database (11), and supports automated retrieval of sequences from orthologous or paralogous genes in a given taxon. Users can mask repeats, whenever these are annotated for the organism(s) of interest, as well as the coding part of retrieved sequences. Upstream and downstream sequences can be retrieved for any chosen size, relative to gene, transcript or CDS limits. By default, sequences of the chosen type are retrieved for each alternative transcript, but a specific option allows retrieval of non-redundant portions only for such sequence set.

Motif discovery

A strong focus of the RSAT suite is the development of algorithms for ab initio motif discovery in sequence sets. Three of the original algorithms have been recently enhanced in order to support the massive sets of sequences produced by next-generation sequencing: oligo-analysis (4) detects over- or underrepresented words; dyad-analysis (12) detects overrepresented spaced motifs, which are typically bound by dimeric transcription factors; position-analysis (13) detects oligonucleotides with heterogeneous positional distributions in a given sequence set. Since 2008, two novel motif discovery algorithms have been added to the RSAT suite: oligo-diff (Defrance, M., unpublished data) detects oligonucleotides differentially represented between two input sequences and estimates their significance with the hypergeometric test; info-gibbs (14) discovers position-specific scoring matrices with high-information content using a Gibbs sampling optimization strategy.

Sequence scanning

The new tool matrix-scan-quick implements a subset of matrix-scan functionalities (9). This quick version, currently restricted to the detection of individual binding sites and their score distributions, has been optimized (with a 100-fold gain in execution time) to enable the scanning of genome-scale sequence sets. The program supports Bernoulli and higher order Markov background models, and can report the P-values of predicted sites. The additional functionalities of matrix-scan (enrichment analysis, prediction of cis-regulatory modules) are still supported by the original program, and will be optimized in the near future.

Assessing matrix quality

A common issue when working with position-specific matrices is to assess their quality, i.e. whether a matrix is able to separate correctly the true signal from the background. We have developed a workflow called matrix-quality (15) that computes theoretical and empirical score distributions to assess the reliability of position-specific matrices for predicting transcription factor binding sites. The underlying principle is to compare the score distributions obtained from various datasets in order to estimate their respective enrichment in binding sites, and this for all possible score threshold values. The theoretical distribution first provides an estimate of the false prediction rate. Empirical distributions then measure the enrichment of binding sites in various collections of sequences: known binding sites (positive control), all upstream regions of a genome, clusters of co-expressed genes, ChIP-seq peaks. As negative controls, empirical distributions are computed in the same sequence collections with column-permuted matrices. The comparison of those distributions permits the definition of score thresholds that optimize the tradeoff between sensitivity and positive predictive value. Typical applications of matrix-quality are (i) choice of the most accurate predictor among alternative matrices for the same transcription factor (e.g. coming from different databases, or built with different sets of sites); (ii) estimating the enrichment of ChIP-seq peaks for reference motifs (e.g. the pulled-down transcription factor) or for motifs discovered in the peak sequences themselves.

Motif comparison

The tool compare-matrices enables extensive comparisons between one or two collections of position-specific scoring matrices. A typical utilization is to compare a set of discovered motifs with databases of known transcription factor binding motifs. The web site includes collections from JASPAR (16), RegulonDB (17), UniPROBE (18) and DMMPMM (19). Users can also upload custom motifs, enabling the use of in-house collections or license-protected databases such as TRANSFAC (20). Another use of the custom motifs option is to compare motifs predicted by two different motif discovery algorithms. The tool integrates a wide variety of similarity/dissimilarity scoring metrics featured by other matrix comparison tools such as STAMP (21) or TOMTOM (22): sum of squared distances, Euclidian distance/similarity, Sandelin–Wasserman similarity (23), Kullback–Leibler distance as defined in (24), covariance, Pearson’s correlation. The program also computes length-normalized metrics, in order to avoid trivial alignments covering a small fraction of the motifs (e.g. the leftmost column of a query matrix aligned with the rightmost column of a reference matrix). Instead of having to choose between those metrics, the user can select several of them (or all) in order to compare their respective scores and compute a mean rank. Multiple thresholds can be specified, for instance a minimum of five aligned columns, a minimal correlation of 0.7 and a minimal normalized correlation of 0.4. Results are exported in various formats: tab-delimited file (one row per matrix comparison), motif similarity graph, HTML reports with pairwise or one-to-n aligned logos (Figure 2).

Figure 2.

Example of result from compare-matrices. Only the four best matches are displayed in the figure, the original Web page displayed five more matches. The second column displays a one-to-n alignment of matrix-logos. The next columns display multiple matching statistics, the corresponding ranks, and the mean rank. cor: Pearson’s correlation; Ncor: alignment width-normalized correlation; dEucl: Euclidian distance; NSW: width-normalized Sandelin–Wasserman similarity; rcor, rNcor, rdEucl, rNSW: ranks on the corresponding metrics; rank_mean: mean of these ranks; match_rank: rank of the alignments sorted by rank_mean.

Generating random data sets

Random data sets are highly useful to control the reliability of predictive programs. Since the early versions of RSAT, the programs random-seq and random-genes were used to build negative control sets, i.e. data sets supposed to contain no significant site (pattern matching) or motif (motif discovery). Several new tools have been added to these two programs in order to support other control types. We describe hereafter the ways to combine the previous and new tools in order to generate negative and positive control sets. An essential parameter for building random sets is the choice of a suitable background model. The web site supports Markov models of any order between 0 and 7, calibrated with upstream non-coding sequences of all genes for each supported organism. The new program sequence-probability computes the probability of input sequences according to any of the supported background models, or yet to user-specified models. The program random-seq generates random sequences according to Markov chains of any order. Such sequences are typically used to check the false positive rate of pattern matching algorithms (matrix-scan, matrix-scan-quick), and assess their capability to handle dependencies between adjacent nucleotides (higher order Markov models). The program random-genes enables another type of negative control, by selecting random gene sets from which natural genomic sequences (e.g. upstream non-coding) can be retrieved. Each of those genes may be regulated by some factors, but a random selection of sufficient size is unlikely to contain a significant proportion of co-regulated genes. Random gene selections thus provide a realistic framework for testing empirically the false positive rate of motif discovery algorithms. The new program random-genome-fragments selects sequences at random positions from a given genome, which can be used as negative controls for genome-wide location approaches such as ChIP-on-chip and ChIP-seq. In addition to these negative controls, positive control sets can be built by inserting (artificial or natural) transcription factor binding sites at random positions in (artificial or natural) sequences: random-motifs generates random position-specific scoring matrices; random-sites generates binding sites on the basis of a matrix model; implant-sites inserts (real or fake) binding site sequences at random positions in (biological or randomly generated) sequences. The program permute-matrix performs random permutations among the columns of one or several input matrices. This method generates ‘realistic’ random models of motifs conserving the nucleotide composition, intra-column variability and information content of the original motifs.

A specialized workflow for analyzing motifs in ChIP-seq peak sets

peak-motifs combines several efficient motif discovery algorithms to extract transcription factor binding motifs and sites from large collection of peak sequences obtained from ChIP-seq or related technologies. Taking a full set of peak sequences as input (without size restriction), peak-motifs discovers exceptional motifs, compares them with motif databases, predicts binding site positions, and enables visualization in genome browsers (Thomas-Chollier, M., et al., submitted). In all studied cases, peak-motifs swiftly identified multiple relevant motifs. Like its constitutive modules, the whole workflow can be used as a stand-alone application, as well as SOAP/WSDL web services.

CONCLUSIONS

RSAT is one of the most comprehensive academic software suites for the analysis of cis-regulatory sequences to date. It integrates diverse, well-documented motif discovery and pattern matching modules and greatly facilitates their application to sequence sets belonging to numerous genomes, while offering particularly sophisticated means to statistically evaluate the returned motifs or sites, as well as to compare them with current knowledge (annotated genomes and motif collections). The modular conception of RSAT enables flexible and seamless module chaining to answer a variety of biological questions, problems and data types, and to address challenges coming from novel technologies. This point is particularly well illustrated by peak-motifs, which combines some of the very early tools of the suite (4,12,13) with some of the most recent ones (e.g. compare-matrices) to perform a comprehensive analysis of the huge sequence sets resulting from ChIP-seq experiments.

AVAILABILITY

The main server is located in Belgium (http://rsat.bigre.ulb.ac.be/rsat/). Mirror servers are available in Mexico (http://embnet.ccg.unam.mx/rsa-tools/), Sweden (http://liv.bmc.uu.se/rsa-tools/), France (http://tagc.univ-mrs.fr/rsa-tools/; http://rsat01.biologie.ens.fr/rsa-tools/), and South Africa (http://anjie.bi.up.ac.za/rsa-tools/) These RSAT Web servers can be freely accessed by all users without login requirement.

FUNDING

M.T-C is supported by the Alexander von Humboldt foundation. A.M-R. was supported during her Ph.D. studies (Programa de Doctorado en Ciencias Biomédicas, Universidad Nacional Autónoma de México) by a fellowship from the Consejo Nacional de Ciencia y Tecnología (Mexico). The BiGRe laboratory is funded by the European Commission through the FP7 MICROME Collaborative Project (thematic area ‘BIO-INFORMATICS-Microbial genomics and bio-informatics’, contract number 222886-2), while the collaboration between BiGRe and TAGC laboratory is supported by the Belgian Program on Interuniversity Attraction Poles, initiated by the Belgian Federal Science Policy Office, project P6/25 (BioMaGNet). The collaboration between BiGRe and ENS has been stimulated by a 2-months invitation of JvH as visiting Professor at ENS. Funding for open access charge: Publication costs were covered by the European Commission through the FP7 MICROME Collaborative Project (thematic area ‘BIO-INFORMATICS-Microbial genomics and bio-informatics’, contract number 222886-2). Conflict of interest statement. None declared.

24 in total

1. Using RSAT oligo-analysis and dyad-analysis tools to discover regulatory signals in nucleic sequences.

Authors: Matthieu Defrance; Rekin's Janky; Olivier Sand; Jacques van Helden
Journal: Nat Protoc Date: 2008 Impact factor: 13.491

2. Using RSAT to scan genome sequences for transcription factor binding sites and cis-regulatory modules.

Authors: Jean-Valery Turatsinze; Morgane Thomas-Chollier; Matthieu Defrance; Jacques van Helden
Journal: Nat Protoc Date: 2008 Impact factor: 13.491

3. Retrieve-ensembl-seq: user-friendly and large-scale retrieval of single or multi-genome sequences from Ensembl.

Authors: Olivier Sand; Morgane Thomas-Chollier; Jacques van Helden
Journal: Bioinformatics Date: 2009-08-31 Impact factor: 6.937

4. info-gibbs: a motif discovery algorithm that directly optimizes information content during sampling.

Authors: Matthieu Defrance; Jacques van Helden
Journal: Bioinformatics Date: 2009-08-18 Impact factor: 6.937

5. Motif discovery and motif finding from genome-mapped DNase footprint data.

Authors: Ivan V Kulakovskiy; Alexander V Favorov; Vsevolod J Makeev
Journal: Bioinformatics Date: 2009-07-15 Impact factor: 6.937

6. Ensembl 2011.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Bert Overduin; Bethan Pritchard; Harpreet Singh Riat; Daniel Rios; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Jana Vandrovcova; Albert J Vilella; Simon White; Steven P Wilder; Amonida Zadissa; Jorge Zamora; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Jan Vogel; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971

7. RegulonDB version 7.0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units).

Authors: Socorro Gama-Castro; Heladia Salgado; Martin Peralta-Gil; Alberto Santos-Zavaleta; Luis Muñiz-Rascado; Hilda Solano-Lira; Verónica Jimenez-Jacinto; Verena Weiss; Jair S García-Sotelo; Alejandra López-Fuentes; Liliana Porrón-Sotelo; Shirley Alquicira-Hernández; Alejandra Medina-Rivera; Irma Martínez-Flores; Kevin Alquicira-Hernández; Ruth Martínez-Adame; César Bonavides-Martínez; Juan Miranda-Ríos; Araceli M Huerta; Alfredo Mendoza-Vargas; Leonardo Collado-Torres; Blanca Taboada; Leticia Vega-Alvarado; Maricela Olvera; Leticia Olvera; Ricardo Grande; Enrique Morett; Julio Collado-Vides
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

8. Theoretical and empirical quality assessment of transcription factor-binding motifs.

Authors: Alejandra Medina-Rivera; Cei Abreu-Goodger; Morgane Thomas-Chollier; Heladia Salgado; Julio Collado-Vides; Jacques van Helden
Journal: Nucleic Acids Res Date: 2010-10-04 Impact factor: 16.971

9. UniPROBE, update 2011: expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions.

Authors: Kimberly Robasky; Martha L Bulyk
Journal: Nucleic Acids Res Date: 2010-10-30 Impact factor: 16.971

10. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles.

Authors: Elodie Portales-Casamar; Supat Thongjuea; Andrew T Kwon; David Arenillas; Xiaobei Zhao; Eivind Valen; Dimas Yusuf; Boris Lenhard; Wyeth W Wasserman; Albin Sandelin
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

138 in total

1. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs.

Authors: Morgane Thomas-Chollier; Andrew Hufton; Matthias Heinig; Sean O'Keeffe; Nassim El Masri; Helge G Roider; Thomas Manke; Martin Vingron
Journal: Nat Protoc Date: 2011-11-03 Impact factor: 13.491

2. Regulation of Ace2-dependent genes requires components of the PBF complex in Schizosaccharomyces pombe.

Authors: M Belén Suárez; María Luisa Alonso-Nuñez; Francisco del Rey; Christopher J McInerny; Carlos R Vázquez de Aldana
Journal: Cell Cycle Date: 2015-08-03 Impact factor: 4.534

3. cMonkey2: Automated, systematic, integrated detection of co-regulated gene modules for any organism.

Authors: David J Reiss; Christopher L Plaisier; Wei-Ju Wu; Nitin S Baliga
Journal: Nucleic Acids Res Date: 2015-04-14 Impact factor: 16.971

4. Contribution of the Salmonella enterica KdgR Regulon to Persistence of the Pathogen in Vegetable Soft Rots.

Authors: Andrée S George; Isai Salas González; Graciela L Lorca; Max Teplitski
Journal: Appl Environ Microbiol Date: 2015-12-18 Impact factor: 4.792

5. A naturally occurring insertion of a single amino acid rewires transcriptional regulation by glucocorticoid receptor isoforms.

Authors: Morgane Thomas-Chollier; Lisa C Watson; Samantha B Cooper; Miles A Pufall; Jennifer S Liu; Katja Borzym; Martin Vingron; Keith R Yamamoto; Sebastiaan H Meijsing
Journal: Proc Natl Acad Sci U S A Date: 2013-10-14 Impact factor: 11.205

6. Coregulated expression of the Na+/phosphate Pho89 transporter and Ena1 Na+-ATPase allows their functional coupling under high-pH stress.

Authors: Albert Serra-Cardona; Silvia Petrezsélyová; David Canadell; José Ramos; Joaquín Ariño
Journal: Mol Cell Biol Date: 2014-09-29 Impact factor: 4.272

7. The EMT regulator ZEB2 is a novel dependency of human and murine acute myeloid leukemia.

Authors: Hubo Li; Brenton G Mar; Huadi Zhang; Rishi V Puram; Francisca Vazquez; Barbara A Weir; William C Hahn; Benjamin Ebert; David Pellman
Journal: Blood Date: 2016-10-18 Impact factor: 22.113

8. The membrane-bound NAC transcription factor ANAC013 functions in mitochondrial retrograde regulation of the oxidative stress response in Arabidopsis.

Authors: Inge De Clercq; Vanessa Vermeirssen; Olivier Van Aken; Klaas Vandepoele; Monika W Murcha; Simon R Law; Annelies Inzé; Sophia Ng; Aneta Ivanova; Debbie Rombaut; Brigitte van de Cotte; Pinja Jaspers; Yves Van de Peer; Jaakko Kangasjärvi; James Whelan; Frank Van Breusegem
Journal: Plant Cell Date: 2013-09-17 Impact factor: 11.277

9. The syp enhancer sequence plays a key role in transcriptional activation by the σ54-dependent response regulator SypG and in biofilm formation and host colonization by Vibrio fischeri.

Authors: Valerie A Ray; Justin L Eddy; Elizabeth A Hussa; Michael Misale; Karen L Visick
Journal: J Bacteriol Date: 2013-10-04 Impact factor: 3.490

10. A survey of 6,300 genomic fragments for cis-regulatory activity in the imaginal discs of Drosophila melanogaster.

Authors: Aurélie Jory; Carlos Estella; Matt W Giorgianni; Matthew Slattery; Todd R Laverty; Gerald M Rubin; Richard S Mann
Journal: Cell Rep Date: 2012-10-12 Impact factor: 9.423