| Literature DB >> 21715389 |
Morgane Thomas-Chollier1, Matthieu Defrance, Alejandra Medina-Rivera, Olivier Sand, Carl Herrmann, Denis Thieffry, Jacques van Helden.
Abstract
RSAT (Regulatory Sequence Analysis Tools) comprises a wide collection of modular tools for the detection of cis-regulatory elements in genome sequences. Thirteen new programs have been added to the 30 described in the 2008 NAR Web Software Issue, including an automated sequence retrieval from EnsEMBL (retrieve-ensembl-seq), two novel motif discovery algorithms (oligo-diff and info-gibbs), a 100-times faster version of matrix-scan enabling the scanning of genome-scale sequence sets, and a series of facilities for random model generation and statistical evaluation (random-genome-fragments, random-motifs, random-sites, implant-sites, sequence-probability, permute-matrix). Our most recent work also focused on motif comparison (compare-matrices) and evaluation of motif quality (matrix-quality) by combining theoretical and empirical measures to assess the predictive capability of position-specific scoring matrices. To process large collections of peak sequences obtained from ChIP-seq or related technologies, RSAT provides a new program (peak-motifs) that combines several efficient motif discovery algorithms to predict transcription factor binding motifs, match them against motif databases and predict their binding sites. Availability (web site, stand-alone programs and SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services): http://rsat.ulb.ac.be/rsat/.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21715389 PMCID: PMC3125777 DOI: 10.1093/nar/gkr377
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Short description of the new programs supported on RSAT web site (since the publication in the 2008 web software issue of this journal)
| Task | Program name | Input | Output | Description |
|---|---|---|---|---|
| Sequences | Gene names | Sequences | Retrieve upstream, downstream, intronic, exonic, UTR, transcript, mRNA, CDS or gene sequences for a list of genes from the EnsEMBL database. Multi-genome queries are supported, enabling automatic retrieval of sequences for all orthologs of query genes in selected taxa. | |
| Motif discovery | Two sequence sets | Differentially represented oligonucleotides | Compare oligonucleotide occurrences between two input sequence files, and return oligos that are significantly enriched in one of the files respective to the other one. | |
| Sequences | Over-represented motifs (matrices) | An enhanced gibbs sampler, based on a stochastic optimization of the information content of PSSMs. | ||
| Pattern matching | Sequences+ motifs (PSSM) | Matching positions in input sequences | Scan a DNA sequence with a profile matrix. This implementation has restricted capabilities with respect to matrix-scan, but runs 100 times faster. | |
| Motif comparisons | Two sets of PSSM | Similarity scores + matrix alignments | Compare two collections of PSSMs, and return various similarity statistics + matrix alignments (pairwise, one-to- | |
| Random model generation | A genome supported in either RSAT or EnsEMBL | Randomly selected genome fragments | Select a set of fragments with random positions in a given genome, and return their coordinates and/or sequences. | |
| Randomly generated motifs (PSSM) | Generate random motifs with a given level of conservation in each column. | |||
| Motif (PSSM) | Randomly generated sites (sequences) | Generate random sites given a motif (PSSM). | ||
| Sequences + sites | Sequences with sites implanted | Implant given sites at random positions into given sequences. | ||
| 1 set of PSSM | Randomized PSSMs | Randomize a set of input matrices by permuting their columns. The resulting motifs have the same nucleotide composition and information content as the original ones. | ||
| Sequences + background model | Sequence probability | Calculate the probability of a sequence, given a background model. Bernoulli or Markov models are supported. | ||
| Work flows | Motif (PSSM) + one or several sequence sets | Statistical analysis of score distributions | Evaluate the quality of a PSSM, by comparing score distributions obtained with this matrix in various sequence sets (positive set, negative set, etc.). Computes ROC curves indicating tradeoff between sensitivity and predictive value. | |
| Sequences | Discovered motifs + correspondences with motif databases + predicted binding sites + sequence composition | Pipeline for discovering motifs in massive ChIP-seq peak sequences. |
Note that additional programs are available as SOAP Web Services and/or with the stand-alone tools. PSSM: position-specific scoring matrix;
ROC: receiver operating characteristic.
Figure 1.Flow chart of the Regulatory Sequence Analysis Tools (RSAT).
Figure 2.Example of result from compare-matrices. Only the four best matches are displayed in the figure, the original Web page displayed five more matches. The second column displays a one-to-n alignment of matrix-logos. The next columns display multiple matching statistics, the corresponding ranks, and the mean rank. cor: Pearson’s correlation; Ncor: alignment width-normalized correlation; dEucl: Euclidian distance; NSW: width-normalized Sandelin–Wasserman similarity; rcor, rNcor, rdEucl, rNSW: ranks on the corresponding metrics; rank_mean: mean of these ranks; match_rank: rank of the alignments sorted by rank_mean.