| Literature DB >> 29722874 |
Nga Thi Thuy Nguyen1, Bruno Contreras-Moreira2,3, Jaime A Castro-Mondragon4,5, Walter Santana-Garcia6, Raul Ossio6, Carla Daniela Robles-Espinoza6,7, Mathieu Bahin1, Samuel Collombet1, Pierre Vincens1, Denis Thieffry1, Jacques van Helden4, Alejandra Medina-Rivera6, Morgane Thomas-Chollier1.
Abstract
RSAT (Regulatory Sequence Analysis Tools) is a suite of modular tools for the detection and the analysis of cis-regulatory elements in genome sequences. Its main applications are (i) motif discovery, including from genome-wide datasets like ChIP-seq/ATAC-seq, (ii) motif scanning, (iii) motif analysis (quality assessment, comparisons and clustering), (iv) analysis of regulatory variations, (v) comparative genomics. Six public servers jointly support 10 000 genomes from all kingdoms. Six novel or refactored programs have been added since the 2015 NAR Web Software Issue, including updated programs to analyse regulatory variants (retrieve-variation-seq, variation-scan, convert-variations), along with tools to extract sequences from a list of coordinates (retrieve-seq-bed), to select motifs from motif collections (retrieve-matrix), and to extract orthologs based on Ensembl Compara (get-orthologs-compara). Three use cases illustrate the integration of new and refactored tools to the suite. This Anniversary update gives a 20-year perspective on the software suite. RSAT is well-documented and available through Web sites, SOAP/WSDL (Simple Object Access Protocol/Web Services Description Language) web services, virtual machines and stand-alone programs at http://www.rsat.eu/.Entities:
Mesh:
Year: 2018 PMID: 29722874 PMCID: PMC6030903 DOI: 10.1093/nar/gky317
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Overview of the main applications of RSAT.
Main tools available on RSAT Web servers (2018 update)
| Application | Program name | Input | Output | Description |
|---|---|---|---|---|
|
| retrieve-seq | Gene names | Sequences | Given a set of gene names, returns upstream, downstream (relative to ORF start) or unspliced ORF sequences. Segments overlapping an upstream ORF can be excluded or included. |
| fetch-sequences (from UCSC) | Genomic coordinates | Sequences | From a set of genomic coordinates (bed file), collects the sequences from the UCSC genome browser. | |
| *retrieve-seq-bed | Genomic coordinates | Sequences | From a set of genomic coordinates (bed/gff/vcf file), collects the sequences from installed organisms. Supports repeat masking option. | |
| retrieve-ensembl-seq | Gene names | Sequences | Returns upstream, downstream, intronic, exonic, UTR, mRNA or CDS for a list of genes from EnsEMBL vertebrates. | |
|
| oligo-analysis | Sequences | Over/under-represented oligonucleotides + PSSM | Analyses oligonucleotide occurrences in a set of sequences, and detects over/under-represented oligonucleotides, using various background models and scoring statistics. |
| dyad-analysis | Sequences | Over/under-represented dyads + PSSM | Detects over-represented dyads (spaced pairs of oligonucleotides) within a set of sequences. | |
|
| peak-motifs | Sequences | Discovered motifs + predicted sites | Discovers motifs in ChIP-seq peak sequence sets, and returns detailed information on sequence composition and discovered motifs, with correspondence in databases and predicted binding sites. |
|
| crer-scan | Transcription factor binding sites | Cis-regulatory enriched regions (CRER) | Given a set of cis-regulatory elements (predicted sites, annotated sites, ChIP-seq peaks), detects regions presenting a significant enrichment in CRERs. |
| matrix-scan (-quick) | Sequences + PSSMs | Matching positions in input sequences | Scans sequences with one or several PSSMs to identify instances of the corresponding motifs (putative sites). Supports a variety of background models (Bernoulli, Markov chains of any order). | |
|
| *retrieve–matrix | Motif collection + motif name/ID | Motif (PSSM) | From a chosen motif collection (supported external database), extract the PSSMs specified by the provided name or identifier. |
| matrix-quality | Motif (PSSM) + sequence set(s) | Score distribution statistics + ROC curves | Evaluates the quality of a PSSM by comparing score distributions obtained with this matrix in control sequence sets. | |
| compare-matrices | Two sets of PSSM | Similarity scores + matrix alignments | Compares two collections of PSSMs, and returns various similarity statistics + matrix alignments. | |
| matrix-clustering | One or several sets of PSSM | Clusters of matrices + similarity trees | Clusters similar PSSMs and builds consensus matrices for each cluster. | |
|
| get-orthologs | Gene names + taxon | List of homologous genes with percentage of identity, alignment length, and e-value | Given a list of genes from a query organism, and a reference taxon, returns the orthologs of the query gene(s) in all the organisms belonging to the reference taxon. |
| *get-orthologs-compara | Ensembl gene ids | Ensembl gene ids + homology relation information | Given a list of Ensembl stable gene IDs from one or more query organisms, returns orthologs (optionally paralogs and homologs). Relies on primary data from Ensembl Compara. | |
| footprint-discovery | Sequences | Conserved dyads + PSSM | Detects phylogenetic footprints by applying | |
| footprint-scan | Sequences + PSSM | Conserved motifs + binding sites | Scans promoters of orthologous genes with one or several PSSMs to detect enriched motifs and predict phylogenetically conserved target genes. | |
|
| retrieve-variation-seq | Identifier of variations | Sequences of the variants | Given a set of IDs for genetic variations, returns the corresponding variants and their flanking sequences. The output file can be scanned with the tool |
| *variation-scan | Variant sequences | Regulatory variants | Scans variant sequences with PSSM and report variations that affect the binding score, in order to predict regulatory variants. Faster version with novel support for indels. | |
| *convert-variations | File with genetic variants | File with genetic variants in the specified format | Converts between different file formats that store genetic variation information. The most commonly used formats are: VCF and GVF, varBed format presents several advantages for scanning variations with matrices using | |
| Visualisation | *feature-map2 | Coordinates (relative or absolute) | Image depicting features over lines representing sequences | Generates a graphical map of features localized on one or several sequences. Several maps can be drawn in parallel, allowing to detect conserved positions. Exports in svg, png/jpeg. |
This table presents a selection of key tools equipped with a Web interface. Connect to the RSAT Web site to obtain the complete list of available tools. Novel tools and major updates since the 2015 Web software issue are emphasized by an asterisk (*).