| Literature DB >> 31871587 |
Walter Santana-Garcia1,2, Maria Rocha-Acevedo2, Lucia Ramirez-Navarro2, Yvon Mbouamboua3,4, Denis Thieffry1, Morgane Thomas-Chollier1, Bruno Contreras-Moreira5,6, Jacques van Helden4,7, Alejandra Medina-Rivera2.
Abstract
Gene regulatory regions contain short and degenerated DNA binding sites recognized by transcription factors (TFBS). When TFBS harbor SNPs, the DNA binding site may be affected, thereby altering the transcriptional regulation of the target genes. Such regulatory SNPs have been implicated as causal variants in Genome-Wide Association Study (GWAS) studies. In this study, we describe improved versions of the programs Variation-tools designed to predict regulatory variants, and present four case studies to illustrate their usage and applications. In brief, Variation-tools facilitate i) obtaining variation information, ii) interconversion of variation file formats, iii) retrieval of sequences surrounding variants, and iv) calculating the change on predicted transcription factor affinity scores between alleles, using motif scanning approaches. Notably, the tools support the analysis of haplotypes. The tools are included within the well-maintained suite Regulatory Sequence Analysis Tools (RSAT, http://rsat.eu), and accessible through a web interface that currently enables analysis of five metazoa and ten plant genomes. Variation-tools can also be used in command-line with any locally-installed Ensembl genome. Users can input personal collections of variants and motifs, providing flexibility in the analysis.Entities:
Keywords: Binding motifs; CEU, Northern Europeans from Utah; CRM, Cis-Regulatory Module; GWAS, Genome Wide Association Studies; LD, Linkage Disequilibrium; MPRA, Massively Parallel Reporter Assays: MPRA; PSSM, Position Specific Scoring Matrix; Position specific scoring matrix; ROC, Receiver Operating Characteristic; RSAT, Regulatory Sequence Analysis Tools; Regulatory variants; SNP, Single Nucleotide Polymorphism; SNPs; SOIs, SNPs of Interest; TF, Transcription Factor; TFBS, Transcription Factor Binding Site; Transcription factors; eQTL, Expression Quantitative Trait Loci; rsID, Reference SNP Identifier
Year: 2019 PMID: 31871587 PMCID: PMC6906655 DOI: 10.1016/j.csbj.2019.09.009
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Tools similar to variation-scan with available implementation. PM stands for Pattern Matching, ML stands for Machine Learning.
| Name | PMID | Source | Approach | Organism | Input | Output | Matrix flexibility | Type | Last update |
|---|---|---|---|---|---|---|---|---|---|
| deltaSVM | 26075791 | Gapped k-mer SVM classifier. | Any organism | DNaseI-seq data; putative regulatory regions as positive training set and randomized sequences as negative training set. | deltaSVM, predicted impact of a variant in chromatin accessibility which is measured by adding up the contribution of all 10-mers in which the SNP is present for chromatin accessibility. | It can only be trained for one TF at a time. | ML, non-static. | Last update Sept 2015. | |
| DeepSea | 26301843 | Deep convolutional network. | Human | SNPs in VCF format. | Chromatin feature probabilities for reference and alternative alleles, chromatin feature probability log fold changes for each variant, chromatin feature probability differences for each variants, e-values for chromatin feature effects, functional significance score for each variant. There are 919 chromatin features evaluated. | It contains 690 TF binding profiles for 160 different TFs, but does not support the addition of new matrices. | ML, non-static. | Last update May 2017. | |
| atSNP | 26092860 | Importance sampling algorithm for p-value calculation, first-order Markov Model to generate random background sequences. | Any organism whose genome is included in the Bioconductor BSGenome package. | SNP list, motif file. | p-value for binding affinity with alternative and reference allele, p-value for binding affinity change based on log-likelihood ratio and log-rank ratio. It also provides composite logo plots for directly visualizing the SNP effects on motif matches. | It accepts several matrices, and several different formats. It includes a motif library of 2,065 PSSMs from ENCODE and JASPAR, but also allows user-defined motif libraries. | PM, non-static. | Last update Nov 2018. | |
| BayesPI-BAR | 26202972 | Biophysical modeling of protein-DNA interaction, estimation of TF chemical potential (through a bayesian nonlinear regression model) and differential binding affinity. | Any organism | ChIP-seq experiment for TFs to be tested, DNA sequences for selected SNPs,PSSMs for selected TFs. | Given a SNP and a PSSM list, it produces two lists sorted by significance: one composed of binding motifs disrupted by the SNP, and one by sites with an increased affinity to the TF caused by the SNP. | Can use several PSSMs simultaneously. | PM, biophysical modeling.Non-static. | No updates listed, software created July 2015. | |
| GWAS4D | 29771388 | Variant prioritization method, followed by an integrative analysis of genome-wide association. | Human | Accepts VCF-like, coordinate only, dbSNP ID and PLINK-like formats. | Regulatory variant prioritization table: includes the most likely affected motif by alternative variant effect. | The model includes motifs of 1,480 transcriptional regulators from 13 different resources. It is not possible to upload user-specified matrices. | PM, static | Last update Sept 2018. | |
| sTRAP | 20127973 | Prediction of local binding affinity followed by a normalization of binding affinities to determine difference between reference allele and SNP. | Organisms available in TRANSFAC. | Accepts only two sequences in FASTA format. | List of TFs ranked according to changes induced by the SNP. | There is no option for user-specified matrices, matrices from TRANSFAC versions can be selected. | PM, non-static | No updates listed, software created in 2011. | |
| SNP2TFBS | 27899579 | Estimation based on PSSM model. | Human. | When working with the code, the input required is the reference genome, a SNP catalogue and a PSSM collection. | List of affected TFBSs, sorted by the magnitude of the effects. | On the web interface, only matrices from JASPAR can be used. Nonetheless, it is possible to download the code used to generate the database and use a different input. | PM, static. | Last update July 2017. | |
| atSNP Search | 30534948 | Used atSNP algorithm with dbSNP build 144 for human genome assembly 38 against JASPAR and Encode motifs to create a repository with all the SNP-motif combinations resulting from the previous resources. | Human. | It can receive a set of rsIDs, a rsID and a window size around the SOI, genomic coordinates, a gene symbol and a window size around the gene of interest, or a TF name. | Table including p-values for motif matches for both reference and alternate alleles, as well as the change in the motif matching and the direction of said change. Output includes logo plots, displaying the sequence logos aligned to best motif matches with reference and SNP alleles. | Only JASPAR or ENCODE matrices can be selected, and it is possible to select only one transcription factor at a time. | PM, static. | Last update Jan 2018. | |
| HaploReg | 22064851, | It contains data from multiple genome annotation resources. PSSMs are scored against reference and alternative alleles, and change in log-odds is calculated. | Human | Users can provide a list of rsIDs or chromosome regions. Users can also select GWAS studies from the NHGRI catalog. | Provides data on allelic frequencies, conservation, chromatin states, and near genes. For each of the regulatory motifs altered by the SNP, it provides the change in log-odds and a logo. | HaploReg contains a library created from literature sources, TRANSFAC, JASPAR and PBM experiments. There is no option for user-specified matrices. | PM, static. | Last update November 2015. | |
| RegulomeDB | 22955989 | RegulomeDB uses information from several datasets, as well as manual curation and a heuristic method to distinguish between functional and non-functional variants. | Human. | Users can provide a list of dbSNP IDs, hg19 coordinates in BED, VCF or GFF3 format, or hg19 chromosomal regions in the same formats. | Table sorted by likely functionality, containing variant coordinates, score assigned by the algorithm, and evidence of function including protein binding, motifs, chromatin structure, eQTLs and histone modifications. | RegulomeDB includes all PSSMs from TRANSFAC, JASPAR CORE, and UniProbe. There is no option for user-specified matrices. | PM, static. | No updates, listed, software created in Sept 2012. | |
| motifbreakR | 26272984 | It has three options of algorithms: the standard sum of log probabilities, weighted sum, and an information content method. | Organisms included in BSgenome. | SNPs can be imported from an R package or provided to the algorithm in BED or VCF format. PSSMs can be selected from the MotifDb package or be user-specified. | Table containing statistics describing the percent of maximum score for a matrix and matrix values for both alleles, as well as the strand. It also reports whether the TFBS is disrupted strongly or weakly. | PSSMs can be imported from the MotifDb package or be user-specified. More than one matrix can be used at a time. | PM, non-static. | Last update Jul 2018. | |
| variation-scan | Estimation based on PSSM model. | web interface: installed Ensembl organisms. | A collection of PSSMs and a set of variants in varSeq format. This format can be obtained using retrieve-variation-seq. | A table with one line per pair of alleles per motif (if there are more than two, there will be one line per possible pair) reporting the position, weight and p-value of each allele, weight difference and p-value ratio. | Users can select for the collections available in RSAT (JASPAR, HOCOMOCO, CisBP), but they can also use personal collections. | PM-non static. | April 2019. |
Fig. 1Schematic representation of Variation-tools: This set of tools, included in the Regulatory Sequence Analysis Tools (RSAT), focuses on assessing the impact of different allelic variants on Transcription factor binding sites. A) convert-variations allows users to input their own variants and convert them to other formats (VCF, GVF and varBed, the latter is the format used in the next step), while variation-info retrieves the annotated information of Ensembl variants installed in RSAT servers. B) The tool retrieve-variation-seq retrieves the surrounding sequence of variants (including possible haplotypes) and generates a text file with one line per allele and per variant or haplotype (varSeq format). C) Users can input their variants in varSeq format and a collection of motifs (direct input by the user or selected from RSAT available collections) to variation-scan; the tool then scans the corresponding sequences with all motifs and perform pairwise comparisons between the binding scores of each transcription factor onto all alleles of a variant or haplotype.
Fig. 2Identification of experimentally validated regulatory variants using variation-scan. A) Correlation of the Massively Parallel Reporter Assays (MPRA) p-value of the mRNA/DNA ratio of positive variants and the variation-scan weight difference for the MPRA variants with significant change. B) Receiver Operating Characteristic (ROC) curve comparing the performance when aiming to classify MPRA experimentally analyzed variants using variation-scan (turquoise), DeepSea (purple), deltaSVM (green), and a negative control which consists of permuted motifs scored with variation-scan (red). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 3Haplotype analysis in high-quality human genomes. A) The number of heterozygous variants (X-axis) within the same putative binding site tend to have a greater impact on the TF binding probability. This is expected as the increase of weight difference observed on the violin plot corresponds to the expected cumulated impact of variations affecting different positions of the same binding site. B) Number of predicted disrupted Transcription Factor Binding Sites (TFBSs) with Cis-Regulatory Modules (CRMs) and TF ChIP-seq peak annotation (blue), with only peak annotation (yellow), and non-annotated predictions (grey). C) University of California Santa Cruz (UCSC) browser [48] screen shot, showing a locus encompassing two SNPs that compose an heterozygous haplotype in one of the Northern Europeans from Utah (CEU) individuals. The figure shows the reference genome haplotype. The variants are located in the FUT10 promoter (top). variation-scan predicts an effect in three motifs that represent binding sites for GABPA, ETS1 and ELF2, factors that have been proven to have binding sites in this region by the ENCODE project. The variant rs2732317 has been associated with effects in gene expression by the GTEx project. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
Fig. 4Enrichment of the set of SNPs of Interest (SOIs) for diseases. The SNPs of interest includes SNPs reported by a GWAS to be associated with resistance to Mycobacterium tuberculosis infection and the SNPs in linkage disequilibrium with those. The genes associated with these SNPs were compared to each term of a catalogue of diseases.