Literature DB >> 28035026

InMoDe: tools for learning and visualizing intra-motif dependencies of DNA binding sites.

Ralf Eggeling¹, Ivo Grosse^2,3, Jan Grau².

Abstract

Summary: Recent studies have shown that the traditional position weight matrix model is often insufficient for modeling transcription factor binding sites, as intra-motif dependencies play a significant role for an accurate description of binding motifs. Here, we present the Java application InMoDe, a collection of tools for learning, leveraging and visualizing such dependencies of putative higher order. The distinguishing feature of InMoDe is a robust model selection from a class of parsimonious models, taking into account dependencies only if justified by the data while choosing for simplicity otherwise. Availability and Implementation: InMoDe is implemented in Java and is available as command line application, as application with a graphical user-interface, and as an integration into Galaxy on the project website at http://www.jstacs.de/index.php/InMoDe . Contact: ralf.eggeling@cs.helsinki.fi.

Entities: Disease Gene Species

Mesh：

Substances：
Transcription Factors
DNA

Year: 2017 PMID： 28035026 PMCID： PMC5408807 DOI： 10.1093/bioinformatics/btw689

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The position weight matrix (PWM) model (Stormo ) has been the standard choice for representing the statistical properties of functional DNA elements such as transcription factor binding sites for more than three decades. While easy to interpret, use and visualize, the original PWM model makes the assumption that all nucleotides contribute independently to the total binding affinity. Recent studies have shown that taking into account dependencies among nucleotides can yield a better motif representation (Eggeling ; Keilwagen and Grau, 2015; Siebert and Söding, 2016; Zhao ), and that the DNA shape (Mathelier ) could serve as one biophysical explanation for deviations from the independence assumption. In this article, we present InMoDe, a user-friendly suite of multiple tools for learning intra-motif dependencies in various scenarios. InMoDe is based on parsimonious context trees (PCTs) as proposed by Bourguignon and Robelin (2004), which provide a sparse parameterization of the conditional probability distribution for avoiding overfitting. The position-specific use of PCTs yields an inhomogeneous parsimonious Markov model (iPMM). Both model structure and parameters of an iPMM can be robustly learned without resorting to computationally expensive parameter tuning even when latent variables are involved (Eggeling ). While finding optimal PCTs is computationally hard, recent algorithmic advances allow us to solve typical instances fast enough to effectively consider dependencies up to order six (Eggeling and Koivisto, 2016). InMoDe also provides tools for applying learned models, such as scanning given sequences for statistically significant hits and classifying binding sites. The learned dependencies of an iPMM can be directly visualized by a conditional sequence logo (CSL, Fig. 1A), which can be viewed as PCT-based extension of a traditional sequence logo (Schneider and Stephens, 1990). InMoDe is implemented in Java using the Jstacs library (Grau ) and provides three user interfaces: a command line interface, a GUI and an integration into Galaxy workflows. In the next two sections, we briefly discuss the different components of InMoDe, consisting of four learning modes and three application tools. For a detailed documentation of all features we refer to the user guide on the project website.

Fig. 1.

Exemplary conditional sequence logo as generated by VisualisationApp (A) and screenshot of the graphical user interface (B). The examples use the E2F1 dataset from JASPAR (Sandelin ). The top row of nucleotide stacks in the CSL corresponds to a classical sequence logo (displayed in top-right box of the GUI), whereas the bottom row shows conditional nucleotide probabilities given the context represented by the PCT (middle). A motif refinement by first-order dependencies is achieved at position 2, 3 and 8, higher-order dependencies do occur once at position 10, and yet the model allows to choose for simplicity where nothing else is justified by the data, such as at position 4–7 and 11

2 Learning modes

In the most simple scenario, the input data is a set of gapless aligned binding sites of identical length, which might have been either obtained experimentally or from a database. Structure and parameters of a single iPMM can be learned exactly using the tool SimpleMoDe, which implements the methodology of Eggeling . In the case of, e.g. ChIP-seq positive fragments, the input sequences are of arbitrary length and the location of the binding sites is not known, leading to the problem of de novo motif discovery. Here, learning is computationally more expensive as it involves latent variables that indicate motif occurrence and strand orientation of a putative binding site in each sequence. The tool DeNovoMoDe implements a recent algorithm for learning iPMM s in this scenario (Eggeling ), which is a variant of a stochastic EM algorithm (Nielsen, 2000), but uses the BIC-score (Schwarz, 1978) of the optimal iPMM as target function. In a third scenario, we again assume the input sequences to be pre-aligned and of same length. But in contrast to the first scenario, not one single, but a mixture of K component models is to be learned. Here, latent variables are involved as well and indicate the mixture component. A possible use case is to investigate to which degree complex intra-motif dependencies could be explained by two (or more) different models of lower complexity. The tool MixtureMoDe allows this type of analysis, using essentially the same algorithm as DeNovoMoDe with the sum of the K BIC-scores as target function. Finally, the tool FlexibleMoDe combines the functionality of the three previous tools, by taking into account latent variables for motif position, strand orientation and motif type. While it contains the other three tools as special cases, it allows a wide range of additional scenarios for less common use-cases. In addition, FlexibleMoDe allows for learning from weighted data, which can be useful if sequences should contribute to a different degree, e.g. according to ChIP-scores, to the learned model.

3 Application tools

ScanApp scans a given dataset for high-scoring occurrences of a previously learned iPMM. In order to assess which likelihood value is sufficient for declaring a hit, it determines a cutoff threshold based on a user-specified false positive rate (FPR) for occurrences on a negative dataset. A negative dataset can be either given as user input or it is constructed as a randomized version of the positive dataset. ClassificationApp performs a binary classification of sequences, where the classes are represented by an iPMM each, thus it accepts only sequences of the same length as the models. If no background iPMM is available, either a pseudo-background based on a homogeneous Markov chain originating from a user-specified negative dataset or a PWM model with uniform parameters can be used instead. All learning tools automatically produce a standard visualization of structure and parameters of the learned iPMM(s) in terms of a CSL. The tool VisualisationApp allows to plot customized CSLs in order to obtain an appropriate degree of detail according to publication format and target audience. For example, Figure 1 displays a default CSL, which additionally contains descriptions of the main plot elements on the left.

4 Conclusions

InMoDe provides multiple tools for learning, utilizing and visualization intra-motif dependencies within functional DNA elements. While the main use case is modeling of transcription factor binding sites, other types of functional nucleotide elements that may be represented by sequence motifs can be analyzed as well.

9 in total

1. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. Improved models for transcription factor binding site identification using nonindependent interactions.

Authors: Yue Zhao; Shuxiang Ruan; Manishi Pandey; Gary D Stormo
Journal: Genetics Date: 2012-04-13 Impact factor: 4.562

3. Sequence logos: a new way to display consensus sequences.

Authors: T D Schneider; R M Stephens
Journal: Nucleic Acids Res Date: 1990-10-25 Impact factor: 16.971

4. Characterization of translational initiation sites in E. coli.

Authors: G D Stormo; T D Schneider; L M Gold
Journal: Nucleic Acids Res Date: 1982-05-11 Impact factor: 16.971

5. Varying levels of complexity in transcription factor binding motifs.

Authors: Jens Keilwagen; Jan Grau
Journal: Nucleic Acids Res Date: 2015-06-26 Impact factor: 16.971

6. DNA Shape Features Improve Transcription Factor Binding Site Predictions In Vivo.

Authors: Anthony Mathelier; Beibei Xin; Tsu-Pei Chiu; Lin Yang; Remo Rohs; Wyeth W Wasserman
Journal: Cell Syst Date: 2016-08-18 Impact factor: 10.304

7. Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences.

Authors: Matthias Siebert; Johannes Söding
Journal: Nucleic Acids Res Date: 2016-06-09 Impact factor: 16.971

8. On the value of intra-motif dependencies of human insulator protein CTCF.

Authors: Ralf Eggeling; André Gohr; Jens Keilwagen; Michaela Mohr; Stefan Posch; Andrew D Smith; Ivo Grosse
Journal: PLoS One Date: 2014-01-22 Impact factor: 3.240

9. Inferring intra-motif dependencies of DNA binding sites from ChIP-seq data.

Authors: Ralf Eggeling; Teemu Roos; Petri Myllymäki; Ivo Grosse
Journal: BMC Bioinformatics Date: 2015-11-09 Impact factor: 3.169

9 in total

7 in total

1. Disentangling transcription factor binding site complexity.

Authors: Ralf Eggeling
Journal: Nucleic Acids Res Date: 2018-11-16 Impact factor: 16.971

2. DIVERSITY in binding, regulation, and evolution revealed from high-throughput ChIP.

Authors: Sneha Mitra; Anushua Biswas; Leelavati Narlikar
Journal: PLoS Comput Biol Date: 2018-04-23 Impact factor: 4.475

3. Allele specific chromatin signals, 3D interactions, and motif predictions for immune and B cell related diseases.

Authors: Marco Cavalli; Nicholas Baltzer; Husen M Umer; Jan Grau; Ioana Lemnian; Gang Pan; Ola Wallerman; Rapolas Spalinskas; Pelin Sahlén; Ivo Grosse; Jan Komorowski; Claes Wadelius
Journal: Sci Rep Date: 2019-02-25 Impact factor: 4.379

4. MODER2: first-order Markov modeling and discovery of monomeric and dimeric binding motifs.

Authors: Jarkko Toivonen; Pratyush K Das; Jussi Taipale; Esko Ukkonen
Journal: Bioinformatics Date: 2020-05-01 Impact factor: 6.937

5. DNA-binding properties of the MADS-domain transcription factor SEPALLATA3 and mutant variants characterized by SELEX-seq.

Authors: Sandra Käppel; Ralf Eggeling; Florian Rümpler; Marco Groth; Rainer Melzer; Günter Theißen
Journal: Plant Mol Biol Date: 2021-01-24 Impact factor: 4.076

6. Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis.

Authors: Anton V Tsukanov; Victoria V Mironova; Victor G Levitsky
Journal: Front Plant Sci Date: 2022-07-28 Impact factor: 6.627

7. A universal framework for detecting cis-regulatory diversity in DNA regions.

Authors: Anushua Biswas; Leelavati Narlikar
Journal: Genome Res Date: 2021-07-19 Impact factor: 9.043

7 in total