Literature DB >> 21081511

GimmeMotifs: a de novo motif prediction pipeline for ChIP-sequencing experiments.

Simon J van Heeringen¹, Gert Jan C Veenstra.

Abstract

SUMMARY: Accurate prediction of transcription factor binding motifs that are enriched in a collection of sequences remains a computational challenge. Here we report on GimmeMotifs, a pipeline that incorporates an ensemble of computational tools to predict motifs de novo from ChIP-sequencing (ChIP-seq) data. Similar redundant motifs are compared using the weighted information content (WIC) similarity score and clustered using an iterative procedure. A comprehensive output report is generated with several different evaluation metrics to compare and evaluate the results. Benchmarks show that the method performs well on human and mouse ChIP-seq datasets. GimmeMotifs consists of a suite of command-line scripts that can be easily implemented in a ChIP-seq analysis pipeline. AVAILABILITY: GimmeMotifs is implemented in Python and runs on Linux. The source code is freely available for download at http://www.ncmls.eu/bioinfo/gimmemotifs/. CONTACT: s.vanheeringen@ncmls.ru.nl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2010 PMID： 21081511 PMCID： PMC3018809 DOI： 10.1093/bioinformatics/btq636

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The spectacular development of sequencing technology has enabled rapid, cost-efficient profiling of DNA binding proteins. Chromatin immunoprecipitation followed by high-throughput deep sequencing (ChIP-seq) delivers high-resolution binding profiles of transcription factors (TFs) (Park, 2009). The elucidation of the binding characteristics of these TFs is one of the obvious follow-up questions. However, the de novo identification of DNA sequence motifs remains a challenging computational task. Although many methods have been developed with varying degrees of success, no single method consistently performs well on real biological eukaryotic data (Tompa ). The combination of different algorithmic approaches, each with its own strengths and weaknesses, has been shown to improve prediction accuracy and sensitivity over single methods (Hu ). Here, we report on GimmeMotifs, a motif prediction pipeline using a ensemble of existing computational tools (Supplementary Fig. S1). This pipeline has been specifically developed to predict TF motifs from ChIP-seq data. It uses the wealth of sequences (binding peaks) usually resulting from ChIP-seq experiments to both predict motifs de novo, as well as validate these motifs in an independent fraction of the dataset. GimmeMotifs incorporates the weighted information content (WIC) similarity metric in an iterative clustering procedure to cluster similar motifs and reduce the redundancy which is the result of combining the output of different tools (see Supplementary Material). It produces an extensive graphical report with several evaluation metrics to enable interpretion of the results (Fig. 1).

Fig. 1.

An example of the GimmeMotifs output for p63 (Kouwenhoven ). Shown are the sequence logo of the predicted motif (Schneider and Stephens, 1990), the best matching motif in the JASPAR database (Sandelin ), the ROC curve, the positional preference plot and several statistics to evaluate the motif performance. See the Supplementary Material for a complete example.

2 METHODS

2.1 Overview

The input for GimmeMotifs is a file in BED format containing genomic coordinates, e.g. peaks from a ChIP-seq experiment or a FASTA file. This dataset is split: a prediction set contains randomly selected sequences from the input dataset (20% of the sequences by default) and is used for motif prediction with several different computational tools. Predicted motifs are filtered for significance using all remaining sequences (the validation set), clustered using the WIC score as described below, and a list of non-redundant motifs is generated.

2.2 Motif similarity and clustering

The WIC similarity score is based on the information content (IC) and is defined for position i in motif X compared with position j of motif Y as: where c is 2.5, and DIC(X, Y) is the differential IC defined in Equation (3). The IC of a specific motif position is defined as: where IC(X) is the IC of position i of motif X, f is the frequency of nucleotide n at position i and f is the background frequency (0.25). The differential IC (DIC) of position i in motif X and position j in motif Y is defined as: The WIC score of all individual positions in the alignment is summed to determine the total WIC score of two aligned motifs. To calculate the maximum WIC score of two motifs, all possible scores of all alignments are calculated, and the maximum scoring alignment is kept. Similar motifs are clustered using an iterative pair-wise clustering procedure (Supplementary Material).

2.3 Evaluation

The motifs can be evaluated using several different statistics: the absolute enrichment, the hypergeometric P-value, a receiver operator characteristic (ROC) graph, the ROC area under the curve (AUC) and the mean normalized conditional probability (MNCP) (Clarke and Granek, 2003). In addition to these evaluation metrics, GimmeMotifs generates a histogram of the motif position relative to the peak summit, the positional preference plot. Especially in case of high-resolution ChIP-seq data, this gives valuable information on the motif location.

2.4 Implementation

The GimmeMotifs package is implemented in Python, while the similarity metrics are written as a C extension module for performance reasons. It is freely available under the MIT license. Sequence logos are generated using WebLogo (Schneider and Stephens, 1990).

3 BENCHMARK RESULTS

We performed a benchmark study of GimmeMotifs on 18 TF ChIP-seq datasets. The ROC AUC and MNCP of the best performing motif were calculated and compared with the best motif of two other ensemble methods: SCOPE (Carlson ) and W-ChipMotifs (Jin ) (Supplementary Tables S1 and S2) . The results show that GimmeMotifs consistently produces accurate results (median ROC AUC 0.830). The method also significantly improves on the results of SCOPE (ROC AUC 0.613). The recently developed W-ChIPmotifs shows comparable results to GimmeMotifs (ROC AUC 0.824), although this tool does not cluster similar redundant motifs. In addition, the focus of GimmeMotifs is different. While the web interface of W-ChipMotifs is very useful for casual use, the command-line tools of GimmeMotifs can be integrated in more sophisticated analysis pipelines.

4 CONCLUSION

We present GimmeMotifs, a de novo motif prediction pipeline ideally suited to predict transcription factor binding motifs from ChIP-seq datasets. GimmeMotifs clusters the results of several different tools and produces a comprehensive report to evaluate the predicted motifs. We show that GimmeMotifs performs well on biologically relevant datasets of different TFs and compares favorably to other methods.

9 in total

1. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

2. Rank order metrics for quantifying the association of sequence features with gene regulation.

Authors: Neil D Clarke; Joshua A Granek
Journal: Bioinformatics Date: 2003-01-22 Impact factor: 6.937

3. Sequence logos: a new way to display consensus sequences.

Authors: T D Schneider; R M Stephens
Journal: Nucleic Acids Res Date: 1990-10-25 Impact factor: 16.971

4. W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data.

Authors: Victor X Jin; Jeff Apostolos; Naga Satya Venkateswara Ra Nagisetty; Peggy J Farnham
Journal: Bioinformatics Date: 2009-10-01 Impact factor: 6.937

5. Genome-wide profiling of p63 DNA-binding sites identifies an element that regulates gene expression during limb development in the 7q21 SHFM1 locus.

Authors: Evelyn N Kouwenhoven; Simon J van Heeringen; Juan J Tena; Martin Oti; Bas E Dutilh; M Eva Alonso; Elisa de la Calle-Mustienes; Leonie Smeenk; Tuula Rinne; Lilian Parsaulian; Emine Bolat; Rasa Jurgelenaite; Martijn A Huynen; Alexander Hoischen; Joris A Veltman; Han G Brunner; Tony Roscioli; Emily Oates; Meredith Wilson; Miguel Manzanares; José Luis Gómez-Skarmeta; Hendrik G Stunnenberg; Marion Lohrum; Hans van Bokhoven; Huiqing Zhou
Journal: PLoS Genet Date: 2010-08-19 Impact factor: 5.917

Review 6. ChIP-seq: advantages and challenges of a maturing technology.

Authors: Peter J Park
Journal: Nat Rev Genet Date: 2009-09-08 Impact factor: 53.242

7. Assessing computational tools for the discovery of transcription factor binding sites.

Authors: Martin Tompa; Nan Li; Timothy L Bailey; George M Church; Bart De Moor; Eleazar Eskin; Alexander V Favorov; Martin C Frith; Yutao Fu; W James Kent; Vsevolod J Makeev; Andrei A Mironov; William Stafford Noble; Giulio Pavesi; Graziano Pesole; Mireille Régnier; Nicolas Simonis; Saurabh Sinha; Gert Thijs; Jacques van Helden; Mathias Vandenbogaert; Zhiping Weng; Christopher Workman; Chun Ye; Zhou Zhu
Journal: Nat Biotechnol Date: 2005-01 Impact factor: 54.908

8. Limitations and potentials of current motif discovery algorithms.

Authors: Jianjun Hu; Bin Li; Daisuke Kihara
Journal: Nucleic Acids Res Date: 2005-09-02 Impact factor: 16.971

9. SCOPE: a web server for practical de novo motif discovery.

Authors: Jonathan M Carlson; Arijit Chakravarty; Charles E DeZiel; Robert H Gross
Journal: Nucleic Acids Res Date: 2007-05-07 Impact factor: 16.971

9 in total

68 in total

1. CompleteMOTIFs: DNA motif discovery platform for transcription factor binding experiments.

Authors: Lakshmi Kuttippurathu; Michael Hsing; Yongchao Liu; Bertil Schmidt; Douglas L Maskell; Kyungjoon Lee; Aibin He; William T Pu; Sek Won Kong
Journal: Bioinformatics Date: 2010-12-23 Impact factor: 6.937

2. Coactivation of GR and NFKB alters the repertoire of their binding sites and target genes.

Authors: Nagesha A S Rao; Melysia T McCalman; Panagiotis Moulos; Kees-Jan Francoijs; Aristotelis Chatziioannou; Fragiskos N Kolisis; Michael N Alexis; Dimitra J Mitsiou; Hendrik G Stunnenberg
Journal: Genome Res Date: 2011-07-12 Impact factor: 9.043

3. Tox: a multifunctional transcription factor and novel regulator of mammalian corticogenesis.

Authors: Benedetta Artegiani; Antonio M de Jesus Domingues; Sara Bragado Alonso; Elisabeth Brandl; Simone Massalini; Andreas Dahl; Federico Calegari
Journal: EMBO J Date: 2014-12-19 Impact factor: 11.598

4. ONECUT transcription factors induce neuronal characteristics and remodel chromatin accessibility.

Authors: Jori van der Raadt; Sebastianus H C van Gestel; Nael Nadif Kasri; Cornelis A Albers
Journal: Nucleic Acids Res Date: 2019-06-20 Impact factor: 16.971

5. Controlling gene activation by enhancers through a drug-inducible topological insulator.

Authors: Taro Tsujimura; Osamu Takase; Masahiro Yoshikawa; Etsuko Sano; Matsuhiko Hayashi; Kazuto Hoshi; Tsuyoshi Takato; Atsushi Toyoda; Hideyuki Okano; Keiichi Hishikawa
Journal: Elife Date: 2020-05-05 Impact factor: 8.140

6. A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs.

Authors: Morgane Thomas-Chollier; Elodie Darbo; Carl Herrmann; Matthieu Defrance; Denis Thieffry; Jacques van Helden
Journal: Nat Protoc Date: 2012-07-26 Impact factor: 13.491

7. Dynamic binding of RBPJ is determined by Notch signaling status.

Authors: David Castel; Philippos Mourikis; Stefanie J J Bartels; Arie B Brinkman; Shahragim Tajbakhsh; Hendrik G Stunnenberg
Journal: Genes Dev Date: 2013-05-01 Impact factor: 11.361

8. Charting Brachyury-mediated developmental pathways during early mouse embryogenesis.

Authors: Macarena Lolas; Pablo D T Valenzuela; Robert Tjian; Zhe Liu
Journal: Proc Natl Acad Sci U S A Date: 2014-03-10 Impact factor: 11.205

9. ZBTB2 reads unmethylated CpG island promoters and regulates embryonic stem cell differentiation.

Authors: Ino D Karemaker; Michiel Vermeulen
Journal: EMBO Rep Date: 2018-02-01 Impact factor: 8.807

10. Transcriptional regulation of tocopherol biosynthesis in tomato.

Authors: Leandro Quadrana; Juliana Almeida; Santiago N Otaiza; Tomas Duffy; Junia V Corrêa da Silva; Fabiana de Godoy; Ramon Asís; Luisa Bermúdez; Alisdair R Fernie; Fernando Carrari; Magdalena Rossi
Journal: Plant Mol Biol Date: 2012-12-18 Impact factor: 4.076