Literature DB >> 19797408

W-ChIPMotifs: a web application tool for de novo motif discovery from ChIP-based high-throughput data.

Victor X Jin¹, Jeff Apostolos, Naga Satya Venkateswara Ra Nagisetty, Peggy J Farnham.

Abstract

UNLABELLED: W-ChIPMotifs is a web application tool that provides a user friendly interface for de novo motif discovery. The web tool is based on our previous ChIPMotifs program which is a de novo motif finding tool developed for ChIP-based high-throughput data and incorporated various ab initio motif discovery tools such as MEME, MaMF, Weeder and optimized the significance of the detected motifs by using a bootstrap resampling statistic method and a Fisher test. Use of a randomized statistical model like bootstrap resampling can significantly increase the accuracy of the detected motifs. In our web tool, we have modified the program in two aspects: (i) we have refined the P-value with a Bonferroni correction; (ii) we have incorporated the STAMP tool to infer phylogenetic information and to determine the detected motifs if they are novel and known using the TRANSFAC and JASPAR databases. A comprehensive result file is mailed to users. AVAILABILITY: http://motif.bmi.ohio-state.edu/ChIPMotifs. Data used in the article may be downloaded from http://motif.bmi.ohio-state.edu/ChIPMotifs/examples.shtml.

Entities: Gene Species

Mesh：

Year: 2009 PMID： 19797408 PMCID： PMC2778340 DOI： 10.1093/bioinformatics/btp570

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

DNA motifs are short sequences varying from 6 to 25 bp and can be highly variable and degenerated. Understanding how transcription factors usually selectively bind to these motifs is important for understanding the logic and mechanisms of gene regulation. One major approach is using position weight matrices (PWMs; Stormo et al., 1982) to represent information content of regulatory sites. However, when used as the sole means of identifying binding sites suffers from the limited amount of training data available (Roulet et al., 1998) and a high rate of false positive predictions (Tompa et al., 2005). Many de novo motif finding tools have been developed to detect these unknown motifs. Typical tools include hidden Markov models (Pedersen and Moult, 1996), Gibbs sampling (Lawrence et al., 1993), exhaustive enumeration (i.e. detecting the set of all nucleotide n-mers, then reporting the most frequent or overrepresented; e.g. Weeder (Pavesi et al., 2004), greedy alignment algorithms [e.g. CONSENSUS (Hertz and Stormo, 1999)], expectation-maximization (MEME) (Bailey and Elkan, 1995) and probabilistic mixture modeling (NestedMica; Down and Hubbard, 2005). ChIP-based high-throughput techniques such as ChIP-chip (Ren et al., 2000; Weinmann et al., 2002), ChIP-seq (Barski et al., 2007; Robertson et al., 2007) and ChIP-PET (Loh et al., 2006) have been used to interrogate protein–DNA interactions in intact cells and is well-documented in many comprehensive reviews (Hanlon and Lieb, 2004). The identified enrichment DNA sequences usually ranging from ∼150 to ∼1500 bases from these techniques are currently considered to be highly reliable datasets for detecting the novel motif. Many computational tools including ours (Ettwiller et al., 2007; Gordon et al., 2005; Hong et al., 2005; Jin et al., 2007) have been recently developed to de novo find the motifs for the data generated from these techniques. There exist many kinds of available computational tools. However, most of them are platform-dependent stand-alone executable programs, and not easily used by biologists. In this application, we have built a web-based de novo motif discovery tool for identifying novel motifs for ChIP-based high-throughput techniques. Although the web tool is based on our previous program, ChIPMotifs, we have significantly modified the program with a refined P-value computation using Bonferroni correction and incorporated a new STAMP tool (Mahony and Benos, 2007) to find the phylogenetic information and similar motifs in TRANSFAC (Wingender et al., 2000) and JASPAR (Sandelin et al., 2004) databases. The web interface is friendly and accessible by this research community.

2 DESCRIPTION OF W-ChIPMotifs

Usage of W-ChIPMotifs web service is simple and does not require any knowledge of the underlying software. The structure of W-ChIPMotifs is shown in Figure 1. There are three required inputs from the user: the DNA sequence data, contact information and a transcription factor name. DNA sequences are required to be in the FASTA format. They can be uploaded either by selecting an existing file, or by directly copying the data into the form. Results will be emailed to the address given in the contact information. The transcription factor name is used as a label in the results. Also, control data can be specified as an optional input, which is used to infer the statistical significance for detected motifs. In case of no control data input from users, we will use default control datasets where we randomly selected 5000 promoter sequences per run from all human or mouse promoter sequences depending on the user selected species.

Fig. 1.

A schematic view of W-ChIPMotifs.

A schematic view of W-ChIPMotifs. After the server validates and retrieves the input, the DNA sequences are processed by a group of existing ab initio motif discovery programs. This group is currently composed of MEME (Bailey and Elkan, 1995), MaMF (Hon and Jain, 2006) and Weeder (Pavesi et al., 2004). These three are frequently used by the community, and have proven to be relatively accurate in detecting motifs. The programs are included in a modular fashion which enables the easy addition of other components in the future. Using these programs, we identified a set of n candidate motifs (usually <10 motifs), then constructed n PWMs for each candidate motif. A bootstrap resampling method is then used to infer the optimized PWM scores. In this method, a new dataset is created by randomizing the user input's sequences of each with 100 times. This new set no longer corresponds to the original ChIP identified binding sequences, but shares the same nucleotide frequencies and therefore can be used as a negative control set. The negative control is used for scanning the identified motifs at a minimal core score of 0.5 and a minimal PWM score of 0.5. Then, we retrieve core and PWM scores at the top 0.1, 0.5 and 1% percentiles. A Fisher test was applied and the P-value was used to define the significant cutoff for these scores. We also apply the Bonferroni correction by adjusting the P-value multiplying by the number of samples being input. If the adjusted P-value ended up >1.0, it would be rounded down to 1.0. To provide users with more flexible and useful information about detected motifs, W-ChIPMotifs also uses the STAMP tool (Mahony and Benos, 2007) to determine if the motifs are known or novel by finding phylogenetic information and motif similarity matches in the TRANSFAC and JASPAR databases. Phylogenetic information implemented in STAMP tool is based on two tree-building algorithms: an agglomerative method and a divisive method. Both take input motifs' PWMs aligned by multiple alignment strategies, and iteratively build tree nodes until reaching each leaf node containing a single PWM. The results from W-ChIPMotifs are composed of two files. The first file contains detected motifs with their SeqLOGOs, PWMs, core and PWM scores, P-values and Bonferroni correction P-value at different percentile levels. The second file contains matched similar motifs from the STAMP tool. These files are in PDF format. In the future, we plan on adding more accurate and efficient motif detecting programs, and optimizing the running time of the statistical methods.

3 IMPLEMENTATION

W-ChIPMotifs is written in Perl, and uses a web interface developed with PHP. Multiple scripts are used to produce output from the included motif discovery programs, parse this output and apply statistical techniques. The sequence logos for the motifs are generated using the WEBLOGO tool (Crooks et al., 2004). The open-source HTMLDOC program is used to convert these logos to PDF format (http://www.htmldoc.org/). A tree from the newicks format is created with the DRAWTREE tool. The PHPGmailer package is used for sending results to the user from the W-ChIPMotifs email account.

4 SAMPLE TESTS

The W-ChIPMotif server is tested with different well-known datasets from the ChIP-seq and ChIP-chip experiments with different sizes of inputs. Some of such datasets include E2F4, FOXA1, NRSF and OCT4, the test data and results are available online at http://motif.bmi.ohio-state.edu/ChIPMotifs/examples.shtml.

24 in total

1. TRANSFAC: an integrated system for gene expression regulation.

Authors: E Wingender; X Chen; R Hehl; H Karas; I Liebich; V Matys; T Meinhardt; M Prüss; I Reuter; F Schacherer
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Evaluation of computer tools for the prediction of transcription factor binding sites on genomic DNA.

Authors: E Roulet; I Fisch; T Junier; P Bucher; N Mermod
Journal: In Silico Biol Date: 1998

3. JASPAR: an open-access database for eukaryotic transcription factor binding profiles.

Authors: Albin Sandelin; Wynand Alkema; Pär Engström; Wyeth W Wasserman; Boris Lenhard
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

4. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes.

Authors: Giulio Pavesi; Paolo Mereghetti; Giancarlo Mauri; Graziano Pesole
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

5. WebLogo: a sequence logo generator.

Authors: Gavin E Crooks; Gary Hon; John-Marc Chandonia; Steven E Brenner
Journal: Genome Res Date: 2004-06 Impact factor: 9.043

6. The value of prior knowledge in discovering motifs with MEME.

Authors: T L Bailey; C Elkan
Journal: Proc Int Conf Intell Syst Mol Biol Date: 1995

7. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Authors: C E Lawrence; S F Altschul; M S Boguski; J S Liu; A F Neuwald; J C Wootton
Journal: Science Date: 1993-10-08 Impact factor: 47.728

8. Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli.

Authors: G D Stormo; T D Schneider; L Gold; A Ehrenfeucht
Journal: Nucleic Acids Res Date: 1982-05-11 Impact factor: 16.971

9. Isolating human transcription factor targets by coupling chromatin immunoprecipitation and CpG island microarray analysis.

Authors: Amy S Weinmann; Pearlly S Yan; Matthew J Oberley; Tim Hui-Ming Huang; Peggy J Farnham
Journal: Genes Dev Date: 2002-01-15 Impact factor: 11.361

10. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing.

Authors: Gordon Robertson; Martin Hirst; Matthew Bainbridge; Misha Bilenky; Yongjun Zhao; Thomas Zeng; Ghia Euskirchen; Bridget Bernier; Richard Varhol; Allen Delaney; Nina Thiessen; Obi L Griffith; Ann He; Marco Marra; Michael Snyder; Steven Jones
Journal: Nat Methods Date: 2007-06-11 Impact factor: 28.547

27 in total

1. Motif-based analysis of large nucleotide data sets using MEME-ChIP.

Authors: Wenxiu Ma; William S Noble; Timothy L Bailey
Journal: Nat Protoc Date: 2014-05-22 Impact factor: 13.491

2. Inference of hierarchical regulatory network of TCF7L2 binding sites in MCF7 cell line.

Authors: Yao Wang; Rui Wang; Victor X Jin
Journal: Int J Comput Biol Drug Des Date: 2016

3. Functional analysis of KAP1 genomic recruitment.

Authors: Sushma Iyengar; Alexey V Ivanov; Victor X Jin; Frank J Rauscher; Peggy J Farnham
Journal: Mol Cell Biol Date: 2011-02-22 Impact factor: 4.272

4. Using ChIPMotifs for de novo motif discovery of OCT4 and ZNF263 based on ChIP-based high-throughput experiments.

Authors: Brian A Kennedy; Xun Lan; Tim H-M Huang; Peggy J Farnham; Victor X Jin
Journal: Methods Mol Biol Date: 2012

5. ZNF274 recruits the histone methyltransferase SETDB1 to the 3' ends of ZNF genes.

Authors: Seth Frietze; Henriette O'Geen; Kimberly R Blahnik; Victor X Jin; Peggy J Farnham
Journal: PLoS One Date: 2010-12-08 Impact factor: 3.240

6. Interaction between DMRT1 function and genetic background modulates signaling and pluripotency to control tumor susceptibility in the fetal germ line.

Authors: Anthony D Krentz; Mark W Murphy; Teng Zhang; Aaron L Sarver; Sanjay Jain; Michael D Griswold; Vivian J Bardwell; David Zarkower
Journal: Dev Biol Date: 2013-03-06 Impact factor: 3.582

7. Spark: a navigational paradigm for genomic data exploration.

Authors: Cydney B Nielsen; Hamid Younesy; Henriette O'Geen; Xiaoqin Xu; Andrew R Jackson; Aleksandar Milosavljevic; Ting Wang; Joseph F Costello; Martin Hirst; Peggy J Farnham; Steven J M Jones
Journal: Genome Res Date: 2012-09-07 Impact factor: 9.043

8. Active motif finder - a bio-tool based on mutational structures in DNA sequences.

Authors: Mani Udayakumar; Palaniyandi Shanmuga-Priya; Kamalakannan Hemavathi; Rengasamy Seenivasagam
Journal: J Biomed Res Date: 2011-11

9. Hierarchical modularity in ERα transcriptional network is associated with distinct functions and implicates clinical outcomes.

Authors: Binhua Tang; Hang-Kai Hsu; Pei-Yin Hsu; Russell Bonneville; Su-Shing Chen; Tim H-M Huang; Victor X Jin
Journal: Sci Rep Date: 2012-11-19 Impact factor: 4.379

10. Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3.

Authors: Seth Frietze; Rui Wang; Lijing Yao; Yu Gyoung Tak; Zhenqing Ye; Malaina Gaddis; Heather Witt; Peggy J Farnham; Victor X Jin
Journal: Genome Biol Date: 2012-09-26 Impact factor: 13.583