| Literature DB >> 22156162 |
Morgane Thomas-Chollier1, Carl Herrmann, Matthieu Defrance, Olivier Sand, Denis Thieffry, Jacques van Helden.
Abstract
ChIP-seq is increasingly used to characterize transcription factor binding and chromatin marks at a genomic scale. Various tools are now available to extract binding motifs from peak data sets. However, most approaches are only available as command-line programs, or via a website but with size restrictions. We present peak-motifs, a computational pipeline that discovers motifs in peak sequences, compares them with databases, exports putative binding sites for visualization in the UCSC genome browser and generates an extensive report suited for both naive and expert users. It relies on time- and memory-efficient algorithms enabling the treatment of several thousand peaks within minutes. Regarding time efficiency, peak-motifs outperforms all comparable tools by several orders of magnitude. We demonstrate its accuracy by analyzing data sets ranging from 4000 to 1,28,000 peaks for 12 embryonic stem cell-specific transcription factors. In all cases, the program finds the expected motifs and returns additional motifs potentially bound by cofactors. We further apply peak-motifs to discover tissue-specific motifs in peak collections for the p300 transcriptional co-activator. To our knowledge, peak-motifs is the only tool that performs a complete motif analysis and offers a user-friendly web interface without any restriction on sequence size or number of peaks.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22156162 PMCID: PMC3287167 DOI: 10.1093/nar/gkr1104
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Features of software tools used for analyzing motifs in ChlP-seq peak seqm
| Program | Peak-motifs | ChipMunk | CompleteMotifs | MEME-ChIP | MICSA | GimmeMotifs |
|---|---|---|---|---|---|---|
| Web interface | Yes | Yes | Yes | Yes | No | No |
| Size limitation | Unrestricted (website tested with 22 Mb) | 100 kb (website) | 500 kb (web site) | Unrestricted, but analysis limited to 600 peaks clipped to 100 bp | Motif discovery restricted to a few hundred base pairs | – |
| Stand-alone version | Yes | Yes | No | Yes | Yes | Yes |
| Tasks | ||||||
| Peak finding | No | No | No | No | Yes | No |
| Annotation of peak-flanking genes | No | No | Yes | No | No | |
| Sequence composition (mono- and di-nucleotides) | Yes | No | No | No | No | |
| Motif discovery | Yes | Yes | Yes | Yes | Yes | Yes |
| Enrichment in motifs from databases | No | No | Yes | Yes | No | |
| Enrichment in discovered motifs | Yes | No | No | No | No | |
| Peak scoring | No | No | No | Yes | Yes | No |
| Motif clustering | No | No | No | No | Yes | |
| Comparison discovered motifs/motif DB | Yes | No | No | Yes | Yes | |
| Sequence scanning for site prediction | Yes | No | No | Yes | No | |
| Positional distribution of sites inside peaks | Yes | No | Yes | No | Yes | |
| Visualization in genome browsers | Yes | No | Yes | No | No | |
| Motif discovery algorithms | RSAT oligo-analysis RSAT dyad-analysis RSAT local-word-analysis MEME ChlPMunk | ChipMunk | ChipMunk MEME Weeder | MEME DREME | MEME | MEME Weeder MotifSampler BioProspector Gadem Improbizer MDmodule Trawler MoAn |
| Pattern matching algorithms | RSAT matrix-scan-quick | No | patser | MAST + AME (enrichment) | No | |
| Motif comparison algorithm | RSAT compare-motifs | No | STAMP | TOMTOM | STAMP | |
| Motif clustering algorithm | STAMP | |||||
| Comparison between discovered motifs | Yes | No | Yes | No | Yes | |
| Motif database comparisons | JASPAR UNIPROBE DMMPMM RegulonDB upload your own database | No | JASPAR TRANSFAC | JASPAR TRANSFAC UNIPROBE FLYREG DPINTERACT SCPD DMMPMM and many others | No | |
| Motif sizes | Variable (multiple word assembly) | User-specified | ≤25 for MEME ≤12 for Weeder ≤ 13 for ChipMunk | Predefined ranges (small, medium, large, extra-large) | ||
| Multiple motifs | Yes | Yes | Yes | Yes | ||
| Ref (PMID) | This article | 20736340 | 21183585 | 21486936 | 20375099 | 21081511 |
The table summarizes the tasks, algorithms and usability properties to compare the different software options for the users. Most programs offer a web interface, but apply restrictive limitations on the size of the data sets to process. Although all programs support motif discovery, the other tasks are quite diverse and not all covered by a single program.
Figure 1.Schematic flow chart of the peak-motifs pipeline. For sake of clarity, only the main analysis steps are depicted. The pipeline takes as input a set of peak sequences, and runs several de novo motif discovery algorithms based on different detection criteria: over-representation, differential representation (test versus control), global position bias or local over-representation along the centered peaks. Transcription factors are predicted by matching discovered motifs against several public motif databases and/or against user-uploaded motif collections. Peak sequences are scanned with the discovered motifs to predict precise binding positions. These positions are then automatically exported as an annotation track for UCSC genome browser, thus enabling a flexible visualization in their genomic context.
Figure 4.Logos of the motifs discovered by peak-motifs for the factors Oct4, Sox2, Nanog and E2f1 adapted from the ChIP-seq data set by Chen et al. (20).
Figure 2.Time efficiency of motif discovery algorithms integrated in peak-motifs (plain lines) compared to alternative algorithms (dotted lines). The abscissa indicates sequence sizes, the ordinate processing times. The programs oligo-, dyad-, position-analysis and DREME show a linear time complexity (the power is ∼1), ChIPMunk has a quasi-linear complexity (power 1.27) and MEME a more than quadratic complexity (power 2.21). See Supplementary File S1 for the detailed analysis.
Figure 3.Most significant motifs discovered with the different algorithms encompassed by peak-motifs for ChIP-seq peak collections pulled down with 12 transcription factors involved in ES cell pluripotency (20). The first three columns indicate the studied transcription factor and the size of the data set (in number of peaks and in Mb). The fourth and fifth columns display the ID and consensus of the chosen reference motif. The sixth column shows the best motif found by peak-motifs, followed by two estimations of the correlation between the discovered and the matched motifs (Cor and Cov). The following columns detail which algorithm(s) detected this motif, and which motifs from the Jaspar and Tranfac databases were similar to the found motif.
Figure 5.Network of motifs discovered in the p300 data set. Each node represents a motif; the shape and color of the node denote the tissue (for the p300 datasets) and the ChIPed-factor (for the HL1 cell-line datasets, used as a validation), respectively. Two motifs are joined by a line if their normalized correlation is above 0.75; the width of the line denotes the degree of correlation. Node labels refer to the algorithm used to discover the motif: L (local-words), P (position-analysis), O (oligo-analysis), D (dyad-analysis) as well as the considered word length (6 or 7). The names of the transcription factor(s) likely associated with the motif clusters are also indicated, together with a representative logo.