| Literature DB >> 22228832 |
Xiaotu Ma1, Ashwinikumar Kulkarni, Zhihua Zhang, Zhenyu Xuan, Robert Serfling, Michael Q Zhang.
Abstract
Identification of DNA motifs from ChIP-seq/ChIP-chip [chromatin immunoprecipitation (ChIP)] data is a powerful method for understanding the transcriptional regulatory network. However, most established methods are designed for small sample sizes and are inefficient for ChIP data. Here we propose a new k-mer occurrence model to reflect the fact that functional DNA k-mers often cluster around ChIP peak summits. With this model, we introduced a new measure to discover functional k-mers. Using simulation, we demonstrated that our method is more robust against noises in ChIP data than available methods. A novel word clustering method is also implemented to group similar k-mers into position weight matrices (PWMs). Our method was applied to a diverse set of ChIP experiments to demonstrate its high sensitivity and specificity. Importantly, our method is much faster than several other methods for large sample sizes. Thus, we have developed an efficient and effective motif discovery method for ChIP experiments.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22228832 PMCID: PMC3326300 DOI: 10.1093/nar/gkr1135
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
POSMO is robust to large sample sizes
Top five k-mers ranked by POSMO are related to the underlying transcription factor-DNA interaction
Dark shaded cells represent either significant k-mers called by our POSMO algorithm (Z columns), or a k-mer with a significant PWM score (PWM columns; >µ + 3σ criterion over all 4 -mers). Light-gray shaded cells represent k-mers which are shifts of the genuine motif, thus having an insignificant PWM score. POSMO Z score is the average POSMO Z score of a k-mer and its reverse complementary k-mer. k-mers for each transcription factor are sorted according to POSMO Z score (Z columns).
Performance comparison of POSMO, MEME, DME, ChIPMunk, HMS and DREME on ChIP data
| Transcription factor | Rank by POSMO | Rank by MEME | Rank by DME | Rank by ChIPMunk | Rank by HMS | Rank by DREME |
|---|---|---|---|---|---|---|
| STAT1 ( | 1 | 1 | 1 | 1 | 1 | 1 |
| NRSF ( | 1, 2 | 1, 2 | 1, 2 | 1 | 1 | 1 |
| CTCF ( | 1 | 1 | 1 | 1 | 1 | 1 |
| CTCF ( | 1 | 3 | 1 | 1 | NA | 1 |
| FOXA2 ( | 1 | 1 | 1 | 1 | 1 | 1 |
| CRX ( | 1 | 1 | 1 | 1 | No match | 1 |
| BCD ( | 1 | 4 | 4 | 1 | No match | 2 |
| CAD ( | No match | No match | No match | No match | No match | No match |
| HB1 ( | 1 | 1 | 1 | 1 | No match | 2 |
| HB2 ( | 1 | 1 | 1 | 1 | No match | 3 |
| KR1 ( | 1 | 1 | 1 | 1 | No match | 3 |
| KR2 ( | 1 | 1 | 1 | 1 | No match | 3 |
| KNI ( | No match | No match | No match | No match | No match | No match |
| GT ( | 1 | 3 | 7 | 1 | No match | No match |
| c-MYC ( | 1 | 1 | 1 | 1 | 1 | 1 |
| n-MYC ( | 1 | 1 | 1 | 1 | 1 | 1 |
| CTCF ( | 1 | 1 | 1 | 1 | No match | 1 |
| ESRRB ( | 1 | 1 | 1 | 1 | 1 | 1 |
| STAT3 ( | 1 | 1 | 2 | 1 | No match | 1 |
| OCT4 ( | 1 | 1 | 1 | 1 | 1 | 1 |
| SOX2 ( | 1 | 1 | 1 | 1 | 1 | 3 |
| KLF4 ( | 1 | 1 | 2 | 1 | 1 | 1 |
| E2F1 ( | No match | No match | No match | No match | No match | No match |
| TCFCP2L1 ( | 1 | 1 | 1 | 1 | 1 | 1 |
| ZFX ( | 1 | 1 | 1 | 1 | No match | 1 |
| NANOG ( | 1 | 1 | 1 | 1 | No match | 1 |
| SMAD1 ( | 1 | No match | No match | 1 | 1 | 1 |
| Total successes | 24/27 | 23/27 | 23/27 | 24/27 | 12/26 | 23/27 |
| Average rank | 1 | 1.30 | 1.47 | 1 | 1 | 1.43 |
Among the Top 5 motifs found by MEME, DREME and DME, the rank (per P-values) of the known binding motif is listed. For NRSF, there are two known motifs and their ranks are counted separately. No match: the software did not report any motif similar to the known motif.
aTop 3 peaks removed to get the correct motif.
bTriangle intensity profile used.
cMotif length of 20 used.
d±200 bases flanking peak summit.
Sequence motifs discovered by POSMO
The DNA motifs after word clustering are listed. As a comparison, the DNA motifs from the literature are also listed. For NRSF, two motifs are reported by POSMO, which correspond to the left and right half-sites reported by Hu et al. (24).
Input pattern lengths of 7, 8 and 9 are compared (POSMO is robust to input parameter k)
Figure 1.POSMO is more efficient than DME for large sample sizes. Shown in the y-axis is the time spent for a given number of top peaks shown in the x-axis. Results for POSMO (dashed line with boxes) and DME (dashed line with triangles for a smaller peak window and solid line with circles for a larger peak window) are shown. Here k = 8 for both POSMO and MEME.