| Literature DB >> 26400819 |
Xin He1, A Ercument Cicek2,3, Yuhao Wang4, Marcel H Schulz5, Hai-Son Le6, Ziv Bar-Joseph7.
Abstract
Methods for the analysis of chromatin immunoprecipitation sequencing (ChIP-seq) data start by aligning the short reads to a reference genome. While often successful, they are not appropriate for cases where a reference genome is not available. Here we develop methods for de novo analysis of ChIP-seq data. Our methods combine de novo assembly with statistical tests enabling motif discovery without the use of a reference genome. We validate the performance of our method using human and mouse data. Analysis of fly data indicates that our method outperforms alignment based methods that utilize closely related species.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26400819 PMCID: PMC4579611 DOI: 10.1186/s13059-015-0756-4
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1De novo ChIP-seq analysis pipeline. Top: Reads from the TF experiment are assembled using a de novo assembly method (SEECER or Velvet) leading to longer ChIPtigs, each of which is based on several (often hundreds or thousands) of assembled reads. Bottom right: Each of the assembled ChIPtigs is scored to determine its enrichment for experiment vs. control reads. ChIPtigs are ranked based on their enrichment scores. Bottom left: Top ranking ChIPtigs are used as input for a motif discovery method resulting in a set of motifs for the experiments
ChIPtig statistics and results of motif finding in the mouse ESC dataset using Velvet
| TF | No. ChIPtigs | Mapped ChIPtigs (%) | Motif rank with peak-calling | Motif rank with | Motif rank with random ChIPtigs |
|---|---|---|---|---|---|
| c-MYC | 5,159 | 92.9 | 1 | 1 | 4 |
| CTCF | 2,152 | 92.9 | 1 | 2 | 1 |
| ESRRB | 30,278 | 95.9 | 1 | 1 | 1 |
| KLF4 | 1,660 | 90.4 | 1 | 1 | 1 |
| NANOG | 5,163 | 94.3 | 1 | N | N |
| n-MYC | 3,610 | 86.8 | 1 | 1 | 1 |
| POU5F1 | 2,528 | 92.9 | 1 | 1 | N |
| SMAD1 | 596 | 96.1 | 7 | N | N |
| SOX2 | 2,511 | 92.5 | 1 | 1 | N |
| STAT3 | 4,329 | 94.8 | 1 | 1 | N |
| TCFCP2I1 | 20,566 | 95.7 | N | N | N |
| ZFX | 3,348 | 93.5 | 1 | 1 | 1 |
Three settings were evaluated for motif finding performance: peak-calling using reference genome (MACS), top 1,000 ChIPtigs from the de novo pipeline, and 1,000 random ChIPtigs from the same experiment assembled by Velvet. The rank of the known motif (from JASPAR) in the DREME results is shown for each TF. ‘N’ in a row means that either DREME did not find any motif, or none of the motifs found by DREME matches the known motif for the TF in that row
ChIPtig statistics and results of motif finding in the mouse ESC dataset using SEECER
| No. ChIPtigs | Mapped ChIPtigs (%) | Motif rank with peak-calling | Motif rank with | Motif rank with random ChIPtigs | |
|---|---|---|---|---|---|
| c-MYC | 15,987 | 86.5 | 1 | 1 | N |
| CTCF | 8,209 | 39.6 | 1 | N | N |
| ESRRB | 41,620 | 90.7 | 1 | 1 | 1 |
| KLF4 | 10,144 | 73.5 | 1 | 1 | N |
| NANOG | 19,106 | 43.7 | 1 | N | N |
| n-MYC | 13,663 | 67.2 | 1 | N | N |
| POU5F1 | 12,939 | 75.5 | 1 | N | N |
| SMAD1 | 9,914 | 39.8 | 7 | N | N |
| SOX2 | 12,797 | 77.7 | 1 | 1 | N |
| STAT3 | 17,394 | 84.7 | 1 | 1 | N |
| TCFCP2I1 | 31,701 | 89.4 | N | N | N |
| ZFX | 10,569 | 80.4 | 1 | 1 | 2 |
Columns are the same as in Table 1
Fig. 2Motif discovery results for the human validation data. The table presents the results obtained for each of the TFs (rows) using the de novo assembly pipeline with SEECER and Velvet and the results for the peak-calling method MACS. For each TF we present the known motif (if it exists in the database). For each method we show: (1) the predicted motif that best matches the known motif; (2) whether it matches the known motif in the JASPAR database; and (3) the motif rank in the DREME results for that method and the TOMTOM P value for the match with the known motif. We also include experiment specific comments in the last column
Co-factors identified in the top 10 motifs predicted by each method for the human validation dataset
| SEECER | Velvet | Alignment | |
|---|---|---|---|
| MAX |
|
|
|
| HCFC1 |
|
| |
| CEBPB |
|
|
|
| SREBF1 |
|
| |
| TCF7L2 | |||
| STAT1 | |||
| TAL1 |
|
|
|
Proteins that are found to be interacting with the target and whose motifs are predicted in the top 10 by each method are shown
Fig. 3Motif discovery results for the fly data. The table presents the results obtained for each of the TFs (rows) using the genome based motif discovery using D. Melanogaster genome and using D. Pseudoobscura genomes and de novo assembly pipeline with SEECER and Velvet. For each TF we present the known motif (if it exists in the database). For each method we show: (1) the predicted motif and how the predicted motif matches (if any); (2) whether it matches the known motif in the JASPAR database; and (3) the motif rank in the DREME results for that method and the TOMTOM P value for the match with the known motif