| Literature DB >> 25128017 |
Huihuang Yan, Jared Evans, Mike Kalmbach, Raymond Moore, Sumit Middha, Stanislav Luban, Liguo Wang, Aditya Bhagwate, Ying Li, Zhifu Sun, Xianfeng Chen, Jean-Pierre A Kocher1.
Abstract
BACKGROUND: Chromatin immunoprecipitation (ChIP) followed by next-generation sequencing (ChIP-Seq) has been widely used to identify genomic loci of transcription factor (TF) binding and histone modifications. ChIP-Seq data analysis involves multiple steps from read mapping and peak calling to data integration and interpretation. It remains challenging and time-consuming to process large amounts of ChIP-Seq data derived from different antibodies or experimental designs using the same approach. To address this challenge, there is a need for a comprehensive analysis pipeline with flexible settings to accelerate the utilization of this powerful technology in epigenetics research.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25128017 PMCID: PMC4152589 DOI: 10.1186/1471-2105-15-280
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Flowchart of HiChIP. It contains five key functions (see implementation), developed by using both public tools and in-house scripts.
Figure 2Irreproducible discovery rate between replicates. Left panel: number of significant peaks versus number of shared peaks called from both replicates. Right panel: number of significant peaks at different IDR cutoffs. 1: between biological replicates; 2: between pseudoreplicates from biological replicate 1, generated by randomly splitting the mapped reads into two equal-sized partitions; 3: between pseudoreplicates from biological replicate 2; 4: between pseudoreplicates from mapped reads merged from the two biological replicates. MACS called peaks at a p value cutoff of 1e-3, instead of the default 1e-5.
Number of NFKB peaks from BWA and Novoalign mapped reads
| Type | GM12878_NFKB_IP_rep1 | GM12878_NFKB_IP_rep2 | GM12891_NFKB_IP_rep1 | GM12891_NFKB_IP_rep2 | ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| Overlapped peaks | 1512 | 1513 | 9942 | 9960 | 2257 | 2264 | 6082 | 6082 |
| Unique peaks w/ motif | 176 | 18 | 409 | 90 | 88 | 19 | 36 | 30 |
| Unique peaks w/o motif | 134 | 28 | 590 | 249 | 442 | 74 | 21 | 7 |
| Total peaks | 1822 | 1559 | 10941 | 10299 | 2787 | 2357 | 6139 | 6119 |
Peaks were identified using MACS (−f BAM -g hs --keep-dup 1 -q 0.01) and motif finding by MEME suite (−dna -mod zoops -nmotifs 5 -minw 10 -maxw 20 -maxsize 999999999 -revcomp). Shown were the number of overlapped peaks between BWA mapped reads (A) and Novoalign mapped reads (B), and number of aligner-specific peaks with or without NFKB binding motif. ChIP-Seq from each cell line had two biological replicates (rep1 and rep2).
BWA and Novoalign mapping of paired-end reads
| Library | Type | Total (million) | BWA | Novoalign | ||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |||
| GSM850526 | IP | 8 | 86 | 5.09 | 8.91 | 92.71 | 4.75 | 2.54 |
| GSM850528 | Control | 7.29 | 82.32 | 7.09 | 10.59 | 89.68 | 6.62 | 3.7 |
The 45-bp paired-end ChIP-Seq data were from TF RUNX1. Raw reads were mapped to hg19 by both BWA and Novoalign. BWA parameters are: bwa aln -o 1 -l 32 -t 4 -k 2 and bwa sampe -n 10 -a 500 -o 10000 -N 10 -s. Novoalign parameters are: novoalign -i PE 250,30 -r Random --hdrhd off -c 1 -d reference.nix -F STDFQ -f end1.fastq end2.fastq -o SAM, where reference.nix is the reference sequence index file. A: percentage of pairs with at least one uniquely mapped end; B: percentage of pairs with multiple mapping locations; C: others, including unmapped pairs, pairs with only one mapped end and improperly mapped pairs with small insertion size or wrong orientation.
RUNX1 peaks from BWA and Novoalign mapped reads
| Type | BWA | Novoalign |
|---|---|---|
| Overlapped peaks | 818 | 809 |
| Unique peaks w/ motif | 85 | 86 |
| Unique peaks w/o motif | 28 | 138 |
| Total peaks | 931 | 1033 |
Peaks were called by MACS (−f BAM -g hs --keep-dup all -q 0.01), after removal of duplicates by Picard. Motif was identified by MEME suite.
ER ChIP-Seq and chip-chip peaks in MCF-7 cell line
| ChIP-Seq | Chip-chip | ||||
|---|---|---|---|---|---|
| Library | Total peaks | Total peaks | Overlapped peaks | Unique peaks | Overlap (%) |
| IP_1 | 69390 | 4621 | 4492 | 129 | 97.2 |
| IP_2 | 46688 | 4621 | 4391 | 230 | 95.0 |
| IP_3 | 57657 | 4621 | 4450 | 171 | 96.3 |
A list of 4,621 highly confident (p value < =1e-5) chip-chip peaks was extracted from the file “ER_binding_p-value.xls” downloaded from http://research4.dfci.harvard.edu/brownlab//datasets/index.php?dir=ER_MCF7_whole_human_genome/. ChIP-Seq peaks were identified by MACS at the q value cutoff of 0.01. Both ChIP-Seq and chip-chip analysis used hg18 reference sequence.
Number of ER peaks from MCF-7 cell line
| Type | IP_1 vs. control | IP_2 vs. control | IP_3 vs. control | |||
|---|---|---|---|---|---|---|
| A | B | A | B | A | B | |
| Unique reads (million) | 16.9 | 22.27 | 21.08 | 22.66 | 25.36 | 29.15 |
| Overlapped peaks | 62184 | 62407 | 41455 | 41551 | 54309 | 54354 |
| Unique peaks | 7206 | 1689 | 5233 | 238 | 3348 | 857 |
| Unique peaks w/ motif | 2552 | 670 | 1359 | 35 | 1121 | 362 |
| Unique peaks w/o motif | 4654 | 1019 | 3874 | 203 | 2227 | 495 |
| Total peaks | 69390 | 64096 | 46688 | 41789 | 57657 | 55211 |
Peaks were called by MACS (−f BAM -g hs --keep-dup all -q 0.01) from BWA-mapped reads (to hg18). Motif was identified using MEME as described above. A: number of ER peaks identified by this pipeline, using duplicate-filtered reads (by Picard); B: number of peaks from reads without duplicate removal. IP_1 to IP_3 are three replicates, IP_1: GSM798423, IP_2: GSM798424, IP_3: GSM798425; control (input): GSM798440.
Duplicate level in three ER ChIP-Seq libraries
| Library | Type* | Size (Mb) | Non-duplicates (M) | Duplicates (M) | Sum (M) | Duplicates (%) | Portion (%)** |
|---|---|---|---|---|---|---|---|
| IP_1 | All peaks | 17.08 | 4.34 | 3.24 | 7.58 | 42.73 | 60.23 |
| IP_1 | Top peaks | 3.33 | 1.92 | 2.54 | 4.46 | 57.03 | 47.3 |
| IP_1 | Non-peak | 3049.89 | 12.27 | 2.08 | 14.35 | 14.51 | 38.71 |
| IP_1 | Total | 3080.44 | 16.9 | 5.38 | 22.27 | 24.13 | 100 |
| IP_2 | All peaks | 12.54 | 2.94 | 1.1 | 4.04 | 27.3 | 69.79 |
| IP_2 | Top peaks | 2.31 | 1.2 | 0.9 | 2.11 | 42.84 | 57.12 |
| IP_2 | Non-peak | 3058.77 | 17.9 | 0.47 | 18.36 | 2.54 | 29.55 |
| IP_2 | Total | 3080.44 | 21.08 | 1.58 | 22.66 | 6.98 | 100 |
| IP_3 | All peaks | 15.94 | 4.87 | 2.69 | 7.56 | 35.53 | 70.87 |
| IP_3 | Top peaks | 3.02 | 1.97 | 2.05 | 4.02 | 51.04 | 54.1 |
| IP_3 | Non-peak | 3053.24 | 20.11 | 1.08 | 21.19 | 5.09 | 28.46 |
| IP_3 | Total | 3080.44 | 25.36 | 3.79 | 29.15 | 13.01 | 100 |
*The non-peak regions are these after excluding regions from −100 bp to +100 bp of all peaks; the top peaks refer to the top 10% of the peaks with the smallest p values.
**Percentage over total duplicates in the library.
Figure 3Peak value versus level of duplicate reads in three ER ChIP-Seq libraries. Peaks were ranked based on –log10 (p value) in descending order and split into 10 groups (x-axis) of equal size. The y-axis represents the level of duplicate reads. The horizontal dashed lines indicate the level of duplication from non-peak regions, which are regions not covered by peaks +/−100 bp.
Figure 4Summary of peak calling for H3K27me3 ChIP-Seq. H3K27me3 ChIP-Seq data were downloaded from the ENCODE project (see implementation) and duplicates were filtered out by Picard. Each cell line has two biological replicates (r1 and r2). SICER software was used to call peaks using both duplicate-filtered reads and reads without duplicate removal.
Figure 5Box plot of FDR from shared and unique H3K27me3 peaks. Peaks were called from duplicate-filtered reads and also from reads without duplicate filtering. Peaks shared between the two methods were randomly sampled to generate the same number of peaks as the unique peaks and used in the plot. The y-axis represents –log10 (FDR). The two biological replicates are indicated by r1 and r2, respectively.
Figure 6Peak FDR versus level of duplicate reads in six H3K27me3 ChIP-Seq libraries. Peaks were ranked based on –log10 (FDR) in descending order and split into 10 groups (x-axis) of equal size. The y-axis represents the level of duplicate reads. The horizontal dashed lines indicate the level of duplication from non-peak regions, as defined in Figure 3. The two biological replicates are indicated by r1 and r2, respectively.