| Literature DB >> 34965583 |
Jeremiah Suryatenggara1, Kol Jia Yong1,2, Danielle E Tenen3, Daniel G Tenen1,4, Mahmoud A Bassal1,4.
Abstract
Chromatin immunoprecipitation coupled with sequencing (ChIP-seq) is a technique used to identify protein-DNA interaction sites through antibody pull-down, sequencing and analysis; with enrichment 'peak' calling being the most critical analytical step. Benchmarking studies have consistently shown that peak callers have distinct selectivity and specificity characteristics that are not additive and seldom completely overlap in many scenarios, even after parameter optimization. We therefore developed ChIP-AP, an integrated ChIP-seq analysis pipeline utilizing four independent peak callers, which seamlessly processes raw sequencing files to final result. This approach enables (1) better gauging of peak confidence through detection by multiple algorithms, and (2) more thoroughly surveys the binding landscape by capturing peaks not detected by individual callers. Final analysis results are then integrated into a single output table, enabling users to explore their data by applying selectivity and sensitivity thresholds that best address their biological questions, without needing any additional reprocessing. ChIP-AP therefore presents investigators with a more comprehensive coverage of the binding landscape without requiring additional wet-lab observations.Entities:
Keywords: ChIP-seq; automated analysis pipeline; histone mark; integrated analysis pipeline; multiple peak callers; transcription factor binding
Mesh:
Substances:
Year: 2022 PMID: 34965583 PMCID: PMC8769893 DOI: 10.1093/bib/bbab537
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 4Union peak set comprehensiveness and accuracy. (A) Fingerprint plot for aligned sequence files. Negligible separation between the SALL4 and control curves indicates poor enrichment in the SALL4 samples, an experiment typically considered as having failed. (B) Upset plot describing the distribution of peaks observed by each peak caller. The left histogram represents the total number of called peaks per caller. The top histograms represent the size of the sub-sets in question. The connected circles represent highlighted overlap. (C) Venn diagram showing the overlapping number of peaks between the SALL4 union ChIP-seq dataset and the Cut&Run dataset, both of which were done on SNU-398 cells. (D) The motif sequence used for the directed motif search in the SALL4 ChIP-seq union set, which was found in 55.2% of the union set. (E) The STREME de novo motif search for the SALL4 union peak set identified the AT-rich binding motif as the second result. (F) The HOMER de novo motif search for the SALL4 union peak set identified the AT-rich binding motif as the third result. (G) The STREME identified motif (shown in D) was found centrally enriched in the union peak set as compared to background sequences; an accepted indicator of a valid motif result. (H) Venn diagram depicting the degree of overlap in the number of genes between the SALL4 ChIP-seq union peak set, the SALL4 Cut&Run peak set and the SALL4 KD DEG list.
Figure 1ChIP-AP functionality and characteristics. An overview of ChIP-AP’s structure. For input, ChIP-AP requires ST [be it customized for a specific analysis or the default (DST)] in conjunction with the location of the raw sequencing FASTQ files. Data can be entered through the wizard (Supplementary Figure 1A available online at http://bib.oxfordjournals.org/) or the dashboard (Supplementary Figure 1B available online at http://bib.oxfordjournals.org/) or the command-line. Once all data are entered, ChIP-AP constructs a folder hierarchy, each with sub-scripts that have defined input/output characteristics performing a singular task. The core stages of analysis include QC & Filtering, Alignment and Peak Calling. For output, ChIP-AP generates an integrated results table that can be viewed in any spreadsheet program like Excel, and can select the union peak set, the consensus peak set or any sub-set in-between as required. Integrated into the output file is all the peak callers that called said peak as well as the peak’s IDR. Optionally, users can run pathway and GO analysis which (if run) have their results merged into the main text output file as well. Users can also optionally run motif enrichment analysis using HOMER and/or MEME-ChIP, which will report results in separate output folders for viewing.
ChIP-AP default ST
| Program | Argument |
|---|---|
| fastqc1 | -q |
| clumpify | dedupe spany addcount qout=33 fixjunk |
| bbduk | ktrim=l hdist=2 |
| trimmomatic | LEADING:20 SLIDINGWINDOW:4:20 TRAILING:20 MINLEN:20 |
| fastqc2 | -q |
| bwa_mem | |
| samtools_view | -q 20 |
| plotfingerprint | |
| fastqc3 | -q |
| macs2_callpeak | |
| gem | -Xmx10G --k_min 8 --k_max 12 |
| sicer2 | |
| homer_findPeaks | |
| genrich | --adjustp -v |
| homer_mergePeaks | |
| homer_annotatePeaks | |
| fold_change_calculator | --normfactor uniquely_mapped |
| homer_findMotifsGenome | -size given -mask |
| meme_chip | -meme-nmotifs 25 |
Note: The default ST used by ChIP-AP if no user provided table is provided. The left column lists the constituent programs of ChIP-AP with their optional modification parameters/flags found in the right column. A copy of the ST utilized by any ChIP-AP run can always be found in the analysis output folder.
Figure 3ChIP-AP peak set robustness. (A) Venn diagrams highlighting the number and percentage of consensus peak set overlapping peaks of inter-lab ChIP-seq datasets. Two datasets were obtained from ChIP-seq experiments on the same DNA-binding proteins of interest in the same cell lines but performed and uploaded to the ENCODE project database by different labs. Five pairs among the most common ChIP-seq transcription factor datasets were picked for this analysis. All datasets were processed with ChIP-AP in default settings and their resulting consensus peak sets were merged in a pair-wise manner. This panel shows venn diagram depicting the consensus peak set overlap between each pair of the five transcription factor dataset pairs (CTCF, JUND, MYC, REST and YY1). (B) Venn diagrams highlighting the number and percentage of union peak set overlapping peaks of inter-lab ChIP-seq datasets. Two datasets were obtained from ChIP-seq experiments on the same histone marks of interest in the same cell lines but performed and uploaded to the ENCODE project database by different labs. Five commonly studied histone mark datasets were picked for this analysis. All datasets were processed with ChIP-AP in default settings and their resulting consensus peak sets were merged in a pair-wise manner. This panel shows venn diagrams depicting the union peak set overlap between each pair of the five histone marker dataset pairs (H3K4me1, H3K9me3, H3K27me3, H3K36me3 and H3K79me2). Above of each set is the name of the lab from which the dataset was derived. (C) Heatmaps depicting the consensus peak set Jaccard Index score when one peak calling parameter was modified, compared to when the peak calling parameters are all at default values. Comparable parameters across different peak callers were modified and compared. Heatmap color range for the Jaccard Index extends between 80 and 100% overlap to better show variability in differences. High JI’s represent a high degree of overlap between the modified parameter peak set and the default consensus peak set.
The SALL4 ChIP-seq was processed with ChIP-AP (v5.1) using hg38. Below is the settings table (ST) used.
| Program | Argument |
|---|---|
| fastqc1 | -q |
| clumpify | dedupe spany addcount qout=33 |
| bbduk | ktrim=l hdist=2 |
| trimmomatic | LEADING:20 SLIDINGWINDOW:4:20 TRAILING:20 MINLEN:20 |
| fastqc2 | -q |
| bwa_mem | |
| samtools_view | |
| plotfingerprint | -bs 50 --centerReads –ignoreDuplicates |
| fastqcs3 | -q |
| macs2_callpeak | |
| gem | -Xmx30G --k_min 8 --k_max 12 |
| sicer2 | |
| homer_findPeaks | |
| genrich | --adjustp –v |
| homer_mergePeaks | |
| homer_annotatePeaks | |
| fold_change_calculator | --normfactor uniquely_mapped |
| homer_findMotifsGenome | -size given -mask |
| meme-chip | -neg background.fa -meme-nmotifs 25 target.fa |
A number of ENCODE datasets were downloaded and utilized; experiment IDs are listed below. Data were downloaded from ENCODE March 2021.
| Cell line | Transcription factor | ChIP experiment ID’s | Control experiment ID’s |
|---|---|---|---|
| GM12878 | MAX | ENCFF000VXY | ENCFF000VWF ENCFF000VWH |
| ENCFF000VYA | |||
| SPI1 | ENCFF000OBS | ||
| ENCFF000OB | |||
| H1-hESC | CTCF (Bernstein) | ENCFF000AVU | ENCFF036EGF ENCFF191QXK |
| ENCFF000AVS | |||
| CTCF (Myers) | ENCFF000ONR | ENCFF000OSP | |
| ENCFF000OOF | |||
| REST (Bernstein) | ENCFF794IGW | ENCFF036EGF ENCFF191QXK | |
| ENCFF079NVQ | |||
| REST (Myers) | ENCFF000OQQ | ENCFF000OSP | |
| ENCFF000OQY | |||
| H3K79me2 (Bernstein) | ENCFF000AYD | ENCFF036EGF ENCFF191QXK | |
| ENCFF000AYF | |||
| H3K79me2 (Ren) | ENCFF580OHZ | ENCFF835IOE ENCFF094GYG | |
| ENCFF519ZRJ | |||
| HepG2 | ZBTB33 | ENCFF000PSP | ENCFF000POC ENCFF000POH |
| ENCFF000PSW | |||
| CEBPB | ENCFF000XQM | ||
| ENCFF000XQN | |||
| JUND (Myers) | ENCFF000PKK | ||
| ENCFF000PKR | |||
| JUND (Snyder) | ENCFF000XTQ ENCFF000XTR | ENCFF002ECQ ENCFF002ECU | |
| K562 | MAFF | ENCFF000YSQ | ENCFF002EFF ENCFF002EFD |
| ENCFF000YSS | |||
| JUN | ENCFF000YJJ | ||
| ENCFF000YJL | |||
| GATA1 | ENCFF000YND | ||
| ENCFF000YNF | |||
| MYC (Iyer) | ENCFF000RWD | ENCFF000RWS | |
| ENCFF000RWE | |||
| ENCFF000RWG | |||
| MYC (Snyder) | ENCFF000YKO | ENCFF002ECS ENCFF002ECW | |
| ENCFF000YKR | |||
| YY1 (Farnham) | ENCFF000ZEK | ENCFF000VEK | |
| ENCFF000ZEJ | |||
| YY1 (Myers) | ENCFF000QKF | ENCFF000QET ENCFF000QEU | |
| ENCFF000QKI | |||
| H3K4me1 (Bernstein) | ENCFF000BXX ENCFF000BYG | ENCFF000BWK ENCFF994FIB ENCFF283HQV ENCFF156ECZ ENCFF561WFK | |
| H3K4me1 (Farnham) | ENCFF000VDV | ENCFF000VEK | |
| ENCFF000VDU | |||
| H3K27me3 (Bernstein) | ENCFF000BXP | ENCFF000BWK ENCFF994FIB ENCFF283HQV ENCFF156ECZ ENCFF561WFK | |
| ENCFF000BXN | |||
| H3K27me3 (Farnham) | ENCFF000VDN | ENCFF000VEK | |
| ENCFF000VDP | |||
| H3K36me3 (Bernstein) | ENCFF000BXR | ENCFF000BWK ENCFF994FIB ENCFF283HQV ENCFF156ECZ ENCFF561WFK | |
| ENCFF000BXO | |||
| H3K36me3 (Stamatoyannopoulos) | ENCFF001FWV | ENCFF001HTT | |
| ENCFF001FWW | |||
| MEIS2 | R1: ENCFF002EIU | R1: ENCFF002EFF | |
| ENCFF002EIW | |||
| R2: ENCFF002EIV | |||
| ENCFF002EIX | |||
| RUNX1 | R1: ENCFF002DOZ | ||
| ENCFF002EGD | |||
| R2: ENCFF002EGE | |||
| ENCFF002DPH | |||
| ATF4 | R1: ENCFF081USS | ||
| ENCFF565KLI | |||
| R2: ENCFF069VNL | |||
| ENCFF682IGK | |||
| MCF7 | H3K9me3 (Bernstein) | ENCFF656BIN | ENCFF318ZNB |
| ENCFF517BLK | |||
| ENCFF600JOS | |||
| H3K9me3 (Farnham) | ENCFF000VFE | ENCFF000VHL | |
| ENCFF000VFJ | |||
| ENCFF000VFG |
All ENCODE datasets were processed with ChIP-AP (v5.1), using hg38 with the following ST.
| Program | Argument |
|---|---|
| fastqc1 | -q |
| clumpify | dedupe spany addcount qout=33 fixjunk |
| bbduk | ktrim=l hdist=2 |
| trimmomatic | LEADING:20 SLIDINGWINDOW:4:20 TRAILING:20 MINLEN:20 |
| fastqc2 | -q |
| bwa_mem | |
| samtools_view | |
| plotfingerprint | -bs 50 --centerReads –ignoreDuplicates |
| fastqcs3 | -q |
| macs2_callpeak | |
| gem | -Xmx30G --k_min 8 --k_max 12 |
| sicer2 | |
| homer_findPeaks | |
| genrich | --adjustp -v |
| homer_mergePeaks | |
| homer_annotatePeaks | |
| fold_change_calculator | --normfactor uniquely_mapped |
| homer_findMotifsGenome | -size given -mask |
| meme-chip | -meme-nmotifs 25 |