| Literature DB >> 30824903 |
Justin G Chitpin1,2, Aseel Awdeh2,3, Theodore J Perkins2,3,4.
Abstract
MOTIVATION: Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice-once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown.Entities:
Year: 2019 PMID: 30824903 PMCID: PMC6761936 DOI: 10.1093/bioinformatics/btz150
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.MACS, SICER and diffReps peak callers produce biased P-values. (A) Visualization of part of a simulated ChIP-seq read dataset, with 500 bp foreground regions every 20–25 kb, where read density is greater. Control data was generated similarly, with matching foreground regions, so a null hypothesis of no enrichment in ChIP-seq versus control is true for every possible genomic region. (B) Peaks called by the three algorithms have P-values that are not uniformly distributed between zero and one, as should be the case for this null hypothesis data if P-values were well calibrated. Empirical CDFs on linear (C) and log (D) axes also show the discrepancy from the uniform distribution
Fig. 2.RECAP recalibrates peak callers' P-values to a near-uniform distribution. (A) Log-log plot of the empirical CDF of recalibrated P-values for MACS, SICER and diffReps, on the simulated, 10% foreground, null-hypothesis data. (B) Similar plot for a representative ChIP-seq replicate pair from ENCODE. (C) Reductions in deviation statistic, which measures difference from the uniform distribution, for the RECAP-recalibrated P-values for several types of simulated data (10 datasets each) and 50 matched pairs of ENCODE replicate ChIP-seq data (10 pairs for each of five cell types)
Fig. 3.FDR and reproducibility analyses based on recalibrated P-values. (A) Empirical CDF of recalibrated P-values for each algorithm on simulated non-null data. (B) Empirical CDFs of MACS’s P-values on ENCODE datasets. (C) Deviation statistics before and after recalibration. (D) Raw versus recalibrated P-values for MACS on ENCODE data. (E) FDR estimates based on recalibrated versus raw P-values for MACS on ENCODE data. (F) Peak reproducibility rates versus FDR estimates based on raw or recalibrated P-values