| Literature DB >> 22009681 |
Laura L Elo1, Aleksi Kallio, Teemu D Laajala, R David Hawkins, Eija Korpelainen, Tero Aittokallio.
Abstract
We developed a computational procedure for optimizing the binding site detections in a given ChIP-seq experiment by maximizing their reproducibility under bootstrap sampling. We demonstrate how the procedure can improve the detection accuracies beyond those obtained with the default settings of popular peak calling software, or inform the user whether the peak detection results are compromised, circumventing the need for arbitrary re-iterative peak calling under varying parameter settings. The generic, open-source implementation is easily extendable to accommodate additional features and to promote its widespread application in future ChIP-seq studies. The peakROTS R-package and user guide are freely available at http://www.nic.funet.fi/pub/sci/molbio/peakROTS.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22009681 PMCID: PMC3245948 DOI: 10.1093/nar/gkr839
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
The ChIP datasets used in the present study
| Transcription factor data | Study organism | ChIP-seq test data | ChIP-qPCR validation data | ||||
|---|---|---|---|---|---|---|---|
| Reference(s) | #ChIP tags | #Control tags | Reference(s) | #Positives | #Negatives | ||
| STAT1_1 | human | ( | 2 430 958 | 4 513 107 | ( | 120 | 160 |
| STAT1_2 | human | ( | 8 184 450 | 4 513 107 | ( | 120 | 160 |
| NRSF_1 | human | ( | 1 697 991 | 2 319 153 | ( | 83 | 30 |
| NRSF_2 | human | ( | 5 349 088 | 10 162 151 | ( | 83 | 30 |
| FoxA1 | human | ( | 3 909 804 | 5 233 682 | ( | 26 | 12 |
| FoxA2 | mouse | ( | 2 813 847 | 4 428 744 | ( | 55 | 11 |
aFrom the STAT1 study, two replicate datasets were downloaded from the Gene Expression Omnibus (GEO accession GSE12782). STAT1_1: ChIP (rep1 lane A), control (rep1 lane C); STAT1_2: ChIP (rep2 lane B), control (rep1 lane C)
bThe NRSF_1 data was downloaded from the Illumina website: http://www.illumina.com/downloads/Illumina_ChIPSeq_Demo_Data_Johnson_Science_2007.zip.
cThe NRSF_2 data corresponding to the monoclonal antibody was downloaded from the QuEST website: http://mendel.stanford.edu/sidowlab/downloads/quest/.
dThe FoxA1 data was downloaded from the MACS website: http://liulab.dfci.harvard.edu/MACS/.
eFrom the mouse FoxA2 study, the first replicate pair of the ChIP and control samples was used (kindly provided by Dr Geetu Tuteja).
Figure 1.Accuracy of the binding site detections in the STAT1_1 data set as a function of top peaks identified by the MACS algorithm. The accuracy of the peak calling parameter combinations was evaluated with respect to independent qPCR validations using the F-score (see ‘Materials and Methods’ section). The grey traces show the variability in the accuracy when different parameter combinations were used. The red and blue traces, respectively, indicate the accuracy of the parameter values learned by the reproducibility optimization procedure (ROTS), compared to the default settings of the software package. The insert shows the F-levels at the cut-off point in which the increase in the accuracy stabilizes (the arrow). The green and black bars, respectively, indicate the highest and lowest F-scores among all the parameter combinations tested at the given cut-off point (the green and black points, respectively). The trace graphs were smoothed for displaying purposes. MACS detections in STAT1 were used here as an example; all the MACS and PeakSeq results are provided as Supplementary Figure S1.
Figure 2.Accuracy of the binding site detections when using the ROTS or default parameter settings in MACS (A) and PeakSeq (B). The detection accuracy was evaluated using the scaled F-score (see ‘Materials and Methods’ section), which shows the practical difference between the two parameter combinations with respect to the highest and lowest possible accuracies that can be obtained, given the independent qPCR validations and the pre-defined parameter space. The scaled F-score was used here to compare the relative performance across the different data sets (FoxA2 data set is from a mouse system, while the others are human data sets); all the original F-scores (Default, ROTS, Min and Max) are available in Supplementary Figure S2. To summarize the detection accuracies across all the six data sets in one histogram, the stable F-scores are shown, which correspond to the top-k levels at which the increase in the accuracy stabilized (indicated by arrows in Figure 1 and Supplementary Figure S1). The overall difference between the ROTS and default parameters was statistically significant across the data sets (paired t-test, P < 0.05).