| Literature DB >> 19500407 |
Wei Sun1, Michael J Buck, Mukund Patel, Ian J Davis.
Abstract
BACKGROUND: Microarray analysis of immunoprecipitated chromatin (ChIP-chip) has evolved from a novel technique to a standard approach for the systematic study of protein-DNA interactions. In ChIP-chip, sites of protein-DNA interactions are identified by signals from the hybridization of selected DNA to tiled oligomers and are graphically represented as peaks. Most existing methods were designed for the identification of relatively sparse peaks, in the presence of replicates.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19500407 PMCID: PMC2700807 DOI: 10.1186/1471-2105-10-173
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1GC-dependent normalization of one sample. Scatter plots of log intensities of Cy3 and Cy5 signals (from array GSM254806) based on the number GC base pairs of each 50-mer probe: 15 (a), 20 (b) or 30 (c). Density plots of raw data (d), MA2C (robust, C = 2) normalized data (e) and Lowess normalized data (f). Three curves are overlaid on figures (a)–(c). The blue line depicts the baseline model of MA2C normalization. The red line is fitted by median regression and the yellow line is the Lowess fit. In figures (d)–(f), vertical lines indicate mode and median of all probes. In raw and MA2C normalized data, the mode is bigger than median (d, e), indicating a heavier tail on the left. This unexpected feature usually indicates a problematic array or insufficient normalization.
Figure 2Dissection of the mixture distribution for probe-level and window-level data. Mixture distributions for the original spike-in data (a, b), first augmented data with ~4.3% spike-ins (c, d) and the second augmented data with ~10.2% spike-ins (e, f)
Comparison of different methods for the original data set
| GSM254930 | 108 | 0.28 | 503 | 0.84 | 241 | 0.66 | 85 | 0.08 | 84 | 0.07 | 111 | 0.20 |
| GSM254971 | 100 | 0.28 | 113 | 0.37 | 227 | 0.64 | 86 | 0.09 | 85 | 0.09 | N/A | N/A |
| GSM254972 | 98 | 0.29 | 195 | 0.61 | 178 | 0.53 | 84 | 0.07 | 88 | 0.08 | N/A | N/A |
| GSM254973 | 98 | 0.24 | 92 | 0.23 | 146 | 0.45 | 71 | 0.07 | 73 | 0.07 | N/A | N/A |
| GSM254805 | 66 | 0.20 | 153 | 0.56 | 116 | 0.43 | 81 | 0.22 | 52 | 0.09 | N/A | N/A |
| GSM254806 | 89 | 0.19 | 184 | 0.61 | 85 | 0.19 | 236 | 0.66 | 143 | 0.42 | 89 | 0.18 |
| GSM254807 | 97 | 0.24 | 102 | 0.26 | 100 | 0.21 | 76 | 0.08 | 91 | 0.13 | 123 | 0.32 |
The first four samples, GSM254930, GSM254971, GSM254972, and GSM254973 were spiked with unamplified DNA, while the last three samples GSM254805, GSM254806, and GSM254807 were spiked with amplified DNA. Among the total of 385,149 probes, about 820 (~0.2%) of them are from spike-in regions. We did not obtain results of HGMM for some arrays (N/A) due to failure of function HGMM.
Comparison of different methods for the simulated data set with 2,100/2,058 spike-in regions for unamplified/amplified samples, respectively
| GSM254930 | 2159 | 0.23 | 1694 | 0.17 | 2219 | 0.28 | 1475 | 0.004 | 1619 | 0.004 | 1605 | 0.03 |
| GSM254971 | 1965 | 0.21 | 2033 | 0.22 | 2187 | 0.30 | 1395 | 0.003 | 1578 | 0.006 | 1577 | 0.03 |
| GSM254972 | 2015 | 0.19 | 2226 | 0.27 | 2151 | 0.27 | 1553 | 0.003 | 1713 | 0.009 | 1553 | 0.03 |
| GSM254973 | 1575 | 0.14 | 1929 | 0.19 | 2094 | 0.28 | 1334 | 0.004 | 1504 | 0.008 | 1520 | 0.02 |
| GSM254805 | 1982 | 0.30 | 1764 | 0.24 | 1671 | 0.27 | 1034 | 0.013 | 1140 | 0.019 | 939 | 0.03 |
| GSM254806 | 2180 | 0.27 | 2344 | 0.33 | 1910 | 0.23 | 1404 | 0.008 | 1687 | 0.027 | 1372 | 0.03 |
| GSM254807 | 1495 | 0.14 | 1926 | 0.18 | 2034 | 0.27 | 1486 | 0.003 | 1655 | 0.009 | 1519 | 0.03 |
See main text for the simulation methods. Approximately 4.3% of the probes are from spike-in regions.
Comparison of different methods for the simulated data set with 5,100/4,998 spike-in regions for unamplified/amplified samples, respectively
| GSM254930 | 4359 | 0.16 | 5753 | 0.28 | 4829 | 0.19 | 2775 | 0.001 | 3872 | 0.003 | 3707 | 0.02 |
| GSM254971 | 4969 | 0.23 | 5110 | 0.23 | 4697 | 0.22 | 2758 | 0.001 | 3682 | 0.005 | 3615 | 0.02 |
| GSM254972 | 5135 | 0.22 | 3957 | 0.19 | 4738 | 0.19 | 2978 | 0.001 | 4114 | 0.011 | 3558 | 0.03 |
| GSM254973 | 4714 | 0.18 | 4795 | 0.20 | 4560 | 0.20 | 2695 | 0.001 | 3534 | 0.003 | 3493 | 0.02 |
| GSM254805 | 4537 | 0.25 | 4784 | 0.27 | 3860 | 0.22 | 1946 | 0.003 | 2744 | 0.022 | 2237 | 0.03 |
| GSM254806 | 4878 | 0.21 | 5826 | 0.32 | 4284 | 0.17 | 2672 | 0.0004 | 3924 | 0.022 | 3085 | 0.03 |
| GSM254807 | 4957 | 0.21 | 5157 | 0.24 | 4569 | 0.20 | 2487 | 0.0004 | 3802 | 0.003 | 3508 | 0.02 |
See main text for the simulation methods. About 10.2% of the probes are from spike-in regions.
Figure 3Comparison of Mixer and MA2C by ROC-like curves. Peaks were detected by Mixer (with Lowess normalization) or MA2C (with MA2C normalization). Some curves appear to be truncated at the left side because we restrict the cutoff to be FDR or lfdr smaller than 0.5. A larger cutoff is rarely used in practice.
Comparison of the peaks identified by Mixer, MA2C, and TileMap with sites identified by ChIP-seq.
| 5,000 | 2974 (59.5%) | 0 | 2421 (48.4%) | 0 | 2090 (41.8%) | ≤ 6 × 10-6 |
| 10,000 | 5909 (59.1%) | ≤ 2.4 × 10-4 | 4840 (48.4%) | 0 | 4049 (40.5%) | ≤ 7 × 10-6 |
| 20,000 | 8931 (44.7%) | ≤ 0.046 | 8217 (41.1%) | ≤ 0.032 | 7270 (36.4%) | ≤ 3.1 × 10-5 |
In each cell, the number of overlapped peak regions and the percentage among the top k peak regions are shown, where k = 5,000, 10,000, or 20,000.