| Literature DB >> 22171553 |
Zhifu Sun1, High Seng Chai, Yanhong Wu, Wendy M White, Krishna V Donkena, Christopher J Klein, Vesna D Garovic, Terry M Therneau, Jean-Pierre A Kocher.
Abstract
BACKGROUND: Genome-wide methylation profiling has led to more comprehensive insights into gene regulation mechanisms and potential therapeutic targets. Illumina Human Methylation BeadChip is one of the most commonly used genome-wide methylation platforms. Similar to other microarray experiments, methylation data is susceptible to various technical artifacts, particularly batch effects. To date, little attention has been given to issues related to normalization and batch effect correction for this kind of data.Entities:
Mesh:
Year: 2011 PMID: 22171553 PMCID: PMC3265417 DOI: 10.1186/1755-8794-4-84
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Figure 1Dataset 1 before and after normalization and batch effect correction. A: PCA plot for all 93 samples using all 27,578 CpGs. Different colors are for different batches. Nine pairs of technical replicates are marked as R1 to R9. The samples on Chip12 (circled with dash line) tend to separate from other samples. B. Density plot of samples from Chip12 and Chip26 shows minor distribution biases between the two chips. C: PCA plot of the 24 samples from Chip22 and 26 using all 27,578 CpGs. Two samples with an across bar are technical replicates. D: Box plot of pair-wise CpG errors between 9 pairs of technical replicates for unnormalized average β (red), QNβ (green), lumi (blue), and ABnorm (cyan). The unnormalized data has wider interquartile ranges and shifted medians from zero line. All normalized data have condensed interquartile ranges with medians adjusted close to zero line. E: Error means (lower pane) and average absolute deviations (upper panel) of 9 pairs of technical replicates before (red) and after three normalizations. Unnormalized data has the largest average absolute deviation for each of replicate pairs and shifted mean for most of the pairs. All normalized data show reduced average absolute deviations. F: Error means (lower pane) and average absolute deviations (upper panel) of 9 pairs of technical replicates before and after three normalizations plus EB correction. The normalized and EB correction data have almost identical error means and average absolute deviations compared to normalized data alone.
Figure 2Dataset 2 before and after normalization and batch correction. A: The density plot of average β values for two chips of samples. All samples on Chip12 shift to the left, more CpGs with lower methylation values. B: PCA plot (first two components) for the 24 samples using 26,486 CpGs after excluding CpGs in sex chromosomes; samples on each chip cluster closely with the first component explaining 50.3% of variance. C: Density plot of lumi normalized data. The distribution bias has been greatly reduced but a significant portion remains. D: PCA plot of lumi normalized data still shows clear sample separation by batches using 26,486 CpGs after excluding CpGs in sex chromosomes. E: Density plot of average β after ABnorm. The distribution bias has been successfully removed. F: PCA plot of ABnorm data shows the clear remaining batch effects using 26,486 CpGs. G: The profiles of selected 20 CpGs that are associated with the batch effects after normalization. X-axis-samples ordered by Chip (11 or 12). Y-axis-methylation average β. Each line represents one CpG across samples. These CpGs are either all higher or lower on one chip than another. H: The profiles of the same 20 CpGs as G after normalization and EB correction. The systematic differences between the two chips have been removed.
Figure 3Dataset 3 before and after normalization and batch effect correction. A: Box plot of raw average β for two chips of 24 samples. The medians between the two chips are similar, but the 3rd quartile values of Chip36 are much lower than Chip54. B: Box plot for the two chips of 24 samples after lumi normalization. C: The density plot of average β for 24 samples before normalization colored by batches. The distribution differs obviously between the chips. D: The density plot of "lumi" normalized average β for 24 samples shows a large portion of batch effects not corrected. E: unsupervised clustering using all 27,578 CpGs before normalization and EB correction shows the clear separation of samples by chips (Chip54 or Chip36); samples from the same tissue type tend to cluster within the same batch (* for normal prostates and others for tumors). F: unsupervised clustering after normalization and EB correction using all 27,578 CpGs shows the separation between batches is removed; samples from the same tissue type cluster closely (* for normal prostates and others for tumors). G: Methylation profiles of selected 20 CpGs that are significantly associated with the batch effects after normalization, showing the dramatic differences between the two chips. The letter "T" and "N" on x-axis represent tumor and normal sample. H: The methylation profiles of the same 20 CpGs as G after the additional EB correction. The systematic biases are successfully removed.
Statistical measures of batch effects and performance evaluation of normalization and batch correction
| Dataset | Statistical measure | Raw β | QNβ | Lumi | ABnorm | QNβ+ | Lumi+ | ABnorm+ |
|---|---|---|---|---|---|---|---|---|
| Number (%) of CpGs associated with batch at p < 0.01 | 17,458 | 6,466 | 8,478 | 6,926 | 12 | 25 | 23 | |
| PCs associated with batch(% variance explained) | 1 | 1 | 1 | 1 | None | None | None | |
| Number (%) of differentially methylated CpGs between case and control at p < 0.01 | 345 | 759 | 714 | 763 | 1,155 | 1,146 | 1,229 | |
| Number (%) of CpGs associated with batch at p < 0.01 | 13,881 | 10,300 | 12,668 | 9,694 | 2 | 6 | 8 | |
| PCs associated with batch (% variance explained) | 1 | 1 | 1 | 1 | None | None | None | |
| Number (%) of differentially methylated CpGs between cancer and normal at p < 0.01 | 794 | 1,877 | 1,131 | 1,635 | 2,799 | 2,400 | 2,289 | |
Raw β: Raw average β without any correction; QNβ: quantile normalization at average β values; lumi: two step quantile normalization at probe signals implemented in R package "lumi"; ABnorm: quantile normalization for A and B signal separately; EB: Empirical Bayes batch correction. * The principal components (PC) significantly associated with batch effects at p value < 0.01 from the top 10 evaluated by Wilcoxon test and the percentage of variance the PC explains.