| Literature DB >> 35069703 |
Zhuang Xiong1,2,3, Mengwei Li1,2,3, Yingke Ma1,2, Rujiao Li1,2, Yiming Bao1,2,3.
Abstract
The Illumina HumanMethylation BeadChip is one of the most cost-effective methods to quantify DNA methylation levels at single-base resolution across the human genome, which makes it a routine platform for epigenome-wide association studies. It has accumulated tens of thousands of DNA methylation array samples in public databases, providing great support for data integration and further analysis. However, the majority of public DNA methylation data are deposited as processed data without background probes which are widely used in data normalization. Here, we present Gaussian mixture quantile normalization (GMQN), a reference based method for correcting batch effects as well as probe bias in the HumanMethylation BeadChip. Availability and implementation: https://github.com/MengweiLi-project/gmqn.Entities:
Keywords: DNA methylation; HumanMethylation BeadChip; batch effect; epigenome-wide association studies; probe bias
Year: 2022 PMID: 35069703 PMCID: PMC8777061 DOI: 10.3389/fgene.2021.810985
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1(A) Statistics of 450 k and EPIC data by year and project number in NCBI GEO database. (B) Distribution of data types of DNA methylation chip projects submitted to the GEO database as of December 2020. There were a total of 1,114 items containing the original “idat” files, and 1,349 items containing TXT or CSV files, indicating that most of the items missed original file. (C) The workflow of GMQN.
Overview of benchmark test dataset.
| Project id | Number of samples | Benchmark test | Annotation | Platform |
|---|---|---|---|---|
| GSE52731 | 56 | batch effects detection | — | 450 k |
| GSE139687 | 27 | batch effects detection | — | EPIC |
| GSE42861 | 689 | case-control study | Rheumatoid Arthritis | 450 k |
| GSE128235 | 537 | case-control study | Depression | 450 k |
| GSE125105 | 210 | regression analysis | Age | 450 k |
| GSE42861 | 335 | regression analysis | Age | 450 k |
| GSE87571 | 732 | regression analysis | Age | 450 k |
| GSE87571 | 732 | comparison of the methylation levels of adjacent sites | — | 450 k |
| GSE42861 | 689 | case-control study (reference evaluation) | Rheumatoid Arthritis | 450 k |
| GSE125105 | 210 | regression analysis (reference evaluation) | Age | 450 k |
| GSE42861 | 335 | regression analysis (reference evaluation) | Age | 450 k |
| GSE87571 | 732 | regression analysis (reference evaluation) | Age | 450 k |
FIGURE 2The signal intensity distribution characteristics of Infinium I probes (450 k data (A,B), EPIC data (C,D)) and clustering results of different batches of samples based on fitting parameters of the Gaussian distributions (E).
FIGURE 3The 450 k data signal strength distribution of the red and green channels of Infinium I probes before and after GMQN normalization. The signal intensities of the red (A) and green (B) channels of the two batches were clearly divided into two batches before being corrected. And the differences Were not due to biological differences (tumor and normal) (E,F). After the GMQN correction, the batch effect problem is significantly reduced (C,D).
FIGURE 4The result of Benchmark Test. (A) and (B): batch effects detection. (C) and (D): case-control study. (E): regression analysis. (F): comparison of the methylation levels of adjacent CpG sites (****p < 10-4, ****p < 10-4)