Hokeun Sun1, Shuang Wang. 1. Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY 10032, USA.
Abstract
MOTIVATION: DNA methylation is a molecular modification of DNA that plays crucial roles in regulation of gene expression. Particularly, CpG rich regions are frequently hypermethylated in cancer tissues, but not methylated in normal tissues. However, there are not many methodological literatures of case-control association studies for high-dimensional DNA methylation data, compared with those of microarray gene expression. One key feature of DNA methylation data is a grouped structure among CpG sites from a gene that are possibly highly correlated. In this article, we proposed a penalized logistic regression model for correlated DNA methylation CpG sites within genes from high-dimensional array data. Our regularization procedure is based on a combination of the l(1) penalty and squared l(2) penalty on degree-scaled differences of coefficients of CpG sites within one gene, so it induces both sparsity and smoothness with respect to the correlated regression coefficients. We combined the penalized procedure with a stability selection procedure such that a selection probability of each regression coefficient was provided which helps us make a stable and confident selection of methylation CpG sites that are possibly truly associated with the outcome. RESULTS: Using simulation studies we demonstrated that the proposed procedure outperforms existing main-stream regularization methods such as lasso and elastic-net when data is correlated within a group. We also applied our method to identify important CpG sites and corresponding genes for ovarian cancer from over 20 000 CpGs generated from Illumina Infinium HumanMethylation27K Beadchip. Some genes identified are potentially associated with cancers.
MOTIVATION: DNA methylation is a molecular modification of DNA that plays crucial roles in regulation of gene expression. Particularly, CpG rich regions are frequently hypermethylated in cancer tissues, but not methylated in normal tissues. However, there are not many methodological literatures of case-control association studies for high-dimensional DNA methylation data, compared with those of microarray gene expression. One key feature of DNA methylation data is a grouped structure among CpG sites from a gene that are possibly highly correlated. In this article, we proposed a penalized logistic regression model for correlated DNA methylation CpG sites within genes from high-dimensional array data. Our regularization procedure is based on a combination of the l(1) penalty and squared l(2) penalty on degree-scaled differences of coefficients of CpG sites within one gene, so it induces both sparsity and smoothness with respect to the correlated regression coefficients. We combined the penalized procedure with a stability selection procedure such that a selection probability of each regression coefficient was provided which helps us make a stable and confident selection of methylation CpG sites that are possibly truly associated with the outcome. RESULTS: Using simulation studies we demonstrated that the proposed procedure outperforms existing main-stream regularization methods such as lasso and elastic-net when data is correlated within a group. We also applied our method to identify important CpG sites and corresponding genes for ovarian cancer from over 20 000 CpGs generated from Illumina Infinium HumanMethylation27K Beadchip. Some genes identified are potentially associated with cancers.
Authors: Andrew E Teschendorff; Usha Menon; Aleksandra Gentry-Maharaj; Susan J Ramus; Daniel J Weisenberger; Hui Shen; Mihaela Campan; Houtan Noushmehr; Christopher G Bell; A Peter Maxwell; David A Savage; Elisabeth Mueller-Holzner; Christian Marth; Gabrijela Kocjan; Simon A Gayther; Allison Jones; Stephan Beck; Wolfgang Wagner; Peter W Laird; Ian J Jacobs; Martin Widschwendter Journal: Genome Res Date: 2010-03-10 Impact factor: 9.043
Authors: Jun Fan; Yirong Wu; Ming Yuan; David Page; Jie Liu; Irene M Ong; Peggy Peissig; Elizabeth Burnside Journal: J Mach Learn Res Date: 2016-12 Impact factor: 3.654
Authors: Weichun Huang; David C Bencic; Robert L Flick; Diane E Nacci; Bryan W Clark; Lawrence Burkhard; Tylor Lahren; Adam D Biales Journal: Environ Pollut Date: 2019-01-10 Impact factor: 8.071
Authors: Stephen J Mooney; Spruha Joshi; Magdalena Cerdá; Gary J Kennedy; John R Beard; Andrew G Rundle Journal: Cancer Epidemiol Biomarkers Prev Date: 2017-02-02 Impact factor: 4.254