Kimberly D Siegmund1, Peter W Laird, Ite A Laird-Offringa. 1. Department of Preventive Medicine, Norris Cancer Center, Keck School of Medicine, University of Southern California, Los Angeles 90089, USA. kims@usc.edu
Abstract
MOTIVATION: Aberrant DNA methylation is common in cancer. DNA methylation profiles differ between tumor types and subtypes and provide a powerful diagnostic tool for identifying clusters of samples and/or genes. DNA methylation data obtained with the quantitative, highly sensitive MethyLight technology is not normally distributed; it frequently contains an excess of zeros. Established tools to analyze this type of data do not exist. Here, we evaluate a variety of methods for cluster analysis to determine which is most reliable. RESULTS: We introduce a Bernoulli-lognormal mixture model for clustering DNA methylation data obtained using MethyLight. We model the outcomes using a two-part distribution having discrete and continuous components. It is compared with standard cluster analysis approaches for continuous data and for discrete data. In a simulation study, we find that the two-part model has the lowest classification error rate for mixture outcome data compared with other approaches. The methods are illustrated using DNA methylation data from a study of lung cancer cell lines. Compared with competing hierarchical clustering methods, the mixture model approaches have the lowest cross-validation error for detecting lung cancer subtype (non-small versus small cell). The Bernoulli-lognormal mixture assigns observations to subgroups with the lowest uncertainty. AVAILABILITY: Software is available upon request from the authors. SUPPLEMENTARY INFORMATION: http://www-rcf.usc.edu/~kims/SupplementaryInfo.html
MOTIVATION: Aberrant DNA methylation is common in cancer. DNA methylation profiles differ between tumor types and subtypes and provide a powerful diagnostic tool for identifying clusters of samples and/or genes. DNA methylation data obtained with the quantitative, highly sensitive MethyLight technology is not normally distributed; it frequently contains an excess of zeros. Established tools to analyze this type of data do not exist. Here, we evaluate a variety of methods for cluster analysis to determine which is most reliable. RESULTS: We introduce a Bernoulli-lognormal mixture model for clustering DNA methylation data obtained using MethyLight. We model the outcomes using a two-part distribution having discrete and continuous components. It is compared with standard cluster analysis approaches for continuous data and for discrete data. In a simulation study, we find that the two-part model has the lowest classification error rate for mixture outcome data compared with other approaches. The methods are illustrated using DNA methylation data from a study of lung cancer cell lines. Compared with competing hierarchical clustering methods, the mixture model approaches have the lowest cross-validation error for detecting lung cancer subtype (non-small versus small cell). The Bernoulli-lognormal mixture assigns observations to subgroups with the lowest uncertainty. AVAILABILITY: Software is available upon request from the authors. SUPPLEMENTARY INFORMATION: http://www-rcf.usc.edu/~kims/SupplementaryInfo.html
Authors: Mulugeta Gebregziabher; Matthew S Shotwell; Jane M Charles; Joyce S Nicholas Journal: Comput Stat Data Anal Date: 2012-01-01 Impact factor: 1.681
Authors: Brock C Christensen; E Andres Houseman; Graham M Poage; John J Godleski; Raphael Bueno; David J Sugarbaker; John K Wiencke; Heather H Nelson; Carmen J Marsit; Karl T Kelsey Journal: Cancer Res Date: 2010-06-29 Impact factor: 12.701
Authors: Carmen J Marsit; E Andres Houseman; Brock C Christensen; Luc Gagne; Margaret R Wrensch; Heather H Nelson; Joseph Wiemels; Shichun Zheng; John K Wiencke; Angeline S Andrew; Alan R Schned; Margaret R Karagas; Karl T Kelsey Journal: PLoS One Date: 2010-08-23 Impact factor: 3.240
Authors: Stacia M DeSantis; E Andrés Houseman; Brent A Coull; David N Louis; Gayatry Mohapatra; Rebecca A Betensky Journal: Biometrics Date: 2009-12 Impact factor: 2.571
Authors: Brock C Christensen; E Andres Houseman; Carmen J Marsit; Shichun Zheng; Margaret R Wrensch; Joseph L Wiemels; Heather H Nelson; Margaret R Karagas; James F Padbury; Raphael Bueno; David J Sugarbaker; Ru-Fang Yeh; John K Wiencke; Karl T Kelsey Journal: PLoS Genet Date: 2009-08-14 Impact factor: 5.917