Pietro Di Lena1, Claudia Sala2, Andrea Prodi3, Christine Nardini4,5,6. 1. Department of Computer Science and Engineering, University of Bologna, Mura Anteo Zamboni 7, Bologna, Italy. 2. Department of Physics and Astronomy, University of Bologna, Viale Berti Pichat 6/2, Bologna, Italy. 3. Smart Cities Living Lab, Istituto per la Sintesi Organica e la Fotoreattività, Consiglio Nazionale delle Ricerche (CNR), Via P. Gobetti 101, Bologna, Italy. 4. Department of Laboratory Medicine, Karolinska Institutet, Alfred Nobels Allé 8, Stockholm, Sweden. 5. Istituto per le applicazioni del calcolo Mauro Picone, Consiglio Nazionale delle Ricerche (CNR), Via dei Taurini 19, Roma, Italy. 6. Scientific and Medical Direction, SOL group SpA, Via Borgazzi 27, Monza, Italy.
Abstract
MOTIVATION: DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. RESULTS: We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. AVAILABILITY AND IMPLEMENTATION: The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: DNA methylation is a stable epigenetic mark with major implications in both physiological (development, aging) and pathological conditions (cancers and numerous diseases). Recent research involving methylation focuses on the development of molecular age estimation methods based on DNA methylation levels (mAge). An increasing number of studies indicate that divergences between mAge and chronological age may be associated to age-related diseases. Current advances in high-throughput technologies have allowed the characterization of DNA methylation levels throughout the human genome. However, experimental methylation profiles often contain multiple missing values that can affect the analysis of the data and also mAge estimation. Although several imputation methods exist, a major deficiency lies in the inability to cope with large datasets, such as DNA methylation chips. Specific methods for imputing missing methylation data are therefore needed. RESULTS: We present a simple and computationally efficient imputation method, metyhLImp, based on linear regression. The rationale of the approach lies in the observation that methylation levels show a high degree of inter-sample correlation. We performed a comparative study of our approach with other imputation methods on DNA methylation data of healthy and disease samples from different tissues. Performances have been assessed both in terms of imputation accuracy and in terms of the impact imputed values have on mAge estimation. In comparison to existing methods, our linear regression model proves to perform equally or better and with good computational efficiency. The results of our analysis provide recommendations for accurate estimation of missing methylation values. AVAILABILITY AND IMPLEMENTATION: The R-package methyLImp is freely available at https://github.com/pdilena/methyLImp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Rongbin Xu; Shuai Li; Shanshan Li; Ee Ming Wong; Melissa C Southey; John L Hopper; Michael J Abramson; Yuming Guo Journal: Environ Health Perspect Date: 2021-08-30 Impact factor: 9.031