| Literature DB >> 32600298 |
Pietro Di Lena1, Claudia Sala2, Andrea Prodi3, Christine Nardini4.
Abstract
BACKGROUND: High-throughput technologies enable the cost-effective collection and analysis of DNA methylation data throughout the human genome. This naturally entails missing values management that can complicate the analysis of the data. Several general and specific imputation methods are suitable for DNA methylation data. However, there are no detailed studies of their performances under different missing data mechanisms -(completely) at random or not- and different representations of DNA methylation levels (β and M-value).Entities:
Keywords: DNA methylation; Imputation; M-value; MAR; MCAR; MNAR; Missing data mechanisms; β-value
Mesh:
Year: 2020 PMID: 32600298 PMCID: PMC7325236 DOI: 10.1186/s12859-020-03592-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Benchmark datasets
| ID | GEO ID | Tissue | Disease status | # Samples | # Missing values (21k) | % Missing values (21k) |
|---|---|---|---|---|---|---|
| D1 | GSE32146 | Colon mucosa | Crohn’s disease | 10 | 175 | 0.08% |
| D2 | GSE32146 | Colon mucosa | Ulcerative colitis | 5 | 161 | 0.15% |
| D3 | GSE32146 | Colon | Normal | 10 | 171 | 0.08% |
| D4 | GSE32148 | Blood | Normal | 19 | 325 | 0.08% |
| D5 | GSE40005 | Blood | Normal | 12 | 324 | 0.13% |
| D6 | GSE42921 | Colon mucosa | Crohn’s disease | 5 | 192 | 0.18% |
| D7 | GSE42921 | Colon mucosa | Ulcerative colitis | 6 | 331 | 0.26% |
| D8 | GSE42921 | Colon | Normal | 12 | 874 | 0.34% |
| D9 | GSE43091 | Liver | Cancer | 50 | 1,980 | 0.19% |
| D10 | GSE43091 | Liver | Normal | 4 | 125 | 0.15% |
| D11 | GSE44684 | Cerebellum | Normal | 6 | 67 | 0.05% |
| D12 | GSE49393 | Prefrontal Cortex | Normal | 25 | 54,000 | 10.11% |
| D13 | GSE51388 | Blood | Normal | 60 | 292,200 | 22.79% |
| D14 | GSE52113 | Blood | Normal | 24 | 0 | 0.00% |
| D15 | GSE53051 | Breast | Cancer | 14 | 0 | 0.00% |
| D16 | GSE53051 | Colon | Cancer | 35 | 0 | 0.00% |
| D17 | GSE53051 | Colon, Pancreas | Normal | 9 | 0 | 0.00% |
| D18 | GSE53051 | Lung | Cancer | 9 | 0 | 0.00% |
| D19 | GSE53051 | Pancreas | Cancer | 29 | 0 | 0.00% |
| D20 | GSE53051 | Thyroid | Cancer | 70 | 0 | 0.00% |
| D21 | GSE53162 | Brain, Cerebellum, Prefrontal Cortex | Normal | 21 | 0 | 0.00% |
| D22 | GSE53740 | Blood | Normal | 165 | 0 | 0.00% |
| D23 | GSE57360 | Brain | Normal | 5 | 0 | 0.00% |
| D24 | GSE61151 | Blood | Normal | 184 | 7,544 | 0.19% |
| D25 | GSE61257 | Adipose | Non-alcoholic fatty liver disease (NAFLD) | 8 | 88 | 0.05% |
| D26 | GSE61257 | Adipose | Non-alcoholic steatohepatitis (NASH) | 9 | 142 | 0.07% |
| D27 | GSE61257 | Adipose | Normal | 15 | 241 | 0.08% |
| D28 | GSE61258 | Liver | Non-alcoholic fatty liver disease (NAFLD) | 14 | 370 | 0.12% |
| D29 | GSE61258 | Liver | Non-alcoholic steatohepatitis (NASH) | 7 | 218 | 0.15% |
| D30 | GSE61258 | Liver | Normal | 32 | 966 | 0.14% |
| D31 | GSE61258 | Liver | Primary biliary cholangitis (PBC) | 12 | 251 | 0.10% |
| D32 | GSE61258 | Liver | Primary sclerosing cholangitis (PSC) | 14 | 352 | 0.12% |
| D33 | GSE61259 | Muscle | Non-alcoholic fatty liver disease (NAFLD) | 9 | 90 | 0.05% |
| D34 | GSE61259 | Muscle | Non-alcoholic steatohepatitis (NASH) | 7 | 49 | 0.03% |
| D35 | GSE61259 | Muscle | Normal | 10 | 96 | 0.04% |
| D36 | GSE61380 | Brain | Normal | 15 | 2,4671 | 7.70% |
| D37 | GSE62003 | Blood | Normal | 35 | 0 | 0.00% |
| D38 | GSE64495 | Blood | Normal | 106 | 32 | 0.00% |
| D39 | GSE67477 | Liver | Cancer | 6 | 461 | 0.36% |
| D40 | GSE67484 | Liver, Intestine-Small | Normal | 4 | 45 | 0.05% |
| D41 | GSE69502 | Brain, Spinal Cord | Normal | 20 | 37,781 | 8.84% |
| D42 | GSE71955 | Blood | Normal | 62 | 260,245 | 19.64% |
| D43 | GSE73103 | Blood | Normal | 268 | 1,005,268 | 17.55% |
| D44 | GSE73747 | Brain | Normal | 9 | 7,069 | 3.68% |
| D45 | GSE79122 | Brain | Normal | 7 | 99 | 0.07% |
| D46 | GSE80970 | Prefrontal Cortex | Normal | 68 | 1,324 | 0.09% |
| D47 | GSE82218 | Blood | Normal | 25 | 398 | 0.07% |
| D48 | GSE84003 | Blood | Normal | 6 | 275 | 0.21% |
| D49 | GSE88821 | Colon, Rectum | Cancer | 63 | 36,995 | 2.75% |
| D50 | GSE88821 | Colon, Rectum | Normal | 8 | 4,680 | 2.74% |
| D51 | GSE88821 | Liver | Cancer | 4 | 2,349 | 2.75% |
| D52 | GSE89093 | Blood | Normal | 46 | 65,044 | 6.62% |
| D53 | GSE89472 | Blood | Normal | 5 | 245 | 0.23% |
| D54 | GSE89702 | Cerebellum | Normal | 17 | 49,572 | 13.65% |
| D55 | GSE89703 | Hippocampus | Normal | 13 | 37,557 | 13.52% |
| D56 | GSE89705 | Putamen | Normal | 17 | 49,215 | 13.55% |
| D57 | GSE89706 | Putamen | Normal | 28 | 78,736 | 13.16% |
| D58 | GSE97362 | Blood | Normal | 123 | 2,333 | 0.09% |
MCAR missing values
| MAE | RMSE | |||
|---|---|---|---|---|
| Method | B-value | B-value | ||
| (a) Healthy datasets | ||||
| mean | 0.030 ±0.001∗ | 0.030 ±0.001 | 0.051 ±0.001 | 0.050 ±0.001∗ |
| impute.knn | 0.039 ±0.007∗ | 0.059 ±0.012 | 0.079 ±0.015∗ | 0.112 ±0.019 |
| softImpute | 0.031 ±0.002 | 0.032 ±0.006∗ | 0.055 ±0.004 | 0.059 ±0.017∗ |
| imputePCA | 0.025 ±0.001∗ | 0.025 ±0.001 | 0.045 ±0.001 | 0.043 ±0.001∗ |
| SVDmiss | 0.035 ±0.001 | 0.027 ±0.001∗ | 0.063 ±0.002 | 0.048 ±0.002∗ |
| missForest | 0.026 ±0.001 | 0.026 ±0.001∗ | 0.044 ±0.003 | 0.043 ±0.002∗ |
| methyLImp | 0.029 ±0.001 | 0.025 ±0.001∗ | 0.050 ±0.002 | 0.047 ±0.002∗ |
| (b) Disease datasets | ||||
| mean | 0.048 ±0.001∗ | 0.048 ±0.001 | 0.080 ±0.002 | 0.079 ±0.002∗ |
| impute.knn | 0.059 ±0.008∗ | 0.082 ±0.013 | 0.107 ±0.014∗ | 0.142 ±0.018 |
| softImpute | 0.050 ±0.004 | 0.051 ±0.010∗ | 0.084 ±0.007 | 0.091 ±0.026∗ |
| imputePCA | 0.041 ±0.001∗ | 0.041 ±0.001 | 0.072 ±0.002 | 0.070 ±0.002∗ |
| SVDmiss | 0.055 ±0.001 | 0.045 ±0.001∗ | 0.093 ±0.002 | 0.080 ±0.003∗ |
| missForest | 0.042 ±0.001∗ | 0.042 ±0.001 | 0.071 ±0.002 | 0.070 ±0.002∗ |
| methyLImp | 0.043 ±0.001 | 0.037 ±0.001∗ | 0.074 ±0.002 | 0.066 ±0.002∗ |
Average Mean Average Error (MAE) and Root Mean Square Error (RMSE) imputation performance ± standard deviation. For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
Fig. 1Healthy datasets. Average RMSE with respect to β-value range. Average RMSE for M-value and β-value imputation with respect to different β-value ranges and with respect to the MCAR, MAR, MNAR (low, mid, high) missing data mechanisms
MAR missing values
| MAE | RMSE | |||
|---|---|---|---|---|
| Method | B-value | B-value | ||
| (a) Healthy datasets | ||||
| mean | 0.041 ±0.001 | 0.040 ±0.001∗ | 0.073 ±0.002 | 0.070 ±0.001∗ |
| impute.knn | 0.043 ±0.004∗ | 0.061 ±0.009 | 0.082 ±0.009∗ | 0.110 ±0.015 |
| softImpute | 0.042 ±0.002∗ | 0.043 ±0.007 | 0.077 ±0.005 | 0.082 ±0.017∗ |
| imputePCA | 0.037 ±0.001 | 0.036 ±0.001∗ | 0.069 ±0.002 | 0.066 ±0.002∗ |
| SVDmiss | 0.043 ±0.001 | 0.036 ±0.001∗ | 0.079 ±0.003 | 0.067 ±0.002∗ |
| missForest | 0.035 ±0.001 | 0.035 ±0.001∗ | 0.064 ±0.002 | 0.061 ±0.002∗ |
| methyLImp | 0.037 ±0.001 | 0.033 ±0.001∗ | 0.068 ±0.002 | 0.063 ±0.002∗ |
| (b) Disease datasets | ||||
| mean | 0.060 ±0.001 | 0.060 ±0.001∗ | 0.101 ±0.002 | 0.097 ±0.002∗ |
| impute.knn | 0.067 ±0.005∗ | 0.087 ±0.010 | 0.115 ±0.009∗ | 0.144 ±0.014 |
| softImpute | 0.062 ±0.003 | 0.065 ±0.011 | 0.106 ±0.006∗ | 0.116 ±0.026 |
| imputePCA | 0.054 ±0.001 | 0.053 ±0.001∗ | 0.095 ±0.002 | 0.090 ±0.002∗ |
| SVDmiss | 0.067 ±0.001 | 0.057 ±0.001∗ | 0.114 ±0.003 | 0.104 ±0.004∗ |
| missForest | 0.053 ±0.001 | 0.053 ±0.001∗ | 0.093 ±0.002 | 0.088 ±0.002∗ |
| methyLImp | 0.053 ±0.001 | 0.049 ±0.001∗ | 0.092 ±0.002 | 0.089 ±0.002∗ |
Average Mean Average Error (MAE) and Root Mean Square Error (RMSE) imputation performance ± standard deviation. For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
Fig. 2β-value distributions of different missingness mechanisms. Comparison of the β-value distribution against the distribution of simulated MCAR, MAR and MNAR missing values
Fig. 3β-value distributions of CpGs with frequently missing values. Comparison of the β-value distribution against the β-value distribution of CpGs with missing values on >20%,>25%,>30% samples. MAR simulated distribution included
MNAR:low missing values
| MAE | RMSE | |||
|---|---|---|---|---|
| Method | B-value | B-value | ||
| (a) Healthy datasets | ||||
| mean | 0.022 ±0.001∗ | 0.023 ±0.001 | 0.043 ±0.001∗ | 0.044 ±0.001 |
| impute.knn | 0.041 ±0.012 | 0.033 ±0.006∗ | 0.086 ±0.021 | 0.077 ±0.014∗ |
| softImpute | 0.026 ±0.002 | 0.023 ±0.003 | 0.052 ±0.006 | 0.046 ±0.010∗ |
| imputePCA | 0.019 ±0.001∗ | 0.020 ±0.001 | 0.039 ±0.002∗ | 0.039 ±0.001 |
| SVDmiss | 0.029 ±0.001 | 0.021 ±0.001∗ | 0.061 ±0.003 | 0.041 ±0.002∗ |
| missForest | 0.019 ±0.001∗ | 0.020 ±0.001 | 0.037 ±0.002∗ | 0.038 ±0.001 |
| methyLImp | 0.022 ±0.001∗ | 0.019 ±0.001 | 0.040 ±0.002∗ | 0.039 ±0.002 |
| (b) Disease datasets | ||||
| mean | 0.036 ±0.001∗ | 0.037 ±0.001 | 0.068 ±0.002∗ | 0.069 ±0.002 |
| impute.knn | 0.063 ±0.014 | 0.048 ±0.008∗ | 0.120 ±0.020 | 0.102 ±0.015∗ |
| softImpute | 0.040 ±0.005 | 0.036 ±0.004∗ | 0.076 ±0.010 | 0.072 ±0.013∗ |
| imputePCA | 0.031 ±0.001∗ | 0.032 ±0.001 | 0.061 ±0.002∗ | 0.062 ±0.002 |
| SVDmiss | 0.047 ±0.001 | 0.035 ±0.001∗ | 0.089 ±0.003 | 0.070 ±0.003∗ |
| missForest | 0.031 ±0.001∗ | 0.032 ±0.001 | 0.060 ±0.002∗ | 0.061 ±0.002 |
| methyLImp | 0.032 ±0.001 | 0.028 ±0.001∗ | 0.063 ±0.003 | 0.058 ±0.002∗ |
Average Mean Average Error (MAE) and Root Mean Square Error (RMSE) imputation performance ± standard deviation. For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
MNAR:mid missing values
| MAE | RMSE | |||
|---|---|---|---|---|
| Method | B-value | B-value | ||
| (a) Healthy datasets | ||||
| mean | 0.053 ±0.001 | 0.051 ±0.001∗ | 0.082 ±0.001 | 0.076 ±0.001∗ |
| impute.knn | 0.041 ±0.002∗ | 0.050 ±0.004 | 0.067 ±0.005∗ | 0.085 ±0.010 |
| softImpute | 0.051 ±0.001 | 0.050 ±0.006∗ | 0.078 ±0.003 | 0.080 ±0.012∗ |
| imputePCA | 0.045 ±0.001 | 0.043 ±0.001∗ | 0.072 ±0.001 | 0.067 ±0.001∗ |
| SVDmiss | 0.052 ±0.001 | 0.043 ±0.001∗ | 0.081 ±0.002 | 0.069 ±0.002∗ |
| missForest | 0.044 ±0.001 | 0.042 ±0.001∗ | 0.068 ±0.001 | 0.064 ±0.001∗ |
| methyLImp | 0.044 ±0.001 | 0.040 ±0.001∗ | 0.068 ±0.001 | 0.064 ±0.001∗ |
| (b) Disease datasets | ||||
| mean | 0.076 ±0.001 | 0.072 ±0.001∗ | 0.109 ±0.001 | 0.101 ±0.001∗ |
| impute.knn | 0.060 ±0.002∗ | 0.073 ±0.006 | 0.091 ±0.005∗ | 0.116 ±0.010 |
| softImpute | 0.075 ±0.002 | 0.072 ±0.010∗ | 0.108 ±0.003 | 0.111 ±0.021∗ |
| imputePCA | 0.066 ±0.001 | 0.062 ±0.001∗ | 0.098 ±0.001 | 0.091 ±0.001∗ |
| SVDmiss | 0.075 ±0.001 | 0.064 ±0.001∗ | 0.112 ±0.002 | 0.100 ±0.002∗ |
| missForest | 0.066 ±0.001 | 0.062 ±0.001∗ | 0.098 ±0.001 | 0.090 ±0.001∗ |
| methyLImp | 0.065 ±0.001 | 0.057 ±0.001∗ | 0.095 ±0.002 | 0.088 ±0.001∗ |
Average Mean Average Error (MAE) and Root Mean Square Error (RMSE) imputation performance ± standard deviation. For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
MNAR:high missing values
| MAE | RMSE | |||
|---|---|---|---|---|
| B-value | B-value | |||
| (a) Healthy datasets | ||||
| mean | 0.026 ±0.001∗ | 0.026 ±0.001 | 0.044 ±0.001∗ | 0.044 ±0.001 |
| impute.knn | 0.054 ±0.013∗ | 0.092 ±0.020 | 0.103 ±0.022∗ | 0.152 ±0.023 |
| softImpute | 0.028 ±0.002∗ | 0.033 ±0.010 | 0.049 ±0.005∗ | 0.063 ±0.026 |
| imputePCA | 0.022 ±0.001∗ | 0.022 ±0.001 | 0.039 ±0.001 | 0.038 ±0.001 |
| SVDmiss | 0.032 ±0.001 | 0.025 ±0.001∗ | 0.056 ±0.004 | 0.043 ±0.002∗ |
| missForest | 0.023 ±0.001∗ | 0.023 ±0.001 | 0.039 ±0.001 | 0.038 ±0.001∗ |
| methyLImp | 0.027 ±0.001 | 0.022 ±0.001∗ | 0.044 ±0.001 | 0.039 ±0.002∗ |
| (b) Disease datasets | ||||
| mean | 0.041 ±0.001∗ | 0.043 ±0.001 | 0.069 ±0.002∗ | 0.069 ±0.002 |
| impute.knn | 0.085 ±0.017∗ | 0.134 ±0.025 | 0.148 ±0.023∗ | 0.203 ±0.024 |
| softImpute | 0.044 ±0.005∗ | 0.053 ±0.018 | 0.075 ±0.009∗ | 0.098 ±0.043 |
| imputePCA | 0.035 ±0.001∗ | 0.036 ±0.001 | 0.061 ±0.002∗ | 0.061 ±0.002 |
| SVDmiss | 0.050 ±0.001 | 0.041 ±0.001∗ | 0.084 ±0.002 | 0.073 ±0.003∗ |
| missForest | 0.036 ±0.001∗ | 0.038 ±0.001 | 0.062 ±0.002∗ | 0.062 ±0.001 |
| methyLImp | 0.037 ±0.001 | 0.032 ±0.001∗ | 0.062 ±0.002 | 0.057 ±0.002∗ |
Average Mean Average Error (MAE) and Root Mean Square Error (RMSE) imputation performance ± standard deviation. For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
Global imputation performances across all datasets (healthy and disease) and all missingness mechanisms
| MAE | RMSE | |||
|---|---|---|---|---|
| Method | B-value | B-value | ||
| mean | 0.041 ±0.001∗ | 0.040 ±0.001 | 0.068 ±0.001 | 0.066 ±0.001∗ |
| impute.knn | 0.052 ±0.008∗ | 0.068 ±0.011 | 0.095 ±0.014∗ | 0.119 ±0.016 |
| softImpute | 0.042 ±0.003 | 0.043 ±0.008∗ | 0.072 ±0.006 | 0.077 ±0.020∗ |
| imputePCA | 0.035 ±0.001 | 0.035 ±0.001∗ | 0.062 ±0.002 | 0.059 ±0.001∗ |
| SVDmiss | 0.046 ±0.001 | 0.037 ±0.001∗ | 0.079 ±0.003 | 0.065 ±0.002∗ |
| missForest | 0.036 ±0.001 | 0.036 ±0.001∗ | 0.061 ±0.002 | 0.059 ±0.002∗ |
| methyLImp | 0.037 ±0.001 | 0.033 ±0.001∗ | 0.062 ±0.002 | 0.058 ±0.002∗ |
For each method, the ∗ symbol indicates the measure (either M-value or β-value) for which the Wilcoxon signed-rank test p-value is <0.05. Best results per metric with respect to the Wilcoxon signed-rank test are highlighted in bold
Average time and memory usage
| Method | Avg time | Avg RAM |
|---|---|---|
| mean | <1s | 27MB |
| impute.knn | 2s | 81MB |
| softImpute | <1s | 74MB |
| imputePCA | 19s | 204MB |
| SVDmiss | 2m | 4GB |
| missForest | 18h | 280MB |
| methyLImp | 21m | 129MB |