| Literature DB >> 31504104 |
Harriet L Mills, Jon Heron, Caroline Relton, Matt Suderman, Kate Tilling.
Abstract
Multiple imputation (MI) is a well-established method for dealing with missing data. MI is computationally intensive when imputing missing covariates with high-dimensional outcome data (e.g., DNA methylation data in epigenome-wide association studies (EWAS)), because every outcome variable must be included in the imputation model to avoid biasing associations towards the null. Instead, EWAS analyses are reduced to only complete cases, limiting statistical power and potentially causing bias. We used simulations to compare 5 MI methods for high-dimensional data under 2 missingness mechanisms. All imputation methods had increased power over complete-case (C-C) analyses. Imputing missing values separately for each variable was computationally inefficient, but dividing sites at random into evenly sized bins improved efficiency and gave low bias. Methods imputing solely using subsets of sites identified by the C-C analysis suffered from bias towards the null. However, if these subsets were added into random bins of sites, this bias was reduced. The optimal methods were applied to an EWAS with missingness in covariates. All methods identified additional sites over the C-C analysis, and many of these sites had been replicated in other studies. These methods are also applicable to other high-dimensional data sets, including the rapidly expanding area of "-omics" studies.Entities:
Keywords: Accessible Resource for Integrated Epigenomics Studies; Avon Longitudinal Study of Parents and Children; epigenetic data; imputation; missing data
Mesh:
Year: 2019 PMID: 31504104 PMCID: PMC6825836 DOI: 10.1093/aje/kwz186
Source DB: PubMed Journal: Am J Epidemiol ISSN: 0002-9262 Impact factor: 4.897
Characteristics and Performance of Different Imputation Methods Used to Impute Smoking Status From Epigenetic Dataa
| Imputation Method | Characteristics of Method | Characteristics of Results | ||||
|---|---|---|---|---|---|---|
| No. of Imputation Procedures | Imputation Model | True Positives | False Positives | Bias | Speed (1 = Fastest) | |
| Complete-case | 0 | N/A | Poor | Very good | Unbiasedb | 1 |
| Separate CpG sites | 482,739 | Smoking ~ single CpG site + age + sex | Good | Good | Unbiased | 8 |
| Random bins (3:1) | 3,219c | For each bin: smoking ~ 150 CpG sites + age + sex | Good | Poor | Unbiased | 6 |
| Random bins (10:1) | 10,728c | For each bin: smoking ~ 45 CpG sites + age + sex | Good | Good | Unbiased | 4 |
| Naive method | 1 | Smoking ~ C-C CpG sites + age + sex | Good | Poor | Biased towards the null for non–C-C CpG sites | 3 |
| Wu method | 1 | Smoking ~ selected CpG sites + age + sex | Good | Good | Biased towards the null for nonselected CpG sites | 2 |
| Wu bins (3:1) | 3,353c,d | For each bin: smoking ~ 150 CpG sites (including Wu-selected CpG sites) + age + sex | Good | Poor | Unbiased | 7 |
| Wu bins (10:1) | 12,378c,d | For each bin: smoking ~ 45 CpG sites (including Wu-selected CpG sites) + age + sex | Good | Good | Unbiased | 5 |
Abbreviations: C-C, complete-case; N/A, not applicable.
a This table provides details on the imputation methods described in the text and their results for the simulations only. NCpG is the number of CpG sites included in the analysis (NCpG = 482,739).
b Methods classified as “unbiased” in this table are only unbiased if the imputation model is correct and data are missing at random.
c This is approximately NCpG/bin size.
d Recall that the bins for the “Wu bins” method always contain the subset of CpG sites selected in the forward-stepwise selection process, so there are slightly more bins for the “Wu bins” method than for the “random bins” method, in order to accommodate the extra sites.
Performance of Different Imputation Methods for Imputing Smoking Status, Assessed by Comparing the EWAS on the Resulting Data Sets With an EWAS on the Complete Dataa
| Imputation Method | MM1 | MM2 | Time Relative to Separate CpG Method | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Associated CpG Sites | Mean SE of β Coefficients (SD) | Associated CpG Sites | Mean SE of β Coefficients (SD) | ||||||||||
| Total No. of Sites | No. in Complete Data Set | % of Complete Data Set | No. Not in Complete Data Set | % of Total Found | Total No. of Sites | No. in Complete Data Set | % of Complete Data Set | No. Not in Complete Data Set | % of Total Found | ||||
| Complete datab | 298.0 | 0.0953 (0.0081) | 298.0 | 0.0953 (0.0081) | |||||||||
| Complete-case | 169.7 | 139.3 | 46.7 | 30.4 | 16.7 | 0.1073 (0.0099) | 147.1 | 122.5 | 41.1 | 24.6 | 13.6 | 0.1069 (0.0099) | 0.002 |
| Separate CpG sites | 330.0 | 188.3 | 63.2 | 141.7 | 40.5 | 0.1052 (0.0093) | 282.8 | 169.9 | 57.0 | 112.9 | 34.1 | 0.1049 (0.0093) | 1.000 |
| Random bins (3:1) | 537.2 | 189.6 | 63.6 | 347.6 | 63.5 | 0.0997 (0.0086) | 482.2 | 170.5 | 57.2 | 311.7 | 58.8 | 0.0985 (0.0085) | 0.517 |
| Random bins (10:1) | 373.6 | 192.0 | 64.4 | 181.6 | 46.1 | 0.1031 (0.0090) | 326.3 | 176.6 | 59.3 | 149.7 | 38.9 | 0.1028 (0.0090) | 0.339 |
| Naive method | 863.3 | 215.8 | 72.4 | 647.5 | 65.1 | 0.0984 (0.0084) | 433.9 | 180.1 | 60.4 | 253.8 | 45.2 | 0.0974 (0.0083) | 0.069 |
| Wu method | 312.0 | 183.8 | 61.7 | 128.2 | 37.3 | 0.1002 (0.0087) | 290.4 | 170.8 | 57.3 | 119.6 | 28.2 | 0.1001 (0.0087) | 0.059 |
| Wu bins (3:1) | 516.7 | 196.9 | 66.1 | 319.8 | 59.9 | 0.0984 (0.0084) | 432.6 | 175.2 | 58.8 | 257.4 | 53.4 | 0.0972 (0.0083) | 0.527 |
| Wu bins (10:1) | 410.2 | 202.2 | 67.9 | 208.0 | 48.2 | 0.0996 (0.0086) | 349.2 | 187.1 | 62.8 | 162.1 | 38.3 | 0.0995 (0.0086) | 0.412 |
Abbreviations: EWAS, epigenome-wide association study; MM, missingness mechanism; SD, standard deviation; SE, standard error.
a The table shows the number of CpG sites associated with former or current smoking that were identified as significant in the regression analysis for each method, for both MMs. We report the number of these sites which were also significant in the EWAS on the complete data set (presented with the percentage, i.e., the true-positive rate) and the number of those which were not significant in the EWAS on the complete data set (presented with the percentage of those found to be significant which were “incorrect,” i.e., the false-positive rate). The mean and SD of the SEs (for the coefficients for the association of each CpG site with being a former or current smoker) are reported for each method. Note that this is the mean value across repeats of the mean and SD of the SEs within each repeat. Recall that the analysis on the complete data and C-C analysis did not require any imputation, making their computation time very low. Relative times are calculated from computation times averaged over example runs for MM1 and MM2; raw times are provided in Web Table 5. The table shows the results of the 10 repeats on the full (unreduced) data set.
b Reference model.
Figure 1.True-positive and false-positive percentages of CpG sites identified by different imputation methods in 10 repeats on the full (unreduced) data set. Values are listed in Table 2. Black symbols are for missingness mechanism 1, and gray symbols are for missingness mechanism 2.
Detailed Analysis of the Performance of the Naive Method for Imputing Smoking Status, Assessed by Comparing the EWAS on the Resulting Data Sets With an EWAS on the Complete Dataa
| MM and Scenario | CpG Sites Identified as Significant in C-C Analysis | CpG Sites Identified as Significant in Analysis on the Complete Data and Not in C-C Analysis | All Other CpG Sites | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | Positive | Negative | |||||||
| Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | |
| MM1b | ||||||||||||
| No. of CpG sites | 63.2 | 106.5 | 85.0 | 73.7 | 235,482.8 | 246,927.8 | ||||||
| Complete data (“truth”) | 0.5641 (0.09207) | −0.5962 (0.09115) | 0.5261 (0.09249) | −0.5239 (0.09252) | 0.1479 (0.09473) | −0.1468 (0.09578) | ||||||
| Naive method | 0.6187 (0.09473) | 0.0546 (0.0541) | −0.6400 (0.09405) | −0.0438 (0.0603) | 0.5187 (0.09602) | −0.0074 (0.0505) | −0.5150 (0.09609) | 0.0089 (0.0530) | 0.1625 (0.09782) | 0.0147 (0.0568) | −0.1611 (0.09892) | −0.0143 (0.0559) |
| MM2c | ||||||||||||
| No. of CpG sites | 54.2 | 92.9 | 91.7 | 83.8 | 235,485.1 | 246,931.3 | ||||||
| Complete data (“truth”) | 0.5562 (0.09214) | −0.6010 (0.09091) | 0.5301 (0.09242) | −0.5287 (0.09252) | 0.1479 (0.09473) | −0.1468 (0.09578) | ||||||
| Naive method | 0.5722 (0.09424) | 0.0160 (0.0530) | −0.6136 (0.09338) | −0.0126 (0.0586) | 0.4983 (0.09517) | −0.0318 (0.0479) | −0.4978 (0.09529) | 0.0309 (0.0498) | 0.1465 (0.09686) | −0.0013 (0.0481) | −0.1466 (0.09794) | 0.0002 (0.0485) |
Abbreviations: C-C, complete-case; EWAS, epigenome-wide association study; MM, missingness mechanism; SD, standard deviation; SE, standard error.
a Average β coefficient (with average SE) and average bias (with SD) for former smokers specifically for the naive method as compared with the EWAS on the complete data set (n = 263). The table shows the results of 10 repeats on the full (unreduced) data set. CpG sites were divided into 3 groups: 1) sites identified as significant in the C-C analysis; 2) sites identified as significant in the complete data set and not in the C-C analysis; and 3) all other sites. We divided the β coefficients into positive (>0) and negative (<0) coefficients according to their value in the EWAS on the complete data set. Web Table 8 shows the equivalent results for current smokers (n = 22).
b Missing with probability 75% for males aged 57 years or over.
c Missing with probability 50% for males aged 57 years or over and with probability 12.5% for all remaining persons.
Detailed Analysis of the Performance of the Wu Method for Imputing Smoking Status, Assessed by Comparing the EWAS on the Resulting Data Sets With an EWAS on the Complete Dataa
| MM and Scenario | CpG Sites Selected From C-C Analysis | CpG Sites Identified as Significant in Analysis on the Complete Data and Not Selected | All Other CpG Sites | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | Positive | Negative | |||||||
| Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | Mean β (SE) | Mean Bias (SD) | |
| MM1b | ||||||||||||
| No. of CpG sitesc | 1.4 | 6.2 | 134.9 | 156.5 | 235,494.7 | 246,945.3 | ||||||
| Complete data (“truth”) | 0.6112 (0.0907) | −0.7925 (0.0857) | 0.5465 (0.09223) | −0.5666 (0.09173) | 0.1479 (0.09473) | −0.1469 (0.09578) | ||||||
| Wu method | 0.6616 (0.09551) | 0.0504 (0.0272) | −0.8456 (0.08873) | −0.0531 (0.0357) | 0.5411 (0.09744) | −0.0054 (0.0441) | −0.5743 (0.09622) | −0.0077 (0.0408) | 0.1606 (0.09961) | 0.0127 (0.0471) | −0.1591 (0.1008) | −0.0122 (0.0462) |
| MM2d | ||||||||||||
| No. of CpG sitesc | 1.8 | 6.2 | 134.4 | 156.3 | 235,494.8 | 246,945.5 | ||||||
| Complete data (“truth”) | 0.5735 (0.09118) | −0.7211 (0.08725) | 0.5465 (0.09222) | −0.5690 (0.09166) | 0.1479 (0.09473) | −0.1469 (0.09578) | ||||||
| Wu method | 0.6313 (0.09471) | 0.0578 (0.0513) | −0.7522 (0.09058) | −0.0311 (0.0424) | 0.5362 (0.09736) | −0.0103 (0.0482) | −0.5648 (0.09632) | 0.0042 (0.0423) | 0.1559 (0.09948) | 0.0080 (0.0461) | −0.1552 (0.1007) | −0.0083 (0.0461) |
Abbreviations: C-C, complete-case; EWAS, epigenome-wide association study; MM, missingness mechanism; SD, standard deviation; SE, standard error.
a Average β coefficient (with average SE) and average bias (with SD) for former smokers specifically for the Wu method as compared with the EWAS on the complete data set (n = 263). The table shows the results of 10 repeats on the full (unreduced) data set. CpG sites were divided into 3 groups: 1) sites selected by means of the Bayesian Information Criterion from those identified as significant in the C-C analysis; 2) sites identified as significant in the EWAS on the complete data set which were not selected by the Bayesian Information Criterion; and 3) all other sites. We divided the β coefficients into positive (>0) and negative (<0) coefficients according to their value in the EWAS on the complete data set. Web Table 10 shows the equivalent results for current smokers (n = 22).
b Missing with probability 75% for males aged 57 years or over.
c Average number of CpG sites in that group, across the 10 repeats.
d Missing with probability 50% for males aged 57 years or over and with probability 12.5% for all remaining persons.
Number of CpG Sites Identified as Associated with Former or Current Smoking in the ARIES Data Set and Their Replication in the Literaturea
| Imputation Method | No. of CpG Sites Identified by the Imputation Method | % of Sites Identified by the Imputation Method That Were Also Reported by Joehanes et al. ( | % of the 185 Sites Identified After BC in Joehanes et al. ( | |
|---|---|---|---|---|
| At the 2,568 Sites | At the 185 Sites After BC | |||
| Complete-case | 18 | 94.4 | 83.3 | 8.1 |
| Random bins | 36 | 72.2 | 50.0 | 9.7 |
| Wu method | 29 | 93.1 | 82.8 | 13.0 |
| Wu bins | 46 | 63.0 | 47.8 | 11.9 |
| Total unique | 60 | 61.7 | 43.3 | 14.0 |
Abbreviations: ARIES, Accessible Resource for Integrated Epigenomics Studies; BC, Bonferroni correction.
a Shown are the number of CpG sites which were significantly associated with former or current smoking in the ARIES data set, the percentage of these which were replicated at the 2,568 CpG sites reported by Joehanes et al. (36) (current smokers vs. never smokers (false discovery rate < 0.05)), the percentage of those which were replicated at the 185 CpG sites reported by Joehanes et al. (36) after BC, and the percentage of those 185 which were identified by each method.
b A 2016 review of other epigenome-wide association studies of smoking (36).