| Literature DB >> 30878639 |
Paul Madley-Dowd1, Rachael Hughes2, Kate Tilling2, Jon Heron3.
Abstract
OBJECTIVES: Researchers are concerned whether multiple imputation (MI) or complete case analysis should be used when a large proportion of data are missing. We aimed to provide guidance for drawing conclusions from data with a large proportion of missingness. STUDY DESIGN ANDEntities:
Keywords: ALSPAC; Bias; Methods; Missing data; Multiple imputation; Simulation
Mesh:
Year: 2019 PMID: 30878639 PMCID: PMC6547017 DOI: 10.1016/j.jclinepi.2019.02.016
Source DB: PubMed Journal: J Clin Epidemiol ISSN: 0895-4356 Impact factor: 6.437
Description of the imputation models used for both MCAR and MAR data
| Imputation model | Variables included | |
|---|---|---|
| 1 (least auxiliary information) | 0.36 | |
| 2 | 0.40 | |
| 3 | 0.52 | |
| 4 | 0.76 | |
| 5 (most auxiliary information) | 0.92 |
, the total coefficient of multiple correlation with the outcome for all variables included in the imputation model, is displayed as a measure of the strength of the auxiliary information in each imputation model.
Fig. 1Empirical SE of the MI exposure coefficient plotted against FMI for simulated MCAR data. Error bars are 95% confidence intervals based on Monte Carlo standard errors across simulations. FMI = fraction of missing information; MCAR = missing completely at random; MI = multiple imputation; SE = standard error.
Percentage reduction in empirical SE and bias compared with CCA for MCAR and MAR results of the exposure coefficient in the simulation study
| % Missing | Imputation model | % Reduction in SE compared to CCA | % Reduction in bias compared to CCA | |
|---|---|---|---|---|
| MCAR data | MAR data | MAR data | ||
| 1 | 1: R2 = 0.36 (No aux info) | 0.00% | −0.01% | 1.46% |
| 2: R2 = 0.40 | 0.16% | 0.24% | 1.91% | |
| 3: R2 = 0.52 | 0.24% | 0.11% | 79.03% | |
| 4: R2 = 0.76 | 0.55% | 0.41% | 79.54% | |
| 5: R2 = 0.92 | 0.52% | 0.58% | 81.42% | |
| 5 | 1: R2 = 0.36 (No aux info) | 0.02% | −0.03% | 0.16% |
| 2: R2 = 0.40 | 0.19% | 0.03% | −1.26% | |
| 3: R2 = 0.52 | 1.04% | 0.93% | 97.92% | |
| 4: R2 = 0.76 | 1.99% | 2.63% | 94.91% | |
| 5: R2 = 0.92 | 1.57% | 3.64% | 93.74% | |
| 10 | 1: R2 = 0.36 (No aux info) | −0.05% | −0.06% | 0.40% |
| 2: R2 = 0.40 | 0.37% | 0.75% | −0.35% | |
| 3: R2 = 0.52 | 0.58% | 1.12% | 97.38% | |
| 4: R2 = 0.76 | 2.59% | 4.61% | 96.73% | |
| 5: R2 = 0.92 | 2.89% | 6.76% | 96.41% | |
| 20 | 1: R2 = 0.36 (No aux info) | 0.03% | −0.05% | −0.19% |
| 2: R2 = 0.40 | 1.08% | 1.03% | −0.65% | |
| 3: R2 = 0.52 | 2.59% | 3.42% | 97.94% | |
| 4: R2 = 0.76 | 8.28% | 7.94% | 97.33% | |
| 5: R2 = 0.92 | 10.53% | 10.26% | 97.29% | |
| 40 | 1: R2 = 0.36 (No aux info) | 0.05% | −0.06% | −0.21% |
| 2: R2 = 0.40 | 2.00% | 1.25% | 0.10% | |
| 3: R2 = 0.52 | 5.37% | 5.06% | 97.84% | |
| 4: R2 = 0.76 | 15.56% | 14.11% | 98.56% | |
| 5: R2 = 0.92 | 21.10% | 22.86% | 98.64% | |
| 60 | 1: R2 = 0.36 (No aux info) | −0.04% | −0.02% | 0.21% |
| 2: R2 = 0.40 | 2.55% | 1.68% | 0.02% | |
| 3: R2 = 0.52 | 5.48% | 6.74% | 99.77% | |
| 4: R2 = 0.76 | 21.02% | 18.45% | 99.43% | |
| 5: R2 = 0.92 | 31.59% | 31.96% | 98.22% | |
| 80 | 1: R2 = 0.36 (No aux info) | −0.03% | −0.14% | 0.00% |
| 2: R2 = 0.40 | 2.16% | 1.57% | 1.34% | |
| 3: R2 = 0.52 | 8.18% | 9.86% | 96.47% | |
| 4: R2 = 0.76 | 27.56% | 28.21% | 99.62% | |
| 5: R2 = 0.92 | 45.88% | 44.66% | 98.77% | |
| 90 | 1: R2 = 0.36 (No aux info) | 0.03% | 0.11% | 0.04% |
| 2: R2 = 0.40 | 1.40% | 2.18% | 0.89% | |
| 3: R2 = 0.52 | 12.44% | 8.86% | 99.97% | |
| 4: R2 = 0.76 | 34.82% | 33.76% | 95.78% | |
| 5: R2 = 0.92 | 53.09% | 52.96% | 98.73% | |
Abbreviations: CCA, complete case analysis; MAR, Missing at random; MCAR, Missing completely at random; SE, Standard error.
R2 refers to the squared coefficient of multiple correlation which is used as a measure of auxiliary information.
Models 1 and 2 do not include all variables in the missingness mechanism and so are biased (as expected) for the MAR data. Models 3–5 do include all variables in the missingness mechanism and so are unbiased (as expected).
Calculated using 100 × (seCCA–seMI)/seCCA, where seCCA and seMI are the empirical standard error of the CCA model and the MI model, respectively.
Calculated using 100 × (abs(biasCCA)-abs(biasMI))/abs(biasCCA), where abs(.) is a function giving the absolute value and biasCCA and biasMI are the bias of the CCA model and the MI model, respectively.
Fig. 2Bias of the CCA and MI exposure coefficient plotted against the proportion of missing data for simulated MAR data. Error bars are 95% confidence intervals based on Monte Carlo standard errors across simulations. CCA = complete case analysis; MI = multiple imputation; FMI = fraction of missing information; SE = standard error.
Imputation models for the applied example, Bristol, United Kingdom, 1991–2007
| Model | Variables included | % Missing data |
|---|---|---|
| A | No extra variables | 62.47% |
| B | IQ at age 8 | 66.64% |
| C | Intelligibility and fluency at age 9 | 66.68% |
| D | Maths assessment score | 76.59% |
| E | Learning difficulties | 78.84% |
| F | Streaming for maths and English | 81.75% |
| G | IQ at age eight and intelligibility | 69.34% |
| H | IQ at age eight and maths assessment | 79.11% |
| I | IQ at age 8, intelligibility, and maths assessment | 80.62% |
| J | IQ at age 8, intelligibility, maths assessment and LD | 84.17% |
| K | IQ at age 8, intelligibility, maths assessment and streaming groups | 86.42% |
| L | IQ at age 8, intelligibility, maths assessment, LD, and streaming groups | 86.51% |
Abbreviations: IQ, intelligence quotient; LD, learning difficulties.
All models additionally contained IQ at the age of 15 years, a binary measure of maternal smoking in pregnancy and the set of all confounders. Continuous variables (IQ at age of 8 and 15 years, intelligibility, and maths assessment score) were imputed using a linear regression model, binary variables (sex and learning difficulties) were imputed using logistic regression, and ordinal variables (maternal age and education, parity, and maths and literacy streaming group) were imputed using ordinal logistic regression.
Variable description, including the proportion of missing data and relationship with observed and missing values in the outcome variable for the applied example, Bristol, United Kingdom, 1991–2007
| Variable | Type | % Missing data | OR for missing data in outcome | 95% CI | |
|---|---|---|---|---|---|
| IQ at age 15 | Continuous | 62.47 | |||
| Maternal smoking in pregnancy | Binary | 0.00 | 0.01 | 2.18 | 1.98, 2.39 |
| Maternal age | Categorical | 0.00 | 0.04 | ||
| ≤ 24 years | Reference | Reference | |||
| 25–29 years | 0.57 | 0.51, 0.64 | |||
| 30–34 years | 0.42 | 0.38, 0.47 | |||
| ≥ 35 years | 0.41 | 0.35, 0.47 | |||
| Parity | Categorical | 0.00 | 0.01 | ||
| 0 | Reference | Reference | |||
| 1 | 1.18 | 1.09, 1.29 | |||
| 2 | 1.46 | 1.30, 1.64 | |||
| ≥ 3 | 2.06 | 1.72, 2.48 | |||
| Sex | Binary | 0.00 | <0.01 | ||
| Female | Reference | Reference | |||
| Male | 1.27 | 1.18, 1.37 | |||
| Maternal education | Categorical | 0.00 | 0.11 | ||
| Vocational | Reference | Reference | |||
| CSE/O level | 0.91 | 0.80, 1.05 | |||
| A level/degree | 0.45 | 0.39, 0.52 | |||
| IQ at age 8 | Continuous | 44.49 | 0.37 | 0.98 | 0.98, 0.98 |
| Intelligibility and fluency at age 9 | Continuous | 37.96 | 0.01 | 0.95 | 0.93, 0.97 |
| Maths assessment score | Continuous | 44.39 | 0.24 | 0.15 | 0.12, 0.19 |
| Ever had learning difficulties | Binary | 48.57 | 0.08 | 2.02 | 1.75, 2.33 |
| Maths streaming group | Ordinal | 52.76 | 0.20 | ||
| Lowest | Reference | Reference | |||
| Middle | 0.58 | 0.50, 0.69 | |||
| Highest | 0.42 | 0.36, 0.49 | |||
| Literacy streaming group | Ordinal | 55.03 | 0.16 | ||
| Lowest | Reference | Reference | |||
| Middle | 0.59 | 0.50, 0.69 | |||
| Highest | 0.39 | 0.33, 0.45 |
Abbreviations: CCA, complete case analysis; CI, confidence interval; IQ, Intelligence quotient; OR, odds ratio; R2, variance explained in the outcome.
Regressed IQ at the age of 15 years, on each variable with no adjustment for other variables. CCA was used in all models.
Using logistic regression, the odds of having a missing value for the outcome were regressed on each variable with no adjustment for other variables. CCA was used in all models.
Fig. 3Estimate, standard error, and FMI for the exposure coefficient in the applied example adjusted analysis model. Reduction in SE is relative to CCA. CCA = complete case analysis; FMI = fraction of missing information; SE = standard error.