| Literature DB >> 30671242 |
Marianne Riksheim Stavseth1, Thomas Clausen1, Jo Røislien1,2.
Abstract
OBJECTIVES: Missing data is a recurrent issue in many fields of medical research, particularly in questionnaires. The aim of this article is to describe and compare six conceptually different multiple imputation methods, alongside the commonly used complete case analysis, and to explore whether the choice of methodology for handling missing data might impact clinical conclusions drawn from a regression model when data are categorical.Entities:
Keywords: Missing data; categorical data; complete case analysis; hot deck imputation; latent class analysis; multiple correspondence analysis; multiple imputation; random forests
Year: 2019 PMID: 30671242 PMCID: PMC6329020 DOI: 10.1177/2050312118822912
Source DB: PubMed Journal: SAGE Open Med ISSN: 2050-3121
Figure 1.Flow chart illustrating the sampling process of the simulation study.
The mean difference between the true value of the regression coefficients and the estimated value of the regression coefficients after imputation for small (n = 200) and large (n = 1000) samples, for four levels of missing (5%, 10%, 20% and 40%).
| n = 200 | n = 1000 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Gender | Medication[ | Stimulants | Drug use[ | Gender | Medication[ | Stimulants | Drug use[ | ||
| Full data | Estimate (SE) | 0.59 (0.06) | 0.31 (0.04) | 0.87 (0.05) | 1.52 (0.06) | 0.59 (0.06) | 0.31 (0.04) | 0.87 (0.05) | 1.52 (0.06) |
| 5% missing | Hot deck | 0.07 | −0.09 | 0.06 | −0.13 | 0.01 | 0.03 | 0.09 | −0.09 |
| Random forest | 0.05 | −0.08 | 0.04 | −0.10 | 0.00 | 0.04 | 0.07 | −0.06 | |
| Latent class | 0.06 | −0.09 | 0.06 | −0.11 | 0.01 | 0.02 | 0.09 | −0.07 | |
| MI EMB | 0.05 | −0.08 | 0.01 | 0.00 | −0.01 | 0.05 | 0.03 | −0.05 | |
| MICE LOG | 0.03 | −0.07 | −0.02 | 0.07 | −0.02 | 0.05 | 0.00 | 0.05 | |
| MIMCA | 0.04 | −0.07 | −0.02 | 0.06 | −0.02 | 0.06 | 0.00 | 0.00 | |
| Complete case | −0.04 | −0.08 | −0.04 | 0.14 | −0.10 | 0.05 | 0.00 | 0.04 | |
| 10% missing | Hot deck | 0.08 | −0.11 | 0.12 | −0.21 | 0.03 | −0.01 | 0.16 | −0.08 |
| Random forest | 0.06 | −0.09 | 0.09 | −0.16 | 0.01 | 0.00 | 0.13 | −0.15 | |
| Latent class | 0.07 | −0.11 | 0.12 | −0.19 | 0.03 | −0.02 | 0.17 | −0.13 | |
| MI EMB | 0.05 | −0.08 | 0.04 | −0.10 | 0.00 | 0.02 | 0.06 | −0.08 | |
| MICE LOG | 0.03 | −0.07 | −0.02 | 0.09 | −0.02 | 0.03 | 0.00 | 0.06 | |
| MIMCA | 0.04 | −0.07 | −0.01 | 0.04 | −0.01 | 0.04 | 0.01 | −0.03 | |
| Complete case | −0.09 | −0.07 | −0.02 | 0.10 | −0.16 | 0.03 | −0.03 | 0.05 | |
| 20% missing | Hot deck | 0.11 | −0.11 | 0.19 | −0.57 | 0.06 | −0.04 | 0.24 | −0.34 |
| Random forest | 0.08 | −0.09 | 0.15 | −0.45 | 0.03 | −0.02 | 0.20 | −0.31 | |
| Latent class | 0.09 | −0.12 | 0.20 | −0.34 | 0.05 | −0.05 | 0.26 | −0.29 | |
| MI EMB | 0.07 | −0.06 | 0.07 | −0.22 | 0.02 | 0.03 | 0.10 | −0.21 | |
| MICE LOG | 0.04 | −0.04 | −0.02 | 0.12 | −0.02 | 0.03 | 0.00 | 0.11 | |
| MIMCA | 0.04 | −0.02 | 0.00 | 0.09 | −0.01 | 0.07 | 0.02 | −0.08 | |
| Complete case | 0.23 | −0.07 | −0.03 | 0.83 | −0.31 | 0.04 | −0.01 | 0.11 | |
| 40% missing | Hot deck | 0.13 | −0.20 | 0.29 | −0.87 | 0.08 | −0.09 | 0.32 | −0.52 |
| Random forest | 0.10 | −0.18 | 0.24 | −0.75 | 0.05 | −0.07 | 0.29 | −0.49 | |
| Latent class | 0.12 | −0.21 | 0.21 | −0.67 | 0.07 | −0.12 | 0.36 | −0.49 | |
| MI EMB | 0.08 | −0.13 | 0.14 | −0.41 | 0.03 | 0.01 | 0.16 | −0.32 | |
| MICE LOG | 0.04 | −0.12 | 0.01 | 0.38 | −0.02 | 0.04 | 0.01 | 0.15 | |
| MIMCA | 0.05 | −0.06 | 0.03 | 0.03 | −0.01 | 0.11 | 0.03 | −0.12 | |
| Complete case | −0.13 | −0.07 | −1.77 | 1.84 | −0.47 | 0.07 | −0.08 | 0.13 | |
SE: standard error; MI EMB: multiple imputation using expectation–maximization with bootstrapping; MICE LOG: multivariate imputation by chained equations–based logistic regression; MIMCA: multiple imputation using multiple correspondence analysis.
Covariate with missing values.
The standard deviation of the bias calculated for small (n = 200) and large (n = 1000) samples, for four levels of missing (5%, 10%, 20% and 40%).
| n = 200 | n = 1000 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Gender | Medication[ | Stimulants | Drug use[ | Gender | Medication[ | Stimulants | Drug use[ | ||
| Full data | Estimate (SE) | 0.59 (0.06) | 0.31 (0.04) | 0.87 (0.05) | 1.52 (0.06) | 0.59 (0.06) | 0.31 (0.04) | 0.87 (0.05) | 1.52 (0.06) |
| 5% missing | Hot deck | 0.47 | 0.32 | 0.49 | 0.40 | 0.19 | 0.17 | 0.23 | 0.20 |
| Random forest | 0.47 | 0.33 | 0.49 | 0.45 | 0.19 | 0.17 | 0.23 | 0.22 | |
| Latent class | 0.47 | 0.32 | 0.49 | 0.41 | 0.19 | 0.17 | 0.23 | 0.20 | |
| MI EMB | 0.47 | 0.33 | 0.49 | 0.49 | 0.19 | 0.18 | 0.23 | 0.23 | |
| MICE LOG | 0.48 | 0.34 | 0.49 | 0.57 | 0.19 | 0.18 | 0.23 | 0.27 | |
| MIMCA | 0.47 | 0.34 | 0.49 | 0.55 | 0.19 | 0.18 | 0.23 | 0.25 | |
| Complete case | 0.49 | 0.36 | 0.54 | 0.78 | 0.20 | 0.19 | 0.25 | 0.28 | |
| 10% missing | Hot deck | 0.47 | 0.33 | 0.49 | 0.36 | 0.19 | 0.16 | 0.23 | 018 |
| Random forest | 0.47 | 0.33 | 0.49 | 0.41 | 0.19 | 0.16 | 0.23 | 0.22 | |
| Latent class | 0.47 | 0.31 | 0.49 | 0.35 | 0.19 | 0.16 | 0.23 | 0.19 | |
| MI EMB | 0.48 | 0.35 | 0.49 | 0.43 | 0.20 | 0.18 | 0.23 | 0.23 | |
| MICE LOG | 0.48 | 0.36 | 0.49 | 0.53 | 0.20 | 0.18 | 0.23 | 0.31 | |
| MIMCA | 0.48 | 0.37 | 0.49 | 0.48 | 0.20 | 0.18 | 0.23 | 0.27 | |
| Complete case | 0.56 | 0.44 | 0.58 | 1.55 | 0.25 | 0.20 | 0.28 | 0.33 | |
| 20% missing | Hot deck | 0.47 | 0.33 | 0.48 | 0.33 | 0.19 | 0.17 | 0.23 | 0.15 |
| Random forest | 0.47 | 0.34 | 0.49 | 0.41 | 0.19 | 0.18 | 0.23 | 0.20 | |
| Latent class | 0.47 | 0.31 | 0.49 | 0.32 | 0.19 | 0.16 | 0.23 | 016 | |
| MI EMB | 0.48 | 0.38 | 0.50 | 0.48 | 0.20 | 0.20 | 0.23 | 0.22 | |
| MICE LOG | 0.48 | 041 | 0.50 | 0.65 | 0.20 | 0.21 | 0.23 | 0.33 | |
| MIMCA | 0.48 | 0.42 | 0.50 | 0.72 | 0.20 | 0.22 | 0.23 | 0.28 | |
| Complete case | 0.67 | 0.54 | 0.79 | 3.31 | 0.26 | 0.26 | 0.32 | 0.38 | |
| 40% missing | Hot deck | 0.46 | 0.29 | 0.47 | 0.28 | 0.19 | 0.14 | 0.22 | 0.13 |
| Random forest | 0.47 | 0.31 | 0.48 | 0.37 | 0.19 | 0.15 | 0.23 | 0.18 | |
| Latent class | 0.46 | 0.25 | 0.47 | 0.25 | 0.19 | 0.11 | 0.22 | 0.13 | |
| MI EMB | 0.47 | 0.41 | 0.50 | 0.46 | 0.19 | 0.19 | 0.24 | 0.23 | |
| MICE LOG | 0.48 | 0.44 | 0.51 | 1.67 | 0.19 | 0.21 | 0.25 | 0.43 | |
| MIMCA | 0.48 | 0.51 | 0.52 | 0.79 | 0.19 | 0.23 | 0.25 | 0.33 | |
| Complete case | 3.17 | 0.62 | 6.06 | 5.37 | 0.34 | 0.35 | 1.78 | 0.52 | |
SE: standard error; MI EMB: multiple imputation using expectation–maximization with bootstrapping; MICE LOG: multivariate imputation by chained equations–based logistic regression; MIMCA: multiple imputation using multiple correspondence analysis.
Covariate with missing values.
The median width and coverage (%) of the confidence intervals calculated for small (n = 200) and large (n = 1000) samples, four levels of missing (5%, 10%, 20% and 40).
| n = 200 | n = 1000 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Gender | Medication[ | Stimulants | Drug use[ | Gender | Medication[ | Stimulants | Drug use[ | ||
| 5% missing | Hot deck | 1.99 (97) | 1.70 (98) | 1.85 (89) | 2.25 (97) | 0.87 (97) | 0.72 (96) | 0.79 (92) | 0.99 (79) |
| Random forest | 1.99 (99) | 1.70 (98) | 1.85 (89) | 2.34 (99) | 0.87 (97) | 0.73 (96) | 0.79 (90) | 1.02 (94) | |
| Latent class | 1.99 (99) | 1.70 (98) | 1.85 (89) | 2.31 (99) | 0.87 (97) | 0.72 (96) | 0.79 (89) | 1.01 (83) | |
| MI EMB | 1.99 (100) | 1.70 (98) | 1.85 (90) | 2.42 (100) | 0.87 (97) | 0.73 (95) | 0.79 (90) | 1.04 (97) | |
| MICE LOG | 2.00 (100) | 1.71 (98) | 1.84 (90) | 2.47 (100) | 0.87 (97) | 0.74 (95) | 0.79 (90) | 1.07 (98) | |
| MIMCA | 2.00 (100) | 1.71 (98) | 1.84 (90) | 2.45 (100) | 0.87 (97) | 0.73 (95) | 0.79 (90) | 1.05 (98) | |
| Complete case | 2.11 (96) | 1.80 (97) | 2.02 (87) | NA (96) | 0.93 (96) | 0.77 (95) | 0.88 (92) | 1.08 (95) | |
| 10% missing | Hot deck | 1.98 (90) | 1.75 (98) | 1.85 (93) | 2.24 (90) | 0.86 (95) | 0.75 (98) | 0.79 (87) | 0.97 (77) |
| Random forest | 2.00 (98) | 1.75 (99) | 1.85 (92) | 2.32 (98) | 0.87 (95) | 0.75 (98) | 0.79 (90) | 1.03 (81) | |
| Latent class | 1.99 (94) | 1.75 (99) | 1.85 (92) | 2.31 (94) | 0.86 (95) | 0.74 (98) | 0.79 (85) | 0.99 (83) | |
| MI EMB | 2.00 (99) | 1.77 (98) | 1.86 (89) | 2.41 (99) | 0.87 (96) | 0.76 (98) | 0.79 (91) | 1.09 (89) | |
| MICE LOG | 2.00 (100) | 1.80 (98) | 1.85 (88) | 2.53 (100) | 0.87 (97) | 0.8 (98) | 0.79 (89) | 1.19 (96) | |
| MIMCA | 2.00 (100) | 1.79 (97) | 1.86 (88) | 2.48 (100) | 0.87 (96) | 0.76 (98) | 0.79 (90) | 1.12 (99) | |
| Complete case | 2.24 (99) | 1.94 (95) | 2.26 (91) | 2.61 (99) | 1.01 (90) | 0.85 (98) | 1.01 (92) | 1.19 (94) | |
| 20% missing | Hot deck | 1.98 (74) | 1.81 (98) | 1.87 (90) | 2.21 (74) | 0.86 (94) | 0.78 (97) | 0.79 (78) | 0.96 (42) |
| Random forest | 1.99 (93) | 1.81 (98) | 1.87 (89) | 2.36 (93) | 0.87 (95) | 0.79 (96) | 0.79 (80) | 1.05 (61) | |
| Latent class | 1.99 (80) | 1.80 (98) | 1.85 (89) | 2.32 (80) | 0.86 (94) | 0.79 (96) | 0.79 (74) | 0.98 (47) | |
| MI EMB | 2.01 (95) | 1.85 (98) | 1.89 (89) | 2.52 (95) | 0.87 (96) | 0.80 (94) | 0.80 (91) | 1.10 (79) | |
| MICE LOG | 2.03 (98) | 1.90 (98) | 1.89 (89) | 2.84 (98) | 0.88 (97) | 0.92 (95) | 0.81 (91) | 1.41 (96) | |
| MIMCA | 2.02 (97) | 1.88 (97) | 1.90 (88) | 2.70 (97) | 0.87 (96) | 0.83 (92) | 0.80 (90) | 1.21 (97) | |
| Complete case | 2.50 (99) | 2.24 (95) | 2.78 (90) | NA[ | 1.13 (81) | 1.02 (95) | 1.30 (97) | 1.41 (96) | |
| 40% missing | Hot deck | 1.97 (49) | 1.95 (99) | 1.85 (90) | 2.18 (49) | 0.86 (94) | 0.84 (99) | 0.78 (67) | 0.94 (40) |
| Random forest | 1.98 (71) | 1.96 (99) | 1.85 (89) | 2.33 (71) | 0.86 (95) | 0.85 (100) | 0.79 (69) | 1.06 (44) | |
| Latent class | 1.98 (56) | 1.96 (99) | 1.83 (89) | 2.25 (56) | 0.86 (94) | 0.84 (99) | 0.77 (55) | 0.97 (40) | |
| MI EMB | 2.00 (92) | 2.12 (99) | 1.93 (90) | 2.63 (92) | 0.87 (96) | 0.91 (98) | 0.81 (87) | 1.17 (56) | |
| MICE LOG | 2.03 (97) | 2.25 (98) | 1.95 (91) | 3.48 (97) | 0.90 (98) | 1.17 (99) | 0.85 (91) | 1.90 (100) | |
| MIMCA | 2.02 (97) | 1.88 (97) | 1.90 (91) | 2.7 (97) | 0.87 (98) | 0.83 (98) | 0.8 (92) | 1.21 (98) | |
| Complete case | 3.30 (98) | 3.12 (97) | 4.66 (84) | NA[ | 1.53 (82) | 1.45 (96) | NA[ | 2.00 (95) | |
MI EMB: multiple imputation using expectation–maximization with bootstrapping; MICE LOG: multivariate imputation by chained equations–based logistic regression; MIMCA: multiple imputation using multiple correspondence analysis.
Covariate with missing values.
The confidence interval could not be computed for all subsets due to the amount of missing.
Figure 2.An illustration of the estimated regression coefficients and 95% confidence intervals for all covariates after handling missing data with six different imputation methods and CCA on data from Trondheim (n = 199) and Oslo (n = 838). The horizontal line indicates a regression coefficient equal to 0, and a confidence interval including 0 indicates a statistical non-significant result.