| Literature DB >> 33276049 |
Peter C Austin1, Ian R White2, Douglas S Lee3, Stef van Buuren4.
Abstract
Missing data is a common occurrence in clinical research. Missing data occurs when the value of the variables of interest are not measured or recorded for all subjects in the sample. Common approaches to addressing the presence of missing data include complete-case analyses, where subjects with missing data are excluded, and mean-value imputation, where missing values are replaced with the mean value of that variable in those subjects for whom it is not missing. However, in many settings, these approaches can lead to biased estimates of statistics (eg, of regression coefficients) and/or confidence intervals that are artificially narrow. Multiple imputation (MI) is a popular approach for addressing the presence of missing data. With MI, multiple plausible values of a given variable are imputed or filled in for each subject who has missing data for that variable. This results in the creation of multiple completed data sets. Identical statistical analyses are conducted in each of these complete data sets and the results are pooled across complete data sets. We provide an introduction to MI and discuss issues in its implementation, including developing the imputation model, how many imputed data sets to create, and addressing derived variables. We illustrate the application of MI through an analysis of data on patients hospitalised with heart failure. We focus on developing a model to estimate the probability of 1-year mortality in the presence of missing data. Statistical software code for conducting MI in R, SAS, and Stata are provided.Entities:
Mesh:
Year: 2020 PMID: 33276049 PMCID: PMC8499698 DOI: 10.1016/j.cjca.2020.11.010
Source DB: PubMed Journal: Can J Cardiol ISSN: 0828-282X Impact factor: 5.223
Multivariate imputation by chained equations (MICE) algorithm for multiple imputation
Specify an imputation model for each of the |
For each of the |
For the first variable that is subject to missing data: |
Regress this first variable on all the other variables using those subjects with complete data on the first variable and observed or currently imputed values of the other variables. |
The estimated regression coefficients and their variance-covariance matrix (and the estimated variance of the residual distribution if a linear regression model was fit for a continuous variable) are extracted from the regression model estimated in (a). |
Using the quantities obtained in (b), randomly perturb the estimated regression coefficients in a way that reflects the degree of uncertainty arising from the data. |
Using the set of perturbed regression coefficients obtained in (c), the conditional distribution of the first variable is determined for each subject with missing data on that variable. |
A value of the variable is drawn from this conditional distribution for each subject with missing data on the first variable. |
Repeat step 3 for each of the variables that is subject to missing data. Steps 3 and 4 form 1 cycle of the imputation process for creating 1 imputed data set. |
Repeat steps 3 and 4 the desired number of times (suggested 5 to 20 cycles). The final imputed values are used as the imputed values in first imputed data set. |
Repeat steps 2-5 M times to produce M imputed data sets (the choice of M, the number of imputed data sets, is discussed in the section How Many Imputations: How Large Should M Be?). |
Descriptive statistics of case study data
| Variable | Mean (SD) or % | No. of subjects with observed data | No. of subjects with missing data | Percentage of subjects with missing data |
|---|---|---|---|---|
| Continuous variables | ||||
| Age, y | 76.7 (11.6) | 8338 | 0 | 0% |
| Respiratory rate at admission, breaths per minute | 24.5 (7.0) | 8138 | 200 | 2.4% |
| Glucose (initial lab test), mmol/L | 8.6 (4.1) | 8051 | 287 | 3.4% |
| Urea (initial lab test), mmol/L | 10.3 (6.6) | 8028 | 310 | 3.7% |
| LDL cholesterol, mmol/L | 2.2 (0.9) | 2272 | 6066 | 72.8% |
| Binary variables | ||||
| Female | 50.9% | 8338 | 0 | 0% |
| S3 | 6.2% | 8126 | 212 | 2.5% |
| S4 | 2.7% | 8135 | 203 | 2.4% |
| Neck vein distension | 66.1% | 7586 | 752 | 9.0% |
| Cardiomegaly on chest X-ray | 47.7% | 7711 | 627 | 7.5% |
| Outcome | ||||
| Death within 1 year | 31.7% | 8338 | 0 | 0% |
LDL, low-density lipoprotein; S3, third heart sound; S4, fourth heart sound.
Figure 1Estimated log-odds ratios and 95% confidence intervals for variables in the logistic regression model fit in the case study. There are 3 estimates/confidence intervals for each of the 10 variables: analyses using complete cases (grey); multiple imputation analyses when using parametric imputation (blue); and multiple imputation analyses when using predictive-mean matching (PMM) (red). LDL, low-density lipoprotein; S3, third heart sound; S4, fourth heart sound.
Figure 2Distribution of continuous variables in complete cases and in those with imputed data when using parametric imputation. The solid black line represents the distribution of the given continuous variable in those subjects for whom that variable was not missing. The dashed red lines denote the distribution of the imputed value for that variable in those subjects for whom the variable was missing. There is 1 red line for each of the imputed data sets. LDL, low-density lipoprotein.
Figure 3Distribution of continuous variables in complete cases and in those with imputed data when using predictive mean matching (PMM). The solid black line denotes the distribution of the given continuous variable in those subjects for whom that variable was not missing. The dashed red lines denote the distribution of the imputed value for that variable in those subjects for whom the variable was missing. There is 1 red line for each of the imputed data sets. LDL, low-density lipoprotein.