| Literature DB >> 28352203 |
Alma B Pedersen1, Ellen M Mikkelsen1, Deirdre Cronin-Fenton1, Nickolaj R Kristensen1, Tra My Pham2, Lars Pedersen1, Irene Petersen3.
Abstract
Missing data are ubiquitous in clinical epidemiological research. Individuals with missing data may differ from those with no missing data in terms of the outcome of interest and prognosis in general. Missing data are often categorized into the following three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In clinical epidemiological research, missing data are seldom MCAR. Missing data can constitute considerable challenges in the analyses and interpretation of results and can potentially weaken the validity of results and conclusions. A number of methods have been developed for dealing with missing data. These include complete-case analyses, missing indicator method, single value imputation, and sensitivity analyses incorporating worst-case and best-case scenarios. If applied under the MCAR assumption, some of these methods can provide unbiased but often less precise estimates. Multiple imputation is an alternative method to deal with missing data, which accounts for the uncertainty associated with missing data. Multiple imputation is implemented in most statistical software under the MAR assumption and provides unbiased and valid estimates of associations based on information from the available data. The method affects not only the coefficient estimates for variables with missing data but also the estimates for other variables with no missing data.Entities:
Keywords: MAR; MCAR; MNAR; missing data; multiple imputation; observational study
Year: 2017 PMID: 28352203 PMCID: PMC5358992 DOI: 10.2147/CLEP.S129785
Source DB: PubMed Journal: Clin Epidemiol ISSN: 1179-1349 Impact factor: 4.790
An example of a situation when data are MAR rather than MCAR
| Observed data | Patients with BMI value (%) | Patients with missing BMI value (%) |
|---|---|---|
| Smoking | ||
| Yes | 80 | 20 |
| No | 60 | 40 |
| Comorbidity prior diagnosis | ||
| Yes | 85 | 15 |
| No | 25 | 75 |
Abbreviations: BMI, body mass index; MAR, missing at random; MCAR, missing completely at random.
Proposed methods for dealing with missing data in the analytic phase
| Methods | Brief description | Assumption to achieve unbiased estimates | Advantages | Limitation(s) |
|---|---|---|---|---|
| Complete-case analysis | Include only individuals with complete information on all variables in the dataset | MCAR | • Simplicity | • Data may not be representative. Reduction of sample size and thereby of statistical power |
| Missing indicator method | For categorical variables, missing values are grouped into a “missing” category. For continuous variables, missing values are set to a fixed value (usually zero), and an extra indicator or dummy (1/0) variable is added to the main analytic model to indicate whether the value for that variable is missing | None | • Uses all available | • The magnitude and direction of bias difficult to predict |
| Single value imputation | Replace missing values by a single value (eg, mean score of the observed values or the most recently observed value for a given variable if data are measured longitudinally) | MCAR, only when estimating mean | • Run analyses as if data are complete | • Too small standard error (overestimation of precision of the results) |
| Sensitivity analyses with worst- and best-case scenarios | Missing data values are replaced with the highest or lowest value observed in the dataset | MCAR | • Simplicity | • Too small standard error and thereby overestimation of precision of the results |
| Multiple imputation | Missing data values are imputed based on the distribution of other variables in the dataset | MAR (but can handle both MCAR and MNAR) | • Variability more accurate for each missing value since it considers variability due to sampling and due to imputation (standard error close to that of having full dataset with true values) | • Room for error when specifying models |
Abbreviations: MAR, missing at random; MCAR, missing completely at random; MNAR, missing not at random.
Figure 1Distribution of BMI values by outcome in full dataset (A) and in a dataset with 35% missing values (B) for BMI handled by creating a missing BMI category.
Abbreviation: BMI, body mass index.
Figure 2Normal distribution of observed BMI in a full dataset of 10,000 observations.
Abbreviation: BMI, body mass index.
Figure 3Distribution of BMI in a dataset of 10,000 observations, where 35% of BMI values are missing and replaced by the observed mean BMI value.
Abbreviation: BMI, body mass index.
Figure 4Selection of variables in order to create multiple imputed datasets when looking into the association between body mass index and transfusion risk.
Figure 5The three main stages of implementing multiple imputation.
An example of the imputed missing BMI values generated with five imputed datasets
| Patient number | Imputed data set 1 | Imputed data set 2 | Imputed data set 3 | Imputed data set 4 | Imputed data set 5 |
|---|---|---|---|---|---|
| 10 | 25.3 | 26.4 | 27.0 | 24.8 | 29.7 |
| 25 | 19.7 | 21.3 | 22.3 | 20.5 | 23.8 |
| 23 | 22.1 | 27.6 | 22.9 | 28.1 | 25.8 |
| 150 | 20.1 | 22.5 | 23.4 | 21.7 | 23.0 |
| 175 | 19.7 | 20.2 | 21.2 | 22.4 | 21.9 |
Abbreviation: BMI, body mass index.
Association between BMI and risk of blood transfusion adjusted for age and gender
| Patient characteristics | Full data (n=3,500)
| Complete case analysis | Multiple imputation | Multiple imputation | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OR | SE | 95% CI | OR | SE | 95% CI | OR | SE | 95% CI | OR | SE | 95% CI | |
| BMI | 0.980 | 0.0085 | (0.963, 0.997) | 0.978 | 0.0098 | (0.959, 0.997) | 0.976 | 0.0087 | (0.959, 0.994) | 0.978 | 0.0098 | (0.959, 0.997) |
| Age (years) | ||||||||||||
| <75 | Baseline | |||||||||||
| ≥75 | 2.100 | 0.1928 | (1.754, 2.514) | 2.244 | 0.2421 | (1.816, 2.772) | 2.097 | 0.1927 | (1.752, 2.511) | 2.098 | 0.1928 | (1.752, 2.511) |
| Gender | ||||||||||||
| Female | Baseline | |||||||||||
| Male | 0.815 | 0.0630 | (0.700, 0.948) | 0.906 | 0.0779 | (0.765, 1.072) | 0.818 | 0.0633 | (0.702, 0.952) | 0.817 | 0.0634 | (0.702, 0.951) |
Note: Results are presented for full-observed data, complete-case analysis, and multiple imputation and contain point estimates for ORs, SEs, and 95% CIs.
Abbreviations: BMI, body mass index; CI, confidence interval; OR, odds ratio; SE, standard error.