| Literature DB >> 20085642 |
Andrea Marshall1, Douglas G Altman, Patrick Royston, Roger L Holder.
Abstract
BACKGROUND: There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.Entities:
Mesh:
Year: 2010 PMID: 20085642 PMCID: PMC2824146 DOI: 10.1186/1471-2288-10-7
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Data structure for the breast cancer dataset and associated means and standard deviations (SDs) after suitable transformation
| Covariate | Variable Type | Groupings/Measurement | Label | Mean(SD) | |
|---|---|---|---|---|---|
| Continuous | Years | Age | 53.05(10.12) | ||
| Continuous | Number of | LN | 1.16(0.94) | ||
| Continuous | fmol | PGR | 3.35(1.93) | ||
| Continuous | fmol | ER | 3.35(1.84) | ||
| Binary | 1 = Yes, | TRT | 0.36(0.48) | ||
| Binary | 0 = Pre, | MENO | 0.58(0.49) | ||
| Binary | 0 = Grade I, | TG | 0.88(0.32) | ||
| Continuous variable categorised | 1 = ≤20 mm, | TS | 3.27(0.46) | ||
Note: Data from the breast cancer dataset for X2 and X8 were log transformed; X3 and X4 were transformed using log(X+1).
Figure 1a: Distribution of the covariates for the German breast cancer dataset; b: Distribution of the transformed continuous covariates in the German breast cancer dataset.
Specification of the missing data mechanisms to be imposed
| Mechanism | X3 (PGR) | X2 (LN) | X5 (TRT) | X8 (TS) |
|---|---|---|---|---|
| β0 | β0 + ln(OR)MX3 | β0 + ln(OR)MX2 | β0 + ln(OR)MX3 | |
| ln(0.8)X4 | ln(3)X1 | ln(0.7)ln(t) | ln(7)X7 | |
| ln(1.3)X3 | ln(0.6) X2 | ln(8)X5 | ln(0.9)X8 | |
| ln(0.7)ln(t) + | ln(3)X1 | ln(0.9)X8 | ||
Note: A logistic regression model was used to model the probability of missingness for each incomplete covariate. The entries in the table represent the variables associated with the missingness of each incomplete covariate. For MAR, MNAR, and the combined mechanism, the terms given are extra to those for the MCAR mechanism, e.g. the MAR mechanism for X2 is
where β0 is the intercept, estimated by solving the above equation using the specified probabilities of missingness for X2 and X3 and the average covariate value of X1, MX3 is the missingness indicator for covariate X3, which equals 1 if an observation is missing and 0 if the value is observed and OR is odds ratio for the relationship between the missingness of X2 and X3, and is obtained from Table 3. The coefficients for the variable associated with the mechanism were modified from relationships with missing data seen in another study [27] to provide significant associations. All continuous variables including survival (t) were standardised by dividing by the standard deviation. When the mechanisms included other covariates subjected to missingness, the original complete data were used.
Odds ratios (OR) to be specified in the missing data mechanisms given in Table 2
| Mechanism | OR | Missingness (%) | ||||
|---|---|---|---|---|---|---|
| for: | for: | 5 | 10 | 25 | 50 | 75 |
| 101.12 | 45.5 | 15.68 | 5.50 | 2.17 | ||
| 42.04 | 20.78 | 7.41 | 3.00 | 1.51 | ||
| 45.14 | 14.23 | 5.44 | 1.92 | 0.92 | ||
Note: MXi is the missingness indicator for covariate Xi, which equals 1 if an observation is missing and 0 if the value is observed.
OR represents the odds of having two variables with missing observations, and was calculated using the proportion of missing values for each variable and the degree of overlap between variables for each of the five overall amounts of missingness to be imposed.
Summary of the missing data methods investigated
| Method Label | Method Description | Library used within R statistical software | Number of iterations |
|---|---|---|---|
| Complete case analysis: Analyses only cases with complete data for all covariates | - | ||
| Single imputation performed using PMM | ' | 20 | |
| Multiple imputation (MI) using data augmentation approach [ | 'norm' [ | 100 | |
| MI using data augmentation approach using a general location model | 'mix' [ | 100 | |
| MI using data augmentation approach using a general location model, but imputed values are not truncated to within plausible range | 'mix' [ | 100 | |
| MI using regression switching imputation [ | 'mice' [ | 20 | |
| MI using MICE with PMM | ' | 20 | |
| MI using MICE with PMM without transforming the incomplete covariates | ' | 20 | |
| MI using flexible additive imputation models [ | ' | 1 | |
Key: PMM = predictive mean matching; MI = multiple imputation
Figure 2Regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
Figure 3Average standard error (SE) estimates for different missing data methods for increasing percentage of MAR missingness
Figure 4Coverage of the regression coefficient estimates for different missing data methods for increasing percentage of MAR missingness.
Figure 5Significance of the covariates in the prognostic model for different missing data methods and increasing percentage of MAR missingness.
Figure 6Model performance measures for different missing data methods for increasing percentage of MAR missingness. a) Likelihood ratio test, b) Nagelkerke R2 statistic, c) Prognostic separation D statistic and d) Predicted 2-year survival from Cox model.
Figure 7Comparison of the regression coefficient estimates for the different MI methods after imposing MAR and MNAR mechanisms.
Figure 8Comparison of coverage estimates for the different MI methods after imposing MAR and MNAR mechanisms.