| Literature DB >> 24782349 |
Catherine A Welch1, Irene Petersen, Jonathan W Bartlett, Ian R White, Louise Marston, Richard W Morris, Irwin Nazareth, Kate Walters, James Carpenter.
Abstract
Most implementations of multiple imputation (MI) of missing data are designed for simple rectangular data structures ignoring temporal ordering of data. Therefore, when applying MI to longitudinal data with intermittent patterns of missing data, some alternative strategies must be considered. One approach is to divide data into time blocks and implement MI independently at each block. An alternative approach is to include all time blocks in the same MI model. With increasing numbers of time blocks, this approach is likely to break down because of co-linearity and over-fitting. The new two-fold fully conditional specification (FCS) MI algorithm addresses these issues, by only conditioning on measurements, which are local in time. We describe and report the results of a novel simulation study to critically evaluate the two-fold FCS algorithm and its suitability for imputation of longitudinal electronic health records. After generating a full data set, approximately 70% of selected continuous and categorical variables were made missing completely at random in each of ten time blocks. Subsequently, we applied a simple time-to-event model. We compared efficiency of estimated coefficients from a complete records analysis, MI of data in the baseline time block and the two-fold FCS algorithm. The results show that the two-fold FCS algorithm maximises the use of data available, with the gain relative to baseline MI depending on the strength of correlations within and between variables. Using this approach also increases plausibility of the missing at random assumption by using repeated measures over time of variables whose baseline values may be missing.Entities:
Keywords: longitudinal electronic health records; missing data; multiple imputation; partially observed
Mesh:
Year: 2014 PMID: 24782349 PMCID: PMC4285297 DOI: 10.1002/sim.6184
Source DB: PubMed Journal: Stat Med ISSN: 0277-6715 Impact factor: 2.373
Log hazard ratios from fitting an exponential model to predict risk of coronary heart disease: estimates from a data sample from The Health Improvement Network, followed by the average of estimates across 1000 simulations based on full data, complete records,‘baseline fully conditional specification (FCS) multiple imputation’, and missing values imputed using the two-fold FCS algorithm.
| Variables | THIN cohort | Full data | Complete records | Baseline FCS imputation | Two-fold FCS | ||
|---|---|---|---|---|---|---|---|
| DGM I | DGM II | DGM I | DGM II | ||||
| Townsend | 1 | Reference | |||||
| deprivation | 2 | 0.1520 | 0.1503 | 0.1425 | 0.1497 | 0.1498 | 0.1588 |
| score quintile | 3 | 0.2377 | 0.2367 | 0.2431 | 0.2366 | 0.2366 | 0.2422 |
| 4 | 0.2433 | 0.2400 | 0.2279 | 0.2391 | 0.2401 | 0.2535 | |
| 5 | 0.4034 | 0.4024 | 0.3935 | 0.4020 | 0.4023 | 0.4017 | |
| Weight (kg) | 0.0019 | 0.0019 | 0.0015 | 0.0016 | 0.0019 | 0.0017 | |
| Systolic blood pressure (mmHg) | 0.0048 | 0.0049 | 0.0051 | 0.0048 | 0.0051 | 0.0053 | |
| Anti-hypertensive drug treatment | 0.2935 | 0.2868 | 0.2852 | 0.2897 | 0.2855 | 0.2915 | |
| Smoking status | Non-smoker | Reference | |||||
| Ex-smoker | 0.0679 | 0.0692 | 0.0633 | 0.0672 | 0.0579 | 0.0567 | |
| Current smoker | 0.2386 | 0.2385 | 0.2307 | 0.2342 | 0.2325 | 0.2261 | |
| Age group (years) | 40–44 | − 1.2820 | − 1.2872 | − 1.3167 | − 1.2869 | − 1.2880 | − 1.2890 |
| 45–49 | − 1.0632 | − 1.0652 | − 1.0892 | − 1.0655 | − 1.0662 | − 1.0623 | |
| 50–54 | − 0.6402 | − 0.6392 | − 0.6467 | − 0.6398 | − 0.6408 | − 0.6330 | |
| 55–59 | − 0.3589 | − 0.3597 | − 0.3700 | − 0.3598 | − 0.3605 | − 0.3536 | |
| 60–64 | − 0.2485 | − 0.2473 | − 0.2545 | − 0.2480 | − 0.2481 | − 0.2423 | |
| 65–69 | − 0.0396 | − 0.0416 | − 0.0470 | − 0.0418 | − 0.0409 | − 0.0348 | |
| 70–74 | Reference | ||||||
| 75–79 | 0.1108 | 0.1039 | 0.1116 | 0.1043 | 0.1057 | 0.1129 | |
| 80 + | 0.1387 | 0.1421 | 0.1255 | 0.1383 | 0.1414 | 0.1368 | |
| Constant term | − 5.1993 | − 5.2297 | − 5.2550 | − 5.2098 | − 5.2552 | − 5.2833 | |
DGM, data generation mechanism; FCS, fully conditional specification; THIN, The Health Improvement Network.
Standard errors (SE) from fitting the exponential model to predict risk of coronary heart disease to the full simulated data and SE and empirical SE found from fitting the exponential model to the complete records analysis after full simulated data sets were changed to missing, imputed data using baseline fully conditional specification (FCS) imputation and imputed data using the two-fold FCS algorithm.
| Variables | Full data | Complete records | Baseline FCS imputation | Two-fold FCS-DGM 1 | Two-fold FCS-DGM 2 | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| SE | SE | Empirical SE | SE | Empirical SE | SE | Empirical SE | SE | Empirical SE | ||
| Townsend | 1 | Reference | ||||||||
| deprivation | 2 | 0.1286 | 0.2528 | 0.2472 | 0.1290 | 0.1304 | 0.1281 | 0.1295 | 0.1249 | 0.1267 |
| score quintile | 3 | 0.1326 | 0.2621 | 0.2630 | 0.1349 | 0.1366 | 0.1340 | 0.1363 | 0.1266 | 0.1322 |
| 4 | 0.1419 | 0.2793 | 0.2817 | 0.1442 | 0.1462 | 0.1432 | 0.1473 | 0.1341 | 0.1404 | |
| 5 | 0.1541 | 0.3087 | 0.3133 | 0.1600 | 0.1627 | 0.1586 | 0.1606 | 0.1535 | 0.1563 | |
| Weight (kg) | 0.0032 | 0.0064 | 0.0062 | 0.0063 | 0.0067 | 0.0041 | 0.0041 | 0.0043 | 0.0041 | |
| Systolic blood pressure (mmHg) | 0.0026 | 0.0055 | 0.0055 | 0.0054 | 0.0056 | 0.0050 | 0.0049 | 0.0053 | 0.0050 | |
| Anti-hypertensive drug treatment | 0.0957 | 0.1923 | 0.1933 | 0.1113 | 0.1133 | 0.1060 | 0.1033 | 0.1109 | 0.1060 | |
| Smoking status | Non-smoker | Reference | ||||||||
| Ex-smoker | 0.1074 | 0.2117 | 0.2153 | 0.2104 | 0.2289 | 0.2064 | 0.2180 | 0.1330 | 0.1326 | |
| Current smoker | 0.1143 | 0.2260 | 0.2302 | 0.2221 | 0.2410 | 0.2161 | 0.2312 | 0.1538 | 0.1489 | |
| Age group (years) | 40–44 | 0.2311 | 0.4659 | 0.4936 | 0.2484 | 0.2538 | 0.2425 | 0.2448 | 0.2410 | 0.2409 |
| 45–49 | 0.2137 | 0.4321 | 0.4395 | 0.2287 | 0.2328 | 0.2231 | 0.2236 | 0.2228 | 0.2220 | |
| 50–54 | 0.1872 | 0.3673 | 0.3762 | 0.1954 | 0.2021 | 0.1907 | 0.1962 | 0.1871 | 0.1895 | |
| 55–59 | 0.1734 | 0.3576 | 0.3657 | 0.1867 | 0.1817 | 0.1832 | 0.1791 | 0.1827 | 0.1825 | |
| 60–64 | 0.1783 | 0.3589 | 0.3706 | 0.1846 | 0.1848 | 0.1819 | 0.1831 | 0.1753 | 0.1815 | |
| 65–69 | 0.1764 | 0.3545 | 0.3671 | 0.1802 | 0.1815 | 0.1790 | 0.1801 | 0.1727 | 0.1784 | |
| 70–74 | Reference | |||||||||
| 75–79 | 0.1914 | 0.3885 | 0.3879 | 0.1998 | 0.1959 | 0.1966 | 0.1955 | 0.1936 | 0.1957 | |
| 80 + | 0.2028 | 0.4122 | 0.4272 | 0.2132 | 0.2106 | 0.2076 | 0.2071 | 0.2139 | 0.2103 | |
| Constant term | 0.4554 | 0.9369 | 0.9516 | 0.8885 | 0.9092 | 0.7481 | 0.7406 | 0.7930 | 0.7371 | |
DGM, data generation mechanism; FCS, fully conditional specification.