| Literature DB >> 35177653 |
Lin Lawrence Guo1, Stephen R Pfohl2, Jason Fries2, Alistair E W Johnson1, Jose Posada2, Catherine Aftandilian3, Nigam Shah2, Lillian Sung4,5.
Abstract
Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective was to characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008-2010, 2011-2013, 2014-2016 and 2017-2019). Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008-2010 (ERM[08-10]) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008-2016 and evaluated them on 2017-2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM[08-16] models trained using 2008-2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080-0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM[08-10] applied to 2017-2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008-2010. When compared with ERM[08-16], DG and UDA experiments failed to produce more robust models (range of AUROC difference, - 0.003 to 0.050). In conclusion, DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.Entities:
Mesh:
Year: 2022 PMID: 35177653 PMCID: PMC8854561 DOI: 10.1038/s41598-022-06484-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Data splitting procedure for baseline, domain generalization (DG) and unsupervised domain adaptation (UDA) experiments. Different shades of the same color indicate that they were used to train or evaluate different models. For instance, in the baseline experiment, the training set of each year group was used to learn models for that year group. In the DG experiment, the training year groups were kept separate to allow DG algorithms to estimate invariance across the year groups. In comparison, in the UDA experiment, data from the training year groups were pooled, and unlabeled samples from the target year group were leveraged for unsupervised distribution matching between training and target year groups. In addition, ERM[8–16] models were learned on pooled data from the training year groups (2008–2016) to be used as ERM comparators for DG and UDA models.
Cohort characteristics by year group.
| 2008–2010 | 2011–2013 | 2014–2016 | 2017–2019 | |
|---|---|---|---|---|
| Patients, no. (% pos) | 9042 (7.4%) | 9476 (7.1%) | 10,289 (7.4%) | 10,060 (7.2%) |
| Age, mean ± SD | 63 ± 18 | 63 ± 18 | 64 ± 17 | 64 ± 17 |
| Sex, no. (%) | ||||
| Female | 3864 (43%) | 4090 (43%) | 4430 (43%) | 4170 (41%) |
| Male | 5178 (57%) | 5386 (57%) | 5859 (57%) | 5890 (59%) |
| Race, no. (%) | ||||
| White | 6784 (75%) | 6217 (66%) | 6468 (63%) | 6129 (61%) |
| Other | 2258 (25%) | 3259 (34%) | 3821 (37%) | 3931 (39%) |
| Patients, no. (% pos) | 9042 (29.8%) | 9476 (28.4%) | 10,289 (31.0%) | 10,060 (35.2%) |
| Age, mean ± SD | 63 ± 18 | 63 ± 18 | 64 ± 17 | 64 ± 17 |
| Sex, no. (%) | ||||
| Female | 3864 (43%) | 4090 (43%) | 4430 (43%) | 4170 (41%) |
| Male | 5178 (57%) | 5386 (57%) | 5859 (57%) | 5890 (59%) |
| Race, No. (%) | ||||
| White | 6784 (75%) | 6217 (66%) | 6468 (63%) | 6129 (61%) |
| Other | 2258 (25%) | 3259 (34%) | 3821 (37%) | 3931 (39%) |
| Patients, no. (% pos) | 6692 (10.2%) | 7181 (10.1%) | 7447 (12.1%) | 7311 (11.4%) |
| Age, mean ± SD | 64 ± 18 | 64 ± 18 | 64 ± 17 | 64 ± 17 |
| Sex, no. (%) | ||||
| Female | 2947 (44%) | 3133 (44%) | 3318 (45%) | 3124 (43%) |
| Male | 3745 (56%) | 4048 (56%) | 4129 (55%) | 4187 (57%) |
| Race, no. (%) | ||||
| White | 5078 (76%) | 4839 (67%) | 4920 (66%) | 4727 (65%) |
| Other | 1614 (24%) | 2342 (33%) | 2527 (34%) | 2584 (35%) |
| Patients, no. (% pos) | 5410 (12.7%) | 5557 (10.2%) | 6217 (10.8%) | 7161 (10.6%) |
| Age, mean ± SD | 62 ± 19 | 62 ± 18 | 62 ± 18 | 63 ± 17 |
| Sex, no. (%) | ||||
| Female | 2334 (43%) | 2442 (44%) | 2809 (45%) | 2948 (41%) |
| Male | 3076 (57%) | 3115 (56%) | 3408 (55%) | 4213 (59%) |
| Race, no. (%) | ||||
| White | 4032 (75%) | 3648 (66%) | 3945 (63%) | 4447 (62%) |
| Other | 1378 (25%) | 1909 (34%) | 2272 (37%) | 2714 (38%) |
pos: positive labels; SD: standard deviation.
Figure 2Mean performance (AUROC, AUPRC, and ACE) of models in the baseline experiment. Solid blue lines depict models trained using 2008–2010 (ERM[8–10]) and evaluated in each year group. Dashed lines depict models trained and evaluated in each year group separately (comparators). Error bars indicate 95% confidence interval obtained from 10,000 bootstrap iterations. Black circles indicate statistically significant differences in performance based on the 95% confidence interval of the difference over 10,000 bootstrap iterations when comparing ERM[8–10] and comparators for each year group. The figure shows temporal dataset shift that is larger for Long LOS and Sepsis tasks. ERM: empirical risk minimization; LOS: length of stay; AUROC: area under the receiver operating characteristics curve; AUPRC: area under the precision recall curve; ACE: absolute calibration error.
Figure 3Difference in mean performance of DG and UDA approaches relative to ERM[8–16] in the target year group (2017–2019). Performance of ERM[8–10] (train set 2008–2010 and test set 2017–2019, dashed line) and ERM[17–19] (train and test sets 2017–2019, solid line) models are also shown for comparison. Error bars indicate 95% confidence interval obtained from 10,000 bootstrap iterations. Here, we show results from three of the four experimental conditions using differing number of unlabelled samples for UDA—we did not observe meaningful differences across the number of unlabelled samples evaluated. Numerical representation of the performance measures relative to ERM[8–16] are presented in Supplementary Table S3. LOS: length of stay; ERM: empirical risk minimization; IRM: invariant risk minimization; AL: adversarial learning; GroupDRO: group distributionally robust optimization; CORAL: correlation alignment; MMD: maximum mean discrepancy; AUROC: area under the receiver operating characteristics curve; AUPRC: area under the precision recall curve; ACE: absolute calibration error; domain generalization: DG; unsupervised domain adaptation: UDA.
Clinical interpretation of temporal dataset shift in sepsis prediction.
| Scenario set-up | There are 100 consecutive patients admitted to the ICU between 2017 and 2019. Management will differ depending on whether the risk of sepsis is greater than 10%a at 4 h after admission. The table below illustrates the anticipated performance metrics of ERM[8–10] based on their performance in 2008–2010 (first column), the implications of dataset shift on ERM[8–10] indicated by their actual performance in 2017–2019 (second column), results of updating the model ERM[08–16]b and mitigation by a representative UDA approach (third and fourth columns), and the ERM[17–19] (last column) | ||||
|---|---|---|---|---|---|
| Training set | 2008–2010 | 2008–2010 | 2008–2016 | 2008–2016 + 1500 unlabelled samples from 2017 to 2019 | 2017–2019 |
| Test set | 2008–2010 | 2017–2019 | 2017–2019 | 2017–2019 | 2017–2019 |
| Learning Algorithm | ERM | ERM | ERM | AL (UDA) | ERM |
| Diagnostic properties in test set | |||||
| Sensitivity | 0.65 | 0.57 | 0.76 | 0.75 | 0.61 |
| Specificity | 0.72 | 0.64 | 0.68 | 0.68 | 0.74 |
| PPV | 0.22 | 0.16 | 0.22 | 0.22 | 0.23 |
| NPV | 0.94 | 0.92 | 0.95 | 0.95 | 0.94 |
| False positives for 89c patients who did not develop sepsis | 25 | 32 | 29 | 28 | 23 |
| False negatives for 11d patients that developed sepsis | 4 | 5 | 3 | 3 | 4 |
Sepsis was chosen as this task exhibited the most discrimination deterioration due to temporal dataset shift.
ERM: empirical risk minimization; AL: domain adversarial learning; UDA: unsupervised domain adaptation.
a10% threshold for sepsis was chosen as a clinically reasonable value although results across thresholds from 5 to 45% are shown in Supplementary Table S5.
bERM[8–16] models are the fair ERM comparators to DG and UDA models given they were trained using 2008–2016.
c,dOutcome prevalence was estimated based on average sepsis prevalence from 2008 to 2019. The table illustrates the results of initial model development with training and evaluation in the earliest period or 2008–2010 (first column), which represents performance anticipated by clinicians applying the model to patients admitted in 2017–2019 if the model is not updated. The second column shows actual performance of that model on their patients, or the impact of temporal dataset shift. In other words, the first two columns illustrate the clinical impact of temporal dataset shift for the task with the most extreme dataset shift, namely sepsis. It shows that for the 11 patients who developed sepsis, the false negative rate increased by 1 patient. The table also shows the impact of retraining with the more updated data (third column), and one approach to mitigate dataset shift, namely AL (UDA) (fourth column). Results of AL (UDA) was almost identical to third column (ERM[8–16]). For illustrative purposes, it also shows the ERM[17–19] in which training and test sets are both 2017–2019 data.