| Literature DB >> 34235453 |
Joseph Futoma1, Morgan Simons2, Finale Doshi-Velez1, Rishikesan Kamaleswaran3,4.
Abstract
OBJECTIVE: Specific factors affecting generalizability of clinical prediction models are poorly understood. Our main objective was to investigate how measurement indicator variables affect external validity in clinical prediction models for predicting onset of vasopressor therapy.Entities:
Keywords: decision support tools; external validity; generalizability; machine learning; statistical modeling; vasopressor therapy
Year: 2021 PMID: 34235453 PMCID: PMC8238368 DOI: 10.1097/CCE.0000000000000453
Source DB: PubMed Journal: Crit Care Explor ISSN: 2639-8028
Figure 1.Results on all three cohorts, as a function of hours in advance of potential vasopressor onset. Models were trained to predict potential onset each hour from 1 hr in advance, up until 12 hr in advance, and models were evaluated in the same fashion (i.e., each 4-hr model was then evaluated internally and externally at 4 hr in advance across datasets). Top row shows areas under the receiver operating characteristic curve curves (AUROCs, also known as C-statistics), and the bottom row shows areas under the precision-recall (AUPR) curves as metrics assessing overall discrimination. Each column shows the performance of all fitted models on one cohort: Methodist floor (left), Methodist ICU (center), and Beth Israel ICU (right). Results within a column for models trained on that data source are in-sample results measuring internal validity, whereas results for models learned from other data sources are out-of-sample and measure external validity. For each evaluation data source, results on 12 different models are shown. Models with a name beginning with “A” were fit from the Methodist floor data and appear in blue throughout. Models with a name beginning with “B” were fit from the Methodist ICU data and appear in green throughout. Models with a name beginning with “C” were fit from the Beth Israel ICU data and appear in red throughout. Models with a name ending in “−1” are the combined models that use both physiologic and measurement indicator variables, both when fitting models and during evaluation; their lines are solid. Models with a name ending in “−2” are the combined models that use both physiologic and measurement indicator variables during model fitting but only use physiologic variables during evaluation; their lines are dashed-dotted. Models with a name ending in “−3” are the models solely using physiologic variables; their lines are dashed. Models with a name ending in “−4” are the models solely using the measurement indicator variables; their lines are dotted. An important finding in the figure is that models learned on a data source always perform better in-sample on that data source when compared with models learned from other data sources; this is seen by the clustering of blue, green, and red lines at the top of each relevant pane. Another key finding is that the combined models (solid lines) always perform best in-sample but not out-of-sample.
Figure 3.Learned coefficients from the 4 h models for each of the three cohorts, but only specific features where there is at least one statistically significant difference in sign between coefficients for two different cohorts (as indicated by Wald test p values of <0.05 for both coefficients and with both coefficients of opposite sign). Points are shown for the point estimate of each regression coefficient, along with 95% CIs. Top row shows physiologic features, and the bottom row shows the indicator features. Models trained on Methodist floor data are shown in blue, Methodist ICU data in green, and Beth Israel ICU data in red. Left column shows coefficients from the combined models that use both physiologic and indicators during model fitting, whereas the right column shows results from the models fit separately to only physiology variables (top) and only indicators (bottom). There are six statistically significant sign changes among the physiology-only model coefficients, and all six involved a Beth Israel-derived model. This number decreases to only 2 when examining the combined model and is evidenced that the use of indicators during model fitting helps learn more robust physiologic relationships that generalize better. There are five statistically significant sign changes in both the indicators-only models and the indicator components of the combined models. In both cases, all five again involved a Beth Israel model, along with one significant change between a Methodist floor and ICU model. ASBP = arterial systolic blood pressure, BIDMC = Beth Israel Deaconess Medical Center, MAP = mean arterial pressure, MLH = Methodist LeBonheur Healthcare, Plt = platelets, RR = respiration rate.
Figure 2.Differences in performance between the combined models (lines ending in “−1” in Figure 1) compared with the physiology-only models (lines ending in “−3” in Figure 1) and models using the physiologic component of the combined model but discarding the indicators component at evaluation (lines ending in “−2” in Figure 1). The difference in performance between the combined model and physiology-only models is shown in dashed lines, and the difference in performance between the full combined model and just using the combined model’s physiologic components are shown in solid lines. Blue lines denote models fit to Methodist floor data, green lines denote models fit to Methodist ICU data, and red lines denote models fit to Beth Israel ICU data. The top row shows differences in areas under the receiver operating characteristic curves (AUROCs, also known as C-statistics), and the bottom row shows differences areas under the precision-recall (AUPR) curves as metrics assessing overall discrimination. Values above 0 indicate that the model under evaluation performed better than the combined model’s performance; values less than 0 indicate that the combined model performed better. Each column shows the performance of all fitted models on one cohort: Methodist floor at left, Methodist ICU at center, and Beth Israel ICU at right. Results within a column for models trained on that data source are in-sample results measuring internal validity, whereas results for models learned from other data sources are out-of-sample and measure external validity. The right column shows that both the physiology-only models and using only the physiologic components of the combined models both perform worse than the combined model when validated internally on Beth Israel data, with the physiology-only models a bit better. However, out-of-sample, this is flipped: the combined model typically fares worst and using only the physiologic component of the combined model is best, with physiology-only models faring somewhere in the middle. For models fit to Methodist data, there are less obvious differences between the physiology-only models and using just the physiologic components of the combined models, and in fact, the full combined models typically fare best.
Figure 4.Visualizations of model predictions for different physiologic variables are shown, from the 4-hr onset models. Each pane shows a different predictor variable. From top left, clockwise, they are: systolic blood pressure (SBP), heart rate (HR), respiration rate (RR), mean arterial pressure (MAP), Fio2, and lactate. Models with a name beginning with “A” were fit from the Methodist floor data and appear in blue throughout. Models with a name beginning with “B” were fit from the Methodist ICU data and appear in green throughout. Models with a name beginning with “C” were fit from the Beth Israel ICU data and appear in red throughout. Models with a name ending in “−1” are the combined models that use both physiologic and measurement indicator variables; their lines are solid. Models with a name ending in “−2” are the models that use only physiologic variables during model fitting; their lines are dashed. The curves indicate a model’s change in log-odds of risk of vasopressor as a function of that predictor variable on the x-axis. The y-axis is shifted such that 0 coincides with the mean of each feature value; the units on the y-axis are relative and not absolute, only denoting change in log-odds as a function of modifying this single predictor. Variables displayed in the top row are examples of predictors where there were no major changes between the combined model and physiology-only models. The difference in RR between the Methodist floor cohort and the two ICU cohorts likely reflects the fact that ICU patients are more likely to be on ventilators. The bottom row shows examples of predictors where there were large changes between the combined and physiologic-only models. The bizarre MAP relationship learned by the Beth Israel ICU physiology-only model is corrected in the combined model, with low MAP associated with higher risk of vasopressor need. Likewise, the strange fitted curves learned by the Methodist floor model for Fio2 and for lactate appear more reasonable in the combined model.
Background Characteristics of Cohorts
| Variable | Methodist Floor: 539 Inpatient Stays, Vasopressor Administered (0.9%) | Methodist Floor: 59,211 Inpatient Stays, No Vasopressor Administered (99.1%) | Methodist ICU Stays: 265 ICU Stays, Vasopressor Administered (12.4%) | Methodist ICU Stays: 1,872 ICU Stays, No Vasopressor Administered (87.6%) | Beth Israel: 1,499 ICU Stays, Vasopressor Administered (11.5%) | Beth Israel: 11,500 ICU Stays, No Vasopressor Administered (88.5%) |
|---|---|---|---|---|---|---|
| Age, median (5%, 25%, 75%, and 95% quantiles) | 66.0 (36.6, 56.0, 75.0, 87.0) | 59.0 (25.0, 43.0, 72.0, 88.0) | 64.0 (35.0, 54.0, 71.0, 86.0) | 62.0 (31.0, 53.0, 72.0, 85.0) | 67.1 (37.8, 56.6, 77.9, 88.3) | 64.1 (27.9, 51.1, 77.8, 90.0) |
| Male sex, | 314 (58.3) | 24,575 (41.5) | 147 (55.5) | 961 (51.3) | 861 (57.4) | 6,311 (54.9) |
| Inhospital mortality, | 213 (39.5) | 843 (1.4) | 118 (44.5) | 242 (12.9) | 547 (36.5) | 2,214 (19.3) |
| LOS (ICU, for ICU cohorts; admission for floor cohort), hr, median (5%, 25%, 75%, and 95% quantiles) | 225.8 (50.2, 124.9, 345.2, 574.7) | 69.4 (22.9, 44.3, 119.2, 268.8) | 205.1 (17.0, 93.0, 324.5, 497.9) | 72.2 (12.8, 32.8, 189.7, 421.3) | 130.8 (28.9, 67.4, 255.3, 474.0) | 42.9 (18.3, 26.2, 70.9, 185.3) |
| LOS ≥7 d, | 363 (67.3) | 8,349 (14.1) | 152 (57.4) | 511 (27.3) | 600 (40.0) | 696 (6.0) |
| Self-reported race, | ||||||
| Black/African-American | 265 (49.2) | 31,695 (53.5) | 163 (61.5) | 1,123 (60.0) | 102 (6.8) | 1,130 (9.8) |
| White/Caucasian | 256 (47.5) | 24,995 (42.2) | 89 (33.6) | 693 (37.0) | 1,098 (73.2) | 8,392 (73.0) |
| Other | 6 (1.1) | 625 (1.0) | 3 (1.1) | 21 (1.1) | 44 (2.9) | 348 (3.0) |
| Asian | 4 (0.7) | 444 (0.7) | 1 (0.4) | 6 (0.3) | 56 (3.7) | 301 (2.6) |
| Hispanic/Latino | 5 (0.9) | 1,128 (1.9) | 8 (3.0) | 26 (1.4) | 42 (2.8) | 488 (4.2) |
| Unknown/unable/declined | 3 (0.6) | 324 (0.5) | 1 (0.4) | 3 (0.2) | 157 (10.5) | 841 (7.3) |
| Acute Physiology and Chronic Health Evaluation II score in first 24 hr (no chronic health points), median (5%, 25%, 75%, and 95% quantiles) | 9 (2, 6, 14, 23) | 5 (0, 3, 8, 13) | 13 (4, 8, 18, 26) | 12 (4, 8, 17, 25) | 20 (9, 15, 25, 31) | 15 (7, 11, 20, 27) |
| Highest lactate in first 24 hr, median (5%, 25%, 75%, and 95% quantiles) | 3.6 (1.2, 2.1, 9.5, 15.2) | 2.0 (1.1, 1.4, 3.0, 7.2) | 2.5 (1.2, 1.8, 3.9, 9.4) | 2.2 (1.1, 1.5, 3.4, 8.0) | 2.3 (1.0, 1.7, 3.5, 7.3) | 1.8 (0.8, 1.3, 2.6, 4.8) |
| Presence of lactate measurement in first 24 hr, | 48 (8.9) | 527 (0.9) | 52 (19.6) | 186 (9.9) | 1,268 (84.6) | 7,282 (63.3) |
| Lowest mean arterial pressure in first 24 hr, median (5%, 25%, 75%, and 95% quantiles) | 65 (45, 57, 77.3, 99) | 84 (59, 73, 95, 114) | 65 (49, 57.5, 75, 101.8) | 71 (46.1, 62, 83, 100) | 58 (40, 50.5, 67, 85) | 61 (41, 54, 69, 83) |
| Lowest Glasgow Coma Scale in first 24 hr, median (5%, 25%, 75%, and 95% quantiles) | 15 (3, 14, 15) | 15 (13, 15) | 15 (3, 11, 15) | 15, (3.6, 10, 15, 15) | 11 (3, 5, 15) | 14 (3, 9, 15) |
LOS = length of stay.
Background characteristics of the three cohorts: the Methodist LeBonheur Healthcare (MLH) floor cohort, the MLH ICU cohort, and the Beth Israel Deaconess Medical Center ICU cohort. Each cohort is further broken down by the primary outcome in this study, whether or not vasopressor therapy was ever initiated or not. Median values along with 5%, 25%, 75%, and 95% quantiles are presented for continuous variables. There is a higher proportion of African Americans at MLH, with no other major demographic differences. The ICU cohorts have higher overall acuity, as evidenced by their higher inpatient mortality and Acute Physiology and Chronic Health Evaluation (APACHE)-II scores. Note that the APACHE-II score was calculated without using chronic health points due to data availability.