| Literature DB >> 35974050 |
Adrian G Zucco1, Rudi Agius2, Rebecka Svanberg2, Kasper S Moestrup3, Ramtin Z Marandi3, Cameron Ross MacPherson3, Jens Lundgren3,4, Sisse R Ostrowski5,6, Carsten U Niemann7,8.
Abstract
Interpretable risk assessment of SARS-CoV-2 positive patients can aid clinicians to implement precision medicine. Here we trained a machine learning model to predict mortality within 12 weeks of a first positive SARS-CoV-2 test. By leveraging data on 33,938 confirmed SARS-CoV-2 cases in eastern Denmark, we considered 2723 variables extracted from electronic health records (EHR) including demographics, diagnoses, medications, laboratory test results and vital parameters. A discrete-time framework for survival modelling enabled us to predict personalized survival curves and explain individual risk factors. Performance on the test set was measured with a weighted concordance index of 0.95 and an area under the curve for precision-recall of 0.71. Age, sex, number of medications, previous hospitalizations and lymphocyte counts were identified as top mortality risk factors. Our explainable survival model developed on EHR data also revealed temporal dynamics of the 22 selected risk factors. Upon further validation, this model may allow direct reporting of personalized survival probabilities in routine care.Entities:
Mesh:
Year: 2022 PMID: 35974050 PMCID: PMC9380679 DOI: 10.1038/s41598-022-17953-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.996
Figure 1Overview of the data sources, feature engineering and modelling approach for predicting 12-week mortality in SARS-CoV-2 positive patients. (a) Electronic Health Records (EHR) of 33,938 patients from 17th of March 2020 to 2nd of March 2021 (incidence curve) in eastern Denmark (geographical region visualized in red) were used to predict 12-week mortality from the first positive SARS-CoV-2 test (FPT). (b) Features were engineered as the last value observed prior to FPT within the last month for vitals and laboratory values. To encode hospital admissions, medications and diagnoses, the count of occurrences within three or one year(s) prior to FPT was used. (c) Machine learning algorithms were trained for survival modelling using a discrete-time approach. Time-to-event data were transformed longitudinally into patient-weeks up to the loss of follow-up (0) or death (1). With the augmented data, binary classification was performed by gradient boosting decision trees to predict personalized survival distributions for each patient and provide explanations of individual risk factors using SHAP values.
Summary statistics of the cohort based on the final feature set.
| Level | Overall | Censored | Died | Survived |
|---|---|---|---|---|
| n | 33,938 | 14,907 | 1662 | 17,369 |
| Age, median [Q1, Q3] | 49.0 [33.0,64.0] | 50.0 [34.0,66.0] | 83.0 [75.0,89.0] | 45.0 [31.0,59.0] |
| Female | 19,581 (57.7) | 8800 (59.0) | 787 (47.4) | 9994 (57.5) |
| 0.0 [0.0,4.0] | 0.0 [0.0,4.0] | 16.0 [4.0,27.0] | 0.0 [0.0,2.0] | |
| 0 | 20,207 (59.5) | 8854 (59.4) | 168 (10.1) | 11,185 (64.4) |
| ≥ 1 | 13,731 (40.5) | 6053 (40.6) | 1494 (89.9) | 6184 (35.6) |
| 4.0 [2.0,8.0] | 4.0 [2.0,8.0] | 11.0 [6.0,16.0] | 4.0 [2.0,7.0] | |
| 0 | 5596 (16.5) | 2277 (15.3) | 54 (3.2) | 3265 (18.8) |
| ≥ 1 | 28,342 (83.5) | 12,630 (84.7) | 1608 (96.8) | 14,104 (81.2) |
| Admitted at the time of first positive test, n (%) | 2485 (7.3) | 927 (6.2) | 534 (32.1) | 1024 (5.9) |
| Previous admissions in the last 3 years, median [Q1, Q3] | 0.0 [0.0,1.0] | 0.0 [0.0,1.0] | 1.0 [0.0,3.0] | 0.0 [0.0,0.0] |
| Cumulative days in hospital within the last 3 years, median [Q1, Q3] | 0.0 [0.0,1.0] | 0.0 [0.0,1.0] | 7.0 [0.0,19.0] | 0.0 [0.0,0.0] |
| Pandemic week, median [Q1, Q3] | 39.0 [19.0,42.0] | 42.0 [41.0,44.0] | 40.0 [7.0,43.0] | 26.0 [6.0,35.0] |
| Body Mass Index, median [Q1, Q3] | 25.7 [22.6,29.7] | 25.6 [22.5,29.8] | 24.3 [21.3,28.0] | 26.0 [23.0,29.9] |
| Absolute Lymphocyte count (LYM), laboratory, last value, median [Q1, Q3] | 1.1 [0.7,1.6] | 1.1 [0.8,1.7] | 0.9 [0.6,1.3] | 1.2 [0.8,1.6] |
| 0 | 31,131 (91.7) | 13,639 (91.5) | 919 (55.3) | 16,573 (95.4) |
| ≥ 1 | 2807 (8.3) | 1268 (8.5) | 743 (44.7) | 796 (4.6) |
| 0 | 26,453 (77.9) | 11,430 (76.7) | 527 (31.7) | 14,496 (83.5) |
| ≥ 1 | 7485 (22.1) | 3477 (23.3) | 1135 (68.3) | 2873 (16.5) |
| 0 | 31,403 (92.5) | 13,753 (92.3) | 965 (58.1) | 16,685 (96.1) |
| ≥ 1 | 2535 (7.5) | 1154 (7.7) | 697 (41.9) | 684 (3.9) |
| 0 | 30,920 (91.1) | 13,470 (90.4) | 1314 (79.1) | 16,136 (92.9) |
| ≥ 1 | 3018 (8.9) | 1437 (9.6) | 348 (20.9) | 1233 (7.1) |
| 0 | 33,216 (97.9) | 14,515 (97.4) | 1497 (90.1) | 17,204 (99.1) |
| ≥ 1 | 722 (2.1) | 392 (2.6) | 165 (9.9) | 165 (0.9) |
| 0 | 33,432 (98.5) | 14,655 (98.3) | 1529 (92.0) | 17,248 (99.3) |
| ≥ 1 | 506 (1.5) | 252 (1.7) | 133 (8.0) | 121 (0.7) |
| 0 | 24,170 (71.2) | 10,568 (70.9) | 788 (47.4) | 12,814 (73.8) |
| ≥ 1 | 9768 (28.8) | 4339 (29.1) | 874 (52.6) | 4555 (26.2) |
| 0 | 18,284 (53.9) | 7485 (50.2) | 637 (38.3) | 10,162 (58.5) |
| ≥ 1 | 15,654 (46.1) | 7422 (49.8) | 1025 (61.7) | 7207 (41.5) |
| 0 | 31,501 (92.8) | 13,783 (92.5) | 1256 (75.6) | 16,462 (94.8) |
| ≥ 1 | 2437 (7.2) | 1124 (7.5) | 406 (24.4) | 907 (5.2) |
| 0 | 32,384 (95.4) | 14,135 (94.8) | 1177 (70.8) | 17,072 (98.3) |
| ≥ 1 | 1554 (4.6) | 772 (5.2) | 485 (29.2) | 297 (1.7) |
| 0 | 32,530 (95.9) | 14,200 (95.3) | 1267 (76.2) | 17,063 (98.2) |
| ≥ 1 | 1408 (4.1) | 707 (4.7) | 395 (23.8) | 306 (1.8) |
Values up to the day of the first positive SARS-CoV-2 test used for training and prediction were considered. Continuous variables were summarized by the median and interquartile ranges (Q1, Q3). Diagnoses and medicines with their ICD-10 and ATC codes in parentheses respectively were summarized as the number of patients with at least one code assigned. Only body mass index and absolute lymphocyte counts reported missing values for 17,823 and 32,803 patients respectively. Patients that had a positive test from the 8th of December 2020 (12-weeks before data generation) and did not die before the 2nd of March 2021 were censored.
Figure 2Binary performance metrics for 12 weeks mortality prediction. Precision-recall area under the curve (PR-AUC) and Mathews correlation coefficient (MCC) were calculated for each predicted week only considering non-censored patients in the test set. The lower panel of each plot depicts the mean values of PR-AUC and MCC at each week based on all patients (a), patients not admitted to the hospital at the time of first positive test (b) and patients who were admitted at the time of first positive test (c). The upper panels of each subfigure contain bar plots showing the number of patients who died (red) during the given week while patients censored due to lack of follow-up (grey) were omitted for the performance metrics.
Figure 3Predicted individual discrete and cumulative death probabilities. Weekly discrete and cumulative probabilities of death were predicted for all patients in the test set using data prior to their first positive test. Individual probabilities were summarized by the median, 80 and 20 percentiles for patients who died (red) or survived (green) (a). Predicted cumulative death probabilities were summarized by the median (b) for patients who died before 4 weeks (pink), between 4 and 8 weeks (yellow) and after 8 weeks (blue). Individual examples of predicted cumulative (c) and discrete (d) death probabilities for three patients are depicted indicating the time of death (black dot) or censoring (x).
Figure 4Global and local explanations of feature contributions to the risk of death in SARS-CoV-2 positive patients. SHAP values for each patient-week in the test set were calculated to explain the contribution of features to the discrete probability of death. A beeswarm plot (a) was generated to agglomerate all individual SHAP values for each patient-week with features coloured according to their normalised feature values. To explore the temporal dynamics, heatmaps were generated to show the maximum feature importance represented as the max(|SHAP|) across all patients (b) for each predicted week. The total feature importance of each feature was calculated as the mean(|SHAP|) across all weeks and shown as a bar plot (b). To exemplify personalized explanations, SHAP values for two patients (c, d) were depicted as heatmaps with their corresponding predicted discrete probabilities of death on top. The original feature values for each patient were reported inside round brackets next to the feature names. In all heatmaps, features were ordered by hierarchical clustering of the original feature values using Pearson correlation as the distance metric and average linkage.
Figure 5Individual feature explanations by survival status. Partial dependence plots (PDP) of SHAP values versus age (a), body mass index (b), sex (c), Lymphocytes levels (d), cumulative days in hospital (e) and the number of admissions (f) in the last 3 years, admission status at the time of first positive test (g) and the number of ordered medicines (h). Each dot shows a patient-week value coloured by survival status indicating those patients who survived (green) or died (red). Total SHAP values are represented as explained contributions in terms of probability (y-axis) given all the features values for a patient whereas features (x-axis) are represented by their corresponding value. The top and left panels of each PDP plot depict letter-value plots of the distribution of the x and y axes by survival status. Top panels were substituted by bar plots for categorical variables. Additional PDPs for the remaining features can be found in Supplementary Fig. 2–4.
Figure 6Summary of relevant feature interactions in explaining early and late mortality in SARS-CoV-2 positive patients. For each patient that died within 12 weeks, the SHAP interaction values between all 22 features were calculated. Only interaction values with an absolute value greater than 0.01 were considered relevant and counted. Counts were averaged across all patients to show the percentage rate a given pair of features was relevant. The diagonal represents the percentage of patients for which each feature had a SHAP value higher than 0.01. (a) Shows relevant feature interactions for patients who died within 4 weeks and for those who died between 8 and 12 weeks (b) thus visualizing the difference in feature interactions for early and late mortality in SARS-CoV-2 positive patients. In both heatmaps, features were ordered by hierarchical clustering using Euclidean distance as the metric for average linkage.