| Literature DB >> 35789446 |
Joy Tzung-Yu Wu1, Miguel Ángel Armengol de la Hoz2,3,4, Po-Chih Kuo5,6, José Maria Castellano7,8, Leo Anthony Celi2,9,10, Joseph Alexander Paguio11,12, Jasper Seth Yao11,12, Edward Christopher Dee13, Wesley Yeung2,14, Jerry Jurado12, Achintya Moulick12, Carmelo Milazzo12, Paloma Peinado7, Paula Villares7, Antonio Cubillo7, José Felipe Varona7, Hyung-Chul Lee15, Alberto Estirado7.
Abstract
The unprecedented global crisis brought about by the COVID-19 pandemic has sparked numerous efforts to create predictive models for the detection and prognostication of SARS-CoV-2 infections with the goal of helping health systems allocate resources. Machine learning models, in particular, hold promise for their ability to leverage patient clinical information and medical images for prediction. However, most of the published COVID-19 prediction models thus far have little clinical utility due to methodological flaws and lack of appropriate validation. In this paper, we describe our methodology to develop and validate multi-modal models for COVID-19 mortality prediction using multi-center patient data. The models for COVID-19 mortality prediction were developed using retrospective data from Madrid, Spain (N = 2547) and were externally validated in patient cohorts from a community hospital in New Jersey, USA (N = 242) and an academic center in Seoul, Republic of Korea (N = 336). The models we developed performed differently across various clinical settings, underscoring the need for a guided strategy when employing machine learning for clinical decision-making. We demonstrated that using features from both the structured electronic health records and chest X-ray imaging data resulted in better 30-day mortality prediction performance across all three datasets (areas under the receiver operating characteristic curves: 0.85 (95% confidence interval: 0.83-0.87), 0.76 (0.70-0.82), and 0.95 (0.92-0.98)). We discuss the rationale for the decisions made at every step in developing the models and have made our code available to the research community. We employed the best machine learning practices for clinical model development. Our goal is to create a toolkit that would assist investigators and organizations in building multi-modal models for prediction, classification, and/or optimization.Entities:
Keywords: COVID-19; Mortality prediction; Multi-center; Multi-modal
Year: 2022 PMID: 35789446 PMCID: PMC9255527 DOI: 10.1007/s10278-022-00674-z
Source DB: PubMed Journal: J Digit Imaging ISSN: 0897-1889 Impact factor: 4.903
Fig. 1The proposed multi-modal models for mortality prediction. The extracted EHR data were first preprocessed and then used to train the EHR-based model. For the CXR-based model, an anatomical bounding box extraction pipeline was used to automatically extract the coordinates for the left lung, right lung, mediastinum, and trachea anatomies from each of the CXR images. The CXR images with augmentation were then used to train the CXR-based model. The probability computed from the CXR-based model along with EHR data were used to train the proposed EHR-CXR fusion model, by which the final prediction was generated. The predictions from the EHR- and CXR-based models were also generated for the comparison
High level descriptive summary of datasets used in this study
fourfold training and internal validation for building and tuning the models | Multi-centered hospital network, Madrid, Spain, from 12/2019 to 06/2020 [ | Under age (< 16) | 1628/2547 | ||
| Missing admission time | |||||
| Missing admission chest X-ray | |||||
| Test (external validation) | Community hospital, Hoboken, NJ, USA, from 03/2020 to 04/2020 [ | Under age (< 16) | 201/242 | ||
| Missing admission time | |||||
| Missing admission chest X-ray | |||||
Test (external validation) | Academic tertiary hospital, Seoul, Republic of Korea, from 1/1/2020 to 12/31/2020 | Under age (< 16) | 315/336 | ||
| Missing admission time | |||||
| Missing admission chest X-ray | |||||
*These are unique patients
Summary of clinical characteristics for the 3 different datasets used in the study
| 1439 | 189 | 114 | 87 | 310 | 5 | ||||
| 65.7 | 79.6 | < 0.001 | 61.9 | 69.1 | 0.003 | 45.7 | 64.0 | 0.053 | |
| 41.3 | 28.6 | < 0.001 | 48.2 | 32.2 | 0.032 | 48.4 | 0 | 0.062 | |
| 88.4 | 11.6 | < 0.001 | 56.7 | 43.3 | < 0.001 | 98.4 | 1.6 | < 0.001 | |
| 16.1 | 24.1 | 0.008 | 39.4 | 35.6 | 0.682 | 11.0 | 40.0 | 0.103 | |
| 6.1 | 10.2 | 0.055 | 54.3 | 55.2 | 0.974 | 13.9 | 80.0 | 0.002 | |
| 26.0 | 35.3 | 0.01 | 32.5 | 34.5 | 0.880 | 6.1 | 20.0 | 0.283 | |
| 4.3 | 7.0 | 0.156 | 16.7 | 14.9 | 0.891 | 1.6 | 20.0 | 0.093 | |
| 6.0 | 13.4 | < 0.001 | 16.7 | 14.9 | 0.891 | 2.3 | 20.0 | 0.122 | |
| 2.6 | 7.5 | < 0.001 | 2.6 | 4.6 | 0.469 | 2.6 | 0.0 | - | |
| 4.7 | 8.0 | 0.083 | 9.6 | 10.3 | 0.941 | 0.6 | 20.0 | 0.047 | |
| 5.1 | 11.8 | < 0.001 | 7.0 | 20.7 | 0.008 | 1.3 | 20.0 | 0.078 | |
| 0.6 | 4.8 | < 0.001 | 0.9 | 0.0 | 1.000 | 1.0 | 0.0 | - | |
| 4.0 | 12.3 | < 0.001 | 6.1 | 3.4 | 0.519 | 0.0 | 0.0 | - | |
COPD chronic obstructive pulmonary disease, CKD chronic kidney disease
Internal validation on Madrid dataset with 95% confidence intervals
| 0.82 (0.79–0.84) | 0.81 (0.78–0.83) | 0.85 (0.83–0.87) | |
| 0.77 (0.71–0.82) | 0.76 (0.71–0.82) | 0.79 (0.74–0.84) | |
| 0.71 (0.66–0.76) | 0.72 (0.67–0.75) | 0.74 (0.71–0.78) | |
| 0.24 (0.21–0.28) | 0.25 (0.21–0.28) | 0.27 (0.23–0.31) | |
| 0.96 (0.95–0.97) | 0.96 (0.95–0.97) | 0.97 (0.96–0.98) | |
| 0.36 (0.32–0.41) | 0.37 (0.33–0.41) | 0.40 (0.36–0.45) | |
| 0.71 (0.68–0.76) | 0.73 (0.68–0.76) | 0.75 (0.72–0.78) |
AUROC area under the receiver operating characteristic curve, PPV positive predictive value, NPV negative predictive value, CI confidence interval
Fig. 3Model performance using EHR-based model, CXR-based model, and fusion model (EHR + CXR). (A) Internal validation on Madrid dataset; (B) external testing on Hoboken dataset; and (C) external testing on Seoul dataset
External testing on Hoboken and Seoul datasets with 95% confidence intervals
| 0.74 (0.68–0.80) | 0.72 (0.66–0.78) | 0.76 (0.70–0.82) | 0.92 (0.88–0.96) | 0.90 (0.86–0.94) | 0.95 (0.92–0.98) | |
| 0.68 (0.59–0.77) | 0.68 (0.57–0.8) | 0.68 (0.60–0.76) | 0.64 (0.25–0.86) | 0.63 (0.20–0.86) | 0.64 (0.20–0.86) | |
| 0.72 (0.62–0.82) | 0.65 (0.55–0.78) | 0.78 (0.70–0.85) | 0.88 (0.85–0.93) | 0.86 (0.80–0.93) | 0.93 (0.89–0.96) | |
| 0.65 (0.56–0.75) | 0.60 (0.52–0.69) | 0.71 (0.61–0.79) | 0.09 (0.02–0.17) | 0.07 (0.02–0.15) | 0.13 (0.03–0.25) | |
| 0.75 (0.68–0.81) | 0.73 (0.66–0.8) | 0.76 (0.70–0.82) | 0.99 (0.99–1.0) | 0.99 (0.99–1.0) | 1.00 (0.99–1.0) | |
| 0.66 (0.59–0.73) | 0.64 (0.57–0.7) | 0.69 (0.62–0.76) | 0.15 (0.04–0.28) | 0.13 (0.03–0.25) | 0.21 (0.06–0.38) | |
| 0.70 (0.65–0.76) | 0.67 (0.61–0.72) | 0.74 (0.68–0.79) | 0.92 (0.88–0.96) | 0.90 (0.86–0.94) | 0.95 (0.92–0.98) | |
AUROC area under the receiver operating characteristic curve, PPV positive predictive value, NPV negative predictive value, CI confidence interval
Fig. 4Feature importance of the EHR-based model revealed by a SHAP plot. Features on the y-axis are ranked by their mean absolute SHAP values and each point represents a patient
Fig. 5Feature importance of the fusion model revealed by a SHAP plot. Features on the y-axis are ranked by their mean absolute SHAP values and each point represents a patient
Fig. 6Explainability: heatmaps using Grad-CAM algorithm shows that the model primarily uses imaging features from the lungs and mediastinum region for mortality prediction. The image was produced by averaging the heatmaps from the expired patients with prediction probability larger than 0·6 and overlaying it on an actual CXR so it is easier to highlight the physiologic area
| Optimize for F1-score for all three models for the 30-day mortality prediction task and report all metrics including areas under the receiver operating characteristic curve (AUROC), sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), F1-score, and accuracy | |
| 30-day mortality is chosen as the target outcome in accordance with clinical precedence [ | |
| We included only cases with admission CXRs and results of laboratory tests taken within the first 24 h of hospital admission in Tables | |
| Our clinical goal is to develop an early assessment algorithm. Admission time is needed to establish the 30-day mortality cut off for this study. Patients under 16 need more privacy protection (very rare and more easily re-identifiable) and their CXR imaging appearance (anatomically) and disease outcome distributions are very different. Not all CXR exam orders include lateral images hence they are not included as an input for the models | |
| An anatomical bounding box (Bbox) extraction pipeline was used to automatically extract the coordinates for the left lung, right lung, mediastinum, and trachea anatomies from each of the frontal CXR images [ | |
| As compared to simply post hoc assessing the explainability of models with Gradient-weighted Class Activation Mapping (Grad-CAM), we tried to force the CXR model to learn features from key CXR anatomies that should be relied on more heavily for prediction during the model training stage as well [ | |
| The whole of Madrid dataset was randomly divided into four subsets in order to conduct a fourfold cross-validation training strategy to select for the best model (by F1-score) for each of the three model types. We ensured similar numbers of mortality cases in each split and the same four-way split was used for all experiments. Finally, we trained each of the 3 models on all of Madrid data once we identified the best hyperparameters from the fourfold cross-validation hyperparameter tuning experiments. The final models are then validated on the two external datasets (Hoboken and Seoul). See supplementary materials for the details of model training and picking for EHR, CXR, and fusion models. All models are trained on Google Cloud TPUs via Colab notebooks. Code for both training with paid and free TPUs are available. Software packages used were tensorflow = = 2.4.1, sklearn-pandas = = 1.8.0, xgboost = = 0.90. To ensure repeatability, a random seed of 2020 was used for all experiments | |
| In this setup, for each experiment, 3 folds are combined and used for training and the fourth fold is used for validation, whereby each individual data subset gets equal opportunities to validate models. Model parameters that perform best on one validation subset might just be “lucky.” Rotating the validation subset and picking model parameters that perform the best on average across all subsets helps in selecting a model that has hopefully learned more reliable features and may generalize better on external validation sets. Similar number of positive mortality cases (expired patients) in each split makes the validation set more likely to be equally difficult. We had to use the whole Madrid dataset for model development (training and validation); otherwise, the number of positive cases (mortality) would be too small for tuning. We used only open sourced python packages so that others can easily re-use and build on our work with no cost barriers | |
| Four different types of machine learning algorithms (logistic regression, random forest, gradient boosting, and XGBoost) implemented in scikit learn 0.21 were tried in a tuning setting to select for the best EHR-based model. A randomized grid-search method was used to sample different hyperparameter settings from prespecified ranges for each optimization experiment, as shown in Supplementary Table | |
| The goal of this modeling is not causality analysis but simply to select a model that performs the best for the given dataset and prediction task. We picked the four most common machine learning algorithms suitable for modeling tabular data and tuned their hyperparameters | |
| CXR images were randomly flipped vertically (left–right) and brightness adjusted (0–0.05). Together with the preprocessed anatomical Bbox augmentation, a random set of CXRs used for training the model is illustrated in Fig. | |
| The goal of image augmentation is to automatically increase training sample variety so that the model can learn to discern features that are more generalizable for the downstream prediction task. This step is particularly important if the training dataset is small. The online augmentations (flip and brightness) try to simulate how variations under which CXRs can be taken in real life might alter the image appearance. Only small augmentation ranges are chosen so that the CXR images remain radiologically interpretable. Augmentation is not used during internal and external validation because there is (1) no need to update model weights during evaluation settings and (2) need for comparing models against a consistent benchmark and augmentation introduces randomness | |
| Two different previously published pre-trained DenseNet-121 CXR models [ | |
| The Madrid dataset is too small to train deep learning networks from scratch. The pre-trained CXR models chosen have already been trained on much larger CXR datasets (MIMIC-CXR) [ | |
| After CXR features are extracted from a pre-trained model, we added a classification block consisting of tunable number of hidden linear layers, followed by a final activation function (choice between ReLU and LeakyReLU), a dropout layer, and a single binary output layer. The output layer represents whether a patient is alive or expired at 30 days. An initial bias to the final out layer was optionally added and tuned along with the choices for activation function (ReLU or LeakyReLU) and the number and sizes of the hidden layers | |
| The feature size extracted from both pre-trained models is 1024 in length. Additional classification layers were added to learn the new mortality classification task. Since the layer numbers and sizes are arbitrary, we picked a few common sizes to tune. We tried LeakyRelu as an activation function in the classification block because the CXR features extracted from the (− 2) and (− 4) layers can have many zeros due to the DenseNet-121 architecture. Adding initial bias to the output layer can help with performance for very unbalanced dataset | |
| Binary cross entropy was used as the loss function and the Adam optimizer was used for parameter optimization. We did not tune for these settings | |
| Binary cross entropy as the loss is appropriate for the binary mortality classification task. Adam is a fast optimizer, helps with avoiding overfitting and has shown good performance over a range of tasks | |
| Supplementary Table 4 provides a summary of all the hyperparameters we experimented with on the Madrid dataset to select for the final best performing CXR-based COVID-19 30-day prediction model. An experiment is defined by one unique combination of hyperparameters. Due to limited training resources and a large hyperparameter search space (345,600 unique combinations), we had to first rough search and manually narrow down the hyperparameter search space—e.g., early observation suggests most experiments did better with smaller batch sizes, LeakyReLU activation, and with Bbox augmentation. We then fine-tuned the model on the other more important parameters such as the learning rate. Early stopping was used to end experiments that did not show loss reduction after 2 or 5 epochs. Overall, we performed over 300 experiments. For each experiment, we plotted the train and valid curves for multiple metrics (recall, precision, accuracy, AUC and F1-score) against the number of epochs. We performed a range of manual and automatic model selection by (1) evaluating experiments with F1-scores above 0.25 for all four folds and (2) manually examining the train-vs.-validation learning curves to pick the hyperparameter setting that showed improvement of the model’s precision and recall from baseline for both the train and valid data, as well as ensuring that the chosen model did not show evidence of overfitting | |
| The standard practice for hyperparameter tuning is to update model weights on the train dataset and evaluate the updated model on the validation dataset at the end of each epoch, which is when the model has “seen” all examples in the train set once. Despite using all of the Madrid dataset for training and validation, the number of positive cases in the valid set is still small. Simply picking the best F1-score automatically without inspecting all the learning curves could just end up picking a “lucky” epoch | |
| We took a late fusion approach that uses the output probability from the CXR model as a feature along with the EHR features for the 30-day mortality classification. With the Madrid train dataset, we again tuned four different machine learning models (logistic regression, random forest, gradient boosting, and XGBoost) in a fourfold cross-validation setting and the best model along with the best hyperparameters were selected using randomized grid search via the same methodology as that for training the EHR-based model | |
| Late fusion approach is used because it can be implemented with traditional machine learning methods, which can avoid overfitting for smaller datasets. On the other hand, intermediate (joint) fusion implemented by neural networks requires more data for training (the implementation of the intermediate fusion model can also be found in Supplementary Table | |
| We made a clear separation between model developers and final model testers. Development of models includes programming feature selection and model training. External testing of models requires institutional access for the Hoboken and Seoul data, which were obtained upon request with submission of our study protocol | |
| This is the best practice to avoid repeated testing on the final test datasets, which could invalidate the reported results. It is also a common setting in real life model evaluation scenarios | |
| We packaged the inference code for the three different models for testing in an end-to-end Colab Notebook for the model testers to run on their datasets | |
| All datasets had been de-identified and are hosted on different HIPPA compliant cloud servers with access granted to different researchers based on institutional affiliation, data access approvals, and IRBs. Running via Colab, which have access management protocols, allows the clinical researchers to run the inference code without setting up Python and other required packages on their local machines, which can be a technical barrier | |