Literature DB >> 35721825

Clinical prediction system of complications among patients with COVID-19: A development and validation retrospective multicentre study during first wave of the pandemic.

Ghadeer O Ghosheh¹, Bana Alamad¹, Kai-Wen Yang¹, Faisil Syed², Nasir Hayat¹, Imran Iqbal², Fatima Al Kindi², Sara Al Junaibi², Maha Al Safi², Raghib Ali¹, Walid Zaher³, Mariam Al Harbi², Farah E Shamout¹.

Abstract

Clinical evidence suggests that some patients diagnosed with coronavirus disease 2019 (COVID-19) experience a variety of complications associated with significant morbidity, especially in severe cases during the initial spread of the pandemic. To support early interventions, we propose a machine learning system that predicts the risk of developing multiple complications. We processed data collected from 3,352 patient encounters admitted to 18 facilities between April 1 and April 30, 2020, in Abu Dhabi (AD), United Arab Emirates. Using data collected during the first 24 h of admission, we trained machine learning models to predict the risk of developing any of three complications after 24 h of admission. The complications include Secondary Bacterial Infection (SBI), Acute Kidney Injury (AKI), and Acute Respiratory Distress Syndrome (ARDS). The hospitals were grouped based on geographical proximity to assess the proposed system's learning generalizability, AD Middle region and AD Western & Eastern regions, A and B, respectively. The overall system includes a data filtering criterion, hyperparameter tuning, and model selection. In test set A, consisting of 587 patient encounters (mean age: 45.5), the system achieved a good area under the receiver operating curve (AUROC) for the prediction of SBI (0.902 AUROC), AKI (0.906 AUROC), and ARDS (0.854 AUROC). Similarly, in test set B, consisting of 225 patient encounters (mean age: 42.7), the system performed well for the prediction of SBI (0.859 AUROC), AKI (0.891 AUROC), and ARDS (0.827 AUROC). The performance results and feature importance analysis highlight the system's generalizability and interpretability. The findings illustrate how machine learning models can achieve a strong performance even when using a limited set of routine input variables. Since our proposed system is data-driven, we believe it can be easily repurposed for different outcomes considering the changes in COVID-19 variants over time.

Entities: Chemical

Year: 2022 PMID： 35721825 PMCID： PMC9188985 DOI： 10.1016/j.ibmed.2022.100065

Source DB: PubMed Journal: Intell Based Med ISSN： 2666-5212

Introduction

The Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has led to a global health emergency since the emergence of the coronavirus disease 2019 (COVID-19) [1]. Despite containment efforts, more than 491 million confirmed cases have been reported globally, including 892,170 cases in the United Arab Emirates (UAE) as of April 4, 2022 [1]. Due to unexpected burdens on healthcare systems, identifying high risk groups using prognostic models has become vital to support patient triage and resource allocation. Most of the published prognostic models for patients with COVID-19 focus on predicting mortality or the need for intubation [2]. While the prediction of such adverse events is important for patient triage, clinical evidence suggests that patients with COVID-19 may also experience a variety of complications in organ systems that could lead to severe morbidity and mortality [3,4], especially amongst severe cases during the early waves of the pandemic. In this study, we identified three such complications associated with poor patient outcomes based on clinical evidence, prior to the emergence of the less severe variants [5]: Acute Respiratory Distress Syndrome (ARDS) [6], Acute Kidney Injury (AKI) [7], and Secondary Bacterial Infection (SBI) [8]. ARDS-related pneumonia has been reported as a major complication among patients with COVID-19 that have poor prognosis [9] and was a major cause of ventilator shortages worldwide [6,10,11]. In a Chinese study published in 2020, 31.0% of patients developed ARDS within a median of 12 days from the onset of COVID-19, and ARDS was the second most frequently observed complication after sepsis [10]. Additionally, only a few patients manifest clear clinical symptoms in the early stages of developing ARDS [6,12], so it is difficult to suspect ARDS unless it occurs. Hence, we identified early prediction of the risk of developing ARDS, prior to its onset, of high importance, since ARDS was considered as one of the main risk factors of death among hospitalized patients with COVID-19 [13]. Although COVID-19 primarily emerged as a respiratory disease, some patients with COVID-19 experience both respiratory and extra-respiratory complications including renal complications such as AKI [7,14]. Patients with AKI require special care and resources such as renal replacement therapy and dialysis [15]. It was estimated that AKI developed in 36.6% of patients admitted with COVID-19 in metropolitan New York in 2020, of which 35% had died [15]. Therefore, risk prediction of AKI can help in initiating preventive interventions in order to avoid quite poor patient prognosis. Moreover, several studies reported alarming percentages of hospitalized patients with COVID-19 who develop SBI [10]. SBI is known for poor outcomes in several respiratory viral infections. Hence, it led to increased burdens on hospitals in the 1918 influenza pandemic, 2009 H1N1 influenza pandemic, and in seasonal flu [[16], [17], [18]]. Patients with COVID-19 who developed SBI have shown worse outcomes, including admission to the Intensive Care Unit (ICU) and mortality, compared to those who did not develop SBI [19]. Therefore, early prediction of SBI can potentially improve patient prognosis, such as by taking aseptic procedures especially when hospitals get crowded [8]. In recent years, machine learning gained popularity for the development of algorithms for clinical decision support tools [[20], [21], [22]]. In the context of COVID-19, most machine learning studies have focused either on diagnosis or prognosis based on adverse events, mostly mortality and intubation [2,23,24]. We summarize a few examples in Table 1 . Since ARDS is considered a major manifestation of the COVID-19 disease, some studies focused on developing machine learning models to predict ARDS as an outcome [6,12], such as by using a large set of hematological and biochemical markers [12]. One limitation of such approaches is that they rely on laboratory-test results that may not be routinely measured. In another study, the authors used both statistical machine learning models and deep neural networks for the prediction of ARDS, by combining a large feature set of chest Computed Tomography (CT) findings, demographics, epidemiology, clinical symptoms, and laboratory-test results [6]. Similarly, for AKI prediction amongst patients with COVID-19, a multivariate logistic regression was developed using findings of CT imaging, laboratory-test results, vital-sign measurements, and patient demographics [25]. While recent work on SBI mainly focused on its clinical manifestations and occurrence [16,26,27], one study investigated sepsis risk prediction among patients with COVID-19 using hematological parameters and other biomarkers [28]. To summarize, existing work tends to predict a single complication at a time, which is less informative than predicting multiple complications known to be common among patients with COVID-19, use costly input features that may not be readily available, or rely on training deep neural networks that require high computational resources and large training datasets.

Table 1

Examples of machine learning studies that aim to predict various outcomes for in-patients with confirmed COVID-19 diagnosis. We refer the readers to extensive published literature reviews [2,23,24].

Reference	Outcome	Input Data	Models	Study Location
[29]	Deterioration (intubation or ICU admission or mortality)	Chest X-ray images and clinical data (patient demographics, seven vital-sign variables, and 24 laboratory-test results)	Convolutional neural network for chest X-ray images and gradient boosting model for clinical data	United States
[30]	Mortality	Five laboratory-test results	Support vector machine	United States
[31]	Severe progression (high oxygen flow rate, mechanical ventiliation or mortality)	Chest CT scans, patient demographics, five vital-sign variables, symptoms, comorbidities, 14 laboratory-test results, and chest CT radiology report findings	Deep neural network and logistic regression	France
[32]	Prognostication (intubation or hospital admission, or mortality)	Chest X-ray images, two vital-sign variables, and nine laboratory-test results	Convolutional neural network	United States
[28]	Sepsis	Eight laboratory-test results	Gradient boosting model	China
[25]	AKI	Findings of abdominal CT scans, demographics, vital signs, comorbidities, and three laboratory-test results	Logistic regression	United States
[33]	ARDS	Demgraphics, interventions, comorbidities, 17 laboratory-test results, and eight vital signs	Gradient boosting model	United States

Examples of machine learning studies that aim to predict various outcomes for in-patients with confirmed COVID-19 diagnosis. We refer the readers to extensive published literature reviews [2,23,24]. Therefore, there is a pressing need for a low-cost predictive system that uses routine clinical data to predict complications and support patient management. In this work, we address this need by developing and evaluating a machine learning system that predicts the risk of ARDS, SBI, and AKI among patients with COVID-19 admitted to the Abu Dhabi Health Services (SEHA) facilities, UAE, from April 1st, 2020 to April 30th, 2020, during the first wave of the pandemic. While we focus on three complications only, namely because their occurrence could be identified retrospectively using clinical criteria, the system and proposed training framework can be scaled to incorporate predictions of other complications, and can be fine-tuned using datasets of other patient cohorts. An overview of the pipeline is shown in Fig. 1 . Next, we describe our methodology in Section 2 and the performance and explainability results in Section 3. We then discuss the limitations and strengths of the study in Section 4, and conclude by highlighting the potential of our system in clinical settings in Section 5. To allow for reproducibility and external validation, we made our code and one of the evaluation test sets publicly available at: https://github.com/nyuad-cai/COVID19Complications.

Fig. 1

Overview of our proposed model development approach and expected application in practice. As shown in the first row, we develop our complication-specific models by first preprocessing the data, identifying the occurrences of the complications based on the criteria shown in Table 2, training and selecting the best-performing models on the validation set, and then evaluating the performance on the test set, retrospectively. As for the application (second row), we expect our system to predict the risk of developing any of the three complications for any patient after 24 h of admission.

Table 2

Criteria used to define the occurrence of complications.

Complication	Definition	Reference
SBI	Positive blood, urine, throat or sputum cultures within 24 h of sample collection	a
AKI	Based on the Kidney Disease Improving Global Guidelines (KDIGO) classification, increase in Serum Creatinine by ≥ 0.3 mg/dl within 48 h	[35]
	OR
	Increase in Serum Creatinine by ≥ to 1.5 times
	OR
	Urine volume < 0.5 ml/kg/h for 6 hb
ARDS	Based on the Berlin definition, presence of bilateral opacity in radiology reports	[36]
	AND
	Oxygenation: PaO₂/FiO₂ ≤ 300 mm Hg
	AND
	Timing: ≤ one week
	AND
	Origin: pulmonary

Based on SEHA's clinical standards.

Urine output was not measured in our dataset because it is collected in the intensive care unit.

Methods

This study is reported following the TRIPOD guidance [34].

Data source

This study is a retrospective multicentre study that includes anonymized data recorded within 3,493 COVID-19 hospital encounters at 18 Abu Dhabi Health Services (SEHA) healthcare facilities in Abu Dhabi, United Arab Emirates. The study received approval by the Institutional Review Board (IRB) from the Department of Health (Ref: DOH/CVDC/2020/1125) and New York University Abu Dhabi (Ref: HRPP-2020-70). Informed consent was not required for this study as it was determined as exempt. All methods were performed in accordance with the relevant guidelines and regulations. There were nine facilities in the Middle region, which includes the capital city, and nine facilities in the Eastern and Western regions. Those regions are highlighted in Fig. 2 (a). Fig. 2(b) shows the flowchart of how the exclusion criteria was applied to obtain the final data splits. We excluded 127 non-adult encounters and 14 pregnant encounters and split the dataset into training and test sets. The training sets were used for model training and selection, while the test sets were used for evaluation. Training set A consisted of 1,829 encounters recorded in the Middle region between April 1, 2020 and April 25, 2020. To evaluate for temporal generalizability, test set A included 587 encounters recorded in the Middle region between April 26, 2020 and April 30, 2020. Training set B included 711 encounters admitted to the Eastern and Western regions between April 1, 2020 and April 25, 2020 and test set B included 225 encounters admitted to the same hospitals between April 26, 2020, and April 30, 2020.

Fig. 2

(a) The UAE map showcasing the location of the healthcare facilities included in this study. (b) Flowchart for the overall dataset showing how the inclusion and exclusion criteria were applied to obtain the final training and test sets, where n represents the number of patient encounters, and p represents the number of unique patients.

Outcomes

Based on clinical evidence and in collaboration with clinical experts, we focused on predicting three clinically diagnosed events, SBI, AKI [35] and ARDS [36] that are associated with poor patient prognosis. For each patient encounter in the training and test sets, we identified the first occurrence (i.e., date and time), if any, of each complication based on the criteria shown in Table 2 . SBI is defined based on positive cultures within 24 h of sample collection, AKI is defined based on the KDIGO classification criteria [35], and ARDS is defined based on the Berlin definition [36], which required the processing of free-text chest radiology reports. Further details on the processing of those reports is described in Supplementary Section A. Criteria used to define the occurrence of complications. Based on SEHA's clinical standards. Urine output was not measured in our dataset because it is collected in the intensive care unit.

Input features

We considered data recorded within the first 24 h of admission as input features for the predictive models. This data included continuous and categorical features related to the patient baseline information, demographics, and vital signs. Within the patient's baseline and demographic information, age and Body Mass Index (BMI) were treated as continuous features, whereas pre-existing medical conditions (i.e., hypertension, diabetes, chronic kidney disease, and cancer), symptoms recorded at admission (i.e., cough, fever, shortness of breath, sore throat, and rash) and patient sex were treated as binary features. As for the vital signs, we included seven continuous features, including systolic blood pressure, diastolic blood pressure, respiratory rate, peripheral pulse rate, oxygen saturation, auxiliary temperature, and the Glasgow Coma Score. We selected those features as they are commonly used in early warning score systems [37]. All vital signs measurements were processed into minimum, maximum, and mean statistics. We summarized patient demographics, prevalence of the complications, and the distributions of the input features across the training and test sets.

Predictive modeling

The proposed system predicts the risk of developing each of the three complications during the patient's stay after 24 h of admission. This is represented by a vector y consisting of three predictions, where each prediction is computed by a complication-specific model, such thatwhere y ∈ [0, 1]. The overall workflow of the model development is depicted in Fig. 1. For each complication-specific model, we excluded from its training and test sets patients who developed that complication prior to the time of prediction. For AKI, we also excluded patients with chronic kidney disease. Then for each complication, our system trains four model ensembles based on four types of base learners: logistic regression (LR), k-nearest neighbors (KNN), support vector machine (SVM) and a light gradient boosting model (LGBM). Missing data was imputed using median imputation for all models except for LGBM, which can natively learn from missing data, and the data was further scaled using min-max scaling for LR and standard scaling for SVM and KNN. For each type of base learner, the system performs a stratified k-folds cross-validation using the complication's respective training set with k = 3. We performed random hyperparameter search for each base learner [38] with 30 iterations, resulting in three trained models for each hyperparameter set selected per iteration. The choice of random search was motivated by its relative simplicity, and high efficiency and performance compared to other hyperparameter tuning methods [38]. The hyperparameter search ranges are summarized in Supplementary Section B. The ranges were defined based on initial experiments with manually chosen hyperparameters. We then selected the top two hyperparameter sets whose models achieved the highest average area under the receiving operator characteristic curve (AUROC) on the validation sets, resulting in six trained models. We created an ensemble of those six models, and each model within the ensemble was further calibrated using isotonic regression on its respective validation set to ensure non-harmful decision making [39], except for the LR models. Isotonic regression takes a trained model's raw predictions as inputs, and computes well-calibrated output probabilities. This is done by grouping the raw predictions into bins associated with estimates of empirical probabilities [40]. The final prediction of each complication consisted of an average of the calibrated predictions of all models within an ensemble. All analysis was performed using Python (version 3.7.3). The LR, KNN, and SVM models were implemented using the Python scikit-learn package and the LGBM models were implemented using the LightGBM package [41].

Model interpretability

We performed post-hoc feature importance analysis using the SHapley Additive exPlanations (SHAP) [42,43]. SHAP values are indicative of the relative importance of the input variables and their impact on the predictions. The analysis was conducted using the open-source SHAP package [43], where we obtained the mean absolute SHAP values of the features for the six models per ensemble. For each feature, the six SHAP values were averaged and then ranked to reveal the overall importance of the features with respect to the ensembled prediction. We present the four top ranked features per complication ensemble for each test set using bar plots.

Performance assessment

We evaluated each complication ensemble using the AUROC and the area under the precision-recall curve (AUPRC) on the test set. The AUROC is a measure of the model's ability to discriminate between positive (complications) and negative cases (no complication) [44], while the AUPRC is a measure of model robustness when dealing with imbalanced datasets, i.e. unequal distribution of positive and negative cases [45]. The closer the AUROC and AUPRC are to 1, the better the performance of the model. Confidence intervals for all of the evaluation metrics were computed using bootstrapping with 1,000 iterations [46]. We also assessed the calibration of the ensemble, after post-hoc calibration of its trained models, using reliability plots and reported calibration intercepts and slopes [39].

Results

A total of 3,352 encounters were included in the study and the statistics of the characteristics of the final data splits are presented in Table 3 . Across all the data splits, the mean age ranges between 39.3 and 45.5 years and the proportion of males ranges between 84.8% and 88.9%. The mortality rate was also less than 4% across all data splits, ranging between 1.3% and 3.7%. ARDS was the most prevalent complication developed in the first 24 h of admission across all datasets. The incidence of the complications developed after 24 h were higher in the test sets than in their respective training sets. The distributions of the vital signs and demographics in terms of the mean and interquartile ranges, are shown in Table 4 .

Table 3

	Training set A	Test set A	Training set B	Test set B
Patient Cohort
Encounters, n	1829	587	711	225
Age, mean (IQR)	41.7 (17.0)	45.5 (18.0)	39.3 (17.0)	42.7 (20.0)
Male, n (%)	1582 (86.5)	522 (88.9)	622 (87.5)	191 (84.8)
Arab, n (%)	295 (16.1)	89 (15.2)	120 (16.9)	43 (19.1)
Non-Arab, n (%)	1534 (83.9)	498 (84.8)	591 (83.1)	182 (80.9)
Mortality, n (%)	36 (2.0)	22 (3.7)	9 (1.3)	3 (1.3)
Complications
SBI, n (%)	92 (5.0)	45 (7.7)	23 (3.2)	17 (7.6)
Developed within 24 h from admission, n (%)	1 (0.1)	3 (0.5)	1 (0.1)	1 (0.4)
Developed after 24 h from admission, n (%)	91 (5.0)	42 (7.2)	22 (3.1)	16 (7.1)
AKI, n (%)	126 (6.9)	52 (8.9)	32 (4.5)	16 (7.1)
Developed within 24 h from admission, n (%)	28 (1.5)	9 (1.5)	14 (2.0)	3 (1.3)
Developed after 24 h from admission, n (%)	98 (5.4)	43 (7.3)	18 (2.5)	13 (5.8)
ARDS, n (%)	117 (6.4)	57 (9.7)	45 (6.3)	24 (10.7)
Developed within 24 h from admission, n (%)	61 (3.3)	26 (4.4)	23 (3.2)	13 (5.8)
Developed after 24 h from admission, n (%)	56 (3.1)	31 (5.3)	22 (3.1)	11 (4.9)

Table 4

Characteristics of the variables that were used as input features to our models. The mean and interquartile ranges are shown for the demographic features, and vital-sign measurements. For the comorbidities and symptoms admission, n denotes the number of patients and % denotes the percentage of patients per the respective dataset.

Variable, unit	Training set A	Test set A	Training set B	Test set B
Demographics, mean (IQR)
Age	41.7 (17.0)	45.5 (18.0)	39.3 (17.0)	42.7 (20.0)
BMI	26.9 (5.2)	26.7 (5.7)	26.5 (5.7)	27.9 (6.2)
Male, n (%)	1582 (86.5)	522 (88.9)	622 (87.5)	191 (84.8)
Comorbidities, n (%)
Hypertension	550 (30.1)	213 (36.3)	168 (23.6)	71 (31.6)
Diabetes	427 (23.3)	221 (37.6)	121 (17.0)	73 (32.4)
Chronic kidney disease	68 (3.7)	30 (5.1)	20 (2.8)	7 (3.1)
Cancer	30 (1.6)	7 (1.2)	12 (1.7)	8 (3.6)
Symptoms at admission, n (%)
Cough	851 (46.5)	338 (57.6)	259 (36.4)	99 (44.0)
Fever	28 (1.5)	20 (3.4)	3 (0.4)	3 (1.3)
Shortness of breath	190 (10.4)	99 (16.9)	71 (10.0)	34 (15.1)
Sore throat	238 (13.0)	89 (15.2)	118 (16.6)	28 (12.4)
Rash	29 (1.6)	10 (1.7)	15 (2.1)	5 (2.2)
Vital-sign measurements, mean (IQR)
Systolic blood pressure, mmHg	126.3 (15.0)	126.8 (16.0)	128.8 (15.5)	128.2 (15.7)
Diastolic blood pressure, mmHg	77.5 (9.8)	76.9 (9.9)	77.9 (10.3)	77.5 (10.7)
Respiratory rate, breaths per minute	18.9 (1.0)	20.2 (2.5)	18.1 (0.7)	18.7 (0.8)
Peripheral pulse rate, beats per minute	82.6 (11.5)	85.4 (11.6)	81.7 (13.4)	82.5 (12.5)
Oxygen saturation, %	98.4 (1.6)	97.5 (2.1)	98.5 (1.0)	98.2 (1.4)
Temperature auxiliary, °C	36.9 (0.4)	37.0 (0.7)	36.9 (0.4)	37.1 (0.6)
Glasgow Coma Score	14.8 (0.0)	15.0 (0.0)	14.8 (0.0)	14.8 (0.0)

Summary of the baseline characteristics of the patient cohort in the training sets and test sets and the prevalence of the predicted complications. Note that n represents the total number of patients while % is the proportion of patients within the respective dataset. Characteristics of the variables that were used as input features to our models. The mean and interquartile ranges are shown for the demographic features, and vital-sign measurements. For the comorbidities and symptoms admission, n denotes the number of patients and % denotes the percentage of patients per the respective dataset. The performance results of the models selected by our system across the two test sets in terms of the AUROC and AUPRC are shown in Table 5 . The Receiver Operating Characteristic curve (ROC), Precesion Recall Curve (PRC), and reliability plots are also visualized in Fig. 3 (a) and (b), and 3(c), respectively. Across both test sets, our data-driven approach achieved good performance (0.82 AUROC) for all of the complications. In test set A, AKI was the best discriminated endpoint at 24 h from admission, with 0.906 AUROC. This is followed by SBI (0.902 AUROC), and SBI (0.854 AUROC). In test set B, AKI was the best discriminated endpoint with 0.891 AUROC, followed by SBI (0.859 AUROC), and ARDS (0.827 AUROC).

Table 5

Complication	Result	Test Set A	Test Set B
SBI	Model Type	LR	LR
	AUROC	0.902 (0.862, 0.939)	0.859 (0.762, 0.932)
	AUPRC	0.436 (0.297, 0.609)	0.387 (0.188, 0.623)
	Calibration Slope	0.933 (0.321, 1.370)	1.031 (−0.066, 1.550)
	Calibration Intercept	0.031 (−0.111, 0.213)	0.010 (−0.164, 0.273)
AKI	Model Type	LR	LR
	AUROC	0.906 (0.856, 0.948)	0.891 (0.804, 0.961)
	AUPRC	0.436 (0.278, 0.631)	0.387 (0.115, 0.679)
	Calibration Slope	0.655 (0.043, 1.292)	1.370 (−0.050, 2.232)
	Calibration Intercept	0.059 (−0.136, 0.251)	−0.072 (−0.183, 0.154)
ARDS	Model Type	LR	LGBM
	AUROC	0.854 (0.789, 0.909)	0.827 (0.646, 0.969)
	AUPRC	0.288 (0.172, 0.477)	0.399 (0.150, 0.760)
	Calibration Slope	0.598 (0.028, 1.149)	0.742 (−0.029, 1.560)
	Calibration Intercept	0.000 (−0.159, 0.164)	0.050 (−0.166, 0.243)

Fig. 3

The (a) ROC curves, (b) PRC curves, and (c) calibration curves are shown for all model ensembles evaluated on test set A (top) and test set B (bottom). The color legend for all figures is shown on the right. The numerical values for the AUROC, AUPRC, calibration slopes and intercepts can be found in Table 5. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Performance evaluation of the best performing models on test sets A & B, which were selected based on the average AUROC performance on the validation sets, as shown in Supplementary Section C. Model type indicates the type of the base learners within the final selected ensemble. All the metrics were computed using bootstrapping with 1,000 iterations [46]. The (a) ROC curves, (b) PRC curves, and (c) calibration curves are shown for all model ensembles evaluated on test set A (top) and test set B (bottom). The color legend for all figures is shown on the right. The numerical values for the AUROC, AUPRC, calibration slopes and intercepts can be found in Table 5. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) The prevalence of the predicted complications ranged between 3.2%-6.9% and 7.1%–10.7% in the training and test sets, respectively. This high class imbalance is reflected in the AUPRC results, since the AUPRC depends on the prevalence of the outcome and tends to have a low value when there is class imbalance [47]. We also observe that LR was selected as the best performing model on the validation sets for most complications, highlighting its predictive power despite its simplicity compared to the other machine learning models. LGBM was selected for ARDS in test set B, as shown in Supplementary Section C. The top four important features for each complication are shown in Fig. 4 across the two test sets. Age was among the top predictive features for all the complications in both test sets. Similarly, systolic blood pressure was one of the top features for predicting SBI and AKI across both sets. Other features such as peripheral pulse rate and respiratory rate were among the top predictive features across both sets, for AKI and ARDS respectively.

Fig. 4

The four most important features are shown for each complication in (a) test set A and (b) test set B. Feature importance was computed using the average SHAP values of the six models per ensemble.

The four most important features are shown for each complication in (a) test set A and (b) test set B. Feature importance was computed using the average SHAP values of the six models per ensemble. The calibration results show that our ensemble models were adequately calibrated across all complications as the calibration slopes were approximately equal to 1, as shown in Table 5 and Fig. 3(c). This is also reflected in the sample patient timelines visualized in Fig. 5 , where the predicted risks for the patient who experienced the complications were relatively higher than those predicted for the patient who did not experience any complications. In Fig. 5(a), the patient shown developed all three complications during their hospital stay of 44 days. This highlights the importance of predicting all complications simultaneously, especially for patients who may develop more than one complication. In Fig. 5(b), the patient did not develop any complications during their hospital stay of two days. To compare both patients, the system's predictions for patient (a) were relatively higher than those for patient (b). For example, the AKI predictions were 0.73 and 0.002, respectively, despite the fact that patient (a) developed AKI at around 20 days from admission. This demonstrates the value of our system in predicting the risk of developing complications early during the patient's stay.

Fig. 5

Timeline showing the development of complications with respect to number of days from admission (x-axis) for two sample patients. (a) For [ySBI, yAKI, yARDS], our system predictions (multiplied by a 100 to obtain percentages) were [64%, 73%, 51%]. (b) This patient did not develop and complications and our model predictions were [0.2%, 0.2%, 2%].

Discussion

In this study, we developed an automated prognostic system to support patient assessment and triage early on during the patient's stay. We demonstrate that the system can predict the risk of multiple complications simultaneously and achieves a good performance across all complications across two geographically independent datasets. The feature importance analysis revealed that age, systolic blood pressure and respiratory rate are highly predictive of several complications across the two datasets. Since COVID-19 was predominantly a pulmonary illness especially in its early variants [48], it was not surprising that respiratory rate ranked among the highest predictive features. We also identified age and systolic blood pressure as markers for severity among patients with COVID-19, which is aligned with clinical literature [49,50]. Specifically, systolic blood pressure has been determined as an important covariate of morbidity and mortality in patients with COVID-19 [51]. This analysis demonstrates that our system's learning is clinically meaningful and relevant. In addition, we assessed our models' calibration through reporting the calibration slopes and intercepts and visualized the calibration curves. Sufficiently large datasets are usually needed to produce stable calibration curves at model validation stage [39]. Despite the size of our dataset, we found that reporting the calibration slopes and intercepts would provide a concise summary of potential problems with our system's calibration, to avoid harmful decision-making [39]. One of the main strengths of this study is that we used multicentre data collected at 18 facilities across several regions in Abu Dhabi, UAE. COVID-19 treatment is free for all patients in the UAE, hence there were no obvious gaps in terms of access to healthcare services in our dataset. Across the training and test sets in regions A & B, 15.2%–19.1% of encounters were for Arab patients. This reflects the diversity of our dataset, since Abu Dhabi is residence for more than 200 nationalities, of which only 19.0% of the population is Emirati. This diversity makes our findings relevant to a global audience. While most previous studies have focused on European or Chinese patient cohorts [[52], [53], [54]], our study is one of few studies with large sample sizes (3,352 COVID-19 patient encounters) that focus on the patient cohort in the UAE. Compared to other international patient cohorts, our cohort is relatively younger (39.3–45.5 years across training and test sets), with a lower overall mortality rate (1.3%–3.7% across the training and test sets), suggesting that our system needs to be further validated on populations with different demographic distributions [10,55,56]. Our data-driven approach and open-access code can be easily adapted for such purposes. Another strength is that our system predicts three complications simultaneously that are indicative of patient severity, in order to avoid poor patient outcomes. From a clinical perspective, several studies reported worse prognosis among patients with COVID-19 who had multi-organ failure, and co-infections [6,8,10,57]. Most of the existing COVID-19 prognostic studies focus on predicting mortality as an adverse event outcome [2]. The low mortality rates in our dataset strongly discouraged the development of a mortality risk prediction score, as such small sample sizes may lead to biased models [2]. An important aspect of this study is that the labeling criteria of the complications rely on renowned clinical standards and hospital-acquired data to identify the exact time of the occurrence of such complications. In collaboration with the clinical experts, this approach was considered more reliable than using International Classification of Disease (ICD) codes [58,59]. Despite the development of new ontologies [60], ICD codes are generally used for billing purposes and their derivation may vary across facilities, especially during a pandemic [61]. We also introduce new benchmark results that can be contested with other competing models on test set B. Future work should also investigate the use of multi-label deep learning classifiers for larger datasets, while accounting for the exclusion criteria during training. Moreover, our system uses routinely collected data and does not incur high data collection costs. Other prognostic machine learning studies have also adopted this strategy to predict adverse outcomes [62,63]. By using routinely collected data rather than hematologic, cardiac, or biochemical laboratory tests that are associated with high processing times, our system is suitable for low-cost deployment. Existing studies achieved comparable performance with our system. For example, an AKI prediction model achieved 0.78 AUROC using findings of abdominal CT scans, vital-sign measurements, comorbidities, and laboratory-test results [25]. Although the results are not directly comparable due to differences in study design, our system achieved 0.91 and 0.89 AUROC in test sets A & B, respectively, without needing any imaging or laboratory-test results. In another study, an ARDS prediction model achieved 0.89 AUROC using patient demographics, interventions, comorbidities, 17 laboratory-test results and eight vital signs [33]. In comparison, our system achieved 0.85 AUROC in test set A and 0.83 AUROC in test set B. This implies that we should consider including additional variables to improve the performance of the model, such as laboratory-test results. One other study highlighted the predictive ability of eight laboratory-test results for the prediction of sepsis, where it achieved 0.93 AUROC [28]. We avoided the use of laboratory-test results to ensure that there is no overlap between the set of input features and the variables used to define the output complications (i.e. label leakage), however this is an area of future work. Our study also has several limitations. One limitation of the labeling procedure is that it could miss patients for whom the data used in identifying a particular complication was not collected. However, this issue is more closely related to data collection practices at institutions as clinical data is often not completely missing at random. Another limitation is that since we relied on a minimal feature set, our system does not account for possible effects of treatment on the predicted outcomes and feature interactions, which is an area of future study. Moreover, the models are not perfectly calibrated due to small dataset size, which could also be attributed to the fact that the final predictions are based on model ensembles, rather than an individually calibrated model. Future work should investigate how to further improve the calibration of ensemble models. Furthermore, we utilized a dataset collected during the first wave of the pandemic, which did not include any information indicating the type of variant. Hence, the results presented here may not be directly applicable to patients with new COVID-19 variants. However, the system can be easily reused, fine-tuned, and validated using new datasets.

Conclusion

Our data-driven approach and results highlight the promise of machine learning in risk prediction in general and COVID-19 complications in particular. The proposed approach performs well when applied to two independent multicentre training and test sets in the UAE. The system can be easily implemented in practice due to several factors. First, the input features that our system uses are routinely collected by hospitals that accommodate patients with COVID-19 as recommended by the World Health Organization. Second, training the machine learning models within our system does not require high computational resources. Finally, through feature importance analysis, our system can offer interpretability, and is also fully automated as it does not require any manual interventions. To conclude, we propose a clinically applicable system that predicts complications among patients with COVID-19. Our system can serve as a guide to anticipate the course of patients with COVID-19 and to help initiate more targeted and complication-specific decision-making on treatment and triage.

Contributors

GOG, BA, and KWY managed and analyzed the data. FS and II extracted, anonymized, and provided the dataset for analysis. GOG, KWY, and NH developed and maintained the experimental codebase. FAK, SAJ, MAS, RA, and MAH provided clinical expertise. WZ, FS, MAH and FES designed the study. MAH and FES supervised the work. GOG, BA, KWY, and FES wrote the manuscript. All authors interpreted the results and revised and approved the final manuscript.

Funding

The work of FES, GOG, BA, KWY, RA and NH is funded by NYU Abu Dhabi, and the work of FS, II, FAK, SAJ, MAS, and MAH is funded by Abu Dhabi Health Services. This work was also supported by the NYUAD Center for Interacting Urban Networks (CITIES), funded by Tamkeen under the NYUAD Research Institute Award CG001.

Patient and public involvement

No patient involvement.

Data availability statement

To allow for reproducibility and benchmarking on our dataset, we are sharing test set B (n = 225) at https://github.com/nyuad-cai/COVID19Complications. We are unable to share the full dataset used in this study due to restrictions by the data provider. The trained models and the source code of the pipeline are also included in the repository.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

56 in total

1. From Local Explanations to Global Understanding with Explainable AI for Trees.

Authors: Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee
Journal: Nat Mach Intell Date: 2020-01-17

2. The Development of a Machine Learning Inpatient Acute Kidney Injury Prediction Model.

Authors: Jay L Koyner; Kyle A Carey; Dana P Edelson; Matthew M Churpek
Journal: Crit Care Med Date: 2018-07 Impact factor: 7.598

3. Bacterial coinfection in influenza: a grand rounds review.

Authors: Daniel S Chertow; Matthew J Memoli
Journal: JAMA Date: 2013-01-16 Impact factor: 56.272

4. Acute respiratory distress syndrome: the Berlin Definition.

Authors: V Marco Ranieri; Gordon D Rubenfeld; B Taylor Thompson; Niall D Ferguson; Ellen Caldwell; Eddy Fan; Luigi Camporota; Arthur S Slutsky
Journal: JAMA Date: 2012-06-20 Impact factor: 56.272

Review 5. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement.

Authors: G S Collins; J B Reitsma; D G Altman; K G M Moons
Journal: Br J Surg Date: 2015-02 Impact factor: 6.939

6. Development of a prognostic model for mortality in COVID-19 infection using machine learning.

Authors: Adam L Booth; Elizabeth Abels; Peter McCaffrey
Journal: Mod Pathol Date: 2020-10-16 Impact factor: 7.842

7. Clinical course and outcomes of critically ill patients with SARS-CoV-2 pneumonia in Wuhan, China: a single-centered, retrospective, observational study.

Authors: Xiaobo Yang; Yuan Yu; Jiqian Xu; Huaqing Shu; Jia'an Xia; Hong Liu; Yongran Wu; Lu Zhang; Zhui Yu; Minghao Fang; Ting Yu; Yaxin Wang; Shangwen Pan; Xiaojing Zou; Shiying Yuan; You Shang
Journal: Lancet Respir Med Date: 2020-02-24 Impact factor: 30.700

8. Risk factors analysis of COVID-19 patients with ARDS and prediction based on machine learning.

Authors: Wan Xu; Nan-Nan Sun; Hai-Nv Gao; Zhi-Yuan Chen; Ya Yang; Bin Ju; Ling-Ling Tang
Journal: Sci Rep Date: 2021-02-03 Impact factor: 4.379

9. Integrating deep learning CT-scan model, biological and clinical variables to predict severity of COVID-19 patients.

Authors: Nathalie Lassau; Samy Ammari; Emilie Chouzenoux; Hugo Gortais; Paul Herent; Matthieu Devilder; Samer Soliman; Olivier Meyrignac; Marie-Pauline Talabard; Jean-Philippe Lamarque; Remy Dubois; Nicolas Loiseau; Paul Trichelair; Etienne Bendjebbar; Gabriel Garcia; Corinne Balleyguier; Mansouria Merad; Annabelle Stoclin; Simon Jegou; Franck Griscelli; Nicolas Tetelboum; Yingping Li; Sagar Verma; Matthieu Terris; Tasnim Dardouri; Kavya Gupta; Ana Neacsu; Frank Chemouni; Meriem Sefta; Paul Jehanno; Imad Bousaid; Yannick Boursin; Emmanuel Planchet; Mikael Azoulay; Jocelyn Dachary; Fabien Brulport; Adrian Gonzalez; Olivier Dehaene; Jean-Baptiste Schiratti; Kathryn Schutte; Jean-Christophe Pesquet; Hugues Talbot; Elodie Pronier; Gilles Wainrib; Thomas Clozel; Fabrice Barlesi; Marie-France Bellin; Michael G B Blum
Journal: Nat Commun Date: 2021-01-27 Impact factor: 14.919

10. Multivariate analysis of CT imaging, laboratory, and demographical features for prediction of acute kidney injury in COVID-19 patients: a Bi-centric analysis.

Authors: Stefanie J Hectors; Sadjad Riyahi; Hreedi Dev; Karthik Krishnan; Daniel J A Margolis; Martin R Prince
Journal: Abdom Radiol (NY) Date: 2020-10-24