| Literature DB >> 35885508 |
Asif Hassan Syed1, Tabrej Khan2, Nashwan Alromema1.
Abstract
The increase in coronavirus disease 2019 (COVID-19) infection caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has placed pressure on healthcare services worldwide. Therefore, it is crucial to identify critical factors for the assessment of the severity of COVID-19 infection and the optimization of an individual treatment strategy. In this regard, the present study leverages a dataset of blood samples from 485 COVID-19 individuals in the region of Wuhan, China to identify essential blood biomarkers that predict the mortality of COVID-19 individuals. For this purpose, a hybrid of filter, statistical, and heuristic-based feature selection approach was used to select the best subset of informative features. As a result, minimum redundancy maximum relevance (mRMR), a two-tailed unpaired t-test, and whale optimization algorithm (WOA) were eventually selected as the three most informative blood biomarkers: International normalized ratio (INR), platelet large cell ratio (P-LCR), and D-dimer. In addition, various machine learning (ML) algorithms (random forest (RF), support vector machine (SVM), extreme gradient boosting (EGB), naïve Bayes (NB), logistic regression (LR), and k-nearest neighbor (KNN)) were trained. The performance of the trained models was compared to determine the model that assist in predicting the mortality of COVID-19 individuals with higher accuracy, F1 score, and area under the curve (AUC) values. In this paper, the best performing RF-based model built using the three most informative blood parameters predicts the mortality of COVID-19 individuals with an accuracy of 0.96 ± 0.062, F1 score of 0.96 ± 0.099, and AUC value of 0.98 ± 0.024, respectively on the independent test data. Furthermore, the performance of our proposed RF-based model in terms of accuracy, F1 score, and AUC was significantly better than the known blood biomarkers-based ML models built using the Pre_Surv_COVID_19 data. Therefore, the present study provides a novel hybrid approach to screen the most informative blood biomarkers to develop an RF-based model, which accurately and reliably predicts in-hospital mortality of confirmed COVID-19 individuals, during surge periods. An application based on our proposed model was implemented and deployed at Heroku.Entities:
Keywords: COVID-19; blood biomarkers; filter-based feature selection; hybrid-feature selection; machine learning models; meta-heuristic method; mortality risk prediction; two-tailed unpaired t-test
Year: 2022 PMID: 35885508 PMCID: PMC9316550 DOI: 10.3390/diagnostics12071604
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Studies of blood biomarker-based COVID-19 mortality risk prediction.
| Studies on Blood | Blood Biomarker | Machine Learning | Accuracy | Area Under the Curve (AUC) | F1 score |
|---|---|---|---|---|---|
| Banerjee et al. 2020 [ | Full Blood counts | RF and Artificial Neural Network (ANN) based models | 90–91% | 94–95% | NA |
| Brinati et al. 2020 [ | White Blood Cell (WBC) count, and the platelets, High Sensitivity C-Reactive Protein (hs-CRP), Aspartate Aminotransferase (AST), Alanine Transaminase (ALT), Gamma-Glutamyl Transferase (GGT), Alkaline Phosphatase (ALP), and Lactate Dehydrogenase (LDH) plasma levels | RF and Three-way Random Forest (TWRF) based models | 82–86% | 84–86% | NA |
| Thell et al. 2021 [ | Eosinophils, ferritin, leukocytes, and erythrocytes | Univariate and multivariate binomial logistic regression-based models | 72.3–79.4% | 0.915 | NA |
| Yang et al. 2020 [ | Patient demographic features (age, sex, race) with 27 routine laboratory tests | Gradient boosting decision tree (GBDT) | NA | 0.854 | NA |
| Rahman et al. 2021 [ | Age, Lymphocyte count, D-dimer, CRP, and Creatinine | LR and developed a nomogram with LR algorithm | 0.91 ± 0.03 | 0.992 for the external validation cohort dataset | 0.92 ± 0.03 |
| Chowdhury et al. 2021 [ | LDH, neutrophils (%), lymphocyte (%), hs-CRP, and age | Multi-tree XGBoost model and developed a nomogram using Multi-tree XGBoost | 100% | 0.991 for the validation cohort dataset | NA |
| Vaid et al. 2020 [ | Mortality at 7 Days biomarker: Age, Anion Gap, hs-CRP, LDH, Oxygen Saturation (SpO2), Blood Urea Nitrogen (BUN), Ferritin, Red Cell Distribution Width (RDW), Diastolic Blood Pressure | XGBoost classifier-based model | NA | In external validation, the XGBoost classifier obtained an AUC-ROC of 0.88 at 3 days, 0.86 at 5 days, 0.86 at 7 days, and 0.84 at 10 days for mortality prediction | NA |
| Aladağ et al. 2020 [ | Intubated patients, a Lower Glomerular filtration rate (GFR) value, and N-terminal pro-brain natriuretic peptide (NT-proBNP) values | Multiple Logistic Regression (MLR) | NA | NA | NA |
| Terwangne et al. 2020 [ | Age, acute kidney injury, lymphocytes, activated prothrombin time (aPTT), and (LDH) Levels | Bayesian network analysis for severity classification of COVID-19 | NA | 83.8% AUC obtained from Bayesian network trained and evaluated using the entire set of patients | NA |
| Huang et al. 2020 [ | Epidemiological exposure histories, weakness/fatigue, heart rate <100 beat/min, bilateral pneumonia, neutrophil count ≤ 6.3 × 109/L, eosinophil count ≤ 0.02 × 109/L, glucose ≥ 6 mmol/L, D-dimer ≥ 0.5 mg/L, and CRP <5 mg/L | Multivariate logistic regression model based novel risk score | NA | 0.921 | NA |
| Cia et al. 2020 [ | LDH, Neutrophil to Lymphocyte Ratio (NLR), D-dimer, and CRP score on admission and severity of COVID-19 infection | LR model | NA | The AUC values for NLR were 0.716, 0.650 for D-dimer, 0.612 for CT score, and 0.740 for LDH, which indicate a specific diagnostic value for the severity of COVID-19 infection | NA |
| Wang et al. 2020 [ | The clinical model was developed using a history of hypertension, age, and coronary heart disease, and the laboratory model was developed using peripheral capillary oxygen saturation, neutrophil, hs-CRP, D-dimer, lymphocyte count, GFR, AST, and age | Stepwise Akaike information criterion and ensemble XGBoost (extreme gradient boosting) model | NA | Clinical model AUC values were 0.88 and 0.98 for the laboratory model | NA |
| Xie et al. 2020 [ | LDH, age, SpO2, and Lymphocyte Count | Multivariable logistic regression model and developed a nomogram using Multivariable logistic regression | NA | Independent validation cohort with an AUC of 0.98 | NA |
| Bolourani et al. 2021 [ | Body mass index (BMI), age, and hypertension, to build a mortality prediction model from COVID-19 data from the United Kingdom and Denmark | XGBoost model | 0.919 | 0.77 | NA |
| Jimenez-Solem et al. 2021 [ | BMI, age, and hypertension | RF-based model | NA | The model showed a higher discriminative power with an AUC of 0.818 at hospital admission, 0.906 at diagnosis, and 0.721 during ICU admission | NA |
| Karthikeyan et al. 2021 [ | Neutrophils, lymphocytes, LDH, hs-CRP, and age | XGBoost feature importance and neural network classification | 96.526 ± 0.637 | 0.9895 ± 0.0057 | 0.9687 ± 0.006 |
| Yan et al. 2020 [ | LDH, hs-CRP, and lymphocyte count | Interpretable single tree XGBoost model | NA | Predict the mortality of COVID-19 individuals with 94% accuracy as early as 3 days before the patient outcome | NA |
Figure 1A pictorial representation of the ML pipeline for implementing an application that predicts the mortality of positive COVID-19 cases.
Figure 2(a) Pictorial representation of the proposed hybrid feature selection method and (b) the workflow of the proposed hybrid feature selection method to screen the most informative ML model for predicting the mortality risk of COVID-19 individuals.
Figure 3Workflow of genetic algorithm.
List of informative attributes from the COVID-19 clinical dataset using the mRMR algorithm.
| Sl.no. | Clinical Attributes |
|---|---|
| 1 | Serum chloride |
| 2 | Monocytes (%) |
| 3 | Serum sodium |
| 4 | Serum potassium |
| 5 | Calcium |
| 6 | Corrected calcium |
| 7 | Indirect bilirubin |
| 8 | Prothrombin Time (PT) |
| 9 | Total Protein (TP) |
| 10 | Neutrophils (%) |
| 11 | Basophil count (BC) |
| 12 | High sensitivity C-reactive protein (hs-CRP) |
| 13 | Hemoglobin |
| 14 | International Standard Ratio (INR) |
| 15 | Platelet Large Cell Ratio (P-LCR) |
| 16 | Mean Platelet Volume (MPV) |
| 17 | Procalcitonin (PCT) |
| 18 | D-Dimer |
| 19 | Platelet Distribution Width (PDW) |
| 20 | Serum Glutamic-Pyruvic Transaminase (SGPT) |
List of features (blood biomarkers) obtained using mRMR and the corresponding mean difference between two classes of population (survivor and non-survivor) at a significance level of 0.5.
| Sl.no. | Name of Blood Biomarkers | Mean and Standard Deviation of Blood Biomarkers between Two Classes of Population (Survivor and Non-Survivor) | Two-Tailed | |
|---|---|---|---|---|
| Non-Survivor | Survivor | |||
| 1 | Serum chloride | 0.448291385 ± 0.155 | 0.3763732 ± 0.119 | |
| 2 | Monocytes (%) | 0.017486858 ± 0.007 | 0.011148923 ± 0.051 | |
| 3 | serum sodium | 0.3911567 ± 0.166 | 0.325273619 ± 0.122 | |
| 4 | Serum potassium | 0.255000148 ± 0.140 | 0.234565716 ± 0.070 | 0.0709 |
| 5 | Calcium | 0.556278701 ± 0.119 | 0.64440659 ± 0.125 | |
| 6 | Corrected calcium | 0.587374724 ± 0.131 | 0.625911132 ± 0.104 | 0.0018 |
| 7 | Indirect Bilirubin | 0.129711839 ± 0.125 | 0.111754649 ± 0.096 | 0.1199 |
| 8 | Prothrombin Time (PT) | 0.089973693 ± 0.102 | 0.055744667 ± 0.010 | |
| 9 | Total protein (TP) | 0.58374466 ± 0.162 | 0.648346818 ± 0.147 | |
| 10 | Neutrophils (%) | 0.902449663 ± 0.097 | 0.757378019 ± 0.177 | |
| 11 | Basophil count (#) | 0.186996944 ± 0.183 | 0.179833248 ± 0.163 | 0.6907 |
| 12 | High sensitivity C-Reactive Protein (hs-CRP) | 0.398503 ± 0.238 | 0.036965 ± 0.078 | |
| 13 | Hemoglobin | 0.668481 ± 0.143 | 0.686667 ± 0.118 | 0.1821 |
| 14 | International Standard Ratio (INR) | 0.069874 ± 0.095 | 0.018222 ± 0.007 | |
| 15 | Platelet Large Cell Ratio (P-LCR) | 0.513142 ± 0.179 | 0.414974 ± 0.178 | |
| 16 | Mean Platelet Volume (MPV) | 0.482952 ± 0.184 | 0.383799 ± 0.178 | |
| 17 | Procalcitonin (PCT) | 0.037908 ± 0.102 | 0.018682 ± 0.073 | 0.0366 |
| 18 | D-Dimer | 0.571878041 ± 0.408 | 0.280508629 ± 0.085 | |
| 19 | Platelet Distribution Width (PDW) | 0.393129813 ± 0.204 | 0.222059325 ± 0.112 | |
| 20 | Serum Glutamic-Pyruvic Transaminase (SGPT) | 0.034611327 ± 0.091 | 0.016735918 ± 0.014 | 0.0070 |
A list of features was obtained using the four state-of-the-art meta-heuristic methods.
| Meta-Heuristic Methods | Global Optimal Feature Subset |
|---|---|
| WOA | ‘INR’, ‘P-LCR’, ‘D-Dimer’ |
| GA | hsCRP’, ‘SGPT’, ‘INR’ |
| GWO | ‘Monocytes (%)’, ’TP’, ‘INR’, ‘D-Dimer’, ‘PDW’ |
| SCA | ‘TP’, ‘INR’, ‘PDW’ |
Figure 4The histogram represents the frequency distribution across the two classes of population (non-survivor and survivor) of the four optimal blood biomarker subset obtained using the WOA meta-heuristic technique. (a) Platelet large cell ratio, (b) D-dimer, and (c) international standard ratio.
Figure 5Comparison of the performance matrices. (a) Accuracy, (b) F1 score, and (c) AUC value of the three predictive models built using the four optimal feature subset obtained by WOA, GA, GWO, and SCA meta-heuristic algorithms.
Comparative performance evaluation of the seven predictive models built in terms of (a) accuracy, (b) F1 score, and (c) AUC value using the four optimal feature subset obtained by the WOA, GA, GWO, and SCA meta-heuristic algorithms.
|
| |||||||
|
|
|
|
|
|
|
| |
|
| 0.96 ± 0.062 | 0.92 ± 0.024 | 0.93 ± 0.047 | 0.91 ± 0.025 | 0.95 ± 0.037 | 0.89 ± 0.053 | 0.92 ± 0.024 |
|
| 0.95 ± 0.024 | 0.92 ± 0.034 | 0.92 ± 0.019 | 0.92 ± 0.027 | 0.88 ± 0.034 | 0.89 ± 0.029 | 0.92 ± 0.034 |
|
| 0.91 ± 0.044 | 0.84 ± 0.039 | 0.87 ± 0.032 | 0.85 ± 0.032 | 0.87 ± 0.027 | 0.87 ± 0.036 | 0.88 ± 0.039 |
|
| 0.85 ± 0.040 | 0.79 ± 0.045 | 0.84 ± 0.034 | 0.81 ± 0.049 | 0.81 ± 0.037 | 0.8 ± 0.049 | 0.75 ± 0.045 |
|
| 0.84 ± 0.044 | 0.80 ± 0.032 | 0.81 ± 0.019 | 0.82 ± 0.027 | 0.82 ± 0.036 | 0.75 ± 0.025 | 0.81 ± 0.024 |
|
| |||||||
|
|
|
|
|
|
|
| |
|
| 0.96 ± 0.099 | 0.91 ± 0.034 | 0.93 ± 0.060 | 0.9 ± 0.036 | 0.94 ± 0.048 | 0.88 ± 0.053 | 0.91 ± 0.034 |
|
| 0.94 ± 0.024 | 0.91 ± 0.035 | 0.91 ± 0.017 | 0.91 ± 0.035 | 0.87 ± 0.037 | 0.88 ± 0.031 | 0.91 ± 0.035 |
|
| 0.9 ± 0.064 | 0.83 ± 0.052 | 0.87 ± 0.038 | 0.84 ± 0.043 | 0.85 ± 0.034 | 0.84 ± 0.053 | 0.87 ± 0.052 |
|
| 0.84 ± 0.058 | 0.77 ± 0.062 | 0.83 ± 0.040 | 0.78 ± 0.064 | 0.79 ± 0.047 | 0.78 ± 0.075 | 0.73 ± 0.062 |
|
| 0.80 ± 0.034 | 0.77 ± 0.017 | 0.78 ± 0.036 | 0.77 ± 0.053 | 0.77 ± 0.035 | 0.80 ± 0.064 | 0.78 ± 0.058 |
|
| |||||||
|
|
|
|
|
|
|
| |
|
| 0.98 ± 0.024 | 0.92 ± 0.004 | 0.99 ± 0.015 | 0.93 ± 0.009 | 0.95 ± 0.011 | 0.94 ± 0.020 | 0.97 ± 0.004 |
|
| 0.97 ± 0.026 | 0.92 ± 0.027 | 0.97 ± 0.015 | 0.97 ± 0.015 | 0.88 ± 0.024 | 0.96 ± 0.030 | 0.92 ± 0.027 |
|
| 0.96 ± 0.020 | 0.84 ± 0.025 | 0.97 ± 0.024 | 0.93 ± 0.020 | 0.86 ± 0.050 | 0.93 ± 0.014 | 0.96 ± 0.025 |
|
| 0.90 ± 0.052 | 0.79 ± 0.050 | 0.90 ± 0.025 | 0.91 ± 0.025 | 0.91 ± 0.025 | 0.91 ± 0.054 | 0.84 ± 0.050 |
|
| 0.78 ± 0.004 | 0.79 ± 0.015 | 0.81 ± 0.027 | 0.81 ± 0.011 | 0.82 ± 0.027 | 0.80 ± 0.026 | 0.79 ± 0.025 |
Figure 6Confusion matrix of the RF-based COVID-19 mortality prediction model was tested on the 20% independent test dataset.
Comparative performance evaluation between models built using the same test dataset.
| Sl.no. | Author | Machine Learning Model | Blood Biomarker (Features) | Accuracy (%) | F1 score | AUC Value |
|---|---|---|---|---|---|---|
| 1 | Yan et al. 2020 [ | Single tree XGBoost model | LDH, hs-CRP, and lymphocytes | 90 ± 0.537 | 95 ± 0.06 | 97.77 ± 1.82 |
| 2 | Karthikeyan et al. 2021 [ | Neural Network (NN)-based classification model | Lymphocytes, Neutrophils, hs-CRP, LDH, and age | 96.526 ± 0.637 | 0.9687 ± 0.006 | 0.9895 ± 0.0057 |
| 3 | Rehman et al. 2021 [ | LR model | Age, Lymphocyte count, D-dimer, CRP, and Creatinine | 0.92 ± 0.03 | 0.93 ± 0.03 | 0.992 ± 0.008 |
| 4 | Our Proposed RF-based model | RF model | INR, P-LCR, and D-dimer | 0.96 ± 0.062 | 0.96 ± 0.099 | 0.98 ± 0.024 |