| Literature DB >> 35180500 |
Tawsifur Rahman1, Amith Khandakar1, Farhan Fuad Abir2, Md Ahasan Atick Faisal2, Md Shafayet Hossain3, Kanchon Kanti Podder4, Tariq O Abbas5, Mohammed Fasihul Alam6, Saad Bin Kashem7, Mohammad Tariqul Islam3, Susu M Zughaier8, Muhammad E H Chowdhury9.
Abstract
The reverse transcription-polymerase chain reaction (RT-PCR) test is considered the current gold standard for the detection of coronavirus disease (COVID-19), although it suffers from some shortcomings, namely comparatively longer turnaround time, higher false-negative rates around 20-25%, and higher cost equipment. Therefore, finding an efficient, robust, accurate, and widely available, and accessible alternative to RT-PCR for COVID-19 diagnosis is a matter of utmost importance. This study proposes a complete blood count (CBC) biomarkers-based COVID-19 detection system using a stacking machine learning (SML) model, which could be a fast and less expensive alternative. This study used seven different publicly available datasets, where the largest one consisting of fifteen CBC biomarkers collected from 1624 patients (52% COVID-19 positive) admitted at San Raphael Hospital, Italy from February to May 2020 was used to train and validate the proposed model. White blood cell count, monocytes (%), lymphocyte (%), and age parameters collected from the patients during hospital admission were found to be important biomarkers for COVID-19 disease prediction using five different feature selection techniques. Our stacking model produced the best performance with weighted precision, sensitivity, specificity, overall accuracy, and F1-score of 91.44%, 91.44%, 91.44%, 91.45%, and 91.45%, respectively. The stacking machine learning model improved the performance in comparison to other state-of-the-art machine learning classifiers. Finally, a nomogram-based scoring system (QCovSML) was constructed using this stacking approach to predict the COVID-19 patients. The cut-off value of the QCovSML system for classifying COVID-19 and Non-COVID patients was 4.8. Six datasets from three different countries were used to externally validate the proposed model to evaluate its generalizability and robustness. The nomogram demonstrated good calibration and discrimination with the area under the curve (AUC) of 0.961 for the internal cohort and average AUC of 0.967 for all external validation cohort, respectively. The external validation shows an average weighted precision, sensitivity, F1-score, specificity, and overall accuracy of 92.02%, 95.59%, 93.73%, 90.54%, and 93.34%, respectively.Entities:
Keywords: COVID-19; Complete blood count (CBC); Detection; RT-PCR; Stacking machine learning
Year: 2022 PMID: 35180500 PMCID: PMC8839805 DOI: 10.1016/j.compbiomed.2022.105284
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 4.589
Fig. 1Step-by-step overview of the methodology.
Dataset description for Model development, internal and external validation.
| Dataset | COVID-19 | Non-COVID | Total | |
|---|---|---|---|---|
| Italy (OSR) | 629 | 670 | 1299 | |
| 670 | 670 | 1340 | ||
| 157 | 168 | 325 | ||
| External validation | Italy-1 | 163 | 174 | 337 |
| Italy-2 | 104 | 145 | 249 | |
| Italy-3 | 118 | 106 | 224 | |
| Brazil-1 | 352 | 949 | 1301 | |
| Brazil-2 | 334 | 11 | 345 | |
| Ethiopia | 200 | – | 200 |
Five-fold cross validation was used for performance evaluation. Number of samples per fold for training, augmented training, and internal validation is reported here.
Statistical analysis of the COVID-19 and Non -COVID groups’ characteristics using the internal training dataset.
| Features | Unit | Acronym | Missing rate (%) | COVID-19 | Non-COVID | Overall | p-value |
|---|---|---|---|---|---|---|---|
| mean ± std | mean ± std | mean ± std | |||||
| Age | years | Age | 0 | 61.85 ± 16.3 | 59.3 ± 22.24 | 60.54 ± 19.61 | <0.05 |
| White blood cells | 109/L | WBC | 2.4 | 7.66 ± 3.88 | 9.72 ± 5.17 | 8.7 ± 4.69 | <0.05 |
| Red blood cells | 1012/L | RBC | 3.6 | 4.65 ± 0.68 | 4.44 ± 0.74 | 4.54 ± 0.72 | 0.112 |
| Hemoglobin | g/dl | HGB | 2.4 | 13.54 ± 1.89 | 12.86 ± 2.09 | 13.2 ± 2.02 | 0.545 |
| Hematocrit | % | HCT | 2.4 | 40.27 ± 5.3 | 38.55 ± 5.7 | 39.41 ± 5.56 | <0.05 |
| Mean corpuscular volume | fL | MCV | 3.6 | 86.9 ± 6.67 | 87.5 ± 7.41 | 87.2 ± 7.06 | 0.288 |
| Mean corpuscular hemoglobin | pg/Cell | MCH | 3.6 | 29.23 ± 2.63 | 29.18 ± 2.8 | 29.2 ± 2.73 | 0.655 |
| Mean corpuscular hemoglobin concentration | g Hb/dL | MCHC | 2.4 | 33.62 ± 1.33 | 33.32 ± 1.36 | 33.47 ± 1.35 | 0.881 |
| Platelets | 109/L | PLT1 | 3.6 | 222.73 ± 90.78 | 246.16 ± 97.5 | 234.5 ± 94.8 | <0.05 |
| Neutrophils count | 109/L | NET | 18.9 | 5.88 ± 3.6 | 7.2 ± 5.4 | 6.47 ± 4.52 | <0.05 |
| Lymphocytes count | 109/L | LYT | 15.2 | 1.15 ± 0.84 | 1.64 ± 1.04 | 1.37 ± 0.96 | <0.05 |
| Monocytes count | 109/L | MOT | 15.2 | 0.54 ± 0.6 | 0.71 ± 0.39 | 0.61 ± 0.5 | <0.05 |
| Eosinophils count | 109/L | EOT | 15.2 | 0.023 ± 0.08 | 0.11 ± 0.18 | 0.064 ± 0.14 | <0.05 |
| Basophils count | 109/L | BAT | 15.2 | 0.005 ± 0.02 | 0.029 ± 0.052 | 0.016 ± 0.04 | 0.075 |
Fig. 2The number of missing data for different features in the OSR dataset (Training data). The missing data are shown as spottier, and the spark-line at right shows the shape of the dataset.
Fig. 3Heatmap of correlation among different features (A) using all features, and (B) removing highly correlated features.
Fig. 4Pair plot for the distribution of the dataset.
Fig. 5Proposed stacking model architecture.
Feature ranked according to different feature selection algorithms.
| Feature | Pearson | Chi-2 | RFE | Logistics | Random Forest | Total |
|---|---|---|---|---|---|---|
| 5 | ||||||
| 5 | ||||||
| 5 | ||||||
| 5 | ||||||
| 4 | ||||||
| 4 | ||||||
| 4 | ||||||
| 3 | ||||||
| 3 | ||||||
| 3 | ||||||
| 3 |
Comparison of the average performance metrics from five-fold cross-validation for different classifiers and the stacking classifier.
| Overall | Weighted with 95% CI | ||||
|---|---|---|---|---|---|
| Classifier | Accuracy | Precision | Recall | F1-score | Specificity |
| Linear Discriminant Analysis (LDA) | 67.88 ± 2.27 | 67.69 ± 2.27 | 67.88 ± 2.27 | 67.88 ± 2.27 | 67.77 ± 2.27 |
| XGBoost (XGB) | 81.43 ± 1.89 | 81.37 ± 1.89 | 81.43 ± 1.89 | 81.43 ± 1.89 | 81.39 ± 1.89 |
| Random Forest (RF) | 82.91 ± 1.83 | 82.87 ± 1.83 | 82.91 ± 1.83 | 82.91 ± 1.83 | 82.74 ± 1.84 |
| Logistic Regression (LR) | 68.37 ± 2.26 | 68.63 ± 2.26 | 68.37 ± 2.26 | 68.37 ± 2.26 | 68.47 ± 2.26 |
| Support Vector Machine (SVM) | 62.28 ± 2.36 | 70.03 ± 2.23 | 62.28 ± 2.36 | 62.28 ± 2.36 | 61.53 ± 2.37 |
| AdaBoost | 74.66 ± 2.12 | 74.45 ± 2.12 | 74.66 ± 2.12 | 74.66 ± 2.12 | 74.22 ± 2.13 |
| K-Nearest Neighbors (KNN) | 79.17 ± 1.97 | 79.11 ± 1.98 | 79.17 ± 1.97 | 79.17 ± 1.97 | 79.13 ± 1.98 |
| Gradient Boosting (GB) | 89.88 ± 1.47 | 89.86 ± 1.47 | 89.88 ± 1.47 | 89.88 ± 1.47 | 89.87 ± 1.47 |
Fig. 6ROC curves for different ML classifiers and the stacking ML model.
The logistic regression analysis to construct the stacking ML based Nomogram.
| Outcome | Coef. | Bootstrap Std. Err. | Z | P>|z| | [95% conf. Interval] | |
|---|---|---|---|---|---|---|
| Gradient Boosting (M1) | 6.685314 | 0.7587142 | 8.81 | 0.000 | 5.198262 | 8.172367 |
| Random Forest (M2) | 1.3158 | 0.4752868 | 2.77 | 0.006 | 0.3842555 | −2.247345 |
| XGBoost (M3) | 0.6338573 | 0.4685463 | 1.35 | 0.176 | −0.2844766 | −1.552191 |
| cons | −3.516128 | 0.205643 | −17.10 | 0.000 | −3.919181 | −3.113076 |
Fig. 7Multivariate logistic regression-based Nomogram to detect COVID-19 patients. Nomogram for prediction of COVID-19 was created using Gradient boost (M1), Random Forest (M2), and XGBoost (M3).
Fig. 8Nomogram scores corresponding to the classification probabilities of COVID-19 and NON-COVID subjects.
Comparison of the average and weighted performance metrics using logistic regression-based nomogram for the external datasets.
| Overall | Weighted with 95% CI | |||||
|---|---|---|---|---|---|---|
| Dataset | Country | Accuracy | Precision | Recall | F1-score | Specificity |
| Internal validation | Italy (OSR) | 91 ± 1.41 | 89.5 ± 1.49 | 91.12 ± 1.38 | 90.3 ± 1.44 | 89.71 ± 1.54 |
| External validation | Italy-1 | 91.69 ± 1.93 | 88.02 ± 2.27 | 97.13 ± 1.17 | 92.35 ± 1.86 | 85.89 ± 2.43 |
| Italy-2 | 95.18 ± 1.98 | 94.63 ± 2.08 | 97.24 ± 1.51 | 95.92 ± 1.83 | 92.31 ± 2.46 | |
| Italy-3 | 92.86 ± 1.97 | 88.79 ± 2.42 | 97.17 ± 1.27 | 92.79 ± 1.98 | 88.98 ± 2.4 | |
| Brazil-1 | 92.7 ± 2.75 | 94.25 ± 2.46 | 95.26 ± 2.24 | 93.25 ± 2.65 | 87.27 ± 3.52 | |
| Brazil-2 | 95.11 ± 1.22 | 94.42 ± 2.06 | 94.25 ± 2.12 | 94.33 ± 2.14 | 95.67 ± 2.11 | |
| Ethiopia | 92.5 ± 3.65 | – | 92.5 ± 3.65 | – | – | |
| 93.34 ± 2.19 | 92.02 ± 2.26 | 95.59 ± 2.18 | 93.73 ± 2.33 | 90.54 ± 2.47 | ||
Ethiopia dataset has only COVID patients.
Fig. 9Comparison of predicted and actual probability of COVID-19 patients. (A) Internal validation representation, (B) external validation representation.
Fig. 10Comparison of decision curves analysis of different models. The net benefit balances the probability scores for COVID-19 patients.
Fig. 11Illustration of a generic framework for the COVID-19 detection tool.