| Literature DB >> 33812263 |
Marcos Antonio Alves1, Giulia Zanon Castro2, Bruno Alberto Soares Oliveira1, Leonardo Augusto Ferreira1, Jaime Arturo Ramírez3, Rodrigo Silva4, Frederico Gadelha Guimarães5.
Abstract
The sudden outbreak of coronavirus disease 2019 (COVID-19) revealed the need for fast and reliable automatic tools to help health teams. This paper aims to present understandable solutions based on Machine Learning (ML) techniques to deal with COVID-19 screening in routine blood tests. We tested different ML classifiers in a public dataset from the Hospital Albert Einstein, São Paulo, Brazil. After cleaning and pre-processing the data has 608 patients, of which 84 are positive for COVID-19 confirmed by RT-PCR. To understand the model decisions, we introduce (i) a local Decision Tree Explainer (DTX) for local explanation and (ii) a Criteria Graph to aggregate these explanations and portrait a global picture of the results. Random Forest (RF) classifier achieved the best results (accuracy 0.88, F1-score 0.76, sensitivity 0.66, specificity 0.91, and AUROC 0.86). By using DTX and Criteria Graph for cases confirmed by the RF, it was possible to find some patterns among the individuals able to aid the clinicians to understand the interconnection among the blood parameters either globally or on a case-by-case basis. The results are in accordance with the literature and the proposed methodology may be embedded in an electronic health record system.Entities:
Keywords: COVID–19; Criteria graph; Decision tree; Explainable artificial intelligence; Machine learning
Mesh:
Year: 2021 PMID: 33812263 PMCID: PMC7962588 DOI: 10.1016/j.compbiomed.2021.104335
Source DB: PubMed Journal: Comput Biol Med ISSN: 0010-4825 Impact factor: 6.698
Papers that applied ML models for prediction of COVID-19, datasets and models used (the best model reported is the bold one), features analyzed, interpretability (Inter.), metric results in each paper. The methods are BN: Bayesian Networks, CRT: Classification and Regression Tree, DNN: Deep Neural Networks, DT: Decision Trees, ET: Extremely Randomized Trees, GBT: Gradient Boosting Trees, KNN: k-Nearest Neighbors, LR: Logistic Regression, MLP: Multilayer Perceptron, MLR: Multivariate Logistic Regression, NB: Naive Bayes, NN: Neural Networks, RF: Randon Forest, SVM: Support Vector Machine, TWRF: Three-Way RF, XGBoost: Extreme Gradient Boosting Machine.
| Ref | Description | Dataset | Methods | Features | Inter. | Metric results |
|---|---|---|---|---|---|---|
| [ | Predict the risk of positive cases using as predictors only results from emergency care admission exams | 235 patients from Hospital Israelita Albert Einstein in São Paulo, Brazil. | NN, RF, GBTrees, LR, | 15 blood parameters | No | AUC 0.85, SE 0.68, SP 0.85, PPV 0.74, NPV 0.77 |
| [ | ML-based diagnosis model and a COVID-19 diagnosis aid application | 620 patients from West China Hospital | Age, gender and more 35 indicators | No | AUC 0.87, PPV 0.86, NPV 0.85 | |
| [ | ML models using hematochemical values from routine blood exams | 279 patients from San Rafaele Hospital in Milan, Italy | DT, ET, KNN, LR, NB, | Several | DT | For RF: ACC 0.82, AUC 0.84, SE 0.92, SP 0.65, PPV 0.83. For TWRF: ACC 0.86, SE 0.95, SP 0.75, PPV 0.86 |
| [ | Smart Blood Analytics (SBA) predictive model on patients with various bacterial and viral infections, and COVID-19 patients | 5333 patients from Department of Infectious Diseases, University Medical Center Ljubljana, Slovenia. | RF, DNN, | 35 blood parameters | No | AUC 0.97, SE 0.82, SP 0.98 |
| [ | RF model and an online assistant tool. | 253 samples from 169 suspected patients collected from multiple sources. | 49 clinical available blood test data. | No | ACC 0.96, AUC 0.96, SE 0.95, SP 0.97, MCC 0.96, Related AUC 1.00 | |
| [ | Heg.IA: An intelligent system to support the diagnosis of Covid-19 based on blood tests | 5644 patients provided by Hospital Israelita Albert Einstein (São Paulo, Brazil). 559 had positive diagnosis. | MLP, SVM, RT, RF, | 24 blood tests | No | ACC 0.95, PR 0.94, SE 0.97, SP 0.94, Kappa index 0.90 |
| [ | Predict the mortality risk and explain the model. | 2779 validated or suspected COVID-19 patients from Tongji Hospital in Wuhan, China. | Several | Single Tree XGB | F1 0.93, PR 0.95, SE 0.92 | |
| [ | Detect the COVID-19 severely ill patients from those with only mild symptoms. | 137 clinically confirmed cases from the Tongji Hospital Affiliated to Huazhong University of Science and Technology. | LR, | 100 features (8 clinical, 76 blood, and 16 urine) | No | ACC 0.79, SE 0.76, SP 0.70 |
| [ | Predict mortality risk | 70 survivors from SMS Medical College, Jaipur (Rajasthan, India). | Several | No | ACC 0.70, AUC 0.95, SE 0.90, SP 0.89 | |
| [ | Identify patients at risk for deterioration during their hospital stay | 6995 patients were evaluated at Sheba Medical Center, China | RF, NN, CRT | Several | No | ACC 0.79, AUC 0.79, SE 0.68, SP 0.81. All of them with Apache II |
| [ | Prediction of the diagnosis based on blood count results and age | 1157 patients made available by the repository COVID-19 Data Sharing/BR | Several | No | ACC 0.80, F1 0.70, AUC 0.81, SE 0.76, PPV 0.65, NPV 0.88 |
Fig. 1In the left side, there is a noise set η generated by DTX around the instance to be explained, x. The decision boundary is based on the DTX output. In the right side there is a tree structure representing the rules responsible for explaining the black-box prediction.
Fig. 2Criteria graph.
Fig. 3Diagram of the proposed method of generating ensemble classifiers with local explainability.
Description of the features used, abbreviation (Abb.) often used/adopted, reference values for male and female, missing rates (Miss. %) and some related references that reported the feature's relationship with COVID-19.
| Abb. | Feature | Description | Reference Value | Miss. % | Ref | |
|---|---|---|---|---|---|---|
| Female | Male | |||||
| HCT | Hematocrit | The amount of whole blood that is made up of red blood cells | 36–46% | 41–53% | 0.82 | [ |
| HGB | Hemoglobin | It is the oxygen-carrying component of red blood cells | 12–16 g/dL | 13.5–17.5 g/dL | 0.82 | [ |
| PLT | Platelets | A tiny, disc-shaped piece of cell that helps form blood clots to slow or stop bleeding and to help wounds heal | 150–400 | 150–400 | 0.98 | [ |
| RBC | Red blood Cells | The blood cell that carries oxygen | 3.5–5.5 | 4.3–5.9 | 0.98 | [ |
| LYM | Lymphocytes | A type of white blood cells | 0.5–4.0 | 0.5–4.0 | 0.98 | [ |
| MCH | Mean corpuscular hemoglobin | It corresponds to the average hemoglobin weight in a population of erythrocytes | 25.4–34.6 pg/cell | 25.4–34.6 pg/cell | 0.98 | [ |
| MCHC | MCH concentration | Mean of the internal hemoglobin concentration in a population of erythrocytes | 31–36% Hb/cell | 31–36% Hb/cell | 0.98 | [ |
| WBC | Leukocytes | White Blood Cells that help the body fight infections and other diseases. | 4500–11000 | 4500–11000 | 0.98 | [ |
| BAY | Basophils | Type of white blood cell (leukocyte) with coarse, bluish-black granules of uniform size within the cytoplasm | 0.0–0.1 | 0.0–0.1 | 0.98 | [ |
| EOS | Eosinophils | Normal type of white blood cell that has coarse granules within its cytoplasm | 0.1–0.5 | 0.1–0.5 | 0.98 | [ |
| LDH | Lactate dehydrogenase | Enzyme of the anaerobic metabolic pathway, that catalyzes the conversion of lactate to pyruvate, important in energy production | 140–280 U/L | 140–280 U/L | 0.98 | [ |
| MCV | Mean corpuscular volume | Average volume of an erythrocyte population | 80–100 | 80–100 | 0.98 | [ |
| RWD | Red blood cell distribution width | A measurement of the range in the volume and size of red blood cells | <15% | <15% | 0.98 | [ |
| MONO | Monocytes | A type of immune cell that has a single nucleus and fights off bacteria, viruses and fungi | 0.3–0.8 | 0.3–0.8 | 1.15 | [ |
| MPV | Mean platelet volume | Average size of platelets | 7.2–11.7 fL | 7.2–11.7 fL | 1.48 | [ |
| NEU | Neutrophils | A type of immune cell that is one of the first cell types to travel to the site of an infection and help by ingesting microorganisms and releasing enzymes that kill them | 1.8–7.7 | 1.8–7.7 | 15.62 | [ |
| CRP | C-reactive protein | Plasma protein produced by the liver and induced by various inflammatory mediators such as interleukin-6 | <10 mg/L | <10 mg/L | 16.77 | [ |
| CREAT | Creatinine | A chemical waste molecule generated from muscle metabolism. | 44–97 μmol/L | 53–106 μmol/L | 30.26 | [ |
| UREA | Urea | A nitrogen-containing substance normally cleared from the blood by the kidney into the urine. | 2.5–7.1 mmol/L | 2.5–7.1 mmol/L | 34.70 | [ |
| K+ | Potassium | A metallic element that is important in body functions such as regulation of blood pressure | 3.5–5.5 mEq/L | 3.5–5.5 mEq/L | 38.98 | [ |
| Na | Sodium | A mineral needed by the body to keep body fluids in balance | 135–145 mmol/L | 135–145 mmol/L | 39.14 | [ |
| AST | Aspartate transaminase | An enzyme found in the liver, heart, and other tissues. A high level of AST released into the blood may be a sign of liver or heart damage, cancer, or other diseases | 0–35 U/L | 0–35 U/L | 62.82 | [ |
| ALT | Alanine transaminase | An enzyme that is normally present in liver and heart cells and it is released into blood when the liver or heart is damaged | <41.0 U/L | <31.0 U/L | 62.99 | [ |
Fig. 4The nested cross validation method.
Fig. 5Example of synthetic sample generated by SMOTE.
Results of the classification of COVID-19.
| Model/Score | Accuracy | F1–score | Sensitivity | Specificity | AUROC |
|---|---|---|---|---|---|
| LR | 0.82 | 0.71 | 0.84 | 0.85 | |
| RF | 0.66 | 0.91 | 0.86 | ||
| XGBoost | 0.87 | 0.73 | 0.60 | 0.91 | 0.85 |
| SVM | 0.84 | 0.70 | 0.56 | 0.89 | 0.85 |
| MLP | 0.85 | 0.68 | 0.42 | 0.81 | |
| Ensemble | 0.67 | 0.91 |
Normalized confusion matrices for the ML methods tested. For each actual class, the sum of the corresponding row is 1.00
| (a) LR | |||
| Negative | Positive | ||
| Negative | 0.84 | 0.16 | |
| Positive | 0.26 | 0.74 | |
| (b) RF | |||
| Negative | Positive | ||
| Negative | 0.91 | 0.09 | |
| Positive | 0.34 | 0.66 | |
| (c) XGBoost | |||
| Negative | Positive | ||
| Negative | 0.91 | 0.09 | |
| Positive | 0.40 | 0.60 | |
| (d) SVM | |||
| Negative | Positive | ||
| Negative | 0.89 | 0.11 | |
| Positive | 0.44 | 0.56 | |
| (e) MLP | |||
| Negative | Positive | ||
| Negative | 0.92 | 0.08 | |
| Positive | 0.58 | 0.42 | |
| (f) Ensemble | |||
| Negative | Positive | ||
| Negative | 0.90 | 0.10 | |
| Positive | 0.35 | 0.65 | |
Fig. 6AUROC for each algorithm.
Fig. 7Explanations provided by SHAP and LIME
Fig. 8Kernel density estimation of WBC and PLT.
Fig. 9Marginal effect of blood features on the target variable.
Explanations for the COVID-19 inference of the 12 COVID-19 positive patients in the test set.
| ID | Decision Tree Explanation |
|---|---|
| 1 | EOS |
| 2 | CRP |
| 3 | EOS |
| 4 | AST |
| 5 | EOS |
| 6 | CRP |
| 7 | CRP |
| 8 | HGB |
| 9 | EOS |
| 10 | EOS |
| 11 | EOS |
| 12 | PLT |
Fig. 10Criteria Graph for the decision tree explanations. Only factors and interactions that appeared in more than one third of the patients are depicted.