| Literature DB >> 35133633 |
Krishnaraj Chadaga1, Chinmay Chakraborty2, Srikanth Prabhu3, Shashikiran Umakanth4, Vivekananda Bhat1, Niranjana Sampathila5.
Abstract
Coronavirus 2 (SARS-CoV-2), often known by the name COVID-19, is a type of acute respiratory syndrome that has had a significant influence on both economy and health infrastructure worldwide. This novel virus is diagnosed utilising a conventional method known as the RT-PCR (Reverse Transcription Polymerase Chain Reaction) test. This approach, however, produces a lot of false-negative and erroneous outcomes. According to recent studies, COVID-19 can also be diagnosed using X-rays, CT scans, blood tests and cough sounds. In this article, we use blood tests and machine learning to predict the diagnosis of this deadly virus. We also present an extensive review of various existing machine-learning applications that diagnose COVID-19 from clinical and laboratory markers. Four different classifiers along with a technique called Synthetic Minority Oversampling Technique (SMOTE) were used for classification. Shapley Additive Explanations (SHAP) method was utilized to calculate the gravity of each feature and it was found that eosinophils, monocytes, leukocytes and platelets were the most critical blood parameters that distinguished COVID-19 infection for our dataset. These classifiers can be utilized in conjunction with RT-PCR tests to improve sensitivity and in emergency situations such as a pandemic outbreak that might happen due to new strains of the virus. The positive results indicate the prospective use of an automated framework that could help clinicians and medical personnel diagnose and screen patients.Entities:
Keywords: Artificial Intelligence; Blood tests; COVID-19; Machine Learning; RT-PCR
Mesh:
Year: 2022 PMID: 35133633 PMCID: PMC8846962 DOI: 10.1007/s12539-021-00499-4
Source DB: PubMed Journal: Interdiscip Sci ISSN: 1867-1462 Impact factor: 2.233
Fig. 1Integral learning steps required for the development of ML classifiers
List of ML models that diagnose COVID-19
| References | Source | Size | Total attributes | Models used | Accuracy of best model | Sensitivity of best model | Specificity of best model | AUC of best model |
|---|---|---|---|---|---|---|---|---|
| [ | Hospital Israelita Albert Einstein, Brazil | 5644, 559 COVID-19 | 24 attributes | MLP (Multi-layer perceptron), SVM, DT, NB | 95% | 96% | 93% | |
| [ | Three Open access datasets | – | Many features | Machine learning and Deep Learning models | 92% | 82% | 92% | |
| [ | 18 hospitalls from Zhejiang, China | 914 patients | 10 features | LR, SVM,DT,RF,RL | 95% | 87% | 97% | |
| [ | Tongji Hospital, China | 413 patients | 21- categorical, 21- continuous | Xgboost | 92.5% | 97.5% | ||
| [ | West China Hospital,m China | 620 samples | 9 features | Multi variate logistic regression | – | – | – | – |
| [ | 11 regions in China | 659 patients | Many biochemical and clinical features | Decision trees | 89% | – | – | 88% |
| [ | SMART hospitals | – | – | NB, RF, SVM | 93.33% | – | – | – |
| [ | Hospital Israelita Albert Einstein Hospital, Brazil | 5644, 559 COVID-19 patients | Many blood parameters | ERLX, an ensemble learning model | 99.60% | 98.72% | 98.99% | 99.38% |
| [ | UK Biobank | 4510 patients | – | Linear discriminant analysis | – | – | – | 97% |
| [ | Hospital Israelita Albert Einstein Hospital, Brazil | 5644 patients 598 COVID-19 patients | Many blood parameters | RF, Shallow learning, flexible ANN | – | – | – | 95% |
| [ | Hospital Israelita Albert Einstein Hospital, Brazil | 5644 patients 598 COVID-19 patients | Many blood parameters | Er-CoV | – | 70% | 85% | 86% |
| [ | Kepler University Hospital | 1357 patients | 28 unique features | Random forest | 86% | – | – | 74% |
| [ | Three Brazilian Hospitals | 815 (442 COVID-19) | 19 features | ADA boost, Gradient boosting, Random forest, extreme gradient boosting, SVM, partial least square | – | 96% | 93% | – |
| [ | – | 1521 patients | 130 clinical features | HUST-19 (CNN based framework) | 94% | – | – | – |
| [ | Oxford University hospitals | 1,14,957—negative 437—COVID-19 | – | Various ML classifiers | 77% | 95% | 93% | |
| [ | Five hospitals in New York | 4098 COVID-19 patients | Many blood parameters | XGBoost | – | – | – | 89% |
| [ | – | 279 cases | 13 features | KNN, DT, RF, SVM, RF | – | – | – | 91% |
Fig. 2Null values present in attributes
Feature description of the final selected parameters
| Sl.no | Abbreviation | Feature | Description | References |
|---|---|---|---|---|
| 1 | AGE | Patient age quantile | Specifies the age of the individual | – |
| 2 | MPV | Mean platelet volume | Mean size of platelets presents in blood. It is known to increase in the presence of COVID-19 | [ |
| 3 | RBC | Red blood cells | The bone marrow produces fresh red blood cells. The red blood cell carries oxygen and removes carbon dioxide from the body | [ |
| 4 | LYM | Lymphocytes | These are part of the person's immune system and are created by the lymph nodes and bone marrow. They tend to decrease for severe COVID-19 patients | [ |
| 5 | MCHC | Mean Corpuscular haemoglobin concentration | Average quantity of haemoglobin present in each of the red blood cells | [ |
| 6 | WBC | Leukocytes | They are also called white blood cells. They defend the body against various infections and threats. The count has increased in COVID-19 patients according to numerous studies | [ |
| 7 | BAY | Basophils | They are a part of white blood cells | [ |
| 8 | EOS | Eosinophils | They help in promoting inflammation that controls the infection. Eosinophil count is reduced for COVID-19 patients | [ |
| 9 | MCV | Mean Corpuscular volume | Average volume of red blood cells. They increase or decrease depending on the average red cell size | [ |
| 10 | MON | Monocytes | They are white blood cells that focus on healing and repair | [ |
| 11 | PLT | Platelets | They form clots and prevent bleeding. COVID-19 patients often have mild thrombocytopenia | [ |
| 12 | RBCDW | Red blood cell distribution width | The range of volume and size of red blood cells | [ |
| 13 | - | Has_disease | A variable that has been created by combining all the other disease columns for this research. It specifies whether the patient suffers from other viral diseases | – |
Fig. 3Pearson co-relation matrix
Correlation coefficient and r value of the final blood parameters
| Dependent features | Result Label | Relationship co-relation | |
|---|---|---|---|
| Age | RT-PCR test | 0.15 | Weak positive correlation |
| MPV | RT-PCR test | 0.11 | Weak positive correlation |
| RBC | RT-PCR test | 0.12 | Weak positive correlation |
| LYM | RT-PCR test | − 0.015 | Very weak negative correlation |
| MCHC | RT-PCR test | 0.046 | Very weak positive correlation |
| WBC | RT-PCR test | − 0.29 | Weak negative co-relation |
| BAY | RT-PCR test | − 0.063 | Very weak negative correlation |
| EOS | RT-PCR test | − 0.19 | Weak negative co-relation |
| MCV | RT-PCR test | − 0.055 | Very weak negative correlation |
| MON | RT-PCR test | 0.2 | Weak positive correlation |
| PLT | RT-PCR test | − 0.28 | Weak negative co-relation |
| RBCDW | RT-PCR test | − 0.04 | Very weak negative correlation |
| Has_Disease | RT-PCR test | − 0.25 | Weak negative co-relation |
Fig. 4An example of synthetic sampling by SMOTE overall flow diagram is given below [34]
Fig. 5Block diagram describing the proposed method for the classification of COVID-19 based on blood sample data
Classification results
| Model | Accuracy | Specificity | Sensitivity | F1-score | AUC | Brier score | Best parameters |
|---|---|---|---|---|---|---|---|
| Simple random forest | 0.60 | 0.71 | 0.33 | 0.66 | 0.69 | 0.23 | – |
| Random forest after feature selection | 0.89 | 0.97 | 0.35 | 0.84 | 0.90 | 0.115 | – |
| Random forest after hyper parameter tuning (Randomized searchCV) | 0.88 | 0.96 | 0.41 | 0.84 | 0.92 | 0.115 | {‘n_estimators’: 10, ‘min_samples_split’: 2, ‘min_samples_leaf’: 2, ‘max_features’: 10, ‘max_depth’: 128} |
| Optimal random forest (After SMOTE) | 0.92 | 0.96 | 0.71 | 0.85 | 0.92 | 0.082 | {‘n_estimators’: 500, ‘min_samples_split’: 4, ‘min_samples_leaf’: 1, ‘max_features’: ‘8’, ‘max_depth’: 32} |
| Optimal LR | 0.85 | 0.87 | 0.70 | 0.81 | 0.89 | 0.157 | {‘penalty’: ‘l2’, ‘C’: 100} |
| Optimal KNN | 0.75 | 0.77 | 0.59 | 0.73 | 0.68 | 0.25 | {‘weights’: ‘distance’, ‘p’: 1, ‘n_neighbors’: 2} |
| Optimal XGBoost | 0.88 | 0.93 | 0.65 | 0.83 | 0.88 | 0.123 | {‘n_estimators’: 100, ‘max_depth’: 8, ‘gamma’: 0, ‘colsample_bytree’: 0.8} |
Normalized confusion matrices for test data set
| (a) Random forest | Actual | ||
|---|---|---|---|
| Negative | Positive | ||
| Predicted | Negative | 0.94 | 0.06 |
| Positive | 0.41 | 0.59 | |
For each actual class, the corresponding row sum is 1.0
Fig. 6AUROC curves of the various ML algorithms as follows: a Initial RF model; b RF model after pre-processing; c RF model after hyperparameter tuning; d Model after SMOTE Analysis; e Optimized RF model; f Logistic Regression; g KNN; h XGBoost
Fig. 7Feature importance using SHAP
Fig. 8Feature importance using random forest
Fig. 9Marginal effect of Leukocytes, Monocytes and Platelets on COVID-19 outcome
Comparison between the related studies and the proposed work
| Reference | Accuracy of best model | Sensitivity of best model | Specificity of best model | AUC of best model | ML models used |
|---|---|---|---|---|---|
| [ | – | 83% | 82% | – | Only Statistical analysis |
| [ | 85% | 91% | 43% | 80% | ANN, RF, glmnet |
| [ | – | 76% | 76% | 84% | Naïve Bayes |
| [ | – | 93% | 63% | 95% | Many models |
| [ | – | – | – | 85% | XGBoost |
| Proposed | 91% | 94% | 71% | 91% | RF, XGBoost, LR, KNN, SVM |