| Literature DB >> 35336395 |
Olusola O Abayomi-Alli1, Robertas Damaševičius1, Rytis Maskeliūnas2, Sanjay Misra3.
Abstract
Current research endeavors in the application of artificial intelligence (AI) methods in the diagnosis of the COVID-19 disease has proven indispensable with very promising results. Despite these promising results, there are still limitations in real-time detection of COVID-19 using reverse transcription polymerase chain reaction (RT-PCR) test data, such as limited datasets, imbalance classes, a high misclassification rate of models, and the need for specialized research in identifying the best features and thus improving prediction rates. This study aims to investigate and apply the ensemble learning approach to develop prediction models for effective detection of COVID-19 using routine laboratory blood test results. Hence, an ensemble machine learning-based COVID-19 detection system is presented, aiming to aid clinicians to diagnose this virus effectively. The experiment was conducted using custom convolutional neural network (CNN) models as a first-stage classifier and 15 supervised machine learning algorithms as a second-stage classifier: K-Nearest Neighbors, Support Vector Machine (Linear and RBF), Naive Bayes, Decision Tree, Random Forest, MultiLayer Perceptron, AdaBoost, ExtraTrees, Logistic Regression, Linear and Quadratic Discriminant Analysis (LDA/QDA), Passive, Ridge, and Stochastic Gradient Descent Classifier. Our findings show that an ensemble learning model based on DNN and ExtraTrees achieved a mean accuracy of 99.28% and area under curve (AUC) of 99.4%, while AdaBoost gave a mean accuracy of 99.28% and AUC of 98.8% on the San Raffaele Hospital dataset, respectively. The comparison of the proposed COVID-19 detection approach with other state-of-the-art approaches using the same dataset shows that the proposed method outperforms several other COVID-19 diagnostics methods.Entities:
Keywords: COVID-19; blood tests; deep learning; diagnostic model; ensemble learning; small data
Mesh:
Year: 2022 PMID: 35336395 PMCID: PMC8955536 DOI: 10.3390/s22062224
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Summary of related work on COVID-19 identification from blood samples.
| Ref. | Methods | Feature Selection Methods | Metrics (Value) | Data Samples (COVID-19 Samples) |
|---|---|---|---|---|
| [ | Ensemble learning extra trees, random forest (RF), logistic regression (LR), extreme gradient boosting (ERLX) classifier | Manual | Accuracy: 99.88% | 5644 |
| [ | Categorical gradient boosting (CatBoost), support vector machine (SVM), and LR | Manual | AUC: 89.9–95.8% | 5148 |
| [ | Ensemble learning with RF, LR, XGBoost, Support Vector Machine (SVM), MLP | Decision Tree Explainer (DTX) | Accuracy | 608 |
| [ | Artificial Neural Network (ANN) predictive model | Pearson and Kendall correlation coefficient | Area under curve (AUC) values of 0.953 (0.889–0.982). | 151 |
| [ | ANN, RF, gradient boosting trees, LR and SVM | NA | AUC: 0.85; Sensitivity: 0.68; Specificity: 0.85; Brier Score: 0.16 | 235 |
| [ | RF classifier | manual | Accuracy: 96.95%, | 253 |
| [ | ANN, Convolutional Neural Network (CNN), Long-Short Term Memory (LSTM), Recurrent Neural Network (RNN), CNN-LSTM, and CNN-RNN | CNN and LSTM | AUC: 0.90, Accuracy: 0.9230, FI-score: 0.93, Precision: 0.9235, Recall: 0.9368 | 600 |
| [ | SVM, LR, DT, RF and deep neural network (DNN) | Logistic regression (LR) | Accuracy: 91%, | 921 |
| [ | ANN, CNN, RNN | SMOTE | Accuracy: 94.95%, | 600 |
| [ | LR | Maximum relevance minimum redundancy (mRMR) algorithm | Sensitivity: 98%, | 110 |
| [ | LR, DT, RF, gradient boosted decision tree | NA | Sensitivity: 75.8%, | 3346 |
Figure 1Visual summary of the proposed methodology.
Summary and description of the dataset.
| S/N | Features | Data Types | Number of Missing Values | Mean/Average |
|---|---|---|---|---|
| 1 | Gender | Nominal | 0 | - |
| 2 | Age | Numeric | 0 | 61.3 |
| 3 | WBC 1 | Numeric | 2 | 8.6 |
| 4 | Platelets | Numeric | 2 | 226.5 |
| 5 | CRP 2 | Numeric | 6 | 90.9 |
| 6 | AST 3 | Numeric | 2 | 54.2 |
| 7 | ALT 4 | Numeric | 13 | 44.9 |
| 8 | GGT 5 | Numeric | 143 | 82.5 |
| 9 | ALP 6 | Numeric | 148 | 89.9 |
| 10 | LDH 7 | Numeric | 85 | 380.5 |
| 11 | Neutrophils | Numeric | 70 | 6.2 |
| 12 | Lymphocytes | Numeric | 70 | 1.2 |
| 13 | Monocytes | Numeric | 70 | 0.6 |
| 14 | Eosinophils | Numeric | 70 | 0.05 |
| 15 | Basophils | Numeric | 71 | 0 |
| 16 | Swab | Nominal | 0 | - |
1 WBC = Leukocytes; 2 CRP = C-Reactive Protein; 3 AST = Aspartate Transaminases; 4 ALT = Alanine Transaminases; 5 GGT = γ-Glutamyl Transferasi; 6 ALP= Alkaline phosphatase; 7 LDH = Lactate dehydrogenase.
Figure 2Correlation matrix for the different features of the analyzed blood sample dataset.
Figure 3Algorithm of ensemble learning in pseudocode.
Default parameters values for the machine learning models.
| Model | Parameters Values | |
|---|---|---|
| KNN | n_neighbors = 3, weights = ‘uniform’, algorithm = ‘auto’, leaf_size = 30, | |
| SVM |
| C: 0.025, kernel: [‘linear’] |
|
| C: 1, gamma: 2, kernel: [‘rbf’] | |
| Decision Tree | criterion = ‘gini’, max_depth = 5, max_features = None, max_leaf_nodes = None, min_samples_leaf = 1, min_samples_split = 2, random_state = None, splitter = ‘best’, in_weight_fraction_leaf = 0.0 | |
| Naïve Bayes (Gaussian) | priors = None, var_smoothing = 10−9 | |
| Neural Network (MLP Classifier) | activation = ‘relu’, alpha = 1, batch_size = 1024, hidden_layer_sizes = 100, learning_rate_init = 0.001, max_iter = 1000, max_iter = 200, power_t = 0.5, random_state = None, shuffle = True, solver = ‘adam’, tol = 0.0001 | |
| Discriminant Analysis |
| n_components = None, priors = None, shrinkage = None, solver = ‘svd’ |
|
| tol = 0.0001, store_covariance = False, reg_param = 0.0, priors = None | |
| Passive | C = 1.0, n_iter_no_change = 5, max_iter = 1000, random_state = None | |
| Ridge | fit_intercept = True, alpha = 1.0, normalize = False, max_iter = None, random_state = None, solver = ‘auto’, | |
| SGDC | loss = ‘hinge’, penalty = ‘l2’, alpha = 0.0001, fit_intercept = True, max_iter = 1000, | |
| Logistic Regression | C = 1.0, cv = None, dual = False, fit_intercept = True, max_iter = 100, penalty = ‘l2’, random_state = None, solver = ‘lbfgs’, tol = 0.0001, | |
| Ensemble Learner | ||
| Random Forest | max_features = 1, n_estimators = 10, max_depth = 5, criterion = ‘gini’, random_state = None, verbose = 0 | |
| AdaBoost | algorithm = ‘SAMME.R’, learning_rate = 1, n_estimators = 50, random_state = None | |
| Extra Trees | criterion = ‘gini’, max_depth = None, max_features = 12, min_samples_leaf = 1, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 100 | |
Mathematical definition of performance metrics.
| Metrics | Definition |
|---|---|
| Accuracy (Acc) |
|
| False Negative Rate (FNR) |
|
| False Positive Rate (FPR) |
|
| Matthews Correlation Coefficient (MCC) |
|
| Cohen Kappa |
|
TP—true positives, FP—false positives, TN—true negatives, FN—false negatives, —observed accuracy, —expected accuracy.
Description of the convolutional neural network model.
| Parameters | Description |
|---|---|
| Activation Function | Input layer: ReLU |
| Hidden layer: ReLU | |
| Output layer: Softmax | |
| Loss = sparse_categorical_crossentropy, optimizer = adam, | |
| Input layer: ReLU | |
| Epochs | 10 |
| Epoch 2 | 50 |
| Batch Size | 1024 |
| Dropout ratio (Input) | 0.5 |
| Dropout ratio (Output) | 0.3 |
The results of ablation study: performance of the proposed model using different final stage ML classifiers. Best values are shown in bold.
| ML Model | Accuracy (%) | FPR (%) | FNR (%) | AUC (%) | MCC (%) | Kappa (%) |
|---|---|---|---|---|---|---|
| Nearest Neighbors | 78.9 | 39.86 | 11.02 | 74.56 | 51.48 | 50.8 |
| Linear SVM | 64.66 |
|
| 50 | 0 | 0 |
| RBF SVM | 71.44 | 79.06 | 1.72 | 59.62 | 26.34 | 21.66 |
| Decision Tree | 94.64 | 9.36 | 3.2 | 93.72 | 88.24 | 87.94 |
| Random Forest | 90.74 | 22.38 | 2.2 | 87.72 | 79.54 | 78.64 |
| Neural Net | 65.02 | 99.04 |
| 50.48 | 3.48 | 1.18 |
| AdaBoost |
| 2.24 |
| 98.88 | 98.36 | 98.32 |
| ExtraTrees |
| 0 | 1.04 |
|
|
|
| Naive Bayes | 72.14 | 54.14 | 13.78 | 66.06 | 35.48 | 34.26 |
| LDA | 70.34 | 67.62 | 8.86 | 61.8 | 30.08 | 26.44 |
| QDA | 91.44 | 18.4 | 3.3 | 89.14 | 81.26 | 80.46 |
| Logistic | 65.02 | 99.04 |
| 50.48 | 3.48 | 1.18 |
| Passive | 59.64 | 60 | 29.48 | 55.26 | 11.48 | 9.74 |
| Ridge | 67.18 | 92.2 | 0.52 | 53.62 | 17.24 | 8.82 |
| SGDC | 58.96 | 52.38 | 34.88 | 56.36 | 13.12 | 15.1 |
Figure 4Performance of machine learning models.
Figure 5Critical difference diagram of the final-stage classifiers (meta-learners) based on their performances.
Figure 6Comparison of results with previous studies. Our proposed model is compared with a three-way random forest classifier (TWFR) approach [27], Random Forest (RF) [67], SMOTE + RF [68], and a Hybrid Fuzzy inference engine and deep neural network (HDS) [66].