| Literature DB >> 34268522 |
Farshad Saberi-Movahed1, Mahyar Mohammadifard2, Adel Mehrpooya3, Mahtab Mohammadifard4, Farid Saberi-Movahed5, Iman Tavassoly6, Mohammad Rezaei-Ravari7, Kamal Berahmand8, Mehrdad Rostami9, Saeed Karami10, Mohammad Najafzadeh5, Davood Hajinezhad11, Mina Jamshidi5, Farshid Abedi12, Elnaz Farbod13, Farinaz Safavi14, Mohammadreza Dorvash15, Shahrzad Vahedi16, Mahdi Eftekhari7.
Abstract
One of the most critical challenges in managing complex diseases like COVID-19 is to establish an intelligent triage system that can optimize the clinical decision-making at the time of a global pandemic. The clinical presentation and patients’ characteristics are usually utilized to identify those patients who need more critical care. However, the clinical evidence shows an unmet need to determine more accurate and optimal clinical biomarkers to triage patients under a condition like the COVID-19 crisis. Here we have presented a machine learning approach to find a group of clinical indicators from the blood tests of a set of COVID-19 patients that are predictive of poor prognosis and morbidity. Our approach consists of two interconnected schemes: Feature Selection and Prognosis Classification. The former is based on different Matrix Factorization (MF)-based methods, and the latter is performed using Random Forest algorithm. Our model reveals that Arterial Blood Gas (ABG) O 2 Saturation and C-Reactive Protein (CRP) are the most important clinical biomarkers determining the poor prognosis in these patients. Our approach paves the path of building quantitative and optimized clinical management systems for COVID-19 and similar diseases.Entities:
Year: 2021 PMID: 34268522 PMCID: PMC8282111 DOI: 10.1101/2021.07.07.21259699
Source DB: PubMed Journal: medRxiv
Figure 1:Chronological and detailed illustration of the basic framework for the methods MFFS MPMR, SGFS, RMFFS and SLSDR.
A summary of the taxonomy and references related to the feature selection methods revisited in this paper.
| Category | Description | References | |
|---|---|---|---|
| To decompose a given (non-negative) matrix into the product of two low-rank (non-negative) matrices | [ | ||
| To learn a low dimensional representation of a high-dimensional space | [ | ||
| To uncover low-dimensional manifolds that are embedded in the high-dimensional space of the input data | [ | ||
| To take into account the duality between samples and features and exploit the manifold structures of both samples and features of the original data | [ | ||
| To identify the relationships between two variables | [ | ||
The MFFS method.
| 1: Initialize |
| 2: |
| 3: Fix |
| 4: Fix |
| 5: |
The MPMR method.
| 1: Initialize |
| 2: |
| 3: Fix |
| 4: Fix |
| 5: |
The SGFS method.
| 1: Compute the feature Laplacian matrix |
| 2: Initialize |
| 3: |
| 4: Fix |
| 5: Fix |
| 6: Update the diagonal matrix |
| 7: |
The RMFFS method.
| 1: Initialize |
| 2: |
| 3: Fix |
| 4: Fix |
| 5: |
The SLSDR method.
| 1: Compute the feature Laplacian matrix |
| 2: Compute the data Laplacian matrixe |
| 3: Initialize |
| 4: |
| 5: Fix |
| 6: Fix |
| 7: Update the diagonal matrix |
| 8: |
Details of ten gene expression datasets used in the experiments.
| Dataset | # Samples | # Features | # Classes | Reference | |
|---|---|---|---|---|---|
| 60 | 7129 | 2 | [ | ||
| 62 | 2000 | 2 | [ | ||
| 47 | 4026 | 2 | [ | ||
| 50 | 4434 | 4 | [ | ||
| 72 | 7070 | 2 | [ | ||
| 203 | 3312 | 5 | [ | ||
| 96 | 4026 | 9 | [ | ||
| 102 | 10509 | 2 | [ | ||
| 83 | 2328 | 4 | [ | ||
| 171 | 5748 | 4 | [ |
Figure 2:The ACC average values (the y-axis) versus the seven datasets (the x-axis). (A bigger value of ACC indicates a better clustering performance.)
Figure 3:The NMI average values (the y-axis) versus the seven datasets (the x-axis). (A bigger value of NMI indicates a better clustering performance.)
The results of clustering accuracy (ACC±STD%) corresponding to five feature selection techniques computed on seven datasets. In every row, the first and the second best outcomes are boldfaced and are underscored, respectively. The number of the selected features for the best clustering outcomes is shown in parentheses. (A bigger value of ACC indicates a better clustering performance.)
| Dataset | Baseline | MFFS | MPMR | SGFS | RMFFS | SLSDR | |
|---|---|---|---|---|---|---|---|
| 53.33 ± 1.45 | 61.75 ± 2.78 (50) | 61.66 ± 0.00 (70) | |||||
| 78.54 ± 6.95 | 74.19 ± 0.29 (10) | 74.21 ± 2.91 (90) | 79.41 ± 1.45 (40) | ||||
| 59.57 ± 1.23 | 85.10 ± 0.14 (40) | 87.23 ± 0.14 (70) | 87.76 ± 2.57 (70) | ||||
| 44.00 ± 0.00 | 48.50 ± 2.66 (80) | 46.40 ± 0.97 (30) | 47.50 ± 2.89 (40) | ||||
| 23.61 ± 0.01 (30) | |||||||
| 83.10 ± 0.94 | 69.72 ± 7.63 (100) | 69.70 ± 2.45 (20) | 73.39 ± 0.22 (90) | ||||
| 57.96 ± 4.55 | 56.19 ± 2.77 (90) | 57.81 ± 3.53 (90) | 57.34 ± 3.99 (70) | ||||
| 31.37 ± 0.36 | 31.37 ± 3.64 (20) | 30.39 ± 0.72 (30) | 47.05 ± 2.18 (60) | ||||
| 25.66 ± 2.82 | 43.97 ± 2.15 (60) | 37.77 ± 5.14 (90) | 45.27 ± 3.18 (100) | ||||
| 41.25 ± 0.72 | 42.13 ± 0.13 (90) | 41.49 ± 1.65 (60) | 42.26 ± 0.87 (100) | ||||
The results of Normalized mutual information (NMI±STD%) corresponding to five feature selection techniques computed on seven datasets. In every row, the first and the second best outcomes are boldfaced and are underscored, respectively. The number of the selected features for the best clustering outcomes is shown in parentheses. (A bigger value of NMI indicates a better clustering performance.)
| Dataset | Baseline | MFFS | MPMR | SGFS | RMFFS | SLSDR | |
|---|---|---|---|---|---|---|---|
| 1.17 ± 0.00 | 2.93 ± 0.45 (100) | 1.60 ± 0.80 (30) | 8.79 ± 0.00 (30) | ||||
| 27.35 ± 9.96 | 24.30 ± 0.14 (100) | 21.08 ± 0.54 (40) | 30.03 ± 0.00 (10) | ||||
| 2.93 ± 1.09 | 39.45 ± 0.00 (40) | 48.46 ± 0.72 (70) | 49.51 ± 0.72 (70) | ||||
| 17.94 ± 0.69 | 23.69 ± 2.07 (80) | 25.24 ± 2.08 (30) | 27.59 ± 6.86 (90) | ||||
| 21.47 ± 0.00 | 66.56 ± 0.06 (80) | 62.07 ± 0.03 (30) | 72.87 ± 0.00 (30) | ||||
| 68.12 ± 0.56 | 55.61 ± 6.88 (100) | 55.77 ± 2.11 (70) | 59.71 ± 2.31 (70) | ||||
| 69.55 ± 3.41 | 63.47 ± 3.03 (60) | 67.13 ± 3.12 (90) | 65.19 ± 2.29 (80) | ||||
| 8.56 ± 0.21 | 6.09 ± 0.91 (60) | 8.83 ± 0.95 (20) | 9.45 ± 0.03 (80) | ||||
| 11.55 ± 3.47 | 38.41 ± 7.38 (50) | 24.27 ± 5.26 (40) | 47.82 ± 4.13 (90) | ||||
| 13.54 ± 0.23 | 12.82 ± 1.52 (80) | 16.22 ± 1.89 (60) | 13.96 ± 1.02 (100) | ||||
Figure 4:The ACC and NMI average values (the y-axis) versus six feature selection methods (the x-axis). (A bigger value of ACC or NMI indicates that the corresponding method has a better performance.)
Figure 5:Average ranks obtained by the Friedman test for each method with respect to the different evaluation metrics on the datasets. (The lower rank of the evaluation metrics, the better the performance of the methods.)
Post hoc comparisons on the ACC metric by using the significance level α = 0.05. Here, the control method is SLSDR, and the Holm’s procedure rejects a null hypothesis when the Holm’s p-value of a pairwise comparison ≤ 0.025.
| Method | Holm’s | Reject | |
|---|---|---|---|
| MPMR | 0.01 | Yes | |
| Baseline | 0.0125 | Yes | |
| MFFS | 0.016667 | Yes | |
| SGFS | 0.025 | Yes | |
| RMFFS | 0.05 | No | |
Post hoc comparisons on the NMI metric by using the significance level α = 0.05. Here, the control method is SLSDR, and the Holm’s procedure rejects a null hypothesis when the Holm’s p-value of a pairwise comparison ≤ 0.05.
| Method | Holm’s | Reject | |
|---|---|---|---|
| MFFS | 0.01 | Yes | |
| Baseline | 0.0125 | Yes | |
| MPMR | 0.016667 | Yes | |
| SGFS | 0.025 | Yes | |
| RMFFS | 0.05 | Yes | |
The per-iteration computational complexity comparison among different feature selection methods. Note that n is the number of samples, d is the number of features, and k is the number of selected features.
| Method | Computational complexity |
|---|---|
Figure 6:The runtime of various feature selection methods for different gene expression datasets as a function of k, where the value of k ∈ {10, 40, 80, 100}.
Figure 7:Performance metrics of the Random Forest classifier: (a) Classification ACC, (b) TPR, (c) TNR, (d) PPV, and (e) NPV.
Figure 8:Frequency of pair of features (biomarkers) that are selected by various feature selection methods at each iteration of the 10-fold CV of the Random Forest classifier. The feature selections methods are (a) MFFS, (b) MPMR, (c) RMFSS, (d) SGFS, and (e) SLSDR.(ALT: Alanine Aminotransferase, AST: Aspartate Aminotransferase, CRP:C-Reactive Protein, K: Potassium, Lymph: Lymphocyte Count, Na: Sodium,O2 Sat:O2 Saturation on ABG, PLT: Platelet Count, PMH: Past Medical History (Cancer, Diabetes, Ischemic Heart Disease, Renal Failure, Immunodeficiency), PTT: Partial Thromboplastin Time, WBC: White Blood Cells count)
Figure 9:Aggregate frequency of features (biomarkers) that are selected by all feature selection methods together at all iterations of the 10-fold CV of the Random Forest classifier, where at each iteration only two features (k = 2) are selected. (ALT: Alanine Aminotransferase, AST: Aspartate Aminotransferase, CRP:C-Reactive Protein, K: Potassium, Lymph: Lymphocyte Count, Na: Sodium,O2 Sat:O2 Saturation on ABG, PLT: Platelet Count, PMH: Past Medical History (Cancer, Diabetes, Ischemic Heart Disease, Renal Failure, Immunodeficiency), PTT: Partial Thromboplastin Time, WBC: White Blood Cells count)