| Literature DB >> 35801239 |
Houwu Gong1,2, Miye Wang3,4, Hanxue Zhang1, Md Fazla Elahe1, Min Jin1.
Abstract
Background: Artificial intelligence-based disease prediction models have a greater potential to screen COVID-19 patients than conventional methods. However, their application has been restricted because of their underlying black-box nature. Objective: To addressed this issue, an explainable artificial intelligence (XAI) approach was developed to screen patients for COVID-19.Entities:
Keywords: COVID-19; artificial intelligence; disease prediction; ensemble learning; explainable
Mesh:
Year: 2022 PMID: 35801239 PMCID: PMC9253566 DOI: 10.3389/fpubh.2022.874455
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Algorithm: LIME.
|
|
|---|
| Input: (1) Complex Model |
| Steps: |
| Output: The weight of the linear model |
Characteristics of the positive and negative COVID-19 patients.
|
|
|
|
| |
|---|---|---|---|---|
| Age, year | 60.40 ± 20.83 | 60.40 ± 20.83 | 62.27 ± 15.84 | 0.066 |
| Female | 583 (42.43%) | 304 (49.43%) | 279 (36.76%) | <0.001 |
| CA, mmol/L | 2.20 ± 0.751 | 2.29 ± 0.74 | 2.14 ± 0.14 | <0.001 |
| CREA, mg/dl | 1.18 ± 1.01 | 1.22 ± 1.20 | 1.14 ± 0.82 | 0.180 |
| ALP, U/L | 87.74 ± 64.26 | 94.18 ± 77.16 | 82.53 ± 50.95 | 0.001 |
| GGT, U/L | 66.12 ± 101.95 | 58.52 ± 118.90 | 72.27 ± 85.40 | 0.013 |
| GLU, mg/dl | 119.03 ± 55.85 | 112.19 ± 49.85 | 124.58 ± 59.73 | <0.001 |
| AST, U/L | 47.11 ± 51.37 | 34.60 ± 33.44 | 57.25 ± 60.37 | <0.001 |
| ALT, U/L | 40.15 ± 40.67 | 32.23 ± 35.22 | 46.56 ± 43.58 | <0.001 |
| LDH, U/L | 336.86 ± 210.61 | 280.76 ± 243.48 | 382.33 ± 166.44 | <0.001 |
| PCR | 72.22 ± 79.59 | 52.86 ± 70.90 | 89.72 ± 82.43 | <0.001 |
| KAL | 4.22 ± 0.51 | 4.25 ± 0.50 | 4.20 ± 0.52 | 0.101 |
| NAT | 138.58 ± 4.66 | 139.10 ± 3.92 | 138.15 ± 5.15 | <0.001 |
| WBC, 109/L | 8.56 ± 4.75 | 9.73 ± 5.45 | 7.62 ± 3.85 | <0.001 |
| RBC, 1012/L | 4.53 ± 0.73 | 4.40 ± 0.75 | 4.64 ± 0.69 | <0.001 |
| HGB, g/dl | 13.18 ± 2.05 | 12.80 ± 2.13 | 13.49 ± 1.94 | <0.001 |
| HCT, % | 39.32 ± 5.64 | 38.32 ± 5.79 | 40.14 ± 5.39 | <0.001 |
| MCV, fl | 87.33 ± 6.93 | 87.76 ± 7.23 | 86.97 ± 6.65 | <0.001 |
| MCH, pg/cell | 29.25 ± 2.69 | 29.27 ± 2.76 | 29.23 ± 2.63 | 0.783 |
| MCHC, g Hb/dl | 33.48 ± 1.34 | 33.34 ± 1.35 | 33.60 ± 1.32 | <0.001 |
| PLT1, 109/L | 234.74 ± 95.89 | 246.55 ± 98.70 | 225.17 ± 92.51 | <0.001 |
| NE, % | 72.35 ± 13.26 | 70.33 ± 13.47 | 73.98 ± 12.86 | <0.001 |
| LY, % | 18.58 ± 11.00 | 19.73 ± 11.37 | 17.65 ± 10.62 | 0.001 |
| MO, % | 7.83 ± 3.88 | 8.06 ± 3.61 | 7.65 ± 4.08 | 0.045 |
| EO, % | 0.88 ± 1.62 | 1.43 ± 2.02 | 0.44 ± 1.00 | <0.001 |
| BA, % | 0.34 ± 0.327 | 0.43 ± 0.31 | 0.26 ± 0.21 | <0.001 |
| NET, 109/L | 6.45 ± 4.48 | 7.15 ± 5.28 | 5.88 ± 3.60 | <0.001 |
| LYT, 109/L | 1.37 ± 0.95 | 1.64 ± 1.02 | 1.15 ± 0.83 | <0.001 |
| MOT, 109/L | 0.62 ± 0.54 | 0.72 ± 0.45 | 0.54 ± 0.59 | <0.001 |
| EOT, 109/L | 0.07 ± 0.14 | 0.12 ± 0.18 | 0.03 ± 0.08 | <0.001 |
| BAT, 109/L | 0.02 ± 0.04 | 0.03 ± 0.05 | 0.01 ± 0.02 | <0.001 |
| Suspect, % | 0.83 ± 0.33 | 0.71 ± 0.39 | 0.92 ± 0.23 | <0.001 |
CA, calcium; CREA, creatinine; ALP, alkaline phosphatase; GGT, gamma-glutamyl transferase, an enzyme that converts glutamyl to glutamine; GLU, glucose; AST, aspartate aminotransferase; ALT, alanine aminotransferase; LDH, lactate dehydrogenase, a type of enzyme that breaks down lactate; WBC, white blood cell; RBC, red blood cell; HGB, hemoglobin, a protein that transports oxygen throughout the body; HCT, hematocrit, a metric representing the proportion of RBCs in the blood; MCV, mean corpuscular volume; MCH, mean corpuscular hemoglobin; MCHC, mean corpuscular hemoglobin concentration; PLT1, platelets; NE, neutrophil count (%); LY, lymphocyte count (%); MO, monocyte count (%); EO, eosinophil count (%);BA, basophil count (%); NET, neutrophil count; LYT, lymphocyte count; MOT, monocyte count; EOT, eosinophil count; BAT, basophil count; Suspect, suspected COVID-19.
Figure 1Correlation coefficient matrix heatmap of all 29 variables. The obtained numerical matrix is visually displayed through a heatmap. Orange indicates a positive correlation, and green indicates a negative correlation. Color depth indicates the value of the coefficient, with deeper colors indicating stronger correlations. Specifically, redder colors indicate correlation coefficients closer to 1, and greener colors indicate coefficients closer to −1.
Performance of random forest, AdaBoost, GBDT, and XGBoost models in screening COVID-19.
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
|
| 74.2% | 70.8% | 90.8% | 53.7% | 0.795 | 0.589 |
|
| 76.7% | 78.2% | 80.3% | 72.4% | 0.792 | 0.553 |
|
| 80.4% | 80.3% | 85.5% | 74.0% | 0.828 | 0.615 |
|
| 75.3% | 73.3% | 86.8% | 61.0% | 0.795 | 0.565 |
Performance of random forest, AdaBoost, GBDT, and XGBoost models to screen COVID-19.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| 85.7% | 0.813, 0.902 | 0.02 | <0.001 | [66, 57], [14, 138] |
|
| 85.4% | 0.810, 0.899 | 0.02 | <0.001 | [89, 34], [30, 122] |
|
| 86.4% | 0.821, 0.907 | 0.02 | <0.001 | [91, 32], [22, 130] |
|
| 84.9% | 0.803, 0.894 | 0.02 | <0.001 | [75, 48], [20, 132] |
Figure 2Receiver operating characteristic (ROC) curves for the machine learning models in screening COVID-19.
Figure 3Calibration curve for the internal validation set. The calibration curve was plotted using the bucket method (continuous data discretization) to observe whether the prediction probability of the classification model is close to the empirical probability (that is, the real probability). Ideally, the calibration curve lies along the diagonal (i.e., the prediction probability is equal to the empirical probability).
Figure 4Influence of input features on the outcome of the XGBoost model. The top three features are LDH, WBC, and EOT. It indicates that they have important auxiliary diagnostic significance for COVID-19. The model found that patients with higher WBC count, higher LDH level, or higher EOT count, were more likely to have COVID-19. It might assist physicians to make their decisions.
Figure 5Influence of nine variables on the outcome of the XGBoost model. Because PCR ≤ 9.30 and CA >2.29 were the most significant features, the classification of this sample was confirmed as positive.
Figure 6Simplified decision tree model based on the top three features.