| Literature DB >> 35588440 |
Elisa Ferrari1, Luna Gargani2, Greta Barbieri3,4, Lorenzo Ghiadoni5, Francesco Faita2, Davide Bacciu6.
Abstract
We present a workflow for clinical data analysis that relies on Bayesian Structure Learning (BSL), an unsupervised learning approach, robust to noise and biases, that allows to incorporate prior medical knowledge into the learning process and that provides explainable results in the form of a graph showing the causal connections among the analyzed features. The workflow consists in a multi-step approach that goes from identifying the main causes of patient's outcome through BSL, to the realization of a tool suitable for clinical practice, based on a Binary Decision Tree (BDT), to recognize patients at high-risk with information available already at hospital admission time. We evaluate our approach on a feature-rich dataset of Coronavirus disease (COVID-19), showing that the proposed framework provides a schematic overview of the multi-factorial processes that jointly contribute to the outcome. We compare our findings with current literature on COVID-19, showing that this approach allows to re-discover established cause-effect relationships about the disease. Further, our approach yields to a highly interpretable tool correctly predicting the outcome of 85% of subjects based exclusively on 3 features: age, a previous history of chronic obstructive pulmonary disease and the PaO2/FiO2 ratio at the time of arrival to the hospital. The inclusion of additional information from 4 routine blood tests (Creatinine, Glucose, pO2 and Sodium) increases predictive accuracy to 94.5%.Entities:
Mesh:
Year: 2022 PMID: 35588440 PMCID: PMC9119448 DOI: 10.1371/journal.pone.0268327
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Summary of the dataset features.
In this table the 63 features analyzed in this work, reported in the second column, are grouped into 6 classes, listed in the first column. The third and fourth columns show the feature occurrence in the dataset in dead and recovered subjects, respectively. The last column reports a description of the feature values: for continuous variables the 5th and the 95th percentiles indicated as [P5, P95] are reported, while for categorical variables their values are reported within curly brackets.
| Category | Feature | Available data for dead subjects (max 71) | Available data for recovered subjects (max 194) | Values |
|---|---|---|---|---|
| Demographic data | Age (years) | 71 | 194 | [ |
| SEX | 71 | 194 | {M, F} = {0, 1} | |
| Smoke(y/n) | 66 | 187 | {No, Yes} = {0, 1} | |
| Smoke(ex/y/n) | 64 | 187 | {Yes, Ex, No} = {0, 1, 2} | |
| Prior respiratory problems | COPD (chronic obstructive pulmonary disease) | 71 | 194 | {No, Yes} = {0, 1} |
| Asma | 70 | 194 | {No, Yes} = {0, 1} | |
| Other resp. disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Prior diseases | Diabetes | 71 | 194 | {No, Yes} = {0, 1} |
| Hypertension | 71 | 194 | {No, Yes} = {0, 1} | |
| Cardio.disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Hypercolest | 71 | 194 | {No, Yes} = {0, 1} | |
| Cerebrovasc. disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Neuro. disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Dementia | 71 | 194 | {No, Yes} = {0, 1} | |
| Cancer | 71 | 194 | {No, Yes} = {0, 1} | |
| Blood cancer | 71 | 194 | {No, Yes} = {0, 1} | |
| Kidney disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Liver disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Cirrhosis | 71 | 194 | {No, Yes} = {0, 1} | |
| Autoimmune disease | 71 | 194 | {No, Yes} = {0, 1} | |
| Ongoing treatments | Anticoag | 71 | 194 | {No, Yes} = {0, 1} |
| RAAS BLOCK (renin-angiotensin-aldosterone system) | 71 | 194 | {No, Yes} = {0, 1} | |
| Immunos. therapy | 71 | 194 | {No, Yes} = {0, 1} | |
| Dialysis | 71 | 194 | {No, Yes} = {0, 1} | |
| Symptoms on admission | Fever | 71 | 194 | {No, Yes} = {0, 1} |
| Conjunct. congest. | 71 | 194 | {No, Yes} = {0, 1} | |
| Nasal congestion | 71 | 194 | {No, Yes} = {0, 1} | |
| Headache | 71 | 194 | {No, Yes} = {0, 1} | |
| Cough | 71 | 194 | {No, Yes} = {0, 1} | |
| Sore throat | 71 | 194 | {No, Yes} = {0, 1} | |
| Sputum | 71 | 194 | {No, Yes} = {0, 1} | |
| Fatigue | 71 | 194 | {No, Yes} = {0, 1} | |
| Hemoptysis | 71 | 194 | {No, Yes} = {0, 1} | |
| Short breath | 71 | 194 | {No, Yes} = {0, 1} | |
| Nausea | 71 | 194 | {No, Yes} = {0, 1} | |
| Diarrhea | 71 | 194 | {No, Yes} = {0, 1} | |
| Myalgia | 71 | 194 | {No, Yes} = {0, 1} | |
| Rash | 71 | 194 | {No, Yes} = {0, 1} | |
| FC (cardiac frequency, bpm) | 63 | 176 | [ | |
| PAS (systolic arterial pressure, mmHg) | 65 | 174 | [ | |
| PAD (diastolic arterial pressure mmHg) | 65 | 174 | [ | |
| Chest pain | 71 | 194 | {No, Yes} = {0, 1} | |
| Confusion | 71 | 194 | {No, Yes} = {0, 1} | |
| Blood analysis on admission | Haemoglobin (g/dl) | 68 | 188 | [ |
| WBC (white blood cells, cells/ | 69 | 192 | [ | |
| Lymphocyte (cells/ | 68 | 191 | [ | |
| Neutrophils (cells/ | 69 | 188 | [ | |
| Haematocrit (%) | 69 | 187 | [ | |
| Platelets (cells/ | 69 | 191 | [ | |
| INR (international normalized ratio) | 65 | 184 | [ | |
| Bilirubin (mg/dl) | 64 | 186 | [ | |
| AST (aspartate aminotransferase, IU/l) | 59 | 181 | [ | |
| ALT (alanine aminotransferase, IU/l) | 66 | 187 | [ | |
| Glucose (mg/dl) | 64 | 186 | [ | |
| Creatinine (mg/dl) | 67 | 191 | [ | |
| BUN (blood urea nitrogen,mg/dl) | 58 | 178 | [ | |
| Sodium (mEq/l) | 68 | 190 | [ | |
| Potassium (mmol/l) | 67 | 187 | [ | |
| pH | 57 | 162 | [ | |
| pO2 (O2 partial pressure, mmHg) | 62 | 173 | [ | |
| pCO2 (CO2 partial pressure, mmHg) | 61 | 167 | [ | |
| PF (pO2/FIO2 ratio, %) | 65 | 188 | [ | |
| PCR (C-reactive protein, mg/dl) | 67 | 177 | [ | |
| Outcome | 71 | 194 | {Death, Recovery} = {0, 1} | |
Fig 1Flow chart of the proposed approach.
Our empirical analysis is mainly based on three steps: (1) an explorative analysis applied to different classes of features separately; (2) an integrative and interpretative step; (3) a quantitative validation.
Fig 2BSL analysis on separate classes of features.
Image showing the BSL analysis applied to different categories of features. All the illustrated graphs are generated taking the information provided by clinicians into account.
Fig 3BSL analysis on the most relevant features.
Graph generated with the most relevant features found from the graphs shown in Fig 2, taking the information provided by clinicians into account.
Results of the bivariate statistical analysis for categorical variables.
Contingency table and results of the Fisher test between the categorical variables present in Fig 3 and the outcome. The most impactful deaths (recoveries) fold increases with respect to the dataset average are reported within the brackets in the second and third columns, respectively.
| Feature X | % of deaths in patients with X | % of recoveries in patients with X | % of deaths in patients without X | % of recoveries in patients without X | P-Value (Fisher test) |
|---|---|---|---|---|---|
| COPD | 69.0% (×2.6) | 31.0% | 21.6% | 78.4% | 5.10−7 |
| Kidney disease | 58.3% (×2.2) | 41.7% | 23.7% | 76.3% | 6.10−4 |
| Cerebrovasc. disease | 59.3% (×2.2) | 40.7% | 23.1% | 76.9% | 1.7*10−4 |
| Cardio. disease | 42.5% (×1.6) | 57.5% | 19.1% | 80.9% | 6.4*10−5 |
| Anticoag. | 51.5% (×1.9) | 48.5% | 23.3% | 76.7% | 1.1*10−3 |
| Myalgia | 2.7% | 97.3% (×1.3) | 30.7% | 69.3% | 6.1*10−5 |
| Confusion | 71.9% (×2.7) | 28.1% | 20.6% | 79.4% | 1.4*10−8 |
| Short breath | 34.1% (×1.3) | 65.9% | 20.1% | 79.9% | 7.5*10−3 |
| Dialysis | 50.0% (×1.9) | 50.0% | 26.4% | 73.6% | 0.29 |
| Hypercolesterolemia | 37.0% (×1.4) | 63.0% | 24.7% | 75.3% | 6.6*10−2 |
| Hypertension | 35.5% (×1.3) | 64.5% | 19.1% | 80.9% | 2.1*10−3 |
| Diarrhea | 11.8% | 88.2% (×1.2) | 30.4% | 69.6% | 4.0*10−3 |
| Fatigue | 23.3% | 76.7% | 27.5% | 72.5% | 0.36 |
| Headache | 4.3% | 95.7% (×1.3) | 28.9% | 71.1% | 5.5*10−3 |
Results of the bivariate statistical analysis for continuous variables.
Point-biserial correlation values and significance tests between the continuous variables present in Fig 3 and the outcome.
| Feature | Correlation | P-Value |
|---|---|---|
| Age | -0.46 | 1.3*10−15 |
| PAS | 0.21 | 4.6*10−4 |
| PAD | 0.24 | 8.1*10−5 |
| AST | -0.11 | 0.052 |
| Glucose | -0.19 | 1.5*10−3 |
| Creatinine | -0.20 | 5.6*10−4 |
| BUN | -0.45 | 1.1*10−13 |
| Sodium | -0.16 | 5.0*10−3 |
| Potassium | -0.18 | 1.6*10−3 |
| PCR | -0.21 | 4.4*10−4 |
| pO2 | 0.12 | 0.032 |
| PF | 0.46 | 1.1*10−14 |
Fig 4BDT obtained with a train-all setting.
BDT trained with the features included in the graph analysis reported in Fig 3. The color of the squares represents the class prevalence: red for deaths, green for recoveries and grey in case of parity. The number of subjects is indicated in the first line of each square, while the last line reports deaths/recoveries. Black edges denote leaves.
BDT performance evaluation.
Comparison between the performance of the classifier developed in this study and classifiers trained on a random set of 7 features. The results of the permutation tests are the average of those obtained from 1,000 permutations.
| Input data | Sensitivity | Specificity | F1 score | |||
|---|---|---|---|---|---|---|
| Train all | 10f cv | Train all | 10f cv | Train all | 10f cv | |
| 7 feat of the tree in | 0.99 | 0.90 | 0.95 | 0.95 | 0.97 | 0.92 |
| 7 random feat | 0.99 | 0.88 | 0.82 | 0.49 | 0.97 | 0.86 |
Fig 5BDT permutation evaluation.
Histogram reporting the percentage of misclassified subjects in the permutation test. The red bar shows the performance of the BDT reported in Fig 4. All the misclassification rates are calculated as the average over a 10-fold cross-validation.