| Literature DB >> 35453898 |
Petr Šín1, Alica Hokynková1, Nováková Marie2, Pokorná Andrea3, Rostislav Krč4, Jan Podroužek4.
Abstract
Increasingly available open medical and health datasets encourage data-driven research with a promise of improving patient care through knowledge discovery and algorithm development. Among efficient approaches to such high-dimensional problems are a number of machine learning methods, which are applied in this paper to pressure ulcer prediction in modular critical care data. An inherent property of many health-related datasets is a high number of irregularly sampled time-variant and scarcely populated features, often exceeding the number of observations. Although machine learning methods are known to work well under such circumstances, many choices regarding model and data processing exist. In particular, this paper address both theoretical and practical aspects related to the application of six classification models to pressure ulcers, while utilizing one of the largest available Medical Information Mart for Intensive Care (MIMIC-IV) databases. Random forest, with an accuracy of 96%, is the best-performing approach among the considered machine learning algorithms.Entities:
Keywords: MIMIC database; MIMIC-IV; artificial neural network; machine learning; open data; pressure injury; pressure ulcer; random forest
Year: 2022 PMID: 35453898 PMCID: PMC9030498 DOI: 10.3390/diagnostics12040850
Source DB: PubMed Journal: Diagnostics (Basel) ISSN: 2075-4418
Characterization of input parameters and their importance for best performing RF model.
| Parameter | Count dec | Count ndec | Ratio | Mean dec | Mean ndec | Data Type | FI | FI Rank |
|---|---|---|---|---|---|---|---|---|
| age | 1979 | 4497 | 2.27 | n/a | n/a | int64 | 9.41 × 10−3 | 12 |
| gender | 1979 | 4497 | 2.27 | n/a | n/a | category | 3.64 × 10−4 | 21 |
| ethnicity | 1979 | 4497 | 2.27 | n/a | n/a | category | 1.01 × 10−3 | 19 |
| ICU length | 1979 | 4497 | 2.27 | 0.37 | 0.25 | float64 | 2.72 × 10−1 | 1 |
| input | 1979 | 793 | 0.40 | 2.69 × 103 | 4615.48 | float64 | 1.27 × 10−1 | 3 |
| output | 1952 | 784 | 0.40 | 1.38 × 103 | 3543.23 | float64 | 1.73 × 10−1 | 2 |
| height | 1356 | 442 | 0.33 | 168.33 | 170.02 | float64 | 1.59 × 10−2 | 11 |
| weight | 1472 | 475 | 0.32 | 84.25 | 85.23 | float64 | 3.09 × 10−2 | 9 |
| blood pressure | 413 | 185 | 0.45 | 118.23 | 119.24 | float64 | 2.60 × 10−3 | 15 |
| glucose | 311 | 149 | 0.48 | 159.25 | 148.07 | float64 | 2.84 × 10−3 | 13 |
| o2sat | 224 | 112 | 0.50 | 95.35 | 96.39 | float64 | 1.27 × 10−3 | 17 |
| Braden sensory | 1519 | 456 | 0.30 | 2.73 | 3.31 | float64 | 3.37 × 10−2 | 8 |
| Braden moisture | 1518 | 456 | 0.30 | 3.36 | 3.66 | float64 | 4.33 × 10−2 | 7 |
| Braden activity | 1518 | 456 | 0.30 | 1.21 | 1.65 | float64 | 1.17 × 10−1 | 4 |
| Braden mobility | 1517 | 456 | 0.30 | 2.28 | 2.85 | float64 | 6.28 × 10−2 | 6 |
| Braden nutrition | 1517 | 456 | 0.30 | 2.16 | 2.51 | float64 | 7.85 × 10−2 | 5 |
| Braden friction | 1513 | 456 | 0.30 | 1.80 | 2.36 | float64 | 2.13 × 10−2 | 10 |
| albumin | 344 | 138 | 0.40 | 2.88 | 3.27 | float64 | 2.60 × 10−3 | 16 |
| protein | 23 | 9 | 0.39 | 5.65 | 5.81 | float64 | 2.88 × 10−4 | 22 |
| bilirubin | 491 | 188 | 0.38 | 1.74 | 1.06 | float64 | 2.71 × 10−3 | 14 |
| diag. spinal injury | 1979 | 4497 | 2.27 | n/a | n/a | bool | 7.30 × 10−5 | 23 |
| diag. diarrhea | 1979 | 4497 | 2.27 | n/a | n/a | bool | 1.04 × 10−3 | 18 |
| diag. fracture | 1979 | 4497 | 2.27 | n/a | n/a | bool | 5.97 × 10−4 | 20 |
FI, feature importance; dec, patients with PU; ndec, patients without PU.
Figure 1Histograms of non-debiased input parameters before normalization. The PU group is represented by blue color, while the orange color represents non-PU group.
Figure 2Correlation matrix of non-debiased input parameters.
Figure 3Performance of the 6 classification models considered at all classification thresholds (ROC curves): (a) k-nearest neighbors, (b) logistic regression, (c) multi-layer perceptron, (d) naïve Bayes, (e) random forest and (f) support vector machines.
Evaluation of machine learning algorithms: scalar performance measures and confusion matrix terms. Values are color-coded on a green (favorable values)-to-red (adverse values) scale.
| Accuracy | Precision | Recall | F1-Score | AUC | Time [s] | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| Model | PPV | TPR | TPR | TNR | FPR | FNR | ||||
| Random Forest | 0.960 | 0.946 | 0.916 | 0.930 | 0.947 | 0.437 | 0.92 | 0.98 | 0.02 | 0.08 |
| Multi-layer Perceptron | 0.944 | 0.899 | 0.911 | 0.905 | 0.934 | 24,130 | 0.91 | 0.96 | 0.04 | 0.09 |
| 0.921 | 0.890 | 0.832 | 0.860 | 0.895 | 0.001 | 0.83 | 0.96 | 0.04 | 0.17 | |
| SVM (linear kernel) | 0.873 | 0.785 | 0.779 | 0.782 | 0.845 | 7.825 | 0.78 | 0.91 | 0.09 | 0.22 |
| Naïve Bayes | 0.851 | 0.752 | 0.734 | 0.743 | 0.817 | 0.004 | 0.73 | 0.90 | 0.10 | 0.27 |
| Logistic Regression | 0.842 | 0.816 | 0.595 | 0.688 | 0.770 | 0.042 | 0.59 | 0.94 | 0.06 | 0.41 |
Based on 80:20 split and fixed seed. PPV, positive predictive value; TPR, true positive rate; TNR, true negative rate; FPR, false positive rate; FNR, false negative rate.