| Literature DB >> 31603902 |
Mallory Sheth1,2, Albert Gerovitch1, Roy Welsch1, Natasha Markuzon2.
Abstract
In many data classification problems, a number of methods will give similar accuracy. However, when working with people who are not experts in data science such as doctors, lawyers, and judges among others, finding interpretable algorithms can be a critical success factor. Practitioners have a deep understanding of the individual input variables but far less insight into how they interact with each other. For example, there may be ranges of an input variable for which the observed outcome is significantly more or less likely. This paper describes an algorithm for automatic detection of such thresholds, called the Univariate Flagging Algorithm (UFA). The algorithm searches for a separation that optimizes the difference between separated areas while obtaining a high level of support. We evaluate its performance using six sample datasets and demonstrate that thresholds identified by the algorithm align well with published results and known physiological boundaries. We also introduce two classification approaches that use UFA and show that the performance attained on unseen test data is comparable to or better than traditional classifiers when confidence intervals are considered. We identify conditions under which UFA performs well, including applications with large amounts of missing or noisy data, applications with a large number of inputs relative to observations, and applications where incidence of the target is low. We argue that ease of explanation of the results, robustness to missing data and noise, and detection of low incidence adverse outcomes are desirable features for clinical applications that can be achieved with relatively simple classifier, like UFA.Entities:
Year: 2019 PMID: 31603902 PMCID: PMC6788700 DOI: 10.1371/journal.pone.0223161
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Body temperature for adult sepsis patients.
List of variables for specification of UFA algorithm.
For the purpose of formulation, we consider candidate thresholds below the median value of x.
| Variable | Definition |
|---|---|
| Binary target | |
| Continuous explanatory variable | |
| Values of | |
| Number of observations in | |
| Percent of | |
| Candidate threshold below median of | |
| Number of observations below candidate threshold | |
| Percent of |
Fig 2Number of high mortality and low mortality flags for adult sepsis patients.
Patients who died are indicated by red squares while patients who lived are indicated by blue triangles. For each patient, we counted the number of flags that are associated with a high likelihood of mortality and the number of flags that are associated with a low likelihood of mortality; the solid line represents a prototype of a linear decision boundary that would be designed to minimize the misclassification rate along these two dimensions.
Comparison of error rates on previously unseen data.
Results show out of sample error rate averaged over 1000 runs.
| Algorithm | Pima Indian Diabetes | Wisconsin Breast Cancer | ALL/AML Cancer |
|---|---|---|---|
| N-UFA | 18.2 ± 2.2 | 6.8 ± 1.6 | 0.9 ± 1.9 |
| R-UFA | 19.3 ± 2.3 | 7.2 ± 1.6 | 0.6 ± 1.6 |
| Logistic Regression | 23.0 ± 2.3 | 5.8 ± 1.9 | 44.9 ± 12.9 |
| Random Forest | 24.1 ± 2.5 | 4.1 ± 1.4 | 5.9 ± 5.8 |
| Conditional Inference Tree | 26.1 ± 3.0 | 6.7 ±1.9 | 9.9 ± 7.0 |
| SVM | 24.1 ± 2.4 | 2.7 ± 1.2 | 20.6 ± 11.4 |
| k-NN | 30.3 ± 2.5 | 7.5 ± 1.7 | 8.8 ± 5.5 |
Robustness to missing data.
Results show out of sample error rate averaged over 1000 runs.
| Full Set | 10% Missing | 50% Missing | |
|---|---|---|---|
| N-UFA | 18.2 ± 2.2 | 24.6 ± 2.4 | 28.0 ± 2.6 |
| R-UFA | 19.3 ± 2.3 | 26.7 ± 2.7 | 28.2 ± 2.7 |
| Logistic Regression | 23.0 ± 2.3 | 27.3 ± 2.7 | 33.7 ± 2.6 |
| Random Forest | 24.1 ± 2.5 | 25.1 ± 2.5 | 31.5 ± 2.6 |
| Conditional Inference Tree | 26.1 ± 3.0 | 27.1 ± 2.9 | 33.6 ± 3.0 |
| SVM | 24.1 ± 2.4 | 24.7 ± 2.6 | 32.8 ± 2.6 |
| k-NN | 30.3 ± 2.5 | 31.0 ± 2.6 | 33.9 ± 2.6 |
| N-UFA | 6.8 ± 1.6 | 7.0 ± 1.7 | 6.8 ± 1.8 |
| R-UFA | 7.2 ± 1.6 | 6.4 ± 1.6 | 6.2 ± 16 |
| Logistic Regression | 5.8 ± 1.9 | 8.1 ± 2.1 | 16.3 ± 2.4 |
| Random Forest | 4.1 ± 1.4 | 4.8 ± 1.6 | 7.4 ± 1.9 |
| Conditional Inference Tree | 6.7 ±1.9 | 8.8 ± 2.4 | 13.0 ± 2.5 |
| SVM | 2.7 ± 1.2 | 6.3 ± 1.8 | 12.4 ± 2.2 |
| k-NN | 7.5 ± 1.7 | 9.2 ± 1.8 | 13.3 ± 2.4 |
| N-UFA | 0.9 ± 1.9 | 0.8 ± 1.8 | 0.0 ± 0.1 |
| R-UFA | 0.6 ± 1.6 | 0.8 ± 1.7 | 0.0 ± 0.0 |
| Random Forest | 5.9 ± 5.8 | 9.8 ± 8.7 | 27.2 ± 9.8 |
| Conditional Inference Tree | 9.9 ± 7.0 | 11.7 ± 7.3 | 34.3 ± 8.8 |
| SVM | 20.6 ± 11.4 | 27.6 ± 13.8 | 35.0 ± 8.3 |
| k-NN | 8.8 ± 5.5 | 18.5 ± 9.8 | 31.7 ± 8.8 |
Examples of UFA-defined thresholds, MIMIC II data.
For each variable in the table, the UFA-identified threshold aligns with the known physiological bound, as established by the National Institutes of Health. The mortality rates for patients who violated these thresholds range from 52.7% to 55.9%, much higher than the 30.9% death rate in the septic population overall.
| Variable | Normal Range | Threshold | Support | Mortality | |
|---|---|---|---|---|---|
| Phosphorus Level | 2.4–4.1 mg/dL | More Than | 4.5 | 93 | 52.7% |
| Sodium Level | 135–145 mEq/L | Less Than | 134.9 | 59 | 55.9% |
| Mean Arterial BP | 70–110 mmHg | Less Than | 67.4 | 86 | 55.8% |
Comparison of different classifiers for in-hospital mortality of adult sepsis patients.
The two UFA-based classifiers have predictive performance better than or equal to other commonly used classification techniques when confidence intervals are considered.
| Classifier | Accuracy | AUROC | |
|---|---|---|---|
| N-UFA | UFA-based | 77.5% (75.1, 79.9) | 0.819 (0.797, 0.841) |
| RF-UFA | 78.1 | 0.800 (0.779, 0.821) | |
| Logistic Regression | Other | 68.7% (65.7, 71.6) | 0.698 (0.642, 0.753) |
| Support Vector Machine | 79.4% (76.2, 82.6) | 0.555 (0.331, 0.780) | |
| Decision Tree | 68.8% (66.0, 71.7) | 0.626 (0.575, 0.677) | |
| Random Forest | 79.0% (76.9, 81.1) | 0.823 (0.796, 0.851) | |
Comparison of different classifiers with varying amounts of missing data.
This table compares the performance of different classifiers for the original MIMIC II data and a version of the MIMIC II data where 50% of observations were replaced randomly with missing values. We see that N-UFA is robust to missing data, with accuracy decreasing just 1.3% as the amount of missing data increases to 50%. An expanded version of Table 6 including confidence intervals and results for 5–25% missing data is available in the S1 Table.
| Classifier | Accuracy | AUROC | |||||
|---|---|---|---|---|---|---|---|
| 0% | 50% | Δ | 0% | 50% | Δ | ||
| N-UFA | UFA-Based | 77.5% | 76.2% | 0.819 | 0.790 | ||
| Random Forest | Other | 79.0% | 71.9% | 0.823 | 0.771 | ||
| Logistic Regression | 68.7% | 58.3% | 0.698 | 0.598 | |||
Comparison of different classifiers with varying amounts of imprecise data.
This table compares the performance of different classifiers for the original MIMIC II data and a version of the MIMIC II data where 50% of observations were randomly perturbed by a value ϵ, distributed normally with mean zero and the empirical variance of the variable in question. We see that N-UFA is robust to imprecise data, with accuracy decreasing 1.7% as the amount of imprecise data increases to 50%. An expanded version of Table 7 including confidence intervals and results for 5–25% imprecise data is available in the S2 Table.
| Classifier | Accuracy | AUROC | |||||
|---|---|---|---|---|---|---|---|
| 0% | 50% | Δ | 0% | 50% | Δ | ||
| N-UFA | UFA-Based | 77.5% | 75.8% | 0.819 | 0.796 | ||
| Random Forest | Other | 79.0% | 76.3% | 0.823 | 0.802 | ||
| Logistic Regression | 68.7% | 68.8% | 0.698 | 0.681 | |||
Fig 3Landslide days stratified by precipitation and wind.
The percentage of days with a landslide in each quadrant is displayed in red. Thresholds can be combined to find conditions under which the relative risk of a landslide is significantly elevated.