| Literature DB >> 29951182 |
Yang Zhao1, Zoie Shui-Yee Wong2, Kwok Leung Tsui1.
Abstract
Identifying rare but significant healthcare events in massive unstructured datasets has become a common task in healthcare data analytics. However, imbalanced class distribution in many practical datasets greatly hampers the detection of rare events, as most classification methods implicitly assume an equal occurrence of classes and are designed to maximize the overall classification accuracy. In this study, we develop a framework for learning healthcare data with imbalanced distribution via incorporating different rebalancing strategies. The evaluation results showed that the developed framework can significantly improve the detection accuracy of medical incidents due to look-alike sound-alike (LASA) mix-ups. Specifically, logistic regression combined with the synthetic minority oversampling technique (SMOTE) produces the best detection results, with a significant 45.3% increase in recall (recall = 75.7%) compared with pure logistic regression (recall = 52.1%).Entities:
Mesh:
Year: 2018 PMID: 29951182 PMCID: PMC5987310 DOI: 10.1155/2018/6275435
Source DB: PubMed Journal: J Healthc Eng ISSN: 2040-2295 Impact factor: 2.682
Figure 1Bias of a linear separator.
Cost matrix.
| Prediction | |||
|---|---|---|---|
| Actual | Positive | Negative | |
| Positive | 0 | C(FN) | |
| Negative | C(FP) | 0 | |
Figure 2Framework of the learning procedure.
Confusion matrix.
| Condition positive | Condition negative | |
|---|---|---|
| Test outcome positive | True positive (TP) | False positive (FP) |
| Test outcome negative | False negative (FN) | True negative (TN) |
Performance of conventional classifiers.
| LR | L.SVM | DT | R.SVM | |
|---|---|---|---|---|
| Recall | 0.521 | 0.479 | 0.375 | 0.396 |
| Precision | 0.694 | 0.767 | 0.750 | 0.792 |
|
| 0.595 | 0.590 | 0.500 | 0.528 |
| Accuracy | 0.850 | 0.859 | 0.841 | 0.850 |
Figure 3Comparison of data-level approaches.
Comparison of classifiers with the best performance.
| Base classifier | Oversampling | Undersampling | SMOTE | |
|---|---|---|---|---|
| Recall | 0.521 | 0.732 (40.50%) | 0.555 (6.53%) | 0.757 (45.30%) |
| Precision | 0.694 | 0.577 (−16.86%) | 0.598 (−13.83%) | 0.597 (−13.98%) |
|
| 0.595 | 0.645 (8.40%) | 0.575 (−3.36%) | 0.665 (11.76%) |
| Overall classification accuracy | 0.850 | 0.829 (−2.47%) | 0.826 (−2.82%) | 0.837 (−1.53%) |
Figure 4ROC curves of LR with different resampling strategies.
Figure 5Comparison of cost-sensitive learning methods with various thresholds.
Figure 6Summary of changes in recall and accuracy for difference classifiers.
| A.1 Random oversampling + LR | ||||||||
| Settings for | ||||||||
|
| 1.5 | 2.0 | 2.5 | 3.0 | 3.5 | 4.0 | 4.5 | 5.0 |
| A.2 Random undersampling + LR | ||||||||
| Settings for | ||||||||
|
| 1/1.5 | 1/2.0 | 1/2.5 | 1/3.0 | 1/3.5 | 1/4.0 | 1/4.5 | 1/5.0 |
| A.3 SMOTE + LR | |||||||
| Settings for | |||||||
|
| 400 | 300 | 200 | 100 | 200 | 100 | 100 |
|
| 100 | 100 | 100 | 100 | 200 | 200 | 300 |
| A.4 Cost-sensitive learning + LR | ||||||||
| Settings for | ||||||||
|
| 0.5 | 0.25 | 0 | −0.25 | −0.5 | −1 | −1.5 | −2 |