| Literature DB >> 35565052 |
Yan Gao1, Min Wang1, Guogang Zhang2, Lingjun Zhou1, Jingming Luo2, Lijue Liu1.
Abstract
Aortic dissection (AD) is a rare and high-risk cardiovascular disease with high mortality. Due to its complex and changeable clinical manifestations, it is easily missed or misdiagnosed. In this paper, we proposed an ensemble learning model based on clustering: Cluster Random under-sampling Smote-Tomek Bagging (CRST-Bagging) to help clinicians screen for AD patients in the early phase to save their lives. In this model, we propose the CRST method, which combines the advantages of Kmeans++ and the Smote-Tomek sampling method, to overcome an extremely imbalanced AD dataset. Then we used the Bagging algorithm to predict the AD patients. We collected AD patients' and other cardiovascular patients' routine examination data from Xiangya Hospital to build the AD dataset. The effectiveness of the CRST method in resampling was verified by experiments on the original AD dataset. Our model was compared with RUSBoost and SMOTEBagging on the original dataset and a test dataset. The results show that our model performed better. On the test dataset, our model's precision and recall rates were 83.6% and 80.7%, respectively. Our model's F1-score was 82.1%, which is 4.8% and 1.6% higher than that of RUSBoost and SMOTEBagging, which demonstrates our model's effectiveness in AD screening.Entities:
Keywords: aortic dissection; bagging; clustering; imbalanced data; screening
Mesh:
Year: 2022 PMID: 35565052 PMCID: PMC9102711 DOI: 10.3390/ijerph19095657
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 4.614
Figure 1The structure of the CRST-Bagging model for AD screening.
Figure 2AD dataset t−SNE dimensionality reduction map. (a) The data distribution of non−AD patients after dimensionality reduction. (b) Data distribution of AD patients after dimensionality reduction. (c) The data distribution of non−AD and AD samples in the same space after dimensionality reduction.
Figure 3Feature Missing Rate Statistics.
Figure 4Feature importance analysis.
Figure 5Schematic diagram of the undersampling process of majority samples.
Figure 6Raw dataset (a); dataset after SMOTE oversampling (b); Tomek-link recognition process (c); dataset after the boundary and noise samples are removed (d).
CRST Sampling Method.
| Method:
Divide the input training sample set P into the majority sample set Use the K-means++ algorithm to cluster the training sample set Take p% samples for each cluster class to obtain a new K cluster class sample set randomly, denoted as |
Figure 7CRST-Bagging algorithm structure diagram.
Experimental results for the XGBoost, Smote, S-T, CCST and CRST methods.
| Method | Precision | Recall | F1 |
|---|---|---|---|
| XGBoost | 0.546 | 0.157 | 0.243 |
| Smote | 0.789 | 0.711 | 0.748 |
| S-T |
| 0.723 | 0.749 |
| CCST | 0.778 | 0.765 | 0.771 |
| CRST | 0.782 |
|
|
Note: The best scores are in bold.
The experimental results of RUSBoost, SMOTEBagging, and CRST-Bagging on the original dataset using seven-fold cross-validation.
| Method | Precision | Recall | F1 | Training Time/Predicting Time (s) |
|---|---|---|---|---|
| RUSBoost | 0.774 | 0.751 | 0.762 | 935.970/0.129 |
| SMOTEBagging | 0.791 | 0.780 | 0.785 | 98.458/0.133 |
| CRST-Bagging |
|
|
|
|
Note: The best scores are in bold.
The experimental results of RUSBoost, SMOTEBagging and CRST-Bagging on the test set.
| Method | Precision | Recall | F1 | Predicting Time (s) |
|---|---|---|---|---|
| RUSBoost | 0.788 | 0.759 | 0.773 | 0.036 |
| SMOTEBagging |
| 0.771 | 0.805 | 0.093 |
| CRST-Bagging | 0.838 |
|
|
|
Note: The best scores are in bold.
Confusion matrix of RUSBoost (a), SMOTEBagging (b), CRST-Bagging (c) on the test set.
|
| ||
| Confusion matrix on the test set | ||
| Predicted non-AD patient | Predicted AD patient | |
| Actual non-AD patient | TN 135 | FP 17 |
| Actual AD patient | FN 20 | TP 63 |
|
| ||
| Confusion matrix on the test set | ||
| Predicted non-AD patient | Predicted AD patient | |
| Actual non-AD patient |
| FP 12 |
| Actual AD patient | FN 19 | TP 64 |
|
| ||
| Confusion matrix on the test set | ||
| Predicted non-AD patient | Predicted AD patient | |
| Actual non-AD patient | TN 139 | FP 13 |
| Actual AD patient | FN 16 |
|
Note: The best scores are in bold.
Figure 8ROC graph of RUSBoost, SMOTEBagging, CRST-Bagging on the test set.