| Literature DB >> 35346181 |
Lijue Liu1,2, Xiaoyu Wu1, Shihao Li1, Yi Li3,4, Shiyang Tan1, Yongping Bai5.
Abstract
BACKGROUND: Imbalance between positive and negative outcomes, a so-called class imbalance, is a problem generally found in medical data. Despite various studies, class imbalance has always been a difficult issue. The main objective of this study was to find an effective integrated approach to address the problems posed by class imbalance and to validate the method in an early screening model for a rare cardiovascular disease aortic dissection (AD).Entities:
Keywords: Aortic dissection; Class imbalance; Ensemble learning; SVM
Mesh:
Year: 2022 PMID: 35346181 PMCID: PMC8962101 DOI: 10.1186/s12911-022-01821-w
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Fig. 1Data flow diagram of the proposed method
Significance test analysis of the indicators used to predict AD
| Variables | AD | Non-AD | P value | ||||
|---|---|---|---|---|---|---|---|
| 1.1 MCV | 91.84 ± 6.82 | 92.10 ± 7.17 | 0.161 | ||||
| 1.3 HGB | 119.76 ± 21.57 | 119.95 ± 22.47 | 0.18 | ||||
| 1.4 A/G | 1.40 ± 0.36 | 1.49 ± 0.37 | 0.88 | ||||
| 1.8 LYMPH | 1.36 ± 0.60 | 1.57 ± 2.03 | 0.22 | ||||
| 2.3 GIOB | 27.57 ± 5.19 | 26.94 ± 5.32 | 3.31 | 0.13 | |||
| 2.4 TB | 16.19 ± 21.62 | 13.20 ± 26.81 | 3.13 | 0.07 | |||
| 2.5 DB | 6.65 ± 11.52 | 5.39 ± 13.53 | 2.63 | 0.09 | |||
| 2.13 CHO | 4.33 ± 0.43 | 4.37 ± 0.55 | 0.81 | ||||
| 2.15 LDL | 2.60 ± 0.35 | 2.63 ± 0.46 | 0.51 | ||||
| 2.24 AG | 15.24 ± 3.68 | 14.95 ± 3.35 | 2.38 | 0.7 | |||
| 2.30 TSH | 3.18 | 3.52 | 0.442 | ||||
| 3.9 AT-III | 271.19 ± 17.23 | 271.18 ± 21.52 | 0.01 | 0.77 | |||
| 4.7 Family history of aortic dissection | 0(0.00) | 2(0.00) | 0.031 | 0.861 | |||
The bold items were features selected by significance test
The underlined items were features selected by logistic regression and not by significance test
Fig. 2A box plot of randomly selected dataset features
Fig. 3Scatter diagrams of dataset features
Fig. 4Flowchart of Algorithm 1
Experimental parameters of models
| Models | Parameters |
|---|---|
| Logistic regression | C = 1, penalty = 'l2' |
| KNN | n_neighbors = 17 |
| SVM | kernel = rbf, C = 4, degree = 3, gamma = 0.004 |
| Decision tree | max_depth = 3 |
| RF | n_estimators = 69 |
| BP | hidden_layer_sizes = 142 |
| AdaBoost | n_estimators = 65 |
| Easy-Ensemble | n_estimators = 65 |
| Proposed model | T = 65 |
Logistic regression analysis of the indicators used to predict AD
| Variable | B | OR | 95% CI | P value |
|---|---|---|---|---|
| 1.1 MCV | 0.009 | 1.009 | (0.998–1.020) | 0.129 |
| 1.3 HGB | 0.998 | (0.994–1.002) | 0.316 | |
| 1.4 A/G | 0.642 | (0.332–1.240) | 0.187 | |
| 1.5 NEUT | 0.006 | 1.006 | (0.989–1.022) | 0.497 |
| 1.8 LYMPH | 0.989 | (0.920–1.063) | 0.77 | |
| 2.1 TP | 0.044 | 1.045 | (0.985–1.109) | 0.145 |
| 2.2 AIB | 0.999 | (0.935–1.067) | 0.975 | |
| 2.3 GIOB | 0.954 | (0.892–1.021) | 0.173 | |
| 2.4 TB | 0.012 | 1.012 | (1.000–1.024) | 0.053 |
| 2.5 DB | 0.986 | (0.962–1.010) | 0.257 | |
| 2.7 ALT | 0 | 1 | (1.000–1.001) | 0.306 |
| 2.8 AST | 0 | 1 | (1.000–1.000) | 0.475 |
| 2.13 CHO | 0.749 | (0.523–1.072) | 0.114 | |
| 2.15 LDL | 0.167 | 1.182 | (0.787–1.774) | 0.421 |
| 2.16 LDH | 0 | 1 | (1.000–1.000) | 0.702 |
| 2.17 CK | 0 | 1 | (1.000–1.000) | 0.189 |
| 2.18 CK-MB | 0 | 1 | (0.999–1.001) | 0.687 |
| 2.19 MYOG | 0.001 | 1.001 | (1.000–1.002) | 0.088 |
| 2.21 Na + | 0.967 | (0.919–1.019) | 0.208 | |
| 2.22 Cl- | 0.001 | 1.001 | (0.953–1.052) | 0.956 |
| 2.23 CO2CP | 0.03 | 1.03 | (0.977–1.086) | 0.273 |
| 2.24 AG | 0.998 | (0.947–1.052) | 0.94 | |
| 2.30 TSH | 0.991 | (0.980–1.003) | 0.154 | |
| 3.2 INR | 0.821 | (0.337–2.002) | 0.665 | |
| 3.3 APTT | 0 | 1 | (0.991–1.009) | 0.96 |
| 3.4 FIB | ||||
| 3.8 PT | 0.995 | (0.913–1.084) | 0.902 | |
| 3.9 AT-III | 0.002 | 1.002 | (0.999–1.006) | 0.214 |
| 4.2 Stomach ache | 0.084 | 1.088 | (0.826–1.433) | 0.55 |
| 4.3 Heart palpitations | 0.774 | (0.588–1.017) | 0.066 | |
| 4.6 Family history of hypertension | 0.07 | 1.073 | (0.846–1.36) | 0.56 |
4.7 Family history of aortic dissection | 0 | 0 | 1 | |
| 4.13 Hypertension and duration | 0.994 | (0.978–1.009) | 0.424 | |
| 4.14 Smoking and duration | 0.998 | (0.991–1.006) | 0.679 | |
| 4.15 Stop smoking and duration | 0.995 | (0.966–1.026) | 0.759 | |
| 4.16 Drinking and duration | 0.004 | 1.004 | (0.994–1.015) | 0.41 |
B, unstandardized regression weight; OR, odds ratio; CI, confidence interval
The bold items were features selected by logistic regression
Feature importance ranking
| Features | Importance |
|---|---|
| 2.17 CK | 2.74% |
| 3.6 PLGAg | 2.68% |
| 4.21 Age | 2.63% |
| 2.19 MYOG | 2.53% |
| 3.7 TT | 2.37% |
| 4.18 Systolic pressure | 2.36% |
| 3.4 FIB | 2.30% |
| 2.20K + | 2.16% |
| 2.28 ESR | 2.14% |
| 1.6 NEUT% | 2.03% |
Sensitivity (Se) and specificity (Sp) of SVM models with different weights on positive and negative samples
| SVM (1,1) | SVM (1.3,1) | SVM (1.6,1) | SVM (2,1) | |||||
|---|---|---|---|---|---|---|---|---|
| Se | Sp | Se | Sp | Se | Sp | Se | Sp | |
| 1st | 0.772 | 0.792 | 0.825 | 0.746 | 0.842 | 0.697 | 0.868 | 0.653 |
| 2nd | 0.746 | 0.807 | 0.754 | 0.751 | 0.754 | 0.691 | 0.789 | 0.669 |
| 3rd | 0.781 | 0.768 | 0.816 | 0.727 | 0.860 | 0.675 | 0.868 | 0.644 |
| 4th | 0.746 | 0.790 | 0.781 | 0.751 | 0.807 | 0.696 | 0.816 | 0.666 |
| 5th | 0.772 | 0.805 | 0.781 | 0.756 | 0.798 | 0.684 | 0.851 | 0.646 |
| 6th | 0.728 | 0.795 | 0.763 | 0.741 | 0.781 | 0.687 | 0.833 | 0.648 |
| 7th | 0.771 | 0.779 | 0.847 | 0.727 | 0.873 | 0.679 | 0.898 | 0.642 |
| Average | 0.759 | 0.791 | 0.795 | 0.734 | 0.816 | 0.687 | 0.846 | 0.653 |
Training time of different models (unit: s)
| AdaBoost | EasyEnsemble | Ensemble model | Random Forest | |
|---|---|---|---|---|
| 1st | 3.4 | 185.3 | 55.0 | 0.36 |
| 2nd | 4.2 | 191.2 | 58.4 | 0.31 |
| 3rd | 4.0 | 191.2 | 54.8 | 0.31 |
| 4th | 3.6 | 188.2 | 56.0 | 0.32 |
| 5th | 3.7 | 191.2 | 57.6 | 0.39 |
| 6th | 3.8 | 179.4 | 55.2 | 0.39 |
| 7th | 4.7 | 185.3 | 58.1 | 0.31 |
| Average | 3.9 | 187.3 | 56.4 | 0.34 |
Sensitivity (Se) and specificity (Sp) of ensemble learning models
| AdaBoost | EasyEnsemble | Ensemble model | Random forest | |||||
|---|---|---|---|---|---|---|---|---|
| Se | Sp | Se | Sp | Se | Sp | Se | Sp | |
| 1st | 0.736 | 0.742 | 0.798 | 0.802 | 0.816 | 0.705 | 0.781 | 0.791 |
| 2nd | 0.675 | 0.759 | 0.737 | 0.794 | 0.798 | 0.733 | 0.737 | 0.792 |
| 3rd | 0.772 | 0.744 | 0.825 | 0.793 | 0.842 | 0.704 | 0.807 | 0.775 |
| 4th | 0.631 | 0.765 | 0.702 | 0.816 | 0.807 | 0.724 | 0.693 | 0.789 |
| 5th | 0.754 | 0.748 | 0.798 | 0.803 | 0.860 | 0.717 | 0.754 | 0.821 |
| 6th | 0.631 | 0.762 | 0.781 | 0.802 | 0.825 | 0.730 | 0.728 | 0.810 |
| 7th | 0.711 | 0.765 | 0.693 | 0.818 | 0.847 | 0.715 | 0.695 | 0.810 |
| Average | 70.1% | 75.5% | 76.1% | 80.4% | 82.8% | 71.9% | 74.2% | 79.8% |
| Variance ( | 57.23 | 10.3 | 51.49 | 9.75 | 19.58 | 9.89 | 42.27 | 15.79 |
Sensitivity (Se) and specificity (Sp) of logistic regression, decision tree, KNN and BP
| Logistic regression | Decision tree | KNN | BP | |||||
|---|---|---|---|---|---|---|---|---|
| Se | Sp | Se | Sp | Se | Sp | Se | Sp | |
| 1st | 0.789 | 0.771 | 0.702 | 0.690 | 0.728 | 0.715 | 0.737 | 0.760 |
| 2nd | 0.754 | 0.786 | 0.596 | 0.660 | 0.684 | 0.709 | 0.754 | 0.773 |
| 3rd | 0.798 | 0.754 | 0.684 | 0.653 | 0.711 | 0.700 | 0.789 | 0.736 |
| 4th | 0.711 | 0.783 | 0.658 | 0.680 | 0.789 | 0.693 | 0.746 | 0.749 |
| 5th | 0.789 | 0.774 | 0.702 | 0.682 | 0.772 | 0.656 | 0.781 | 0.765 |
| 6th | 0.754 | 0.774 | 0.667 | 0.667 | 0.667 | 0.679 | 0.711 | 0.765 |
| 7th | 0.788 | 0.773 | 0.644 | 0.679 | 0.720 | 0.683 | 0.788 | 0.737 |
| Average | 0.769 | 0.774 | 0.665 | 0.673 | 0.724 | 0.691 | 0.758 | 0.755 |
Fig. 5Seven-fold cross validation results of sensitivity of AdaBoost, EasyEnsemble (Easy), RF, Ensemble model and SVM (1.3, 1)