| Literature DB >> 24207108 |
Kung-Jeng Wang1, Bunjira Makond, Kung-Min Wang.
Abstract
BACKGROUND: Breast cancer is one of the most critical cancers and is a major cause of cancer death among women. It is essential to know the survivability of the patients in order to ease the decision making process regarding medical treatment and financial preparation. Recently, the breast cancer data sets have been imbalanced (i.e., the number of survival patients outnumbers the number of non-survival patients) whereas the standard classifiers are not applicable for the imbalanced data sets. The methods to improve survivability prognosis of breast cancer need for study.Entities:
Mesh:
Year: 2013 PMID: 24207108 PMCID: PMC3829096 DOI: 10.1186/1472-6947-13-124
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Breast cancer survival prognosis researches using SEER data
| Delen et al. [ | Survival: 46% | C5 DT | 93.62% |
| Non-survival: 54% | ANN | 91.21% | |
| LR | 89.20% | ||
| Bellaachia and Guven [ | Survival: 76.80% | C4.5 DT | 86.70% |
| Non-survival: 23.20% | ANN | 86.50% | |
| Naïve BN | 84.50% | ||
| Endo et al. [ | Survival: 81.50% | LR | 85.80% |
| Non-survival: 18.50% | J48 DT | 85.60% | |
| DT (with naïve Bayes) | 84.20% | ||
| ANN | 84.50% | ||
| Naïve BN | 83.90% | ||
| BN | 83.90% | ||
| ID3 DT | 82.30% | ||
| Liu et al. [ | Survival: 86.52% | C5 DT | 88.05% (AUC = 0.607) |
| Non-survival: 13.48% | Under-sampling + C5 DT | 74.22% (AUC = 0.748) | |
| Bagging algorithm + C5 DT | 76.59% (AUC = 0.768) |
Cancer survivability class distribution
| Survival (denote as 0) | 195,172 | 90.68% |
| Non-survival (denote as 1) | 20,049 | 9.32% |
| Total | 215,221 | 100% |
Resulting predictor variables selected in this study
| re_v4 | Race | | 27 | | |
| re_v20 | Grade | | 4 | | |
| re_v24 | Extension of disease | | 32 | | |
| re_sss | Site-specific surgery code | | 9 | | |
| re_v26 | Lymph node involvement | | 9 | | |
| re_v102 | Stage of cancer | | 4 | | |
| re_v104 | SEER modified AJCC stage 3rd | | 9 | | |
| v23 | Tumor size | 20.70 | 16.24 | 0 | 200 |
| v27 | Number of positive nodes | 1.44 | 3.69 | 0 | 79 |
Cost matrix
| C(1,1), or TP | C(1,0), or FN | ||
| C(0,1), or FP | C(0,0), or TN | ||
Confusion matrix
| TP | FN | ||
| FP | TN | ||
Figure 1Flow diagram of SMOTE, CSC, under-sampling, bagging, and boosting implementation.
All studied models in form of acronyms along with the descriptions
| DT_9 | Decision tree algorithm with 9 predictor variables |
| LR_9 | Logistic regression algorithm with 9 predictor variables |
| S_DT_9 | Decision tree algorithm with 9 predictor variables, pre-processed by using the SMOTE |
| S_LR_9 | Logistic regression algorithm with 9 predictor variables, pre-processed by using the SMOTE |
| S_DT_10 | Decision tree algorithm with 10 predictor variables proposed by Endo et al. [ |
| S_LR_10 | Logistic regression algorithm with 10 predictor variables proposed by Endo et al. [ |
| S_DT_16 | Decision tree algorithm with 16 predictor variables proposed by Delen et al. [ |
| S_LR_16 | Logistic regression algorithm with 16 predictor variables proposed by Delen et al. [ |
| S_DT_20 | Decision tree algorithm with 20 predictor variables, pre-processed by using the SMOTE |
| S_LR_20 | Logistic regression algorithm with 20 predictor variables, pre-processed by using the SMOTE |
| S_pDT | Pruning decision tree algorithm pre-processed by using the SMOTE |
| S_rLR | Logistic regression algorithm pre-processed by using the SMOTE (This model is constructed by the same predictor variables as in S_pDT) |
| C_DT_9 | Decision tree algorithm with 9 predictor variables, wrapped with CSC |
| C_LR_9 | Logistic regression algorithm with 9 predictor variables, wrapped with CSC |
| C_DT_10 | Decision tree algorithm with 10 predictor variables proposed by Endo et al. [ |
| C_LR_10 | Logistic regression algorithm with 10 predictor variables proposed by Endo et al. [ |
| C_DT_16 | Decision tree algorithm with 16 predictor variables proposed by Delen et al. [ |
| C_LR_16 | Logistic regression algorithm with 16 predictor variables proposed by Delen et al. [ |
| C_DT_20 | Decision tree algorithm with 20 predictor variables, wrapped with CSC |
| C_LR_20 | Logistic regression algorithm with 20 predictor variables, wrapped with CSC |
| C_pDT | Pruning decision tree algorithm wrapped with CSC |
| C_rLR | Logistic regression algorithm wrapped with CSC (This model is constructed by the same predictor variables as in C_pDT) |
| U_DT_9 | Decision tree algorithm with 9 predictor variables, pre-processed by using the under-sampling approach |
| U_LR_9 | Logistic regression algorithm with 9 predictor variables, pre-processed by using the under-sampling approach |
| U_DT_10 | Decision tree algorithm with 10 predictor variables proposed by Endo et al. [ |
| U_LR_10 | Logistic regression algorithm with 10 predictor variables proposed by Endo et al. [ |
| U_DT_16 | Decision tree algorithm with 16 predictor variables proposed by Delen et al. [ |
| U_LR_16 | Logistic regression algorithm with 16 predictor variables proposed by Delen et al. [ |
| U_DT_20 | Decision tree algorithm with 20 predictor variables, pre-processed by using the under-sampling approach |
| U_LR_20 | Logistic regression algorithm with 20 predictor variables, pre-processed by using the under-sampling approach |
| U_pDT | Pruning decision tree algorithm pre-processed by using the under-sampling approach |
| U_rLR | Logistic regression algorithm pre-processed by using the under-sampling approach (This model is constructed by the same predictor variables as in U_pDT) |
| Ba_DT_9 | Decision tree algorithm with 9 predictor variables, combined with bagging |
| Ba_LR_9 | Logistic regression algorithm with 9 predictor variables, combined with bagging |
| Ba_DT_10 | Decision tree algorithm with 10 predictor variables proposed by Endo et al. [ |
| Ba_LR_10 | Logistic regression algorithm with 10 predictor variables proposed by Endo et al. [ |
| Ba_DT_16 | Decision tree algorithm with 16 predictor variables proposed by Delen et al. [ |
| Ba_LR_16 | Logistic regression algorithm with 16 predictor variables proposed by Delen et al. [ |
| Ba_DT_20 | Decision tree algorithm with 20 predictor variables, combined with bagging |
| Ba_LR_20 | Logistic regression algorithm with 20 predictor variables, combined with bagging |
| Ba_pDT | Pruning decision tree algorithm combined with bagging |
| Ba _rLR | Logistic regression algorithm combined with bagging (This model is constructed by the same predictor variables as in Ba_pDT) |
| Ad_DT_9 | Decision tree algorithm with 9 predictor variables, combined with AdaboostM1 |
| Ad_LR_9 | Logistic regression algorithm with 9 predictor variables, combined with AdaboostM1 |
| Ad_DT_10 | Decision tree algorithm with 10 predictor variables proposed by Endo et al. [ |
| Ad_LR_10 | Logistic regression algorithm with 10 predictor variables proposed by Endo et al. [ |
| Ad_DT_16 | Decision tree algorithm with 16 predictor variables proposed by Delen et al. [ |
| Ad_LR_16 | Logistic regression algorithm with 16 predictor variables proposed by Delen et al. [ |
| Ad_DT_20 | Decision tree algorithm with 20 predictor variables, combined with AdaboostM1 |
| Ad_LR_20 | Logistic regression algorithm with 20 predictor variables, combined with AdaboostM1 |
| Ad_pDT | Pruning decision tree algorithm combined with AdaboostM1 |
| Ad_rLR | Logistic regression algorithm combined with AdaboostM1 (This model is constructed by the same predictor variables as in Ad_pDT) |
The comparative results of models using all techniques and standard data mining models
| DT_9 | 0.912 | 0.140 | 0.991 | 0.374 | 0.772 |
| LR_9 | 0.913 | 0.156 | 0.990 | 0.394 | 0.829 |
| S_DT_9 | 0.791 | 0.475 | 0.823 | 0.626 | 0.700 |
| S_LR_9 | 0.759 | 0.645 | 0.771 | 0.705 | 0.783 |
| C_DT_9 | 0.772 | 0.669 | 0.792 | 0.727 | 0.758 |
| C_LR_9 | 0.752 | 0.752 | 0.752 | 0.752 | 0.829 |
| U_DT_9 | 0.748 | 0.748 | 0.749 | 0.748 | 0.798 |
| U_LR_9 | 0.749 | 0.732 | 0.767 | 0.749 | 0.825 |
| Ba_DT_9 | 0.911 | 0.151 | 0.990 | 0.386 | 0.797 |
| Ba_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.829 |
| Ad_DT_9 | 0.902 | 0.197 | 0.974 | 0.438 | 0.752 |
| Ad_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.787 |
ANOVA for average g-mean
| Between Groups | 3.239 | 11 | 0.294 | 4335.466 | 0.000 |
| Within Groups | 0.007 | 108 | 0.000 | | |
| Total | 3.246 | 119 |
Tukey’s HSD test for g-mean
| DT_9 | 0.374 | | | | | | |
| Ba_DT_9 | 0.386 | 0.386 | | | | | |
| LR_9 | | 0.394 | | | | | |
| Ba_LR_9 | | 0.394 | | | | | |
| Ad_LR_9 | | 0.394 | | | | | |
| Ad_DT_9 | | | 0.438 | | | | |
| S_DT_9 | | | | 0.626 | | | |
| S_LR_9 | | | | | 0.705 | | |
| C_DT_9 | | | | | | 0.727 | |
| U_DT_9 | | | | | | | 0.748 |
| U_LR_9 | | | | | | | 0.749 |
| C_LR_9 | 0.752 | ||||||
The comparative results of models with feature selection using all techniques
| S_DT_9 | 0.791 | 0.475 | 0.823 | 0.626 | 0.700 |
| S_LR_9 | 0.759 | 0.645 | 0.771 | 0.705 | 0.783 |
| S_DT_10 | 0.835 | 0.363 | 0.884 | 0.566 | 0.726 |
| S_LR_10 | 0.772 | 0.492 | 0.800 | 0.627 | 0.720 |
| S_DT_16 | 0.869 | 0.310 | 0.926 | 0.536 | 0.731 |
| S_LR_16 | 0.796 | 0.471 | 0.830 | 0.623 | 0.727 |
| S_DT_20 | 0.871 | 0.311 | 0.929 | 0.537 | 0.733 |
| S_LR_20 | 0.791 | 0.476 | 0.824 | 0.626 | 0.726 |
| C_DT_9 | 0.772 | 0.669 | 0.792 | 0.727 | 0.758 |
| C_LR_9 | 0.752 | 0.752 | 0.752 | 0.752 | 0.829 |
| C_DT_10 | 0.723 | 0.734 | 0.722 | 0.728 | 0.774 |
| C_LR_10 | 0.723 | 0.766 | 0.719 | 0.742 | 0.818 |
| C_DT_16 | 0.804 | 0.557 | 0.829 | 0.679 | 0.673 |
| C_LR_16 | 0.662 | 0.810 | 0.647 | 0.724 | 0.814 |
| C_DT_20 | 0.805 | 0.552 | 0.831 | 0.677 | 0.672 |
| C_LR_20 | 0.591 | 0.824 | 0.567 | 0.684 | 0.787 |
| U_DT_9 | 0.748 | 0.748 | 0.749 | 0.748 | 0.798 |
| U_LR_9 | 0.749 | 0.732 | 0.767 | 0.749 | 0.825 |
| U_DT_10 | 0.744 | 0.767 | 0.720 | 0.743 | 0.795 |
| U_LR_10 | 0.743 | 0.762 | 0.724 | 0.743 | 0.817 |
| U_DT_16 | 0.746 | 0.745 | 0.748 | 0.746 | 0.786 |
| U_LR_16 | 0.749 | 0.727 | 0.771 | 0.749 | 0.826 |
| U_DT_20 | 0.746 | 0.744 | 0.748 | 0.746 | 0.785 |
| U_LR_20 | 0.753 | 0.743 | 0.764 | 0.753 | 0.829 |
| Ba_DT_9 | 0.911 | 0.151 | 0.990 | 0.386 | 0.797 |
| Ba_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.829 |
| Ba_DT_10 | 0.911 | 0.117 | 0.992 | 0.341 | 0.784 |
| Ba_LR_10 | 0.911 | 0.126 | 0.991 | 0.354 | 0.818 |
| Ba_DT_16 | 0.912 | 0.186 | 0.987 | 0.429 | 0.801 |
| Ba_LR_16 | 0.913 | 0.177 | 0.989 | 0.418 | 0.829 |
| Ba_DT_20 | 0.912 | 0.189 | 0.987 | 0.432 | 0.801 |
| Ba_LR_20 | 0.914 | 0.187 | 0.989 | 0.430 | 0.835 |
| Ad_DT_9 | 0.902 | 0.197 | 0.974 | 0.438 | 0.752 |
| Ad_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.787 |
| Ad_DT_10 | 0.905 | 0.146 | 0.983 | 0.379 | 0.773 |
| Ad_LR_10 | 0.911 | 0.112 | 0.993 | 0.334 | 0.783 |
| Ad_DT_16 | 0.890 | 0.247 | 0.956 | 0.486 | 0.749 |
| Ad_LR_16 | 0.914 | 0.177 | 0.989 | 0.418 | 0.779 |
| Ad_DT_20 | 0.891 | 0.247 | 0.958 | 0.487 | 0.748 |
| Ad_LR_20 | 0.914 | 0.180 | 0.990 | 0.422 | 0.794 |
ANOVA for average g-mean of models using feature selection
| Between Groups | 8.994 | 39 | 0.231 | 2266.112 | 0.000 |
| Within Groups | 0.037 | 360 | 0.000 | | |
| Total | 9.031 | 399 |
Tukey’s HSD test for g-mean of models using feature selection
| Ad_LR_10 | 0.334 | | | | | | | | | | | | | |
| Ba_DT_10 | 0.341 | 0.341 | | | | | | | | | | | | |
| Ba_LR_10 | | 0.354 | | | | | | | | | | | | |
| Ad_DT_10 | | | 0.379 | | | | | | | | | | | |
| Ba_DT_9 | | | 0.386 | | | | | | | | | | | |
| Ba_LR_9 | | | 0.394 | | | | | | | | | | | |
| Ad_LR_9 | | | 0.394 | | | | | | | | | | | |
| Ad_LR_16 | | | | 0.418 | | | | | | | | | | |
| Ba_LR_16 | | | | 0.418 | | | | | | | | | | |
| Ad_LR_20 | | | | 0.422 | 0.422 | | | | | | | | | |
| Ba_DT_16 | | | | 0.429 | 0.429 | | | | | | | | | |
| Ba_LR_20 | | | | 0.430 | 0.430 | | | | | | | | | |
| Ba_DT_20 | | | | 0.432 | 0.432 | | | | | | | | | |
| Ad_DT_9 | | | | | 0.438 | | | | | | | | | |
| Ad_DT_16 | | | | | | 0.486 | | | | | | | | |
| Ad_DT_20 | | | | | | 0.487 | | | | | | | | |
| S_DT_20 | | | | | | | 0.537 | | | | | | | |
| S_DT_16 | | | | | | | 0.536 | | | | | | | |
| S_DT_10 | | | | | | | | 0.566 | | | | | | |
| S_LR_16 | | | | | | | | | 0.623 | | | | | |
| S_DT_9 | | | | | | | | | 0.626 | | | | | |
| S_LR_20 | | | | | | | | | 0.626 | | | | | |
| S_LR_10 | | | | | | | | | 0.627 | | | | | |
| C_DT_20 | | | | | | | | | | 0.677 | | | | |
| C_DT_16 | | | | | | | | | | 0.679 | | | | |
| C_LR_20 | | | | | | | | | | 0.684 | | | | |
| S_LR_9 | | | | | | | | | | | 0.705 | | | |
| C_LR_16 | | | | | | | | | | | | 0.724 | | |
| C_DT_9 | | | | | | | | | | | | 0.727 | 0.727 | |
| C_DT_10 | | | | | | | | | | | | 0.728 | 0.728 | |
| C_LR_10 | | | | | | | | | | | | | 0.742 | 0.742 |
| U_LR_10 | | | | | | | | | | | | | 0.743 | 0.743 |
| U_DT_10 | | | | | | | | | | | | | 0.743 | 0.743 |
| U_DT_16 | | | | | | | | | | | | | | 0.746 |
| U_DT_20 | | | | | | | | | | | | | | 0.746 |
| U_DT_9 | | | | | | | | | | | | | | 0.748 |
| U_LR_9 | | | | | | | | | | | | | | 0.749 |
| U_LR_16 | | | | | | | | | | | | | | 0.749 |
| C_LR_9 | | | | | | | | | | | | | | 0.752 |
| U_LR_20 | 0.753 | |||||||||||||
The comparative results of models using feature pruning
| S_DT_9 | 0.791 | 0.475 | 0.823 | 0.626 | 0.700 |
| S_LR_9 | 0.759 | 0.645 | 0.771 | 0.705 | 0.783 |
| S_pDT | 0.728 | 0.703 | 0.731 | 0.717 | 0.770 |
| S_rLR | 0.747 | 0.717 | 0.750 | 0.734 | 0.811 |
| C_DT_9 | 0.772 | 0.669 | 0.792 | 0.727 | 0.758 |
| C_LR_9 | 0.752 | 0.752 | 0.752 | 0.752 | 0.829 |
| C_pDT | 0.740 | 0.748 | 0.740 | 0.744 | 0.795 |
| C_rLR | 0.770 | 0.719 | 0.776 | 0.747 | 0.824 |
| U_DT_9 | 0.748 | 0.748 | 0.749 | 0.748 | 0.798 |
| U_LR_9 | 0.749 | 0.732 | 0.767 | 0.749 | 0.825 |
| U_pDT | 0.740 | 0.749 | 0.731 | 0.740 | 0.791 |
| U_rLR | 0.745 | 0.703 | 0.787 | 0.743 | 0.823 |
| Ba_DT_9 | 0.911 | 0.151 | 0.990 | 0.386 | 0.797 |
| Ba_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.829 |
| Ba_pDT | 0.911 | 0.107 | 0.994 | 0.324 | 0.724 |
| Ba_rLR | 0.912 | 0.142 | 0.991 | 0.377 | 0.823 |
| Ad_DT_9 | 0.902 | 0.197 | 0.974 | 0.438 | 0.752 |
| Ad_LR_9 | 0.913 | 0.157 | 0.990 | 0.394 | 0.787 |
| Ad_pDT | 0.911 | 0.161 | 0.988 | 0.397 | 0.822 |
| Ad_rLR | 0.910 | 0.130 | 0.990 | 0.359 | 0.745 |
ANOVA for average g-mean value of models using feature pruning
| Between Groups | 5.896 | 19 | 0.310 | 2062 | 0.000 |
| Within Groups | 0.027 | 180 | 0.000 | | |
| Total | 5.923 | 199 |
Tukey’s HSD test for g-mean of models using feature pruning
| Ba_pDT | 0.324 | | | | | | | | |
| Ad_rLR | | 0.359 | | | | | | | |
| Ba_rLR | | 0.377 | 0.377 | | | | | | |
| Ba_DT_9 | | | 0.386 | | | | | | |
| Ba_LR_9 | | | 0.394 | | | | | | |
| Ad_LR_9 | | | 0.394 | | | | | | |
| Ad_pDT | | | 0.397 | | | | | | |
| Ad_DT_9 | | | | 0.438 | | | | | |
| S_DT_9 | | | | | 0.626 | | | | |
| S_LR_9 | | | | | | 0.705 | | | |
| S_pDT | | | | | | 0.717 | 0.717 | | |
| C_DT_9 | | | | | | | 0.727 | 0.727 | |
| S_rLR | | | | | | | 0.734 | 0.734 | 0.734 |
| U_pDT | | | | | | | | 0.740 | 0.740 |
| U_rLR | | | | | | | | 0.743 | 0.743 |
| C_pDT | | | | | | | | 0.744 | 0.744 |
| C_rLR | | | | | | | | | 0.747 |
| U_DT_9 | | | | | | | | | 0.748 |
| U_LR_9 | | | | | | | | | 0.749 |
| C_LR_9 | 0.752 | ||||||||
The comparative results (data from the same database and period as used by Delen et al.[8])
| Proposed method (C_rLR) | 0.751 | 0.762 | 0.750 | 0.756 | 0.842 |
| Previous method (LR) | 0.903 | 0.272 | 0.985 | 0.517 | 0.849 |
| Proposed method (C_pDT) | 0.758 | 0.756 | 0.758 | 0.757 | 0.820 |
| Previous method (DT) | 0.903 | 0.279 | 0.984 | 0.524 | 0.769 |
The comparative results (data from the same database and period as used by Endo et al.[10])
| Proposed method (C_rLR) | 0.723 | 0.748 | 0.719 | 0.733 | 0.814 |
| Previous method (LR) | 0.897 | 0.226 | 0.988 | 0.472 | 0.832 |
| Proposed method (C_pDT) | 0.747 | 0.756 | 0.746 | 0.752 | 0.812 |
| Previous method (DT) | 0.896 | 0.214 | 0.988 | 0.460 | 0.793 |