| Literature DB >> 24088532 |
Munehiro Nakamura1, Yusuke Kajiwara, Atsushi Otsuka, Haruhiko Kimura.
Abstract
BACKGROUND: Over-sampling methods based on Synthetic Minority Over-sampling Technique (SMOTE) have been proposed for classification problems of imbalanced biomedical data. However, the existing over-sampling methods achieve slightly better or sometimes worse result than the simplest SMOTE. In order to improve the effectiveness of SMOTE, this paper presents a novel over-sampling method using codebooks obtained by the learning vector quantization. In general, even when an existing SMOTE applied to a biomedical dataset, its empty feature space is still so huge that most classification algorithms would not perform well on estimating borderlines between classes. To tackle this problem, our over-sampling method generates synthetic samples which occupy more feature space than the other SMOTE algorithms. Briefly saying, our over-sampling method enables to generate useful synthetic samples by referring to actual samples taken from real-world datasets.Entities:
Year: 2013 PMID: 24088532 PMCID: PMC4016036 DOI: 10.1186/1756-0381-6-16
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1Example of codebooks obtained by Learning Vector Quantization. These codebooks are extracted from the samples in Iris dataset [12]. Each of the painted colored points represents the numerical value of a codebook.
Figure 2Flow of the proposed over-sampling method. The numbered methods are executed in ascending sequence.
Figure 3Example of the distance measure. The distance measurement is Euclidean distance.
Figure 4Example of generated synthetic samples by our proposed method. The four synthetic samples in T1 are the actual four samples taken from R1, where T is a target dataset and R is a reference dataset.
Benchmark datasets used for our experiments
| Breast-w | 683 | 10 | 0.35 : 0.65 |
| Blood | 748 | 4 | 0.23 : 0.77 |
| Colon-cancer | 2000 | 62 | 0.35 : 0.65 |
| Ionosphere | 351 | 34 | 0.36 : 0.64 |
| Leukemia | 7129 | 72 | 0.34 : 0.66 |
| Pima | 768 | 8 | 0.35 : 0.65 |
| Satimage | 6435 | 36 | 0.097 : 0.903 |
| Yeast | 1484 | 8 | 0.034 : 0.966 |
Average G-mean for three cases
| NaiveBayes | 76.25% | 77.34% | 78.54% | 78.94% |
| Logistic Tree | 72.88% | 74.21% | 81.21% | 83.64% |
| Neural Network | 75.24% | 79.62% | 80.44% | 80.24% |
| SVM | 72.65% | 73.31% | 80.92% | 83.22% |
| RandomForest | 75.34% | 78.96% | 79.47% | 80.68% |
| OLVQ3 | 75.76% | 74.35% | 80.88% | 82.55% |
Nothing represents that all the datasets were remained as the class imbalanced problem. In the case of SMOTE and LVQ-SMOTE, the minority samples were increased up to the number of the majority samples.
Sensitivity, Specificity, and G-mean for each of the datasets
| Breast-w | 76.40% | 74.16% | 64.21% | 67.89% | 70.31% | 71.03% |
| Blood | 95.44% | 95.00% | 97.38% | 99.04% | 96.41% | 97.02% |
| Colon-cancer | 80.00% | 85.00% | 63.64% | 72.73% | 71.82% | 78.86% |
| Ionosphere | 80.16% | 86.51% | 91.56% | 92.44% | 85.86% | 89.48% |
| Leukemia | 95.65% | 100.0% | 95.92% | 100.0% | 95.79% | 100.0% |
| Pima | 72.76% | 71.27% | 77.60% | 80.20% | 75.18% | 75.73% |
| Satimage | 78.75% | 75.76% | 68.53% | 75.67% | 73.64% | 75.71% |
| Yeast | 74.51% | 71.72% | 86.81% | 90.81% | 80.66% | 81.27% |
This is the case of Logistic Tree which has shown the highest G-mean among the basic classification algorithms in Table 2.
G-mean for our proposed method (LVQ-SMOTE) in case MWMOTE instead of SMOTE is used in our algorithm
| Breast-w | 70.31% | 70.59% | 71.03% | 70.69% |
| Blood | 96.41% | 96.50% | 97.02% | 96.40% |
| Colon-cancer | 71.82% | 71.08% | 78.86% | 79.09% |
| Ionosphere | 85.86% | 85.92% | 89.48% | 91.28% |
| Leukemia | 95.79% | 95.92% | 100.0% | 100.0% |
| Pima | 75.18% | 74.07% | 75.73% | 75.69% |
| Satimage | 73.64% | 73.92% | 75.71% | 77.01% |
| Yeast | 80.66% | 81.20% | 81.27% | 81.38% |
The algorithm used in this experiment is Logistic Tree.
Results of -turns prediction on the BT547 and BT823 dataset
| BT547 | I | 0.38 | 71.6% | 82.6% | 0.40 | 73.7% | 85.0% |
| | II | 0.33 | 63.0% | 90.8% | 0.31 | 66.7% | 86.1% |
| | IV | 0.27 | 69.8% | 73.3% | 0.38 | 81.6% | 75.2% |
| | VIII | 0.14 | 47.8% | 84.4% | 0.26 | 60.3% | 84.1% |
| | Non-turn | 0.37 | 21.1% | 99.7% | 0.39 | 30.4% | 97.6% |
| BT823 | I | 0.39 | 70.6% | 84.2% | 0.37 | 71.3% | 82.5% |
| | II | 0.33 | 62.7% | 91.2% | 0.30 | 61.4% | 92.1% |
| | IV | 0.27 | 68.3% | 74.4% | 0.35 | 78.4% | 78.9% |
| | VIII | 0.14 | 42.2% | 87.2% | 0.17 | 47.9% | 86.3% |
| Non-turn | 0.38 | 23.6% | 99.7% | 0.40 | 27.5% | 97.4% | |
Comparison of MCC scores between DEBT + our method, DEBT, and another -turn type prediction method
| BT547 | DEBT + our method | 0.40 | 0.31 | 0.38 | 0.26 |
| | DEBT | 0.38 | 0.33 | 0.27 | 0.14 |
| | X.Shi et al. [ | 0.53 | 0.55 | 0.31 | 0.04 |
| BT823 | DEBT + our method | 0.37 | 0.30 | 0.35 | 0.17 |
| | DEBT | 0.38 | 0.33 | 0.27 | 0.14 |
| X.Shi et al. [ | 0.64 | 0.63 | 0.32 | 0.13 |