| Literature DB >> 31874622 |
Yongqing Zhang1,2, Shaojie Qiao3,4, Rongzhao Lu1, Nan Han5, Dingxiang Liu6, Jiliu Zhou1.
Abstract
BACKGROUND: Imbalanced datasets are commonly encountered in bioinformatics classification problems, that is, the number of negative samples is much larger than that of positive samples. Particularly, the data imbalance phenomena will make us underestimate the performance of the minority class of positive samples. Therefore, how to balance the bioinformatic data becomes a very challenging and difficult problem.Entities:
Keywords: Imbalanced data; Max-relevance; Min-redundancy; Pearson correlation coefficients; Pseudo-negative sampling
Mesh:
Year: 2019 PMID: 31874622 PMCID: PMC6929457 DOI: 10.1186/s12859-019-3269-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A flow chart of MMPCC algorithm
Description of datasets
| Dataset | Positive | Negative | Attributes | Ratio |
|---|---|---|---|---|
| CMC | 333 | 1140 | 9 | 3.4 |
| Haberman | 81 | 225 | 3 | 2.7 |
| Solar Flare | 69 | 1320 | 10 | 19.1 |
| Oil | 41 | 896 | 49 | 21.9 |
| PDNA-543 | 9549 | 134995 | 180 | 14.1 |
| PDNA-316 | 5609 | 67109 | 180 | 11.9 |
| SNP | 183 | 2891 | 25 | 15.7 |
Performance comparison of classifiers under different percentage of pseudo-negative samples on the CMC data
| Percentage | Classifier | Sen(%) | Spe(%) | Acc(%) | MCC |
|---|---|---|---|---|---|
| 0 | DA | 9.38 | 97.81 | 77.8 | 0.156 |
| AdaBoost | 21.37 | 94.48 | 77.94 | 0.226 | |
| RF | 28.19 | 92.8 | 78.2 | 0.27 | |
| NN | 27.01 | 87.09 | 73.52 | 0.161 | |
| 10 | DA | 17.6 | 94.85 | 75.7 | 0.198 |
| AdaBoost | 25.76 | 93.58 | 76.78 | 0.266 | |
| RF | 39.22 | 91.77 | 78.75 | 0.369 | |
| NN | 40.92 | 86.98 | 75.55 | 0.302 | |
| 20 | DA | 37.35 | 91.71 | 76.94 | 0.351 |
| AdaBoost | 40.03 | 91.24 | 77.33 | 0.36 | |
| RF | 43.94 | 91.24 | 78.41 | 0.404 | |
| NN | 47.28 | 87.22 | 76.38 | 0.368 | |
| 30 | DA | 52.46 | 88.34 | 77.8 | 0.438 |
| AdaBoost | 50.89 | 88.83 | 77.67 | 0.431 | |
| RF | 50.87 | 89.98 | 78.48 | 0.448 | |
| NN | 53.39 | 87.86 | 77.73 | 0.439 | |
| 40 | DA | 59.46 | 87.21 | 78.43 | 0.485 |
| AdaBoost | 56.01 | 87.61 | 77.61 | 0.461 | |
| RF | 56.45 | 90.27 | 79.57 | 0.505 | |
| NN | 54.94 | 86.68 | 76.64 | 0.439 | |
| 50 | DA | 66.78 | 85.42 | 79.08 | 0.530 |
| AdaBoost | 64.01 | 87.37 | 79.42 | 0.531 | |
| RF | 62 | 88.71 | 79.63 | 0.532 | |
| NN | 61.02 | 87.38 | 78.42 | 0.505 |
Performance comparison of classifiers under different percentage of pseudo-negative samples on the Haberman data
| Percentage | Classifier | Sen(%) | Spe(%) | Acc(%) | MCC |
|---|---|---|---|---|---|
| 0 | DA | 17.33 | 95.42 | 74.79 | 0.212 |
| AdaBoost | 29.19 | 90.89 | 74.71 | 0.266 | |
| RF | 34.2 | 82.84 | 70.07 | 0.197 | |
| NN | 27.98 | 87.28 | 71.68 | 0.202 | |
| 10 | DA | 21.77 | 93.96 | 72.96 | 0.236 |
| AdaBoost | 32.72 | 86.12 | 70.58 | 0.214 | |
| RF | 33.38 | 83.91 | 69.38 | 0.197 | |
| NN | 30.37 | 82.01 | 67.04 | 0.144 | |
| 20 | DA | 30.51 | 94.41 | 74.2 | 0.340 |
| AdaBoost | 46.68 | 87.54 | 74.26 | 0.370 | |
| RF | 45.01 | 81.32 | 69.59 | 0.272 | |
| NN | 37.42 | 82.97 | 68.57 | 0.222 | |
| 30 | DA | 31.73 | 95.1 | 73.32 | 0.36 |
| AdaBoost | 51.81 | 87.15 | 75.65 | 0.422 | |
| RF | 51.06 | 79.6 | 70 | 0.311 | |
| NN | 42.39 | 84.54 | 70.36 | 0.291 | |
| 40 | DA | 37.13 | 94.38 | 72.93 | 0.404 |
| AdaBoost | 50.73 | 86.1 | 72.87 | 0.396 | |
| RF | 56.81 | 78.38 | 69.95 | 0.359 | |
| NN | 53.63 | 81 | 70.6 | 0.35 | |
| 50 | DA | 38.61 | 93.83 | 71.74 | 0.405 |
| AdaBoost | 61.46 | 82.26 | 73.81 | 0.447 | |
| RF | 60.75 | 78.22 | 70.95 | 0.395 | |
| NN | 52.41 | 79.81 | 68.56 | 0.339 |
Fig. 2Performance comparison of RF and NN classifiers on PDNA-543 data under different percentage of pseudo-negative samples
Fig. 3Performance comparison of RF and NN classifiers on PDNA-316 data under different percentage of pseudo-negative samples
Fig. 4Performance comparison of RF and NN classifiers on SNP data under different percentage of pseudo-negative samples
Fig. 5Comparison of algorithm MMPCC, MAXR and MINR on RF and NN classifiers for Sen, Spe, Acc and MCC performances
Performance comparison between MMPCC and SMOTE under different percentage of pseudo-negative samples
| Percentage(%) | Methods | Classifiers | Sen(%) | Spe(%) | Acc(%) | MCC |
|---|---|---|---|---|---|---|
| 10 | MMPCC | RF | 17.13 | 99.44 | 92.46 | 0.333 |
| NN | 17.05 | 99.23 | 92.25 | 0.312 | ||
| SMOTE | RF | 16.01 | 98.27 | 91.34 | 0.235 | |
| NN | 5.2 | 99.69 | 91.68 | 0.16 | ||
| 20 | MMPCC | RF | 17.28 | 99.45 | 91.84 | 0.337 |
| NN | 24.6 | 99.22 | 92.31 | 0.405 | ||
| SMOTE | RF | 17.07 | 98.14 | 90.75 | 0.246 | |
| NN | 8.05 | 99.49 | 91.16 | 0.2 | ||
| 30 | MMPCC | RF | 18.18 | 99.44 | 91.29 | 0.351 |
| NN | 30.38 | 99.16 | 92.27 | 0.464 | ||
| SMOTE | RF | 17.69 | 97.95 | 90.08 | 0.25 | |
| NN | 10.16 | 99.23 | 90.5 | 0.216 | ||
| 40 | MMPCC | RF | 19.09 | 99.43 | 90.75 | 0.363 |
| NN | 35.94 | 99.08 | 92.26 | 0.513 | ||
| SMOTE | RF | 18.54 | 97.8 | 89.5 | 0.258 | |
| NN | 12.07 | 99.14 | 90.02 | 0.243 | ||
| 50 | MMPCC | RF | 19.56 | 99.39 | 90.16 | 0.367 |
| NN | 38.82 | 99.13 | 92.15 | 0.543 | ||
| SMOTE | RF | 18.5 | 97.72 | 88.9 | 0.258 | |
| NN | 14.05 | 99.01 | 89.55 | 0.266 |
Classification results of the Solar Flare dataset with highly imbalance Ratio
| Percentage | Classifier | Sen(%) | Spe(%) | Acc(%) | MCC |
|---|---|---|---|---|---|
| 0 | RF | 1.43 | 99.02 | 94.24 | 0.01 |
| NN | 7.25 | 96.90 | 92.51 | 0.05 | |
| 10 | RF | 4.00 | 99.16 | 94.02 | 0.06 |
| NN | 8.01 | 96.73 | 91.94 | 0.06 | |
| 20 | RF | 13.53 | 99.39 | 94.31 | 0.23 |
| NN | 20.88 | 97.63 | 93.09 | 0.24 | |
| 30 | RF | 25.03 | 99.08 | 94.39 | 0.39 |
| NN | 32.03 | 97.08 | 92.95 | 0.33 | |
| 40 | RF | 20.68 | 98.92 | 93.59 | 0.32 |
| NN | 28.33 | 96.44 | 91.79 | 0.28 | |
| 50 | RF | 32.57 | 98.84 | 93.95 | 0.44 |
| NN | 35.48 | 97.05 | 92.51 | 0.38 |
Classification results of the Oil dataset with highly imbalance Ratio
| Percentage | Classifier | Sen(%) | Spe(%) | Acc(%) | MCC |
|---|---|---|---|---|---|
| 0 | RF | 14.50 | 99.68 | 96.07 | 0.27 |
| NN | 52.18 | 98.90 | 96.83 | 0.58 | |
| 10 | RF | 19.60 | 99.55 | 95.74 | 0.32 |
| NN | 51.95 | 98.54 | 96.37 | 0.54 | |
| 20 | RF | 33.53 | 98.98 | 95.60 | 0.43 |
| NN | 41.26 | 98.65 | 95.72 | 0.48 | |
| 30 | RF | 39.83 | 98.99 | 95.65 | 0.50 |
| NN | 45.83 | 98.29 | 95.32 | 0.51 | |
| 40 | RF | 50.36 | 99.31 | 96.26 | 0.63 |
| NN | 54.96 | 97.83 | 95.08 | 0.55 | |
| 50 | RF | 49.76 | 98.75 | 95.52 | 0.59 |
| NN | 48.09 | 97.84 | 94.58 | 0.51 |
Fig. 6Classification results of the Solar Flare dataset with highly imbalance Ratio for Sen and MCC performances
Fig. 7Classification results of the Oil dataset with highly imbalance Ratio for Sen and MCC performances