| Literature DB >> 32766193 |
Sushruta Mishra1, Pradeep Kumar Mallick1, Lambodar Jena2, Gyoo-Soo Chae3.
Abstract
In the past few years, classification has undergone some major evolution. With a constant surge of the amount of data gathered from different sources, efficient processing and analysis of data is becoming difficult. Due to the uneven distribution of data among classes, data classification with machine-learning techniques has become more tedious. While most algorithms focus on major data samples, they ignore the minor class data. Thus, the data-skewing issue is one of the critical problems that need attention of researchers. The paper stresses upon data preprocessing using sampling techniques to overcome the data-skewing problem. Here, three different sampling techniques such as Resampling, SpreadSubSampling, and SMOTE are implemented to reduce this uneven data distribution issue and classified with the K-nearest neighbor algorithm. The performance of classification is evaluated with various performance metrics to determine the efficiency of classification.Entities:
Keywords: F-score; KNN algorithm; SMOTE; SpreadSubSampling; best first search; data skewing problem; machine learning
Mesh:
Year: 2020 PMID: 32766193 PMCID: PMC7378392 DOI: 10.3389/fpubh.2020.00274
Source DB: PubMed Journal: Front Public Health ISSN: 2296-2565
Figure 1Under-sampling process.
Figure 2Over-sampling process.
Figure 3Our proposed hybrid model based on sampling.
Parameters of a confusion matrix.
| True positives (TP) | Positive instances correctly identified by classifier |
| True negatives (TN) | Negative instances correctly identified by classifier |
| False positives (FP) | Negative instances incorrectly identified by classifier |
| False negatives (FN) | Positive instances incorrectly identified by classifier |
Figure 4Skeleton of a confusion matrix.
Figure 5Confusion matrix analysis in breast cancer dataset.
Figure 6Confusion matrix analysis in diabetes dataset.
Figure 7Confusion matrix analysis in hepatitis dataset.
Analysis of different sampling techniques on sample disease data sets yielding optimum performance.
| Breast cancer | PPV | 0.814 | SMOTE and SpreadSubSampling |
| Sensitivity | 0.820 | SMOTE and SpreadSubSampling | |
| F-score | 0.812 | SMOTE and SpreadSubSampling | |
| Prediction accuracy | 0.814 | SMOTE and SpreadSubSampling | |
| Diabetes | PPV | 0.921 | SpreadSubSampling |
| Sensitivity | 0.906 | SpreadSubSampling | |
| F-score | 0.905 | SpreadSubSampling | |
| Prediction accuracy | 0.884 | SMOTE and SpreadSubSampling | |
| Hepatitis | PPV | 0.839 | Resampling |
| Sensitivity | 0.837 | Resampling | |
| F-score | 0.849 | SMOTE | |
| Prediction accuracy | 0.867 | SMOTE |