| Literature DB >> 32779031 |
Shahzad Ashraf1, Sehrish Saleem2, Tauqeer Ahmed3, Zeeshan Aslam4, Durr Muhammad5.
Abstract
An imbalanced dataset is commonly found in at least one class, which are typically exceeded by the other ones. A machine learning algorithm (classifier) trained with an imbalanced dataset predicts the majority class (frequently occurring) more than the other minority classes (rarely occurring). Training with an imbalanced dataset poses challenges for classifiers; however, applying suitable techniques for reducing class imbalance issues can enhance classifiers' performance. In this study, we consider an imbalanced dataset from an educational context. Initially, we examine all shortcomings regarding the classification of an imbalanced dataset. Then, we apply data-level algorithms for class balancing and compare the performance of classifiers. The performance of the classifiers is measured using the underlying information in their confusion matrices, such as accuracy, precision, recall, and F measure. The results show that classification with an imbalanced dataset may produce high accuracy but low precision and recall for the minority class. The analysis confirms that undersampling and oversampling are effective for balancing datasets, but the latter dominates.Entities:
Keywords: Class imbalance; Classification; Machine learning; Spread subsampling
Year: 2020 PMID: 32779031 PMCID: PMC7417470 DOI: 10.1186/s42492-020-00055-9
Source DB: PubMed Journal: Vis Comput Ind Biomed Art ISSN: 2524-4442
Comparative analysis of previous work in relation to the class balancing ratio
| Ref. | Class distribution/imbalance ratio | |||||
|---|---|---|---|---|---|---|
| Fatima and Mahgoub [ | Class | A | B | |||
| Instances | 62 | 195 | ||||
| Imbalance ratio | 1 | 3.14 | ||||
| Xie et al. [ | Class | A | B | C | D | E |
| Instances | 2 | 22 | 38 | 8 | 2 | |
| Imbalanced ratio | 1 | 11 | 19 | 4 | 1 | |
| Xie et al. [ | Class | A | B | C | D | E |
| Instances | 1 | 41 | 46 | 14 | 4 | |
| Imbalanced ratio | 1 | 41 | 46 | 14 | 4 | |
| Ashraf et al. [ | Class | Excellent | Very good | Good | Average | Bad |
| Instances | 539 | 4336 | 4543 | 347 | 564 | |
| Imbalanced ratio | 1.55 | 12.5 | 13.10 | 1 | 1.60 | |
Fig. 1Information flow chart
Representation of the standard confusion matrix
| Positive | Negative | |
|---|---|---|
| Positive | True positive | False negative |
| Negative | False positive | True negative |
Fig. 2Depiction of oversampling and undersampling
Results of classification with the imbalanced dataset
| Classifier | Accuracy | Classes | Precision | Recall | F-Measure | Confusion matrix |
|---|---|---|---|---|---|---|
| Naïve bayes | 84.77% | Low | 0.659 | 0.750 | 0.701 | a b < −- classified as 27 9 | a = Low 14,101 | b = High |
| High | 0.918 | 0.878 | 0.898 | |||
| Average | 0.856 | 0.848 | 0.851 | |||
| Multilayer perceptron | 80.79% | Low | 0.600 | 0.583 | 0.592 | a b < −- classified as 21 15 | a = Low 14,101 | b = High |
| High | 0.871 | 0.878 | 0.874 | |||
| Average | 0.806 | 0.808 | 0.807 | |||
| SVM | 89.40% | Low | 0.833 | 0.694 | 0.758 | a b < −- classified as 25 11 | a = Low 5110 | b = High |
| High | 0.909 | 0.957 | 0.932 | |||
| Average | 0.891 | 0.894 | 0.891 | |||
| IBk | 78.81% | Low | 0.559 | 0.528 | 0.543 | a b < −- classified as 19 17 | a = Low 15,100 | b = High |
| High | 0.855 | 0.870 | 0.862 | |||
| Average | 0.784 | 0.788 | 0.786 | |||
| Random forest | 86.09% | Low | 0.727 | 0.667 | 0.696 | a b < −- classified as 24 12 | a = Low 9106 | b = High |
| High | 0.898 | 0.922 | 0.910 | |||
| Average | 0.858 | 0.861 | 0.859 |
Fig. 3Accuracy comparison and the F measures of classifiers for the minority and majority classes over the imbalanced dataset
Fig. 4Imbalanced dataset
Fig. 5Balanced dataset
Classification results after oversampling
| Classifier | Accuracy | Classes | Precision | Recall | F-Measure | Confusion matrix |
|---|---|---|---|---|---|---|
| Naïve bayes | 87.89% | Low | 0.852 | 0.907 | 0.879 | a b < −- classified as 98 10 | a = Low 17 98 | b = High |
| High | 0.907 | 0.852 | 0.879 | |||
| Average | 0.881 | 0.879 | 0.879 | |||
| Multilayer perceptron | 91.03% | Low | 0.873 | 0.954 | 0.912 | a b < −- classified as 103 5 | a = Low 15,100 | b = High |
| High | 0.952 | 0.870 | 0.909 | |||
| Average | 0.914 | 0.910 | 0.910 | |||
| SVM | 88.79% | Low | 0.849 | 0.935 | 0.890 | a b < −- classified as 101 7 | a = Low 18 97 | b = High |
| High | 0.933 | 0.843 | 0.886 | |||
| Average | 0.892 | 0.888 | 0.888 | |||
| IBk | 83.86% | Low | 0.805 | 0.880 | 0.841 | a b < −- classified as 95 13 | a = Low 23 92 | b = High |
| High | 0.876 | 0.800 | 0.836 | |||
| Average | 0.842 | 0.839 | 0.838 | |||
| Random forest | 90.13% | Low | 0.898 | 0.898 | 0.898 | a b < −- classified as 97 11 | a = Low 11,104 | b = High |
| High | 0.904 | 0.904 | 0.904 | |||
| Average | 0.901 | 0.901 | 0.901 |
Fig. 6Performance comparison before and after undersampling
Fig. 7Class distribution after oversampling
Fig. 8Performance comparison using the average F measure
Fig. 9Precision increase with oversampling