| Literature DB >> 33266581 |
Khalil El Hindi1, Hussien AlSalman1, Safwan Qasem1, Saad Al Ahmadi1.
Abstract
Text classification is one domain in which the naive Bayesian (NB) learning algorithm performs remarkably well. However, making further improvement in performance using ensemble-building techniques proved to be a challenge because NB is a stable algorithm. This work shows that, while an ensemble of NB classifiers achieves little or no improvement in terms of classification accuracy, an ensemble of fine-tuned NB classifiers can achieve a remarkable improvement in accuracy. We propose a fine-tuning algorithm for text classification that is both more accurate and less stable than the NB algorithm and the fine-tuning NB (FTNB) algorithm. This improvement makes it more suitable than the FTNB algorithm for building ensembles of classifiers using bagging. Our empirical experiments, using 16-benchmark text-classification data sets, show significant improvement for most data sets.Entities:
Keywords: ensembles of classifiers; fine-tuning naive Bayesian algorithm; machine learning; naive Bayesian learning; text classification
Year: 2018 PMID: 33266581 PMCID: PMC7512419 DOI: 10.3390/e20110857
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
A description of the data sets used in the experiments.
| Dataset | #Documents | #Words | #Classes |
|---|---|---|---|
| 2463 | 2000 | 17 | |
| 3204 | 13,196 | 6 | |
| 3075 | 12,433 | 6 | |
| 1003 | 3182 | 10 | |
| 1050 | 3238 | 10 | |
| 913 | 3100 | 10 | |
| 918 | 3012 | 10 | |
| 1657 | 3758 | 25 | |
| 1504 | 2886 | 13 | |
| 414 | 6429 | 9 | |
| 313 | 5804 | 8 | |
| 336 | 7902 | 6 | |
| 927 | 10,128 | 7 | |
| 878 | 7454 | 10 | |
| 690 | 8261 | 10 | |
| 1560 | 8460 | 20 |
Figure 1Building ensembles using the naive Bayesian (NB) learning algorithm for text classification. ENB: Ensemble of NB classifiers.
Figure 2Building ensembles using the fine-tuning NB (FTNB) algorithm classifiers for text classification. EFTNB: Ensemble of FTNB classifiers.
The results ofGFTNB and EGFTNB vs. NB and FTNB.
| Data Set | FTNB vs. GFTNB | #Iterations | NB vs. GFTNB | GFTNB vs. EGFTNB | #Iterations | NB vs. EGFTNB | |||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| FTNB% | GFTNB% | FTNB | GFTNB | NB% | GFTNB% | GFTNB% | EGFTNB% | EGFTNB | NB% | EGFTNB% | |
| Fbis.wc | 77.55 | 10.3 | 13.1 | 69.96 | 77.67 | 68.1 | 69.96 | ||||
| La1s.wc | 86.33 | 2.8 | 5 | 86.55 | 89.79 | 57.1 | 86.55 | ||||
| La2s.wc | 84.26 | 3.1 | 6.1 | 87.48 | 89.63 | 53.7 | 87.48 | ||||
| Oh0.wc | 89.63 | 3.3 | 4.1 | 91.43 | 91.63 | 46.3 | 91.43 | ||||
| Oh5.wc | 85.73 | 3.2 | 5.9 | 84.42 | 88.13 | 50.6 | 84.42 | ||||
| Oh10.wc | 77.62 | 2.8 | 5.7 | 82.10 | 82.10 | 50.1 | 83.05 | ||||
| Oh15.wc | 83.24 | 2.6 | 4.9 | 85.21 | 85.21 | 52.3 | 85.43 | ||||
| Re0.wc | 79.06 | 6.7 | 5 | 74.47 | 80.19 | 54.3 | 74.47 | ||||
| Re1.wc | 78.76 | 3 | 3.5 | 77.37 | 79.72 | 47 | 77.37 | ||||
| Tr11.wc | 86.96 | 4 | 4.4 | 77.29 | 87.68 | 45.3 | 77.29 | ||||
| Tr12.wc | 92.33 | 2.9 | 3 | 92.97 | 92.65 | 39.5 | 92.65 | ||||
| Tr21.wc | 88.39 | 88.39 | 4.1 | 2.6 | 58.04 | 88.39 | 41.5 | 58.04 | |||
| Tr31.wc | 94.07 | 5.3 | 5.6 | 90.61 | 94.50 | 40 | 90.61 | ||||
| Tr41.wc | 91.23 | 2.6 | 4.2 | 92.14 | 92.14 | 92.14 | 43.6 | 92.14 | |||
| Tr45.wc | 87.97 | 3.1 | 5.1 | 77.25 | 87.97 | 44.3 | 77.25 | ||||
| Wap.wc | 78.91 | 3.1 | 5.4 | 81.35 | 81.67 | 54.9 | 81.35 | ||||
Figure 3Comparing the FTNB, GFTNB, and NB algorithms.
Figure 4Comparing GFTNB with an ensemble of 10 GFTNB classifiers.
Figure 5An ensemble of 10 (GFTNB-10) and 20 classifiers (GFTNB-20) after modifying the termination condition.
The performance of each algorithm compared to NB.
| Algorithm | Average Higher by | Wins | Ties | Losses |
|---|---|---|---|---|
| ENB | −0.02% | 1 | 12 | 3 |
| FTNB | 3.16% | 7 | 4 | 5 |
| EFTNB | 3.28% | 5 | 9 | 2 |
| GFTNB | 4.92% | 10 | 5 | 1 |
| EGFTNB | 6.69% | 12 | 3 | 1 |
| GFTNB-10 | 6.85% | 12 | 4 | 0 |
| GFTNB-20 | 7.19% | 13 | 3 | 0 |