| Literature DB >> 35634124 |
Demeke Endalie1, Getamesay Haile1, Wondmagegn Taye Abebe2.
Abstract
Text classification is the process of categorizing documents based on their content into a predefined set of categories. Text classification algorithms typically represent documents as collections of words and it deals with a large number of features. The selection of appropriate features becomes important when the initial feature set is quite large. In this paper, we present a hybrid of document frequency (DF) and genetic algorithm (GA)-based feature selection method for Amharic text classification. We evaluate this feature selection method on Amharic news documents obtained from the Ethiopian News Agency (ENA). The number of categories used in this study is 13. Our experimental results showed that the proposed feature selection method outperformed other feature selection methods utilized for Amharic news document classification. Combining the proposed feature selection method with Extra Tree Classifier (ETC) improves classification accuracy. It improves classification accuracy up to 1% higher than the hybrid of DF, information gain (IG), chi-square (CHI), and principal component analysis (PCA), 2.47% greater than GA and 3.86% greater than a hybrid of DF, IG, and CHI.Entities:
Keywords: Chi-square; Document frequency; Extra tree classifier; Feature selection; Genetic algorithm; Information gain; Text classification
Year: 2022 PMID: 35634124 PMCID: PMC9137894 DOI: 10.7717/peerj-cs.961
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1The architecture of the proposed Amharic text classifier.
List of consonants normalized in the study.
| Canonical form | Characters to be replaced |
|---|---|
| hā(ሀ) | hā(ሃ፣ኃ፣ኀ፤ሐ፣ሓ) |
| se(ሰ) | se(ሠ) |
| ā(አ) | ā(ኣ፣0፣ዓ) |
| ts’e(ጸ) | ts’e(ፀ) |
| wu(ው) | wu(ዉ) |
| go(ጐ) | go(ጎ) |
Figure 2State chart for feature selection using genetic algorithm.
Figure 3A pictorial description of the proposed feature selection.
Genetic algorithm parameters used for this study.
| Parameters | Value |
|---|---|
| Generation | 5 |
| Population | 100 |
| Verbosity | 2 |
| Other parameters | Default |
News categories and the number of news documents in each category.
| News category | No. of news | Category label |
|---|---|---|
| Business | 257 | 1 |
| Education | 269 | 2 |
| Sport | 251 | 3 |
| Technology | 267 | 4 |
| Diplomatic relation | 270 | 5 |
| Military force | 278 | 6 |
| Politics | 244 | 7 |
| Health | 275 | 8 |
| Agriculture | 256 | 9 |
| Justice | 212 | 10 |
| Accidents | 275 | 11 |
| Tourism | 239 | 12 |
| Environmental protection | 265 | 13 |
Performance evaluation of Amharic news document classification using the proposed feature selection method.
| Evaluation metrics | ||||
|---|---|---|---|---|
| Accuracy | Precision | Recall | F-measure | |
| Experimental results in percentage | 89.68% | 89.52% | 89.65% | 89.56% |
Comparison of CTC, RFC, and GBC.
| No. | Machine learning model | Accuracy (%) |
|---|---|---|
| 1 | ETC | 89.68 |
| 2 | RFC | 87.80 |
| 3 | GBC | 87.58 |
Comparison between the proposed feature selection methods with existing methods.
| Learning model | Feature selection | Accuracy (%) |
|---|---|---|
| ETC | IG | 82.73 |
| CHI | 74.85 | |
| DF | 84.42 | |
| Hybrid of (IG, CHI, and DF) | 85.82 | |
| Genetic algorithm | 87.21 | |
| PCA | 84.56 | |
| Hybrid of (IG, CHI, DF, and PCA) | 88.67 | |
| DFGA | 89.68 |
Comparison of feature selection methods in terms of the number of features.
| Feature selection methods | Number of features |
|---|---|
| Hybrid of IG, CHI, and DF | 405 |
| Hybrid of IG, CHI, DF, and PCA | 194 |
| PCA | 1,226 |
| GA | 230 |
| DF | 393 |
| DFGA | 100 |