| Literature DB >> 34354742 |
Demeke Endalie1, Getamesay Haile1.
Abstract
For decades, machine learning techniques have been used to process Amharic texts. The potential application of deep learning on Amharic document classification has not been exploited due to a lack of language resources. In this paper, we present a deep learning model for Amharic news document classification. The proposed model uses fastText to generate text vectors to represent semantic meaning of texts and solve the problem of traditional methods. The text vectors matrix is then fed into the embedding layer of a convolutional neural network (CNN), which automatically extracts features. We conduct experiments on a data set with six news categories, and our approach produced a classification accuracy of 93.79%. We compared our method to well-known machine learning algorithms such as support vector machine (SVM), multilayer perceptron (MLP), decision tree (DT), XGBoost (XGB), and random forest (RF) and achieved good results.Entities:
Year: 2021 PMID: 34354742 PMCID: PMC8331297 DOI: 10.1155/2021/3774607
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1A general process of text classification using deep learning models.
Algorithm 1The framework for the proposed system.
Figure 2Architecture of the proposed model of Amharic text classifier.
Figure 3Data set composition.
List of consonants used by the study.
| Canonical form | Characters to be replaced |
|---|---|
| hā(ሀ) | hā(ሃ፣ኃ፣ኀ፤ሐ፣ሓ) |
| se(ሰ) | se(ሠ) |
| ā(አ) | ā(ኣ፣0፣ዓ) |
| ts'e(ጸ) | ts'e(ፀ) |
| wu(ው) | wu(ዉ) |
| go(ጐ) | go(ጎ) |
Sample BoW representation.
| Documents | The | Club | Is | Effective |
|---|---|---|---|---|
| doc1 | 10 | 3 | 4 | 4 |
| doc2 | 5 | 1 | 7 | 2 |
| doc3 | 12 | 0 | 2 | 6 |
Algorithm 2fastText implementation.
Figure 4Word embedding model.
Figure 5Block diagram of the proposed method.
CNN model hyperparameters.
| Hyperparameter | Value |
|---|---|
| Embedding dimension | 300 |
| Filter size | [3, 4, 5] |
| Number of filters | 256 |
| Batch size | 64 |
| Dropout | 0.7 |
| Activation | Softmax |
| Optimization | Adam |
Performance of model in terms of accuracy, precision, recall, and F-measure.
| Evaluation metrics | Performance (%) |
|---|---|
| Accuracy | 93.79 |
| Precision | 93.63 |
| F-measure | 93.67 |
| Recall | 93.76 |
Model evaluation using unlabeled document.
| No. | Sample documents translated from Amharic | Predicted label | Label by the expert | Status |
|---|---|---|---|---|
| 1. | Every year, in Ethiopia, more people die of tuberculosis, according to the World Health Organization. | 2 (Education) | 3 (Health) | Failed |
|
| ||||
| 2. | Sports championship held in Oromia North Shoa and East Welega zones is over. | 6 (Sport) | 6 (Sport) | Accepted |
|
| ||||
| 3. | Prime Minister Dr. Abiy Ahmed spoke to the robot at his office. | 5 (Technology) | 5 (Technology) | Accepted |
|
| ||||
| 4. | Addis Ababa Ethio-Telecom has recently signed an agreement with eight companies to distribute Internet services. | 5 (Technology) | 5 (Technology) | Accepted |
Figure 6Evaluation of our model in terms of number of epochs.
Classification accuracy of SVM, MLP, and RF classifier with different hyperparameter values.
| Classifier | Hyperparameters | Accuracy (%) | ||
|---|---|---|---|---|
| Kernel | C | Gamma | ||
| SVM | “Linear” | 1.0 | 1 | 86.6 |
| “rbf” | 1.0 | 1 | 88.88 | |
| “rbf” | 100 | 1 | 88,56 | |
| “rbf” | 10 | 1 | 88,51 | |
| “Linear” | 10 | 1 | 87.58 | |
|
| ||||
| MLP | hidden_layer_sizes | max_iter | ||
| 100 | 1000 | 87.57 | ||
| 200 | 1000 | 88.48 | ||
|
| ||||
| RF | n_estimators | |||
| 100 | 87.28 | |||
| 200 | 88.23 | |||
| 300 | 88.88 | |||
Comparison of CNN model, FR, XGB, MLP, SVM, and DT.
| Classifiers | Testing accuracy (%) |
|---|---|
| CNN | 93.79 |
| RF | 88.88 |
| XGB | 87.58 |
| SVM | 88.88 |
| MLP | 88.48 |
| DT | 77.45 |