| Literature DB >> 34483495 |
Hozayfa El Rifai1, Leen Al Qadi1, Ashraf Elnagar1.
Abstract
The process of tagging a given text or document with suitable labels is known as text categorization or classification. The aim of this work is to automatically tag a news article based on its vocabulary features. To accomplish this objective, 2 large datasets have been constructed from various Arabic news portals. The first dataset contains of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290 k multi-tagged articles. To examine the single-label dataset, we employed an array of ten shallow learning classifiers. Furthermore, we added an ensemble model that adopts the majority-voting technique of all studied classifiers. The performance of the classifiers on the first dataset ranged between 87.7% (AdaBoost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. For the second dataset, we tested both shallow learning and deep learning multi-labeling approaches. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. Firstly, we used classifiers that were compatible with multi-labeling tasks such as Logistic Regression and XGBoost, by wrapping each in a OneVsRest classifier. XGBoost gave the higher accuracy, scoring 84.7%, while Logistic Regression scored 81.3%. Secondly, ten neural networks were constructed (CNN, CLSTM, LSTM, BILSTM, GRU, CGRU, BIGRU, HANGRU, CRF-BILSTM and HANLSTM). CGRU proved to be the best multi-labeling classifier scoring an accuracy of 94.85%, higher than the rest of the classifies.Entities:
Keywords: Arabic datasets; Arabic text classification; Deep learning classifiers; Multi-label classification; Shallow learning classifiers; Single-label classification
Year: 2021 PMID: 34483495 PMCID: PMC8408369 DOI: 10.1007/s00521-021-06390-z
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.606
Number or documents collected from 7 news portals
| Websites | Classes | Articles count |
|---|---|---|
| Sky News Arabia | Sports | 7923 |
| CNN Arabia | Sports | 3800 |
| Tech | 1680 | |
| Middle East | 21,516 | |
| Business | 3908 | |
| Bein Sports | Sports | 6603 |
| Tech-wd | Tech | 23,682 |
| Arabic RT | Business | 896 |
| Youm7 | Business | 14,478 |
| CNBC Arabia | Business | 4653 |
Fig. 1Some statistics on the single-label dataset categories
Scraped news portal for the multi-label dataset
| Websites scraped | |
|---|---|
| CNBC Arabia | Bein Sports |
| CNN Arabia | Tech-wd |
| Masrawy | aitnews |
| Youm7 | Arabic RT |
| Al Arabiya | SkyNewsArabia |
Fig. 2Some statistics on the multi-label dataset categories
TF-IDF vectorizer versus count vectorizer: performance evaluation
| Algorithms | TF-IDF vectorizer | Count vectorizer |
|---|---|---|
| LR | 96.4 | |
| SVM | 97.0 | |
| DT | 91.7 | |
| MNB | 91.1 | |
| XG | ||
| KN | 69.9 | |
| RF | 94.5 |
Fig. 3Generic system flow-diagram
Fig. 4Performance of all classifiers on our proposed dataset
Fig. 5The confusion matrices for highest(SVM) and lowest (AD) performers
Evaluation metrics for all classifiers on the testing dataset
| Classifier | Evaluation metrics | |||||
|---|---|---|---|---|---|---|
| Accuracy | Ham. loss | F1 Score | Precision | Recall | ROC | |
| LR | 97.50 | 0.025 | 97.57 | 97.58 | 97.57 | 99.84 |
| SVM | 97.92 | 0.021 | 97.98 | 97.99 | 97.98 | 99.87 |
| DT | 90.76 | 0.092 | 90.92 | 90.94 | 90.92 | 93.90 |
| MNB | 96.30 | 0.037 | 96.37 | 96.50 | 96.31 | 99.76 |
| XGB | 93.47 | 0.065 | 93.64 | 93.89 | 93.51 | 99.27 |
| KNN | 95.87 | 0.041 | 95.95 | 95.97 | 95.94 | 99.14 |
| RF | 94.46 | 0.055 | 94.61 | 94.64 | 94.60 | 99.21 |
| NC | 91.16 | 0.088 | 91.32 | 92.53 | 91.37 | 92.40 |
| ADB | 87.73 | 0.123 | 88.15 | 89.01 | 87.84 | 94.08 |
| MLP | 97.93 | 97.93 | 97.92 | |||
| Ensemble | 97.92 | 0.021 | 97.98 | 97.99 | 97.98 | 99.88 |
Fig. 6A correctly tagged news-article as ’Technology’
Fig. 7A misclassified news-article as ’Business’; Originally it is tagged as ’Technology’
Classifiers accuracy scores on ’Akhbarona’ dataset
| Classifiers | Evaluation metrics | |||
|---|---|---|---|---|
| Accuracy | Precision | Recall | F1-score | |
| LR | 93.9 | 0.94 | 0.94 | 0.94 |
| SVM | 0.94 | 0.94 | 0.94 | |
| DT | 83.0 | 0.83 | 0.83 | 0.83 |
| MNB | 88.0 | 0.91 | 0.88 | 0.88 |
| XGB | 88.4 | 0.89 | 0.88 | 0.88 |
| KNN | 90.8 | 0.91 | 0.91 | 0.91 |
| RF | 87.8 | 0.88 | 0.88 | 0.88 |
| NC | 86.2 | 0.89 | 0.86 | 0.87 |
| ADB | 0.80 | 0.78 | 0.78 | |
| MLP | 94.1 | 0.94 | 0.94 | 0.94 |
| Ensemble | 94.3 | 0.94 | 0.94 | 0.94 |
Fig. 8Count of the labels used in CGRU experiment
Evaluation metrics of the SL classifiers for the multi-label classification task
| Classifier | Evaluation metrics | |||||
|---|---|---|---|---|---|---|
| Accuracy | Ham. loss | F1 Score | Precision | Recall | ROC | |
| OVR-LR | 81.34 | 2.50 | 75.62 | 88.67 | 69.01 | 98.40 |
| OVR-XGB | 84.73 | 2.24 | 78.86 | 87.59 | 74.87 | 98.47 |
Fig. 9Accuracy scores for all deep neural networks
Evaluation metrics for all deep learning classifiers for the multi-label classification task
| Classifier | Evaluation metrics | |||||
|---|---|---|---|---|---|---|
| Accuracy | Ham. loss | F1 Score | Precision | Recall | ROC | |
| CNN | 91.34 | 1.61 | 89.13 | 90.99 | 87.51 | 95.08 |
| CNNLSTM | 91.34 | 1.61 | 89.13 | 90.99 | 87.51 | 95.08 |
| BILSTM | 94.03 | 1.27 | 90.25 | 92.31 | 88.48 | 97.06 |
| BIGRU | 91.34 | 1.61 | 89.13 | 90.99 | 87.51 | 95.08 |
| GRU | 94.28 | 90.55 | 88.70 | 98.04 | ||
| LSTM | 90.17 | 1.78 | 86.85 | 90.61 | 83.92 | |
| CRF-BILSTM | 91.34 | 1.61 | 89.13 | 90.99 | 87.51 | 95.08 |
| HANLSTM | 92.92 | 1.45 | 90.60 | 91.05 | 90.57 | 83.83 |
| HANGRU | 92.96 | 1.43 | 90.66 | 91.16 | 90.40 | 93.52 |
| CGRU | 92.06 | 97.74 | ||||
Evaluation metrics of the CNN-GRU classifier per each of the 21 labels
| Label | Precision | Recall | F1-score |
|---|---|---|---|
| Business | 97.81 | 97.95 | 97.88 |
| oil | 90.58 | 94.27 | 92.39 |
| Business America | 89.12 | 76.30 | 82.21 |
| Business Egypt | 93.24 | 88.91 | 91.02 |
| Business SA | 81.23 | 87.18 | 84.10 |
| ME | 99.16 | 98.87 | 99.01 |
| Syria | 94.72 | 92.99 | 93.84 |
| Egypt | 91.84 | 93.57 | 92.69 |
| Yaman | 95.91 | 88.47 | 92.04 |
| Saudi Arabia | 89.94 | 79.47 | 84.38 |
| Iraq | 94.31 | 89.28 | 91.73 |
| Sports | 99.67 | 99.88 | 99.77 |
| Premier League | 88.84 | 95.73 | 92.16 |
| Real Madrid | 90.53 | 90.78 | 90.65 |
| Barca | 89.47 | 90.79 | 90.13 |
| Football | 82.62 | 57.37 | 67.72 |
| Tech | 99.52 | 99.78 | 99.65 |
| Android | 89.49 | 89.45 | 89.47 |
| Apple | 93.11 | 89.44 | 91.24 |
| 87.84 | 88.86 | 88.35 | |
| Social Media | 94.19 | 95.31 | 94.75 |
Average evaluation scores of the CNN-GRU classifier
| CNNGRU | Precision | Recall | F1-score |
|---|---|---|---|
| Micro avg | 94.72 | 93.48 | 94.09 |
| Macro avg | 92.06 | 89.74 | 90.72 |
| Weighted avg | 94.63 | 93.48 | 93.95 |
| Samples avg | 95.67 | 94.85 | 94.69 |
Fig. 10Relationships of true labels in the testing dataset
Fig. 11Relationships of predicted labels by CNN
Fig. 12An example of a news article correctly tagged, with 5 tags, by the CGRU model
Fig. 13An example of a misclassified news article that turns to be good
Approximate computational cost for the classifier
| Classifier | Training | Prediction |
|---|---|---|
| LR | ||
| SVM | ||
| DT | ||
| MNB | ||
| XGB | ||
| KNN | ||
| RF | ||
| NC | ||
| ADB | ||
| MLP/NN |
Fig. 14Average training time for the standard classifiers for multiple n samples and f features; Logarithmic scale
Classifiers parameters
| Classifier | parameters settings |
|---|---|
| LR | C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class=’ovr’, penalty=’l2’, solver=’lbfgs’, tol=0.0001, warm_start=False |
| SVM | C=1.0, break_ties=False, cache_size=200, coef0=0.0, decision_function_shape=’ovr’, degree=3, gamma=’scale’, kernel=’linear’,max_iter=-1, probability=True, shrinking=True, tol=0.001 |
| DT | criterion=’gini’, min_samples_split=2, min_samples_leaf=1 |
| MNB | alpha=1.0, fit_prior=True |
| XGB | loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, max_depth=3, warm_start=False, validation_fraction=0.1, tol=0.0001 |
| KNN | n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, metric=’minkowski |
| RF | n_estimators=100, *, criterion=’gini’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0 |
| NC | metric=’euclidean’ |
| ADB | algorithm=’SAMME.R’, learning_rate=1.0, n_estimators=50 |
| MLP | activation=’relu’, alpha=0.0001, batch_size=’auto’, beta_1=0.9, beta_2=0.999, epsilon=1e-08, hidden_layer_sizes=(100,), learning_rate=’constant’, learning_rate_init=0.001, max_iter=200, momentum=0.9, n_ iter_no_change=10, power_t=0.5, solver=’adam’, tol=0.0001, validation_fraction=0.1, warm_start=False |