| Literature DB >> 34249602 |
Yahya Albalawi1,2,3, Jim Buckley1,3, Nikola S Nikolov1,3.
Abstract
This paper presents a comprehensive evaluation of data pre-processing and word embedding techniques in the context of Arabic document classification in the domain of health-related communication on social media. We evaluate 26 text pre-processings applied to Arabic tweets within the process of training a classifier to identify health-related tweets. For this task we use the (traditional) machine learning classifiers KNN, SVM, Multinomial NB and Logistic Regression. Furthermore, we report experimental results with the deep learning architectures BLSTM and CNN for the same text classification problem. Since word embeddings are more typically used as the input layer in deep networks, in the deep learning experiments we evaluate several state-of-the-art pre-trained word embeddings with the same text pre-processing applied. To achieve these goals, we use two data sets: one for both training and testing, and another for testing the generality of our models only. Our results point to the conclusion that only four out of the 26 pre-processings improve the classification accuracy significantly. For the first data set of Arabic tweets, we found that Mazajak CBOW pre-trained word embeddings as the input to a BLSTM deep network led to the most accurate classifier with F1 score of 89.7%. For the second data set, Mazajak Skip-Gram pre-trained word embeddings as the input to BLSTM led to the most accurate model with F1 score of 75.2% and accuracy of 90.7% compared to F1 score of 90.8% achieved by Mazajak CBOW for the same architecture but with lower accuracy of 70.89%. Our results also show that the performance of the best of the traditional classifier we trained is comparable to the deep learning methods on the first dataset, but significantly worse on the second dataset.Entities:
Keywords: Deep learning; Health information; Machine learning; Natural language processing; Pre-trained word embeddings; Social media; Twitter
Year: 2021 PMID: 34249602 PMCID: PMC8253467 DOI: 10.1186/s40537-021-00488-w
Source DB: PubMed Journal: J Big Data ISSN: 2196-1115
Pre-trained word embedding models
| Pre-trained word embedding | Number of documents | Sources | Techniques | Availability | Pre-processing |
|---|---|---|---|---|---|
| fastText [ | 400 millions tokens from Wikipedia”, i.e. 400 million Wikipedia articles + “24 terabytes of raw text data” from Common Crawl | Common Crawl and Wikipedia | CBOW with sub-wording techniques applied to the methods | Open | Only tokenization |
| AraVec [ | 66.9 million tweets and 320,636 documents from Wikipedia | Twitter and Wikipedia | CBOW and Skip-Gram with different n-gram and unigram features | Open | Remove non-Arabic letters. replace ة with ه. Normalize alef. remove duplicates, Normalize mentions, URLs emojis |
| Mazajak [ | 250 million tweets | CBOW and Skip-Gram with different n-gram | Open | Removal URLs, Tashkeel, emojis and punctuation | |
| ArWordVec [ | 55 million tweets | CBOW and Skip-Gram | Open | Normalize mentions, URLs. Remove tashkeel, punctuation, Normalize bare alef Replace ى" with "ي", Replace ؤ" with “ء", Replace ئ" with ء", Replace " ة with ه " |
Fig. 1Study overview
Normalization techniques used by different researchers
| Replace | With | Relevant studies |
|---|---|---|
| Bare-alif | [ | |
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ | ||
| [ |
Fig. 2Overview flowchart for the first experiment. (In the best combination, we only tried MNB as it was the best algorithm from previous steps.)
Techniques used as brute-force algorithms in each attempt for all algorithms
| Techniques used | Parameters or range used |
|---|---|
| Number of features | Ranges from 7,000 to 18,000 |
| N-gram | (1, 1), (1, 2), (1, 3), (1, 4) |
| Feature selection | Count vectorizer and TF-IDF |
Hyperparameters optimized for NB
| 'tfidf__norm' | 'clf__alpha' | |
|---|---|---|
| Range | ('l1', 'l2'), | [1, 1e-1, 1e-2] |
Hyperparameters optimized for SVM
| 'clf__C' | 'clf__loss' | |
|---|---|---|
| Range | [.05, .12, .25, .5, 1, 2, 4] | ['squared_hinge','hinge'] |
Hyperparameters optimized for LG
| 'clf__C' | 'clf__loss' | |
|---|---|---|
| Range | [0.001,0.01,0.1,1,10,100] | ['l1','l2'] |
Hyperparameters optimized for KNN
| clf__algorithm | clf__n_neighbors | clf__leaf_size | |
|---|---|---|---|
| Range | ['auto', 'kd_tree'] | [2, 4, 6, 8, 9, 10, 11, 12] | [2, 9, 16, 20, 26, 31, 50, 70] |
Baseline results for four algorithms used in this study
| Algorithm | N-gram | Feature selection | F1 score |
|---|---|---|---|
| LinearSVC | 1 | Term frequency | 84 .0 |
| Logistic regression | 1, 2 | TF-IDF | 84.0 |
| Multinomial NB | 1, 2 | TF-IDF | 86.0 |
| KNN | 1, 2 | TF-IDF | 77.6 |
Accuracy (in percentages) of each of the pre-processing techniques used for the extracted tweets
| Techniques used | MNB | Logistic regression | LinearSVC | KNN | |
|---|---|---|---|---|---|
| Baseline models | 86.0 | 84.0 | 84.0 | 77.6 | |
| 1 | Remove non-Arabic letters | 85.4 − | 83.6 − | 82.8 − | 76.7 − |
| 2 | Remove numbers | 85.5 − | 82.9 − | 84.3 | 77.4 − |
| 3 | Remove usernames, external links, and hashtags | 85.2 − | 83.2 − | 83.4 − | 78.1 + |
| 4 | Remove punctuation | 86.0 | 84.0 | 84.0 | 77.6 |
| 5 | Remove diacritics | 86.0 | 83.6 | 83.8 − | 76.6 − |
| 6 | Remove repeated characters | 86.4 + | 84.3 + | 84.9 + | 79.2 + |
| 7 | Remove duplicate letters | 86.0 | 83.2 − | 84.1 + | 78.7 + |
| 8 | Remove Kashida | 86.3 + | 83.8 − | 84.6 + | 78.0 + |
| 9 | Replace | 85.8 − | 83.6 − | 84.1 + | 77.4 − |
| 10 | Replace | 86.7 + | 84.0 | 84.6 + | 77.9 + |
| 11 | Replace | 86.8 + | 84.0 | 84.2 + | 78.0 + |
| 12 | Replace | 86.0 | 83.0 − | 84.3 + | 77.8 + |
| 13 | Replace | 85.8 − | 83.8 − | 83.9 | 77.7 + |
| 14 | Replace | 86.0 | 84.0 | 84.3 + | 77.6 |
| 15 | Replace | 86.7 + | 83.8 − | 84.8 + | 77.1 − |
| 16 | Replace | 86.0 | 84.0 | 84.0 | 77.6 |
| 17 | Replace | 86.0 | 82.8 + | 84.0 | 77.6 |
| 18 | Replace | 86.0 | 84.0 | 84.0 | 77.6 |
| 19 | Replace | 85.7 − | 83.7 − | 83.6 − | 78.0 + |
| 20 | Replace | 86.0 | 83.6 − | 84.0 | 77.9 + |
| 21 | Replace | 85.8 − | 82.8 − | 84.2 + | 77.6 |
| 22 | Replace | 86.0 | 84.0 | 84.2 + | 77.6 |
| 23 | Remove stop words | 85.2 − | 84.4 + | 83.4 − | 76.6 − |
| 24 | Light Stemming | 86.6 + | 85.3 + | 86.2 + | 79.1 + |
| 25 | Root stemming | 84.4 − | 85.2 + | 85.1 + | 77.8 + |
| 26 | Lemmatization | 86.7 + | 86.2 + | 86.5 + | 80.1 + |
Plus sign ( +) indicate the technique improved the F1−score of the baseline model; negative sign(−) indicate the technique decreased the F1-score; and cells without sign indicate the technique had no impact on the F1-score of the algorithm
Results for MNB classifier with the best combinations
| Model | Baseline Classifier | Optimized classifier | ||||||
|---|---|---|---|---|---|---|---|---|
| Metrics | Precision | Recall | F1 Score | Accuracy | Precision | Recall | F1 Score | Accuracy |
| First Dataset | 87.5 | 84.6 | 86.0 | 91.6 | 89.1 | 86.6 | 87.9 | 92.7 |
| Second Dataset | 61.5 | 55.0 | 58.1 | 85.1 | 66.5 | 55.6 | 60.5 | 86.4 |
Fig. 3BLSTM architecture
Fig. 4CNN architecture
Optimal hyperparameters of the CNN
| CNN_Mazajak sg | CNN_Mazajak cbow | CNN_fasttext | CNN_arwordvec sg | CNN_arwordvec cbow | Cnn arvec sg | Cnn arvec cbow | |
|---|---|---|---|---|---|---|---|
| cov_filter | 32 | 32 | 1 | 1 | 32 | 1 | 32 |
| cov_filter1 | 32 | 32 | 1 | 32 | 32 | 32 | 32 |
| cov_filter2 | 1 | 32 | 1 | 32 | 32 | 1 | 32 |
| cov_kernel | 32 | 32 | 32 | 32 | 32 | 1 | 1 |
| cov_kernel1 | 1 | 1 | 32 | 1 | 32 | 32 | 32 |
| cov_kernel2 | 1 | 1 | 1 | 1 | 32 | 1 | 1 |
| pool_filter | 32 | 32 | 1 | 32 | 1 | 1 | 1 |
| cov1_activation | relu | sigmoid | relu | relu | relu | sigmoid | relu |
| cov1_activation1 | relu | relu | relu | relu | sigmoid | relu | relu |
| cov1_activation2 | sigmoid | relu | relu | relu | relu | relu | relu |
| dropout_1 | 0.0 | 0.0 | 0.6 | 0.600 | 0.600 | 0.300 | 0.000 |
| dense_units | 380 | 20 | 20 | 20.000 | 380.000 | 20.000 | 380.000 |
| Dense activatino | relu | relu | sigmoid | relu | relu | sigmoid | sigmoid |
| dropout_2': 0.0 | 0.0 | 0.000 | 0.5 | 0.0 | 0.0 | 0.5 | 0.0 |
| learning_rate | 0.01 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| Batch size | 40 | 40 | 40 | 40 | 40 | 40 | 300 |
Optimal hyperparameters of the BLSTM
| Mazajak sg | Mazajak cbow | Fasttext | ArWordVec sg | ArWordVec cbow | arvec sg | AraVec cbow | |
|---|---|---|---|---|---|---|---|
| Neurons | 280 | 360 | 20 | 500 | 500 | 500 | 500 |
| Drop rate | 0.2 | 0 | 0.5 | 0.3 | 0.2 | 0.5 | 0.2 |
| Drop rate | 0.2 | 0 | 0 | 0.2 | 0 | 0 | 0 |
| Learning rate | 0.001 | 0.001 | 0.01 | 0.001 | 0.001 | 0.01 | 0.01 |
| Batch size | 120 | 40 | 40 | 300 | 300 | 40 | 300 |
Results of BLSTM using different pre-trained word embeddings on the first and second data sets
| First data set | Second data set | |||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 score | Accuracy | Precision | Recall | F1 score | Accuracy | |
| AraVec Skip-Gram | 89.14 | 89 | 93.3 | 82.19 | 63.49 | 71.64 | 90.5 | |
| AraVec CBOW | 87.09 | 86.23 | 86.66 | 91.9 | 79.35 | 65.08 | 71.51 | 90.27 |
| Mazajak Skip-Gram | 90.27 | 88.2 | 75.81 | |||||
| Mazajak CBOW | 59.26 | 70.89 | ||||||
| fastText | 89 | 87.54 | 88.26 | 92.9 | 57.67 | 68.13 | 89.8 | |
| ArWordVec Skip-Gram | 89.6 | 87.54 | 88.56 | 93.1 | 71.76 | 64.55 | 67.97 | 88.5 |
| ArWordVec CBOW | 85.25 | 88.44 | 93.2 | 76.92 | 90.2 | |||
Bold numbers indicate the best value while underlined numbers represent the second-best value
Results of CNN model using different pre-trained word embeddings in the first and second data sets. Bold numbers indicate the best value while underlined numbers represent the second-best value
| First data set | Second data set | |||||||
|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 score | Accuracy | Precision | Recall | F1 score | Accuracy | |
| AraVec Skip-Gram | 88.16 | 76.84 | ||||||
| AraVec CBOW | 87.46 | 84.59 | 86 | 91.6 | 78.31 | 63.49 | 70.18 | 89.8 |
| Mazajak Skip-Gram | 89.04 | 78.26 | 66.67 | 72 | 90.2 | |||
| Mazajak CBOW | 83.28 | 86.25 | 91.9 | 65.61 | ||||
| fastText | 87.46 | 84.59 | 86 | 91.6 | 56.08 | 67.3 | 89.7 | |
| ArWordVec Skip-Gram | 85.76 | 82.95 | 84.33 | 90.6 | 70.56 | 68.83 | 88.5 | |
| ArWordVec CBOW | 79.67 | 84.38 | 91 | 78.57 | 64.02 | 70.55 | 89.9 | |
Bold numbers indicate the best value while underlined numbers represent the second-best value