| Literature DB >> 35361890 |
Lal Khan1, Ammar Amjad1, Noman Ashraf2, Hsien-Tsung Chang3,4,5,6.
Abstract
Sentiment analysis (SA) is an important task because of its vital role in analyzing people's opinions. However, existing research is solely based on the English language with limited work on low-resource languages. This study introduced a new multi-class Urdu dataset based on user reviews for sentiment analysis. This dataset is gathered from various domains such as food and beverages, movies and plays, software and apps, politics, and sports. Our proposed dataset contains 9312 reviews manually annotated by human experts into three classes: positive, negative and neutral. The main goal of this research study is to create a manually annotated dataset for Urdu sentiment analysis and to set baseline results using rule-based, machine learning (SVM, NB, Adabbost, MLP, LR and RF) and deep learning (CNN-1D, LSTM, Bi-LSTM, GRU and Bi-GRU) techniques. Additionally, we fine-tuned Multilingual BERT(mBERT) for Urdu sentiment analysis. We used four text representations: word n-grams, char n-grams,pre-trained fastText and BERT word embeddings to train our classifiers. We trained these models on two different datasets for evaluation purposes. Finding shows that the proposed mBERT model with BERT pre-trained word embeddings outperformed deep learning, machine learning and rule-based classifiers and achieved an F1 score of 81.49%.Entities:
Mesh:
Year: 2022 PMID: 35361890 PMCID: PMC8971433 DOI: 10.1038/s41598-022-09381-9
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Summary of existing Urdu datasets.
| Corpus | Publicly available | Classes | Algorithms | Acc (%) |
|---|---|---|---|---|
| 6025 (various genres)[ | Yes | 3 | SVM, Lib, NB, (KNN, IBK), PART and decision tree | 67 |
| 650 (movies)[ | No | 2 | Language prepossessing | 40 |
| 700 (electronics appliances)[ | No | 2 | Language prepossessing | 38 |
| 26,057 documents[ | No | – | NB and SVM for language prepossessing | – |
| Only 1000 opinions of Urdu news data[ | No | 3 | Unsupervised (lexicon based) | 86 |
| 9601 (various domain)[ | Yes | 2 | Machine and deep learning | 81 |
| 6000[ | No | 2 | Deep learning | 77.9 |
| 9312 reviews of various domains (proposed study) | Yes | 3 | Rule-based, deep learning and machine learning | 78 |
Online collection sources for Urdu user reviews.
| Domain | Websites |
|---|---|
| Appliances, software and blogs | mobilesmspk.net, itforumpk.com, baazauq.blogspot.com, dufferistan.com, mbilalm.com, urduweb.org, urdudaan.blogspot.com, itdunya.com, achidosti.com, itdarasgah.com, tafrehmella.com, sachiidosti.com, urdupoint.com |
| Movies, news talk shows, and Pakistani and Indian dramas | Hamriweb.com, youtube.com, facebook.com, hamariweb.net, zemtv.com, dramasonline.com, fashionuniverse.net, tweettunnel.com |
| Sports and entertainments | twitter.com, youtube.com, facebook.com |
| Politics | Facebook.com, siasat.pk, twitter.com, youtube.com |
| Food and recipes | Urduweb.org, facebook.com, friendscorner.com, Pakistan.web.pk, kfoods.com |
Figure 1Examples of customer reviews label as neutral, positive and negative.
Details of proposed and UCSA Urdu corpus.
| Characteristics | Proposed corpus | UCSA corpus |
|---|---|---|
| Total number of reviews | 9312 | 9601 |
| Positive reviews | 3422 | 4843 |
| Negative reviews | 2787 | 4758 |
| Neutral reviews | 3103 | – |
| Minimum review length in words | 1 | 1 |
| Maximum review length in words | 149 | – |
| Total number of tokens | 179,791 | 1,141,716 |
| Average tokens per review | 19.30 | – |
Statistics of proposed dataset.
| Genres | Total reviews | Positive reviews | Negative reviews | Neutral reviews |
|---|---|---|---|---|
| Food and recipes | 1250 | 386 | 317 | 547 |
| Movies and drams | 1977 | 590 | 677 | 710 |
| Politics | 1873 | 479 | 744 | 650 |
| Software and gadgets | 2325 | 1326 | 455 | 544 |
| Sports and entertainment | 1887 | 641 | 594 | 652 |
| Total | 9312 | 3422 | 2787 | 3103 |
Figure 2Proposed abstract level architecture for Urdu sentiment analysis.
Figure 3Rule-based Urdu sentiment analysis algorithm using Urdu Lexicon.
Figure 4Multilingual BERT high level architecture for Urdu sentiment analysis.
mBERT model hyper-parameters.
| Hyper-parameter | Value |
|---|---|
| Learning rate | 2e-5 |
| Batch size | 16 |
| Number of epochs | 15 |
| Attention heads | 12 |
| Gradient accumulation steps | 16 |
| Hidden size | 768 |
| Hidden layers | 12 |
| Maximum sequence length | 128 |
| Parameters | 110 M |
Urdu sentiment analysis results using machine learning models with word n-gram features.
| Feature | Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| Unigram | KNN | 67.23 | 63.31 | 70.34 | 66.64 |
| RF | 65.80 | 62.07 | 69.12 | 65.40 | |
| NB | 68.70 | 65.45 | 70.19 | 67.73 | |
| LR | 64.70 | 61.90 | 67.01 | 64.35 | |
| MLP | 67.81 | 65.01 | 70.22 | 67.46 | |
| SVM | 71.66 | 69.02 | 72.76 | 70.84 | |
| AdaBoost | 69.23 | 66.99 | 71.01 | 68.94 | |
| Bigram | KNN | 61.73 | 59.21 | 63.04 | 61.06 |
| RF | 60.58 | 58.97 | 62.10 | 60.49 | |
| NB | 64.39 | 62.05 | 66.20 | 64.05 | |
| LR | 60.24 | 58.10 | 61.98 | 59.97 | |
| MLP | 63.30 | 60.01 | 65.02 | 62.28 | |
| SVM | 67.96 | 64.45 | 69.00 | 66.64 | |
| AdaBoost | 64.03 | 61.90 | 66.10 | 63.93 | |
| Trigram | KNN | 58.13 | 48.88 | 68.04 | 57.19 |
| RF | 55.39 | 47.00 | 67.20 | 55.31 | |
| NB | 59.20 | 51.05 | 70.20 | 59.11 | |
| LR | 55.00 | 47.09 | 65.80 | 54.89 | |
| MLP | 57.40 | 49.10 | 68.78 | 57.29 | |
| SVM | 61.66 | 50.00 | 68.10 | 61.25 | |
| AdaBoost | 58.50 | 51.01 | 67.80 | 58.21 | |
| Combination (1–2) | KNN | 67.62 | 66.02 | 69.30 | 67.62 |
| RF | 66.95 | 65.07 | 68.89 | 66.92 | |
| NB | 70.10 | 68.06 | 71.97 | 69.96 | |
| LR | 66.30 | 64.16 | 67.32 | 65.70 | |
| MLP | 69.91 | 67.23 | 70.98 | 69.05 | |
| SVM | 72.71 | 71.05 | 74.10 | 72.54 | |
| AdaBoost | 70.60 | 69.00 | 72.11 | 70.52 | |
| Combination (1–3) | KNN | 67.80 | 66.80 | 68.33 | 67.55 |
| RF | 66.70 | 65.70 | 67.32 | 66.50 | |
| NB | 69.50 | 68.44 | 70.12 | 69.26 | |
| LR | 66.00 | 64.70 | 66.39 | 65.53 | |
| MLP | 69.80 | 68.09 | 70.30 | 69.17 | |
| SVM | 71.30 | 70.30 | 72.20 | 71.23 | |
| AdaBoost | 71.00 | 69.70 | 71.59 | 70.63 |
Urdu sentiment analysis results using machine learning models with char n-gram features.
| Feature | Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| Char-3-Gram | KNN | 65.23 | 61.31 | 68.34 | 64.63 |
| RF | 64.70 | 61.07 | 67.12 | 63.95 | |
| NB | 68.29 | 63.45 | 70.19 | 66.65 | |
| LR | 64.60 | 62.90 | 66.01 | 64.41 | |
| MLP | 66.71 | 63.01 | 68.22 | 65.51 | |
| SVM | 67.50 | 64.02 | 68.76 | 66.30 | |
| AdaBoost | 64.90 | 62.99 | 66.01 | 64.66 | |
| Char-4-Gram | KNN | 60.75 | 59.21 | 62.04 | 60.59 |
| RF | 60.30 | 57.97 | 60.10 | 59.01 | |
| NB | 63.40 | 60.05 | 64.20 | 62.05 | |
| LR | 60.24 | 57.10 | 60.98 | 58.98 | |
| MLP | 62.10 | 58.15 | 64.10 | 60.98 | |
| SVM | 65.90 | 62.35 | 67.10 | 64.63 | |
| AdaBoost | 62.90 | 60.70 | 64.20 | 62.40 | |
| Char-5-Gram | KNN | 60.00 | 58.10 | 61.10 | 59.56 |
| RF | 58.70 | 56.90 | 59.00 | 57.93 | |
| NB | 62.46 | 59.05 | 62.10 | 60.53 | |
| LR | 58.40 | 55.10 | 59.90 | 57.39 | |
| MLP | 60.10 | 56.01 | 62.00 | 58.85 | |
| SVM | 63.55 | 60.45 | 64.10 | 62.22 | |
| AdaBoost | 61.00 | 59.60 | 61.00 | 60.29 |
Urdu sentiment analysis results using rule-based algorithm.
| Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|
| Rule-based | 64.20 | 60.50 | 68.09 | 64.07 |
Urdu sentiment analysis results using deep learning models for UCSA-21 Corpus.
| Word Embedding | Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| fastText | Bi-LSTM | 76.50 | 75.01 | 77.14 | 76.06 |
| Bi-GRU | 75.60 | 73.10 | 76.70 | 74.85 | |
| CNN-1D | 72.10 | 69.79 | 72.70 | 71.21 | |
| CNN-1D+MP | 70.09 | 68.79 | 70.70 | 69.73 | |
| CNN-1D+ATT | 73.80 | 71.79 | 75.70 | 73.69 | |
| LSTM | 73.15 | 71.40 | 74.28 | 72.49 | |
| LSTM+MP | 72.15 | 70.40 | 73.28 | 71.81 | |
| LSTM+ATT | 74.80 | 72.40 | 76.28 | 74.41 | |
| GRU | 72.50 | 71.00 | 72.00 | 71.49 | |
| BERT | Proposed model | 77.61 | 76.15 | 78.25 | 77.18 |
Urdu sentiment analysis results using deep learning models for UCSA corpus.
| Word Embedding | Model | Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|---|---|
| fastText | Bi-LSTM | 81.10 | 80.20 | 80.55 | 80.37 |
| Bi-GRU | 80.55 | 80.05 | 80.15 | 80.09 | |
| CNN-1D | 78.10 | 78.43 | 76.78 | 77.59 | |
| CNN-1D+MP | 77.60 | 77.05 | 75.25 | 76.13 | |
| CNN-1D+ATT | 79.05 | 78.00 | 7.45 | 78.15 | |
| LSTM | 78.85 | 77.76 | 77.83 | 77.79 | |
| LSTM+MP | 77.55 | 76.50 | 76.45 | 76.47 | |
| LSTM+ATT | 79.05 | 79.80 | 78.50 | 78.67 | |
| GRU | 78.35 | 77.30 | 77.15 | 77.22 | |
| BERT | Proposed model | 82.50 | 81.35 | 81.65 | 81.49 |
Figure 5Accuracy Comparison of Machine, Deep Learning and Rule-Based Approaches with Proposed Model using UCSA-21 Corpus.
Figure 6Confusion matrix of our proposed model using our proposed UCSA-21 corpus.
Figure 7Confusion matrix of our proposed model using UCSA corpus.