| Literature DB >> 34084924 |
Latifah Almuqren1,2, Alexandra Cristea1.
Abstract
Comparing Arabic to other languages, Arabic lacks large corpora for Natural Language Processing (Assiri, Emam & Al-Dossari, 2018; Gamal et al., 2019). A number of scholars depended on translation from one language to another to construct their corpus (Rushdi-Saleh et al., 2011). This paper presents how we have constructed, cleaned, pre-processed, and annotated our 20,0000 Gold Standard Corpus (GSC) AraCust, the first Telecom GSC for Arabic Sentiment Analysis (ASA) for Dialectal Arabic (DA). AraCust contains Saudi dialect tweets, processed from a self-collected Arabic tweets dataset and has been annotated for sentiment analysis, i.e.,manually labelled (k=0.60). In addition, we have illustrated AraCust's power, by performing an exploratory data analysis, to analyse the features that were sourced from the nature of our corpus, to assist with choosing the right ASA methods for it. To evaluate our Golden Standard corpus AraCust, we have first applied a simple experiment, using a supervised classifier, to offer benchmark outcomes for forthcoming works. In addition, we have applied the same supervised classifier on a publicly available Arabic dataset created from Twitter, ASTD (Nabil, Aly & Atiya, 2015). The result shows that our dataset AraCust outperforms the ASTD result with 91% accuracy and 89% F1avg score. The AraCust corpus will be released, together with code useful for its exploration, via GitHub as a part of this submission. ©2021 Almuqren and Cristea.Entities:
Keywords: Arabic; Gold Standard Corpus; Sentiment analysis; Supervised approach
Year: 2021 PMID: 34084924 PMCID: PMC8157250 DOI: 10.7717/peerj-cs.510
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Comparison between different Arabic corpora.
| Al-Hayat Corpus | ( | Al-Hayat newspaper articles | 42,591 | MSA | Available for a fee |
| Arabic Lexicon for Business Reviews | Reviews | 2,000 URLs | MSA | Not Available | |
| AWATIF (a multi-genre corpus of Modern Standard Arabic) | Wikipedia Talk Pages (WTP), The Web forum (WF) and Part 1 V 3.0 (ATB1V3) of the Penn Arabic TreeBank (PATB) | 2855 sentences from PATB, 5,342 sentences from WTP and 2,532 sentences from WF | MSA/Dialect | Not Available | |
| The Arabic Opinion Holder Corpus | News articles | 1 MB news documents | MSA | Available at | |
| Large Arabic Book Review Corpus (LABR) | Book reviews from GoodReads.com | 63,257 book reviews | MSA/Dialect | Freely available at | |
| Arabic Twitter Corpus | ( | 8,868 tweets | Arabic dialect | Available via the ELRA repository. | |
| An-Nahar Corpus | Newspaper text | MSA | Available for a fee | ||
| Tunisian Arabic Railway Interaction Corpus (TARIC) | ( | Dialogues in the Tunisian Railway Transport Network | 4,662 | Tunisian dialect | Not Available |
| DARDASHA | ( | Chat Maktoob (Egyptian website) | 2,798 | Arabic dialect | Not Available |
| TAGREED | 3,015 | MSA/ Dialect | |||
| TAHRIR | Wikipedia Talk pages | 3,008 | MSA | ||
| MONTADA | Forums | 3,097 | MSA/ Dialect | ||
| Hotel Reviews (HTL) | TripAdvisor.com | 15,572 | MSA/ Dialect | Not Available | |
| Restaurant Reviews (RES) | Restaurant Reviews (RES) from Qaym.com | 10,970 | MSA/ Dialect | ||
| Movie Reviews (MOV) | Movie Reviews (MOV) from Elcinemas.com | 1,524 | MSA/ Dialect | ||
| Product Reviews (PROD) | Product Reviews (PROD) from Souq.com | 4,272 | MSA/ Dialect | ||
| MIKA | ( | Twitter and different forum websites for TV shows, product and hotel reservation. | 4,000 topics | MSA and Egyptian dialect | Not Available |
| Arabic Sentiment Tweets Dataset (ASTD) | ( | 10,000 Egyptian dialect. | Egyptian dialect | Freely available at | |
| Health dataset | ( | 2026 tweets | Arabic dialect | Not Available | |
| SUAR (Saudi corpus for NLP Applications and Resources) | ( | Different social media sources such as Twitter, YouTube, Instagram and WhatsApp. | 104,079 words | Saudi dialect | Not Available |
| Twitter Benchmark Dataset for Arabic Sentiment Analysis | ( | 151,000 sentences | MSA/ Egyptian dialect | Not Available |
Comparison between different Saudi dialect corpora for ASA.
| AraSenti-Tweet Corpus of Arabic SA | ( | 17,573 tweets | Positive, negative, neutral, or mixed labels. | Not Available | |
| Saudi Dialects Twitter Corpus (SDTC) | ( | 5,400 tweets | Positive, negative, neutral, objective, spam, or not sure. | Not Available | |
| Sentiment corpus for Saudi dialect | 4,000 tweets | Positive or negative. | Not Available | ||
| Corpus for Sentiment Analysis | ( | 4,700 tweets | Not Available | ||
| Saudi public opinion | Two Saudi newspapers | 815 comments | Strongly positive, positive, negative, or strongly negative | Available upon request | |
| Saudi corpus | 5,500 tweets | Positive, negative, or neutral | Not Available | ||
| Saudi corpus | 1,331 tweets | Positive, negative, or neutral | Not Available |
Figure 1Percentage of Arabic corpora based on the type of corpus, from 2002 to 2019.
Companies and the total number of unique tweets from each in AraCust.
| STC | @STC_KSA, @STCcare, @STCLive | 7,590 |
| Mobily | @Mobily, @Mobily1100, @MobilyBusiness | 6,460 |
| Zain | @ZainKSA, @ZainHelpSA | 5,950 |
| Total | 20,000 |
Figure 2AraCust corpus collection, filtering and pre-processing.
Subset of the corpus before and after pre-processing.
| @So2019So @STCcare الشركهغيري | Negative | STC | Change the Company | غيريشركه |
| @alrakoo @mmshibani @GOclub @Mobilyاشكرك | Positive | Mobily | Thank you | اشكرك |
| @ZainKSA @CITC_withU يعوضنيعنالخسايرمين | Negative | Zain | Who will compensate me for the losses | مينيعوضنيعنخساير |
Companies and the total number of positive and negative tweets.
| STC | 5,065 | 2,525 | 7,590 |
| Mobily | 4,530 | 1,930 | 6,460 |
| Zain | 3,972 | 1,978 | 5,950 |
| Total | 13,567 | 6,433 | 20,000 |
Figure 3Distribution of negative and positive sentiment.
Figure 4Tweet length distribution across sentiment.
Figure 5Tweet length distribution across companies.
Most frequent words in the AraCust corpus.
| نتا | 1770 | Internet |
| لله | 1760 | God |
| سلام | 1363 | Hello |
| والله | 1179 | Swear God |
| خاص | 1315 | Private |
| حسبي | 637 | Pray |
| عملاء | 599 | Customers |
| شكرا | 560 | Thank you |
| مشكلة | 549 | Problem |
| شريحة | 515 | Sim card |
Figure 6Most frequent Bigrams in the AraCust corpus.
Most frequent words in the AraCust corpus and their sentiment probability.
| نت | Internet | 975 | 795 | 1,770 | 0.44 | 0.55 |
| الله | God | 977 | 783 | 1,760 | 0.44 | 0.55 |
| سلام | Hello | 765 | 895 | 1,363 | 0.65 | 0.56 |
| والله | Swear God | 567 | 704 | 1,179 | 0.59 | 0.48 |
| خاص | Private | 656 | 659 | 1,315 | 0.50 | 0.49 |
| حسبي | Pray | 425 | 212 | 637 | 0.33 | 0.66 |
| عملاء | Customers | 413 | 186 | 599 | 0.31 | 0.68 |
| شريحه | Sim card | 271 | 289 | 560 | 0.51 | 0.48 |
| مشكله | Problem | 279 | 270 | 549 | 0.49 | 0.50 |
| شكرا | Thank you | 235 | 280 | 515 | 0.54 | 0.45 |
Character-based features.
| Punctuation marks | 8.0 |
| Numbers | 6.03 |
| Symbol | 0.0 |
Sentence-based features.
| Words per sentence | 16.23 |
| Sentence standard deviation | 7.17 |
| Range | 30 |
Word-based features.
| Word standard deviation | 6.51 |
| Word range | 30 |
| Chars per word | 5.22 |
| Vocabulary richness | 1.0 |
| Stop words | 0.0 |
| Proper nouns | 0.11 |
Figure 7The included annotation guidelines in the XLSX file.
Figure 8The annotation file.
Two-by-two agreement for binary classification between the three annotators.
| A1& A2 | 0.7 |
| A2 & A3 | 0.74 |
| A1 & A3 | 0.87 |
| Avg A | 0.77 |
Datasets used in the evaluation.
| Aracust | 6,433 | 13,567 | 20,000 |
| ASTD | 797 | 1,682 | 2,479 |
Evaluation results of using the SVM on the datasets.
| Aracust | 93.0 | 76.0 | 83.6 | 91.0 | 98.0 | 94.4 | 89.0 | 91.0 |
| ASTD | 79.0 | 65.0 | 71.3 | 76.0 | 96.0 | 84.4 | 77.9 | 85.0 |
Percentage of predicted customers satisfaction vs. actual customer’s satisfaction.
| 40.01% | 20.1% | |
| 39.00% | 22.89% | |
| 34.06% | 22.91% |
Figure 9Snapshot from the Python code for tweets generator.
Figure 10Number of participants based on telecom companies.
Figure 11Number of satisfied and unsatisfied users for STC company.
Figure 12Number of satisfied and unsatisfied users for Mobily company.
Figure 13Number of satisfied and unsatisfied users for Zain company.