| Literature DB >> 34901419 |
Abstract
Bengali is a low-resource language that lacks tools and resources for various natural language processing (NLP) tasks, such as sentiment analysis or profanity identification. In Bengali, only the translated versions of English sentiment lexicons are available. Moreover, no dictionary exists for detecting profanity in Bengali social media text. This study introduces a Bengali sentiment lexicon, BengSentiLex, and a Bengali swear lexicon, BengSwearLex. For creating BengSentiLex, a cross-lingual methodology is proposed that utilizes a machine translation system, a review corpus, two English sentiment lexicons, pointwise mutual information (PMI), and supervised machine learning (ML) classifiers in various stages. A semi-automatic methodology is presented to develop BengSwearLex that leverages an obscene corpus, word embedding, and part-of-speech (POS) taggers. The performance of BengSentiLex compared with the translated English lexicons in three evaluation datasets. BengSentiLex achieves 5%-50% improvement over the translated lexicons. For identifying profanity, BengSwearLex achieves documentlevel coverage of around 85% in an document-level in the evaluation dataset. The experimental results imply that BengSentiLex and BengSwearLex are effective resources for classifying sentiment and identifying profanity in Bengali social media content, respectively. ©2021 Sazzed.Entities:
Keywords: Profanity detection; Sentiment lexicon
Year: 2021 PMID: 34901419 PMCID: PMC8627231 DOI: 10.7717/peerj-cs.681
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 1Sample reviews from training dataset Drama-Train.
Evaluation datasets for BengSentiLex.
|
|
|
|
|
|
|---|---|---|---|---|
|
| Drama Review | 1000 | 1000 | 2000 |
|
| News Comments | 2000 | 2000 | 4000 |
|
| News Comments | 5205 | 5205 | 10410 |
Figure 2The various phases of sentiment lexicon generation in Bengali.
Figure 3The lexicon building block.
Performances of supervised ML classifiers in annotated corpus.
|
|
|
|
|
|
|---|---|---|---|---|
|
| 0.939 | 0.901 | 0.920 | 93.61% |
|
| 0.908 | 0.924 | 0.916 | 93.00% |
|
| 0.889 | 0.922 | 0.905 | 91.80% |
|
| 0.901 | 0.849 | 0.875 | 90.18% |
|
| 0.878 | 0.870 | 0.874 | 89.9l% |
Figure 4Class-label assignment of unlabeled reviews using supervised ML classifier and labeled data.
Figure 5Examples of Bengali obscene comments and corresponding English translation.
Description of drama review evaluation corpus.
|
|
|
|
|---|---|---|
| 664 | 2643 | 3307 |
Figure 6The proposed methodology.
Accuracy of various lexicons for sentiment classification.
|
|
|
|
|
|
|---|---|---|---|---|
|
| 145/1000 (14.15%) | 488/1000 (48.80%) | 633/2000 (31.65%) | |
| Drama |
| 241/1000 (24.10%) | 598/1000 (59.80%) | 839/2000 (41.95%) |
|
| 225/1000 (22.5%) | 707/1000 (70.70%) | 932/2000 (46.60%) | |
|
| 533/1000 (53.30%) | 775/1000 (77.50%) | 1308/2000 (65.40%) | |
|
| 626/2000 (31.3%) | 628/2000 (31.4%) | 1254/4000 (31.35%) | |
| News1 |
| 590/2000 (29.5%) | 833/2000 (41.65%) | 1423/4000 (35.57%) |
|
| 566/2000 (28.30%) | 1070/2000 (53.50%) | 1636/4000 (40.90%) | |
|
| 932/2000 (46.6%) | 960/2000 (48.80%) | 1892 4000 (47.30%) | |
|
| 2004/5660 (35.4%) | 1826/5205 (35.08%) | 3830/10865 (35.2%) | |
| News2 |
| 1763/5660 (31.14%) | 2274/5205 (43.68%) | 4037/10865 (37.15%) |
|
| 1662/5205 (31.93%) | 2827/5205 (54.31%) | 4489/10410 (43.12%) | |
|
| 2086/5205 (40.08%) | 2721/5205 (52.27%) | 4807/10410 (46.17%) |
Document-level coverage of various methods for profanity detection.
| Type |
|
|
|
|---|---|---|---|
| Unsupervised |
| 564/664 | 84.93% |
|
| 161/664 | 24.5% | |
| Supervised (Unbalanced) |
| 345/664 | 53.4% |
|
| 366/664 | 58.8% | |
|
| 433/664 | 65.21% | |
|
| 462/664 | 70.4% | |
|
| 444/664 | 66.86% | |
|
| 609/664 | 91.71% | |
| Supervised (Balanced) |
| 594/664 | 89.45% |
|
| 589/664 | 88.70% | |
|
| 610/664 | 91.67% | |
|
| 624/664 | 94.0% | |
|
| 609/664 | 91.71% |