| Literature DB >> 35211168 |
Ahmad Alsharef1, Karan Aggarwal2, Deepika Koundal3, Hashem Alyami4, Darine Ameyed5.
Abstract
The automated identification of toxicity in texts is a crucial area in text analysis since the social media world is replete with unfiltered content that ranges from mildly abusive to downright hateful. Researchers have found an unintended bias and unfairness caused by training datasets, which caused an inaccurate classification of toxic words in context. In this paper, several approaches for locating toxicity in texts are assessed and presented aiming to enhance the overall quality of text classification. General unsupervised methods were used depending on the state-of-art models and external embeddings to improve the accuracy while relieving bias and enhancing F1-score. Suggested approaches used a combination of long short-term memory (LSTM) deep learning model with Glove word embeddings and LSTM with word embeddings generated by the Bidirectional Encoder Representations from Transformers (BERT), respectively. These models were trained and tested on large secondary qualitative data containing a large number of comments classified as toxic or not. Results found that acceptable accuracy of 94% and an F1-score of 0.89 were achieved using LSTM with BERT word embeddings in the binary classification of comments (toxic and nontoxic). A combination of LSTM and BERT performed better than both LSTM unaccompanied and LSTM with Glove word embedding. This paper tries to solve the problem of classifying comments with high accuracy by pertaining models with larger corpora of text (high-quality word embedding) rather than the training data solely.Entities:
Mesh:
Year: 2022 PMID: 35211168 PMCID: PMC8863472 DOI: 10.1155/2022/8467349
Source DB: PubMed Journal: Comput Intell Neurosci
Figure 1Experimental workflow.
Classification models of this work.
| Experiment 1 | Experiment 2 | ||
|---|---|---|---|
| Neural network | Word embedding | Neural network | Word embedding |
| LSTM | Glove | LSTM | BERT |
Training dataset 1 sample.
| Comment_text | Target |
|---|---|
| This is so cool. It's like, “would you want your mother to read this??'” | 0 |
| Thank you!! This would make my life a lot less anxiety-inducing. | 0 |
| Haha, you guys are a bunch of losers. | 0.8936 |
Figure 2The relation between some features of comments and toxicity in the 1st training dataset.
Figure 3The relation between some features of comments and toxicity.
Label distribution of the 2nd training dataset.
| Class | No. of occurrences |
|---|---|
| Clean | 201,081 |
| Toxic | 21,384 |
| Obscene | 12,140 |
| Insult | 11,304 |
| Identity hate | 2,117 |
| Severe toxic | 1,962 |
| Threat | 689 |
Testing dataset sample.
| Comment_text | Rating |
|---|---|
| Sorry, you missed high school. Eisenhower sent troops to Vietnam after the French withdrew in 1954 | Approved |
| Our oils read; President IS taking different tactics to deal with a corrupt malignant, hypocritical …. | Rejected |
| Why would 90% of articles print fake news to discredit Trump? Where are you getting your new” … | Approved |
Figure 4LSTM model layers' design.
Figure 5Word embedding types.
Accuracy and F1-score of LSTM with different word embeddings in classifying toxic words.
| Accuracy (%) | F1-score | |
|---|---|---|
|
| 93 | 0.841 |
|
| 94 | 0.894 |