| Literature DB >> 35968532 |
S Abarna1, J I Sheeba1, S Jayasrilakshmi1, S Pradeep Devaneyan2.
Abstract
Due to Coronavirus diseases in 2020, all the countries departed into lockdown to combat the spread of the pandemic situation. Schools and institutions remain closed and students' screen time surged. The classes for the students are moved to the digital platform which leads to an increase in social media usage. Many children had become sufferers of cyber harassment which includes threatening comments on young students, sexual torture through a digital platform, people insulting one another, and the use of fake accounts to harass others. The rising effort on automated cyber harassment detection utilizes many AI-related components Natural language processing techniques and machine learning approaches. Though machine learning models using different algorithms fail to converge with higher accuracy, it is much more important to use significant natural language processes and efficient classifiers to detect cyberbullying comments on social media. In this proposed work, the lexical meaning of the text is analysed by the conventional scheme and the word order of the text is performed by the Fast Text model to improve the computational efficacy of the model. The intention of the text is analysed by various feature extraction methods. The score for intention detection is calculated using the frequency of words with a bully-victim participation score. Finally, the proposed model's performance is measured by different evaluation metrics which illustrate that the accuracy of the model is higher than many other existing classification methods. The error rate is lesser for the detection model.Entities:
Keywords: Buzzwords; Cyberbullying; Fast text; Insulting vocabulary; Intention detection; Natural language processing; Word2vec
Year: 2022 PMID: 35968532 PMCID: PMC9364757 DOI: 10.1016/j.engappai.2022.105283
Source DB: PubMed Journal: Eng Appl Artif Intell ISSN: 0952-1976 Impact factor: 7.802
Fig. 1Percentage of different Types of Cyberharassment.
Fig. 2Impact of cyberharassment on age diversity.
Summary of the existing work for cyberharassment detection.
| Author’s | Methodology | Measures | Dataset | Features | PROS & CONS |
|---|---|---|---|---|---|
| Occupational Safety and Health (OSH) standards | Preventing sexual cyberbullying | organizations | Textual, User-based | To examine how the association between a dark personality and sexual cyberbullying activities may be influenced by factors including gender, age, or socioeconomic class. | |
| Deep decision trees and multi-feature artificial intelligence for classification | Automatic detection of cyberbullying | Textual, User-based | With its increased text classification accuracy, the innovative Deep classifier’s accuracy in classification is validated. | ||
| Conectado, A Solemn Game Against Bullying | Focuses on evaluating the game’s efficacy and analysing the data produced during the game. | Social media | Textual, Sentimental, Contextual | To surge alertness | |
| Big Five and Dark Triad features | Cyberbullying recognition | Textual, User-based | Indication showing connection among handler characters and cyberbullying enactment to detect online bullying patterns. | ||
| AdaBoost | Cyberbullying detection | Social media | Textual | To provide an effective estimate of performance on the variant containing social media features. | |
| Supervised learning method | Cyberbullying detection | Vine social networks | Textual | Proposed dual clusters of features named threshold and dual for two early detection methods. | |
| Feature Density (FD) using different linguistically-backed feature | Automatic cyberbullying detection | Social media Yelp business review | Textual, Sentimental, Contextual | To estimate dataset complexity | |
| Feature Engineering techniques and Machine Learning | To explore the irony and satire | Textual, User-based | To evaluate the properties of irony and sarcasm recognition in cyberbullying detection tasks. | ||
| Supervised machine learning | Automated detection and prevention systems | Textual, Sentimental, User-based | Bullies’ relative popularity, collective and automated efficacy, and incident interpretation | ||
| The Critical Appraisal Skills Program assessment tool | Used to assess the calibre of the studies that were included. | Social media | Textual, Sentimental, Contextual | Intent, repetition, accessibility, anonymity, and disclosure barriers were five major ideas that were found. | |
| Robust methodology | To separate bullies and attackers from regular Twitter users | Textual, User-based, Network-based | Classify the accounts with an accuracy of over 90% and an AUC. | ||
| Socio-contextual approach | To create and test a model for automatic detection | Textual, Sentimental, Contextual | It gives information about several scenarios to detect cyberbullying. | ||
Fig. 3The process for detecting online harassment along with its intention model.
Sample example for the data cleaning process.
| Sample text: @jackSimson_$Your mother looks so bitch UGLY ass nappy which make scenes nigga**# | |
|---|---|
| Techniques | Specified outcome |
| Lowercase conversion | @jacksimson_$your mother looks so bitch ugly ass nappy which make scenes nigga**# |
| Removal of symbols | jacksimson your mother looks so bitch ugly ass nappy which make scenes nigga |
| After tokenization | [‘jacksimson’, ‘your’, ‘mother’, ‘looks’, ‘so’, ‘bitch’, ‘ugly’, ‘ass’, ‘nappy’, ‘which’, ‘make’, ‘scenes’, ‘nigga’] |
| After lemmatization | [‘jacksimson’, ‘your’, ‘mother’, ‘ |
Sample working example for feature extraction.
| Sample Text: ‘‘Everybody makes mistakes and he did so what like shit it’’ | |
|---|---|
| Techniques | Specified Outcome |
| Tagged tokens | [(Everybody/NN), (makes/VBP). (mistakes/NN), (and/CC), (he/PRP), (did/VBG), (so/CC), (what/WDT), (like/NN), (shit/FW), (it/PRP)] |
| Filter ‘NN’ & ‘PRP’ | [(Everybody/NN), (mistakes/NN), (he/PRP), (like/NN), (it/PRP)] |
| Finding target user | [(he/PRP), (it/PRP)] |
| Mapping function for finding intention | ‘Individual person’ |
Pseudocode for finding targeted users.
| Input: ti |
| Output: (C1,T1) where T |
Pseudocode for determining intention behind each text group.
| Input: ti |
|---|
| Output: (C1,I1) where T |
Fig. 6BOW example for sample sentences.
Fig. 4Word cloud sample for harassment words with higher frequency in the vocabulary.
Fig. 5Word cloud sample for non-harassment words with higher frequency in the vocabulary.
Fig. 7Skip-gram example for finding target vector.
Input and output for the conventional intention detection model.
| Input | Training dataset S |
| Output | Results R |
Fig. 8Sample working procedure for Fast Text classification model.
Fig. 9Text classification using Huffman tree with probability.
Pseudocode for calculating word frequency in the BOW.
| Input: Ti |
|---|
Pseudocode for combining WS and FT module.
| Input : |
|---|
Sample workflow of the model using an example.
| Steps | Input data | Desired Outcome |
|---|---|---|
| After preprocessing | I1 | ‘skeezedez’, ‘listen’, ‘up’, ‘bitch’, ‘I’, ‘will’, ‘just’, ‘fuck’, ‘on’, ‘you’, ‘worthless’, ‘shit’ |
| I2 | ‘like’, ‘i’, ‘said’, ‘kill’, ‘yourself’, ‘you’, ‘dumb’, ‘like’, ‘you’, ‘took’, ‘like’, ‘a’, ‘damn’, ‘shit’, ‘and’, ‘take’, ‘a’, ‘shower’, ‘cause’, ‘youre’, ‘attitude’, ‘stink’, ‘you’, ‘stanky’, ‘ugly’, ‘cunt’ | |
| I3 | ‘when’, ‘you’, ‘broke’, ‘ass’, ‘learn’, ‘to’, ‘spell’, ‘and’, ‘get’, ‘to’, ‘coked’, ‘up’, ‘crack’, ‘headed’, ‘get’, ‘to’, ‘asshole’, ‘out’, ‘of’, ‘a’, ‘dawn’, ‘apartment’, ‘and’, ‘for’, ‘starters’, ‘at’, ‘least’, ‘i’, ‘wear’, ‘name’, ‘brand’, ‘you’, ‘stupid’, ‘hoe’ | |
| After feature extraction | I1 | |
| I2 | [‘attitude/NN’] [‘yourself/PRP’] | |
| I3 | [‘apartment/NN’, ‘name/NN’, ‘brand/NN’] [‘you/PRP’] | |
| Building vocabulary | I1 | ‘skeezedez’, ‘listen’, ‘bitch’, ‘fuck’, ‘worthless’, ‘shit’ |
| I2 | ‘kill’, ‘shower’, ‘attitude’, ‘stink’, ‘stanky’, ‘ugly’ | |
| I3 | ‘broke’, ‘learn’, ‘coked’, ‘crack’, ‘headed’, ‘asshole’, ‘apartment’, ‘starters’, ‘brand’, ‘stupid’ | |
| Vector representation (word2vec model) | I1 | [0.7357,0.1835,0.9826,0.9402,0.7411,0,1] |
| I2 | [−0.1985,0,0.1925,0.6217,0.4298,0.8605] | |
| I3 | [0,0.2314,−0.1934,0.7384,0.1392,0.9325,0.5687,0.4218,0.2165,1] | |
| Cosine similarity measure | I1 | 0.5021, −0.1972, 1, 0.9811, 0, 0.8090 |
| I2 | 0.8320, 0.3217, 0.5532, −0.2188, 0.6321, 1 | |
| I3 | 0.3772, 0.4287, 0.3219, 0.5079, −0.1338, −0.9782, −1.0558, −0.4156, 0.5008,1 | |
| Training Fast Text model | I1 | ‘positive’, (0.8357) |
| I2 | ‘positive’, (0.9218) | |
| I3 | ‘positive’, (0.9745) | |
| Classified output | I1 | ‘positive’, 1, ‘sexual attention’, ‘IDS’ |
| I2 | ‘positive’, 1, ‘trolling’, ‘IDS’ | |
| I3 | ‘positive’, 1, ‘sexual attention’, ‘IDS’ | |
Fig. 10Performance evaluation based on sensitivity values.
Fig. 11Mean Squared Error rate for training sample.
Fig. 12Performance analysis of various algorithms.
Training sample of harassment and non-harassment classes.
| Algorithm used | Recall | Precision | F1 Score |
|---|---|---|---|
| J48 | 0.8280 | 0.9049 | 0.2944 |
| Naïve Bayes | 0.8352 | 0.8644 | 0.2021 |
| Logistic regression | 0.8188 | 0.8742 | 0.6168 |
| MLP | 0.8451 | 0.8933 | 0.5286 |
| Intention detection model |
Fig. 13No of training samples classifies into two labels.
Fig. 14ROC curve for bag of words for different algorithms.
Training/Prediction time for two classes of the dataset.
| S.no | Algorithms used | Training/prediction time |
|---|---|---|
| 1 | J48 | 2.57 s |
| 2 | Naïve Bayes | 2.63 s |
| 3 | Logistic regression | 3.01 s |
| 4 | MLP | 5.94 s |
| 8 |
Statistical distribution of Instagram dataset.
| Dataset used | Total comments | CH | NCH |
|---|---|---|---|
| Instagram comments | 10,957 | 2754 | 8203 |