| Literature DB >> 35789611 |
Lulwah M Al-Harigy1,2, Hana A Al-Nuaim2, Naghmeh Moradpoor1, Zhiyuan Tan1.
Abstract
The increased use of social media among digitally anonymous users, sharing their thoughts and opinions, can facilitate participation and collaboration. However, this anonymity feature which gives users freedom of speech and allows them to conduct activities without being judged by others can also encourage cyberbullying and hate speech. Predators can hide their identity and reach a wide range of audience anytime and anywhere. According to the detrimental effect of cyberbullying, there is a growing need for cyberbullying detection approaches. In this survey paper, a comparative analysis of the automated cyberbullying techniques from different perspectives is discussed including data annotation, data preprocessing, and feature engineering. In addition, the importance of emojis in expressing emotions as well as their influence on sentiment classification and text comprehension leads us to discuss the role of incorporating emojis in the process of cyberbullying detection and their influence on the detection performance. Furthermore, the different domains for using self-supervised learning (SSL) as an annotation technique for cyberbullying detection are explored.Entities:
Mesh:
Year: 2022 PMID: 35789611 PMCID: PMC9250443 DOI: 10.1155/2022/4794227
Source DB: PubMed Journal: Comput Intell Neurosci
Summary of the related work surveys.
| Research | ML approach | Dataset | Labelling approach | Feature engineering | Preprocessing technique | Classifiers | Using emojis | Using SSL |
|---|---|---|---|---|---|---|---|---|
| [ | ML, SL | ✓ | ✓ | ✓ | ✓ | ✓ | N/A | N/A |
| [ | DL with UL and semi-supervised learning | ✓ | N/A | N/A | N/A | ✓ | N/A | N/A |
| [ | ML and DL with SL and UL | ✓ | N/A | ✓ | N/A | ✓ | N/A | N/A |
| [ | ML and DL | ✓ | ✓ | ✓ | N/A | ✓ | N/A | N/A |
| [ | ML and DL | ✓ | N/A | ✓ | N/A | ✓ | N/A | N/A |
| This survey | ML | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Figure 1A taxonomy of data labelling approaches.
Types of data labelling used in the recent research.
| Research | Platform | No. of datasets | Language | Type of cyberbullying | Type of annotation | Annotation method/platform |
|---|---|---|---|---|---|---|
| Reynolds et al. [ | Formspring.me | One | English | Cyberbullying | Manual | Amazon Mechanical Turk |
| Potha and Maragoudakis [ | Perverted-Justice | One | English | Cyberbullying | Manual | Experts |
| Ptacek et al. [ | One | English and Czech | Sarcasm | Manual | Experts | |
| Rafiq, et al. [ | Vine | One | English | Cyberbullying | Manual | CrowdFlower |
| Rajadesingan et al. [ | One | English | Sarcasm | Automated | Self-annotating using hashtags | |
| Wallace et al. [ | One | English | Irony | Manual | Experts | |
| Amir et al. [ | One | English | Sarcasm | Automated | Self-annotating using hashtags | |
| Capua et al. [ | Formspring.me, YouTube, and Twitter | Three | English | Cyberbullying | Manual | Amazon Mechanical Turk |
| Farias et al. [ | Six | English | Irony | Manual and automated | Crowdsourcing and self-annotating using hashtags | |
| Van Hee et al. [ | One | English | Irony | Manual | Participants | |
| Hosseinmardi, et al. [ | One | English | Cyberbullying | Manual | Crowdsourcing | |
| Nand et al. [ | One | English | Cyberbullying | Manual | Experts | |
| Zhang et al. [ | One | English | Sarcasm | Manual | Experts | |
| Waseem and Hovy [ | One | English | Hate speech | Manual | Authors themselves | |
| Zhao and Mao [ | Twitter and MySpace | Two | English | Cyberbullying | Manual | Experts |
| Bharti et al. [ | One | English | Sarcasm | Automated | Self-annotating using hashtags | |
| Davidson et al. [ | One | English | Hate speech | Manual | CrowdFlower | |
| Mishra et al. [ | Twitter + snippets with eye movement | Two | English | Sarcasm | Manual | Participants |
| Romsaiyud et al. [ | Twitter and Perverted-Justice | Two | English | Cyberbullying | Manual | Participants |
| Samghabad et al. [ | Ask.fm | One | English | Nastiness | Manual | CrowdFlower |
| Wulczyn et al. [ | Wikipedia | One | English | Personal attack | Manual | CrowdFlower |
| Agrawal and Awekar [ | Formspring, Twitter, and Wikipedia | Three | English | Cyberbullying | Manual | Participants |
| Van Hee et al. [ | One | English | Irony | Manual | Linguistic students and second-language speakers of English | |
| Mishra et al. [ | One | English | Abuse | Manual | Authors themselves | |
| Rosa et al. [ | Google News, Twitter, and Formspring | Three | English | Cyberbullying | Manual | Amazon Mechanical Turk |
| Cai et al. [ | One | English | Sarcasm | Automated | Self-annotating using hashtags | |
| Cheng et al. [ | Instagram and Vine | Two | English | Cyberbullying | Manual | Crowdsourcing |
| Drishya et al. [ | One | English | Cyberbullying | Not mentioned | ||
| Mozafari et al. [ | Two | English | Hate speech | Manual | Authors themselves | |
| Patro et al. [ | Book snippets and tweets | One | English | Sarcasm | Manual | Experts |
| Samghabadi et al. [ | Ask.fm, Wikipedia, Kaggle, and Curious Cat | Four | English | Abuse | Manual and automated | Using CrowdFlower and pretraining |
| Hettiarachchi and Ranasinghe [ | One | English | Offense | Followed Zampieri et al.'s [ | ||
| Ortega-Bueno, et al. [ | One | Spanish | Irony | Manual | Participants | |
| Subramanian et al. [ | Facebook and Twitter | Two | English | Sarcasm | Not mentioned | |
| Zampieri, et al. [ | One | English | Offense | Manual | Figure Eight | |
| Zhang et al. [ | Seven | English | Irony | Manual and automated | Experts and self-annotating using hashtags | |
| Fortunatus et al. [ | One | English | Cyberbullying | Manual | Experts | |
| González et al. [ | Two | English and Spanish | Irony | Followed Van Hee et al. [ | ||
| Iwendi et al. [ | Wikipedia | One | English | Cyberbullying | Followed Wulczyn et al.'s [ | |
| Paul and Saha [ | Formspring, Twitter, and Wikipedia | Three | English | Cyberbullying | Followed Waseem and Hovy [ | |
| Potamias et al. [ | Twitter and Reddit | Four | English | Irony and sarcasm | Not mentioned | |
| Rezvani et al. [ | Instagram and Twitter | Two | English | Cyberbullying | Followed Hosseinmardi et al. [ | |
| Tripathy et al. [ | One | English | Cyberbullying | Followed Davidson et al.'s [ | ||
Summary of preprocessing steps.
| Research | Preprocessing techniques | Preprocessing steps | |||
|---|---|---|---|---|---|
| Replace @, URL, #, RT | Tokenization | Stop word/punctuation removal | Other techniques | ||
| Potha and Maragoudakis [ | ✓ | ✓ | ✓ | Converted letters to lower case | They applied tokenization based on the space character, stop word removal, and a case transformation as preprocessing steps |
| Ptacek et al. [ | ✓ | ✓ | ✓ | Stemming | They replaced user, URL, and hashtag in tweets by “user,” “link,” and “hashtag.” They removed retweets starting with “RT” and removed diacritics from all Czech tweets. They also used tokenization, POS tagging, stem, stop word removal, and phonetics. |
| Rajadesingan et al. [ | ✓ | N/A | N/A | Removed tweets with three words or less | They removed non-English tweets, retweets, tweets that contained mentions and URLs, and tweets containing three words or less such as yeah, right, and so on as they found that these words were very noisy. |
| Capua et al. [ | N/A | N/A | ✓ | Stemming | They applied stop word and punctuation removal and stemming. |
| Van Hee et al. [ | ✓ | ✓ | N/A | Replaced emojis, POS, and lemmatization | They replaced all emojis with their name or description, normalized hyperlinks, and retweets to |
| Nand et al. [ | ✓ | N/A | N/A | Replaced word abbreviation | They used Nand et al. techniques for eliminating noise such as word variations e.g., replacing tmro and 2moro by tomorrow, multi-word abbreviation, e.g., replacing lol with laugh out loud, slangs, e.g., replacing gonna by going to, and removed duplicates, retweets, @usernames, #hashtags, hyperlinks. |
| Zhang et al. [ | N/A | N/A | N/A | Removed sarcasm hashtags | They removed the hashtags #sarcasm and #not from the tweets and assigning to them the sarcasm output tags for training and evaluation. |
| Zhao and Mao [ | ✓ | ✓ | ✓ | N/A | They followed Xu et al.'s preprocessing steps for Twitter dataset using only tokenization with replacing special characters such as mentions @ and URLs with tokens “@USERNAME” and “HTTPLINK,” respectively. They included hashtags and emoticons as tokens. For MySpace dataset, their focus was on content-based and the preprocessing for text only using tokenization and deletion of punctuation and special characters. |
| Bharti et al. [ | ✓ | N/A | N/A | Converted letters to lower case | They removed retweet, hashtags, URL, and @username and converted letters to lower cases. |
| Felbo et al. [ | ✓ | N/A | N/A | Removed characters repeated more than twice | They used English tweets without URLs, removed characters repeated more than twice, and replaced mentions and numbers with special tokens. |
| Romsaiyud et al. [ | N/A | N/A | N/A | Removed non-printable and special characters and duplicate words | They preprocessed their datasets using a method to remove non-printable and special characters and duplicate words. |
| Mishra et al. [ | N/A | N/A | ✓ | Converted letters to lower case | They changed all letters to lower case and removed stop words to normalize the data. |
| Rosa et al. [ | ✓ | N/A | N/A | Removed characters repeated in the words more than twice | They removed “Q” and “A” markers, “html” tags, and repeated characters in the words more than twice. |
| Cai et al. [ | N/A | N/A | N/A | Separated words, emoticons, and hashtags | They cleaned up their dataset by rejecting tweets using the words sarcasm, sarcastic, irony, and ironic as regular words, tweets containing URL's, and tweets with words that frequently co-occur with sarcastic tweets. In the preprocessing phase, mentions of (@user) were replaced with <user>, and then they used the NLTK toolkit to separate words, emoticons, and hashtags. |
| Drishya et al. [ | N/A | N/A | ✓ | Stemming | They cleaned their dataset by conducting stemming and removing stop words. |
| Samghabadi et al. [ | ✓ | N/A | N/A | Converted letters to lower case and padding | They changed all letters to lower case and replaced all of the links and user mentions with the words “url” and “@username” respectively, truncated the posts to 200 tokens, and left-pad the shorter sequence with zeros. |
| Singh et al. [ | ✓ | ✓ | ✓ | Converted letters to lower case and removed repeated words | They used tokenization, changed all letters to lower case, and removed stop words, numbers, URLs, consecutive repeated words, user mentions, and expand hashtags. |
| Subramanian et al. [ | ✓ | N/A | ✓ | N/A | They preprocessed their datasets using removal of hyperlinks, special characters, hashtags, retweets, etc. |
| Fortunatus et al. [ | N/A | N/A | ✓ | Replaced emojis and emoticons with their Unicode, normalization, and POS | They started their cleaning process with separating emojis and emoticons from plain text utilizing emoji Unicode representation from emoji sentiment ranking and emoticon Unicode from emoticon lexicon. The other preprocessing steps they used included punctuation removal and text normalization which included five steps: pronoun spelling resolution, slang resolution, laughter text resolution, elongated character reduction, and similar word replacement. After the normalization step, they used NLTK's POS tagger to perform part-of-speech tagging. The last preprocessing step was stop removal which was done after POS tagging to ensure that the POS tagger can work as effective as possible; otherwise, some words might already be removed and the POS tag will not be correct because the sentence is no longer grammatically sensible. |
| González et al. [ | ✓ | ✓ | N/A | Removed characters repeated in the words more than twice | They applied a case-folding process for all the tweets, used TokTokTokenizer from NLTK to tokenize the tweets, replaced user mentions, hashtags, and URLs by the token user, hashtag, and URL, respectively, and removed repeated characters in the words more than twice. |
| Gupta et al. [ | N/A | N/A | ✓ | N/A | They used stop word removal algorithm and filtration technique as preprocessing step to remove stop words from the dataset. |
| Iwendi et al. [ | N/A | ✓ | ✓ | Stemming, lemmatization, and converted letters to lower case | They started the preprocessing by removing punctuation and non-letter characters and changed all letters to lower case. Then, they tokenized the text by separating it into smaller tokens that may include words, numbers, and punctuation marks. Next, they used stemming to refer each token to its root such as removing plurals and verb tense, e.g., converting the words “running,” “ran,” and “runner” to “run.” After stemming, they used lemmatization. The Twitter dataset that was used by Paul and Saha was preprocessed by its author by normalizing the data through removing stop words, special markers such as “RT” (retweet) and screen names, and punctuation. |
| Potamias et al. [ | N/A | N/A | N/A | Converted letters to lower case | They used only decapitalization as a preprocessing step. |