| Literature DB >> 35174265 |
Sanaa Kaddoura1, Ganesh Chandrasekaran2, Daniela Elena Popescu3, Jude Hemanth Duraisamy4.
Abstract
The presence of spam content in social media is tremendously increasing, and therefore the detection of spam has become vital. The spam contents increase as people extensively use social media, i.e., Facebook, Twitter, YouTube, and E-mail. The time spent by people using social media is overgrowing, especially in the time of the pandemic. Users get a lot of text messages through social media, and they cannot recognize the spam content in these messages. Spam messages contain malicious links, apps, fake accounts, fake news, reviews, rumors, etc. To improve social media security, the detection and control of spam text are essential. This paper presents a detailed survey on the latest developments in spam text detection and classification in social media. The various techniques involved in spam detection and classification involving Machine Learning, Deep Learning, and text-based approaches are discussed in this paper. We also present the challenges encountered in the identification of spam with its control mechanisms and datasets used in existing works involving spam detection.Entities:
Keywords: Classification; Data mining; Deep learning; Machine learning; Natural language processing; Social media analysis; Spam Content; Text mining
Year: 2022 PMID: 35174265 PMCID: PMC8802784 DOI: 10.7717/peerj-cs.830
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Description about academic databases and their links.
| Academic Data sources | Search string | Links |
|---|---|---|
| WoS | Social spam |
|
| Scopus | Spam AND Twitter |
|
| Springer | Spam AND Artificial Intelligence |
|
| IEEE Xplore | Social spam AND Artificial Intelligence |
|
| ACM Digital Library | Online spam AND Review Spam |
|
| Science Direct | Social media AND Spam |
|
Figure 1Articles distribution based on publication type.
Figure 2Steps in spam detection.
E-mail spam datasets with their description.
| S. No | Dataset name | Description | Reference | Web link |
|---|---|---|---|---|
| 1 | Spam Assassin | 1,897 spam and 4,150 ham messages | ( |
|
| 2 | Princeton Spam Image Benchmark | 1,071 spam images | ( |
|
| 3 | Dredze Image Spam Dataset | 3,927 spam and 2,006 spam images | ( |
|
| 4 | ZH1–Chinese email spam dataset | 1,205 spam and 428 ham text emails | ( |
|
| 5 | Enron-Spam | 13,496 spam and 16,545 non spam email text | ( |
|
Twitter spam datasets with their description.
| S. No | Dataset name | Description | Reference | Web link |
|---|---|---|---|---|
| 1 | Bzzfeednews dataset | 11,000 labeled users, 1,000 spammers and 10,000 non-spammer users | ( |
|
| 2 | Fake election news dataset with 36 real and 35 fake news stories | ( | ||
| 3 | Twitter ground labeled ground truth dataset | 6.5 million spam and 6 million non-spam tweets | ( |
|
| 4 | Twitter social honeypot dataset | 22,223 spammers and 19,276 non-spammer users | ( |
|
| 5 | Stanford Twitter sentiment 140 dataset | 1.6 million tweets for spam detection with a total tweet id of 4435. | ( |
|
Spam review datasets with their description.
| S. No | Dataset name | Description | Reference | Web link |
|---|---|---|---|---|
| 1 | Single Domain hotel review | 1,600 hotel reviews (800 spam and ham) from TripAdvisor website belonging to 20 popular hotels in Chicago | ( |
|
| 2 | Multi-Domain review dataset | Hotels, Restaurant and Doctors reviews dataset (2,840 reviews) | ( |
|
| 3 | Yelp Review Dataset | 85 hotels and 130 restaurant reviews in and around Chicago | ( |
|
| 4 | Store Review Dataset | 4,08,470 reviews on 14,651 stores obtained from | ( |
|
| 5 | Amazon e-commerce Dataset | 40,000 samples for training and 10,000 samples for testing were collected on various categories like Beauty, Fashion and Automotive etc. | ( |
|
| 6 | Hotel reviews dataset | 42 fake and 40 hotel reviews | ( |
|
| 7 | Trustpilot company review dataset. | 9,000 fake and real reviews from online company Trustpilot | ( |
|
Most often used spam terms in e-mail, Facebook, and Twitter.
| S. No | Social network | Words |
|---|---|---|
| 1 | Full refund, Get it Now, Order now, Order status, Make money, Earn extra cash, 100% free, Apply now, Click here, Sign up free, Winner, Lose weight, Lifetime, Gift certificate. | |
| 2 | Amazing, Hear, Watch, Hunt, Win, ipad | |
| 3 | Money, Marketing, Mobi, Free |
Figure 3Various text-preprocessing techniques.
Illustration of a sentence and its generated tokens.
| Sentence | Tokens |
|---|---|
| “I went to the library to read books” | “I”, “went”,”to”,”the”,”library”,”to”,”read”,”books” |
Existing research on spam text pre-processing.
| S.No | Authors | Pre-Processing technique used | Dataset | Classifier | Result |
|---|---|---|---|---|---|
| 1 |
| Tokenization, | e-mail text corpora | Support Vector Machine (SVM) | Classification accuracy is improved with pre-processing |
| 2 |
| Stemming, Lemmatization,Stopwords removal and noise removal | Ling-spam | Naïve Bayes (NB) and Support Vector Machine (SVM) | Pre-processing with NB gives better results than SVM |
| 3 |
| Data Normalization and discretization methods | Twitter dataset | SVM, Neural Networks (NN) and Random Forests (RF) | Overall classification rate of 84.30% is obtained |
| 4 |
| Tokenization and Segmentation | 1.5 million posts from real time Facebook data | NB, SVM and RF classifiers | RF classifier outperformed the others with a F-measure of |
| 5 |
| Stemming and Stopwords removal | Honeypot dataset with 2 million spam and non-spam tweets | Multilayer Perceptron (MLP), NB and RF | SVM outperformed others with a precision of 0.98 and an accuracy of 0.96 |
Tools available for pre-processing of spam text.
| Library/Package | Description | Link |
|---|---|---|
| TextBlob | TextBlob is a Python text processing package. It provides a straightforward API for typical NLP tasks such as part-of-speech tagging and sentiment analysis. |
|
| Spacy | Spacy is a Python Natural Language Processing (NLP) package with a number of built-in features |
|
| NLTK | The Natural Language Toolkit, or NLTK for short, is a Python-based set of tools and programmes for performing natural language processing. |
|
| RapidMiner | Accessing and analysing various types of data, both organised and unstructured, is simplified. |
|
| Memory-Based Shallow Parser | Can determine the grammatical structure of a sentence by parsing a string of letters or words using python |
|
A bag of words illustration (BoW).
| Words | Doc-1 | Doc-2 | Doc-3 | Doc-4 |
|---|---|---|---|---|
| Sentiment | 2 | 3 | 2 | |
| Processing | 2 | 4 | 1 | |
| Classification | 1 | 2 | ||
| Algorithm | 1 | 3 | 4 |
An N-grams illustration.
| S. No | Type of N-Gram | Example |
|---|---|---|
| 1 | Unigram | “I”, “Like”, “to”, “Play”, “Cricket” |
| 2 | Bi-gram | I Like, Like to, Play Cricket |
| 3 | Tri-gram | I Like to, to Play Cricket |
Existing works that employ various text feature extraction techniques.
| S.No | Author | Dataset | Classification approach | Merits | Limitations | Result |
|---|---|---|---|---|---|---|
| 1 |
| Honeypot, SPD manually and automatically annotated spam dataset | Support Vector Machine (SVM), Random Forest (RF), Multi-Layer Perception (MLP), Gradient Boosting and Max.Entropy | Real time spam detection is possible and the proposed feature set increases the system accuracy | Need to deal with the presence of lengthy tweets on spamming activity. | Accuracy-97.71% |
| 2 |
| 13,000 comments from YouTube channels | RF, SVM, Naive Bayes (NB) with N-grams based features | Machine Learning (ML) models with N-grams has helped to improve | The use of better word representation like Word2Vec is needed to improve system performance | F1-Score-0.97 |
| 3 |
| 774 spam campaigns in 1, 31,000 Tweets | RF, Decision Trees (DT), Decision Table, Random Tree, KStar, Bayes Net and Simple Logistic | Content and Behaviour features were combined to build an automatic spam detection model. | Need to explore more features to build a robust model for spam classification | Accuracy-94.5% |
| 4 |
| More than 10,000 Arabic tweets collected with Twitter API | Long Short Term Memory (LSTM) with word embedding feature representation | Time requirement to classify the tweets is very less compared to the state-of-the art methods | System classification accuracy depends on tweet length | Accuracy-0.97 |
| 5 |
| 97,839 Restaurant (RES) and 31,317 Hotel review dataset (HOS) | Machine Learning (ML) techniques and Bi-LSTM | Could capture sophisticated spammer activities using multimodal neural network model | There is a need to analyze the use of other effective features to improve the performance | Recall-0.80 |
| 6 |
| Hotel review | SVM, K-Nearest Neighbor and Naïve Bayes (NB) | Lexical content and stylistic information were captured better using character n-grams | Need to build a hybrid feature set combining character and word n-grams | F1-score-0.87 |
| 7 |
| 10 day real-life Twitter dataset of 1,376,206 spam and 6,73,836 non-spam tweets | RF, Multi-Layer Perceptron (MLP) and Naïve Bayes | Variations in spamming activities are captured within a short span of time. | The model needs to be adaptable to new characteristics | Accuracy-99.35 |
Figure 4Various text-preprocessing techniques.
Existing research works on spam classification using rule-based systems.
| S.No | Author | Dataset | Classification approach | Merits | Limitations | Result |
|---|---|---|---|---|---|---|
| 1 |
| Rule based spam detection filter with some assigned weights | Combination of Genetic Algorithm with e-mail filtering methods facilitates efficient spam detection | Need to increase the size of dataset and in-depth analysis of parameters of Genetic algorithm is required | Accuracy-82.7% | |
| 2 |
| 1,260 Facebook messages from Italian groups | Flexible rule-based system is used to customize the filtering criteria. | Automatic filtering of unwanted messages from Online Social Networks is made possible. | Care should be taken to handle the extraction of contextual features for better discrimination of samples. | Precision-81% |
| 3 |
| Enron | Manually and Automatically extracted rules from labelled emails | Domain categorization used in this work has helped to improve the filter performance | Continuous enhancement and updation of semantic features is needed. | Accuracy-0.98 |
| 4 |
| SpamAssassin | Rule extraction, optimization and rule filtering models are used | Dynamic adjustment of static rules for improving the spam filter is made possible. | Value of threshold has an impact on classification performance and it has to be taken care of. | Accuracy-98.5% |
| 5 |
| Email | Fuzzy Inference System with a set of Fuzzy rules | The system is made adaptive by making use of effective fuzzy rules. | Need to train the system with a large | Accuracy-90% |
Existing research works on spam classification using machine learning.
| S.No | Author | Dataset | Classification approach | Merits | Limitations | Result |
|---|---|---|---|---|---|---|
| 1 |
| 4,360 non-spam and 1,368 spam samples from the Kaggle Dataset | Logistic Regression (LR), Naïve Bayes (NB), K-Nearest Neighbor (K-NN) and Decision Trees(DT) | Presented a comparative analysis of different ML algorithms | Better DL based feature learning strategies can be employed for extracting relevant features. | Accuracy-0.99 |
| 2 |
| Email-1,431 dataset | SVM, K-NN, NB and DT | Instead of using spam trigger words, which may fail, a lexicon-based approach is used to filter the data. | Less number of training samples used (272 ham and 1,219 spam). Need for a better feature extraction technique | Accuracy-85.96% |
| 3 |
| 1,200 Labelled posts crawled from Facebook using a webcrawler | Random Forest (RF) | Social features like comments etc., are combined with textual features yields better results | Need to use image features to get improved results | Precision-98.19% |
| 4 |
| 25,847 Twitter users with 500K tweets are collected using Twitter API and a Web crawler | DT, NN, SVM, NB | Graph and Content based features extracted from Twitter aids in improving model’s performance | Need to analyze the use of Deep Learning (DL) techniques and bring in more metrics for performance evaluation. | Precision-1 |
| 5 |
| Textual data collected from Twitter and Facebook with spam and on-spam content | SVM & NN | Hybrid architecture of SVM with NN helped to improve the classification results | Only a few performance metrics is evaluated to determine the model’s efficiency | Precision-85% |
| 6 |
| 4.4 million Facebook posts acquired using Graph API | RF | Automatic identification of spam text is done with 42 features using ML techniques | The labelled spam dataset was gathered through crowdsourcing and may be biased. | Accuracy-86.9% |
| 7 |
| Restaurant reviews from Yelp.com | LR, K-NN, NB, RF, SVM | For effective spam identification, uses both univariate and multivariate distribution across user ratings. | It is necessary to adjust the model to new characteristics and improve its efficiency. | Accuracy-0.76 |
| 8 |
| Opinion spam | Rule-based and Machine learning classifiers (NB, SVM, K-NN, RF and NN) | The model’s performance was increased by using N-gram feature extraction and Negation handling. | Spam detection efficiency could be improved using Deep Learning (DL) techniques | Accuracy-95.25% |
| 9 |
| Opinion spam | NB, RF and SVM | The ensemble strategy aided in obtaining a higher accuracy score. | It is necessary to develop a control mechanism to reduce the propagation of fraudulent reviews. | Accuracy-87.68% |
| 10 |
| Random collection of tweets from 1,000 Twitter accounts containing both spam and non-spam text | RF, NB and K-NN | User and Content based features with RF classifier was successful in identifying spam and non-spam tweets | Need a larger Twitter dataset for evaluating the effectiveness of the model | Precision-95.97 |
Existing research works on spam classification using deep learning.
| S.No | Author | Dataset | Classification approach | Merits | Limitations | Result |
|---|---|---|---|---|---|---|
| 1 |
| 1. Twitter social honeypot dataset | Convolutional Neural Network (CNN) | Combination of tweet text with meta data has helped to attain good performance for spam classification | Using only textual data i.e tweets the system could not perform well | Accuracy-99.32% |
| 2 |
| Sina Weibo dataset with 12,500 malicious URLs and 12,500 normal URLs | Convolutional Neural Network (CNN) with Word2Vec | Detects the spam content by utilizing low computing resources | Complexity of the model | Accuracy-91.36% |
| 3 |
| Open source SpamBase dataset with 5,569 emails and Kaggle spam filter dataset | Fine-tuned BERT(Bidirectional Encoder Representations from Transformers) with Word2Vec approach | Spam detection efficiency is improved with the help of BERT word embedding approach | Need to utilize a large input sequence for better training of model. | Accuracy-0.98 |
| 4 |
| Image-Dataset with 1,521 spam images and 1,500 ham images. | CNN with multimodal data (Image and Text) | Multimodal (Image+Text) technique helped to achieve greater accuracy compared to unimodal inputs | Need to improve the neural network model for achieving better accuracy by tuning the hyper parameters | Accuracy-98.11% |
| 5 |
| MicroblogPCU dataset-2,000 spam and non-spam data | Self-attention BiLSTM with ALBERT model-word vector model of BERT | Semantic and Contextual data from Tweets are captured using the Bi-LSTM model with self-attention mechanism | Computational time and resources required by the model has to be reduced. | Accuracy-0.91 |
| 6 |
| Twitter and SinaWeibo datasets with 2,313 and 2,351 rumors | Recurrent Neural Networks (RNN) with extra hidden layers | RNN model with multiple hidden and embedding layers help to reduce the spam detection time. | Massive unlabeled data from social media reduces the system performance. Works well for Weibo dataset compared to Twitter | Accuracy-0.88 |
| 7 |
| Single domain hotel review dataset with 800 reviews (Dataset1) | Un-supervised Self Organized Maps (SOM) with CNN | Semantic information is captured well with the help of SOM to enhance the spam detection performance | Need to improve the performance of SOM model by including additional layers and features. | Accuracy-0.87 |
| 8 |
| Single domain hotel review dataset with 800 reviews and Yelp spam review dataset with 2,000 reviews | CNN and Bi-LSTM with Word2Vec method | Word2Vec approach has helped to get better feature vector representations to get efficient results. | Data labelling process need to be improved and requires more training samples (1,600 reviews) to improve the classification performance. | Accuracy-94.56% |
| 9 |
| WEBSPAM-2007 dataset containing 222 spam and 3,776 non-spam web pages. | LSTM model | It provides cognitive ability to search engine for automatic webspam detection. | Need to tune the algorithm to handle large scale data from web | Accuracy-96.96% |
| 10 |
| WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets with spam and non-spam labels | Deep Belief Networks (DBN)-Stacked Restricted Boltzmann Machine (RBM) | Algorithm’s performance is improved by employing a preference function which is based on DBN | Proposed algorithm’s performance is dependent on selection of appropriate reference examples. | Accuracy-0.94 |