| Literature DB >> 36217359 |
Surendra Singh Gangwar1, Santosh Singh Rathore1, Satyendra Singh Chouhan2, Sanskar Soni2.
Abstract
The wide popularity of Twitter as a medium of exchanging activities, entertainment, and information is attracted spammers to discover it as a stage to spam clients and spread misinformation. It poses the challenge to the researchers to identify malicious content and user profiles over Twitter such that timely action can be taken. Many previous works have used different strategies to overcome this challenge and combat spammer activities on Twitter. In this work, we develop various models that utilize different features such as profile-based features, content-based features, and hybrid features to identify malicious content and classify it as spam or not-spam. In the first step, we collect and label a large dataset from Twitter to create a spam detection corpus. Then, we create a set of rich features by extracting various features from the collected dataset. Further, we apply different machine learning, ensemble, and deep learning techniques to build the prediction models. We performed a comprehensive evaluation of different techniques over the collected dataset and assessed the performance for accuracy, precision, recall, and f1-score measures. The results showed that the used different sets of learning techniques have achieved a higher performance for the tweet spam classification. In most cases, the values are above 90% for different performance measures. These results show that using profile, content, user, and hybrid features for suspicious tweets detection helps build better prediction models.Entities:
Keywords: Machine learning techniques; Natural language processing; Social network; Suspicious content detection; User-content features
Year: 2022 PMID: 36217359 PMCID: PMC9534460 DOI: 10.1007/s13278-022-00977-7
Source DB: PubMed Journal: Soc Netw Anal Min
Fig. 1Twitter data collection, feature extraction procedure and ML/DL model evaluation
Description of the features collected for the Twitter’s spam dataset
| Feature name | Feature type | Feature description |
|---|---|---|
| AccountAge | Profile | Days since account creation to date of collection |
| FollowersCount | Profile | In user profile meta-data |
| FriendsCount | profile | In user profile meta-data |
| StatusesCount | Profile | In user profile meta-data |
| DigitsCountInNmae | Content | Number of digits in screen name |
| TweetLen | Content | Number of characters in tweet |
| UserNamelen | Content | Number of characters in user name |
| ScreenNameLen | Content | Number of characters in screen name |
| Metric entropy for all textual features: tweet, user profile description, user name and screen name, respectively | Hybrid | To measure randomness in text. |
| URIsRatio | Hybrid | |
| MentionsRatio | Hybrid | |
| NameSim | Hybrid | Proportion of similarity in user name and screen name |
| Friendship | Hybrid | |
| Followership | Hybrid | |
| Interestingness | Hybrid | |
| Activeness | Hybrid | |
| VerifiedAccount | Profile | In tweet meta-data |
| FavouritiesCount | Profile | In user profile meta-data |
| NamesRatio | Hybrid |
Word categories
| Ads | Ads, images, banners, Hedberg, RealMedia, img, announcer, popup, offer, adserver, sales, gifs, media, exit, out, adv, splash, pub, pop, graphics |
| Books | Catalog, book, patterns, weaving, product, sniacademic, news, ebook, educator, library, store, wilecyda |
| E-commerce | Shop, store, catalog, tickets, art, users, business |
| Games | Juegos, Jeux, category, game, Xbox, jeunesse, pc, online, Comunidad, consoles, flash, PSP, arcade, Wii, emulator, gratis, Nintendo, PlayStation |
| Medical | Health, conditions, article, content, diseases, meds, group |
| News | News, newspapers, media, publications, section, feed, opinion, business, community, archive, papers, profile |
| Sport | Sport, athletics, team, basketball, football, college, women, track, tennis, soccer, baseball, golf, mens |
Description of the performance measures
| Measure | Description |
|---|---|
| Accuracy | It is defined as the ratio of correctly predicted examples to the total examples. |
| Precision | It is calculated as the proportion of accurately predicted positive examples to all positive examples predicted. |
| Recall | It is defined as the proportion of correctly predicted positive examples to all positive examples in the actual class. |
| F1-score | It is the weighted average of precision and recall. F1-score considers both the false positives and false negatives. |
|
| |
Different Ml models with bag of words on DS1 and DS2
| Classifier | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Logistic Regression | 0.8763 | 0.87351 | 0.86263 | 0.87543 |
| Naive Bayes | 0.68367 | 0.67542 | 0.67324 | 0.68453 |
| KNN | 0.83106 | 0.83225 | 0.84751 | 0.82257 |
| Decision Tree | 0.90127 | 0.90543 | 0.91248 | 0.89112 |
| Random Forest | 0.91602 | 0.90251 | 0.92152 | 0.91358 |
| Logistic Regression | 0.916 | 0.953 | 0.855 | 0.901 |
| Naive Bayes | 0.544 | 0.495 | 0.932 | 0.647 |
| KNN | 0.839 | 0.894 | 0.726 | 0.802 |
| Decision Tree | 0.99 | 0.99 | 0.98 | 0.991 |
| Random Forest | 0.992 | 0.99 | 0.985 | 0.992 |
Different ML models with TF-IDF on DS1 and DS2
| Classifier | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Logistic Regression | 0.91375 | 0.91256 | 0.90145 | 0.90628 |
| Naive Bayes | 0.6912 | 0.68845 | 0.69014 | 0.68158 |
| KNN | 0.83219 | 0.82156 | 0.84751 | 0.83348 |
| Decision Tree | 0.9241 | 0.92147 | 0.91254 | 0.91469 |
| Random Forest | 0.94666 | 0.93458 | 0.94375 | 0.94112 |
| Logistic Regression | 0.753 | 0.83 | 0.567 | 0.671 |
| Naive Bayes | 0.544 | 0.495 | 0.932 | 0.647 |
| KNN | 0.839 | 0.895 | 0.726 | 0.801 |
| Decision Tree | 0.99 | 0.988 | 0.989 | 0.989 |
| Random Forest | 0.99 | 0.99 | 0.98 | 0.99 |
Ensemble techniques with BOW on DS1 and DS2
| Classifier | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Bagging | 0.997 | 0.99 | 0.99 | 0.99 |
| Boosting | 0.986 | 0.986 | 0.987 | 0.986 |
| Stacking | 0.92 | 0.869 | 0.993 | 0.927 |
| Bagging | 0.783 | 0.878 | 0.598 | 0.712 |
| Boosting | 0.823 | 0.79 | 0.824 | 0.807 |
| Stacking | 0.932 | 0.876 | 0.98 | 0.929 |
Ensemble techniques with TF-IDF on DS1 and DS2
| Classifier | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| Bagging | 0.94368 | 0.94283 | 0.93451 | 0.93457 |
| Boosting | 0.90354 | 0.90228 | 0.9134 | 0.91586 |
| Stacking | 0.93765 | 0.92506 | 0.92355 | 0.93679 |
| Bagging | 0.95242 | 0.94525 | 0.95221 | 0.94625 |
| Boosting | 0.93691 | 0.93542 | 0.92231 | 0.92042 |
| Stacking | 0.91969 | 0.90125 | 0.90589 | 0.91287 |
Deep learning techniques based models on DS1 and DS2
| Models | Accuracy | Precision | Recall | F1-score |
|---|---|---|---|---|
| BASIC ANN 64 and 32 layers | 0.979 | 0.969 | 0.988 | 0.978 |
| LSTM | 0.673 | 0.652 | 0.637 | 0.646 |
| Single convolution layer | 0.979 | 0.974 | 0.982 | 0.978 |
| Two convolution layer | 0.986 | 0.997 | 0.972 | 0.985 |
| GRU | 0.983 | 0.978 | 0.986 | 0.982 |
| VDCNN | 0.938 | 0.99 | 0.868 | 0.929 |
| Convolution + LSTM | 0.923 | 0.88 | 0.954 | 0.92 |
| BASIC ANN 64 and 32 layers | 0.928 | 0.948 | 0.868 | 0.916 |
| LSTM | 0.612 | 0.606 | 0.378 | 0.464 |
| Single convolution layer | 0.906 | 0.951 | 0.832 | 0.88 |
| Two convolution layer | 0.87 | 0.896 | 0.801 | 0.846 |
| GRU | 0.558 | 0.559 | 0.047 | 0.086 |
| VDCNN | 0.829 | 0.977 | 0.631 | 0.767 |
| Convolution + LSTM | 0.558 | 0.682 | 0.021 | 0.041 |
Fig. 2Comparison of different used ML, ensemble, and deep learning techniques with bag of words (BOW) on DS1 and DS2 datasets, (*LR= Logistic Regression, NB= Naive Bayes, KNN= K-nearest neighbors, DT= Decision Tree, RF= Random Forest, SLC= Single layer convolution, TLC= Two layer convolution, GRU= Gated recurrent unit, Cov_LSTM= Convolution + LSTM, VDCNN= Very deep convolution neural network)
Fig. 3Comparison of different used ML, ensemble, and deep learning techniques with TF-IDF on DS1 and DS2