| Literature DB >> 28793347 |
Weiling Chen1, Chai Kiat Yeo1, Chiew Tong Lau1, Bu Sung Lee1.
Abstract
Detection techniques of malicious content such as spam and phishing on Online Social Networks (OSN) are common with little attention paid to other types of low-quality content which actually impacts users' content browsing experience most. The aim of our work is to detect low-quality content from the users' perspective in real time. To define low-quality content comprehensibly, Expectation Maximization (EM) algorithm is first used to coarsely classify low-quality tweets into four categories. Based on this preliminary study, a survey is carefully designed to gather users' opinions on different categories of low-quality content. Both direct and indirect features including newly proposed features are identified to characterize all types of low-quality content. We then further combine word level analysis with the identified features and build a keyword blacklist dictionary to improve the detection performance. We manually label an extensive Twitter dataset of 100,000 tweets and perform low-quality content detection in real time based on the characterized significant features and word level analysis. The results of our research show that our method has a high accuracy of 0.9711 and a good F1 of 0.8379 based on a random forest classifier with real time performance in the detection of low-quality content in tweets. Our work therefore achieves a positive impact in improving user experience in browsing social media content.Entities:
Mesh:
Year: 2017 PMID: 28793347 PMCID: PMC5549928 DOI: 10.1371/journal.pone.0182487
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Overview of the low-quality content detection system.
How much do content polluters affect your user experience when using social network sites?
| Options | Number | Ratio |
|---|---|---|
| Very much. | 48 | 22.75% |
| A bit but still bearable. | 141 | 66.82% |
| A little. | 16 | 7.58% |
| They don’t affect my user experience. | 6 | 2.84% |
If someone follows you, will you follow back?
| Options | Number | Ratio |
|---|---|---|
| follow back out of courtsey | 42 | 19.91% |
| follow those I know or share common interests | 149 | 70.62% |
| don’t follow back | 15 | 7.11% |
| others | 5 | 2.37% |
How often will you clean up your followees/friends?
| Options | Number | Ratio |
|---|---|---|
| Seldom or never. | 153 | 72.51% |
| More than once a month. | 41 | 19.43% |
| At least once a month. | 11 | 5.21% |
| Almost every week. | 6 | 2.84% |
Fig 2Users’ habits about following and cleaning up friends.
Fig 3Users’ definition for low-quality content (Abstract categories).
Fig 4Users’ definition for low-quality content (Specific examples).
What’s the maximum threshold (as a percentage of your recently received messages) you can bear before considering unfollowing him/her?
| Options | Number | Ratio |
|---|---|---|
| Too bothersome to unfollow | 28 | 13.27% |
| Nearly 100% | 15 | 7.11% |
| More than 75% | 52 | 24.64% |
| More than 50% | 75 | 35.55% |
| More than 25% | 41 | 19.43% |
Direct features.
| Index | Feature | Comments |
|---|---|---|
| 1 | Source | Tweeting tools |
| 2 | Type | Regular, Replies, Mentions and Retweets. |
| 3 | Retweet_count | The number of times the tweet is retweeted. |
| 4 | Favorite_count | The number of times the tweet is favorited |
| 5 | Hashtags_count | The number of hashtags in the tweet. |
| 6 | Urls_count | The number of urls in the tweet. |
| 7 | Mentions_count | The number of mentions in the tweet. |
| 8 | Media_count | The number of media in the tweet. |
| 9 | Symbols_count | The number of cashtag in the tweet. |
| 10 | Possibly_sensitive | If the tweet possibly contains sensitive content. |
| 11 | Location | If the location field of profile is null. |
| 12 | URL | If the URL field of profile is null. |
| 13 | Description_len | The length of the description field of. |
| 14 | Verified | If the user is verified by Twitter. |
| 15 | Ff_ratio | Followers_count / Friends_count |
| 16 | Followers_count | The number of followers of the user. |
| 17 | Friends_count | The number of friends of the user. |
| 18 | Statuses_count | The number of statuses the user post. |
| 19 | Favourites_count | The number of tweets the user favorite. |
| 20 | Listed_count | The number of lists the user create. |
| 21 | Account_age | The lifespan of the account. |
| 22 | Default_profile | If the user is using a default profile. |
| 23 | Default_profile_image | If the user is using a default avatar. |
Indirect features.
| Index | Feature | Comments |
|---|---|---|
| 1 | Source_count | No. of sources used for posting n latest tweets. |
| 2 | Type_count | No. of types of the latest n tweets posted. |
| 3 | Hashtags_proportion | % of tweets with hashtags in the latest n tweets. |
| 4 | Urls_proportion | % of tweets with urls in the latest n tweets. |
| 5 | mentions_proportion | % tweets with mentions in the latest n tweets. |
| 6 | Media_proportion | % tweets with media in the latest n tweets. |
| 7 | Symbols_proportion | % tweets with symbols in the latest n tweets. |
| 8 | Sensitive_proportion | % tweets possibly sensitive |
| 9 | Nonfriends_interaction | If the tweet is an interaction between non-friends. |
Fig 5F1 measure with and without stemming.
Fig 6Accuracy of different subsets of features.
Feature rank.
| IG | CHI | AUC | RFE |
|---|---|---|---|
| mention_prop | mention_prop | favourites_count | follwers_count |
| url_prop | url_prop | type_cnt | friends_count |
| media_prop | media_prop | urls_cnt | statuses_count |
| type_cnt | favourites_count | url_prop | url_prop |
| favourites_count | type_cnt | mention_prop | listed_count |
| friends_count | friends_count | mentions_count | urls_count |
| urls_count | follwers_count | type | mention_prop |
| hashtag_prop | urls_count | default_profile | media_prop |
| follwers_count | hashtag_prop | ff_ratio | favourites_count |
| Type | type | hashtags_count | hashtag_prop |
Detection performance of different feature subsets.
| Feature Subset | Acc | Fpr | F1 | Time(s) |
| I | 0.9526 | 0.0103 | 0.7124 | 0.0002 |
| II | 0.9599 | 0.0089 | 0.7634 | 1.9327 |
| III | 0.9711 | 0.0075 | 0.8379 | 1.9342 |
| Feature Subset | Acc | Fpr | F1 | Time(s) |
| I | 0.9335 | 0.003 | 0.4981 | 0.0003 |
| II | 0.9418 | 0.0074 | 0.6089 | 1.9328 |
| III | 0.9562 | 0.0037 | 0.7199 | 1.9343 |
Comparisons of different methods.
| Method | Acc | FPR | F1 |
|---|---|---|---|
| 0.9711 | 0.0075 | 0.8379 | |
| 0.9580 | 0.0056 | 0.7538 | |
| 0.8514 | 0.0919 | 0.7025 |