| Literature DB >> 25874768 |
Siaw Ling Lo1, Raymond Chiong1, David Cornforth1.
Abstract
The vast amount and diversity of the content shared on social media can pose a challenge for any business wanting to use it to identify potential customers. In this paper, our aim is to investigate the use of both unsupervised and supervised learning methods for target audience classification on Twitter with minimal annotation efforts. Topic domains were automatically discovered from contents shared by followers of an account owner using Twitter Latent Dirichlet Allocation (LDA). A Support Vector Machine (SVM) ensemble was then trained using contents from different account owners of the various topic domains identified by Twitter LDA. Experimental results show that the methods presented are able to successfully identify a target audience with high accuracy. In addition, we show that using a statistical inference approach such as bootstrapping in over-sampling, instead of using random sampling, to construct training datasets can achieve a better classifier in an SVM ensemble. We conclude that such an ensemble system can take advantage of data diversity, which enables real-world applications for differentiating prospective customers from the general audience, leading to business advantage in the crowded social media space.Entities:
Mesh:
Year: 2015 PMID: 25874768 PMCID: PMC4395415 DOI: 10.1371/journal.pone.0122855
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Followers’ domains discovery using Twitter LDA.
The domains identified from followers’ tweets using Twitter LDA.
|
|
|
|
|
|
|---|---|---|---|---|
| 10 | Topic 9 | Daily musing | love, people, life, god, things, feel | 1 |
| 20 | Topic 6 | Food | singapore, food, lunch, dinner, coffee, tea, chicken | 1 |
| Topic 7 | Football, English premier league (EPL) | united, manchester, league, chelsea, david, goal | 1 | |
| Topic 8 | Daily musing | people, love, life, things, god, feel | 1 | |
| Topic 12 | Singapore related | singapore, airport, points, club, changi | 0.75 | |
| Topic 0 | Daily musing | happy, video, birthday, love, mothers | 0.75 | |
| 30 | Topic 10 | Daily musing | day, good, happy, morning, mothers, birthday, dinner | 1 |
| Topic 15 | Daily musing | time, work, sleep, school, long | 1 | |
| Topic 18 | Daily musing | people, life, love, happy, things, god | 1 | |
| Topic 28 | Football, EPL | chelsea, league, united, match, madrid | 1 | |
| Topic 1 | Social media marketing | social, media, marketing, twitter, facebook, business | 0.75 | |
| Topic 14 | Music | singapore concert, tour, fans, tickets, album | 0.75 | |
| Topic 16 | Transport | singapore, mrt, blk, bus, wifi | 0.75 | |
| Topic 25 | News | indonesia, model, tokyo, festival | 0.75 |
Fig 2The bootstrapping algorithm.
The configuration of bootstrapping using a single SVM model.
|
|
|
|
|---|---|---|
| SVM with bootstrapping sampling | samsungsg (1978) and others (1978) | 1 SVM model |
aThe number in the brackets represents the number of records. The data collection and pre-processing process can be found in S1 Methods. 1978 records from 10 different domains have been extracted as the ‘others’ training dataset. Due to the consideration of temporal effect, only the past 200 records or less from each domain in the same period (as samsungsg) have been extracted.
Fig 3A general architecture of bootstrapping using a single SVM model.
Fig 4A general architecture of the ensemble system using multiple SVM models.
The configuration of various multiple SVM ensembles.
|
|
|
|
| |
|---|---|---|---|---|
| 1. | SVM with 10 random sampling with majority vote | samsungsg (200) | others (~200) x 10 | 10 SVM models |
| 2. | SVM with majority vote | samsungsg (200) | 10 others | 10 SVM models |
| 3. | SVM with bagging | samsungsg (200) | others (1978) | 10 SVM models |
| 4. | SVM with stacking | samsungsg (200) | 10 others | 10 SVM models with Naïve Bayes (kernel) as the tier two classifier |
aThe number in the brackets represents the number of records. The data collection and pre-processing process can be found in S1 Methods.
Fig 5The majority vote algorithm.
Fig 6The bagging algorithm.
Fig 7The stacking algorithm.
Topic groups identified via the seed words-fuzzy match approach and some of their topical words.
|
|
|
|
|---|---|---|
| 10 | Topic 1 | samsung, galaxy, phone, iphone, app, mobile |
| Topic 8 | singapore, android, ipad, Samsung, sg | |
| 20 | Topic 9 | tv, led, Samsung, contest, giveaway |
| Topic 10 | galaxy, Samsung, android, tablet, sony, xperia | |
| Topic 16 | samsung, galaxy, android, phone, mobile, iphone, app | |
| 30 | Topic 0 | samsung, galaxy, android, phone, note, iphone, htc |
| Topic 2 | tv, Samsung, led, video, review, hd | |
| Topic 12 | android, touch, tablet, pc | |
| Topic 17 | galaxy, Samsung, video | |
| Topic 23 | app, google, ipad, android, iphone |
Fig 8F measures of 10 SVM models generated from random samples.
The x-axis represents individual SVM models.
Fig 9AUC of 10 SVM models generated from random samples.
The x-axis represents individual SVM models.
Results of 10 fold cross-validation of various SVM ensembles.
|
|
|
|
|
|
|---|---|---|---|---|
| SVM with bootstrapping sampling | 1 | 0.98 | 0.99 | 0.99 |
| SVM with 10 random sampling with majority vote | 0.31 | 0.46 | 0.37 | 0.54 |
| SVM with majority vote | 0.84 | 0.38 | 0.52 | 0.85 |
| SVM with bagging | 0.69 | 0.97 | 0.80 | 0.83 |
| SVM with stacking | 0.96 | 0.90 | 0.93 | 0.95 |
Results of various SVM ensembles on the testing dataset.
|
|
|
|
|---|---|---|
| SVM with bootstrapping sampling | 0.76 | 1932±61 |
| SVM with 10 random sampling with majority vote | 0.62 | 722±29 |
| SVM with majority vote | 0.64 | 723±16 |
| SVM with bagging | 0.89 | 482±22 |
| SVM with stacking | 0.73 | 629±25 |
Fig 10ROC curves of various SVM ensembles on the testing dataset.