| Literature DB >> 26394145 |
Daniel Preoţiuc-Pietro1, Svitlana Volkova2, Vasileios Lampos3, Yoram Bachrach4, Nikolaos Aletras3.
Abstract
Automatically inferring user demographics from social media posts is useful for both social science research and a range of downstream applications in marketing and politics. We present the first extensive study where user behaviour on Twitter is used to build a predictive model of income. We apply non-linear methods for regression, i.e. Gaussian Processes, achieving strong correlation between predicted and actual user income. This allows us to shed light on the factors that characterise income on Twitter and analyse their interplay with user emotions and sentiment, perceived psycho-demographics and language use expressed through the topics of their posts. Our analysis uncovers correlations between different feature categories and income, some of which reflect common belief e.g. higher perceived education and intelligence indicates higher earnings, known differences e.g. gender and age differences, however, others show novel findings e.g. higher income users express more fear and anger, whereas lower income users express more of the time emotion and opinions.Entities:
Mesh:
Year: 2015 PMID: 26394145 PMCID: PMC4578862 DOI: 10.1371/journal.pone.0138717
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Subset of the SOC classification hierarchy.
|
|
| •Job titles: engineering manager, managing director, production manager, construction manager, quarry manager, operations manager |
|
|
| •Job titles: conservation officer, ecologist, energy conservation officer, heritage manager, marine conservationist, energy manager, environmental consultant, environmental engineer, environmental protection officer, environmental scientist, landfill engineer |
|
|
| •Job titles: architectural assistant, architectural, technician, construction planner, planning enforcement officer, cartographer, draughtsman, CAD operator |
|
|
| •Job titles: administrative assistant, civil servant, government clerk, revenue officer, benefits assistant, trade union official, research association secretary |
|
|
| •Job titles: knitter, weaver, carpet weaver, curtain maker, upholsterer, curtain fitter, cobbler, leather worker, shoe machinist, shoe repairer, hosiery cutter, dressmaker, fabric cutter, tailor, tailoress, clothing manufacturer, embroiderer, hand sewer, sail maker, upholstery cutter |
|
|
| •Job titles: barber, colourist, hair stylist, hairdresser, beautician, beauty therapist, nail technician, tattooist |
|
|
| •Job titles: sales supervisor, section manager, shop supervisor, retail supervisor, retail team leader |
|
|
| •Job titles: assembler, line operator, solderer, quality assurance inspector, quality auditor, quality controller, quality inspector, test engineer, weightbridge operator, type technician |
|
|
| •Job titles: factory cleaner, hygene operator, industrial cleaner, paint filler, packaging operator, material handler, packer |
Fig 1The distribution of yearly income for the users in our dataset.
The red dotted line represents the mean.
Description of the user level features.
|
| |
|
| number of followers |
|
| number of friends |
|
| number of times listed |
|
| follower/friend ratio |
|
| no. of favourites the account made |
|
| avg. number of tweets/day |
|
| total number of tweets |
|
| proportion of tweets in English |
|
| |
|
| gender (male, female) |
|
| age (18–70) |
|
| political (independent, conservative, liberal, unaffiliated) |
|
| intelligence (> average, average, ≤ average, ≫ average, ≪ average) |
|
| relationship (married, in a relationship, single, other) |
|
| ethnicity (Asian, African American, Indian, Hispanic, Other, Caucasian) |
|
| education (bachelor, graduate, high school) |
|
| religion (Christian, Jewish, Muslim, Hindu, unaffiliated, other) |
|
| children (yes, no) |
|
| income (below average, above average, very high) |
|
| life satisfaction (satisfied, dissatisfied, very satisfied, very dissatisfied, neither) |
|
| optimism (optimist, pessimist, extreme optimist, extreme pessimist, neither) |
|
| narcissism (agree strongly, agree, disagree, disagree strongly, neither) |
|
| excited (agree strongly, agree, disagree, disagree strongly, neither) |
|
| anxious (agree strongly, agree, disagree, disagree strongly, neither) |
|
| |
|
| proportion of tweets with positive sentiment |
|
| proportion of tweets with neutral sentiment |
|
| proportion of tweets with negative sentiment |
|
| proportion of joy tweets |
|
| proportion of sadness tweets |
|
| proportion of disgust tweets |
|
| proportion of anger tweets |
|
| proportion of surprise tweets |
|
| proportion of fear tweets |
|
| |
|
| proportion of non-duplicate tweets |
|
| proportion of retweeted tweets |
|
| average no. of retweets/tweet |
|
| proportion of retweets done |
|
| proportion of hashtags |
|
| proportion of tweets with hashtags |
|
| proportion of tweets with @-mentions |
|
| proportion of @-replies |
|
| no. of unique @-mentions in tweets |
|
| proportion of tweets with links |
Prediction of income with our groups of features.
Pearson correlation (left columns) and Mean Average Error (right columns) between income and our models on 10 fold cross-validation using three different regression methods: Linear regression (LR), Support Vector Machines with RBF kernel (SVM) and Gaussian Processes (GP) and sets of features described in the User Features section.
|
|
|
|
|
| |||
|---|---|---|---|---|---|---|---|
| Profile | 8 | .205 |
| .331 |
| .372 |
|
| Demo | 15 | .278 |
| .257 |
| .364 |
|
| Emo | 9 | .271 |
| .358 |
| .371 |
|
| Shallow | 10 | .200 |
| .261 |
| .355 |
|
| Topics | 200 | .498 |
| .606 |
| .608 |
|
| All features (Linear ensemble) | 5 | .506 |
| .614 |
| .633 |
|
Fig 2Mean income with confidence intervals for psycho-demographic groups.
All group mean differences are statistically significant (Mann-Whitney test, p < .001).
Fig 3Linear and non-linear (GP) fit for Profile features.
Variation of income as a function of user profile features. Linear fit in red, non-linear Gaussian Process fit in black. Brackets show the GP lengthscales—the lower the value, the more important the feature is for prediction.
Fig 4Linear and non-linear (GP) fit for emotions and sentiments.
Variation of income as a function of user emotion and sentiment scores. Linear fit in red, non-linear Gaussian Process fit in black. Brackets show the GP lengthscales—the lower the value, the more important the feature is for prediction.
Fig 5Linear and non-linear (GP) fit for shallow textual features.
Variation of income as a function of user shallow textual features. Linear fit in red, non-linear Gaussian Process fit in black. Brackets show the GP lengthscales—the lower the value, the more important the feature is for prediction.
Topics, represented by top 15 words, sorted by their ARD lengthscale.
Most predictive topics for income. Topic labels are manually added. Lower lengthscales (l) denote more predictive topics.
|
|
|
|
|
|
|---|---|---|---|---|
| 1 | 139 | Politics | republican democratic gop congressional judiciary hearings abolishing oppose legislation governors congress constitutional lobbyists democrat republicans | 3.10 |
| 2 | 163 | NGOs | advocacy organization organizations advocates disadvantaged communities organisations participation outreach associations non-profit nonprofit orgs educators initiative | 3.44 |
| 3 | 196 | Web analytics / Surveys | #measure analytics #mrx #crowdsourcing crowdsourcing #socialmedia #analytics whitepaper #li metrics #roi startup #social #smm segmentation | 3.68 |
| 5 | 124 | Corporate 1 | consortium institutional firm’s acquisition enterprises subsidiary corp telecommunications infrastructure partnership compan aims telecom strategic mining | 6.48 |
| 6 | 91 | Corporate 2 | considerations provides comprehensive cost-effective enhance advantages selecting utilizing resource essential additionally specialized benefits provide enhancing | 7.44 |
| 7 | 107 | Justice | allegations prosecution indictment alleged convicted allegation alleges accused charges extortion defendant investigated prosecutor sentencing unlawful | 7.84 |
| 8 | 92 | Link words | otherwise unless wouldn’t whatever either maybe pretend anyone’s assume eventually assuming or bother couldn’t however | 8.39 |
| 9 | 173 | Beauty | hair comb bleached combed slicked hairs eyebrows ponytail trimmed curlers dye dyed curls waxed bangs | 9.75 |
| 10 | 40 | Sport shows | first-ever roundup sport’s round-up rundown poised previewing spotlight thursday’s com’s long-running joins concludes prepares observer | 10.57 |
| 11 | 99 | Swearing | messed f’d picking effed cracking f*cked hooking tearing catching lighten picked cracks ganging warmed fudged | 11.09 |
Fig 6Linear and non-linear (GP) fit for topics.
Variation of income as a function of user topic usage. Linear fit in red, non-linear Gaussian Process fit in black.