| Literature DB >> 30485276 |
Vivek Kulkarni1, Margaret L Kern2, David Stillwell3, Michal Kosinski4, Sandra Matz5, Lyle Ungar6, Steven Skiena1, H Andrew Schwartz1.
Abstract
Over the past century, personality theory and research has successfully identified core sets of characteristics that consistently describe and explain fundamental differences in the way people think, feel and behave. Such characteristics were derived through theory, dictionary analyses, and survey research using explicit self-reports. The availability of social media data spanning millions of users now makes it possible to automatically derive characteristics from behavioral data-language use-at large scale. Taking advantage of linguistic information available through Facebook, we study the process of inferring a new set of potential human traits based on unprompted language use. We subject these new traits to a comprehensive set of evaluations and compare them with a popular five factor model of personality. We find that our language-based trait construct is often more generalizable in that it often predicts non-questionnaire-based outcomes better than questionnaire-based traits (e.g. entities someone likes, income and intelligence quotient), while the factors remain nearly as stable as traditional factors. Our approach suggests a value in new constructs of personality derived from everyday human language use.Entities:
Mesh:
Year: 2018 PMID: 30485276 PMCID: PMC6261386 DOI: 10.1371/journal.pone.0201703
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Word clouds showing the most/least correlated words for each factor (with rotation).
Word clouds showing the most/least correlated words for each FA factor (with rotation) as obtained using Differential Language Analysis ([11]). The larger the word, the more strongly it correlates with the factor (For all word clouds shown, FDR correction has been done to only show significant words. Also spatial location does not code for anything.). Color indicates frequency (grey = low use, blue = moderate use, red = frequent use).
Fig 2Correlations between the learned factors and the Big5 factors.
Fig 3Individual factor correlations with outcomes.
Note how F4 which captures the use of swear words negatively correlates with Satisfaction with Life (SWL).
Fig 4Questions (left of each factor) and Likes (right of each factor) that correlate the highest (green) and lowest (pink) for each of our 5 behavioral-linguistic trait factors.
Fig 5Word clouds showing the effect of a rotation.
Word clouds showing the effect of a rotation. A rotation yields markedly distinct factors. Note the absence of words like “paste this” in the rotated version in multiple factors as opposed to the unrotated version where multiple factors are characterized by words like “paste this” and “status update”. The larger the word, the more strongly it correlates with the factor. Color indicates frequency (grey = low use, blue = moderate use, red = frequent use) [11].
Predictive performance for behavioral/economic outcomes.
| Method | F | I | IQ | L |
|---|---|---|---|---|
| D | 0.052 | 0.283 | 0.162 | 55.50 |
| B | 0.183 | 0.037 | 0.179 | 52.60 |
| B | 0.192 | 0.278 | 0.269 | 56.90 |
| FA5 | 0.125 | 0.362 | 0.361 | 60.11 |
| FA5 + D | 0.148 | 0.375 | 0.423 | 61.86 |
Behavioral/Economic Outcomes: Predictive performance for the behavioral, and economic outcomes for the Big5 factors, our learned language based factors (FA5), and demographics (age and gender; DEMOG). We show mean Pearson’s R over 10 random train-test splits for FriendSize, Income and IQ while for Likes we show the mean area under the curve (AUC) over all 20 categories. Language based factors (FA) perform competitively and even outperform questionnaire based factors as highlighted in color.
Predictive performance for questionnaire based outcomes.
| Method | B | S | D |
|---|---|---|---|
| D | 0.072 | 0.053 | 0.103 |
| B | 0.178 | 0.486 | 0.407 |
| B | 0.191 | 0.524 | 0.424 |
| FA5 | 0.178 | 0.165 | 0.293 |
| FA5 + D | 0.186 | 0.207 | 0.227 |
Questionnaire based outcomes Predictive performance for the questionnaire outcomes for the Big5 factors, our learned language based factors (FA5), and demographics (age and gender; DEMOG). We show mean Pearson R over 10 random train-test splits. Language based factors (FA) do not outperform questionnaire based factors.
Questions with the best and worst predictive performance.
| Q | R | |
| 54 | 0.230 | |
| 71 | 0.224 | |
| 64 | 0.222 | |
| 51 | 0.220 | |
| 90 | 0.215 | |
| 28 | 0.094 | |
| 63 | 0.131 | |
| 43 | 0.133 | |
| 29 | 0.135 | |
| 88 | 0.139 | |
The Big5 questions on which the BLT factors do the best (top) and worst (bottom) at predicting.
Likes with the best and worst predictive performance.
| L | AUC | |
| 8 | 71.04 | |
| 12 | 68.50 | |
| 11 | 68.50 | |
| 10 | 68.05 | |
| 14 | 65.85 | |
| 3 | 50.73 | |
| 0 | 51.46 | |
| 16 | 54.23 | |
| 4 | 54.49 | |
| 1 | 54.78 | |
The Categories of Likes that the BLT factors do the best (top) and worst (bottom) at predicting. We show the top LIKES from the clusters for interpretation. AUC = area under the curve
Fig 6Test re-test validity of our learned factors.