| Literature DB >> 32341156 |
Kokil Jaidka1,2, Salvatore Giorgi3, H Andrew Schwartz4, Margaret L Kern5, Lyle H Ungar3, Johannes C Eichstaedt6,7.
Abstract
Researchers and policy makers worldwide are interested in measuring the subjective well-being of populations. When users post on social media, they leave behind digital traces that reflect their thoughts and feelings. Aggregation of such digital traces may make it possible to monitor well-being at large scale. However, social media-based methods need to be robust to regional effects if they are to produce reliable estimates. Using a sample of 1.53 billion geotagged English tweets, we provide a systematic evaluation of word-level and data-driven methods for text analysis for generating well-being estimates for 1,208 US counties. We compared Twitter-based county-level estimates with well-being measurements provided by the Gallup-Sharecare Well-Being Index survey through 1.73 million phone surveys. We find that word-level methods (e.g., Linguistic Inquiry and Word Count [LIWC] 2015 and Language Assessment by Mechanical Turk [LabMT]) yielded inconsistent county-level well-being measurements due to regional, cultural, and socioeconomic differences in language use. However, removing as few as three of the most frequent words led to notable improvements in well-being prediction. Data-driven methods provided robust estimates, approximating the Gallup data at up to r = 0.64. We show that the findings generalized to county socioeconomic and health outcomes and were robust when poststratifying the samples to be more representative of the general US population. Regional well-being estimation from social media data seems to be robust when supervised data-driven methods are used.Entities:
Keywords: Twitter; big data; language analysis; machine learning; subjective well-being
Year: 2020 PMID: 32341156 PMCID: PMC7229753 DOI: 10.1073/pnas.1906364117
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
The language-based emotion measures used in this study, which span four main methods: word-level methods and data-driven methods applied at the sentence, user, or county level
| Type | Method (source) | No. of features | Categories |
| Word-level methods | |||
| LIWC 2015 ( | 1,364 | Positive emotion, negative emotion, | |
| anxiety, anger, sadness | |||
| PERMA dictionary ( | 402 | Positive emotion, negative emotion | |
| Word-level annotations | ANEW ( | 1,034 | Valence |
| Word-level annotations | LabMT | 10,218 | Valence |
| Data-driven methods | |||
| Sentence-level annotations | WWBP affect ( | 7,265 | Affect |
| Sentence-level annotations | Swiss Chocolate ( | 7,168 | Positive, neutral and negative emotion |
| Person-level models | WWBP life satisfaction (this study) | 2,000 | Cantril Ladder score |
| Direct prediction Cantril Ladder | County life satisfaction (this study) | 2,000 | Cantril Ladder score |
Pearson correlations (r) between Twitter-based emotions and Gallup-Sharecare Well-Being Index estimates across 1,208 US counties
The gray column headers identify the modified LIWC (removed 3 words), LabMT (removed 15 words), andANEW(removed 2 words) dictionaries (in the text). The color indicates the direction and magnitude of correlation; white cells are nonsignificant, and all others are P <0.05 corrected for multiple comparisons.
Fig. 1.Sources of error in the LIWC positive and negative emotion dictionaries. The matrix illustrates the 25 most frequent words from the two dictionaries that were correlated as expected (green indicates true LIWC positives and true negatives) or opposite to expectation (red indicates false positives and false negatives) with the Gallup happiness item. The size of the word denotes the magnitude of its correlation (0.06 r 0.34; P 0.05 corrected for multiple comparisons). The shade indicates the normalized frequency, with darker shades reflecting higher frequencies relative to other words.
Fig. 2.The relative frequency of false LIWC positive emotion words across the United States. States with a darker shade of red had relatively higher numbers of positive emotion words that correlated negatively with county Gallup happiness (Fig. 1, Upper Right) at P 0.05, controlling for multiple comparisons.
Pearson correlations (r) between Gallup-Sharecare Well-Being Index-based estimates and Twitter use of subsets of LIWC positive emotion words that co-occur with other LIWC dictionaries across 1,208 US counties
Color indicates direction and magnitude of correlation; white cells are nonsignificant, and all others are P < 0.05 corrected for multiple comparisons
Pearson correlations (r) between Facebook-based emotions and survey responses across 2,321 Facebook users
The color indicates direction and magnitude of correlation; white cells are nonsignificant, and all others are P < 0.05 corrected for multiple comparisons