| Literature DB >> 33724195 |
Robert Chew1, Caroline Kery1, Laura Baum2, Thomas Bukowski3, Annice Kim4, Mario Navarro5.
Abstract
BACKGROUND: Social media are important for monitoring perceptions of public health issues and for educating target audiences about health; however, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences and limits how well social media can be used for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed, and researchers have developed machine learning algorithms to predict social media users' demographic characteristics, mainly for Twitter. To date, there has been limited research on predicting the demographic characteristics of Reddit users.Entities:
Keywords: Reddit; age; classification; machine learning; social media
Year: 2021 PMID: 33724195 PMCID: PMC8087286 DOI: 10.2196/25807
Source DB: PubMed Journal: JMIR Public Health Surveill ISSN: 2369-2960
Number of manually labeled Reddit posts by age category (December 2019).
| Category | Coded posts or comments, n | ||
|
| 2156 | ||
| 13-17 years | 683 | ||
| 18-20 years | 642 | ||
| 21-54 years | 831 | ||
|
| |||
| Cannot determine age | 252 | ||
| Not relevant | 154 | ||
aWe excluded posts if age could not be determined (eg, number provided not related to age; multiple ages provided; different language) or if the post was not relevant (eg, user explicitly states they are based outside of the United States; from throwaway accounts).
Variables for modeling age of Reddit users in the training set derived from the comment, post, and user data collected for users whose ages were confirmed by manual labeling.
| Variable group | Metadata used | Example | Variables (N=1523), n |
| Summary statistics | All | Median post score | 189 |
| Subreddit frequencies | Posts and comments | Frequency user posted to “Teenagers” subreddit | 624 |
| Emoji frequencies | Comments | Frequency of “ | 101 |
| Literary characteristics | Posts and comments | Average Flesch Reading Ease score | 28 |
| Patterns in Posting | Posts and comments | Percentage of user’s posts that were videos | 42 |
| Term usage | Comments | TF-IDFa score for the term “school” | 539b |
aTF-IDF: term frequency–inverse document frequency.
bThe number of TF-IDF terms varied across the cross-validation folds based on the comments and submissions vocabulary present in the training portion of each fold. The value presented here is the number of TF-IDF features when calculated on text from the full training set.
Classifier performance for predicting manually labeled age of Reddit users based on the full derived variable set.
| Age group | Precision, mean (SD) | Recall, mean (SD) | F1 score, mean (SD) | AUROCa, mean (SD) | Support, n | |
|
| ||||||
| 13-20 years | 0.75 (0.04) | 0.91 (0.01) | 0.82 (0.02) | —b | 202 | |
| 21-54 years | 0.79 (0.03) | 0.53 (0.04) | 0.63 (0.03) | — | 128 | |
| Overall | — | — | 0.75 (0.03) | 0.83 (0.02) | 331 | |
|
| ||||||
| 13-20 years | 0.65 (0.04) | 0.98 (0.01) | 0.78 (0.03) | — | 202 | |
| 21-54 years | 0.84 (0.04) | 0.18 (0.04) | 0.29 (0.05) | — | 128 | |
| Overall | — | — | 0.59 (0.04) | 0.75 (0.03) | 331 | |
|
| ||||||
| 13-20 years | 0.75 (0.04) | 0.89 (0.03) | 0.82 (0.03) | — | 202 | |
| 21-54 years | 0.76 (0.06) | 0.54 (0.05) | 0.63 (0.04) | — | 128 | |
| Overall | — | — | 0.74 (0.04) | 0.83 (0.02) | 331 | |
|
| ||||||
| 13-20 years | 0.79 (0.04) | 0.85 (0.02) | 0.82 (0.02) | — | 202 | |
| 21-54 years | 0.74 (0.05) | 0.64 (0.04) | 0.68 (0.03) | — | 128 | |
| Overall | — | — | 0.77 (0.03) | 0.84 (0.02) | 331 | |
aAUROC: area under the receiver operating characteristics curve.
bNot available or not applicable.
Model performance on the test set for predicting age group of Reddit users using gradient boosted trees classifier with all features vs select features.
| Age group | Precision | Recall | F1 score | AUROCa | Support, n | |
|
| ||||||
| 13-20 years | 0.81 | 0.87 | 0.84 | —b | 254 | |
| 21-54 years | 0.77 | 0.67 | 0.72 | — | 161 | |
| Overall | — | — | 0.79 | 0.87 | 415 | |
|
| ||||||
| 13-20 years | 0.79 | 0.89 | 0.84 | — | 254 | |
| 21-54 years | 0.78 | 0.63 | 0.70 | — | 161 | |
| Overall | — | — | 0.78 | 0.86 | 415 | |
aAUROC: area under the receiver operating characteristics curve.
bNot available or not applicable.
Figure 1Permutation feature importance scores for top variables in the reduced model for predicting age of Reddit users.
Age group comparisons for the top performing variables (by permutation feature importance).
| Variable | Type | Age 13-20 years (n=1014), meana (95% CI) | Age 21-54 years (n=643), meana (95% CI) | ||
| Sentences per comments | Literary characteristics | −0.43 (−0.52, −0.35) | 0.37 (0.27, 0.46) | −12.31 (1489.89) | <.001 |
| Year account created | Summary statistics | 2.96 (2.79, 3.14) | 1.35 (1.10, 1.59) | 10.65 (1272.76) | <.001 |
| Proportion of user’s posts or comments in | Subreddit frequencies | −3.49 (−3.67, −3.31) | −5.11 (−5.17, −5.05) | 16.69 (1208.10) | <.001 |
| 75th percentile subscriber count for subreddits user posted | Summary statistics | −0.14 (−0.23, −0.04) | −0.33 (−0.46, −0.20) | 2.41 (1276.15) | .02 |
| Average comment Coleman Liau Index | Literary characteristics | −0.25 (−0.34, −0.16) | 0.08 (−0.01, 0.17) | −5.06 (1571.31) | <.001 |
| Comment karma | Summary statistics | −0.19 (−0.25, −0.13) | 0.30 (0.22, 0.38) | −9.85 (1309.14) | <.001 |
| TF-IDFb weight for “school” | Term usage | −2.46 (−2.65, −2.27) | −2.63 (−2.86, −2.40) | 1.14 (1406.18) | .25 |
| Frequency of WWBPc 23-29 word set used | Term usage | −1.23 (−1.39, −1.08) | −0.05 (−0.20, 0.10) | −10.97 (1592.33) | <.001 |
| TF-IDF weight for “need” | Term usage | −1.58 (−1.75, −1.41) | −0.31 (−0.48, −0.15) | −10.48 (1577.27) | <.001 |
| Normalized count of WWBP 23-29 word set used | Term usage | −1.29 (−1.44, −1.14) | 0.04 (−0.11, 0.19) | −12.35 (1569.17) | <.001 |
| Proportion of comments posted in a thread user started | Summary statistics | −0.74 (−0.90, −0.57) | −1.41 (−1.62, −1.20) | 4.95 (1366.37) | <.001 |
| TF-IDF weight for “look like” | Term usage | −3.04 (−3.22, −2.86) | −1.84 (−2.08, −1.61) | −7.85 (1327.70) | <.001 |
| TF-IDF weight for “home” | Term usage | −3.45 (−3.62, −3.28) | −1.89 (−2.14, −1.65) | −10.24 (1252.67) | <.001 |
| TF-IDF weight for “totally” | Term usage | −4.23 (−4.37, −4.09) | −2.95 (−3.19, −2.71) | −8.92 (1092.36) | <.001 |
| Proportion of user’s posts or comments in | Subreddit frequencies | −5.03 (−5.10, −4.97) | −4.29 (−4.48, −4.11) | −7.38 (810.43) | <.001 |
aQuantile transformed means.
bTF-IDF: term frequency–inverse document frequency.
cWorld Well-Being Project