| Literature DB >> 28850620 |
Antonio A Morgan-Lopez1, Annice E Kim2, Robert F Chew3, Paul Ruddle3.
Abstract
Health organizations are increasingly using social media, such as Twitter, to disseminate health messages to target audiences. Determining the extent to which the target audience (e.g., age groups) was reached is critical to evaluating the impact of social media education campaigns. The main objective of this study was to examine the separate and joint predictive validity of linguistic and metadata features in predicting the age of Twitter users. We created a labeled dataset of Twitter users across different age groups (youth, young adults, adults) by collecting publicly available birthday announcement tweets using the Twitter Search application programming interface. We manually reviewed results and, for each age-labeled handle, collected the 200 most recent publicly available tweets and user handles' metadata. The labeled data were split into training and test datasets. We created separate models to examine the predictive validity of language features only, metadata features only, language and metadata features, and words/phrases from another age-validated dataset. We estimated accuracy, precision, recall, and F1 metrics for each model. An L1-regularized logistic regression model was conducted for each age group, and predicted probabilities between the training and test sets were compared for each age group. Cohen's d effect sizes were calculated to examine the relative importance of significant features. Models containing both Tweet language features and metadata features performed the best (74% precision, 74% recall, 74% F1) while the model containing only Twitter metadata features were least accurate (58% precision, 60% recall, and 57% F1 score). Top predictive features included use of terms such as "school" for youth and "college" for young adults. Overall, it was more challenging to predict older adults accurately. These results suggest that examining linguistic and Twitter metadata features to predict youth and young adult Twitter users may be helpful for informing public health surveillance and evaluation research.Entities:
Mesh:
Year: 2017 PMID: 28850620 PMCID: PMC5574558 DOI: 10.1371/journal.pone.0183537
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Number of unique Twitter users identified from birthday tweets by age group.
| Age Group | N |
|---|---|
| Youth: 13–17 | 1,036 |
| Young adults: 18–24 | 1,634 |
| Adults: 25 or older | 514 |
Precision and recall results from validation of multiple age classification models.
| Age Group | Tweet Language Use Only | Twitter Handle Metadata Only | Tweet Language Use and Twitter Handle Metadata | WWBP Words | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | Precision | Recall | F1 | |
| 13–17 | 69% | 71% | 70% | 59% | 51% | 55% | 71% | 75% | 73% | 62% | 72% | 67% |
| 18–24 | 78% | 74% | 76% | 61% | 78% | 68% | 80% | 73% | 76% | 77% | 65% | 71% |
| 25 or older | 60% | 65% | 65% | 47% | 17% | 25% | 63% | 73% | 67% | 52% | 59% | 55% |
WWBP = World Well Being Project [4].
Confusion matrix.
| 155 | 42 | 9 | ||
| 53 | 239 | 35 | ||
| 11 | 17 | 74 | ||
Top predictive features for each age group in tweet language use and Twitter handle metadata models.
| Predictive Features | Youth | Young Adults | Adults | |||
|---|---|---|---|---|---|---|
| Cohen’s d | Direction of Association | Cohen’s d | Direction of Association | Cohen’s d | Direction of Association | |
| Age of Twitter Account | 0.336 | − | 0.193 | + | ||
| Count of the term “school” | 0.210 | 0.194 | − | |||
| Count of WWBP words positively correlated with 23–29 age category, in tweet | 0.222 | − | ||||
| Count of the stems of “ili” (e.g. “I like”) | 0.186 | − | ||||
| Count of the term “college” | 0.236 | − | 0.232 | |||
| Percent of WWBP words negatively correlated with 19–22 age category, in tweet | 0.171 | 0.331 | ||||
| Count of stems of 18 | 0.210 | |||||
| Count of stems of 21 | 0.209 | |||||
| Count of the term “drunkard” | 0.194 | |||||
| Count of the term “semester” | 0.179 | |||||
| Count of kissyheart emoji | 0.162 | |||||
| Count of smiley emoji | 0.170 | - | ||||
| Count of stems of “via” | 0.172 | + | ||||
| Mean absolute deviation of count of URLs in tweet | 0.174 | + | ||||
a To capture the distributional properties of a user’s tweeting behavior, we created tweet-level features and then calculated descriptive statistics of those features across a user’s tweets. For example, for the “Average Percent Characters in Tweet that are Emoji” feature, we calculated the percentage of characters that are emoji for each tweet and then took the average across all the user’s collected tweets.
b To group common categorizes of words together, terms were stemmed, a process of reducing words to their base form. For example, a stemming algorithm would reduce the words “hunting,” “hunter,” “hunts,” and “hunters” to the stem “hunt.”