| Literature DB >> 35486433 |
Su Golder1, Robin Stevens2, Karen O'Connor3, Richard James4, Graciela Gonzalez-Hernandez3.
Abstract
BACKGROUND: A growing amount of health research uses social media data. Those critical of social media research often cite that it may be unrepresentative of the population; however, the suitability of social media data in digital epidemiology is more nuanced. Identifying the demographics of social media users can help establish representativeness.Entities:
Keywords: ethnicity; race; social media; twitter
Mesh:
Year: 2022 PMID: 35486433 PMCID: PMC9107046 DOI: 10.2196/35788
Source DB: PubMed Journal: J Med Internet Res ISSN: 1438-8871 Impact factor: 7.076
Databases searched with number of records retrieved.
| Database | Total results, n |
| ACL Anthology | Screened first 50 records from 2 searches |
| ACM Digital Library | 150 |
| CINAHL | 200 |
| Conference Proceedings Citation Index—Science | 84 |
| Conference Proceedings Citation Index—Social Science | 7 |
| Emerging Sources Citation Index | 41 |
| Google Scholar | Screened first 100 records from 2 searches |
| IEEE Xplore | 186 |
| Library and Information Science Abstracts | 120 |
| LISTA | 79 |
| OpenGrey | 0 |
| ProQuest dissertations and theses—United Kingdom and Ireland | 195 |
| PsycINFO | 72 |
| PubMed | 84 |
| Science Citation Index | 56 |
| Social Science Citation Index | 111 |
| Zetoc | 50 |
Figure 1Flow diagram for included studies.
Top system performance within studies using machine learning or natural language processing (result metrics are reflected here as reported in the original publications).
| Study | Classifier | MLa model | Features | Results reported | ||
|
|
|
|
| Accuracy | Area under curve | |
| Pennacchiotti and Popescu, 2011 [ | Binary | GBDTb | Images, text, topics, and sentiment | N/Ac | 0.66 | N/A |
| Pennacchiotti and Popescu, 2011 [ | Binary | GBDT | Images, text, topics, sentiment, and network | N/A | 0.70 | N/A |
| Bergsma et al, 2013 [ | Binary | SVMd | Names and name clusters | 0.85 | N/A | N/A |
| Ardehaly and Culotta, 2017 [ | Binary | DLLPe | Text and images | N/A | 0.95 (image); 0.92 (text) | N/A |
| Volkova and Backrach, 2018 [ | Binary | LRf | Text, sentiment, and emotion | N/A | N/A | 0.97 |
| Wood-Doughtry et al, 2018 [ | Binary | CNNg | Name | 0.73 | 0.72 | N/A |
| Saravanan, 2017 [ | Ternary | CNN | Text | NRh | NR | NR |
| Ardehaly and Culotta, 2017 [ | Ternary | DLLP | Text and images | N/A | 0.84 (image); 0.83 (text) | N/A |
| Gunarathne et al, 2019 [ | Ternary | CNN | Text | N/A | 0.88 | N/A |
| Wood-Doughtry et al, 2018 [ | Ternary | CNN | Name | 0.62 | 0.43 | N/A |
| Culotta et al, 2016 [ | Quaternary | Regression | Network and text | N/A | 0.86 | N/A |
| Chen et al, 2015 [ | Quaternary | SVM | n-grams, topics, self-declarations, and image | 0.79 | 0.79 | 0.72 |
| Markson, 2017 [ | Quaternary | CNN | Synonym expansion and topics | 0.76 | N/A | N/A |
| Wang et al, 2016 [ | Quaternary | CNN | Images | 0.84 | N/A | N/A |
| Xu et al, 2016 [ | Quaternary | SVM | Synonym expansion and topics | 0.76 | N/A | N/A |
| Ardehaly and Culotta, 2015 [ | Quaternary | Multinomial logistic regression | Census, name, network, and tweet language | 0.83 | N/A | N/A |
| Ardehaly, 2014 [ | Quaternary | LR | Census and image tweets | 0.82 | 0.81 | N/A |
| Barbera, 2016 [ | Quaternary | LR with ENi | Tweets, emojis, and network | 0.81 | N/A | N/A |
| Wood-Doughty 2020 [ | Quaternary | CNN | Name, profile metadata, and text | 0.83 | 0.46 | N/A |
| Preotiuc-Pietro and Ungar, 2018 [ | Quaternary | LR with EN | Text, topics, sentiment, part-of-speech tagging, name, perceived race labels, and ensemble | N/A | N/A | 0.88 (African American), 0.78 (Latino), 0.83 (Asian), and 0.83 (White) |
| Mueller et al, 2021 [ | Quaternary | CNN | Text and accounts followed | N/A | 0.25 (Asian), 0.63 (African American or Black), 0.28 (Hispanic), and 0.90 (White) | N/A |
| Bergsma et al, 2013 [ | Multinomial (>4) | SVM | Name and name clusters | 0.81 | N/A | N/A |
| Nguyen et al, 2018 [ | Multinomial (>4) | Neural network | Images | 0.53 | N/A | N/A |
aML: machine learning.
bGBDT: gradient-boosted decision tree.
cN/A: not applicable.
dSVM: support vector machine.
eDLLP: deep learning from label proportions.
fLR: logistic regression.
gCNN: convolutional neural network.
hNR: not reported.
iEN: elastic net.
Figure 2Summary of our best practice recommendations.