| Literature DB >> 34169232 |
Yuan-Chi Yang1, Mohammed Ali Al-Garadi1, Jennifer S Love2, Jeanmarie Perrone3, Abeed Sarker1,4.
Abstract
OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user's demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study.Entities:
Keywords: Twitter; gender detection; machine learning; natural language processing; toxicovigilance; user profiling
Year: 2021 PMID: 34169232 PMCID: PMC8220305 DOI: 10.1093/jamiaopen/ooab042
Source DB: PubMed Journal: JAMIA Open ISSN: 2574-2531
Figure 1.Gender classification pipeline, from user profile to gender label.
Data distributions for the training, validation and test sets from Dataset-1
| Dataset | F | M | Total |
|---|---|---|---|
| Training (Dataset-1) | 21 521 | 18 788 | 40 309 |
| Validation (Dataset-1) | 7133 | 6303 | 13 436 |
| Test (Dataset-1) | 7158 | 6278 | 13 436 |
| Total (Dataset-1) | 35 812 | 31 369 | 67 181 |
Test results (on Dataset-1) for classifiers and meta-classifiers
| Feature/method | F1 score (95% CI) (0.XXX) | Coverage (%) | Accuracy (95% CI) (%) | AUROC | |
|---|---|---|---|---|---|
| F | M | ||||
| Name/DG | 802 (795–810) | 802 (795–809) | 98.1 | 80.2 (79.5–80.9) | 0.878 |
| Screen name/SVM | 748 (740–756) | 719 (710–728) | 100.0 | 73.4 (727–742) | 0.817 |
| Description/SVM | 728 (719–736) | 693 (683–703) | 88.9 | 71.1 (70.3–72.0) | 0.796 |
| Description/BLSTM | 724 (716–733) | 665 (655–675) | 88.9 | 69.7 (68.9–70.6) | 0.781 |
| Description/BERT | 790 (782–797) | 757 (748–766) | 88.9 | 77.4 (76.7–78.2) | 0.873 |
| Tweets/SVM | 893 (888–898) | 879 (872–885) | 100.0 | 88.6 (88.1–89.2) | 0.933 |
| Tweets/Lexicon | 874 (868–880) | 856 (849–862) | 100.0 | 86.5 (86.0–87.1) | 0.917 |
| Profile/M3 | 903 (897–908) | 898 (893–903) | 100.0 | 90.0 (89.5–90.5) | 0.968 |
| Colors/SVM | 671 (662–682) | 649 (640–659) | 100.0 | 66.1 (65.3–66.9) | 0.712 |
| Colors/RF | 660 (651–669) | 640 (630–649) | 100.0 | 65.0 (64.2–65.8) | 0.692 |
| Meta-1 |
|
| 100.0 |
|
|
| Meta-2 |
|
| 100.0 |
|
|
| Meta-3 |
|
| 100.0 |
|
|
| Meta-4 | 930 (925–934) | 920 (915–925) | 100.0 | 92.5 (92.1–92.9) | 0.953 |
Test results (on Dataset-2, for users who have revealed gender information on Facebook) for classifiers and meta-classifiers
| Feature/method | F1 score (95% CI) (0.XXX) | Coverage (%) | Accuracy (95% CI) (%) | AUROC | |
|---|---|---|---|---|---|
| F | M | ||||
| Name/DG | 717 (655–773) | 833 (796–867) | 94.9 | 79.0 (74.9–82.9) | 0.844 |
| Screen name/SVM | 692 (634–745) | 776 (732–816) | 100.0 | 74.0 (69.7–78.2) | 0.838 |
| Description/BERT | 674 (616–727) | 715 (663–762) | 94.9 | 69.6 (65.0–74.2) | 0.839 |
| Tweets/SVM | 821 (772–865) | 894 (864–921) | 100.0 | 86.7 (83.3–89.8) | 0.916 |
| Tweets/Lexicon | 770 (717–818) | 846 (810–879) | 100.0 | 81.6 (77.7–85.2) | 0.889 |
| Profile/M3 | 894 (855–928) | 936 (913–956) | 100.0 | 92.0 (89.3–94.4) |
|
| Meta-1 |
|
| 100.0 |
|
|
| Meta-4 | 885 (846–919) | 926 (902–949) | 100.0 | 91.0 (88.1–93.7) | 0.955 |
Gender distributions for selected medication categories (inferred by the classifier/NSDUH 2018/overdose EDV 2016)
| Medication category | Number of users (geo-located in the US) | Percentage of male/female | ||
|---|---|---|---|---|
| inferred (geo-located in the US) | NSDUH 2018 | overdose EDV 2016 | ||
| Tranquilizers | 62 471 (20 863) | 0.499/0.501 (0.490/0.510) | 0.499/0.501 | — |
| Stimulants | 93 598 (36,323) | 0.504/0.496a (0.514/0.486a) | 0.551/0.449 | — |
| Pain relievers | 38,299 (12,077) | 0.621/0.379a (0.635/0.365a) | 0.518/0.482b | 0.630/0.370 |
The female proportion whose difference with the corresponding female proportion in NSDUH 2018 is statistically significant.
According to the Appendix A in NSDUH 2018 (35), Glossary, “Although the specific pain relievers listed above are classified as opioids, use or misuse of any other pain reliever could include prescription pain relievers that are not opioids. For misuse in the past year or past month, estimates could include small numbers of respondents whose only misuse involved other drugs that are not opioids.”