Literature DB >> 34169232

Automatic gender detection in Twitter profiles for health-related cohort studies.

Yuan-Chi Yang¹, Mohammed Ali Al-Garadi¹, Jennifer S Love², Jeanmarie Perrone³, Abeed Sarker^1,4.

Abstract

OBJECTIVE: Biomedical research involving social media data is gradually moving from population-level to targeted, cohort-level data analysis. Though crucial for biomedical studies, social media user's demographic information (eg, gender) is often not explicitly known from profiles. Here, we present an automatic gender classification system for social media and we illustrate how gender information can be incorporated into a social media-based health-related study.
MATERIALS AND METHODS: We used a large Twitter dataset composed of public, gender-labeled users (Dataset-1) for training and evaluating the gender detection pipeline. We experimented with machine learning algorithms including support vector machines (SVMs) and deep-learning models, and public packages including M3. We considered users' information including profile and tweets for classification. We also developed a meta-classifier ensemble that strategically uses the predicted scores from the classifiers. We then applied the best-performing pipeline to Twitter users who have self-reported nonmedical use of prescription medications (Dataset-2) to assess the system's utility. RESULTS AND DISCUSSION: We collected 67 181 and 176 683 users for Dataset-1 and Dataset-2, respectively. A meta-classifier involving SVM and M3 performed the best (Dataset-1 accuracy: 94.4% [95% confidence interval: 94.0-94.8%]; Dataset-2: 94.4% [95% confidence interval: 92.0-96.6%]). Including automatically classified information in the analyses of Dataset-2 revealed gender-specific trends-proportions of females closely resemble data from the National Survey of Drug Use and Health 2018 (tranquilizers: 0.50 vs 0.50; stimulants: 0.50 vs 0.45), and the overdose Emergency Room Visit due to Opioids by Nationwide Emergency Department Sample (pain relievers: 0.38 vs 0.37).
CONCLUSION: Our publicly available, automated gender detection pipeline may aid cohort-specific social media data analyses (https://bitbucket.org/sarkerlab/gender-detection-for-public).

Entities: Chemical

Keywords: Twitter; gender detection; machine learning; natural language processing; toxicovigilance; user profiling

Year: 2021 PMID： 34169232 PMCID： PMC8220305 DOI： 10.1093/jamiaopen/ooab042

Source DB: PubMed Journal: JAMIA Open ISSN： 2574-2531

INTRODUCTION

Social media data are increasingly being used for health-related research., Users often discuss personal experiences or opinions regarding a variety of health topics, such as health services or medications. Such information can be categorized, aggregated and analyzed to obtain population-level insights, at low cost and in close to real time. It has thus been used as a resource for population health tasks such as influenza surveillance, pharmacovigilance, and toxicovigilance. While early research mostly attempted to conduct observational studies on entire populations (eg, Twitter users discussing flu), some recent studies have been moving to targeted cohorts (eg, pregnant women, people in certain geo-locations, cancer patients, and people suffering from mental health issues). Demographic information about such cohorts can help researchers investigate what roles demographics have in a given study, understand if social media is biased toward specific cohorts, and explicitly address these biases., Due to the importance of explicitly considering biological sex or gender in health research, funding agencies, including the National Institutes of Health, have emphasized the necessity to describe sex/gender information of the cohorts included in research studies (eg, through inclusion of women). This, however, presents a challenge for social media-based studies because the demographic information of the users are often not explicitly known. One solution is to infer the demographic information from the users’ metadata. In the past two decades, researchers have developed various automatic methods for characterizing users. Taking gender detection on Twitter as an example, researchers have investigated classification schemes based on the users’ (screen) names, profile descriptions, tweets, profile colors, and even images, with machine learning algorithms such as support vector machine (SVM), Naive Bayes, Decision Tree, Deep Neural Network, and Bidirectional Encoder Representations from Transformers (BERT). Some have made their pipelines publicly available and have since been applied to social media mining tasks. For example, Sap et al released a lexicon for gender and age detection and it was applied for mental health research. Knowles et al released a package named Demographer to infer gender based on users’ first names and it was later employed to infer gender in studies for influenza vaccination and mental health. Wang et al also released a multimodal deep learning system (M3) to infer gender based on users’ profile information, including pictures, (screen) names, and descriptions. Though these existing pipelines can be directly applied to biomedical tasks, there is still room for improvement, particularly for Twitter data. First, none of these pipelines used all four of the users’ textual attributes—names, screen names, descriptions, and tweets. This is a missed opportunity and there is thus the possibility to further improve upon these models by developing a pipeline capable of incorporating these four attributes or more. Second, these experiments have not been validated on the same data, making it impossible to perform direct comparisons of their performances. Third, to the best of our knowledge, these pipelines were developed based on general users, but have not been tested on gender-labeled, domain-specific datasets. Benchmarking the performance variations due to domain change can inform researchers about the applicability of these pipelines on their specific tasks. In this work, we aimed at developing a high-accuracy, automatic gender classification system and evaluated its performance and utility on a domain-specific dataset. In the following sections, we first describe our experiments with various unimodal and multimodal strategies and existing pipelines, and compare their performances on a unified platform. We then discuss the benchmarking of the best strategies on our domain-specific (Toxicovigilance) dataset, consisting of a Twitter cohort of self-reported nonmedical consumers of prescription medications (PMs). The benchmarking involves evaluating performance scores on an annotated subset. To illustrate the utility of this pipeline, we applied the best-performing approach to compare the inferred gender proportions of a Twitter cohort with traditional, trusted sources., The source code for gender detection experiments described will be made open source (https://bitbucket.org/sarkerlab/gender-detection-for-public). LAY SUMMARY To perform biomedical research using social media data on a targeted cohort, the user’s demographic information (eg, gender) is typically required. However, the information is often not explicitly known from the user profile. One solution is to infer the information from the user’s public data via natural language processing and machine-learning techniques. In this work, we focused on estimating the user’s gender and developed a highly accurate pipeline. We then applied the pipeline on a Toxicovigilance cohort of Twitter users who have self-reported misuse of prescription medications (PMs), including tranquilizers, stimulants, and opioids. We found that the pipeline performs with high accuracy on this data set. Additionally, the inferred gender proportions of those users are consistent with traditional surveys, including the National Drug Use and Health Survey 2018 by the Substance Abuse and Mental Health Services Administration and the estimated overdose-related Emergency Department visits in 2016 from the Nationwide Emergency Department Sample. The results support that social media data can be harnessed as a complementary source to traditional surveys and can be used to understand the demographics of PM misuse in the United States. Our gender detection pipeline will be made publicly available to ensure transparency and support community-driven development.

MATERIALS AND METHODS

This study was approved by the Emory University institutional review board (IRB00114235).

Gender detection pipeline development

Data collection

We collected gender-labeled datasets for general Twitter users, released by previous work., The data from Liu and Ruths consists of 12 681 users with binary annotations obtained via crowdsourcing through Amazon Mechanical Turk. Each instance was coded by three annotators and a label was accepted only if all three annotators agreed. The data from Volkova et al consists of 1 000 000 tweets, randomly sampled from the data in Burger et al, which is labeled using users’ self-specified genders on Facebook or MySpace profiles linked to their Twitter accounts. Both datasets provide the users’ IDs and gender labels. Our focus is to develop the informatics infrastructure to detect gender as Twitter users self-identify themselves on the social media platform and we consider the two annotation methods to fall within this definition. We combined the two datasets and extracted users’ publicly available data using Twitter API, including profile meta-data, such as handle names, descriptions, and profile colors, as well as the users’ timelines (only English tweets were collected, while the retweets were excluded; users who had no original English tweets were dropped). We called this dataset as Dataset-1 and split it into training (60%), validation (20%), and test (20%) sets for pipeline development.

Classification

We first developed classifiers based on single attributes (ie, unimodal), including names and screen names, descriptions, tweets, and profile colors. We then experimented with building meta-classifiers based on the predicted scores from these classifiers (ie, multimodal). The flowchart in Figure 1 illustrates our processing pipeline. In the experiments, we considered machine learning algorithms including SVMs,, Random Forest (RF), Bi-directional Long Short-Term Memory (BLSTM),, and BERT,, as well as existing resources including the lexica released by Sap et al, the Demographer system by Knowles et al and the M3 system (without profile picture) by Wang et al Below we briefly outline each experiment, with further details in the Supplementary Table S1.

Figure 1.

Gender classification pipeline, from user profile to gender label.

Name and screen name

We applied package Demographer (DG) on the users’ names. DG attempts to identify gender using character n-grams of user’s first name, trained using the list of given names from US Social Security data. Similar to DG, we trained an SVM classifier for screen names using character n-grams.

Description

To classify gender using a user’s description, we experimented with SVM, BLSTM, and BERT, approaches suited for free text data. BERT is a transformer-based model that produces contextual vector representations of words and achieves state-of-the-art performance on many tasks., Many models with similar architecture have then been implemented and released., Each description was pre-processed by lowercasing and anonymizing URLs and user names. For SVM, the features are the normalized term frequency of unigrams. For BLSTM and BERT, each word or character sequence was replaced with a dense vector, and the vectors were fed into the algorithms for training.

Tweets

Focusing on users who have a substantial number of tweets, we selected users in the training data with at least 100 tweets and merged all collected tweets as the training texts for experiments on SVMs. The pre-processing is the same as that for the SVM classifier using description. The regularization parameter was optimized according to the validation accuracy.

Colors

We utilized 5 features associated with profile colors, including background color, link color, sidebar border color, sidebar fill color, and text color. Each profile color is represented using RGB values, each ranging from 0 to 255. We collapsed each value into 4 groups, yielding 64 groups for each color. We then experimented with SVM and RF.

Meta-classifier

We experimented with building SVM models on the predicted scores from 4 different combinations of the classifiers: meta-1: SVM on tweets and M3. meta-2: SVM on tweets, M3, Demographer on name, and BERT on description. meta-3: SVM on tweets, M3, and SVM on colors. meta-4: DM on names, SVM on screen names, BERT on description, and SVM on tweets.

Classification performance evaluation

The classification performance evaluation is based on class-specific precision, recall, and F1 score, as well as accuracy (male and female combined). These metrics are defined as the follows: where F1 score is the harmonic mean of precision and recall. We also calculate the area under the receiver operating characteristic curve (AUROC). The receiver operating characteristic curve presents the relationship between the true positive rate and the false positive rate under different threshold and the AUROC provides a measure for the performance. The range of AUROC is from 0 to 1, with 1 being the best.

Coverage

Some users have missing profile information such as name or description or use non-English characters in the name field. This may make the inference using the specific information impossible. Therefore, for each classifier, we show the percentage of users whose genders can be inferred from the relevant profile information (as “coverage”) while the performance is evaluated using this subset of users.

Application on Toxicovigilance dataset

To conduct Toxicovigilance research using social media, we had collected publicly available, English tweets mentioning over 20 PMs that have the potential for nonmedical use or misuse. The lists of PMs can be found in Supplementary Table S2. In our prior work, we have developed annotation guidelines with our domain expert (JP) and have annotated a subset consisting of 16 433 tweets. A brief description of annotation guideline and example tweets are given in Supplementary Tables S3 and S4, respectively. Based on this set, we then developed automatic classification schemes to detect if the tweets are describing self-reported nonmedical use (referred as “misuse tweets” in the following). In this work, we used this classifier to classify a dataset collected from March 6, 2018 to January 14, 2020 and extracted the users’ publicly available data. We referred to this set as Dataset-2. We also grouped users whose misuse tweets could be geo-located in the United States as a subset (Dataset-2-US). Since Dataset-2 did not have manual binary annotations, we relied on a secondary source to identify a user’s gender—their self-identified gender information on the linked public Facebook profiles—whenever possible. These users make up the test set of Dataset-2 for benchmarking.

Classification performance

We applied the best-performing classification strategies on the test set of Dataset-2 to evaluate their performances. This serves not only as a benchmarking of how those pipelines perform on the Toxicovigilance dataset (Dataset-2) but also provides a measure of the transferability of our pipelines across research problems.

Gender distribution inference

To assess the utility of our cohort characterization pipeline on a health surveillance related task, we applied the best-performing classification pipeline on Dataset-2 (and Dataset-2-US) and analyzed the gender distributions of the users who had self-reported misuse/abuse on one of the three abuse-prone PM categories—stimulants (eg, Adderall®), which can increase alertness, attention, and energy and are mostly prescribed to treat Attention Deficit Hyperactivity Disorder, tranquilizers (eg, alprazolam/Xanax®), which slow brain activity and are mostly used to treat anxiety, and pain relievers (eg, Oxycodone/OxyContin®), specifically for those containing opioids., We then compared the distributions with metrics from the 2018 NSDUH, as well as the overdose-related Emergency Department Visits (EDV) in 2016 from the Nationwide Emergency Department Sample (NEDS). The details of the calculation are given in the Supplementary Materials. We performed Pearson’s Chi-squared test for contingency table to determine if the differences in the proportions of females inferred from the different sources (Twitter vs survey data) are statistically significant, defined as P-value < 0.05.

RESULTS

Data Collection (Dataset-1)

In total, we were able to retrieve the user data from 67 181 users, consisting of 35 812 (53.3%) females (F) and 31 369 (46.7%) males (M), which is close to the distribution estimated by Burger et al and Heil and Piskorski (55% female and 45% male) but deviate from the distribution estimated by Liu and Ruths (65% female and 35% male). The distribution is presented in Table 1.

Table 1.

Data distributions for the training, validation and test sets from Dataset-1

Dataset	F	M	Total
Training (Dataset-1)	21 521	18 788	40 309
Validation (Dataset-1)	7133	6303	13 436
Test (Dataset-1)	7158	6278	13 436
Total (Dataset-1)	35 812	31 369	67 181

Data distributions for the training, validation and test sets from Dataset-1 The performance (F1-score, accuracy, and AUROC) for each classifier and meta-classifier is presented in Table 2, while the precisions and recalls are presented in the Supplementary Table S5. We now highlight the main findings.

Table 2.

Test results (on Dataset-1) for classifiers and meta-classifiers

Feature/method	F₁ score (95% CI) (0.XXX)		Coverage (%)	Accuracy (95% CI) (%)	AUROC
Feature/method	F	M	Coverage (%)	Accuracy (95% CI) (%)	AUROC
Name/DG	802 (795–810)	802 (795–809)	98.1	80.2 (79.5–80.9)	0.878
Screen name/SVM	748 (740–756)	719 (710–728)	100.0	73.4 (727–742)	0.817
Description/SVM	728 (719–736)	693 (683–703)	88.9	71.1 (70.3–72.0)	0.796
Description/BLSTM	724 (716–733)	665 (655–675)	88.9	69.7 (68.9–70.6)	0.781
Description/BERT	790 (782–797)	757 (748–766)	88.9	77.4 (76.7–78.2)	0.873
Tweets/SVM	893 (888–898)	879 (872–885)	100.0	88.6 (88.1–89.2)	0.933
Tweets/Lexicon	874 (868–880)	856 (849–862)	100.0	86.5 (86.0–87.1)	0.917
Profile/M3	903 (897–908)	898 (893–903)	100.0	90.0 (89.5–90.5)	0.968
Colors/SVM	671 (662–682)	649 (640–659)	100.0	66.1 (65.3–66.9)	0.712
Colors/RF	660 (651–669)	640 (630–649)	100.0	65.0 (64.2–65.8)	0.692
Meta-1	947 (944–951)	940 (936–944)	100.0	94.4 (94.0–94.8)	0.965
Meta-2	947 (943–951)	939 (935–944)	100.0	94.3 (93.9–94.7)	0.971
Meta-3	948 (944–952)	941 (937–945)	100.0	94.5 (94.1–94.9)	0.966
Meta-4	930 (925–934)	920 (915–925)	100.0	92.5 (92.1–92.9)	0.953

Test results (on Dataset-1) for classifiers and meta-classifiers The best performing classification schemes were the meta-classifiers using predicted scores from M3 and SVM on tweets (meta-1, 2, 3), with accuracies around 94.4%. The second best scheme was meta-4, with an accuracy of 92.5%. These all performed better than existing pipelines, including the lexicon (86.5%), the Demographer (80.2%), and M3 (90.0%), and other unimodal classifiers.

Application on toxicovigilance dataset

Data Collection (Dataset-2)

We were able to retrieve past data from 176 683 users for Dataset-2 (63 306 users for Dataset-2-US). Less than 0.3% of the users (412) had publicly available gender information from linked Facebook profile pages. One hundred fifty-five out of 412 users in this subset were female (37.6%), while 257 users were male (62.4%). The performances of the pipelines on the test set of Dataset-2 are shown on Table 3 (precisions and recalls are on Supplementary Table S6). The best performing pipeline was meta-1 (accuracy 94.4%). Besides M3 and meta-1, all the classifiers experience performance drops possibly due to domain change. Here, we left out meta-2 and meta-3 because meta-1 provides comparable performance while being simpler. We also note that the accuracy of meta-1 is 95.8% (95% confidence interval 93.3–98.3%) when restricted to users whose misuse tweets could be geo-located in the United States (239 users with 103 females and 136 males).

Table 3.

Test results (on Dataset-2, for users who have revealed gender information on Facebook) for classifiers and meta-classifiers

Feature/method	F₁ score (95% CI) (0.XXX)		Coverage (%)	Accuracy (95% CI) (%)	AUROC
Feature/method	F	M	Coverage (%)	Accuracy (95% CI) (%)	AUROC
Name/DG	717 (655–773)	833 (796–867)	94.9	79.0 (74.9–82.9)	0.844
Screen name/SVM	692 (634–745)	776 (732–816)	100.0	74.0 (69.7–78.2)	0.838
Description/BERT	674 (616–727)	715 (663–762)	94.9	69.6 (65.0–74.2)	0.839
Tweets/SVM	821 (772–865)	894 (864–921)	100.0	86.7 (83.3–89.8)	0.916
Tweets/Lexicon	770 (717–818)	846 (810–879)	100.0	81.6 (77.7–85.2)	0.889
Profile/M3	894 (855–928)	936 (913–956)	100.0	92.0 (89.3–94.4)	0.974
Meta-1	927 (894–954)	955 (935–972)	100.0	94.4 (92.0–96.6)	0.964
Meta-4	885 (846–919)	926 (902–949)	100.0	91.0 (88.1–93.7)	0.955

Test results (on Dataset-2, for users who have revealed gender information on Facebook) for classifiers and meta-classifiers We applied meta-1 on all the users and analyzed the gender distributions for the users who have self-reported abuse/misuse of tranquilizers, stimulants, or pain relievers (opioids). In Table 4, we report the number of users for each category, and the percentage of males and females, inferred through the classification results (meta-1), and reported by NSDUH 2018.

Table 4.

Gender distributions for selected medication categories (inferred by the classifier/NSDUH 2018/overdose EDV 2016)

Medication category	Number of users (geo-located in the US)	Percentage of male/female
Medication category	Number of users (geo-located in the US)	inferred (geo-located in the US)	NSDUH 2018	overdose EDV 2016
Tranquilizers	62 471 (20 863)	0.499/0.501 (0.490/0.510)	0.499/0.501	—
Stimulants	93 598 (36,323)	0.504/0.496^a (0.514/0.486^a)	0.551/0.449	—
Pain relievers	38,299 (12,077)	0.621/0.379^a (0.635/0.365^a)	0.518/0.482^b	0.630/0.370

The female proportion whose difference with the corresponding female proportion in NSDUH 2018 is statistically significant.

According to the Appendix A in NSDUH 2018 (35), Glossary, “Although the specific pain relievers listed above are classified as opioids, use or misuse of any other pain reliever could include prescription pain relievers that are not opioids. For misuse in the past year or past month, estimates could include small numbers of respondents whose only misuse involved other drugs that are not opioids.”

Gender distributions for selected medication categories (inferred by the classifier/NSDUH 2018/overdose EDV 2016) The female proportion whose difference with the corresponding female proportion in NSDUH 2018 is statistically significant. According to the Appendix A in NSDUH 2018 (35), Glossary, “Although the specific pain relievers listed above are classified as opioids, use or misuse of any other pain reliever could include prescription pain relievers that are not opioids. For misuse in the past year or past month, estimates could include small numbers of respondents whose only misuse involved other drugs that are not opioids.” Although the users in Dataset-2-US are only roughly one-third of all users in Dataset-2, the gender proportions are close to each other. For tranquilizer and stimulants users, the gender proportions inferred from Twitter are very close to the comparator from NSDUH 2018 (with no statistically significant difference for tranquilizer users). In contrast, the gender proportions of pain reliever users are quite different from the comparator from NSDUH 2018, but much closer to the overdose EDV from NEDS. This suggests that Twitter data could be an indicator of the gender distribution of opioid overdoses and might provide complementary information to better understand the discrepancies between the aforementioned two traditional data sources.

DISCUSSION

Model performance and improvement

Meta-1 performs with high accuracy consistently across Dataset-1 (94.4%) and Dataset-2 (94.4%), better than all the existing pipelines and other classifiers on this platform. This shows building the gender detection pipeline based on the four prominent textual features (name, screen name, description, and tweets) can improve performance over existing approaches. Also, except meta-1 and M3, all classifiers performed worse on the domain-specific data. This illustrates the importance of benchmarking the existing machine learning systems on the targeted cohorts, in order to evaluate their applicability on the desired tasks. It also indicates that multimodal strategies could enhance the robustness of the system against unseen data and is thus desirable when building similar user-characterization pipelines. Moving forward, there are two directions to further improve the pipeline, inclusion of targeted cohort into training data and experimenting with additional classification algorithms/architectures. For example, incorporating multiple features in one system, similar to the M3 system, might further improve the performance. We chose our architecture based on model simplicity and development efficiency. We note that, though potentially complex and time-consuming, it is possible to design and train a model that learns from all the user’s attributes simultaneously and performs well, in contrast to our architecture that learns these information through a transformed knowledge—the predicted scores. We leave this investigation to future work.

Potential pipeline utility

Given that our pipeline performs well across domains and shows promising results on the external task (eg, inferring gender proportions), we believe that this pipeline is well-suited for application on medical/health tasks harnessing Twitter data. This pipeline can be used to infer the gender proportions in targeted cohorts and potentially help investigate the gender disparities in health topics of interest. For example, social media has been shown to be a potentially excellent resource for conducting large-scale mental health surveillance,,, and our methods can be used to derive gender-specific insights from such surveillance tasks. Tasks commonly performed using social media data, such as sentiment analyses regarding targeted topics, may also benefit from the gender-specific characterizations enabled by our system., Combined with other recently developed methods, such as geolocation-based characterization of social media chatter,, our methods can provide very unique insights over a given population of social media users.

Toxicovigilance

Our post-classification analyses of the PM cohort illustrated the utility of automatic gender classification on social media data. The closeness of the gender proportions of tranquilizer and stimulant misusers from Twitter and those from NSDUH 2018 validates the effectiveness applying social media mining for Toxicovigilance.,, The inferred gender proportion of pain reliever users, though different from NSDUH 2018, is almost identical to that of the overdose EDV according to the NEDS. This association between self-reports of drug use on Twitter and overdose EDV rates is consistent with our past research, in which we identified significant associations between opioid misuse reports on Twitter and overdose deaths over specific geolocations (eg, counties and sub-states). Social media provides the opportunity to combine multiple types of information, including past tweets, social connections, and geolocation. All the information combined can provide geolocation-, gender- and time-specific trends to extract insights, for example for gender inequalities in medical treatment regarding substance use disorder. It potentially could also test hypotheses such as the association between mental health issues and PM misuse., The development of sophisticated models for social media mining may even provide broad insights about how nonmedical users of pain relievers become victims of overdose over time, and may even serve as an early warning system.,, Furthermore, the surveillance can be done close to real time—a great improvement over the turnaround time for curating overdose statistics and conducting the NSDUH, which may make timely public health intervention possible. For example, the system can provide the trends and statistics to the local health department and hospitals for better preparation for PM misuse prevention and treatment, and highlight cohorts at higher risk., Note that we do not envision that social media data analytics can replace the traditional resources, but we know from the current state of research that it provides excellent complementary data, and the opportunity to provide information/intervention beyond the traditional health services.

Limitations

Our pipeline may inherit the biases and errors introduced by the data and resources used in the pipeline development, leading to significant limitations. The lack of information related to the biases (eg, race, primary language, or location) limits the performance and our ability to address them. For example, the users in Dataset-1, though having at least one English tweet, may not be representative for US Twitter users (eg, by racial distribution). Our pipeline may inherit this undetected bias. Also, using Demographer might introduce bias toward racial majority. Though its effect on the test performance might be detected, we are not able to remedy such biases when the racial coding is absent. Also biases might be introduced during annotation. For example, Dataset-1 and the test set of Dataset-2 may be biased toward those whose gender identities are public. Therefore, though the evaluation provides a measure of the pipeline’s performance against human interpretation, it may not be accurate on users whose genders are difficult to identify. Besides, merging the two individually labeled datasets when constructing Dataset-1, though essential for obtaining acceptable training power and generalizability, could also affect annotation quality by introducing inconsistency. Though the annotation methods adopted in Liu and Ruths and Burger et al both fall within our definition (gender identified on social media), we caution that these methods are different and are not perfect. For example, some users might use different gender identities on different platforms. Crucially, limited by the annotations, our methods are only applicable to populations with binary gender identities. While this covers the majority of the population, our methods do not work for the non-binary gender minorities—a community that has been shown to be particularly vulnerable from a public health perspective. Despite this limitation, our proposed system not only serves as an important stepping stone for future work by establishing a strong performance over the simplistic binary classification, but already allows us to investigate the inequalities that women experience in medical treatment (eg, for substance use disorder). Including non-binary population in our model would require collecting data from this population using coding schemes tailored for the differences within the population. Obtaining comprehensive demographic information could also help extending our methods to include non-binary users. We currently are in the early stage of exploring how to best address these issues. There are also significant limitations associated with the analysis of nonmedical PM users. First, not all people living in the United States use English primarily over social media. Limited by our infrastructure, we currently are unable to capture Twitter users who use languages other than English, but extending to other languages, specifically Spanish, is a planned future direction. Second, Twitter users might choose not to include geo information in tweets, which makes geo-locating impossible. For example, Dredze et al estimated that only less than 25% of the public tweets could be geo-located by their system. We caution that, because of this low proportion, it is not clear if the tweets geo-located in the United States can well represent the US tweets. For Dataset-2, we found that roughly 40% of the users’ misuse tweets could be geo-located while about 84% of them were located in the United States, and the gender proportions inferred using Dataset-2 and those using Dataset-2-US are very similar. Though this suggests that they might represent similar populations, they may still not be representative of all US Twitter users. Third, the detection of misuse tweets is based on a classification pipeline, so the inference is also limited by this NLP pipeline’s performance. Fourth, the data are limited to Twitter users that are accessible via the Twitter API, and should not be considered as a random sample of US population.

Ethics

Though we limit this work to observational research on publicly available data and adhere to Twitter API’s use terms, there is still concern over Twitter users’ protection and their perceptions. To avoid potential harms to the users, we only study and report on the aggregated data; no user’s data will be released. We also will make the NLP pipeline publicly available (without the data) to ensure reproducibility, transparency to researchers and Twitter users, and to support community-driven development. Only the scripts for gender detection pipeline and our best performing pipeline will be made available with this manuscript.

CONCLUSIONS

As social media-based health research focus is moving from population-level to cohort-level studies, incorporating user demographic information is becoming more important. In this work, we developed a gender detection pipeline and evaluated its performance on a general dataset and a domain-specific dataset. Our proposed pipeline shows high accuracy even when applied on a health-specific dataset. We further showed that the pipeline can be used to infer the nonmedical PM users’ gender distributions, which is consistent with the statistical data reported by NSDUH 2018 (stimulants and tranquilizers) and by NEDS (overdose EDV due to Opioids). With the much-needed growing attention on explicitly incorporating demographic information, such as gender and race/ethnicity, in research, it is crucial to be able to conduct aggregated gender-specific analyses of health-related social media data. Our pipeline is readily usable by social media researchers who need to infer users’ demographics from their data. We note that, besides gender, other demographic information, such as race or age are also important for research, and developing pipelines for these user characterization tasks and evaluating them on domain-specific datasets are part of our planned future work.

FUNDING

Research reported in this publication was supported by the National Institute on Drug Abuse (NIDA) of the National Institutes of Health (NIH) under award number R01DA046619. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

AUTHOR CONTRIBUTIONS

YY conducted and directed the machine learning experiments, evaluations and data analyses, with assistance from MAA and AS. AS provided supervision for full study. JSL and JP provided toxicology domain expertise for interpreting the results. YY drafted the manuscript and all authors contributed to the final manuscript.

SUPPLEMENTARY MATERIAL

Supplementary material is available at Journal of the American Medical Informatics Association online. Click here for additional data file.

43 in total

1. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

2. Sex Differences in Substance Use Among Adult Emergency Department Patients: Prevalence, Severity, and Need for Intervention.

Authors: Francesca L Beaudoin; Janette Baird; Tao Liu; Roland C Merchant
Journal: Acad Emerg Med Date: 2015-10-16 Impact factor: 3.451

3. Gender and prescription opioid misuse in the emergency department.

Authors: Esther K Choo; Carole Douriez; Traci Green
Journal: Acad Emerg Med Date: 2014-12 Impact factor: 3.451

4. Pharmacovigilance on twitter? Mining tweets for adverse drug reactions.

Authors: Karen O'Connor; Pranoti Pimpalkhute; Azadeh Nikfarjam; Rachel Ginn; Karen L Smith; Graciela Gonzalez
Journal: AMIA Annu Symp Proc Date: 2014-11-14

5. Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid.

Authors: Yuan-Chi Yang; Mohammed Ali Al-Garadi; Whitney Bremer; Jane M Zhu; David Grande; Abeed Sarker
Journal: J Med Internet Res Date: 2021-05-03 Impact factor: 5.428

Review 6. Social Media Interventions to Promote HIV Testing, Linkage, Adherence, and Retention: Systematic Review and Meta-Analysis.

Authors: Bolin Cao; Somya Gupta; Jiangtao Wang; Lisa B Hightow-Weidman; Kathryn E Muessig; Weiming Tang; Stephen Pan; Razia Pendse; Joseph D Tucker
Journal: J Med Internet Res Date: 2017-11-24 Impact factor: 5.428

7. Twitter Influenza Surveillance: Quantifying Seasonal Misdiagnosis Patterns and their Impact on Surveillance Estimates.

Authors: Jared Mowery
Journal: Online J Public Health Inform Date: 2016-12-28

8. Promoting Reproducible Research for Characterizing Nonmedical Use of Medications Through Data Annotation: Description of a Twitter Corpus and Guidelines.

Authors: Karen O'Connor; Abeed Sarker; Jeanmarie Perrone; Graciela Gonzalez Hernandez
Journal: J Med Internet Res Date: 2020-02-26 Impact factor: 5.428

Review 9. A new dimension of health care: systematic review of the uses, benefits, and limitations of social media for health communication.

Authors: S Anne Moorhead; Diane E Hazlett; Laura Harrison; Jennifer K Carroll; Anthea Irwin; Ciska Hoving
Journal: J Med Internet Res Date: 2013-04-23 Impact factor: 5.428

10. Forecasting the onset and course of mental illness with Twitter data.

Authors: Andrew G Reece; Andrew J Reagan; Katharina L M Lix; Peter Sheridan Dodds; Christopher M Danforth; Ellen J Langer
Journal: Sci Rep Date: 2017-10-11 Impact factor: 4.379