Literature DB >> 31419834

Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data.

Mike Conway¹, Mengke Hu¹, Wendy W Chapman¹.

Abstract

OBJECTIVE: We present a narrative review of recent work on the utilisation of Natural Language Processing (NLP) for the analysis of social media (including online health communities) specifically for public health applications.
METHODS: We conducted a literature review of NLP research that utilised social media or online consumer-generated text for public health applications, focussing on the years 2016 to 2018. Papers were identified in several ways, including PubMed searches and the inspection of recent conference proceedings from the Association of Computational Linguistics (ACL), the Conference on Human Factors in Computing Systems (CHI), and the International AAAI (Association for the Advancement of Artificial Intelligence) Conference on Web and Social Media (ICWSM). Popular data sources included Twitter, Reddit, various online health communities, and Facebook.
RESULTS: In the recent past, communicable diseases (e.g., influenza, dengue) have been the focus of much social media-based NLP health research. However, mental health and substance use and abuse (including the use of tobacco, alcohol, marijuana, and opioids) have been the subject of an increasing volume of research in the 2016 - 2018 period. Associated with this trend, the use of lexicon-based methods remains popular given the availability of psychologically validated lexical resources suitable for mental health and substance abuse research. Finally, we found that in the period under review "modern" machine learning methods (i.e. deep neural-network-based methods), while increasing in popularity, remain less widely used than "classical" machine learning methods. Georg Thieme Verlag KG Stuttgart.

Entities: Chemical

Mesh：

Year: 2019 PMID： 31419834 PMCID： PMC6697505 DOI： 10.1055/s-0039-1677918

Source DB: PubMed Journal: Yearb Med Inform ISSN： 0943-4747

1 Introduction

Social media is a valuable source of data for public health research. It is estimated that 75% of Internet users have read or watched online health information content, and 26% of Internet users have posted (or shared) their personal health information online 1 . This large-scale sharing of health information makes social media and Online Health Communities (OHC) a valuable and abundant source of data for addressing public health questions. Social media – including online consumer generated OHC data – provide a ready source of timely, abundant data that can serve as a valuable resource for several broad types of public health applications, including surveillance, health communication, sentiment analysis, and understanding the natural history of a disease, injury, or health behaviour. Research on utilising social media in conjunction with Natural Language Processing (NLP) for public health applications is a robust and growing area of study, with dedicated meetings 1 and a now well-established research community 2 . Regarding surveillance, the importance of mental health and substance abuse surveillance is increasingly recognised 3 . This growth is unsurprising given that it is estimated that mental health and substance abuse constitute approximately 10.4% of the global burden of disease and are the leading cause of years lived with disability, imposing direct and indirect costs on the world economy of around US$2.5 trillion 4 . The study of health communication is another area of research that uses social media in conjunction with NLP methods, particularly in the area of understanding and quantifying vaccine hesitancy and refusal. NLP can support public health researchers in identifying common health-related misconceptions, and in turn, devising more effective health communication methods 5 . Similarly, sentiment analysis with respect to products relevant to public health (e.g. marijuana-related products, e-cigarettes) and the health behaviours that they facilitate is a further area of research 6 . Finally, social media provide a valuable data source for studies focussed on understanding and analysing the natural history of a disease, illness or injury, especially in the context of new and re-emerging diseases and rapid changes in health behaviour 7 . The key changes we have observed since 2016 – apart from the growth in research related to mental health and substance abuse and the increasing interest in “modern” machine learning methods–include a move towards integrating social media analysis with the Electronic Health Record (EHR) 8 , in part as a means of obtaining valuable diagnostic “ground truth”. A further shift of note is the increased interest in elucidating ethical issues in the application of NLP (and machine learning more generally) to social media for public health applications, particularly with respect to protecting the rights of those users suffering from potentially stigmatising conditions 9 . Challenges in developing high performance NLP methods for social media have been extensively enumerated, but in summary, major outstanding problems include the use of non-standard grammar, the use of rapidly changing and often non-standard slang terms , spelling variation in informal consumer-generated text, the rapidly changing nature of social media language, and finally the identification (and filtering) of jokes, memes, and advertising 2 . In this paper, we review literature from the period 2016-2018 regarding the application of NLP methods to social media data as a means of addressing public health research questions, focussing specifically on new application areas and the adoption of new methods. A distinctive feature of this review is an emphasis on the increasing volume of research focussed on ethics-related issues involved in using consumer-generated data for public health research.

2 Methods

Our paper selection process involved the following steps. First, we searched PubMed, the Association for Computational Linguistics Anthology, the Proceedings of the Conference on Human Factors in Computer Systems (CHI), and the Proceedings of the International AAAI (Association for the Advancement of Artificial Intelligence) Conference on Web and Social Media (ICWSM) using a variety of social media and NLP-related keywords. Second, we manually inspected Tables of Contents for the Journal of the American Medical Informatics Association, the Journal of Biomedical Informatics , and the Journal of Medical Internet Research . In this first pass, over 1,800 papers were identified. After reviewing abstracts, we reduced the number of papers reviewed to 130. In order to increase the tractability of the reviewing task, we further winnowed the papers to 71. This winnowing process was designed to capture a large swathe of both application areas and methods, and cannot be interpreted as a comment on the quality of research. Only the papers that both demonstrated a clear public health focus and explicitly utilised NLP or text mining methods were retained. Papers that reported on the results of qualitative content analysis or professional standards for health communication using social media without reference to NLP were excluded. Papers that discussed ethical issues pertaining to the use of social media for public health applications and research were retained. References dated outside the period 2016-2018 have been included in order to provide important context. The use of these references does not imply that they form part of the document set defined by the inclusion criteria. The papers reviewed utilise social media from several different sources, including Twitter, Reddit, Weibo, Facebook, and online discussion forums (see Figure 1 and Tables 1 & 2 ).

Fig. 1

Social media data sources. Note that this list is not exhaustive.

Table 1

Number of papers by topic and data source. Note that papers can occur in several categories

Data Source	Vac a	Comm b	Cancer c	SA d	Pharmaco e	STI f	MH g	Total
Reddit	-	1	-	3	-	1	13	18
Twitter	3	3	1	17	7	1	9	41
Instagram	-	-	-	-	-	-	1	1
Facebook	1	-	-	-	-	-	3	4
OHC h	1	-	2	2	1	-	6	12
Weibo	-	1	-	-	-	-	1	2
WhatsApp	-	-	-	1	-	-	-	1
Youtube	-	-	-	1	-	-	-	1
Yik-Yak	-	-	-	1	-	-	-	1
Tumblr	-	-	-	-	-	-	1	1

Vaccination hesitancy and refusal;

Health communication;

Cancer;

Substance Abuse;

Pharmacovigilance;

Sexually transmitted infections;

Mental health;

Online Health Communities

Table 2

Data Sources and Topics [Note that ethics-related papers are excluded from this table as they are frequently concerned with social media in general.]

Data Source	Vac a	Comm b	Cancer	SA c	Pharmaco d	STI e	MH f
Reddit	-	10	-	11 12 13	-	14	15 16 17 18 19 20 21 22 23 24 25 26 27
Twitter	28 29 30	31 32 33	34	6 , 12 , 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49	50 51 52 53 54 55 56	57	18 , 58 59 60 61 62 63 64 65
Instagram	-	-	-	-	-	-	18
Facebook	66	-	-	-	-	-	8 , 18 , 67
OHC g	5	-	68 , 69	12 , 13	50	-	70 71 72 73 74 75
Weibo	-	32	-	-	-	-	76
Tumblr	-	-	-	-	-	-	18

Vaccination hesitancy and refusal;

Communicable diseases;

Substance Abuse;

Pharmacovigilance;

Sexually transmitted infections;

Mental health;

Online Health Communities

Social media data sources. Note that this list is not exhaustive. Vaccination hesitancy and refusal; Health communication; Cancer; Substance Abuse; Pharmacovigilance; Sexually transmitted infections; Mental health; Online Health Communities Vaccination hesitancy and refusal; Communicable diseases; Substance Abuse; Pharmacovigilance; Sexually transmitted infections; Mental health; Online Health Communities The vast majority of the papers reviewed focussed on analysing English language text (68 papers), with two papers focussing on Chinese text 76 , 77 and one paper focussing on Japanese text 31 . With respect to the geographical location of first authors, most of the articles emerged from North America (55), with Europe (7), and Asia (including Australasia and Turkey) (6) all represented. The reviewed papers can be grouped into several health-related categories, including vaccine hesitancy and refusal, communicable diseases surveillance (including sexually transmitted infections, [STIs]), cancer, substance abuse, pharmacovigilance, and mental health (see Table 2 ). A wide range of methods were used, including “classical” machine learning (e.g., Random Forests, Support Vector Machines [SVM]), “modern” machine learning (e.g., Convolutional Neural Networks [CNN], Recurrent Neural Networks [RNN] 2 ), and lexicon-based approaches). Among the lexicon-based approaches, the Linguistic Inquiry and Word Count (LIWC) lexicon, a dictionary of words arranged into numerous psychological dimensions, is used extensively in many of the papers reviewed, especially in the areas of mental health and substance abuse 79 .

3 Results

3.1 Vaccine Hesitancy and Refusal

Vaccine hesitancy – defined by the World Health Organisation as referring to a “delay in acceptance or refusal of vaccines despite availability of vaccination services” 3 – has been a growing subject of research during learning methods 5 , 29 , 30 , and one used modem machine learning methods 30 , with surveillance 28 29 30 , health communication 5 , 28 29 30 , 66 , and sentiment analysis 28 29 30 , 66 , all frequently studied topics. The LIWC lexicon has been used either to characterise public attitudes towards vaccination in general 66 , or as a tool to explore the purported link between autism and the Measles, Mumps, and Rubella vaccine 28 . This last study aimed at investigating key differences between users who are longstanding vaccination advocates, long standing anti-vaccination advocates, or users who had recently adopted an anti-vaccination orientation. Vaccination the review period, with NLP methods applied to social media data in an attempt to develop insights into how best to understand and improve health communication as well as quantifying the degree of vaccine hesitancy in a community. Of the five papers reviewed in this section (see Table 3 ), three utilised Twitter data 29 , 30 , one utilised Facebook data 66 , and one further paper utilised data derived from an online health community, in this case moth- ering.com 5 . Supervised machine learning 30 and unsupervised machine learning 5 , 28 , 29 were both represented. Three of the papers reviewed used classical machine to protect against the Human Papillomavirus Virus (HPV) – a vaccine typically administered to adolescent boys and girls to prevent future sexual transmission of the disease – was also the subject of reviewed research, with high performance sentiment classifiers developed (AUC: 0.92) 30 , and LDA (Latent Dirichlet Allocation) topic modeling used to identify a number of vaccine-hesitancy-related topics, including clinical evidence and vaccination harms 29 .

Table 3

Summary of vaccine-related papers

Data Source	SML a	UML b	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Twitter	30	28 , 29	28 , 29	29 , 30	30	28 29 30	28 29 30	28 29 30	28
Facebook	-	-	-	-	-	-	66	66	66
OHC i	-	5	5	5	-	-	5	-	-

Supervised machine learning (e.g., Support Vector Machines, Random Forests);

Unsupervised machine learning (e.g., Latent Dirichlet Allocation, K-means);

Classical machine learning (e.g., Random Forests, Support Vector Machines);

Modern machine learning (e.g., Convolutional Neural Networks);

Surveillance;

Health communication;

Sentiment analysis;

Lexicon-based methods;

Online health communities

Supervised machine learning (e.g., Support Vector Machines, Random Forests); Unsupervised machine learning (e.g., Latent Dirichlet Allocation, K-means); Classical machine learning (e.g., Random Forests, Support Vector Machines); Modern machine learning (e.g., Convolutional Neural Networks); Surveillance; Health communication; Sentiment analysis; Lexicon-based methods; Online health communities In a further example of novel research, Tangherlini et al., produced a statistical-mechanical network model representing relationships between “actants” (actors) that is used to automatically extract typical narratives and “story fragments” related to vaccination issues, evidencing a narrative framework related to a pronounced distrust of government and medical authority 5 .

3.2 Communicable Diseases and Sexually Transmitted Infections

Systems designed to use social media data for pandemic public health surveillance have existed for almost 13 years 80 , 81 , and approaches that are variously referred to as infodemiology 82 , digital disease detection 83 , and digital epidemiology 84 are by now well established, particularly for dengue, influenza, and more recently, ebola. In addition, significant research efforts have centered on the study of STI, despite some methodological concerns regarding the willingness of users with STIs to disclose their status on social media. In order to investigate the changing prevalence of a number of health related topics, Park et al., 10 observed that ebola discussions were characterised by concerns about risks and symptoms, while influenza was associated with terms like “CDC” and “H1N1”. Another study focussed on influenza misdiagnoses 33 , achieving an F-score of 0.76. Regarding STIs, one study demonstrated statistically significant associations between Twitter data from 2012 and official Centers for Disease Control syphilis prevalence data from 2013 57 , with a related study discovering that the most frequent STIs discussed were intermediate (non-reportable) STIs like genital herpes and HPV, with more serious (reportable) diseases like syphilis and gonorrhoea discussed less frequently 14 . Of the six papers reviewed (see Table 4 ), four used Twitter data 31 32 33 , 57 , and two used Reddit data 10 , 14 , while Al-Garadi et al., provided a review that concentrated on Twitter and Weibo, the Chinese language microblog service 32 . Two of the papers reviewed described the use of supervised machine learning methods 31 , 32 , three papers used unsupervised machine learning methods 10 , 14 , 32 , and one used a lexicon-based approach 57 . Machine learning methods were used to perform a variety of tasks, including surveillance 10 , 14 , 31 32 33 , 57 , health communication 32 , and sentiment analysis 32 . Several studies concentrated on influenza surveillance using English 10 , 33 and Japanese 31 Twitter data.

Table 4

Summary of communicable diseases and STI-related papers

Data Source	SML a	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Reddit	-	10 , 14	10 , 14	-	10 , 14	-	-	-
Twitter	31 , 32	32	[31-33]	-	[31-33, 57]	32	32	57
Weibo	32	32	32	-	32	32	32	-

Supervised machine learning;

Unsupervised machine learning;

Classical machine learning;

Modern machine learning;

Surveillance;

Health communication;

Sentiment analysis;

Lexicon-based methods

Supervised machine learning; Unsupervised machine learning; Classical machine learning; Modern machine learning; Surveillance; Health communication; Sentiment analysis; Lexicon-based methods

3.3 Cancer

Work on using NLP and text-mining methods to understand issues directly related to cancer (diagnosis, treatment, and management) are less well developed than some of the other areas considered in this review (e.g., mental health and substance abuse). Of the three cancer-related papers reviewed (see Table 5 ), one utilised Twitter data 34 , and two utilised data derived from an online health community 68 , 69 . All the papers discussed used both classical and modern machine learning methods, with modern machine learning methods performing better than classical machine learning methods, albeit by a narrow margin in the case of Zhang et al.’s work on identifying chemotherapy-related Twitter accounts by account type 34 . Zhang et al., observed that Twitter accounts belonging to individuals focussed on “personal chemotherapy experience and emotions”, whereas professional accounts typically provided a neutral presentation of chemotherapy side effects 34 . Two of the papers were centred on health communication, broadly conceived 68 , 69 , with one paper focusing on sentiment analysis 34 . Concentrating specifically on the patient experience of breast cancer, one study 68 aimed at characterizing how forum topics changed over time depending on the individual’s time since diagnosis and cancer state, and found that diagnosis is the most frequent class in the early stages of cancer treatment, with diagnosis (and treatment) related discussions declining over the course of a user’s cancer journey.

Table 5

Summary of cancer-related papers

Data Source	SML a	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Twitter	34	34	34	34	-	-	34	-
OHC i	[68, 69]	68	[68, 69]	[68, 69]	-	[68, 69]	-	-

Supervised machine learning;

Unsupervised machine learning;

Classical machine learning;

Modern machine learning;

Surveillance;

Health communication;

Sentiment analysis;

Lexicon;

Online Health Communities

Supervised machine learning; Unsupervised machine learning; Classical machine learning; Modern machine learning; Surveillance; Health communication; Sentiment analysis; Lexicon; Online Health Communities

3.4 Substance Abuse

This section is concerned with reviewing work centred on the use of social media, in conjunction with NLP methods, to address substance abuse research questions, focussing on opioid abuse, tobacco, e-cigarette and marijuana use , and alcohol abuse . Interesting work on drug abuse – particularly new and emerging products – is increasingly evident in the literature. NLP methods are needed to deal with ambiguity and colloquial expressions used on social media (such as “bath salts”, “kitty cat”, or “miaow miaow” for mephedrone 44 ). Of the twenty-two papers discussed in this section, three are focussed on opioid abuse [35, 41, 42], eight on tobacco and marijuana use [6, 12, 13, 40, 43, 45, 46, 49], one on alcohol abuse 36 , and one on the street drug, mephedrone 44 . Twitter is the most popular source of data (18 papers) [6, 11, 12, 35-49], with Reddit [11-13], and online health communities 12 , 13 , both represented. Supervised machine learning (8 papers - all utilising Twitter data) and unsupervised machine learning (11 papers) were both evident in the reviewed papers, with classical machine learning approaches more common than modern neural-network-based approaches (17 and 2 papers, respectively). Two of the papers reviewed utilized a rule- based approach. Table 6 summarises the reviewed substance abuse-related papers.

Table 6

Summary of substance abuse-related papers

Data source	SML a	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Reddit	-	[11-13]	[11-13]	-	12	-	-	13
Twitter	[6, 36, 40, 45-49]	[6, 1 2, 35,37, 39, 41, 42, 43, 45]	[6, 12, 35, 36, 38-43, 45-49]	[6, 37]	[1 1, 12, 35, 36, 38, 39, 42, 44, 47-49]	43	[46-48]	44
OHC i	-	[12, 13]	[12, 13]	-	12	-	-	13

Supervised machine learning;

Unsupervised machine learning;

Classical machine learning;

Modern machine learning;

Surveillance;

Health communication;

Sentiment analysis;

Lexicon;

Online Health Communities

3.4.1 Opioid Abuse

Opioid abuse is now recognised as one of the leading public health problems in the United States 4 , and an important – albeit slightly less pressing – concern in many developed and developing countries. The crisis in the US is due to historical changes in drug prescription policies and practices that have encouraged both the licit and illicit use of highly addictive opioid-based painkillers 5 Every year in the United States, over 72,000 people die as a direct consequence of using opioids 6 , making the need to understand emerging opioid-related behaviours and user trajectories especially pressing. One study concentrated on identifying public reactions to the opioid epidemic by identifying the most popular opioid-related topics tweeted by users 41 . Topics identified included discussions related to the possibility of promoting marijuana as a substitute for opioids, discussions related to the growing opioid market in North America, and discussions related to news reports advocating the use of buprenorphine – a narcotic used to treat opioid addiction – for adolescents experiencing opioid use disorders. Another study 35 aimed at detecting marketing and sale of opioids by illicit online sellers. The authors observed that the frequency of tweets directly related to illegal activity was relatively low when compared with other kinds of opioid mentions. A similar observation was made for tweets promoting the illegal online sale of fentanyl 42 . In this context, unsupervised approaches are of significant value for understanding changes in a rapidly developing online environment.

3.4.2 Tobacco, E-Cigarette, and Marijuana Use and Abuse

Tobacco use is declining in popularity in much of the developed world (the proportion of smokers in the US has declined by over half since 1964 and now stands at 16.8% among adults, and approximately half that among high school students 85 ). However, despite this decrease in tobacco use, there has been a dramatic increase–now plateauing – in the use of e-cigarettes since their introduction to developed world markets in around 2007 86 . This increase has occurred in the context of a lack of consensus regarding both the safety of the product 87 and its potential efficacy as a smoking cessation device 88 . In addition to these shifts in tobacco use, there have also been substantial changes in the regulation of marijuana products, particularly in the US context, and these changes have led – it has been suggested 89 – to an increase in marijuana use 90 . Given these public health concerns, using NLP to investigate tobacco, e-cigarette, and marijuana use, has become an active research area, especially to classify discussions [6, 12, 43, 45, 46] or to determine whether a particular user is above or below 21 years of age 40 . Reported findings included evidence that Twitter users frequently discussed ways in which e-cigarettes can be used in the workplace in a bid to circumvent smoking bans 43 , and evidence that hookah was discussed more frequently at the weekend, indicating its use is associated with leisure activities, while reported tobacco use tends to be more consistent across the week 40 . In addition, authors observed that different social media services manifested distinctly different cultures regarding e-cigarette use, e.g., sensory experiences vs. psychological factors associated with quitting 13 . Rule- based approaches were used to identify where people reported using e-cigarettes, with 39% of posts referring to e-cigarette use in the classroom 49 . Other studies aimed at describing strategies for marketing Little Cigars & Cigarillos (LCC) and observed that 83% of identified LCC tweets referred to marijuana, and 29% of LCC tweets referenced memes 45 .

3.4.3 Alcohol Abuse

Alcohol abuse was the seventh leading risk-factor worldwide for both death and disability in 2016. In the same year, among males aged 15-49, alcohol was a causal factor in 12% of deaths 91 . One of the reviewed studies 36 yielded the surprising result that– in the US at least – a positive correlation exists between excessive county-level alcohol consumption and higher education, suggesting that highly educated counties drink more, or at least tweet more about their drinking.

3.5 Pharmacovigilance

Pharmacovigilance – i.e. the post-market surveillance of drugs – was an early health-related focus for social media NLP 92 , 93 and has remained an important subject of research, with applications including the identification of mentions of Adverse Drug Reactions (ADRs) 51 , 55 . One recent study focussed on topics related to Thyroid Hormone Replacement Therapy (THRT), particularly on the identification of side effects 50 . It was discovered that male and female users of THRT had different experiences and concerns regarding side effects, with women primarily concerned about the effect of the drug on personal appearance and men more concerned about potential pain symptoms associated with the drug. A recent significant development in pharmacovigilance research was the instigation of the SMM4 2017 shared task. The shared task consisted of three subtasks: automatic identification of ADRs, automatic classification of tweets that explicitly mentioned medication consumption, and normalization of ADR mentions. Important outputs of this effort included a publicly available corpus 51 and language models 55 for future research. In addition to this work on ADR identification and normalization, the identification of semantic relationships – chiefly causal relationships – between drug and symptom mentions had been a focus of research 52 , 53 . A key challenge associated with this task is the difficulty involved in distinguishing between drug use as a response to a particular symptom (“I have a horrible headache and just took some ibuprofen”) and the existence of a symptom as a side effect of a drug (“Ever since I started taking Sertraline I’ve felt like crap”). Despite the difficulty of this task, Bollegala et al., achieved a moderately high F-score (0.74) using a skip-gram based method 52 . Six of the pharmacovigilance papers reviewed used Twitter as a data source 51 , 56 , while one used an online health community (see Table 7 ). Four of the papers used supervised methods 51 – 54 and five used unsupervised methods 50 , 53 – 56 with five using classical machine learning methods 50 – 53 , 56 and three using modern machine learning methods 51 , 54 , 55 , with (unsurprisingly given the topic of pharmacovigilance) surveillance being the main application area.

Table 7

Summary of pharmacovigilance-related papers

Data Source	SML a	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Twitter	[51-54]	[53-56]	[51-53, 56]	[51, 54, 55]	[51-54, 56]	-	-	-
OHC i	-	50	50	-	-	-	-	-

Supervised machine learning;

Unsupervised machine learning;

Classical machine learning;

Modern machine learning;

Surveillance;

Health communication;

Sentiment analysis;

Lexicon-based methods;

Online Health Communities

3.6 Mental Health

Mental health problems are estimated to account for 13% of the global burden of disease, as measured in Disability Adjusted Life Years 95 . Using social media as a resource to understand mental health is a research area that has experienced substantial growth in recent years 96 , given the burden of disease associated with mental health problems and the fact that social media provides ready access to first person reports of behaviour, thoughts, and feelings. Reviewed studies covered a range of mental health topics, including predicting depression diagnosis 8 , assessing suicide risk [16, 18, 24, 74-76, 98, 99], and developing a better understanding of users’ experiences of eating disorders 15 , schizophrenia 59 , 61 , grief processes between gang-involved youth 58 , relaxation 62 , stress 63 , pathological empathy 67 , 72 , and negative emotional effects associated with campus-based mass murders 64 . Related to this, a range of metrics have been used to characterize language use associated with specific mental health conditions, with lexical diversity, readability scores, sentence complexity, negation, uncertainty , and degree of repetition , all used during the review period [23, 26, 27, 60]. In novel work focussing on the relationship between clinical guidelines and actual treatments, Zhang et al. 71 created a catalogue of real-world treatments used – as opposed to merely discussed – by parents of children with autistic spectrum disorder, and then automatically identified their frequency of mention in two online autism forums. With a view to improving how mental health forums are designed, one study applied textual cluster analysis to forums related to the conditions anxiety, depression, and post-traumatic stress disorder (PTSD) 19 , showing that–consistent with current thinking regarding the relationship between PTSD and anxiety 97 – anxiety and PTSD forums shared more similarities to each other than to the depression forum. Related to this, another study found that different communities provided different degrees of emotional and informational support 20 , with some communities (e.g., depression forums) focussed primarily on emotional support, and other communities (e.g. obsessive compulsive disorder forums) offering a greater proportion of informational support. Furthermore, the same study found that at the user level, the provision of social support was correlated with demonstrated linguistic accommodation, suggesting that those users who were able to “match” the linguistic culture of a particular community were likely to receive a greater volume of social support. Finally, a further study 100 involved the development of a classifier capable of identifying respectful uses of a mental-health related term (e.g. “I’m fuming. How dare a TV show portray folks suffering from mental health issues so unfairly”) and less-respectful usage. Of the thirty-one mental health-related papers reviewed (see Table 8 ), thirteen involved the use of Reddit data [15-27], ten used Twitter data [18, 24, 58-65], one used Instagram 18 , three used Facebook [8, 18, 67], six used OHC data [70-75], and one used data derived from Weibo 76 , with twenty-two of the papers utilising supervised machine learning methods [8, 16, 18, 20-22, 24, 25, 58-62, 65, 67, 70-76], and twelve papers utilising unsupervised machine learning [8, 15, 18-22, 27, 59, 60, 70, 72]. The majority of the papers reported on the use of classical machine learning approaches [8, 15, 16, 18-20, 22, 24, 25, 27, 58-62, 65, 67, 71, 73-76], with a minority using modern machine learning methods [18, 21, 22, 67, 70, 72]. Four of the mental health papers reviewed utilised primarily lexicon-based methods [17, 23, 63, 64].

Table 8

Summary of mental health-related papers

Datasource	SML a	UML b	CML c	MML d	Surv e	HC f	Senti g	Lexicon h
Reddit	[16, 18, 20-22, 24, 25]	[15, 18-22, 27]	[15, 16, 18-20, 22, 24, 25, 27]	[21, 22]	-	-	26	17 , 23
Twitter	[18, 58-62, 65]	[18, 59, 60]	[58-62, 65]	18	-	-	[63, 64, 24]	[63, 64]
Instagram	18	18	-	18	-	-	-	-
Facebook	[8, 18, 67]	[8, 18]	[8, 67]	[18, 67]	-	-	-	-
OHC i	[70-75]	[70, 72]	[71, 73-75]	[70, 72]	-	-	-	-
Weibo	76	-	76	-	-	-	-	-

Supervised machine learning;

Unsupervised machine learning;

Classical machine learning;

Modern machine learning;

Surveillance;

Health communication;

Sentiment analysis;

Lexicon-based methods;

Online Health Communities

3.7 Ethical Issues

Two types of ethics-related papers are discussed in this section: those that are focussed on empirical ethics (i.e. the empirical investigation of ethical beliefs and practices) 101 , 102 , and those that are focussed on ethical guideline development (i.e. the generation of theoretical frameworks and practical guidelines for conducting health-related NLP research with social media) [9, 103, 104]. Reviewed studies highlighted the need for both transparency in the development of algorithms and an ethical framework to guide the appropriate use of social media for computational public health research. Focussing specifically on research ethics from the perspective of social media users, one study 102 pointed to a generally favourable view of the use of computational methods for public health research among social media users, provided that data was highly aggregated, and the goal of the work was of significant public health value (e.g. opioid abuse surveillance was acceptable in a public health context, but not when used for employment screening). However, among some users, concerns remained regarding the robustness of both the data and the research methods, due to the fact that the data was not representative of the general population, and was subject to impression management (i.e. many users did not tweet about stigmatising health problems 105 ). Related to this work, one paper – a systematic review of attitudes towards the ethics of computational social media research 106 – found a range of different views on appropriate research ethics, depending on the particular research topic discussed, suggesting that a “blanket” approach to research ethics is currently not appropriate, and instead ethical deliberations ought to take into account the particular context of the research under review 106 . As noted by Vayena et al., 104 , the research regulation infrastructure in most jurisdictions was developed in the period prior to social media, and hence is not well-equipped to manage the review of computational social media research. This point is reinforced by a qualitative study conducted with Research Ethics Committee (Institutional Review Board) members in the United Kingdom. This study outlines the challenges faced by ethics committees in the application of existing research ethics regulation to computational work and emphasises the need to protect research participants (i.e. social media users), even in the context of research using publicly available data 101 . Finally, practical guidelines have recently been developed to guide NLP research using social media data 103 , with eight principles outlined, including the stipulation that as most social media based NLP research can be defined as human subjects research 107 , ethical approval or exemption ought to be gained from an Institutional Review Board or Research Ethics Committee; that data ought to be de-identified for use in publications and presentations; and that caution ought to be exercised in linking data. In recent years there has been a move away from the commonly held view that in social media research “anything goes”, towards a more sophisticated perspective that acknowledges both the existence and importance of the ethical and regulatory issues involved in the application of NLP to social media for health research. Further, the provision of ethical guidelines developed specifically for NLP researchers – as described above, 103 – is a new and welcome development in the period since 2016.

4 Discussion and Conclusion

In this survey, we have presented recent advances in the application of NLP to social media to address public health research questions. We observed a substantial growth in the area of mental health and substance abuse research, and a continuing sustained interest in the use of social media for studying communicable diseases (particularly in the area of vaccine hesitancy). The widespread use of lexical resources developed in the psychology research communities – specifically, LIWC – is also notable, as is the relatively low frequency of “modern” (as opposed to “classical”) machine learning approaches. While predicting future trends is not a straightforward task, we tentatively suggest four directions in which current work is evolving. First, linking data – with appropriate consent – from the EHR and social media, both in the context of public health research and clinical care. Examples of this type of work in the research context already exist (e.g. 8 ), and will likely be a focus of considerable research effort over the next few years. Second, further utilisation of social media in public health surveillance. Currently, while advances have been made in research using NLP and social media, substantial barriers still exist to implementing social media health surveillance in the context of public health practice. These barriers include costs (public health agencies are frequently underfunded), limited expertise in NLP, and difficulties in integrating social media analysis with existing surveillance methods and pipelines. However, even given these challenges, considerable strides have been made, particularly in the area of pharmacovigilance (e.g. the Food & Drug Administration Center for Drug Evaluation and Research). Third, much social media research relies on the identification of appropriate keywords to construct a data sample suitable for the research question at hand. This keyword selection process has typically relied on intuition. However, recently there has been a move towards a more data-driven means of iteratively identifying and evaluating keywords (and their associated synonyms), with word embeddings and other empirical synonym discovery methods (e.g. 108 ). This shift towards a more principled method of selecting keywords for data sampling is to be welcomed. Fourth, while we believe that Twitter will remain a valuable (and popular) data source for NLP research, we suspect that Reddit will become increasingly popular as a research resource, partly due to its “research-friendly” terms and conditions and its increasing user base. Related to this, the dynamism of the social media ecosystem should not be underestimated, with new services (e.g. TikTok) attracting users – especially new adolescent users – away from existing services. Given this rapidly changing social media environment, there is little reason to believe that currently popular social media platforms will maintain their current level of popularity.

83 in total

1. HealthMap: the development of automated real-time internet surveillance for epidemic intelligence.

Authors: J S Brownstein; C C Freifeld
Journal: Euro Surveill Date: 2007-11-29

2. The apomediated world: regulating research when social media has changed research.

Authors: Dan O'Connor
Journal: J Law Med Ethics Date: 2013 Impact factor: 1.718

3. Assessing the effects of medical marijuana laws on marijuana use: the devil is in the details.

Authors: Rosalie L Pacula; David Powell; Paul Heaton; Eric L Sevigny
Journal: J Policy Anal Manage Date: 2015

Review 4. Opioid epidemic in the United States.

Authors: Laxmaiah Manchikanti; Standiford Helm; Bert Fellows; Jeffrey W Janata; Vidyasagar Pampati; Jay S Grider; Mark V Boswell
Journal: Pain Physician Date: 2012-07 Impact factor: 4.965

5. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet.

Authors: Gunther Eysenbach
Journal: J Med Internet Res Date: 2009-03-27 Impact factor: 5.428

6. Comorbidity of posttraumatic stress disorder, anxiety and depression: a 20-year longitudinal study of war veterans.

Authors: Karni Ginzburg; Tsachi Ein-Dor; Zahava Solomon
Journal: J Affect Disord Date: 2009-09-18 Impact factor: 4.839

7. Digital disease detection--harnessing the Web for public health surveillance.

Authors: John S Brownstein; Clark C Freifeld; Lawrence C Madoff
Journal: N Engl J Med Date: 2009-05-07 Impact factor: 91.245

8. BioCaster: detecting public health rumors with a Web-based text mining system.

Authors: Nigel Collier; Son Doan; Ai Kawazoe; Reiko Matsuda Goodwin; Mike Conway; Yoshio Tateno; Quoc-Hung Ngo; Dinh Dien; Asanee Kawtrakul; Koichi Takeuchi; Mika Shigematsu; Kiyosu Taniguchi
Journal: Bioinformatics Date: 2008-10-15 Impact factor: 6.937

Review 9. Digital epidemiology.

Authors: Marcel Salathé; Linus Bengtsson; Todd J Bodnar; Devon D Brewer; John S Brownstein; Caroline Buckee; Ellsworth M Campbell; Ciro Cattuto; Shashank Khandelwal; Patricia L Mabry; Alessandro Vespignani
Journal: PLoS Comput Biol Date: 2012-07-26 Impact factor: 4.475

Review 10. E-cigarettes: a scientific review.

Authors: Rachel Grana; Neal Benowitz; Stanton A Glantz
Journal: Circulation Date: 2014-05-13 Impact factor: 29.690

15 in total

Review 1. Advancing Artificial Intelligence in Health Settings Outside the Hospital and Clinic.

Authors: Nakul Aggarwal; Mahnoor Ahmed; Sanjay Basu; John J Curtin; Barbara J Evans; Michael E Matheny; Shantanu Nundy; Mark P Sendak; Carmel Shachar; Rashmee U Shah; Sonoo Thadaney-Israni
Journal: NAM Perspect Date: 2020-11-30

2. Assessing rigid modes of thinking in self-declared abortion ideology: natural language processing insights from an online pilot qualitative study on abortion attitudes.

Authors: Danny Valdez; Kristen N Jozkowski; Katherine Haus; Marijn Ten Thij; Brandon L Crawford; María S Montenegro; Wen-Juo Lo; Ronna C Turner; Johan Bollen
Journal: Pilot Feasibility Stud Date: 2022-06-16

3. A graph-based approach for population health analysis using Geo-tagged tweets.

Authors: Hung Nguyen; Thin Nguyen; Duc Thanh Nguyen
Journal: Multimed Tools Appl Date: 2020-10-26 Impact factor: 2.757

4. Determining the prevalence of cannabis, tobacco, and vaping device mentions in online communities using natural language processing.

Authors: Mengke Hu; Ryzen Benson; Annie T Chen; Shu-Hong Zhu; Mike Conway
Journal: Drug Alcohol Depend Date: 2021-09-06 Impact factor: 4.492

10. Monitoring COVID-19 pandemic through the lens of social media using natural language processing and machine learning.

Authors: Yang Liu; Christopher Whitfield; Tianyang Zhang; Amanda Hauser; Taeyonn Reynolds; Mohd Anwar
Journal: Health Inf Sci Syst Date: 2021-06-25

Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data.

1 Introduction

2 Methods

3 Results

3.1 Vaccine Hesitancy and Refusal

3.2 Communicable Diseases and Sexually Transmitted Infections

3.3 Cancer

3.4 Substance Abuse

3.4.1 Opioid Abuse

3.4.2 Tobacco, E-Cigarette, and Marijuana Use and Abuse

3.4.3 Alcohol Abuse

3.5 Pharmacovigilance

3.6 Mental Health

3.7 Ethical Issues

4 Discussion and Conclusion

1. HealthMap: the development of automated real-time internet surveillance for epidemic intelligence.

2. The apomediated world: regulating research when social media has changed research.

3. Assessing the effects of medical marijuana laws on marijuana use: the devil is in the details.

Review 4. Opioid epidemic in the United States.

5. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet.

6. Comorbidity of posttraumatic stress disorder, anxiety and depression: a 20-year longitudinal study of war veterans.

7. Digital disease detection--harnessing the Web for public health surveillance.

8. BioCaster: detecting public health rumors with a Web-based text mining system.

Review 9. Digital epidemiology.

Review 10. E-cigarettes: a scientific review.

Review 1. Advancing Artificial Intelligence in Health Settings Outside the Hospital and Clinic.

2. Assessing rigid modes of thinking in self-declared abortion ideology: natural language processing insights from an online pilot qualitative study on abortion attitudes.

3. A graph-based approach for population health analysis using Geo-tagged tweets.

4. Determining the prevalence of cannabis, tobacco, and vaping device mentions in online communities using natural language processing.

Review 5. Medical Information Extraction in the Age of Deep Learning.

Review 6. Studies of Depression and Anxiety Using Reddit as a Data Source: Scoping Review.

7. Health Information Needs of Young Chinese People Based on an Online Health Community: Topic and Statistical Analysis.

Review 8. Blockchain and artificial intelligence technology in e-Health.

Review 9. The Use of Social Media for Health Research Purposes: Scoping Review.

10. Monitoring COVID-19 pandemic through the lens of social media using natural language processing and machine learning.