| Literature DB >> 35562661 |
Julia Walsh1, Christine Dwumfour2, Jonathan Cave3, Frances Griffiths2,4.
Abstract
PURPOSE: Social media has led to fundamental changes in the way that people look for and share health related information. There is increasing interest in using this spontaneously generated patient experience data as a data source for health research. The aim was to summarise the state of the art regarding how and why SGOPE data has been used in health research. We determined the sites and platforms used as data sources, the purposes of the studies, the tools and methods being used, and any identified research gaps.Entities:
Keywords: Health research; Machine learning; Methods; Natural language processing; Social media; Text analysis; Umbrella review
Mesh:
Year: 2022 PMID: 35562661 PMCID: PMC9106384 DOI: 10.1186/s12874-022-01610-z
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.612
Fig. 1Search term clusters
Components of CERQual appraisal tool (GRADE CERQual, 2017)
| Methodological limitations | Are the methods suitable for this project? |
| Relevance | Do the findings relate to the research question? |
| Coherence | How well does the data relate to the finding? |
| Adequacy | Richness & quantity of data supporting the finding |
Characteristics of included review papers
| Ref | Title | Review Aims/Objectives | Area | Data Source | Review type | No Papers |
|---|---|---|---|---|---|---|
| Abbe 2016 [ | Text mining applications in psychiatry: a systematic literature review | Two specific objectives: (1) to collect and analyse applications from the studies reviewed to assess the benefits and limitations of using TM; and (2) to identify new opportunities for use of TM in psychiatry. | Mental health | Online posts, qual studies, EHRs, biomed literature | Systematic | 38 |
| Abd Rahman 2020 [ | Application of Machine Learning Methods in Mental Health Detection: A Systematic Review | The main purpose of this paper is to explore the adequacy, challenges, and limitations of a mental health problem detection based on OSNs data. The objective of this systematic literature review is to conduct a critical assessment analysis on detection of mental health problems using OSNs. We also investigated the appropriateness of this pre-mental health detection by identifying its data analysis method, comparison, challenges, and limitations. | Mental Health | Mostly Twitter or Sina Weibo (Chinese Twitter) | Systematic | 22 |
| Al-Garadi 2016 [ | Using online social networks to track a pandemic: A systematic review | This study aims to investigate the adequacy and limitations of pandemic surveillances based on OSN data. | Infectious disease | Mostly Twitter | Systematic | 20 |
| Allen 2016 [ | Long-Term Condition Self-Management Support in Online Communities: A Meta-Synthesis of Qualitative Papers | To understand the negotiation of long-term condition illness-work in patient online communities and how such work may assist the self-management of long-term conditions in daily life. | Chronic | Mostly disease specific / Gen health sites / FB | Systematic | 21 |
| Barros 2020 [ | The Application of Internet-Based Sources for Public Health Surveillance (Infoveillance): Systematic Review | aimed to assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). | Public Health | SM, search queries | Systematic | 162 |
| Calvo 2017 [ | Natural language processing in mental health applications using non-clinical texts | To highlight areas of research where NLP has been applied in the mental health literature and to help develop a common language that draws together the fields of mental health, human-computer interaction, and NLP. | Mental health | Mostly Twitter | Scoping | 23 |
| CastiiloSanchez 2020 [ | Suicide Risk Assessment Using Machine Learning and Social Networks: A Scoping Review | Aims to identify the machine learning techniques used to predict suicide risk based on information posted on social networks. | Mental Health | any but mostly Twitter | Scoping | 16 |
| Charles-Smith 2015 [ | Using Social Media for Actionable Disease Surveillance and Outbreak Management: A Systematic Literature Review | 1. Q1. Can social media be integrated into disease surveillance practice and outbreak management to support and improve public health? 2. Q2. Can social media be used to effectively target populations, specifically vulnerable populations, to test an intervention and interact with a community to improve health outcomes? | Infectious disease | Mostly Twitter (81%) | Systematic | 33 |
| Cheerkoot-Jalim 2020 [ | A systematic review of text mining approaches applied to various application areas in the biomedical domain | To identify the different text mining approaches used in different application areas of the biomedical domain, the common tools used, and the challenges of biomedical text mining as compared to generic text mining algorithms. | Any | EHR, Biomed literature, SM | Systematic | 34 |
| Convertino 2018 [ | The usefulness of listening social media for pharmacovigilance purposes: a systematic review | To evaluate the usefulness and quality of signals from social media listening. | ADR | Varied | Systematic | 38 |
| Demner-Fushman 2016 [ | Aspiring to Unintended Consequences of Natural Language Processing: A Review of Recent Developments in Clinical and Consumer-Generated Text Processing | To review work over the past two years in Natural Language Processing (NLP) applied to clinical and consumer-generated texts | Any | Clinical & UG texts. | General review | NS |
| Dobrossy 2020 [ | “Clicks, likes, shares and comments” a systematic review of breast cancer screening discourse in social media | we had two aims: first, to assess the volume, participants, and content of breast screening social media communication and second, to find out whether social media can be used by screening organisers as a channel of patient education. | Breast Cancer | any but mostly Twitter | Systematic | 17 |
| Dol 2019 [ | Health Researchers’ Use of Social Media: Scoping Review | To explore how social media is used by health researchers professionally, as reported in the literature | Any | Varied | Scoping | 414 |
| Dreisbach 2019 [ | A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data | To synthesize the literature on the use of natural language processing (NLP) and text mining as they apply to symptom extraction and processing in electronic patient-authored text (ePAT) | Symptoms | Varied | Systematic | 21 |
| Drewniak 2020 [ | Risks and Benefits of Web-Based Patient Narratives: Systematic Review | This review aimed to evaluate whether research-generated Web-based patient narratives have quantifiable risks or benefits for (potential) patients, relatives, or health care professionals | Any | Any SM | Systematic | 17 |
| Edo-Osagie 2020 [ | A scoping review of the use of Twitter for public health research | Aims to review and synthesize the literature on Twitter applications for public health, highlighting current research and products in practice. | Any | Scoping | 92 | |
| Falisi 2017 [ | Social media for breast cancer survivors: a literature review | To provide a systematic synthesis of the current literature in order to inform cancer health communication practice and cancer survivorship research. | Breast cancer | Online support groups | Systematic | 98 |
| Filannino 2018 [ | Advancing the State of the Art in Clinical Natural Language Processing through Shared Tasks | To review the latest scientific challenges organized in clinical Natural Language Processing (NLP) by highlighting the tasks, the most effective methodologies used, the data, and the sharing strategies. | Any | Twitter/ReachOut forum | General review | 17 |
| Fung 2016 [ | Ebola virus disease and social media: A systematic review | Ebola virus disease and social media, especially to identify the research questions and the methods used to collect and analyse social media | Infectious disease | Mostly Twitter & YouTube | Systematic | 12 |
| Gianfredi 2018 [ | Harnessing Big Data for Communicable Tropical and Sub-Tropical Disorders: Implications from a Systematic Review of the Literature | To systematically assess the feasibility of exploiting novel data streams (NDS) for surveillance purposes and/or their potential for capturing public reaction to epidemic outbreaks. | Infectious disease | Varied but mostly Twitter | Systematic | 47 |
| Giuntini 2020 [ | A review on recognizing depression in social networks: challenges and opportunities | investigates the state-of-the-art of how sentiment and emotion analysis approaches can identify depressive disorders in social networks. | Mental Health | Any: mostly Twitter | Systematic | 26 |
| Gohil 2018 [ | Sentiment Analysis of Health Care Tweets: Review of the Methods Used | To review the methods used to measure sentiment for Twitter-based health care studies. | Any | Systematic | 12 | |
| Golder 2015 [ | Systematic review on the prevalence, frequency, and comparative value of adverse events data in social media | To summarize prevalence, frequency, and comparative value of information on the adverse events of healthcare interventions from user comments and videos in social media. | ADR | Mostly discussion forums | Systematic | 51 |
| Gonzalez-Hernandez 2017 [ | Capturing the Patient’s Perspective: a Review of Advances in Natural Language Processing of Health-Related Text | To review the recently published literature discussing the application of NLP techniques for mining health-related information from EHRs and social media posts. To provide a scope of the trends and advances in capturing the patient’s perspective on health within the last three years. | Any | SM & EHRs | General review | 87 |
| Gupta 2020 [ | Social media-based surveillance systems for healthcare using machine learning: A systematic review | We review the recent work, trends, and machine learning (ML) text classification approaches used by surveillance systems seeking social media data in the healthcare domain. We also highlight the limitations and challenges followed by possible future directions that can be taken further in this domain. | Any | Twitter 64% | Systematic | 26 |
| Hamad 2016 [ | Toward a Mixed-Methods Research Approach to Content Analysis in The Digital Age: The Combined Content-Analysis Model and its Applications to Health Care Twitter Feeds | To identify studies on health care and social media that used Twitter feeds as a primary data source and CA as an analysis technique. | ADR | Narrative review | 18 | |
| Ho 2016 [ | Data-driven Approach to Detect and Predict Adverse Drug Reactions | Compares omics, social media and EHRs as sources of ADR knowledge | ADR | Any SM | General review | 22 |
| Injadat 2016 [ | Data mining techniques in social media: A survey | Techniques, areas, performance, comparison of techniques, strengths and weaknesses of data mining methods | Any | Any SM | Survey | 66 |
| Karmegan 2020 [ | A Systematic Review of Techniques Employed for Determining Mental Health Using Social Media in Psychological Surveillance During Disasters | Our review aims to analyse the possibility, effectiveness, and procedures of using social media data to understand the emotional and psychological impact of an unforeseen disaster on the community. | Mental Health | Any SM: mostly Twitter | Systematic | 18 |
| Kim 2017 [ | Scaling Up Research on Drug Abuse and Addiction Through Social Media Big Data | To determine how social media big data can be used to understand communication and behavioural patterns of problematic use of prescription drugs. | Substance misuse | Critical | 8 | |
| Lafferty 2015 [ | Perspectives on social media in and as research: A synthetic review | To summarize findings, opinions and discussion about the use of SoMe in research, including examples from psychiatry. | Mental health | Varied | Systematic | 56 |
| Lardon 2015 [ | Adverse Drug Reaction Identification and Extraction in Social Media: A Scoping Review | To explore the breadth of evidence about the use of social media as a new source of knowledge for pharmacovigilance. | ADR | Mainly online forums +Twitter/blogs | Scoping | 24 |
| Lau 2019 [ | Artificial Intelligence in Health: New Opportunities, Challenges, and Practical Implications | To summarise the state of the art during the year 2018 in consumer health informatics | Any | Any SM | General review | 14 |
| Lopez-Castroman 2019 [ | Mining social networks to improve suicide prevention: A scoping review | Narrative review of possible suicidal behaviours on social networks | Mental health | NS | Scoping | NS |
| Mavragani 2020 [ | Infodemiology and Infoveillance: Scoping Review | The aim of this paper is to provide a scoping review of the state-of-the-art in infodemiology along with the background and history of the concept, to identify sources and health categories and topics, to elaborate on the validity of the employed methods, and to discuss the gaps identified in current research. | Any | Mostly Twitter | Scoping | 338 |
| Neveol 2017 [ | Making Sense of Big Textual Data for Health Care: Findings from the Section on Clinical Natural Language Processing | To identify the best clinical NLP papers of 2016 | Any | SM + EHRs | General review | 5 |
| Neveol 2018 [ | Expanding the Diversity of Texts and Applications: Findings from the Section on Clinical Natural Language Processing of the International Medical Informatics Association Yearbook | Summarize recent research / best papers for clinical NLP in 2017 | Any | Any SM | General review | 15 |
| Patel 2015 [ | Social Media Use in Chronic Disease: A Systematic Review and Novel Taxonomy | To evaluate clinical outcomes from applications of contemporary social media in chronic disease; to develop a conceptual taxonomy to categorize, summarize, and then analyse the current evidence base; and to suggest a framework for future studies on this topic | Chronic | Any SM | Systematic | 42 |
| Pourebrahim 2020 [ | Adverse Drug Reaction Detection Using Data Mining Techniques: A Review Article | The aim of this study is to study, review and challenge the methods of ADR diagnosis by data mining on social media, especially Twitter. | ADR | Any SM: mostly Twitter | General | 0 |
| Qiao 2020 [ | A Systematic Review of Machine Learning Approaches for Mental Disorder Prediction on Social Media | The purpose of this paper is to provide a systematic overview of SM studies in the mental disorder detection field. | Mental Health | Facebook, Twitter, Reddit, Tumblr, Instagram | General | 0 |
| Ru & Yao 2019 [ | A Literature Review of Social Media-Based Data Mining for Health Outcomes Research | To summarize key points of the area in data accessibility, textual data pre-processing methods, analysis methods, opportunities, and challenges. | Any | Any SM | General review | 19 |
| Santos 2019 [ | Datamining and machine learning techniques applied to public health problems: A bibliometric analysis from 2009 to 2018 | To: (i) analyse the number of papers published from 2009 to 2018 (10 years) due to the increasing number of publications and dissemination of ML in public health; (ii) identify the journals with the greatest number of papers; (iii) determine which techniques, programming languages and software tools are most widely used in the field of DM applied to public health; (iv) identify which countries and databases were targeted by these studies; (v) analyse which public health classes were tackled by these papers and (vi)identify which papers were most frequently cited in the literature. | Public health | Any SM | Bibliometric | 250 |
| Sarker 2019 [ | Mining social media for prescription medication abuse monitoring: a review and proposal for a data-centric framework | To present a methodological review of social media-based PM abuse or misuse monitoring studies, and to propose a potential generalizable, data-centric processing pipeline for the curation of data from this resource. | Substance misuse | Twitter / Facebook / Reddit | General review | 39 |
| Sharma 2016 [ | Identifying Complementary and Alternative Medicine Usage Information from Internet Resources. A Systematic Review | Identify and highlight research issues and methods used in studying Complementary and Alternative Medicine (CAM) information needs, access, and exchange over the Internet. | CAM | Any SM | Systematic | 120 |
| Sharma 2020 [ | Sentiment analysis of social media posts on pharmacotherapy: A Scoping Review | The aim of this scoping review was to describe the available evidence as it pertains to SA of Social Media specifically about pharmacotherapy. Themes will be generated about the published uses of SA and the real-world implications of the knowledge generated. | Any | Any SM: mostly Twitter | Scoping | 10 |
| Sinnenberg 2017 [ | Twitter as a Tool for Health Research: A Systematic Review | To systematically review the use of Twitter in health research, define a taxonomy to describe Twitter use, and characterize the current state of Twitter in health research. | Health research | Systematic | 137 | |
| Skaik 2020 [ | Using Social Media for Mental Health Surveillance: A Review | This systematic review aims to analyse the literature on using social media posts to predict mental disorders using ML and NLP methods that could be useful for mental health surveillance and presents the cutting-edge techniques in predictive analysis of suicide ideation and depression at the population-level. It also points at the gaps that need further research from the perspective of the data, the models, and evaluation procedures. | Mental Health | Any SM | General | 110 |
| Staccini 2017 [ | Secondary Use of Recorded or Self-expressed Personal Data: Consumer Health Informatics and Education in the Era of Social Media and Health Apps | To summarize the state of the art during the year 2016 in the areas related to consumer health informatics and education with a special emphasis in secondary use of patient data. | Any | Any SM | Systematic | 5 |
| Su 2020 [ | Deep learning in mental health outcome research: a scoping review | The goal of this study is to review existing research on applications of DL algorithms in mental health outcome research. | Mental Health | SM, EHR, etc | Scoping | 57 |
| Tricco 2018 [ | Utility of social media and crowd-intelligence data for pharmacovigilance: a scoping review | Review the literature regarding using SM conversations for ADR detection | ADR | Any SM | Scoping | 70 |
| Vilar 2018 [ | Detection of drug-drug interactions through data mining studies using clinical sources, scientific literature, and social media | To review datamining as a method of detecting drug-drug interactions | ADR | SM/ EHRs. FAERS, WHO | General review | NS |
| Wilson 2015 [ | Using blogs as a qualitative health research tool: A scoping review | To identify how blogs are being used in health research to date and whether blogging has potential as a useful qualitative tool for data collection. Our purpose was to summarize the extent, range, and nature of research activity using blogs. | Any | blogs | Scoping | 44 |
| Wong 2018 [ | Natural Language Processing and Its Implications for the Future of Medication Safety: A Narrative Review of Recent Advances and Challenges | To review methods of identifying adverse events from free text | ADR | SM + EHRs | General review | 12 |
| Wongkoblap 2017 [ | Researching Mental Health Disorders in the Era of Social Media: Systematic Review | To explore the scope and limits of cutting-edge techniques that researchers are using for predictive analytics in mental health and to review associated issues, such as ethical concerns, in this area of research. | Mental health | Various SM | Systematic | 48 |
| Yin 2019 [ | A systematic literature review of machine learning in online personal health data | To systematically review the effectiveness of applying machine learning (ML) methodologies to UGC for personal health investigations. | Any | Any SM: mostly Twitter | Systematic | 103 |
| Zhang 2018 [ | Using Twitter for Data Collection with Health-Care Consumers: A Scoping Review | To provide an overview of previously published literature describing Twitter as a data collection method with health-care consumers and provide researchers with considerations when potentially using this data collection approach. | Any | Scoping | 17 | |
| Zhang 2020 [ | When Public Health Research Meets Social Media: Knowledge Mapping From 2000 to 2018 | Aims to examine research themes, the role of social media, and research methods in social media–based public health research published from 2000 to 2018 | Any | Any SM | Review | 3419 |
| Zunic 2020 [ | Sentiment Analysis in Health and Well-Being: Systematic Review | This study aimed to establish the state of the art in SA related to health and well-being by conducting a systematic review of the recent literature. To capture the perspective of those individuals whose health and well-being are affected, we focused specifically on spontaneously generated content and not necessarily that of health care professionals. | Any | Various SM | Systematic | 86 |
Fig. 2Prisma flow diagram
Fig. 3Included review papers per year
Included papers by journal
| Review | Journal |
|---|---|
| Allen 2016 [ | Journal of Medical Internet Research |
| Barros 2020 [ | Journal of Medical Internet Research |
| Dol 2019 [ | Journal of Medical Internet Research |
| Drewniak 2020 [ | Journal of Medical Internet Research |
| Hamad 2016 [ | Journal of Medical Internet Research |
| Kim 2017 [ | Journal of Medical Internet Research |
| Mavragani 2020 [ | Journal of Medical Internet Research |
| Zhang 2020 [ | Journal of Medical Internet Research |
| Lardon 2015 [ | Journal of Medical Internet Research |
| Wongkoblap 2017 [ | Journal of Medical Internet Research |
| Lopez-Castroman 2019 [ | Journal of Medical Internet Research |
| Demner-Fushman 2016 [ | Yearbook of Medical Informatics |
| Filannino 2018 [ | Yearbook of Medical Informatics |
| Gonzalez-Hernandez 2017 [ | Yearbook of Medical Informatics |
| Lau 2019 [ | Yearbook of Medical Informatics |
| Neveol 2017 [ | Yearbook of Medical Informatics |
| Neveol 2018 [ | Yearbook of Medical Informatics |
| Staccini 2017 [ | Yearbook of Medical Informatics |
| Pourebrahim 2020 [ | IEEE |
| Qiao 2020 [ | IEEE |
| Abd Rahman 2020 [ | IEEE |
| Dobrossy 2020 [ | PLoSone |
| Charles-Smith 2015 [ | PLoSone |
| Wilson 2015 [ | International Journal of Qualitative Methods |
| Zhang 2018 [ | International Journal of Qualitative Methods |
| Al-Garadi 2016 [ | Journal of Biomedical Informatics |
| Gupta 2020 [ | Journal of Biomedical Informatics |
| Skaik 2020 [ | ACM Computer Survey |
| Fung 2016 [ | American Journal of Infection Control |
| Sinnenberg 2017 [ | American Journal of Public Health |
| Patel 2015 [ | American Journal of Medicine |
| Tricco 2018 [ | BMC Medical Informatics & Decision Making |
| Golder 2015 [ | British Journal of Clinical Pharmacology |
| Vilar 2018 [ | Briefings in Bioinformatics |
| Edo-Osagie 2020 [ | Computers in Biology & Medicine |
| Santos 2019 [ | Computers & Industrial Engineering |
| Ho 2016 [ | Current Pharmaceutical Design |
| Karmegan 2020 [ | Disaster medicine and public health preparedness |
| Convertino 2018Kar [ | Expert Opinion on Drug Safety |
| Gianfredi 2018 [ | Frontiers in Public Health |
| Dreisbach 2019 [ | International Journal of Medical Informatics |
| Lafferty 2015 [ | International Review of Psychiatry |
| Abbe 2016 [ | International Journal of Methods in Psychiatric Research |
| Sarker 2019 [ | Journal of American Medical Informatics |
| Falisi 2017 [ | Journal of Cancer Survivorship |
| Castiilo-Sanchez 2020 [ | Journal of Medical Systems |
| Cheerkoot-Jalim 2020 [ | Journal of Knowledge Management |
| Yin 2019 [ | JAMA Medical Informatics |
| Zunic 2020 [ | JMIR Medical Informatics |
| Gohil 2018 [ | JMIR Public Health Surveillance |
| Giuntini 2020 [ | Journal of Ambient Intelligence and Humanized Computing |
| Sharma 2016 [ | Methods of Information in Medicine |
| Calvo 2017 [ | Natural Language Engineering |
| Injadat 2016 [ | Neurocomputing |
| Sharma 2020 [ | Pharmacology Research and Perspectives |
| Wong 2018 [ | Pharmocotherapy |
| Ru & Yao 2019 [ | Social Web and Health Research (book) |
| Su 2020 [ | Translational Psychiatry |
Fig. 4Research area of included reviews – WoS bibliometric analysis (WoS 2020)
Categorisation of the main purpose of the review
| Uses | [ |
| Methods | [ |
| Both | [ |
Fig. 5Word cloud of author generated keywords
Aims, outcomes, key findings methods used and future research suggestions
| Ref | Paraphrased Aims | Area | Outcomes Assessed | Key Findings Paraphrased | Methods Mentioned | Future Research |
|---|---|---|---|---|---|---|
| Abbe 2016 [ | Benefits & limitations. Current and potential uses in psych. | Mental health | Objectives of studies, and topic modelling methods /tools used for pre-processing and analysis. | Identified four main areas of application: Psychopathology, patient perspective, medical records, medical literature. A data source that cannot be ignored. Techniques and topics heterogenous. Basic capabilities at present but will get better and become a core method. | Mostly rule based systems but some classification. | Improved techniques, apply to more languages than English. |
| Abd Rahman | Adequacy, challenges, and limitations of SGOPE data for detecting MH problems | Mental Health | Data Sources, Condition, location, Feature extraction methods, analysis methods | 22 studies: stress 8, depression 7, suicide 3, MH disorders 4. Geographical: China 6, US, 4, Japan 1, Greece 1, unspecified 10. Source: twitter 8, Sina Weibo 5, Facebook 2, others 7. The keywords used to select data often not specified. SVM (13/22) most popular classification, LR & RF (5/22), NB 4/22) | Text analysis, multi method inc questionnaires, accessing respondents OSN accounts. Feature extraction TF-IDF, ngrams, BOW, | Multiple sources, other languages, inclusion of audio, video, photos. Better methods |
| Al-Garadi 2016 [ | Adequacy / limitations of SM for pandemic surveillance | Infectious disease | Data source and volume, analysis method, study aims and outcomes. Features and classifier performance of supervised methods. | Can complement existing systems but still problems with representivity. Need better algorithms and computational linguistic methods. | Mostly supervised, classification. SVM. Most used ngrams as features. | Better algorithms/ computational linguistics |
| Allen 2016 [ | Better understanding of how patients with chronic disease share knowledge in online spaces. Possibilities for improving self-management. | Chronic | Network themes and mechanisms | Helpful in encouraging patients to self-manage l/term conditions through sharing collective knowledge, gifting relationships, sociability and disinhibition. Need to understand why people do or do not post. | Qualitative: thematic, grounded theory, content & thematic, IPA, ethnography | Find out why people are reluctant to post and illuminate how these communities help people manage their condition in daily life. |
| Barros 2020 [ | To assess research findings regarding the application of IBSs for public health surveillance (infodemiology or infoveillance). Sources, purposes, methods | Public Health | Paper type, year, disease, health topic, forecasting, surveillance, disease characterisation, first person health mention, diagnosis prediction, | Infectious disease the biggest area. We also identified limitations in representativeness and biased user age groups, as well as high susceptibility to media events by search queries, social media, and web encyclopaedias | Correlation analysis (59/162) regression models (46/162). Machine learning 27/162, statistical models 20/162. Manual analysis 18/162, topic analysis 12/162. Deep learning 10/162, linguistic analysis 10/162. Rule-based techniques ( | Updating keywords to reflect changing search behaviours and health trends. Susceptibility of SM content to media events. Creation of standard datasets to improve method development. |
| Calvo 2017 [ | What NLP methods used on user generated data in mental health? | Mental health | Objectives of studies, data sources, features extracted | Triaging MH issues seems like a great use but need to find how to react to it in practice. Ethics/ privacy issues. Very interdisciplinary. | LIWC most widely used both for feature extraction and Sentiment analysis. Good methods often a combination of methods/ algorithms. Lots of different tools/ techniques available- could not determine whether any one was superior. | Need to do research into using NLP in different languages. Also think about how to make contact with people identified as being at risk from mental health that are identified during the process. |
| CastiiloSanchez 2020 [ | What ML techniques used to predict suicide from SM data? | Mental Health | Methods, Tools, Techniques | Text classification main objective for 75%. 8/16 studies report explicit datamining techniques. 10/16 using SVM. Papers not reporting time spans of data collection, or number of participants. | LIWC, LDA, LSA for feature extraction, Sentiment analysis | Other languages. Use annotated corpus. Develop new tools. Do temporal studies. |
| Charles-Smith 2015 [ | Can SM be used for disease surveillance? Or to test interventions to improve health outcomes? | Infectious disease | Correlation between social media data and national health statistics. Prediction times. Topic / theme identification. Influence on health behaviours. | Earlier prediction of outbreaks. Correlation with existing methods. Topic modelling good for broad topics, but not for lower frequency themes. Lots of gaps in knowledge. Need to look for ways to incorporate SM into PH surveillance. | Topic modelling (LDA). Query selection and thematic analysis to detect lower frequency topics. | Work on who uses what types of social media, so as to get representative data. SM platforms/ preferences change. |
| Cheerkoot-Jalim 2020 [ | Identify the text mining approaches, tools used in biomedical text. Who benefits? Application areas? What are the challenges? | Any | Data Sources, Techniques, Tools and Potential Beneficiaries of research | Looked at who could benefit from SGOPE research | MetaMap, UMLS used - mainly on EHRs and biomed literature. NLP methods; NER and relationship extraction. | Big data paradigms, methods that can scale with the volume of text. Methods of standardising data across sources. Improving accuracy. |
| Convertino 2018 [ | Summarise strategies, assess quality of information, potential for early detection from SM. | ADR | Sources, study population, drug Proto-ADE pairs, clinical features, extraction method. | Lots of potential to complement existing regulatory agencies. But utility, validity and implementation are all under-studied. Need standardised methods. Fast moving field. No causality assessment so far. | Keywords, dictionary most popular 37/38. | More work to improve methods. Use in conjunction with other signal detection methods. |
| Demner-Fushman 2016 [ | Improvements in NLP on patient language, and new opportunities. | Any | SM as a source for quality assessment. Methods | Much more to be done both in clinical and SM NLP. Research moving from capturing trends to addressing individual health-related posts, thus showing potential to become a tool for precision medicine and a valuable addition to the standard healthcare quality evaluation tools. | Sentiment analysis. Rule based RegEx or supervised event extraction most used.. More work needed on semantic processing. Using sentences better than words, | Need more publicly available clinical datasets. Work on semantics. Work on porting pipelines across domains. Collaboration between NLP research and EHR suppliers. |
| Dobrossy 2020 [ | Assess volume, participants and content of SM data about breast screening. Potential for patient education. | Breast Cancer | Platforms, volume of discourse, participant roles, discourse content, themes. | Looked at age, role of user types, and the content of the posts. Good source to understand beliefs, attitudes, and literacy of the target population. | NS | NS |
| Dol 2019 [ | How health researchers are using SM data. | Any | Journals, study country, first author discipline, health topic covered, platforms, study purpose. | 81/414 analysing content. Biggest use was recruitment. Generally seen as positive but concerns re ethics. | NS | Need methods to optimise usage and demonstrate potential. |
| Dreisbach 2019 [ | Using NLP methods to extract symptoms from SM text | Symptoms | Study purpose, data source, symptom categorisation, evaluation, and performance metrics | Pain and fatigue most evaluated symptoms. Variety od sources. NLP primary methodology for 15/21 papers. Current focus on extraction of terms. Need to share lexicons to move forward. | 21 papers: 14 NLP, 6 text mining, 1 NLP + TM. No breakdown of type of methods. | Future research should consider the needs of patients expressed through ePAT and its relevance to symptom science. Understanding the role that ePAT plays in health communication and real-time assessment of symptoms, is critical to a patient-centred health system. |
| Drewniak 2020 [ | Does SGOPE research have quantifiable risks or benefits for patients, relatives, or HCPs? | Any | Purposes of the narrative: inform, engage, model behaviour, persuade, comfort | Generally positive benefits although potential risks from misinformation | NS | Future research is needed to define the optimal standards for quantitative approaches to narrative-based interventions. |
| Edo-Osagie 2020 [ | Current uses of Twitter data in public health | Any | Conditions, data sources, analysis methods, geographical and time trends | Twitter a good data source for 6 aspects of public health: surveillance, event detection, pharmacovigilance, forecasting, disease tracking and geographical identification. | Numerous | Unsupervised methods. Do research into less studied areas |
| Falisi 2017 [ | What role does SM play in the health of breast cancer survivors? | Breast cancer | Platforms, ethnicity of study population, analysis method, which aspects analysed, connection between SM content and health outcome. | Focus on psychosocial wellbeing. Mostly online support forums/ message boards. Few non-Caucasian. Content analyses of social media interactions prevalent, but few articles linked content to health outcomes | 40/98 did content analysis. Some manual / some M/L. Pre 2011 = LIWC, post 2011 = LDA etc. 37 quant. 3 qual | Should consider connecting SM content to psychosocial, behavioural, and physical health outcomes. None of the content analysis articles attempted to do this. |
| Filannino 2018 [ | What tasks and methods included in the shared tasks? | Any | Task description, data type, data source, dataset size, best performance, measure. | NER & classification the most used tasks. Clear trend to data-driven solutions. Need more and varied datasets to explore. | NER and classification most common tasks. | Bigger and more varied datasets to share |
| Fung 2016 [ | What research questions and methods used on Ebola related social media? | Infectious disease | Study design, qual or quant, study aim, data collection method, time frame, keywords used, analysis method, main findings, and limitations. | 12 papers: 8 from Twitter/ Weibo, 1 from Facebook, 3 from YouTube, and 1 from Instagram and Flickr. All studies were cross-sectional. 11/12 articles studied one or more of themes / topics of SM content, post meta-data and characteristics of the SM account. Twitter content analysis methods included text mining (n = 3) and manual coding (n = 1). Two studies involved mathematical modelling. YouTube /Instagram/Flickr studies used manual coding of videos and images. Published Ebola virus disease-related social media research focused on Twitter and YouTube. The utility of social media research to public health practitioners is warranted. No evaluation of the studies utility performed. | Mix of manual coding and frequency analysis using LIWC. | Need a new checklist to appraise quality of SM papers. Future research in the direction of analysing multiple cross-sectional social media datasets or conducting prospective cohort studies of social media users will provide useful data for analysis of temporal change of social media contents or social media users’ behaviours. Need to bridge research and practice. |
| Gianfredi 2018 [ | Can SM be used for disease surveillance / predictions? Can they capture public reactions to epidemic outbreaks? | Infectious disease | Data source, disease, study period, geographical location, study purpose, type of analysis and main findings | Out of the 47 articles included, only 7 were focusing on neglected tropical diseases, while all the other covered communicable tropical/sub-tropical diseases, and the main determinant of this unbalanced coverage seems to be the media impact and resonance. | Qualitative, narrative analysis, content analysis, mathematical modelling, correlational analysis, geospatial. | Lots of gaps, possibly due to the media impact of the specific disease. Need further research into ways of integrating diverse data sources. |
| Giuntini 2020 [ | Sentiment and emotion analysis for identifying depressive disorders. What types of SM data? Which networks? Which methods? | Mental Health | Platform, type of SM, emotion or feeling detection, other disorders inferred, methodology | Most used media is text, then emoticons. Twitter most employed platform. Supervised methods with off the shelf classifiers combined with lexicons such as LIWC. | Supervised (NB, DT, SVM etc) plus LIWC, NRC Word Emoticon, word-Net Affect lexicons | More multidisciplinary studies. |
| Gohil 2018 [ | What sentiment analysis tools for Twitter / healthcare. Any health specific training, validation or justification | Any | Health area, sentiment towards, type of method, tool used, manual annotation sample size, sample size | Multiple methods mix of open source, commercial and bespoke tools. Very few tested for accuracy. | Sentiment analysis. Mix of tools. | This study suggests that there is a need for an accurate and tested tool for sentiment analysis of tweets trained using a health care setting–specific corpus of manually annotated tweets first. |
| Golder 2015 [ | Prevalence, frequency and value of ADR comments from SM | ADR | Data source type, ADR type, search strategy used, post selection, study aim, ADR prevalence, comparison method | 51 studies, discussion forums most used source type. ADR prevalence varied from 0.2 to 8%. General agreement that a higher frequency of adverse events was found in social media and that this was particularly true for ‘symptom’ related and ‘mild’ adverse events. | 8/12 used Consumer Health Vocab dictionary. Few evaluation methods | A cost-effectiveness analysis of all pharmacovigilance systems, including social media is urgently required. |
| Gonzalez-Hernandez 2017 [ | Show how NLP is developing in regard to capturing the patient perspective from unstructured text. | Any | Types of SM sites, analysis type, types of tasks. | Move from rule based to learning based systems. Work needed on noise reduction and normalisation/mapping. Shortage of annotated shared datasets. Shared tasks useful development tool. | Move from rule based to learning methods. Over 50% papers used lexical content analysis. In SM NLP: regex, LDA topic modelling. Supervised classification. Sentiment analysis | Normalisation of data, co-reference and temporal relation extraction. Need to create and release annotated datasets and targeted unlabelled data sets in distinct languages. |
| Gupta 2020 [ | What methods, sources, are used for SM based health surveillance. Potential applications, and challenges. | Any | ML Methods, Data Sources, Diseases, Limitations of SM systems | Twitter most used source (64%). SVM most used method (33%) - better at binary classification. | SVM, Decision trees, random forest, NB, Logistic Regression | Noise reduction, Combining SM with other data, theme detection, develop better predictive models for epidemic prediction. Only 3 studies included ethical debate. |
| Hamad 2016 [ | How is content analysis used in health-related SM studies? | ADR | Keywords and hashtags, sampling and data collection, analysis methods, validation, and presentation of results | Methods used were not purely quantitative or qualitative, and the mixed-methods design was not explicitly chosen for data collection and analysis. Proposes CCA analysis as straightforward method for Twitter analysis | Content analysis (quantitative and qualitative). Infoveillance. Combined content analysis (mix of mixed methods and content analysis) | NS |
| Ho 2016 [ | Compares omics, social media and EHRs as sources of ADR knowledge | ADR | Study aims, Data & Tool, Method | Data driven approach essential to detect /predict ADRs. Omics data, EHRs and SM all new opportunities. | Datamining. NLP, NER, ontology building. Classification to exclude noise. Aims to reduce false positive rate. Yang = mix of topic + classification. Classification to link effect to drug. UMLS & MetaMap | NS |
| Injadat 2016 [ | Techniques, areas, performance, comparison of techniques, strengths, and weaknesses of data mining methods. | Any | Domains, Techniques, Research objectives, Strengths, and weaknesses of techniques. | 19 data mining techniques used to address 9 different research objectives in 6 different industrial and services domains. Most used methods: SVM, NB & DT. Most used in business and social network analysis. Medical/health use only 8% | Datamining. SVM, BN, DT | Research into how techniques are implemented. Need more statistical tests of results. But - many of the tests applied required a normal distribution which was not the case. Health researchers not good about writing about the methods used. Could learn a lot from CRM and HRM domains. |
| Karmegan 2020 [ | Aims to analyse the possibility, effectiveness, and procedures of using SM data to understand the emotional and psychological impact of unforeseen disaster on the community. | Mental Health | Platform, methods | Twitter most used source. Sentiment analysis used for psychological surveillance. Could not conclude that any one method was superior. | Feature extraction using classification algorithms. Sentiment analysis | Combine text and image processing. Incorporate social network analysis with post content. |
| Kim 2017 [ | How SM data can be used to understand communication and behavioural patterns of nonmedical or problematic use of prescription drugs | Substance misuse | User characteristics, communication characteristics, outcomes, methodological domain, ethical domain | See lots of potential, but more work needed. | Mixture: manual, qualitative, supervised / non supervised ML to identify themes, patterns, sentiment. | Lots more - sees their review as a base to build on. Identified a lack of theoretical framework for substance misuse monitoring. Consequences of SM engagement understudied. |
| Lafferty 2015 [ | How is SM being used in psychiatry? Tools, benefits, and challenges. | Mental health | SM as data, methodological considerations, ethical considerations, SM for recruitment | Observational, real time patient experiences. Can help with development of practice, policy, and provision. Opportunities for co-creation of research, patient centric care. | Grounded theory, Social network analysis | Ethical issues. Analyse SM data through different socio-cultural lenses to build theoretical frameworks. |
| Lardon 2015 [ | Can SM be a new source of knowledge for pharmacovigilance? | ADR | Language, data source, data volume, methods, lexicon, | Identification theme all 11 papers used manual methods. Identified heterogeneity of methods, but also gaps. Included studies failed to assess the completeness, quality, or reliability of the data. | RQ1: All manual /mixed, RQ2: Web scraping, pre-processing, various rule-based methods. | Additional studies are required to precisely determine the role of social media in the pharmacovigilance system. Need methods to assess data quality. |
| Lau 2019 [ | 2018 SOTA of opportunities, challenges, and implication of AI in health informatics | Any | NS | Few 2018 papers reported Artificial Intelligence (AI) research for patients and consumers. No studies that elicited patient and consumer input on AI. Most common use is secondary analysis of social media data (e.g., online discussion forums). The 3 best papers shared a common methodology of using data-driven algorithms (such as text mining, topic modelling, Latent Dirichlet allocation modelling), combined with insight-led approaches (e.g., visualisation, qualitative analysis, and manual review), to uncover patient and consumer experiences of health and illness in online communities. There is a lack of direction and evidence on how AI could actually benefit patients and consumers. | Best papers shared a common methodology of using data-driven algorithms (such as text mining, topic modelling, Latent Dirichlet allocation modelling), combined with insight-led approaches (e.g., visualisation, qualitative analysis, and manual review), to uncover patient and consumer experiences of health and illness in online communities | See what patients want from AI in health. More patient involvement to ensure that research is asking the right questions. |
| Lopez-Castroman 2019 [ | Detecting suicide ideation from SM | Mental health | NS | Early days, but SM has important role in suicide prevention. Lots more work needed. | Various: Sentiment analysis, topic modelling, data mining | Add demographic data to text to improve results. |
| Mavragani 2020 [ | Current state of SM based infodemiology. Validity of methods and research gaps. | Any | Timeline & journals. Data sources, Health topics, Advantages & Disadvantages of SM data | JMIR most used journal. Increasing interest since 2018. Twitter most used platform. Most researched subjects were conditions/diseases, epidemics, healthcare, drugs, smoking/alcohol. | NS | Combine SM data with traditional sources for more complete assessment. |
| Neveol 2017 [ | Best clinical NLP papers of 2016 | Any | Applications of NLP, Directions of progress | Developing applications rather than methods. Starting on the more complex tasks e.g., semantics, coreference resolution, and discourse analysis. | Classification of useful sentences, Information extraction, abbreviation disambiguation, coreference resolution, grounding of gradable adjectives | NS |
| Neveol 2018 [ | Summarize recent research / best papers for clinical NLP in 2017 | Any | NLP of SM data, NLP of HCP text, methods | 2017 trends - revisiting old problems such as SM classification and negation with deep learning & neural nets. Production of annotated corpora. Continuing applications rather than methods. Beginning of deep learning. Start of language variants. | Negation detection, corpus annotation, deep learning. | Work in other languages. Increase generalisability. |
| Patel 2015 [ | Categorise & summarise existing papers about chronic disease outcomes from SM. Suggest framework for future research. | Chronic | Platform, Taxonomy category, disease, study aim, study design, sample size & description, Method summary, SM effect | 85% either Facebook or blogs. 40% for support (social, emotional, or experiential). | Quantitative, Thematic qualitative, Content analysis. | Understand how disease, patient factors and tech can interact to improve outcomes. Reduce potential for bias. Target studies to specific diseases might be the best way to improve clinical care. |
| Pourebrahim 2020 [ | Datamining methods for ADR detection from SM | ADR | Analysis and evaluation metrics | SM good for early identification of ADRs. Three main stages; Pre-processing, feature extraction and classification | Supervised, regression, unsupervised | NS |
| Qiao 2020 [ | Overview of SM studies relating to mental disorder detection. | Mental Health | Platforms, collection methods, feature extraction, algorithms, evaluation metrics | Facebook, Twitter, Reddit, Tumblr, Instagram. Most used supervised methods, especially SVM | SVM, Decision trees, random forest, NB, Logistic Regression | Develop systems with lower computational cost to increase speed. Multi-language systems. |
| Ru & Yao 2019 [ | SGOPE data - methods/analysis opportunities and challenges | Any | Data type, volume, pre-processing method, analysis method, health outcomes | Variety of methods. Outcomes included side effects / effectiveness / adherence / hrqol | NER, mapping, identify concepts, text mining (Ngram, LDA, topic modelling), content analysis, hypothesis testing, supervised, unsupervised | Suggested further research on treatment effectiveness, adverse drug events, perceived value of treatment, and health-related quality of life. The challenge lies in the further improvement and customization of text mining methods. Only 6 discussed ethics. |
| Santos 2019 [ | Numbers of papers / journals, countries / databases, methods/tools, which public health issues looked at | Public health | Year, Journal, Study purpose, health area, techniques, software/ programming language, study country | Results showed a slight increase in the number of papers published in 2014 and a significative increase since 2017, focusing mostly on infectious, parasitic, and communicable diseases, chronic diseases, and risk factors for chronic diseases. JMIR and PLoS ONE published the highest number of papers. Support Vector Machines (SVM) were the most common technique, while R and WEKA were the most common programming language and software application, respectively. The U.S. was the most common country where the studies were conducted. In addition, Twitter was the most frequently used source of data by researchers. | SVM, Decision trees, random forest, NB, most used techniques. R, WEKA, and Python most used languages/ apps. | In depth analysis of variations in techniques (deep learning / ensemble etc) |
| Sarker 2019 [ | Look at existing methods of SM based medication abuse or misuse, propose new data centric pipeline. | Substance misuse | Data source, dataset size, medication studied, study objectives, methods, and findings. | 39 studies, 80% published since 2015. Twitter most used source. Earlier studies manual qualitative, but growing trend towards NLP methods. | Supervised, unsupervised | Develop shared annotation guidelines and annotated datasets. Will help the direct project and enable comparison across methods. Show agreement for manual annotation. Reduce noise in data. |
| Sharma 2016 [ | Identify and highlight research issues and methods used in studying Complementary and Alternative Medicine (CAM) information needs, access, and exchange over the Internet. | CAM | NS | Significant interest in developing methodologies for identifying CAM treatments, including the analysis of search query data and social media platform discussions | Qualitative, thematic, content analysis, keyword searches, regex, Consumer health vocabulary | Little work done on using SGOPE to understand CAM user’s perspectives / prevalence of CAM use. Lots more work required. |
| Sharma 2020 [ | Can sentiment analysis be conducted on social media platforms to understand public sentiment held towards pharmacotherapy? | Any | Author, Year, Journal, data source, conditions, pharmacotherapy, SA method used, potential clinical use. | Lack of consistent approach. Opinion on particular medication (7/10) and ADRs (3/10) Lexicon based more used than ML for sentiment. (Lexicon 6, ML 1). Combining SA with other ADR methods improved results. Lots of untapped potential. | Lexicon, ML. Combining | No gold standard methods yet. Early stage of development. Accuracy rarely assessed. |
| Sinnenberg 2017 [ | How and why health researchers using Twitter? | Health research | Ways Twitter data used by researchers, ways that Twitter platform used in health research, Publication date, research topic, ethics, and funding | The primary approaches for using Twitter in health research that constitute a new taxonomy were content analysis (56%; | Content analysis, network analysis | Future work should develop standardized reporting guidelines for health researchers who use Twitter and policies that address privacy and ethical concerns in social media research. New opportunities to characterise users from metadata such as demographics. |
| Skaik 2020 [ | Recent trends and tools for using social media posts to predict mental disorders using ML and NLP methods. Identifying research gaps. | Mental Health | Collection methods, applications, best practices, and gaps | 25 papers looking at population level mental health classification techniques. 15/25 depressive disorders, 10/25 suicide-ideation. Twitter most used data source, SVM most used model. Heterogeneity of methods and feature selection. | Models: SVM, Ensemble, LR, RF, DT, LSTM. Features: WEKA, LDA, TF-IDF, Sentiment, Lexical, Syntactic, Demographics, Word embedding, Topic modelling | Improve identification of risk factors. |
| Staccini 2017 [ | Uses and challenges for secondary use of health data | Any | Data donation, uses of SGOPE data | Secondary use of patient data (apart from personal health care record data) can be expressed according to many ways. Requirements to allow this secondary use should be harmonized between countries, and social media platforms can be efficiently used to explore and create knowledge on patient experience with health problems or activities. Machine learning algorithms can explore those massive amounts of data to support health care professionals, and institutions provide more accurate knowledge about use and usage, behaviour, sentiment, or satisfaction about health care delivery. | NS | Very early days, lots to work on. Socio-ethical concerns, increased adoption in health care. Need to check AI /SM is asking the right questions. Need a formal framework for consent and secondary use of data. Far from massive adoption in health practice. |
| Su 2020 [ | Deep learning in Mental Health | Mental Health | Methods, Tools, Techniques | A growing number of studies using DL models for studying mental health outcomes. Particularly, multiple studies have developed disease risk prediction models using both clinical and non-clinical data and have achieved promising initial results. Lots of potential but lots of challenges | CNN, RNN, Autoencoders | Reduce bias, improve methods |
| Tricco 2018 [ | Using SGOPE for ADR detection. Types / characteristics of platforms? How valid or reliable are the conversations? | ADR | Data sources, document characteristics, health conditions, methods, types of listening system, outcome results | 46/70 documents (66%) described an automated or semi-automated information extraction system to detect health product AEs from social media conversations (in the developmental phase). Seven pre-existing information extraction systems to mine social media data were identified in eight documents. 19/70 documents compared SM reported AEs with validated data: consistent AE discovery in 17/19. No evaluation of methods or reliability. | Supervised 15/70, Rule based 6/70, unsupervised 4/70, deep learning 1/70, other ML 5/70, Manual or NA 32/70. Dictionary/ lexicon based most used. | Further research is required to strengthen and standardize the approaches as well as to ensure that the findings are valid, for the purpose of pharmacovigilance. Studies required to look at uses / utility over a longer time period. Need standardised methods. Fast moving field. |
| Vilar 2018 [ | To review datamining as a method of detecting drug-drug interactions from pharmacovigilance sources, scientific literature. Challenges and limitations compared. | ADR | Data source, methods | SGOPE offers new possibilities for identifying DDIs. Current emphasis has been on ADRs not DDIs. | Dictionary matching, association mining, supervised LR. | More studies are necessary to really prove and understand the potential of social media resources and their role in pharmacovigilance. |
| Wilson 2015 [ | Understanding how blogs could be used for qualitative health research | Any | Geographical location, study aims, now data used in health research. | Used for data collection and recruitment. Good for accessing out of reach populations. Potential for significant improvement of health equity. Sees blogs as ‘central part of global transformation’. Need to develop knowledge and skills to take advantage of this new resource. | Purely qualitative | Look for innovative methods to develop qualitative research. |
| Wong 2018 [ | To review methods of identifying adverse events from free text | ADR | Definition of NLP tasks, evaluation metrics, challenges in applying NLP to medication safety, data source, methods | Time saving/ real time. Limited by lack of data sharing inhibiting large-scale monitoring across populations. SM good for groups such as children, pregnant women, often not included in trials. Data is Pt reported outcomes, values / preferences - more patient focused. | Supervised, CRF classifier, unsupervised k-means clustering. Linguistic based, standardising text with UMLS. Statistical based. | Integrate data sources from different domains to improve ADR detection. Ethical issues. Increased volume of open-source data. |
| Wongkoblap 2017 [ | Scope & limitations of new predictive method using SM. Ethical concerns. | Mental health | Key characteristics, data collection techniques, data pre-processing, feature extraction, feature selection, model construction, and model verification. | Methods work across languages. Despite an increasing number of studies investigating mental health issues using social network data, some common problems persist. Assembling large, high-quality datasets of social media users with mental disorder is problematic, not only due to biases associated with the collection methods, but also with regard to managing consent and selecting appropriate analytics techniques. | Most common method was text analysis with LIWC. Sentiment analysis. Supervised / predictive models. Only 1/58 used deep learning, | Move towards open science standards - share datasets / workflow /code. Ethical aspects of using SM data not clearly defined. Lack of models for detecting stress or anxiety disorders. Combining SM content with confirmed patients rather than self-reported ones. Network analysis to investigate prevalence. |
| Yin 2019 [ | To systematically review the effectiveness of applying machine learning (ML) methodologies to UGC for personal health investigations. | Any | Methods, Objectives, Data Source, Health issue, Language, Dataset size | 103 eligible studies, summarized with respect to 5 research categories, 3 data collection strategies, 3 gold standard dataset creation methods, and 4 types of features applied in ML models. Popular off-the-shelf ML models were logistic regression ( | Logistic regression (22), SVM (18), Naïve Bayes (17), ensemble learning (12), deep learning (11) | Ethical aspects of analysing personally contributed data, bias induced when building study cohorts and dealing with natural language, interpretation of modelling results, and reliability of the findings. |
| Zhang 2018 [ | Consideration of Twitter as a data source for health researchers. | Any | Research design, collection techniques, analytic methods, tools, author’s opinion on Twitter as a health research method. | 17 papers: Quantitative (n = 2), qualitative (n = 7), and mixed methods ( | Qualitative, quantitative, mixed methods | Creates new questions about data collection, verification, ethics for researchers. |
| Zhang 2020 [ | Role of SM, themes and methods used in SM based public health research. | Any | Publication trends, themes, role of SM, research methods | Growing number of publications and journals including studies. | Still mostly qual or quant, with little use of computational methods. | Need to develop the methodological potential. |
| Zunic 2020 [ | Data sources, roles, motivations, and demographics of posters. Topic areas. Practical applications, methods and current performance levels of sentiment analysis. | Any | Data sources, role of post author, demographic features recorded, health area, ML algorithms used for SA, classification performance, lexical resources | 86 studies. Majority of data from social networking/ Web-based retailing platforms. Primary purpose of online conversations is information exchange/social support. Communities tend to form around health conditions with high severity / chronicity rates. Topics include medications, vaccination, surgery, orthodontic services, individual physicians, and health care services in general. 5 poster roles identified: sufferer, addict, patient, carer, and suicide victim. Only 4 reported demographic characteristics. Many methods used for SA. Mainly supervised. Only 1 study used deep learning. Performance less than achieved by general sentiment analysis methods. F-score, below 60% on average. Few domain-specific corpora and lexica are shared publicly for research purposes. Unclear if performance issues are because of the intrinsic differences between the domains and their respective sublanguages, the size of training datasets, the lack of domain-specific sentiment lexica, or the choice of algorithms. | Sentiment analysis. Mix of tools. A wide range of methods were used to perform SA. Most common choices included support vector machines, naïve Bayesian learning, decision trees, logistic regression, and adaptive boosting. Only 1 study used deep learning. | Improved methods. Performance less than achieved by general sentiment analysis methods. Lack of domain specific datasets / lexicons. Need to create and share large, anonymised domain specific datasets. More inclusion of demographic data. |
Shared dataset NLP challenges since 2015
| Event | Data Source | Task | No tweets / posts | Best result | Methods used | Data availability |
|---|---|---|---|---|---|---|
| 2015 CLPsych | Binary classification of users based on depression / PTSD. 1. Depression vs control 2. PTSD vs control 3. Depression vs PTSD | 7.857 million | Average precision 80% | SVM /TD-IDF weighting | With IRB approval & privacy agreement | |
| 2016 CLPsych | ReachOut forum | Classify triage level (1–4) for professional support | 65,024 | F1–42% | Variety of classifiers | With IRB approval & privacy agreement |
| 2017 CLPsych | ReachOut forum | Classify triage level (1–4) for professional support | 157.963 | F1–46.7% | Variety of classifiers | With IRB approval & privacy agreement |
| 2016 SMM | 1. Classify ADRs. 2. Map to UMLS (NER) 3. Concept normalisation | 10,822 | F1–42% F1–61% No result | Random forest (ngram) CRF | Yes | |
| 2017 SMM | 1. Classify ADRs. 2. Classify drug intake. 3. Concept normalisation | 15,717 training 9961 testing | 1. F1–43.5% 2. F1–69.3% 3. Acc −88.5% | SVM CNN LR/DeepLearn | Yes | |
| 2017 NTCIR-13 | Label disease / symptoms | 2560 (English, Japanese & Chinese) | Exact match accuracy of 88% | Hierarchical attention networks (HAN) plus CNNs | Training data only |
Adapted from [39, 44]