Literature DB >> 27507563

The Importance of Debiasing Social Media Data to Better Understand E-Cigarette-Related Attitudes and Behaviors.

Jon-Patrick Allem1, Emilio Ferrara.   

Abstract

Entities:  

Keywords:  Internet; Twitter; electronic cigarettes; social media; surveillance

Mesh:

Year:  2016        PMID: 27507563      PMCID: PMC5037931          DOI: 10.2196/jmir.6185

Source DB:  PubMed          Journal:  J Med Internet Res        ISSN: 1438-8871            Impact factor:   5.428


× No keyword cloud information.
In a recent issue of JMIR, Kim and colleagues described a framework for data collection, quality assessment, and reporting standards for social media data used in health research [1]. The authors’ framework was based on two principles: retrieval precision or “how much of retrieved data is relevant” and retrieval recall or “how much of the relevant data is retrieved.” With an in-depth knowledge of the subject matter under investigation, and refinement of the keywords to develop reliable search filters, the authors suggested that irrelevant content could be weeded out and high-quality data collection could be assured. Using the topic of electronic cigarettes (e-cigarettes), discussed on Twitter, as a case study to showcase their framework, the authors demonstrated how reporting standards could be made systematic and transparent. While the authors cogently argued for better reporting standards in social media data used in health research, and their principles regarding retrieval precision and retrieval recall were thoughtfully laid out, they overlooked the importance of identifying the sources of the content being captured during data collection. For example, Twitter has quickly become subject to third party manipulation where automated accounts are created by industry groups and private companies that aim to influence discussions and promote specific ideas or products [2]. This fact is absent from the framework of Kim and colleagues [1] and according to their principle of retrieval precision, researchers could classify tweets about e-cigarettes as high-quality data regardless of its origin. Recent research has suggested that between 70% and 80% of tweets mentioning e-cigarettes stem from automated accounts [3]. Studies using tweets and that aimed at gaining insights to individual-level attitudes and behaviors are now faced with data with substantial bias and noise. Any results drawn upon this data and not preprocessed with de-noising techniques lose validity and significance. To ignore this bias in Twitter data would be akin to a public health researcher ignoring the bias from having a sample of participants, in a survey-based study on tobacco-related attitudes, where 700 of the 1000 participants happened to be gainfully employed by a tobacco company. The survey researcher would be forced to rethink their sampling frame, and the same dilemma applies to the social media researcher relying on Twitter as their data source. We propose herein that appropriate analyses be implemented to obtain valid data sets that remove sources of bias and noise before applying the framework of Kim and colleagues. Twitter screen names responsible for each tweet collected in a data set should be obtained and each account’s recent history, interactions, and metadata should be analyzed to determine whether the account is a social bot, a computer algorithm designed to automatically produce content and engage with humans on Twitter [2]. These social bots are meant to appear to be individuals operating Twitter accounts that are complete with metadata (name, location, pithy quote) and a photo or an image. Tweets from these accounts pollute social and health research data sets and need to be identified and removed. Programs like “Bot Or Not?” [2] use a classification system that groups each Twitter account’s features into 6 main classes: Network (diffusion patterns), User (metadata), Friends (account’s contacts), Temporal (tweet rate), and Sentiment (content of message). This classification system ultimately generates a score that falls on a spectrum that can then be used to determine the likelihood of any one account being a social bot. If an account is identified as a social bot then that account and any tweets produced from that account should be removed from the dataset. This platform is freely available, easy to use, and has shown to be successful in reducing bias and noise in datasets from earlier studies led by computer scientists [2]. Using Twitter to examine e-cigarette-related discussion is a novel approach; however, the signal-to-noise ratio has become increasingly low [3]. In other words, the ratio of information representative of individuals’ perceptions, sentiments, and behavior is low as compared with the content from social bots. Prior studies have attempted to increase the signal-to-noise ratio by employing crude techniques (eg, removing any tweet that is accompanied by a URL [4]. However, this approach and other blunt approaches (eg, methods solely relying on community detection or methods solely relying on innocent by association paradigms—an account interacting with a human user is considered human) result in misclassification (eg, the removal of a valid tweet from the data set simply because it was accompanied by a URL or keeping an invalid tweet because a human interacted with the account it originated from) [5]. The debiasing techniques available to social media researchers proposed herein can be used to overcome earlier limitations. Social bots are only one source of bias in studies of Twitter posts. For example, the population of Twitter users over represents young people and ethnic minority groups, when compared to the general population in the United States. This source of bias cannot be easily resolved by machine algorithms and correcting such biases should be a focus of future research. The use of social bots are not confined to discussions of e-cigarettes but have been found to infiltrate political discourse, manipulate the stock market, acquire personal information, and disseminate misinformation [5]. “Bot or Not?” is not a perfect system for bot detection, however, it scores a detection accuracy above 95% suggesting biases from inappropriate removal of legitimate accounts is minimal especially when compared with earlier approaches [5]. Researchers need to take advantage of the resources designed to reliably identify and remove third party accounts responsible for the noise in social media data. Once debiasing techniques have been exploited, frameworks for data collection, quality assessment, and reporting standards for social media data used in health research should be employed.
  3 in total

1.  Garbage in, Garbage Out: Data Collection, Quality Assessment and Reporting Standards for Social Media Data Use in Health Research, Infodemiology and Digital Disease Detection.

Authors:  Yoonsang Kim; Jidong Huang; Sherry Emery
Journal:  J Med Internet Res       Date:  2016-02-26       Impact factor: 5.428

2.  Vaporous Marketing: Uncovering Pervasive Electronic Cigarette Advertisements on Twitter.

Authors:  Eric M Clark; Chris A Jones; Jake Ryland Williams; Allison N Kurti; Mitchell Craig Norotsky; Christopher M Danforth; Peter Sheridan Dodds
Journal:  PLoS One       Date:  2016-07-13       Impact factor: 3.240

3.  A cross-sectional examination of marketing of electronic cigarettes on Twitter.

Authors:  Jidong Huang; Rachel Kornfield; Glen Szczypka; Sherry L Emery
Journal:  Tob Control       Date:  2014-07       Impact factor: 7.552

  3 in total
  27 in total

1.  Characterizing JUUL-related posts on Twitter.

Authors:  Jon-Patrick Allem; Likhit Dharmapuri; Jennifer B Unger; Tess Boley Cruz
Journal:  Drug Alcohol Depend       Date:  2018-06-23       Impact factor: 4.492

2.  Strategies to find audience segments on Twitter for e-cigarette education campaigns.

Authors:  Kar-Hai Chu; Jon-Patrick Allem; Jennifer B Unger; Tess Boley Cruz; Meleeka Akbarpour; Matthew G Kirkpatrick
Journal:  Addict Behav       Date:  2018-11-14       Impact factor: 3.913

3.  Cannabis Surveillance With Twitter Data: Emerging Topics and Social Bots.

Authors:  Jon-Patrick Allem; Patricia Escobedo; Likhit Dharmapuri
Journal:  Am J Public Health       Date:  2019-12-19       Impact factor: 9.308

4.  Twitter Surveillance at the Intersection of the Triangulum.

Authors:  Anuja Majmundar; Jon-Patrick Allem; Tess Boley Cruz; Jennifer B Unger; Mary Ann Pentz
Journal:  Nicotine Tob Res       Date:  2022-01-01       Impact factor: 4.244

5.  Developing a standardized protocol for computational sentiment analysis research using health-related social media data.

Authors:  Lu He; Tingjue Yin; Zhaoxian Hu; Yunan Chen; David A Hanauer; Kai Zheng
Journal:  J Am Med Inform Assoc       Date:  2021-06-12       Impact factor: 4.497

6.  Tweets About Acute Nicotine Toxicity Due to e-Liquid Exposure.

Authors:  Sarah Trigger; Moronke Akinso Johnson; Anh Nguyen Zarndt; Danielle K Hill
Journal:  Tob Regul Sci       Date:  2021-01

7.  Expressed Symptoms and Attitudes Toward Using Twitter for Health Care Engagement Among Patients With Lupus on Social Media: Protocol for a Mixed Methods Study.

Authors:  Alden Bunyan; Swamy Venuturupalli; Katja Reuter
Journal:  JMIR Res Protoc       Date:  2021-05-06

8.  Identifying Sentiment of Hookah-Related Posts on Twitter.

Authors:  Jon-Patrick Allem; Jagannathan Ramanujam; Kristina Lerman; Kar-Hai Chu; Tess Boley Cruz; Jennifer B Unger
Journal:  JMIR Public Health Surveill       Date:  2017-10-18

9.  Whose Post Is It? Predicting E-cigarette Brand from Social Media Posts.

Authors:  Elizabeth A Vandewater; Stephanie L Clendennen; Emily T Hébert; Galya Bigman; Christian D Jackson; Anna V Wilkinson; Cheryl L Perry
Journal:  Tob Regul Sci       Date:  2018-03

10.  E-Cigarette Social Media Messages: A Text Mining Analysis of Marketing and Consumer Conversations on Twitter.

Authors:  Allison J Lazard; Adam J Saffer; Gary B Wilcox; Arnold DongWoo Chung; Michael S Mackert; Jay M Bernhardt
Journal:  JMIR Public Health Surveill       Date:  2016-12-12
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.