Literature DB >> 30445220

An unsupervised and customizable misspelling generator for mining noisy health-related text sources.

Abeed Sarker1, Graciela Gonzalez-Hernandez2.   

Abstract

BACKGROUND: Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources.
MATERIALS AND METHODS: The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system.
RESULTS: On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms demonstrated an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. DISCUSSION: Our proposed spelling variant generator has several advantages over past spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations.
CONCLUSION: The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.
Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Clinical notes; Data science; Misspelling generation; Natural language processing; Social media; Spelling variant generation; Text mining

Mesh:

Year:  2018        PMID: 30445220      PMCID: PMC6322919          DOI: 10.1016/j.jbi.2018.11.007

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  17 in total

1.  Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes.

Authors:  Abdulrahman Khalifa; Stéphane Meystre
Journal:  J Biomed Inform       Date:  2015-08-28       Impact factor: 6.317

2.  Automated misspelling detection and correction in clinical free-text records.

Authors:  Kenneth H Lai; Maxim Topaz; Foster R Goss; Li Zhou
Journal:  J Biomed Inform       Date:  2015-04-24       Impact factor: 6.317

3.  Cadec: A corpus of adverse drug event annotations.

Authors:  Sarvnaz Karimi; Alejandro Metke-Jimenez; Madonna Kemp; Chen Wang
Journal:  J Biomed Inform       Date:  2015-03-27       Impact factor: 6.317

4.  Recurrent neural networks for classifying relations in clinical notes.

Authors:  Yuan Luo
Journal:  J Biomed Inform       Date:  2017-07-08       Impact factor: 6.317

5.  Systematic review of surveillance by social media platforms for illicit drug use.

Authors:  Donna M Kazemi; Brian Borsari; Maureen J Levine; Beau Dooley
Journal:  J Public Health (Oxf)       Date:  2017-12-01       Impact factor: 2.341

6.  Digital disease detection--harnessing the Web for public health surveillance.

Authors:  John S Brownstein; Clark C Freifeld; Lawrence C Madoff
Journal:  N Engl J Med       Date:  2009-05-07       Impact factor: 91.245

Review 7.  Social media and pharmacovigilance: A review of the opportunities and challenges.

Authors:  Richard Sloane; Orod Osanlou; David Lewis; Danushka Bollegala; Simon Maskell; Munir Pirmohamed
Journal:  Br J Clin Pharmacol       Date:  2015-09-02       Impact factor: 4.335

8.  Predicting the risk of suicide by analyzing the text of clinical notes.

Authors:  Chris Poulin; Brian Shiner; Paul Thompson; Linas Vepstas; Yinong Young-Xu; Benjamin Goertzel; Bradley Watts; Laura Flashman; Thomas McAllister
Journal:  PLoS One       Date:  2014-01-28       Impact factor: 3.240

9.  National and local influenza surveillance through Twitter: an analysis of the 2012-2013 influenza epidemic.

Authors:  David A Broniatowski; Michael J Paul; Mark Dredze
Journal:  PLoS One       Date:  2013-12-09       Impact factor: 3.240

10.  Enhancing Seasonal Influenza Surveillance: Topic Analysis of Widely Used Medicinal Drugs Using Twitter Data.

Authors:  Ireneus Kagashe; Zhijun Yan; Imran Suheryani
Journal:  J Med Internet Res       Date:  2017-09-12       Impact factor: 5.428

View more
  15 in total

1.  Social media mining for birth defects research: A rule-based, bootstrapping approach to collecting data for rare health-related events on Twitter.

Authors:  Ari Z Klein; Abeed Sarker; Haitao Cai; Davy Weissenbacher; Graciela Gonzalez-Hernandez
Journal:  J Biomed Inform       Date:  2018-10-04       Impact factor: 6.317

2.  Automated Misspelling Detection and Correction in Persian Clinical Text.

Authors:  Azita Yazdani; Marjan Ghazisaeedi; Nasrin Ahmadinejad; Masoumeh Giti; Habibe Amjadi; Azin Nahvijou
Journal:  J Digit Imaging       Date:  2020-06       Impact factor: 4.056

3.  Towards Automating Location-Specific Opioid Toxicosurveillance from Twitter via Data Science Methods.

Authors:  Abeed Sarker; Graciela Gonzalez-Hernandez; Jeanmarie Perrone
Journal:  Stud Health Technol Inform       Date:  2019-08-21

4.  Using Twitter Data for Cohort Studies of Drug Safety in Pregnancy: Proof-of-concept With β-Blockers.

Authors:  Ari Z Klein; Karen O'Connor; Lisa D Levine; Graciela Gonzalez-Hernandez
Journal:  JMIR Form Res       Date:  2022-06-30

5.  RedMed: Extending drug lexicons for social media applications.

Authors:  Adam Lavertu; Russ B Altman
Journal:  J Biomed Inform       Date:  2019-10-15       Impact factor: 6.317

6.  Utilizing a multi-class classification approach to detect therapeutic and recreational misuse of opioids on Twitter.

Authors:  Samah Jamal Fodeh; Mohammed Al-Garadi; Osama Elsankary; Jeanmarie Perrone; William Becker; Abeed Sarker
Journal:  Comput Biol Med       Date:  2020-11-20       Impact factor: 4.589

7.  An Analysis of the Safety of Medication Ordering Using Typo Correction within an Academic Medical System.

Authors:  Alaina Brooks Darby; Brittany Lee Karas; Tina Wagner
Journal:  Appl Clin Inform       Date:  2021-08-02       Impact factor: 2.762

8.  A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook.

Authors:  Natalia Grabar; Cyril Grouin
Journal:  Yearb Med Inform       Date:  2019-08-16

9.  Machine Learning and Natural Language Processing for Geolocation-Centric Monitoring and Characterization of Opioid-Related Social Media Chatter.

Authors:  Abeed Sarker; Graciela Gonzalez-Hernandez; Yucheng Ruan; Jeanmarie Perrone
Journal:  JAMA Netw Open       Date:  2019-11-01

10.  Towards scaling Twitter for digital epidemiology of birth defects.

Authors:  Ari Z Klein; Abeed Sarker; Davy Weissenbacher; Graciela Gonzalez-Hernandez
Journal:  NPJ Digit Med       Date:  2019-10-01
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.