Alan R Shapiro1. 1. Department of Medicine, New York University School of Medicine, 5 Pheasant Run, Pleasantville, NY 10570, USA. alan.shapiro@med.nyu.edu
Abstract
INTRODUCTION: Use of free text in syndromic surveillance requires managing the substantial word variation that results from use of synonyms, abbreviations, acronyms, truncations, concatenations, misspellings, and typographic errors. Failure to detect these variations results in missed cases, and traditional methods for capturing these variations require ongoing, labor-intensive maintenance. OBJECTIVES: This paper examines the problem of word variation in chief-complaint data and explores three semi-automated approaches for addressing it. METHODS: Approximately 6 million chief complaints from patients reporting to emergency departments at 54 hospitals were analyzed. A method of text normalization that models the similarities between words was developed to manage the linguistic variability in chief complaints. Three approaches based on this method were investigated: 1) automated correction of spelling and typographical errors; 2) use of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes to select chief complaints to mine for overlooked vocabulary; and 3) identification of overlooked vocabulary by matching words that appeared in similar contexts. RESULTS: The prevalence of word errors was high. For example, such words as diarrhea, nausea, and vomiting were misspelled 11.0%-18.8% of the time. Approximately 20% of all words were abbreviations or acronyms whose use varied substantially by site. Two methods, use of ICD-9-CM codes to focus searches and the automated pairing of words by context, both retrieved relevant but previously unexpected words. Text normalization simultaneously reduced the number of false positives and false negatives in syndrome classification, compared with commonly used methods based on word stems. In approximately 25% of instances, using text normalization to detect lower respiratory syndrome would have improved the sensitivity of current word-stem approaches by approximately 10%-20%. CONCLUSIONS: Incomplete vocabulary and word errors can have a substantial impact on the retrieval performance of free-text syndromic surveillance systems. The text normalization methods described in this paper can reduce the effects of these problems.
INTRODUCTION: Use of free text in syndromic surveillance requires managing the substantial word variation that results from use of synonyms, abbreviations, acronyms, truncations, concatenations, misspellings, and typographic errors. Failure to detect these variations results in missed cases, and traditional methods for capturing these variations require ongoing, labor-intensive maintenance. OBJECTIVES: This paper examines the problem of word variation in chief-complaint data and explores three semi-automated approaches for addressing it. METHODS: Approximately 6 million chief complaints from patients reporting to emergency departments at 54 hospitals were analyzed. A method of text normalization that models the similarities between words was developed to manage the linguistic variability in chief complaints. Three approaches based on this method were investigated: 1) automated correction of spelling and typographical errors; 2) use of International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes to select chief complaints to mine for overlooked vocabulary; and 3) identification of overlooked vocabulary by matching words that appeared in similar contexts. RESULTS: The prevalence of word errors was high. For example, such words as diarrhea, nausea, and vomiting were misspelled 11.0%-18.8% of the time. Approximately 20% of all words were abbreviations or acronyms whose use varied substantially by site. Two methods, use of ICD-9-CM codes to focus searches and the automated pairing of words by context, both retrieved relevant but previously unexpected words. Text normalization simultaneously reduced the number of false positives and false negatives in syndrome classification, compared with commonly used methods based on word stems. In approximately 25% of instances, using text normalization to detect lower respiratory syndrome would have improved the sensitivity of current word-stem approaches by approximately 10%-20%. CONCLUSIONS: Incomplete vocabulary and word errors can have a substantial impact on the retrieval performance of free-text syndromic surveillance systems. The text normalization methods described in this paper can reduce the effects of these problems.
Authors: Richard T Griffey; Jesse M Pines; Heather L Farley; Michael P Phelan; Christopher Beach; Jeremiah D Schuur; Arjun K Venkatesh Journal: Ann Emerg Med Date: 2014-10-16 Impact factor: 5.721
Authors: Herman D Tolentino; Michael D Matters; Wikke Walop; Barbara Law; Wesley Tong; Fang Liu; Paul Fontelo; Katrin Kohl; Daniel C Payne Journal: BMC Med Inform Decis Mak Date: 2007-02-12 Impact factor: 2.796
Authors: Sylvia Halász; Philip Brown; Cem Oktay; Arif Alper Cevik; Isa Kılıçaslan; Colin Goodall; Dennis G Cochrane; Thomas R Fowler; Guy Jacobson; Simon Tse; John R Allegra Journal: Biomed Inform Insights Date: 2013-04-25