| Literature DB >> 17295907 |
Herman D Tolentino1, Michael D Matters, Wikke Walop, Barbara Law, Wesley Tong, Fang Liu, Paul Fontelo, Katrin Kohl, Daniel C Payne.
Abstract
BACKGROUND: The Institute of Medicine has identified patient safety as a key goal for health care in the United States. Detecting vaccine adverse events is an important public health activity that contributes to patient safety. Reports about adverse events following immunization (AEFI) from surveillance systems contain free-text components that can be analyzed using natural language processing. To extract Unified Medical Language System (UMLS) concepts from free text and classify AEFI reports based on concepts they contain, we first needed to clean the text by expanding abbreviations and shortcuts and correcting spelling errors. Our objective in this paper was to create a UMLS-based spelling error correction tool as a first step in the natural language processing (NLP) pipeline for AEFI reports.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17295907 PMCID: PMC1805499 DOI: 10.1186/1472-6947-7-3
Source DB: PubMed Journal: BMC Med Inform Decis Mak ISSN: 1472-6947 Impact factor: 2.796
Figure 1Framework for concept extraction using the UMLS. The natural language processing pipeline for this project is made up of the steps in this diagram. For this paper, we are focusing on step 1, spelling correction.
Column descriptions of the custom dictionary
| word_id | Unique identifier |
| word_str | Dictionary word |
| word_ngram | Bigrams of the dictionary word. Example: "pediatrician" would have the following bigrams: |
| word_metaphone | The metaphone value of the dictionary word. Example: pediatrician would have the metaphone |
| word_header | The first 4 characters of the word. Example: "pediatrician" would have the header |
| word_anterior | The 4 characters after the first character of the dictionary word |
| word_posterior | The 4 characters before the last character of the dictionary word |
| word_fragment | If the dictionary word is longer than 10 characters the first 10 characters of the dictionary word |
Figure 2Screenshots from the spell checker showing sample regular expression code and effect on free text. This figure combines two screen shots. Box A shows the interface that displays changes to free text done by regular expression. Box B shows examples of regular expression code that changes abbreviated month names to their full form. Note how several instances of the abbreviated month "Nov" are detected and converted to the full form "November".
Figure 3Stages of spelling correction.
Mean contribution of algorithms to word list generation
| N-Gram | 20% | 13% |
| Header | 55% | 59% |
| Metaphone | 8% | 4% |
| Transposition | 1% | 3% |
| Deletion | 5% | 6% |
| Substitution | 5% | 5% |
| Insertion | 6% | 10% |
| TOTAL | 100% | 100% |
Mean contribution of algorithms to word sense disambiguation (smoothing) and ranking
| Concept | 12% | 13% |
| Homonym | 1% | 1% |
| N-Gram | 55% | 53% |
| Metaphone | 5% | 4% |
| Length | 14% | 14% |
| Part-of-speech | 10% | 11% |
| History | 3% | 4% |
| TOTAL | 100% | 100% |
Spell checker performance measures
| Sensitivity (%) | 93 | 93 | 94 | 74 | 74 | 75 |
| Specificity (%) | 100 | 100 | 100 | 100 | 100 | 100 |
| Recovery (%) | 85 | 84 | 85 | 68 | 67 | 69 |
| Positive Predictive Value | 64 | 63 | 65 | 47 | 46 | 48 |
| Regular expression transformations | 1,217 (10%) | 770 (9%) | ||||
| Words corrected | 105 (1%) | 68 (1%) | ||||
| Processing time per word (second) | 0.07 | 0.06 | ||||