Rachel Chasin1, Anna Rumshisky2, Ozlem Uzuner3, Peter Szolovits1. 1. Massachusetts Institute of Technology, Cambridge, Massachusetts, USA. 2. Department of Computer Science, University of Massachusetts, Lowell, Massachusetts, USA. 3. Department of Information Studies, University at Albany, SUNY, Albany, New York, USA.
Abstract
OBJECTIVE: To evaluate state-of-the-art unsupervised methods on the word sense disambiguation (WSD) task in the clinical domain. In particular, to compare graph-based approaches relying on a clinical knowledge base with bottom-up topic-modeling-based approaches. We investigate several enhancements to the topic-modeling techniques that use domain-specific knowledge sources. MATERIALS AND METHODS: The graph-based methods use variations of PageRank and distance-based similarity metrics, operating over the Unified Medical Language System (UMLS). Topic-modeling methods use unlabeled data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC II) database to derive models for each ambiguous word. We investigate the impact of using different linguistic features for topic models, including UMLS-based and syntactic features. We use a sense-tagged clinical dataset from the Mayo Clinic for evaluation. RESULTS: The topic-modeling methods achieve 66.9% accuracy on a subset of the Mayo Clinic's data, while the graph-based methods only reach the 40-50% range, with a most-frequent-sense baseline of 56.5%. Features derived from the UMLS semantic type and concept hierarchies do not produce a gain over bag-of-words features in the topic models, but identifying phrases from UMLS and using syntax does help. DISCUSSION: Although topic models outperform graph-based methods, semantic features derived from the UMLS prove too noisy to improve performance beyond bag-of-words. CONCLUSIONS: Topic modeling for WSD provides superior results in the clinical domain; however, integration of knowledge remains to be effectively exploited. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
OBJECTIVE: To evaluate state-of-the-art unsupervised methods on the word sense disambiguation (WSD) task in the clinical domain. In particular, to compare graph-based approaches relying on a clinical knowledge base with bottom-up topic-modeling-based approaches. We investigate several enhancements to the topic-modeling techniques that use domain-specific knowledge sources. MATERIALS AND METHODS: The graph-based methods use variations of PageRank and distance-based similarity metrics, operating over the Unified Medical Language System (UMLS). Topic-modeling methods use unlabeled data from the Multiparameter Intelligent Monitoring in Intensive Care (MIMIC II) database to derive models for each ambiguous word. We investigate the impact of using different linguistic features for topic models, including UMLS-based and syntactic features. We use a sense-tagged clinical dataset from the Mayo Clinic for evaluation. RESULTS: The topic-modeling methods achieve 66.9% accuracy on a subset of the Mayo Clinic's data, while the graph-based methods only reach the 40-50% range, with a most-frequent-sense baseline of 56.5%. Features derived from the UMLS semantic type and concept hierarchies do not produce a gain over bag-of-words features in the topic models, but identifying phrases from UMLS and using syntax does help. DISCUSSION: Although topic models outperform graph-based methods, semantic features derived from the UMLS prove too noisy to improve performance beyond bag-of-words. CONCLUSIONS: Topic modeling for WSD provides superior results in the clinical domain; however, integration of knowledge remains to be effectively exploited. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Entities:
Keywords:
Clinical Language Processing; Medical Language Processing; Natural Language Processing; Word Sense Disambiguation
Authors: Mohammed Saeed; Mauricio Villarroel; Andrew T Reisner; Gari Clifford; Li-Wei Lehman; George Moody; Thomas Heldt; Tin H Kyaw; Benjamin Moody; Roger G Mark Journal: Crit Care Med Date: 2011-05 Impact factor: 7.598
Authors: Guergana K Savova; Anni R Coden; Igor L Sominsky; Rie Johnson; Philip V Ogren; Piet C de Groen; Christopher G Chute Journal: J Biomed Inform Date: 2008-03-04 Impact factor: 6.317
Authors: Susanne M Humphrey; Willie J Rogers; Halil Kilicoglu; Dina Demner-Fushman; Thomas C Rindflesch Journal: J Am Soc Inf Sci Technol Date: 2006-01-01
Authors: Denis Newman-Griffis; Guy Divita; Bart Desmet; Ayah Zirikly; Carolyn P Rosé; Eric Fosler-Lussier Journal: J Am Med Inform Assoc Date: 2021-03-01 Impact factor: 4.497