Dmitriy Dligach1,2,3, Majid Afshar2,3, Timothy Miller4. 1. Department of Computer Science, Loyola University Chicago, Chicago, Illinois, USA. 2. Department of Public Health Sciences, Stritch School of Medicine, Loyola University, Maywood, Illinois, USA. 3. Center for Health Outcomes and Informatics Research, Loyola University, Maywood, Illinois, USA. 4. Computational Health Informatics Program (CHIP), Boston Children's Hospital and Harvard Medical School, Boston, Massachusetts, USA.
Abstract
OBJECTIVE: Our objective is to develop algorithms for encoding clinical text into representations that can be used for a variety of phenotyping tasks. MATERIALS AND METHODS: Obtaining large datasets to take advantage of highly expressive deep learning methods is difficult in clinical natural language processing (NLP). We address this difficulty by pretraining a clinical text encoder on billing code data, which is typically available in abundance. We explore several neural encoder architectures and deploy the text representations obtained from these encoders in the context of clinical text classification tasks. While our ultimate goal is learning a universal clinical text encoder, we also experiment with training a phenotype-specific encoder. A universal encoder would be more practical, but a phenotype-specific encoder could perform better for a specific task. RESULTS: We successfully train several clinical text encoders, establish a new state-of-the-art on comorbidity data, and observe good performance gains on substance misuse data. DISCUSSION: We find that pretraining using billing codes is a promising research direction. The representations generated by this type of pretraining have universal properties, as they are highly beneficial for many phenotyping tasks. Phenotype-specific pretraining is a viable route for trading the generality of the pretrained encoder for better performance on a specific phenotyping task. CONCLUSIONS: We successfully applied our approach to many phenotyping tasks. We conclude by discussing potential limitations of our approach.
OBJECTIVE: Our objective is to develop algorithms for encoding clinical text into representations that can be used for a variety of phenotyping tasks. MATERIALS AND METHODS: Obtaining large datasets to take advantage of highly expressive deep learning methods is difficult in clinical natural language processing (NLP). We address this difficulty by pretraining a clinical text encoder on billing code data, which is typically available in abundance. We explore several neural encoder architectures and deploy the text representations obtained from these encoders in the context of clinical text classification tasks. While our ultimate goal is learning a universal clinical text encoder, we also experiment with training a phenotype-specific encoder. A universal encoder would be more practical, but a phenotype-specific encoder could perform better for a specific task. RESULTS: We successfully train several clinical text encoders, establish a new state-of-the-art on comorbidity data, and observe good performance gains on substance misuse data. DISCUSSION: We find that pretraining using billing codes is a promising research direction. The representations generated by this type of pretraining have universal properties, as they are highly beneficial for many phenotyping tasks. Phenotype-specific pretraining is a viable route for trading the generality of the pretrained encoder for better performance on a specific phenotyping task. CONCLUSIONS: We successfully applied our approach to many phenotyping tasks. We conclude by discussing potential limitations of our approach.
Authors: Majid Afshar; Andrew Phillips; Niranjan Karnik; Jeanne Mueller; Daniel To; Richard Gonzalez; Ron Price; Richard Cooper; Cara Joyce; Dmitriy Dligach Journal: J Am Med Inform Assoc Date: 2019-03-01 Impact factor: 4.497
Authors: Alistair E W Johnson; Tom J Pollard; Lu Shen; Li-Wei H Lehman; Mengling Feng; Mohammad Ghassemi; Benjamin Moody; Peter Szolovits; Leo Anthony Celi; Roger G Mark Journal: Sci Data Date: 2016-05-24 Impact factor: 6.444