Bethany Percha1,2, Yuhao Zhang2, Selen Bozkurt3, Daniel Rubin4, Russ B Altman5,6, Curtis P Langlotz4. 1. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 2. Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA. 3. Department of Biostatistics and Medical Informatics, Akdeniz University Faculty of Medicine, Antalya, Turkey. 4. Department of Radiology, Stanford University, Stanford, CA, USA. 5. Department of Medicine, Stanford University, Stanford, CA, USA. 6. Department of Genetics and Bioengineering, Stanford University, Stanford, CA, USA.
Abstract
Objective: Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus. Materials and Methods: We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building. Results: Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others. Discussion: The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.
Objective: Distributional semantics algorithms, which learn vector space representations of words and phrases from large corpora, identify related terms based on contextual usage patterns. We hypothesize that distributional semantics can speed up lexicon expansion in a clinical domain, radiology, by unearthing synonyms from the corpus. Materials and Methods: We apply word2vec, a distributional semantics software package, to the text of radiology notes to identify synonyms for RadLex, a structured lexicon of radiology terms. We stratify performance by term category, term frequency, number of tokens in the term, vector magnitude, and the context window used in vector building. Results: Ranking candidates based on distributional similarity to a target term results in high curation efficiency: on a ranked list of 775 249 terms, >50% of synonyms occurred within the first 25 terms. Synonyms are easier to find if the target term is a phrase rather than a single word, if it occurs at least 100× in the corpus, and if its vector magnitude is between 4 and 5. Some RadLex categories, such as anatomical substances, are easier to identify synonyms for than others. Discussion: The unstructured text of clinical notes contains a wealth of information about human diseases and treatment patterns. However, searching and retrieving information from clinical notes often suffer due to variations in how similar concepts are described in the text. Biomedical lexicons address this challenge, but are expensive to produce and maintain. Distributional semantics algorithms can assist lexicon curation, saving researchers time and money.
Authors: Selen Bozkurt; Jung In Park; Kathleen Mary Kan; Michelle Ferrari; Daniel L Rubin; James D Brooks; Tina Hernandez-Boussard Journal: AMIA Annu Symp Proc Date: 2018-12-05
Authors: Priya Deshpande; Alexander Rasin; Jun Son; Sungmin Kim; Eli Brown; Jacob Furst; Daniela S Raicu; Steven M Montner; Samuel G Armato Journal: J Digit Imaging Date: 2020-06 Impact factor: 4.056
Authors: Jean Coquet; Selen Bozkurt; Kathleen M Kan; Michelle K Ferrari; Douglas W Blayney; James D Brooks; Tina Hernandez-Boussard Journal: J Biomed Inform Date: 2019-04-20 Impact factor: 6.317
Authors: Selen Bozkurt; Kathleen M Kan; Michelle K Ferrari; Daniel L Rubin; Douglas W Blayney; Tina Hernandez-Boussard; James D Brooks Journal: BMJ Open Date: 2019-07-18 Impact factor: 2.692
Authors: Arlene Casey; Emma Davidson; Michael Poon; Hang Dong; Daniel Duma; Andreas Grivas; Claire Grover; Víctor Suárez-Paniagua; Richard Tobin; William Whiteley; Honghan Wu; Beatrice Alex Journal: BMC Med Inform Decis Mak Date: 2021-06-03 Impact factor: 2.796