Literature DB >> 34920127

Word embeddings trained on published case reports are lightweight, effective for clinical tasks, and free of protected health information.

Zachary N Flamholz1, Andrew Crane-Droesch2, Lyle H Ungar3, Gary E Weissman4.   

Abstract

OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings.
MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively.
RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use.
CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
Copyright © 2021 Elsevier Inc. All rights reserved.

Entities:  

Keywords:  Clinical informatics; Natural language processing; Protected health information; Word embeddings

Mesh:

Year:  2021        PMID: 34920127      PMCID: PMC8766939          DOI: 10.1016/j.jbi.2021.103971

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  21 in total

1.  PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals.

Authors:  A L Goldberger; L A Amaral; L Glass; J M Hausdorff; P C Ivanov; R G Mark; J E Mietus; G B Moody; C K Peng; H E Stanley
Journal:  Circulation       Date:  2000-06-13       Impact factor: 29.690

2.  A comparison of word embeddings for the biomedical natural language processing.

Authors:  Yanshan Wang; Sijia Liu; Naveed Afzal; Majid Rastegar-Mojarad; Liwei Wang; Feichen Shen; Paul Kingsbury; Hongfang Liu
Journal:  J Biomed Inform       Date:  2018-09-12       Impact factor: 6.317

3.  Semantic Similarity and Relatedness between Clinical Terms: An Experimental Study.

Authors:  Serguei Pakhomov; Bridget McInnes; Terrence Adam; Ying Liu; Ted Pedersen; Genevieve B Melton
Journal:  AMIA Annu Symp Proc       Date:  2010-11-13

4.  Enhancing clinical concept extraction with contextual embeddings.

Authors:  Yuqi Si; Jingqi Wang; Hua Xu; Kirk Roberts
Journal:  J Am Med Inform Assoc       Date:  2019-11-01       Impact factor: 4.497

5.  Utility of General and Specific Word Embeddings for Classifying Translational Stages of Research.

Authors:  Vincent Major; Alisa Surkis; Yindalon Aphinyanaphongs
Journal:  AMIA Annu Symp Proc       Date:  2018-12-05

6.  Development of a novel score for the prediction of hospital mortality in patients with severe sepsis: the use of electronic healthcare records with LASSO regression.

Authors:  Zhongheng Zhang; Yucai Hong
Journal:  Oncotarget       Date:  2017-07-25

7.  Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases.

Authors:  Zhiwei Chen; Zhe He; Xiuwen Liu; Jiang Bian
Journal:  BMC Med Inform Decis Mak       Date:  2018-07-23       Impact factor: 2.796

8.  BioWordVec, improving biomedical word embeddings with subword information and MeSH.

Authors:  Yijia Zhang; Qingyu Chen; Zhihao Yang; Hongfei Lin; Zhiyong Lu
Journal:  Sci Data       Date:  2019-05-10       Impact factor: 6.444

9.  Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study.

Authors:  Mohamed Abdalla; Moustafa Abdalla; Graeme Hirst; Frank Rudzicz
Journal:  J Med Internet Res       Date:  2020-07-15       Impact factor: 5.428

10.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining.

Authors:  Jinhyuk Lee; Wonjin Yoon; Sungdong Kim; Donghyeon Kim; Sunkyu Kim; Chan Ho So; Jaewoo Kang
Journal:  Bioinformatics       Date:  2020-02-15       Impact factor: 6.937

View more
  2 in total

1.  Development and validation of a prediction model for actionable aspects of frailty in the text of clinicians' encounter notes.

Authors:  Jacob A Martin; Andrew Crane-Droesch; Folasade C Lapite; Joseph C Puhl; Tyler E Kmiec; Jasmine A Silvestri; Lyle H Ungar; Bruce P Kinosian; Blanca E Himes; Rebecca A Hubbard; Joshua M Diamond; Vivek Ahya; Michael W Sims; Scott D Halpern; Gary E Weissman
Journal:  J Am Med Inform Assoc       Date:  2021-12-28       Impact factor: 4.497

2.  Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms.

Authors:  Nadia Saeed; Hammad Naveed
Journal:  Front Mol Biosci       Date:  2022-08-12
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.