Zachary N Flamholz1, Andrew Crane-Droesch2, Lyle H Ungar3, Gary E Weissman4. 1. Medical Scientist Training Program, Albert Einstein College of Medicine, Bronx, NY, USA. Electronic address: zachary.flamholz@einsteinmed.edu. 2. Penn Medicine Predictive Healthcare, University of Pennsylvania Health System, Philadelphia, PA, USA; Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. 3. Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA; Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA. 4. Palliative and Advanced Illness Research (PAIR) Center, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, PA, USA; Leonard Davis Institute of Health Economics, University of Pennsylvania, Philadelphia, PA, USA; Pulmonary, Allergy, and Critical Care Division, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
Abstract
OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
OBJECTIVE: Quantify tradeoffs in performance, reproducibility, and resource demands across several strategies for developing clinically relevant word embeddings. MATERIALS AND METHODS: We trained separate embeddings on all full-text manuscripts in the Pubmed Central (PMC) Open Access subset, case reports therein, the English Wikipedia corpus, the Medical Information Mart for Intensive Care (MIMIC) III dataset, and all notes in the University of Pennsylvania Health System (UPHS) electronic health record. We tested embeddings in six clinically relevant tasks including mortality prediction and de-identification, and assessed performance using the scaled Brier score (SBS) and the proportion of notes successfully de-identified, respectively. RESULTS: Embeddings from UPHS notes best predicted mortality (SBS 0.30, 95% CI 0.15 to 0.45) while Wikipedia embeddings performed worst (SBS 0.12, 95% CI -0.05 to 0.28). Wikipedia embeddings most consistently (78% of notes) and the full PMC corpus embeddings least consistently (48%) de-identified notes. Across all six tasks, the full PMC corpus demonstrated the most consistent performance, and the Wikipedia corpus the least. Corpus size ranged from 49 million tokens (PMC case reports) to 10 billion (UPHS). DISCUSSION: Embeddings trained on published case reports performed as least as well as embeddings trained on other corpora in most tasks, and clinical corpora consistently outperformed non-clinical corpora. No single corpus produced a strictly dominant set of embeddings across all tasks and so the optimal training corpus depends on intended use. CONCLUSION: Embeddings trained on published case reports performed comparably on most clinical tasks to embeddings trained on larger corpora. Open access corpora allow training of clinically relevant, effective, and reproducible embeddings.
Authors: A L Goldberger; L A Amaral; L Glass; J M Hausdorff; P C Ivanov; R G Mark; J E Mietus; G B Moody; C K Peng; H E Stanley Journal: Circulation Date: 2000-06-13 Impact factor: 29.690
Authors: Jacob A Martin; Andrew Crane-Droesch; Folasade C Lapite; Joseph C Puhl; Tyler E Kmiec; Jasmine A Silvestri; Lyle H Ungar; Bruce P Kinosian; Blanca E Himes; Rebecca A Hubbard; Joshua M Diamond; Vivek Ahya; Michael W Sims; Scott D Halpern; Gary E Weissman Journal: J Am Med Inform Assoc Date: 2021-12-28 Impact factor: 4.497