| Literature DB >> 31797605 |
Andrew L Beam1, Benjamin Kompa, Allen Schmaltz, Inbar Fried, Griffin Weber, Nathan Palmer, Xu Shi, Tianxi Cai, Isaac S Kohane.
Abstract
Word embeddings are a popular approach to unsupervised learning of word relationships that are widely used in natural language processing. In this article, we present a new set of embeddings for medical concepts learned using an extremely large collection of multimodal medical data. Leaning on recent theoretical insights, we demonstrate how an insurance claims database of 60 million members, a collection of 20 million clinical notes, and 1.7 million full text biomedical journal articles can be combined to embed concepts into a common space, resulting in the largest ever set of embeddings for 108,477 medical concepts. To evaluate our approach, we present a new benchmark methodology based on statistical power specifically designed to test embeddings of medical concepts. Our approach, called cui2vec, attains state-of-the-art performance relative to previous methods in most instances. Finally, we provide a downloadable set of pre-trained embeddings for other researchers to use, as well as an online tool for interactive exploration of the cui2vec embeddings.Entities:
Mesh:
Year: 2020 PMID: 31797605 PMCID: PMC6922053
Source DB: PubMed Journal: Pac Symp Biocomput ISSN: 2335-6928
Comparison of GloVe, PCA, and word2vec for an embedding dimension of 500. Columns 1–4 report power to detect known relationships and column 5 reports the Spearman correlation between human assessments of concept similarity and cosine similarity from the embeddings. The best result for each each benchmark/dataset combination is shown in bold. The claims dataset contained only diagnosis codes and no drugs and so did not report results for the NDFRT benchmark.
| Data Source | Algorithm | Causative | Comorbidity | Semantic Type | NDFRT | Human Assessment |
|---|---|---|---|---|---|---|
| Claims | GloVe | 0.29 | - | |||
| PCA | 0.40 | 0.15 | 0.32 | - | 0.19 | |
| word2vec (SVD) | 0.54 | 0.50 | - | |||
| PMC Articles | GloVe | 0.59 | 0.57 | 0.28 | 0.54 | 0.60 |
| PCA | 0.30 | 0.24 | 0.24 | 0.29 | 0.29 | |
| word2vec (SVD) | ||||||
| word2vec (original) | 0.75 | 0.51 | 0.48 | 0.74 | 0.59 | |
| Clinical Notes | GloVe | 0.39 | 0.51 | 0.11 | 0.34 | |
| PCA | 0.36 | 0.31 | 0.47 | 0.14 | 0.53 | |
| word2vec (SVD) | 0.52 | |||||
| Combined Data | GloVe | 0.40 | 0.37 | 0.50 | 0.39 | |
| PCA | 0.24 | 0.23 | 0.30 | 0.37 | 0.47 | |
| word2vec (SVD) | 0.52 | |||||
Fig. 1.Upset visualization of the intersection of medical concepts found in the insurance claims, clinical notes, and biomedical journal articles (PMC).
Comparison of the performance of cui2vec to previously published embeddings. Columns 1–4 report power to detect known relationships and column 5 reports the Spearman correlation between human assessments of concept similarity and cosine similarity from the embeddings. The best result for each each comparison is shown in bold.
| Source | Causative | Comorbidity | NDFRT | Semantic Type | Human Assessment |
|---|---|---|---|---|---|
| Choi et al. (claims) | 0.25 | 0.63 | 0.24 | ||
| cui2vec | 0.31 | 0.35 | |||
| Choi et al. (notes) | 0.29 | 0.23 | 0.15 | 0.43 | |
| cui2vec | 0.42 | ||||
| Devine et al. (PMC abstracts) | 0.29 | 0.05 | 0.18 | 0.22 | 0.45 |
| cui2vec | |||||