| Literature DB >> 27570647 |
Youngduck Choi1, Chill Yi-I Chiu1, David Sontag1.
Abstract
We show how to learn low-dimensional representations (embeddings) of a wide range of concepts in medicine, including diseases (e.g., ICD9 codes), medications, procedures, and laboratory tests. We expect that these embeddings will be useful across medical informatics for tasks such as cohort selection and patient summarization. These embeddings are learned using a technique called neural language modeling from the natural language processing community. However, rather than learning the embeddings solely from text, we show how to learn the embeddings from claims data, which is widely available both to providers and to payers. We also show that with a simple algorithmic adjustment, it is possible to learn medical concept embeddings in a privacy preserving manner from co-occurrence counts derived from clinical narratives. Finally, we establish a methodological framework, arising from standard medical ontologies such as UMLS, NDF-RT, and CCS, to further investigate the embeddings and precisely characterize their quantitative properties.Entities:
Year: 2016 PMID: 27570647 PMCID: PMC5001761
Source DB: PubMed Journal: AMIA Jt Summits Transl Sci Proc
Figure 1.Illustration a low-dimensional representation (in this case, 2 dimensions) of medical concepts. Similar concepts are close to each other in Euclidean space.
Figure 2.Illustration of the data used to learn embeddings of medical concepts, for a single patient.
Figure 3.Our simple modified algorithm takes as input temporal data such as that shown in Figure 2 (just one temporal record is shown, but typically there would be many, e.g. one per patient). It first partitions the data into time intervals of size Τ (here, 1/3 of a year). Then it removes duplicate concepts, and finally shuffles the concepts so that they are in a random order. Each partition is then treated as a single sentence, and stochastic gradient descent of a bilinear skip-gram model is performed using word2vec [10].
Figure 4.Shown on the left is the input, a weighted graph where the nodes correspond to each concept. We sample edges with probability proportional to the edge weight, with replacement. This results in the intermediate representation shown on the right. Each line corresponds to two prediction tasks, where the goal is to predict one of the concepts given the other. The learning algorithm iterates through the intermediate representation and performs gradient updates to the embeddings to minimize the loss on the corresponding prediction problems.
Display of a sub-computation for MCSM(MCECN, Neoplastic Process, 8). The sub-computation concerns the neighborhood of the medical concept 4003436 (Carcinoma, non-small-cell lung). The medical concept type annotations are shown in the square brackets. The numerical values represent the cosine distance of the corresponding medical concept from the query 4003436.
| Neighbors of CUI 4003436 (Carcinoma, non-small-cell lung) [‘Neoplastic Process’] |
|
|
|
|
|
|
|
|
| 4555365 (tarceva, C1135136, [‘Organic Chemical’, ‘Pharmacologic Substance’]): 0.918 |
| 4069342 (lung mass, C0149726, [‘Finding’]): 0.914 |
| 4542086 (alimta, C1101816, [‘Organic Chemical’, ‘Pharmacologic Substance’]): 0.903 |
|
|
Medical conceptual similarity property comparison of MCEMJ and MCECN-SGD through MCSM. We display the evaluations of six different medical concept types with their standard deviations. Overall, we observe that MCEMJ has a stronger medical conceptual similarity property in comparison to MCECN-SGD.
| MCSMUMLS(MCEMJ[ | MCSMUMLS (MCECN-SGD, -, 40) | |
|---|---|---|
| Pharmacologic Substance |
| 2.95 ± 2.15 |
| Disease or Syndrome |
| 4.28 ± 1.60 |
| Neoplastic Process |
| 4.54 ± 0.11 |
| Clinical Drug |
| 0.12 ± 0.18 |
| Finding |
| 2.15 ± 1.35 |
| Injury or Poisoning | 2.67 ± 2.40 |
|
Display of a sub-computation for MRMNDF-RT(MCECN, May-Treat, 8, —). In contrast to Table 1, we now have the medical concepts 4555365 (tarceva) and 4542086 (alimta) highlighted, as they are medications that are used to treat lung cancers, which the NDF-RT May-Treat relation encodes. The use of NDF-RT allows us to quantitatively and quickly evaluate the Medical Relatedness Property on a large number of test cases.
| Neighbors of CUI 4003436 (Carcinoma, non-small-cell lung) [‘Neoplastic Process’] |
| 4069419 (small cell carcinoma of lung, C0149925, [‘Neoplastic Process’]): 0.956 |
| 4394316 (carcinoma of lung, C0684249, [‘Neoplastic Process’]): 0.934 |
| 4125384 (malignant neoplasm of lung, C0242379, [‘Neoplastic Process’]): 0.929 |
| 4070138 (adenocarcinoma of lung (disorder), C0152013, [‘Neoplastic Process’]): 0.925 |
|
|
| 4069342 (lung mass, C0149726, [‘Finding’]): 0.914 |
|
|
| 4148168 (non-small cell lung cancer metastatic, C0278987, [‘Neoplastic Process’]) : 0.900 |
The Medical Relatedness Property comparison of various embeddings through MRMNDF-RT. The results are of the form (neighbors/avg-seed/max-seed).
| MRMNDF-RT(-, May Treat, 40, -) | MRMNDF-RT(-, May Prevent, 40, -) | |
|---|---|---|
| MCEMJr=5,d=200 [ | 12.59% / 31.56% / 53.92% | 18.12% / |
| MCEMCmonth,ns20 | 10.93% / 28.67% / 57.01% | 5.88% / 29.45% / |
| MCEMCmonth,hs | 19.24% / | 8.82% / 30.20% / |
| MCECN-SGD1Bil,7d,ns20 | 36.81% / 33.94% / 57.48% | 27.94% / 30.42% / 45.59% |
| MCECN-SGD10Bil,7d,ns20 | 38.72% / 34.90% / 57.95% | 32.95% / 31.99% / 48.53% |
| MCECN-SVD7d,ns10 |
|
|
The Medical Relatedness Property comparison of various embeddings through MRMCCS.
| MRMCCS (-, Fine-grained, 40) | MRMCCS (-, Coarse-grained, 40) | |
|---|---|---|
| MCEMJr=5,d=200 [ | 0.2293 | 0.2490 |
| MCEMCmonth,ns20 | 0.4127 | 0.4422 |
| MCEMCmonth,hs |
|
|
| MCECN-SGD1Bil,7d,i.sa0 | 0.2966 | 0.3319 |
| MCECN-SGD10Bil,7d,ns20 | 0.3087 | 0.3420 |
| MCECN-SVD7d,ns10 | 0.3461 | 0.3776 |
A few neighborhood examples from MCECN illustrating genotypic-phenotypic relations.
| (cd52, C2733653) | (bcl1, C2599665) | ||
|---|---|---|---|
| 1 | (cd52 protein, human, C0376272) | 1 | (cyclins, C0079183) |
| 2 | (mycosis fungoides/sezary syndrome nos, C0862196) | 2 | (proliferating cell nuclear antigen, C0072108) |
| 3 | (t-cell receptor, C0034790) | 3 | (lymphoplasmacytic lymphoma, C2700641) |
| 4 | (lymphoma, t-cell, cutaneous, C0079773) | 4 | (paired box 5 protein, C0167636) |
| 5 | (pralatrexate, C1721300) | 5 | (cyclin d1, C0174680) |
|
|
| ||
| 1 | (refractory anemia with ringed sideroblasts, C1264195) | 1 | (mesothelioma, C0025500) |
| 2 | (large platelets (finding), C1148412) | 2 | (cdx2 protein, human, C1505661) |
| 3 | (anagrelide, C0051809) | 3 | (cdx2 antigen, C1829706) |
| 4 | (hypercellular bone marrow, C1334068) | 4 | (pleural mass, C1709576) |
| 5 | (myeloid metaplasia, C0027013) | 5 | (braf protein, human, C1259929) |
The neighborhood of the diagnosis code 710.0 in the MCEMC. We display the top 5 neighbors for each type of code, filtering duplicates.
| Nearest Neighbors of ICD9 710.0 (Systemic lupus erythematosus) in MCEMC | |
|---|---|
|
| |
| 1 | 695.4 (Lupus erythematosus) |
| 2 | 710.9 (Unspecified diffuse connective tissue disease) |
| 3 | 710.2 (Sicca syndrome) |
| 4 | 795.79 (Other and unspecified nonspecific immunological findings) |
| 5 | 443.0 (Raynaud’s syndrome) |
|
| |
| 1 | 4498-2 (Complement C4 in Serum or Plasma) |
| 2 | 4485-9 (Complement C3 in Serum or Plasma) |
| 3 | 5130-0 (DNA Double Strand Ab) in Serum) |
| 4 | 14030-1 (Smith Extractable Nuclear Ab+Ribonucleoprotein Extractable Nuclear Ab in Serum) |
| 5 | 11090-8 (Smith Extractable Nuclear Ab in Serum) |
|
| |
| 1 | 00378037301 (Hydroxychloroquine Sulfate 200mg) |
| 2 | 00024156210 (Plaquenil 200mg) |
| 3 | 51927105700 (Fluocinolone Acetonide Miscell Powder) |
| 4 | 00062331300 (All-flex Contraceptive Diaphragm Arcing Spring Ortho All-flex 80mm) |
| 5 | 00054412925 (Cyclophosphamide 25mg) |