| Literature DB >> 35365092 |
Bolin Wang1, Yuanyuan Sun2, Yonghe Chu1, Di Zhao1, Zhihao Yang1, Jian Wang1.
Abstract
BACKGROUND: Electronic medical records (EMR) contain detailed information about patient health. Developing an effective representation model is of great significance for the downstream applications of EMR. However, processing data directly is difficult because EMR data has such characteristics as incompleteness, unstructure and redundancy. Therefore, preprocess of the original data is the key step of EMR data mining. The classic distributed word representations ignore the geometric feature of the word vectors for the representation of EMR data, which often underestimate the similarities between similar words and overestimate the similarities between distant words. This results in word similarity obtained from embedding models being inconsistent with human judgment and much valuable medical information being lost.Entities:
Keywords: Distributed word representation; Electronic medical records; Geometric structure; Manifold
Mesh:
Year: 2022 PMID: 35365092 PMCID: PMC8973530 DOI: 10.1186/s12859-022-04653-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Medical term pairs similarity on different methods
| Medical term pairs | UMNRS-Sim(Ground truth) | Glove | Ours |
|---|---|---|---|
| P1: “peripheral edema” | sim(P1, P2) = 3.92 | sim(P1, P2) = 0.55 | sim(P1, P2) = 0.15 |
| P2:“pulmonary edema” | |||
| P3: “pkidney stone” | sim(P3, P4) = 4.69 | sim(P3, P4) = 0.37 | sim(P3, P4) = 0.32 |
| P4:“ureteral obstruction” |
Fig. 1The relationship between high-dimensional space and low-dimensional embedding
Pearson and Spearman correlations coefficient score () between model predictions and human ratings on three evaluation datasets
| Method | MayoSRS | UMNSRS-sim | UMNSRS-rel | |||
|---|---|---|---|---|---|---|
| Pearson | Spearman | Pearson | Spearman | Pearson | Spearman | |
| BERT | 24.7 | 24.5 | 28.3 | 26.2 | 31.4 | 28.2 |
| Zhang | 62.5 | 61.1 | 64.9 | 62.5 | 57.0 | 57.0 |
| Chiu | 60.4 | 61.5 | 66.3 | 65.2 | 60.0 | 60.1 |
| ALBERT | 24.9 | 25.0 | 28.7 | 26.6 | 31.5 | 28.7 |
| BioBERT | 26.0 | 25.5 | 29.8 | 27.4 | 33.4 | 29.4 |
| BlueBERT | 26.5 | 27.6 | 31.2 | 28.9 | 33.9 | 30.4 |
| Ours | ||||||
Bold values denote the best result for each column of data
Three basic models use different types of pre-trained word embeddings to predict performance
| Method | Embedding | Macro AUC | Micro AUC | Macro F1 | Micro F1 | Test loss value | Top-10 recall |
|---|---|---|---|---|---|---|---|
| RNN | Random | 0.854 | 0.972 | 0.204 | 0.653 | 0.772 | |
| FastText | 0.842 | 0.973 | 0.149 | 0.628 | 0.032 | 0.774 | |
| Glove | 0.974 | 0.656 | 0.031 | 0.788 | |||
| Word2Vec | 0.851 | 0.974 | 0.165 | 0.642 | 0.031 | 0.783 | |
| BERT | 0.500 | 0.908 | 0.000 | 0.000 | 0.061 | 0.442 | |
| ALBERT | 0.503 | 0.915 | 0.026 | 0.018 | 0.054 | 0.446 | |
| BioBERT | 0.513 | 0.923 | 0.051 | 0.038 | 0.052 | 0.457 | |
| BlueBERT | 0.533 | 0.939 | 0.075 | 0.043 | 0.050 | 0.471 | |
| Ours | 0.857 | 0.182 | 0.030 | ||||
| CNN | Random | 0.825 | 0.968 | 0.214 | 0.626 | 0.040 | 0.753 |
| FastText | 0.665 | 0.921 | 0.012 | 0.223 | 0.053 | 0.488 | |
| Glove | 0.842 | 0.972 | 0.188 | 0.622 | 0.034 | 0.767 | |
| Word2Vec | 0.692 | 0.925 | 0.021 | 0.313 | 0.052 | 0.492 | |
| BERT | 0.549 | 0.906 | 0.000 | 0.000 | 0.442 | ||
| ALBERT | 0.556 | 0.914 | 0.014 | 0.012 | 0.053 | 0.453 | |
| BioBERT | 0.559 | 0.921 | 0.015 | 0.041 | 0.047 | 0.459 | |
| BlueBERT | 0.567 | 0.929 | 0.021 | 0.047 | 0.042 | 0.464 | |
| Ours | 0.038 | ||||||
| CAML | Random | 0.855 | 0.978 | 0.257 | 0.656 | 0.032 | 0.806 |
| FastText | 0.856 | 0.980 | 0.270 | 0.656 | 0.031 | 0.809 | |
| Glove | 0.867 | 0.978 | 0.272 | 0.647 | 0.801 | ||
| Word2Vec | 0.855 | 0.980 | 0.662 | 0.030 | 0.813 | ||
| BERT | 0.497 | 0.908 | 0.000 | 0.000 | 0.058 | 0.442 | |
| ALBERT | 0.505 | 0.916 | 0.026 | 0.022 | 0.054 | 0.457 | |
| BioBERT | 0.513 | 0.924 | 0.045 | 0.041 | 0.048 | 0.465 | |
| BlueBERT | 0.534 | 0.934 | 0.060 | 0.076 | 0.042 | 0.478 | |
| Ours | 0.270 | 0.029 |
Bold values denote the best result for each row of data(%)
Average performance on clinical sentence pair similarity tasks
| Space | Metric | Glove | Ours |
|---|---|---|---|
| 6B300d | Pearson | 69.2 | |
| 6B300d | Spearman | 64.6 | |
| 6B200d | Pearson | 69.9 | |
| 6B200d | Spearman | 64.6 | |
| 6B100d | Pearson | 68.3 | |
| 6B100d | Spearman | 63.5 |
Bold values represent the best result for each row of data. (window start [0,1000], number of MLLE local neighbours = 500, manifold dimensionality = space dimensionality)
Words with the highest weight by manifold and Word2Vec for frequent diabetes medical code
| Ours | Word2Vec | ||
|---|---|---|---|
| Word | Weight | Word | Weight |
| Hemodialysis | 0.7856 | Disease | 0.4320 |
| Found | 0.0235 | Hemodialysis | 0.2576 |
| Disease | 0.0347 | Renal | 0.0726 |
| Stage | 0.0043 | Found | 0.0123 |
| Job | 0.0052 | Hypertension | 0.0026 |
| Hypertension | 0.0071 | Job | 0.0010 |
| Renal | 0.0046 | Stage | 0.0009 |
| Name | 0.0083 | End | 0.0005 |
| Mellitus | 0.0008 | Initial | 0.0004 |
| Diabetes | 0.0005 | Declared | 0.0003 |
Words with the highest weight by manifold and Word2Vec for rare asbestosis medical code
| Ours | Word2Vec | ||
|---|---|---|---|
| Word | Weight | Word | Weight |
| Pneumothorax | 0.00535 | Old | 0.0617 |
| Silhouette | 0.0241 | Service | 0.0345 |
| Mediastinal | 0.0336 | Evidence | 0.0187 |
| Opacity | 0.0184 | Partially | 0.0171 |
| Tissue | 0.0173 | Present | 0.0162 |
| Tobacco | 0.0102 | Without | 0.0137 |
| Meet | 0.0085 | Speaking | 0.0095 |
| Without | 0.0091 | Brief | 0.0084 |
| Remains | 0.0075 | Stable | 0.0064 |
| Partially | 0.0059 | Associated | 0.0063 |
Fig. 2Visualization of word vectors on MayoSRS. The abscissa is the first dimension of vectors, and the ordinate is the second dimension of vectors
The results of different dimensions on medical code classification between our method and Word2Vec
| Dimension | Metric | Word2Vec | Ours |
|---|---|---|---|
| 100 | Pearson | 68.8 | |
| 100 | Spearman | 63.5 | |
| 200 | Pearson | 69.2 | |
| 200 | Spearman | 63.8 | |
| 250 | Pearson | 69.2 | |
| 250 | Spearman | 63.8 | |
| 300 | Pearson | 69.2 | |
| 300 | Spearman | 63.8 |
Bold values represent the best result for each row of data(%). (Original space dimension is 300d,(window start [0,1000], number of MLLE local neighbors = 500, manifold dimensionality = space dimensionality)
The results of the different numbers of local neighbors on medical code classification between our method and Word2Vec
| neighbor | Metric | Word2Vec | Ours |
|---|---|---|---|
| 300 | Pearson | 68.5 | |
| 300 | Spearman | 63.8 | |
| 400 | Pearson | 69.2 | |
| 400 | Spearman | 63.8 | |
| 500 | Pearson | 69.2 | |
| 500 | Spearman | 63.8 | |
| 600 | Pearson | 69.2 | |
| 600 | Spearman | 63.8 |
Bold values represent the best result for each row of data. (Space is Glove 840B 300d)
The results of different window lengths on medical code classification between our method and Word2Vec
| Win | Metric | Word2Vec | Ours |
|---|---|---|---|
| 1000 | Pearson | 69.2 | |
| 1000 | Spearman | 63.8 | |
| 1500 | Pearson | 69.2 | |
| 1500 | Spearman | 63.8 | |
| 2000 | Pearson | 69.2 | |
| 2000 | Spearman | 63.8 | |
| 3000 | Pearson | 69.2 | |
| 3000 | Spearman | 63.8 |
Bold values represent the best result for each row of data. (Space is Glove 840B 300d)
Fig. 3Biomedical word re-embedding via manifold learning
| Algorithm: Electronic Medical Records Representation With Manifold Embedding. | |
|---|---|
| 1.Using the Word2Vec and Glove models to train the electronic medical records obtain the word embeddings for each word. | |
| 2.Select the word vector window from the pre-trained word vectors as the sample of manifold learning. | |
| 3.The data samples obtained in step 2 are used to train the MLLE algorithm by using Eqs. (1) and | |
| (4) | |
| 4.The MLLE model is trained using Eqs. (1) and (4), and then the model re-embeds the electronic | |
| medical records words embedding using Eqs. (5) and (6): | |