| Literature DB >> 31076572 |
Yijia Zhang1,2, Qingyu Chen1, Zhihao Yang2, Hongfei Lin2, Zhiyong Lu3.
Abstract
Distributed word representations have become an essential foundation for biomedical natural language processing (BioNLP), text mining and information retrieval. Word embeddings are traditionally computed at the word level from a large corpus of unlabeled text, ignoring the information present in the internal structure of words or any information available in domain specific structured resources such as ontologies. However, such information holds potentials for greatly improving the quality of the word representation, as suggested in some recent studies in the general domain. Here we present BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH). We assess both the validity and utility of our generated word embeddings over multiple NLP tasks in the biomedical domain. Our benchmarking results demonstrate that our word embeddings can result in significantly improved performance over the previous state of the art in those challenging tasks.Entities:
Mesh:
Year: 2019 PMID: 31076572 PMCID: PMC6510737 DOI: 10.1038/s41597-019-0055-0
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Schematic of learning word embedding based on PubMed literature and MeSH.
Fig. 2Illustration of the MeSH sequences sampling strategy. (a) An example of MeSH term graph. (b) Random sampling strategy.
The cosine similarity of the word pair examples by different word embeddings.
| Word pair | UMNSRS-Sim[ | UMNSRS-Rel[ | Mikolov | Pyysalo | Chiu | BioWordVec (win 20) |
|---|---|---|---|---|---|---|
| thalassemia, hemoglobinopathy | 1307 | 1218 | — | 0.713 | 0.754 | 0.834 |
| mycosis, histoplasmosis | 1137.25 | 1185.75 | 0.353 | 0.544 | 0.595 | 0.706 |
| thirsty, hunger | 935.75 | 1249 | 0.252 | 0.425 | 0.59 | 0.629 |
| influenza, pneumoniae | 898.5 | 1354 | 0.482 | 0.252 | 0.514 | 0.611 |
| atherosclerosis, angina | 936 | 1357.75 | 0.503 | 0.506 | 0.506 | 0.589 |
“win20” denotes the BioWordVec was trained by setting the context window size as 20. “UMNSRS-Sim[20]” and “UMNSRS-Rel[20]” denote the mean score of the word pair from UMNSRS-Sim[20] and UMNSRS-Rel[20].
The top 5 most similar words of “deltaproteobacteria”.
| BioWordVec (win20) | Chiu | ||
|---|---|---|---|
| Top 5 similar words | Similarity score | Top 5 similar words | Similarity score |
| deltaproteobacterial | 0.985 | magnetospirilla | 0.861 |
| deltaproteobacterium | 0.963 | Thermales | 0.857 |
| betaproteobacteria | 0.952 | Acidiphilium-like | 0.854 |
| zetaproteobacteria | 0.945 | nirK1 | 0.85 |
| delta-proteobacteria | 0.939 | nostoc | 0.847 |
“win20” denotes the BioWordVec was trained by setting the context window size as 20.
Evaluation results on UMNSRS datasets.
| Method | Corpus | UMNSRS-Sim | UMNSRS-Rel | ||||
|---|---|---|---|---|---|---|---|
| # | Pearson | Spearman | # | Pearson | Spearman | ||
| Mikolov | Google news | 336 | 0.421 | 0.409 | 329 | 0.359 | 0.347 |
| Pyysalo | PubMed + PMC | 493 | 0.549 | 0.524 | 496 | 0.495 | 0.488 |
| Chiu | PubMed | 462 | 0.662 | 0.652 | 467 | 0.600 | 0.601 |
| BioWordVec (win20) | PubMed |
| 0.665 | 0.654 |
| 0.608 | 0.607 |
| BioWordVec (win20) | PubMed + MeSH |
|
|
|
|
|
|
“#” denotes the number of the term pairs that can be mapped by the different word embeddings. “Pearson” and “Spearman” denote the Pearson’s correlation coefficient score and Spearman’s correlation coefficient score, respectively. “win20” denotes the BioWordVec was trained by setting the context window size as 20. The highest value is shown in bold.
Comparison results on UMNSRS datasets using the common term pairs.
| Method | Corpus | UMNSRS-Sim | UMNSRS-Rel | ||||
|---|---|---|---|---|---|---|---|
| # | Pearson | Spearman | # | Pearson | Spearman | ||
| Chiu | PubMed | 459 | 0.661 | 0.651 | 461 | 0.600 | 0.601 |
| BioWordVec (win20) | PubMed | 459 | 0.679 | 0.665 | 461 | 0.624 | 0.626 |
| BioWordVec (win20) | PubMed + MeSH | 459 |
|
| 461 |
|
|
“#” denotes the number of the term pairs that can be mapped by the different word embeddings. “Pearson” and “Spearman” denote the Pearson’s correlation coefficient score and Spearman’s correlation coefficient score, respectively. “win20” denotes the BioWordVec was trained by setting the context window size as 20. The highest value is shown in bold.
Sentence pair similarity results on BioCreative/OHNLP STS dataset.
| Similarity measures | Mikolov | Pyysalo | Chiu | BioWordVec (win20) w/o MeSH | BioWordVec (win20) w/MeSH |
|---|---|---|---|---|---|
| Cosine | 0.768 | 0.755 | 0.757 | 0.770 |
|
| Euclidean | 0.725 | 0.723 | 0.727 | 0.751 |
|
| Block | 0.725 | 0.722 | 0.727 | 0.750 |
|
“win20” denotes the BioWordVec was trained by setting the context window size as 20.
The statistics of the PPI corpora.
| Corpus | Sentences | Positive | Negative | Total |
|---|---|---|---|---|
| AIMed | 1955 | 1,000 | 4,834 | 5,834 |
| BioInfer | 1100 | 2,534 | 7,132 | 9,666 |
| IEPA | 145 | 335 | 482 | 817 |
| HPRD50 | 486 | 163 | 270 | 433 |
| LLL | 77 | 164 | 166 | 300 |
PPI extraction evaluation results on five PPI corpora.
| Data Set | Mikolov | Pyysalo | Chiu | BioWordVec (win5) w/o MeSH | BioWordVec (win5) w/MeSH | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| F-Score |
| F-Score |
| F-Score |
| F-Score |
| F-Score |
| |
| AIMed | 0.445 | 0.076 | 0.457 | 0.087 |
| 0.064 | 0.484 | 0.101 | 0.487 | 0.081 |
| BioInfer | 0.524 | 0.038 | 0.532 | 0.044 | 0.545 | 0.053 | 0.543 | 0.041 |
| 0.039 |
| IEPA | 0.603 | 0.062 | 0.597 | 0.062 | 0.615 | 0.061 | 0.617 | 0.049 |
| 0.064 |
| HPRD50 | 0.484 | 0.187 | 0.499 | 0.121 | 0.481 | 0.145 | 0.504 | 0.136 |
| 0.13 |
| LLL | 0.679 | 0.12 | 0.688 | 0.093 | 0.684 | 0.124 | 0.708 | 0.092 |
| 0.095 |
The highest value is shown in bold. “σ” denotes the standard deviation of the F-score. “win5” denotes the BioWordVec was trained by setting the context window size as 5.
DDI extraction evaluation results on DDI 2013 corpus.
| Method | Corpus | CNN model | hierarchical RNN model | ||||
|---|---|---|---|---|---|---|---|
| Precision | Recall | F-score | Precision | Recall | F-score | ||
| Mikolov | Google news | 0.698 | 0.584 | 0.636 | 0.681 | 0.699 | 0.691 |
| Pyysalo | PubMed + PMC | 0.689 | 0.624 | 0.655 | 0.692 |
| 0.709 |
| Chiu | PubMed |
| 0.650 | 0.677 | 0.749 | 0.691 | 0.719 |
| BioWordVec (win5) | PubMed | 0.696 | 0.669 | 0.683 | 0.744 | 0.702 | 0.722 |
| BioWordVec (win5) | PubMed + MeSH | 0.694 |
|
|
| 0.696 |
|
The highest value is shown in bold. “win5” denotes the BioWordVec was trained by setting the context window size as 5.
| Design Type(s) | data transformation objective • data integration objective • text processing and analysis objective |
| Measurement Type(s) | word representation |
| Technology Type(s) | Text_Mining |
| Factor Type(s) | |
| Sample Characteristic(s) |