| Literature DB >> 29402212 |
Billy Chiu1, Sampo Pyysalo2, Ivan Vulić2, Anna Korhonen2.
Abstract
BACKGROUND: Word representations support a variety of Natural Language Processing (NLP) tasks. The quality of these representations is typically assessed by comparing the distances in the induced vector spaces against human similarity judgements. Whereas comprehensive evaluation resources have recently been developed for the general domain, similar resources for biomedicine currently suffer from the lack of coverage, both in terms of word types included and with respect to the semantic distinctions. Notably, verbs have been excluded, although they are essential for the interpretation of biomedical language. Further, current resources do not discern between semantic similarity and semantic relatedness, although this has been proven as an important predictor of the usefulness of word representations and their performance in downstream applications.Entities:
Keywords: Downstream tasks; Intrinsic evaluation; Word similarity
Mesh:
Year: 2018 PMID: 29402212 PMCID: PMC5800055 DOI: 10.1186/s12859-018-2039-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Biomedical- and general-domain word samples in Bio-SimVerb and Bio-SimLex
| Biomedical | General |
|---|---|
| Depolymerize | Automate |
| Electrophorese | Study |
| Phosphorylate | Argue |
| Centrosome | Idea |
| Pathophysiology | People |
| Endothelium | River |
Fourteen Ontologies used for sampling synonymous pairs in Bio-SimVerb and Bio-SimLex
| Ontology | Reference |
|---|---|
| Chemical Entities of Biological Interest (ChEBI) | [ |
| Gene Ontology (GO) | [ |
| NCI Thesaurus (NCIT) | [ |
| Foundational Model of Anatomy (FMA) | [ |
| Disease Ontology (DOID) | [ |
| Uberon multi-species anatomy ontology (UBERON) | [ |
| Plant Ontology (PO) | [ |
| Plant Phenotypes and Traits (PATO) | [ |
| Ontology for Biomedical Investigations(OBI) | [ |
| Molecular Process Ontology (MOP) | [ |
| Zebrafish anatomy and development (ZFA) | [ |
| Protein modification (PSI-MOD) | [ |
| Common Anatomy Reference Ontology (CARO) | [ |
| Xenopus anatomy and development (XAO) | [ |
Hyper-parameter values for word representation models. Parameters specific to individual models are set to their defaults
| Parameters | Values |
|---|---|
| Context window size | 5 |
| Vector dimension | 200 |
| Learning rate | 0.05 |
| Negative sampling | 5 |
| Min-count | 5 |
| Sampling rate | 1e-5 |
Hyper-parameters used in NER
| Parameters | Values |
|---|---|
| Vector dimension | 200 |
| Hidden layer dimension | 300 |
| Context window size | 5 |
| Learning rate | 0.01 |
| Dropout probability | 0.2 |
| Epochs | 20 |
| Minibatch size | 50 |
Intrinsic (left 5 columns) and extrinsic scores (right 4 columns) of different representation models trained on the biomedical corpus
| UMN-rel( | UMN-sim( | MayoSRS( | Bio-SimVerb( | Bio-SimLex( | BC4CHEMD (F-score) | BC2GM (F-score) | AnatEM (F-score) | JNLPBA (F-score) | |
|---|---|---|---|---|---|---|---|---|---|
| Attention | 0.5248 | 0.5551 |
| 0.471 | 0.7155 | 79.11 | 65.91 | 80.49 | 62.3 |
| SSG | 0.5189 | 0.552 | 0.6003 |
| 0.7181 | 79.62 | 67.3 | 81.3 | 63.78 |
| SG |
|
| 0.5744 | 0.4638 | 0.7151 | 81.37 | 70.2 | 81.32 | 65.16 |
| CBOW | 0.5 | 0.5348 | 0.5146 | 0.4367 | 0.702 | 78.41 | 64.05 | 80.3 | 61.9 |
| Dependency | 0.3934 | 0.4622 | 0.3445 | 0.3978 |
|
|
|
| 65.01 |
| PubMed-w2v | 0.506 | 0.549 | 0.5133 | 0.4376 | 0.6984 | 80.71 | 67.4 | 81.1 | 64.86 |
| BioASQ | 0.5092 | 0.5893 | 0.4729 | 0.4228 | 0.6982 | 56.95 | 48.86 | 53.34 | 50.51 |
The bolded text implies the best performing models of their kind
Pearson’s correlation between word-similarity/Bio-SimVerb and Bio-SimLex scores and the NER tasks evaluated on biomedical representation models trained with different approaches. None of the scores are statistically significant
| BC4CHEMD | BC2GM | AnatEM | JNLPBA | |
|---|---|---|---|---|
| UMN-rel | -0.15 | -0.14 | -0.08 | -0.07 |
| UMN-sim | -0.38 | -0.34 | -0.34 | -0.3 |
| MayoSRS | 0.08 | 0.04 | 0.18 | 0.12 |
| Bio-SimVerb | 0.2 | 0.18 | 0.29 | 0.24 |
| Bio-SimLex |
|
|
|
|
Bold: best scores
Intrinsic (left 5 columns) and extrinsic scores (right 4 columns) of the biomedical representation models trained using different window sizes
| Window Size | UMN-rel( | UMN-sim( | MayoSRS ( | Bio-SimVerb( | Bio-SimLex( | BC4CHECMD (F-score) | BC2GM (F-score) | AnatEM (F-score) | JNLPBA (F-score) |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.5317 | 0.5759 | 0.5551 | 0.4594 |
|
| 70.06 | 82.16 | 65.34 |
| 2 | 0.563 | 0.6144 | 0.6238 | 0.4696 | 0.7207 | 81.44 | 70 |
| 65.51 |
| 4 | 0.5768 | 0.6247 | 0.581 | 0.464 | 0.7188 | 81.5 | 70.04 | 82 | 65.75 |
| 5 | 0.5767 | 0.6271 | 0.5744 | 0.4638 | 0.7151 | 81.37 |
| 81.32 | 65.16 |
| 8 | 0.582 | 0.6377 | 0.5975 | 0.4611 | 0.7086 | 81.24 | 69.56 | 80.99 | 65.53 |
| 16 | 0.5888 | 0.6431 | 0.6123 | 0.4667 | 0.7034 | 81.02 | 69.39 | 80.72 | 64.78 |
| 20 | 0.5896 | 0.6418 | 0.6319 | 0.4584 | 0.7031 | 81.12 | 69.62 | 80.49 | 65.19 |
| 25 |
|
| 0.6188 | 0.4519 | 0.7004 | 81.07 | 69.93 | 80.92 | 65.14 |
| 30 | 0.6007 | 0.6457 |
| 0.4502 | 0.7043 | 80.71 | 69.2 | 81.03 | 64.79 |
The bolded text implies the best performing models of their kind
Pearson’s correlation between word-similarity/Bio-SimVerb and Bio-SimLex scores and the NER tasks evaluated on biomedical representation models trained with different window sizes
| BC4CHEMD | BC2GM | AnatEM | JNLPBA | |
|---|---|---|---|---|
| UMN-rel | -0.78a | -0.56 | -0.78a | -0.46 |
| UMN-sim | -0.73 | -0.57a | -0.81 | -0.42a |
| MayoSRS | -0.78 | -0.69 | -0.54a | -0.47a |
| Bio-SimVerb | 0.63 | 0.36 | 0.42 | 0.40 |
| Bio-SimLex |
|
|
|
|
Bold: best scores
aStatistically significant
Pearson’s correlation between general-domain datasets/Bio-SimVerb and Bio-SimLex scores and the NER tasks evaluated on general-domain representation models benchmarked in SimVerb and SimLex. None of the scores are statistically significant
| BC4CHEMD | BC2GM | AnatEM | JNLPBA | |
|---|---|---|---|---|
| SimVerb | -0.31 | -0.09 | -0.41 | -0.12 |
| SimLex | -0.36 | -0.20 | -0.49 | -0.19 |
| Bio-SimVerb | -0.38 | -0.18 | -0.47 | -0.22 |
| Bio-SimLex |
|
|
|
|
Bold: best scores
Fig. 1Subset-based evaluation for Bio-SimVerb (y axis unit: ρ), where subsets are created based on the word-frequency in PMC. To be included in each group it is required that both words in a pair are in the same frequency interval (x axis)
Fig. 2Subset-based evaluation for Bio-SimLex (y axis unit: ρ), where subsets are created based on the word-frequency in PMC. To be included in each group it is required that both words in a pair are in the same frequency interval (x axis)
Fig. 3Subset-based evaluation for Bio-SimVerb (y axis unit: ρ), where subsets are created based on the word’s number of unique Broad Subject Terms. A word can have multiple Broad Subject terms when it appears in journals of different areas in biomedicine. To be included in each group, it is required that both words in a pair are contained in the same Subject Term interval (x axis)
Fig. 4Subset-based evaluation for Bio-SimLex (y axis unit: ρ), where subsets are created based on the word’s number of unique Broad Subject Terms. A word can have multiple Broad Subject terms when it appears in journals of different areas in biomedicine. To be included in each group, it is required that both words in a pair are contained in the same Subject Term interval (x axis)
Average standard deviation of ratings per subset by word-frequency.
| Frequency subset | Bio-SimVerb | Bio-SimLex |
|---|---|---|
| Low | 0.9848 | 1.5621 |
| Medium | 0.8059 | 0.6784 |
| High | 1.2352 | 1.0237 |
| Average | 1.009 | 1.088 |
We use: low, medium and high to label subsets for brevity. Range values of corresponding subsets can be found in Figs. 1 and 2
Average standard deviation of ratings per subset by the number of Broad Subject Term.
| Subject subset | Bio-SimVerb | Bio-SimLex |
|---|---|---|
| Low | 0.8941 | 1.2395 |
| Medium | 0.9084 | 0.7585 |
| High | 1.25 | 1.1204 |
| Average | 1.018 | 1.039 |
We use low, medium and high to label subsets for brevity. Range values of corresponding subsets can be found in Figs. 3 and 4