| Literature DB >> 28031037 |
Maciej Rybinski1, José Francisco Aldana-Montes2.
Abstract
BACKGROUND: Semantic relatedness is a measure that quantifies the strength of a semantic link between two concepts. Often, it can be efficiently approximated with methods that operate on words, which represent these concepts. Approximating semantic relatedness between texts and concepts represented by these texts is an important part of many text and knowledge processing tasks of crucial importance in the ever growing domain of biomedical informatics. The problem of most state-of-the-art methods for calculating semantic relatedness is their dependence on highly specialized, structured knowledge resources, which makes these methods poorly adaptable for many usage scenarios. On the other hand, the domain knowledge in the Life Sciences has become more and more accessible, but mostly in its unstructured form - as texts in large document collections, which makes its use more challenging for automated processing. In this paper we present tESA, an extension to a well known Explicit Semantic Relatedness (ESA) method.Entities:
Keywords: Bioinformatics; Biomedical semantics; Distributional linguistics; Explicit semantic analysis; Knowledge extraction; Semantic relatedness; Semantic similarity
Mesh:
Year: 2016 PMID: 28031037 PMCID: PMC5192592 DOI: 10.1186/s13326-016-0109-6
Source DB: PubMed Journal: J Biomed Semantics
Fig. 1Overview. Overview of the method’s components
Presentation of the general characteristics of the corpora used in the experiments
| MEDLINE | PMC OA | Wikipedia | |
|---|---|---|---|
| Size | 14073912 | 1024890 | 3807314 |
| Type | Scientific | Scientific | Encyclopedic |
| Documents | Abstacts and titles | Mostly fulltext + abstracts + titles | Fulltext + titles |
| Snapshot date | Autumn 2015 | September 2015 | December 2015 |
| Token count [M] | 2531,14; 264,84 | 3684,89; 15,8 | 2434,55; 11,13 |
| Unique token count [M] | 3,85; 1,24 | 35,57; 0,48 | 12,53; 0,98 |
Token counts and unique token counts are expressed in millions. These statistics are collected for raw texts (before preprocessing) and raw corpora (e.g. there might be an uneven number of titles and abstracts in Medline). For each corpus and count type we provide two metrics - of the documents’ textual contents (abstract or full articles) and titles. The statistics are included to highlight the compositional differences between the corpora
Presentation of the general characteristics of the datasets used in the experiments; number of pairs and distinct items describe the size of the datasets; the focus of the dataset column contains the information on the type of relationship captured in the reference results
| Dataset | No of pairs | Distinct items | Reference | Focus of the dataset | Annotators | Scale | ICC(2,1) |
|---|---|---|---|---|---|---|---|
| umnsrsSim | 566 | 375 | [ | Similarity | Residents | 0 - 1600 | 0.47 |
| umnsrsRelate | 587 | 397 | [ | Relatedness | Residents | 0 - 1600 | 0.5 |
| mayo101 | 101 | 191 | [ | Relatedness | Medical coders | 1 - 10 | 0.5 |
| mayo29c | 29 | 56 | [ | Relatedness | Medical coders | 1 - 10 | 0.78 |
| mayo29ph | 29 | 56 | [ | Relatedness | Physicians | 1 - 10 | 0.68 |
The ICC (2,1) presents interclass corelation coefficient, which provides an objective measure of inter-annotator agreement; the issues of inter-annotator reliability are covered in more detail in the corresponding reference papers
Overview of the results for different experimental settings - corpus and benchmark pairs; ESA and tESA runs with M=10000 and DS (the method described in [28]) runs with M=200 and cutoff at 0,02 (robust parameters, that can be expected to provide decent results in different experimental settings)
| Corpus | Method | umnsrsRelate | umnsrsSim | mayo101 | mayo29ph | mayo29c |
|---|---|---|---|---|---|---|
| ESA | 0.608 | 0.621 | 0.546 | 0.835 | 0.734 | |
| Medline | tESA |
|
| 0.549 | 0.783 | 0.687 |
| DS | 0.46 | 0.438 | 0.511 | 0.483 | 0.493 | |
| ESA | 0.588 | 0.597 | 0.543 | 0.855 | 0.75 | |
| PMC | tESA | 0.595 | 0.607 | 0.484 | 0.796 | 0.7 |
| DS | 0.574 | 0.626 | 0.504 | 0.738 | 0.673 | |
| ESA | 0.501 | 0.5 | 0.548 | 0.822 | 0.722 | |
| Wiki | tESA | 0.484 | 0.484 | 0.502 | 0.801 | 0.755 |
| DS | 0.444 | 0.463 | 0.413 | 0.627 | 0.597 | |
| Best reported (citation) | 0.54 [ | 0.58 [ |
| 0.84 [ |
| |
The table row for best reference results has been compiled with results reported in the domain literature for the respective datasets, regardless of the type of method used to achieve those results. Best reported results for umnsrsRelate, umnsrsSim and mayo101 were attained with specific parameter combinations in our experiments (presented in [28]), whereas for the two smaller datasets the best results were previously obtained with knowledge-rich methods (distributional and IC-based respectively for mayo29ph and mayo 29c). Updated best results are highlighted with bold font
Fig. 2Performance changes for different M (cutoff limit for a maximum number of documents considered in the distributional representation). The figure shows the correlation with human judgement of ESA and tESA with different corpora in the function of M; the values were obtained for umnsrsRelate dataset
Fig. 3Performance in the function of increased inter-annotator agreement - umnsrsRelate. The figure shows the correlation with human judgement of ESA and tESA in the function of decreasing threshold for standard deviation, which is used to model the inter-annotator agreement, calculated for the umnsrsRelate reference dataset
Fig. 4Performance in the function of increased inter-annotator agreement - umnsrsSim. The figure shows the correlation with human judgement of ESA and tESA in the function of decreasing threshold for standard deviation, which is used to model the inter-annotator agreement, calculated for the umnsrsSim reference dataset
Recall for different dataset-corpus pairs. Recall is measured as a ratio of unique items (single input labels) represented by non-zero vectors to the total number of unique items in their respective datasets. As mayo29ph and mayo29c contain the same set of item pairs, the recall is identical for both datasets
| Dataset | Medline | PMC | Wiki |
|---|---|---|---|
| umnsrsRelate | 0.985 | 0.977 | 0.95 |
| umnsrsSim | 0.989 | 0.981 | 0.963 |
| mayo101 | 0.957 | 0.951 | 0.929 |
| mayo29 | 0.982 | 0.982 | 0.982 |
Statistical tests (confidence intervals) for differences between correlations reported in Table 3
| tESA config | Other method | Dataset | CI | Comparison |
|---|---|---|---|---|
| Medline | DS (Medline) | mayo29ph | (0.09; 0.59) | + |
| Medline | ESA (Medline) | umnsrsRel | (0.003; 0.09) | + |
| Medline | ESA (PMC) | umnsrsRel | (0.025; 0.099) | + |
| Medline | ESA (Wiki) | umnsrsRel | (0.097; 0.2) | + |
| Medline | DS (Medline) | umnsrsRel | (0.13; 0.25) | + |
| Medline | DS (PMC) | umnsrsRel | (0.026; 0.12) | + |
| Medline | DS (Wiki) | umnsrsRel | (0.15; 0.26) | + |
| Medline | tESA (PMC) | umnsrsRel | (0.02; 0.09) | + |
| Medline | tESA (Wiki) | umnsrsRel | (0.11; 0.22) | + |
| Medline | ESA (PMC) | umnsrsSim | (0.004; 0.08) | + |
| Medline | ESA (Wiki) | umnsrsSim | (0.09; 0.19) | + |
| Medline | DS (Medline) | umnsrsSim | (0.14; 0.26) | + |
| Medline | DS (Wiki) | umnsrsSim | (0.11; 0.24) | + |
| Medline | tESA (Wiki) | umnsrsSim | (0.1; 0.21) | + |
| PMC | DS (Medline) | mayo29ph | (0.1; 0.61) | + |
| PMC | ESA (Wiki) | umnsrsRel | (0.04; 0.15) | + |
| PMC | DS (Medline) | umnsrsRel | (0.07; 0.2) | + |
| PMC | DS (Wiki) | umnsrsRel | (0.096; 0.21) | + |
| PMC | tESA (Wiki) | umnsrsRel | (0.06; 0.16) | + |
| PMC | ESA (Wiki) | umnsrsSim | (0.056; 0.16) | + |
| PMC | DS (Medline) | umnsrsSim | (0.1; 0.24) | + |
| PMC | DS (Wiki) | umnsrsSim | (0.09; 0.2) | + |
| PMC | tESA (Wiki) | umnsrsSim | (0.07; 0.18) | + |
| Wiki | DS (Medline) | mayo29c | (0.04; 0.55) | + |
| Wiki | DS (Medline) | mayo29ph | (0.11; 0.62) | + |
| Wiki | DS (Wiki) | mayo29ph | (0.01; 0.41) | + |
| Wiki | ESA (Medline) | umnsrsRel | (-0.18; -0.07) | - |
| Wiki | ESA (PMC) | umnsrsRel | (-0.15; -0.05) | - |
| Wiki | DS (PMC) | umnsrsRel | (-0.16; -0.025) | - |
| Wiki | ESA (Medline) | umnsrsSim | (-0.19; -0.086) | - |
| Wiki | ESA (PMC) | umnsrsSim | (-0.16; -0.06) | - |
| Wiki | DS (PMC) | umnsrsSim | (-0.21; -0.07) | - |
The CIs were constructed for pairs of correlations involving at least one tESA setup. The table provides all the information necessary to track the CI back to Table 3, i.e. the corpus of the tESA method, the method (and corpus) to which the tESA results are being compared and the reference dataset. We also provide the CI itself, additionally indicating if the result is positive or negative
Average vector ‘length’
| Medline | PMC | Wiki | |
|---|---|---|---|
| tESA | 3222,7 | 3547,4 | 535,8 |
| ESA | 4579,4 | 3391,9 | 751 |
The table shows an average of non-zero elements in tESA and ESA vectors, calculated throughout reference datasets for each of the corpora