| Literature DB >> 34991460 |
Juan J Lastra-Díaz1, Alicia Lara-Clares2, Ana Garcia-Serrano2.
Abstract
BACKGROUND: Ontology-based semantic similarity measures based on SNOMED-CT, MeSH, and Gene Ontology are being extensively used in many applications in biomedical text mining and genomics respectively, which has encouraged the development of semantic measures libraries based on the aforementioned ontologies. However, current state-of-the-art semantic measures libraries have some performance and scalability drawbacks derived from their ontology representations based on relational databases, or naive in-memory graph representations. Likewise, a recent reproducible survey on word similarity shows that one hybrid IC-based measure which integrates a shortest-path computation sets the state of the art in the family of ontology-based semantic measures. However, the lack of an efficient shortest-path algorithm for their real-time computation prevents both their practical use in any application and the use of any other path-based semantic similarity measure.Entities:
Keywords: Gene ontology; HESML; Information content models; MeSH; Ontology-based semantic similarity measures; SNOMED-CT; Semantic measures library; WordNet
Mesh:
Year: 2022 PMID: 34991460 PMCID: PMC8734250 DOI: 10.1186/s12859-021-04539-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Pairwise ontology-based semantic similarity measures implemented by the three main publicly available software libraries for the biomedical domain
| UMLS:: | SML | HESML | |
|---|---|---|---|
| Banerjee and Pedersen [ | x | ||
Patwardhan and Pedersen [ context vector | x | ||
| Rada et al. [ | x | x | x* |
| Wu and Palmer [ | x | x | |
Wu and Palmer [ (depth-based approximation) | x | x | |
| Leacock and Chodorow [ | x | x | x* |
| Stojanovic et al. [ | x | x* | |
| Maedche and Staab [ | x | ||
| Zhong et al. [ | x | ||
| Pekar and Staab [ | x | x | x* |
| Li et al. [ | x* | ||
| Li et al. [ | x* | ||
| Liu et al. [ | x* | ||
| Liu et al. [ | x* | ||
Pedersen et al. [ reciprocal Rada | x | x* | |
| Al-Mubaid and Nguyen [ | x | x* | |
| Kyogoku et al. [ | x | ||
| Batet et al. [ | x | ||
| Hao et al. [ | x* | ||
| Hadj Taieb et al. [ | x | ||
| Hadj Taieb et al. [ | x | ||
| McInnes et al. [ | x | ||
| Resnik [ | x | x | x |
| Jiang and Conrath [ | x | x | x |
| Lin [ | x | x | x |
| Schlicker et al. [ | x | x | |
| Pirró and Seco [ | x | ||
| FaITH [ | x | x | |
| Garla and Brandt [ | x | ||
| Meng and Gu [ | x | ||
| Gao et al. [ | x | ||
| Lastra&García [ | x | ||
| Cai et al. [ | x | ||
| Li et al. [ [ | x* | ||
| Zhou et al. [ | x* | ||
| Meng et al. [ | x* | ||
| Gao et al. [ | x* | ||
| Lastra and García [ | x* | ||
| Lastra and García [ | x* | ||
| Cai et al. [ | x* | ||
| Sánchez et al. [ | x | x | |
(*) Real-time reformulation of all path-based measures based on the AncSPL algorithm
Groupwise ontology-based semantic similarity measures implemented by SML and HESML (this work), which are mainly used for genomics applications based on the GO ontology
| Groupwise similarity measures | SML | HESML |
|---|---|---|
| Maximum [ | x | |
| Average [ | x | |
Best-Match-Average (BMA) [ | x | |
| SimUI [ | x | x |
| SimLP [ | x | x |
| SimGIC [ | x | x |
| Ali and Deane [ | x | |
| Lee et al. [ | x | |
| Term Overlap (TO) [ | x | |
Normalized Term Overlap (NTO) [ | x | |
| NTO | x |
Information Content models implemented by the main publicly available software libraries for the biomedical domain
| IC models | UMLS ::Similarity | SML | HESML |
|---|---|---|---|
| Resnik [ | x | x | x |
| CPCorpus [ | x | ||
| CPRefCorpus [ | x | ||
| Seco et al. [ | x | x | x |
| Blanchard et al. [ | x | ||
| Zhou et al. [ | x | x | |
| Sebti and Barfroush [ | x | ||
| Sánchez et al. [ | x | x | x |
| Sánchez and Batet [ | x | ||
| Meng et al. [ | x | ||
| Harispe et al. [ | x | x | |
| Yuan et al. [ | x | ||
| Hadj Taieb et al. [ | x | ||
| Adhikari et al. [ | x | ||
| Ben Aouicha and Hadj Taieb [ | x | ||
| Ben Aouicha et al. [ | x | ||
| CondProbHyponyms [ | x | ||
| CondProbUniform [ | x | ||
| CondProbLeaves [ | x | ||
| CondProbCosine [ | x | ||
| CondProbLogistic [ | x | ||
| CondProbRefHyponyms [ | x | ||
| CondProbRefUniform [ | x | ||
| CondProbRefLeaves [ | x | ||
| CondProbRefCosine [ | x | ||
| CondProbRefLogistic [ | x | ||
| CondProbCosineLeaves [ | x | ||
| CondProbRefLogistic-Leaves [ | x | ||
| CondProbRefLeaves-SubsumerRatio [ | x | ||
Ontologies and thesaurus implemented by the three main semantic measures libraries for the biomedical domain
| Ontology | UMLS::Similarity | SML | HESML |
|---|---|---|---|
| MeSH | x | x | x |
| SNOMED | x | x | x |
| WordNet | x | x | |
| OBO file format | x | x | |
| Gene Ontology | x | x | |
| OWL file format | x | ||
| RDF triples files | x |
Fig. 1HESML V1R5 architecture showing the main functional blocks and abstract interfaces. Boxes in yellow show main abstract objects and interfaces contained in the HESML library, whilst boxes in turquoise blue show main HESML client programs, whose aim is to evaluate semantic similarity measures implemented in HESML on the SNOMED-CT, MeSH, GO, and WordNet ontologies
Collection of pre-trained word embedding (WE and WEC) models and ontology-based vector models (OVM) evaluated in a previous series of experiments [58–60] by using the Java classes implementing their evaluation
| WN | Family | Word embedding model |
|---|---|---|
| Yes | WEC | Attract-repel [ |
| No | WE | FastText [ |
| No | WE | GloVe [ |
| No | WE | CBOW [ |
| Yes | WEC | SymPatterns (SP-500d) [ |
| No | WEC | Paragram-ws [ |
| No | WEC | Paragram-sl [ |
| Yes | WEC | Counter-fitting (CF) [ |
| Yes | OVM | WN-RandomWalks [ |
| Yes | OVM | WN-UKB [ |
| Yes | OVM | Nasari [ |
First column details which methods use WordNet during their training
Fig. 2This figure shows the cumulative distribution function (CDF) of the signed AncSPL length error function , where is the exact length of the shortest path between concepts and in SNOMED-CT, GO, and WordNet ontologies
Fig. 3This figure shows the average running time in micro seconds (s) obtained in evaluating the AncSPL-Rada similarity measure for groups of at least random concept pairs in SNOMED-CT and GO, and at least random pairs in WordNet, which are grouped by the dimension of their corresponding ancestor-based subgraph
Average speed in CUI concept pairs per second (pairs/s) for the evaluation of random CUI pairs with three representative ontology-based similarity measures based on the SNOMED-CT US 2019AB ontology (357,406 nodes) implemented by the three UMLS-based semantic measures libraries reported in the literature
| Similarity measure | UMLS::Similarity | SML | HESML |
|---|---|---|---|
| Avg. speed (pairs/s) | Avg. speed (pairs/s) | Avg. speed (pairs/s) | |
| Rada [ | xxx | 0.041 (15) | |
AncSPL-Rada (this work) | – | – | |
| Lin-Seco [ | 0.744 (500) | 202160 | |
| Wu-Palmer | 0.035 (15) | – |
Best performing values are shown in bold. Non-implemented methods (–) or more than 1 h/pair (xxx). UMLS::Similarity uses caching for the shortest path computations. The number of random CUI pairs evaluated to measure each value is shown between parentheses
Average speed in CUI concept pairs per second (pairs/s) for the evaluation of random CUI pairs with three representative ontology-based similarity measures based on the MeSH ontology (Nov, 2019. 59,747 nodes) implemented by the three UMLS-based semantic measures libraries reported in the literature
| Similarity measure | UMLS::Similarity | SML | HESML |
|---|---|---|---|
| Avg. speed (pairs/s) | Avg. speed (pairs/s) | Avg. speed (pairs/s) | |
| Rada [ | 30.43 (15) | 0.096 (15) | |
AncSPL-Rada (this work) | – | – | |
| Lin-Seco [ | 140.82 (500) | 532913 | |
| Wu-Palmer | 21.34 (15) | – |
Best performing values are shown in bold. Non-implemented methods (–). The number of random CUI pairs evaluated to measure each value is shown between parentheses
Average speed in GO concept pairs per second (pairs/s) for the evaluation of two representative ontology-based similarity measures based on the Gene Ontology [1, 2] (2020-05-02 version, 44509 nodes)) implemented by state-of-the-art SML [34] library and HESML
| Similarity measure | Measure type | SML | HESML |
|---|---|---|---|
| Avg. speed | Avg. speed | ||
| Rada [ | Edge-counting | 0.077 (20) | |
AncSPL-Rada (this work) | Edge-counting | – | |
Lin-Seco [ IC model | IC-based | 372140 |
Best performing values are shown in bold. The number of random GO concept pairs evaluated to measure each value is shown between parentheses
Average speed in sentence pairs per second (sent/s) and CUI pairs per second (CUIs/s) for the evaluation of the UBSM [39] sentence similarity measure combined with three representative ontology-based similarity measures based on MeSH (Nov, 2019) in 30 sentence pairs extracted from the MedSTS [135] sentence similarity dataset, and 1 million sentence pairs extracted from BioC corpus [136]
| Pairwise sentence comparison based on MeSH | UMLS::Sim (30 pairs) | SML (30 pairs) | HESML (30 pairs) | |||||
|---|---|---|---|---|---|---|---|---|
| Similarity measure | Avg. speed | Avg. speed | Avg. speed | Avg. speed | Avg. speed | Avg. speed | Avg. speed | Avg. speed |
| Rada et al. [ | 0.441 | 36.63 | 0.126 | 10.478 | 235000 | 7982.222 | 337843.826 | |
AncSPL-Rada (this work) | – | – | – | – | 211101.695 | 7958.742 | 336850.041 | |
| Lin-Seco [ | 0.782 | 64.956 | 2586.207 | 214741.379 | 259479.167 | 8166.185 | 345629.98 | |
| Wu-Palmer | 0.181 | 15.067 | – | – | 259479.167 | 7892.959 | 334065.805 | |
We provide the average evaluation in normalized CUI pairs per second to allow a fair and unbiased comparison of the results reported for 30 and 1 million sentence pairs. The dataset with 30 sentence pairs requires 2491 pairwise CUI comparisons, whilst the 1 million sentence pairs dataset requires 42324534 pairwise CUI comparisons. Best performing values are shown in bold. Non-implemented methods (–)
This table shows the Pearson (r) and Spearman () correlation values between the similarity values returned by a set of path-based similarity measures and those values returned by their reformulation based on the new AncSPL algorithm for a sequence of 1000 random CUI pairs in SNOMED-CT 2019AB, GO (2020-05-02), and WordNet 3.0
| Base measure | AncSPL reformulation | 50 samples | 100 samples | 200 samples | 1000 samples | ||||
|---|---|---|---|---|---|---|---|---|---|
| r | r | r | r | ||||||
| Correlation values in SNOMED-CT ( | |||||||||
| Rada [ | AnsSPL-Rada | 0.9214 | 0.9412 | 0.9413 | 0.9444 | 0.9357 | 0.9352 | 0.9231 | 0.9217 |
| Leacock and Chodorow [ | AnsSPL-Leacock | 0.9409 | 0.9412 | 0.9479 | 0.9444 | 0.9422 | 0.9352 | 0.9217 | 0.9217 |
| coswJ&C [ | AnsSPL-coswJ&C | 0.9136 | 0.9506 | 0.9583 | 0.9747 | 0.9761 | 0.9775 | 0.941 | 0.9714 |
| Correlation values in GO ( | |||||||||
| Rada [ | AnsSPL-Rada | 0.8571 | 0.8277 | 0.9133 | 0.9085 | 0.8883 | 0.8868 | 0.9074 | 0.8947 |
| Leacock and Chodorow [ | AnsSPL-Leacock | 0.8542 | 0.8277 | 0.9109 | 0.9085 | 0.9007 | 0.8868 | 0.9191 | 0.8947 |
| coswJ&C [ | AnsSPL-coswJ&C | 0.9679 | 0.9848 | 0.9372 | 0.9894 | 0.9654 | 0.9888 | 0.9533 | 0.977 |
| Correlation values in WordNet ( | |||||||||
| Rada [ | AnsSPL-Rada | 0.9072 | 0.8882 | 0.9151 | 0.8855 | 0.9225 | 0.8994 | 0.9168 | 0.9038 |
| Leacock and Chodorow [ | AnsSPL-Leacock | 0.9354 | 0.8882 | 0.9375 | 0.8855 | 0.937 | 0.8994 | 0.9345 | 0.9038 |
| coswJ&C [ | AnsSPL-coswJ&C | 0.9993 | 0.9906 | 0.998 | 0.9916 | 0.9644 | 0.9859 | 0.9815 | 0.9807 |
We show the results obtained in the evaluation of the first 50, 100, 200, and 1000 random CUI pairs. All similarity measures are implemented in HESML V1R5 [63]. CoswJ&C [35] sets the current state-of-the-art in the family of ontology-based semantic similarity measures based on WordNet [58]. We define the tree-like deviation () below as the ratio of nodes with multiple parents regarding the overall number of ontology nodes. The tree-like deviation is 0 for MeSH, whilst it is (2213/82115) for WordNet 3.0, (151916/357406) for SNOMED-CT, and (19680/44509) for GO
Experimental confirmation of the factor impacting the linear scalability of AncSPL for non-tree-like ontologies () shown in Fig. 3
| Ontology | ||||
|---|---|---|---|---|
| SNOMED-CT | 72.02 | 1.191 | 7.79 | 5.39 |
| GO | 31.14 | 0.3277 | 1.46 | 1.48 |
| WordNet (WN) | 25.80 | 0.2210 | 1 | 1 |
First column shows the average number of adjacent nodes per ancestor set for each node in ontology C, denoted by . Second column shows the estimated value for the factor in obtained by fitting the scalability plot shown in Fig. 3 to the line . Then, third and fourth columns compare the theoretical and experimental expected ratios between the time complexity (slope) of two different ontologies using WordNet (WN) as baseline
Overall running time in seconds (s) and average speed in protein pairs per second (prot. pairs/s) obtained by four groupwise GO-based similarity measures (GO, 2020-05-02 version) implemented by HESML in the evaluation of the pairwise protein similarity between the Homo Sapiens and Canis lupus familiaris organisms
| Pairwise protein comparison between two large organisms | |||
|---|---|---|---|
| Measure | Type | HESML | Avg. speed |
| SimLP [ | Common ancestors ratio | 28243 | 12038 |
| SimUI [ | Common ancestor max depth | 31922 | 10651 |
SimGIC-Seco [ | IC-based | 30754 | 11055 |
BMA-Lin-Seco [ | IC-based | 7981 | 42604 |
We used the 542193 and 120720 GO annotations for both organisms provided by the “goa_human.gaf” and “go_dog.gaf” files, respectively. Approximately 340 million protein pairs and GO-annotation pairs are compared