| Literature DB >> 19649320 |
Catia Pesquita1, Daniel Faria, André O Falcão, Phillip Lord, Francisco M Couto.
Abstract
In recent years, ontologies have become a mainstream topic in biomedical research. When biological entities are described using a common schema, such as an ontology, they can be compared by means of their annotations. This type of comparison is called semantic similarity, since it assesses the degree of relatedness between two entities by the similarity in meaning of their annotations. The application of semantic similarity to biomedical ontologies is recent; nevertheless, several studies have been published in the last few years describing and evaluating diverse approaches. Semantic similarity has become a valuable tool for validating the results drawn from biomedical studies such as gene clustering, gene expression data analysis, prediction and validation of molecular interactions, and disease gene prioritization. We review semantic similarity measures applied to biomedical ontologies and propose their classification according to the strategies they employ: node-based versus edge-based and pairwise versus groupwise. We also present comparative assessment studies and discuss the implications of their results. We survey the existing implementations of semantic similarity measures, and we describe examples of applications to biomedical research. This will clarify how biomedical researchers can benefit from semantic similarity measures and help them choose the approach most suitable for their studies.Biomedical ontologies are evolving toward increased coverage, formality, and integration, and their use for annotation is increasingly becoming a focus of both effort by biomedical experts and application of automated annotation procedures to create corpora of higher quality and completeness than are currently available. Given that semantic similarity measures are directly dependent on these evolutions, we can expect to see them gaining more relevance and even becoming as essential as sequence similarity is today in biomedical research.Entities:
Mesh:
Year: 2009 PMID: 19649320 PMCID: PMC2712090 DOI: 10.1371/journal.pcbi.1000443
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Figure 1Section of the GO graph showing the three aspects (molecular function, biological process, and cellular component) and some of their descendant terms.
The fact that GO is a DAG rather than a tree is illustrated by the term “transcription factor activity” which has two parents. An example of a part of relationship is also shown between the terms cell part and cell.
Figure 2Main approaches for comparing terms: node-based and edge-based and the techniques used by each approach.
DCA, disjoint common ancestors; IC, information content; MICA, most informative common ancestor.
Figure 3Main approaches for comparing gene products: pairwise and groupwise and the techniques used by each approach.
Summary of term measures, their approaches, and their techniques.
| Measure | Approach | Techniques |
| Resnik | Node-based | MICA |
| Lin | Node-based | MICA |
| Jiang and Conrath | Node-based | MICA |
| GraSM | Node-based | DCA |
| Schlicker et al. | Node-based | MICA |
| Wu et al. | Edge-based | Shared path |
| Wu et al. | Edge-based | Shared path; distance |
| Bodenreider et al. | Node-based | Shared annotations |
| Othman et al. | Hybrid | IC/depth/number of children; distance |
| Wang et al. | Hybrid | Shared ancestors |
| Riensche et al. | Node-based | IC/MICA; shared annotations |
| Yu et al. | Edge-based | Shared path |
| Cheng et al. | Edge-based | Shared path |
| Pozo et al. | Edge-based | Shared path |
Summary of pairwise approaches.
| Measure | Approach | Techniques | Term Comparison |
| Lord et al. | All pairs | Average | Resnik/Lin/Jiang |
| Sevillla et al. | All pairs | Maximum | Resnik/Lin/Jiang |
| Riensche et al. | All pairs | Maximum | XOA |
| Azuaje et al. | Best pairs | Average | Resnik/Lin/Jiang |
| Couto et al. | Best pairs | Average | GraSM+(Resnik/Lin/Jiang) |
| Schlicker et al. | Best pairs | Average | simRel |
| Wang et al. | Best pairs | Average | Wang |
| Tao et al. | Best pairs | Average Min. threshold | Lin |
| Pozo et al. | Best pairs | Average | Pozo |
| Lei et al. | All pairs Best pairsa | Average Max, Sum | Depth of LCA |
Lei et. al also consider exact matches only.
Summary of groupwise approaches.
| Measure | Approach | Techniques | Weighting |
| Lee et al. | Graph-based | Term overlap | None |
| Mistry et al. | Graph-based | Term overlap, Normalized | None |
| Gentleman | Graph-based | Shared-path | None |
| Gentleman | Graph-based | Jaccard | None |
| Martin et al. | Graph-based | Czekanowski-Dice, Jaccard | None |
| Pesquita et al. | Graph-based | Jaccard | IC |
| Ye et al. | Graph-based | LCA, Normalized | None |
| Cho et al. | Graph-based | LCA | IC |
| Lin et al. | Graph-based | Intersection | Annotation set probability |
| Yu et al. | Graph-based | LCA | Annotation set probability |
| Sheehan et al. | Graph-based | Resnik, Lin | Annotation set probability |
| Huang et al. | Vector-based | Kappa-statistic | None |
| Chabalier et al. | Vector-based | Cosine | IC |
Summary of assessment studies performed on semantic similarity measures in GO, detailing the properties used in the evaluation and the best performing measures.
| Study | Standard | Best Measure |
| Lord et al. | Sequence similarity | Resnik (average) |
| Wang et al. | Gene expression | None |
| Sevillla et al. | Gene expression | Resnik (max) |
| Couto et al. | Family similarity | Jiang and Conrath |
| Schlicker et al. | Sequence and family similarity | Schlicker et al. |
| Lei et al. | Subnuclear location | TO |
| Guo et al. | Human regulatory pathways | Resnik (average) |
| Wang et al. | Clustering | Wang et al. |
| Pesquita et al. | Sequence similarity | simGIC |
| Xu et al. | PPI/gene expression | Resnik(Max) |
| Mistry et al. | Sequence similarity | TO/Resnik(Max) |
Tools for GO-based semantic similarity measures.
| Tool | Format Available | Measures Implemented | Input Size | Annotation Types | Extras |
| FuSSiMeG | Web | Several | 2 | All | None |
| GOToolBox | Web | Several | Unlimited | Single | Representation, Clustering, Semantic retrieval |
| ProteInOn | Web | Several | 10 | All/manual | Protein interaction |
| G-SESAME | Web | Wang et al. | 2 | All manual, Single manual | Clustering, Filter by species |
| FunSimMat | Web | Several | Unlimited | All | Filter by protein family, Filter by species |
| DynGO | Standalone | AVG(Resnik) | Unlimited | All ECs | Visualization, Browsing, Semantic retrieval |
| UTMGO | Standalone | Othman et al. | NA |
| Semantic retrieval of terms |
| SemSim | R | Several | NA | all/non- | Support for clustering, filter by species |
| GOvis | R | simLP+simUI | NA | All | Visualization |
| csbl.go | R | Several | NA | NA | Clustering |
Acceptable number of terms or gene products.
Best measures for the main applications of GO-based semantic similarity measures.
| Application | Best Measure | Reference |
| Function | BMA(Resnik)/simGIC |
|
| Protein-protein interaction p/v | Max(Resnik) |
|
| Cellular location prediction | SUM(EM) |
|
Identified by sequence similarity.
p/v, prediction/validation.