| Literature DB >> 27473391 |
Rezvan Ehsani1,2, Finn Drabløs3.
Abstract
BACKGROUND: The Gene Ontology (GO) is a dynamic, controlled vocabulary that describes the cellular function of genes and proteins according to tree major categories: biological process, molecular function and cellular component. It has become widely used in many bioinformatics applications for annotating genes and measuring their semantic similarity, rather than their sequence similarity. Generally speaking, semantic similarity measures involve the GO tree topology, information content of GO terms, or a combination of both.Entities:
Keywords: Gene ontology; Semantic similarity measure; Tree topology
Mesh:
Year: 2016 PMID: 27473391 PMCID: PMC4966780 DOI: 10.1186/s12859-016-1160-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
List of human KEGG pathways and Pfam clans used for benchmarking
| KEGG | Pfam | ||||
|---|---|---|---|---|---|
| Pathway | Name | #genes | Accession | Name | #genes |
| hsa00040 | Pentose and glucuronate interconversions | 26 | CL0099.10 | ALDH-like | 18 |
| hsa00920 | Sulfur metabolism | 13 | CL0106.10 | 6PGD_C | 8 |
| hsa00140 | C21-Steroid homone metabolism | 17 | CL0417.1 | BIR-like | 9 |
| hsa00290 | Valine, leucine and isoleucine biosynthesis | 11 | CL0165.8 | Cache | 5 |
| hsa00563 | Glycosylphosphatidylinositol (GPI)-anchor biosynthesis | 23 | CL0149.9 | CoA-acyltrans | 7 |
| hsa00670 | One carbon pool by folate | 16 | CL0085.11 | FAD_DHS | 12 |
| hsa00232 | Caffeine metabolism | 7 | CL0076.9 | FAD_Lum_binding | 18 |
| hsa03022 | Basal transcription factors | 38 | CL0289.3 | FBD | 6 |
| hsa03020 | RNA polymerase | 29 | CL0119.10 | Flavokinase | 7 |
| hsa04130 | SNARE interactions in vesicular transport | 38 | CL0042.9 | Flavoprotein | 10 |
| hsa03450 | Non-homologous end-joining | 14 | |||
| hsa03430 | Mismatch repair | 23 | |||
| hsa04950 | Maturity onset diabetes of the young | 25 | |||
| Total #genes | 280 | 100 |
These datasets were obtained directly from [22]
Fig. 1Pseudocode for the TopoICSim algorithm
Fig. 2Sample GO structure illustrating the main computations used in TopoICSim
Fig. 3IntraSet similarities for the Pfam clan dataset using MF annotations. The IntraSet similarity is estimated for all pairs of genes within in each clan using MF annotations over all considered similarity measures
Fig. 4IntraSet similarities for KEGG pathways dataset using BP annotations. The IntraSet similarity is estimated for all pair genes within each KEGG pathway using BP annotations for all considered similarity measures
Fig. 5Comparison of the discriminating power of six similarity measures using Pfam clan and MF annotations. The discriminating power values estimated using all considered similarity measures are plotted for all Pfam clans
Fig. 6Comparison of the discriminating power of six similarity measures using KEGG pathway and BP annotations. The discriminating power values estimated with all considered similarity measures are plotted for all KEGG pathways
Fig. 7Comparison of the IDP values of six similarity measures using Pfam clan and MF annotations
Fig. 8Comparison of the IDP values of six similarity measures using KEGG pathways and BP annotations
Correlation between expression and annotation similarities
| G2M | DNA_REPAIR | STAT3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| TopoICSim | IntelliGO | Wang | TopoICSim | IntelliGO | Wang | TopoICSim | IntelliGO | Wang | |
| Pearson | 0.932 | 0.572 | 0.849 | 0.890 | 0.879 | 0.867 | 0.833 | 0.795 | 0.824 |
| Spearman | 0.914 | 0.548 | 0.871 | 0.876 | 0.890 | 0.813 | 0.872 | 0.766 | 0.793 |
| DC |
| 0.594 | 0.885 |
| 0.887 | 0.863 |
| 0.801 | 0.827 |
Numbers in bold indicate the best correlation for each subset when comparing TopoICSim, IntelliGO and Wang
Results obtained with the CESSM benchmarking tool
| Metrics | Methods | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SimGIC | SimUI | RA | RM | RB | LA | LM | LB | JA | JM | JB | TopoICSim | ||
| MF | ECC | 0.62 | 0.63 | 0.39 | 0.45 | 0.60 | 0.42 | 0.45 | 0.64 | 0.34 | 0.36 | 0.56 |
|
| Pfam |
| 0.61 | 0.44 | 0.18 | 0.57 | 0.44 | 0.18 | 0.56 | 0.33 | 0.12 | 0.49 | 0.62 | |
| SeqSim |
| 0.59 | 0.50 | 0.12 | 0.66 | 0.46 | 0.12 | 0.60 | 0.29 | 0.10 | 0.54 | 0.55 | |
| BP | ECC | 0.39 | 0.40 | 0.30 | 0.30 | 0.44 | 0.30 | 0.31 | 0.43 | 0.19 | 0.25 | 0.37 |
|
| Pfam | 0.45 | 0.45 | 0.32 | 0.26 | 0.45 | 0.28 | 0.20 | 0.37 | 0.17 | 0.16 | 0.33 |
| |
| SeqSim |
| 0.73 | 0.40 | 0.30 | 0.73 | 0.34 | 0.25 | 0.63 | 0.21 | 0.23 | 0.58 | 0.68 | |
Pearson correlation coefficients are shown for the ECC, Pfam, and SeqSim datasets. The MF and BP annotations are used. Numbers in bold show the best correlation for each dataset. The column headings represent the following methods: SimGIC Similarity Graph Information Content, SimUI Union Intersection similarity, RA Resnick Average, RM Resnick Max, RB Resnick Best match, LA Lord Average, LM Lord Max, LB Lord Best match, JA Jaccard Average, JM Jaccard Max, JB Jaccard Best match
Running time
| Running time (min) | ||||
|---|---|---|---|---|
| Gene set | Interactions | TopoICSim | IntelliGO | Wang |
| STAT3 | 7569 | 112 | 132 | 15 |
| DNA_REPAIR | 22801 | 312 | 426 | 45 |
| G2M | 40000 | 595 | 815 | 83 |
Running times in minutes for calculating similarities over all genes pairs in each of the gene sets