| Literature DB >> 25474538 |
Gaston K Mazandu1, Nicola J Mulder2.
Abstract
The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25474538 PMCID: PMC4256219 DOI: 10.1371/journal.pone.0113859
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of different IC-based functional similarity and term semantic similarity measures.
| Measure | Model | Approach | Reference |
| Functional similarity | IC-based direct term | SimGIC |
|
| SimDIC |
| ||
| SimUIC |
| ||
| SimUI |
| ||
| Pair-wise term or IC-based non direct term | Avg |
| |
| Max |
| ||
| BMA |
| ||
| ABM |
| ||
| Term Semantic Similarity | Annotation-based | Resnik |
|
| XGraSM-Resnik |
| ||
| Nunivers |
| ||
| XGraSM-Nunivers |
| ||
| Lin |
| ||
| XGraSM-Lin |
| ||
| Li et al. |
| ||
| Relevance |
| ||
| Topology-based | GO-Univeral |
| |
| Wang et al. |
| ||
| Zhang et al. |
|
These measures were used to built 57 different functional similarity measures that are assessed using different types of biological data, including Enzyme Commission (EC), Pfam domain, Sequence Similarity (Seq. Sim.), Protein-Protein Interaction (PPI) and Co-expression Network (CN) or Gene Expression (microarray) data.
Figure 1Performance evaluation in terms of Pearson's correlation values.
These different Pearson's correlation values with Enzyme Commission (EC), Pfam and Sequence similarity are obtained from the CESSM online tool. For x-axis labels, the prefixes R, N, L, Li, S, X, A, Z, W, and U represent the approaches and stand for Resnik, Nunivers, Lin, Li, Relevance, XGraSM, Annotation-based, Zhang, Wang and GO-universal, respectively. The suffixes GIC, UIC and DIC represent SimGIC, SimUIC and SimDIC measures, respectively. In cases where the prefix X is used, it is immediately followed by the approach prefix. Refer to Table 2 and 3 for the description of these different measures.
Pearson's correlation values of different measures.
| Approach | Measure | Molecular Function (MF) | Cellular Component (CC) | Biological Process (BP) | ||||||
| EC | PFAM | Seq Sim | EC | PFAM | Seq Sim | EC | PFAM | Seq Sim | ||
| R | Avg (RAvg) | 0.37532 | 0.38905 | 0.36071 | 0.28501 | 0.41217 | 0.31369 | 0.21121 | 0.25342 | 0.21123 |
| ABM (RABM) | 0.53917 | 0.46098 | 0.50052 | 0.32731 | 0.45986 | 0.54093 | 0.34236 | 0.31019 | 0.46918 | |
| BMA (RBMA) | 0.54787 | 0.46651 | 0.50675 | 0.32117 | 0.46045 | 0.52959 | 0.34478 | 0.30893 | 0.46692 | |
| Max (RMax) | 0.55531 | 0.52123 |
| 0.27177 | 0.39069 | 0.51783 | 0.30957 | 0.30944 | 0.40344 | |
| XGraSM-Avg (XRAvg) | 0.44532 | 0.49719 | 0.45670 | 0.30590 | 0.41945 | 0.40519 | 0.27856 | 0.32978 | 0.35064 | |
| XGraSM-ABM (XRABM) | 0.62228 |
| 0.43396 |
|
|
| 0.48034 | 0.51387 | 0.71515 | |
| XGraSM-BMA (XRBMA) |
| 0.52562 | 0.42832 | 0.35322 | 0.46763 | 0.61850 |
|
|
| |
| XGraSM-Max (XRMax) | 0.31849 | 0.21078 | 0.09714 | 0.18628 | 0.24576 | 0.23365 | 0.36543 | 0.23418 | 0.12051 | |
| N | Avg (NAvg) | 0.40437 | 0.40605 | 0.30088 | 0.28388 | 0.41696 | 0.32773 | 0.24504 | 0.29716 | 0.24824 |
| ABM (NABM) | 0.52989 | 0.42083 | 0.32264 | 0.31967 | 0.45185 | 0.53365 | 0.40960 | 0.40157 | 0.53209 | |
| BMA (NBMA) | 0.53717 | 0.41589 | 0.31800 | 0.31157 | 0.44377 | 0.51035 | 0.41764 | 0.39956 | 0.52862 | |
| Max (NMax) | 0.20693 | 0.18493 | 0.07917 | 0.15753 | 0.23744 | 0.21049 | 0.26021 | 0.21171 | 0.10015 | |
| XGraSM-Avg (XNAvg) | 0.38562 | 0.42789 | 0.34098 | 0.29236 | 0.41989 | 0.36626 | 0.26348 | 0.30804 | 0.30011 | |
| XGraSM-ABM (XNABM) | 0.56160 |
|
|
|
|
| 0.45603 | 0.46176 | 0.65137 | |
| XGraSM-BMA (XNBMA) |
| 0.48224 | 0.39241 | 0.33988 | 0.46313 | 0.59236 |
|
|
| |
| XGraSM-Max (XNMax) | 0.23379 | 0.19230 | 0.08402 | 0.17896 | 0.24475 | 0.22948 | 0.32608 | 0.22527 | 0.11304 | |
| L | Avg (LAvg) | 0.37960 | 0.38149 | 0.26975 | 0.27358 | 0.40980 | 0.30420 | 0.23344 | 0.29618 | 0.22678 |
| ABM (LABM) | 0.47794 | 0.37214 | 0.27193 | 0.29969 | 0.43507 | 0.49146 | 0.38369 | 0.37405 | 0.47976 | |
| BMA (LBMA) | 0.48346 | 0.36783 | 0.26797 | 0.28974 | 0.42621 | 0.46926 | 0.38909 | 0.37171 | 0.47449 | |
| Max (LMax) | 0.18341 | 0.17780 | 0.07476 | 0.14639 | 0.23298 | 0.19865 | 0.23248 | 0.20287 | 0.09363 | |
| XGraSM-Avg (XLAvg) | 0.35730 | 0.40170 | 0.30816 | 0.28454 | 0.41735 | 0.34196 | 0.25193 | 0.30720 | 0.27799 | |
| XGraSM-ABM (XLABM) | 0.52692 |
|
|
|
|
| 0.44261 | 0.44548 | 0.62035 | |
| XGraSM-BMA (XLBMA) |
| 0.45679 | 0.36643 | 0.33214 | 0.45976 | 0.57860 |
|
|
| |
| XGraSM-Max (XLMax) | 0.21668 | 0.18726 | 0.08100 | 0.17668 | 0.24590 | 0.22795 | 0.36543 | 0.23418 | 0.12051 | |
| S | Avg (SAvg) | 0.39895 |
|
| 0.27509 | 0.40934 | 0.31267 | 0.24007 | 0.29585 | 0.23224 |
| ABM (SABM) | 0.49846 | 0.37641 | 0.27502 |
|
|
| 0.38575 |
|
| |
| BMA (SBMA) |
| 0.37236 | 0.27109 | 0.29257 | 0.42855 | 0.47516 |
| 0.37108 | 0.47462 | |
| Max (SMax) | 0.20848 | 0.18507 | 0.07914 | 0.14737 | 0.23302 | 0.20005 | 0.23424 | 0.20336 | 0.09398 | |
| Li | Avg (LiAvg) | 0.42024 | 0.40930 | 0.30788 | 0.28658 | 0.41761 | 0.33494 | 0.25799 | 0.31039 | 0.25784 |
| ABM (LiABM) | 0.53691 |
|
|
|
|
| 0.41396 | 0.40670 |
| |
| BMA (LiBMA) |
| 0.41010 | 0.30739 | 0.31239 | 0.44524 | 0.51221 |
|
| 0.52966 | |
| Max (LiMax) | 0.24125 | 0.19425 | 0.08499 | 0.16030 | 0.24041 | 0.21243 | 0.26839 | 0.21407 | 0.10168 | |
| A | SimGIC (AGIC) | 0.59941 |
|
|
|
|
| 0.44164 | 0.49011 |
|
| SimDIC (ADIC) |
| 0.54614 | 0.49134 | 0.36469 | 0.51438 | 0.66385 |
|
| 0.69403 | |
| SimUIC (AUIC) | 0.57433 | 0.54488 | 0.50643 | 0.35844 | 0.50424 | 0.67929 | 0.44573 | 0.48520 | 0.69341 | |
| Z | Avg (ZAvg) | 0.42242 | 0.39074 | 0.27595 | 0.26767 | 0.40746 | 0.32121 | 0.21181 | 0.31769 | 0.19658 |
| ABM (ZABM) | 0.49670 | 0.38912 | 0.28048 | 0.29104 | 0.41915 | 0.48201 | 0.39965 | 0.43449 | 0.51446 | |
| BMA (ZBMA) | 0.50184 | 0.38219 | 0.27446 | 0.28135 | 0.41131 | 0.46165 | 0.40097 | 0.42915 | 0.50697 | |
| Max (ZMax) | 0.21496 | 0.18623 | 0.08015 | 0.14434 | 0.22535 | 0.19262 | 0.24156 | 0.20658 | 0.09524 | |
| SimGIC (ZGIC) |
|
|
|
|
|
| 0.45672 | 0.50121 |
| |
| SimDIC (ZDIC) | 0.54733 | 0.44010 | 0.36048 | 0.36433 | 0.51140 | 0.66128 | 0.48173 |
| 0.67914 | |
| SImUIC (ZUIC) | 0.52587 | 0.44723 | 0.37719 | 0.35847 | 0.50159 | 0.67704 |
| 0.49626 | 0.67906 | |
| W | Avg (WAvg) | 0.32939 | 0.39711 | 0.31829 | 0.27822 | 0.39797 | 0.28790 | 0.24429 | 0.37518 | 0.31967 |
| ABM (WABM) | 0.43759 | 0.37805 | 0.26197 | 0.30419 | 0.42580 | 0.47450 | 0.42471 | 0.47434 | 0.59775 | |
| BMA (WBMA) | 0.43853 | 0.36980 | 0.25558 | 0.29551 | 0.41827 | 0.45501 | 0.43182 | 0.46893 | 0.59284 | |
| Max (WMax) | 0.17071 | 0.17392 | 0.07249 | 0.14920 | 0.22981 | 0.19250 | 0.27792 | 0.21691 | 0.10267 | |
| SimGIC (WGIC) |
|
|
|
|
|
| 0.46808 | 0.49629 |
| |
| SimDIC (WDIC) | 0.53335 | 0.40794 | 0.29754 | 0.35580 | 0.50564 | 0.61990 |
|
| 0.65003 | |
| SimUIC (WUIC) | 0.52018 | 0.42227 | 0.31293 | 0.35396 | 0.49725 | 0.64186 | 0.46571 | 0.48716 | 0.65164 | |
| U | Avg (UAvg) | 0.36584 |
|
| 0.31186 | 0.41240 | 0.32592 | 0.29650 | 0.38034 | 0.37786 |
| ABM (UABM) | 0.51354 | 0.42361 | 0.31259 |
| 0.47023 | 0.56028 | 0.46424 |
|
| |
| BMA (UBMA) | 0.51740 | 0.41406 | 0.30507 | 0.35088 | 0.45819 | 0.53448 | 0.47364 | 0.50134 | 0.67084 | |
| Max (UMax) | 0.21967 | 0.18836 | 0.08140 | 0.17511 | 0.23499 | 0.20663 | 0.32326 | 0.22605 | 0.11111 | |
| SimGIC (UGIC) |
| 0.39141 | 0.30800 | 0.35707 | 0.49532 |
| 0.45891 | 0.46904 | 0.66193 | |
| SimDIC (UDIC) | 0.51113 | 0.33578 | 0.23846 | 0.35868 |
| 0.68552 |
| 0.47434 | 0.63920 | |
| SimUIC (UUIC) | 0.50018 | 0.34776 | 0.24722 | 0.35052 | 0.48685 | 0.69761 | 0.46210 | 0.46436 | 0.63865 | |
| SimUI | SimUI | 0.56126 | 0.49980 | 0.41280 | 0.36520 | 0.52065 | 0.64969 | 0.45463 | 0.49754 | 0.69992 |
Comparing performance of 57 different functional similarity measures using Pearson's correlation with Enzyme Commission (EC), Pfam and Sequence similarity. Results are obtained from the CESSM online tool and the best scores are in bold. R, N, L, Li, S, X, A, Z, W, and U represent the approaches and stand for Resnik, Nunivers, Lin, Li, Relevance, XGraSM, Annotation-based, Zhang, Wang and GO-universal, respectively. The double middle bold line separates annotation-based approaches above from the topology-based approaches below.
Area under the curve (AUC), Rand Index (RI) and Normalized Mutual Information (NI) values of different measures.
| Approach | Measure | Protein-Protein Interaction | Gene Expression | ||
| AUC (CC) | AUC (BP) | RI (BP) | NI (BP) | ||
| R | Avg (RAvg) |
|
| 0.9814900 | 0.9202300 |
| ABM (RABM) | 0.9999815 | 0.9997248 | 0.9819800 | 0.9159100 | |
| BMA (RBMA) | 0.9999656 | 0.9995277 |
|
| |
| Max (RMax) | 0.9823696 | 0.8355199 | 0.9699500 | 0.8729600 | |
| XGraSM-Avg (XRAvg) | 0.9715316 | 0.9965294 | 0.9804600 | 0.9218900 | |
| XGraSM-ABM (XRABM) | 0.9191044 | 0.9466970 | 0.9732500 | 0.8811700 | |
| XGraSM-BMA (XRBMA) | 0.8933883 | 0.9367340 | 0.9740800 | 0.8815500 | |
| XGraSM-Max (XRMax) | 0.3196787 | 0.4527575 | 0.9612000 | 0.7056700 | |
| N | Avg (NAvg) | 0.9281535 | 0.9912221 | 0.9811400 | 0.9151300 |
| ABM (NABM) | 0.6994310 | 0.8056306 | 0.9710300 | 0.8690500 | |
| BMA (NBMA) | 0.6194493 | 0.7257469 | 0.9716100 | 0.8731000 | |
| Max (NMax) | 0.2628194 | 0.2725935 | 0.9604600 | 0.7017200 | |
| XGraSM-Avg (XNAvg) |
|
|
|
| |
| XGraSM-ABM (XNABM) | 0.8500164 | 0.9140909 | 0.9747400 | 0.8935100 | |
| XGraSM-BMA (XNBMA) | 0.7977606 | 0.8885191 | 0.9722000 | 0.8758500 | |
| XGraSM-Max (XNMax) | 0.3166060 | 0.4174917 | 0.9613600 | 0.7065700 | |
| L | Avg (LAvg) | 0.8635273 | 0.9838635 | 0.9825000 | 0.9181300 |
| ABM (LABM) | 0.5666561 | 0.7222728 | 0.9716000 | 0.8665000 | |
| BMA (LBMA) | 0.4853167 | 0.6194642 | 0.9693300 | 0.8667700 | |
| Max (LMax) | 0.2174561 | 0.2274708 | 0.9606800 | 0.7028400 | |
| XGraSM-Avg (XLAvg) |
|
|
|
| |
| XGraSM-ABM (XLABM) | 0.7982155 | 0.8935707 | 0.9738000 | 0.8850100 | |
| XGraSM-BMA (XLBMA) | 0.7292282 | 0.8566720 | 0.9715700 | 0.8729400 | |
| XGraSM-Max (XLMax) | 0.3053099 | 0.3774761 | 0.9611800 | 0.7048300 | |
| S | Avg (SAvg) |
|
|
|
|
| ABM (SABM) | 0.6036674 | 0.7332584 | 0.9670700 | 0.8505500 | |
| BMA (SBMA) | 0.5220448 | 0.6330507 | 0.9693700 | 0.8665600 | |
| Max (SMax) | 0.2278649 | 0.2332203 | 0.9606000 | 0.7031900 | |
| Li | Avg (LiAvg) |
|
|
|
|
| ABM (LiABM) | 0.7209326 | 0.8204703 | 0.9685900 | 0.8598000 | |
| BMA (LiBMA) | 0.6436113 | 0.7460765 | 0.9698300 | 0.8640500 | |
| Max (LiMax) | 0.2710713 | 0.2850380 | 0.9607700 | 0.7032300 | |
| A | SimGIC (AGIC) | 0.9173889 |
|
|
|
| SimDIC (ADIC) | 0.8486233 | 0.9514534 | 0.9748200 | 0.8893700 | |
| SimUIC (AUIC) |
| 0.9654985 | 0.9752600 | 0.8937500 | |
| Z | Avg (ZAvg) | 0.8628325 |
|
|
|
| ABM (ZABM) | 0.5564467 | 0.7571073 | 0.9726600 | 0.8718700 | |
| BMA (ZBMA) | 0.4756021 | 0.6578980 | 0.9762600 | 0.8847200 | |
| Max (ZMax) | 0.2142027 | 0.2341097 | 0.9605400 | 0.7016000 | |
| SimGIC (ZGIC) |
| 0.9680424 | 0.9755300 | 0.8946900 | |
| SimDIC (ZDIC) | 0.8468629 | 0.9494608 | 0.9743600 | 0.8889200 | |
| SimUIC (ZUIC) | 0.9041007 | 0.9642283 | 0.9777800 | 0.9071400 | |
| W | Avg (WAvg) | 0.8261012 |
|
|
|
| ABM (WABM) | 0.4524186 | 0.8706998 | 0.9710900 | 0.8786900 | |
| BMA (WBMA) | 0.3719390 | 0.8287918 | 0.9767100 | 0.8861300 | |
| Max (WMax) | 0.1190595 | 0.2833496 | 0.9606800 | 0.7068500 | |
| SimGIC (WGIC) |
| 0.9659196 | 0.9747400 | 0.8909700 | |
| SimDIC (WDIC) | 0.7811533 | 0.9451399 | 0.9741200 | 0.8908800 | |
| SimUIC (WUIC) | 0.8678077 | 0.9615032 | 0.9733400 | 0.8892300 | |
| U | Avg (UAvg) |
|
|
|
|
| ABM (UABM) | 0.8335202 | 0.9513584 | 0.9740500 | 0.8819600 | |
| BMA (UBMA) | 0.7798275 | 0.9377088 | 0.9706300 | 0.8707200 | |
| Max (UMax) | 0.2943297 | 0.3982111 | 0.9607000 | 0.7050400 | |
| SimGIC (UGIC) | 0.9178478 | 0.9691239 | 0.9767900 | 0.9014500 | |
| SimDIC (UDIC) | 0.8758490 | 0.9595382 | 0.9751300 | 0.8882600 | |
| SimUIC (UUIC) | 0.9104333 | 0.9673649 | 0.9733100 | 0.8914700 | |
| SimUI | SimUI | 0.8483416 | 0.9582268 | 0.9731600 | 0.8890300 |
Comparing performance of 57 different functional similarity measures in terms of AUC values for CC and BP ontologies, RI and NI values for the BP ontology using Protein-Protein Interaction (PPI) and Co-expression Network (CN) or Gene Expression (microarray) data. The double middle bold line separates annotation-based approaches above from the topology-based approaches below.
Figure 2Performance evaluation in terms of clustering power (RI and NI) and Area Under the Curve (AUC) values.
Different x-axis labels are the same as in Fig. 1, where different prefixes and suffixes stand for different term semantic similarity approaches and functional similarity measures.
Summary of overall ‘best’ performing measures for different biological data.
| Biological data type | |||||
| Ontology | EC | Pfam | Seq. Sim. | PPI | CN |
| MF | XRBMA | AGIC | AGIC | ||
| CC | ZGIC | WGIC | UGIC | Ravg | |
| BP | XRBMA | XRBMA | AGIC | Ravg | Wavg |
List of overall ‘best’ performing functional similarity measures for MF, CC and BP ontologies given biological data. Refer to Table 2 and 3 for the description of these different measures.
Summary of the best performing measures for different applications.
| Model | Approach | EC | Pfam | Seq. Sim. | PPI | CN |
| IC-based direct term | Annotation-based (A) | SimDIC | SimGIC | SimGIC | SimGIC | SimGIC |
| GO-universal (U) | SimDIC | SimDIC | SimGIC | SimGIC | SimGIC | |
| Wang et al. (W) | SimGIC | SimGIC | SimGIC | SimGIC | SimGIC | |
| Zhang et al. (Z) | SimGIC | SimGIC | SimGIC | SimGIC | SimUIC | |
| Pair-wise term or IC-based non direct term | Resnik (R) | BMA | Max | Max | Avg | BMA |
| XGraSM-Resnik (XR) | BMA | ABM | ABM | Avg | Avg | |
| Nunivers (N) | BMA | ABM | ABM | Avg | Avg | |
| XGraSM-Nunivers (XN) | BMA | ABM | ABM | Avg | Avg | |
| Lin (L) | BMA | ABM | ABM | Avg | Avg | |
| XGraSM-Lin (XL) | BMA | ABM | ABM | Avg | Avg | |
| Li et al. (Li) | BMA | ABM | ABM | Avg | Avg | |
| Relevance (S) | BMA | ABM | ABM | Avg | Avg | |
| GO-Universal (U) | BMA | BMA | ABM | Avg | Avg | |
| Wang et al. (W) | BMA | ABM | ABM | Avg | Avg | |
| Zhang et al. (Z) | BMA | ABM | ABM | Avg | Avg |
List of the best performing functional similarity measures, term specificity and semantic similarity approaches for different biological data, including Enzyme Commission (EC), Pfam domain, Sequence Similarity (Seq. Sim.), Protein-Protein Interaction (PPI) and Co-expression Network (CN) or Gene Expression (microarray) data.