| Literature DB >> 30108262 |
Wenting Liu1, Jianjun Liu2, Jagath C Rajapakse3.
Abstract
There exists a plethora of measures to evaluate functional similarity (FS) between genes, which is a widely used in many bioinformatics applications including detecting molecular pathways, identifying co-expressed genes, predicting protein-protein interactions, and prioritization of disease genes. Measures of FS between genes are mostly derived from Information Contents (IC) of Gene Ontology (GO) terms annotating the genes. However, existing measures evaluating IC of terms based either on the representations of terms in the annotating corpus or on the knowledge embedded in the GO hierarchy do not consider the enrichment of GO terms by the querying pair of genes. The enrichment of a GO term by a pair of gene is dependent on whether the term is annotated by one gene (i.e., partial annotation) or by both genes (i.e. complete annotation) in the pair. In this paper, we propose a method that incorporate enrichment of GO terms by a gene pair in computing their FS and show that GO enrichment improves the performances of 46 existing FS measures in the prediction of sequence homologies, gene expression correlations, protein-protein interactions, and disease associated genes.Entities:
Mesh:
Year: 2018 PMID: 30108262 PMCID: PMC6092333 DOI: 10.1038/s41598-018-30455-0
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Illustration of a DAG representing GO terms annotating two genes. (A) The DAG representing GO terms that annotates the two genes, and (B) the two genes g1 and g2 and their GO term sets and . The FS is derived as the semantic similarity or the common information contents (IC) in the two term sets. Our approach takes care of differential enrichment of GO terms by the querying gene pair: for example, blue terms annotates only g1, green terms annotates only gene g2, and red terms annotates both genes g1 and g2.
Details of five datasets and statistical significances of the improvement of performances by FS over corresponding FS measures on predicting disease genes, protein interactions on yeast PPI dataset and yeast GE dataset, gene co-expressions on yeast GE dataset in different ontological domains (BP, MF, and CC), and sequence similarities on CESSM datasets (ECC, Pfam, and SeqSim).
| Data Type | Data Sets | #Protein pairs | ontology | #Experiments | |
|---|---|---|---|---|---|
| Disease Genes | DG_BP; DG_MF; DG_CC | 6084 | BP; MF; CC | 138 | 5.619e-08 |
| Yeast PPI | PPI_BP; PPI_MF; PPI_CC | 8654; 7166; 8852 | BP; MF; CC | 138 | 2.885e-07 |
| Human PPI | PPI_BP; PPI_MF; PPI_CC | 2408; 2576; 2108 | BP; MF; CC | 138 | 3.528e-03 |
| Yeast GE | GE_BP; GE_MF; GE_CC | 4800 | BP; MF; CC | 138 | 6.912e-09 |
| CESSM | ECC; Pfam; SeqSim | 13430 | BP; MF | 276 | 4.94e-16 |
| Total | DG; PPI; GE; ECC; Pfam; SeqSim | 35376; 34056; 35274 | BP; MF; CC | 828 | <2.2e-16 |
Details of 46 FS measures. The types of IC/SS and methods used to compute FS measures: GIC[1] (Jaccard index), DIC[8] (dice index), and UIC[8] (universal index) for individual terns; Average (AVG), Maximum (MAX), Best-Match Average (BMA) and Average Best-Matches (ABM) for measures based on pairs of terms; and Overlap Ratio (OR) and Intersection to Union Ratio (IUR) for measures based on sets of terms.
| Acronyms | IC/SS | |
|---|---|---|
| U | GO-universal[ | UABM, UBMA, UMAX, UAVG, UDIC, UGIC, UUIC |
| Z | Zhang[ | ZABM, ZBMA, ZMAX, ZAVG, ZDIC, ZGIC, ZUIC |
| W | Wang[ | WABM, WBMA, WMAX, WAVG, WDIC, WGIC, WUIC |
| N | Nunivers[ | NABM, NBMA, NMAX, NAVG |
| XN | Extended Nunivers[ | XNABM, XNBMA, XNMAX, XNAVG |
| L | Lin[ | LABM, LBMA, LMAX, LAVG |
| XL | Extended Lin[ | XLABM, XLBMA, XLMAX, XLAVG |
| S | Schlicker[ | SABM, SBMA, SMAX, SAVG |
| D | Direct-term based[ | DDIC, DGIC, DUIC |
| R | SORA[ | ROR |
| I | WIS[ | IIUR |
Top five performers of FS and FS* measures on predicting ECC, Pfam, and SeqSim similarities of protein pairs of CESSM datasets, using BP and MF ontologies.
| Datasets | Methods | Correlation | Datasets | Methods | Correlation | Datasets | Methods | Correlation |
|---|---|---|---|---|---|---|---|---|
| ECC_BP |
|
| Pfam_BP |
|
| SeqSim_BP |
|
|
|
|
| WBMA* | 0.5223 | IIUR | 0.7927 | |||
| XLBMA* | 0.4748 | ROR* | 0.5199 | ROR | 0.7884 | |||
| XLBMA | 0.4708 | IIUR* | 0.5005 | WABM* | 0.7741 | |||
| NBMA* | 0.4651 | ROR | 0.4933 | ROR* | 0.7738 | |||
|
|
|
|
|
|
| |||
| WBMA* | 0.7665 | ROR* | 0.6829 | IIUR | 0.7165 | |||
| ECC_MF | XNBMA* | 0.7567 | Pfam_MF | IIUR | 0.6627 | SeqSim_MF | ROR | 0.6505 |
| XNBMA | 0.7525 | ROR | 0.6565 | DGIC* | 0.6358 | |||
| NBMA* | 0.7525 | WABM* | 0.6283 | DGIC | 0.6285 |
Top five performers of FS and FS* measures predicting protein-protein interactions of human and yeast PPI datasets, using three ontologies: BP, MF, and CC.
| Datasets | Methods | AUC | Datasets | Methods | AUC | Datasets | Methods | AUC |
|---|---|---|---|---|---|---|---|---|
| human PPI_BP |
|
| human PPI_MF | SABM | 0.7787 | human PPI_CC |
|
|
| ZBMA | 0.8747 | LABM | 0.7777 | WBMA* | 0.7697 | |||
| NBMA* | 0.8739 | SABM* | 0.7771 | IIUR* | 0.7678 | |||
| NBMA | 0.8737 | LABM* | 0.7762 | UABM* | 0.7658 | |||
| SBMA* | 0.8721 | NABM | 0.7718 | UABM | 0.7657 | |||
| yeast PPI_BP |
|
| yeast PPI_MF | DUIC | 0.6930 | yeast PPI_CC |
|
|
| XNMAX* | 0.8563 | DUIC* | 0.6928 | IIUR | 0.8158 | |||
| XLMAX* | 0.8561 | DDIC* | 0.6926 | ROR* | 0.8143 | |||
| XNMAX | 0.8559 | DGIC* | 0.6926 | UABM* | 0.8072 | |||
| XNBMA | 0.8559 | DGIC | 0.6916 | NABM* | 0.8068 |
Top five performers of FS and FS* measures predicting gene co-expressions on yeast GE dataset, using three ontologies: BP, MF, and CC.
| Datasets | Methods | Correlation | Datasets | Methods | Correlation | Datasets | Methods | Correlation |
|---|---|---|---|---|---|---|---|---|
| yeast GE_BP |
|
| yeast GE_MF |
|
| yeast GE_CC |
|
|
|
|
| ROR | 0.2087 | ZDIC | 0.4253 | |||
| ROR | 0.2876 | DGIC* | 0.2023 | ZGIC* | 0.4236 | |||
| ZGIC* | 0.2875 | DGIC | 0.2022 | ZGIC | 0.4233 | |||
| DGIC | 0.2873 | DDIC* | 0.2008 | ZUIC | 0.4229 |
Top five performers of FS and FS* measures predicting disease genes on benchmark dataset, using three ontologies: BP, MF, and CC.
| Datasets | Methods | AUC | Datasets | Methods | AUC | Datasets | Methods | AUC |
|---|---|---|---|---|---|---|---|---|
| Disease Genes_BP |
|
| Disease Genes_MF |
|
| Disease Genes_CC |
|
|
| ROR | 0.8062 | UBMA* | 0.7357 | UBMA | 0.7032 | |||
| XNBMA | 0.8058 | UBMA | 0.7357 | ROR | 0.7031 | |||
| IIUR | 0.8030 | NBMA* | 0.7344 | UBMA* | 0.7029 | |||
| NBMA* | 0.8019 | NBMA | 0.7330 | WBMA | 0.7006 |
Figure 2The DAG representing the sets of GO terms annotating proteins: P01906 and P17693. Protein P01906 is annotated by terms GO:0006955, GO:0019882 and GO:0002504; and P17693 is annotated by terms GO:0006955, GO:0019882, GO:0002474 and GO:0006968. Blue terms denote terms annotating only protein P01906, green terms denote terms annotating only protein P17693, and red terms denote terms annotating both proteins.
Figure 3Venn diagram illustrating a GO term annotating the genes in the corpus (blue) and the querying set (yellow): M genes in the corpus of N genes and k genes in a querying set of n genes are annotated by the GO term.