Literature DB >> 27814675

SGFSC: speeding the gene functional similarity calculation based on hash tables.

Zhen Tian1, Chunyu Wang1, Maozu Guo2, Xiaoyan Liu1, Zhixia Teng1,3.   

Abstract

BACKGROUND: In recent years, many measures of gene functional similarity have been proposed and widely used in all kinds of essential research. These methods are mainly divided into two categories: pairwise approaches and group-wise approaches. However, a common problem with these methods is their time consumption, especially when measuring the gene functional similarities of a large number of gene pairs. The problem of computational efficiency for pairwise approaches is even more prominent because they are dependent on the combination of semantic similarity. Therefore, the efficient measurement of gene functional similarity remains a challenging problem.
RESULTS: To speed current gene functional similarity calculation methods, a novel two-step computing strategy is proposed: (1) establish a hash table for each method to store essential information obtained from the Gene Ontology (GO) graph and (2) measure gene functional similarity based on the corresponding hash table. There is no need to traverse the GO graph repeatedly for each method with the help of the hash table. The analysis of time complexity shows that the computational efficiency of these methods is significantly improved. We also implement a novel Speeding Gene Functional Similarity Calculation tool, namely SGFSC, which is bundled with seven typical measures using our proposed strategy. Further experiments show the great advantage of SGFSC in measuring gene functional similarity on the whole genomic scale.
CONCLUSIONS: The proposed strategy is successful in speeding current gene functional similarity calculation methods. SGFSC is an efficient tool that is freely available at http://nclab.hit.edu.cn/SGFSC . The source code of SGFSC can be downloaded from http://pan.baidu.com/s/1dFFmvpZ .

Entities:  

Keywords:  Gene functional similarity; Gene ontology; Hash table

Mesh:

Substances:

Year:  2016        PMID: 27814675      PMCID: PMC5096311          DOI: 10.1186/s12859-016-1294-0

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

In the functional genomic era, measuring gene functional similarity is a fundamental task because it is the foundation of much essential research such as gene clustering [1-4], protein-protein interaction prediction [5-8], gene function prediction [9-12] and disease gene prioritisation [13-15]. Comparing functional similarity between genes provides more information for understanding the biological roles and functions of genes, although sometimes it may be less objective compared with sequence and structure similarity [16]. Gene Ontology (GO) is a standardised and controlled vocabulary of terms that comprises three orthogonal ontologies: cellular component (CC), molecular function (MF) and biological process (BP). These three ontologies are structured as three directed acyclic graphs (DAGs), which are also called GO graphs sometimes. Semantic similarity applied to the GO annotations of gene products provides a measure of their functional similarity. Therefore, functional similarity between genes can be inferred from the semantic relationships of GO terms. In this article, ‘functional similarity’ refers to the similarity between genes or gene products, and ‘semantic similarity’ refers to the similarity between two GO terms. In recent years, many gene functional similarity calculation measures [15, 17–30] have been proposed and widely used in biology research. They are mainly divided into two categories: pairwise approaches and group-wise approaches, both of which must rely on GO graphs [31]. Pairwise methods measure the gene functional similarity via two steps. The first step computes the semantic similarities of GO term pairs using term comparison techniques. The second step measures gene functional similarity between genes using the results of semantic similarity scores from the first step. Maximum rule, average rule and best-match average rule (BMA) are three kinds of strategies widely used in the second step. The characteristics of the three rules have been discussed in detail [16, 17]. In contrast, group-wise methods measure gene functional similarity by comparing the terms that annotate the genes in groups. Overall, there are three types of approaches for measuring the functional similarities of genes: set, graph and vector [31]. However, the common issue of current functional similarity calculation methods is time consumption, especially when they measure similarity on a whole genomic scale. There are two main reasons for the low computational efficiency, which becomes very prominent. One is that more and more GO terms are added into the GO graphs because of the daily evolution of the GO database. The other is that the number of annotated genes in the Gene Ontology Annotation (GOA) database has greatly increased. Although some online tools [23, 25, 32–36] have achieved great success on a variety of applications such as constructing gene semantic similarity networks [37, 38] and disease gene prioritisation [39-41], few of them pay attention to the problem of computational efficiency. Thus, improving the computational efficiency of functional similarity methods has become a challenging problem. In the remainder of this section, we review seven typical methods that use the proposed strategy and some other methods. These methods include those of Resnik [20], Lin [19], Jiang and Conrath (hereafter referred to as Jiang) [18], Pekar and Staab (hereafter referred to as Pekar) [30] and Wang et al. (hereafter referred to as Wang) [21], which are pairwise approaches, and simUI [29] and simGIC [23], which are group-wise approaches. It should be noted that we emphasise the analysis of semantic similarity between terms for pairwise approaches and highlight functional similarity between genes for group-wise approaches. The methods of Resnik, Lin and Jiang belong to node-based approaches, which rely on comparing the properties of terms in the GO graph. One concept commonly used in these approaches is information content (IC), which gives a measure of how specific and informative a term is. The IC of a term t can be quantified as the negative log likelihood: where p (t) is the probability of occurrence of t in a specific corpus (such as the UniProt knowledgebase) being normally estimated by its frequency of annotation [31]. According to Eq. (1), terms that are more genetic will have a larger p (t) and hence a smaller IC value. When applying this traditional measure, one important problem is that the specificity of a term is fully dependent on the number of genes taken in a given annotation corpus [24]. The detail definition of p (t) can be represented by Eq. (2). Here, annotation (t) is the number of genes annotated by term t, and descendent (t) is a term set that contains descendants of t in the GO graph. Resnik [21] elaborated that edges do not represent the same uniform distance in the GO graph because the terms at the same level do not have the same specificity. Some terms in the GO graph have more children even though they belong to the same level. Therefore, an IC value of the term t in the GO graph can be used as a metric to measure the relationship between terms. Resnik defined the similarity between two terms t 1 and t 2 as the IC value of the lowest common ancestor term (LCA), which was given as follows: According to Eq. (3), the similarity between two terms only depends on the IC value of their LCA. Sometimes, LCA is also called the most informative common ancestor. Lin [19] pointed out one serious drawback of the Resnik method, which is that two different pairs of terms that locate on different levels in the GO graph will have the same LCA, so they have the same similarity value. Apparently, this is not a reasonable result that meets the human perspective. Therefore, both Lin and Jiang invented two other measures, which are formulated as Eqs. (4) and (5), respectively: As is pointed out by Wang [21], the methods of both Lin and Jiang have the problem of “shallow annotation”, i.e. if two genes are well annotated near the root of the ontology, their semantic similarity will always be measured very highly (close to 1), and their semantic distance will always be computed close to nil, thus providing a misleading result. In contrast, edge-based approaches are also popular in measuring semantic similarity between GO terms. Pekar [30] proposed a measure based on the length of the longest path between the lowest common ancestor of two terms and the root, and on the length of the longest path between each term and their common ancestor. This is given by the following expression: where δ(t 1, t 2) denotes the longest distance between term t 1 and term t 2 in the GO graph, and t a is the LCA of t 1 and t 2. Three distances are used in Eq. (6), and thus the functional similarity computed by this method is more reasonable than that of Resnik’s results. In addition, Cheng et al. [27] proposed a maximum common ancestor depth measure and weighted each edge to reflect its depth. Wu et al. [26] introduced the distance to the nearest leaf node of a term and the distance to the LCA to take the specificity of terms into account. Wang [21] developed a hybrid method in which the edge was assigned a fixed weight named the semantic contribution factor (ω ) according to the type of relationship in the GO database. A GO term A is represented as a DAG DAG  = (A, T , E ), a sub-graph of GO where T A is the set of all ancestors for term A, and E is the set of corresponding links. The contribution of any term t to the semantics of a term A is defined as the S-value of the term t related to term A, which can be represented by where (ω ) is the semantic contribution factor for edge e∈EA linking term t with its child term t′ [23]. Then, we calculate the semantic value of the GO term A, SV(A), which is represented as: Given DAG  = (A, T , E ) and DAG  = (B, T , E ) for GO terms A and B, respectively, the semantic similarity between them, S (A, B), is defined as: where S (t) is the S-value of GO term t related to term A, and S (t) is the S-value of GO term t related to term B. There are two main disadvantages of Wang’s method. One is that the semantic contribution factor (ω ) is fixed according the linking types of GO terms, and the other is that the semantic contribution only depends on the maximum products of all of the paths linking the two terms. According to Eqs. (7) and (8), computing the SV(A) and SV(B) is difficult because they have to traverse their corresponding DGAs of term A and B, respectively. As for group-wise approaches, Pesquita et al. [31] pointed out that purely set-based approaches are not common because few measures consider only direct annotations, whereas graph-based approaches are suitable for computing the similarity with the help of graph matching. Indeed, simUI and simGIC are two typical group-wise methods that measure gene functional similarity from the graph-based perspective. To compute gene functional similarity, these methods usually make use of Tversky’s ratio model or its variants. Genes g 1 and g 2 are annotated with term sets A  = {t 1, t 2, ⋯ t } and A  = {t 1, t 2, ⋯ t }, respectively. Therefore, simUI calculates similarity as the number of GO terms shared by two genes divided by the number of GO terms they have together. The functional similarity between g 1 and g 2 is According to [42], simGIC is an expansion of simUI that sums the IC value of annotation terms. For two genes g 1 and g 2, simGIC is given by Although simUI does not consider the specificity of the term in the GO graph, simGIC takes the IC value of a term as its specificity. As pointed out by Teng et al. [16], simGIC ignores the shared IC value of two terms in the GO graph and may also result in misjudgements of gene functional similarity. Teng et al. [16] proposed a new method called SORA (semantic overlap ration of annotation) to overcome the limitations of simGIC. However, obtaining A(g1) and A(g2) from the GO graph directly for group-wise methods is also difficult and time consuming. The rest of this paper is organised as follows. In the Methods section, we begin by analysing the problems leading to high time consumption of each method. In the following subsections, we describe how to speed the gene functional similarity calculation methods based on hash tables. Then, taking Wang’s method as an example, we show how to establish the hash table and measure the functional similarity with the table. Finally, complexity analysis is presented for each method. In the Results section, we present the experimental results, including the running times to establish hash tables and measure the semantic similarity between GO terms and the functional similarity between genes. We also compare the developed Speeding Gene Functional Similarity Calculation tool (SGFSC) with other online tools. In the Discussion section, we discuss the implications and limitations of our method, and finally, we draw our conclusions in the Conclusions section.

Methods

In this section, we first analyze what factors affect computational efficiency and establish a hash table to speed the gene functional similarity calculation for each method. Then we take Wang’s method as an example to show how to speed the computing process using the proposed strategy. Finally, we present a complexity analysis for each method adopting our proposed strategy.

Analysing the problems leading to time consumption for each method

Methods that measure gene functional similarity must traverse the GO graph repeatedly to obtain the information they need. However, traversing the GO graph is time consuming because the topological structure of the GO graph is complex due to multiple inheritances of GO. In addition, the problem of low computational efficiency tends to be more prominent, especially when gene functional similarity needs to be measured on the genomic scale. To speed the gene functional similarity calculation, we should analyse the calculation process of each method and then determine the key information that affects the computational efficiency. It is need to calculate key information that lead to traversing the GO graph repeatedly for these methods. The key information for each method is listed in Table 1. IC(t) denotes the IC value of term t; LCA(t 1, t 2) denotes the lowest common ancestor of terms t 1 and t 2; Dep(t) denotes the deepest depth of term t in the GO graph; SV(t) denotes the semantic value of term t; and A(g) denotes the term set containing all of the terms that annotate gene g.
Table 1

Key information affecting computational efficiency for each method

MethodsKey information affecting computational efficiency
ResnikLCA(t 1,t 2) and IC(LCA(t 1,t 2))
LinLCA(t 1,t 2) and IC(LCA(t 1,t 2))
JiangLCA(t 1,t 2) and IC(LCA(t 1,t 2))
PekarLCA(t 1,t 2) and Dep (LCA(t 1,t 2))
WangSV(t 1) and SV(t 2)
simUIA(g 1) and A(g 2)
simGICA(g 1) and A(g 2)

For pairwise approaches, we focus on the semantic similarity between t 1 and t 2. For group-wise approaches, the functional similarity between gene g 1 and g 2 requires special attention

Key information affecting computational efficiency for each method For pairwise approaches, we focus on the semantic similarity between t 1 and t 2. For group-wise approaches, the functional similarity between gene g 1 and g 2 requires special attention For example, with the Resnik method, determining the LCA of terms t 1 and t 2 from the GO graph requires traversing of the DAGs of t 1 and t 2, respectively. The method also has to calculate the IC of each term in the DAGs to obtain the IC of LCA(t 1, t 2). Therefore, the key information that affects computational efficiency in the Resnik method is LCA(t 1, t 2) and IC(LCA (t 1, t 2)). The key information for the other methods is also listed in Table 1.

Speeding the gene functional similarity calculation for each method

As we know, traversing the GO graph repeatedly is the main reason for the reduced computational efficiency of each method. Therefore, if we can avoid traversing the GO graph repeatedly, the computational efficiency for each method will improve greatly. Hence, we can convert the storage form of information from the GO graph into a hash table. Then, these methods could measure the functional similarity based on the hash tables instead of traversing the GO graph, eventually reaching the goal of improving computational efficiency. As a result, we propose a novel two-step computing strategy: (1) establish hash tables to store essential information that is obtained from the GO graph and (2) measure gene functional similarity based on the hash tables. The flowchart of our proposed strategy is shown in Fig. 1. For ease of description, two definitions are given below.
Fig. 1

The flowchart of the proposed strategy

The flowchart of the proposed strategy DEFINITION 1. Direct information: information that occurs in corresponding equations for each method. For example, the IC values of LCA(t 1, t 2), IC(t 1) and IC(t 2) are direct information for Eq. (4). DEFINITION 2. Essential information: information that will be stored in the corresponding hash tables. For example, the ancestors of term t 1, t 2 and their corresponding IC values are essential information for Eq. (4). It is important to note that the essential information should have the ability to substitute the GO graph. This is critical for the proposed strategy. Therefore, before using our proposed strategy, each method must first analyse the direct information and extract the essential information from the original GO graph. The direct information and essential information of pairwise approaches for measuring semantic similarity between t 1 and t 2 are shown in Table 2. T(t) denotes the ancestor set of term t including t itself. S (t) is the S-value of GO term t related to term A. The direct information and essential information of group approaches for measuring functional similarity between genes g 1 and g 2 are also shown in Table 2. A(g) denotes a term set containing all of the terms that annotate gene g. In the next subsection, we take the Wang method as an example to show how to establish the hash table for each method.
Table 2

Direct information and essential information for each method

MethodDirect informationEssential informationExplanation
ResnikIC(LCA(t 1,t 2))IC(t), t ∈ T(t 1), IC(t), t ∈ T(t 2)The IC values of t, t ∈ T(t 1)The IC values of t, t ∈ T(t 2)
LinIC(t 1), IC(t 2), IC(LCA(t 1,t 2))
JiangIC(t 1), IC(t 2), IC(LCA(t 1,t 2))
PekarDep (LCA (t 1, t 2)), Dep(t 1), Dep(t 2)Depth (t), t∈ T(t 1)Depth (t), t∈ T(t 2)The depth of t, t ∈ T(t 1)The depth of t, t ∈ T(t 2)
Wang∑(S t1(t) + S t2(t)), t ∈ T t1 ∩ T t2 SV(t 1), SV(t 2) \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \begin{array}{l}{S}_{t_1}(t),t\in T\left({t}_1\right)\\ {}{S}_{t_2}(t),t\in T\left({t}_2\right)\end{array} $$\end{document}St1t,tTt1St2t,tTt2 The S-values of t, t ∈ T(t 1)The S-values of t, t ∈ T(t 2)
simUI|A(g i) ∩ A(g j)|, |A(g i) ∪ A(g j)| A(g i), A(g 2)
simGIC \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \begin{array}{l}{\displaystyle \sum IC(t),t}\in A\left({g}_i\right)\cap A\left({g}_j\right)\\ {}{\displaystyle \sum IC(t),t}\in A\left({g}_i\right)\cup A\left({g}_j\right)\end{array} $$\end{document}ICt,tAgiAgjICt,tAgiAgj \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \begin{array}{l}IC\left({t}_1\right),{t}_1\in A\left({g}_i\right)\\ {}IC\left({t}_2\right),{t}_2\in A\left({g}_j\right)\end{array} $$\end{document}ICt1,t1AgiICt2,t2Agj The IC values of t, t ∈ A(g i)The IC values of t, t ∈ A(g j)
Direct information and essential information for each method Finally, why do we select the hash table as the data structure to store the essential information extracted from the GO graph? A hash table is a commonly used data structure that satisfies the requirement for quick searches. Its search efficiency is very high, and its structure is also convenient to program and implement. What’s more, with the help of the hash tables, there is no need to repeatedly traverse the GO graph to obtain direct information. Each method can obtain essential information from the hash table and calculate the direct information by making use of the essential information. As a result, the computational efficiency increases dramatically.

Speeding the calculation of functional similarity for the Wang method

In this subsection, we take Wang’s method as an example to show how to speed the functional similarity calculation. The main idea is illustrated in Fig. 2. The proposed strategy comprises two main steps.
Fig. 2

The main idea of the proposed strategy adopted for the Wang method. a Depicts a DAG for GO term Intracellular Membrane-bound Organelle: 0043231. b Depicts the hash table established from (a). Each row in (b) is called a record. For each record, the key of the record is the ID of the GO term, and the value of the record is a link list that contains all of the S-values of the key. For each term in (a), there is a corresponding record in (b). We can obtain the essential information from the hash table directly instead of from the DAG in (a). The proposed strategy converts the storage form of information from the GO graph into hash tables to speed the calculation process

The main idea of the proposed strategy adopted for the Wang method. a Depicts a DAG for GO term Intracellular Membrane-bound Organelle: 0043231. b Depicts the hash table established from (a). Each row in (b) is called a record. For each record, the key of the record is the ID of the GO term, and the value of the record is a link list that contains all of the S-values of the key. For each term in (a), there is a corresponding record in (b). We can obtain the essential information from the hash table directly instead of from the DAG in (a). The proposed strategy converts the storage form of information from the GO graph into hash tables to speed the calculation process Step one: Establish a hash table for the Wang method According to Eq. (9), measuring the semantic similarity between two terms A and B is only based on three parts: SV(A), SV(B) and the numerator of Eq. (9). With further analysis, we fortunately find that the values of SV(A) and SV(B) are only dependent on the S-values for all terms in DAGA and the S-values for all terms in DAGB, respectively. Besides, the numerator of Eq. (9) can also be calculated quickly based on the S-values of terms A and B. From the analysis above, S (t), t ∈ T(A) and S (t), t ∈ T(B) are essential information for Eq. (9). To adopt the proposed strategy for calculating Eq. (9), we can compute all of the S-values of A and B in their corresponding DAGs and then store the results into a hash table. The S-values for GO:0043231 are listed in Table 3. The hash table is established based on Fig. 2a.
Table 3

S-values for GO terms in the DAG for intracellular membrane-bound organelle: 0043231

GO terms0043231004322900432270005622
S-value1.00.80.80.48
GO terms000562300432260005575
S-value0.2880.640.512
S-values for GO terms in the DAG for intracellular membrane-bound organelle: 0043231

Step two: Measure the gene functional similarity based on the hash table

Once the hash table is established, measuring the semantic similarity is based on the hash table instead of the GO graph. In other words, because the hash table contains all of the essential information that Eq. (9) needs, the Wang method can obtain the information needed directly from the hash table. After obtaining the semantic similarity between GO pairs, it can use the BMA rule to further measure the functional similarity between two genes. In Fig. 2a represents a DAG for the GO term Intracellular Membrane-bound Organelle: 0043231, and (b) represents the hash table established on the basis of (a). The semantic contribution factors for the ‘is-a’ and ‘part-of’ relations are 0.8 and 0.6, respectively. We use Eq. (7) to calculate all of the S-values for term GO:0043231 and list the results in Table 3. The storage form for term GO:0043231 is listed in the last row of Fig. 2b. For the other terms in Fig. 2a, the handling process is similar to term GO:0043231. The results are also listed in Fig. 2b. It is noteworthy that the order of records shown in Fig. 2b may differ from the actual order because of the special storage features of hash tables.

Example: Measuring the semantic similarity for the Wang method

We take two terms GO:0043227, named A, and GO:0005622, named B, as an example to measure the semantic similarity based on the hash table. The relationship between A and B can be obtained from Fig. 2a. The semantic similarity of the two terms A and B is calculated in three steps based on the hash table as follows: Therefore, the semantic similarity between terms A and B is: Because the S-values of terms A and B can be obtained directly from the hash table represented in Fig. 2b, there is no need to search the DAGs of terms A and B. Therefore, the computational efficiency of measuring the semantic similarity has sharply improved. The proposed strategy has achieved the desired result. To obtain the corresponding hash table from the GO graph for the Wang method, we design an algorithm for establishing the hash table, namely EHT. The algorithm is described in Fig. 3. To simplify the description of the algorithm, we briefly explain the notations used in the algorithm. An adjacency matrix M represents one of three GO graphs MF, BP and CC. RHT represents the hash table that stores the essential information extracted from the GO graph. DDec(t) represents a set that contains all of the direct descendants of term t, and DAnc(t) represents a set that contains all of the direct ancestors of term t.
Fig. 3

The algorithm for establishing a hash table from the GO graph for the Wang method

The algorithm for establishing a hash table from the GO graph for the Wang method

Complexity analysis

Without loss of generality, suppose there are m pairs of genes that require computation of their functional similarities. Each gene has been annotated by an average k GO terms. There are n GO terms in the GO graph totally. Each pairwise approach uses the BMA rule to measure gene functional similarity. The time complexity for the seven methods is listed in Table 4.
Table 4

Time complexity for measuring gene functional similarity of each method

MethodTime complexity with the proposed strategyTime complexity without the proposed strategy
Step oneStep two
ResnikO(n3)O(m*k2*n*logn)O(m*n3*k2*n*logn)
JiangO(n3)O(m*k2*n*logn)O(m*n3*k2*n*logn)
LinO(n3)O(m*k2*n*logn)O(m*n3*k2*n*logn)
PekarO(n4)O(m*k2*n*logn)O(m*n4*k2*n*logn)
WangO(n4)O(m*k2*n*logn)O(m*n4*k2*n*logn)
simUIO(n3)O(m*k2*n*logn)O(m*n3*k2*n*logn)
simGICO(n3)O(m*k2*n*logn)O(m*n3*k2*n*logn)
Time complexity for measuring gene functional similarity of each method In step one, we can find that the time complexity for each method has no relationship with m. The time complexity for establishing the hash table is relatively low. In step two, once the hash tables are established, the efficiency of measurement of the gene functional similarity will be improved. It is noteworthy that the total time complexity for each gene functional similarity method equals the time complexity of step two as long as the GO graph remains unchanged. To further differentiate SGFSC from other tools in terms of time complexity, we also list the time complexity when our proposed strategy is not adopted. From the Table 4, we can find that, if m ≫ n, SGFSC has a higher computational efficiency comparing with other methods that don’t adopt the proposed strategy. Regarding space complexity, the proposed strategy occupies some memory space to store the hash tables. Suppose the storage space of a GO term is l, then the space complexity for storing the hash table is O(n*k*l). The actual amount of memory required to store the hash table is about 5 MB, which has been verified through experimentation. There is no doubt that it will be quicker to store and read the hash table on a laptop computer.

Results

In this section, we provide the running time to establish the hash table from the GO graph first. Then by using the proposed strategy, the time needed to measure the semantic similarity between term pairs is listed in Table 5. The results of measuring gene functional similarities on the whole genomic scale for model organisms are listed in Table 6. The results of comparisons with other tools are listed in Tables 7 and 8.
Table 5

Time in seconds required to establish the hash table for each method on BP, CC and MF ontologies

TypeResnikPekarWangsimUIsimGIC
BP44187618184562
CC264462.31.6267
MF3791794.94.7384
Table 6

Running time to measure semantic similarity between term pairs for three ontologies

Method# of term pairsBPMFCC
# of related termsTime (s)# of related termsTime (s)# of related termsTime (s)
Resnik106 27,8645.99,9432.53,8174.4
Jiang106 27,8646.79,9432.93,8173.8
Lin106 27,8646.09,9432.73,8173.6
Wang106 27,8646.29,9432.63,8174.1
Pekar106 27,8645.89,9432.63,8173.7
Table 7

Time in seconds to measure gene functional similarity of five organisms

OrganismType# of annotated genes# of average annotations# of gene pairsTime ResnikTimeJiangTimeLinTimeWangTimePekarTimesimUITimesimGIC
HumanBP393374.357.74 × 108 36702364873636636427369721908819762
CC359752.566.47 × 108 5843581558145810586949215815
MF384042.327.37 × 108 32332632732132645194828
ArabidopsisBP255323.243.26 × 108 6945695759486956695352385590
CC176832.001.56 × 108 812808841816816962989
MF203051.922.06 × 108 63261661962262810481091
YeastBP59063.201.74 × 107 578580586583579353383
CC56602.221.60 × 107 143146145146144432142
MF59022.301.74 × 107 666466656494100
RatBP233194.622.65 × 108 136501369013641137141370572197954
CC222172.602.47 × 108 2319231422992411231919112083
MF230652.492.66 × 108 1334131513511339133017501874
OryzaBP19091.441.82 × 106 887881313
CC399951.068.00 × 108 1262134313071258125429763115
MF20411.602.08 × 106 55545910
Table 8

Running time in seconds for each tool to measure semantic similarity

Tool# of term pairs
102 104 106
SGFSC<129.4
GFSAT686,387X
GOSemSim523,634X
Time in seconds required to establish the hash table for each method on BP, CC and MF ontologies Running time to measure semantic similarity between term pairs for three ontologies Time in seconds to measure gene functional similarity of five organisms Running time in seconds for each tool to measure semantic similarity

Running time to establish hash tables for each method

The key goal of the strategy is to extract the essential information from the GO graph and then establish a hash table to replace the GO graph. Therefore, we provide the running times to establish the hash table for each method in Table 9.
Table 9

Running time in seconds for each tool to measure gene functional similarity

Tool# of gene pairs
102 104 106
SGFSC<129768
GFSAT16336,514X
GOSemSim7813,056X
Running time in seconds for each tool to measure gene functional similarity From the results, we can find that SGFSC can establish the hash table within a few minutes. For example, the running times for the Resnik method on BP, CC and MF ontologies are 441 s, 264 s and 379 s, respectively. The running times of the other methods are close to those of the Resnik method, indicating that the computational efficiency of the proposed strategy for establishing the hash table is high. SGFSC is efficient in updating the content of the hash tables. Therefore, SGFSC is well adapted to the daily evolution of the GO database, which may change with the deletion of obsolete terms and the addition of new terms. It should be noted that the hash tables for the Resnik, Lin and Jiang methods are the same. The experiments were conducted on Windows with an i5-2600 K CPU @ 3.30 GHz with 16 GB memory.

Running time to measure semantic similarity for each method

For pairwise approaches, measuring the functional similarity between genes is mainly dependent on the sematic similarity between term pairs. Therefore, the computational efficiency of sematic similarity plays a key role in the pairwise methods. We randomly select 106 term pairs that are related to all of the terms in the corresponding GO graphs for the BP, MF and CC ontologies. The computing time for each method on BP, MF and CC graphs is listed in Table 5. For example, the running times for the Wang method on BP, CC and MF ontologies are 5.9 s, 2.5 s and 4.4 s, respectively. Because the BP ontology has the most terms, its running time is longer than that of the MF and CC ontologies. The results show that SGFSC can complete the calculation within a few seconds. In addition, the bar plots of running time for each method are presented in Fig. 4. We can clearly find that the computation time is within 10s.
Fig. 4

Bar plots of running time in measuring semantic similarity between term pairs

Bar plots of running time in measuring semantic similarity between term pairs

Running time to measure gene functional similarity for each method

To give a comprehensive comparison of the computational efficiency of each method, we select annotation data of five organisms representing Human, Arabidopsis, Yeast, Rat and Oryza. In Table 6, the number of annotated genes, number of average annotations for each gene, number of gene pairs and the running time in seconds for seven typical methods are presented for the corresponding ontologies of BP, CC and MF of five organisms. The experiments select annotation data for five model organisms: Human, Arabidopsis, Yeast, Rat and Oryza. The UniProt-GOA data for each species were downloaded from http://geneontology.org/page/download-annotations in August 2015. Number of gene pairs refers to the number of gene pairs that will be measured for gene functional similarity using SGFSC. We obtain the gene pairs by combining all of the annotated genes in the GOA database. The experiments were conducted on Linux with an E5-2609 CPU @2.40 GHz and 64 GB memory. For each organism, SGFSC computes the respective functional similarity on BP, CC and MF ontologies. For the whole genomic scale of human, the experimental results show that the computing time for each method is no more than 11 h. For the other organisms, the computing time is shorter (within a few hours) because the number of gene pairs and the number of average annotations of genes are relatively smaller. For example, the number of annotated genes for the human in BP ontology is 39337, whereas that for yeast is 5906. Therefore, the running time for the two organisms is greatly different. However, the running times of SGFSC for all selected organisms are within an acceptable range. Therefore, SGFSC shows its outstanding advantage for measuring functional similarity on the genomic scale. Even more, it has the ability to measure the similarity of a combination of all annotated genes in the GOA database for the model organisms. The proposed strategy thus can effectively solve the problem of large-scale computing of gene functional similarity. To represent the experimental results more intuitive, we add two bar plots which are depicted using Figs. 5 and 6. Figure 5 is the bar plots for running time using Wang method on selected organisms. Wang method can finish the gene functional similarity calculation on the genome scale for all selected organism in a relatively short time. Figure 6 shows the running time for each selected method on human genome scale. Results also indicates that all these methods can complete the calculation in an acceptable period of time.
Fig. 5

Bar plots for running time in measuring gene functional similarity using Wang method on each organism

Fig. 6

Bar plots for running time for each method in measuring functional similarity on human genomic scale

Bar plots for running time in measuring gene functional similarity using Wang method on each organism Bar plots for running time for each method in measuring functional similarity on human genomic scale

Comparison with other tools

To compare the computational efficiency of SGFSC with other tools, we select two tools: GFSAT [23] and GOSemSim [33]. These two tools can be conveniently installed on a laptop, and therefore, we can accurately measure running times. It is difficult to accurately measure the running times of other tools because they can only be used online. Therefore, we decided to compare running time of these two tools only. We use the running time to compute the similarities of a large number of term pairs and gene pairs to evaluate the computational efficiency of these three tools. In the experiment, we use SGFSC, GFSAT and GOSemSim to compute the similarities of three sets of term pairs and gene pairs, respectively. The numbers of term pairs and gene pairs in these sets are both 102, 104 and 106, respectively. The experiments were conducted on Windows with an i5-2600 K CPU @ 3.30 GHz and 16 GB memory. The GO data in Table 7 were downloaded in August 2015, and the term pairs were randomly generated from the MF ontology. The Wang method was used for each tool. The GOA data in Table 8 of Arabidopsis was downloaded in August 2015. We selected the well-annotated genes and then generated gene pairs for testing. The Wang method was also used for each tool. ‘X’ in Tables 7 and 8 indicates that the process took >12 h. The running time of each tool to measure the semantic similarity results is listed in Table 7. The running times of GFSAT and GOSemSim to measure the semantic similarity of 104 term pairs were 6387 s and 3634 s, respectively, whereas that for SGFSC took only 2 s. Furthermore, the running times of both GFSAT and GOSemSim to calculate 106 term pairs were greater than 12 h, whereas that for SGFSC was only 9.4 s. The running time of each tool to measure gene functional similarity is listed in Table 8. Similar to the findings for semantic similarity, SGFSC performed considerably better than the other two tools on the similarity calculation of gene pairs. The running time of SGFSC was 29 s for 104 gene pairs, whereas GFSAT and GOSemSim required 36154 s and 13056 s, respectively. As the results clearly show, in comparison with the other two tools, SGFSC has a considerable speed advantage especially when calculating a large number of gene pairs. Besides, we add one merged bar plots Figure for Tables 7 and 8. Figure 7a and b show the running time of measuring semantic similarity on 102 and 104 term pair datasets. Figure 7c and d depict the running time of measuring gene functional similarity on 102 and 104 gene pair datasets respectively. All of the four subfigures clearly show the advantages of SGFSC in the computation time. Therefore, our proposed strategy achieves the desired results in speeding up the gene functional similarity calculation.
Fig. 7

Bar plots for running time on different datasets using SGFSC, GFSAT and GOSemSim

Bar plots for running time on different datasets using SGFSC, GFSAT and GOSemSim

Discussion

First, we emphasise that the aim of the proposed two-step strategy is only to speed up the methods used to measure gene functional similarity because these methods tend to be time consuming if they are not implemented with a proper data structure. The problem may be extremely prominent especially when gene functional similarity needs to be measured on the genome scale for some applications. Therefore, the proposed strategy for speeding up the functional similarity calculation is quite meaningful. Second, the proposed two-step strategy adopts hash tables as the data structure to store essential information to avoid traversing the GO graph. Furthermore, the hash table is used only to satisfy the requirement of a quick search. Hence, the computational efficiency of these methods improves significantly. We particularly highlight that the computational efficiency for these methods varies widely according to different data structures and implementation strategies. Therefore, it is critical to choose an appropriate implementation strategy to increase the computational efficiency of these methods. Third, we also implemented an online tool, SGFSC, for adoption of the two-step strategy. The SGFSC could speed the functional similarity calculation methods on the whole genomic scale. However, there are some other issues need to note.

Precision

SGFSC extracts the essential information for each method from the GO graph and then establish hash tables, which have a special structure to store the information. Therefore, the precision of the final computing result is not lost with SGFSC because the function of the hash tables is only to store intermediate results to avoid repeatedly traversing the GO graph for each method.

Correctness

SGFSC was implemented with seven typical methods. It is not difficult to understand that SGFSC only achieves the computation speeds with these methods when our proposed two-step strategy is used. The computing process of the original methods is unchanged. Therefore, the computing results of SGFSC are identical to those of other tools if the same versions of the GO and GOA databases are used. As a result, the computing results are absolutely correct.

Space

Because SGFSC adopts a two-step computing strategy, it first needs to read the hash table into memory. Therefore, it occupies some additional memory space compared with other tools that do not adopt the proposed strategy. Experimentation showed that the memory space for storing the hash table is about 5 MB. The continuing development of computer hardware technology has made this a simple problem to solve. The core idea of the strategy comes from making the best use of memory space to reduce the computing time, which is widely used in computer science.

Application

SGFSC offers powerful computational capability to compute gene functional similarity on a genomic scale. We have provided a friendly online tool for the convenient use of SGFSC. In addation, our proposed strategy also offers good versatility for use in other research. For example, in the area of natural language process research, because the size of the WordNet is large, computing the IC value of a concept is also time consuming. Therefore, researchers measuring the semantic similarity between concepts could adopt our proposed two-step strategy to improve computational efficiency.

Drawback

To speed up the gene functional similarity calculation, SGFSC first has to read the corresponding information hash table into memory, which will take a few seconds. Therefore, the computational efficiency of SGFSC is lower than that of other tools if the number of gene pairs requiring measurement of functional similarity is small. Therefore, the outstanding advantage of SGFSC is in its measurement of gene functional similarity for a large number of gene pairs.

Conclusion

First, a novel two-step computing strategy is proposed to speed up gene functional similarity calculation. These methods measure gene functional similarity based on hash tables. Therefore, the time complexity is obviously decreased because there is no need to traverse the GO graph repeatedly, which primarily affects the computational efficiency. Second, we have implemented an online tool called SGFSC that is bundled with seven typical gene functional similarity calculation methods and is freely available at http://nclab.hit.edu.cn/SGFSC. The computational efficiency of SGFSC offers a significant improvement in computing time. Our experiments show that SGFSC has a great advantage in measuring gene functional similarity on the whole genomic scale. Third, the key point in our proposed strategy is the transformative idea that the information required can be obtained directly from the hash tables instead of the original GO graph. The proposed strategy converts the storage form of information from a GO graph into a hash table structure that can meet the requirements of a quick query. As a result, the proposed strategy achieved a desired result. The proposed strategy may also be applied to other areas of bioinformatics to improve computational efficiency.
  34 in total

1.  Semantic similarity analysis of protein data: assessment with biological features and issues.

Authors:  Pietro H Guzzi; Marco Mina; Concettina Guerra; Mario Cannataro
Journal:  Brief Bioinform       Date:  2011-12-02       Impact factor: 11.622

2.  Functional evaluation of domain-domain interactions and human protein interaction networks.

Authors:  Andreas Schlicker; Carola Huthmacher; Fidel Ramírez; Thomas Lengauer; Mario Albrecht
Journal:  Bioinformatics       Date:  2007-04-01       Impact factor: 6.937

3.  Semantic similarity based feature extraction from microarray expression data.

Authors:  Young-Rae Cho; Aidong Zhang; Xian Xu
Journal:  Int J Data Min Bioinform       Date:  2009       Impact factor: 0.667

4.  Co-clustering and visualization of gene expression data and gene ontology terms for Saccharomyces cerevisiae using self-organizing maps.

Authors:  Markus Brameier; Carsten Wiuf
Journal:  J Biomed Inform       Date:  2006-05-20       Impact factor: 6.317

5.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms.

Authors:  Andreas Schlicker; Thomas Lengauer; Mario Albrecht
Journal:  Bioinformatics       Date:  2010-09-15       Impact factor: 6.937

6.  An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology.

Authors:  Shobhit Jain; Gary D Bader
Journal:  BMC Bioinformatics       Date:  2010-11-15       Impact factor: 3.169

7.  Information content-based gene ontology semantic similarity approaches: toward a unified framework theory.

Authors:  Gaston K Mazandu; Nicola J Mulder
Journal:  Biomed Res Int       Date:  2013-09-02       Impact factor: 3.411

Review 8.  Semantic similarity in biomedical ontologies.

Authors:  Catia Pesquita; Daniel Faria; André O Falcão; Phillip Lord; Francisco M Couto
Journal:  PLoS Comput Biol       Date:  2009-07-31       Impact factor: 4.475

9.  G-SESAME: web tools for GO-term-based gene similarity analysis and knowledge discovery.

Authors:  Zhidian Du; Lin Li; Chin-Fu Chen; Philip S Yu; James Z Wang
Journal:  Nucleic Acids Res       Date:  2009-06-02       Impact factor: 16.971

10.  Probabilistic protein function prediction from heterogeneous genome-wide data.

Authors:  Naoki Nariai; Eric D Kolaczyk; Simon Kasif
Journal:  PLoS One       Date:  2007-03-28       Impact factor: 3.240

View more
  1 in total

1.  Refine gene functional similarity network based on interaction networks.

Authors:  Zhen Tian; Maozu Guo; Chunyu Wang; Xiaoyan Liu; Shiming Wang
Journal:  BMC Bioinformatics       Date:  2017-12-28       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.