Literature DB >> 26072504

Exploiting ontology graph for predicting sparsely annotated gene function.

Sheng Wang¹, Hyunghoon Cho¹, ChengXiang Zhai¹, Bonnie Berger², Jian Peng¹.

Abstract

MOTIVATION: Systematically predicting gene (or protein) function based on molecular interaction networks has become an important tool in refining and enhancing the existing annotation catalogs, such as the Gene Ontology (GO) database. However, functional labels with only a few (<10) annotated genes, which constitute about half of the GO terms in yeast, mouse and human, pose a unique challenge in that any prediction algorithm that independently considers each label faces a paucity of information and thus is prone to capture non-generalizable patterns in the data, resulting in poor predictive performance. There exist a variety of algorithms for function prediction, but none properly address this 'overfitting' issue of sparsely annotated functions, or do so in a manner scalable to tens of thousands of functions in the human catalog.
RESULTS: We propose a novel function prediction algorithm, clusDCA, which transfers information between similar functional labels to alleviate the overfitting problem for sparsely annotated functions. Our method is scalable to datasets with a large number of annotations. In a cross-validation experiment in yeast, mouse and human, our method greatly outperformed previous state-of-the-art function prediction algorithms in predicting sparsely annotated functions, without sacrificing the performance on labels with sufficient information. Furthermore, we show that our method can accurately predict genes that will be assigned a functional label that has no known annotations, based only on the ontology graph structure and genes associated with other labels, which further suggests that our method effectively utilizes the similarity between gene functions.
AVAILABILITY AND IMPLEMENTATION: https://github.com/wangshenguiuc/clusDCA.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2015 PMID： 26072504 PMCID： PMC4542782 DOI： 10.1093/bioinformatics/btv260

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Automated annotation of gene (or protein) function has become a critical task in the post-genomic era (Radivojac ). Fortunately, an increasing compendium of genomic, proteomic and interactomic data allows us to extract patterns from functionally well-characterized genes (or proteins) to accurately infer functional properties of lesser-known ones. In particular, recently developed high-throughput experimental techniques, such as yeast two-hybrid screens and genetic interaction assays, have helped to build molecular interaction networks in bulk. Topological structures of these networks can be exploited for function prediction using the ‘guilt-by-association’ principle, which states that genes (or proteins) that share similar neighbors or other topological properties in interaction networks are more likely to be functionally related. To this end, a variety of graph-theoretic and machine learning algorithms (Karaoz ; Letovsky and Kasif, 2003; Murali ; Sefer and Kingsford, 2011) have been developed to provide a way of refining and enhancing existing functional annotations [e.g. Gene Ontology (GO) database (Ashburner )] based on network data. A popular class of graph-theoretic algorithms uses a diffusion process to examine the local topology of nodes, exploiting both direct and indirect linkages (Cao , 2013; Cho ; Kohler ; Nabieva ). Alternatively, the number of occurrences of different elementary subgraphs (known as graphlets) in the neighborhood can be used to characterize each node and to establish pairwise affinity scores (Gligorijevic ; Milenkovic ; Milenkovic and Przulj, 2008). More sophisticated machine learning algorithms, such as GeneMANIA (Mostafavi and Morris, 2010; Mostafavi ), have also been proposed. GeneMANIA uses label propagation on an integrated network specifically constructed for each functional label, and is currently available as the state-of-the-art web interface for gene function prediction in multiple organisms. Despite the success of existing algorithms, a major difficulty that has not been sufficiently addressed is that of predicting rare labels. Because many molecular functions (MFs) are inherently specific in their scope, a large number of functional labels have only a few annotated genes (or positive annotations); for instance, in the human GO annotation database (Ashburner ), there are currently 8626 GO labels with at least 3 annotations, 4178 of which have <10 annotated genes and 7905 labels have <100 genes. The distributions of GO labels with different numbers of annotations in yeast and human are shown in Figure 1. Nearly half of the GO labels have <10 annotations in both species.

Fig. 1.

A breakdown of GO labels by the number of annotated genes in (a) human and (b) yeast

A breakdown of GO labels by the number of annotated genes in (a) human and (b) yeast Predicting new associations for these sparsely annotated labels is substantially more challenging than those for labels with a lot of annotations, because patterns extracted from the few known genes are more likely to be statistical artifacts that cannot be generalized, which is commonly known as the ‘overfitting’ problem in machine learning and statistics. One way to mitigate overfitting is to take similarities between labels into account. For instance, if we have a priori knowledge that two labels reflect similar MFs (e.g. they are both children of the same parent in the ontology graph), we would also expect the two corresponding sets of genes (or proteins) to be similar. If one of the gene sets contains only a few genes, then the other may provide valuable information about missing associations. Thus, by propagating information along the edges in the ontology graph, one can pool available data together for increased robustness to overfitting. Notably, previous efforts to incorporate label similarity into function prediction algorithms have largely been unsuccessful. They formulated the problem as a single structured-output hierarchical classification (HC) instead of binary classification, but its predictive performance for sparsely annotated functional labels is far from satisfactory (Clark and Radivojac, 2013; Eisner ; Guan ; Jiang ; Kim ; Obozinski ; Sokolov and Ben-Hur, 2010). Another related work (Wang ) exploited the similarity between 17 Munich Information Center for Protein Sequences (MIPS) functional categories via a regularization scheme. However, this approach does not scale to tens of thousands of sparse GO annotations in human, as it employs a computationally expensive optimization. Our approach to this problem is based on diffusion component analysis (DCA) (Cho ), a recently developed algorithm that combines network diffusion, such as random walk with restart (RWR) (Kohler ), with dimensionality reduction to obtain low-dimensional vector representations of nodes in a graph that capture topological properties. Topological features extracted from interaction networks in this manner can be used in conjunction with k-nearest neighbors (kNNs) or support vector machines (SVMs) to outperform the corresponding state-of-the-art for predicting hundreds of MIPS labels in yeast (Cho ). However, DCA also suffers from overfitting for sparsely annotated labels when predicting GO labels for a larger human interactome, if we want to train label-wise classifiers for all labels. In this work, we introduce clusDCA, an improved function prediction algorithm based on DCA (we attached a two-page abstract of DCA in the Supplementary Data), which (i) incorporates the similarity between functional labels and (ii) scales to a large number of annotations. The key idea of clusDCA is to perform DCA also on the ontology graph to obtain compact vector representations of labels. The gene vectors from the original method are then projected onto the space of label vectors so that the projections of positively annotated genes are geometrically close to their assigned labels. Because labels that are similar to each other in the ontology graph are co-localized in the label vector space, classifiers for sparsely annotated labels will now favor genes associated with other similar labels in the neighborhood. This is how information is transferred between labels to avoid overfitting in our approach. When compared with state-of-the-art methods that do not incorporate label similarity, our experiments on yeast, mouse and human datasets demonstrate that our method substantially improves the predictive accuracy of sparsely annotated labels while achieving comparable performance for GO labels with sufficiently many genes. We also demonstrate the performance improvement of clusDCA over an alternative approach to utilizing the ontology graphs based on HC. Furthermore, our method can be used to predict new genes for a given GO label, even in the extreme case where there are no existing gene annotations and the only information available is the label’s position in the ontology graph and the genes associated with other labels. In addition to improving function prediction on its own, this demonstrates the potential for our method to be used in conjunction with recent methods that extract new ontology terms from data (Dutkowski ; Kramer ) to provide an improved way of refining and extending our knowledge of gene or protein function.

2 Methods

As an overview, clusDCA first computes the ‘diffusion state’ of each node by performing a RWR on each input network, and subsequently finds a low-dimensional vector representation for each gene via an efficient matrix factorization of the diffusion states. A key contribution is that clusDCA then follows an analogous procedure to obtain a low-dimensional vector representation of each functional label based on the ontology graph. Intuitively, the gene vectors encode the topology of the interactome, which in turn reflects gene function, while the label vectors encode the topology of the ontology graph, which reflects the semantic and relational properties of the labels. Given both the gene and the label vectors, clusDCA novelly finds the best projection of the gene vectors onto the label vector space, thus keeping the projected gene vectors geometrically close to their known labels. In the final step, clusDCA computes its predictions for an uncharacterized gene by sorting the candidate functions by their proximity to the projected gene vector, based on the optimal projection. An illustration of this pipeline is given in Figure 2. We give a more detailed description of this pipeline below.

Fig. 2.

Overview of clusDCA

2.1 Low-dimensional vector representations of genes

2.1.1 Review of DCA

The first step of clusDCA is to use DCA to compute low-dimensional vector representations of genes in molecular networks. To achieve this goal, DCA first runs RWR on each node in each molecular network (e.g. protein–protein interaction or co-expression network) to compute the ‘diffusion state’ of each node, which summarizes local topology. RWR is different from conventional random walks in that it introduces a pre-defined probability of restarting at the initial node after every iteration. Formally, let A denote the weighted adjacency matrix of a molecular interaction network with n genes (or proteins). Each entry in the transition matrix represents the probability of a transition from node to node and is defined as: Next, letting be an -dimensional distribution vector in which each entry stores the probability of a node being visited from node after steps, RWR from node with restart probability is defined as: where is an -dimensional distribution vector with and , . Note that the restart probability controls the relative influence of global and local topological information in the diffusion, where a larger value places greater emphasis on the local structure. We can obtain the stationary distribution of RWR at the fixed point of this iteration, and we refer to this as the ‘diffusion state’ of node (i.e. , using the same definition as previous work (Cao ). Intuitively, the jth entry S stores the probability that RWR starts at node i and ends up at node j in equilibrium. The fact that two nodes having similar diffusion states implies they are in similar positions with respect to other nodes in the graph, which may reflect functional similarity. However, the diffusion states are not entirely accurate, partially due to the noisy and incomplete nature of interactomes. Moreover, high dimensionality imposes additional computational constraints on directly using the diffusion states as features for classification or regression tasks. To address this issue, DCA employs the following dimensionality reduction scheme. The probability assigned to node j in the diffusion state of node i is modeled as where for . DCA refers to as the context feature and as the node feature of node i, both capturing the topological properties of the network. If and are close in direction and have large inner product, then it is likely that node j is frequently visited in the random walk starting from node i. DCA takes a set of observed diffusion states as input and optimizes over w and x for all nodes, using KL-divergence as the objective function: The original framework uses a standard quasi-Newton method L-BFGS (Zhu ) to solve this optimization problem. Although the learnt low-dimensional vector representation can effectively capture the network structure, we found that optimizing in this way is time consuming.

2.1.2 New contributions

To make DCA more scalable to large molecular networks, we developed a fast, matrix factorization-based approach to decompose the diffusion states. Based on the definition of , we have: The first term in the above equation corresponds to the low-dimensional approximation of , while the second term is the normalization factor that enforces , where is the -dimensional probability simplex. In our new formulation, we relax the constraint that the entries in sum to one by dropping the second term; while the resulting low-dimensional approximations of diffusion states are no longer strictly valid probability distributions, we find that the approximations are close enough to the true distribution that the relaxation has a negligible impact. As a result, can be simplified as: In addition, instead of optimizing the relative entropy between the true and the approximated diffusion states, we use the sum of squared errors as the new objective function Now, the resulting optimization problem can be easily solved by the classic singular value decomposition (SVD) (Golub and Reinsch, 1970). To avoid taking a logarithm of zeros, we added a small positive constant to and computed the logarithm diffusion state matrix L as: where with , and is the concatenation of . With SVD, we decompose into three matrices , and : where , and is the diagonal singular value matrix. To obtain the low-dimensional vectors and with dimensions, we simply choose the first singular vectors , and the first singular values . More precisely, let denote the low-dimensional vector representation matrix, and denote the context feature matrix. and can be computed as:

2.1.3 Runtime improvements

The key benefit of this new optimization procedure is significantly reduced computational time. For example, decomposing the STRING yeast network into 500-dimensional vectors takes <5 min on a standard server (with six 3.07 GHz Intel Xeon CPUs and 32GB RAM) for SVD, while the original approach with L-BFGS takes >2 h. We noticed that the prediction accuracies for both methods are almost identical in predicting yeast gene function. To integrate heterogeneous network data, we extend the above single-network DCA to multiple networks. Let denote the set of logarithm diffusion state matrices based on the diffusion states from the k input networks. Here, we optimize the following objective function: where for each node i in network r, we assign a network-specific context feature , which encodes the intrinsic topological properties of node i in network r. The node features x are shared across all k networks to be able to capture more global patterns. This objective function can be also optimized by SVD. It is worth noting that it is possible to weight each network differently when concatenating the networks, but we give equal importance to each network in this work for simplicity. In the following sections, we use as the low-dimensional vector representations of genes. Note that one can also use , but we observed that the performances of these two representations are quite similar.

2.2 Low-dimensional vector representations of functional labels

The GO graph is a directed acyclic graph (DAG) over functional labels where the edges represent various semantic relationships. In this work, we only consider the ‘is a’ and ‘part of’ edges, which results in a hierarchy of labels with edges going from the more specific to the more generic terms. As a consequence of this hierarchical structure, which is generally not present in molecular networks, a naive application of RWR on the ontology graph where the edges are treated as undirected unfairly favors high-level nodes, which tend to have higher centrality. On the other hand, allowing a random walk to only move from high- to low-level nodes would greatly restrict the portion of the graph a random walk can explore. To address these issues, we allow both edge directions but with different weights, whose ratio is controlled by the ‘back propagation’ parameter . With denoting the transition matrix of the original graph with unidirectional edges, our modified RWR for the ontology graph is defined as We chose a value for that generally shrinks the diffusion scores of high-level nodes, and confirmed that the final prediction performance is stable for different values of α between 0.5 and 0.8. Based on the diffusion states from this modified random walk, we learned a low-dimensional vector representation of the ontology graph using the same procedure as the one for molecular networks. Importantly, our representation captures not only single-hop parent–child relationships, but also more global patterns such as long-range sibling relationships in the network. In the following sections, we use to denote the low-dimensional vector representation matrix of functional labels. is the vector for function j.

2.3 Projecting gene vectors into ontology label space

After obtaining the low-dimensional vector representations of both genes and functional labels, we use these vectors to predict gene function. Because the vectors reflect the topological structure of nodes in the network, genes that are close in their vector directions are more likely to be similar in their functions. Analogously, functional labels that are close in their vector directions may be more semantically similar. Based on this intuition, we use a transformation matrix to project genes from the gene vector space to the function vector space, which allows us to match genes to functions based on geometric proximity. Let be the projection of the gene vector : Then we define the pairwise affinity score between gene i and function j to be used for function prediction as: A larger z indicates that gene i is more likely to be annotated with function j We want to optimize W so that positively annotated genes are geometrically similar to their assigned GO labels. In this work, we use the inner product as the similarity function. We also explored the L2 distance, but it performed generally worse than the inner product, possibly due to the fact that the inner product explicitly models both positive and negative annotations. Next, we define (as the set of genes that are positively (negatively) annotated with function j. Note that whenever a gene is positively annotated with a particular function we also positively annotated all of its ancestors with the gene. Our constrained optimization problem for finding the best projection that incorporates both positive and negative annotations is given by where the weights and correct for the imbalance in the training data. Instead of simply maximizing the affinity scores of positive annotations, this formulation aims at maximizing the margin between the affinity scores of positive and negative annotations. This problem can be solved analytically by a closed-form solution: where is a weight matrix with and . Because modeling the complex relationship between genes and functional labels with a single transformation matrix may be overly restrictive, we group functions into different clusters and learn a separate projection model for each cluster. For the main results, we divide the GO labels into the following four groups based on the number of annotated genes in the training data: [3-10], [11-30], [31-100] and [101-300]. We also tested using the partition given by the clustering of our learned label vectors and obtained comparable prediction performance (see Supplementary Data).

3 Results

3.1 Networks and annotations

We obtained a collection of six molecular networks each for human, yeast and mouse from the STRING database v9.1 (Franceschini ). These networks are built from heterogeneous data sources, including high-throughput interaction assays, curated protein-protein interaction databases, and conserved co-expression. We excluded text mining-based networks to avoid confounding by links based on functional similarity. There were 16 662 nodes in human, 6311 in yeast and 18 248 in mouse. The number of edges in these networks varied from 1183 to 673 410 in human, from 1059 to 293 921 in yeast and from 3917 to 1 638 107 in mouse. Note that each edge is associated with a weight between 0 and 1 representing the confidence of interaction. Next, we obtained gene-function associations and the ontology of functional labels from the GO Consortium (Ashburner ). We built a DAG of GO labels from all three categories [biological process (BP), MF, cellular component] based only on the ‘is a’ and ‘part of’ relationships for each species. Labels without any associated genes were removed, resulting in three species-specific ontology graphs for yeast, human and mouse. The human ontology graph had 13 708 functions and 19 206 edges, the yeast ontology graph had 4240 functions and 4804 edges and the mouse ontology graph had 13 807 functions and 19 704 edges.

3.2 Experimental setting

Following previous work (Mostafavi ), we used 3-fold cross-validation to evaluate our method, where a randomly chosen subset of one-third of the genes are held out as the test set. After computing the optimal projection of gene vectors into the functional label space based only on the training data, we calculated the affinity scores (see Equation 15) to get a ranked list of test genes for each function. Then we measured the extent to which true annotations are concentrated near the top of the list by calculating the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), which are standard performance metrics in this field (Mostafavi ). To summarize results across different labels, we used both micro- and macro-averages. The micro-average directly combines the entries in the confusion matrix constructed from different labels prior to calculating the predictive performance, and the macro-average calculates the areas under the curves for each label independently and then takes the average. We compared clusDCA to two state-of-the-art network-based function prediction algorithms, GeneMANIA (Mostafavi ) and DCA (Cho ), and another algorithm based on HC (Sokolov and Ben-Hur, 2010), which exploits the hierarchical structure of functional labels. For consistency, we used the same dataset (i.e. annotations, genes, networks) and the same evaluation scheme for every method we tested. We obtained the original MATLAB implementation of GeneMANIA from http://morrislab.med.utoronto.ca/Data/GB08/. For DCA, we tested only the kNN version, because the SVM version seriously suffers from overfitting for sparsely annotated labels and also does not scale to the human dataset. Importantly, neither GeneMANIA nor DCA leverages topological information from the GO graph. Noting that the original formulation of HC does not scale to large datasets, we instead implemented an efficient version that utilizes the clusDCA framework. In particular, we associate each gene with a ‘macro-label’ , where is a ‘micro-label’ which is set to 1 when the gene is positively annotated with label i and 0 otherwise. Following HC, we impose a constraint that if then for every ancestor j of i. We then find the optimal projection from DCA gene vectors to the space of macro-labels using the same optimization problem we introduced in Equation 16. After solving the optimization problem, we use the optimized transformation matrix to compute the pairwise affinity score zij between gene i and function j. For clusDCA, we set the back propagation parameter to 0.8 and the restart probability to 0.8 for the GO graph. We observed that our performance is stable for different values of between 0.5 and 0.8. For the molecular networks, we used a restart probability of 0.5, adopted from our previous work (Cho ). We set a larger restart probability for the GO graphs because they are generally much sparser. We used d = 2500 as the dimensionality of the learned vectors for the main results. The effect of varying this parameter is analyzed in Section 3.7. We followed the same procedure as the one in GeneMANIA to group the GO labels into two major gene ontologies: ‘BP’ and ‘MF’. For both ontologies, we further binned GO labels into four sparsity levels, each consisting of GO labels with [3-10], [11-30], [31-100] and [101-300] annotated genes (see Table 1). Although we used all of the GO terms in human and yeast, for mouse, we used only the GO terms with evidence codes that are also used in the evaluation of GeneMANIA (Mostafavi et al., 2008): TAS, RCA, ND, NAS, ISS, IPI, IMP, IGI, IEP, IEA, IDA and IC (Peņa-Castillo et al., 2008).

Table 1.

Number of GO terms in different sparsity levels

	3–10	11–30	31–100	101–300
Human MF	886	390	222	99
Human BP	2940	1677	1122	553
Yeast MF	351	156	92	29
Yeast BP	815	408	235	87
Mouse MF	188	215	165	84
Mouse BP	337	568	678	329

Number of GO terms in different sparsity levels

3.3 GO label vectors capture semantic similarity

Unlike some of the previous work (Mostafavi ; Wang ) that did not explicitly incorporate the GO graph, our approach exploits the ontology structure to learn a low-dimensional vector representation of each GO label. If these vectors can be clustered into semantically meaningful clusters, it would validate our attempt to enforce gene assignments to be similar between labels that are geometrically close in the label vector space as being well founded. To test this hypothesis, we used K-means to cluster the GO labels based on the cosine similarity of our low-dimensional vector representations of labels. We determined the number of clusters by restricting the largest cluster to have at most 80% of the total number of labels. In Supplementary Figure S1 (Supplementary Data), two of the 41 clusters we identified are visualized with Cytoscape (Smoot ). The complete list of clusters can be found in Supplementary Data. In the visualization, the node size reflects the number of genes that the corresponding function is annotated with, and the edge width reflects the cosine similarity between the vector representations of the two nodes. The first cluster represents functions related to molecular binding, such as cation binding, metal ion binding, nucleotide binding and NAD binding, whereas the second cluster represents functions related to different transmembrane transporter activity. The fact that the set of GO labels in each of these clusters is highly consistent in function provides evidence that the learned vectors faithfully reflect the semantic relationships among the labels.

3.4 clusDCA substantially improves prediction of sparsely annotated GO labels

To evaluate clusDCA, we performed large-scale function prediction for human, yeast and mouse. The results are summarized in Figure 3 and Supplementary Figure S2 (Supplementary Data). It is clear that our approach significantly outperforms other methods on sparsely annotated labels in all three datasets. For example, in human, our method achieved 0.8491 micro-AUROC and 0.8648 macro-AUROC on BP labels with 3–10 annotations, which is much higher than 0.5815 (micro), 0.5857 (macro) for DCA and 0.7288 (micro), 0.8002 (macro) for GeneMANIA. It is worth noting that DCA performs consistently worse than GeneMANIA at this task, possibly due to the fact that GeneMANIA adaptively integrates the input networks for each functional label to optimize performance on training data. In yeast, clusDCA achieved 0.9025 micro-AUROC on BP labels with 3–10 annotations, which is again substantially higher than 0.6645 for DCA and 0.8504 for GeneMANIA. In mouse, clusDCA achieved 0.8627 micro-AUROC and 0.8802 macro-AUROC on BP labels with 3–10 annotations, which is again substantially higher than 0.5873 (micro), 0.5937 (macro) for DCA and 0.7609 (micro), 0.8245 (macro) for GeneMANIA. A similar improvement was observed for functional labels with 11–30 annotations and also for the MF labels in human, yeast and mouse (Fig. 3 and Supplementary Fig. S2). We found most of the improvements to be statistically significant (P < 0.05; paired Wilcoxon signed-rank test). The improvement was most pronounced in human overall, presumably because the human dataset is much sparser than the other two.

Fig. 3.

Comparison of our approach with other methods in terms of micro-AUROC. Asterisk indicates that our approach is statistically significant in comparison with GeneMANIA. Performance is evaluated for different subsets of GO labels with varying sparsity levels as shown on the x-axis The above results suggest that the topological information in ontology graphs can be exploited to greatly improve function prediction performance for sparse labels. It remains to be shown whether clusDCA is better than other approaches to incorporating the ontology. To this end, we found that clusDCA substantially outperforms HC. For instance, in human, our method achieved 0.8984 micro-AUROC and 0.9135 macro-AUROC on MF labels with 3–10 annotations, which is much higher than 0.7435 (micro), 0.7580 (macro) for HC. The improvement was more pronounced where the number of GO labels was large (e.g. human BP). This is likely because the number of candidate predictions for HC grows exponentially with the number of GO labels. As a result, HC is highly prone to overfitting in a dataset with a large number of labels. Notably, HC also invariably performed worse than GeneMANIA in most of our experiments. In addition, we observed consistent improvements over GeneMANIA with respect to the AUPRC. In human, our method achieved 0.0429 macro-AUPRC on BP labels with 3–10 annotations, which is higher (better) than 0.0368 AUPRC for GeneMANIA. In yeast, clusDCA achieved 0.1360 AUPRC on MF labels with 3–10 annotations, which is substantially higher than 0.1075 macro-AUPRC for GeneMANIA. Similarly, in mouse, clusDCA achieved 0.0516 macro-AUPRC on BP labels with 3–10 annotations, which is substantially higher than 0.0389 macro-AUPRC for GeneMANIA Interestingly, we note that the improvement of our method is negatively correlated with the number of annotations of the GO labels. In other words, we observed a greater improvement of clusDCA over previous methods for sparser labels. This observation suggests that clusDCA indeed addresses the overfitting issue, which has more significant impact on the sparsely annotated labels.

3.5 clusDCA achieves comparable performance for labels with a large number of annotations

In addition to sparsely annotated labels, our approach also achieved a performance comparable to GeneMANIA and greatly outperformed HC and DCA on labels with a large number of annotations (i.e. 31–100 and 101–300) with respect to both AUROC and AUPRC. The difference between clusDCA and GeneMANIA is not statistically significant in this case, but clusDCA is still marginally better than GeneMANIA on most categories.

3.6 clusDCA accurately predicts genes for new GO labels

Given that the current GO database is likely incomplete, in the event that a new GO label is created, we hope to automatically find genes that this label may be associated with. Remarkably, our framework can be directly used to tag genes with a newly created GO label using only the topological information from the ontology graph and other annotated labels. When the new GO labels are added to the ontology graph, we first obtain the low-dimensional vectors of these labels with DCA. Then, given the low-dimensional vectors for both genes and functions, we can inversely project function vectors onto the gene vector space and predict associated genes for the new GO labels. This approach can also potentially help to refine and enhance the current GO annotation database, thus serving as a verification platform. As a proof-of-concept, we repeatedly held out one-third of the GO labels as the validation set of ‘uncharacterized’ labels. We then used the remaining two-third GO labels to learn the projection model and to predict genes that are associated with the held out labels. Figure 4 shows the result of this experiment in yeast. We observed that our framework achieves a promising performance on all categories with micro-AUROC ranging from 0.81 to 0.87. It is worth noting that, to our best knowledge, no other existing method is able to predict associated genes for new GO labels without any existing annotations. Disease gene prioritization is a closely related task where the goal is to predict genes associated with a particular disease, but most algorithms proposed for this problem also require an initial set of associated genes to be able to make predictions.

Fig. 4.

Micro ROC curve of predicting genes for new GO labels on MF in yeast

3.7 Choice of the dimensionality of low-dimensional representations

Here, we examined the impact of the number of dimensions used for clusDCA on the prediction performance. To this end, we calculated the micro-AUROC of BP label prediction in yeast with different number of dimensions (Supplementary Fig. S3 in Supplementary Data). We observed that our method is quite robust over a wide range of dimensions. A good performance is achieved from 1000 dimensions and above, with a notable exception of the [3-10] label group, which seems to improve further at 2000 dimensions. Interestingly, we found that labels with more annotations are less affected by this parameter. In our previous work (Cho ), DCA achieved a good performance with only 500 dimensions. We think this is due to the fact that the GO graph is sparser than the yeast interactome and thus shows less clustering properties locally; random walks are therefore more localized on the GO graph, which requires more dimensions to be accurately modeled.

4 Conclusion

We introduced a novel algorithm, clusDCA, for gene function prediction. The major idea of clusDCA is to leverage similarity between functional labels in addition to similarity between genes to prevent overfitting of sparsely annotated GO labels. We achieve this goal by learning low-dimensional vector representations of genes and functions and matching gene vectors to function vectors via a projection that best preserves known gene-function associations. Similar labels are co-localized in the vector space, which allows the transfer of information between neighboring functions when genes are assigned to them. Since learning the projection solves the prediction of all labels simultaneously, our method has the added benefit of being scalable to datasets with a large number of annotations. Although based on DCA, clusDCA is a substantial advance, as evidenced by its superior performance over DCA, as well as GeneMANIA and HC, on sparsely annotated GO labels, while maintaining a comparable performance on labels with many genes. Moreover, we demonstrated clusDCA’s ability to identify putatively associated genes for newly created GO labels without any annotations, which suggests that our method can be used to improve poorly annotated labels and thus takes a significant step towards more comprehensive understanding of gene or protein function in various organisms.

Funding

This research was partially supported by grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Conflict of Interest: none declared.

27 in total

1. Whole-genome annotation by using evidence integration in functional-linkage networks.

Authors: Ulas Karaoz; T M Murali; Stan Letovsky; Yu Zheng; Chunming Ding; Charles R Cantor; Simon Kasif
Journal: Proc Natl Acad Sci U S A Date: 2004-02-23 Impact factor: 11.205

2. Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data.

Authors: Tijana Milenkovic; Vesna Memisevic; Anand K Ganesan; Natasa Przulj
Journal: J R Soc Interface Date: 2009-07-22 Impact factor: 4.118

3. Walking the interactome for prioritization of candidate disease genes.

Authors: Sebastian Köhler; Sebastian Bauer; Denise Horn; Peter N Robinson
Journal: Am J Hum Genet Date: 2008-03-27 Impact factor: 11.025

4. Predicting protein function from protein/protein interaction data: a probabilistic approach.

Authors: Stanley Letovsky; Simon Kasif
Journal: Bioinformatics Date: 2003 Impact factor: 6.937

5. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

6. Information-theoretic evaluation of predicted ontological annotations.

Authors: Wyatt T Clark; Predrag Radivojac
Journal: Bioinformatics Date: 2013-07-01 Impact factor: 6.937

7. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence.

Authors: Lourdes Peña-Castillo; Murat Tasan; Chad L Myers; Hyunju Lee; Trupti Joshi; Chao Zhang; Yuanfang Guan; Michele Leone; Andrea Pagnani; Wan Kyu Kim; Chase Krumpelman; Weidong Tian; Guillaume Obozinski; Yanjun Qi; Sara Mostafavi; Guan Ning Lin; Gabriel F Berriz; Francis D Gibbons; Gert Lanckriet; Jian Qiu; Charles Grant; Zafer Barutcuoglu; David P Hill; David Warde-Farley; Chris Grouios; Debajyoti Ray; Judith A Blake; Minghua Deng; Michael I Jordan; William S Noble; Quaid Morris; Judith Klein-Seetharaman; Ziv Bar-Joseph; Ting Chen; Fengzhu Sun; Olga G Troyanskaya; Edward M Marcotte; Dong Xu; Timothy R Hughes; Frederick P Roth
Journal: Genome Biol Date: 2008-06-27 Impact factor: 13.583

8. Inferring gene ontologies from pairwise similarity data.

Authors: Michael Kramer; Janusz Dutkowski; Michael Yu; Vineet Bafna; Trey Ideker
Journal: Bioinformatics Date: 2014-06-15 Impact factor: 6.937

9. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective.

Authors: Yuxiang Jiang; Wyatt T Clark; Iddo Friedberg; Predrag Radivojac
Journal: Bioinformatics Date: 2014-09-01 Impact factor: 6.937

10. Integration of molecular network data reconstructs Gene Ontology.

Authors: Vladimir Gligorijević; Vuk Janjić; Nataša Pržulj
Journal: Bioinformatics Date: 2014-09-01 Impact factor: 6.937

24 in total

1. Framing Electronic Medical Records as Polylingual Documents in Query Expansion.

Authors: Edward W Huang; Sheng Wang; Doris Jung-Lin Lee; Runshun Zhang; Baoyan Liu; Xuezhong Zhou; ChengXiang Zhai
Journal: AMIA Annu Symp Proc Date: 2018-04-16

Review 2. Network propagation: a universal amplifier of genetic associations.

Authors: Lenore Cowen; Trey Ideker; Benjamin J Raphael; Roded Sharan
Journal: Nat Rev Genet Date: 2017-06-12 Impact factor: 53.242

3. PROSNET: INTEGRATING HOMOLOGY WITH MOLECULAR NETWORKS FOR PROTEIN FUNCTION PREDICTION.

Authors: Sheng Wang; Meng Qu; Jian Peng
Journal: Pac Symp Biocomput Date: 2017

4. Compact Integration of Multi-Network Topology for Functional Analysis of Genes.

Authors: Hyunghoon Cho; Bonnie Berger; Jian Peng
Journal: Cell Syst Date: 2016-11-23 Impact factor: 10.304

5. Identification of disease treatment mechanisms through the multiscale interactome.

Authors: Camilo Ruiz; Marinka Zitnik; Jure Leskovec
Journal: Nat Commun Date: 2021-03-19 Impact factor: 14.919

6. Genome-wide identification of the genetic basis of amyotrophic lateral sclerosis.

Authors: Sai Zhang; Johnathan Cooper-Knock; Annika K Weimer; Minyi Shi; Tobias Moll; Jack N G Marshall; Calum Harvey; Helia Ghahremani Nezhad; John Franklin; Cleide Dos Santos Souza; Ke Ning; Cheng Wang; Jingjing Li; Allison A Dilliott; Sali Farhan; Eran Elhaik; Iris Pasniceanu; Matthew R Livesey; Chen Eitan; Eran Hornstein; Kevin P Kenna; Jan H Veldink; Laura Ferraiuolo; Pamela J Shaw; Michael P Snyder
Journal: Neuron Date: 2022-01-18 Impact factor: 18.688

7. Characterizing gene sets using discriminative random walks with restart on heterogeneous biological networks.

Authors: Charles Blatti; Saurabh Sinha
Journal: Bioinformatics Date: 2016-03-19 Impact factor: 6.937

8. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information.

Authors: Yunan Luo; Xinbin Zhao; Jingtian Zhou; Jinglin Yang; Yanqing Zhang; Wenhua Kuang; Jian Peng; Ligong Chen; Jianyang Zeng
Journal: Nat Commun Date: 2017-09-18 Impact factor: 14.919

9. Predicting Protein Functions Based on Differential Co-expression and Neighborhood Analysis.

Authors: Jael Sanyanda Wekesa; Yushi Luan; Jun Meng
Journal: J Comput Biol Date: 2020-04-17 Impact factor: 1.479

10. Typing tumors using pathways selected by somatic evolution.

Authors: Sheng Wang; Jianzhu Ma; Wei Zhang; John Paul Shen; Justin Huang; Jian Peng; Trey Ideker
Journal: Nat Commun Date: 2018-10-08 Impact factor: 14.919