Sheng Wang1, Emily R Flynn2, Russ B Altman1,2,3. 1. Department of Bioengineering, Stanford University, Stanford, CA 94305, USA. 2. Biomedical Informatics Training Program, Stanford University, Stanford, CA 94305, USA. 3. Department of Genetics, Stanford University, Stanford, CA 94305, USA.
Abstract
Gene sets, including protein complexes and signaling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein-protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumors, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, which we validate in independent cohorts.
Gene sets, including protein complexes and signaling pathways, have proliferated greatly, in large part as a result of high-throughput biological data. Leveraging gene sets to gain insight into biological discovery requires computational methods for converting them into a useful form for available machine learning models. Here, we study the problem of embedding gene sets as compact features that are compatible with available machine learning codes. We present Set2Gaussian, a novel network-based gene set embedding approach, which represents each gene set as a multivariate Gaussian distribution rather than a single point in the low-dimensional space, according to the proximity of these genes in a protein-protein interaction network. We demonstrate that Set2Gaussian improves gene set member identification, accurately stratifies tumors, and finds concise gene sets for gene set enrichment analysis. We further show how Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, which we validate in independent cohorts.
One of the most basic outcomes of biological data analysis is the discovery of gene sets. Such sets come from many sources[1-4]: Genome-wide association studies (GWAS) produce sets of genes associated with a disease or other phenotype. Gene expression analyses identify gene sets by examining differential expression between conditions, or clustering genes by expression similarity. Proteomics and metabolomics data also produce lists of proteins and metabolites. Biological network analysis associates genes, proteins, or metabolites that interact with one another into network neighborhoods. In all of these cases, the hypothesis is that genes in the set are likely involved in the same biological process or function. Because of their ability to boost signal-to-noise and increase explanatory power, gene sets have been used in various downstream analyses, including disease signature identification[5,6], drug pathway association prediction[7], survival analysis[8], and drug response prediction[9]. Unfortunately, a bottleneck for gene set-based analysis is a lack of computational tools that convert these gene sets into informative forms which can distinguish completely different gene sets deemed as the same by averaging embedding (Figure 1a). In particular, existing machine learning codes require input features in the form of fixed length vectors, and prefer these vectors to be compact and high-quality in order to avoid overfitting.
Figure 1.
Overview of Set2Gaussian. (a). 2-D t-SNE plot showing two very different gene sets embedded in the same point (0,0) by simply averaging embeddings of individual genes. (b). Pie chart showing the percent of 150 drug response correlated gene sets that are significantly enriched with different numbers of Gene Ontology functions (P-value < 0.05 after Bonferroni correction). (c). Flowchart showing Set2Gaussian embedding process and downstream applications. First, RWR is used to calculate the diffusion states of each gene and gene set. These diffusion states are then embedded into a low-dimensional space where genes are represented as single points and gene sets are represented as Gaussian distributions. These representations are then applied to a variety of gene set-based analyses.
Molecular interaction networks provide novel insights into the functional interdependencies of genes and proteins[10]. In particular, high-throughput experimental techniques, such as yeast two-hybrid screens and genetic interaction assays, have enabled researchers to piece together large-scale interaction networks in bulk[11]. Consequently, network-based approaches, including network propagation[12-18], network clustering[19], network integration[20,21], and network regularization[22], have been developed to efficiently analyze these networks. Among them, network embedding has emerged as a powerful network analysis approach because it generates a highly informative and compact vector representation for each node in the network[21,23]. Molecular interaction networks are noisy and incomplete, especially as they increase in size[21,23]. Network embedding typically leverages dimensionality reduction techniques to regularize high-dimensional network data. Prior to the advent of network embedding approaches, researchers manually identified network features for machine learning, which was time-consuming and often required expert knowledge. By contrast, network embedding automates this process by embedding each node in the network as a vector. These node embeddings have shown good performance in machine learning classifiers, and have become the building blocks in a large number of systems biology applications[21,23]. In this paper, we examine networks with genes as nodes; however, it is important to note that these methods can be applied to networks with any type of node, such as networks with disease nodes[10] or drug nodes[24].Gene embeddings represent each gene as a fixed length vector in a common low-dimensional continuous vector space. The dimension of these gene embeddings is typically much smaller than the original data dimensions (e.g., the number of nodes in the network). While many useful gene embeddings are now available[21,23], learning representations for gene sets remains challenging. One simple approach to represent a gene set is to averaging individual gene representations that comprise the gene set. In natural language processing, such averaging of word vectors has been used to construct sentence embeddings[25]. In computer vision, deep learning architectures rely on pooling which aggregates nearby pixels by taking the average, max, or weighted average of their values[26]. In network modeling, graph neural network models also use pooling to aggregate information from the neighborhood to effectively propagate information throughout the network[27]. However, although simple and intuitive, averaging and other pooling operations may not be sufficiently expressive for representing gene sets and many underfit the data. By contrast with sentences that only have a few words or images where we only average a few nearby pixels, gene sets can be arbitrarily large. In part due to this size difference, simple aggregation methods are not expressive enough to represent such gene sets. Figure 1a provides an example where the average embedding approach is unable to distinguish two completely different gene sets. A second intuitive approach to represent a gene set is to add new nodes representing the entire gene sets into the existing molecular network and connect these “gene set nodes” to their gene members. One can then run node embedding on this heterogeneous network to obtain the representation of each gene set. However, adding these likely high-degree nodes to the network can substantially change the topological structure and connectivity of the network, leading to inaccuracy in the embedding space. Other methods aim at embedding densely connected subnetworks: ComE performs community embedding and community detection simultaneously using a community-aware high-order proximity[28]; PathEmb models pathways as documents and then applies document embedding models to calculate pathway similarity[29]. However, both ComE and PathEmb require gene sets to be fully connected in the network, which is not the case for most of the biologically meaningful gene sets.Fundamentally, all of these approaches assume that genes in the same set tend to have similar properties. This is intuitively the case for protein complexes or biological processes; however, this is not true for a large number of other biologically meaningful gene sets. To examine the diversity of functions in gene sets, we calculated the Gene Ontology (GO) enrichment of 150 drug response-related gene sets derived from two pharmacogenomics datasets (Figure 1b)[30]. We found that 86% of these gene sets were significantly enriched with more than one GO function (P-value <0.05 after Bonferroni correction). This indicates that genes in the same set frequently had different functions and were involved in multiple biological processes; however, this diversity is ignored in average embedding and other simple aggregation methods. Recently, Gaussian embedding, which represents each node as a multivariate Gaussian distribution in the low-dimensional space, has been proposed to model the uncertainty of nodes[31-33]. Motivated by prior work on Gaussian embedding of nodes, we propose to represent each gene set as a multivariate Gaussian distribution according to its network topology. This allows us to model the diversity by the uncertainty of each dimension in this vector. To our knowledge, our method is the first approach that learns compact representations for gene sets.Thus, we present Set2Gaussian, a novel computational method that summarizes genes in the same set closely as a multivariate Gaussian distribution in the low-dimensional space (Figure 1c). Set2Gaussian takes biological networks and a collection of gene sets as input. Each gene is represented as a single point and each gene set as a multivariate Gaussian distribution parameterized by a low-dimensional mean vector and a low-dimensional covariance matrix. The mean vector of each gene set describes the joint contribution of genes in this gene set, and the covariance matrix characterizes the agreement among individual genes in each dimension. Dimensions that have small variance across different genes in the set should have higher weights when they are used to calculate the similarity between two gene sets. In contrast to using a diagonal matrix to represent the covariance matrix in previous work[31], we propose to use a low-rank matrix in order to avoid underfitting. Set2Gaussian is able to differentiate between gene sets that would be considered equivalent by conventional approaches, such as mean pooling. We demonstrate that Set2Gaussian significantly improves gene set member identification in three large-scale gene set collections. We further show how the embeddings generated by Set2Gaussian allows us to identify a previously unknown clinical prognostic and predictive subnetwork around NEFM in sarcoma, providing insight into the treatment of sarcoma. Finally, we use Set2Gaussian to select concise previously defined gene sets that substantially enhance and accelerate gene set enrichment analysis.
Results
Overview of Set2Gaussian
A key observation behind our approach is that gene sets can have diverse molecular functions and/or biological processes. Set2Gaussian explicitly models this diversity as a low-dimensional Gaussian distribution which summarizes both location and uncertainty of each dimension, improving upon the expressive power of existing node embedding models. Set2Gaussian takes a network and a collection of gene sets as input (Figure 1c). It then finds a low-dimensional space in which genes and gene sets preserve their distances in the network. Each gene is represented as a single point in the low-dimensional space. Each gene set is represented as a multivariate Gaussian distribution which is parameterized by a mean vector and covariance matrix. In our work, we approximated the covariance matrix through low-rank matrix factorization in order to avoid underfitting, as we demonstrated in our experiments.
Set2Gaussian improves gene set member identification
We first sought to examine whether Set2Gaussian could accurately identify gene set members. Genes in the same gene set are analyzed together to study the underlying biological processes. However, gene sets, especially those generated from high-throughput experiments, may contain a substantial number of false positives and false negatives due to the robustness and quality of the assay. Therefore, reducing the noise by identifying gene set members is an important task for large-scale gene set analysis. Because labeling manually is expensive and tedious, many computational approaches have been proposed to identify gene set members, each utilizing different information, including domain signatures[34], gene expression[35], and sequence data[36].Set2Gaussian significantly outperformed the proposed baseline gene set representation approaches on identifying gene set members in all three datasets at all size categories (P-value<0.05; Wilcoxon signed-rank test) (Figure 2a, b, c). We first observed clear improvements over three pooling-based approaches in all three datasets. For example, In Reactome, at the medium set [11-30], Set2Gaussian obtained an AUPRC of 0.48, outperforming mean (0.42), weighted mean (0.40), and max (0.22). The above results suggest that representing a gene set through mean or max pooling is unable to model the uncertainty and capture the diverse functions within a gene set, leading to worse performance. We further noticed increasing improvement with larger gene set size. For example, Set2Gaussian obtained 36% improvement on large sets [31-1000] versus 21% improvement on small sets [3-10] in comparison to mean on MSigDB. Larger gene sets may be noisier and contain more diverse functions. Overall, this comparison indicated that using a single vector to represent a gene set, as in these pooling approaches, is not expressive enough to capture gene set diversity. By adopting a covariance matrix to model uncertainty, Set2Gaussian is able to model this diversity and thus substantially improves gene set identification performance.
Figure 2.
Application of Set2Gaussian to gene set member identification. Comparison of Set2Gaussian with four other approaches in identifying gene set members in NCI (a), Reactome (b), and MSigDB (c). Gene sets are grouped into three categories according to the number of genes in the gene set (small [3–10], medium [11–30], large [31–1000]). (d). Comparison of Set2Gaussian with Set2Gaussian-diag on NCI for medium gene sets ([11–30]). Error bars represent the standard error on the mean.
Additionally, Set2Gaussian outperformed hypergraph embedding on gene set identification across all gene set sizes in all three datasets. For example, in MSigDB, Set2Gaussian obtained 0.13 AUPRC for small sets [3-10], which was significantly better than hypergraph embedding (AUPRC = 0.032). Interestingly, we found that hypergraph embedding was worse than mean and weighted mean pooling approaches in all three datasets. Network embedding approaches rely on finding similar contexts (e.g., similar neighbors or similar diffusion states) to accurately embed nodes. However, in a hypergraph, a “gene set node” could have a noisy neighborhood structure due to a large number of neighbors, which may make it difficult to find enough nodes with similar contexts to support an accurate embedding. Constructing a hypergraph may also introduce too many high degree nodes, substantially changing the topological structure of the network.A key feature of Set2Gaussian is that it leverages a low-rank matrix instead of a diagonal matrix to parametrize the covariance in the Gaussian distribution of each gene set. A low-rank covariance matrix is more expressive but could also be more prone to overfitting. To examine whether it is necessary to increase model complexity given the risk of overfitting, we compared Set2Gaussian with Set2Gaussian-diag, which forced the covariance matrix to be a diagonal matrix. We found that Set2Gaussian had improved performance over Set2Gaussian-diag (AUPRC of 0.24 vs. 0.20 respectively) for medium sets [11-30] in Reactome (Figure 2d). Moreover, the improvement of Set2Gaussian against Set2Gaussian-diag was larger on MSigDB than the other two datasets (Supplementary Figure 1). We postulated that because MSigDB had the largest number of gene sets, a diagonal covariance matrix does not accurately project these gene sets into the same low-dimensional space and thus underfits the data. Notably, although worse than Set2Gaussian, Set2Gaussian-diag still performed better than other gene set representation approaches including Mean. This demonstrates the importance of modeling the uncertainty of each dimension by using a Gaussian distribution.In addition to accurately recovering existing gene set collections, the top predictions made by Set2Gaussian also revealed novel gene set members that could supplement existing pathway databases. For example, the NLRP1 inflammasome pathway in Reactome is known to have three genes, BCL2, BCL2L1, and NLRP1. All these three genes were included in the top ten predictions of Set2Gaussian. In addition, Set2Gaussian predicted that BAX, CASP1, MCL1, BCL2A1, CASP9, CASP3, and NLRP1 were within the multidimensional Gaussian distribution described by the original set and thus might be the members of this pathway. These predictions were supported by the literature; NLRP1 activates a downstream protease Caspase‐1 (CASP1)[37]. BAX, BCL2A1, and MCL1 belong to the BLC2 family, which inhibits NLRP1 induced CASP1 activation in a concentration-dependent manner[38].
Set2Gaussian identifies novel subtypes and subnetworks in Sarcoma
We next applied Set2Gaussian to tumor stratification and subnetwork identification in Sarcoma. Tumor stratification aims at dividing a heterogeneous population of tumors into clinically and biologically meaningful subtypes. We modeled each tumor as a gene set denoted by the tumor’s set of mutations and then clustered these gene sets based on the gene set representations derived from Set2Gaussian. We found that Set2Gaussian’s subtypes had significantly different survival in Sarcoma across groups; while subtypes from the other three approaches did not (Figure 3a). For example, at k=2 subtypes, the survival times were significantly different between subtypes (P-value <0.03, log-rank test); and five-year survival rates varied greatly (41% and 71% respectively) (Figure 3b). We attributed Set2Gaussian’s superior performance against comparison approaches to its projection of mutation profiles as low-dimensional Gaussian distributions, thus obtaining more accurate similarity between tumors.
Figure 3.
Application of Set2Gaussian to cancer subtyping. (a). Comparison of Set2Gaussian with other approaches in tumor stratification on 235 sarcoma tumors at varying numbers of subtypes. (b). KM-plot showing the clustering of tumors into 2 subtypes using all the genes. (c). NEFM-subnetwork identified by Set2Gaussian in sarcoma. (d). KM-plot showing the clustering of tumors into 2 subtypes by only using the NEFM-subnetwork. (e). Scatter plot showing the comparison between the expression values of FREM2 and the drug responses to Paclitaxel on 16 soft-tissue cell lines collected from CTRP. (f). Box plot showing the Paclitaxel sensitivity of two groups of cell lines clustered by using the expression of NEFM-subnetwork.
By finding the top differentially mutated genes between two subtypes that have the highest and lowest survival rates, we further identified a connected subnetwork of five genes in protein-protein interaction networks from the STRING database, including NEFM, FREM2, KIF1A, NRXN1, and DSCAM (Figure 3c). We named this subnetwork NEFM-subnetwork since it was central around NEFM. The subtype with a larger NEFM-subnetwork mutation rate tends to have worse survival in comparison to other subtypes. Since this subnetwork seemed to be associated with Sarcoma survival, we divided the 235 Sarcoma tumors into two subtypes based on whether at least one of NEFM-subnetwork genes was mutated in a given tumor. We found that dividing tumors using only these five genes obtained subtypes that had significantly different overall survival (P-value < 0.03, log-rank test) (Figure 3d), and subtypes had very different five-year survival rates (49% for NEFM-subnetwork mutated vs. 87% for NEFM-subnetwork not mutated).After demonstrating that the NEFM-subnetwork stratifies sarcoma tumors by survival, we investigated if this subnetwork could be used as a biomarker for chemosensitivity prediction. To this end, we first collected gene expression profiles of 16 soft tissue cell lines from GDSC[39]. Among genes within the NEFM subnetwork, we found that FREM2 expression was significantly correlated with Paclitaxel sensitivity (Figure 3e). Paclitaxel is a microtubule-stabilizing drug that has been used to treat Kaposi’s sarcoma, lung, ovarian, and breast cancer[40]. We clustered these 16 cell lines into two groups according to the gene expression values of the NEFM-subnetwork. We observed that these two clusters had significantly different Paclitaxel sensitivities (indicated by the area under the dose-response curve (P-value < 0.0019, t-test)) (Figure 3f). Notably, clustering on any of these five genes alone did not yield clusters with significantly different Paclitaxel sensitivities. When using this subnetwork to cluster a larger collection of 990 cell lines across 25 tissue lineages, we also observed significantly different Paclitaxel sensitivities (P-value < 0.004). Additionally, gene expression of Neurofilament Medium (NEFM) itself was significantly overexpressed in the Paclitaxel sensitive group relative to the Paclitaxel resistant group (log2 fold change 7.93 vs 4.00 respectively, P-value < 3.12e-8, rank-sum test). Neurofilaments are known to modulate microtubule stability[41], and higher neurofilament levels have been observed to destabilize microtubule network[42], thus may increase sensitivity to Paclitaxel. The significant differences in overall survival and chemosensitivity indicate that this previously unrecognized NEFM-subnetwork may be a biomarker for sarcoma treatment.
Set2Gaussian finds concise previously defined gene sets for GSEA
With the rapid development of sequencing technology, high-throughput experiments have been used to generate large numbers of gene sets or pathways. Given a newly generated gene set, the immediate question is its function. Gene set enrichment analysis (GSEA) is a method for finding gene sets that are significantly different between two biological states[43]. Additionally, GSEA is extensively used to find significantly enriched gene sets for a novel gene set, from a large collection of previously defined gene sets[44] which are obtained from gene set databases such as MSigDB[3]. While GSEA provides a list of possible enriched gene sets, it does not resolve or de-duplicate these gene sets. However, gene set databases are continually growing, inevitably leading to a substantial amount of redundant information in previously defined gene sets. The redundancy not only slows down the enrichment analysis but also masks the true signal. Hence, there would be great potential value to decrease the number of previously defined gene sets and keep only the most informative and representative sets. Notably, data-reduction techniques have been applied to other computational biology tasks and obtain promising results[45,46]. Thus, we investigated whether representations generated by Set2Gaussian can be used to effectively downsize existing previously defined gene sets. To filter existing gene sets, we first used Set2Gaussian to project all gene sets of NCI, MSigDB, and Reactome into the same low-dimensional space and then selected the ones that are close to a large number of genes (see Methods).Our results are summarized in Figure 4 a,b,c. We found that using previously defined gene sets created by Set2Gaussian achieved the best performance for GSEA. Specifically, we found significant enrichment for 72.1% of queried gene sets, which was significantly larger than 68.5% of all, 50.5% of Standard, 16.4% of Proportional, and 15.1% of Hitting. A detailed comparison of Set2Gaussian with baselines is shown in Figure 4b,c, which demonstrates the number of significantly enriched gene sets for each cell line. For all cell lines, Set2Gaussian enriched for more gene sets in comparison to Standard. And for 60 among 69 cell lines, Set2Gaussian enriched for more gene sets in comparison to All. We observed significant improvement across a large range of K from 100 to 1000 with a step size of 10 (all P-values < 0.05) (Figure 4d). At K > 700 gene sets, the improvement remained significant but started to decrease; we expected that this was because a K that was too large might include redundant gene sets in this collection. The highest percent improvement came from K=200 on 69 cell lines (P-value < 7e-10, Wilcoxon signed-rank test). In addition to increasing the number of enriched gene sets, Set2Gaussian-filtered previously defined gene sets also substantially enhanced the enrichment analysis computational speed in comparison to all previously defined gene sets (almost 65-fold). We provide Set2Gaussian-filtered previously defined gene sets in Supplementary Table 1, which covers a variety of biological processes and molecular functions, such as cytokine receptor interaction, cell cycle, and olfactory signaling pathway.
Figure 4.
Application of Set2Gaussian to finding concise gene sets. (a). Comparison of using five different previously defined gene sets in gene set enrichment analysis. (b). Comparison of using Set2Gaussian-filtered previously defined gene set to using all previously defined gene sets. Each point is a cell line, where x-axis (y-axis) shows how many gene sets of this cell line can find at least one significant enriched gene set from Set2Gaussian (all) previously defined gene sets. (c). Comparison of using Set2Gaussian-filtered previously defined gene set to using standard cover set-derived previously defined gene sets. (d). The number of significantly enriched gene sets at varying number of previously defined gene sets selected by Set2Gaussian. Error bars represent the standard error on the mean.
Discussion
Tens of thousands of genes sets are discovered from high-throughput experimental screening assays, providing novel insights into the underlying biological processes and molecular functions. Set2Gaussian is developed to facilitate the usage of these gene sets by creating high-quality and compact gene set features that can be used as input for machine learning codes. Our analysis of 11,892 gene sets indicates that Set2Gaussian is able to accurately recover gene set members, thus substantially reducing noise in experimentally derived gene sets. An important output of Set2Gaussian is the Set2Gaussian-derived feature representation of these 11,892 gene sets, which is made available by us to facilitate future gene set-based analysis.We further showed how Set2Gaussian can be applied to high-dimensional somatic mutation data. Set2Gaussian allows us to stratify tumors into biologically meaningful subtypes and obtain an unrecognized NEFM-subnetwork, whereas none of the compared methods is able to achieve. We validated the NEFM-subnetwork in an independent cohort and found that it could be used as the biomarker for Paclitaxel sensitivity. Finally, we used Set2Gaussian to enhance and accelerate the widely used GSEA, by reducing the redundancy in previously identified gene sets. Set2Gaussian reduces the number of previously identified gene sets from 11,892 to 200. These 200 most informative gene sets obtained better GSEA hits when we used them to analyze large-scale cell line perturbation experiments.One limitation of Set2Gaussian is that it requires an input network of genes to embed gene sets. In this work, we used the protein-protein interaction networks from the STRING database. Although it limits the applications of Set2Gaussian, the network provides additional information to the functional interdependencies between genes and has been extensively used to embed genes. In fact, biological networks have been massively generated these days from high-throughput experimental techniques[11] and scientific papers[47]. Even when there is no existing network available, one can still create a network based on the co-occurrence or mutual exclusivity in the gene set collection[8]. In addition, the quality of embeddings obtained by Set2Gaussian might be affected by noise in hub genes. Hub genes, which are genes that have a large number of neighbors in the network, could lead to biased diffusion states. Consequently, gene sets that have hub genes tend to have large variances and have less accurate low-dimensional representations.
Methods
Set2Gaussian framework
Problem definition. Let be the adjacency matrix of a given network , where is the number of genes. denotes the set of all genes. Let be gene sets defined on , where each set of genes . Set2Gaussian aims to find a low-dimensional multivariate Gaussian distribution for each gene set with mean and covariance matrix , where .
Random walk with restart from a gene set
In order to define the objective function, we first need to characterize the network topology that we want to preserve in the low-dimensional space. Here, we use the random walk with restart (RWR) to capture the network topology. RWR captures fine-grain topological properties that lie beyond direct neighbors[12,13]. When there are missing and spurious genes in a given gene set, RWR can correct the noise using network neighbors. RWR differs from the conventional random walk in that it introduces a predefined probability of restarting at the initial gene after every iteration.Formally, we first calculate a transition matrix , which represents the probability of a transition from gene to gene . B is defined as:To run RWR from gene , we define as an -dimensional distribution vector in which each entry contains the probability of gene being visited from gene after steps. RWR from gene with restart probability is defined as:
where is an -dimensional distribution vector with and . We can obtain the stationary distribution of RWR at the fixed point of this iteration. Consistent with the previous work[12,14,21,23], we define the diffusion state of each gene to be the stationary distribution of an RWR starting at each gene. Here, the restart probability controls the relative influence of global and local topological information in the diffusion, where a larger value places greater emphasis on the local structure.To run RWR from gene set , we define as an -dimensional distribution vector in which each entry contains the probability of a gene being visited from gene set after steps. RWR from gene set with restart probability is defined as:
where is an -dimensional distribution vector with and . We can obtain the stationary distribution of RWR at the fixed point of this iteration. we define the diffusion state of each gene set to be the stationary distribution of an RWR starting at each gene in uniformly. When genes in the set are rank-ordered by importance, we can adjust according to the gene weights.Notably, a gene set could have missing or spurious genes. RWR can account for the noisy gene sets using network neighbors to characterize the network topology. The restart probability reflects our uncertainty of this gene set, where a smaller value encourages the gene set to extend its members with network neighbors. Set2Gaussian uses the diffusion state ) to represent the topological information of gene (gene set ) in the network. The th entry stores the probability that RWR starts at gene (gene set ) and ends up at gene in equilibrium.
Representing gene sets as multivariate Gaussian distributions
The diffusion states of each gene and each gene set are used to find the low-dimensional representation. Set2Gaussian embeds genes and gene sets in the same low-dimensional space, where each gene is represented as a single point and each gene set is represented as a multivariate Gaussian distribution parameterized by a mean vector and covariance matrix.Set2Gaussian optimizes two criteria to find the low-dimensional representation: 1) genes with similar diffusion states should be close to each other in the low-dimensional space, and 2) genes in a given gene set in the network should have higher probabilities in the Gaussian distribution of that gene set. The first criterion preserves the distance between genes and has been widely used in conventional node embedding approaches. The second criterion is unique to Set2Gaussian, and groups genes in the same set together as a multivariate Gaussian distribution. Through the use of the second criteria, Set2Gaussian explicitly leverages the prior knowledge that genes in the same set are likely to have similarities and thus should be closely located in the low-dimensional space. Formally, let and represent the loss function based on the above two criteria. The loss function can be defined as:To preserve the gene distance (criteria 1), we define as:where is the Kullback-Leibler (KL) divergence[48] and is defined as:Here, is the representation of gene in the low-dimensional space and is the context feature describing the network topology of gene . If and are close in direction and have a large inner product, then it is likely that is frequently visited in the random walk restarting from gene . We optimize over and for all genes, using KL divergence as the objective function. One key improvement of Set2Gaussian is that all gene sets are trained jointly and the representation of genes is shared across all gene sets.Similar to previous work, we relax the constraint that the entries in sum to one by dropping the normalization factor in the above equation[21,23]. As a result, can be simplified as:This simplification substantially reduces the computational complexity while still achieving comparable performance[21,23]. Since is no longer an -dimensional probability simplex, we use the sum of squared errors instead of KL divergence as the new objective function. Therefore, is defined as:Next, to preserve the distance between genes and gene sets, we define as:
where is defined as :
is the multivariate Gaussian probability density function and is the probability density of gene :Here, we can optimize over the mean vector and the covariance matrix to obtain the multivariate Gaussian distribution of gene set .Same as the simplification in , we also drop the normalization factor in the above equation. As a result, is simplified as:Notably, can also be viewed as the Mahalanobis distance of gene from the mean and covariance matrix . The Mahalanobis distance can account for different variances in each direction and reduces to Euclidean distance when is an identity matrix. While matrix factorization approaches, such as singular value decomposition (SVD), also calculate a diagonal matrix , Set2Gaussian improves on this by optimizing different for each gene set in order to model the uncertainty of each gene set differently.We then use the sum of squared errors as the objective function:Summing up two parts, the new loss function of our model is defined:While the first term preserves gene distance in the network, the second term forces genes in the same set to form a multivariate Gaussian distribution. Therefore, these biologically meaningful gene sets are used as prior knowledge by Set2Gaussian to infer the embedding of genes. By contrast, other methods, such as average embedding, are unable to leverage this prior knowledge.
Low-rank approximation of the covariance matrix
Set2Gaussian has the following parameters: , and . The parameters , and can be directly estimated with gradient descent. By contrast, since is the covariance matrix of a multivariate Gaussian distribution, we need to assure that it is positive semi-definite. To achieve this, let be the precision matrix of the multivariate Gaussian distribution for gene set :Instead of directly estimating the covariance matrix , we estimate the precision matrix to avoid numerical problems that arise in matrix inversion. We define to force to be positive semi-definite:Since a matrix multiplied by its transpose is positive semi-definite, is thus a positive semi-definite matrix. This further ensures that its inverse is also a positive semi-definite matrix. Since there is no constraint on , we can use gradient descent to estimate . However, directly optimizing over introduces a substantial memory complexity of , which counteracts a key benefit of using a low-dimensional representation. To address this problem, we propose to factorize by using a low-rank approximation:
where , and . In our experiment, we found that setting to 3 is expressive enough to obtain a good performance. This reduces Set2Gaussian’s parameters to , and . We estimate these parameters using Adam to find a local optimum[49].After finding the low-dimensional representation using Set2Gaussian, we can calculate the distance between genes and gene sets in this space. The distance between gene and gene set is calculated according to the probabilistic density function of the multivariate Gaussian distribution for gene set k:Using this formulation, the distance between a gene and a gene set depends not only on the mean vector (the location of this gene set) but also on the covariance matrix . To calculate the distance between gene set and gene set , we take the average asymmetric Kullback-Leibler divergence according to their Gaussian distributions:
where is calculated as:
Sources of gene sets and molecular networks
We considered three public gene set databases widely used in gene set and pathway analyses, The Reactome Knowledgebase (Reactome)[4], National Cancer Institute Pathway Interaction Database (NCI)[1] and The Molecular Signatures Database (MSigDB)[3]. Reactome is a manually curated pathway collection composed of classical intermediary metabolism, signaling transduction, transport, DNA replication, and other key cellular processes. The pathways are classified into a pathway hierarchy. We obtained the lowest level of Homo sapiens pathways from this pathway hierarchy, resulting in 1771 pathways. We obtained all 223 pathways from NCI, which is a database of human cellular signaling and regulatory pathways that might be relevant to cancer research and treatment. We ignored interactions in these pathways from Reactome and NCI and used them as gene sets in our analysis. MSigDB contains annotated gene sets from eight sources, including oncogenic signatures, immunologic signatures, computationally inferred gene sets, curated gene sets, positional gene sets, gene ontology gene sets, motif gene sets, and hallmark gene sets. All sources except gene ontology gene sets were included in our analysis, resulting in 11892 gene sets.Protein-protein interaction networks from the STRING database[11] were used for the PPI network structure. STRING uses a Bayesian integration algorithm to integrate many different types of evidence for protein-protein interactions, including literature curation, computationally predicted interactions, interactions transferred from model organisms by orthology, interactions computed from genomic features such as gene-gene fusion events, and interactions based on functional or co-expression similarity. We downloaded the STRING v10, which has 15,108 genes and 3,621,168 edges. Each edge is associated with a confidence score between 0 and 1 assigned by the Bayesian integration algorithm.
Source of tumor somatic mutation, drug sensitivity, and differentially expressed gene sets
The Genomics of Drug Sensitivity in Cancer (GDSC) compound screening dataset spanned 990 humancancer cell lines[39]. We downloaded the compound response data, and gene expression profiles, and corresponding tissue of origin of the 990 cancer cell lines. To examine the identified sarcoma subnetwork, we used all 16 soft tissue cell lines that were screened under the exposure to Paclitaxel.We downloaded somatic mutation profiles of TCGA sarcoma tumors from the GDAC Firehose website (http://gdac.broadinstitute.org, February 11th 2016)[50]. In each tumor, a gene was classified as either wild type (0) or altered (1), where altered is any non-silent mutation. We excluded tumors with less than 10 mutations, leaving a total of 237 tumors with 10,618 mutations in 15,108 genes. The survival of patients with these tumors was downloaded from TCGA.The LINCS project consists of gene expression profiles of human cell lines exposed to perturbations[51]. These perturbations included treatment with more than 20,000 unique compounds and 69 cell lines. For each cell line, we first randomly selected 200 perturbation experiments from LINCS level 4 signature which normalized across samples based on two or more characteristics. We then obtained a gene set from each experiment by filtering for genes with an absolute level 4 signature scores greater than 2, resulting in 13,800 gene sets across 69 cell lines.
Gene set member identification
Intuitively, when projecting gene sets and genes into the same low-dimensional space, a gene set should be closer to its member genes than others in the low-dimensional space. As a result, we anticipate that a good gene set representation can accurately identify gene set members according to the distance between a gene set and a candidate gene in this space. We are not aware of any other methods for learning compact gene set representations; thus, for comparison, we proposed four competitive approaches. These approaches were derived according to the existing literature of aggregating set information in other research areas[26,27,52]. Pooling methods relied on a two-step optimization approach. In the first stage, these methods used an existing network embedding approach to obtain the embedding vector of genes in the network. In the second stage, these methods used the mean (Mean), max (Max), or weighted mean (Weighted mean) of the embedding vectors of gene set members to represent that gene set. Weighted mean used the random walk diffusion score as the weight for each gene set member. Because all of these three pooling-based approaches used a single vector to represent a gene set, they could not model the uncertainty in each dimension. Moreover, such two-stage optimization approaches might lead to suboptimal results. Hypergraph embedding (Hypergraph) jointly optimized the gene and gene set embedding in a newly constructed hypergraph. In this method, each gene set was added as a new node to the original network that was connected to its member genes. Network embedding was applied to this heterogeneous network to embed gene set nodes and gene nodes into the same low-dimensional space. Similar to these methods, our approach took protein-protein interaction networks as input and generated embedding representations for all gene sets and genes. We evaluated Set2Gaussian on three gene set collections (NCI, Reactome, and MSigDB), each was further grouped into three categories according to the number of genes in the gene set (small [3-10], medium [11-30], and large [31-1000]). We chose the cutoffs that enabled the numbers of gene sets in each category close to each other. We used the Area Under Precision-Recall Curve (AURPC) as the evaluation metric which was shown to be a robust metric on different sparsity levels [53].Mashup, a recently developed node embedding algorithm, was the underlying node embedding method for all comparison approaches[21]. Similar to Mashup’s approach, we used cosine similarity to calculate the proximity between a gene set and a gene in the low-dimensional space. We set the dimension of Mashup as 500 which achieved the best performance in a range of dimensions from 100 to 1000 in our experiments. For our method, we set to 3, to 300 and to 0.8. We observed that the performance of our method is stable for between 1 and 5, greater than 100, and greater than 0.5.
Tumor stratification and subnetwork identification
Computationally, tumor stratification is performed by assessing the similarity between molecular profiles such as somatic mutation profiles. However, stratifying tumors based on somatic mutation is challenging due to the extreme sparsity and heterogeneity in somatic mutation profiles[54]. To address this sparsity and heterogeneity, a large number of network-based approaches have been proposed to aggregate gene mutations into higher-level functions[15,22,55].Here, we asked whether Set2Gaussian could enhance tumor stratification accuracy and identify the underlying driver subnetworks. Formally, we modeled each tumor as a gene set denoted by the tumor’s set of mutations. We then used Set2Gaussian to project these gene sets into Gaussian distributions located in the same low-dimensional space. These low-dimensional Gaussian distributions were later used as features to divide tumors into subtypes. By forcing each tumor’s mutation set to form a Gaussian distribution, Set2Gaussian might be less biased by noise from networks and passenger mutations; and therefore, identify clinically meaningful subtypes. We compared Set2Gaussian with three comparison tumor stratification approaches. Network-based stratification (NBS) used protein-protein interaction networks to aggregate mutation signals within network regions[22]. Mutation Profile directly clustered tumors based on the somatic mutation profile without considering the network structure[56]. Mutation Load clustered tumors according to the number of mutations in each tumor[57]. We applied Set2Gaussian and the three comparison methods to stratify 235 Sarcoma tumors from TCGA. We used the Protein-protein interaction networks from the STRING database to identify the submodule for Sarcoma. Since the co-expression values between genes in the STRING network are measured based on another cohort, the same co-expression may not be observed on GDSC cell lines.For tumor stratification, we obtained the implementation of NBS from a recent paper[8]. To cluster tumors, we adopted the same clustering approach as described in Wang et al.[8] We first projected the representations of all tumors into a low dimensional space using truncated SVD. A tumor similarity matrix was then constructed by calculating the cosine similarity between the columns of truncated SVD. We adopted the k-means++ clustering algorithm to cluster tumors using the cosine tumor similarity matrix[58]. For K-means, the maximum number of iterations was set to 100 and the number of random starts was set to 200. Predefining the correct number of subtypes in a cancer cohort is difficult. Therefore, we compared a variety of subtypes from 2 to 6.
Identification of concise previously defined gene sets
Several computational approaches focused on either finding a subset of existing previously defined gene sets[59] or dynamically selecting gene sets according to the queried gene set[60,61]. The key insight behind these approaches is to find gene sets that cover a large number of genes and reduce redundancy within selected gene sets. Therefore, greedy algorithms were used to select gene sets sequentially, leading to heuristic and sensitive results. For example, Stoney et al. proposed three set cover approaches based on different objectives: removal of pathway redundancy, controlling pathway size, and coverage of the gene set[59]. Intuitively, these informative gene sets can be effectively derived from the Gaussian distributions of Set2Gaussian, because these Gaussian distributions inherently capture the redundancy, size, and coverage of gene sets.To generate such previously defined gene sets, we first used Set2Gaussian to project all gene sets of NCI, MSigDB, and Reactome into the same low-dimensional space. We then defined a gene set informative score for each gene set according to its distance to all existing genes. A gene set will have a higher informative score if it is close to a large number of genes. The intuition is that informative gene sets are surrounded by many genes and located in the dense region in the low-dimensional space. The top K gene sets with the highest informative scores then became the Set2Gaussian-filtered concise previously defined gene sets. We set K to 200 and observed a stable performance of K from 100 to 1000, as demonstrated by our experiments.To evaluate our approach, we ran GSEA for a given new gene set by comparing it with previously defined gene sets. We compared five different previously defined gene sets: Set2Gaussian-derived previously defined gene sets (Set2Gaussian, 200 gene sets), standard set cover-derived previously defined gene sets (Standard, 291 gene sets)[59], proportional set cover-derived previously defined gene sets (Proportional, 890 gene sets)[59], hitting set cover-derived previously defined gene sets (Hitting, 711 gene sets)[59], and all previously defined gene sets by combining sets from NCI, Reactome and MSigDB (All, 13868 gene sets). previously defined gene sets of Standard set cover, proportional set cover, and hitting set cover are obtained from Stoney et al.[59] which selected gene sets based on different objectives: removal of pathway redundancy, controlling pathway size and coverage of the gene set. Although a larger number of previously defined gene sets leads to a better chance to find a significant enrichment, it is also more likely to commit a Type I error. Therefore, good previously defined gene sets should be able to find significant enrichment for more queried gene sets after correcting for Type I error using Bonferroni correction. We generated queried gene sets by mining large-scale gene expression data. In particular, we collected 69 cell lines from a large gene expression compendium LINCS. For each cell line, we used differential expression analysis to identify 200 differentially expressed gene sets grouped by drug exposure across 41,729 small molecules. We then ran GSEA on these 13800 gene sets by comparing to the above five previously defined gene sets using Fisher’s exact test.
Authors: Micheal Hewett; Diane E Oliver; Daniel L Rubin; Katrina L Easton; Joshua M Stuart; Russ B Altman; Teri E Klein Journal: Nucleic Acids Res Date: 2002-01-01 Impact factor: 16.971
Authors: Bo Wang; Aziz M Mezlini; Feyyaz Demir; Marc Fiume; Zhuowen Tu; Michael Brudno; Benjamin Haibe-Kains; Anna Goldenberg Journal: Nat Methods Date: 2014-01-26 Impact factor: 28.547
Authors: Michael S Lawrence; Petar Stojanov; Paz Polak; Gregory V Kryukov; Kristian Cibulskis; Andrey Sivachenko; Scott L Carter; Chip Stewart; Craig H Mermel; Steven A Roberts; Adam Kiezun; Peter S Hammerman; Aaron McKenna; Yotam Drier; Lihua Zou; Alex H Ramos; Trevor J Pugh; Nicolas Stransky; Elena Helman; Jaegil Kim; Carrie Sougnez; Lauren Ambrogio; Elizabeth Nickerson; Erica Shefler; Maria L Cortés; Daniel Auclair; Gordon Saksena; Douglas Voet; Michael Noble; Daniel DiCara; Pei Lin; Lee Lichtenstein; David I Heiman; Timothy Fennell; Marcin Imielinski; Bryan Hernandez; Eran Hodis; Sylvan Baca; Austin M Dulak; Jens Lohr; Dan-Avi Landau; Catherine J Wu; Jorge Melendez-Zajgla; Alfredo Hidalgo-Miranda; Amnon Koren; Steven A McCarroll; Jaume Mora; Brian Crompton; Robert Onofrio; Melissa Parkin; Wendy Winckler; Kristin Ardlie; Stacey B Gabriel; Charles W M Roberts; Jaclyn A Biegel; Kimberly Stegmaier; Adam J Bass; Levi A Garraway; Matthew Meyerson; Todd R Golub; Dmitry A Gordenin; Shamil Sunyaev; Eric S Lander; Gad Getz Journal: Nature Date: 2013-06-16 Impact factor: 49.962
Authors: Preeti Yadav; Bhuvaneish T Selvaraj; Florian L P Bender; Marcus Behringer; Mehri Moradi; Rajeeve Sivadasan; Benjamin Dombert; Robert Blum; Esther Asan; Markus Sauer; Jean-Pierre Julien; Michael Sendtner Journal: Acta Neuropathol Date: 2016-03-28 Impact factor: 17.088
Authors: Rayees Rahman; Nicole Zatorski; Jens Hansen; Yuguang Xiong; J G Coen van Hasselt; Eric A Sobie; Marc R Birtwistle; Evren U Azeloglu; Ravi Iyengar; Avner Schlessinger Journal: Proc Natl Acad Sci U S A Date: 2021-05-11 Impact factor: 11.205