Literature DB >> 27597310

NEAT: an efficient network enrichment analysis test.

Mirko Signorelli1,2, Veronica Vinciotti3, Ernst C Wit4.   

Abstract

BACKGROUND: Network enrichment analysis is a powerful method, which allows to integrate gene enrichment analysis with the information on relationships between genes that is provided by gene networks. Existing tests for network enrichment analysis deal only with undirected networks, they can be computationally slow and are based on normality assumptions.
RESULTS: We propose NEAT, a test for network enrichment analysis. The test is based on the hypergeometric distribution, which naturally arises as the null distribution in this context. NEAT can be applied not only to undirected, but to directed and partially directed networks as well. Our simulations indicate that NEAT is considerably faster than alternative resampling-based methods, and that its capacity to detect enrichments is at least as good as the one of alternative tests. We discuss applications of NEAT to network analyses in yeast by testing for enrichment of the Environmental Stress Response target gene set with GO Slim and KEGG functional gene sets, and also by inspecting associations between functional sets themselves.
CONCLUSIONS: NEAT is a flexible and efficient test for network enrichment analysis that aims to overcome some limitations of existing resampling-based tests. The method is implemented in the R package neat, which can be freely downloaded from CRAN ( https://cran.r-project.org/package=neat ).

Entities:  

Keywords:  Enrichment analysis; Gene expression; Hypergeometric; Network

Mesh:

Year:  2016        PMID: 27597310      PMCID: PMC5011912          DOI: 10.1186/s12859-016-1203-6

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

The advent of high throughput technologies has driven the development of cell biology over the last decades. The diffusion of microarrays and next generation sequencing techniques has made available a large amount of data that can be used to increase our understanding of gene expression. The need to analyse and interpret these data has led to the development of new methods to infer relationships between genes, which require a combination of biological knowledge, statistical modelling and computational techniques. When the first data on gene expression became available, they were usually analysed considering each gene separately. However, researchers soon realized that genes act in a concerted manner, and that cellular processes are the result of complex interactions between different genes and molecules. Nowadays, sets of genes that are responsible for many cellular functions have been identified, and are collected in publicly available databases [1, 2]. One of the advantages of these sets of genes, whose function is already known, is that they can be used to interpret the results of new experiments: this has led to the implementation of a large number of methods for gene enrichment analysis [3]. Their aim is to compare gene expression levels under two different conditions (experimental vs control), and to detect which sets of genes are differentially expressed (enriched) in the experimental condition. To this end, genes are ordered in a list L in decreasing order of differential expression, and enrichment is then tested in different ways. Singular enrichment analysis [4, 5] tests the over or under-representation of functional gene sets within the set of genes defined by the first k top genes in L. The major limitations of this approach lie in the fact that the choice of k is arbitrary, and that the test does not take into account gene expression levels. Gene set enrichment analysis [6, 7] overcomes these limitations, by making use of the whole list L of genes, and testing the tendency of genes belonging to a functional set to occupy positions at the top (or at the bottom) of L. A limitation that is common to both single and gene set enrichment analysis, however, is that these methods base computations on the level of overlap between sets of genes only, without considering associations and interactions between genes. Gene networks are an established tool to represent these interactions. In network inference [8, 9], genes or molecules are represented as nodes of a graph and their interactions are modelled as links between the nodes. These links can be represented as either a directed or an undirected edge, and a graph is called directed if all edges are directed, undirected if every edge is undirected and partially directed (or mixed) otherwise [10]. An undirected edge displays association between two genes, while a directed edge posits a direction in the relationship between them. Network estimation represents a difficult task, and many different estimation methods have been proposed [11, 12]. Marback et al. [13] classified them into six groups and pointed out that their predictive performance can vary a lot within each group and according to the structure of the network. In order to integrate evidence on gene associations unveiled by a number of experimental and computational studies into a single network, curated gene networks for different species have been proposed, including YeastNet [14] and FunCoup [15]. In an attempt to integrate the information on interactions between genes provided by gene networks into enrichment analyses, researchers have recently developed methods for network enrichment analysis [16-19]. The idea, here, is to test enrichment between sets of genes in a network. Shojaie and Michaidilis [16] focus mainly on network inference, proposing to represent the gene network with a linear mixed model, so that enrichment tests can be then computed by testing a system of linear hypotheses on the fixed effect parameters of the model. Glaab et al. [17], Alexeyenko et al. [18] and McCormack et al. [19], instead, assume that a gene network is already available (either from the literature or as the result of a tailored inferential process) and focus their attention on the strategy that can be used to assess enrichment between sets of nodes. In particular, Glaab et al. [17] propose a network enrichment score based on a suitably defined network distance between two sets of nodes, alongside an empirical method for setting a cut-off on this distance. In contrast to this, Alexeyenko et al. [18] and McCormack et al. [19] derive network enrichment scores on the basis of statistical tests against the null distribution of no enrichment. The advantage of the approach proposed by Alexeyenko et al. and McCormack et al. is that the assessment of enrichment is based on a significance testing procedure. The idea of [18, 19] is that the presence of enrichment between two sets of genes, say A and B, can be assessed by comparing the number of links connecting nodes in A and B with a reference distribution, which models the number of links between the same two sets in the absence of enrichment. Both [18] and [19] assume that the reference distribution is approximately normal, and they obtain its mean and variance by means of permutations, i.e., computing the mean and variance of the number of links between A and B in a sequence of random replications of the network. Their tests rely on algorithms that permute the network, and mainly differ between themselves for the fact that each algorithm aims to preserve different topological properties of the original network in the generation of network replicates. These methods, however, suffer from three limitations. First of all, they require the simulation of a large number of permuted networks, an activity that can be computationally intensive and highly time consuming (especially for big networks). Furthermore, they base the computation of the test on a normal approximation for the reference distribution, whose nature is discrete. McCormack et al. [19] show that such an approximation is inaccurate when the expected number of links between A and B is small. A further drawback of these methods is that they have been implemented so far only for undirected networks. In this work we build upon the approach of [18, 19] and propose an alternative test which we call NEAT (Network Enrichment Analysis Test). The main idea behind this test is that, under the null hypothesis of no enrichment, the number of links between two gene sets A and B follows an hypergeometric distribution. This enables us to model the reference distribution directly via a discrete distribution, without having to resort to a normal approximation. NEAT does not require network permutations to compute mean and variance under the null hypothesis, and is therefore faster than the existing resampling-based methods. Moreover, we develop NEAT not only for undirected, but also for directed and partially directed networks, thus providing a common framework for the analysis of different types of networks.

Methods

The starting point of enrichment analyses is the identification of one or more gene sets of interest. These target gene sets are typically groups of genes that are differentially expressed between experimental conditions, but they can also be different types of gene sets: e.g., clusters of genes that are functionally similar in a given time course, or genes that are bound by a particular protein in a ChIP-chip or ChIP-seq experiment. Enrichment analysis provides a characterization of each target gene set by testing whether some known functional gene sets can be related to it. Methods for gene enrichment analysis assess the relationship between a target gene set and each functional gene set simply by considering the overlap of these two groups. In contrast to this, network enrichment analysis incorporates an evaluation of the level of association between genes in the target set and genes in the functional gene set into the test. Information on associations and dependences between genes is represented by a network, which consists of a set of N nodes V={v1,…,v} that are connected by edges (links). Each gene is thus represented as a node v of the network, and a link between two nodes is drawn to signify interaction between the corresponding genes. Examples of genome-wide curated networks that collect known gene associations are YeastNet [14] and FunCoup [15]. A natural way to study the relation between two sets of genes A and B in a network is to consider the presence or absence of links connecting nodes in the two groups [18, 19]. In the inferred network, we expect that individual links may be slightly unstable and noisy. However, we do expect that the inferred links contain a sign of the relationships between gene sets. So, although links between individual genes in sets A and B may be noisy, if there is a functional relationship between functions described by sets A and B we expect the number of links between the two groups to be larger (or smaller) than expected by chance. If this is the case, we say that there is enrichment between A and B. Links between two nodes of a network can be either directed (arrows) or undirected. The presence of an arrow between two genes implies a directionality in the relation between them, whereas an undirected edge does not provide information on the direction of the relation. The upcoming subsection considers directed networks. In this case, one can distinguish two cases: whether genes in the target set regulate genes of the functional set, or genes in the functional gene set regulate genes in the target set (enrichment from A to B, or from B to A). This distinction does not occur for undirected networks, which are the subject of the next subsection: in this case, A and B are exchangeable, and we simply talk of enrichment “between” A and B. A workflow diagram summarizing the input and the output of NEAT is shown in Fig. 1.
Fig. 1

Workflow diagram of a typical network enrichment analysis with NEAT

Workflow diagram of a typical network enrichment analysis with NEAT

Enrichment test for directed networks

In a directed network, we assess the presence of enrichment from A to B by considering the number of arrows going from genes in A to genes belonging to B. We denote this by n. The observed n can be thought of as a realization from a random variable N, with expected value μ. To assess the relation from A to B, we compare μ with the number of arrows that we would expect to observe from A to B by chance, which we denote as μ0. We say that there is enrichment from A to B if μ is different from μ0. Furthermore, we say that there is over-enrichment from A to B if μ is higher than μ0, and under-enrichment (or depletion) if μ is lower than μ0. We propose a test based on the hypergeometric distribution to assess the significance of this difference. The motivation behind this choice is the following. The hypergeometric distribution models the number of successes in a random sample without replacement: in our case, we can mark arrows in the network that reach genes in B as “successful”, and the remaining ones as “unsuccessful”. Then, we can view the arrows that go out from genes in A as a random sample without replacement from the population of arrows present in the graph: if there is no relation (i.e., no enrichment) between A and B, then the distribution of N (the number of successes in the sample) is where the sample size o is the outdegree of A (the total number of arrows going out from genes that belong to A), the number of successful cases in the population i is the indegree (number of incoming arrows) of B and the population size i is the total indegree of the network (which is equal to the total number of arrows). It is certainly possible to imagine alternative choices for the null distribution of N. Alexeyenko et al. [18] and McCormack et al. [19] assume that N is normal with mean μ0 and variance , and they use network permutations to estimate μ0 and . However, the normal distribution is continuous and symmetric, so that their choice implies somehow that the behaviour of N should be roughly symmetric, and could be well approximated with a continuous random variable. In addition, estimation of μ0 and by means of network permutations can be highly time consuming. Alternatively, one could consider for N an hypergeometric distribution with different parameters, defined for example, by considering all possible edges in the network (instead of the edges that are actually present in the network) as a population. We prefer model (1) over this alternative, because the choice of the parameters therein allows to condition on two quantities that we consider crucial, which are the outdegree of A and the indegree of B. Moreover, in our experience so far, we have observed that tests based on alternative parametrizations often result in poor performances. The null mean and variance of N can be immediately derived from model (1). In particular, in the absence of enrichment we expect to observe, on average, arrows from nodes in A to nodes in B. Thus, we expect μ0 to increase as the number of arrows leaving A, or reaching B, increases. Biological assessment of enrichment can therefore be carried out by testing the null hypothesis of no enrichment against the alternative hypothesis of enrichment In a test with a discrete test statistic and two-sided alternative, such as the one that we propose, the p-value can be computed in different ways [20-22]. Let T be a discrete test statistic and t be the observed value of T. A first possibility is to compute the p-value for the two-tailed test by doubling the one-tailed p-value, p1=2 min[P0(T≤t), P0(T≥t)], where P0 denotes the distribution of T under the null hypothesis. An evident drawback of this formula, however, is that p1 can exceed 1, and therefore p1 does not represent a probability. Even though a simple modification p2= min(p1,1) could avoid the problem, we prefer to subtract P0(T=t) from p1 (P0(T=t) is non-null for discrete T, and this is the reason why p1 can exceed 1) and to compute the p-value using which always lies within the interval [0,1] and differs from p1 by a factor equal to P0(T=t). A p-value close to 0 can be regarded as evidence of enrichment, because it entails that the number of links from A to B is significantly smaller or higher than we would expect it to be in the absence of enrichment. Therefore, for a given type I error probability α, we conclude that there is evidence of enrichment from A to B if p<α, while if p≥α there is not enough evidence of enrichment. As an example, consider the network in Fig. 2. Suppose that we are interested to test whether there is enrichment from the set A={1,4} to the set B={3,5,7}. It can be observed that there are 5 arrows going out from A, and 2 of them reach B. The whole network consists of 15 arrows, of which 4 reach B. Thus, n=2, o=5, i=4 and i=15. The idea behind (1) is that, if the 5 arrows that are going out from A are a random sample (without replacement) from the 15 arrows that are present in the network, then the proportion of arrows reaching B from A should be close to the proportion of arrows reaching B in the whole network, and in the absence of enrichment we should observe on average μ0=1.33 edges. In this case, it seems that arrows going out from A tend to reach B more frequently (40 %) than other arrows do (27 % of the 15 arrows in the network reach B). However, the computation of the p-value leads to p=0.48: the observed n=2 does not provide enough evidence to reject the null hypothesis, so that the conclusion of the test is that there is no enrichment from A to B.
Fig. 2

Example: NEAT in directed networks. Left: directed network consisting of 8 nodes connected by 15 arrows. Set A contains nodes 1 and 4 (red) and set B nodes 3, 5 and 7 (orange). Right: bipartite representation of the same network: it can be observed that n =2, o =5, i =4 and i =15. It follows that μ 0=1.07 and p=0.48

Example: NEAT in directed networks. Left: directed network consisting of 8 nodes connected by 15 arrows. Set A contains nodes 1 and 4 (red) and set B nodes 3, 5 and 7 (orange). Right: bipartite representation of the same network: it can be observed that n =2, o =5, i =4 and i =15. It follows that μ 0=1.07 and p=0.48 We can also consider sets B={3,5,7} and C={2,5} (note that the two groups share gene 5), and test enrichment from B to C. In this case, n=3 arrows out of o=4 (75 %) reach C from B, whereas in the whole network i=4 arrows out of d=15 (27 %) reach C. The null expectation is here μ0=1.07; if we fix the type I error probability equal to α=5 %, the p-value p=0.03 leads to the conclusion that there is enrichment from B to C.

Enrichment test for undirected networks

When dealing with undirected networks, the presence of enrichment between A and B is assessed considering the number of edges that connect genes in A to genes in B. We denote this by n. Given the undirected nature of the links in the network, there is no distinction between indegree and outdegree of a node, and it only makes sense to consider the degree of a node, which is the number of vertices that are linked to that node. The null distribution (1) should thus be adapted accordingly. Let us define the total degree d of a set S as the sum of the degrees of nodes that belong to it: then, in the absence of enrichment we can view n as the number of successes in a random sample of size d, drawn from a population of size d. The null distribution of N for undirected networks is thus where d, d and d are the total degrees of sets A,B and V. The null hypothesis is then that , the alternative that μ≠μ0. The p-value is computed using formula (2). As an example, consider the network in Fig. 3a and suppose that we are interested to test the presence of enrichment between the pairs of sets (A,B), (A,C) and (B,C). Sets A and B are linked by n=4 edges, and their degrees are d=4 and d=15, while d=36. Thus, μ0=1.67 and p=0.023. In the same way, it is possible to compute p=0.465 and p=0.038. Figure 3b shows the relation between the three sets fixing α=5 %: enrichment is present between the pairs (A,B) and (B,C), but not between sets A and C.
Fig. 3

Example: NEAT in undirected networks. Left: undirected network with 12 nodes. We are interested to infer the relation between sets A (nodes 1 and 5), B (2, 4 and 7) and C (6 and 8). Right: representation of the relations between sets: enrichment is detected between sets A and B (p=0.023) and between sets B and C (p=0.038), but not between sets A and C (p=0.465)

Example: NEAT in undirected networks. Left: undirected network with 12 nodes. We are interested to infer the relation between sets A (nodes 1 and 5), B (2, 4 and 7) and C (6 and 8). Right: representation of the relations between sets: enrichment is detected between sets A and B (p=0.023) and between sets B and C (p=0.038), but not between sets A and C (p=0.465)

Enrichment test for partially directed networks

A partially directed network (or “mixed” network) is a network where both directed and undirected edges are present. It is possible to view such a network as a directed network, where every undirected edge connecting two nodes v and w represents in fact a pair of arrows, the former going from v to w and the latter from w to v. If such an adaptation is adopted, model (1) can be applied and partially directed networks can be analysed within neat as directed networks.

Software

NEAT is implemented in the R package neat [23], which can be freely downloaded from CRAN: https://cran.r-project.org/package=neat. The manual and a vignette illustrating the package are also available from the same URL. The package allows users to specify the network in different formats, it includes functions to plot and summarize the results of the analysis and is accompanied by a set of data and examples, including the enrichment analysis of the ESR gene sets that we discuss in the upcoming section.

Results

Performance evaluation

We assess the performance of NEAT by means of simulations. Table 1 summarizes some aspects of these simulations, that are the subject of the next two subsections. The R scripts and data files for each simulation can be found at https://github.com/m-signo/neat. We first consider directed networks, and check whether the performance of NEAT is influenced by the degree distribution of the network, or by the level of overlap between sets of nodes. We then consider undirected networks, and carry out a comparison of NEAT with the NEA test of [18] and with the LP, LA, LA+S and NP tests of [19].
Table 1

An overview of simulations S1–S5

SimulationNetwork typeDegree distributionGraph densityMean overlapMaximum overlap
S1DirectedPower law3 %4 %11.3 %
S2DirectedMixture of 2 Poisson4 %3.6 %9.5 %
S3DirectedMixture of 2 Poisson4 %
S4UndirectedPower law3 %3.8 %12 %
S5UndirectedMixture of 2 Poisson4 %3.6 %11 %

In Simulations S1 and S2, we compare the performance of NEAT in two directed networks with different degree distribution. In simulation S3, we check the performance of the test for different levels of overlap, ranging from 0 to 100 %. In Simulations S4 and S5, we compare NEAT to alternative tests in two undirected networks with different degree distribution

An overview of simulations S1–S5 In Simulations S1 and S2, we compare the performance of NEAT in two directed networks with different degree distribution. In simulation S3, we check the performance of the test for different levels of overlap, ranging from 0 to 100 %. In Simulations S4 and S5, we compare NEAT to alternative tests in two undirected networks with different degree distribution We compare the performance of the methods under the null hypothesis by checking whether the empirical distribution of p-values in the absence of enrichment is uniform using the Kolmogorov-Smirnov test, and by computing the following ratios: and The idea behind R1 and R5 is that if the null hypothesis H0 is true, we expect a good test to reject it with a frequency that is close to α. So, the target value for R1 and R5 is 1. Furthermore, we compare the capacity of different tests to correctly detect enrichments and non-enrichments by computing specificity and sensitivity at α=5 % level, and the area under the ROC curve (AUC). The specificity is the proportion of correctly detected non-enrichments, and we expect it to be as close as possible to 1−α. The sensitivity indicates the proportion of correctly detected enrichments, whereas the AUC is a measure of the overall capacity of a test to discriminate enrichments and non-enrichments across all values of α. Therefore, a test will show a good performance whenever it achieves a specificity close to 1−α, and values of sensitivity and AUC as high as possible (ideally 1).

Simulation with directed networks

In simulations S1 and S2, we generate two random networks with 1000 nodes and with fixed indegree and outdegree distributions using the algorithm implemented by [24]. The indegree and outdegree distributions of nodes are power law with exponent 4 and minimum degree 20 in simulation S1, and a mixture of two Poisson distributions, with parameters λ1=40 and λ2=100 and weights q1=99 % and q2=1 %, in simulation S2. We consider 50 sets of nodes whose size ranges between 50 and 100, and we test enrichment from A to B and from B to A for every pair of sets: this means that, in total, we compute 50×49=2450 tests. In the original networks, no preferential attachment (i.e., no enrichment) between any couple of these sets is present; we generate enrichments by increasing or reducing the number of arrows for 200 pairs of sets. In each case, enrichment is created by adding or removing arrows randomly from one group to the other, in such a way that n increases or reduces by a proportion uniformly ranging from 10 to 50 %. Table 2 shows that the empirical distribution of p-values in absence of enrichment is approximately uniform both in simulation S1 and S2. The sensitivity is higher in simulation S2, whereas the specificity is close to the target value (95 %) in both cases. As a result, the area under the ROC curve is slightly higher in simulation S2. Overall, the test shows in both cases a good capacity to discriminate enrichments and non-enrichments.
Table 2

Performance of NEAT in simulations S1 and S2

Simulation p KS R 1 R 5 SensitivitySpecificityAUC
S10.5101.561.1773 %94 %0.894
S20.1251.201.1278 %94 %0.927

p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. In both simulations, the distribution of p-values under H 0 is uniform and the specificity is close to the expected 95 % value. Sensitivity and AUC are higher in simulation S2

Performance of NEAT in simulations S1 and S2 p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. In both simulations, the distribution of p-values under H 0 is uniform and the specificity is close to the expected 95 % value. Sensitivity and AUC are higher in simulation S2 In simulation S3 we check whether the proportion of overlap between sets A and B, that we measure with the Jaccard index could have an effect on specificity and sensitivity. We consider the same network used in simulation S2, and we test enrichment between pairs of sets with fixed size |A|=|B|=50, but with increasing overlap (we consider |A∩B|∈{0,5,10,15,…,50}). Under H0 we do not modify the network, whereas under H1 we introduce enrichments adding 35 arrows going from genes in A to genes in B. For every value of overlap, we consider 2000 test (H0 is true in 1000 cases, and false in the remaining 1000). Figure 4 shows that the specificity remains constant and close to 95 % for any level of overlap; the sensitivity, on the other hand, is slightly higher when the level of overlap is moderate.
Fig. 4

Specificity and sensitivity in simulation S3. The plot shows the values of specificity and sensitivity for different levels of overlap (every point in the plot is computed on the basis of 1000 tests). We observe that the specificity of the test does not vary substantially for different levels of overlap, and is always close to 95 % as expected. The sensitivity, instead, slightly reduces as the percentage of overlap increases

Specificity and sensitivity in simulation S3. The plot shows the values of specificity and sensitivity for different levels of overlap (every point in the plot is computed on the basis of 1000 tests). We observe that the specificity of the test does not vary substantially for different levels of overlap, and is always close to 95 % as expected. The sensitivity, instead, slightly reduces as the percentage of overlap increases

Simulation with undirected networks

As alternative methods for network enrichment analysis are available for undirected networks only, we compare NEAT with them in two simulations where we consider undirected networks with 1000 nodes. We generate two random networks with fixed degree distribution, using the algorithm implemented by [24]; the degree distribution follows a power law in simulation S4 and a mixture of Poisson distributions in simulation S5, with the same parameters used in simulations S1 and S2. Likewise, we consider 50 sets of nodes, whose sizes vary between 50 and 100 nodes. We test enrichment between every pair of sets A and B, so that the total number of comparisons is here 50×49/2=1225. We introduce enrichments for 100 pairs of sets by adding or removing edges randomly between them, in such a way that n is increased or reduced by a proportion uniformly ranging from 10 to 50 %. Tables 3 and 4 show the results for simulations S4 and S5, respectively. As concerns the behaviour under the null hypothesis, the distribution of p-values is uniform in both cases for NEAT and LA, and in one case for LA+S (simulation S4) and NP (S5). NEA and LP, instead, do not produce uniform distributions: as it can be observed from Fig. 5, the reason is that the distribution is strongly left-skewed for NEA, whereas for LP the distribution is right-skewed (the same patterns occur also in simulation S5). In both simulations, most of the methods achieve a specificity close to 95 % as expected; comparison with the other tests shows that the sensitivity and AUC of NEAT are overall good.
Table 3

Results of simulation S4

Test p KS R 1 R 5 SensitivitySpecificityAUC
NEAT 0.399 1.33 1.14 69 % 94 % 0.920
NEA0.0010 0.87 68 % 96 % 0.918
LP02.131.51 68 % 92 %0.908
LA 0.255 1.601.1760 % 94 % 0.897
LA+S 0.409 1.871.1763 % 94 % 0.913
NP0.037 1.24 1.2858 % 94 % 0.884

The best results for each indicator are in bold. p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. The distribution of p-values under H 0 is evidently not uniform for NEA and LP. NEAT shows the highest values of sensitivity and AUC, and its specificity is close to the target value (95 %)

Table 4

Results of simulation S5

Test p KS R 1 R 5 SensitivitySpecificityAUC
NEAT 0.343 0.62 0.98 79 % 95 % 0.925
NEA0.02400.8273 %96 %0.912
LP01.331.51 78 % 92 %0.904
LA 0.111 1.16 1.3373 %93 %0.908
LA+S0.024 1.16 1.1376 %94 %0.910
NP 0.323 1.421.1670 %94 %0.908

The best results for each indicator are in bold. p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. The distribution of p-values under H 0 can be considered uniform for NEAT, LA and NP, and is questionable for LA+S. NEAT shows the highest values of sensitivity and AUC, and its specificity is exactly equal to the target value (95 %)

Fig. 5

Histogram of p-values in absence of enrichment in simulation S4. The test of Kolmogorov-Smirnov indicates that the distribution is uniform for NEAT (p=0.34), LA (p=0.11) and NP (p=0.32). The distribution of p-values is highly left-skewed for NEA, and right-skewed for LP

Histogram of p-values in absence of enrichment in simulation S4. The test of Kolmogorov-Smirnov indicates that the distribution is uniform for NEAT (p=0.34), LA (p=0.11) and NP (p=0.32). The distribution of p-values is highly left-skewed for NEA, and right-skewed for LP Results of simulation S4 The best results for each indicator are in bold. p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. The distribution of p-values under H 0 is evidently not uniform for NEA and LP. NEAT shows the highest values of sensitivity and AUC, and its specificity is close to the target value (95 %) Results of simulation S5 The best results for each indicator are in bold. p denotes the p-value of the Kolmogorov-Smirnov test for uniform distribution, AUC is an abbreviation for “area under the ROC curve”. The distribution of p-values under H 0 can be considered uniform for NEAT, LA and NP, and is questionable for LA+S. NEAT shows the highest values of sensitivity and AUC, and its specificity is exactly equal to the target value (95 %) Table 5 compares the speed of computation for the different methods. NEAT turns out to be the fastest method by far, being 22 times faster than NP (the fastest alternative) and more than 3000 times faster than NEA (the slowest alternative). This result is mostly due to the fact that NEAT does not require the generation of a large number of permuted networks to compute the test.
Table 5

Speed comparison

TestSoftwareSimulation S4Simulation S5
NEAT R package neat 0.60.7
NEA R package neaGUI 2125.42151.5
LPCrossTalkZ28.644.7
LACrossTalkZ14.418.0
LA+SCrossTalkZ21.827.6
NPCrossTalkZ12.915.8

The table compares the time (in seconds) that each method required to compute 1225 tests for enrichment in simulations S4 and S5, using a processor with 2.5 GhZ CPU frequency. NEAT turns out to be by far the fastest method

Speed comparison The table compares the time (in seconds) that each method required to compute 1225 tests for enrichment in simulations S4 and S5, using a processor with 2.5 GhZ CPU frequency. NEAT turns out to be by far the fastest method

Network enrichment analysis: an application to yeast

The budding yeast Saccharomyces cerevisiae is a unicellular eukaryote organism that can be easily grown in laboratory. Because of these features, it represents a model organism that has been extensively studied, and it was the first eukaryote whose genome was completely sequenced [25]. Since then, a large number of studies has aimed to detect associations between genes. In an attempt to collect these results into a unique source, Kim et al. [14] developed YeastNet, an undirected gene network that aims to integrate the results of a large number of high-throughput studies on Saccharomyces cerevisiae. In its most recent version (v3), YeastNet comprises 362512 edges connecting 5808 genes. We use this network of known associations in the following analyses.

Network enrichment analysis of environmental stress response in yeast

After analysing gene expression patterns of yeast Saccharomyces cerevisiae in response to different stressful stimuli, Gasch et al. [26] inferred the existence of a set of 868 genes that reacted in a similar way to different, hostile environmental changes. This set of genes, called Environmental Stress Response (ESR), is believed to constitute a coordinated, initial reaction to the emergence of any hostile condition in the cell. It consists of two subgroups of genes, containing genes that are repressed and induced under stressful conditions, respectively. We take these two gene sets as target sets, and for each of them we test enrichment with the following functional gene sets: 99 gene sets that are part of the GO Slim biological process ontology (we do not consider the groups “biological process” and “other” in the analysis) and 106 known KEGG pathways. At α=1 % level, NEAT detects over-enrichment between 23 GO Slim sets and the set of repressed genes, and between 25 GO Slim sets and the set of induced genes. Furthermore, 15 KEGG pathways are found to be over-enriched with the set of repressed ESR genes, and 47 with the set of induced genes. Gasch et al. [26] reports that genes that are repressed in the ESR are involved in growth related processes, various aspects of RNA metabolism, nucleotide biosyntesis, secretion, encoding of ribosomal proteins and other metabolic processes. These results are in strong agreement with the list of over-enrichments detected by NEAT, shown in Table 6. As a matter of fact, most of the over-enrichments detected by NEAT are related to RNA transcription, nucleotide secretion and translation of ribosomal proteins (rows 1-18 and 24-35 in Table 6), growth-related processes (row 22) and further metabolic processes (rows 23 and 33-35).
Table 6

Network enrichment analysis of the repressed ESR gene set

Gene set n AB μ 0 log 10 (p-value)
Go Slim BP sets:
1Cytoplasmic translation68782641.9<-300
2Ribosomal large subunit biogenesis34081097.8<-300
3Ribosomal small subunit biogenesis58612073.7<-300
4Ribosome assembly1782621.9<-300
5RNA modification29441062.0<-300
6rRNA processing91873290.2<-300
7tRNA processing2037901.0<-300
8Translational elongation1786782.3–283.8
9Ribosomal subunit export from nucleus1420561.4–281.8
10Translational initiation939462.5–112.1
11Transcription from RNA polymerase III promoter565228.4–107.7
12SnoRNA processing634303.3–82.0
13Regulation of translation19521328.6–73.5
14DNA-dependent transcription, termination774447.0–57.5
15Transcription from RNA polymerase I promoter1005646.4–49.5
16Protein alkylation1063759.4–31.4
17tRNA aminoacylation for protein translation400233.1–29.4
18Peptidyl-amino acid modification1088883.0–13.2
19Nuclear transport31542003.5–162.4
20Organelle assembly20901362.7–96.1
21Nucleobase-containing compound transport14531155.4–20.8
22Cytokinesis1024806.9–16.0
23Vitamin metabolic process325274.0–3.1
KEGG pathways:
24Ribosome biogenesis in eukaryotes98243661.0<-300
25Ribosome186408731.7<-300
26RNA polymerase30571541.2<-300
27RNA transport43412906.4–177.6
28Aminoacyl-tRNA biosynthesis1433960.9–58.2
29RNA degradation25601939.3–51.9
30mRNA surveillance pathway17681413.5–24.0
31Pentose phosphate pathway1126947.1–9.7
32Spliceosome26492523.6–2.3
33Purine metabolism55793623.0–263.6
34Pyrimidine metabolism45412884.5–234.9
35Cyanoamino acid metabolism218158.8–6.3
36One carbon pool by folate541392.5–15.0
37Sulfur relay system238196.5–2.9
38Carbapenem biosynthesis11789.8–2.7

The table lists the 23 Go Slim BP gene sets and the 15 KEGG pathways which the set of repressed ESR genes is found to be over-enriched with at 1 % significance level

Network enrichment analysis of the repressed ESR gene set The table lists the 23 Go Slim BP gene sets and the 15 KEGG pathways which the set of repressed ESR genes is found to be over-enriched with at 1 % significance level Gasch et al. [26] observed that inference for the set of genes that are induced by the ESR is more complicated, because most of the genes in this group lack functional annotations. It is worthwhile to observe that NEAT detects a large number of enriched KEGG pathways (47 out of 106). This preliminary observation points out a major feature of the Environmental Stress Response: the cell reacts to the emergence of different hostile conditions by activating a number of known cellular pathways that involve energy production, metabolic reactions and molecular transportation (see Table 8).
Table 8

Network enrichment analysis of the induced ESR gene set (KEGG pathways)

KEGG pathway n AB μ 0 log 10 (p-value)
1Starch and sucrose metabolism1436394.2<-300
2Pentose and glucuronate interconversions414110.7–119.9
3Glycolysis/Gluconeogenesis1235616.3–116.5
4Fructose and mannose metabolism562200.0–106.7
5Galactose metabolism511173.9–104.5
6Amino sugar and nucleotide sugar metabolism567264.2–63.4
7Other glycan degradation7911.7–44.2
8Pyruvate metabolism633355.9–42.8
9Propanoate metabolism189107.3–12.9
10Glycerolipid metabolism444172.1–72.7
11Peroxisome633313.3–61.2
12Fatty acid degradation419215.0–37.2
13Arachidonic acid metabolism11736.7–28.1
14Sphingolipid metabolism227103.6–27.3
15Glycerophospholipid metabolism450270.9–24.5
16alpha-Linolenic acid metabolism6927.1–11.7
17Fatty acid elongation13875.3–10.8
18Biosynthesis of unsaturated fatty acids134103.9–2.5
19Glutathione metabolism467204.8–59.9
20Citrate cycle (TCA cycle)487267.3–35.6
21Ubiquinone and other terpenoid-quinone biosynthesis9641.8–13.1
22Protein processing in endoplasmic reticulum1121866.0–17.4
23Longevity regulating pathway987544.0–70.6
24beta-Alanine metabolism397104.0–118.0
25Taurine and hypotaurine metabolism13224.3–59.4
26Tyrosine metabolism382163.5–51.8
27Tryptophan metabolism292113.3–48.2
28Valine, leucine and isoleucine degradation276107.5–45.3
29Alanine, aspartate and glutamate metabolism488262.2–38.0
30Histidine metabolism267127.4–28.8
31Arginine and proline metabolism301154.3–27.0
32Lysine degradation294150.4–26.6
33Phenylalanine metabolism17171.4–25.0
34Glycine, serine and threonine metabolism350264.3–6.7
35Cysteine and methionine metabolism338285.3–2.8
36Arginine biosynthesis167134.0–2.4
37Butanoate metabolism46084.8–202.8
38Pentose phosphate pathway604288.0–64.0
39Regulation of autophagy303126.7–43.3
40Insulin resistance337172.8–30.1
41Glyoxylate and dicarboxylate metabolism368201.6–27.3
42Methane metabolism435254.2–26.2
43Nicotinate and nicotinamide metabolism15499.8–6.7
44Nitrogen metabolism8852.8–5.4
45Thiamine metabolism5732.9–4.1
46Selenocompound metabolism12289.3–3.2
47Sulfur metabolism133105.3–2.2

The table lists the 47 KEGG pathways which the set of induced ESR genes is found to be over-enriched with at 1 % significance level

Our results for this gene set do not only match the ones of the original study - identifying many processes and pathways that are related to carbohydrate metabolism (rows 1–3 in Table 7 and 1–9 in Table 8), fatty acid metabolism (rows 4–6 in Table 7 and 10–18 in Table 8), mythocondrial functions and cellular redox reactions (rows 5–9 in Table 7 and 19–21 in Table 8), protein folding and degradation (10 in Table 7 and 22 in Table 8) and cellular protection during stressful conditions (rows 11–13 in Table 7 and 23 in Table 8) - but they also unveil further enrichments that involve molecular transportation (rows 3, 6, 14–18 in Table 7) and amino-acid metabolism (rows 24–36 in Table 8).
Table 7

Network enrichment analysis of the induced ESR gene set (GO Slim sets)

GO Slim BP gene set n AB μ 0 log 10 (p-value)
1Carbohydrate metabolic process1296671.2–110.9
2Oligosaccharide metabolic process442165.3–77.3
3Carbohydrate transport20265.8–45.0
4Lipid metabolic process693484.4–19.9
5Peroxisome organization181124.8–6.0
6Lipid transport12079.7–4.9
7Generation of precursor metabolites and energy585294.8–54.0
8Cellular respiration210118.4–14.5
9Proteolysis involved in cellular protein catabolic process639488.5–10.9
10Protein folding476296.9–22.7
11Response to oxidative stress813242.2–202.7
12Response to chemical stimulus1489885.1–83.4
13Response to starvation459331.4–11.2
14Transmembrane transport910644.4–24.2
15Endocytosis395245.5–19.3
16Protein targeting628478.8–10.9
17Ion transport464380.2–4.8
18Amino acid transport137109.4–2.1
19Cofactor metabolic process523219.0–73.7
20Nucleobase-containing small molecule metabolic process722404.5–49.2
21Membrane invagination278120.6–37.0
22Vacuole organization335200.2–18.9
23Protein maturation4927.7–3.9
24Cell morphogenesis11379.4–3.6
25Sporulation352306.4–2.1

The table lists the 25 Go Slim BP gene sets which the set of induced ESR genes is found to be over-enriched with at 1 % significance level

Network enrichment analysis of the induced ESR gene set (GO Slim sets) The table lists the 25 Go Slim BP gene sets which the set of induced ESR genes is found to be over-enriched with at 1 % significance level Network enrichment analysis of the induced ESR gene set (KEGG pathways) The table lists the 47 KEGG pathways which the set of induced ESR genes is found to be over-enriched with at 1 % significance level Tables 9, 10 and 11 compare the p-values obtained with NEAT with those obtained with LA+S [19], which, according to the conclusions of [19] and to our own simulations, can be considered as the main competitor of NEAT. The tables show a large overlap between the over-enrichments detected by the two methods at a 1 % significance level: the two methods jointly detect 34 over-enrichments (19 GO Slim sets and 15 KEGG pathways) for the set of repressed ESR genes, and 67 (24 GO Slim sets and 43 KEGG pathways) for the set of induced ESR genes. There is only a small number of discrepancies between the two methods and these are mostly borderline cases. In particular, LA+S detects 4 over-enrichments that are not detected by NEAT (rows 39 in Table 9, 26–27 in Table 10 and 48 in Table 11), whereas NEAT detects 9 over-enrichments that are not detected by LA+S (rows 19–22 in Table 9, 25 in Table 10 and 43–46 in Table 11). As concerns computing time, NEAT computed the required task (410 tests in total) in 23 s, whereas the same computation with LA+S required 1,171 s. In summary, the two methods lead to very similar conclusions, but NEAT is considerably more efficient.
Table 9

Repressed ESR gene set: comparison between NEAT and LA+S

μ 0 log10 (p-value)
Gene setNEATLA+SNEATLA+S
GO Slim BP sets:
1Cytoplasmic translation2641.93583.5<-300–290.9
2Ribosomal large subunit biogenesis1097.81602.4<-300–269.2
3Ribosomal small subunit biogenesis2073.73013.2<-300–236.8
4Ribosome assembly621.9872.1<-300–95.9
5RNA modification1062.01422.7<-300–213.7
6rRNA processing3290.24623.2<-300<-300
7tRNA processing901.01137.6<-300–103.3
8Translational elongation782.31019.5–283.8–71.2
9Ribosomal subunit export from nucleus561.4693.4–281.8–151.2
10Nuclear transport2003.52452.5–162.4–33.0
11Translational initiation462.5594.8–112.1–33.6
12Transcription from RNA polymerase III promoter228.4281.6–107.7–43.6
13Organelle assembly1362.71719.2–96.1–8.0
14SnoRNA processing303.3349.8–82.0–26.5
15Regulation of translation1328.61577.5–73.5–12.9
16DNA-dependent transcription, termination447.0575.2–57.5–11.7
17Transcription from RNA polymerase I promoter646.4874.2–49.5–5.2
18tRNA aminoacylation for protein translation233.1256.7–29.4–11.2
19Protein alkylation759.41000.0–31.4–1.2
20Nucleobase-containing compound transport1155.41445.1–20.8–0.1
21Cytokinesis806.9925.9–16.0–1.8
22Peptidyl-amino acid modification883.01102.4–13.2–0.1
23Vitamin metabolic process274.0245.8–3.1–5.5
KEGG pathways:
24Ribosome biogenesis in eukaryotes3661.05212.5<-300<-300
25Ribosome8731.711954.0<-300–283.3
26RNA polymerase1541.22058.0<-300–76.1
27Purine metabolism3623.04136.9–263.6–66.9
28Pyrimidine metabolism2884.53402.5–234.9–61.0
29RNA transport2906.43193.2–177.6–75.4
30Aminoacyl-tRNA biosynthesis960.9934.2–58.2–49.8
31RNA degradation1939.32051.3–51.9–19.9
32mRNA surveillance pathway1413.51477.3–24.0–12.7
33One carbon pool by folate392.5344.2–15.0–19.5
34Pentose phosphate pathway947.1979.2–9.7–4.6
35Cyanoamino acid metabolism158.8132.2–6.3–7.2
36Sulfur relay system196.5172.7–2.9–3.9
37Carbapenem biosynthesis89.875.1–2.7–4.1
38Spliceosome2523.62432.2–2.3–4.1
39Synthesis and degradation of ketone bodies39.829.8–0.3–2.2

The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in the absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively

Table 10

Induced ESR gene set: comparison between NEAT and LA+S (GO Slim sets)

μ 0 log10 (p-value)
GO Slim BP setNEATLA+SNEATLA+S
1Response to oxidative stress242.2248.5–202.7–253.7
2Carbohydrate metabolic process671.2663.9–110.9–123.3
3Response to chemical stimulus885.1912.4–83.4–92.8
4Oligosaccharide metabolic process165.3158.1–77.3–104.5
5Cofactor metabolic process219.0225.6–73.7–76.2
6Generation of precursor metabolites and energy294.8293.4–54.0–56.1
7Nucleobase-containing small molecule metabolic process404.5417.4–49.2–41.0
8Carbohydrate transport65.877.7–45.0–52.8
9Membrane invagination120.6118.3–37.0–51.7
10Transmembrane transport644.4684.7–24.2–16.2
11Protein folding296.9296.3–22.7–26.6
12Lipid metabolic process484.4495.7–19.9–23.3
13Endocytosis245.5248.7–19.3–19.3
14Vacuole organization200.2199.7–18.9–22.4
15Cellular respiration118.4125.2–14.5–14.1
16Response to starvation331.4318.4–11.2–15.8
17Protein targeting478.8485.1–10.9–15.8
18Proteolysis involved in cellular protein catabolic process488.5494.1–10.9–9.8
19Peroxisome organization124.8123.5–6.0–6.0
20Lipid transport79.790.4–4.9–2.8
21Ion transport380.2410.7–4.8–2.1
22Protein maturation27.730.9–3.9–3.0
23Cell morphogenesis79.480.8–3.6–3.7
24Sporulation306.4301.7–2.1–2.5
25Amino acid transport109.4113.0–2.1–1.6
26Response to osmotic stress181.8178.3–1.6–2.1
27Protein phosphorylation587.6564.3–1.4–2.7

The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in the absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively

Table 11

Induced ESR gene set: comparison between NEAT and LA+S (KEGG pathways)

μ 0 log10 (p-value)
KEGG pathwayNEATLA+SNEATLA+S
1Starch and sucrose metabolism394.2400.6<-300<-300
2Butanoate metabolism84.898.0–202.8<-300
3Pentose and glucuronate interconversions110.7127.5–119.9–185.7
4beta-Alanine metabolism104.0122.9–118.0–209.8
5Glycolysis/Gluconeogenesis616.3618.7–116.5–149.3
6Fructose and mannose metabolism200.0206.2–106.7–160.7
7Galactose metabolism173.9193.2–104.5–126.4
8Glycerolipid metabolism172.1193.2–72.7–103.2
9Longevity regulating pathway - multiple species544.0508.2–70.6–79.1
10Pentose phosphate pathway288.0284.2–64.0–105.8
11Amino sugar and nucleotide sugar metabolism264.2277.6–63.4–66.7
12Peroxisome313.3332.9–61.2–55.8
13Glutathione metabolism204.8221.6–59.9–77.8
14Taurine and hypotaurine metabolism24.328.5–59.4–92.8
15Tyrosine metabolism163.5169.9–51.8–62.6
16Tryptophan metabolism113.3130.9–48.2–59.4
17Valine, leucine and isoleucine degradation107.5124.8–45.3–56.8
18Other glycan degradation11.712.9–44.2–66.3
19Regulation of autophagy126.7135.2–43.3–45.5
20Pyruvate metabolism355.9388.8–42.8–41.6
21Alanine, aspartate and glutamate metabolism262.2284.5–38.0–36.7
22Fatty acid degradation215.0225.0–37.2–43.7
23Citrate cycle (TCA cycle)267.3299.5–35.6–32.9
24Insulin resistance172.8176.5–30.1–30.4
25Histidine metabolism127.4147.8–28.8–25.8
26Arachidonic acid metabolism36.744.1–28.1–40.6
27Glyoxylate and dicarboxylate metabolism201.6224.8–27.3–23.7
28Sphingolipid metabolism103.6116.3–27.3–26.2
29Arginine and proline metabolism154.3180.2–27.0–24.8
30Lysine degradation150.4160.2–26.6–31.5
31Methane metabolism254.2262.7–26.2–23.7
32Phenylalanine metabolism71.481.5–25.0–26.4
33Glycerophospholipid metabolism270.9285.1–24.5–22.3
34Protein processing in endoplasmic reticulum866.0857.1–17.4–20.7
35Ubiquinone and other terpenoid-quinone biosynthesis41.847.1–13.1–12.3
36Propanoate metabolism107.3122.9–12.9–9.9
37alpha-Linolenic acid metabolism27.130.5–11.7–11.2
38Fatty acid elongation75.376.1–10.8–12.9
39Glycine, serine and threonine metabolism264.3281.1–6.7–3.5
40Nicotinate and nicotinamide metabolism99.8111.9–6.7–4.7
41Nitrogen metabolism52.860.7–5.4–4.0
42Thiamine metabolism32.936.8–4.1–3.2
43Selenocompound metabolism89.397.0–3.2–1.9
44Cysteine and methionine metabolism285.3310.6–2.8–1.0
45Arginine biosynthesis134.0154.2–2.4–0.6
46Sulfur metabolism105.3121.9–2.2–0.5
47Biosynthesis of unsaturated fatty acids103.9102.1–2.5–3.1
48Regulation of mitophagy - yeast554.4510.4–1.6–5.1

The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively

Repressed ESR gene set: comparison between NEAT and LA+S The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in the absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively Induced ESR gene set: comparison between NEAT and LA+S (GO Slim sets) The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in the absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively Induced ESR gene set: comparison between NEAT and LA+S (KEGG pathways) The table reports the gene sets that are found to be over-enriched (α=1 %) by at least one of the two methods. μ 0 denotes the expected value of N in absence of enrichment. The last two columns report log 10 p-values for the proposed NEAT and the LA+S test of [19], respectively

Network enrichment analysis of GO Slim sets: overlap does not imply enrichment

Gene ontologies [1] consist of a large number of gene sets, which are involved in different cellular functions or biological processes, or that are active in a specific component of the cell. These sets of genes are typically employed to enrich sets of differentially expressed genes that have been experimentally detected (the analysis of the ESR gene sets in the previous subsection provides an example of this). However, network enrichment analysis is a more general instrument, which allows to assess the relation between pairs of gene sets in a network. One might wonder, for instance, whether gene sets within an ontology tend to be strongly related to each other, or whether there is a strong separation between them. We consider gene sets in the GO Slim biological process ontology for Saccharomyces cerevisiae (we once more exclude the two general groups “biological process” and “other” from the analysis). As a result of the hierarchical structure of Gene Ontologies, 12 gene sets are nested within another group. We exclude these 12 sets from the analysis: the remaining 87 gene sets do not have hierarchical relations with each other, and pairs of these sets display overall a low overlap (1.7 % on average), which is null in most cases (62 % of pairs of sets do not share genes). If overlapping of sets was taken by itself as evidence of a relation between two gene sets, one would therefore conclude that most of these gene sets are unrelated. If, however, we do not limit our attention to the overlap between pairs of sets, but consider also known associations between genes in the two sets as represented in YeastNet [14], we obtain a different conclusion. We have used NEAT to test whether there is enrichment between each pair of sets. In a random network where no relations between the sets are present, we would expect to detect 37 enrichments (on average) out of 3741 tests for α=1 %; instead, we detect 1409 enrichments, 38 times more than expected. Out of these, 710 are under-enrichments, and 699 are over-enrichments. An under-enrichment, here, indicates that two GO Slim sets are poorly connected to each other: the high number of under-enrichments, therefore, might be not particularly surprising or interesting, as we do expect that unrelated gene sets within the ontology are poorly connected. The high number of over-enrichments, on the other hand, is striking: this indicates that many groups within the ontology are highly connected to each other - something that would occur rather rarely, if there was no relation between the sets. This result points out a major difference between gene enrichment analysis and network enrichment analysis: whereas in the first case the extent of overlapping between two gene sets is taken by itself as evidence of enrichment, network enrichment analysis bases the evaluation of enrichment on the level of connectivity that exists between the two sets in a network. Of course, the two facts are not completely unrelated. Figure 6 shows that there is a certain correlation between overlap of gene sets (Jaccard index) and network enrichment, so that we tend to find network enrichment in the presence of higher levels of overlap. This correlation is, however, low (the Pearson correlation coefficient between J and p is −0.15), pointing out that there does not necessarily have to be enrichment for highly overlapping gene sets, and vice versa. As an example, the GO Slim sets “cytokinesis” and “nuclear organization” do not share genes, but are detected as enriched (p=0.0003) in YeastNet. This result can be explained by the fact that “nuclear organization” includes genes involved in the assembly and disassembly of the nucleus, which is a preliminary step in cell cytokinesis.
Fig. 6

Relation between overlap (J ) and p-values. Note that p-values are represented on a negative log-scale to enhance readability

Relation between overlap (J ) and p-values. Note that p-values are represented on a negative log-scale to enhance readability

Conclusion

Network enrichment analysis is a powerful extension of traditional methods of gene enrichment analysis, that allows to integrate them with the information on connectivity between genes provided by genetic networks. Whereas gene enrichment analysis bases the test for enrichment solely on the overlap between two gene sets and ignores the relationships between individual genes, network enrichment analysis exploits a larger amount of information by making use of gene networks, and it is thus capable to detect enrichment even between two gene sets that do not share genes. In this paper, we have presented a Network Enrichment Analysis Test (NEAT) that aims to overcome some limitations which affect the network enrichment tests of [18, 19]. First of all, we believe that a normal approximation does not make justice to the discrete nature of N. We have shown that this approximation can be avoided if one models N directly, using a hypergeometric distribution with suitably specified parameters. In addition, the normal approximation employed by [18, 19] requires the computation of a large number of network permutations to obtain the mean and variance under H0: this operation can be very time consuming for big networks and it makes the computation of the test rather slow. The use of the hypergeometric distribution, instead, allows to specify the null distribution of N without resorting to permutations, thus speeding up computations considerably. A further drawback of existing methods for network enrichment analysis [16-19] is that they have been implemented only for undirected networks. We address this problem by considering different types of networks (directed, undirected and partially directed) and by proposing two different parametrizations, which take into account the different nature of directed and undirected links. We believe that NEAT could constitute a flexible and computationally efficient test for network enrichment analysis. Our simulations show that NEAT has a good capacity to correctly classify enrichments and non-enrichments. Comparison of NEAT with other methods points out an overall good performance in terms of sensitivity and of specificity, as well as the computational efficiency of the proposed method. The examples illustrated in the previous section show that NEAT can retrieve enrichments that were detected with gene enrichment analysis, but it can also unveil further enrichments that would be overlooked, if known associations between genes were ignored. Even though the focus of this paper is on gene regulatory networks, NEAT is a rather general test: it can be applied to networks that arise in different contexts and disciplines, whenever the interest is to infer the relationship between groups of vertices. This can include, for example, other types of biological networks, as well as social, economic or technological networks.
  21 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.

Authors:  M Ashburner; C A Ball; J A Blake; D Botstein; H Butler; J M Cherry; A P Davis; K Dolinski; S S Dwight; J T Eppig; M A Harris; D P Hill; L Issel-Tarver; A Kasarskis; S Lewis; J C Matese; J E Richardson; M Ringwald; G M Rubin; G Sherlock
Journal:  Nat Genet       Date:  2000-05       Impact factor: 38.330

3.  Revealing strengths and weaknesses of methods for gene network inference.

Authors:  Daniel Marbach; Robert J Prill; Thomas Schaffter; Claudio Mattiussi; Dario Floreano; Gustavo Stolovitzky
Journal:  Proc Natl Acad Sci U S A       Date:  2010-03-22       Impact factor: 11.205

4.  Network enrichment analysis in complex experiments.

Authors:  Ali Shojaie; George Michailidis
Journal:  Stat Appl Genet Mol Biol       Date:  2010-05-22

5.  FunSpec: a web-based cluster interpreter for yeast.

Authors:  Mark D Robinson; Jörg Grigull; Naveed Mohammad; Timothy R Hughes
Journal:  BMC Bioinformatics       Date:  2002-11-13       Impact factor: 3.169

6.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

7.  EnrichNet: network-based gene set enrichment analysis.

Authors:  Enrico Glaab; Anaïs Baudot; Natalio Krasnogor; Reinhard Schneider; Alfonso Valencia
Journal:  Bioinformatics       Date:  2012-09-15       Impact factor: 6.937

8.  Statistical assessment of crosstalk enrichment between gene groups in biological networks.

Authors:  Theodore McCormack; Oliver Frings; Andrey Alexeyenko; Erik L L Sonnhammer
Journal:  PLoS One       Date:  2013-01-23       Impact factor: 3.240

9.  YeastNet v3: a public database of data-specific and integrated functional gene networks for Saccharomyces cerevisiae.

Authors:  Hanhae Kim; Junha Shin; Eiru Kim; Hyojin Kim; Sohyun Hwang; Jung Eun Shim; Insuk Lee
Journal:  Nucleic Acids Res       Date:  2013-10-27       Impact factor: 16.971

10.  FunCoup 3.0: database of genome-wide functional coupling networks.

Authors:  Thomas Schmitt; Christoph Ogris; Erik L L Sonnhammer
Journal:  Nucleic Acids Res       Date:  2013-10-31       Impact factor: 16.971

View more
  13 in total

1.  Osteogenesis depends on commissioning of a network of stem cell transcription factors that act as repressors of adipogenesis.

Authors:  Alexander Rauch; Anders K Haakonsson; Jesper G S Madsen; Mette Larsen; Isabel Forss; Martin R Madsen; Elvira L Van Hauwaert; Christian Wiwie; Naja Z Jespersen; Michaela Tencerova; Ronni Nielsen; Bjørk D Larsen; Richard Röttger; Jan Baumbach; Camilla Scheele; Moustapha Kassem; Susanne Mandrup
Journal:  Nat Genet       Date:  2019-03-04       Impact factor: 38.330

2.  NEAT: an efficient network enrichment analysis test.

Authors:  Mirko Signorelli; Veronica Vinciotti; Ernst C Wit
Journal:  BMC Bioinformatics       Date:  2016-09-05       Impact factor: 3.169

3.  A simple null model for inferences from network enrichment analysis.

Authors:  Gustavo S Jeuken; Lukas Käll
Journal:  PLoS One       Date:  2018-11-09       Impact factor: 3.240

4.  Two critical positions in zinc finger domains are heavily mutated in three human cancer types.

Authors:  Daniel Munro; Dario Ghersi; Mona Singh
Journal:  PLoS Comput Biol       Date:  2018-06-28       Impact factor: 4.475

5.  Integrative Network Analysis of Differentially Methylated and Expressed Genes for Biomarker Identification in Leukemia.

Authors:  Robersy Sanchez; Sally A Mackenzie
Journal:  Sci Rep       Date:  2020-02-07       Impact factor: 4.379

6.  Evaluation of blood gene expression levels in facioscapulohumeral muscular dystrophy patients.

Authors:  M Signorelli; A G Mason; K Mul; T Evangelista; H Mei; N Voermans; S J Tapscott; R Tsonaka; B G M van Engelen; S M van der Maarel; P Spitali
Journal:  Sci Rep       Date:  2020-10-16       Impact factor: 4.379

7.  Functional enrichment of alternative splicing events with NEASE reveals insights into tissue identity and diseases.

Authors:  Markus List; Olga Tsoy; Zakaria Louadi; Maria L Elkjaer; Melissa Klug; Chit Tong Lio; Amit Fenn; Zsolt Illes; Dario Bongiovanni; Jan Baumbach; Tim Kacprowski
Journal:  Genome Biol       Date:  2021-12-02       Impact factor: 13.583

8.  Pathway-specific model estimation for improved pathway annotation by network crosstalk.

Authors:  Miguel Castresana-Aguirre; Erik L L Sonnhammer
Journal:  Sci Rep       Date:  2020-08-12       Impact factor: 4.379

9.  Collateral Sensitivity to β-Lactam Drugs in Drug-Resistant Tuberculosis Is Driven by the Transcriptional Wiring of BlaI Operon Genes.

Authors:  A S Trigos; B W Goudey; J Bedő; T C Conway; N G Faux; K L Wyres
Journal:  mSphere       Date:  2021-05-28       Impact factor: 4.389

10.  OscoNet: inferring oscillatory gene networks.

Authors:  Luisa Cutillo; Alexis Boukouvalas; Elli Marinopoulou; Nancy Papalopulu; Magnus Rattray
Journal:  BMC Bioinformatics       Date:  2020-08-21       Impact factor: 3.307

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.