MOTIVATION: Even within well studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information. RESULTS: We introduce the concept of signed distance correlation as a measure of dependency between two variables, and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson correlation or mutual information. SUPPLEMENTARY INFORMATION: Supplementary Information and code are available at Bioinformatics and https://github.com/javier-pardodiaz/sdcorGCN online.
MOTIVATION: Even within well studied organisms, many genes lack useful functional annotations. One way to generate such functional information is to infer biological relationships between genes/proteins, using a network of gene coexpression data that includes functional annotations. However, the lack of trustworthy functional annotations can impede the validation of such networks. Hence, there is a need for a principled method to construct gene coexpression networks that capture biological information and are structurally stable even in the absence of functional information. RESULTS: We introduce the concept of signed distance correlation as a measure of dependency between two variables, and apply it to generate gene coexpression networks. Distance correlation offers a more intuitive approach to network construction than commonly used methods such as Pearson correlation and mutual information. We propose a framework to generate self-consistent networks using signed distance correlation purely from gene expression data, with no additional information. We analyse data from three different organisms to illustrate how networks generated with our method are more stable and capture more biological information compared to networks obtained from Pearson correlation or mutual information. SUPPLEMENTARY INFORMATION: Supplementary Information and code are available at Bioinformatics and https://github.com/javier-pardodiaz/sdcorGCN online.
Gene expression data, while noisy, contains key information about biological processes (Kothapalli ). Such data are often represented as gene coexpression networks, where nodes are genes and edges represent correlations in their expression across multiple samples (Lee ). Representing gene coexpression as networks eases the study and visualization of the expression data (Magwene and Kim, 2004; Weirauch, 2011). One motivation behind creating these networks is that genes which are coexpressed across multiple samples are likely to have related functions (Hughes ; Makrodimitris ; Stuart ; van Noort ), allowing inference of gene function using guilt by association approaches (Wolfe ). This procedure is especially useful if the studied organism is poorly annotated. For example, Rhizobium leguminosarum, a soil bacteria important in agriculture that can infect plants of the legume family and provide them organic nitrogenous compound, has no functional information available for ∼25% of its predicted genes. Most methods to generate and validate gene coexpression networks use exogenous biological information (such as gene ontologies and metabolic information) to select which edges need to or do not have to be present (e.g. Bar-Joseph ; Ihmels ; Ucar ). Therefore, the lack of reliable genomic functional information may hinder the construction of gene coexpression networks and the validation of their accuracy.The most widely used methods to generate gene coexpression networks in the absence of exogenous information are based on the absolute value of the Pearson correlation of the expression of gene pairs across samples (Weirauch, 2011). After computing the correlation, there are two alternatives: (i) construct a fully connected weighted network, or (ii) impose a threshold to construct unweighted networks with edges connecting genes whose expression correlation is high enough. The former approach is widely used thanks to the R package WGCNA (Langfelder and Horvath, 2008), but results in noisy networks where gene relationships may not be easy to identify. The latter approach (e.g. George ) keeps the strongest relationships; however, it is not obvious which threshold value strikes the right balance. A threshold too low, results in overly dense networks that are difficult to analyse; a threshold that is too high risks discarding valuable information.A natural way of studying gene expression data is to compare the expression of a gene in different samples to assess how it changes. For any two genes, the most intuitive approach should follow the same straightforward procedure: evaluate how the expression of each gene changes across the different samples and then compare the patterns of changes between the two genes. This same idea underpins the concept of distance correlation (Székely ). While Pearson correlation measures linear relationships, distance correlation measures the dependence, both linear and nonlinear, between two vectors, and provides a non-negative score that is zero if and only if the vectors are statistically independent (Székely ). Thus, distance correlations allow us to identify relationships between the expression of genes beyond linearity; this type of correlation has been successfully used in bioinformatics settings to predict miRNA-disease associations (Zhao ) and to generate gene regulatory networks from expression data (Ghanbari ; Guo ). Gene regulatory networks are different from gene coexpression networks because they are directed, much sparser and aim to identify regulatory pathways rather than functional associations. We give a detailed definition of distance correlation in Section 2 below.Distance correlation is always non-negative; that is, it does not capture whether an association between the expression of two genes is positive or negative. Naturally, this information may be biologically relevant; to overcome this shortcoming, we introduce a signed distance correlation. After calculating the distance correlation between each pair of genes, we impose a sign, which corresponds to the sign of the Pearson correlation between the expression of the genes.Here, we propose a method to construct gene coexpression networks using signed distance correlation as an intuitive alternative to networks from Pearson correlation, Spearman correlations and mutual information. To highlight the strengths of our approach, we construct and compare networks using the four methods. This work is to our knowledge the first work using distance correlation to construct gene coexpression networks.We construct networks by including only edges between genes for which the signed distance correlation of their expression exceeds a threshold. We select the threshold based on the internal consistency of the networks using the R package COGENT (Bozhilova ) instead of using exogenous biological information known (or imposed) a priori. We evaluate our method in data from three different organisms. First, we generate an unweighted gene coexpression network for the bacteria R.leguminosarum from microarray data. We then analyse RNA-Seq data from the yeast Saccharomyces cerevisiae, and single-cell RNA-Seq data from human liver cells. The results of our analysis of yeast and human data can be found in the Supplementary Material.We evaluate the biological information in our networks using the STRING database (Szklarczyk ), which is a protein–protein interaction database with scores for pairs of proteins. The higher the STRING score, the more likely they are to have a biologically meaningful functional interaction. Using STRING, we show that networks from signed distance correlation capture more biological information and are structurally more stable than networks based on Pearson or Spearman correlation or mutual information.While we apply our method to gene expression data, our method to construct networks from signed distance correlations (in combination with COGENT) can be used in applications beyond bioinformatics.Data and source code are available from https://github.com/javier-pardodiaz/sdcorGCN.
2 Materials and methods
The method we propose generates an unweighted coexpression network from gene expression data that may come from different sources, such as microarrays, RNA-Seq and single-cell RNA-Seq assays. Figure 1 illustrates the main steps of the method, which includes data pre-processing, computing correlations, and thresholding to create networks.
Fig. 1.
Pipeline to construct networks from gene expression data using signed distance correlation. (A) We pre-process the input matrix M with raw gene expression data using quantile normalization and setting the lowest 20% values from each sample to the minimum value in M to obtain . (B) We compute the expression distance matrices and for each gene , and we double center them to obtain and . (C) We compute the distance correlation matrix D, whose entries are the positive root of the Pearson correlation between and , for every pair of genes. (D) We compute the Pearson correlation between each pair of rows in the to obtain the Pearson correlation matrix P. (E) To construct the signed distance correlation matrix S, we multiply every distance correlation between the expression of two genes by the sign of their Pearson correlation . (F) Using COGENT (Bozhilova ), we find the optimal threshold that produces the most self-consistent network from S. (F′) Analogously, we find the optimal threshold to generate the network from P; this step is not part of the pipeline and only necessary to be able to compare Pearson and signed distance correlation networks.
Pipeline to construct networks from gene expression data using signed distance correlation. (A) We pre-process the input matrix M with raw gene expression data using quantile normalization and setting the lowest 20% values from each sample to the minimum value in M to obtain . (B) We compute the expression distance matrices and for each gene , and we double center them to obtain and . (C) We compute the distance correlation matrix D, whose entries are the positive root of the Pearson correlation between and , for every pair of genes. (D) We compute the Pearson correlation between each pair of rows in the to obtain the Pearson correlation matrix P. (E) To construct the signed distance correlation matrix S, we multiply every distance correlation between the expression of two genes by the sign of their Pearson correlation . (F) Using COGENT (Bozhilova ), we find the optimal threshold that produces the most self-consistent network from S. (F′) Analogously, we find the optimal threshold to generate the network from P; this step is not part of the pipeline and only necessary to be able to compare Pearson and signed distance correlation networks.The input to the method is a m × n gene expression matrix M, where each of the m rows correspond to a gene, each of the n columns is a different sample, and the entries are the expression values of each gene in each sample.
2.1 Data
We analyse gene expression data from three different organisms: R.leguminosarum, Yeast (S.cerevisiae), and Human (Homo sapiens), obtained using different experimental techniques (microarrays, RNA-Seq and single-cell RNA-Seq). Below, we present our results on the R.leguminosarum dataset. The description and analysis of yeast and human datasets can be found in Supplementary Sections 3 and 4.The R.leguminosarum bv. viciae 3841 data contains gene expression information observed under 18 different growth conditions. These data come from n = 54 microarray channels with 3 independent samples per condition (Karunakaran ; Pini ; Ramachandran ). The complete list of the conditions is in Supplementary Section 1. From the total 7263 genes in the current genome annotation (Young ), we remove genes that do not appear in all the microarrays or appear as pseudogenes, leaving m = 7, 077 genes for which we calculate the mean expression within each microarray. These data are encoded in the matrix M.The data for all three organisms in the form of expression matrices (genes in rows and samples in columns) are available at https://github.com/javier-pardodiaz/sdcorGCN and http://opig.stats.ox.ac.uk/resources.
2.2 Pre-processing
Gene expression data is noisy and the raw values are only comparable within the same experiment due to their arbitrary scales. Therefore, gene expression data requires some pre-processing before we can use it to generate networks (Libralon ). We apply quantile normalization (Bolstad ) to the gene expression matrix M. This normalization step renders the distribution of the expression values in different samples (i.e. the columns of M) identical in their statistical properties, such as maximum value and quantiles. This normalization enables us to compare data from different experiments.To avoid interference from low expression values in the quantile normalization, we ignore the 20% least expressed genes from each sample before the normalization step. After the quantile normalization, we set the ignored values to the lowest expression value in M to decrease the level of noise. In practice, we have observed that 20% offers a good balance between preserving as much information as possible, and weeding out noisy measurements. We denote the pre-processed expression data by , and its i-th row by .
2.3 Computing correlations between the expression of the genes
Distance correlation is as a measure of association between random vectors that addresses some of the limitations of linear measures such as Pearson’s (Székely ). To compute the distance correlation between the expression of two genes i and j, let the vectors and contain their expression values across the n samples. For each gene , we calculate the n × n expression distance matrix whose entries are the absolute differences between the expression values across samples:
where are all the samples in the data. Then, we compute the double-centred expression distance matrix :
where . The distance covariance of the expression of genes i and j is:
where . Finally, the distance correlation between the expression of the genes isThis expression is the non-negative square root of the Pearson correlation between and . We store the pairwise distance correlation between the expression of all genes in the m × m symmetric matrix D with entries .The distance correlation in Equation (4) is always non-negative; however, the association between the expression of two genes can be either positive (i.e. both genes are expressed at the same time) or negative (i.e. one gene is expressed when the other is not). Thus, the sign of the association may contain crucial biological information that is lost if we only use distance correlation. We introduce sign into distance correlation by using the correlation matrix P that contains the Pearson correlation coefficients between the expression of each pair of genes:The matrix P also has size m × m but, unlike D, it may have negative values: values close to 1 mean strong positive correlation, values close to −1 mean strong negative correlation, and values around 0 mean no correlation. We generate a signed distance correlation matrix S whose values are the entries in D multiplied by the sign of the corresponding entries in P:For comparison purposes, we create the matrix I which contains the mutual information values for each pair of gene expression vectors from the preprocessed dataset . To compute I, we use the R package minet (Meyer ). We select the empirical probability distribution as the entropy estimator, and use bins with equal frequencies for the discretization.
2.4 Thresholding correlation matrices
Once we have a signed distance correlation S, we generate an unweighted network with adjacency matrix by applying a threshold θ to S:The matrix encodes an unweighted undirected network in which pairs of genes are connected if there is a strong and positive signed distance correlation in their expression. Naturally, different values of θ result in networks with different properties. We want to find the θ that minimizes the number of edges between genes that are not coexpressed, and maximizes the number of edges between genes that are coexpressed. For this purpose, we use the R package COGENT (Consistency of Gene Expression NeTworks) (Bozhilova ).The main COGENT functions evaluate the internal consistency of a method to generate networks from a specific dataset. First, COGENT splits the gene expression data in two possibly overlapping groups of samples, and then constructs a network with the same node set from each group, G1 and G2. Then, COGENT measures the similarity between G1 and G2; the more similar the networks, the higher the internal consistency of the method. COGENT helps to find a threshold that results in the most internally consistent networks. The COGENT function getEdgeSimilarityCorrected provides a score of edge similarity between G1 and G2 and adjusts it so that results obtained for different values of θ are comparable. We use the semi-random density adjustment implemented in COGENT and select the function parameter that allows to keep the isolated nodes during the analysis.The edge similarity between G1 and G2 is the Jaccard index of the set of edges. This index is adjusted using the randomized networks and from a a configuration-type model from the degree sequences and . The degree sequences and are random permutations of the degree sequences of G1 and G2. More precisely, the similarity between G1 and G2 is:
where
and E, V are the set of edges and vertices for network .The value of β is the expected edge overlap between G1 and G2 if they were random networks. In general, β is higher for denser networks. The similarity in Eq. 8 is a value between −1/3 and 1; this value is high if G1 is more similar to G2 than to a randomization of G2 and vice versa (Bozhilova ).In our analysis, we create two random subsets of columns from M to obtain two matrices M1 and M2, such that half of the total number of samples n (i.e. the columns of M) are shared between both subsets, and 1/3 of their columns are different. Here, denotes the greatest integer less or equal to . We choose this amount of overlap to evaluate how the relatively small change of about 1/3 of the data affects the final result. The choice of amount of overlap is user defined and may also depend on the research question. When the interest lies in clustering of genes, then one may like to choose an amount of overlap which leads to a moderately sparse network. Our previous in silico experiments show that if the proportion of samples from M shared between M1 and M2 is much smaller than 50%, the similarity between the obtained networks is low, and proportions much larger than 50% produce almost identical networks. We pre-process M1 and M2 and compute the signed distance correlation matrices S1 and S2 as outlined in Section 2.3. We test different values of θ, and for each of them, we obtain two unweighted networks and their similarity with COGENT. We repeat this whole process 25 times, every time with different subsets of columns of M. Finally, we compute the similarity score , which is the average of the similarity of the networks (in Eq. 8) over the 25 samples.To favour signal over noise, we create a score that balances the similarity of the networks in with the density of :We select the threshold that retrieves the highest and use this value to generate the unweighted gene coexpression network from the signed distance correlation matrix S. We also calculate in the same manner the optimal thresholds and to construct the unweighted networks and from the Pearson correlation matrix P and the mutual information matrix I.
2.5 Network comparison
We compare gene coexpression networks obtained from thresholding the signed distance correlation matrix S, the Pearson correlation matrix P, and the MI matrix I. The density of a network may influence the amount of biological information it captures and hence we cannot compare networks with different densities. Instead, we generate additional networks from each matrix which match the different optimal edge densities. Letting the edge density of , and be d, d, and d, we compare the following nine networks:: Network from S with edge density d (i.e. ).: Network from P with edge density d (i.e. ).: Network from I with edge density d (i.e. ).: Network from P with edge density d.: Network from I with edge density d.: Network from S with edge density d.: Network from I with edge density d.: Network from S with edge density d.: Network from P with edge density d.To construct the last six networks, we simply find a threshold manually that produces networks with density d, d, and d. Since for the R.leguminosarum dataset d is roughly the same as d (see Section 3), we use the network as a proxy for the network and omit networks and .We first evaluate the internal consistency of each network with COGENT. We also evaluate the biological information contained in the networks using STRING, a database of known and predicted protein–protein interactions (Szklarczyk ). STRING collects information from numerous sources, including experimental data, computational predictions and textmining. The association evidence in STRING is categorized into independent channels, weighted, and integrated to produce a confidence score C for all recorded protein interactions. Interactions with high C score are more likely to be true than those with a low score.We work with three different sets of confidence scores:C: Total scores provided by STRING.: Scores that only consider coexpression information (coexpression channel combined with coexpression transferred channel).: Scores that exclude coexpression information.We provide details of how to retrieve and in Supplementary Section 5.For each network, we add the confidence score associated with each pair of connected genes and divide by the number of edges. We perform this operation independently for the three sets of confidence scores to obtain three aggregate confidence scores per network. These scores represent the amount of biological information that the networks capture. Then we compare these scores to the expected amount of biological information captured by chance. To do so, we generate three sets of 30 random networks with edge densities d, d, d, and evaluate their biological content following the approach detailed above. We compare the distribution of the aggregate C, and scores from the random networks with density d with the scores from networks , and , the scores from the random networks with density d with and , and the scores from the random networks with density d with and . For the R.leguminosarum dataset, we omit the random networks with density d and compare with the network with density d due to the similarity in their densities. In Supplementary Section 2, we also perform the biological evaluation of the networks and with edge density d and d, respectively, obtained from a Spearman correlation matrix R.
3 Results
Here, we present our analysis of the R.leguminosarum dataset; the results for the yeast and human data are in Supplementary Sections 3 and 4.
3.1 Correlation matrices
The correlation matrices P (Pearson), D (distance), and S (signed distance) are all symmetric with m = 7, 077 rows and columns. As we show in Table 1 and Figure 2, the distribution of the absolute values in P are different to those in D. In particular, the distribution of the values of P and S are different. Supplementary Figures S2 and S5 and Tables S4 and S7 contain the same analysis for the yeast and human data.
Table 1.
Statistical summary of the correlation matrices from the R.leguminosarum data
Correlation
Min
1st q
Median
3rd q
Max
Mean
D
0.02
0.22
0.28
0.36
0.99
0.30
|P|
0.00
0.08
0.17
0.29
1.00
0.20
P
−0.86
−0.17
−0.01
0.17
1.00
0.00
S
−0.92
−0.27
−0.14
0.28
0.99
0.00
Fig. 2.
Distribution of the entries of the correlation matrices from the R.leguminosarum dataset. (A) Distribution of distance correlations in D (blue), and the absolute of the Pearson correlations (red). (B) Distribution of signed distance correlation S (blue), and Pearson correlation P (red).
Distribution of the entries of the correlation matrices from the R.leguminosarum dataset. (A) Distribution of distance correlations in D (blue), and the absolute of the Pearson correlations (red). (B) Distribution of signed distance correlation S (blue), and Pearson correlation P (red).Statistical summary of the correlation matrices from the R.leguminosarum data
3.2 Gene coexpression networks
We first estimate, using COGENT (Bozhilova ), the optimal thresholds , and to construct the networks , and from the matrices S, P, and I, respectively. To analyse and compare the networks, it is perhaps more intuitive to compare the networks using their edge density than with the threshold used to produce them. For this, we construct networks for a range of values of θ. For each density (and the θ that produced it), we evaluate its consistency score using Eq. 9. This score is related to the self-consistency of the network. Figure 3 shows the value of as a function of the density of the networks.
Fig. 3.
Self-consistency scores of R.leguminosarum networks with edge densities between 0 and 0.3. The blue line with circles shows the scores of networks obtained using signed distance correlations; the red line with triangles shows the score of networks using Pearson correlations; the green line shows the scores of the networks obtained using MI. The dashed vertical lines indicate the density of the most self-consistent network for each type of correlation.
Self-consistency scores of R.leguminosarum networks with edge densities between 0 and 0.3. The blue line with circles shows the scores of networks obtained using signed distance correlations; the red line with triangles shows the score of networks using Pearson correlations; the green line shows the scores of the networks obtained using MI. The dashed vertical lines indicate the density of the most self-consistent network for each type of correlation.The highest score from the signed distance correlation networks is , where , and density . The highest score from the Pearson correlation networks is , where , and density . The highest score from the mutual information networks is , where , and density . We also create the networks from P and from I to match the edge density of , and the network from S to match the edge density of and . Table 2 contains a statistical summary of the networks. The discrepancy in the number of edges in and is due to several pairs of genes having the same mutual information value. For both edge densities, the networks retrieved using our signed distance correlation matrix S ( and ) have a smaller, denser, and with higher global clustering coefficient largest connected component (LCC) than the networks obtained using Pearson correlation ( and ). The LCC of the networks obtained using mutual information ( and ) are smaller and denser than those from density-matching networks but they show the lowest global clustering coefficient. See Supplementary Tables S5 and S8 for the gene coexpression networks for the yeast and human data.
Table 2.
Statistical summary of R.leguminosarum coexpression networks. LCC denotes largest connected component
Network
Number of edges
Number of vertices in LCC
Edge density LCC×100
Global clustering coeff. LCC
NS(dS)
313 348
6431
1.52
0.570
NS(dP)
406 977
6688
1.82
0.570
NP(dS)
313 348
6697
1.40
0.544
NP(dP)
406 977
6880
1.72
0.543
NI(dS)
317 014
5993
1.77
0.279
NI(dI)
414 140
6084
2.24
0.284
Statistical summary of R.leguminosarum coexpression networks. LCC denotes largest connected component
3.3 Network evaluation
We perform two evaluations of the networks from signed distance correlation, Pearson correlation, and mutual information: self-consistency and biological content. We are interested in the self-consistency of the networks because it reflects their ability to cope with changes in the data used to generate them. If our network is more self-consistent, then we can have greater confidence in the biological conclusions we draw from it, even if the data is imperfect or noisy. The Jaccard similarity in Eq. 8 measures the similarity between networks generated from overlapping, non-identical subsets of a dataset using the same network construction method. The higher the similarity, the higher the internal consistency of the method and the more self-consistent the network obtained from applying it is. We measure the self-consistency , obtained from computing the average Jaccard similarity of networks from different randomized subsets of data and subtracting the density of ( and for Pearson correlation and mutual information, respectively). As we show in Figure 3, the self-consistency scores of the networks from signed distance correlations are consistently higher than in the Pearson and mutual information networks over an interval of densities that includes the maxima for all three methods. The scores obtained for the mutual information networks are the lowest. From this analysis, we conclude that signed distance correlation networks are more self-consistent than networks based on Pearson correlation or mutual information. Even the optimal threshold for the Pearson matrix P produces a less self-consistent network than a signed distance correlation network with a matching edge density (); the same applies for mutual information. See Supplementary Figures S3 and S6 for the same analysis on the yeast and human data, with similar results.To evaluate the biological content of a network, we add the STRING scores of the edges using: all the information in STRING (C), only coexpression information (), and everything except coexpression information (), and then divide by the number of edges in the network, as described in Section 2.5. Table 3 and Figure 4 (and Supplementary Fig. S1) show the results for all the networks, and the mean scores from random networks. In every case, the signed distance correlation networks contain more biological information than networks from Pearson correlation, those obtained using mutual information, and the randomized networks. See Supplementary Tables S6 and S9 and Figures S4 and S7 for a similar analysis on the human and yeast data, with similar results. The highest difference of the networks with the randomized networks occurs when we only use coexpression information. The scores of , and are 10, 8.5, and 8.7 times higher than the expected ones for random networks. This is not surprising because we have built the networks using gene expression information. However, the coexpression data used to construct these networks is different to the data in STRING. The scores excluding the coexpression information are 3.7, 3.2, and 3.1 times higher than in the random networks. This is remarkable because this assessment is performed on data that is completely different than the data used to construct the networks. Hence, by applying our pipeline to gene expression data, we can identify new types of relationships between proteins (and genes). These results demonstrate the power of gene coexpression networks to predict functional interactions, especially when constructed using signed distance correlation.
Table 3.
Evaluation of the biological content of the networks with STRING
Network
All of STRING information (C)
Only coexpression information (C†)
All information except coexpression (C‡)
NS(dS)
29.61
9.02
26.55
NP(dS)
25.23
7.63
22.72
NI(dS)
24.68
7.81
22.01
RE dS
7.49 ± 0.08
0.90 ± 0.02
7.09 ± 0.08
NS(dP)
26.79
7.64
24.12
NP(dP)
23.32
6.61
21.08
NI(dI)
22.39
6.60
20.07
RE dP
7.54 ± 0.08
0.91 ± 0.03
7.13 ± 0.08
Note: RE indicates the expected (mean) result based on random networks with the indicated edge density and its SD. The values in bold correspond to the highest scores for each set of networks and confidence scores.
Fig. 4.
Scores obtained for the R.leguminosarum gene coexpression networks using STRING. All panels show the score for the different networks in the y axis, and the network density on the x axis. The scores are the result of adding up the confidence scores with all evidence (C), only coexpression evidence () and everything excluding coexpression () from STRING associated with the edges in the networks, each computed using different of information. The black box plots correspond to the scores obtained by 30 random networks. Blue circles, red triangles, and green diamonds represent signed distance correlation, Pearson correlation, and mutual information respectively.
Scores obtained for the R.leguminosarum gene coexpression networks using STRING. All panels show the score for the different networks in the y axis, and the network density on the x axis. The scores are the result of adding up the confidence scores with all evidence (C), only coexpression evidence () and everything excluding coexpression () from STRING associated with the edges in the networks, each computed using different of information. The black box plots correspond to the scores obtained by 30 random networks. Blue circles, red triangles, and green diamonds represent signed distance correlation, Pearson correlation, and mutual information respectively.Evaluation of the biological content of the networks with STRINGNote: RE indicates the expected (mean) result based on random networks with the indicated edge density and its SD. The values in bold correspond to the highest scores for each set of networks and confidence scores.The analysis for networks with d and d tells a similar story; the scores obtained by the signed distance correlation network are higher than the scores from the Pearson correlation network and the mutual information network , despite d and d being the optimal edge density for Pearson correlation and mutual information. We highlight the fact that in absolute terms (i.e. not dividing the scores by the number of edges in the network) when we use only coexpression information (), the score of is higher than the score of , even though the former has fewer edges than the latter, and therefore its score is the result of adding fewer terms.For the three datasets we analyse, signed distance correlation networks perform better than Pearson correlation and mutual information networks with matching densities, according to both evaluations and for all tested edge densities.
4 Discussion
In this work, we have introduced signed distance correlation, and presented a method to construct networks in a self-consistent way exclusively from gene expression data. This method has three main steps: data pre-processing, computing correlations, and thresholding. These steps combine well-established methods such as quantile normalization, and the use of COGENT (Bozhilova ) to identify the optimal threshold.Distance correlation is an intuitive approach to study gene expression because it relies on the differences in the expression between samples. By incorporating signs into distance correlation, we can also differentiate between positive and negative relationships, and maintain the advantages of distance correlation.We apply our method to data from R.leguminosarum, yeast, and human. In all cases, our method produces networks that are more self-consistent than using Pearson correlation and mutual information. The reason why self-consistency is so important is that it ensures that our results are robust to changes or noise in the data. Therefore, when we cannot assess the biological significance of a network directly, we can use its self-consistency as an indication of biological significance. Networks from signed distance correlation also capture more biological information than networks from Pearson correlation and mutual information, as shown by our evaluation using STRING. In the case of R.leguminosarum, the signed distance correlation network (using an optimal threshold found with COGENT) captures almost four times more biological information than random networks. The amount of captured biological information is less if we use Pearson or Spearman correlations (Supplementary Fig. S1) or mutual information to build the networks.These results give a weak indication that self-consistent networks derived from biological data might capture more biological information than those with less stability. Therefore, if we cannot assess the biological significance of a network directly, measuring its self-consistency may serve as a proxy. The use of COGENT to select the threshold values requires a sufficiently strong signal in the association between the expression values of different genes. For example, using Euclidean distance, scaled by the square root of the sample size, did not yield sufficient signal, so that networks created by the subsampling method had no more overlap than expected at random. These networks also did not capture much biological information (data not shown).We applied our method to construct, to our knowledge, the first gene coexpression network for R.leguminosarum. This network promises to reveal rich biological information that will illuminate our understanding of plant–bacteria interactions and nitrogen fixation, and it is therefore the starting point for further investigations of the biological mechanisms of this organism. In particular, we plan to identify groups of genes in R.leguminosarum which are highly connected in the network and associate them with specific biological processes. To do so, we will make use of community detection techniques and new experimental data. For the human liver dataset one could explore predicting disease-related biological information; see for example Song ; Chen ; Li .Finally, we have showcased our method on gene expression datasets from different organisms obtained using different techniques: microarrays (R.leguminosarum), RNA-Seq (yeast), and single-cell RNA-Seq (human). However, the methods that we have developed are general, and can also be used to construct networks in a vast range of domains, such as, for example, economics (Wang ), neuroscience (Bernhardt ), climatology (Donges ), or any discipline where networks can be constructed from correlation data.Click here for additional data file.
Authors: Ziv Bar-Joseph; Georg K Gerber; Tong Ihn Lee; Nicola J Rinaldi; Jane Y Yoo; François Robert; D Benjamin Gordon; Ernest Fraenkel; Tommi S Jaakkola; Richard A Young; David K Gifford Journal: Nat Biotechnol Date: 2003-10-12 Impact factor: 54.908
Authors: R Karunakaran; V K Ramachandran; J C Seaman; A K East; B Mouhsine; T H Mauchline; J Prell; A Skeffington; P S Poole Journal: J Bacteriol Date: 2009-04-17 Impact factor: 3.490
Authors: Francesco Pini; Alison K East; Corinne Appia-Ayme; Jakub Tomek; Ramakrishnan Karunakaran; Marcela Mendoza-Suárez; Anne Edwards; Jason J Terpolilli; Joshua Roworth; J Allan Downie; Philip S Poole Journal: Plant Physiol Date: 2017-05-11 Impact factor: 8.340
Authors: Damian Szklarczyk; Annika L Gable; David Lyon; Alexander Junge; Stefan Wyder; Jaime Huerta-Cepas; Milan Simonovic; Nadezhda T Doncheva; John H Morris; Peer Bork; Lars J Jensen; Christian von Mering Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Javier Pardo-Diaz; Philip S Poole; Mariano Beguerisse-Díaz; Charlotte M Deane; Gesine Reinert Journal: Netw Sci (Camb Univ Press) Date: 2022-06-16