Literature DB >> 25823587

Gene Network Reconstruction by Integration of Prior Biological Knowledge.

Abstract

With the development of high-throughput genomic technologies, large, genome-wide datasets have been collected, and the integration of these datasets should provide large-scale, multidimensional, and insightful views of biological systems. We developed a method for gene association network construction based on gene expression data that integrate a variety of biological resources. Assuming gene expression data are from a multivariate Gaussian distribution, a graphical lasso (glasso) algorithm is able to estimate the sparse inverse covariance matrix by a lasso (L1) penalty. The inverse covariance matrix can be seen as direct correlation between gene pairs in the gene association network. In our work, instead of using a single penalty, different penalty values were applied for gene pairs based on a priori knowledge as to whether the two genes should be connected. The a priori information can be calculated or retrieved from other biological data, e.g., Gene Ontology similarity, protein-protein interaction, gene regulatory network. By incorporating prior knowledge, the weighted graphical lasso (wglasso) outperforms the original glasso both on simulations and on data from Arabidopsis. Simulation studies show that even when some prior knowledge is not correct, the overall quality of the wglasso network was still greater than when not incorporating that information, e.g., glasso.

Entities: Chemical Species

Keywords: gene network; graphical model; prior knowledge; systems biology and gene expression

Mesh：

Year: 2015 PMID： 25823587 PMCID： PMC4478538 DOI： 10.1534/g3.115.018127

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

A key challenge for biology is to understand the complex molecular interactions of genes in a living cell (Barabasi and Oltvai 2004). Analysis of gene networks provides a global view of these interactions and can provide biologists with a better understanding of complex biological systems (Friedman 2004; Karlebach and Shamir 2008). Also, by the use of gene networks, the guilt-by-association paradigm can be applied to infer the biological function of unknown genes (Wolfe ). Gene expression data, which is relatively easy to generate or to collect from databases, has been used to infer gene networks. A variety of gene network reconstruction methods based on gene expression data have been developed, e.g., regression, mutual information, correlation Bayesian network, meta predicators, and others (Marbach ). Here, we focus on the Gaussian graphical model (Dempster 1972). In this approach, gene expression is assumed to follow a multivariate Gaussian distribution N(μ, Σ), and Gaussian Markov random fields have been used to infer the structure of networks from gene expression data (Liu and Ihler 2011). A gene association network can be seen as an undirected graph G = (V, E), where V = {v is the vertex set representing genes and E = {e is the edge set representing association relations between pairs of genes. In an unweighted network, if e = 0, genes i and j are conditionally independent given all other genes. The Hammersley–Clifford theorem implies that zeros in the inverse covariance matrix of a multivariate Gaussian distribution indicate absent edges in the corresponding graphical model (Besag 1974; Lauritzen 1996). Therefore, the problem of estimating the gene association network based on gene expression data can be transferred to estimating Σ or selecting nonzero entries in Σ. Many studies, however, have used Pearson correlation coefficients between pairs of genes to infer network structure; Pearson correlation coefficients correspond to the covariance matrix Σ and cannot infer the true structure. An accurate inference of biological network using Gaussian graphical model is challenging for two main reasons. The first is that most genome-scale datasets are highly dimensional. Given p genes, there are possible p(p − 1)/2 edges, but gene expression data often have a limited number of samples. When the gene or locus number is much higher than the sample size in a dataset, a traditional maximum-likelihood estimate for covariance variance is often not appropriate (Schafer and Strimmer 2005; Uhler 2012). Alternative methods have been developed for highly dimensional datasets (Meinshausen and Buhlmann 2006; Friedman ; Yuan 2010; Cai ; Ravikumar ), most of which have used Lasso, a shrinkage and selection method using L regularization, a popular approach to deal with highly dimensional datasets (Tibshirani 1996; Hastie ). These methods have been shown to be able to asymptotically and consistently estimate the set of non-zero elements of Σ. The second major difficulty is the lack of efficient methods to integrate multiple levels of biological data to enhance model accuracy. Increasing amounts of biological data have been collected, especially with the development of high-throughput technologies, e.g., microarrays and next-generation sequencing, which provide an unprecedented opportunity to explore biological systems. We now have access to genomic, epigenomic, transcriptomic, proteomic, metabolomic, and phenomic data, and careful analysis and integration of these genome-scale datasets should provide large-scale, multidimensional, and insightful views into biological systems (Joyce and Palsson 2006; Hawkins ). Also, we have a variety of resources that serve as indicators of the functional association of any two genes, e.g., Kyoto Encyclopedia of Genes and Genomes pathway, Gene Ontology (GO) similarity, protein−protein interaction, co-occurrence in literature, gene network generated from other methods, and association of orthologous genes. Integrating potentially reliable information from other sources should increase the accuracy of network reconstruction (Imoto ; Mostafavi ; Christley ; Wang ; Chen ). Here, we present a statistical algorithm based on the Gaussian graphical model for gene association network reconstruction using gene expression data, which is able to solve these two challenges to allow a more thorough understanding of complex biological systems.

Materials and Methods

Let Θ = Σ and S be the empirical covariance matrix, then by applying L penalty to the original log-likelihood for estimating Θ, the problem of estimating Θ becomes finding the Θ which maximize the formulawhere tr is the trace, i.e., the sum of the elements on the matrix diagonal, and is the L norm, i.e., the sum of the absolute values of the elements of Σ, and ρ is the penalty parameter. When ρ = 0, it is the normal maximum likelihood. When ρ > 0, some elements in Θ will be shrunk to zero. The sparsity of the estimated graph increases when ρ increases. For a fixed ρ, the graphical lasso (glasso) algorithm can be used to quickly solve the equation using a block-coordinate method (Friedman ). The optimal ρ can be chosen empirically or tuned by cross-validation, Bayesian information criterion (BIC), or other methods (Friedman ; Foygel and Drton 2010; Liu ). The Lasso regression can be interpreted as a Bayesian regression with a Laplace prior distribution, Lp(0, ρ). In Bayesian statistics, prior information can be integrated into the model by changing the parameter in the prior distribution, and the posterior distribution should better approximate the true distribution of the data. Therefore, for the glasso algorithm, instead of using a single penalty parameter, it is reasonable to specify different amounts of penalties for different elements in Θ based on a priori information as to whether two genes are associated, or not. A smaller penalty can be given if a priori information indicates that they are linked. Then, the log-likelihood becomeswhere P is the prior information matrix and * indicates component-wise multiplication, and . Larger values for an element in P generate larger penalty values and represent weaker association of two genes based on priori information. We name this updated version of glasso as weighted graphical lasso (wglasso), which is not only stable for high-dimensional datasets by utilization of Lasso but also more accurate than the original glasso through the integration of prior information. The prior matrix can be obtained in numerous ways. For example, the GO semantic similarity between genes can be calculated using tools GOSemSim (Yu ) or GOssTo (Yu ), Then, the inverse of similarities can be implemented in the prior matrix, as a high similarity means a low penalty in our model. Other types of networks, e.g., protein-protein interaction and gene regulatory networks, can be directly transferred to a prior matrix after inverse transformation of the weights of the network. Some databases estimate the functional association between genes using data mining and text mining and can be valuable and reliable resources for generating a prior matrix, e.g., STING (Von Mering ) and AraNet (Lee ). Although previous efforts to use lasso for incorporating a priori knowledge have been made, none is identical to our proposed method. For example, most of the previously described algorithms partitioned the gene network reconstruction into a number of linear regression problems (Anjum ; Charbonnier ; Wang ); however, it has been shown this type of algorithms is less accurate than glasso (Friedman ).

Results and Discussions

Simulation studies demonstrate that wglasso is more reliable than glasso

Because most biological networks are scale-free (Barabasi and Oltvai 2004), a scale-free network with p genes and gene expression data with n samples based on the network topology are generated using the “huge” package in R (Zhao ). A prior network with precision ratio x, x ∈ [0,1] also was generated. A prior network is a weighted network, in which each edge corresponds to the prior information that two genes are functionally associated and the edge weight indicates the strength of the information. Precision ratio = x with x > 0 means p*100 percent of prior edges are correct, and the remaining (1 − p)*100 edges are incorrect. The edge number of the prior network is set to the same number of true edges. The edge weights of the prior network were randomly generated from the uniform distribution, U(0,1). A special case is precision ratio = 0, which indicates no prior information and the prior network will not contain any edges. In this case, wglasso is equivalent to glasso. Once the empirical covariance matrix S is calculated from the expression data and the prior information matrix P retrieved from the prior network, we estimate the true network using glasso and wglasso. In order to find the optimal penalty parameter, ρ, wglasso networks were estimated under a sequence of ρ values from 0 to 1 with 0.01 intervals. Then we selected the network most similar to the true network. The Matthews correlation coefficient (MCC) was used to measure the similarity between the estimated and true networks (Matthews 1975), and was calculated as follows based on 2 × 2 contingency table,where TP, TN, FP, and FN are the number of true positives, true negatives, false positives, and false negatives, respectively. A penalty parameter that maximized MCC (maxMCC) was considered optimal. A single simulation with 100 genes, 50 samples and a 0.7 precision ratio is shown in Figure 1 and Figure 2.

Figure 1

Figure 2

Parameter selection process of glasso and wglasso Matthews correlation coefficient (MCC; solid lines) and Bayesian information criterion (BIC; dashed lines) values of the estimated networks under different penalty parameters (Rho) using glasso (blue lines) and wglasso (red lines). Triangles are points that maximize MCC values and minimize BIC values under the corresponding penalty parameters, considered to be optimal penalty parameters.

Demonstration of a single simulation using glasso and wglasso. (A) The true network with scale-free property. (B) Heatmap of gene expression data. The X-axis represents genes and the Y-axis represents samples. (C) The prior network, representing prior information of genes’ associations with precision ratio 0.7. Black edges are correct association information; edges that exist in prior information but not in the true network are in red, and gray edges are missed associations in the prior information. (D) The estimated network using glasso. The edge colors have the same meaning as prior network. (E) The estimated network using wglasso. The edge colors have the same meaning as prior network. Parameter selection process of glasso and wglasso Matthews correlation coefficient (MCC; solid lines) and Bayesian information criterion (BIC; dashed lines) values of the estimated networks under different penalty parameters (Rho) using glasso (blue lines) and wglasso (red lines). Triangles are points that maximize MCC values and minimize BIC values under the corresponding penalty parameters, considered to be optimal penalty parameters. Additional simulations were performed with a variety of sample sizes and precision ratios in order to systematically evaluate wglasso. The simulation was repeated 100 times for each combination of sample size and precision ratio. The results show that reconstructed networks with a priori information had a significantly greater maxMCC than without a priori information, indicating superior performance of wglasso (Figure 3). In most cases, wglasso outperformed glasso, even when most of the prior knowledge was incorrect. If the sample size was high and the precision ratio low, e.g., sample size = 300 and precision ratio = 0.2, it is possible that an excess of incorrect prior information would be harmful. However, in real situations, the sample size is often much lower than the gene number and misleading a priori information based on experimental studies is likely to be low. Moreover, highly accurate prior knowledge results in more accurately reconstructed networks. In practice, it would be possible to integrate multiple resources to increase the reliability of prior information.

Figure 3

Simulation to compare glasso and wglasso. The simulation was repeated 100 times for each combination of sample sizes (n) and precision ratio of the prior information. Gene number = 100. In each simulation, the maximum Matthews correlation coefficient (maxMCC) of estimated networks from different penalty parameters is recorded. The Y-axis shows the mean of maxMCC from the 100 simulations. The error bars are 95% confidence intervals.

Tuning the penalty parameter using BIC

In real situations, the true network is unknown; thus, maxMCC cannot be used to select the optimal penalty parameter. Instead, one can use cross-validation, BIC, extended BIC (Foygel and Drton 2010), and a stability approach for regularization selection (Liu ). Cross-validation, stability approach for regularization selection, and BIC conduct many subsampling or permutations and thus are computationally intensive. Extended BIC tends to select very sparse networks, and often no edges are found. For a scale-free network, another method is to test whether the log transformed degree distribution has a linear relationship (Langfelder and Horvath 2008). The R-squared values from linear regressions generally increase as the network gets sparser, and the optimal fit is selected at the point where the increase trend slows down. The selection is usually done visually, so the method is not practical for simulations. Also, the optimal parameter varies by dataset and individual interpretation. BIC is one of the standard methods for choosing the regularization parameter, and works well in our scale-free network simulations (Figure 2 and Figure 4). The BIC for Gaussian graphical model takes the formwhere is the edge number, n the sample size, and the log-likelihood function simplified fromSimulation showed that the difference between the MCC of reconstructed networks based on BIC and the maxMCC was small, especially when sample size was relatively small (Figure 2 and Figure 4).

Figure 4

Selection of the optimal penalty parameter based on minimum Bayesian information criterion (BIC) simulation was repeated 100 times for each combination of sample sizes (n) and precision ratio of the prior information. Gene number = 100. For each simulation, the maximum Matthews correlation coefficient (MCC) value (maxMCC) of estimated networks from different penalty parameters is shown. The MCC value of the estimated network based on minimum BIC is also recorded, called BICMCC. The difference between two values (maxMCC − BICMCC) is used to evaluate two parameter selection methods. The Y-axis shows the mean of maxMCC − BICMCC from the 100 simulations. The error bars represent 95% confidence intervals.

Application to experimental data

We applied wglasso to gene expression data from the Eukaryotic species, Arabidopsis thaliana. Gene expression data of 795 genes related to isoprenoid pathways from 118 microarray experiments were collected (Wille ). Prior information was obtained from AraNet, a probabilistic network of functional associations among 19,647 Arabidopsis genes (Lee ). A total of 701 of 795 genes have functional associations in AraNet. The edge weight is the likelihood score, calculated from variety of resources that indicates a functional association between two genes. The likelihood scores range from 0 to 5, so they were rescaled to a range of 0.5 to 1 using the formula,High likelihood scores will become low penalty values after rescaling. Pairs of genes without prior information have zero values in the prior matrix. The optimal penalty parameter was selected based on BIC. The network reconstructed using wglasso has 16,759 edges, whereas the glasso network has 19,331 edges. The MCC values were calculated by comparing the reconstructed networks with an independent benchmark network, which is the same benchmark set that was used to test the prediction performance of AraNet (Lee ). The MCC of the estimated network with wglasso was 0.184, higher than that with glasso, which was 0.033. By incorporating prior knowledge, weighted graphical lasso (wglasso) outperforms glasso both on simulation studies and data from Arabidopsis. Simulation studies showed that even when some prior knowledge was incorrect, the overall quality of network from wglasso network was higher than that from glasso. Moreover, the more accurate the prior knowledge, the better the reconstructed network. This method increases gene network reconstruction accuracy and will allow researchers to better study networks in complex biological systems and their interaction with external factors, e.g., the environment.

23 in total

Review 1. Next-generation genomics: an integrative approach.

Authors: R David Hawkins; Gary C Hon; Bing Ren
Journal: Nat Rev Genet Date: 2010-07 Impact factor: 53.242

Review 2. Network biology: understanding the cell's functional organization.

Authors: Albert-László Barabási; Zoltán N Oltvai
Journal: Nat Rev Genet Date: 2004-02 Impact factor: 53.242

3. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

4. Rational association of genes with traits using a genome-scale gene network for Arabidopsis thaliana.

Authors: Insuk Lee; Bindu Ambaru; Pranjali Thakkar; Edward M Marcotte; Seung Y Rhee
Journal: Nat Biotechnol Date: 2010-01-31 Impact factor: 54.908

5. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics.

Authors: Juliane Schäfer; Korbinian Strimmer
Journal: Stat Appl Genet Mol Biol Date: 2005-11-14

Review 6. The model organism as a system: integrating 'omics' data sets.

Authors: Andrew R Joyce; Bernhard Ø Palsson
Journal: Nat Rev Mol Cell Biol Date: 2006-03 Impact factor: 94.444

7. Weighted-LASSO for structured network inference from time course data.

Authors: Camille Charbonnier; Julien Chiquet; Christophe Ambroise
Journal: Stat Appl Genet Mol Biol Date: 2010-02-01

8. Systematic survey reveals general applicability of "guilt-by-association" within gene coexpression networks.

Authors: Cecily J Wolfe; Isaac S Kohane; Atul J Butte
Journal: BMC Bioinformatics Date: 2005-09-14 Impact factor: 3.169

9. STRING: known and predicted protein-protein associations, integrated and transferred across organisms.

Authors: Christian von Mering; Lars J Jensen; Berend Snel; Sean D Hooper; Markus Krupp; Mathilde Foglierini; Nelly Jouffre; Martijn A Huynen; Peer Bork
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Incorporating existing network information into gene network inference.

Authors: Scott Christley; Qing Nie; Xiaohui Xie
Journal: PLoS One Date: 2009-08-27 Impact factor: 3.240

12 in total

1. Intra- and interregional coregulation of opioid genes: broken symmetry in spinal circuits.

Authors: Olga Kononenko; Vladimir Galatenko; Malin Andersson; Igor Bazov; Hiroyuki Watanabe; Xing Wu Zhou; Anna Iatsyshyna; Irina Mityakina; Tatiana Yakovleva; Daniil Sarkisyan; Igor Ponomarev; Oleg Krishtal; Niklas Marklund; Alex Tonevitsky; DeAnna L Adkins; Georgy Bakalkin
Journal: FASEB J Date: 2017-01-25 Impact factor: 5.191

2. piMGM: incorporating multi-source priors in mixed graphical models for learning disease networks.

Authors: Dimitris V Manatakis; Vineet K Raghu; Panayiotis V Benos
Journal: Bioinformatics Date: 2018-09-01 Impact factor: 6.937

3. Correlation Imputation in Single cell RNA-seq using Auxiliary Information and Ensemble Learning.

Authors: Luqin Gan; Giuseppe Vinci; Genevera I Allen
Journal: ACM BCB Date: 2020-09

4. CONE: Community Oriented Network Estimation Is a Versatile Framework for Inferring Population Structure in Large-Scale Sequencing Data.

Authors: Markku O Kuismin; Jon Ahlinder; Mikko J Sillanpӓӓ
Journal: G3 (Bethesda) Date: 2017-10-05 Impact factor: 3.154

Review 5. Gaussian and Mixed Graphical Models as (multi-)omics data analysis tools.

Authors: Michael Altenbuchinger; Antoine Weihs; John Quackenbush; Hans Jörgen Grabe; Helena U Zacharias
Journal: Biochim Biophys Acta Gene Regul Mech Date: 2019-10-19 Impact factor: 4.490

6. An Augmented High-Dimensional Graphical Lasso Method to Incorporate Prior Biological Knowledge for Global Network Learning.

Authors: Yonghua Zhuang; Fuyong Xing; Debashis Ghosh; Farnoush Banaei-Kashani; Russell P Bowler; Katerina Kechris
Journal: Front Genet Date: 2022-01-27 Impact factor: 4.599