Literature DB >> 33437755

Application of graphical lasso in estimating network structure in gene set.

Yu-Jyun Huang1, Tzu-Pin Lu1,2, Chuhsing Kate Hsiao1,2.   

Abstract

Year:  2020        PMID: 33437755      PMCID: PMC7791228          DOI: 10.21037/atm-20-6490

Source DB:  PubMed          Journal:  Ann Transl Med        ISSN: 2305-5839


× No keyword cloud information.

Introduction: the importance of structure learning in gene set

Gene set analysis or pathway analysis tools play an important role in exploring the relationship between a group of genes and phenotypes of interest (1,2). How genes in this group work cooperatively to regulate or stimulate the complex biological function in different cellular status, however, often remains a mystery. Based on scientific studies or other text mining techniques (3,4), several public databases, such as KEGG (5), BioGRID (6) and STRING (7), have already annotated biological functions as pathways and the interactions within the molecular network. Therefore, it is possible to examine if the estimated correlation from the raw data is in conformity with the information retrieved from those public databases. Questions in the following may arise: “Can we directly estimate the interactions or learn the structure relationship among a group of genes only from the data?”; “Is there any statistical implementation that can help to answer this question?” Answers to these questions may provide an opportunity for researchers to construct the gene network, and most importantly, to discover novel relationships within a group of genes (8-11). The graphical lasso (12) is a widely used approach in structure learning research as well as a useful tool to answer the above questions (13). It was proposed to estimate a sparse graph by utilizing the lasso penalty in the precision matrix of a multivariate normal distribution. Here we discuss how to estimate the network structure based on the multivariate normal distribution, and next introduce the rationale and the estimation procedure of the graphical lasso. Then, we demonstrate the graphical lasso algorithm with a real cancer application and conclude with a brief summary.

Structure learning with graphical lasso

Gene set analysis is often considered for microarray gene expression levels to investigate the association between a set of genes and a complex trait after a collection of differentially expressed genes have been identified (14-16). It is common to assume that the gene expression values in the gene set follow a multivariate normal distribution, also known as the Gaussian graphical model for gene network. This assumption is popular because of the theoretical statistical properties. For a group of P genes, assume the P– dimensional vector X follows a multivariate normal distribution, Inside this vector, each component is a random variable representing the gene expression value of gene i. This distribution can be used to construct the network of these P genes. If a network follows this distribution, then the absence of an edge between two nodes (two random variables) implies that the two random variables are conditionally independent given all other variables. In fact, information of this conditional independence can be obtained from the precision matrix , where , in this multivariate normal distribution for X = (X1,X2,...,X). Specifically, if the (i,j) entry of equals zero, it implies that X and X are conditionally independent. By assuming a multivariate normal distribution for the multi-dimensional gene expression values, the construction of the gene network structure can be based on the estimation of the precision matrix of the multivariate normal distribution. The mathematical proof and descriptions are detailed in (17). The graphical lasso is a fast and efficient algorithm for estimating inverse covariance matrices (12,18). It is similar to the original lasso approach (19), but the graphical lasso focuses on selecting which edge to exist in a network rather than which variable to select in a regression problem. The graphical lasso adopts the convex optimization strategy to estimate the precision matrix by maximizing the following penalized log-likelihood where is the element-wise norm of the precision matrix, S is the sample covariance matrix, and λ is the tuning parameter controlling the sparsity of the network. After obtaining the estimated precision matrix from the graphical lasso algorithm, the network can be constructed based on the non-zero elements in . is a simple example for illustrating the equivalence between the estimated precision matrix and the corresponding network structure. Note that no edge appears between nodes X1 and X3 and between X2 and X4, since the corresponding two entries in are zero.
Figure 1

A simple example for illustration.

A simple example for illustration.

Real data application: the lung cancer study

The expression data from a lung cancer study (20) is demonstrated here to show the utilization of the graphical lasso in estimating the network structure for a selected gene set. This data set was downloaded from the Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/) and the corresponding accession number in NCBI data portal is “GSE19804”. This data set contains gene expression values extracted from 60 paired tumor and normal tissues. Forty-seven tumor tissue samples categorized as tumor stage 1 and 2 were selected into the following analysis. The STRING database (https://string-db.org/; Version 11.0) was considered to determine the gene set involving the protein-protein interaction (PPI) network of the EGFR gene. The EGFR gene was illustrated here because it has been shown in many studies that the EGFR gene is associated with tumor progression of lung cancer (21,22). In addition, several therapeutic drugs have already been developed to target on EGFR for lung cancer treatment (23-25). A novel interaction between these genes may help to unravel the underlying mechanism or improve therapeutic treatments for the cancer patients. The following analysis contained the gene expression values from 11 genes of the 47 tumor tissues. The expression value is the average probe log2 RMA signal intensity. The analysis can be conducted with the function “glasso” in R package “glasso” (26). The input is the sample variance covariance matrix which can directly be calculated with the R basic function “var”, and the lambda tuning parameter can be assigned by the option “rho” in the “glasso” function. shows the resulting network structures constructed by the graphical lasso approach corresponding to different lambda tuning values. As we can see, when the lambda value increases, the degree of the sparsity in the network also increases. Some degree of sparsity in the network can reflect the underlying biological reality, and is often easier to interpret, particularly in the high-dimensional setting (27). Some edges in the estimated network, e.g., the connection between EGFR and GRB2, are consistent with the reports in (5) and (7). Furthermore, the results indicate that GRB2 and CBL contains more connections than others in the estimated graph, implying that these two genes and its immediate neighboring nodes may form a potential target for future lung cancer genetics research.
Figure 2

Estimated network structure of the 11 protein-protein interaction (PPI) genes of epidermal growth factor receptor (EGFR) with the graphical lasso. The upper panel contains heatmaps of the estimated precision matrices with different lambda values. The lower panel lists the corresponding graph structures. Note that if the entry in the estimated precision matrix is zero, then the corresponding paired nodes will not have a connecting edge between them in the network structure.

Estimated network structure of the 11 protein-protein interaction (PPI) genes of epidermal growth factor receptor (EGFR) with the graphical lasso. The upper panel contains heatmaps of the estimated precision matrices with different lambda values. The lower panel lists the corresponding graph structures. Note that if the entry in the estimated precision matrix is zero, then the corresponding paired nodes will not have a connecting edge between them in the network structure.

Brief summary

This report discusses the importance of structure learning in gene set analysis. The graphical lasso approach was introduced in constructing the network structure and a real data from a lung cancer study was considered to demonstrate the use of the graphical lasso. The main advantage of the graphical lasso is that it can reconstruct the network based on the raw data without incorporating other existing network profiles. By applying the graphical lasso in gene set analysis, we may discover a novel interaction between a set of genes and provide insight into the understanding of the complex biological mechanism. The article’s supplementary files as
  19 in total

Review 1.  Comparative analysis of gene regulatory networks: from network reconstruction to evolution.

Authors:  Dawn Thompson; Aviv Regev; Sushmita Roy
Journal:  Annu Rev Cell Dev Biol       Date:  2015-09-03       Impact factor: 13.827

2.  Sparse inverse covariance estimation with the graphical lasso.

Authors:  Jerome Friedman; Trevor Hastie; Robert Tibshirani
Journal:  Biostatistics       Date:  2007-12-12       Impact factor: 5.899

Review 3.  The statistical properties of gene-set analysis.

Authors:  Christiaan A de Leeuw; Benjamin M Neale; Tom Heskes; Danielle Posthuma
Journal:  Nat Rev Genet       Date:  2016-04-12       Impact factor: 53.242

4.  Sparse Methods for Biomedical Data.

Authors:  Jieping Ye; Jun Liu
Journal:  SIGKDD Explor       Date:  2012-06-01

5.  Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors:  Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal:  Proc Natl Acad Sci U S A       Date:  2005-09-30       Impact factor: 11.205

Review 6.  Rational, biologically based treatment of EGFR-mutant non-small-cell lung cancer.

Authors:  William Pao; Juliann Chmielecki
Journal:  Nat Rev Cancer       Date:  2010-10-22       Impact factor: 60.716

7.  Reconstructing transcriptional regulatory networks through genomics data.

Authors:  Ning Sun; Hongyu Zhao
Journal:  Stat Methods Med Res       Date:  2009-12       Impact factor: 3.021

Review 8.  Ten years of pathway analysis: current approaches and outstanding challenges.

Authors:  Purvesh Khatri; Marina Sirota; Atul J Butte
Journal:  PLoS Comput Biol       Date:  2012-02-23       Impact factor: 4.475

9.  Disease-targeted sequencing of ion channel genes identifies de novo mutations in patients with non-familial Brugada syndrome.

Authors:  Jyh-Ming Jimmy Juang; Tzu-Pin Lu; Liang-Chuan Lai; Chia-Chuan Ho; Yen-Bin Liu; Chia-Ti Tsai; Lian-Yu Lin; Chih-Chieh Yu; Wen-Jone Chen; Fu-Tien Chiang; Shih-Fan Sherri Yeh; Ling-Ping Lai; Eric Y Chuang; Jiunn-Lee Lin
Journal:  Sci Rep       Date:  2014-10-23       Impact factor: 4.379

10.  Identification of reproducible gene expression signatures in lung adenocarcinoma.

Authors:  Tzu-Pin Lu; Eric Y Chuang; James J Chen
Journal:  BMC Bioinformatics       Date:  2013-12-26       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.