Literature DB >> 25077716

Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm.

Jung Hun Oh, Joseph O Deasy.   

Abstract

BACKGROUND: Inference of gene regulatory networks (GRNs) from gene microarray expression data is of great interest and remains a challenging task in systems biology. Despite many efforts to develop efficient computational methods, the successful modeling of GRNs thus far has been quite limited. To tackle this problem, we propose a novel framework to reconstruct radio-responsive GRNs based on the graphical lasso algorithm. In our attempt to study radiosensitivity, we reviewed the literature and analyzed two publicly available gene microarray datasets. The graphical lasso algorithm was applied to expression measurements for genes commonly found to be significant in these different analyses.
RESULTS: Assuming that a protein-protein interaction network obtained from a reliable pathway database is a gold-standard network, a comparison between the networks estimated by the graphical lasso algorithm and the gold-standard network was performed. Statistically significant p-values were achieved when comparing the gold-standard network with networks estimated from one microarray dataset and when comparing the networks estimated from two microarray datasets.
CONCLUSION: Our results show the potential to identify new interactions between genes that are not present in a curated database and GRNs using microarray datasets via the graphical lasso algorithm.

Entities:  

Mesh:

Year:  2014        PMID: 25077716      PMCID: PMC4110733          DOI: 10.1186/1471-2105-15-S7-S5

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

In recent years, there has been a great interest in identifying radio-responsive genes across the whole genome using gene microarray data in the field of radiation oncology. To develop new biomarkers for radiation exposure, Templin et al. used whole genome microarray and miRNA data generated from blood samples of patients who underwent total body irradiation in preparation for stem cell transplantation [1]. Rieger and Chu utilized oligonucleotide microarrays with cell lines collected from 15 healthy individuals to investigate the transcriptional response of 10,000 genes in DNA damage to ionizing radiation (IR) and ultraviolet (UV) radiation [2]. In another study, Rieger et al. explored transcriptional responses to radiation in lymphoblastoid cells collected from 57 patients and found 20 IR-responsive and 4 UV radiation-responsive genes predictive of radiation toxicity [3]. Eschrich et al. employed systems biology modeling to better understand radiosensitivity by identifying novel radiation-specific biomarkers [4]. With gene expression profiles from 48 human cancer cell lines, this biomarker discovery platform resulted in a key radiosensitivity network with 10 hub genes. In our previous study [5], we identified important radio-responsive genes using literature review and gene microarray data analysis. With a systems biology approach, we found a core radio-responsive protein interaction network and its key biological processes using gene ontology analysis. Gene regulatory networks (GRNs) provide simplified representations and easy interpretation of biological processes in an organism under given conditions [6]. However, inference of GRNs remains a major challenging problem in systems biology, although a number of approaches have been proposed [7]. Cai et al. employed sparse structural equation models (SEMs) that integrate gene expression and cis-expression quantitative trait loci data for improving inference accuracy and proposed a systematic inference method for SEM estimation [8]. Based on Bayesian analysis using a non-parametric Gaussian process modeling, a novel method for inferring GRNs was proposed by Aijö and Lähdesmäki [9]. This approach enables the use of both time-series and steady-state gene expression measurements to improve the inference of GRNs. Menéndez et al. used a Gaussian Markov Random Field (GMRF) to deal with the problem of reverse engineering in GRNs and applied the graphical lasso algorithm to effectively estimate undirected graphical models [10]. Applying a lasso penalty to the problem of inverse covariance matrix estimation facilitated a fast and efficient calculation. The graphical lasso algorithm, which uses a blockwise coordinate descent approach to estimate a sparse graphical model, was proposed by Friedman et al. [11]. In this study, we employed a multi-component filtering process, based on a systems biology approach that was proposed in our previous study [5], that narrows down the list of gene candidates and applied the graphical lasso algorithm to microarray datasets to infer radio-responsive GRNs. To estimate the accuracy of the network modeling, the estimated networks were compared with a reliable protein interaction network produced by a manually curated protein interaction database.

Methods

Datasets

In our previous study [5], we attempted to investigate all putative genes implicated in radiation response through literature review using PubMed and Scopus search engines and found a list of 221 genes associated with radiation response. In addition, to identify significant radio-responsive genes, biological processes, and pathways, further analysis was performed using two publicly available microarray datasets (GSE23393 [1] and GSE1977 [2]) downloaded from Gene Expression Omnibus (GEO) database. In GSE1977, lymphoblastoid cells collected from 15 healthy individuals were exposed to 5-Gy radiation doses and harvested 4 hours later. To examine the change in gene expression after IR, only mock and IR cases were used. In GSE23393, blood was obtained from eight radiotherapy patients treated at our institution immediately before irradiation and at 4 hours after total body irradiation with 1.25-Gy X-rays. In this study, to explore GRNs associated with radiation response, we used the resulting information obtained from our literature review and reanalyzed the two microarray datasets.

Identification of significant genes

In the preprocessing stage, gene expression measurements were log-base-2 transformed, followed by quantile normalization across all samples. In our previous study [5], to identify significant genes that have considerable differential expression between before and after irradiation, we used a permutation test. In this study, we employed Significant Analysis of Microarrays (SAM) that resulted in a false discovery rate (FDR) and fold-change for each gene [12]. To minimize the possibility of omitting interesting genes, significant genes identified in the two different techniques were combined for further analysis.

Pathway analysis

Significant genes identified in our analysis were entered into a manually curated pathway database (MetaCore™, GeneGo, Inc.). This system leads to the most probable protein interaction network based on a list of genes uploaded by the user. We converted this resulting network to an undirected network and used it as a gold-standard network to assess the GRNs estimated by the graphical lasso algorithm.

Graphical lasso algorithm

In recent years, increasing attention has been paid to the estimation of sparse inverse covariance using L1 (lasso) regularization [11,13,14]. This approach has been efficiently applied to the investigation of sparse undirected graphical models. In the graphical model, a node represents a feature (gene or protein in this study) and an edge between two nodes represents the relationship between the two corresponding features. In particular, each nonzero off-diagonal element in the inverse covariance matrix indicates that there is a dependency between the two features. That is, as the number of zero off-diagonal elements in the inverse covariance matrix increases, sparser graphs are yielded. The underlying assumption behind the graphical lasso is that a data matrix X×consisting of p features measured on n observations follows a multivariate Gaussian distribution with mean μ and covariance Σ [10]. Let be the precision matrix, and let S denote the covariance matrix of the data. The problem is to maximize the penalized log-likelihood over nonnegative definite matrices , taking the form where is the L1 norm that is the sum of the absolute values of the elements of , and is a nonnegative tuning parameter that controls the sparsity of the estimated network. More specifically, large values of lead to sparse networks due to the lasso-type penalty, whereas small values of lead to dense networks. Note that the problem that maximizes the penalized log-likelihood is convex [11]. The subgradient equation for Eq. (1) is where . The block coordinate descent approach partitions the rows and columns such that the target column is always the last, cycling through all features in turn. The partition of , , and is defined as: where the size of , , and is (p − 1) × (p − 1); the size of , , and is (p − 1) × 1; and , , and are scalars. Inspired by this approach, Friedman et al. proposed to use a conventional lasso algorithm to solve Eq. (2). Here we describe how to solve Eq. (2). The upper-right block of Eq. (2) is The lower-right block of Eq. (2) is Since , we have from which we derive Then, we have where . Since in Eq. (3) is equal to = Therefore, Eq. (3) can be rewritten as We note that Eq. (6) is equivalent to the gradient equation of the regular lasso problem. At each partition, a lasso regression is fitted. Then, and are calculated and inserted into W before a new partition is made. This procedure is repeated until W is converged. Table 1 summarizes the graphical lasso algorithm.
Table 1

Graphical lasso algorithm.

1. Initialize W = S + λI. Note that the diagonal of W is fixed in what follows.2. Repeat the following steps until W is converged.   a. Partition the matrix W such that the target column is last.   b. Solve the lasso problem using the coordinate descent algorithm.   c. Update w12=W11β.3. In the final cycle, calculateθ12=-βθ22 with θ22=1/(w22-w12Tβ).
Graphical lasso algorithm.

Evaluation of estimated gene regulatory networks

To compare the GRNs estimated using the graphical lasso algorithm with a gold-standard network produced by MetaCore software, we used Recall, Precision, and f-score metrics defined as follows [15]: where TP indicates the true positives (correctly inferred edges); FP represents the false positives (edges inferred in the estimated network, but not present in the gold-standard network); and FN signifies the false negatives (edges present in the gold-standard network, but not inferred).

Results

Identification of significant genes using microarray datasets

Using SAM, significant genes were identified for the GSE1977 and GSE23393 datasets. For GSE1977, 61 genes were identified, with a cutoff of FDR < 20%, including 44 induced genes and 17 repressed genes. For GSE23393, 64 genes were identified, with a cutoff of FDR < 20%, including 19 induced genes and 45 repressed genes. Note that more genes were induced in GSE1977, whereas more genes were repressed in GSE23393. We examined commonly found genes in the two datasets. Table 2 shows the 21 overlapping genes with their fold-change and FDR. For these genes, considerable fold-changes were observed: averaged fold-changes for induced and repressed genes were 3.02 and 0.53, respectively, in GSE1977 and 3.12 and 0.60, respectively, in GSE23393. There was no significant difference in fold-change between GSE1977 and GSE23393 (p = 0.64 using Wilcoxon signed-rank test). It is worthwhile to note that induced genes had relatively lower FDR than repressed genes in both datasets.
Table 2

A list of 21 genes commonly identified in both microarray datasets using Significant Analysis of Microarrays.

Gene SymbolEntrez Gene IDGSE1977GSE23393

Fold-changeFDR (%)Fold-changeFDR (%)
Induced GenesACTA2592.020.001.850.00
ATF34673.050.001.409.79
BAX5811.530.001.800.00
BBC3271134.000.001.489.79
CCNG19001.720.001.510.00
CD709702.070.004.850.00
DDB216432.380.005.330.00
EI2495381.920.001.593.20
FDXR22322.340.0013.120.00
GADD45A16474.500.002.340.00
MMP943182.942.061.859.79
PCNA51112.040.003.540.00
PLK21076911.540.003.630.00
PLK312631.470.001.543.20
PPM1D84932.460.001.640.00
TNFRSF10B87952.330.002.460.00

Repressed GenesBIRC53320.6316.090.433.52
CCNB18910.360.000.500.00
HUS133640.3016.090.8116.52
MDC196560.8816.680.643.52
MYC46090.500.000.6516.52
A list of 21 genes commonly identified in both microarray datasets using Significant Analysis of Microarrays. In our previous study [5], we identified 20 genes that were significant in both datasets using a permutation test. It was found that 15 genes out of 20 were re-identified in this study, with 5 genes (BTG2, CDKN1A, MDM2, MR1, and XPC) excluded in SAM analysis. To minimize the possibility of excluding important genes, we combined the two gene lists identified in the SAM analysis and the permutation test, resulting in a list of 26 genes. Note that all 26 genes are in the list of 221 genes found in our literature review [5].

Pathway analysis results

These 26 genes were fed into MetaCore software to identify a key interacting network. Figure 1 illustrates a directly connected protein-protein interaction network produced by MetaCore. This network consisted of 16 directly connected nodes and 28 edges. For the remaining 10 genes, there was no single connection. In this network, the MYC gene seems to play a key role as a hub gene with 12 connections. We assume that this network is reliable, considering it as a gold-standard network to assess the accuracy of the GRNs estimated by the graphical lasso algorithm. Due to the static nature of the microarray datasets and the unidirectional property of GRNs resulting from the graphical lasso algorithm, we removed the directionality from the network in Figure 1 to ease the comparison between the unidirectional gold-standard network and the estimated GRNs.
Figure 1

A directly connected protein interaction network. This protein interaction network was obtained when 26 genes were entered into MetaCore software. This network consists of 16 nodes and 28 edges. The MYC gene has 12 connections. The table on the right shows corresponding gene symbols for proteins in the network. Red, green, and gray lines indicate inhibitory, stimulatory, and unspecified interactions, respectively.

A directly connected protein interaction network. This protein interaction network was obtained when 26 genes were entered into MetaCore software. This network consists of 16 nodes and 28 edges. The MYC gene has 12 connections. The table on the right shows corresponding gene symbols for proteins in the network. Red, green, and gray lines indicate inhibitory, stimulatory, and unspecified interactions, respectively.

Identification of gene regulatory networks

For the 16 genes shown in Figure 1, expression measurements after irradiation were extracted from the GSE1977 and GSE23393 microarray datasets and were used in the graphical lasso algorithm. Using different values, a large range of candidate networks were yielded. Table 3 summarizes our experimental results. Table 3(A) and (B) show comparison results of networks estimated from GSE1977 and GSE23393 datasets, respectively, with the gold-standard network. To calculate a p-value of a result obtained for each in Table 3, a simulation was performed. For example, using in GSE1977, the estimated network had 27 edges with precision = 0.41, recall = 0.39, and f-score = 0.4. The common number of edges between the estimated network and the gold-standard network was 11. For the simulation, we randomly created two networks, both having 16 nodes: one with 28 edges (the number of edges in the gold-standard network) and the other with 27 edges, as in the estimated network. Then, a f-score was calculated between the two networks. This procedure was repeated 10,000 times. From our simulation results, the number of times that the f-score was larger than 0.4 (note that the f-score in the above example was 0.4) was 50. Therefore, its p-value was 50/10000 = 0.005. As shown in Table 3, the networks estimated from GSE1977 had larger f-scores than those estimated from GSE23393 and overall, their p-values were statistically significant.
Table 3

Results obtained using the graphical lasso algorithm.

-log10(λ)TPFNFP# of edgesPrecisionRecallf-scorep-value
0.80424480.500.140.220.055
0.965237120.420.180.250.062
1.0582011190.420.290.340.017
1.10101815250.400.360.380.007
1.20111716270.410.390.400.005
1.31121620320.380.430.400.010
1.48151323380.390.540.450.001
1.52161225410.390.570.460.001
1.72171132490.350.610.440.002
1.78171136530.320.610.420.007
(A) Comparison of the gold-standard network and networks estimated from GSE1977,

-log10(λ)TPFNFP# of edgesPrecisionRecallf-scorep-value

0.55127560.170.040.060.737
0.582268100.200.070.110.596
0.7132512150.200.110.140.599
0.8052317220.230.180.200.420
1.0472123300.230.250.240.337
1.1391926350.260.320.290.194
1.15101827370.270.360.310.150
1.31121632440.270.430.330.110
1.39131535480.270.460.340.096
1.46131538510.250.460.330.135
(B) comparison of the gold-standard network and networks estimated from GSE23393, and

-log10(λ)in GSE1977-log10(λ)in GSE23393# of edges in GSE1977# of edges in GSE23393# of common edgesf-score p-value

0.800.558610.140.324
0.960.58121020.180.228
1.050.71191540.240.123
1.100.80252280.340.022
1.201.042730120.420.004
1.311.133235170.51< 0.001
1.481.153837190.51< 0.001
1.521.314144220.520.001
1.721.394948260.540.001
1.781.465351290.56<0.001
(C) comparison of networks estimated from GSE1977 and GSE23393.
Results obtained using the graphical lasso algorithm. Table 3(C) shows results of the comparison between the networks estimated from GSE1977 and GSE23393 datasets. We note that the f-scores were greater than those between the gold-standard network and the networks estimated using GSE1977 or GSE23393. Their p-values were statistically significant when in GSE1977 and in GSE23393. However, as values increased, the complexity of models (the number of edges in estimated networks) also increased. We compared the gold-standard network with estimated networks that have the number of edges that was similar to the gold-standard network (the number of edges: 28). In comparison between the gold-standard network and networks estimated from GSE1977 with the number of edges was 27, f-score was 0.4, and p-value was 0.005. In comparison between the gold-standard network and networks estimated from GSE23393 with the number of edges was 30, f-score was 0.24, and p-value was 0.337. In a comparison of the two networks estimated using GSE1977 (with ) and GSE23393 (with ), the f-score was 0.42 and p-value was 0.004. Figure 2 shows the change in f-scores calculated from two networks estimated using GSE1977 and GSE23393 with different values. Figure 3 illustrates the change in networks estimated when the graphical lasso algorithm was applied to GSE1977 with 6 different values.
Figure 2

Accuracy of estimated networks. The f-scores calculated from two networks estimated using GSE1977 and GSE23393 with different values.

Figure 3

Change in estimated networks. Networks estimated when the graphical lasso algorithm was applied to GSE1977 with 6 different values.

Accuracy of estimated networks. The f-scores calculated from two networks estimated using GSE1977 and GSE23393 with different values. Change in estimated networks. Networks estimated when the graphical lasso algorithm was applied to GSE1977 with 6 different values.

Discussion

To identify radio-responsive GRNs, we employed the graphical lasso algorithm using a list of radio-responsive genes selected from literature review and microarray data analysis. For the identification of significant genes with two publicly available microarray datasets (GSE1977 and GSE23393), we used SAM analysis. As a result, we found 21 genes to be important with FDR < 20%. These 21 genes included 15 of 20 genes that we identified in our previous study using a permutation test. This result suggests that we may miss important genes by using only a single statistical approach. In this study, we combined the two gene sets for further analysis, resulting in a list of 26 genes, including genes related to DNA repair: GADD45A, XPC, DDB2, and PCNA [2]. These 26 genes were entered into MetaCore software for a systems biology analysis. We identified a protein-protein interaction network that was used as a gold-standard network in this study. This network consisted of 16 nodes and 28 edges. Interestingly, the MYC gene had 12 connections, implying that this gene may play an important role in DNA-damage-related biological functions. It has been known that MYC is a key regulator of cell proliferation and apoptosis [16-19]. However, the role of MYC is not fully understood, due to its contradictory effect in enhancing or reducing radioresponsiveness [20,21]. The f-scores calculated to compare the gold-standard network with the networks estimated from GSE1977 were larger than those found in comparing the gold-standard network with the networks estimated from GSE23393. It was noted that the f-scores calculated from the two networks estimated from GSE1977 and GSE23393 were larger than those calculated from the gold-standard network and the networks estimated from GSE1977. This means that there are some common edges between the two networks estimated from GSE1977 and GSE23393, which are not present in the gold-standard network, suggesting that these edges could represent unknown associations between the genes. In this study, we compared the estimated networks with a gold-standard network to investigate the change in the accuracy and number of interactions between genes with different values. However, in general, one needs to find the best regularization parameter , taking into account a tradeoff between model prediction and model complexity [10,22]. Bayesian information criterion (BIC) and Akaike criterion (AIC) are widely used for model selection. Despite the lack of available datasets regarding radiation response, we demonstrated the presented methodology has the potential to identify radio-responsive GRNs via the graphical lasso algorithm based on literature review and microarray data analysis.

Conclusions

We demonstrated that the graphical lasso algorithm can be a useful tool to reconstruct GRNs. We used a biomarker filtering method proposed in our previous study based on literature review and microarray data analysis. To evaluate the accuracy of radio-responsive GRNs estimated from two publicly available microarray datasets, we used a reliable protein interaction network generated from a curated database as a gold-standard network. It is expected that our proposed method can be efficiently used to identify not only significant radio-responsive genes, but also radio-responsive GRNs.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JHO performed data analysis and wrote the manuscript. JOD supervised the study and edited the paper. Both authors read and approved the final manuscript.
  22 in total

1.  Portrait of transcriptional responses to ultraviolet and ionizing radiation in human cells.

Authors:  Kerri E Rieger; Gilbert Chu
Journal:  Nucleic Acids Res       Date:  2004-09-08       Impact factor: 16.971

2.  Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics.

Authors:  Tarmo Aijö; Harri Lähdesmäki
Journal:  Bioinformatics       Date:  2009-08-25       Impact factor: 6.937

3.  Robust Gaussian graphical modeling via l1 penalization.

Authors:  Hokeun Sun; Hongzhe Li
Journal:  Biometrics       Date:  2012-09-28       Impact factor: 2.571

4.  Oncogene Expression and Cellular Radiation Resistance: A Modulatory Role for c-myc.

Authors: 
Journal:  Mol Diagn       Date:  1998-03

5.  MYC prevents apoptosis and enhances endoreduplication induced by paclitaxel.

Authors:  Giuliana Gatti; Giovanna Maresca; Manuela Natoli; Fulvio Florenzano; Angelo Nicolin; Armando Felsani; Igea D'Agnano
Journal:  PLoS One       Date:  2009-05-06       Impact factor: 3.240

6.  Mediation of c-Myc-induced apoptosis by p53.

Authors:  H Hermeking; D Eick
Journal:  Science       Date:  1994-09-30       Impact factor: 47.728

Review 7.  Apoptosis. Its significance in cancer and cancer therapy.

Authors:  J F Kerr; C M Winterford; B V Harmon
Journal:  Cancer       Date:  1994-04-15       Impact factor: 6.860

8.  Systems biology modeling of the radiation sensitivity network: a biomarker discovery platform.

Authors:  Steven Eschrich; Hongling Zhang; Haiyan Zhao; David Boulware; Ji-Hyun Lee; Gregory Bloom; Javier F Torres-Roca
Journal:  Int J Radiat Oncol Biol Phys       Date:  2009-10-01       Impact factor: 7.038

9.  Gene regulatory networks from multifactorial perturbations using Graphical Lasso: application to the DREAM4 challenge.

Authors:  Patricia Menéndez; Yiannis A I Kourmpetis; Cajo J F ter Braak; Fred A van Eeuwijk
Journal:  PLoS One       Date:  2010-12-20       Impact factor: 3.240

10.  A bioinformatics filtering strategy for identifying radiation response biomarker candidates.

Authors:  Jung Hun Oh; Harry P Wong; Xiaowei Wang; Joseph O Deasy
Journal:  PLoS One       Date:  2012-06-29       Impact factor: 3.240

View more
  6 in total

1.  A Poisson Log-Normal Model for Constructing Gene Covariation Network Using RNA-seq Data.

Authors:  Yoonha Choi; Marc Coram; Jie Peng; Hua Tang
Journal:  J Comput Biol       Date:  2017-05-30       Impact factor: 1.479

2.  Integration of multi-omics data for integrative gene regulatory network inference.

Authors:  Neda Zarayeneh; Euiseong Ko; Jung Hun Oh; Sang Suh; Chunyu Liu; Jean Gao; Donghyun Kim; Mingon Kang
Journal:  Int J Data Min Bioinform       Date:  2017-10-03       Impact factor: 0.667

3.  Discriminating cancer-related and cancer-unrelated chemoradiation-response genes for locally advanced rectal cancers.

Authors:  You Guo; Jun Cheng; Lu Ao; Xiangyu Li; Qingzhou Guan; Juan Zhang; Haidan Yan; Hao Cai; Qiao Gao; Weizhong Jiang; Zheng Guo
Journal:  Sci Rep       Date:  2016-11-15       Impact factor: 4.379

4.  Bayesian differential analysis of gene regulatory networks exploiting genetic perturbations.

Authors:  Yan Li; Dayou Liu; Tengfei Li; Yungang Zhu
Journal:  BMC Bioinformatics       Date:  2020-01-09       Impact factor: 3.169

5.  Reproducibility of radiomic features using network analysis and its application in Wasserstein k-means clustering.

Authors:  Jung Hun Oh; Aditya P Apte; Evangelia Katsoulakis; Nadeem Riaz; Vaios Hatzoglou; Yao Yu; Usman Mahmood; Harini Veeraraghavan; Maryam Pouryahya; Aditi Iyer; Amita Shukla-Dave; Allen Tannenbaum; Nancy Y Lee; Joseph O Deasy
Journal:  J Med Imaging (Bellingham)       Date:  2021-04-30

6.  Pan-Cancer Prediction of Cell-Line Drug Sensitivity Using Network-Based Methods.

Authors:  Maryam Pouryahya; Jung Hun Oh; James C Mathews; Zehor Belkhatir; Caroline Moosmüller; Joseph O Deasy; Allen R Tannenbaum
Journal:  Int J Mol Sci       Date:  2022-01-19       Impact factor: 6.208

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.