| Literature DB >> 22586449 |
Xiang Zhang1, Wei Cheng, Jennifer Listgarten, Carl Kadie, Shunping Huang, Wei Wang, David Heckerman.
Abstract
Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com.Entities:
Mesh:
Year: 2012 PMID: 22586449 PMCID: PMC3346750 DOI: 10.1371/journal.pone.0035762
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The graphical model
Known and potential TFs are assumed to be mutually independent. Regulated genes are assumed to be mutually independent given the TFs.
Figure 2Out-of-sample prediction accuracy of the three models across the 10 folds of the data.
GO enrichment analysis of the gene sets associated with hidden variables.
| Gene Set Size | Raw p-value | Adjusted p-value | FDR | GO Categories |
| 19649 | 1.17×10−15 | 0 | 0 | cellular protein metabolic process |
| 19431 | 2.31×10−13 | 0 | 0 | protein metabolic process |
| 22301 | 1.71×10−10 | 0 | 0 | transport |
| 23608 | 2.53×10−9 | 0 | 0 | transport |
| 20500 | 9.47×10−9 | 0 | 0 | cellular protein metabolic process |
| 26332 | 1.55×10−8 | 0 | 0 | transport |
| 21264 | 2.20×10−5 | 0.001 | 0.003 | response to chemical stimulus |
| 19395 | 1.87×10−5 | 0.004 | 0.01 | organic acid metabolic process |
| 21098 | 1.51×10−4 | 0.01 | 0.022 | organic acid metabolic process |
| 29240 | 2.03×10−3 | 0.026 | 0.052 | synaptic transmission |
| 20199 | 3.76×10−4 | 0.03 | 0.054 | positive regulation of phosphate metabolic process |
| 24175 | 1.04×10−3 | 0.048 | 0.08 | phosphoinositide mediated signaling |
| 17480 | 6.73×10−4 | 0.064 | 0.1 | cation homeostasis |
| 20331 | 9.45×10−4 | 0.07 | 0.1 | digestion |
| 22477 | 1.29×10−3 | 0.075 | 0.1 | locomotory behavior |
| 22644 | 2.74×10−3 | 0.204 | 0.255 | organic acid transport |
| 18732 | 4.00×10−3 | 0.393 | 0.462 | positive regulation of t_cell proliferation |
| 16294 | 7.86×10−3 | 0.707 | 0.786 | inorganic anion transport |
Figure 3Histogram of sizes of the gene sets associated with known and putative regulators.
GO enrichment analysis of the gene sets identified by NCA and our model.
| Method | Average raw p-value | Number of gene sets with calibrated p-values |
| NCA | 0.024 | 5 |
| Our model | 0.007 | 219 |