Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Size matters: network inference tackles the genome scale.

Literature DB >> 17299414

Size matters: network inference tackles the genome scale.

Boris Hayete¹, Timothy S Gardner, James J Collins.

Abstract

Entities: Chemical Species

Mesh：

Year: 2007 PMID： 17299414 PMCID： PMC1828748 DOI： 10.1038/msb4100118

Source DB: PubMed Journal: Mol Syst Biol ISSN： 1744-4292 Impact factor: 11.429

× No keyword cloud information.

The growing importance of microarray data challenges biologists, and especially the systems biology community, to come up with genome-scale analysis methods that can convert the large quantity of available high-throughput data into high-quality systems-level insights. One area of systems-level analysis that has received considerable attention in recent years is that of inferring molecular-level regulation, with frequent focus on transcriptional regulatory networks (Kholodenko ; Tavazoie ; Gardner ; Segal ; Beer and Tavazoie, 2004; Yu ; di Bernardo ; Gardner and Faith, 2005; Woolf ; Margolin ; Faith ). As microarrays provide a tool for measuring transcript levels of the whole genome, recent interest has shifted to inferring networks on a genome scale. The less-studied organisms are a natural starting point for such mapping, as it is for these organisms that the rapid, genome-scale identification of regulatory structure is most needed. In a recent study, Bonneau ) apply the Inferelator, their elegant new algorithm, for inferring gene networks, to precisely such a little-studied but important organism. Specifically, the authors focus on Halobacterium NRC-I, a model archaeon (DasSarma ), to show that, at least for a small genome, it is possible to determine a sizeable portion of the transcriptional regulatory network from microarrays without much prior knowledge. This choice of an organism has two practical advantages. First, the salt-loving NRC-I is one of a handful of Halobacteria for which transformation techniques have been well studied, allowing in vivo validation of network predictions. Second, NRC-I's genome is relatively small and thus, its regulation ought to be comparatively easy to reconstruct. Small genome or not, putting high-throughput profiling technologies to work on the genome scale requires a confluence of robust algorithms, biologically plausible simplifying assumptions, and a robust verification strategy. The work of Bonneau is a good example, using multiple tools in the bioinformatics toolbox to build a credible blueprint of a transcriptional-regulatory network involving thousands of genes and more than 100 transcription factors. In order to appreciate the need for a well-structured approach to regulatory mapping, consider the mathematical and biological scope of this cross-disciplinary problem. The tiny archaeon Halobacterium NRC-I contains about 2400 genes. For each one of these, the goal is to understand the transcriptional regulatory apparatus—that is about 2400 question marks, each with thousands of possible answers in the form of a set of transcriptional regulators. Put that against a typical compendium size of several hundred chips for a given organism, and you get what is known as a ‘small n, large p' problem, where the number of possible parameters (regulators), p, dwarfs the number of data points (microarrays), n, available to define them. This problem gets considerably worse for complex organisms, where a larger number of available microarrays are more than offset by the vast complexity of large genomes, alternate splice variants, and multiple layers of regulation. For network inference algorithms, ‘small n, large p' means dearth of data and very high computational demands. As if this computational complexity were not bad enough, there is the inherent high dimensionality in the biological realm. Regulation happens in the domains of mRNA, proteins, metabolites, kinases, acetylases, and so on, and through a variety of pleiotropic perturbations and influences, such as salinity, temperature, and cell-wall permeability. As the best high-throughput data capture only mRNA, one must make simplifying assumptions and skip many important parameters. Bonneau and colleagues' best simplifying assumption is to focus on predicting the targets of transcription factors in the network, along with some key environmental influences. When only transcription factors are allowed to regulate other genes, the ‘p' in the ‘small n, large p' problem is no longer so big. In fact, at 120, it is smaller than the number of chips (268) used in this study. To further constrain the network learning problem, the Inferelator performs a pre-processing step of bi-clustering—organizing experimental data by both genes and conditions. This algorithm, the cMonkey (Reiss ), allows further reduction of dimensionality by collapsing genes into conditionally coexpressed modules. cMonkey identified 300 such bi-clusters, and 159 individual genes that could not be grouped, a nearly six-fold reduction in dimensionality. Crucially, as the composition of the culture medium used for the microarray-profiled experiments is known, each bi-cluster's grouping of genes by experimental condition suggests plausible metabolic or environmental effectors of regulation. The authors exploit this benefit of their approach in one of their verifying experiments. Bi-clustering, therefore, serves two ends: it limits the number of genes, and thus variables to reconstruct, to fewer than 500 (including only 80 TFs and metabolites), and places each predicted regulatory interaction into an experiment-specific context. The problem now becomes mathematically well-posed, and the authors solve it using LASSO regression, a sparse regression method designed just for such computationally difficult problems (Tibshirani, 1996). LASSO works by selecting a small set of the most likely regulators of a given gene, and simultaneously determines a quantitative influence function relating regulator expression to target expression (Figure 1). In addition, the authors extend the LASSO algorithm beyond its typical linear domain by including piecewise and nonlinear terms in the regression to model saturation effects and pairwise combinatorial regulation. With this approach, the authors construct a model of transcription regulation in Halobacterium that matches 80 transcription factors to 500 predicted gene targets and captures the putative metabolic controllers of these pathways. This is an impressive result, both in size and regulatory complexity, particularly in light of the relatively modest size of the experimental data set (i.e., 268 microarrays). Moreover, this represents a dramatic leap in our understanding of this little-studied organism.

Figure 1

(A) Schematic diagram of a hypothetical bacterial operon, represented by a single gene Y, which is regulated by a protein X1 and a protein complex X2X3. (B) Within its dynamic range, the level of the transcript y may be modeled as a function of transcripts of the regulatory proteins X1, X2, and X3. The min function captures the notion of cooperativity, and the general form of g incorporates saturation effects. On the genome scale, the initial model for regulation of y would involve all possible transcription factors, and would greatly benefit from parameter shrinkage by LASSO. (C) This table illustrates the representative power of the chosen design matrix. The model can capture AND, OR, and XOR logical functions and saturation effects (not shown). Assigning the shown values to the coefficients from (B) would cause the model to represent the corresponding logical function for the interaction of X2 and X3.

Having obtained the first-pass transcriptional blueprint, Bonneau and colleagues ask the obligatory next question: how much do we trust this network? In network inference, three broad types of verification are possible: computational verification through cross-validation, in vivo verification, and literature-driven curation. To be effective, the last approach should leverage a large data set documenting connectivity known in the literature, such as TransFac (Matys ) or RegulonDB (Salgado ). This type of verification not being available for Halobacterium, the authors vigorously pursue the former two, including knockout experimentation and ChIP-chip analysis, demonstrating that their network can serve as a reliable and useful blueprint of Halobacterium NRC-I's transcriptional regulation. Bonneau ) show the feasibility of mapping a genome-scale regulatory network from a modestly sized compendium of microarrays, an important success for the systems biology community. As microarray technology continues to improve and costs drop, growing databases of microarrays present an opportunity to infer ever more complex regulatory networks in both microbes and higher organisms. Abundance of data fuels the need for a network inference case study that would clearly map the boundaries of what is possible with today's network mapping algorithms. To this end, we believe that the once and future model organisms like Escherichia coli and Saccharomyces cerevisiae, buoyed by extensive bodies of literature and large databases such as RegulonDB, SGD (Christie ), and TransFac, may represent attractive short-term targets for network inference studies. In addition to the use of curated data sets, it may be possible to seed organisms with small synthetic in vivo networks, the connectivity of which is known by design, and to measure the success of network reconstruction on the whole by success or failure to reconstruct the seed. We are aware of at least one lab doing such work (Cantone ). Biological yardsticks in general will gain in importance, as they supplement in silico testing and usher in algorithms' transition from design to practical use, and from simple organisms to higher eukaryotes. Challenges remain, but we see the immediate future of network inference as promising and bright. Molecular biologists have long been looking for ways to generate more oomph from their microarrays. Systems biology may have some answers, and we laud Bonneau and colleagues for providing an illuminating step in that direction.

17 in total

1. Systematic determination of genetic network architecture.

Authors: S Tavazoie; J D Hughes; M J Campbell; R J Cho; G M Church
Journal: Nat Genet Date: 1999-07 Impact factor: 38.330

2. Advances to Bayesian network inference for generating causal networks from observational biological data.

Authors: Jing Yu; V Anne Smith; Paul P Wang; Alexander J Hartemink; Erich D Jarvis
Journal: Bioinformatics Date: 2004-07-29 Impact factor: 6.937

3. Bayesian analysis of signaling networks governing embryonic stem cell fate decisions.

Authors: Peter J Woolf; Wendy Prudhomme; Laurence Daheron; George Q Daley; Douglas A Lauffenburger
Journal: Bioinformatics Date: 2004-10-12 Impact factor: 6.937

4. Reverse-engineering transcription control networks.

Authors: Timothy S Gardner; Jeremiah J Faith
Journal: Phys Life Rev Date: 2005-03 Impact factor: 11.025

5. Quantification of information transfer via cellular signal transduction pathways.

Authors: B N Kholodenko; J B Hoek; H V Westerhoff; G C Brown
Journal: FEBS Lett Date: 1997-09-08 Impact factor: 4.124

6. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.

Authors: Karen R Christie; Shuai Weng; Rama Balakrishnan; Maria C Costanzo; Kara Dolinski; Selina S Dwight; Stacia R Engel; Becket Feierbach; Dianna G Fisk; Jodi E Hirschman; Eurie L Hong; Laurie Issel-Tarver; Robert Nash; Anand Sethuraman; Barry Starr; Chandra L Theesfeld; Rey Andrada; Gail Binkley; Qing Dong; Christopher Lane; Mark Schroeder; David Botstein; J Michael Cherry
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

7. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions.

Authors: Heladia Salgado; Socorro Gama-Castro; Martín Peralta-Gil; Edgar Díaz-Peredo; Fabiola Sánchez-Solano; Alberto Santos-Zavaleta; Irma Martínez-Flores; Verónica Jiménez-Jacinto; César Bonavides-Martínez; Juan Segura-Salazar; Agustino Martínez-Antonio; Julio Collado-Vides
Journal: Nucleic Acids Res Date: 2006-01-01 Impact factor: 16.971

8. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles.

Authors: Jeremiah J Faith; Boris Hayete; Joshua T Thaden; Ilaria Mogno; Jamey Wierzbowski; Guillaume Cottarel; Simon Kasif; James J Collins; Timothy S Gardner
Journal: PLoS Biol Date: 2007-01 Impact factor: 8.029

Size matters: network inference tackles the genome scale.

1. Systematic determination of genetic network architecture.

2. Advances to Bayesian network inference for generating causal networks from observational biological data.

3. Bayesian analysis of signaling networks governing embryonic stem cell fate decisions.

4. Reverse-engineering transcription control networks.

5. Quantification of information transfer via cellular signal transduction pathways.

6. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms.

7. RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions.

8. Large-scale mapping and validation of Escherichia coli transcriptional regulation from a compendium of expression profiles.

9. Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks.

10. Post-genomics of the model haloarchaeon Halobacterium sp. NRC-1.

1. Reverse engineering biomolecular systems using -omic data: challenges, progress and opportunities.

Review 2. Adaptation of cells to new environments.

3. DREAM3: network inference using dynamic context likelihood of relatedness and the inferelator.

4. Predictive networks: a flexible, open source, web application for integration and analysis of human gene networks.

Review 5. Systems biology and cancer stem cells.

Review 6. Reverse engineering and identification in systems biology: strategies, perspectives and challenges.