| Literature DB >> 22796662 |
Daniel Marbach1, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Manolis Kellis, James J Collins, Gustavo Stolovitzky.
Abstract
Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. We characterize the performance, data requirements and inherent biases of different inference approaches, and we provide guidelines for algorithm application and development. We observed that no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets. We thereby constructed high-confidence networks for E. coli and S. aureus, each comprising ~1,700 transcriptional interactions at a precision of ~50%. We experimentally tested 53 previously unobserved regulatory interactions in E. coli, of which 23 (43%) were supported. Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks.Entities:
Mesh:
Year: 2012 PMID: 22796662 PMCID: PMC3512113 DOI: 10.1038/nmeth.2016
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Figure 1The DREAM5 network inference challenge
Assessment involved the following steps (from left to right). (1) Participants were challenged to infer the genome-wide transcriptional regulatory networks of E. coli, S. cerevisiae, and S. aureus, as well as an in silico (simulated) network. (2) Gene expression datasets for a wide range of experimental conditions were compiled. Anonymized datasets were released to the community, hiding the identities of the genes. (3) 29 participating teams inferred gene regulatory networks. In addition, we applied 6 “off-the-shelf” inference methods. (4) Network predictions from individual teams were integrated to form community networks. (5) Network predictions were assessed using experimentally supported interactions from E. coli and S. cerevisae, as well as the known in silico network.
Network inference methods.
| ID | Synopsis | Reference | |
|---|---|---|---|
| 1 | Trustful Inference of Gene REgulation using Stability Selection (TIGRESS): (1) Lasso; (2) the regularization parameter selects five transcription factors per target gene in each bootstrap sample. | ||
| 2 | (1) Steady state and time series data are combined by group lasso; (2) bootstrapping. | ||
| 3 | Combination of lasso and Bayesian linear regression models learned using Reversible Jump Markov Chain Monte Carlo simulations. | ||
| 4 | (1) Lasso; (2) bootstrapping. | ||
| 5 | (1) Lasso; (2) area under the stability selection curve. | ||
| 6 | Application of the Lasso toolbox GENLAB using standard parameters. | ||
| 7 | Lasso models are combined by the maximum regularization parameter selecting a given edge for the first time. | ||
| 8 | Linear regression determines the contribution of transcription factors to the expression of target genes. | — | |
| 1 | Context likelihood of relatedness (CLR): (1) Spline estimation of mutual information; (2) the likelihood of each mutual information score is computed based on its local network context. | ||
| 2 | (1) Mutual information is computed from discretized expression values. | ||
| 3 | Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE): (1) kernel estimation of mutual information; (2) the data processing inequality is used to identify direct interactions. | ||
| 4 | (1) Fast kernel-based estimation of mutual information; (2) Bayesian Local Causal Discovery (BLCD) and Markov blanket (HITON-PC) algorithm to identify direct interactions. | ||
| 5 | (1) Mutual information and Pearson’s correlation are combined; (2) BLCD and HITON-PC algorithm. | ||
| 1 | Absolute value of Pearson’s correlation coefficient. | ||
| 2 | Signed value of Pearson’s correlation coefficient. | ||
| 3 | Signed value of Spearman’s correlation coefficient. | ||
| 1 | Simulated annealing (catnet R package, | — | |
| 2 | Simulated annealing (catnet R package, | — | |
| 3 | Max-Min Parent and Children algorithm (MMPC), bootstrapped datasets. | ||
| 4 | Markov blanket algorithm (HITON-PC), bootstrapped datasets. | ||
| 5 | Markov boundary induction algorithm (TIE*), bootstrapped datasets. | ||
| 6 | Models transcription factor perturbation data and time series using dynamic Bayesian networks (Infer.NET toolbox, | — | |
| 1 | Genie3: A random forest is trained to predict target gene expression. Putative transcription factors are selected as tree nodes if they consistently reduce the variance of the target. | ||
| 2 | Co-dependencies between transcription factors and target genes are detected by the non-linear correlation coefficient η2 (two-way ANOVA). Transcription factor perturbation data are up-weighted. | ||
| 3 | Transcription factors are selected maximizing the conditional entropy for target genes, which are represented as Boolean vectors with probabilities to avoid discretization. | ||
| 4 | Transcription factors are preselected from transcription factor perturbation data or by Pearson’s correlation and then tested by iterative Bayesian Model Averaging (BMA). | ||
| 5 | A Gaussian noise model is used to estimate if the expression of a target gene changes in transcription factor perturbation measurements. | ||
| 6 | After scaling, target genes are clustered by Pearson’s correlation. A neural network is trained (genetic algorithm) and parameterized (back-propagation). | ||
| 7 | Data is discretized by Gaussian mixture models and clustering (Ckmeans); Interactions are detected by generalized logical network modeling (χ2 test). | ||
| 8 | The χ2 test is applied to evaluate the probability of a shift in transcription factor and target gene expression in transcription factor perturbation experiments. | ||
| 1 | (1) Z-scores for target genes in transcription factor knockout data, time-lagged CLR for time series, and linear ordinary differential equation models constrained by lasso (Inferelator); (2) resampling approach. | ||
| 2 | (1) Pearson’s correlation, mutual information, and CLR; (2) rank average. | — | |
| 3 | (1) Calculates target gene responses in transcription factor knockout data, applies full-order, partial correlation and transcription factor-target co-deviation analysis; (2) weighted average with weights trained on simulated data. | — | |
| 4 | (1) CLR filtered by negative Pearson’s correlation, least angle regression (LARS) of time series, and transcription factor perturbation data; (2) combination by z-scores. | ||
| 5 | (1) Pearson’s correlation, differential expression (limma), and time series analysis (maSigPro); (2) Naïve Bayes. | — | |
Methods have been manually categorized based on participant-supplied descriptions. Within each class, methods are sorted by overall performance (see Figure 2a). Note that generic references have been used if more specific ones were not available.
Detailed method description included in Supplementary Note 10;
Off-the-shelf algorithm applied by challenge organizers.
Figure 2Evaluation of network inference methods
Inference methods are indexed according to Table 1. (a) The plots depict the performance for the individual networks (area under precision-recall curve, AUPR) and the overall score summarizing the performance across networks (Methods). R indicates performance of random predictions. C indicates performance of the integrated community predictions. (b) Methods are grouped according to the similarity of their predictions via principal component analysis. Shown are the 2nd vs. 3rd principal components; the 1st principal component accounts mainly for the overall performance (Supplementary Note 4). (c) The heatmap depicts method-specific biases in predicting network motifs. Rows represent individual methods and columns represent different types of regulatory motifs. Red and blue show interactions that are easier and harder to detect, respectively.
Figure 3Analysis of community networks vs. individual inference methods
(a) The plot shows the overall score, which summarizes performance across the E. coli, S. cerevisiae, and in silico networks, for individual inference methods or various combinations of integrated methods. The first boxplot depicts the performance distribution of individual inference methods (K=1). Subsequent boxplots show the performance when integrating K>1 randomly sampled methods. The red bar shows the performance when integrating all methods (K=29). Boxplots depict performance distributions with respect to the minimum, the maximum and the three quartiles. (b) The probability that the community network ranks among the top x% of the K individual methods used to construct the community network. The diagonal shows the expected performance when choosing an individual method (K=1). (c) The integration of complementary methods is particularly beneficial. The first boxplot shows the performance of individual methods from clusters 1–3 (as defined in Fig. 2b). The second and third boxplots show performance of community networks obtained by integrating three randomly selected inference methods: (i) from the same cluster, or (ii) from different clusters. (d) The plots show the overall score for an initial community network formed by integrating all individual methods (open circles, blue) except for the best five and worst five. One-by-one the worst five (left panel) and best five (right panel) methods are added to form additional community networks (filled circles, red).
Figure 4E. coli and S. aureus community networks
(a, b) At a cutoff of 1688 edges, the (a) E. coli community network connects 1,505 genes (including 204 transcription factors, shown as diamonds), and the (b) S. aureus network connects 1,084 genes (85 transcription factors). Network modules were identified and tested for Gene Ontology term enrichment, as indicated (grey colored genes do not show enrichment). A network module enriched for Gene Ontology terms related to pathogenesis is highlighted in the S. aureus network. (c) The schematics depict newly predicted E. coli regulatory interactions that were experimentally tested. The pie chart depicts the breakdown of strongly and weakly supported targets (Methods). The positive controls were six known interactions from RegulonDB.