| Literature DB >> 27555837 |
Craig Disselkoen1, Brian Greco2, Kaitlyn Cook3, Kristin Koch4, Reginald Lerebours3, Chase Viss5, Joshua Cape6, Elizabeth Held7, Yonatan Ashenafi1, Karen Fischer8, Allyson Acosta9, Mark Cunningham10, Aaron A Best10, Matthew DeJongh9, Nathan Tintle1.
Abstract
Numerous methods for classifying gene activity states based on gene expression data have been proposed for use in downstream applications, such as incorporating transcriptomics data into metabolic models in order to improve resulting flux predictions. These methods often attempt to classify gene activity for each gene in each experimental condition as belonging to one of two states: active (the gene product is part of an active cellular mechanism) or inactive (the cellular mechanism is not active). These existing methods of classifying gene activity states suffer from multiple limitations, including enforcing unrealistic constraints on the overall proportions of active and inactive genes, failing to leverage a priori knowledge of gene co-regulation, failing to account for differences between genes, and failing to provide statistically meaningful confidence estimates. We propose a flexible Bayesian approach to classifying gene activity states based on a Gaussian mixture model. The model integrates genome-wide transcriptomics data from multiple conditions and information about gene co-regulation to provide activity state confidence estimates for each gene in each condition. We compare the performance of our novel method to existing methods on both simulated data and real data from 907 E. coli gene expression arrays, as well as a comparison with experimentally measured flux values in 29 conditions, demonstrating that our method provides more consistent and accurate results than existing methods across a variety of metrics.Entities:
Keywords: Bayesian model; bacteria; gene activity; gene expression; metabolic modeling
Year: 2016 PMID: 27555837 PMCID: PMC4977825 DOI: 10.3389/fmicb.2016.01191
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Boxplots of deviations from true activity state by approach. Boxplots represent the value of 1−d across each of the experiments, i = 1,…, 907, where and the average across all genes j = 1,…,m in experiment i (see Section Methods: Statistical analysis for details). Larger numbers on the y-axis represent less deviation from true activity states, illustrating that MultiMM has the best performance, followed by UniMM, with MT yielding the worst performance. This figure illustrates the results on simulated data using the Unif simulation approach. Performance with the Fitted approach was similar.
Overall consistency.
| Among gene-experiment combinations the simulator assigned as active ( | 1569 | 69.1 | 68.4 | 69.1 | 76.5 | 81.4 |
| Among gene-experiment combinations the simulator assigned as inactive ( | 1547 | 69.4 | 68.7 | 69.4 | 81.8 | 86.4 |
| Sensitivity + Specificity | – | 138.5 | 137.1 | 138.5 | 158.3 | 167.8 |
| Only data points from genes flagged as 2-component | 2209 | 69.3 | 68.6 | 69.3 | 84.8 | 89.4 |
| Only data points from genes flagged as 1-component | 907 | 69.2 | 68.4 | 69.2 | 65.3 | 70.3 |
Values in this table are reported as 100% times average consistency (c.
Instead of maximizing the Sensitivity + Specificity, a researcher could choose to maximize the average c.
By definition these approaches will yield the same result since we are dichotomizing the RB a.
These values are based on the genes flagged as one component by the MultiMM method; when using genes flagged by the UniMM method, results were comparable.
Overall method consistency with simulated gene activity state assignments stratified by confidence.
| MT | 69.3% (2158/3116) | − | − |
| TT | 73.2% (1824/2429) | − | 50% (312/624) |
| RB | 80.9% (1009/1247) | 65.5% (816/1246) | 53.5% (333/622) |
| UniMM | 93.4% (1623/1737) | 65.4% (577/883) | 53.7% (266/495) |
| MultiMM | 95.3% (1849/1941) | 70.1% (530/757) | 55.8% (233/418) |
For example, 2158 is the number of consistent gene-experiment combinations at high confidence for the MT approach (in thousands), and 3116 is the total number of gene-experiment combinations at high confidence for the MT approach (in thousands).
Method consistency within operons with simulated gene activity state assignments.
| MT | 34.5% (207/599 | 65.5% (392/599) | 50.1% (300/599) |
| TT | 16.9% (101/599) | 83.2% (498/599) | 58.9% (353/599) |
| RB | 16.9% (101/599) | 83.0% (497/599) | 50.1% (300/599) |
| UniMM | 20.0% (120/599) | 80.0% (479/599) | 58.2% (349/599) |
| MultiMM | 0% (0/599) | 100% 599/599 | 85.3% (511/599) |
Inconsistent occurs when one or more genes within the same operon are indicated likely to be active and one or more genes within that same operon are indicated likely to be inactive.
Correct means that the consistent operon activity calls are also identified correctly as active or inactive (based on the underlying simulation model).
All counts in the table are reported in 1000s, representing the number of operon-experiment combinations.
Overall consistency.
| Among gene-experiment combinations the model predicted as active ( | 116 | 83.8 | 82.1 | 83.8 | 66.0 | 70.3 |
| Among gene-experiment combinations the model predicted as inactive ( | 729 | 40.1 | 40.2 | 40.1 | 62.8 | 58.9 |
| Sensitivity + Specificity | – | 123.9 | 122.3 | 123.9 | 128.8 | 129.2 |
| Only data points from genes flagged as 2-component | 721 | 45.3 | 45.2 | 45.3 | 62.6 | 59.4 |
| Only data points from genes flagged as 1-component | 124 | 50.9 | 50.3 | 50.9 | 66.9 | 67.1 |
Values in this table are reported as 100% times average consistency (c.
Instead of maximizing the Sensitivity + Specificity, a researcher could choose to maximize the average c.
By definition these approaches will yield the same result since we are dichotomizing the RB aij's when computing cij
These values are based on the genes flagged as one component by the MultiMM method; likewise, the values in the row above are based on the genes flagged as two component by the MultiMM method. When using genes flagged by the UniMM method, results were comparable.
Overall method consistency vs. metabolic model predictions stratified by confidence.
| MT | 46.1% (390/845) | − | − |
| TT | 44.8% (295/658) | − | 50.0% (94/187) |
| RB | 39.2% (115/293) | 49.4% (180/364) | 50.6% (95/187) |
| UniMM | 64.0% (368/574) | 65.2% (117/179) | 54.4% (50/92) |
| MultiMM | 60.1% (395/656) | 64.2% (77/120) | 57.5% (39/68) |
For example, 390 is the number of consistent gene-experiment combinations at high confidence for the MT approach (in thousands), and 845 is the total number of gene-experiment combinations at high confidence for the MT approach (in thousands).
Method consistency within operons.
| MT | 66.6% 455/683 | − | 33.4% 228/683 |
| TT | 47.8% 327/683 | 35.7% 244/683 | 16.5% 113/683 |
| RB | 18.0% 123/683 | 65.5% 447/683 | 16.5% 113/683 |
| UniMM | 37.7% 258/683 | 43.3% 296/683 | 19.0% 130/683 |
| MultiMM | 79.8% 545/683 | 20.2% 138/683 | 0% 0/683 |
MultiMM calls for the L-arabinose (.
| Yes | 219 | 35 |
| No | 8 | 645 |
| Total | 227 | 680 |
All 8 experiments were from the same series of experiments (experimenter, lab, and condition), a series of experiments on wild-type E. coli in the presence of varying amounts of Norfloxacin. See Faith et al., .
Of these 35 experiments, 8 were from mutant E. coli strains without the ara operon. (See Faith et al., .
Figure 2Performance of different activity state inference methods on . Expression values are across 907 E. Coli experiments. (A) shows the raw expression data with an overlaid Gaussian mixture distribution from MultiMM for the araB gene. The remaining three figures (B–D) graph the posterior probability that araB is active vs. the log expression for each experiment using three different methods of generating posterior probabilities. The UniMM and MultiMM methods (C,D) yield results which more intuitively agree with the observed raw expression values than the rank-based approach (B). The MultiMM method, by leveraging information from all genes in the operon, is able to provide improved certainty over the other methods.
Figure 3Performance of different activity state inference methods on gene pair shows the raw expression data with an overlaid Gaussian mixture distribution from the MultiMM method for the araA gene (the corresponding histogram for araB is in Figure 2A). (B–D) graph the observed expression values for araA vs. araB indicating how consistent or inconsistent the calls are for genes within an operon for each of three different methods of estimating gene activity states. Blue dots represent experiments for which the approach is very sure the gene pair is inactive (a < 0.2) and red triangles represent experiments for which the approach is very sure the gene pair is active (a > 0.8). Open squares represent unsure (0.2 < a < 0.8) calls for both genes, and black “X's” represent situations where either a < 0.2 for one gene and a > 0.2 for the other, or a < 0.8 for one gene and a > 0.8 for the other. The MultiMM method, by leveraging information from all genes in the operon, is able to provide the most consistent calls of the three methods.
Figure 4Performance of different activity state inference methods on gene pair . (A,B) show raw expression data with overlaid Gaussian mixture distributions from the MultiMM method for cysM and cysP, respectively. (C–E) graph the observed expression values for these two genes, indicating how consistent the expression values are with three different methods of estimating gene activity states. Blue dots represent experiments for which the approach is very sure the gene pair is inactive (a < 0.2) and red triangles represent experiments for which the approach is very sure the gene pair is active (a > 0.8). Open squares represent unsure (0.2 < a < 0.8) calls for both genes, and black “X's” represent situations where either a < 0.2 for one gene and a > 0.2 for the other, or a < 0.8 for one gene and a > 0.8 for the other. The MultiMM method, by leveraging information from all genes in the operon, is able to provide the most consistent calls of the three methods.
Association between inferred gene activity states and experimentally measured fluxes.
| Raw expression (ϵij) | 0.223 (0.170 to 0.277) | − | − | 0.018 (−0.046 to 0.082) | 0.344 (0.280 to 0.408) |
| MT | 0.255 (0.202 to 0.308) | 0.189 (0.123 to 0.255) | 0.111 (0.045 to 0.177) | −0.032 (−0.110 to 0.047) | 0.378 (0.300 to 0.457) |
| TT | 0.311 (0.25 to 0.364) | 0.296 (0.225 to 0.367) | 0.023 (−0.045 to 0.094) | 0.085 (0.002 to 0.168) | 0.287 (0.204 to 0.370) |
| RB | 0.305 (0.253 to 0.358) | 0.417 (0.318 to 0.517) | −0.132 (−0.231 to −0.032) | 0.093 (0.017 to 0.170) | 0.285 (0.208 to 0.361) |
| UniMM | 0.351 (0.300 to 0.403) | 0.336 (0.273 to 0.399) | 0.026 (−0.037 to 0.089) | − | − |
| MultiMM | 0.354 (0.303 to 0.406) | 0.344 (0.280 to 0.408) | 0.018 (−0.046 to 0.082) | − | − |
p < 0.001;
p < 0.01;
p < 0.05.
These standardized beta coefficients (i.e., correlations) result from predicting flux values by either the raw expression data or gene activity state estimates.
These standardized beta coefficients (i.e., partial correlations) result from predicting flux values by one of the gene activity estimates and the raw expression data. When the partial correlation for expression data is significant it suggests that the corresponding gene activity estimating method is not sufficiently capturing the variation in expression data that explains changes in flux.
These standardized beta coefficients (i.e., partial correlations) result from predicting flux values by one of the gene activity estimates and the MultiMM approach. The partial correlations for the MultiMM method are always much larger and more significant compared to other gene activity approaches, suggesting that the MultiMM method is explaining significantly more variation in flux values than other approaches.
The correlation between UniMM and MultiMM activity estimates on this dataset is 0.998 (essentially equivalent) making linear models containing both UniMM and MultiMM activity estimates lack robustness.