| Literature DB >> 26167162 |
Scott Powers1, Matt DeJongh2, Aaron A Best3, Nathan L Tintle4.
Abstract
BACKGROUND: Rapid growth in the availability of genome-wide transcript abundance levels through gene expression microarrays and RNAseq promises to provide deep biological insights into the complex, genome-wide transcriptional behavior of single-celled organisms. However, this promise has not yet been fully realized.Entities:
Keywords: Pearson correlation; co-regulation; mutual information; operon prediction; regulatory network inference
Year: 2015 PMID: 26167162 PMCID: PMC4481165 DOI: 10.3389/fmicb.2015.00650
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Full compendia size and number of partial compendia for each bacteria genome in the analysis.
| 195 | 3 (A = 83, B = 60, C = 52) | |
| 907 | 5 (A = 331, B = 212, C = 137, D = 133, E = 94) | |
| 176 | 3 (A = 70, B = 54, C = 52) | |
| 245 | 4 (A = 72, B = 61, C = 59, D = 53) | |
| 852 | 5 (A = 263, B = 228, C = 193, D = 90, E = 78) | |
| 407 | 3 (A = 222, B = 99, C = 86) | |
| Total | 2782 samples | 23 partial compendia |
Figure 1Expression values for and across 907 microarray samples in a publicly available repository. The genes lexA and dinF show a strong pattern of association in RMA-normalized gene expression across the 907 samples available in the M3D repository (full compendium). In particular, as expression levels of lexA increase, expression levels of dinF also increase. This strong association is captured by the high values of common statistical measures of association (Pearson correlation = 0.86, Spearman correlation = 0.79, Mutual Information = 0.70).
Figure 2Expression values for and across 331 microarray samples in a publicly available repository. The genes lexA and dinF show a much weaker pattern of association in RMA-normalized gene expression across the 331 samples available in a subset of the M3D repository consisting of gene expression data collected from GEO (28) (partial compendium A). In particular, as expression levels of lexA increase, expression levels of dinF only modestly increase. This modest association is captured by the modest values of common statistical measures of association (Pearson correlation = 0.56, Spearman correlation = 0.60, Mutual Information = 0.37).
Figure 3Expression values for lexA and dinF across 212 microarray samples in a publicly available repository. The genes lexA and dinF show a strong pattern of association in RMA-normalized gene expression across these 212 microarray samples (partial compendium B), similar to the pattern seen in the larger set of all 907 samples. In particular, as expression levels of lexA increase, expression levels of dinF increase. This association is captured by the modest values of common statistical measures of association (Pearson correlation = 0.87, Spearman correlation = 0.81, Mutual Information = 0.71).
Proportion of 95% bootstrap confidence intervals that do not include zero, comparing partial compendia.
| AB (46%, 35%, 32%) | AB (67%, 69%, 55%) | AB (25%, 22%, 5%) | AB (38%, 37%, 8%) | AB (40%, 44%, 13%) | AB (40%, 39%, 67%) |
For each organism, each pair of partial compendia are compared and the percent of Pearson correlation, Spearman correlation and mutual information confidence intervals that don't include zero are provided. If the partial compendia are estimating the same underlying parameter, as we would expect and desire, on average, only 5% of the confidence intervals will not include 0. However, in almost all cases substantially more than 5% of the confidence intervals do not include zero suggesting that for many genes the different partial compendia are estimating different parameters, regardless of the gene association metric being used.
Figure 4Pearson correlations according to partial compendia A and B in 1000 random gene pairs. For large compendia (as we use here: 331 samples and 212 samples, respectively) we expect limited variability in gene expression correlation measures between the compendia. In particular, we expect most correlations to be close to the line y = x, if the same underlying parameter is being estimated. We computed 95% confidence intervals on the difference in correlation estimates for 1000 random E. coli gene pairs. Black circles indicate pairs of genes for which the correlation estimates are similar (95% confidence interval includes zero), while red x's indicate gene pairs for which the correlation estimates are not similar (95% confidence interval does not include zero). As shown in the figure by the preponderance of red x's, the majority of pairwise correlation values are well outside the expected range, representing substantially more variability than is expected due to chance alone. The Pearson correlation computed on the scatterplot shown in this figure, is only 0.33, representing very weak association between the Pearson correlation of pairs of genes in partial compendia A and B for E. coli.
Predicting pairwise gene correlations using a plausibly informative vs. uninformative approach.
| 37.4% | 37.1% | 49.9% | |
| 40.7% | 40.6% | 60.7% | |
| 32.9% | 34.3% | 45.8% | |
| 40.6% | 42.5% | 50.4% | |
| 34.2% | 36.2% | 65.2% | |
| 27.8% | 28.3% | 67.5% | |
The naïve approach uses the pairwise gene correlation value in partial compendia l, to predict the value for the same pair of genes in partial compendia k, the uninformative approach uses a random value to make the prediction. If the naïve approach were no better than the uninformative approach, we would expect the values in the table to be ~50%. Since the values in the table are quite close to 50%, there is a high degree of noise in the correlation estimates and limited signal. If the values were close to 0, the naïve approach would, typically, be outperforming the uninformative approach.
Values of the association metrics for the .
| Pearson | 0.86 | 0.56 | 0.87 | 0.40 | 0.92 | 0.34 |
| Spearman | 0.79 | 0.60 | 0.81 | 0.37 | 0.92 | 0.23 |
| Mutual Information | 0.70 | 0.37 | 0.71 | 0.43 | 0.99 | 0.35 |
| Percent of samples in which the expression level of lexA is “on” (above 11) | 24% (216/907) | 1% (2/331) | 88% (186/212) | 8% (11/137) | 13% (17/133) | 0% (0/94) |
| Percent of samples in which the expression level of lexA is “on” (above 11) and the expression level of dinF is “on” (above 9.5) | 20% (185/907) | 0% (1/331) | 80% (169/212) | 2% (3/137) | 9% (12/133) | 0% (0/94) |
| Standard deviation of absolute expression for lexA | 1.13 | 0.62 | 0.64 | 0.99 | 0.92 | 0.55 |
| Standard deviation of absolute expression for dinF | 0.71 | 0.36 | 0.53 | 0.47 | 0.55 | 0.26 |
Average Pearson correlation across all pairs of genes, cross-classified by biological grouping and percent on/off.
| Likely operons | 0.72 (294) | 0.75 (320) | 0.81 (610) | 0.84 (493) | 0.87 (310) | 0.88 (123) |
| Likely non-operons | 0.20 (254) | 0.17 (238) | 0.16 (186) | 0.21 (186) | 0.19 (86) | 0.06 (28) |
| Gene pairs in same pathway | 0.23 (1128) | 0.20 (1280) | 0.28 (1994) | 0.38 (1779) | 0.35 (711) | 0.47 (266) |
In parentheses is the number of gene pairs satisfying the conditions to be included in each cell. The full compendium of expression data is considered.
The minimum of the percent of samples arrays for which either gene is on or off in the full compendium of samples.
Posterior probability of being in an operon based on genomic evidence is more than 99% according to MicrobesOnline (Bockhorst et al., .
Posterior probability of being in an operon based on genomic evidence is less than 1% according to MicrobesOnline (Bockhorst et al., .
Pathway definitions provided by the SEED (Tintle et al., .
Figure 5Pairwise Pearson correlation versus minimum standard deviation of gene expression value: operons. As the minimum standard deviation of the gene pair increases, the correlation between genes likely to be in the same operon also increases. The Pearson correlation for the scatterplot is 0.58.
Figure 6Pairwise Pearson correlation versus minimum standard deviation of gene expression value: non-operons. As the minimum standard deviation of the gene pair increases, the correlation between genes likely to not be operons shows little discernible pattern also increases. The Pearson correlation for the scatterplot is 0.10.
Figure 7Pairwise Pearson correlation versus minimum standard deviation of gene expression value: pathways. As the minimum standard deviation of the gene pair increases, the correlation between genes in the same pathway shows a generally increasing pattern. The Pearson correlation for the scatterplot is 0.38.
Sensitivity and specificity of different standard deviation cutoffs.
| Standard deviation | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
| Sensitivity | 0 (0/287) | 0.36 (103/287) | 0.77 (221/287) | 0.95 (274/287) | 0.99 (283/287) |
| Specificity | 1 (2448/2448) | 0.93 (2272/2448) | 0.72 (1772/2448) | 0.49 (1193/2448) | 0.31 (753/2448) |
| Sensitivity+ Specificity | 1 | 1.29 | 1.49 | 1.44 | 1.3 |
For 2735 pairs of genes likely to be in the same operon, 287 have correlations below 0.6 suggesting that these correlation estimates are substantially biased, since the true correlation between two genes in the same operon should be close to 1. Sensitivity is computed by calculating the number of these pairs of genes with standard deviations below a certain threshold. 77% (221/287) of the 287 pairs of operon pairs with low correlation also have low (<0.5) standard deviation. Of the operon pairs with larger correlations (0.6), nearly three-fourths (72%) have standard deviations above 0.5.
Sensitivity and specificity of different state cutoff rules.
| Minimum represented cutoff | 0.025 | 0.05 | 0.1 | 0.2 |
| Sensitivity | 0.11 (26/244) | 0.25 (61/244) | 0.51 (125/244) | 0.77 (188/244) |
| Specificity | 0.95 (1938/2042) | 0.87 (1782/2042) | 0.74 (1511/2042) | 0.45 (915/2034) |
| Sensitivity + specificity | 1.06 | 1.12 | 1.25 | 1.22 |
For 2286 pairs of genes likely to be in the same operon and for whom the two state model is a good fit, 244 have correlations below 0.6 suggesting that these correlation estimates are substantially biased, since the true correlation between two genes in the same operon should be close to 1. Sensitivity is computed by calculating the number of these 244 low correlation pairs with the percent of samples in a state below a certain threshold. 51% (125/244) of the 244 low correlation pairs also have less than 10% of the experiments classified to one of two states (on-off). Of the operon pairs with larger correlations (0.6), nearly three-fourths (74%) have at least 10% of the experiments in each of the two states.