| Literature DB >> 19214232 |
Abstract
In microarray gene expression data analysis, it is often of interest to identify genes that share similar expression profiles with a particular gene such as a key regulatory protein. Multiple studies have been conducted using various correlation measures to identify co-expressed genes. While working well for small datasets, the heterogeneity introduced from increased sample size inevitably reduces the sensitivity and specificity of these approaches. This is because most co-expression relationships do not extend to all experimental conditions. With the rapid increase in the size of microarray datasets, identifying functionally related genes from large and diverse microarray gene expression datasets is a key challenge. We develop a model-based gene expression query algorithm built under the Bayesian model selection framework. It is capable of detecting co-expression profiles under a subset of samples/experimental conditions. In addition, it allows linearly transformed expression patterns to be recognized and is robust against sporadic outliers in the data. Both features are critically important for increasing the power of identifying co-expressed genes in large scale gene expression datasets. Our simulation studies suggest that this method outperforms existing correlation coefficients or mutual information-based query tools. When we apply this new method to the Escherichia coli microarray compendium data, it identifies a majority of known regulons as well as novel potential target genes of numerous key transcription factors.Entities:
Mesh:
Year: 2009 PMID: 19214232 PMCID: PMC2637418 DOI: 10.1371/journal.pone.0004495
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Illustration of the model-based gene expression query algorithm.
Each row represents a gene, and each column represents a sample/experimental condition. The query gene is at the bottom. The Blue boxes indicate the collection of genes and experimental conditions in which co-expression with the query gene is observed.
Performance1 comparison among various methods for querying simulated microarray gene expression dataset. Best results are displayed in bold.
| Case | Sub-case | Pearson | Spearman | Kendall | QDB | Mutual | BEST A | BEST B | BEST C |
| Case 1: | I |
|
|
|
|
|
|
|
|
| 100% | II | 0.67 (0.12) | 0.68 (0.12) | 0.68 (0.12) | 0.59 (0.13) |
|
|
|
|
| foreground | III |
|
|
|
|
|
|
|
|
| IV | 0.62 (0.09) | 0.70 (0.09) | 0.70 (0.09) | 0.51 (0.11) | 0.78 (0.08) | 0.97 (0.04) |
|
| |
| Case 2: | I | 0.89 (0.10) | 0.96 (0.05) | 0.99 (0.03) |
| 0.87 (0.09) |
|
|
|
| 75% | II | 0.66 (0.12) | 0.71 (0.10) | 0.70 (0.09) | 0.70 (0.10) | 0.81 (0.09) |
|
|
|
| foreground | III | 0.91 (0.09) | 0.97 (0.04) | 0.99 (0.03) |
| 0.87 (0.09) |
|
|
|
| IV | 0.61 (0.11) | 0.68 (0.11) | 0.70 (0.11) | 0.53 (0.12) | 0.70 (0.11) |
|
|
| |
| Case 3: | I | 0.66 (0.17) | 0.73 (0.14) | 0.80 (0.13) | 0.97 (0.16) | 0.61 (0.14) |
|
|
|
| 50% | II | 0.51 (0.11) | 0.59 (0.11) | 0.62 (0.12) | 0.71 (0.13) | 0.52 (0.13) |
|
|
|
| foreground | III | 0.63 (0.14) | 0.70 (0.13) | 0.77 (0.12) | 0.91 (0.25) | 0.59 (0.15) |
|
|
|
| IV | 0.42 (0.12) | 0.49 (0.12) | 0.53 (0.11) | 0.53 (0.17) | 0.43 (0.16) | 0.92 (0.06) | 0.92 (0.06) |
| |
| Case 4: | I | 0.36 (0.13) | 0.38 (0.12) | 0.40 (0.12) | 0.29 (0.29) | 0.29 (0.13) | 0.79 (0.34) | 0.95 (0.15) |
|
| 25% | II | 0.25 (0.10) | 0.26 (0.09) | 0.28 (0.09) | 0.19 (0.08) | 0.27 (0.09) | 0.73 (0.36) | 0.86 (0.28) |
|
| foreground | III | 0.34 (0.09) | 0.36 (0.09) | 0.38 (0.09) | 0.21 (0.14) | 0.29 (0.10) | 0.85 (0.29) | 0.95 (0.17) |
|
| IV | 0.25 (0.08) | 0.26 (0.07) | 0.26 (0.07) | 0.22 (0.13) | 0.22 (0.11) | 0.57 (0.28) | 0.66 (0.25) |
|
Performance was measured by the proportions of true positives among the top T genes. T is the number of true positives in each simulated dataset. The mean and standard deviation of these proportions in the 50 simulated datasets were reported.
There are four sub-cases in each of the simulated cases with the same amount of foreground columns.
Sub case I: no linear transformation, no cell-level noise;
Sub case II: only add linear transformation;
Sub case III: only add cell-level noise;
Sub case IV: add both linear transformation and cell-level noise.
Query method using Pearson correlation coefficient.
Query method using Spearman correlation coefficient.
Query method using Kendall’s τ.
Query method using QDB.
Query method using mutual information.
Query method using BEST.
Query method using BEST allowing exclusion of individual cells from the foreground.
Query method using BEST when fixing the indicator variables of five true target genes and five true experimental conditions as 1.
Figure 2ROC curves for various query methods when applying to synthetic datasets simulated under different settings and when there are 25% foreground columns.
BEST A default setting; BEST B allowing exclusion of individual cells from the foreground; BEST C fixing the indicator variables of five true target genes and five true experimental conditions as 1. A. No linear transformation nor cell-level noise. B. With linear transformation only. C. With cell-level noise only. D. With both linear transformation and cell-level noise.
Figure 3ROC curves for various query methods applying to the 100-gene test set selected from the E. coli microarray compendium.
The area under the curves (AUC) are: Pearson correlation: 0.69; Spearman correlation: 0.69; Kendall's τ: 0.66; QDB: 0.70; Mutual information: 0.56; BEST: 0.87; Random control: 0.52.
Figure 4The original (blue line) and inverted (red line) expression profiles of gcvB, lysU, kbl and tdh compared to query gene Lrp.
Black lines indicate the query gene—Lrp. Only the 143 foreground experimental conditions identified by BEST were shown in these plots. Results are from the 100-gene test set selected from the E. coli microarray compendium.
Information of the four genes showing inverse correlation patterns with Lrp identified by BEST when applied to the 100-gene test set selected from the E. coli microarray compendium.
| Rank | Gene Name | Log Bayes Ratio | positive/negative | RegulonDB | CLR | motif distance | Empirical p-value |
| 16 | gcvB | 107.8 | negative | 414 | 0.0047 | ||
| 23 | lysU | 84.52 | negative | X | 138 | 0.0044 | |
| 24 | kbl | 81.47 | negative | X | 33 | 0.0019 | |
| 25 | tdh | 80.09 | negative | X |
All but the first one, gcvB, are in the RegulonDB target set.
Genes displayed here are sorted by the Log Bayes ratio (target gene versus non-target gene).
Blank indicates that the target gene shows the same pattern as the query gene. Negative indicates that the target gene shows the inversed pattern as the query gene.
“X” indicates that the predicted gene is in the RegulonDB target set.
“X” indicates that the gene is predicted by CLR as a target gene.
Motif distance is defined as the distance between the start position of the gene and the closest motif in the intergenic region upstream.
Empirical p-value indicates the significance of conservation in the current motif, which is calculated as proportion of all possible motif locations in the complete E. coli genome that have likelihood ratios comparing between Lrp motif and background higher than that of the current motif.