| Literature DB >> 21649912 |
Tracy L Bergemann1, Jason Wilson.
Abstract
BACKGROUND: In genetic transcription research, gene expression is typically reported in a test sample relative to a reference sample. Laboratory assays that measure gene expression levels, from Q-RT-PCR to microarrays to RNA-Seq experiments, will compare two samples to the same genetic sequence of interest. Standard practice is to use the log(2)-ratio as the measure of relative expression. There are drawbacks to using this measurement, including unstable ratios when the denominator is small. This paper suggests an alternative estimate based on a proportion that is just as simple to calculate, just as intuitive, with the added benefit of greater numerical stability.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21649912 PMCID: PMC3224106 DOI: 10.1186/1471-2105-12-228
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The competitive hybridization process for a two-color system: The number of PCR products equals the number of possible hybridizations. A proportion of the sequences will bind with matching red labeled strands and the remainder bind with the matching green labeled strands. Some sequences will not match (marked with X's) and should not hybridize.
Figure 2The relationship between the log2-ratio .
Simulation comparing test statistics for , , , limma/EBA, edgeR, and DESeq with a sample size of n = 20 under four distributional assumptions.
| Exponential | Poisson | Binomial | Normal | |||||
|---|---|---|---|---|---|---|---|---|
| 0.051 | 0.742 | 0.004 | 0.116 | 0.047 | 1.000 | 0.050 | 1.000 | |
| 0.051 | 0.742 | 0.038 | 0.757 | 0.047 | 1.000 | 0.050 | 1.000 | |
| 0.051 | 0.742 | 0.044 | 0.943 | 0.047 | 1.000 | 0.050 | 1.000 | |
| 0.975 | 1.000 | 0.045 | 1.000 | 0.048 | 1.000 | 0.003 | 1.000 | |
| 0.055 | 0.773 | 0.047 | 0.881 | 0.047 | 1.000 | 0.050 | 1.000 | |
| 0.975 | 1.000 | 0.045 | 1.000 | 0.048 | 1.000 | 0.003 | 1.000 | |
| EBA | 0.051 | 0.781 | 0.048 | 1.000 | 0.047 | 1.000 | 0.052 | 1.000 |
| edger | NA | NA | 0.033 | 1.000 | 0.014 | 1.000 | NA | NA |
| DESeq | NA | NA | 0.042 | 1.000 | 0.047 | 1.000 | NA | NA |
The exponential distribution has rate parameter 1/4000, the Poisson has rate parameter 3, the binomial has size 10000; and the normal has mean 10 and standard deviation 2. Each entry is proportion of times the null hypothesis was rejected at α = 0.05, out of 1000 simulations. The null hypothesis of no differential expression is equivalent to a fold change of one (fc = 1). When the fold change is three, we are calculating the power to detect differential expression (fc = 3). Tables for other distributional parameters may be found in Additional file 1. These tables also include a greater range of sample sizes and fold changes.
Comparison of estimators from the simulations.
| Good | Acceptable | Unacceptable | |
|---|---|---|---|
| Exponential | |||
| Poisson | |||
| Binomial | edgeR | DESeq | |
| Normal |
Four estimators (, and ) and three methods (EBA, edgeR, and DESeq) were used under four distributional assumptions (Exponential, Poisson, Binomial, and Normal). The performance rating (Good, Acceptable, Unacceptable) was judged on the basis of Type I error and power. See Table 1 for an example of the estimators and why the ratings were judged as shown. Additional data for judging the ratings is given in Additional file 1.
Figure 3Scatterplot of p-values for log. The general ordering of the genes is similar, although not identical, using the two methods.
Table of raw p-values for Welch t-statistics and limma methods using and from the apoAI control treatment expression data.
| rank | p-value ( | rank | p-value ( | rank | p-value(limma ( | rank | p-value(limma ( |
|---|---|---|---|---|---|---|---|
| 1 | 7.3 × 10-7 | 1 | 4.2 × 10-6 | 1 | 3.8 × 10-12 | 1 | 1.5 × 10-9 |
| 2 | 2.4 × 10-5 | 2 | 2.4 × 10-5 | 4 | 5.2 × 10-7 | 4 | 1.3 × 10-6 |
| 3 | 3.4 × 10-5 | 4 | 4.0 × 10-5 | 2 | 2.6 × 10-8 | 3 | 1.2 × 10-7 |
| 4 | 5.0 × 10-5 | 3 | 2.8 × 10-5 | 3 | 5.1 × 10-8 | 2 | 7.6 × 10-8 |
| 5 | 1.0 × 10-4 | 6 | 1.2 × 10-4 | 5 | 1.4 × 10-6 | 6 | 7.5 × 10 -6 |
| 6 | 1.0 × 10-4 | 5 | 5.8 × 10-5 | 7 | 9.6 × 10-6 | 7 | 1.2 × 10-5 |
| 7 | 2.9 × 10-4 | 7 | 2.7 × 10-4 | 12 | 11 | ||
| 8 | 5.9 × 10-4 | 9 | 8 | 16 | |||
| 9 | 7.4 × 10-4 | 8 | 10 | 61 | |||
| 10 | 1.3 × 10-3 | 10 | 1.1 × 10-3 | 6 | 5 | ||
| 1 | -16.5 | 1 | -12.8 | 1 | -23.1 | 1 | -14.4 |
| 2 | -9.8 | 2 | -9.8 | 4 | -9.0 | 4 | -8.3 |
| 3 | -9.3 | 4 | -9.1 | 2 | -11.5 | 3 | -10.1 |
| 4 | -8.8 | 3 | -9.6 | 3 | -10.9 | 2 | -10.5 |
| 5 | -7.9 | 6 | -7.7 | 5 | -8.2 | 6 | -7.0 |
| 6 | -7.8 | 5 | -8.5 | 7 | -6.7 | 7 | -6.7 |
| 7 | 6.6 | 7 | 6.7 | 12 | 5.9 | 11 | 6.0 |
| 8 | -5.9 | 9 | -5.8 | 8 | -6.7 | 16 | -5.9 |
| 9 | -5.7 | 8 | -5.9 | 10 | -5.2 | 61 | -4.8 |
| 10 | 5.2 | 10 | 5.3 | 6 | 8.0 | 5 | 8.2 |
The limma method was developed for log-ratios, not proportions, but we show the results using proportions for comparison. The first group of ten rows are the p-values and the second group is the t-statistics, for reference. In the original paper, the top 8 probes were selected using the maxT multiple testing procedure with using Welch t-statistics on in [12]. This selection is the first column, ranked 1 through 10. The p-values/t-statistics for the probes using and limma correspond to the first column, with their ranks shown. Using t-statistics with , the selection is similar to , but not identical, since probes ranked 8th and 9th switch places. Using limma with and , the selection begins to vary widely at the 7th probe.
Significant genes detected from the dataset in Marioni et al (2008).
| Significant Genes in Intersecting Sets | ||||||||
|---|---|---|---|---|---|---|---|---|
| EBA (Affymetrix) | 3641 | 3096 | 672 | 3127 | 3037 | 251 | 3183 | 960 |
| 8641 | 790 | 8461 | 5400 | 308 | 8641 | 1116 | ||
| EBA (RNA-Seq) | 790 | 790 | 789 | 275 | 790 | 700 | ||
| edgeR | 8697 | 5516 | 308 | 8697 | 1351 | |||
| DESeq | 7083 | 301 | 5644 | 1305 | ||||
| 315 | 308 | 315 | ||||||
| 11915 | 3324 | |||||||
| 3331 | ||||||||
For each row and column, the number gives the significant gene calls by both methods. The cut-off to determine significance was set at α = 0.05/32000. The acronym LRT denotes the likelihood ratio test based on the Poisson distribution, as described in the Methods. The acronym EBA denotes the empirical Bayes analysis performed on both the Affymetrix and RNA-Seq data.
A summary of the missing values for each of the tests and the number of significant genes detected by other methods within those missing values.
| Significant Genes from Sets of Missing Genes | ||||||||
|---|---|---|---|---|---|---|---|---|
| EBA (Affymetrix) | 14292 | 561 | 17 | 589 | 450 | 16 | 2550 | 1316 |
| 75 | 3979 | 0 | 235 | 197 | 0 | 3064 | 2208 | |
| EBA (RNA-Seq) | 81 | 0 | 4726 | 235 | 197 | 0 | 3064 | 2208 |
| edgeR | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| DESeq | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 422 | 973 | 5 | 1166 | 997 | 8947 | 4171 | 2406 | |
| 2 | 0 | 0 | 0 | 0 | 0 | 915 | 0 | |
| 3 | 0 | 0 | 0 | 0 | 0 | 856 | 1832 | |
The diagonal gives the number of genes with missing tests. The off-diagonals indicate those genes that are significant for one method amongst the missing calls for another method. The acronym LRT denotes the likelihood ratio test based on the Poisson distribution, as described in the Methods. The acronym EBA denotes the empirical Bayes analysis performed on both the Affymetrix and RNA-Seq data.