| Literature DB >> 22738145 |
Jeffrey T Leek1, Margaret A Taub, Jason L Rasgon.
Abstract
BACKGROUND: Genomic technologies are, by their very nature, designed for hypothesis generation. In some cases, the hypotheses that are generated require that genome scientists confirm findings about specific genes or proteins. But one major advantage of high-throughput technology is that global genetic, genomic, transcriptomic, and proteomic behaviors can be observed. Manual confirmation of every statistically significant genomic result is prohibitively expensive. This has led researchers in genomics to adopt the strategy of confirming only a handful of the most statistically significant results, a small subset chosen for biological interest, or a small random subset. But there is no standard approach for selecting and quantitatively evaluating validation targets.Entities:
Mesh:
Year: 2012 PMID: 22738145 PMCID: PMC3568710 DOI: 10.1186/1471-2105-13-150
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1 Validation strategy schematic. A set of RNA sequencing data is analyzed using statistical models (blue = high expression, yellow = low expression) and a list of significant genes is identified at a fixed false discovery rate (FDR). From the list of significant genes a few - usually the most statistically significant - are validated with the independent validation technology quantitative PCR (qPCR). Ideally the confirmation with inde-pendent technology can be used to validate the entire list of significant genes.
independent technology. The minimum validation sample size for FDR cutoff q can be found by solving the following optimization problem:
Figure 2 Minimum Validation Sample Size Versus FDR Cutoff. A plot of the minimum validation sample size required using sampling to achieve a target validation probability of 0.5, assuming that the experimental technology, statistical method, and validation technology are accurate. This plot is based on the results of a specific study and can be used to plan validation experiments. A ∙ indicates the minimum sample size for a fixed FDR cutoff and a × indicates that for that FDR threshold the target validation probability can not be achieved.
Figure 3 Validation Probability by Sample Size. A plot of the validation probability versus the sample size, for various FDR cutoffs assuming that (0.7 × FDR level × Validation sample size) false positives are observed in the validation set. For any sample size, the validation probability is higher when the FDR cutoff is larger.
Simulation study to assess the properties of the validation probability
| Errorless validation | Median Validation Probability | 0.72 | 0.91 | 1.00 |
| Prior = Uniform | Validation Probability IQR | (0.58, 0.87) | (0.72, 0.98) | (1.00,1.00) |
| | FDR 95% Credible Interval Coverage | 0.98 | 0.68 | 0.00 |
| | Median Posterior Expectation of FDR | 0.04 | 0.07 | 0.35 |
| Validation subject to error | Median Validation Probability | 0.64 | 0.87 | 1.00 |
| Prior = Uniform | Validation Probability IQR | (0.37, 0.83) | (0.71, 0.95) | (1.00,1.00) |
| | FDR 95% Credible Interval Coverage | 0.98 | 0.68 | 0.00 |
| | Median Posterior Expectation of FDR | 0.05 | 0.08 | 0.24 |
| Results should not validate | Median Validation Probability | 0.00 | 0.00 | 0.25 |
| Prior = Uniform | Validation Probability IQR | (0.00,0.00) | (0.00,0.00) | (0.15,0.51) |
| | FDR 95% Credible Interval Coverage | 0.00 | 0.00 | 0.96 |
| | Median Posterior Expectation of FDR | 0.35 | 0.37 | 0.52 |
| Errorless validation | Median Validation Probability | 0.83 | 0.94 | 1.00 |
| Prior = Adaptive | Validation Probability IQR | (0.70,0.94) | (0.79,0.99) | (1.00,1.00) |
| | FDR 95% Credible Interval Coverage | 0.88 | 0.63 | 0.00 |
| | Median Posterior Expectation of FDR | 0.04 | 0.07 | 0.35 |
| Validation subject to error | Median Validation Probability | 0.76 | 0.91 | 1.00 |
| Prior = Adaptive | Validation Probability IQR | (0.49, 0.91) | (0.78, 0.97) | (1.00,1.00) |
| | FDR 95% Credible Interval Coverage | 0.91 | 0.77 | 0.00 |
| | Median Posterior Expectation of FDR | 0.04 | 0.07 | 0.23 |
| Results should not validate | Median Validation Probability | 0.00 | 0.00 | 0.25 |
| Prior = Adaptive | Validation Probability IQR | (0.00,0.00) | (0.00,0.00) | (0.15,0.51) |
| | FDR 95% Credible Interval Coverage | 0.00 | 0.00 | 0.95 |
| Median Posterior Expectation of FDR | 0.35 | 0.37 | 0.52 |
For each of three scenarios and two choices for the prior distribution, 100 simulated gene expression studies were generated with 1,000 genes each. This table reports the median (25th percentile, 75th percentile) of the validation probability across the 100 studies, the coverage proportion of the 95% posterior credible interval for the estimated FDR in each scenario, and the median posterior expectation of the FDR.
Statistical validation analysis of data from six microarray experiments obtained from GEO
| GSE10245 | 6,742 | 58 | 3.57% | 3.85 years | 0.14 years |
| | | | | $2.5e6 | $8.8e4 |
| GSE11492 | 333 | 8 | 72.37% | 0.03 years | 0.02 years |
| | | | | $8.9e4 | $6.4e4 |
| GSE17913 | 739 | 79 | 32.61% | 0.58 years | 0.18 years |
| | | | | $3.0e5 | $9.8e4 |
| GSE16032 | 343 | 10 | 70.26% | 0.03 years | 0.02 years |
| | | | | $9.3e4 | $6.5e4 |
| GSE16538 | 1,624 | 12 | 14.83% | 0.19 years | 0.03 years |
| | | | | $9.3e4 | $6.6e4 |
| GSE11524 | 2,295 | 30 | 10.50% | 0.68 years | 0.07 years |
| $7.1e5 | $7.5e4 |
For each data set, differential expression was calculated with respect to the primary biological variable. Fomnr each experiment, the number of genes differentially expressed at 5% is reported. In each case, 241 genes are required for statistical validation, for each study we present the fraction of the DE genes required for statistical validation. Tthe cost in dollars and graduate student years of manually confirming the whole list of DE genes or only the DE genes needed for statistical validation is also reported.