| Literature DB >> 20064243 |
Esteban Czwan1, Benedikt Brors, David Kipling.
Abstract
BACKGROUND: Theme-driven cancer survival studies address whether the expression signature of genes related to a biological process can predict patient survival time. Although this should ideally be achieved by testing two separate null hypotheses, current methods treat both hypotheses as one. The first test should assess whether a geneset, independent of its composition, is associated with prognosis (frequently done with a survival test). The second test then verifies whether the theme of the geneset is relevant (usually done with an empirical test that compares the geneset of interest with random genesets). Current methods do not test this second null hypothesis because it has been assumed that the distribution of p-values for random genesets (when tested against the first null hypothesis) is uniform. Here we demonstrate that such an assumption is generally incorrect and consequently, such methods may erroneously associate the biology of a particular geneset with cancer prognosis.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20064243 PMCID: PMC2824674 DOI: 10.1186/1471-2105-11-19
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Association between geneset size and significance for breast cancer OS. For each type of geneset (i.e. Biocarta, GO, and KEGG), sets of random genes of sizes 20, 75, 150, 300, and 500 (the latter omitted for Biocarta because the pool of genes was too small) consisting of 10,000 genesets for each size were generated. For each geneset, hierarchical clustering was performed to segregate samples into two groups, a subsequent log-rank test was performed to assess a difference in prognosis between both groups, and the p-value1 was recorded. The negative base 10 logarithms of the p-values1 are plotted against geneset size. Biocarta-like genesets appear to be more significant around length = 150 (A); GO-like genesets do not show a clear correlation of significance to geneset size (B); KEGG-like genesets seem to be more significant as size becomes smaller (C).
breast OS and lung OS
, respectively (Additional File 3B shows these results for breast RFS). In these plots an x = y line would be expected if the empirical distributions were uniformly distributed, which is clearly not seen.
Figure 2Empirical p-value1distributions of random genesets for breast cancer OS and lung cancer OS. Relative frequency density distributions are shown for breast cancer OS (A) and for lung OS (B). For each survival estimate (i.e. breast cancer OS, and lung cancer OS), the relative frequency density estimate with a bandwidth equal to 0.01 is plotted for each empirical distribution (i.e. Biocarta-like, GO-like, KEGG-like, and CSR-like). A uniformly distributed empirical distribution would result in p-values1 at the same frequency across the entire range 0 < p < 1 (dashed line in (A) and (B)). Ordered plots of empirical p-value1 distributions versus the uniform distribution are shown for breast cancer OS (C) and for lung cancer OS (D). The permuted p-values used to model each distribution (i.e. Biocarta-like, GO-like, KEGG-like, and CSR-like) are plotted against random p-values1 from a uniform distribution. An x = y line would be expected if the empirical distributions were uniformly distributed.
Summary of results.
| Survival | Number of genesets ( | Number of genesets ( | Number of genesets ( | Number of genesets ( |
|---|---|---|---|---|
| Breast OS | 66 | 66 | 7 | 1 |
| Breast RFS | 58 | 58 | 13 | 0 |
| Lung OS | 32 | 20 | 18 | 4 |
For each survival estimate (i.e. breast OS, breast RFS, and lung OS), the numbers of significant genesets against the first null hypothesis (p1 < 0.01) and against the second null hypothesis (p2 < 0.01), as well as the numbers of genesets with FDR values below the significant threshold (FDR < 0.30) for those significant genesets are shown.
Significant genesets and CSR comparison.
| Geneset Name | Survival | Chang | ||||
|---|---|---|---|---|---|---|
| GO "fatty acid metabolism" | Breast OS | - | 0.006 | 0.30 | ||
| CSR | Breast OS | 0.0410 | 0.0321000 | 0.280 | 0.14929 | 0.99 |
| CSR | Breast RFS | 0.0130 | 0.0144000 | 0.210 | 0.09614 | 0.99 |
| GO "receptor mediated endocytosis" | Lung OS | - | 0.003 | 0.02 | ||
| GO "brain development" | Lung OS | - | 0.003 | 0.02 | ||
| GO "apical plasma membrane" | Lung OS | - | 0.040 | 0.29 | ||
| KEGG "MAPK signaling pathway" | Lung OS | - | 0.040 | 0.29 | ||
| CSR | Lung OS | 0.0014 | 0.230 | 0.53 |
For each survival estimate, a description of the significant biologically-related genesets (p1 < 0.01; FDR1 < 0.30; p2 < 0.01; FDR2 < 0.30) as well as a comparison of the p-values reported by Chang et al. and our log-rank (p1) and empirical (p2) p-values for the CSR gene expression signature are shown.