| Literature DB >> 19178740 |
Roland Nilsson1, Johan Björkegren, Jesper Tegnér.
Abstract
BACKGROUND: Molecular signatures are sets of genes, proteins, genetic variants or other variables that can be used as markers for a particular phenotype. Reliable signature discovery methods could yield valuable insight into cell biology and mechanisms of human disease. However, it is currently not clear how to control error rates such as the false discovery rate (FDR) in signature discovery. Moreover, signatures for cancer gene expression have been shown to be unstable, that is, difficult to replicate in independent studies, casting doubts on their reliability.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19178740 PMCID: PMC2646701 DOI: 10.1186/1471-2105-10-38
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Signature discovery. Molecular signatures (1) are markers for a particular cell or tissue phenotype. Signatures are discovered from a given set of molecular profiles (e.g., gene expression profiles) together with phenotype labels (2). Signatures have dual uses, both as predictive models (3) and for discovery of molecular mechanisms (4). While it is well-known how to assess predictive accuracy (5), the method proposed herein is the first to control signature FDR (6), enabling reliably discovery.
Figure 2Good predictive accuracy despite high FDR. Probability of prediction error for the Support Vector Machine (gray level) as a function of signature false discovery rate (FDR) and statistical power (fraction of true positives). Nearly horizontal level curves indicate weak dependence on FDR.
Figure 3Signatures with low FDR may be unstable. Left, statistical power vs. effect size (arbitrary units) for varying FDR. Middle, stability, defined as the average normalized overlap between two signatures vs. effect size and FDR. Right, illustration of how power affects stability.
Figure 4Controlling error rates for gene signatures. A: Realized level and power for the bootstrap test at 5% nominal level. B: Realized FDR, power and stability for signatures selected by the bootstrap test after Benjamini-Hochberg (BH) correction. Here the nominal FDR was set at 5%. C: Same as (B) for signatures selected by recursive feature elimination (RFE). D: Same as (B) for signatures selected as the top 200 genes. Acc, classifier accuracy.
Results on cancer gene expression data
| Data set (ref.) | MCF,% | CV,% | TA,%(ref.) | BS | BS0 | RFE | RFE0 | DE | |
| Golub (2) | 72 | 32 | 97.0 ± 4.2 | 99.3 (28) | 537 | 0 | 35 | 154 | 1007 |
| Singh (4) | 136 | 43 | 92.6 ± 3.0 | 81.1 (27) | 99 | 0 | 48 | 312 | 3807 |
| Alon (1) | 62 | 35 | 81 ± 7.2 | 97.9 (29) | 19 | 0 | 55 | 94 | 303 |
| Wang (6) | 286 | 37 | 65 ± 4.3 | N/A | 0 | 0 | 261 | 1250 | 106 |
| van't Veer (5) | 97 | 47 | 62 ± 8.4 | N/A | 0 | 0 | 42 | 153 | 1 |
Results are ordered by prediction accuracy. n, number of samples; MCF, minority class frequency; CVA, balanced cross-validated prediction accuracy, mean ± std.dev.; TA, balanced prediction accuracy of bootstrap signature on an independent test set (reference given in parentheses); BS, significant genes using the bootstrap with SVM at 5% FDR; RFE, genes chosen by recursive feature elimination; BS0 and RFE0, gene chosen by the bootstrap and RFE methods respectively on randomized data. DE, differentially expressed genes using the t-test at 5% FDR.