| Literature DB >> 19290050 |
Constantin F Aliferis1, Alexander Statnikov, Ioannis Tsamardinos, Jonathan S Schildcrout, Bryan E Shepherd, Frank E Harrell.
Abstract
BACKGROUND: Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development. METHODOLOGY/PRINCIPALEntities:
Mesh:
Year: 2009 PMID: 19290050 PMCID: PMC2654113 DOI: 10.1371/journal.pone.0004922
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 2Application of Protocol II to human microarray data.
Each histogram is the distribution of the repeated 10-fold cross-validation AUC estimates for each dataset under the null hypothesis “there is no signal present in the data” (as computed by 400 random outcome value permutations). The red line in each graph is the observed value of AUC estimated by the repeated 10-fold cross-validation on the original data. AUC and p-values are shown for each dataset in the embedded table. Bold p-values indicate that the null hypothesis is rejected at the 0.05 level in these datasets.
Figure 1Comparison of Protocols I and II in simulated data.
Left: Example where the Protocol I [23] applied to simulated data with true moderate-strength signal fails to detect statistical significance at all training set sizes. Right: a more powerful protocol (Protocol II, based on event balanced repeated 10-fold cross-validation with SVM classifiers and AUC metric) detects statistically significant predictive signal according to an outcome-value permutation test. Specifically, the p-value of the null hypothesis of no signal is 0.0025. The blue bars depict the distribution of repeated 10-fold cross-validation AUC estimates over 400 random datasets produced via outcome value permutation. The red line depicts the value of repeated 10-fold cross-validation AUC on the original data (i.e., without perturbing the outcome values).
Characteristics of gene expression microarray datasets analyzed in this study.
| Dataset authors and reference | Sample size and number of events | Number of variables (genes) | Predicted event (outcome) |
| Beer et al |
|
| Lung adenocarcinoma survival |
| Bhattacharjee et al |
|
| Lung adenocarcinoma 4-year survival |
| Iizuka et al |
|
| Hepatocellular carcinoma 1-year recurrence-free survival |
| Pomeroy et al |
|
| Medulloblastoma survival |
| Rosenwald et al |
|
| Non-Hodgkin lymphoma survival |
| Veer et al |
|
| Breast cancer 5-year metastasis-free survival |
| Yeoh et al |
|
| Acute lymphocytic leukemia relapse-free survival |