| Literature DB >> 19458761 |
Jing Wang1, Kim Anh Do, Sijin Wen, Spyros Tsavachidis, Timothy J McDonnell, Christopher J Logothetis, Kevin R Coombes.
Abstract
MOTIVATION: Individual microarray studies searching for prognostic biomarkers often have few samples and low statistical power; however, publicly accessible data sets make it possible to combine data across studies.Entities:
Keywords: combining data; cross-validation; feature selection; microarray expression profiling; predictive model; prostrate cancer
Year: 2007 PMID: 19458761 PMCID: PMC2675498
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Sources of Microarray Data Sets for Meta-Analysis.
| Institution | Results Published | Author | Array Type | Number of Probe Sets/Clones | Healthy Samples | Tumor Samples |
|---|---|---|---|---|---|---|
| Harvard | Cancer Cell (2002) | Singh et al. | Affymetrix U95-Av2 | 12625 | 50 | 52 |
| Stanford | PNAS (2004) | Lapointe et al. | Two-color glass cDNA | 42129 | 41 | 71 |
Stanford tumor samples include 9 lymph node metastases.
Summary of the Gleason Grades of Both Microarray Data Sets.
| Gleason Grade
| |||||
|---|---|---|---|---|---|
| Institution | Low (L)
| Medium (M)
| High (H)
| NA | Total |
| Harvard
| 19 | 29 | 4 | 0 | 52 |
| Stanford
| 24 | 22 | 15 | 1 | 62 |
| Total Patient Population | 43 | 51 | 19 | 1 | 114 |
Figure 1.Distribution of p-value computed from Komogorov-Smirnov (KS) goodness-of-fit test to the expression profiles from healthy prostate samples re-standardized on a gene-by-gene basis. The plot illustrates the p-values distributed near uniformly. The superimposed curves represent the division into uniform and β contributions.
Figure 2.Analysis of p-values as a beta-uniform mixture. Left: Histogram of the p-values from a chi-squared test of the quality of the fit when adding a single gene to the logistic model. Right: The plot describing the relationship between FDR and single-test p-values.
Figure 3.Predicting Gleason Score 7 in an independent test set using the 7-gene model. The data set consists of 6204 gene expression measurements and data from 52 prostate tumors. The squares in the plot indicate Gleason scores 3+4 and the triangles represent Gleason scores 4+3. The results are statistically significant with p=0.033 (Fisher's exact test).
LOOCV Results of the Prediction Models.
| Model | Features Selected | LOOCV Misclassification Rate (%) |
|---|---|---|
| Logistic Regression - Forward Stepwise | 4 – 9 | 31 |
| Greedy - LDA | 10 – 20 | 31 |
| Genetic algorithm - LDA | 10 | 28 |
| PCA - LDA | 10 – 20 | 24 |
| Robust feature selection - LDA | 5 – 11 | 31 |
Results of Fitting the Model with Fewer Features.
| Misclassification Rates of Cross-validation
| ||||
|---|---|---|---|---|
| Number of Features | Greedy-LDA (%) | GA-LDA (%) | PCA-LDA (%) | Robust Selection-LDA (%) |
| 1 | 33 | 32 | 27 | 19 |
| 2 | 24 | 26 | 27 | 15 |
| 3 | 31 | 19 | 23 | 21 |
| 4 | 26 | 24 | 22 | 24 |
| 5 | 29 | 21 | 19 | 29 |
| 6 | 29 | 31 | 19 | 29 |
| 7 | 26 | 24 | 19 | 31 |
Figure 4.Misclassification rates of leave-one-out cross validation obtained by performing robust feature selection approach on randomly generated data sets (n=10). For seven selected features, the median values range from 41.85 to 48.40.