| Literature DB >> 20625502 |
Abstract
This paper presents an empirical study that aims to explain the relationship between the number of samples and stability of different gene selection techniques for microarray datasets. Unlike other similar studies where number of genes in a ranked gene list is variable, this study uses an alternative approach where stability is observed at different number of samples that are used for gene selection. Three different metrics of stability, including a novel metric in bioinformatics, were used to estimate the stability of the ranked gene lists. Results of this study demonstrate that the univariate selection methods produce significantly more stable ranked gene lists than the multivariate selection methods used in this study. More specifically, thousands of samples are needed for these multivariate selection methods to achieve the same level of stability any given univariate selection method can achieve with only hundreds.Entities:
Mesh:
Year: 2010 PMID: 20625502 PMCID: PMC2896709 DOI: 10.1155/2010/616358
Source DB: PubMed Journal: J Biomed Biotechnol ISSN: 1110-7243
Overview of GEMLeR datasets used in this study.
| Dataset | No. of Samples | Class 1 | Class 2 | Probes |
|---|---|---|---|---|
| AP_Breast_Colon | 630 | 344 | 286 | 10937 |
| AP_Breast_Kidney | 604 | 344 | 260 | 10937 |
| AP_Breast_Ovary | 542 | 344 | 198 | 10937 |
| AP_Colon_Kidney | 546 | 286 | 260 | 10937 |
Figure 1Relation between the number of samples and stability using resampling technique (breast versus colon cancer data set).
Figure 2Relation between number of samples and stability using partitioning technique (breast versus colon cancer dataset).
Figure 3Classification accuracy and AUC using four different classification models (breast versus colon cancer dataset).
Figure 4Standard deviation levels for classification accuracy and AUC (breast versus colon cancer dataset).
Figure 5Overlap as a function of dataset size (number of samples) for different number of selected genes ranging from 16 to 256 (breast versus colon cancer dataset).