| Literature DB >> 20825641 |
Abstract
BACKGROUND: A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations.Entities:
Mesh:
Year: 2010 PMID: 20825641 PMCID: PMC2949887 DOI: 10.1186/1471-2105-11-452
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustrative classification boundaries for two genes. The points are the centroids. Vertical and horizontal lines at the centroid are proportional to the variances. Distance measures are D = 1 = pooled variance and D = 2 = class-specific variance.
Figure 2Swirls and Ripples applied to data generated with .
Classification rules selected from simulated data using a greedy algorithm.
| Gene set = | Centroid set = | |||||||
|---|---|---|---|---|---|---|---|---|
| Sample size | Description | |||||||
| Swirl | 1 | 2002 | informative | -0.8 | 2.0 | 4.4 | 4.4 | |
| Swirl | 2 | 2002 | informative | -0.1 | 2.1 | 5.5 | 0.9 | |
| 2001 | informative | 0.1 | 1.9 | 5.1 | 0.9 | |||
| 162 | non-informative | 0.5 | -0.7 | 5.0 | 5.5 | |||
| 1118 | non-informative | -1.4 | 0.9 | 5.2 | 5.0 | |||
Classification rules selected from simulated data using a wrapper.
| Gene set = | Centroid set = | |||||||
|---|---|---|---|---|---|---|---|---|
| Sample size | Description | |||||||
| Swirl | 2 | 2002 | Informative | -2.2 | 1.7 | 6.5 | 1.0 | |
| 1771 | non-informative | -0.9 | 2.5 | 3.9 | 6.7 | |||
| Swirl | 2 | 2002 | Informative | 1.1 | 2.3 | 5.7 | 0.8 | |
| 7 | non-informative | 0.4 | 0.4 | 5.0 | 6.5 | |||
| 2001 | Informative | -0.3 | 2.0 | 5.3 | 0.9 | |||
| 323 | non-informative | -0.3 | 1.0 | 5.1 | 6.0 | |||
Figure 3ROC and RU curves for simulation.
Most frequently selected genes in simulated data.
| Gene | ||||
|---|---|---|---|---|
| Sample size | Feature selection | Description | Percentage of splits | |
| Greedy | 2002 | informative | 48% | |
| 2001 | informative | 20% | ||
| 1565 | non-informative | 4% | ||
| Wrapper | 2002 | informative | 48% | |
| 2001 | informative | 14% | ||
| 707 | non-informative | 3% | ||
| Greedy | 2001 | informative | 38% | |
| 2002 | informative | 27% | ||
| 996 | non-informative | 4% | ||
| Wrapper | 2002 | informative | 26% | |
| 2001 | informative | 26% | ||
| 1941 | non-informative | 1% | ||
Classification rules selected in data sets using greedy algorithms.
| Gene set = | Centroid set = | ||||||
|---|---|---|---|---|---|---|---|
| Data set | Description | ||||||
| Colon cancer | Swirl | 1 | 493 | myosin heavy chain | 716 | 278 | 338 |
| Leukemia 1 | Swirl | 1 | 3532 | glutathione S-transferase | 81 | 1456 | 449 |
| Medulloblastoma | Swirl | 1 | 6230 | myosin heavy polypeptide 7 | 0 | -53 | 234 |
| Prostate cancer | Swirl | 1 | 6185 | serine protease hepsin | 48 | 184 | 70 |
| Leukemia 2 | Swirl | 1 | 8828 | HLA class II alpha | 1504 | 40640 | 7151 |
Classification rules selected in data sets using a wrapper.
| Gene set = | Centroid set = | ||||||
|---|---|---|---|---|---|---|---|
| Data set | Description | ||||||
| Colon cancer | Swirl | 1 | 1772 | myosin heavy chain | 44 | 125 | 61 |
| 249 | desmin | 1958 | 467 | 793 | |||
| 1582 | p cadherin | 53 | 174 | 83 | |||
| 1423 | myosin reg light chain 2 | 763 | 196 | 213 | |||
| 745 | ORF, xq terminal portion | 188 | 226 | 179 | |||
| Leukemia 1 | Swirl | 1 | 3532 | glutathione S-transferase | 37 | 1460 | 569 |
| Medulloblastoma | Swirl | 1 | 977 | zinc finger protein HZfq | 38 | -102 | 87 |
| Prostate cancer | Swirl | 1 | 8850 | cDNA DKFZp564A072 | 24 | 215 | 110 |
| Leukemia 2 | Swirl | 1 | 8828 | HLA class II alpha | 785 | 41345 | 7280 |
Figure 4ROC and RU curves for data sets.
Genes most frequently selected in data sets.
| Gene | ||||
|---|---|---|---|---|
| Data set | Feature selection | Description | Fraction of splits | |
| Colon cancer | Greedy | 249 | 0.17 | |
| 493 | myosin heavy chain | 0.15 | ||
| 1772 | collagen alpha 2 | 0.12 | ||
| Wrapper | 249 | 0.13 | ||
| 1772 | collagen alpha 2 | 0.08 | ||
| 1582 | p cadherin | 0.06 | ||
| Leukemia 1 | Greedy | 4847 | 0.46 | |
| 6855 | TCF3 transcription factor 3 | 0.17 | ||
| 1834 | CD33 antigen | 0.17 | ||
| Wrapper | 4847 | 0.22 | ||
| 3252 | glutathione S-transferase | 0.16 | ||
| 6855 | TCF3 transcription factor 3 | 0.11 | ||
| Medulloblastoma | Greedy | 5585 | drebrin E | 0.05 |
| 4174 | COL6A2 collagen type IV alpha 2 | 0.04 | ||
| 3185 | pancreatic beta cell growth factor | 0.04 | ||
| Wrapper | 2426 | prostaglandin D2 synthase | 0.04 | |
| 4710 | acylphosphatase isozyme | 0.03 | ||
| 3185 | pancreatic beta cell growth factor | 0.03 | ||
| Prostate cancer | Greedy | 6185 | 0.70 | |
| 8965 | mitochondrial matrix protein P1 | 0.09 | ||
| 10494 | mRNA, ne1-related protein P1 | 0.07 | ||
| Wrapper | 6185 | 0.31 | ||
| 8965 | mitochondrial matrix protein P1 | 0.09 | ||
| 4365 | T-cell receptor Ti gamma chain | 0.07 | ||
| Leukemia 2 | Greedy | 8828 | 0.59 | |
| 9101 | MHC class II lymphocyte antigen | 0.23 | ||
| 2610 | mRNA for oct-binding factor | 0.13 | ||
| Wrapper | 8828 | 0.41 | ||
| 9101 | MHC class II lymphocyte antigen | 0.20 | ||
| 2610 | mRNA for oct-binding factor | 0.17 | ||
Genes listed in bold occur most frequently and are discussed in the text.