| Literature DB >> 21982331 |
Xiaosheng Wang1, Richard Simon.
Abstract
BACKGROUND: Although numerous methods of using microarray data analysis for cancer classification have been proposed, most utilize many genes to achieve accurate classification. This can hamper interpretability of the models and ease of translation to other assay platforms. We explored the use of single genes to construct classification models. We first identified the genes with the most powerful univariate class discrimination ability and then constructed simple classification rules for class prediction using the single genes.Entities:
Mesh:
Year: 2011 PMID: 21982331 PMCID: PMC3228540 DOI: 10.1186/1471-2105-12-391
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
The LOOCV classification accuracy (%)
| Method | SGC-t | SGC-W | DLDA | SVM | RF | |
|---|---|---|---|---|---|---|
| Melanoma | 97* | 96** | 97* | 97* | 97* | 97* |
| Breast Cancer 1 | 63** | 69* | 61 | 53 | 52 | 43 |
| Brain Cancer | 80* | 77** | 65 | 73 | 60 | 70 |
| Breast Cancer 2 | 58 | 50 | 73* | 67** | 73* | 67** |
| Gastric Tumor | 89 | 80 | 81 | 96** | 97* | 95 |
| Lung Cancer 1 | 98* | 95** | 95** | 98* | 98* | 98* |
| Lung Cancer 2 | 93** | 93** | 99* | 99* | 99* | 99* |
| Lymphoma | 74* | 71** | 66 | 52 | 59 | 57 |
| Myeloma | 68 | 67 | 75 | 78** | 74 | 79* |
| Pancreatic Cancer | 69** | 90* | 63 | 61 | 65 | 55 |
| Prostate Cancer | 89** | 89** | 78 | 93* | 93* | 93* |
Note:
1 SGC-t: Single Gene Classifier with the t-test gene selection method.
2 SGC-W: Single Gene Classifier with the WMW gene selection method.
3 for each dataset, the highest classification accuracy is highlighted with a single asterisk and the second highest is highlighted with a double asterisk.
The mean number of genes in classifiers
| Method | SGC-t | SGC-W | DLDA | SVM | RF | |
|---|---|---|---|---|---|---|
| Melanoma | 1 | 1 | 7200 | 7200 | 7200 | 7200 |
| Breast Cancer 1 | 1 | 1 | 17 | 17 | 17 | 15 |
| Brain Cancer | 1 | 1 | 14 | 14 | 14 | 14 |
| Breast Cancer 2 | 1 | 1 | 176 | 176 | 176 | 176 |
| Gastric Tumor | 1 | 1 | 848 | 848 | 848 | 848 |
| Lung Cancer 1 | 1 | 1 | 7472 | 7472 | 7472 | 7472 |
| Lung Cancer 2 | 1 | 1 | 3207 | 3207 | 3207 | 3207 |
| Lymphoma | 1 | 1 | 2 | 2 | 2 | 2 |
| Myeloma | 1 | 1 | 169 | 169 | 169 | 169 |
| Pancreatic Cancer | 1 | 1 | 56 | 56 | 56 | 44 |
| Prostate Cancer | 1 | 1 | 798 | 798 | 798 | 798 |
Comparison of single gene classifiers and standard classifiers
| Parameter | Smallest | t-test statisticb | Fold changec | Accuracy (%) of standard classifierse | Accuracy (%) of single gene classifiersf | |
|---|---|---|---|---|---|---|
| Melanoma | 1.37e-29 | 22.68 | 277.78 | 7263 | 97 | 96.5 |
| Breast Cancer 1 | 8.10e-06 | 9.06 | 3.65 | 20 | 52.2 | 66 |
| Brain Cancer | 1.51e-04 | 4.06 | 21.73 | 15 | 67 | 78.5 |
| Breast Cancer 2 | 3.10e-06 | 5.16 | 3.48 | 180 | 70 | 54 |
| Gastric Tumor | 7.34e-10 | 9.51 | 10 | 4798 | 92.2 | 84.5 |
| Lung Cancer 1 | 2.51e-21 | 20.34 | 1923.48 | 7561 | 97.2 | 96.5 |
| Lung Cancer 2 | 6.82e-35 | 24.72 | 505.16 | 3219 | 99 | 93 |
| Lymphoma | 1.50e-04 | 4.07 | 1.33 | 2 | 58.5 | 72.5 |
| Myeloma | 5.00e-07 | 5.23 | 4.49 | 172 | 76.5 | 67.5 |
| Pancreatic Cancer | 1.30e-06 | 5.37 | 5.88 | 58 | 61 | 79.5 |
| Prostate Cancer | 1.34e-21 | 12.53 | 12.82 | 812 | 89.3 | 89 |
Note:
aThe minimum univariate t-test p-value for the genes significantly different between the classes.
bThe absolute value of the t-test statistic corresponding to the left smallest p-value.
cThe maximum fold change in the geometry mean of gene expression between the classes,
dThe total number of genes significantly different between the classes at 0.001 significance level.
eThe mean classification accuracy of the four standard classifiers.
fThe mean classification accuracy of the two single gene classifiers.
Comparison of classification accuracy (%) with the TSP classifier
| Method | TSP | SGC-t | SGC-W |
|---|---|---|---|
| Melanoma | 99 | 97 | 96 |
| Breast Cancer 1 | 75 | 63 | 69 |
| Brain Cancer | 77 | 80 | 77 |
| Breast Cancer 2 | 47 | 58 | 50 |
| Gastric Tumor | 91 | 89 | 80 |
| Lung Cancer 1 | 95 | 98 | 95 |
| Lung Cancer 2 | 97 | 93 | 93 |
| Lymphoma | 57 | 74 | 71 |
| Myeloma | 71 | 68 | 67 |
| Pancreatic Cancer | 90 | 69 | 90 |
| Prostate Cancer | 81 | 89 | 89 |
Note: The number of gene pairs selected is set as one for the TSP classifier.
Stability of gene selection
| Dataset | Classifier | The genes selected and their occurrence percentages across the CV loop |
|---|---|---|
| Melanoma | SGC-t | 200965_s_at (99%), 213050_at (1%) |
| SGC-W | 217906_at (92%), 218552_at (4%), 218996_at (1%), 219343_at (1%), 221577_x_at (1%), 221882_s_at (1%) | |
| Breast Cancer 1 | SGC-t | 259466 (92%), 291660 (5%), 950574 (3%) |
| SGC-W | 259466 (98%), 291660 (2%) | |
| Brain Cancer | SGC-t | J02611_at (95%), X53331_at (5%) |
| SGC-W | J02611_at (93%), X67951_at (3%), HG3543-HT3739_at (2%), X12794_at (2%), | |
| Breast Cancer 2 | SGC-t | AI868854 (65%), AK026899 (13%), AK026789 (12%), AK025709 (3%), AI240933 (3%), AF119844 (2%), AW006861 (2%) |
| SGC-W | N30081 (65%), AF119844 (23%), AI868854 (12%) | |
| Gastric Tumor | SGC-t | W70254 (100%) |
| SGC-W | AA171606 (94%), W70254 (6%) | |
| Lung Cancer 1 | SGC-t | 37210_at (66%), 198_g_at (15%), 40165_at (12%), 32254_at (5%), 41344_s_at (2%) |
| SGC-W | 1252_at (100%) | |
| Lung Cancer 2 | SGC-t | 33754_at (100%) |
| SGC-W | 40936_at (98%), 33833_at (0.5%), 34320_at (0.5%), 37157_at (0.5%), 39640_at (0.5%) | |
| Lymphoma | SGC-t | X76538_at (100%) |
| SGC-W | X76538_at (91%), D30655_at (9%) | |
| Myeloma | SGC-t | 33146_at (88%), 32546_at (12%) |
| SGC-W | 32546_at (99%), 1071_at (1%) | |
| Pancreatic Cancer | SGC-t | 209596_at (98%), 206451_at (2%) |
| SGC-W | 206451_at (45%), 209596_at (43%), 218498_s_at (6%), 219625_s_at (4%), 212058_at (2%) | |
| Prostate Cancer | SGC-t | 34452_at (100%) |
| SGC-W | 34452_at (100%) | |
Note: In Breast Cancer 1, the genes are denoted by Clone ID; in Breast Cancer 2 and Gastric Tumor, the genes are denoted by GenBank Accession number; in all the others, the genes are denoted by Probe Set.
Summary of the eleven gene expression datasets
| Dataset | # Genes | Class | # Samples |
|---|---|---|---|
| Melanoma [ | 22283 | malignant/nonmalignant | 70 (45/25) |
| Breast Cancer 1 [ | 7650 | relapse/no-relapse | 99 (45/54) |
| Brain Cancer [ | 7129 | classic/desmoplastic | 60 (46/14) |
| Breast Cancer 2 [ | 22575 | disease-free/cancer recurred | 60 (32/28) |
| Gastric Tumor [ | 19508 | normal/tumor | 132 (29/103) |
| Lung Cancer 1 [ | 12600 | squamous cell lung carcinoma/pulmonary carcinoid | 41 (21/20) |
| Lung Cancer 2 [ | 12533 | mesothelioma/adenocarcinoma | 181 (31/150) |
| Lymphoma [ | 7129 | cured/fatal | 58 (32/26) |
| Myeloma [ | 12651 | without bone lytic lesion/with bone lytic lesion | 173 (36/137) |
| Pancreatic Cancer [ | 22283 | normal/pancreatic ductal carcinoma | 49 (25/24) |
| Prostate Cancer [ | 12600 | normal/tumor | 102 (50/52) |
Note: The sample size of each class is given in parenthesis.