| Literature DB >> 20003274 |
Desheng Huang1, Yu Quan, Miao He, Baosen Zhou.
Abstract
BACKGROUND: More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data.Entities:
Mesh:
Year: 2009 PMID: 20003274 PMCID: PMC2800110 DOI: 10.1186/1756-9966-28-149
Source DB: PubMed Journal: J Exp Clin Cancer Res ISSN: 0392-9078
Figure 1Framework for the procedure of classification.
Characteristics of the six microarray datasets used
| Dataset | No. of samples | Classes | No. of genes | Website | |
|---|---|---|---|---|---|
| Two-class lung cancer | 181 | MPM(31), adenocarcinoma(150) | 12533 | [ | |
| Colon | 62 | normal(22), tumor(40) | 2000 | [ | |
| Prostate | 102 | normal(50), tumor(52) | 6033 | [ | |
| Multi-class lung cancer | 68(66) a | adenocarcinoma(37), combined(1), normal(5), small cell(4), squamous cell(10), fetal(1), large cell(4), lymph node(6) | 3171 | [ | |
| SRBCT | 88(83) b | Burkitt lymphoma (29), Ewing sarcoma (11), neuroblastoma (18), rhabdomyosarcoma (25), non-SRBCTs(5) | 2308 | [ | |
| Brain | 42(38) c | medulloblastomas(10), CNS AT/RTs(5), rhabdoid renal and extrarenal rhabdoid tumours(5), supratentorial PNETs(8), non-embryonal brain tumours (malignant glioma) (10), normal human cerebella(4) | 5597 | [ |
Note: Some samples were removed for keeping adequate number of each type.
a. One combined and one fetal cancer samples were removed, and real sample size is 66;
b. Five non-SRBCT samples were removed, and real sample size is 83;
c. Four normal tissue samples were removed, and real sample size is 38.
Numbers of feature genes selected by 4 methods for each dataset
| Dataset | PAM | SDDA | SLDA | SCRDA |
|---|---|---|---|---|
| 2-class lung cancer | 7.98 | 422.74 | 407.83 | 118.72 |
| Colon | 25.72 | 65.67 | 117.08 | 214.87 |
| Prostate | 83.13 | 120.53 | 187.91 | 217.47 |
| Multi-class lung cancer | 45.26 | 57.98 | 97.27 | 1015.00 |
| SRBCT | 30.87 | 114.32 | 131.24 | 86.22 |
| Brain | 69.11 | 115.04 | 182.01 | 26.83 |
Average test error of LDA and its modification methods (10 cycles of 10-fold cross validation)
| Dataset | Gene selection methods | Performance | ||||
|---|---|---|---|---|---|---|
| LDA | PAM | SDDA | SLDA | SCRDA | ||
| 2-class Lung cancer data(n = 181, p = 12533, K = 2) | PAM | 0.30 | 0.15 | 0.16 | 0.42 | |
| SDDA | 0.17 | 0.11 | 0.11 | 0.1 | ||
| SLDA | 0.47 | 0.3 | 0.3 | 0.32 | ||
| SCRDA | 0.73 | 0.20 | 0.19 | 0.17 | ||
| Colon data(n = 62, p = 2000, K = 2) | PAM | 1.30 | 0.8 | 0.86 | 0.86 | |
| SDDA | 2.25 | 2.09 | 1.29 | 1.25 | ||
| SLDA | 1.12 | 0.74 | 0.75 | 0.80 | ||
| SCRDA | 1.19 | 0.77 | 0.77 | 0.75 | ||
| Prostate data(n = 102, p = 6033, K = 2) | PAM | 2.87 | 0.82 | 0.81 | 1.00 | |
| SDDA | 2.53 | 0.71 | 0.68 | 0.74 | ||
| SLDA | 1.75 | 0.7 | 0.64 | 0.70 | ||
| SCRDA | 2.15 | 0.57 | 0.59 | 0.57 | ||
| Multi-class lung cancer data(n = 66, p = 3171, K = 6) | PAM | 2.13 | 1.21 | 1.28 | 1.19 | |
| SDDA | 1.62 | 1.32 | 1.31 | 1.30 | ||
| SLDA | 1.62 | 1.31 | 1.32 | 1.34 | ||
| SCRDA | 1.63 | 1.43 | 1.45 | 1.58 | ||
| SRBCT data(n = 83, p = 2308, K = 4) | PAM | 0.17 | 0.01 | 0.03 | 0.01 | |
| SDDA | 2.45 | 0.03 | 0 | 0.03 | ||
| SLDA | 2.87 | 0 | 0 | 0 | ||
| SCRDA | 2.32 | 0.03 | 0.03 | 0.02 | ||
| Brain data(n = 38, p = 5597, K = 4) | PAM | 1.14 | 0.57 | 0.58 | 0.61 | |
| SDDA | 1.09 | 0.61 | 0.63 | 0.55 | ||
| SLDA | 0.89 | 0.60 | 0.60 | 0.58 | ||
| SCRDA | 0.84 | 0.56 | 0.54 | 0.54 | ||
Figure 2Comparison of classification performance for different datasets. The y-axis shows the average error and the x-axis indicates the gene selection methods: PAM, SDDA, SLDA and SCRDA. Error bars (± 1.96 SE) are provided for the classification methods.