| Literature DB >> 18831790 |
Guo-Zheng Li1, Hua-Long Bu, Mary Qu Yang, Xue-Qiang Zeng, Jack Y Yang.
Abstract
BACKGROUND: Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.Entities:
Mesh:
Year: 2008 PMID: 18831790 PMCID: PMC2559889 DOI: 10.1186/1471-2164-9-S2-S24
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Statistical classification error rates (and their corresponding standard deviation) by using SVM with different parameters on four microarray data sets (%)
| DATASET | SVM | PCASVM | GAPCASVM | PLSSVM | GAPLSSVM | PPSVM | GAPPSVM |
| CNS | 40.4(8.6) | 36.4(4.4) | 35.2(1.6) | 36.6(2.8) | 35.0(5.6) | 36.2(5.3) | 35.0(7.0) |
| COLON | 34.6(5.4) | 31.3(7.0) | 29.6(7.5) | 29.5(8.2) | 28.8(9.2) | 33.2(6.7) | 30.5(7.2) |
| LEUKEMIA | 27.0(7.1) | 26.2(6.1) | 23.2(6.4) | 20.4(7.0) | 16.2(5.9) | 22.9(7.5) | 21.7(7.2) |
| LUNG | 4.2(7.2) | 4.1(5.9) | 3.9(6.2) | 3.7(6.2) | 3.3(5.8) | 4.4(5.5) | 4.0(7.3) |
| Average | 26.8(7.1) | 24.5(5.9) | 22.9(5.4) | 22.5(6.1) | 20.8(6.7) | 24.1(6.2) | 22.8(7.2) |
| CNS | 40.6(9.2) | 35.7(4.2) | 34.4(7.3) | 39.6(7.5) | 38.4(5.5) | 40.0(5.3) | 38.7(8.3) |
| COLON | 34.6(5.4) | 29.1(8.3) | 29.6(7.3) | 29.7(8.3) | 29.1(9.3) | 33.7(6.9) | 32.2(7.2) |
| LEUKEMIA | 27.4(6.9) | 26.3(6.2) | 22.1(6.3) | 20.4(7.0) | 16.3(6.0) | 22.8(7.3) | 20.9(7.2) |
| LUNG | 3.7(6.9) | 2.9(6.2) | 1.7(6.2) | 1.6(6.8) | 1.6(6.1) | 1.6(8.5) | 1.5(4.4) |
| Average | 26.5(7.1) | 23.4(6.2) | 21.9(6.8) | 22.8(7.4) | 21.3(6.7) | 24.5(6.9) | 23.3(6.8) |
| CNS | 37.6(7.6) | 35.4(1.7) | 34.4(1.0) | 36.1(2.9) | 35.0(5.4) | 36.0(5.0) | 35.6(3.1) |
| COLON | 33.7(5.2) | 31.0(7.8) | 29.9(7.3) | 28.9(8.7) | 27.8(9.3) | 33.4(7.2) | 30.1(8.0) |
| LEUKEMIA | 26.8(7.2) | 25.8(6.2) | 23.3(6.3) | 20.5(6.9) | 16.4(9.5) | 24.0(7.4) | 21.0(7.1) |
| LUNG | 4.2(6.9) | 4.1(6.5) | 3.2(6.2) | 3.4(7.2) | 3.4(8.4) | 4.0(6.3) | 3.3(5.8) |
| Average | 25.6(6.8) | 24.0(5.5) | 22.7(5.2) | 22.2(6.4) | 20.6(8.2) | 24.6(6.5) | 22.9(6.0) |
| CNS | 41.4(8.7) | 35.4(8.6) | 34.0(8.4) | 40.9(7.6) | 38.9(5.8) | 42.5(9.1) | 39.2(8.8) |
| COLON | 33.9(6.0) | 31.0(7.5) | 29.7(7.0) | 29.0(8.7) | 28.5(9.3) | 32.8(7.2) | 29.7(8.0) |
| LEUKEMIA | 27.9(7.2) | 25.0(6.1) | 23.1(6.6) | 20.5(6.9) | 16.4(5.9) | 22.6(7.7) | 21.1(7.3) |
| LUNG | 3.9(5.8) | 2.8(6.5) | 1.3(6.8) | 1.7(6.3) | 1.4(6.9) | 3.6(6.3) | 1.3(6.4) |
| Average | 26.7(6.9) | 23.5(7.2) | 22.0(7.2) | 23.0(7.3) | 21.2(7.0) | 25.4(7.6) | 22.8(7.6) |
Average percentage of features (and their corresponding standard deviation) used by SVM with different parameters on four microarray data sets (%)
| DATASET | PCASVM | GAPCASVM | PLSSVM | GAPLSSVM | PPSVM | GAPPSVM |
| CNS | 74.4(8.5) | 27.3(8.3) | 67.7(10.0) | 29.4(8.4) | 68.8(6.6) | 30.8(8.6) |
| COLON | 81.1(7.5) | 28.5(8.9) | 57.6(4.8) | 30.8(7.4) | 59.4(9.2) | 31.1(7.4) |
| LEUKEMIA | 78.0(9.8) | 26.7(6.3) | 46.8(10.1) | 30.0(6.5) | 52.0(9.3) | 30.3(6.6) |
| LUNG | 82.2(6.2) | 74.7(6.2) | 73.8(7.9) | 72.0(3.6) | 82.3(7.6) | 73.3(5.7) |
| Average | 78.9(8.0) | 39.3(7.4) | 61.4(8.2) | 40.5(6.5) | 65.6(8.2) | 41.3(7.1) |
| CNS | 73.4(9.8) | 27.4(8.0) | 62.7(9.0) | 29.6(8.5) | 65.7(9.3) | 20.7(8.3) |
| COLON | 82.1(6.5) | 28.7(8.9) | 57.6(4.8) | 30.6(7.3) | 59.3(8.3) | 31.1(7.5) |
| LEUKEMIA | 87.0(9.1) | 26.7(6.1) | 46.8(10.0) | 30.0(6.3) | 49.3(8.3) | 30.4(6.7) |
| LUNG | 77.4(7.0) | 74.4(6.7) | 76.3(6.9) | 73.1(3.0) | 83.1(8.1) | 72.0(6.3) |
| Average | 79.4(8.1) | 39.3(7.4) | 60.8(7.7) | 40.8(6.3) | 64.4(8.5) | 38.5(7.2) |
| CNS | 78.1(8.0) | 27.4(8.2) | 64.3(9.0) | 29.3(8.3) | 70.1(9.1) | 30.6(8.3) |
| COLON | 80.9(7.4) | 28.1(8.6) | 57.6(5.2) | 30.7(7.3) | 62.2(8.2) | 31.0(7.5) |
| LEUKEMIA | 87.1(9.7) | 27.0(6.7) | 47.6(10.0) | 30.4(7.3) | 49.2(8.3) | 30.4(6.8) |
| LUNG | 79.0(6.6) | 76.8(6.3) | 77.2(7.1) | 67.4(4.3) | 84.4(7.9) | 81.4(6.6) |
| Average | 81.3(7.9) | 39.8(7.4) | 61.6(7.8) | 39.4(6.8) | 66.4(8.4) | 43.3(7.3) |
| CNS | 76.2(8.9) | 27.4(8.4) | 67.1(8.9) | 29.2(7.7) | 69.4(9.1) | 30.6(8.3) |
| COLON | 82.5(7.2) | 28.0(8.7) | 59.5(4.6) | 30.7(7.3) | 63.3(8.7) | 32.2(7.4) |
| LEUKEMIA | 88.2(9.7) | 27.3(6.7) | 47.6(8.1) | 30.3(7.3) | 49.2(7.9) | 30.4(6.8) |
| LUNG | 81.1(6.8) | 78.2(6.1) | 81.3(5.1) | 77.6(8.2) | 81.1(8.0) | 76.4(5.9) |
| Average | 82.0(8.1) | 40.2(7.0) | 63.8(6.7) | 41.9(7.6) | 65.7(8.4) | 42.4(7.1) |
Figure 1Comparison of distributions of eigenvectors used by GAPCASVM and GAPLSSVM with . X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 2Comparison of distributions of eigenvectors used by GAPPSVM with . X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Statistical classification error rates (and their corresponding standard deviation) by using kNN with different different parameters on four microarray data sets (%)
| DATASET | KNN | PCAKNN | GAPCAKNN | PLSKNN | GAPLSKNN | PPKNN | GAPPKNN |
| CNS | 47.5(3.5) | 43.8(8.7) | 40.8(10.2) | 44.9(9.3) | 36.4(10.7) | 44.9(10.3) | 34.3(8.3) |
| COLON | 32.5(1.4) | 28.4(8.4) | 27.1(10.3) | 24.8(14.3) | 21.9(7.5) | 30.2(12.2) | 18.2(7.2) |
| LEUKEMIA | 16.1(2.2) | 14.7(10.8) | 11.4(8.7) | 15.7(10.4) | 12.3(8.6) | 15.9(11.9) | 8.4(6.4) |
| LUNG | 17.6(2.3) | 11.8(7.1) | 11.0(4.6) | 11.8(5.3) | 6.1(4.7) | 13.2(5.6) | 7.8(4.1) |
| Average | 28.4(2.3) | 24.6(8.7) | 22.57(8.4) | 24.3(9.8) | 19.2(7.8) | 25.3(10.0) | 17.1(6.5) |
| CNS | 48.6(1.2) | 46.5(11.2) | 41.5(11.0) | 44.9(9.9) | 38.4(10.8) | 47.8(9.9) | 38.5(8.7) |
| COLON | 44.6(2.8) | 42.9(12.9) | 36.2(9.2) | 35.3(14.3) | 28.8(8.8) | 34.9(13.7) | 24.5(8.4) |
| LEUKEMIA | 32.5(1.9) | 31.5(14.1) | 28.1(11.9) | 15.5(9.6) | 14.8(14.9) | 18.6(11.5) | 10.0(7.8) |
| LUNG | 16.2(0.8) | 15.8(4.6) | 13.5(4.8) | 12.6(6.3) | 10.1(4.8) | 13.4(6.5) | 9.4(3.5) |
| Average | 35.4(1.7) | 34.1(10.7) | 28.8(9.2) | 27.0(10.0) | 23.0(8.0) | 28.67(10.4) | 20.6(7.1) |
| CNS | 46.5(1.4) | 39.1(6.6) | 38.6(6.8) | 41.4(10.3) | 39.4(9.4) | 39.0(9.0) | 31.9(7.6) |
| COLON | 30.9(1.0) | 28.4(8.4) | 26.0(8.3) | 28.3(11.6) | 24.3(8.3) | 25.8(11.0) | 19.5(7.2) |
| LEUKEMIA | 24.3(1.0) | 22.7(10.2) | 17.7(9.1) | 13.5(9.0) | 11.5(8.8) | 12.4(8.0) | 7.0(6.2) |
| LUNG | 16.3(0.6) | 13.2(3.6) | 10.1(3.5) | 11.5(4.1) | 6.5(4.7) | 12.1(5.4) | 7.5(3.61) |
| Average | 29.5(1.0) | 25.8(7.2) | 23.1(6.9) | 23.6(8.7) | 20.4(7.8) | 22.42(8.3) | 16.4(6.2) |
Average percentage of features (and their corresponding standard deviation) used by kNN with different parameters on four microarray data sets (%)
| DATASET | PCAKNN | GAPCAKNN | PLSKNN | GAPLSKNN | PPKNN | GAPPKNN |
| CNS | 68.5(6.5) | 32.3(8.0) | 69.2(8.0) | 32.2(6.4) | 62.5(7.3) | 32.3(9.1) |
| COLON | 78.2(4.4) | 29.7(7.8) | 58.3(5.2) | 32.8(6.4) | 61.4(9.0) | 34.2(7.2) |
| LEUKEMIA | 68.0(8.8) | 28.6(6.2) | 47.8(7.1) | 31.2(7.6) | 54.3(8.3) | 33.3(6.8) |
| LUNG | 73.4(6.2) | 72.2(7.6) | 78.4(7.2) | 68.9(5.9) | 79.8(8.1) | 71.9(6.9) |
| Average | 72.0(6.5) | 40.7(7.4) | 63.4(6.9) | 41.2(6.6) | 64.5(8.2) | 42.9(7.5) |
| CNS | 71.2(6.8) | 26.6(7.9) | 68.2(7.8) | 31.6(9.5) | 62.4(9.8) | 23.1(8.2) |
| COLON | 80.3(7.5) | 32.2(6.8) | 59.7(5.2) | 27.3(8.3) | 62.3(8.8) | 32.5(7.5) |
| LEUKEMIA | 81.4(6.9) | 26.7(5.7) | 46.8(8.8) | 35.5(7.1) | 50.7(7.3) | 33.2(6.0) |
| LUNG | 78.2(8.7) | 71.2(6.3) | 74.7(6.0) | 69.3(4.1) | 80.2(8.9) | 70.0(6.2) |
| Average | 77.7(7.5) | 39.1(6.7) | 62.3(6.9) | 40.9(7.2) | 63.9(8.7) | 39.7(7.0) |
| CNS | 72.8(7.1) | 29.4(8.2) | 61.7(8.8) | 32.3(6.1) | 68.1(10.3) | 32.0(8.8) |
| COLON | 81.2(8.3) | 25.7(7.7) | 52.4(6.3) | 36.7(5.7) | 65.4(8.9) | 33.4(7.3) |
| LEUKEMIA | 79.1(6.9) | 28.9(6.5) | 48.9(9.1) | 32.1(6.8) | 51.3(9.1) | 32.7(6.1) |
| LUNG | 80.0(5.2) | 72.5(8.7) | 78.3(7.4) | 62.9(7.6) | 82.7(9.2) | 80.9(5.8) |
| Average | 78.2(6.9) | 39.1(7.8) | 60.3(7.9) | 41.0(6.6) | 66.8(9.4) | 44.7(7.0) |
Figure 3Comparison of distributions of eigenvectors used by GAPCAKNN and GAPLSKNN with . X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 4Comparison of distributions of eigenvectors used by GAPPKNN with . X-axis corresponds to the eigenvectors in descending order by their eigenvalues and has been divided into bins of size 5. Y-axis corresponds to the average value of times that eigenvectors within some bin are selected by GA.
Figure 5A framework of dimension reduction for the analysis of gene microarray data.
Figure 6Genetic algorithm based feature selection.
Microarray data sets used for comparison
| Data Sets | Samples | Class Ratio | Features |
| CNS | 60 | 21/39 | 7,129 |
| Colon | 62 | 22/40 | 2,000 |
| Leukemia | 72 | 25/47 | 7,129 |
| Lung | 181 | 31/150 | 12,533 |