| Literature DB >> 29772787 |
Yingqiang Sun1,2, Chengbo Lu3, Xiaobo Li4,5.
Abstract
The gene expression profile has the characteristics of a high dimension, low sample, and continuous type, and it is a great challenge to use gene expression profile data for the classification of tumor samples. This paper proposes a cross-entropy based multi-filter ensemble (CEMFE) method for microarray data classification. Firstly, multiple filters are used to select the microarray data in order to obtain a plurality of the pre-selected feature subsets with a different classification ability. The top N genes with the highest rank of each subset are integrated so as to form a new data set. Secondly, the cross-entropy algorithm is used to remove the redundant data in the data set. Finally, the wrapper method, which is based on forward feature selection, is used to select the best feature subset. The experimental results show that the proposed method is more efficient than other gene selection methods and that it can achieve a higher classification accuracy under fewer characteristic genes.Entities:
Keywords: cross-entropy; ensemble method; gene expression profile; gene selection; multi-filter
Year: 2018 PMID: 29772787 PMCID: PMC5977198 DOI: 10.3390/genes9050258
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Experimental datasets.
| Dataset | Samples | Number of Samples | Classes | |
|---|---|---|---|---|
| Class 1 | Class 2 | |||
| Colon | 2000 | 40 (T) | 22 (N) | 2 |
| Prostate | 12,600 | 52 (T) | 50 (N) | 2 |
| Leukemia | 7129 | 25 (AML) | 47 (ALL) | 2 |
| Lymphoma | 7129 | 58 (DLBCL) | 19 (FL) | 2 |
| Lung | 12600 | 31 (MPM) | 150 (ADCA) | 2 |
T: tumor; DLBCL: diffuse large B-cell lymphoma; N: normal; FL: follicular lymphoma; AML: acute myeloid leukemia; ALL: acute lymphoblastic leukemia; MPM: malignant pleural mesothelioma; ADCA: adenocarcinoma.
Figure 1Flowchart of the cross-entropy multi-filter ensemble (CEMFE) algorithm. TS: t-statistic; SNR: signal-to-noise ratio; PC: Pearson correlation coefficient; LOOCV: leave one out cross validation.
Experimental contrast of all of the kinds of algorithms on different data sets, feature gene number, and the best classification accuracy.
| Dataset | Classifier | CEMFE | SNRCE | TSCE | PCCE |
|---|---|---|---|---|---|
| Colon | SVM | 93.55 (23) | 90.32 (7) | 90.32 (18) | 93.55 (33) |
| KNN | 96.77 (14) | 88.71 (17) | 83.87 (13) | 90.32 (7) | |
| NB | 96.77 (9) | 88.71 (17) | 85.48 (6) | 91.91 (21) | |
| Prostate | SVM | 97.10 (17) | 88.24 (11) | 97.10 (22) | 91.18 (7) |
| KNN | 98.04 (9) | 94.12 (26) | 90.20 (10) | 98.04 (15) | |
| NB | 96.10 (9) | 89.22 (13) | 93.14 (22) | 92.16 (27) | |
| Leukemia | SVM | 97.22 (6) | 95.83 (17) | 90.28 (17) | 95.83 (35) |
| KNN | 98.61 (7) | 89.06 (28) | 88.89 (23) | 93.06 (8) | |
| NB | 100 (12) | 83.33 (26) | 96.88 (19) | 94.44 (33) | |
| Lymphoma | SVM | 100 (26) | 88.31 (34) | 96.10 (66) | 84.42 (36) |
| KNN | 98.70 (16) | 93.51 (7) | 79.22 (14) | 97.40 (41) | |
| NB | 98.70 (18) | 94.81 (9) | 96.10 (14) | 80.52 (22) | |
| Lung | SVM | 100 (4) | 98.34 (24) | 100 (36) | 99.45 (33) |
| KNN | 100 (3) | 100 (21) | 100 (40) | 100 (18) | |
| NB | 98.90 (9) | 96.13 (17) | 98.90 (23) | 98.90 (41) |
SVM: support vector machine; KNN: k-nearest neighbor; NB: Naive Bayesian; CEMFE: cross-entropy based multi-filter ensemble; SNRCE: signal–noise ration and cross-entropy; TSCE: t-statistic and cross-entropy method; PCCE: Pearson correlation coefficient and cross-entropy method.
Figure 2Classification accuracy vs. number of genes for the prostate dataset, using the (a) Support Vector Machine, (b) k-nearest neighbor, and (c) Naive Bayesian.
The best classification accuracy of the CEMFE, FCBF, and mRMR algorithms in different data sets.
| Dataset | NB | KNN | ||||
|---|---|---|---|---|---|---|
| FCBF | mRMR | CEMFE | FCBF | mRMR | CEMFE | |
| Colon | 91.94 | 88.79 | 96.77 | 88.71 | 77.42 | 96.77 |
| Prostate | 97.06 | 98.04 | 96.08 | 97.06 | 97.06 | 98.04 |
| Leukemia | 100 | 100 | 100 | 100 | 100 | 98.61 |
| Lymphoma | 93.51 | 94.81 | 98.70 | 93.51 | 97.40 | 98.70 |
| Lung | 86.67 | 99.13 | 100 | 83.33 | 96.13 | 98.90 |
| Average | 93.84 | 96.15 | 98.31 | 92.52 | 93.60 | 98.20 |
mRMR: minimum redundancy maximum relevance; FCBF: fast correlation based filter.
Classification accuracies variation on different number of genes selected by each filter.
| N/Data Set | Colon | Prostate | Leukemia | Lymphoma | Lung | |
|---|---|---|---|---|---|---|
| 50 | Avg. no. of genes | 15.3 | 12.3 | 8.3 | 20 | 5.3 |
| Avg. performance | 95.70 | 97.08 | 98.61 | 99.13 | 99.63 | |
| 100 | Avg. no. of genes | 10.6 | 12.3 | 11.6 | 20 | 5.3 |
| Avg. performance | 93.55 | 95.93 | 93.06 | 100 | 99.63 | |
| 200 | Avg. no. of genes | 10.6 | 12.3 | 13.3 | 22.3 | 3.6 |
| Avg. performance | 91.94 | 95.93 | 90.28 | 98.70 | 98.90 |
N: number of genes selected; Avg: average; no: number.