| Literature DB >> 25077572 |
Rafael Marcos Luque-Baena, Daniel Urda, Jose Luis Subirats, Leonardo Franco, Jose M Jerez.
Abstract
BACKGROUND: Extracting relevant information from microarray data is a very complex task due to the characteristics of the data sets, as they comprise a large number of features while few samples are generally available. In this sense, feature selection is a very important aspect of the analysis helping in the tasks of identifying relevant genes and also for maximizing predictive information.Entities:
Mesh:
Year: 2014 PMID: 25077572 PMCID: PMC4108856 DOI: 10.1186/1742-4682-11-S1-S7
Source DB: PubMed Journal: Theor Biol Med Model ISSN: 1742-4682 Impact factor: 2.432
Cancer datasets
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| 7129 | 72 | 25 | 47 | 0.347 | |
| 12533 | 181 | 150 | 31 | 0.829 | |
| 2000 | 62 | 22 | 40 | 0.355 | |
| 24481 | 78 | 33 | 44 | 0.423 | |
| 15154 | 253 | 91 | 162 | 0.360 | |
| 12600 | 102 | 50 | 52 | 0.490 |
Main characteristics of the six cancer datasets analysed.
Parameters estimation for GA
| Prostate dataset | Lung dataset | ||||||
|---|---|---|---|---|---|---|---|
| 0.8 | 0.6 | 0.9838 | 2.67 | 0.8 | 0.6 | 0.9730 | 8.65 |
| 0.8 | 0.4 | 0.9899 | 3.30 | 0.8 | 0.4 | 0.9748 | 7.28 |
| 0.8 | 0.25 | 0.9914 | 3.52 | 0.8 | 0.25 | 0.9801 | 9.85 |
| 0.4 | 0.6 | 0.9827 | 2.56 | 0.4 | 0.6 | 0.9743 | 8.80 |
| 0.4 | 0.4 | 0.9912 | 3.75 | 0.4 | 0.4 | 0.9763 | 9.55 |
| 0.1 | 0.6 | 0.9837 ± 0.0104 | 3.04 ± 1.71 | 0.1 | 0.6 | 0.9770 ± 0.0095 | 7.83 ± 2.06 |
| 0.1 | 0.4 | 0.9895 ± 0.0065 | 2.88 ± 0.70 | 0.1 | 0.4 | 0.9763 ± 0.0118 | 9.63 ± 2.53 |
Parameter estimation for the α and β parameters of the fitness function of the GA for the Lung and Prostate datasets.
Parameters settings
| Algorithm | Test Parameters |
|---|---|
| LDA | No parameters |
| SVM | Kernel type, |
| NaiveBayes | Kernel density, |
| C-MANTEC | Max. Iterations, |
| kNN | Neighbours, |
| MLP | Hidden neurons, |
Parameter settings tested during evaluation of the classification algorithms. The combination of all the values of the parameters generate a set of configurations for each method.
Figure 1Quantitative measures. False Positives (FP) and False Negatives (FN) ratios after applying each method to the test sequences with all the parameter configurations. Each coloured point '*' is considered as a different configuration for the indicated method. The closer the points are to the origin, the better the segmentation. Additionally, the method is less sensible to a parameters' change if the cloud of points is more compact (see the text for more details). The datasets are different and so the scales are.
Performance comparison of classification techniques
| GA | SFS | |||||
|---|---|---|---|---|---|---|
| LDA | - | 99.959 ± 0.07 | 12 | 97.609 ± 2.86 | 2 | |
| SVM | {polynomial,15,1,0.6,0} | 99.982 ± 0.06 | 8 | 4 | ||
| NaiveBayes | {1,0} | 99.974 ± 0.03 | 12 | 98.060 ± 2.19 | 3 | |
| C-MANTEC | {1000,0.01,4.5} | 99.038 ± 0.25 | 7 | 98.837 ± 2.46 | 3 | |
| kNN | {1,Euclidean} | 10 | 99.844 ± 0.77 | 5 | ||
| MLP | {3,0.5,50} | 99.944 ± 0.05 | 5 | 95.784 ± 3.38 | 2 | |
| LDA | - | 99.971 ± 0.03 | 5 | 99.057 ± 1.00 | 3 | |
| SVM | {linear,10,-,-,-} | 11 | 99.828 ± 0.70 | 3 | ||
| NaiveBayes | {1,0} | 99.998 ± 0.01 | 4 | 3 | ||
| C-MANTEC | {100000,0.25,2} | 99.678 ± 0.08 | 6 | 99.673 ± 0.94 | 2 | |
| kNN | {1,Euclidean} | 99.969 ± 0.02 | 4 | 99.969 ± 0.22 | 4 | |
| MLP | {4,0.1,50} | 99.996 ± 0.01 | 4 | 99.778 ± 0.79 | 2 | |
| LDA | - | 98.676 ± 0.35 | 11 | 87.179 ± 6.15 | 2 | |
| SVM | {polynomial,1,1,0.4,2} | 89.917 ± 1.26 | 20 | 91.738 ± 5.21 | 5 | |
| NaiveBayes | {0,1} | 90.583 ± 0.49 | 15 | 89.076 ± 7.79 | 4 | |
| C-MANTEC | {10000,0.01,1} | 94.315 ± 0.48 | 11 | 87.593 ± 6.69 | 2 | |
| kNN | {3,cosine-similarity} | 95.060 ± 0.38 | 19 | 6 | ||
| MLP | {5,0.3,50} | 12 | 88.733 ± 5.51 | 2 | ||
| LDA | - | 99.788 ± 0.12 | 15 | 74.169 ± 6.52 | 1 | |
| SVM | {polynomial,7,2,0.001,2} | 99.744 ± 0.14 | 31 | 3 | ||
| NaiveBayes | {0,0} | 97.759 ± 0.23 | 27 | 73.499 ± 6.34 | 1 | |
| C-MANTEC | {10000,0.01,1.5} | 97.342 ± 0.39 | 23 | 76.645 ± 6.53 | 1 | |
| kNN | {3,Euclidean} | 97.485 ± 0.30 | 34 | 80.975 ± 6.37 | 2 | |
| MLP | {4,0.3,50} | 18 | 79.191 ± 6.43 | 2 | ||
| LDA | - | 99.980 ± 0.01 | 4 | 3 | ||
| SVM | {polynomial,9,1,0.2,0} | 4 | 99.978 ± 0.13 | 4 | ||
| NaiveBayes | {1,0} | 99.951 ± 0.03 | 5 | 99.980 ± 0.13 | 4 | |
| C-MANTEC | {1000,0.3,1.5} | 99.844 ± 0.05 | 4 | 99.659 ± 0.75 | 3 | |
| kNN | {1,Euclidean} | 99.984 ± 0.01 | 4 | 99.982 ± 0.11 | 3 | |
| MLP | {5,0.3,50} | 99.999 ± 0 | 3 | 3 | ||
| LDA | - | 99.720 ± 0.12 | 9 | 95.677 ± 2.81 | 4 | |
| SVM | {polynomial,5,1,3,1} | 99.428 ± 0.31 | 20 | 5 | ||
| NaiveBayes | {0,0} | 98.817 ± 0.16 | 14 | 98.331 ± 2.13 | 7 | |
| C-MANTEC | {1000,0.25,4} | 98.681 ± 0.24 | 8 | 95.351 ± 3.40 | 4 | |
| kNN | {3,cosine-similarity} | 99.633 ± 0.11 | 20 | 97.146 ± 2.28 | 6 | |
| MLP | {3,0.5,50} | 12 | 96.921 ± 2.37 | 4 | ||
Performance comparison among the two different feature selection frameworks used (GA and SFS) and the six classifiers analyzed (LDA, SVM, NaiveBayes, C-MANTEC, kNN and MLP) for each cancer microarray dataset. The results correspond to the best simulation for each dataset, showing the accuracy for method in the format of mean ± standard deviation and the number of selected genes.
Performance comparison of feature selection frameworks
| GA | SFS | |||
|---|---|---|---|---|
| LDA | 99.682 ± 0.12 | 9.33 | 92.282 ± 3.22 | 2.5 |
| SVM | 99.082 ± 0.25 | 15.67 | 95.185 ± 2.36 | 4 |
| NaiveBayes | 97.847 ± 0.16 | 12.83 | 93.156 ± 3.11 | 3.67 |
| C-MANTEC | 98.150 ± 0.25 | 9.83 | 92.960 ± 3.46 | 2.5 |
| kNN | 98.688 ± 0.14 | 15.17 | 95.249 ± 2.36 | 4.33 |
| MLP | 99.798 ± 0.08 | 9 | 93.401 ± 3.08 | 2.5 |
Average performance comparison among two different feature selection frameworks (GA and SFS) and six classifiers (LDA, SVM, NaiveBayes, C-MANTEC, kNN and MLP) over all dataset.
Differences between classifiers.
| FS procedure | Dataset | p-value | Control | Statistically different classifiers |
|---|---|---|---|---|
| SFS | Leukemia | LDA | SVM | |
| Lung | LDA | kNN, NB | ||
| Colon | LDA | SVM, kNN | ||
| Breast | NB | kNN, SVM | ||
| Ovarian | CM | LDA, NN | ||
| Prostate | CM | NB, SVM | ||
| GA | Leukemia | CM | NB, NN, LDA, SVM, kNN | |
| Lung | CM | SVM, NB, NN | ||
| Colon | SVM | LDA, NN | ||
| Breast | SVM | NN, LDA | ||
| Ovarian | CM | SVM, NN | ||
| Prostate | CM | NN, LDA |
Differences between classifiers for the two feature selection (FS) procedures used (first column). The lowest performance classifier is taken as control group and the last column of the table lists the classifiers that lead to statistically significant results (corresponding p-value indicated in the third column).
Differences between feature selection algorithms
| Classifier | Dataset | p-value | Control | Statistically different FS procedures |
|---|---|---|---|---|
| LDA | Leukemia | 1.54 | SFS | GA |
| Lung | 1.54 | SFS | GA | |
| Colon | 1.54 | SFS | GA | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 3.28 | GA | SFS | |
| Prostate | 1.54 | SFS | GA | |
| SVM | Leukemia | 3.65 | SFS | GA |
| Lung | 1.54 | SFS | GA | |
| Colon | 2.86 | GA | SFS | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 9.13 | SFS | GA | |
| Prostate | 1.54 | SFS | GA | |
| NB | Leukemia | 4.71 | SFS | GA |
| Lung | 1.54 | SFS | GA | |
| Colon | 1.54 | SFS | GA | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 0.157 | - | - | |
| Prostate | 1.54 | SFS | GA | |
| CM | Leukemia | 4.71 | SFS | GA |
| Lung | 1.54 | SFS | GA | |
| Colon | 1.54 | SFS | GA | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 0.157 | - | - | |
| Prostate | 1.54 | SFS | GA | |
| kNN | Leukemia | 1.54 | SFS | GA |
| Lung | 0.0897 | - | - | |
| Colon | 1.54 | SFS | GA | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 0.6547 | - | - | |
| Prostate | 1.54 | SFS | GA | |
| NN | Leukemia | 4.71 | SFS | GA |
| Lung | 1.54 | SFS | GA | |
| Colon | 1.54 | SFS | GA | |
| Breast | 1.54 | SFS | GA | |
| Ovarian | 0.157 | - | - | |
| Prostate | 1.54 | SFS | GA |
Differences between SFS and GA feature selection algorithms for the six different classification methods used (first column). The lowest performant FS procedure is taken as control group (fourth column) while the last column of the table lists the procedures that lead to statistically significant results (corresponding p-value indicated in the third column)
Figure 2Frequency selection of genes for Leukemia, Lung, Colon and Breast databases. The ten most selected features for the analysed datasets. Frequency selection is represented by an horizontal bar, divided according to the six classifiers used in the analysis: LDA, SVM, C-MANTEC, kNN, NaiveBayes and MLP. The index, gene symbol and probe set ID of each gene is shown in columns one to three.
Figure 3Frequency selection of genes for Ovarian and Prostate databases. The ten most selected features for the analysed datasets. The structure of this figure is the same than Figure 2.
Selected genes for the Leukemia dataset
| ID | Probe Set ID | Gene Description | References |
|---|---|---|---|
| 4951 | NME/NM23 nucleoside diphosphate kinase 4 | [ | |
| 3847 | Homeo box A9 | [ | |
| 6169 | C1NH Complement component 1 inhibitor | [ | |
| 6184 | PTMA Prothymosin alpha | [ | |
| 6225 | CD19 Molecule | [ | |
| 1882 | CST3 Cystatin C | [ | |
| 1834 | CD33 antigen | [ | |
| 4847 | Zyxin | [ | |
| 3320 | LTC4 synthase | [ | |
| 5094 | TPM1 Tropomyosin alpha chain | [ |
The best selected genes ranked with the GA approach for the Leukemia dataset which also appear in other studies in the literature.
Figure 4Comparison of the most frequently selected genes. Comparison of the most frequently selected genes (in 50 independent executions) by the GA and SFS strategy in Leukemia dataset, with independence of the classifier used.
Figure 5Biological analysis for Leukemia dataset. Biological analysis of the resuls obtained by GA-CMANTEC and SFS-CMANTEC strategy for the dataset using the IPA tool.