| Literature DB >> 25861214 |
Ali Anaissi1, Madhu Goyal1, Daniel R Catchpoole2, Ali Braytee1, Paul J Kennedy1.
Abstract
BACKGROUND: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process.Entities:
Keywords: case base reasoning; data mining; dimensionality reduction; feature weighting; gene expression; machine learning
Year: 2015 PMID: 25861214 PMCID: PMC4368049 DOI: 10.4137/CIN.S22371
Source DB: PubMed Journal: Cancer Inform ISSN: 1176-9351
Figure 1Case-based retrieval framework.
Figure 2Preprocessing the training data set in the case-based retrieval framework.
Figure 3Weight-learning GA and kNN.
Figure 4Modification of the training data set in the feature-weighting and sampling techniques section.
Figure 5Preprocessing the test sample in the case-based retrieval framework and retrieving similar cases.
The number of patients in the training and test data sets.
| HIGH RISK | MEDIUM RISK | STANDARD RISK | TOTAL | |
|---|---|---|---|---|
| Training dataset | 6 | 53 | 11 | 70 |
| Test dataset | 5 | 25 | 10 | 40 |
The Colon cancer data set is a publicly available microarray data set that was obtained with an Affymetrix oligonucleotide microarray.30 The Colon data set contains 62 samples, with each sample containing the expression values for 2,000 genes. Each sample indicates whether or not it came from a tumor biopsy. This data set has been used in many different research papers, eg, Ben-Dor et al,31 Brazma and Vilo,32 and Getz et al.33 The Prostate cancer data set also has been used in the experiments. The data set contains 52 prostate tumor samples and 50 nontumor prostate samples, with around 12,600 genes.
Comparison of the performance of kNN on the childhood leukemia test data set for different values of k.
| AVERAGE ACCURACY | |
|---|---|
| 3 | 0.71 |
| 4 | 0.69 |
| 5 | |
| 6 | 0.66 |
| 7 | 0.62 |
| 8 | 0.58 |
| 9 | 0.58 |
| 10 | 0.52 |
The average accuracy is calculated based on the average of the sensitivity and specificity of each class.
Results of classification performance tests on the childhood leukemia test data set.
| PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD | |
|---|---|---|---|
| Actual High | 2 | 2 | 1 |
| Actual Medium | 1 | 22 | 2 |
| Actual Standard | 1 | 3 | 6 |
Statistics by class for the confusion matrix of the test data set presented in Table 3.
| HIGH | MEDIUM | STANDARD | |
|---|---|---|---|
| Sensitivity | 0.40 | 0.88 | 0.60 |
| Specificity | 0.94 | 0.66 | 0.90 |
Figure 6Accuracy according to the different dimensionality reduction processes using eight-fold cross-validation on the test childhood leukemia data set.
Results of the classification performance on the childhood leukemia test data set after reducing the dimensionality.
| PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD | |
|---|---|---|---|
| Actual High | 3 | 1 | 1 |
| Actual Medium | 0 | 23 | 2 |
| Actual Standard | 2 | 1 | 7 |
Statistics by class for the confusion matrix of the test data set presented in Table 5.
| HIGH | MEDIUM | STANDARD | |
|---|---|---|---|
| Sensitivity | 0.60 | 0.92 | 0.70 |
| Specificity | 0.94 | 0.86 | 0.90 |
The parameters for the genetic algorithm for this task.
| NO. OF VARIABLES | BEQ | LOWER | UPPER | POPULATION SIZE | NO. OF GENERATIONS | OTHER PARAMETERS |
|---|---|---|---|---|---|---|
| 50 | 1 | zeros(1,50) | ones(1,50) | 100 | 50 | Default |
Figure 7Fitness value for each generation.
Results of the classification performance on the childhood leukemia test data set applied on the weighted 5NN classifier.
| PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD | |
|---|---|---|---|
| Actual High | 4 | 0 | 1 |
| Actual Medium | 0 | 25 | 0 |
| Actual Standard | 0 | 1 | 9 |
Statistics by class for the confusion matrix of the test data set presented in Table 8.
| STANDARD | MEDIUM | HIGH | |
|---|---|---|---|
| Sensitivity | 0.90 | 1.00 | 0.80 |
| Specificity | 0.96 | 0.93 | 1.00 |
Results of classification probability on the childhood leukemia test data set.
| ACTUAL | PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD |
|---|---|---|---|
| 1 High | 0.4 | 0.2 | 0.4 |
| 2 High | 0.4 | 0.4 | 0.2 |
| 3 High | 0.4 | 0.4 | 0.2 |
| 4 High | 0.8 | 0.2 | 0 |
| 5 High | 0.2 | 0 | 0.8 |
| 6 Medium | 0 | 1 | 0 |
| 7 Medium | 0 | 1 | 0 |
| 8 Medium | 0 | 0.6 | 0.4 |
| 9 Medium | 0 | 0.8 | 0.2 |
| 10 Medium | 0 | 1 | 0 |
| 11 Medium | 0 | 0.6 | 0.4 |
| 12 Medium | 0 | 1 | 0 |
| 13 Medium | 0 | 0.6 | 0.4 |
| 14 Medium | 0 | 0.6 | 0.4 |
| 15 Medium | 0 | 0.6 | 0.4 |
| 16 Medium | 0 | 0.8 | 0.2 |
| 17 Medium | 0 | 1 | 0 |
| 18 Medium | 0 | 0.8 | 0.2 |
| 19 Medium | 0 | 1 | 0 |
| 20 Medium | 0 | 0.8 | 0.2 |
| 21 Medium | 0 | 1 | 0 |
| 22 Medium | 0 | 0.6 | 0.4 |
| 23 Medium | 0 | 1 | 0 |
| 24 Medium | 0 | 1 | 0 |
| 25 Medium | 0 | 1 | 0 |
| 26 Medium | 0 | 0.6 | 0.4 |
| 27 Medium | 0 | 0.6 | 0.4 |
| 28 Medium | 0 | 0.8 | 0.2 |
| 29 Medium | 0 | 0.8 | 0.2 |
| 30 Medium | 0 | 0.8 | 0.2 |
| 31 Standard | 0 | 0.2 | 0.8 |
| 32 Standard | 0 | 0 | 1 |
| 33 Standard | 0 | 0 | 1 |
| 34 Standard | 0 | 0 | 1 |
| 35 Standard | 0.2 | 0.2 | 0.6 |
| 36 Standard | 0.2 | 0.2 | 0.6 |
| 37 Standard | 0.2 | 0.2 | 0.6 |
| 38 Standard | 0.2 | 0.2 | 0.6 |
| 39 Standard | 0.2 | 0.4 | 0.4 |
| 40 Standard | 0.2 | 0.4 | 0.4 |
Classification performance on the childhood leukemia training data set after 100% oversampling.
| PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD | |
|---|---|---|---|
| Actual High | 18 | 0 | 0 |
| Actual Medium | 0 | 53 | 0 |
| Actual Standard | 0 | 0 | 33 |
Results of classification performance on the childhood leukemia test data set after 100% oversampling of the training data set.
| PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD | |
|---|---|---|---|
| Actual High | 4 | 0 | 1 |
| Actual Medium | 0 | 25 | 0 |
| Actual Standard | 0 | 0 | 10 |
Results of classification probability on the childhood leukemia test data set after application of SMOTE.
| ACTUAL | PREDICTED HIGH | PREDICTED MEDIUM | PREDICTED STANDARD |
|---|---|---|---|
| 1 High | 0.8 | 0.2 | 0 |
| 2 High | 1 | 0 | 0 |
| 3 High | 1 | 0 | 0 |
| 4 High | 0.8 | 0.2 | 0 |
| 5 High | 0.4 | 0 | 0.6 |
| 6 Medium | 0 | 1 | 0 |
| 7 Medium | 0 | 1 | 0 |
| 8 Medium | 0 | 0.8 | 0.2 |
| 9 Medium | 0 | 0.8 | 0.2 |
| 10 Medium | 0 | 1 | 0 |
| 11 Medium | 0 | 0.6 | 0.4 |
| 12 Medium | 0 | 1 | 0 |
| 13 Medium | 0 | 0.8 | 0.2 |
| 14 Medium | 0 | 1 | 0 |
| 15 Medium | 0 | 0.6 | 0.4 |
| 16 Medium | 0 | 0.8 | 0.2 |
| 17 Medium | 0 | 1 | 0 |
| 18 Medium | 0 | 0.8 | 0.2 |
| 19 Medium | 0 | 1 | 0 |
| 20 Medium | 0 | 0.8 | 0.2 |
| 21 Medium | 0 | 1 | 0 |
| 22 Medium | 0 | 0.6 | 0.4 |
| 23 Medium | 0 | 1 | 0 |
| 24 Medium | 0 | 1 | 0 |
| 25 Medium | 0 | 1 | 0 |
| 26 Medium | 0 | 0.6 | 0.4 |
| 27 Medium | 0 | 0.6 | 0.4 |
| 28 Medium | 0 | 1 | 0 |
| 29 Medium | 0 | 1 | 0 |
| 30 Medium | 0 | 0.8 | 0.2 |
| 31 Standard | 0 | 0.2 | 0.8 |
| 32 Standard | 0 | 0 | 1 |
| 33 Standard | 0 | 0 | 1 |
| 34 Standard | 0 | 0 | 1 |
| 35 Standard | 0.2 | 0 | 0.8 |
| 36 Standard | 0 | 0.2 | 0.8 |
| 37 Standard | 0.1 | 0.1 | 0.8 |
| 38 Standard | 0.2 | 0.2 | 0.6 |
| 39 Standard | 0.2 | 0.2 | 0.6 |
| 40 Standard | 0.4 | 0 | 0.6 |
Average balanced accuracy results of the three public microarray data sets processed by the case-based retrieval framework.
| DATASETS | FS/NN | FS/DR/NN | FS/DR/FW/NN |
|---|---|---|---|
| NCI | 0.68 | 0.88 | 0.95 |
| Colon | 0.86 | 0.93 | – |
| Prostate | 0.90 | 0.98 | – |