| Literature DB >> 28117659 |
Harsh Saini1, Sunil Pranit Lal2, Vimal Vikash Naidu3, Vincel Wince Pickering3, Gurmeet Singh3, Tatsuhiko Tsunoda4,5,6, Alok Sharma3,7,8,9.
Abstract
BACKGROUND: High dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy.Entities:
Mesh:
Year: 2016 PMID: 28117659 PMCID: PMC5260793 DOI: 10.1186/s12920-016-0233-2
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Illustration of gene masking on the original dataset to produce a masked dataset
Fig. 2Flowchart depicting the relation of Genetic Algorithm and Classifier in gene masking where the best chromosome represents the best gene mask discovered
Fig. 3Illustration of fitness evaluation with gene masking. Cross validation is performed using a classifier and the average accuracy is used for fitness calculation
Genetic algorithm parameters
| Parameter | Value |
|---|---|
| GA type | Binary |
| Population size | 105 |
| Chromosome length | 2308 |
| No. of generations | 50000 |
| Selection function | Roulette wheel |
| Crossover rate | 0.85 |
| Mutation rate | 0.10 |
| Elite conservation | Yes, num_elite=1 |
Parameter tuning and selection method used in this study
| Parameter tuning and selection |
|---|
| Let |
| Let |
| Let |
| Let |
| Define the GA parameters apart from |
| Define |
| Define |
| Define |
| For each combination of { |
| - Perform |
| - Report the results obtained by the best performing |
| - Repeat for 10 iterations |
| Select the best performing combination of { |
Gene masking and NSCC performance on SRBCT test set with different values for Δ with α = 0.9
|
| Genes left | Genes left | Test accuracy |
|---|---|---|---|
| after shrinkage | after masking | ||
| 3 | 343 | 36 | 0.9 |
| 3.5 | 280 | 23 | 0.95 |
| 4 | 235 | 21 | 0.95 |
| 4.5 | 208 | 15 | 0.9 |
| 5 | 174 | 14 | 0.95 |
| 5.5 | 158 | 14 | 0.95 |
| 6 | 135 | 12 | 0.95 |
| 6.5 | 124 | 15 | 1 |
| 7 | 112 | 16 | 1 |
| 7.5 | 102 | 13 | 1 |
| 8 | 90 | 17 | 1 |
| 8.5 | 80 | 20 | 1 |
| 9 | 72 | 19 | 1 |
| 9.5 | 65 | 18 | 0.95 |
| 10 | 61 | 14 | 0.8 |
| 10.5 | 54 | 15 | 0.75 |
| 11 | 48 | 12 | 0.75 |
| 11.5 | 42 | 13 | 0.8 |
| 12 | 41 | 10 | 0.8 |
Comparison of performance of NCC and NSCC with gene masking
| NCC | NSCC | |
|---|---|---|
| Number of genes remaining | 1637 | 13 |
| Training accuracy | 100% | 100% |
| Test accuracy | 100% | 100% |
Comparison of performance of similar techniques
| Method (Classifier) | Number of genes | Accuracy |
|---|---|---|
| PCA, MLP, Neural Network [ | 96 | 100% |
| Nearest Shrunken Centroid [ | 43 | 100% |
| Information gain + SVM [ | 150 | 95% |
| Towing rule + SVM [ | 150 | 95% |
| Sum minority + SVM [ | 150 | 95% |
| Max minority + SVM [ | 150 | 91% |
| Gini index + SVM [ | 150 | 95% |
| Sum of variances + SVM [ | 150 | 95% |
| t-statistics + SVM [ | 150 | 95% |
| One-dimensional SVM + SVM [ | 150 | 95% |
| Information gain + LDA with NCC [ | 4 | 70% |
| Chi-squared + NNC [ | 4 | 70% |
| Gain Ratio + NNC [ | 4 | 85% |
| Gene masking + ANN [ | 13 | 100% |
| Gene masking + NCC (this paper) | 650 | 100% |
| Gene masking + NSCC (this paper) | 13 | 100% |
The 13 genes selected via gene masking with their relative occurrence in other solutions
| Image | Name | Percentage | In [ | In [ | In [ |
|---|---|---|---|---|---|
| ID | occurrence | ||||
| 39093 | methionine aminopeptidase; | 42.86% | No | Yes | No |
| eIF-2-associated p67 | |||||
| 365826 | growth arrest-specific 1 | 100% | No | Yes | No |
| 1416782 | creatine kinase, brain | 100% | No | Yes | No |
| 461425 | myosin MYL4 | 71.43% | Yes | Yes | No |
| 810057 | cold shock domain protein A | 100% | Yes | No | No |
| 866702 | protein tyrosine phosphatase, | 57.14% | Yes | Yes | Yes |
| non-receptor type 13 (APO-1/CD95 | |||||
| (Fas)-associated phosphatase) | |||||
| 854899 | dual specificity phosphatase 6 | 28.57% | No | Yes | No |
| 629896 | microtubule-associated protein 1B | 71.43% | No | Yes | Yes |
| 214572 | ESTs | 100% | No | No | No |
| 208718 | annexin A1 | 100% | No | Yes | No |
| 784224 | fibroblast growth factor receptor | 100% | Yes | Yes | No |
| 204545 | ESTs | 57.14% | Yes | Yes | No |
| 295985 | ESTs | 100% | Yes | Yes | No |
A summary of performance of gene masking with NSCC on MLL Leukemia and Lung Cancer datasets
| Dataset | Genes remaining | Test accuracy |
|---|---|---|
| MLL Leukemia | 94 | 100% |
| Lung Cancer | 90 | 100% |