| Literature DB >> 29702670 |
Hassan Tariq1,2, Elf Eldridge1, Ian Welch1.
Abstract
Dimensionality reduction of microarray data is a very challenging task due to high computational time and the large amount of memory required to train and test a model. Genetic programming (GP) is a stochastic approach to solving a problem. For high dimensional datasets, GP does not perform as well as other machine learning algorithms. To explore the inherent property of GP to generalize models from low dimensional data, we need to consider dimensionality reduction approaches. Random projections (RPs) have gained attention for reducing the dimensionality of data with reduced computational cost, compared to other dimensionality reduction approaches. We report that the features constructed from RPs perform extremely well when combined with a GP approach. We used eight datasets out of which seven have not been reported as being used in any machine learning research before. We have also compared our results by using the same full and constructed features for decision trees, random forest, naive Bayes, support vector machines and k-nearest neighbor methods.Entities:
Mesh:
Year: 2018 PMID: 29702670 PMCID: PMC5922581 DOI: 10.1371/journal.pone.0196385
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 110-fold cross-validation for GP.
Training set and Test set performance evaluations goes into Tables 3 and 4 respectively. Performance has been measured in each of the GP run for each fold and used to calculate mean accuracies and standard deviations by the end of 10-folds.
Test set accuracies of GP and machine learning algorithms.
| Dataset | Features | GP | DT | NB | KNNs | SVMs | RF |
|---|---|---|---|---|---|---|---|
| Adenocarcinomas | 54675 | 83 ± 15 | 83 ± 12.9 | 87.67 ± 11 | 89.67 ± 13.5 | 96.67± 6.67 | 89.67 ± 9.2 |
| Oral Mucosa | 54675 | 62±16 | 77.32 ± 12 | 74.82±13.5 | 72.14±9.3 | 82.5±15 | 76.78± 16 |
| B-Cells | 22283 | 69±16 | 80 ± 15 | 83.75 ± 13.7 | 72.5± 10.8 | 91.25±9.7 | 91.25±11.25 |
| Placenta | 11155 | 74 ± 11 | 71.24 ±15.6 | 73.92±14.5 | 78.92± 10.6 | 63.21±11.5 | 81.6±6.2 |
| Melanoma | 22283 | 86.8 ± 11 | 84.16 ± 11 | 92.91 ± 8 | 87.63±7.9 | 92.78 ±8 | 95.27 ± 5.8 |
| Breast cancer | 24482 | 57.2 ±16 | 63.11 ±18.8 | 54.66± 4.1 | 57.44 ± 19 | 68.11±14.9 | 71.11 ± 11 |
| Skeletal Muscle | 54675 | 79.39±13 | 86.36 ±10 | 92.72 ± 7.9 | 90 ±9.4 | 96.36 ± 7.2 | 96.36 ±8.1 |
| Osteoarthritis | 48802 | 73±11 | 86.2± 8.5 | 57.58 ±20 | 76.9 ± 4.2 | 92.08±5.9 | 78.4 ±5.5 |
Comparison of random projections based feature construction.
| Dataset | Features | GP | DT | NB | KNNs | SVMs | RF |
|---|---|---|---|---|---|---|---|
| Adenocarcinomas(58) | 50 | 99.83±0.009 | 73.33±1.6 | 87.66±3 | 86±1.3 | 91.33±0.7 | 77.67±1.5 |
| 100 | 98.46±0.08 | 76.33±3.3 | 90±3.3 | 93.3±1.0 | 91.67±1.2 | 89.67 ± 1.5 | |
| 150 | 97.14±0.9 | 86±2.7 | 90±3.3 | 91.67±1.3 | 95±1.6 | 89.67±1.5 | |
| Oral Mucosa(79) | 50 | 99.95±0.002 | 79.82±1.8 | 74.82±3.44 | 57.14±6.6 | 81±1.4 | 78.3±2.2 |
| 100 | 98.57±0.08 | 70.89±0.16 | 69.46±3.8 | 62.14±2.1 | 77.32±2.6 | 81.25±5.9 | |
| 150 | 96.93±0.17 | 68.39±0.95 | 65.71±2.7 | 64.64±1.3 | 81.75±1.4 | 78.75±6.7 | |
| B-Cells(79) | 50 | 99.41±0.03 | 77.32±2.6 | 74.82±3.44 | 82.32±2.2 | 85±4.7 | 82±5.5 |
| 100 | 97.28±0.15 | 72.32±4.2 | 76.07 ±3.0 | 77.57±2.2 | 88.75±3.5 | 83.75±5.1 | |
| 150 | 96.59±0.19 | 71.25±9 | 78.57±2.2 | 76.25±6.7 | 87.5±3.9 | 83.75±5.1 | |
| Placenta(76) | 50 | 99.91±0.005 | 77.49±2.5 | 68.75±9.8 | 70.71±4.7 | 84.28 ± 0.45 | 84.28±0.45 |
| 100 | 99.30±0.03 | 73.39±5.1 | 67.5±10.7 | 69.46±5.1 | 84.28±0.45 | 84.28±0.45 | |
| 150 | 97.95±0.11 | 70.71±4.2 | 68.75±9.8 | 69.46±5.1 | 84.28±0.45 | 81.6±1.2 | |
| Melanoma(83) | 50 | 97.64±0.13 | 88.19±4.1 | 94.16±1.8 | 95.13±2.4 | 91.67±1.3 | 96.52±1.0 |
| 100 | 97.06±0.16 | 84.3±2.9 | 92.77±1.67 | 95.13±1.53 | 96.3±2.8 | 97.77±0.7 | |
| 150 | 96.37±0.51 | 85.5±3.3 | 95.2±1.4 | 96.3±1.14 | 97.63±0.7 | 97.77±0.7 | |
| Breast cancer(97) | 50 | 97.87±0.12 | 52.77±0.8 | 49.44±1.9 | 43.22±0.3 | 54.66±0.2 | 56±0.14 |
| 100 | 96.75±0.18 | 52.55±2.5 | 49.33±1.9 | 47.22±11 | 51.55±1.2 | 55.88±0.1 | |
| 150 | 96.90±0.17 | 51.67±2.2 | 51.44±1.3 | 52.1±2.4 | 52.55±0.9 | 57±0.45 | |
| Skeletal Muscle(110) | 50 | 99.24±0.04 | 65.45±2.2 | 69.09±7.4 | 74.54±0.57 | 89.09±2.2 | 83.63±0.5 |
| 100 | 98.69±0.07 | 71.81±5.4 | 66.36±2.0 | 83.63±5.1 | 98.18±0.57 | 82.72±0.28 | |
| 150 | 98.27 ±0.09 | 63.36±0.8 | 68.18±1.43 | 85.45±1.72 | 97.27±2.0 | 81.81±2.8 | |
| Osteoarthritis(139) | 50 | 99.90±0.005 | 71.97±1.5 | 72.74±3.7 | 78.4±0.46 | 86.97±3.1 | 78.46±1.9 |
| 100 | 99.23±0.04 | 70±9.4 | 69.12±2.4 | 84.23±2.5 | 90.65±0.52 | 81.31±1.04 | |
| 150 | 98.73±0.07 | 82.03±0.8 | 67.69±2.9 | 83.51±2.7 | 94.23±0.68 | 77.74±2.17 |
GP settings.
| Function set | +, -, x, ÷ |
| Terminal set | Feature values |
| Initialization method | Ramped half-and-half |
| Tree depth | 2–8 |
| Crossover probability | 0.8 |
| Mutation probability | 0.2 |
| Selection method | Tournament |
| Tournament size | 7 |
Training set accuracies of GP and machine learning algorithms.
| Dataset | Features | GP | DT | NB | KNNs | SVMs | RF |
|---|---|---|---|---|---|---|---|
| Adenocarcinomas (58) | 54675 | 97.4 ± 2.2 | 99.04 ± 0.9 | 94.82±1.2 | 100± 0 | 100± 0 | 100± 0 |
| Oral Mucosa(79) | 54675 | 84.5±3.8 | 99.57 ± 0.9 | 98.45 ± 0.7 | 87.76± 1.3 | 100± 0 | 100± 0 |
| B-Cells (79) | 22283 | 89.8 ± 3.2 | 100± 0 | 99.71± 0.5 | 80.45±2.9 | 100± 0 | 100± 0 |
| Placenta (76) | 11155 | 83.3 ± 4.3 | 91.52 ± 4.7 | 80.55 ± 2.6 | 86.69±1.9 | 96.93±2.1 | 100± 0 |
| Melanoma (83) | 22283 | 97.3 ± 1.5 | 99.19 ± 0.65 | 100±0 | 92.5±1.6 | 100±0 | 100±0 |
| Breast cancer (97) | 24482 | 86±3 | 98.85 ±0.72 | 55.9± 3.6 | 75.25± 2.5 | 100± 0 | 100± 0 |
| Skeletal Muscle (110) | 54675 | 91 ± 4.4 | 99.39 ±0.8 | 99.09±0.5 | 96.36±0.6 | 100±0 | 100±0 |
| Osteoarthritis (139) | 48802 | 86.28±2.6 | 99.43 ± 0.6 | 71 ± 9.2 | 79.13± 1.2 | 100± 0 | 100± 0 |
Fig 2Comparison of GP performance on test dataset for 50 (F50), 100 (F100) and 150 (F150) features constructed by random projections and full feature set.