| Literature DB >> 28450890 |
Nathaniel M Crabtree1, Jason H Moore2, John F Bowyer3, Nysia I George4.
Abstract
BACKGROUND: A computational evolution system (CES) is a knowledge discovery engine that can identify subtle, synergistic relationships in large datasets. Pareto optimization allows CESs to balance accuracy with model complexity when evolving classifiers. Using Pareto optimization, a CES is able to identify a very small number of features while maintaining high classification accuracy. A CES can be designed for various types of data, and the user can exploit expert knowledge about the classification problem in order to improve discrimination between classes. These characteristics give CES an advantage over other classification and feature selection algorithms, particularly when the goal is to identify a small number of highly relevant, non-redundant biomarkers. Previously, CESs have been developed only for binary class datasets. In this study, we developed a multi-class CES.Entities:
Keywords: Artificial intelligence; Biomarker discovery; Classification; Data mining; Evolutionary algorithm; Feature selection; Genetic programming; Machine learning; Multi-class
Year: 2017 PMID: 28450890 PMCID: PMC5404302 DOI: 10.1186/s13040-017-0134-8
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Differences between the binary and multi-class CES classification algorithm. Description of data: The binary-class CES classification algorithm was generalized to the multi-class problem using flexible data structures and by sorting each class according to the mean of the median test value. Example test values are provided
The characteristics of the three datasets
| Dataset | #Samples | #Features | #Classes | |
|---|---|---|---|---|
| Rat blood | Full Dataset | 73 | 12,549 | 4 |
| Immune-related Genes | 73 | 227 | 4 | |
| Human cancer | 304 | 19,955 | 6 | |
| Human lymphoblastoid | 465 | 23,722 | 5 | |
Rat blood mRNA expression 10 rep, 5-fold CV
| Algorithm | Accuracy | Tanimoto Distance | Number of Selected Genes |
|---|---|---|---|
| CES | 0.9113 | 0.2625 | 6 |
| RF | 0.8493 | 0.2406 | 453 |
| RKNN | 0.7645 | 0.0798 | 30 |
| SVM | 0.7034 | 0.2786 | 2 |
| SVM | 0.7858 | 0.2061 | 4 |
| SVM | 0.8031 | 0.2230 | 8 |
| SVM | 0.8284 | 0.3299 | 16 |
| SVM | 0.8438 | 0.3795 | 32 |
| SVM | 0.8428 | 0.4268 | 64 |
Human cancer 10 rep, 5-fold CV
| Algorithm | Accuracy | Tanimoto Distance | Number of Selected Genes |
|---|---|---|---|
| CES | 0.8506 | 0.0957 | 8 |
| RF | 0.9933 | 0.2737 | 8188 |
| RKNN | 0.9927 | 0.9441 | 19381 |
| SVM | 0.8853 | 0.2785 | 4 |
| SVM | 0.9803 | 0.4454 | 8 |
| SVM | 0.9980 | 0.4739 | 16 |
| SVM | 0.9987 | 0.5215 | 32 |
| SVM | 1 | 0.5774 | 64 |
Human lymphoblastoid 10 rep, 5-fold CV
| Algorithm | Accuracy | Tanimoto Distance | Number of Selected Genes |
|---|---|---|---|
| CES | 0.5468 | 0.2017 | 7 |
| RF | 0.8678 | 0.1795 | 4417 |
| RKNN | 0.5048 | 0.1539 | 50 |
| SVM | 0.4439 | 0.2382 | 4 |
| SVM | 0.5136 | 0.2921 | 8 |
| SVM | 0.5795 | 0.2678 | 16 |
| SVM | 0.6507 | 0.2846 | 32 |
| SVM | 0.7547 | 0.3441 | 64 |
Fig. 2A summary of classification accuracy for each dataset.. Description of data: Classification accuracy was computed for each of the 50 testing datasets resulting from 10 Rep, 5-fold cross validation. We report the average across all evaluation datasets. Metrics are reported for CES, RF, RKNN, and the best performing SVM for all datasets
Algorithm run-time comparison for entire 10 rep, 5 fold CV
| Algorithm | Run - time |
|---|---|
| CES | 10 days |
| RF | 25 min |
| RKNN | 1 min |
| SVM | 5 min |
CES single rep, 5-fold CV performance with 1 day run - time
| Pareto Level | Accuracy | Tanimoto Distance | Avg number of Selected Genes | Avg Number of Classifiers |
|---|---|---|---|---|
| 6 | 0.7763 | 0.0873 | 10 | 5 |
| 5 | 0.8059 | 0.0871 | 8 | 14 |
| 4 | 0.8093 | 0.0938 | 8 | 32 |
| 3 | 0.8092 | 0.0871 | 9 | 64 |
| 2 | 0.7960 | 0.0941 | 7 | 121 |
| 1 | 0.8027 | 0.1336 | 6 | 216 |
CES single rep, 5-fold CV performance with 10 day run - time
| Pareto Level | Accuracy | Tanimoto Distance | Avg number of Selected Genes | Avg Number of Classifiers |
|---|---|---|---|---|
| 9 | 0.9113 | 0.1080 | 6 | 5 |
| 8 | 0.9145 | 0.0863 | 7 | 9.4 |
| 7 | 0.9244 | 0.0713 | 7 | 23 |
| 6 | 0.9244 | 0.0742 | 7 | 46 |
| 5 | 0.9244 | 0.0605 | 9 | 92 |
| 4 | 0.9244 | 0.0846 | 9 | 179 |
| 3 | 0.9078 | 0.0697 | 10 | 351 |
| 2 | 0.9045 | 0.0911 | 9 | 672 |
| 1 | 0.8850 | 0.0800 | 8 | 1263 |
Rat blood selected genes
| Gene | # of reps |
|---|---|
| Stip1 | 10 |
| Enkur | 9 |
| Pea15a | 8 |
| Tpi1 | 3 |
| Bst2 | 3 |
| Gsg1 | 3 |
| Hspa1b | 3 |
| Arf4 | 3 |
| Dnaja1 | 2 |
| MANF | 2 |
| Hsph1 | 2 |
| Hsp90aa1 | 2 |
| CREM | 2 |
Immune-related rat blood selected genes
| Gene | # of reps |
|---|---|
| Cd96 | 10 |
| Il12rb2 | 10 |
| Ifitm1 | 9 |
| Ifngr2 | 9 |
| Il17ra | 8 |
| Cd44 | 7 |
| Anxa2 | 5 |
| Ccr3 | 5 |
| Il1rap | 4 |
| Ifngr1 | 4 |
| Cd300lb | 4 |
| Il7r | 4 |
| Il9r | 4 |
| Cd27 | 4 |
| Il4ra | 4 |
| Ccl2 | 4 |
| Il2rg | 3 |
| Ccl6 | 3 |
| Il18bp | 3 |
| Cd84 | 3 |
| Cd8a | 3 |
| Cxcl13 | 3 |
| Il21r | 3 |
Human cancer selected genes
| Gene | # of reps |
|---|---|
| DVWA | 10 |
| DUOXA1 | 10 |
| TSHR | 10 |
| SLC23A2 | 8 |
| VCP | 8 |
| SFTA3 | 8 |
| NOX4 | 5 |
| NKX2-1 | 4 |
| HSD17B14 | 4 |
| HPN | 4 |
| NKX2 | 4 |
| POLR3A | 3 |
| DPYS | 3 |
| TRPS1 | 2 |
| PTPRC | 2 |
| FGA | 2 |
| CDH3 | 2 |
| HIST1H1T | 2 |
| UXT | 2 |
| NACAP1 | 2 |
| PEBP1 | 2 |
| ZNF706 | 2 |
| CYP2C18 | 2 |
| ASGR1 | 2 |
| FGG | 2 |
Human lymphoblastoid selected genes
| Gene | # of reps |
|---|---|
| ARHGEF18 | 10 |
| RP11-108 M9.3 | 10 |
| nckap5 | 7 |
| IGLV2-5 | 5 |