| Literature DB >> 22808075 |
Enrico Glaab1, Jaume Bacardit, Jonathan M Garibaldi, Natalio Krasnogor.
Abstract
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22808075 PMCID: PMC3394775 DOI: 10.1371/journal.pone.0039932
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flowchart illustrating the experimental procedure.
The protocol consists of three steps: 1) Pre-processing; 2) Supervised analysis; 3) Post-analysis.
Datasets used in this paper.
| Dataset | Platform | No. of genes | No. of samples | References |
| class 1; class 2 | ||||
| Lymphoma | Affymetrix | 7,129 | 58 (D); 19 (F) |
|
| Prostate | Affymetrix | 12,600 | 52 (T); 50 (N) |
|
| Breast | Illumina | 47,293 | 84 (L); 44 (N) |
|
Figure 2A BioHEL classification rule set obtained for the prostate cancer dataset and illustrating different types of rules.
“Exp(x)” is short for “Expression of gene x”, where x is a HUGO gene symbol, “” represents the conjunctive AND-operator, “[x,y]” is an interval of expression values in which the value of the attribute must lie to fulfill one premise of the rule, and “-” is a class assignment operator, followed by the output class of the rule. Rule 5 is a default rule that applies if no rule above is matched.
10-fold cross-validation results.
| Dataset | Feature Selection | Classification | AVG (%) | STDDEV |
| CFS | BioHEL | 91 | 8 | |
| CFS | GAssist | 93 | 10 | |
| CFS | SVM | 90 | 10 | |
| CFS | RF | 92 | 11 | |
| CFS | PAM | 91 | 10 | |
| PLSS | BioHEL | 92 | 8 | |
| PLSS | GAssist | 93 | 9 | |
| Prostate cancer | PLSS | SVM | 90 | 11 |
| PLSS | RF | 92 | 9 | |
|
|
|
|
| |
| RFS | BioHEL | 89 | 8 | |
| RFS | GAssist | 92 | 12 | |
| RFS | SVM | 88 | 8 | |
| RFS | RF | 93 | 9 | |
| RFS | PAM | 90 | 11 | |
|
|
|
|
| |
| CFS | BioHEL | 81 | 10 | |
| CFS | GAssist | 80 | 15 | |
| CFS | SVM | 87 | 12 | |
| CFS | RF | 87 | 16 | |
| CFS | PAM | 78 | 17 | |
| PLSS | BioHEL | 93 | 12 | |
| Diffuse large | PLSS | GAssist | 94 | 6 |
| B-cell lymphoma | PLSS | SVM | 91 | 13 |
| PLSS | RF | 87 | 8 | |
| PLSS | PAM | 86 | 11 | |
| RFS | BioHEL | 91 | 11 | |
| RFS | GAssist | 89 | 13 | |
| RFS | SVM | 91 | 13 | |
| RFS | RF | 89 | 13 | |
| RFS | PAM | 86 | 14 | |
|
|
|
|
| |
| CFS | BioHEL | 84 | 11 | |
| CFS | GAssist | 87 | 8 | |
| CFS | SVM | 86 | 9 | |
| CFS | RF | 86 | 7 | |
|
|
|
|
| |
| PLSS | BioHEL | 84 | 7 | |
| PLSS | GAssist | 85 | 5 | |
| Breast cancer | PLSS | SVM | 84 | 7 |
|
|
|
|
| |
| PLSS | PAM | 88 | 7 | |
| RFS | BioHEL | 86 | 5 | |
| RFS | GAssist | 88 | 6 | |
| RFS | SVM | 80 | 17 | |
|
|
|
|
| |
| RFS | PAM | 88 | 7 | |
|
|
|
|
|
10-fold cross-validation results obtained with BioHEL, SVM, RF and PAM on the three microarray datasets using three feature selection methods (CFS, PLSS, RFS); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative method are both shown in bold type for each dataset.
Leave-one-out cross-validation results.
| Dataset | Feature Selection | Classification | AVG (%) | STDDEV |
| CFS | BioHEL | 92 | 27 | |
| CFS | GAssist | 93 | 25 | |
| CFS | SVM | 89 | 31 | |
|
|
|
|
| |
| CFS | PAM | 90 | 30 | |
|
|
|
|
| |
| PLSS | GAssist | 92 | 27 | |
| Prostate cancer | PLSS | SVM | 93 | 25 |
| PLSS | RF | 93 | 25 | |
| PLSS | PAM | 93 | 25 | |
| RFS | BioHEL | 88 | 32 | |
| RFS | GAssist | 93 | 25 | |
| RFS | SVM | 89 | 31 | |
| RFS | RF | 91 | 29 | |
| RFS | PAM | 91 | 29 | |
| none | BioHEL | 92 | 27 | |
| CFS | BioHEL | 84 | 36 | |
| CFS | GAssist | 87 | 34 | |
| CFS | SVM | 88 | 32 | |
| CFS | RF | 87 | 34 | |
| CFS | PAM | 84 | 37 | |
| PLSS | BioHEL | 92 | 26 | |
| Diffuse large | PLSS | GAssist | 92 | 27 |
| B-cell lymphoma |
|
|
|
|
| PLSS | RF | 90 | 31 | |
| PLSS | PAM | 86 | 35 | |
| RFS | BioHEL | 88 | 32 | |
| RFS | GAssist | 88 | 32 | |
| RFS | SVM | 90 | 31 | |
| RFS | RF | 92 | 27 | |
| RFS | PAM | 83 | 38 | |
|
|
|
|
| |
| CFS | BioHEL | 82 | 38 | |
| CFS | GAssist | 84 | 36 | |
| CFS | SVM | 84 | 37 | |
| CFS | RF | 84 | 36 | |
|
|
|
|
| |
| PLSS | BioHEL | 84 | 37 | |
| PLSS | GAssist | 84 | 36 | |
| Breast cancer | PLSS | SVM | 81 | 39 |
| PLSS | RF | 88 | 33 | |
| PLSS | PAM | 86 | 35 | |
| RFS | BioHEL | 82 | 39 | |
| RFS | GAssist | 85 | 36 | |
| RFS | SVM | 86 | 35 | |
| RFS | RF | 87 | 34 | |
| RFS | PAM | 88 | 32 | |
|
|
|
|
|
Leave-one-out cross-validation results obtained with BioHEL, SVM, RF and PAM on the three microarray datasets using three feature selection methods (CFS, PLSS, RFS); AVG = average accuracy, STDDEV = standard deviation; the highest accuracies achieved with BioHEL and the best alternative are both shown in bold type for each dataset.
Comparison of prediction results from the literature for the prostate cancer dataset.
| Author (year) | Method | AVG (%) | Size |
| T.K. Paul | RPMBGA, LOOCV | 96.6 | 48.5 |
| Wessels | RFLD(0), Monte-Carlo CV | 93.4 | 14 |
| Shen | PLR, Monte-Carlo-CV (30 iterations) | 94.6 | *** |
| LSR, Monte-Carlo-CV (30 iterations) | 94.3 | *** | |
| W Chu | Gaussian processes, LOOCV | 91.2 | 13 |
| Lecocke | SVM, LOOCV | 90.1 | ** |
| DLDA, LOOCV | 89.2 | ** | |
| GAGA+3NN, LOOCV | 88.1 | ** | |
| our study | BioHEL, 10-fold CV | 94 | *30 |
| PLSS+BioHEL, LOOCV | 94 | *30 |
(*maximum no. of genes per base classifier in ensemble learning model; **evaluation results averaged over feature subsets using different numbers of genes; ***singular value decomposition used instead of classical feature selection).
Comparison of prediction results from the literature for the lymphoma dataset.
| Author (year) | Method | AVG (%) | Size |
| Wessels | RFLD(10), Monte-Carlo CV | 95.7 | 80 |
| Liu | MOEA+WV | 93.5 | 6 |
| Shipp | SNR+WV, LOOCV | 92.2 | 30 |
| Goh | PCC-SNR + ECF, LOOCV | 91 | 10 |
| Lecocke | GA+SVM, LOOCV | 90.2 | ** |
| GAGA+DLDA, LOOCV | 89.8 | ** | |
| GAGA+3-NN, LOOCV | 86.3 | ** | |
| Hu | WWKNN, LOOCV | 87.01 | 12 |
| ECF, LOOCV | 85.71 | 12 | |
| our study | BioHEL, 10-fold CV | 95 | *30 |
| BioHEL, LOOCV | 94 | *30 |
(*maximum no. of genes per base classifier in ensemble learning model; **evaluation results averaged over feature subsets using different numbers of genes).
Comparison of prediction methods.
| Average ranks | |||||
| method | SVM | RF | PAM | BioHEL | GAssist |
| 10-fold | 3.8 |
| 3.1 | 3.4* |
|
| LOO | 3.0 |
| 3.1 | 3.7* | 3.0 |
Results of a Friedman test to compare prediction methods across different datasets and feature selection methods (the best average ranks are shown in bold typeface; *here only the results in combination with feature selection are taken into account).
Comparison of feature selection methods.
| Average ranks | |||
| method | CFS | PLSS | RFS |
| 10-fold | 2.3 |
| 2.0 |
| LOO | 2.3 |
| 2.0 |
Results of a Friedman test to compare feature selection methods in terms of classification accuracy across different datasets and prediction methods (the best average ranks for each row are shown in bold typeface).
List of high scoring genes for the prostate cancer dataset.
| Ensemble feature selection | BioHEL feature ranking | ||||
| Gene identifier | Freq. | Annotation | Gene identifier | Perc. | Annotation |
|
| 3 |
|
| 7.6 |
|
|
| 3 |
| 914_g_at | 4.0 |
|
|
| 3 |
|
| 3.4 |
|
| 38634_at | 3 |
|
| 2.0 |
|
| 37366_at | 3 |
| 41817_g_at | 1.5 |
|
|
| 2 |
| 35278_at | 1.5 |
|
| 38087_s_at | 2 |
| 41741_at | 1.3 |
|
| 41468_at | 2 |
| 32250_at | 1.3 |
|
| 38827_at | 2 |
| 32755_at | 1.1 |
|
| 38406_f_at | 2 |
|
| 1.0 |
|
| 34840_at | 2 |
| 37331_g_at | 0.9 |
|
List of genes that were chosen by at least two different selection methods among the 20 features selected most frequently on the prostate dataset. The 4 genes detected as informative by both the Ensembl FS and the BioHEL FR approach (hepsin, nel-like 2, AMACR and adipsin) are highlighted in bold face (see discussion in the literature mining analysis section).
List of high scoring genes for the lymphoma dataset.
| Ensemble feature selection | BioHEL feature ranking | ||||
| Gene identifier | Freq. | Annotation | Gene identifier | Perc. | Annotation |
| X02152_at | 3 |
| X01060_at | 6.6 |
|
| V00594_at | 2 |
| M63835_at | 6.0 |
|
| HG1980-HT2023_at | 2 |
| HG2090-HT2152_s_at | 5.3 |
|
| U63743_at | 2 |
| X02544_at | 3.0 |
|
| X05360_at | 2 |
| U21931_at | 1.9 |
|
| M63379_at | 2 |
| D80008_at | 1.7 |
|
| M13792_at | 2 |
| X65965_s_at | 1.5 |
|
| L19686_rna1_at | 2 |
| D13413_rna1_s_at | 1.3 |
|
| D14662_at | 2 |
| L25876_at | 1.2 |
|
| S73591_at | 2 |
| D78134_at | 1.1 |
|
List of genes that were chosen by at least two different selection methods among the 30 features selected most frequently on the lymphoma dataset. On this dataset, the genes detected as informative by the Ensembl FS and the BioHEL FR did not overlap (see discussion in the literature mining analysis section).
List of high scoring genes for the breast cancer dataset.
| Ensemble feature selection | BioHEL feature ranking | ||||
| Gene identifier | Freq. | Annotation | Gene identifier | Perc. | Annotation |
|
| 3 |
| GI_37545993-S | 0.7 |
|
|
| 3 |
|
| 0.6 |
|
|
| 3 |
| |||
|
| 2 |
| GI_23308560-S | 0.6 |
|
| GI_42657473-S | 2 |
|
| 0.6 |
|
| GI_7706686-S | 2 |
|
| 0.5 |
|
| GI_40788002-S | 2 |
| GI_22748948-S | 0.4 |
|
|
| |||||
| GI_33620752-S | 2 |
| GI_4507266-S | 0.4 |
|
|
| 2 |
|
| 0.4 |
|
|
| 2 |
| GI_4502798-S | 0.4 |
|
| GI_37551139-S | 2 |
| Hs.501130-S | 0.4 |
|
| GI_40255152-S | 2 |
|
| 0.4 |
|
|
|
| 0.3 |
| ||
| GI_30410031-S | 2 |
| GI_42659577-S | 0.3 |
|
|
| |||||
| GI_4503928-S | 2 |
|
| 0.3 |
|
|
| |||||
| GI_42659459-S | 2 |
| GI_21389370-S | 0.3 |
|
|
| 2 |
| Hs.202515-S | 0.3 |
|
|
| |||||
| GI_38455428-S | 2 |
| GI_18152766-S | 0.3 |
|
| GI_22035691-A | 2 |
| Hs.499414-S | 0.3 |
|
|
| |||||
List of genes that were chosen by at least two different selection methods among the 30 features selected most frequently on the breast cancer dataset. The 7 genes detected as informative by both the Ensembl FS and the BioHEL FR approach are highlighted in bold face (see discussion in the literature mining analysis section).
Figure 3Comparison of text mining scores.
Histogram of text mining scores for randomly chosen gene identifier subsets compared to scores achieved by BioHEL and the ensemble feature selection (FS) approach (prostate cancer dataset).
Figure 4Comparison of text mining scores.
Histogram of text mining scores for randomly chosen gene identifier subsets compared to scores achieved by BioHEL and the ensemble feature selection (FS) approach (lymphoma cancer dataset).
Figure 5Comparison of text mining scores.
Histogram of text mining scores for randomly chosen gene identifier subsets, compared to scores achieved by BioHEL and the ensemble feature selection (FS) approach (breast cancer dataset).
Literature mining significance scores.
| Dataset | Ensemble FS (p-value) | BioHEL FR (p-value) |
| Prostate | 0.00 | 0.05 |
| Lymphoma | 0.51 | 0.22 |
| Breast | 0.02 | 0.03 |