| Literature DB >> 28095769 |
Noah Eyal-Altman1, Mark Last2, Eitan Rubin3.
Abstract
BACKGROUND: Numerous publications attempt to predict cancer survival outcome from gene expression data using machine-learning methods. A direct comparison of these works is challenging for the following reasons: (1) inconsistent measures used to evaluate the performance of different models, and (2) incomplete specification of critical stages in the process of knowledge discovery. There is a need for a platform that would allow researchers to replicate previous works and to test the impact of changes in the knowledge discovery process on the accuracy of the induced models.Entities:
Keywords: Breast cancer; Data mining; Reproducible research
Mesh:
Year: 2017 PMID: 28095769 PMCID: PMC5240197 DOI: 10.1186/s12859-016-1435-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Screenshot of PCM-SABRE
Machine learning methods available in PCM-SABRE
| Meta-node | Method | KNIME node | Default parameters | |
|---|---|---|---|---|
| 1.1 | Select patients | Estrogen Receptor status (ER) | R script | |
| 1.2 | Select patients | Lymph Node status (LN) | R script | |
| 2.1 | Feature Selection | Information Gain (InfoGain) | InformationGainCalculator (Community node – Palladian) | Top 100 ranked |
| 2.2 | Feature Selection | ANOVA | One-way ANOVA | include genes with |
| 3.1 | Modeling | Logistic Regression (LR) | Logistic (3.7) (Weka node) | Ridge = 1.0E-8, |
| 3.2 | Modeling | Random Forest (RF) | Random Forest Learner | Split criteria = Information Gain Ratio, Number of models = 350 |
| 3.3 | Modeling | Artificial Neural Network (ANN) | PNN Learner (DDA) | Theta Minus = 0.2, Theta Plus = 0.4 |
| 3.4 | Modeling | K-Nearest Neighbors (KNN) | IBK (3.7) (Weka node) | KNN = 15 |
| 3.5 | Modeling | Support Vector Machine (SVM) | SVM Learner | Kernel = RBF, sigma = 0.2 |
Fig. 2Demonstration of drag-and-drop model replacement (Naïve Bayes instead of decision tree)
Fig. 3Modification of the feature selection Meta-node in order to replicate Chou et al. work
Predictive power (in terms of percent accuracy) of several feature selection methods combined with different classification models. AUC results are shown in brackets
| Prediction model | PCM-SABRE pipeline | Chou et al. [ | ||
|---|---|---|---|---|
| Feature selection | InfoGain | ANOVA | MW | MW |
| RF | 76.52 (NA) | 77.70 (NA) | 76.10 (NA) | NA |
| LR | 76.27 (73.0) | 66.55 (62.49) | 75.68 (70.95) | 64.12 (58.96) |
| PNN | 76.52 (74.09) | 76.27 (75.21) | 74.58 (72.32) | 69.54 (63.88) |
| KNN | 75.76 (67.78) | 75.34 (68.48) | 76.10 (70.30) | NA |
| SVM | 72.64 (NA) | 72.64 (NA) | 72.64 (NA) | NA |
| DT | 70.19 (60.59) | 68.07 (61.53) | 64.44 (57.34) | 63.45 (56.90) |
| DL | NA | NA | 75.34 (71.71) | 68.90 (61.66) |
| DA | NA | NA | 75.51 (72.23) | 65.91 (61.65) |