| Literature DB >> 21569330 |
Leonardo Vanneschi1, Antonella Farinaccio, Giancarlo Mauri, Mauro Antoniotti, Paolo Provero, Mario Giacobini.
Abstract
BACKGROUND: The ability to accurately classify cancer patients into risk classes, i.e. to predict the outcome of the pathology on an individual basis, is a key ingredient in making therapeutic decisions. In recent years gene expression data have been successfully used to complement the clinical and histological criteria traditionally used in such prediction. Many "gene expression signatures" have been developed, i.e. sets of genes whose expression values in a tumor can be used to predict the outcome of the pathology. Here we investigate the use of several machine learning techniques to classify breast cancer patients using one of such signatures, the well established 70-gene signature.Entities:
Year: 2011 PMID: 21569330 PMCID: PMC3108919 DOI: 10.1186/1756-0381-4-12
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Experimental comparison between the number of incorrectly classified instances found on the test sets by the different machine learning methods.
| GP | SVM-K1 | SVM-K2 | SVM-K3 | MP | RF | |
|---|---|---|---|---|---|---|
| best | 10 | 13 | 14 | 15 | 10 | 12 |
| average (SEM) | 16.40 (0.30) | 18.32 (0.37) | 16.76 (0.18) | 17.62 (0.17) | 18.08 (0.39) | 17.60 (0.35) |
Each method was independently run 50 times using each time a different training/test partition of the validation dataset (see text for details). The first line indicates the method: Genetic Programming (GP), Support Vector Machine with exponent for the polynomial kernel 1.0 (SVM-K1), 2.0 (SVM-K2), and 3.0 (SVM-K3), Multilayer Perceptrons (MP), and Random Forest (RF). The second line shows the best value of the incorrectly classified instances obtained on the test set over the 50 runs, and the third line reports the average performances of each group of 50 runs on their test sets (standard error of mean is shown in parentheses).
Statistical significance of the difference in performance between the methods.
| ANOVA | ||||
|---|---|---|---|---|
| GP vs. SVM-K1 | GP vs. SVM-K2 | GP vs. SVM-K3 | GP vs. MP | GP vs. RF |
First line shows ANOVA test on the 6 samples of solutions found by each method, while second line depicts pairwise 2-tailed Student t-tests comparing GP with each other method.
The 10 most recurring features in the solutions found by GP.
| Accession ID | Gene name | Gene description | Solutions |
|---|---|---|---|
| NM_003981 | PRC1 | protein regulator of cytokinesis 1 | 48 |
| NM_002916 | RFC4 | replication factor C (activator 1) 4, 37 kDa | 23 |
| AI992158 | - | - | 16 |
| AI554061 | - | - | 10 |
| NM_006101 | NDC80 | NDC80 homolog, kinetochore complex component (S. cerevisiae) | 9 |
| NM_015984 | UCHL5 | ubiquitin carboxyl-terminal hydrolase L5 | 7 |
| NM_020188 | C16orf61 | chromosome 16 open reading frame 61 | 6 |
| NM_016448 | DTL | denticleless homolog (Drosophila) | 6 |
| NM_014791 | MELK | maternal embryonic leucine zipper kinase | 6 |
| NM_004702 | - | - | 6 |
The four columns show: accession ID, gene name, gene description, and number of solutions where that feature occurs.
Experimental comparison between the number of incorrectly classified instances found on the test sets by GP and Support Vector Machine with exponent 2 on unbalanced datasets.
| 5 yrs | 7.5 yrs | |||
|---|---|---|---|---|
| best | 9 | 10 | 12 | 13 |
| average (SEM) | 15.04 (0.41) | 17.84 (0.42) | 21.18 (0.49) | 20.7 (0.46) |
The datasets are defined by survival status at endpoints 5 and 7.5 years.
Experimental comparison between the number of false negatives found on the test sets by the different machine learning methods.
| GP | SVM-K1 | SVM-K2 | SVM-K3 | MP | RF | |
|---|---|---|---|---|---|---|
| best | 2 | 6 | 6 | 6 | 5 | 6 |
| average (SEM) | 9.82 (0.44) | 13.26 (0.51) | 12.60 (0.35) | 14.08 (0.39) | 12.88 (0.51) | 13.38 (0.49) |
Each method was independently run 50 times using each time a different training/test partition of the validation dataset (see text for details). The first line indicates the method: Genetic Programming (GP), Support Vector Machine (SVM), Multilayer Perceptrons (MP), and Random Forest (RF). The second line shows the best value of the incorrectly classified instances obtained on the test set over the 50 runs, and the third line reports the average performances of each group of 50 runs on their test sets (standard error of mean is shown in parentheses).
False negative prediction: statistical significance of the difference in performance between the methods.
| ANOVA | ||||
|---|---|---|---|---|
| GP vs. SVM-K1 | GP vs. SVM-K2 | GP vs. SVM-K3 | GP vs. MP | GP vs. RF |
| P = 8.53 × 10-6 | ||||
First line shows ANOVA test on the 6 samples of solutions found by each method, while second line depicts pairwise 2-tailed Student t-tests comparing GP with each other method.
Figure 1The best-fitness model. Tree representation and the traditional Lisp representation of the model with the best fitness found by GP over the studied 50 independent runs.
Parameters used in the experiments.
| GP Parameters | |
|---|---|
| population size | 500 individuals |
| population initialization | ramped half and half [ |
| selection method | tournament (tournament size = 10) |
| crossover rate | 0.9 |
| mutation rate | 0.1 |
| maximum number of generations | 5 |
| algorithm | generational tree based GP with no elitism |
| complexity parameter | 0.1 |
| size of the kernel cache | 107 |
| epsilon value for the round-off error | 10-12 |
| exponent for the polynomial kernel | 1.0,2.0, 3.0 |
| tolerance parameter | 0.001 |
| learning algorithm | Back-propagation |
| learning rate | 0:03 |
| activation function for all the neurons in the net | sigmoid |
| momentum | 0.2 progressively decreasing until 0.0001 |
| hidden layers | (number of attributes + number of classes)/2 |
| number of epochs of training | 500 |
| number of trees | 2500 |
| number of attributes per node | 1 |