| Literature DB >> 25078098 |
Davide Cangelosi, Marco Muselli, Stefano Parodi, Fabiola Blengio, Pamela Becherini, Rogier Versteeg, Massimo Conte, Luigi Varesio.
Abstract
BACKGROUND: Cancer patient's outcome is written, in part, in the gene expression profile of the tumor. We previously identified a 62-probe sets signature (NB-hypo) to identify tissue hypoxia in neuroblastoma tumors and showed that NB-hypo stratified neuroblastoma patients in good and poor outcome 1. It was important to develop a prognostic classifier to cluster patients into risk groups benefiting of defined therapeutic approaches. Novel classification and data discretization approaches can be instrumental for the generation of accurate predictors and robust tools for clinical decision support. We explored the application to gene expression data of Rulex, a novel software suite including the Attribute Driven Incremental Discretization technique for transforming continuous variables into simplified discrete ones and the Logic Learning Machine model for intelligible rule generation.Entities:
Mesh:
Year: 2014 PMID: 25078098 PMCID: PMC4095004 DOI: 10.1186/1471-2105-15-S5-S4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Characteristics of 182 neuroblastoma patients included in the study.
| Risk factors and outcome | Training set (n = 109)a% | Independent test set (n = 73)a | |||
|---|---|---|---|---|---|
| % | % | ||||
| 1 | 54 | 49 | 33 | 46 | |
| 55 | 51 | 40 | 54 | ||
| 1 | 29 | 26 | 14 | 19 | |
| 2 | 15 | 14 | 9 | 13 | |
| 3 | 16 | 15 | 7 | 9 | |
| 4 | 33 | 30 | 35 | 48 | |
| 4s | 16 | 15 | 8 | 11 | |
| Normal | 91 | 83 | 61 | 83 | |
| Amplified | 18 | 17 | 12 | 17 | |
| Good | 81 | 74 | 50 | 68 | |
| Poor | 28 | 26 | 23 | 32 | |
a The number of patients is 109 in the training set and 73 in the test set. The data show the total number of patients and the relative percentage in each subdivision.
Figure 1Rule generation workflow. The initial 182 patients dataset is randomly divided into training and test sets. The training set is used by the supervised learning procedure to iteratively calculate the LLM parameter:" maximum error allowed for a rule" by performing a complete 10-fold cross validation. The whole training set is randomly subdivided into 10 non-overlapping subsets, nine of which are used to train the classifier by employing ADID and LLM. The classifier is subsequently used to predict the outcome of the patients in the excluded subset. This procedure is repeated 10 times until every subset is classified once. Each parameter value is then evaluated according to the mean classification accuracy obtained in the cross validation. The parameter value, which obtained the highest mean accuracy, is selected to generate the final optimal classification rules. The rules are then tested on an independent cohort to assess their ability to predict patients' outcome.
Classification rules.
| Rule | Cond 1 | Cond 2 | Predicted | Coveringb | Errorc | Fisher | ||
|---|---|---|---|---|---|---|---|---|
| 217356_s_at ≤ 721 | 226452_at < 326 | Good | 80 | 3.5 | <0.001 | |||
| 206686_at ≤ 26 | 226452_at ≤326 | Good | 70 | 14 | <0.001 | |||
| 200738_s_at ≤ 1846 | 230630_at | Good | 62 | 10 | <0.001 | |||
| 209446_s_at ≤ 57 | 223172_s_at | Good | 60 | 10 | <0.001 | |||
| 202022_at > 131 | 223193_x_at < 324 | Good | 60 | 14 | <0.001 | |||
| 224314_s_at ≤ 29 | 236180_at <13 | Good | 48 | 7.1 | <0.001 | |||
| 217356_s_at > 721 | Poor | 92 | 17 | <0.001 | ||||
| 223172_s_at >73 | 226452_at> 326 | Poor | 60 | 8.6 | <0.001 | |||
| 206686_at > 26 | 223172_s_at>73 | Poor | 57 | 7.4 | <0.001 |
a Cond 1 and Cond 2 indicate the conditions into the premises of the rules.
b The covering accounts for the fraction of patients that verify the rule and belong to the target outcome.
c The error accounts for the fraction of patients that satisfy the rule and do not belong to the target outcome.
d Fisher p-value quantifies the statistical significance of the rule on the basis of the number of patients correctly and incorrectly classified by a rule and the number of patients of the dataset belonging to each specific outcome.
Figure 2Patients representation in the rules of Table 2. Plot of the membership of the 109 patients (x axis) to the rules (y axis) of the classifier in Table 2. The rule identifier and predicted outcome are listed in right side of the plot. Two oriented lines divide the plot. The horizontal line separates the rules classifying good outcome from those classifying poor outcome. The vertical line separates good outcome from poor outcome patients. Each point of the plot indicates the membership of a patient to one rule. The two lines separate the plot in four sections labeled as A, B, C and D. Section A includes all the patients incorrectly classified by a poor outcome rule. Section B includes all the patients correctly classified by a poor outcome rule. Section C includes all the patients correctly classified by a good outcome rule and Section D includes all the patients incorrectly classified by a good outcome rule.
Probe sets characteristics of the new NB-hypo-II signature.
| Probe setsa | Rule IDb | Probe set relevancec | |
|---|---|---|---|
| 200738_s_at | 3 | -0.49 | 0 |
| 202022_at | 5 | 0.5 | 0 |
| 206686_at | 2, 9 | -0.48 | 0.26 |
| 209446_s_at | 4 | -0.25 | 0 |
| 217356_s_at | 1, 7 | -0.74 | 0.92 |
| 223172_s_at | 4, 8, 9 | -0.35 | 0.48 |
| 223193_x_at | 5 | -0.1 | 0 |
| 224314_s_at | 6 | -0.22 | 0 |
| 226452_at | l, 2, 8 | -0.26 | 0.34 |
| 230630_at | 3 | 0.13 | 0 |
| 236180_at | 6 | -0.25 | 0 |
a Affymetrix probe sets belonging to NB-hypo-II signature.
b Rule ID indicates the ID of the rules in which the probe set occurs (as illustrated in Table 2).
c Relevance measures the importance of the features included into the rules.
The relevance calculated for each outcome is shown.
Expression cut-offs from the Kaplan-Meier and from the rules.
| Probe set IDa | NB-hypo-IIb | Overallc | Relapse freed | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Expression cut-offe | Worsef | Expression cut-offe | Worsef | ||||||
| 1 | 223172_s_at | 107 | 73 | high | high | 107 | 73 | high | high |
| 2 | 200738_s_at | 1553 | 1846 | high | high | 1553 | 1846 | high | high |
| 3 | 209446_s_at | 69 | 57 | high | high | 69 | 57 | high | high |
| 4 | 226452_at | 280 | 326 | high | high | 280 | 326 | high | high |
| 5 | 217356_s_at | 706 | 721 | high | high | 706 | 721 | high | high |
| 6 | 236180_at | 18 | 13 | hgih | hgih | 13 | 13 | high | hgih |
| 7 | 202022_at | 101 | 131 | low | low | 138 | 131 | low | low |
| 8 | 224314_s_at | 25 | 29 | high | high | 25 | 29 | high | high |
| 9 | 206686_at | 36 | 26 | high | high | 36 | 26 | high | high |
| 10 | 223193_x_at | 495 | 324 | high | high | 572 | 324 | high | high |
| 11 | 230630_at | 19 | 23 | low | low | 35 | 23 | low | low |
a Probe sets ID indicates the numerical identified of the probeset.
b NB-hyp-II indicates the list of probe sets of the NB-hypo signature belonging to the rules.
c Overall indicates the survival time between the time of an event or last follow up and the time of diagnosis.
d Relapse free indicates the survival time between the first relapse and the time of diagnosis.
e Expression cut-off indicates the optimal cut-off point of each probe set resulting from the Kaplan-Meier scan and from the rules.
f Worse indicates whether high or low expression of a given probe set is associated to the worse survival. Worse survival are calculated from the Kaplan-Meier curve or from the conditions included into the rules.
Comparison among discretization algorithms' performance.
| Algorithmf | Accuracya | Recallb | Precisionc | Specificityd | NPVe |
|---|---|---|---|---|---|
| 80% | 90% | 82% | 57% | 72% | |
| 68% | 60% | 91% | 87% | 50% | |
| 71% | 64% | 91% | 87% | 53% | |
| 77% | 84% | 82% | 61% | 64% | |
| 68% | 60% | 91% | 87% | 50% |
a Accuracy is the fraction of correctly classified patients and overall classified patients.
b Recall is the fraction of correctly classified good outcome patients and the overall predicted good outcome patients.
c Precision is the fraction of correctly classified good outcome patients and the predicted good outcome patients.
d Specificity is the fraction of correctly classified poor outcome patients and the overall poor outcome patients.
e NPV(negative predictive value) is the fraction of correctly classified poor outcome patients and the overall predicted poor outcome patients.
f Discretization algorithms utilized for comparison/
Performance of classification algorithms.
| Algorithmf | Accuracya | Recallb | Precisionc | Specificityd | NPVe |
|---|---|---|---|---|---|
| 80% | 90% | 82% | 57% | 72% | |
| 63% | 76% | 72% | 35% | 40% | |
| 81% | 90% | 83% | 61% | 74% | |
| 84% | 94% | 84% | 61% | 82% |
a Accuracy is the fraction of correctly classified patients and overall classified patients.
b Recall is the fraction of correctly classified good outcome patients and the overall predicted good outcome patients.
c Precision is the fraction of correctly classified good outcome patients and the predicted good outcome patients.
d Specificity is the fraction of correctly classified poor outcome patients and the overall poor outcome patients.
e NPV(negative predictive value) is the fraction of correctly classified poor outcome patients and the overall predicted poor outcome patients.
f Machine learning algorithms utilized for comparison.
Performance comparison among the configurations in the weighted classification on the test set.
| Configuration f | Accuracya | Recallb | Precisionc | Specificityd | Negative Predictive Valuee |
|---|---|---|---|---|---|
| 80% | 90% | 82% | 57% | 72% | |
| 65% | 63% | 82% | 70% | 47% | |
| 66% | 60% | 86% | 78% | 47% | |
| 78% | 98% | 77% | 35% | 89% |
a Accuracy is the fraction of correctly classified patients and overall classified patients.
b Recall is the fraction of correctly classified good outcome patients and the overall predicted good outcome patients.
c Precision is the fraction of correctly classified good outcome patients and the predicted good outcome patients.
d Specificity is the fraction of correctly classified poor outcome patients and the overall poor outcome patients.
e Negative predictive value is the fraction of correctly classified poor outcome patients and the overall predicted poor outcome patients.
f Configuration indicates the specific weights assigned to the outcomes in the weighted classification.