| Literature DB >> 21798039 |
Jonathan L Lustgarten1, Shyam Visweswaran, Vanathi Gopalakrishnan, Gregory F Cooper.
Abstract
BACKGROUND: Several data mining methods require data that are discrete, and other methods often perform better with discrete data. We introduce an efficient Bayesian discretization (EBD) method for optimal discretization of variables that runs efficiently on high-dimensional biomedical datasets. The EBD method consists of two components, namely, a Bayesian score to evaluate discretizations and a dynamic programming search procedure to efficiently search the space of possible discretizations. We compared the performance of EBD to Fayyad and Irani's (FI) discretization method, which is commonly used for discretization.Entities:
Mesh:
Year: 2011 PMID: 21798039 PMCID: PMC3162539 DOI: 10.1186/1471-2105-12-309
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Pseudocode for the efficient Bayesian discretization (EBD) method. The EBD method uses dynamic programming and runs in O(n2) time as indicated by the two for loops (n is the number of instances in the dataset).
Figure 2An example of the application of the efficient Bayesian discretization (EBD) method. This example shows the progression of the EBD method when applying the pseudocode given in Figure 1 to the dataset of six instances that is introduced in the main text. An asterisk denotes the discretization with the highest EBD score in a given iteration, as indexed by a. There are 25 = 32 possible discretizations for a dataset of six instances; for this dataset EBD explicitly evaluates only the 6 discretizations shown in bold font.
Description of datasets
| Dataset | Dataset name | Type | P/D | #t | #n | #V | M |
|---|---|---|---|---|---|---|---|
| 1 | Alon et al. | T | D | 2 | 61 | 6,584 | 0.651 |
| 2 | Armstrong et al. | T | D | 3 | 72 | 12,582 | 0.387 |
| 3 | Beer et al. | T | P | 2 | 86 | 5,372 | 0.795 |
| 4 | Bhattacharjee et al. | T | D | 7 | 203 | 12,600 | 0.657 |
| 5 | Bhattacharjee et al. | T | P | 2 | 69 | 5,372 | 0.746 |
| 6 | Golub et al. | T | D | 2 | 72 | 7,129 | 0.653 |
| 7 | Hedenfalk et al. | T | D | 2 | 36 | 7,464 | 0.500 |
| 8 | Iizuka et al. | T | P | 2 | 60 | 7,129 | 0.661 |
| 9 | Khan et al. | T | D | 4 | 83 | 2,308 | 0.345 |
| 10 | Nutt et al. | T | D | 4 | 50 | 12,625 | 0.296 |
| 11 | Pomeroy et al. | T | D | 5 | 90 | 7,129 | 0.642 |
| 12 | Pomeroy et al. | T | P | 2 | 60 | 7,129 | 0.645 |
| 13 | Ramaswamy et al. | T | D | 29 | 280 | 16,063 | 0.100 |
| 14 | Rosenwald et al. | T | P | 2 | 240 | 7,399 | 0.574 |
| 15 | Staunton et al. | T | D | 9 | 60 | 7,129 | 0.145 |
| 16 | Shipp et al. | T | D | 2 | 77 | 7,129 | 0.747 |
| 17 | Su et al. | T | D | 13 | 174 | 12,533 | 0.150 |
| 18 | Singh et al. | T | D | 2 | 102 | 10,510 | 0.510 |
| 19 | Veer et al. | T | P | 2 | 78 | 24,481 | 0.562 |
| 20 | Welsch et al. | T | D | 2 | 39 | 7,039 | 0.878 |
| 21 | Yeoh et al. | T | P | 2 | 249 | 12,625 | 0.805 |
| 22 | Petricoin et al. | P | D | 2 | 322 | 11,003 | 0.784 |
| 23 | Pusztai et al. | P | D | 3 | 159 | 11,170 | 0.364 |
| 24 | Ranganathan et al. | P | D | 2 | 52 | 36,778 | 0.556 |
In the Type column, T denotes transcriptomic and P denotes proteomic. In the P/D column, P denotes prognostic and D denotes diagnostic. #t is the number of values of the target variable and #n is the number of instances in the dataset. #V is the number of predictor variables. M is the proportion of the data that has the majority target value.
Accuracies for the EBD and FI discretization methods
| Classifier | C4.5 | NB | ||
|---|---|---|---|---|
| 1 | ||||
| 2 | 84.62% (0.77) | 92.12% (0.94) | ||
| 3 | 64.23% (1.72) | |||
| 4 | 84.38% (0.67) | 72.76% (0.79) | ||
| 5 | 56.33% (1.93) | 69.78% (1.13) | ||
| 6 | 95.67% (0.82) | 80.28% (1.51) | ||
| 7 | ||||
| 8 | 50.00% (2.03) | 70.82% (1.49) | ||
| 9 | 81.29% (0.97) | 91.94% (0.91) | ||
| 10 | 66.54% (1.21) | 71.76% (1.32) | ||
| 11 | 72.44% (0.91) | 73.81% (1.11) | ||
| 12 | 55.83% (2.14) | 61.67% (1.84) | ||
| 13 | 57.14% (0.96) | 49.32% (0.88) | ||
| 14 | 58.75% (0.91) | 57.65% (1.09) | ||
| 15 | 54.20% (0.74) | 53.86% (1.07) | ||
| 16 | 71.25% (1.45) | 85.45% (1.22) | ||
| 17 | 68.96% (1.17) | 81.78% (1.42) | ||
| 18 | 81.21% (0.58) | 83.76% (0.91) | ||
| 19 | 72.22% (1.21) | 84.19% (1.31) | ||
| 20 | ||||
| 21 | 62.32% (1.54) | 76.23% (0.54) | ||
| 22 | 69.78% (1.21) | 77.23% (0.78) | ||
| 23 | 68.49% (0.98) | 46.22% (0.98) | ||
| 24 | 73.04% (1.72) | 80.12% (1.23) | ||
| Average | 71.48% (2.12) | 76.79% (2.32) | ||
Accuracies for EBD and FI discretization methods are obtained from the application of C4.5 and NB classifiers to the discretized variables. The mean and the standard error of the mean (SEM) for the accuracy for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher accuracy is shown in bold font and equal accuracies are underlined.
AUCs for the EBD and FI discretization methods
| Classifier | C4.5 | NB | ||
|---|---|---|---|---|
| 1 | 66.79% (0.92) | |||
| 2 | 69.37% (1.22) | 78.58% (1.87) | ||
| 3 | 55.42% (1.65) | 54.16% (1.92) | ||
| 4 | 68.37% (1.27) | 58.12% (1.08) | ||
| 5 | 54.38% (1.44) | 53.91% (1.09) | ||
| 6 | 60.11% (0.95) | 86.38% (0.86) | ||
| 7 | ||||
| 8 | 54.11% (1.12) | 58.76% (0.85) | ||
| 9 | 86.90% (1.41) | 84.28% (1.12) | ||
| 10 | 74.30% (0.81) | 82.59% (1.04) | ||
| 11 | 66.12% (0.61) | 70.74% (0.98) | ||
| 12 | 55.14% (1.06) | 53.72% (0.86) | ||
| 13 | 70.45% (0.87) | 69.89% (0.71) | ||
| 14 | 55.16% (0.98) | 54.42% (0.98) | ||
| 15 | 73.49% (1.01) | 89.45% (0.89) | ||
| 16 | 80.06% (1.12) | 80.11% (1.09) | ||
| 17 | 78.65% (1.41) | 75.98% (1.24) | ||
| 18 | 92.31% (0.90) | 94.19% (0.72) | ||
| 19 | 74.23% (1.14) | 81.16% (1.24) | ||
| 20 | 94.12% (1.19) | |||
| 21 | 52.13% (0.46) | 54.92% (0.65) | ||
| 22 | 60.65% (0.98) | 64.25% (0.71) | ||
| 23 | 81.56% (0.79) | 76.17% (0.88) | ||
| 24 | 80.21% (0.89) | 81.21% (0.77) | ||
| Average | 72.15% (1.77) | 73.71% (1.24) | ||
AUCs for EBD and FI discretization methods are obtained from the application of C4.5 and NB classifiers to the discretized variables. The mean and the standard error of the mean (SEM) for the AUC for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher AUC is shown in bold font and equal AUCs are underlined.
Robustness for the EBD and FI discretization methods
| Classifier | C4.5 | NB | ||
|---|---|---|---|---|
| 1 | ||||
| 2 | 87.69% (0.86) | 94.36% (0.98) | ||
| 3 | 53.57% (2.10) | 81.69% (1.14) | ||
| 4 | 84.18% (0.76) | 75.91% (1.09) | ||
| 5 | 49.83% (2.01) | 69.97% (1.20) | ||
| 6 | 80.58% (1.42) | 95.89% (0.92) | ||
| 7 | 96.67% (0.86) | |||
| 8 | 55.11% (2.06) | 70.94% (1.48) | ||
| 9 | 87.16% (0.99) | 96.08% (0.94) | ||
| 10 | 68.65% (1.39) | 74.35% (2.05) | ||
| 11 | 70.36% (0.95) | 78.25% (1.22) | ||
| 12 | 57.82% (2.22) | 63.47% (1.87) | ||
| 13 | 66.12% (0.39) | 50.83% (1.02) | ||
| 14 | 57.47% (0.94) | 67.01% (1.08) | ||
| 15 | 54.20% (0.74) | 54.16% (1.75) | ||
| 16 | 73.17% (1.66) | 84.11% (1.38) | ||
| 17 | 82.71% (1.35) | 85.49% (1.60) | ||
| 18 | 79.38% (0.57) | 88.91% (0.72) | ||
| 19 | 73.00% (1.48) | 85.55% (1.31) | ||
| 20 | ||||
| 21 | 55.18% (1.26) | 77.01% (0.60) | ||
| 22 | 66.84% (1.03) | 81.15% (0.93) | ||
| 23 | 76.07% (0.99) | 52.49% (1.73) | ||
| 24 | 70.00% (1.75) | 80.33% (1.64) | ||
| Average | 72.55% (2.81) | 81.72% (2.92) | ||
The mean and the standard error of the mean (SEM) for robustness for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher robustness value is shown in bold font and equal robustness values are underlined.
Stabilities for the EBD and FI discretization methods
| Dataset | EBD | FI |
|---|---|---|
| 1 | 0.80 | |
| 2 | ||
| 3 | 0.67 | |
| 4 | 0.82 | |
| 5 | 0.50 | |
| 6 | 0.79 | |
| 7 | 0.79 | |
| 8 | 0.55 | |
| 9 | 0.82 | |
| 10 | 0.81 | |
| 11 | 0.75 | |
| 12 | ||
| 13 | 0.76 | |
| 14 | ||
| 15 | 0.59 | |
| 16 | 0.79 | |
| 17 | 0.75 | |
| 18 | 0.76 | |
| 19 | 0.69 | |
| 20 | 0.84 | |
| 21 | 0.42 | |
| 22 | 0.85 | |
| 23 | 0.89 | |
| 24 | 0.59 | |
| Average | 0.72 | |
The mean stability for each dataset is obtained by 10 × 10 cross-validation. For each dataset, the higher stability value is shown in bold font and equal stability values are underlined.
Mean number of intervals per predictor variable for the EBD and FI discretization methods
| Mean fraction of predictors with 1 interval | Mean # of intervals per predictor with >1 interval | Mean # of intervals per predictor | ||||
|---|---|---|---|---|---|---|
| 1 | 0.81 | 2.01 | 1.15 | |||
| 2 | 0.47 | 2.04 | 1.41 | |||
| 3 | 0.91 | 2.01 | 1.04 | |||
| 4 | 0.18 | 2.16 | 1.84 | |||
| 5 | 0.97 | 2.03 | ||||
| 6 | 0.87 | 1.11 | ||||
| 7 | 0.82 | 2.01 | 1.13 | |||
| 8 | 0.97 | 1.01 | ||||
| 9 | 0.54 | 2.11 | 1.27 | |||
| 10 | 0.38 | 1.37 | ||||
| 11 | 0.51 | 2.06 | 1.25 | |||
| 12 | 0.98 | |||||
| 13 | 0.05 | 2.10 | 1.11 | |||
| 14 | 0.98 | 2.02 | 1.01 | 1.01 | ||
| 15 | 0.70 | 2.08 | 1.02 | |||
| 16 | 0.75 | 1.12 | ||||
| 17 | 0.76 | |||||
| 18 | 0.17 | 2.13 | 1.25 | |||
| 19 | 0.87 | 2.02 | ||||
| 20 | 0.81 | 2.02 | 1.15 | |||
| 21 | 0.97 | 2.01 | ||||
| 22 | 0.82 | 1.16 | ||||
| 23 | 0.93 | 2.01 | 1.03 | |||
| 24 | 0.92 | 2.06 | 1.04 | |||
| Average | 0.76 | 2.06 | 1.16 | |||
The mean fraction of predictor variables discretized to one interval (no cut points), the mean number of intervals for predictor variables discretized to more than one interval (at least one cut point), and the mean number of intervals for all predictor variables for each dataset is obtained by 10-fold cross-validation done ten times. For each dataset, the higher value is shown in bold font and equal values are underlined.
Statistical comparison of EBD and FI discretization methods
| Evaluation Measure | Method | Mean (SEM) | Difference of Means | Z statistic ( |
|---|---|---|---|---|
| C4.5 Accuracy | EBD | 73.49% (2.07) | 2.01 | 2.219 |
| [0%, 100%] | FI | 71.48% (2.12) | ( | |
| C4.5 AUC | EBD | 73.22% (1.89) | 1.07 | 2.732 |
| [50%, 100%] | FI | 72.15% (1.77) | ( | |
| C4.5 Robustness | EBD | 72.55% (2.81) | -0.26 | -0.261 |
| [0%, ∞] | FI | 72.81% (2.76) | (0.794) | |
| NB Accuracy | EBD | 77.55% (2.65) | 0.76 | 2.080 |
| [0%, 100%] | FI | 76.79% (2.32) | ( | |
| NB AUC | EBD | 74.83% (1.43) | 1.11 | 2.711 |
| [0%, 100%] | FI | 73.71% (1.24) | ( | |
| NB Robustness | EBD | 81.72% (2.92) | -0.68 | -0.016 |
| [50%, ∞] | FI | 82.40% (2.59) | (0.987) | |
| Stability | EBD | 0.74 (0.025) | 0.02 | 1.972 |
| [0, 1] | FI | 0.72 (0.029) | ( | |
| Mean # of intervals per predictor | EBD | 1.27 (0.074) | 0.11 | 1.686 |
| [1, | FI | 1.16 (0.038) | (0.092) | |
In the first column the range of a measure is given in square brackets where n is the number of instances in the dataset. In the last column the number on top in the last column is the Z statistic and the number at the bottom is the corresponding p-value. On all performance measures, except for the mean number of intervals per predictor, the Z statistic is positive when EBD performs better than FI. The two-tailed p-values of 0.05 or smaller are in bold, indicating that EBD performed statistically significantly better at that level.
Summary of wins, draws and losses of EBD versus FI
| Evaluation Measure | Wins | Draws | Losses |
|---|---|---|---|
| C4.5 Accuracy | 17 | 3 | 4 |
| C4.5 AUC | 17 | 2 | 5 |
| C4.5 Robustness | 10 | 3 | 11 |
| NB Accuracy | 17 | 4 | 3 |
| NB AUC | 16 | 2 | 6 |
| NB Robustness | 9 | 2 | 13 |
| Stability | 15 | 3 | 6 |
Number of wins, draws and losses on accuracy, AUC, robustness and stability for EBD and FI.