| Literature DB >> 33266604 |
Jerzy W Grzymala-Busse1,2, Teresa Mroczek2.
Abstract
As previous research indicates, a multiple-scanning methodology for discretization of numerical datasets, based on entropy, is very competitive. Discretization is a process of converting numerical values of the data records into discrete values associated with numerical intervals defined over the domains of the data records. In multiple-scanning discretization, the last step is the merging of neighboring intervals in discretized datasets as a kind of postprocessing. Our objective is to check how the error rate, measured by tenfold cross validation within the C4.5 system, is affected by such merging. We conducted experiments on 17 numerical datasets, using the same setup of multiple scanning, with three different options for merging: no merging at all, merging based on the smallest entropy, and merging based on the biggest entropy. As a result of the Friedman rank sum test (5% significance level) we concluded that the differences between all three approaches are statistically insignificant. There is no universally best approach. Then, we repeated all experiments 30 times, recording averages and standard deviations. The test of the difference between averages shows that, for a comparison of no merging with merging based on the smallest entropy, there are statistically highly significant differences (with a 1% significance level). In some cases, the smaller error rate is associated with no merging, in some cases the smaller error rate is associated with merging based on the smallest entropy. A comparison of no merging with merging based on the biggest entropy showed similar results. So, our final conclusion was that there are highly significant differences between no merging and merging, depending on the dataset. The best approach should be chosen by trying all three approaches.Entities:
Keywords: data mining; discretization; entropy; numerical attributes
Year: 2018 PMID: 33266604 PMCID: PMC7512462 DOI: 10.3390/e20110880
Source DB: PubMed Journal: Entropy (Basel) ISSN: 1099-4300 Impact factor: 2.524
An example of a dataset with numerical attributes.
| Case | Attributes | Decision | |||
|---|---|---|---|---|---|
| Length | Height | Width | Weight | Quality | |
| 1 | 4.7 | 1.8 | 1.7 | 1.7 | high |
| 2 | 4.5 | 1.4 | 1.8 | 0.9 | high |
| 3 | 4.7 | 1.8 | 1.9 | 1.3 | high |
| 4 | 4.5 | 1.8 | 1.7 | 1.3 | medium |
| 5 | 4.3 | 1.6 | 1.9 | 1.7 | medium |
| 6 | 4.3 | 1.4 | 1.7 | 0.9 | low |
| 7 | 4.5 | 1.6 | 1.9 | 0.9 | very-low |
| 8 | 4.5 | 1.4 | 1.8 | 1.3 | very-low |
Partially discretized dataset after the first scan. ( is the decision)
| Case | Attributes | Decision | |||
|---|---|---|---|---|---|
| Length | Height | Width | Weight | Quality | |
| 1 | 4.4..4.7 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | high |
| 2 | 4.4..4.7 | 1.4..1.5 | 1.75..1.9 | 0.9..1.1 | high |
| 3 | 4.4..4.7 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | high |
| 4 | 4.4..4.7 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | medium |
| 5 | 4.3..4.4 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | medium |
| 6 | 4.3..4.4 | 1.4..1.5 | 1.7..1.75 | 0.9..1.1 | low |
| 7 | 4.4..4.7 | 1.5..1.8 | 1.75..1.9 | 0.9..1.1 | very-low |
| 8 | 4.4..4.7 | 1.4..1.5 | 1.75..1.9 | 1.1..1.7 | very-low |
A subset of the dataset presented in Table 1.
| Case | Attributes | Decision | |||
|---|---|---|---|---|---|
| Length | Height | Width | Weight | Quality | |
| 1 | 4.7 | 1.8 | 1.7 | 1.7 | high |
| 4 | 4.5 | 1.8 | 1.7 | 1.3 | medium |
Discretized dataset.
| Case | Attributes | Decision | |||
|---|---|---|---|---|---|
| Length | Height | Width | Weight | Quality | |
| 1 | 4.6..4.7 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | high |
| 2 | 4.4..4.6 | 1.4..1.5 | 1.75..1.9 | 0.9..1.1 | high |
| 3 | 4.6..4.7 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | high |
| 4 | 4.4..4.6 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | medium |
| 5 | 4.3..4.4 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | medium |
| 6 | 4.3..4.4 | 1.4..1.5 | 1.7..1.75 | 0.9..1.1 | low |
| 7 | 4.4..4.6 | 1.5..1.8 | 1.75..1.9 | 0.9..1.1 | very-low |
| 8 | 4.4..4.6 | 1.4..1.5 | 1.75..1.9 | 1.1..1.7 | very-low |
Discretized dataset after interval merging.
| Case | Attributes | Decision | |||
|---|---|---|---|---|---|
| Length | Height | Width | Weight | Quality | |
| 1 | 4.6..4.7 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | high |
| 2 | 4.3..4.6 | 1.4..1.5 | 1.75..1.9 | 0.9..1.1 | high |
| 3 | 4.6..4.7 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | high |
| 4 | 4.3..4.6 | 1.5..1.8 | 1.7..1.75 | 1.1..1.7 | medium |
| 5 | 4.3..4.4 | 1.5..1.8 | 1.75..1.9 | 1.1..1.7 | medium |
| 6 | 4.3..4.6 | 1.4..1.5 | 1.7..1.75 | 0.9..1.1 | low |
| 7 | 4.3..4.6 | 1.5..1.8 | 1.75..1.9 | 0.9..1.1 | very-low |
| 8 | 4.3..4.6 | 1.4..1.5 | 1.75..1.9 | 1.1..1.7 | very-low |
Datasets.
| Dataset | Cases | Number of Attributes | Concepts |
|---|---|---|---|
| Abalone | 4177 | 8 | 29 |
| Australian | 690 | 14 | 2 |
| Bankruptcy | 66 | 5 | 2 |
| Bupa | 345 | 6 | 2 |
| Connectionist Bench | 208 | 60 | 2 |
| Echocardiogram | 74 | 7 | 2 |
| Ecoli | 336 | 8 | 8 |
| Glass | 214 | 9 | 6 |
| Image Segmentation | 210 | 19 | 7 |
| Ionoshere | 351 | 34 | 2 |
| Iris | 150 | 4 | 3 |
| Leukemia | 415 | 175 | 2 |
| Pima | 768 | 8 | 2 |
| Spectrometry | 25,931 | 15 | 2 |
| Wave | 512 | 21 | 3 |
| Wine Recognition | 178 | 13 | 3 |
| Yeast | 1484 | 8 | 9 |
Figure 1Error rate for consecutive scans for the yeast dataset.
Figure 2Number of discretization intervals for consecutive scans for the yeast dataset.
Figure 3Domains of all attributes for the yeast dataset.
Figure 4Intervals for all attributes after the first scan for the yeast dataset.
Figure 5Intervals for all attributes after the second scan for the yeast dataset.
Figure 6Intervals for all attributes after the third scan for the yeast dataset.
Figure 7Intervals for all attributes after merging based on minimal entropy for the second scan for the yeast dataset.
Figure 8Intervals for all attributes after merging based on maximal entropy for the second scan for the yeast dataset.
Error rates for three approaches to merging.
| Dataset | No Merging | Scan Number | MIN Entropy | Scan Number | MAX Entropy | Scan Number |
|---|---|---|---|---|---|---|
| Abalone | 75.58 | 5 | - | - | - | - |
| Australian | 13.48 | 1 | 12.61 | 3 | 13.04 | 1 |
| Bankruptcy | 3.03 | 1 | - | - | - | - |
| Bupa | 29.28 | 3 | 30.43 | 2 | 30.43 | 2 |
| Connectionist Bench | 16.83 | 1 | 24.04 | 1 | 24.04 | 1 |
| Echocardiogram | 14.86 | 1 | 14.86 | 2 | 14.86 | 1 |
| Ecoli | 22.02 | 0 | 17.86 | 0 | 20.54 | 2 |
| Glass | 24.77 | 3 | 23.36 | 2 | 23.36 | 2 |
| Image Segmentation | 11.90 | 2 | - | - | 13.81 | 0 |
| Ionoshere | 5.98 | 2 | 5.98 | 1 | 5.98 | 4 |
| Iris | 4.67 | 2 | - | - | - | - |
| Leukemia | 21.20 | 2 | 26.27 | 2 | 21.20 | 2 |
| Pima | 24.09 | 2 | 24.48 | 0 | 24.61 | 0 |
| Spectrometry | 1.13 | 2 | 1.15 | 5 | 1.19 | 3 |
| Wave | 23.04 | 1 | 24.02 | 1 | 23.44 | 3 |
| Wine Recognition | 3.93 | 1 | 3.37 | 1 | 3.37 | 1 |
| Yeast | 51.75 | 3 | 49.12 | 5 | 49.93 | 2 |
Z scores for the test of differences between averages of error rates associated with three approaches to merging.
| Dataset | No Merging − Merging with MIN Entropy | Scan Number | No Merging − Merging with MAX Entropy | Scan Number |
|---|---|---|---|---|
| Abalone | - | - | - | - |
| Australian | 6.90 | 3 | 5.20 | 3 |
| Bankruptcy | - | - | - | - |
| Bupa | 22.90 | 0 | 12.75 | 0 |
| Connectionist Bench | −9.09 | 1 | −8.73 | 1 |
| Echocardiogram | −6.71 | 0 | −7.84 | 0 |
| Ecoli | 31.75 | 0 | −140.05 | 0 |
| Glass | −7.28 | 1 | 11.92 | 0 |
| Image Segmentation | −0.33 | 0 | −14.71 | 2 |
| Ionoshere | −41.36 | 2 | −8.85 | 3 |
| Iris | - | - | - | - |
| Leukemia | −51.18 | 0 | −45.40 | 0 |
| Pima | 16.34 | 0 | 27.16 | 2 |
| Spectrometry | 20.94 | 5 | −14.55 | 3 |
| Wave | 6.92 | 2 | 10.20 | 3 |
| Wine Recognition | −0.73 | 0 | 8.94 | 1 |
| Yeast | 25.68 | 2 | 23.14 | 2 |