| Literature DB >> 24083103 |
Inhaúma Neves Ferraz1, Ana Cristina Bicharra Garcia.
Abstract
Data mining has emerged to address the problem of transforming data into useful knowledge. Although most data mining techniques, such as the use of association rules, may substantially reduce the search effort over large data sets, often, the consequential outcomes surpass the amount of information humanly manageable. On the other hand, important association rules may be overlooked owing to the setting of the support threshold, which is a very subjective metric, but rooted in most data mining techniques. This paper presents a study on the effects, in terms of precision and recall, of using a data preparation technique, called SemPrune, which is built on domain ontology. SemPrune is intended for pre- and post-processing phases of data mining. Identifying generalization/specialization relations, as well as composition/decomposition relations, is the key to successfully applying SemPrune.Entities:
Keywords: Association rules; Data mining; Ontology; Post-processing; Preprocessing; Pruning
Year: 2013 PMID: 24083103 PMCID: PMC3786067 DOI: 10.1186/2193-1801-2-452
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Figure 1Post-processing semantic enrichment model.
Figure 2SemPrune’s semantic enrichment during the data mining pre-processing phase.
Figure 3Partial view databases and ontologies interaction.
Effect of SemPrune post-processing on the number of extracted rules
| Database | Mined rules | Rules with dependency | Eliminated rules | Reduction percentage |
|---|---|---|---|---|
| Adult | 2924 | 1988 | 515 | 17.61% |
| Stulong | 34422 | 12601 | 8163 | 23.71% |
| Labor | 181229 | 134962 | 32874 | 18.14% |
Effect of SemPrune pre-processing on the number of extracted rules
| Database | Mined rules | Eliminated rules | Reduction percentage |
|---|---|---|---|
| Northwind Traders | 9941 | 5052 | 50.82% |
Effects of SemPrune on precision
| Database | Mined rules | Eliminated rules | Remaining rules | Precision gain |
|---|---|---|---|---|
| Adult | ||||
| Stulong | 34422 | 8163 | 26259 | 31.09% |
| Labor | 181229 | 32874 | 148355 | 22.16% |
Results obtained using the interest measures for the STULONG database
| Conviction | Cut-off | 1.1 | 1.2 | 1.3 | 1.4 | 1.5 |
| % Eliminated Rules | 48,85% | 53,33% | 58,04% | 61,45% | 63,18% | |
| # Remaining Rules | 1597 | 1377 | 1238 | 1137 | 1086 | |
| SemPrune % of Eliminated rules | 26,35% | 16,26% | 14,47% | 12,60% | 11,54% | |
| Specificity | Cut-off | 0.95 | 0.97 | 0.98 | 0.99 | 1.00 |
| % Eliminated Rules | 12,44% | 26,33% | 5,22% | 53,43% | 59,38% | |
| # Remaining Rules | 2583 | 1307 | 2180 | 1374 | 1198 | |
| SemPrune % of Eliminated rules | 12,83% | 8,91% | 2,46% | 0,65% | 0,00% | |
| Lift | Cut-off | 1.0 | 1.1 | 1.2 | 1.3 | 1.4 |
| % Eliminated Rules | 0,00% | 20,86% | 24,89% | 25,30% | 26,06% | |
| # Remaining Rules | 2950 | 921 | 2216 | 2204 | 2181 | |
| SemPrune % of Eliminated rules | 17,61% | 5,84% | 0,00% | 0,00% | 0,00% | |
| Novelty | Cut-off | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 |
| % Eliminated Rules | 0,00% | 10,00% | 100,00% | 100,00% | 100,00% | |
| # Remaining Rules | 2950 | 0 | 0 | 0 | 0 | |
| SemPrune % of Eliminated rules | 17,61% | 0,00% | 0,00% | 0,00% | 0,00% |
Performance metrics for the STULONG database
| Conviction | Specificity | Lift | Novelty | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cut-off | W | WN | Cut-off | W | WN | Cut-off | W | WN | Cut-off | W | WN | ||||
| 1.1 | S | 807 | 966 | 0.95 | S | 212 | 1561 | 1.0 | S | 0 | 1773 | 0.0 | S | 0 | 1773 |
| SN | 546 | 631 | SN | 155 | 1022 | SN | 0 | 1177 | SN | 0 | 1177 | ||||
| 1.2 | S | 923 | 850 | 0.97 | S | 466 | 1307 | 1.1 | S | 359 | 1414 | 0.1 | S | 1773 | 0 |
| SN | 650 | 527 | SN | 311 | 866 | SN | 256 | 921 | SN | 1177 | 0 | ||||
| 1.3 | S | 1004 | 769 | 0.98 | S | 72 | 1051 | 1.2 | S | 429 | 1344 | 0.2 | S | 1773 | 0 |
| SN | 708 | 469 | SN | 48 | 1129 | SN | 305 | 872 | SN | 1177 | 0 | ||||
| 1.4 | S | 1053 | 720 | 0.99 | S | 943 | 830 | 1.3 | S | 436 | 1337 | 0.3 | S | 1773 | 0 |
| SN | 760 | 417 | SN | 633 | 544 | SN | 310 | 867 | SN | 1177 | 0 | ||||
| 1.5 | S | 1079 | 694 | 1.00 | S | 1048 | 725 | 1.4 | S | 449 | 1324 | 0.4 | S | 1773 | 0 |
| SN | 785 | 392 | SN | 704 | 473 | SN | 320 | 857 | SN | 1177 | 0 | ||||
Interest measures for the Northwind traders database
| Conviction | Cut-off | 1.1 | 1.2 | 1.3 | 1.4 | 1.5 |
| % Eliminated Rules | 16,50% | 16,50% | 16,50% | 16,79% | 17,05% | |
| # Remaining Rules | 4515 | 4515 | 4515 | 4485 | 4499 | |
| SemPrune % of Eliminated rules | 83,01% | 83,01% | 83,01% | 83,03% | 83,04% | |
| Specificity | Cut-off | 0.95 | 0.97 | 0.98 | 0.99 | 1.00 |
| % Eliminated Rules | 59,20% | 74,31% | 77,07% | 83,50% | 83,50% | |
| # Remaining Rules | 2206 | 1389 | 1240 | 892 | 892 | |
| SemPrune % of Eliminated rules | 84,27% | 80,49% | 71,37% | 69,06% | 62,89% | |
| Lift | Cut-off | 1.0 | 1.1 | 1.2 | 1.3 | 1.4 |
| % Eliminated Rules | 0,00% | 0,00% | 0,00% | 0,00% | 0,55% | |
| # Remaining Rules | 5407 | 5407 | 5407 | 5407 | 5377 | |
| SemPrune % of Eliminated rules | 79,67% | 79,67% | 79,67% | 79,67% | 79,69% | |
| Novelty | Cut-off | 0.0 | 0.1 | 0.2 | 0.3 | 0.4 |
| % Eliminated Rules | 0,00% | 98,17% | 99,93% | 100,00% | 100,00% | |
| # Remaining Rules | 5407 | 0 | 0 | 0 | 0 | |
| SemPrune % of Eliminated rules | 76,67% | 100,00% | 100,00% | 0,00% | 0,00% |
Performance table for the Northwind traders database
| Conviction | Specificity | Lift | Novelty | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Cut-off | W | WN | Cut-off | W | WN | Cut-off | W | WN | Cut-off | W | WN | ||||
| 1.1 | S | 561 | 3747 | 0.95 | S | 2671 | 1637 | 1.0 | S | 0 | 4308 | 0.0 | S | 0 | 4308 |
| SN | 331 | 768 | SN | 530 | 569 | SN | 0 | 1099 | SN | 0 | 1099 | ||||
| 1.2 | S | 561 | 3747 | 0.97 | S | 3353 | 955 | 1.1 | S | 0 | 4308 | 0.1 | S | 4235 | 73 |
| SN | 331 | 768 | SN | 665 | 434 | SN | 0 | 1099 | SN | 1073 | 26 | ||||
| 1.3 | S | 561 | 3747 | 0.98 | S | 3454 | 854 | 1.2 | S | 0 | 4308 | 0.2 | S | 4304 | 4 |
| SN | 331 | 768 | SN | 713 | 386 | SN | 0 | 1099 | SN | 1099 | 0 | ||||
| 1.4 | S | 572 | 3736 | 0.99 | S | 3747 | 561 | 1.3 | S | 0 | 4308 | 0.3 | S | 4308 | 0 |
| SN | 336 | 763 | SN | 768 | 331 | SN | 0 | 1099 | SN | 1099 | 0 | ||||
| 1.5 | S | 584 | 3724 | 1.00 | S | 3747 | 561 | 1.4 | S | 23 | 4285 | 0.4 | S | 4308 | 0 |
| SN | 338 | 761 | SN | 768 | 331 | SN | 7 | 1092 | SN | 1099 | 0 | ||||
SemPrune and recall
| Database | Number of mined rules | Number of rules inthe enriched rules’ set | Number of eliminated rules | Number of remaining rules | Recall gain |
|---|---|---|---|---|---|
| Northwind Traders | 6070 | 9941 | 5052 | 4889 | 63.77% |
Precision gain of SemPrune model
| Database | Mined rules | Eliminated rules | Remaining rules | Precision gain |
|---|---|---|---|---|
| Adult | 2924 | 515 | 2409 | 21.38% |
| Stulong | 34422 | 8163 | 26259 | 31.09% |
| Labor | 181229 | 32874 | 148355 | 22.16% |
Recall gain of SemPrune model
| Database | Mined rules | Enriched rules set | Eliminated rules | Remaining rules | Recall gain |
|---|---|---|---|---|---|
| Northwind Traders | 6070 | 9941 | 5052 | 4889 | 63,77% |