| Literature DB >> 24616651 |
András Király1, Attila Gyenesei2, János Abonyi1.
Abstract
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.Entities:
Mesh:
Year: 2014 PMID: 24616651 PMCID: PMC3925583 DOI: 10.1155/2014/870406
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Figure 1Illustrative representation of biclusters/frequent closed itemsets on binary data.
Figure 2Bit-table representation of market basket data.
Figure 3Schematic view of frequent closed itemset discovery.
Pseudocode 1Pseudocode of the Apriori-like algorithm.
Figure 4Mining process example using the bit-table representation.
Algorithm 1MATLAB code 2: mining frequent itemsets.
Algorithm 2MATLAB code 3: the generation of closed frequent itemsets.
Performance test using synthetic data.
| Size | Density | Minsupp | Number of closed itemsets | Time (s) | Number of BiMAX biclusters | Time (s) |
|---|---|---|---|---|---|---|
| 50 × 50 | 10% | 2 | 78 | 0.8 | 78 | ~1 |
| 50 × 50 | 20% | 4 | 140 | 1.1 | 140 | ~1 |
| 50 × 50 | 50% | 15 | 238 | 0.9 | 238 | ~1 |
| 100 × 100 | 10% | 3 | 337 | 5 | 337 | ~2 |
| 100 × 100 | 20% | 7 | 488 | 7 | 488 | ~2 |
| 100 × 100 | 50% | 30 | 694 | 9 | 694 | ~3 |
| 300 × 300 | 10% | 8 | 437 | 17 | 437 | ~5 |
| 300 × 300 | 20% | 22 | 156 | 6 | 156 | 52 |
| 300 × 300 | 50% | 90 | 1038 | 40 | 1038 | >600 |
| 700 × 700 | 10% | 15 | 1318 | 120 | 1318 | 195 |
| 700 × 700 | 20% | 45 | 375 | 33 | 375 | >300 |
| 700 × 700 | 50% | 210 | 283 | 25 | 283 | >300 |
| 1000 × 1000 | 10% | 20 | 1496 | 196 | 1496 | >600 |
| 1000 × 1000 | 20% | 60 | 714 | 92 | 714 | >600 |
| 1000 × 1000 | 50% | 290 | 1030 | 135 | 1030 | >600 |
Test runs using biological data.
| Name | Size | Minsupp | Number of closed itemsets | Time (s) | Number of BiMAX biclusters | Time (s) |
|---|---|---|---|---|---|---|
| Compendium | 6316 × 300 | 50 | 2594 | 12 | 2594 | ~19 |
| StemCell-27 | 45276 × 27 | 200 | 7972 | 27 | 7972 | ~115 |
| Leukemia | 125336 × 72 | 400 | 3643 | 147 | 3643 | >600 |
| StemCell-9 | 1840 × 9 | 2 | 177 | 0.8 | 177 | ~1 |
| Yeast-80 | 6221 × 80 | 80 | 3285 | 8 | 3285 | ~17 |