| Literature DB >> 24162173 |
Stefan Naulaerts, Pieter Meysman, Wout Bittremieux, Trung Nghia Vu, Wim Vanden Berghe, Bart Goethals, Kris Laukens.
Abstract
Over the past two decades, pattern mining techniques have become an integral part of many bioinformatics solutions. Frequent itemset mining is a popular group of pattern mining techniques designed to identify elements that frequently co-occur. An archetypical example is the identification of products that often end up together in the same shopping basket in supermarket transactions. A number of algorithms have been developed to address variations of this computationally non-trivial problem. Frequent itemset mining techniques are able to efficiently capture the characteristics of (complex) data and succinctly summarize it. Owing to these and other interesting properties, these techniques have proven their value in biological data analysis. Nevertheless, information about the bioinformatics applications of these techniques remains scattered. In this primer, we introduce frequent itemset mining and their derived association rules for life scientists. We give an overview of various algorithms, and illustrate how they can be used in several real-life bioinformatics application domains. We end with a discussion of the future potential and open challenges for frequent itemset mining in the life sciences.Entities:
Keywords: association rule; biclustering; frequent item set; market basket analysis; pattern mining
Mesh:
Year: 2013 PMID: 24162173 PMCID: PMC4364064 DOI: 10.1093/bib/bbt074
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1:Toy example to demonstrate how frequent itemsets and association rules can be derived from a series of transactions. Transactions are indicated by circular boxes, and are labeled as (tid, I), where tid is the transaction identifier and I = {i1, … , i} is an itemset containing the items i1 to i. Frequent itemsets are represented as a squared box, and association rules are shown as an octagonal box.
Measures related to the itemsets and association rules presented in Figure 1
| Rule | Support | Confidence | Lift | Coverage |
|---|---|---|---|---|
| {a} ⇒ {b} | 2 | 100% | 33% | 2 |
| {b} ⇒ {a} | 2 | 66% | 33% | 3 |
Overview of popular frequent itemset mining algorithms and implementations
| Algorithm | Itemsets, subgraphs or rules | Context | License | Publication | Additional information or implementations |
|---|---|---|---|---|---|
| Anets | All (apriori), various threshold measures | Annotation mining | GNU GPL | ||
| AGM | All (apriori) | Subgraph mining | / | / | |
| Apriori (Borgelt) | All | / | GNU GPL | ||
| Apriori (Goethals) | All | / | Research only | ||
| ARIA | All (apriori), various verifications | Annotation mining | / | ||
| Carpenter | Closed | Quantitative omics profiles | GNU GPL | ||
| CBA | All (apriori) | Classifiers | / | / | |
| CMAR | All (FP-growth) | Classifiers | / | / | |
| Cobbler | Closed | Quantitative omics profiles | / | / | |
| CODENSE | Coherent dense subgraphs | Subgraph mining | Research only | ||
| COLL | All (apriori), chi-squared threshold pruning | Annotation mining | Open source | ||
| COPS | All (FP-tree), score threshold | Biclustering | Contact authors | ||
| CPAR | All | Classifiers | / | / | |
| CPMine | All (eclat), machine learning | Structural patterns | / | / | |
| DeBi | Maximal | Biclustering | Creative Commons 2.0 | ||
| Distiller | Closed | Quantitative omics profiles | Academic use only | ||
| Eclat (Borgelt) | All | / | GNU GPL | ||
| FESP | Emerging patterns | Classifiers | / | / | |
| Farmer | Closed | Quantitative omics profiles | Research only | ||
| FSG | / | Subgraph mining | / | / | |
| FPGrowth | All | / | GNU GPL | / | |
| GenMax | Maximal | Quantitative omics profiles | Research only | ||
| GenMiner | Closed | Quantitative omics profiles | Research only | ||
| gSpan | All | Subgraph mining | Internal research only | ||
| KRIMP | Minimal descriptive length | Future work | |||
| MAFIA | Maximal | / | Contact authors | ||
| MAGO | Multilevel association rules | Quantitative omics profiles | / | / | |
| MaxConf | Closed | Quantitative omics profiles | Research only | ||
| MaxMiner | Maximal | Quantitative omics profiles | / | / | |
| Min-Ex | δ-free itemsets | Quantitative omics profiles | / | / | |
| MULE | Maximal frequent connected subgraph | Subgraph mining | Research only | ||
| NetCAR | Maximal frequent connected subgraph | Classifiers | Research only | ||
| PathFinder | Large | Subgraph mining | / | / | |
| REMMAR | Shortest distance thresholding | Quantitative omics profiles | Research only | ||
| TD-Close | Closed | Quantitative omics profiles | / | / | |
| TopKRGs | Top-k | Quantitative omics profiles | / | / |
Overview of software frameworks for frequent itemset mining
| Application name | Description | License | Publication | Available from |
|---|---|---|---|---|
| Arules | FIM toolbox in R | GNU GPL-2 | ||
| ARtool | FIM toolbox for binary databases | GNU GPL | ||
| KNIME Desktop | Data analytics platform | GNU GPL | ||
| MIME | Interactive FIM toolbox | Research only | ||
| Orange | Data analytics platform | GNU GPL-3 | ||
| PyFIM | Python library | GNU LPL | ||
| Rapidminer | Data analytics platform | AGPL-3 | ||
| SPMF | FIM toolbox | GNU GPL-3 | / | |
| Weka | Machine learning library | GNU GPL |
Figure 2:Mining for frequent co-occurrences in annotations. Annotations can be mapped to biological entities, such as interactions between biological molecules. As such, each transaction is composed of the transaction identifier (represents the interaction between both partners) and the items (the annotations corresponding to each of the biomolecules). Frequent itemset miners can then be used to uncover patterns of often co-occurring annotations and several interestingness measures can be computed (e.g. support). This information can then be interpreted by the researcher or used to create weighted protein networks [16].
Figure 3:Visualization of structural pattern mining. Here the biological sequence of a domain on a biomolecule is processed with frequent itemset mining algorithms to identify conserved motifs. These motifs incorporate the underlying dependencies between the items in the form of the support value or other quality measures.
Figure 4:From expression matrix to bicluster. Gene expression data are converted into a matrix and discretized into a regulation category. In this figure, there are three groups: up, down or unchanged. This matrix can then be formatted into a suitable format for frequent itemset miners (transactional layout) to generate biclusters or rules.
Figure 5:Gene interaction networks in mouse, human and rat as derived from String [12]. Frequent edges among these interaction networks can be extracted and presented as a frequent subgraph. Conserved subgraphs can have universal functional importance within the studied species.