| Literature DB >> 27651825 |
Rui Henriques1, Sara C Madeira1.
Abstract
BACKGROUND: Biclustering has been largely used in biological data analysis, enabling the discovery of putative functional modules from omic and network data. Despite the recognized importance of incorporating domain knowledge to guide biclustering and guarantee a focus on relevant and non-trivial biclusters, this possibility has not yet been comprehensively addressed. This results from the fact that the majority of existing algorithms are only able to deliver sub-optimal solutions with restrictive assumptions on the structure, coherency and quality of biclustering solutions, thus preventing the up-front satisfaction of knowledge-driven constraints. Interestingly, in recent years, a clearer understanding of the synergies between pattern mining and biclustering gave rise to a new class of algorithms, termed as pattern-based biclustering algorithms. These algorithms, able to efficiently discover flexible biclustering solutions with optimality guarantees, are thus positioned as good candidates for knowledge incorporation. In this context, this work aims to bridge the current lack of solid views on the use of background knowledge to guide (pattern-based) biclustering tasks.Entities:
Year: 2016 PMID: 27651825 PMCID: PMC5024481 DOI: 10.1186/s13015-016-0085-5
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Proposed contributions to an effective incorporation of constraints with distinct properties into (pattern-based) biclustering tasks
Fig. 2Pattern-based biclusters with distinct coherency assumptions
Fig. 3Discovery of biclusters with constant and order-preserving assumptions based on full-patterns (itemsets and sequences) discovered from transactional databases mapped from the original data matrix
Fig. 4Symbolic dataset and corresponding “price table”
Fig. 5Behavior of F2G (detailed in [17]). The FP-tree is created from the inputted database with transactions annotated in leafs; a conditional pattern is created for each node in the FP-tree; conditional FP-trees are projected from each conditional pattern (transactions moved up along the tree to enable the discovery of full-patterns); conditional FP-trees are recursively mined and patterns grown if frequent; whenever a conditional FP-tree contains a single path, all frequent patterns are enumerated
Fig. 6Simplified illustration of BiC2PAM behavior: (1) transactional and sequential databases are derived from a multi-item matrix; (2) constraints are processed; (3) pattern mining searches are applied with a decreasing support; and (4) the discovered pattern-based biclusters that satisfy the inputted constraints are postprocessed
Properties of the generated dataset settings.
| Non-exhaustive list of matrices ( | 500 × 50 | 1000 × 100 | 2000 × 200 | 4000 × 400 |
|---|---|---|---|---|
| Number of hidden biclusters ( |
|
|
|
|
| Number of rows per hidden bicluster |
|
|
|
|
| Number of columns per hidden bicluster |
|
|
|
|
where defines the flexibility of the underlying coherency assumption ( = 1 for constant and = 2 for order-preserving)
Additional properties (default settings in bold):
Coherency strength = {5, 10, 15, 20, 25, 33 %} (or symbols = {20, 10, 7, 5, 4, 3})
Deviations on data values in {0, /2, , 2}, and degree of noisy and missing elements in {0, 2, 5, 10 %}
Overlapping degree = {0, 0.1, 0.2, 0.4} with plaid effects described by f = {sum, product, weighted} (cumulative function) = {1, 0.7, 0.4} (cumulative effect), = {0.1, 0.2} (noise), = {0.5, 0.3, 0.1 K} (average number of interacting biclusters) and = {1, 0.8, 0.5} (distribution of overlapping areas between the bics)— variables according to [20]
Fig. 7Efficiency gains of BiC2PAM from succinct constraints specifying uninformative elements for varying data settings with constant and order-preserving biclusters and coherency strength defined by = 7
Fig. 8BiC2PAM ability to biclustering data with varying distributions of annotations (efficiency and Jaccard-based match scores [14] collected for the 2000 × 200 setting)
Fig. 9BiC2PAM’s efficiency in the presence of succinct constraints (2000 × 200 setting with constant assumption)
Fig. 10BiC2PAM’s efficiency with (combined) anti-monotone, monotone and convertible constraints (2000 × 200 setting with constant coherency). Impact of enhancing BiC2PAM with CFG [15] and FP-Bonsai [33] principles
Fig. 11BiC2PAM performance with sequence constraints when learning order-preserving solutions (1000 × 100 setting)
Fig. 12Impact of full-pattern growth searches in the performance of BiC2PAM for data with varying size (under a fixed coherency strength = 20 %) and for fixed data settings with varying coherency strength
Fig. 13Efficiency of BiC2PAM with knowledge regarding the uninformative elements for the analysis of expression data (hughes, dlblc, yeast-cycle) when assuming a constant coherency with = 5
Fig. 14Efficiency of BiC2PAM with knowledge regarding the uninformative elements for the analysis of network data (human, Escherichia coli, yeast from STRING [53]) when assuming constant coherency with = 5
Fig. 15Performance of BiC2PAM for biclustering biological datasets (yeast-cycle and dlblc) annotated with representative human and yeast GO terms (terms associated with biological processes with more than 50 genes)
Fig. 16Efficiency gains from using biologically meaningful constraints with succinct/monotone/convertible properties within BiC2PAM for the analysis of the gasch dataset (6152 × 176)
Fig. 17Biological relevance of BiC2PAM for different constraint-based profiles of expression