| Literature DB >> 25649207 |
Rui Henriques1, Sara C Madeira1.
Abstract
BACKGROUND: Biclustering, the discovery of sets of objects with a coherent pattern across a subset of conditions, is a critical task to study a wide-set of biomedical problems, where molecular units or patients are meaningfully related with a set of properties. The challenging combinatorial nature of this task led to the development of approaches with restrictions on the allowed type, number and quality of biclusters. Contrasting, recent biclustering approaches relying on pattern mining methods can exhaustively discover flexible structures of robust biclusters. However, these approaches are only prepared to discover constant biclusters and their underlying contributions remain dispersed.Entities:
Keywords: Biclustering; Biomedical data analysis; Pattern mining
Year: 2014 PMID: 25649207 PMCID: PMC4302537 DOI: 10.1186/s13015-014-0027-z
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Figure 1Illustrative bicluster types and biclustering structures.
Figure 2Discovering biclusters with a constant assumption across rows (a), columns (b) and overall elements (c) using frequent itemset mining. Column identifiers (y 1, y 2, y 3) are combined with the observed values {0,1,2,3}, and FIM applied under a parameterizable support threshold (θ=2∧|P|≥2). Constant values on columns can be mined using the transpose matrix. To find biclusters with constant values overall, each item needs to be separately mined.
Figure 3BicPAM’s methodology. BicPAM relies on three steps that determine the type, quality and structure of the biclustering solutions. Within each step, we make available principles based on existing contributions. Additionally, we propose key strategies within each step for the handling of noise, the accommodation of more flexible types of biclusters (with additive, multiplicative and symmetric properties) and the composition of alternative structures of biclusters.
Figure 4Comparison of biclustering solutions using frequent itemsets, maximal frequent itemsets and closed frequent itemsets.
Figure 5Impact of discretization options available in BicPAM.
Figure 6Mapping methods to handle missings: relaxed, conservative ( ) and restrictive alternatives to imputation.
Figure 7Strategies to deal with noise-relaxations.
Figure 8Pattern-based discovery of biclusters under additive and multiplicative assumptions.
Figure 9Pattern-based discovery of biclusters with symmetries for a constant coherency (a) and non-constant coherency (b).
Properties of the generated set of synthetic datasets
|
|
|
|
|
|
|
|---|---|---|---|---|---|
| Nr. of hidden biclusters | 3 | 5 | 10 | 15 | 20 |
| Nr. columns in biclusters | [5,7] | [6,8] | [6,10] | [6,14] | [6,20] |
| Nr. rows in biclusters | [10,20] | [15,30] | [20,40] | [40,70] | [60,100] |
| Area of biclusters | 9.0% | 2.6% | 2.4% | 2.1% | 1.3% |
Figure 10FC levels across biclustering approaches using FABIA datasets.
Figure 11Match scores across biclustering approaches using FABIA datasets.
Figure 12Match scores of biclustering approaches using datasets with constant models.
Figure 13Match scores of biclustering approaches using datasets with non-constant models.
Figure 14Efficiency of biclustering approaches using the generated datasets.
Figure 15Efficiency bounds of BicPAM for 10000 rows (magnitude of the human genome).
Figure 16Performance of BicPAM under a constant assumption.
Figure 17Performance of BicPAM under an additive assumption.
Figure 18Performance of BicPAM under a multiplicative assumption.
Figure 19Match score levels of BicPAM under constant, additive and multiplicative assumptions.
FC and MS levels of BicPAM in different settings (mean and variance from 20 datasets)
|
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
| FC | Constant | 0.862 ±0.017 | 0.930 ±0.014 | 0.884 ±0.018 | 0.956 ±0.007 | 0.909 ±0.017 | 0.949 ±0.006 | 0.907 ±0.014 | 0.948 ±0.011 |
| Additive | 0.782 ±0.021 | 0.831 ±0.008 | 0.834 ±0.014 | 0.888 ±0.007 | 0.845 ±0.018 | 0.897 ±0.007 | 0.827 ±0.015 | 0.887 ±0.006 | |
| Multiplicative | 0.762 ±0.028 | 0.794 ±0.013 | 0.790 ±0.019 | 0.825 ±0.014 | 0.785 ±0.020 | 0.840 ±0.011 | 0.767 ±0.020 | 0.819 ±0.015 | |
| MS | Constant | 0.923 ±0.018 | 0.974 ±0.007 | 0.931 ±0.012 | 0.968 ±0.005 | 0.935 ±0.010 | 0.984 ±0.005 | 0.944 ±0.011 | 0.987 ±0.008 |
| Additive | 0.895 ±0.017 | 0.945 ±0.006 | 0.925 ±0.012 | 0.963 ±0.003 | 0.913 ±0.008 | 0.981 ±0.007 | 0.917 ±0.011 | 0.974 ±0.006 | |
| Multiplicative | 0.902 ±0.019 | 0.958 ±0.014 | 0.906 ±0.015 | 0.953 ±0.009 | 0.910 ±0.015 | 0.941 ±0.008 | 0.886 ±0.019 | 0.948 ±0.010 | |
|
| Constant | 0.956 ±0.013 | 0.984 ±0.006 | 0.960 ±0.007 | 0.981 ±0.004 | 0.961 ±0.004 | 0.996 ±0.002 | 0.957 ±0.009 | 0.993 ±0.002 |
| Additive | 0.955 ±0.012 | 0.997 ±0.001 | 0.959 ±0.006 | 0.997 ±0.002 | 0.955 ±0.004 | 0.995 ±0.002 | 0.957 ±0.007 | 0.995 ±0.003 | |
| Multiplicative | 0.937 ±0.015 | 0.966 ±0.008 | 0.924 ±0.012 | 0.968 ±0.008 | 0.923 ±0.010 | 0.963 ±0.009 | 0.927 ±0.013 | 0.974 ±0.007 | |
Figure 20Comparison of pattern mining algorithms for the 1000×100 setting.
Figure 21Impact of choosing alternative pattern representations over the 1000 × 100 data setting.
Figure 22Comparing the handling of missings for data with varying levels of noise.
Figure 23Impact of extending biclusters for data with varying levels of noise.
Figure 24Impact of merging and filtering (reduction) for the 1000×100 setting. (a) Merging for varying overlapping degrees (5% of planted noise). (b) Filtering for varying homogeneity degrees (2% of planted noise).
Comparing the biological relevance and novelty of different biclustering solutions
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| BicPAM | 56 | 83 ×7 | 43 (77%) | Highest number of exclusively enriched terms (partial list in Table |
|
| BiModule | 322 | 62 ×4 | 79 (25%) | Absence of closing options leads to redundant and less significant terms. |
|
| DeBi | 31 | 73 ×6 | 21 (68%) | Loss of relevant terms due to the inability to discover all maximal biclusters. |
| CC | 10 | 41 ×33 | 5 (50%) | Exclusive bicluster related with circulatory & cardiovascular system development. | |
| ISA | 72 | 23 ×8 | 8 (11%) | Exclusive bicluster for extracellular structure organization and heparin binding. | |
| Plaid | 3 | 12 ×49 | 1 (33%) | Majority of genes modeled in a single background bicluster with general terms. | |
| Fabia | 10 | 79 ×35 | 6 (60%) | Small bicluster with superior enrichment of antigen binding functions. | |
| Bexpa | 10 | 16 ×87 | 2 (20%) | Small sets of genes supported by large number of conditions. | |
| Samba | 100 | 17 ×6 | 18 (18%) | Dedicated terms for antigen processing, peptide cross-linking and disassembly. | |
| OPSM | 12 | 128 ×5 | 5 (42%) | High variance of | |
|
| BicPAM | 47 | 360 ×7 | 38 (81%) | Exclusive enriched terms due to flexible coherency and post-processing criteria. |
|
| BiModule | 219 | 285 ×4 | 43 (20%) | Terms with lower sig. than terms from noise-tolerant BicPAM solutions. |
|
| DeBi | 28 | 317 ×7 | 21 (75%) | Terms observed across very small sets of conditions (≤5) are not enriched. |
| CC | 10 | 228 ×58 | 6 (60%) | GO terms covered by BicPAM constant biclusters. | |
| ISA | 8 | 120 ×4 | 5 (63%) | Small biclusters with exclusive significance GO terms: spindle pole and karyogamy. | |
| Plaid | 8 | 78 ×39 | 3 (38%) | One bicluster with higher significance for fungal-type cell wall assembly. | |
| Fabia | 10 | 210 ×49 | 5 (50%) | Higher significance observed for actin cortical patch and oxidoreductase GO-terms. | |
| Bexpa | 72 | 42 ×49 | 1 (10%) | Low number of enriched terms (probably due to the low | |
| Samba | 120 | 18 ×9 | 11 (9%) | Enriched terms covered by pattern-based biclustering solutions. | |
| OPSM | 6 | 531 ×4 | 3 (50%) | Exclusive bicluster for the negative regulation of metabolic processes. | |
|
| BicPAM | 149 | 411 ×8 | 123 (83%) | Large diversity of highly significant GO-terms (partial list in Table |
|
| BiModule | 653 | 287 ×4 | 159 (24%) | Large but incomplete set of GO-terms as it excludes non-constant biclusters. |
|
| DeBi | 82 | 310 ×6 | 61 (74%) | Significance of terms slightly differ than BicPAM due to the handling of noise. |
| CC | 10 | 203 ×79 | 7 (70%) | Enriched terms appear in BicPAM solutions with higher significance. | |
| ISA | 23 | 292 ×22 | 18 (78%) | Enriched terms covered by pattern-based biclustering solutions. | |
| Plaid | 6 | 48 ×12 | 3 (50%) | Biclusters (apart from background layer) with lower enrichments than peers. | |
| Fabia | 10 | 310 ×41 | 8 (80%) | Bicluster with higher sig. for specific proteasome complexes. | |
| Bexpa | 10 | 63 ×29 | 3 (33%) | The few biclusters with deviation in size (higher | |
| OPSM | 16 | 212 ×8 | 11 (69%) | One bicluster with higher significance for pre-ribosome functions. |
Summary on the biological relevance of BicPAM’s biclusters
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|
| merging | 4803 | 81 ×7 | 28 | 22 | 5 | |
|
| relaxed | 980 | 83 ×9 | 24 | 19 | 3 |
| tight | 7652 | 79 ×6 | 27 | 25 | 2 | |
| merge | 6311 | 432 ×6 | 36 | 19 | 12 | |
|
| relaxed | 1259 | 492 ×7 | 22 | 12 | 8 |
| tight | 9210 | 398 ×5 | 39 | 22 | 11 | |
| merge | 27031 | 392 ×8 | 89 | 66 | 12 | |
|
| relaxed | 2177 | 486 ×11 | 67 | 49 | 11 |
| tight | 52123 | 367 ×7 | 92 | 79 | 9 |
Terms highly enriched in BicPAM’s biclusters
|
|
|
|
|
|
|---|---|---|---|---|
|
| Dl1 | translational elongation; cytosolic part; translational initiation | 4.49E-5 | 81 |
| Dl2 | Golgi apparatus; MHC protein complex | 5.40E-5 | 83 | |
| Dl3 | defense response; receptor activity; single organism signaling; vacuole; cell communication | 4.91-5 | 162 | |
| Dl4 | immune response; response to interferon-gamma | 1.06E-4 | 58 | |
| Dl5 | immune system process | 1.27E-4 | 52 | |
| Dl6 | response to interferon-gamma; cellular response to chemical stimulus; response to cytokine stimulus | 0.001 | 60 | |
| Dl7 | membrane-enclosed lumen; cell division; cell cycle process | 2.92E-12 | 81 | |
| Dl8 | small molecule binding; catalytic activity; cell cycle process | 6.14E-8 | 108 | |
|
| H1 | mitochondrion organization; organellar ribosome; mitochondrial matrix; mitochondrial translation | 2.70E-39 | 416 |
| H2 | cell periphery; cell wall constituent; oxidoreductase activity; cell wall organization; sexual sporulation | 1.73E-4 | 370 | |
| H3 | ribonucleoprotein complex biogenesis; nucleus | 3.61E-30 | 426 | |
| H4 | cellular amino acid metabolic/biosynthetic process; carboxylic acid metabolic/biosynthetic process | 1.3E-25 | 581 | |
| H5 | organonitrogen compound metabolic process; sulfur compound metabolic process | 1.62E-4 | 504 | |
| H6 | macromolecular complex; intracell. non-membrane-bounded organelle; membrane-enclosed lumen | 4.80E-14 | 512 | |
|
| G1 | nitrogen compound metabolic proc.; carboxylic/organic amino acid processes; structural cytoskeleton | 1.84E-16 | 434 |
| G2 | cellular carbohydrate metabolic process; cytoplasm | 2.01E-7 | 265 | |
| G3 | generation of precursor metabolites and energy; tricarboxylic acid cycle | 1.16E-14 | 954 | |
| G4 | endomembrane system; retrotransposon nucleocapsid; pore; viral procapsid maturation | 4.34E-6 | 102 | |
| G5 | nucleolus; ncRNA metabolic process | 1.03E-61 | 611 | |
| G6 | intracell. non-membrane-bounded organelle; structural molecule activity | 5.33E-76 | 293 | |
| G7 | cytosolic part; ribosomal subunit | 1.61E-88 | 460 | |
| G8 | membrane-enclosed lumen; nuclear lumen; intracell. organelle lumen | 1.17E-47 | 263 | |
| G9 | mitochondrion organization; mitochondrial part; cytoplasmic part; protein complex biogenesis | 2.06E-26 | 592 | |
| G10 | cellular response to oxidative stress; generation of precursor metabolites and energy | 2.37E-4 | 296 | |
| G11 | binding; nuclear part; preribosome | 2.87E-11 | 508 | |
| G12 | cellular process involved in reproduction | 0.001 | 435 | |
| G13 | macromolecular complex; cell part; structural molecule activity | 6.05E-29 | 1442 | |
| G14 | vacuolar transport; chromosome | 5.09E-7 | 606 | |
| G15 | regulation of cellular (macromolecule) biosynthetic process; protein modification process | 2.28E-13 | 1019 | |
| G16 | organic substance catabolic process; carbohydrate metabolic process; cytoplasm | 1.02E-15 | 648 | |
| G17 | ribonucleoprotein complex biogenesis (general) | 1.08E-94 | 784 |
Illustrative set of biclusters with different properties and heightened biological relevance ( -values after Bonferroni correction)
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| B1 | FAABFFF | A-F | Merging with tight overlapping | |||
|
| B2 | AAABCA | A-C | Extensions allowed (with tight merging) | ||
| B3 | AAA/../EEE | A-E | Reducing with high homogeneity | |||
| B4 | EEECEE | A-E | Merging allowed | |||
|
| B5 | CCDCBCBCC | A-E | Merging with relaxed overlapping | ||
| B6 | AAAAA/../G..G | A-G | Merging with tight overlapping | |||
|
| B7 | AAAGGGA | A-G | Merging with tight overlapping | ||
| B8 | AAABACCCAA | A-E | Merging allowed | |||
|
|
|
|
|
|
|
|
| B1 | constant | 83 | 7 | 41 | 21 | 1.97E-10 |
| B2 | constant | 153 | 8 | 9 | 1 | 2.27E-12 |
| B3 | multiplicative | 119 | 5 | 5 | 18 | 4.12E-8 |
| B4 | constant | 581 | 6 | 12 | 7 | 1.31E-25 |
| B5 | constant | 654 | 10 | 16 | 4 | 1.31E-17 |
| B6 | additive | 476 | 6 | 12 | 10 | 1.92E-6 |
| B7 | multiplicative | 483 | 7 | 57 | 10 | 1.24E-81 |
| B8 | additive | 521 | 10 | 17 | 5 | 4.57E-12 |
Enriched GO terms of three illustrative BicPAM biclusters
|
|
|
|
|---|---|---|
| B1 |
| Immune response (2.32E-10); immune system process, defense response (<1E-6); |
| cytokine-mediated signaling pathway (1.33E-7); Golgi apparatus (1.19E-7). | ||
| B4 |
| Carboxylic acid biosynthetic process (1.3E-25) and metabolic process (6.12E-16); |
| organonitrogen compound biosynthetic process (2.23E-18) and metabolic process (2.71E-13). | ||
| B7 |
| Ribonucleoprotein biogenesis and assembly (1.24E-81); cytosolic part (1.22E-57); |
| intracell. non-membrane-bounded organelle (1.31E-65); ncRNA metabolic process (1.82E-52). |
Analysis of TFs of the putative regulatory modules given by the BicPAM’s biclusters provided in Table for the human genome ( dataset) and the yeast genome ( dataset)
|
|
|
|
|---|---|---|
| dlblc | Dl1 | BCL11A, LZTS1, GTF2I, HCLS1, HDAC1, MBD4, MEF2B, NCOA3, STAT6 |
| Dl2 | ANP32A, HCLS1, IRF1, MNDA, NCOA1, RUNX3, STAT1, TRIM22, TRIP10 | |
| Dl3 | BCL3, TRIM22, ANP32A, ARID5B, CEBPB, CREG1, IRF1, PFDN5, STAT1 | |
| Dl4 | ANP32A, IRF1, NCOA1, STAT1, TRIM22 | |
| Dl5 | CREG1, IRF1, TRIM22, ANP32A, STAT1 | |
| Dl6 | ANP32A, IRF1, NCOA1, STAT1, TRIM22 | |
| Dl7 | BCL6, BCL6B, HIf1A, ILF2, POU2AF1, SERTAD1, TCF3 | |
| Dl8 | DR1, DRAP1, HIf1A, ILF2, NCOA3, SERTAD1, TMF1, ZNFN1A1 | |
| gasch | G1 | Gcn4p, Sfp1p, Ace2p, Tec1p, Ste12p, Ash1p |
| G2 | Sfp1p, Msn2p, Bas1p, Tec1p, Sok2p, Abf1p, Ash1p, Cst6p | |
| G3 | Sfp1p, Tec1p, Ste12p, Msn2p, Bas1p, Sok2p, Msn4p, Gcn4p | |
| G4 | Snf6p, Tec1p, Ste12p, Rap1p, Sin4p, Abf1p, Snf2p, Ash1p | |
| G5 | Sfp1p, Ace2p, Cst6p, Tup1p, Msn2p, Spt10p, Spt20p | |
| G6 | Hsf1p, Spt23p, Mga2p, Sfp1p, Spt10p, Msn2p, Gcr1p, Gcn4p | |
| G7 | Sfp1p, Swi5p, Tup1p, Spt10p, Spt20p, Gcr1p, Sin3p, Mga2p | |
| G8 | Sfp1p, Swi5p, Cst6p, Tup1p, Spt20p, Ash1p, Spt10p | |
| G9 | Yap1p, Ace2p, Sfp1p, Msn2p, Ash1p, Msn4p, Abf1p | |
| G10 | Sfp1p, Msn2p, Msn4p, Cst6p, Abf1p, Sok2p, Bas1p | |
| G11 | Snf6p, Tup1p, Snf2p, Cst6p, Sin4p, Rap1p, Swi3p, Hap2p | |
| G12 | Yap1p, Tec1p, Msn2p, Msn4p, Ste12p, Sok2p | |
| G13 | Snf6p, Tup1p, Abf1p, Snf2p, Cst6p, Sin4p | |
| G14 | Sfp1p, Tec1p, Ste12p, Bas1p, Sok2p, Yrm1p | |
| G15 | Ace2p, Sfp1p, Tec1p, Ste12p, Ash1p, Bas1p, Gcn4p, Sok2p | |
| G16 | Cin5p, Gcn4p, Msn4p, Sfp1p, Msn2p, Tec1p, Ste12p, Sok2p | |
| G17 | Sfp1p, Ace2p, Cst6p, Snf6p, Rap1p, Tup1p, Spt10p, Swi5p |
Figure 25Biclusters extracted from gasch dataset with constant models (a), multiplicative models (b) and additive models in the absence and presence of symmetries (c and d).