| Literature DB >> 19509312 |
Guojun Li1, Qin Ma, Haibao Tang, Andrew H Paterson, Ying Xu.
Abstract
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called 'scaling patterns', a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/ approximately maqin/bicluster. A server version of QUBIC is also available upon request.Entities:
Mesh:
Year: 2009 PMID: 19509312 PMCID: PMC2731891 DOI: 10.1093/nar/gkp491
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Comparison of recovery accuracy of QUBIC with the other five algorithms. The analysis reveals both the effects of increasing noise levels ‘scaling’ (A and B) models and varying degrees of overlapping for ‘constant’ (C) models. Note that the recovery score is calculated similarly to BIMAX using , where Mopt is the set of implanted biclusters, M is the set of recovered biclusters and G is for genes sets within the bicluster.
Figure 2.(A) Proportions of E. coli biclusters that have significant overlap (P < 0.01) with GO biological processes, KEGG pathways and experimentally verified regulons. (B) Proportions of yeast biclusters that are statistically enriched (P < 0.01) in GO biological processes, KEGG pathway and MIPS functional catalog.
Figure 3.Visualization of three biclusters (BC000, BC002 and BC074), which were selected based on the specificity to certain subtype of leukemia (ALL/AML/MLL). The gene names are given to the right of the heat-map. Some genes are represented twice since there are cases where two different Affymetrix probes are used for the same gene.
Functional enrichments in the biclusters by different programs for E. coli respect to KEGG classes
| ABC transporters – Organism-specific | 1e-04 (30%) | na | ns | ns | |
| Aminosugars metabolism | na | na | ns | 1e-03 (7%) | |
| Arginine and proline metabolism | na | ns | 4e-03 (4%) | ns | |
| Ascorbate and aldarate metabolism | 2e-03 (22%) | na | 2e-03 (3%) | na | |
| Flagellar assembly | 4e-37 (71%) | 8e-57 (38%) | 8e-18 (10%) | 3e-45 (17%) | |
| Fructose and mannose metabolism | na | 2e-03 (40%) | 9e-05 (7%) | na | |
| Galactose metabolism | ns | na | na | 3e-06 (5%) | |
| Glycerophospholipid metabolism | 9e-03 (22%) | na | ns | na | |
| Nitrogen metabolism | na | 3e-05 (6%) | ns | 2e-06 (7%) | |
| Pentose and glucuronate interconversions | 6e-04 (20%) | na | ns | ns | |
| Phosphotransferase system (PTS) | na | ns | na | 2e-03 (7%) | |
| Pyrimidine metabolism | na | na | ns | 2e-03 (8%) | |
| Ribosome | na | na | 4e-38 (37%) | 2e-43 (23%) | |
| Sulfur metabolism | 1e-10 (11%) | 2e-03 (2%) | na | 3e-09 (4%) |
Values represent P-values followed by the enrichment ratios (the number of genes in both class and bicluster/the number of genes in the bicluster). Each value in bold represents the most significant P-value for each functional class.
na:– functional class not present in the results.
ns: functional class present in the results but not significant at level of 0.01.