| Literature DB >> 27001340 |
Zhenjia Wang1, Guojun Li1,2, Robert W Robinson3, Xiuzhen Huang2.
Abstract
Biclustering algorithms, which aim to provide an effective and efficient way to analyze gene expression data by finding a group of genes with trend-preserving expression patterns under certain conditions, have been widely developed since Morgan et al. pioneered a work about partitioning a data matrix into submatrices with approximately constant values. However, the identification of general trend-preserving biclusters which are the most meaningful substructures hidden in gene expression data remains a highly challenging problem. We found an elementary method by which biologically meaningful trend-preserving biclusters can be readily identified from noisy and complex large data. The basic idea is to apply the longest common subsequence (LCS) framework to selected pairs of rows in an index matrix derived from an input data matrix to locate a seed for each bicluster to be identified. We tested it on synthetic and real datasets and compared its performance with currently competitive biclustering tools. We found that the new algorithm, named UniBic, outperformed all previous biclustering algorithms in terms of commonly used evaluation scenarios except for BicSPAM on narrow biclusters. The latter was somewhat better at finding narrow biclusters, the task for which it was specifically designed.Entities:
Mesh:
Year: 2016 PMID: 27001340 PMCID: PMC4802312 DOI: 10.1038/srep23466
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Relevance and recovery scores of the seven algorithms on six types of biclusters, with error bars.
Figure 2Relevance and recovery scores of the seven algorithms on synthetic matrices with overlapping biclusters, including error bars.
Description of GDS datasets.
| Dataset | Genes | Samples | Description |
|---|---|---|---|
| GDS181 | 12626 | 84 | Large-scale analysis of the human Transcriptome |
| GDS589 | 8799 | 122 | Multiple normal tissue gene expression across strains |
| GDS1406 | 12488 | 87 | Brain regions of various inbred strains |
| GDS1451 | 8799 | 94 | Toxicants effect on liver: pooled and individual sample comparison |
| GDS1490 | 12488 | 150 | Neural tissue profiling |
| GDS2520 | 12625 | 44 | Head and neck squamous cell carcinoma |
| GDS3715 | 12626 | 110 | Insulin effect on skeletal muscle |
| GDS3716 | 22283 | 42 | Breast cancer: histologically normal breast epithelium |
The results of GO enrichment analysis on eight GDS datasets.
| Algorithm | Found | Enriched |
|---|---|---|
| UniBic | 151 | 62(41.1%) |
| OPSM | 163 | 48(29.5%) |
| QUBIC | 91 | 34(37.4%) |
| ISA | 217 | 71(32.7%) |
| FABIA | 80 | 22(27.5%) |
| CPB | 96 | 34(35.4%) |
Figure 3Comparison of the distributions of running time for the seven algorithms versus the number of rows on the matrices of 50 columns, with error bars.
The time scale is logarithmic.