| Literature DB >> 24651574 |
Ali Oghabian1, Sami Kilpinen2, Sampsa Hautaniemi3, Elena Czeizler4.
Abstract
DNA microarray technologies are used extensively to profile the expression levels of thousands of genes under various conditions, yielding extremely large data-matrices. Thus, analyzing this information and extracting biologically relevant knowledge becomes a considerable challenge. A classical approach for tackling this challenge is to use clustering (also known as one-way clustering) methods where genes (or respectively samples) are grouped together based on the similarity of their expression profiles across the set of all samples (or respectively genes). An alternative approach is to develop biclustering methods to identify local patterns in the data. These methods extract subgroups of genes that are co-expressed across only a subset of samples and may feature important biological or medical implications. In this study we evaluate 13 biclustering and 2 clustering (k-means and hierarchical) methods. We use several approaches to compare their performance on two real gene expression data sets. For this purpose we apply four evaluation measures in our analysis: (1) we examine how well the considered (bi)clustering methods differentiate various sample types; (2) we evaluate how well the groups of genes discovered by the (bi)clustering methods are annotated with similar Gene Ontology categories; (3) we evaluate the capability of the methods to differentiate genes that are known to be specific to the particular sample types we study and (4) we compare the running time of the algorithms. In the end, we conclude that as long as the samples are well defined and annotated, the contamination of the samples is limited, and the samples are well replicated, biclustering methods such as Plaid and SAMBA are useful for discovering relevant subsets of genes and samples.Entities:
Mesh:
Year: 2014 PMID: 24651574 PMCID: PMC3961251 DOI: 10.1371/journal.pone.0090801
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Expression patterns of genes across samples in two types of biclusters.
(A) Bicluster containing genes having expression values correlated across the samples. (B) Bicluster containing genes exhibiting a limited variance in the expression values across the considered samples. The X-axis represents the samples included in the bicluster, the Y-axis represents the expression level, and each line shows the expression values of a gene (included in the bicluster) along the various samples of the bicluster.
The class and availability of biclustering methods.
| Bicluster Method | Class | Since | Availability | Parameters |
| ACV | CMB | 2007 | - |
|
| Bayesian Plaid | PGM | 2008 | C |
|
|
| VMB | 2006 | Java |
|
|
| CMB | 2009 | Java |
|
|
| CMB | 2000 | R |
|
| CMonkey | PGM | 2006 | R |
|
|
| TWC | 2000 | MATLAB |
|
| DCC | TWC | 2002 | - | - |
|
| PGM | 2010 | R |
|
|
| CMB | 2005 | R |
|
| GEMS | CGS | 2004 | Web, C |
|
| Gibbs biclustering | PGM | 2003 | - | - |
|
| TWC | 2002 | Java |
|
| ITWC | TWC | 2001 | - | - |
| OP-Clustering | CMB | 2003 | - | - |
|
| CMB | 2003 | Java |
|
|
| PGM | 2002 | R |
|
| ProBic | PGM | 2009 | - | - |
|
| VMB | 2009 | C |
|
|
| VMB | 2006 | Java |
|
|
| PGM | 2002 | Java |
|
| Spectral | VMB | 2003 | R |
|
| TreeBic | PGM | 2010 | C |
|
| UBCLUST | CMB | 2006 | Java |
|
| XMOTIF | VMB | 2003 | R |
|
| ZBDD | VMB | 2005 | - |
|
|
| VMB | 1972 | - |
|
|
| VMB | 2002 | - |
|
|
| VMB | 2000 | - |
|
The notations used for the methods classes are stated in the text. The parameters used by the biclustering methods are described in Table 4. The methods that are shown in bold texts were evaluated in our study.
The biclustering methods specifications and testing data types.
| Bicluster Method | Method specifications | Tested data |
| ACV | GSOVL | Synthetic, yeast |
| Bayesian Plaid | GSOVL, MCMC, BAYES | Synthetic, yeast |
|
| GSOVL, DISC | Synthetic, yeast |
|
| GSOVL, TREE | Synthetic, yeast |
|
| GSOVL | Synthetic, Human, yeast |
| CMonkey | GSOVL, MCMC, MOTIF, TMV | Synthetic, yeast |
|
| GSOVL,SIMA | Human |
| DCC | NOVL, VECOS | Human |
|
| GSOVL, EM, BAYES, SVD | Synthetic, Human |
|
| GSOVL, TMV | Synthetic, Human |
| GEMS | GSOVL, MCMC | Synthetic, Human |
| Gibbs biclustering | GSOVL, DISC, MCMC, BAYES | Synthetic, Human |
|
| GSOVL | Synthetic, yeast |
| ITWC | SOVL, VECOS | Human |
| OP-Clustering | GSOVL, TREE | Yeast, Human |
|
| GSOVL, DISC | Synthetic, yeast |
|
| GSOVL, FUZZY | Synthetic, Human, yeast |
| ProBic | GSOVL, EM, BAYES, TMV | Synthetic, yeast |
|
| GSOVL | Synthetic, yeast, e. coli, Human |
|
| GSOVL | Synthetic, yeast |
|
| GSOVL, DISC | Yeast, Human |
| Spectral | NOVL,SVD | Human |
| TreeBic | GSOVL, MCMC, BAYES, TREE | Human |
| UBCLUST | GSOVL, DISC, MCMC, SIMA | Synthetic, yeast |
| XMOTIF | GSOVL | Synthetic |
| ZBDD | GSOVL | Synthetic, yeast |
|
| NOVL | Synthetic |
|
| GSOVL | Synthetic, yeast |
|
| GSOVL | Synthetics, Human |
The methods specifications are described in Table 3. Although the original FLOC algorithm is tolerant to missing values (TMV), the R implementation available in BicARE (V 1.2.0) of the Bioconductor package does not accept missing values in input data. Note that all the tested data with missing citations were studied by the developers of the algorithms to which they have been assigned. For the citation of the algorithms see Table 1.
Various specifications considered for the biclustering methods.
| Specifications | Description |
| GOVL | The obtained biclusters are allowed to have overlaps over only the gene-sets. |
| SOVL | The obtained biclusters are allowed to have overlaps over only the sample-sets. |
| GSOVL | The obtained biclusters are allowed to have overlaps over both gene and sample-sets. |
| NOVL | No overlaps at all are allowed for the obtained biclusters. |
| DISC | Discretization is mandatory for running the algorithm |
| TMV | The method is tolerant to missing values. |
| SIMA | Simulated annealing is applied to avoid convergence to local optima. |
| VECOS | Vector Cosine Scores is applied to measure the similarities of the samples (or genes). |
| SVD | The method applies a form of Singular Value Decomposition. |
| MCMC | The method employs a Markovian Chain Monte Carlo approach. |
| BAYES | The method employs a fully Bayesian approach. |
| EM | The method uses the Expectation-Maximization method. |
| MOTIF | The MOTIF sequence co-occurrence is considered in the biclustering approach. |
| TREE | The method applies a tree structure for discovering suitable sets of genes and samples. |
Different types of parameters used by the biclustering methods.
| Parameter | Parameter specification |
|
| the number of generated biclusters either per iteration or globally |
|
| the threshold for biclustering optimization criteria |
|
| the threshold for the number of iterations |
|
| the probability of including/excluding a gene or a sample during the clustering process |
|
| the threshold for the size of the biclusters |
|
| the threshold for the number of gene (or respectively sample) operations in one iteration |
|
| the number of genes and/or samples in the initial bicluster seeds) |
|
| the overlap threshold for the obtained biclusters |
|
| model-based parameters, e.g., parameters for prior distributions, or tree depth |
The operations allowed when defining parameter are comparisons, additions, removals, and splits for genes (or respectively samples).
Figure 2Sample-based (i.e. sample differentiation) and gene-based benchmarks (i.e. GO-Sig and TiGER-Sig) for thirteen biclustering and two clustering methods for the Multi-tissue type (A) and the breast tumour (B) data.
Figure 3Running time of the thirteen biclustering and two clustering methods on the Breast cancer microarray and multi-tissue type microarray.