| Literature DB >> 19958477 |
Abstract
BACKGROUND: Grouping genes into clusters on the basis of similarity between their expression profiles has been the main approach to predict functional modules, from which important inference or further investigation decision could be made. While the univocal determination of similarity metric is important, current practices are normally involved with Euclidean distance and Pearson correlation, of which assumptions are not likely the case for high-throughput microarray data.Entities:
Mesh:
Year: 2009 PMID: 19958477 PMCID: PMC2788366 DOI: 10.1186/1471-2164-10-S3-S14
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Distance distributions of the homologous and heterogeneous groups. Comparison of the three distance metric capability in differentiating between homologous and heterogeneous sample pairs over three generating cases. Red lines: densities of homologous distances (two samples are from the same process); blue lines: densities of heterogeneous distances (two samples are from two different processes). Case 1: Samples are independently generated from a Gaussian distribution with varying noises (favours BayesGen); Case 2: Samples are independently generated from a Gaussian distribution with fixed noise (favours Euclidean distance); Case 3: Samples are generated as noisy linear transformations from a common mean vector (favours Pearson correlation).
Figure 2Protein functional association discovery. Comparison of the three distance metric capability in predicting interacting yeast protein pairs from genome-wide microarray expression data. The standard positive pairs are derived from the annotations of GO terms that got 5/6 votes of expert survey. (A) Results from Gasch et al. [11] data; (B) Results from Avara et al. [12] data.
Clustering expression profiles into cancer subtypes
| euclid | euclidNorm | corr | corrNorm | bayesGen | |
|---|---|---|---|---|---|
| General leukemia | 0.5447 | 0.1175 | 0.7491 | 0.1817 | 0.8076 |
| Pediatric leukemia | 0.1982 | 0.4789 | 0.2014 | 0.9129 | 0.9413 |
| Multiple tissues | 0.5304 | 0.9082 | 0.6416 | 0.783 | 0.9726 |
| B-cell lymphoma | 0.0016 | 0.0008 | 0.4407 | 0.1745 | 0.9053 |
| Average | 0.3187 | 0.3764 | 0.5082 | 0.5130 | 0.9067 |
Predicting number of clusters using gap statistics
| true number | euclid | euclidNorm | corr | corrNorm | bayesGen | |
|---|---|---|---|---|---|---|
| General leukemia | 3 | 3 | 3 | 3 | 3 | 4 |
| Pediatric leukemia | 6 | 3 | 13 | 2 | 15 | 7 |
| Multiple tissues | 4 | 6 | 7 | 6 | 9 | 4 |
| B-cell lymphoma | 3 | 2 | 2 | 15 | 15 | 6 |
| Average difference | 1.2 | 2.2 | 3.6 | 5.2 | 1.0 | |
Figure 3Cluster structures resulted from the use of different metrics on hierarchical clustering. Comparison of the resulted cluster structures resulted from the use of different distance metrics on hierarchical clustering over 4 cancer datasets. Top row: the true structure derived from known phenotypes; Middle row: the structure resulted from BayesGen (offered highest Rand indices); Bottom row: the structure resulted from the metric that offered the second best Rand indices.