| Literature DB >> 22373334 |
Mi Hyeon Kim1, Hwa Jeong Seo, Je-Gun Joung, Ju Han Kim.
Abstract
BACKGROUND: Clustering-based methods on gene-expression analysis have been shown to be useful in biomedical applications such as cancer subtype discovery. Among them, Matrix factorization (MF) is advantageous for clustering gene expression patterns from DNA microarray experiments, as it efficiently reduces the dimension of gene expression data. Although several MF methods have been proposed for clustering gene expression patterns, a systematic evaluation has not been reported yet.Entities:
Mesh:
Year: 2011 PMID: 22373334 PMCID: PMC3278848 DOI: 10.1186/1471-2105-12-S13-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustration of various measures. Illustration of various measures. Here, we evaluated seven methods by six measures. Each illustration shows results from various measures such as (a) Homogeneity, (b) separation, (c) Dunn Index, (d) average silhouette width, (e) Pearson correlation of cophenetic distance, (f) Hubert gamma and (g) GAP statistic. GAP statistic is optimized when it has lower value. But other measures which have higher value are optimized.
Figure 2Illustration of the Adjusted Rand index. Illustration of the Adjusted Rand index. (a) Result from leukemia dataset which has known class labels with two groups, ALL and AML, We tested various methods at rank k=2. (b) From leukemia dataset with three groups, ALL-B, ALL-T and AML. We applied the adjusted Rand index at rank k=3. (c) From medulloblastoma dataset which has known class labels with two groups, classic and desmoplastic. (d) From iris dataset that has known class labels with three groups of flower species.
Figure 3Illustrations of accuracy. Illustrations of accuracy. It measures prediction power of clustering. Bar plot of accuracy from three dataset, Leukemia dataset, Medulloblastoma dataset and Iris dataset which have known labels of sample-class.
Class Assignment of Acute Myelogenous Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL)
| Kmeans | *SVD | *PCA | *ICA | *NMF | *SNMF | *BSNMF | *Voting | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | ( | |
| ALL_19769_B.cell | **L | **B | B | L | L | B | L | B | L | B | L | B | L | B | ||
| Error Count | 7 | 7 | 18 | 20 | 16 | 18 | 3 | 5 | 2 | 3 | 1 | 2 | 1 | 1 | 1 | 1 |
Class Assignment of Acute Myelogenous Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) at K=2 and K=3.
* SVD: singular value decomposition, PCA: principal component analysis, ICA: independent component analysis, NMF: non-negative matrix factorization, SNMF: sparse non-negative matrix factorization,
BSNMF: bi-directional non-negative matrix factorization, Voting: Voting class
** L: ALL, M: AML, B: ALL_B cell, T: ALL_T cell
Bold-faced: misclassified samples
Class assignment for Medulloblastoma dataset
| ☐ Sample | Subgroup | Kmeans | *SVD | *PCA | *ICA | *NMF | *SNMF | *BSNMF | *Voting |
|---|---|---|---|---|---|---|---|---|---|
| Brain_MD_7 | classic | ||||||||
| Error Count | 14 | 16 | 16 | 14 | 13 | 13 | 11 | 12 | |
Class assignment for Medulloblastoma dataset at K=2
* SVD: singular value decomposition, PCA: principal component analysis,
ICA: independent component analysis, NMF: non-negative matrix factorization,
SNMF: sparse non-negative matrix factorization,
BSNMF: bi-directional non-negative matrix factorization, Voting: Voting class
** 1: classic type, 2: desmoplastic type
Bold-faced: misclassified samples
Number of significantly enriched GO terms (or pathways)
| (a) Leukemia dataset | |||||||
|---|---|---|---|---|---|---|---|
| ALL | 480 | 389 | 441 | 453 | 532 | 425 | 535 |
| AML | 85 | 262 | 223 | 222 | 167 | 266 | 280 |
| Total | 565 | 651 | 664 | 675 | 699 | 691 | 815 |
| (b) Medulloblastoma dataset | |||||||
| classic | 517 | 373 | 467 | 479 | 388 | 456 | 599 |
| desmoplastic | 58 | 361 | 226 | 213 | 335 | 208 | 206 |
| Total | 575 | 734 | 693 | 692 | 723 | 664 | 805 |
| (c) Fibroblast dataset | |||||||
| cluster1 | 52 | 45 | 71 | 47 | 57 | 41 | 128 |
| cluster2 | 32 | 35 | 68 | 27 | 54 | 48 | 69 |
| cluster3 | 48 | 24 | 63 | 61 | 37 | 75 | 50 |
| cluster4 | 126 | 38 | 37 | 96 | 108 | 60 | 155 |
| cluster5 | 54 | 63 | 60 | 33 | 65 | 68 | 102 |
| Total | 312 | 205 | 299 | 264 | 321 | 292 | 504 |
| (d) Mouse dataset | |||||||
| cluster1 | 593 | 520 | 294 | 258 | 637 | 686 | 690 |
| cluster2 | 27 | 61 | 107 | 114 | 38 | 28 | 56 |
| Total | 620 | 581 | 401 | 372 | 675 | 714 | 746 |
Number of significantly enriched terms at α=0.05
Figure 4Weighted p-value of significantly enriched GO terms. Weighted p-value of significantly enriched GO terms. (a) and (b) represent result of ALL and AML cluster in leukemia dataset. (d) and (e) show result of cluster 1 (assigned to classic type) and cluster 2 (assigned to desmoplastic type) in medulloblastoma dataset. Among the entire significantly enriched factors, top 50 factors are represented. (c) and (f) represent result of top 50 factors in each entire dataset. Results from other dataset are shown in supplementary site.
Figure 5Log scaled p-values for significantly enriched factors. Log scaled p-values for significantly enriched factors. Each plot represents significantly enriched terms (at α=0.05) at AML cluster in leukemia dataset using (a) K-means and (b) BSNMF. x-axis represents log10 (p-value). Entire factors were divided into five categories, GO term of biological process (BP), GO term of cellular component (CC), GO term of molecular function (MF), BIOCARTA, and pathway of KEGG.