| Literature DB >> 21303507 |
Riccardo De Bin1, Davide Risso.
Abstract
BACKGROUND: Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, since the number of variables can be much higher than the number of observations.Entities:
Mesh:
Year: 2011 PMID: 21303507 PMCID: PMC3042915 DOI: 10.1186/1471-2105-12-49
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of gene selection in the filtering step of our procedure. Gene1 is crucial in cluster definition, while gene2 is not. The univariate distributions of the genes reflect this, as one can see from the Gaussian kernel density estimation reported along the axes.
Simulation results for GG model
| SE | 0.9877 | 0.0007 | 0.9991 | 0.0001 |
| SP | 0.9866 | 0.0008 | 0.9985 | 0.0004 |
| ER | 0.0128 | 0.0006 | 0.0012 | 0.0002 |
| RG | 0.0837 | 0.0061 | 0.7787 | 0.0093 |
| CC | 0.77 | 0.84 | ||
Simulation results for pdfCluster and Mclust in GG model: rate of correct identification of number of clusters (CC), sensitivity (SE), specificity (SP) and error rate (ER) in the classification of the samples, and error rate in the selection of relevant genes (RG).
Simulation results for NU model
| RG | 0.433 | 0.041 | 0.616 | 0.077 |
| CC2 | 0.47 | 0.34 | ||
| CC3 | 0.19 | 0.39 | ||
| ER | 0.135 | 0.004 | 0.227 | 0.005 |
Simulation results for pdfCluster and Mclust in NU model: rate of two clusters identification (CC2), rate of three clusters identification (CC3), error rate in the classification of samples (ER) and error rate in the selection of relevant genes (RG).
Sample size
| ER | ||||
|---|---|---|---|---|
| 10 | 0.182 | 0.033 | 0.302 | 0.039 |
| 20 | 0.131 | 0.030 | 0.381 | 0.025 |
| 50 | 0.114 | 0.020 | 0.287 | 0.025 |
| 100 | 0.137 | 0.015 | 0.230 | 0.019 |
| 200 | 0.172 | 0.012 | 0.204 | 0.014 |
Misclassification error rate (ER) for pdfCluster and Mclust in NU model, varying the sample size.
Clusters found in Colon data
| Cluster 1 | 1-6,8-19,21-29,31,32,34,35,37-40, |
| Cluster 2 | |
| Cluster 3 |
Clusters found after pdfCluster procedure in Colon data; tumor samples are labeled 1-40, normal samples 41-62; misallocated samples are shown in bold. The star represents a wrongly labeled samples.
Confusion matrices for Colon data
| Real | 1 | 2-3 | 1 | 2 | 1 | 2 |
|---|---|---|---|---|---|---|
| Tumor | 35 | 5 | 29 | 11 | 23 | 17 |
| Normal | 3 | 19 | 12 | 10 | 6 | 16 |
| ER: | 0.13 | 0.37 | 0.37 | |||
Confusion matrices for pdfCluster, Mclust and k-means with error rates (ER) for Colon data.
Confusion matrices for Leukaemia data
| Real | 1 | 2 | 1 | 2 | 3 | 4 | 1 | 2 | 3 |
|---|---|---|---|---|---|---|---|---|---|
| ALL B-cell | 37 | 1 | 9 | 20 | 9 | 0 | 15 | 0 | 23 |
| ALL T-cell | 5 | 4 | 0 | 0 | 7 | 2 | 7 | 2 | 0 |
| AML | 4 | 21 | 0 | 2 | 1 | 22 | 1 | 23 | 1 |
Confusion matrices for pdfCluster, Mclust and k-means with error rates (ER) for Leukaemia data.