| Literature DB >> 29297298 |
Alok Sharma1,2,3,4, Piotr J Kamola1,2, Tatsuhiko Tsunoda5,6,7.
Abstract
BACKGROUND: Clustering methods are becoming widely utilized in biomedical research where the volume and complexity of data is rapidly increasing. Unsupervised clustering of patient information can reveal distinct phenotype groups with different underlying mechanism, risk prognosis and treatment response. However, biological datasets are usually characterized by a combination of low sample number and very high dimensionality, something that is not adequately addressed by current algorithms. While the performance of the methods is satisfactory for low dimensional data, increasing number of features results in either deterioration of accuracy or inability to cluster. To tackle these challenges, new methodologies designed specifically for such data are needed.Entities:
Keywords: Cancer; EM algorithm; Feature matrix; Methylome; Phenotype clustering; Small sample size; Transcriptome
Mesh:
Year: 2017 PMID: 29297298 PMCID: PMC5751765 DOI: 10.1186/s12859-017-1970-8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An illustration of 2D–EM clustering algorithm
Arrangement of features into m × m matrix
| Feature Selection |
| 1. Given x ∈ |
| 2. Perform hierarchical clustering on all samples x to find temporary class labels. |
| 3. Using these class labels find |
| 4. Find |
| 5. Retaining the top |
| Matrix arrangement |
| 6. Compute mean |
| 7. Arrange features of |
| 8. Arrange features of |
| 9. Reshape a sample |
Fig. 2Visualization of high dimensional data
Fig. 3Visualization of feature matrix: acute lymphoblastic leukemia (ALL) vs. myeloid leukemia (AML). An ALL sample or feature vector x ∈ ℝ is transformed to feature matrix X ∈ ℝ using the procedure outlined in Table 1. These feature matrices are shown at top right side of the figure. Similarly, a sample of AML is also transformed to feature matrix and shown at bottom right side of the figure
Transcriptome and methylome datasets
| Datasets | Features | Samples | Classes |
|---|---|---|---|
| ALL Leukemia | 7129 | 72 | 2 |
| SRBCT | 2308 | 83 | 4 |
| MLL | 12,582 | 72 | 3 |
| ALL Subtype | 12,558 | 327 | 7 |
| GCM | 16,063 | 198 | 14 |
| Lung Cancer | 12,553 | 181 | 2 |
| Gastric Cancer | 27,579 | 64 | 2 |
| Hepatocellular Carcinoma | 27,579 | 40 | 2 |
Rand score (highest values are highlighted as bold faces)
| Method | SRBCT | ALL | MLL | ALL subtype | GCM | Lung cancer |
|---|---|---|---|---|---|---|
| K-means | 0.58 | 0.53 | 0.78 | 0.64 | 0.84 | 0.72 |
| CLink | 0.30 | 0.49 | 0.54 | 0.52 | 0.71 | 0.70 |
| ALInk | 0.30 | 0.56 | 0.35 | 0.51 | 0.38 | 0.71 |
| Ward-Link | 0.44 | 0.56 | 0.78 | 0.53 | 0.84 | 0.80 |
| Weighted-Link | 0.30 | 0.52 | 0.51 | 0.52 | 0.61 | 0.71 |
| Mlink | 0.30 | 0.55 | 0.35 | 0.48 | 0.54 | 0.71 |
| Spectral Clustering | 0.39 | 0.51 | 0.56 | 0.63 | 0.55 | 0.71 |
| NNMF Clustering |
| 0.50 | 0.74 | 0.64 | 0.83 | 0.63 |
| Mclust | 0.51 | 0.50 | 0.61 | 0.30 | 0.83 | 0.57 |
| 2D–EM | 0.65 |
|
|
|
|
|
Adjusted Rand index (highest values are highlighted as bold faces)
| Method | SRBCT | ALL | MLL | ALL subtype | GCM | Lung cancer |
|---|---|---|---|---|---|---|
| Kmeans | 0.13 | 0.03 | 0.47 | 0.15 | 0.19 | 0.22 |
| CLink | 0.00 | −0.03 | 0.13 | 0.00 | 0.09 | −0.02 |
| ALInk | 0.00 | 0.05 | 0.00 | −0.01 | 0.01 | −0.01 |
| Wa-Link | 0.00 | 0.09 | 0.51 | 0.00 | 0.17 | 0.41 |
| Wt-Link | 0.00 | −0.03 | 0.08 | 0.00 | 0.07 | −0.01 |
| Mlink | 0.00 | 0.02 | 0.00 | −0.01 | 0.08 | −0.01 |
| Spectral Clustering | −0.02 | 0.02 | 0.02 | 0.00 | 0.07 | −0.01 |
| NNMF Clustering | 0.18 | 0.00 | 0.42 | 0.11 | 0.17 | 0.26 |
| Mclust | −0.02 | −0.01 | 0.21 | −0.01 | 0.09 | 0.05 |
| 2D–EM |
|
|
|
|
|
|
Percentage improvement of 2D–EM clustering method over other existing clustering methods
| Parameter | SRBCT | ALL | MLL | ALL subtype | GCM | Lung cancer |
|---|---|---|---|---|---|---|
| Rand Score | −1.5 | 10.7 | 2.6 | 21.9 | 3.6 | 5.0 |
| Adjusted Rand Index | 5.6 | 155.6 | 11.8 | 73.3 | 21.1 | 51.2 |
Fig. 4Comparison of average performance (in terms of Rand score and Adjusted Rand index)
Fig. 5Box plot showing the effect of changing cut-off value for 2D–EM clustering algorithm
Fig. 6Rand score of five best performing methods over 100 runs
Fig. 7Adjusted Rand index over 100 runs
Fig. 8Rand score and adjusted Rand index on Gastric cancer methylation data (Beta-values)
Fig. 9Rand score and adjusted Rand index on Gastric cancer methylation data (M-values)
Fig. 10Rand score and adjusted Rand index on Hepatocellular carcinoma (Beta-values)
Fig. 11Rand score and adjusted Rand index on Hepatocellular carcinoma (M-values)