| Literature DB >> 27553625 |
Alok Sharma1,2,3, Daichi Shigemizu4,5,6, Keith A Boroevich4, Yosvany López4,6, Yoichiro Kamatani4, Michiaki Kubo4, Tatsuhiko Tsunoda7,8,9.
Abstract
BACKGROUND: Biological/genetic data is a complex mix of various forms or topologies which makes it quite difficult to analyze. An abundance of such data in this modern era requires the development of sophisticated statistical methods to analyze it in a reasonable amount of time. In many biological/genetic analyses, such as genome-wide association study (GWAS) analysis or multi-omics data analysis, it is required to cluster the plethora of data into sub-categories to understand the subtypes of populations, cancers or any other diseases. Traditionally, the k-means clustering algorithm is a dominant clustering method. This is due to its simplicity and reasonable level of accuracy. Many other clustering methods, including support vector clustering, have been developed in the past, but do not perform well with the biological data, either due to computational reasons or failure to identify clusters.Entities:
Mesh:
Year: 2016 PMID: 27553625 PMCID: PMC4995791 DOI: 10.1186/s12859-016-1184-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1An illustration of stepwise iterative maximum likelihood method using a = 2 cluster case. In this illustration, two clusters and are given with likelihood functions L1 and L2, respectively. The center of clusters are depicted by μ 1 and μ 2 (shown as ‘+’ inside two clusters). Initial total likelihood is Lold which is the sum of two likelihood functions (L1 + L2). A sample is checked for grouping. It is advantageous to shift sample to cluster only if the new likelihood (Lnew = L 1* + L 2*) is higher than the old likelihood; i.e., L > L
Stepwise iterative maximum likelihood method procedure
| 1. | |
| 2. | |
| 3. If | |
| 4. | |
| 5. Transfer | |
| 6. Update | |
| 7. If |
Fig. 2An illustration using 3 clusters: a Three cluster data where n = 1500 and d = 2; b k-means clustering, different colors show different clusters; c Support vector clustering (CG method); d Support vector clustering (SEP method); e Stepwise iterative maximum likelihood (SIML) method
Fig. 3Likelihood plots a L plot, b MaxL plot and c DelL plot
Fig. 4Processing time of SIML method for n = 3k − 102k and d = 10 − 200
Fig. 5a Average clustering accuracy on Gaussian data. b Average rand score on Gaussian data
Clustering accuracy on SRBCT dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 | 60.4 | 34.9 | 62.7 | 54.2 | 62.7 |
|
| 3 | 67.9 | 39.8 | 69.9 |
| 69.9 | 66.3 |
| 4 | 77.1 | 49.4 | 65.1 | 67.5 | 72.3 |
|
| 5 | 70.3 | 50.6 |
| 50.6 | 65.1 | 67.5 |
| 6 | 64.0 | 39.8 | 53.0 | 53.0 | 57.8 |
|
The methods achieving highest results are depicted in bold faces
Rand score on SRBCT dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 | 69.5 | 32.9 | 69.9 | 60.0 | 62.7 |
|
| 3 | 77.2 | 32.0 | 76.6 |
| 69.9 | 75.3 |
| 4 | 80.5 | 51.3 | 71.4 | 74.8 | 72.3 |
|
| 5 | 78.3 | 53.1 |
| 53.1 | 65.1 | 75.0 |
| 6 | 72.4 | 35.8 | 56.5 | 56.5 | 57.8 |
|
The methods achieving highest results are depicted in bold faces
Clustering accuracy on MLL dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 | 56.3 | 40.3 | 45.8 | 45.8 |
| 58.3 |
| 3 | 58.8 | 40.3 | 50.0 | 50.0 |
| 61.1 |
| 4 | 59.5 | 43.1 | 54.2 | 43.1 |
| 72.2 |
| 5 | 81.9 | 43.1 | 72.2 | 69.4 | 94.4 |
|
| 6 | 81.9 | 43.1 | 81.9 | 69.4 | 55.6 |
|
| 7 | 80.0 | 41.7 | 81.9 | 72.2 | 91.7 |
|
| 8 | 81.7 | 43.1 | 79.2 | 68.1 |
| 62.5 |
| 9 | 82.8 | 48.6 | 80.6 |
| 65.3 | 63.9 |
| 10 | 80.4 | 43.1 | 58.3 | 63.9 | 61.1 |
|
The methods achieving highest results are depicted in bold faces
Rand score on MLL dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 | 63.6 | 35.0 | 41.1 | 41.1 |
| 72.3 |
| 3 | 67.5 | 35.0 | 45.7 | 45.7 | 68.1 |
|
| 4 | 64.0 | 36.3 | 47.2 | 36.3 |
| 77.5 |
| 5 | 80.4 | 36.3 | 75.2 | 70.2 | 94.4 |
|
| 6 | 80.4 | 36.3 | 80.4 | 70.2 | 55.6 |
|
| 7 | 79.6 | 35.3 | 80.4 | 75.7 | 91.7 |
|
| 8 | 80.6 | 36.3 | 78.4 | 67.7 |
| 69.9 |
| 9 | 81.2 | 41.1 | 79.3 |
| 65.3 | 71.6 |
| 10 | 80.3 | 36.3 | 66.1 | 73.2 | 61.1 |
|
The methods achieving highest results are depicted in bold faces
Clustering accuracy on ALL subtype dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 |
| 32.1 | 42.8 | 36.1 | 34.3 | 44.0 |
| 3 | 53.3 | 25.1 | 45.3 | 46.2 | 34.9 |
|
| 4 | 57.4 | 25.1 | 51.7 | 49.9 | 33.3 |
|
| 5 | 60.4 | 26.0 | 42.8 | 34.6 | 44.7 |
|
| 6 | 58.9 | 25.4 | 38.8 | 41.0 | 45.3 |
|
| 7 |
| 24.2 | 47.4 | 36.1 | 49.5 | 56.3 |
| 8 | 54.5 | 25.7 | 42.8 | 34.8 | 41.9 |
|
The methods achieving highest results are depicted in bold faces
Rand score on ALL subtype dataset
| Dim | K-means | SLINK | CLINK | MLINK | mclust | SIML |
|---|---|---|---|---|---|---|
| 2 |
| 37.1 | 68.2 | 49.9 | 34.3 | 71.8 |
| 3 |
| 20.5 | 73.5 | 62.7 | 34.9 | 77.6 |
| 4 |
| 20.4 | 78.3 | 72.4 | 33.3 | 81.2 |
| 5 | 79.6 | 22.0 | 69.0 | 47.3 | 44.7 |
|
| 6 | 79.9 | 21.6 | 69.5 | 67.5 | 45.3 |
|
| 7 | 79.9 | 21.0 | 75.2 | 40.6 | 49.5 |
|
| 8 | 77.8 | 21.7 | 70.3 | 60.6 | 74.9 |
|
The methods achieving highest results are depicted in bold faces
The estimation of the number of clusters by SIML
| Dim | SRBCT | MLL | ALL subtype |
|---|---|---|---|
| 2 | 4 | 3 | 7 |
| 3 | 4 | 2 | 7 |
| 4 | 4 | 2 | 8 |
| 5 | 4 | 3 | 4,7 |
| 6 | 2,4 | 3 | 7,9 |
| 7 | 3 | 3,8 | |
| 8 | 3 | 7 | |
| 9 | 3 | ||
| 10 | 6 |
True positives for Hondo, RYU and CHB cluster on BBJ and HapMap data
| Hondo | RYU | CHB | |
|---|---|---|---|
| Methods | (6891) | (151) | (45) |
| 71.4 % | 85.5 % | 100 % | |
| K-means | 4922 | 129 | 45 |
| 99.9 % | 0 % | 100 % | |
| SLINK | 6886 | 0 | 45 |
| 97.9 % | 92.7 % | 100 % | |
| CLINK | 6746 | 140 | 45 |
| 95.8 % | 92.1 % | 100 % | |
| MLINK | 6603 | 139 | 45 |
| 97.3 % | 94.7 % | 100 % | |
| SIML | 6707 | 143 | 45 |
| 66.8 % | 94.7 % | 0 % | |
| mclust | 4602 | 143 | 0 |
Fig. 6Clustering by SIML on 2-dimensional BBJ data
Fig. 7MaxL Plot for 2-dimensional BBJ and HapMap data