| Literature DB >> 25350106 |
Weizhong Zhao, Wen Zou, James J Chen.
Abstract
BACKGROUND: The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets.Entities:
Mesh:
Year: 2014 PMID: 25350106 PMCID: PMC4251039 DOI: 10.1186/1471-2105-15-S11-S11
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1The workflow of the topic model-derived clustering methods.
Cluster analysis of the Salmonella dataset using the method of topic model-derived clustering based on highest probable topic assignment.
| Most dominant serotype | Number of isolates | Topic ID | % of most dominant serotype |
|---|---|---|---|
| Enteritidis | 1046 | T11 | 99.71% |
| Saintpaul | 989 | T12 | 99.60% |
| Paratyphi B | 850 | T26 | 99.41% |
| Enteritidis | 1236 | T2 | 99.35% |
| Saintpaul | 709 | T29 | 99.29% |
| Hadar | 1837 | T18 | 99.18% |
| Poona | 1216 | T22 | 98.68% |
| Oranienburg | 1847 | T27 | 98.65% |
| Poona | 504 | T16 | 98.41% |
| Newport | 1179 | T15 | 98.39% |
| Braenderup | 852 | T14 | 98.12% |
| Heidelberg | 2125 | T23 | 96.80% |
| Typhi | 1845 | T19 | 95.88% |
| Braenderup | 1135 | T9 | 95.51% |
| Javiana | 2002 | T1 | 94.36% |
| Agona | 1846 | T13 | 91.87% |
| Infantis | 2130 | T25 | 89.48% |
| Thompson | 2195 | T7 | 89.25% |
| 4, 5, 12:i- | 1024 | T28 | 86.82% |
| Paratyphi B | 1041 | T10 | 85.40% |
| Typhimurium var. 5- | 288 | T5 | 84.03% |
| Montevideo | 2240 | T17 | 80.31% |
| 4, 5, 12:i- | 854 | T3 | 79.39% |
| Mississippi | 1860 | T4 | 78.60% |
| Typhimurium var. 5- | 1201 | T20 | 66.36% |
| Typhimurium | 1217 | T21 | 54.97% |
| Typhimurium | 738 | T0 | 51.63% |
| Typhimurium var. 5- | 417 | T6 | 48.68% |
| Typhimurium | 815 | T24 | 38.16% |
| Muenchen | 3994 | T8 | 36.60% |
Figure 2Hierarchical cluster analysis of the LDA-derived clusters. The dendrogram shows a simplified tree-structure of 30 topic clusters (T0 to T29). For each of the 30 clusters, the average of PFGE band presence /absence of all sample isolates at 60 designated band locations were calculated. The hierarchical clustering algorithm was applied on the Euclidean distance measures of the means of the 30 clusters and two major clusters (A and B) were grouped.
Comparison of the results on the lung cancer dataset using the three proposed topic model-derived clustering methods.
| Methods |
| Cluster ID | Adenocarcinoma | Squamous cell carcinoma | No. of misclassified samples | NMI |
|---|---|---|---|---|---|---|
| Clustering based on feature selection | 2 | 1 | 42 | 11 | ||
| 2 | 11 | 47 | ||||
| 3 | 1 | 40 | 8 | |||
| 2 | 4 | 15 | ||||
| 3 | 9 | 35 | ||||
| 4 | 1 | 37 | 8 | |||
| 2 | 9 | 35 | ||||
| 3 | 0 | 14 | ||||
| 4 | 7 | 1 | ||||
| Clustering based on highest topic assignment | 2 | 1 | 13 | 46 | 25 | 0.2296 |
| 2 | 40 | 12 | ||||
| 3 | 1 | 11 | 29 | 25 | 0.1847 | |
| 2 | 37 | 9 | ||||
| 3 | 5 | 20 | ||||
| 4 | 1 | 5 | 13 | 26 | 0.1744 | |
| 2 | 13 | 26 | ||||
| 3 | 1 | 12 | ||||
| 4 | 34 | 7 | ||||
| Clustering based on feature extraction | 2 | 1 | 13 | 47 | 24 | 0.2461 |
| 2 | 40 | 11 | ||||
| 3 | 1 | 8 | 34 | 24 | 0.2055 | |
| 2 | 8 | 16 | ||||
| 3 | 37 | 8 | ||||
| 4 | 1 | 7 | 6 | 25 | 0.1820 | |
| 2 | 33 | 6 | ||||
| 3 | 8 | 31 | ||||
| 4 | 5 | 15 | ||||
Comparison of the results on the lung cancer dataset using the proposed method of topic model-derived clustering based on feature selection and two conventional clustering methods of k-means and PCA.
| Methods |
| Cluster ID | Adenocarcinoma | Squamous cell carcinoma | No. of misclassified samples | NMI |
|---|---|---|---|---|---|---|
| Topic model-derived clustering based on feature selection | 2 | 1 | 42 | 11 | ||
| 2 | 11 | 47 | ||||
| 3 | 1 | 40 | 8 | |||
| 2 | 4 | 15 | ||||
| 3 | 9 | 35 | ||||
| 4 | 1 | 37 | 8 | |||
| 2 | 9 | 35 | ||||
| 3 | 0 | 14 | ||||
| 4 | 7 | 1 | ||||
| 2 | 1 | 41 | 12 | 24 | 0.2461 | |
| 2 | 12 | 46 | ||||
| 3 | 1 | 8 | 35 | 31 | 0.1365 | |
| 2 | 27 | 17 | ||||
| 3 | 18 | 6 | ||||
| 4 | 1 | 6 | 14 | 25 | 0.1602 | |
| 2 | 22 | 6 | ||||
| 3 | 18 | 6 | ||||
| 4 | 7 | 32 | ||||
| PCA (10 features) + | 2 | 1 | 12 | 46 | 24 | 0.2461 |
| 2 | 41 | 12 | ||||
| 3 | 1 | 8 | 35 | 31 | 0.1456 | |
| 2 | 22 | 6 | ||||
| 3 | 23 | 17 | ||||
| 4 | 1 | 16 | 5 | 25 | 0.1605 | |
| 2 | 6 | 14 | ||||
| 3 | 7 | 32 | ||||
| 4 | 24 | 7 | ||||
Figure 3Survival analysis of the breast cancer dataset . The three subgroups were represented by three colors, respectively. The p-value of the logrank test for the differences among the three subgroups was 0.000174; the test for the differences between G2 and G3 was calculated as 0.645.
Figure 4Survival analysis of the breast cancer dataset . The two subgroups were represented by two colors, respectively. The p-value of the logrank test for difference among the two subgroups was 4.6e-5.