| Literature DB >> 26424364 |
Weizhong Zhao, James J Chen, Roger Perkins, Zhichao Liu, Weigong Ge, Yijun Ding, Wen Zou.
Abstract
BACKGROUND: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. METHODS ANDEntities:
Mesh:
Year: 2015 PMID: 26424364 PMCID: PMC4597325 DOI: 10.1186/1471-2105-16-S13-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1RPC values of LDA models with various testing topic numbers in each of three datasets. (a) Salmonella sequence dataset; (b) SIDER2 dataset; (c) TCBB dataset.
Figure 2Comparison of frequencies of candidate topic numbers obtained by perplexity-based method and RPC-based method.
Hierarchical clustering accuracy and running time of Salmonella sequence dataset
| T* | 5 | 10 | 20 | 30 | 40 | 50 |
|---|---|---|---|---|---|---|
| Misclassified | 3 | 3 | 0 | 15 | 15 | |
| Time(ms) | 33,914 | 34,584 | 35,478 | 35,636 | 35,816 | |
| Misclassified | 15 | 15 | 15 | 15 | 15 | |
| Time(ms) | 36,143 | 36,365 | 36,517 | 36,636 | 36,969 | |
*T: Number of topics.
K-means clustering accuracy and running time of Salmonella sequence dataset
| T | 5 | 10 | 20 | 30 | 40 | 50 |
|---|---|---|---|---|---|---|
| Purity** ( | 0.95 | 0.93 | 0.96 | 0.93 | 0.93 | |
| Time(ms) | 33,914 | 34,584 | 35,478 | 35,636 | 35,816 | |
| Purity ( | 0.93 | 0.93 | 0.93 | 0.93 | 0.93 | |
| Time(ms) | 36,143 | 36,365 | 36,517 | 36,636 | 36,969 | |
**Purity of each cluster is calculated as the ratio of correctly classified strains in the total 119 strains in the cluster. The ratios in the table represent the average purities of k clusters obtained for each topic modeling.
Hierarchical clustering accuracy and running time on SIDER2 dataset
| T* | 5 | 10 | 20 | 30 | 40 | 50 |
|---|---|---|---|---|---|---|
| Misclassified | 443 | 411 | 362 | 355 | 285 | |
| Time (ms) | 43,378 | 45,233 | 48,252 | 49,278 | 50,493 | |
| Misclassified | 223 | 246 | 251 | 269 | 269 | |
| Time (ms) | 52,526 | 52,577 | 54,298 | 54,468 | 54,608 | |
*T: Number of topics.
K-means clustering accuracy and running time of SIDER2 dataset
| T | 5 | 10 | 20 | 30 | 40 | 50 |
|---|---|---|---|---|---|---|
| Purity**( | 0.41 | 0.44 | 0.53 | 0.53 | 0.53 | 0.58 |
| Purity( | 0.41 | 0.44 | 0.56 | 0.50 | 0.54 | |
| Time (ms) | 43,378 | 45,233 | 48,252 | 49,278 | 50,493 | |
| Purity ( | 0.55 | 0.57 | 0.56 | 0.54 | ||
| Purity( | 0.59 | 0.57 | 0.57 | 0.56 | 0.56 | |
| Time (ms) | 52,526 | 52,577 | 54,298 | 54,468 | 54,608 | |
**Purity of each cluster is calculated as the ratio of correctly classified drugs in the total 996 drugs in the cluster. The ratios in the table represent the average purities of k clusters obtained for each topic modeling.
Figure 3Eight example topics obtained by LDA modeling with 40 topics on TCBB dataset.
Abstracts with label T8 (Estimation models)
| PMID* | Title | Probability of T8 |
|---|---|---|
| 21519119 | Inferring the number of contributors to mixed DNA profiles | 0.642 |
| 21844637 | Exploiting the functional and taxonomic structure of genomic data by probabilistic topic modeling | 0.568 |
| 24384712 | Computing the joint distribution of tree shape and tree distance for gene tree inference and recombination detection | 0.511 |
| 24042552 | Computing the Joint Distribution of Tree Shape and Tree Distance for Gene Tree Inference and Recombination Detection | 0.474 |
| 21030742 | The Metropolized Partial Importance Sampling MCMC mixes slowly on minimum reversal rearrangement paths | 0.467 |
| 21116045 | On the distribution of the number of cycles in the breakpoint graph of a random signed permutation | 0.398 |
| 19407352 | Statistical alignment with a sequence evolution model allowing rate heterogeneity along the sequence | 0.365 |
| 17277422 | On the length of the longest exact position match in a random sequence | 0.352 |
| 20733238 | Identifiability of two-tree mixtures for group-based models | 0.308 |
| 22331862 | Faster mass spectrometry-based protein inference: junction trees are more efficient than sampling and marginalization by enumeration | 0.291 |
| 19179700 | The identifiability of covarion models in phylogenetics | 0.286 |
| 17048396 | A short proof that phylogenetic tree reconstruction by maximum likelihood is hard | 0.281 |
| 18670048 | Hadamard conjugation for the Kimura 3ST model: combinatorial proof using path sets | 0.267 |
| 21233528 | Semantics and ambiguity of stochastic RNA family models | 0.204 |
*PMID: PubMed ID number of each paper in Journal of TCBB.
Figure 4Two example topics from an LDA model with 20 topics derived from the TCBB dataset.
Figure 5Four example topics derived by LDA modeling with 60 topics on TCBB dataset.
Figure 6Two drawbacks of a perplexity-based method in selecting topic numbers.