| Literature DB >> 23046528 |
Abstract
Clustering textual contents is an important step in mining useful information on the web or other text-based resources. The common task in text clustering is to handle text in a multi-dimensional space, and to partition documents into groups, where each group contains documents that are similar to each other. However, this strategy lacks a comprehensive view for humans in general since it cannot explain the main subject of each cluster. Utilizing semantic information can solve this problem, but it needs a well-defined ontology or pre-labeled gold standard set. In this paper, we present a thematic clustering algorithm for text documents. Given text, subject terms are extracted and used for clustering documents in a probabilistic framework. An EM approach is used to ensure documents are assigned to correct subjects, hence it converges to a locally optimal solution. The proposed method is distinctive because its results are sufficiently explanatory for human understanding as well as efficient for clustering performance. The experimental results show that the proposed method provides a competitive performance compared to other state-of-the-art approaches. We also show that the extracted themes from the MEDLINE® dataset represent the subjects of clusters reasonably well.Entities:
Year: 2012 PMID: 23046528 PMCID: PMC3465205 DOI: 10.1186/2041-1480-3-S3-S6
Source DB: PubMed Journal: J Biomed Semantics
The thematic clustering algorithm
| Given |
| 1. Create a random partition |
| 2. Compute |
| 3. Compute |
| 4. For each cluster, select the |
| 5. Compute the probabilities { |
| 6. For all |
| 7. Test for convergence. Terminate if converged. |
| 8. For a subset |
| 9. Return to Step 2. |
Datasets used for the experiments
| Datasets | Number of Documents | Number of Clusters |
|---|---|---|
| News-Different-3 | 300 | 3 |
| News-Similar-3 | 300 | 3 |
| News-Moderated-6 | 600 | 6 |
| Parkinson's Disease | 25992 | - |
| Huntington's Disease | 5602 | - |
News-Different-3, News-Similar-3, and News-Moderated-6 are from the 20-Newsgroup collection. Parkinson's Disease and Huntington's Disease are from the MEDLINE dataset.
Figure 1Comparisons of theme scores and normalized mutual information (NMI) scores on News-Different-3. This graph shows the correlation between theme scores and NMI scores on News-Different-3. The points are the clustering results obtained from 1,000 runs. The correlation coefficient is 0.904070.
Figure 2Comparisons of theme scores and normalized mutual information (NMI) scores on News-Moderated-6. This graph shows the correlation between theme scores and NMI scores on News-Moderated-6. The points are the clustering results obtained from 1,000 runs. The correlation coefficient is 0.871111.
Performance comparison of THEME, DPMFS, EDCM, and EM-MN on the 20-Newsgroup collection
| THEME | DPMFS | EDCM | EM-MN | |
|---|---|---|---|---|
| News-Different-3 | 0.847 | 0.688 | 0.734 | 0.867 |
| News-Similar-3 | 0.103 | 0.231 | 0.163 | 0.081 |
| News-Moderated-6 | 0.782 | 0.663 | 0.531 | 0.562 |
THEME, DPMFS, EDCM, and EM-MN are the proposed clustering method, a Dirichlet process mixture model, a Dirichlet compound multinomial model, and an EM-based mixture model, respectively.
Average paired F-scores from three best runs on the 20-Newsgroup collection
| F-score | |
|---|---|
| News-Different-3 | 0.9387 |
| News-Similar-3 | 0.3023 |
| News-Moderated-6 | 0.8646 |
Three best runs on the 20-Newsgroup collection are compared using paired F-scores. Each best run is the result with the best theme score among 500 runs.
An example for Parkinson's disease clusters
| Subject terms | Titles |
|---|---|
| synuclein | alpha-synuclein |
| alpha synuclein | |
| alpha | |
| protein | |
| aggregation | |
| deep brain | deep brain stimulation |
| deep | |
| stimulation | |
| brain stimulation | |
| subthalamic | |
| lewy | lewy bodies |
| lewy bodies | |
| bodies | |
| lewy body | |
| dementia | |
| monoamine oxidase | monoamine oxidase |
| oxidase | |
| monoamine | |
| mao | |
| b | |
| mitochondrial | oxidative stress |
| oxidative | |
| complex i | |
| oxidative stress | |
| stress | |
For each cluster, top 5 subject terms are listed. Titles are from these subject terms and the documents included in each cluster.
Average F-scores from three best runs on the MEDLINE data
| F-score | |
|---|---|
| Parkinson's Disease | 0.6572 |
| Huntington's Disease | 0.6308 |
F-score results of three best runs on each MEDLINE dataset are averaged. A best run is the result with the best theme score among 500 runs.
Analysis of three best runs on the MEDLINE data
| Number of clusters | ||
|---|---|---|
| Parkinson's Disease | 46.0 | 2.56E-10 |
| Huntington's Disease | 21.5 | 4.11E-11 |
For each MEDLINE dataset, clustering was performed 500 times, and the best run was selected. The number of clusters and the average of p-values of the 10 strongest MeSH terms in each cluster were recorded. This was repeated three times, and averages of the resulting values are given in this table.