| Literature DB >> 34064491 |
Suganya Selvaraj1, Eunmi Choi1,2.
Abstract
Text document clustering refers to the unsupervised classification of textual documents into clusters based on content similarity and can be applied in applications such as search optimization and extracting hidden information from data generated by IoT sensors. Swarm intelligence (SI) algorithms use stochastic and heuristic principles that include simple and unintelligent individuals that follow some simple rules to accomplish very complex tasks. By mapping features of problems to parameters of SI algorithms, SI algorithms can achieve solutions in a flexible, robust, decentralized, and self-organized manner. Compared to traditional clustering algorithms, these solving mechanisms make swarm algorithms suitable for resolving complex document clustering problems. However, each SI algorithm shows a different performance based on its own strengths and weaknesses. In this paper, to find the best performing SI algorithm in text document clustering, we performed a comparative study for the PSO, bat, grey wolf optimization (GWO), and K-means algorithms using six data sets of various sizes, which were created from BBC Sport news and 20 newsgroups. Based on our experimental results, we discuss the features of a document clustering problem with the nature of SI algorithms and conclude that the PSO and GWO SI algorithms are better than K-means, and among those algorithms, the PSO performs best in terms of finding the optimal solution.Entities:
Keywords: artificial intelligence; data mining; swarm intelligence algorithms; text document clustering
Year: 2021 PMID: 34064491 PMCID: PMC8125674 DOI: 10.3390/s21093196
Source DB: PubMed Journal: Sensors (Basel) ISSN: 1424-8220 Impact factor: 3.576
Figure 1Process of text document clustering.
Parameter set for SI algorithms.
| PSO | Values | BA | Values | BA | Values |
|---|---|---|---|---|---|
| k | 10 | k | 10 | k | 10 |
|
| 0.9 | A | 0.5 | ||
| C1 | 0.5 | r | 0.5 | ||
| C2 | 0.3 | Qmin | 0.0 | ||
| Qmax | 2.0 |
Parameters, features, and best performing areas of SI algorithms.
| Algorithms | Parameters | Features | Best Performing Area |
|---|---|---|---|
| PSO | Control: | Simple, fast computing speed, and parallel processing [ | Clustering and scheduling [ |
| GWO | Control: | Faster convergence due to continuous reduction of search space and fewer decision variables (i.e., | Robot swarm learning [ |
| BA | Control: loudness ( | Provides very quick convergence at a very initial stage by switching from exploration to exploitation [ | Clustering and feature selection [ |
Benchmark data sets.
| Data Set | Source | No. of Documents | No. of Terms | No. of Clusters |
|---|---|---|---|---|
| 1 | 20 newsgroups | 1427 | 23,057 | 2 |
| 2 | BBC Sport | 737 | 4613 | 5 |
| 3 | BBC Sport | 40 | 2596 | 5 |
| 4 | 20 newsgroups | 200 | 8716 | 4 |
| 5 | 20 newsgroups | 100 | 5549 | 3 |
| 6 | BBC Sport | 100 | 3876 | 2 |
Mean and standard deviation results for all iteration numbers for clustering algorithms with six data sets.
| Data Sets | Metrics | K-Means | PSO | GWO | BA | |
|---|---|---|---|---|---|---|
| Mean (Std.) | ||||||
| 1 | Purity | 0.665 (0.04) |
| 0.695 (0.018) | 0.636 (0.02) | |
| Homogeneity | 0.081 (0.041) |
| 0.108 (0.023) | 0.052 (0.014) | ||
| Completeness | 0.085 (0.39) |
| 0.112 (0.023) | 0.057 (0.016) | ||
| V-measure | 0.083 (0.04) |
| 0.110 (0.023) | 0.054 (0.014) | ||
| ARI | 0.112 (0.057) |
| 0.156 (0.029) | 0.0732 (0.019) | ||
| Rank | 3 | 1 | 2 | 4 | ||
| 2 | Purity |
| 0.775 (0.042) | 0.74 (0.02) | 0.692 (0.034) | |
| Homogeneity |
| 0.525 (0.067) | 0.468 (0.041) | 0.408 (0.058) | ||
| Completeness |
| 0.527 (0.068) | 0.472 (0.05)) | 0.413 (0.059) | ||
| V-measure |
| 0.525 (0.066) | 0.470 (0.045) | 0.410 (0.058) | ||
| ARI |
| 0.536 (0.089) | 0.462 (0.051) | 0.393 (0.0665) | ||
| Rank | 1 | 2 | 3 | 4 | ||
| 3 | Purity | 0.734 (0.049) |
| 0.839 (0.025) | 0.786 (0.027) | |
| Homogeneity | 0.591 (0.082) |
| 0.733 (0.039) | 0.637 (0.046) | ||
| Completeness | 0.605 (0.082) |
| 0.735 (0.041) | 0.65 (0.046) | ||
| V-measure | 0.598 (0.081) |
| 0.734 (0.04) | 0.643 (0.045) | ||
| ARI | 0.478 (0.087) |
| 0.659 (0.067) | 0.552 (0.07) | ||
| Rank | 3 | 1 | 2 | 4 | ||
| 4 | Purity | 0.592 (0.046) |
| 0.652 (0.027) | 0.607 (0.039) | |
| Homogeneity | 0.271 (0.059) |
| 0.322 (0.033) | 0.284 (0.046) | ||
| Completeness | 0.284 (0.062) |
| 0.336 (0.036) | 0.301 (0.051) | ||
| V-measure | 0.278 (0.06) |
| 0.329 (0.034) | 0.292 (0.048) | ||
| ARI | 0.248 (0.058) |
| 0.318 (0.029) | 0.263 (0.059) | ||
| Rank | 3 | 1 | 2 | 4 | ||
| 5 | Purity | 0.709 (0.036) |
| 0.793 (0.05) | 0.728 (0.034) | |
| Homogeneity | 0.309 (0.052) |
| 0.427 (0.075) | 0.314 (0.049) | ||
| Completeness | 0.323 (0.063) |
| 0.448 (0.082) | 0.347 (0.076) | ||
| V-measure | 0.315 (0.540) |
| 0.437 (0.078) | 0.329 (0.059) | ||
| ARI | 0.321 (0.069) |
| 0.473 (0.117) | 0.344 (0.066) | ||
| Rank | 3 | 1 | 2 | 4 | ||
| 6 | Purity | 0.994 (0.008) |
| 0.995 (0.009) | 0.976 (0.023) | |
| Homogeneity | 0.958 (0.049) |
| 0.45 (0.491) | 0.864 (0.093) | ||
| Completeness | 0.959 (0.048) |
| 0.972 (0.051) | 0.867 (0.089) | ||
| V-measure | 0.958 (0.049) |
| 0.972 (0.051) | 0.866 (0.091) | ||
| ARI | 0.975 (0.031) |
| 0.982 (0.035) | 0.909 (0.085) | ||
| Rank | 3 | 1 | 2 | 4 | ||
| Total rank | 3 | 1 | 2 | 4 | ||
Figure 2Purity comparison for data set 1.
Figure 3Purity comparison for data set 2.
Figure 4Purity comparison for data set 3.
Figure 5Purity comparison for data set 4.
Figure 6Purity comparison for data set 5.
Figure 7Purity comparison for data set 6.
Figure 8Average running time of SI algorithms.