| Literature DB >> 32525869 |
Hossam M J Mustafa1, Masri Ayob1, Dheeb Albashish2, Sawsan Abu-Taleb2.
Abstract
The text clustering is considered as one of the most effective text document analysis methods, which is applied to cluster documents as a consequence of the expanded big data and online information. Based on the review of the related work of the text clustering algorithms, these algorithms achieved reasonable clustering results for some datasets, while they failed on a wide variety of benchmark datasets. Furthermore, the performance of these algorithms was not robust due to the inefficient balance between the exploitation and exploration capabilities of the clustering algorithm. Accordingly, this research proposes a Memetic Differential Evolution algorithm (MDETC) to solve the text clustering problem, which aims to address the effect of the hybridization between the differential evolution (DE) mutation strategy with the memetic algorithm (MA). This hybridization intends to enhance the quality of text clustering and improve the exploitation and exploration capabilities of the algorithm. Our experimental results based on six standard text clustering benchmark datasets (i.e. the Laboratory of Computational Intelligence (LABIC)) have shown that the MDETC algorithm outperformed other compared clustering algorithms based on AUC metric, F-measure, and the statistical analysis. Furthermore, the MDETC is compared with the state of art text clustering algorithms and obtained almost the best results for the standard benchmark datasets.Entities:
Year: 2020 PMID: 32525869 PMCID: PMC7289410 DOI: 10.1371/journal.pone.0232816
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Example of the label-based representation of a candidate solution.
Fig 2Example of the centroid-based representation of a candidate solution.
Fig 3The pseudo-code of the proposed MDETC algorithm.
Fig 4The pseudo-code of creating a trial individual algorithm.
The characteristics of the used LABIC datasets.
| Dataset | Source | No. of documents | No. of terms | No. of clusters |
|---|---|---|---|---|
| CSTR | Technical Reports | 299 | 1725 | 4 |
| tr41 | TREC | 878 | 7454 | 10 |
| tr12 | TREC | 313 | 5804 | 8 |
| tr23 | TREC | 204 | 5832 | 6 |
| tr11 | TREC | 414 | 6429 | 9 |
| oh15 | MEDLINE | 913 | 3100 | 10 |
Parameters setting used in experiments.
| parameter | Value |
|---|---|
| No. of generations | 100 |
| Population size | 20 |
| Tournament selection size | 10 |
| Recombination mating pool size | 10 |
| Max Gen without improve | 20 |
| Crossover probability | 0.9 |
| DE mutation scaling factor | 0.7 |
The comparison of AUC values obtained by the MDETC, K-means, DE and GA algorithms.
| Dataset | K-means | DE | GA | MDETC |
|---|---|---|---|---|
| tr23 | 0.4697 | 0.5 | 0.4457 | 0. |
| tr11 | 0.4745 | 0.4701 | 0.5206 | |
| tr12 | 0.4259 | 0.4438 | 0.4524 | |
| tr41 | 0.5533 | 0.5081 | 0.49 | |
| CSTR | 0.5555 | 0.5706 | 0.5337 | |
| oh15 | 0.5335 | 0.5588 | 0.5635 |
Friedman test ranking for MDETC, K-means, DE and GA algorithms based on the AUC metric.
| Algorithm | Ranking |
|---|---|
| MDETC | 1.1666 |
| DE | 2.8333 |
| GA | 2.8333 |
| K-means | 3.1666 |
Comparison between MDETC, K-means, DE and GA algorithms using Holm’s post-hoc procedure based on the AUC metric.
| Algorithm | α/ | Null Hypothesis | ||
|---|---|---|---|---|
| 1 | DE | 0.05/1 = 0.0500 | 0.02534 | Rejected |
| 2 | GA | 0.05/2 = 0.0250 | 0.02434 | Rejected |
| 3 | K-means | 0.05/3 = 0.0166 | 0.00729 | Rejected |
Fig 5The ROC curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets.
The comparison of F-measure values obtained by the MDETC, K-means, DE and GA algorithms.
| Dataset | K-means | DE | GA | MDETC |
|---|---|---|---|---|
| tr23 | 0.5759 | 0.5791 | 0.5572 | |
| tr11 | 0.5043 | 0.4398 | 0.4595 | |
| tr12 | 0.3402 | 0.4114 | 0.4470 | |
| tr41 | 0.4494 | 0.4030 | 0.3685 | |
| CSTR | 0.5008 | 0.5429 | 0.5133 | |
| oh15 | 0.3709 | 0.2976 | 0.2788 |
Friedman test ranking for MDETC, K-means, DE and GA algorithms based on the F-measure.
| Algorithm | Ranking |
|---|---|
| MDETC | 1 |
| DE | 2.833 |
| K-means | 2.833 |
| GA | 3.333 |
Comparison between MDETC, K-means, DE and GA algorithms using Holm’s post-hoc procedure based on the F-measure.
| Algorithm | α/ | Null Hypothesis | ||
|---|---|---|---|---|
| 1 | DE | 0.05/1 = 0.0500 | 0.013906 | Rejected |
| 2 | K-means | 0.05/2 = 0.0250 | 0.013906 | Rejected |
| 3 | GA | 0.05/3 = 0.0166 | 0.001745 | Rejected |
Fig 6The convergence curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets.
Running time of MDETC, K-means, DE and GA algorithms.
| Dataset | K-means | DE | GA | MDETC |
|---|---|---|---|---|
| tr23 | 0.301 | 0.517 | 0.362 | |
| tr11 | 1.001 | 1.211 | 0.831 | |
| tr12 | 0.784 | 0.770 | 0.533 | |
| tr41 | 1.211 | 2.907 | 2.002 | |
| CSTR | 0.201 | 0.246 | 0.171 | |
| oh15 | 1.109 | 1.343 | 0.918 |
F-measure comparison between MDETC and the state of art algorithms.
| Dataset | HS | KH | PSO | MMKHA | MDETC |
|---|---|---|---|---|---|
| tr23 | 0.4021 | 0.4004 | 0.3565 | 0.4214 | |
| tr11 | 0.4095 | 0.4138 | 0.4380 | 0.5164 | |
| tr12 | 0.4526 | 0.5019 | 0.4708 | 0.4481 | |
| tr41 | 0.4392 | 0.4272 | 0.4471 | 0.5241 | |
| CSTR | 0.5268 | 0.4847 | 0.5090 | 0.6055 | |
| oh15 | 0.4185 | 0.4840 | 0.4471 | 0.5278 |
Friedman test ranking for MDETC and the state of art algorithms based on the F-measure.
| Algorithm | Ranking |
|---|---|
| MDETC | 1.6666 |
| MMKHA | 1.8333 |
| PSO | 3.6666 |
| KH | 3.8333 |
| HS | 4.0 |
Comparison between MDETC and the state of art algorithms using Holm’s procedure based on the F-measure.
| Algorithm | α/ | Null Hypothesis | ||
|---|---|---|---|---|
| 1 | MMKHA | 0.05/1 = 0.0500 | 0.85513 | Not rejected |
| 2 | PSO | 0.05/2 = 0.0250 | 0.02445 | Rejected |
| 3 | KH | 0.05/3 = 0.0166 | 0.01762 | Rejected |
| 4 | HS | 0.05/4 = 0.0125 | 0.01058 | Rejected |