Literature DB >> 32525869

Solving text clustering problem using a memetic differential evolution algorithm.

Hossam M J Mustafa¹, Masri Ayob¹, Dheeb Albashish², Sawsan Abu-Taleb².

Abstract

The text clustering is considered as one of the most effective text document analysis methods, which is applied to cluster documents as a consequence of the expanded big data and online information. Based on the review of the related work of the text clustering algorithms, these algorithms achieved reasonable clustering results for some datasets, while they failed on a wide variety of benchmark datasets. Furthermore, the performance of these algorithms was not robust due to the inefficient balance between the exploitation and exploration capabilities of the clustering algorithm. Accordingly, this research proposes a Memetic Differential Evolution algorithm (MDETC) to solve the text clustering problem, which aims to address the effect of the hybridization between the differential evolution (DE) mutation strategy with the memetic algorithm (MA). This hybridization intends to enhance the quality of text clustering and improve the exploitation and exploration capabilities of the algorithm. Our experimental results based on six standard text clustering benchmark datasets (i.e. the Laboratory of Computational Intelligence (LABIC)) have shown that the MDETC algorithm outperformed other compared clustering algorithms based on AUC metric, F-measure, and the statistical analysis. Furthermore, the MDETC is compared with the state of art text clustering algorithms and obtained almost the best results for the standard benchmark datasets.

Entities: Chemical Disease Species

Year: 2020 PMID： 32525869 PMCID： PMC7289410 DOI： 10.1371/journal.pone.0232816

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Data clustering is a common data mining task that has been applied in several applications to understand the hidden structures in data. It is considered as an essential task in several disciplines such as Information Retrieval [1], Internet of Things [2], Image segmentation [3], and wireless sensor networks [4]. Moreover, one of the widespread applications of data clustering is text clustering (TC), which is considered as an unsupervised learning method that operates without the prior knowledge of the text document labels [5]. The text clustering is utilized to cluster a vast quantity of disordered text documents as a result of the expanded big data and online information [6,7]. Thus, the text clustering aims to group a collection of text documents into a group of clusters according to the related contents and topics. A particular cluster may include all related documents, and other clusters include irrelevant documents [8-10]. Recently, many researchers used the metaheuristic algorithms to address the text clustering problem [6,8], such as krill herd algorithm (KHA) [10], particle swarm optimization (PSO) [11]. The trade-off between exploration and exploitation in these algorithms plays a vital role in improving the performance of the clustering algorithm, which can be enhanced to seek reasonable clustering solutions based on specific datasets [12,13]. However, some algorithms were unable to find robust and effective results across many datasets [10]. This may occur due to an inefficient balance between exploitation and exploration that may lead to stagnation or premature convergence [14]. Some recent studies have suggested hybridizing a local search and a global search to obtain a good balance. The local search manages the exploitation, whereas the global search manages the exploration [10,15-17]. Moreover, the optimization framework of the Memetic Algorithms (MAs) can utilize the strength of different optimization algorithms by hybridizing them within the MA framework, which may offer better performance [18]. The MA includes several evolutionary steps that help in solving many complex optimization problems [18-21]. Consequently, the MA can be hybridized with the Differential Evolution algorithm (DE), which revealed good performance over several optimization problems. Therefore, in this work, we propose a memetic differential evolution algorithm to address the text clustering. The offered MDETC algorithm employs the approach of hybridization between the Differential Evolution and the Memetic algorithms to address the text clustering problem. The purposed text clustering algorithm aims to produce high-quality clustering measures such as AUC metric and F-measure.

Related work

The primary task of text clustering is to group sets of documents into homogeneous clusters [28]. This task can be achieved by employing a suitable similarity function that should be maximized/minimized the similarity between the documents clusters [6]. Several researchers have used metaheuristic optimization algorithms to solve the text clustering problem such as Genetic Algorithm [22,23], Particle Swarm Optimizer algorithm [24,25], Cuckoo search [26], Ant colony optimization [27], Artificial bee colony algorithm [28,29], Firefly algorithm [30], Harmony Search [31], and the hybrid metaheuristic approaches [32-37]. Some studies employed the Genetic Algorithm to address the text clustering problem. For example, [22] proposed a text clustering method based on Genetic Algorithm that employed the ontology and the thesaurus using several similarity measures. The researchers in [23] introduced a text clustering method based on Genetic Algorithm, which was utilized separately to every cluster to avoid the local optima. Moreover, the Particle Swarm Optimizer algorithm used in some studies to reach an optimal solution. For example, [24] offered a hybridized Particle Swarm Optimizer with a k-mean algorithm with benchmark text datasets. The authors of [25] proposed text clustering based on the Particle Swarm Optimizer algorithm (EPSO). Their algorithm seeks a multi-local optimal solution using the Particle Swarm Optimizer. Additionally, the researchers used the Cuckoo search to address the text clustering problem. For example, [26] introduced a data clustering method based on and fuzzy cuckoo optimization algorithm (FCOA) and the cuckoo optimization algorithm (COA). Other metaheuristics such as Ant colony optimization utilized to solve the text clustering problem, for example, [27] proposed a document clustering algorithm based on Ant colony optimization. Thus, the Artificial bee colony algorithm utilized to improve the text document clustering algorithm, for example, [28] employed the chaotic map model in the local search to improve the exploitation capability of the Artificial bee colony. The study of [29] utilized the Artificial bee colony algorithm in the text document clustering using the gradient search and the chaotic local search to enhance the exploitation capability of the Artificial bee colony. Moreover, the Firefly algorithm (FA) used in [30] to address dynamic text document clustering using a Gravity Firefly Clustering (GF-CLUST). Other studies utilized the Harmony Search for the text document clustering. For example, [31] introduced the factorization approach to enhance the text document clustering. The hybrid metaheuristic approaches are used to address the text clustering problem. For example, [32] combined Particle Swarm Optimizer with the Genetic Algorithm to address the text clustering problem. Their algorithm employed a Genetic Algorithm to enhance the global search and the Particle Swarm Optimizer to produce the range of search space. The researchers in [33] proposed a text clustering algorithm based on the combination between Particle Swarm Optimizer and Cuckoo search algorithm. Many other studies utilized the metaheuristic optimization algorithms to avoid the local optima problem of the K-means algorithm. For example, [34] applied the Harmony Search algorithm with text clustering to seek optimal clustering. Their proposed algorithm hybridized Harmony Search using the local search within the k-mean algorithm. The research of [35] hybridized k-mean with the Cuckoo search (CS) algorithm for addressing the web document clustering. The hybridization aims to improve the performance of web search results. The authors of [36] proposed a hybrid optimization algorithm to address the data clustering problem. Their algorithm combined the k-mean algorithm with the Tabu search (TS) to avoid the local optima problem. The researcher in [37] combined the Firefly algorithm with the k-mean algorithm. In their proposed algorithm, the Firefly algorithm employed to seek optimal centroids of the clusters that initialize the k-mean algorithm. Other studies used the memetic differential evolution approach to solve several data clustering problems. For example, the study of [21] introduced a memetic differential evolution algorithm for solving data clustering problems. The algorithm proposes a clustering algorithm based on a modified adaptive Differential Evolution mutation algorithm and a local search algorithm to enhance the balance between exploration and exploitation. The experiments were based on several low dimensional real-life benchmark datasets obtained from the UCI repository of the machine learning databases. Additionally, the research employed the intra-cluster distance with the Euclidian distance similarity/dissimilarity function. Despite that this method was an effective approach to find reasonable clustering solutions, it may fail to find better solutions for high dimensional datasets such as text clustering problems. This may occur due to the utilization of inappropriate objective function that may lead to an imbalance between exploration and exploitation for high dimensional datasets. Despite that the methods of the text clustering algorithms based on several metaheuristics approaches have better performance than other earlier algorithms, the problem of weak convergence exists in many metaheuristics algorithms. Specifically, the exploration and exploitation trade-off of the metaheuristics algorithms can be further enhanced.

Contribution of this paper

This paper aims to tackle the issues discussed above, which can help in solving the text clustering problem by using a memetic differential evolution algorithm. More precisely, our contribution significance is two-fold. We introduced a memetic Differential Evolution algorithm to address the text clustering problem. The introduced text clustering algorithm combined the MA and DE algorithms to solve the text clustering problem. We developed a modified DE Mutation phase that can be applied to enhance the search of the text clustering algorithm. More specifically, the proposed text clustering algorithm utilizes a DE mutation that is coupled with the memetic algorithm evolutionary steps. The mutation step intends to improve the search abilities of DE by employing an adaptive mutation strategy. Moreover, the improvement phase is modified to remove the duplicated solutions, which aim to avoid falling into premature convergence. The restart phase was modified by replacing a portion of the population with new solutions that are randomly generated to improve the diversity of the population.

The organization of the paper

This paper contains the following sections: The second section presents the concepts and background such as text clustering, DE and MA. In the third section, discusses the proposed memetic DE for the text clustering problem. The fourth section discusses the results of the MDETC algorithm experiments. Lastly, the fifth section discusses the conclusions and future works of the research.

Background

This section presents the necessary concepts of text clustering problem, memetic algorithm, and differential evolution (DE) algorithm, which are employed in the offered data text clustering algorithm.

Text clustering problem

Text document clustering is a method of splitting a set of n text documents into a group of K clusters, which can be grouped using a particular dissimilarity/similarity measure. The n text documents are denoted by a set D = {d1, d2, …, dn}, the K clusters are represented by C = {C1, C2, …, CK}, where the entire text documents in each cluster are similar, and other text documents are dissimilar. Thus, the number of clusters is given in advance [10,38]. The pre-processing steps of the text should be used to decrease the number of text attributes/features to support the algorithm task. The pre-processing steps are organized into (a) Tokenization (b) Removal stop word (c) Stemming (d) Feature selection and (e) Calculate the terms weighing [10]. The text documents can be represented by the vector space model (VSM) as presented in Eq (1). VSM model denotes each document i as a vector of length t [10]. The wi,j denotes the value of the tf/idf weight of term j in document I, which is commonly used term weighting method that measures whether the term is frequent or rare across all documents [24], and calculated using Eq (2). The tf (i,j) denotes the frequency of term j in document i, and n denotes the total number of documents in D, the df (j) is term j frequency in all documents [10]: The text document clustering problem can be formulated in Eq (3): The f(D, C) represents the fitness function that measures the quality of the clusters that is produced by the text clustering methods. Hence, the fitness function can be minimized or maximized subject to the employed dissimilarity/similarity measure. The quality of the text clustering solutions can be measured by the intra-cluster distance dissimilarity/similarity measure, which is commonly utilized in text clustering [10], as shown in the Eq (4): The d(di, Zl) denotes the distance between the centroid of cluster Zl and text document di. The cosine distance is one of the most widely used distance functions in text clustering [10,38]. It can measure the similarity between document di and the centroid of cluster Zl inside the same cluster, as in Eq (5). The w(Zl, tj) denotes the weight of term j in the centroid number l, and w(di, tj) denotes the weight of term j in document i. Additionally, centroids Zl can be manipulated as the average value of the entire cluster text documents, as shown in Eq (6). The nl denotes the number of text documents in cluster Zl.

Differential evolution algorithm

The Differential Evolution algorithm (DE) is considered as an effective metaheuristic evolutionary algorithm that was introduced to solve continuous and combinatorial optimization problems [19]. DE begins by population initialization. At every iteration, parents are chosen from solutions for the crossover and mutation, to produce the trial solution [19]. The mutation phase is responsible for perturbing the solution by a scaled differential vector, which includes many randomly chosen solutions to generate the mutant solution. The parent solutions are compared with the offspring solution utilizing the fitness function; the better one is then selected as the new solution to the subsequent iteration. The algorithm terminates when a condition is met, and the problem’s solution is chosen as the best individual in the population.

Memetic algorithms

Memetic Algorithm (MA) is a metaheuristic algorithm that combines the problem-specific solvers with the evolutionary algorithm. The solvers can be performed as an approximation, local or exact search heuristics. The combination intends to find better solutions and find unreachable solutions by the local search methods or the evolutionary algorithms alone. Besides, MAs provide an optimization framework that integrates various local search strategies, learning strategies [39], perturbation mechanisms, and population management strategies [40]. MAs have several names in the literature, such as Lamarckian EA, hybrid Genetic Algorithm, or Baldwinian evolutionary algorithm. The MA utilized other optimization algorithms by employing them inside the framework [41]. For example, metaheuristic algorithms such as Differential Evolution has shown better mutation performance [42] with appropriate parameter settings and mutation strategies. The combination of DE within the MA can offer three benefits: Firstly, the offspring’s quality produced by evolutionary algorithms such as MA can be improved by implementing several search methods in the optimization search process. An example of these search methods is the DE mutation, which can be employed to generate better quality individuals [43]. Secondly, premature convergence and stagnation can be minimized when employing a DE algorithm by balancing exploitation and exploration, which can be achieved by utilizing several mutation strategies [19]. Thirdly, the DE population can stagnate when the offspring are less fit than their parents over a given number of iterations. To address this, the DE’s performance can be enhanced by employing a convenient hybridization including local search algorithms within the MA framework [43,44]. The Memetic Algorithm includes the initialization procedure that creates solutions of the initial population; the compete procedure that is utilized to reconstruct the current population using the previous population, and the restart procedure, which is started on every degenerate state of the population [45].

Proposed algorithm

This section describes the evolutionary steps and the solution representation of the introduced MDETC algorithm.

Solution representation

The label-based solution representation is employed to represents the candidate solution in the text clustering problem. Each solution represents a set of n documents that contain the cluster number related to each document. Fig 1 shows an example of the label-based representation of a candidate solution that contains two clusters and nine documents.

Fig 1

Example of the label-based representation of a candidate solution.

Moreover, a centroid-based 2-dimensional array is employed in the local search to store the centroid values of the clusters. The array includes D columns and K rows, where the total number of the attributes is denoted by D, and K is the number of the clusters. Fig 2 presents an example of a candidate solution of a dataset that contains two attributes and two clusters.

Fig 2

Example of the centroid-based representation of a candidate solution.

The MDETC proposed algorithm

In MDETC, the DE mutation is hybridized with the evolutionary steps of the MA that utilizes an adaptive strategy DE/current-to-best/1. The hybridization aims to improve the convergence rate. Thus, premature convergence can be prevented in the restart step by rebuilding the diversity of the population. At last, the improvement step plays an important role to seek better solutions. The pseudo-code for the proposed MDETC algorithm is presented in Fig 3, which consist of the following phases:

Fig 3

The pseudo-code of the proposed MDETC algorithm.

The population initialization phase

The initial solutions of MDETC are randomly generated. The documents are grouped into K random clusters; every cluster’s centroid is computed by Eq (6). These steps are repeated to produce Pop_Size random solutions.

The recombination phase

The mating pool approach [46] is employed in this phase with a size of Pool_Size. This phase also employs the tournament selection with a size of Tour_Size [47], which is combined with the mating pool. The two-point crossover is then applied to the mating pool. At last, the population is joined with the mating pool, where the worst individuals in the population are replaced with new individuals from the mating pool.

The DE mutation phase

This phase utilizes the DE/current-to-best/1 strategy [21], as shown in Fig 3. The cluster centroids are adjusted in the mutation step to obtain better solutions, as presented in Fig 4. This is accomplished with Eq (7). The Zbest is the best solution centroid, Zi denotes the current solution centroid, Zrand denotes a random centroid, the Curr_Iteration denotes the current MDETC algorithm iteration number, and Max_Iterations is the maximum number of iterations of MDETC.

Fig 4

The pseudo-code of creating a trial individual algorithm.

The improvement phase

The improvements step clears the duplicated solutions, which guarantees better diversity in the population to prevent any premature convergence.

The restart phase

Whenever the population falls into the degeneration state, the restart step is invoked [45]. The restart strategy retains some portion of the population and excepts the other solutions by generating new solutions. The MDETC preserve 75% of the population for the subsequent iteration, while the rest of the population is produced randomly.

Experimental results and setup

Experimental setup

The MDETC performance is studied using six standard real datasets from the Laboratory of Computational Intelligence (LABIC) and represented in numerical form after the extraction of the terms. These datasets contain different variety of characteristics, such as the number of terms, clusters, and documents, and variety of complexity [48], where the datasets that been used are CSTR, tr41, tr23, tr12, tr11, and oh15, as shown in Table 1. To assess the efficiency of the introduced algorithm, the performance of MDETC is compared with the K-means algorithm, DE [21], and Genetic Algorithm (GA) [22], where the algorithms are implemented using the same experimental setup.

Table 1

The characteristics of the used LABIC datasets.

Dataset	Source	No. of documents	No. of terms	No. of clusters
CSTR	Technical Reports	299	1725	4
tr41	TREC	878	7454	10
tr12	TREC	313	5804	8
tr23	TREC	204	5832	6
tr11	TREC	414	6429	9
oh15	MEDLINE	913	3100	10

The algorithms’ performance is evaluated using the F-measure, which matches the ground truth with the obtained clustering solution to identify the correspondence between them. Also, the receiver operating characteristic curves (ROC) are plotted and the area under the curve (AUC) metric was calculated. A higher value of the AUC metric and F-measure means better quality of the clustering algorithm, which both range from 0 to 1. The ROC curve can measure the degree of separability, which shows the capability of the algorithm to distinguish between classes. The ROC curves are plotted using the True Positive Percentage (TPP) against the False Positive Percentage (FPP). The TPP and FPP are computed using Eq (8) and Eq (9). The F-measure of cluster Sj can be computed using the recall and precision, which are shown in Eq (10) and Eq (11), Where Nij denoted the number of objects of class Ci in cluster Sj, |Sj| is the number of objects in cluster Sj, and |Ci| is the number of objects in class Ci. The F-measure is computed using Eq (12). The settings of the parameters of the MDETC algorithm were separately tested 31 times on all datasets; the average values of the AUC metric and F-measure were calculated. The parameter setting of the proposed MDETC is shown in Table 2, which is based on an experimental basis and the drawing on previous work from the scientific literature [21]. At last, the algorithms are applied using Oracle Java 1.8, where it was run on a personal computer with an Intel Core i7 CPU (2.6GHz) and a RAM of 8 GB size.

Table 2

Parameters setting used in experiments.

parameter	Value
No. of generations	100
Population size	20
Tournament selection size	10
Recombination mating pool size	10
Max Gen without improve	20
Crossover probability	0.9
DE mutation scaling factor	0.7

Experimental results and discussion

Table 3 shows the average results of the AUC metric obtained by MDETC and the competing algorithms. The proposed MDETC achieved the best results on tr23, tr12, tr41, CSTR, and oh15 datasets, also it achieved the second-best result on the tr11 dataset. Based on AUC metric results, the MDETC obtained an excellent performance on tr41, CSTR, and oh15 datasets. Besides, MDETC obtained fair performance on tr23, tr11, and tr12 datasets. The results show that the proposed MDETC algorithm has a higher AUC metric compared with the competing algorithms, for example, the results of tr41 dataset indicates that MDETC obtained an AUC metric value of 0.9511, whereas the F-measure results of K-means, DE, and GA are 0.5533, 0.5081, and 0.49, respectively.

Table 3

The comparison of AUC values obtained by the MDETC, K-means, DE and GA algorithms.

Dataset	K-means	DE	GA	MDETC
tr23	0.4697	0.5	0.4457	0.5575
tr11	0.4745	0.4701	0.5212	0.5206
tr12	0.4259	0.4438	0.4524	0.4577
tr41	0.5533	0.5081	0.49	0.9511
CSTR	0.5555	0.5706	0.5337	0.802
oh15	0.5335	0.5588	0.5635	0.9052

Moreover, the results in Table 3 are further analyzed using the rankings generated by Friedman’s test based on the AUC metric, as shown in Table 4. Friedman’s test has shown that MDETC obtained a significant difference with a p-value of 0.03207 that is below the significance level (α = 0.05). The results confirm that MDETC obtained the best ranking based on the AUC metric. The DE obtained the second-best rank, and then the GA algorithm. Finally, K-means achieved the worst rank.

Table 4

Friedman test ranking for MDETC, K-means, DE and GA algorithms based on the AUC metric.

Algorithm	Ranking
MDETC	1.1666
DE	2.8333
GA	2.8333
K-means	3.1666

Moreover, the statistical difference between the control case (MDETC) and the other algorithms is detected using the Holm’s post-hoc procedure. Table 5 demonstrates the p-value achieved by Holm’s procedure, where the null hypothesis is rejected based on the achieved p-value that needs to be less than the adjusted value of α (α/i). The value of i represents the rank of each algorithm. The Holm’s procedure demonstrates that MDETC is statistically better than K-means, DE and GA based on the AUC metric.

Table 5

Comparison between MDETC, K-means, DE and GA algorithms using Holm’s post-hoc procedure based on the AUC metric.

i	Algorithm	α/i	p-value of Holms	Null Hypothesis
1	DE	0.05/1 = 0.0500	0.02534	Rejected
2	GA	0.05/2 = 0.0250	0.02434	Rejected
3	K-means	0.05/3 = 0.0166	0.00729	Rejected

Fig 5 shows the corresponding ROC curves obtained by the MDETC, K-means, DE and GA algorithms on the used datasets. The ROC curves demonstrate that MDETC produces excellent performance on tr41, CSTR, and oh15 datasets with better capability to distinguish between classes. Thus, MDETC obtained fair performance compared with the competing algorithms on tr23, tr11, and tr12 datasets.

Fig 5

The ROC curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets.

The ROC curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets. Table 6 demonstrates the average results of the F-measure obtained by competing algorithms. The proposed MDETC achieved the best results on all datasets concerning the F-measure (i.e., tr23, tr11, tr12, tr41, CSTR, and oh15). The results show that the proposed MDETC algorithm has a higher F-measure compared with the competing algorithms, for example, the results of CSTR dataset indicates that MDETC obtained an F-measure value of 0.6908, whereas the F-measure results of K-means, DE, and Genetic Algorithm (GA) are 0.5008, 0.5429, and 0.5133, respectively. However, the F-measure result achieved by GA is close to MDETC on tr12 datasets.

Table 6

The comparison of F-measure values obtained by the MDETC, K-means, DE and GA algorithms.

Dataset	K-means	DE	GA	MDETC
tr23	0.5759	0.5791	0.5572	0.6240
tr11	0.5043	0.4398	0.4595	0.5414
tr12	0.3402	0.4114	0.4470	0.4481
tr41	0.4494	0.4030	0.3685	0.6269
CSTR	0.5008	0.5429	0.5133	0.6908
oh15	0.3709	0.2976	0.2788	0.5895

Moreover, the results in Table 6 are further analyzed using the rankings generated by Friedman’s test based on the F-measure, as shown in Table 7. Friedman’s test has shown that MDETC obtained a significant difference with a p-value of 0.00974 that is below the significance level (α = 0.05). The results confirm that MDETC obtained the best ranking based on the F-measure. The DE obtained the second-best rank, and then the K-means algorithm. Finally, GA achieved the worst rank.

Table 7

Friedman test ranking for MDETC, K-means, DE and GA algorithms based on the F-measure.

Algorithm	Ranking
MDETC	1
DE	2.833
K-means	2.833
GA	3.333

Moreover, the statistical difference between the control case (MDETC) and the other algorithms is detected using the Holm’s post-hoc procedure. Table 8 demonstrates the p-value achieved by Holm’s procedure, where the null hypothesis is rejected based on the achieved p-value that needs to be less than the adjusted value of α (α/i). The Holm’s procedure demonstrates that MDETC is statistically better than K-means, DE and GA.

Table 8

Comparison between MDETC, K-means, DE and GA algorithms using Holm’s post-hoc procedure based on the F-measure.

i	Algorithm	α/i	p-value of Holms	Null Hypothesis
1	DE	0.05/1 = 0.0500	0.013906	Rejected
2	K-means	0.05/2 = 0.0250	0.013906	Rejected
3	GA	0.05/3 = 0.0166	0.001745	Rejected

Fig 6 shows the convergence curves on the employed datasets. The curves demonstrate that MDETC produces the best convergence performance on the six datasets with fast convergence in the initial iterations; next, convergence becomes slower. The proposed memetic steps improved efficiency by avoiding premature convergence. The DE obtained the second best convergence rate results, and GA obtained the worst results.

Fig 6

The convergence curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets.

The convergence curves on (a) tr23, (b) tr11; (c) tr12; (d) tr41; (e) CSTR; (f) oh15 datasets. Table 9 shows the running time of a single iteration of the proposed MDETC, K-means, DE and GA algorithms on the related datasets to investigate the complexity of these algorithms. As presented in Table 9, The GA algorithm obtained the best results of the processing time on all datasets. Nevertheless, the MDETC requires less processing time compared to DE and K-means algorithms on the employed datasets except for the tr23 dataset. The K-means achieved the best third-best processing time on the entire datasets except for the tr12 dataset. The DE did not achieve any shorter running time on the test datasets except for the tr12 dataset. Consequently, the trade-off between the time-cost and the quality problem appeared, where hybrid metaheuristic methods, such as MDETC, can achieve optimal solutions in acceptable running time. On the other hand, the traditional metaheuristic algorithm does not promise to obtain the optimal solution and commonly can produce sub-optimal and good-quality solutions in shorter running time.

Table 9

Running time of MDETC, K-means, DE and GA algorithms.

Dataset	K-means	DE	GA	MDETC
tr23	0.301	0.517	0.052	0.362
tr11	1.001	1.211	0.070	0.831
tr12	0.784	0.770	0.059	0.533
tr41	1.211	2.907	0.192	2.002
CSTR	0.201	0.246	0.021	0.171
oh15	1.109	1.343	0.069	0.918

Comparison between MDETC and state of the art

The performance of MDETC is compared with the state of the art algorithms, such as the hybrid krill herd algorithm (MMKHA) [10], krill herd algorithm (KH) [10], particle swarm optimization (PSO) [49], Hybrid Harmony Search (HS) [34]. As presented in Table 10, the F-measure achieved by MDETC is better than competing algorithms. The MDETC obtained the optimum F-measure on the tr23, tr11, tr41, CSTR, and oh15 datasets. The MMKHA algorithm obtained the optimum F-measure on the tr12 dataset and scored the second-best result on the remaining datasets. The results presented in Table 10 reveals that MDETC achieved consistent performance across all datasets using the F-measure.

Table 10

F-measure comparison between MDETC and the state of art algorithms.

Dataset	HS	KH	PSO	MMKHA	MDETC
tr23	0.4021	0.4004	0.3565	0.4214	0.6240
tr11	0.4095	0.4138	0.4380	0.5164	0.5414
tr12	0.4526	0.5019	0.4708	0.5624	0.4481
tr41	0.4392	0.4272	0.4471	0.5241	0.6269
CSTR	0.5268	0.4847	0.5090	0.6055	0.6908
oh15	0.4185	0.4840	0.4471	0.5278	0.5895

Additionally, the results in Table 10 are further analyzed using the rankings generated by Friedman’s test based on the F-measure, as shown in Table 11. The test has shown that MDETC obtained a significant difference with a p-value of 0.01302 that is below the significance level (α = 0.05). The results confirm that MDETC obtained the best ranking based on the F-measure. The MMKHA algorithm attained the second-best rank, and the PSO scored the third rank, then the KH. Finally, HS achieved the worst rank. The rankings presented in Table 11 show that MDETC performance based on the F-measure is consistent when compared with the state of art algorithms.

Table 11

Friedman test ranking for MDETC and the state of art algorithms based on the F-measure.

Algorithm	Ranking
MDETC	1.6666
MMKHA	1.8333
PSO	3.6666
KH	3.8333
HS	4.0

Moreover, Table 12 shows the p-value of MDETC and the state of art algorithms using Holm’s post-hoc procedure, where the null hypothesis is rejected based on the achieved p-value that needs to be less than the adjusted value of α (α/i). The Holm’s procedure shown in Table 12 demonstrates that MDETC is statistically better than PSO, KH, and HS. Thus, MDETC is not significantly different from the MMKHA algorithm. However, the results presented in Table 10 confirm that the MDETC algorithm outperformed the MMKHA based on the tested datasets.

Table 12

Comparison between MDETC and the state of art algorithms using Holm’s procedure based on the F-measure.

i	Algorithm	α/i	p-value of Holms	Null Hypothesis
1	MMKHA	0.05/1 = 0.0500	0.85513	Not rejected
2	PSO	0.05/2 = 0.0250	0.02445	Rejected
3	KH	0.05/3 = 0.0166	0.01762	Rejected
4	HS	0.05/4 = 0.0125	0.01058	Rejected

Conclusions and future work

This work proposed an MDETC algorithm for addressing the text clustering problem. The combination of DE and Memetic algorithms intends to achieve a better balance between exploration and exploitation. The algorithm introduced a DE mutation operator that is hybridized within the Memetic algorithm. To prove the effectiveness of the introduced algorithm, six standard text clustering benchmark datasets (i.e. the Laboratory of Computational Intelligence (LABIC)) employed to assess the presented algorithm. The Experimental results confirmed that the introduced MDETC algorithm obtained consistent performance compared to the state of art algorithms concerning the AUC metric and F-measure validity measures. These results revealed that the proposed MDETC has achieved a better balance between exploration and exploitation and improved the performance of the Memetic algorithms to solve the text clustering problem. The MDETC algorithm obtained the optimum results of the F-measure on tr23 (62.4%), tr11 (54.14%), tr41 (62.69%), CSTR (69.08%), and oh15 (58.95%) datasets. Furthermore, the future work will concentrate on incorporating different validity measures when employed within the multi-objective metaheuristic algorithms. (DOCX) Click here for additional data file. 25 Sep 2019 PONE-D-19-20564 Solving Text Clustering Problem using a Memetic Differential Evolution Algorithm PLOS ONE Dear Mr. Mustafa, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. The reviewers think that your article has some merits but needs to address their comments and suggestions in order to improve the quality of the article. We would appreciate receiving your revised manuscript by Nov 09 2019 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Mohd Nadhir Ab Wahab, Ph.D. Academic Editor PLOS ONE Journal Requirements: 1. When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information. Additional Editor Comments: The reviewers think that your article has some merits but needs to address their comments and suggestions in order to improve the quality of the article. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No Reviewer #2: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The introduction provides a good, generalized background of the topic that quickly gives the reader an appreciation of the wide range of applications for this technology. I think the motivations for this study need to be made clearer.Figures must be in a good quality with high resolution.The manuscript is poorly written. It is full of incomplete & meaningless sentences. I really find it difficult to understand the manuscript. Authors really need to improve their written communications skills in English. All references must be revised. Issue numbers and volume numbers are missed from most of references. Conclusions should address whether or not they achieved the objective of the study. Reviewer #2: I congratulate the author on their work, however there seem to be multiple issues that need to be addressed before publication: (1) First and foremost, there are multiple grammatical and editorial errors in the text. As a non-native English speaker I can understand the text, but a larger audience will have a problem with that. Also it speaks of lack of attention: For example in the introduction section, one to the last paragraph is an incomplete sentence/paragraph "The reseach [25] of we propose a memetic differential evolution algorithm to address the text clustering (TC). The offered" (2) There seem to be unnecessary amount of citations to the works performed by one other group. Namely, citations 47,10,11 and then 12 and 13 (where most of these work seem to be highly connected to each other). In order to preserve the integrity of the journal and academic community, I suggest being more selective in your citations. (3) A disturbing amount of abbreviations and unnecessary nomenclature has been used that is not conductive to the cohesion of the report. For example FA (firefly algorithm) has been used two or three times in the entire document, I would suggest using the complete name "firefly algorithm" rather than FA, in order to not to confuse the reader. Or two different abbreviations TC and TD are used for text clustering and text document clustering. I suspected there is a difference between the two but I couldn't find any distinction between these two, neither in the manuscript, nor online. If these two refer to the same concept, you can use only one of them, if not, you will need to do a better job defining the two in the manuscript. (4) In the section "population initialization phase" the authors refer to equation (4) for computing cluster centroids. I believe they intended to refer to equation (6). (5) The authors are using F-value or F-measure as their evaluation metric, I think it is a good practice to include the metric definition in the manuscript. Moreover, when referring to this metric they cite a 2017 paper. You can say that the cited paper has used F-value as a measure of clustering efficiency and therefore it is sound to use F-value, but the way it is put in the manuscript, it implies that the 2017 paper was the “invention” of the F-value which clearly is wrong. (6) In the section "Parameter setting used in the study": how did the authors come to this setting? Did they perform a parameter study? Or maybe cross-validation? (7) Number of terms in documents (table 1) seem to be very low. For example 1700 unique words in a dataset of 299 documents (CSTR dataset). Or 3000 words in 900 documents. Maybe this numbers are after stemming and stop-word removal, but how are those steps performed? I would recommend, if not within the main manuscript, at least provide a supplementary material section where you describe this detailed methodology. As an example refer to table1 in the following paper: https://doi.org/10.1109/ACCESS.2019.2923462 Their datasets seem to have much higher term/document ratio, is there a reason for this? Maybe your datasets are more technical than of general nature? in this case your algorithm will be for "technical text classification" (8) The authors are using tf-idf and vector space model for their document representation. Why not using a more elaborate word-representation such as Word2Vec or GloVe. Is there a benefit in using tf-idf and not more modern representations? Needs to be explained. I applaud the authors for their efforts but I think the points made above are the minimum that need to be met to before being published in a journal. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 23 Oct 2019 Thank you for the valuable feedback. We have added some sentences/paragraphs in order to response the comments from the reviewers. The following summarizes the updates that have been made to the paper as a response to the reviewers’ comments. Moreover, additional statistical analysis is added in the experimental results section to fulfill the journal requirement that have been addressed by the reviewers. Reviewer #1: 1. The motivations for this study need to be made clearer: The motivation is modified in the introduction section, and cleared in the closing paragraph of the related work. 2. Figures must be in a good quality with high resolution: All figures is modified with better resolution according to the journal figures requirements. 3. The manuscript is poorly written. Authors really need to improve their written communications skills in English: The manuscript grammar is rechecked and proofread. 4. All references must be revised. Issue numbers and volume numbers are missed from most of references: All references mentioned in the comment are revised accordingly. 5. Conclusions should address whether or not they achieved the objective of the study: The conclusion section has been modified to include the achievement of the objective of the study. Reviewer #2: 1. There are multiple grammatical and editorial errors in the text: The manuscript grammar is rechecked and proofread. 2. There seem to be unnecessary amount of citations to the works performed by one other group. Namely, citations 47,10,11 and then 12 and 13: References 11, 12, and 47 is removed. The remaining references are kept since they represent different works in text clustering such as multiobjective optimization, unsupervised feature selection, and Hybrid krill herd algorithm approaches 3. A disturbing amount of abbreviations and unnecessary nomenclature has been used that is not conductive to the cohesion of the report. You will need to do a better job defining the two different abbreviations TC and TD in the manuscript: All abbreviations have been revised accordingly. The TD and TC are discussion is modified in the introduction section. 4. In the section "population initialization phase" the authors refer to equation (4) for computing cluster centroids. I believe they intended to refer to equation (6); The equation number is corrected to be equation (6) (calculation of cluster centroids) 5. The authors are using F-value or F-measure as their evaluation metric, I think it is a good practice to include the metric definition in the manuscript. Moreover, when referring to this metric they cite a 2017 paper; The unnecessary citation is removed since its only indicates that this measure have been used in similar studies. The equations and discussion of F-measure are added in the experimental setup section. 6. In the section "Parameter setting used in the study": how did the authors come to this setting? Did they perform a parameter study? Or maybe cross-validation?: It is based on an experimental basis and the drawing on previous work from the scientific literature [3]*. This statement is added to the experimental setup section. 7. Number of terms in documents (table 1) seem to be very low. For example 1700 unique words in a dataset of 299 documents (CSTR dataset). Or 3000 words in 900 documents. Maybe this numbers are after stemming and stop-word removal, but how are those steps performed? I would recommend, if not within the main manuscript, at least provide a supplementary material section where you describe this detailed methodology: These datasets are standard benchmark text documents datasets that already pre-processed, the description is added in the experimental setup section. The detailed steps of the pre-processing are elaborated in background section (text clustering problem). 8. The authors are using tf-idf and vector space model for their document representation. Why not using a more elaborate word-representation such as Word2Vec or GloVe. Is there a benefit in using tf-idf and not more modern representations? Needs to be explained: The following are the reasons to adopt the tf-idf in our algorithm: 1. The TF/IDF is commonly used by TC algorithms, where the frequent terms will be a good indicator for a certain topic [1] [2]* 2. The standard text document clustering datasets are represented (pre-processed) by tf-idf term frequency. More discussion is added to the Text clustering problem section. And also more discussion is added (as in comment No. 7). *References. 1. Aggarwal CC, Reddy CK. Data Custering Algorithms and Applications. 1st ed. Taylor & Francis Group, LLC; 2013. 2. Cui X, Potok TE, Palathingal P. Document clustering using particle swarm optimization. Proc 2005 IEEE Swarm Intell Symp 2005 SIS 2005. 2005; 185–191. doi:10.1109/SIS.2005.1501621 3. Mustafa HMJ, Ayob M, Nazri MZA, Kendall G. An improved adaptive memetic differential evolution optimization algorithms for data clustering problems. PLoS One. 2019;14(5): e0216906. doi:10.1371/journal.pone.0216906 Submitted filename: Response to Reviewers.docx Click here for additional data file. 18 Feb 2020 PONE-D-19-20564R1 Solving Text Clustering Problem using a Memetic Differential Evolution Algorithm PLOS ONE Dear Mr. Mustafa, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please address all the comments given by the reviewers. One of the reviewer raise their concern where the similarity between this paper with other published paper entitle "An improved adaptive memetic differential evolution optimization algorithms for data clustering problems" under PLoS One as well. Please emphasize on the different between these two papers and highlight your contribution explicitly. We would appreciate receiving your revised manuscript by Apr 03 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. We look forward to receiving your revised manuscript. Kind regards, Mohd Nadhir Ab Wahab, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (if provided): Please address all the comments given by the reviewers. One of the reviewer raise their concern where the similarity between this paper with other published paper entitle "An improved adaptive memetic differential evolution optimization algorithms for data clustering problems" under PLoS One as well. Please emphasize on the different between these two papers and highlight your contribution explicitly. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: All comments have been addressed Reviewer #4: (No Response) Reviewer #5: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: No ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: No ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes Reviewer #4: Yes Reviewer #5: No ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: The authors have addressed the comments from previous reviewers and improved the manuscript. I suggest all abbreviations of algorithms such as KH, CS, PSO, etc. be removed, they are not necessary and still excessive, making the manuscript hard to follow. The abstract needs to be improved as well. Reviewer #4: The authors have met all comments of the previous reviewers. The English could still be improved, though. Reviewer #5: 1. Line 390: in the comparative analysis section: Why is the work from Ref 23 not included here? What is the difference between this work and the work presented in Ref 23? 2. Line 340: The use of F-measure sounds good, but it is logical to present the full spectrum of evaluation metrics before narrowing down to one. For instance, the reader would be interested to know the AUC, in addition to the F-measure. 3. Line 233 to 235: The study failed to explained the limitation of existing algorithm. The use of overly broad terminology does not account for the limitation. Furthermore, the study failed to state the limitation of MA and DE for which an extended is considered important. I think this are missing steps that can improve the paper. 4. Reference to the Related works: The language is really bad. Another round of proof reading would be required. One more thing, the sequence of the reference can be adjusted in ascending order. It is presented in a scattered manner. this does not aid reading. Overall, the manuscript failed in readability due to poor grammar and sentence structure. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: Yes: Jijun Tang Reviewer #4: Yes: Volker Ahlers Reviewer #5: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step. 2 Apr 2020 Thank you for the valuable feedback. We have added some sentences/paragraphs to respond to the comments from the reviewers. The following table summarizes the updates that have been made to the paper as a response to the reviewers’ comments. Moreover, additional statistical analysis is added in the experimental results section to fulfill the journal requirement that has been addressed by the reviewers. Reviewer #3 comment # 1: I suggest all abbreviations of algorithms such as KH, CS, PSO, etc. be removed, they are not necessary and still excessive, making the manuscript hard to follow. Response to comment : The abbreviations are used within the related work and the comparison with the state of art sections. We defined these abbreviations whenever are used, then the abbreviations are used due to difficulties of using the full name of the algorithms within the figures and tables, for example, hybrid krill herd algorithm (MMKHA). Reviewer #3 comment # 2: The abstract needs to be improved as well. Response to comment : We revised the abstract and made proofreading. Reviewer #4 comment # 1: The English could still be improved. Response to comment : Proofreading is made. Reviewer #5 comment # 1: Line 390 (460 in the new manuscript): in the comparative analysis section: Why is the work from Ref 23 not included here? What is the difference between this work and the work presented in Ref 23? Response to comment : Ref 23 (21 in the new manuscript) uses different objective functions and local search heuristic. In Ref 23, the datasets used to evaluate the algorithm is low dimension benchmark datasets that are collected from many domains. More discussion about this issue can be found in the related work section (las paragraphs line .. ). The text clustering datasets (as mentioned in the background section) use different kind of data (tf/idf) and requires different objective function, evolutionary steps, and local search to be proposed. Reviewer #5 comment # 2: Line 340 (403 in the new manuscript): The use of F-measure sounds good, but it is logical to present the full spectrum of evaluation metrics before narrowing down to one. For instance, the reader would be interested to know the AUC, in addition to the F-measure. Response to comment : The AUC metric and ROC are included in the study. A discussion about these evaluation metrics is added to the experiment setup section. Besides, the ROC curve (figure 5) and the results of the AUC metric (tables 3-5) are added and compared using the statistical analysis in the experimental results and discussion section. The abstract, introduction, and conclusion are revised accordingly. Reviewer #5 comment # 3: Line 233 to 235 (235 in the new manuscript): The study failed to explain the limitation of the existing algorithm. The use of overly broad terminology does not account for the limitation. Furthermore, the study failed to state the limitation of MA and DE for which an extended is considered important. I think these are missing steps that can improve the paper Response to comment : A paragraph is added before the comment to elaborate the need to hybridize between DE and MA with three benefits (line 235 - 251). Reviewer #5 comment # 4: Reference to the Related works: The language is really bad. Another round of proofreading would be required. One more thing, the sequence of the reference can be adjusted in ascending order. It is presented in a scattered manner. This does not aid reading. Overall, the manuscript failed in readability due to poor grammar and sentence structure Response to comment : We reorganized and revised the related work section accordingly. The sequence of references is adjusted. The related work is grouped according to the metaheuristic algorithm or approach used. Submitted filename: Response to Reviewers.docx Click here for additional data file. 23 Apr 2020 Solving Text Clustering Problem using a Memetic Differential Evolution Algorithm PONE-D-19-20564R2 Dear Dr. Mustafa, We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements. Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication. Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. With kind regards, Mohd Nadhir Ab Wahab, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Congratulations. Please proof read the article as well. Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #3: All comments have been addressed Reviewer #4: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #3: Yes Reviewer #4: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #3: Yes Reviewer #4: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #3: Yes Reviewer #4: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #3: Yes Reviewer #4: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #3: The authors have adequately addressed all my concerns, I have no further request. for other updates. Reviewer #4: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #3: Yes: Jijun Tang Reviewer #4: Yes: Volker Ahlers 26 May 2020 PONE-D-19-20564R2 Solving Text Clustering Problem using a Memetic Differential Evolution Algorithm Dear Dr. Mustafa: I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. For any other questions or concerns, please email plosone@plos.org. Thank you for submitting your work to PLOS ONE. With kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Mohd Nadhir Ab Wahab Academic Editor PLOS ONE

5 in total

1. An Adaptive Multipopulation Differential Evolution With Dynamic Population Reduction.

Authors: Mostafa Z Ali; Noor H Awad; Ponnuthurai Nagaratnam Suganthan; Robert G Reynolds
Journal: IEEE Trans Cybern Date: 2016-10-25 Impact factor: 11.448

2. A memetic optimization algorithm for multi-constrained multicast routing in ad hoc networks.

Authors: Rahab M Ramadan; Safa M Gasser; Mohamed S El-Mahallawy; Karim Hammad; Ahmed M El Bakly
Journal: PLoS One Date: 2018-03-06 Impact factor: 3.240

3. An improved adaptive memetic differential evolution optimization algorithms for data clustering problems.

Authors: Hossam M J Mustafa; Masri Ayob; Mohd Zakree Ahmad Nazri; Graham Kendall
Journal: PLoS One Date: 2019-05-28 Impact factor: 3.240

4. Health-related hot topic detection in online communities using text clustering.

Authors: Yingjie Lu; Pengzhu Zhang; Jingfang Liu; Jia Li; Shasha Deng
Journal: PLoS One Date: 2013-02-15 Impact factor: 3.240

5. Clustering algorithms: A comparative approach.

Authors: Mayra Z Rodriguez; Cesar H Comin; Dalcimar Casanova; Odemir M Bruno; Diego R Amancio; Luciano da F Costa; Francisco A Rodrigues
Journal: PLoS One Date: 2019-01-15 Impact factor: 3.240

5 in total

2 in total

1. Birdsongs recognition based on ensemble ELM with multi-strategy differential evolution.

Authors: Shanshan Xie; Yan Zhang; Danjv Lv; Haifeng Xu; Jiang Liu; Yue Yin
Journal: Sci Rep Date: 2022-06-13 Impact factor: 4.996

Review 2. Identification of technology frontiers of artificial intelligence-assisted pathology based on patent citation network.

Authors: Ting Zhang; Juan Chen; Yan Lu; Xiaoyi Yang; Zhaolian Ouyang
Journal: PLoS One Date: 2022-08-22 Impact factor: 3.752

2 in total