Literature DB >> 35559954

Rough set based information theoretic approach for clustering uncertain categorical data.

Jamal Uddin1, Rozaida Ghazali2, Jemal H Abawajy3, Habib Shah4, Noor Aida Husaini5, Asim Zeb6.   

Abstract

MOTIVATION: Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability. PROBLEM STATEMENT: The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute.
OBJECTIVES: The main objective of this research is to propose a new information theoretic based Rough Purity Approach (RPA). Another objective of this work is to handle the problems of traditional Rough Set Theory based categorical clustering techniques. Hence, the ultimate goal is to cluster uncertain categorical datasets efficiently in terms of the performance, generalizability and computational complexity.
METHODS: The RPA takes into consideration information-theoretic attribute purity of the categorical-valued information systems. Several extensive experiments are conducted to evaluate the efficiency of RPA using a real Supplier Base Management (SBM) and six benchmark UCI datasets. The proposed RPA is also compared with several recent categorical data clustering techniques.
RESULTS: The experimental results show that RPA outperforms the baseline algorithms. The significant percentage improvement with respect to time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), and accuracy (12.15%) as well as Rough Accuracy of clusters show that RPA is suitable for practical usage.
CONCLUSION: We conclude that as compared to other techniques, the attribute purity of categorical-valued information systems can better cluster the data. Hence, RPA technique can be recommended for large scale clustering in multiple domains and its performance can be enhanced for further research.

Entities:  

Mesh:

Year:  2022        PMID: 35559954      PMCID: PMC9106167          DOI: 10.1371/journal.pone.0265190

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.752


1 Introduction

Advances in computational, faster and cheaper storage and communication technologies have led to the generation and storage of very large and complex data by businesses, governmental agencies and other organizations. The collected data can be used for important business decisions such as better understanding market dynamics, customers spending trends, operations and internal business processes. However, size and complexity of the data render it beyond the ability of a human analyst to process it for the purpose of decision making process. Similarly, in these processes, the issue of uncertain attribute value appears as a result of instrument fault, approximations in measurement or even subjective by assessments expert etc [1]. Moreover, as much of the data is uncertain and categorical in nature, it poses defiance to the conventional data analytic approaches. As a result, there is a surge of interest in methods for mining uncertain categorical data recently [2-5]. Discovering useful knowledge these data sets efficiently is a serious requirement and a huge economic need. Clustering a set of objects into homogeneous groups is a fundamental operation in data mining. Clustering methods are often used to support data-driven decision making in numerous domains such as Businesses (e.g., market dynamic analysis) [6], Healthcare (e.g., protein sequence analysis) [7-9], Science (e.g., environmental data analysis) [10], Information Security [11], Computer Networks [12], Image Segmentation [13] and Software Maintenance [14, 15]. In data analytics, clustering method lies at the core of successful data analysis tasks such as data summation, classification as well as data reduction, filtering, exploratory data analysis and many more [14, 16–19]. A variety of cluster analysis methods for numerical data analysis are commonly deployed by organizations. These cluster analysis methods are not appropriate for categorical dataset processing. The increasing proliferation of large uncertain categorical data sets poses significant challenges to the contemporary clustering techniques. Recently, attention has been put on data with non-numerical attributes or categorical attributes. There have been progresses in categorical data clustering [20-24]. Although these clustering methods show advancement in categorical data clustering and analysis, they are not suitable for uncertain categorical datasets and suffer from stability issues [25]. Recently, approaches that are based on fuzzy sets [20, 26–28] and Rough Set Theory (RST) [25, 29–33] for clustering categorical data have appeared in the literature. However, fuzzy sets based methods require heavy computational complexity as they require several runs each time with new initial value to assess the clustering outcome stability. Moreover, a parameter that controls the membership fuzziness need to be adjusted to achieve better clustering results. In the process of dealing with categorical data and handling uncertainty, the Rough Set Theory has become well-established mechanism in a wide variety of applications including databases. Two types of uncertainty can be modeled by Rough Set Theory inherently [34-36]. The indiscernibility relation gives rise to the first type of uncertainty. The indiscernibility relation partitions all values into a finite set of equivalence classes and is imposed on the universe. The second type of uncertainty is modeled through the approximation regions in Rough Sets. Here, the elements of upper approximation region have uncertain participation, whereas the lower approximation region have total participation. Rough Set Theory (RST) is a mathematical concept to imperfect analysis. It was discussed in greater detail in [12, 30]. The RST is a viable system to deal with uncertainty in clustering process of categorical data. RST was originally a symbolic data analysis tool now being developed for cluster analysis. RST clusters the universe, and describe its subsets as classes of equivalence. It also helps in decision making on uncertain data [31]. For example, symptoms form information about patients of a certain disease. In view of their available symptoms, the similar or indiscernible patients are characterized by the same symptoms. This way of generating the indiscernibility relation is the mathematical basis of Rough Set Theory. Maximum Dependence Attribute (MDA), Maximum Significance of Attribute (MSA), Information Theoretic Dependency Roughness (ITDR) and other recent rough set based techniques [31-33] outperformed their predecessors [25, 37] for clustering categorical data. However, these recent techniques suffer from low accuracy, high computational complexity and generalizability issues especially on data sets where they sometimes fail or hardly select their best clustering attribute. Some of their limitations are outlined: MDA technique cannot perform well on data sets with attributes having zero or equal dependency value. MSA technique also fails to select clustering attribute on data sets having attributes with zero or equal significance value. ITDR techniques face issues like random attribute selection and integrity of classes due to presence of entropy measure. Hence, an efficient technique is needed to cluster uncertain categorical datasets in terms of the accuracy, generalizability and computational complexity. In this paper, we propose a new information theoretic Rough Purity Approach (RPA) for categorical data clustering that addresses the problems inherent in the existing RST based clustering techniques. RPA utilizes the Rough Attribute Dependencies based on purity measure [38-41] in categorical valued information systems. The representation of uncertain information by purity has been applied to all areas of databases, including data mining [39], knowledge extraction [40], cluster validation [42] and information retrieval [41]. Hence, this paper relates the concept of information theoretic purity to Rough Sets to establish a new Rough Set metric of uncertainty which is Rough Purity. A Supplier Base Management real data set and several UCI benchmark data sets are used to validate the effectiveness of the proposed approach [43]. The Accuracy, Entropy, Purity, Rough Accuracy, Iterations and Time are some measures to test the quality of the obtained clusters. Moreover, validating the clustering results is a non-trivial task. The ratio of correctly clustered and total objects gives Accuracy [44]. The degree to which each cluster consists of objects from a single class is called entropy and better clustering performance has smaller entropy [39, 45]. The extent to which a cluster contains objects of a single class is known as Purity measure [39]. A better clustering result must have high overall purity and a value of 1 shows perfect clustering. The mean roughness of selected clustering attribute will give the Rough Accuracy. Higher mean Roughness implies better accuracy [31]. The computational complexity of clustering task can be determined by number of iterations required for finding the indiscernibility relations. It also includes finding the maximum or minimum values of dependence, significance, Rough Entropy, Rough Purity etc. The computational complexity of any technique can also be illustrated in terms of respond time. Here, the response time of CPU in milliseconds is counted to examine the performance of clustering task. A better technique in terms of response time will always consume less time. The rest of this paper is organized as follows. Section 2 describes the overview of the related work in the field of Cluster Analysis, Rough set theory, and categorical data clustering. To explore the limitations of Rough categorical clustering techniques, the analysis of existing techniques on an illustrative example is presented in Section 3. Section 4 introduces the concept of a new and proposed information theoretic Rough Purity measure. An illustrative example and proposition illustrating the methodology and significance of proposed approach is also highlighted. The experimental setup and data sets is described in Section 5. The experiments and the discussion on results are presented in Section 6. The summary of results and threats to validay are discussed in Section 7 and Section 8 respectively. Section 9 concludes the article at the end.

2 Related work

2.1 Cluster analysis

Clustering is a summary and generative or concise model of the data without explicit labels. The basic issue of clustering is splitting the data objects into potential similar sets. There are significant variations in this issue depending on clustering model and data type. The clustering methods are utilized to support data-driven decision making in many domains such as software maintenance, information security, science, businesses and health care [29]. The application areas in which the clustering is required are social network analysis, biological data analysis, multimedia data analysis, dynamic trend detection, data summarization, customer segmentation and collaborative filtering [46]. Moreover, it is also utilized as intermediate step for other fundamental data mining problems. A wide variety of cluster analysis techniques is employed to address the clustering problems [42, 47]. The commonly used clustering techniques include Feature Selection Methods, Probabilistic and Generative Models, Distance-Based Algorithms, Density and Grid-Based Methods, Leveraging Dimensionality Reduction Methods, Model-based Methods, Matrix Factorization and Co-Clustering, Spectral Methods [17]. The existing work on cluster analysis techniques is summarized in Table 1.
Table 1

Summary of related work on cluster analysis.

PaperProposed TechniqueCompared TechniquesEvaluation MetricsData Sets/ Application Area
[48]Fuzzy Cluster AnalysisFuzzy C-MeansConsensus threshold, Time of Iterations, Number of ClustersEmergency Response Plan Selection
[49]New strategy for cluster analysisNetwork-determined mechanismsPolarity, CorrelationFocal mechanism
[45]Clustering Based on Entropy (CBE)K-means, fuzzy c-means, Bayes classifier, Multilayer perceptronEffectivenessSynthetic Gaussian and non-Gaussian datasets, UCI datasets
[18]Agglomeration methodsK-meansInclusiveness, contestationPolitical Science
[17]Taxonomy and empirical analysisClassical clustering algorithmsStability, runtime, and scalability testsMHORD), MHIRD, SHORD, SHIRD, SPFDS, DOSDS, SPDOS, WTP, DARPA, ITD B Big data sets
[50]SurveyPartition based Clustering AlgorithmsNumber of clustersMedical data sets
[51]Cooperative clustering techniqueAgglomerative, LIMBO, WcombinedMoJoFM measure, arbitrary decisionsObject oriented software systems, Mozilla
[52]Empirical studySeveral clustering methodsSegmentation Variables, Number of clustersMarketing research
[53]Combined and Weighted AlgorithmsAgglomerative approachesArbitrary decisions, Number of clustersOpen source software systems written
[54]Refined rough cluster algorithmRough cluster algorithmObjective function, stabilitySynthetic, forest and gene data.
[47]SurveySeveral clustering algorithmsPercentage error, AccuracyIris, Mushroom, Salesman problem, Bio-informatics.
[55]Self-Splitting and Competitive LearningOPTOCNumber of clustersGene Expression Data
[56]Segmentation and phantom studyManual ROIAverage mean squared error, timePET Images, lung data
[23]CACTUSSTIRRSimilarity, timeReal and synthetic datasets
[57]Software Re-modularizationComplete, single, weightedPrecision, Recall, Cohesion, Coupling, Similaritygcc, Linux, Mosaic and real world legacy system
[20]Extented k-Means and k-modesk-Means and k-modesAccuracy, run time, standard deviationSoybean disease and credit approval
[58]Decision support approachAverage linkage, Centroid, Ward’sGrowth rate, Gamma frequencyLarge scale R and D planning.
[59]Fine-classification procedureCluster classificationSpectraLand and marine object
[60]SilhouettesFuzzy clusteringAverage silhouette width, Number of clustersRuspini

2.2 Rough Set Theory

The uncertain categorical data is used in several areas nowadays and the classical clustering methods are unable to handle such data. Accordingly, several uncertain categorical clustering methods got attention. Pawlak in 1982 introduces Rough Set Theory (RTS) which is an approach to deal with uncertainty and vagueness. The RST has appeared as an essential concept for dealing with different tasks like identifying and evaluating data dependency, reasoning of uncertain data and reduct of information. Moreover, it is useful for representing and analyzing the uncertain, vague and imprecise knowledge, data patterns and accessibility of consistent information [30]. In RST, the viewpoint is that every object of the universe has associated some information (knowledge, data) and the objects are similar or indiscernible characterized by the identical information. Accordingly, an indiscernibility relation is generated in this way which is the fundamental mathematical concept of RST. This relation somehow resembles with Leibniz’s Law of Indiscernibility. The rough indiscernibility relations are developed in context of an arbitrary set of attributes. Other data analysis tool need additional information like basic probability assignments in Dempster–Shafer theory, probability distributions in statistics and grade of membership of fuzzy set theory whereas the RST does not have any such requirement about data hence it is better. The precise concepts in contrast to vague concepts can be characterized in terms of information about the objects. Accordingly, as pair of precise concepts the RST replaces any vague concept by an upper and lower and approximation. All possibly belonged objects for each concept are included in upper approximation whereas all surely belonged objects are in lower approximation. A boundary region of any concept is the difference of upper and lower and approximation. Hence, despite of membership of a set a boundary region is employed in RST to express the vagueness [12]. The boundary region of a set is non-empty when the knowledge about set is not enough to describe the set precisely. Therefore, a set having empty boundary region is crisp otherwise it is rough. This idea of vagueness resembles exactly that is proposed by Frege [61] whereas the lower and upper and approximations of a set coincides with the interior and closure operations of topology [62]. Different effective RST based techniques were developed for exploring hidden patterns and determining optimal sets in data. Moreover, it assists in evaluating the data significance and developing the decision rules from data [31]. The RST utilized in numerous applications by researchers which is summarized in Table 2.
Table 2

Summary of related work on rough set theory.

PaperProposed TechniqueCompared TechniquesEvaluation MetricsData Sets/ Application Area
[35]Integrated Fuzzy PIPRECIA–Interval Rough Saw ModelThe interval rough and fuzzy evaluationsEnvironmental image, recycling, pollution control, the environmental management system, environmentally friendly products, resource consumption and green competenciessupplier selection
[63]Rough set theory based hierarchical linear modelResource-based and Enterprise ecosystem theoryT-test, P-value, errorGrain farms
[64]Framework based on RSTEnvironmental and Store factorsFrequency, Ranking, Growth rateRestaurant chain
[65]Generalized attribute reduction in rough set theoryMean decision power increased attribute reduction (MDPIAR), Positive region preserved attribute reduction (PRPAR) etc.Micro and Macro evaluation16 UCI data sets
[66]Survey of rough set clusteringVariable Precision Model, Total Roughness, Rough K-meansPurity, EntropyOutliers detection
[67]Rough generation algorithm (RGI)Rule and tree based classification algorithmsMean absolute errorMedical data sets
[68]Effective Rough Clustering——-Precision, AccuracySuper market data set
[69]Rough Set Based Feature SelectionFuzzy Rough Set Based Feature SelectionA reviewCrisp and real-valued data sets
[70]Rough set based decision theoryDecision making by weightF score, CEIReuters Corpus Volume 1 data set
[71]Rough CART algorithmCART algorithmAccuracyNutrition and health
[72]Rough-Set Feature Selection ModelDecision treeError, AccuracySurvey data
[73]Rough evolutionary algorithmEvolutionary AlgorithmCourage, AccuracyBeer preferences, City image data
[74]Foundations of Rough ClusteringRough k-MeansLower and upper boundsTraffic, Web and Supermarket data
[75]Rough set theoryDecision TreeRules, AccuracyMultimedia Data
[76]Rough Self Organizing MapCrisp clusteringError, AccuracyArtificial, Iris data set
[62]Rough Set Theory Fundamental ConceptsRough Set Theory PrincipalsRough Set Theory Data ExtractionRough Set Theory Applications
[77]Rough classification rules frameworkRough Set theoryMisclassification rate, AccuracyInterval-valued information system
[78]Rough autonomous Knowledge-Oriented (K-O) clusteringComplete, Single and Average LinkageAccuracy, Number of clustersFood nutrient data
[29]Rough set theory_______Rudiments of rough setsResearch directions and applications

2.3 Categorical data clustering

The classical techniques for clustering are limited for numeric data however, the categorical data is multi-valued and similarity may be termed as identical objects, values or both. In categorical type of data, the tables with fields are not naturally illustrated by a metric for example certain symptoms of a patient, names of automobiles producers and manufacturer products. Therefore, the clustering of categorical data is more challenging as there is no inherent distance measure. Though, several valuable categorical clustering algorithms are introduced but they are not designed to deal with uncertainty [31]. Accordingly, the clustering of categorical data where no sharp boundary is present between clusters rises as an important problem of the real world applications. This uncertainty in categorical data clustering is handled using fuzzy sets where the clusters of categorical data is represented with fuzzy centroids [26]. The fuzzy set based algorithm and conventional algorithms are tested and compared on some categorical clustering data sets. Though, better performance is obtained by the fuzzy set based algorithm but to get a satisfactory value for even one parameter it requires multiple runs. Similarly, to achieve stability the fuzzy membership need to be controlled. Some substantial contributions are offered by rough set based techniques which handles uncertainty and cluster categorical data. The rough set based Total roughness (TR) and Bi-clustering (BC) techniques select best clustering attribute and handle the uncertainty issue [37]. The BC technique is limited to bi-valued attributes whereas the TR works on multiple-valued attributes. Moreover, the limited data, arbitrarily selection and imbalance clustering are key limitations of both techniques. Min–Min-Roughness (MMR) is another rough set based clustering technique for categorical data having the significant ability to handle uncertainty by user itself [25]. The MMR technique outperforms against K-modes, fuzzy K-modes and fuzzy centroids on Zoo and Soybean data. The proposed technique is also tested against ROCK, Squeezer, hierarchical and other algorithms on comparatively larger date of Mushroom data. The stable results of MMR technique are subject to number of clusters as input. The MMR clustering technique is modified as MMeR for dealing with uncertainty, numerical and categorical features at the same time [79]. The MMeR has ability to deal with heterogeneous data by generalizing the hamming distance. A new modified hamming distance is accordingly developed for any two data objects. The experimental results show better performance of MMeR as compare to some existing algorithms on several data sets. Certain limitations related to computational complexity and accuracy of previous techniques were resolved by suggesting an improved rough set based categorical clustering technique named Maximum Dependency Attributes (MDA) [31]. The clustering attribute in information systems with maximum attribute dependency is chosen by the MDA technique. The MDA technique outperforms its predecessor approaches but itself lacks the generalizability and efficiency. A Variable Precision Rough Set (VPRS) approach utilizes the mean accuracy of approximation to cluster categorical data [5]. The VPRS consider a noisy data and without a predefined clustering attribute it successfully clusters some UCI data sets. Furthermore, the final clusters using divide and conquer method were found comparatively better and are also visualized. The performance of MMeR in terms of data heterogeneity and uncertainty algorithm was further enhanced by suggesting the Standard Deviation Roughness (SDR) clustering algorithm [80]. The experimental results on certain data sets in terms of cluster purity shows the worth of SDR as compare to other techniques. Later on, a Standard deviation of Standard Deviation Roughness (SSDR) was introduced in this sequence [81]. The SSDR has the capability to cluster uncertain numerical and categorical data at the same time and hence is proven better than its predecessors like SDR, MMeR and MMR. Maximum Significance of Attributes (MSA) also computes an appropriate clustering attribute based on the significance of attributes RST concept [32]. The MSA handles the uncertainty and stability for categorical clustering process. The accuracy and purity was also improved up to some extent as compare to MDA, MMR, TR and BC techniques. A clustering technique known as Information-Theoretic Dependency Roughness (ITDR) for categorical data is developed that utilizes the information-theoretic dependencies [33]. A new measure of uncertainty in categorical data was introduced named as information-theoretic entropy. The complexity and purity for the appropriate clustering attribute selection by ITDR was better against SSDR, SDR, MMeR and MMR. The likelihood function and indiscernibility relation of multivariate multinomial distributions was utilized to develop a novel modified Fuzzy k-Partition method [82]. The idea was effective as it performs extensive theoretical analysis and still achieve lower computational complexity as compare to Fuzzy k-Partition and Fuzzy Centroid approaches. The clustering accuracy and response time were also improved on some real and UCI data. The rough intuitionistic fuzzy K-Mode algorithm was an extension of rough fuzzy k-mode for clustering the categorical data. The parameter of intuitionistic degree in a given cluster was added which calculate the element membership value. The efficiency of suggested scheme as tested on some categorical data of UCI repository which highlights the better results against rough fuzzy k-mode algorithm. An algorithm called Min-Mean-Mean-Roughness (MMeMeR) was introduced based on enhancements in MMeR and MMR algorithms [83]. A coherent and logical effect of considering the minimum or mean on better accuracy is also analyzed using standard UCI data. They found the objects at edge of a heterogeneous data can be clustered with certainty and are more captivating. Hence, MMeMeR technique was termed effective over existing SDR, MMeR and MMR techniques. Recently, Maximum Value Attribute (MVA) technique is suggested that efficiently cluster the uncertain categorical data [84]. A supplier’s data and several UCI data sets are considered to validate the performance of MVA technique with existing approaches. Despite of better performance, it sometimes produce singleton clusters and subject to only domain knowledge. The existing work on rough categorical data clustering is summarized in Table 3.
Table 3

Summary of existing work on rough categorical data clustering.

PaperProposed TechniqueCompared TechniquesEvaluation MetricsData Sets/ Application Area
[84]Rough Set based Maximum Value Attribute TechniqueK Mean, RST based techniquesAccuracy, Purity, entropy, Time, IterationsSupplier and UCI data sets
[83]MMeMeRMMR, MMeR, SDR, Fuzzy K modes, Fuzzy centriodsAccuracy, PurityZoo, Soyabean and Mushroom data sets
[85]>Rough intuitionistic fuzzy k-modeRough fuzzy k-modeDB index, D index, XB index, PC pair and Minkowski scoreUCI data sets
[82]Modified Fuzzy k-PartitionFuzzy Centroid and Fuzzy k-PartitionResponse time, clustering accuracyUCI and real data sets
[33]Information Theoretic Dependency RoughnessK-means, Fuzzy K-means, MMR, MMeR, SDR, SSDRPurityZoo data set
[86]Review of categorical Clustering techniquesMin-Min Roughness, Standard Deviation Roughness, Modified Min-Min Roughness, Fuzzy set theoryUncertainityCategorical data sets
[32]Maximum Significant Attribute (MSA)Bi-Clustering, Total roughness, Min-Min Roughness, Maximum Dependent AttributeRough accuracy, PurityCredit card promotion dataset
[16]Variable Precision ModelTotal roughness, Min-Min RoughnessPurity, AccuracyBalloon, Tic-Tac-Toe, SPECT, Hayes-Roth
[81]Standard deviation of Standard Deviation RoughnessMin-Min Roughness, Standard Deviation Roughness, Modified Min-Min Roughness, Fuzzy set theoryPurityZoo data set
[80]Standard Deviation Roughnessk-modes, fuzzy k-modes, Min-Min RoughnessPuritySoybean, Zoo, Mushroom data sets
[79]Modified Min-Min RoughnessMin-Min Roughness, K-Modes, Fuzzy set theoryPuritySoybean, Zoo, Mushroom data sets
[87]Maximum Dependent AttributeBi-Clustering, Total roughness, Min-Min RoughnessRough accuracy, IterationsCredit card, Student’s qualifications and animal data sets
[25]Min-Min RoughnessSqueezer, K-modes, LCBCDC, ROCK, hierarchical algorithmPuritySoybean and Zoo
[37]Total RoughnessBi- ClusteringRough accuracySmall Data sets

3 An epirical analysis of existing categorical clustering techniques based on Rough Set Theory

Some existing Rough Set based techniques for selecting a clustering attribute in categorical data are analyzed. A well-known technique, Maximum Dependency Attribute (MDA) [31] takes into account the Rough dependency of attributes. The MDA technique chooses best clustering attribute in information system based on maximum dependency degree [87]. Best clustering attribute is selected by MDA technique on the basis of higher dependency degree. Hassanein and Elmelegy [32] proposes an alternative Rough Clustering Technique known as Maximum Significance Attribute (MSA). In an information system, MSA technique utilizes the significance of attributes. Higher degree of significance in MSA technique determines the best clustering attribute. Though, MDA and MSA techniques perform well in clustering categorical data as compared to their predecessor, they have hardly or sometimes not been able to work on following cases in a categorical data set, Independent attributes Non-significant attributes Equally dependent attributes Equally significant attributes To illustrate these issues, we consider the following example. Example 1 Table 4 is modified data set showing patients with possible viral symptoms [62]. There are three conditional attributes: Headache (H), Vomiting (V) and Temperature (T) of six patients. Viral Illness is the decision attribute in Table 4.
Table 4

A Viral Illness information system.

PatientHVTViral illness
101High1
210High1
311Very High1
401Normal0
510Normal0
600Very High1
The indiscernibility relation of each attribute induces equivalence classes and considering MDA technique, we calculate the dependency degree of attributes. The dependency degrees of viral data set are given in Table 5. Here, selecting best clustering attribute is not possible as dependency degrees are all equal and 0. Accordingly, the MDA technique fails and hence creates a problem.
Table 5

Dependency degree of attributes from Table 4.

Attribute(depends on)Degree of DependencyMDA
HV0T00
VH0T00
TH0V00
In case of MSA technique, we compute the significance of subsets of U. The significance degree of all attributes are presented in Table 6. In such situation, the best clustering attribute selection by MSA technique is not possible as all significance values are equal and 0. Therefore, MSA technique also fails and creates a problem.
Table 6

Significance degree of attributes from Table 4.

AttributesSignificanceMSA
HV0T00
VH0T00
TH0V00
The above example illustrates the lack of ability of existing techniques to deal with zero or equal dependent and significant attributes. Another recent categorical clustering technique ITDR works on the entropy roughness to find clustering attribute [5, 33]. However, entropy is one of the type of purity measure [42] which considers the entire distribution and not just the largest class as it is done by the purity measure [88] in a particular cluster. In other words, the homogeneity or heterogeneity of the cluster does not affect the entropy results [89]. The strength and limitations of existing Rough Set based categorical clustering techniques are highlighted in Table 7. The summary of literature review leading to the proposed research framework is presented as in Fig 1. This figure shows how various researchers contributed towards the main issue of clustering categorical data.
Table 7

Strengths and limitations of existing Rough categorical clustering techniques.

TechniqueBasic ideaStrengthsLimitations
BC Binary valued attributesCategorical Data, UncertaintyAccuracy, Generalization
TR Maximum total roughnessCategorical Data, Complexity, UncertaintyPurity, Generalization
MMR Maximum mean roughness using lower bound and upper boundCategorical Data, Complexity, UncertaintyPurity, Stability, Generalization
MDA Dependency of attributesCategorical Data, Complexity, UncertaintyAccuracy, Stability, Purity
MSA Significance of attributesCategorical Data, Purity, UncertaintyStability, Complexity, Entropy
ITDR Information theoretic attribute dependenciesCategorical Data, Purity, UncertaintyGeneralization, Complexity, Accuracy
Fig 1

Scenario leading to the proposed framework.

The analysis of existing techniques presented in Table 7 and Fig 1 motivates towards the development of a more comprehensive measure of uncertainty. Accordingly, a measure based on classical information theoretic purity is derived.

4 Information-theoretic purity measure with Rough Set Theory

The first and most commonly used purity measures is information gain which is based on Shannon’s entropy from information theory [40, 90]. Several variations in classical purity are introduced depending on type of application and a particular uncertainty measurement [39, 41, 42, 89, 91, 92]. In this work, the purity is defined that it can be applied to Rough databases. Hence, the purity of a Rough Set X is illustrated as below. Definition 1 In an approximation space S = (U, Y, V, ξ), let L, M ⊆ Y and L, M ≠ ϕ. Rough Purity (RP) of attribute M on attributes L, written as L⇒ M can be define using by the following equation, Where P(M|L) is a fuction from Y. Definition 2 Suppose y ∈ Y, V(y), has k-different values say β, k = 1, 2, …, n. Consider a subset of the attributes M(y = β) having k-different values of attribute y. Max-roughness of the set M(y = β) with respect to y where i ≠ j denoted by MR (M|L) as, Definition 3 MMR(M|L) denotes the Max-mean-roughness of y ∈ Y w.r.t y ∈ Y and is calculated as, V(y) is the set of values of attribute y ∈ Y and i ≠ j. Definition 4 Consider number of attributes a, max-mean-max-roughness of y ∈ M with respect to y ∈ L, where i ≠ j, refers to the maximum of MMR(y|y), denoted MMMR(M|L) is obtained by the following formula: The Rough Purity Approach (RPA) takes into account the mean degree of Rough Purity to find partitioning attribute. The justification is that the high Rough Purity value implies the more accurate partition attribute is selected. The maximum total roughness of each attribute decides the best crispness [37]. Normally, high purity shows better clustering combination and the clusters are pure subsets of input classes if purity value is high [93]. Definition 5 To illustrate the computational complexity for RPA technique, let there are n objects, m attributes and l values of each attribute in an information system. The RPA needs nm computation for finding elementary set of all attributes. The Rough Purity of all subsets of U having different values and maximum Rough Purity of all attributes with respect to each other consumes n l computation steps. Accordingly, the steps for finding all mean max-rough purity values are n times. Therefore, the polynomial O(n l + mn + n) comes the computational complexity of RPA. The steps involved in RPA technique are presented in Fig 2. Next, we present an illustrative example of the RPA technique.
Fig 2

The RPA algorithm.

Example 2 A student’s enrollment qualification information system is presented in Table 8. Degree (D), English (E), Statistics (S), Programming (P) and Mathematics (M) are five categorical attributes of eight students. The best clustering attribute needs to be selected provided no pre-defined decision attribute. For calculating the Rough Purity values, firstly the indiscernibility relations of each attribute must be obtained that induces equivalence classes. Table 8 gives following partitions of object,
Table 8

Student’s enrollment qualification information system.

U/ADESPM
1B.Sc.LowNoFluentPoor
2B.Sc.IntermediateYesPoorFluent
3M.Sc.AdvancedNoPoorPoor
4M.Sc.IntermediateNoFluentPoor
5Ph.D.LowYesPoorFluent
6Ph.D.AdvancedNoPoorFluent
7Ph.D.AdvancedYesFluentPoor
8M.Sc.AdvancedYesFluentFluent
X(D=B.Sc.)={1, 2}, X(D=M.Sc.)={3, 4, 8}, X(D=Ph.D.)={5, 6, 7}, U/D={{1, 2}, {3, 4, 8}, {5, 6, 7}} X(E=low)={1, 5}, X(E=intermediate)={2, 4}, X(E=advanced)={3, 6, 7, 8}, U/E={{1, 5}, {2, 4}, {3, 6, 7, 8}} X(S=no)={1, 3, 4, 6}, X(S=yes) ={2, 5, 7, 8}, U/S={{1, 3, 4, 6}, {2, 5, 7, 8}} X(P=fluent)={1, 4, 7, 8}, X(P=poor)={2, 3, 5, 6}, U/P={{1, 4, 7, 8}, {2, 3, 5, 6}} X(M=poor)={1, 3, 4, 7}, X(M=fluent)={2, 5, 6, 8}, U/M={{1, 3, 4, 7}, {2, 5, 6, 8}} Definition 4 is used to find the Rough Purity of Degree w.r.t Statistics, P(S=yes∣B.Sc.)=({2, 5, 7, 8}, {1, 2})=1/2=0.5 P(S=yes∣ M.Sc.)=({2, 5, 7, 8}, {3, 4, 8})=1/3=0.33 P(S=yes ∣ Ph.D)=({2, 5, 7, 8}, {5, 6, 7})=2/3=0.67 P(S=no ∣ B.Sc)=({1, 3, 4, 6}, {1, 2})=1/2=0.5 P(S=no ∣ M.Sc.)=({1, 3, 4, 6}, {3, 4, 8})=2/3=0.67 P(S=no ∣ Ph.D.)=({1, 3, 4, 6}, {5, 6, 7})=1/3=0.33 The maximum roughness degree of Statistics (S) w.r.t Degree (D) can be calculated as: MP(Syes)=max(0.5,0.33,0.67)=0.67, MP(Sno)=max(0.5,0.67,0.33)=0.67. The mean Rough Purity of attribute Statistics (S) with respect to Degree (D) are MMP(S)=(D∣(L∣S = no)+ D∣(L∣S = yes)) /|V(D)| =(0.67+0.67)/2 = 0.67 Proceeding similarly, each attribute mean Rough Purity is computed. Table 9 summarizes the calculations with RPA, which shows that the high mean purity value is of Mathematics attribute. Considering the heuristic that the high purity shows better clustering combinations, therefore, best clustering attribute is selected as Mathematics. Hence, the clusters obtained are (1,3,4,7), (2,5,6,8).
Table 9

MMP roughnes of Table 8.

Mean Rough PurityMean
DESPM
D-0.50.41670.41670.41670.4375
E0.5567-0.3330.3330.3330.3889
S0.6670.5-0.50.750.604
P0.6670.50.5-0.750.604
M0.6670.50.750.75- 0.667
The comparison of Rough Purity and other measures of uncertainty are illustrated in Propsotion 1. Proposition 1 Rough Purity is more comprehensive measure of uncertainty as compared to Rough Dependency and significance of attributes. Proof: If the attributes are not dependent on each other, then dependency degree [31] results zero. Similarly, it can be proved that independent attributes are also non-significant. Hence significance of attribute [32] also gives zero. Irrespective of above cases that attributes are not dependent or they are not significant for each other, the Rough Purity measure will always give a non zero value. In other words, the Eq 2 always gives, Hence, it is proved that Rough Purity is more comprehensive measure of uncertainty than Rough Dependency and significance of attributes.

5 Experimental setup and data sets description

RPA technique is validated using C#. The results are presented in form of tables. The domain of Supplier Base Management (SBM) is used to validate the proposed RPA technique [43]. SBM data set comprises ten attributes (shown in Table 10) showing performance information and supplier capability of 23 Suppliers (S). The attribute included are Quality Management Practices and systems (Q), Documentation and Self-audit (D), Process/manufacturing Capability (P), Management of Firm (M), Design and Development Capabilities (D), Cost (C), Quality (Q), Price (P), Delivery (D), Cost Reduction Performance (C) and Others (O). The efficiency of each supplier is determined by applying the Data Envelopment Analysis [43]. The last column of Table 10 shows their conclusion on each supplier. The domain of all attributes contain continuous value because the categorical data is already normalized.
Table 10

Discretized supply base management data set.

S Q m D s P c M f D c CQPD C p OE
132334211111I
222111221111E
311112131413E
452223452414E
552334433312I
632233333423E
721324342223E
852333322212I
952334411111I
1012111151413E
1121123211312I
1211333333212E
1352334412313I
1442334421212I
1542231143423E
1632332231111I
1752334331211I
1852334441111I
1942314341413I
2042331422414E
2152324421312I
2252233453424E
2342332343324E
RPA technique is also validated using six data sets taken from UCIML repository. They includes: Balloons (16 instances, 4 attributes), Car Evaluation (1728 instances, 6 attributes), Zoo (101 instances, 17 attributes) and Chess (3196 instances, 37 attributes), Balance scale (625 instances, 5 attributes), Monk’s problems (432 instances, 8 attributes). RPA is tested with all these data sets and compared with recent Rough Categorical techniques MDA, MSA and ITDR on basis of various evaluation measures like Time, Iterations, Purity, Entropy, Accuracy and Rough Accuracy.

6 Results and discussion

Table 11 illustrates the time complexity of MDA, MSA, ITDR and RPA techniques to complete the clustering task. For Balloons data set, the number of instances are less therefore the response time is same for all techniques. Moreover, RPA takes lesser time as compared to all techniques for Car, Zoo and Chess data sets.
Table 11

Time complexity of all techniques.

Data SetResponse Time (millisec)
MDAMSAITDRRPA
Balloons 000 0
Car Evaluation 2059517 15
Zoo 61168 6
Chess 31598658068815 800
Balance scale 2275 2
M’s Problem 4723 3
SBM 1153 1
The iterative complexity depends on number of attributes and attribute values of a data set. It also includes the steps like finding dependency degree of all attributes for MDA, maximum significance of all possible combinations of attributes for MSA, minimum Rough Entropy for ITDR and maximum Rough Purity for RPA. It can also be seen from Table 12 that the RPA consumes minimum iterations all data set than the MDA and MSA techniques. According to Table 12, despite the fact that the RPA and ITDR techniques undergo almost similar iterative complexity to get their best clustering attribute but the RPA has still the better time taken. The reason is Rough Purity formula is computationally simpler than Rough Entropy therefore the effect can be seen on response time. The relevant induced indiscernibility relation will show the clusters obtained by selected best attribute.
Table 12

Iterative complexity of all techniques.

Data SetMinimum iterations
MDAMSAITDRRPA
Balloons 8014749 25
Car Evaluation 351912138367 184
Zoo 4381234611201 600
Chess 78112738923585181 2591
Balance scale 6241404301 150
M’s Problem 13974218239 120
SBM 66027791373 680
Table 13 shows the performance of RPA, MDA, MSA and ITDR techniques in terms of Purity, Entropy, Accurracy and Rough Accuracy. The achieved accuracy on all data sets as presented in Table 13 shows that the proposed RPA technique outperformed other techniques except Balance Scale and Monk’s Problem where the accuracy is the same. Similarly, Table 13 also illustrates the entropy of obtained clusters by each technique. Less entropy shows better clustering technique [45] and it can be seen from this table that the proposed technique shows lesser entropy for all data sets except Balance Scale and Monk’s Problem where entropy is the same. Hence, RPA performance is better for entropy measure too. Moreover, the purity of obtained clusters by each technique as presented in Table 13 shows that the RPA technique has better purity for all data sets except Car evaluation, Balance Scale and Monk’s Problem data set where all techniques produce equal purity of their best clustering attribute. Finally, Table 13 presents the Rough Accuracy of the techniques. The reason of less or zero Rough Accuracy value is that this measure is not a comprehensive measure of uncertainty [34]. The overall performance of RPA technique in terms of Rough accuracy is still better as compared to other techniques.
Table 13

Comparative performance of techniques for all data sets.

MeasureTechniqueBalloonsCar EvaluationZooChessBalance ScaleMonks ProblemSBM
Purity RPA 0.8 0.7 0.61 0.6 0.64 0.5 0.74
MDA0.60.70.590.520.640.50.61
MSA0.60.70.40.540.640.50.74
ITDR0.60.70.50.520.640.50.74
Entropy RPA 0.16 0.29 0.43 0.28 0.36 0.3 0.22
MDA0.290.330.50.30.360.30.29
MSA0.290.330.70.30.360.30.23
ITDR0.290.30.480.30.360.30.22
Accuracy RPA 0.66 0.53 0.72 0.52 0.6 0.5 0.6
MDA0.470.480.550.50.60.50.5
MSA0.470.480.50.50.60.50.55
ITDR0.470.50.690.50.60.50.6
Rough Accuracy RPA 0.2 0.1 0.2 000 0.11
MDA000.10000
MSA000.10000.1
ITDR0.100.10000.11
If two or more techniques select similar clustering attributes then the evaluation measures produced by those techniques are also similar. For example in case of Monk’s problems data set, the MDA and MSA techniques select similar clustering attribute hence their Accuracy, Purity and Entropy values are similar. Similarly, for the same data set the ITDR and RPA choose the same attribute as best. Despite the fact that these techniques can choose same best clustering attribute, but the number of iterations, time taken and hence, the complexity is still promising for RPA technique as the data sets size increases.

7 Summary of results

This section summarizes the average percentage improvement and overall percentage improvement by RPA technique for clustering categorical data as compared to MDA, MSA, and ITDR. This summary of results shows that the RPA technique significantly improves the time, iterations, purity, entropy, and accuracy. Table 14 shows a slight response time improvement by RPA against ITDR but as compared to MSA and MDA techniques, the percentage improvement is large. It is also observed in Table 15, that the RPA techniques require almost half iterations as compared to ITDR and 100% fewer iterations against MDA and MSA techniques to choose the best clustering attribute. Similarly, the Tables 16–18 clearly show the significant improvement in terms of several clustering evaluation measures like purity, entropy and accuracy by RPA technique against MDA and MSA technique. Though the ITDR technique outperforms MDA and MSA for these measures but still the performance of RPA is reasonably improved as compared to ITDR technique for clustering the categorical data. Finally, Table 19 highlights the comparative overall improvement by RPA in terms of Time, Iterations, Purity, Entropy, and Accuracy. It can be clearly seen that RPA technique not only proved to be less complex but also more efficient in selecting the best clustering attribute and clustering categorical data. Hence, it can be summarized from the whole experimental results that the proposed RPA technique is not only simple, more generalized, and quick but also more perfect clusters were obtained having less entropy and high purity and accuracy.
Table 14

Average percentage improvement of time by RPA technique.

MDAMSAITDRRPA
Average Time (milisec) 4518.71494127.57121.5714118.1429
Improvement by RPA 97.40%99.87%2.82%
Table 15

Average percentage improvement of iterations by RPA technique.

MDAMSAITDRRPA
Average Iterations 113112.6562357.91244.429621.4286
Improvement by RPA 99.45%99.88%50.06%
Table 16

Average percentage improvement of Purity by RPA technique.

TechniqueAverage PurityImprovement by RPA
MDA0.5911.14%
MSA0.5911.14%
ITDR0.609.30%
RPA 0.655714
Table 18

Average percentage improvement of accuracy by RPA technique.

TechniqueAverage AccuracyImprovement by RPA
MDA0.514314.72%
MSA0.514314.72%
ITDR0.55147%
RPA 0.59
Table 19

Overall percentage improvement by RPA technique.

MeasureOverall Improvement by RPA
Time66.70%
Iterations83.13%
Purity10.53%
Entropy14%
Accuracy12.15%

8 Threats to validity

The primary threat to validity for this study is that the tools of existing approaches like MDA, MSA and ITDR are not available, they are re-implemented via a prototype implementation system. This system is developed using C# for experimental purpose. However, our code of previous approaches is strictly based on the descriptions and pseudo codes available in their respective research articles. To reduce the influence of this biasness and as remedy, similar data sets and same evaluation measures were considered as used by other existing techniques. As a result, it is ensured that all evaluation measures of existing techniques give the same results as computed in their original work. Another threat to validity for this study is related to the number of instances and attributes of dataset. In this study, a real SBM and six bench mark data sets were chosen for experiments. Moreover, to generalize our results, it was necessary to perform experiments with data sets of various number of instances and attributes. Accordingly, the data sets considered for experimentation were chosen from different application domains. However, this study only focused on small and medium size data sets. Experiments on large data sets may be performed to further validate the proposed technique.

9 Conclusion

The traditional clustering techniques are not able to deal with uncertainty in the data set as they are not designed to do so. Several categorical data clustering techniques have emerged as a new trend in techniques of handling uncertainty in the clustering process. The motivation of a better Rough Clustering technique is developed after exposing some potential issues of recently developed Rough Clustering techniques like MDA, MSA and ITDR. These issues include data with attributes having zero or equal dependency, attributes with zero or equal significance value and random attribute selection. The key contribution of this paper is that these limitations of existing rough set based clustering techniques for categorical data are handled successfully and effectively. A Rough set based information theoretic approach for clustering categorical data with uncertainty named Rough Purity (RPA) approach is hence presented. The extensive experimental analysis of the proposed RPA and existing approaches using a supplier base management real data set and UCI benchmark data sets are discussed. The significant improvement can be seen in experimental outcomes in terms of relevant parameters like time (66.70%), iterations (83.13%), purity (10.53%), entropy (14%), accuracy (12.15%) and rough accuracy of clusters. This significant improvement by the proposed technique shows that RPA can be extended for further research in the field of Data Mining, Artificial Intelligence, Rough Set Theory and soft computing etc. One of the limitation of this research is that the analyses of only relevant rough set based categorical techniques like MDA, MSA and ITDR is presented. Though, this comparison provides strong evidence about the efficiency of the proposed approach in terms of several evaluation parameters, but other approaches like fuzzy bipolar soft set and Pythagorean fuzzy bipolar soft set etc. need to be compared to further analyze the RPA technique. 2 Dec 2021
PONE-D-21-21593
Rough Set Based Information Theoretic Approach for Clustering Uncertain Categorical Data
PLOS ONE Dear Dr. GHAZALI, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== ACADEMIC EDITOR: The reviewer has asked for revisions. There is concerns about the discussion and the comparisons, that authors need to address. Based on all this, I am recommending major revisions. Furthermore when submitting the revised paper, please also consider the following points: 1. English language needs proofreading. 2. References should be in proper format. 3. All acronyms must first be defined. ============================== Please submit your revised manuscript by Jan 16 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:
A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Usman Qamar Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf. 2. Thank you for stating the following in the Acknowledgments Section of your manuscript: [The authors would like to thank the King Khalid University of Saudi Arabia for supporting this research under grant number R.G.P.1/365/42.] We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: [The authors would like to thank the King Khalid University of Saudi Arabia for supporting this research under grant number R.G.P.1/365/42.] Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 3. Please include a separate caption for each figure in your manuscript. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: No ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper is well-written in general and the authors have done a good job of communicating their ideas. Your abstract does not highlight the specifics of your research or findings. There needs to be an explicit research objective stated, preferably as a separate section. The related work section should be extended to present a critical review of existing techniques highlighting their deficiencies. At present, the literature review is presented in a "this did that" format. There is no flow here. This section should be re-written based on the techniques rather than listing the papers. It is better to provide a tabular format summary of the existing approaches along with strengths and weaknesses as well. For the methodology, it was explained clearly, however it should be supported with an example. The algorithm proposed in the paper has no formal proof that will produce the correct score. Analyses are missing using more state-of-the-art methods. Compare and provide strong evidence about the efficiency of the proposed approach with similar approaches based on fuzzy bipolar soft set and Pythagorean fuzzy bipolar soft set, so far, no such experimentation is provided. Show the robustness checking of the proposed model. How the proposed approach is effective in terms of computational resources like memory and execution time. The conclusion is not precise. The key findings and its implementation potential (in practice) is missing. Clearly, identify its academic contributions also. Limitations are not mentioned. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 25 Jan 2022 ACADEMIC EDITOR: When submitting the revised paper, please also consider the following points: Comment: 1. English language needs proofreading. Response: English proof reading is done and as per capability all relevant issues are resolved. Comment: 2. References should be in proper format. Response: References are reviewed to remove any mistake. Comment: 3. All acronyms must first be defined. Response: Acronyms are defined at first place. ============================== JOURNAL REQUIREMENTS: When submitting your revision, we need you to address these additional requirements. Comment: 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf. Response: All above requirements are considered in preparing the revised manuscript. 2. Thank you for stating the following in the Acknowledgments Section of your manuscript: [The authors would like to thank the King Khalid University of Saudi Arabia for supporting this research under grant number R.G.P.1/365/42.] Comment: 2. We note that you have provided funding information that is currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: [The authors would like to thank the King Khalid University of Saudi Arabia for supporting this research under grant number R.G.P.1/365/42.] Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Response: The funding-related text is now removed from the manuscript. Comment: 3. Please include a separate caption for each figure in your manuscript. Response: The caption of Figures is now included as per instructions. ============================== REVIEWERS' COMMENTS: Reviewer #1: The paper is well-written in general and the authors have done a good job of communicating their ideas. ________________________________________ Comment: Your abstract does not highlight the specifics of your research or findings. There needs to be an explicit research objective stated, preferably as a separate section. Response: The Abstract is rearranged and rephrased accordingly. The Motivation, Problem Statement, Objectives, Methods, Results and Conclusion is now explicitly stated as separate sections. ________________________________________ Comment: The related work section should be extended to present a critical review of existing techniques highlighting their deficiencies. At present, the literature review is presented in a "this did that" format. There is no flow here. This section should be re-written based on the techniques rather than listing the papers. It is better to provide a tabular format summary of the existing approaches along with strengths and weaknesses as well. Response: The Related work section (Section 2, Subsections 2.1, 2.2 & 2.3) is now presented as per the valuable suggestion of respected reviewer. The tabular summary is also provided accordingly (Table 1, 2 & 3). The strength and weakness of relevant existing approaches are highlighted in Section 3, Example 1 and Table 7. ________________________________________ For the methodology, it was explained clearly, however it should be supported with an example. The algorithm proposed in the paper has no formal proof that will produce the correct score. Response: Example 2 illustrate the methodology of proposed approach. It presents how successfully the clusters can be obtained using the proposed approach. ________________________________________ Analyses are missing using more state-of-the-art methods. Compare and provide strong evidence about the efficiency of the proposed approach with similar approaches based on fuzzy bipolar soft set and Pythagorean fuzzy bipolar soft set, so far, no such experimentation is provided. Response: With all due respect, the goal of this research is to explore some significant limitations in existing rough set based categorical clustering techniques only. Therefore, particularly the cases where these techniques are unable to produce quality clusters (independent and insignificant data) are considered. Accordingly, we suggest a better, viable and more comprehensive alternative approach RPA in our research. Therefore, the analyses of only relevant rough set based categorical techniques like MDA, MSA and ITDR is presented in Section 6, Table 11-19. This comparison provides strong evidence about the efficiency of the proposed approach in terms of several evaluation parameters. The comparison with other approaches like fuzzy bipolar soft set and Pythagorean fuzzy bipolar soft set is valuable point but due to scope limitations we have mention it as future work in Section 9. ________________________________________ Show the robustness checking of the proposed model. How the proposed approach is effective in terms of computational resources like memory and execution time. Response: The computational complexity of proposed clustering strategy is determined by the number of iterations required and in terms of respond time. It is mentioned in 2nd last paragraph of Section 1 (Introduction). The computational complexity in terms of Big O is illustrated in Definition 5. Table 11, 12, 14 & 15 shows the relevant experimental results. ________________________________________ The conclusion is not precise. The key findings and its implementation potential (in practice) is missing. Clearly, identify its academic contributions also. Limitations are not mentioned. Response: The Conclusion (Section 9) is now rearranged to include key findings, contributions and limitations. Other limitations are highlighted in Section 8. ________________________________________ While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. Response: The Figures are now enhanced using PACE. Thanks for anther valuable suggestion. Submitted filename: Response to Reviewers.docx Click here for additional data file. 28 Feb 2022 Rough Set Based Information Theoretic Approach for Clustering Uncertain Categorical Data PONE-D-21-21593R1 Dear Dr. GHAZALI, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Usman Qamar Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Authors addressed the previous comments and the paper has been substantially improved. I vote for its acceptance. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 11 Apr 2022 PONE-D-21-21593R1 Rough Set Based Information Theoretic Approach for Clustering Uncertain Categorical Data Dear Dr. Ghazali: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Usman Qamar Academic Editor PLOS ONE
Table 17

Average percentage improvement of entropy by RPA technique.

TechniqueAverage EntropyImprovement by RPA
MDA0.33857113.92%
MSA0.35857118.72%
ITDR0.3214299.33%
RPA 0.291429
  6 in total

1.  Cluster analysis of gene expression data based on self-splitting and merging competitive learning.

Authors:  Shuanhu Wu; Alan Wee-Chung Liew; Hong Yan; Mengsu Yang
Journal:  IEEE Trans Inf Technol Biomed       Date:  2004-03

Review 2.  Survey of clustering algorithms.

Authors:  Rui Xu; Donald Wunsch
Journal:  IEEE Trans Neural Netw       Date:  2005-05

3.  Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets.

Authors:  A Astel; S Tsakovski; P Barbieri; V Simeonov
Journal:  Water Res       Date:  2007-06-16       Impact factor: 11.236

4.  Multistage approach for clustering and classification of ECG data.

Authors:  J H Abawajy; A V Kelarev; M Chowdhury
Journal:  Comput Methods Programs Biomed       Date:  2013-08-28       Impact factor: 5.428

5.  Enhancing Predictive Accuracy of Cardiac Autonomic Neuropathy Using Blood Biochemistry Features and Iterative Multitier Ensembles.

Authors:  Jemal Abawajy; Andrei Kelarev; Morshed U Chowdhury; Herbert F Jelinek
Journal:  IEEE J Biomed Health Inform       Date:  2014-10-20       Impact factor: 5.772

6.  An Empirical Analysis of Rough Set Categorical Clustering Techniques.

Authors:  Jamal Uddin; Rozaida Ghazali; Mustafa Mat Deris
Journal:  PLoS One       Date:  2017-01-09       Impact factor: 3.240

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.