| Literature DB >> 35559954 |
Jamal Uddin1, Rozaida Ghazali2, Jemal H Abawajy3, Habib Shah4, Noor Aida Husaini5, Asim Zeb6.
Abstract
MOTIVATION: Many real applications such as businesses and health generate large categorical datasets with uncertainty. A fundamental task is to efficiently discover hidden and non-trivial patterns from such large uncertain categorical datasets. Since the exact value of an attribute is often unknown in uncertain categorical datasets, conventional clustering analysis algorithms do not provide a suitable means for dealing with categorical data, uncertainty, and stability. PROBLEM STATEMENT: The ability of decision making in the presence of vagueness and uncertainty in data can be handled using Rough Set Theory. Though, recent categorical clustering techniques based on Rough Set Theory help but they suffer from low accuracy, high computational complexity, and generalizability especially on data sets where they sometimes fail or hardly select their best clustering attribute.Entities:
Mesh:
Year: 2022 PMID: 35559954 PMCID: PMC9106167 DOI: 10.1371/journal.pone.0265190
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Summary of related work on cluster analysis.
| Paper | Proposed Technique | Compared Techniques | Evaluation Metrics | Data Sets/ Application Area |
|---|---|---|---|---|
| [ | Fuzzy Cluster Analysis | Fuzzy C-Means | Consensus threshold, Time of Iterations, Number of Clusters | Emergency Response Plan Selection |
| [ | New strategy for cluster analysis | Network-determined mechanisms | Polarity, Correlation | Focal mechanism |
| [ | Clustering Based on Entropy (CBE) | K-means, fuzzy c-means, Bayes classifier, Multilayer perceptron | Effectiveness | Synthetic Gaussian and non-Gaussian datasets, UCI datasets |
| [ | Agglomeration methods | K-means | Inclusiveness, contestation | Political Science |
| [ | Taxonomy and empirical analysis | Classical clustering algorithms | Stability, runtime, and scalability tests | MHORD), MHIRD, SHORD, SHIRD, SPFDS, DOSDS, SPDOS, WTP, DARPA, ITD B Big data sets |
| [ | Survey | Partition based Clustering Algorithms | Number of clusters | Medical data sets |
| [ | Cooperative clustering technique | Agglomerative, LIMBO, Wcombined | MoJoFM measure, arbitrary decisions | Object oriented software systems, Mozilla |
| [ | Empirical study | Several clustering methods | Segmentation Variables, Number of clusters | Marketing research |
| [ | Combined and Weighted Algorithms | Agglomerative approaches | Arbitrary decisions, Number of clusters | Open source software systems written |
| [ | Refined rough cluster algorithm | Rough cluster algorithm | Objective function, stability | Synthetic, forest and gene data. |
| [ | Survey | Several clustering algorithms | Percentage error, Accuracy | Iris, Mushroom, Salesman problem, Bio-informatics. |
| [ | Self-Splitting and Competitive Learning | OPTOC | Number of clusters | Gene Expression Data |
| [ | Segmentation and phantom study | Manual ROI | Average mean squared error, time | PET Images, lung data |
| [ | CACTUS | STIRR | Similarity, time | Real and synthetic datasets |
| [ | Software Re-modularization | Complete, single, weighted | Precision, Recall, Cohesion, Coupling, Similarity | gcc, Linux, Mosaic and real world legacy system |
| [ | Extented k-Means and k-modes | k-Means and k-modes | Accuracy, run time, standard deviation | Soybean disease and credit approval |
| [ | Decision support approach | Average linkage, Centroid, Ward’s | Growth rate, Gamma frequency | Large scale R and D planning. |
| [ | Fine-classification procedure | Cluster classification | Spectra | Land and marine object |
| [ | Silhouettes | Fuzzy clustering | Average silhouette width, Number of clusters | Ruspini |
Summary of related work on rough set theory.
| Paper | Proposed Technique | Compared Techniques | Evaluation Metrics | Data Sets/ Application Area |
|---|---|---|---|---|
| [ | Integrated Fuzzy PIPRECIA–Interval Rough Saw Model | The interval rough and fuzzy evaluations | Environmental image, recycling, pollution control, the environmental management system, environmentally friendly products, resource consumption and green competencies | supplier selection |
| [ | Rough set theory based hierarchical linear model | Resource-based and Enterprise ecosystem theory | T-test, P-value, error | Grain farms |
| [ | Framework based on RST | Environmental and Store factors | Frequency, Ranking, Growth rate | Restaurant chain |
| [ | Generalized attribute reduction in rough set theory | Mean decision power increased attribute reduction (MDPIAR), Positive region preserved attribute reduction (PRPAR) etc. | Micro and Macro evaluation | 16 UCI data sets |
| [ | Survey of rough set clustering | Variable Precision Model, Total Roughness, Rough K-means | Purity, Entropy | Outliers detection |
| [ | Rough generation algorithm (RGI) | Rule and tree based classification algorithms | Mean absolute error | Medical data sets |
| [ | Effective Rough Clustering | ——- | Precision, Accuracy | Super market data set |
| [ | Rough Set Based Feature Selection | Fuzzy Rough Set Based Feature Selection | A review | Crisp and real-valued data sets |
| [ | Rough set based decision theory | Decision making by weight | F score, CEI | Reuters Corpus Volume 1 data set |
| [ | Rough CART algorithm | CART algorithm | Accuracy | Nutrition and health |
| [ | Rough-Set Feature Selection Model | Decision tree | Error, Accuracy | Survey data |
| [ | Rough evolutionary algorithm | Evolutionary Algorithm | Courage, Accuracy | Beer preferences, City image data |
| [ | Foundations of Rough Clustering | Rough k-Means | Lower and upper bounds | Traffic, Web and Supermarket data |
| [ | Rough set theory | Decision Tree | Rules, Accuracy | Multimedia Data |
| [ | Rough Self Organizing Map | Crisp clustering | Error, Accuracy | Artificial, Iris data set |
| [ | Rough Set Theory Fundamental Concepts | Rough Set Theory Principals | Rough Set Theory Data Extraction | Rough Set Theory Applications |
| [ | Rough classification rules framework | Rough Set theory | Misclassification rate, Accuracy | Interval-valued information system |
| [ | Rough autonomous Knowledge-Oriented (K-O) clustering | Complete, Single and Average Linkage | Accuracy, Number of clusters | Food nutrient data |
| [ | Rough set theory | _______ | Rudiments of rough sets | Research directions and applications |
Summary of existing work on rough categorical data clustering.
| Paper | Proposed Technique | Compared Techniques | Evaluation Metrics | Data Sets/ Application Area |
|---|---|---|---|---|
| [ | Rough Set based Maximum Value Attribute Technique | K Mean, RST based techniques | Accuracy, Purity, entropy, Time, Iterations | Supplier and UCI data sets |
| [ | MMeMeR | MMR, MMeR, SDR, Fuzzy K modes, Fuzzy centriods | Accuracy, Purity | Zoo, Soyabean and Mushroom data sets |
| [ | >Rough intuitionistic fuzzy k-mode | Rough fuzzy k-mode | DB index, D index, XB index, PC pair and Minkowski score | UCI data sets |
| [ | Modified Fuzzy k-Partition | Fuzzy Centroid and Fuzzy k-Partition | Response time, clustering accuracy | UCI and real data sets |
| [ | Information Theoretic Dependency Roughness | K-means, Fuzzy K-means, MMR, MMeR, SDR, SSDR | Purity | Zoo data set |
| [ | Review of categorical Clustering techniques | Min-Min Roughness, Standard Deviation Roughness, Modified Min-Min Roughness, Fuzzy set theory | Uncertainity | Categorical data sets |
| [ | Maximum Significant Attribute (MSA) | Bi-Clustering, Total roughness, Min-Min Roughness, Maximum Dependent Attribute | Rough accuracy, Purity | Credit card promotion dataset |
| [ | Variable Precision Model | Total roughness, Min-Min Roughness | Purity, Accuracy | Balloon, Tic-Tac-Toe, SPECT, Hayes-Roth |
| [ | Standard deviation of Standard Deviation Roughness | Min-Min Roughness, Standard Deviation Roughness, Modified Min-Min Roughness, Fuzzy set theory | Purity | Zoo data set |
| [ | Standard Deviation Roughness | k-modes, fuzzy k-modes, Min-Min Roughness | Purity | Soybean, Zoo, Mushroom data sets |
| [ | Modified Min-Min Roughness | Min-Min Roughness, K-Modes, Fuzzy set theory | Purity | Soybean, Zoo, Mushroom data sets |
| [ | Maximum Dependent Attribute | Bi-Clustering, Total roughness, Min-Min Roughness | Rough accuracy, Iterations | Credit card, Student’s qualifications and animal data sets |
| [ | Min-Min Roughness | Squeezer, K-modes, LCBCDC, ROCK, hierarchical algorithm | Purity | Soybean and Zoo |
| [ | Total Roughness | Bi- Clustering | Rough accuracy | Small Data sets |
A Viral Illness information system.
| Patient | H | V | T | Viral illness |
|---|---|---|---|---|
| 1 | 0 | 1 | High | 1 |
| 2 | 1 | 0 | High | 1 |
| 3 | 1 | 1 | Very High | 1 |
| 4 | 0 | 1 | Normal | 0 |
| 5 | 1 | 0 | Normal | 0 |
| 6 | 0 | 0 | Very High | 1 |
Dependency degree of attributes from Table 4.
| Attribute(depends on) | Degree of Dependency | MDA | |
|---|---|---|---|
| H | V | T | 0 |
| V | H | T | 0 |
| T | H | V | 0 |
Significance degree of attributes from Table 4.
| Attributes | Significance | MSA | |
|---|---|---|---|
| H | V | T | 0 |
| V | H | T | 0 |
| T | H | V | 0 |
Strengths and limitations of existing Rough categorical clustering techniques.
| Technique | Basic idea | Strengths | Limitations |
|---|---|---|---|
|
| Binary valued attributes | Categorical Data, Uncertainty | Accuracy, Generalization |
|
| Maximum total roughness | Categorical Data, Complexity, Uncertainty | Purity, Generalization |
|
| Maximum mean roughness using lower bound and upper bound | Categorical Data, Complexity, Uncertainty | Purity, Stability, Generalization |
|
| Dependency of attributes | Categorical Data, Complexity, Uncertainty | Accuracy, Stability, Purity |
|
| Significance of attributes | Categorical Data, Purity, Uncertainty | Stability, Complexity, Entropy |
|
| Information theoretic attribute dependencies | Categorical Data, Purity, Uncertainty | Generalization, Complexity, Accuracy |
Fig 1Scenario leading to the proposed framework.
Fig 2The RPA algorithm.
Student’s enrollment qualification information system.
| U/A | D | E | S | P | M |
|---|---|---|---|---|---|
| 1 | B.Sc. | Low | No | Fluent | Poor |
| 2 | B.Sc. | Intermediate | Yes | Poor | Fluent |
| 3 | M.Sc. | Advanced | No | Poor | Poor |
| 4 | M.Sc. | Intermediate | No | Fluent | Poor |
| 5 | Ph.D. | Low | Yes | Poor | Fluent |
| 6 | Ph.D. | Advanced | No | Poor | Fluent |
| 7 | Ph.D. | Advanced | Yes | Fluent | Poor |
| 8 | M.Sc. | Advanced | Yes | Fluent | Fluent |
MMP roughnes of Table 8.
| Mean Rough Purity | Mean | |||||
| D | E | S | P | M | ||
| D | - | 0.5 | 0.4167 | 0.4167 | 0.4167 | 0.4375 |
| E | 0.5567 | - | 0.333 | 0.333 | 0.333 | 0.3889 |
| S | 0.667 | 0.5 | - | 0.5 | 0.75 | 0.604 |
| P | 0.667 | 0.5 | 0.5 | - | 0.75 | 0.604 |
| M | 0.667 | 0.5 | 0.75 | 0.75 | - |
|
Discretized supply base management data set.
| S |
|
|
|
|
| C | Q | P | D |
| O | E |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 3 | 2 | 3 | 3 | 4 | 2 | 1 | 1 | 1 | 1 | 1 | I |
| 2 | 2 | 2 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | E |
| 3 | 1 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 4 | 1 | 3 | E |
| 4 | 5 | 2 | 2 | 2 | 3 | 4 | 5 | 2 | 4 | 1 | 4 | E |
| 5 | 5 | 2 | 3 | 3 | 4 | 4 | 3 | 3 | 3 | 1 | 2 | I |
| 6 | 3 | 2 | 2 | 3 | 3 | 3 | 3 | 3 | 4 | 2 | 3 | E |
| 7 | 2 | 1 | 3 | 2 | 4 | 3 | 4 | 2 | 2 | 2 | 3 | E |
| 8 | 5 | 2 | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 1 | 2 | I |
| 9 | 5 | 2 | 3 | 3 | 4 | 4 | 1 | 1 | 1 | 1 | 1 | I |
| 10 | 1 | 2 | 1 | 1 | 1 | 1 | 5 | 1 | 4 | 1 | 3 | E |
| 11 | 2 | 1 | 1 | 2 | 3 | 2 | 1 | 1 | 3 | 1 | 2 | I |
| 12 | 1 | 1 | 3 | 3 | 3 | 3 | 3 | 3 | 2 | 1 | 2 | E |
| 13 | 5 | 2 | 3 | 3 | 4 | 4 | 1 | 2 | 3 | 1 | 3 | I |
| 14 | 4 | 2 | 3 | 3 | 4 | 4 | 2 | 1 | 2 | 1 | 2 | I |
| 15 | 4 | 2 | 2 | 3 | 1 | 1 | 4 | 3 | 4 | 2 | 3 | E |
| 16 | 3 | 2 | 3 | 3 | 2 | 2 | 3 | 1 | 1 | 1 | 1 | I |
| 17 | 5 | 2 | 3 | 3 | 4 | 3 | 3 | 1 | 2 | 1 | 1 | I |
| 18 | 5 | 2 | 3 | 3 | 4 | 4 | 4 | 1 | 1 | 1 | 1 | I |
| 19 | 4 | 2 | 3 | 1 | 4 | 3 | 4 | 1 | 4 | 1 | 3 | I |
| 20 | 4 | 2 | 3 | 3 | 1 | 4 | 2 | 2 | 4 | 1 | 4 | E |
| 21 | 5 | 2 | 3 | 2 | 4 | 4 | 2 | 1 | 3 | 1 | 2 | I |
| 22 | 5 | 2 | 2 | 3 | 3 | 4 | 5 | 3 | 4 | 2 | 4 | E |
| 23 | 4 | 2 | 3 | 3 | 2 | 3 | 4 | 3 | 3 | 2 | 4 | E |
Time complexity of all techniques.
| Data Set | Response Time (millisec) | |||
|---|---|---|---|---|
| MDA | MSA | ITDR | RPA | |
|
| 0 | 0 | 0 |
|
|
| 20 | 595 | 17 |
|
|
| 6 | 116 | 8 |
|
|
| 31598 | 658068 | 815 |
|
|
| 2 | 27 | 5 |
|
|
| 4 | 72 | 3 |
|
|
| 1 | 15 | 3 |
|
Iterative complexity of all techniques.
| Data Set | Minimum iterations | |||
|---|---|---|---|---|
| MDA | MSA | ITDR | RPA | |
|
| 80 | 147 | 49 |
|
|
| 3519 | 12138 | 367 |
|
|
| 4381 | 23461 | 1201 |
|
|
| 781127 | 3892358 | 5181 |
|
|
| 624 | 1404 | 301 |
|
|
| 1397 | 4218 | 239 |
|
|
| 660 | 2779 | 1373 |
|
Comparative performance of techniques for all data sets.
| Measure | Technique | Balloons | Car Evaluation | Zoo | Chess | Balance Scale | Monks Problem | SBM |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
| MDA | 0.6 | 0.7 | 0.59 | 0.52 | 0.64 | 0.5 | 0.61 | |
| MSA | 0.6 | 0.7 | 0.4 | 0.54 | 0.64 | 0.5 | 0.74 | |
| ITDR | 0.6 | 0.7 | 0.5 | 0.52 | 0.64 | 0.5 | 0.74 | |
|
|
|
|
|
|
|
|
|
|
| MDA | 0.29 | 0.33 | 0.5 | 0.3 | 0.36 | 0.3 | 0.29 | |
| MSA | 0.29 | 0.33 | 0.7 | 0.3 | 0.36 | 0.3 | 0.23 | |
| ITDR | 0.29 | 0.3 | 0.48 | 0.3 | 0.36 | 0.3 | 0.22 | |
|
|
|
|
|
|
|
|
|
|
| MDA | 0.47 | 0.48 | 0.55 | 0.5 | 0.6 | 0.5 | 0.5 | |
| MSA | 0.47 | 0.48 | 0.5 | 0.5 | 0.6 | 0.5 | 0.55 | |
| ITDR | 0.47 | 0.5 | 0.69 | 0.5 | 0.6 | 0.5 | 0.6 | |
|
|
|
|
|
| 0 | 0 | 0 |
|
| MDA | 0 | 0 | 0.1 | 0 | 0 | 0 | 0 | |
| MSA | 0 | 0 | 0.1 | 0 | 0 | 0 | 0.1 | |
| ITDR | 0.1 | 0 | 0.1 | 0 | 0 | 0 | 0.11 |
Average percentage improvement of time by RPA technique.
| MDA | MSA | ITDR | RPA | |
|---|---|---|---|---|
|
| 4518.714 | 94127.57 | 121.5714 | 118.1429 |
|
| 97.40% | 99.87% | 2.82% |
Average percentage improvement of iterations by RPA technique.
| MDA | MSA | ITDR | RPA | |
|---|---|---|---|---|
|
| 113112.6 | 562357.9 | 1244.429 | 621.4286 |
|
| 99.45% | 99.88% | 50.06% |
Average percentage improvement of Purity by RPA technique.
| Technique | Average Purity | Improvement by RPA |
|---|---|---|
| MDA | 0.59 | 11.14% |
| MSA | 0.59 | 11.14% |
| ITDR | 0.60 | 9.30% |
|
| 0.655714 |
Average percentage improvement of accuracy by RPA technique.
| Technique | Average Accuracy | Improvement by RPA |
|---|---|---|
| MDA | 0.5143 | 14.72% |
| MSA | 0.5143 | 14.72% |
| ITDR | 0.5514 | 7% |
|
| 0.59 |
Overall percentage improvement by RPA technique.
| Measure | Overall Improvement by RPA |
|---|---|
| Time | 66.70% |
| Iterations | 83.13% |
| Purity | 10.53% |
| Entropy | 14% |
| Accuracy | 12.15% |
Average percentage improvement of entropy by RPA technique.
| Technique | Average Entropy | Improvement by RPA |
|---|---|---|
| MDA | 0.338571 | 13.92% |
| MSA | 0.358571 | 18.72% |
| ITDR | 0.321429 | 9.33% |
|
| 0.291429 |