| Literature DB >> 26543895 |
Abstract
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks.Entities:
Year: 2015 PMID: 26543895 PMCID: PMC4620246 DOI: 10.1155/2015/180749
Source DB: PubMed Journal: ScientificWorldJournal ISSN: 1537-744X
Algorithm 1Steps in discretization.
Algorithm 2Identifying initial centroids.
Algorithm 3K-means clustering.
Accuracy of proposed method.
| Dataset | Proposed approach | |
|---|---|---|
| Eliminating | Incorporating | |
| Iris | 0.75 |
|
| Wine | 0.61 |
|
| Cancer | 0.68 |
|
| Vowel | 0.72 |
|
Validity of proposed method.
| Dataset | Proposed approach | |
|---|---|---|
| Eliminating | Incorporating | |
| Iris | 0.40 |
|
| Wine | 0.25 |
|
| Cancer | 0.29 |
|
| Vowel | 0.73 |
|
Comparative analysis of the algorithms based on accuracy and DB index.
| Dataset | Method | Accuracy | DB index |
|---|---|---|---|
| Iris | Simple | 0.69 | 0.43 |
| Binary Search | 0.75 | 0.4 | |
| Proposed method |
|
| |
|
| |||
| Wine | Simple | 0.58 | 0.26 |
| Binary Search | 0.61 | 0.25 | |
| Proposed method |
|
| |
|
| |||
| Cancer | Simple | 0.6 | 0.33 |
| Binary Search | 0.68 | 0.29 | |
| Proposed method |
|
| |
|
| |||
| Vowel | Simple | 0.65 | 0.82 |
| Binary Search | 0.72 | 0.73 | |
| Proposed method |
|
| |
Objective contentment.
| Objectives | Contentment level | Rationale |
|---|---|---|
| To adapt simple structures in representation | High | Simple yet powerful phases in the framework |
| To develop a methodology which is effortless and easy to implement | High | |
| To provide robust and trustworthy approach | High | |
| To produce accurate clusters | High | |
|
| ||
| To generate clusters quickly | Medium | Execution time is high due to discretization process |
Figure 1Proposed framework.
Figure 2Discretization framework.
Figure 3Process of discretization.
Figure 4Accuracy of proposed method with and without Phase I.
Figure 5Validity of proposed method with and without Phase I.
Figure 6Comparative analysis of the algorithms based on accuracy and DB index.