| Literature DB >> 26019610 |
Yong Gyu Jung1, Min Soo Kang1, Jun Heo2.
Abstract
Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K-means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K-means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.Entities:
Keywords: EM; K-means; logistic regression
Year: 2014 PMID: 26019610 PMCID: PMC4433949 DOI: 10.1080/13102818.2014.949045
Source DB: PubMed Journal: Biotechnol Biotechnol Equip ISSN: 1310-2818 Impact factor: 1.632
Figure 1. Part of experimental data.
Attributes of experimental data.
| No. | Attribute | Type | Range |
|---|---|---|---|
| 1 | Colour | Binomial | 0: red, 1: white |
| 2 | Fixed acidity | Numeric | [3.80, 15.90] |
| 3 | Volatile acidity | Numeric | [0.08, 1.58] |
| 4 | Citric acid | Numeric | [0.00, 1.66] |
| 5 | Residual sugar | Numeric | [0.60, 65.80] |
| 6 | Chlorides | Numeric | [0.01, 0.61] |
| 7 | Free sulphur dioxide | Numeric | [1.0, 289.00] |
| 8 | Total sulphur dioxide | Numeric | [6.00, 444.00] |
| 9 | Density | Numeric | [0.99, 1.04] |
| 10 | pH | Numeric | [2.72, 4.01] |
| 11 | Sulphates | Numeric | [0.22, 2.00] |
| 12 | Alcohol | Numeric | [8.00, 14.90] |
| 13 | Quality | Nomial | [0]very bad, [10] excellent |
Figure 2. K-means experimental result.
Figure 3. EM cluster experimental results.
Experimental results applying K-means.
| Instance classification | Percentage |
|---|---|
| Correctly classified instances | 94.7467% |
| Incorrectly classified instances | 5.2533% |
Experimental results applying EM.
| Instance classification | Percentage |
|---|---|
| Correctly classified instances | 59.7874% |
| Incorrectly classified instances | 40.2126% |
Fitted logistic classification results using EM.
| Instance classification | Percentage |
|---|---|
| Correctly classified instances | 87.4296% |
| Incorrectly classified instances | 12.5704% |