| Literature DB >> 29096638 |
Yiyan Zhang1, Yi Xin1, Qin Li2, Jianshe Ma1, Shuai Li1, Xiaodan Lv1, Weiqi Lv1.
Abstract
BACKGROUND: Various kinds of data mining algorithms are continuously raised with the development of related disciplines. The applicable scopes and their performances of these algorithms are different. Hence, finding a suitable algorithm for a dataset is becoming an important emphasis for biomedical researchers to solve practical problems promptly.Entities:
Keywords: Applicability of algorithm; Characters of datasets; Classification task; Data mining
Mesh:
Year: 2017 PMID: 29096638 PMCID: PMC5668968 DOI: 10.1186/s12938-017-0416-x
Source DB: PubMed Journal: Biomed Eng Online ISSN: 1475-925X Impact factor: 2.819
Fig. 1A general view of the work and application scenarios
Profile of research data sets
| Name of dataset | Sample size | Number of attributes | Missing values? | Task | Area |
|---|---|---|---|---|---|
| Iris | 150 | 4 | No | Multi-class | Life |
| Adulta | 32,561 | 13 | Yes | Binary-class | Social |
| Wine | 178 | 13 | No | Multi-class | Physical |
| Car evaluation | 1728 | 6 | No | Multi-class | – |
| Breast cancer Wisconsina | 699 | 9 | Yes | Binary-class | Life |
| Wdbca | 569 | 30 | No | Binary-class | Life |
| Wpbca | 198 | 31 | Yes | Binary-class | Life |
| Abalone | 4177 | 8 | No | Multi-class | Life |
| Wine quality_reda | 1599 | 11 | No | Multi-class | Business |
| Wine quality_whitea | 4898 | 11 | No | Multi-class | Business |
| Heart diseasea | 303 | 13 | Yes | Multi-class | Life |
| Poker handa | 25,010 | 10 | No | Multi-class | Game |
aThe dataset ‘Adult’ is a subset of the database ‘Adult Data Set’. The datasets ‘Breast cancer Wisconsin’, ‘Wdbc’ and ‘Wpbc’ are three subsets come from the same database ‘Breast Cancer Wisconsin (diagnostic) data set’. The datasets ‘Wine quality_red’ and ‘Wine quality_white’ are included in the same database ‘Wine Quality Data Set’. Limited to data quality, ‘processed.cleveland’ and ‘poker-hand-training-true’ two subsets were selected as represents of the databases ‘Heart Disease Data Set’ and ‘Poker hand data set’, respectively
Correlation coefficients between variables in ‘wine quality_red’ data set
| Fixed.acidity | Volatile.acidity | Citric.acid | Residual.sugar | Chlorides | Free.sulfur.dioxide | Total.sulfur.dioxide | Density | pH | Sulphates | Alcohol | Quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Fixed.acidity | 1 | – | – | – | – | – | – | – | – | – | – | – |
| Volatile.acidity | − 0.2561 | 1 | – | – | – | – | – | – | – | – | – | – |
| Citric.acid | 0.6717 | − 0.5525 | 1 | – | – | – | – | – | – | – | – | – |
| Residual.sugar | 0.1148 | 0.0019 | 0.1436 | 1 | – | – | – | – | – | – | – | – |
| Chlorides | 0.0937 | 0.0613 | 0.2038 | 0.0556 | 1 | – | – | – | – | – | – | – |
| Free.sulfur.dioxide | − 0.1538 | − 0.0105 | − 0.061 | 0.187 | 0.0056 | 1 | – | – | – | – | – | – |
| Total.sulfur.dioxide | − 0.1132 | 0.0765 | 0.0355 | 0.203 | 0.0474 | 0.6677 | 1 | – | – | – | – | – |
| Density | 0.668 | 0.022 | 0.3649 | 0.3553 | 0.2006 | − 0.0219 | 0.0713 | 1 | – | – | – | – |
| pH | − 0.683 | 0.2349 | − 0.5419 | − 0.0857 | − 0.265 | 0.0704 | − 0.0665 | − 0.3417 | 1 | – | – | – |
| Sulphates | 0.183 | − 0.261 | 0.9128 | 0.0055 | 0.3713 | − 0.0517 | 0.0429 | 0.1485 | − 0.1966 | 1 | – | – |
| Alcohol | − 0.0617 | − 0.2023 | 0.1099 | 0.0421 | − 0.2211 | − 0.0694 | − 0.2057 | − 0.4962 | 0.2056 | 0.0936 | 1 | – |
| Quality | 0.1241 | − 0.3906 | 0.2264 | 0.0137 | − 0.1289 | − 0.0507 | − 0.1851 | − 0.1749 | − 0.0577 | 0.2514 | 0.4762 | 1 |
Quantification of the characteristics of ‘Wine quality_red’ dataset
| Quantification index | Values |
|---|---|
| Sample size | 1599 |
| Number of attributes | 11 |
| Number of missing values | 0 |
| Number of classes | 6 |
| Sample size of the largest class | 681 |
| Sample size of the least class | 10 |
| Correlation coefficients1a | 0.4762 |
| Correlation coefficients2a | − 0.6830 |
| Class entropy of task variable | 0.5145 |
| Ratio of sample size of the largest class to the least class | 68.10 |
aCorrelation coefficients1 represents the maximum of correlation coefficient between task variable and other non-task attribute variables; correlation coefficients2 represents the maximum of correlation coefficient between each pair of non-task attribute variables
Performance evaluation of the algorithms applied to ‘Wine quality_red’ dataset
| Algorithm | Accuracy | Sensitivity | Sensitivity | Specificity | Running time (s) | Memory usage (M) |
|---|---|---|---|---|---|---|
| C4.5 | 0.9099 | 0.8000 | 0.9266 | 0.9956 | 0.15 | 0.02 |
| SVM | 0.6717 | 0 | 0.8062 | 1.0000 | 0.79 | 0.53 |
| AdaBoost | 0.6629 | 0 | 0.7871 | 1.0000 | 34.02 | 11.33 |
| kNN | 0.8705 | 0.7000 | 0.9178 | 1.0000 | 0.11 | 0.39 |
| Naïve Bayes | 0.5604 | 0.3000 | 0.6696 | 0.9975 | 0.00 | 0.01 |
| Random forest | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.42 | 10.33 |
| Logistic regression | 0.6079 | 0.2000 | 0.7518 | 0.9981 | 0.23 | 0.34 |
Fig. 2Evaluation and rank of algorithms on the ‘Wine quality_red’ data set
Characteristic quantification values and performance assessment of algorithms applied to the 12 research datasets
| Dataset | Sample size | Number of attributes | Number of classes | Cor1 | Cor2 | Class entropy | Balance | Well-performed algorithm rank |
|---|---|---|---|---|---|---|---|---|
| Iris | 150 | 4 | 3 | 0.9565 | 0.9629 | 0.4771 | 1 | Ensemble, single classifier |
| Adult | 30,162 | 13 | 2 | 0.3353 | − 0.5849 | 0.2437 | 3.017 | Ensemble, C4.5 |
| Wine | 178 | 13 | 3 | − 0.8475 | 0.8646 | 0.4717 | 1.479 | Ensemble, LR, SVM, other |
| Car evaluation | 1728 | 6 | 4 | 0.4393 | 0 | 0.3630 | 18.62 | Ensemble,C4.5, kNNa |
| Breast cancer Wisconsin | 683 | 9 | 2 | 0.8227 | 0.9072 | 0.2812 | 1.858 | Ensemble, kNN, C4.5, SVM |
| Wdbc | 569 | 30 | 2 | 0.7936 | 0.9979 | 0.2868 | 1.684 | Ensemble,LR, C4.5, kNN, SVM |
| Wpbc | 194 | 31 | 2 | − 0.3460 | 0.9959 | 0.2379 | 3.217 | Ensemble, C4.5, kNNa |
| Abalone | 4177 | 8 | 28 | 0.6276 | 0.9868 | 1.084 | 689 | RF, kNN, C4.5 |
| Wine quality_red | 1599 | 11 | 6 | 0.4762 | − 0.6830 | 0.5145 | 68.1 | RF, C4.5, kNN |
| Wine quality_white | 4898 | 11 | 7 | 0.4356 | 0.8390 | 0.5604 | 439.6 | RFb, C4.5, kNN |
| Heart disease | 297 | 13 | 5 | 0.5212 | 0.5790 | 0.5577 | 12.31 | RF, kNN, AB, C4.5 |
| Poker hand | 25,010 | 10 | 10 | 0.0102 | − 0.0303 | 0.4277 | 2498.6 | kNN, C4.5 |
‘Other’ in the last column means remaining algorithms besides previous listed algorithms
akNN has higher sensitivity on a certain class, namely kNN has higher accuracy when predict the certain class
bRF occupied bigger memory, then 2000 instances were sampled randomly to be training set, and RF showed high classification accuracy and acceptable running speed
Summary of applicative algorithm recommendation on different characteristic datasets
| Character of dataset | NB | LR | kNN | C4.5 | SVM | AB | RF | Represents of dataset |
|---|---|---|---|---|---|---|---|---|
| Small sample size | √ | √ | √ | √ | Iris, wine | |||
| High correlation | √ | √ | Iris, wine | |||||
| Binary-class task | √ | √ | √ | Breast cancer Wisconsin, Wdbc | ||||
| Balanced data | √ | √ | √ | Wine, breast cancer Wisconsin, Wdbc | ||||
| Multi-class task | √ | √ | √ | Abalone, wine quality_red | ||||
| Imbalanced data | √ | √ | √ | Wine quality_white | ||||
| Large sample size | √ | √ | Adult, poker hand | |||||
| Low correlation | √ | √ | √ | √ | Car evaluation, Wpbc, heart disease |