| Literature DB >> 29238404 |
Randal S Olson1, William La Cava1, Patryk Orzechowski1,2, Ryan J Urbanowicz1, Jason H Moore1.
Abstract
BACKGROUND: The selection, development, or comparison of machine learning methods in data mining can be a difficult task based on the target problem and goals of a particular study. Numerous publicly available real-world and simulated benchmark datasets have emerged from different sources, but their organization and adoption as standards have been inconsistent. As such, selecting and curating specific benchmarks remains an unnecessary burden on machine learning practitioners and data scientists.Entities:
Keywords: Benchmarking; Data repository; Machine learning; Model evaluation
Year: 2017 PMID: 29238404 PMCID: PMC5725843 DOI: 10.1186/s13040-017-0154-4
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig. 1Histograms showing the distribution of meta-feature values from the PMLB datasets. Note the log scale of the y axes
Machine learning algorithms and parameters tuned in the PMLB benchmark
| Machine learning algorithm | Tuned parameters |
|---|---|
| Gaussian Naïve Bayes (NB) | No parameters. |
| Bernoulli Naïve Bayes | alpha: Additive smoothing parameter. |
| binarize: Threshold for binarizing the features. | |
| fit_prior: Whether or not to learn class prior probabilities. | |
| Multinomial Naïve Bayes | alpha: Additive smoothing parameter. |
| fit_prior: Whether or not to learn class prior probabilities. | |
| Logistic regression | C: Regularization strength. |
| penalty: Whether to use Lasso or Ridge regularization. | |
| fit_intercept: Whether or not the intercept of the linear | |
| classifier should be computed. | |
| Linear classifier trained via stochastic gradient | loss: Loss function to be optimized. |
| descent (SGD) | penalty: Whether to use Lasso, Ridge, or ElasticNet |
| regularization. | |
| alpha: Regularization strength. | |
| learning_rate: Shrinks the contribution of each successive | |
| training update. | |
| fit_intercept: Whether or not the intercept of the linear | |
| classifier should be computed. | |
| l1_ratio: Ratio of Lasso vs. Ridge reguarlization to use. | |
| Only used when the ‘penalty’ is ElasticNet. | |
| eta0: Initial learning rate. | |
| power_t: Exponent for inverse scaling of the learning rate. | |
| Linear classifier trained via the passive aggressive | loss: Loss function to be optimized. |
| algorithm | C: Maximum step size for regularization. |
| fit_intercept: Whether or not the intercept of the linear | |
| classifier should be computed. | |
| Support vector machine for classification (SVC) | kernel: ‘linear’, ‘poly’, ‘sigmoid’, or ‘rbf’. |
| C: Penalty parameter for regularization. | |
| gamma: Kernel coef. for ‘rbf’, ‘poly’ & ‘sigmoid’ kernels. | |
| degree: Degree for the ‘poly’ kernel. | |
| coef0: Independent term in the ‘poly’ and ‘sigmoid’ kernels. | |
| K-Nearest Neighbor (KNN) | n_neighbors: Number of neighbors to use. |
| weights: Function to weight the neighbors’ votes. | |
| Decision tree | min_weight_fraction_leaf: The minimum number of |
| (weighted) samples for a node to be considered a leaf. | |
| Controls the depth and complexity of the decision tree. | |
| max_features: Number of features to consider when | |
| computing the best node split. | |
| criterion: Function used to measure the quality of a split. | |
| Random forest & Extra random forest | n_estimators: Number of decision trees in the ensemble. |
| (a.k.a. Extra Trees Classifier) | min_weight_fraction_leaf: The minimum number of |
| (weighted) samples for a node to be considered a leaf. | |
| Controls the depth and complexity of the decision trees. | |
| max_features: Number of features to consider when | |
| computing the best node split. | |
| criterion: Function used to measure the quality of a split. | |
| AdaBoost | n_estimators: Number of decision trees in the ensemble. |
| learning_rate: Shrinks the contribution of each successive | |
| decision tree in the ensemble. | |
| Gradient tree boosting | n_estimators: Number of decision trees in the ensemble. |
| learning_rate: Shrinks the contribution of each successive | |
| decision tree in the ensemble. | |
| loss: Loss function to be optimized via gradient boosting. | |
| max_depth: Maximum depth of the decision trees. | |
| Controls the complexity of the decision trees. | |
| max_features: Number of features to consider when | |
| computing the best node split. |
Fig. 2Clustered meta-features of datasets in the PMLB projected onto the first two principal component axes (PCA 1 and PCA 2)
Fig. 3Mean values of each meta-feature within PMLB dataset clusters identified in Fig. 2
Fig. 4a Biclustering of the 13 ML models and 165 datasets according to the balanced accuracy of the models using their best parameter settings. b Deviation from the mean balanced accuracy across all 13 ML models. Highlights datasets upon which all ML methods performed similarly versus those where certain ML methods performed better or worse than others. c Identifies the boundaries of the 40 contiguous biclusters identified based on the 4 ML-wise clusters by the 10 data-wise clusters
Fig. 5Accuracy of the tuned ML models on each dataset across the PMLB suite of problems, sorted by the maximum balanced accuracy obtained for that dataset