| Literature DB >> 31632184 |
James Large1, Jason Lines1, Anthony Bagnall1.
Abstract
Our hypothesis is that building ensembles of small sets of strong classifiers constructed with different learning algorithms is, on average, the best approach to classification for real-world problems. We propose a simple mechanism for building small heterogeneous ensembles based on exponentially weighting the probability estimates of the base classifiers with an estimate of the accuracy formed through cross-validation on the train data. We demonstrate through extensive experimentation that, given the same small set of base classifiers, this method has measurable benefits over commonly used alternative weighting, selection or meta-classifier approaches to heterogeneous ensembles. We also show how an ensemble of five well-known, fast classifiers can produce an ensemble that is not significantly worse than large homogeneous ensembles and tuned individual classifiers on datasets from the UCI archive. We provide evidence that the performance of the cross-validation accuracy weighted probabilistic ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time series classification archive, and we also demonstrate that our ensemble technique can significantly improve the state-of-the-art classifier for this problem domain. We investigate the performance in more detail, and find that the improvement is most marked in problems with smaller train sets. We perform a sensitivity analysis and an ablation study to demonstrate the robustness of the ensemble and the significant contribution of each design element of the classifier. We conclude that it is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.Entities:
Keywords: Classification; Ensemble; Heterogeneous; Weighted
Year: 2019 PMID: 31632184 PMCID: PMC6790343 DOI: 10.1007/s10618-019-00638-y
Source DB: PubMed Journal: Data Min Knowl Discov ISSN: 1384-5810 Impact factor: 3.670
Fig. 1Illustration of the different effects of combination and weighting schemes on a toy instance classification. Each stage progressively pushes the predicted class probabilities further in the correct direction for this prediction
The 85 UCR time series classification problems used in the experiments for Sect. 5.5
| Dataset | Atts | Classes | Train | Test | Dataset | Atts | Classes | Train | Test |
|---|---|---|---|---|---|---|---|---|---|
| Adiac | 176 | 37 | 390 | 391 | MedicalImages | 99 | 10 | 381 | 760 |
| ArrowHead | 251 | 3 | 36 | 175 | MidPhalOutAgeGroup | 80 | 3 | 400 | 154 |
| Beef | 470 | 5 | 30 | 30 | MidPhalOutCorrect | 80 | 2 | 600 | 291 |
| BeetleFly | 512 | 2 | 20 | 20 | MiddlePhalanxTW | 80 | 6 | 399 | 154 |
| BirdChicken | 512 | 2 | 20 | 20 | MoteStrain | 84 | 2 | 20 | 1252 |
| Car | 577 | 4 | 60 | 60 | NonInvasiveThorax1 | 750 | 42 | 1800 | 1965 |
| CBF | 128 | 3 | 30 | 900 | NonInvasiveThorax2 | 750 | 42 | 1800 | 1965 |
| ChlorineConcentration | 166 | 3 | 467 | 3840 | OliveOil | 570 | 4 | 30 | 30 |
| CinCECGtorso | 1639 | 4 | 40 | 1380 | OSULeaf | 427 | 6 | 200 | 242 |
| Coffee | 286 | 2 | 28 | 28 | PhalOutCorrect | 80 | 2 | 1800 | 858 |
| Computers | 720 | 2 | 250 | 250 | Phoneme | 1024 | 39 | 214 | 1896 |
| CricketX | 300 | 12 | 390 | 390 | Plane | 144 | 7 | 105 | 105 |
| CricketY | 300 | 12 | 390 | 390 | ProxPhalOutAgeGroup | 80 | 3 | 400 | 205 |
| CricketZ | 300 | 12 | 390 | 390 | ProxPhalOutCorrect | 80 | 2 | 600 | 291 |
| DiatomSizeReduction | 345 | 4 | 16 | 306 | ProximalPhalanxTW | 80 | 6 | 400 | 205 |
| DisPhalOutAgeGroup | 80 | 3 | 400 | 139 | RefrigerationDevices | 720 | 3 | 375 | 375 |
| DisPhalOutCor | 80 | 2 | 600 | 276 | ScreenType | 720 | 3 | 375 | 375 |
| DislPhalTW | 80 | 6 | 400 | 139 | ShapeletSim | 500 | 2 | 20 | 180 |
| Earthquakes | 512 | 2 | 322 | 139 | ShapesAll | 512 | 60 | 600 | 600 |
| ECG200 | 96 | 2 | 100 | 100 | SmallKitchApps | 720 | 3 | 375 | 375 |
| ECG5000 | 140 | 5 | 500 | 4500 | SonyAIBORSurface1 | 70 | 2 | 20 | 601 |
| ECGFiveDays | 136 | 2 | 23 | 861 | SonyAIBORSurface2 | 65 | 2 | 27 | 953 |
| ElectricDevices | 96 | 7 | 8926 | 7711 | StarlightCurves | 1024 | 3 | 1000 | 8236 |
| FaceAll | 131 | 14 | 560 | 1690 | Strawberry | 235 | 2 | 613 | 370 |
| FaceFour | 350 | 4 | 24 | 88 | SwedishLeaf | 128 | 15 | 500 | 625 |
| FacesUCR | 131 | 14 | 200 | 2050 | Symbols | 398 | 6 | 25 | 995 |
| FiftyWords | 270 | 50 | 450 | 455 | SyntheticControl | 60 | 6 | 300 | 300 |
| Fish | 463 | 7 | 175 | 175 | ToeSegmentation1 | 277 | 2 | 40 | 228 |
| FordA | 500 | 2 | 3601 | 1320 | ToeSegmentation2 | 343 | 2 | 36 | 130 |
| FordB | 500 | 2 | 3636 | 810 | Trace | 275 | 4 | 100 | 100 |
| GunPoint | 150 | 2 | 50 | 150 | TwoLeadECG | 82 | 2 | 23 | 1139 |
| Ham | 431 | 2 | 109 | 105 | TwoPatterns | 128 | 4 | 1000 | 4000 |
| HandOutlines | 2709 | 2 | 1000 | 370 | UWaveAll | 945 | 8 | 896 | 3582 |
| Haptics | 1092 | 5 | 155 | 308 | UWaveX | 315 | 8 | 896 | 3582 |
| Herring | 512 | 2 | 64 | 64 | UWaveY | 315 | 8 | 896 | 3582 |
| InlineSkate | 1882 | 7 | 100 | 550 | UWaveZ | 315 | 8 | 896 | 3582 |
| InsectWingbeatSound | 256 | 11 | 220 | 1980 | Wafer | 152 | 2 | 1000 | 6164 |
| ItalyPowerDemand | 24 | 2 | 67 | 1029 | Wine | 234 | 2 | 57 | 54 |
| LargeKitchApps | 720 | 3 | 375 | 375 | WordSynonyms | 270 | 25 | 267 | 638 |
| Lightning2 | 637 | 2 | 60 | 61 | Worms | 900 | 5 | 181 | 77 |
| Lightning7 | 319 | 7 | 70 | 73 | WormsTwoClass | 900 | 2 | 181 | 77 |
| Mallat | 1024 | 8 | 55 | 2345 | Yoga | 426 | 2 | 300 | 3000 |
| Meat | 448 | 3 | 60 | 60 |
Experiments were conducted on 30 stratified resamples of each dataset and all classifiers were aligned on the same folds. Each UCR dataset has an initial default train and test partition that was used for the first experiment, and each subsequent experiment was conducted using resamples of the data that preserve the class distributions and size of the original training and test partitions
Raw average scores for error, balanced error, AUC and NLL of the classifiers referenced throughout Sect. 5
| 121 UCI datasets | Classifier | Sections | Error | Balanced error | AUC | NLL |
|---|---|---|---|---|---|---|
| CAWPE | CAWPE-A | 0.174 | 0.243 | 0.893 | 0.651 | |
| CAWPE-S | 0.184 | 0.258 | 0.884 | 0.706 | ||
| Simple components | C4.5 | 0.23 | 0.301 | 0.736 | 1.161 | |
| Logistic | 0.238 | 0.309 | 0.841 | 8.134 | ||
| MLP1 | 0.213 | 0.287 | 0.86 | 1.297 | ||
| NN | 0.216 | 0.303 | 0.798 | 1.116 | ||
| SVML | 0.229 | 0.306 | 0.849 | 1.073 | ||
| Advanced components | MLP2 | 0.204 | 0.276 | 0.858 | 1.26 | |
| RandF | 0.185 | 0.259 | 0.886 | 0.713 | ||
| RotF | 0.187 | 0.265 | 0.868 | 0.704 | ||
| XGBoost | 0.193 | 0.261 | 0.876 | 0.843 | ||
| SVMQ | 0.216 | 0.281 | 0.863 | 1.454 | ||
| Heterogeneous ensembles, | ES-S | 0.19 | 0.266 | 0.813 | 0.884 | |
| simple components | MV-S | 0.195 | 0.273 | 0.808 | 0.877 | |
| NBC-S | 0.193 | 0.26 | 0.82 | 0.999 | ||
| PB-S | 0.229 | 0.306 | 0.847 | 0.95 | ||
| RC-S | 0.195 | 0.288 | 0.811 | 0.912 | ||
| SMLR-S | 0.195 | 0.272 | 0.737 | 1.144 | ||
| SMLRE-S | 0.214 | 0.288 | 0.734 | 1.251 | ||
| SMM5-S | 0.195 | 0.271 | 0.744 | 1.046 | ||
| WMV-S | 0.192 | 0.27 | 0.814 | 0.872 | ||
| Heterogeneous ensembles, | ES-A | 0.176 | 0.246 | 0.817 | 0.847 | |
| advanced components | MV-A | 0.176 | 0.249 | 0.815 | 0.833 | |
| NBC-A | 0.183 | 0.249 | 0.821 | 1.031 | ||
| PB-A | 0.193 | 0.261 | 0.876 | 0.843 | ||
| RC-A | 0.177 | 0.262 | 0.813 | 0.87 | ||
| SMLR-A | 0.19 | 0.263 | 0.752 | 1.141 | ||
| SMLRE-A | 0.203 | 0.275 | 0.747 | 1.232 | ||
| SMM5-A | 0.188 | 0.261 | 0.757 | 1.019 | ||
| WMV-A | 0.175 | 0.248 | 0.817 | 0.837 | ||
| Homogeneous ensembles | AdaBoost | 0.353 | 0.469 | 0.775 | 3.258 | |
| (RandF and XGBoost repeated) | Bagging | 0.206 | 0.303 | 0.868 | 0.775 | |
| LogitBoost | 0.241 | 0.302 | 0.836 | 8.246 | ||
| RandF | 0.185 | 0.259 | 0.886 | 0.713 | ||
| XGBoost | 0.193 | 0.261 | 0.876 | 0.843 | ||
| Tuned classifiers | TunedMLP | 0.227 | 0.318 | 0.857 | 1.009 | |
| TunedRandF | 0.188 | 0.271 | 0.879 | 0.719 | ||
| TunedSVM | 0.188 | 0.255 | 0.857 | 0.955 | ||
| (On 117 UCI datasets) | TunedXGBoost | 0.194 | 0.267 | 0.869 | 0.86 | |
| CAWPE-T | 0.175 | 0.244 | 0.891 | 0.653 |
Scores are averaged over all datasets and resamples of the UCI and UCR archives respectively, except for the tuned classifiers on the UCI archive which had the adult, chess-kvrk, miniboone, and magic datasets removed due to computational restraints
Fig. 2Critical difference diagrams CAWPE-S with its base classifiers (left), and CAWPE-A with its base classifiers (right). Ranks formed on test set accuracy averaged over 30 resamples
Fig. 3Critical difference diagrams for ten heterogeneous ensemble classifiers on 121 UCI data built using logistic, C4.5, SVML, NN and MLP1 base classifiers. The weighted ensembles are: majority vote (MV); weighted majority vote (WMV); recall (RC); Naive Bayes (NBC) and our scheme (CAWPE). The selection ensembles are: pick best (PB); and ensemble selection (ES). The stacking schemes are: stacking with multi-response linear regression (SMLR); stacking with multi-response linear regression on extended features (SMLRE); and stacking with multi-response model trees (SMM5)
Fig. 4Critical difference diagrams for ten heterogeneous ensemble classifiers on 121 UCI data built using random forest (RandF), rotation forest (RotF), support vector machine with a quadratic kernel (SVMQ), a two layer multilayer perceptron (MLP2) and extreme gradient boosting (XGBoost) base classifiers
Fig. 5Critical difference diagrams for CAWPE (built using logistic, C4.5, SVML, NN and MLP1 base classifiers) against 5 homogeneous ensemble classifiers on 121 UCI data
Summaries of train times for CAWPE-S and the homogeneous ensembles
| Classifier | CAWPE-S | LogitBoost | RandF | XGBoost | Bagging | AdaBoost |
|---|---|---|---|---|---|---|
| Mean | 524.9 | 302.2 | 111.9 | 46.8 | 22.7 | 7.8 |
| Median | 13.7 | 8.9 | 6.9 | 2.1 | 0.7 | 0.06 |
All times are in seconds, and are averaged across the 121 UCI data
Tuning parameter ranges for SVMRBF, random forest, MLP and XGBoost
| Classifier | Total | Parameter | Range |
|---|---|---|---|
| SVMRBF | 1089 | Regularisation | |
| Variance | |||
| Random forest | 1000 | Number of trees (10 values) | |
| Feature subset size (10 values) | |||
| Max tree depth (10 values) | |||
| MLP | 1024 | Hidden layers (2 values) | |
| Nodes per layer (4 values) | |||
| Learning rate (8 values) | |||
| Momentum (8 values) | |||
| Decay (2 values) | |||
| XGBoost | 625 | Number of trees (5 values) | |
| Learning rate (5 values) | |||
| Max tree depth (5 values) | |||
| Min child weight (5 values) |
c is the number of classes and m the number of attributes
Fig. 6Average ranked errors for a CAWPE-S and b CAWPE-A against four tuned classifiers on 117 datasets in the UCI archive. The datasets adult, chess-krvk, miniboone and magic are omitted due to computational restraints
Fig. 7Average ranked errors for DTW against a CAWPE-S and its components and b CAWPE-A and its components on the 85 datasets in the UCR archive
Fig. 8Average ranked errors for 4 variants of HIVE-COTE on the UCR datasets
Fig. 9Accuracy of a CAWPE-S and b CAWPE-A versus picking the best component
Fig. 10Clustered histograms of accuracy rankings over the 121 UCI datasets for a CAWPE-S and b CAWPE-A and their respective components. For each classifier, the number of occurrences of each rank being achieved relative to the other classifiers is shown
CAWPE-S versus pick best split by train set size
| #Train cases | #Problems | #CAWPE-S WINS | Mean error difference (%) |
|---|---|---|---|
| 1001–5000 | 23 | 11 | 0.16 |
| > 5001 | 9 | 2 | 0.02 |
The three datasets with the same average error have been removed (acute-inflammation, acute-nephritis and breast-cancer-wisc-diag). If there is a significant difference within a group (tested using a Wilcoxon sign rank test) the row is in bold
Fig. 11The difference in average errors in increasing order between CAWPE-S and picking the best classifier on each dataset. Significant differences according to paired t-tests over folds are also reported. CAWPE-S is significantly more accurate on 46, the best individual classifier on 18, and there is no significant difference on 57
Fig. 12Critical difference diagrams of the stages of progression from a simple majority vote up to CAWPE, on the 121 datasets of the UCI archive using the CAWPE-S variant
Fig. 13Four plots of the difference in error between CAWPE (, probs) and WMC (, probs), against different dataset characteristics. Above zero CAWPE wins, below zero WMC wins. Trend represented by solid black line, reported in top-right corner
Fig. 14Mean train (squares) and test (triangles) accuracies over the 121 UCI (dashed line) and 85 UCR (solid line) datasets as the alpha parameter changes, expressed as the difference to equal weighting ()
Fig. 15Critical difference diagrams over test error of CAWPE on the UCI and UCR archives as it stands (alpha ), and against two tuning schemes for the alpha parameter: resolving ties in error estimates randomly (RandTie); and conservatively picking the lowest alpha amongst the ties (ConTie)
A full list of the UCI datasets used in the experiments in Sects. 5.1–5.4
| Dataset | Atts | Classes | Cases | Dataset | Atts | Classes | Cases |
|---|---|---|---|---|---|---|---|
| abalone | 8 | 3 | 4177 | monks-1 | 6 | 2 | 556 |
| acute-inflammation | 6 | 2 | 120 | monks-2 | 6 | 2 | 601 |
| acute-nephritis | 6 | 2 | 120 | monks-3 | 6 | 2 | 554 |
| adult | 14 | 2 | 48,842 | mushroom | 21 | 2 | 8124 |
| annealing | 31 | 5 | 898 | musk-1 | 166 | 2 | 476 |
| arrhythmia | 262 | 13 | 452 | musk-2 | 166 | 2 | 6598 |
| audiology-std | 59 | 18 | 196 | nursery | 8 | 5 | 12,960 |
| balance-scale | 4 | 3 | 625 | oocytes_m_nucleus_4d | 41 | 2 | 1022 |
| balloons | 4 | 2 | 16 | oocytes_m_states_2f | 25 | 3 | 1022 |
| bank | 16 | 2 | 4521 | oocytes_t_nucleus_2f | 25 | 2 | 912 |
| blood | 4 | 2 | 748 | oocytes_t_states_5b | 32 | 3 | 912 |
| breast-cancer | 9 | 2 | 286 | optical | 62 | 10 | 5620 |
| breast-cancer-w | 9 | 2 | 699 | ozone | 72 | 2 | 2536 |
| breast-cancer-w-diag | 30 | 2 | 569 | page-blocks | 10 | 5 | 5473 |
| breast-cancer-w-prog | 33 | 2 | 198 | parkinsons | 22 | 2 | 195 |
| breast-tissue | 9 | 6 | 106 | pendigits | 16 | 10 | 10,992 |
| car | 6 | 4 | 1728 | pima | 8 | 2 | 768 |
| cardio-10clases | 21 | 10 | 2126 | pit-bri-MATERIAL | 7 | 3 | 106 |
| cardio-3clases | 21 | 3 | 2126 | pit-bri-REL-L | 7 | 3 | 103 |
| chess-krvk | 6 | 18 | 28,056 | pit-bri-SPAN | 7 | 3 | 92 |
| chess-krvkp | 36 | 2 | 3196 | pit-bri-T-OR-D | 7 | 2 | 102 |
| congressional-voting | 16 | 2 | 435 | pit-bridges-TYPE | 7 | 6 | 105 |
| conn-bench-sonar | 60 | 2 | 208 | planning | 12 | 2 | 182 |
| conn-bench-vowel | 11 | 11 | 990 | plant-margin | 64 | 100 | 1600 |
| connect-4 | 42 | 2 | 67,557 | plant-shape | 64 | 100 | 1600 |
| contrac | 9 | 3 | 1473 | plant-texture | 64 | 100 | 1599 |
| credit-approval | 15 | 2 | 690 | post-operative | 8 | 3 | 90 |
| cylinder-bands | 35 | 2 | 512 | primary-tumor | 17 | 15 | 330 |
| dermatology | 34 | 6 | 366 | ringnorm | 20 | 2 | 7400 |
| echocardiogram | 10 | 2 | 131 | seeds | 7 | 3 | 210 |
| ecoli | 7 | 8 | 336 | semeion | 256 | 10 | 1593 |
| energy-y1 | 8 | 3 | 768 | soybean | 35 | 18 | 683 |
| energy-y2 | 8 | 3 | 768 | spambase | 57 | 2 | 4601 |
| fertility | 9 | 2 | 100 | spect | 22 | 2 | 265 |
| flags | 28 | 8 | 194 | spectf | 44 | 2 | 267 |
| glass | 9 | 6 | 214 | statlog-aus-credit | 14 | 2 | 690 |
| haberman-survival | 3 | 2 | 306 | statlog-ger-credit | 24 | 2 | 1000 |
| hayes-roth | 3 | 3 | 160 | statlog-heart | 13 | 2 | 270 |
| heart-cleveland | 13 | 5 | 303 | statlog-image | 18 | 7 | 2310 |
| heart-hungarian | 12 | 2 | 294 | statlog-landsat | 36 | 6 | 6435 |
| heart-switzerland | 12 | 5 | 123 | statlog-shuttle | 9 | 7 | 58,000 |
| heart-va | 12 | 5 | 200 | statlog-vehicle | 18 | 4 | 846 |
| hepatitis | 19 | 2 | 155 | steel-plates | 27 | 7 | 1941 |
| hill-valley | 100 | 2 | 1212 | synthetic-control | 60 | 6 | 600 |
| horse-colic | 25 | 2 | 368 | teaching | 5 | 3 | 151 |
| ilpd-indian-liver | 9 | 2 | 583 | thyroid | 21 | 3 | 7200 |
| image-segmentation | 18 | 7 | 2310 | tic-tac-toe | 9 | 2 | 958 |
| ionosphere | 33 | 2 | 351 | titanic | 3 | 2 | 2201 |
| iris | 4 | 3 | 150 | trains | 29 | 2 | 10 |
| led-display | 7 | 10 | 1000 | twonorm | 20 | 2 | 7400 |
| lenses | 4 | 3 | 24 | vert-col-2clases | 6 | 2 | 310 |
| letter | 16 | 26 | 20,000 | vert-col-3clases | 6 | 3 | 310 |
| libras | 90 | 15 | 360 | wall-following | 24 | 4 | 5456 |
| low-res-spect | 100 | 9 | 531 | waveform | 21 | 3 | 5000 |
| lung-cancer | 56 | 3 | 32 | waveform-noise | 40 | 3 | 5000 |
| lymphography | 18 | 4 | 148 | wine | 13 | 3 | 178 |
| magic | 10 | 2 | 19,020 | wine-quality-red | 11 | 6 | 1599 |
| mammographic | 5 | 2 | 961 | wine-quality-white | 11 | 7 | 4898 |
| miniboone | 50 | 2 | 130,064 | yeast | 8 | 10 | 1484 |
| molec-biol-promoter | 57 | 2 | 106 | zoo | 16 | 7 | 101 |
| molec-biol-splice | 60 | 3 | 3190 |
Experiments were conducted on 30 stratified resamples of each dataset, with 50% of the data taken for training and 50% for testing. All classifiers were are aligned on the same folds