| Literature DB >> 33169094 |
John T Hancock1, Taghi M Khoshgoftaar1.
Abstract
Gradient Boosted Decision Trees (GBDT's) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT's in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost's effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.Entities:
Keywords: Big data; CatBoost; Categorical variable encoding; Decision tree; Ensemble methods; Machine learning
Year: 2020 PMID: 33169094 PMCID: PMC7610170 DOI: 10.1186/s40537-020-00369-8
Source DB: PubMed Journal: J Big Data ISSN: 2196-1115
Fig. 1Image from [8] showing sensitivity of CatBoost to hyper-parameter settings; a records performance on the Higgs benchmark, b performance on the Epsilon benchmark, c performance on the Microsft benchmark, and d performance on the Yahoo Benchmark
Oblivious Decision Tree example from Lou and Obukhov demonstrating a Decision Tree and Decision Table that provide equivalent logic [46]
Machine learning
| Title | CatBoost: unbiased boosting with categorical features |
| Description | Paper introducing CatBoost algorithm |
| Performance metric | logloss, zero-one loss |
| Winner | CatBoost |
| Reference | [ |
| Title | Benchmarking and optimization of gradient boosting decision tree algorithms |
| Description | Compare CatBoost, LightGBM, and XGBoost run on GPU’s, using four benchmark tasks |
| Performance metric | AUC ROC and Normalized discounted cumulative gain ( |
| Winner | CatBoost wins AUC for Epsilon DataSet, LightGBM wins AUC for the Higgs dataset, XGBoost wins (NDCG) for Microsoft and Yahoo Datasets |
| Reference | [ |
Traffic engineering
| Title | A Semi-Supervised Tri-CatBoost method for driving style recognition |
| Description | Combine labeled and unlabeled data, use CatBoost as a base classifier to identify driving style |
| Performance metric | N/A CatBoost used for semi-supervised learning not compared to other classifiers |
| Winner | N/A |
| Reference | [ |
| Title | Reconstructing commuters network using machine learning and urban indicators. |
| Description | Construct graph on human movement between cities, extract features, apply CatBoost among other algorithms to reconstruct graph |
| Performance metric | Accuracy |
| Winner | CatBoost wins but training time is long compared to XGBoost, so authors use XGBoost for remainder of study |
| Reference | [ |
Finance
| Title | Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset |
| Description | Evaluate of XGBoost, LightGBM, and CatBoost performance in predicting loan default |
| Performance metric | AUC, running time |
| Winner | LightGBM |
| Reference | [ |
| Title | Short term electricity spot price forecasting using CatBoost and bidirectional long short term memory neural network |
| Description | CatBoost for feature selection for time-series data |
| Performance metric | Mean absolute percentage error |
| Winner | CatBoost not a competitor, used for feature selection |
| Reference | [ |
| Title | Research on personal credit scoring model based on multi-source data |
| Description | Use “Stacking&Blending” with CatBoost, Logistic Regression, and Random Forest to calculate credit score in a regression technique |
| Performance metric | Model is ensemble of no direct comparison between algorithms; performance measured in AUC |
| Winner | N/A |
| Reference | [ |
| Title | Predicting loan default in peer-to-peer lending using narrative data. |
| Description | Evaluate CatBoost against other classifiers on the task of predicting loan default using Lending Club data |
| Performance metric | Accuracy, AUC, H measure, type I error rate, type II error rate |
| Winner | CatBoost |
| Reference | [ |
Astronomy
| Title | KiDS-SQuaD II. Machine learning selection of bright extragalactic objects to search for new gravitationally lensed quasars |
| Description | Use CatBoost to classify astronomical data |
| Performance metric | AUC |
| Winner | CatBoost |
| Reference | [ |
Cyber-security
| Title | Attack detection in enterprise networks by machine learning methods |
| Description | Compare CatBoost, LightGBM, SVM, and logistic regression in multi-class and binary classification task of identifying computer network attacks. |
| Performance metric | AUC, CV balanced accuracy, balanced accuracy, F1, precision, recall |
| Winner | CatBoost |
| Reference | [ |
Meteorology
| Title | Short-term weather forecast based on wavelet denoising and catboost |
| Description | Use CatBoost to predict weather-related observations, and compare to other machine learning algorithms doing the same task |
| Performance metric | unique method, based on root mean squared error |
| Winner | CatBoost |
| Reference | [ |
| Title | Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions |
| Description | compare CatBoost, |
| Performance metric | MAPE, RSME, R2 |
| Winner | Results do not indicate clear overall-winner |
| Reference | [ |
Medicine
| Title | The use of data mining methods for the prediction of dementia: evidence from the english longitudinal study of aging |
| Description | Classify dementia on imbalanced data, maximum cardinality of feature is 50, compare CatBoost to other classifiers |
| Performance metric | Normalized Gini coefficient |
| Winner | Convolutional neural network |
| Reference | [ |
| Title | A novel fracture prediction model using machine learning in a community-based cohort |
| Description | Use CatBoost to predict fragility fracture |
| Performance metric | AUC |
| Winner | CatBoost |
| Reference | [ |
| An efficient novel approach for iris recognition based on stylometric features and machine learning techniques | |
| Description | Use CatBoost after doing feature extraction from image data converted to base-64 encoded data |
| Performance metric | AUC |
| Winner | multiboostAB |
| Reference | [ |
Biology
| Title | CT-based machine learning model to predict the Fuhrman nuclear grade of clear cell renal cell carcinoma |
| Description | Classify kidney cancer images into instances of high-grade or low-grade cancer, presents opportunities for research at Big Data scale |
| Performance metric | Used only CatBoost |
| Winner | N/A |
| Reference | [ |
| Title | diseases spread prediction in tropical areas by machine learning methods ensembling and spatial analysis techniques |
| Description | Use CatBoost to predict spread of dengue fever |
| Performance metric | Mean absolute error |
| Winner | LSTM and XGBoost ensemble |
| Reference | [ |
| Title | Performance analysis of boosting classifiers in recognizing activities of daily living |
| Description | Compare CatBoost with XGBoost in ability to identify human physical activity types from sensor data |
| Performance metric | f-measure |
| Winner | Friedman stochastic gradient boosting, ada-decision trees |
| Reference | [ |
Marketing
| Title | Predicting online shopping behavior from clickstream data using deep learning |
| Description | CatBoost is part of ensemble that is best clickstream predictor |
| Performance metric | AUC |
| Winner | GRU—CatBoost Ensemble |
| Reference | [ |
Bio-chemistry
| Title | Construction and analysis of molecular association network by combining behavior representation and node attributes. |
| Description | Leverage graph representation of association network of biological entities to predict associations as input for classifier, compare CatBoost with other popular classifiers as association predictor |
| Performance metric | Accuracy, sensitivity, specificity, precision, Matthew’s Correlation, Coefficient, AUC, |
| Winner | CatBoost (except Sensitivity) |
| Reference | [ |
| Title | Prediction model of aryl hydrocarbon receptor activation by a novel QSAR approach, deepSnap–deep learning |
| Description | Compare CatBoost to other learners in image processing task related to study relationship between genes and liver function |
| Performance metric | AUC, accuracy |
| Winner | DeepSnap-DL (deep learning algorithm) |
| Reference | [ |
Electrical utilities fraud
| Title | Bridging the gap between energy consumption and distribution through non-technical loss detection |
| Description | Use CatBoost for predicting non-technical loss in power distribution networks, authors report little in terms of quantitative results |
| Performance metric | Performance metric not explicit |
| Winner | Not clear, authors do not give exact numbers |
| Reference | [ |
| Title | Performance Analysis of Different Types of Machine Learning Classifiers for Non-Technical Loss Detection |
| Description | Compare CatBoost with 14 other classifiers |
| Performance metric | Precision, recall, F-Measure |
| Winner | CatBoost has highest precision and F-measure, |
| Reference | [ |
| Title | Energy theft detection using gradient boosting theft detector with feature engineering-based preprocessing |
| Description | Technique for using CatBoost with highly imbalanced data |
| Performance metric | True positive rate, false positive rate |
| Winner | CatBoost, has lowest false positive rate, LightGBM wins true positive rate, CatBoost has longest total train and test time, LightGBM has shortest total train and test time |
| Reference | [ |
| Title | Impact of feature selection on non-technical loss detection |
| Description | Use incremental feature selection, compare performance of CatBoost, Decision Tree and K-Nearest Neighbors classifiers |
| Performance metric | Precision, recall, F-Measure |
| Winner | CatBoost, except for recall of models trained with 9 features, where K-NN wins |
| Reference | [ |
Fig. 2Confusion matrices from Khramtsov et al. showing the relative performance of Random Forest, CatBoost and XGBoost on the hold-out dataset [18]
From [20], bracketed numbers are confidence intervals; note we do not find where Xia et al. document the significance level for the confidence intervals; here “softer” means models are trained with all available features
| Softer dataset | |||
|---|---|---|---|
| Model | Accuracy | AUC | H-measure |
| LR-softer | 0.7516 [0.7508, 0.7523] | 0.6151 [0.6139, 0.6163] | 0.0843 [0.0827, 0.0860] |
| RT-softer | 0.6952 [0.6911, 0.6996] | 0.5444 [0.5391, 0.5493] | 0.0124 [0.0095, 0.0153] |
| BNN-softer | 0.7496 [0.7480, 0.7516] | 0.6120 [0.6095, 0.6151] | 0.0801 [0.0766, 0.0843] |
| RF-softer | 0.7436 [0.7415, 0.7456] | 0.6043 [0.6013, 0.6073] | 0.0695 [0.0659, 0.0733] |
| GBDT-softer | 0.7504 [0.7488, 0.7520] | 0.6132 [0.6107, 0.6158] | 0.0818 [0.0784.0.0853] |
| XGBoost-softer | 0.7511 [0.7496, 0.7526] | 0.6143 [0.6120, 0.6167] | 0.0833 [0.0801, 0.0866] |
| CatBoost-softer | 0.7523 [0.7511, 0.7535] | 0.6162 [0.6144, 0.6180] | 0.0859 [0.0834, 0.0885] |
“Best Gini scores of individual ML algorithms on the test data” [26]
| XGB | LGB | CatBoost | K-CNN | RF | RGF | LR |
|---|---|---|---|---|---|---|
| 0.9234 | 0.9153 | 0.9218 | 0.9307 | 0.9295 | 0.9276 | 0.9069 |
XGB stands for XGBoost; LGB for LightGBM; K-CNN the Keras [66] implementation of Convolutional Neural Networks; RF for Random Forest; RGF for Regularized Greedy Forest; performance in terms of Normalized Gini Coefficient
From Yang and Bath [26], “System performance on the test data using different ensemble strategies”
| E1 | E2 | E3 | E4 | E5 | E6 |
|---|---|---|---|---|---|
| 0.9332 | 0.9331 | 0.9325 | 0.9322 | 0.9332 | 0.9333 |
E1 is ensemble of K-CNN, RF and RGF; E2 is ensemble of K-CNN, RF and XGB; E3 is ensemble of K-CNN, RGF, XGB; E4 is ensemble of K-CNN, RGF and CatBoost; E5 is ensemble of K-CNN, RF, RGF and CatBoost; E6 is ensemble of K-CNN, RF, RGF and XGB; performance in terms of Normalized Gini Coefficient
From [23] original caption, “Iris recognition performances on the CASIA dataset, with the cross-validation performed after the over-sampling (SMOTE).”
| Method | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|
| All features | |||||
| OneR | 0.9982 ± 0.003 | 1.00 ± 0.01 | 0.99 ± 0.01 | 0.99 ± 0.01 | 1.00 ± 0.01 |
| J48 | 0.9926 ± 0.006 | 0.99 ± 0.02 | 0.96 ± 0.04 | 0.98 ± 0.02 | 0.98 ± 0.02 |
| SMO | 0.9927 ± 0.005 | 0.99 ± 0.02 | 0.96 ± 0.03 | 0.98 ± 0.02 | 0.98 ± 0.01 |
| SVC | 0.9955 ± 0.004 | 0.97 ± 0.03 | 1.00 ± 0.01 | 0.98 ± 0.02 | 0.99 ± 0.00 |
| RandomForest | 0.9980 ± 0.003 | 1.00 ± 0.01 | 0.99 ± 0.02 | 0.99 ± 0.01 | 1.00 ± 0.00 |
| MultiboostAB | |||||
| CatBoost | 0.9993 ± 0.001 | 1.00 ± 0.01 | 1.00 ± 0.00 | 1.00 ± 0.00 | 0.99 ± 0.00 |
| RFE-16 | |||||
| OneR | 0.9978 ± 0.003 | 1.00 ± 0.01 | 0.99 ± 0.02 | 0.99 ± 0.01 | 0.99 ± 0.01 |
| J48 | 0.9947 ± 0.005 | 0.99 ± 0.01 | 0.97 ± 0.03 | 0.98 ± 0.02 | 0.99 ± 0.01 |
| SMO | 0.9966 ± 0.004 | 0.99 ± 0.01 | 0.98 ± 0.02 | 0.99 ± 0.01 | 0.99 ± 0.01 |
| SVC | 0.9951 ± 0.002 | 0.97 ± 0.02 | 0.99 ± 0.01 | 0.98 ± 0.01 | 0.99 ± 0.00 |
| RandomForest | 0.9983 ± 0.002 | 1.00 ± 0.01 | 0.99 ± 0.01 | 0.99 ± 0.01 | 1.00 ± 0.00 |
| MultiboostAB | |||||
| CatBoost | 0.9979 ± 0.002 | 0.99 ± 0.01 | 1.00 ± 0.01 | 0.99 ± 0.01 | 0.99 ± 0.00 |
| RRF-8 | |||||
| OneR | 0.9971 ± 0.003 | 1.00 ± 0.01 | 0.98 ± 0.02 | 0.99 ± 0.01 | 0.99 ± 0.01 |
| J48 | 0.9960 ± 0.004 | 1.00 ± 0.01 | 0.98 ± 0.02 | 0.99 ± 0.01 | 0.99 ± 0.01 |
| SMO | |||||
| SVC | |||||
| RandomForest | 0.9982 ± 0.003 | 1.00 ± 0.01 | 0.99 ± 0.01 | 0.99 ± 0.01 | 1.00 ± 0.00 |
| MultiboostAB | 0.9977 ± 0.003 | 1.00 ± 0.01 | 0.99 ± 0.02 | 0.99 ± 0.01 | 1.00 ± 0.00 |
| CatBoost | 0.9986 ± 0.002 | 0.99 ± 0.01 | 1.00 ± 0.01 | 1.00 ± 0.01 | 0.99 ± 0.00 |
Fig. 3Image from [25] showing relatively weak performance of CatBoost (CB) as compared to XGBoost (XGB), LightGBM (LGBM), Gradient Boosting(GB), AdaBoost using Decision Trees (ADA_DT) and AdaBoost using Random Forest (ADA_RF)
From [31], “Performance comparison without or with new feature(s) (average of 100 random customers), where revised theft cases are used
| XGBoost | w/o | w/ | w/ | w/ | w/ | w/ |
|---|---|---|---|---|---|---|
| Synth | Mean | Std | Min | Max | All 4 | |
| DR(%) | 94 | 95 | 95 | 95 | 95 | 96 |
| FPR(%) | 6 | 5 | 4 | 4 | 4 | 4 |
| CatBoost | w/o | w/ | w/ | w/ | w/ | w/ |
| Synth | Mean | Std | Min | Max | All 4 | |
| DR(%) | 97 | 97 | 97 | 97 | 97 | 97 |
| FPR(%) | 5 | 6 | 5 | 5 | 5 | 3 |
| Light | w/o | w/ | w/ | w/ | w/ | w/ |
| GBM | Synth | Mean | Std | Min | Max | All 4 |
| DR(%) | 97 | 97 | 97 | 97 | 97 | 97 |
| FPR(%) | 7 | 7 | 6 | 5 | 6 | 5 |
“ Here”, “synth” refers to features derived from summary statistics of daily usage; DR refers to detection rate, or true positive rate, and FPR refers to false positive rate
Fig. 4From [31]; false positive rates for XGBoost, CatBoost, and LightGBM as number of features used increases
Fig. 5From [31]; Evaluation (total train and test) time for XGBoost, CatBoost, and LightGBM as number of features increases, average of 100 random customers; shows improvement in CatBoost false positive rate when all summary statistics are employed
From [30], “Precision , recall and F-measure of CatBoost, Decision Tree classifier and KNN for 9 and 71 features”
| Features | CatBoost (%) | DT (%) | KNN (%) | |
|---|---|---|---|---|
| Precision | 71 | 98.11 | 97.23 | 94.18 |
| 9 | 97.40 | 96.8 | 96.58 | |
| Recall | 71 | 99.27 | 97.80 | 45.10 |
| 9 | 98.68 | 98.24 | 99.12 | |
| F-Measure | 71 | 98.69 | 97.51 | 61.00 |
| 9 | 98.04 | 97.53 | 97.83 |
Fig. 6Image from [33] illustrating the relative efficiency of CatBoost, according to Huang et al. Level 1 is 10 years’ data from a single station, Level 2 is 10 years’ data from 12 stations, Level 3 is 20 years’ data from 12 stations, Level 4 is 30 years’ data from 12 stations and Level 5 is 40 years’ data from 12 stations
Fig. 7Image from [90] depicting ensemble architecture of system for automatic vehicle detection; CNN and OS output are fed to CatBoost
From [2], showing the mean tree construction time in seconds
| Time per tree | |
|---|---|
| CatBoost Plain | |
| CatBoost Ordered | 1.9 s |
| XGBoost | 3.9 s |
| LightGBM |
Italic text from the original work, indicates shortest tree construction time
From Yi et al. the proposed method is CatBoost
| Method | Accuracy (%) | Sensitivity (%) | Specificity (%) |
|---|---|---|---|
| MAN-HOPE-LR | 83.75 ± 0.11 | 83.21 ± 0.47 | 84.30 ± 0.32 |
| MAN-HOPE-Ada | 84.73 ± 0.18 | 85.53 ± 0.29 | 83.93 ± 0.22 |
| MAN-HOPE-RF | 92.66 ± 0.12 | 93.29 ± 0.22 | |
| MAN-HOPE-XGB | 89.56 ± 0.41 | 90.60 ± 0.28 | 88.51 ± 0.95 |
| Proposed method | 91.50 ± 0.14 |
Best metrics are highlighted in italic; we split table in two for legibility
Fig. 8Image from [39] illustrating best results for ensemble of CatBoost and ; Here, the authors refer to CatBoost as Gradient Boosted Machine (GBM)
Fig. 9Image from [102] illustrating neural network based algorithms outperforming Gradient Boosted tree algorithms for classifying homogeneous text data in the SoHu dataset of news articles labeled as with or without marketing intent