| Literature DB >> 35910041 |
Yang Lu1,2, Lian Yang1,2, Baofeng Shi1,2, Jiaxiang Li1,2, Mohammad Zoynul Abedin3.
Abstract
With the development of industry 4.0, the credit data of SMEs are characterized by a large volume, high speed, diversity and low-value density. How to select the key features that affect the credit risk from the high-dimensional data has become the critical point to accurately measure the credit risk of SMEs and alleviate their financing constraints. In doing so, this paper proposes a credit risk feature selection approach that integrates the binary opposite whale optimization algorithm (BOWOA) and the Kolmogorov-Smirnov (KS) statistic. Furthermore, we use seven machine learning classifiers and three discriminant methods to verify the robustness of the proposed model by using three actual bank data from SMEs. The empirical results show that although no one artificial intelligence credit evaluation method is universal for different SMEs' credit data, the performance of the BOWOA-KS model proposed in this paper is better than other methods if the number of indicators in the optimal subset of indicators and the prediction performance of the classifier are considered simultaneously. By providing a high-dimensional data feature selection method and improving the predictive performance of credit risk, it could help SMEs focus on the factors that will allow them to improve their creditworthiness and more easily access loans from financial institutions. Moreover, it will also help government agencies and policymakers develop policies to help SMEs reduce their credit risks.Entities:
Keywords: Binary opposite whale optimization algorithm; Credit rating; Credit risk; Feature selection; Kolmogorov–Smirnov statistic; SMEs
Year: 2022 PMID: 35910041 PMCID: PMC9309243 DOI: 10.1007/s10479-022-04849-3
Source DB: PubMed Journal: Ann Oper Res ISSN: 0254-5330 Impact factor: 4.820
Fig. 1Example of KS statistic
Confusion matrix for SME credit prediction
| Items | Predicted non-default | Predicted default | Total |
|---|---|---|---|
| Non-default SMEs | |||
| Default SMEs | |||
| Total |
Original data on 2,044 SMEs
| (a) No | (b) Criterion level | (c) Indicator | (d) Indicator type | (e) Original data for 2,044 loans | |||||
|---|---|---|---|---|---|---|---|---|---|
| Non-default SMEs | Default SMEs | ||||||||
| (1) | … | (1816) | (1817) | … | (2044) | ||||
| 1 | Interval | 57 | … | 36 | 45 | … | 24 | ||
| … | … | … | … | … | … | … | … | … | |
| 12 | Qualitative | 3 | … | 3 | 3 | … | 4 | ||
| … | … | … | … | … | … | … | … | … | |
| 25 | Qualitative | 1 | … | 1 | 1 | … | 1 | ||
| … | … | … | … | … | … | … | … | … | |
| 35 | Qualitative | 2 | … | 1 | 0 | … | 1 | ||
| … | … | … | … | … | … | … | … | … | |
| … | … | … | … | … | … | … | … | … | |
| 44 | Negative | 0.373 | … | 0.364 | 0.373 | … | 0.379 | ||
| 45 | – | Default | – | 1 | … | 1 | 0 | … | 0 |
Descriptive statistics of data for 2,044 SMEs
| Indicators | Mean | Var | S.D |
|---|---|---|---|
| Age | 40.932 | 70.207 | 8.379 |
| Education | 3.976 | 0.976 | 0.988 |
| Marital status | 1.086 | 0.240 | 0.490 |
| Gender | 1.082 | 0.075 | 0.274 |
| Number of members | 1.634 | 0.409 | 0.640 |
| Size of labor force | 2.004 | 0.938 | 0.969 |
| Number of family members/Size of labor force | 3.145 | 0.599 | 0.774 |
| Support population | 2.524 | 0.613 | 0.783 |
| … | … | … | … |
| Regional GDP growth rate | 18.253 | 7.195 | 2.682 |
| Consumer price index | 105.574 | 0.599 | 0.774 |
| Resident savings deposit balance | 5414.524 | 5,157,935.048 | 2271.109 |
| Engel’s coefficient | 0.376 | 0.001 | 0.028 |
Standardized data of the 2,044 SMEs
| (a) No | (b) Criterion level | (c) Indicator | (d) Indicator type | (e) Standardized data for 2,044 loans | |||||
|---|---|---|---|---|---|---|---|---|---|
| Non-default SMEs | Default SMEs | ||||||||
| (1) | … | (1816) | (1817) | … | (2044) | ||||
| 1 | Interval | 0.368 | … | 1.000 | 1.000 | … | 0.526 | ||
| … | … | … | … | … | … | … | … | … | |
| 11 | Positive | 0.130 | … | 0.285 | 0.052 | … | 0.000 | ||
| … | … | … | … | … | … | … | … | … | … |
| 39 | Positive | 0.089 | … | 0.750 | 0.089 | … | 0.205 | ||
| … | … | … | … | … | … | … | … | … | |
| 44 | Negative | 0.628 | … | 0.701 | 0.628 | … | 0.413 | ||
| 45 | – | Default | – | 1 | … | 1 | 0 | … | 0 |
Hyperparameters of artificial intelligence classification methods
| Methods | Parameters |
|---|---|
| K-NN | K = 5, Euclidean distance |
| LR | L2 penalty, 0.5 cut off |
| SVM | Radial basis kernel function, complexity parameter c = 1.0, gamma = 0.1 |
| DT | Support criterion is Gini |
| RF | Number of trees = 200, max depth = 9, support criterion is Gini |
| GBDT | Default booster, learning rate = 0.1, max depth = 3 |
| XGBoost | Learning rate = 0.1, max depth = 5, number of iterations = 200 |
Accuracy of different models for different subsets
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 44 | Original | 0.73 | 0.63 | 0.89 | 0.78 | 0.89 | 0.89 | 0.88 |
| 13 | BOWOA-AUC | 0.77 | 0.55 | 0.89 | 0.8 | 0.87 | 0.88 | 0.86 |
| 9 | BOWOA-GINI | 0.79 | 0.62 | 0.89 | 0.78 | 0.87 | 0.86 | 0.84 |
| 8 | BOWOA-KS | 0.78 | 0.63 | 0.89 | 0.78 | 0.87 | 0.88 | 0.87 |
| 15 | PSO-AUC | 0.74 | 0.56 | 0.89 | 0.77 | 0.88 | 0.88 | 0.86 |
| 21 | PSO-GINI | 0.75 | 0.56 | 0.89 | 0.79 | 0.88 | 0.87 | 0.87 |
| 21 | PSO-KS | 0.71 | 0.62 | 0.89 | 0.8 | 0.88 | 0.88 | 0.87 |
f1 scores of different models for different subsets of indicators
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 44 | Original | 0.50 | 0.50 | 0.47 | 0.53 | 0.50 | 0.54 | 0.54 |
| 13 | BOWOA-AUC | 0.53 | 0.45 | 0.47 | 0.56 | 0.51 | 0.52 | 0.51 |
| 9 | BOWOA-GINI | 0.51 | 0.49 | 0.47 | 0.53 | 0.53 | 0.53 | 0.53 |
| 8 | BOWOA-KS | 0.5 | 0.49 | 0.47 | 0.52 | 0.55 | 0.55 | 0.57 |
| 15 | PSO-AUC | 0.49 | 0.46 | 0.47 | 0.52 | 0.53 | 0.55 | 0.54 |
| 21 | PSO-GINI | 0.53 | 0.44 | 0.47 | 0.57 | 0.5 | 0.53 | 0.55 |
| 21 | PSO-KS | 0.47 | 0.48 | 0.47 | 0.54 | 0.48 | 0.53 | 0.54 |
AUC of different models for different subsets of indicators
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 44 | Original | 0.52 | 0.59 | 0.54 | 0.55 | 0.64 | 0.62 | 0.63 |
| 13 | BOWOA-AUC | 0.58 | 0.54 | 0.56 | 0.57 | 0.63 | 0.56 | 0.6 |
| 9 | BOWOA-GINI | 0.56 | 0.53 | 0.49 | 0.54 | 0.56 | 0.57 | 0.54 |
| 8 | BOWOA-KS | 0.57 | 0.57 | 0.63 | 0.52 | 0.59 | 0.65 | 0.64 |
| 15 | PSO-AUC | 0.52 | 0.56 | 0.54 | 0.53 | 0.65 | 0.59 | 0.6 |
| 21 | PSO-GINI | 0.6 | 0.54 | 0.51 | 0.59 | 0.65 | 0.58 | 0.61 |
| 21 | PSO-KS | 0.54 | 0.59 | 0.59 | 0.55 | 0.64 | 0.63 | 0.63 |
Fig. 2Accuracy of different machine learning methods with different numbers of indicators
Fig. 3ROC curve of different methods used in the BOWOA-KS model
Fig. 415 most important features in the data for the 2,044 SMEs using RF
Accuracy of different models for different subsets of indicators
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 60 | Original | 0.69 | 0.62 | 0.88 | 0.78 | 0.89 | 0.88 | 0.88 |
| 13 | BOWOA-AUC | 0.78 | 0.62 | 0.89 | 0.77 | 0.86 | 0.85 | 0.85 |
| 14 | BOWOA-GINI | 0.78 | 0.64 | 0.89 | 0.78 | 0.86 | 0.87 | 0.85 |
| 18 | BOWOA-KS | 0.77 | 0.63 | 0.88 | 0.77 | 0.88 | 0.88 | 0.88 |
| 25 | PSO-AUC | 0.75 | 0.63 | 0.89 | 0.78 | 0.88 | 0.88 | 0.87 |
| 32 | PSO-GINI | 0.73 | 0.62 | 0.88 | 0.77 | 0.88 | 0.87 | 0.87 |
| 19 | PSO-KS | 0.74 | 0.62 | 0.89 | 0.76 | 0.89 | 0.87 | 0.87 |
| 80 | Original | 0.97 | 0.93 | 0.97 | 0.99 | 0.99 | 0.98 | 0.99 |
| 16 | BOWOA-AUC | 0.95 | 0.71 | 0.97 | 0.97 | 0.98 | 0.97 | 0.98 |
| 24 | BOWOA-GINI | 0.97 | 0.88 | 0.96 | 0.98 | 0.98 | 0.98 | 0.98 |
| 48 | BOWOA-KS | 0.96 | 0.92 | 0.96 | 0.98 | 0.98 | 0.98 | 0.98 |
| 23 | PSO-AUC | 0.97 | 0.86 | 0.96 | 0.98 | 0.99 | 0.98 | 0.98 |
| 27 | PSO-GINI | 0.96 | 0.9 | 0.96 | 0.98 | 0.98 | 0.98 | 0.98 |
| 47 | PSO-KS | 0.97 | 0.92 | 0.97 | 0.99 | 0.98 | 0.98 | 0.99 |
f1 scores of different models for different subsets of indicators
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 60 | Original | 0.50 | 0.49 | 0.49 | 0.52 | 0.47 | 0.49 | 0.53 |
| 13 | BOWOA-AUC | 0.52 | 0.51 | 0.48 | 0.52 | 0.48 | 0.49 | 0.5 |
| 14 | BOWOA-GINI | 0.51 | 0.52 | 0.47 | 0.51 | 0.51 | 0.48 | 0.49 |
| 18 | BOWOA-KS | 0.53 | 0.51 | 0.47 | 0.54 | 0.53 | 0.52 | 0.55 |
| 25 | PSO-AUC | 0.53 | 0.5 | 0.52 | 0.54 | 0.51 | 0.49 | 0.52 |
| 32 | PSO-GINI | 0.5 | 0.49 | 0.54 | 0.5 | 0.47 | 0.49 | 0.5 |
| 19 | PSO-KS | 0.47 | 0.5 | 0.47 | 0.53 | 0.51 | 0.51 | 0.53 |
| 80 | Original | 0.78 | 0.64 | 0.71 | 0.85 | 0.80 | 0.75 | 0.83 |
| 16 | BOWOA-AUC | 0.63 | 0.44 | 0.58 | 0.72 | 0.73 | 0.67 | 0.75 |
| 24 | BOWOA-GINI | 0.72 | 0.58 | 0.68 | 0.71 | 0.77 | 0.77 | 0.75 |
| 48 | BOWOA-KS | 0.74 | 0.62 | 0.68 | 0.85 | 0.79 | 0.79 | 0.83 |
| 23 | PSO-AUC | 0.76 | 0.56 | 0.7 | 0.8 | 0.81 | 0.77 | 0.81 |
| 27 | PSO-GINI | 0.7 | 0.6 | 0.69 | 0.75 | 0.8 | 0.78 | 0.79 |
| 47 | PSO-KS | 0.78 | 0.6 | 0.73 | 0.88 | 0.77 | 0.77 | 0.83 |
AUC of different models for different subsets of indicators
| Total indicators | Subset | KNN | LR | SVM | DT | RF | GBDT | XGBoost |
|---|---|---|---|---|---|---|---|---|
| 60 | Original | 0.56 | 0.62 | 0.59 | 0.53 | 0.69 | 0.63 | 0.62 |
| 13 | BOWOA-AUC | 0.55 | 0.65 | 0.54 | 0.53 | 0.58 | 0.53 | 0.52 |
| 14 | BOWOA-GINI | 0.56 | 0.66 | 0.61 | 0.51 | 0.63 | 0.64 | 0.57 |
| 18 | BOWOA-KS | 0.61 | 0.66 | 0.63 | 0.55 | 0.65 | 0.63 | 0.63 |
| 25 | PSO-AUC | 0.6 | 0.61 | 0.58 | 0.55 | 0.63 | 0.61 | 0.6 |
| 32 | PSO-GINI | 0.56 | 0.61 | 0.57 | 0.5 | 0.67 | 0.63 | 0.6 |
| 19 | PSO-KS | 0.52 | 0.64 | 0.58 | 0.55 | 0.61 | 0.59 | 0.59 |
| 80 | Original | 0.90 | 0.89 | 0.90 | 0.86 | 0.96 | 0.94 | 0.95 |
| 16 | BOWOA-AUC | 0.73 | 0.57 | 0.55 | 0.76 | 0.89 | 0.78 | 0.82 |
| 24 | BOWOA-GINI | 0.85 | 0.85 | 0.83 | 0.74 | 0.96 | 0.95 | 0.95 |
| 48 | BOWOA-KS | 0.9 | 0.88 | 0.9 | 0.88 | 0.97 | 0.94 | 0.96 |
| 23 | PSO-AUC | 0.85 | 0.85 | 0.86 | 0.83 | 0.96 | 0.94 | 0.95 |
| 27 | PSO-GINI | 0.84 | 0.89 | 0.89 | 0.77 | 0.96 | 0.92 | 0.9 |
| 47 | PSO-KS | 0.89 | 0.88 | 0.9 | 0.91 | 0.96 | 0.92 | 0.95 |
Fig. 515 most important features for dataset of 2,157 SMEs using RF
Fig. 615 most important features for dataset of 3,111 SMEs using RF