| Literature DB >> 35281625 |
Renato Alexandre de Lima Lemos1, Thiago Christiano Silva1, Benjamin Miranda Tabak2.
Abstract
This paper examines churn prediction of customers in the banking sector using a unique customer-level dataset from a large Brazilian bank. Our main contribution is in exploring this rich dataset, which contains prior client behavior traits that enable us to document new insights into the main determinants predicting future client churn. We conduct a horserace of many supervised machine learning algorithms under the same cross-validation and evaluation setup, enabling a fair comparison across algorithms. We find that the random forests technique outperforms decision trees, k-nearest neighbors, elastic net, logistic regression, and support vector machines models in several metrics. Our investigation reveals that customers with a stronger relationship with the institution, who have more products and services, who borrow more from the bank, are less likely to close their checking accounts. Using a back-of-the-envelope estimation, we find that our model has the potential to forecast potential losses of up to 10% of the operating result reported by the largest Brazilian banks in 2019, suggesting the model has a significant economic impact. Our results corroborate the importance of investing in cross-selling and upselling strategies focused on their current customers. These strategies can have positive side effects on customer retention.Entities:
Keywords: Churn; Churn prediction; Financial services; Machine learning; Random forests
Year: 2022 PMID: 35281625 PMCID: PMC8898559 DOI: 10.1007/s00521-022-07067-x
Source DB: PubMed Journal: Neural Comput Appl ISSN: 0941-0643 Impact factor: 5.102
Fig. 1Number of publications in journals and conferences from 2003 to 2019 with the co-occurrence of the terms machine learning and churn either in the title, abstract or keyword list in the scopus dataset
Fig. 2Schematic of the research steps used in this empirical investigation on customer churn prediction
Target variable (first row) and selected attributes (remaining rows) used in our customer churn prediction supervised task
| Class/Attribute | Data type | Description |
|---|---|---|
| Churned | Binary (Yes or No) | Customer closed their current accounts or stopped moving them for six months (churned) |
| Segment | Nominal (4 segments) | Customer segment (Basic income, middle class, high income, and very high income) |
| Automatic_Debt | Binary (Yes or No) | Use of the direct debit service - at least once in the last 60 days |
| Salary_Credit | Binary (Yes or No) | Receipt of salary - at least once in the last 60 days |
| Accreditation | Binary (Yes or No) | Membership to the accreditation service/card make-up |
| Insurance | Binary (Yes or No) | Ownership of insurance product consortium, capitalization or pension plan |
| Portability_Request | Binary (Yes or No) | Request for salary credit portability to another financial institution |
| Complaint_Request | Binary (Yes or No) | A registered complaint in channels managed by OUVID (Ombudsman, SAC, Procon, BACEN) |
| Automatic_Debt_DIFF | Real value in | Evolution of the use of the automatic debit service - at least once in the last 60 days |
| Salary_Credit_DIFF | Real value in | Evolution of salary receipt - at least once in the last 60 days |
| Insurance_DIFF | Real value in | Evolution of the insurance company’s product ownership - insurance, consortium, capitalization or pension |
| Qualified_Products | Integer | Number of products that the customer owns, and that is indicated for the if segment |
| Qualified_Products_Previous | Integer | Quantity of customer products, and which is indicated for the Position segment: 6 months before |
| Qualified_Products_DIFF | Integer | Number of products that the customer owns, and that is indicated for the if segment - Absolute variation between 6 months |
| Qualified_Products_PERC | Percentage | Number of products that the customer owns, and that is indicated for the if segment - Percentage change between 6 months |
| Products | Integer | Number of products the customer owns |
| Products_Previous | Integer | Number of products the customer owns - Position: 6 months before |
| Products_DIFF | Integer | Number of products that the customer has - Absolute change between 6 months |
| Products_PERC | Percentage | Number of products that the customer has - Percentage change between 6 months |
| Transactions | Value in R$ | Number of spontaneous movements carried out in the current account |
| Transactions_Previous | Value in R$ | Number of spontaneous movements carried out in current account - Position: 6 months before |
| Transactions_DIFF | Value in R$ | Number of spontaneous movements performed in the current account - Absolute variation between 6 months |
| Transactions_PERC | Percentage | Number of spontaneous movements performed in the current account - Percentage change between 6 months |
| Investment | Value in R$ | Volume invested in investments, savings or deposit account |
| Investment_Previous | Value in R$ | Volume invested in investments, savings or deposit account - Position: 6 months before |
| Investment_DIFF | Value in R$ | Volume invested in investments, savings or deposit account - Absolute change between 6 months |
| Investment_PERC | Percentage | Volume invested in investments, savings or deposit account Percentage variation between 6 months |
| Credit | Value in R$ | The volume of commercial and housing loans active |
| Credit_Previous | Value in R$ | The volume of commercial and active housing credit - Position: 6 months before |
| Credit_DIFF | Value in R$ | The volume of commercial and active housing credit - Absolute change between 6 months |
| Credit_PERC | Percentage | The volume of commercial and active housing credit - Percentage change between 6 months |
| Profitability | Value in R$ | Profitability (financial return indicator) of the client, accumulated 12 months |
| Profitability_Previous | Value in R$ | Profitability (financial return indicator) of the client, accumulated 12 months - Position: 6 months before |
| Profitability_DIFF | Value in R$ | Profitability (financial return indicator) of the client, accumulated 12 months - Absolute change between 6 months |
| Profitability_PERC | Percentage | Profitability (financial return indicator) of the client, accumulated 12 months - Percentage change between 6 months |
Comparison of the means of the data sample and the entire population
| Attribute | Population (churned=0) | Population (churned=1) | Sample (churned=0) | Sample (churned=1) |
|---|---|---|---|---|
| Number of customers | 8.879.145 | 834.716 | 250.000 | 250.000 |
| Automatic_Debt | 0.17 | 0.03 | 0.17 | 0.03 |
| Salary_Credit | 0,57 | 0.29 | 0.57 | 0.29 |
| Qualified_Products | 5.74 | 4.32 | 5.74 | 4.32 |
| Products | 6.75 | 5.10 | 6.75 | 5.10 |
| Transactions | 8.84 | 2.26 | 8.85 | 2.26 |
| Investment | 24,481.11 | 2,.450.42 | 23,953.49 | 21,785.40 |
| Credit | 38,051.83 | 16,690.32 | 38,165.85 | 16,591.60 |
Comparison of the medians of the data sample and the entire population
| Attribute | Population (churned=0) | Population (churned=1) | Sample (churned=0) | Sample (churned=1) |
|---|---|---|---|---|
| Number of customers | 8.879.145 | 834.716 | 250.000 | 250.000 |
| Automatic_Debt | 0 | 0 | 0 | 0 |
| Salary_Credit | 1 | 0 | 1 | 0 |
| Qualified_Products | 6 | 4 | 6 | 4 |
| Products | 6 | 5 | 6 | 5 |
| Transactions | 4.67 | 1 | 4.67 | 1 |
| Investment | 755.14 | 57.00 | 755.01 | 56.78 |
| Credit | 5,066.19 | 50.43 | 5,047.06 | 46.65 |
Comparison of sample x population generations
| Generations | % sample | % population |
|---|---|---|
| Baby boomers (until 1960) | 21.9 | 23.6 |
| Generation X (1961–1980) | 41.3 | 42.3 |
| Generation Y (1981–1997) | 35.5 | 33.1 |
| Generation Z (1998–2009) | 0.9 | 0.6 |
| Generation alpha (since 2010) | 0 | 0 |
Comparison of state of residence sample x population
| Region | State | % sample | % population |
|---|---|---|---|
| North | Acre | 0.4 | 0.4 |
| Amazonas | 0.8 | 0.7 | |
| Amapá | 0.3 | 0.3 | |
| Pará | 1.9 | 1.7 | |
| Rondônia | 0.6 | 0.6 | |
| Roraima | 0.2 | 0.2 | |
| Tocantins | 0.5 | 0.5 | |
| North | 4.7 | 4.4 | |
| Northeast | Alagoas | 1.5 | 1.9 |
| Bahia | 4.5 | 4.4 | |
| Ceará | 2.9 | 2.8 | |
| Maranhão | 1.3 | 1.2 | |
| Paraíba | 1.2 | 1.2 | |
| Pernambuco | 2.9 | 2.8 | |
| Piauí | 1.1 | 1.1 | |
| Rio Grande do Norte | 1.2 | 1.3 | |
| Sergipe | 0.9 | 0.9 | |
| Northeast | 17.5 | 17.6 | |
| Midwest | Distrito Federal | 1.7 | 1.8 |
| Goiás | 4.9 | 5.5 | |
| Mato Grosso do Sul | 1.3 | 1.3 | |
| Mato Grosso | 1.3 | 1.2 | |
| Midwest | 9.2 | 9.8 | |
| Southeast | Espírito Santo | 1.8 | 1.9 |
| Minas Gerais | 11.5 | 11.6 | |
| Rio de Janeiro | 7.6 | 7.0 | |
| São Paulo | 24.2 | 22.9 | |
| Southeast | 45.1 | 43.4 | |
| South | Paraná | 7.8 | 8.6 |
| Rio Grande do Sul | 8.4 | 8.6 | |
| Santa Catarina | 5.8 | 6.1 | |
| South | 22.0 | 23.3 |
Fig. 3Modeling strategy to construct our churn prediction model. Due to the temporal nature of our data, we must use historical data to forecast future behavior. Therefore, our attributes are composed of customer’s financial traits extracted during August 2018 to January 2019 (red color). Our target is whether the client churned in the following six months, i.e., February to July 2019 (blue color)
Fig. 4Boxplot of the attribute check account transactions/operations (vertical axis) versus the target binary variable indicating whether the client churned in the next six months (horizontal axis). We categorize the plots according to the customer segment (very high, high, middle, low income)
Fig. 5Boxplot of the attribute number of qualified products (vertical axis) versus the target binary variable indicating whether the client churned in the next six months (horizontal axis). We categorize the plots according to the customer segment (very high, high, middle, low income)
Fig. 6Boxplot of the attribute volume of credit (vertical axis) versus the target binary variable indicating whether the client churned in the next six months (horizontal axis). We categorize the plots according to the customer segment (very high, high, middle, low income)
Models used for classification (horserace)
| Classifier | Method’s alias | Description |
|---|---|---|
| Decision trees | Rpart | Recursive partitioning and regression trees |
| knn | k-nearest neighbors | |
| Logistic regression | glm (Family = binomial) | Generalized linear model |
| Elastic net | elasticnet | Logistic regression regularized with lasso and ridge |
| Support vector machines | svm | SVM with radial kernel function |
| Random forests | rf | Random forests |
Performance of the optimized models on the test set. Test data was heldout during the entire process (10% of the dataset)
| True | True | False | False | Accuracy | Precision | F-measure | |
|---|---|---|---|---|---|---|---|
| Positive | Negative | Positive | Negative | ||||
| Decision tree | 38.8 | 39.4 | 10.6 | 11.2 | 78.2 | 78.5 | 78.05 |
| Knn | 37.8 | 40.1 | 9.9 | 12.2 | 77.9 | 79.2 | 77.36 |
| Elastic net | 40.5 | 35.7 | 14.3 | 9.5 | 76.2 | 73.9 | 77.29 |
| Logistic regression | 40.4 | 35.8 | 14.2 | 9.6 | 76.2 | 74.0 | 77.26 |
| Svm | 39.6 | 40.7 | 9.3 | 10.4 | 80.3 | 81.0 | 80.09 |
| Random forests | 40.1 | 42.6 | 7.4 | 9.9 | 82.8 | 84.4 | 82.25 |
We used training data (90% of the dataset) for model selection: we apply a repeated k-fold cross validation (ten independent times) with to optimize the hyperparameters of each model. We used ROC as the optimizing metric in the training process. After selecting the best hyperparameters, we retrain each model with the entire training set (because we hold out one fold each time in a cross-validation procedure) with these optimized values.
Fig. 7Performance metrics on the test set of the six employed classifiers using a box-and-whisker graph. The black dot represents the median performance (color figure online)
Fig. 8Performance of the optimized models for the metric ROC in the test set. The dot represents the average performance and the vertical bars, the standard error. The horizontal red-dashed line is the ensemble’s performance
Result (ROC) of each model and the associated standard deviation on the test set, including the ensemble
| Method | ROC | Std. Dev. ROC |
|---|---|---|
| Ensemble of classifiers below | 0.9018 | |
| Decision tree | 0.8628 | 0.0017 |
| Knn | 0.8485 | 0.0020 |
| Elastic net | 0.8461 | 0.0018 |
| Logistic regression | 0.8462 | 0.0018 |
| Svm | 0.8746 | 0.0014 |
| Random forests | 0.9015 | 0.0013 |
Fig. 9Trained decision tree to predict customer churn in the next six months. Within each node, the first row shows the predicted class if one stops traversing the tree at that node. The second row shows the proportion of clients that do not churn and churn for the subset of data that falls in that tree node. The third row shows the support: the fraction of data that falls within that tree as a share of the total number of observations (in percent)
Fig. 10Ranking of attributes. We take the average rank of each attribute across the methods used in the horserace (when applicable)