| Literature DB >> 36164486 |
Yoontae Hwang1, Yongjae Lee1, Frank J Fabozzi2.
Abstract
Households are becoming increasingly heterogeneous. While previous studies have revealed many important insights (e.g., wealth effect, income effect), they could only incorporate two or three variables at a time. However, in order to have a more detailed understanding of complex household heterogeneity, more variables should be considered simultaneously. In this study, we argue that advanced clustering techniques can be useful for investigating high-dimensional household heterogeneity. A deep learning-based clustering method is used to effectively handle the high-dimensional balance sheet data of approximately 50,000 households. The employment of appropriate dimension-reduction techniques is the key to incorporate the full joint distribution of high-dimensional data in the clustering step. Our study suggests that various variables should be used together to explain household heterogeneity. Asset variables are found to be crucial for understanding heterogeneity within wealthy households, while debt variables are more important for those households that are not wealthy. In addition, relationships with sociodemographic variables (e.g., age, education, and family size) were further analyzed. Although clusters are found only based on financial variables, they are shown to be closely related to most sociodemographic variables.Entities:
Keywords: Clustering; Deep learning; Heterogeneous household; High-dimensional data; Household finance; Machine learning
Year: 2022 PMID: 36164486 PMCID: PMC9491674 DOI: 10.1007/s10479-022-04900-3
Source DB: PubMed Journal: Ann Oper Res ISSN: 0254-5330 Impact factor: 4.820
Fig. 1Average portfolio weights of Korean households in 2017–2020
Fig. 2N2D framework for deep clustering by McConville et al. (2021) (Created by the authors)
Fig. 3Optimal household clusters with different number of clusters
Variable deviations and total count of cluster label changes
| Experiments ( | Average absolute difference of variables | Total count | ||
|---|---|---|---|---|
| Asset | Debt | Expenditure | ||
| 4 | 0.261 | 0.391 | 0.071 | 7,473 |
| 5 | 0.235 | 0.308 | 0.069 | 8,587 |
| 6 | 0.233 | 0.335 | 0.069 | 9,554 |
| 7 | 0.239 | 0.333 | 0.070 | 12,202 |
| 9 | 0.226 | 0.347 | 0.067 | 14,458 |
| 10 | 0.230 | 0.313 | 0.070 | 15,767 |
| 11 | 0.222 | 0.316 | 0.067 | 16,212 |
| 12 | 0.231 | 0.308 | 0.070 | 16,548 |
Clustering performance comparison
| k-means | DBSCAN | Hierarchical clustering | Hierarchical | Deep clustering | |
|---|---|---|---|---|---|
| ( | |||||
| Silhouette (↑) | 0.317 | 0.065 | 0.292 | 0.154 | |
| Davies-Bouldin index (↓) | 1.418 | 1.278 | 1.515 | 1.553 |
Fig. 4Average portfolio weights of different household clusters
Average values (proportions) of asset, debt, expenditure variables of different household clusters
| No | Deposit savings | Other savings | Long-term rental deposit | Residential housing | Nonresidential real estate | Other real assets |
|---|---|---|---|---|---|---|
| 1 | 16,089.8 (12.1%) | 662.4 (0.5%) | 1343.1 (1.0%) | 4673.1 (3.5%) | ||
| 2 | 10,386.6 (18.4%) | 801.0 (1.4%) | 2470.8 (4.4%) | 4167.7 (7.4%) | ||
| 3 | 6507.1 (16.6%) | 353.3 (0.9%) | 1.1 (0.0%) | 227.3 (0.6%) | 1589.6 (4.1%) | |
| 4 | 5189.1 (24.4%) | 219.9 (1.0%) | 82.1 (0.4%) | 1500.5 (7.1%) | 1612.5 (7.6%) | |
| 5 | 327.1 (1.6%) | 1.1 (0.0%) | 674.7 (3.3%) | 1076.2 (5.3%) | ||
| 6 | 18.5 (0.3%) | 95.4 (1.5%) | ||||
| 7 | 31.3 (2.3.%) | 18.5 (1.4%) | ||||
| 8 | 1.3(0.6%) | 0.0 (0.0%) | 0.2 (0.1%) | 5.2 (3.6%) | ||
(Unit: KRW 10,000)
Fig. 5Major asset class of households with different level of wealth
Fig. 6Major loan types of households with different level of wealth`
Fig. 7Decision tree for household clusters
Fig. 8Decomposition of Gini coefficients into between-group, within-group, and overlapping inequalities
List of independent variables for logistic regression
| Independent variable | Description |
|---|---|
| Area of residence | Living in Seoul metropolitan area or not |
| Gender of householder | Male or not |
| Number of family members | (Numbers are directly used for regression) |
| Education level of householder | Under middle school, high school, or higher education |
| Home ownership | None (includes monthly rental or free company housing), long-term rental, or homeowner |
| Age of householder | Under 39, 40 ~ 49, 50 ~ 59, or upper 60 |
| Income level | Low-income (1st and 2nd income quintiles), mid-income (3rd income quintile), or high-income (4th and 5th income quintiles) |
| Employment status | Employed or not (includes freelancers or helping family business) |
Logistic regression results of clusters with respect to socio-demographic variables
| Variables | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | ||||
|---|---|---|---|---|---|---|---|---|
| Coeff | Odds | Coeff | Odds | Coeff | Odds | Coeff | Odds | |
| Constant | − 5.224*** | 0.005 | − 3.972*** | 0.019 | − 2.833*** | 0.059 | − 3.025*** | 0.049 |
| Metropolitan area | − | − | ||||||
| Gender (male) | − .109*** | 0.897 | − 0.014 | 0.986 | ||||
| Number of members | − | 0.062*** | 1.064 | − 0.047*** | 0.954 | |||
| Education (under middle school) | ||||||||
| High school | 0.246*** | 1.279 | − 0.064* | 0.938 | − 0.053 | 0.949 | − | |
| Higher education | 0.129*** | 1.138 | 0.134*** | 1.143 | − | |||
| Home ownership (none) | ||||||||
| Long-term rental | 0.489*** | 1.630 | 0.597*** | 1.816 | 0.063 | 1.065 | − 0.463*** | 0.629 |
| Homeowner | 1.486*** | 4.421 | 0.656*** | 1.927 | ||||
Age (under 39) | ||||||||
| 40 ~ 49 | 0.665*** | 1.945 | 0.596*** | 1.815 | − | |||
| 50 ~ 59 | 1.330*** | 3.780 | 1.000*** | 2.718 | − | 0.164*** | 1.178 | |
| Upper 60 | − | − 0.485*** | 0.616 | |||||
| Income level (low-income) | ||||||||
| Mid-income | 0.338*** | 1.403 | 0.356*** | 1.428 | − 0.119*** | 0.888 | − 0.031*** | 0.734 |
| High-income | − 0.358*** | 0.699 | − 0.257 | 0.773 | ||||
| Employment | 0.012 | 1.012 | − | 0.193*** | 1.213 | |||
| Number of households | 5937 | 10,644 | 10,699 | 10,001 | ||||
* < .05, ** < .01, *** < .001
Fig. 9Transition matrix between household clusters
Fig. 10Variable importance weights of between-group inequalities of Gini coefficients in different years
Fig. 11Transition matrix between household clusters before (left) and after (right) COVID-19
| Age | Old clusters are likely to be wealthy, which is natural in a sense that households would accumulate wealth during working ages. However, there were also two strong exceptions (Clusters 3 and 8) |
| Education | The three most wealthy clusters are highly educated while the three most poor clusters are poorly educated. For the two middle class groups, one in metropolitan area (Cluster 5) is highly educated and the other outside metropolitan area (Cluster 4) is poorly educated. Also, Cluster 3 is highly educated but has low income |
| Income | The two most wealthy clusters have high income, and the three most poor clusters have low income. However, three clusters in the middle exhibit mixed results (especially Cluster 3) |
Number of family members | While there is no clear linear relationship between family size and wealth, it is interesting to note that the wealthiest and the poorest clusters are highly likely to consist of small families |
| Area of residence | No overall trend is found, but typical rural–urban differences can be seen between the two middle class groups (Clusters 4 and 5) |
Hyperparameter search range
| Hyperparameter | Range |
|---|---|
| Batch size | [16, 24, 32, 64] |
| Learning rate | [0.0001, 0.001, 0.001, 0.01] |
| Epochs | [20, 50, 75, 100] |
| # of nodes in each layer | [10–1000, 10–1000, 10–1000] |
| [0.9 ~ 1, 0.5–1.0] |
Summary statistics of household balance sheet of clusters
| Panel A. Assets | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| No | Deposit savings | Other savings | Long-term rental deposit | Residential housing | Nonresidential real estate | Other real assets | ||||||
| Mean | Median | Mean | Median | Mean | Median | Mean | Median | Mean | Median | Mean | Median | |
| 1 | 0.105 (0.128) | 0.061 | 0.003 (0.019) | 0.000 | 0.007 (0.039) | 0.000 | 0.390 (0.303) | 0.330 | 0.461 (0.294) | 0.491 | 0.028 (0.068) | 0.007 |
| 2 | 0.178 (0.159) | 0.132 | 0.012 (0.047) | 0.000 | 0.032 (0.091) | 0.000 | 0.304 (0.226) | 0.313 | 0.382 (0.242) | 0.373 | 0.080 (0.137) | 0.030 |
| 3 | 0.143 (0.139) | 0.106 | 0.006 (0.029) | 0.000 | 0.000 (0.010) | 0.000 | 0.813 (0.157) | 0.845 | 0.004 (0.020) | 0.000 | 0.036 (0.056) | 0.014 |
| 4 | 0.229 (0.157) | 0.213 | 0.008 (0.032) | 0.000 | 0.004 (0.029) | 0.000 | 0.613 (0.246) | 0.650 | 0.074 (0.175) | 0.000 | 0.072 (0.087) | 0.0447 |
| 5 | 0.254 (0.249) | 0.173 | 0.010 (0.050) | 0.000 | 0.654 (0.271) | 0.712 | 0.000 (0.002) | 0.000 | 0.015 (0.069) | 0.000 | 0.043 (0.063) | 0.017 |
| 6 | 0.467 (0.312) | 0.441 | 0.002 (0.103) | 0.000 | 0.359 (0.316) | 0.297 | 0.001 (0.019) | 0.000 | 0.005 (0.039) | 0.000 | 0.130 (0.182) | 0.474 |
| 7 | 0.418 (0.348) | 0.327 | 0.011 (0.067) | 0.000 | 0.389 (0.355) | 0.322 | 0.004 (0.054) | 0.000 | 0.003 (0.045) | 0.000 | 0.174 (0.262) | 0.000 |
| 8 | 0.683 (0.408) | 1.000 | 0.003 (0.042) | 0.000 | 0.273 (0.401) | 0.000 | 0.000 (0.021) | 0.000 | 0.001 (0.029) | 0.000 | 0.037 (0.147) | 0.000 |