| Literature DB >> 33216786 |
Chao Wu1, Guolong Wang1, Simon Hu2, Yue Liu1, Hong Mi1, Ye Zhou3, Yi-Ke Guo3, Tongtong Song4.
Abstract
For decades, traditional correlation analysis and regression models have been used in social science research. However, the development of machine learning algorithms makes it possible to apply machine learning techniques for social science research and social issues, which may outperform standard regression methods in some cases. Under the circumstances, this article proposes a methodological workflow for data analysis by machine learning techniques that have the possibility to be widely applied in social issues. Specifically, the workflow tries to uncover the natural mechanisms behind the social issues through a data-driven perspective from feature selection to model building. The advantage of data-driven techniques in feature selection is that the workflow can be built without so much restriction of related knowledge and theory in social science. The advantage of using machine learning techniques in modelling is to uncover non-linear and complex relationships behind social issues. The main purpose of our methodological workflow is to find important fields relevant to the target and provide appropriate predictions. However, to explain the result still needs theory and knowledge from social science. In this paper, we trained a methodological workflow with left-behind children as the social issue case, and all steps and full results are included.Entities:
Mesh:
Year: 2020 PMID: 33216786 PMCID: PMC7678991 DOI: 10.1371/journal.pone.0242483
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Workflow design.
Fig 2Pre-processing.
Fig 3The relationship for feature selection and modelling.
Fig 4Feature selection and modelling.
High scoring features.
| Measure Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|
| marriage | hukou | from | industry | job | Edu | company | |
| marriage | edu | range | from | hukou | industry | job |
Top 10 high scoring features for CC measure.
| Filtered Feature | CC | |
|---|---|---|
| museum visitors | 0.7302 | |
| highway passenger traffic | 0.7155 | |
| secondary vocational school graduates eligible for qualifications | 0.7072 | |
| special education enrollment | 0.7032 | |
| national special entitled groups with regular subsidy | 0.7021 | |
| passenger volume | 0.6977 | |
| national special entitled groups | 0.6908 | |
| construction employments in urban units | 0.6819 | |
| student-teacher ratio of primary school | 0.6745 | |
| rural population | 0.6694 |
Top 10 high scoring features for r2 measure.
| Filtered Feature | r2 | |
|---|---|---|
| highway passenger traffic | 0.3323 | |
| passenger volume | 0.3269 | |
| student-teacher ratio of primary school | 0.3085 | |
| number of inpatients | 0.2941 | |
| number of discharged patients | 0.2928 | |
| health expenditure from local finance | 0.2869 | |
| number of high schools | 0.2787 | |
| secondary vocational school graduates eligible for qualifications | 0.2262 | |
| construction area of residential house | 0.2162 | |
| number of automatic weather stations | 0.2104 |
Parameters for SVR model with filtered features.
| Trade-off Constant C | Kernal Function | |
|---|---|---|
| 1000 | RBF | 0.1 |
Parameters for NNR model with filtered features.
| Number of Hidden Neurons | Loss Function | Learning Algorithm | Learning Rate | Coefficient of L2 Regularization | Activation Function | |
|---|---|---|---|---|---|---|
| 5 Inputs | 10 Inputs | MSE | SGD | 0.01 | 0.01 | ReLU, Tanh |
| 5 | 8 | |||||
Performances for SVR models built by filter approach.
| Feature selection methods | Performance | ||
|---|---|---|---|
| Number of features | Filter measure | MSE | |
| CC | 0.4717 | 0.0492 | |
| MIC | 0.5668 | 0.0404 | |
| 0.5788 | 0.0392 | ||
| CC | 0.7054 | 0.0274 | |
| MIC | 0.4552 | 0.0508 | |
| 0.6715 | 0.0306 | ||
Average performance for NNR models built by filter approach.
| Feature selection methods | Performance | ||
|---|---|---|---|
| Number of features | Filter measure | MSE | |
| CC | 0.4334 | 0.0528 | |
| MIC | 0.3796 | 0.0578 | |
| 0.3912 | 0.0567 | ||
| CC | 0.5616 | 0.0408 | |
| MIC | 0.4360 | 0.0524 | |
| 0.5274 | 0.0440 | ||
Fig 5Prediction results for SVR models by filter method.
Fig 6Features, chromosomes and population.
Fig 7Clustering-based search.
Parameters for SVR model by Wrapper Approach.
| Trade-off Constant C | Kernal Function |
|---|---|
| 100 | RBF |
Parameters for NNR model by Wrapper Approach.
| Number of Hidden Neurons | Loss Function | Learning Algorithm | Learning Rate | Coefficient of L2 Regularization | Activation Function | |
|---|---|---|---|---|---|---|
| 5 Inputs | 10 Inputs | MSE | SGD | 0.01 | 0.01 | Relu, Tanh |
| 5 | 8 | |||||
Average performances for SVR models built by Wrapper Approach.
| Feature selection methods | Performance | ||
|---|---|---|---|
| Number of features | Search method | MSE | |
| GA | 0.3733 | 0.0584 | |
| K-means clustering | 0.5961 | 0.0371 | |
| GA | 0.4261 | 0.0532 | |
| K-means clustering | 0.5352 | 0.0432 | |
Average performances for NNR models built by Wrapper Approach.
| Feature selection methods | Performance | ||
|---|---|---|---|
| Number of features | Search method | MSE | |
| GA | 0.3840 | 0.0574 | |
| K-means clustering | 0.4546 | 0.0510 | |
| GA | 0.3928 | 0.0563 | |
| K-means clustering | 0.5122 | 0.0451 | |
5 features by Wrapper Approach.
| Feature | |
|---|---|
| residential real estate development investment | |
| number of motorized thresher | |
| construction employments in urban units | |
| national special entitled groups with regular subsidy | |
| student-teacher ratio of primary school |
10 features by Wrapper Approach.
| Feature | |
|---|---|
| construction area of residential house | |
| number of domestic design patent application examined | |
| mainland residents registration of marriage | |
| hospital bed using-days | |
| number of special education students | |
| number of performing arts groups | |
| fresh vegetables consumer price index | |
| national special entitled groups with regular subsidy | |
| Average wage of employees in urban units | |
| highway passenger transportation |
Performance of two chosen SVR models.
| Performance | ||
|---|---|---|
| Number of features | MSE | |
| 0.7164 | 0.0264 | |
| 0.7033 | 0.0276 | |
Fig 8Prediction results for SVR models by Wrapper Method.
Fig 9Prediction results and residuals of Model 1.
Fig 11Prediction results and residuals of Model 3.
Fig 12SVR with 5 features and clustering method.
Fig 19NNR with 10 features and genetic algorithm.
Fig 20Simulation results of Model 1.
Fig 22Simulation results of Model 3.
Fig 21Simulation results of Model 2.
Top 10 high scoring features for MIC measure.
| Filtered Feature | MIC | |
|---|---|---|
| special education enrollment | 0.8215 | |
| vegetable acreage | 0.8215 | |
| number of special education students | 0.8215 | |
| national special entitled groups with regular subsidy | 0.7412 | |
| national special entitled groups | 0.7412 | |
| office sales | 0.7320 | |
| number of health checked at outpatient | 0.7070 | |
| secondary vocational school enrollment | 0.6969 | |
| number of secondary vocational school students | 0.6969 | |
| secondary vocational school graduates eligible for qualifications | 0.6969 |