| Literature DB >> 29968777 |
Chao Song1,2, Xiu Yang3, Xun Shi4, Yanchen Bo5, Jinfeng Wang6.
Abstract
Due to a large number of missing values, both spatially and temporally, China has not published a complete official socioeconomic statistics dataset at the county level, which is the country's basic scale of official statistics data collection. We developed a procedure to impute the missing values under the Bayesian hierarchical modeling framework. The procedure incorporates two novelties. First, it takes into account spatial autocorrelations and temporal trends for those easier-to-impute variables with small missing percentages. Second, it further uses the first-step complete variables as covariate information to improve the modeling of more-difficult-to-impute variables with large missing percentages. We applied this progressive spatiotemporal (PST) method to China's official socioeconomic statistics during 2002-2011 and compared it with four other widely used imputation methods, including k-nearest neighbors (kNN), expectation maximum (EM), singular value decomposition (SVD) and random forest (RF). The results show that the PST method outperforms these methods, thus proving the effects of sophisticatedly incorporating the additional spatial and temporal information and progressively utilizing the covariate information. This study has an outcome that allows China to construct a complete socioeconomic dataset and establishes a methodology that can be generally useful for estimating missing values in large spatiotemporal datasets.Entities:
Year: 2018 PMID: 29968777 PMCID: PMC6030081 DOI: 10.1038/s41598-018-28322-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Missing data situations of 20 socioeconomic variables.
| Abbreviation | Socioeconomic variable | Unit | Overall missing percentage | Max missing percentage | Number of big missing data years |
|---|---|---|---|---|---|
| X1 | Land area | km2 | 2.25% | 5.54% | 0 |
| X2 | Total population | person | 2.19% | 5.50% | 0 |
| X3 | Employees at the end of the year | number | 2.40% | 5.58% | 0 |
| X4 | Local telephone users at the end of the year | person | 2.91% | 6.02% | 0 |
| X5 | Local general budget revenue | million | 2.42% | 5.58% | 0 |
| X6 | Local government budgetary expenditures | million | 2.37% | 5.63% | 0 |
| X7 | Savings deposits of urban and rural residents | million | 2.86% | 6.02% | 0 |
| X8 | Loan balance of financial institutions | million | 2.65% | 5.84% | 0 |
| X9 | Total retail sales of social consumer goods | yuan | 4.58% | 7.23% | 0 |
| X10 | Above-scale total industrial output value | million | 6.47% | 14.11% | 0 |
| X11 | Social fixed asset investments | million | 3.12% | 6.06% | 0 |
| X12 | Middle and high school students | person | 2.34% | 5.58% | 0 |
| X13 | Primary school students | person | 2.25% | 5.45% | 0 |
| X14 | Number of hospital beds | number | 2.37% | 5.50% | 0 |
| X15 | Regional GDP | million | 12.39% | 87.66% | 1 |
| X16 | First industry output | million | 20.90% | 87.62% | 2 |
| X17 | Second industry output | million | 20.88% | 87.62% | 2 |
| X18 | Tertiary industry output | million | 29.51% | 87.66% | 3 |
| X19 | GDP per capita | yuan/person | 38.57% | 88.01% | 4 |
| X20 | Staff and workers in Urban Units | person | 15.34% | 87.66% | 1 |
(We use X1 to X20 to refer to the 20 variables. The missing percentage is the ratio of the total number of the county-years with missing data for a variable to the total number of county-years during the 10-year period).
Figure 1Study area and the missing data maps of GDP (variable X15) in the years 2002 (a) and 2011 (b).
Figure 2Experiment’s overall design flow chart.
Bayesian models’ evaluated results of 20 variables with the alternative spatiotemporal models (M1: parametric spatiotemporal model; M2: nonparametric spatiotemporal model; M3: spatiotemporal multivariable regression model).
| Variable | Model |
| DIC | LS |
|---|---|---|---|---|
| X1 | M1 | 126.69 | 125663.17 | 2.72 |
| M2 | 6176.37 | −14111.29 | −0.45 | |
| X2 | M1 | 1835.48 | 96930.00 | 2.15 |
| M2 | 5543.50 | −12130.76 | −0.41 | |
| X3 | M1 | 3929.79 | 10166.79 | 0.18 |
| M2 | 8343.87 | 6623.99 | 0.10 | |
| X4 | M1 | 3820.75 | 19380.00 | 0.42 |
| M2 | 9028.92 | 10103.58 | 0.22 | |
| X5 | M1 | 4034.30 | 19129.21 | 0.41 |
| M2 | 14040.63 | 2708.83 | 0.15 | |
| X6 | M1 | 3644.52 | 12854.78 | 0.26 |
| M2 | 8941.07 | 5934.85 | 0.12 | |
| X7 | M1 | 2103.73 | 92887.11 | 2.07 |
| M2 | 8027.79 | 6469.70 | 0.11 | |
| X8 | M1 | 2110.40 | 93380.78 | 2.08 |
| M2 | 11798.02 | 5849.11 | 0.17 | |
| X9 | M1 | 1909.06 | 98419.80 | 2.24 |
| M2 | 14893.63 | −1767.84 | 0.12 | |
| X10 | M1 | 4234.27 | 25353.06 | 0.60 |
| M2 | 14215.22 | 9917.30 | 0.42 | |
| X11 | M1 | 4079.19 | 39651.11 | 0.88 |
| M2 | 11832.53 | 27789.02 | 0.69 | |
| X12 | M1 | 4247.52 | 3280.72 | 0.01 |
| M2 | 11932.56 | −4713.48 | −0.12 | |
| X13 | M1 | 4072.09 | −3645.90 | −0.16 |
| M2 | 9404.30 | −8151.90 | −0.27 | |
| X14 | M1 | 3946.53 | 3705.55 | 0.05 |
| M2 | 8774.90 | −1100.59 | −0.06 | |
| X15* | M2 | 4690.08 | 29812.04 | 0.75 |
| M3 | 5254.68 | 28573.29 | 0.73 | |
| X16* | M2 | 7528.36 | 32827.42 | 1.01 |
| M3 | 7616.29 | 31564.11 | 0.97 | |
| X17* | M2 | 7244.71 | 38198.39 | 1.14 |
| M3 | 7120.46 | 37171.88 | 1.10 | |
| X18* | M2 | 5616.22 | 21887.47 | 0.72 |
| M3 | 5488.50 | 20870.88 | 0.69 | |
| X19* | M2 | 5432.17 | 2872.51 | 0.12 |
| M3 | 5388.29 | 2837.99 | 0.11 | |
| X20* | M2 | 3598.56 | 33249.81 | 0.87 |
| M3 | 3723.70 | 32603.66 | 0.85 |
*Variables belong to the second-step imputation modeling of the PST method.
Figure 3Prediction scatter diagrams of 20 variables in the 10% simulation experiment.
Figure 4Evaluation of 20 socioeconomic variables in the 10%, 20% and 30% cross-validation simulation experiments with the PST method.
Figure 5Spatial SAE maps of variable X14 in the years (a) 2002, (b) 2005, (c) 2008 and (d) 2011.
Figure 6Evaluation of different imputation methods (EM, SVD, kNN, RF, and PST) for the 10% simulation dataset.