| Literature DB >> 28464010 |
Daniel Arribas-Bel1, Jorge E Patino2, Juan C Duque2.
Abstract
This paper provides evidence on the usefulness of very high spatial resolution (VHR) imagery in gathering socioeconomic information in urban settlements. We use land cover, spectral, structure and texture features extracted from a Google Earth image of Liverpool (UK) to evaluate their potential to predict Living Environment Deprivation at a small statistical area level. We also contribute to the methodological literature on the estimation of socioeconomic indices with remote-sensing data by introducing elements from modern machine learning. In addition to classical approaches such as Ordinary Least Squares (OLS) regression and a spatial lag model, we explore the potential of the Gradient Boost Regressor and Random Forests to improve predictive performance and accuracy. In addition to novel predicting methods, we also introduce tools for model interpretation and evaluation such as feature importance and partial dependence plots, or cross-validation. Our results show that Random Forest proved to be the best model with an R2 of around 0.54, followed by Gradient Boost Regressor with 0.5. Both the spatial lag model and the OLS fall behind with significantly lower performances of 0.43 and 0.3, respectively.Entities:
Mesh:
Year: 2017 PMID: 28464010 PMCID: PMC5413026 DOI: 10.1371/journal.pone.0176684
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Location of Liverpool city and LED index values at LSOA level, shown over a Stamen Terrain base map (base map tiles by Stamen Design, under a CC BY 3.0, data by OpenStreetMap, under CC BY SA).
Accuracy assessment of GE image classification results.
| Ground truth | Gray imp. surf. | Soil | Orange imp. surf. | Vegetation | Shadow | Water | Total |
|---|---|---|---|---|---|---|---|
| Classified as: | |||||||
| Gray imp. surf | 5,241 | 73 | 278 | 80 | 32 | 0 | 5,884 |
| Soil | 422 | 740 | 200 | 15 | 0 | 0 | 1,377 |
| Orange imp. surf. | 65 | 6 | 1,696 | 17 | 0 | 0 | 1,784 |
| Vegetation | 156 | 10 | 49 | 3,684 | 168 | 0 | 4,067 |
| Shadow | 1,474 | 6 | 68 | 496 | 1,427 | 0 | 3,471 |
| Water | 12 | 0 | 0 | 208 | 34 | 885 | 1,139 |
| Total | 7,550 | 835 | 2,291 | 4,500 | 1,661 | 885 | 17,722 |
| Producer’s accuracy (%) | 71.80 | 88.62 | 74.03 | 81.87 | 85.91 | 100 | |
| User’s accuracy (%) | 92.13 | 53.74 | 95.07 | 90.58 | 41.11 | 77.70 | |
| Overall classification accuracy (%) | 78.17 | ||||||
| Kappa value | 0.71 | ||||||
Remote sensing derived variables and descriptions.
| Group | Variable name | Description |
|---|---|---|
| Land cover | p_imp_surf | Percentage of impervious surface cover |
| p_ora_surf | Percentage of orange impervious surface cover | |
| p_i_s_wora | Percentage of impervious surface without orange surface cover | |
| f_or_imp_s | Fraction of orange impervious surface over the impervious surface | |
| p_veg | Percentage of vegetation cover | |
| p_soil | Percentage of bare soil cover | |
| p_shadow | Percentage of shadow cover | |
| p_b_water | Percentage of water cover | |
| Spectral | MEAN1 | Mean of band 1 intensity values (red color) |
| DEVST1 | Standard deviation of band 1 intensity values | |
| MEAN2 | Mean of band 2 intensity values (green color) | |
| DEVST2 | Standard deviation of band 2 intensity values | |
| MEAN3 | Mean of band 3 intensity values (blue color) | |
| DEVST3 | Standard deviation of band 3 intensity values | |
| Texture | MEAN_EDG | Mean of the edgeness factor |
| STDEV_EDG | Standard deviation of the edgeness factor | |
| UNIFOR | GLCM uniformity | |
| ENTROP | GLCM entropy | |
| CONTRAS | GLCM contrast | |
| IDM | GLCM inverse difference moment | |
| COVAR | GLCM covariance | |
| VARIAN | GLCM variance | |
| CORRELAC | GLCM correlation | |
| SKEWNESS | Skewness value of the histogram | |
| KURTOSIS | Kurtosis value of the histogram | |
| Structure | RVF | Ratio variance at first lag |
| RSF | Ratio between semivariance values at second and first lag | |
| FDO | First derivative near the origin | |
| SDT | Second derivative at third lag | |
| MFM | Mean of the semivariogram values up to the first maximum | |
| VMF | Variance of the semivariogram values up to the first maximum | |
| DMF | Difference between the mean of the semivariogram values up to the first maximum and the semivariance at first lag | |
| RMM | Ratio between the semivariance at first local maximum and the mean semivariogram values up to this maximum | |
| SDF | Second order difference between first lag and first maximum | |
| AFM | Area between the semivariogram value in the firs lag and the semivariogram function until the first maximum |
Rotated factor loadings (orthogonal Varimax rotation) of image spectral, texture and structure variables.
| Variable | Factor 1 | Factor 2 | Factor 3 | Factor 4 |
|---|---|---|---|---|
| MEAN1 | 0.346 | 0.002 | -0.0633 | |
| DEVST1 | 0.3036 | 0.3805 | 0.1993 | |
| MEAN2 | 0.0755 | -0.0178 | 0.0055 | |
| DEVST2 | 0.3437 | 0.3938 | 0.2035 | |
| MEAN3 | 0.3533 | 0.0642 | -0.0152 | |
| DEVST3 | 0.2793 | 0.3338 | 0.2287 | |
| MEAN_EDG | 0.1691 | -0.0661 | 0.0242 | |
| DEVST_EDG | 0.5929 | 0.0101 | 0.2671 | |
| UNIFOR | -0.2777 | 0.0135 | -0.0167 | |
| ENTROP | 0.4904 | 0.0832 | 0.0861 | |
| CONTRAS | 0.3685 | -0.0159 | 0.1263 | |
| IDM | -0.0082 | 0.1062 | 0.0133 | |
| COVAR | 0.1433 | 0.4224 | 0.1904 | |
| VARIAN | 0.25 | 0.3923 | 0.1941 | |
| CORRELAC | -0.5293 | 0.5393 | 0.2014 | |
| SKEWNESS | 0.1689 | 0.1646 | 0.5480 | |
| KURTOSIS | -0.1572 | -0.0055 | 0.4423 | |
| RVF | -0.4935 | 0.467 | 0.0325 | |
| RSF | -0.1193 | -0.308 | 0.3285 | |
| FDO | 0.3681 | 0.1731 | 0.2296 | |
| SDT | 0.2795 | 0.2227 | 0.1445 | |
| MFM | 0.5481 | 0.2911 | 0.3145 | |
| VFM | 0.3318 | -0.078 | 0.0116 | |
| DMF | 0.5817 | 0.384 | 0.3641 | |
| RMM | 0.3789 | 0.0888 | -0.3623 | |
| SDF | -0.4438 | -0.5395 | -0.0032 | |
| AFM | 0.5871 | 0.2946 | 0.043 |
Regression coefficients.
| OLS | GMM | RF | GBR | |||
|---|---|---|---|---|---|---|
| Coefficient | P-Value | Coefficient | P-Value | |||
| CONSTANT | 65.0581 | 0.0003 | 44.6907 | 0.0011 | ||
| f1 | -1.7557 | 0.3315 | -3.0328 | 0.0264 | ||
| f2 | -2.6794 | 0.1764 | -2.1430 | 0.1526 | ||
| f3 | 3.2969 | 0.0003 | 2.0437 | 0.0036 | ||
| f4 | -2.0457 | 0.0209 | -1.3078 | 0.0497 | ||
| p_b_water | -3.246 | 0.0000 | -1.8888 | 0.0005 | ||
| p_shadow | -0.0983 | 0.7794 | -0.3728 | 0.1599 | ||
| p_veg | -0.644 | 0.0075 | -0.5402 | 0.0029 | ||
| 0.6504 | 0.0000 | |||||
| 0.3440 | 0.4327 | 0.9354 | 0.8320 |
Fig 2Feature importance plot (Random Forest).
Fig 3Partial dependence plots for the four most relevant variables (Gradient Boost Regressor): RSF, percentage of vegetation and water, and RMM.
Fig 4Cross-validated R2 (median values in parenthesis).