| Literature DB >> 35381028 |
Noée Szarka1,2, Filip Biljecki2,3.
Abstract
Mapping population distribution at a fine spatial scale is essential for urban studies and planning. Numerous studies, mainly supported by geospatial and statistical methods, have focused primarily on predicting population counts. However, estimating their socio-economic characteristics beyond population counts, such as average age, income, and gender ratio, remains unattended. We enhance traditional population estimation by predicting not only the number of residents in an area, but also their demographic characteristics: average age and the proportion of seniors. By implementing and comparing different machine learning techniques (Random Forest, Support Vector Machines, and Linear Regression) in administrative areas in Singapore, we investigate the use of point of interest (POI) and real estate data for this purpose. The developed regression model predicts the average age of residents in a neighbourhood with a mean error of about 1.5 years (the range of average resident age across Singaporean districts spans approx. 14 years). The results reveal that age patterns of residents can be predicted using real estate information rather than with amenities, which is in contrast to estimating population counts. Another contribution of our work in population estimation is the use of previously unexploited POI and real estate datasets for it, such as property transactions, year of construction, and flat types (number of rooms). Advancing the domain of population estimation, this study reveals the prospects of a small set of detailed and strong predictors that might have the potential of estimating other demographic characteristics such as income.Entities:
Mesh:
Year: 2022 PMID: 35381028 PMCID: PMC8982831 DOI: 10.1371/journal.pone.0266484
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Detailed flowchart of the method and the employed datasets.
Fig 2Planning areas (thick lines) including their subzones (thin lines) in Singapore.
The yellow areas are part of the training group, while the turquoise zones are the test areas for estimations. The grey parts of the country are out of scope of our work because they are not residential or not dominated by HDB. Source of the administrative dataset: Urban Redevelopment Authority / data.gov.sg (2014).
An overview of the predictors.
For each subzone, the density of each amenity has been computed.
| Predictor | Source |
|---|---|
|
| |
| Food establishments | National Environment Agency |
| Student care services | Ministry of Social and Family Development |
| Bus stops | Land Transport Authority |
| Supermarkets | National Environment Agency |
| Residents committees | People’s Association |
| E-waste recycling locations | National Environment Agency |
| Eldercare services | Ministry of Social and Family Development |
| Clinics | Ministry of Health |
| Schools | Ministry of Education |
| Childcare facilities | Early Childhood Development Agency |
|
| |
| Number of buildings | Housing and Development Board |
| No. of property transactions in the last 3 years | Housing and Development Board |
| Age of buildings (mean, median, mode) | Housing and Development Board |
| Proportion of 1-room flats | Housing and Development Board |
| Proportion of 2-room flats | Housing and Development Board |
| Proportion of 3-room flats | Housing and Development Board |
| Proportion of 4-room flats | Housing and Development Board |
| Proportion of executive flats | Housing and Development Board |
Fig 3Visualisation of some of the datasets that we have used in our work.
Proportion of age groups by administrative area (from which we calculate the proportion of seniors and the average age—plotted as well) together with the average age of buildings. The plot hints at disparate demographics of neighbourhoods and at an association between the age of buildings and age of residents, which we attempt to take advantage of in our estimations. Source of the datasets: Singapore Department of Statistics and Housing and Development Board (data.gov.sg).
Fig 4Extracts from the datasets that we have used in our work.
(a) aggregated age group distribution for subzones in one of the planning areas in our focus (in our work, we estimate the proportion of the senior group depicted in blue); (b) population counts of subzones are disparate, presenting a suitably diverse dataset for estimations. Source of the datasets: Singapore Department of Statistics and Housing and Development Board (data.gov.sg).
Overview of the performance of the different combinations of the developed regression models to estimate population counts and age.
| RF | SVM | LM | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| R2 | MAE | MAPE | R2 | MAE | MAPE | R2 | MAE | MAPE | ||
| Population counts | No model tuning With FC | 0.910 | 3415 | 0.390 | 0.973 | 2441 | 0.167 | 0.980 | 2297 | 0.173 |
| 0.907 | 3087 | 0.332 | 0.974 | 2443 | 0.164 | 0.974 | 2469 | 0.179 | ||
| Average age | No model tuning With FC | 0.745 | 1.681 | 0.041 | 0.767 | 1.637 | 0.036 | 0.768 | 1.513 | 0.038 |
| 0.745 | 1.682 | 0.044 | 0.767 | 1.570 | 0.040 | 0.497 | 1.811 | 0.042 | ||
| Senior proportion | No model tuning With FC | 0.724 | 0.029 | 0.211 | 0.684 | 0.063 | 0.837 | 0.713 | 0.026 | 0.167 |
| 0.728 | 0.028 | 0.206 | 0.695 | 0.063 | 0.837 | 0.446 | 0.032 | 0.160 | ||
Fig 5Observed vs predicted and predicted vs predicted (models) scatterplots for population count, average age, and elderly proportion.
LM and SVM tend to produce very similar predictions (population counts and average age), while RF and LM reveal differences in particular for lower and higher values (elderly proportion).
An overview of the predictors and their variable importance from none (o) to high (***).
| Counts estimation | Age estimation | ||||||
|---|---|---|---|---|---|---|---|
| Predictors | VarImp | Predictors | VarImp (mean age) | VarImp (senior share) | |||
| SVM | LM | SVM | LM | RF | LM | ||
| No. of buildings | *** | *** | Bldg. age (mean) | *** | *** | *** | ** |
| No. of transactions | *** | ** | Bldg. age (median) | *** | * | *** | ** |
| Food establishments | * | * | Bldg. age (mode) | *** | o | *** | o |
| Supermarkets | * | * | 1-Room proportion | o | * | o | * |
| E-waste recycling locations | o | * | 2-Room proportion | o | ** | * | ** |
| Residents committees | ** | * | 3-Room proportion | *** | *** | ** | *** |
| Student care services | * | * | 4-Room proportion | *** | ** | ** | *** |
| Childcare facilities | ** | * | Executive proportion | ** | o | * | o |
| Schools | * | * | Mean x Median | *** | – | *** | – |
| Clinics | * | o | Mode x 3-Room prop. | *** | – | ** | – |
| Bus stops | *** | * | |||||
| Buildings x Transactions | *** | – | |||||
| Childcare x Bus stops | *** | – | |||||