| Literature DB >> 28515661 |
Nirav N Patel1, Forrest R Stevens2, Zhuojie Huang3, Andrea E Gaughan2, Iqbal Elyazar4, Andrew J Tatem5,6,7.
Abstract
Many different methods are used to disaggregate census data and predict population densities to construct finer scale, gridded population data sets. These methods often involve a range of high resolution geospatial covariate datasets on aspects such as urban areas, infrastructure, land cover and topography; such covariates, however, are not directly indicative of the presence of people. Here we tested the potential of geo-located tweets from the social media application, Twitter, as a covariate in the production of population maps. The density of geo-located tweets in 1x1 km grid cells over a 2-month period across Indonesia, a country with one of the highest Twitter usage rates in the world, was input as a covariate into a previously published random forests-based census disaggregation method. Comparison of internal measures of accuracy and external assessments between models built with and without the geotweets showed that increases in population mapping accuracy could be obtained using the geotweet densities as a covariate layer. The work highlights the potential for such social media-derived data in improving our understanding of population distributions and offers promise for more dynamic mapping with such data being continually produced and freely available.Entities:
Year: 2016 PMID: 28515661 PMCID: PMC5412862 DOI: 10.1111/tgis.12214
Source DB: PubMed Journal: Trans GIS ISSN: 1361-1682
Figure 1Map of Indonesia administrative boundaries levels 3 and 4, focused around Jakarta, with administrative units shaded to show population counts per administrative unit
Figure 2Results of a two‐month aggregation of geo‐located tweets over the full extent of Java (top) and a view focused on Jakarta (bottom)
Test‐specific data sources and variable names used for population density estimation with dasymetric weights
| Type | Variable Name(s)* | Description | Indonesia Data |
|---|---|---|---|
| Census | Country‐specific census data that isused for disaggregation | 2010, Admin‐level 3 and Admin‐level 4 (census datasets received from the Government of Indonesia) | |
| Land Cover | lan_cls011, lan_dst011 | Cultivated terrestrial lands | Landcover utilizing 3‐year Google Earth Engine data & MDA GlobCover with methods from Patel et al. ( |
| lan_cls040, lan_dst040 | Woody/Trees | ||
| lan_cls130, lan_dst130 | Shrubs | ||
| lan_cls140, lan_dst140 | Herbaceous | ||
| lan_cls150, lan_dst150 | Other terrestrial vegetation | ||
| lan_cls160, lan_dst160 | Aquatic vegetation | ||
| lan_cls190, lan_dst190 | Urban area | ||
| lan_cls200, lan_dst200 | Bare areas | ||
| lan_cls210, lan_dst210 | Water bodies | ||
| lan_cls230, lan_dst230 | No data, cloud/shadow | ||
| lan_cls240, lan_dst240 | Rural settlement | ||
| lan_cls250, lan_dst250 | Industrial area | ||
| lan_clsBLT, lan_dstBLT | Built, merged urban/rural class | ||
| Continuous | |||
| Raster‐Format | |||
| Lig | Lights at night data | Suomi VIIRS‐Derived (NOAA | |
| Npp | MODIS 17A3 2010 estimated net primary productivity, 1 km | Extraction from MODIS package in R (Running et al. | |
| Tem | Mean temperature, 1950–2000 | WorldClim/BioClim (Hijmans et al. | |
| Pre | Mean precipitation, 1950–2000 | WorldClim/BioClim (Hijmans et al. | |
| Ele | Elevation | HydroSHEDS (Lehner et al. | |
| ele_slope | Slope | HydroSHEDS‐Derived (Lehner et al. | |
| Twe | Tweets | Tweets data obtained from method detailed in Section | |
| Converted | |||
| Vector‐Format | roa_cls, roa_dst | Roads | OSM (2014) |
| riv_dst | Distance to rivers/streams | VMAP0 merged† | |
| pop_cls, pop_dst | Populated Places | OSM (2014) | |
| wat_cls, wat_dst | Water bodies | VMAP0 merged† | |
| pro_cls, pro_dst | Protected areas | IUCN and UNEP ( | |
| poi_cls, poi_dst | Populated Points of Interest | OSM (2014) | |
| bui_cls, bui_dst | Buildings | OSM (2014) | |
| use_cls, use_dst | Delineated land uses | OSM (2014) | |
| cit_cls, cit_dst | Cities | OSM (2014) | |
| dwe_cls, dwe_dst | Dwellings | OSM (2014) | |
| ham_cls, ham_dst | Hamlets | OSM (2014) | |
| hos_cls, hos_dst | Hospital | OSM (2014) | |
| loc_cls, loc_dst | Localities | OSM (2014) | |
| pol_cls, pol_dst | Police | OSM (2014) | |
| sch_cls, sch_dst | Schools | OSM (2014) | |
| sub_cls, sub_dst | Suburbs | OSM (2014) | |
| tow_cls, tow_dst | Towns | OSM (2014) | |
| vil_cls, vil_dst | Villages | OSM (2014) | |
| ind_cls, ind_dst | Industrial land use | OSM (2014) | |
| res_cls, res_dst | Residential land use | OSM (2014) | |
| pri_cls, pri_dst | Primary roads | OSM (2014) | |
| sec_cls, sec_dst | Secondary roads | OSM (2014) | |
| ter_cls, ter_dst | Tertiary roads | OSM (2014) | |
| rro_cls, rro_dst | Residential roads | OSM (2014) | |
| ser_cls, ser_dst | Service roads | OSM (2014) | |
| nei_cls, nei_dst | Neighborhoods | OSM (2014) |
*The variable names are used in the Random Forest model output and throughout the text to refer to the specific data they were derived from. The first three letters are derived from the data type (e.g. “lan” indicates land cover) and the last three letters, if present, indicates what type of data each variable represents (e.g. “_cls” is a binary classification and “_dst” is a calculated Euclidean distance‐to variable.
†The default data for populated places is merged from several VMAP0 data sources. There are three VMAP0 data sets used: The point data pop/builtupp and pop/mispopp are buffered to 100 m and merged with the pop/builtupa polygons creating a vector‐based built layer. This layer is then converted to binary class and distance‐to rasters for use in modeling (NGA 2005).
Figure 3General structure of the data processing and map production procedure used to compare the methodology outlined in Stevens et al (2015). The orange boxes represent items that are specific to the research presented here and not part of end‐user map data product generation. The green boxes represent data pre‐processing stages. Items in blue represent Random Forest model estimation, per‐pixel prediction and dasymetric redistribution of census counts
Figure 4Covariate importance plots for tests: (a) without geotweets; and (b) with geotweets
Figure 5Map of Persons Per Pixel (PPP) produced using high resolution population mapping method (Stevens et al. 2015), showing the final population maps for a region on the island of Java with: (a) no geotweet data; (b) geotweet data included; and finally (c) the output dataset for the entirety of Indonesia with geotweet data included
Accuracy assessment results for tests
| RMSE (persons) | %RMSE | MAE (persons) | |
|---|---|---|---|
| Admin 3 without tweets | 2284.14 | 74.58 | 1123.44 |
| Admin 3 with tweets | 2213.99 | 72.29 | 1120.16 |
| Difference (Without ‐ With) | 70.15 | 2.29 | 3.28 |
Figure 6Differenced map produced through comparing the population maps generated with: (a) no geotweet data; and (b) geotweet data constructed from administrative level 3 census population counts and differencing the zonal sums against administrative level 4 census population count data
Figure 7Difference map of persons per pixel (PPP) generated from subtracting the population map generated utilizing no geotweet data from the population map generated utilizing geotweet data: (a) Jakarta and surrounding areas; and (b) All of Indonesia