| Literature DB >> 24992657 |
Weston Anderson1, Seth Guikema2, Ben Zaitchik3, William Pan4.
Abstract
Obtaining accurate small area estimates of population is essential for policy and health planning but is often difficult in countries with limited data. In lieu of available population data, small area estimate models draw information from previous time periods or from similar areas. This study focuses on model-based methods for estimating population when no direct samples are available in the area of interest. To explore the efficacy of tree-based models for estimating population density, we compare six different model structures including Random Forest and Bayesian Additive Regression Trees. Results demonstrate that without information from prior time periods, non-parametric tree-based models produced more accurate predictions than did conventional regression methods. Improving estimates of population density in non-sampled areas is important for regions with incomplete census data and has implications for economic, health and development policies.Entities:
Mesh:
Year: 2014 PMID: 24992657 PMCID: PMC4081515 DOI: 10.1371/journal.pone.0100037
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of model structures, strengths and weaknesses.
| Model Description | Advantages | Disadvantages | |
|
| Linear model | Simple to implement,transparent modelstructure | Unable to capture nonlinearrelationships |
|
| Linear modelincorporatingspatial correlation | Explicitly accounts for spatialcorrelation, transparentmodel structure | Unable to capture nonlinearrelationships |
|
| Non-linear extensionof a LM using asmoothing function | Able to representnonlinear relationships | Vulnerable to model over fit,which degrades predictive accuracy |
|
| Penalized spline,extension of a LMusing multiple basisfunctions | Able to representnonlinear relationships | Vulnerable to model over fit,which degrades predictive accuracy |
|
| Baggedclassification andregression tree(CART) method | Nonparametric, designed to reducevariance and improvepredictive accuracy ofCART methods | Complex model structure, moredifficult to succinctly measurevariable importance |
|
| Sum-of-treesmethod | Nonparametric, provides aflexible inference of therelationship between responsevariables and covariates | Complex model structure, difficultto interpret variableimportance,computationally intensive |
Population counts for the five regions included in the analysis.
| 1993 Population | 2007 Population | |
| Apurimac | 381,997 | 404,190 |
| Arequipa | 916,806 | 1,112,858 |
| Ayacucho | 483,341 | 584,959 |
| Cusco | 1,001,898 | 1,154,969 |
| Madre de Dios | 67,008 | 102,577 |
| Total: | 2,851,050 | 3,359,553 |
Districts excluded due to irresolvable redistricting issues (see Data Consistency) are not included in the table.
Figure 1Population count by district for all regions included in the analysis for A) 1993 and B) 2007.
Figure 2Political boundaries and topographical features of Peru.
A) Classified land cover. B) Regions included in the analysis (Apurimac, Arequipa, Ayacucho, Cusco and Madre de Dios). C) Population density at the district level derived from the 2007 census.
Model errors and p-values assessed using density, population data included in the covariates.
| Average MAE | MAE Standard Error | LM | GAM | RF | MARS | BART | LMM | |
|
| 0.051 | 0.033 |
| |||||
|
| 0.089 | 0.095 |
|
| ||||
|
| 0.108 | 0.091 |
|
|
| |||
|
| 0.051 | 0.036 |
|
|
|
| ||
|
| 0.085 | 0.060 |
|
|
|
|
| |
|
| 0.053 | 0.032 |
|
|
|
|
|
|
|
| 0.296 | 0.256 |
|
|
|
|
|
|
Columns 3–8 display p-values corresponding to the t-test between the MAE distributions of each row-column pair. Stars indicate statistical significance at a level of 0.01 (**) or 0.05 (*), while dashes indicate not significant.
Model errors and p-values assessed using population count error as a proportion of actual district population: |Predicted – Actual|/Actual.
| Average MAE | MAE Standard Error | LM | GAM | RF | MARS | BART | LMM | |
|
| 0.515 | 0.118 | - | |||||
|
| 0.496 | 0.116 | - | - | ||||
|
| 0.408 | 0.106 | ** | ** | - | |||
|
| 0.479 | 0.129 | - | - | ** | - | ||
|
| 0.436 | 0.110 | ** | ** | - | ** | - | |
|
| 0.453 | 0.125 | ** | ** | ** | - | - | - |
|
| 0.491 | 0.207 | - | - | ** | - | ** | - |
Population data not included in the covariates. Stars indicate statistical significance at a level of 0.01 (**) or 0.05 (*), while dashes indicate not significant.
Model errors and p-values assessed using density, population data not included in the covariates.
| Average MAE | MAE Standard Error | LM | GAM | RF | MARS | BART | LMM | |
|
| 0.371 | 0.123 |
| |||||
|
| 0.412 | 0.121 |
|
| ||||
|
| 0.207 | 0.121 |
|
|
| |||
|
| 0.371 | 0.174 |
|
|
|
| ||
|
| 0.298 | 0.157 |
|
|
|
|
| |
|
| 0.302 | 0.124 |
|
|
|
|
|
|
|
| 0.296 | 0.256 |
|
|
|
|
|
|
Columns 3–8 display p-values corresponding to the t-test between the MAE distributions of each row-column pair. Stars indicate statistical significance at a level of 0.01 (**) or 0.05 (*), while dashes indicate not significant.
Model errors and p-values assessed using population count error as a proportion of actual district population: |Predicted–Actual|/Actual.
| Average MAE | MAE Standard Error | LM | GAM | RF | MARS | BART | LMM | |
|
| 0.105 | 0.036 | - | |||||
|
| 0.108 | 0.042 | - | - | ||||
|
| 0.150 | 0.054 | ** | ** | - | |||
|
| 0.109 | 0.036 | - | - | ** | - | ||
|
| 0.133 | 0.049 | ** | ** | ** | ** | - | |
|
| 0.104 | 0.036 | - | - | ** | - | ** | - |
|
| 0.491 | 0.207 | ** | ** | ** | ** | ** | ** |
Population data included in the covariates. Stars indicate statistical significance at a level of 0.01 (**) or 0.05 (*), while dashes indicate not significant.
Measures of variable importance, population data included in the covariates.
| Previous Popdensity | Roads | River Water | X coordinate | Y coordinate | NDVI | LST Day | GDP | Perm Water | |
|
| 0.979 | - | –0.0199 | - | - | - | - | - | - |
|
| 0.979 | 0.0198 | - | NA | NA | - | - | - | - |
|
| 505.22 | –0.92 | –0.87 | –0.07 | 1.6 | 1.78 | - | - | - |
|
| 100 | 5 | - | - | - | - | - | - | - |
|
| 22.19 | 1 | 2.26 | 1.35 | 3.9 | 3.33 | 3.74 | 2.15 | 1.21 |
|
| 204.94 | 81.85 | 22.21 | 16.25 | 27.27 | 19.65 | 18.34 | 3.11 | 0 |
|
| 68.28 | 19.89 | 13.84 | 11.95 | 10.76 | 11.71 | 31.21 | 17.88 | 27.29 |
Stars indicate the models producing the most accurate estimates. Dashes indicate variables that were discarded by the model during variable selection.
Measures of variable importance, population data not included in the covariates.
| Previous PopDensity | Roads | River Water | X coordinate | Y coordinate | NDVI | LST Day | GDP | Perm Water | |
|
| NA | 0.154 | –0.184 | 0.117 | –0.102 | - | - | - | - |
|
| NA | 0.166 | - | NA | NA | - | - | - | - |
|
| NA | –4.37 | –1.3 | - | –2.33 | –2.59 | - | - | - |
|
| NA | 100 | 45.9 | 15.8 | 26 | 22.2 | 16.8 | - | - |
|
| NA | 3.19 | -0.91 | 2.45 | 4.7 | 4.88 | 5.61 | 4.33 | -0.63 |
|
| NA | 116.79 | 45.36 | 35.23 | 47.67 | 32.61 | 41.14 | 7.78 | 0.18 |
|
| NA | 55.03 | 28.43 | 18.09 | 33.01 | 30.33 | 39.62 | 37.19 | 26.61 |
Stars indicate the models producing the most accurate estimates. Dashes indicate variables that were discarded by the model during variable selection.
Figure 3Random Forest model error by district.
Figure 4Census population density vs. Random Forest estimated density.
1∶1 line plotted for reference.
Figure 5Random Forest uncertainty analysis.
Observations of district population density (black points) are ordered from lowest to highest density. The Random Forest mean (red point), median (blue point) and interval between the 5th/95th percentiles (blue lines) illustrate the uncertainty in each corresponding model estimate.