| Literature DB >> 23527110 |
Jay M Ver Hoef1, Hailemariam Temesgen.
Abstract
Forest surveys provide critical information for many diverse interests. Data are often collected from samples, and from these samples, maps of resources and estimates of aerial totals or averages are required. In this paper, two approaches for mapping and estimating totals; the spatial linear model (SLM) and k-NN (k-Nearest Neighbor) are compared, theoretically, through simulations, and as applied to real forestry data. While both methods have desirable properties, a review shows that the SLM has prediction optimality properties, and can be quite robust. Simulations of artificial populations and resamplings of real forestry data show that the SLM has smaller empirical root-mean-squared prediction errors (RMSPE) for a wide variety of data types, with generally less bias and better interval coverage than k-NN. These patterns held for both point predictions and for population totals or averages, with the SLM reducing RMSPE from 9% to 67% over some popular k-NN methods, with SLM also more robust to spatially imbalanced sampling. Estimating prediction standard errors remains a problem for k-NN predictors, despite recent attempts using model-based methods. Our conclusions are that the SLM should generally be used rather than k-NN if the goal is accurate mapping or estimation of population totals or averages.Entities:
Mesh:
Year: 2013 PMID: 23527110 PMCID: PMC3602606 DOI: 10.1371/journal.pone.0059129
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Histograms of A. PMAI, B. DRYBIOT.
The gray-shaded histograms are based on the original centered data, and the cross-hatched histogram is based on the residuals after fitting a multiple regression model with main effects for all covariates.
Figure 2Spatial locations of PMAI variable.
The redder shades indicate higher values, and the bluer shades indicate lower values. One draw from the unbalanced spatial sample is shown with black circles around the sampled locations.
Performance summaries from 2000 simulated spatial data sets.
| Data | P/T | MAH1 | MAH5 | MSN1 | MSN5 | bstNN | LM | SLM | |
| RMSPE | S1 | P | 9.329 | 7.451 | 5.379 | 4.423 | 4.456 | 3.892 | 2.443 |
| SRB | S1 | P | −0.006 | −0.009 | 0 | −0.004 | −0.004 | −0.002 | 0.001 |
| PIC90 | S1 | P | 0.897 | 0.9 | 0.887 | 0.889 | 0.88 | 0.896 | 0.892 |
| RMSPE | S1 | T | 262.6 | 289.8 | 174.3 | 153.3 | 154.5 | 139.3 | 87.8 |
| SRB | S1 | T | −0.058 | −0.067 | −0.003 | −0.034 | −0.034 | −0.02 | 0.009 |
| PIC90 | S1 | T | 0.952 | 0.87 | 0.914 | 0.886 | 0.874 | 0.88 | 0.887 |
| RMSPE | S2 | P | 6.445 | 5.17 | 6.428 | 5.146 | 5.129 | 5.185 | 4.414 |
| SRB | S2 | P | −0.024 | −0.03 | −0.009 | −0.019 | −0.030 | 0.000 | 0.003 |
| PIC90 | S2 | P | 0.907 | 0.922 | 0.899 | 0.906 | 0.906 | 0.932 | 0.917 |
| RMSPE | S2 | T | 320 | 295.9 | 296.3 | 262.3 | 272.2 | 283.1 | 226.1 |
| SRB | S2 | T | −0.137 | −0.188 | −0.047 | −0.135 | −0.182 | −0.033 | −0.005 |
| PIC90 | S2 | T | 0.912 | 0.842 | 0.9 | 0.867 | 0.83 | 0.86 | 0.858 |
| PCC | S3 | P | 0.731 | 0.767 | 0.749 | 0.785 | 0.795 | 0.799 | 0.846 |
| SRB | S3 | P | 0.009 | 0.013 | 0.001 | 0.002 | 0.010 | 0.002 | 0.002 |
| PIC90 | S3 | P | 0.767 | 0.884 | 0.764 | 0.85 | 0.844 | 0.905 | 0.889 |
| RMSPE | S3 | T | 0.0395 | 0.0394 | 0.0387 | 0.0334 | 0.0343 | 0.0329 | 0.0298 |
| SRB | S3 | T | 0.072 | 0.09 | 0.003 | 0.019 | 0.079 | 0.018 | 0.014 |
| PIC90 | S3 | T | 0.919 | 0.841 | 0.913 | 0.882 | 0.84 | 0.886 | 0.884 |
In the Data column, S1 indicates data from the first simulation method, S2 indicates data from the second simulation method (count data), and S3 indicates data from the third simulation method (binary data), as described in Section “Simulation of Artificial Data.” Each data set used 100 samples per simulation, indicated by P in the P/T column, and summaries were based on 300 predictions per resampling, which were then averaged over the 2000 simulations. There was one total estimate per simulation, which were summarized over the 2000 simulations, and indicated by T in the P/T column. Different prediction methods form the rest of the columns and are described in Section “Prediction Methods.” Performance measures form the rows and are described in Section “Performance Measures;” however, note that percent correctly classified "PCC" replace RMSPE for point predictions of the binary (S3) simulated data.
Performance summaries for 500 resamplings of forest data.
| Data | P/T | MAH1 | MAH5 | MSN1 | MSN5 | bstNN | LM | SLM | |
| RMSPE | PM | P | 2.998 | 2.371 | 3.243 | 2.53 | 2.362 | 2.399 | 2.127 |
| SRB | PM | P | 0.02 | 0.038 | 0.004 | −0.001 | 0.026 | 0.004 | 0.003 |
| PIC90 | PM | P | 0.895 | 0.902 | 0.888 | 0.894 | 0.898 | 0.897 | 0.899 |
| RMSPE | PM | T | 219.1 | 230.7 | 243.3 | 200.9 | 223.2 | 197 | 180.4 |
| SRB | PM | T | 0.437 | 0.712 | 0.064 | −0.019 | 0.446 | 0.082 | 0.058 |
| PIC90 | PM | T | 0.944 | 0.838 | 0.948 | 0.922 | 0.834 | 0.904 | 0.904 |
| RMSPE | DB | P | 90.8 | 71.3 | 95 | 73.8 | 68.4 | 69.2 | 67.3 |
| SRB | DB | P | −0.002 | 0.000 | 0.000 | 0.002 | −0.018 | 0.004 | 0.005 |
| PIC90 | DB | P | 0.899 | 0.903 | 0.892 | 0.904 | 0.912 | 0.919 | 0.914 |
| RMSPE | DB | T | 6795 | 6369 | 7683 | 6393 | 6498 | 6193 | 6091 |
| SRB | DB | T | −0.027 | −0.001 | 0.027 | 0.036 | −0.302 | 0.052 | 0.066 |
| PIC90 | DB | T | 0.942 | 0.878 | 0.914 | 0.878 | 0.848 | 0.876 | 0.866 |
| RMSPE | UN | P | 2.983 | 2.497 | 3.115 | 2.495 | 2.389 | 2.436 | 2.146 |
| SRB | UN | P | 0.135 | 0.227 | 0.08 | 0.104 | 0.139 | 0.159 | 0.028 |
| PIC90 | UN | P | 0.912 | 0.907 | 0.905 | 0.903 | 0.9 | 0.903 | 0.918 |
| RMSPE | UN | T | 637.9 | 853.6 | 457.1 | 442.2 | 576.1 | 608.1 | 269 |
| SRB | UN | T | 2.635 | 4.055 | 1.418 | 1.77 | 1.651 | 2.86 | 0.369 |
| PIC90 | UN | T | 0.248 | 0.01 | 0.626 | 0.438 | 0.308 | 0.128 | 0.92 |
In the Data column, PM indicates the PMAI data set, DB indicates the DRYBIOT data set, and UN indicates the PMAI data set with unbalanced sampling, as described in Section “Forest Data.” Each data set used 386 samples per resampling, and for point predictions, indicated by P in the P/T column, summaries were based on 1500 predictions per resampling, which were then averaged over the 500 resamples. There was one total estimate per resample, which were summarized over the 500 resamples, and indicated by T in the P/T column. Different prediction methods form the rest of the columns and are described in Section “Prediction Methods.” Performance measures form the rows and are described in Section “Performance Measures.”
Figure 3Scatter plots of absolute errors and the estimated standard errors for a single simulated data set.
IterVar is the iterated variogram method of McRoberts et al. (2007), kNNGeo is the covariance matrix as estimated with all main effects in a spatial linear model and REML, but using the k-NN weights, and EBLUP are the estimated standard errors from the SLM.
Figure 4Violin plots of Kendall’s rank correlation coefficients between absolute error and estimated standard errors over 2000 simulations.