| Literature DB >> 30832259 |
Senait D Senay1,2, Susan P Worner3.
Abstract
Correlative species distribution models (SDMs) are increasingly being used to predict suitable insect habitats. There is also much criticism of prediction discrepancies among different SDMs for the same species and the lack of effective communication about SDM prediction uncertainty. In this paper, we undertook a factorial study to investigate the effects of various modeling components (species-training-datasets, predictor variables, dimension-reduction methods, and model types) on the accuracy of SDM predictions, with the aim of identifying sources of discrepancy and uncertainty. We found that model type was the major factor causing variation in species-distribution predictions among the various modeling components tested. We also found that different combinations of modeling components could significantly increase or decrease the performance of a model. This result indicated the importance of keeping modeling components constant for comparing a given SDM result. With all modeling components, constant, machine-learning models seem to outperform other model types. We also found that, on average, the Hierarchical Non-Linear Principal Components Analysis dimension-reduction method improved model performance more than other methods tested. We also found that the widely used confusion-matrix-based model-performance indices such as the area under the receiving operating characteristic curve (AUC), sensitivity, and Kappa do not necessarily help select the best model from a set of models if variation in performance is not large. To conclude, model result discrepancies do not necessarily suggest lack of robustness in correlative modeling as they can also occur due to inappropriate selection of modeling components. In addition, more research on model performance evaluation is required for developing robust and sensitive model evaluation methods. Undertaking multi-scenario species-distribution modeling, where possible, is likely to mitigate errors arising from inappropriate modeling components selection, and provide end users with better information on the resulting model prediction uncertainty.Entities:
Keywords: invasive insect species, model uncertainty, multi-model framework, non-linear principal component analysis, principal component analysis, random forest, species distribution models
Year: 2019 PMID: 30832259 PMCID: PMC6468778 DOI: 10.3390/insects10030065
Source DB: PubMed Journal: Insects ISSN: 2075-4450 Impact factor: 2.769
Variables included in the three-predictor datasets used in this study.
| Variable | Variable Name | Dataset |
|---|---|---|
| 01 | Annual mean temperature (°C) | P1, P2, P3 |
| 02 | Mean diurnal temperature range (mean(period max-min)) (°C) | P1, P2, P3 |
| 03 | Isothermality (Bio02 ÷ Bio07) | P1, P2, P3 |
| 04 | Temperature seasonality (C of V) | P1, P2, P3 |
| 05 | Max temperature of warmest week (°C) | P1, P2, P3 |
| 06 | Min temperature of coldest week (°C) | P1, P2, P3 |
| 07 | Temperature annual range (Bio05-Bio06) (°C) | P1, P2, P3 |
| 08 | Mean temperature of wettest quarter (°C) | P1, P2, P3 |
| 09 | Mean temperature of driest quarter (°C) | P1, P2, P3 |
| 10 | Mean temperature of warmest quarter (°C) | P1, P2, P3 |
| 11 | Mean temperature of coldest quarter (°C) | P1, P2, P3 |
| 12 | Annual precipitation (mm) | P1, P2, P3 |
| 13 | Precipitation of wettest week (mm) | P1, P2, P3 |
| 14 | Precipitation of driest week (mm) | P1, P2, P3 |
| 15 | Precipitation seasonality (C of V) | P1, P2, P3 |
| 16 | Precipitation of wettest quarter (mm) | P1, P2, P3 |
| 17 | Precipitation of driest quarter (mm) | P1, P2, P3 |
| 18 | Precipitation of warmest quarter (mm) | P1, P2, P3 |
| 19 | Precipitation of coldest quarter (mm) | P1, P2, P3 |
| 20 | Annual mean radiation (W m−2) | P2, P3 |
| 21 | Highest weekly radiation (W m−2) | P2, P3 |
| 22 | Lowest weekly radiation (W m−2) | P2, P3 |
| 23 | Radiation seasonality (C of V) | P2, P3 |
| 24 | Radiation of wettest quarter (W m−2) | P2, P3 |
| 25 | Radiation of driest quarter (W m−2) | P2, P3 |
| 26 | Radiation of warmest quarter (W m−2) | P2, P3 |
| 27 | Radiation of coldest quarter (W m−2) | P2, P3 |
| 28 | Annual mean moisture index | P2, P3 |
| 29 | Highest weekly moisture index | P2, P3 |
| 30 | Lowest weekly moisture index | P2, P3 |
| 31 | Moisture index seasonality (C of V) | P2, P3 |
| 32 | Mean moisture index of wettest quarter | P2, P3 |
| 33 | Mean moisture index of driest quarter | P2, P3 |
| 34 | Mean moisture index of warmest quarter | P2, P3 |
| 35 | Mean moisture index of coldest quarter | P2, P3 |
| 36 | Elevation (m) | P3 |
| 37 | Slope (deg) | P3 |
| 38 | Aspect (deg) | P3 |
| 39 | Hillshade | P3 |
Figure 1Maps showing the global occurrence of the five species used in this study. Inset maps show the global distribution of (A) Aedes albopictus; (B) Anoplopis gracilipes; (C) D. v. virgifera; (E) Thaumetopoea pityocampa; and (F) Vespula vulgaris. (G) The geographic extent of each species; and main map (D) shows the extent of occurrence of all five species with presence points overlaid on a global elevation model.
Number of presence points * and distances used to limit background extent before pseudo-absence selection, for the three types of predictor datasets used to model the global distribution of the five species in this study.
| No. | Species | Predictor | Distance (km) |
|---|---|---|---|
| 1 | BIOCLIM19 | 350 | |
| 2 | BIOCLIM35 | 300 | |
| 3 | BIOCLIM35+T4 | 600 | |
| 4 | BIOCLIM19 | 550 | |
| 5 | BIOCLIM35 | 500 | |
| 6 | BIOCLIM35+T4 | 400 | |
| 7 | BIOCLIM19 | 2000 | |
| 8 | BIOCLIM35 | 800 | |
| 9 | BIOCLIM35+T4 | 800 | |
| 10 | BIOCLIM19 | 300 | |
| 11 | BIOCLIM35 | 1300 | |
| 12 | BIOCLIM35+T4 | 800 | |
| 13 | BIOCLIM19 | 550 | |
| 14 | BIOCLIM35 | 300 | |
| 15 | BIOCLIM35+T4 | 700 |
* Numbers next to each species name show available presence points followed by spatially unique points with respect to the environmental predictor dataset resolution. Another two sets of the datasets listed above were generated according to the listed background binding distances for predictor data transformed using PCA and NLPCA making 45 training/test datasets in total. BIOCLIM19 contains 19 precipitation- and temperature-based variables, BIOCLIM35 contains variables in BIOCLIM19 plus 26 radiation- and soil moisture-derived variables, BIOCLIM35+T4 contains variables in BIOCLIM35 plus 4 variables derived from topographic data.
Figure 2Subsets of the global study area with different sets of pseudo-absence points shown by the species, predictor data, and dimension reduction method. The nine sets of pseudo-absences generated based on the different combinations of the three predictor datasets and three dimension reduction methods are shown for A. albopictus (A), A. gracilipes (B), D. v. virgifera (C), T. pityocampa (D), and V. vulgaris (E). The extents of the sub-set maps (A–E) are shown on the global map (F).
Figure 3Conceptual model showing factorial research design. The study was carried out using a 3 × 5 × 3 × 4 factorial design. The design incorporated three types of predictor datasets, occurrence data for five species, three types of collinearity reduction methods, and four types of models that utilize different modeling techniques.
The multiple factor multivariate analysis of variance (MANOVA) results table showing the effects of the various modeling components model performance.
| Modeling Components | Pillai’s Trace | η2 (%) | F | Df |
|
|---|---|---|---|---|---|
| Model type | 0.79 | 26.22 | 9.24 | 3 | <0.001 *** |
| Dimension reduction | 0.42 | 21.01 | 6.86 | 2 | <0.001 *** |
| Species data | 0.81 | 20.32 | 6.68 | 4 | <0.001 *** |
| Predictor | 0.11 | 5.50 | 1.50 | 2 | 0.138 ns |
| Species data x Predictor | 0.68 | 13.51 | 2.58 | 8 | <0.001 *** |
| Species data x Dimension reduction | 0.58 | 11.65 | 2.18 | 8 | <0.001 *** |
| Predictor x Dimension reduction | 0.49 | 12.37 | 3.70 | 4 | <0.001 *** |
| Species data x Predictor x Dim. Red. | 0.95 | 18.98 | 1.93 | 16 | <0.001 *** |
| Residuals | 26.22 | 132 |
# Signif. P codes: 0 < *** ≤ 0.001 < ** ≤ 0.01 < * ≤ 0.05 ≤ 0.1 < ns ≤ 1.
Figure 4Structure correlations (canonical factor loadings) for the first canonical dimension. Arrows show the vector direction of variables that correspond to the canonical component on the y-axis. The corresponding variables for the x-axis (combinations of modeling components) were not labeled to avoid overcrowding the graph. The red line indicates the linear regression line. The blue ellipse (data ellipse) shows 68% of the data points (approx. one standard deviation and their centroid (filled black dot) in relation to the linear regression line. The green line shows the locally weighted scatterplot smoothing (LOWESS) fit.
Figure 5Model Kappa scores plotted against cross-validation error scores. Models to the right of the vertical black dotted line have a Kappa score ≥0.8; models below the horizontal black dotted line have a cross validation error ≤0.1, and models below the horizontal red dotted line have a cross validation error ≤0.05. The graph shows the advantage of using a second performance score to discriminate between models with similar scores on the first performance measure.
Best and worst component combinations for the five species modeled in this study. For abbreviations, please refer to Figure 3.
| Species | Best | Kappa | CVerror | Worst # | Kappa | CVerror |
|---|---|---|---|---|---|---|
|
| P1DR3SVM * | 0.99 | 0.006 | P1DR2LOGR | 0.14 | 0.433 |
|
| P1DR2QDA * | 0.96 | 0.050 | P2DR2LOGR | 0.43 | 0.292 |
|
| P1DR3SVM * | 0.98 | 0.006 | P2DR2LOGR | 0.21 | 0.344 |
|
| P2DR2SVM * | 0.88 | 0.009 | P3DR3LOGR | −0.12 | 0.498 |
|
| P1DR3SVM * | 0.99 | 0.004 | P1DR3LOGR | 0.56 | 0.248 |
|
| P1DR3CART * | 0.99 | 0.005 |
* Combinations are the best based on their high Kappa and low CVerror, but are not significantly different from the second best combination. # All model combinations identified as “worst” for a species had a significantly lower score than the second worst models. CVerror = cross-validation error. For V. vulgaris, additional presence data was obtained from the Landcare Research Centre (Figure S3). External validation of the selected model for V. vulgaris using the New Zealand V. vulgaris presence data showed that 91% of the occurrence sites were correctly predicted by the selected model. Two combinations were selected for V. vulgaris as they had the same Kappa score, CV error was used to select from the two equivalent Kappa score models. P and DR indicate the predictor data and dimension reduction method used along with the models selected as best or worst models.
Figure 6Predicted probability of presence for A. gracilipes. (A) Occurrence data, (B) the best model combination for A. gracilipes, (C) the second best model combination for A. gracilipes, and (D) the worst model combination for A. gracilipes.
Figure 7(A) Mean predicted presence across all scenarios for A. albopictus; (B) the associated uncertainty around the mean prediction within the multi-scenario modeling framework. Grey shades show higher uncertainty whereas purplish-bluish shades show lower uncertainty in the form of low SD among replicates.
Figure 8Spatial pattern of variability according to, (A) predictor data—P, (B) dimension reduction—DR, (C) model type—MT, and (D) the probability density of predicted presences according to the P, DR, and MT for A. albopictus.