| Literature DB >> 21701579 |
Maxime Dupin1, Philippe Reynaud, Vojtěch Jarošík, Richard Baker, Sarah Brunel, Dominic Eyre, Jan Pergl, David Makowski.
Abstract
Many distribution models developed to predict the presence/absence of invasive alien species need to be fitted to a training dataset before practical use. The training dataset is characterized by the number of recorded presences/absences and by their geographical locations. The aim of this paper is to study the effect of the training dataset characteristics on model performance and to compare the relative importance of three factors influencing model predictive capability; size of training dataset, stage of the biological invasion, and choice of input variables. Nine models were assessed for their ability to predict the distribution of the western corn rootworm, Diabrotica virgifera virgifera, a major pest of corn in North America that has recently invaded Europe. Twenty-six training datasets of various sizes (from 10 to 428 presence records) corresponding to two different stages of invasion (1955 and 1980) and three sets of input bioclimatic variables (19 variables, six variables selected using information on insect biology, and three linear combinations of 19 variables derived from Principal Component Analysis) were considered. The models were fitted to each training dataset in turn and their performance was assessed using independent data from North America and Europe. The models were ranked according to the area under the Receiver Operating Characteristic curve and the likelihood ratio. Model performance was highly sensitive to the geographical area used for calibration; most of the models performed poorly when fitted to a restricted area corresponding to an early stage of the invasion. Our results also showed that Principal Component Analysis was useful in reducing the number of model input variables for the models that performed poorly with 19 input variables. DOMAIN, Environmental Distance, MAXENT, and Envelope Score were the most accurate models but all the models tested in this study led to a substantial rate of mis-classification.Entities:
Mesh:
Year: 2011 PMID: 21701579 PMCID: PMC3118793 DOI: 10.1371/journal.pone.0020957
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Western corn rootworm distribution in North America.
The hatched area represents non-irrigated maize area.
Figure 2Western corn rootworm distribution in Europe.
Bioclimatic variables computed from monthly mean temperatures (T), from monthly precipitation sums (P), or from both (T + P).
| Code | Bioclimatic variables | Initial climatic variable |
|
|
|
|
| Bio02 | Mean Diurnal Range [Mean of monthly (max temp - min temp)] | T |
| Bio03 | Isothermality [(Bio02/Bio07)*100] | T |
| Bio04 | Temperature Seasonality [standard deviation *100] | T |
| Bio05 | Max Temperature of Warmest Month | T |
|
|
|
|
|
|
|
|
| Bio08 | Mean Temperature of Wettest Quarter | T + P |
| Bio09 | Mean Temperature of Driest Quarter | T + P |
| Bio10 | Mean Temperature of Warmest Quarter | T |
|
|
|
|
| Bio12 | Annual Precipitation | P |
|
|
|
|
| Bio14 | Precipitation of Driest Month | P |
| Bio15 | Precipitation Seasonality [Coefficient of Variation] | P |
| Bio16 | Precipitation of Wettest Quarter | P |
|
|
|
|
| Bio18 | Precipitation of Warmest Quarter | T + P |
| Bio19 | Precipitation of Coldest Quarter | T + P |
The subset of 6 variables selected based on the literature is formatted in bold.
Nine models for predicting distribution of the western corn rootworm.
| Name | Class of method | Data | Software |
| BIOCLIM | Envelope model | P | DIVA-GIS v5.2 |
| Envelope Score (ES) | Envelope model | P | openModeller v1.0.9 |
| DOMAIN | Multivariate distance | P | DIVA-GIS v5.2 |
| Environmental Distance (ED) | Multivariate distance | P | openModeller v1.0.9 |
| Climate Space Model (CSM) | Principal components analysis | P | openModeller v1.0.9 |
| DKGARP | Genetic Algorithm for Rule Set Production, desktop version, with the best subset procedure | ppa | openModeller v1.0.9 |
| OMGARP | Genetic Algorithm for Rule Set Production, openModeller version, with the best subset procedure | ppa | openModeller v1.0.9 |
| MAXENT | Maximum Entropy | ppa | Maxent v3.3.1 |
| Support Vector Machine (SVM) | Support Vector Machine | ppa | openModeller v1.0.9 |
Data needed for model calibration are presence data (p) or both presence and pseudo-absence data (ppa).
Figure 3Geographical area of the training datasets.
The hatched area represents the WCR distribution before 1955 while the grey area represents the WCR distribution before 1980.
Rotated component matrix of Principal Component Analysis.
| Bioclimatic variables | Components | ||
| 1 | 2 | 3 | |
| Bio01 Annual Mean Temperature | 0.967 | 0.212 | −0.042 |
| Bio02 Mean Diurnal Range (Mean of monthly (max temp - min temp)) | 0.403 | −0.214 | −0.379 |
| Bio03 Isothermality (BIO2/BIO7) (* 100) | 0.805 | 0.335 | 0.183 |
| Bio04 Temperature Seasonality (standard deviation *100) | −0.843 | −0.252 | −0.267 |
| Bio05 Max Temperature of Warmest Month | 0.856 | 0.074 | −0.287 |
| Bio06 Min Temperature of Coldest Month | 0.952 | 0.242 | 0.112 |
| Bio07 Temperature Annual Range (BIO5-BIO6) | −0.773 | −0.298 | −0.365 |
| Bio08 Mean Temperature of Wettest Quarter | 0.764 | 0.328 | −0.287 |
| Bio09 Mean Temperature of Driest Quarter | 0.958 | 0.11 | 0.095 |
| Bio10 Mean Temperature of Warmest Quarter | 0.892 | 0.148 | −0.21 |
| Bio11 Mean Temperature of Coldest Quarter | 0.967 | 0.231 | 0.054 |
| Bio12 Annual Precipitation | 0.3 | 0.798 | 0.502 |
| Bio13 Precipitation of Wettest Month | 0.342 | 0.896 | 0.166 |
| Bio14 Precipitation of Driest Month | 0.095 | 0.382 | 0.82 |
| Bio15 Precipitation Seasonality (Coefficient of Variation) | 0.285 | 0.068 | −0.745 |
| Bio16 Precipitation of Wettest Quarter | 0.332 | 0.894 | 0.205 |
| Bio17 Precipitation of Driest Quarter | 0.106 | 0.403 | 0.825 |
| Bio18 Precipitation of Warmest Quarter | 0.119 | 0.86 | 0.234 |
| Bio19 Precipitation of Coldest Quarter | 0.237 | 0.438 | 0.671 |
Varimax rotation method with Kaiser normalization. The components are scaled between 0–1; the closer the values to one, the more variance they explain. Values between 0.7–0.79, 0.8–0.89.
Significance of effect of training area, size of training dataset, and set of bioclimatic variables on AUC values.
| Model | Area 1980/1955 | Input variables | Size | |||||||
| 6/19var | PCA/19var | PCA/6var | 20/10 | Big/Small | ||||||
| 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | |||
| BIOCLIM |
|
|
|
|
|
|
| NS | NS |
|
| CSM |
| NS | NS | NS | NS | NS | NS | NS | NS | NS |
| DKGARP | . |
|
|
|
|
|
| NS | NS |
|
| DOMAIN |
| NS | NS | NS |
| NS |
| NS | NS | NS |
| ED |
| NS | NS | NS |
| NS |
| NS | NS | NS |
| ES | NS | NS | NS | NS | NS | NS | NS | NS | NS |
|
| MAXENT |
| NS | NS |
|
|
|
| NS | NS | NS |
| OMGARP | . |
|
|
|
| NS | . | NS | NS | NS |
| SVM |
| NS | NS | NS |
| . |
| NS | NS | NS |
Area 1980 vs. 1955, 6 variables vs. 19 variables, first three principal components (PCA) vs. 19 variables, PCA vs. 6 variables, training dataset size = 20 vs. 10, big training dataset (more than 50 presence points) vs. small (less than 50).
***p<0.001 |
**p<0.01 |
*p<0.05 |. p<0.1 | NS not significant.
Figure 4Box plots of AUC values and likelihood ratios (sensitivity = 0.95) computed for the nine models with 19 variables, 6 variables, or three principal components (PCA) for training datasets generated from two areas (1955 and 1980).
Continuous and dashed lines correspond to AUC = 0.5 or ratio = 1 and AUC = 0.7or ratio = 1.5 respectively.
Figure 5Outputs of model ES obtained with 20 presence data located within the 1955 presence area (A) and with 50 presence data located within the 1980 presence area (B).
The model was fitted using 19 input variables in both cases.
Significance of the model performance.
| Model | Input variables | AUC = 0.5 | AUC = 0.7 | LikR = 1 | LikR = 1.5 | ||||
| 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | ||
| BIOCLIM | 6 | . |
| NS | NS | NS | NS | NS | NS |
| 19 | NS |
| NS | NS | NS | NS | NS | NS | |
| 3 PCA |
|
| NS | NS | NS | NS | NS | NS | |
| CSM | 6 | NS |
| NS | NS | NS |
| NS |
|
| 19 | NS |
| NS |
| NS |
| NS | NS | |
| 3 PCA | NS |
| NS | NS | NS |
| NS | NS | |
| DKGARP | 6 |
|
| NS | . | NS | . | NS | NS |
| 19 | NS |
| NS | NS | NS | NS | NS | NS | |
| 3 PCA |
|
| NS |
| NS |
| NS | NS | |
| DOMAIN | 6 | NS |
| NS |
| NS |
| NS |
|
| 19 | NS |
| NS |
| NS |
| NS | NS | |
| 3 PCA | NS |
| NS |
| NS |
| NS | NS | |
| ED | 6 |
|
| NS |
|
|
| NS | NS |
| 19 |
|
| NS |
| NS |
| NS | NS | |
| 3 PCA |
|
| NS |
| . |
| NS |
| |
| ES | 6 |
|
| NS | NS | NS |
| NS | NS |
| 19 |
|
|
|
|
|
| NS | NS | |
| 3 PCA |
|
| NS | NS | NS |
| NS | NS | |
| MAXENT | 6 | NS |
| NS |
| NS |
| NS |
|
| 19 | NS |
| NS |
| NS |
| NS | . | |
| 3 PCA |
|
| NS |
|
|
| NS | NS | |
| OMGARP | 6 |
|
| NS | NS | NS | NS | NS | NS |
| 19 | NS |
| NS | NS | NS | NS | NS | NS | |
| 3 PCA |
|
| NS | . | NS | NS | NS | NS | |
| SVM | 6 | NS |
| NS | NS | NS |
| NS | NS |
| 19 | NS |
| NS | NS | NS |
| NS | NS | |
| 3 PCA | NS |
| NS | . | . |
| NS | NS | |
Tests “AUC<0.5 vs. AUC>0.5”, “AUC<0.7 vs. AUC>0.7”, “Likelihood ratio<1 vs. Likelihood ratio>1”, and “Likelihood ratio<1.5 vs. Likelihood ratio>1.5”.
***p<0.001 |
**p<0.01 |
*p<0.05 |. p<0.1 | NS not significant.
Significance of effect of training area, size of training dataset, and set of bioclimatic variables on likelihood ratio values (sensitivity = 0.95).
| Model | Area 1980/1955 | Input variables | Size | |||||||
| 6/19var | PCA/19var | PCA/6var | 20/10 | Big/Small | ||||||
| 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | 1955 | 1980 | |||
| BIOCLIM | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| CSM |
| NS | . | NS | NS | NS | NS | NS | NS | NS |
| DKGARP | . | NS |
| NS |
| NS |
| NS | NS |
|
| DOMAIN |
| NS |
| NS | NS | NS | NS | NS | NS | NS |
| ED |
|
| NS |
| . | NS |
| NS | NS | NS |
| ES | NS | NS | NS | NS | NS | NS | NS | NS | . |
|
| MAXENT |
| NS | NS |
| NS |
| NS | NS | NS | NS |
| OMGARP | NS | NS | NS | NS | NS | NS | NS | NS | NS | NS |
| SVM |
| NS | NS |
| NS |
| NS | NS | NS | NS |
Area 1980 vs. 1955, 6 variables vs. 19 variables, first three principal components (PCA) vs. 19 variables, PCA vs. 6 variables, training dataset size = 20 vs. 10, big training dataset (more than 50 presence points) vs. small (less than 50).
***p<0.001 |
**p<0.01 |
*p<0.05 |. p<0.1 | NS not significant.