| Literature DB >> 33619083 |
Evlyn Pless1,2, Norah P Saarman3,4, Jeffrey R Powell3, Adalgisa Caccone3, Giuseppe Amatulli5,6.
Abstract
Mapping landscape connectivity is important for controlling invasive species and disease vectors. Current landscape genetics methods are often constrained by the subjectivity of creating resistance surfaces and the difficulty of working with interacting and correlated environmental variables. To overcome these constraints, we combine the advantages of a machine-learning framework and an iterative optimization process to develop a method for integrating genetic and environmental (e.g., climate, land cover, human infrastructure) data. We validate and demonstrate this method for the Aedes aegypti mosquito, an invasive species and the primary vector of dengue, yellow fever, chikungunya, and Zika. We test two contrasting metrics to approximate genetic distance and find Cavalli-Sforza-Edwards distance (CSE) performs better than linearized FST The correlation (R) between the model's predicted genetic distance and actual distance is 0.83. We produce a map of genetic connectivity for Ae. aegypti's range in North America and discuss which environmental and anthropogenic variables are most important for predicting gene flow, especially in the context of vector control.Entities:
Keywords: gene flow; invasive species; landscape genetics; random forest; vector control
Year: 2021 PMID: 33619083 PMCID: PMC7936321 DOI: 10.1073/pnas.2003201118
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 11.205
Important terminology and acronyms
| Abbreviation | Explanation |
| IBD | Isolation by distance: the expectation of increased genetic distance with increased geographic distance ( |
| CSE | Cavalli-Sforza–Edwards distance ( |
| Linearized FST | Measure of genetic distance: FST/(1 − FST) |
| Full | Complete dataset (38 sites, 703 pairwise genetic distances) |
| Train | Complete dataset excluding one point and its affiliated pairs (37 sites, 666 pairwise genetic distances) |
| Test | One point and its affiliated pairs (1 site, 37 pairwise genetic distances) |
| Rtrain | Pearson correlation between predicted and observed enetic distance for training dataset |
| Rtest | Pearson correlation between predicted and observed genetic distance for testing dataset |
| Rfull | Pearson correlation between predicted and observed genetic distance for full dataset |
| RMSEtrain | Root-mean-square error of model using the training dataset |
| RMSEtest | Root-mean-square error of model using the testing dataset |
| RMSEfull | Root-mean-square error of model using the full dataset |
| RF | Random forest: a nonlinear classification and regression tree analysis |
| RSQtrain | Pseudo-R-squared (% variance explained by the model) built with the training dataset |
| RSQfull | Pseudo-R-squared (% variance explained by the model) built with the full dataset |
See for more details.
Fig. 2.Optimized connectivity map using CSE full dataset. The black points show collection sites for Ae. aegypti (the genetic data).
Fig. 1.Pipeline of model workflow. Left side (leave-one-out cross-validation) ensures internal model accuracy by highlighting potential overfitting; Right side (full dataset run) produces the full dataset model output and overall variable importance.
Result from CSE full dataset model
| Iteration | Rfull | RMSEfull | RSQfull |
| Straight | 0.786 | 0.0388 | 0.606 |
| 1 | 0.825 | 0.0353 | 0.674 |
| 2 | 0.824 | 0.0356 | 0.669 |
| 4 | 0.820 | 0.0357 | 0.667 |
| 5 | 0.830 | 0.0347 | 0.685 |
| 6 | 0.817 | 0.0360 | 0.661 |
| 7 | 0.817 | 0.0359 | 0.662 |
| 8 | 0.821 | 0.0356 | 0.669 |
| 9 | 0.818 | 0.0361 | 0.658 |
| 10 | 0.828 | 0.0349 | 0.680 |
Rfull, Pearson correlation between observed and expected CSE; RMSEfull, root-mean-square error; RSQfull, percentage of variance explained. Iteration 3 (in bold) has the lowest RMSEfull and is therefore chosen as the optimal iteration.
Fig. 5.Observed versus predicted genetic distance for CSE full dataset model. The red line is the best-fit linear regression, and the black line is y = x.
Fig. 3.Variable importance list for the CSE full dataset model. The x axis shows the mean decrease in accuracy of the model when excluding each variable computed from permuting out-of-bag data.
Fig. 4.Important variables for the CSE full dataset model. (A) Maximum temperature (degrees Celsius × 100). (B) Slope (degree incline). (C) Barren land cover (%). (D) Human density (density of buildings and structures, scaled to a maximum of 1).