Literature DB >> 34117771

Geonomics: Forward-Time, Spatially Explicit, and Arbitrarily Complex Landscape Genomic Simulations.

Drew E Terasaki Hart¹, Anusha P Bishop¹, Ian J Wang¹.

Abstract

Understanding the drivers of spatial patterns of genomic diversity has emerged as a major goal of evolutionary genetics. The flexibility of forward-time simulation makes it especially valuable for these efforts, allowing for the simulation of arbitrarily complex scenarios in a way that mimics how real populations evolve. Here, we present Geonomics, a Python package for performing complex, spatially explicit, landscape genomic simulations with full spatial pedigrees that dramatically reduces user workload yet remains customizable and extensible because it is embedded within a popular, general-purpose language. We show that Geonomics results are consistent with expectations for a variety of validation tests based on classic models in population genetics and then demonstrate its utility and flexibility with a trio of more complex simulation scenarios that feature polygenic selection, selection on multiple traits, simulation on complex landscapes, and nonstationary environmental change. We then discuss runtime, which is primarily sensitive to landscape raster size, memory usage, which is primarily sensitive to maximum population size and recombination rate, and other caveats related to the model's methods for approximating recombination and movement. Taken together, our tests and demonstrations show that Geonomics provides an efficient and robust platform for population genomic simulations that capture complex spatial and evolutionary dynamics.

Entities: Chemical

Keywords: Python; environmental change; evolutionary genetics; landscape ecology; population dynamics; spatial modeling

Mesh：

Year: 2021 PMID： 34117771 PMCID： PMC8476160 DOI： 10.1093/molbev/msab175

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Spatial patterns of genomic diversity result from the complex interplay of many underlying ecological and evolutionary processes and are shaped by a wide variety of geographic and environmental factors. Understanding how these patterns develop in natural systems has emerged as a primary goal of modern evolutionary genetics. These systems often occupy complex and potentially changing landscapes and might include populations that are not at demographic equilibrium. They may undergo neutral evolution as well as natural selection, sometimes on multiple traits of variable genetic architecture. The study of complex natural systems is crucial for developing evolutionary and ecological theory (Epperson et al. 2010; Barrett et al. 2019; Pelletier 2019), understanding the forces governing the evolution and maintenance of genetic diversity (Manel et al. 2003; Schoville et al. 2012), anticipating ecological futures in the Anthropocene (Bay et al. 2018; Capblancq et al. 2020), and informing conservation and management (Crossley et al. 2017; Lind et al. 2017). The complex genomics of many such systems are beyond the reach of analytical population genetics, and their spatial complexity and evolutionary dynamics make them intractable for coalescent simulation (Hoban et al. 2012). This hinders not only our understanding of many empirical systems but also our ability to predict their dynamics and, thus, to manage them. Hence, in population and landscape genomics, as in many other fields, forward-time simulation is a crucial tool for dissecting complex study systems. However, the current suite of forward-time genomic simulators, although numerous, is still of limited utility for such work. Most available software is restricted, either genomically or geospatially, in the complexity, it can model. Many programs can model systems of considerable genomic complexity (e.g., simuPOP, Peng and Kimmel 2005; NEMO, Guillaume and Rougemont 2006; QuantiNemo, Neuenschwander et al. 2008) yet incorporate only rudimentary spatial components or none at all. Other programs are designed specifically for landscape genetic simulations (e.g., CDPOP, Landguth and Cushman 2010; CDMetaPOP, Landguth et al. 2017; SimAdapt, Rebaudo et al. 2013) but are limited in their genomic complexity. For instance, many programs are unable to model simultaneous selection on multiple, polygenic traits. To our knowledge, SLiM (Messer 2013; Haller and Messer 2017; Haller and Messer 2019) is the only package currently capable of simulating scenarios that are sufficiently complex, both genomically and geospatially, to model population genomic patterns emerging under dynamic evolutionary processes (according to a search of the National Cancer Institute’s Genetic Simulation Resources website; Peng et al. 2013), and its extreme generalizability and complexity allow it to be used for landscape genomics simulation. Furthermore, many species are distributed continuously in space, and examining continuous fields of genetic variation can require distinct methods and assumptions (Bradburd and Ralph 2019), yet most population genomic simulation packages, aside from SLiM, are population-based. Such software requires individuals to be assigned to discrete subpopulations, which can at best be arranged on a high-resolution, regular grid in order to approximate continuously distributed populations. Here, we present Geonomics, a Python package for forward-time, individual-based, continuous-space, population genomic simulations on complex landscapes. Geonomics models are parameterized by way of an informatively annotated parameters file that provides the user a straightforward means of building models of arbitrary complexity while offering reasonable default settings and “off switches” for parameters and components unrelated to the user’s interests. Models consist of 1) a landscape with one or more environmental layers, each of which can undergo arbitrarily complex environmental change events and 2) one or more species having genomes with realistic architecture and any number of associated phenotypes. Species undergo non-Wright-Fisher evolution in continuous space, with localized mating and mortality, such that species-level phenomena and simulation dynamics are emergent properties of a model’s parameterization. Evolution is comprehensively tracked by way of recently developed data structures that record the complete spatial pedigree (Kelleher et al. 2018), providing for the customizable output of rich, 3D data sets in a variety of common formats, including VCF and FASTA for genomic data, GeoTiff for landscape data, and CSV, Shapefile, and GeoJSON for individuals’ nongenomic data (location, environmental values, phenotypes, age, and sex). All of this allows Geonomics to produce realistic landscape genomic results useful for a wide variety of theoretical and empirical purposes.

New Approaches

Model Design: Overview

A Geonomics model consists of two core components: the species and the landscape. The species is composed of a set of individuals and a wide variety of demographic and life-history parameters, including an intrinsic growth rate, mate-search radius, the mean number of offspring per mating event, reproductive age, and maximum age, among others. A species can undergo any number of change events, including changes to demographic and life history parameters and various types of population size changes. Each individual in the species has an x, y location, a sex, an age (or life-history stage), a set of phenotypes and a diploid genome consisting of any number of diallelic loci, which can represent either a contiguous haplotype block or a set of distinct loci. Loci can exhibit different types of dominance, and recombination rates can be heterozygous across the genome. Phenotypic traits are continuous and quantitative and can be monogenic or multigenic. Each trait is defined by the loci that comprise its genetic basis, the effect sizes of those loci, and a phenotypic selection coefficient, which can be made heterogeneous in both space and time, allowing for spatially complex selection scenarios. While the strength of selection is determined by that coefficient, the force of selection is represented by the environmental raster layer to which the trait is adapting. Loci can have separate mutation rates for three types of mutations: neutral, deleterious, and trait-affecting mutations. Neutral mutations do not affect fitness, and deleterious mutations decrease fitness without affecting simulated phenotypic traits. Trait-affecting mutations, on the other hand, introduce mutations at previously unmutated loci mapped to a trait. This adds to the genetic variation affecting a trait, thus generating phenotypic variance upon which natural selection can operate. Mutation rates can be defined separately for each trait. The other core component of a Geonomics model, the landscape, is a stack of raster layers. Each layer can be set to serve as one or more of 1) a resistance raster, which controls individual movement or offspring dispersal, 2) a carrying-capacity raster, which controls population density, and 3) a fitness raster for a trait, which governs natural selection. A key feature of Geonomics is that each layer can undergo any number of arbitrarily complex environmental change events which, as they unfold, influence the dynamics of any species whose carrying capacity, movement and dispersal, or fitness depend on the corresponding layer.

Model Operation: Overview

A Geonomics run begins with a burn-in stage during which individuals move and reproduce, without genomes or selection, until a series of statistical tests is passed. These tests include a time-lagged t-test and an augmented Dickey–Fuller test, which are run as a pair for 1) the total population size, serving as a test of temporal demographic stability, and 2) both the mean and the standard deviation of timestep-differenced cell-wise counts of individuals, serving as a test of spatial demographic stability. This burn-in period results in a stationary spatial distribution of the species on the landscape. Following burn-in, each individual has its genome randomly assigned according to the genomic architecture parameters, such that the main phase of each run begins with no pedigree and, thus, without population structure. Each time step in the main phase is a series of four operations, some optional (fig. 1):

Fig. 1.

Operations during the main phase of a Geonomics model run. In the center is a species on a multilayer landscape that includes a selection layer (above) and a layer for movement and carrying capacity (below). Surrounding the landscape is a flow-diagram of the major operations during a time step. Operations in dashed boxes are optional. During the movement stages (top-left), individuals move along movement vectors drawn from various distribution options (shown is an example of a cell-specific von Mises mixture distribution). During the mating stage (top-right), each mating individual (black circle) randomly chooses a mate (white circle) from all potential mates within its mating radius (dashed circle). The resulting offspring (half-black, half-white circle) disperses from its parents' midpoint along a randomly drawn dispersal vector. During the mortality stage (bottom-right), deaths are modeled as a Bernoulli process, with the probability of mortality a product of density-dependence and selection on all traits. During the changes stage (bottom-left), environmental and demographic change events, which can be represented by a series of change rasters corresponding to scheduled time steps, take place.

Model Operation: Movement

Movement takes place in continuous space—individuals have x, y coordinates, on either real or simulated landscapes, rather than being arbitrarily restricted to grid cells or bounded populations. Each individual moves along a vector, composed of a distance drawn from a Wald distribution and a direction drawn either from a uniform distribution on the unit circle or from a movement surface—an array of unimodal or multimodal von Mises distributions derived from a landscape layer that serves as a resistance surface (sensu McRae 2006; Spear et al. 2010). On a unimodal movement surface, each cell is assigned a single von Mises distribution, with mode parameter μ set to the direction of the highest-valued cell in the 8-cell neighborhood. On a multimodal surface, each cell’s mixture distribution is a weighted sum of eight such unimodal distributions, one pointing toward the center of each cell in the 8-cell neighborhood and with normalized weights equal to the values of the neighboring cells. This approach to simulating movement generates realistic, anisotropic movement across a heterogeneous landscape (fig. 2) while avoiding time-consuming computational steps, such as repeated searches for minimum-resistance neighboring cells.

Fig. 2.

A raster layer representing a movement surface with example movement histograms for each cell (left) and a movement track for a sample individual (right). The circularized histograms represent the movement directions that could be drawn from the von Mises mixture distribution approximations within each cell. Longer bars in a histogram indicate higher probability of movement in their direction. The movement track, plotted with the gnx.help.param_help.plot_movement function in Geonomics, is 5,000 steps long. Both preferential movement toward higher-suitability regions of the landscape (i.e., cells closer to 1 in value) and occasional long-distance movements between relatively isolated portions of the landscape are evident.

Model Operation: Mating

Potential mating pairs are randomly drawn from among all eligible pairs of individuals within the mate-search radius (unless strict nearest-neighbor mating is chosen), with pairing probabilities either uniform or inverse-distance weighted within the mating radius, and with eligibility based on both sex and age. From among those pairs, actual mating-event decisions are Bernoulli distributed, with probability equal to the intrinsic birth rate. Each mating pair produces a number of offspring according to a fixed value or drawn from a Poisson distribution with λ equal to the mean number of offspring. Each parent produces a gamete for each of its offspring, using realistic recombination and Mendelian segregation. Gametes are united to create offspring individuals, which then disperse to new locations. As with movement, dispersal vectors can be drawn isotropically or anisotropically based on a resistance surface.

Model Operation: Mortality

Mortality is modeled as a Bernoulli process with the probability of an individual death a combination of the probabilities of death due to density-dependence (using a logistic-growth model) and due to natural selection (based on the cumulative fitness for all traits), calculated as: where is the probability of death due to density-dependence for individual i, m is the number of traits, and is the fitness of individual i for trait p. The probability of density-dependent death at location x,y is calculated as: where, for location x,y, is the expected number of deaths, is the population density, is the expected number of births, and is the population logistic growth rate. The fitness of individual i for trait p is calculated as: where is the phenotypic selection coefficient on trait at location x,y, is the value of the selection layer for trait p at location x,y, is the phenotype of individual i for trait p, and γ defines the curvature of the fitness function for trait p. The phenotype is a result of the additive effects of that individual’s genotypes at all underlying loci, and is calculated as: where is the number of loci, is the effect size of locus l on trait p, is the genotype at locus l for individual i, and , the baseline genotype, equals 0 for monogenic traits or 0.5 for polygenic traits.

Model Operation: Change Events

Each demographic or environmental change event unfolds as a series of incremental changes that occur at the ends of scheduled time steps. A demographic change event can be exponential, random, or cyclical, or it can follow an arbitrarily complex, custom trajectory. Each event is parameterized by defining the time steps at which its changes take place and the factor by which the carrying capacity raster is multiplied. Simple environmental change events are defined by a terminal raster for the final environmental state and a list of time steps at which incremental changes occur (based on cell-wise linear interpolation between the beginning and terminal states). More complex, custom events can be simulated by providing a series of environmental rasters labeled with the time step at which each will be applied. This option makes it easy to simulate evolution on real-world landscapes undergoing nonlinear, spatially heterogeneous environmental change.

Results

Validation

To validate the performance of Geonomics, we ran a series of simulations based on classical population genetic models, covering both neutral and nonneutral evolutionary scenarios. Because the classical models are simpler than the individual-based, spatially explicit, continuous-movement models built by Geonomics, we parameterized the simulations so as to accurately emulate these models while minimizing artifacts (see Validation Testing, Supplementary Material online). Our goal was to statistically and heuristically validate Geonomics’ full range of functionality and to ensure that it accurately models neutral and nonneutral evolutionary processes. To verify that Geonomics effectively models neutral evolution, we first examined the average time to fixation for a neutral allele in a finite population using simulations approximating a Wright-Fisher model (Fisher 1923; Wright 1930). We simulated allele-frequency trajectories for 250 independent loci (25 of which are plotted in supplementary fig. S1, Supplementary Material online), and we found that that fixation time did, indeed, increase with population size and was proportional to 4Ne, as expected (Kimura and Ohta 1969), in our simulations (supplementary fig. S2, Supplementary Material online). We then tested for changes in the rate of drift surrounding a population bottleneck event by forcing a population to undergo a 70% reduction in size for 50 of 300 timesteps. We found that rates of allele frequency change increased during the bottleneck, then returned to prior levels shortly thereafter (supplementary fig. S3, Supplementary Material online). Finally, we quantified the accumulation of genetic structure under a stepping-stone model (Kimura 1953) to certify that genetic covariance decreases with distance (Kimura and Weiss 1964). As expected, we saw that migration rates decreased as a function of inter-island distance (supplementary fig. S4, Supplementary Material online), whereas FST, calculated from both heterozygosity data and genetic variance data, increased as a function of inter-island distance (and, therefore, decreased with pairwise migration rate) and as a function of time (supplementary figs. S4 and S5, Supplementary Material online). We also performed a discriminant analysis of principal components (DAPC) using the R package adegenet (Jombart et al. 2010) to confirm the expected population structure of six island clusters (supplementary fig. S6, Supplementary Material online). To validate the performance of Geonomics for modeling nonneutral evolution, we first performed simulations under a simple scenario of divergent selection between two discrete habitats. As expected, simulations on a landscape evenly divided by two habitat blocks led to local adaptation, producing a significant pattern of phenotype-habitat matching (supplementary fig. S7, Supplementary Material online), with mismatches concentrated along the border between habitats. Additionally, over time, the species reached migration–selection equilibrium—the frequencies of the beneficial alleles in each habitat increased up to a stationary level, with that level being positively correlated with the strength of selection (supplementary fig. S8, Supplementary Material online). A plot of the mean difference between individuals’ phenotypic and environmental values shows a strong decline over model time, with the rate and level of decline increasing as a function of increasing strength of selection (supplementary fig. S9, Supplementary Material online). Finally, logistic regressions show no significant relationships between phenotypic and environmental values at the outset (pseudo-R2s ≈ 0.0, P values > 0.1), but show highly significant relationships at the ends of the simulations (P < 0.0001 for all values of Φ), with the amounts of variation explained increasing as a function of selection strength (pseudo-R2 = 0.327 for Φ = 0.01, 0.376 for Φ = 0.05, and 0.406 for Φ = 0.1). We next tested the ability of Geonomics to recreate the genetic structure expected under local adaptation along an environmental cline: monotonic change in the allele frequency of a nonneutral locus across the cline. On a landscape with a symmetric environmental selection gradient, Geonomics again produced the expected spatial pattern of local adaptation (supplementary fig. S10, Supplementary Material online), and when we fitted sigmoid tanh clines (Szymura and Barton 1986; Porter 2013) for all loci, the locus underlying the monogenic trait was the only one to exhibit clinal variation (supplementary fig. S11, Supplementary Material online). In a family of genotype-environment analyses using Bonferroni-corrected, locus-wise logistic regressions, this locus was also the most significantly correlated with the environmental variable based on locus-wise logistic regressions (P < 0.0001). A plot of the mean difference between individuals’ phenotypic and environmental values shows a strong decline over model time (supplementary fig. S12, Supplementary Material online), and logistic regressions show no significant relationship between phenotypic and environmental values at the outset (pseudo-R2 = 0, P value = 0.812) but a significant relationship at the end of the simulation (pseudo-R2 = 0.169, P value < 0.0001). Finally, we verified that Geonomics can effectively model genomic data with physical linkage by simulating a selective sweep, introducing a beneficial mutation at the center of a 101-locus block of otherwise neutral loci. The results exhibited the classic genomic signal of a selective sweep (Przeworski 2002; Kim and Nielsen 2004), with a region of reduced nucleotide diversity surrounding the locus under selection and that region gradually eroding over time (supplementary fig. S12, Supplementary Material online). During these simulations, as the beneficial mutant spread through the population the population’s mean fitness increased from 1 − Φ (where Φ is the strength of selection) to a saturating value of 1 (supplementary fig. S14, Supplementary Material online), confirming that the population dynamics of the selective sweep played out as expected. To further support these results, we also validated Geonomics’ method of recombination by examining effective recombination rates observed in a Geonomics model to those produced by an msprime simulation using the same randomly drawn, heterogeneous recombination map (see Recombination Test, Supplementary Material online); the resulting genome-wide pattern of recombination breakpoint densities recapitulates the one produced by msprime and the true recombination map (supplementary fig. S15, Supplementary Material online).

Example Applications: Overview

To demonstrate the broad utility of Geonomics for modeling complex evolutionary scenarios, we performed a series of simulations covering a range of potential applications. These demonstrations highlight scenarios for which Geonomics is particularly well suited, including spatially explicit simulations on highly heterogeneous landscapes, selection on multiple traits with complex genomic architecture, and microevolutionary responses to nonstationary environmental change.

Example 1: Isolation by Distance and by Environment

Genetic covariances between individuals or populations are often inversely correlated with linear or resistance-based geographic distance—a pattern known as isolation by distance (IBD; Wright 1943) or isolation by resistance (IBR; McRae 2006; McRae et al. 2008)—or with environmental distance—a pattern known as isolation by environment (IBE; Wang and Bradburd 2014). Understanding the landscape factors and population processes generating these patterns has emerged as a major focus of landscape genetics (Sexton et al. 2014; Wang and Bradburd 2014). To demonstrate how Geonomics can simultaneously generate patterns of IBD and IBE, we built a simulation that uses a heterogeneous resistance layer as a movement surface and models selection for a 10-gene trait on a heterogeneous environmental layer (Φ = 0.05). The model features a species with a stationary population size (roughly 2,450 individuals), experiencing both selection and neutral evolution. The resistance layer consists of a central barrier separating equal-area sides—the barrier has a high resistance to movement, but the movement is unconstrained on either side. This layer was also used as the carrying-capacity layer, yielding homogeneous population density on the two sides and zero density within the barrier region. The selection layer consists of two environmental gradients running in opposite directions on either side of the barrier, such that the landscape contains pairs of locations representing a range of combinations of geographic and environmental distances. To observe the development of population structure, we collected data sets consisting of the genomes for all individuals at timesteps 0 and 1,000. We then used principal component analysis (PCA) to calculate pairwise genetic distances between all individuals for each data set. To visualize population structure, we extracted the first three principal components (PCs) and used them as the red, green, and blue (RGB) color values for mapping individuals on the landscape. To visualize the outcomes of selection, we produced paired maps of the same individuals colored by their phenotypes for the trait under selection (using Geonomics’ “model.plot_phenotype(…)” method), and also created a set of the population–structure plots using DAPC. To visualize the time course of the simulation, we plotted the mean phenotype–environment mismatch (i.e., the mean of |e-z|, the driving force of selection) and mean fitness. We visualized signals of IBD and IBE in the final data set using a 3D scatterplot of Euclidean pairwise genetic distance against Euclidean pairwise geographic and environmental distances, colored by pairwise phenotypic distances. We tested the significance of the relationship between genetic distance and environmental distance, controlling for geographic distance, using paired partial Mantel tests with the vegan package (version 2.5-6; Oksanen et al. 2019) and using multiple matrix regression (MMRR; Wang 2013) in R version 4.0.2 (R Core Team 2020). Finally, because Geonomics models do not use defined landscape-resistance values, we quantified the barrier’s increased landscape resistance by tracking all barrier-crossing events, using them to calculate the per time step crossing rate, then comparing that to the equivalent crossing rate of the same landscape zone in an otherwise identical model that omitted the barrier. The RGB and phenotype plots of the initial population, with randomly assigned genomes, showed a clear lack of both spatial structure and local adaptation (fig. 3, top left). However, as expected, spatial structure developed over time, and the species showed signs of local adaptation over the course of the simulation (fig. 3, top right), as well as a corresponding, hierarchical population structure (supplementary fig. S15, Supplementary Material online). Average phenotype-environment mismatch decreased and average fitness increased over time (fig. 3, top middle). At the end of the simulation, the species demonstrated significant signals of both IBD (partial Mantel test: r = 0.560, P ≤ 0.001; MMRR: P ≤ 0.001) and IBE (partial Mantel test: r = 0.121, P ≤ 0.001; MMRR: P ≤ 0.001; MMRR full model R2 = 0.354), as evidenced by the positive slopes on both horizontal axes of the 3D scatterplot (fig. 3, bottom). The colors of the points in the 3D scatter plot also indicate a clear pattern of increasing phenotypic differences between individuals with increasing environmental distance (fig. 3, bottom right) but not between individuals separated by increasing geographic distances (fig. 3, bottom left). Finally, the barrier zone had an observed crossing rate of 0.004 individuals per time step in this model, 13 times lower than the rate of 0.052 individuals per time step observed in the barrier-less but otherwise identical model. These results show that Geonomics effectively models IBD and IBE, driven by divergent natural selection, using two simple raster layers. More complex layers could be used to simulate IBD and IBE under a wide range of scenarios, and empirical layers could be used to simulate patterns of spatial genetic variation on real-world landscapes.

Fig. 3.

Results of simulations for the isolation by distance (IBD) and isolation by environment (IBE) example application, in which a species evolved on a landscape with a barrier layer that served as the movement surface (displayed as a vertical gray band down the landscape) and an environmental layer that served as the selective surface for a 10-locus trait (displayed as the red to blue gradient on the landscape). (A) The population before the simulation (left column) and after it (right column), colored by genetic distance (top row), with colors derived from scores on the first three PCs of a genetic PCA used to assign RGB values, and by phenotype (bottom row). The most-fit individuals are those whose phenotypic colors perfectly match the cells on which they are located. (B) The time courses of the mean difference between individuals’ phenotypes and their environmental values (blue) and of mean fitness values (red). (C) Two views of a 3D scatter plot of pairwise genetic distance as a function of Euclidean geographic distance (left) and Euclidean environmental distance (right), with points colored by phenotypic distance.

Example 2: Simultaneous Election

One of the most powerful features of Geonomics is that it can simulate selection on numerous traits simultaneously, each responding to a separate selection layer. Thus, a simulated species can experience multiple spatial selection regimes. Many natural systems are locally adapted to multiple environmental variables (Fournier-Level et al. 2011; Lasky et al. 2012; Manel et al. 2012), so simulating these scenarios could be broadly valuable for investigating the nature of local adaptation in real environments. To demonstrate how Geonomics can model simultaneous selection, we simulated a scenario in which a species undergoes natural selection along two orthogonal environmental gradients, each driving selection for a separate trait (Φ = 0.05). Each trait had values ranging from 0 to 1, determined by 10 loci, all with equal effect sizes. Individuals had a mean movement distance of 0.5 cell widths on a 50 × 50-cell landscape, chosen to limit gene flow and allow for the development of strong spatial structure and, thus, the potential for local adaptation. We let the system evolve for 1,000 time steps and then mapped the species on each of the environmental layers, with individuals colored by phenotype in order to visually evaluate whether individual phenotypes matched their environmental backgrounds. The results showed clear patterns of phenotype–environment matching along both independent gradients (fig. 4) that evolved steadily through time (supplementary fig. S16, Supplementary Material online; compare to supplementary figs. S17 and S18, Supplementary Material online, with Φ = 0), indicating strong evidence for simultaneous selection across the simulated landscape.

Fig. 4.

Results of simultaneous selection on two traits with spatially distinct selective regimes. Each trait is controlled by 10 unlinked loci and has a selection coefficient of Φ = 0.05. Individuals are colored by phenotype for the trait under selection on each layer.

Example 3: Polygenic Adaptation to Climate Change in the Yosemite Region

Better understanding evolutionary responses to changing environments is essential for predicting species outcomes and preserving biodiversity under ongoing climate change (Hoffmann and Sgrò 2011; Franks and Hoffmann 2012; Bay et al. 2018; Capblancq et al. 2020). In many regions, climate shifts are projected to be spatially heterogeneous, including in montane regions where cooler, higher-altitude areas are warming more quickly than warmer, low-altitude regions (Rangwala et al. 2013; Mountain Research Initiative EDW Working Group 2015; but see Oyler et al. 2015). Of particular interest under these scenarios is the ability of species to adapt to changing local conditions (Franks and Hoffmann 2012). To demonstrate the utility of Geonomics for studying microevolutionary responses to climate change, we simulated the response to projected climate change of a continuously distributed, locally adapted species, using the sagebrush lizard (Sceloporus graciosus) in the topographically complex Yosemite National Park region of California (USA) as an empirical model. To model climate change, we assembled time series raster stacks of projected mean annual temperature, annual precipitation, and habitat suitability for 19 even time steps from the present through the year 2100. For present temperature and precipitation, we used PRISM data (Daly et al. 2008), calculated as 30-year normals for 1981–2010 at 800 m resolution. For future years, we used means at a set of 5-year intervals (2015–2100), downscaled to 6 km resolution using the localized constructed analogs downscaling technique (LOCA; Pierce et al. 2014), from the Cal-Adapt database (https://cal-adapt.org/). We calculated means of both variables from their minima and maxima observed across 32 global climate models, using a conservative representative concentration pathway (RCP 4.5). We developed time series of future temperature and precipitation layers at 800-m resolution by 1) calculating the raster difference between the first projected year and the current data, aggregated to the projected data’s resolution; 2) adding that difference to the current data, such that each cell in the current data received the difference of the coarser, projected cell within which it lay; and 3) repeating that process for all remaining years. All data preparation was done using custom scripts (Supplementary Material online) in R (R Core Team 2020). For the habitat suitability rasters, we constructed a species distribution model (SDM) using the present-day temperature and precipitation variables. We downloaded all georeferenced S. graciosus occurrence data from the Global Biodiversity Information Facility database (www.gbif.org), using the gbif function in the dismo R package (Hijmans et al. 2017). We clipped the points to California and Nevada, then subsampled the full data set to remove multiple points within the same raster cells. We generated pseudoabsence data by drawing random points from all cells in the California-Nevada region where the species was not observed (following the recommendations of Barbet‐Massin et al. 2012). We extracted the current temperature and precipitation data at these points and used them as predictor variables in a binomial generalized linear model (GLM) with a logit link. We then projected that GLM onto the current and future temperature and precipitation rasters for our study region, producing a time series of predicted habitat suitability. We generated the simulation’s parameters file using the code provided in Code Sample S1, then edited the parameter values therein as needed. To simulate the nonneutral evolution of a polygenic, quantitative trait, we set the trait to be underlain by 100 loci randomly distributed across a genome of 1,000 loci and set a strength of selection of Φ = 0.5. We set other life-history and demographic parameters (carrying capacity, age at reproductive maturity, number of offspring per individual, and maximum age) to reasonable values based on S. graciosus natural history (Stebbins 1948; Tinkle 1973; Rose 1976; Ruth 1978; Tinkle et al. 1993; Supplementary Material online). We ran the main phase for 500 time steps without climate change (to develop a pattern of local adaptation), then ran an additional 100 time steps (years) with changing climate (see Code Sample S2). At time steps 500 (before the initiation of climate change events) and 600 (after completion of climate change events), we plotted the current temperature and habitat-suitability landscape layers along with a kriged surface of the current population’s phenotypes and a kernel density map of the current population’s density, two key emergent properties of the model that should be driven by temperature and habitat suitability, respectively. We then ran the model for an additional 50 timesteps to be able to more clearly visualize the effect of climate change on population size. The model generated a clear and realistic pattern of adaptation to the spatial temperature gradient in the Yosemite region after the 500 iterations following burn-in, and that pattern demonstrated a spatial shift in phenotypes that aligns clearly with the spatial shift in temperature under the simulated climate change scenario (fig. 5, rows 1 and 2; Video 1, Supplementary Material online). The model also generated a spatial pattern of population density that clearly aligns with spatial variation in habitat suitability prior to the onset of climate change that likewise shifted as expected in response to the climate change-induced shift in habitat suitability (fig. 5, rows 3 and 4). We observed demographic changes in response to climate change over the course of the simulation as well. After climate change, mean population size was reduced by roughly 16.8% (from about 255,500 to 212,500 individuals), in line with the 17.9% reduction in the carrying capacity layer derived from the habitat suitability rasters (from 337,089.0 to 276,742.8 individuals, according to sums of the pre- and post-change carrying capacity layers). The population also exhibited sizable fluctuations during the climate change period, with oscillations exceeding 60,000 individuals (roughly 23.5% of the pre-change mean population size; fig. 5). We interpret this as a result of the stepwise environmental changes comprising the climate change event. Each change causes a shift in the optimum phenotypes of local populations, leading to increased maladaptation and thus increased mortality rates. Subsequent reductions in density-dependent mortality rates because of these reduced population densities, paired with adaptation by natural selection, then reduce overall mortality rates, leading to rebounds in population size, with stochastic movement into and out of local populations, along with other sources of model stochasticity, imposing noise on this oscillatory behavior. Overall, these results show how Geonomics can effectively simulate organismal responses to highly complex environmental scenarios and reveal that these simulations can uncover system behavior that could provide avenues for future investigation.

Fig. 5.

Polygenic adaptation to climate change in the Yosemite region. (A) Hillshade plot comparisons of key variables (mean temperature, phenotype, habitat suitability, and population density) before and after the simulated climate change event. The mean of each variable through time is used to draw midvalue contours on each map (white lines) to help visualize spatial change. As expected, the spatiotemporal shift in temperature (first row) drives a spatially corresponding shift in phenotypes, visualized as a surface kriged from all phenotypic values (second row), and the shift in habitat suitability (third row) likewise drives a corresponding shift in population density, visualized using a 2D kernel density estimator (fourth row). (B) The time course of population size. The early drop in population size results from the onset of natural selection after completion of the unplotted burn-in portion of the model. The oscillations and ultimate reduction at the end of the simulation are a result of the climate change event, which occurs during the period bracketed by vertical red lines.

Discussion

Our validations tests demonstrate that Geonomics simulates molecular evolution in concordance with predictions from theoretical population genetics (Fisher 1923; Wright 1943; Kimura; Szymura and Barton 1986), including dynamics of genetic drift, migration, and selection along clines, and our example applications show that Geonomics is capable of generating accurate and realistic population and landscape genomic data sets under scenarios of varying complexity. Geonomics is embedded in Python (van Rossum 1995; Python Software Foundation 2019), one of the most popular programming languages and one already familiar to many researchers who use bioinformatics. It makes the creation of arbitrarily complex models quick and easy, without even requiring prior Python experience, yet provides advanced users with access to core data structures, enabling broad customization and extension. Many theoretical questions in population genomics necessitate explicitly spatial study methods, often with full tracking of a population’s spatial pedigree (Bradburd and Ralph 2019). Geonomics makes this work more tractable than ever before. Landscape genomics studies draw conclusions about complex, real-world systems, sometimes with direct implications for conservation and management (Epperson et al. 2010; Landguth et al. 2012). Geonomics not only enables the generation of simulated data sets specific to such study systems but will also aid the development and testing of analytical methods in landscape genomics, strengthening our ability to draw accurate and reliable inferences from real-world data.

Runtime and Memory

Geonomics models run more slowly and have steeper memory limitations than models written and optimized in compiled languages, such as SLiM (Messer 2013; Haller and Messer 2017; Haller and Messer 2019). However, for users whose scenarios are well served by the design and affordances of Geonomics, what is sacrificed in runtime will be made up for in flexibility, customizability, and ease of use. With a reasonably powerful computer and for moderately sized models, most users should not find runtime or memory a major limitation. Indeed, our first two example applications were run on a laptop computer with 8 GB of RAM and an Intel® Core™ i5-8250U 3.4 GHz quad-core processor. Each run took an average of 271 s (about 0.27 s per time step) for the IBD-IBE model and 144 s (roughly 0.14 s per time step) for the simultaneous selection model. Because the polygenic adaptation example has much higher complexity, approximating the high population density of a small vertebrate, we ran it on a regular-memory node of a computing cluster (the savio3 partition of UC Berkeley’s Savio system) with 96 GB RAM and 2.1 GHz Skylake processors in order to handle the larger memory requirement. This model took considerably longer to run (approximately 7.35 h, running at about 32 s per time step after time for upfront computation of a series of changing movement surfaces) and had a peak memory usage of 25.433 GB, but even highly complex scenarios like this remain tractable on reasonable research timelines. Given the complexity of Geonomics and the number of parameters a user can modify, numerous parameters and parameter combinations can influence a model’s average runtime. We provide a basic runtime analysis (fig. 6), run on the same 8 Gb, quad-core laptop as the examples above. This analysis highlights some basic parameters that are likely to influence a model’s average runtime per time step, including the mean population size (as determined by an array of local carrying capacities), the number of offspring per mating event, the size of the landscape, and the number of nonneutral loci in the genome. The effect of landscape size predominates, as runtime scales superlinearly with this parameter. The number of nonneutral loci actually has only slight effects on total runtime, and because neutral genetic data are stored in a set of tables rather than redundantly for each individual (Kelleher et al. 2018) runtime is even less sensitive to the number of neutral loci in the genome (although recombination rate does impact runtime and memory usage), meaning that Geonomics can efficiently simulate genome-scale data sets if provided adequate memory to store the set of recombination pathways that is calculated at the outset. Finally, after the upfront cost of computing recombination pathways and movement surfaces, runtime scales roughly linearly with the number of time steps, barring large demographic changes. That means that the moderately complex scenario in our simultaneous selection example could complete 1,000,000 time steps in approximately 40 h, and even the highly complex scenario in our polygenic adaptation example with more than 200,000 individuals could run through 10,000 generations in about 3.7 days. Hence, Geonomics could even prove useful for research at deeper timescales, for example in phylogeography or geogenomics (Baker et al. 2014).

Fig. 6.

Average runtime as a function of four major parameters: the number of nonneutral loci (n_loci), the carrying-capacity constant (K) that determines mean population size, landscape size (dim), and the λ parameter of the Poisson distribution from which the number of offspring in each mating event is drawn (lambda). Runtime increases with landscape dimension both because of functions whose runtimes scale with the landscape size directly and because of functions whose runtime scales with total population size. The difference between the lines for K and dim can be taken as an indication of the runtime cost of landscape dimension above and beyond population size effects, which predominates because it is superlinear.

Caveats

Geonomics uses two unconventional approximations to make complex models tractable within reasonable compute time in an interpreted language. The first is the approximation used to model heterogeneous recombination. Enacting recombination between all neighboring loci each time a gamete is produced would require an extremely large and time-consuming number of random draws. To avoid this, when a model is first created, Geonomics generates and saves, as binary arrays, a large collection of recombination “paths.” The number of paths used is set by the user, and directly determines the minimum recombination rate difference that can be modeled. Each path is just a genome-length array that switches between 0 and 1 at each interlocus position where a recombination event should occur. The path can then quickly be used to subset an individual’s genome, producing a gamete. As a model runs and gametes are continually produced, these paths are repeatedly shuffled and drawn through, like a deck of cards during multiple rounds of a game. This approach, which we have validated using msprime (see Recombination Test, Supplementary Material online), can lead to memory limitations for models with a large number of paths and a long genome, because the data structure containing the paths is essentially a 2D binary array whose size is the product of these values and because more recombination events require that more trees are recorded in the spatial pedigree. To avoid these problems, for genomic architectures with homogeneous recombination rates, Geonomics provides the option to use an alternative recombination mechanism that simulates recombination on the fly for each new gamete but does so at a cost of increased average runtime per time step. The second is the approximation used to model the circular distributions from which movement directions are drawn. Conceptually, a movement or dispersal surface is an array of Von Mises distributions. In practice, each distribution on that surface is represented by a column of angular directions (an “approximation column”) drawn, at the time the model is built, from the true, continuous distribution. During a model run, to draw a movement direction from a cell, a random value is sampled from that cell’s approximation column. This increases computational efficiency by avoiding large numbers of calls to random number generators during runtime. The accuracy of these approximation columns is a function of their length, which is set by the user. This length will usually not be so constrained that it significantly impacts the accuracy of the approximation, but such a constraint could arise if the movement or dispersal surface undergoes environmental change. In this case, the movement surfaces corresponding to each step of the change event will be generated and stored when the model is first created, and the series of arrays produced could exhaust memory if the landscape is very large and has many environmental change steps. A solution to this problem would feature some combination of decreasing the temporal resolution of the environmental change event, decreasing the landscape size, or decreasing the approximation column length. In all cases, users may check the accuracy of modeled movement by using built-in functions that visualize the composition and behavior of movement and dispersal surfaces.

Materials and Methods

We performed all simulations using the Geonomics Python package (see New Approaches) as described in the Results section and Supplementary Material. The simplest way to get started with Geonomics is to install it via pip. Geonomics uses common, well-established Python packages as required dependencies—Numpy (Harris et al. 2020), Matplotlib (Hunter 2007), Pandas (McKinney et al. 2010), Shapely (Gillies et al. 2007), Scipy (Virtanen et al. 2020), Scikit-learn (Pedregosa et al. 2011), Statsmodels (Seabold and Perktold 2010), Rasterio (Gillies et al. 2019), Bitarray (Schnell et al. 2021), msprime (Kelleher et al. 2016), and tskit (Kelleher et al. 2018)—and offers optional integration of neutral landscape models through the NLMpy package (Etherington et al. 2015). The source code is publicly available on GitHub (https://github.com/drewhart/geonomics), where it is actively maintained and developed.

Conclusions

Geonomics is a Python package designed to make building and running complex landscape genomic models quick and simple. At the same time, it provides a flexible scripting framework that allows advanced users to customize and extend its functionality. We believe Geonomics will prove highly useful for theoretical, empirical, methodological, and applied research in population and landscape genomics, molecular ecology, global change biology, and conservation.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

37 in total

1. Using circuit theory to model connectivity in ecology, evolution, and conservation.

Authors: Brad H McRae; Brett G Dickson; Timothy H Keitt; Viral B Shah
Journal: Ecology Date: 2008-10 Impact factor: 5.499

2. SLiM: simulating evolution with selection and linkage.

Authors: Philipp W Messer
Journal: Genetics Date: 2013-05-24 Impact factor: 4.562

3. Examining the full effects of landscape heterogeneity on spatial genetic variation: a multiple matrix regression approach for quantifying geographic and ecological isolation.

Authors: Ian J Wang
Journal: Evolution Date: 2013-05-11 Impact factor: 3.694

4. Broad-scale adaptive genetic variation in alpine plants is driven by temperature and precipitation.

Authors: Stéphanie Manel; Felix Gugerli; Wilfried Thuiller; Nadir Alvarez; Pierre Legendre; Rolf Holderegger; Ludovic Gielly; Pierre Taberlet
Journal: Mol Ecol Date: 2012-06-10 Impact factor: 6.185

5. Characterizing genomic variation of Arabidopsis thaliana: the roles of geography and climate.

Authors: Jesse R Lasky; David L Des Marais; John K McKay; James H Richards; Thomas E Juenger; Timothy H Keitt
Journal: Mol Ecol Date: 2012-08-01 Impact factor: 6.185

6. Linking a mutation to survival in wild mice.

Authors: Rowan D H Barrett; Stefan Laurent; Ricardo Mallarino; Susanne P Pfeifer; Charles C Y Xu; Matthieu Foll; Kazumasa Wakamatsu; Jonathan S Duke-Cohan; Jeffrey D Jensen; Hopi E Hoekstra
Journal: Science Date: 2019-01-31 Impact factor: 47.728

7. Genetic Simulation Resources: a website for the registration and discovery of genetic data simulators.

Authors: Bo Peng; Huann-Sheng Chen; Leah E Mechanic; Ben Racine; John Clarke; Lauren Clarke; Elizabeth Gillanders; Eric J Feuer
Journal: Bioinformatics Date: 2013-02-23 Impact factor: 6.937

8. SLiM 3: Forward Genetic Simulations Beyond the Wright-Fisher Model.

Authors: Benjamin C Haller; Philipp W Messer
Journal: Mol Biol Evol Date: 2019-03-01 Impact factor: 16.240

Review 9. Array programming with NumPy.

Authors: Charles R Harris; K Jarrod Millman; Stéfan J van der Walt; Ralf Gommers; Pauli Virtanen; David Cournapeau; Eric Wieser; Julian Taylor; Sebastian Berg; Nathaniel J Smith; Robert Kern; Matti Picus; Stephan Hoyer; Marten H van Kerkwijk; Matthew Brett; Allan Haldane; Jaime Fernández Del Río; Mark Wiebe; Pearu Peterson; Pierre Gérard-Marchant; Kevin Sheppard; Tyler Reddy; Warren Weckesser; Hameer Abbasi; Christoph Gohlke; Travis E Oliphant
Journal: Nature Date: 2020-09-16 Impact factor: 49.962

10. Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes.

Authors: Jerome Kelleher; Alison M Etheridge; Gilean McVean
Journal: PLoS Comput Biol Date: 2016-05-04 Impact factor: 4.475

1 in total

1. Efficient ancestry and mutation simulation with msprime 1.0.

Authors: Franz Baumdicker; Gertjan Bisschop; Daniel Goldstein; Graham Gower; Aaron P Ragsdale; Georgia Tsambos; Sha Zhu; Bjarki Eldon; E Castedo Ellerman; Jared G Galloway; Ariella L Gladstein; Gregor Gorjanc; Bing Guo; Ben Jeffery; Warren W Kretzschumar; Konrad Lohse; Michael Matschiner; Dominic Nelson; Nathaniel S Pope; Consuelo D Quinto-Cortés; Murillo F Rodrigues; Kumar Saunack; Thibaut Sellinger; Kevin Thornton; Hugo van Kemenade; Anthony W Wohns; Yan Wong; Simon Gravel; Andrew D Kern; Jere Koskela; Peter L Ralph; Jerome Kelleher
Journal: Genetics Date: 2022-03-03 Impact factor: 4.402

1 in total