Literature DB >> 32680854

rdmc: An Open Source R Package Implementing Convergent Adaptation Models of Lee and Coop (2017).

Abstract

The availability of whole genome sequencing data from multiple related populations creates opportunities to test sophisticated population genetic models of convergent adaptation. Recent work by Lee and Coop (2017) developed models to infer modes of convergent adaption at local genomic scales, providing a rich framework for assessing how selection has acted across multiple populations at the tested locus. Here I present, rdmc, an R package that builds on the existing software implementation of Lee and Coop (2017) that prioritizes ease of use, portability, and scalability. I demonstrate installation and comprehensive overview of the package's current utilities.

Entities: Chemical Disease Gene Species

Keywords: Composite Likelihood; Convergent Adaptation; Population Genetics; R; Software

Mesh：

Year: 2020 PMID： 32680854 PMCID： PMC7467004 DOI： 10.1534/g3.120.401527

Source DB: PubMed Journal: G3 (Bethesda) ISSN： 2160-1836 Impact factor: 3.154

Convergent adaptation occurs when natural selection independently orchestrates the evolution of the same set of trait or traits in multiple populations (Takuno ; Tishkoff ; Yeaman ; Losos 2011). Efforts by Lee and Coop (2017) used coalescent theory to develop composite likelihood models to infer which among several competing modes of convergent adaptation best explains allele frequencies at a putatively selected region. These models provide rich statistical information about the inferred adaptive mutation, including its location along the region, the strength of selection, migration rate, age, and its initial allele frequency prior to selection. To facilitate use of the convergent adaptation models of Lee and Coop (2017), I developed rdmc, an R package implementing their models that was designed to be easy to use, portable, and scalable. In this short manuscript, I provide an overview of the usage and installation of the package, concluding with opportunities for future improvements and expansion to the software.

Materials And Methods

Lee and Coop (2017) described three distinct modes of convergent adaption: independent mutations, where two or more populations independently gain the selected mutation; migration, where the mutation occurs once and subsequently migrates to multiple populations prior to fixation; and standing variation, where the mutation was present at low frequency in the ancestral population prior to divergence. The models are composite likelihood-based, where likelihood calculations are made over a grid of user-chosen input parameters.

Data requirements

Using rdmc requires two kinds of allele frequency data. The first is allele frequencies from unlinked neutral sites across all populations. The second is allele frequencies from at least three populations that have putatively undergone convergent adaptation at a specific locus, and three or more populations that did not. Sample allele frequencies can be estimated with a number of existing software resources including VCFtools (Danecek ) and ANGSD (Korneliussen ). Typically, the allele frequencies at sites that have putatively undergone convergent adaptation will have been identified prior to using rdmc. Numerous methods exists for identifying such regions, such as finding overlapping selective sweeps in multiple populations (Stetter ), or by identifying regions with elevated values between populations that putatively did and did not experience convergent adaptation (Hohenlohe ). Additionally, rdmc requires an estimate of the per base recombination rate for the region or regions of interest, and an estimate of the effective population size. Assuming a mutation rate, effective population size can be estimated from genetic diversity (Gillespie 2004), or inferred via multiple methods (Gutenkunst ; Schiffels and Durbin 2014; Excoffier and Foll 2011). Likewise, local recombination rates can be derived from genetic maps (Swarts ), or inferred (Chan ; McVean and Auton 2007; Adrion ). At the time of writing, all available models in rdmc assume a single effective population size for all populations. Depending on which modes of convergent adaptation are being investigated, users must also provide vectors of selection coefficients, migration rates, allele frequency ages, and initial allele frequencies prior to selection. The exhaustive list of required inputs and their definitions is given in Table 1.

Table 1

Description of the arguments used with the function parameter_barge()

Function argument	Description
neutral_freqs	Matrix of allele frequencies at putatively neutral sites with dimensions, number of populations x number of sites.
selected_freqs	Matrix of allele frequencies at putatively selected sites with dimensions, number of populations x number of sites.
selected_pops	Vector of indices for populations that experienced selection.
Positions	Vector of genomic positions for the selected region.
n_sites	Integer for the number of sites to propose as the selected site. Sites are uniformly placed along positions using seq(min(positions), max(positions), length.out = n_sites). Must be less than or equal to length(positions). Cannot be used with sel_sites.
sel_sites	Optional vector of sites to propose as selected site. Useful if particular sites are suspected to be under selection. Cannot be used with n_sites.
sample_sizes	Vector of sample sizes of length number of populations. (i.e., twice the number of diploid individuals sampled in each population).
num_bins	The number of bins in which to bin alleles a given distance from the proposed selected sites.
Sets	A list of population indices, where each element in the list contains a vector of populations with a given mode of convergence. For example, if populations 2 and 6 share a mode and population 3 has another, sets = list(c(2,6), 3). Only required for fitting models with mixed modes. Must be used in conjunction with the ”modes” argument.
Modes	Character vector of length sets defining mode for each set of selected populations (”independent”, ”standing”, and/or ”migration”). Only required for fitting models with mixed modes. More details about the modes is available on help page for mode_cle
Sels	Vector of proposed selection coefficients.
Migs	Vector of proposed migration rates (proportion of individuals of migrant origin each generation). Cannot be 0.
Times	Vector of proposed times in generations the variant is standing in populations before selection occurs and prior to migration from source population.
Gs	Vector of proposed frequencies of the standing variant.
Sources	Vector of population indices to propose as the source population of the beneficial allele. Used for both the migration and standing variant with source models. Note: the source must be one of the populations contained in selected_pops.
Ne	Effective population size (assumed equal for all populations).
Rec	Per base recombination rate for the putatively selected region.
locus_name	String to name the locus. Helpful if multiple loci will be combined in subsequent analyses. Defaults to ”locus”.
Cholesky	Logical to use cholesky factorization of covariance matrix. Used for both inverse and determinant. Faster, but not guaranteed to work for all data sets. TRUE by default. if FALSE, ginv from MASS is used.

Installation and dependencies

Installation of rdmc requires the r package devtools (Wickham ). With devtools available, the package can be installed and made locally available with the following R commands: devtools::install_github(’silastittes/rdmc’) library(rdmc) In addition to devtools, rdmc depends on several other packages. Namely, MASS (Venables and Ripley 2002), dplyr (Wickham ), tidyr (Wickham and Henry 2020), purrr (Henry and Wickham 2019), magrittr (Bache and Wickham 2014), and rlang (Henry and Wickham 2020). All dependencies are automatically installed or updated when the installation command above is issued. I encourage users to update to the most recent version of R prior to issuing any of the commands featured here.

Specifying parameters and input data

For convenience, the original simulated example data generated by Lee and Coop (2017) are provided with the installation and can be loaded with: #load example data data(neutral_freqs) data(selected_freqs) data(positions) The example data consists of 10,000 simulated base pairs from six populations, three of which (with indices 1,3,5) independently mutated to the selected allele at position 0, along with three other populations that evolved neutrally. Allele frequencies must be be passed to rdmc as a matrix, where each row is a population and each column is a locus. Users should note that the simulated positions here take on values between zero and one, but that base pair positions along the chromosomes of empirical data should not be altered prior to fitting the models. When fitting possible convergent adaptation models, several quantities are reused regardless of which modes of convergent adaptation are of interest. In efforts to minimize computation, all parameters and quantities that are common across models are stored in a single named list generated with the function parameter_barge() that can be used when fitting any of the possible models. The list of quantities is generated using: #specify parameters and input data. param_list parameter_barge( Ne = 10000, rec = 0.005, neutral_freqs = neutral_freqs, selected_freqs = selected_freqs, selected_pops = c(1, 3, 5), positions = positions, n_sites = 10, sample_sizes = rep(10, 6), num_bins = 1000, sels = c( 1e-4, 1e-3, 0.01, seq(0.02, 0.14, by = 0.01), seq(0.15, 0.3, by = 0.05), seq(0.4, 0.6, by = 0.1) ), times = c(0, 5, 25, 50, 100, 500, 1000, 1e4, 1e6), gs = c(1 / (2 * 10000), 10 -^(4:1)), migs = c(10 -^(seq(5, 1, by = -2)), 0.5, 1), sources = selected_pops, locus_name = ”test_locus”, cholesky = TRUE ) where all the arguments are fully described in Table 1. This command also determines the grid of parameter values (namely the arguments, sels, times, gs, migs, sources, and n_sites or positions) that will be used in the likelihood calculations. Depending on which modes of convergent adaptation are being studied, some of these grid parameters may not be used for inferences. Users must still input values for all of the grid parameters. Naturally, features of the input data (the density and amount of variation in the allele frequencies, the effective population size, and the mutation and recombination rates), will impact the model results, and will determine the resolution we have to infer the model parameters. The number and density of points along the grid of parameters also affect the resolution one has to make inferences. However, computation time and memory usage may become infeasible if these grids are made too large.

Fitting the models

Once the parameter barge is constructed, all models can be fit using this list of quantities as the only data input. All of the mode types (neutral, independent mutations, standing variation with and without a source population, migration, and mixed-modes) are implemented using the same function, mode_cle(), passing the desired mode as an argument to the function. The neutral, independent mutations, migration, and standing variation with a source population modes can be fit, respectively with: #fit composite likelihood models neut_cle ind_cle mig_cle sv_cle Models of mixed modes, where specified populations are modeled under different modes, can be also implemented by modifying the parameter list object in-place. Specifically, mixed modes are constructed by adding the sets and modes arguments, which groups the population indices according the vector of modes, and specifies which modes are to be used. For example, to fit a model where populations with indices 1 and 3 adapted via standing variation, and population 5 gained the same mutation independently, and another mixed-mode model where populations 1 and 3 adapted via migration, and population 5 mutated independently: #update barge to fit a mixed-mode model param_list <- update_mode( barge = param_list, sets = list(c(1, 3), 5), modes = c(”standing_source”, ”independent”)) #fit mixed-mode model multi_svind <- mode_cle(param_list, ”multi”) #update to another mixed-mode param_list <- update_mode( barge = param_list, sets = list(c(1, 3), 5), modes = c(“migration”, “independent”)) #fit mixed-mode model multi_migind <- mode_cle(param_list, ”multi”) Regardless of which mode is used when calling mode_cle(), the data frame returned will always contain the same 10 features: The 6 grid parameters generated by parameter_barge() (Table 1), the composite likelihood score calculated over all possible combinations of the grid parameters, the indices of the selected populations, and the names of the locus and mode that were used. To always maintain the same number of columns, missing (NA) values are added when variables are not used for a given mode type. As will be shown below, this design facilitates combining results from multiple models for downstream analyses.

Results And Discussion

Benchmarking

The computation time and memory usage of rdmc increases with the complexity of the model and size of the input data used. Compared to the original code implemented by Lee and Coop (2017), rdmc is slightly faster computationally, and requires substantially less memory. However, the reduced time and memory allocation for rdmc only occurs when Cholesky factorization is used to obtain the inverse of the neutral and selected covariance matrices (Tables 1 and 2). Alternatively, the matrix inverses are obtained using ginv() from the MASS package (Venables and Ripley 2002), which requires a larger memory allocation, but will still approximate the inverse even if the covariance matrix is not positive definite. Users are therefore encouraged to use the default parameter_barge() argument cholesky = TRUE unless Cholesky factorization fails.

Table 2

Benchmarking of three rdmc model types. Computation time, memory allocation, and the number of garbage collections are reported for the original (dmc) code written by Lee and Coop (2017), and the two matrix inversion methods available in rdmc (ginv and chol.). Median time was estimated using 5 iterations of each model. Execution time is reported in seconds. Benchmarking was conducted using the R package, bench(Hester 2020). Code was executed in an interactive job on the UC Davis Farm HPC (2.00GHz Intel Xeon CPU, 124GB RAM)

Model	version	median time	memory	garbage collections
ind.	Dmc	15.1	230.6MB	1
ind.	chol.	12.9	109.2MB	3
ind.	Ginv	18.4	195.6MB	1
migration	dmc	264.6	2.9GB	19
migration	chol.	182.3	1.6GB	55
migration	ginv	321.5	2.8GB	18
std.var	dmc	780.2	8.6GB	52
std.var	chol.	537.4	4.8GB	136
std.var	ginv	898.5	8.6GB	49

The composite likelihood calculations are made over a grid of input parameters chosen when constructing the parameter barge (code shown above), hence, a denser grid will also have a considerable impact on time and memory usage. The size of the example data provided gives a realistic sense of memory and time usage for potential empirical data. While most modern laptops are capable of handling the required memory, many users will be interested in genome-wide analysis, where the mode of convergence for many separate regions are of interest. In these instances, cloud or high performance computing environments will be more appropriate. Making rdmc a portable and easy to install R package simplifies running separate genomic regions as independent jobs in parallel using workflows such as Snakemake (Köster and Rahmann 2012) or Nextflow (Di Tommaso ).

Extracting useful quantities and visualization

Once the models of interest have finished, the common format of the returned data frames allows all of the inferences to be combined into a single data frame, which simplifies creation of statistical and graphical summaries, and storage: #rdmc loads dplyr::bind_rows() all_mods <- bind_rows( ind_cle, mig_cle, sv_cle, multi_svind, multi_migind ) #save results to file readr::write_csv(neut_cle, ”rdmc_neutral.csv”) readr::write_csv(all_mods, ”rdmc_modes.csv”) With a single data frame containing output from all tested models, there are many visualization and summary methods available in the R ecosystem (R Core Team 2020). For example, the maximum composite likelihood estimate of the selection coefficient for each model can be accessed with: #rdmc loads dplyr::group_by() and magrittr::%>% all_mods %>% group_by(model) %>% filter(cle == max(cle)) %>% select(selected_sites, sels, model) #returns (model names edited here for space) # A tibble: 4 × 3 # Groups: model [4] selected_sites sels model 1 0.0017 0.03 independent 2 0.0017 0.03 migration 3 0.0017 0.03 stdvar-stdvar-ind. 4 0.0017 0.03 mig.-mig.-ind. Visualizing the composite likelihood values by genomic position (relative to the neutral composite likelihood) (Figure 1) can be done with:

Figure 1

Visualizing rdmc results for several modes. (A) The composite likelihood score at each of the 10 proposed sites of selection for each model. The true selected site was modeled at position 0. The data were simulated as independent mutations in the three selected populations. (B) The composite likelihood scores over grid of selection coefficients. Dotted line indicates true selection coefficient () the data were modeled under. Visualizations were made using the R packages, ggplot2(Wickham 2016) and cowplot(Wilke 2019). library(ggplot2) library(cowplot) theme_set(theme_cowplot(font_size = 18)) neut <- unique(neut_cle$cle) all_mods %>% group_by(selected_sites, model) %>% summarize(mcle = max(cle) - neut) %>% ggplot(aes(selected_sites, mcle, color = model)) + geom_line() + geom_point() + xlab(”Position”) + ylab(“Composite likelihood”) + theme(legend.position = ”n”) + scale_color_brewer(palette = ”Set1”) Lastly, one can visualize the likelihood surface with respect to specific parameter, such as selection (Figure 1): #visualize likelihood surface wrt selection all_mods %>% group_by(sels, model) %>% summarize(mcle = max(cle) - neut) %>% ggplot(aes(sels, mcle, color = model)) + geom_line() + geom_point() + ylab(”Composite likelihood”) + xlab(”Selection coefficient”) + scale_color_brewer(palette = ”Set1”)

Concluding remarks and future developments

rdmc was made to facilitate the use of convergent adaptation models of Lee and Coop (2017). The package is easy to install, and requires only a few lines of code to generate and analyze the output. By making rdmc an R package, the code is highly portable and has relatively few, highly maintained dependencies, making it simpler to adopt to different computing systems. Because of its portability and ease of use, rmdc also simplifies downstream tasks which facilitates usage at large scales, such as modeling thousand of genomic regions simultaneously on high performance computing resources. Several elaborations to the currently available utilities in the rdmc package could be addded. Since the methods developed in Lee and Coop (2017), additional models have been developed, including ones that can use putatively selected deletion variation, strong selection, concurrent sweeps, and variation in population size among populations (Oziolor ). Lee and Coop (2017) also introduced parametric bootstrapping to evaluate support for alternative modes. While not currently incorporated into rdmc, future development of the package would include functions to perform bootstrapping. However, for the same reasons mentioned above, rdmc should facilitate creation and computation of bootstrap replicates in parallel.

Web Resources

The source of the package and workflow outlined above are available at https://github.com/silastittes/rdmc. The package is released under GNU General Public License (v3.0). All of the presented analyses were computed on a personal laptop (x86_64, Apple) using R version 4.0.0 2020-04-24). The original code associated with Lee and Coop (2017) is available https://github.com/kristinmlee/dmc.

18 in total

1. Recombination rate estimation in the presence of hotspots.

Authors: Adam Auton; Gil McVean
Journal: Genome Res Date: 2007-07-10 Impact factor: 9.043

Review 2. Convergence, adaptation, and constraint.

Authors: Jonathan B Losos
Journal: Evolution Date: 2011-04-07 Impact factor: 3.694

3. Nextflow enables reproducible computational workflows.

Authors: Paolo Di Tommaso; Maria Chatzou; Evan W Floden; Pablo Prieto Barja; Emilio Palumbo; Cedric Notredame
Journal: Nat Biotechnol Date: 2017-04-11 Impact factor: 54.908

4. Independent Molecular Basis of Convergent Highland Adaptation in Maize.

Authors: Shohei Takuno; Peter Ralph; Kelly Swarts; Rob J Elshire; Jeffrey C Glaubitz; Edward S Buckler; Matthew B Hufford; Jeffrey Ross-Ibarra
Journal: Genetics Date: 2015-06-15 Impact factor: 4.562

5. Distinguishing Among Modes of Convergent Adaptation Using Population Genomic Data.

Authors: Kristin M Lee; Graham Coop
Journal: Genetics Date: 2017-10-18 Impact factor: 4.562

6. Convergent adaptation of human lactase persistence in Africa and Europe.

Authors: Sarah A Tishkoff; Floyd A Reed; Alessia Ranciaro; Benjamin F Voight; Courtney C Babbitt; Jesse S Silverman; Kweli Powell; Holly M Mortensen; Jibril B Hirbo; Maha Osman; Muntaser Ibrahim; Sabah A Omar; Godfrey Lema; Thomas B Nyambo; Jilur Ghori; Suzannah Bumpstead; Jonathan K Pritchard; Gregory A Wray; Panos Deloukas
Journal: Nat Genet Date: 2006-12-10 Impact factor: 38.330