Arya Iranmehr1, Ali Akbari2, Christian Schlötterer3, Vineet Bafna4. 1. Department of Electrical and Computer Engineering, University of California, San Diego, California 92093 airanmehr@gmail.com. 2. Department of Electrical and Computer Engineering, University of California, San Diego, California 92093. 3. Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, 1210 Vienna, Austria. 4. Department of Computer Science and Engineering, University of California, San Diego, California 92093.
Abstract
The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution "in action" via evolve-and-resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Most existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, or wide time spans. These assumptions do not hold in many E&R studies. In this article, we propose a method-composition of likelihoods for evolve-and-resequence experiments (Clear)-to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequences of pools of individuals as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength, and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied the Clear statistic to multiple E&R experiments, including data from a study of adaptation of Drosophila melanogaster to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.
The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution "in action" via evolve-and-resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Most existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, or wide time spans. These assumptions do not hold in many E&R studies. In this article, we propose a method-composition of likelihoods for evolve-and-resequence experiments (Clear)-to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequences of pools of individuals as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength, and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied the Clear statistic to multiple E&R experiments, including data from a study of adaptation of Drosophila melanogaster to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.
NATURAL selection is a key force in evolution, and a mechanism by which populations can adapt to external “selection” pressure. Examples of adaptation abound in the natural world (Fan ), including classic examples like lactose tolerance in Northern Europeans (Bersaglieri ), human adaptation to high altitudes (Simonson ; Yi ), but also drug resistance in pests (Daborn ), HIV (Feder ), cancer (Gottesman 2002; Zahreddine and Borden 2013), malarial parasite (Nair ; Ariey ), and others (Spellberg ). In these examples, understanding the genetic basis of adaptation can provide valuable information, underscoring the importance of the problem.Experimental evolution refers to the study of the evolutionary processes of a model organism in a controlled (Hegreness ; Bollback and Huelsenbeck 2007; Barrick ; Lang ; Orozco-ter Wengel ; Lang ; Oz ) or natural (Barrett ; Reid ; Denef and Banfield 2012; Winters ; Daniels ; Maldarelli ; Bergland ) environment. Recent advances in whole-genome sequencing have enabled us to sequence populations at a reasonable cost, even for large genomes. Perhaps more important for experimental evolution studies, we can now evolve and resequence (E&R) multiple replicates of a population to obtain longitudinal time-series data, to investigate the dynamics of evolution at the molecular level. Although constraints such as small sizes, limited timescales, and oversimplified laboratory environments may limit the interpretation of E&R results, these studies are increasingly being used to test a wide range of hypotheses (Kawecki ) and have been shown to be more predictive than static data analysis (Sawyer and Hartl 1992; Boyko ; Desai and Plotkin 2008). In particular, longitudinal E&R data are being used to estimate model parameters including population size (Pollak 1983; Waples 1989; Williamson and Slatkin 1999; Wang 2001; Terhorst ; Jónás ), strength of selection (Bollback ; Illingworth and Mustonen 2011; Illingworth ; Malaspinas ; Mathieson and McVean 2013; Steinrücken ; Terhorst ), allele age (Malaspinas ), recombination rate (Terhorst ), mutation rate (Barrick and Lenski 2013; Terhorst ), quantitative trait loci (Baldwin-Brown ), and for tests of neutrality hypotheses (Burke ; Bergland ; Feder ; Terhorst ).While many E&R study designs are being used (Barrick and Lenski 2013; Schlötterer ), we restrict our attention to the adaptive evolution due to standing variation in fixed-size populations. This regime has been considered earlier, typically with Drosophila melanogaster as the model organism of choice, to identify adaptive genes in longevity and aging (600 generations) (Burke ; Remolina ), courtship song (100 generations) (Turner ), hypoxia tolerance (200 generations) (Zhou ), adaptation to new laboratory environments (59 generations) (Orozco-ter Wengel ; Franssen ), egg size (40 generations) (Jha ), C-virus resistance (20 generations) (Martins ), and dark-fly (49 generations) (Izutsu ).The task of identifying selection signatures can be addressed at different levels of specificity. At the coarsest level, identification could simply refer to deciding whether some genomic region (or a gene) is under selection or not. In the following, we refer to this task as detection. In contrast, the task of site identification corresponds to the process of finding the favored mutation/allele at the nucleotide level. Finally, estimation of model parameters, such as strength of selection and dominance at the site, can provide a comprehensive description of the selection process.In the effort to analyze E&R selection experiments, many authors chose to adapt existing tests that were originally used for static data, pairwise comparisons (two time points), and single replicates to perform a null scan. For instance, Zhou used the ratio of the estimated population size of case and control populations to compute a test statistic for each genomic region. Burke applied the Fisher exact test to the last observation of data on case and control populations. Orozco-ter Wengel used the Cochran–Mantel–Haenszel (CMH) test (Agresti and Kateri 2011) to detect SNPs whose read counts change consistently across all replicates of two time-point data. Turner proposed the diffstat statistic to test whether the change in allele frequencies of two populations deviate from the distribution of change in allele frequencies of two drifting populations. Bergland calculated to populations throughout time to signify their differentiation from ancestral (two time-point data) as well as geographically different populations. Jha computed a test statistic of generalized linear-mixed model directly from read counts.Alternatively, direct methods have been developed to analyze time-series data by taking a likelihood approach, and estimating population genetics parameters. Bollback proposed a hidden Markov model (HMM) to estimate the selection coefficient s and population size by using a diffusion approximation to the Wright–Fisher process. Steinrücken proposed a general diploid selection model which takes into account dominance of the favored allele and approximates likelihood analytically. Recently, Schraiber proposed a Bayesian framework to estimate parameters using Markov chain Monte Carlo sampling. Mathieson and McVean (2013) adopted HMMs to structured populations and estimated parameters using an expectation maximization procedure on a discretized allele frequency. Feder modeled increments in allele frequency with a Brownian motion process, proposed as the frequency increment test (FIT). More recently, Topa proposed a Gaussian process (GP) for modeling single-locus, time-series, whole-genome sequencing of pools of individuals (pool-seq) data. Terhorst extended GP to compute joint likelihood of multiple loci under null and alternative hypotheses. Finally, Levy proposed a Bayesian model to handle sequencing, amplification, and growth noise in a large population of barcoded lineages.Among the methods specifically designed for time-series data, many make assumptions which may not hold in E&R studies. One common assumption is that the underlying population size is large, so it is reasonable to model dynamics of allele frequencies using continuous-state models (Bollback ; Feder ; Terhorst ). Second, many existing methods were originally designed to process the wider time spans seen in ancient DNA studies, an assumption that does not hold for E&R experiments (Steinrücken ; Schraiber ). Finally, many E&R analysis tools assume that allele frequencies in the input data are unbiased (e.g., Bollback ), which may not be valid for shotgun sequencing experiments.Here, we consider an HMM, similar to Williamson and Slatkin (1999) and Bollback but under a “small-population-size” regime. Specifically, we use a discrete state (frequency) model. We show that for small population sizes, discrete models can compute likelihood exactly, which improves statistical performance, especially for short time-span experiments. Additionally, we add another level of sampling noise to the traditional HMM model, allowing for heterogeneous ascertainment bias due to uneven coverage among variants. We show that for a wide range of parameters, Clear provides higher power for detecting selection, estimates model parameters consistently, and localizes the favored allele more accurately compared to the state-of-the-art methods, while being computationally efficient.
Materials and Methods
Consider a panmictic diploid population with fixed size of N individuals. Let be the frequencies of the derived allele at generations for a given variant, where at generations samples of n individuals are chosen for pooled sequencing. The experiment is replicated R times. We denote allele frequencies of the R replicates by the set To identify the genes and variants that are responding to selection pressure, we use the following procedure:Estimating population size: The procedure starts by estimating the effective population size, under the assumption that much of the genome is evolving neutrally.Estimating selection parameters: For each polymorphic site, selection and dominance parameters are estimated so as to maximize the likelihood of the time-series data, givenComputing likelihood statistics: For each variant, a log-odds ratio of the likelihood of selection model to the likelihood of neutral evolution/drift model is computed. Likelihood ratios in a genomic region are combined to compute the Clear statistic for the region.Hypothesis testing: An empirical null distribution of the Clear statistic is calculated using genome-wide drift simulations, and used to compute P-values and thresholds for a specified false discovery rate (FDR). We perform single-locus hypothesis testing within selected regions to identify significant variants and report genes that intersect with the selected variants.These steps are described in detail below.
Estimating population size
Methods for estimating population sizes from temporal neutral evolution data have been developed (Williamson and Slatkin 1999; Anderson ; Bollback ; Terhorst ; Jónás ). Here, we aim to extend these models to explicitly model the sampling noise that arise in pool-seq data. Specifically, we model the variation in sequence coverage over different locations, and the noise due to sequencing only a subset of the individuals in the population. In addition, many existing methods (Bollback ; Feder ; Terhorst ; Topa ) are designed for large populations, and model frequency as a continuous quantity. We observed that using Brownian motion to model frequency drift may be inadequate for small populations, low starting frequencies, and sparse sampling (in time); factors that are common in experimental evolution (see Results, Figure 3, A–C, and Figure 2). To this end, we model the Wright–Fisher Markov process for generating pool-seq data (Supplemental Material, Figure S1 in File S3) via a discrete HMM (Figure 1B). We start by computing a likelihood function for the population size given neutral pool-seq data.
Figure 3
Power calculations for detection of selection. Detection power for Clear
FIT, GP, and CMH under (A–C) hard- and (D–F) soft-sweep scenarios. λ and s denote the mean coverage and selection coefficient, respectively. Orange hexagons represent the performance of Clear when the maximum of the single-locus statistic is used to make a decision for the genomic region, while the red ● corresponds to the performance of Clear when single-locus statistics are averaged over the region. The y-axis measures power—sensitivity with false-positive rate for simulations with and The horizontal line reflects the power of a random classifier. In all simulations, three replicates are evolved and sampled at generations
Figure 2
Comparison of empirical distributions of allele frequencies (red) vs. predictions from Brownian motion (green), and Markov chain (blue). Comparison of empirical and theoretical distributions under neutral evolution (A–F) and selection (G–M) with different starting frequencies and sampling times of where and For each panel, the empirical distribution was computed over 100,000 simulations. Brownian motion (Gaussian approximation) provides poor approximations when initial frequency is far from 0.5 (A) or sampling is sparse (B, C, E , and F). In addition, Brownian motion can only provide approximations under neutral evolution. In contrast, Markov chain consistently provides a good approximation in all cases.
Figure 1
E&R selection experiments on D. melanogaster. (A) Typical configuration in which time-series data are collected for D. melanogaster. A small set of founder lines is selected from a large population and used to create a subpopulation of isofemale lines. Multiple replicates of the population are evolved and resequenced to collect time-series genomic data. For sequencing, n individuals are randomly sampled and sequenced with coverage λ. (B) Graphical model showing dependence of the random variables in the single-locus model used to compute Clear statistics. Observed variables c (derived allele read count) and d (total read count) are shaded. The variables denote allele frequency, sampled allele frequency, and mean sequencing coverage, respectively. (C) Mean and 95% confidence intervals of the (i and iii) theoretical and (ii and iv) empirical trajectories of the favored allele for (i and ii) hard- and (iii and iv) soft-sweep scenarios and The first 50 generations are shaded in gray to represent the sampling span of sampling in short-term experiments, illustrating the difficulty in predicting selection at early stages of selective sweep.
E&R selection experiments on D. melanogaster. (A) Typical configuration in which time-series data are collected for D. melanogaster. A small set of founder lines is selected from a large population and used to create a subpopulation of isofemale lines. Multiple replicates of the population are evolved and resequenced to collect time-series genomic data. For sequencing, n individuals are randomly sampled and sequenced with coverage λ. (B) Graphical model showing dependence of the random variables in the single-locus model used to compute Clear statistics. Observed variables c (derived allele read count) and d (total read count) are shaded. The variables denote allele frequency, sampled allele frequency, and mean sequencing coverage, respectively. (C) Mean and 95% confidence intervals of the (i and iii) theoretical and (ii and iv) empirical trajectories of the favored allele for (i and ii) hard- and (iii and iv) soft-sweep scenarios and The first 50 generations are shaded in gray to represent the sampling span of sampling in short-term experiments, illustrating the difficulty in predicting selection at early stages of selective sweep.
Likelihood for the neutral model:
We model the allele frequency counts as being sampled from a binomial distribution. Specifically,where π is the global distribution of allele frequencies in the base population. Note that π depends on the demographic history of the founder lines and can be estimated from the site-frequency spectrum (see Figure S2 in File S3) of the initial population. For notational convenience, henceforth we omit the dependence of likelihoods to the parameter π.To estimate frequency after τ transitions, it is enough to specify the transition matrix where denotes probability of change in allele frequency from to in τ generations:Furthermore, in an E&R experiment, individuals are randomly selected for sequencing. The sampled allele frequencies, are also binomially distributed:We introduce the sampling matrix Y, where stores the probability that the sample allele frequency is given that the true allele frequency isWe denote the pool-seq data for that variant as where represent the coverage and the read count of the derived allele, respectively. Let be the sequencing coverage at different generations. Then, the observed data are sampled according toThe emission probability for an observed tuple isFor let denote the probability of emitting and reaching state j at Then, can be computed using the forward procedure (Durbin ):where The joint likelihood of the observed data from R independent observations is given bywhere The graphical model and the generative process for which data are being generated is depicted in Figure 1B and Figure S1 in File S3, respectively.Finally, the last step is to compute an estimate that maximizes the likelihood of all M variants in the whole genome. Let denote the time-series data of the ith variant in replicate r. Then,
Estimating selection parameters
Likelihood for the selection model:
Assume that the site is evolving under selection constraints and where s and h denote selection strength and dominance parameters, respectively. By definition, the relative fitness values of genotypes 0|0, 0|1, and 1|1 are given by
and Then, the frequency at time (one generation ahead), can be estimated using:The machinery for computing likelihood of the selection parameters is identical to that of population size, except for transition matrices. Hence, here we only describe the definition transition matrix of the selection model. Let denote the probability of transition from to in τ generations, then (see Ewens 2004, p. 24, equations 1.58–1.59):The maximum likelihood estimates are given byUsing grid search, we first estimate N (Equation 8), and subsequently, we estimate parameters s and h (Equation 12, Figure S3 in File S3). By broadcasting and vectorizing the grid search operations across all variants, the genome scan on millions of polymorphisms can be done in a significantly smaller time than iterating a numerical optimization routine for each variant (see Results and Figure 6).
Figure 6
Running time. Box plots of running time per variant (CPU seconds) of Clear
CMH, FIT, and GP with 1, 3, 5, 7, and 10 loci over 1000 simulations conducted on a workstation with Intel Core i7 processor. The average running time for each method is shown on the x-axis. In all simulations, three replicates are evolved and sampled at generations
Empirical likelihood-ratio statistics
The likelihood-ratio statistic for testing directional selection, to be computed for each variant, is given bywhere Similarly, we can define a test statistic for testing if selection is dominant byWhile extending the single-locus Wright–Fisher model to multiple linked loci can improve the power of the model (Terhorst ), it is computationally and statistically expensive to compute exact likelihood. In addition, computing linked-loci joint likelihood requires haplotype resolved data, which pool-seq does not provide. Here, similar to Nielsen , we calculate the composite-likelihood-ratio score for a genomic region,where L is a collection of segregating sites and is the likelihood-ratio score based for each variant in L. The optimal value of the hyper-parameter L depends upon a number of factors, including initial frequency of the favored allele, recombination rates, linkage of the favored allele to neighboring variants, population size, coverage, and time since the onset of selection (duration of the experiment). In File S3, we provide a heuristic to compute a reasonable value of L, based on experimental data.We work with a normalized value of given bywhere and are the mean and standard deviation of values in a large region We found different chromosomes have different distributions of values, and therefore decided to use single chromosomes as
Hypothesis testing
Single-locus tests:
Under neutrality, log-likelihood ratios can be approximated by distribution (Williams and Williams 2001), and P-values can be computed directly. However, Feder showed that when the number of independent samples (replicates) is small, is a crude approximation to the true null distribution and results in more false positives. Following their suggestion, we first compute the empirical null distribution using simulations with the estimated population size (see Figure S1 in File S3). The empirical null distribution of statistic H is used to compute P-values as the fraction of null values that exceed the test score. Finally, we use the method of Storey and Tibshirani (2003) to control for FDR in multiple testing.
Composite likelihood tests:
Similar to single-locus tests, we compute the null distribution of the statistic using whole-genome simulations with the estimated population size, and subsequently compute the FDR. The simulations for generating the null distribution of are described next.
Simulations
We use the same simulation procedure for two purposes. First, we use the simulations to test the power of Clear against other methods in small genomic windows. Second, we use them to generate the distribution of null values for the statistic to compute empirical P-values. We mainly chose parameters that are relevant to D. melanogaster experimental evolution (Kofler and Schlötterer 2013). See also Figure 1A for illustration.Creating initial founder line haplotypes: Using msms (Ewing and Hermisson 2010), we created neutral populations for F founding haplotypes with command $./msms 〈F〉 1 −t 〈2μWN〉 −r 〈2rWN〉 〈W〉, where is the number of founder lines, is the effective founder population size, is the recombination rate, and is the mutation rate. The window size W is used to compute and We chose W = 50 kbp for simulating individual windows for performance evaluations, and W = 20 Mbp for simulating D. melanogaster chromosomes for P-value computations.Creating initial diploid population: An initial set of haplotypes was created from step 1, and duplicated to create F homozygous diploid individuals to simulate generation of inbred lines. N diploid individuals were generated by sampling with replacement from the F individuals.Forward simulation: We used forward simulations for evolving populations under selection. We also consider selection regimes in which the favored allele is chosen from standing variation (not de novo mutations). Given initial diploid population, position of the site under selection, selection strength s, number of replicates recombination rate and sampling times simuPop (Peng and Kimmel 2005) was used to perform the forward simulation and compute allele frequencies for all of the R replicates. For hard sweep (respectively, soft sweep) simulations we randomly chose a site with an initial frequency of (respectively, to be the favored allele. For generating the null distribution with drift for P-value computations, we used this procedure withSequencing simulation: Given allele frequency trajectories we sampled depth of each site in each replicate identically and independently from Poisson (λ), where is the coverage for the experiment. Once depth d is drawn for the site with frequency ν, the number of reads c carrying the derived allele are sampled according to binomial For experiments with finite depth the tuple is the input data for each site.
Data availability
The source code and running scripts for Clear are publicly available at https://github.com/airanmehr/clear.D. melanogaster data was originally published (Orozco-ter Wengel ; Franssen ). The data set of the D. melanogaster study, until generation 37, is obtained from the Dryad digital repository (http://datadryad.org) under accession DOI: 10.5061/dryad.60k68. Generation 59 of the D. melanogaster study is accessed from the European Sequence Read Archive (http://www.ebi.ac.uk/ena/) under the project accession number PRJEB6340. The data set containing the experimental evolution of yeast populations (Burke ) is downloaded from http://wfitch.bio.uci.edu/∼tdlong/PapersRawData/BurkeYeast.gz (last accessed January 24, 2017). University of California, Santa Cruz browser tracks for D. melanogaster and yeast data analysis are found in File 1 and File 2, respectively.
Results
Modeling allele frequency trajectories in small populations
We first tested the goodness of fit of the discrete vs. Brownian motion (a continuous-state model) in modeling allele frequency trajectories, under general E&R parameters. For this purpose, we conducted 100 K simulations with two time samples where is the parameter controlling the density of sampling in time. In addition, we repeated simulations for different values of starting frequency (i.e., hard and soft sweep) and selection strength (i.e., neutral and selection). Then, given initial frequency we computed the expected distribution of the frequency of the next sample under two models to make a comparison. Figure 2, A–F, shows that Brownian motion (continuous model) is inadequate when is far from 0.5, or when sampling times are sparse If the favored allele arises from standing variation in a neutral population, it is unlikely to have a frequency close to 0.5, and the starting frequencies are usually much smaller (see Figure S2 in File S3). Moreover, in typical D. melanogaster experiments, for example, sampling is sparse. Often, the experiment is designed so that (Zhou ; Orozco-ter Wengel ; Kofler and Schlötterer 2013; Franssen ).Comparison of empirical distributions of allele frequencies (red) vs. predictions from Brownian motion (green), and Markov chain (blue). Comparison of empirical and theoretical distributions under neutral evolution (A–F) and selection (G–M) with different starting frequencies and sampling times of where and For each panel, the empirical distribution was computed over 100,000 simulations. Brownian motion (Gaussian approximation) provides poor approximations when initial frequency is far from 0.5 (A) or sampling is sparse (B, C, E , and F). In addition, Brownian motion can only provide approximations under neutral evolution. In contrast, Markov chain consistently provides a good approximation in all cases.In contrast to the Brownian motion approximation, discrete Markov chain predictions (Equation 11) are highly consistent with empirical data for a wide range of simulation parameters (Figure 2, A–M). Moreover, the discrete Markov chain can be modified to model the case when the the allele is under selection.
Detection power
We compared the performance of Clear against other methods for detecting selection. For each method we calculated detection power as the percentage of true positives identified with a false-positive rate For each configuration (specified with values for selection coefficient s, starting allele frequency and coverage λ), the power of each method is evaluated over 2000 distinct simulations, half of which modeled neutral evolution and the rest modeled positive selection.We compared the power of Clear with GP (Terhorst ), FIT (Feder ), and CMH (Agresti and Kateri 2011) statistics. FIT and GP convert read counts to allele frequencies prior to computing the test statistic. Clear shows the highest power in all cases and the power stays relatively high even for low coverage (Figure 3 and Table S1 in File S3). In particular, the difference in performance of Clear with other methods is pronounced when starting frequency is low. The advantage of Clear stems from the fact that the favored allele with a low starting frequency might be missed by low coverage sequencing. In this case, incorporating the signal from linked sites becomes increasingly important. We note that methods using only two time points, such as CMH, do relatively well for high selection values and high coverage. However, the use of time-series data can increase detection power in low coverage experiments or when the starting frequency is low. Moreover, time-series data provide means for estimating selection parameters s and h (see below). Finally, as Clear is robust to change of coverage, our results (Figure 3, B and C) suggest that taking many samples with lower coverage is preferable to sparse sampling with higher coverage. For comparison purposes, we also tested Clear using the single-locus statistic For the most part, Clear showed an improvement over other methods even with or showed similar performance. The performance improved with higher L.Power calculations for detection of selection. Detection power for Clear
FIT, GP, and CMH under (A–C) hard- and (D–F) soft-sweep scenarios. λ and s denote the mean coverage and selection coefficient, respectively. Orange hexagons represent the performance of Clear when the maximum of the single-locus statistic is used to make a decision for the genomic region, while the red ● corresponds to the performance of Clear when single-locus statistics are averaged over the region. The y-axis measures power—sensitivity with false-positive rate for simulations with and The horizontal line reflects the power of a random classifier. In all simulations, three replicates are evolved and sampled at generations
Site identification
In general, localizing the favored variant using pool-seq data is a nontrivial task due to extensive linkage disequilibrium (LD) (Tobler ). To measure performance, we sorted variants by their H scores and computed rank of the favored allele for each method. For each setting of and s, we conducted 1000 simulations and computed the rank of the favored mutation in each simulation. The cumulative distribution of the rank of the favored allele in 1000 simulations for each setting (Figure 4) shows that Clear outperforms other statistics.
Figure 4
Ranking performance for 100× coverage. Cumulative distribution function (CDF) of the distribution of the rank of the favored allele in 1000 simulations for Clear (H), GP, CMH, and FIT, for different values of selection coefficient s and initial carrier frequency. Note that the individual variant Clear score (H) is used to rank variants. The area under the curve is computed as an overall quantitative measure to compare the performance of methods for each configuration. In all simulations, three replicates with are evolved and sampled at generations
Ranking performance for 100× coverage. Cumulative distribution function (CDF) of the distribution of the rank of the favored allele in 1000 simulations for Clear (H), GP, CMH, and FIT, for different values of selection coefficient s and initial carrier frequency. Note that the individual variant Clear score (H) is used to rank variants. The area under the curve is computed as an overall quantitative measure to compare the performance of methods for each configuration. In all simulations, three replicates with are evolved and sampled at generationsAn interesting observation is revisiting the contrast between site identification and detection (Long ; Tobler ). When selection strength is high, detection is easier (Figure 3, A–F), but site identification is harder, due to the high LD between flanking variants and the favored allele (Figure 4, A–F). Moreover, site identification becomes more difficult whenever the initial frequency of the favored allele is low; i.e., at the onset of selection, LD between a favored allele and its nearby variants is high. For example, when coverage and selection coefficient the detection power is 75% for hard sweep, but 100% for soft sweep (Figure 3, B–E). In contrast, the favored site was ranked as the top in 14% of hard sweep cases, compared to 95% of soft sweep simulations.
Estimating parameters
Clear estimates effective population size and selection parameters, and , as a byproduct of the hypothesis testing. We computed bias of selection fitness and dominance for Clear and GP for 1000 simulations in each setting. The distribution of the error (bias) for 100× coverage is presented in Figure 5 for different configurations. Figures S4 and S5 in File S3 provide the distribution of estimation errors for 30× and 300× coverage, respectively. For hard sweep, Clear provides estimates of s with lower variance of bias (Figure 5A and Figure S6 in File S3). In soft sweep, GP and Clear both provide unbiased estimates of s with low variance (Figure 5B). Figure 5, C and D, shows that Clear provides unbiased estimates of h as well when and We also tested if Clear provides unbiased estimates of N by estimating population size on 1000 simulations when As shown in Figure 7A and Figure S9, A–C, in File S3 maximum likelihood is attained at the true value of the parameter.
Figure 5
Distribution of bias for 100× coverage. The distribution of bias in estimating selection coefficient over 1000 simulations using GP and Clear (H) is shown for a range of choices for the selection coefficient s and starting carrier frequency when coverage (A and B). GP and Clear have similar variance in estimates of s for soft sweep, while Clear provides lower variance in hard sweep. Also see Table S2 in File S3 (C and D). The variance in the estimation of h is shown. In all simulations, three replicates are evolved and sampled at generations
Figure 7
Estimating population size. (A) Distribution of bias in estimating N, computed on 1000 neutral simulations for each when and (B) Estimates of population size for data from a study of adaptation of D. melanogaster to alternating temperatures. For each case, the distribution of estimator is computed by 100 bootstrap computations using 1000 variants each. The multiple modes are an artifact of grid search used to speed up computation. (C) Distribution of the population size estimates on the yeast data set. Despite large census population size
Burke ), this data set exhibits a much smaller effective population size
Distribution of bias for 100× coverage. The distribution of bias in estimating selection coefficient over 1000 simulations using GP and Clear (H) is shown for a range of choices for the selection coefficient s and starting carrier frequency when coverage (A and B). GP and Clear have similar variance in estimates of s for soft sweep, while Clear provides lower variance in hard sweep. Also see Table S2 in File S3 (C and D). The variance in the estimation of h is shown. In all simulations, three replicates are evolved and sampled at generations
Running time
As Clear does not compute exact likelihood of a region (i.e., does not explicitly model linkage between sites), the complexity of scanning a genome is linear in the number of polymorphisms. Calculating the score of each variant requires an computation for However, most of the operations are can be vectorized for all replicates to make the effective running time for each variant faster. We conducted 1000 simulations and measured running times for computing site statistics H, FIT, CMH, and GP with different numbers of linked loci. Our analysis reveals (Figure 6) that Clear is orders of magnitude faster than GP, and comparable to FIT. While slower than CMH on the time per variant, the actual running times are comparable after vectorization and broadcasting over variants (see below).Running time. Box plots of running time per variant (CPU seconds) of Clear
CMH, FIT, and GP with 1, 3, 5, 7, and 10 loci over 1000 simulations conducted on a workstation with Intel Core i7 processor. The average running time for each method is shown on the x-axis. In all simulations, three replicates are evolved and sampled at generationsThese times can have a practical consequence. For instance, to run GP in the single-locus mode on the entire pool-seq data of the D. melanogaster genome from a small sample (≈1.6-M variant sites), it would take 1444 CPU hours (≈1 CPU month). In contrast, after vectorizing and broadcasting operations for all variant operations using the numba package, Clear took 75 min to perform a scan, including precomputation; while the fastest method, CMH, took 17 min.
Analysis of Real Data
Analysis of a D. melanogaster adaptation to alternating temperatures:
We applied Clear to the data from a study of the adaptation of D. melanogaster to alternating temperatures (Orozco-ter Wengel ; Franssen ), where three replicate samples were chosen from a population of D. melanogaster for 59 generations under alternating 12-hr cycles of hot-stressful (28°) and nonstressful (18°) temperatures, and sequenced. In this data set, sequencing coverage is different across replicates and generations (see figure S2 of Terhorst ) which makes variant depths highly heterogeneous (Figure S10 in File S3).We first filtered out heterochromatic, centromeric, and telomeric regions (Fiston-Lavier ), and those variants that have a collective coverage of >1500 in all 13 populations: three replicates at the base population, two replicates at generation 15, one replicate at generation 23, one replicate at generation 27, three replicates at generation 37, and three replicates at generation 59. After filtering, we ended up with 1,605,714 variants.Next, we estimated genome-wide population size (Figure 7B and Figure S9E in File S3), which is consistent with previous studies (Orozco-ter Wengel ; Jónás ). The likelihood curves of Clear are sharper around the optimum compared to that of the method of Bollback (see figure S1 in Orozco-ter Wengel ). Also, chromosomes 3L and 3R appear to have a smaller population size, respectively. Others have made similar observations on this data. In particular, Jónás showed that the chromosome-wise population size varies even more when it is computed for each replicate separately (see table 1 in Jónás ). For instance, is 131 for chromosome 3R replicate 1, while it is 328 for chromosome X replicate 2.Estimating population size. (A) Distribution of bias in estimating N, computed on 1000 neutral simulations for each when and (B) Estimates of population size for data from a study of adaptation of D. melanogaster to alternating temperatures. For each case, the distribution of estimator is computed by 100 bootstrap computations using 1000 variants each. The multiple modes are an artifact of grid search used to speed up computation. (C) Distribution of the population size estimates on the yeast data set. Despite large census population size
Burke ), this data set exhibits a much smaller effective population sizeWhile it would be ideal to compute the Clear statistic for each replicate and chromosome separately, computing empirical P-values and significant regions become computationally intensive as the empirical null distribution of each replicate and each chromosome needs to be computed. Hence, we use a single genome-wide estimate in all analyses, but we normalize statistic separately for each chromosome.We used a heuristic calculation (see File S3) to choose the sliding window size L as the distance where the LD between the favored mutation and a site away remains strong. For D. melanogaster parameters, we obtained We computed the normalized test statistic on sliding windows of size of 30 kbp and step size of 5 kbp over the genome (see Figure 8A).
Figure 8
Scan of Clear statistic on data from a study of adaptation of D. melanogaster to alternating temperatures. (A) Manhattan plot of scan for statistic using sliding window of size over the genome. The dashed line represents cutoff for genome-wide and identifies five contiguous intervals, I1–I5, which are shaded in blue. (B) Trajectories of the selected variants within intervals I1–I5.
Scan of Clear statistic on data from a study of adaptation of D. melanogaster to alternating temperatures. (A) Manhattan plot of scan for statistic using sliding window of size over the genome. The dashed line represents cutoff for genome-wide and identifies five contiguous intervals, I1–I5, which are shaded in blue. (B) Trajectories of the selected variants within intervals I1–I5.The empirical null distribution of was estimated by creating 100 whole-genome simulations (400 K statistic values), as described in Material and Methods. Then, the P-value of the test statistic in each region in the experimental data was calculated as the fraction of the null statistic values that are greater than or equal to the test statistic(see Figure S11 in File S3). After correcting for multiple testing, we identified five contiguous intervals (Figure 8) satisfying and covering polymorphic sites. We further performed single-locus hypothesis testing on the sites to identify 174 individual variants with an (Figure 8B).The final set of 174 variants fall within 32 genes (Table S3 in File S3), including many serine inhibitory proteases (serpins) and other genes involved in endocytosis. Recycling of synaptic vesicles is seen to be blocked at high temperature in temperature-sensitive Drosophila mutants (Kosaka and Ikeda 1983). This is also supported by gene ontology (GO)-enrichment analysis, where a single GO term “inhibition of proteolysis” is found to be enriched (corrected P-value = 0.0041). To test for dominant selection, we computed the D statistic on simulated neutral and experimental data, and computed P-values accordingly. After correcting for multiple testing, 96 variants were discovered with an (Figure S12 in File S3).
Analysis of outcrossing yeast populations
We also applied Clear to 12 replicate samples of outcrossing yeast populations (Burke ), where samples were taken at generations We observed a significant variation in the genome-wide, site-frequency spectrum of certain populations over different time points for some replicates (Figure S13 in File S3). The variation does not have an easily identifiable cause. Therefore, we focused analysis on seven replicates with a genome-wide, site-frequency spectrum over the time range (Figure S14 in File S3).We estimated population size to be haplotypes (Figure 7C and Figure S9F in File S3), and computed the
and H statistics accordingly. To compute P-values, we created 1-M single-locus neutral simulations according to the experimental data’s initial frequency and coverage. By setting the FDR cutoff to 0.05, only 18 and 16 variants show significant signal for directional and dominant selection, respectively (Figure 9 S12 in File S3). Selected variants for directional selection are clustered in two regions, which match two of the five regions (regions C and E in figure 2A in Burke ) identified by Burke in their preliminary analysis. UCSC browser tracks for analysis of D. melanogaster and Yeast datasets are available as File S1 and File S2, respectively.
Figure 9
Single-locus analysis of the yeast outcrossing populations. Manhattan plot of scan single-locus Clear statistic for testing (A) directional selection and (C) dominant selection. The dashed line represents cutoff for genome-wide Trajectories of the selected variants are depicted in panels (B) and (D).
Single-locus analysis of the yeast outcrossing populations. Manhattan plot of scan single-locus Clear statistic for testing (A) directional selection and (C) dominant selection. The dashed line represents cutoff for genome-wide Trajectories of the selected variants are depicted in panels (B) and (D).
Discussion
We developed a computational tool, Clear, that can detect regions and variants under selection E&R experiments. Using extensive simulations, we show that Clear outperforms existing methods in detecting selection, locating the favored allele, and estimating model parameters. Also, while being computationally efficient, Clear provides a means for estimating populations size and hypothesis testing.Many factors; such as small population size, finite coverage, LD, finite sampling for sequencing, duration of the experiment, and the small number of replicates; can limit the power of tools for analyzing E&R. Here, by discrete modeling, Clear estimates population size and provides unbiased estimates of s and h. It adjusts for heterogeneous coverage of pool-seq data, and exploits the presence of linkage within a region to compute a composite likelihood ratio statistic.It should be noted that, even though we described Clear for small fixed-size populations, the statistic can be adjusted for other scenarios, including changing population sizes when the demography is known. For large populations, transitions can be computed on sparse data structures, as for large N the transition matrices become increasingly sparse. Alternatively, frequencies can be binned to reduce dimensionality.The comparison of hard- and soft-sweep scenarios showed that the initial frequency of the favored allele can have a nontrivial effect on the statistical power for identifying selection. Interestingly, while it is easier to detect a region undergoing strong selection, it is harder to locate the favored allele in that region.There are many directions to improve the analyses presented here. In particular, we plan to focus our attention on other organisms with more complex life cycles, experiments with variable population size, and longer sampling-time spans. As E&R experiments continue to grow, deeper insights into adaptation will go hand in hand with improved computational analysis.
Supplementary Material
Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.197566/-/DC1.Click here for additional data file.Click here for additional data file.Click here for additional data file.
Authors: Jeffrey E Barrick; Dong Su Yu; Sung Ho Yoon; Haeyoung Jeong; Tae Kwang Oh; Dominique Schneider; Richard E Lenski; Jihyun F Kim Journal: Nature Date: 2009-10-18 Impact factor: 49.962
Authors: Dan Zhou; Nitin Udpa; Merril Gersten; DeeAnn W Visk; Ali Bashir; Jin Xue; Kelly A Frazer; James W Posakony; Shankar Subramaniam; Vineet Bafna; Gabriel G Haddad Journal: Proc Natl Acad Sci U S A Date: 2011-01-24 Impact factor: 11.205
Authors: Molly K Burke; Joseph P Dunham; Parvin Shahrestani; Kevin R Thornton; Michael R Rose; Anthony D Long Journal: Nature Date: 2010-09-15 Impact factor: 49.962