Literature DB >> 30428070

Methods for Estimating Demography and Detecting Between-Locus Differences in the Effective Population Size and Mutation Rate.

Kai Zeng¹, Benjamin C Jackson², Henry J Barton¹.

Abstract

It is known that the effective population size (Ne) and the mutation rate (u) vary across the genome. Here, we show that ignoring this heterogeneity may lead to biased estimates of past demography. To solve the problem, we develop new methods for jointly inferring past changes in population size and detecting variation in Ne and u between loci. These methods rely on either polymorphism data alone or both polymorphism and divergence data. In addition to inferring demography, we can use the methods to study a variety of questions: 1) comparing sex chromosomes with autosomes (for finding evidence for male-driven evolution, an unequal sex ratio, or sex-biased demographic changes) and 2) analyzing multilocus data from within autosomes or sex chromosomes (for studying determinants of variability in Ne and u). Simulations suggest that the methods can provide accurate parameter estimates and have substantial statistical power for detecting difference in Ne and u. As an example, we use the methods to analyze a polymorphism data set from Drosophila simulans. We find clear evidence for rapid population expansion. The results also indicate that the autosomes have a higher mutation rate than the X chromosome and that the sex ratio is probably female-biased. The new methods have been implemented in a user-friendly package.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 30428070 PMCID： PMC6409433 DOI： 10.1093/molbev/msy212

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

Introduction

Information on past demographic changes is essential for understanding how major events shape the evolution of a species (e.g., the out-of-Africa migration of humans; Veeramah and Hammer 2014), for reliably detecting genes underlying adaptation/speciation (Bank et al. 2014; Payseur and Rieseberg 2016), and for formulating effective conservation strategies (Allendorf et al. 2010). As a result, many methods have been developed for making demographic inferences by examining various aspects of sequence polymorphism (Schraiber and Akey 2015; Payseur and Rieseberg 2016). Due to the randomness of the process of evolution and the rarity of polymorphic sites, the amount of information provided by data from a small genomic region is rather limited, which in turn leads to large statistical noise in the inference. This problem is typically dealt with by combining data from multiple loci. However, this approach is complicated by regional heterogeneity in important parameters. For instance, the mutation rate, u, varies across the genome (Hodgkinson and Eyre-Walker 2011). In addition, the effective population size, Ne, is also heterogeneous (Ne is inversely related to the rate of coalescence; Charlesworth 2009). Variation in Ne may be caused by differences in the mode of inheritance (e.g., autosomes vs. sex chromosomes; Charlesworth 2009) and/or differences in the strength of selection at linked sites (e.g., selective sweeps and background selection; Cutter and Payseur 2013). To illustrate problems with combining data from multiple loci, imagine that there was a 10-fold increase in the population size 1,000 generations ago and that we have data from two loci with effective population sizes 5,000 and 100, respectively. It is well known that the level of polymorphism is determined by . Thus, if u is the same at the two loci, locus 1 is expected to contribute more SNPs to the combined data set due to its higher Ne. On the other hand, when expressed in units of two times the locus-specific Ne, the scaled time to the expansion event is 0.1 for locus 1 and 5 for locus 2. Thus, locus 1 is much closer to the event than locus 2, and its local genealogy is expected to deviate more from that expected under an equilibrium model. As locus 1 makes a larger contribution to the combined data set, making inferences on the combined data without regard to these between-locus differences will lead to results that are biased toward the situation at locus 1. Being able to detect differences in Ne and u between loci is required for studying important questions in evolution. For instance, comparing sex chromosomes and autosomes with regard to their polymorphism patterns is a powerful way of detecting evidence for an unequal sex ratio and/or sex-biased demographic processes (Webster and Wilson Sayres 2016). Although existing methods developed for this purpose do take into account variation in Ne and u between sex chromosomes and autosomes (Pool and Nielsen 2007, 2008; Garrigan 2009; Keinan et al. 2009; Haddrill et al. 2011; Evans et al. 2014; Clemente et al. 2018), they are limited in several important aspects: 1) Some rely on summary statistics such as the X-autosome diversity ratio, and do not make full use of the data (Pool and Nielsen 2008); 2) some cannot detect changes in the sex ratio between different evolutionary epochs (Garrigan 2009; Haddrill et al. 2011; Evans et al. 2014); and 3) some do not model the mutation process, and therefore cannot detect difference in the mutation rate between sex chromosomes and autosomes caused by, for example, male-driven evolution (Clemente et al. 2018). Demographic inference methods concerned with data collected from within autosomes (or sex chromosomes) seem to pay less attention to regional variation in Ne and u (Gutenkunst et al. 2009; Excoffier et al. 2013). The method of Bhaskar et al. (2015) allows u to vary across loci but assumes a single Ne for all loci. The method of Gossmann et al. (2011) considers between-locus differences in both Ne and u but assumes that the population size is constant over time. Beaumont and colleagues (Beaumont 1999; Storz and Beaumont 2002) developed a hierarchical Bayesian model that accommodates changes in population size as well as variation in both Ne and u. However, this method is applicable to microsatellite data only. Finally, the method of Hey and colleagues (Hey and Nielsen 2004; Sousa et al. 2013) considers both demography and between-locus differences but is computationally intensive and not suitable for analyzing data sets with many loci. To solve the issues discussed above, we describe a general framework for simultaneously inferring past changes in population size and detecting variation in Ne and u. Several methods are constructed, either for making comparisons between the X (or Z) chromosome and autosomes or for analyzing multilocus data from within autosomes (or sex chromosomes). The methods typically make inferences on polymorphism data, although some of them are able to use both polymorphism and divergence data. Using computer simulations, we ask the following questions: 1) To what extent do regional differences in Ne and u bias results obtained by demographic inference methods that ignore this heterogeneity? 2) Can the new methods overcome these biases? 3) Do the new methods have sufficient statistical power for detecting between-locus differences in Ne and u? As an example, we use the methods to analyze a polymorphism data set from Drosophila simulans (Jackson et al. 2017), focusing on X-autosome comparisons. We examine whether the population size has changed recently, whether u differs between the X chromosome and the autosomes, and whether there is evidence for sex-biased processes (e.g., an unequal sex ratio, sex differences in reproductive success).

New Approach

The General Model without Divergence Data

Consider a randomly mating diploid population. Going backward from the present, the population size changes in a stepwise manner with H epochs (see supplementary table S1, Supplementary Material online, for a list of mathematical symbols). The most recent epoch is referred to as epoch 1, the next epoch as epoch 2, and so forth. It is assumed that epoch H (i.e., the most distant epoch) extends indefinitely into the past, whereas the duration of epoch h is T generations (). Let us focus on a locus in the genome, referred to as locus 1. It is assumed that the Ne at this locus in epoch H is N1. The population size during epoch h is (). Mutation is modeled by the infinite-sites model. Let u1 be the mutation rate per site per generation. We define the scaled mutation rate as , and the scaled time as (). Consider a second locus, referred to as locus 2. Because the underlying demographic process is shared by all loci in the genome, the timing of population size changes (i.e., the Ts) is the same across loci (see supplementary fig. S1, Supplementary Material online, for a graphical representation of the model and its parameters). To model the difference in Ne between locus 1 and 2, we treat locus 1 as “the reference locus” and assume that the local Ne at locus 2 in the most distant epoch (i.e., epoch H) is . To model variation in the mutation rate, we assume that the mutation rate at locus 2 is u2 per site per generation and define the scaled mutation rate as (note that all scaled parameters are defined with respect to the reference locus). Finally, to accommodate the possibility that these two loci may respond differently to the demographic changes (e.g., sex-biased demographic processes can affect X-linked and autosomal loci differently; Webster and Wilson Sayres 2016), the population size during epoch h is assumed to be (). More generally, with data from K loci, the model has the following parameters, denoted collectively by : 1) the time parameters = (τ1, τ2,…, ), which are shared across loci; and 2) the locus-specific parameters θ, f, and = (,…, ) (). Note that f1 is fixed at 1 for identifiability of the parameters. Under this parameterization, information on variation in local Ne is provided by the fs. Because the θs are defined with respect to N1, they are directly comparable between loci and provide information about variability in the mutation rate. By checking whether differs significantly from 1, we can examine whether locus i and j respond differently to demographic changes. It should be noted that the parameters are identifiable if and only if there have been recent changes in population size. In contrast, if the population size is constant (i.e., H = 1), then polymorphism patterns at the K loci are fully characterized by the composite parameter , defined as (). Without loss of generality, we assume that samples of n alleles have been obtained from all K loci. The data from locus k are summarized using the unfolded site-frequency spectrum (uSFS), denoted by = (,…, ), where is the observed number of segregating sites of derived allele frequency i. Let = (,…, ) denote all the data. As detailed in Materials and Methods, we calculate the (composite) likelihood of the data using the Poisson random field model (Sawyer and Hartl 1992; Bustamante et al. 2001), assuming neutrality, free recombination between sites, and the infinite-sites model of mutation. This allows us to obtain maximum likelihood estimates (MLEs) of the parameters (see Materials and Methods).

The General Model with Divergence Data

We seek to increase the statistical power of the above model by appealing to the fact that the level of divergence to an outgroup species carries information about the mutation rate. For simplicity, we consider divergence between a sequence from the ingroup species and a sequence from the outgroup. Here it is important to consider the effects of ancestral polymorphism, which may account for a substantial fraction of the divergence level between closely related species (e.g., ancestral polymorphism may account for more than 24% of divergence between humans and chimpanzees; Chen and Li 2001). Consider locus k. It is assumed that Ne at this locus in the population of the ancestral species is , where the parameter c is used to model the possibility that the ancestral population is of a different size (recall that locus 1 is used as the reference locus and ). The expected divergence level is , where m is the length of locus k in basepairs, and is the divergence time in generations. The first term in the parentheses describes differences accumulated within the ancestral population, and the second term considers changes accrued after speciation. We can rewrite λ as , where . Thus, the inclusion of divergence data introduces two new parameters, c and t, which are shared across loci. Let = (x1, x2,…, x) where x is the observed number of substitutions at locus k. The data set now includes both and , and the new model has parameters , c, and t. The likelihood of can be calculated by assuming that the number of substitutions follows a Poisson distribution with mean λ, as in previous studies (Gossmann et al. 2011; Galtier 2016; Tataru et al. 2017). This is then combined with the likelihood of to obtain the overall likelihood (see eq. 9). It should be noted that the information on c and t comes from variation in divergence level across loci. Thus, this model should not be used to analyze data sets containing a small number of loci (see Results for more detail).

A Simplified Model

The models described above are general in that they allow each locus to have its private parameters (i.e., θ, f, and ). They are parameter-rich and require each locus to be sufficiently large so that enough information is available for estimating the locus-specific parameters. Thus, the general model is more suitable for analyzing large genomic regions (e.g., the X chromosome vs. autosomes). Regarding data collected from multiple autosomal (or sex-linked) loci, it is reasonable to define a simplified model with ( and ). That is, = (g1, g2,…, ) is now shared across loci. This model assumes that the loci, despite their difference in local Ne, respond to the underlying demographic process in the same manner. The rationale comes from the observation that the effects of selection at linked sites (e.g., recurrent selective sweeps, background selection, or the joint effects of the two) can be roughly approximated by a function of the form = (Kim and Stephan 2000; Charlesworth 2012a; Coop and Ralph 2012; Nicolaisen and Desai 2013; Zeng 2013; Corbett-Detig et al. 2015; Zeng and Corcoran 2015). Here, N(t) is the population size at time t in the absence of selection at linked sites. represents parameters of the model under consideration and typically includes the strength of selection, the rate at which selected variants arise, and the recombination rate. The function has relatively weak dependence on the population size. For instance, under background selection, is approximately independent of the population size and is a function of the deleterious mutation rate, the distribution of fitness effects of new deleterious variants, and the recombination rate (Charlesworth 2012a; Nicolaisen and Desai 2013; Zeng 2013; Zeng and Corcoran 2015). Although modeling the effects of selection at linked sites as a reduction in local Ne is known to be an oversimplification, this approach has been employed by several widely used inference methods (Beaumont 1999; Storz and Beaumont 2002; Hey and Nielsen 2004; Sousa et al. 2013) and should represent a step toward solving the problems caused by ignoring selection at linked sites (Ewing and Jensen 2016; Schrider et al. 2016).

Dealing with Polarization Errors

So far we have assumed that the uSFS is known. In reality, obtaining the uSFS requires the inference of the ancestral state at polymorphic sites, which can be error-prone (e.g., when sequence divergence to outgroup species is high). It is also known that polarization errors can bias inferences based on the uSFS (Hernandez et al. 2007; Barton and Zeng 2018; Keightley and Jackson 2018). We provide two solutions to this problem. The first is to use the folded SFS (fSFS). Let be the observed number of segregating sites at which the less frequent allele (minor allele) is represented i times (, where is the largest integer that is not greater than x). The fSFS for locus k is = (,…, ), and the overall polymorphism data are = (,…, ). As in the unfolded case, the likelihood of the data can be calculated using the Poisson random field model (see Materials and Methods). An alternative approach is to explicitly consider polarization error in the model (Williamson et al. 2005; Glémin et al. 2015; Barton and Zeng 2018). When the ancestral state of a segregating site of derived allele frequency i is misinferred, it will be incorrectly assigned as a segregating site of derived allele frequency n – i (). Let be the probability that the ancestral state of a polymorphic site at locus k is misinferred. After polarization, the expected number of segregating sites of derived allele frequency i is where is (true) expected number segregating sites of derived allele frequency i and is a function of θ, f, , and (see eq. 2). As an example, when the above is used with the general model (no divergence), the free parameters include and (), which can be estimated by maximum likelihood.

Results and Discussion

Properties of the General Model

We evaluated the performance of the general model using X-autosome comparisons as an example. To this end, we employed a two-locus setup and treated locus 1 as the X chromosome (the reference locus) and locus 2 as the autosomes. We generated data from two different models, referred to as Model 1 and Model 2 (table 1). First, let us consider Model 1. It includes several factors that are known to be important for human evolution: changes in the X-autosome ratio of Ne, recent population expansion, and difference in the mutation rate between the X chromosome and autosomes. Here, the simulations were carried out using a demographic model with H = 2 epochs. The Ne for the X chromosome and the autosomes in epoch 2 (i.e., the most distant epoch) are denoted by NX and NA, respectively. Let be the X-autosome ratio of Ne in epoch 2. At time before the present, measured in units of generations, the population sizes of the X chromosome and the autosomes changed instantly to and , respectively (see supplementary fig. S1, Supplementary Material online, for a graphical representation). As a result, the X-autosome ratio of Ne in epoch 1 (i.e., the current epoch) is given by . We assumed that and , close to the values reported by Keinan et al. (2009). The shift in the X-autosome ratio of Ne is accompanied by population expansion characterized by and . Let u and uA be the mutation rate per site per generation on the X chromosome and the autosomes, respectively. The corresponding scaled mutation rates are defined as and (recall that scaled parameters are defined with respect to the reference locus). We used and . These values give an average autosomal diversity level of 0.001 per site and also reflect the fact that the X chromosome probably have a 30% lower mutation rate than the autosomes (Hodgkinson and Eyre-Walker 2011).

Table 1.

Mean (standard deviation; SD) of the MLEs for the Parameters of Two Different Two-Locus Models.

	θ_X	θ_A	r₁	r₂	gX,1	τ1
Model 1 (true)	5.25×10−4	7.5×10−4	0.65	0.75	10	0.1
Mean (SD)	5.259×10−4 (5.8×10−6)	7.53×10−4 (2.6×10−5)	0.653 (0.05)	0.752 (0.03)	10.0 (0.3)	0.10 (0.003)
Model 2 (true)	5.25×10−4	7.5×10−4	0.9	0.75	0.2	0.05
Mean (SD)	5.261×10−4 (1.0×10−5)	7.41×10−4 (1.0×10−4)	0.89 (0.08)	0.742 (0.10)	0.20 (0.01)	0.051 (0.005)

Note.—Definition of the symbols can be found in supplementary table S1, Supplementary Material online. The population size increases in Model 1, but reduces in Model 2. In both models, the X-autosome ratio of Ne are different before and after the population size change (as measured by r1 and r2). The results are based on 100 simulation replicates. The sample size is 100. Both loci contain 5-Mb sites. The mean number of X-linked and autosomal polymorphic sites are 23,187 and 40,734 under Model 1, and 8,040 and 15,296 under Model 2.

Mean (standard deviation; SD) of the MLEs for the Parameters of Two Different Two-Locus Models. Note.—Definition of the symbols can be found in supplementary table S1, Supplementary Material online. The population size increases in Model 1, but reduces in Model 2. In both models, the X-autosome ratio of Ne are different before and after the population size change (as measured by r1 and r2). The results are based on 100 simulation replicates. The sample size is 100. Both loci contain 5-Mb sites. The mean number of X-linked and autosomal polymorphic sites are 23,187 and 40,734 under Model 1, and 8,040 and 15,296 under Model 2. Model 2 is similar to Model 1, except for the following: 1) and ; 2) the shift in the X-autosome ratio of Ne coincides with a population size reduction characterized by and . We used Model 2 to assess how the general model fared when the X-autosome ratio of Ne increased whereas the population size reduced. We were able to accurately recover all parameters by analyzing only polymorphism data (table 1). Parameter estimation is more difficult under Model 2, as indicated by the higher standard deviation values. This is expected because the population size contraction means that samples generated under Model 2 contain fewer polymorphic sites (see table 1 legend). Likelihood ratio tests can be readily constructed to ask specific questions of interest. Here, we focus on the following: Test 1—is the mutation rate different between the X chromosome and the autosomes (a model with vs. the full model; degree of freedom [df] = 1)? Test 2—is there evidence for the X-autosome ratio of Ne being significantly different from 0.75 (a model with vs. the full model; df = 2)? Test 3—has the X-autosome ratio of Ne changed between epochs (a model with vs. the full model; df = 1)? These tests were applied to the simulated data used in table 1, and the results are shown in table 2. Test 1 is less powerful under Model 2 than under Model 1, which is an expected consequence of a drop in the number of polymorphic sites. In contrast, Test 2 has higher power under Model 2 than under Model 1, and the power of Test 3 is comparable between the models. These observations indicate that the number of polymorphic sites is not the only factor that affects statistical power.

Table 2.

Power (%) of the Three Likelihood Ratio Tests.

Model	Test 1	Test 2	Test 3
Model 1	100	84	83
Model 2	67	98	78

Note.—Model 1 and Model 2 are the same as those used in table 1; so are the number of replicates, sample size, and locus length. Each sample was analyzed using the likelihood ratio tests described in the main text. The values above are the frequency at which the null model is rejected at a 5% significance level.

Power (%) of the Three Likelihood Ratio Tests. Note.—Model 1 and Model 2 are the same as those used in table 1; so are the number of replicates, sample size, and locus length. Each sample was analyzed using the likelihood ratio tests described in the main text. The values above are the frequency at which the null model is rejected at a 5% significance level. Overall, the simulations suggest that polymorphism data can be used to obtain information about X-autosome differences in Ne and/or u. The power of these analyses depends in a complex way on both the sample size and the demographic history. It should also be pointed out that the divergence-based version of the general model is not suitable for analyzing data sets containing only two loci. This is because the data contain very little information about the parameters c and t. In fact, simulations suggest that, when this is the case, including divergence data can lead to biases in parameter estimation (supplementary table S2, Supplementary Material online).

Properties of the Simplified Model

This model is suitable for analyzing data collected from multiple autosomal or sex-linked loci. We will start by analyzing data sets consisting of a small number of loci, in order to demonstrate several important properties of the model. We will then consider data sets with many loci, which represents a much more challenging problem.

Data Sets with a Small Number of Loci

We analyzed 100 simulated data sets. Each data set contains the uSFS from 20 loci, and the sample size is 100. All loci are 5 kb long. The scaled parameters are defined with respect to N1, the Ne at locus 1 in the most distant epoch (i.e., locus 1 is the reference locus). The scaled mutation rate θ () vary linearly across loci, with (blue line in fig. 1). The f () also vary linearly with (blue line in fig. 1). The demographic model has H = 2 epochs. At time before the present, the population size increased 10-fold (i.e., ). To model divergence, we assumed that the population of the ancestral species was larger with c = 2. The scaled divergence time is t = 8. With these parameter values, the expected divergence level at locus 1 is 0.1 per site.

. 1.

MLEs obtained by fitting the simplified model to simulated data from 20 loci. Each locus is 5 kb long. The solid blue line in each plot shows the true parameter values across loci. The population size expanded recently with parameters and . The results are based on 100 replicates. The sample size is 100. The simulated data were first analyzed by combining the uSFS from the 20 loci into a single uSFS (i.e., disregarding variation in Ne and u). Estimates of g1 and were obtained by fitting the combined data to a demographic model with a one-step change in population size. The mean and the interval between the 2.5 and 97.5 percentiles are 9.20 and [8.62, 9.63] for g1, and 0.58 and [0.54, 0.61] for . Both estimates are biased, and neither of the intervals overlaps the true value. Thus, ignoring heterogeneity in Ne and u can lead to high statistical support for biased estimates. The simulated data were then analyzed by the simplified model, both with and without using the divergence data. From table 3, we can see that the model can provide unbiased estimates for both g1 and , regardless of whether divergence data were used. The standard deviation (SD) values in table 3 suggest that estimates of are somewhat less variable with divergence data. The model is also able to provide accurate estimates of c and t, in contrast to the two-locus case (supplementary table S2, Supplementary Material online).

Table 3.

Parameter Estimates Obtained by Applying the Simplified Model to Simulated Data Sets Containing Either 20 or 500 Loci.

Data	Mean (SD)
	g₁	τ1	c	t
True (20 loci)	10	0.5	2	8
With div	10.1 (0.26)	0.51 (0.06)	2.0 (0.2)	8.1 (0.9)
No div	10.1 (0.26)	0.51 (0.08)	—	—
True (500 loci)	10	0.5	2	6
With div	10.0 (0.04)	0.50 (0.04)	2.0 (0.04)	6.0 (0.5)
No div	10.0 (0.04)	0.50 (0.05)	—	—

Note.—The cases with 20 loci are the same as those presented in figure 1. The locus length is 5 kb, and the results are based on 100 replicates and a sample size of 100. For the cases with 500 loci, θ and f were sampled from the gamma distributions described in the main text. The locus length is 10 kb, and the results are based on 50 replicates and a sample size of 50. The demographic model is the same in all cases, and is characterized by g1 and .

Parameter Estimates Obtained by Applying the Simplified Model to Simulated Data Sets Containing Either 20 or 500 Loci. Note.—The cases with 20 loci are the same as those presented in figure 1. The locus length is 5 kb, and the results are based on 100 replicates and a sample size of 100. For the cases with 500 loci, θ and f were sampled from the gamma distributions described in the main text. The locus length is 10 kb, and the results are based on 50 replicates and a sample size of 50. The demographic model is the same in all cases, and is characterized by g1 and . Regarding θ and f, the estimates are also unbiased (fig. 1). The addition of divergence data appears to slightly lower the variance of the estimates. In figure 1, we can see that the variance of the f estimates tends to be larger for loci with a higher index, whereas the variance of the estimates of the composite parameter is more uniform across loci (fig. 1). To see why, we first note that is the total scaled mutation rate at locus k in the most distant epoch (i.e., scaled by the Ne at locus k instead of the Ne at the reference locus). The information the model uses to separate θ and f comes in part from the distortion of the local genealogy caused by the recent expansion. For locus k, the rate of coalescence (in units of generations) between the present and the time of expansion is . Thus, coalescence occurs at a slower rate at loci with a larger local Ne (i.e., a higher true value of f in fig. 1). In the most extreme scenario when f is so large that approaches zero, the local genealogy will be indistinguishable from that expected under the equilibrium model. In this case, the likelihood surface will contain a ridge on which both θ and f vary with the product held constant, making it impossible to separate θ and f. As f is large when k is large (fig. 1), the increase in variance reflects the increase in difficulty in separating θ and f. This suggests that the ability to estimate θ and f separately at locus k depends on both the demographic history and properties specific to the locus itself. Finally, we repeated the above analyses, but used locus 20 as the reference locus instead. This has little effect on the results. For instance, the mean (SD) of the MLEs of g1 is 10.1 (0.29) without divergence data, and 10.1 (0.29) with divergence data (cf. table 3). As in figure 1, estimates of f are more variable for loci with a higher local Ne (supplementary fig. S2, Supplementary Material online). Thus, the choice of reference locus may be relatively unimportant.

Data Sets with a Large Number of Loci

We analyzed 50 simulated data sets. Each data set contains uSFS from 500 loci, and the sample size is 50. The loci are 10 kb in length. As above, the scaled parameters are defined with respect to N1, the Ne at locus 1 in the most distant epoch. In each replicate, we sampled θ from a gamma distribution with shape = 3 and scale , and f from a gamma distribution with shape and scale . For divergence, we used c = 2 and t = 6 in all replicates. The average diversity and divergence levels under these parameters are 1.5% and 12%, respectively, which are close to those observed at putatively neutral sites in short introns on the autosomes of D. melanogaster (using D. simulans as an outgroup; Jackson et al. 2015). The demographic model is the same as that used in figure 1. The use of the gamma distribution was inspired by a previous study (Gossmann et al. 2011), but the values of the shape and scale parameters are somewhat arbitrary. Our treatment also does not consider distortions in the shape of the SFS caused by selection at linked sites. These simplifications were made on consideration of the complexity of the inference problem, so that we could assess the model’s performance in a relatively simple setting. The data shown in table 3 suggest that g1, , c, and t can all be estimated accurately. As a different set of θ and f were sampled from the gamma distributions in each replicate, we assessed the accuracy of the model by calculating the slope and intercept of the linear regression of the MLEs of θ and f over their true values. For θ, the mean (SD) for the slopes and intercepts are 1.00 (0.09) and () with divergence data, and 0.99 (0.10) and () without divergence data. For f, these are 0.95 (0.08) and 0.05 (0.01) with divergence data, and 0.93 (0.10) and 0.07 (0.01) without divergence data. Thus, as above, the inclusion of divergence data seems to increase accuracy and lower variance. Compared with f, the regression lines for θ have slopes closer to 1 and intercepts closer to 0, suggesting that θ tends to be more accurately estimated using this method. As discussed in the previous section, when the data do not contain enough information, θ and f tend to form a ridge in the likelihood surface. This can create an artificial negative correlation between these two parameters, which may be problematic if the MLEs of θ and f are to be used for detecting association with other genomic variables (e.g., GC content, recombination rate). As the true values of θ and f were sampled from two independent probability distributions in the simulations, their MLEs should be uncorrelated. However, when making inferences on polymorphism data alone, the MLEs of θ and f are significantly negatively correlated in 16% of the simulation replicates (based on Kendall’s τ and a significance level of 5%). In contrast, for estimates based on both polymorphism and divergence, only 2% of the replicates show a significant negative correlation, suggesting that the addition of divergence data has increased the model’s ability to separate variation in Ne from that in u. It should be pointed out that this requires the divergence level to be sufficiently large. For instance, if we keep all parameters the same as above, but reduce t, the scaled divergence time, such that the expected divergence level drops from 12% to 6%, the MLEs of θ and f are significantly correlated in 8% of the replicates. In practice, the “required” level of divergence is a function of the demographic history of the ingroup species, lengths of the loci, and the number of alleles in the sample.

Implications of the Results Based on the Simplified Model

The results presented above suggest that disregarding variability in Ne and u can lead to biased demographic inferences. The new methods can solve this problem and help to quantify this heterogeneity across loci. It is, however, important to note that the ability to separate θ and f depends on several factors—the demographic history, the local effective population size, and the sample size (in terms of both the number of alleles and locus lengths). When there is insufficient information, the ridge along tends to create a negative correlation between the MLEs of θ and f. There is some evidence that the inclusion of divergence data can help to counter this tendency, and (moderately) lower variance in parameter estimation (fig. 1 and table 3). It should, however, be noted that we have used a highly simplifying model of divergence. It is of interest to incorporate complications such as nonequilibrium substitution patterns in the future by using, for instance, the framework of Matsumoto et al. (2015). The above discussion is relevant to other methods that allow Ne and u to vary across loci, especially when considering that these methods do not use divergence data to help the inference (Beaumont 1999; Storz and Beaumont 2002; Hey and Nielsen 2004; Sousa et al. 2013). Thus, the simulations highlight a major challenge in population genetic data analysis—although many important questions in evolution can be studied by detecting differences in Ne and u, the fact that diversity patterns are determined by the composite parameter means that separating these two parameters is inher—ently difficult. The same applies to the analysis of data collected from subdivided populations. Here the composite parameter , where m is the migration rate, is inversely correlated to the level of differentiation between populations. As a result, distinguishing between the following two causes of locally elevated levels of differentiation may not be straightforward (Cruickshank and Hahn 2014): 1) Loci have smaller m due to their involvement in selection against gene flow (Wolf and Ellegren 2017) and 2) loci have reduced Ne, but not m, as a result of background selection (Zeng and Corcoran 2015). Therefore, how to further increase the statistical power and robustness of the methods cited above warrants further investigation.

Application to the D. simulans Data

X-autosome Comparisons Based on the General Model

Our data set contains 21 alleles collected from the putative ancestral range in Madagascar (Jackson et al. 2017; see table 1 therein for values of summary statistics such as the nucleotide diversity [π] and Tajima’s D). To avoid complication caused by selection on synonymous codon usage, we considered sequence variability on putatively neutrally evolving sites in short introns (i.e., positions 8–30 bp of introns <66 bp; see also Parsch et al. 2010). Comparing the X chromosome and the autosomes (A), the diversity ratio is . This is lower than the “null” value of 0.75 expected when there is a 1:1 sex ratio and no difference in reproduction success between sexes (Charlesworth 2009). However, the population is not at equilibrium, as suggested by the negative Tajima’s D value of −1.46 on the X chromosome and −1.19 on the autosomes. It is known that changes in population size can perturb away from 0.75 (Pool and Nielsen 2007). Thus, the observed ratio can potentially be explained by a combination of the following factors: 1) recent demography; 2) difference in Ne between X and A as a result of an unequal sex ratio, difference in the mode of inheritance, and/or variation in reproductive success between sexes; and 3) difference in the mutation rate between X and A. To determine which of the three factors may have had an effect on , we fitted the general model to the uSFS, with the ancestral state at polymorphic sites inferred using D. melanogaster as an outgroup and maximum parsimony. A model with H = 2 epochs fits the data well, except for the uptick toward the high-frequency end of the uSFS (table 4 and supplementary fig. S3, Supplementary Material online). Increasing the number of epochs to H = 3 does not significantly improve the fit, suggesting that the uptick is most probably a result of polarization error (supplementary fig. S3, Supplementary Material online). As the sample size is relatively small, using the fSFS is likely to lead to a significant loss of power. Thus, we adopted the alternative approach and introduced two new parameters, X and A, for modeling polarization errors in the X-linked and autosomal data set, respectively. This model explains the observed uSFS significantly better than the no-error model (supplementary fig. S4, Supplementary Material online). This is further confirmed by the fact that the 95% CIs for the two polarization error parameters have lower bounds >0 (table 4). Adding another epoch to the model does not significantly increase the goodness of fit (, where the subscript b signifies that the P-value was obtained by bootstrapping). Thus, we refer to the model with H = 2 and polarization error as the best-fit model, and use it in the subsequent analyses.

Table 4.

Parameter Estimates Obtained by Fitting Two Models to the uSFS from Drosophila simulans.

Model	MLE and 95% CI of Parameters
H = 2	θ_X	θ_A	r₁	r₂	gX,1	τ1
No pol err	0.015	0.024	1.99	1.00	11.88	0.40
	[0.013, 0.016]	[0.019, 0.028]	[1.26, 2.80]	[0.78, 1.18]	[9.35, 14.93]	[0.33, 0.49]
H = 2	θ_X	θ_A	r₁	r₂	gX,1	τ1	ɛ_X	ɛ_A
With pol err	0.011	0.019	1.91	1.03	12.60	0.67	0.06	0.05
	[0.010, 0.013]	[0.014, 0.025]	[1.24, 2.63]	[0.77, 1.33]	[9.99, 15.43]	[0.51, 0.86]	[0.05, 0.07]	[0.05, 0.06]

Note.—Both models have H = 2 epochs. The second model contains two extra parameters, X and A, for modeling polarization errors in the X-linked and autosomal data set, respectively. The 95% CIs were obtained by analyzing 100 bootstrap samples. The bootstrap samples were generated by sampling the short introns with replacement, while keeping the numbers of X-linked and autosomal introns the same as in the real data set.

Parameter Estimates Obtained by Fitting Two Models to the uSFS from Drosophila simulans. Note.—Both models have H = 2 epochs. The second model contains two extra parameters, X and A, for modeling polarization errors in the X-linked and autosomal data set, respectively. The 95% CIs were obtained by analyzing 100 bootstrap samples. The bootstrap samples were generated by sampling the short introns with replacement, while keeping the numbers of X-linked and autosomal introns the same as in the real data set. The MLEs of the parameters in the best-fit model are presented in table 4. Consistent with the negative Tajima’s D values, is significantly >1, providing support for a recent population expansion (). The X chromosome mutates at a lower rate than the autosomes, and the MLE of is 0.59 (95% CI = [0.49, 0.68]), which is significantly smaller than 1 (). The MLE of r1, the X-autosome ratio of Ne in the current epoch (i.e., epoch 1), is 1.91, and that of r2, the Ne ratio in epoch 2 (i.e., before the expansion), is 1.03. Bootstrapping suggests that both r1 and r2 are significantly higher than 0.75 (), and that the Ne ratio is probably different between the two epochs (). Thus, all the three factors listed above may have affected the observed .

Implications of the Results Obtained from the D. simulans Data

The fact that the MLE of the X-autosome mutation rate ratio is 0.59 is interesting and lends support to the existence of male-driven evolution in Drosophila (Bachtrog 2008). However, our estimate is significantly smaller than the X-autosome divergence rate ratio of 0.91 estimated on the same set of short introns by Charlesworth et al. (2018). The reason for this difference is unclear. It is possible that the mutation rate has evolved. The fact that substitution patterns are significantly different between the D. simulans and D. melanogaster lineages is potentially consistent with this, although other explanations have been put forward (Jackson et al. 2017). Alternatively, the difference may be caused by the fact that the general model does not consider the potential existence of a GC-favoring force acting on short introns, possibly due to GC-biased gene conversion (Jackson et al. 2017). However, the MLE of the X-autosome mutation rate ratio is still 0.59 when applying the model to variants that are unaffected by GC-biased gene conversion (i.e., a reduced data set containing polymorphic sites between A and T, and those between G and C). Thus, what causes the difference requires further investigation. Nonetheless, both our analysis and the analysis of Charlesworth et al. (2018) suggest that the X chromosome has a lower mutation rate than the autosomes, which may have direct bearing on the study of the faster-X hypothesis in Drosophila (Charlesworth et al. 2018). The MLE of r2 (the long-term X-autosome ratio of Ne before the expansion) is 1.03. It is close to the upper limit of 9/8, expected when there is an extremely female-biased sex ratio or substantially higher variance in reproductive success in males (Charlesworth 2009; Webster and Wilson Sayres 2016). The proximity to the upper limit could be a result of statistical noise, as suggested by the wide 95% confidence interval (table 4). Nevertheless, the fact that r2 is significantly higher than 0.75 lends support to the possibility of a female-biased sex ratio or high variance in male reproductive success. Further studies should investigate whether r2 may have also been influenced by other factors such as mate pairing practices, selection at linked sites, and sex-biased demographic changes (Charlesworth 2001, 2012b; Keinan et al. 2009; Evans and Charlesworth 2013; Webster and Wilson Sayres 2016). The MLE of r1 (the X-autosome ratio of Ne in the most recent epoch) is 1.91, significantly higher than the upper limit of 9/8 (Charlesworth 2009; Webster and Wilson Sayres 2016). However, the simulation results presented in supplementary table S3, Supplementary Material online, suggest that the estimation of r1 may be liable to upward biases when there are very recent events that are difficult for a sample of 21 alleles to detect. The main reason is that the smaller number of polymorphic sites in the X-linked data set (due to its lower mutation rate and shorter length) restricts its ability to detect recent events. Thus, further research using a much larger sample is needed to rule out methodological artifacts as the reason for the high estimate of r1. Fortunately, this potential power issue does not affect the estimation of the ratio and r2. Thus, the conclusions of a lower mutation rate on the X chromosome and a potentially female-biased sex ratio should be robust.

Conclusion

In this study, we show that it is possible to use polymorphism data to jointly infer past changes in population size and variation in Ne and u, provided that the population is not at equilibrium. These methods are capable of handling a large number of loci and many alleles (thousands). Including divergence data can increase the statistical power in some cases. However, because the mutation pattern itself may evolve (Smith et al. 2018), care should be exercised when choosing the outgroup. We have assumed that the population size changes in a stepwise manner, but this assumption can be readily relaxed (Polanski and Kimmel 2003; Bhaskar et al. 2015; Gao and Keinan 2016). It is important to note that Ne and u are confounded (similarly, Ne and m, the migration rate, are confounded; Sousa et al. 2013). This makes separating these parameters inherently difficult. This difficulty can in part be dealt with by increasing the sample size (both the locus length and the number of alleles), which has become feasible, thanks to advances in sequencing technologies. Our analyses have shown that the modeling framework developed herein provides an effective way of analyzing the data and can be used to study a variety of questions in different organisms.

Materials and Methods

Further Details of the Models

Assuming neutrality and an infinite-sites model of mutation, the expected number of segregating sites of derived allele frequency i in a sample of n alleles taken from locus k is given by where , m is the length (in basepairs) of locus k, is the expected total length of branches in the coalescent genealogy that have i descendants in the sample (Wakeley 2009, Section 4.1.3). Marth et al. (2004) derived an explicit expression of , which we have rearranged in the following form: where and . We use the Poisson random field model, which assumes that the sites are unlinked, to calculate the (composite) likelihood of the uSFS (Sawyer and Hartl 1992; Bustamante et al. 2001). Specifically, the probability that we observe segregating sites of derived allele frequency i at locus k is given by . The log likelihood of the data is where C is a constant that depends only on the data, and is therefore omitted from the calculation. An alternative way of calculating the likelihood of the uSFS is to condition on the segregating sites (Williamson et al. 2005). To this end, we note that the probability that a particular SNP is of derived allele frequency i is given by , where and . Importantly, is independent of the mutation rate. Therefore, assuming that the sites are unlinked, the log likelihood is where represents all the parameters in less θ (). This equation is equivalent to the profile likelihood function described by Bhaskar et al. (2015) and is computationally more efficient than equation (7) by reducing the dimensionality of the problem. Once MLEs of f, , and have been found, we can use them to calculate and estimate θ as , where is the total number of segregating sites from locus k (Bustamante et al. 2001; Bhaskar et al. 2015). To include divergence data, we assume that the number of substitutions follows a Poisson distribution with mean λ. The augmented version of equation (7) can be written as where constants dependent on the data are omitted, and = (x1, x2,…, x) are the observed number of substitutions. It should be noted that the information about c and t comes from the variation in divergence level between loci. Thus, although the composite parameter should be estimated accurately, the model may have difficulty teasing c and t apart when there is only a small number of loci and/or when (for ). To calculate likelihood of the fSFS, we define , where if the condition y is true and 0 otherwise. Likelihood functions corresponding to equations (7–9) can be obtained by changing the upper limit of the second summation from to and replacing by , and by . Finally, to explicitly consider polarization errors, we introduce parameters into the model (). The likelihood functions are analogous to equations (7–9), but with (see eq. 1) in place of . Note that the uSFS must be used in this case, as the fSFS contains no information about polarization error rates.

Computational Details

Calculation of (see eq. 2) is complicated by the presence of the alternating terms C(b, j) (see eq. 6), which can result in catastrophic cancellation during standard double-precision floating-point computations. Marth et al. (2004) dealt with this problem by using numerical libraries that performed arbitrary precision arithmetic. However, these libraries tend to be slow, especially when the sample size is large. For instance, a sample of 1,000 requires a numerical precision of ∼500 decimal places, which is orders of magnitude slower than the standard double-precision arithmetic (16-digit precision). Here, we observe that, in our new representation of the result of Marth et al. (2004) (see eq. 3–6), , where is given by equation (10) in Polanski and Kimmel (2003). Thus, we can obtain (and then ) using the recursion equations derived by Polanski and Kimmel (2003, see eqs. 13–15 therein). These equations can be evaluated using the standard double-precision arithmetic and are known to be numerically stable and very fast. Due to the introduction of locus-specific parameters, evaluating the likelihood function requires the calculation of K locus-specific expected SFSs. As the order in which the expected SFSs are obtained is unimportant, the computation can be sped up by distributing the work across multiple CPU cores. This is achieved here via OpenMP (http://www.openmp.org/).

Analysis of the Simulated Data

We performed parameter estimation using our program, varne, on random samples simulated using Mathematica (http://www.wolfram.com/). The computations in Mathematica were carried out using a very high precision level with 315 digits. Because the generation of simulated data was separate from the numerical routines we used in varne, this setup can verify the numerical robustness of varne. Unless stated otherwise, 100 data sets were generated for each parameter combination and only uSFSs were used. To obtain MLEs of the parameters, we used gradient-based optimization algorithms implemented in the NLopt library (http://ab-initio.mit.edu/nlopt). Partial derivatives were obtained by analytically differentiating equation (2) with respect to the parameters of the model. This is numerically much more stable than the finite difference method. Wherever possible, the profile likelihood (eq. 8) was used in favor of its higher computational efficiency. To ensure that the global maximum was found, the optimization algorithm was run multiple times, each starting from a randomly chosen point in the parameter space. The most complex case has H = 2 epochs and contains both polymorphism and divergence data from 500 loci. The corresponding model has 1,003 parameters. The optimization algorithms seem to cope well with the high dimensionality of the problem—the MLE was typically found by running the algorithm for <50 times.

Analysis of the D. simulans Data

We downloaded raw read data in fastq format for 21 isofemale lines of D. simulans collected from Madagascar from the European Nucleotide Archive (study accession numbers: PRJEB7673; PRJNA215932). These samples were previously described by Jackson et al. (2017). We mapped the reads to version 2.02 of the D. simulans genome (FlyBase release 2017_04) using BWA MEM (Li 2013), then sorted, merged and marked duplicates on the resulting BAM files using Picard Tools version 2.8.3 (https://broadinstitute.github.io/picard/). We called variants separately for each individual line using the HaplotypeCaller tool from GATK version 3.7 (McKenna et al. 2010), with the options –emitRefConfidence, BP_RESOLUTION and –max-alternate-alleles 2, then made per-chromosome VCF files for the whole population using the GATK v3.7 tools combineGVCFs and genotypeGVCFs. The sequencing depth per sample ranged from 54× to 100×. All the scripts necessary for downloading the fastq files and calling variants are available at https://github.com/benjamincjackson/dsim_variant_pipeline_ref_v2.02.git. Multispecies alignment was performed between the reference genomes of D. simulans (v2.02), D. melanogaster (v5.57), and D. yakuba (v1.3) using the same MULTI-Z pipeline as described by Barton and Zeng (2018). We used the information in the header lines of the FlyBase fasta file of introns for version 2.02 of the D. simulans reference (available from ftp://ftp.flybase.net/genomes/Drosophila_simulans/dsim_r2.02_FB2017_04/fasta/dsim-all-intron-r2.02.fasta.gz) to extract coordinates of the 8–30 bp region of introns that were ≤65 bp in length, after checking that this region did not overlap with an exon, an intron of length more than 65 bp, or the non-8–30 bp portion of an intron of length ≤65 bp, using information from the gff format annotation of the D. simulans genome v2.02 (available from ftp://ftp.flybase.net/genomes/Drosophila_simulans/dsim_r2.02_FB2017_04/gff/dsim-all-r2.02.gff.gz). Using these coordinates we made fasta files containing sequences from the 21 D. simulans lines as well as from the D. melanogaster and D. yakuba reference sequences, keeping only sites that met the following criteria: no more than two alleles in the polymorphism data set; phred-scaled quality score (QUAL) > 30; no missing data in any of the polymorphism or outgroup samples; not soft-masked as being repetitive in the multiple alignment step; no overlap with indels in the variant callset. For positions that still contained residual heterozygosity after the inbreeding process we chose one allele with probability proportional to the read depth for each allele at that site, following Jackson et al. (2017). We extracted all autosomal (excluding the fourth chromosome) and X-linked polymorphic sites. To unfold the SFS, we used the D. melanogaster reference genome as an outgroup and the maximum parsimony principle. When analyzing these data using the general model, we did not require that the X-autosome ratio of Ne varied between 9/16 and 9/8, with the lower bound corresponding to the case where there is only one breeding female (or much higher variance in reproductive success in females than males), and the upper bound corresponding to the case where there is only one breeding male (or much higher variance in reproductive success in males than females). This choice is different from some previous studies (Clemente et al. 2018) and means that our models do not regard deviation from a 1:1 sex ratio as the only reason why the X-autosome ratio of Ne departs from the “null” value of 0.75. For example, it is possible for selection at linked sites to reduce diversity more substantially on the autosomes than on the X chromosome (Charlesworth 2012b). If this is combined with a female-biased sex ratio, the X-autosome ratio of Ne may go above 9/8. We used bootstrapping to access uncertainties in the parameter estimation. We assumed that all sites within a short intron were completely linked, and different short introns were unlinked. These assumptions should be reasonable because each short intron region is only 23 bp long, and we expect linkage disequilibrium to decay very rapidly in D. simulans (True et al. 1996). The results were obtained by analyzing 100 bootstrap samples. The bootstrap samples were generated by sampling the short introns with replacement, while keeping the numbers of X-linked and autosomal introns the same as in the real data set.

Software Availability

The models presented here have been implemented in a computer package varne, which can be downloaded from http://zeng-lab.group.shef.ac.uk/.

Supplementary Material

Supplementary data are available at Molecular Biology and Evolution online. Click here for additional data file.

60 in total

1. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations.

Authors: Gabor T Marth; Eva Czabarka; Janos Murvai; Stephen T Sherry
Journal: Genetics Date: 2004-01 Impact factor: 4.562

Review 2. Fundamental concepts in genetics: effective population size and patterns of molecular evolution and variation.

Authors: Brian Charlesworth
Journal: Nat Rev Genet Date: 2009-03 Impact factor: 53.242

Review 3. Making sense of genomic islands of differentiation in light of speciation.

Authors: Jochen B W Wolf; Hans Ellegren
Journal: Nat Rev Genet Date: 2016-11-14 Impact factor: 53.242

Review 4. Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow.

Authors: Tami E Cruickshank; Matthew W Hahn
Journal: Mol Ecol Date: 2014-06-17 Impact factor: 6.185

5. Effects of Linked Selective Sweeps on Demographic Inference and Model Selection.

Authors: Daniel R Schrider; Alexander G Shanku; Andrew D Kern
Journal: Genetics Date: 2016-09-07 Impact factor: 4.562

6. Determinants of synonymous and nonsynonymous variability in three species of Drosophila.

Authors: Penelope R Haddrill; Kai Zeng; Brian Charlesworth
Journal: Mol Biol Evol Date: 2010-12-29 Impact factor: 16.240

7. Differences in crossover frequency and distribution among three sibling species of Drosophila.

Authors: J R True; J M Mercer; C C Laurie
Journal: Genetics Date: 1996-02 Impact factor: 4.562

8. Large scale variation in the rate of germ-line de novo mutation, base composition, divergence and diversity in humans.

Authors: Thomas C A Smith; Peter F Arndt; Adam Eyre-Walker
Journal: PLoS Genet Date: 2018-03-28 Impact factor: 5.917

9. Accelerated genetic drift on chromosome X during the human dispersal out of Africa.

Authors: Alon Keinan; James C Mullikin; Nick Patterson; David Reich
Journal: Nat Genet Date: 2008-12-21 Impact factor: 38.330

10. Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.

Authors: Ryan N Gutenkunst; Ryan D Hernandez; Scott H Williamson; Carlos D Bustamante
Journal: PLoS Genet Date: 2009-10-23 Impact factor: 5.917

4 in total

1. The germline mutational process in rhesus macaque and its implications for phylogenetic dating.

Authors: Lucie A Bergeron; Søren Besenbacher; Jaco Bakker; Jiao Zheng; Panyi Li; George Pacheco; Mikkel-Holger S Sinding; Maria Kamilari; M Thomas P Gilbert; Mikkel H Schierup; Guojie Zhang
Journal: Gigascience Date: 2021-05-05 Impact factor: 6.524

2. The estimates of effective population size based on linkage disequilibrium are virtually unaffected by natural selection.

Authors: Irene Novo; Enrique Santiago; Armando Caballero
Journal: PLoS Genet Date: 2022-01-25 Impact factor: 5.917

3. Screening of Genes Related to Sex Determination and Differentiation in Mandarin Fish (Siniperca chuatsi).

Authors: Cheng Yang; Liangming Chen; Rong Huang; Bin Gui; Yangyu Li; Yangyang Li; Yongming Li; Lanjie Liao; Zuoyan Zhu; Yaping Wang
Journal: Int J Mol Sci Date: 2022-07-12 Impact factor: 6.208

4. A Study of Faster-Z Evolution in the Great Tit (Parus major).

Authors: Kai Hayes; Henry J Barton; Kai Zeng
Journal: Genome Biol Evol Date: 2020-03-01 Impact factor: 3.416

4 in total