Literature DB >> 32365209

mstree: A Multispecies Coalescent Approach for Estimating Ancestral Population Size and Divergence Time during Speciation with Gene Flow.

Junfeng Liu1, Qiao Liu2, Qingzhu Yang1,2.   

Abstract

Gene flow between species may cause variations in branch length and topology of gene tree, which are beyond the expected variations from ancestral processes. These additional variations make it difficult to estimate parameters during speciation with gene flow, as the pattern of these additional variations differs with the relationship between isolation and migration. As far as we know, most methods rely on the assumption about the relationship between isolation and migration by a given model, such as the isolation-with-migration model, when estimating parameters during speciation with gene flow. In this article, we develop a multispecies coalescent approach which does not rely on any assumption about the relationship between isolation and migration when estimating parameters and is called mstree. mstree is available at https://github.com/liujunfengtop/MStree/ and uses some mathematical inequalities among several factors, which include the species divergence time, the ancestral population size, and the number of gene trees, to estimate parameters during speciation with gene flow. Using simulations, we show that the estimated values of ancestral population sizes and species divergence times are close to the true values when analyzing the simulation data sets, which are generated based on the isolation-with-initial-migration model, secondary contact model, and isolation-with-migration model. Therefore, our method is able to estimate ancestral population sizes and speciation times in the presence of different modes of gene flow and may be helpful to test different theories of speciation.
© The Author(s) 2020. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  coalescent; gene tree; mathematical inequalities

Mesh:

Year:  2020        PMID: 32365209      PMCID: PMC7259675          DOI: 10.1093/gbe/evaa087

Source DB:  PubMed          Journal:  Genome Biol Evol        ISSN: 1759-6653            Impact factor:   3.416


Introduction

The role of gene flow in speciation is a fundamental issue in evolutionary biology. Allopatric speciation considers complete lack of gene flow as prerequisite to the formation of new species. However, parapatric and sympatric speciation allow gene flow during speciation. Although allopatric speciation has been historically taken as the paramount mode of speciation (Futuyma and Mayer 1980), theoretical modeling and empirical evidence increasingly support that speciation can occur with gene flow (Gourbiere and Mallet 2010; Smadja and Butlin 2011; Feder et al. 2012). There are usually two kinds of models to make inferences about gene flow during speciation. Some methods are based on an isolation-with-migration (IM) model (Wang and Hey 2010; Tian and Kubatko 2016; Dalquen et al. 2017) and others on an isolation-with-initial-migration (IIM) model (Mailund et al. 2012; Costa and Wilkinson-Herbots 2017). However, the above two models include an assumption about the relationship between isolation and migration. Here, we use the properties of coalescent-based model in gene tree data for estimating the important parameters such as ancestral population sizes and divergence times without any assumption of the relationship between isolation and migration. Furthermore, we conduct simulations to examine the accuracy of the estimates of parameters. The simulation results show that our method can accurately estimate the parameters. At last, we compared mstree with the program 3s (Dalquen et al. 2017) and IMa3 (Hey et al. 2018) with simulation data; the simulation results show that mstree is faster than 3s and IMa3.

Materials and Methods

The Theoretical Model

Consider two closely related species (1 and 2) with an outgroup species 3. We assume that there is only gene flow between two closely related species (fig. 1). We use and to denote the two species divergence times, scaled by mutation rate. Let and measure the two ancestral species population sizes. Here, is the mutation rate per site and generation, and the and denote the effective population sizes. There are five possible gene trees for a locus with three sequences (k, l, and m), which are from species 1, species 2, and species 3, respectively (fig. 2). For any locus, is the coalescent time among three sequences and is the coalescent time between two sequences. In the presence of gene flow between species 1 and species 2, the gene tree G1 may be possible at a locus. Otherwise, only gene trees G2–G5 are possible. In this study, the term “loci” refers to independent or loosely linked short segments of the genome, and we assume that there is no recombination within a locus while different loci are free recombining. For tens of thousands of loci, there are some mathematical inequalities among the species divergence time, the ancestral species population size, and the number of gene trees, of which is larger than . Based on the coalescent theory with no gene flow under given species, the probability of gene tree, of which belongs to , is ; and the probability of gene tree, of which belongs to and is less than , is . We use g([a, b],[c, d]) as the number of gene trees with category G (fig. 2) and with is in [a, b] and in [c, d] for i = 1, 2, …, 5. Moreover, to simplify notation, let g denote the number of gene trees with category G (fig. 2) for i = 1, …, 5. Then, the formulas of cases A and C are as follows:
. 1.

—Species tree ((1, 2), 3) for three species. The species divergence times are denoted as and . The ancestral species population sizes are denoted as and .

. 2.

—For three species (1–3) with gene flow between species 1 and 2, there are five categories of gene trees for any locus with three sequences (k, l, and m), which are from species 1, species 2, and species 3, respectively.

Case A: If , then for the category G2–G5 (fig. 2). Case B: If , then for the category G1–G2 (fig. 2). Case C: for the category G2–G5 (fig. 2). —Species tree ((1, 2), 3) for three species. The species divergence times are denoted as and . The ancestral species population sizes are denoted as and . —For three species (1–3) with gene flow between species 1 and 2, there are five categories of gene trees for any locus with three sequences (k, l, and m), which are from species 1, species 2, and species 3, respectively. The above gene tree distributions allow us to compute the parameters , , , and . The approach of estimating the parameters is called mstree and the strategies are as follows. First, we can estimate the value of based on the fact that the shape of gene tree may be ((k, l),m), ((k, m),l) or ((l, m),k) with equal probability when is larger than . If there exists that satisfies , where is the number of gene trees, of which is larger than and the shape is ((k, l),m) (fig. 2: G3); is the number of gene trees, of which the shape is ((k, m),l) (fig. 2: G4); and is the number of gene trees, of which the shape is ((l, m),k) (fig. 2: G5), can be considered as the estimated value of . Second, we can estimate the value of based on the estimated value of by using the formula in case B. When we choose and that satisfy , the estimated value of approximates , where is the number of gene trees, of which is larger than and is less than ; is the number of gene trees, of which belong to and is less than . Similarly, we can also estimate the value of based on the estimated value of by using the formula in case A. If we choose that is less than and assume that is larger than when is closed to , the estimated value of approximates , where is the number of gene trees, of which is larger than and is the number of gene trees, of which belongs to . Lastly, we estimate the value of based on the values of and by using the formula in case C. If there exists that is less than and satisfies , where is the number of gene trees, of which is larger than and the shape is ((k, l),m) (fig. 2: G2–G3); is the number of gene trees, of which the shape is ((k, m),l) (fig. 2: G4); and is the number of gene trees, of which the shape is ((l, m),k) (fig. 2: G5), can be considered as the estimated value of .

The Simulation

We simulated gene trees by using the program ms (Hudson 2002) and converted gene trees to sequence data under JC69 model by using seq-gen (Rambaut and Grassly 1997). The example of command is as follows: ./ms id="465" 3 50000 -T -I 3 1 1 1 -m 1 2 0 -m 2 1 0 -em 0.667 1 2 4 -em 0.667 2 1 4 -em 1 1 2 0 -em 1 2 1 0 -ej 1 2 1 -ej 2 3 1 | tail -n + 4 | grep -v//> tree ./seq-gen -m HKY -l 500 -s 0.01 -t 2.0 < tree > infile

Compared with IMa3 and Analyzed Real Data

The model estimated by IMa3 is IM model, and we used a fixed true species topology for IMa3. The real data are the genomic sequences of the human (H), chimpanzee (C), and gorilla (G) from Burgess and Yang (2008). The data set comprises 14,663 autosomal loci, and the mean locus length is 508 bp.

Results

The Accuracy of mstree

We used the program ms (Hudson 2002) to simulate gene trees at multi loci under the IIM, secondary contact (SC), and IM model. For the IIM model, the gene flow stopped at in the past; For the SC model, the time of SC began at in the past. Two sets of parameter values were used, roughly based on estimates from the hominoids (Burgess and Yang 2008) and the mangroves (Zhou et al. 2007). They are as follows: , , and (hominoids); , , and (mangroves). For the three models, gene flow is symmetrical and the migration rate (the expected number of migrants per generation) is 1. The number of loci is 10,000 and the number of replicates is 1,000. Analyzing the simulation data by using mstree, the results show that the parameter estimates are very close to the true values and are not sensitive to the model’s assumption about the relationship between isolation and migration (table 1). in table 1 is the threshold value in mstree and describes the degree of approximation between two sides of the formulas in cases A, B, and C. For example, in case A, means the value of should be <0.03 when . In mstree, the value of must be <0.05. The results in table 1 show that the smaller increased the standard deviations of the parameter estimates and the estimate of . Therefore, we suggest that the value of should be 0.03 when using mstree. Furthermore, we applied mstree to additional two parameter sets (Dalquen et al. 2017), which have larger parameter values and different values for two (table 2). The results show that mstree still performs well.
Table 1

The Estimated Species Divergence Time and Population Size with Different Threshold Value

ThresholdHominoid
Mangrove
θ0 θ1 τ0 τ1 θ0 θ1 τ0 τ1
IIM model ε = 0.0070.50 ± 0.020.49 ± 0.040.60 ± 0.000.42 ± 0.041.00 ± 0.060.97 ± 0.112.00 ± 0.011.25 ± 0.29
ε = 0.010.50 ± 0.020.49 ± 0.030.60 ± 0.000.41 ± 0.031.00 ± 0.050.98 ± 0.072.00 ± 0.011.14 ± 0.22
ε = 0.03 0.50 ± 0.02 0.50 ± 0.01 0.60 ± 0.00 0.39 ± 0.01 1.00 ± 0.02 0.99 ± 0.02 2.00 ± 0.01 0.99 ± 0.05
SC model ε = 0.0070.50 ± 0.020.48 ± 0.050.60 ± 0.000.43 ± 0.051.00 ± 0.060.95 ± 0.142.00 ± 0.011.31 ± 0.33
ε = 0.010.50 ± 0.010.49 ± 0.040.60 ± 0.000.42 ± 0.031.00 ± 0.040.97 ± 0.092.00 ± 0.011.21 ± 0.27
ε = 0.03 0.50 ± 0.01 0.50 ± 0.02 0.60 ± 0.00 0.39 ± 0.01 1.00 ± 0.02 0.99 ± 0.03 2.00 ± 0.01 1.01 ± 0.07
IM model ε = 0.0070.50 ± 0.020.48 ± 0.060.60 ± 0.000.43 ± 0.051.00 ± 0.060.95 ± 0.152.00 ± 0.011.34 ± 0.34
ε = 0.010.50 ± 0.010.49 ± 0.040.60 ± 0.000.41 ± 0.041.00 ± 0.040.96 ± 0.112.00 ± 0.011.22 ± 0.28
ε = 0.03 0.50 ± 0.01 0.50 ± 0.02 0.60 ± 0.00 0.39 ± 0.01 1.00 ± 0.02 0.99 ± 0.03 2.00 ± 0.01 1.00 ± 0.07

Note.—The hominoid set is θ0 = θ1 = 0.005, τ0 = 0.006, and τ1 = 0.004. The mangrove set is θ0 = θ1 = 0.01, τ0 = 0.02, and τ1 = 0.01. θ and τ estimates are scaled by 102. Gene flow is symmetrical and the migration rate is 1. ε is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initial-migration; SC, secondary contact; IM, isolation-with-migration. The best estimates are marked in bold.

Table 2

The Estimated Species Divergence Time and Population Size with Larger and Different Parameter Values

Threshold θ 0 = 0.02, θ1 = 0.03, τ0 = 0.06, τ1 = 0.04
θ 0 = 0.02, θ1 = 0.01, τ0 = 0.02, τ1 = 0.01
θ0 θ1 τ0 τ1 θ0 θ1 τ0 τ1
IIM model ε = 0.032.00 ± 0.082.97 ± 0.105.99 ± 0.033.90 ± 0.172.00 ± 0.061.00 ± 0.022.00 ± 0.011.00 ± 0.03
SC model ε = 0.032.01 ± 0.062.95 ± 0.175.99 ± 0.034.03 ± 0.192.00 ± 0.061.00 ± 0.022.00 ± 0.021.00 ± 0.04
IM model ε = 0.032.00 ± 0.052.94 ± 0.295.98 ± 0.053.89 ± 0.462.00 ± 0.051.00 ± 0.022.00 ± 0.021.00 ± 0.04

Note.—θ and τ estimates are scaled by 102. Gene flow is symmetrical and the migration rate is 1. ε is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initial-migration; SC, secondary contact; IM, isolation-with-migration.

The Estimated Species Divergence Time and Population Size with Different Threshold Value Note.—The hominoid set is θ0 = θ1 = 0.005, τ0 = 0.006, and τ1 = 0.004. The mangrove set is θ0 = θ1 = 0.01, τ0 = 0.02, and τ1 = 0.01. θ and τ estimates are scaled by 102. Gene flow is symmetrical and the migration rate is 1. ε is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initial-migration; SC, secondary contact; IM, isolation-with-migration. The best estimates are marked in bold. The Estimated Species Divergence Time and Population Size with Larger and Different Parameter Values Note.—θ and τ estimates are scaled by 102. Gene flow is symmetrical and the migration rate is 1. ε is the threshold value in mstree. The number of loci is 10,000. The number of replicates is 1,000. IIM, isolation-with-initial-migration; SC, secondary contact; IM, isolation-with-migration.

The Factors That Influence Parameter Estimates

In addition, we performed more simulations to test how different factors influence parameter estimates, such as the number of loci, migration rate, and the direction of migration. The numbers of loci are 5,000, 10,000, and 50,000; the migration rates are 0.1, 1, and 10; and the directions of migration are symmetrical and asymmetrical. The results are shown in supplementary tables S1–S6, Supplementary Material online. For the parameters , , and , the results show that larger number of loci makes the estimates more accurate and the estimates are not sensitive to the model’s assumption, migration rate, and the direction of migration. For the parameter , we have the same conclusion except for the case that migration rate is 10. When migration rate is 10, asymmetrical gene flow decreases the accuracy of estimate. This indicates that the estimate of is sensitive to the direction of migration with large migration rate (supplementary fig. S1, Supplementary Material online). The examples of above simulation commands are in the supplementary file S2, Supplementary Material online.

Compared with 3s and IMa3

The input file of mstree is gene tree, which is in Newick format, and the gene tree can be estimated from the observed sequence alignments where there must be three sequences, with one sequence from each species, at each locus. Therefore, we need a program, such as PHYLIP, to infer gene trees when applying mstree to experimental data. Because inference of gene trees is associated with error and uncertainty, we did some simulations to investigate the effect of the gene tree uncertainty (supplementary table S7, Supplementary Material online). We used program ms and seq-gen (Rambaut and Grassly 1997) to generate sequence data under JC69 model and used program dnamlk in PHYLIP package to infer gene trees. Although there has been some decline in the accuracy of parameter estimates because of the inferred error of gene trees, the estimates of mstree are still near to the true values and not sensitive to the model’s assumption. Comparing mstree with the program 3s (Dalquen et al. 2017) and IMa3 (Hey et al. 2018), mstree is faster than 3s and IMa3 (supplementary table S7, Supplementary Material online). Although 3s and IMa3 performed very well on some parameter estimates, the τ1 estimates of 3s and τ0 estimates of IMa3 were very poor.

Robustness of mstree and Analyzing Real Data

Though our method is not affected by gene flow between the sister species, our method assumes that there is no gene flow between the ingroup and the outgroup. Therefore, we examined the robustness of our method in the presence of gene flow between the ingroup and the outgroup. The results are shown in supplementary table S8, Supplementary Material online. Our method is robust to the simulations based on IIM model between the ingroup and the outgroup. For the simulations based on SC and IM model between the ingroup and the outgroup, the accuracy of parameter estimates is on the decline. At last, we apply mstree to the genomic sequences of the human (H), chimpanzee (C), and gorilla (G) (Burgess and Yang 2008). The estimates of parameters are similar to those of Burgess and Yang (2008), but the estimate of τHC is slightly higher (supplementary table S9, Supplementary Material online). In order to quantify uncertainty in the estimates obtained, we resort to bootstrapping with 100 replicates. The averages and the standard errors of estimates are as follows: , , , and .

Discussion

Supplementary table S8, Supplementary Material online, shows the performance of mstree in the presence of gene flow with species 3. Under the influence of gene flow between the ingroup and the outgroup, τ0 was underestimated and was closed to the time that gene flow stopped except for IIM model. When τ0 was underestimated, the estimates of other parameters were far away from the true value. Burgess and Yang (2008) estimated divergence times under the assumption of no gene flow. However, Zhu and Yang (2012) applied the test based on SIM3s model to a humanchimpanzeegorilla genomic data and the test results suggested gene flow around the time of speciation of human and chimpanzee. Compared with the estimated divergence times from Burgess and Yang (2008), the analysis from mstree suggested migrations between sister species. In addition, there are two significant differences between mstree and COALGF (Tian and Kubatko 2016), which describes the distribution of coalescent histories under the coalescent model with gene flow: 1) mstree uses the coalescent history distribution under coalescent model without gene flow to infer model parameters based on summary statistics; however, COALGF computes probabilities of gene tree histories given species trees under the coalescent process with gene flow and the results obtained from COALGF may be used to infer model parameters based on a maximum likelihood framework. 2) mstree does not make any assumption about the mode of gene flow between sister taxa; however, COALGF assumes that the mode of gene flow between sister taxa is IM. To summarize, we propose a multispecies coalescent approach, mstree, for estimating the parameters during speciation with gene flow. Theoretically, our method does not rely on any assumption about the relationship between isolation and migration. Furthermore, the simulation results demonstrate that mstree can accurately estimate species divergence time and ancestral population size regardless of IIM model, SC model, or IM model. Click here for additional data file.
  14 in total

1.  Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors:  Richard R Hudson
Journal:  Bioinformatics       Date:  2002-02       Impact factor: 6.937

Review 2.  The genomics of speciation-with-gene-flow.

Authors:  Jeffrey L Feder; Scott P Egan; Patrik Nosil
Journal:  Trends Genet       Date:  2012-04-18       Impact factor: 11.639

3.  Maximum likelihood implementation of an isolation-with-migration model with three species for testing speciation with gene flow.

Authors:  Tianqi Zhu; Ziheng Yang
Journal:  Mol Biol Evol       Date:  2012-04-13       Impact factor: 16.240

Review 4.  A framework for comparing processes of speciation in the presence of gene flow.

Authors:  Carole M Smadja; Roger K Butlin
Journal:  Mol Ecol       Date:  2011-11-09       Impact factor: 6.185

5.  Population genetics of speciation in nonmodel organisms: I. Ancestral polymorphism in mangroves.

Authors:  Renchao Zhou; Kai Zeng; Wei Wu; Xiaoshu Chen; Ziheng Yang; Suhua Shi; Chung-I Wu
Journal:  Mol Biol Evol       Date:  2007-09-28       Impact factor: 16.240

6.  Are species real? The shape of the species boundary with exponential failure, reinforcement, and the "missing snowball".

Authors:  Sébastien Gourbière; James Mallet
Journal:  Evolution       Date:  2009-09-23       Impact factor: 3.694

7.  Distribution of coalescent histories under the coalescent model with gene flow.

Authors:  Yuan Tian; Laura S Kubatko
Journal:  Mol Phylogenet Evol       Date:  2016-09-07       Impact factor: 4.286

8.  Maximum Likelihood Implementation of an Isolation-with-Migration Model for Three Species.

Authors:  Daniel A Dalquen; Tianqi Zhu; Ziheng Yang
Journal:  Syst Biol       Date:  2017-05-01       Impact factor: 15.683

9.  Estimation of hominoid ancestral population sizes under bayesian coalescent models incorporating mutation rate variation and sequencing errors.

Authors:  Ralph Burgess; Ziheng Yang
Journal:  Mol Biol Evol       Date:  2008-07-04       Impact factor: 16.240

10.  Estimating divergence parameters with small samples from a large number of loci.

Authors:  Yong Wang; Jody Hey
Journal:  Genetics       Date:  2009-11-16       Impact factor: 4.562

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.