| Literature DB >> 31452861 |
Sven Gundlach1, Olaf Junge1, Lars Wienbrandt2, Michael Krawczak1, Amke Caliebe1.
Abstract
The evolutionary analysis of genetic data is an important subject of modern bioscience, with practical applications in diverse fields. Parameters of interest in this context include effective population sizes, mutation rates, population growth rates and the times to most recent common ancestors. Studying Y-chromosomal microsatellite data, in particular, has proven useful to unravel the recent patrilineal history of Homo sapiens populations. We compared the individual analysis options and technical details of four software tools that are widely used for this purpose, namely BATWING, BEAST, IMa2 and LAMARC, all of which use Bayesian coalescent-based Markov chain Monte Carlo (MCMC) methods for parameter estimation. More specifically, we simulated datasets for either eight or 20 hypothetical Y-chromosomal microsatellites, assuming a mutation rate of 0.0030 per generation and a constant or exponentially increasing population size, and used these data to evaluate the parameter estimation capacity of each tool. The datasets comprised between 100 and 1000 samples. In addition to runtime, the practical utility of the tools of interest can also be expected to depend critically upon the convergence behavior of the actual MCMC implementation. In fact, we found that runtime increased, and convergence rate decreased, with increasing sample size as expected. BATWING performed best with respect to runtime and convergence behavior, but only supports simple evolutionary models. As regards the spectrum of evolutionary models covered, and also in terms of cross-platform usability, BEAST provided the greatest flexibility. Finally, IMa2 and LAMARC turned out best to incorporate elaborate migration models in the analysis process.Entities:
Keywords: Bayesian inference; Coalescent; Convergence; Population genetics; Runtime
Year: 2019 PMID: 31452861 PMCID: PMC6700485 DOI: 10.1016/j.csbj.2019.07.014
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Exemplary coalescent tree.
The coalescent tree featured connects six Y-chromosomal genetic profiles (g1, …, g6) and is constructed backward in time. Coalescent events are depicted as red stars, numbered in the order of their occurrence during the tree construction process. Small red horizontal lines mark the corresponding coalescent times, i.e. the times until two profiles, or groups of profiles, shared a most recent common ancestor. The earliest coalescent time is the time to the most recent common ancestor (TMRCA) of all profiles in the tree.
Functionalities of investigated tools.
| Functionality | Tool | |||
|---|---|---|---|---|
| BATWING | BEAST | IMa2 | LAMARC | |
| Data type | STR, SNP, K-Allele | DNA, STR, SNP | DNA, STR, SNP | DNA, STR, SNP, K-Allele |
| Linkage | Completely linked (haplotypes) | Mix of linked and unlinked markers | Mix of linked and unlinked markers | Mix of linked and unlinked markers |
| Population size model | Constant, exponential | Coalescent: constant, expansion, exponential, logistic; other: Yule, birth and death | Constant | Constant, exponential |
| Mutation model | Stepwise, K-allele | Various standard substitution models, custom XML specification | Infinite sites, HKY, stepwise, compound locus | Stepwise, K-allele, F84, GTR, brownian motion |
| Migration model | No migration after splitting events | No migration after splitting events | Gene flow: unconstrained | Gen flow: constant, symmetric, unconstrained |
| Inference framework | MCMC | MCMC | MCMC | MCMC, MLE |
| Prior distributions | Constant, uniform, normal, lognormal, gamma | Constant, CTMC, (infinite) uniform, normal, lognormal, (inverse) gamma, exponential, 1/x | Uniform | Uniform, original or logarithmic values |
| Output parameters | μ, Ne, α, β, TMRCA, likelihoods | Ne, α, μ, TMRCA, branch rates, likelihoods | θ, μ, migration rates | θ, α, migration rates, recombination rates, likelihoods |
| Special features | Molecular clock models, MCMC operators | Heating | Recombination models, multiple chains, heating | |
STR, short tandem repeat or microsatellite; SNP, single-nucleotide polymorphism; XML, extensible markup language; HKY, Hasegawa-Kishino-Yano; F84, Felsenstein 84 nucleotide model; GTR, general time-reversible model; MCMC, Markov chain Monte Carlo; MLE, maximum likelihood estimation; CTMC, continuous time Markov chain; TMRCA, time to the most recent common ancestor.
Technical details of investigated tools.
| Specific | Tool | |||
|---|---|---|---|---|
| BATWING | BEAST | IMa2 | LAMARC | |
| Evaluated version | V1.0 | V1.8.4 | V8.27.12 | V2.1.10 |
| End of development | Yes | No | No | No |
| Fork | R package rforensicbatwing | BEAST 2 | IMa2p | MIGRATE-N |
| Open version control system | No | Git | No | No |
| GUI | No | Yes | No | (Yes) |
| TUI | No | No | No | Yes |
| Batch mode | No | No | No | Yes |
| Parallel mode | No | Yes | No | No |
| GPU | No | (Yes) | No | No |
| Programming language | C | Java | C++ | C++ |
| Compiled binary for OS | Win, Unix, Mac | Java 1.6 | Win, Unix, Mac | Win, Unix, Mac |
| Source code available | Yes | Yes | Yes | Yes |
| Documentation | Manual | Manual, website, tutorial, forum, (book), workshops | Manual, how-to | Website, tutorials |
| License | GPLv2 | LGPL | GPLv2 | Apache license, version 2.0 |
| Special features | Simulation program SAMPLE | Supports Tracer | Data summary and visualization, live analysis updates, AC | Supports Tracer, run report in XML, input data converter |
GUI, graphical user interface; TUI, text-based user interface; GPU, graphics processing unit; OS, operating system; Win, Microsoft Windows; Mac, Macintosh; GPLv2, GNU General Public License Version 2; LGPL, Lesser GNU Public License.2; AC, autocorrelation; XML, extensible markup language.
Parameter estimation from simulated data (8 loci, constant population size).
| Sample size | Prior | Parameter | BATWING | BEAST | IMa2 | LAMARC |
|---|---|---|---|---|---|---|
| 100 | More inf | θ | 60 (45–75) | 62 (45–77) | 61 (45–77) | 60 (47–75) |
| μ | 0.0031 (0.0025–0.0035) | 0.0031 (0.0025–0.0036) | 0.0031 (0.0031–0.0032) | 0.0030 (0.0024–0.0038) | ||
| Less inf | θ | 61 (44–77) | 62 (44–80) | 61 (45–78) | 61 (45–76) | |
| μ | 0.0037 (0.0033–0.0041) | 0.0036 (0.0032–0.0041) | 0.0031 (0.0031–0.0032) | 0.0031 (0.0023–0.0038) | ||
| 500 | More inf | θ | 60 (53–69) | n.c. | 60 (53–69) | n.c. |
| μ | 0.0031 (0.0028–0.0033) | n.c. | 0.0030 (0.0030–0.0031) | n.c. | ||
| Less inf | θ | 60 (53–69) | n.c. | 60 (54–70) | n.c. | |
| μ | 0.0037 (0.0031–0.0043) | n.c. | 0.0030 (0.0030–0.0031) | n.c. | ||
| 1000 | More inf | θ | 60 (55–65) | n.c. | 62 (57–68) | n.c. |
| μ | 0.0030 (0.0029–0.0032) | n.c. | 0.0030 (0.0030–0.0030) | n.c. | ||
| Less inf | θ | 60 (54–65) | n.c. | 62 (57–68) | n.c. | |
| μ | 0.0037 (0.0031–0.0043) | n.c. | 0.0030 (0.0030–0.0030) | n.c. |
θ, scaled mutation rate; μ, mutation rate per generation and microsatellite locus; more inf, set of more informative priors; less inf, set of less informative priors, n.c., no convergence achieved when stopping the MCMC. For each of the simulated 100 datasets per combination of sample size and prior distribution set, the means of the posterior distributions of θ and μ were obtained. Given are the mean and 0.025 and 0.975 quantiles (in brackets) of these 100 means. The values used for simulation were θ = 60 and μ = 0.0030. Note that the MCMC chain length was reduced by a factor of 100 for LAMARC due to poor runtime behavior. For BATWING and BEAST, the original output was Y-effective population size Ne and μ; θ was calculated as θ = 2Neμ. IMa2 outputs θ and μ directly whilst for LAMARC only output θ is provided, so that μ was calculated from the correct value Ne = 10,000.
Parameter estimation from simulated data (8 loci, exponential population growth).
| Sample size | Prior | Parameter | BATWING | BEAST | LAMARC |
|---|---|---|---|---|---|
| 100 | More inf | θ | 60 (45–78) | 57 (43–74) | 67 (50–81) |
| μ | 0.0030 (0.0026–0.0034) | 0.0029 (0.0026–0.0033) | 0.0034 (0.0025–0.0040) | ||
| α | 0.0050 (0.0037–0.0065) | 0.0050 (0.0038–0.0067) | 0.0055 (0.0037–0.0076) | ||
| Less inf | θ | 63 (41–97) | 59 (38–88) | 69 (49–88) | |
| μ | 0.0034 (0.0029–0.0038) | 0.0035 (0.0031–0.0040) | 0.0034 (0.0025–0.0044) | ||
| α | 0.0056 (0.0039–0.0080) | 0.0062 (0.0040–0.0090) | 0.0056 (0.0039–0.0082) | ||
| 500 | More inf | θ | 60 (52–70) | n.c. | n.c. |
| μ | 0.0030 (0.0028–0.0032) | n.c. | n.c. | ||
| α | 0.0050 (0.0038–0.0066) | n.c. | n.c. | ||
| Less inf | θ | 60 (51–72) | n.c. | n.c. | |
| μ | 0.0033 (0.0028–0.0038) | n.c. | n.c. | ||
| α | 0.0056 (0.0041–0.0072) | n.c. | n.c. | ||
| 1000 | More inf | θ | 60 (53–66) | n.c. | n.c. |
| μ | 0.0030 (0.0028–0.0032) | n.c. | n.c. | ||
| α | 0.0050 (0.0039–0.0059) | n.c. | n.c. | ||
| Less inf | θ | 60 (52–68) | n.c. | n.c. | |
| μ | 0.0033 (0.0029–0.0038) | n.c. | n.c. | ||
| α | 0.0056 (0.0042–0.0069) | n.c. | n.c. |
α, growth rate per generation; for further details, see legend to Table 3.
Parameter estimation from simulated data (20 loci, constant population size).
| Sample size | Prior | Parameter | BATWING | BEAST | LAMARC |
|---|---|---|---|---|---|
| 100 | Less inf | θ | 61 (50–74) | 63 (51–75) | 61 (50–73) |
| μ | 0.0037 (0.0032–0.0043) | 0.0036 (0.0032–0.0040) | 0.0031 (0.0025–0.0037) |
For details, see legend to Table 3.
Fig. 2Convergence behavior for .
, estimated scaled mutation rate; lMCMC, length of the evaluated Markov chain.
Ten randomly chosen datasets with a constant population size of 1000 were analyzed for a chain length of 220 million and the set of less informative priors. For each analysis, parameter estimates of θ were derived after a different number of iterations, lMCMC, of the Markov chain, e.g. after lMCMC = 110,000, 220,000, 2.2 million or 220 million iterations of the Markov chain. Since no intermediate parameter values were provided by IMa2 during chain runs, multiple runs of the Markov chain had to be carried out for this tool. For LAMARC the maximal chain length was limited by 4.4 million due to poor runtime behavior. For each of the 10 datasets, the means of the posterior distribution of θ were obtained after different numbers lMCMC of iterations. Shown are the means of these 10 means for each tool. The correct value, i.e. the value used for simulation, was θ = 60 (dashed horizontal line). For BATWING and BEAST, was calculated from θ = 2Neμ whereas, for IMa2 and LAMARC, was obtained directly. Note that both axes have logarithmic scale. For exact values of the means and maximal and minimal values of and for estimates for Ne and μ, see Supplementary Table S4.
Parameter estimation from simulated data, using incorrect priors (8 loci, constant population size).
| Parameter (correct value) | MF | BATWING | BEAST | IMa2 | LAMARC | |
|---|---|---|---|---|---|---|
| θ (60) | 1/2 | Prior | – | – | U(0,111) | U(0,111) |
| Estimate | 60 (44–76) | 62 (44–79) | 61 (45–79) | 60 (45–75) | ||
| 1/10 | Prior | – | – | U(0,22) | U(0,22) | |
| Estimate | 59 (43–74) | 60 (43–76) | 22 (22−22) | 22 (22–22) | ||
| μ (0.003) | 1/2 | Prior | Γ(1,0.003) | Γ(1,0.003) | – | – |
| Estimate | 0.0049 (0.0043–0.0054) | 0.0049 (0.0042–0.0055) | – | – | ||
| 1/10 | Prior | Γ(1,0.003) | Γ(1,0.003) | – | – | |
| Estimate | 0.0101 (0.0087–0.0112) | 0.0101 (0.0086–0.0113) | – | – | ||
| Ne (10,000) | 1/2 | Prior | Γ(1,5000) | Γ(1,5000) | – | – |
| Estimate | 8213 (7283–9261) | 8414 (7281–9312) | – | – | ||
| 1/10 | Prior | Γ(1,1000) | Γ(1,1000) | – | – | |
| Estimate | 3356 (2923–3733) | 3409 (2955–3801) | – | – |
θ, scaled mutation rate; μ, mutation rate per generation per microsatellite locus; Ne, Y-effective population size; correct value: parameter value used in the underlying simulations; Γ(a,b), gamma distribution with shape a and scale b (the mean is for a = 1 equal to the scale); U(a,b), uniform distribution with boundaries a and b; MF, misspecification factor. Simulations were performed for a sample size of 100. For each of the simulated 100 datasets per prior distribution, the means of the posterior distributions of θ, μ and Ne were obtained. Given are the mean and 0.025 and 0.975 quantiles (in brackets) of these 100 means. MF values of 1/2 and 1/10 correspond to means of 5000 and 1000, respectively, of the priors for Ne (instead of the correct value of 10,000) and to means of 30 and 6, respectively, of the priors for θ (instead of the correct value of 60). Note that the MCMC chain length was reduced by a factor of 100 for LAMARC due to poor runtime behavior. For BATWING and BEAST, the original output comprised Ne and μ; θ was calculated as θ = 2Neμ. IMa2 and LAMARC output θ directly.
Fig. 3Runtime under a constant population size model.
BEAST_s, sequential version of BEAST; BEAST_p, parallel version of BEAST.
The analyses were performed on a 16-core computer (2x Xeon(R) E5-2620 v4/2.1 GHz) with 64 GB DRAM running Linux OS. For LAMARC, the chain length was reduced by a factor of 100 due to poor runtime behavior. Runtimes obtained for the less and the more informative sets of prior distributions were merged in the figure because they were barely distinguishable. Box plots are based upon the results obtained from 200 data sets, except for BEAST_p where only 10 datasets underlie each box plot. Note that the Y-axis has a logarithmic scale.