Literature DB >> 19430603

Mlcoalsim: multilocus coalescent simulations.

Sebastian E Ramos-Onsins¹, Thomas Mitchell-Olds.

Abstract

Coalescent theory is a powerful tool for population geneticists as well as molecular biologists interested in understanding the patterns and levels of DNA variation. Using coalescent Monte Carlo simulations it is possible to obtain the empirical distributions for a number of statistics across a wide range of evolutionary models; these distributions can be used to test evolutionary hypotheses using experimental data. The mlcoalsim application presented here (based on a version of the ms program, Hudson, 2002) adds important new features to improve methodology (uncertainty and conditional methods for mutation and recombination), models (including strong positive selection, finite sites and heterogeneity in mutation and recombination rates) and analyses (calculating a number of statistics used in population genetics and P-values for observed data). One of the most important features of mlcoalsim is the analysis of multilocus data in linked and independent regions. In summary, mlcoalsim is an integrated software application aimed at researchers interested in molecular evolution. mlcoalsim is written in ANSI C and is available at: http://www.ub.es/softevol/mlcoalsim.

Entities: Disease Gene Species

Keywords: Coalescent simulations; Multilocus analyses; Neutrality tests; Population Genetics; Rejection algorithm

Year: 2007 PMID： 19430603 PMCID： PMC2674636

Source DB: PubMed Journal: Evol Bioinform Online ISSN： 1176-9343 Impact factor: 1.625

Introduction

Statistical inference of molecular population data under different evolutionary models typically employs a coalescent framework (Kingman, 1982a,b; Hudson, 1990; Donnelly and Tavaré, 1995; Nordborg, 2001). Hudson’s ms (Hudson, 2002) application enabled a large number of population geneticists and molecular biologists to examine data under different evolutionary models. In recent years, a number of coalescent programs focused on the generation of genetic data have been published (e.g. SimCoal, Excoffier et al. 2000; Laval and Excoffier, 2004; SelSim, Spencer and Coop, 2004; CoaSim, Mailund et al 2005; FastCoal, Marjoram and Wall, 2005). Nevertheless, multilocus data obtained by high throughput techniques (e.g. the Drosophila Polymorphisms Sequencing Project, as well as smaller projects such as those described by Akey et al. 2004; Schmid et al. 2005) are not easily analyzed using available software. Here we describe the mlcoalsim software application which, unlike other available tools, allows the generation of simulated genetic data and the calculation of descriptive statistics for a large number of loci under different evolutionary models, as well as obtaining P-values of observed data.

Program Overview

mlcoalsim enables researchers to compare single and multilocus data with several common evolutionary models. It is an integrated application that not only constructs coalescent trees and sequences but also calculates a number of summary statistics that are useful for the examination of evolutionary hypotheses. This program is designed to generate within-species genetic data; that is, the level of nucleotide variation should not be too high—a maximum of approximately 5%—in order to avoid important errors (a more sophisticated substitution model should be used). For the same reason, the level of divergence from an outgroup species should be no greater than 10–15%.

Multilocus analyses

One of the main features of mlcoalsim is the generation of DNA samples and calculation of a number of statistical tests for a set of multiple loci with variable levels of intragenic recombination. There are two options for multilocus analysis: using independent (unlinked) loci and using a single long region separated into several fragments. The first option (independent loci) allows the independent analysis of each locus and the calculation of summary statistics (the average and variance for all loci of each statistic). This option is useful for contrasting data with demographic models that would affect the entire genome. For such an analysis, a correction factor for population size depending on the chromosomal location of each locus (e.g. autosomal, sexual) is needed. The second option (linked loci) generates samples for an entire linked region and calculates statistics for specified fragments within this region or for a sliding window analysis. The “linked” option is useful in evolutionary processes that affect only specific regions, such as a selective sweep in a recombining region.

Uncertainty in mutation and recombination rates

Mutation and recombination rates are critical parameters which are usually unknown. In order to consider the uncertainty of these two parameters, mlcoalsim can sample the rates from a distribution (uniform and gamma distributions are used) instead of using a fixed value. In addition, mlcoalsim can generate samples by fixing the observed values, the number of segregating sites, the minimum number of recombination events and (optionally) the number of haplotypes. This last option is obtained using the rejection method 2 of Tavaré (Tavaré et al. 1997). Posterior distributions for the population mutation and for the population recombination parameter are recorded.

Heterogeneity in mutation and recombination rate across the sequence

mlcoalsim is also able to take into account differences in the mutation and in recombination rates across the studied region. Heterogeneity is modelled with a gamma distribution, modelling from extreme hotspots regions (i.e. in case using heterogeneity for the recombination rate, only few position are enabled to recombine while others can not) to uniform values for all positions. Furthermore, it is possible to fix the average number of invariant positions (position that can not mutate) for the studied region.

Evolutionary models

mlcoalsim includes the following evolutionary models: the neutral stationary panmictic model, the finite island model, models with changing population sizes over time, refugia models and deterministic positive selection (not all of these models can be used simultaneously). mlcoalsim allows the use of neutral and positive selection models for different independent loci, and changing population size also can be used with a finite island model.

Statistics

A number of statistics and related tests used in population genetics are displayed in the output (Table 1). The statistics incorporated in this program describe the level and patterns of diversity for a given sample.

Table 1.

List of the main statistics included in mlcoalsim.

Name	Statistic	Citation
TD	Tajima’s D test	Tajima, 1989
Fs	Fu’s Fs test	Fu, 1997
FD*	Fu and Li’s D* test	Fu and Li, 1993
FF*	Fu and Li’s F* test	Fu and Li, 1993
FD	Fu and Li’s D test	Fu and Li, 1993
FF	Fu and Li’s F test	Fu and Li, 1993
H	Fay and Wu’s H test	Fay and Wu, 2000
B	Wall’s B test	Wall, 1999
Q	Wall’s Q test	Wall, 1999
ZA	ZA	Rozas et al. 2001
Fst	F_st	Hudson et al. 1992
Kw	No. haplotypes/n	Strobeck, 1987
Hw	Haplotype diversity/n	Depaulis and Veuille, 1998
R2	R2 test	Ramos-Onsins and Rozas, 2002
S	No. of biallelic mutations
thetaWatt	θ	Watterson, 1975
thetaTaj	π	Tajima, 1983
thetaFW	θ_H	Fay and Wu, 2000
pi_w	π within populations	e.g. Hudson et al. 1992
pi_b	π among populations	e.g. Hudson et al. 1992
D/Dmin	D/Dmin	Schaeffer, 2002; Schmid et al. 2005
H/Hmin	H/Hmin	Schmid et al. 2005
maxhap	No. lines in most common haplotype/n	Depaulis et al. 2003
maxhap1	maxhap excepting one biallelic mutation	Hudson et al 1994; Rozas et al. 2001
Rm	Rm	Hudson and Kaplan, 1985

n is the number of sequence lines.

See text and mlcoalsim documentation for a brief description of statistics.

Different statistics that estimate the level of variation are included (θWatterson, 1975, π, Tajima, 1983, and θ, Fay and Wu, 2000) for the entire sample. Although these estimates are calculated using different approaches, the values should be equal under the assumption of a neutral stationary panmictic model. The average levels of variation within and among populations are also estimated (π and π, Hudson et al. 1992), as well as the average differentiation among populations with the Fst statistic (e.g. Hudson et al. 1992). A description of the patterns of diversity is obtained using two main classes of statistics (Ramos-Onsins and Rozas, 2002): Class I statistics, which use the mutation frequency information, and Class II statistics, which use information from the haplotype distribution. Class I includes Tajima’s D test (TD, Tajima, 1989), Fu and Li’s tests (FD*, FF*, FD, FF, Fu and Li, 1993), Fay and Wu’s H test (Fay and Wu, 2000), R2 (Ramos-Onsins and Rozas, 2002) and weighted statistics for a multilocus approach such as D/Dmin (Schaeffer, 2002) and H/Hmin (Schmid et al. 2005). Class II includes the number of haplotypes Kw (Strobeck, 1987) and the haplotype diversity Hw (Depaulis and Veuille, 1998), both weighted by the number of samples for a better multilocus comparison, Fs (Fu, 1997), the statistics B and Q, (Wall, 1999) which count differences in haplotype structure at adjacent positions, the ZA statistic (Rozas et al. 2001) as a measure of linkage disequilibrium at adjacent positions, maxhap (Depaulis et al. 2003) and maxhap1 (simplified from Hudson et al. 1994), which counts the number of lines with the most common haplotype (i.e. maxhap) but allowing a single segregating site within the largest “haplotype” group (Rozas et al. 2001). Finally, the minimum number of recombination events, Rm (Hudson and Kaplan, 1985), is also calculated. Multilocus analyses generate a comprehensive output with the calculated statistics with their average and variance. Only biallelic positions are considered for the analyses given that tri- or tetra-allelic positions are rare in within-species samples.

Other technical features

The generation of random deviates from uniform, binomial, Poisson, and gamma distributions and the determining of roots for complex functions are based on Lanczos (1964); Atkinson (1979); Cheng and Feast (1979); Fishman (1979); Ridders (1979); Press et al (1992) and Press and Teukolsky (1992). The Rm function was obtained and modified from Wall’s code (Wall, 2000). The gamma function was partially obtained from Grassly, Adachi and Rambaut code (Grassly et al. 1997).

27 in total

1. On the number of segregating sites in genetical models without recombination.

Authors: G A Watterson
Journal: Theor Popul Biol Date: 1975-04 Impact factor: 1.570

2. Estimation of levels of gene flow from DNA sequence data.

Authors: R R Hudson; M Slatkin; W P Maddison
Journal: Genetics Date: 1992-10 Impact factor: 4.562

Review 3. Coalescents and genealogical structure under neutrality.

Authors: P Donnelly; S Tavaré
Journal: Annu Rev Genet Date: 1995 Impact factor: 16.830

4. Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection.

Authors: Y X Fu
Journal: Genetics Date: 1997-10 Impact factor: 4.562

5. Inferring coalescence times from DNA sequence data.

Authors: S Tavaré; D J Balding; R C Griffiths; P Donnelly
Journal: Genetics Date: 1997-02 Impact factor: 4.562

6. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors: F Tajima
Journal: Genetics Date: 1989-11 Impact factor: 4.562

7. Statistical tests of neutrality of mutations.

Authors: Y X Fu; W H Li
Journal: Genetics Date: 1993-03 Impact factor: 4.562

8. Evolutionary relationship of DNA sequences in finite populations.

Authors: F Tajima
Journal: Genetics Date: 1983-10 Impact factor: 4.562

9. Statistical properties of the number of recombination events in the history of a sample of DNA sequences.

Authors: R R Hudson; N L Kaplan
Journal: Genetics Date: 1985-09 Impact factor: 4.562

10. Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster.

Authors: R R Hudson; K Bailey; D Skarecky; J Kwiatowski; F J Ayala
Journal: Genetics Date: 1994-04 Impact factor: 4.562

19 in total

Review 1. An overview of population genetic data simulation.

Authors: Xiguo Yuan; David J Miller; Junying Zhang; David Herrington; Yue Wang
Journal: J Comput Biol Date: 2011-12-09 Impact factor: 1.479

2. Optimal neutrality tests based on the frequency spectrum.

Authors: Luca Ferretti; Miguel Perez-Enciso; Sebastian Ramos-Onsins
Journal: Genetics Date: 2010-07-06 Impact factor: 4.562

3. Evolutionary study of a potential selection target region in the pig.

Authors: A Ojeda; S E Ramos-Onsins; D Marletta; L S Huang; J M Folch; M Pérez-Enciso
Journal: Heredity (Edinb) Date: 2010-05-26 Impact factor: 3.821

4. Insights into the origin and distribution of biodiversity in the Brazilian Atlantic forest hot spot: a statistical phylogeographic study using a low-dispersal organism.

Authors: M Álvarez-Presas; A Sánchez-Gracia; F Carbayo; J Rozas; M Riutort
Journal: Heredity (Edinb) Date: 2014-02-19 Impact factor: 3.821

5. Population history in Arabidopsis halleri using multilocus analysis.

Authors: Andrew J Heidel; Sebastian E Ramos-Onsins; Wei-Kuang Wang; Tzen-Yuh Chiang; Thomas Mitchell-Olds
Journal: Mol Ecol Date: 2010-07-28 Impact factor: 6.185

6. Efficient simulation of epistatic interactions in case-parent trios.

Authors: Qing Li; Holger Schwender; Thomas A Louis; M Daniele Fallin; Ingo Ruczinski
Journal: Hum Hered Date: 2013-03-27 Impact factor: 0.444

7. Porcine colonization of the Americas: a 60k SNP story.

Authors: W Burgos-Paz; C A Souza; H J Megens; Y Ramayo-Caldas; M Melo; C Lemús-Flores; E Caal; H W Soto; R Martínez; L A Alvarez; L Aguirre; V Iñiguez; M A Revidatti; O R Martínez-López; S Llambi; A Esteve-Codina; M C Rodríguez; R P M A Crooijmans; S R Paiva; L B Schook; M A M Groenen; M Pérez-Enciso
Journal: Heredity (Edinb) Date: 2012-12-19 Impact factor: 3.821

8. Phylogeography of the common vampire bat (Desmodus rotundus): marked population structure, Neotropical Pleistocene vicariance and incongruence between nuclear and mtDNA markers.

Authors: Felipe M Martins; Alan R Templeton; Ana C O Pavan; Beatriz C Kohlbach; João S Morgante
Journal: BMC Evol Biol Date: 2009-12-20 Impact factor: 3.260

9. Simulation of molecular data under diverse evolutionary scenarios.

Authors: Miguel Arenas
Journal: PLoS Comput Biol Date: 2012-05-31 Impact factor: 4.475

10. Simulation of genomes: a review.

Authors: Antonio Carvajal-Rodríguez
Journal: Curr Genomics Date: 2008-05 Impact factor: 2.236