Literature DB >> 19430603

Mlcoalsim: multilocus coalescent simulations.

Sebastian E Ramos-Onsins1, Thomas Mitchell-Olds.   

Abstract

Coalescent theory is a powerful tool for population geneticists as well as molecular biologists interested in understanding the patterns and levels of DNA variation. Using coalescent Monte Carlo simulations it is possible to obtain the empirical distributions for a number of statistics across a wide range of evolutionary models; these distributions can be used to test evolutionary hypotheses using experimental data. The mlcoalsim application presented here (based on a version of the ms program, Hudson, 2002) adds important new features to improve methodology (uncertainty and conditional methods for mutation and recombination), models (including strong positive selection, finite sites and heterogeneity in mutation and recombination rates) and analyses (calculating a number of statistics used in population genetics and P-values for observed data). One of the most important features of mlcoalsim is the analysis of multilocus data in linked and independent regions. In summary, mlcoalsim is an integrated software application aimed at researchers interested in molecular evolution. mlcoalsim is written in ANSI C and is available at: http://www.ub.es/softevol/mlcoalsim.

Entities:  

Keywords:  Coalescent simulations; Multilocus analyses; Neutrality tests; Population Genetics; Rejection algorithm

Year:  2007        PMID: 19430603      PMCID: PMC2674636     

Source DB:  PubMed          Journal:  Evol Bioinform Online        ISSN: 1176-9343            Impact factor:   1.625


Introduction

Statistical inference of molecular population data under different evolutionary models typically employs a coalescent framework (Kingman, 1982a,b; Hudson, 1990; Donnelly and Tavaré, 1995; Nordborg, 2001). Hudson’s ms (Hudson, 2002) application enabled a large number of population geneticists and molecular biologists to examine data under different evolutionary models. In recent years, a number of coalescent programs focused on the generation of genetic data have been published (e.g. SimCoal, Excoffier et al. 2000; Laval and Excoffier, 2004; SelSim, Spencer and Coop, 2004; CoaSim, Mailund et al 2005; FastCoal, Marjoram and Wall, 2005). Nevertheless, multilocus data obtained by high throughput techniques (e.g. the Drosophila Polymorphisms Sequencing Project, as well as smaller projects such as those described by Akey et al. 2004; Schmid et al. 2005) are not easily analyzed using available software. Here we describe the mlcoalsim software application which, unlike other available tools, allows the generation of simulated genetic data and the calculation of descriptive statistics for a large number of loci under different evolutionary models, as well as obtaining P-values of observed data.

Program Overview

mlcoalsim enables researchers to compare single and multilocus data with several common evolutionary models. It is an integrated application that not only constructs coalescent trees and sequences but also calculates a number of summary statistics that are useful for the examination of evolutionary hypotheses. This program is designed to generate within-species genetic data; that is, the level of nucleotide variation should not be too high—a maximum of approximately 5%—in order to avoid important errors (a more sophisticated substitution model should be used). For the same reason, the level of divergence from an outgroup species should be no greater than 10–15%.

Multilocus analyses

One of the main features of mlcoalsim is the generation of DNA samples and calculation of a number of statistical tests for a set of multiple loci with variable levels of intragenic recombination. There are two options for multilocus analysis: using independent (unlinked) loci and using a single long region separated into several fragments. The first option (independent loci) allows the independent analysis of each locus and the calculation of summary statistics (the average and variance for all loci of each statistic). This option is useful for contrasting data with demographic models that would affect the entire genome. For such an analysis, a correction factor for population size depending on the chromosomal location of each locus (e.g. autosomal, sexual) is needed. The second option (linked loci) generates samples for an entire linked region and calculates statistics for specified fragments within this region or for a sliding window analysis. The “linked” option is useful in evolutionary processes that affect only specific regions, such as a selective sweep in a recombining region.

Uncertainty in mutation and recombination rates

Mutation and recombination rates are critical parameters which are usually unknown. In order to consider the uncertainty of these two parameters, mlcoalsim can sample the rates from a distribution (uniform and gamma distributions are used) instead of using a fixed value. In addition, mlcoalsim can generate samples by fixing the observed values, the number of segregating sites, the minimum number of recombination events and (optionally) the number of haplotypes. This last option is obtained using the rejection method 2 of Tavaré (Tavaré et al. 1997). Posterior distributions for the population mutation and for the population recombination parameter are recorded.

Heterogeneity in mutation and recombination rate across the sequence

mlcoalsim is also able to take into account differences in the mutation and in recombination rates across the studied region. Heterogeneity is modelled with a gamma distribution, modelling from extreme hotspots regions (i.e. in case using heterogeneity for the recombination rate, only few position are enabled to recombine while others can not) to uniform values for all positions. Furthermore, it is possible to fix the average number of invariant positions (position that can not mutate) for the studied region.

Evolutionary models

mlcoalsim includes the following evolutionary models: the neutral stationary panmictic model, the finite island model, models with changing population sizes over time, refugia models and deterministic positive selection (not all of these models can be used simultaneously). mlcoalsim allows the use of neutral and positive selection models for different independent loci, and changing population size also can be used with a finite island model.

Statistics

A number of statistics and related tests used in population genetics are displayed in the output (Table 1). The statistics incorporated in this program describe the level and patterns of diversity for a given sample.
Table 1.

List of the main statistics included in mlcoalsim.

NameStatisticCitation
TDTajima’s D testTajima, 1989
FsFu’s Fs testFu, 1997
FD*Fu and Li’s D* testFu and Li, 1993
FF*Fu and Li’s F* testFu and Li, 1993
FDFu and Li’s D testFu and Li, 1993
FFFu and Li’s F testFu and Li, 1993
HFay and Wu’s H testFay and Wu, 2000
BWall’s B testWall, 1999
QWall’s Q testWall, 1999
ZAZARozas et al. 2001
FstFstHudson et al. 1992
KwNo. haplotypes/nStrobeck, 1987
HwHaplotype diversity/nDepaulis and Veuille, 1998
R2R2 testRamos-Onsins and Rozas, 2002
SNo. of biallelic mutations
thetaWattθWatterson, 1975
thetaTajπTajima, 1983
thetaFWθHFay and Wu, 2000
pi_wπ within populationse.g. Hudson et al. 1992
pi_bπ among populationse.g. Hudson et al. 1992
D/DminD/DminSchaeffer, 2002; Schmid et al. 2005
H/HminH/HminSchmid et al. 2005
maxhapNo. lines in most common haplotype/nDepaulis et al. 2003
maxhap1maxhap excepting one biallelic mutationHudson et al 1994; Rozas et al. 2001
RmRmHudson and Kaplan, 1985

n is the number of sequence lines.

See text and mlcoalsim documentation for a brief description of statistics.

Different statistics that estimate the level of variation are included (θWatterson, 1975, π, Tajima, 1983, and θ, Fay and Wu, 2000) for the entire sample. Although these estimates are calculated using different approaches, the values should be equal under the assumption of a neutral stationary panmictic model. The average levels of variation within and among populations are also estimated (π and π, Hudson et al. 1992), as well as the average differentiation among populations with the Fst statistic (e.g. Hudson et al. 1992). A description of the patterns of diversity is obtained using two main classes of statistics (Ramos-Onsins and Rozas, 2002): Class I statistics, which use the mutation frequency information, and Class II statistics, which use information from the haplotype distribution. Class I includes Tajima’s D test (TD, Tajima, 1989), Fu and Li’s tests (FD*, FF*, FD, FF, Fu and Li, 1993), Fay and Wu’s H test (Fay and Wu, 2000), R2 (Ramos-Onsins and Rozas, 2002) and weighted statistics for a multilocus approach such as D/Dmin (Schaeffer, 2002) and H/Hmin (Schmid et al. 2005). Class II includes the number of haplotypes Kw (Strobeck, 1987) and the haplotype diversity Hw (Depaulis and Veuille, 1998), both weighted by the number of samples for a better multilocus comparison, Fs (Fu, 1997), the statistics B and Q, (Wall, 1999) which count differences in haplotype structure at adjacent positions, the ZA statistic (Rozas et al. 2001) as a measure of linkage disequilibrium at adjacent positions, maxhap (Depaulis et al. 2003) and maxhap1 (simplified from Hudson et al. 1994), which counts the number of lines with the most common haplotype (i.e. maxhap) but allowing a single segregating site within the largest “haplotype” group (Rozas et al. 2001). Finally, the minimum number of recombination events, Rm (Hudson and Kaplan, 1985), is also calculated. Multilocus analyses generate a comprehensive output with the calculated statistics with their average and variance. Only biallelic positions are considered for the analyses given that tri- or tetra-allelic positions are rare in within-species samples.

Other technical features

The generation of random deviates from uniform, binomial, Poisson, and gamma distributions and the determining of roots for complex functions are based on Lanczos (1964); Atkinson (1979); Cheng and Feast (1979); Fishman (1979); Ridders (1979); Press et al (1992) and Press and Teukolsky (1992). The Rm function was obtained and modified from Wall’s code (Wall, 2000). The gamma function was partially obtained from Grassly, Adachi and Rambaut code (Grassly et al. 1997).
  27 in total

1.  On the number of segregating sites in genetical models without recombination.

Authors:  G A Watterson
Journal:  Theor Popul Biol       Date:  1975-04       Impact factor: 1.570

2.  Estimation of levels of gene flow from DNA sequence data.

Authors:  R R Hudson; M Slatkin; W P Maddison
Journal:  Genetics       Date:  1992-10       Impact factor: 4.562

Review 3.  Coalescents and genealogical structure under neutrality.

Authors:  P Donnelly; S Tavaré
Journal:  Annu Rev Genet       Date:  1995       Impact factor: 16.830

4.  Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection.

Authors:  Y X Fu
Journal:  Genetics       Date:  1997-10       Impact factor: 4.562

5.  Inferring coalescence times from DNA sequence data.

Authors:  S Tavaré; D J Balding; R C Griffiths; P Donnelly
Journal:  Genetics       Date:  1997-02       Impact factor: 4.562

6.  Statistical method for testing the neutral mutation hypothesis by DNA polymorphism.

Authors:  F Tajima
Journal:  Genetics       Date:  1989-11       Impact factor: 4.562

7.  Statistical tests of neutrality of mutations.

Authors:  Y X Fu; W H Li
Journal:  Genetics       Date:  1993-03       Impact factor: 4.562

8.  Evolutionary relationship of DNA sequences in finite populations.

Authors:  F Tajima
Journal:  Genetics       Date:  1983-10       Impact factor: 4.562

9.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences.

Authors:  R R Hudson; N L Kaplan
Journal:  Genetics       Date:  1985-09       Impact factor: 4.562

10.  Evidence for positive selection in the superoxide dismutase (Sod) region of Drosophila melanogaster.

Authors:  R R Hudson; K Bailey; D Skarecky; J Kwiatowski; F J Ayala
Journal:  Genetics       Date:  1994-04       Impact factor: 4.562

View more
  19 in total

Review 1.  An overview of population genetic data simulation.

Authors:  Xiguo Yuan; David J Miller; Junying Zhang; David Herrington; Yue Wang
Journal:  J Comput Biol       Date:  2011-12-09       Impact factor: 1.479

2.  Optimal neutrality tests based on the frequency spectrum.

Authors:  Luca Ferretti; Miguel Perez-Enciso; Sebastian Ramos-Onsins
Journal:  Genetics       Date:  2010-07-06       Impact factor: 4.562

3.  Evolutionary study of a potential selection target region in the pig.

Authors:  A Ojeda; S E Ramos-Onsins; D Marletta; L S Huang; J M Folch; M Pérez-Enciso
Journal:  Heredity (Edinb)       Date:  2010-05-26       Impact factor: 3.821

4.  Insights into the origin and distribution of biodiversity in the Brazilian Atlantic forest hot spot: a statistical phylogeographic study using a low-dispersal organism.

Authors:  M Álvarez-Presas; A Sánchez-Gracia; F Carbayo; J Rozas; M Riutort
Journal:  Heredity (Edinb)       Date:  2014-02-19       Impact factor: 3.821

5.  Population history in Arabidopsis halleri using multilocus analysis.

Authors:  Andrew J Heidel; Sebastian E Ramos-Onsins; Wei-Kuang Wang; Tzen-Yuh Chiang; Thomas Mitchell-Olds
Journal:  Mol Ecol       Date:  2010-07-28       Impact factor: 6.185

6.  Efficient simulation of epistatic interactions in case-parent trios.

Authors:  Qing Li; Holger Schwender; Thomas A Louis; M Daniele Fallin; Ingo Ruczinski
Journal:  Hum Hered       Date:  2013-03-27       Impact factor: 0.444

7.  Porcine colonization of the Americas: a 60k SNP story.

Authors:  W Burgos-Paz; C A Souza; H J Megens; Y Ramayo-Caldas; M Melo; C Lemús-Flores; E Caal; H W Soto; R Martínez; L A Alvarez; L Aguirre; V Iñiguez; M A Revidatti; O R Martínez-López; S Llambi; A Esteve-Codina; M C Rodríguez; R P M A Crooijmans; S R Paiva; L B Schook; M A M Groenen; M Pérez-Enciso
Journal:  Heredity (Edinb)       Date:  2012-12-19       Impact factor: 3.821

8.  Phylogeography of the common vampire bat (Desmodus rotundus): marked population structure, Neotropical Pleistocene vicariance and incongruence between nuclear and mtDNA markers.

Authors:  Felipe M Martins; Alan R Templeton; Ana C O Pavan; Beatriz C Kohlbach; João S Morgante
Journal:  BMC Evol Biol       Date:  2009-12-20       Impact factor: 3.260

9.  Simulation of molecular data under diverse evolutionary scenarios.

Authors:  Miguel Arenas
Journal:  PLoS Comput Biol       Date:  2012-05-31       Impact factor: 4.475

10.  Simulation of genomes: a review.

Authors:  Antonio Carvajal-Rodríguez
Journal:  Curr Genomics       Date:  2008-05       Impact factor: 2.236

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.