Literature DB >> 25596205

scrm: efficiently simulating long sequences using the approximated coalescent with recombination.

Paul R Staab1, Sha Zhu1, Dirk Metzler1, Gerton Lunter1.   

Abstract

MOTIVATION: Coalescent-based simulation software for genomic sequences allows the efficient in silico generation of short- and medium-sized genetic sequences. However, the simulation of genome-size datasets as produced by next-generation sequencing is currently only possible using fairly crude approximations.
RESULTS: We present the sequential coalescent with recombination model (SCRM), a new method that efficiently and accurately approximates the coalescent with recombination, closing the gap between current approximations and the exact model. We present an efficient implementation and show that it can simulate genomic-scale datasets with an essentially correct linkage structure.
© The Author 2015. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2015        PMID: 25596205      PMCID: PMC4426833          DOI: 10.1093/bioinformatics/btu861

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Coalescent simulation is a valuable tool to investigate population genetic data and the demographic processes that shaped them. Simulation programs based on the coalescent with recombination (CWR) such as ms (Hudson, 2002) support a wide range of evolutionary scenarios and are extremely efficient for short- and medium-sized sequences. As the number of recombination events grows exponentially with increasing sequence length, it is however infeasible to simulate whole chromosomes using these methods. This prevents many methods relying on simulations from being applicable to next-generation sequencing datasets. In order to resolve this problem, McVean and Cardin (2005) introduced the sequentially Markov coalescence (SMC) model, a method that approximates the CWR by partially ignoring genetic linkage between simulated sites. Subsequently, Marjoram and Wall (2006) proposed a modification of this model, termed SMC’, which improved accuracy. The simulation programs MaCS (Chen ) and fastsimcoal (Excoffier and Foll, 2011) implement SMC’ and allow rapid simulation of chromosome-sized datasets. However, Eriksson found that the decrease in genetic linkage depends on the simulated evolutionary model, which led them to conclude that SMC’ might not be suitable under certain conditions, and suggested that the effect of the approximation needed to be investigated carefully in each application. We have developed a novel approximation of the CWR, called the sequential coalescent with recombination model (SCRM). Besides algorithmic optimization, it allows for user-controlled arbitrary precision ranging continuously from SMC’ to the full CWR. We here present an efficient implementation of this model, termed scrm, and show that by using an intermediate approximation level it allows the simulation of sequences of arbitrary length with an essentially correct linkage structure.

2 Materials and Methods

SCRM is based on the sequential model for building the ancestral recombination graph (ARG) by Wiuf and Hein (1999). After sampling an initial genealogy at one end of a chromosome, it moves along the sequence and updates the genealogy as recombination events are encountered. The genealogy includes so-called non-local branches, which belong to a previous genealogy but do not carry ancestral material for the current position. These may become important at upstream positions. Every recombination introduces additional non-local branches, which causes the ARG to grow exponentially along the sequence and makes it impractical to simulate long sequences under the CWR. To resolve this problem, SCRM adds three modifications to the Wiuf–Hein model: SCRM uses a memory efficient tree-based data structure, which encodes recombinations as non-local leaves rather than splits in the graph. Recombination events on non-local branches are postponed until a local tree is affected, and ignored if the local trees remain unchanged. This requires that we account only for local recombinations while moving along the sequence. Non-local branches accumulate the recombination rate until they are targeted by a coalescence. The time until the next recombination event on this branch is then exponentially distributed with the accumulated rate. This modification is similar to a model proposed by Wang . The most crucial modification is that we allow to disregard weak linkage over large genomic regions. We do this by removing non-local branches with an accumulated recombination rate above a threshold. As this threshold corresponds to a certain genomic distance to the local tree, this approximation is equivalent to an ‘exact window’ sliding along the sequence. Positions within the window have the same linkage as in the CWR, while positions further apart have reduced linkage. Setting the window size to 0 reduces the algorithm to the SMC’, while a chromosome-sized window recovers the CWR.

2.1 Implementation and validation

We have developed scrm, an efficient open-source implementation of SCRM using C++11. Command line in- and output are designed to be compatible with ms, so that scrm can be used as a drop-in replacement. The supported feature set is similar to ms. Additionally, scrm supports samples at different times and variable recombination rates along the sequence. It is optimized for sample sizes of thousands of individuals. We validated the implementation by comparing exact simulations to ms. No significant deviations were found using and Kolmogorov–Smirnov tests (Supplementary Table S1).

2.2 Approximation of Linkage

We compared the genetic linkage produced for different levels of approximation by using the correlation of the total local branch length of the genealogy at two sites as a function of their distance (Fig. 1). The ‘exact window’ of scrm is similar to MaCS’s history parameter. However, as MaCS ignores all non-local recombinations, it simulates too much linkage for sites within its history. Consequently, it does not converge to the CWR when reducing the approximation, while scrm does (Fig. 2). In the settings of Figure 2 and using an exact window size of 300 kb, scrm simulates essentially correct linkage across 20 samples with a linear run-time cost of 0.1 s per megabase.
Fig. 1.

Approximation of genetic linkage. Shown is the correlation of ρ (y-axis) of the total local branch length at two sites δ base pairs apart (x-axis). The linkage in the CWR (ms, options 20 1 -r 4000 10000001 -T) is indicated in black. Results for scrm using different exact window sizes (see legend) are indicated in colour

Fig. 2.

Efficiency for different approximations. Shown is the deviation (y-axis) against run-time (x-axis) for simulating 10 Mb with a recombination rate of per base per generation. The deviation of the approximation from the correct values is measured as the square root of the area between the correlation curves for the approximate simulated data, and ms-generated data (see Fig. 1). For scrm and MaCS multiple approximation levels are drawn using different exact window sizes or history parameters. The recently published Cosi2 (Shlyakhter ) does not output trees and could not be included in this figure; for a comparison of Cosi2 and scrm using different summary statistics see Supplementary Figure S5

Approximation of genetic linkage. Shown is the correlation of ρ (y-axis) of the total local branch length at two sites δ base pairs apart (x-axis). The linkage in the CWR (ms, options 20 1 -r 4000 10000001 -T) is indicated in black. Results for scrm using different exact window sizes (see legend) are indicated in colour Efficiency for different approximations. Shown is the deviation (y-axis) against run-time (x-axis) for simulating 10 Mb with a recombination rate of per base per generation. The deviation of the approximation from the correct values is measured as the square root of the area between the correlation curves for the approximate simulated data, and ms-generated data (see Fig. 1). For scrm and MaCS multiple approximation levels are drawn using different exact window sizes or history parameters. The recently published Cosi2 (Shlyakhter ) does not output trees and could not be included in this figure; for a comparison of Cosi2 and scrm using different summary statistics see Supplementary Figure S5

Funding

This work was supported by the Deutsche Forschungsgemeinschaft [DFG ME 3134/3-2] and the Wellcome Trust [090532/Z/09/Z]. Conflict of Interest: none declared.
  9 in total

1.  Recombination as a point process along sequences.

Authors:  C Wiuf; J Hein
Journal:  Theor Popul Biol       Date:  1999-06       Impact factor: 1.570

2.  Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors:  Richard R Hudson
Journal:  Bioinformatics       Date:  2002-02       Impact factor: 6.937

3.  Cosi2: an efficient simulator of exact and approximate coalescent with selection.

Authors:  Ilya Shlyakhter; Pardis C Sabeti; Stephen F Schaffner
Journal:  Bioinformatics       Date:  2014-08-22       Impact factor: 6.937

4.  Approximating the coalescent with recombination.

Authors:  Gilean A T McVean; Niall J Cardin
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2005-07-29       Impact factor: 6.237

5.  Sequential Markov coalescent algorithms for population models with demographic structure.

Authors:  A Eriksson; B Mahjani; B Mehlig
Journal:  Theor Popul Biol       Date:  2009-05-09       Impact factor: 1.570

6.  Fast and flexible simulation of DNA sequence data.

Authors:  Gary K Chen; Paul Marjoram; Jeffrey D Wall
Journal:  Genome Res       Date:  2008-11-24       Impact factor: 9.043

7.  fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios.

Authors:  Laurent Excoffier; Matthieu Foll
Journal:  Bioinformatics       Date:  2011-03-12       Impact factor: 6.937

8.  A new method for modeling coalescent processes with recombination.

Authors:  Ying Wang; Ying Zhou; Linfeng Li; Xian Chen; Yuting Liu; Zhi-Ming Ma; Shuhua Xu
Journal:  BMC Bioinformatics       Date:  2014-08-11       Impact factor: 3.169

9.  Fast "coalescent" simulation.

Authors:  Paul Marjoram; Jeff D Wall
Journal:  BMC Genet       Date:  2006-03-15       Impact factor: 2.797

  9 in total
  41 in total

1.  The Effects of Background and Interference Selection on Patterns of Genetic Variation in Subdivided Populations.

Authors:  Kai Zeng; Pádraic Corcoran
Journal:  Genetics       Date:  2015-10-04       Impact factor: 4.562

2.  Human Prehistoric Demography Revealed by the Polymorphic Pattern of CpG Transitions.

Authors:  Xiaoming Liu
Journal:  Mol Biol Evol       Date:  2020-09-01       Impact factor: 16.240

3.  ARGON: fast, whole-genome simulation of the discrete time Wright-fisher process.

Authors:  Pier Francesco Palamara
Journal:  Bioinformatics       Date:  2016-06-16       Impact factor: 6.937

4.  POPdemog: visualizing population demographic history from simulation scripts.

Authors:  Ying Zhou; Xiaowen Tian; Brian L Browning; Sharon R Browning
Journal:  Bioinformatics       Date:  2018-08-15       Impact factor: 6.937

5.  Inference of complex population histories using whole-genome sequences from multiple populations.

Authors:  Matthias Steinrücken; Jack Kamm; Jeffrey P Spence; Yun S Song
Journal:  Proc Natl Acad Sci U S A       Date:  2019-08-06       Impact factor: 11.205

6.  Climate-driven flyway changes and memory-based long-distance migration.

Authors:  Zhongru Gu; Shengkai Pan; Zhenzhen Lin; Li Hu; Xiaoyang Dai; Jiang Chang; Yuanchao Xue; Han Su; Juan Long; Mengru Sun; Sergey Ganusevich; Vasiliy Sokolov; Aleksandr Sokolov; Ivan Pokrovsky; Fen Ji; Michael W Bruford; Andrew Dixon; Xiangjiang Zhan
Journal:  Nature       Date:  2021-03-03       Impact factor: 49.962

7.  Chimpanzee genomic diversity reveals ancient admixture with bonobos.

Authors:  Marc de Manuel; Martin Kuhlwilm; Peter Frandsen; Vitor C Sousa; Tariq Desai; Javier Prado-Martinez; Jessica Hernandez-Rodriguez; Isabelle Dupanloup; Oscar Lao; Pille Hallast; Joshua M Schmidt; José María Heredia-Genestar; Andrea Benazzo; Guido Barbujani; Benjamin M Peter; Lukas F K Kuderna; Ferran Casals; Samuel Angedakin; Mimi Arandjelovic; Christophe Boesch; Hjalmar Kühl; Linda Vigilant; Kevin Langergraber; John Novembre; Marta Gut; Ivo Gut; Arcadi Navarro; Frands Carlsen; Aida M Andrés; Hans R Siegismund; Aylwyn Scally; Laurent Excoffier; Chris Tyler-Smith; Sergi Castellano; Yali Xue; Christina Hvilsom; Tomas Marques-Bonet
Journal:  Science       Date:  2016-10-27       Impact factor: 47.728

8.  Robust and scalable inference of population history from hundreds of unphased whole genomes.

Authors:  Jonathan Terhorst; John A Kamm; Yun S Song
Journal:  Nat Genet       Date:  2016-12-26       Impact factor: 38.330

9.  Efficient computation of the joint sample frequency spectra for multiple populations.

Authors:  John A Kamm; Jonathan Terhorst; Yun S Song
Journal:  J Comput Graph Stat       Date:  2017-02-16       Impact factor: 2.302

10.  A comprehensive genomic history of extinct and living elephants.

Authors:  Eleftheria Palkopoulou; Mark Lipson; Swapan Mallick; Svend Nielsen; Nadin Rohland; Sina Baleka; Emil Karpinski; Atma M Ivancevic; Thu-Hien To; R Daniel Kortschak; Joy M Raison; Zhipeng Qu; Tat-Jun Chin; Kurt W Alt; Stefan Claesson; Love Dalén; Ross D E MacPhee; Harald Meller; Alfred L Roca; Oliver A Ryder; David Heiman; Sarah Young; Matthew Breen; Christina Williams; Bronwen L Aken; Magali Ruffier; Elinor Karlsson; Jeremy Johnson; Federica Di Palma; Jessica Alfoldi; David L Adelson; Thomas Mailund; Kasper Munch; Kerstin Lindblad-Toh; Michael Hofreiter; Hendrik Poinar; David Reich
Journal:  Proc Natl Acad Sci U S A       Date:  2018-02-26       Impact factor: 11.205

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.