Literature DB >> 27713837

SimBac: simulation of whole bacterial genomes with homologous recombination.

Thomas Brown1, Xavier Didelot2, Daniel J Wilson3,4,5, Nicola De Maio4,3.   

Abstract

Bacteria can exchange genetic material, or acquire genes found in the environment. This process, generally known as bacterial recombination, can have a strong impact on the evolution and phenotype of bacteria, for example causing the spread of antibiotic resistance across clades and species, but can also disrupt phylogenetic and transmission inferences. With the increasing affordability of whole genome sequencing, the need has emerged for an efficient simulator of bacterial evolution to test and compare methods for phylogenetic and population genetic inference, and for simulation-based estimation. We present SimBac, a whole-genome bacterial evolution simulator that is roughly two orders of magnitude faster than previous software and includes a more general model of bacterial evolution, allowing both within- and between-species homologous recombination. Since methods modelling bacterial recombination generally focus on only one of these two modes of recombination, the possibility to simulate both allows for a general and fair benchmarking. SimBac is available from https://github.com/tbrown91/SimBac and is distributed as open source under the terms of the GNU General Public Licence.

Entities:  

Keywords:  bacterial evolution; bacterial genomics; bacterial recombination; coalescent simulation

Mesh:

Year:  2016        PMID: 27713837      PMCID: PMC5049688          DOI: 10.1099/mgen.0.000044

Source DB:  PubMed          Journal:  Microb Genom        ISSN: 2057-5858


Data Summary

SimBac, the software we developed to simulate genome-wide bacterial evolution, is distributed as open source under the terms of the GNU General Public Licence, and is available from GitHub (https://github.com/tbrown91/SimBac). A manual and examples of usage of SimBac are provided in the Supplementary Material.

Impact Statement

Sequencing technologies are revolutionizing microbiology, allowing researchers to investigate with great detail the genetic information in bacteria. This increasingly overwhelming amount of information requires adequate, efficient computer methods to be processed in reasonable time. One of the most important tasks performed by computer methods is simulating data, as this provides a means for testing hypotheses and checking the performance of other methods in extracting valuable information from data. Previous software specifically developed for simulating bacterial evolution is limited in applicability, having been conceived for limited data and biological phenomena. We present SimBac, a new simulator of bacterial evolution that can generate data for thousands of bacterial genomes about 100 times faster than previous methods. SimBac also includes a very general model of bacterial evolution that accounts for the fact that bacteria can exchange genetic material with each other, not only within the same population, but also across species boundaries. Thanks to these advancements in SimBac it will be possible to efficiently test hypotheses and estimate parameters comparing real and simulated bacterial data, to test the accuracy of bacterial genomic methods, and to fairly compare methods that make different assumptions regarding bacterial evolution.

Introduction

Whole-genome bacterial sequencing is rapidly gaining in popularity and replacing multilocus sequence typing (MLST) thanks to its fast and cost-effective provision of higher resolution genetic information (Didelot ; Wilson, 2012). Computational algorithms that use genomic data to infer epidemiological, phylogeographic, phylodynamic and evolutive patterns are generally hampered by recombination (e.g. Schierup & Hein, 2000; Posada & Crandall, 2002; Hedge & Wilson, 2014), and recent years have seen a surge of methods that measure, identify and account for bacterial homologous recombination (e.g. Didelot & Falush, 2007; Marttinen , 2012; Didelot ; Croucher ; Didelot & Wilson, 2015). Assessing and comparing the performance of different methods is complicated by the use of different models of recombination, in particular within-species recombination leading to phylogenetically discordant sites (e.g. Didelot ) or between-species recombination leading to accumulation of substitutions on specific branches and genomic intervals (e.g. Didelot & Falush, 2007). Simulators of bacterial evolution are routinely used for parameter inference and hypothesis testing (Fearnhead ; Fraser ) and for method testing and comparison (Falush ; Didelot & Falush, 2007; Turner , Buckee ; Wilson ; Hedge & Wilson, 2014), but simulation software and models used are generally targeted to the specific model of evolution implemented in the methods considered. One of the reasons for this is the lack of general and efficient simulators of bacterial evolution. Coalescent simulators of eukaryotic evolution usually focus on crossover recombination (see e.g. Arenas & Posada, 2007, 2010, 2014), while bacterial recombination is generally modelled as gene conversion, meaning that in a recombination event only a small fragment of DNA is imported from a donor, whereas most of the genetic material is inherited from the recipient. Many fast and approximate simulation methods (e.g. Marjoram & Wall, 2006; Excoffier & Foll, 2011) cannot be applied to bacterial recombination because the approximations used do not generate the expected long genomic distance correlations in bacterial local trees. Other similar approximate methods are only adequate for low bacterial recombination rates (e.g. Chen ; Wang ). Many forward-in-time simulation methods (e.g. Chadeau-Hyam ; Dalquen ) or discrete generation coalescent methods (Excoffier ; Laval & Excoffier, 2004) can allow gene conversion, but are generally too slow for simulating whole-genome evolution of large samples or populations. An exact and fast method to simulate gene conversion is the coalescent model of Wiuf & Hein (2000) included in ms (Hudson, 2002) and its extensions (Mailund ; Hellenthal & Stephens, 2007; Ramos-Onsins & Mitchell-Olds 2007). Recently, this model has been implemented in simulation software specific for bacterial evolution, SimMLST (Didelot ). SimMLST is optimized for MLST data which requires to simulate several short distant loci, and, similarly to ms, only simulates within-species bacterial recombination. For these reasons, these methods are not generally suited for large, genome-wide bacterial simulation studies or for testing different models and assumptions of recombination. Here we present SimBac, a new method for simulating bacterial evolution. SimBac implements an efficient coalescent-based algorithm for simulating genome-wide bacterial evolution, and includes a new and more general model of bacterial recombination that extends the classical within-species recombination (Didelot ) by allowing the user to specify any degree of recombination between species.

Theory and Implementation

We simulate evolution backward in time under the standard coalescent model with gene conversion, and generate an ancestral recombination graph (ARG; see Wiuf & Hein, 2000). Within-species recombination events are modelled as a copy-pasting of a small fragment of DNA from the donor lineage sequence into the recipient. The computational efficiency of SimBac derives from algorithmic improvements over previous software. First, instead of rejection sampling of recombination events as described by Didelot , we developed an analytical solution that only samples recombination events effectively altering ancestral material of lineages (details of the methods are available in the online Supplementary Material). Second, we represent ancestral material with a more efficient data structure. These new features allow about 100-fold faster simulation of bacterial genome-wide evolution compared with SimMLST (see Fig. 1). Also, our method generally outperforms ms (Hudson, 2002) when many recombination (or equivalently gene conversion) events are expected.
Fig. 1

Comparison of run-time of SimMLST, ms and SimBac. Only gene conversion (no crossover) is simulated in ms, to model bacterial evolution. (a) Mean time to simulate the ARG for a fixed recombination rate R = 0.01 and genome length from 100 bp to 1 Mbp. (b) Mean time to simulate the ARG for a fixed genome length of 1 Mbp and recombination rate increasing from R = 0 to R = 0.05. One hundred simulations were performed for each dot, except for SimMLST at R = 0.02 and R = 0.05, and ms at R = 0.02, where 10 simulations were performed due to the elevated computational demand. ms was not run at R = 0.05 because a single run required >4 days. Error bars show ± 1 sd.

Comparison of run-time of SimMLST, ms and SimBac. Only gene conversion (no crossover) is simulated in ms, to model bacterial evolution. (a) Mean time to simulate the ARG for a fixed recombination rate R = 0.01 and genome length from 100 bp to 1 Mbp. (b) Mean time to simulate the ARG for a fixed genome length of 1 Mbp and recombination rate increasing from R = 0 to R = 0.05. One hundred simulations were performed for each dot, except for SimMLST at R = 0.02 and R = 0.05, and ms at R = 0.02, where 10 simulations were performed due to the elevated computational demand. ms was not run at R = 0.05 because a single run required >4 days. Error bars show ± 1 sd. Our software also provides the possibility to simulate a circular or linear genome, and entire or fragmented bacterial genome, and offers a recombination model that allows a mixture of between- and within-species recombination. Within-species recombination is modelled as the coalescent with gene conversion (Wiuf & Hein, 2000; Didelot ) with fragment lengths distributed geometrically with mean δ, and with all sites having the same per-site recombination initiation rate R (scaled by the effective population size). As the coalescent process is simulated backward in time, any extant lineage can be the recipient of a recombining interval from a donor lineage, which is then added to the other extant lineages. In such a case, the recombining interval becomes part of the genome of the new donor lineage (see Fig. 2b). Every site of the genome of every extant lineage becomes the start of a recombining interval at the same rate R.
Fig. 2

Examples of ancestral recombination graphs (ARGs) generated and plotted by SimBac. Branches represent ARG lineages, and time is considered to go backward from the bottom to the top of the tree. Branch merges (from bottom to top) represent coalescent events, while branch splits represent recombination events. (a) Example ARG with the clonal frame lineages marked in black, the non-clonal lineages in grey, and a recombination event involving an external species marked in red. (b) Same ARG as before, but with ancestral material of each lineage represented as a rectangle in the corresponding node. Each coloured vertical bar inside each rectangle represent a genomic segment. Genomic segments that are present in the ancestral material are coloured in grey, those absent are in white, and those imported from an external species are in red.

Examples of ancestral recombination graphs (ARGs) generated and plotted by SimBac. Branches represent ARG lineages, and time is considered to go backward from the bottom to the top of the tree. Branch merges (from bottom to top) represent coalescent events, while branch splits represent recombination events. (a) Example ARG with the clonal frame lineages marked in black, the non-clonal lineages in grey, and a recombination event involving an external species marked in red. (b) Same ARG as before, but with ancestral material of each lineage represented as a rectangle in the corresponding node. Each coloured vertical bar inside each rectangle represent a genomic segment. Genomic segments that are present in the ancestral material are coloured in grey, those absent are in white, and those imported from an external species are in red. Between-species recombination is modelled as a separate process backward in time with a specific scaled per-site recombination initiation rate Re and a specific distribution of imported fragment lengths (geometric with mean δe). When a between-species recombination event occurs at a recipient lineage and interval, the donor lineage is not tracked back in time as for within-species recombination, but instead substitutions are introduced into the recombining interval, similar to the model in ClonalFrame (Didelot & Falush, 2007). Therefore, we do not simulate species evolution as described by Arenas & Posada (2014), but rather assume that each recombining segment is donated by a different lineage within a given divergence range. However, differently from ClonalFrame, the donor sequence is obtained by adding a random amount of divergence [uniformly sampled within the interval (D1, D2), specified by the user] into the corresponding homologous sequence from the root of the ARG. This model accounts for the excess of substitutions caused by between-species recombination as in ClonalFrame, but at the same time also generates the homoplasies that are expected if the recipient lineage does not lead to the root of the local tree. More details on the methods of simulation and a summary of the algorithm are provided in the online Supplementary Material. To showcase the possible applications of our software, we extend the investigation of phylogenetic inference accuracy by Hedge & Wilson (2014). The authors investigated the effect of low bacterial recombination rates (up to a scaled per-site rate of R = 0.01) on the inference of clonal frame. Using SimBac, we are able to simulate higher recombination rates (up to R = 0.1) in reasonable time, and we show that for highly recombining bacteria, and in particular for older phylogenetic branches, the probability of reconstructing the phylogenetic topology is reduced further to around 91 % (Fig. 3).
Fig. 3

Accuracy of clonal frame estimation from recombining bacterial genomes. The x-axis shows the recombination rate R under which simulations are performed. The y-axis shows the accuracy of inference, as the proportion of branches correctly estimated using the Robinson–Foulds metric (Robinson & Foulds, 1981). Ten independent replicates are used for R = 0.1 and 100 in all other cases. Genomes are 1 Mbp long and the scaled mutation rate is fixed at 0.01. (a) Accuracy of three phylogenetic methods: neighbour-joining (NJ), unweighted pair group method with arithmetic mean (UPGMA) and maximum-likelihood (ML). Error bars represent ± 1 sd. (b) Clonal frame branches were separated into three age categories: young, middle-aged and old (respectively with a distance between the branch mid-point and the root of more than 2.09, between 1.32 and 2.09, and less than 1.32 N generations). The ML accuracy for each age category is plotted separately in different colours.

Accuracy of clonal frame estimation from recombining bacterial genomes. The x-axis shows the recombination rate R under which simulations are performed. The y-axis shows the accuracy of inference, as the proportion of branches correctly estimated using the Robinson–Foulds metric (Robinson & Foulds, 1981). Ten independent replicates are used for R = 0.1 and 100 in all other cases. Genomes are 1 Mbp long and the scaled mutation rate is fixed at 0.01. (a) Accuracy of three phylogenetic methods: neighbour-joining (NJ), unweighted pair group method with arithmetic mean (UPGMA) and maximum-likelihood (ML). Error bars represent ± 1 sd. (b) Clonal frame branches were separated into three age categories: young, middle-aged and old (respectively with a distance between the branch mid-point and the root of more than 2.09, between 1.32 and 2.09, and less than 1.32 N generations). The ML accuracy for each age category is plotted separately in different colours.

Conclusion

Simulation of genome evolution is important as it allows inference of parameters from data and testing of evolutionary hypotheses, and because it is routinely used to benchmark and compare different microbial genomic analysis methods. We present SimBac, a new method for simulating genome-wide bacterial evolution implemented and distributed as open source software (). Our model of bacterial recombination is more general than those used by most methods in the field, in that it can describe any mixture of within-species and between-species recombination, and as such, it can fit the assumptions of most methods, or it can provide a more realistic background for comparing methods with different hypotheses. Also, our efficient implementation achieves an approximately 100-fold increase in computational efficiency over previous similar efforts, allowing inference and benchmarking over considerably larger datasets. For example, 1000 1 Mbp genomes with R = 0.01 can be generated in about 6 min. SimBac can generate a wide range of possible outputs: sequence alignments, ARGs graphics (see Fig. 2), clonal frames, local genealogies and lists of recombination events. Although only a Jukes & Cantor substitution model (Jukes & Cantor 1969) is presently included in SimBac, in practice this is not a restriction because the local genealogies can be used to generate alignments under a vast choice of nucleotide and codon substitution models using, for example, SeqGen (Rambaut & Grassly, 1997) or INDELible (Fletcher & Yang, 2009) (see Arenas, 2013). Although SimBac generalizes the applicability of SimMLST, it currently lacks the wide set of options of some simulators of evolution, in particular of forward simulators that allow very general demographic, speciation, selection, migration and rate variation patterns (e.g. Chadeau-Hyam ; Dalquen ). In fact, many of these features present considerable methodological hurdles in being incorporated in computationally efficient coalescent simulators. Yet, future extensions of our method could consist of the inclusion of distributive conjugal transfer (Gray ), of non-homogeneous genomic rates of recombination (see e.g. Everitt ; Arenas & Posada, 2014), or of demographic events and population structure (Arenas & Posada, 2007, 2014).
  39 in total

1.  The coalescent with gene conversion.

Authors:  C Wiuf; J Hein
Journal:  Genetics       Date:  2000-05       Impact factor: 4.562

2.  SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history.

Authors:  Guillaume Laval; Laurent Excoffier
Journal:  Bioinformatics       Date:  2004-04-29       Impact factor: 6.937

3.  Inference of homologous recombination in bacteria using whole-genome sequences.

Authors:  Xavier Didelot; Daniel Lawson; Aaron Darling; Daniel Falush
Journal:  Genetics       Date:  2010-10-05       Impact factor: 4.562

4.  SimMLST: simulation of multi-locus sequence typing data under a neutral model.

Authors:  Xavier Didelot; Daniel Lawson; Daniel Falush
Journal:  Bioinformatics       Date:  2009-03-13       Impact factor: 6.937

5.  ALF--a simulation framework for genome evolution.

Authors:  Daniel A Dalquen; Maria Anisimova; Gaston H Gonnet; Christophe Dessimoz
Journal:  Mol Biol Evol       Date:  2011-12-08       Impact factor: 16.240

6.  CoaSim: a flexible environment for simulating genetic data under coalescent models.

Authors:  Thomas Mailund; Mikkel H Schierup; Christian N S Pedersen; Peter J M Mechlenborg; Jesper N Madsen; Leif Schauser
Journal:  BMC Bioinformatics       Date:  2005-10-14       Impact factor: 3.169

7.  Mismatch induced speciation in Salmonella: model and data.

Authors:  Daniel Falush; Mia Torpdahl; Xavier Didelot; Donald F Conrad; Daniel J Wilson; Mark Achtman
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2006-11-29       Impact factor: 6.237

8.  A new method for modeling coalescent processes with recombination.

Authors:  Ying Wang; Ying Zhou; Linfeng Li; Xian Chen; Yuting Liu; Zhi-Ming Ma; Shuhua Xu
Journal:  BMC Bioinformatics       Date:  2014-08-11       Impact factor: 3.169

9.  Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not.

Authors:  Jessica Hedge; Daniel J Wilson
Journal:  mBio       Date:  2014-11-25       Impact factor: 7.867

10.  Simulation of genome-wide evolution under heterogeneous substitution models and complex multispecies coalescent histories.

Authors:  Miguel Arenas; David Posada
Journal:  Mol Biol Evol       Date:  2014-02-19       Impact factor: 16.240

View more
  13 in total

1.  GrapeTree: visualization of core genomic relationships among 100,000 bacterial pathogens.

Authors:  Zhemin Zhou; Nabil-Fareed Alikhan; Martin J Sergeant; Nina Luhmann; Cátia Vaz; Alexandre P Francisco; João André Carriço; Mark Achtman
Journal:  Genome Res       Date:  2018-07-26       Impact factor: 9.043

2.  Coalescent framework for prokaryotes undergoing interspecific homologous recombination.

Authors:  Tetsuya Akita; Shohei Takuno; Hideki Innan
Journal:  Heredity (Edinb)       Date:  2018-01-23       Impact factor: 3.821

3.  Efficient ancestry and mutation simulation with msprime 1.0.

Authors:  Franz Baumdicker; Gertjan Bisschop; Daniel Goldstein; Graham Gower; Aaron P Ragsdale; Georgia Tsambos; Sha Zhu; Bjarki Eldon; E Castedo Ellerman; Jared G Galloway; Ariella L Gladstein; Gregor Gorjanc; Bing Guo; Ben Jeffery; Warren W Kretzschumar; Konrad Lohse; Michael Matschiner; Dominic Nelson; Nathaniel S Pope; Consuelo D Quinto-Cortés; Murillo F Rodrigues; Kumar Saunack; Thibaut Sellinger; Kevin Thornton; Hugo van Kemenade; Anthony W Wohns; Yan Wong; Simon Gravel; Andrew D Kern; Jere Koskela; Peter L Ralph; Jerome Kelleher
Journal:  Genetics       Date:  2022-03-03       Impact factor: 4.402

4.  Adaptation in a Fibronectin Binding Autolysin of Staphylococcus saprophyticus.

Authors:  Tatum D Mortimer; Douglas S Annis; Mary B O'Neill; Lindsey L Bohr; Tracy M Smith; Hendrik N Poinar; Deane F Mosher; Caitlin S Pepperell
Journal:  mSphere       Date:  2017-11-29       Impact factor: 4.389

5.  SWPhylo - A Novel Tool for Phylogenomic Inferences by Comparison of Oligonucleotide Patterns and Integration of Genome-Based and Gene-Based Phylogenetic Trees.

Authors:  Xiaoyu Yu; Oleg N Reva
Journal:  Evol Bioinform Online       Date:  2018-02-20       Impact factor: 1.625

6.  Efficient Inference of Recent and Ancestral Recombination within Bacterial Populations.

Authors:  Rafal Mostowy; Nicholas J Croucher; Cheryl P Andam; Jukka Corander; William P Hanage; Pekka Marttinen
Journal:  Mol Biol Evol       Date:  2017-05-01       Impact factor: 16.240

7.  The Bacterial Sequential Markov Coalescent.

Authors:  Nicola De Maio; Daniel J Wilson
Journal:  Genetics       Date:  2017-03-03       Impact factor: 4.562

8.  Bayesian reconstruction of transmission within outbreaks using genomic variants.

Authors:  Nicola De Maio; Colin J Worby; Daniel J Wilson; Nicole Stoesser
Journal:  PLoS Comput Biol       Date:  2018-04-18       Impact factor: 4.475

9.  Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes.

Authors:  Morteza M Saber; B Jesse Shapiro
Journal:  Microb Genom       Date:  2020-03

10.  PanDelos: a dictionary-based method for pan-genome content discovery.

Authors:  Vincenzo Bonnici; Rosalba Giugno; Vincenzo Manca
Journal:  BMC Bioinformatics       Date:  2018-11-30       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.