Literature DB >> 25425237

Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not.

Abstract

UNLABELLED: Phylogenetic inference in bacterial genomics is fundamental to understanding problems such as population history, antimicrobial resistance, and transmission dynamics. The field has been plagued by an apparent state of contradiction since the distorting effects of recombination on phylogeny were discovered more than a decade ago. Researchers persist with detailed phylogenetic analyses while simultaneously acknowledging that recombination seriously misleads inference of population dynamics and selection. Here we resolve this paradox by showing that phylogenetic tree topologies based on whole genomes robustly reconstruct the clonal frame topology but that branch lengths are badly skewed. Surprisingly, removing recombining sites can exacerbate branch length distortion caused by recombination. IMPORTANCE: Phylogenetic tree reconstruction is a popular approach for understanding the relatedness of bacteria in a population from differences in their genome sequences. However, bacteria frequently exchange regions of their genomes by a process called homologous recombination, which violates a fundamental assumption of phylogenetic methods. Since many researchers continue to use phylogenetics for recombining bacteria, it is important to understand how recombination affects the conclusions drawn from these analyses. We find that whole-genome sequences afford great accuracy in reconstructing evolutionary relationships despite concerns surrounding the presence of recombination, but the branch lengths of the phylogenetic tree are indeed badly distorted. Surprisingly, methods to reduce the impact of recombination on branch lengths can exacerbate the problem.

Entities: Chemical Disease Species

Mesh：

Year: 2014 PMID： 25425237 PMCID： PMC4251999 DOI： 10.1128/mBio.02158-14

Source DB: PubMed Journal: mBio Impact factor: 7.867

Observation

Phylogenetic methods are powerful and widely used tools for reconstructing the ancestral history of pathogen populations. These methods have been used extensively in evolutionary contexts and are increasingly applied to bacterial populations in clinical settings for strain classification and outbreak detection (1). Such applications require accurate estimation of the phylogenetic tree, but this can be problematic for bacteria due to recombination, in which DNA is exchanged via transformation, transduction, or conjugation (2). In the early 2000s, several authors demonstrated that recombination distorts phylogenetic inference, leading to biased estimates of branch lengths, artifactual signals of population expansion (3), false inference of positive selection (4, 5), and unreliable reconstruction of the tree topology (6, 7). Recombination causes tree topology and branch lengths to change along the genome, preventing a single tree from adequately explaining the reticulated ancestry of recombining sequences. With the advent of accessible whole-genome sequencing, phylogenetic approaches are increasingly being used to reconstruct the evolutionary history of bacterial populations from their genome sequences (1, 8). The prevalence of phylogenetic analyses despite their demonstrable problems raises difficult questions concerning the credibility of conclusions drawn from phylogenetic inference. The esthetic appeal of phylogenetic trees partly explains their continued popularity, but the lack of viable alternatives is also an important factor. Several sophisticated methods attempt to model reticulated ancestries, but their practical application has been limited by computational demands (9–14). However, we contend that phylogenetic approaches have endured because biologists have found they convey meaningful information about the structure and relatedness of bacterial populations that fits with other evidence. Milkman and Bridges (15) introduced the concept of the clonal frame to describe the phylogeny of sites in the bacterial genome that have not experienced recombination. Since a bacterial recombination event typically affects only a fraction of the genome, continual assault by recombination throughout the genome would be required to obliterate the signal of the clonal frame. Despite the attention given to the effect of recombination on phylogenetic inference, investigation into the accuracy of topological reconstruction has been limited to analyses of single or concatenated gene sequences and small sample sizes (6, 16). Therefore, we reasoned that phylogenetic inference might be reliably recovering the signal of the clonal frame from bacterial genomes, which could explain the continued faith placed in phylogenetic inference despite the problem of recombination. We set out to test this idea through simulation. We simulated 1,000 populations of 100 bacterial genomes, each 1 Mb long with moderate mutation (substitution rate [θ] = 1%) under three scenarios: high, low, and no recombination (recombination rate [ρ] = 1%, 0.1%, and 0%, respectively). For each simulation, we recorded the clonal frame and estimated the phylogeny using neighbor joining (NJ) (17), unweighted-pair group method with arithmetic means (UPGMA) (18), maximum likelihood (ML) (19), and BEAST (20) (full details in Text S1 in the supplemental material). We quantified accuracy as the percentage of branches in the clonal frame correctly reconstructed. We found that the clonal frame topology was reconstructed remarkably accurately even when recombination was present (>97% [Fig. 1b]). Increasing ρ only modestly reduced accuracy, which appeared to be driven by the shorter branches (see Fig. S1 in the supplemental material). In a model of stable population size, branches nearer the tips tend to be shorter, whereas in an exponentially growing population, the tendency for tips to be shorter than deep branches is reduced, and at high growth rates, it is reversed (21). As such, branches closer to the root are less accurate at high recombination rates for exponentially growing populations (Fig. S2). In contrast, we found that bootstrap values (NJ, UPGMA, and ML) and posterior probabilities (BEAST) were upwardly biased by recombination (Fig. S3). Our results indicate that the accuracy of the tree topology decays progressively with increasing recombination rate. It follows that at very high recombination rates, it would no longer be sensible to pursue tree-based inference, although even at ρ = 8%, we found that topological accuracy remained high (93% based on 100 simulations with constant population size).

FIG 1

Effects of recombination in bacteria on phylogenetic tree topology and growth rate estimates. (a) The true clonal frame (left) and ML phylogenies constructed from all sites (center) and only nonhomoplastic sites (right) representing the evolutionary history of a population of 100 bacterial genomes of 1 million base pairs. The recombination rate (ρ) and substitution rate (θ) were fixed at 1%. The number of homoplasies per branch is shown for the center tree. (b) Estimates of branch accuracy for trees reconstructed using ML, BEAST, NJ, and UPGMA at three different values of ρ. The means and standard errors are based on 1,000 simulations of a demographic model of constant population size. (c) Mean posterior estimates of the exponential growth rate parameter (g) from BEAST, averaged over analyses of 1,000 simulated data sets. Data were simulated under a demographic model of constant population size (gray), low exponential growth (blue), and high exponential growth (red) and at three different values of ρ. Error bars represent the mean 95% confidence intervals. Estimates from analyses using either all sites in the sequence alignment (filled triangles) or only those sites without homoplasies (open circles) are plotted. Black dashed horizontal lines represent the true value of the exponential growth rate parameter used in the simulations. In contrast to the robustness of the phylogenetic topology, recombination gave rise to a spurious or inflated signal of demographic growth when we fitted a model of exponential growth using BEAST (Fig. 1c). In simulations under high, low, and no growth (exponential growth rate parameter [g] = 10, 1, and 0, respectively), growth rates were systematically overestimated, even though tree topology remained accurate (>98% for ρ = 0.1% and 1%; see Fig. S4 in the supplemental material). Some authors have recommended the removal of recombining sites to ameliorate their detrimental effect on phylogenetic analysis, in particular the tendency for recombination to produce a spurious signal of exponential growth (22–24). Recombination generates various signatures including homoplasy, in which the same substitution is observed in different parts of the tree. Homoplasy can be generated by repeat and back mutation, but it also results from reshuffling diversity among ancestral lineages by recombination, so that excess homoplasy is indicative of levels of recombination sufficient to cause problems for phylogenetic inference (25). We investigated whether removing homoplastic sites improved the estimation of exponential growth rates by BEAST. We found that removing homoplasies actually exacerbated the spurious signal of demographic growth generated by recombination (Fig. 1c), because older recombination events were more likely to be detected as homoplasies. This led to preferential removal of substitutions from the deep branches of the tree, producing trees that appeared even more star-like (Fig. 1a). The magnitude of the effect increased with higher recombination rates, producing 95% confidence intervals that excluded the true growth rate. The number of homoplastic sites removed due to repeat and back mutation amounted to 0.2% of the genome and had a negligible effect on the estimation of growth rates (observed in the absence of recombination in Fig. 1c). We found that removal of homoplasies followed by reestimation of the phylogeny had limited effect on the accuracy of the topology itself (see Fig. S5 in the supplemental material). In summary, our results show that the clonal frame topology is robustly reconstructed from bacterial whole genomes by phylogenetic methods even in the presence of recombination, but the branch lengths of the clonal frame are not. Removal of recombining sites exacerbates branch length distortion, because older events are easier to detect than young ones, meaning that phylogenetic-based demographic inference should still be viewed with caution in recombining species. Supplemental methods. Additional details of the simulations, phylogenetic tree construction, calculation of tree accuracy, and identification of recombining sites are given. Download Text S1, PDF file, 0.1 MB Branch accuracy for trees reconstructed using ML, BEAST, NJ, and UPGMA at three different values of the recombination rate (ρ) and growth rate (g). Branches are partitioned into three intervals according to their length, selected in an attempt to keep the number of branches within intervals the same (mean of 65.3 branches). The mean number of branches per interval for each method is displayed above each bar. Means and standard errors are based on analyses of 1,000 simulations under a demographic model of constant population size (g = 0) (gray), low exponential growth (g = 1) (blue), and high exponential growth (g = 10) (red). Download Figure S1, PDF file, 0.05 MB Branch accuracy for trees reconstructed using ML, BEAST, NJ, and UPGMA at three different values of the recombination rate (ρ) and growth rate (g). Branches are partitioned into three intervals according to the distance between the end of the branch and the root node. These intervals differed between growth rates in an attempt to keep the number of branches within intervals the same (mean of 65.3 branches). The mean number of branches per interval for each method is displayed above each bar. Means and standard errors are based on analyses of 1,000 simulations under a demographic model of constant population size (g = 0) (gray), low exponential growth (g = 1) (blue), and high exponential growth (g = 10) (red). Download Figure S2, PDF file, 0.04 MB Accuracy of branches in estimated trees partitioned by the accuracy of either the bootstrap value or posterior probability support for each branch. Trees were reconstructed using ML, BEAST, NJ, and UPGMA at three different values of the recombination rate (ρ). The expected linear relationship between support and accuracy is plotted in blue. Means and standard errors are based on analyses of 1,000 simulations under a demographic model of constant population size. Download Figure S3, PDF file, 0.1 MB Branch accuracy for trees reconstructed using ML, BEAST, NJ, and UPGMA at three different values of the recombination rate (ρ) and growth rate. Means and standard errors are based on analyses of 1,000 simulations under a demographic model of constant population size (g = 0) (gray), low exponential growth (g = 1) (blue), and high exponential growth (g = 10) (red). Download Figure S4, PDF file, 0.04 MB Branch accuracy for trees reconstructed using ML, BEAST, NJ, and UPGMA from genome sequence alignments after the removal of homoplasies. Data were simulated under three different values of the recombination rate (ρ) and growth rate (g). Means and standard errors are based on analyses of 1,000 simulations under a demographic model of constant population size (g = 0) (gray), low exponential growth (g = 1) (blue), and high exponential growth (g = 10) (red). The accuracy of NJ trees is marginally greater than for those constructed from all sites (see Fig. S3 in the supplemental material), due to the alignment being enriched for sites supporting the clonal frame. However, the accuracy of UPGMA trees is lower after the removal of homoplasies when ρ = 1%. Download Figure S5, PDF file, 0.03 MB

24 in total

1. Effect of recombination on the accuracy of the likelihood method for detecting positive selection at amino acid sites.

Authors: Maria Anisimova; Rasmus Nielsen; Ziheng Yang
Journal: Genetics Date: 2003-07 Impact factor: 4.562

2. Inference of homologous recombination in bacteria using whole-genome sequences.

Authors: Xavier Didelot; Daniel Lawson; Aaron Darling; Daniel Falush
Journal: Genetics Date: 2010-10-05 Impact factor: 4.562

Review 3. Phylogenetic inference using whole genomes.

Authors: Bruce Rannala; Ziheng Yang
Journal: Annu Rev Genomics Hum Genet Date: 2008 Impact factor: 8.929

4. Molecular evolution of the Escherichia coli chromosome. III. Clonal frames.

Authors: R Milkman; M M Bridges
Journal: Genetics Date: 1990-11 Impact factor: 4.562

5. Detecting recombination from gene trees.

Authors: J Maynard Smith; N H Smith
Journal: Mol Biol Evol Date: 1998-05 Impact factor: 16.240

6. Unifying vertical and nonvertical evolution: a stochastic ARG-based framework.

Authors: Erik W Bloomquist; Marc A Suchard
Journal: Syst Biol Date: 2009-11-09 Impact factor: 15.683

7. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Authors: N Saitou; M Nei
Journal: Mol Biol Evol Date: 1987-07 Impact factor: 16.240

8. How clonal are bacteria?

Authors: J M Smith; N H Smith; M O'Rourke; B G Spratt
Journal: Proc Natl Acad Sci U S A Date: 1993-05-15 Impact factor: 11.205

9. Yersinia pestis genome sequencing identifies patterns of global phylogenetic diversity.

Authors: Giovanna Morelli; Yajun Song; Camila J Mazzoni; Mark Eppinger; Philippe Roumagnac; David M Wagner; Mirjam Feldkamp; Barica Kusecek; Amy J Vogler; Yanjun Li; Yujun Cui; Nicholas R Thomson; Thibaut Jombart; Raphael Leblois; Peter Lichtner; Lila Rahalison; Jeannine M Petersen; Francois Balloux; Paul Keim; Thierry Wirth; Jacques Ravel; Ruifu Yang; Elisabeth Carniel; Mark Achtman
Journal: Nat Genet Date: 2010-10-31 Impact factor: 38.330

10. Rapid typing of Coxiella burnetii.

Authors: Heidie M Hornstra; Rachael A Priestley; Shalamar M Georgia; Sergey Kachur; Dawn N Birdsell; Remy Hilsabeck; Lauren T Gates; James E Samuel; Robert A Heinzen; Gilbert J Kersh; Paul Keim; Robert F Massung; Talima Pearson
Journal: PLoS One Date: 2011-11-02 Impact factor: 3.240

54 in total

Review 1. Transforming bacterial disease surveillance and investigation using whole-genome sequence to probe the trace.

Authors: Biao Kan; Haijian Zhou; Pengcheng Du; Wen Zhang; Xin Lu; Tian Qin; Jianguo Xu
Journal: Front Med Date: 2018-01-09 Impact factor: 4.592

2. Clostridium botulinum Group II Isolate Phylogenomic Profiling Using Whole-Genome Sequence Data.

Authors: K A Weedmark; P Mabon; K L Hayden; D Lambert; G Van Domselaar; J W Austin; C R Corbett
Journal: Appl Environ Microbiol Date: 2015-06-26 Impact factor: 4.792

3. Phylogenetic Methods for Genome-Wide Association Studies in Bacteria.

Authors: Xavier Didelot
Journal: Methods Mol Biol Date: 2021

4. Development and Application of a Core Genome Multilocus Sequence Typing Scheme for the Health Care-Associated Pathogen Pseudomonas aeruginosa.

Authors: Richard A Stanton; Gillian McAllister; Jonathan B Daniels; Erin Breaker; Nicholas Vlachos; Paige Gable; Heather Moulton-Meissner; Alison Laufer Halpin
Journal: J Clin Microbiol Date: 2020-08-24 Impact factor: 5.948

5. Host-Specific Evolutionary and Transmission Dynamics Shape the Functional Diversification of Staphylococcus epidermidis in Human Skin.

Authors: Wei Zhou; Michelle Spoto; Rachel Hardy; Changhui Guan; Elizabeth Fleming; Peter J Larson; Joseph S Brown; Julia Oh
Journal: Cell Date: 2020-01-30 Impact factor: 41.582

6. A Phylogeny-Informed Proteomics Approach for Species Identification within the Burkholderia cepacia Complex.

Authors: Honghui Wang; Ousmane H Cissé; Anthony F Suffredini; John P Dekker; Thomas Bolig; Steven K Drake; Yong Chen; Jeffrey R Strich; Jung-Ho Youn; Uchenna Okoro; Avi Z Rosenberg; Junfeng Sun; John J LiPuma
Journal: J Clin Microbiol Date: 2020-10-21 Impact factor: 5.948

7. Phylogenetic background and habitat drive the genetic diversification of Escherichia coli.

Authors: Marie Touchon; Amandine Perrin; Jorge André Moura de Sousa; Belinda Vangchhia; Samantha Burn; Claire L O'Brien; Erick Denamur; David Gordon; Eduardo Pc Rocha
Journal: PLoS Genet Date: 2020-06-12 Impact factor: 5.917

8. Bayesian inference of ancestral dates on bacterial phylogenetic trees.

Authors: Xavier Didelot; Nicholas J Croucher; Stephen D Bentley; Simon R Harris; Daniel J Wilson
Journal: Nucleic Acids Res Date: 2018-12-14 Impact factor: 16.971

Review 9. Measurably evolving pathogens in the genomic era.

Authors: Roman Biek; Oliver G Pybus; James O Lloyd-Smith; Xavier Didelot
Journal: Trends Ecol Evol Date: 2015-04-14 Impact factor: 17.712

10. Identifying lineage effects when controlling for population structure improves power in bacterial association studies.

Authors: Sarah G Earle; Chieh-Hsi Wu; Jane Charlesworth; Nicole Stoesser; N Claire Gordon; Timothy M Walker; Chris C A Spencer; Zamin Iqbal; David A Clifton; Katie L Hopkins; Neil Woodford; E Grace Smith; Nazir Ismail; Martin J Llewelyn; Tim E Peto; Derrick W Crook; Gil McVean; A Sarah Walker; Daniel J Wilson
Journal: Nat Microbiol Date: 2016-04-04 Impact factor: 17.745