| Literature DB >> 35290758 |
Boas C L van der Putten1,2, Niek A H Huijsmans1, Daniel R Mende1, Constance Schultsz1,2.
Abstract
Phylogenetic analyses are widely used in microbiological research, for example to trace the progression of bacterial outbreaks based on whole-genome sequencing data. In practice, multiple analysis steps such as de novo assembly, alignment and phylogenetic inference are combined to form phylogenetic workflows. Comprehensive benchmarking of the accuracy of complete phylogenetic workflows is lacking. To benchmark different phylogenetic workflows, we simulated bacterial evolution under a wide range of evolutionary models, varying the relative rates of substitution, insertion, deletion, gene duplication, gene loss and lateral gene transfer events. The generated datasets corresponded to a genetic diversity usually observed within bacterial species (≥95 % average nucleotide identity). We replicated each simulation three times to assess replicability. In total, we benchmarked 19 distinct phylogenetic workflows using 8 different simulated datasets. We found that recently developed k-mer alignment methods such as kSNP and ska achieve similar accuracy as reference mapping. The high accuracy of k-mer alignment methods can be explained by the large fractions of genomes these methods can align, relative to other approaches. We also found that the choice of de novo assembly algorithm influences the accuracy of phylogenetic reconstruction, with workflows employing SPAdes or skesa outperforming those employing Velvet. Finally, we found that the results of phylogenetic benchmarking are highly variable between replicates. We conclude that for phylogenomic reconstruction, k-mer alignment methods are relevant alternatives to reference mapping at the species level, especially in the absence of suitable reference genomes. We show de novo genome assembly accuracy to be an underappreciated parameter required for accurate phylogenomic reconstruction.Entities:
Keywords: benchmarking study; in silico evolution; phylogenetics; simulation
Mesh:
Year: 2022 PMID: 35290758 PMCID: PMC9176278 DOI: 10.1099/mgen.0.000799
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Overview of this study. (a) Simulation of the in silico evolution. The K-12 MG1655 genome is evolved in silico according to a phylogeny (providing genetic distances) and a set of parameters controlling the rates of genetic events (providing which genetic events result in the genetic distance provided by the phylogeny). The resulting genomes are depicted by coloured complete genome graphs visualized in Bandage [47]. The complete genomes are subsequently shredded into sequencing reads. (b) Phylogenetic workflows. Generated sequencing reads are assembled into draft genomes (coloured draft genome graphs) or directly mapped onto the ancestral genome. From alignments, phylogenetic trees are inferred using iq-tree.
Fig. 2.Kendall–Colijn metrics and Robinson–Foulds distances per phylogenetic workflow across eight simulations. Displayed distances are calculated between the ground truth phylogeny and the phylogeny produced by the relevant workflow. Generated using SuperPlotsOfData, and ordered by median. Large circles indicate the median of replicates. Small circles indicate separate measurements for a replica.
Fig. 3.Kendall–Colijn metrics and Robinson–Foulds distances per de novo assembly algorithm used in workflows, across eight simulations. Displayed distances are calculated between the ground truth phylogeny and the phylogeny produced by the relevant workflow. Generated using SuperPlotsOfData, and ordered alphabetically. Large circles indicate the median of replicates. Small circles indicate separate measurements for a replica.
Fig. 4.Count of informative sites in the alignment plotted against Kendall–Colijn metric, with a linear model fitted (shading indicates 95 % confidence interval). Pearson’s rho and associated P value are shown.
Fig. 5.Differences between technical replicates for identical workflows across identical simulations, only differing in starting seed for the simulation. Workflows including MLST were excluded. Generated using SuperPlotsOfData.