| Literature DB >> 29026651 |
Aaron Petkau1, Philip Mabon1, Cameron Sieffert1, Natalie C Knox1, Jennifer Cabral1, Mariam Iskander2, Mark Iskander2, Kelly Weedmark3, Rahat Zaheer4, Lee S Katz5, Celine Nadon1, Aleisha Reimer1, Eduardo Taboada1, Robert G Beiko6, William Hsiao7, Fiona Brinkman8, Morag Graham1, Gary Van Domselaar1.
Abstract
The recent widespread application of whole-genome sequencing (WGS) for microbial disease investigations has spurred the development of new bioinformatics tools, including a notable proliferation of phylogenomics pipelines designed for infectious disease surveillance and outbreak investigation. Transitioning the use of WGS data out of the research laboratory and into the front lines of surveillance and outbreak response requires user-friendly, reproducible and scalable pipelines that have been well validated. Single Nucleotide Variant Phylogenomics (SNVPhyl) is a bioinformatics pipeline for identifying high-quality single-nucleotide variants (SNVs) and constructing a whole-genome phylogeny from a collection of WGS reads and a reference genome. Individual pipeline components are integrated into the Galaxy bioinformatics framework, enabling data analysis in a user-friendly, reproducible and scalable environment. We show that SNVPhyl can detect SNVs with high sensitivity and specificity, and identify and remove regions of high SNV density (indicative of recombination). SNVPhyl is able to correctly distinguish outbreak from non-outbreak isolates across a range of variant-calling settings, sequencing-coverage thresholds or in the presence of contamination. SNVPhyl is available as a Galaxy workflow, Docker and virtual machine images, and a Unix-based command-line application. SNVPhyl is released under the Apache 2.0 license and available at http://snvphyl.readthedocs.io/ or at https://github.com/phac-nml/snvphyl-galaxy.Entities:
Keywords: bacterial genomics; bioinformatics; genomic epidemiology; infectious disease surveillance; phylogenomics; single nucleotide variation detection
Mesh:
Year: 2017 PMID: 29026651 PMCID: PMC5628696 DOI: 10.1099/mgen.0.000116
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
A comparison of whole-genome phylogenetic software
| Name | Input* | Parallel computing† | Distribution‡ | Interface§ | Reference |
|---|---|---|---|---|---|
| CFSAN SNP pipeline | sr | mn, mt | Local | cl | [ |
| CSI phylogeny | sr, ag |
| Web | gui | [ |
| kSNP | sr, ag | mt | Local | cl | [ |
| Lyve-SET | sr, agr | mn, mt | Local | cl | [ |
| NASP | sr, ag | mn, mt | Local | cl | [ |
| Parsnp | ag | mt | Local | cl | [ |
| PhaME | sr, ag | mt | Local | cl | [ |
| REALPHY | sr, ag | mt | Web, local | gui, cl | [ |
| Snippy | sr, agr | mt | Local | cl |
|
| SNVPhyl | sr | mn, mt | Local | gui, cl |
|
*ag, assembled genome; agr, assembled genome supported by generating simulated reads; sr, sequence reads.
†mn, multi-node – provides capability to execute across multiple compute nodes; mt, multi-thread – provides multi-threading capability; na, not applicable (not locally installable).
‡Local, locally distributed and installable software; web, software provided as a web service.
§cl, command-line interface; gui, graphical user interface.
Fig. 1.(a) Overview of the SNVPhyl pipeline. Input to the pipeline is provided as a reference genome, a set of sequence reads for each isolate and an optional list of positions to mask from the final results. Repeat regions are identified on the reference genome and reference mapping followed by variant calling is performed on the sequence reads. The resulting files are compiled together to construct a SNV alignment and list of identified SNVs, which are further processed to construct a SNV distance matrix, maximum-likelihood phylogeny and a summary of the identified SNVs. Individual software or scripts are given in the parenthesis below each stage. (b) An overview of the Mapping/Variant Calling stage of SNVPhyl. Variants are called using two separate software packages and compiled together in the Variant Consolidation stage. As output, a list of the validated variant calls, regions with high-density SNVs, as well as quality information on the mean mapping coverage, are produced and sent to further stages.
SNV simulation results
| Comparison | No. of variant columns simulated | No. of non-variant columns | No. of true positives | No. of false positives | No. of true negatives | No. of false negatives | Specificity | Sensitivity |
|---|---|---|---|---|---|---|---|---|
| Valid SNVs* | 10 000 | 5 584 477 | 9116 | 0 | 5 575 361 | 884 | 1.0 | 0.91 |
| All SNVs† | 10 000 | 5 584 477 | 9573 | 51 | 5 574 853 | 427 | 1.0 | 0.96 |
*Valid SNVs – the number of SNV-containing sites detected that passed all thresholds to be considered high quality for every isolate.
†All SNVs – all the SNV-containing sites identified by SNVPhyl, including those where at least one isolate did not have a high-quality base call or sites that were masked by the pipeline.
A comparison of the SNVPhyl variant density filtering algorithm to the Gubbins system for recombination detection
| Case | No. of true positives | No. of false positives | No. of true negatives | No. of false negatives | Sensitivity | Specificity | K tree score |
|---|---|---|---|---|---|---|---|
| No DF* | 142 | 2159 | 2 218 849 | 23 | 0.861 | 0.999 | 0.419 |
| 2 in 20† | 142 | 565 | 2 220 443 | 23 | 0.861 | 1.000 | 0.425 |
| 2 in 100† | 142 | 155 | 2 220 853 | 23 | 0.861 | 1.000 | 0.377 |
| 2 in 500† | 133 | 12 | 2 221 005 | 32 | 0.806 | 1.000 | 0.045 |
| 2 in 1000† | 125 | 6 | 2 221 019 | 40 | 0.758 | 1.000 | 0.044 |
| 2 in 2000† | 111 | 3 | 2 221 036 | 54 | 0.673 | 1.000 | 0.063 |
| Gubbins/SNVPhyl‡ | 138 | 10 | 2 221 002 | 27 | 0.836 | 1.000 | 0.037 |
*No DF – a case of no SNV density filtering by SNVPhyl.
†X in Y – masking regions with a density of X variants in Y bases.
‡Gubbins/SNVPhyl – a whole-genome alignment generated from SNVs identified by SNVPhyl and run through Gubbins.
A comparison of the performance of SNVPhyl across a range of parameters and analysis scenarios
| No. | Scenario | Parameter/condition | hqSNV | % core* | Differentiated outbreaks |
|---|---|---|---|---|---|
| 1 | Minimum | 5× | 317 | 95 | Yes |
| 10× | 301 | 92 | Yes | ||
| 15× | 262 | 81 | Yes | ||
| 20× | 165 | 54 | No | ||
| 2 | Subsample coverage level | 10׆ | 155 | 47 | No |
| 15׆ | 242 | 76 | Yes | ||
| 20׆ | 276 | 88 | Yes | ||
| 30׆ | 299 | 92 | Yes | ||
| 3 | Relative SNV abundance | 0.25 | 351 | 92 | No |
| 0.5 | 307 | 92 | No | ||
| 0.75 | 301 | 92 | Yes | ||
| 0.9 | 291 | 92 | Yes | ||
| 4 | Contamination | 5 %‡ | 298 | 92 | Yes |
| 10 %‡ | 292 | 92 | Yes | ||
| 20 %‡ | 260 | 92 | No | ||
| 30 %‡ | 231 | 92 | No |
*100 % core = 4 888 768 bp (percentage of reference genome identified as the core genome).
†These represent the mean coverage of one sample after subsampling reads and not the minimum base coverage parameter of SNVPhyl (which is fixed at 10×).
‡100 % contamination represents complete replacement of reads from SH13-001 (at 71× coverage) with SH12-001.