| Literature DB >> 28320310 |
Emily Jane McTavish1,2, James Pettengill3, Steven Davis3, Hugh Rand3, Errol Strain3, Marc Allard3, Ruth E Timme3.
Abstract
BACKGROUND: Using phylogenomic analysis tools for tracking pathogens has become standard practice in academia, public health agencies, and large industries. Using the same raw read genomic data as input, there are several different approaches being used to infer phylogenetic tree. These include many different SNP pipelines, wgMLST approaches, k-mer algorithms, whole genome alignment and others; each of these has advantages and disadvantages, some have been extensively validated, some are faster, some have higher resolution. A few of these analysis approaches are well-integrated into the regulatory process of US Federal agencies (e.g. the FDA's SNP pipeline for tracking foodborne pathogens). However, despite extensive validation on benchmark datasets and comparison with other pipelines, we lack methods for fully exploring the effects of multiple parameter values in each pipeline that can potentially have an effect on whether the correct phylogenetic tree is recovered.Entities:
Keywords: Genomics; Phylogenetics; Simulation
Mesh:
Year: 2017 PMID: 28320310 PMCID: PMC5359950 DOI: 10.1186/s12859-017-1592-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic of the TreeToReads procedure. a Input Newick tree file and background/anchor genome. b Simulate mutations across taxa according to defined set of parameters. c Simulate raw reads (fastq files)
Simulation parameter values and mean of inferred parameter values across 5 simulation and inference replicates
| Case Study 1 – | |||||
| Input | Inferred | ||||
| Close ref | Far ref | ||||
| Number of SNPs | 100 | 98.6 (1.4) | 38,384 (mutations) | ||
| 154 (3) variable sites | |||||
| RF distance | - | 33 (1) | 33 (1) | ||
| Outbreak clade monophyletic | - | 5/5 | 5/5 | ||
| Base frequencies | A | 0.238787 | 0.24 (0.03) | 0.25 (0.0) | |
| C | 0.261361 | 0.26 (0.03) | 0.25 (0.0) | ||
| G | 0.261132 | 0.24 (0.03) | 0.25 (0.0) | ||
| T | 0.238720 | 0.27 (0.03) | 0.24 (0.0005) | ||
| GTR rate Matrix | ac | 0.4 | 0.1 (0.7) | 1.5 (0.6) | |
| ag | T 3.0 | 91 (160) | 5.4 (2.0) | ||
| at | 0.5 | 21 (39) | 1.0 (0.6) | ||
| cg | 0.1 | 0.1 (0.2) | 1.5 (0.7) | ||
| ct | 4.4 | 66 (112) | 5.3 (1.9) | ||
| gt | 1 | 1.0 (0.0) | 1.0 (0.0) | ||
| Coverage | 20 | 17.04 (0) | 15.45 (0) | ||
| Case Study 2 – | |||||
| Input | Inferred | ||||
| Number of SNPs | 500 | 494.8 (1.6) | |||
| RF distance | - | 0.0 (0) | |||
| Base frequencies | A | 0.311521 | 0.30 (0.01) | ||
| C | 0.190709 | 0.20 (0.01) | |||
| G | 0.189125 | 0.20 (0.007) | |||
| T | 0.308645 | 0.29 (0.007) | |||
| GTR rate Matrix | ac | 1.2070 | 1.1 (0.1) | ||
| ag | 5.9306 | 5.3 (0.8) | |||
| at | 1.7425 | 1.8 (0.2) | |||
| cg | 0.4610 | 0.3 (0.2) | |||
| ct | 5.1238 | 5.1 (0.7) | |||
| gt | 1 | 1.0 (0.0) | |||
| Coverage | 40 | 37.49 (0) | |||
| Difference from INDELible gap distribution | 0 | 0.0 (0) | |||
Standard deviations of parameter values shown in parentheses
Fig. 2Input and inferred trees for two case studies, Salmonella enterica Bareilly and Listeria monocytogenes. Variable sites called for Salmonella enterica Bareilly by mapping reads to the anchor genome as a reference (close reference), and to a reference genome outside of the sampled tree (distant reference). Listeria monocytogenes reads were mapped to the anchor genome