| Literature DB >> 30303479 |
Michael A Martin1, Robyn S Lee2, Lauren A Cowley1, Jennifer L Gardy3,4, William P Hanage1.
Abstract
Whole genome sequencing in conjunction with traditional epidemiology has been used to reconstruct transmission networks of Mycobacterium tuberculosis during outbreaks. Given its low mutation rate, genetic diversity within M. tuberculosis outbreaks can be extremely limited - making it difficult to determine precisely who transmitted to whom. In addition to consensus SNPs (cSNPs), examining heterogeneous alleles (hSNPs) has been proposed to improve resolution. However, few studies have examined the potential biases in detecting these hSNPs. Here, we analysed genome sequence data from 25 specimens from British Columbia, Canada. Specimens were sequenced to a depth of 112-296×. We observed biases in read depth, base quality, strand distribution and read placement where possible hSNPs were initially identified, so we applied conservative filters to reduce false positives. Overall, there was phylogenetic concordance between the observed 2542 cSNP and 63 hSNP loci. Furthermore, we identified hSNPs shared exclusively by epidemiologically linked patients, supporting their use in transmission inferences. We conclude that hSNPs may add resolution to transmission networks, particularly where the overall genetic diversity is low.Entities:
Keywords: Mycobacterium tuberculosis; genomic epidemiology; transmission; whole genome sequencing; within-host diversity
Mesh:
Year: 2018 PMID: 30303479 PMCID: PMC6249434 DOI: 10.1099/mgen.0.000217
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Schematic description of our final analysis pipeline (Supplementary Methods). ADF, high-quality allelic depth on the forward strand; ADR, high-quality allelic depth on the reverse strand; BAM, binary alignment map; bp: base pair; DP, high-quality read depth; Indels, small insertions and deletions; MQ, Phred-scaled average mapping quality; QUAL, Phred-scaled base quality score; SAM, sequence alignment map; SP, Phred-scaled strand bias P-value; PE, proline-glutamic acid; PPE, proline-proline-glutamic acid, PE_PGRS, proline-glutamic acid_polymorphic guanine-cytosine-rich sequence; RF, read frequency, high-quality variant reads/total high-quality reads.
Quality metrics comparing consensus and heterogeneous SNPs after the final filtering protocol
DP, high-quality read depth; QUAL, recalibrated base quality score; MQ, average mapping quality; SP, Phred-scaled strand bias P-value; BQB, Mann–Whitney U test of base quality bias; MQB, Mann–Whitney U test of mapping quality bias; RPB, Mann–Whitney U test of read position bias.
| cSNP | hSNP | |||
|---|---|---|---|---|
| Mean | Mean | |||
| DP | 161.84 | 41.34 | 232.84 | 88.17 |
| QUAL | 228.00 | 0.36 | 209.13 | 34.66 |
| MQ | 59.71 | 1.95 | 59.42 | 1.57 |
| SP | 0.16 | 1.02 | 10.44 | 11.10 |
| BQB* | 0.98 | 0.09 | 0.68 | 0.30 |
| MQB* | 1.00 | 0.03 | 0.91 | 0.22 |
| RPB* | 0.98 | 0.10 | 0.28 | 0.10 |
*As BQB, MQB and RPB are only defined at positions with reference and variant reads, we assume an RPB value of 1.0 for SNPs with 100 % variant reads
Fig. 2.Maximum-likelihood (ML) phylogeny generated using an alignment of 2542 cSNPs, rooted on the H37Rv reference (REF). TVMe+ASC was identified as the best-fit model based on the Bayesian information criterion. Ultrafast bootstrap support values are annotated in blue (support values >95 % indicate high confidence) and epidemiological data provided by the BCCDC (British Columbia Centre for Disease Control) are indicated in the top left. Branches without annotation are not epidemiologically linked to other cases in the dataset. Each column at the right represents an informative hSNP as compared to H37Rv, ordered by position in the genome and coloured by supporting variant read frequency percentage. The scale bar corresponds to the number of subsitutions per site.
Fig. 3.Tanglegram comparing the topology of a cSNP-defined ML phylogeny (left) and an hSNP-defined ML phylogeny (right). Phylogenies were generated with IQTREE using automatic model selection based on the Bayesian information criterion (cSNP phylogeny: TVMe+ASC, hSNP phylogeny: K2P, hSNP constrained cSNP phylogeny: TVM+F+ASC+G4). Lineage 4 specimens are outlined in blue. Kendall–Colijn Euclidean distance is indicated at the bottom.