| Literature DB >> 29875188 |
David H Wyllie1,2,3, Nicholas Sanderson4, Richard Myers5, Tim Peto4,3, Esther Robinson6, Derrick W Crook4,3, E Grace Smith6, A Sarah Walker4,3.
Abstract
Contact tracing requires reliable identification of closely related bacterial isolates. When we noticed the reporting of artifactual variation between Mycobacterium tuberculosis isolates during routine next-generation sequencing of Mycobacterium spp., we investigated its basis in 2,018 consecutive M. tuberculosis isolates. In the routine process used, clinical samples were decontaminated and inoculated into broth cultures; from positive broth cultures DNA was extracted and sequenced, reads were mapped, and consensus sequences were determined. We investigated the process of consensus sequence determination, which selects the most common nucleotide at each position. Having determined the high-quality read depth and depth of minor variants across 8,006 M. tuberculosis genomic regions, we quantified the relationship between the minor variant depth and the amount of nonmycobacterial bacterial DNA, which originates from commensal microbes killed during sample decontamination. In the presence of nonmycobacterial bacterial DNA, we found significant increases in minor variant frequencies, of more than 1.5-fold, in 242 regions covering 5.1% of the M. tuberculosis genome. Included within these were four high-variation regions strongly influenced by the amount of nonmycobacterial bacterial DNA. Excluding these four regions from pairwise distance comparisons reduced biologically implausible variation from 5.2% to 0% in an independent validation set derived from 226 individuals. Thus, we demonstrated an approach identifying critical genomic regions contributing to clinically relevant artifactual variation in bacterial similarity searches. The approach described monitors the outputs of the complex multistep laboratory and bioinformatics process, allows periodic process adjustments, and will have application to quality control of routine bacterial genomics.Entities:
Keywords: Mycobacterium tuberculosis; artifact; reference mapping; relatedness; single nucleotide variation
Mesh:
Substances:
Year: 2018 PMID: 29875188 PMCID: PMC6062814 DOI: 10.1128/JCM.00104-18
Source DB: PubMed Journal: J Clin Microbiol ISSN: 0095-1137 Impact factor: 5.948
FIG 1Samples used and derivation and validation sets. Shown is a flowchart describing the samples used and the selection of derivation and validation sets.
FIG 2Bioinformatics processes. Shown is a flow diagram illustrating the standard bioinformatics pipeline used, as well as the adaptive-masking process used to generate masks. Gray circles indicate links to a description of the process at https://github.com/davidhwyllie/adaptivemasking. findNeighbour2 is an open-source server-based system for monitoring single nucleotide variation (27).
FIG 3Minor variant frequency and nonmycobacterial bacterial DNA quantities. The observed minor variant frequency for three regions of the M. tuberculosis genome (genes B55, eswX, and rrs) versus the proportion of reads of nonmycobacterial bacterial origin (as determined by Kraken) is shown for samples in the derivation set (n = 2,018). Panel A shows a dot plot, whereas in panel B, the proportion of reads of nonmycobacterial bacterial origin is stratified with 1%, 5%, and 20% boundaries. The number at each stratum refers to the number of samples with nonzero read depth in that region.
FIG 4Modeling minor variant frequencies. For 8,006 genomic regions of the H37Rv reference genome, Poisson models were used to estimate the mean minor variant frequency. The estimated minor variant frequency when less than 1% nonmycobacterial bacterial DNA is present (n = 208 [2.6% of the regions]) is shown in panel A. The red line is a lognormal distribution with μ = log(minor variant frequency with <1% nonmycobacterial DNA) and σ = median absolute deviation [log(minor variant frequency with <1% nonmycobacterial DNA)]. In panel B the rate ratio estimates (i.e., the fold change associated with increases in nonmycobacterial bacterial DNA quantifications) for each gene are shown. Panel C shows the significance of a test comparing the log(rate ratio estimates) with zero, in the form of a Volcano plot. The dashed lines in panels B and C correspond to a 50% increase in rate ratio.
FIG 5A distinct subset of genes are impacted by quantity of nonmycobacterial DNA. (A) Fold change in minor variant frequency with >20% nonmycobacterial bacterial DNA present versus <1% nonmycobacterial bacterial DNA. Quadrant boundary markers correspond to (horizontal line) a 50% increase over <1% nonmycobacterial bacterial DNA and (vertical line) a minor variant frequency of 2.1 × 10−3. (B) Genes with elevated minor variant frequencies when nonmycobacterial bacterial DNA is low (<1%) or high (>20%) fall into mutually exclusive sets. (C) The number of bases represented by the deployed masking versus the deployed masking plus the genes in zones A, A plus D, A plus D plus C, and A plus D plus C plus B.
FIG 6Impact of masking strategies on reported distances between closely related samples. SNV distances between pairs of M. tuberculosis genomes isolates from samples taken from the same individual within 7 days of each other were compared using different masking strategies. The top portion describes the published, deployed method of masking. In the panels below that, genes in the zones shown in Fig. 5B are additionally masked (i.e., ignored from pairwise comparisons).