| Literature DB >> 27651357 |
Jason W Sahl1, Adam J Vazquez2, Carina M Hall2, Joseph D Busch2, Apichai Tuanyok3, Mark Mayo4, James M Schupp5, Madeline Lummis2, Talima Pearson2, Kenzie Shippy2, Rebecca E Colman5, Christopher J Allender2, Vanessa Theobald4, Derek S Sarovich4, Erin P Price4, Alex Hutcheson6, Jonas Korlach6, John J LiPuma7, Jason Ladner8, Sean Lovett8, Galina Koroleva8, Gustavo Palacios8, Direk Limmathurotsakul9, Vanaporn Wuthiekanun10, Gumphol Wongsuwan10, Bart J Currie4, Paul Keim1, David M Wagner11.
Abstract
UNLABELLED: Whole-genome sequence (WGS) data are commonly used to design diagnostic targets for the identification of bacterial pathogens. To do this effectively, genomics databases must be comprehensive to identify the strict core genome that is specific to the target pathogen. As additional genomes are analyzed, the core genome size is reduced and there is erosion of the target-specific regions due to commonality with related species, potentially resulting in the identification of false positives and/or false negatives. IMPORTANCE: A comparative analysis of 1,130 Burkholderia genomes identified unique markers for many named species, including the human pathogens B. pseudomallei and B. mallei Due to core genome reduction and signature erosion, only 38 targets specific to B. pseudomallei/mallei were identified. By using only public genomes, a larger number of markers were identified, due to undersampling, and this larger number represents the potential for false positives. This analysis has implications for the design of diagnostics for other species where the genomic space of the target and/or closely related species is not well defined.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27651357 PMCID: PMC5030356 DOI: 10.1128/mBio.00846-16
Source DB: PubMed Journal: mBio Impact factor: 7.867
Summary of new genomes sequenced as part of this study
| Clade | No. of genomes |
|---|---|
| 8 | |
| 1 | |
| 4 | |
| 78 | |
| 12 | |
| 1 | |
| 5 | |
| 2 | |
| 2 | |
| 1 | |
| 14 | |
| 2 | |
| Putative species 1 | 3 |
| Putative species 2 | 4 |
| Putative species 3 | 10 |
| Putative species 4 | 7 |
| Putative species 5 | 8 |
| 256 | |
| 9 | |
| 1 | |
| 2 | |
| 67 | |
| 8 | |
| 33 | |
| 254 | |
| 37 | |
| Total | 829 |
FIG 1 A core genome single-nucleotide-polymorphism (SNP) phylogeny of Burkholderia genomes. All SNPs were identified by aligning genome assemblies against the finished genome of B. pseudomallei K96243 (19) with NUCmer (20) and processed with the Northern Arizona SNP Pipeline (http://tgennorth.github.io/NASPtgennorth.github.io/NASP) (30). A maximum-likelihood phylogeny was inferred on the concatenated SNP alignment with RAxML version 8 (31) with 100 bootstrap replicates. Clades were collapsed with ARB (41). Putative novel species are named with PS (putative species) and the clade number.
Core genome statistics
| Species/clade | Core genome size (CDSs) | No. of: | |
|---|---|---|---|
| Genomes | Species-/clade-specific markers | ||
| 5,408 | 2 | 71 | |
| 5,507 | 8 | 13 | |
| 3,823 | 8 | 8 | |
| 5,076 | 16 | 22 | |
| 4,415 | 83 | 7 | |
| 4,566 | 12 | 7 | |
| 5,451 | 3 | 436 | |
| 4,898 | 6 | 833 | |
| 3,253 | 3 | 264 | |
| 5,115 | 7 | 157 | |
| 4,214 | 7 | 0 | |
| 5,348 | 3 | 105 | |
| 4,001 | 21 | 53 | |
| 5,681 | 4 | 141 | |
| PS-1 | 3,693 | 3 | 504 |
| PS-2 | 4,231 | 4 | 23 |
| PS-3 | 5,047 | 11 | 195 |
| PS-4 | 4,366 | 7 | 0 |
| PS-5 | 4,978 | 8 | 0 |
| 2,339 | 392 | 22 | |
| 1,690 | 416 | 38 | |
| 4,549 | 10 | 62 | |
| 6,397 | 4 | 153 | |
| 6,533 | 2 | 90 | |
| 4,835 | 67 | 54 | |
| 4,447 | 20 | 116 | |
| 4,399 | 33 | 0 | |
| 3,128 | 255 | 40 | |
| 3,803 | 40 | 71 | |
FIG 2 A core genome single-nucleotide-polymorphism (SNP) phylogeny associated with a heat map of markers unique to specific clades. The core genome phylogeny was inferred with RAxML (31) on a concatenated SNP alignment produced by aligning 1,130 genomes against the finished genome of B. pseudomallei K96243 (19), using NUCmer (20) in conjunction with NASP (http://tgennorth.github.io/NASPtgennorth.github.io/NASP). Coding regions unique to specific clades were aligned against all genomes with LS-BSR (22), and the heat map was visualized with the Interactive Tree of Life (42). The heat map demonstrates the distribution of identified markers against all genomes screened in this study.
Average nucleotide identity and DNA-DNA hybridization values between representatives of putative novel species and representatives of established clades
| Genome | Clade | Nearest genome | ANIm (%) | ANIb (%) | DDH range (%) |
|---|---|---|---|---|---|
| MSMB175 | Putative species 1 | 85.5 | 79.8 | 18.7–23.7 | |
| BDU8 | Putative species 2 | 94.9 | 94.8 | 59.3–75.8 | |
| MSMB0852 | Putative species 3 | 92.4 | 91.1 | 44.5–52.7 | |
| MSMB0856 | Putative species 4 | 91.2 | 89.8 | 44.9–60.8 | |
| NRF60-BP8 | Putative species 5 | 94.1 | 93.5 | 54.5–56.9 |
ANI, average nucleotide identity; ANIm, uses NUCmer alignments; ANIb, uses BLASTN alignments; DDH, DNA-DNA hybridization.
FIG 3 (A) Core genome reduction in Burkholderia pseudomallei/mallei. The core genome was calculated with the LS-BSR pipeline (22) on 416 genomes. For subsampling, genomes were randomly selected at different depths and the number of coding regions (CDSs) with a BLAST score ratio (BSR) (39) of >0.8 in all genomes was calculated and plotted. For each subsampling level, 100 iterations were performed. The mean value at each level is shown in red, and each replicate is shown in black. (B) The effect of signature erosion on the design of B. pseudomallei/mallei diagnostic markers. Genomes outside the B. pseudomallei/mallei clade (n = 714) were randomly selected at different depths. The core genome of 416 B. pseudomallei/mallei genomes was screened against non-pseudomallei/mallei genomes with LS-BSR (22), and the number of markers with a BSR of <0.4 in non-pseudomallei/mallei genomes was calculated and plotted. One hundred independent replicates were processed at each sampling depth. The mean value at each level is shown in red, and each replicate is shown in black.