| Literature DB >> 28348870 |
Aravind Sankar1, Brandon Malone1,2, Sion C Bayliss3, Ben Pascoe3, Guillaume Méric3, Matthew D Hitchings4, Samuel K Sheppard3, Edward J Feil3, Jukka Corander5,6, Antti Honkela1.
Abstract
Rapidly assaying the diversity of a bacterial species present in a sample obtained from a hospital patient or an environmental source has become possible after recent technological advances in DNA sequencing. For several applications it is important to accurately identify the presence and estimate relative abundances of the target organisms from short sequence reads obtained from a sample. This task is particularly challenging when the set of interest includes very closely related organisms, such as different strains of pathogenic bacteria, which can vary considerably in terms of virulence, resistance and spread. Using advanced Bayesian statistical modelling and computation techniques we introduce a novel pipeline for bacterial identification that is shown to outperform the currently leading pipeline for this purpose. Our approach enables fast and accurate sequence-based identification of bacterial strains while using only modest computational resources. Hence it provides a useful tool for a wide spectrum of applications, including rapid clinical diagnostics to distinguish among closely related strains causing nosocomial infections. The software implementation is available at https://github.com/PROBIC/BIB.Entities:
Keywords: pathogenic bacteria; probabilistic modelling; staphylococcus aureus; strain identification
Mesh:
Substances:
Year: 2016 PMID: 28348870 PMCID: PMC5320594 DOI: 10.1099/mgen.0.000075
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Phylogenetic tree of the investigated Staphylococcus strains. Inset: Enlarged view of the S. aureus branch illustrating the clustering of the strains within clonal complexes. The scale measures base-level sequence dissimilarity, showing that the S. aureus clusters differ by approximately two to ten substitutions every 1 kb while strains within each cluster differ by less than one substitution every 5 kb.
Fig. 2.Magnitudes of errors in proportion estimates of BIB, Pathoscope and naive estimation among uniquely mapping reads (Unique) in strains really present in the experiment (true positives; left) and those not present in the experiment (true negatives; right). The “Unique” method is implemented by simply computing the frequencies of different strain clusters among unique alignments. Lower values indicate better results.
Fig. 3.Scatter plot comparing the estimation errors of BIB and Pathoscope on true positives. Points below the diagonal are cases where BIB is more accurate while points above the diagonal are cases where Pathoscope is more accurate.
Fig. 4.Comparison of errors in estimation of proportions of Staphylococcus strains with and without Bacillus contamination. Errors on contaminated samples are slightly higher, but overall still very low.
Fig. 5.Two examples of error spectra when some strain clusters present in a sample are not included in the index. The plots show the profile of true and estimated proportions (top) as well as the errors in the estimation (bottom). The error profile lines will always show a bump at the dropped cluster index because they cannot be estimated while the other shape shows how the reads get reassigned.
Fig. 6.An example of error profile in strain abundance estimation without clustering. The vertical dotted lines indicate the borders between different clusters.
Fig. 7.Estimated cluster abundance profiles from diverse clinical samples. The two top rows represent clean samples where one cluster clearly dominates. Rows 3 and 4 represent contaminated samples where the true cluster can still be fairly reliably identified. The bottom row shows a completely failed sample, possibly due to problems with sequence barcoding.
Fig. 8.Estimated relative abundances of the correct and other clusters in S. epidermidis data with two different clusterings. (Higher values are better for correct cluster, lower values for other clusters.) The results show more leakage of estimates to incorrect clusters, but the true cluster is still identified as dominant in all cases with three clusters and in 85 % of cases with 11 clusters.