| Literature DB >> 23843222 |
Owen E Francis1, Matthew Bendall, Solaiappan Manimaran, Changjin Hong, Nathan L Clement, Eduardo Castro-Nallar, Quinn Snell, G Bruce Schaalje, Mark J Clement, Keith A Crandall, W Evan Johnson.
Abstract
Emerging next-generation sequencing technologies have revolutionized the collection of genomic data for applications in bioforensics, biosurveillance, and for use in clinical settings. However, to make the most of these new data, new methodology needs to be developed that can accommodate large volumes of genetic data in a computationally efficient manner. We present a statistical framework to analyze raw next-generation sequence reads from purified or mixed environmental or targeted infected tissue samples for rapid species identification and strain attribution against a robust database of known biological agents. Our method, Pathoscope, capitalizes on a Bayesian statistical framework that accommodates information on sequence quality, mapping quality, and provides posterior probabilities of matches to a known database of target genomes. Importantly, our approach also incorporates the possibility that multiple species can be present in the sample and considers cases when the sample species/strain is not in the reference database. Furthermore, our approach can accurately discriminate between very closely related strains of the same species with very little coverage of the genome and without the need for multiple alignment steps, extensive homology searches, or genome assembly--which are time-consuming and labor-intensive steps. We demonstrate the utility of our approach on genomic data from purified and in silico "environmental" samples from known bacterial agents impacting human health for accuracy assessment and comparison with other approaches.Entities:
Mesh:
Year: 2013 PMID: 23843222 PMCID: PMC3787268 DOI: 10.1101/gr.150151.112
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Figure 1.Impact of the closely related strains on the read alignment proportions. The genomes in the database were aligned to each other using an all-against-all BLASTN approach (Agren et al. 2012), and strains of the same species that were >98% similar using this metric were considered “closely related” strains. As the number of closely related strains increases, the naïve algorithm was not able to definitively identify the origin species. However, Pathoscope performed consistently well independent of the number of closely related strains.
Results from the application of several species identification approaches to subsets of the 92,370 sequencing reads from the first O104:H4 Ion Torrent sequencing run
Results from the application of our species identification method on 12 data sets from the NCBI Sequence Read Archive (SRA, reference website)
Results from 1000 random mixtures of ∼5770 reads from the Y. pestis KIM D27 (SRR033501), E. coli K-12 MG1655 (SRR031601), and F. tularensis subsp. holarctica OSU18 (SRR032505) data sets