| Literature DB >> 26714571 |
Andy Kilianski1, Patrick Carcel2, Shijie Yao3,4, Pierce Roth5,6, Josh Schulte7, Greg B Donarum8, Ed T Fochler9, Jessica M Hill10,11, Alvin T Liem12,13, Michael R Wiley14, Jason T Ladner15, Bradley P Pfeffer16, Oliver Elliot17, Alexandra Petrosov18, Dereje D Jima19, Tyghe G Vallard20, Melanie C Melendrez21, Evan Skowronski22, Phenix-Lan Quan23, W Ian Lipkin24, Henry S Gibbons25, David L Hirschberg26,27, Gustavo F Palacios28, C Nicole Rosenzweig29.
Abstract
BACKGROUND: The detection of pathogens in complex sample backgrounds has been revolutionized by wide access to next-generation sequencing (NGS) platforms. However, analytical methods to support NGS platforms are not as uniformly available. Pathosphere (found at Pathosphere.org) is a cloud - based open - sourced community tool that allows for communication, collaboration and sharing of NGS analytical tools and data amongst scientists working in academia, industry and government. The architecture allows for users to upload data and run available bioinformatics pipelines without the need for onsite processing hardware or technical support.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26714571 PMCID: PMC4696252 DOI: 10.1186/s12859-015-0840-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Pathosphere user interface. The web-based portion of Pathosphere contains message boards, forums, user communities to share data and results, a live-chat messager, user and developer guides and FAQs, as well a custom interfaces for the pathogen detection pipelines utilized by the current Pathosphere users. This screenshot displays the user-defined parameters that are customizable for each pathogen detection run
Fig. 2Summary of the analytical capability of the bioinformatics pipeline. Data can currently be preprocessed by two tools, Columbia University’s Preprocessing Procedure (CUPP) or a taxonomy analysis based on NCBI taxonomy results. Then, reads retained after the pre-processing manipulations are assembled de novo. Nearest neighbors and SNP profiling then occurs by comparing the identified contigs to NCBI databases. A reference map is created, and the SNP profile from those mapping results provides a comprehensive comparison of the taxonomical near neighbors. Finally, all the unmapped reads are extracted and used as input to the next iteration
ECBC Pipeline Analysis on Non-host Reads of Samples Containing B. anthracis, F. tularensis, Y. pestis, B. pseudomallei, and B. mallei. Unknown samples were created, sequenced on 454, Ion Torrent, and Illumina platforms and processed (methods). Datasets were then analyzed using the ECBC pathogen detection pipeline. Table shading represents the positive and correct identification of the organism listed. Unshaded cells represent the lack of single-read identification matching to the pathogens spiked
| 454 | Ion Torrent | Illumina | |||||
|---|---|---|---|---|---|---|---|
| File size: 2.5gb | File size:1.62gb | File size: 5gb paired-read files | |||||
| Pipeline runtime: 35m12s | Pipeline runtime: 29m18s | Pipeline runtime: 4hr11m | |||||
| Agent | Spiked Amounts | Taxonomy Assignment ID | Iterative Assembly ID | Taxonomy Assignment ID | Iterative Assembly ID | Taxonomy Assignment ID | Iterative Assembly ID |
|
| 1x106 CFU |
|
|
|
|
|
|
|
| 1x105 CFU |
|
|
|
|
|
|
|
| 1x104 CFU |
|
|
|
|
|
|
|
| 1x103 CFU |
|
| ||||
|
| 1x102 CFU |
| |||||
Viral Samples And Non-host Reads. Samples collected from various sources were sequenced for pathogen detection. CUPP removed host reads, leaving non-host reads for further iterative and taxonomic analysis. Samples obtained had already been confirmed to contain or not contain indicated virus (methods)
| Sample | Host | Tissue | Viral Agent | Total Reads | non-host reads | non-host reads (%) |
|---|---|---|---|---|---|---|
| 28 | Bat ( | Serum | Hepatitis G virus | 69558 | 1704 | 2 % |
| 712 |
| Liver biopsy | Lujo virus | 55227 | 759 | 1 % |
| 806 | Bat ( | Gastro-intestinal tract | - | 9308 | 4752 | 51 % |
| 808 | Human | Liver biopsy | Lujo virus | 43090 | 4479 | 10 % |
| 819 | Bat ( | Gastro-intestinal tract | Zaria-CoV | 29182 | 12450 | 43 % |
| 820 | Bat ( | Gastro-intestinal tract | Zaria-CoV | 67808 | 14790 | 22 % |
| 1164 | Bat ( | Spleen | parvovirus | 72877 | 34935 | 48 % |
| 1500 | Dromedary Camel ( | Nasal Swab | MERS-CoV | 924389 | 888708 | 96 % |
| 1501 | Dromedary Camel ( | Nasal Swab | MERS-CoV | 866598 | 831820 | 96 % |
Iterative Analysis on Non-host Reads of Sample 712, 806, 808, 819, 820, 1500, and 1501. Collected known samples (Table 2) were analyzed using the ECBC iterative analysis pipeline for pathogen detection. De novo assembled contigs are used to generate nearest neighbors, then the nearest neighbors are used to map reads and generate consensus contigs from the mapped reads (Fig. 2). Upon completion, a new iteration begins using reads not mapped to the nearest neighbor. The cycle completes after no further reads exist for contig building or there are no matches reported
| Sample | Iteration | Reads |
| Nearest Neighbor Reported | Contigs Generated from Nearest Neighbor Read Mapping |
|---|---|---|---|---|---|
| 712 | 1 | 759 | 3 | Lujo virus segment S glycoprotein precursor and nucleocapsid protein genes, complete cds | 2 |
| 2 | 90 | 1 | Lujo virus segment L multifunctional matrix-like protein and large RNA-dependent RNA polymerase genes, complete cds | 1 | |
| 806 | 1 | 4752 | 0 | - | - |
| 808 | 1 | 4479 | 4 | Lujo virus segment L multifunctional matrix-like protein and large RNA-dependent RNA polymerase genes, complete cds | 3 |
| 2 | 1527 | 1 | Lujo virus segment S glycoprotein precursor and nucleocapsid protein genes, complete cds | 1 | |
| 3 | 139 | 0 | - | - | |
| 819 | 1 | 12450 | 12 | Zaria bat coronavirus strain ZBCoV, partial genome | 1 |
| 2 | 8247 | 9 | - | no db hits | |
| 820 | 1 | 14790 | 31 | Rhinolophus ferrumequinum clone VMRC7-267P18, complete sequence | 1 |
| 2 | 10222 | 37 | Zaria bat coronavirus strain ZBCoV, partial genome | 1 | |
| 3 | 9704 | 33 | - | no db hits | |
| 1500 | 1 | 884863 | 3993 | Middle East respiratory syndrome coronavirus complete genome | 1 |
| 2 | 26795 | 37 | Actinobacillus suis H91-0380 complete genome | 1 | |
| 3 | 26643 | 36 | Middle East respiratory syndrome coronavirus Isolate Qatar4 complete genome | 12 | |
| 1501 | 1 | 828116 | 3967 | Middle East respiratory syndrome coronavirus complete genome | 1 |
| 2 | 26937 | 33 | Actinobacillus suis H91-0380 complete genome | 17 | |
| 3 | 26761 | 34 | PREDICTED: Equus caballus uncharacterized LOC102148405 (LOC102148405) | 1 |
USAMRIID-WRAIR pipeline reanalysis for pathogen reads from Sample 28, 1164. Pathosphere’s ability to host multiple pipelines was tested using a pipeline designed by USAMRIID and WRAIR to analyze datasets from samples 28 and 1164. The reanalysis resulted in viral hits against the correct agent (sample 28) and against multiple viruses (sample 1164) with one correctly identified as a nearest neighbor
| Reanalyzed Sample | Read length | NN hit |
|---|---|---|
| Sample 28 | 152 | GB virus D strain 93 polyprotein precursor, gene, partial cds |
| 692 total hits | ||
| 231 | GB virus D strain 93 polyprotein precursor, gene, partial cds | |
| 3 viral hits | ||
| 246 | GB virus D strain 93 polyprotein precursor, gene, partial cds | |
| Sample 1164 | 171 | Gray fox amdovirus NS1, NS2, NS3, VP1, and VP2 genes, complete_cds |
| 4621 total hits | ||
| 24 viral hits |