| Literature DB >> 30423069 |
Andreas Andrusch1, Piotr W Dabrowski2, Jeanette Klenner1, Simon H Tausch1, Claudia Kohl1, Abdalla A Osman3, Bernhard Y Renard2, Andreas Nitsche1.
Abstract
Motivation: Next generation sequencing (NGS) has provided researchers with a powerful tool to characterize metagenomic and clinical samples in research and diagnostic settings. NGS allows an open view into samples useful for pathogen detection in an unbiased fashion and without prior hypothesis about possible causative agents. However, NGS datasets for pathogen detection come with different obstacles, such as a very unfavorable ratio of pathogen to host reads. Alongside often appearing false positives and irrelevant organisms, such as contaminants, tools are often challenged by samples with low pathogen loads and might not report organisms present below a certain threshold. Furthermore, some metagenomic profiling tools are only focused on one particular set of pathogens, for example bacteria.Entities:
Mesh:
Year: 2018 PMID: 30423069 PMCID: PMC6129269 DOI: 10.1093/bioinformatics/bty595
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.PAIPline standard workflow: The PAIPline for Automatic Identification of Pathogens. Items colored in green indicate user-adjustable parameters or input. First, raw reads are preprocessed, including filters for read length, base quality and read composition complexity. The processed reads are then mapped against user-designated fore- and background databases. The mappings are matched to remove reads originating from background organisms. All remaining read hits are validated by BLAST using the NCBI nt database. Afterwards ambiguities are resolved and the final read assignment is set. Organisms of low interest (OLIs) are then masked, before the final result is presented
References, their accession numbers and the number of reads simulated from them to form the artificial sample used in this study
| References | Accession number(s) | Reads simulated |
|---|---|---|
| Homo sapiens | CM000663.2 - CM000686.2, J01415.2 | 1000000 |
| Human Herpesvirus 1 (HSV-1) | X14112.1 | 180 |
| Cowpox virus (CPXV) | AF482758.2 | 104 |
| Human Immunodeficiency Virus 1 (HIV-1) | AF033819.3 | 87 |
| Yellow fever virus (YFV) | X03700.1 | 24 |
| Human Adenovirus B2 Type 11 (HAV-11) | AY598970.1 | 12 |
| Sum | 1000407 |
Taxons recalled on species level by the benchmarked programs in the respective datasets
| Host | Library | Expected virus | PAIPline | Pathoscope | Kraken | Sigma (Mapping) | Sigma (Abundance) |
|---|---|---|---|---|---|---|---|
| Chicken | DNA | Vaccinia virus | Yes | Yes | No (Cowpox) | No (Canarypox) | No |
| Sendai virus | No | No | No | Yes | No | ||
| Influenza A virus | No | No | No | Yes | No | ||
| Chicken | RNA | Vaccinia virus | Yes | Yes | No (Cowpox) | Yes | No |
| Sendai virus | Yes | Yes | No | Yes | No | ||
| Influenza A virus | Yes | Yes | No | Yes | No | ||
| Marmoset | DNA | Sendai virus | No | No | No | Yes | No |
| Marmoset | RNA | Sendai virus | Yes | No | No | Yes | No |
| Artificial sample | Cowpox virus | Yes | Yes | Yes | Yes | Yes | |
| HIV-1 | No | Yes | No | Yes | No | ||
| HAV-11 | Yes | Yes | No | No | No | ||
| HSV-1 | Yes | Yes | No | Yes | Yes | ||
| Yellow fever virus | No | Yes | No | No | No |
Note: Entries in parentheses indicate a detection of a species in the same family. PAIPline is the only tool to detect Sendai virus successfully in the Marmoset RNA, whereas Pathoscope is the only tool to detect Yellow fever virus in the artificial sample.
Fig. 2.The F-scores on family level for all combinations of samples and benchmarked tools are shown. All tools were run with their default parameters. The transparent bars indicate the mean over all samples processed with that program and mode of operation, whereas light gray sample names indicate no recall. A higher bar generally indicates a better compromise between recall and precision, approximating better real-life performance
Fig. 3.The F-scores on species level for all combinations of samples and benchmarked tools are shown. All tools were run with their default parameters. The transparent bars indicate the mean over all samples processed with that program and mode of operation, whereas light gray sample names indicate no recall. A higher bar generally indicates a better compromise between recall and precision, approximating better real-life performance
Fig. 4.The wall clock times needed to complete each analysis of the given datasets by the benchmarked programs are shown. A higher bar indicates a computationally more expensive or less well parallelized process