| Literature DB >> 30097567 |
Nadim J Ajami1,2, Matthew C Wong3,4, Matthew C Ross3,4, Richard E Lloyd4, Joseph F Petrosino3,4.
Abstract
Accurate classification of the human virome is critical to a full understanding of the role viruses play in health and disease. This implies the need for sensitive, specific, and practical pipelines that return precise outputs while still enabling case-specific post hoc analysis. Viral taxonomic characterization from metagenomic data suffers from high background noise and signal crosstalk that confounds current methods. Here we develop VirMAP that overcomes these limitations using techniques that merge nucleotide and protein information to taxonomically classify viral reconstructions independent of genome coverage or read overlap. We validate VirMAP using published data sets and viral mock communities containing RNA and DNA viruses and bacteriophages. VirMAP offers opportunities to enhance metagenomic studies seeking to define virome-host interactions, improve biosurveillance capabilities, and strengthen molecular epidemiology reporting.Entities:
Mesh:
Year: 2018 PMID: 30097567 PMCID: PMC6086868 DOI: 10.1038/s41467-018-05658-8
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1A schematic overview of VirMAP. Data processing with VirMAP is achieved through four main stages (shaded colors) divided into nine major steps (top left corner). A putative list of viral genomes and protein pseudo-scaffolds are constructed from clustered nucleotide and translated alignments to the Genbank viral and phage divisions (gbvrl and gbphage). Nucleotide and amino acid pseudo-scaffolds are “built” and merged into a single super-scaffold per genome. A merged de novo assembly is constructed and merged in, resulting in contigs that are then refined using an iterative rebuild process. The improved dual assembly is filtered against a comprehensive Genbank database and are taxonomically classified using a novel per-base contig scoring system
Fig. 2Viral Mock Community (VMC) calculated genome coverage depth and span from remapping source reads to VirMAP reconstructed genomes. The VMC consists of purified preparations of seven different viruses (a) human poliovirus type 1 [strain Mahoney], (b) echovirus E13 [strain Del Carmen], (c) coxsackievirus B4 [strain Tuscany], (d) human adenovirus (b, e) human adenovirus (c, f) murine gammaherpesvirus 4, and (g) rotavirus, combined at different concentrations in phosphate-buffered saline. Coverage depth and span are represented for each of the viruses in VMC per nucleotide position. For coverage span, a value of 1 represents a nucleotide position covered with respect to the source genome. VMC is available at BioProject ID PRJNA431646
Comparative analysis of ten viral sequence classifiers
| Pipeline | Mapped Reads (%) | Unique Calls | Viral Taxonomies | CCR (% of mapped) | Precision | Recall | F-score |
|---|---|---|---|---|---|---|---|
| VirMAP | 3,099,015 (50.1%) | 8 | 8 | 3,099,007 (99.999%) | 0.88 | 1.00 | 0.94 |
| Read classification | |||||||
| FastViromeExplorer | 2,710,170 (43.85%) | 7 | 4 | 2,710,170 (100%) | 1.00 | 0.57 | 0.73 |
| VirusSeekera | 10,750 (0.174%) | 16 | 16 | 1,467 (13.65%) | 0.31 | 0.57 | 0.40 |
| Kaiju | 2,287,962 (37.02%) | 227 | 227 | 433,243 (18.94%) | 0.09 | 1.00 | 0.17 |
| ViromeScan | 663,185 (10.73) | 427 | 354 | 614,016 (92.586%) | 0.01 | 0.57 | 0.02 |
| Contig classification | |||||||
| drVMb | 22,404,813 (362.54%) | 673 | 158 | 18,235,876 (81.39%) | 0.35 | 1.00 | 0.52 |
| VirusTAP | NA | 5 | 5 | NA | 0.6 | 0.43 | 0.50 |
| VIPIEc | ~109633 (~1.77%) | 13 | 11 | ~23,731 (~21.65%) | 0.30 | 0.71 | 0.42 |
| Standard methodd | 2,319,573 (37.53%) | 8 | 8 | 2,273,193 (98.03%) | 0.75 | 0.86 | 0.80 |
| Marker gene classification | |||||||
| MetaPhlAn2 | NA | 5 | 5 | NA | 0.40 | 0.29 | 0.34 |
The Viral Mock Community (VMC) dataset (6,180,026 trimmed reads) was processed through nine different pipelines for viral taxonomic classification. VMC was generated by combining purified preparations of seven different viruses (human adenovirus B, human adenovirus C, murine gammaherpesvirus 4, coxsackievirus B4 [strain Tuscany], echovirus E13 [strain Del Carmen], human poliovirus type 1 [strain Mahoney], and rotavirus A) in phosphate-buffered saline. Unique calls refer to the distinct database entries reported while viral taxonomies represent a reduction of unique calls to NCBI taxonomic ID. CCR: Correctly Classified Reads. Precision: (true positives/true positives + false positives). Recall: (true positives/true positives + false negatives), F-score: harmonic average of recall and precision scores 2 × ((P × R) / (P + R))
aVirusSeeker applies filtering and clustering techniques to the reads and final counts are derived from this reduced set
bdrVM internally counts identical reads across multiple reported entries, so the total counts can exceed 100%
cVIPIE reports reads as counts per 100,000 reads, the approximation is a rescaled amount against the original read counts
dThe standard approach employs a metagenomic assembly using MEGAHIT and a sequential top-hit mapping classification using BLASTn and BLASTx
Fig. 3VirMAP analysis of an external mock community. A mock virome control sample (SRR3458562) recently reported[15] was processed with VirMAP. A total of 5,969,272 reads (32.96%) were classified as being of viral origin across 10 distinct viral lineages which included the nine viral constituents of the mock community. Additionally, one putative contaminant virus was identified: southern tomato virus
Fig. 4A comparison of VirMAP and the Standard Approach using the influenza virus dataset (BioProject ID PRJEB7888). The total length of reconstructed influenza virus segments was calculated at different levels of subsampling by adding the total number of base pairs found for across segments. An average N50 was calculated at each subsampling level by averaging the N50 values for all trials (20 at 100%, 200 at 10, 1, and 0.1). The percentage of positive trials correspond to the ratio of trials with >1 identifiable influenza contig over the total number of trials (20 at 100%, 200 at 10, 1, and 0.1%). Tukey plots, bar: statistical median, edges: low, 25%; high 75% quartiles, whiskers:1.5 × interquartile range, dots: outliers)