| Literature DB >> 26038737 |
Simon Roux1, Francois Enault2, Bonnie L Hurwitz3, Matthew B Sullivan1.
Abstract
Viruses of microbes impact all ecosystems where microbes drive key energy and substrate transformations including the oceans, humans and industrial fermenters. However, despite this recognized importance, our understanding of viral diversity and impacts remains limited by too few model systems and reference genomes. One way to fill these gaps in our knowledge of viral diversity is through the detection of viral signal in microbial genomic data. While multiple approaches have been developed and applied for the detection of prophages (viral genomes integrated in a microbial genome), new types of microbial genomic data are emerging that are more fragmented and larger scale, such as Single-cell Amplified Genomes (SAGs) of uncultivated organisms or genomic fragments assembled from metagenomic sequencing. Here, we present VirSorter, a tool designed to detect viral signal in these different types of microbial sequence data in both a reference-dependent and reference-independent manner, leveraging probabilistic models and extensive virome data to maximize detection of novel viruses. Performance testing shows that VirSorter's prophage prediction capability compares to that of available prophage predictors for complete genomes, but is superior in predicting viral sequences outside of a host genome (i.e., from extrachromosomal prophages, lytic infections, or partially assembled prophages). Furthermore, VirSorter outperforms existing tools for fragmented genomic and metagenomic datasets, and can identify viral signal in assembled sequence (contigs) as short as 3kb, while providing near-perfect identification (>95% Recall and 100% Precision) on contigs of at least 10kb. Because VirSorter scales to large datasets, it can also be used in "reverse" to more confidently identify viral sequence in viral metagenomes by sorting away cellular DNA whether derived from gene transfer agents, generalized transduction or contamination. Finally, VirSorter is made available through the iPlant Cyberinfrastructure that provides a web-based user interface interconnected with the required computing resources. VirSorter thus complements existing prophage prediction softwares to better leverage fragmented, SAG and metagenomic datasets in a way that will scale to modern sequencing. Given these features, VirSorter should enable the discovery of new viruses in microbial datasets, and further our understanding of uncultivated viral communities across diverse ecosystems.Entities:
Keywords: Bacteriophage; Metagenomics; Prophage; Single-cell amplified genome; Viral metagenomics; Virus
Year: 2015 PMID: 26038737 PMCID: PMC4451026 DOI: 10.7717/peerj.985
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1VirSorter process: overview (A) and examples of viral sequence detection (B).
(A) Overview of VirSorter process. The top part described the different parts of the sequence analysis pipeline, and the bottom frame summarizes the classification in three categories of decreasing confidence based on the different metrics being significant (green dot) or not (black cross). Viral “hallmark” genes or protein clusters (PCs) were identified by looking for genes typically of viral origin that are annotated as “major capsid protein,” “portal,” “terminase large subunit,” “spike,” “tail,” “virion formation” or “coat” and manually removing all protein domains with a potential overlap with microbial functions. (B) Examples of viral sequence detection by VirSorter. On top is the clearest case, in which a sequence harbors several viral hallmark genes as well as enrichment in viral-like genes (or virome-like when the genes are most similar to a viral metagenome sequence, when using the Viromes database). This type of detection is considered as the most confident. The three examples below are different cases in which only one of the primary metrics is significant. Notably, these examples display how VirSorter can detect new viruses based on a significant depletion in characterized genes associated with a viral hallmark gene (case 3), and how the same number of genes can be a non-significant enrichment when considering all viruses, yet significant when looking at only the non-Caudovirales (case 4). These detections are still considered confident, although less sure than case 1. Finally, a last example (case 5) displays a more ambiguous situation, in which a sequence displays only secondary viral metrics but neither viral gene enrichment nor a viral hallmark gene. For these detections, one of the metrics (at least) must have an E-value lower than 10−04 (note that significance scores used in VirSorter output files are computed as negative log10 transformations of E-values, and would here correspond to a score of 4 or more).
Figure 2Accuracy of viral sequence predictions of VirSorter, PHAST, Phage_finder and PhiSpy on (A) complete microbial genomes, and (B) draft genomes from simulated SAGs including a microbial and viral genome.
For each set of predictions (i.e., each tool and set of option when applicable), the two metrics used to evaluate the tool performance are Recall (x-axis, proportion of known viral sequences or regions detected) and Precision (y-axis, proportion of predictions that corresponded to known viral sequences or regions). Prophages identified in the complete microbial genomes are compared to the list of manually curated prophages from Casjens (2003).
Comparison of VirSorter predictions with prophage predictors on Pseudomonasaeruginosa LES B58 genome (NC_011770).
The coordinates of each prophage known on Pseudomonas aeruginosa LES B58 genome and detection for the different tools are indicated, with absence of detection highlighted in red. For VirSorter and PHAST, the category of detection (1, 2 or 3 for VirSorter, intact, incomplete or questionable for PHAST) is also indicated. False-positive detections of genomic islands as putative prophages are highlighted in orange.
| Feature | Coordinates | VirSorter | PHAST | PhiSpy | Phage_Finder |
|---|---|---|---|---|---|
| Prophage 1 | 665,272–680,608 | Prophage – 2 | Prophage – questionable | Prophage | – |
| Prophage 2 | 863,875–906,018 | Prophage – 2 | Prophage – questionable | Prophage | Prophage |
| Prophage 3 | 1,433,756–1,476,547 | Prophage – 2 | Prophage – questionable | Prophage | Prophage |
| Prophage 4 | 1,684,045–1,720,850 | Prophage – 2 | Prophage – questionable | Prophage | Prophage |
| Genomic Island 1 | 2,504,700–2,551,100 | Prophage – 3 | Prophage – questionable | – | – |
| Prophage 5 | 2,690,450–2,740,350 | Prophage – 1 | Prophage – intact | Prophage | Prophage |
| Genomic Island 2 | 2,751,800–2,783,500 | – | – | – | – |
| Genomic Island 3 | 2,796,836–2,907,406 | – | – | Prophage | – |
| Genomic Island 4 | 3,392,800–3,432,228 | – | – | – | – |
| Prophage 6 | 4,545,190–4,552,788 | Prophage – 2 | Prophage – intact | – | – |
| Genomic Island 5 | 4,931,528–4,960,941 | Prophage – 3 | – | – | – |
Figure 3Detection of viral sequences in microbial metagenomes by VirSorter.
(A) Average Recall (x-axis) and Precision (y-axis) of viral sequence detection by VirSorter in 10 simulated microbial metagenomes for different contig size thresholds. (B) Detection of viral sequences by VirSorter in simulated microbial metagenomes by contig size fraction.
Results of VirSorter viral sequence detection on simulated viral metagenomes with a limited contamination by cellular genomes (1 to 25% of raw reads).
Metrics presented are Recall (proportion of viral sequences detected) and Precision (proportion of predictions corresponding to viral sequences).
| VirSorter—categories 1 & 2 | VirSorter—all categories | |||
|---|---|---|---|---|
|
|
|
|
| |
|
| 31.71% | 99.89% | 32.96% | 99.79% |
|
| 85.64% | 99.80% | 90.29% | 99.62% |
|
| 97.14% | 99.48% | 99.82% | 98.99% |