| Literature DB >> 25566226 |
Saskia L Smits1, Rogier Bodewes2, Aritz Ruiz-Gonzalez3, Wolfgang Baumgärtner4, Marion P Koopmans5, Albert D M E Osterhaus6, Anita C Schürch2.
Abstract
Viral infections remain a serious global health issue. Metagenomic approaches are increasingly used in the detection of novel viral pathogens but also to generate complete genomes of uncultivated viruses. In silico identification of complete viral genomes from sequence data would allow rapid phylogenetic characterization of these new viruses. Often, however, complete viral genomes are not recovered, but rather several distinct contigs derived from a single entity are, some of which have no sequence homology to any known proteins. De novo assembly of single viruses from a metagenome is challenging, not only because of the lack of a reference genome, but also because of intrapopulation variation and uneven or insufficient coverage. Here we explored different assembly algorithms, remote homology searches, genome-specific sequence motifs, k-mer frequency ranking, and coverage profile binning to detect and obtain viral target genomes from metagenomes. All methods were tested on 454-generated sequencing datasets containing three recently described RNA viruses with a relatively large genome which were divergent to previously known viruses from the viral families Rhabdoviridae and Coronaviridae. Depending on specific characteristics of the target virus and the metagenomic community, different assembly and in silico gap closure strategies were successful in obtaining near complete viral genomes.Entities:
Keywords: assembly; metagenome; pathogen; viral metagenomics; virome; virus; virus discovery
Year: 2014 PMID: 25566226 PMCID: PMC4270193 DOI: 10.3389/fmicb.2014.00714
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Description of deep sequencing datasets.
| Total number of reads | 69358 | 56174 | 135812 |
| Assembled metagenome (%) | 40.67 | 57.78 | 36.83 |
| Reads identified by homology search as obtained from target virus (%) | 27.67 | 5.82 | 0.11 |
| Reads retrospectively obtained from target virus (%) | 69.52 | 13.58 | 26.14 |
Cell culture supernatant (CCS) containing Dolphin rhabdovirus (DRV), red fox feces (RFF) metagenome containing red fox fecal rhabdovirus (RFFRV) and python lung tissue (TPLT) metagenome containing python nidovirus (PNV).
Taxonomic composition of deep sequencing datasets.
| Unassigned | 0.04 | 0.19 | 0.97 |
| Virus | 68.05 | 30.21 | 0.72 |
| Unknown | 3.90 | 10.35 | 49.39 |
| Eukaryota | 27.40 | 37.90 | 35.22 |
| Bacteria | 0.61 | 21.34 | 13.64 |
| Archea | 0 | 0 | 0.06 |
Taxonomic composition per read in percentage of cell culture supernatant (CCS) containing Dolphin rhabdovirus (DRV), red fox feces (RFF) metagenome containing red fox fecal rhabdovirus (RFFRV) and python lung tissue (TPLT) metagenome containing python nidovirus (PNV). Unassigned, best BLAST hit without taxonomic assignment. Unknown, no homology to any database entry.
Figure 1Viral target genomes. Panels (A–C) contain information on read coverage and contigs matching the viral genomes of DRV (A), RFFRV (B), and PNV (C), produced by different assembly algorithms. Shown are only contigs larger than 1 kb. Green: Contigs assembled through Genovo as described in the methods. Black outlined: Contigs assembled through iterative assembly. Black solid: Seed contig. Red: Contigs assembled through CLC Genomics workbench assembler. Blue: Contigs assembled through Newbler assembler. Small black boxes at the bottom of the read coverage line mark stretches of low sequence complexity. “ORF” indicates the genome organization as described below. “Motif” shows the location of sequence motifs. Motifs are shown in detail in Figure S1. “BLAST” shows regions with sequence homology as determined by BLASTX. Colored boxes show sequence identity to the best BLAST hit as indicated on top. “HMM” indicates region with remote homology identified by PFAM profiles, if any. Ruler at the bottom indicates sequence lengths in kilobases. (A) DRV, Dolphin rhabdovirus; N, nucleoprotein; P, phosphoprotein; M, matrix protein; G, glycoprotein; L, large protein. (B) RFFRV, Red fox fecal rhabdovirus; N, nucleoprotein; P, phosphoprotein; M, matrix protein; G, glycoprotein; L, large protein; no abbrevation, alpha 1,2,3 protein. (C) PNV, Python nidovirus; PP1a, polyprotein 1a; PP1b, polyprotein1b; S, spike glycoprotein; no abbreviations, minor membrane protein, membrane protein, nucleocapsid protein, minor membrane protein 2, putative hemagglutinin-neuraminidase protein. Striped line at 5′ end indicates putative unresolved 5′ end.
Figure 2Coverage profile binning. Histograms of coverage (reads per base) of each contig of (A) of cell culture supernatant containing Dolphin rhabdovirus, (B) red fox feces containing red fox fecal rhabdovirus and (C) python lung tissue containing python nidovirus. Gray: contigs mapping to the finished viral genome. Black: seed contig. The first bar in the last panel is truncated for visibility (47%). Shown are only contigs larger than 1 kb.
Figure 3K-mer profiling. Dot plots showing ranked k-mer distance of each contig when compared to the k-mer profile of the seed contig of (A) Dolphin rhabdovirus (DRV), (B) red fox fecal rhabdovirus (RFFRV), and (C) python nidovirus (PNV) in relation to contig lengths. Open boxes indicate contigs that were retrospectively identified as originating from the target genomes. Shown are only contigs larger than 1 kb.