| Literature DB >> 25610431 |
Hayssam Soueidan1, Louise-Amélie Schmitt2, Thierry Candresse3, Macha Nikolski2.
Abstract
Collectively, viruses have the greatest genetic diversity on Earth, occupy extremely varied niches and are likely able to infect all living organisms. Viral infections are an important issue for human health and cause considerable economic losses when agriculturally important crops or husbandry animals are infected. The advent of metagenomics has provided a precious tool to study viruses by sampling them in natural environments and identifying the genomic composition of a sample. However, reaching a clear recognition and taxonomic assignment of the identified viruses has been hampered by the computational difficulty of these problems. In this perspective paper we examine the trends in current research for the identification of viral sequences in a metagenomic sample, pinpoint the intrinsic computational difficulties for the identification of novel viral sequences within metagenomic samples, and suggest possible avenues to overcome them.Entities:
Keywords: NGS; host—pathogen interactions; microbial metagenomics; taxonomic assignment; virome
Year: 2015 PMID: 25610431 PMCID: PMC4285800 DOI: 10.3389/fmicb.2014.00739
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Figure 1Distribution of kDN by classes for each of three classification tasks. (A) Corresponds to Task 3—assignment of 500 nt contigs to first-level domains; (B1,B2) to Task 2—assignment of 500 nt viral contigs to a group or to a family, respectively; (C1,C2) to Task 1—assignment of 500 nt bacterial contigs to a phylum or to a class, respectively. Each of the 300,000 randomly selected contigs sampled from different first-level domains were represented as vectors of 3-mer frequencies. Histograms indicate how many contigs (y-axis) per class (colors) have a certain number of neighbors (x-axis) not sharing their own class label, within the closest 73 neighbors. Neighbors are determined w.r.t. euclidean distance in the space of 3-mer frequencies (cf. Section Why is the First-level Assignment Problem Hard? of main text). For example, there are more than 6000 different archaeal contigs (red bar) not having a single non-archeal contigs in their closest 73 neighbors (red bar corresponding to 0 kDN). The dashed line represents the boundary between contigs easy to classify correctly with a majority vote (to the left of the line) and hard to classify (to the right). Only the top 4 most abundant classes are shown for (B1,C1); and 6 for (B2,C2).
Figure 22D projection of 3-mer frequencies for cellular and viral contigs. (A) Top two dimensions from the PCA reduction of 28,134 contigs (points) of average length 500 nt represented as frequency vectors of 3-mers; sampled equally from genomes originating from 3 top levels cellular domains (top row) and from 3 viral types known to infect them (bottom row). Dimension 1 (x-axis) accounts for 30% of the variance, dimension 2 (y-axis) for 8% of the variance. For each sub-panel, 2 d kernel density estimation is represented using red contour lines and local density maxima are numbered within large white shapes. (B) Close up of (A) with all local density maxima. The principal components were computed once for the whole set of contigs of all genomes. Position, coordinates and axes from all sub-panels are comparable.