| Literature DB >> 23407591 |
K Eric Wommack1, Jaysheel Bhavsar, Shawn W Polson, Jing Chen, Michael Dumas, Sharath Srinivasiah, Megan Furman, Sanchita Jamindar, Daniel J Nasko.
Abstract
One consistent finding among studies using shotgun metagenomics to analyze whole viral communities is that most viral sequences show no significant homology to known sequences. Thus, bioinformatic analyses based on sequence collections such as GenBank nr, which are largely comprised of sequences from known organisms, tend to ignore a majority of sequences within most shotgun viral metagenome libraries. Here we describe a bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences. Functional and taxonomic information is derived from five annotated sequence databases which are linked to the UniRef 100 database. Environmental classifications are obtained from hits against a custom database, MetaGenomes On-Line, which contains 49 million predicted environmental peptides. Each predicted viral metagenomic ORF run through the VIROME pipeline is placed into one of seven ORF classes, thus, every sequence receives a meaningful annotation. Additionally, the pipeline includes quality control measures to remove contaminating and poor quality sequence and assesses the potential amount of cellular DNA contamination in a viral metagenome library by screening for rRNA genes. Access to the VIROME pipeline and analysis results are provided through a web-application interface that is dynamically linked to a relational back-end database. The VIROME web-application interface is designed to allow users flexibility in retrieving sequences (reads, ORFs, predicted peptides) and search results for focused secondary analyses.Entities:
Keywords: ORFan; environmental sequencing; shotgun metagenomics; viral ecology
Year: 2012 PMID: 23407591 PMCID: PMC3558967 DOI: 10.4056/sigs.2945050
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Figure 1Overview flow-chart of VIROME bioinformatic pipeline. A) Initial screening steps to remove poor quality sequences, false duplicate sequences created during 454 em-PCR library preparation, and rRNA-containing sequences. Contaminating sequence screens includes searches against the UniVec database for vector, linker, and adapter sequences. B) Analysis steps including the identification of tRNA-containing sequences and BLASTP of metagenome peptides against the UniRef 100 and MGOL sequence databases. Significant BLASTP hits have an expectation score of E <0.001. C) Viral metagenome peptide sequences with a significant hit a UniRef 100 sequences are characterized by the taxonomic origin of the homolog and functional information contained within UniRef or the annotated databases. Those metagenome peptides with hits to the MGOL database are characterized according to the environmental origin of their MGOL homologs (Figure 3). Sequences within blue objects are accessible through VIROM web-application interface for viewing or download. Parameters for sequence analyses (rectangles) are given in Table 1.
Figure 2Overview flow-chart of the VIROM classification scheme for environmental peptides. BLAST homology data from the sequence analysis pipeline (Figure 1) serves as input to the classification decision tree. Peptides having a significant hit (E ≤ 0.001) to a sequence in UNIREF 100 are placed in the ‘Known protein’ bin. If one of the homologs has a meaningful annotation, the viral metagenome predicted peptide is considered a ‘Functional protein’. If not, the peptide is considered an ‘Unassigned protein’. Peptides having only a significant hit to an environment peptide in the MGOL database are placed in the ‘Environment protein’ bin. Within this bin, peptides that hit only environmental proteins within either microbial or viral metagenome libraries are classified as ‘Only microbial hit’ or ‘Only viral hit’, respectively. Peptides having hits to protein within viral and microbial metagenome libraries are classified as either ‘Top-hit microbial’ or ‘Top-hit viral’ depending on whether the top BLAST hit came from a microbial or viral metagenome library, respectively. A predicted viral metagenome peptide having no significant hit to a protein within the UniRef 100 or MGOL sequence databases is classified as an ‘ORFan’.
Figure 3Environmental terms and metadata appended to each library within the MetaGenomes On-Line (MGOL) database. Using the annotation scheme presented in Figure 2, the distribution of significant BLAST hits (E<0.001) to MGOL sequences can be described according to environmental feature terms or ENVO terms.
Algorithms, parameters, and databases used in the VIROM bioinformatics pipeline
| Process | Tool | Parameters | Subject database |
|---|---|---|---|
| Screening of rRNAs | BLASTALL | -p blastn –e 1e-3 –f T – b 1 –v 1 –M BLOSUM62 | |
| Identification of tRNAs | tRNA scan SE | -b G | |
| ORF calling | MetaGene Annotator | -m | |
| Known protein identification | BLASTALL | -p blastp –e 1e-1 –F T – b 50 –v 50 –M BLOSUM 62 | UNIREF 100 |
| Environmental Protein Identification | BLASTALL | -p blastp –e 1e-1 –F T –b 50 –v 50 –M BLOSUM62 | MGOL |
Figure 4Flow-chart of VIROME environmental annotation. For each predicted viral metagenome ORF, E-scores (E<0.01) of top-hits against each unique library in the MGOL database are summed. Ratios of E-score distribution for each unique MGOL library are calculated. These ratios can be used to examine the prevalence of sequence homologs according to the environmental features of MGOL libraries (e.g., ecosystem, biome, physico-chemical parameters).