| Literature DB >> 29492275 |
Rebecca Rose1,2,3, Bede Constantinides1,2,3, Avraam Tapinos1,2,3, David L Robertson1,2,3, Mattia Prosperi1,2,3.
Abstract
Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.Entities:
Keywords: assembly; classification; epidemic; metagenomics; next-generation sequencing; surveillance
Year: 2016 PMID: 29492275 PMCID: PMC5822887 DOI: 10.1093/ve/vew022
Source DB: PubMed Journal: Virus Evol ISSN: 2057-1577
Figure 1.Two widely used methodologies in de novo assembly of short reads. Reads are not represented explicitly within a de Bruijn graph; they are instead decomposed into distinct subsequence ‘words’ of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. In OLC, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. These overlaps are used to construct a read graph. Next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. This figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes.
Discussed software for the analysis of viral (meta)genomes.
| Name | Application | Distribution | Interface | Platform | License | Description | URL |
|---|---|---|---|---|---|---|---|
| Kraken ( | Taxonomic assignment | Source | Command line | Linux, MAC OS | GNU GPL | Fast in-memory k-mer search and LCA assignment of short reads using a comprehensive sequence database |
|
| Kaiju ( | Taxonomic assignment | Source | Web, command line | Website, Linux | GNU GPL | Fast in-memory, k-mer seeded protein search and LCA taxonomic assignment |
|
| CLARK ( | Taxonomic assignment | Source | Command line | Linux, MAC OS | GNU GPL | Fast in-memory k-mer search and LCA assignment of short reads using a comprehensive sequence database |
|
| Lambda ( | Protein homology search | Binary, source | Command line | Linux, Mac OS | GNU GPL | Fast BLAST compatible nucleotide and reduced alphabet protein homology search |
|
| Diamond ( | Protein homology search | Binary, source | Command line | Linux, Mac OS, FreeBSD | – | Fast BLAST compatible nucleotide and reduced alphabet protein homology search |
|
| NCBI BLAST + ( | Nucleotide and protein homology search | Binary, source | Web, command line | Linux, Windows, Mac OS | Public domain | Nucleotide and protein homology search |
|
| VirusHunter ( | Virus discovery | Source | Command line | Linux | GNU GPL | Automated viral discovery pipeline for use with a computing cluster |
|
| MetaVir ( | Taxonomic assignment | Web application | Web | – | – | Annotation and visualization of viral reads and assemblies |
|
| VirSorter ( | Virus and prophage discovery | Source | Command line | Linux, Mac OS, Docker | GNU GPL | Reference-based and reference-independent annotation of assembled virus genomes |
|
| One Codex (Minot, Krumm, and Greenfield) | Taxonomic assignment | Web application | Web, web API | – | – | Web portal for assignment, visualization and comparison of metagenomes using a bespoke comprehensive sequence database |
|
| PHYMMBL ( | Taxonomic assignment | Source | Command line | Linux | – | Hybrid taxonomic assignment using Phymm interpolated Markov models and BLAST results |
|
| IVA ( | Viral genome assembly | Python package, source | Command line | Linux, Mac OS | GNU GPL | Consensus genomic assembly of diverse viral populations using paired-end short reads |
|
| Vicuna ( | Viral genome assembly | Source | Command line | Linux | Broad academic license | Consensus genomic assembly of diverse viral populations using short reads |
|
| PRICE ( | Viral genome assembly | Source | Command line | Linux | GNU GPL | Assembly of low abundance viral sequences in metagenomes with paired-end short reads |
|
| SPAdes ( | Genome assembly | Binary, source | Command line | Linux, Mac OS | GNU GPL | Microbial genome assembly with long and short paired-end reads |
|
| MetaSPAdes ( | Metagenome assembly | Binary, source | Command line | Linux, Mac OS | GNU GPL | Metagenome assembly with long and short paired-end reads |
|
| MEGAHIT ( | Metagenome assembly | Source | Command line | Linux, Mac OS | GNU GPL | Fast metagenome assembly for complex metagenomes with short reads |
|
| IDBA-UD ( | Metagenome assembly | Source | Command line | Linux, Mac OS | GNU GPL | Metagenome assembly for short paired-end reads |
|
| MetaVelvet ( | Metagenome assembly | Source | Command line | Linux, Mac OS | GNU GPL | Metagenome assembly for short paired-end reads |
|
| ShoRAH ( | Viral haplotype reconstruction | Source | Command line | Linux | GNU GPL | Probabilistic viral haplotype reconstruction from short reads |
|
| QuRE ( | Viral haplotype reconstruction | Java package, source | Command line, graphical interface | Most OSs | GNU GPL | Probabilistic viral haplotype reconstruction from short reads |
|
| PredictHaplo ( | Viral haplotype reconstruction | Source | Command line | Linux | GNU GPL | Probabilistic viral haplotype reconstruction from short paired-end reads |
|
Figure 2.Proposed DWT signal processing approach for nucleotide sequence analysis. Sequences 1 and 2 are subsequences of the HIV-1 HXB2 genome (the reference genome for HIV), and sequence 3 is a subsequence of the Mycoplasma genitalium genome (all three sequences appear at the bottom of the figure). (A) illustrates the integer number representations of the three sequences—sequence 1 is depicted as a black line, sequence 2 is depicted as a red line and sequence 3 is depicted as a blue line. The sequences are mapped into numerical space with the integer representation method enabling the application of transformation approaches. (B) illustrates the DWT transformations of the three sequences’ numerical representations at varying resolutions. The three sequences are each shown consecutively transformed into six reduced resolution representations. The minor sequence mismatches between sequences 1 and 2 (indicated with green circles) can be easily detected at different transformation resolutions despite reduction in information content from the transformation process. Similar nucleotide sequences give rise to similar DWT transformations and thus can be intuitively identified even at low resolution (level 6), where sequences are represented by a single numerical value. Depicted in (C) are the coefficient matrices obtained from each sequence’s DWT transformation. Coefficient matrices can be used to approximately identify the sites of the mismatch positions between the two sequences. Sequences 1 and 2 differ only at sites 16–17 and 48–49. The exact location of minor differences can be detected at transformation level 4 where each sequence is compressed to four wavelets. Darker colored positions in between the matrices of sequence 1 and 2 indicate matching coefficients, and lighter colored positions indicate dissimilar coefficients.
Figure 3.Distinct viral species in the NCBI RefSeq releases from June 2003 – May 2015 (data from ftp://ncbi.nlm.nih.gov/refseq/release/release-statistics/viral.acc_taxid_growth.txt).