Literature DB >> 29492275

Challenges in the analysis of viral metagenomes.

Rebecca Rose^1,2,3, Bede Constantinides^1,2,3, Avraam Tapinos^1,2,3, David L Robertson^1,2,3, Mattia Prosperi^1,2,3.

Abstract

Genome sequencing technologies continue to develop with remarkable pace, yet analytical approaches for reconstructing and classifying viral genomes from mixed samples remain limited in their performance and usability. Existing solutions generally target expert users and often have unclear scope, making it challenging to critically evaluate their performance. There is a growing need for intuitive analytical tooling for researchers lacking specialist computing expertise and that is applicable in diverse experimental circumstances. Notable technical challenges have impeded progress; for example, fragments of viral genomes are typically orders of magnitude less abundant than those of host, bacteria, and/or other organisms in clinical and environmental metagenomes; observed viral genomes often deviate considerably from reference genomes demanding use of exhaustive alignment approaches; high intrapopulation viral diversity can lead to ambiguous sequence reconstruction; and finally, the relatively few documented viral reference genomes compared to the estimated number of distinct viral taxa renders classification problematic. Various software tools have been developed to accommodate the unique challenges and use cases associated with characterizing viral sequences; however, the quality of these tools varies, and their use often necessitates computing expertise or access to powerful computers, thus limiting their usefulness to many researchers. In this review, we consider the general and application-specific challenges posed by viral sequencing and analysis, outline the landscape of available tools and methodologies, and propose ways of overcoming the current barriers to effective analysis.

Entities: Chemical Disease Species

Keywords: assembly; classification; epidemic; metagenomics; next-generation sequencing; surveillance

Year: 2016 PMID： 29492275 PMCID： PMC5822887 DOI： 10.1093/ve/vew022

Source DB: PubMed Journal: Virus Evol ISSN： 2057-1577

1. Introduction

In the last decade, at least seven separate viral outbreaks have caused tens of thousands of human deaths (Woolhouse, Rambaut, and Kellam, 2015), and the ever-increasing density of livestock, rate of habitat destruction, and extent of human global travel provides a fertile environment for new pandemics to emerge from host switching events (Delwart 2007; Fancello, Raoult, and Desnues 2012), as was the case for SARS, Ebola, Middle East Respiratory Syndrome (MERS), and influenza-A (H1N1) (Castillo-Chavez et al. 2015). At present we have a limited grasp of the extent of viral diversity present in the environment: the 2014 database release from the International Committee for the Taxonomy of Viruses classified just 7 orders, 104 families, 505 genera, and 3286 species (http://www.ictvonline.org/virustaxonomy.asp); yet, one study estimated that there are at least 320,000 virus species infecting mammals alone (Anthony et al. 2013). High throughput (or so-called ‘next generation’) sequencing of viruses during the most recent outbreaks of MERS in South Arabia (Gire et al. 2014; Carroll et al. 2015; Park et al. 2015) and Ebola in West Africa (Quick, J et al. 2016) has facilitated rapid identification of transmission chains, rates of viral evolution, and evidence of the zoonotic origin of these outbreaks. Access to such information during initial stages of an outbreak would offer invaluable insight into when, where, and how an epidemic might emerge, informing intervention and mitigation measures or even stopping it altogether. A major step towards this goal is therefore to identify existing zoonotic and environmental pathogens with pandemic potential. This is a significant undertaking, demanding considerable investment and close collaboration between government, NGOs and academia, for example, the USAID program PREDICT http://www.vetmed.ucdavis.edu/ohi/predict/index.cfm, as well as on the ground surveillance by local authorities and scientists in areas of the world most at risk. The characterization of unknown viral entities in the environment is now possible with modern sequencing; however, current tooling for exploiting these data represents a practical and methodological bottleneck for effective data analysis. Practically, most available software tools are inaccessible to the majority of potential users, demanding expertise and computing resources often lacked by the researchers from diverse backgrounds involved in sample collection, sequencing, and analysis. There is a need for robust and intuitive analytical tools without requirements for fast internet connectivity, which may be unavailable in remote or developing regions. More fundamentally, the intended scope of published analytical tools and workflows is often less than clear, and given the diverse applications of viral sequencing, it can be difficult to gauge the relevance of newly published tools without first testing them. For example, a fast sequence classifier might fail entirely to detect a novel strain of a well-characterized virus, and equally might perform well with Illumina sequences yet deliver poor results for data generated with the Ion Torrent platform. Furthermore, results arising from these analyses should be replicable, intelligible, and useful to the end user, with provision for quality control and error management. Software tools that target expert users should be tested, documented and robustly distributed as packages or containers so as to streamline the processes of installation and generating results. Methodologically, most genomic sequence analysis software is not well suited for viral genomes. Generic tools that are able to address the challenges posed by viral sequences are often applicable only in limited circumstances. Choosing between approaches is made difficult due to an abundance of disparate yet functionally equivalent methodologies and in general a lack of rigorous benchmarks for viral datasets. While there is much ongoing research in this area, both the sensitive detection of previously characterized viruses and viral discovery remain key challenges open for innovation. Here we survey the landscape of available approaches for analyzing both known and unknown viruses within genomic and metagenomic samples, with focus on their practical and methodological suitability for use by a broad spectrum of researchers seeking to characterize viral metagenomes.

2. Viral sequence enrichment: physical and insilico approaches

Within metagenomes the proportion of viral nucleic acids is typically far lower than that of host or other microbes, limiting the amount of signal available for analysis after sequencing. To mitigate this issue, enrichment and amplification approaches are widely used prior to sequencing viral samples. Size filtration or density-based enrichment by centrifugation are two effective methods for increasing virus yield, although such methods may bias the observed composition of viral populations (Ruby, Bellare, and Derisi 2013). Alternatively, PCR amplification may be used to generate an abundance of specific viral sequences present in a sample, a widely used strategy, which was employed in the identification and analysis of MERS coronavirus (Zaki et al. 2012; Cotten et al. 2013, 2014), although effective primer design can be challenging in the presence of high genomic diversity in the target viral species. Conversely, an excess of sequencing coverage can lead to the construction of overly complex and unwieldy de novo assembly graphs in the presence of high genomic diversity, reducing assembly quality. Using in silico normalisation (Crusoe et al. 2015), excess coverage may be reduced by discarding sequences containing redundant information. This approach increases analytical efficiency when dealing with high coverage sequence data, and we have shown that it can benefit de novo assembly of viral consensus sequences. Another in silico strategy for increasing analytical efficiency by discarding unneeded data is to filter sequences from known abundant organisms through alignment with one or more reference genomes using an aligner or specialist tool (approaches reviewed in Daly et al. 2015).

3. Choosing a sequencing platform

There are several sequencing technologies in widespread use that are capable of reading hundreds of thousands to billions of DNA sequences per run (Reuter, Spacek, and Snyder 2015). The current market leader, Illumina, manufactures instruments capable of generating billions of 150 base pair (bp) paired end reads (see ‘Glossary’) per run, with read lengths of up to 300 bp. The Illumina short read platform is widely used for analyses of viral genomes and metagenomes, and, given sufficient sequencing coverage, enables sensitive characterization of low-frequency variation within viral populations (e.g. HIV resistance mutations as low as 0.1% (Li et al. 2014)). Ion Torrent (ThermoFisher) is capable of generating longer reads than Illumina at the expense of reduced throughput and a higher rate of insertion and deletion (indel) error (Eid et al. 2009). Single molecule real-time sequencing commercialized by Pacific Biosciences (PacBio) produces much longer (>10 kbp) reads from a single molecule without clonal amplification, which eliminates the errors introduced in this step. However, this platform has a high (∼10%) intrinsic error rate, and remains much more expensive than Illumina sequencing for equivalent throughput. The Nanopore platform from Oxford Nanopore Technologies, which includes the pocket sized MinION sequencer, also implements long read single molecule sequencing, and permits truly real-time analysis of individual sequences as they are generated. Although more affordable than PacBio single molecule sequencing, the Nanopore platform also suffers from high error rates in comparison with Illumina (Reuter, Spacek, and Snyder 2015). However, the technology is maturing rapidly and has already demonstrated potential to revolutionize pathogen surveillance and discovery in the field, as well as enabling contiguous assembly of entire bacterial genomes at relatively low cost (Feng et al. 2015; Quick et al. 2015; Hoenen et al. 2016). Hybrid sequencing strategies using both long and short reads leverage the ability of long reads to resolve repetitive DNA regions while benefitting from the high accuracy of short reads, at the expense of additional sequencing, library preparation and data analysis (Madoui et al. 2015).

4. Assembling genomes: denovo and reference-based assembly

The reconstruction of sequencing reads into full length genes and genomes can be performed by means of either reference-based alignment or de novo assembly, a decision dependent on experimental objectives, read length, quality and data complexity. In reference-based approaches, reads are mapped to similar regions of a supplied template genome, a well-studied and computationally efficient process implemented with a suffix array index of the reference genome. In contrast, de novo assembly is computationally exhaustive but important in cases where either a target genome is poorly characterized or reconstruction of genomes of a priori unknown entities in metagenomes is sought, such as in surveillance studies. For short read data, the increased sequence length afforded by assembly can be necessary to distinguish members of highly conserved gene families from one another. Assembly is also widely used for generating whole genome consensus sequences to facilitate analyses of viral variation, and is a typical starting point for analyses of diverse populations of well-characterized viruses. Even where long reads are available, assembly plays an important role in mitigating the high error rates associated with single molecule sequencing technologies, yielding accurate consensus sequences from inaccurate individual reads.

4.1 Denovo assembly methodologies

Modern de novo assemblers generally leverage either de Bruijn graphs or read overlap graphs as part of the approach known as overlap layout consensus (OLC). Figure 1 illustrates the differences between the two methods. OLC assemblers use the similarity of whole reads in order to construct a graph wherein each read is represented by a node, and subsequently merge overlapping reads into consensus contigs (Deng et al. 2015). OLC is relatively time and memory intensive, scaling poorly to millions of reads and beyond. However, the fewer, longer reads generated by emerging single molecule sequencing technologies tend to be well suited to OLC assembly, which can be easily implemented to tolerate long and noisy sequences (Compeau, Pevzner, and Tesler 2011). Older, notable, de novo assemblers implementing OLC include CAP3 (Huang and Madan 1999) and Celera (http://www.jcvi.org/cms/research/projects/cabog/overview/), while MHAP (Berlin et al. 2015), Canu (Berlin et al. 2015), and Miniasm (Li 2016) represent the current state of the art. There also exist a number of OLC assemblers intended for use with viral sequences: VICUNA was designed for short, non-repetitive and highly variable reads from a single population (Yang et al. 2012), and PRICE (Ruby, Bellare, and Derisi, 2013) iteratively assembles low to moderate complexity metagenomes (e.g. Runckel et al. 2011; Grard et al. 2012;) using a similar algorithm to the actively developed consensus assembler IVA (Hunt et al. 2015), which like VICUNA is designed for single virus populations rather than metagenomes (see Table 1 for additional details on programs).

Figure 1.

Table 1.

Discussed software for the analysis of viral (meta)genomes.

Name	Application	Distribution	Interface	Platform	License	Description	URL
Kraken (Wood and Salzberg 2014)	Taxonomic assignment	Source	Command line	Linux, MAC OS	GNU GPL	Fast in-memory k-mer search and LCA assignment of short reads using a comprehensive sequence database	https://ccb.jhu.edu/software/kraken/
Kaiju (Menzel, Ng, and Krogh 2016)	Taxonomic assignment	Source	Web, command line	Website, Linux	GNU GPL	Fast in-memory, k-mer seeded protein search and LCA taxonomic assignment	http://kaiju.binf.ku.dk/
CLARK (Ounit et al. 2015)	Taxonomic assignment	Source	Command line	Linux, MAC OS	GNU GPL	Fast in-memory k-mer search and LCA assignment of short reads using a comprehensive sequence database	http://clark.cs.ucr.edu/
Lambda (Hauswedell, Singer, and Reinert 2014)	Protein homology search	Binary, source	Command line	Linux, Mac OS	GNU GPL	Fast BLAST compatible nucleotide and reduced alphabet protein homology search	https://seqan.github.io/lambda/
Diamond (Buchfink, Xie, and Huson 2015)	Protein homology search	Binary, source	Command line	Linux, Mac OS, FreeBSD	–	Fast BLAST compatible nucleotide and reduced alphabet protein homology search	http://ab.inf.uni-tuebingen.de/software/diamond/
NCBI BLAST + (Altschul et al. 1990)	Nucleotide and protein homology search	Binary, source	Web, command line	Linux, Windows, Mac OS	Public domain	Nucleotide and protein homology search	https://blast.ncbi.nlm.nih.gov/Blast.cgi
VirusHunter (Zhao et al. 2013)	Virus discovery	Source	Command line	Linux	GNU GPL	Automated viral discovery pipeline for use with a computing cluster	http://www.ibridgenetwork.org/wustl/virushunter
MetaVir (Roux et al. 2011)	Taxonomic assignment	Web application	Web	–	–	Annotation and visualization of viral reads and assemblies	http://metavir-meb.univ-bpclermont.fr/
VirSorter (Roux et al. 2015)	Virus and prophage discovery	Source	Command line	Linux, Mac OS, Docker	GNU GPL	Reference-based and reference-independent annotation of assembled virus genomes	https://github.com/simroux/VirSorter
One Codex (Minot, Krumm, and Greenfield)	Taxonomic assignment	Web application	Web, web API	–	–	Web portal for assignment, visualization and comparison of metagenomes using a bespoke comprehensive sequence database	https://www.onecodex.com/
PHYMMBL (Brady and Salzberg 2011)	Taxonomic assignment	Source	Command line	Linux	–	Hybrid taxonomic assignment using Phymm interpolated Markov models and BLAST results	https://ccb.jhu.edu/software/phymmbl/index.shtml
IVA (Hunt et al. 2015)	Viral genome assembly	Python package, source	Command line	Linux, Mac OS	GNU GPL	Consensus genomic assembly of diverse viral populations using paired-end short reads	http://sanger-pathogens.github.io/iva/
Vicuna (Yang et al. 2012)	Viral genome assembly	Source	Command line	Linux	Broad academic license	Consensus genomic assembly of diverse viral populations using short reads	http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/vicuna
PRICE (Ruby, Bellare, and Derisi 2013)	Viral genome assembly	Source	Command line	Linux	GNU GPL	Assembly of low abundance viral sequences in metagenomes with paired-end short reads	http://derisilab.ucsf.edu/software/price/
SPAdes (Bankevich et al. 2012)	Genome assembly	Binary, source	Command line	Linux, Mac OS	GNU GPL	Microbial genome assembly with long and short paired-end reads	http://bioinf.spbau.ru/spades
MetaSPAdes (Nurk et al. 2016)	Metagenome assembly	Binary, source	Command line	Linux, Mac OS	GNU GPL	Metagenome assembly with long and short paired-end reads	http://bioinf.spbau.ru/spades
MEGAHIT (Li et al. 2015)	Metagenome assembly	Source	Command line	Linux, Mac OS	GNU GPL	Fast metagenome assembly for complex metagenomes with short reads	https://github.com/voutcn/megahit
IDBA-UD (Peng et al. 2012)	Metagenome assembly	Source	Command line	Linux, Mac OS	GNU GPL	Metagenome assembly for short paired-end reads	https://github.com/loneknightpy/idba
MetaVelvet (Afiahayati, Sato, and Sakakibara 2015)	Metagenome assembly	Source	Command line	Linux, Mac OS	GNU GPL	Metagenome assembly for short paired-end reads	http://metavelvet.dna.bio.keio.ac.jp/
ShoRAH (Zagordi et al. 2011)	Viral haplotype reconstruction	Source	Command line	Linux	GNU GPL	Probabilistic viral haplotype reconstruction from short reads	https://github.com/ozagordi/shorah
QuRE (Prosperi and Salemi 2012)	Viral haplotype reconstruction	Java package, source	Command line, graphical interface	Most OSs	GNU GPL	Probabilistic viral haplotype reconstruction from short reads	https://sourceforge.net/projects/qure/
PredictHaplo (Prabhakaran et al. 2014)	Viral haplotype reconstruction	Source	Command line	Linux	GNU GPL	Probabilistic viral haplotype reconstruction from short paired-end reads	http://bmda.cs.unibas.ch/HivHaploTyper/

Two widely used methodologies in de novo assembly of short reads. Reads are not represented explicitly within a de Bruijn graph; they are instead decomposed into distinct subsequence ‘words’ of length k, or k-mers, which can be linked together via overlapping k-mers to create an assembly graph. In OLC, a pairwise comparison of all reads is performed, identifying reads with overlapping regions. These overlaps are used to construct a read graph. Next, overlapping reads are bundled into aligned contigs in what is referred to as the layout step, before finally the most likely nucleotide at position is determined through consensus. This figure is simplified to demonstrate the theory for the assembly of single genomes; note that the process has additional complexities for the reconstruction of metagenomes. Discussed software for the analysis of viral (meta)genomes. A de Bruijn or k-mer graph represents a set of reads in terms of its k-mer composition, where k-mers are subsequences of a length k, specified by the user. Each k-mer is assigned to an edge in a graph, where the nodes are k-1 prefixes and suffixes of the k-mer. The assembler identifies the path through the graph in which each edge is visited only once (reviewed in Compeau, Pevzner, and Tesler 2011). De Bruijn graphs are much more efficient to construct than overlap graphs and are suited to large numbers of short reads, and where coverage is high, since redundant k-mers occupy negligible random access memory (RAM). However, with this efficiency comes a lack of error tolerance in identifying overlaps, less tolerance of repeated sequences in comparison to overlap graphs, and a loss of read coherence, meaning that k-mers originating from different reads may be co-assembled. Examples of assemblers using de Bruijn graphs include SOAPdenovo (Luo et al. 2012), ALLPATHS (Butler et al. 2008), SPAdes (Bankevich et al. 2012), and ABySS (Simpson et al. 2009).

4.2 Denovo assembly for metagenomes

Typical de novo assemblers are designed to reconstruct genomes with uniform sequencing coverage across their length. This is problematic for metagenomes (including viromes) where coverage typically varies considerably both among different genomes and within individual genomes. To address this problem, dedicated metagenome assemblers have been developed. Omega (Haider et al. 2014) is an OLC-based method that uses a minimum cost flow analysis of the OLC graph to generate initial contigs, merging these to create longer contigs and scaffolds using mate-pair information. Genovo (Laserson, Jojic, and Koller 2011) is another OLC-based method that generates a probabilistic model for the dataset and subsequently uses an iterative approach to reconstruct the most likely genome contigs. MEGAHIT (Li et al. 2015) prioritizes speed, leveraging a succinct de Bruijn graph to rapidly reconstruct high complexity metagenomes, such as those of soil or seawater, on a single computer. Noteworthy is the iterative de Bruijn graph assembler SPAdes, which although not initially intended for metagenome assembly, has been widely adopted for its effectiveness in assembling variable coverage metagenomes of limited complexity. MetaSPAdes (Nurk et al. 2016) is a metagenome-specific release of the SPAdes pipeline with refinements to its graph simplification and repeat resolution algorithms, counter-intuitively capable of leveraging rare strain information so as to improve its consensus reconstruction capabilities. Other de Bruijn graph metagenome assemblers based on their genomic counterparts include Ray-Meta (Boisvert et al. 2012), MetAMOS (Treangen et al. 2013), MetaVelvet (Namiki et al. 2012; Afiahayati, Sato, and Sakakibara 2015), and IDBA-UD (Peng et al. 2012). For example, unlike the genome assembler Velvet, MetaVelvet’s de Bruijn graph is decomposed into many subgraphs (using coverage difference and graph connectivity), and scaffolds are built independently for each subgraph. MetaVelvet-SL addresses limitations with MetaVelvet, using supervised learning to detect and classify chimeric nodes within the de Bruijn graph. IDBA-UD partitions a de Bruijn graph into isolated components, constructs a multiple alignment, and subsequently identifies variation within these partitions using multiple depth relative thresholds to remove erroneous k-mers. Ray Meta (Boisvert et al. 2012) extends the massively distributed assembly model of Ray to variable coverage metagenomes, while MetAMOS (Treangen et al. 2013) is both a metagenomic extension and successor to the AMOS genome assembler. We recently proposed a method based on numerical sequence representations and digital signal processing data transformation (SPDT) approaches to reduce the size of working datasets, permitting fast and sensitive read alignment and de novo assembly of diverse viral populations (Tapinos et al. 2015). SPDT methods, such as the discrete Fourier transform (DFT) (Agrawal, Faloutsos, and Swami 1993), and discrete wavelet transform (DWT) (Percival and Walden 2006) (Fig. 2), are used to reduce sequences into lower dimensional space, preserving only prominent data characteristics. Analysis is subsequently performed with these lower dimensionality transformations, enabling faster data comparison. Since SPDT methodologies such as the Fourier and wavelet transforms are applicable only to numerical sequences, nucleotide sequences must first be numerically transformed with one of several techniques including real number representations (Chakravarthy et al. 2004), complex number representations (Anastassiou 2001), the DNA walk (Lobry 1996), and the Voss method (Voss 1992).

Figure 2.

Proposed DWT signal processing approach for nucleotide sequence analysis. Sequences 1 and 2 are subsequences of the HIV-1 HXB2 genome (the reference genome for HIV), and sequence 3 is a subsequence of the Mycoplasma genitalium genome (all three sequences appear at the bottom of the figure). (A) illustrates the integer number representations of the three sequences—sequence 1 is depicted as a black line, sequence 2 is depicted as a red line and sequence 3 is depicted as a blue line. The sequences are mapped into numerical space with the integer representation method enabling the application of transformation approaches. (B) illustrates the DWT transformations of the three sequences’ numerical representations at varying resolutions. The three sequences are each shown consecutively transformed into six reduced resolution representations. The minor sequence mismatches between sequences 1 and 2 (indicated with green circles) can be easily detected at different transformation resolutions despite reduction in information content from the transformation process. Similar nucleotide sequences give rise to similar DWT transformations and thus can be intuitively identified even at low resolution (level 6), where sequences are represented by a single numerical value. Depicted in (C) are the coefficient matrices obtained from each sequence’s DWT transformation. Coefficient matrices can be used to approximately identify the sites of the mismatch positions between the two sequences. Sequences 1 and 2 differ only at sites 16–17 and 48–49. The exact location of minor differences can be detected at transformation level 4 where each sequence is compressed to four wavelets. Darker colored positions in between the matrices of sequence 1 and 2 indicate matching coefficients, and lighter colored positions indicate dissimilar coefficients. Although metagenome assemblers generally outperform single genome assemblers in reconstructing different genomes simultaneously, the complexity of this task stipulates their tendency to collapse variation at or beneath strain level into consensus sequences. Even to this end, their effectiveness may be limited as a consequence of extreme variation within specific RNA virus populations due to mutation and recombination, and low and/or uneven sequencing coverage across a particular genome. Furthermore, it should be noted that de novo assembly is particularly sensitive to the quality of input sequences, meaning that problems during sample extraction, enrichment and library preparation can be highly detrimental to downstream analyses. Of key importance therefore are quality control methods for detecting, and where appropriate correcting, problems associated with contamination (Darling et al. 2014; Orton et al. 2015), primer read-through and low quality reads (reviewed in Leggett et al. 2013).

5. Haplotype reconstruction in specific viral populations

Viral genomes and metagenomes comprising high intraspecific variation can be challenging targets for assembly, giving rise to complex assembly graphs and fragmented assemblies. This is often the case for clinical samples from HIV and Hepatitis C patients, in which high rates of mutation and long durations of infection can contribute to extreme population divergence, but can also be observed in environmental samples. Where such diversity exists, alignment based probabilistic population reconstruction approaches can be effective, permitting the reconstruction of individual viral variants into ‘haplotypes’ exceeding read length. This problem has been well studied, and tools such as ShoRAH, QuRE, and PredictHaplo (Giallonardo et al. 2014) are designed for haplotyping viral populations. ShoRAH (Zagordi et al. 2011) extracts local alignments of a specified window length, reconstructs haplotypes for each ‘cluster’ in that window, and removes mutations from sequences in the cluster not matching the reconstructed haplotype using a model-based probabilistic clustering algorithm. QuRe (Prosperi and Salemi 2012; Prosperi et al. 2013) removes nucleotide substitutions and indels with a Poisson model and reconstructs haplotypes using a heuristic algorithm based on a multinomial distribution. Both approaches have the advantage of reporting probabilities for the reconstructed haplotypes. PredictHaplo is notable for taking into account the read pairing information in Illumina data. A limitation of all of these approaches; however, is their reliance upon a single reference sequence with which to perform the initial alignment, a process which assumes a degree of sequence similarity which may not always be observed in diverse regions, such as regions encoding envelope proteins, of RNA virus genomes. This can be mitigated through construction of a data-specific template through iterative reference mapping and consensus refinement strategies (Archer et al. 2010; Břinda, Boeva, and Kucherov 2016). Other possibilities for broader utility of these approaches include the use of multiple viral reference sequences, either through consideration of multiple linear sequences or by direct alignment of sequences to a variation graph [https://github.com/vgteam/vg], an emerging approach for modeling genomic variation.

6. Sequence classification

Sequence classification is one of the most studied problems in computational biology, and taxonomic assignment is a key objective of metagenome analysis. All classification methods, to some extent, depend upon detecting similarity between a query sequence and a collection of annotated sequences. Classification may be undertaken using either unassembled reads or the reconstructed contigs arising from the assembly process. The computational requirements of available approaches vary dramatically according to their ability to detect homology in divergent sequences; for example, exact k-mer matching approaches permit rapid sequence classification, yet typically struggle to identify divergent sequences of viral origin, while high-sensitivity protein alignment searches may be prohibitively slow, especially in application to entire sequencing datasets. Some of the more contemporary and speed-optimized taxonomic assignment approaches also have high RAM requirements, limiting scope for their use with readily available computer hardware. The output of sequence homology search tools is not itself easily interpreted, requiring post-processing in order to yield meaningful classifications. Retroactive taxonomic assignment using these results is non-trivial, requiring additional database lookups, for example, for determination of a conservative ‘lowest common ancestor’ (LCA) taxon shared by all matches for each query sequence. This kind of complexity necessitates the need for the integration of different tools within application-specific ‘pipelines’.

6.1. Sequence similarity searches

Viral identification approaches typically depend on similarity searches against a database using an aligner such as BLAST (Altschul et al. 1990). Comprehensive databases (e.g. GenBank) or smaller custom databases containing for example, only viral sequences of interest may be used, although the latter can generate misleading results. ProViDE (Ghosh et al. 2011) uses virus-specific alignment parameters and thresholds to assign viruses at different taxonomic levels from BLAST matches to a protein database. VIROME (Wommack et al. 2012) is a multifaceted tool integrating results from searches of several sequence and function databases. MEGAN (Huson et al. 2011) is a generally applicable metagenomic classifier, which uses BLAST results to infer the LCA for a given sequence and provides functional analyses through a graphical interface. Automatic pipelines which combine various homology search strategies to identify a final set of viral reads include VirusHunter (Zhao et al. 2013), a Perl script that automates viral identification using BLAST prior to assembly; MetaVir (Roux et al. 2011), a web application that compares users’ datasets to published viral sequences; and VirSorter (Roux et al. 2015), which identifies prophages and viruses by comparison with custom datasets. With the exception of web applications, however, these are not intuitive tools for the majority of users, requiring manual configuration and installation of software dependencies. Furthermore, similarity search approaches are in general extremely resource-intensive, and performing sensitive BLAST-like database searches with millions of reads is intractable without use of specialist computational resources. To address this problem, tools have emerged leveraging optimized search algorithms and prebuilt databases so as to increase the tractability of classifying millions of reads. For example, Kraken (Wood and Salzberg 2014) and Clark (Ounit et al. 2015) are fast exact k-mer matching approaches that use prebuilt databases of viruses, bacteria, human, and fungi, although custom databases may also be built. One Codex is a proprietary web-based metagenome analysis platform with an integrated fast k-mer matching engine (similar to that of Kraken) which is both fast, very easy to use, and free for academic use (Minot, Krumm, and Greenfield). Lambda (Hauswedell, Singer, and Reinert 2014) and Diamond (Buchfink, Xie, and Huson 2015) are sensitive and heavily optimized BLAST-like aligners which leverage alphabet reduction to permit protein searches three to five orders of magnitude faster than BLAST, offering prebuilt database indexes for common applications.

6.2 Alternatives to similarity searches

Although exhaustive BLAST-like methods can detect homology in divergent sequences, these methods are in general limited by the relatively few validated viral sequences deposited in public databases, the high diversity within viral families which can obscure relatedness, and the lack of a defined set of core genes common to all viruses that can be used to distinguish species (e.g. the 16S gene for bacteria) (Fancello, Raoult, and Desnues 2012). These features make it difficult to assign similarity thresholds for classification that are applicable to all potential viruses in a sample (Simmonds 2015). Comparison methods that do not rely on sequence similarity include PhyloPythia (McHardy et al. 2007), which uses nucleotide frequencies to classify reads, and PHYMM (Brady and Salzberg 2009), which uses interpolated Markov models to find variable length oligonucleotides that characterize species in the NCBI RefSeq database. Although these approaches are less accurate than BLAST searches, PHYMMBL (Brady and Salzberg 2011) combines PHYMM and BLAST and outperforms either one on its own. Alignment-free comparison approaches, for example, based on dinucleotide frequencies, codon usage patterns, or small but conserved regions of family wide ubiquitous genes, may be more robust to the limitations of the database than sequence similarity searches. These features may also reduce the computation required and highlight evolutionary relationships otherwise obscured by high sequence variability. A fundamental challenge in the classification of viral sequences with any of these methods remains their limited representation within curated sequence databases. While the rate at which new viruses are being added to NCBI’s RefSeq collection has increased considerably, from a year average 0.34 species/day in 2010 to 2.5 species/day in 2015 (Fig. 3), our documented understanding of the extent of viral diversity remains superficial (Anthony et al. 2013). Reads of true viral origin are therefore liable to be missed in many cases. The rate of database growth also highlights the need to maintain frequently updated search indexes for sequence classification, construction of which often demands specialist servers equipped with hundreds of gigabytes of RAM. Even if up-to-date indexes are maintained inside a public repository, their file sizes are substantial, demanding users have access to a fast internet connection. Consequently, complete outsourcing of sequence classification to remote web services is a compelling prospect for those with adequate internet connections but without powerful computing hardware, increasing scope for conducting analyses with portable computers.

Figure 3.

Distinct viral species in the NCBI RefSeq releases from June 2003 – May 2015 (data from ftp://ncbi.nlm.nih.gov/refseq/release/release-statistics/viral.acc_taxid_growth.txt).

7. Conclusion

We see several barriers to realizing the goal of active, on-the-ground surveillance and early detection of viruses with epidemic potential. The emergence of virus-specific assembly and metagenomic tools is a relatively recent phenomenon, with many of the methodologies in use today repurposing one or more existing algorithms. These tools mostly target a small audience of expert users and, as with most research software, decay after initial release due to a lack of ongoing funding, poor software development practices and/or authors’ change of circumstances (Duck et al. 2016). There is a need for a better balance between research software presenting novel methodologies and for sustainably developed, documented and tested software distributed through robust and user friendly channels such as package managers so as to increase the useful life of viral informatics software. Researchers and granting agencies should consider the importance of this step and allocate resources accordingly. Democratisation of routine analyses through development of user friendly, locally installable software and remote web services is critical. Preconfigured cloud virtual machines offer a convenient, low cost way to run analyses, yet must permit straightforward sequence database and software version updates so as to remain relevant after their initial release. Maintaining up to date indexes of large sequence databases is a problem all classification tools must address, stipulating access either to powerful computers for index construction or the ability to download the prebuilt indexes over a fast connection. Furthermore, classification of viral sequences is critically dependent upon the quality of curated viral databases such as RefSeq, to which submitting newly discovered sequences can be prohibitively time consuming. A solution might involve the creation of a central database containing for any given sequencing project both raw reads as well as filtered, assembled and/or annotated reads, and analysed using a single central pipeline. On a regular basis, the database could report sequences and corresponding metadata for unclassified ‘dark matter’, which is often discarded and yet is likely to contain sequences belonging to novel pathogens. By combining the dark matter from multiple studies, trends within these unclassified reads may be identified that could lead to greater power to identify new biological entities. Benchmarking of software also remains an open problem within the field, which lacks standardized test datasets that are used across multiple studies. Often benchmarking datasets are chosen to highlight the advantages of the method under study, and therefore may be quite specific for a given application. Thus the field needs to agree upon a set of standard, well-characterized reference datasets for virus-focused studies. The future of the field is promising, with emerging technologies showing potential to eliminate certain challenges. Single molecule sequencing, for example, permits the sequencing of whole viral genomes as single reads, with forthcoming portable and smartphone operated sequencers promising potentially revolutionary analyses in the field. Innovative analytical approaches are constantly being published, and it is evident that the motivation, creativity and expertise needed to meet these challenges exists within the community. Broader communication among developers and end users is essential, and in conjunction with well-funded international initiatives directed at this goal, intelligent viral surveillance could soon be realized.

71 in total

1. Integrative analysis of environmental sequences using MEGAN4.

Authors: Daniel H Huson; Suparna Mitra; Hans-Joachim Ruscheweyh; Nico Weber; Stephan C Schuster
Journal: Genome Res Date: 2011-06-20 Impact factor: 9.043

Review 2. High-throughput sequencing technologies.

Authors: Jason A Reuter; Damek V Spacek; Michael P Snyder
Journal: Mol Cell Date: 2015-05-21 Impact factor: 17.970

3. PhymmBL expanded: confidence scores, custom databases, parallelization and more.

Authors: Arthur Brady; Steven Salzberg
Journal: Nat Methods Date: 2011-05 Impact factor: 28.547

4. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline.

Authors: Todd J Treangen; Sergey Koren; Daniel D Sommer; Bo Liu; Irina Astrovskaya; Brian Ondov; Aaron E Darling; Adam M Phillippy; Mihai Pop
Journal: Genome Biol Date: 2013-01-15 Impact factor: 13.583

5. ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Authors: Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal: Genome Res Date: 2008-03-13 Impact factor: 9.043

6. Identification of novel viruses using VirusHunter--an automated data analysis pipeline.

Authors: Guoyan Zhao; Siddharth Krishnamurthy; Zhengqiu Cai; Vsevolod L Popov; Amelia P Travassos da Rosa; Hilda Guzman; Song Cao; Herbert W Virgin; Robert B Tesh; David Wang
Journal: PLoS One Date: 2013-10-22 Impact factor: 3.240

7. PhyloSift: phylogenetic analysis of genomes and metagenomes.

Authors: Aaron E Darling; Guillaume Jospin; Eric Lowe; Frederick A Matsen; Holly M Bik; Jonathan A Eisen
Journal: PeerJ Date: 2014-01-09 Impact factor: 2.984

8. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

9. The khmer software package: enabling efficient nucleotide sequence analysis.

Authors: Michael R Crusoe; Hussien F Alameldin; Sherine Awad; Elmar Boucher; Adam Caldwell; Reed Cartwright; Amanda Charbonneau; Bede Constantinides; Greg Edvenson; Scott Fay; Jacob Fenton; Thomas Fenzl; Jordan Fish; Leonor Garcia-Gutierrez; Phillip Garland; Jonathan Gluck; Iván González; Sarah Guermond; Jiarong Guo; Aditi Gupta; Joshua R Herr; Adina Howe; Alex Hyer; Andreas Härpfer; Luiz Irber; Rhys Kidd; David Lin; Justin Lippi; Tamer Mansour; Pamela McA'Nulty; Eric McDonald; Jessica Mizzi; Kevin D Murray; Joshua R Nahum; Kaben Nanlohy; Alexander Johan Nederbragt; Humberto Ortiz-Zuazaga; Jeramia Ory; Jason Pell; Charles Pepe-Ranney; Zachary N Russ; Erich Schwarz; Camille Scott; Josiah Seaman; Scott Sievert; Jared Simpson; Connor T Skennerton; James Spencer; Ramakrishnan Srinivasan; Daniel Standage; James A Stapleton; Susan R Steinman; Joe Stein; Benjamin Taylor; Will Trimble; Heather L Wiencko; Michael Wright; Brian Wyss; Qingpeng Zhang; En Zyme; C Titus Brown
Journal: F1000Res Date: 2015-09-25

10. Nanopore Sequencing as a Rapidly Deployable Ebola Outbreak Tool.

Authors: Thomas Hoenen; Allison Groseth; Kyle Rosenke; Robert J Fischer; Andreas Hoenen; Seth D Judson; Cynthia Martellaro; Darryl Falzarano; Andrea Marzi; R Burke Squires; Kurt R Wollenberg; Emmie de Wit; Joseph Prescott; David Safronetz; Neeltje van Doremalen; Trenton Bushmaker; Friederike Feldmann; Kristin McNally; Fatorma K Bolay; Barry Fields; Tara Sealy; Mark Rayfield; Stuart T Nichol; Kathryn C Zoon; Moses Massaquoi; Vincent J Munster; Heinz Feldmann
Journal: Emerg Infect Dis Date: 2016-02 Impact factor: 6.883

23 in total

Review 1. Finding a helix in a haystack: nucleic acid cytometry with droplet microfluidics.

Authors: Iain C Clark; Adam R Abate
Journal: Lab Chip Date: 2017-06-13 Impact factor: 6.799

Review 2. Metagenomics-enabled microbial surveillance.

Authors: Karrie K K Ko; Kern Rei Chng; Niranjan Nagarajan
Journal: Nat Microbiol Date: 2022-04-01 Impact factor: 17.745

Review 3. Viruses of Yams (Dioscorea spp.): Current Gaps in Knowledge and Future Research Directions to Improve Disease Management.

Authors: Mame Boucar Diouf; Ruth Festus; Gonçalo Silva; Sébastien Guyader; Marie Umber; Susan Seal; Pierre Yves Teycheney
Journal: Viruses Date: 2022-08-26 Impact factor: 5.818

Review 4. Enteric Virome and Carcinogenesis in the Gut.

Authors: Cade Emlet; Mack Ruffin; Regina Lamendella
Journal: Dig Dis Sci Date: 2020-03 Impact factor: 3.199

5. Virome analyses of Hevea brasiliensis using small RNA deep sequencing and PCR techniques reveal the presence of a potential new virus.

Authors: Paula L C Fonseca; Fernanda Badotti; Tatiana F P de Oliveira; Antônio Fonseca; Aline B M Vaz; Luiz M R Tomé; Jônatas S Abrahão; João T Marques; Giliane S Trindade; Priscila Chaverri; Eric R G R Aguiar; Aristóteles Góes-Neto
Journal: Virol J Date: 2018-11-26 Impact factor: 4.099

6. Nanopore-based detection and characterization of yam viruses.

Authors: Denis Filloux; Emmanuel Fernandez; Etienne Loire; Lisa Claude; Serge Galzi; Thierry Candresse; Stephan Winter; M L Jeeva; T Makeshkumar; Darren P Martin; Philippe Roumagnac
Journal: Sci Rep Date: 2018-12-14 Impact factor: 4.379

7. Genome Detective: an automated system for virus identification from high-throughput sequencing data.

Authors: Michael Vilsker; Yumna Moosa; Sam Nooij; Vagner Fonseca; Yoika Ghysens; Korneel Dumon; Raf Pauwels; Luiz Carlos Alcantara; Ewout Vanden Eynden; Anne-Mieke Vandamme; Koen Deforche; Tulio de Oliveira
Journal: Bioinformatics Date: 2019-03-01 Impact factor: 6.937

8. Investigating the viral ecology of global bee communities with high-throughput metagenomics.

Authors: David A Galbraith; Zachary L Fuller; Allyson M Ray; Axel Brockmann; Maryann Frazier; Mary W Gikungu; J Francisco Iturralde Martinez; Karen M Kapheim; Jeffrey T Kerby; Sarah D Kocher; Oleksiy Losyev; Elliud Muli; Harland M Patch; Cristina Rosa; Joyce M Sakamoto; Scott Stanley; Anthony D Vaudo; Christina M Grozinger
Journal: Sci Rep Date: 2018-06-11 Impact factor: 4.379

Review 9. Overview of Virus Metagenomic Classification Methods and Their Biological Applications.

Authors: Sam Nooij; Dennis Schmitz; Harry Vennema; Annelies Kroneman; Marion P G Koopmans
Journal: Front Microbiol Date: 2018-04-23 Impact factor: 5.640

10. A Divergent Articulavirus in an Australian Gecko Identified Using Meta-Transcriptomics and Protein Structure Comparisons.

Authors: Ayda Susana Ortiz-Baez; John-Sebastian Eden; Craig Moritz; Edward C Holmes
Journal: Viruses Date: 2020-06-04 Impact factor: 5.048