Literature DB >> 30556814

Minimum Information about an Uncultivated Virus Genome (MIUViG).

Simon Roux¹, Evelien M Adriaenssens², Bas E Dutilh^3,4, Eugene V Koonin⁵, Andrew M Kropinski⁶, Mart Krupovic⁷, Jens H Kuhn⁸, Rob Lavigne⁹, J Rodney Brister⁵, Arvind Varsani^10,11, Clara Amid¹², Ramy K Aziz¹³, Seth R Bordenstein¹⁴, Peer Bork¹⁵, Mya Breitbart¹⁶, Guy R Cochrane¹², Rebecca A Daly¹⁷, Christelle Desnues¹⁸, Melissa B Duhaime¹⁹, Joanne B Emerson²⁰, François Enault²¹, Jed A Fuhrman²², Pascal Hingamp²³, Philip Hugenholtz²⁴, Bonnie L Hurwitz^25,26, Natalia N Ivanova¹, Jessica M Labonté²⁷, Kyung-Bum Lee²⁸, Rex R Malmstrom¹, Manuel Martinez-Garcia²⁹, Ilene Karsch Mizrachi⁵, Hiroyuki Ogata³⁰, David Páez-Espino¹, Marie-Agnès Petit³¹, Catherine Putonti^32,33,34, Thomas Rattei³⁵, Alejandro Reyes³⁶, Francisco Rodriguez-Valera³⁷, Karyna Rosario¹⁶, Lynn Schriml³⁸, Frederik Schulz¹, Grieg F Steward³⁹, Matthew B Sullivan^40,41, Shinichi Sunagawa⁴², Curtis A Suttle^43,44,45,46, Ben Temperton⁴⁷, Susannah G Tringe¹, Rebecca Vega Thurber⁴⁸, Nicole S Webster^24,49, Katrine L Whiteson⁵⁰, Steven W Wilhelm⁵¹, K Eric Wommack⁵², Tanja Woyke¹, Kelly C Wrighton¹⁷, Pelin Yilmaz⁵³, Takashi Yoshida⁵⁴, Mark J Young⁵⁵, Natalya Yutin⁵, Lisa Zeigler Allen^56,57, Nikos C Kyrpides¹, Emiley A Eloe-Fadrosh¹.

Abstract

We present an extension of the Minimum Information about any (x) Sequence (MIxS) standard for reporting sequences of uncultivated virus genomes. Minimum Information about an Uncultivated Virus Genome (MIUViG) standards were developed within the Genomic Standards Consortium framework and include virus origin, genome quality, genome annotation, taxonomic classification, biogeographic distribution and in silico host prediction. Community-wide adoption of MIUViG standards, which complement the Minimum Information about a Single Amplified Genome (MISAG) and Metagenome-Assembled Genome (MIMAG) standards for uncultivated bacteria and archaea, will improve the reporting of uncultivated virus genomes in public databases. In turn, this should enable more robust comparative studies and a systematic exploration of the global virosphere.

Entities: Chemical Disease Species

Mesh：

Year: 2018 PMID： 30556814 PMCID： PMC6871006 DOI： 10.1038/nbt.4306

Source DB: PubMed Journal: Nat Biotechnol ISSN： 1087-0156 Impact factor: 54.908

Main

Current estimates are that virus particles massively outnumber live cells in most habitats[1,2], but only a tiny fraction of viruses have been cultivated in the laboratory. An unprecedented diversity of viruses are being discovered through culture-independent sequencing[3]. Progress has been made in reconstructing genomes of uncultivated viruses de novo, from biotic and abiotic environments, without laboratory isolation of the virus–host system. For example, in the past 2 years, more than 750,000 uncultivated virus genomes (UViGs) have been identified in metagenome and metatranscriptome datasets[4,5,6,7,8,9], five times the total number of genomes sequenced from virus isolates (Fig. 1), and UViGs already represent ≥95% of the taxonomic diversity in publicly available virus sequences[10,11]. Although double-stranded DNA (dsDNA) genomes are over-represented in UViGs because most metagenomic protocols exclusively target dsDNA, UViGs nonetheless enable an assessment of global virus diversity and an evaluation of structure and drivers of viral communities. UViGs also contribute to improving our understanding of the evolutionary history of viruses and virus–host interactions.

Figure 1

Size of virus genome databases over time[4,7,22,45,83,84,85,86,87,88,89].

Size of virus genome databases over time[4,7,22,45,83,84,85,86,87,88,89].

Genome sequences from isolates (blue and green) or from UViGs (yellow) are shown. For genomes from isolates, the total number of genomes (blue) and the number of 'reference' genomes (green) are shown. Data were downloaded using the queries “Viruses[Organism] AND srcdb_refseq[PROP] NOT wgs[PROP] NOT cellular organisms[ORGN] NOT AC_000001:AC_999999[PACC]” for reference genomes and “Viruses[Organism] NOT cellular organisms[ORGN] NOT wgs[PROP] NOT AC_000001:AC_999999[pacc] NOT gbdiv syn[prop] AND nuccore genome samespecies[Filter]” for total number of virus genomes, on the NCBI nucleotide database portal (https://www.ncbi.nlm.nih.gov/nuccore) in January 2018. Genomes from the influenza virus database (https://www.ncbi.nlm.nih.gov/genomes/FLU/Database/nph-select.cgi?go=genomeset) were also added to the total number of virus genomes. UViGs can be assembled from metagenomes, from proviruses identified in microbial genomes, or from single-virus genomes, and estimated total UViG numbers were obtained by compiling data from the literature and from the total number of sequences in the IMG/VR database in January 2017, January 2018 and July 2018 (https://img.jgi.doe.gov/vr/)[11]. UpViG, uncultivated provirus. Analysis and interpretation of standalone genomes present substantial challenges, whether the genomes are eukaryotic, bacterial, archaeal or viral. To address these challenges, MISAG and MIMAG standards were drafted to improve the quality of reporting of microbial genomes derived from single cell or metagenome sequences, which are often incomplete[12]. Although some aspects of MISAG and MIMAG can be applied to UViGs, the extraordinary diversity of viral genome composition and content, replication strategies, and hosts means that the completeness, quality, taxonomy and ecology of UViGs need to be evaluated via virus-specific metrics. The Genomic Standards Consortium (http://gensc.org) maintains metadata checklists for MIxS, encompassing genome and metagenome sequences[13], marker gene sequences[14] and single amplified and metagenome-assembled bacterial and archaeal genomes[12]. Here we present a set of standards that extend the MIxS checklists to include identification, quality assessment, analysis and reporting of UViGs (Table 1 and Supplementary Tables 1 and 2), together with recommendations on how to perform these analyses. We provide a metadata checklist for database submission and publication of UViGs designed to be flexible enough to accommodate technological and methodological changes over time (Table 1 and Supplementary Table 1). The information gathered through the MIUViG checklist can be directly submitted with new UViG sequences to International Nucleotide Sequence Database Collaboration (INSDC) member databases—the DNA Database of Japan (DDBJ), the European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL-EBI) and US National Center for Biotechnology Information (NCBI)—which will host and display checklist metadata alongside the UViG sequence. These MIUViG standards should also be used along with existing guidelines for virus genome analysis, including those issued by the International Committee on Taxonomy of Viruses (ICTV), which recently endorsed the incorporation of UViGs into the official virus classification scheme[15] (https://talk.ictvonline.org). Although MIUViG standards and best practices were designed for genomes of viruses infecting microorganisms, they can also be applied to viruses infecting animals, fungi and plants, and are compatible with standards that are already in place for epidemiological analysis of these viruses[16] (Supplementary Table 3).

Table 1

List of mandatory metadata for UViGs

Mandatory metadata	Description
Source of UViGs	Type of dataset from which the UViG was obtained
Assembly software	Tool(s) used for assembly and/or binning, including version number and parameters
Virus identification software	Tool(s) used for the identification of UViG as a viral genome, software or protocol name including version number, parameters, and cutoffs used (see Supplementary Table 2)
Predicted genome type	Type of genome predicted for the UViG
Predicted genome structure	Expected structure of the viral genome
Detection type	Type of UViG detection
Assembly quality	The assembly quality categories, specific for virus genomes, are based on sets of criteria as follows: Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities, with extensive manual review and editing to annotate putative gene functions and transcriptional units High-quality draft genome: One or multiple fragments, totaling ≥90% of the expected genome or replicon sequence or predicted complete Genome fragment(s): One or multiple fragments, totaling <90% of the expected genome or replicon sequence, or for which no genome size could be estimated
Number of contigs	Total number of contigs composing the UViG

For a complete list and description of mandatory and optional metadata, see Supplementary Table 1.

List of mandatory metadata for UViGs The assembly quality categories, specific for virus genomes, are based on sets of criteria as follows: Finished: Single, validated, contiguous sequence per replicon without gaps or ambiguities, with extensive manual review and editing to annotate putative gene functions and transcriptional units High-quality draft genome: One or multiple fragments, totaling ≥90% of the expected genome or replicon sequence or predicted complete Genome fragment(s): One or multiple fragments, totaling <90% of the expected genome or replicon sequence, or for which no genome size could be estimated For a complete list and description of mandatory and optional metadata, see Supplementary Table 1.

Recovery of UViGs after virus enrichment

UViGs can be retrieved from datasets enriched for virus genomes, namely viral metagenomes and single-virus genomes (Fig. 2). Viral metagenomes are usually obtained through a combination of filtration steps, DNase or RNase treatments, and RNA or DNA extraction depending on the targeted viruses, then reverse transcription (to find RNA viruses) and shotgun sequencing[3,17,18,19]. Targeted sequence capture methods can be applied to recover specific virus groups (Fig. 2), and these methods have proven especially useful when viruses are present in small amounts (for example, clinical samples)[20]. Single-virus methods use flow cytometry to sort individual viral particles before genome amplification and sequencing, to produce viral single amplified genomes (SAGs)[9,21,22,23] (Fig. 2). Viral metagenomes and single-virus genomes are usually sequenced with short-read, high-throughput technologies, such as Illumina sequencing, and assembled by algorithms similar to those used for microbial genomes and metagenomes. However, owing to their relatively small genome size (92% of virus genomes in the NCBI Viral RefSeq database are <100 kb)[10], short read-based genome assemblies could soon be superseded by long-read sequencing technologies[24] (for example, PacBio zero-mode waveguide technology or Oxford Nanopore Technology nanopore sequencing; Fig. 2). Sequencing virus genomes from a single template would notably enable the identification of individual genotypes in mixed populations.

Figure 2

Identification of UViGs.

Identification of UViGs.

Schematic of methods used to obtain UViGs. Steps that have been adapted from those used to assemble MAGs and SAGs[12] or added for UViG are shown for sample preparation (orange) and bioinformatics analysis (blue). Steps specifically required for virus targeting and identification are highlighted in bold. *For viruses with short genomes, long-read technologies can provide complete genomes from shotgun sequencing in a single read, bypassing the assembly step[24]. **Targeted sequence capture can be used to recover viral genomes from a known virus group. These genomes can be recovered from samples in which they represent a small fraction of the templates (for example, clinical samples[20]). The main advantages of datasets produced after enrichment for viruses are good de novo assembly of both abundant and rare viruses, increased confidence that the sequence is of viral origin, and the ability to sequence both active and 'inactive' or 'cryptic' viruses (i.e., viruses that are present in the sample but cannot infect). However, virus-enriched datasets can have over-representation of virulent viruses with high burst size (high number of virus particles released from each infected cell) and under-representation of larger viruses with capsids ≥0.2 μm, such as giant viruses, as a result of the selective filtration steps used[25]. Furthermore, in silico approaches are often the only option available to determine the host range of UViGs obtained from virus-enriched samples.

Recovery of UViGs without enrichment

Virus sequences are also present in non-virus-enriched datasets, including sorted cells, tissues, or environmental samples collected on 0.2 μm filters[4,26,27,28]. These sequences could originate from viruses that are replicating in cells, from temperate viruses (proviruses or prophages) that are either integrated into host genomes or present as episomal elements in the host cell, or from free virus particles present in samples. Analyzing datasets without virus enrichment has several advantages. It can detect lytic, temperate and persistent infection, it overcomes some of the biases arising from the size-based selection of virus particles, and it can be applied to any metagenome. However, UViGs from non-virus-enriched datasets may be biased toward viruses that infect the dominant host cell in the sample, and rare viruses or those infecting rare hosts could be under-represented or absent. Finally, comparisons between virus-enriched and non-virus-enriched datasets suggest that analyzing UViGs across different size fractions and sample types is valuable for exploring the virus genome sequence space[29] (Supplementary Fig. 1 and Supplementary Note 1).

Computational identification of viral sequences

Regardless of the type of dataset, the viral origin of UViGs must be validated because even samples enriched for virus particles still contain a substantial amount of cellular DNA[30]. Contamination can arise either from difficulty in separating virus particles from cellular fractions (for example, ultra-small bacteria[31]) or from the capture of extracellular DNA in the virus fraction. Cellular sequences can also derive from cell genome fragments that are encased in virus capsids or comparable particles (for example, via transduction), DNA-containing membrane vesicles, or gene transfer agents[32,33,34]. Several bioinformatic tools and protocols have been developed to identify sequences from bacteriophages and archaeal viruses[35,36,37,38]; eukaryotic viruses[39]; or combinations of bacteriophages, archaeal viruses and large eukaryotic viruses[40] (Supplementary Table 4). These approaches rely on a few characteristics, such that a sequence is considered viral if it is significantly similar to known viruses (in terms of gene content or nucleotide usage pattern) or if it is unrelated to any known virus and cellular genome but contains one or more hallmark virus genes. UViGs must therefore be accompanied by a list of virus detection tool(s) and protocol(s) used, together with any thresholds applied (Table 1 and Supplementary Table 1). Identification of integrated proviruses and their precise boundaries in the host genome is problematic (Box 1). Notably, no high-throughput approach can accurately distinguish active proviruses (still able to replicate and produce virions) from inactive proviral remnants of a past infection[28]. Thus, although prediction methods are improving, UViGs identified as proviruses should be clearly marked as such, so that these caveats are clear (Table 1 and Supplementary Table 1). Several factors may confound assembly of an uncultivated virus genome. The major issues are listed below: • Misidentification of a cellular sequence as viral. Viral metagenomes can be contaminated with cellular nucleic acids[30]. Any analysis should start with the identification of virus and cellular sequences, even in virus-targeted datasets. We advise process improvement by analyzing replicates, blanks or other controls. Determining the boundaries of an integrated provirus can be challenging, even for dedicated software (for example, PHAST, VirSorter), which can results in inclusion of host gene(s) in a virus genome. Manual annotation of genes on the edge of a provirus prediction is recommended. • Partial genomes assembled as circular contigs. Partial genomes are sometimes misassembled as circular contigs owing to repeats[47]. These circularized fragments could be incorrectly identified as complete genomes. The size and gene content of circular contigs should be manually validated as consistent or at least plausible in comparison with known reference genomes. • Errors in gene prediction. For novel viruses with little or no similarity to known references, gene prediction can be challenging in the absence of accompanying transcriptomics or proteomics data. Outputs of automatic gene predictors applied to novel viruses should be checked for gene density (most viruses do not include large noncoding regions), as well as typical gene prediction errors, such as internal stop codons causing artificially shortened genes. • Inaccurate functional annotation. The annotation of open reading frames predicted from novel viruses often requires sensitive profile similarity approaches. Although such sensitive searches are necessary to detect homology in the face of high rates of virus sequence evolution, the inferred function should be cautiously interpreted and remain general (for example, “DNA polymerase,” “membrane transporter” or “PhoH-like protein”). • Clustering of partial genomes. Incomplete genomes can be difficult to classify using genome-based taxonomic classification methods. For example, the estimation of whole-genome average nucleotide identity from partial genomes could vary by up to 50% from the complete genome value (Supplementary Fig. 5). Thus, the classification of genome fragments and their clustering into vOTUs should be interpreted only as an approximation of the true clustering values, and it will likely change as more complete genomes become available. • Taxonomic classification of UViG. Although virus classification primarily relies on genome sequences, no universal approach is currently available to classify viruses at different ranks. Classification of UViGs should be based on the best method available for the type of virus (see Box 2). • Read mapping from nonquantitative datasets. Amplified datasets, produced using multiple displacement amplification or sequence-independent single-primer amplification, are biased toward specific virus genome types and can selectively overamplify specific genome regions. The coverage derived from read mapping based on these amplified datasets should not be interpreted as reflecting the relative abundance of the UViG in the initial sample.

Estimating quality of UViGs

We propose three categories of UViG sequences: genome fragment(s), high-quality draft genomes and finished genomes (Fig. 3 and Table 2). These categories mirror those in MISAG and MIMAG[12], and they are matched to categories already proposed for complete-genome sequencing of small viruses in epidemiology and surveillance[16] (Supplementary Table 3). UViG quality is more challenging to evaluate than metagenome-assembled genomes (MAGs) or SAGs because most viruses lack conserved sets of single-copy marker genes that can be used to estimate draft genome completeness. However, exceptions exist, such as large eukaryotic dsDNA viruses. To date, researchers have estimated UViG sequence completeness by identifying circular contigs or contigs with inverted terminal repeats as putative complete genomes. For linear contigs, completeness is estimated by comparison to reference genome sequences and typically requires a taxonomic assignment to a (candidate) (sub)family or genus because genome length is relatively homogeneous at these ranks (±10%; Supplementary Fig. 2 and Supplementary Table 5). This assignment can be based on the detection of specific marker genes, such as clade-specific viral orthologous groups (Supplementary Table 6), or based on genome-based classification tools (see “Taxonomy of UViGs”). Estimating completeness is more difficult for segmented genomes, which require either a closely related reference genome or additional in vitro experiments[16]. A detailed example of how this quality tier classification can be performed on the Global Ocean Virome dataset[7] is presented in Supplementary Note 2 and Supplementary Table 7.

Figure 3

UViG classification and associated sequence analyses.

Table 2

Summary of required characteristics for each category

Category	Genome fragment(s)	High-quality draft genome	Finished genome
Assembly	Single or multiple fragments	Single or multiple fragments where gaps span (mostly) repetitive regions	Single contiguous sequence (per segment) without gaps or ambiguities
Completeness	<90% expected genome size or no expected genome size	Complete or ≥90% of expected genome size	Complete
Required features	Minimal annotation	Minimal annotation	Comprehensive manual review and editing

Complete genomes include sequences detected as circular, those with terminal inverted repeats, or those for which an integration site is identified.

UViG classification and associated sequence analyses.

“Functional potential” is functional annotation used in gene content analysis. “Host prediction” is the application of different in silico host prediction tools. “Taxonomic classification” is classification of the contig to established groups using marker genes or gene content comparison. “Diversity and distribution” includes vOTU clustering and relative abundance estimation through metagenome read mapping, at the geographical scale or across anatomical sites for host-associated datasets. “New taxonomic groups” concerns the delineation of new proposed groups (for example, families or genera) based exclusively on UViG sequences. “New reference species” refers to the proposal of a new entry in ICTV (https://talk.ictvonline.org/files/taxonomy-proposal-templates/). *Some of these approaches require a minimum contig size—for example, contigs ≥10 kb for taxonomic classification based on gene content[59] or diversity estimation[47]—and will not be applicable to every genome fragment. Summary of required characteristics for each category Complete genomes include sequences detected as circular, those with terminal inverted repeats, or those for which an integration site is identified. Contigs or genome bins representing <90% of the expected genome length, or for which no expected genome length can be determined, would be considered genome fragments. This category might include UViG fragments large enough to be assigned to known virus groups on the basis of gene content and average nucleotide identity. However, high-quality draft or finished genomes are required to establish new taxa (Fig. 3). Sequences from UViG fragments can be used in phylogenetic and diversity studies, either as references for virus operational taxonomic units (see Supplementary Note 4), or through the analysis of virus marker genes encoded in these genome fragments; for example, capsid proteins, terminases, ribonucleotide reductases and DNA- or RNA-dependent RNA polymerases[41,42,43,44,45,46]. Similarly, UViG fragments can be analyzed to assess the functional gene complement of unknown viruses or link them to potential hosts. Importantly, current methods for automatic virus sequence identification[35,36,37,38,39,40] cannot reliably identify short (<10 kb) viral sequences, which should be interpreted with utmost caution. Contigs or genome bins either predicted as complete or representing ≥90% of the expected genome sequence are high-quality drafts, consistent with standards for microbial genomes[12]. Repeat regions may lead to erroneous assembly of partial genomes as circular contigs[47]. Thus, the length of the assembled circular contig should be considered when assessing UViG completeness (Box 1). For UViGs not derived from a consensus assembly, such as single long reads, base calling quality >99% on average (phred score >20) is needed to assign a “high-quality draft” label. Genome sequences assembled into a single contig, or one per segment, with extensive manual review and annotation, can be labeled “finished genomes.” Annotation must include identification of putative gene functions; structural, replication or lysogeny modules; and transcriptional units. The “finished genomes” category is reserved for only the highest quality, manually curated UViGs and is required for the establishment of new virus species (Fig. 3 and Table 2). Unlike that of SAGs and MAGs[12], quality estimation of UViGs does not include a genome contamination threshold. Contamination issues are most prominent in the case of genome bins, whereas most UViGs are represented by a single contig for which in silico simulations have shown that chimeric sequences are rare and present at <2% (ref. 47). In addition, no tools exist to automatically estimate UViG contamination, and thus this information is not included in the current MIUViG checklist. A future updated version of the MIUViG checklist may, however. For include contamination thresholds if such a tool were to be developed. For example, such a tool might exploit single-copy marker genes (once these have been defined for a broader range of viruses) or it might use coverage by metagenome reads, which should in principle be evenly distributed along the genome with no major deviance, except for highly conserved genes.

Annotation of UViGs

Functional annotation of UViGs comprises the following tasks: predicting features in the genome sequence, such as protein-coding genes, tRNAs and integration sites; assigning functions to as many predicted features as possible; and assigning the remaining hypothetical proteins to uncharacterized protein families. Annotation pipelines have been established for different types of viruses[48,49], and large differences between viral genome types likely preclude the development of a single tool able to annotate every virus[50]. Therefore, we recommend that software used to annotate UViGs be reported (Supplementary Table 1). The choice of methods and reference databases used to annotate predicted proteins should be clearly stated. Homologs of novel virus genes may not be detected with standard methods for pairwise sequence similarity detection, such as BLAST, but instead require the use of more sensitive profile similarity approaches, such as HMMER[51], PSI-BLAST[52] or HHPred[53] (Supplementary Table 8; reviewed in ref. 54). Although sequence profiles for many protein families have been collected, they frequently remain unassociated with any specific function. Therefore, UViG analyses should always report (i) feature prediction method(s), (ii) sequence similarity search method(s), and (iii) database(s) searched (Box 1 and Supplementary Table 1).

Taxonomy of UViGs

Taxonomic classification can provide information on the relationship of a UViG with known viruses. Although the information and criteria used for virus classification have changed over time, virus classification has now converged to genome-based analyses[15] (Box 2). The ICTV established specific demarcation criteria for each virus group (Supplementary Table 9) owing to the vast range of viral genomes, mutation rates and evolution. Recently, a consensus has emerged on using whole-genome average nucleotide identity for classification at the species rank, which is used in downstream ecological, evolutionary and functional studies. This consensus was reached through analysis of published population genetics studies[55,56] and gene content comparison of NCBI RefSeq[10] virus genomes[57,58,59] (Supplementary Note 3 and Supplementary Fig. 3). We propose to formalize the use of species-rank virus groups and to name these “virus operational taxonomic units” (vOTUs) to avoid confusion because species groups have been variously named “viral population,” “viral cluster” or “contig cluster” in the literature[4,7,60]. We suggest standard thresholds of 95% average nucleotide identity over 85% alignment fraction (relative to the shorter sequence) on the basis of a comparison of sequences currently available in NCBI RefSeq[10] and IMG/VR[11] (Supplementary Note 3 and Supplementary Figs. 3 and 4). Although partial genomes remain challenging to classify, these common thresholds will enable comparative analyses (Supplementary Fig. 5). In addition, vOTU reports should include the clustering method and cutoff, the reference database used (if any), and the genome alignment approach because small differences have been observed between different methods[61] (Supplementary Table 1). For higher taxonomic ranks than species, no consensus has been reached on which approach should be used, although several have been proposed[58,59,62,63,64,65,66]. Keeping this in mind, UViG reports including taxonomy must clearly indicate the methods and cutoffs applied, and any new taxon must be highlighted as preliminary (for example, “genus-rank cluster,” “putative genus” or “candidate genus,” but not simply “genus,” as this category is reserved for ICTV-recognized groups; Supplementary Table 1). Authors should submit formal taxonomic proposals to the ICTV for consideration (https://talk.ictvonline.org/files/taxonomy-proposal-templates/). Finally, information about the nature of the genome and mode of expression (i.e., Baltimore classification[67]) should be included in the UViG description. Similarly, the predicted segmentation state of the genome (segmented or nonsegmented) should be reported, typically derived from taxonomic classification and comparison with the closest references (Supplementary Table 1). Compared with the classification of cellular organisms, virus classification is associated with unique challenges. First, viruses are most likely polyphyletic; that is, they arose multiple times independently. Unlike ribosomal genes of cellular organisms, for example, there are no genes that are present in all virus genomes that could be used as universal taxonomic markers. Virus genomes are variable, and they can be single-stranded RNA (or single-stranded DNA) encoding only a couple of proteins, double-stranded RNA viruses with up to 12 segments, or large and complex dsDNA viruses with genome sizes that are as large as those of some bacteria. Viruses are very diverse and tend to evolve faster than cellular organisms, in terms of both their genetic sequence and genome content. For all these reasons, viruses are not incorporated into the universal tree of life and a 'one size fits all' virus taxonomy has not been reported. Instead, there are different classification rules for different groups of viruses. A set of criteria to classify viruses was first formally proposed by the Virus Subcommittee of the International Nomenclature Committee at the Fifth International Congress of Microbiology, held at Rio de Janeiro in August 1950 (ref. 90). The virus classification criteria were purposefully based on stable properties of the virus itself, first among them being the virion morphology, virus genome type, and mode of replication, rather than more variable properties such as symptomatology after infection. A hierarchical categorization of viruses based on genome type and virion morphology was then proposed[91], and another operational classification scheme relying on nucleic acid type and method of genome expression was proposed by David Baltimore in 1971 (ref. 67). The need for a specific set of rules to name and classify viruses led to the establishment of the International Committee on Nomenclature of Viruses (ICNV)[92], renamed as the International Committee on Taxonomy of Viruses (ICTV) in 1975 (ref. 82). The ICTV is a committee of the Virology Division of the International Union of Microbiological Societies and is charged with the task of developing, refining and maintaining the official virus taxonomy, presented to the research community in The ICTV Report (https://talk.ictvonline.org/ictv-reports/ictv_online_report/) and interim update articles (“Virology Division news”) in Archives of Virology. Using some of the stable properties of viruses that were previously highlighted, experts in the ICTV developed a universal virus taxonomy similar to the classical Linnaean hierarchical system, in which virus groups were assigned to familiar taxonomic ranks including order, family, genus and species. In the postgenomic era, virus classification is increasingly based on the comparison of genome and protein sequences, which provides a unique opportunity to evaluate phylogenetic and evolutionary relationships between viruses and reconcile the taxonomy of viruses with their reconstructed evolutionary trajectory. The ICTV has undertaken the immense task of re-evaluating virus classification in light of sequence-based information[15,82,93]. Importantly, with large sections of the virosphere still to be explored, virus taxonomy represents only the current best attempt at recapitulating virus evolutionary history on the basis of available data. Virus classification will need to remain dynamic, expanding as we discover new viruses and being refined as our understanding of virus evolution improves.

In silico host prediction

Once a new virus genome has been assembled, an important step toward understanding the ecological role of the associated virus is to predict its host(s). In silico approaches are often the only option for UViGs (reviewed in ref. 68; Supplementary Table 10). These can be separated into four main types. First, hosts can be predicted with relatively high precision on the basis of sequence similarity between the UViG and a reference virus genome when a closely related virus is available[69,70]. Second, hosts can be predicted on the basis of sequence similarities between a UViG and a host genome. These sequence similarities can range from short exact matches (∼20–100 bp), which include CRISPR spacers[4,7,68,71], to longer (>100 bp) nucleotide sequence matches, including proviruses integrated into a larger host contig[26,68,72,73] (Supplementary Table 10). Host-range predictions based on sequence similarity are the most reliable but require that a closely related host genome has been sequenced[68]. Third, host taxonomy from domain down to genus rank can be predicted from nucleotide usage signatures reflecting coevolution between virus and host genomes in terms of G+C content, k-mer frequency and codon usage[26,74,75]. These approaches are usually less specific than sequence similarity–based ones and cannot reliably predict host range below the genus rank, but can provide a predicted host for a larger number of UViGs[7] (Supplementary Table 10). Finally, host predictions can be computed from a comparison of abundance profiles of host and virus sequences across spatial or temporal scales, either through abundance correlation[25,76,77,78] or through more sophisticated model-based interaction predictors[79]. Although few datasets are available for robust evaluation of host prediction based on comparison of abundance profiles, we expect this approach to become more powerful and relevant as high-resolution time-series metagenomics becomes more common. As all these bioinformatic approaches remain predictive, it is crucial that robust false-discovery rate estimations are reported (Supplementary Table 1). Moreover, computational tools do not predict quantitative infection characteristics (for example, infection rate or burst size), which are important for understanding the impacts of viruses on host biology, and thus far only apply to viruses infecting bacteria or archaea. Nevertheless, these predictions are important guides for subsequent in silico, in vitro and in vivo studies, including experimental validation to unequivocally demonstrate a viral infection of a given microbial host. Host predictions should be reported along with details regarding the specific tool(s) used and, importantly, their estimated accuracy as derived either from published benchmarks or from tests conducted in the study (Supplementary Table 1). This information will allow virus–host databases[69,80] to progressively incorporate UViGs while still controlling for the sensitivity and accuracy of the predictions provided to users.

Reporting UViGs

We recommend the following best practice for sharing and archiving UViGs and UViG-related data: data publication should center on the data resources of INSDC (http://www.insdc.org/) through one of the member databases, at DDBJ (https://www.ddbj.nig.ac.jp/index-e.html), EMBL-EBI's European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena) or NCBI (GenBank and the Sequence Read Archive; https://www.ncbi.nlm.nih.gov/nucleotide). If needed, INSDC database curators can be contacted directly for large-scale batch dataset submissions. Where new datasets are generated as part of a UViG study, sequenced samples should be described according to the environment-relevant MIxS checklists and raw read data should be submitted. High-quality and finished UViGs should be submitted as assemblies, the former reported as “draft” accompanied by the required metadata (Table 1). Incomplete assemblies may be submitted, but they must be accompanied by the required metadata (Table 1 and Supplementary Table 1). Where available, annotation and taxonomic classification should be submitted to INSDC, and occurrence and abundance data reported as 'Analysis' records in the ENA. Reports of abundance data estimated by short-read metagenome mapping should include information about the nucleotide identity and coverage thresholds used, with corresponding estimates of false-positive and false-negative rates either computed de novo or extracted from the literature (for example, from refs. 47, 81; Supplementary Note 4). All INSDC accession codes must be cited in publications. For ICTV classification, only coding-complete genomes (complete high-quality and finished draft UViGs) are currently considered[82].

Conclusions

MIUViG standards and best practices for UViG analysis are the virus-specific counterparts to MISAG and MIMAG[12]. Virus genomics and metagenomics are rapidly expanding and improving as sequencing technologies emerge and mature. At the same time, the development of genome-based virus taxonomy methods as well as unified, comprehensive, and annotated reference databases of virus genomes and/or proteins continues apace. Community adoption of these standards, including through ongoing collaborations with other virus committees (ICTV) and data centers (DDBJ, EMBL-EBI and NCBI), will provide a framework for a systematic exploration of viral genome sequence space and enable the research community to better utilize and report UViGs.

Additional information

Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

89 in total

1. Protein homology detection by HMM-HMM comparison.

Authors: Johannes Söding
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

2. Shotgun metagenomics indicates novel family A DNA polymerases predominate within marine virioplankton.

Authors: Helen F Schmidt; Eric G Sakowski; Shannon J Williamson; Shawn W Polson; K Eric Wommack
Journal: ISME J Date: 2013-08-29 Impact factor: 10.302

3. Ocean plankton. Determinants of community structure in the global plankton interactome.

Authors: Gipsi Lima-Mendez; Karoline Faust; Nicolas Henry; Johan Decelle; Sébastien Colin; Fabrizio Carcillo; Samuel Chaffron; J Cesar Ignacio-Espinosa; Simon Roux; Flora Vincent; Lucie Bittner; Youssef Darzi; Jun Wang; Stéphane Audic; Léo Berline; Gianluca Bontempi; Ana M Cabello; Laurent Coppola; Francisco M Cornejo-Castillo; Francesco d'Ovidio; Luc De Meester; Isabel Ferrera; Marie-José Garet-Delmas; Lionel Guidi; Elena Lara; Stéphane Pesant; Marta Royo-Llonch; Guillem Salazar; Pablo Sánchez; Marta Sebastian; Caroline Souffreau; Céline Dimier; Marc Picheral; Sarah Searson; Stefanie Kandels-Lewis; Gabriel Gorsky; Fabrice Not; Hiroyuki Ogata; Sabrina Speich; Lars Stemmann; Jean Weissenbach; Patrick Wincker; Silvia G Acinas; Shinichi Sunagawa; Peer Bork; Matthew B Sullivan; Eric Karsenti; Chris Bowler; Colomban de Vargas; Jeroen Raes
Journal: Science Date: 2015-05-22 Impact factor: 47.728

4. Phage Genome Annotation Using the RAST Pipeline.

Authors: Katelyn McNair; Ramy Karam Aziz; Gordon D Pusch; Ross Overbeek; Bas E Dutilh; Robert Edwards
Journal: Methods Mol Biol Date: 2018

5. Multidimensional metrics for estimating phage abundance, distribution, gene density, and sequence coverage in metagenomes.

Authors: Ramy K Aziz; Bhakti Dwivedi; Sajia Akhter; Mya Breitbart; Robert A Edwards
Journal: Front Microbiol Date: 2015-05-08 Impact factor: 5.640

6. Genome diversity of marine phages recovered from Mediterranean metagenomes: Size matters.

Authors: Mario López-Pérez; Jose M Haro-Moreno; Rafael Gonzalez-Serrano; Marcos Parras-Moltó; Francisco Rodriguez-Valera
Journal: PLoS Genet Date: 2017-09-25 Impact factor: 5.917

7. Implementation of Objective PASC-Derived Taxon Demarcation Criteria for Official Classification of Filoviruses.

Authors: Yīmíng Bào; Gaya K Amarasinghe; Christopher F Basler; Sina Bavari; Alexander Bukreyev; Kartik Chandran; Olga Dolnik; John M Dye; Hideki Ebihara; Pierre Formenty; Roger Hewson; Gary P Kobinger; Eric M Leroy; Elke Mühlberger; Sergey V Netesov; Jean L Patterson; Janusz T Paweska; Sophie J Smither; Ayato Takada; Jonathan S Towner; Viktor E Volchkov; Victoria Wahl-Jensen; Jens H Kuhn
Journal: Viruses Date: 2017-05-11 Impact factor: 5.048

8. Single-virus genomics reveals hidden cosmopolitan and abundant viruses.

Authors: Francisco Martinez-Hernandez; Oscar Fornas; Monica Lluesma Gomez; Benjamin Bolduc; Maria Jose de la Cruz Peña; Joaquín Martínez Martínez; Josefa Anton; Josep M Gasol; Riccardo Rosselli; Francisco Rodriguez-Valera; Matthew B Sullivan; Silvia G Acinas; Manuel Martinez-Garcia
Journal: Nat Commun Date: 2017-06-23 Impact factor: 14.919

Review 9. Metagenomics and future perspectives in virus discovery.

Authors: John L Mokili; Forest Rohwer; Bas E Dutilh
Journal: Curr Opin Virol Date: 2012-01-20 Impact factor: 7.090

10. Linking Virus Genomes with Host Taxonomy.

Authors: Tomoko Mihara; Yosuke Nishimura; Yugo Shimizu; Hiroki Nishiyama; Genki Yoshikawa; Hideya Uehara; Pascal Hingamp; Susumu Goto; Hiroyuki Ogata
Journal: Viruses Date: 2016-03-01 Impact factor: 5.048

105 in total

1. New virus isolates from Italian hydrothermal environments underscore the biogeographic pattern in archaeal virus communities.

Authors: Diana P Baquero; Patrizia Contursi; Monica Piochi; Simonetta Bartolucci; Ying Liu; Virginija Cvirkaite-Krupovic; David Prangishvili; Mart Krupovic
Journal: ISME J Date: 2020-04-22 Impact factor: 10.302

Review 2. Examining horizontal gene transfer in microbial communities.

Authors: Ilana Lauren Brito
Journal: Nat Rev Microbiol Date: 2021-04-12 Impact factor: 60.633

3. Cenote-Taker 2 democratizes virus discovery and sequence annotation.

Authors: Michael J Tisza; Anna K Belford; Guillermo Domínguez-Huerta; Benjamin Bolduc; Christopher B Buck
Journal: Virus Evol Date: 2020-12-30

4. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses.

Authors: Simon Roux; David Páez-Espino; I-Min A Chen; Krishna Palaniappan; Anna Ratner; Ken Chu; T B K Reddy; Stephen Nayfach; Frederik Schulz; Lee Call; Russell Y Neches; Tanja Woyke; Natalia N Ivanova; Emiley A Eloe-Fadrosh; Nikos C Kyrpides
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

5. Marine DNA Viral Macro- and Microdiversity from Pole to Pole.

Authors: Ann C Gregory; Ahmed A Zayed; Nádia Conceição-Neto; Ben Temperton; Ben Bolduc; Adriana Alberti; Mathieu Ardyna; Ksenia Arkhipova; Margaux Carmichael; Corinne Cruaud; Céline Dimier; Guillermo Domínguez-Huerta; Joannie Ferland; Stefanie Kandels; Yunxiao Liu; Claudie Marec; Stéphane Pesant; Marc Picheral; Sergey Pisarev; Julie Poulain; Jean-Éric Tremblay; Dean Vik; Marcel Babin; Chris Bowler; Alexander I Culley; Colomban de Vargas; Bas E Dutilh; Daniele Iudicone; Lee Karp-Boss; Simon Roux; Shinichi Sunagawa; Patrick Wincker; Matthew B Sullivan
Journal: Cell Date: 2019-04-25 Impact factor: 41.582

6. Cooccurrence of Broad- and Narrow-Host-Range Viruses Infecting the Bloom-Forming Toxic Cyanobacterium Microcystis aeruginosa.

Authors: Daichi Morimoto; Kento Tominaga; Yosuke Nishimura; Naohiro Yoshida; Shigeko Kimura; Yoshihiko Sako; Takashi Yoshida
Journal: Appl Environ Microbiol Date: 2019-08-29 Impact factor: 4.792

7. Dissolved Microcystin Release Coincident with Lysis of a Bloom Dominated by Microcystis spp. in Western Lake Erie Attributed to a Novel Cyanophage.

Authors: Katelyn M McKindles; Makayla A Manes; Jonathan R DeMarco; Andrew McClure; R Michael McKay; Timothy W Davis; George S Bullerjahn
Journal: Appl Environ Microbiol Date: 2020-10-28 Impact factor: 4.792

8. Honey bees harbor a diverse gut virome engaging in nested strain-level interactions with the microbiota.

Authors: Germán Bonilla-Rosso; Théodora Steiner; Fabienne Wichmann; Evan Bexkens; Philipp Engel
Journal: Proc Natl Acad Sci U S A Date: 2020-03-16 Impact factor: 11.205

9. An Uncultivated Virus Infecting a Nanoarchaeal Parasite in the Hot Springs of Yellowstone National Park.

Authors: Jacob H Munson-McGee; Colleen Rooney; Mark J Young
Journal: J Virol Date: 2020-01-17 Impact factor: 5.103

Review 10. Tara Oceans: towards global ocean ecosystems biology.

Authors: Shinichi Sunagawa; Silvia G Acinas; Peer Bork; Chris Bowler; Damien Eveillard; Gabriel Gorsky; Lionel Guidi; Daniele Iudicone; Eric Karsenti; Fabien Lombard; Hiroyuki Ogata; Stephane Pesant; Matthew B Sullivan; Patrick Wincker; Colomban de Vargas
Journal: Nat Rev Microbiol Date: 2020-05-12 Impact factor: 60.633