Literature DB >> 24939889

Standards for sequencing viral genomes in the era of high-throughput sequencing.

Jason T Ladner¹, Brett Beitzel², Patrick S G Chain³, Matthew G Davenport⁴, Eric F Donaldson⁵, Matthew Frieman⁶, Jeffrey R Kugelman², Jens H Kuhn⁷, Jules O'Rear⁵, Pardis C Sabeti, David E Wentworth⁸, Michael R Wiley², Guo-Yun Yu², Shanmuga Sozhamannan, Christopher Bradburne⁴, Gustavo Palacios⁹.

Abstract

Thanks to high-throughput sequencing technologies, genome sequencing has become a common component in nearly all aspects of viral research; thus, we are experiencing an explosion in both the number of available genome sequences and the number of institutions producing such data. However, there are currently no common standards used to convey the quality, and therefore utility, of these various genome sequences. Here, we propose five "standard" categories that encompass all stages of viral genome finishing, and we define them using simple criteria that are agnostic to the technology used for sequencing. We also provide genome finishing recommendations for various downstream applications, keeping in mind the cost-benefit trade-offs associated with different levels of finishing. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques.

Entities: Disease Gene Species

Mesh：

Year: 2014 PMID： 24939889 PMCID： PMC4068259 DOI： 10.1128/mBio.01360-14

Source DB: PubMed Journal: mBio Impact factor: 7.867

EDITORIAL

Viruses represent the greatest source of biological diversity on Earth, and with the help of high-throughput (HT) sequencing technologies, great strides are being made toward the genomic characterization of this diversity (1–3). Genome sequences play a critical role in our understanding of viral evolution, disease epidemiology, surveillance, diagnosis, and countermeasure development and thus represent valuable resources which must be properly documented and curated to ensure future utility. Here, we outline a set of viral genome quality standards, similar in concept to those proposed for large DNA genomes (4) but focused on the particular challenges of and needs for research on small RNA/DNA viruses, including characterization of the genomic diversity inherent in all viral samples/populations. Our goal is to define a common vocabulary that will allow comparison of genome quality across different research groups, sequencing platforms, and assembly techniques. Despite the small sizes of viral genomes, complications related to limited RNA quantities, host “contamination,” and secondary structure mean that it is often not time- or cost-effective to finish every genome, and given the intended use, finishing may be unnecessary (5). Therefore, we have used technology-agnostic criteria to define five standard categories designed to encompass the levels of completeness most often encountered in viral sequencing projects. Each viral family/species comes with its own challenges (e.g., secondary structure and GC content); therefore, we provide only loose guidance on the depth of sequence coverage likely required to obtain different levels of finishing. In reality, a similar amount of data will generate genomes with different levels of finishing for different viruses. To alleviate any reliance on particular aspects of the different sequencing technologies, we have made two assumptions that should be valid in most viral sequencing projects. The first assumption is a basic understanding of the genomic structure of the virus being sequenced, including the expected size of the genome, the number of segments, and the number and distribution of major open reading frames (ORFs). Fortunately, genome structure is highly conserved within viral groups (6), and although new viruses are constantly being uncovered, the discovery of a novel family or even genus remains relatively uncommon (7). In the absence of such information, the defined standards can still be applied following further analysis to determine genome structure. The second assumption is that the genetic material of the virus being described can be accurately separated from the genomes of the host and/or other microbes, either physically or bioinformatically. Depending on the technology used, it is critical that the potential for cross-contamination of samples during the sample indexing/bar coding process and sequencing procedure be addressed with appropriate internal controls and procedural methods (8).

PROPOSED CATEGORIES FOR WHOLE-GENOME SEQUENCING OF VIRUSES

For a summary of the proposed categories for whole-genome sequencing of viruses, see Fig. 1 and Table 1.

FIG 1

TABLE 1

Overview of viral genome standards

Feature	Standard draft[a]	High quality[a]	Coding complete[a]	Complete	Finished
No. of contigs	>1 for some segments	1 per segment	1 per segment	1 per segment	1 per segment
Open reading frames	Incomplete	Incomplete	Complete	Complete	Complete
Estimated % of genome covered[b]	≥50%	~80-90%	~90-99%	100%	100%
Population-level characterization	Optional	Optional	Optional	Optional	Required
Contaminant analysis	Optional	Optional	Optional	Optional	Optional

It is suggested that all bases included in any incomplete genome meet a minimum quality standard, with ≥5 reads supporting the consensus base call with individual base qualities of ≥20 on the Phred scale.

Percentages of genome covered are not meant to serve as criteria for categorizing a genome; they are simply estimates of expected levels of coverage.

Graphical representation of viral genome standards. Bullets on the left represent primary distinctions between categories. Bullets on the right indicate potential downstream applications of genomes in each category. Overview of viral genome standards It is suggested that all bases included in any incomplete genome meet a minimum quality standard, with ≥5 reads supporting the consensus base call with individual base qualities of ≥20 on the Phred scale. Percentages of genome covered are not meant to serve as criteria for categorizing a genome; they are simply estimates of expected levels of coverage.

Standard draft (SD).

The “standard draft” category is for whole shotgun genome assemblies with coverage that is low and/or uneven enough to prevent the assembly of a single contig for ≥1 genome segments. Genomes in this category are likely to result from samples with low viral titers, such as clinical and environmental samples, or to be those containing regions that are difficult to sequence across (e.g., intergenic hairpin regions) (9). To distinguish standard drafts from targeted amplification of partial viral sequences, standard drafts should contain at least 1 contig for each genomic segment and should be prepared in a manner that allows the possibility of sequencing the vast majority of a virus’s genome. To avoid the inclusion of small pieces of genomes as “drafts,” there needs to be some type of minimum cutoff for breadth of coverage. Therefore, we suggest that at least a majority (≥50%) of the genome be present for a set of sequences to be considered a draft genome.

High quality (HQ).

Genomes should be considered high quality if no gaps remain (i.e., a single contig per genome/segment), even if one or more ORFs remain incomplete due to missing sequence at the ends of segments. An HQ genome can often be achieved with modest levels of HT sequencing coverage (~15 to 30×) or through Sanger-mediated gap resolution of an SD.

Coding complete (CC).

The “coding complete” category indicates that in addition to the lack of gaps, all ORFs are complete. This level of completion is typically possible with high levels of HT sequencing coverage (>100×) or may require the use of conserved PCR primers targeting the ends of the segments.

Complete.

A genome is complete when the genome sequence has been fully resolved, including all non-protein-coding sequences at the ends of the segment(s). This is typically achieved through rapid amplification of cDNA ends (RACE) or similar procedures.

Finished.

This final category represents a special instance in which, in addition to having a completed consensus genome sequence, there has been a population-level characterization of genomic diversity. Typically this requires ~400 to 1,000× coverage (see below). This provides the most complete picture of a viral population; however, this designation will apply only for a single stock. Additional characterizations will be necessary for future passages.

ADDITIONAL HIGH-THROUGHPUT SEQUENCE-BASED GENOME CHARACTERIZATIONS

Population-level characterization.

HT sequencing technologies provide powerful platforms for investigating the genetic diversity within viral populations, which is integral to our understanding of viral evolution and pathogenesis (10, 11). Population-level characterization requires very high levels of HT sequencing coverage (12, 13); however, the exact level will depend on the background error profiles of the sequencing technology and the desired level of sensitivity. As an example, Wang et al. (12) determined that for pyrosequencing data, ~400× coverage is necessary to identify minor variants present at 1% frequency with 99.999% confidence, and ~1,000× coverage is needed for variants with a frequency of 0.5%. Targeted amplification of the viral genome is often necessary to achieve these coverage requirements. Due to the modest sequence lengths of most HT technologies, the state of the art for population-level analysis has been the characterization of unphased polymorphisms. However, single-molecule technologies, with maximum read lengths of >20 kb, are opening the door for complete genome haplotype phasing (14).

Identification of contaminants or adventitious agents.

After isolation, viruses are often maintained as stocks, which are propagated within host cells in tissue culture and thus amplified and preserved for future use. Despite careful laboratory practices, it is possible for these stocks to become contaminated with additional microbes. Contaminating microbes are often detrimental to subsequent applications such as vaccine development or the testing of therapeutics, making it imperative to monitor the purity of viral stocks. HT sequencing provides a powerful method for not only detecting the presence of contaminants within a sample but also for identification and characterization of any contaminants. The level of sequencing required for contamination analysis is dependent on the desired sensitivity, with more sequencing required to ensure detection of contaminants present at very low levels. For most approaches, HQ-level sequencing should be sufficient. Depending on the intended applications, analysis may need to be repeated after further passaging to ensure that no additional contaminants have been introduced.

RECOMMENDED STANDARDS FOR DOWNSTREAM APPLICATIONS

Description of novel viruses.

Despite the rapidly growing collection of viral sequences, the description of novel viruses is likely to remain an important aspect of viral genome sequencing (7, 15, 16). This is true in part because viruses evolve rapidly and are capable of recombining to form novel genotypes (17, 18). It is also true that most of the viruses that are currently circulating remain uncharacterized (15). Particularly lacking are representatives from groups that are not currently known to infect humans or organisms of economic importance. It would be imprudent, however, to continue to ignore these uncharacterized reservoirs of diversity, because it is difficult to predict the source of future emerging diseases (19–21). Additionally, with the current suite of primarily sequence similarity-based pathogen identification tools, the ability to detect novel pathogens is wholly dependent on high-quality reference databases (22). There is a trend toward requiring a complete genome sequence when a description of a novel virus is being published, and we agree that this is a good goal; however, the amount of time and resources required to complete the last 1 to 2% of a viral genome is often cost and time prohibitive for projects sequencing a large number of samples, and in most cases the very ends of the segments are not essential for proper identification and characterization. Therefore, for the majority of viral characterization projects, we recommend, at a minimum, a CC genome. This will ensure a complete description of the viral proteome and will allow accurate phylogenetic placement.

Molecular epidemiology.

One of the most common and important applications for viral genomes is in the study of viral epidemiology, which encompasses our understanding of the patterns, causes, and effects of disease. Early studies of molecular epidemiology targeted small pieces of viral genomes; however, this type of analysis is likely to miss important changes elsewhere in the genome. Therefore, there has been a strong focus in recent years toward the sequencing of “full” viral genomes. Institutes such as the Broad Institute and the J. Craig Venter Institute (JCVI) have been instrumental in breaking ground in the collection of large numbers of good-quality viral sequences. Their newly identified genomes typically fall within our CC category. This is likely to remain the gold standard for studies involving a large number of genome sequences, especially when some samples come from low-titer clinical samples, often necessitating amplicon-based sequencing methods. CC genomes allow for interrogation of changes throughout the coding portion of the viral genome and often include partial noncoding regions. In the absence of high-throughput RACE alternatives, the time and resources required to complete hundreds or thousands of genomes are likely to continue to outweigh the potential information gained from completing the terminal sequences.

Countermeasure development.

Advancements in our capabilities to sequence viral genomes are changing the way we counteract global pandemics and acts of bioterrorism. There are two important aspects of countermeasure development that can benefit strongly from the availability of genome sequences and HT sequencing data: the detection of the infectious agent and the treatment of the disease caused by the agent. Taxonomic classification and detection through DNA/RNA-based inclusivity assays (i.e., using techniques such as PCR to detect the presence of a pathogen) can be designed using fragmented and incomplete genomes (e.g., SD and HQ sequences). Fully resolved ORFs (CC) further enable the development of immunological assays, such as enzyme-linked immunosorbent assays (ELISA) and immunofluorescence assays (IFA), for protein-based detection, and obtaining a complete genome opens the door to a plethora of additional downstream applications, including the design of exclusivity tests, the establishment of reverse genetics systems, and the design of robust forensics protocols. However, for effective development and testing of animal models, therapeutics, vaccines, and prophylactics, it is necessary to obtain a complete picture of the variability present within both the challenge stock and postinfection populations, thereby necessitating finished genomes. In these medical applications, it is also important to demonstrate the absence of adventitious agents.

REPOSITORIES OF GENOMIC INFORMATION AND DATA CURATION

In addition to standardizing the vocabulary of viral genome assemblies, it is also critical for researchers to routinely provide raw sequencing reads. Without these, it is impossible for others to independently verify the quality of an assembly. Data repositories such as GenBank already provide a platform for depositing HT sequencing reads, but this is not a requirement for the submission of a genome, nor is this option typically utilized. Wider analysis of data will ultimately result in higher-quality assemblies. It is worth considering broader implementation of a wiki-like, crowd-sourcing strategy to genome assembly, similar to the annotation strategies that have been adopted for specific genomes of high interest (23, 24). This approach would allow multiple parties to work on genome assembly and annotation at the same time and would provide instant updates for the entire community to evaluate and utilize in their own research. Our primary goal here is to initiate a conversation. The rate at which viral genomes are being sequenced is only going to increase in the coming years, and without some standardization, it will be impossible for these valuable resources to be utilized to their full potential. We present these categories as a starting point, with the goal of adjusting and refining them over time as our capabilities and needs continue to change.

23 in total

1. Genomics. Genome project standards in a new era of sequencing.

Authors: P S G Chain; D V Grafham; R S Fulton; M G Fitzgerald; J Hostetler; D Muzny; J Ali; B Birren; D C Bruce; C Buhay; J R Cole; Y Ding; S Dugan; D Field; G M Garrity; R Gibbs; T Graves; C S Han; S H Harrison; S Highlander; P Hugenholtz; H M Khouri; C D Kodira; E Kolker; N C Kyrpides; D Lang; A Lapidus; S A Malfatti; V Markowitz; T Metha; K E Nelson; J Parkhill; S Pitluck; X Qin; T D Read; J Schmutz; S Sozhamannan; P Sterk; R L Strausberg; G Sutton; N R Thomson; J M Tiedje; G Weinstock; A Wollam; J C Detter
Journal: Science Date: 2009-10-09 Impact factor: 47.728

2. Characterization of the Candiru antigenic complex (Bunyaviridae: Phlebovirus), a highly diverse and reassorting group of viruses affecting humans in tropical America.

Authors: Gustavo Palacios; Robert Tesh; Amelia Travassos da Rosa; Nazir Savji; Wilson Sze; Komal Jain; Robert Serge; Hilda Guzman; Carolina Guevara; Marcio R T Nunes; Joaquim P Nunes-Neto; Tadeusz Kochel; Stephen Hutchison; Pedro F C Vasconcelos; W Ian Lipkin
Journal: J Virol Date: 2011-02-02 Impact factor: 5.103

Review 3. The evolution of epidemic influenza.

Authors: Martha I Nelson; Edward C Holmes
Journal: Nat Rev Genet Date: 2007-01-30 Impact factor: 53.242

4. Characterization of mutation spectra with ultra-deep pyrosequencing: application to HIV-1 drug resistance.

Authors: Chunlin Wang; Yumi Mitsuya; Baback Gharizadeh; Mostafa Ronaghi; Robert W Shafer
Journal: Genome Res Date: 2007-06-28 Impact factor: 9.043

5. Highly sensitive and specific detection of rare variants in mixed viral populations from massively parallel sequence data.

Authors: Alexander R Macalalad; Michael C Zody; Patrick Charlebois; Niall J Lennon; Ruchi M Newman; Christine M Malboeuf; Elizabeth M Ryan; Christian L Boutwell; Karen A Power; Doug E Brackney; Kendra N Pesko; Joshua Z Levin; Gregory D Ebel; Todd M Allen; Bruce W Birren; Matthew R Henn
Journal: PLoS Comput Biol Date: 2012-03-15 Impact factor: 4.475

6. Pseudomonas Genome Database: improved comparative analysis and population genomics capability for Pseudomonas genomes.

Authors: Geoffrey L Winsor; David K W Lam; Leanne Fleming; Raymond Lo; Matthew D Whiteside; Nancy Y Yu; Robert E W Hancock; Fiona S L Brinkman
Journal: Nucleic Acids Res Date: 2010-10-06 Impact factor: 16.971

7. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform.

Authors: Martin Kircher; Susanna Sawyer; Matthias Meyer
Journal: Nucleic Acids Res Date: 2011-10-21 Impact factor: 16.971

Review 8. The search for meaning in virus discovery.

Authors: Peter Daszak; W Ian Lipkin
Journal: Curr Opin Virol Date: 2011-11-04 Impact factor: 7.090

9. Human viruses: discovery and emergence.

Authors: Mark Woolhouse; Fiona Scott; Zoe Hudson; Richard Howey; Margo Chase-Topping
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2012-10-19 Impact factor: 6.237

Review 10. Computational tools for viral metagenomics and their application in clinical research.

Authors: L Fancello; D Raoult; C Desnues
Journal: Virology Date: 2012-10-11 Impact factor: 3.616

46 in total

1. Aedes Anphevirus: an Insect-Specific Virus Distributed Worldwide in Aedes aegypti Mosquitoes That Has Complex Interplays with Wolbachia and Dengue Virus Infection in Cells.

Authors: Rhys Parry; Sassan Asgari
Journal: J Virol Date: 2018-08-16 Impact factor: 5.103

2. Taxonomic reorganization of the family Bornaviridae.

Authors: Jens H Kuhn; Ralf Dürrwald; Yīmíng Bào; Thomas Briese; Kathryn Carbone; Anna N Clawson; Joseph L deRisi; Wolfgang Garten; Peter B Jahrling; Jolanta Kolodziejek; Dennis Rubbenstroth; Martin Schwemmle; Mark Stenglein; Keizo Tomonaga; Herbert Weissenböck; Norbert Nowotny
Journal: Arch Virol Date: 2014-12-02 Impact factor: 2.574

Review 3. Navigating Microbiological Food Safety in the Era of Whole-Genome Sequencing.

Authors: J Ronholm; Neda Nasheri; Nicholas Petronella; Franco Pagotto
Journal: Clin Microbiol Rev Date: 2016-10 Impact factor: 26.132

4. Two novel simian arteriviruses in captive and wild baboons (Papio spp.).

Authors: Adam L Bailey; Michael Lauck; Samuel D Sibley; Jerilyn Pecotte; Karen Rice; Geoffrey Weny; Alex Tumukunde; David Hyeroba; Justin Greene; Michael Correll; Michael Gleicher; Thomas C Friedrich; Peter B Jahrling; Jens H Kuhn; Tony L Goldberg; Jeffrey Rogers; David H O'Connor
Journal: J Virol Date: 2014-09-03 Impact factor: 5.103

5. Durable sequence stability and bone marrow tropism in a macaque model of human pegivirus infection.

Authors: Adam L Bailey; Michael Lauck; Mariel Mohns; Eric J Peterson; Kerry Beheler; Kevin G Brunner; Kristin Crosno; Andres Mejia; James Mutschler; Matthew Gehrke; Justin Greene; Adam J Ericsen; Andrea Weiler; Gabrielle Lehrer-Brey; Thomas C Friedrich; Samuel D Sibley; Esper G Kallas; Saverio Capuano; Jeffrey Rogers; Tony L Goldberg; Heather A Simmons; David H O'Connor
Journal: Sci Transl Med Date: 2015-09-16 Impact factor: 17.956

Review 6. Lessons from Ebola: Improving infectious disease surveillance to inform outbreak management.

Authors: Mark E J Woolhouse; Andrew Rambaut; Paul Kellam
Journal: Sci Transl Med Date: 2015-09-30 Impact factor: 17.956

7. A Multicomponent Animal Virus Isolated from Mosquitoes.

Authors: Jason T Ladner; Michael R Wiley; Brett Beitzel; Albert J Auguste; Alan P Dupuis; Michael E Lindquist; Samuel D Sibley; Krishna P Kota; David Fetterer; Gillian Eastwood; David Kimmel; Karla Prieto; Hilda Guzman; Matthew T Aliota; Daniel Reyes; Ernst E Brueggemann; Lena St John; David Hyeroba; Michael Lauck; Thomas C Friedrich; David H O'Connor; Marie C Gestole; Lisa H Cazares; Vsevolod L Popov; Fanny Castro-Llanos; Tadeusz J Kochel; Tara Kenny; Bailey White; Michael D Ward; Jose R Loaiza; Tony L Goldberg; Scott C Weaver; Laura D Kramer; Robert B Tesh; Gustavo Palacios
Journal: Cell Host Microbe Date: 2016-08-25 Impact factor: 21.023

8. Evolution and Spread of Ebola Virus in Liberia, 2014-2015.

Authors: Jason T Ladner; Michael R Wiley; Suzanne Mate; Gytis Dudas; Karla Prieto; Sean Lovett; Elyse R Nagle; Brett Beitzel; Merle L Gilbert; Lawrence Fakoli; Joseph W Diclaro; Randal J Schoepp; Joseph Fair; Jens H Kuhn; Lisa E Hensley; Daniel J Park; Pardis C Sabeti; Andrew Rambaut; Mariano Sanchez-Lockhart; Fatorma K Bolay; Jeffrey R Kugelman; Gustavo Palacios
Journal: Cell Host Microbe Date: 2015-12-09 Impact factor: 21.023

9. Widespread recombination, reassortment, and transmission of unbalanced compound viral genotypes in natural arenavirus infections.

Authors: Mark D Stenglein; Elliott R Jacobson; Li-Wen Chang; Chris Sanders; Michelle G Hawkins; David S-M Guzman; Tracy Drazenovich; Freeland Dunker; Elizabeth K Kamaka; Debbie Fisher; Drury R Reavill; Linda F Meola; Gregory Levens; Joseph L DeRisi
Journal: PLoS Pathog Date: 2015-05-20 Impact factor: 6.823

10. Sequence-independent characterization of viruses based on the pattern of viral small RNAs produced by the host.

Authors: Eric Roberto Guimarães Rocha Aguiar; Roenick Proveti Olmo; Simona Paro; Flavia Viana Ferreira; Isaque João da Silva de Faria; Yaovi Mathias Honore Todjro; Francisco Pereira Lobo; Erna Geessien Kroon; Carine Meignin; Derek Gatherer; Jean-Luc Imler; João Trindade Marques
Journal: Nucleic Acids Res Date: 2015-06-03 Impact factor: 16.971