Literature DB >> 24304891

Virus Variation Resource--recent updates and future directions.

J Rodney Brister¹, Yiming Bao, Sergey A Zhdanov, Yuri Ostapchuck, Vyacheslav Chetvernin, Boris Kiryutin, Leonid Zaslavsky, Michael Kimelman, Tatiana A Tatusova.

Abstract

Virus Variation (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) is a comprehensive, web-based resource designed to support the retrieval and display of large virus sequence datasets. The resource includes a value added database, a specialized search interface and a suite of sequence data displays. Virus-specific sequence annotation and database loading pipelines produce consistent protein and gene annotation and capture sequence descriptors from sequence records then map these metadata to a controlled vocabulary. The database supports a metadata driven, web-based search interface where sequences can be selected using a variety of biological and clinical criteria. Retrieved sequences can then be downloaded in a variety of formats or analyzed using a suite of tools and displays. Over the past 2 years, the pre-existing influenza and Dengue virus resources have been combined into a single construct and West Nile virus added to the resultant resource. A number of improvements were incorporated into the sequence annotation and database loading pipelines, and the virus-specific search interfaces were updated to support more advanced functions. Several new features have also been added to the sequence download options, and a new multiple sequence alignment viewer has been incorporated into the resource tool set. Together these enhancements should support enhanced usability and the inclusion of new viruses in the future.

Entities: Disease Species

Mesh：

Substances：
Viral Proteins

Year: 2013 PMID： 24304891 PMCID： PMC3965055 DOI： 10.1093/nar/gkt1268

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

‘So many sequences and yet, so little metadata’ might as well be the official slogan to the dawn of the sequencing age. Often sequence source descriptors such as host, isolation place and time, and other metadata are missing from International Nucleotide Sequence Database Collaboration (INSDC) (1) sequence records. Though metadata can sometimes be inferred from information found within the sequence record or found in the text of a research article, associating this derived metadata with the original sequence record is difficult in practice. Even when metadata are readily available, without universally accepted standards, varied but synonymous terms can hinder retrieval of relevant sequences from public database searches. Lack of data standardization extends beyond metadata and sequence annotations are often inconsistent, with the same protein annotated in different ways among different sequence records—a major impediment to sequence analysis. Of course metadata and sequence annotation standards are but the tip of the iceberg. With so many sequences now available in public databases, under the best of circumstances, database queries often produce very large datasets, forcing the user to weed through pages of traditional text based displays. Indeed, one could argue that the explosion of sequence data now threatens to blow up traditional models of data storage, retrieval and display. This realization and the argument that such broad issues require equally broad solutions led to the development of the NCBI Virus Variation Resource (http://www.ncbi.nlm.nih.gov/genomes/VirusVariation/) (2). This comprehensive, value added web resource includes three elements—a specialized database, a unique search interface and a suite of tools and displays—all designed to support large sequence datasets.

VIRUS VARIATION 2.0

The current Virus Variation Resource is an outgrowth of the NCBI Influenza Virus Resource created in 2004 (3) in response to the National Institute of Allergy and Infectious Diseases (NIAID) Influenza Genome Sequencing Project (4). The resource was initially designed to enhance the usability of very large influenza sequence datasets, and a number of features were introduced to facilitate sequence retrieval. Among these was the development of a metadata driven search interface (Figure 1). Sequence descriptors such as country of isolation, host and protein name are parsed from GenBank (5) records during database loading using advanced strategies. These machine processes are augmented with human curation allowing data found in publications and other sources to be associated with sequences in the database. The resultant metadata are mapped to controlled vocabulary lists, and consistent terms are stored in the database, providing a single term for synonymous and misspelled ones. These metadata terms are then displayed among several menus providing users with a straightforward but comprehensive search interface through which users can retrieve nucleotide and protein sequences based on a number of biological and clinical criteria.

Figure 1.

Influenza virus database search interface. The search interface is shown with the ‘Additional filters’ selection open. A number of search criteria have been selected, and two separate searches added to the ‘Query builder’. Selected search results are indicated by the check box to the left of the ‘Query builder’ display and can be downloaded directly in a variety of formats or loaded into the Virus Variation search results interface by depressing the ‘Show results’ button. The number of sequences in the database has grown substantially as influenza continues to be a major human pathogen and as surveillance networks and virus sequencing efforts are maintained around the world (6,7). There are now more than 292 000 individual influenza nucleotides sequences in the database, including more than 17 100 complete genome sets. The value added influenza data model was first extended to a separate Dengue virus (DENV) resource in 2009, again in response to NIAID funded genome sequencing efforts (2). DENV is mosquito borne pathogen that is thought to infect as many as 100 million people each year worldwide (8,9), and as attempts to better understand the biology of this Flavivirus have continued (10), the number of DENV sequences in the database has grown to more than 13 000 individual nucleotide sequences. Over the past 2 years, a second mosquito borne Flavivirus, West Nile virus (WNV) has been added to the Virus Variation Resource. WNV is found throughout Africa, the Middle East, southern Europe, Russia, Asia and Australia and has caused 16 196 cases of human neuroinvasive disease and 1549 deaths in the USA since 1999 (11). Moreover, WNV appears endemic to the Americas, Europe and Australia (12), and evidence supports continued WNV evolution in North America over the past decade, underscoring human health concerns (13,14). There are currently 2400 WNV nucleotide sequences in the database. The design goal of the new Virus Variation construct is to create a resource with a single, value added data model but enough flexibility to accommodate a broad range of viruses. This approach attempts to maintain historic functionalities while leveraging shared backend support to facilitate more efficient data flow. The Virus Variation database loading pipeline is central to the new approach and is responsible for the standardized annotation of incoming nucleotide sequences, automated parsing of metadata terms from GenBank records and mapping parsed terms to a controlled vocabulary. All nucleotide sequences included in Virus Variation are processed in a similar manner. New sequences are retrieved from GenBank, and processed by a standardized set of database loading pipelines. The influenza database loading pipeline simply extracts the existent annotation from INSDC records and loads it into the database. Influenza coding regions and other sequence features can be systematically annotated prior to INSDC database submission using the Flu Annotation Pipeline (FLAN) (15). This pipeline is publicly available from the Virus Variation web pages and first types (or genotypes) sequences by BLAST alignment to a set of virus-specific nucleotide references and then annotates protein coding regions using reference protein sequence sets specific to each virus subtype (15). Specifically, FLAN maintains a set of reference nucleotide sequences that are used to classify input influenza sequences by type (A, B or C), identify specific segments (1 through 8) and—when applicable—subtype influenza A hemagglutinin and neuraminidase segments (reference sequences available at ftp://ftp.ncbi.nih.gov/genomes/INFLUENZA/ANNOTATION/blastDB.fasta). Corresponding reference protein sets are then aligned to translated input sequences and protein coding regions predicted using the ‘Protein to nucleotide alignment tool’ (ProSplign) (15). The FLAN is continually being updated to support community needs. For example, the Influenza virus annotation tool now supports Influenza C sequences in addition to A and B, and can predict the recently discovered PA-X protein coding sequences. The annotation pipelines for DENV and WNV are integrated into the database loading pipeline and are very similar to FLAN. The reference nucleotide records and corresponding protein sets used to annotate the two viruses are listed in Table 1. Currently, protein coding regions are extracted from INSDC records and mature peptides annotated by the pipeline and stored in the database. However, this dependency on submitted protein annotations has several shortcomings, not least of which is the inability to update protein annotations in response to evolving biological knowledge, and we are in the process of moving to a fully de novo annotation model. In the new model, all features will be annotated directly by internal pipelines using an improved version of the NCBI ‘Protein to nucleotide alignment tool’ (ProSplign 2). Annotations will also be updated on a regular basis and consistent annotation maintained irrespective of sequence submission dates or changing annotation standards.

Table 1.

DENV and WNV reference sequences

	DENV				WNV
Nucleotide	NC_001477	NC_001474	NC_001475	NC_002640	NC_009942	NC_001563
Polyprotein	NP_059433	NP_056776	YP_001621843	NP_073286	YP_001527877	NP_041724
Anchored capsid (C) protein	NP_722457	NP_739581	YP_001531165	NP_740314	YP_005097850	NP_776011
Membrane (prM) glycoprotein precursor	NP_733807	NP_739582	YP_001531166	NP_740315	YP_001527879	NP_776012
Envelope (E) protein	NP_722460	NP_739583	YP_001531168	NP_740317	YP_001527880	NP_776014
Non-structural (NS1) protein 1	NP_722461	NP_739584	YP_001531169	NP_740318	YP_001527881	NP_776015
Non-structural (NS2A) protein 2A	NP_733808	NP_739585	YP_001531170	NP_740319	YP_001527882	NP_776016
Non-structural (NS2B) protein 2B	NP_733809	NP_739586	YP_001531171	NP_740320	YP_001527883	NP_776017
Non-structural (NS3) protein 3	NP_722463	NP_739587	YP_001531172	NP_740321	YP_001527884	NP_776018
Non-structural (NS4A) protein 4A	NP_733810	NP_739588	YP_001531173	NP_740322	YP_001527885	NP_776019
2K protein	NP_722467	NP_739593	YP_001531174	NP_740323	YP_001527885	NP_776020
Non-structural (NS4B) protein 4B	NP_733811	NP_739589	YP_001531175	NP_740324	YP_001527886	NP_776021
Non-structural (NS5) protein 5	NP_722465	NP_739590	YP_001531176	NP_740325	YP_001527887	NP_776022

Accessions for nucleotide and protein sequences used in the Virus Variation annotation pipeline are shown. The protein names used on the Virus Variation search pages are shown within parentheses.

DENV and WNV reference sequences Accessions for nucleotide and protein sequences used in the Virus Variation annotation pipeline are shown. The protein names used on the Virus Variation search pages are shown within parentheses.

NEW FEATURES

A number of new features have been added to Virus Variation since the last published description of the resource. The resource web pages have been updated, including the database search interface (Figure 1). This interface now supports searches using multiple GenBank accessions as well as keyword searches for sequence patterns, strain names/definition lines and influenza drug resistance mutations. Search menus have been updated and support multiple selections, so several proteins, hosts or geographic locations can be added to a single set of search criteria. In the influenza query page there is now the option to select sequences from northern temperate, southern temperate and tropical regions in addition to the country and continent selections used throughout the resource. Searches can also be limited by both collection date and GenBank release date including year, month and day. Several sets of virus specific filters have been added to the search interfaces to enhance usability. The ‘Full-length genomes only’ filter used in the DENV virus and WNV search interfaces limits retrieved mature protein sequences to those that are part of a complete polyprotein coding sequence (all mature proteins). On the influenza page the ‘Full-length only’ filter limits searches to protein or nucleotide sequences that include a complete coding region, from start codon to stop. A second, ‘Full-length plus’ filter restricts the search to both full-length protein or nucleotide sequences and nearly complete sequences missing only the start and/or stop codons. Complete, nearly complete and partial sequences are marked in search results. A set of ‘Additional filters’ have been added to the influenza query page, and users can now limit searches to those sequences that have a specified day and/or month in the collection date field. Users can also ‘Include’, ‘Exclude’ or ‘Only’ retrieve sequences from WHO recommended ‘Vaccine strains’, pandemic (H1N1) 2009 viruses, sequence sets with ‘Mixed subtypes’, ‘Lineage defining strains’ of well-defined lineages/clades. Currently, virus prototypes include those for the Victoria and Yamagata lineages of influenza B viruses, and the H5N1 and H9N2 subtypes of influenza A viruses. The ‘Required segments’ filter limits retrieved sequences to those where all the selected segments of the same virus isolate exist in the database. The Virus Variation search interface allows the user to build complicated datasets containing sequences retrieved using different criteria. To do this, the results from each individual database search are added to the ‘Query builder’ section at the bottom of the search interface (Figure 1), then one or more search sets selected for display on the Virus Variation search result page or direct download. The search result page displays sequences retrieved from search sets along with several sortable metadata columns and supports selection of individual sequences for download or further analysis (Figure 2). Identical sequences can be collapsed in the search results and represented by the oldest sequence in the group. Results can be downloaded as a table in XML, CSV or tab-delimited formats, or users can also download a GenBank accession list or FASTA file of selected sequences. The definition line of FASTA sequences can now be customized in the downloaded files, and users can replace original GenBank definition lines with a number of fields including host, country, date, serotype, patient age or gender, viral mutations and CDS location.

Figure 2.

Virus Variation search results interface. Results from a DENV database query are shown. The display includes a number of retrieved sequence descriptors including Accession, Length, Type, Disease, Genome region, Host, Country, Collection date and Virus name. Sequences can be selected and downloaded in a number of formats or used to construct an alignment or phylogenetic tree. The resource sequence analysis tool set has been improved to enhance visualization of large datasets and facilitate discovery activities. A new multiple sequence alignment viewer (Figure 3) has been integrated into DENV virus and WNV resources and will soon be available for influenza virus. This tool is based on the NCBI Genome Workbench multiple sequence alignment viewer and includes a variation histogram above the alignment as well as a feature table that highlights mature protein boundaries and other important sequence features. There are a number of usability features integrated into the viewer such as selectable alignment scoring methods for individual nucleotides/amino acid residues, link outs to associated GenBank records and selectable alignment anchor sequence—either consensus or any sequence in the alignment. Alignments displayed in the viewer can also be downloaded in FASTA, Clustal, Phylip and Nexus formats for use locally or with other tools. The Virus Variation tree builder tool (16) has also been updated for all viruses, and GenBank accession numbers can be downloaded through the tree builder tool by selecting the branch of interest on the tree.

Figure 3.

Virus Variation multi-sequence alignment viewer. The results from a DENV database query were aligned and displayed in the Virus Variation multi-sequence alignment viewer. The top section of the alignment viewer includes a histogram that displays sequence and coverage variation across the alignment, a second histogram that plots the frequency of sequence differences with a shading scheme and highlights insertions and deletions with gaps and a feature table where protein names and other sequence feature identifiers are displayed. The alignment position is indicated above the histogram, and the region displayed in the lower section is highlighted by a gray box. The lower section displays the highlighted region in greater detail by default, but the magnification can be decreased or increased as desired by the user. Alignments are anchored to the consensus sequence by default, but any sequence can be selected as an anchor. Sequences identical to the consensus can be displayed as individual nucleotides or amino acids or replaced with dots—highlighting variations from the consensus.

FUTURE DIRECTIONS

The long term plan is to increase the coverage of virus sequences in the Virus Variation Resource. The flexibility of the resource should support a number of diverse viral pathogens and provide consistently annotated sequence datasets with standardized isolate descriptors. This will require continued tweaking of metadata parsing strategies and development of new virus-specific sequence annotation modules. As these annotation modules are added to our core pipeline for use by the resource, they will be made publicly available. We will also explore approaches to increase user outreach and leverage community knowledge to improve data curation, reference sequence assignment and resource usability.

FUNDING

This research was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. Funding for open access charge: Intramural Research Program of the National Institutes of Health, National Library of Medicine. Conflict of interest statement. None declared.

16 in total

1. The influenza virus resource at the National Center for Biotechnology Information.

Authors: Yiming Bao; Pavel Bolotov; Dmitry Dernovoy; Boris Kiryutin; Leonid Zaslavsky; Tatiana Tatusova; Jim Ostell; David Lipman
Journal: J Virol Date: 2007-10-17 Impact factor: 5.103

Review 2. West Nile virus population genetics and evolution.

Authors: Kendra N Pesko; Gregory D Ebel
Journal: Infect Genet Evol Date: 2011-12-27 Impact factor: 3.342

Review 3. West Nile virus: review of the literature.

Authors: Lyle R Petersen; Aaron C Brault; Roger S Nasci
Journal: JAMA Date: 2013-07-17 Impact factor: 56.272

4. Visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation.

Authors: Leonid Zaslavsky; Yiming Bao; Tatiana A Tatusova
Journal: BMC Bioinformatics Date: 2008-05-16 Impact factor: 3.169

Review 5. Dengue viruses - an overview.

Authors: Anne Tuiskunen Bäck; Ake Lundkvist
Journal: Infect Ecol Epidemiol Date: 2013-08-30

6. Challenges of global surveillance during an influenza pandemic.

Authors: S Briand; A Mounts; M Chamberland
Journal: Public Health Date: 2011-04-27 Impact factor: 2.427

7. The International Nucleotide Sequence Database Collaboration.

Authors: Yasukazu Nakamura; Guy Cochrane; Ilene Karsch-Mizrachi
Journal: Nucleic Acids Res Date: 2012-11-24 Impact factor: 16.971

8. The global distribution and burden of dengue.

Authors: Samir Bhatt; Peter W Gething; Oliver J Brady; Jane P Messina; Andrew W Farlow; Catherine L Moyes; John M Drake; John S Brownstein; Anne G Hoen; Osman Sankoh; Monica F Myers; Dylan B George; Thomas Jaenisch; G R William Wint; Cameron P Simmons; Thomas W Scott; Jeremy J Farrar; Simon I Hay
Journal: Nature Date: 2013-04-07 Impact factor: 49.962

9. GenBank.

Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

Review 10. Molecular epidemiology and evolution of West Nile virus in North America.

Authors: Brian R Mann; Allison R McMullen; Daniele M Swetnam; Alan D T Barrett
Journal: Int J Environ Res Public Health Date: 2013-10-16 Impact factor: 3.390

30 in total

1. Immunization-Elicited Broadly Protective Antibody Reveals Ebolavirus Fusion Loop as a Site of Vulnerability.

Authors: Xuelian Zhao; Katie A Howell; Shihua He; Jennifer M Brannan; Anna Z Wec; Edgar Davidson; Hannah L Turner; Chi-I Chiang; Lin Lei; J Maximilian Fels; Hong Vu; Sergey Shulenin; Ashley N Turonis; Ana I Kuehne; Guodong Liu; Mi Ta; Yimeng Wang; Christopher Sundling; Yongli Xiao; Jennifer S Spence; Benjamin J Doranz; Frederick W Holtsberg; Andrew B Ward; Kartik Chandran; John M Dye; Xiangguo Qiu; Yuxing Li; M Javad Aman
Journal: Cell Date: 2017-05-18 Impact factor: 41.582

2. Functional Characterization of Adaptive Mutations during the West African Ebola Virus Outbreak.

Authors: Erik Dietzel; Gordian Schudt; Verena Krähling; Mikhail Matrosovich; Stephan Becker
Journal: J Virol Date: 2017-01-03 Impact factor: 5.103

3. A Human Bi-specific Antibody against Zika Virus with High Therapeutic Potential.

Authors: Jiaqi Wang; Marco Bardelli; Diego A Espinosa; Mattia Pedotti; Thiam-Seng Ng; Siro Bianchi; Luca Simonelli; Elisa X Y Lim; Mathilde Foglierini; Fabrizia Zatta; Stefano Jaconi; Martina Beltramello; Elisabetta Cameroni; Guntur Fibriansah; Jian Shi; Taylor Barca; Isabel Pagani; Alicia Rubio; Vania Broccoli; Elisa Vicenzi; Victoria Graham; Steven Pullan; Stuart Dowall; Roger Hewson; Simon Jurt; Oliver Zerbe; Karin Stettler; Antonio Lanzavecchia; Federica Sallusto; Andrea Cavalli; Eva Harris; Shee-Mei Lok; Luca Varani; Davide Corti
Journal: Cell Date: 2017-09-21 Impact factor: 41.582

4. Database resources of the National Center for Biotechnology Information.

Authors:
Journal: Nucleic Acids Res Date: 2014-11-14 Impact factor: 19.160

5. NCBI viral genomes resource.

Authors: J Rodney Brister; Danso Ako-Adjei; Yiming Bao; Olga Blinkova
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

6. Molecular Signature of the Ebola Virus Associated with the Fishermen Community Outbreak in Aberdeen, Sierra Leone, in February 2015.

Authors: Maria R Capobianchi; Cesare E M Gruber; Fabrizio Carletti; Silvia Meschi; Concetta Castilletti; Francesco Vairo; Mirella Biava; Claudia Minosse; Gino Strada; Gina Portella; Rossella Miccio; Valeria Minardi; Luca Rolla; Abdul Kamara; Giovanni Chillemi; Alessandro Desideri; Antonino Di Caro; Giuseppe Ippolito
Journal: Genome Announc Date: 2015-09-24

Review 7. Databases for Microbiologists.

Authors: Igor B Zhulin
Journal: J Bacteriol Date: 2015-05-26 Impact factor: 3.490

8. Dengue Virus Nonstructural Protein 5 (NS5) Assembles into a Dimer with a Unique Methyltransferase and Polymerase Interface.

Authors: Valerie J Klema; Mengyi Ye; Aditya Hindupur; Tadahisa Teramoto; Keerthi Gottipati; Radhakrishnan Padmanabhan; Kyung H Choi
Journal: PLoS Pathog Date: 2016-02-19 Impact factor: 6.823

9. Representing virus-host interactions and other multi-organism processes in the Gene Ontology.

Authors: R E Foulger; D Osumi-Sutherland; B K McIntosh; C Hulo; P Masson; S Poux; P Le Mercier; J Lomax
Journal: BMC Microbiol Date: 2015-07-28 Impact factor: 3.605

10. EbolaID: An Online Database of Informative Genomic Regions for Ebola Identification and Treatment.

Authors: João Carneiro; Filipe Pereira
Journal: PLoS Negl Trop Dis Date: 2016-07-21