| Literature DB >> 27899678 |
Eneida L Hatcher1, Sergey A Zhdanov1, Yiming Bao1, Olga Blinkova1, Eric P Nawrocki1, Yuri Ostapchuck1, Alejandro A Schäffer1, J Rodney Brister2.
Abstract
The Virus Variation Resource is a value-added viral sequence data resource hosted by the National Center for Biotechnology Information. The resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, Dengue virus, West Nile virus, Ebolavirus, MERS coronavirus, Rotavirus A and Zika virus Each module is supported by pipelines that scan newly released GenBank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. These processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. Once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. This manuscript describes a series of features and functionalities recently added to the Virus Variation Resource. Published by Oxford University Press on behalf of Nucleic Acids Research 2016. This work is written by (a) US Government employee(s) and is in the public domain in the US.Entities:
Mesh:
Year: 2016 PMID: 27899678 PMCID: PMC5210549 DOI: 10.1093/nar/gkw1065
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of data enhancements in the Virus Variation Resource
| INSDC/GenBank | Virus Variation Resource |
|---|---|
| Inconsistent and/or out of date gene/protein names present in INSDC sequence records | Gene and protein sequences are validated and given consistent, up to date names |
| Annotation is often incomplete in INSDC sequence records, especially for mature peptides | All proteins and mature peptides annotated and possible sequence errors reported |
| Non-standardized source descriptor (metadata) vocabulary and formatting within INSDC sequence records | Source descriptors are parsed from several fields within INSDC sequence records and mapped to standardized terms with correct spelling |
| Source metadata potentially missing from INSDC sequence records | Source metadata can be added manually from literature |
| Drug resistance and/or high virulence sequence polymorphisms may not be annotated in INSDC Influenza virus sequence records | Documented drug resistance and high virulence sequence variations are detected and can be retrieved |
| Sequence searches based on metadata terms and gene/protein names can be difficult | Complex searches can be performed through a convenient user interface |
| Once sequences are retrieved, users must perform some data analysis locally or on a third party site | Selected sequences can be aligned or visualized as a tree within the resource |
| Download formats for sequences and metadata limited for some uses | Sequences can be downloaded in a variety of formats with customized metadata fields |
Publically available sequence content of Virus Variation Resource (as of September 1, 2016)
| Virus module | Species/Types included | Nucleotide seq. | Complete genomes | Protein seq. |
|---|---|---|---|---|
| Dengue virus | 18 495 | 4140 | 17 635 | |
| Ebolavirus | 1849 | 1318 | 14 407 | |
| Influenza virus | 471 603 | 33 717 | 624 541 | |
| MERS coronavirus | 730 | 320 | 3269 | |
| Rotavirus | 49 186 | 1169 | 49 607 | |
| West Nile virus | 4184 | 1675 | 3678 | |
| Zika virus | 386 | 111 | 345 |
Reference sequences employed by Virus Variation
| Virus module | Reference sequences |
|---|---|
| Dengue virus | NC_001477, NC_001474, NC_001475, NC_002640 |
| Ebolavirus | NC_014372, NC_014373, NC_004161, NC_006432, NC_002549 |
| Influenza virus | References are created by Virus Variation staff as needed, and a comprehensive list is maintained here: |
| MERS coronavirus | NC_019843 |
| Rotavirus | References are selected and maintained by the Rotavirus Classification Working Group ( |
| West Nile virus | NC_009942, NC_001563 |
| Zika virus | NC_012532 |
Number of GenBank sequences where non-standard metadata terms were mapped to standardized vocabulary
| Virus module | Total sequences processed | Isolation country | Isolation host | Isolation source |
|---|---|---|---|---|
| Dengue virus | 18 909 | 1321 | 6361 | 7402 |
| Ebolavirus | 1849 | 598 | 56 | 588 |
| Influenza virus | 472 050 | 267 955 | 380 384 | n.a. |
| MERS coronavirus | 730 | 5 | 95 | 327 |
| Rotavirus | 49 186 | 15 823 | 17 166 | 19 009 |
| West Nile virus | 4184 | 2143 | 1253 | 1329 |
| Zika virus | 386 | 86 | 127 | 148 |
Figure 1.Virus Variation Resource search interface page. (A) The Ebolavirus module search interface prior to selection of filters and hidden elements. (B) The Ebolavirus module search interface with all elements opened and several example searches displayed in the query builder. The search page is divided into three elements. The first element supports selection of protein or nucleotide sequences based on standardized metadata terms generated by processing pipelines described in the text. Menus support filtering of sequences based on gene or protein names, host, isolation country and isolation source, and collection and release dates ranges can be set with text boxes. Additional filters are accessible with a drop-down arrow revealing options for environmental or laboratory isolates, vaccine strains, keyword or sequence string searches, and optional menus tailored to specific viruses. The second element supports searches based on GenBank accessions – either using the text box or by uploading a text file of accessions. The third element includes the query builder where the number of sequences retrieved from individual searches can be viewed by clicking one of the ‘Add query’ buttons. When multiple searches are added to the Query Builder, the total number of unique sequence records is also summed. A checkbox is provided that allows identical sequences to be collapsed and represented by the oldest sequence on the results table. Clicking the ‘Show results’ button opens a separate browser tab and displays all of the sequences meeting the criteria in each of the checked queries in the results interface.
Figure 2.Virus Variation Resource results interface page. The results interface search criteria at the top of the page and a table of retrieved sequences below. There is a row of functions directly above the table of retrieved sequences that supports a number of actions. For example, users can select the visible columns in the results table using the ‘Select columns’ link, or quickly display multiple sequence alignments of selected sequences using the ‘Build sequence alignment’ button. There is also an option to customize sequence labels before downloading them or building trees. Individual GenBank or BioSample records listed in the table can be reviewed by clicking the hyperlinked accessions. If identical sequences were collapsed, they can be expanded to view individual accessions by clicking the blue arrow in the ‘Identical sequences’ column.
Figure 3.Virus Variation Resource tree and multi-sequence alignment displays. (A) A sample tree is shown depicting the use of standardized metadata terms as sequence labels. The tree was built from 31 West Nile virus complete polyprotein sequences collected since 2013. Sequence labels are based on GenBank accessions, host, country of isolation and isolation date. Left clicking a node highlights the lineage, and hovering over a node with the cursor displays a menu that includes descriptors for that particular sample, including GenBank accession and available standardized metadata terms for host, country, isolation source, etc. The menu also includes a function to reroot the tree around that sequence. (B) A multi-sequence alignment is shown for the same 31 West Nile polyprotein sequences. Individual GenBank accessions are listed to the left next to sequences. Left clicking the accession displays a menu that includes the standardized metadata label chosen in the results interface, a link to the sequence in GenBank, a function to use that sequence as an anchor for the alignment. Differences between residues in a given sequence and the consensus are highlighted in red. A histogram above the alignment shows coverage in blue and the frequency of changes in red.