| Literature DB >> 30674273 |
Ryan C Shean1,2, Negar Makhsous1,2, Graham D Stoddard3, Michelle J Lin1,2, Alexander L Greninger4,5.
Abstract
BACKGROUND: With sequencing technologies becoming cheaper and easier to use, more groups are able to obtain whole genome sequences of viruses of public health and scientific importance. Submission of genomic data to NCBI GenBank is a requirement prior to publication and plays a critical role in making scientific data publicly available. GenBank currently has automatic prokaryotic and eukaryotic genome annotation pipelines but has no viral annotation pipeline beyond influenza virus. Annotation and submission of viral genome sequence is a non-trivial task, especially for groups that do not routinely interact with GenBank for data submissions.Entities:
Keywords: Data submission; GenBank; NCBI; VAPiD; Viral annotation; Viral genomics; Virus sequence
Mesh:
Year: 2019 PMID: 30674273 PMCID: PMC6343335 DOI: 10.1186/s12859-019-2606-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Example usage of VAPiD. The two required files are shown as genome.fasta and author_info.sbt. Genome.fasta is all of the viral genomes you wish to submit, named as you want them to appear on GenBank. In the example code provided in the github repository this example file is called example.fasta. The author_info.sbt file is an NCBI specific file for attaching sequence author names to sequin files and is a required part of properly submitting sequences to NCBI. This file can be generated at (https://submit.ncbi.nlm.nih.gov/genbank/template/submission/ ). The first optional command is a comma separated file in which you can include all relevant metadata. You can create additional columns here so long as they correspond to NCBI approved sequence metadata. A list and formatting requirements can be found at (https://www.ncbi.nlm.nih.gov/Sequin/modifiers.html). Note that FASTA sequence names must be identical to names in the optional metadata sheet. Additionally, one could omit the metadata sheet and VAPiD will prompt for strain name, collection-date, country, and coverage data automatically at runtime. The second optional argument is a location of a local BLASTn database, which will force VAPiD to use the specified database instead of the included database. The last optional argument will force VAPiD to send an online search query to NCBI’s NT database
Fig. 2General design and information flow of VAPiD. First the provided sequences are used as queries for a local BLAST search (default) or an online BLASTn search. After results have been returned a reference annotation is downloaded, if a specific reference accession number is given then this reference is downloaded. Next the original FASTA file is aligned with the reference FASTA and the resulting alignment is used to map the reference annotations onto the new FASTA. Then custom code runs through the file and handles RNA editing, ribosomal slippage and splicing. These finalized annotations are then plugged into NCBI’s tbl2asn with the author information and sequin files are generated as well as .gbk files which can be used to manually verify accuracy of new annotations. Quality checked .sqn files can be emailed directly to GenBank
Attribute comparison of VAPiD, VIGOR, and viral-ngs annotation pipelines
| Annotator | Input | Output | Database options | Ribosomal slippage | RNA editing | Batch submission | NCBI submission | Features transferred | Operating systems | URL |
|---|---|---|---|---|---|---|---|---|---|---|
| VAPiD | FASTA, author information, sequence metadata | .tbl annotation, .gbf/.sqn complete records | NCBI nt, RefSeq, custom database, specified reference | Yes | Yes | Yes | Yes | CDS | Windows, Mac, Linux |
|
| VIGOR | FASTA | .tbl annotation | viral databases | Yes | Yes | Yes | No | All | Mac, Linux |
|
| viral-ngs | FASTA | .tbl annotation | specified reference | Yes | No | Only same viruses | No | All | Mac, Linux, Windows |
|