| Literature DB >> 32448124 |
Alejandro A Schäffer1,2, Eneida L Hatcher2, Linda Yankie2, Lara Shonkwiler2,3, J Rodney Brister2, Ilene Karsch-Mizrachi2, Eric P Nawrocki4.
Abstract
BACKGROUND: GenBank contains over 3 million viral sequences. The National Center for Biotechnology Information (NCBI) previously made available a tool for validating and annotating influenza virus sequences that is used to check submissions to GenBank. Before this project, there was no analogous tool in use for non-influenza viral sequence submissions.Entities:
Keywords: Alignment; Annotation; Virus; ncRNA
Mesh:
Year: 2020 PMID: 32448124 PMCID: PMC7245624 DOI: 10.1186/s12859-020-3537-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Viruses with the highest number of sequences in GenBank as of October 10, 2019
| species | #seqs | family |
| HIV-1 | 850,115 | |
| Influenza A virus | 684,026 | |
| Hepacivirus C | 244,533 | |
| Hepatitis B virus | 114,306 | |
| Influenza B virus | 100,373 | |
| Rotavirus A | 73,375 | |
| SIV | 44,374 | |
| Norovirus (Norwalk virus) | 40,925 | |
| Enterovirus A | 31,478 | |
| PRRSV | 29,081 | |
| Dengue virus | 28,564 | |
| Human orthopneumovirus | 24,384 | |
| Enterovirus B | 23,865 | |
| Rabies lyssavirus | 23,771 | |
| West Nile virus | 21,563 | |
| Measles morbillivirus | 17,233 |
The number of sequences for the segmented viruses Influenza and Rotavirus are the sums over all their segments
Software packages and libraries used within VADR
| software and website | purpose in VADR |
|---|---|
| Bio-Easel v0.09 github.com/nawrockie/Bio-Easel | sequence alignment handling and sequence utility functions |
| Sequip v0.03 github.com/nawrockie/sequip | option handling, output file handling and other utilities |
| Infernal v1.1.3 github.com/EddyRivasLab/infernal | build and use profile HMMs and CMs to classify, validate, align and annotate sequences |
| BLAST+ v2.9.0 ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.9.0 | build BLAST databases and validate CDS/protein predictions |
Fig. 1VADR workflow schematic illustrating uses of the two main VADR scripts. v-build.pl can be used once to build a single model or repeatedly to build a library of models. v-annotate.pl can be used with a model or model library to validate and annotate input sequences
Attributes of the 43 types of VADR alerts organized by the stage in which they are detected
| noannotn ∗ | S | NO_ANNOTATION | no significant similarity detected |
| revcompl ∗ | S | REVCOMPLEM | sequence appears to be reverse complemented |
| incsbgrp | S | INCORRECT_SPECIFIED_SUBGROUP | score difference too large between best overall model and best specified subgroup model |
| incgroup | S | INCORRECT_SPECIFIED_GROUP | score difference too large between best overall model and best specified group model |
| qstsbgrp | S | QUESTIONABLE_SPECIFIED_SUBGROUP | best overall model is not from specified subgroup |
| qstgroup | S | QUESTIONABLE_SPECIFIED_GROUP | best overall model is not from specified group |
| indfclas | S | INDEFINITE_CLASSIFICATION | low score difference between best overall model and second best model (not in best model’s subgroup) |
| lowscore | S | LOW_SCORE | score to homology model below low threshold |
| lowcovrg | S | LOW_COVERAGE | low sequence fraction with significant similarity to homology model |
| dupregin | S | DUPLICATE_REGIONS | similarity to a model region occurs more than once |
| discontn | S | DISCONTINUOUS_SIMILARITY | not all hits are in the same order in the sequence and the homology model |
| indfstrn | S | INDEFINITE_STRAND | significant similarity detected on both strands |
| lowsim5s | S | LOW_SIMILARITY_START | significant similarity not detected at 5’ end of the sequence |
| lowsim3s | S | LOW_SIMILARITY_END | significant similarity not detected at 3’ end of the sequence |
| lowsimis | S | LOW_SIMILARITY | internal region without significant similarity |
| biasdseq | S | BIASED_SEQUENCE | high fraction of score attributed to biased sequence composition |
| unexdivg ∗ | S | UNEXPECTED_DIVERGENCE | sequence is too divergent to confidently assign nucleotide-based annotation |
| noftrann ∗ | S | NO_FEATURES_ANNOTATED | sequence similarity to homology model does not overlap with any features |
| mutstart | F | MUTATION_AT_START | expected start codon could not be identified |
| mutendcd | F | MUTATION_AT_END | expected stop codon could not be identified, predicted CDS stop by homology is invalid |
| mutendns | F | MUTATION_AT_END | expected stop codon could not be identified, no in-frame stop codon exists 3’ of predicted valid start codon |
| mutendex | F | MUTATION_AT_END | expected stop codon could not be identified, first in-frame stop codon exists 3’ of predicted stop position |
| unexleng | F | UNEXPECTED_LENGTH | length of complete coding (CDS or mat_peptide) feature is not a multiple of 3 |
| cdsstopn | F | CDS_HAS_STOP_CODON | in-frame stop codon exists 5’ of stop position predicted by homology to reference |
| peptrans | F | PEPTIDE_TRANSLATION_PROBLEM | mat_peptide may not be translated because its parent CDS has a problem |
| pepadjcy | F | PEPTIDE_ADJACENCY_PROBLEM | predictions of two mat_peptides expected to be adjacent are not adjacent |
| indfantn | F | INDEFINITE_ANNOTATION | nucleotide-based search identifies CDS not identified in protein-based search |
| indf5gap | F | INDEFINITE_ANNOTATION_START | alignment to homology model is a gap at 5’ boundary |
| indf5loc | F | INDEFINITE_ANNOTATION_START | alignment to homology model has low confidence at 5’ boundary |
| indf3gap | F | INDEFINITE_ANNOTATION_END | alignment to homology model is a gap at 3’ boundary |
| indf3loc | F | INDEFINITE_ANNOTATION_END | alignment to homology model has low confidence at 3’ boundary similarity |
| lowsim5f | F | LOW_FEATURE_SIMILARITY_START | region within annotated feature at 5’ end of sequence lacks significant |
| lowsim3f | F | LOW_FEATURE_SIMILARITY_END | region within annotated feature at 3’ end of sequence lacks significant similarity |
| lowsimif | F | LOW_FEATURE_SIMILARITY | region within annotated feature lacks significant similarity |
| cdsstopp | F | CDS_HAS_STOP_CODON | stop codon in protein-based alignment |
| indfantp | F | INDEFINITE_ANNOTATION | protein-based search identifies CDS not identified in nucleotide-based search |
| indf5plg | F | INDEFINITE_ANNOTATION_START | protein-based alignment extends past nucleotide-based alignment at 5’ end |
| indf5pst | F | INDEFINITE_ANNOTATION_START | protein-based alignment does not extend close enough to nucleotide-based alignment 5’ endpoint |
| indf3plg | F | INDEFINITE_ANNOTATION_END | protein-based alignment extends past nucleotide-based alignment at 3’ end |
| indf3pst | F | INDEFINITE_ANNOTATION_END | protein-based alignment does not extend close enough to nucleotide -based alignment 3’ endpoint |
| indfstrp | F | INDEFINITE_STRAND | strand mismatch between protein-based and nucleotide-based predictions |
| insertnp | F | INSERTION_OF_NT | too large of an insertion in protein-based alignment |
| deletinp | F | DELETION_OF_NT | too large of a deletion in protein-based alignment |
The S/F column indicates whether the alert applies to an entire sequence (S) or to one feature (F) in a sequence. The five non-fatal alerts (four detected in the classification stage, and one in the coverage stage) do not cause a sequence to fail and are not reported in the output feature table. Codes marked with ∗ are always fatal; all other codes can be set to fatal or non-fatal with command-line options to v-annotate.pl
Summary of pass/fail outcomes for VADR on the full datasets
| VADR | |
| dataset | #pass/#fail [pass fraction] |
| NC | 1157/227 [0.836] |
| DC | 4171/409 [0.911] |
| NP | 29488/2702 [0.916] |
| DP | 17276/3697 [0.824] |
Counts of fatal VADR alerts reported for the test datasets
| NC | NP | DC | DP | total | ||
|---|---|---|---|---|---|---|
| alert | 1384 seqs | 32190 seqs | 4580 seqs | 20973 seqs | 59127 seqs | |
| code | error message | ct(seqs) | ct(seqs) | ct(seqs) | ct(seqs) | ct(seqs) |
| peptrans | PEPTIDE_TRANSLATION_PROBLEM | 516(86) | 716(535) | 1330(95) | 4051(1065) | 6613(1781) |
| noannotn | NO_ANNOTATION | - | 512(512) | 5(5) | 2236(2236) | 2753(2753) |
| indf3pst | INDEFINITE_ANNOTATION_END | 82(70) | 1059(1029) | 56(56) | 600(593) | 1797(1748) |
| indf5pst | INDEFINITE_ANNOTATION_START | 59(57) | 940(876) | 16(16) | 660(574) | 1675(1523) |
| indf3loc | INDEFINITE_ANNOTATION_END | 85(48) | 185(90) | 206(98) | 293(136) | 769(372) |
| incgroup | INCORRECT_SPECIFIED_GROUP | 19(19) | 302(302) | 30(30) | 286(286) | 637(637) |
| indf5loc | INDEFINITE_ANNOTATION_START | 19(15) | 66(35) | 222(135) | 286(144) | 593(329) |
| lowcovrg | LOW_COVERAGE | 3(3) | 217(217) | 60(60) | 279(279) | 559(559) |
| unexleng | UNEXPECTED_LENGTH | 42(34) | 66(55) | 105(49) | 318(182) | 531(320) |
| indf5gap | INDEFINITE_ANNOTATION_START | 6(3) | 23(12) | 117(100) | 220(127) | 366(242) |
| indf3gap | INDEFINITE_ANNOTATION_END | 4(2) | 83(71) | 15(14) | 237(133) | 339(220) |
| lowsim3f | LOW_FEATURE_SIMILARITY_END | - | - | 272(88) | 20(9) | 292(97) |
| cdsstopp | CDS_HAS_STOP_CODON | 7(5) | 112(111) | 15(15) | 153(153) | 287(284) |
| revcompl | REVCOMPLEM | 3(3) | 85(85) | 35(35) | 120(120) | 243(243) |
| cdsstopn | CDS_HAS_STOP_CODON | 96(93) | 72(71) | 58(58) | 5(4) | 231(226) |
| insertnp | INSERTION_OF_NT | 50(43) | 151(138) | - | 2(2) | 203(183) |
| lowsim5f | LOW_FEATURE_SIMILARITY_START | - | - | 101(101) | 79(39) | 180(140) |
| lowsim3s | LOW_SIMILARITY_END | 61(61) | 80(80) | 2(2) | 5(5) | 148(148) |
| mutstart | MUTATION_AT_START | 13(11) | 58(58) | 8(8) | 35(27) | 114(104) |
| mutendcd | MUTATION_AT_END | 52(50) | 47(46) | 6(6) | 5(4) | 110(106) |
| discontn | DISCONTINUOUS_SIMILARITY | - | 8(8) | 25(25) | 35(35) | 68(68) |
| dupregin | DUPLICATE_REGIONS | - | 6(6) | 33(33) | 25(25) | 64(64) |
| indfstrn | INDEFINITE_STRAND | 1(1) | 4(4) | 56(56) | 2(2) | 63(63) |
| deletinp | DELETION_OF_NT | 22(20) | 26(25) | - | 12(6) | 60(51) |
| lowsimif | LOW_FEATURE_SIMILARITY | - | - | 29(14) | 18(9) | 47(23) |
| indf3plg | INDEFINITE_ANNOTATION_END | 1(1) | 40(40) | - | 2(2) | 43(43) |
| indfantn | INDEFINITE_ANNOTATION | 1(1) | 23(23) | - | 18(17) | 42(41) |
| lowsim5s | LOW_SIMILARITY_START | 12(12) | - | 6(6) | 20(20) | 38(38) |
| noftrann | NO_FEATURES_ANNOTATED | - | 1(1) | - | 26(26) | 27(27) |
| indf5plg | INDEFINITE_ANNOTATION_START | - | 10(10) | - | - | 10(10) |
| indfantp | INDEFINITE_ANNOTATION | - | 3(3) | - | 6(6) | 9(9) |
| pepadjcy | PEPTIDE_ADJACENCY_PROBLEM | - | 3(3) | - | 6(6) | 9(9) |
| mutendex | MUTATION_AT_END | 2(2) | 5(5) | 1(1) | - | 8(8) |
| mutendns | MUTATION_AT_END | 1(1) | 5(5) | - | - | 6(6) |
The 34 fatal alert codes reported at least once for any test dataset are listed sorted by total number of reports. 4 fatal alert types (unexdivg, lowsimis, incsbgrp and indfstrp) were not reported for any of the four test sets and are not shown. See Table 3 for more information on alerts
Summary of pass/fail outcomes for VADR, VAPiD and VIGOR on the 200 sequence test datasets
| VADR | VAPiD | VIGOR | |
|---|---|---|---|
| dataset | pass/fail | pass/fail | pass/fail |
| NC | 167/33 | 161/39 | 198/2 |
| DC | 189/11 | 196/4 | - |
| NP | 191/9 | - | 195/5 |
| DP | 163/37 | - | - |
Comparison of pass/fail outcomes for VADR and VAPiD on the 200 sequence Norovirus-Complete (NC) and Dengue-Complete (DC) test datasets
| Both | Both | VADR-pass | VADR-fail | |
|---|---|---|---|---|
| dataset | pass | fail | VAPiD-fail | VAPiD-pass |
| NC | 137 | 9 | 30 | 24 |
| DC | 188 | 3 | 1 | 8 |
Comparison of pass/fail outcomes for VADR and VIGOR on the 200 sequence Norovirus-Complete (NC) and Norovirus-Partial (NP) test datasets
| Both | Both | VADR-pass | VADR-fail | |
|---|---|---|---|---|
| dataset | pass | fail | VIGOR-fail | VIGOR-pass |
| NC | 167 | 2 | 0 | 31 |
| NP | 191 | 5 | 0 | 4 |