| Literature DB >> 35547842 |
Abstract
Background: In 2020 and 2021, more than 1.5 million SARS-CoV-2 sequences were submitted to GenBank. The initial version (v1.0) of the VADR (Viral Annotation DefineR) software package that GenBank uses to automatically validate and annotate incoming viral sequences is too slow and memory intensive to process many thousands of SARS-CoV-2 sequences in a reasonable amount of time. Additionally, long stretches of ambiguous N nucleotides, which are common in many SARS-CoV-2 sequences, prevent VADR from accurate validation and annotation.Entities:
Year: 2022 PMID: 35547842 PMCID: PMC9094095 DOI: 10.1101/2022.04.25.489427
Source DB: PubMed Journal: bioRxiv
Number of SARS-CoV-2 sequences released in GenBank in 2020 and 2021. Sequence counts were obtained using the NCBI Virus SARS-CoV-2 Data Hub on January 19, 2022, filtering by release date.
| month | year | #new seqs | #cumulative seqs |
|---|---|---|---|
| Jan | 2020 | 32 | 32 |
| Feb | 2020 | 58 | 90 |
| Mar | 2020 | 332 | 422 |
| Apr | 2020 | 1541 | 1963 |
| May | 2020 | 2974 | 4937 |
| Jun | 2020 | 3394 | 8331 |
| Jul | 2020 | 3604 | 11,935 |
| Aug | 2020 | 3818 | 15,753 |
| Sep | 2020 | 6731 | 22,484 |
| Oct | 2020 | 11,939 | 34,423 |
| Nov | 2020 | 4274 | 38,697 |
| Dec | 2020 | 4530 | 43,227 |
| Jan | 2021 | 8775 | 52,002 |
| Feb | 2021 | 26,078 | 78,080 |
| Mar | 2021 | 42,607 | 120,687 |
| Apr | 2021 | 97,095 | 217,782 |
| May | 2021 | 104,729 | 322,511 |
| Jun | 2021 | 46,187 | 368,698 |
| Jul | 2021 | 43,336 | 412,034 |
| Aug | 2021 | 141,958 | 553,992 |
| Sep | 2021 | 267,562 | 821,554 |
| Oct | 2021 | 239,296 | 1,060,850 |
| Nov | 2021 | 267,270 | 1,328,120 |
| Dec | 2021 | 288,771 | 1,616,891 |
Attributes of norovirus, dengue virus, and SARS-CoV-2 sequences in GenBank and VADR 1.0 performance metrics. Only sequences deposited in GenBank were considered (ENA and DDBJ sequences were not included). ‘length’: average length of all RefSeq sequences for each virus. ‘# seqs’: total number of GenBank sequences for each virus, SARS-CoV-2 sequences limited to those with publication date in 2020 and 2021. Norovirus and dengue virus sequences were not limited by date; ‘% seqs full length’: percentage of sequences that are full length, defined as >= 95% the average length of shortest RefSeq sequence (minimum length 6952nt for norovirus, 10,117nt for dengue virus and 28,408nt for SARS-CoV-2); ‘% Ns’: percentage of total nucleotides in all sequences that are Ns; ‘% seqs with stretch of >=50 Ns’: percentage of all sequences that have at least one stretch of 50 or more consecutive Ns; The remaining four rows pertain to single-threaded VADR 1.0 processing of all full length sequences: ‘average % identity’: the average of the average pairwise sequence identity in the multiple sequence alignments, one per RefSeq-based model, created by VADR; ‘seconds per sequence (VADR 1.0)’: the average running time per sequence (seconds); ‘required RAM (VADR 1.0)’: amount of RAM required; ‘total running time, CPU days (VADR 1.0)’: the total number of CPU days required; CPU times were measured as single threads on 2.2 GHz Intel Xeon processors. List of all norovirus and dengue virus sequences obtained by the following Entrez nucleotide queries on January 25, 2022, and then restricting to only GenBank sequences: ”Norovirus NOT chimeric AND 50:10000[slen]” and ”Dengue NOT chimeric AND 50:11200[slen]”; List of all SARS-CoV-2 sequences obtained using the NCBI Virus SARS-CoV-2 dashboard tabular view, restricting ”release date” to 2020 and 2021. SARS-CoV-2 VADR 1.0 running time and average percent identity statistics are based on only 300 randomly selected SARS-CoV-2 sequences to limit total running time. Additional details are available in the supplementary material (https://github.com/nawrockie/vadr-sarscov2-paper-supplementary-material).
|
|
| SARS-CoV-2 | |
|---|---|---|---|
| length | 7575.6 | 10703.5 | 29903.0 |
| # seqs | 44,936 | 113,211 | 1,616,891 |
| % seqs full length | 5.1% | 8.4% | 99.7% |
| % Ns | 0.5% | 0.2% | 1.4% |
| % seqs with stretch of >= 50 Ns | 1.0% | 0.4% | 38.7% |
| average % identity | 81.6% | 94.4% | 99.4% |
| seconds per sequence (VADR 1.0) | 42.4 | 92.6 | 331.8 |
| required RAM (VADR 1.0) | 8Gb | 8Gb | 64Gb |
| total running time, CPU days (VADR 1.0) | 1.1 | 10.2 | 6187.6 |
Figure 1Seeded alignment strategy. The input sequence is used as a blastn query against the NC_045512 RefSeq sequence and the top-scoring alignment is kept as the seed, after potentially shortening it as described in the text. The 5’ and/or 3’ regions not covered by the seed (if any) plus 100 nucleotides of flanking sequence, are aligned to the full NC_045512 sequence using glsearch, and the resulting alignments are joined with the seed to produce the final alignment.
Summary statistics for N-replacement and seeded alignment on ENA SARS-CoV-2 sequences. The four rightmost columns report statistics for VADR 1.4.1 processing of subset of sequences used for testing (14,912 total, described in text). ‘seed coverage’ is defined as the percentage of the input sequence that is contained within the blastn-derived seed, if seed coverage is 100% this means that no subsequence alignment with glsearch was necessary. v-annotate.pl command line options used: --cpu 8 --split --mdir
| month | year | # seqs total | # seqs tested | average # Ns per seq | average % Ns replaced | average seed % coverage | % seqs w/100% seed coverage |
|---|---|---|---|---|---|---|---|
| Jul | 2020 | 1067 | 1000 | 813.9 | 99.3% | 100.0% | 96.7% |
| Aug | 2020 | 667 | 667 | 290.1 | 96.9% | 99.0% | 96.9% |
| Sep | 2020 | 790 | 790 | 189.3 | 99.5% | 100.0% | 98.6% |
| Oct | 2020 | 517 | 517 | 26.5 | 98.5% | 100.0% | 97.5% |
| Nov | 2020 | 545 | 545 | 1108.4 | 98.6% | 99.9% | 96.1% |
| Dec | 2020 | 140 | 140 | 52.5 | 99.7% | 100.0% | 95.7% |
| Jan | 2021 | 1688 | 1000 | 4220.9 | 99.8% | 97.6% | 87.9% |
| Feb | 2021 | 253 | 253 | 3318.9 | 99.6% | 100.0% | 98.4% |
| Mar | 2021 | 2198 | 1000 | 1250.3 | 99.4% | 99.9% | 98.5% |
| Apr | 2021 | 116,569 | 1000 | 138.4 | 99.5% | 99.8% | 98.1% |
| May | 2021 | 112,618 | 1000 | 196.8 | 98.5% | 99.6% | 94.3% |
| Jun | 2021 | 200,490 | 1000 | 193.2 | 96.9% | 99.5% | 94.6% |
| Jul | 2021 | 171,287 | 1000 | 307.2 | 98.8% | 99.9% | 97.7% |
| Aug | 2021 | 116,116 | 1000 | 307.7 | 98.7% | 99.7% | 97.2% |
| Sep | 2021 | 165,371 | 1000 | 279.8 | 99.1% | 99.8% | 97.0% |
| Oct | 2021 | 157,664 | 1000 | 480.7 | 99.0% | 99.8% | 97.3% |
| Nov | 2021 | 170,434 | 1000 | 347.5 | 98.1% | 99.8% | 97.2% |
| Dec | 2021 | 188,246 | 1000 | 259.9 | 97.9% | 99.8% | 96.9% |
| all combined | 1,406,660 | 14,912 | 711.1 | 99.3% | 99.6% | 96.4% | |
VADR running time on SARS-CoV-2 sequences. Running times computed for the set of 14,912 ENA sequences tested in Table 3. Columns 2, 3, and 4 indicate whether ‘seeded alignment’ (-s option in versions 1.1 and later), ‘N replacement’ (-r option in versions 1.1 and later), and ‘glsearch’ (--glsearch option in versions 1.2 and later), were enabled (‘+’) or not used (‘−’). ‘# cpus’ indicates number of CPUs that were run in parallel (--cpu
| VADR version | seeded alignment? | N replacement? | glsearch? | # cpus | required RAM | secs per seq | hours per 100K seqs | speedup vs v1.0 |
|---|---|---|---|---|---|---|---|---|
| v1.0 | − | − | − | 1 | 64 Gb | 329.91 | 9164.3 | - |
| v1.1 | + | + | − | 1 | 64 Gb | 49.35 | 1370.7 | 6.7 |
| v1.4.1 | + | + | + | 1 | 2 Gb | 2.51 | 69.8 | 131.4 |
| v1.4.1 | + | + | + | 2 | 4 Gb | 1.49 | 41.5 | 220.8 |
| v1.4.1 | + | + | + | 4 | 8 Gb | 0.65 | 18.0 | 509.9 |
|
| + | + | + |
|
|
|
|
|
| v1.4.1 | + | + | + | 16 | 32 Gb | 0.23 | 6.5 | 1417.9 |
| v1.4.1 | + | + | + | 32 | 64 Gb | 0.13 | 3.7 | 2462.2 |
Figure 2N-replacement strategy. Stretches of contiguous Ns in input sequences (orange) are identified as regions not covered by blastn alignments (gaps in second grey line) after the input sequence is used as a query against the NC_045512 RefSeq sequence. The Ns in these regions are replaced by the corresponding ‘expected’ nucleotides from NC_045512 and the resulting sequence is validated and annotated by v-annotate.pl. Importanty, the sequence deposited into GenBank is the original input sequence, including the Ns, not the sequence with the Ns replaced that was processed by v-annotate.pl.