| Literature DB >> 35234489 |
Darlene D Wagner1, Rachel L Marine2, Edward Ramos3, Terry Fei Fan Ng2, Christina J Castro4, Margaret Okomo-Adhiambo5, Krysten Harvey4, Gregory Doho3, Reagan Kelly3, Yatish Jain3, Roman L Tatusov1,3, Hideky Silva3, Paul A Rota2, Agha N Khan5, M Steven Oberste2.
Abstract
Next-generation sequencing (NGS) is a powerful tool for detecting and investigating viral pathogens; however, analysis and management of the enormous amounts of data generated from these technologies remains a challenge. Here, we present VPipe (the Viral NGS Analysis Pipeline and Data Management System), an automated bioinformatics pipeline optimized for whole-genome assembly of viral sequences and identification of diverse species. VPipe automates the data quality control, assembly, and contig identification steps typically performed when analyzing NGS data. Users access the pipeline through a secure web-based portal, which provides an easy-to-use interface with advanced search capabilities for reviewing results. In addition, VPipe provides a centralized system for storing and analyzing NGS data, eliminating common bottlenecks in bioinformatics analyses for public health laboratories with limited on-site computational infrastructure. The performance of VPipe was validated through the analysis of publicly available NGS data sets for viral pathogens, generating high-quality assemblies for 12 data sets. VPipe also generated assemblies with greater contiguity than similar pipelines for 41 human respiratory syncytial virus isolates and 23 SARS-CoV-2 specimens. IMPORTANCE Computational infrastructure and bioinformatics analysis are bottlenecks in the application of NGS to viral pathogens. As of September 2021, VPipe has been used by the U.S. Centers for Disease Control and Prevention (CDC) and 12 state public health laboratories to characterize >17,500 and 1,500 clinical specimens and isolates, respectively. VPipe automates genome assembly for a wide range of viruses, including high-consequence pathogens such as SARS-CoV-2. Such automated functionality expedites public health responses to viral outbreaks and pathogen surveillance.Entities:
Keywords: automated bioinformatics pipeline; infectious disease surveillance; next-generation sequencing (NGS); viral molecular detection
Mesh:
Year: 2022 PMID: 35234489 PMCID: PMC8941893 DOI: 10.1128/spectrum.02564-21
Source DB: PubMed Journal: Microbiol Spectr ISSN: 2165-0497
FIG 1VPipe standard analysis pipeline: VPipe takes raw FASTQ data generated by Illumina or Ion Torrent sequencing instruments. Raw reads are processed using the Read Quality and Trimming module prior to de novo assembly using SPAdes and detection of viral contigs via BLASTN. Analysis results are available on the VPipe user interface, accessible through the CDC OAMD portal. For SARS-CoV-2 data sets, reference-based assembly is also run in parallel with the de novo Assembly Module.
VPipe compared with previous tools for data sets of predominantly single virus species
| SRA data set, accession no. | Target virus (length, kb) | Assembly features | Assembly results by pipeline | ||||
|---|---|---|---|---|---|---|---|
| VPipe | drVM ( | EDGE ( | VirMAP ( | Genome Detective ( | |||
|
| Ebola virus (18.8–19.0) | Contig count | 1 | n/r | 1 | 1 | 7 |
| Max. contig (bp) | 18,756 | n/r | 18,600 | 18,728 | 7,509 | ||
|
| Bovine viral diarrhea virus 2 (12.3–12.5) | Contig count | 12 | 1 | 4 | 1 | 2 |
| Max. contig (bp) | 8,726 | 12,224 | 1,602 | 11,699 | 8,481 | ||
|
| Enterovirus D94 (7.3–7.4) | Contig count | 1 | n/r | 1 | 1 | 1 |
| Max. contig (bp) | 7,457 | n/r | 670 | 7,573 | 7,320 | ||
| PiType (Vpipe) | EV-D94 | - | Enterovirus_D | ||||
|
| Enterovirus D70 (∼7.4) | Contig count | 1 | n/r | 1 | 1 | 1 |
| Max. contig (bp) | 7,353 | n/r | 7,248 | 7,395 | 7,390 | ||
| PiType (Vpipe) | EV-D70 | - | Enterovirus_D | ||||
|
| Human parechovirus 3 (7.2–7.3) | Contig count | 1 | n/r | 1 | 1 | 1 |
| Max. contig (bp) | 7,153 | n/r | 332 | 7,252 | 7,157 | ||
| PiType (Vpipe) | PEV-A3 | - | - | ||||
|
| Enterovirus A71 (7.4–7.5) | Contig count | 4 | n/r | 123 | 1 | 2 |
| Max. contig (bp) | 6,952 | n/r | 832 | 7,423 | 4,919 | ||
| PiType (Vpipe) | EV-A71 | - | Enterovirus_A | ||||
|
| Human parechovirus 3 (7.2–7.3) | Contig count | 5 | n/r | 6 | 1 | 7 |
| Max. contig (bp) | 6,273 | n/r | 404 | 7,269 | 2,220 | ||
| PiType (Vpipe) | PeV-A3 | - | - | ||||
|
| Coxsackievirus B5 (7.3–7.4) | Contig count | 2 | n/r | 3 | 1 | 1 |
| Max. contig | 5,137 | n/r | 4,294 | 7,361 | 7,165 | ||
| PiType (Vpipe) | CV-B5 | - | Human echovirus 6 | ||||
|
| Human rotavirus A (longest segment ∼3.3) | Contig count | 11 | 13 | 7 | 11 | 11 |
| Max. contig (bp) | 3,462 | 3,375 | 3,338 | 3,349 | 3,299 | ||
|
| Influenza A virus (longest segment ∼2.3) | Contig count | 8 | 8 | 8 | 8 | 9 |
| Max. contig (bp) | 2,340 | 2,340 | 2,370 | 2,297 | 2,285 | ||
|
| HIV-1 (∼9.7) | Contig count | 8 | 12 | 5 | 8 | 4 |
| Max. contig (bp) | 4,956 | 2,819 | 5,195 | 5,163 | 4,319 | ||
| Contig count | 1 | Not detected | Not detected | 1 | Not detected | ||
| Max. contig (bp) | 5,211 | 5,129 | |||||
| GB virus C (9.3–9.4) | Contig count | 84 | Not detected | 53 | 8 | 2 | |
| Max. contig (bp) | 1,612 | 1,639 | 2,004 | 9,102 | |||
|
| Hepatitis C virus (9.6–9.8) | Contig count | 2 | 15 | 2 | 1 | 1 |
| Max. contig (bp) | 5,125 | 2,227 | 7,850 | 9,316 | 9,418 | ||
| GB virus C (9.3–9.4) | Contig count | 2 | 159 | 10 | 1 | 2 | |
| Max. contig (bp) | 9,013 | 825 | 5,711 | 9,276 | 9,295 | ||
Strain-typing by PiType shown for VPipe in rows 3 through 8. Comparable presumptive annotation in VirMAP and Genome Detective (rows 3 through 8) shown as ‘-’ unless otherwise indicated.
Approximate length of complete viral genomes (kb), or the longest segment for segmented genomes.
Matches and trimming information are shown in Tables S1–, S3.
n/r, no results; only previously published results used for comparison.
Trimmed 137 bp.
FIG 2Distribution of longest assembled contigs for HRSV and SARS-CoV-2 clinical data sets. (A) Bar plots indicating the average longest contigs assembled for 41 human respiratory syncytial virus (HRSV) samples. From left to right, bars represent average maximum contig lengths (bp) from Agoti et al. (2015), representing manually curated assemblies (28): EDGE with host sequence removal, EDGE with reads preprocessed through VPipe, Genome Detective, and VPipe. (B) Bar plots indicating the average longest contigs assembled for 23 SARS-CoV-2 specimens using EDGE with host sequence removal, EDGE with reads preprocessed through VPipe, Genome Detective, and VPipe. Whiskers represent standard error of the mean. Bar plots with the same letter are statistically equivalent (pairwise Wilcoxon’s test).