| Literature DB >> 36112576 |
Alex Váradi1,2, Eszter Kaszab1,3, Gábor Kardos1, Eszter Prépost1, Krisztina Szarka1, Levente Laczkó1,4.
Abstract
The most important information about microorganisms might be their accurate genome sequence. Using current Next Generation Sequencing methods, sequencing data can be generated at an unprecedented pace. However, we still lack tools for the automated and accurate reference-based genotyping of viral sequencing reads. This paper presents our pipeline designed to reconstruct the dominant consensus genome of viral samples and analyze their within-host variability. We benchmarked our approach on numerous datasets and showed that the consensus genome of samples could be obtained reliably without further manual data curation. Our pipeline can be a valuable tool for fast identifying viral samples. The pipeline is publicly available on the project's GitHub page (https://github.com/laczkol/QVG).Entities:
Mesh:
Year: 2022 PMID: 36112576 PMCID: PMC9481040 DOI: 10.1371/journal.pone.0274414
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1Schematic representation of the QVG pipeline.
White boxes represent input data needed to run the pipeline, and differently shaded gray boxes show the main consecutive steps of the workflow proposed in this study. The main outputs are shown in dark gray boxes.
Summary of datasets used in this study.
| Dataset | Sequencing method | NCBI SRA accessions | Reference genome size (bp) | Genome sequencing approach | Reference |
|---|---|---|---|---|---|
| SARS-CoV2 (this study) | MiSeq PE 150 bp | SRR19666963, SRR19666962, SRR19666951, SRR19666950, SRR19666949, SRR19666948, SRR19666947, SRR19666946, SRR19666945, SRR19666944, SRR19666961, SRR19666960, SRR19666959, SRR19666958, SRR19666957, SRR19666956, SRR19666955, SRR19666954, SRR19666953, SRR19666952 | 29,903 | Amplicon-based | This study |
| SARS-CoV2 (public) | NovaSeq PE 150 bp | SRR14824570, SRR17309642, SRR16741159, SRR14155371, SRR16912480, SRR14824567, SRR14824569, SRR14824574, SRR14824563, SRR14155385, SRR14824566, SRR14824573, SRR14824560, SRR14824572, SRR14824562, SRR14824561, SRR14824565, SRR16912539, SRR14824564, SRR14824568 | 29,903 | Amplicon-based and genomic | INSDC SARS-CoV-2 Viral Sequencing Data |
| Hepatitis B (HBV) | MiSeq PE 150 bp | SRR12535936, SRR12535937, SRR12535938, SRR12535946, SRR12535947 | 3,182 | Amplicon-based | Hebeler-Barbosa et al., 2020 [ |
| Rabies (RABV) | HiSeq PE 125bp | SRR12012243, SRR12012256, SRR12012246, SRR12012251, SRR12012238, SRR12012242, SRR12012241, SRR12012234, SRR12012239, SRR12012247, SRR12012255, SRR12012245, SRR12012236, SRR12012253, SRR12012240, SRR12012237, SRR12012244, SRR12012250, SRR12012252, SRR12012254, SRR12012248, SRR12012249, SRR12012235 | 11,923 | Genomic | Sabeta et al., 2020 [ |
| Avian adenovirus | NextSeq SE 150 bp | N.A. | 45,473 | Genomic | Homonnay et al., 2021 [ |
| Feline coronavirus (FCoV) | MiniSeq PE 150 bp | SRR8352624 | 29,174 | Genomic | de Barros et al., 2021 [ |
| Herpes Simplex Virus 1 (HSV-1) | MiSeq PE 250 bp | ERR3316622, ERR3316623, ERR3316627, ERR3316619 | 152,222 | Genomic | Lassalle et al., 2020 [ |
*Raw Illumina reads were kindly made available for us by Homonnay et al. (2021) [46] upon request.
Fig 2Statistical assessment of the presented pipeline’s accuracy.
The plots show the values of sensitivity (true positive rate—TPR), specificity (true negative rate—TNR), balanced accuracy (BA), and precision (positive predictive value—PPV).
Fig 3(A,B) Time (C,D) and memory required to run the whole pipeline.
Time is reported in minutes (min), and peak memory usage is reported in Gigabytes (Gb). Running time corresponds to the wall clock time, and peak memory usage refers to the maximum resident size as reported by the ’time’ utility. This analysis was run on a commercial laptop with an Intel i7-4910MQ processor. Using more threads decreased the running time proportionally. On the left plots (A,C) the size of symbols is proportional to the reference genome size. Different symbols indicate the approach used for genome sequencing. The "genomic" approach includes whole genome, metagenomic and transcriptome sequencing. The symbol’s color represents mean read depth- The x-axis shows the number of reads supplied to the pipeline, including those that could not be aligned to the reference genome. On the right panels (B,D) the symbol’s color shows the type of the sequencing run, and different symbols indicate the sample’s corresponding dataset (Table 1). Runs shown on these plots used the annotation transfer feature of our pipeline alignment with resampling of alignments turned off.
Fig 4Example of the SNP density of sample S11 across the reference genome.
The x-axis shows the genomic position, whereas the y-axis represents the number of SNPs within sliding windows.
Fig 5Example of AB distribution (sample S11) visualized as a histogram.
An AB value different from 0 suggests multiple probable alleles at a given site.
Comparison of pipelines used in this study by the lineage assignment and support values as output by Pangolin.
The only sample assigned differently after genotyping by the two compared pipelines is given in bold.
| Sequence name | QVG | Geneious | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lineage | Conflict | Ambiguity score | Scorpio call | Scorpio support | Scorpio conflict | Lineage | Conflict | Ambiguity score | Scorpio call | Scorpio support | Scorpio conflict | |
| S2 | AY.4 | 0 | 1 | Delta (AY.4-like) | 0.91 | 0.06 | AY.4 | 0 | 1.00 | Delta (AY.4-like) | 0.91 | 0.03 |
| S3 | AY.46.6 | 0 | 0.96 | Delta (B.1.617.2-like) | 0.85 | 0.15 | AY.46.6 | 0 | 0.97 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S4 | AY.46 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.39 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.85 | 0.08 |
| S5 | AY.43 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S8 | AY.4 | 0 | 1 | Delta (AY.4-like) | 0.91 | 0.06 | AY.4 | 0 | 1 | Delta (AY.4-like) | 0.94 | 0.03 |
| S9 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S10 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S11 | AY.9.2 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.9.2 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 |
| S12 | AY.9.1 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.9.1 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 |
| S13 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S14 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.85 | 0.15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| S16 | AY.3 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.3 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S17 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S18 | AY.122 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.122 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S19 | AY.46.6 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 | AY.46.6 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 |
| S20 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S22 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 | AY.43 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 |
| S23 | AY.122 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.92 | 0.08 | AY.122 | 0 | 0.99 | Delta (B.1.617.2-like) | 0.92 | 0.08 |
| S24 | AY.122 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 | AY.122 | 0 | 1 | Delta (B.1.617.2-like) | 1 | 0 |
Fig 6(A) Example of read depth counting all alignments and (B) evened out read depth by the resampling feature of using our pipeline. The x-axis shows the genomic position, whereas the y-axis represents the read depth of each position, shown as a gray line. The middle red dashed line shows the mean of read depth across the genome, and the thinner dashed lines show the first and third quartile of read depth distribution. Green bars on the x-axis show the positions of polymorphisms discovered using all alignments (A) and the read depth after resampling the alignments along genomic windows (B).
Comparison of originally reported lineages and lineages identified by Pangolin after genotyping publicly available sequencing reads of SARS-CoV2 with our pipeline.
| Sequence name | Lineage | Conflict | Ambiguity score | Scorpio call | Scorpio support | Scorpio conflict | Originally reported lineage |
|---|---|---|---|---|---|---|---|
| SRR14155371 | B.1.1.7 | 0 | 0.98 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14155385 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14824560 | B.1.1.7 | 0 | 0.98 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14824561 | B.1.1.7 | 0 | 0.98 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14824562 | B.1.429 | 0 | 1.0 | Epsilon (B.1.429-like) | 1.0 | 0 | B.1.429 |
| SRR14824563 | P.1 | 0 | 1.0 | Gamma (P.1-like) | 0.87 | 0 | P.1 |
| SRR14824564 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.91 | 0.04 | B.1.1.7 |
| SRR14824565 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14824566 | P.1 | 0 | 1.0 | Gamma (P.1-like) | 0.87 | 0 | P.1 |
| SRR14824567 | B.1.637 | 0 | 1.0 | B.1.526.1 | |||
| SRR14824568 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.95 | 0.04 | B.1.1.7 |
| SRR14824569 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.95 | 0.04 | B.1.1.7 |
| SRR14824570 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.95 | 0.04 | B.1.1.7 |
| SRR14824572 | B.1.525 | 0 | 0.98 | Eta (B.1.525-like) | 1.00 | 0 | B.1.525 |
| SRR14824573 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR14824574 | B.1.1.7 | 0 | 1.0 | Alpha (B.1.1.7-like) | 0.96 | 0.04 | B.1.1.7 |
| SRR16741159 | B.1.351 | 0 | 0.98 | Beta (B.1.351-like) | 0.78 | 0.14 | B.1.351 |
| SRR16912480 | P.1 | 0 | 1.0 | Gamma (P.1-like) | 0.87 | 0 | P.1 |
| SRR16912539 | P.1 | 0 | 1.0 | Gamma (P.1-like) | 0.87 | 0 | P.1 |
| SRR17309642 | BA.1 | 0 | 1.0 | Omicron (BA.1-like) | 0.91 | 0 | B.1.1.529/Omicron |
Short result of the phylogenetic type classification of HBV samples by genome detective.
| Name | Length | Begin | End | Species | Type | Type support | Original subtype |
|---|---|---|---|---|---|---|---|
| SRR12535936 | 3182 | 1 | 3182 | Hepatitis B virus | subtype D | 100.0 | subtype D |
| SRR12535937 | 3179 | 4 | 3182 | Hepatitis B virus | subtype D | 100.0 | subtype D |
| SRR12535938 | 3182 | 1 | 3182 | Hepatitis B virus | subtype D | 100.0 | subtype D |
| SRR12535946 | 3182 | 1 | 3182 | Hepatitis B virus | subtype D | 100.0 | subtype D |
| SRR12535947 | 3179 | 1 | 3182 | Hepatitis B virus | Could not assign | subtype A | |
Fig 7(A) Example of an unequivocally identified HBV sample (SRR12535946) and (B) a recombinant sample (SRR12535947) as output by Genome Detective using Bootscan.
Values on the y-axis show positions of x belonging to a given cluster.
Result of classification of the RABV datasets samples returned by RABV-GLUE.
| Coding region coverage | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Sequence | Identified as RABV? | Major clade | Minor clade | Closest full genome reference sequence | N (%) | P (%) | M (%) | G (%) | L (%) | Originally reported lineage |
| SRR12012234 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012235 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012236 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012237 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012238 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012239 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012240 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012241 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012242 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012243 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012244 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012245 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012246 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012247 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012248 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012249 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012250 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012251 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012252 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012253 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012254 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012255 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148204 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
| SRR12012256 | Yes | Cosmopolitan | Cosmopolitan AF1b | KX148103 | 100 | 100 | 100 | 100 | 100 | Africa 1-b lineage |
Fig 8(A) Phylogenetic tree reconstructed for the best 10 BLAST hits using the consensus FCoV genome obtained using our pipeline and (B) pairwise sequence similarity shown on a heatmap of these sequences using raw distances (B).
The sample name SRR8352624_KX72252910 represents the sequencing reads genotyped with our pipeline relying on alignments to the reference genome KX722529.1 and MH817484 shows the position of the publicly available reference genome of feline coronavirus strain FCoV-SB22[45].
Fig 9(A) Phylogenetic tree reconstructed for the best 10 BLAST hits using the consensus avian Adenovirus genome obtained using our pipeline and (B) pairwise sequence similarity shown on a heatmap of these sequences using raw distances.
The sample name MT500572_MG95320110 represents the sequencing reads genotyped with our pipeline relying on alignments to the reference genome MG953201.1, and MT500572.1 shows the position of the publicly available reference genome of the avian adenovirus isolate D2453/1/10-12/13/UA [46].
Fig 10(A) Phylogenetic tree reconstructed for the HSV-1 dataset and (B) heatmap showing the pairwise distances of genome consensus sequences.
Sample names starting with "HSV-1" represent the sequences reconstructed by Lassalle et al. (2020) [49], and accession numbers show the placement of the newly reconstructed genome sequences of the same samples. The accession of the reference genome used for the analysis is given next to the accession number of raw read data.