| Literature DB >> 32526997 |
Jörg T Wennmann1, Jiangbin Fan1, Johannes A Jehle1.
Abstract
Natural isolates of baculoviruses (as well as other dsDNA viruses) generally consist of homogenous or heterogenous populations of genotypes. The number and positions of single nucleotide polymorphisms (SNPs) from sequencing data are often used as suitable markers to study their genotypic composition. Identifying and assigning the specificities and frequencies of SNPs from high-throughput genome sequencing data can be very challenging, especially when comparing between several sequenced isolates or samples. In this study, the new tool "bacsnp", written in R programming langue, was developed as a downstream process, enabling the detection of SNP specificities across several virus isolates. The basis of this analysis is the use of a common, closely related reference to which the sequencing reads of an isolate are mapped. Thereby, the specificities of SNPs are linked and their frequencies can be used to analyze the genetic composition across the sequenced isolate. Here, the downstream process and analysis of detected SNP positions is demonstrated on the example of three baculovirus isolates showing the fast and reliable detection of a mixed sequenced sample.Entities:
Keywords: Baculoviridae; Cydia pomonella granulovirus; dsDNA viruses; genetic variability; genome sequencing; sequence heterogeneity
Mesh:
Year: 2020 PMID: 32526997 PMCID: PMC7354547 DOI: 10.3390/v12060625
Source DB: PubMed Journal: Viruses ISSN: 1999-4915 Impact factor: 5.048
Figure 1Workflow of processing Illumina sequencing data for the detection of variable single nucleotide polymorphisms (SNP) positions. Steps 2.2 to 2.6 refer to the corresponding paragraphs in the text. Steps 2.2 to 2.4 were applied separately for each sequenced isolate. All binary alignment mapping (BAM) files of all isolates were processed commonly using MPileup (step 2.5) to detect variant sites and to analyze their specificities and frequencies (step 2.6).
Result of the genome sequencing of isolates CpGV-M, CpGV-S and CpGV-0006 using short-read Illumina sequencing. Paired-end reads were 151 bp long.
| Isolate | No. Reads | Paired/Unpaired (%) | Mapped to Reference b (%) | ||||
|---|---|---|---|---|---|---|---|
| Total | Quality Filtered a | CpGV-M | CpGV-E2 | ||||
| CpGV-M | 3,886,630 | 3,644,161 | 95.7 | / | 4.3 | 99.6 | 99.8 |
| CpGV-S | 3,595,502 | 3,346,909 | 95.2 | / | 4.8 | 89.1 | 89.1 |
| CpGV-0006 | 1,508,218 | 1,424,605 | 96.2 | / | 3.8 | 99.2 | 99.3 |
a Adapter trimming and quality filtering with Phred quality score ≥30 (base-call accuracy 99.9%). b Percentage refers to number of quality filtered reads.
Mean frequencies of the reference (ƒref) and three alternative nucleotides (ƒrel1, ƒrel2 and ƒrel3). Frequencies were calculated from the 284 SNP positions referring to the mappings of short-read sequencing data of CpGV-M, CpGV-S and CpGV-0006 against CpGV-M reference sequence. Given are the arithmetic means and standard deviation.
| Isolate | Mean Frequency and Standard Deviation | |||||
|---|---|---|---|---|---|---|
|
|
|
|
| |||
| CpGV-M | 0.970 ± 0.092 | 0.028 ± 0.092 | 0.001 ± 0.002 | 0.001 ± 0.001 | 0.998 | 0.002 |
| CpGV-S | 0.087 ± 0.252 | 0.910 ± 0.254 | 0.002 ± 0.015 | 0.001 ± 0.001 | 0.997 | 0.003 |
| CpGV-0006 | 0.394 ± 0.199 | 0.605 ± 0.199 | 0.001 ± 0.002 | 0.000 ± 0.001 | 0.998 | 0.002 |
Figure 2Single nucleotide frequency (SNP) plots of sequenced isolates CpGV-M (A and D), CpGV-S (B and E) and CpGV-0006 (C and F) mapped against reference sequences CpGV-M (KM217575) (A, B and C) and CpGV-E2 (KM217577) (D, E and F). For the CpGV-M and CpGV-E2 reference sequence-based analyses, 277 and 300 variable SNP positions were found, respectively, and the frequency of the alternative nucleotide was plotted (dots). The specificities of SNP positions were marked red for CpGV-M (n = 15 and n = 82 for CpGV-M and CpGV-E2 reference based analysis, respectively), blue for CpGV-S (n = 223 and n = 113 for CpGV-M and CpGV-E2 reference based analysis, respectively) and green for both isolates (n = 39 and n = 105 for CpGV-M and CpGV-E2 reference based analysis, respectively). Median frequencies for CpGV-M and CpGV-S are indicated by red and blue dashed lines, respectively; numbers indicate the median frequency of CpGV-M and CpGV-S-specific SNPs in CpGV-0006.