| Literature DB >> 35764934 |
Adam Nunn1,2, Christian Otto1, Mario Fasold1, Peter F Stadler2,3,4,5,6,7, David Langenberger8.
Abstract
BACKGROUND: Calling germline SNP variants from bisulfite-converted sequencing data poses a challenge for conventional software, which have no inherent capability to dissociate true polymorphisms from artificial mutations induced by the chemical treatment. Nevertheless, SNP data is desirable both for genotyping and to understand the DNA methylome in the context of the genetic background. The confounding effect of bisulfite conversion however can be conceptually resolved by observing differences in allele counts on a per-strand basis, whereby artificial mutations are reflected by non-complementary base pairs.Entities:
Keywords: Benchmarking; Bisulfite sequencing; DNA methylation; Epigenetics; Epigenomics; Genetic variant; Genotype; SNP
Mesh:
Substances:
Year: 2022 PMID: 35764934 PMCID: PMC9237988 DOI: 10.1186/s12864-022-08691-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fig. 1An overview of the double-masking procedure. The central sequence represents the reference genome, with example alignments (+FW and -FW) adjacent to each originating strand. Black, emboldened nucleotides potentially arise from bisulfite treatment. Blue colouring indicates 5mC/5hmC. Red colouring represents in silico nucleotide manipulation, and corresponding base quality manipulations are indicated with an exclamation mark. In example (1) the variant caller is informed only by the -FW alignment, and in (2) only by the +FW alignment. As there is no equivalent Watson(+) alignment in (3) it is impossible to determine whether the apparent G >A polymorphism arises from bisulfite or a natural mutation
Fig. 2Precision-sensitivity of variants called in real data. In response to an increasing variant quality (QUAL) threshold, SNPs derived from published WGBS data are compared to those derived from established benchmark datasets for AA. thaliana (Cvi-0) and B human (NA12878). Software with the epi- prefix are intended for conventional DNA sequencing libraries but in this case run after preprocessing with the double-masking procedure. True and false positives are evaluated based on both the substitution context and the estimated genotype
Fig. 3ROC-like comparisons in real and simulated data. In response to an increasing variant quality (QUAL) threshold, SNPs derived from real WGS data are compared to those derived from equivalent WGBS data after in silico bisulfite conversion of either reads or alignments, followed by preprocessing with the double-masking procedure, in A. thaliana (Cvi-0). The real WGBS dataset from Figure 2A is also displayed alongside in each panel for comparison. Panels show results from conventional software A Freebayes, B GATK3.8 and C Platypus (default mode). True and false positives are evaluated based on both the substitution context and the estimated genotype
Optimised F1 scores in A. thaliana (Cvi-0). In comparison to the reference SNPs obtained from 1001 genomes consortium data, scores are derived when using real WGS and WGBS data, alongside in silico WGBS data derived from the WGS reads and alignments, respectively
| Real data | in silico | |||
|---|---|---|---|---|
| WGS | WGBS | reads | alignments | |
| GATK3.8 | 0.9189 | 0.8177 | 0.8508 | 0.9069 |
| Freebayes | 0.8247 | 0.7670 | 0.8039 | 0.8247 |
| Platypus (default) | 0.7423 | 0.7026 | 0.7709 | 0.7935 |
| Platypus (assembly) | 0.6378 | 0.5980 | 0.6449 | 0.6509 |
Fig. 4The shared fraction of true and false positive variants in real and simulated data for A. thaliana (Cvi-0), following analysis with GATK UnifiedGenotyper. Distinct WGBS datasets were simulated from both the real WGS alignments and the real WGS reads, separately. The panels denote A true positives, before and B after filtering, according to recommended hard-filter thresholds in GATK best-practices, and C false positives, also before and D after filtering. The thresholds chosen for filtering are further described in Supplementary Table S3