| Literature DB >> 35510784 |
Russ J Jasper1, Tegan Krista McDonald1, Pooja Singh1,2,3, Mengmeng Lu1, Clément Rougeux1, Brandon M Lind4, Sam Yeaman1.
Abstract
The use of next-generation sequencing (NGS) data sets has increased dramatically over the last decade, but there have been few systematic analyses quantifying the accuracy of the commonly used variant caller programs. Here we used a familial design consisting of diploid tissue from a single lodgepole pine (Pinus contorta) parent and the maternally derived haploid tissue from 106 full-sibling offspring, where mismatches could only arise due to mutation or bioinformatic error. Given the rarity of mutation, we used the rate of mismatches between parent and offspring genotype calls to infer the single nucleotide polymorphism (SNP) genotyping error rates of FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan. With baseline filtering HaplotypeCaller and UnifiedGenotyper yielded more SNPs and higher error rates by one to two orders of magnitude, whereas FreeBayes, SAMtools and VarScan yielded lower numbers of SNPs and more modest error rates. To facilitate comparison between variant callers we standardized each SNP set to the same number of SNPs using additional filtering, where UnifiedGenotyper consistently produced the smallest proportion of genotype errors, followed by HaplotypeCaller, VarScan, SAMtools, and FreeBayes. Additionally, we found that error rates were minimized for SNPs called by more than one variant caller. Finally, we evaluated the performance of various commonly used filtering metrics on SNP calling. Our analysis provides a quantitative assessment of the accuracy of five widely used variant calling programs and offers valuable insights into both the choice of variant caller program and the choice of filtering metrics, especially for researchers using non-model study systems.Entities:
Keywords: bioinformatics; genomics; genotyping; next-generation sequencing; non-model; single nucleotide polymorphism
Mesh:
Year: 2022 PMID: 35510784 PMCID: PMC9544674 DOI: 10.1111/1755-0998.13628
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 8.678
Baseline filtering criteria. Set of filtering criteria unique to each variant caller program and set of common filtering criteria used across all programs. Criteria describe the sites removed
|
FreeBayes Sites with less than 30 quality (QUAL) Genotype calls with less than 5 depth (DP) Genotype calls with less than 20 genotype quality (GQ) |
|
HaplotypeCaller Sites with greater than 60 Fisher strand (FS) Sites with less than 40 mapping quality (MQ) Sites with less than −12.5 mapping quality rank sum test (MQRankSum) Sites with less than 30 quality (QUAL) Sites with less than 2.0 quality by depth (QD) Sites with less than −8.0 read position rank sum test (ReadPosRankSum) Sites with greater than 3.0 strand odds ratio (SOR) Genotype calls with less than 20 genotype quality (GQ) |
|
SAMtools Sites with less than 20 quality (QUAL) Genotype calls with less than 5 depth (DP) |
|
UnifiedGenotyper Sites with greater than 60 Fisher strand (FS) Sites with less than 40 mapping quality (MQ) Sites with less than −12.5 mapping quality rank sum test (MQRankSum) Sites with less than 30 quality (QUAL) Sites with less than 2.0 quality by depth (QD) Sites with less than −8.0 read position rank sum test (ReadPosRankSum) Sites with greater than 3.0 strand odds ratio (SOR) |
|
VarScan Genotype calls with less than 10 depth (DP) Genotype calls with less than 20 genotype quality (GQ) Heterozygote genotype calls |
|
Common filters Sites not called in both parent and offspring Sites with greater than 50% missingness Multiallelic sites |
The number of sites and genotypes called, and the by‐site and by‐genotype mismatch rates for each variant caller program after base filtering was applied
| Sites | Genotypes | Site mismatch rate | Genotype mismatch rate | |
|---|---|---|---|---|
| FreeBayes | 1.19 × 105 | 1.05 × 107 | 1.03 × 10−2 | 2.39 × 10−3 |
| HaplotypeCaller | 2.24 × 106 | 1.87 × 108 | 2.41 × 10−1 | 1.18 × 10−2 |
| SAMtools | 4.59 × 105 | 4.16 × 107 | 5.36 × 10−2 | 3.08 × 10−3 |
| UnifiedGenotyper | 3.49 × 106 | 2.86 × 108 | 1.51 × 10−1 | 2.03 × 10−2 |
| VarScan | 1.16 × 105 | 9.01 × 106 | 2.49 × 10−3 | 5.48 × 10−4 |
FIGURE 1Number of unique variant sites shared between FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan after additional filtering to approximately 1 x 105 sites called. The number of unique sites with zero genotype mismatches (a) and the number of unique sites with at least one genotype mismatch (b) are shown. The vertical bars show the number of unique sites shared by a combination of variant callers (or single variant caller), and the coloured dots and connecting line below define which combination of variant callers. The horizontal bars at the lower left show the total number of sites without genotype mismatches (a) or with genotype mismatches (b) from each caller. SNP sets were filtered by depth, genotype quality, and quality score where applicable. FreeBayes, HaplotypeCaller, SAMtools, UnifiedGenotyper, and VarScan resulted in 100,041, 99,087, 100,596, 99,871, and 99,983 sites, respectively
FIGURE 2Comparison of mismatch rates by genotype (a) and by site (b) between variant callers at variable numbers of sites called. Mismatch rates were calculated as the proportion of mismatched parent‐offspring genotype calls out of the total number of genotypes called (a) and as the proportion of sites with at least one mismatched parent‐offspring genotype call out of the total number of sites (b). Inset plots show a magnification of the mismatch rates over 50 to 100 k sites called. Variation in the number of sites called for a particular variant caller was generated by additional incremental filtering to different degrees with depth, genotype quality, and quality score where applicable
FIGURE 3Effect of filtering by depth (DP) on the number of genotypes called (blue) and the genotype mismatch rates (red) after baseline filtering. Scales differ on all three axes for each panel