| Literature DB >> 31695094 |
Robert P Adelson1, Alan E Renton2, Wentian Li3, Nir Barzilai3, Gil Atzmon4,5, Alison M Goate6, Peter Davies1, Yun Freudenberg-Hua7,8.
Abstract
The success of next-generation sequencing depends on the accuracy of variant calls. Few objective protocols exist for QC following variant calling from whole genome sequencing (WGS) data. After applying QC filtering based on Genome Analysis Tool Kit (GATK) best practices, we used genotype discordance of eight samples that were sequenced twice each to evaluate the proportion of potentially inaccurate variant calls. We designed a QC pipeline involving hard filters to improve replicate genotype concordance, which indicates improved accuracy of genotype calls. Our pipeline analyzes the efficacy of each filtering step. We initially applied this strategy to well-characterized variants from the ClinVar database, and subsequently to the full WGS dataset. The genome-wide biallelic pipeline removed 82.11% of discordant and 14.89% of concordant genotypes, and improved the concordance rate from 98.53% to 99.69%. The variant-level read depth filter most improved the genome-wide biallelic concordance rate. We also adapted this pipeline for triallelic sites, given the increasing proportion of multiallelic sites as sample sizes increase. For triallelic sites containing only SNVs, the concordance rate improved from 97.68% to 99.80%. Our QC pipeline removes many potentially false positive calls that pass in GATK, and may inform future WGS studies prior to variant effect analysis.Entities:
Mesh:
Year: 2019 PMID: 31695094 PMCID: PMC6834861 DOI: 10.1038/s41598-019-52614-7
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Density plots used to empirically determine thresholds for (A) DP, (B) MQ, and (C) VQSLOD (for SNVs only). These plots compare the densities for discordant and concordant sites, and the thresholds are set in order to maximize the ratio of discordant to concordant sites filtered out. Sites were removed if their total DP was less than 25,000, MQ was less than 58.75 or greater than 61.25, or VQSLOD was less than 7.81 (for SNVs only). The minimum VQSLOD value to be designated “PASS” in GATK was –3.769 for SNVs and –0.961 for indels.
Outcome from the hard filters utilized in the QC pipeline, at the variant, genotype, and sample levels, for ClinVar-indexed biallelic sites only.
| Variant Level | Site Removal Criterion | Sequential Filtering | Independent Filtering |
|---|---|---|---|
| # Pass (% Pass), Variants | |||
| — | Monomorphic | 38,402 (100) | 38,402 (100) |
| 1 | Missingness ≥ 5% | 38,359 (99.89) | 38,776 (99.79) |
| 2 | Blacklisted region or LCR | 38,359 (100) | 38,402 (100) |
| 3 | DP < 25,000 | 37,771 (98.47) | 38,098 (98.05) |
| 4 | MQ < 58.75 or MQ > 61.25 | 37,025 (98.02) | 37,696 (97.01) |
| 5 | VQSLOD < 7.81 | 36,415 (98.35) | 37,080 (95.43) |
| 6 | InbreedingCoeff < –0.8 | 35,751 (98.18) | 38,102 (98.06) |
|
|
| ||
| 7 | DP < 10 | 9,253,660 (99.94) | 10,037,482 (99.74) |
| 8 | GQ < 20 | 8,722,641 (94.26) | 9,435,150 (93.75) |
|
|
| ||
| 9 | Missingness ≥ 10% | 259 (100) | 259 (100) |
The third column represents the number and percentage of variants, genotypes, and samples remaining following the serial application of all nine filters. The fourth column presents the outcome of applying each individual filter to the full ClinVar-indexed dataset (38,402 biallelic variants), indicating each filter’s absolute removal rate. Of 17,585,919 biallelic sites genome-wide, 38,402 matched to ClinVar (which contains 416,908 variants in the 2019-01-02 version used here). Matching was performed using ClinVar version 2019-01-02.
Non-reference concordance rates after running each variant-level filter in the QC pipeline in succession, for ClinVar-indexed biallelic sites only.
| Variant Filter | Site Removal Criterion | Concordance Rate of Passing Sites (%) | Change in Rate (%) |
|---|---|---|---|
| — | Monomorphic | 99.375 | — |
| 1 | Missingness ≥ 5% | 99.473 | +0.098 |
| 2 | Within blacklisted region or LCR | 99.473 | 0 |
| 3 | DP < 25,000 | 99.563 | +0.090 |
| 4 | MQ < 58.75 or MQ > 61.25 | 99.695 | +0.132 |
| 5 | VQSLOD < 7.81 | 99.725 | +0.030 |
| 6 | InbreedingCoeff < –0.8 | 99.729 | +0.004 |
These values were calculated following removal of non-‘PASS’ sites according to GATK HaplotypeCaller. A pair of genotypes is concordant when the genotypes of a duplicate pair are identical. The concordance change was always positive or zero. Prior to QC, 99.375% of the 9,946,118 replicate genotypes at ClinVar-indexed biallelic sites were concordant. Following QC, 99.729% of the 8,722,641 remaining genotypes were concordant. Matching was performed using ClinVar version 2019-01-02.
Ranking of variant-level filters for ClinVar-indexed biallelic sites, and genome-wide biallelic and triallelic sites.
| Rank | Filter | Negative Predictive Value | Specificity | ||||
|---|---|---|---|---|---|---|---|
| Discordances among Discarded Genotypes (%) | % of Discordant Genotypes Removed | ||||||
| ClinVar Biallelic | All Biallelic | All Triallelic | ClinVar Biallelic | All Biallelic | All Triallelic | ||
| 1 | Missingness | 87.65 | 1.98 | 42.55 | 18.39 | 0.03 | 34.92 |
| 2 | MQ | 16.19 | 8.85 | 42.91 | 48.70 | 55.38 | 79.98 |
| 3 | DP | 13.97 | 20.72 | 45.97 | 27.46 | 19.21 | 53.34 |
| 4 | VQSLOD* | 12.16 | 6.77 | 41.15 | 55.96 | 68.65 | 99.03 |
| 5 | InbreedingCoeff | 2.25 | 2.31 | 29.62 | 4.40 | 3.65 | 37.76 |
The filters are ranked in order from greatest to lowest preference for filtering out discordant genotypes. Negative predictive value refers to a filter’s ability to remove discordant genotypes (true negatives) and minimize the number of concordant genotypes removed (false negatives). Specificity refers to a filter’s ability to identify and remove discordant genotypes (true negatives) and minimize the number of discordant genotypes retained (false positives). Matching was performed using ClinVar version 2019-01-02.
*Filter applied to biallelic and triallelic sites involving only SNVs.
Outcome from the hard filters utilized in the QC pipeline, at the variant, genotype, and sample levels, for genome-wide biallelic and triallelic sites.
| Variant Level | Site Removal Criterion | Biallelic, Sequential Filtering | Triallelic, Sequential Filtering |
|---|---|---|---|
| # Pass (% Pass), Variants | |||
| – | Monomorphic | 17,585,919 (100) | 1,536,657 (100) |
| 1 | Missingness ≥ 5% | 17,584,990 (99.99) | 1,536,085 (99.96) |
| 2 | Blacklisted region or LCR | 17,584,990 (100) | 1,536,085 (100) |
| 3 | DP < 25,000 | 17,346,931 (98.65) | 1,345,292 (87.58) |
| 4 | MQ < 58.75 or MQ > 61.25 | 15,971,098 (92.17) | 968,987 (72.03) |
| 5 | InbreedingCoeff < –0.8 | 15,661,311 (98.06) | 949,810 (98.02) |
| 6 | VQSLOD < 7.81 | 14,760,982 (94.25) | 888,194 (93.51) |
|
|
| ||
| 7 | DP < 10 | 3,819,276,086 (99.96) | 202,424,447 (98.89) |
| 8 | GQ < 20 | 3,800,347,137 (99.50) | 187,956,031 (92.85) |
|
|
| ||
| 9 | Missingness ≥ 10% | 259 (100) | 193 (74.52) |
These values were calculated following removal of non-‘PASS’ sites according to GATK HaplotypeCaller. The third and fourth columns include results when only variants passing the preceding filter move on to the subsequent filter. If only SNV-SNV triallelic sites are considered for the triallelic pipeline, zero samples are removed in the triallelic pipeline (the missingness for all samples remained below 8.5%).
Non-reference concordance rate after running each hard filter in the QC pipeline in succession at the variant level, for biallelic and triallelic variants.
| Variant Filter | Site Removal Criterion | Concordance Rate of Passing Sites (%) | |||
|---|---|---|---|---|---|
| All Biallelic | Biallelic SNVs | Biallelic Indels | All Triallelic | ||
| — | Monomorphic | 98.532 | 98.690 | 96.887 | 84.155 |
| 1 | Missingness ≥ 5% | 98.533 | 98.690 | 96.887 | 84.155 |
| 2 | Within blacklisted region or LCR | 98.533 | 98.690 | 96.887 | 84.155 |
| 3 | DP < 25,000 | 98.798 | 98.904 | 97.673 | 87.570 |
| 4 | MQ < 58.75 or MQ > 61.25 | 99.401 | 99.482 | 98.536 | 92.704 |
| 5 | InbreedingCoeff < –0.8 | 99.404 | 99.486 | 98.529 | 92.671 |
| 6 | VQSLOD < 7.81 | 99.694 | 99.810 | 98.529 | 94.358 |
These values were calculated following removal of non-‘PASS’ sites according to GATK HaplotypeCaller. A pair of genotypes is concordant when the genotypes of a duplicate pair are identical. The change in concordance rate was always positive. Prior to QC, 98.532% of the 30,137,375 replicate non-reference genotypes at genome-wide biallelic sites were concordant; following QC, 99.694% of the 25,180,411 remaining non-reference genotypes were concordant. Prior to QC, 84.155% of the 2,604,018 replicate genotypes at genome-wide triallelic sites were concordant; following QC, 94.358% of the 1,522,106 remaining genotypes were concordant.
Figure 2The distribution of biallelic and triallelic sites. This distribution is shown for the original dataset, following removal of non-‘PASS’ variants (according to GATK HaplotypeCaller), and following application of all variant-level filters.
Ti/Tv ratio at each variant-level step in the genome-wide biallelic and triallelic pipelines.
| Filter/Step | Biallelic Sites | Triallelic Sites | ||
|---|---|---|---|---|
| Ti/Tv | Change (%) | Ti/Tv | Change (%) | |
| (1) Original | 1.88322 | — | 1.88322 | — |
| (2) Biallelic (or Triallelic) Only | 2.04350 | +8.511 | 0.94341 | −0.940 |
| (3) ‘PASS’ | 2.14108 | +4.775 | 0.96112 | +0.018 |
| (4) Missingness | 2.14111 | +0.001 | 0.96122 | +0.0001 |
| (5) DP | 2.14874 | +0.356 | 0.94256 | −0.019 |
| (6) MQ | 2.14418 | −0.212 | 0.85855 | −0.084 |
| (7) InbreedingCoeff | 2.14707 | +0.135 | 0.87857 | +0.020 |
| (8) SNV VQSLOD | 2.16381 | +0.780 | 0.85444 | −0.024 |
Ti/Tv increases by 0.12 (5.9%) among biallelic SNVs, from before GATK is run (step 2) through the end of QC. Ti/Tv decreases by 0.089 (9.4%) among triallelic SNV-containing sites (SNV-SNV and SNV-indel).
Figure 3Schematic for the genome-wide biallelic, triallelic, and ClinVar-indexing pipelines. The pipelines include: indexing sites in the full VCF files to the ClinVar database (in the ClinVar-indexing pipeline only), several applications of pre-QC filters and annotations, variant-level filtration, sample-level filtration, genotype-level filtration, a recommended manual review of the final output, and study-specific statistical and association analyses.
Hard filters utilized in the QC pipeline, at the variant, genotype, and sample levels.
| Variant Level | Site Removal Criterion |
|---|---|
| 1 | Missingness ≥ 5% |
| 2 | Within blacklisted region or LCR |
| 3 | DP < 25,000 |
| 4 | MQ < 58.75 or MQ > 61.25 |
| 5 | VQSLOD < 7.81 |
| 6 | InbreedingCoeff < –0.8 |
| Genotype Level | Genotype Removal Criterion |
| 7 | DP < 10 |
| 8 | GQ < 20 |
| Sample Level | Sample Removal Criterion |
| 9 | Missingness ≥ 10% |
The thresholds for steps 4 through 6 (DP, MQ, and VQSLOD) were determined empirically, by comparing density plots for those parameters in concordant and discordant variants.