| Literature DB >> 24884706 |
Andrew R Carson, Erin N Smith, Hiroko Matsui, Sigrid K Brækkan, Kristen Jepsen, John-Bjarne Hansen, Kelly A Frazer1.
Abstract
BACKGROUND: Genotypes generated in next generation sequencing studies contain errors which can significantly impact the power to detect signals in common and rare variant association tests. These genotyping errors are not explicitly filtered by the standard GATK Variant Quality Score Recalibration (VQSR) tool and thus remain a source of errors in whole exome sequencing (WES) projects that follow GATK's recommended best practices. Therefore, additional data filtering methods are required to effectively remove these errors before performing association analyses with complex phenotypes. Here we empirically derive thresholds for genotype and variant filters that, when used in conjunction with the VQSR tool, achieve higher data quality than when using VQSR alone.Entities:
Mesh:
Year: 2014 PMID: 24884706 PMCID: PMC4098776 DOI: 10.1186/1471-2105-15-125
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Genotype concordance between WES genotypes and exome array genotypes in 10 samples
| None | None | 73,537 (621,855) | 551 (661) | 99.26% (99.89%) | NA | NA |
| None | VQSR | 73,069 (608,065) | 493 (108) | 99.33% (99.98%) | 0.64% (2.22%) | 10.53% (83.66%) |
| DP | None | 72,104 (611,606) | 236 (627) | 99.67% (99.90%) | 1.95% (1.65%) | 57.17% (5.14%) |
| GQ | None | 72,689 (610,682) | 225 (460) | 99.69% (99.92%) | 1.15% (1.80%) | 59.17% (30.41%) |
| DP & GQ | None | 71,986 (610,108) | 220 (446) | 99.70% (99.93%) | 2.11% (1.89%) | 60.07% (32.53%) |
| DP & GQ | VQSR | 71,552 (597,070) | 168 (36) | 99.77% (99.99%) | 2.70% (3.99%) | 69.51% (94.55%) |
Non-reference genotypes are shown above reference genotypes in brackets.
Non-reference genotypes include genotypes that are heterozygous or homozygous alternate in the exome array. Reference genotypes include genotypes that are homozygous reference in the exome array.
† Includes heterozygote to homozygous alternate mismatches.
(See Additional file 1).
Figure 1Summary of methods and improved data quality from genotype and variant filters. A) Left panel illustrates the standard filtering method (left side) compared to the proposed genotype and variant filtering method (right side) for sequencing data. Right panel illustrates the method used for genotype and variant filtering of imputed data. The quality metrics resulting from standard filtering (blue box), proposed genotype and variant filters (orange boxes), and a combination of these methods (green box) are compared to the quality of the unfiltered data (grey boxes). B) Quantitative comparisons of quality improvement are depicted for both sequencing and imputation filters at both genotype (% of discordant genotypes removed and % concordance) and variant (Ti/Tv and R2) levels. Box colors match the boxes in A).
Sensitivity and specificity of WES genotypes for exome array genotypes in 10 samples
| None | None | 73,537 | 621,855 | 661 | 551 | 99.26% | 99.89% |
| None | VQSR | 73,069 | 608,065 | 108 | 493 | 99.33% | 99.98% |
| DP | None | 72,104 | 611,606 | 627 | 236 | 99.67% | 99.90% |
| GQ | None | 72,689 | 610,682 | 460 | 225 | 99.69% | 99.92% |
| DP & GQ | None | 71,986 | 610,108 | 446 | 220 | 99.70% | 99.93% |
| DP & GQ | VQSR | 71,552 | 597,070 | 36 | 168 | 99.77% | 99.99% |
1TP = exact match of non-reference genotype; Ref/Alt with Ref/Alt or Alt/Alt with Alt/Alt.
2TN = exact match of reference genotype; Ref/Ref with Ref/Ref.
3FP = additional alternate allele in WES genotype; Ref/Ref with Ref/Alt or Ref/Ref with Alt/Alt or Ref/Alt with Alt/Alt.
4FN = missing alternate allele in WES genotype; Ref/Alt with Ref/Ref or Alt/Alt with Ref/Ref or Alt/Alt with Ref/Alt.
(See Additional file 3).
Figure 2Improved concordance, sensitivity and specificity of WES data using genotype filters. Plots illustrate the non-reference concordance and sensitivity versus specificity between array and sequencing genotypes for 10 samples. A) The percent of non-reference discordant calls removed is plotted versus the percent of non-reference concordant calls retained at increasing quality thresholds. B) Sensitivity versus specificity is plotted at increasing quality thresholds. For A) and B), blue line represents changing DP thresholds and the red line represents change GQ thresholds. Chosen filter thresholds (DP ≥ 8 and GQ ≥ 20) are indicated by points on these lines. C) Summarizes the effect that the chosen genotype filters (both DP and GQ) have on non-reference concordant and discordant genotype calls with and without the VQSR filter.
Variant filtering of WES data improves Ti/Tv ratios
| None | 0 | 448,862 (100%) | 1.93 | 2.71 | 3.05 | 2.39 | N/A |
| VQSR | 56,036 | 392,826 (87.5%) | 2.21 | 2.88 | 3.07 | 2.63 | <10-320 |
| HWE | 11,855 | 437,007 (97.4%) | 1.93 | 2.73 | 3.06 | 2.40 | 1.42 × 10-21 |
| Ave. GQ | 33,083 | 415,779 (92.6%) | 2.00 | 2.73 | 3.06 | 2.47 | 1.13 × 10-265 |
| Call Rate | 51,117 | 397,745 (88.6%) | 2.09 | 2.78 | 3.08 | 2.51 | <10-320 |
| Combined* | 59,952 | 388,910 (86.6%) | 2.09 | 2.80 | 3.09 | 2.52 | <10-320 |
| Combined* + VQSR | 97,840 | 351,022 (78.2%) | 2.38 | 2.96 | 3.10 | 2.75 | <10-320 (3.72 × 10-106)b |
| VQSR + Combined* | 92,091 | 356,771 (79.5%) | 2.34 | 2.94 | 3.10 | 2.72 | <10-320 |
†Variants found in NCBI dbSNP Build 135.
‡Variants found in HapMap phase 3 release 3.
*Combination of HWE, Ave. GQ and Call Rate filters.
ap-value based on a hypergeometric test of whether the removed variants were enriched for Tv over Ti vs. the unfiltered variant sets.
bp-value based on a hypergeometric test of whether the variants that differed between Combined + VQSR variant sets and VQSR + Combined variant sets were enriched for Tv over Ti.
Figure 3Improved Ti/Tv ratios in WES data using variant filters. Plots illustrate Ti/Tv improvement at different thresholds of A) average GQ and B) call rate. In these plots, Ti/Tv ratios (blue) for novel (dotted line), known (dashed line) and true (solid line) variants are juxtaposed against the drop in sensitivity (red) as the variant filtering thresholds increase. Chosen thresholds are show by the red dashed lines. C) The Ti/Tv improvement after each of the variant filtering steps is summarized. In addition, the result from an alternative filtering order, where VQSR is applied prior to the combined variant filters, is also displayed (green circle). *Combined filters refers to HWE, average GQ and call rate filters applied together.
Splitting samples by batch (“batched”) retains more high quality variants
| Number of variants filtered from “unbatched” dataset | 14,209 | 197,540 | 0 | 26,967 | 238,716 |
| Number of filtered variants found in “batched” | 1,983 | 135,031 | N/A | 2,050 | 139,064 |
| Ti/Tv of filtered variants found in “batched” | 2.64 | 2.20 | N/A | 2.16 | 2.20 |
*Used – hwe in vcftools to remove variants with Bonferroni-corrected p-value < 0.05.
†Used – geno in vcftools to remove variants with call rates < 88%.
‡Used an awk command to remove variants with average GQ < 35.
¥Filtered VQSR processed variants at the 99% sensitivity tranche.
Figure 4Applying a GQ filter improves the quality of imputation results from WES data. Plots illustrate data quality improvement seen after applying GQ threshold. A) Plots the average concordance (blue line) improvement between array and sequencing genotypes for 10 samples as the GQ threshold increases. Coupled with this concordance improvement is the average percent of genotypes that remain (red line) with GQ values above that threshold. B) At the GQ > 20 threshold, this plot shows that variants removed (blue) due to loss of all genotypes have generally lower quality (as measured by R2) compared to variants containing at least one genotype (red). Mean values for each distribution are shown by the dotted lines.