| Literature DB >> 27287006 |
Christopher E Gillies1, Edgar A Otto2, Virginia Vega-Warner1, Catherine C Robertson1, Simone Sanna-Cherchi3, Ali Gharavi3, Brendan Crawford1, Rajendra Bhimma4, Cheryl Winkler5, Hyun Min Kang6, Matthew G Sampson7,8.
Abstract
BACKGROUND: Targeted sequencing of discrete gene sets is a cost effective strategy to screen subjects for monogenic forms of disease. One method to achieve this pairs microfluidic PCR with next generation sequencing. The PCR step of this pipeline creates challenges in accurate variant calling. This includes that most reads targeting a specific exon are duplicates that have been amplified from the PCR step. To reduce false positive variant calls from these experiments, previous studies have used threshold-based filtering of alternative allele depth ratio and manual inspection of the alignments. However even after manual inspection and filtering, many variants fail to be validated via Sanger sequencing. To improve the accuracy of variant calling from these experiments, we are challenged to design a variant filtering strategy that sufficiently models microfluidic PCR-specific issues.Entities:
Keywords: Accuracy; Microfluidic; Nephrotic; Next-generation sequencing; PCR; Support vector machine; Variant calling
Mesh:
Year: 2016 PMID: 27287006 PMCID: PMC4902911 DOI: 10.1186/s12859-016-1108-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Sensitivity of detecting variant sites in exome chip dataset of 373 subjects. Panel (a) shows the sensitivity of detecting sites (N = 202) an allele count greater than one in the Exome Chip dataset of 373 subjects using microfluidic PCR for the same subjects. Panel (b) shows a similar sensitivity analysis, except it is limited to sites with an allele count of exactly one (N = 61). Please note that the “No Filter” bar is an upper bound for all methods except GotCloud SVM. The overall conclusion from these two plots is that tarSVM’s sensitivity is very close to the most sensitive methods for common and rare variants
Fig. 2Sensitivity and specificity of filters using Sanger sequenced variants as gold standard for NS Cohort. Panel (a) shows the sensitivity of six different filters to detect 142 variants sequenced using Sanger. The default genotype filter has the highest sensitivity, which is not surprising because it was the principal norm for determining whether or not a variant should undergo Sanger sequencing from the CLC Genomics Workbench™. tarSVM is nearly as sensitive as the default genotype filter. Panel (b) displays the specificity of six filters. tarSVM is significantly more specific than the default genotype filter
Fig. 3Accuracy and FDR of filters using Sanger sequenced variants as gold standard for NS Cohort. Panel (a) shows the accuracy of six different filters for 142 variants sequenced using Sanger. The SVM filter is more accurate than other filter methods. Panel (b) illustrates the decreased false discovery rate of tarSVM as compared to other filters
Fig. 4Sensitivity and specificity of filters using Sanger sequenced variants as gold standard for CAKUT Cohort. Panel (a) shows the sensitivity of six different filters for 371 variants sequenced using Sanger. The default genotype filter has the highest sensitivity, and tarSVM has comparable sensitivity with other methods. Panel (b) displays the specificity of six filters. tarSVM is substantially more specific than other filters
Fig. 5Accuracy and FDR of filters using Sanger sequenced variants as gold standard for CAKUT Cohort. Panel (a) shows the accuracy of six different filters for 371 variants sequenced using Sanger. tarSVM is more accurate than other filter methods. Panel (b) illustrates the decreased false discovery rate of tarSVM as compared to other filters
Reduction in number of variants to validate with Sanger sequencing for the NS and CAKUT cohorts
| Cohort | Variant quality filter | Total variants | Total variants passing filter | Eligible variants | Pathogenicity filter |
|---|---|---|---|---|---|
| NS Cohort | Default | 2250 | 1263 | 481 | 156 |
| NS Cohort | tarSVM | 2250 | 1093 | 408 | 121 |
| CAKUT Cohort | Default | 8812 | 3300 | 1564 | 639 |
| CAKUT Cohort | tarSVM | 8812 | 2347 | 1135 | 432 |
The first column describes the cohort for which the row corresponds. The second column identifies the variant quality filter applied to the dataset. The total variants column refers to the total variants that were called by GATK. The next column shows the number of variants passing a particular variant quality filter for a specific cohort. Eligible variants referrers to all missense and loss of function variants considered for the analysis, excluding frame shift mutations that are considered in the pathogenicity filter. The final column for the pathogenicity filter column displays the number of variants passing having an allele frequency of less than 1 % across all population in the Exome Variant Server, and the variant was either loss of function or predicted to be deleterious by two of MutationTaster, PolyPhen2, and SIFT