| Literature DB >> 24266885 |
Jacob Durtschi, Rebecca L Margraf, Emily M Coonrod, Kalyan C Mallempati, Karl V Voelkerding.
Abstract
BACKGROUND: Variant discovery for rare genetic diseases using Illumina genome or exome sequencing involves screening of up to millions of variants to find only the one or few causative variant(s). Sequencing or alignment errors create "false positive" variants, which are often retained in the variant screening process. Methods to remove false positive variants often retain many false positive variants. This report presents VarBin, a method to prioritize variants based on a false positive variant likelihood prediction.Entities:
Mesh:
Year: 2013 PMID: 24266885 PMCID: PMC3849648 DOI: 10.1186/1471-2105-14-S13-S2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Read coverage depth effect on likelihood scores. Each data point represents one variant change per one sample from the study family's genomes and 38 background data files. Blue data points passed the GATK best practice variant filters and are called variants in the vcf file. Red data points are "non-variants" which did not pass these same filters. Homozygous variants were given an × axis value of zero to separate them from the heterozygous variant distributions. Values beyond the plot axis limits are shown at the plot edges. A and B) Scatter plot of the Phred-scaled likelihood ratio (PLR) versus coverage depth for two variant sets. A) Variant data set enriched for true variants. For this plot, variants in the study family proband within chromosome 1 were limited to between 10 and 20% allele frequency in 1000 Genomes data (20,000 variants total). B) Variant data set enriched for false positive variants. For this plot variants in the study family proband at chromosomes 1 - 22, then limited to variants not found in the 1000 Genomes data or either parent (14,500 variants total). C and D) Scatter plot of the Phred-scaled likelihood ratio divided by coverage depth (PLRD) versus coverage depth for two variant sets. C) Data from panel A divided by coverage depth. D) Data from panel B divided by coverage depth
Figure 2PLRD histograms by variant position and nucleotide change. Six example proband variants are shown with data from the proband (gold bar with star), as well as the proband's family and the background samples for the same variant change and position. The PLRD score is plotted versus the sample count. Blue bars indicate for that sample, that variant passed the GATK best practices variant filters. The samples with the red bars did not pass these same filters (wildtype/non-variant) and this variant was not called in their vcf file. The vertical black lines are marking the 3 (dashed line) and 6 (dotted line) standard deviation from the average PLRD score using only the assumed wild type/non-variant PLRD values (red bars, variant was not called). Bin numbers (as described in Methods) are given for each of the proband's variants shown. A) These two variant examples are called as Bin 1 and Bin 2 by the PLRD method and were Sanger verified as true variants. B) Examples of variants from all Bins that were only detected as wild type sequence by Sanger sequencing (false positive variants).
Figure 3Comparison of VarBin to VQSLOD. Variants were separated into Bins using the VarBin method for true variant likelihood (Bin1 most likely true variants, Bin 4 most likely false positive varaints). The Bin groups are displayed in four separate histograms and the total number and percentage of variants in each VarBin Bin group are shown. The corresponding GATK variant quality score recalibration scores (VQSLOD) for each of these Binned variants is plotted on the X-axis versus variant count. Note the starred (*) axis numbers, indicate that the scale is different than the other graphs in the figure. A) A variant set enriched for true variants (approximately 20,000 variants called in the study family proband chromosome one that were also found at between 10 and 20% in 1000 Genomes data set). B) A variant set enriched for false positive variants (approximately 14,500 variants called in the proband's chromosome 1 - 22 that were not found in 1000 Genomes data or in either parent).
Sanger sequencing result is compared with the VarBin variant classification Bin
| Bin | Sanger result | Total* | SNV | Indel |
|---|---|---|---|---|
| Bin 1 | true | 33 | 23 | 10 |
| false | 1 | 1 | - | |
| Bin 2 | true | 10 | 8 | 2 |
| false | 23 | 21 | 2 | |
| Bin 3 | true | - | - | - |
| false | 16 | 16 | - | |
| Bin 4 | true | - | - | - |
| false | 11 | 11 | - |
*98 variants were sequenced in the proband, the proband's family, or other families in the background data sets as stated in the methods. Four of the 98 could not be Sanger sequenced and were excluded from this table (94 total variants shown). SNV, single nucleotide variant; Indel, and insertion or deletion.