| Literature DB >> 25884587 |
Yancy Lo1, Hyun M Kang2, Matthew R Nelson3, Mohammad I Othman4, Stephanie L Chissoe5, Margaret G Ehm6, Gonçalo R Abecasis7, Sebastian Zöllner8,9.
Abstract
BACKGROUND: Sequencing studies of exonic regions aim to identify rare variants contributing to complex traits. With high coverage and large sample size, these studies tend to apply simple variant calling algorithms. However, coverage is often heterogeneous; sites with insufficient coverage may benefit from sophisticated calling algorithms used in low-coverage sequencing studies. We evaluate the potential benefits of different calling strategies by performing a comparative analysis of variant calling methods on exonic data from 202 genes sequenced at 24x in 7,842 individuals. We call variants using individual-based, population-based and linkage disequilibrium (LD)-aware methods with stringent quality control. We measure genotype accuracy by the concordance with on-target GWAS genotypes and between 80 pairs of sequencing replicates. We validate selected singleton variants using capillary sequencing.Entities:
Mesh:
Substances:
Year: 2015 PMID: 25884587 PMCID: PMC4359451 DOI: 10.1186/s12859-015-0489-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Summary statistics of 27,500 top-ranked SNPs per call set and quality assessed by transition-to-transversion ratio (Ts/Tv) and missing genotypes
|
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |
| IBC | 27500 | 25.72% | 3.02 | 2.54 | 2.71 | 16325 (59.36%) | 2.57 | 4.71 |
| PBC | 27500 | 26.87% | 3.02 | 2.45 | 2.59 | 15877 (57.73%) | 2.44 | 0.47 |
| LDC | 27500 | 26.85% | 3.01 | 2.45 | 2.59 | 15857 (57.66%) | 2.44 | 0.17 |
| LDC + F | 27500 | 26.81% | 3.00 | 2.45 | 2.58 | 15869 (57.71%) | 2.44 | 0.17 |
Abbreviations: IBC = individual-based single marker caller, PBC = population-based single marker caller, LDC = LD-aware caller without flanking haplotypes, LDC + F = LD-aware caller with flanking haplotypes. Expanded table showing quality of call sets broken down by variant class is included in Additional file 1: Table S1.
Figure 1Distribution of coverage at the individual carrying the singleton alternative allele. We compare the distribution of coverage at called singleton variants between individual-based caller (black) and population-based caller (light gray). The overlap of the two distributions is in dark gray. Here we show all singleton variants after SNP filtering and genotype filtering on quality < 20. We keep individual-based single marker calls at low genotype coverage for this comparison, with the vertical dash line indicating genotype coverage filter at 7x.
Heterozygous mismatch (a) between sequence calls and GWAS genotypes at 378 on-target GWAS markers, (b) between 80 sequence replicate pairs and (c) between pairs of algorithms
|
| |||||
|---|---|---|---|---|---|
|
|
|
|
| ||
| (a) | All samples at 378 GWAS markers | 0.82% | 0.38% | 0.39% | 0.32% |
| (b) | 80 sequence replicate pairs at all called variants | 1.01% | 0.34% | 0.36% | 0.20% |
| (c) | Pairwise comparison of callers | ||||
| vs PBC | 0.42% | -- | -- | -- | |
| vs LDC | 0.93% | 0.35% | -- | -- | |
| vs LDC + F | 1.01% | 0.41% | 0.30% | -- | |
Heterozygous mismatch (a) between each call set and GWAS genotypes at 378 on-target markers, and (b) between additional heterozygous genotypes in more complex algorithms and the GWAS markers
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|
| (a) | Number of heterozygous genotypes (hets) | 276,761 | 293,730 | 298,220 | 298,531 | |
| Heterozygous mismatch | 0.82% | 0.38% | 0.39% | 0.32% | ||
| (b) | Number of additional hets and heterozygous mismatch | not in IBC | – | 15,727 (0.85%) | 17,937 (1.23%) | 18,308 (0.47%) |
| not in PBC | – | – | 3,113 (2.41%) | 3,664 (0.71%) | ||
| not in LDC | – | – | – | 1,145 (0.87%) | ||