| Literature DB >> 28738841 |
Jennifer A Tom1, Jens Reeder2, William F Forrest2, Robert R Graham3, Julie Hunkapiller3, Timothy W Behrens3, Tushar R Bhangale2,3.
Abstract
BACKGROUND: Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data.Entities:
Keywords: Batch effects; Genome-wide association studies; Genotyping; Whole genome sequencing
Mesh:
Year: 2017 PMID: 28738841 PMCID: PMC5525370 DOI: 10.1186/s12859-017-1756-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A detectable batch effect was apparent in PCA of relevant quality metrics calculated using the gVCF (a). The standard GWAS PCA performed using 250,000 common SNPs did not reveal this batch effect (b). Quality metrics included in the PCA in (a) include percent of variants confirmed in 1000 genomes (phase 1, high confidence SNPs) [26], mean genotype quality, median read depth, transition transversion ratio in non-coding regions, transition transversion ratio in coding regions, and percent heterozygotes. Group 1 here refers to samples sequenced in 2010–2012 and Group 2 to samples sequenced in 2013 and 2014
Descriptive metrics of 1231 whole genome sequences by batch
| Variable Mean (SD) | Group 1 | Group 2 |
|
|---|---|---|---|
| N | 918 | 313 | |
| GATK Genotype Quality | 91.47 (2.72) | 90.77 (3.57) | NS |
| Median Read Depth | 33.65 (4.69) | 35.39 (6.81) | NS |
| Ti/Tv in Non Coding Regions | 2.01 (0.012) | 1.95 (0.019) | < 0.0001 |
| Ti/Tv in Coding Regions | 2.99 (0.053) | 2.90 (0.032) | < 0.0001 |
| % Confirmed in 1000 Genomes | 81 (0.87) | 77 (0.76) | < 0.0001 |
| Percent Heterozygote | 7.5 (0.48) | 8.2 (0.45) | < 0.0001 |
Group 1 and Group 2 refer to two different groups detected via a visualization of eigenvectors from a PCA of metrics derived from the gVCF files
GATK Genome Analysis Toolkit, Ti/Tv transition transversion ratio, NS not significant
The means of each variable are reported along with the standard deviation in parenthesis
aDifferences between the two groups were assessed using the Wilcoxon Rank Sum Test, two-sided alternative, with a Bonferroni adjustment for multiple tests
Fig. 2Filtering unconfirmed genome-wide significant associations (UGAs) from the Batch GWAS. Percent (and number, n) of the 7370 UGAs (1901 SNPs and 5469 indels) removed by each filter for (a) SNPs and (b) indels. In yellow are the filters we recommend and in blue are other filters we tested
Fig. 3Quantile-quantile plots revealed differences in genotype quality (GQ) distributions. Hom Ref, homozogyous reference (a,b); Het, heterozygotes (c,d); Hom Alt, homozygous alternative (e,f); UGAs, sites with p-value <5E-8 in the Batch GWAS; Random, comparable set of sites with p-value >5E-8 in the Batch GWAS. Note that in (d) most points overlap the single darkest point on the plot
Fig. 4Performance of filters on an Age-Related Macular Degeneration (AMD) GWAS with no batch effect. Percent (and number, n) of variants removed genome-wide in an AMD GWAS with no batch effect where 8,636,121 unique sites and 8,791,425 variants (SNPs and indels) were analyzed
Retaining confirmed AMD associations in a candidate SNP analysis when batch is completely confounded with AMD status
| CHR | Positiona |
| Percent missing | GQ20M05 | GQ20M30 | Diff GQ | LD corrected |
|---|---|---|---|---|---|---|---|
| 1 | 196,710,325 | 0.002199 | 4.928 | NF | NF | NF | 0.002122 |
| 3 | 64,719,689 | NS | 5.027 | F | NF | NF | NS |
| 3 | 99,762,695 | NS | 5.027 | F | NF | NF | NS |
| 6 | 43,858,890 | NS | 6.57 | F | NF | F | NS |
| 6 | 116,122,572 | NS | 6.471 | F | NF | NF | NS |
| 8 | 23,225,458 | NS | 6.72 | F | NF | NF | NS |
| 9 | 99,146,083 | NS | 5.475 | F | NF | NF | NS |
| 10 | 122,454,932 | 1.60E-05 | 0.6969 | NF | NF | NF | 1.79E-05 |
| 13 | 31,245,188 | NS | 5.625 | F | NF | NF | NS |
| 14 | 68,318,360 | NS | 5.226 | F | NF | NF | NS |
| 15 | 58,396,268 | NS | 5.625 | F | NF | NF | NS |
| 16 | 56,963,321 | NS | 3.833 | NF | NF | F | NS |
| 19 | 6,718,376 | NS | 3.534 | NF | NF | NF | NS |
| 19 | 44,919,689 | 2.30E-21b | 6.67 | F | NF | NF | 3.16E-21 |
| 22 | 32,663,679 | NS | 6.521 | F | NF | NF | NS |
| 22 | 38,080,269 | NS | 6.272 | F | NF | NF | NS |
NF is not filtered, F is filtered, GQ20M05 filter, filter sites with more than 5% missingness after setting genotypes with GQ < 20 to missing; GQ20M30 filter, filter sites with more than 30% missingness after setting genotypes with GQ < 20 to missing
Diff GQ, differential genotype quality filter, LD linkage disequilibrium, NS not significant in candidate SNP analysis at Bonferroni adjusted significance level: 0.05/16 = 0.00312
aSites are reported in GRCh38 coordinates
bWe detect APOE because our controls are enriched for Alzheimer’s cases