| Literature DB >> 19076733 |
E K F Chan1, R Hawken, A Reverter.
Abstract
The last decade has seen rapid improvements in high-throughput single nucleotide polymorphism (SNP) genotyping technologies that have consequently made genome-wide association studies (GWAS) possible. With tens to hundreds of thousands of SNP markers being tested simultaneously in GWAS, it is imperative to appropriately pre-process, or filter out, those SNPs that may lead to false associations. This paper explores the relationships between various SNP genotype and phenotype attributes and their effects on false associations. We show that (i) uniformly distributed ordinal data as well as binary data are more easily influenced, though not necessarily negatively, by differences in various SNP attributes compared with normally distributed data; (ii) filtering SNPs on minor allele frequency (MAF) and extent of Hardy-Weinberg equilibrium (HWE) deviation has little effect on the overall false positive rate; (iii) in some cases, filtering on MAF only serves to exclude SNPs from the analysis without reduction of the overall proportion of false associations; and (iv) HWE, MAF and heterozygosity are all dependent on minor genotype frequency, a newly proposed measure for genotype integrity.Entities:
Mesh:
Substances:
Year: 2008 PMID: 19076733 PMCID: PMC2680326 DOI: 10.1111/j.1365-2052.2008.01816.x
Source DB: PubMed Journal: Anim Genet ISSN: 0268-9146 Impact factor: 3.169
Figure 1Relationship between minor genotype frequency (MGF) and minor allele frequency (MAF) for 9075 SNPs from 565 individuals. SNPs deviating from HWE at P < 0.05 (circle), P < 0.0001 (triangle) and P < 10 x 10-10 (cross) are indicated.
Figure 2Proportions of SNPs with the corresponding number of false associations for the five trait-types. Shown are the proportions of all SNPs (top), ‘good’ SNPs (middle) and ‘bad’ SNPs (bottom). The five types of quantitative traits are: normally distributed continuous data (cont-norm), normally distributed ordered-categorical data (cat-norm), discretely distributed ordered-categorical data (cat-disc), uniformly distributed ordered-categorical data (cat-unif) and binomially distributed binary data (bin-bin).
Spearman’s ρ-correlation between number of false associations and various SNP attributes.
| SNP attributes | Continuous normal | Categorical normal | Categorical discrete | Categorical uniform | Binary |
|---|---|---|---|---|---|
| Call-rate | – | – | – | – | – |
| Missing values | – | – | – | – | – |
| LOH | – | – | |ρ| < 0.1 ( | – | – |
| – | |ρ| < 0.1 ( | |ρ| < 0.1 ( | |||
| MAF | – | |ρ| < 0.1 ( | |ρ| < 0.1 ( | ||
| MGF | – | |ρ| < 0.1 ( | |ρ| < 0.1 ( | ||
| HWE: χ2-statistic | – | |ρ| < 0.1 ( | |ρ| < 0.1 ( | ||
| HWE: Fisher’s odds ratio | – | – | – | – | |ρ| < 0.1 ( |
Only correlations where either |ρ| ≥ 0.1 or the corresponding P < 0.05 are shown, otherwise ‘–’ is indicated, and only when both criteria are satisfied is significance asserted (bold). For test of HWE, the chi-squared test was used for all SNPs, and Fisher’s Exact test was used only on SNPs with n ≥ 5.
The significance of testing the null hypothesis of no difference between FP-free (FP = 0) and FP-prone (FP ≥ 4) SNPs.
| FP = 0 vs. FP ≥ 4 | Continuous normal | Categorical normal | Categorical discrete | Categorical uniform | Binary |
|---|---|---|---|---|---|
| Call-rate | – | – | – | – | – |
| No. missing | – | – | – | – | – |
| LOH | – | – | – | – | – |
| – | – | <10−3 (higher) | <10−9 (higher) | <10−16 (higher) | |
| MAF | – | 0.023 (higher) | <10−3 (higher) | <10−10 (higher) | <10−16 (higher) |
| MGF | – | – | 0.001 (higher) | <10−8 (higher) | <10−13 (higher) |
| MGF = 0 | – | 0.001 (lower) | <10−3 (lower) | <10−4 (lower) | <10−4 (lower) |
| HWE: χ2-test | – | 0.007 (higher) | 0.008 (higher) | 0.017 (higher) | 0.009 (higher) |
| HWE: Fisher’s Exact test | – | – | – | – | – |
Only significant differences (P < 0.05) are shown, otherwise, ‘–’ is indicated. ‘Higher’ and ‘lower’ in parentheses indicate if the distributions are right or left shifted respectively in FP-prone compared with FP-free SNPs. For the test of HWE, the chi-squared test was used for all SNPs, and Fisher’s Exact test was used only on SNPs with n ≥ 5.
Figure 3Rates of reduction in the proportion of false associations to the proportion of excluded SNPs at various combinations of MAF, MGF and HWE deviation thresholds. Filtration on MGF is indicated by different plotting symbols (circle: no filtration on MGF, triangle: MGF > 0, plus: MGF > 0.005, cross: MGF > 0.01, diamond: MGF > 0.05, inverse triangle: MGF > 0.01), filtration on MAF is indicated by different colours [red: polymorphic (MAF > 0), green: MAF > 0.005, blue: MAF > 0.01, cyan: MAF > 0.05, magenta: MAF > 0.01] and filtration on HWE deviation is indicated by different plotting sizes (smallest: no filtration on HWE deviation, to the largest: P > 0.05). The red line indicates the line of unity.