| Literature DB >> 24205929 |
Amy Victoria Spencer1, Angela Cox, Kevin Walters.
Abstract
Genome-wide association studies have successfully identified associations between common diseases and a large number of single nucleotide polymorphisms (SNPs) across the genome. We investigate the effectiveness of several statistics, including p-values, likelihoods, genetic map distance and linkage disequilibrium between SNPs, in filtering SNPs in several disease-associated regions. We use simulated data to compare the efficacy of filters with different sample sizes and for causal SNPs with different minor allele frequencies (MAFs) and effect sizes, focusing on the small effect sizes and MAFs likely to represent the majority of unidentified causal SNPs. In our analyses, of all the methods investigated, filtering on the ranked likelihoods consistently retains the true causal SNP with the highest probability for a given false positive rate. This was the case for all the local linkage disequilibrium patterns investigated. Our results indicate that when using this method to retain only the top 5% of SNPs, even a causal SNP with an odds ratio of 1.1 and MAF of 0.08 can be retained with a probability exceeding 0.9 using an overall sample size of 50,000.Entities:
Keywords: Fine-mapping; LD; causal variants; complex disease; likelihood; p-value; single nucleotide polymorphism
Mesh:
Year: 2013 PMID: 24205929 PMCID: PMC4282378 DOI: 10.1111/ahg.12043
Source DB: PubMed Journal: Ann Hum Genet ISSN: 0003-4800 Impact factor: 1.670
Figure 1Comparing the effectiveness of filters for fine-mapped data in three regions of the genome. Using the LD structure of each region, 1000 datasets were simulated and then analysed using each method (only 100 were analysed using the Zhu method). Panels (A), (C) and (E) show the efficacy of filtering using thresholds based on p-values from Cochran–Armitage tests, RLs and LP points. Panels (B), (D) and (F) show the results using genetic map distance (GMD) from and pairwise r2 or D′ values with the top hit and the Zhu method using preferential r2. The causal SNPs all have an OR of 1.1, an MAF of 0.08 and the sample size is 20,000.
Area under curve (AUC, given as a percentage) for ROC curves of different filters using mean false positive rates (FPRs). Three different 1 Mb regions of the genome were used but in each the causal SNP has an OR of 1.1, an MAF of 0.08 and the sample size is 20,000
| Genomic region | |||
|---|---|---|---|
| High | Mixed | Low | |
| Filtering method | LD (%) | LD (%) | LD (%) |
| Likelihood (LP threshold) | 93 | 90 | 96 |
| 91 | 89 | 96 | |
| Likelihood (RL threshold) | 87 | 79 | 90 |
| Preferential LD (Zhu) | 74 | 60 | 69 |
| 67 | 60 | 63 | |
| Genetic map distance (GMD) | 62 | 58 | 66 |
| 42 | 42 | 48 | |
Area under curve (AUC, given as a percentage) for portions of ROC curves of different filters for which FPR ≤0.1. Three different 1 Mb regions of the genome were used but in each the causal SNP has an OR of 1.1, an MAF of 0.08 and the sample size is 20,000. The maximum percentage of AUC for such a portion is 10%
| Genomic region | |||
|---|---|---|---|
| High | Mixed | Low | |
| Filtering method | LD (%) | LD (%) | LD(%) |
| Likelihood (LP threshold) | 4.8 | 4.7 | 7.2 |
| 4.3 | 4.5 | 7.2 | |
| Likelihood (RL threshold) | 4.1 | 3.6 | 6.2 |
| Preferential LD (Zhu) | 2.5 | 1.0 | 2.1 |
| 2.9 | 2.1 | 2.8 | |
| Genetic map distance (GMD) | 0.2 | 1.0 | 2.2 |
| 0.02 | 0 | 0 | |
Figure 2Receiver operating characteristic (ROC) curves showing the effectiveness of likelihood percentile (LP) as a fine-mapping filter dependent on the sample size used, the per-allele OR and MAF of the causal SNP. One thousand datasets were simulated for each scenario using the LD structure of the CASP8 region and the results of filtering at specific thresholds are highlighted.
Figure 3The effectiveness of LP and p-value filtering for fine-mapping data which has been partially imputed compared to its effectiveness for data which is fully genotyped. The causal SNP has an OR of 1.14, an MAF of 0.08 and a sample size of 10,000. A set of 100 datasets were simulated using the LD structure of the CASP8 region containing 2871 fully genotyped SNPs. These were then reduced to contain 469 genotyped informative SNPs and the remaining 2402 SNPs were imputed.