| Literature DB >> 30325923 |
Audrey E Hendricks1,2,3, Stephen C Billups1, Hamish N C Pike2, I Sadaf Farooqi4, Eleftheria Zeggini5, Stephanie A Santorico1,2,3, Inês Barroso4,5, Josée Dupuis6.
Abstract
A primary goal of the recent investment in sequencing is to detect novel genetic associations in health and disease improving the development of treatments and playing a critical role in precision medicine. While this investment has resulted in an enormous total number of sequenced genomes, individual studies of complex traits and diseases are often smaller and underpowered to detect rare variant genetic associations. Existing genetic resources such as the Exome Aggregation Consortium (>60,000 exomes) and the Genome Aggregation Database (~140,000 sequenced samples) have the potential to be used as controls in these studies. Fully utilizing these and other existing sequencing resources may increase power and could be especially useful in studies where resources to sequence additional samples are limited. However, to date, these large, publicly available genetic resources remain underutilized, or even misused, in large part due to the lack of statistical methods that can appropriately use this summary level data. Here, we present a new method to incorporate external controls in case-control analysis called ProxECAT (Proxy External Controls Association Test). ProxECAT estimates enrichment of rare variants within a gene region using internally sequenced cases and external controls. We evaluated ProxECAT in simulations and empirical analyses of obesity cases using both low-depth of coverage (7x) whole-genome sequenced controls and ExAC as controls. We find that ProxECAT maintains the expected type I error rate with increased power as the number of external controls increases. With an accompanying R package, ProxECAT enables the use of publicly available allele frequencies as external controls in case-control analysis.Entities:
Mesh:
Year: 2018 PMID: 30325923 PMCID: PMC6191077 DOI: 10.1371/journal.pgen.1007591
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Data notation for internal case and external control samples for ProxECAT.
| Predicted Functional Impact | Total | |||
|---|---|---|---|---|
| Functional | Not Functional (Proxy) | |||
| Y = 1 | ||||
| Y = 0 | ||||
Simulation parameters.
| Baseline variant minor allele rate | 0.001 per subject per 1Kb |
| Association variant minor allele rate | 0.001 * (1.2, 1.4, 1.6, 1.8, 2, 3) |
| Gene length | 20, 40 Kb |
| Case set sample size | 500, 1000 |
| Control set sample size | 500, 1000, 10000, 40000, 100000 |
| Gene confounding | In cases and controls: 0.001 * (1, 1.2, 1.5, 2) |
| Only in cases: 0.001 * (1, 1.2, 1.5, 2) | |
| Case control confounding | In cases: 0.001 * (1, 1.1, 1.3, 1.5) |
Fig 1Type I error and power estimates for case-only LRT, case-control LRT, and ProxECAT.
Estimates provided over various confounding simulation scenarios. General simulation parameters: gene-length = 20Kb, baseline mutation rate = 0.001 per person per 1Kb. Left Plot: type I error rate for Ncases = Ncontrols = 1000 and combinations of case-control confounding (mid level) and gene confounding (low level); dashed line represents expected type I error rate of 0.05 and dotted lines represent 95% confidence interval around the expected type I error rate. (A) Null simulation with no case-control or gene confounding bias; (B) gene-confounding; (C) gene-confounding only in cases; (D) case-control confounding; (E) case-control confounding and gene confounding; (F) case-control confounding and gene confounding only in cases. Right Plot: power for an effect size of 2 for case-control LRT (Ncases = 500; Ncontrols = 500) and ProxECAT (Ncases = 1000) and various external controls sample size. Dashed line is the case-control LRT power and dotted lines represent 95% confidence interval around the estimated power for case-control LRT.
Fig 2Quantile-Quantile plots for SCOOP cases vs. UK10K Cohort controls.
Internal MAF < 0.01 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 11,051. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 3.151), ProxECAT-weighted (orange, lambda = 1.026), case-control (black, lambda = 1.971). A) all tests, B) ProxECAT-weighted only.
Genome-wide descriptive statistics for the ratio of the number of functional and synonymous variant minor alleles per gene in cases and controls.
| min | Q1 | Median | Q3 | Max | ||
|---|---|---|---|---|---|---|
| 0.01 | 2.00 | 3.00 | 6.00 | 124 | ||
| 0.02 | 1.02 | 1.89 | 3.33 | 120 | ||
| 0.07 | 1.00 | 1.40 | 3.00 | 29 | ||
| 0.02 | 1.00 | 1.65 | 2.55 | 109 | ||
Fig 3Quantile-Quantile plots for SCOOP cases vs. ExAC controls.
Internal MAF < 0.001 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 15,863. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 1.163), ProxECAT-weighted (orange, lambda = 1.069), case-control (black, lambda = 1.713) A) all tests, B) ProxECAT and ProxECAT-weighted only.
Gene-based results for genes with p–value < 0.01 in SCOOP vs. Cohort and SCOOP vs ExAC.
| SCOOP vs Cohort | SCOOP vs ExAC | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SCOOP | Cohort | p-values | SCOOP | ExAC | p-values | |||||
| ProxECAT | ProxECAT | case | ProxECAT | ProxECAT | case | |||||
| Gene | weighted | control | weighted | control | ||||||
| 15/0 | 13/18 | 1.1E-05 | 2.1E-03 | 1.1E-04 | 16/1 | 380/247 | 1.5E-03 | 1.4E-03 | 1.3E-01 | |
| 0/8 | 62/16 | 1.9E-06 | 1.2E-04 | 1.1E-07 | 0/4 | 600/361 | 5.2E-03 | 1.8E-02 | 9.8E-09 | |
| 13/0 | 18/25 | 1.7E-05 | 2.0E-03 | 6.6E-03 | 11/1 | 357/268 | 8.1E-03 | 5.7E-03 | 7.4E-01 | |
| 9/0 | 13/33 | 1.1E-05 | 8.0E-03 | 2.9E-02 | 7/0 | 173/116 | 7.8E-03 | 6.8E-03 | 3.6E-01 | |