Literature DB >> 30325923

ProxECAT: Proxy External Controls Association Test. A new case-control gene region association test using allele frequencies from public controls.

Audrey E Hendricks1,2,3, Stephen C Billups1, Hamish N C Pike2, I Sadaf Farooqi4, Eleftheria Zeggini5, Stephanie A Santorico1,2,3, Inês Barroso4,5, Josée Dupuis6.   

Abstract

A primary goal of the recent investment in sequencing is to detect novel genetic associations in health and disease improving the development of treatments and playing a critical role in precision medicine. While this investment has resulted in an enormous total number of sequenced genomes, individual studies of complex traits and diseases are often smaller and underpowered to detect rare variant genetic associations. Existing genetic resources such as the Exome Aggregation Consortium (>60,000 exomes) and the Genome Aggregation Database (~140,000 sequenced samples) have the potential to be used as controls in these studies. Fully utilizing these and other existing sequencing resources may increase power and could be especially useful in studies where resources to sequence additional samples are limited. However, to date, these large, publicly available genetic resources remain underutilized, or even misused, in large part due to the lack of statistical methods that can appropriately use this summary level data. Here, we present a new method to incorporate external controls in case-control analysis called ProxECAT (Proxy External Controls Association Test). ProxECAT estimates enrichment of rare variants within a gene region using internally sequenced cases and external controls. We evaluated ProxECAT in simulations and empirical analyses of obesity cases using both low-depth of coverage (7x) whole-genome sequenced controls and ExAC as controls. We find that ProxECAT maintains the expected type I error rate with increased power as the number of external controls increases. With an accompanying R package, ProxECAT enables the use of publicly available allele frequencies as external controls in case-control analysis.

Entities:  

Mesh:

Year:  2018        PMID: 30325923      PMCID: PMC6191077          DOI: 10.1371/journal.pgen.1007591

Source DB:  PubMed          Journal:  PLoS Genet        ISSN: 1553-7390            Impact factor:   5.917


Introduction

Recent investments have produced sequence data on millions of people with the number of sequenced individuals continuing to grow. Although large sequencing studies, such as the Trans-Omics for Precision Medicine (TopMed) through the National Heart, Lung, and Blood Institute, exist, most sequencing data is gathered and processed in much smaller units of hundreds to thousands of samples. This is especially true in the study of diseases that are not very common but still likely to have a complex or oligogenic genetic architecture. These silos of data mean that most rare-variant association studies of uncommon, complex diseases are underpowered. Zuk et al. suggest that sample sizes in the tens, and perhaps hundreds of thousands are required for adequate power[1]. In addition to increasing the sample size of future studies, fully leveraging existing sequencing resources could increase power considerably and could be vital in scenarios where resources to sequence more samples are limited. Existing genetic resources such as the Exome Aggregation Consortium (ExAC; >60,000 exomes)[2] and more recently, the Genome Aggregation Database (gnomAD; ~140,000 sequenced samples) have the potential to be used as controls in studies of complex diseases. However, to date, these large, publicly available genetic resources remain underutilized, or even misused[3], in large part due to the lack of statistical methods that can appropriately use this summary level data in complex disease studies. In particular, there is a large potential for bias caused by differences in sequencing technology, processing, and read depth[3]. Recently, Lee et al[4] developed iECAT, a method to incorporate publicly available allele frequencies from controls into an existing, unbiased, but underpowered case-control analysis. They found that iECAT controls for bias while increasing power to detect association to a genetic region and can be applied to both single variant analysis and gene region analysis using a SKAT-O framework[5]. iECAT cannot be applied to very rare variants such as singletons or doubletons and requires a set of controls that were sequenced and variant-called in parallel to the cases (i.e. internal controls). Additionally, the type I error rate for iECAT increases as the size of the internal control sample set decreases relative to the internal cases. Thus, there is still the need for methods that can incorporate very rare variants and external controls without the explicit need for large internal control samples. Here we present Proxy External Controls Association Test (ProxECAT), a method to estimate enrichment of rare variants within a gene region using internal cases and external controls. Our method addresses existing gaps such as using singleton and doubleton variants and requiring only external controls. Rare-variant tests in a gene are often limited to variants predicted to have a functional effect on the protein, hence discarding non-functional variants. This can result in greater power[6, 7]. The development of ProxECAT was motivated by the observation that these discarded variants can be used as a proxy for how well variants within a genetic region are sequenced and called within a sample. ProxECAT is both simple and fast, requiring only allele frequency information, and is thus well suited to use publicly available resources such as ExAC and gnomAD. We evaluate ProxECAT in simulations, and empirical analysis of high depth of coverage (80x) whole-exome sequenced childhood obesity cases (N = 927) using both low-depth of coverage (7x) whole-genome sequenced controls (N = 3,621), and ExAC (N = 33,370). Our method controls the type I error rate in simulations and yields the expected distribution of test statistics in real data settings. Given an accompanying R package, ProxECAT provides a robust and previously unavailable method to use publicly available allele frequencies as external controls in case-control analysis. This increases the utility of existing sequenced datasets to generate hypotheses and further research into the genetic basis of disease.

Results

Proxy external controls association test

For a gene region-based test, we consider the following. Let Y denote the disease status, with Y = 1 and Y = 0 for internal case and external control status, respectively. We split the variants into those that are predicted to have a functional genetic impact and those that are not predicted to have a functional impact. We use the latter as the proxy variants. Let, and denote the counts of the functional and proxy rare variant alleles respectively for internal cases and and denote the counts of functional and proxy rare variant alleles respectively for external controls (Table 1).
Table 1

Data notation for internal case and external control samples for ProxECAT.

Predicted Functional ImpactTotal
FunctionalNot Functional (Proxy)
Cases (Internal)Y = 1x1fx1px1
Controls (External)Y = 0x0fx0px0
Totalxfxp
We model the observed variant minor allele counts in Table 1 as a random sample from four independent Poisson distributions, i.e., , and . The derivation of the ProxECAT test statistic follows from the null hypothesis in Eq (1): Using the method of Lagrange Multipliers and the constraint as defined by the null hypothesis, we find the maximum likelihood estimates (MLEs) of our parameters: . Details are in S1 Appendix. Our MLEs under the null hypothesis are: We use the parameter estimates in the likelihood for the constrained null hypothesis. The MLEs for the unconstrained alternative hypothesis parameters are the variant allele counts for each group (i.e. ). We then complete a likelihood ratio test (LRT) as the ratio of the constrained (null hypothesis) and unconstrained (alternative hypothesis) likelihoods, which, by Wilk’s theorem[8] can be transformed to have a chi-squared distribution with 1-df.

Extension to incorporate different depths of coverage

It has been shown that functional variants have a lower minor allele frequency (MAF) distribution compared to synonymous variants[9]. Further, high-depth of coverage sequencing will detect a higher amount of variation at lower MAFs compared to low-depth of coverage sequencing[9, 10]. This results in high-depth of coverage sequencing detecting more functional variation relative to synonymous variation compared to low-depth of coverage sequencing. To allow for scenarios where sequencing coverage varies considerably between cases and controls, we weight the observed functional variant minor allele counts. Specifically, we divide the number of minor alleles for functional variants by the median ratio of the number of minor alleles for functional to synonymous variants within cases (M1) and within controls (M0) separately: The weighted functional variant minor allele counts, and , are used in place of the observed functional variant minor allele counts, and , respectively to estimate the parameters in (2). This new test statistic is called ProxECAT-weighted.

Extension to negative binomial

By assuming a Negative Binomial distribution for the number of minor alleles in a region instead of a Poisson distribution, we extend ProxECAT to incorporate possible over-dispersion. We model the Negative Binomial distribution with the mean, λ, and over-dispersion, η, parameters where the distribution approaches Poisson as η becomes large (S1 Fig).

Type I error and power simulation results

We simulated a variety of confounding scenarios. Case-control confounding represents systematic, genome-wide differences in the number of rare minor alleles observed in cases and controls due to differences in sequencing technologies and pipelines. Gene confounding refers to a gene having a higher or lower number of rare minor alleles than expected based on gene length. Gene confounding can occur in both cases and controls for a variety of reasons including differences in mutation rates, ability to detect variants, and annotation quality. Confounding can also occur when a particular gene region has a different number of rare minor alleles in cases and in controls due to sequencing differences between cases and controls. This confounding is distinct from case-control confounding in that it is isolated to a particular gene region rather than genome-wide. Here, we refer to this confounding as gene confounding only in cases. The simulation scenarios and parameters are presented in Table 2 and Supplemental Table 1.
Table 2

Simulation parameters.

Baseline variant minor allele rate0.001 per subject per 1Kb
Association variant minor allele rate0.001 * (1.2, 1.4, 1.6, 1.8, 2, 3)
Gene length20, 40 Kb
Case set sample size500, 1000
Control set sample size500, 1000, 10000, 40000, 100000
Gene confoundingIn cases and controls: 0.001 * (1, 1.2, 1.5, 2)
Only in cases: 0.001 * (1, 1.2, 1.5, 2)
Case control confoundingIn cases: 0.001 * (1, 1.1, 1.3, 1.5)
The case-control LRT (see Software and Statistical Analysis under Subjects and Methods) was robust to gene confounding scenarios maintaining the appropriate type I error rate but had an increased type I error rate in the presence of case-control confounding. The case-only LRT maintained appropriate type I error rate in the presence of case-control confounding but was inflated in the presence of gene-confounding. The inflation in the type I error for the case-control LRT and the case-only LRT increased further when both gene and case-control confounding were present. This was especially true for the case-control LRT (Fig 1).
Fig 1

Type I error and power estimates for case-only LRT, case-control LRT, and ProxECAT.

Estimates provided over various confounding simulation scenarios. General simulation parameters: gene-length = 20Kb, baseline mutation rate = 0.001 per person per 1Kb. Left Plot: type I error rate for Ncases = Ncontrols = 1000 and combinations of case-control confounding (mid level) and gene confounding (low level); dashed line represents expected type I error rate of 0.05 and dotted lines represent 95% confidence interval around the expected type I error rate. (A) Null simulation with no case-control or gene confounding bias; (B) gene-confounding; (C) gene-confounding only in cases; (D) case-control confounding; (E) case-control confounding and gene confounding; (F) case-control confounding and gene confounding only in cases. Right Plot: power for an effect size of 2 for case-control LRT (Ncases = 500; Ncontrols = 500) and ProxECAT (Ncases = 1000) and various external controls sample size. Dashed line is the case-control LRT power and dotted lines represent 95% confidence interval around the estimated power for case-control LRT.

Type I error and power estimates for case-only LRT, case-control LRT, and ProxECAT.

Estimates provided over various confounding simulation scenarios. General simulation parameters: gene-length = 20Kb, baseline mutation rate = 0.001 per person per 1Kb. Left Plot: type I error rate for Ncases = Ncontrols = 1000 and combinations of case-control confounding (mid level) and gene confounding (low level); dashed line represents expected type I error rate of 0.05 and dotted lines represent 95% confidence interval around the expected type I error rate. (A) Null simulation with no case-control or gene confounding bias; (B) gene-confounding; (C) gene-confounding only in cases; (D) case-control confounding; (E) case-control confounding and gene confounding; (F) case-control confounding and gene confounding only in cases. Right Plot: power for an effect size of 2 for case-control LRT (Ncases = 500; Ncontrols = 500) and ProxECAT (Ncases = 1000) and various external controls sample size. Dashed line is the case-control LRT power and dotted lines represent 95% confidence interval around the estimated power for case-control LRT. Despite usually being within the 95% confidence interval for type I error, ProxECAT appeared to have a slight, but consistent inflation (Supplemental Table 2). This minor, but consistent inflation in the type I error rate can be addressed by using a more conservative significance threshold. We found that multiplying the significance level by 0.9 works well such that a 0.045 significance threshold maintains a 0.05 type I error rate, a 0.009 significance threshold maintains a 0.01 type I error rate, etc. Both the case-control LRT used here and ProxECAT assume a Poisson distribution and had inflated Type I Error rate in the presence of overdispersion (S3 Table). ProxECAT-over, which assumes a Negative Binomial distribution instead of a Poisson distribution, corrects for overdispersion in simulations when the overdispersion parameter is known and overdispersion is not too extreme (i.e. over-dispersion, η ≥ 5) (S3 Table). Case-control LRT had higher power than ProxECAT under scenarios of no case-control confounding and given the same sample size (S4 Table). However, the power of ProxECAT increased as the sample size of the external control set increased eventually reaching higher power than the case-control LRT for the same number of internal sequences (Fig 1). This increase in power for ProxECAT is due, in part, to being able to sequence more cases with ProxECAT (N = 1000) than with a case-control LRT where sequencing resources need to be split between cases and controls (here Ncases = 500 and Ncontrols = 500). ProxECAT’s power increased while the type I error stayed the same under confounding scenarios where the number of functional variants in the cases increases (S4 Table).

Assessing fit of the Poisson distribution

To assess the fit of the Poisson distribution and specifically look for over dispersion, we simulated rare minor alleles assuming a Binomial distribution for each variant and compared these results to the theoretical Poisson distribution for the number of rare minor alleles in a genetic region. No over dispersion was apparent as the sampling mean and variance of the simulated scenarios were similar across different sample sizes, MAFs, and number of minor alleles per gene (S2 and S3 Figs). When the expected number of minor alleles per gene was greater than 20, the Poisson approximation for the number of minor alleles started to look more continuous. In other words, as the expected number of variants per gene decreased, the Poisson approximation became more discrete and multimodal (S2 and S3 Figs). The theoretical distribution for the number of minor alleles per gene created from simulating genotypes for individual, independent variants from a Binomial distribution was more robust to discretization maintaining a mostly continuous distribution until the expected number of minor alleles per gene was equal to or less than four.

SCOOP data analysis

We evaluated ProxECAT using 926 cases from the Severe Childhood Onset Obesity Project (SCOOP) sample as cases and either 3,621 UK10K Cohort or 33,370 ExAC non-Finnish Europeans as controls. High-depth of coverage WES SCOOP cases vs. low-depth of coverage WGS UK10K Cohort controls had an inflated distribution of test statistics for the case-control LRT both at the center (lambda = 1.971) and in the tail of the distribution. While we did not observe inflation in the tail of the distribution for ProxECAT (Fig 2), there was a large inflation in the overall distribution of test statistics (lambda = 3.151). We observed a much higher ratio of the number of minor alleles in functional to synonymous variants per gene for the high-depth of coverage cases, median = 3.00, versus the low-depth of coverage controls, median = 1.89 (Table 3). ProxECAT-weighted, which adjusts for this systematic difference in sequencing coverage, resulted in a distribution of observed test statistics that more closely matches the expected distribution (lambda = 1.026, Fig 2).
Fig 2

Quantile-Quantile plots for SCOOP cases vs. UK10K Cohort controls.

Internal MAF < 0.01 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 11,051. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 3.151), ProxECAT-weighted (orange, lambda = 1.026), case-control (black, lambda = 1.971). A) all tests, B) ProxECAT-weighted only.

Table 3

Genome-wide descriptive statistics for the ratio of the number of functional and synonymous variant minor alleles per gene in cases and controls.

minQ1MedianQ3Max
SCOOP vs UK10K CohortSCOOP cases0.012.003.006.00124
UK10K Cohort controls0.021.021.893.33120
SCOOP vs ExACSCOOP cases0.071.001.403.0029
ExAC0.021.001.652.55109

Quantile-Quantile plots for SCOOP cases vs. UK10K Cohort controls.

Internal MAF < 0.01 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 11,051. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 3.151), ProxECAT-weighted (orange, lambda = 1.026), case-control (black, lambda = 1.971). A) all tests, B) ProxECAT-weighted only. A large strength of this method is the ability to use allele frequency data directly, rather than individual level allele calls. To assess the ability of this method to use publicly available allele frequency data, we used ExAC allele frequencies as controls for the SCOOP cases. The standard case-control LRT was inflated at both the median, lambda = 1.713, and tail (Fig 3) while our method maintained the expected distribution of test statistics. Because the depth of sequencing coverage is comparable and high for both SCOOP cases and ExAC controls, ProxECAT-weighted produced similar results to the standard, un-weighted test.
Fig 3

Quantile-Quantile plots for SCOOP cases vs. ExAC controls.

Internal MAF < 0.001 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 15,863. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 1.163), ProxECAT-weighted (orange, lambda = 1.069), case-control (black, lambda = 1.713) A) all tests, B) ProxECAT and ProxECAT-weighted only.

Quantile-Quantile plots for SCOOP cases vs. ExAC controls.

Internal MAF < 0.001 in both cases and controls and number of variant minor alleles per gene ≥ 5. N genes = 15,863. 95% confidence interval of expected results in gray. ProxECAT (blue, lambda = 1.163), ProxECAT-weighted (orange, lambda = 1.069), case-control (black, lambda = 1.713) A) all tests, B) ProxECAT and ProxECAT-weighted only. For both analyses, filtering to very rare variants was essential to avoid inflation in the distribution of observed test-statistics. This can be accomplished using moderate internal frequency filters and an external dataset such as 1000Genomes (MAF < 1%) as in the SCOOP vs UK Cohort analysis or using more stringent internal frequency filters (MAF < 0.1%) and no external dataset as in the SCOOP vs ExAC analysis. Four genes, passing a 0.01 level of significance in both the SCOOP vs UK10K Cohort analysis and in the SCOOP vs ExAC analysis, are shown in Table 4. These results are putative novel obesity candidates meriting further replication. MIB2 may be of particular interest as it is associated with decreased body weight in mice in the International Mouse Phenotyping Consortium (p-value = 7.49*10−10, http://www.mousephenotype.org/data/genes/MGI:2679684). Additional genes with the smallest p-values are found in S5–S7 Tables.
Table 4

Gene-based results for genes with p–value < 0.01 in SCOOP vs. Cohort and SCOOP vs ExAC.

SCOOP vs CohortSCOOP vs ExAC
SCOOPCohortp-valuesSCOOPExACp-values
ProxECATProxECATcaseProxECATProxECATcase
Genex1f/x1px0f/x0pweightedcontrolx1f/x1px0f/x0pweightedcontrol
CD2215/013/181.1E-052.1E-031.1E-0416/1380/2471.5E-031.4E-031.3E-01
MIB20/862/161.9E-061.2E-041.1E-070/4600/3615.2E-031.8E-029.8E-09
NDEL113/018/251.7E-052.0E-036.6E-0311/1357/2688.1E-035.7E-037.4E-01
PRDM139/013/331.1E-058.0E-032.9E-027/0173/1167.8E-036.8E-033.6E-01

Sensitivity of proxy selection

Within the SCOOP vs. ExAC analysis, we completed a sensitivity analysis using three increasingly broad proxy selection strategies of Sequence Ontology terms: (1) synonymous (SYN); (2) predicted low impact rating from Ensembl [11] (LOW); and (3) not in our functional category (NOT FUNC). These strategies are nested with LOW Sequence Ontology terms included in NOT FUNC, and SYN Sequence Ontology terms included in both LOW and NOT FUNC. We assessed consistency across the number of alternate alleles and in the distribution of test statistics across the three proxy selection strategies. As expected given the nested nature of the proxy selection strategies, SYN had a smaller number of alternate alleles than either LOW or NOT FUNC and LOW had a smaller number of alternate alleles than NOT FUNC. SYN and LOW proxy selection strategies produced similar numbers of alternate alleles per gene while the correlation was lower for NOT FUNC with either SYN or LOW (S4 Fig). We found similar consistency in the distributions of test statistics between the proxy selection strategies (S5 Fig).

Discussion

We propose a new method, ProxECAT, to test for enrichment of an accumulation of very rare variant alleles in a gene-region using publicly available external allele frequencies. ProxECAT only requires allele frequencies and uses exclusively external controls enabling the use of large, publicly available datasets such as ExAC and gnomAD. Analyses in simulations and using UK10K Cohort and ExAC as control sets for childhood obesity cases show that ProxECAT keeps the type I error rate and expected distribution of test statistics under control despite differences in sequencing technology and processing. Because ProxECAT uses external controls, additional resources can be devoted to sequencing cases. This results in greater power for ProxECAT compared to the case-control LRT test for the same number of internally sequenced individuals. There are several limitations to the method proposed here. First, ProxECAT has a minor, but consistent inflation in the type I error rate. This limitation is easily addressed by using a more conservative significance threshold. Second, ProxECAT cannot currently include covariates such as sex, and ancestry. Thus, internal cases and external controls should be closely matched by ancestry and, as with any association study, findings will need independent replication preferably using a study where cases and controls are sequenced and processed in parallel. Third, the current approach does not enable internal controls to be analyzed along with external controls. While two analyses can be done in parallel and compared, it would be ideal to incorporate internal and external controls into the same statistical test. We are actively working on extensions to address these limitations. It is important to highlight that research utilizing solely external controls is more susceptible to confounding due to known or unknown factors. Thus, any genes identified using ProxECAT or any method that uses only external controls should be carefully followed up in further validation, replication, and functional studies. ProxECAT provides a robust approach to using allele frequencies from existing, publicly available sequencing data enabling case-control analysis when no or limited internal controls exist. ProxECAT uses the insight that readily available genomic information often discarded from analyses (here synonymous variation) can adjust for sizeable confounding due to differences in data generation. In the era of big data, we hope that both this insight and the ProxECAT method will enable additional genetic discoveries and will also motivate future methodological advancements in analyzing data across technologies and platforms.

Materials and methods

Software and statistical analysis

All tests were implemented using functions from our accompanying R package ProxECAT (https://github.com/hendriau/ProxECAT). Our primary test, which can model both ProxECAT and ProxECAT-weighted, was implemented with the proxecat function and our secondary test modeling over-dispersion was implemented using the proxecat.over function. We also implemented a case-control LRT to test for enrichment of rare, functional variant alleles in cases vs. controls and a case-only LRT similar to that performed by Zhi and Chen in 2012 [12]. The case-only LRT tests for enrichment of rare alleles for functional variants in each gene of interest compared to the genome-wide average number of minor alleles per gene in cases only adjusting for the length of each gene. Unless otherwise specified, we assumed the data follow a Poisson distribution for all LRTs.

Type I error and power simulations

Within each case-control confounding simulation, we simulated 20,000 independent genes under four gene-disease association and gene confounding states. The four distinct gene states are: (1) association with case status and no gene confounding, (2) association with case status and gene confounding, (3) no association with case status and gene confounding, (4) no association with case status and no gene confounding. The number of rare minor alleles per gene was simulated under a Poisson distribution or an over-dispersed Poisson modeled using a Negative Binomial parameterization using the R functions rpois and rnbinom, respectively. The mu and size parameters in rnbinom represent the mean and over-dispersion, respectively. To assess the fit of the Poisson distribution, we simulated the number of each genotype group for each variant assuming Hardy-Weinberg Equilibrium and a Binomial distribution where p was the MAF. We varied the MAF (0.0001, 0.0005, 0.001, 0.005), the sample size (1000; 10,000), and the maximum number of variable variants within the gene region (5, 10, 20). We then assessed how closely the simulated distributions of the number of minor alleles observed per gene region matched a theoretical Poisson distribution where λ was the mean from each simulation scenario.

UK10K SCOOP

Whole-exome sequenced (WES) cases are from the Severe Childhood Onset Obesity Project (SCOOP) cohort[6, 13], which is a self-reported UK European subset of the Genetics of Obesity Study (GOOS). GOOS includes individuals with severe early-onset obesity body mass index (BMI) standard deviation score (SDS) > 3 and age at onset of obesity < 10 years. Leptin deficient individuals (identified by biochemical measurement) and those with mutations in the MC4R gene were excluded. We used VerifyBamID (v1.0)[14] and a threshold of ≥3% to identify contaminated samples. We computed principal components with the 1000Genomes Phase I integrated call set[9] using EIGENSTRAT v4.2[15] to identify non-Europeans, and pairwise identity by descent estimates from PLINK v1.07[16] with a threshold of ≥0.125 to identify related individuals. Contaminated, non-European, and related samples were removed resulting in 927 SCOOP cases for analysis. Details about sequencing and variant calling for the SCOOP cases, as part of the UK10K exomes can be found elsewhere[17]. All participants gave written informed consent and all methods were performed in accordance with the relevant laboratory/clinical guidelines and regulations.

UK10K cohort

The whole-genome sequenced (WGS) controls consist of the UK10K Cohort sample, comprised of two population cohorts: the Avon Longitudinal Study of Parents and Children (ALSPAC) and the TwinsUK study from the Department of Twin Research and Genetic Epidemiology at King’s College London (TwinsUK). We used allele frequency data for 3,621 individuals that passed sample QC as described elsewhere[17].

Exome aggregate consortium

We used allele frequency values for the N = 33,370 non-Finnish European (NFE) group from the ExAC variant site dataset version 1.0 (http://exac.broadinstitute.org/downloads)[2].

Variant and gene filtering

To focus on rare or very rare variants, we limited to variants below a pre-specified MAF threshold in both cases and controls. We used MAF ≤ 1% in the SCOOP cases vs. UK10K cohort controls analysis and MAF ≤ 0.1% in the SCOOP vs. ExAC analysis. For the SCOOP cases vs. UK10K controls analysis, we also applied external filtering excluding variants with a MAF > 1% in at least one of the 1000Genomes five primary ancestry groups. Exclusion by 1000Genomes MAF was not possible when using ExAC as 1000Genomes sample are included in the ExAC genotype frequencies. We explored the distribution of test statistics over several thresholds for the minimum number of functional (x) and proxy (x) variants within each gene (5, 10, and 20). Analysis regions were limited to the intersection of respective target regions for SCOOP vs. UK10K Cohort and for SCOOP vs. ExAC. All variant annotation was applied using the GRCh37 human reference. The Ensembl Variant Effect Predictor (VEP, http://www.ensembl.org/info/docs/tools/vep/index.html [11] v79 and v90.1) from Ensembl was used to add variant consequence annotations for SCOOP vs. UK10K Cohort and SCOOP vs. ExAC respectively. We defined functional variation using the following Sequence Ontology terms[18] variant consequences: splice_donor_variant, splice_acceptor_variant, stop_gained, frameshift_variant, stop_lost, initiator_codon_variant, inframe_insertion, inframe_deletion, missense_variant, and protein_altering_variant. Variants were considered synonymous if they had the “synonymous_variant” flag. We defined the LOW proxy group as having a predicted low impact rating from Ensembl, SO terms: splice_region_variant, incomplete_terminal_codon_variant, stop_retained_variant, synonymous_variant.

Assessing results from real data analysis

We used quantile-quantile plots (QQ-plots) to assess the resulting distribution of test statistics from the real data applications. Specifically, we looked at the middle of the distribution of test statistics as assessed by the lambda value (i.e. the median of the observed test statistic divided by the median of the expected test statistic) and the tail of the distribution of test statistics, which we assessed visually.

R Package

ProxECAT R package and functions are available on github: https://github.com/hendriau/ProxECAT.

Comparison of Poisson and Negative Binomial distributions for μ = 20.

(PNG) Click here for additional data file.

Comparison of the number of rare alleles in a gene region from a simulated variant level binomial distribution (black) and a theoretical Poisson distribution (red) for a sample size of 10,000.

MAF = 0.0001, 0.0005, 0.001, 0.005; number of minor variant alleles within the gene region = 5, 10, 20. (PDF) Click here for additional data file.

Comparison of the number of rare alleles in a gene region from a simulated variant level binomial distribution (black) and a theoretical Poisson distribution (red) for a sample size of 1,000.

MAF = 0.0001, 0.0005, 0.001, 0.005; number of variants within the gene region = 5, 10, 20. (PDF) Click here for additional data file.

Sensitivity analysis of proxy selection strategies in SCOOP vs. ExAC; scatter plots.

Comparison of the natural log of the number of alternate alleles observed in each gene region for functional variants (FUNC) and three proxy selection strategies: synonymous (SYN), low impact (LOW), not functional (NOT FUNC). Top right panels: scatter plots with y = x line. Bottom left panels: correlation coefficient. (PDF) Click here for additional data file.

QQplots for the test results of proxy selection strategies for SCOOP vs. ExAC: Synonymous (SYN), low impact (LOW), not functional (NOT FUNC).

Internal MAF < 0.001 and number of alleles per gene ≥ 5 for functional and proxy. ProxECAT (blue), ProxECAT-weighted (orange), 95% confidence interval of expected results in gray. Left: SYN, Ngenes = 15,779 (ProxECAT lambda = 1.233, ProxECAT-weighted = 1.081). Middle: LOW, Ngenes = 15,874, (ProxECAT lambda = 1.215, ProxECAT-weighted lambda = 1.119). Right: NOT FUNC, Ngenes = 16,011 (ProxECAT lambda = 1.18, ProxECAT-weighted = 1.18). For the NOT FUNC proxy group, the weights for ProxECAT-weighted are one for both cases and controls resulting in identical distributions of test statistics for ProxECAT and ProxECAT-weighted. (PNG) Click here for additional data file.

Gene confounding and case-control confounding simulation design.

Darker shading indicates a higher level of gene confounding. Solid shading indicates gene confounding in both cases and controls. Stripped shading indicates gene confounding in only cases. (XLSX) Click here for additional data file.

Type I Error over all simulation scenarios.

(XLSX) Click here for additional data file.

Type I Error for over-dispersed simulations.

Gene length = 20Kb, Ncases = 1000, Ncontrols = 1000, no confounding. (XLSX) Click here for additional data file.

Power over all simulation scenarios.

(XLSX) Click here for additional data file.

Top 100 results for SCOOP vs. Cohort ordered by ProxECAT-weighted p-value.

(XLSX) Click here for additional data file.

Top 100 results for SCOOP vs. ExAC ordered by ProxECAT p-value.

(XLSX) Click here for additional data file.

Results with p-value < 0.05 for both SCOOP vs. Cohort and SCOOP vs. ExAC.

(XLSX) Click here for additional data file.

Derivation of ProxECAT.

(PDF) Click here for additional data file.

Full results for SCOOP vs Cohort analysis.

(ZIP) Click here for additional data file.

Full results for SCOOP vs ExAC analysis.

(ZIP) Click here for additional data file.

Read me file for S1 and S2 Results.

(TXT) Click here for additional data file.
  16 in total

1.  Detecting and estimating contamination of human DNA samples in sequencing and array-based genotype data.

Authors:  Goo Jun; Matthew Flickinger; Kurt N Hetrick; Jane M Romm; Kimberly F Doheny; Gonçalo R Abecasis; Michael Boehnke; Hyun Min Kang
Journal:  Am J Hum Genet       Date:  2012-10-25       Impact factor: 11.025

2.  Searching for missing heritability: designing rare variant association studies.

Authors:  Or Zuk; Stephen F Schaffner; Kaitlin Samocha; Ron Do; Eliana Hechter; Sekar Kathiresan; Mark J Daly; Benjamin M Neale; Shamil R Sunyaev; Eric S Lander
Journal:  Proc Natl Acad Sci U S A       Date:  2014-01-17       Impact factor: 11.205

3.  Improving power for rare-variant tests by integrating external controls.

Authors:  Seunggeun Lee; Sehee Kim; Christian Fuchsberger
Journal:  Genet Epidemiol       Date:  2017-06-28       Impact factor: 2.135

4.  Statistical guidance for experimental design and data analysis of mutation detection in rare monogenic mendelian diseases by exome sequencing.

Authors:  Degui Zhi; Rui Chen
Journal:  PLoS One       Date:  2012-02-10       Impact factor: 3.240

5.  The Sequence Ontology: a tool for the unification of genome annotations.

Authors:  Karen Eilbeck; Suzanna E Lewis; Christopher J Mungall; Mark Yandell; Lincoln Stein; Richard Durbin; Michael Ashburner
Journal:  Genome Biol       Date:  2005-04-29       Impact factor: 13.583

6.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

7.  Analysis of protein-coding genetic variation in 60,706 humans.

Authors:  Monkol Lek; Konrad J Karczewski; Eric V Minikel; Kaitlin E Samocha; Eric Banks; Timothy Fennell; Anne H O'Donnell-Luria; James S Ware; Andrew J Hill; Beryl B Cummings; Taru Tukiainen; Daniel P Birnbaum; Jack A Kosmicki; Laramie E Duncan; Karol Estrada; Fengmei Zhao; James Zou; Emma Pierce-Hoffman; Joanne Berghout; David N Cooper; Nicole Deflaux; Mark DePristo; Ron Do; Jason Flannick; Menachem Fromer; Laura Gauthier; Jackie Goldstein; Namrata Gupta; Daniel Howrigan; Adam Kiezun; Mitja I Kurki; Ami Levy Moonshine; Pradeep Natarajan; Lorena Orozco; Gina M Peloso; Ryan Poplin; Manuel A Rivas; Valentin Ruano-Rubio; Samuel A Rose; Douglas M Ruderfer; Khalid Shakir; Peter D Stenson; Christine Stevens; Brett P Thomas; Grace Tiao; Maria T Tusie-Luna; Ben Weisburd; Hong-Hee Won; Dongmei Yu; David M Altshuler; Diego Ardissino; Michael Boehnke; John Danesh; Stacey Donnelly; Roberto Elosua; Jose C Florez; Stacey B Gabriel; Gad Getz; Stephen J Glatt; Christina M Hultman; Sekar Kathiresan; Markku Laakso; Steven McCarroll; Mark I McCarthy; Dermot McGovern; Ruth McPherson; Benjamin M Neale; Aarno Palotie; Shaun M Purcell; Danish Saleheen; Jeremiah M Scharf; Pamela Sklar; Patrick F Sullivan; Jaakko Tuomilehto; Ming T Tsuang; Hugh C Watkins; James G Wilson; Mark J Daly; Daniel G MacArthur
Journal:  Nature       Date:  2016-08-18       Impact factor: 49.962

8.  Genome-wide SNP and CNV analysis identifies common and low-frequency variants associated with severe early-onset obesity.

Authors:  Eleanor Wheeler; Ni Huang; Elena G Bochukova; Julia M Keogh; Sarah Lindsay; Sumedha Garg; Elana Henning; Hannah Blackburn; Ruth J F Loos; Nick J Wareham; Stephen O'Rahilly; Matthew E Hurles; Inês Barroso; I Sadaf Farooqi
Journal:  Nat Genet       Date:  2013-04-07       Impact factor: 38.330

9.  The Ensembl Variant Effect Predictor.

Authors:  William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal:  Genome Biol       Date:  2016-06-06       Impact factor: 13.583

10.  The UK10K project identifies rare variants in health and disease.

Authors:  Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal:  Nature       Date:  2015-09-14       Impact factor: 49.962

View more
  14 in total

1.  Deviation from baseline mutation burden provides powerful and robust rare-variants association test for complex diseases.

Authors:  Lin Jiang; Hui Jiang; Sheng Dai; Ying Chen; Youqiang Song; Clara Sze-Man Tang; Shirley Yin-Yu Pang; Shu-Leong Ho; Binbin Wang; Maria-Mercedes Garcia-Barcelo; Paul Kwong-Hang Tam; Stacey S Cherny; Mulin Jun Li; Pak Chung Sham; Miaoxin Li
Journal:  Nucleic Acids Res       Date:  2022-04-08       Impact factor: 16.971

Review 2.  Opportunities and challenges for the use of common controls in sequencing studies.

Authors:  Genevieve L Wojcik; Jessica Murphy; Jacob L Edelson; Christopher R Gignoux; Alexander G Ioannidis; Alisa Manning; Manuel A Rivas; Steven Buyske; Audrey E Hendricks
Journal:  Nat Rev Genet       Date:  2022-05-17       Impact factor: 59.581

3.  Integrating external controls in case-control studies improves power for rare-variant tests.

Authors:  Yatong Li; Seunggeun Lee
Journal:  Genet Epidemiol       Date:  2022-02-16       Impact factor: 2.344

4.  Targeted next generation sequencing of nine osteoporosis-related genes in the Wnt signaling pathway among Chinese postmenopausal women.

Authors:  Can Li; Qin Huang; Rui Yang; Xiaodong Guo; Yu Dai; Junchao Zeng; Yun Zeng; Lailin Tao; Xiaolong Li; Haolong Zhou; Qi Wang
Journal:  Endocrine       Date:  2020-03-08       Impact factor: 3.633

5.  Summix: A method for detecting and adjusting for population structure in genetic summary data.

Authors:  Ian S Arriaga-MacKenzie; Gregory Matesi; Samuel Chen; Alexandria Ronco; Katie M Marker; Jordan R Hall; Ryan Scherenberg; Mobin Khajeh-Sharafabadi; Yinfei Wu; Christopher R Gignoux; Megan Null; Audrey E Hendricks
Journal:  Am J Hum Genet       Date:  2021-06-21       Impact factor: 11.025

6.  A data harmonization pipeline to leverage external controls and boost power in GWAS.

Authors:  Danfeng Chen; Katherine Tashman; Duncan S Palmer; Benjamin Neale; Kathryn Roeder; Alex Bloemendal; Claire Churchhouse; Zheng Tracy Ke
Journal:  Hum Mol Genet       Date:  2022-02-03       Impact factor: 5.121

7.  Novel score test to increase power in association test by integrating external controls.

Authors:  Yatong Li; Seunggeun Lee
Journal:  Genet Epidemiol       Date:  2020-11-08       Impact factor: 2.344

8.  Exome Sequencing Identifies Genes and Gene Sets Contributing to Severe Childhood Obesity, Linking PHIP Variants to Repressed POMC Transcription.

Authors:  Gaëlle Marenne; Audrey E Hendricks; Aliki Perdikari; Rebecca Bounds; Felicity Payne; Julia M Keogh; Christopher J Lelliott; Elana Henning; Saad Pathan; Sofie Ashford; Elena G Bochukova; Vanisha Mistry; Allan Daly; Caroline Hayward; Nicholas J Wareham; Stephen O'Rahilly; Claudia Langenberg; Eleanor Wheeler; Eleftheria Zeggini; I Sadaf Farooqi; Inês Barroso
Journal:  Cell Metab       Date:  2020-06-02       Impact factor: 27.287

9.  Genomic analysis of 21 patients with corneal neuralgia after refractive surgery.

Authors:  Jun-Hui Yuan; Betsy R Schulman; Philip R Effraim; Dib-Hajj Sulayman; Deborah S Jacobs; Stephen G Waxman
Journal:  Pain Rep       Date:  2020-07-27

10.  The PI3K/mTOR Pathway Is Targeted by Rare Germline Variants in Patients with Both Melanoma and Renal Cell Carcinoma.

Authors:  Jean-Noël Hubert; Voreak Suybeng; Maxime Vallée; Tiffany M Delhomme; Eve Maubec; Anne Boland; Delphine Bacq; Jean-François Deleuze; Fanélie Jouenne; Paul Brennan; James D McKay; Marie-Françoise Avril; Brigitte Bressac-de Paillerets; Estelle Chanudet
Journal:  Cancers (Basel)       Date:  2021-05-07       Impact factor: 6.639

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.