Literature DB >> 34337551

BinomiRare: A robust test for association of a rare genetic variant with a binary outcome for mixed models and any case-control proportion.

Tamar Sofer^1,2,3, Jiwon Lee³, Nuzulul Kurniansyah³, Deepti Jain⁴, Cecelia A Laurie⁴, Stephanie M Gogarten⁴, Matthew P Conomos⁴, Ben Heavner⁴, Yao Hu⁵, Charles Kooperberg⁵, Jeffrey Haessler⁵, Ramachandran S Vasan^6,7, L Adrienne Cupples^7,8, Brandon J Coombes⁹, Amanda Seyerle¹⁰, Sina A Gharib¹¹, Han Chen^12,13, Jeffrey R O'Connell¹⁴, Man Zhang¹⁵, Daniel J Gottlieb³, Bruce M Psaty^16,17, W T Longstreth¹⁷, Jerome I Rotter¹⁸, Kent D Taylor¹⁸, Stephen S Rich¹⁹, Xiuqing Guo¹⁸, Eric Boerwinkle^12,20, Alanna C Morrison¹², James S Pankow²¹, Andrew D Johnson^7,22, Nathan Pankratz²³, Alex P Reiner⁵, Susan Redline^1,3, Nicholas L Smith^24,25,26, Kenneth M Rice⁴, Elizabeth D Schifano²⁷.

Abstract

Whole-genome sequencing (WGS) and whole-exome sequencing studies have become increasingly available and are being used to identify rare genetic variants associated with health and disease outcomes. Investigators routinely use mixed models to account for genetic relatedness or other clustering variables (e.g., family or household) when testing genetic associations. However, no existing tests of the association of a rare variant with a binary outcome in the presence of correlated data control the type 1 error where there are (1) few individuals harboring the rare allele, (2) a small proportion of cases relative to controls, and (3) covariates to adjust for. Here, we address all three issues in developing a framework for testing rare variant association with a binary trait in individuals harboring at least one risk allele. In this framework, we estimate outcome probabilities under the null hypothesis and then use them, within the individuals with at least one risk allele, to test variant associations. We extend the BinomiRare test, which was previously proposed for independent observations, and develop the Conway-Maxwell-Poisson (CMP) test and study their properties in simulations. We show that the BinomiRare test always controls the type 1 error, while the CMP test sometimes does not. We then use the BinomiRare test to test the association of rare genetic variants in target genes with small-vessel disease (SVD) stroke, short sleep, and venous thromboembolism (VTE), in whole-genome sequence data from the Trans-Omics for Precision Medicine (TOPMed) program.

Entities: Chemical

Year: 2021 PMID： 34337551 PMCID： PMC8321319 DOI： 10.1016/j.xhgg.2021.100040

Source DB: PubMed Journal: HGG Adv ISSN： 2666-2477

Introduction

Whole-genome sequencing (WGS) and whole-exome sequencing studies are becoming increasingly available to public health researchers, for example, from the National Heart, Lung, and Blood Institute (NHLBI) Trans-Omics for Precision Medicine (TOPMed) program,[1] National Human Genome Research Institute (NHGRI) Centers for Common Disease Genetics (CCDG), and the UK Biobank.[2] As most variants in sequencing datasets are rare, researchers may be interested in using such datasets for detecting rare variant associations, genome-wide or in a genomic region of interest. They may also seek to confirm suggested associations from other studies or populations or to assess pathogenicity in large population-based studies of rare variant alleles reported from small family-based studies. For example, Amininejad et al.[3] studied the association of genetic variants within genes associated with monogenic immunodeficiency disorders with Crohn disease. Wright et al.[4] assessed the pathogenicity and penetrance of rare variants identified in clinical studies in the population-based UK Biobank. Tuijnenburg et al.[5] studied rare genetic variants within NGKB1 for association with primary immunodeficiency disease. Do et al.[6] studied risk of myocardial infarction in individuals with rare LDLR and APOA5 alleles. Kendall et al.[7] studied cognitive outcomes in individuals with rare copy-number variants. These studies demonstrate that there is an interest in testing single rare genetic variant associations with a wide range of health outcomes, including binary outcomes such as disease or affection status. Testing rare variant associations with binary traits is challenging. It was previously shown that likelihood-based tests such as the Wald, Score, and likelihood ratio tests poorly control the type 1 error when testing for rare variant associations with a binary trait.[8,9] The Score test performance depends on the case-control ratio, and for rare variants, even a small imbalance causes “inflation” (i.e., too many false-positive results). A few approaches have been used previously to study rare variant associations in a set of unrelated individuals. Amininejad et al.[3] used a permutation approach to test for association of rare genetic variants with Crohn disease. Wright et al.[4] used Fisher’s exact test. While it is possible to adjust for covariates in the permutation approach and when using Fisher’s exact test to some extent through stratification,[10] they do not have the full flexibility of covariate adjustment of a generalized linear model (i.e., they still require the identification of distinct groups in which no additional adjustment is required). Further, permutation tests may also be computationally intensive if low p values are desired, because the number of required permutations may be large, although there are ways to reduce this computational burden.[10] Alternatively, Tuijnenburg et al.[5] used a method called Bev-iMed,[11] implementing a Bayesian model to estimate posterior disease probabilities. The BinomiRare test has also been proposed as a powerful method to test for rare variant associations that can account for covariates.[9] The BinomiRare test uses standard methods to compute the disease probabilities in the entire dataset, under the null hypothesis of no association between a specific genetic variant and the binary outcome. Then, for each specific genetic variant, it uses the estimated probabilities in individuals harboring at least one copy of the rare variant to test the hypothesis that the disease probabilities under the null are the true outcome probabilities in these individuals. The null hypothesis is rejected if the number of individuals with both the rare variant and the outcome is inconsistent with their outcome probabilities. However, the previously published version of this method assumed the sample contains only unrelated individuals. Currently, there is no single-variant test that is generally appropriate for testing rare variants when individuals are correlated (e.g., due to known or cryptic genetic relatedness). Notably, the saddle point approximation to compute p values (henceforth SPA[12]) was first developed to improve the calibration of the Score test when there is case-control imbalance and was then extended in the SAIGE framework for the settings where related individuals are used.[13] However, it does not reliably control the type I error rate when the number of individuals harboring the rare variant is very small (i.e., tens of individuals[14]). Therefore, there is a need for a statistical test that is well-calibrated when the number of individuals with the rare allele is low, individuals are potentially related, and there is case-control imbalance. The previously published version of the BinomiRare test[14] is useful in the presence of case-control imbalance, allows for covariate adjustment, controls the type I error rate for any number of individuals with the rare allele, and can also be used when combining heterogeneous studies, and here we expand its framework for testing rare variant associations when study individuals are correlated. We developed two tests: first, we extended the BinomiRare test to the mixed models setting by applying it on conditional probabilities computed with a mixed model, rather than on marginal probabilities. Second, we developed the Conway-Maxwell-Poisson (CMP) test, which follows the same framework by using estimated (conditional) disease probabilities like the BinomiRare. For a given rare variant it uses the estimated disease probabilities in individuals with the rare allele to fit the parameters of the CMP distribution, under the null. It then tests whether the observed number of individuals with both one or more copies of the rare allele and the outcome is consistent with this distribution. We study these tests using synthetic simulations with varying outcome probabilities, variant allele frequencies, and strengths of correlation between individuals due to genetic relatedness. We apply the BinomiRare test to test rare variant associations in known disease-causing genes for specific disorders: the NOTCH3 gene and small vessel disease (SVD) ischemic stroke, the DEC2 (also known as BHLHE41) gene and short sleep, and the F5 gene and venous thromboembolism (VTE).

Material and methods

Statistical approach

Let D be an indicator of the disease, or another binary outcome, of participant i, with value 1 if the person is affected and 0 otherwise, where i = 1, …, n and the n individuals may be correlated. Let be a p×1 vector of covariate values for the ith participant, and g be their count of minor alleles for a specified genetic variant. Under the logistic disease model for correlated data: with p = Pr(D = 1|, g, b) being the conditional outcome probability in the sample (regardless of the population probability), and b is the ith entry of the vector of correlated random effects with possibly K variance components and V modeling the correlation structure corresponding to a particular source of correlation. While the methods proposed here can be applied for an arbitrary K ≥ 1, we simplify presentation by focusing on the scenario of a single correlation matrix modeling genetic relatedness, possibly cryptic, so that , with being any genetic relationship matrix (GRM), or possibly kinship matrix, and is the corresponding variance component. We assume that the genetic variant is rare, so that the minor allele frequency (MAF) is low and that individuals harboring the minor allele are overwhelmingly heterozygotes. While having homozygotes does not invalidate our approach, it also does not increase statistical power. Our approach first estimates a disease probability for each individual in the sample under the null hypothesis of no association between the genetic variant and disease status, (i.e., under the assumption that β = 0) by not including any variant of interest in the regression (step 1, demonstrated in Figure 1), and then considers individuals with at least one copy of the rare variant (in short, with the rare allele), testing whether the number of diseased individuals with the rare allele is consistent with their estimated disease probabilities (step 2, demonstrated in Figure 2).

Figure 1.

Step 1 of testing genetic association using the proposed framework

A null model of association between the binary outcomes and covariates of interest is fitted, accounting for genetic relationship. Then, estimated conditional outcome probabilities are extracted to be used in the testing step.

Figure 2.

Step 2 of testing genetic associations using the proposed framework

Based on estimated outcome probabilities, variants are inspected one at a time. For a given variant, individuals harboring the rare allele are identified, and a test of the null hypothesis is performed testing whether n is consistent with the outcome probabilities within individuals with the rare allele, based on the null model.

Step 1: Estimating disease probabilities under the null hypothesis

At step 1, we fit a null model under the assumption β = 0, using the existing penalized quasi-likelihood algorithm for logistic mixed models.[15] This approach is implemented in multiple software, including the GENESIS R package,[16] GMMAT,[17] and SAIGE.[13] In both GENESIS and GMMAT, the vector of fixed effects and the variance component are estimated using an implementation of an AI-REML (average information restricted maximum likelihood) algorithm on top of the penalized quasi-likelihood (PQL) approach,[17] but the proposed tests do not depend on the specific algorithm used for estimating the outcome probabilities. From the fitted null model, we obtain estimates , and an estimated disease probability vector by plugging them in to obtain , i = 1, …, n, where expit is the inverse of the logit function. If the variance component is estimated as 0, so is , and the analysis reverts to the independent individual settings.

Step 2: Testing the association between a genetic variant and disease status

Suppose that we obtained disease probability estimates , i = 1, …, n, under the null as described above. Denote n as the number of individuals harboring at least one copy of the rare variant allele (i.e., those with g > 0), so that . Without loss of generality, assume that participants i = 1, …, n have the rare allele. Let n be the number of diseased individuals with rare allele: Let denote the vector of estimated disease probabilities for individuals with the rare variant. Despite being estimated, we treat it as fixed. For testing, we assess the goodness-of-fit of the estimated model to the observed disease status in the individuals with the rare allele by testing the null hypothesis: where is the true, unknown, vector of outcome probabilities among those with the rare allele. The p value for testing the null hypothesis of no variant-disease association is given by: This is a two-tailed p value, because n can appear to be lower or higher than expected. When only a single person carries the rare variant (i.e., n = 1), the calculation is trivial; Equation 1 reduces to the single individual fitted probability if they are an affected individual, and if they are a control individual. When n > 1, there are two special cases that are already developed. If for all individuals with the rare allele are equal, and outcomes for all those individuals are independent, then n ~ Binomial(n, ), and the p value is the tail area (possibly two tails) of the standard binomial distribution (i.e., a binomial exact test). If the for the individuals with the rare allele differ but independence still holds, the distribution is the Poisson-binomial distribution, and the test is the previously proposed BinomiRare test for independent data.[9] In the general case, an arbitrary sum of binomial variables, possibly correlated, has the CMP-binomial distribution, which can be approximated by the CMP distribution[18,19] when the number of individuals with the rare allele is “large enough” (see Appendix A). In addition to the p value above, we also study the mid-p value, which was previously shown to improve properties of discrete tests[20] and to be less conservative. The mid-p value is always smaller than the p value, because when summing the tail area probabilities, it accounts for only half of the probability of the observed event n, whereas the p value uses it as it is, without dividing in half.

BinomiRare and CMP tests using conditional probabilities

In Appendix A, we show that the distribution of n in the general case can be approximated by the CMP distribution and develop the CMP test. However, because approximations may not work well in practice for low n, we also attempt a different approach. Note that for two individuals i and j, we have that D were independent if the true conditional disease probabilities were known. In other words, given conditional disease probabilities, knowing the disease status of individual i does not inform of the disease status of individual j. Therefore, we consider using the BinomiRare test, which was developed for independent data—with the conditional probabilities. We note that this independence may not hold when probabilities are estimated, and therefore it is not trivially true that the BinomiRare is appropriate in this setting. Both the CMP and the BinomiRare tests for correlated data are available in the GENESIS R package for genetic association analysis.[21]

Simulation study: Testing rare variant associations using BinomiRare and CMP in a sample of trios

We carried out a simulation study to evaluate the performance of BinomiRare and CMP tests in samples of correlated individuals. In each simulation, we generated 3,000 individuals as 1,000 trios (two parents and one offspring), as follows. For 1,000 pairs of parents, and each of two chromosomal copies, we generated 20 independent “non-causal” genetic variants by first sampling MAFs from a uniform U[0.05, 0.5] distribution and setting MAF ∈ {0.05, 0.02, 0.01, 0.001} for one “causal”, variant, followed by sampling of genetic variants using a binary distribution based on these MAFs. For each parent, allele count was the sum of the two sampled alleles. For each variant independently, an offspring inherited one allele from each of the parents. The parental allele was sampled at random with equal probabilities from the two alleles. We used the 21 (1 causal and 20 non-causal) simulated genotypes to generate a variable mimicking a principal component (PC), as a weighted sum of all allele counts, with weights sampled from a standard normal distribution N(0, 1). Next, we simulated probability of disease using a mixed logistic model: Here, exp(β0)∈{0.01, 0.05, 0.5} is the probability of disease in individuals without the rare allele (g = 0) with genetic PC and b equal to zero. β models the association of the PC with disease probability, β is the effect of the (causal) variant of interest, and = (b1, …, b), representing the correlation across individuals, is sampled from a multivariate normal distribution , with the correlation matrix being a block diagonal kinship matrix, having twice the kinship coefficient between a child and each of their parents (i.e., 0.5). We set . In all simulations we had β = 0.1. The variant effect was varied from zero when evaluating type 1 error rate to β = log(Odds Ratio)∈{log(2), log(3), log(4)} when evaluating power. We then sampled disease status for each individual from a binary distribution with the computed disease probability. Finally, we applied the BinomiRare and CMP tests and computed p values and mid-p values. We performed 1x107 replicates to estimate type 1 error rate and 1x105 replicates to estimate power. We estimated type 1 error rate and power for p value threshold for declaring significance {1x10−2, 1x10−3, 1x10−4}. For tests that did not, empirically, control the type I error rate for a given p value threshold (i.e., the proportion of simulations passing the threshold was higher than the threshold), we computed a calibrated threshold, defined as a value for which the proportion of simulations with p value less than this value was the desired threshold. We then used this calibrated threshold to estimate power, specifically power at an “honest alpha” (Supplemental material and methods). Our main results are those focused on simulations in which the variance component had a non-zero estimate, but we analyzed all simulations.

The TOPMed whole-genome sequencing study

Whole-genome sequencing was performed via TOPMed and the NHGRI’s CCDG programs, using DNA from blood at multiple sequencing centers using Illumina X10 technology at an average sequencing depth of >30×. Studies and samples were sequenced in multiple phases. Periodically, the TOPMed Informatics Research Center (IRC) performed variant calling on the combined TOPMed and CCDG samples, resulting in multiple releases of data “freezes.” Details regarding sequencing methods and quality control are provided elsewhere[22] and in the TOPMed website. We used three TOPMed multi-ethnic datasets: a dataset of SVD stroke in the Women’s Health Initiative (WHI), a study of short sleep, and a study of VTE, with the latter two comprised of individuals from multiple TOPMed cohorts. We performed data analysis to demonstrate the BinomiRare test. The approaches for data analysis were similar. GRMs were constructed based on the analytic datasets of each of the analyses, using all genetic variants with minor allele frequency ≥ 0.001. Logistic mixed models under the null were fit and adjusted for age, sex, and self-reported race/ethnic group, and, for short sleep, also for parent study/cohort. SVD stroke and short sleep analyses used TOPMed freeze 5b release, while the VTE analysis used TOPMed freeze 8 genotype release. All participants provided written informed consent at their recruitment centers.

The TOPMed WHI stroke dataset

The WHI is a long-term health study following postmenopausal women aged 50–79 years who were recruited from 1993 through 1998 from 40 clinical centers throughout the United States.[23] In the present analysis, we focus on a subset of 5,358 WHI participants who were sequenced through TOPMed with data available via freeze 5b and had SVD stroke case-control classification, according to the following methodology: stroke diagnosis requiring and/or occurring during hospitalization was based on the rapid onset of a neurological deficit attributable to an obstruction or rupture of an arterial vessel system. Hospitalized incident stroke events were identified by semiannual questionnaires and adjudicated following medical record review, which occurred both locally (at individual study sites) and centrally. Ischemic strokes were further classified by the central neurologist adjudicators into cardio-embolic stroke, larger artery stroke, and SVD stroke according to the Trial of Org 10172 Acute Stroke Trial (TOAST) criteria.[24] The TOAST classification focuses on the presumed underlying stroke mechanism and requires detailed investigations (such as brain computed tomography, magnetic resonance imaging, angiography, carotid ultrasound, and echocardiography). Baseline stroke cases were excluded from the analysis, and VTE cases were excluded from the control samples. Further, participants who had non-SVD stroke were excluded.

The TOPMed short sleep dataset

We used sleep duration data from multiple TOPMed cohorts, as described in the Supplemental information detailing phenotype harmonization for short sleep analysis. Short sleep was defined as self-reported sleep duration during weekday, or usual sleep (if sleep duration during the weekdays was not available), being 5 h or less. Otherwise, if self-reported sleep duration was 6 h or longer and less than 9 h, sleep was “normal.” Individuals with self-reported sleep duration longer than 5 h and shorter than 6 h were excluded to minimize risk of misclassification. Because of a well-known U-shaped relationship between sleep duration and cardiovascular disease,[25] suggesting that potential non-linearity in genetic associations may exist as well, we also excluded “long sleepers” reporting usual sleep of 9 h or longer.

The TOPMed VTE dataset

The TOPMed VTE dataset includes TOPMed participants from six studies, combining prospective cohort and case-only studies. Individuals were matched across groups defined to be homogeneous with respect to race/ethnicity and sex, and strata defined by age at event (determined according to cases). The matching strategy resulted in a sample set mimicking a case-control study, with 11,627 individuals, of whom 3,793 are cases and 7,834 are controls.

Association testing of rare coding variants within known disease-causing genes

For each of the SVD stroke, short sleep, and VTE datasets, we considered a known gene associated with the disorder. For stroke, we focused on the NOTCH3 gene, in which mutations may cause cerebral autosomal dominant arteriopathy with subcortical infarcts and leukoencephalopathy (CADASIL), which causes ischemic stroke.[26] For short sleep, we focused on the gene DEC2 (also known as BHLHE41), a transcription inhibitor of orexin, a neuropeptide that regulates wakefulness.[27,28] For VTE, we focused on the coagulation factor V gene, F5,[29,30] which has a known common variant highly associated with VTE, factor V Leiden (rs6025). We performed single-variant analysis within the candidate genes, as follows. We selected a subset of rare variants within the genes based on functional annotations, with the goal of increasing power by focusing on variants that are more likely to be functional compared to others. In detail, the filter based on functional annotation included the selection of variants that were: (1) high-confidence loss-of-function variants according to the Ensembl Variant Effect Predictor;[31] (2) missense variants if they are predicted deleterious by either SIFT 4G,[32] PolyPhen2-HDIV,[33] PolyPhen2-HVAR,[33] or LRT-pred;[34] (3) inframe insertions or deletions (indels) with FATHMM-XF coding score > 0.5;[35] or (4) variants that are synonymous according to the Ensemble Variant Effect Predictor and have FATHMM-XF coding score > 0.5. The annotation-based variant filtering was performed using the annotation explorer application on the NHLBI’s BioData Catalyst.[36] We further filtered variants to those that passed the TOPMed quality control (QC) filter[22] and had at least 3 and no more than 300 individuals with the rare allele. This upper threshold was defined because we were interested specifically in rare variants and because it was previously shown that properties of statistical tests of rare variant associations depend on the number of individuals with the rare allele, rather than on allele frequency.[14] Finally, we further restricted the set of variants to those that had reasonable statistical power according to a power analysis performed as follows. We arbitrarily assumed an odds ratio (OR) of 2 for a causal variant, and for each variant we computed power based on a function developed for the BinomiRare test. The function uses the estimated outcome probabilities in the sample, an OR, the number of individuals harboring the rare variant, and p value threshold to compute power. To increase accuracy, for each variant we specifically used the estimated disease probabilities among the individuals with the rare allele. Although the proposed testing approach is developed primarily for studies of candidate gene or association regions, some investigators may be interested in applying it genome-wide. Therefore, in the Supplemental information we provide Manhattan and QQ-plots and report computation times from applying BinomiRare, CMP, and SPA tests genome-wide for the three outcomes of interest. When applied genome-wide, we used the Score test and recomputed p values (mid-p value for BinomiRare and p value for CMP) whenever the Score test p value was <0.05. Results are reported for all variants with at least 3 individuals with the rare allele and MAF ≤0.01.

Results

Simulation studies

We studied the performance of the tests in simulations of 1,000 trios. In the setting where , about half of the simulations estimated the variance component to be zero. When , this happened in about a third of the simulations. The number of individuals with the simulated rare variant allele was in the range [0, 27] when MAF = 0.001, [17, 119] when MAF = 0.01, [58, 203] when MAF = 0.02, and [195, 401] when MAF = 0.05. Table 1 provides estimated type 1 error rates in the simulations, restricted to those simulations in which the estimated variance component was . For BinomiRare, we only provide results for the mid-p value, because in our simulations it always controlled the type 1 error rate, while the usual p value controlled it as well while being more conservative. For CMP, we only provide results for the usual p value, because it sometimes did not control the type 1 error, and the lack of control was worse with the mid-p value. The CMP test usually did not control the type 1 error when the variant was very rare (MAF = 0.001) and when the case proportion was low (exp(β0) = 0.05). Its performance improved as the MAF increased. In Tables S3–S5, we provide complete simulation results, including both mid-p value and the usual p value for both the CMP and BinomiRare tests and results computed over all simulations and computed over the simulations in which .

Table 1.

Estimated type 1 error rates of BinomiRare and CMP tests in simulations with related individuals

		Estimated type 1 error by p value threshold
		σg2=0.06			σ_g²=0.6
MAF	exp(β₀)	10⁻²	10⁻³	10⁻⁴	10⁻²	10⁻³	10⁻⁴
BinomiRare (mid-p value)

0.001	0.01	3.38E–03	2.08E–04	1.25E–05	3.78E–03	2.55E–04	1.88E–05

0.001	0.05	5.61E–03	4.45E–04	2.76E–05	5.63E–03	4.4E–04	3.24E–05

0.001	0.5	5.78E–03	3.22E–04	1.57E–05	5.44E–03	2.94E–04	1.26E–05

0.01	0.01	6.22E–03	4.62E–04	3.25E–05	6.36E–03	4.64E–04	3.35E–05

0.01	0.05	7.70E–03	6.58E–04	4.42E–05	7.83E–03	6.58E–04	5.1E–05

0.01	0.5	8.68E–03	8.28E–04	6.67E–05	8.21E–03	7.3E–04	6.56E–05

0.02	0.01	6.17E–03	4.27E–04	2.73E–05	6.41E–03	4.71E–04	3.38E–05

0.02	0.05	7.53E–03	6.27E–04	5.34E–05	7.52E–03	6.15E–04	5.27E–05

0.02	0.5	7.90E–03	6.87E–04	6.51E–05	7.56E–03	6.34E–04	5.55E–05

0.05	0.01	4.99E–03	2.88E–04	1.84E–05	5.21E–03	2.97E–04	1.48E–05

0.05	0.05	5.94E–03	4.49E–04	3.10E–05	5.93E–03	4.26E–04	2.73E–05

0.05	0.5	6.02E–03	4.69E–04	3.91E–05	5.67E–03	4.16E–04	3.34E–05

CMP (usual p value)

0.001	0.01	5.75E–02	6.59E–03	4.18E–04	6.01E–02	6.54E–03	4.41E–04

0.001	0.05	4.50E–02*	4.44E–03*	2.83E–04*	4.22E–02*	3.89E–03*	2.26E–04*

0.001	0.5	3.44E–02*	9.29E–04	6.28E–06	3.34E–02*	7.78E–04	8.21E–06

0.01	0.01	2.34E–02	2.19E–03	1.70E–04	2.25E–02	2.09E–03	1.68E–04

0.01	0.05	1.62E–02*	1.55E–03*	1.37E–04*	1.53E–02*	1.44E–03*	1.20E–04*

0.01	0.5	9.30E–03	7.53E–04	4.41E–05	8.71E–03	6.74E–04	4.65E–05

0.02	0.01	1.76E–02	1.50E–03	1.11E–04	1.69E–02	1.44E–03	1.16E–04

0.02	0.05	1.11E–02*	1.10E–03*	1.02E–04	1.02E–02	9.77E–04	8.58E–05

0.02	0.5	7.86E–03	6.30E–04	5.49E–05	7.45E–03	5.76E–04	4.36E–05

0.05	0.01	1.06E–02	7.29E–04	4.14E–05	1.01E–02	7.19E–04	4.04E–05

0.05	0.05	6.53E–03	4.94E–04	3.44E–05	6.38E–03	4.52E–04	3.38E–05

0.05	0.5	5.80E–03	4.37E–04	3.30E–05	5.46E–03	3.83E–04	2.99E–05

Settings in which the type 1 error was not controlled, defined according to type 1 error rate being larger than the highest value in a 95% confidence interval around the expected type 1 error rate, based on binomial distribution with parameters being the p value threshold and number of simulations used.

In power analysis, after appropriately calibrating the p value threshold for the CMP test, CMP was either equally powerful as BinomiRare or more powerful (Figures S1–S3). The patterns were similar across p value thresholds used and across the two variance component parameters used in the simulations. Notably, when the disease was common (exp(β0) = 0.5), the power was lower when the variance component was high () compared to when it was low (). When the disease was rare (exp(β0) = 0.05), the power was essentially the same with both values of variance components.

Data analysis: TOPMed datasets

For each of the three TOPMed datasets that we considered, Table 2 provides the sample sizes, gene of interest, and number of variants according to sequential filtering: the number of available (non-monomorphic) variants in the sample that passed the functional filters described in the Material and methods section, number of variants after applying quality filters, number of variants after further restricting to variants with at least 3 and no more than 300 individuals with the rare allele, and the number of variants with at least 50% power to reject the null hypothesis at the 0.05 level under the assumption of OR = 2. There were 3 such variants in the NOTCH3-SVD stroke analysis, 1 variant in the DEC2-short sleep analysis, and 4 variants in the F5-VTE analysis.

Table 2.

Characteristics of the TOPMed datasets and variants considered for association testing

	SVD stroke	Short sleep	VTE
No. of individuals in the analysis	5,358	20,021	11,627
No. of cases	692 (12.9%)	2,408 (12%)	3,793 (32.6%)
No. of controls	4,666 (87.1%)	17,613 (88%)	7,834 (67.4%)
Gene of interest	NOTCH3	DEC2/BHLHE41	F5
No. of potentially functional non-monomorphic variants identified	122	58	142
No. of variants further passing TOPMed quality filters	117	49	132
No. of variants further having 2 < individuals with the rare allele < 300	20	9	25
No. of variants with estimated power > 0.5 at the 0.05 α level	3	1	4

Table 3 provides the results from testing each of the variants passing this estimated power filter. Of the three tested NOTCH3 variants, rs115582213 had p value = 0.03. For short sleep, only a single DEC2 variant was tested; it had BinomiRare mid-p value = 0.03, suggesting association with short sleep. For the F5 gene and VTE, none of the four tested variants showed evidence of association.

Table 3.

Results from association analysis of rare genetic variants within monogenic disease genes of interest

rsID	Variant	BinomiRare p value	BinomiRare mid-p value	n_c	n_d	Estimated power (OR = 2)	ClinVar interpretation	CADD PHRED	FATHMM-XF coding
SVD stroke: NOTCH 3 gene
rs115582213	chr-19-15162524-C-T	0.04	0.03	87	17	0.7	benign/likely benign	25.4	0.66
rs112197217	chr-19-15179425-G-T	0.53	0.49	166	23	0.91	benign/likely benign	21	0.42
rs11670799	chr-19-15188240-G-A	0.81	0.77	180	23	0.94	benign/likely benign	28.8	0.68
Short sleep: DEC2 gene
rs121912617	chr-12-26122364-G-T	0.04	0.03	127	38	0.98	not available	27.5	0.66
VTE: F5 gene
rs6026	chr-1-169528054-C-T	0.37	0.34	115	31	0.94	benign/likely benign	25.7	0.75
rs6034	chr-1-169529782-G-C	1.00	0.94	46	16	0.57	conflicting interpretations	21.3	0.56
rs78958618	chr-1-169542985-G-A	0.67	0.63	130	32	0.94	benign	15.18	0.11
rs9332485	chr-1-169586344-C-T	0.37	0.34	222	55	1	benign/likely benign	22.5	0.23

Genetic variants presented are those that passed functional annotation and statistical power filters. For each variantwe provide its BinomiRare p value and mid-p value, the number of individuals with the rare allele n, the number of individuals with both the rare allele and the outcome n, the estimated power computed while assuming effect size OR = 2 and p value threshold = 0.05, pathogenicity interpretation from ClinVar, CADD score, and FATHMM-XF coding score.

Figures S4–S6 provide Manhattan and QQ-plots from genome-wide analyses of the three phenotypes, and Table S6 provides computation times when testing associations in a single segment of 1 × 107 base pairs.

Discussion

We extended the BinomiRare test and studied the CMP test for testing the association of a rare genetic variant with a binary outcome in the mixed-model framework. These tests were specifically developed to handle variants with very low minor allele counts (tens of individuals harboring the rare allele), because it was previously shown that other tests that allow for covariate adjustment, such as the naive Score test and the SPA test, do not always control the type 1 error in the very low count settings.[14] Both BinomiRare and CMP tests first estimate the outcome probabilities for each person in a dataset, while accounting for covariates and for genetic relatedness (and possibly other covariance matrices) via a mixed model, and then use the estimated conditional disease probabilities. For a single variant, individuals with the rare alleles are identified, and based on their disease probabilities and the observed number of cases, a p value is computed, as the probability of observing the given number of cases or more extreme given the estimated outcome probabilities. The BinomiRare test assumes a Poisson binomial distribution on the number of cases, and the CMP test assumes a CMP distribution. The BinomiRare test using estimated conditional outcome probabilities assuming that the individuals with the rare allele are independent performed well, while, surprisingly, the CMP test, which was constructed specifically for correlated data, did not control the type 1 error rate for settings with a low number of individuals harboring the rare allele. This was likely because the approximations on which it relies are asymptotic in its non-centrality parameter λ, which is related to the number of individuals with the rare allele. We demonstrated the application of the BinomiRare test using three TOPMed studies: SVD stroke, short sleep, and VTE. Due to the low power for testing low-count variants, we filtered variants according to functional annotation and according to computed statistical power. The limitations of this approach are that (1) the deleteriousness predicting annotations used and the filters applied to them may not have captured the true functional variant set, and (2) the power analysis was based on an arbitrarily selected OR parameters. In this study, we chose OR = 2 and only considered the handful of variants that had estimated power > 0.5 for testing, while requiring p value (α level) < 0.05. We recognize that many rare variants have larger effect sizes. However, if we specified a larger OR parameter, and thus included more variants in our analysis, a more stringent a level would be needed. Thus, the resulting list of variants to test may have been similar. More work is needed developing strategies for identifying single rare variant associations. For each of the phenotypes, SVD stroke, VTE, and short sleep, we searched for rare variants within genes with known trait associations. For SVD stroke, we considered NOTCH3, because some NOTCH3 variants have been reported in individuals with CADASIL, which poses a risk for stroke. Most NOTCH3 mutations reported as associated with CADASIL are those involving loss or gain of a cysteine residue, leading to unpaired cysteine.[37] Single-nucleotide variants in NOTCH3 have not yet consistently been identified as associated with SVD stroke in population-based studies. Here, we identified the rare variant rs115582213 (BinomiRare mid-p value = 0.03). This variant was rare, with 87 out of 5,358 individuals in the dataset harboring the rare allele. Of these, 17 individuals had SVD stroke. More work is needed to study the association of rs115582213 with SVD stroke, as, after accounting for multiple testing, its association is not statistically significant. For VTE, we considered the F5 gene. The F5 gene harbors the strongest known, relatively common, genetic risk factor for VTE, the rs6025 variant.[38,39] This motivated the search for rare variants in this gene. We did not identify any variant associated with VTE at the p value < 0.05 level. We did not consider rs6025 as part of our testing strategy because it was common with MAF = 0.04, and 839 individuals had the rare allele, a setting in which other tests such as the SPA should be able to control the type 1 error well and also be more powerful. Still, as a positive control we tested its association with VTE using BinomiRare, and the p value was 1.5 × 10−14. Short sleep has been consistently associated with cardiovascular and cardiometabolic disease.[40,41] Genetic determinant of short sleep may help elucidate this connection.[42] We considered the DEC2/BHLHE4 gene, which has a mutation with a known familial aggregation associated with short sleep. Our filtering strategy resulted in a single variant considered for testing: rs121912617, the known short sleep mutation.[27] In our data, it was associated with short sleep with BinomiRare mid-p value = 0.03. rs121912617 is substantially more common (yet is still rare) in African Americans compared to European Americans (0.01 MAF in African Americans from the TOPMed short sleep datasets, compared to MAF < 0.001 in European Americans from the same dataset), allowing for observing this association in a population-based, rather than a family-based, study. Here, we demonstrated the BinomiRare test for testing single-variant associations in data with known or cryptic relatedness. It can also be used to test sets of rare variants, by focusing on individuals with at least one rare allele in the variant set. It is a topic of future research to extend this framework to use the counts of the rare variant allele and increase power.

37 in total

1. SIFT missense predictions for genomes.

Authors: Robert Vaser; Swarnaseetha Adusumalli; Sim Ngak Leng; Mile Sikic; Pauline C Ng
Journal: Nat Protoc Date: 2015-12-03 Impact factor: 13.491

2. Identification of deleterious mutations within three human genomes.

Authors: Sung Chun; Justin C Fay
Journal: Genome Res Date: 2009-07-14 Impact factor: 9.043

3. Control for Population Structure and Relatedness for Binary Traits in Genetic Association Studies via Logistic Mixed Models.

Authors: Han Chen; Chaolong Wang; Matthew P Conomos; Adrienne M Stilp; Zilin Li; Tamar Sofer; Adam A Szpiro; Wei Chen; John M Brehm; Juan C Celedón; Susan Redline; George J Papanicolaou; Timothy A Thornton; Cathy C Laurie; Kenneth Rice; Xihong Lin
Journal: Am J Hum Genet Date: 2016-03-24 Impact factor: 11.025

4. Fast permutation tests and related methods, for association between rare variants and binary outcomes.

Authors: Arjun Sondhi; Kenneth Martin Rice
Journal: Ann Hum Genet Date: 2017-12-18 Impact factor: 1.670

5. A genome-wide association study of venous thromboembolism identifies risk variants in chromosomes 1q24.2 and 9q.

Authors: J A Heit; S M Armasu; Y W Asmann; J M Cunningham; M E Matsumoto; T M Petterson; M De Andrade
Journal: J Thromb Haemost Date: 2012-08 Impact factor: 5.824

6. The transcriptional repressor DEC2 regulates sleep length in mammals.

Authors: Ying He; Christopher R Jones; Nobuhiro Fujiki; Ying Xu; Bin Guo; Jimmy L Holder; Moritz J Rossner; Seiji Nishino; Ying-Hui Fu
Journal: Science Date: 2009-08-14 Impact factor: 47.728

7. Genome-wide association study identifies genetic loci for self-reported habitual sleep duration supported by accelerometer-derived estimates.

Authors: Hassan S Dashti; Samuel E Jones; Andrew R Wood; Jacqueline M Lane; Vincent T van Hees; Heming Wang; Jessica A Rhodes; Yanwei Song; Krunal Patel; Simon G Anderson; Robin N Beaumont; David A Bechtold; Jack Bowden; Brian E Cade; Marta Garaulet; Simon D Kyle; Max A Little; Andrew S Loudon; Annemarie I Luik; Frank A J L Scheer; Kai Spiegelhalder; Jessica Tyrrell; Daniel J Gottlieb; Henning Tiemeier; David W Ray; Shaun M Purcell; Timothy M Frayling; Susan Redline; Deborah A Lawlor; Martin K Rutter; Michael N Weedon; Richa Saxena
Journal: Nat Commun Date: 2019-03-07 Impact factor: 14.919

8. The Ensembl Variant Effect Predictor.

Authors: William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal: Genome Biol Date: 2016-06-06 Impact factor: 13.583

9. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies.

Authors: Wei Zhou; Jonas B Nielsen; Lars G Fritsche; Rounak Dey; Maiken E Gabrielsen; Brooke N Wolford; Jonathon LeFaive; Peter VandeHaar; Sarah A Gagliano; Aliya Gifford; Lisa A Bastarache; Wei-Qi Wei; Joshua C Denny; Maoxuan Lin; Kristian Hveem; Hyun Min Kang; Goncalo R Abecasis; Cristen J Willer; Seunggeun Lee
Journal: Nat Genet Date: 2018-08-13 Impact factor: 38.330

10. Assessing the Pathogenicity, Penetrance, and Expressivity of Putative Disease-Causing Variants in a Population Setting.

Authors: Caroline F Wright; Ben West; Marcus Tuke; Samuel E Jones; Kashyap Patel; Thomas W Laver; Robin N Beaumont; Jessica Tyrrell; Andrew R Wood; Timothy M Frayling; Andrew T Hattersley; Michael N Weedon
Journal: Am J Hum Genet Date: 2019-01-18 Impact factor: 11.025

1 in total

1. Rare Missense Functional Variants at COL4A1 and COL4A2 in Sporadic Intracerebral Hemorrhage.

Authors: Jaeyoon Chung; Graham Hamilton; Minsup Kim; Sandro Marini; Bailey Montgomery; Jonathan Henry; Art E Cho; Devin L Brown; Bradford B Worrall; James F Meschia; Scott L Silliman; Magdy Selim; David L Tirschwell; Chelsea S Kidwell; Brett Kissela; Steven M Greenberg; Anand Viswanathan; Joshua N Goldstein; Carl D Langefeld; Kristiina Rannikmae; Catherine Lm Sudlow; Neshika Samarasekera; Mark Rodrigues; Rustam Al-Shahi Salman; James G D Prendergast; Sarah E Harris; Ian Deary; Daniel Woo; Jonathan Rosand; Tom Van Agtmael; Christopher D Anderson
Journal: Neurology Date: 2021-05-24 Impact factor: 9.910

1 in total