Literature DB >> 32839606

Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale.

Xihao Li¹, Zilin Li¹, Hufeng Zhou¹, Sheila M Gaynor¹, Yaowu Liu², Han Chen^3,4, Ryan Sun⁵, Rounak Dey¹, Donna K Arnett⁶, Stella Aslibekyan⁷, Christie M Ballantyne⁸, Lawrence F Bielak⁹, John Blangero¹⁰, Eric Boerwinkle^3,11, Donald W Bowden¹², Jai G Broome¹³, Matthew P Conomos¹⁴, Adolfo Correa¹⁵, L Adrienne Cupples^16,17, Joanne E Curran¹⁰, Barry I Freedman¹⁸, Xiuqing Guo¹⁹, George Hindy²⁰, Marguerite R Irvin⁷, Sharon L R Kardia⁹, Sekar Kathiresan^21,22,23, Alyna T Khan¹⁴, Charles L Kooperberg²⁴, Cathy C Laurie¹⁴, X Shirley Liu^25,26, Michael C Mahaney¹⁰, Ani W Manichaikul²⁷, Lisa W Martin²⁸, Rasika A Mathias²⁹, Stephen T McGarvey³⁰, Braxton D Mitchell^31,32, May E Montasser³³, Jill E Moore³⁴, Alanna C Morrison³, Jeffrey R O'Connell³¹, Nicholette D Palmer¹², Akhil Pampana^35,36, Juan M Peralta¹⁰, Patricia A Peyser⁹, Bruce M Psaty^37,38, Susan Redline^39,40,41, Kenneth M Rice¹⁴, Stephen S Rich²⁷, Jennifer A Smith^9,42, Hemant K Tiwari⁴³, Michael Y Tsai⁴⁴, Ramachandran S Vasan^17,45, Fei Fei Wang¹⁴, Daniel E Weeks⁴⁶, Zhiping Weng³⁴, James G Wilson^47,48, Lisa R Yanek²⁹, Benjamin M Neale^35,49,50, Shamil R Sunyaev^35,51,52, Gonçalo R Abecasis^53,54, Jerome I Rotter¹⁹, Cristen J Willer^55,56,57, Gina M Peloso¹⁶, Pradeep Natarajan^23,35,36, Xihong Lin^58,59,60.

Abstract

Large-scale whole-genome sequencing studies have enabled the analysis of rare variants (RVs) associated with complex phenotypes. Commonly used RV association tests have limited scope to leverage variant functions. We propose STAAR (variant-set test for association using annotation information), a scalable and powerful RV association test method that effectively incorporates both variant categories and multiple complementary annotations using a dynamic weighting scheme. For the latter, we introduce 'annotation principal components', multidimensional summaries of in silico variant annotations. STAAR accounts for population structure and relatedness and is scalable for analyzing very large cohort and biobank whole-genome sequencing studies of continuous and dichotomous traits. We applied STAAR to identify RVs associated with four lipid traits in 12,316 discovery and 17,822 replication samples from the Trans-Omics for Precision Medicine Program. We discovered and replicated new RV associations, including disruptive missense RVs of NPC1L1 and an intergenic region near APOC1P1 associated with low-density lipoprotein cholesterol.

Entities: Chemical

Mesh：

Substances：
Cholesterol, LDL

Year: 2020 PMID： 32839606 PMCID： PMC7483769 DOI： 10.1038/s41588-020-0676-4

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

An increasing number of whole genome/exome sequencing (WGS/WES) studies are being conducted to investigate the genetic bases of human diseases and traits, including the Trans-Omics for Precision Medicine Program (TOPMed) of the National Heart, Lung and Blood Institute (NHLBI) and the Genome Sequencing Program (GSP) of the National Human Genome Research Institute (NHGRI). Such studies enable assessment of associations between complex traits and both coding and non-coding rare variants (RVs; minor allele frequency (MAF) < 1%) across the genome. However, single-variant analyses typically have low power to identify associations with rare variants[1-3]. To improve power, variant-set tests have been proposed to jointly test the effects of given sets of multiple rare variants. These methods include the burden test[4-7], Sequence Kernel Association Test (SKAT)[8], and their various combinations[9-12]. In parallel, external biological information provided by functional annotations, such as conservation scores and predicted enhancer status, has been successfully used for prioritizing plausibly causal common variants in fine-mapping studies, partitioning heritability in GWAS, and predicting genetic risk[13-17]. It is of substantial interest to incorporate variant functional annotations effectively, to boost the power of RV analysis of WGS association studies[18,19]. Variant functional annotations take two forms: (i) qualitative functional groupings into genomic elements, such as Variant Effect Predictor (VEP) categories[20,21], and (ii) quantitative functional scores available for variants across the genome, including protein functional scores[22,23], evolutionary conservation scores[24,25], epigenetic measures[26], and integrative functional scores[27]. Different annotation scores capture diverse aspects of variant function[28,29]. Given the diversity of available annotations, efforts have been made to aggregate the evidence they provide on genomic function[30]. Simultaneous use of multiple, varied functional annotation scores in variant-set tests could improve rare variant association study (RVAS) power, for example, by optimally selecting and weighting plausibly-causal rare variants[31]. To boost power for variant-set tests in WGS RVAS, we propose the variant-Set Test for Association using Annotation infoRmation (STAAR), a general framework that dynamically incorporates both qualitative functional categories and quantitative complementary annotation scores using a unified omnibus multi-dimensional weighting scheme. For the latter, to effectively capture the multi-faceted biological impact of a variant, we introduce annotation Principal Components (aPCs), multi-dimensional summaries of annotation scores that can be leveraged in the STAAR framework. Recent methods[32-34] have incorporated functional annotations in genetic association studies. However, these methods are not scalable to analyze large-scale WGS studies while accounting for relatedness and population structure. Large scale WGS and WES studies, such as TOPMed and GSP, include a considerable fraction of related and ancestrally diverse samples. STAAR accounts for both relatedness and population structure, as well as longitudinal follow-up designs, for both quantitative and dichotomous traits, using a Generalized Linear Mixed Models (GLMM) framework[35] that includes linear and logistic mixed models[36,37]. Using sparse Genetic Relatedness Matrices (GRMs)[38], STAAR is computationally scalable for very large WGS studies and biobanks of hundreds of thousands of samples. We perform herein extensive simulation studies to demonstrate that STAAR can achieve substantially greater power compared to conventional variant-set tests, while maintaining accurate type I error rates for both quantitative and dichotomous phenotypes. We then apply STAAR to perform WGS gene-centric and sliding window-based genetic region analysis of 12,316 discovery samples and 17,822 replication samples with four quantitative lipid traits: low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), triglycerides (TG), and total cholesterol (TC) from the NHLBI TOPMed program. We show that STAAR outperforms existing methods and identifies novel and replicated associations, including with LDL-C in disruptive missense RVs of NPC1L1, and in an intergenic region near APOC1P1.

Results

Overview of methods.

STAAR is a general framework for analyzing WGS RVAS at scale by using both qualitative functional categories as well as multiple in silico variant annotation scores within a variant-set, while accounting for population structure and relatedness by fitting linear and logistic mixed models for quantitative and dichotomous traits using fast and scalable algorithms. For each variant-set, there are two main components of the STAAR framework: (i) using annotation PCs to capture and prioritize multi-dimensional variant biological functions, and (ii) testing the association between each variant-set and phenotypes by incorporating these annotation PCs as well as other integrative functional scores and MAFs in the STAAR test statistics using an omnibus weighting scheme (Fig. 1).

Figure 1 |

STAAR workflow.

a, Prepare the input data of STAAR, including genotypes, phenotypes, covariates, and (sparse) genetic relatedness matrix. b, Annotate all variants in the genome and calculate the annotation principal components for different classes of variant function. c, Define two types of variant-sets: gene-centric analysis by grouping variants into functional genomic elements for each protein-coding gene; genetic region analysis using agnostic sliding windows. d, Estimate STAAR statistics for each variant-set. e, Obtain STAAR-O P-values for all variants sets that are defined in c and report significant findings.

Variants often influence genes and gene products through multiple mechanisms. We extract a broad set of variant functional annotations (Supplementary Table 1), including both individual and ensemble functional scores, from various databases, such as ENCODE[26], Roadmap Epigenomics[39], and other evolutionary and protein annotation databases[27,40,41]. A correlation heatmap across variants in the genome (Fig. 2) shows that the correlation structure among all individual annotations is approximately block-diagonal, with highly correlated blocks representing different classes of variant function, e.g., epigenetic function, evolutionary conservation, protein function, local nucleotide diversity. We introduce annotation Principal Components defined as the first PCs calculated from the set of individual functional annotation scores in each functional block (Supplementary Table 1 and Online Methods). Annotation PCs effectively reduce the dimensionality of the large number of individual annotations and summarize multiple aspects of variant function.

Figure 2 |

Correlation heatmap of functional annotation scores.

The figure shows pairwise correlations between 76 individual and integrative functional annotations using variants from the pooled samples of lipid traits in the TOPMed data. The cells in the visualization are colored by Pearson’s correlation coefficient values with deeper colors indicating higher positive (red) or negative (blue) correlations. Each annotation principal component (aPC) is the first PC calculated from the set of individual functional annotations that measure similar biological function. These aPCs are then transformed into the PHRED-scaled scores for each variant across the genome (Online Methods).

The STAAR framework first calculates a set of multiple candidate test statistics using different annotation weights under a particular testing approach (Fig. 1d). For each type of RV test, STAAR then uses ACAT (aggregated Cauchy association test) method to combine the resulting P-values calculated using different weights in order to effectively and powerfully aggregate the association strength from all annotations in a data-adaptive manner (Fig. 1d and Online Methods). The ACAT method for combining P-values is accurate and computationally efficient, while accounting for arbitrary correlation structure between tests[9,42]. To leverage the advantages of different types of tests, we propose an omnibus test in the STAAR framework (STAAR-O) by combining P-values across different types of multiple-annotation-weighted variant-set tests using the ACAT method (Fig. 1d and Online Methods).

Simulation studies.

To evaluate the type I error and power of STAAR compared to conventional variant-set tests, we performed simulation studies under a variety of configurations. We followed the steps described in the Data simulation section of the Online Methods to generate both continuous and dichotomous phenotypes. We generated genotypes by simulating 20,000 sequences for 100 different regions with each spanning 1 Mb. The data were generated to mimic the linkage disequilibrium (LD) structure of an African American population by using the calibration coalescent model (COSI)[43]. We randomly selected 5-kb regions from these 1-Mb regions and considered sample sizes of 2,500, 5,000, and 10,000 for each replicate. The simulation studies focused on aggregating uncommon variants with MAF < 5%.

Type I error simulations.

The empirical type I error rates for STAAR-O were evaluated based on simulations at for continuous and dichotomous traits (Supplementary Table 2). The results show that the type I error rate for STAAR-O appeared to be well controlled for both continuous and dichotomous traits at all levels. For continuous traits, STAAR-O delivered accurate empirical type I error rates. For dichotomous traits and the smallest level considered of STAAR-O was slightly conservative for moderate sample sizes (2,500 individuals); however, its type I error rate came close to the nominal level with larger sample sizes.

Empirical power simulations.

Next, we evaluated the power of STAAR empirically by incorporating MAF and 10 annotations into its analysis and comparing results with conventional variant-set tests in a variety of configurations. Power was estimated as the proportion of P-values less than based on replicates. Causality of variants was allowed to be dependent on different sets of annotations through a logistic model (Online Methods). We considered different proportions of causal variants (5%, 15%, 35% on average) in the signal region. For both continuous and dichotomous traits, STAAR-O incorporating all 10 annotations had higher power than the conventional variant-set tests in terms of signal region detection (Supplementary Figs. 1–4). Power simulation results of STAAR-O for different magnitudes of effect sizes and different proportions of effect size directions yielded the same conclusion (Supplementary Figs. 1, 5, and 6). Overall, our simulation studies showed that STAAR-O could provide considerably higher power than conventional variant-set tests.

Association analysis of lipid traits in the TOPMed WGS data.

We applied STAAR to identify RV-sets associated with four quantitative lipid traits (LDL-C, HDL-C, TG and TC) using TOPMed WGS data[44,45]. LDL-C and TC were adjusted for the presence of medications as before[44]. DNA samples were sequenced at >30X target coverage. The discovery phase consists of four study cohorts of TOPMed Freeze 3. The replication phase consists of ten different study cohorts in TOPMed Freeze 5 that were not in Freeze 3 (Supplementary Note and Supplementary Table 3). Sample-level and variant-level quality control (QC) were performed[44,45]. There were 12,316 discovery samples, which had 155 million single nucleotide variants (SNVs), and 17,822 replication samples, which had 188 million SNVs. The TOPMed data consist of ancestrally diverse and multi-ethnic related samples. Race/ethnicity was defined using a combination of self-reported race/ethnicity and study recruitment information. The discovery cohorts consist of 4,580 (37.2%) Black or African American, 6,266 (50.9%) White, 543 (4.4%) Asian American, and 927 (7.5%) Hispanic/Latino American. Among all samples in discovery phase, 3,577 (29.0%) had first-degree relatedness, 491 (4.0%) had second-degree relatedness, and 273 (2.2%) had third-degree relatedness (Supplementary Fig. 7). Among all SNVs observed in the discovery samples, there were 6.5 million (4.2%) common variants (MAF > 5%), 5.3 million (3.4%) low frequency variants (1% MAF 5%), and 143.2 million (92.4%) rare variants (MAF < 1%). The race/ethnicity distribution, related sample distribution, and variant number distribution for replication phase and pooled samples (samples from both discovery phase and replication phase) are given in Supplementary Table 4. Our study used the proposed STAAR-O method to perform (i) gene-centric analysis using RV-sets based on functional categories, and (ii) genetic region analysis using variant-sets defined by 2-kb sliding windows with 1-kb skip length across the genome. We adjusted for age, age[2], sex, race/ethnicity, study, and the first 10 ancestral PCs, while controlling for relatedness using linear mixed models, with inverse-rank normal transformation applied to phenotypes (Online Methods). Race/ethnicity was included as a covariate to adjust for sociocultural and environmental factors, while genetic ancestry differences were captured by the inclusion of the ancestral PCs. In addition to the two MAF weights[3], we incorporated 13 aggregated functional annotation scores in STAAR-O: 3 integrative scores (CADD[27], LINSIGHT[46], and FATHMM-XF[47]) and 10 aPCs. Figure 2 summarizes the correlation among all functional annotations, including 60 individual scores, 3 integrative scores, and 10 aPCs.

Gene-centric association analysis of coding and non-coding rare variants.

We performed gene-centric analysis to identify whether rare variants in coding, promoter, and enhancer regions of genes are associated with lipid traits using STAAR-O. For each of the four lipid traits, we analyzed five functional categories (masks) of coding and non-coding variants: (i) pLoF (stop gain, stop loss and splice) RVs, (ii) missense RVs, (iii) synonymous RVs, (iv) promoter RVs, and (v) enhancer RVs. The pLoF, missense, and synonymous RVs were defined by GENCODE VEP categories[20,21]. The promoter RVs were defined as RVs in the +/− 3-kb window of transcription starting site (TSS) with overlap of Cap Analysis of Gene Expression (CAGE) sites. The enhancer RVs were defined as RVs in GeneHancer predicted regions with overlap of CAGE sites[48-50]. Within each gene functional category, we tested for an association between rare variants (MAF < 1%) in the functional category and lipid traits using STAAR-O with the 13 aggregated functional annotations described above. For missense RVs, we incorporated an additional annotation functional category predicting functionally “disruptive” variants determined by MetaSVM[51], which measures the deleteriousness of missense mutations. The overall distributions of STAAR-O P-values were well calibrated for all four lipid phenotypes (Supplementary Fig. 8). We considered in unconditional analysis a Bonferroni-corrected genome-wide significance threshold of accounting for five different masks across protein-coding genes. STAAR-O identified 21 genome-wide significant associations with four lipid phenotypes using unconditional analysis of the discovery samples (Supplementary Table 5 and Supplementary Fig. 9). After conditioning on known lipids-associated variants[44,52-67], 11 out of the 21 associations remained significant at the Bonferroni correction level using the discovery samples. These included associations with LDL-C (pLoF RVs in PCSK9 and APOB, missense RVs in PCSK9, NPC1L1, and APOE), association with HDL-C (pLoF RVs in APOC3), association with TG (pLoF RVs in APOC3), and associations with TC (pLoF RVs in PCSK9 and APOB, missense RVs in PCSK9 and LIPG) (Table 1). Of these 11 associations, 10 were replicated at the Bonferroni-corrected level after adjusting for known lipid-associated variants. The association between APOC3 pLoF RVs and HDL-C was unreported in a previous study using the same TOPMed Freeze 3 data[44].

Table 1 |

Gene-centric analysis results of both unconditional analysis and analysis conditional on known common and low-frequency variants.

12,316 discovery samples, 17,822 replication samples and 30,138 pooled samples from TOPMed program were considered in the analysis. Results for the conditionally significant genes (unconditional STAAR-O ; conditional STAAR-O ) using discovery samples are presented in the table. Chr (chromosome); Category (functional category); #SNV (number of rare variants (MAF < 1%) of the particular functional category in the gene); STAAR-O (STAAR-O P-value); LDL-C (low-density lipoprotein cholesterol); HDL-C (high-density lipoprotein cholesterol); TG (triglycerides); TC (total cholesterol); Variants Adjusted (adjusted variants in conditional analysis).

Trait	Gene	Chr	Category	Discovery			Replication			Pooled			Variants Adjusted

				#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)	#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)	#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)
LDL-C	PCSK9	1	pLoF	5	3.09E-38	1.94E-07	8	6.97E-27	5.29E-10	9	4.59E-65	7.52E-17	rs28362286, rs28362263, rs11591147, rs12117661
	APOB	2	pLoF	11	1.91E-14	2.38E-14	5	1.97E-09	1.76E-09	16	3.91E-21	4.08E-21	rs934197
	PCSK9	1	missense	92	1.09E-16	2.65E-08	129	1.90E-06	1.15E-06	167	2.11E-15	1.14E-14	rs28362286, rs28362263, rs11591147, rs12117661
	NPC1L1	7	missense	174	1.29E-07	3.83E-07	219	2.19E-03	3.28E-03	293	3.25E-10	1.58E-09	rs10234070, rs73107473, rs2072183, rs41279633, rs17725246, rs2073547, rs10260606, rs217386, rs7791240, rs2300414
	NPC1L1	7	disruptive missense	94	3.15E-09*	9.27E-09*	129	1.46E-04*	2.59E-04*	173	8.05E-12*	4.02E-11*	rs10234070, rs73107473, rs2072183, rs41279633, rs17725246, rs2073547, rs10260606, rs217386, rs7791240, rs2300414
	APOE	19	missense	54	3.11E-10	9.88E-11	58	6.61E-05	3.47E-04	88	1.07E-13	2.02E-12	rs7412, rs429358

HDL-C	APOC3	11	pLoF	5	2.20E-07	6.82E-07	6	5.73E-18	2.89E-17	7	3.18E-23	4.51E-22	rs66505542

TG	APOC3	11	pLoF	5	1.10E-14	5.53E-14	6	2.67E-49	2.73E-46	7	3.98E-56	1.04E-52	rs66505542, rs964184, rs7350481

TC	PCSK9	1	pLoF	5	4.60E-33	2.04E-10	8	1.83E-25	9.74E-11	9	9.83E-58	4.23E-20	rs28362286, rs11591147, rs191448952
	APOB	2	pLoF	11	7.29E-13	8.78E-13	5	2.62E-09	2.30E-09	16	9.76E-20	1.01E-19	rs934197
	PCSK9	1	missense	92	6.00E-15	1.11E-06	131	2.14E-05	1.13E-05	169	5.18E-12	3.16E-12	rs28362286, rs11591147, rs191448952
	LIPG	18	missense	62	9.61E-08	4.34E-06	68	3.45E-04	1.47E-01	101	2.04E-09	5.62E-04	rs4939883, rs7241918, rs149615216

Burden test P-value.

The association between missense RVs in NPC1L1 and LDL-C was not detected by the conventional variant-set tests and has not been observed in previous studies[44,55,68,69]. In the discovery phase, its unconditional STAAR-O P-value was while the most significant conventional variant-set test was the burden test with This association was not driven by any single RV (minimum single RV P-value ) but was due to the aggregated effects of multiple missense RVs. The P-value of the burden test additionally weighted by MetaSVM was the smallest of all annotations highlighting the significant association between disruptive missense RVs in NPC1L1 and LDL-C (Supplementary Fig. 10). Among all 174 missense RVs in NPC1L1 from the discovery samples, the disruptive missense RVs as predicted by MetaSVM were enriched among variants with higher aPC-Conservation scores (Supplementary Table 6). This contributed to the test weighted by aPC-Conservation being the most significant across all quantitative annotation-weighted tests included in STAAR-O (burden ). As aPC-Conservation summarizes variants’ evolutionary conservation scores, it is informative in predicting whether or not variants are deleterious and thus functional[70,71]. Conditioning on the ten known common variants in NPC1L1 associated with LDL-C (Supplementary Table 7)[57-61,65-67], the association between disruptive missense RVs in NPC1L1 and LDL-C remained significant after Bonferroni correction with the conditional analysis in discovery phase. This association was validated in replication phase with and with in pooled samples in conditional analysis. This significant association was also validated using whole exome sequencing data from the UK Biobank[72] with in conditional analysis.

Genetic region analysis of rare variants.

We performed genetic region analysis to determine whether RVs within sliding windows are associated with lipid traits. The sliding windows were defined to be 2 kb in length, start at position 0 bp for each chromosome, and have a skip length of 1 kb. Windows with a total minor allele count less than 10 were excluded from the analysis, resulting in a total of 2.66 million 2-kb overlapping windows, with a median of 104 RVs in each sliding window among discovery samples. For each 2-kb window, we tested for an association between the RVs in the window and each lipid trait using STAAR-O by incorporating 13 aggregated quantitative annotations. The overall distributions of STAAR-O P-values were well calibrated for all four lipid phenotypes (Fig. 3b and Supplementary Figs. 11b, 12b, and 13b). Using the Bonferroni correction, we set the genome-wide significance threshold at across sliding windows (Fig. 3a and Supplementary Figs. 11a, 12a, and 13a). Supplementary Table 8 summarizes the significant 2-kb sliding windows identified using STAAR-O. Overall, by dynamically incorporating multiple functional annotations capturing different aspects of variant function, STAAR-O was able to detect more significant sliding windows, and showed consistently smaller P-values for top sliding windows compared with conventional variant-set tests weighted using MAFs (Fig. 3c,d and Supplementary Figs. 11c–f, 12c, and 14). Burden tests were not able to detect any window that reached significance.

Figure 3 |

Genetic region (2-kb sliding window) unconditional analysis results of LDL-C in discovery phase using the TOPMed cohort.

a, Manhattan plot showing the associations of 2.66 million 2-kb sliding windows for LDL-C versus of STAAR-O. The horizontal line indicates a genome-wide P-value threshold of (n = 12,316). b, Quantile-quantile plot of 2-kb sliding window STAAR-O P-values for LDL-C (n = 12,316). c, Genetic landscape of the windows significantly associated with LDL-C that are located in the 150-kb region on chromosome 19. Four statistical tests were compared: Burden, SKAT, ACAT-V and STAAR-O. A dot indicates that the sliding window at this location is significant using the statistical test that the color of the dot represents (n = 12,316). d, Scatterplot of P-values for the 2-kb sliding windows comparing STAAR-O with Burden, SKAT and ACAT-V tests. Each dot represents a sliding window with x-axis label being the of the conventional test and y-axis label being the of STAAR-O (n = 12,316).

Among the 59 genome-wide significant sliding windows detected by STAAR-O in unconditional analysis, 17 remained significant at the Bonferroni correction level after conditioning on known lipids-associated variants using the discovery samples (Table 2). For LDL-C, the significant sliding windows were located in gene PCSK9 or in a 50-kb region on chromosome 19 including the APOE cluster. For TC, all of the significant sliding windows were located in the same areas as for LDL-C. For TG, STAAR-O detected two consecutive significant sliding windows within APOC3, whereas no significant sliding windows were detected for HDL-C. Of these 17 associations, six were replicated at level after Bonferroni correction and another four were replicated at level after Bonferroni correction for nine non-overlapping sliding windows in conditional analysis of replication samples[17], including a sliding window located downstream of APOC1P1 (Chr 19: 44,931,528 bp - 44,933,527 bp), which was significantly associated with LDL-C but undetected by the burden test, SKAT, and ACAT-V (Table 2 and Fig. 3c).

Table 2 |

Genetic region (2-kb sliding window) analysis results of both unconditional analysis and analysis conditional on known common and low-frequency variants.

12,316 discovery samples, 17,822 replication samples and 30,138 pooled samples from the TOPMed program were considered in the analysis. Results for the conditionally significant sliding windows (unconditional STAAR-O ; conditional STAAR-O ) using discovery samples are presented in the table. Chr (chromosome); Start Location (start location of the 2-kb sliding window); End Location (end location of the 2-kb sliding window); #SNV (number of rare variants (MAF < 1%) in the 2-kb sliding window); STAAR-O (STAAR-O P-value); LDL-C (low-density lipoprotein cholesterol); TG (triglycerides); TC (total cholesterol); Variants Adjusted (adjusted variants in conditional analysis). Physical positions of each window are on build hg38.

Trait	Chr	Start Location	End Location	Gene	Discovery			Replication			Pooled			Variants Adjusted

					#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)	#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)	#SNV	STAAR-O (Unconditional)	STAAR-O (Conditional)
LDL-C	1	55045498	55047497	PCSK9	114	7.83E-09	1.06E-04	124	3.33E-06	4.10E-04	186	1.89E-15	2.90E-09	rs28362286, rs28362263, rs11591147, rs12117661
	1	55046498	55048497	PCSK9	124	5.32E-09	2.13E-05	130	1.79E-06	8.79E-05	191	1.33E-15	1.15E-09	rs28362286, rs28362263, rs11591147, rs12117661
	19	44881528	44883527	NECTIN2	118	7.31E-10	1.81E-08	155	5.16E-04	2.42E-01	202	8.15E-08	5.26E-06	rs7412, rs429358
	19	44882528	44884527	NECTIN2	104	2.08E-10	3.90E-09	133	1.23E-01	3.59E-01	176	1.38E-08	7.47E-07	rs7412, rs429358
	19	44893528	44895527	TOMM40	110	2.64E-19	2.33E-11	136	4.54E-09	2.60E-02	187	7.29E-29	7.62E-13	rs7412, rs429358
	19	44894528	44896527	TOMM40	120	2.44E-15	4.31E-11	153	7.62E-05	1.74E-02	205	6.73E-20	5.28E-13	rs7412, rs429358
	19	44905528	44907527	APOE	91	1.73E-10	1.64E-10	115	1.22E-02	4.91E-03	169	7.68E-12	9.00E-12	rs7412, rs429358
	19	44906528	44908527	APOE	84	1.67E-09	1.90E-10	115	8.65E-03	3.24E-03	165	8.34E-11	6.25E-12	rs7412, rs429358
	19	44907528	44909527	APOE	113	1.01E-09	1.97E-10	143	5.92E-03	3.58E-03	205	4.88E-11	8.71E-12	rs7412, rs429358
	19	44908528	44910527	APOE	140	6.30E-10	1.32E-10	152	4.14E-03	6.10E-03	228	2.40E-11	5.21E-12	rs7412, rs429358
	19	44931528	44933527	APOC1P1	114	6.63E-09	7.60E-04	123	5.78E-11	5.40E-03	181	1.34E-19	4.15E-06	rs7412, rs429358

TG	11	116828930	116830929	APOC3	125	4.63E-10	2.80E-09	155	1.35E-36	3.94E-34	207	7.32E-45	2.73E-41	rs66505542, rs964184, rs7350481
TG	11	116829930	116831929	APOC3	109	3.61E-10	5.99E-10	140	2.85E-36	4.25E-34	187	5.75E-45	2.17E-41	rs66505542, rs964184, rs7350481

TC	1	55045498	55047497	PCSK9	114	3.05E-09	2.86E-07	130	3.12E-06	1.92E-06	189	2.22E-15	9.21E-14	rs28362286, rs11591147, rs191448952
	1	55046498	55048497	PCSK9	124	2.24E-09	2.06E-07	138	2.19E-06	1.34E-06	195	1.78E-15	7.04E-14	rs28362286, rs11591147, rs191448952
	19	44893528	44895527	TOMM40	111	9.35E-13	4.37E-07	146	1.12E-07	4.02E-01	196	7.57E-21	7.91E-08	rs7412, rs429358
	19	44894528	44896527	TOMM40	120	1.80E-09	1.99E-06	164	1.08E-04	8.31E-01	213	8.40E-14	2.19E-07	rs7412, rs429358

The top variant of the significant sliding window located downstream of APOC1P1 was rs370625306 which was not significant at a Bonferroni-corrected threshold in individual variant analysis. This rare variant and the second top variant in these windows were upweighted by aPC-Epigenetic in STAAR-O (Supplementary Fig. 15). Specifically, the aPC-Epigenetic scores of rs370625306 and rs9749443 ranked in the top 10% and top 30% among all RVs, respectively, in each sliding window. Conditioning on the two known common variants rs7412 and rs429358 in APOE associated with LDL-C[55], the strength of association of both sliding windows was reduced but remained significant (Table 2). Similar results were found after further conditioning on APOE haplotypes using these two SNPs (Supplementary Table 8). This suggests that the effects of RVs in this sliding window are not fully captured by the two known common LDL-associated variants. STAAR-O also identified and replicated two highly significant windows in APOC3 associated with TG in conditional analysis that were undetected by SKAT and burden test[73].

STAAR identifies more associations using relevant tissue functional annotations.

To evaluate the effect of tissue specificity, we compared the performance of STAAR-O in both gene-centric and genetic region analysis by incorporating liver (a central hub for lipid metabolism), heart, and brain annotations. For each tissue, we calculated a tissue-specific aPC from tissue-specific DNase, H3K4me3, H3K27ac and H3K27me3 from ENCODE (Supplementary Table 9)[26,74]. We used tissue-specific CAGE sites with overlap of RVs in the +/− 3-kb window of TSS and GeneHancers to define promoter and enhancer RV masks in gene-centric analysis. To make a fair comparison between tissues, we calculated STAAR-O P-values based solely on the tissue-specific aPC and without incorporating the MAF and other annotations. Overall, the use of liver annotation resulted in more significant levels of association than heart and brain annotations, as would be expected for lipid traits, although no additional replicated conditionally significant association was detected by using tissue-specific annotations. STAAR-O identified 9 and 8 replicated conditionally significant associations by using liver annotation in gene-centric and genetic region analysis, respectively (Supplementary Tables 10 and 11). Among these 17 significant associations, two were not seen when heart annotation was used and two were not seen when brain annotation was used, and no additional associations were detected by using heart and brain annotations (Supplementary Tables 10 and 11). Furthermore, more suggestive significant associations were detected when using liver annotation than the other two tissues at various levels of unconditional P-value thresholds in the discovery phase (Supplementary Figs. 16 and 17).

Computation cost.

We developed an R package, STAAR, to perform scalable variant-set association tests incorporating multiple variant annotations for WGS RVAS. Using sparse GRMs[38], STAAR scales well both in terms of computation time and memory for very large-scale WGS association studies, such as sample sizes in TOPMed, GSP, and UK Biobank. The computation time for STAAR-O to perform WGS gene-centric and genetic region analysis on 30,000 related samples using the TOPMed data requires 15 hours for 100 2.10 GHz computing cores with 6 GB memory for each lipid trait. Analyzing 500,000 simulated related samples mimicking the UK Biobank sample size requires 26 hours for WGS analysis using the same approach and computational resources (Online Methods).

Discussion

We propose STAAR as a general, computationally scalable framework that effectively incorporates multiple qualitative and quantitative variant functional annotations to boost power for variant-set tests for continuous and binary traits in WGS RVAS, while accounting for both population structure and relatedness using GLMMs. We highlighted STAAR-O, the omnibus test that aggregates multiple annotation-weighted tests in the STAAR framework. We focused on two types of WGS RV association analyses using STAAR-O: gene-centric analyses by grouping coding and non-coding variants into functional categories for each protein-coding gene, and agnostic genetic region analyses using sliding windows. In extensive simulation studies, we demonstrated that STAAR-O achieves substantial power gain compared with conventional variant-set tests weighted by MAF, while maintaining accurate type I error rates for both quantitative and dichotomous phenotypes. In a WGS RV analysis of lipid traits using the TOPMed data, STAAR-O identified several conditionally significant functional categories associated with lipid traits in gene-centric analysis (including NPC1L1 missense RVs and LDL-C; APOC3 pLoF RVs and HDL-C; and LIPG missense RVs and TC) that were missed by the previous study using the same TOPMed data[44]. Earlier studies reported marginal association between inactivating mutations (pLoF RVs and frameshift indels) in NPC1L1 and LDL-C with [69], which was replicated using the pooled TOPMed samples no significant association between pLoF RVs and LDL-C was found STAAR-O identified much more significant novel association, which replicated, between missense RVs in NPC1L1 and LDL-C, which was driven by disruptive missense RVs (conditional in pooled samples). None of these disruptive missense RVs was reported in ClinVar[75], suggesting that the findings from emerging WGS studies can help guide the expansion of the ClinVar database. NPC1L1 is the direct molecular target of the lipid-lowering drug ezetimibe, which reduces the absorption of cholesterol by binding to NPC1L1[76]. STAAR-O also suggested several conditional associations in the discovery phase that were validated in our replication phase and achieved significance in pooled samples (Supplementary Table 12). In agnostic sliding-window based genetic region analysis, STAAR-O detected and replicated 10 sliding windows after conditioning on known variants, including association between an intergenic region located downstream of APOC1P1 and LDL-C, that were not detected using conventional tests. This detected APOC1P1 region is located in the hepatic control region 2 (HCR-2) that regulates hepatic expression of apolipoproteins. By further conditioning on the APOE haplotypes and rs35136575, a common variant previously found in the downstream HCR-2 associated with LDL-C[77], the strength of association was reduced but remained significant (Supplementary Table 8). This discovery is due to upweighting several plausibly causal rare variants that have regulatory functions using aPC-Epigenetic scores in STAAR-O (Supplementary Fig. 15 and Supplementary Table 13). These results highlight that incorporating multiple functional annotations using STAAR can effectively boost power for WGS RVAS. To capture multiple aspects of variant functionality, we introduced annotation PCs by performing dimension reduction of a large number of diverse individual annotations from various external databases. See Online Methods for an example demonstrating that aPCs explain diverse and complementary functionality of known LDL-associated functional rare variants, and STAAR provides greater power for RV association tests by upweighting these variants using aPCs. In practice, STAAR is very flexible and users can determine the set of individual annotations to calculate aPCs and the number of aPCs and integrative functional scores and other qualitative scores to be used, as well as tissue, cell-type and phenotype-specific variant annotations[78-80]. In this paper, we group the individual annotations based on biological knowledge; users can also apply data-driven approaches, such as clustering, to group annotations for aPC calculation. We also demonstrate that STAAR detects more associations using relevant tissue functional annotations. It will be of interest, in future research, to incorporate improved rare variant effect size models in the weights to further improve power for RVAS[81,82]. The STAAR procedure is fast and scalable for very large WGS studies and biobanks of hundreds of thousands to millions of samples for both quantitative and dichotomous phenotypes as it uses estimated sparse GRMs[38] to fit the null GLMM and to scan the genome. Besides using sliding windows of a pre-specified fixed window length, STAAR could be extended to flexibly detect the sizes and locations of coding and non-coding rare variant association regions using the dynamic window analysis method SCANG[83]. In addition, STAAR could be extended to settings with survival, unbalanced case-control, and multiple phenotypes, and hence could provide a comprehensive framework for WGS RVAS. Thus, STAAR provides a powerful and flexible tool for variant association discovery in many settings to explore the molecular basis of common diseases.

Online Methods

Notations and model.

Suppose there are subjects with M total variants sequenced across the whole genome. Given a genetic set of variants, for subject i, let denote a continuous or dichotomous trait with mean denote q covariates, such as age, gender, ancestral principal components; and denote the genotype information of the p genetic variants in a variant-set. When the data consist of unrelated samples, we consider the following Generalized Linear Model (GLM) Where for a continuous normally distributed trait, for a dichotomous trait, is an intercept, is a vector of regression coefficients for , and is a vector of regression coefficients for . When the data consist of related samples, we consider the following Generalized Linear Mixed Model (GLMM)35−37 where the random effects account for remaining population structure unaccounted by ancestral PCs, relatedness, and other between-observation correlation. We assume that with variance components and known covariance matrices . The random effects can be decomposed into a sum of multiple random effects to account for different sources of relatedness and correlation as with For example, accounts for population structure and family relatedness by using the Genetic Relatedness Matrices (GRMs) as its covariance matrix [84,85]. A sparse GRM can be used to scale up computation[38]. Additional random effects can be used to account for complex sampling designs, such as correlation between repeated measures from longitudinal studies using subject-specific random intercepts and slopes and hierarchical designs. The remaining variables are defined in the same way as those in the GLM (1). Under both the GLM and the GLMM, we are interested in testing the null hypothesis of whether the variant-set is associated with the phenotype, adjusting for covariates and relatedness, which corresponds to , that is,

Conventional variant-set tests.

Conventional score-based aggregation methods allow for jointly testing the association between variants in the genetic set and phenotype. In particular, burden tests[4-7] assume that where is a constant for all variants, such that the corresponding burden test statistic to test is given by where is the score statistic of the marginal model for variant and is the estimated mean of under the null GLM or the null GLMM asymptotically follows a chi-square distribution with 1 degree of freedom under the null hypothesis, and its P-value can be obtained analytically while accounting for linkage disequilibrium (LD) between variants[3,37]. For SKAT[8], the ’s are assumed to be independent and identically distributed (i.i.d.) following an arbitrary distribution, with and The null hypothesis of no variant-set effect is equivalent to , and the corresponding SKAT test statistic is given by asymptotically follows a mixture of chi-square distributions under the null hypothesis, and its P-value can be obtained analytically while accounting for LD between variants[3,37]. Further, the recently proposed ACAT-V test uses a combination of transformed variant P-values rather than operating on the test statistics directly[9]. The ACAT-V test statistic is given by where is the number of variants with minor allele count (MAC) greater than 10 and is the association P-value of individual variant corresponding the individual variant score statistics for those variants with MAC 10. is the burden test P-value of extremely rare variants with is the average of the weights among the extremely rare variants with MAC 10. can be well approximated by a Cauchy distribution under the null hypothesis, and its P-value can be obtained analytically while accounting for LD between variants[9]. For binary traits in highly unbalanced designs, one can improve individual P-value calculations using Saddlepoint approximation[86,87]. These conventional approaches consider a weight defined as a threshold indicator or a function of minor allele frequency (MAF) for variant , i.e. Common choices of the parameters are which upweights rarer variants, or which corresponds to equal weights for all variants. In WGS studies, the vast majority of rare variants across the genome are not causal. Thus, choosing their weights according to MAF will incorrectly upweight many such “noise” variants in a variant-set and result in a loss of statistical power. Weighting using multiple variant functional annotations will help overcome this deficiency.

Calculation of annotation principal components using individual functional annotations.

To effectively capture the multi-faceted biological impact of a variant while reducing dimensionality, we propose variant annotation Principal Components (aPCs) as the PC summary of the functional annotation data by incorporating individual scores extracted from various functional databases[26,27,39-41,88]. We first group the individual scores into 10 major functional categories based on a priori knowledge, each capturing a specific aspect of variant biological function, including epigenetics, conservation, protein function, local nucleotide diversity, distance to coding, mutation density, transcription factors, mappability, distance to TSS/TES, and micro RNA (Fig. 2). For each category, we then center and standardize all individual scores within the category, such that higher value of each individual score indicates increased functionality of that annotation, and calculate aPC as the first PC from the standardized individual scores (Supplementary Table 1). To facilitate better interpretation, these aPCs are then transformed into the PHRED-scaled scores for each variant across the genome, defined as where is total number of variants sequenced across the whole genome. Unlike ancestral PCs that are subject-specific and are calculated using genotypes across the genome to control for population structure, annotation PCs are variant-specific and are calculated using functional annotations for individual variants and are used to summarize multi-facet functions of individual variants. Complementary to other existing single-dimension integrative functional scores, annotation PCs summarize multiple aspects of variant function, with different blocks captured by different annotation PCs in the heatmap (Fig. 2).

STAAR incorporating multiple functional annotations.

STAAR constructs the weights by modeling the probability of a variant being causal using its functional annotation information via qualitative annotations (e.g. functional categories) and quantitative annotations (e.g. annotation PCs and integrative annotations), as well as modeling the effect sizes of causal variants. Specifically, we consider the effect of variant on a phenotype can be written as where is the latent binary indicator of whether variant is causal, and is the effect size of variant if it is causal. The burden test, SKAT, and ACAT-V make direct assumptions on the variance of using MAF information. This newly proposed variant effect model is expected to increase association power since a variant’s causal status can be prioritized using its functional annotations[13,14]. Let denote the probability of variant being causal, then the effect of variant given above is equivalent to where is the Dirac delta function indicating that with probability , variant has no association with the phenotype. Define as the estimated probability of th variant being causal using the th annotation , e.g., measures the estimated probability that the th variant is causal using epigenetic annotation, aPC-Epigenetic. We estimate using the empirical CDF of the th annotation for variant using its rank among all variants as where is the th annotation for the th variant. For , we set as the intercept, which gives For a quantitative annotation, represents its numeric value, e.g., the th annotation PC. The quantitative we consider in this paper include 10 aPCs (Supplementary Table 1) and existing integrative scores, including CADD[27], LINSIGHT[46], and FATHMM-XF[47]. For a qualitative annotation, we define for variants in the functional group (yes) and for variants otherwise (no). For example, denotes whether a variant is a disruptive missense variant using MetaSVM[51]. Hence, for variants in the functional group and otherwise, e.g., disruptive missense variants (yes/no). This corresponds to the RV tests using variants of this functional group. In the STAAR framework, we model the effect sizes of causal variants in the same way as that used in conventional variant-set tests. Specifically, we assume where is assumed as a function of MAFs. For simplicity, we model using or . Then, the burden test statistic using th variant functional annotation as the weight, e.g., aPC-Epigenetic, is given by whose P-value is denoted by . Under the assumption of SKAT, by estimating the probability of th variant being causal using the th annotation , we have and . Hence, the SKAT test statistic using th variant functional annotation as the weight is given by whose P-value is denoted by . In the ACAT-V test, the test statistic using th variant functional annotation as the weight is given by where is the average of the weights among the extremely rare variants with MAC 10. The P-value of is denoted by We denote by the P-values of burden, SKAT, and ACAT-V tests, respectively calculated using the th annotation as the weight. For each type of RV tests, to robustly aggregate information from multiple annotations to boost power RV association tests in a data-adaptive manner, we propose to use the STAAR framework to combine individual annotation weighted tests using the ACAT P-value combination method[9,42]. Specifically, we define STAAR-Burden (STAAR-B), STAAR-SKAT (STAAR-S), and STAAR-ACAT-V (STAAR-A) as The P-value of can be approximated by To further aggregate information from different types tests and different weights, we propose an omnibus test in the STAAR framework (STAAR-O) by combining STAAR-B, STAAR-S and STAAR-A using the ACAT method[9,42]. We define the STAAR-O test statistic as where denote the P-values of STAAR-B, STAAR-S, and STAAR-A using is the set of specified values of is the size of set . In practice, we set . The P-value of could then be accurately approximated by By combining different types of tests into an omnibus test, STAAR-O has a robust power with respect to the sparsity of causal variants and the directionality of effects of causal variants in a variant-set, as well as variant multi-facet functions and MAFs. Specifically, by including the burden test, STAAR-O is powerful when majority of variants in a variant-set are causal and have effects in the same direction; by including SKAT, STAAR-O is powerful when not a small number of variants in a variant-set are causal with effects in different directions, or when variants in a variant-set are in high LD; by including ACAT-V, STAAR-O is powerful when a small number of variants in a variant-set are causal or a good number of extremely rare variants are causal; by weighting each type of tests using multiple annotation PCs and other integrative functional scores and qualitative annotations, STAAR-O is powerful when any of these variant functional annotations can pinpoint causal variants and help boost power.

Data simulation.

Type I error simulations.

We performed extensive simulation studies to evaluate whether the proposed STAAR framework preserves the desired type I error rate. We generated continuous traits from a linear model defined as where Dichotomous traits were generated from a logistic model defined as where and were defined the same as continuous traits and was determined to set the prevalence to 1%. In this setting, we used a balanced case-control design. We generated genotypes by simulating 20,000 sequences for 100 different regions each spanning 1 Mb. The data were generated to mimic the LD structure of an African American population by using the calibration coalescent model (COSI)[43]. In each simulation replicate, 10 annotations were generated as i.i.d. for each variant, and we randomly selected 5-kb regions from these 1-Mb regions for type I error simulations. We applied STAAR-B, STAAR-S, STAAR-A, and STAAR-O by incorporating MAFs and the 10 annotations and repeated the procedure with replicates to examine the type I error rate at levels. Total sample sizes considered were 2,500, 5,000, and 10,000.

Empirical power simulations.

Next, we carried out simulation study under a variety of configurations to assess the power gain by incorporating multiple functional annotations using STAAR compared to conventional variant-set tests that use MAFs as weights. In each simulation replicate, we randomly selected 5-kb regions from these 1-Mb regions for power simulations. For each selected 5-kb region, we generated causal variants according to a logistic model defined as where were randomly sampled for each region. For different regions, causality of variants was allowed to be dependent on different sets of annotations. We set for all annotations and varied the proportions of causal variants in the signal region by setting for averaging 5%, 15% and 35% causal variants in the signal region, respectively. We generated continuous traits from a linear model given by where were defined the same as the type I error simulations, were the genotypes of the s causal variants in the signal region, and were the corresponding effect sizes of causal variants. Dichotomous traits were generated from a logistic model given by where were defined the same as the type I error simulations, were the genotypes of the causal variants in the signal region, and were the corresponding log ORs of the causal variants. Under both settings, we model the effect sizes of causal variants using The effect size of causal variant was therefore a decreasing function of MAF. For continuous traits, was set to be 0.13. For dichotomous traits, was set to be 0.255, which gives an odds ratio of 3 for a variant with MAF of . For each setting, we additionally varied the proportions of causal variant effect size directions by setting 100%, 80%, and 50% variants to have positive effects. Finally, we performed simulations using different magnitudes of effect sizes by varying the values of across a wide range. We applied STAAR-B, STAAR-S, STAAR-A, and STAAR-O using MAFs and all 10 annotations in the weighting scheme, and repeated the procedure with replicates to examine the powers at level. Total sample sizes considered were 10,000 across all settings.

Computation cost.

To test the computation time of 500,000 related samples, we simulated 1,000 genomic regions, each with 100 variants, for 1 million haplotypes of 125,000 families with 2 parents and 2 children per family. The computation time for WGS RVAS was estimated by analyzing 2.5 million variant-sets with on average 100 variants in each set using STAAR.

Statistical analysis of lipid traits in the TOPMed data.

The TOPMed WGS data consist of ancestrally diverse and multi-ethnic related samples[45]. Race/ethnicity was defined using a combination of self-reported race/ethnicity and study recruitment information. The discovery cohorts consist of 4,580 (37.2%) Black or African American, 6,266 (50.9%) White, 543 (4.4%) Asian American, and 927 (7.5%) Hispanic/Latino American. The replication cohorts consist of 3,534 (19.8%) Black or African American, 11,662 (65.4%) White, 132 (0.7%) Asian American, and 2,494 (14.0%) others. The “others” category in the replication cohort includes many Hispanic/Latino American as well as a cohort of Samoans. We applied STAAR-O to identify RV-sets associated with four quantitative lipid traits (LDL, HDL, TG and TC) using the TOPMed WGS data. LDL-C and TC were adjusted for the presence of medications as before[44]. Linear regression model adjusting for age, age[2], sex was first fit for each study-race/ethnicity-specific group. In addition, for Old Order Amish (OOA), we also adjusted for APOB p.R3527Q in LDL-C and TC analyses and adjusted for APOC3 p.R19Ter in TG and HDL-C analyses[44]. The residuals were rank-based inverse normal transformed and rescaled by the standard deviation of the original phenotype within each group. We then fit a heteroscedastic linear mixed model (HLMM) for the rank normalized residuals, adjusting for 10 ancestral PCs, study-ethnicity group indicators, and a variance component for empirically derived kinship matrix plus separate group-specific residual variance components to account for population structure and relatedness. The output of HLMM was then used to perform following variant set analyses for rare variants (MAF < 1%) by scanning the genome, including gene-centric analysis using five variant categories (pLoF RVs, missense RVs, synonymous RVs, promoter RVs, and enhancer RVs) for each protein coded gene, and agnostic genetic region analysis using 2-kb sliding windows across the genome with a 1-kb skip length. The WGS RVAS analysis was performed using the R package STAAR (version 0.9.5). The aPCs provide diverse and complementary information on variant functionality, and are incorporated in rare variant association tests using an omnibus weighting scheme via the proposed STAAR method. We demonstrate using the following example that STAAR boosts the rare variant association test power by properly upweighting known LDL-associated functional rare variants. For example, the association between a 2-kb sliding window located at 55,038,498 bp - 55,040,497 bp on chromosome 1 and LDL-C using STAAR-O is more significant than conventional tests in unconditional analysis (Supplementary Table 14). This power gain of STAAR-O is due to upweighting functional variants, e.g., the known tolerated missense variant rs11591147 within the sliding window through incorporating multiple aPCs[59]. Specifically, the aPC-Epigenetic, aPC-Protein, and aPC-Mappability PHRED scores are greater than 20 (top 1% across the genome), and the aPC-MutationDensity, aPC-TF, and CADD PHRED scores are greater than 10 (top 10% across the genome) for this variant, highlighting the multi-dimensional functionality of this variant. The aPC-Protein and aPC-Mappability weighted SKAT P-values are and , which are more significant than SKAT () and burden test ().

Statistical analysis of LDL-C in the UK Biobank data.

We used UK Biobank whole exome sequences (WES) from the functionally equivalent (FE) pipeline. Sample and variant quality control measures were previously described[72,89]. In brief, samples with mismatch between genetically inferred and reported sex, high rates of heterozygosity or contamination (D-stat > 0.4), low sequence coverage (less than 85% of targeted bases achieving 20X coverage), duplicates, and WES variants discordant with genotyping chip were removed. A total of 43,243 individuals with genetically inferred European ancestry were included; 40,519 of those had data on LDL cholesterol. Total cholesterol was adjusted by dividing the value by 0.8 among individuals reporting lipid lowering medication use after 1994 or statin use at any time point. LDL cholesterol was calculated from adjusted total cholesterol levels by the Friedewald equation for individuals with triglyceride levels < 400 mg/dl. If LDL cholesterol levels were directly measured, then their values were divided by 0.7 among reporting lipid lowering medication use after 1994 or statin use at any time point. Residuals were created after adjustment for age, age[2], sex, and the first 10 ancestral principal components. Residuals were then rank-based inverse-normal transformed and multiplied by the standard deviation. Analyses were restricted to missense variants in the NPC1L1 gene predicted to be damaging according to the MetaSVM prediction algorithm and conditioned on ten known common variants in NPC1L1 associated with LDL-C (rs10234070, rs73107473, rs2072183, rs41279633, rs17725246, rs2073547, rs10260606, rs217386, rs7791240, rs2300414) obtained from the UK Biobank imputed genotype data. We performed a burden test for the association between disruptive missense RVs in NPC1L1 and LDL-C.

Reporting summary.

Further information on research design is available in the Nature Research Reporting Summary linked to this technical report.

Genome build.

All genome coordinates are given in NCBI GRCh38/UCSC hg38.

Code availability.

STAAR is implemented as an open source R package available at https://github.com/xihaoli/STAAR and https://content.sph.harvard.edu/xlin/software.html.

Data availability.

This paper used the TOPMed Freeze 5 Whole Genome Sequencing data and lipids phenotype data. The genotype and phenotype data are both available in dbGAP. The discovery phase used the data from the following four study cohorts, where the accession numbers are provided in parenthesis: Framingham Heart Study (phs000974.v1.p1), Old Order Amish (phs000956.v1.p1), Jackson Heart Study (phs000964.v1.p1), and Multi-Ethnic Study of Atherosclerosis (phs001416.v1.p1). The replication phase used the data from the following ten study cohorts: Atherosclerosis Risk in Communities Study (phs001211), Cleveland Family Study (phs000954), Cardiovascular Health Study (phs001368), Diabetes Heart Study (phs001412), Genetic Study of Atherosclerosis Risk (phs001218), Genetic Epidemiology Network of Arteriopathy (phs001345), Genetics of Lipid Lowering Drugs and Diet Network (phs001359), San Antonio Family Heart Study (phs001215), Genome-wide Association Study of Adiposity in Samoans (phs000972) and Women’s Health Initiative (phs001237). The sample sizes, ethnicity and phenotype summary statistics of these cohorts are given in Supplementary Table 3. The functional annotation data are publicly available and were downloaded from the following links: GRCh38 CADD v1.4 (https://cadd.gs.washington.edu/download), ANNOVAR dbNSFP v3.3a (https://annovar.openbioinformatics.org/en/latest/user-guide/download), LINSIGHT (https://github.com/CshlSiepelLab/LINSIGHT), FATHMM-XF (http://fathmm.biocompute.org.uk/fathmm-xf), CAGE (https://fantom.gsc.riken.jp/5/data), GeneHancer (https://www.genecards.org), and Umap/Bismap (https://bismap.hoffmanlab.org). In addition, recombination rate and nucleotide diversity were obtained from Gazal et al[90]. The tissue-specific functional annotations were downloaded from ENCODE (https://www.encodeproject.org/report/?type=Experiment).

83 in total

1. Exome sequencing and the genetic basis of complex traits.

Authors: Adam Kiezun; Kiran Garimella; Ron Do; Nathan O Stitziel; Benjamin M Neale; Paul J McLaren; Namrata Gupta; Pamela Sklar; Patrick F Sullivan; Jennifer L Moran; Christina M Hultman; Paul Lichtenstein; Patrik Magnusson; Thomas Lehner; Yin Yao Shugart; Alkes L Price; Paul I W de Bakker; Shaun M Purcell; Shamil R Sunyaev
Journal: Nat Genet Date: 2012-05-29 Impact factor: 38.330

2. Optimal tests for rare variant effects in sequencing association studies.

Authors: Seunggeun Lee; Michael C Wu; Xihong Lin
Journal: Biostatistics Date: 2012-06-14 Impact factor: 5.899

3. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data.

Authors: Bingshan Li; Suzanne M Leal
Journal: Am J Hum Genet Date: 2008-08-07 Impact factor: 11.025

4. Rare-variant association testing for sequencing data with the sequence kernel association test.

Authors: Michael C Wu; Seunggeun Lee; Tianxi Cai; Yun Li; Michael Boehnke; Xihong Lin
Journal: Am J Hum Genet Date: 2011-07-07 Impact factor: 11.025

5. ACAT: A Fast and Powerful p Value Combination Method for Rare-Variant Analysis in Sequencing Studies.

Authors: Yaowu Liu; Sixing Chen; Zilin Li; Alanna C Morrison; Eric Boerwinkle; Xihong Lin
Journal: Am J Hum Genet Date: 2019-03-07 Impact factor: 11.025

Review 6. Rare-variant association analysis: study designs and statistical tests.

Authors: Seunggeung Lee; Gonçalo R Abecasis; Michael Boehnke; Xihong Lin
Journal: Am J Hum Genet Date: 2014-07-03 Impact factor: 11.025

7. A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST).

Authors: Stephan Morgenthaler; William G Thilly
Journal: Mutat Res Date: 2006-11-13 Impact factor: 2.433

Review 8. Statistical analysis strategies for association studies involving rare variants.

Authors: Vikas Bansal; Ondrej Libiger; Ali Torkamani; Nicholas J Schork
Journal: Nat Rev Genet Date: 2010-10-13 Impact factor: 53.242

9. A groupwise association test for rare mutations using a weighted sum statistic.

Authors: Bo Eskerod Madsen; Sharon R Browning
Journal: PLoS Genet Date: 2009-02-13 Impact factor: 5.917

10. An evaluation of statistical approaches to rare variant analysis in genetic association studies.

Authors: Andrew P Morris; Eleftheria Zeggini
Journal: Genet Epidemiol Date: 2010-02 Impact factor: 2.135

32 in total

1. eSCAN: scan regulatory regions for aggregate association testing using whole-genome sequencing data.

Authors: Yingxi Yang; Quan Sun; Le Huang; Jai G Broome; Adolfo Correa; Alexander Reiner; Laura M Raffield; Yuchen Yang; Yun Li
Journal: Brief Bioinform Date: 2022-01-17 Impact factor: 11.622

2. Cross-ancestry genome-wide meta-analysis of 61,047 cases and 947,237 controls identifies new susceptibility loci contributing to lung cancer.

Authors: Jinyoung Byun; Younghun Han; Yafang Li; Jun Xia; Erping Long; Jiyeon Choi; Xiangjun Xiao; Meng Zhu; Wen Zhou; Ryan Sun; Yohan Bossé; Zhuoyi Song; Ann Schwartz; Christine Lusk; Thorunn Rafnar; Kari Stefansson; Tongwu Zhang; Wei Zhao; Rowland W Pettit; Yanhong Liu; Xihao Li; Hufeng Zhou; Kyle M Walsh; Ivan Gorlov; Olga Gorlova; Dakai Zhu; Susan M Rosenberg; Susan Pinney; Joan E Bailey-Wilson; Diptasri Mandal; Mariza de Andrade; Colette Gaba; James C Willey; Ming You; Marshall Anderson; John K Wiencke; Demetrius Albanes; Stephan Lam; Adonina Tardon; Chu Chen; Gary Goodman; Stig Bojeson; Hermann Brenner; Maria Teresa Landi; Stephen J Chanock; Mattias Johansson; Thomas Muley; Angela Risch; H-Erich Wichmann; Heike Bickeböller; David C Christiani; Gad Rennert; Susanne Arnold; John K Field; Sanjay Shete; Loic Le Marchand; Olle Melander; Hans Brunnstrom; Geoffrey Liu; Angeline S Andrew; Lambertus A Kiemeney; Hongbing Shen; Shanbeh Zienolddiny; Kjell Grankvist; Mikael Johansson; Neil Caporaso; Angela Cox; Yun-Chul Hong; Jian-Min Yuan; Philip Lazarus; Matthew B Schabath; Melinda C Aldrich; Alpa Patel; Qing Lan; Nathaniel Rothman; Fiona Taylor; Linda Kachuri; John S Witte; Lori C Sakoda; Margaret Spitz; Paul Brennan; Xihong Lin; James McKay; Rayjean J Hung; Christopher I Amos
Journal: Nat Genet Date: 2022-08-01 Impact factor: 41.307

Review 3. Social and scientific motivations to move beyond groups in allele frequencies: The TOPMed experience.

Authors: Sarah C Nelson; Stephanie M Gogarten; Stephanie M Fullerton; Carmen R Isasi; Braxton D Mitchell; Kari E North; Stephen S Rich; Matthew R G Taylor; Sebastian Zöllner; Tamar Sofer
Journal: Am J Hum Genet Date: 2022-09-01 Impact factor: 11.043

Review 4. Opportunities and challenges for the use of common controls in sequencing studies.

Authors: Genevieve L Wojcik; Jessica Murphy; Jacob L Edelson; Christopher R Gignoux; Alexander G Ioannidis; Alisa Manning; Manuel A Rivas; Steven Buyske; Audrey E Hendricks
Journal: Nat Rev Genet Date: 2022-05-17 Impact factor: 59.581

5. Whole exome sequencing identifies novel germline variants of SLC15A4 gene as potentially cancer predisposing in familial colorectal cancer.

Authors: Diamanto Skopelitou; Aayushi Srivastava; Beiping Miao; Abhishek Kumar; Dagmara Dymerska; Nagarajan Paramasivam; Matthias Schlesner; Jan Lubinski; Kari Hemminki; Asta Försti; Obul Reddy Bandapalli
Journal: Mol Genet Genomics Date: 2022-05-13 Impact factor: 2.980

6. A gene-level methylome-wide association analysis identifies novel Alzheimer's disease genes.

Authors: Chong Wu; Jonathan Bradley; Yanming Li; Lang Wu; Hong-Wen Deng
Journal: Bioinformatics Date: 2021-02-01 Impact factor: 6.937

7. Variant-set association test for generalized linear mixed model.

Authors: Xiang Zhan; Kalins Banerjee; Jun Chen
Journal: Genet Epidemiol Date: 2021-02-19 Impact factor: 2.344

8. InTACT: An adaptive and powerful framework for joint-tissue transcriptome-wide association studies.

Authors: Ye Eun Bae; Lang Wu; Chong Wu
Journal: Genet Epidemiol Date: 2021-07-13 Impact factor: 2.135

9. A unifying framework for rare variant association testing in family-based designs, including higher criticism approaches, SKATs, and burden tests.

Authors: Julian Hecker; F William Townes; Priyadarshini Kachroo; Cecelia Laurie; Jessica Lasky-Su; John Ziniti; Michael H Cho; Scott T Weiss; Nan M Laird; Christoph Lange
Journal: Bioinformatics Date: 2020-12-26 Impact factor: 6.937

10. GIGYF1 loss of function is associated with clonal mosaicism and adverse metabolic health.

Authors: Yajie Zhao; Stasa Stankovic; Mine Koprulu; Eleanor Wheeler; Felix R Day; Hana Lango Allen; Nicola D Kerrison; Maik Pietzner; Po-Ru Loh; Nicholas J Wareham; Claudia Langenberg; Ken K Ong; John R B Perry
Journal: Nat Commun Date: 2021-07-07 Impact factor: 14.919