Literature DB >> 34226706

Whole-exome imputation within UK Biobank powers rare coding variant association and fine-mapping analyses.

Alison R Barton^1,2,3, Maxwell A Sherman^4,5,6, Ronen E Mukamel^4,5, Po-Ru Loh^7,8.

Abstract

Exome association studies to date have generally been underpowered to systematically evaluate the phenotypic impact of very rare coding variants. We leveraged extensive haplotype sharing between 49,960 exome-sequenced UK Biobank participants and the remainder of the cohort (total n ≈ 500,000) to impute exome-wide variants with accuracy R2 > 0.5 down to minor allele frequency (MAF) ~0.00005. Association and fine-mapping analyses of 54 quantitative traits identified 1,189 significant associations (P < 5 × 10-8) involving 675 distinct rare protein-altering variants (MAF < 0.01) that passed stringent filters for likely causality. Across all traits, 49% of associations (578/1,189) occurred in genes with two or more hits; follow-up analyses of these genes identified allelic series containing up to 45 distinct 'likely-causal' variants. Our results demonstrate the utility of within-cohort imputation in population-scale genome-wide association studies, provide a catalog of likely-causal, large-effect coding variant associations and foreshadow the insights that will be revealed as genetic biobank studies continue to grow.

Entities: Chemical

Mesh：

Substances：

Year: 2021 PMID： 34226706 PMCID： PMC8349845 DOI： 10.1038/s41588-021-00892-1

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 41.307

INTRODUCTION

Exome association studies have shown that rare coding variants tend to have larger phenotypic effects than common variants and collectively contribute an important component of complex trait heritability[1-4]. However, the phenotypic effects of very rare coding variants have been difficult to comprehensively assess, as exome sequencing studies have not yet reached the sample sizes needed to power such analyses (N>100,000)[5-10], and imputation of rare variants into cohorts of this scale has been insufficiently accurate[11]. The largest exome-wide association studies conducted to date have analyzed cohorts of N~50,000 exome-sequenced individuals, and while these studies have identified modest numbers of variants and genes associated with phenotypes, they have largely been underpowered to evaluate the effects of individual very rare coding variants[7-10]. The UK Biobank (UKB) is a powerful resource for genetic association analyses because of its large sample size (N~500,000) and deep phenotyping[12]. Previous studies of UKB have examined disease associations of protein-truncating variants genotyped on the UK Biobank array, which was designed to include the majority of predicted loss-of-function (LoF) variants with MAF>0.02% and missense variants with MAF>0.2%[13,14]. However, most LoF variants are ultra-rare (MAF<0.01%), such that only ~14% of rare LoF variants detected in whole-exome sequencing (WES) of 49,960 UKB participants had been genotyped on the UK Biobank array[8]. We reasoned that although exome sequencing of ~10% of the UKB cohort provided insufficient power to directly assess the effects of ultra-rare variants (which have <10 carriers in N~50,000 sequenced participants), we could leverage the extensive haplotype sharing within the UKB cohort[15,16] to accurately impute these variants into up to ~100 carriers in the full cohort, thereby powering association analysis. (This strategy is distinct from a recent analysis of “putative LoF-segments” determined based on identity-by-descent sharing, which did not consider LoF phase[17].) By combining this exome-wide imputation strategy with careful fine-mapping of significant associations to identify causal effects of rare coding variants on 54 quantitative traits, we identified hundreds of novel likely-causal variant-trait associations and obtained insights into widespread allelic heterogeneity and pleiotropy.

RESULTS

Exome-wide imputation, association, and fine-mapping

We leveraged whole-exome sequencing of 49,960 UKB participants together with SNP-array genotyping in the full cohort to impute exome-wide variants into all UKB participants as follows (full details in Methods). First, we created an imputation reference panel by phasing WES genotype calls together with SNP-array genotypes in the WES cohort using Eagle2[16], restricting to 4.9 million variants with minor allele count (MAC) at least 2. Second, we used Minimac4[11] to impute these variants into phased SNP-array haplotypes we had previously generated for 487,409 UKB participants[18]. This strategy achieved imputation accuracy (R2) > 0.5 for rare variants down to MAF~0.00005 (Fig. 1a,b, Supplementary Table 1, and Supplementary Note), consistent with previous simulations[19] and roughly one order of magnitude deeper into the rare allele frequency spectrum than the current UKB imputation release (v3)[12], which used the Haplotype Reference Consortium (HRC) and UK10K / 1000 Genomes reference panels[20,21]. Compared to imputation using N=97,256 genomes in the TOPMed reference panel[22], within-cohort imputation using the N=49,960 UKB WES panel achieved substantially greater coverage of very rare variants while maintaining similar accuracy per imputed variant (Fig. 1a,b).

Figure 1.

Whole-exome imputation, association, and fine-mapping identify rare coding variants likely to causally associate with 54 quantitative traits.

Imputation panel coverage (a) and imputation accuracy (b) assessed using SNP calls from the second release of UK Biobank whole exome-sequencing data (N=200,643; accuracy benchmarks excluded individuals in the initial release). Data are presented as mean values. Error bars, 95% CIs. (c) Schematic of our analytical pipeline, which combined UK Biobank whole-exome sequences with SNP-array genotypes to impute exome-wide genotypes into the full cohort. We analyzed imputed exome variants together with the genome-wide UK Biobank imputation release to find significant variant-trait associations independent of neighboring variants, and we restricted to rare (MAF<0.01) protein-altering variants with CADD ≥ 20 or SpliceAI support to form a final list of likely-causal variants. (d) Distribution of first UK Biobank genetic data set in which each association could have been detected. Roughly one-third of all likely-causal variants – and nearly all very rare likely-causal variants – were only discoverable using WES imputation. (e) WES imputation enabled identification of new rare coding variants for all but one trait (immature reticulocyte fraction) among 54 quantitative traits analyzed.

We tested the imputed variants for association with 54 heritable quantitative traits (measuring anthropometric traits, blood pressure, lung function, bone mineral density, blood cell indices, and serum biomarkers; Supplementary Table 2) by running linear mixed model association analysis on N=459,259 participants of European ancestry using BOLT-LMM[23,24], which we verified was robust to potential population stratification in rare variant association analysis (Methods and Supplementary Note). This procedure identified tens of thousands of associations between coding variants and traits that reached nominal genome-wide significance (P<5 x 10−8); however, we expected that most of these associations were not causal but rather reflected linkage disequilibrium with nearby causal variants. To filter detected associations to a high-confidence subset primarily containing causal variants, we developed a stringent filtering pipeline that combined variant annotation filters (to increase the prior on causality) with statistical fine-mapping (Fig. 1c and Methods). First, we restricted to rare (MAF<1%) variants predicted to have high protein-altering impact based on either of the following criteria: (i) Combined Annotation Dependent Depletion (CADD)[25] score ≥ 20 (for coding variants annotated by VEP[26], including canonical splice variants); or (ii) SpliceAI[27] score ≥ 0.5 (for noncanonical splice variants). In our primary analyses, we further restricted to variants with high estimated imputation accuracy (INFO>0.5) and with imputed MAF>10−5. These filters left 529,602 rare coding variants under consideration, of which 440,253 (83%) either were not present or were poorly imputed (INFO<0.5) in the HRC-based UKB imputation release. Among the 529,602 variants, 1,647 distinct variants associated (P<5 x 10−8) with at least one phenotype, accounting for a total of 2,706 variant-trait associations (Fig. 1c) (with 1.4 false discoveries expected across all 54 traits). We combined our variant annotation filters with a statistical fine-mapping filter to exclude associations that could be explained by linkage disequilibrium with other variants. Our primary filter required that each association remain significant (P < 5 x 10−8, slightly conservative for 529,620 variants tested) after conditioning on any other more-strongly-associated variant within 3 megabases (considering in turn each variant from either our WES imputation or the UKB imputation v3 release; Methods). This filter was more robust for our rare variant analyses than standard fine-mapping software packages, which aim to find small sets of variants that explain maximal phenotypic variance, making configurations which include rare variants less likely to be considered the most probable[28,29]. Fine-mapping algorithms do have the advantage of accounting for the possibility of variants tagging combinations of multiple nearby causal variants (which our pairwise conditional filter did not consider); to account for this possibility, we applied a second filtering pipeline based on iterative runs of the FINEMAP software[28] (Methods). Together, these filters reduced the set of associations to a final “likely-causal” set of 1,189 associations involving 675 unique variants (Fig. 1c and Supplementary Table 3). Both the variant annotation filters and the fine-mapping filters were designed to be very stringent, with the goal of producing a conservative set of associations with high confidence of causality for downstream analysis. Association data for all variants (including those that failed filters) are also available (see Data availability). Among the 1,189 likely-causal associations, 30% could only be discovered using imputation from UKB exome-sequencing data, demonstrating the power of this approach for causal variant discovery (Fig. 1d,e). The remaining associations could previously have been discovered using either the UKB SNP-array (51% of likely-causal associations, reflecting the inclusion of rare coding variants on the array), the HRC-based UKB imputation v3 release (an additional 16%), or association analysis within the WES cohort (an additional 3%). Furthermore, among likely-causal associations involving ultra-rare variants (MAF<0.01%), the large majority (197 of 253 associations; 78%) were discoverable only using imputation from the UKB WES cohort (Fig. 1d). Roughly half (576 of 1,189; ~48%) of all likely-causal associations were still not discoverable in the subsequent release of 200,643 UK Biobank exomes[30,31] (Fig. 2 and Supplementary Table 4). Most likely-causal variants (600 of 675; 89%) were not reported in the NHGRI-EBI GWAS catalog for association with any trait, underscoring the power of exome imputation within UKB to detect novel rare coding associations (Supplementary Fig. 1). Effect sizes generally increased with decreasing minor allele frequency among likely-causal rare coding variants (Supplementary Fig. 2), which collectively explained an average of 0.6% of variance per trait (Supplementary Table 2).

Figure 2.

Association analyses of the subsequent N=200,643 UK Biobank exome release demonstrate robustness of likely-causal variant-trait associations ascertained using genotypes imputed from N=49,960 exomes.

For each likely-causal association, we repeated the association analysis (i) restricting to the N=200,643 cohort, but still using imputed genotypes (x-axis); or (ii) restricting to the N=200,643 cohort and using genotypes directly derived from exome sequencing (y-axis). Only 613 of 1,189 likely-causal associations from the imputed N=487,409 data set reached significance (BOLT-LMM P<5 x 10−8; red line in panel b) using the N=200,643 exomes alone. Association test statistics were highly correlated (Pearson R=0.96) between these two approaches. Only 6 associations involving 5 distinct variants (1:120463017:C:T, 2:174130918:G:A, 11:48285468:G:A, 16:2287866:G:A, 20:30610469:G:T) decreased in strength by >2-fold in the direct analysis, potentially due to inaccurate imputation or inaccurate genotyping.

We further attempted to assess the extent to which the likely-causal variants we identified implicated new genes influencing traits. This determination is challenging and generally requires substantial literature review, so we focused our assessment on two types of traits – blood cell traits and height – for which recent, largest-to-date (N>500,000) association studies could serve as proxies for prior knowledge (Supplementary Note). For blood cell traits, we found that ~26% (86 out of 337) of the unique gene-trait pairs implicated by our likely-causal associations did not appear among conditionally independent associations reported by Vuckovic et al. (2020)[32] (Supplementary Table 5). For height, ~45% (23 out of 51) of the unique genes implicated by our associations were novel compared to genes reported by Marouli et al. (2017)[2] (Supplementary Table 6). We expected that the linear mixed models we used for association tests had adequately controlled any potential confounding from population stratification or relatedness[24]. To verify robustness of our results, we performed multiple confirmatory analyses. First, we attempted to replicate associations with traits for which large-scale exome array studies (not including UKB participants) had previously been published. For height, 28 variants we identified as likely-causal had been analyzed in a previous ExomeChip study of height[2]; for all 28 variants, the direction of effect replicated, and 21 of the 28 variants reached nominal significance (P < 0.05) in the replication data set (Table 1). Similarly, effect directions replicated for 75 out of 75 lipid associations for which association statistics were available from the Global Lipids Genomics Consortium (GLGC)[3] and for 9 out of 10 blood pressure associations for which data was available from the CHARGE-BP Consortium[4] (Supplementary Table 7). Second, we verified that associations were robust to restricting analysis to a genetically homogeneous subset of unrelated British UKB participants (N=337,539): effect sizes (Pearson R2=0.985), association strengths (Pearson R2=0.998), and allele frequencies (Pearson R2=0.999) were all very consistent within this subset (Methods and Supplementary Fig. 3). Third, we verified that likely-causal rare alleles had geographical distributions nearly identical to MAF-matched background variants (Supplementary Fig. 4 and Supplementary Note). These results indicate that while subtle stratification in large genetic analyses may affect some types of epidemiological studies[33], the strong, highly localized stratification required to confound rare variant association analyses[34] is unlikely to be present in UK Biobank.

Table 1.

Replication of likely-causal associations between rare coding variants and height.

P-values and effect sizes are compared for the 28 height-associated variants that were included in the ExomeChip analysis previously performed by Marouli et al.[2] Effect directions replicated for all 28 variants, most of which had not previously reached exome-wide significance. The last column indicates whether any variants in the affected gene had previously reached significance; several implicated genes were novel relative to Marouli et al.

Height variants reaching exome-wide significance in Marouli et al. (2017)
SNP	Cytoband	Gene	Variant effect	MAF (UKB)	P-value (UKB)	P-value (Marouli et al.)	Effect size (s.e.) (UKB)	Effect size (s.e.) (Marouli et al.)	Effect sign agreement	Gene reported
rs143365597	1p34.2	SCMH1	missense	4.4E-03	6.10E-38	1.60E-25	0.149 (0.012)	0.190 (0.018)	✓	✓
rs144673025	1q41	DISP1	missense	7.3E-03	1.60E-13	1.10E-09	−0.063 (0.009)	−0.078 (0.013)	✓	✓
rs142036701	2q35	IHH	missense	3.8E-04	9.30E-10	1.10E-15	−0.234 (0.044)	−0.320 (0.040)	✓	✓
rs149385790	4q26	PDE5A	missense	1.1E-03	1.80E-14	7.50E-17	0.194 (0.026)	0.260 (0.031)	✓	✓
rs778920303	5p13.3	NPR3	missense	2.5E-03	3.60E-29	1.10E-08	0.177 (0.016)	0.130 (0.022)	✓	✓
rs61736454	5q12.3	ADAMTS6	missense	2.6E-03	5.30E-24	7.80E-09	−0.151 (0.016)	−0.150 (0.026)	✓	✓
rs78727187	5q23.3	FBN2	missense	6.6E-03	1.20E-49	2.50E-33	0.139 (0.010)	0.180 (0.015)	✓	✓
rs148833559	5q35.2	STC2	missense	1.3E-03	9.10E-40	5.70E-15	0.285 (0.022)	0.290 (0.037)	✓	✓
rs75596750	8q24.22	ZFAT	cryptic splice	8.0E-04	6.50E-23	1.50E-12	0.264 (0.028)	0.250 (0.036)	✓	✓
rs138273386	11p14.2	FIBIN	missense	4.3E-03	2.40E-10	5.80E-12	−0.078 (0.013)	−0.120 (0.017)	✓	✓
rs141308595	15q26.1	HAPLN3	missense	1.4E-03	7.40E-46	2.80E-13	−0.307 (0.022)	−0.27 (0.037)	✓	✓
Height variants not reaching exome-wide significance in Marouli et al. (2017)
SNP	Cytoband	Gene	Variant effect	MAF (UKB)	P-value (UKB)	P-value (Marouli et al.)	Effect size (s.e.) (UKB)	Effect size (s.e.) (Marouli et al.)	Effect sign agreement	Gene reported
rs121908188	1p36.11	SEPN1	missense	8.8E-04	4.80E-11	1.10E-01	−0.178 (0.028)	−0.088 (0.055)	✓
rs200496074	1p35.2	COL16A1	missense	6.5E-03	1.10E-10	2.90E-01	0.065 (0.010)	0.063 (0.015)	✓
rs201166538	3q13.33	LRRC58	missense	1.7E-04	6.70E-20	1.00E-01	0.545 (0.061)	0.097 (0.092)	✓
rs143137713	3q24	GYG1	missense	2.4E-03	1.00E-10	1.70E-01	−0.101 (0.017)	−0.049 (0.036)	✓
rs73181210	3q26.2	PHC3	missense	7.2E-03	9.00E-11	1.10E-05	0.066 (0.009)	0.056 (0.013)	✓	✓
rs149437411	3q27.3-q28	LPP	missense	3.2E-03	2.00E-15	1.70E-05	0.122 (0.015)	0.088 (0.020)	✓
rs147927477	6p21.32	COL11A2	missense	6.8E-04	9.30E-12	9.90E-01	−0.215 (0.031)	−0.001 (0.049)	✓
rs146458902	7p14.1	GLI3	missense	6.1E-03	1.10E-15	3.50E-04	0.080 (0.010)	0.060 (0.017)	✓	✓
rs121912974	7q11.23	POR	missense	3.8E-04	5.70E-10	1.20E-03	0.271 (0.042)	0.180 (0.057)	✓
rs140870470	9p13.3	NPR2	missense	4.4E-04	1.80E-11	1.40E-01	0.251 (0.039)	0.160 (0.110)	✓	✓
rs143836544	9q34.11	LRRC8A	missense	6.1E-03	3.20E-09	2.00E-05	−0.060 (0.011)	−0.065 (0.015)	✓	✓
rs200733908	11q13.1	LTBP3	missense	4.7E-04	8.10E-12	4.00E-01	−0.281 (0.042)	−0.059 (0.070)	✓	✓
rs202116412	12p13.1	APOLD1	splice donor	1.3E-03	3.80E-09	3.40E-06	0.131 (0.024)	0.110 (0.024)	✓	✓
rs142153001	14q11.2	LRP10	missense	9.0E-03	1.20E-11	5.00E-03	0.057 (0.009)	0.035 (0.013)	✓	✓
rs201029932	14q22.2	SAMD4A	missense	4.8E-03	8.60E-13	6.80E-04	−0.080 (0.012)	−0.057 (0.017)	✓	✓
rs35816944	16p13.3	SPSB3	missense	6.7E-03	1.10E-18	2.90E-05	−0.083 (0.010)	−0.067 (0.016)	✓	✓
rs141510764	16q23.1	CLEC3A	missense	3.7E-04	3.20E-09	9.10E-04	0.230 (0.042)	0.260 (0.077)	✓	✓

Likely-causal variants are enriched for deleteriousness

The 675 rare coding variants that we identified as likely-causal were roughly evenly distributed across the full range of allele frequencies we considered (MAF = 10−5 to 10−2; Fig. 3a). In contrast, the 972 rare coding variants that were annotated as high-impact and associated significantly with at least one trait but were filtered after considering linkage disequilibrium with other associated variants were enriched for more-common variants (MAF = 10−3 to 10−2), suggesting that many of these filtered variants – which constituted the majority of trait-associated rare coding variants – merely tagged causal common variants (Fig. 3a).

Figure 3.

Likely-causal coding variants are rare and enriched for deleteriousness.

(a) Likely-causal variants (pink, n=675) had minor allele frequencies distributed relatively evenly across the range under consideration (MAF = 10−5 to 10−2), whereas variants that failed linkage disequilibrium (LD)-based filters (blue, n=898) tended to be less rare. (b) Likely-causal variants had elevated CADD scores compared to those that failed LD-based filters and compared to a randomly-sampled background distribution of rare coding variants (green, n=47,002). (c) Likely-causal variants were enriched for predicted loss-of-function mutations. Bar height represents identified fraction. Error bars estimate sampling uncertainty based on a binomial model, 95% CIs. (d) Likely-causal missense variants were enriched for higher-impact amino acid substitutions (as measured by more negative BLOSUM62 scores).

To assess enrichment of measures of deleteriousness among the 675 likely-causal variants while controlling for MAF (which is modestly negatively correlated with deleteriousness; Supplementary Fig. 5), we compared features of these variants to a MAF-matched background distribution that we generated by subsampling the 529,602 predicted-high-impact variants we tested (Methods). The average CADD score of likely-causal variants was +1.6 higher than in the background distribution (mean CADD = 27.3 vs. 25.3; P = 1.6 x 10−23, two-sample t-test) (Fig. 3b). Furthermore, predicted loss of function mutations (including frameshifts, stop gains, and canonical splice variants) were 2.1-fold enriched (P = 3.7x 10−16, Fisher’s exact test) among likely-causal variants (comprising 19.1% of likely-causal variants vs. 8.9% of variants from the background distribution; Fig. 3c). In contrast, variants that failed our fine-mapping filters had CADD and variant type distributions similar to background, providing further evidence against causality of most of these variants (Fig. 3b,c). Missense variants, which comprised the majority of both likely-causal and background variants, produced broadly more severe amino acid substitutions (as measured by BLOSUM62 scores) across likely-causal variants compared to background (mean BLOSUM62 score = −0.78 vs. −0.57; P = 0.003, two-sample t-test) (Fig. 3d). Cryptic splice variants (computationally predicted by SpliceAI) accounted for 11 of the 675 likely-causal variants and were slightly depleted relative to background, suggesting that these variants were on average slightly less likely to affect function than missense variants with CADD ≥ 20 (Fig. 3c); however, our statistical power here was limited.

Rare coding variants form long allelic series

Among the 1,189 likely-causal variant-trait associations we identified, roughly half (578 out of 1,189; 49%) occurred in genes containing multiple likely-causal rare coding variants for the same trait. The observation of two or more rare coding hits in the same gene strengthened our evidence for these associations and suggested the possibility of longer allelic series within these genes containing very rare causal coding variants that either had not reached genome-wide significance or had been excluded by our stringent filters. To increase our power to detect additional independently-associated rare coding variants within these genes, we performed follow-up analyses in which we relaxed the significance threshold (to a 5% false discovery rate within each gene-trait pair) and relaxed our fine-mapping filter (conditioning only on a set of associated variants selected by FINEMAP) and annotation-based filter (considering all protein-altering variants regardless of CADD score; Methods). These analyses revealed very long allelic series of rare coding variants likely to alter phenotypes: for 56 gene-trait pairs, the allelic series contained 10 or more variants on distinct haplotypes, and eight distinct genes contained allelic series of 30 or more variants (Fig. 4 and Supplementary Table 8). In the longest allelic series, 45 rare coding variants in ALPL – out of 76 such variants tested – independently associated with serum alkaline phosphatase levels, all with negative effect directions for the rare minor allele. This consistency in effect directions was broadly displayed across the allelic series we identified (93% mean concordance with the majority effect direction; Supplementary Fig. 6). Somewhat surprisingly, the amino acid residues modified by missense variants within these allelic series tended not to cluster in specific protein domains (Fig. 4a-d and Supplementary Fig. 7); instead, they appeared to be distributed throughout protein structures, suggesting that protein structures may often contain many domains that are sensitive to mutation.

Figure 4.

Many genes contain long allelic series of rare coding variants with consistent effect directions.

(a-d) Allelic series of rare coding variants with statistically independent phenotype associations (reaching FDR<0.05 significance) for: (a) PCSK9 and LDL cholesterol, (b) IQGAP2 and mean platelet volume, (c) IFRD2 and high light scatter reticulocyte count, and (d) NPR2 and height. Top, protein structures with altered amino acids (modified by missense variants) color-coded by effect direction (red for trait-increasing variants and blue for trait-decreasing variants). Bottom, per-variant effect sizes (data point represents mean value; error bars, 95% CIs) and allele frequencies. Protein structures were previously determined experimentally (for PCSK9 and IQGAP2) or computationally predicted (for IFRD2 and NPR2). Functional domains of PCSK9 are shaded in different colors. IQGAP2 is represented as a homodimer in its crystal structure. (e) Distributions of effect directions for all gene-trait pairs with 10 or more variants in an allelic series.

Most of the allelic series we identified extended previously-described allelic series (such as in PCSK9 and IQGAP2; Fig. 4a,b); however, several genes contained long allelic series in which most or all variants represented novel associations. At IFRD2 (interferon-related developmental regulator 2, which has an unknown function), 24 rare coding variants independently associated with high light scatter reticulocyte count (Fig. 4c and Supplementary Table 8), suggesting an important role of IFRD2 in red blood cell development; interestingly, these associations were specific to reticulocyte indices and did not extend to red blood cell count. A common IFRD2 eQTL variant (rs1076872, which is synonymous in one IFRD2 transcript and in the 5’ UTR of another transcript) exhibited the strongest association with reticulocyte indices (P = 1.8 x 10−545), and variants in linkage disequilibrium with rs1076872 have been reported by many association studies of blood cell indices. However, IFRD2 has no common protein-altering variants, such that its apparent sensitivity to coding mutations had not previously been observable: among the 24 variants we identified, only two had MAF>0.1%. Of the remaining 22 very rare IFRD2 variants, 19 variants had positive, large effects on high light scatter reticulocyte count (median +0.61 s.d.); intriguingly, homozygotes and compound heterozygotes for these variants exhibited extreme phenotypes (mean +2.52 s.d.; s.e.m., 0.25 s.d.). At NPR2, which encodes a natriuretic peptide receptor involved in bone growth regulation[35], 11 rare coding variants independently associated with height (Fig. 4d and Supplementary Table 8). Loss-of-function and gain-of-function variants in NPR2 have previously been implicated in Mendelian skeletal disorders with very strong, mirror effects on stature; however, well-powered exome array studies have not linked NPR2 polymorphisms to height in the general population[2]. Our exome-imputation approach uncovered many more NPR2 alleles that appear to exert milder (but still strong) effects on height in the UK population, with estimated effect sizes ranging from −1.09 (0.18) s.d. to +0.25 (0.04) s.d. At PLA2G12A and PLIN1, allelic series containing up to seven rare coding variants in PLA2G12A and eight in PLIN1 associated with serum lipid levels (Supplementary Fig. 7 and Supplementary Table 8), and the lead association in each series replicated in GLGC data (PLA2G12A missense SNP rs41278045: P = 3.3 x 10−4 for HDL and P = 2.3 x 10−6 for triglycerides; PLIN1 missense SNP rs139271800: P = 1.2 x 10−4 for HDL). PLA2G12A encodes a secretory phospholipase that liberates arachidonic acid for eicosanoids with many downstream effects; PLIN1 encodes a protein that coats lipid droplets. While frameshift variants in PLIN1 have been implicated in Mendelian lipodystrophies[36], the contribution of rare variants in each gene to population variation in blood lipid levels has been largely unexplored.

Rare coding variants often exhibit pleiotropic effects

Of the 371 genes involved in at least one variant-trait association, 151 genes contained likely-causal variants for two or more traits. These associations often involved related traits or traits connected by pathways known to involve the gene in question. For example, the cell cycle regulators CHEK2 and JAK2 both contained likely-causal variants associated with white blood cell, red blood cell and platelet traits; a JAK2 missense variant also associated with IGF-1 measurements (Supplementary Table 9). Additionally, three genes that regulate Rho GTPases (DENND2C, DOCK8, and KALRN) contained likely-causal variants associated with multiple platelet traits, consistent with the key role of Rho GTPases in platelet function[37]. Other genes associated with more-distinct sets of traits (Supplementary Table 9). APOC3 exhibited the widest variety of likely-causal associations, with the splice donor variant rs138326449 associating with 13 distinct traits including lipid levels, white blood cell and red blood cell traits, and kidney biomarkers. In PDE3B, the stop gain variant rs150090666 associated likely-causally with 10 distinct traits, including expected associations with waist-hip-ratio and lipid measurements[38], but also associations with red blood cell traits, SHBG levels, and height. Further work will be required to determine which of these associations represent direct biological effects versus downstream effects of perturbed regulatory networks (as posited by the omnigenic model)[39].

Exome imputation uncovers novel large-effect variants

Our ability to probe the effects of ultra-rare variants revealed 10 variants in 10 different genes with very large estimated effects on height (≥ 0.5 s.d.; Supplementary Table 10); in contrast, the largest effect sizes detected in a recent exome array study of height were ~0.3 s.d.)[2]. Four of these genes (NPR2, COL2A1, HERC1, and PCNA) have been implicated in Mendelian diseases manifesting short stature or skeletal disorder phenotypes; however, the specific variants we identified were not previously reported in ClinVar[40], consistent with their rarity as well as their effects being less-extreme and contributing to complex genetic variation in height. We also detected one very-large-effect variant for BMI in MC4R (+0.62 (0.12) s.d.; Supplementary Table 10); this variant had previously been associated with obesity in a Mendelian fashion[41]. Rare coding variants with more-moderate effects on height also yielded new insights into the genetic basis of height. Among the 28 height-associated likely-causal variants for which we could replicate effect directions in the ExomeChip study of Marouli et al.[2] (Table 1), seven altered genes that did not contain any variants that had previously reached significance, representing potentially novel height loci. Many of these genes had functions suggestive of their association with height, including two collagen genes, COL16A1 and COL11A2. Gene Ontology (GO) analysis of all genes containing likely-causal height variants implicated numerous biological processes relating to skeletal system development and extracellular matrix organization (Supplementary Table 11)[42,43].

Biomarker-associated variants confer downstream disease risk

Many phenotypes we analyzed measured blood cell indices or biomarkers for liver, kidney, cardiovascular, or endocrine function, suggesting the possibility that rare coding variants affecting these molecular or cellular phenotypes might have downstream impacts on diseases of the corresponding systems. To test this hypothesis, we analyzed likely-causal variants from our blood and biomarker association analyses for association with disease status for related disorders (Methods). Seventeen associations involving 12 distinct variants reached FDR<0.05 significance (P < 1.5 x 10−4; Supplementary Table 12), all of which either replicated previous results[44] or added to allelic series at known disease genes (e.g., a MAF=0.1% splice donor in SLC34A3 that conferred threefold-increased risk of kidney stones (P = 2.0 x 10−5, OR = 3.1 (2.0-4.8)). In contrast to our analyses of quantitative traits, in which nearly one-third of the associations we identified were discoverable only through exome imputation, 11 of the 12 disease-associated variants had either been genotyped on the UKB SNP-array or accurately imputed from the HRC panel (the only exception being a MAF=0.04% LDLR missense variant previously implicated in familial hypercholesterolemia; Supplementary Table 12). This behavior was consistent with the greater difficulty of identifying robust statistical associations with disease traits (for which causal variants tend to have low penetrance) as compared to molecular or cellular traits (for which causal variants can have much more direct effects). The rarest of the 12 disease-associated variants we identified had MAF=0.04%; to identify ultra-rare variants that influence disease in population cohorts, even larger sample sizes will be needed.

Single-variant tests implicate genes missed by burden tests

Most exome association analyses conducted to date have used gene-based association tests to aggregate signal from very rare variants within the same gene[8,9,45], motivating a comparison between results from our single-variant analyses and a gene-based test using imputed coding variants. In light of our observation that most likely-causal variants from our single-variant analyses had consistent effect directions (Fig. 4e), we aggregated our whole-exome imputed variants within a burden test framework (rather than using a kernel test that trades off power in this scenario to account for bidirectional effects[46]). A key consideration in performing burden tests is deciding which variants to include as potentially deleterious; as such, we considered two possible functional criteria (protein-altering with CADD≥20 vs. predicted LoF) and three possible MAF cutoffs (MAF<1%, <0.1%, or <0.01%) for variants to include (Methods). Of these six parameter combinations, the least stringent option (CADD≥20 and MAF<1%) appeared to be the most powerful (Supplementary Table 13) and was used for subsequent analyses. Among gene-trait pairs implicated by our single-variant association tests, 32% were not detected by burden analysis, indicating that single-variant analysis can often be more powerful than gene-based tests for discovering novel loci associated with complex traits (Supplementary Table 14). Conversely, most gene-trait associations identified by burden analysis (1130 of 1572; 71% of associations) involved at least one variant that reached significance in single-variant analysis. Notably, a sizable minority of these variant associations (414 of 1130; 37% of top-associated variants) had failed our linkage disequilibrium (LD)-based filters that detected potential tagging of other causal variants, suggesting that many statistically significant results from the burden analysis could potentially represent false-positive associations due to the presence of a very strong causal signal present in a nearby, linked gene or regulatory region. The confounding effects of linkage disequilibrium were apparent in several large clusters of gene-trait associations near large-effect loci (e.g., 8 genes within 1 Mb of APOE associated with apoB levels; Supplementary Table 14). While burden analyses are somewhat less susceptible to confounding from LD because they aggregate signal across several variants, approximately half of the burden-test associations that reached significance (51%) were dominated by one variant that accounted for the majority of alleles collapsed in the burden analysis, such that the collapsed “carrier genotype” shared strong, potentially confounding LD with all variants linked to the dominating variant. These results highlight the need to account for linkage disequilibrium even in the context of burden analysis.

DISCUSSION

These results demonstrate the power of using a large, well-matched reference panel to impute very rare variants into biobank data. Whereas exome sequencing on ~50,000 UK Biobank samples offered limited power to detect associations between coding variants and phenotypes[8,9], imputation into the remainder of the UK Biobank cohort enabled a comprehensive survey of the effects of rare coding variation on 54 quantitative phenotypes (with adequate power even for ultra-rare, MAF<0.01% variants). In combination with fine-mapping analyses, this strategy uncovered many new large-effect coding variants, revealed long allelic series within core genes for many traits, and produced a resource of likely-causal rare coding variant associations for future study. More broadly, our results suggest that sequencing 10% of a cohort and imputing into the remaining 90% can be a cost-efficient strategy for designing genetic association studies. Accurate imputation tends to be possible for variants with at least 5-10 carriers in a reference panel[11,19,20] (assuming most mutations are not highly recurrent (Supplementary Fig. 9), which we verified empirically; Supplementary Fig. 10, Supplementary Table 15, and Supplementary Note); when the panel represents 10% of a cohort, this frequency corresponds to 50-100 carriers in the full cohort, which matches well with the minimum number of carriers typically needed to detect a moderate-effect association (Supplementary Fig. 11). Our results also have several implications for the analysis of exome association studies. First, single-variant analysis is a viable strategy for extremely large exome association studies. Second, linear mixed model association analysis is robust to population stratification for rare variants as well as for common variants. Third, careful fine-mapping is critical for identifying causal associations even when analyzing rare coding variants predicted to have high impact (CADD ≥ 20): even for such variants, most associations appear not to be causal but rather to tag associations of other variants in linkage disequilibrium (Fig. 3). Our study does have important limitations. First, while we observed broad agreement between association statistics computed using genotypes derived from imputation vs. direct sequencing (Fig. 2), this agreement was imperfect: some associations (~3%) increased in strength by >2-fold and a few associations (<1%) decreased in strength by >2-fold in the direct analysis, demonstrating the limitations of rare variant imputation. Second, we restricted our primary analyses to quantitative traits; a comprehensive study of rare coding variant effects on UKB disease traits will require a separate analytical pipeline designed to handle unbalanced binary traits[47]. Third, while we could filter associations potentially explained by linkage disequilibrium with other variants imputed from exome sequencing or the HRC reference panel, we could not account for potential tagging of variants unavailable to us (e.g., very rare noncoding variants or structural variants). This limitation is shared by all fine-mapping studies conducted to date; here, we expect that our annotation-based filters (requiring that likely-causal coding variants be rare and have high predicted impact) ameliorate this concern. This intuition appears to be borne out by our replication analysis of height variants (in a pan-European meta-analysis that presumably contained different linkage disequilibrium patterns) and qualitatively by the large proportion of likely-causal associations that involved genes with clear biological relevance (Supplementary Table 3). Our study of UK Biobank exome data also gives an indication of the analyses that will become feasible as exome association studies grow even larger. Very large exome-sequenced cohorts provide a natural genetic perturbation experiment: the 49,960 UK Biobank exomes we studied here contained ~7 million missense variants that modified ~3.7 million different amino acids, a sizable fraction of the ~9 million amino acids encoded by all genes in the human genome[26]. Most of these variants were singletons or doubletons and were therefore difficult or impossible to impute; however, when exome sequencing of the full UK Biobank cohort is complete, whole-exome imputation into even larger cohorts will enable characterization of the effects of much of the viable coding variation in the genome.

METHODS

UK Biobank genetic data.

Data from the UK Biobank Resource were accessed under application number 10438. All data were collected and made available by the UK Biobank under North West – Haydock Research Ethics Committee reference 16/NW/0274. The UK Biobank cohort was previously genotyped using genome-wide SNP-arrays which produced genotype data for 488,377 UK Biobank participants at 784,256 autosomal SNPs passing quality control[12]. We analyzed these data together with whole-exome sequencing (WES) data available for 49,960 participants[8]. We analyzed WES genotype calls at 10.2 million autosomal variants from the SPB pipeline[8], filtering to a subset of 9.8 million variants that unambiguously lifted to hg19 using UCSC liftOver, among which 4.9 million had minor allele count at least 2. We also analyzed imputed genotypes available for 487,409 participants from the UK Biobank imp_v3 data release, which consisted of 93 million variants imputed using the Haplotype Reference Consortium and UK10K / 1000 Genomes reference panels[12]. We restricted our primary analyses to individuals who reported European ancestry (459,327 participants comprising 94% of the cohort). In supplementary analyses to ensure that our association analyses were not affected by confounding sample structure, we further restricted to a genetically homogeneous, unrelated (at third-degree or closer) subset of 337,539 white British participants[12] (Supplementary Note). We excluded a small number of participants who withdrew from UK Biobank (up to a maximum of 149 withdrawals by the time we completed our study).

UK Biobank phenotype data.

We analyzed 54 heritable quantitative traits measured by UK Biobank for most participants. These traits included body measurements (3 anthropometric traits and 1 bone mineral density trait), blood pressure (2 traits), lung function (2 traits), blood cell indices (19 traits), and serum biomarker levels (7 lipid traits and 20 other biomarkers for liver, kidney, or endocrine function; Supplementary Table 2). We analyzed all available blood cell traits except for nucleated red blood cell count and percentage (which were mostly zero) and blood cell percentage traits (which were highly correlated with the corresponding blood cell counts). We analyzed all available serum biomarker traits except for estradiol, testosterone, and rheumatoid factor (which had measurable levels in only half or less of the cohort). We performed basic quality control on serum biomarker traits by masking extreme outliers (>1000 times the interquartile range), stratifying by sex and menopause status, applying inverse-normal transformation, regressing out covariates (ethnic group, alcohol use, smoking status, age, height, and BMI), and re-applying inverse-normal transformation. Quality control and normalization of the other quantitative traits was previously described[24]. We also analyzed disease traits affecting organ systems corresponding to molecular and cellular traits above. We analyzed health outcomes in the “first occurrence” data fields that UK Biobank generated by aggregating information from self-report, inpatient hospital data, primary care, or death record data.

Phasing and imputation of WES variants.

To generate an imputation reference panel from the WES cohort, we phased the 4.9 million non-singleton autosomal variants from WES together with variants genotyped on the UK Biobank array (using Eagle2[16] with --Kpbwt=20000). We phased the data in chunks of 50,000 variants with an overlap of at least 5,000 variants between consecutive chunks, resulting in a total of 126 chunks across all autosomes. We then imputed the WES-derived variants into phased haplotypes we had previously generated[18] for 487,409 participants in the full cohort (using Minimac4[11] with noncoding variants from the UK Biobank array used as the imputation scaffold, i.e., matching target and reference haplotypes based on SNP alleles at non-coding variants on the array). We benchmarked the accuracy of this imputation approach by computing correlations between imputed genotype dosages and direct genotype calls from exome sequencing of N=141,255 additional individuals subsequently released by UK Biobank (Supplementary Note).

Association tests.

We tested variants for association with each of the 54 quantitative traits using the non-infinitesimal linear mixed model association test implemented in BOLT-LMM[23] (--lmmforceNonInf) with assessment center, genotyping array, sex, age, age squared, and 20 genetic principal components included as covariates. We fit the mixed model on directly-genotyped autosomal variants with MAF>10−4 and missingness<0.1 and computed association test statistics for WES-imputed variants and variants from the UK Biobank imp_v3 release. In our primary analyses, we included all participants with non-missing phenotypes who reported European ancestry (and had not withdrawn from the study). We also performed association analyses that further restricted the sample set to the WES cohort to determine which associations were detectable in the WES cohort alone.

Filtering associations using coding variant annotations.

To focus our analyses on variants likely to have protein-altering effects, we filtered significant associations to those involving variants predicted (by genome annotation algorithms) to impact function. For variants modifying protein-coding sequence or canonical splice sites, we required a CADD v1.3 score ≥ 20 and a VEP annotation of missense, inframe deletion, inframe insertion, start lost, stop lost, splice acceptor, splice donor, frameshift, or stop gained[25,26]. For variants that affected multiple transcripts (for one or more genes), we assigned the most severe VEP annotation (in the order listed above) across all affected transcripts. We also included potential cryptic splice variants predicted by SpliceAI v1.2 (specifically, variants with a delta score ≥ 0.5 for at least one of the four splice modifier categories: gain or loss of a splice acceptor or a splice donor)[27].

Filtering associations in LD with nearby variants.

To further filter significant associations to a high-confidence set of likely-causal associations, we analyzed linkage disequilibrium (LD) between pairs of associated variants to identify and remove any associations potentially attributable to tagging of another variant in LD. We took this approach because while many algorithms have been developed for fine-mapping common variant associations, these methods are not optimized for rare variants: intuitively, they maximize the heritable variance that can be explained by a configuration of causal variants, making configurations which include rare variants – which typically account for very little heritability even though they can have large effect sizes – less likely to be considered probable[28,29]. Our filter, which was equivalent to requiring that each association remain significant (P < 5 x 10−8) after conditioning on any other more-strongly-associated variant nearby, proceeded as follows. For each rare coding variant i significantly associated with a phenotype, we calculated its correlation r (i.e., in-sample LD) with each other more-strongly-associated variant j (including both WES-imputed variants and variants from the HRC-based imputation release) using plink “--r”[48]. We then computed the approximate chi-square statistic that would be obtained for variant i in a model including variant j as a covariate: where and denote the chi-square test statistics computed by BOLT-LMM for variants i and j (and the sign of the square root reflects whether the effect directions are the same or opposite)[49]. In order to retain variant i’s association as likely-causal, we required the conditional chi-square statistic to exceed 29.7168 (corresponding to P < 5 x 10−8) for every variant j with .

Filtering associations in LD with multiple variants.

The filter described above was designed to eliminate associations involving variants that primarily tagged one other variant in LD; however, in theory, non-causal variants could escape this filter by tagging a combination of multiple other variants. To account for this possibility, we used the FINEMAP software[28] to determine, for each gene harboring a rare coding variant of interest, whether the local genetic architecture appeared to involve multiple causal variants, and if so, to assess whether the rare coding variant(s) under consideration remained significantly associated after conditioning on the variants selected by FINEMAP. We performed this analysis using a two-step procedure. First, we ran FINEMAP’s shotgun stochastic search algorithm (“--sss”) to identify up to 5 putatively causal variants among all significantly associated variants within 500kb of the gene under consideration. This run produced a most probable configuration of 1-5 variants, most of which were typically common. We then ran FINEMAP a second time, adjusting the number of allowed causal variants to be one greater than the number selected for the top configuration in the first run, and limiting the set of potential causal variants to those variants in the top configuration from the first run along with all significantly-associated rare coding variants in the gene under consideration. The purpose of this second run was to ascertain whether each rare coding variant remained significant in a model conditioning on multiple common variants. Specifically, we extracted the conditional z-scores output by FINEMAP in its “.snp” files and dropped variants with z-score ≤ 4. This filter only removed 20 variants involved in 36 associations, suggesting that most rare variants that tagged other causal variants were primarily tagging just one neighboring variant. We set the z-score threshold to ≤ 4 after exploring other cut-offs such as z ≤ 5.45, the equivalent of a genome-wide significance threshold. The z ≤ 5.45 threshold filtered an additional 54 variants; however, several associations with z-scores around 5 that failed this filter appeared to be real (e.g., high-CADD or stop gain mutations in genes known to alter lipid levels). In light of this observation and the stringent filtering we had already performed using pairwise tests, we decided to set a threshold of z-score ≤ 4, which appeared to filter primarily spurious associations. Applying this filter together with the previous two filters left us with the final list of 1,189 significant rare coding variant associations involving 675 unique variants for 54 quantitative traits.

Variant lookup in the NHGRI-EBI GWAS Catalog.

We compared the variants we identified to those reported in the NHGRI-EBI GWAS Catalog (accessed January 15, 2020)[50]. Each variant was checked to see if it was reported in the catalog for any phenotype to exclude the possibility that the variant was previously reported for a related phenotype.

Replication analyses.

Several traits we analyzed had previously been studied in large-scale meta-analyses using exome arrays, providing the opportunity for replication of likely-causal associations that involved variants assayed on the exome arrays. We compared the associations of likely-causal variants we identified for height, blood pressure, and lipid measurements (LDL cholesterol, HDL cholesterol, triglycerides, and total cholesterol) to association statistics previously published by the GIANT Consortium[2] (N=381,625), the CHARGE-BP Consortium[4] (N=120,473), and the Global Lipids Genetics Consortium (N≈300,000), respectively[3]; all of these meta-analyses predominantly studied participants of European ancestry, and none included UK Biobank. While most variants were too rare to attain statistical significance in these replication data sets (probably due to allele frequency differences between the UK and other European populations), 112 out of 113 associations exhibited the same effect direction in UK Biobank and the replication data set (Table 1 and Supplementary Table 7). We also compared our height associations to association statistics reported from exome-sequencing of the FinMetSeq cohort[51] (N=19,241), which provided replication support for a few additional variants that happened to have higher allele frequencies in Finns (Supplementary Table 7).

Background distribution for assessing functional enrichment.

To identify trends in the deleteriousness of likely-causal rare coding variants as compared to all rare coding variants, we generated a background distribution of rare coding variants with a MAF distribution matching that of the likely-causal variants (to account for the tendency of rarer variants to have higher deleteriousness scores). We first stratified likely-causal variants into three MAF bins: 10−5-10−4, 10−4-10−3, and 10−3-10−2. We then subsampled the set of all rare coding variants considered in our analyses (regardless of whether or not they had a significant association) using the R “sample” function to generate a set of variants with the same fraction of variants in each MAF bin as in the likely-causal set. We included all variants in the highest MAF bin (as this bin contained the fewest variants), which set the total number of variants in the background distribution at 47,142 variants.

Allelic series analyses.

As our primary analysis pipeline for identifying likely-causal rare coding variant associations implemented strict filters on statistical significance (in both single-variant analysis and conditional analyses), we applied a secondary analysis pipeline that relaxed these filters to identify additional rare coding variant associations with good statistical support within genes with two or more likely-causal variants for a trait (indicating strong evidence for the gene-trait association). This pipeline applied a two-step approach (detailed in the Supplementary Note) using FINEMAP in a manner somewhat similar to the approach we used to filter associations that could be explained by combinations of other variants. Here, we again performed a first run of FINEMAP to allow it to select a multiple-causal-variant model (this time containing up to 15 causal variants chosen from common and low-frequency variants as well as rare coding variants), and we then ran FINEMAP a second time to perform an iterative conditional analysis using the selected variants together with rare coding variants. We used conditional P-values from the second FINEMAP run to assess the extent to which each rare coding variant exhibited a trait association independent of previous variants. Finally, we converted P-values to q-values to determine the set of rare coding variants that reached significance at a false discovery rate of 5%. The expanded allelic series we identified at FDR<0.05 significance often contained many variants. (For genes with multiple transcripts, we counted the lengths of the allelic series for the transcript that contained the most FDR<0.05-significant variants, treating cryptic splice variants as belonging to all transcripts.) To visualize the effects of missense variants, we plotted the affected amino acids on previously-generated protein structures. Experimentally-derived protein structures for PCSK9 (2P4E)[52], ANGPTL3 (6EUA)[53], IQGAP2 (5CJP)[54], and GOT1 (3II0)[55] were retrieved from PDB[56]. Computationally predicted structures for NPR2 (P20594 monomer) and IFRD2 (Q12894 monomer) were retrieved from SWISS-MODEL[57].

Associations with health outcomes.

We tested likely-causal variants we identified for cellular and molecular phenotypes (blood cell traits, liver biomarkers, diabetes biomarkers, renal biomarkers, and cardiovascular biomarkers) for associations with corresponding disease outcomes coded by UK Biobank using ICD-10 codes (blood disorders, D50-D77; liver diseases, K70-K77; type 2 diabetes, E11; gout and kidney diseases, M10 and N00-N29; cardiovascular diseases, I20-I25 and I63). To further reduce multiple testing burden, we further restricted to diseases with at least 500 reported cases. These criteria left 40 phenotypes under consideration (i.e., an average of 8 phenotypes tested for each likely-causal variant for each of the 5 classes of cellular/molecular phenotypes) and resulted in 5,508 separate tests. Setting a false discovery rate threshold of 5% across the 5,508 tests resulted in a significance threshold of P < 1.5 x 10−4.

Gene-based burden tests.

We assessed the performance of gene-based association analyses using burden tests that collapsed the genotypes of imputed rare coding variants within each gene. We considered six different criteria for inclusion of rare coding variants in the burden. These six criteria were defined by three different allele frequency thresholds (MAF ≤ 1%, 0.1%, and 0.01%) and two different variant annotation criteria (protein-altering with CADD ≥ 20 or predicted loss-of-function as annotated by VEP). Collapsed genotypes were coded as 0 (if an individual had no variants meeting these requirements) or 1 (if the individual carried at least one of these variants). We performed association tests against the 54 quantitative traits using BOLT-LMM with the same settings as in our single-variant analyses, and we applied a Bonferroni-corrected P-value threshold of P < 2.7 x 10−6 to account for 18,530 genes tested. We compared the results of these analyses to those previously reported in burden analyses of N=49,960 exome-sequenced UK Biobank participants[8,9]. Among phenotypes in common between our analyses and the previous analyses, we replicated 13/15 associations from Van Hout et al.[8] and 48/58 associations from Cirulli et al.[9]. Non-replicated results might arise from different selection criteria for variants and to a lesser extent from singletons that were included in the previous analyses but excluded from our imputation.

Data availability.

Access to the UK Biobank Resource is available by application (http://www.ukbiobank.ac.uk/). Exome-wide summary association statistics for the 54 quantitative traits we analyzed are available at https://data.broadinstitute.org/lohlab/UKB_exomeWAS/, and data files containing allelic series for all gene-trait associations with multiple likely-causal variants are also available at this website.

Code availability.

The following publicly available software packages were used to perform analyses: Eagle2 (v2.3.5), https://data.broadinstitute.org/alkesgroup/Eagle/; Minimac4 (v1.0.1), https://genome.sph.umich.edu/wiki/Minimac4; BOLT-LMM (v2.3.4), https://data.broadinstitute.org/alkesgroup/BOLT-LMM/; FINEMAP (v1.3.1), http://www.christianbenner.com/; plink (v1.9 and v2.0), https://www.cog-genomics.org/plink2/; tsinfer (v0.1.4), https://tsinfer.readthedocs.io/en/latest/. Information from the following databases were also used: Variant Effect Predictor (v95 on GRCh37 with GENCODE 19), https://useast.ensembl.org/info/docs/tools/vep/index.html; CADD (v 1.5), https://cadd.gs.washington.edu/download; SpliceAI (v1.2.1) https://github.com/Illumina/SpliceAI; NHGRI-EBI GWAS Catalog (v1.0), https://www.ebi.ac.uk/gwas/home; TOPMed (v r2, 97,256 TOPMed samples), https://imputation.biodatacatalyst.nhlbi.nih.gov/#!pages/about; Protein Data Bank https://www.rcsb.org/; SWISS-MODEL, https://swissmodel.expasy.org/; PANTHER http://www.pantherdb.org/. Scripts used to perform the downstream analyses described above are available at https://data.broadinstitute.org/lohlab/UKB_exomeWAS/ (DOI: 10.5281/zenodo.4771214).

47 in total

1. Distribution and clinical impact of functional variants in 50,726 whole-exome sequences from the DiscovEHR study.

Authors: Frederick E Dewey; Michael F Murray; John D Overton; Lukas Habegger; Joseph B Leader; Samantha N Fetterolf; Colm O'Dushlaine; Cristopher V Van Hout; Jeffrey Staples; Claudia Gonzaga-Jauregui; Raghu Metpally; Sarah A Pendergrass; Monica A Giovanni; H Lester Kirchner; Suganthi Balasubramanian; Noura S Abul-Husn; Dustin N Hartzel; Daniel R Lavage; Korey A Kost; Jonathan S Packer; Alexander E Lopez; John Penn; Semanti Mukherjee; Nehal Gosalia; Manoj Kanagaraj; Alexander H Li; Lyndon J Mitnaul; Lance J Adams; Thomas N Person; Kavita Praveen; Anthony Marcketta; Matthew S Lebo; Christina A Austin-Tse; Heather M Mason-Suares; Shannon Bruse; Scott Mellis; Robert Phillips; Neil Stahl; Andrew Murphy; Aris Economides; Kimberly A Skelding; Christopher D Still; James R Elmore; Ingrid B Borecki; George D Yancopoulos; F Daniel Davis; William A Faucett; Omri Gottesman; Marylyn D Ritchie; Alan R Shuldiner; Jeffrey G Reid; David H Ledbetter; Aris Baras; David J Carey
Journal: Science Date: 2016-12-23 Impact factor: 47.728

2. Evolution and functional impact of rare coding variation from deep sequencing of human exomes.

Authors: Jacob A Tennessen; Abigail W Bigham; Timothy D O'Connor; Wenqing Fu; Eimear E Kenny; Simon Gravel; Sean McGee; Ron Do; Xiaoming Liu; Goo Jun; Hyun Min Kang; Daniel Jordan; Suzanne M Leal; Stacey Gabriel; Mark J Rieder; Goncalo Abecasis; David Altshuler; Deborah A Nickerson; Eric Boerwinkle; Shamil Sunyaev; Carlos D Bustamante; Michael J Bamshad; Joshua M Akey
Journal: Science Date: 2012-05-17 Impact factor: 47.728

3. Next-generation genotype imputation service and methods.

Authors: Sayantan Das; Lukas Forer; Sebastian Schönherr; Carlo Sidore; Adam E Locke; Alan Kwong; Scott I Vrieze; Emily Y Chew; Shawn Levy; Matt McGue; David Schlessinger; Dwight Stambolian; Po-Ru Loh; William G Iacono; Anand Swaroop; Laura J Scott; Francesco Cucca; Florian Kronenberg; Michael Boehnke; Gonçalo R Abecasis; Christian Fuchsberger
Journal: Nat Genet Date: 2016-08-29 Impact factor: 38.330

4. Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci.

Authors: Chunyu Liu; Aldi T Kraja; Jennifer A Smith; Jennifer A Brody; Nora Franceschini; Joshua C Bis; Kenneth Rice; Alanna C Morrison; Yingchang Lu; Stefan Weiss; Xiuqing Guo; Walter Palmas; Lisa W Martin; Yii-Der Ida Chen; Praveen Surendran; Fotios Drenos; James P Cook; Paul L Auer; Audrey Y Chu; Ayush Giri; Wei Zhao; Johanna Jakobsdottir; Li-An Lin; Jeanette M Stafford; Najaf Amin; Hao Mei; Jie Yao; Arend Voorman; Martin G Larson; Megan L Grove; Albert V Smith; Shih-Jen Hwang; Han Chen; Tianxiao Huan; Gulum Kosova; Nathan O Stitziel; Sekar Kathiresan; Nilesh Samani; Heribert Schunkert; Panos Deloukas; Man Li; Christian Fuchsberger; Cristian Pattaro; Mathias Gorski; Charles Kooperberg; George J Papanicolaou; Jacques E Rossouw; Jessica D Faul; Sharon L R Kardia; Claude Bouchard; Leslie J Raffel; André G Uitterlinden; Oscar H Franco; Ramachandran S Vasan; Christopher J O'Donnell; Kent D Taylor; Kiang Liu; Erwin P Bottinger; Omri Gottesman; E Warwick Daw; Franco Giulianini; Santhi Ganesh; Elias Salfati; Tamara B Harris; Lenore J Launer; Marcus Dörr; Stephan B Felix; Rainer Rettig; Henry Völzke; Eric Kim; Wen-Jane Lee; I-Te Lee; Wayne H-H Sheu; Krystal S Tsosie; Digna R Velez Edwards; Yongmei Liu; Adolfo Correa; David R Weir; Uwe Völker; Paul M Ridker; Eric Boerwinkle; Vilmundur Gudnason; Alexander P Reiner; Cornelia M van Duijn; Ingrid B Borecki; Todd L Edwards; Aravinda Chakravarti; Jerome I Rotter; Bruce M Psaty; Ruth J F Loos; Myriam Fornage; Georg B Ehret; Christopher Newton-Cheh; Daniel Levy; Daniel I Chasman
Journal: Nat Genet Date: 2016-09-12 Impact factor: 41.307

5. Exome-wide association study of plasma lipids in >300,000 individuals.

Authors: Dajiang J Liu; Gina M Peloso; Haojie Yu; Adam S Butterworth; Xiao Wang; Anubha Mahajan; Danish Saleheen; Connor Emdin; Dewan Alam; Alexessander Couto Alves; Philippe Amouyel; Emanuele Di Angelantonio; Dominique Arveiler; Themistocles L Assimes; Paul L Auer; Usman Baber; Christie M Ballantyne; Lia E Bang; Marianne Benn; Joshua C Bis; Michael Boehnke; Eric Boerwinkle; Jette Bork-Jensen; Erwin P Bottinger; Ivan Brandslund; Morris Brown; Fabio Busonero; Mark J Caulfield; John C Chambers; Daniel I Chasman; Y Eugene Chen; Yii-Der Ida Chen; Rajiv Chowdhury; Cramer Christensen; Audrey Y Chu; John M Connell; Francesco Cucca; L Adrienne Cupples; Scott M Damrauer; Gail Davies; Ian J Deary; George Dedoussis; Joshua C Denny; Anna Dominiczak; Marie-Pierre Dubé; Tapani Ebeling; Gudny Eiriksdottir; Tõnu Esko; Aliki-Eleni Farmaki; Mary F Feitosa; Marco Ferrario; Jean Ferrieres; Ian Ford; Myriam Fornage; Paul W Franks; Timothy M Frayling; Ruth Frikke-Schmidt; Lars G Fritsche; Philippe Frossard; Valentin Fuster; Santhi K Ganesh; Wei Gao; Melissa E Garcia; Christian Gieger; Franco Giulianini; Mark O Goodarzi; Harald Grallert; Niels Grarup; Leif Groop; Megan L Grove; Vilmundur Gudnason; Torben Hansen; Tamara B Harris; Caroline Hayward; Joel N Hirschhorn; Oddgeir L Holmen; Jennifer Huffman; Yong Huo; Kristian Hveem; Sehrish Jabeen; Anne U Jackson; Johanna Jakobsdottir; Marjo-Riitta Jarvelin; Gorm B Jensen; Marit E Jørgensen; J Wouter Jukema; Johanne M Justesen; Pia R Kamstrup; Stavroula Kanoni; Fredrik Karpe; Frank Kee; Amit V Khera; Derek Klarin; Heikki A Koistinen; Jaspal S Kooner; Charles Kooperberg; Kari Kuulasmaa; Johanna Kuusisto; Markku Laakso; Timo Lakka; Claudia Langenberg; Anne Langsted; Lenore J Launer; Torsten Lauritzen; David C M Liewald; Li An Lin; Allan Linneberg; Ruth J F Loos; Yingchang Lu; Xiangfeng Lu; Reedik Mägi; Anders Malarstig; Ani Manichaikul; Alisa K Manning; Pekka Mäntyselkä; Eirini Marouli; Nicholas G D Masca; Andrea Maschio; James B Meigs; Olle Melander; Andres Metspalu; Andrew P Morris; Alanna C Morrison; Antonella Mulas; Martina Müller-Nurasyid; Patricia B Munroe; Matt J Neville; Jonas B Nielsen; Sune F Nielsen; Børge G Nordestgaard; Jose M Ordovas; Roxana Mehran; Christoper J O'Donnell; Marju Orho-Melander; Cliona M Molony; Pieter Muntendam; Sandosh Padmanabhan; Colin N A Palmer; Dorota Pasko; Aniruddh P Patel; Oluf Pedersen; Markus Perola; Annette Peters; Charlotta Pisinger; Giorgio Pistis; Ozren Polasek; Neil Poulter; Bruce M Psaty; Daniel J Rader; Asif Rasheed; Rainer Rauramaa; Dermot F Reilly; Alex P Reiner; Frida Renström; Stephen S Rich; Paul M Ridker; John D Rioux; Neil R Robertson; Dan M Roden; Jerome I Rotter; Igor Rudan; Veikko Salomaa; Nilesh J Samani; Serena Sanna; Naveed Sattar; Ellen M Schmidt; Robert A Scott; Peter Sever; Raquel S Sevilla; Christian M Shaffer; Xueling Sim; Suthesh Sivapalaratnam; Kerrin S Small; Albert V Smith; Blair H Smith; Sangeetha Somayajula; Lorraine Southam; Timothy D Spector; Elizabeth K Speliotes; John M Starr; Kathleen E Stirrups; Nathan Stitziel; Konstantin Strauch; Heather M Stringham; Praveen Surendran; Hayato Tada; Alan R Tall; Hua Tang; Jean-Claude Tardif; Kent D Taylor; Stella Trompet; Philip S Tsao; Jaakko Tuomilehto; Anne Tybjaerg-Hansen; Natalie R van Zuydam; Anette Varbo; Tibor V Varga; Jarmo Virtamo; Melanie Waldenberger; Nan Wang; Nick J Wareham; Helen R Warren; Peter E Weeke; Joshua Weinstock; Jennifer Wessel; James G Wilson; Peter W F Wilson; Ming Xu; Hanieh Yaghootkar; Robin Young; Eleftheria Zeggini; He Zhang; Neil S Zheng; Weihua Zhang; Yan Zhang; Wei Zhou; Yanhua Zhou; Magdalena Zoledziewska; Joanna M M Howson; John Danesh; Mark I McCarthy; Chad A Cowan; Goncalo Abecasis; Panos Deloukas; Kiran Musunuru; Cristen J Willer; Sekar Kathiresan
Journal: Nat Genet Date: 2017-10-30 Impact factor: 38.330

6. Genome-wide rare variant analysis for thousands of phenotypes in over 70,000 exomes from two cohorts.

Authors: Elizabeth T Cirulli; Simon White; Robert W Read; Gai Elhanan; William J Metcalf; Francisco Tanudjaja; Donna M Fath; Efren Sandoval; Magnus Isaksson; Karen A Schlauch; Joseph J Grzymski; James T Lu; Nicole L Washington
Journal: Nat Commun Date: 2020-01-28 Impact factor: 14.919

7. Rare and low-frequency coding variants alter human adult height.

Authors: Eirini Marouli; Mariaelisa Graff; Carolina Medina-Gomez; Ken Sin Lo; Andrew R Wood; Troels R Kjaer; Rebecca S Fine; Yingchang Lu; Claudia Schurmann; Heather M Highland; Sina Rüeger; Gudmar Thorleifsson; Anne E Justice; David Lamparter; Kathleen E Stirrups; Valérie Turcot; Kristin L Young; Thomas W Winkler; Tõnu Esko; Tugce Karaderi; Adam E Locke; Nicholas G D Masca; Maggie C Y Ng; Poorva Mudgal; Manuel A Rivas; Sailaja Vedantam; Anubha Mahajan; Xiuqing Guo; Goncalo Abecasis; Katja K Aben; Linda S Adair; Dewan S Alam; Eva Albrecht; Kristine H Allin; Matthew Allison; Philippe Amouyel; Emil V Appel; Dominique Arveiler; Folkert W Asselbergs; Paul L Auer; Beverley Balkau; Bernhard Banas; Lia E Bang; Marianne Benn; Sven Bergmann; Lawrence F Bielak; Matthias Blüher; Heiner Boeing; Eric Boerwinkle; Carsten A Böger; Lori L Bonnycastle; Jette Bork-Jensen; Michiel L Bots; Erwin P Bottinger; Donald W Bowden; Ivan Brandslund; Gerome Breen; Murray H Brilliant; Linda Broer; Amber A Burt; Adam S Butterworth; David J Carey; Mark J Caulfield; John C Chambers; Daniel I Chasman; Yii-Der Ida Chen; Rajiv Chowdhury; Cramer Christensen; Audrey Y Chu; Massimiliano Cocca; Francis S Collins; James P Cook; Janie Corley; Jordi Corominas Galbany; Amanda J Cox; Gabriel Cuellar-Partida; John Danesh; Gail Davies; Paul I W de Bakker; Gert J de Borst; Simon de Denus; Mark C H de Groot; Renée de Mutsert; Ian J Deary; George Dedoussis; Ellen W Demerath; Anneke I den Hollander; Joe G Dennis; Emanuele Di Angelantonio; Fotios Drenos; Mengmeng Du; Alison M Dunning; Douglas F Easton; Tapani Ebeling; Todd L Edwards; Patrick T Ellinor; Paul Elliott; Evangelos Evangelou; Aliki-Eleni Farmaki; Jessica D Faul; Mary F Feitosa; Shuang Feng; Ele Ferrannini; Marco M Ferrario; Jean Ferrieres; Jose C Florez; Ian Ford; Myriam Fornage; Paul W Franks; Ruth Frikke-Schmidt; Tessel E Galesloot; Wei Gan; Ilaria Gandin; Paolo Gasparini; Vilmantas Giedraitis; Ayush Giri; Giorgia Girotto; Scott D Gordon; Penny Gordon-Larsen; Mathias Gorski; Niels Grarup; Megan L Grove; Vilmundur Gudnason; Stefan Gustafsson; Torben Hansen; Kathleen Mullan Harris; Tamara B Harris; Andrew T Hattersley; Caroline Hayward; Liang He; Iris M Heid; Kauko Heikkilä; Øyvind Helgeland; Jussi Hernesniemi; Alex W Hewitt; Lynne J Hocking; Mette Hollensted; Oddgeir L Holmen; G Kees Hovingh; Joanna M M Howson; Carel B Hoyng; Paul L Huang; Kristian Hveem; M Arfan Ikram; Erik Ingelsson; Anne U Jackson; Jan-Håkan Jansson; Gail P Jarvik; Gorm B Jensen; Min A Jhun; Yucheng Jia; Xuejuan Jiang; Stefan Johansson; Marit E Jørgensen; Torben Jørgensen; Pekka Jousilahti; J Wouter Jukema; Bratati Kahali; René S Kahn; Mika Kähönen; Pia R Kamstrup; Stavroula Kanoni; Jaakko Kaprio; Maria Karaleftheri; Sharon L R Kardia; Fredrik Karpe; Frank Kee; Renske Keeman; Lambertus A Kiemeney; Hidetoshi Kitajima; Kirsten B Kluivers; Thomas Kocher; Pirjo Komulainen; Jukka Kontto; Jaspal S Kooner; Charles Kooperberg; Peter Kovacs; Jennifer Kriebel; Helena Kuivaniemi; Sébastien Küry; Johanna Kuusisto; Martina La Bianca; Markku Laakso; Timo A Lakka; Ethan M Lange; Leslie A Lange; Carl D Langefeld; Claudia Langenberg; Eric B Larson; I-Te Lee; Terho Lehtimäki; Cora E Lewis; Huaixing Li; Jin Li; Ruifang Li-Gao; Honghuang Lin; Li-An Lin; Xu Lin; Lars Lind; Jaana Lindström; Allan Linneberg; Yeheng Liu; Yongmei Liu; Artitaya Lophatananon; Jian'an Luan; Steven A Lubitz; Leo-Pekka Lyytikäinen; David A Mackey; Pamela A F Madden; Alisa K Manning; Satu Männistö; Gaëlle Marenne; Jonathan Marten; Nicholas G Martin; Angela L Mazul; Karina Meidtner; Andres Metspalu; Paul Mitchell; Karen L Mohlke; Dennis O Mook-Kanamori; Anna Morgan; Andrew D Morris; Andrew P Morris; Martina Müller-Nurasyid; Patricia B Munroe; Mike A Nalls; Matthias Nauck; Christopher P Nelson; Matt Neville; Sune F Nielsen; Kjell Nikus; Pål R Njølstad; Børge G Nordestgaard; Ioanna Ntalla; Jeffrey R O'Connel; Heikki Oksa; Loes M Olde Loohuis; Roel A Ophoff; Katharine R Owen; Chris J Packard; Sandosh Padmanabhan; Colin N A Palmer; Gerard Pasterkamp; Aniruddh P Patel; Alison Pattie; Oluf Pedersen; Peggy L Peissig; Gina M Peloso; Craig E Pennell; Markus Perola; James A Perry; John R B Perry; Thomas N Person; Ailith Pirie; Ozren Polasek; Danielle Posthuma; Olli T Raitakari; Asif Rasheed; Rainer Rauramaa; Dermot F Reilly; Alex P Reiner; Frida Renström; Paul M Ridker; John D Rioux; Neil Robertson; Antonietta Robino; Olov Rolandsson; Igor Rudan; Katherine S Ruth; Danish Saleheen; Veikko Salomaa; Nilesh J Samani; Kevin Sandow; Yadav Sapkota; Naveed Sattar; Marjanka K Schmidt; Pamela J Schreiner; Matthias B Schulze; Robert A Scott; Marcelo P Segura-Lepe; Svati Shah; Xueling Sim; Suthesh Sivapalaratnam; Kerrin S Small; Albert Vernon Smith; Jennifer A Smith; Lorraine Southam; Timothy D Spector; Elizabeth K Speliotes; John M Starr; Valgerdur Steinthorsdottir; Heather M Stringham; Michael Stumvoll; Praveen Surendran; Leen M 't Hart; Katherine E Tansey; Jean-Claude Tardif; Kent D Taylor; Alexander Teumer; Deborah J Thompson; Unnur Thorsteinsdottir; Betina H Thuesen; Anke Tönjes; Gerard Tromp; Stella Trompet; Emmanouil Tsafantakis; Jaakko Tuomilehto; Anne Tybjaerg-Hansen; Jonathan P Tyrer; Rudolf Uher; André G Uitterlinden; Sheila Ulivi; Sander W van der Laan; Andries R Van Der Leij; Cornelia M van Duijn; Natasja M van Schoor; Jessica van Setten; Anette Varbo; Tibor V Varga; Rohit Varma; Digna R Velez Edwards; Sita H Vermeulen; Henrik Vestergaard; Veronique Vitart; Thomas F Vogt; Diego Vozzi; Mark Walker; Feijie Wang; Carol A Wang; Shuai Wang; Yiqin Wang; Nicholas J Wareham; Helen R Warren; Jennifer Wessel; Sara M Willems; James G Wilson; Daniel R Witte; Michael O Woods; Ying Wu; Hanieh Yaghootkar; Jie Yao; Pang Yao; Laura M Yerges-Armstrong; Robin Young; Eleftheria Zeggini; Xiaowei Zhan; Weihua Zhang; Jing Hua Zhao; Wei Zhao; Wei Zhao; He Zheng; Wei Zhou; Jerome I Rotter; Michael Boehnke; Sekar Kathiresan; Mark I McCarthy; Cristen J Willer; Kari Stefansson; Ingrid B Borecki; Dajiang J Liu; Kari E North; Nancy L Heard-Costa; Tune H Pers; Cecilia M Lindgren; Claus Oxvig; Zoltán Kutalik; Fernando Rivadeneira; Ruth J F Loos; Timothy M Frayling; Joel N Hirschhorn; Panos Deloukas; Guillaume Lettre
Journal: Nature Date: 2017-02-01 Impact factor: 49.962

8. Exome sequencing of 20,791 cases of type 2 diabetes and 24,440 controls.

Authors: Josep M Mercader; Christian Fuchsberger; Miriam S Udler; Anubha Mahajan; Jason Flannick; Jennifer Wessel; Tanya M Teslovich; Lizz Caulkins; Ryan Koesterer; Francisco Barajas-Olmos; Thomas W Blackwell; Eric Boerwinkle; Jennifer A Brody; Federico Centeno-Cruz; Ling Chen; Siying Chen; Cecilia Contreras-Cubas; Emilio Córdova; Adolfo Correa; Maria Cortes; Ralph A DeFronzo; Lawrence Dolan; Kimberly L Drews; Amanda Elliott; James S Floyd; Stacey Gabriel; Maria Eugenia Garay-Sevilla; Humberto García-Ortiz; Myron Gross; Sohee Han; Nancy L Heard-Costa; Anne U Jackson; Marit E Jørgensen; Hyun Min Kang; Megan Kelsey; Bong-Jo Kim; Heikki A Koistinen; Johanna Kuusisto; Joseph B Leader; Allan Linneberg; Ching-Ti Liu; Jianjun Liu; Valeriya Lyssenko; Alisa K Manning; Anthony Marcketta; Juan Manuel Malacara-Hernandez; Angélica Martínez-Hernández; Karen Matsuo; Elizabeth Mayer-Davis; Elvia Mendoza-Caamal; Karen L Mohlke; Alanna C Morrison; Anne Ndungu; Maggie C Y Ng; Colm O'Dushlaine; Anthony J Payne; Catherine Pihoker; Wendy S Post; Michael Preuss; Bruce M Psaty; Ramachandran S Vasan; N William Rayner; Alexander P Reiner; Cristina Revilla-Monsalve; Neil R Robertson; Nicola Santoro; Claudia Schurmann; Wing Yee So; Xavier Soberón; Heather M Stringham; Tim M Strom; Claudia H T Tam; Farook Thameem; Brian Tomlinson; Jason M Torres; Russell P Tracy; Rob M van Dam; Marijana Vujkovic; Shuai Wang; Ryan P Welch; Daniel R Witte; Tien-Yin Wong; Gil Atzmon; Nir Barzilai; John Blangero; Lori L Bonnycastle; Donald W Bowden; John C Chambers; Edmund Chan; Ching-Yu Cheng; Yoon Shin Cho; Francis S Collins; Paul S de Vries; Ravindranath Duggirala; Benjamin Glaser; Clicerio Gonzalez; Ma Elena Gonzalez; Leif Groop; Jaspal Singh Kooner; Soo Heon Kwak; Markku Laakso; Donna M Lehman; Peter Nilsson; Timothy D Spector; E Shyong Tai; Tiinamaija Tuomi; Jaakko Tuomilehto; James G Wilson; Carlos A Aguilar-Salinas; Erwin Bottinger; Brian Burke; David J Carey; Juliana C N Chan; Josée Dupuis; Philippe Frossard; Susan R Heckbert; Mi Yeong Hwang; Young Jin Kim; H Lester Kirchner; Jong-Young Lee; Juyoung Lee; Ruth J F Loos; Ronald C W Ma; Andrew D Morris; Christopher J O'Donnell; Colin N A Palmer; James Pankow; Kyong Soo Park; Asif Rasheed; Danish Saleheen; Xueling Sim; Kerrin S Small; Yik Ying Teo; Christopher Haiman; Craig L Hanis; Brian E Henderson; Lorena Orozco; Teresa Tusié-Luna; Frederick E Dewey; Aris Baras; Christian Gieger; Thomas Meitinger; Konstantin Strauch; Leslie Lange; Niels Grarup; Torben Hansen; Oluf Pedersen; Philip Zeitler; Dana Dabelea; Goncalo Abecasis; Graeme I Bell; Nancy J Cox; Mark Seielstad; Rob Sladek; James B Meigs; Steve S Rich; Jerome I Rotter; David Altshuler; Noël P Burtt; Laura J Scott; Andrew P Morris; Jose C Florez; Mark I McCarthy; Michael Boehnke
Journal: Nature Date: 2019-05-22 Impact factor: 49.962

9. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962

10. Exome sequencing and characterization of 49,960 individuals in the UK Biobank.

Authors: Cristopher V Van Hout; Ioanna Tachmazidou; Joshua D Backman; Joshua D Hoffman; Daren Liu; Ashutosh K Pandey; Claudia Gonzaga-Jauregui; Shareef Khalid; Bin Ye; Nilanjana Banerjee; Alexander H Li; Colm O'Dushlaine; Anthony Marcketta; Jeffrey Staples; Claudia Schurmann; Alicia Hawes; Evan Maxwell; Leland Barnard; Alexander Lopez; John Penn; Lukas Habegger; Andrew L Blumenfeld; Xiaodong Bai; Sean O'Keeffe; Ashish Yadav; Kavita Praveen; Marcus Jones; William J Salerno; Wendy K Chung; Ida Surakka; Cristen J Willer; Kristian Hveem; Joseph B Leader; David J Carey; David H Ledbetter; Lon Cardon; George D Yancopoulos; Aris Economides; Giovanni Coppola; Alan R Shuldiner; Suganthi Balasubramanian; Michael Cantor; Matthew R Nelson; John Whittaker; Jeffrey G Reid; Jonathan Marchini; John D Overton; Robert A Scott; Gonçalo R Abecasis; Laura Yerges-Armstrong; Aris Baras
Journal: Nature Date: 2020-10-21 Impact factor: 69.504

9 in total

1. Protein-coding repeat polymorphisms strongly shape diverse human phenotypes.

Authors: Ronen E Mukamel; Robert E Handsaker; Maxwell A Sherman; Alison R Barton; Yiming Zheng; Steven A McCarroll; Po-Ru Loh
Journal: Science Date: 2021-09-23 Impact factor: 47.728

2. Long-sought mediator of vitamin K recycling discovered.

Authors: Nathan P Ward; Gina M DeNicola
Journal: Nature Date: 2022-08 Impact factor: 69.504

3. Epigenomic and transcriptomic analyses define core cell types, genes and targetable mechanisms for kidney disease.

Authors: Hongbo Liu; Tomohito Doke; Dong Guo; Xin Sheng; Ziyuan Ma; Joseph Park; Ha My T Vy; Girish N Nadkarni; Amin Abedini; Zhen Miao; Matthew Palmer; Benjamin F Voight; Hongzhe Li; Christopher D Brown; Marylyn D Ritchie; Yan Shu; Katalin Susztak
Journal: Nat Genet Date: 2022-06-16 Impact factor: 41.307

4. A spectrum of recessiveness among Mendelian disease variants in UK Biobank.

Authors: Alison R Barton; Margaux L A Hujoel; Ronen E Mukamel; Maxwell A Sherman; Po-Ru Loh
Journal: Am J Hum Genet Date: 2022-05-31 Impact factor: 11.043

5. PLIN1 Haploinsufficiency Causes a Favorable Metabolic Profile.

Authors: Kashyap A Patel; Shivang Burman; Thomas W Laver; Andrew T Hattersley; Timothy M Frayling; Michael N Weedon
Journal: J Clin Endocrinol Metab Date: 2022-05-17 Impact factor: 6.134

6. The sequences of 150,119 genomes in the UK Biobank.

Authors: Bjarni V Halldorsson; Hannes P Eggertsson; Kristjan H S Moore; Hannes Hauswedell; Ogmundur Eiriksson; Magnus O Ulfarsson; Gunnar Palsson; Marteinn T Hardarson; Asmundur Oddsson; Brynjar O Jensson; Snaedis Kristmundsdottir; Brynja D Sigurpalsdottir; Olafur A Stefansson; Doruk Beyter; Guillaume Holley; Vinicius Tragante; Arnaldur Gylfason; Pall I Olason; Florian Zink; Margret Asgeirsdottir; Sverrir T Sverrisson; Brynjar Sigurdsson; Sigurjon A Gudjonsson; Gunnar T Sigurdsson; Gisli H Halldorsson; Gardar Sveinbjornsson; Kristjan Norland; Unnur Styrkarsdottir; Droplaug N Magnusdottir; Steinunn Snorradottir; Kari Kristinsson; Emilia Sobech; Helgi Jonsson; Arni J Geirsson; Isleifur Olafsson; Palmi Jonsson; Ole Birger Pedersen; Christian Erikstrup; Søren Brunak; Sisse Rye Ostrowski; Gudmar Thorleifsson; Frosti Jonsson; Pall Melsted; Ingileif Jonsdottir; Thorunn Rafnar; Hilma Holm; Hreinn Stefansson; Jona Saemundsdottir; Daniel F Gudbjartsson; Olafur T Magnusson; Gisli Masson; Unnur Thorsteinsdottir; Agnar Helgason; Hakon Jonsson; Patrick Sulem; Kari Stefansson
Journal: Nature Date: 2022-07-20 Impact factor: 69.504

Review 7. eQTLs as causal instruments for the reconstruction of hormone linked gene networks.

Authors: Sean Bankier; Tom Michoel
Journal: Front Endocrinol (Lausanne) Date: 2022-08-17 Impact factor: 6.055

8. Whole Exome Sequencing Enhanced Imputation Identifies 85 Metabolite Associations in the Alpine CHRIS Cohort.

Authors: Eva König; Johannes Rainer; Vinicius Verri Hernandes; Giuseppe Paglia; Fabiola Del Greco M; Daniele Bottigliengo; Xianyong Yin; Lap Sum Chan; Alexander Teumer; Peter P Pramstaller; Adam E Locke; Christian Fuchsberger
Journal: Metabolites Date: 2022-06-29

Review 9. Linking genome variants to disease: scalable approaches to test the functional impact of human mutations.

Authors: Gregory M Findlay
Journal: Hum Mol Genet Date: 2021-10-01 Impact factor: 6.150

9 in total