| Literature DB >> 31118516 |
Josep M Mercader1,2,3,4,5, Christian Fuchsberger6,7,8, Miriam S Udler1,2,3,4,5, Anubha Mahajan9,10, Jason Flannick11,12,13,14, Jennifer Wessel15,16,17, Tanya M Teslovich18, Lizz Caulkins1,2, Ryan Koesterer1,2, Francisco Barajas-Olmos19, Thomas W Blackwell6,8, Eric Boerwinkle20,21, Jennifer A Brody22, Federico Centeno-Cruz19, Ling Chen4,5, Siying Chen6,8, Cecilia Contreras-Cubas19, Emilio Córdova19, Adolfo Correa23, Maria Cortes24, Ralph A DeFronzo25, Lawrence Dolan26, Kimberly L Drews27, Amanda Elliott1,2,4,5, James S Floyd28, Stacey Gabriel24, Maria Eugenia Garay-Sevilla29,30, Humberto García-Ortiz19, Myron Gross31, Sohee Han32, Nancy L Heard-Costa33,34, Anne U Jackson6,8, Marit E Jørgensen35,36,37, Hyun Min Kang6,8, Megan Kelsey27, Bong-Jo Kim32, Heikki A Koistinen38,39,40, Johanna Kuusisto41,42, Joseph B Leader43, Allan Linneberg44,45,46, Ching-Ti Liu47, Jianjun Liu48,49,50, Valeriya Lyssenko51,52, Alisa K Manning5,53, Anthony Marcketta18, Juan Manuel Malacara-Hernandez29,30, Angélica Martínez-Hernández19, Karen Matsuo6,8, Elizabeth Mayer-Davis54, Elvia Mendoza-Caamal19, Karen L Mohlke55, Alanna C Morrison20, Anne Ndungu9, Maggie C Y Ng56,57,58, Colm O'Dushlaine18, Anthony J Payne9, Catherine Pihoker59, Wendy S Post60, Michael Preuss61, Bruce M Psaty62,63,64,65,66, Ramachandran S Vasan34,67, N William Rayner9,10,68, Alexander P Reiner65, Cristina Revilla-Monsalve69, Neil R Robertson9,10, Nicola Santoro70, Claudia Schurmann61, Wing Yee So71,72,73, Xavier Soberón19, Heather M Stringham6,8, Tim M Strom74,75, Claudia H T Tam71,72,73, Farook Thameem76, Brian Tomlinson71, Jason M Torres9, Russell P Tracy77,78, Rob M van Dam49,50,79, Marijana Vujkovic80, Shuai Wang47, Ryan P Welch6,8, Daniel R Witte81,82, Tien-Yin Wong83,84,85, Gil Atzmon86,87,88, Nir Barzilai86,88, John Blangero89,90, Lori L Bonnycastle91, Donald W Bowden56,57,58, John C Chambers92,93,94, Edmund Chan49, Ching-Yu Cheng95, Yoon Shin Cho96, Francis S Collins91, Paul S de Vries20, Ravindranath Duggirala89,90, Benjamin Glaser97, Clicerio Gonzalez98, Ma Elena Gonzalez99, Leif Groop51,100, Jaspal Singh Kooner101, Soo Heon Kwak102, Markku Laakso41,42, Donna M Lehman25, Peter Nilsson103, Timothy D Spector104, E Shyong Tai49,50,84, Tiinamaija Tuomi100,105,106,107, Jaakko Tuomilehto108,109,110,111, James G Wilson112, Carlos A Aguilar-Salinas113, Erwin Bottinger61, Brian Burke27, David J Carey43, Juliana C N Chan71,72,73, Josée Dupuis34,47, Philippe Frossard114, Susan R Heckbert62,65, Mi Yeong Hwang32, Young Jin Kim32, H Lester Kirchner43, Jong-Young Lee115, Juyoung Lee32, Ruth J F Loos61,116, Ronald C W Ma71,72,73, Andrew D Morris117, Christopher J O'Donnell118,119,120,121, Colin N A Palmer122, James Pankow123, Kyong Soo Park101,124,125, Asif Rasheed114, Danish Saleheen80,114, Xueling Sim50, Kerrin S Small104, Yik Ying Teo50,126,127, Christopher Haiman128, Craig L Hanis129, Brian E Henderson128, Lorena Orozco19, Teresa Tusié-Luna113,130, Frederick E Dewey18, Aris Baras18, Christian Gieger131,132, Thomas Meitinger74,75,133, Konstantin Strauch131,134, Leslie Lange135, Niels Grarup136, Torben Hansen136,137, Oluf Pedersen136, Philip Zeitler138, Dana Dabelea139, Goncalo Abecasis6,8, Graeme I Bell29,30, Nancy J Cox140, Mark Seielstad141,142, Rob Sladek143,144,145, James B Meigs24,5,146, Steve S Rich147, Jerome I Rotter148,149,150, David Altshuler1,2,4,5,151,152,153, Noël P Burtt1,2, Laura J Scott6,8, Andrew P Morris9,154, Jose C Florez1,2,3,4,5, Mark I McCarthy9,10,155, Michael Boehnke6,8.
Abstract
Protein-coding genetic variants that strongly affect disease risk can yield relevant clues to disease pathogenesis. Here we report exome-sequencing analyses of 20,791 individuals with type 2 diabetes (T2D) and 24,440 non-diabetic control participants from 5 ancestries. We identify gene-level associations of rare variants (with minor allele frequencies of less than 0.5%) in 4 genes at exome-wide significance, including a series of more than 30 SLC30A8 alleles that conveys protection against T2D, and in 12 gene sets, including those corresponding to T2D drug targets (P = 6.1 × 10-3) and candidate genes from knockout mice (P = 5.2 × 10-3). Within our study, the strongest T2D gene-level signals for rare variants explain at most 25% of the heritability of the strongest common single-variant signals, and the gene-level effect sizes of the rare variants that we observed in established T2D drug targets will require 75,000-185,000 sequenced cases to achieve exome-wide significance. We propose a method to interpret these modest rare-variant associations and to incorporate these associations into future target or gene prioritization efforts.Entities:
Mesh:
Year: 2019 PMID: 31118516 PMCID: PMC6699738 DOI: 10.1038/s41586-019-1231-2
Source DB: PubMed Journal: Nature ISSN: 0028-0836 Impact factor: 49.962
Extended Data Fig. 1Power analysis.
The power to detect associations (using a two-sided test) at P < 5 × 10−8 for variants (or collections of variants) with a given minor allele frequency (x axis) and odds ratio (y axis) measured as the average across all ancestries. a, Cells are shaded according to the power of the current study of 20,791 T2D cases and 24,440 controls, with white indicating high power and red indicating low power. The 20%, 50%, 80% and 99% contour lines are labelled. b, Cells are shaded according to the difference in power between the current study and a previously published study of 12,940 individuals[10], with yellow–white indicating a large increase in power and red indicating a small increase in power. The 20%, 40%, 60% and 80% contour lines are labelled.
Extended Data Fig. 2Data quality control workflow.
A schematic of the steps involved in sample and variant quality control is shown. Quality control was conducted as described in the Methods to construct a set of samples and variants included in the association analysis. Each step is depicted as an arrow, with the number of samples or variants excluded by the step shown at the end of the arrow. The final set of samples and variants analysed are represented by the ‘Analysis’ dataset; we further excluded samples of high relatedness to other samples in the dataset from some but not all analyses. After each step that removed samples, we also removed newly monomorphic variants (hence the decrease in variants between the ‘Clean’ and ‘Analysis’ datasets).
Extended Data Fig. 3Single-variant association analysis workflow.
A schematic of the steps involved in single-variant exome-sequencing association analysis is shown, as described in the Methods. We began analysis with a division of samples in the ‘Analysis’ dataset (leftmost column) into 25 different subgroups (second column from the left) based on cohort, ancestry and sequencing technology. We then filtered variants according to metrics computed separately for each subgroup; we applied the filters listed in the ‘Basic filters’ box to all subgroups and for some subgroups we applied additional (more stringent) filters as indicated by boxes in the third column from the left. The resulting number of variants and samples advanced for analysis in each subgroup are indicated in the fourth column from the left. We analysed each subgroup with both the EMMAX test (to measure association strength) and the Firth test (to measure allelic odds ratios), each of which are two-sided; the number of principal components included as covariates in the Firth test is shown in the fifth column from the left. Finally, we combined each of the EMMAX and Firth subgroup-level results using a 25-group meta-analysis to produce the final P values and odds ratios reported for each variant. Multi, variant is multiallelic; CR, call rate; P, variant subgroup-level P value; P(Fisher), variant subgroup-level P value from Fisher’s exact test; P(miss), P value for subgroup-level variant differential missingness between T2D cases and controls; P(HWE), P value for deviation from subgroup-level Hardy–Weinberg equilibrium; Alt GQ, mean genotype quality of non-reference genotypes (across all samples); X Chrom, variant is on X chromosome.
Extended Data Fig. 4Calibration of single-variant analysis.
To assess whether our single-variant association statistics (two-sided, calculated by the EMMAX test) were well-calibrated, we computed quantile–quantile plots of associations across all samples (Overall) and within each ancestry (total n = 45,231 individuals). To avoid deflation of the quantile–quantile plot from rare variants (for which the expected P values are discrete rather than uniformly distributed), only variants with minor allele counts of 20 or greater (either overall or within the relevant ancestry) are shown. Variants were also LD-pruned before plotting, to avoid induced variance from correlated P values of these variants, using the ‘clump’ method implemented in PLINK. The λ values indicate genomic control, as measured by the ratio in observed median χ2 statistic to that expected under the null hypothesis. Red line, expectation of P values under the null distribution. Blue lines (and grey region), 95% confidence interval of expectations under the null distribution.
Fig. 1Exome-wide association analysis.
a, Single-variant associations were calculated using the two-sided EMMAX test. Red line, P = 4.3 × 10−7. b, Gene-level P values, corrected for four analyses performed using the two-sided minimum P-value test. The most-significant genes are labelled. Red line, P = 6.5 × 10−7. c, SLC30A8 gene-level P values (left y axis, black line), calculated using a two-sided burden test after removing variants (in order of increasing single-variant P value) from the 1/5 1% mask (the strongest signal for SLC30A8). Dashed line, P = 0.05. Right y axis (blue line), estimated effect size (log10(odds ratio)). Blue shading, 95% confidence interval. Dotted line, effect size = 0. Single-variant n = 45,231 individuals. Gene-level n = 43,074 unrelated individuals.
Most significant single-variant associations from exome-sequencing analysis
Most significant single-variant associations from exome-sequencing analysis
The most significant results from exome-sequencing single-variant association analyses are shown (n = 45,231 individuals). Gene, the closest gene to the variant. Variant, a unique identifier for the variant within our exome-sequencing analysis. Consequence, the predicted consequence of the variant, defined by sequence ontology annotation and produced using VEP. Impact, the effect of the variant, as predicted using VEP (Med, medium; Mod, modifier). Change, the predicted protein change, defined according to the ‘best guess’ transcript as described in the Methods. MAFs are calculated as the maximum across all ancestries. Case, the number of samples with T2D carrying the variant. Ctrl, the number of samples without T2D carrying the variant. OR, the odds ratio, calculated from the Firth analysis. P, the P value, calculated from the (two-sided) EMMAX analysis. Ref P: the P value of the variant in one of three previous GWAS or exome array analyses, referenced in the Ref column[6,13,79].
Extended Data Fig. 5Gene-level association analysis workflow.
A schematic of the steps involved in gene-level exome-sequencing association analysis, as described in Methods, is shown. We began analysis with subgroup-level genotype filtering (second column from the left) of unrelated samples in the ‘Analysis’ dataset (leftmost column); we then applied genotype filters for each subgroup (filtering genotypes for either all or no samples in each subgroup), similar to those used in subgroup-level single-variant analyses. We then annotated each non-reference variant allele with 16 different bioinformatics algorithms to assess allele deleteriousness, and we grouped alleles into one of seven nested masks (third column from the left; the number of variants and weights shown correspond to alleles absent from ‘higher’, or more stringent, nested masks). We computed burden and SKAT analyses (both of which are two-sided) using one of two approaches to combine alleles across masks (Methods): first, by analysing all alleles at once with weights assigned according to the most stringent mask containing the allele (weighted test); and second, by analysing each mask independently and then calculating the lowest P value corrected for the effective number of tests (minimum P-value test). Multi, variant is multiallelic; CR, call rate; P(miss), P value for subgroup-level variant differential missingness between T2D cases and controls; P(HWE), P value for deviation from subgroup-level Hardy–Weinberg equilibrium; Alt GQ, mean genotype quality of non-reference genotypes (across all samples).
Extended Data Fig. 6Calibration of gene-level association analyses.
For both the burden and SKAT tests, we tested for gene-level association within seven different allelic masks. As this produced seven P values for each test, we developed two means to consolidate these results (Methods). a, b, The quantile–quantile plots of associations are shown for the minimum P-value burden test (a) and the weighted burden test (b). Each test is two-sided and consists of n = 43,071 unrelated individuals. Only genes with combined minor allele count of 20 or greater are shown in the quantile–quantile plots, to avoid deflation from genes with too few variants to produce P values asymptotically uniform under the null distribution. The λ values indicate genomic control, as measured by the ratio in observed median χ2 statistic to that expected under the null hypothesis. The three genes with exome-wide significant associations are labelled. Red line, expectation of P values under the null distribution. Blue lines (and grey region), 95% confidence interval of expectations under the null distribution.
Most significant gene-level associations from exome-sequencing analysis
Most significant gene-level associations from exome-sequencing analysis
The most significant results from gene-level analyses are shown (n = 43,071 unrelated individuals). Genes are ranked according the (two-sided) minimum P value achieved across the four gene-level analyses. Gene, a unique identifier for the gene within our exome-sequencing analysis. Var (CAF), the number of alleles (combined allele frequency (CAF) of variants) in the mask achieving the strongest association across the four tests (that is, the 0/5 1% mask for the weighted test, or the mask with the minimum P value for the minimum P-value test). Burden, results from the (two-sided) burden analysis. SKAT, results from the (two-sided) SKAT analysis. Min P, results from the (two-sided) minimum P-value analysis. Weighted, results from the (two-sided) weighted analysis. OR, the odds ratio as estimated from the burden analysis. P, the (two-sided) P value for the indicated analyses.
Fig. 2Gene set analysis.
a–e, Rank percentiles (1 = highest) for gene-level associations (compared to matched genes) within 11 monogenic diabetes genes (548 matched genes) (a), 8 T2D drug targets (400 matched genes) (b), 31 genes linked to NIDD in mice (1,499 matched genes) (c), 323 genes linked to impaired glucose tolerance in mice (10,043 matched genes) (d) and 11 genes with common likely causal coding variants (537 matched genes) (e). P values are from a one-sided Wilcoxon rank-sum test between each gene set and comparison set. Labels indicate minimum, 25th percentile, median, 75th percentile and maximum. f, Estimated odds ratios, from the weighted burden test of the 5/5 mask, for 8 T2D drug targets (red, agonists; blue, inhibitors). Data are mean ± s.e. of log(odds ratio) from the burden test. n = 43,071 unrelated individuals.
Fig. 3Comparison of exome-sequencing to array-based GWAS.
a, Single-variant associations, calculated by two-sided Firth logistic regression, from an array-based imputed GWAS of the subset (n = 34,529) of samples in the exome-sequencing analysis for which array data were available. Labels and axes as described in Fig. 1a. b, The observed liability variance explained (LVE) by the top 19 exome-sequencing gene-level associations (red) and the top 19 imputed GWAS single-variant associations (maximum of 1 per 250 kb; blue) and their ratio (black). Signals ranked by liability variance explained. c, Gene rank percentiles from exome-sequencing gene-level analysis (x axis) and a previous multi-ancestry T2D GWAS[13] (y axis). Genes are from the mouse NIDD gene set (Fig. 2c).
Fig. 4Decision support from exome-sequencing data.
a, Estimated power, as a function of future sample size (x axis), to detect T2D gene-level associations (two-sided test at P = 6.25 × 10−7) with aggregate frequency and odds ratios equal to those estimated from our analysis of 8 T2D drug targets (Fig. 2f). b, A proposed workflow for using exome-sequencing data in gene characterization. Depending on the prior belief in the disease relevance of a gene, the cost of experimental characterization and the benefit of gene validation, a decision to conduct the experiment could be informed by the posterior probability of the disease relevance of the gene, as estimated from exome-sequencing association statistics (available through http://www.type2diabetesgenetics.org/).
Extended Data Fig. 7PPA calculation workflow.
a, We estimated the PPAs for nonsynonymous variants in our sequence analysis based on concordance with independent exome array data and previously published[6,78,80] estimates of the fraction of causal coding associations (Methods). b, PPA estimates for nonsynonymous variants within T2D GWAS loci are shown as a function of P value (right y axis, black line; 95% confidence interval, grey shading) together with the total number of such variants (left y axis, red line). c, For variants outside of T2D GWAS loci, we developed a method to further compute Bayes factors, which measure the odds of true and causal association, as a function of P value, using a model of the prior odds of true and causal association for variants in GWAS loci (Methods). d, These Bayes factors can be combined with a subjective prior belief in the T2D relevance of a gene (y axis) to produce the estimated posterior probability of true and causal association for any nonsynonymous variant in the exome-sequence dataset based on its observed log10(P) (x axis). Posterior estimates are shaded proportional to value (red, low; white, high). Values are shown for the default modelling assumptions of 33% of missense variants that caused gene inactivation and 30% of true missense associations that represented the causal variant.
Extended Data Fig. 8Estimated posterior probability of associations for different prior hypotheses.
We estimated the posterior probability of association for nonsynonymous variants that met various single-variant P-value thresholds (two-sided EMMAX test, n = 45,231 individuals) in our analysis, as described in the Methods and shown in Extended Data Fig. 7. To perform the needed calculations, we assumed that, on average, 1.1 genes that are found within each T2D GWAS locus are relevant to T2D and 33% of missense mutations within these genes cause gene loss-of-function. a–f, To assess the sensitivity of our analysis to these assumptions, we repeated the calculations with different assumptions of 0.5 (a), 2.0 (b), 0.25 (c) and 0.1 (d) T2D-relevant genes within each GWAS locus, as well as 25% (e) and 40% (f) of missense variants leading to loss-of-function. All analyses assume the default modelling parameters that 30% of true nonsynonymous associations are causal associations; different values for this parameter would scale posterior probability estimates linearly.