Literature DB >> 29795570

Insights into the genetic epidemiology of Crohn's and rare diseases in the Ashkenazi Jewish population.

Manuel A Rivas1,2, Brandon E Avila1,3, Jukka Koskela1,3,4, Hailiang Huang1,3, Christine Stevens1, Matti Pirinen4,5, Talin Haritunians6, Benjamin M Neale1,3, Mitja Kurki1,3, Andrea Ganna1,3, Daniel Graham1, Benjamin Glaser7, Inga Peter8, Gil Atzmon9,10, Nir Barzilai9, Adam P Levine11, Elena Schiff11, Nikolas Pontikos11,12, Ben Weisburd1,3, Monkol Lek1,3, Konrad J Karczewski1,3, Jonathan Bloom1,3, Eric V Minikel1,3, Britt-Sabina Petersen13, Laurent Beaugerie14, Philippe Seksik14, Jacques Cosnes14, Stefan Schreiber15, Bernd Bokemeyer16, Johannes Bethge15, Graham Heap17, Tariq Ahmad18, Vincent Plagnol12, Anthony W Segal11, Stephan Targan6, Dan Turner19, Paivi Saavalainen20, Martti Farkkila21, Kimmo Kontula22, Aarno Palotie1,4,23, Steven R Brant24,25, Richard H Duerr26,27, Mark S Silverberg28, John D Rioux29,30, Rinse K Weersma31, Andre Franke13, Luke Jostins32, Carl A Anderson33, Jeffrey C Barrett33, Daniel G MacArthur1,3, Chaim Jalas34, Harry Sokol14, Ramnik J Xavier1,35, Ann Pulver36, Judy H Cho37, Dermot P B McGovern6, Mark J Daly1,3,4.   

Abstract

As part of a broader collaborative network of exome sequencing studies, we developed a jointly called data set of 5,685 Ashkenazi Jewish exomes. We make publicly available a resource of site and allele frequencies, which should serve as a reference for medical genetics in the Ashkenazim (hosted in part at https://ibd.broadinstitute.org, also available in gnomAD at http://gnomad.broadinstitute.org). We estimate that 34% of protein-coding alleles present in the Ashkenazi Jewish population at frequencies greater than 0.2% are significantly more frequent (mean 15-fold) than their maximum frequency observed in other reference populations. Arising via a well-described founder effect approximately 30 generations ago, this catalog of enriched alleles can contribute to differences in genetic risk and overall prevalence of diseases between populations. As validation we document 148 AJ enriched protein-altering alleles that overlap with "pathogenic" ClinVar alleles (table available at https://github.com/macarthur-lab/clinvar/blob/master/output/clinvar.tsv), including those that account for 10-100 fold differences in prevalence between AJ and non-AJ populations of some rare diseases, especially recessive conditions, including Gaucher disease (GBA, p.Asn409Ser, 8-fold enrichment); Canavan disease (ASPA, p.Glu285Ala, 12-fold enrichment); and Tay-Sachs disease (HEXA, c.1421+1G>C, 27-fold enrichment; p.Tyr427IlefsTer5, 12-fold enrichment). We next sought to use this catalog, of well-established relevance to Mendelian disease, to explore Crohn's disease, a common disease with an estimated two to four-fold excess prevalence in AJ. We specifically attempt to evaluate whether strong acting rare alleles, particularly protein-truncating or otherwise large effect-size alleles, enriched by the same founder-effect, contribute excess genetic risk to Crohn's disease in AJ, and find that ten rare genetic risk factors in NOD2 and LRRK2 are enriched in AJ (p < 0.005), including several novel contributing alleles, show evidence of association to CD. Independently, we find that genomewide common variant risk defined by GWAS shows a strong difference between AJ and non-AJ European control population samples (0.97 s.d. higher, p<10-16). Taken together, the results suggest coordinated selection in AJ population for higher CD risk alleles in general. The results and approach illustrate the value of exome sequencing data in case-control studies along with reference data sets like ExAC (sites VCF available via FTP at ftp.broadinstitute.org/pub/ExAC_release/release0.3/) to pinpoint genetic variation that contributes to variable disease predisposition across populations.

Entities:  

Mesh:

Year:  2018        PMID: 29795570      PMCID: PMC5967709          DOI: 10.1371/journal.pgen.1007329

Source DB:  PubMed          Journal:  PLoS Genet        ISSN: 1553-7390            Impact factor:   5.917


Introduction

Genetic population isolates like the Ashkenazim, Jews who trace their ancestry to eleventh century central European Jewish groups[1], have previously facilitated the mapping of alleles contributing to human disease predisposition[2-5]. The documented 2–4 fold enrichment of Crohn’s Disease (CD) prevalence in the Ashkenazi Jewish (AJ) population[6,7] motivated the use of exome sequencing and genome-wide array data to evaluate the degree to which bottleneck-enriched protein-altering alleles and unequivocally implicated common variants contribute an excess CD genetic risk to AJ[6]. Despite the progress in mapping genes and alleles for rare diseases with increased prevalence in the AJ population, precise estimates of the risk-allele frequency and the carrier rate in the AJ population have not yet been determined[8]. Through this study, we provide a frequency resource of protein-coding alleles from over 2,000 non-CD AJ individuals with low admixture that will serve to improve interpretation of rare disease risk alleles in the AJ population and which we employ to discover new Crohn’s risk alleles by comparison to 1855 AJ Crohn’s cases.

Results

We generated a jointly called whole-exome sequence dataset consisting of 18,745 individuals from international Inflammatory Bowel Disease (IBD) and non-IBD cohorts[9,10] (S1 Fig). Given the increased prevalence of Crohn’s disease in the AJ population, our global sequencing efforts had specifically included 5,652 individuals self-reporting as Jewish and, as we aimed to focus on variation observed in the AJ population in comparison to reference populations in ExAC[9,11] (including non-Finnish Europeans (NFE), Latino (AMR), and African/African-American (AFR)) populations, we chose a model-based approach to estimate the ancestry of the study population using ADMIXTURE[12]. To identify AJ individuals and estimate admixture fractions we used a set (n = 21,066) of LD-pruned common variants (MAF>1%, see Supplementary Note for additional details) filtered for genotype quality (GQ>20). The 18,745 individuals were assigned to four groups (K = 4) using ADMIXTURE (further described in Supplementary Note, also see S3 Fig). One group of 5,685 individuals was found consisting mostly (84%) of self-reported AJ individuals, while 3,522 of these individuals were further found with high ancestry fraction (> 0.9) mapping to this group (S2 Fig, S1 Table). Thus, many self-reported AJ individuals were not included, as they did not have high enough ascertained AJ ancestry fraction. As we were interested in computing an enrichment statistic that would not be affected by possible admixture, we obtained alternate (non-reference) allele frequency estimates by restricting the enrichment analysis to the 2,178 non-IBD Ashkenazi Jewish individuals that passed QC and relatedness filtering and had AJ ancestry fraction (genotype ancestry grouping closely with other AJ individuals) of > 0.9. Our study includes exomes throughout Europe and Israel but the vast majority (86%) of these high ancestry fraction AJ individuals were collected in major US cities including Los Angeles, Boston, Baltimore, and New York (S2 Table). To explore AJ exome population genetics, including proportion of enriched alleles and degree of enrichment, we used the observed alternate allele counts and total number of alleles available from ExAC release 0.3 dataset [ntotal = 60,706; NFE (n = 31,902; after excluding AJ individuals from ExAC), AFR (n = 5,203), and AMR (n = 5,789)]. We focused on protein-coding alleles with estimated allele frequency of at least 0.002 and less than .1 in AJ (nalleles = 73,228; practical cutoff of what could be statistically defined as convincing enrichment, see S4 Fig), and applied a one-sided Fisher’s exact test on allele counts (see Supplementary Note), to classify the observed alleles into two groups: “enriched” or “not enriched”. This analysis identified 34% of protein-coding alleles as significantly enriched, with mean 15-fold increased odds of the alternate allele compared to other populations. Different proportions of alleles belong to the enriched group depending on variant annotation: 36% for predicted protein-truncating variants (PTV); 38% for predicted protein-altering variants (PRA); and 31% for synonymous variants. The substantially higher PTV+PRA:synonymous ratio observed in the enriched category is consistent with those alleles being drawn randomly from a large pool of much rarer alleles (where the functional:synonymous ratio is higher[3]) and abruptly boosted in frequency (Fig 1, p < 10−16 across comparisons of PTV and PRA to synonymous variants, two-proportion test, Supplementary Note). Since much rarer alleles have a higher probability of being damaging (e.g., they have a higher missense/synonymous ratio), the advantage to gene mapping arises from the fact that enriched alleles of a certain frequency are more damaging/deleterious on average than non-enriched alleles of the same frequency.
Fig 1

Enrichment of alleles discovered in AJ exome sequencing project.

A) Histogram of estimated log enrichment statistic, defined as the log of the bias corrected odds ratio comparing the allele frequency in AJ population to the maximum allele frequency estimated from NFE, AFR, and AMR populations in ExAC. For each histogram bin we show a bar plot of the expected number of alleles belonging to the two groups we analyzed: 1) enriched (green) and 2) not enriched (white). B) Bar plots of estimated percentage of alleles belonging to the two groups we analyzed for all protein-coding (ALL), synonymous (SYN), protein-altering (PRA), and protein-truncating variants (PTV). An estimate of 34% of protein-coding alleles observed in AJ have a mean shift of 15-fold increased odds of the alternate allele compared to other reference populations. This observation is supported by the property that compared to intergenic variants, coding variants tend to be younger for a given frequency and the more pathogenic a variant, the younger it is, therefore tending to be population specific[13].

Enrichment of alleles discovered in AJ exome sequencing project.

A) Histogram of estimated log enrichment statistic, defined as the log of the bias corrected odds ratio comparing the allele frequency in AJ population to the maximum allele frequency estimated from NFE, AFR, and AMR populations in ExAC. For each histogram bin we show a bar plot of the expected number of alleles belonging to the two groups we analyzed: 1) enriched (green) and 2) not enriched (white). B) Bar plots of estimated percentage of alleles belonging to the two groups we analyzed for all protein-coding (ALL), synonymous (SYN), protein-altering (PRA), and protein-truncating variants (PTV). An estimate of 34% of protein-coding alleles observed in AJ have a mean shift of 15-fold increased odds of the alternate allele compared to other reference populations. This observation is supported by the property that compared to intergenic variants, coding variants tend to be younger for a given frequency and the more pathogenic a variant, the younger it is, therefore tending to be population specific[13]. Additionally, we may expect that a “depleted” set of alleles arises from the founder effect, but in reality, many of these already rare variants are simply eliminated during the bottleneck. Of course, it is more difficult and less interesting to search for depleted alleles, as their absence provides no opportunity to obtain significant statistics on population enrichment or disease association. We intersected the list of protein-coding alleles identified in the AJ exome sequencing study with alleles reported to be pathogenic with no conflicting evidence (n = 42,226) in ClinVar[14] resulting in 148 alleles found both in ClinVar and with p-value less than .005 of belonging to the AJ enriched set (S1 Data File). In OMIM, 48 of the 148 alleles included documentation of a disease subject with AJ ancestry (Table 1). This set of enriched alleles includes all of the major AJ mutations for 8 diseases described in the American College of Medical Genetics and Genomics 2008 screening guideline study[15]. In the setting of autosomal recessive disorders these differences in population allele frequencies may contribute a factor proportional to the squared enrichment difference to genetic risk and prevalence between populations (see Supplementary Note). For instance, a 19-fold enriched frameshift indel, p.Tyr427IlefsTer5, in HEXA, contributes a 361-fold enrichment in genetic risk in AJ to non-AJ population to Tay-Sachs disease. Enrichment in this large adult Ashkenazi exome database reinforces recent publications of founder mutations for rare pediatric disorders including FKTN (Walker Warburg syndrome)[16], CCDC65 (Primary ciliary dyskinesia)[17], TMEM216 (Joubert syndrome)[18], C11orf73 (Leukoencephalopathy)[19]; PEX2 (Zellweger syndrome)[20], VPS11 (Hypomyelination and developmental delay)[21] and BBS2 (Bardet-Biedl syndrome)[22]. While many alleles on this pathogenic list may demonstrate incomplete penetrance (as in the case of p.V726A in MEFV[23] for Familial Mediterranean fever) and some may not show recessive inheritance, this resource should provide considerable assistance in gene discovery and clinical genetic screening in AJ (S2 Data File).
Table 1

Forty-eight ClinVar “pathogenic” alleles enriched in AJ.

HGVS and Gene is the allele nomenclature in ClinVar and gene symbol, respectively. Enrichment odds ratio corresponds to the bias corrected comparison of allele frequency in AJ (AJ AF) to maximum frequency among three population groups (max EXAC AF): 1) NFE; 2) AMR; and 3) AFR. Curated trait is based on the trait description in the Online Mendelian Inheritance in Man (OMIM) and is independent of effect size as a Crohn’s risk allele. Inheritance corresponds to the inheritance description in OMIM (AR: autosomal recessive, AD: autosomal dominant, risk factor: not specified genetic risk factor). Alleles are sorted in decreasing order by AJ AF.

VariantHGVSGeneEnrichment Odds RatioAJ AFMax ExAC AFCurated TraitsInheritance
16:3293310:A:Gp.Val726AlaMEFV26.080.04160.0017Familial Mediterranean feverAR
5:150723155:C:Ap.Gly87ValSLC36A23.510.04140.0122HyperglycinuriaAD
1:155205634:T:Cp.Asn409SerGBA11.160.02960.0027Susceptibility to Lewy bod dementia, Gaucher’s disease, Susceptibility to late onset Parkinson’s diseaseAR
4:187201412:T:Cp.Phe301LeuF1147.170.02730.0006Hereditary factor XI deficiencyAR
13:20763553:CA:Cp.Leu56ArgfsGJB239.190.01990.0005Autosomal recessive deafnessAR
4:187195347:G:Tp.Glu135TerF1128.200.01950.0007Factor XI deficiencyAR
12:14421038:G:Ap.Arg49CysPRB316.120.01890.0012Salivary peroxidaseAR
9:111662096:A:Gc.2204+6T>CIKBKAP45.220.01680.0004Familial dysautonomiaAR
15:72638920:G:GGATAp.Tyr427IlefsTer5HEXA19.140.01220.00064Tay-Sachs diseaseAR
1:125848678:C:Tp.Arg4192HisUSH2A13.630.01060.0008Retinitis pigmentosaAR
22:29091207:G:Ap.Ser428PheCHEX250.060.01030.0002Hereditary cancer, multiple typesRisk factor
10:99371368:TGAG:Tp.Glu315delHOGA129.280.01010.0003Primary hyperoxaluriaAR
7:117282620:G:Ap.Trp1282TerCFTR23.640.00850.0004Cystic fibrosisAR
11:17418602:C:Tc.3992-9G>AABCC840.620.00760.0002Hyperinsulinemic hypoglycemiaAR, AD
17:3402294:A:Cp.Glu285AlaASPA40.360.00760.0002Canavan diseaseAR
2:98986540:G:Ac.101+1G>ACNGA326.110.00740.0003AchromatopsiaAR
13:32914437:GT:Gp.Ser1982ArgfsBRCA227.570.00690.0003Hereditary cancer, multiple typesRisk factor
9:97934315:T:Ac.456+4A>TFANCC42.750.00690.0002Fanconi anemiaAR
9:108382330:G:GAp.Phe390IlefsFKTN32.620.00670.0002Limb-girdle muscular dystrophy-dystroglycanopathyAR
12:40734202:G:Ap.Gly2019SerLRRK220.640.00640.0003Parkinson’s diseaseRisk factor
17:41055964:C:Tp.Arg83CysG6PC11.040.00620.0006Glycogen storage diseaseAR
1:26764719:A:Gp.Lys42GluDHDDS64.830.00510.0001Retinitis pigmentosaAR
3:150690352:A:Cp.Asn48LysCLRN146.260.00510.0001Usher syndromeAR
12:49312533:GTA:Gp.Ile293ProfsCCDC6525.750.00480.0002Ciliary dyskinesia without situs inversusAR
6:80878662:G:Cp.Arg183ProBCKDHB29.420.00460.0002Maple syrup diseaseAR
10:56077147:G:Ap.Arg245TerPCDH1526.580.00460.0002Usher syndromeAR
7:107555951:G:Tp.Gly229CysDLD26.550.00460.0002Maple syrup diseaseAR
15:72638575:C:Gc.1421+1G>CHEXA52.650.00440.0001Tay-Sachs diseaseAR
15:72105913:G:Ap.Arg311GlnNR2E39.860.00420.0004Enhanced s-cone syndromeAR
5:178699927:G:Ap.Gln225TerADAMTS2129.410.00410.0000Ehlers-Danlos syndrome, dermatosparaxis typeAR
16:50745656:G:Ap.Ala612ThrNOD212.480.00390.0003Early-onset sarcoidosisRisk factor
11:6415434:G:Tp.Arg498LeuSMPD141.530.00390.0001Niemann-Pick diseaseAR
11:61161437:G:Tp.Arg73LeuTHEM21627.770.00390.0001Joubert syndromeAR
1:53676583:CAG:CpLys414ThrfsTer7CPT278.340.00370.0000Carnitine palmitoyltransferase II deficiencyAR
1:53676688:T:Cp.Phe448LeuCPT278.350.00370.0000Carnitine palmitoyltransferase II deficiencyAR
3:172737276:C:Tp.Arg283GlnSPATA169.790.00370.0004Spermatogenic failureAR
11:86017416:G:Cp.Val54LeuC11orf7347.030.00370.0001Hypomyelinating leukodystrophyAR
8:77896070:G:Ap.Arg119TerPEX220.030.00340.0002Peroxisome biogenesis disorderAR
11:118951899:T:Gp.Cys845GlyVPS11190.980.00300.0000Hypomyelinating leukodystrophyAR
6:80203353:G:Ap.Gln279TerLCA529.250.00280.0001Leber congenital amaurosisAR
19:7591645:A:Gc.406-2A>GMCOLN121.930.00280.0001MucolipidosisAR
16:56530894:C:Gp.Arg632ProBBS229.370.00280.0001Retinitis pigmentosaAR
17:41276044:ACT:Ap.Glu23ValfsBRCA110.040.00250.0003Hereditary cancer, multiple typesRisk factor
4:100543913:G:Tp.Gly865TerMTTP40.380.00250.0001AbetalipoproteinaemiaAR
2:99013302:G:Ap.Gly557ArgCNGA329.360.00230.0001AchromatopsiaAR
7:107557794:G:Ap.Glu375LysDLD26.440.00210.0001Maple syrup diseaseAR
17:41209079:T:TGp.Gln1756ProfsBRCA18.800.00210.0002Hereditary cancer, multiple typesRisk factor
10:99371292:G:Tp.Gly287ValHOGA122.010.00210.0001Primary hyperoxaluriaAR

Forty-eight ClinVar “pathogenic” alleles enriched in AJ.

HGVS and Gene is the allele nomenclature in ClinVar and gene symbol, respectively. Enrichment odds ratio corresponds to the bias corrected comparison of allele frequency in AJ (AJ AF) to maximum frequency among three population groups (max EXAC AF): 1) NFE; 2) AMR; and 3) AFR. Curated trait is based on the trait description in the Online Mendelian Inheritance in Man (OMIM) and is independent of effect size as a Crohn’s risk allele. Inheritance corresponds to the inheritance description in OMIM (AR: autosomal recessive, AD: autosomal dominant, risk factor: not specified genetic risk factor). Alleles are sorted in decreasing order by AJ AF. To assess whether AJ-enriched protein-coding alleles also contribute to the established difference in CD genetic risk we performed case-control association analyses. Since individuals with only partial AJ ancestry will still carry bottleneck-enriched alleles, here we included samples with estimated AJ ancestry fraction > 0.4 (Supplementary Note, S2 Fig), resulting in a dataset of 4,899 AJ samples (1,855 Crohn’s disease and 3,044 non-IBD). To improve ability to detect a true association, we performed a meta-analysis with CD and non-IBD case-control exome sequencing data from two additional ancestry groups: 1) non-Finnish European (NFE) (2,296 CD and 2,770 non-IBD); and 2) Finnish (FIN) (210 CD and 9,930 non-IBD samples) from a separate callset described in a previous publication[24] for a total of 4,361 CD samples and 15,744 non-IBD samples. By calling additional non-AJ samples, we hoped to discern which of the AJ-enriched alleles contributed a significant risk factor across all populations. The meta-analysis performed across several populations described should mitigate biases by confirming consistency in effect size across these population groups. Study-specific association analysis was performed with Firth bias-corrected logistic regression[25,26] and four principal components as covariates using the software package EPACTS[27] (S5 Fig). We combined association statistics in a meta-analysis framework using the Bayesian models in Band et al.[28]. We used the correlated effects model, obtained a Bayes factor (BF) by comparing it with the null model where all the prior weight is on an effect size of zero, reported p-value approximation using the BF as a test statistic, and assessed whether heterogeneity of effects exist across studies for downstream QC (see Supplementary Note). We separately assessed CD associations of enriched protein-altering (PRA) and synonymous (SYN) alleles in protein-coding genes in CD implicated GWAS loci (ngwas,pra = 351; ngwas,syn = 167), and outside implicated GWAS loci (nnon-gwas,pra = 12,529; nnon-gwas,syn = 6,202, Fig 2). See Methods and Materials for a description of these loci.
Fig 2

Q-Q plots of enriched alleles.

Q-Q plots of Crohn’s disease association for AJ enriched A) protein-altering (protein-truncating and missense) and B) synonymous alleles in GWAS regions; and AJ enriched C) protein-altering and D) synonymous alleles outside of GWAS regions. For each Q-Q plot variants with a corresponding p-value less than or equal to a threshold where expected number of false discoveries is equal to one are annotated. The black dashed line is y = x, and the grey shapes show 95% confidence interval under the null.

Q-Q plots of enriched alleles.

Q-Q plots of Crohn’s disease association for AJ enriched A) protein-altering (protein-truncating and missense) and B) synonymous alleles in GWAS regions; and AJ enriched C) protein-altering and D) synonymous alleles outside of GWAS regions. For each Q-Q plot variants with a corresponding p-value less than or equal to a threshold where expected number of false discoveries is equal to one are annotated. The black dashed line is y = x, and the grey shapes show 95% confidence interval under the null. We identified ten AJ enriched CD risk alleles (p<0.005): the previously published risk haplotypes in LRRK2 and NOD2 (LRRK2: p.N2081D; NOD2: p.N852S, p.G908R, p.M863V+p.fs1007insC)[29,30], in addition to newly implicated alleles (NOD2: p.A612T, p = 2.8x10-9; c.74-7T>A, p = 1.4x10-4; p.L248R, p = 6.4x10-4; p.D357A, p = 0.0011; LRRK2: p.G2019S, p = 0.0014, a Parkinson’s disease risk allele[31]). To assess whether the new NOD2 enriched alleles are conditionally independent of the previously established associated NOD2 alleles we performed conditional haplotype association analysis in PLINK and Bayesian model averaging[32] for variable selection, both of which suggested independent effects for all alleles (S6 Fig, S3 Table). Deviation from additivity can contribute additionally to individual risk but has been difficult to document in complex disease associations with modest ORs. Despite the functional relationship between LRRK2 and NOD2[33], we do not observe deviation from additivity between LRRK2 and NOD2 (p = 0.273); that is, the effect of mutations in both LRRK2 and NOD2 is no greater than the sum of their individual effects. We assessed whether composite risk carriers (carrier of more than one variant allele) had evidence of deviation from additivity. Deviation from additivity has been reported for p.fs1007insC, p.G908R, and p.R702W in NOD2[34,35]. In our AJ exome sequencing data set we estimate a 1-hit effect equal to 1.82 (95% confidence interval [1.59, 2.07]) and a 2-hit effect equal to 8.24 (95% confidence interval [6.06, 11.21]; we found similar evidence for departure from additivity when restricting the analysis to the newly reported alleles only: p = 0.00357, odds ratio = 7.53). We confirmed this finding using the larger non-AJ Crohn’s disease ImmunoChip dataset to provide a more precise estimate of the 1-hit effect (OR = 2.17; 95% confidence interval [2.07, 2.27], S4 Table) and the non-additive 2-hit effects in NOD2 (OR = 9.93; 95% confidence interval [8.88, 11.13], S5 Table). We found no evidence of deviation from additivity for the associated protein-altering alleles in LRRK2 (p = 0.418). Given that enriched genetic variants in NOD2 and LRRK2 contribute to differences in CD risk in AJ population, we next asked whether unequivocally established common variant associations contribute to differences in CD genetic risk. We performed polygenic risk score (PRS) analysis using reported effect size estimates from 124 CD alleles including those reported in a previously published study[36] and four variants in IL23R from a recent fine-mapping study[37], and excluding variants in NOD2 and LRRK2. We observed an elevated PRS for AJ compared to non-Jewish controls (0.97 s.d. higher, p<10−16; Fig 3A; number of non-AJ controls = 35,007; number of AJ controls = 454), and as expected when performing the PRS analysis using OR calculated from non-Jewish subset of iCHIP data the signal still remains (p<10−16, S7 Fig). We observed a similar trend for the CD samples (0.54 s.d. higher; p<10−16; Fig 3B; number of non-AJ CD cases = 20,652; number of AJ CD cases = 1,938). We demonstrate this is not a systematic property of common risk alleles in AJ by running the same comparison using instead the comparable set of established schizophrenia associated alleles from the Psychiatric Genomics Consortium[38].
Fig 3

AJ individuals have higher CD polygenic risk score than NJ controls.

NJ: non-Jewish; AJ: Ashkenazi Jewish; CD: Crohn’s disease; PRS: polygenic risk score. A) Density plot of CD polygenic risk scores in 454 AJ (green) and 35,007 NJ(purple)controls. AJ controls have higher CD polygenic risk score than NJ controls (0.97 s.d. higher, p<10−16). B) Density plot of CD polygenic risk scores in 1,938 AJ (green) and 20,652 NJ CD (purple) cases (0.54 s.d. higher, p<10−16). For both density plots the scores have been scaled to NJ controls, thus resulting in an NJ control PRS density of mean equal to 0 and variance equal to 1 (see Online Methods). C) Ranked (decreasing order) CD associated variants by estimated contribution to the differences in genetic risk between AJ and NJ. Associated variants with estimated contribution greater than or equal to 0.01, computed as 2 log(odds ratio) (AJ frequency—NJ frequency), assuming additive effects on the log scale, are highlighted in green. Associated variants with estimated contribution less than or equal to -0.01 are highlighted in purple. Forward slashes represent a break in variants highlighted.

AJ individuals have higher CD polygenic risk score than NJ controls.

NJ: non-Jewish; AJ: Ashkenazi Jewish; CD: Crohn’s disease; PRS: polygenic risk score. A) Density plot of CD polygenic risk scores in 454 AJ (green) and 35,007 NJ(purple)controls. AJ controls have higher CD polygenic risk score than NJ controls (0.97 s.d. higher, p<10−16). B) Density plot of CD polygenic risk scores in 1,938 AJ (green) and 20,652 NJ CD (purple) cases (0.54 s.d. higher, p<10−16). For both density plots the scores have been scaled to NJ controls, thus resulting in an NJ control PRS density of mean equal to 0 and variance equal to 1 (see Online Methods). C) Ranked (decreasing order) CD associated variants by estimated contribution to the differences in genetic risk between AJ and NJ. Associated variants with estimated contribution greater than or equal to 0.01, computed as 2 log(odds ratio) (AJ frequency—NJ frequency), assuming additive effects on the log scale, are highlighted in green. Associated variants with estimated contribution less than or equal to -0.01 are highlighted in purple. Forward slashes represent a break in variants highlighted. To quantify the relative contribution of CD-implicated alleles to the difference in genetic risk between AJ and non-AJ populations we estimated the expected PRS value of an individual and expected difference in PRS between two populations by simply using summary statistics including the frequency of the minor allele in the two populations and the corresponding odds ratio (Supplementary note, S6 Fig). We applied the approach to all CD implicated alleles and observed that variants in GWAS loci annotated as IRGM, LACC1, NOD2, MST1, ATG16L1, GCKR, NKX2-3, and LRRK2[36] contribute substantially (>0.01) to the increased genetic risk observed in AJ. It is possibly relevant that variants contributing to increased risk in AJ include many autophagy/intracellular defense genes (IRGM, ATG16L1, LRRK2), while those contributing to increased risk in non-AJ include many anti-fungal/Th17/ILC3 genes[39] (IL23R, IL12B, CARD9, TRAF3IP2, IL6ST, CEBPB; Fig 3C). Both documented variability in the occurrence of CD over time[40,41] and substantial uncertainty in reported CD prevalence estimates[42,43] impact our ability to precisely estimate the overall contribution of genetics to the established difference in prevalence between populations. To interpret the impact of shifts in genetic risk score on differences in prevalence, we used the logit risk model[35] and evaluated a new estimate of disease probability, pnew, assuming an initial disease probability, p0, and multiple values for the differences in genetic risk. Assuming log-additive effects, and a logit-risk model, we estimate that the observed differences in genetic risk between the AJ and non-AJ populations contribute an expected 1.5-fold increase in disease prevalence in a population with environmental risk factors corresponding to AJ and baseline genetic risk corresponding to non-AJ populations (S7–S9 Figs). To address the extent to which non-additive effects in NOD2 may impact the observed prevalence we assumed 1-hit and 2-hit odds ratios of 2.17 and 9.93, respectively. We attribute a 6.8% difference in the ratio of estimated disease prevalence in the AJ population to the deviation from additivity, suggesting a small effect on differences in population prevalence (Supplementary Note).

Discussion

Analyzing data from 5,685 Ashkenazi Jewish exomes, we provide a systematic analysis of AJ enriched protein-coding alleles, which may contribute to differences in genetic risk to CD as well as numerous rare diseases, many of which are transmitted via autosomal recessive inheritance. We identified protein-altering alleles in NOD2 and LRRK2 that are conditionally independent and contribute to the excess burden of CD in AJ. We found evidence that common variant risk defined by GWAS shows a strong elevated difference between AJ and non-AJ European population samples (0.97 s.d. higher in controls, 0.54 s.d. higher in cases, p<10−16 in both), independent of NOD2 and LRRK2[44]. Highly polygenic diseases are unlikely to have substantially altered incidence as a result of a bottleneck alone—for every enriched variant there are those depleted or lost entirely and population genetics simulations[45] suggest no systematic alteration of overall genetic burden as a function of a bottleneck. Thus, the strong (approximately 1.5-fold, see supplementary note) difference in Crohn’s incidence in concert with a systematic enrichment of risk-increasing alleles, unlikely to have arisen by chance, suggests non-random selection in the AJ population for higher CD risk alleles. It seems plausible that, rather than ‘selection for Crohn's' per se, this likely suggests a subset of Crohn's risk alleles may contribute to a common biological process (e.g., a specific immune response) or phenotype that was positively selected for in AJ[46-48]. Such weak, widespread ‘polygenic selection’ has previously been observed with respect to height-associated SNPs in Europe[49], where drift alone could not explain the systematic enrichment of height-increasing alleles in populations of Northern Europe vs. Southern Europe. We found that CD risk alleles that are systematically elevated in AJ are not unusually elevated in another well-established founder population for which we have extensive genotype data (Finland). In Finns, Crohn’s risk alleles were not systematically enriched—they were if anything slightly depleted with 69 risk alleles at higher frequency in Finns than NFE and 79 risk alleles at lower frequency in Finns than NFE. We also demonstrate this is not a systematic property of common risk alleles in AJ by running the same comparison using instead the comparable set of established schizophrenia associated alleles from the Psychiatric Genetics Consortium[50]. We mapped 102 schizophrenia-associated index SNPs to AJ frequency data and again observed no uneven distribution where risk alleles are systematically more or less common. In total, 52 risk alleles were at higher frequency in AJ than NFE and 50 risk alleles were higher frequency in NFE than AJ. This study of CD in the AJ population confirms population-genetic expectations. First, recently bottlenecked populations are uniquely powered to discover alleles with markedly increases in frequency, and, as a consequence, contributors to differences in genetic risk across population groups. Second, while NOD2 and published common variant associations contribute substantially to the genetic risk of CD, other genes with causal alleles that failed to pass through the bottleneck are missed, consistent with predictions from Zuk et al[4]. We provide an exome frequency resource of protein-coding alleles in AJ along with estimates of population-specific enrichment. The sets of enriched alleles should be carefully considered when performing case-control analysis. Population structure can easily lead to false positive associations, especially for low frequency and rare variants, if the AJ:nonAJ ratio is slightly different in cases and controls. Our approach and this resource will likely catalyze our understanding of the medical relevance of enriched alleles in population isolates. Most importantly, the frequency reference provides critical guidance in pinpointing or excluding specific risk factors in individuals in clinical and research settings.

Materials and methods

Initial variant call set

We generated a jointly called dataset consisting of 18,745 individuals from international IBD and non-IBD cohorts. Sequencing of these samples was done at Broad Institute.

Ethics statement

All patients and control subjects provided informed consent. Recruitment protocols and consent forms were approved by Institutional Review Boards at each participating institutions (Protocol Title: The Broad Institute Study of Inflammatory Bowel Disease Genetics; Protocol Number: 2013P002634). All DNA samples and data in this study were denominalized.

Cohort descriptions

For all cohorts, CD was diagnosed according to accepted clinical, endoscopic, radiological and histological findings.

Target selection

G4L WES is a project specific product. It combines the Human WES (Standard Coverage) product with an Infinium Genome-Wide Association Study (GWAS) array. In addition to the array adding to the genomics data, it also acts as a concordance QC, linking 14 SNPs to the exome data. The processing of the exome includes Sample prep (Illumina Nextera), hybrid capture (Illumina Rapid Capture Enrichment - 37Mb target), sequencing (Illumina, HiSeq machines, 150bp paired reads), Identification QC check, and data storage (5 years). Our hybrid selection libraries typically meet or exceed 85% of targets at 20x, comparable to ~60x mean coverage. The array consists of a 24-sample Infinium array with ~245,000 fixed genome-wide markers, designed by the Broad. On average our genotyping call rates typically exceed 98%.

Pre-processing

The sequence reads are first mapped using BWA MEM[51] to the GRCh37 reference to produce a file in SAM/BAM format sorted by coordinate. Duplicate reads are marked–these reads are not informative and are not used as additional evidence for or against a putative variant. Next, local realignment is performed around indels. This identifies the most consistent placement of the reads relative to potential indels in order to clean up artifacts introduced in the original mapping step. Finally, base quality scores are recalibrated in order to produce more accurate per-base estimates of error emitted by the sequencing machines.

Variant discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. The variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project. Variant calling is done by running Genome Analysis Toolkit’s (GATK) HaplotypeCaller in gVCF mode on each sample's BAM file(s) to create single-sample gVCFs. If there are more than a few hundred samples, batches of ~200 gVCFs are merged hierarchically into a single gVCF to make the next step more tractable. Joint genotyping is then performed on the gVCFs of all available samples together in order to create a set of raw SNP and indel calls. Finally, variant recalibration is performed in order to assign a well-calibrated probability to each variant call in a raw call set, and to apply filters that produce a subset of calls with the desired balance of specificity and sensitivity as described in Rivas et al. (2016)[24]. Samples with > = 10% contamination are excluded from call sets. Exome samples with less than 40% of targets at 20X coverage are excluded.

Variant annotation

Variant annotation was performed using the Variant Effect Predictor (VEP) [cite PMID: 20562413] version 83 with Gencode v19 on GRCh37. Loss-of-function (LoF) variants were annotated using LOFTEE (Loss-Of-Function Transcript Effect Estimator, available at https://github.com/konradjk/loftee), a plugin to VEP. LOFTEE considers all stop-gained, splice-disrupting, and frameshift variants, and filters out many known false-positive modes, such as variants near the end of transcripts and in non-canonical splice sites, as described in the code documentation.

Identification of Finnish samples

Finnish CD patients were recruited from Helsinki University Hospital and described in more detail previously[52,53]. We used the same exome sequencing dataset described in Rivas et al.[24]. We applied additional PC correction in the Finnish identified individuals to remove individuals with membership of Finnish sub-isolate (Northern Finland) and excluded based on PC2 0.015 (853 excluded, 826 controls, 27 IBD). We recalculated PCs and included the first four PCs in the association analysis.

Identifying previously implicated GWAS loci

CD implicated GWAS loci were those loci defined as reaching genome-wide significance in International IBD Genetics Consortium studies (Jostins, Ripke et al., Nature 2012) and (Liu et al., Nature Genetics 2015)—Credible sets of SNPs around index associations were defined as in (Huang et al., Nature 2017) for fine-mapped loci, and for others credible sets were defined as all SNPs with r2 > 0.6 to the index variant. Genes within 50 kb of the span of credible set SNPs were considered “implicated’ by GWAS.

Ancestry estimation and quality control

As the present study aimed to focus on variation observed in Ashkenazi Jewish (AJ) population in comparison to reference populations in ExAC including (non-Finnish Europeans (NFE), Latino (AMR), and African/African-American (AFR)) we chose a model-based approach to estimate the ancestry of the study population using ADMIXTURE[12]. To identify AJ individuals and estimate admixture proportions we included a set (n = 21,066) of LD-pruned common variants (MAF>1%) after filtering for genotype quality (GQ>20) using the PLINK LD-pruning algorithm, whose description is available at http://pngu.mgh.harvard.edu/~purcell/plink/summary.shtml#prune. For the parameters, we selected a window size of 50 SNPs, a window shift of 5 SNPs at each step, and the variance inflation factor (VIF) threshold equal to 2. The 18,745 samples were assigned to four groups (K = 4), as ancestry was defined as having a single estimated ancestry fraction ≥ 0.4, and remaining three fractions < 0.4 (S2 Fig). Individuals mostly representing African/African-American and East-Asian ancestry (1,267 and 569 individuals respectively) were discarded from downstream analysis, as well as the 983 admixed individuals with none of the ancestry fractions ≥ 0.4. Thus, a total of 6,093 individuals were considered of Ashkenazi Jewish (AJ) ancestry, while 9,833 were considered to represent Non-Finnish Europeans (NFE). After sample QC and relatedness check, 5,685 individuals of Ashkenazi Jewish and 7,240 of non-Finnish European ancestry were found with valid IBD case/control status (S1 Table). Individuals with Ulcerative Colitis and unspecified and Indeterminate Colitis were further excluded, resulting in 4,899 AJ and 5,066 NFE individuals. Prior to enrichment and association analysis, 81 samples (of total 18,745) were also filtered due to possible contamination (heterozygous/homozygous ratio < 1), excess of singletons (n>2000), deletion/insertion ratio (>1.5) and mean genotype quality (<40). 275 samples were excluded for relatedness (>0.35 cut-off). Genotypes with low genotype quality (<20) were filtered, in addition to variants with low call rate (<80%) and allele balance deviating from 70:30 ratio for greater than 40% of heterozygous samples if at least 7 heterozygous samples were identified. As we were interested in computing an enrichment statistic that would not be affected by possible admixture, we obtained alternate allele frequency estimates by restricting the enrichment analysis to the 2,178 non-IBD Ashkenazi Jewish samples that passed QC and relatedness filtering and had AJ focused ancestry fraction > 0.9 (S1 Fig). Principal Component Analysis (PCA) was done in each ancestry group using the 21,066 variants. Sample QC was done using the Hail software while PCA, differential missingness and sample relatedness analysis was done using PLINK[54]. Hail is an open-source software framework for scalably and flexibly analyzing large-scale genetic data sets (https://github.com/broadinstitute/hail). Allele balance was calculated using PLINK/SEQ (https://atgu.mgh.harvard.edu/plinkseq/).

Estimating fold-enrichment in AJ population compared to reference populations in ExAC

Statistical methods: Fisher’s exact test

To estimate which alleles are enriched in AJ compared to alleles in reference population groups in ExAC we applied Fisher’s exact test one-sided alternative (“greater”). Using the number of alternate and reference alleles observed in AJ non-IBD samples and in the population (NFE, AFR or AMR) with the highest frequency from ExAC we compute a bias corrected log odds ratio estimate, , and its standard error, , for odds of the alternate allele as described in the Software DataPlot developed by the National Institute of Standards and Technology (http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/logoddra.htm, and http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/logodrse.htm) Precisely, is the estimate of the log of the odds ratio of finding the alternate allele in AJ vs in the ExAC population with the highest allele frequency. We classified a variant as ‘enriched’ if p-value was less than .05/73,228, where 73,228 is the number of variants analyzed with minor allele frequency between .002 and .1. To estimate allele enrichment in AJ compared to reference populations we used 2,178 non-IBD Ashkenazi Jewish samples, after sample and relatedness QC. We calculated alternate allele frequencies for the Ashkenazi Jewish population and used allele frequency information for NFE (n = 31,902; after excluding AJ individuals from ExAC), AFR (n = 5,203), and AMR (n = 5,789) available from ExAC release 0.3 dataset (ntotal = 60,706) and focused on alleles where allele frequency information was available for AJ and the reference populations. For the enrichment plot we focused on alleles with estimated frequency of at least 0.002 in AJ (nalleles = 106,377) and with alleles observed with an estimated frequency of at least .0001 in the reference populations with depth of coverage of at least 20X in at least 80% of the samples in ExAC.

Overlap of enriched alleles with ClinVar

We harmonized the XML and TXT releases of the ClinVar database (April 11, 2016 data release)[14] into a single tab-delimited text file using scripts that we have released publicly (https://github.com/macarthur-lab/clinvar). Briefly, we normalized variants using a Python implementation of vt normalize[55] and de-duplicated to yield a dataset unique on chromosome, position, reference, and alternate allele. A variant was considered 'pathogenic' if it had at least one assertion of either Pathogenic or Likely Pathogenic for any phenotype. A variant was considered 'conflicted' if it had at least one assertion of Pathogenic or Likely Pathogenic, and at least one assertion of Benign or Likely Benign, each for any phenotype. By these criteria, ClinVar contained n = 42,226 identified as pathogenic and non-conflicted. Intersecting with our dataset revealed that 148 belonged to the AJ enriched group with p-value less than .005.

Assessing Crohn’s disease association of protein-coding variation that may contribute to difference in disease prevalence in AJ

We focused Crohn’s disease association analysis of protein-coding variant to alleles that may account for difference in disease prevalence in AJ population to reference populations. To do so we focused on alleles with high probability of belonging to the enriched group. We included all samples with ADMIXTURE estimated AJ ancestry fraction of at least 0.4 (we excluded any samples that had alternative ancestry fraction of at least .4 in any other group). Samples with Ulcerative Colitis (n = 700), unspecified and Indeterminate Colitis (n = 86) were excluded from subsequent analysis. This resulted in a dataset of 4,899 AJ samples (1,855 Crohn’s disease and 3,044 non-IBD). Study-specific association analysis was performed with Firth bias-corrected logistic regression test[25,26] and four principal components as covariates using the software package EPACTS version 3.2.6[27]. Minimum minor allele count (≥1) and variant call rate (≥0.8) filters were used. For meta-analysis we combined association statistics using the Bayesian models and frequentist properties proposed in Band et al[28], which is a normal approximation to the logistic regression likelihood suggested by Wakefield[56]. As the authors of Band et al. indicate, one way of thinking about the approach is that it uses the study-wise estimated log-odds ratio (beta) and its standard error as summary statistics of the data. For each model of association, we assume a prior on the log odds ratio which is normally distributed around zero with a standard deviation of 0.2. By changing the prior on the covariance (or correlation) in effect sizes between studies we can formally compare models where: 1) the effects are independent across studies, and 2) the effects are correlated equally between studies. The final report of results is based on the correlated effects model. To address potential differences in effect sizes for the reported associated variants, we assessed heterogeneity of effects and did not find evidence (log10BF > 2). For each model we can obtain a Bayes factor (BF) for association by comparing it with the null model where all the prior weight is on an effect size of zero. We report p-value approximation using the Bayes factor as a statistic for model 2 where the effects are correlated between studies. Association statistics were combined based on association analysis across three study groups: 1) AJ (1,855 CD and 3,044 non-IBD samples); 2) NFE (2,296 CD and 2,770 non-IBD); and 3) Finnish (FINN) (210 CD and 9,930 non-IBD samples) for a total of 4,361 CD samples and 15,744 non-IBD samples.

Conditional haplotype based testing and variable selection for NOD2 alleles

In the conditional haplotype-based testing (—chap) analysis we used PLINK v1.08p[54] and set a minimum haplotype frequency of .001 (—mhf). We used PLINKSEQ (https://atgu.mgh.harvard.edu/plinkseq/), an open-source C/C++ library for working with human genetic variation data, and the Python bindings implemented in pyPLINKSEQ to perform Bayesian Model Averaging (BMA). We applied BMA[32] using the R package ‘BMA’ (https://cran.r-project.org/web/packages/BMA/BMA.pdf).

Polygenic risk scores

The polygenic risk scores were calculated for the international inflammatory bowel diseases consortium European samples. Details of these samples including the QC procedures were described in previous publications[37]. We used reported effect size estimates from 124 CD alleles including those reported in a previously published study[36] and four variants in IL23R from a recent fine-mapping study[37], and excluding variants in NOD2 and LRRK2. We used 454 AJ controls; 1,938 AJ CD; 35,007 non-Jewish controls and 20,652 non-Jewish CD samples. Polygenic risk scores were calculated using array genotype data as the sum of the log odds ratio of the variants associated with CD. Scores for missing genotypes were replaced by the imputed expected value using PLINK[54]. Variants in NOD2 and LRRK2 were excluded from the analysis to assess whether polygenic signal was independent of those genes. Let PRS be the polygenic risk score of individual i, assuming additive effects on the log-odds scale, i.e. where denotes the estimated log odds ratio for variant m and G denotes the genotype dosage of individual i for variant m. More specifically, is the effect size estimate of variant m on a logit scale in conferring risk of CD in an individual. In the setting where effects are non-additive, i.e. a genotype-specific effect model, For now, we consider the additive scenario, and later we return to the setting where non-additive effects exist, which is relevant for quantifying the differences in contribution of NOD2 alleles to genetic risk in two populations. The estimated expected PRS value for an individual in population j is where N is the number of individuals sampled in population j. Substituting equation for PRS and rearranging terms simplifies the equation as a function of variant frequency: where and denotes the frequency of variant m in population j. Thus, the estimated expected PRS value of an individual in population j is . Assume that we are interested in the expected difference in contribution of the studied variants to the PRS between two individuals, say from population 1 being AJ and population 2 being NFE. Also, assume that the effect size of variant m is shared across both populations. Then, using the estimated expected PRS value we define estimated expected difference in contribution of the studied variants to the PRS as the difference in estimated expected PRS value in two populations: which can be used to get an estimated difference in contribution of a variant m to the polygenic risk score in two populations, To rank variants according to their relative differences in contribution to genetic risk we included the NOD2 and LRRK2 alleles, used the list of estimated effect size from the published studies[36,37], and estimates from this study. If we substitute PRS* for PRS, Then, the estimated expected difference in PRS* when non-additive effects exist is

Estimating fold difference in prevalence for a population with shift in expected genetic risk

Assuming log-additive effects in the logit risk model the disease probability for an individual is given as p = (1 + exp(−η))−1, where η tends towards a normal distribution with parameters and [35]. Here p refers to a baseline disease probability. We can see that μ may be expressed in terms of the expected polygenic risk score, i.e. . In the setting where , then To evaluate the impact of a shift in the expected value of polygenic risk score to the expected value of μ we can express the shift as . We can compute new values of p for new values of μ to obtain a fold-increase in prevalence for a population that has undergone such a shift. We see that this requires a value to be chosen for p and that log(p0/(1 − p0)) can be represented as a baseline risk score value βTo get an estimate of the absolute prevalence of CD in the AJ population, we must choose a baseline β where p represents the expected prevalence with zero non-baseline alleles in the population[35], to which we add a contribution from multiple non-baseline alleles to calculate: 1) an individual’s probability of disease, or 2) the expected prevalence of the disease in the population. Once we have chosen a value for β, we can calculate the ratio of expected prevalence as follows. First, use the means (μAJ and μNAJ) and variances ( and ) of risk scores as calculated above to calculate the probability density function of the disease prevalence. In the case of the AJ population, we have where η is the risk score associated with prevalence p, g is the link function, so p = g(η) = (1 + e−)−1, and ϕ is the standard normal density function. Next, we integrate to get . Finally, we can calculate in a similar way, and divide the expected prevalence in the AJ population by that in the non-AJ population to get the prevalence ratio, . The value of β = -20.5 was chosen in order to obtain a prevalence in the non-AJ population of ~0.5%. At this value of β, the ratio of prevalence in the AJ population to that in the non-AJ population was estimated to be 1.5 ( = 0.82%, = 0.55%). For different choices of β, however, this ratio may vary, as the relationship between probability of disease and risk score is non-linear. S10 Fig shows how the values of the disease prevalence and their ratio vary as β is changed. We see that the ratio values range from 1.46 to 1.52 for different values of β with a range of baseline prevalence of .001 to .01—the range of prevalence estimates for Crohn’s disease[41,43,57]. To further understand the effect that choosing a logit-based model had on the results, a comparison of the standard logit and probit models was done using the values inferred from the logit model. No full scale probit modelling was done in this analysis, so the values found with the probit model represent only a close approximation of the expected results. In the logit model for population analysis, we may assume that individual risk scores are chosen from a normal distribution where μlogit and σlogit represent the mean and standard deviation of the risk scores as defined above. From here, we may calculate the probability density function of probit model risk scores μprobit based on that of logit model risk scores μlogit as and use this to calculate μprobit and σprobit, the estimated mean and standard deviation of the risk scores in the probit model. Using these values, we obtain a probability distribution for the frequency of disease in the populations using the probit model. While the logit model yielded a prevalence ratio of 1.506, the probit estimation yielded a prevalence ratio of 1.5136, with similar expected prevalence values ( = 0.823%, = 0.544%). These values demonstrate that individual logit and probit analyses would likely give similar results for values of interest. The complete probability densities under the logit and probit models can be seen in S8 Fig. Further, it is interesting to compare the relationship between values of risk scores in the two models. For values of risk scores between -1 and 1 in the logit model, the relationship to those in the probit model is highly linear, with a formula of ηprobit = 0.6223 ∙ ηlogit, with r2 = 1.0000. This formula may be used to impute single values in one model or the other assuming that the estimated total risk score is otherwise close to zero, and the imputed value is low. It is worth noting, however, that this does not work for all values of ηlogit, as the relationship between risk score in the logit and probit models deviates from this simple linear model when the risk score values are large.

Difference in prevalence between AJ and NFE attributed to implicated variants

The difference in prevalence due to multiple alleles can be computed as where p denotes the disease prevalence in population j and i denotes the disease prevalence without the risk factors in population j, which according to Moonesinghe et al.[58] is where GRR denotes the genotype relative risk for variant m. We model the CD prevalence accounted for by CD associated enriched protein-altering alleles separately in both AJ and non-AJ European and determine the amount that CD prevalence would be reduced if this variant were absent from each population. To estimate the difference in prevalence between two populations attributed to genetic risk factors when non-additive effects exist,

Enirchment testing sensitivity

When modeling enrichment, we chose a standard significance cutoff of p < 0.05/N for classifying variants as enriched. We noted that the number of variants classified as enriched does not change significantly when the p-value threshold changes. See S11 Fig for more information.

Analysis workflow diagram.

(PNG) Click here for additional data file.

Admixture plots.

(PNG) Click here for additional data file.

Cross-validation errors for number of clusters in ADMIXTURE.

(PNG) Click here for additional data file.

MAF thresholds chosen for enrichment testing.

(PNG) Click here for additional data file.

Principal components plot for 5,685 AJ individuals.

(PNG) Click here for additional data file.

Variable selection using Bayesian model averaging (BMA).

(PNG) Click here for additional data file. (PNG) Click here for additional data file.

Dependence of population prevalence on β.

(PNG) Click here for additional data file.

Probit and logit model analysis.

(PNG) Click here for additional data file.

The relationship between expected differences in genetic risk score and expected fold differences in disease prevalence.

(PNG) Click here for additional data file.

Sensitivity test for p-value significance cutoff.

(PNG) Click here for additional data file.

ClinVar pathogenic alleles enriched in AJ.

(XLSX) Click here for additional data file.

AJ enrichment data for all analyzed alleles.

(TXT) Click here for additional data file.

Origins of moderate ancestry fraction Ashkenazi samples.

(PNG) Click here for additional data file.

Origins of high ancestry fraction Ashkenazi samples.

(PNG) Click here for additional data file.

Conditional haplotype-based testing in NOD2.

(PNG) Click here for additional data file.

Assessing association of a one-hit and two-hit model of NOD2 in the AJ exome sequencing data.

(PNG) Click here for additional data file.

Assessing association of a one-hit and two-hit model of NOD2 in the non-AJ immunoChip data.

(JPG) Click here for additional data file.
  50 in total

1.  Crohn's disease risk alleles on the NOD2 locus have been maintained by natural selection on standing variation.

Authors:  Shigeki Nakagome; Shuhei Mano; Lukasz Kozlowski; Janusz M Bujnicki; Hiroki Shibata; Yasuaki Fukumaki; Judith R Kidd; Kenneth K Kidd; Shoji Kawamura; Hiroki Oota
Journal:  Mol Biol Evol       Date:  2012-01-12       Impact factor: 16.240

2.  Commensal bacteria direct selective cargo sorting to promote symbiosis.

Authors:  Qin Zhang; Ying Pan; Ruiqing Yan; Benhua Zeng; Haifang Wang; Xinwen Zhang; Wenxia Li; Hong Wei; Zhihua Liu
Journal:  Nat Immunol       Date:  2015-08-03       Impact factor: 25.606

3.  A Bayesian measure of the probability of false discovery in genetic epidemiology studies.

Authors:  Jon Wakefield
Journal:  Am J Hum Genet       Date:  2007-07-03       Impact factor: 11.025

4.  Genomewide association study of leprosy.

Authors:  Fu-Ren Zhang; Wei Huang; Shu-Min Chen; Liang-Dan Sun; Hong Liu; Yi Li; Yong Cui; Xiao-Xiao Yan; Hai-Tao Yang; Rong-De Yang; Tong-Sheng Chu; Chi Zhang; Lin Zhang; Jian-Wen Han; Gong-Qi Yu; Cheng Quan; Yong-Xiang Yu; Zheng Zhang; Ben-Qing Shi; Lian-Hua Zhang; Hui Cheng; Chang-Yuan Wang; Yan Lin; Hou-Feng Zheng; Xi-An Fu; Xian-Bo Zuo; Qiang Wang; Heng Long; Yi-Ping Sun; Yi-Lin Cheng; Hong-Qing Tian; Fu-Sheng Zhou; Hua-Xu Liu; Wen-Sheng Lu; Su-Min He; Wen-Li Du; Min Shen; Qi-Yi Jin; Ying Wang; Hui-Qi Low; Tantoso Erwin; Ning-Han Yang; Jin-Yong Li; Xin Zhao; Yue-Lin Jiao; Li-Guo Mao; Gang Yin; Zhen-Xia Jiang; Xiao-Dong Wang; Jing-Ping Yu; Zong-Hou Hu; Cui-Hua Gong; Yu-Qiang Liu; Rui-Yu Liu; De-Min Wang; Dong Wei; Jin-Xian Liu; Wei-Kun Cao; Hong-Zhong Cao; Yong-Ping Li; Wei-Guo Yan; Shi-Yu Wei; Kui-Jun Wang; Martin L Hibberd; Sen Yang; Xue-Jun Zhang; Jian-Jun Liu
Journal:  N Engl J Med       Date:  2009-12-16       Impact factor: 91.245

5.  Hypomyelination and developmental delay associated with VPS11 mutation in Ashkenazi-Jewish patients.

Authors:  Shimon Edvardson; Frank Gerhard; Chaim Jalas; Jens Lachmann; Dafna Golan; Ann Saada; Avraham Shaag; Christian Ungermann; Orly Elpeleg
Journal:  J Med Genet       Date:  2015-08-25       Impact factor: 6.318

6.  Novel CARD15/NOD2 mutations in Finnish patients with Crohn's disease and their relation to phenotypic variation in vitro and in vivo.

Authors:  Maarit Lappalainen; Paulina Paavola-Sakki; Leena Halme; Ulla Turunen; Martti Färkkilä; Heikki Repo; Kimmo Kontula
Journal:  Inflamm Bowel Dis       Date:  2008-02       Impact factor: 5.325

7.  Prevalence and significance of mutations in the familial Mediterranean fever gene in Henoch-Schönlein purpura.

Authors:  Ruth Gershoni-Baruch; Yiftah Broza; Riva Brik
Journal:  J Pediatr       Date:  2003-11       Impact factor: 4.406

8.  Joubert syndrome 2 (JBTS2) in Ashkenazi Jews is associated with a TMEM216 mutation.

Authors:  Simon Edvardson; Avraham Shaag; Shamir Zenvirt; Yaniv Erlich; Gregory J Hannon; Alan L Shanske; John Moshe Gomori; Joseph Ekstein; Orly Elpeleg
Journal:  Am J Hum Genet       Date:  2009-12-31       Impact factor: 11.025

9.  Demography and the age of rare variants.

Authors:  Iain Mathieson; Gil McVean
Journal:  PLoS Genet       Date:  2014-08-07       Impact factor: 5.917

10.  Distribution and medical impact of loss-of-function variants in the Finnish founder population.

Authors:  Elaine T Lim; Peter Würtz; Aki S Havulinna; Priit Palta; Taru Tukiainen; Karola Rehnström; Tõnu Esko; Reedik Mägi; Michael Inouye; Tuuli Lappalainen; Yingleong Chan; Rany M Salem; Monkol Lek; Jason Flannick; Xueling Sim; Alisa Manning; Claes Ladenvall; Suzannah Bumpstead; Eija Hämäläinen; Kristiina Aalto; Mikael Maksimow; Marko Salmi; Stefan Blankenberg; Diego Ardissino; Svati Shah; Benjamin Horne; Ruth McPherson; Gerald K Hovingh; Muredach P Reilly; Hugh Watkins; Anuj Goel; Martin Farrall; Domenico Girelli; Alex P Reiner; Nathan O Stitziel; Sekar Kathiresan; Stacey Gabriel; Jeffrey C Barrett; Terho Lehtimäki; Markku Laakso; Leif Groop; Jaakko Kaprio; Markus Perola; Mark I McCarthy; Michael Boehnke; David M Altshuler; Cecilia M Lindgren; Joel N Hirschhorn; Andres Metspalu; Nelson B Freimer; Tanja Zeller; Sirpa Jalkanen; Seppo Koskinen; Olli Raitakari; Richard Durbin; Daniel G MacArthur; Veikko Salomaa; Samuli Ripatti; Mark J Daly; Aarno Palotie
Journal:  PLoS Genet       Date:  2014-07-31       Impact factor: 5.917

View more
  25 in total

Review 1.  Rare and common variant discovery in complex disease: the IBD case study.

Authors:  Guhan R Venkataraman; Manuel A Rivas
Journal:  Hum Mol Genet       Date:  2019-11-21       Impact factor: 6.150

2.  Pathophysiology and Treatment of Canavan Disease.

Authors:  David Pleasure; Fuzheng Guo; Olga Chechneva; Peter Bannerman; Jennifer McDonough; Travis Burns; Yan Wang; Vanessa Hull
Journal:  Neurochem Res       Date:  2018-12-08       Impact factor: 3.996

3.  Characterization of Exome Variants and Their Metabolic Impact in 6,716 American Indians from the Southwest US.

Authors:  Hye In Kim; Bin Ye; Nehal Gosalia; Çiğdem Köroğlu; Robert L Hanson; Wen-Chi Hsueh; William C Knowler; Leslie J Baier; Clifton Bogardus; Alan R Shuldiner; Cristopher V Van Hout
Journal:  Am J Hum Genet       Date:  2020-07-07       Impact factor: 11.025

4.  A study of Kibbutzim in Israel reveals risk factors for cardiometabolic traits and subtle population structure.

Authors:  Einat Granot-Hershkovitz; David Karasik; Yechiel Friedlander; Laura Rodriguez-Murillo; Rajkumar Dorajoo; Jianjun Liu; Anshuman Sewda; Inga Peter; Shai Carmi; Hagit Hochner
Journal:  Eur J Hum Genet       Date:  2018-08-14       Impact factor: 4.246

Review 5.  Pathway paradigms revealed from the genetics of inflammatory bowel disease.

Authors:  Daniel B Graham; Ramnik J Xavier
Journal:  Nature       Date:  2020-02-26       Impact factor: 49.962

6.  Identification of novel risk loci, causal insights, and heritable risk for Parkinson's disease: a meta-analysis of genome-wide association studies.

Authors:  Mike A Nalls; Cornelis Blauwendraat; Costanza L Vallerga; Karl Heilbron; Sara Bandres-Ciga; Diana Chang; Manuela Tan; Demis A Kia; Alastair J Noyce; Angli Xue; Jose Bras; Emily Young; Rainer von Coelln; Javier Simón-Sánchez; Claudia Schulte; Manu Sharma; Lynne Krohn; Lasse Pihlstrøm; Ari Siitonen; Hirotaka Iwaki; Hampton Leonard; Faraz Faghri; J Raphael Gibbs; Dena G Hernandez; Sonja W Scholz; Juan A Botia; Maria Martinez; Jean-Christophe Corvol; Suzanne Lesage; Joseph Jankovic; Lisa M Shulman; Margaret Sutherland; Pentti Tienari; Kari Majamaa; Mathias Toft; Ole A Andreassen; Tushar Bangale; Alexis Brice; Jian Yang; Ziv Gan-Or; Thomas Gasser; Peter Heutink; Joshua M Shulman; Nicholas W Wood; David A Hinds; John A Hardy; Huw R Morris; Jacob Gratten; Peter M Visscher; Robert R Graham; Andrew B Singleton
Journal:  Lancet Neurol       Date:  2019-12       Impact factor: 44.182

7.  Novel ultra-rare exonic variants identified in a founder population implicate cadherins in schizophrenia.

Authors:  Todd Lencz; Jin Yu; Raiyan Rashid Khan; Erin Flaherty; Shai Carmi; Max Lam; Danny Ben-Avraham; Nir Barzilai; Susan Bressman; Ariel Darvasi; Judy H Cho; Lorraine N Clark; Zeynep H Gümüş; Joseph Vijai; Robert J Klein; Steven Lipkin; Kenneth Offit; Harry Ostrer; Laurie J Ozelius; Inga Peter; Anil K Malhotra; Tom Maniatis; Gil Atzmon; Itsik Pe'er
Journal:  Neuron       Date:  2021-03-22       Impact factor: 17.173

8.  DUOX2 variants associate with preclinical disturbances in microbiota-immune homeostasis and increased inflammatory bowel disease risk.

Authors:  Helmut Grasberger; Andrew T Magis; Elisa Sheng; Matthew P Conomos; Min Zhang; Lea S Garzotto; Guoqing Hou; Shrinivas Bishu; Hiroko Nagao-Kitamoto; Mohamad El-Zaatari; Sho Kitamoto; Nobuhiko Kamada; Ryan W Stidham; Yasutada Akiba; Jonathan Kaunitz; Yael Haberman; Subra Kugathasan; Lee A Denson; Gilbert S Omenn; John Y Kao
Journal:  J Clin Invest       Date:  2021-05-03       Impact factor: 14.808

Review 9.  How autophagy controls the intestinal epithelial barrier.

Authors:  Elisabeth G Foerster; Tapas Mukherjee; Liliane Cabral-Fernandes; Juliana D B Rocha; Stephen E Girardin; Dana J Philpott
Journal:  Autophagy       Date:  2021-04-27       Impact factor: 16.016

10.  Genetic modifiers of risk and age at onset in GBA associated Parkinson's disease and Lewy body dementia.

Authors:  Cornelis Blauwendraat; Xylena Reed; Lynne Krohn; Karl Heilbron; Sara Bandres-Ciga; Manuela Tan; J Raphael Gibbs; Dena G Hernandez; Ravindran Kumaran; Rebekah Langston; Luis Bonet-Ponce; Roy N Alcalay; Sharon Hassin-Baer; Lior Greenbaum; Hirotaka Iwaki; Hampton L Leonard; Francis P Grenn; Jennifer A Ruskey; Marya Sabir; Sarah Ahmed; Mary B Makarious; Lasse Pihlstrøm; Mathias Toft; Jacobus J van Hilten; Johan Marinus; Claudia Schulte; Kathrin Brockmann; Manu Sharma; Ari Siitonen; Kari Majamaa; Johanna Eerola-Rautio; Pentti J Tienari; Alexander Pantelyat; Argye E Hillis; Ted M Dawson; Liana S Rosenthal; Marilyn S Albert; Susan M Resnick; Luigi Ferrucci; Christopher M Morris; Olga Pletnikova; Juan Troncoso; Donald Grosset; Suzanne Lesage; Jean-Christophe Corvol; Alexis Brice; Alastair J Noyce; Eliezer Masliah; Nick Wood; John Hardy; Lisa M Shulman; Joseph Jankovic; Joshua M Shulman; Peter Heutink; Thomas Gasser; Paul Cannon; Sonja W Scholz; Huw Morris; Mark R Cookson; Mike A Nalls; Ziv Gan-Or; Andrew B Singleton
Journal:  Brain       Date:  2020-01-01       Impact factor: 13.501

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.