Lillias H Maguire1, Samuel K Handelman2, Xiaomeng Du2, Yanhua Chen2, Tune H Pers3,4, Elizabeth K Speliotes2,5. 1. Department of Surgery, Division of Colorectal Surgery, University of Michigan, Ann Arbor, MI, USA. maguirel@med.umich.edu. 2. Department of Internal Medicine, Division of Gastroenterology, Ann Arbor, MI, USA. 3. The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark. 4. Department of Epidemiology Research, Statens Serum Institut, Copenhagen, Denmark. 5. Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
Abstract
Diverticular disease is common and has a high morbidity. Treatments are limited owing to the poor understanding of its pathophysiology. Here, to elucidate its etiology, we performed a genome-wide association study of diverticular disease (27,444 cases; 382,284 controls) from the UK Biobank and tested for replication in the Michigan Genomics Initiative (2,572 cases; 28,649 controls). We identified 42 loci associated with diverticular disease; 39 of these loci are novel. Using data-driven expression-prioritized integration for complex traits (DEPICT), we show that genes in these associated regions are significantly enriched for expression in mesenchymal stem cells and multiple connective tissue cell types and are co-expressed with genes that have a role in vascular and mesenchymal biology. Genes in these associated loci have roles in immunity, extracellular matrix biology, cell adhesion, membrane transport and intestinal motility. Phenome-wide association analysis of the 42 variants shows a common etiology of diverticular disease with obesity and hernia. These analyses shed light on the genomic landscape of diverticular disease.
Diverticular disease is common and has a high morbidity. Treatments are limited owing to the poor understanding of its pathophysiology. Here, to elucidate its etiology, we performed a genome-wide association study of diverticular disease (27,444 cases; 382,284 controls) from the UK Biobank and tested for replication in the Michigan Genomics Initiative (2,572 cases; 28,649 controls). We identified 42 loci associated with diverticular disease; 39 of these loci are novel. Using data-driven expression-prioritized integration for complex traits (DEPICT), we show that genes in these associated regions are significantly enriched for expression in mesenchymal stem cells and multiple connective tissue cell types and are co-expressed with genes that have a role in vascular and mesenchymal biology. Genes in these associated loci have roles in immunity, extracellular matrix biology, cell adhesion, membrane transport and intestinal motility. Phenome-wide association analysis of the 42 variants shows a common etiology of diverticular disease with obesity and hernia. These analyses shed light on the genomic landscape of diverticular disease.
Diverticulosis is an outpouching of the gastrointestinal tract present in the majority of older adults in Western countries[1]. Most patients are asymptomatic, but hundreds of thousands develop diverticular disease. Diverticulosis, the precursor lesion, is highly prevalent in the United States (US), Europe, and Canada; >50% of adults over age 60 have diverticulosis, and 10–25% will become symptomatic[2]. It is less common in other populations and demonstrates anatomic variation; diverticulosis is predominantly (90%) in the sigmoid colon in Western populations, but for unknown reasons is right-sided (70%) in Asia[3]. The low fiber Western diet has traditionally been implicated in diverticulosis, but this correlation has been questioned[4-6]. Diverticulitis (inflammation and infection of diverticula) causes >200,000 hospital admissions in the US annually[7]. The incidence is increasing; US inpatient admissions increased 26% from 1998–2005, most rapidly among patients <45 years old[8]. Complications including fistula, stricture, abscess, and intestinal perforation necessitate tens of thousands of surgical interventions annually[9]. Inpatient mortality is 1.5–3.0%[10]. Progression from diverticulosis to diverticular disease is poorly understood. Observational studies have correlated age, obesity, decreased physical activity, ultraviolet radiation, and diet with diverticular diseaes[4,11-15]. Incidence is higher in males <50, but females predominate in older ages[16]. Diverticular disease is associated with connective tissue disorders: Ehlers-Danlos Syndrome (collagen mutations)[17], Williams Syndrome (elastin mutations)[18], and polycystic kidney disease[19]. In the general population, twin studies estimate heritability at 40–53%[20,21] indicating a strong genetic component. To date, only one genome-wide association study (GWAS) has been performed, identifying three associated loci[22].Here we report the to-date largest GWAS of diverticular disease. We examined associations of ~30 million single nucleotide polymorphisms (SNPs) with diverticular disease (ICD-10 code K57; N=27,244) in the European component of the United Kingdom Biobank (UKBB) population vs 382,284 control individuals[23]. K57 is a root code including diverticulitis and diverticular hemorrhage (Supplementary Table 1). It has been validated as a diagnostic code for diverticular disease, with a positive predictive value of 0.98[24]. Analyses were adjusted for age, sex, and principal components and relatedness using mixed linear modeling[25]. We tested the top 154 independently-associated SNPs (p< 1×10−5) in 31,221 unrelated European-ancestry individuals enrolled in the Michigan Genomics Initiative (MGI)[26] adjusted for gender, age, and principal components[25]. Cases of diverticular disease in MGI were identified using ICD-9 billing codes for diverticulosis (code 562.10, 562.12, N=1,854) or diverticulitis (code 562.11, 562.13 N=718).We identified 40 independent loci with genome-wide significant (p < 5 × 10–8) associations for diverticular disease and 112 more loci with suggestive associations (p < 1 × 10–5, Supplementary Table 2, Supplementary Figures 1 and 2) in UKBB. In MGI, 8/154 variants replicated with a consistent direction of effect at an MGI false discovery rate (FDR) < 10%. All loci associated with UKBB-genome-wide significant SNPs (N=40) and two MGI-replicated/UKBB-suggestive-SNPs were carried forward for analysis (Figure 1). Of these 42 loci of interest, 39 represent novel associations (Table 1). Supplementary Table 2 is a full list of associated variants. The 42 loci mapped to 99 genes within a distance of 500kb and R2 >0.5 (Supplementary Table 3). Regional association plots better defined the associated signal (Supplementary Figure 3).
Figure 1:
Study Design. Graphic representation of study design. GWAS: Genome-wide association study. SNPs: single nucleotide polymorphisms. PheWAS: Phenome-wide association study. GTEx: Genotype-Tissue Expression project. DEPICT: Data-driven Expression-Prioritized Integration for Complex Traits. eQTL: Expression Quantitative Trait Loci
Table One:
Loci of interest including genome-wide significant variants (p<5×10−8) from UKBB and highly significant SNPs (p<5×10−5) with replication in MGI. Bold gene symbols indicate replication in MGI at FDR <0.1 following Benjamin-Hochberg correction. X–chromosome, DD – diverticular disease, FDR – false discovery rate, GWAS – genome wide association study, EAF – estimated allele frequency. At each locus, a superscript1 or 2 indicates an eQTL and eGene for GTEx v7 sigmoid colon or transverse colon, respectively.
Tissue expression and pathway enrichment analyses were performed using Data-Driven Expression Prioritized Integration for Complex Traits (DEPICT) [27]. Mesenchymal stem cells and four related connective tissue cell types were enriched (FDR < 0.20). Digestive, connective, and urogenital tissues (Figure 2AB, Supplementary Table 4) were enriched (FDR < 0.20). 95 of 14,462 independent reconstituted DEPICT gene sets (FDR <0.20 and kappa of 0.5) were enriched for the 99 genes, including pathways involved in vascular biology, mesenchymal development/derivatives, and embryogenesis (Supplementary Table 5).
Figure 2A/B:
Tissue and cell type enrichment analysis. Plots showing the enrichment of loci associated with diverticular disease (p < 1 × 10−5 in the UKBB; N=27,444 cases/382,284 controls) in specific cell types (A) and tissues (B). Enrichments are grouped according to system or cell type and significance; annotations above the dashed line have FDR ≤ 0.20. Data corrected for multiple comparisons using Benjamini-Hochberg method. Top tissue in each category labelled.
Of the 42 SNPs carried forward for analysis, 7 were expression quantitative trait loci (eQTLs) in sigmoid colon, and 6 were eQTLs in transverse colon, according to GTEx[28] (Table 1). The most significant eQTL-SNP was rs7086249 (NM_020752.2:c.1405–28470T>C) regulating GPR158, which encodes orphan G-protein coupled receptor, in the sigmoid colon. Mechanistic studies in fresh tissues are needed. Power to detect eQTLs is limited; post-mortem interval strongly influences colonic RNA quality[29]. 31/42 SNPs, including the 8 confirmed variants, were intronic; the remainder were intergenic.We performed Phenome-wide association study (PheWAS) analysis for the 42 loci of interest. PheWAS can be used to agnostically assess whether phenotypes are associated with a genetic variant. Here 42 SNPs were tested for association with 780 UKBB traits[30] (Supplementary Table 6). Traits were hierarchically clustered before filtering those without significant association. Twenty-three loci correlated with morphometric traits (ABO, BDNF, CALCA/CALCB, COL6A1, CRISPLD2, CWC27, DISP2, EFEMP1, ENSG00000224849, ENSG00000251283, FADD/ANO1, FAM185A, GTPBP1, HLX, LYPLAL1, NOV, NT5C1B, RBKS, PCSK5, S100A10, TRPS1, UBTF, and ZBTB4). Fourteen loci associated with hematologic variables (ABO, ARHGAP15, BDNF, CRISPLD2, DISP2, ENSG00000224849, GTPBP1, HLX, PPP1R14A/SPINT2, RBKS, SLC25A28, TRPS1, UBTF, and ZBTB4). LYPLAL1, GTPBP1, ELN, EFEMP1, and CRISPLD2 associated with hernias. EFEMP1 and CRISPLD2 also associated with female genital prolapse EFEMP1 has been previously associated with hernia[31]. SHFM1, UBTF, HLX, ABO, and UNC50 associated with connective tissue traits, such as osteoarthritis and soft tissue inflammation. Eight loci (ABO, CACNB2, ENSG00000224849, FADD/ANO1, NOV, NT5C1B, RBKS, and ZBTB4) associated with vascular traits including venous thrombosis, pulmonary embolism, hypertension, and heart failure. Nineteen loci (ABO, ARHGAP15, COLQ/METTL6, CRISPLD2, EFEMP1, ELN, ENSG00000224849, ENSG00000251283, FADD/ANO1, FAM155A, FAM185A, GPR158, GTPBP1, P2RY12/P2RY14, PPP1R14A/SPINT2, RBKS, SLC25A28, SLC35F3, and UNC50) associated with gastrointestinal disease, but not with the common bowel conditions inflammatory bowel disease, polyps, and cancer. An edited heatmap (Figure 3) highlights these effects.
Figure 3:
Phenome-wide association matrix. Filtered association matrix highlighting vascular, gastrointestinal, connective tissue, hematologic, morphometric, and dietary traits associated with loci of interest in the UKBB (27,444 cases/382,284 controls) Data controlled for multiple comparisons using Benjamini-Hochberg method, filtered at FDR<5%, and clustered at h=0.2.
One prior GWAS identified risk-loci near ARHGAP15, FAM155A and COLQ[22]. These associations were confirmed, supporting the validity of our approach. ARHGAP15 encodes a GTPase-activating protein acting on Rac[32] and negatively regulates neutrophils[33]. The function of the gene product of FAM155A is unknown. COLQ encodes a critical protein for acetylcholine-mediated signaling[34]. CALCB, identified in our study, was identified but not validated in the prior GWAS. TNFSF15 has been associated with diverticular disease[35], but this was not found in our study. Despite clinical association[17], Ehlers-Danlos genes were not identified.The 8 replicated loci were associated with 21 genes (Supplementary Table 3). Some contribute to cytoskeletal and extracellular matrix (ECM) dynamics (ELN, SHFM1, BMPR1B, LIMK1, and CLSTN2). BMPR1B and SHFM1 are implicated in bone and cartilage synthesis[36,37]. LIMK1 stabilizes the cytoskeleton by inhibiting actin de-polymerization[38]. CLSTN2 encodes an atypical cadherin involved in cell adhesion[39]. ELN encodes elastin, which is altered in diverticular colons[40]. Diverticular disease is common in Williams Syndrome, a congenital disorder caused by ELN hemizygosity[18]. ANO1 encodes a chloride channel in the intestinal pacemaker cells of Cajal[41]. These cells are reduced in diverticular disease[42]. ANO1 is critical for intestinal contractility[43]. Altered intestinal motility is implicated in diverticular disease[44]. Diverticular colons demonstrate abnormal smooth muscle morphology[45] and altered contractile force[46]. ARHGAP15, GPR158, and GTPBP1 are involved in GTP-signaling. Many identified genes have unknown function or unclear functional link to diverticular disease. Functional characterization should be prioritized to confirm these gene-variant associations. In the absence of strong molecular evidence to the contrary, systematic studies indicate that the closest gene is the best candidate for SNP effect[47]. All replicated SNPs were located in introns, supporting a molecular mechanism at the RNA-expression level in the surrounding gene. Therefore, expression levels of these genes is the most plausible avenue for further molecular phenotyping[48].Among our other 99 identified genes, many have roles in the ECM, motility, and membrane transport (Figure 4), COL6A1, CRISPLD2, EFEMP1, HAS2, NOV, and TCHH have known roles in the ECM[49-53]. Enrichment in mesenchymal stem cells, connective tissues, and mesenchymal development pathways, suggest a role for connective tissue biology. PPP1R14A and CHRNB1 effect smooth muscle motility[54,55]. Others are involved in transport of copper (CUTC), sodium (SPINT2) and calcium (CALCA, CALCB, CACNB2). SPINT2 mutations result in congenital sodium diarrhea[57]. Altered absorption or motility could produce constipation, which is clinically associated with diverticular disease. Vascular biology identified by pathway analysis/PheWAS may be relevant as diverticula tend to occur adjacent to penetrating arteries.
Figure 4:
Plausible biological pathways underlying risk loci associated with diverticular disease. Bold gene symbols indicate replication in MGI. * indicates prior identification in GWAS
This study is limited in that it detects diverticular disease via inpatient coding, and does not identify asymptomatic diverticulosis. Given the epidemiology of diverticulosis, the majority of participants likely harbor the precursor lesion and the variants identified only associate with diverticular disease. However, this is the clinically relevant outcome. Given the high reliability of diverticular disease codes[24] and the derivation of cases from inpatient hospital admissions, it is likely that most patients suffered severe diverticular disease. However, patients might be erroneously identified if diverticulosis was noted incidentally. Conversely, patients with mild diverticular disease treated as outpatients may be falsely identified as controls. The de-identified nature of the data precludes coding confirmation. Another limitation is ICD9 versus ICD10 coding between populations. We chose grouped, rather than individual codes to mitigate this difference. Additionally, the UKBB entry age of 40–69 prohibits comparison of older/younger patients. Finally, some conditions in our PheWas could be a consequence of diverticular disease rather than sharing a common etiology.In summary, the biologic basis for both the development of colonic diverticula in the majority of older adults and the triggers that produce diverticulitis in some patients are unknown. We report the largest GWAS thus far for diverticular disease and identify 39 novel loci as contributing to the pathophysiology of these diseases. This work defines the landscape for future functional studies and identifies possible targets for therapeutic development.
Online Methods
UK Biobank
The UK Biobank (UKBB) contains genotypes, clinical and demographic data on over 400,000 individuals aged 40–69 at time of study recruitment. The UKBB protocols were approved by the National Research Ethics Service Committee. Participants signed written informed consent, specifically applicable to health-related research. All ethical regulations were followed. The analyses used in this paper were carried out by Canela-Xandri et al. under UK Biobank Resource project 788[30]. Diverticular disease was recorded under the International Classification of Disease (ICD) 10 code K57 (N=27,244). Participant genotyping, data collection, and quality control has been described in detail[23]. In brief, participants were genotyped on one of two purpose-designed arrays (UK BiLEVE Axiom Array (N=50,520) and UK Biobank Axiom Array (N=438,692)) with 95% maker overlap. The Haplotype Reference Consortium was used as a reference panel to phase and impute the data. Following quality control, over 30 million variants in 408,455 white British individuals (http://geneatlas.roslin.ed.ac.uk/) were tested for association with K57 controlling for age, gender, principal components and relatedness using mixed linear modeling.
Michigan Genomics Initiative
The Michigan Genomics Initiative (MGI) is an institutional repository of DNA linked to participants’ medical profile via the electronic medical record at the University of Michigan. The MGI is approved for this research use by the Institutional Review Board of the University of Michigan. All relevant ethical regulations were followed. Diverticular disease was derived from ICD-9 codes (562.11 and 562.13 for diverticulitis and 562.10 and 562.12 for diverticulosis). Following informed consent, individuals (N=35,888) were genotyped using the Illumina HumanCoreExome Array.Genotype analysis was performed with Illumina GenomeStudio (module 1.9.4, algorithm GenTrain 2.0). After initial clustering, we re-defined variant cluster boundaries using individuals with call rate >99% and genotyped the remaining samples. Samples were excluded if they met any of the following criteria: (1) call rate <99%, (2) estimated contamination >2.5% (BAF Regress)[57], (3) large chromosomal copy number variants, (4) lower call rate of a technical duplicate pair and twins, or (5) whose inferred sex contradicted the reported sex.Variant-quality control was performed in the following manner: we excluded variants if: (1) their probes could not be perfectly mapped or mapped perfectly to multiple positions (2) they showed deviations from Hardy Weinberg equilibrium (p-value< 0.0001), (3) had a call rate < 99%, or (4) another variant with higher call rate assayed the same variant (PLINK (v1.90)[58]).Phasing was carried out using SHAPEIT2 (v2. r837)[59] on autosomal chromosomes. Genotypes of the Haplotype Reference Consortium (chromosome 1–22: HRC release 1; chromosome X: HRC release 1.1) were imputed into the phased MGI data using Minimac3 (v1.0.13)[60]. Excluding variants with low imputation quality (R2 <0.3) resulted in dense mapping at 39,127,678 million quality-imputed genetic markers.We estimated pairwise kinship using the software KING (v1.4.2)[61]. We excluded any 1st- or 2nd-degree relative pairs within the cohort. In addition, we used principal component analysis to identify ethnically homogeneous groups using individuals from the Human Genome Diversity Project[64]. We included only European samples.
Locus Identification
197 independent loci were identified for all imputed UKBB variants associated with diverticular disease at p <1×10–5 using criteria of R2 < 0.1 within a distance of 500kb using PLINK version 1.90b4.6[62] within the DEPICT program[27]. DEPICT then assigned each SNP to genes in the specified region or genes containing variants in linkage disequilibrium with the SNP. SNPs were then queried for replication in MGI using a nominal one sided FDR of 10% by Benjamini-Hochberg[63].
SNP Annotation
Effect allele and allele frequencies were annotated using ANNOVAR chromosome 1–22 imputation data, build 37[64].
Tissue and Pathway Analysis
Tissue and pathway enrichment was carried out against 14,462 reconstituted gene sets in DEPICT[27] (version 1, release 194) for the 192 loci associated with diverticular disease at a nominal p-value below 1×10–5. Pathways were culled using a kappa statistic of 0.5[65]. Tissue and cell type enrichment was similarly determined in DEPICT by analyzing gene expression enrichment of genes at our 192 loci of interest in 209 MeSH-defined tissue and cell types. FDR of <0.20 was set as a threshold for significance for both pathway and tissue analysis.
Colon eQTLs
Lists of eQTLs in sigmoid or transverse colon were obtained from GTEx version 7. The GTEx project has been described in detail elsewhere[28]. Briefly it is a gene expression resource created from RNA Sequencing (RNA-Seq) results obtained from post-mortem donors. Gene expression levels and individual variants were correlated to enable discovery of 697,430 gene-variant associations in sigmoid colon, and 832,983 gene-variant associations in transverse colon. In both cases, a false discovery rate below 5% was used.
PheWAS
Phenome-wide association study was carried out for all lead SNPs in our loci of interest. SNPs were queried against 778 traits ascertained for UKBB participants and reported in the Roslin Gene Atlas, including morphometric data, hematologic lab values, ICD-10 clinical diagnoses, and self-reported conditions. First, traits were hierarchically clustered using inverse-absolute Pearson correlation among the Z-scores as a distance metric. The resultant hierarchical clustering/tree was pruned at a height corresponding h=0.2, leaving a total of 97 largely independent traits. Then, the pruned matrix of trait-genotype associations was filtered at an FDR of 0.05 by Benjamini-Hochberg[63]. This filtered association matrix was used in further analysis and reporting.
Statistics
All p-values described in the manuscript are two-sided. Multiple comparison corrections were made using the method of Benjamini-Hochberg[63] at multiple points during the study, as detailed above.
Data Availability Statement
The UK BioBank genomic and phenotypic data supporting this publication are publicly available from the Roslin Institute, University of Edinburgh (http://geneatlas.roslin.ed.ac.uk/). The Michigan Genomics Initiative (MGI) genomic and phenotypic data are not publicly available due to restrictions on participant privacy. MGI data can be made available on reasonable request to the corresponding author with permission of the University of Michigan Institutional Review Board. Detailed information on software, study design, and data availability can be found in the Life Sciences Reporting Summary associated with this manuscript.
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Tara M Connelly; Arthur S Berg; John P Hegarty; Sue Deiling; David Brinton; Lisa S Poritz; Walter A Koltun Journal: Ann Surg Date: 2014-06 Impact factor: 12.969
Authors: Rocco Ricciardi; Nancy N Baxter; Thomas E Read; Peter W Marcello; Jason Hall; Patricia L Roberts Journal: Dis Colon Rectum Date: 2009-09 Impact factor: 4.585
Authors: Pedro J Gomez-Pinilla; Simon J Gibbons; Michael R Bardsley; Andrea Lorincz; Maria J Pozo; Pankaj J Pasricha; Matt Van de Rijn; Robert B West; Michael G Sarr; Michael L Kendrick; Robert R Cima; Eric J Dozois; David W Larson; Tamas Ordog; Gianrico Farrugia Journal: Am J Physiol Gastrointest Liver Physiol Date: 2009-04-16 Impact factor: 4.052
Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069
Authors: Lemuel Racacho; Ashley M Byrnes; Heather MacDonald; Helen J Dranse; Sarah M Nikkel; Judith Allanson; Elisabeth Rosser; T Michael Underhill; Dennis E Bulman Journal: Eur J Hum Genet Date: 2015-03-11 Impact factor: 4.246
Authors: Pedro G Ferreira; Manuel Muñoz-Aguirre; Ferran Reverter; Caio P Sá Godinho; Abel Sousa; Alicia Amadoz; Reza Sodaei; Marta R Hidalgo; Dmitri Pervouchine; Jose Carbonell-Caballero; Ramil Nurtdinov; Alessandra Breschi; Raziel Amador; Patrícia Oliveira; Cankut Çubuk; João Curado; François Aguet; Carla Oliveira; Joaquin Dopazo; Michael Sammeth; Kristin G Ardlie; Roderic Guigó Journal: Nat Commun Date: 2018-02-13 Impact factor: 14.919
Authors: Anne F Peery; Alexander Keil; Katherine Jicha; Joseph A Galanko; Robert S Sandler Journal: Clin Gastroenterol Hepatol Date: 2019-05-08 Impact factor: 11.382
Authors: Wenjie Ma; Long H Nguyen; Mingyang Song; Manol Jovani; Po-Hong Liu; Yin Cao; Idy Tam; Kana Wu; Edward L Giovannucci; Lisa L Strate; Andrew T Chan Journal: Am J Gastroenterol Date: 2019-09 Impact factor: 10.864
Authors: Jaeyoon Chung; Sandro Marini; Joanna Pera; Bo Norrving; Jordi Jimenez-Conde; Jaume Roquer; Israel Fernandez-Cadenas; David L Tirschwell; Magdy Selim; Devin L Brown; Scott L Silliman; Bradford B Worrall; James F Meschia; Stacie Demel; Steven M Greenberg; Agnieszka Slowik; Arne Lindgren; Reinhold Schmidt; Matthew Traylor; Muralidharan Sargurupremraj; Steffen Tiedt; Rainer Malik; Stéphanie Debette; Martin Dichgans; Carl D Langefeld; Daniel Woo; Jonathan Rosand; Christopher D Anderson Journal: Brain Date: 2019-10-01 Impact factor: 13.501