Literature DB >> 30978214

An exploratory phenome wide association study linking asthma and liver disease genetic variants to electronic health records from the Estonian Biobank.

Glen James1, Sulev Reisberg2,3,4, Kaido Lepik2, Nicholas Galwey5, Paul Avillach6,7, Liis Kolberg2, Reedik Mägi8, Tõnu Esko8, Myriam Alexander5, Dawn Waterworth9, A Katrina Loomis10, Jaak Vilo2.   

Abstract

The Estonian Biobank, governed by the Institute of Genomics at the University of Tartu (Biobank), has stored genetic material/DNA and continuously collected data since 2002 on a total of 52,274 individuals representing ~5% of the Estonian adult population and is increasing. To explore the utility of data available in the Biobank, we conducted a phenome-wide association study (PheWAS) in two areas of interest to healthcare researchers; asthma and liver disease. We used 11 asthma and 13 liver disease-associated single nucleotide polymorphisms (SNPs), identified from published genome-wide association studies, to test our ability to detect established associations. We confirmed 2 asthma and 5 liver disease associated variants at nominal significance and directionally consistent with published results. We found 2 associations that were opposite to what was published before (rs4374383:AA increases risk of NASH/NAFLD, rs11597086 increases ALT level). Three SNP-diagnosis pairs passed the phenome-wide significance threshold: rs9273349 and E06 (thyroiditis, p = 5.50x10-8); rs9273349 and E10 (type-1 diabetes, p = 2.60x10-7); and rs2281135 and K76 (non-alcoholic liver diseases, including NAFLD, p = 4.10x10-7). We have validated our approach and confirmed the quality of the data for these conditions. Importantly, we demonstrate that the extensive amount of genetic and medical information from the Estonian Biobank can be successfully utilized for scientific research.

Entities:  

Mesh:

Year:  2019        PMID: 30978214      PMCID: PMC6461350          DOI: 10.1371/journal.pone.0215026

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


2. Introduction

Genetic data are an important resource for scientific research and potential drug target identification [1] and genome-wide association studies (GWAS) have identified many disease-associated genetic variants [2]. Complementary to GWAS are phenome-wide association studies (PheWAS) which use a genotype-to-phenotype approach, testing for associations between specific genetic variants over a wide spectrum of phenotypes [3]. Combining genetic data with phenotypes defined and validated in electronic health records (EHRs) permits associations between genetic variants and disease outcomes, including diagnoses and procedures not commonly found in GWAS studies. To date, despite a relative abundance of EHR and genetic data becoming available, few large scale PheWAS studies have linked these data [4]. Understanding the full range of associations along with understanding the functional mechanisms of causal genetic variants will have important implications for the design of novel therapies across indications to reduce morbidity and mortality. The Estonian Biobank, governed by the Institute of Genomics at the University of Tartu (Biobank) and launched in 2007, aimed to create a biobank with biological samples and genetic material with linkage to EHRs to investigate the genetic, environmental and behavioral background of common diseases in the Estonian population [5]. To explore the utility of data in the Biobank, we conducted a PheWAS in two areas of interest to healthcare researchers–asthma and liver disease–to determine whether established and novel disease associations together with early disease indicators could be detected. Liver disease is a heterogeneous and complex disease, being the tenth highest cause of mortality worldwide [6]. Non-alcoholic fatty liver disease (NAFLD) is a type of liver disease which is defined by the presence of liver fat accumulation exceeding 5% of hepatocytes, in the absence of significant alcohol intake, viral infection, or any specific aetiology of liver disease. Patients with NAFLD may develop more serious conditions, such as non-alcoholic steatohepatitis (NASH), hepatic fibrosis, cirrhosis, and hepatocellular carcinoma, and there are currently no therapies approved to treat NAFLD or NASH. Due to the surge in prevalence of obesity, NAFLD is now considered more common than alcoholic liver disease. The prevalence of NAFLD, in the general population, is estimated to range between 12–18% in Europe and 27–38% in the US [7]. Single nucleotide polymorphisms (SNPs) associated with obesity, in particular in the Patatin-like Phospholipase 3 (PNPLA3) gene, have been associated with liver disease and cirrhosis in GWAS [8-11]. Other SNPs reported to be associated with NAFLD, NASH, liver fat or fibrosis are in the APOC3, GCKR, MBOAT7, MERTK, PPP1R3B, SOD2, TM6SF2 and TRIB1 genes [7,8,11-29]. Asthma is a chronic inflammatory disease affecting the airways and lung characterized by recurring symptoms including breathlessness and wheezing caused by inhalation of environmental particulates such as allergens. Asthma requires both acute and long-term management for alleviation of symptoms which vary in severity between individuals. It is estimated that 235 million people worldwide suffer from asthma [30]. A number of asthma associated SNPs have been identified by GWAS and include variants near TSLP, BIRC3, IL18R1, HLA-DQB1, IL33, GSDMB, GSDM1, IL2RB, SLC22A5, IL13, and RORA gene regions [31-33].

3. Objectives

In an effort to assess the utility of the genetic and phenotypic data available in the Estonian Biobank, our first two objectives were to show our ability to detect previously reported GWAS associations of (1) asthma-associated SNPs with biomarkers (lab measures) and clinical diagnoses of asthma, and (2) liver disease-associated SNPs and clinical diagnoses of NAFLD/NASH within the Estonian Biobank. Both objectives are first, to test the Biobank suitability for validation/replication studies, and second, to confirm no systematic errors in the Biobank data before moving on to objective three, to conduct a PheWAS to test the association of the selected 24 SNPs with all other ICD-10 clinical diagnoses. Where significant associations between SNPs and ICD-10 diagnostic codes are found, our fourth objective was to determine associations of corresponding SNPs with lab/biomarker measures (quantitative traits) prior to the date of diagnosis.

4. Materials and methods

Database

The Estonian Biobank has had continuous data collection since 2002 on a total of 52,274 voluntary adult individuals, of which 50,000 are currently active (alive and data continuously updated) in the database, representing approximately 5% of the Estonian adult population. The Biobank has the ability to access Estonian primary care and secondary/inpatient care EHRs via linkage to a central e-health database where EHR information is sent and uploaded by healthcare professionals. Additionally, participant information is updated through linkage to other databases/registries including the Estonian population registry, cancer registry, and death registry. This linkage ensures prospective and retrospective capture of patient and disease characteristics in the form of ICD-10 codes [5], demographics, lifestyle information, laboratory measurements, biomarker measurements, diagnoses and drug utilization (prescriptions and fill data). Additional information is captured by questionnaires, including lifestyle data e.g. smoking and alcohol consumption. Questionnaire information is usually captured during patient enrolment; however, where new questionnaires are introduced, existing Biobank participants are invited to complete these to increase completeness of data. Number of participants in the Biobank with genetic data available is expected to increase over 150K by the end of 2019.

Case ascertainment

To be eligible for the study, patients were required to have genetic and EHR data linkage for at least one year (365 days) after recruitment. Patients were excluded if age, gender or genetic information was missing or if patients had no EHR linkage. ICD-10 diagnostic codes were used to determine clinical diagnoses.

SNP selection and array information

We conducted a thorough search of the literature to identify and select genetic variants associated with asthma or fatty liver disease (or hepatic fat) using GWAS methodology [7-21,23-29,31-38]. For one variant (rs4240624), the published association was given with computed tomography measured hepatic steatosis [25] and we wanted to test whether we can identify the association with NAFLD also. We then determined the availability of these SNPs on the Illumina Infinium Global Screening Array used by the Biobank to obtain genetic information. Where primary SNPs of interest were not available on this array, the Broad Institute SNP Annotation and Proxy Search (SNAP) tool was used to identify proxy SNPs (SNPs which can represent the primary SNP of interest) with a minimum linkage disequilibrium r2≥0.6 [39]. After these steps, 11 asthma (1 proxy) and 13 liver disease-associated (8 proxies) SNPs of interest were included in our analyses (). For 4 of these SNPs, associations with one or more lab measurements (6 in total) were reported. *Where primary SNPs of interest were not available on this array, the Broad Institute SNP Annotation and Proxy Search (SNAP) tool was used to identify appropriate proxy SNPs (SNPs which can represent the SNP of interest) with a minimum r2 value of 0.6. r2 or linkage disequilibrium is the non-random measure of association between alleles at different loci, providing an approximate reliability for a proxy SNP representing a primary SNP.

PheWAS codes

ICD-10 diagnostic codes were mapped to “PheWAS codes” using methodology from Neuraz et al. [40]. This involved converting individual ICD-10 codes into a higher order of grouped codes for a single disease, e.g. grouping codes A00.0, A00.9, A00.1, A00 for cholera into a three-character code A00 (“PheWAS code”) and using broader A00-A09 range (intestinal infectious diseases) as an exclusion criterion for the control group (S1 Table). If a significant association was observed, we repeated the analysis using the individual ICD-10 code to define the case group while keeping the control group criteria unchanged, e.g A00.0 as the case group and A00-A09 as exclusion criteria.

Other covariates, laboratory measures and biomarkers

To describe the patient population, we extracted data on a range of covariates including birth year, gender, ethnicity, smoking status and body mass index (BMI). Laboratory and biomarker measures included sodium, potassium, blood urea, blood glucose, Creatinine, c-reactive protein, haemoglobin, platelet count, B-type natriuretic peptide / brain natriuretic protein, troponin I, troponin T, white blood cell count, neutrophil count, eosinophil count, serum uric acid, fibrinogen, A-fetoprotein, gamma-glutamyl transferase, alanine transaminase (ALT), alkaline phosphatase, albumin, aspartate transaminase, bilirubin, total cholesterol, high-density lipoprotein cholesterol and low density lipoprotein cholesterol.

Data extraction and analysis

Biobank participant information, e.g. lifestyle information, was extracted directly from the Estonian Biobank questionnaires completed during recruitment. Incident diagnoses were extracted from local copy of central e-health database, and laboratory measures were extracted using both e-health database and local copies of two main Estonian hospitals’ databases. In total, there were more than 1.6 million diagnoses recorded of 12,845 different ICD-10 codes, which were grouped into 1,888 higher level ICD-10 (PheWAS) codes used in this analysis. There were over 640,000 lab/biomarker measurements available; however, for 42% of the participants no measurements were recorded. Summary statistics for each laboratory measurement can be found in S1 Text. SNP data was called from the Illumina Infinium Global Screening chip array. After variant calling, the data was filtered using PLINK software [41] sample-wise: call rate>95%, no sex mismatches between phenotype and genotype data, heterozygosity within 3 standard deviations of average heterozygosity over all samples to eliminate possible inbreeding and DNA contamination; and marker-wise: HWE p-value>1x10-6, callrate>95%, and Illumina GenomeStudio GenTrain score >0.6, Cluster Separation Score >0.4. SNPs of interest were then extracted from the data. Analysis of this study was conducted in R version 3.4.3 using the PheWAS R package [42] customized to allow use of ICD-10 codes. The high-performance computing center at the University of Tartu was used for analysis. Calculations for statistical power can be found in S2 Text. Logistic regression models were used to evaluate association of SNP/proxy SNP variation with ICD-10 diagnoses. Odds ratios (ORs) and 95% confidence intervals (CI) were estimated. Linear regression models were used to evaluate associations between SNPs and laboratory measurements. Most of the laboratory measurements (15 out of 26) were log-transformed due to positive skewness assessed graphically (see Fig A in S1 Text). Prior to these log-transformations, zero values were replaced with the minimum non-zero value in the same variable. Data for laboratory measures were taken from 2008 to 2013 (Fig A in S3 Text) as the central collection of lab measurements had only just started and not all labs recorded the lab test results until 2008, possibly causing a bias if we were to include the earlier measurements in the analysis. For objective 1, cases for asthma were defined as ever having been diagnosed with ICD-10 codes J45-J46, and controls were defined from the remaining individuals as never (prospectively and retrospectively in the patients’ medical records) having been diagnosed with ICD-10 codes J40-J47. Similarly, for objective 2, cases for NASH/NAFLD were defined as having ICD-10 codes K75.8 or K76.0 and controls as individuals never having been diagnosed with ICD-10 codes K70-K77. Full list of J40-J47 and K70-K77 diagnoses with their counts is given in S4 Text. Altogether, 21 previously reported SNP-disease and 6 SNP-lab measurement (covering 4 SNPs out of 24) association validation tests were performed in the afore-mentioned objectives. For objective 3 (PheWAS), each PheWAS code was tested individually, and only PheWAS codes with at least 20 cases and 20 controls were included in the analyses. To explore what drives the detected significant associations, an exact ICD-10 code instead of PheWAS code was used in the subsequent analysis. A p-value significance threshold 0.05 was used for objectives 1 and 2 (confirmation of liver and asthma SNP associations). For objective 3, a p-value of 2.0x10-6 (≈0.05/1,000/24 where 1,000 is the effective number of ICD-10 diagnoses tested and 24 is the number of SNPs tested) was used to reduce the likelihood of chance associations and identify the most prominent differences. To search for evidence of systematic bias, a QQ-plot of PheWAS p-values was used to evaluate whether the observed distribution was different from what would be expected under the null hypothesis.

Protection of human subjects

Research at Estonian Biobank is regulated by Human Gene Research Act and all participants have signed a broad informed consent. IRB approval for current study was granted by Research Ethics Committee of University of Tartu, approval nr 236/T-23.

5. Results

After applying the exclusion criteria, 26,766 (51.2%) Caucasian individuals remained from the original 52,274 (). The main reason for the drop in the number of samples is missing genotype data (not genotyped with the given genotyping array). In this study, most participants were female (71.8%), 41.6% of individuals had a body mass index (BMI) of between 18.5–25 (normal weight), 59.2% were never smokers and 75.9% were of Estonian nationality ().

Preselected SNPs and risk of asthma and fatty liver disease

Two asthma-associated SNPs (rs11071559 (RORA) and rs1837253 (TSLP)) passed the significance threshold of <0.05 () and were directionally consistent with previous studies. While Moffat et al. showed that rs1837253 significantly reduces severe asthma in one of their datasets, it did not replicate in the other and the association direction with asthma was the opposite (increase) in the full dataset [31]. We also observe the increased effect of allele C of rs1837253 in our data. From the lab measurement tests, we confirmed that each additive effect allele T of rs2846848 (BIRC3) had a significant effect on reducing neutrophil levels (). All other associations were non-significant. Five fatty liver disease-associated SNPs (rs780094 (GCKR), rs2281135 (PNPLA3), rs8418 (PNPLA3), rs58542926 (TM6SF2), rs2980875 (TRIB1)) passed the significance threshold of <0.05 () and were directionally consistent with previous studies. Additionally, we found that the AA genotype of rs4374383 (MERTK) significantly increases the chance of developing NASH/NAFLD, which is the opposite direction of effect to what was shown by Patin et al. [29]. For rs9992651 (HSD17B13) and rs11597086 (ERLIN1) we also assessed the association to ALT levels. In our analysis, both SNPs are significantly associated with increased ALT levels, but rs11597086 was in the opposite direction of what has been shown by Yuan et al. [21]. We could only replicate some previously reported associations with asthma or fatty liver disease-associated SNPs, but this is likely due to having only a small number of cases/measurements for each of these conditions (Tables 4 and 5) and limitations of EHRs and ICD-10 codes which might not always result in clear signals for all diseases (see Discussion). However, while many of the associations we attempted to confirm were not strong enough in our analysis to pass the significance threshold, almost all of the effect directions were concordant between our study and the original studies, and the QQ-plot of p-values from our confirmatory analysis showed clear enrichment of small p-values (). We can consequently expect the Estonian Biobank data to be suitable for the PheWAS of these SNPs, though we might lack sufficient power to detect modest association signals for specific diseases.
Table 4

Association between genetic variants and asthma (J45-J46) diagnosis/laboratory measurements in the Estonian Biobank.

GeneSNP / Proxy SNP and effect alleleAuthorStudy SizeStudy Effect: OR (95% CI), p-valueBiobank Case/Control Size (total size for continuous variables)Biobank Effect Size: OR (95% CI), p-value
RORArs11071559:TMoffat [31]10,365/16,110OR = 0.85 (0.79–0.90), p = 7.9E-073,424/21,020OR = 0.88 (0.83–0.95), p = 0.00036
TSLPrs1837253:CMoffat [31]290/974Severe asthma OR = 0.56 in one dataset (p = 3x10-6); Asthma OR = 1.15 (1.08–1.22), p = 7.5x10-83,424/21,020Asthma OR = 1.08 (1.02–1.15), p = 0.0059
Astle [33]Total study size 173,480 (case/control size not reported)Increased Eosinophil Count (p = 2.1x10-17); Decreased Neutrophil Count (p = 3.5 x10-12)8,040 Eosinophil measurements; 8,174 Neutrophil measurementsEosinophil/neutrophil counts increased/decreased, but non-significant
IL33rs1342326:CMoffat [31]10,365/16,110OR = 1.22 (1.14–1.30), p = 1.4 x10-83,424/21,020OR = 1.08 (0.99–1.17), p = 0.07
BIRC3rs7127583 / rs2846848:TRoscioli [32]401Reduced Eosinophil Count p = 0.002; Reduced Neutrophil Count p = 0.0058,040 Eosinophil measurements; 8,174 Neutrophil measurementsReduced Neutrophil Count (p = 0.018); Reduced Eosinophil Count globally non-significant
IL18R1rs3771166:AMoffat [31]10,365/16,110OR = 0.87 (0.83–0.91), p = 1.7x10-83,424/21,020OR = 0.97 (0.92–1.03), p = 0.34
HLA-DQB1rs9273349:CMoffat [31]10,365/16,110OR = 1.19 (1.13–1.25), p = 2.0x10-113,424/21,020OR = 0.99 (0.94–1.04), p = 0.68
GSDMBrs2305480:AMoffat [31]10,365/16,110OR = 0.82 (0.79–0.86), p = 3.3x10-163,424/21,020OR = 0.98 (0.93–1.03), p = 0.47
GSDM1rs3894194:AMoffat [31]10,365/16,110OR = 1.18 (1.13–1.23), p = 2.0x10-133,424/21,020OR = 1.03 (0.98–1.09), p = 0.25
IL2RBrs2284033:AMoffat [31]10,365/16,110OR = 0.90 (0.86–0.94) p = 4.8x10-63,424/21,020OR = 0.98 (0.93–1.03), p = 0.37
SLC22A5rs2073643:CMoffat [31]10,365/16,110OR = 0.90 (0.86–0.94) p = 6.2x10-63,424/21,020OR = 0.98 (0.93–1.03), p = 0.35
IL13rs1295686:CMoffat [31]10,365/16,110OR = 0.87 (0.83–0.92), p = 7.9x10-73,424/21,020OR = 0.97 (0.92–1.03), p = 0.32
Table 5

Association between genetic variants and NASH/NAFLD diagnosis (K75.8 or K76.0)/laboratory measurements in the Estonian Biobank.

GeneSNP / Proxy SNPAuthorStudy SizeStudy EffectBiobank Case/Control Size (total size for continuous variables)Biobank Effect Size: OR (95% CI), p-value
APOC3rs2854117 / rs2849176:TPetersen [23]258, prevalence of NAFLD not reportedIncreased prevalence of NAFLD (p<0.001)625/25,097OR = 1.00 (0.89–1.12), p = 0.99
APOC3rs2854116:COR = 1.02 (0.91–1.14), p = 0.79
GCKRrs780094:TSpeliotes [25]592/1,405Effect allele T: NAFLD OR = 1.45 (1.17–1.57), p = 2.6x10-8OR = 1.14 (1.02–1.27),p = 0.026
Yang [26]436/467Effect allele T: NAFLD OR = 1.61 (1.14–2.27), p = 0.0072
GCKRrs1260326:T / rs780094:TPetit [27]201/107Steatosis, OR = 1.99 (1.14–3.47), p = 0.01
MBOAT7rs641738:TMancina [28]2,736, case group size not reportedNAFLD OR = 1.20 (1.05–1.37), p = 0.006OR = 1.03 (0.92–1.16), p = 0.55
MERTKrs4374383:APatin [29]57/239Advanced fibrosis OR = 0.18 (0.09–0.36), p = 1.1x10-9 (recessive model, AA required)OR = 1.12 (1.01–1.26), p = 0.03
PNPLA3rs738409:G / rs2281135:AKitamoto [24]540/1,012NAFLD OR = 2.20 (1.78–2.72), p = 4.1x10-13OR = 1.40 (1.23–1.59), p = 4.9x10-7
Speliotes [25]592/1,405NAFLD OR = 3.26 (2.11–7.21), p = 3.6x10-43
PNPLA3rs2294918:G / rs8418:GDonati [12]142/100NAFLD, p = 0.0009, OR not givenOR = 1.14 (1.01–1.28), p = 0.030
PPP1R3Brs4240624:A / rs4841132:GSpeliotes [25]592/1,405Significant effect for computed tomography measured hepatic steatosis (p = 3.6x10-18); NAFLD OR = 0.93 (0.68–1.18), p = 0.285OR = 0.86 (0.71–1.03), p = 0.10
SOD2rs4880:GAl-Serri [16]179/323Advanced fibrosis OR = 1.56 (1.09–2.25), p = 0.014OR = 1.00 (0.90–1.12), p = 0.96
TM6SF2rs58542926:TBale [17]256/247NAFLD OR = 2.7 (1.37–5.3), p = 0.0004OR = 1.29 (1.06–1.58), p = 0.013
Liu [15]437/637Advanced fibrosis OR = 1.88 (1.41–2.5), p = 1.6x10-5
TRIB1rs2954021:A / rs2980875:AKitamoto [24]540/1,012NAFLD OR = 1.52, (1.23–1.88), p = 9.7x10-5OR = 1.20 (1.07–1.34), p = 0.001
HSD17B13rs6834314/ rs9992651:GChambers [43]61,089Increase of ALT concentration in plasma per copy of effect allele rs6834314 A: OR = 2.6, (1.9–3.4), p = 3.1x10-99,107 ALT measurementsIncreased ALT Count (p = 0.012)
ERLIN1rs2862954 / rs11597086:CYuan [21]7,715rs11597086 C: Decreased ALT level (p = 1.8x10-8)9,107 ALT measurementsIncreased ALT level (p = 0.0056)

PheWAS association between pre-selected SNPs and ICD-10 codes

We conducted a PheWAS to test the association of 24 investigated asthma and liver disease SNPs with all ICD-10 clinical diagnoses. Three SNP-diagnosis pairs passed the PheWAS significance threshold: rs9273349 (HLA-DQB1) and E06 (thyroditis); rs9273349 (HLA-DQB1) and E10 (type-1 diabetes) (; and rs2281135 (PNPLA3) and K76 (non-alcoholic liver diseases, including NAFLD) (). The QQ-plot of all the association p-values (Fig B in S2 Text) does not indicate any systematic biases in our study (e.g. due to population stratification) as some inflation is expected due to analyzing known disease-associated SNPs.

PheWAS Plot for rs9273349.

Pink line corresponds to the significance threshold. Groups were defined by the ICD-10. In order to explore what exact diagnoses drive the detected associations, we repeated the analysis with the individual ICD-10 diagnosis codes instead of PheWAS codes (). We found that rs9273349 is associated with autoimmune thyroiditis (E06.3) and insulin-dependent diabetes mellitus without complications (E10.9), with ophthalmic complications (E10.3) and with multiple complications (E10.7). rs2281135 is associated with fatty (change of) liver, not elsewhere classified, including NAFLD (K76.0).

Associations between significant PheWAS results and laboratory measurements

Using the rs9273349-E06, rs9273349-E10 and rs2281135-K76 pairs, the lab/biomarker measures (quantitative traits) closest in time prior to diagnosis were identified but no significant effects were observed under a significance threshold 9.6x10-4 (≈0.05/2/26 where 2 is the number of distinct SNPs and 26 is the number of lab/biomarker measures) due to very small sample sizes after discarding all measurements on patients without the underlying diagnosis. Using all the measurements regardless of whether a patient had been diagnosed with a disease or not, only rs9273349 (HLA-DQB1) passed the PheWAS significance threshold for decreasing average (over all patient’s measurements) levels of cholesterol per addition of an effect allele (p = 1.1x10-6) ().

6. Discussion

This is the first PheWAS study using Estonian Biobank data linked to EHRs. Large scale PheWAS are scarce [44] with other PheWAS utilising mostly small cohorts [42,45-52]. Recently large scale PheWAS have become possible in the UK Biobank, but other biobanks will still be required for further validation or replication. To assess the Estonian Biobank’s suitability for such a study or association validation/replication purposes, we focused on asthma and liver disease associated SNPs only and as a first task, investigated whether the effects previously reported in the literature (GWAS) could also be detected in our data. That is also to confirm that the data have no systematic errors and are suitable for PheWAS analysis. We replicated previous GWAS results, reporting a significant association of TSLP and RORA gene variants with asthma [31,33] () and GCKR, PNPLA3, TRIB1 and TM6SF2 gene variants with the risk of developing liver diseases, notably NAFLD/NASH [8,11-15,20,24-26,29,53] (). We observed decreasing neutrophil levels per addition of an effect allele in the BIRC3 gene variant, consistent with previously reported results. In addition, variants in HSD17B13 and ERLIN1 influence alanine transaminase (ALT) levels. The HSD17B13 association is consistent with a recent study reporting that a loss-of-function variant associated with decreased levels of ALT and aspartate aminotransferase (AST) and reduced the risk of liver disease and progression from NAFLD to NASH [22]. Our observed association between rs11597086 (ERLIN1) and ALT level is in the opposite direction of what has been previously reported [21]. The limited replication of our results with previously published GWAS results in asthma and liver disease is likely due to smaller case sizes in our data than in the original studies, leading to low power, but with continuous enrolment and data collection, statistical power will improve in time or could be enhanced by meta-analysis with other studies. From the PheWAS, we identified the association of asthma-associated genetic variant rs9273349 with type 1 diabetes, and autoimmune thyroiditis. rs9273349 is a variant of the major histocompatibility complex, class II, DQ beta 1 (HLA-DQB1) gene which has an important role in the immune system. HLA-DQB1 is anchored to the cell surface membrane and functions to present extracellular proteins into the cell [54]. A number of studies have identified the association of HLA-DQB1 with thyroiditis (notably autoimmune Hashimoto’s thyroiditis) and type 1 diabetes [55-59]. Recently, a study by Verma et al. detected an association between acquired hypothyroidism (usually caused by thyroiditis) and rs17843604, a SNP in the same HLA-DQA1/B1 region [44]. Thyroiditis is a complex immune disorder of unknown aetiology where an infiltration of T and B lymphocytes occurs as a reaction to thyroid antigens. These B and T lymphocytes then produce thyroid autoantibodies resulting in clinical hypo- or hyper-thyroidism [55]. Type 1 diabetes is also a T lymphocyte driven disease which results in the destruction of insulin producing pancreatic islet cells. As suggested in [31,60,61], the amino acid variation of HLA-DQB1 variants could cause differential and incorrect binding of peptides, altering cellular sensitization resulting in cellular hypo- or hyperactivity, disrupting homeostatic immune function. Additionally, Mosaad et al. observed microalbuminuria in type 1 diabetic patients, suggesting that alterations in HLA-DQB1 expression effect homeostatic regulation [59]. This study has a number of strengths which include; a large sample size of ICD-10 diagnoses and quantitative measures linked to 26,766 genotyped individuals, relatively long follow-up time, integration of questionnaire data and linkage of EHR data with laboratory and other databases. However, there are a number of limitations to consider. The large proportion of females may introduce bias and reduce generalizability to the general population. However, this is a general problem of all voluntary biobank cohorts as women tend to enroll more actively than men [5]. Furthermore, EHR-linked data are not collected for research purposes, and without strict standards for data collection and format, the quality of the data may vary broadly (missing data, non-coherent format, errors on data insertion etc.). Even when an association between a SNP and an ICD-10 diagnostic code or biomarker/lab value is clearly identified, this does not imply a causal variant. ICD-10 coding may be a limitation itself as in some cases it can be difficult to ascertain the exact condition. The limitation of using a biomarker/lab measure closest to diagnosis is that it may not be the most sensitive or specific predictor of disease. Additionally, these measures do not take into account potential confounders e.g. response to medication, age and time-period effects. Furthermore, small sample sizes for specific conditions limit statistical power. For example, many diseases are rare and due to the fine granularity of ICD-10 codes, using the exact codes results in very small sample sizes for case groups. With the small number of cases, the power to detect an association is limited. Therefore, it is helpful to conduct a two-step PheWAS by first using higher level diagnosis codes to screen for any association signals, and subsequently evaluating the more detailed codes to understand where specifically the association is coming from. In our study, after detecting the association between rs9273349 and E06 (thyroiditis), an exact diagnosis analysis revealed that the association was driven by E06.3 (autoimmune thyroiditis). Similarly, the association between rs2281135 and K76 “Other diseases of liver, non-alcoholic fatty liver disease (NAFLD)” is actually driven by more specific code K76.0 “Fatty (change of) liver, not elsewhere classified”. In conclusion, this is the first PheWAS conducted using the Estonian Biobank demonstrating the extensive amount of genetic and medical information which can be successfully utilized for scientific research. The Estonian Biobank has 50K participants, all have signed informed consent, allowing regular data update from all health databases throughout their lives. Notably, its value is increasing over time—not only because of the continuous EHR data addition and vast amount of genetic data available, but also the participation count of Estonian Biobank (with genetic data available) is expected to rise to 150K by the end of 2019. We showed that Biobank data can be effectively used as a validation/replication database. We replicated 9 GWAS associations for asthma and liver disease-associated SNPs and found the opposite effect directions for 2 associations (rs4374383 increases the risk of NAFLD, rs11597086 increases ALT level). Furthermore, this PheWAS exploring the association of 11 asthma-associated and 13 liver disease-associated SNPs with other diseases and biomarkers is one of few studies to use ICD-10 diagnostic codes to link genetic data with EHRs. Although we did not detect any novel associations in our study, we were able to confirm 3 phenome-wide significant associations based on our data–rs9273349 and thyroiditis, rs9273349 and type-1 diabetes, rs2281135 and non-alcoholic liver diseases, including NAFLD. Considering also the continuous addition of the new data, this highlights the usability of Estonian Biobank for using it effectively as a validation database and conducting extended PheWAS studies with much larger sets of SNPs in the future.

ICD-10 codes to Phewas code map.

(XLSX) Click here for additional data file.

Summary statistics and distributions for each laboratory/biomarker measurements.

(DOCX) Click here for additional data file.

Calculations for statistical power.

(DOCX) Click here for additional data file.

Supplementary figures.

(DOCX) Click here for additional data file.

Liver disease and Asthma ICD-10 diagnostic codes with occurrence and patient count.

(DOCX) Click here for additional data file.
Table 1

SNPs / Proxy SNPs of the genes of interest.

GenePhenotypeSNPProxySNP*R2*Effect Allele (%)Hardy-Weinberg Equilibrium p-value
Asthma Phenotype
TSLPAsthma, increased eosinophil count, decreased neutrophil countrs1837253NANAC (71)0.190
BIRC3Asthma, reduced eosinophil and neutrophil countrs7127583rs28468480.677T (29)0.011
IL18R1Asthmars3771166NANAA (26)0.374
HLA-DQB1Asthmars9273349NANAC (57)0.316
IL33Asthmars1342326NANAC (10)0.194
GSDMBAsthmars2305480NANAA (43)0.261
GSDM1Asthmars3894194NANAA (46)0.210
IL2RBAsthmars2284033NANAA (46)0.994
SLC22A5Asthmars2073643NANAC (48)0.567
IL13Asthmars1295686NANAC (69)0.751
RORAAsthmars11071559NANAT (19)0.684
Liver Phenotype
APOC3Hepatic Fatrs2854117rs28491760.714T (42)0.948
APOC3Hepatic Fatrs2854116NANAC (44)0.939
GCKRNAFLDrs780094NANAT (39)0.306
GCKRNAFLDrs1260326rs7800940.933
MBOAT7NAFLDrs641738NANAT (42)0.239
MERTKHCV fibrosis progression, NAFLD fibrosisφrs4374383NANAA (36)0.347
PNPLA3NAFLD / NASHrs738409rs22811350.688A (19)0.362
PNPLA3Liver density, NAFLDrs2294918rs84180.890G (60)0.388
PPP1R3BComputed tomography measured hepatic steatosisrs4240624rs48411321.000G (91)0.835
SOD2Fibrosis in NAFLDrs4880NANAG (55)0.021
TM6SF2NAFLD/NASHrs58542926NANAT (7)0.721
TRIB1NAFLD, Lipidsrs2954021rs29808750.780A (47)0.959
HSD17B13NAFLD, increased ALT levelrs6834314rs99926510.823G (77)0.338
ERLIN1NAFLD, decreased ALT levelrs2862954rs115970860.669C (41)0.878

*Where primary SNPs of interest were not available on this array, the Broad Institute SNP Annotation and Proxy Search (SNAP) tool was used to identify appropriate proxy SNPs (SNPs which can represent the SNP of interest) with a minimum r2 value of 0.6. r2 or linkage disequilibrium is the non-random measure of association between alleles at different loci, providing an approximate reliability for a proxy SNP representing a primary SNP.

Table 2

Biobank study attrition.

Exclusion Criteria AppliedNumber of Patients Remaining (%)Number of Patients Removed
Total Database Population52,274 (100)NA
Has Genotype Data32,831 (62.8)19,443
Has EHR Linked Data26,808 (51.3)6,023
Inside Study Period26,789 (51.2)19
Age >1826,766 (51.2)23
Not Missing Gender26,766 (51.2)0
Table 3

PheWAS study participant characteristics.

CharacteristicN%
Gender
Female19,22471.8
Male7,54228.2
Smoking Status
Current7,15126.7
Former3,72113.9
Never15,85359.2
Unknown410.2
Body Mass Index
<18.54671.7
18.5–2511,12741.6
25–308,72132.6
30+6,42324.0
Unknown280.1
Nationality
Estonian20,32075.9
Russian5,30719.8
Other1,1394.3
Table 6

Results of the PheWAS using the PheWAS codes and passing the PheWAS significance threshold.

PhenotypeSNPSingle diagnosis required,Number of cases/controlsSingle diagnosis required,OR (95% CI), p-value
E06 (Thyroiditis)rs92733492,458/20,382OR = 1.182 (1.113–1.255), p = 5.52x10-8
E10 (Type 1 Diabetes)rs9273349719/23,268OR = 1.331 (1.194–1.484), p = 2.55x10-7
K76 (non-alcoholic liver diseases, including NAFLD)rs22811351,041/25,097OR = 1.309 (1.179–1.452), p = 4.05x10-7
Table 7

Results of the PheWAS using exact ICD-10 diagnosis codes and passing the PheWAS significance threshold.

PhenotypeSNPSingle diagnosis required,Number of cases/controlsSingle diagnosis required,OR (95% CI), p-value
E10.7 (Insulin-dependent diabetes mellitus with multiple complications)rs9273349215/23,268OR = 2.021 (1.634–2.500), p = 8.6x10-11
E06.3 (Autoimmune thyroiditis)rs92733491,986/20,382OR = 1.203 (1.126–1.286), p = 4.7x10-8
E10.9 (Insulin-dependent diabetes mellitus without complications)rs9273349288/23,268OR = 1.630 (1.367–1.943), p = 5.3x10-8
E10.3 (Insulin-dependent diabetes mellitus with ophthalmic complications)rs9273349106/23,268OR = 2.352 (1.720–3.216), p = 8.6x10-8
K76.0 (Fatty (change of) liver, not elsewhere classified, including NAFLD)rs2281135605/25,097OR = 1.251 (1.251–1.630), p = 1.2x10-7
  58 in total

1.  The SOD2 C47T polymorphism influences NAFLD fibrosis severity: evidence from case-control and intra-familial allele association studies.

Authors:  Ahmad Al-Serri; Quentin M Anstee; Luca Valenti; Valerio Nobili; Julian B S Leathart; Paola Dongiovanni; Julia Patch; Anna Fracanzani; Silvia Fargion; Christopher P Day; Ann K Daly
Journal:  J Hepatol       Date:  2011-07-12       Impact factor: 25.083

2.  Analyses of shared genetic factors between asthma and obesity in children.

Authors:  Erik Melén; Blanca E Himes; John M Brehm; Nadia Boutaoui; Barbara J Klanderman; Jody S Sylvia; Jessica Lasky-Su
Journal:  J Allergy Clin Immunol       Date:  2010-09       Impact factor: 10.793

3.  Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma.

Authors:  John C Chambers; Weihua Zhang; Joban Sehmi; Xinzhong Li; Mark N Wass; Pim Van der Harst; Hilma Holm; Serena Sanna; Maryam Kavousi; Sebastian E Baumeister; Lachlan J Coin; Guohong Deng; Christian Gieger; Nancy L Heard-Costa; Jouke-Jan Hottenga; Brigitte Kühnel; Vinod Kumar; Vasiliki Lagou; Liming Liang; Jian'an Luan; Pedro Marques Vidal; Irene Mateo Leach; Paul F O'Reilly; John F Peden; Nilufer Rahmioglu; Pasi Soininen; Elizabeth K Speliotes; Xin Yuan; Gudmar Thorleifsson; Behrooz Z Alizadeh; Larry D Atwood; Ingrid B Borecki; Morris J Brown; Pimphen Charoen; Francesco Cucca; Debashish Das; Eco J C de Geus; Anna L Dixon; Angela Döring; Georg Ehret; Gudmundur I Eyjolfsson; Martin Farrall; Nita G Forouhi; Nele Friedrich; Wolfram Goessling; Daniel F Gudbjartsson; Tamara B Harris; Anna-Liisa Hartikainen; Simon Heath; Gideon M Hirschfield; Albert Hofman; Georg Homuth; Elina Hyppönen; Harry L A Janssen; Toby Johnson; Antti J Kangas; Ido P Kema; Jens P Kühn; Sandra Lai; Mark Lathrop; Markus M Lerch; Yun Li; T Jake Liang; Jing-Ping Lin; Ruth J F Loos; Nicholas G Martin; Miriam F Moffatt; Grant W Montgomery; Patricia B Munroe; Kiran Musunuru; Yusuke Nakamura; Christopher J O'Donnell; Isleifur Olafsson; Brenda W Penninx; Anneli Pouta; Bram P Prins; Inga Prokopenko; Ralf Puls; Aimo Ruokonen; Markku J Savolainen; David Schlessinger; Jeoffrey N L Schouten; Udo Seedorf; Srijita Sen-Chowdhry; Katherine A Siminovitch; Johannes H Smit; Timothy D Spector; Wenting Tan; Tanya M Teslovich; Taru Tukiainen; Andre G Uitterlinden; Melanie M Van der Klauw; Ramachandran S Vasan; Chris Wallace; Henri Wallaschofski; H-Erich Wichmann; Gonneke Willemsen; Peter Würtz; Chun Xu; Laura M Yerges-Armstrong; Goncalo R Abecasis; Kourosh R Ahmadi; Dorret I Boomsma; Mark Caulfield; William O Cookson; Cornelia M van Duijn; Philippe Froguel; Koichi Matsuda; Mark I McCarthy; Christa Meisinger; Vincent Mooser; Kirsi H Pietiläinen; Gunter Schumann; Harold Snieder; Michael J E Sternberg; Ronald P Stolk; Howard C Thomas; Unnur Thorsteinsdottir; Manuela Uda; Gérard Waeber; Nicholas J Wareham; Dawn M Waterworth; Hugh Watkins; John B Whitfield; Jacqueline C M Witteman; Bruce H R Wolffenbuttel; Caroline S Fox; Mika Ala-Korpela; Kari Stefansson; Peter Vollenweider; Henry Völzke; Eric E Schadt; James Scott; Marjo-Riitta Järvelin; Paul Elliott; Jaspal S Kooner
Journal:  Nat Genet       Date:  2011-10-16       Impact factor: 38.330

4.  Apolipoprotein C3 gene variants in nonalcoholic fatty liver disease.

Authors:  Kitt Falk Petersen; Sylvie Dufour; Ali Hariri; Carol Nelson-Williams; Jia Nee Foo; Xian-Man Zhang; James Dziura; Richard P Lifton; Gerald I Shulman
Journal:  N Engl J Med       Date:  2010-03-25       Impact factor: 91.245

5.  Systematic review of genetic association studies involving histologically confirmed non-alcoholic fatty liver disease.

Authors:  Kayleigh L Wood; Michael H Miller; John F Dillon
Journal:  BMJ Open Gastroenterol       Date:  2015-02-17

6.  Phenome-wide association studies on a quantitative trait: application to TPMT enzyme activity and thiopurine therapy in pharmacogenomics.

Authors:  Antoine Neuraz; Laurent Chouchana; Georgia Malamut; Christine Le Beller; Denis Roche; Philippe Beaune; Patrice Degoulet; Anita Burgun; Marie-Anne Loriot; Paul Avillach
Journal:  PLoS Comput Biol       Date:  2013-12-26       Impact factor: 4.475

7.  Phenome-wide association study (PheWAS) in EMR-linked pediatric cohorts, genetically links PLCL1 to speech language development and IL5-IL13 to Eosinophilic Esophagitis.

Authors:  Bahram Namjou; Keith Marsolo; Robert J Caroll; Joshua C Denny; Marylyn D Ritchie; Shefali S Verma; Todd Lingren; Aleksey Porollo; Beth L Cobb; Cassandra Perry; Leah C Kottyan; Marc E Rothenberg; Susan D Thompson; Ingrid A Holm; Isaac S Kohane; John B Harley
Journal:  Front Genet       Date:  2014-11-18       Impact factor: 4.599

8.  Phenome-Wide Association Study to Explore Relationships between Immune System Related Genetic Loci and Complex Traits and Diseases.

Authors:  Anurag Verma; Anna O Basile; Yuki Bradford; Helena Kuivaniemi; Gerard Tromp; David Carey; Glenn S Gerhard; James E Crowe; Marylyn D Ritchie; Sarah A Pendergrass
Journal:  PLoS One       Date:  2016-08-10       Impact factor: 3.240

9.  eMERGE Phenome-Wide Association Study (PheWAS) identifies clinical associations and pleiotropy for stop-gain variants.

Authors:  Anurag Verma; Shefali S Verma; Sarah A Pendergrass; Dana C Crawford; David R Crosslin; Helena Kuivaniemi; William S Bush; Yuki Bradford; Iftikhar Kullo; Suzette J Bielinski; Rongling Li; Joshua C Denny; Peggy Peissig; Scott Hebbring; Mariza De Andrade; Marylyn D Ritchie; Gerard Tromp
Journal:  BMC Med Genomics       Date:  2016-08-12       Impact factor: 3.063

Review 10.  The challenges, advantages and future of phenome-wide association studies.

Authors:  Scott J Hebbring
Journal:  Immunology       Date:  2014-02       Impact factor: 7.397

View more
  4 in total

1.  Association of Metabolic Syndrome and Hyperferritinemia in Patients at Cardiovascular Risk.

Authors:  Ricardo José Tofano; Leticia Maria Pescinni-Salzedas; Eduardo Federighi Baisi Chagas; Claudia Rucco Penteado Detregiachi; Elen Landgraf Guiguer; Adriano Cressoni Araujo; Marcelo Dib Bechara; Claudio José Rubira; Sandra Maria Barbalho
Journal:  Diabetes Metab Syndr Obes       Date:  2020-09-24       Impact factor: 3.168

Review 2.  The current state of omics technologies in the clinical management of asthma and allergic diseases.

Authors:  Brittney M Donovan; Lisa Bastarache; Kedir N Turi; Mary M Zutter; Tina V Hartert
Journal:  Ann Allergy Asthma Immunol       Date:  2019-09-05       Impact factor: 6.347

3.  The European medical information framework: A novel ecosystem for sharing healthcare data across Europe.

Authors:  Simon Lovestone
Journal:  Learn Health Syst       Date:  2019-12-25

Review 4.  Update on Non-Alcoholic Fatty Liver Disease-Associated Single Nucleotide Polymorphisms and Their Involvement in Liver Steatosis, Inflammation, and Fibrosis: A Narrative Review

Authors:  Fajar Dwi Astarini; Neneng Ratnasari; Widya Wasityastuti
Journal:  Iran Biomed J       Date:  2022-07-01
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.