Literature DB >> 30327483

Phenome-wide association studies across large population cohorts support drug target validation.

Dorothée Diogo¹, Chao Tian², Christopher S Franklin³, Mervi Alanne-Kinnunen⁴, Michael March⁵, Chris C A Spencer³, Ciara Vangjeli³, Michael E Weale³, Hannele Mattsson^4,6, Elina Kilpeläinen⁴, Patrick M A Sleiman⁵, Dermot F Reilly⁷, Joshua McElwee^7,8, Joseph C Maranville^7,9, Arnaub K Chatterjee^7,10, Aman Bhandari^7,11, Khanh-Dung H Nguyen¹², Karol Estrada¹², Mary-Pat Reeve¹³, Janna Hutz¹³, Nan Bing¹⁴, Sally John¹², Daniel G MacArthur^15,16, Veikko Salomaa⁶, Samuli Ripatti^4,15,17, Hakon Hakonarson⁵, Mark J Daly^15,16, Aarno Palotie^{4,15,16,18,19}, David A Hinds², Peter Donnelly³, Caroline S Fox⁷, Aaron G Day-Williams^7,12, Robert M Plenge^7,9, Heiko Runz^20,21.

Abstract

Phenome-wide association studies (PheWAS) have been proposed as a possible aid in drug development through elucidating mechanisms of action, identifying alternative indications, or predicting adverse drug events (ADEs). Here, we select 25 single nucleotide polymorphisms (SNPs) linked through genome-wide association studies (GWAS) to 19 candidate drug targets for common disease indications. We interrogate these SNPs by PheWAS in four large cohorts with extensive health information (23andMe, UK Biobank, FINRISK, CHOP) for association with 1683 binary endpoints in up to 697,815 individuals and conduct meta-analyses for 145 mapped disease endpoints. Our analyses replicate 75% of known GWAS associations (P < 0.05) and identify nine study-wide significant novel associations (of 71 with FDR < 0.1). We describe associations that may predict ADEs, e.g., acne, high cholesterol, gout, and gallstones with rs738409 (p.I148M) in PNPLA3 and asthma with rs1990760 (p.T946A) in IFIH1. Our results demonstrate PheWAS as a powerful addition to the toolkit for drug discovery.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2018 PMID： 30327483 PMCID： PMC6191429 DOI： 10.1038/s41467-018-06540-3

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Introduction

The discovery and development of novel therapeutics is difficult. It may take 15 years to advance a new molecular entity from therapeutic hypothesis to approval, with development costs in the billion dollar range and only a 10% chance of a new drug tested in humans eventually getting approval[1]. Two reasons stand out to explain the high failure rate of clinical trials and receding return on R&D investment across the pharmaceutical industry: a lower efficacy of the compound in the targeted disease population than anticipated from preclinical studies; and the occurrence of unintended drug effects, particularly mechanism-based adverse drug events (ADEs) uncovered only in late-stage clinical trials[2]. A greater understanding of human data relevant to the drug target at early stages of drug development is generally considered to increase the probability of success[1,3,4]. Resources that systematically capture biomedical information on vast numbers of individuals are revolutionizing our ability to understand the complexities of human biology and morbidity. Electronic health records (EHRs) and other resources that systematically capture extensive health information have rapidly become well-established tools for epidemiological and post-marketing research[5,6]. Recently, a surge of initiatives are seeking to link such phenotype resources with genome-scale genetic data in order to gain further insights into the genetics of common diseases[7-14]. An attractive approach to help accelerate drug development utilizing these genotype–phenotype resources is through applying phenome-wide association studies (PheWAS). PheWAS are an unbiased approach to test for associations between a specific genetic variant, or, more recently, combination of variants, and a wide range of phenotypes in large numbers of individuals[7,15,16]. By exploring the associations of a genetic variant that impacts the function of a drug target gene, PheWAS in disease-agnostic cohorts with extensive health information may enrich the drug discovery process for five reasons: (1) association studies in disease-agnostic cohorts may validate target-disease links in cohorts that more closely resemble the real-world, i.e., the patients that will ultimately receive a drug;[17] (2) by unraveling pleiotropy, PheWAS may improve our understanding of the biological functions of a target, or hint at concealed pathophysiological connections between disease entities previously considered as distinct;[18,19] (3) PheWAS may reveal opportunities for drug repurposing, an attractive alternative to de novo drug development;[20,21] (4) PheWAS may point to phenotypes that associate with an inverse directionality of target function, thus unraveling potential ADEs at very early stages of a developmental program, minimize risks to trial participants, and help define the most appropriate patient populations to benefit from a drug;[21] and (5) through quantitative estimates from genetic safety and efficacy profiles, PheWAS may help prioritize multiple possible targets by identifying the target with the most promising therapeutic window. Despite these benefits, the ability for PheWAS to substantially add to the decision making in drug development is thwarted by the difficulty to obtain and systematize comprehensive genotypes and phenotypes across very large numbers of individuals. Here, we test the hypothesis that PheWAS can inform target validation at early stages of drug discovery. We select candidate drug targets across a range of therapeutic indications based on their support from genome-wide association studies (GWAS). To maximize power, we map a large spectrum of clinical endpoints from four of the world’s largest disease-agnostic cohorts with extensive health information (23andMe, UK Biobank interim release, FINRISK and CHOP) and conduct association testing in up to 697,815 individuals. We validate the top associations in the extended UK Biobank cohort (337,199 participants), and apply conditional analyses and co-localization methods to identify true pleiotropy predicting drug efficacy or safety signals. Our results show that PheWAS, despite limitations, enrich drug discovery with valuable information.

Results

Assessing pleiotropy of SNPs near 19 candidate drug targets

In this study, we queried the literature for genes nominated through GWAS as putatively causally linked to the risk for common complex human diseases and supported by various degrees of additional genetic or biological evidence. We selected 19 genes that, based on previously described genetic associations with either immune-mediated (9 genes: ATG16L1, CARD9, CD226, CDHR3, GPR35, GPR65, IFIH1, IRF5, and TYK2), cardiometabolic (8 genes: F11, F12, GDF15, GUCY1A3, KNG1, LGALS3, PNPLA3, and SLC30A8), or neurodegenerative diseases (2 genes: LRRK2, TMEM175), were evaluated as potential novel drug targets (Table 1). Gene-disease associations had been established through 25 common lead single nucleotide polymorphisms (SNPs) that all reached a conservative level of statistical significance (P < 5 × 10−8) for association in GWAS with at least one phenotype of relevance to drug discovery and development (Supplementary Table 1). All of these SNPs have either been demonstrated to impact the target gene in functional studies (genetic evidence), or locate proximal to a gene implicated in a biological mechanism related to the GWAS phenotype (biological evidence). Our selection ranged from targets with little biological knowledge beyond GWAS nomination (e.g., TMEM175 for Parkinson’s disease (PD)) to targets with drug candidates in early clinical trials (e.g., F11 for thromboembolism). Details on the genetic and biological support for all selected genes and SNPs is provided in Supplementary Methods.

Table 1

Candidate drug targets investigated in the study

	Human genetics		Drug development^b
Gene	Prior GWAS associations^a	Mendelian disorders (direction of effect)	Indications/status/proposed mechanism of action
ATG16L1	CD; IBD	–	– / – / –
CARD9	CD; IBD; UC	Familial candidiasis (LOF)	– / – / –
CD226	IBD; MPV; T1D	–	– / – / –
CDHR3	Asthma	–	– / – / –
F11	aPTT; VTE; FXI levels	FXI deficiency (LOF)	Hemophilia C/launched/factor XI stimulant
			Thrombosis/phase II/factor XI inhibitor
F12	aPTT; FXII levels	Hereditary angioedema (GOF); FXII deficiency (LOF)	Hereditary angioedema; thrombosis/phase I/factor XII inhibitor
			Antiphospholipid syndrome/preclinical/factor XII inhibitor
GDF15	BMI	–	Cachexia/preclinical/GDF-15 antagonist
GPR35	CD; IBD; UC	–	Cough; mastocytosis; pruritus/phase II/GPR35 agonist
GPR65	CD; IBD; UC	–	– / – / –
GUCY1A3	BP; CAD; MI	Moyamoya 6 with achalasia (LOF)	– / – / –
IFIH1	IgAD; IBD; psoriasis; UC; SLE; T1D; vitiligo	Aicardi–Goutieres syndrome (GOF); Singleton–Merten syndrome (GOF)	Solid cancer/phase I/IFIH1 stimulant (additional targets: RIG-I; TLR3)
IRF5	PBC; RA; SJO; SLE; SSc; UC	–	– / – / –
KNG1	aPTT; FXI levels	–	– / – / –
LGALS3	Galectin-3 levels	–	Liver fibrosis; non-alcoholic steatohepatitis; psoriasis/phase II/galectin-1 and 3 antagonist
			Pulmonary idiopathic fibrosis/phase II/galectin-3 antagonist
			Atopic eczema; head and neck cancer; melanoma/phase I/galectin-1 and 3 antagonist
			Arrhythmia; fibrosis: myocardial, renal; pulmonary hypertension/preclinical/galectin-1 and 3 antagonist
			Cardiac and renal conditions/preclinical/galectin-3 antagonist
LRRK2	CD; IBD; PD; UC	Familial Parkinson’s disease (GOF)	Parkinson’s disease/phase I/LRRK2 inhibitor
			Alzheimer’s disease; glaucoma/preclinical/LRRK2 inhibitor
PNPLA3	Alcohol-related cirrhosis; ALT; CT; hepatic steatosis; NAFLD	–	– / – / –
SLC30A8	Fasting glucose; T2D	–	– / – / –
TMEM175	PD	–	– / – / –
TYK2	CD; IBD; MS; PBC; psoriasis; RA; SLE; T1D; UC	Immunodeficiency (LOF)	Atopic eczema/phase II/JAK1 and TYK2 inhibitor
			psoriasis/phase II/TYK2 inhibitor; JAK1 and TYK2 inhibitor
			SLE/phase II/TYK2 inhibitor
			Alopecia areata; UC/phase II/JAK1 and TYK2 inhibitor
			IBD/phase I/TYK2 inhibitor; JAK1 and TYK2 inhibitor
			psoriatic arthritis/phase I/TYK2 inhibitor
			CD/preclinical/JAK1-3 and TYK2 inhibitor
			cancer: acute leukemia, colorectal, anaplastic large cell lymphoma; MS; RA/preclinical/JAK1 and TYK2 inhibitor
			Uveitis/preclinical/TYK2 inhibitor

ALT: alanine aminotransferase, aPTT: activated partial thromboplastin time, BMI: body mass index, CAD: coronary artery disease, IgAD: immunoglobulin A deficiency, MI: myocardial infarction, MPV: mean platelet volume, NAFLD: non-alcoholic fatty liver disease, RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, T1D: type 1 diabetes, T2D: type 2 diabetes, VTE: venous thromboembolism, CD: Crohn’s disease, IBD: inflammatory bowel disease, MS: multiple sclerosis, PBC: primary biliary cirrhosis, PD: Parkinson’s disease, SJO: Sjogren’s syndrome, SSc: systemic sclerosis, UC: ulcerative colitis, GOF: gain-of-function, LOF: loss-of-function

aPublished associations at the genetic locus as defined in Methods. Causal gene not always unambiguously established. For details, see Supplementary Information

bAs listed in Citeline’s Pharmaprojects database. Active development with most advanced status (preclinical or clinical) as of Dec 16, 2017 is indicated

Candidate drug targets investigated in the study ALT: alanine aminotransferase, aPTT: activated partial thromboplastin time, BMI: body mass index, CAD: coronary artery disease, IgAD: immunoglobulin A deficiency, MI: myocardial infarction, MPV: mean platelet volume, NAFLD: non-alcoholic fatty liver disease, RA: rheumatoid arthritis, SLE: systemic lupus erythematosus, T1D: type 1 diabetes, T2D: type 2 diabetes, VTE: venous thromboembolism, CD: Crohn’s disease, IBD: inflammatory bowel disease, MS: multiple sclerosis, PBC: primary biliary cirrhosis, PD: Parkinson’s disease, SJO: Sjogren’s syndrome, SSc: systemic sclerosis, UC: ulcerative colitis, GOF: gain-of-function, LOF: loss-of-function aPublished associations at the genetic locus as defined in Methods. Causal gene not always unambiguously established. For details, see Supplementary Information bAs listed in Citeline’s Pharmaprojects database. Active development with most advanced status (preclinical or clinical) as of Dec 16, 2017 is indicated To broadly investigate pleiotropic effects of the 25 chosen SNPs in a maximal number of individuals, we interrogated four large disease-agnostic cohorts that link genome-wide genotype data from individuals of European ancestry with extensive phenotypic data: the 23andMe Inc. cohort with self-reported phenotypes on 671,151 research participants[22], the interim UK Biobank cohort analyzed by Genomics plc with questionnaire-based health information on 112,337 participants (from the first genetic data release in May 2015)[10], and two EHR-based cohorts from an adult Finnish cohort (FINRISK; 21,371 participants)[23] and from a pediatric healthcare population from the Children’s Hospital of Philadelphia (CHOP; 12,044 patients)[24] (Table 2 and Methods). All four cohorts contributed phenotypic data in different formats (medical interviews, self-reports, WHO ICD codes, or ICD9-CM codes) in both shared and distinct phenotype categories (Fig. 1a). Manual phenotype mapping identified 145 distinct clinical endpoints that were tested in two or more cohorts in up to 697,815 individuals (Fig. 1b, Supplementary Table 2, and Supplementary Table 3). As illustrated in Fig. 1c, these 145 mapped phenotypes represent a broad spectrum of disease categories and, as typically observed in disease-agnostic cohorts, show significant variability in the case:control ratios, both within and between cohorts. In addition, PheWAS in the four cohorts provided association results for 1538 cohort-specific unmapped endpoints, leading to a total of 1683 endpoints included in our analysis. Association testing in the cohorts was performed using logistic regression models; meta-analyses were performed using fixed effect models (see Methods for details).

Table 2

Cohorts included in this study

Cohort	Participants geographic distribution	Phenotypes source	N binary endpoints tested^a	Max sample size
23andMe	89% USA (adult)	Questionnaire-based self-reports	654	671,151
Genomics plc UK Biobank	100% UK (adult)	Questionnaire-based self-reports, medical interviews and follow-up	90	112,337
FINRISK	100% Finns (adult)	National health registries (ICD8,9,10)	278	21,371
CHOP	100% USA (pediatric)	Electronic health records (ICD9-CM)	870	12,044
Genomics plc GWAS	Mixed	Mixed—multiple independent disease-specific cohorts	34	-

aNumber of binary endpoints with N cases ≥ 20

Fig. 1

Phenotypes tested and study design. a Categories of phenotypes assessed in the 23andMe, Genomics plc UK Biobank, FINRISK, and CHOP cohorts. b Manual phenotype mapping was performed to identify phenotypes shared between cohorts. One hundred and forty-five phenotypes were captured with at least 20 cases in at least 2 cohorts. After PheWAS in each cohort separately, the 145 phenotypes were meta-analyzed to increase statistical power and enable systematic comparisons of results between cohorts. c The 145 mapped phenotypes (see Supplementary Table 2) represent a broad spectrum of phenotypic categories and are captured with variable case:control ratios in the cohorts tested

Cohorts included in this study aNumber of binary endpoints with N cases ≥ 20 Phenotypes tested and study design. a Categories of phenotypes assessed in the 23andMe, Genomics plc UK Biobank, FINRISK, and CHOP cohorts. b Manual phenotype mapping was performed to identify phenotypes shared between cohorts. One hundred and forty-five phenotypes were captured with at least 20 cases in at least 2 cohorts. After PheWAS in each cohort separately, the 145 phenotypes were meta-analyzed to increase statistical power and enable systematic comparisons of results between cohorts. c The 145 mapped phenotypes (see Supplementary Table 2) represent a broad spectrum of phenotypic categories and are captured with variable case:control ratios in the cohorts tested

Meta-PheWAS replicate known GWAS signals

We first evaluated whether association testing in the four disease-agnostic cohorts replicated established results from published GWAS. GWAS had associated the 25 tested SNPs with genome-wide significance to 58 binary disease endpoints. Of these, 47 endpoints were ascertained with adequate power (beta ≥ 0.8) to reach P < 0.05 in the PheWAS meta-analysis. After excluding the three Parkinson’s disease associations that were derived from 23andMe data in the published GWAS, we observed that 33 of the 44 (75%) powered GWAS associations replicated at P < 0.05 in our PheWAS meta-analysis with consistent directions of effects (18/27 (67%) powered GWAS associations replicated at FDR < 0.1 (P < 3.8 × 10−4)) (Supplementary Figs 1, 2, and Supplementary Table 4). The overlap between the published GWAS effect sizes and the confidence intervals observed in the meta-PheWAS and in the four cohorts is provided in Supplementary Figs 2 and 3. As expected from data obtained in real-world settings, the replication rate of known associations was highly disease-dependent (Supplementary Fig. 1B). For instance, out of the 11 associations that failed to replicate despite sufficient case numbers in the cohorts, eight were associations with inflammatory bowel disease (IBD), Crohn’s disease (CD), or ulcerative colitis (UC), likely reflecting suboptimal ascertainment of these endpoints in real-world settings. Nonetheless, the high replication rate of previously reported associations demonstrates the power of combining disease-agnostic cohorts from various sources to detect and validate true SNP-disease associations, and to substantiate therapeutic hypotheses.

Meta-PheWAS identify novel SNP-phenotype associations

We next investigated whether meta-PheWAS across the four cohorts could identify novel associations to support the proposed clinical indication(s) (derived from established genetic associations, see Table 1), suggest alternative indications for drug repositioning, or uncover potential target-related ADEs. To improve statistical power in this analysis, the PheWAS results in the four cohorts were meta-analyzed together with summary statistics from published GWAS studies of 34 diseases available from a larger database assembled and harmonized by Genomics plc (referred to as Genomics plc GWAS, Supplementary Note 1). Overall, 27,763 association tests (across 145 harmonized and 1538 cohort-specific endpoints) resulted in nine putative novel associations reaching study-wide significance after Bonferroni correction (P < 1.8 × 10−6) (Table 3). Using a less stringent significance threshold of FDR < 0.1 (P < 7 × 10−4) previously applied in PheWAS[25], we identified 71 distinct putative novel associations (Fig. 2, Supplementary Table 5 and Supplementary Data 1). Of these, 30 were with mapped phenotypes and were obtained from meta-analyzing results from at least two cohorts, and 41 were supported by a single cohort (and thus require independent replication) (Supplementary Table 5). Forty-three of these putative novel associations showed the same directions of effect as disease endpoints related to the proposed clinical indication for a drug and may hint at potential repositioning opportunities (Supplementary Fig. 4). Conversely, 27 showed directions of effect opposite to disease endpoints related to the proposed clinical indication and may suggest safety signals that could endanger therapeutic success and warrant monitoring for in preclinical models and clinical trials (Supplementary Fig. 4).

Table 3

Significant novel associations in the PheWAS meta-analysis

				Novel association in meta-PheWAS^a
Gene	SNP	EA (EAF)^b	Known associated phenotype^c	Phenotype	OR (CI95)	P value	Direction^d	N cases	N controls
CD226	rs763361	T (0.47)	IBD	Hypothyroidism	1.05 (1.04–1.07)	8.11e−11	++?+?	35,428	412,577
GDF15	rs17724992	A (0.73)	BMI	Heart metabolic disease^e	1.03 (1.02–1.04)	3.08e−09	+????	275,944	209,302
				High blood pressure^e	1.03 (1.02–1.04)	7.64e−09	++???	151,511	465,686
				Blood pressure medication^e	1.03 (1.02–1.04)	1.76e−07	+????	125,406	394,753
				GERD	1.03 (1.02–1.04)	6.11e−07	+????	130,654	384,572
				Any CVD^e	1.03 (1.01–1.04)	1.40e−06	+????	148,577	388,405
IFIH1	rs1990760	T (0.61)	T1D	Asthma ^f	0.96 (0.95–0.98)	1.11e−07	− − − − −	57,101	269,659
IRF5	rs10488631	C (0.11)	SLE	Hypothyroidism	1.08 (1.05–1.12)	5.78e−07	++?+?	23,182	236,240
PNPLA3	rs738409	G (0.33)	ALT	Severe acne	0.91 (0.88–0.93)	1.47e−11	−????	14,812	187,018
				High cholesterol	0.96 (0.94–0.97)	1.59e−07	− −???	101,646	180,947
TYK2	rs34536443	G (0.89)	Psoriasis	Any immune disorder	1.10 (1.07–1.13)	4.27e−12	+????	112,148	173,986
				Hypothyroidism	1.14 (1.08–1.20)	1.19e−06	++?−?	23,145	233,757

ALT: alanine aminotransferase, BMI: body mass index, CVD: cardiovascular disease, EA: effect allele, EAF: effect allele frequency, GERD: gastroesophageal reflux disease, IBD: inflammatory bowel disease, SLE: systemic lupus erythematosus, T1D: type diabetes, T2D: type 2 diabetes

aAssociations reaching P < 1.8e−6 (Bonferroni-corrected significance threshold) in the meta-analysis of PheWAS results with GWAS results. The full list of potential novel SNP-phenotype pairs reaching FDR < 0.1 is provided in Supplementary Table 5. Novel associations with direction of effect opposite to the known associated disease(s) effect, predicting potential adverse drug events, are highlighted in bold

bThe effect allele is the risk allele for known associated disease(s) related to the therapeutic hypothesis

cKnown associated disease related to the therapeutic hypothesis (surrogate for efficacy). The strongest association reported in the literature is indicated. The full list of known associations is provided in Supplementary Table 1

dDirection of effect in 23andMe, Genomics plc UK Biobank, FINRISK, CHOP, and GWAS

eCorrelated phenotypes

fMeta-analysis results including the 23andMe, Gplc/UK Biobank, FINRISK, CHOP, and GWAS Gabriel cohorts. When further including the independent GWAS EVE study, the association reaches P = 6.7 × 10−8

Fig. 2

Meta-PheWAS results for 25 SNPs in candidate drug targets. Phenotypes associated at FDR < 0.1 (P < 7e−4) with at least one SNP in the meta-PheWAS are represented. Direction of effect of the known disease-risk increasing allele related to the therapeutic hypothesis is indicated. A positive Z score (in red) indicates increased risk, a negative Z score (in blue) indicates reduced risk. Known and novel associations reaching FDR < 0.1 are outlined in white and black respectively. Detailed association results are provided in the Supplementary Data 1

Significant novel associations in the PheWAS meta-analysis ALT: alanine aminotransferase, BMI: body mass index, CVD: cardiovascular disease, EA: effect allele, EAF: effect allele frequency, GERD: gastroesophageal reflux disease, IBD: inflammatory bowel disease, SLE: systemic lupus erythematosus, T1D: type diabetes, T2D: type 2 diabetes aAssociations reaching P < 1.8e−6 (Bonferroni-corrected significance threshold) in the meta-analysis of PheWAS results with GWAS results. The full list of potential novel SNP-phenotype pairs reaching FDR < 0.1 is provided in Supplementary Table 5. Novel associations with direction of effect opposite to the known associated disease(s) effect, predicting potential adverse drug events, are highlighted in bold bThe effect allele is the risk allele for known associated disease(s) related to the therapeutic hypothesis cKnown associated disease related to the therapeutic hypothesis (surrogate for efficacy). The strongest association reported in the literature is indicated. The full list of known associations is provided in Supplementary Table 1 dDirection of effect in 23andMe, Genomics plc UK Biobank, FINRISK, CHOP, and GWAS eCorrelated phenotypes fMeta-analysis results including the 23andMe, Gplc/UK Biobank, FINRISK, CHOP, and GWAS Gabriel cohorts. When further including the independent GWAS EVE study, the association reaches P = 6.7 × 10−8 Meta-PheWAS results for 25 SNPs in candidate drug targets. Phenotypes associated at FDR < 0.1 (P < 7e−4) with at least one SNP in the meta-PheWAS are represented. Direction of effect of the known disease-risk increasing allele related to the therapeutic hypothesis is indicated. A positive Z score (in red) indicates increased risk, a negative Z score (in blue) indicates reduced risk. Known and novel associations reaching FDR < 0.1 are outlined in white and black respectively. Detailed association results are provided in the Supplementary Data 1 The 30 novel associations with mapped phenotypes showed limited evidence of heterogeneity between the PheWAS cohorts (Supplementary Fig. 5). Twenty-three (77%) of these 30 associations showed an I2 < 40%. Manual review of the results showed that only one of the seven associations with I2 > 40%, the GDF15 rs17724992 association with high blood pressure, was less significant in the meta-analysis than in the individual cohorts (P23andMe = 6.4 × 10−10, OR23andMe = 0.96; PGplc/UK Biobank = 0.58, ORGplc/UK Biobank = 0.99; Pmeta = 7.6 × 10−9, ORmeta = 0.97) (Supplementary Fig. 5B).

Replication of novel associations in UK Biobank v2

Forty-one of the 71 potential novel associations reaching FDR < 0.1, including eight of the nine novel associations reaching study-wide significance, were with phenotypes tested by Neale et al. through GWAS in the expanded UK Biobank (v2) cohort of up to 337,199 participants of European ancestry. In an attempt to replicate putative novel associations discovered in our meta-PheWAS, we performed weighted Z score-based meta-analyses between the 23andMe, FINRISK and CHOP PheWAS results, the published GWAS results and the UK Biobank v2 results (excluding the Gplc UK Biobank results). Out of the 41 putative novel associations, 16 showed P < 0.05 in UK Biobank v2 with consistent direction of effect, thus validating and further strengthening significance of our previous results (Supplementary Table 6). An additional seven potential novel associations showed increased significance in meta-analysis despite P > 0.05 in UK Biobank v2, largely due to small number of cases and lack of statistical power in UK Biobank v2 alone. Overall, meta-analysis with UK Biobank v2 strengthened all eight novel associations with study-wide significance after Bonferroni correction and 23/41 (56%) of the potential novel associations with FDR < 0.1, including eight associations that were based on results from a single PheWAS cohort. Strengthened associations in the meta-analysis with UK Biobank v2 include the rs17724992-high blood pressure association that showed significant heterogeneity between the 23andMe and the interim UK Biobank cohorts (P23andMe = 6.4 × 10−10, OR23andMe = 0.96; PUK Biobank v2 = 4.4 × 10−5; Pmeta_v2 = 3.9 × 10−13).

Interpretation of apparent pleiotropy in PheWAS results

A challenge to the PheWAS approach is to reliably distinguish true pleiotropic associations of a SNP (or SNPs in strong LD with the lead SNP), suggesting a shared causal mechanism, from unrelated associations driven by independent SNPs at a locus[18]. For instance, in our study, the putative association of rs2274273 near LGALS3 (encoding the galactin-3 protein) with PD (OR23andMe = 0.94, P23andMe = 1 × 10−4) likely reflects a distinct causal mechanism previously attributed to GCH1[26]. rs2274273 is a protein quantitative trait locus (pQTL) that controls plasma levels of galectin-3[27]. Through a Bayesian test for co-localization using summary statistics from published GWAS studies[26,28,29], we excluded rs2274273 as a causal SNP for PD (posterior probability for a shared variant leading the PD and galectin-3 levels associations = 0.0008%) (Supplementary Fig. 6). A second challenge to PheWAS is the existence of common co-morbidities among endpoints, or alternatively an insufficient distinction between phenotypes[19]. In our meta-PheWAS, rs17724992 near GDF15 showed association with multiple cardiovascular-related phenotypes, which is likely mediated by the known association of this SNP with body mass index (BMI)[30], an established risk factor for cardiovascular disease[31]. This is supported by the lack of association of rs17724992 with blood pressure (PSBP = 0.064, PDBP = 0.134) and coronary artery disease (CAD, P = 0.17) in the large GWASs published by the International Consortium for Blood Pressure and the CARDIoGRAMplusC4D consortium[32,33]. Phenotype correlation scores can indicate apparent pleiotropic effects that may be explained by comorbidities or confounding (Supplementary Fig. 7), yet follow-up customized association analyses adjusting for specific phenotypic covariates are required to distinguish true pleiotropic effects and inform target validation. In summary, these two examples demonstrate that thorough investigation of association results can reduce biases introduced through PheWAS.

Meta-PheWAS reveal pleiotropic effects of PNPLA3 rs738409

Among the nine study-wide significant associations, our meta-PheWAS revealed multiple novel associations for the PNPLA3 missense SNP rs738409 (p.I148M). The rs738409-G allele has previously been reported as associated with an increased risk for non-alcoholic fatty liver disease (NAFLD), alcohol-related cirrhosis and hepatic steatosis, as well as elevated alanine aminotransferase (ALT) levels, most likely through a gain-of-function (GOF) mechanism (Supplementary Methods). Consistent with these findings, our meta-PheWAS found rs738409-G to be associated with elevated liver tests (OR23andMe = 1.25, P23andMe = 4 × 10−45) (Supplementary Fig. 8). Beyond that, our analysis also indicated that carriers of the rs738409-G allele that increases ALT are more prone to develop liver toxicities when treated with nonsteroidal anti-inflammatory drugs (NSAIDs) such as ibuprofen (OR23andMe = 1.43, P23andMe = 4.6 × 10−5) or aspirin (OR23andMe = 1.57, P23andMe = 5.3 × 10−5). It also confirmed the association of rs738409-G with increased risk of T2D (ORmeta = 1.08, Pmeta = 8 × 10−11) recently reported in a T2D fine-mapping study that confirmed rs738409 as the most likely causal SNP[34]. Our meta-PheWAS further revealed significant associations between rs738409-G and decreased risk for high cholesterol (ORmeta = 0.96, Pmeta = 1.6 × 10−7; Pmeta_v2 = 1.1 × 10−8) and the intake of cholesterol-lowering medications (OR23andMe = 0.97, P23andMe = 2 × 10−4; Pmeta_v2 = 2.8 × 10−5), consistent with recent results from the lipids exome chip study describing a significant association of rs738409-G with decreased LDL levels[35]. In addition, the meta-PheWAS revealed novel significant associations between the rs738409-G GOF allele and decreased risk for acne (OR23andMe = 0.90, P23andMe = 1.5 × 10−11; Pmeta_v2 = 7.3 × 10−12), gout (ORmeta = 0.92, Pmeta = 4.1 × 10−5; Pmeta_v2 = 3.9 × 10−9), and gallstones (ORmeta = 0.95, Pmeta = 2.7 × 10−4; Pmeta_v2 = 1.5 × 10−5). All these associations remained prominent after adjusting for elevated liver tests (Supplementary Table 7), and were further strengthened in the meta-analysis with the expanded UK Biobank cohort (Supplementary Table 6). Taken together, our PheWAS results support the hypothesis that therapeutic inhibition of PNPLA3 could treat liver diseases. They also support T2D as a potential alternative indication for PNPLA3 inhibition. However, concomitant inverse associations with multiple other endpoints, including acne and high plasma cholesterol levels, indicate potential clinically relevant on-target ADEs that should be considered in decisions to progress PNPLA3 inhibitors toward clinical development.

IFIH1 partial loss-of-function increases asthma risk

The meta-PheWAS further revealed novel, important pleiotropic effects for drugs directed toward IFIH1. Carriers of the IFIH1 (encoding MDA5) rs1990760-C allele (MAF = 40%) have an established lower risk for several autoimmune diseases (type 1 diabetes, T1D; vitiligo; systemic lupus erythematosus, SLE; psoriasis) and an increased risk for UC (Supplementary Methods). Functional studies suggest that rs1990760-C (p.T946A) causes IFIH1 loss-of-function (LOF), and additional IFIH1 LOF alleles have been shown to protect against T1D, vitiligo, psoriasis, and psoriatic arthritis (PsA) (Supplementary Methods). Our meta-PheWAS support these associations (Fig. 2 and Supplementary Table 4). Beyond this, we found a significant novel association between rs1990760-C and increased risk for asthma (ORmeta = 1.04, Pmeta = 6.7 × 10−8) that reached Pmeta_v2 = 2 × 10−8 in the meta-analysis with the expanded UK Biobank cohort (Fig. 3a and Supplementary Table 6). The association between rs1990760 and asthma was supported by data from all four disease-agnostic cohorts as well as the GABRIEL and EVE asthma GWAS cohorts[36,37], despite lack of power to detect an association with rs1990760 in the published GWAS cohorts alone (Fig. 3b). This association remained significant after adjustment for autoimmune diseases in the 23andMe cohort, demonstrating that the asthma association is independent of the previously established associations of rs1990760 with autoimmunity (Supplementary Table 8). Co-localization analysis confirmed that the same SNP was responsible for the SLE, UC, and asthma associations at the locus, supporting true pleiotropic effects driven by the same causal variant(s) (Fig. 3c). The observed IFIH1 pleiotropic effects were further strengthened by the observation in the Genomics plc UK Biobank data that the independent low-frequency IFIH1 missense allele p.I923V (rs35667974-C, MAF = 1.8%), previously reported to result in IFIH1 LOF and to protect against T1D, vitiligo, psoriasis, and PsA, and to increase risk of UC, was also associated with increased risk of asthma (ORGplc/UK Biobank = 1.18, PGplc/UK Biobank = 1.1 × 10−4) (Fig. 3d). Together, these and previous findings establish IFIH1 as a gene with an allelic series[38] and further support the therapeutic hypothesis that inhibition of MDA5 may protect against several autoimmune diseases. However, our results also reveal the potential of MDA5 inhibitors to cause pulmonary ADEs and strengthen previous findings for an increased risk for colitis-related symptoms, endpoints that may limit the therapeutic window of MDA5 modulators and should be considered for monitoring in clinical trials.

Fig. 3

Pleiotropic effects of IFIH1 LOF variants. a A significant association of IFIH1 rs1990760-C (p.T946A) with increased risk of asthma was observed in the meta-analysis of PheWAS and GWAS results, with consistent effect estimate across the six cohorts tested. Odds ratios (OR) and 95% confidence intervals are represented. b Power estimation demonstrates the lack of power to detect an association at rs1990760-C in currently available asthma GWAS studies. Power to surpass various significance cutoffs (P < 0.05; FDR < 0.1, P < 7e−4; study-wide significance after Bonferroni correction, P < 1.8e−6; and genome-wide significance, P < 5e−8) in the six cohorts was estimated using the frequency of the asthma risk allele (RAF = 0.39), the OR in the PheWAS/GWAS meta-analysis (OR = 1.037), a disease prevalence of 8%, and the number of cases and controls in each of the cohorts. c Co-localization analysis demonstrates that the asthma, systemic lupus erythematosus (SLE), and ulcerative colitis (UC) associations at the IFIH1 locus are driven by a shared causal signal. Regional association results with asthma (red), SLE (blue) and UC (orange) are shown. PP, posterior probability of co-localization. d Results from this study (indicated by an asterix) combined with previously published findings suggest an allelic series of LOF IFIH1 alleles decreasing the risk of various autoimmune diseases while increasing the risk of asthma and UC. OR and 95% confidence intervals of association for the IFIH1 loss-of-function alleles rs1990760-C (p.T946A) and rs35667974-C (p. I923V) are shown

PheWAS assist target prioritization for thromboembolism

Beyond informing on individual genes, we hypothesized that PheWAS might help prioritize targets among several candidates within a biological pathway. Factors XI, XII, and plasma kininogen (encoded by KNG1) are members of the contact activation coagulation pathway[39]. Anti-coagulation therapies directed against these factors are hypothesized to have improved therapeutic windows over current standard-of-care, which is accompanied by significant bleeding liabilities[40]. With the aim to estimate genetic risk–benefit profiles for the three candidate targets, we chose to interrogate three uncorrelated SNPs at the F11, KNG1, and F12 loci. These three SNPs had similar allele frequencies in Europeans, had previously been shown to impact FXI, FXII, and/or KNG1 mRNA and/or protein levels, and are associated with activated partial thromboplastin time (aPTT), a biomarker of blood clotting, or venous thromboembolism (VTE) risk (Supplementary Methods and Supplementary Table 1). Carriers of the rs4253399-T allele, which reduces circulating FXI levels and increases aPTT, showed an expected lower risk for blood clots (ORmeta = 0.84, Pmeta = 3.5 × 10−25)[41], but no evidence for association with bleeding tendency (OR23andMe = 1.04, P23andMe = 0.35) (Fig. 4). In contrast, carriers of the KNG1 allele rs5030062-A, which reduces plasma kininogen as well as circulating FXI, and increases aPTT, showed both reduced blood clotting (ORmeta = 0.93, Pmeta = 1.6 × 10−4) as well as increased bleeding liability (OR23andMe = 1.14, P23andMe = 4.1 × 10−4). A nominal association with both phenotypes was found in carriers of the FXII levels-reducing and aPTT-increasing allele rs2731672-T (blood clots: OR23andMe = 0.96, P23andMe = 0.034; bleeding tendency: OR23andMe = 1.09, P23andMe = 0.039).

Fig. 4

PheWAS for contact activation coagulation pathway targets. Three SNPs known to affect plasma protein levels of FXI (rs4253399), FXII (rs2731672), and KNG1 (rs5030062), and previously reported associated with partial thromboplastin time (aPTT) were interrogated in meta-PheWAS. Five phenotypes were observed as significantly associated (FDR < 0.1) with at least one of the three SNPs: blood clots (23andMe, FINRISK, and CHOP: 7487 cases, 273,305 controls), known association with the F11 SNP (*), blood thinners medication (23andMe: 22,985 cases, 236,431 controls), warfarin medication (23andMe: 7142 cases, 94,701 controls), pulmonary embolism (Gplc/UK Biobank: 949 cases, 111,077 controls), and bleeding tendency (23andMe: 1574 cases, 85,223 controls). Odds ratios (OR) and 95% confidence intervals of association of the aPTT-increasing alleles are shown. Detailed association results are provided in the Supplementary Data 1

Discussion

Our study investigates the utility of PheWAS to help predict therapeutic success of candidate drug targets nominated through human genetics. We focused on a selection of loci that GWAS have firmly established as associated with common immune-mediated, cardiometabolic, or neurodegenerative human diseases, and where additional biological or genetic evidence supports candidate drug target genes within these loci as likely causing the disease associations. We analyzed SNPs impacting these targets for association with 1683 disease endpoints captured in four large, disease-agnostic population cohorts that link genome-wide genotypes with various types of structured health information. Our PheWAS meta-analysis replicates 75% of the published GWAS associations at P < 0.05, substantially surpassing performance of previous PheWAS in smaller cohorts[25]. Through meta-analyzing PheWAS results with published GWAS data, we identified nine novel SNP-phenotype associations that exceeded stringent significance thresholds for multiple test correction, as well as additional putative associations with therapeutically relevant clinical endpoints. For a subset of early drug targets, our results support previous genetic evidence for efficacy in distinct common disease indications. Our analysis further proposes alternative indications as opportunities for drug repositioning and predicts on-target adverse drug events that may warrant preclinical or clinical monitoring. Among others, we discovered novel associations for p.I148M in PNPLA3. This is a common gain-of-function missense allele increasing the risk for a range of liver phenotypes, which suggested that pharmaceutical inhibition of PNPLA3 could be a viable strategy to treat or prevent liver diseases. While our PheWAS support this hypothesis and further backs expanding the indication spectrum of a putative PNPLA3 inhibitor to T2D, we also uncovered opposite associations with severe acne and high cholesterol, phenotypes that if observed during a clinical trial might put a therapeutic program at risk. We also identified a novel association of the IFIH1 loss-of-function allele rs1990760-C (p.T946A) with risk of asthma. The rs1990760-C allele, which protects against several autoimmune diseases and increases risk of UC, has been shown to decrease interferon (IFN) signaling and lower resistance to viral challenge[43], while complete loss of IFIH1 function makes children susceptible to severe viral respiratory infections[44,45]. The association of rs1990760-C with increased risk of asthma discovered in our meta-PheWAS is consistent with the observation that bronchial epithelial cells from asthmatics produce lower amounts of IFN-β during viral infections[46], a finding that lead to inhaled IFN-β being tested in phase 2 clinical trials for the treatment of virus-induced asthma exacerbation[47]. Future studies will need to investigate the risk:benefit ratio of modulating MDA5 (encoded by IFIH1) for asthma relative to autoimmune diseases. While our study illustrates the power of systematically interrogating disease-agnostic cohorts with extensive health information to enrich target validation, it also emphasizes several opportunities to improve existing resources in order for PheWAS to become a routine tool in drug discovery and development. First, truly large, thoroughly phenotyped cohorts will be needed to adequately power PheWAS. Despite our meta-PheWAS being conducted in close to 700,000 individuals, 20% of GWAS associations could not be replicated (P < 0.05) in the disease-agnostic cohorts due to an insufficient number of cases. In addition, PheWAS should considerably gain from improved phenotypic endpoints[48]. In our study, this is best reflected by an only modest replication rate, despite adequate power, for CD, UC, and IBD endpoints that are closely related and difficult to discern from other disorders in routine clinical settings[49]. To better take these considerations and other characteristics of disease-agnostic cohorts (typical case:control ratio unbalance between phenotypes and phenotype correlation) into account, novel statistical methods will be needed to better define significance thresholds and control type I error rates in PheWAS[50]. Second, our study highlights the challenge to systematically combine phenotypes from independent disease-agnostic cohorts with various phenotype data sources. While we introduce the concept of meta-PheWAS and demonstrate that mapping phenotypes to interrogate independent PheWAS cohorts may considerably strengthen association signals, there is still a need for standardized terminology, automated phenotype extraction, and coordinated data management across healthcare institutions that will help with better harmonization across cohorts in the future[9,51]. A third challenge to the PheWAS approach is inherent to the current limitations of human genetics. Even when starting from a highly-annotated set of loci as in our study, PheWAS may lead to spurious interpretation of association results that can only be ruled out through thorough follow-up[18]. We demonstrate this at the example of LGALS3 and PD. Access to genome-wide association results for systematic fine-mapping and co-localization analyses, functionalization of GWAS loci and the emergence of association data for intermediate phenotypes, e.g., at the protein level, will be needed to help narrow the gap between SNPs and candidate target genes in the future. Finally, a fourth challenge to broadly use PheWAS for drug development is to relate findings from germline variants that impact a target across an individual’s entire lifetime to success of an interventional trial with much shorter observation periods. In the end, many decisions to pursue or discontinue a therapeutic program may remain dependent on the specific risk:benefit ratio that quantitative genetics as applied here may help to predict, and the level of unmet clinical need. Taken together, our study highlights PheWAS as a highly promising, yet largely untapped opportunity to use disease-agnostic cohorts with extensive health information for drug target validation. We provide several examples that illustrate PheWAS as a powerful strategy to help predict efficacy and unintended drug effects, which should ultimately help to develop better drugs. Whether PheWAS may truly impact decision making during drug development will only become evident with either the emergence of ADEs in trials that genetics could have predicted, or reduced safety-related attrition rates for portfolios enriched in targets nominated through human genetics. The growing number of large-scale population cohorts that link genetic data with extensive health data, together with an increased willingness across the borders of academia, biotech and the pharmaceutical industry to collaborate and share data, will provide opportunities to demonstrate that.

Methods

SNP selection

In this study, we selected 25 SNPs that were significantly associated (P < 5 × 10−8) in published GWAS with binary or quantitative phenotypes related to three main therapeutic areas: (auto)immune, cardiometabolic, or neurodegenerative diseases (Supplementary Methods). These 25 SNPs had either been functionally validated in published studies, establishing the candidate target gene as causal for the risk of disease, or they were located within or near genes (as defined by the regions encompassing all SNPs in r2 > 0.5 to the GWAS index SNPs extended to the nearest recombination hot spots) for which previous studies had generated convincing biological evidence to be of relevance for the respective clinical endpoint. The 25 SNPs were linked to 19 genes that were evaluated as candidate drug targets. Detailed information on the SNPs, candidate causal genes and their link to common human disease is provided in Supplementary Methods. The list of SNPs and their known associated phenotypes is provided in Supplementary Table 1.

Study cohorts

We interrogated four large observational disease-agnostic cohorts of subjects of European ancestry with genome-wide genotyped data linked to extensive phenotypic information (Table 2). All participants included in each of the four cohorts were unrelated individuals of European ancestry. Individual-level data from each cohort was analyzed independently, and the relevant summary statistics for the 25 SNPs were shared for further analysis. We restricted all cohorts to binary disease phenotypes with at least 20 cases per cohort. All endpoints were derived from questionnaires or ICD codes (including endpoints like high cholesterol or high blood pressure). No quantitative laboratory measurements were included in the study. The 23andMe cohort comprised up to 671,151 participants and 654 binary disease endpoints derived from questionnaire-based self-reports[22]. Participants were restricted to a set of individuals who have > 97% European ancestry, as determined through an analysis of local ancestry using a support vector machine (SVM) and a hidden Markov model (HMM) to assign individuals to one of 31 reference populations. For each phenotype, we chose a maximal set of unrelated individuals using a segmental identity-by-descent (IBD) estimation algorithm. We defined individuals as related if they shared > 700 cM IBD on either one or both of their chromosomes. SNPs with Hardy–Weinberg equilibrium P < 10−20, call rate < 95%, or strong allele frequency deviation from European 1000 Genomes reference data were excluded. Participant genotype data were then imputed against the September 2013 release of 1000 Genomes Phase1 reference haplotypes[52], using an internally developed phasing tool, Finch, which implements the Beagle haplotype graph-based phasing algorithm[53], and Minimac2[54]. The Genomics plc analysis of UK Biobank cohort (referred to as ‘Genomics plc UK Biobank’) comprised 112,337 participants and 90 binary disease endpoints derived from questionnaire-based self-reports and medical interviews[10]. GWAS analyses were performed by Genomics plc using the interim data release (May 2015). QC followed the recommendations provided by UK Biobank. European ethnicity was defined as self-reported white British ethnic background, and confirmed by principal component analysis clustering. Samples with relatives (3rd degree or closer) were excluded. Imputation was carried out by the UK Biobank data providers using SHAPEIT3[55], IMPUTE3[56], and a reference panel combining the 1000 Genomes Phase 3[57] and UK10K datasets[58]. FINRISK is a collection of cross-sectional population surveys carried out since 1972 to assess the risk factors of chronic diseases and health behavior in the working age population of Finland[23]. The FINRISK cohort comprised 21,371 Finnish participants and 269 binary disease endpoints derived from ICD codes grouping in Finnish national hospital registries and cause-of-death registry, and drug reimbursement and purchase registries. The FINRISK samples were genotyped using Illumina CoreExome, OmniExpress, and 610K chips. After gender check, samples with genotype missing rate > 5% or excess heterozygosity (> 4SD) were excluded. SNPs QC, including exclusion of SNPs with genotype missing rate > 2%, minor allele frequency <1%, or Hardy–Weinberg equilibrium P value <1 × 10−6, was performed for each genotyping chip separately. Multidimensional scaling (MDS) components were estimated with PLINK v1.9[59] from the LD-pruned genotype data where relatives with pi-hat > 0.2 had been removed. Samples with non-Finnish ancestry observed as MDS outliers were removed. Imputation was performed with SHAPEIT[55] and IMPUTE2[56] using a reference panel combining information from the 1000 Genomes phase 3[57] and 1941 Finnish SiSu whole genome sequences[60]. Imputation was stratified based on genotyping chip. The cohort from the Children’s hospital of Philadelphia (CHOP) comprised 12,044 pediatric patients and 870 binary disease endpoints derived from ICD9–CM codes using the ICD9-to-PheWAS codes mapping described by Denny et al.[24,61]. Subjects included in the CHOP PheWAS were genotyped on one of the following genotyping chips following the Illumina standard protocols: Illumina Human610-Quad version 1, Illumina 550K SNP array, or Illumina OmniExpress array. Samples with genotype call rate > 95% were included in the study. SNPs with genotype missing rate > 5%, minor allele frequency <1%, and Hardy–Weinberg equilibrium P value < 0.00001 were excluded. Principle component analysis (PCA) was performed using EIGENSTRAT[62] on ∼130,000 SNPs that had been pruned for linkage disequilibrium using PLINK v1.07[59] and reference genotypes from the HapMap consortium[63]. Imputation was performed with SHAPEIT v2[55] and IMPUTE2[56] using the 1000 Genomes project phase 1 reference panel[52]. SNPs with INFO scores < 0.9 were excluded. All the participants in the 23andMe, Genomics plc UK Biobank, FINRISK, and CHOP cohorts provided written informed consent for participating in research studies. Blood or saliva samples were collected according to protocols approved by local institutional review boards. Details are provided in the original publications describing the cohorts[10,22]−[24]. This research has been conducted using the UK Biobank resource under the Genomics plc project application number 9659. In addition, with the aim to replicate novel associations identified in the four disease-agnostic cohorts, we interrogated genome-wide summary statistics from 57 published GWAS, including 34 binary disease phenotypes, derived from a larger database that has been assembled and harmonized by Genomics plc (referred to as ‘Genomics plc GWAS’). The full list of studies in Genomics plc GWAS database and tested in this study is available in the Supplementary Note 1). These included checks to ensure consistency of the data, and alignment of alleles to the forward strand of the human reference sequence, with effects ascribed to the alternative allele. Effect size estimates for quantitative traits were rescaled relative to the residual variance. Summary-statistic imputation was applied to infer association evidence at common variants (minor allele frequency > 2%) in the 1000 Genomes EUR reference panel. Results for SNPs associated with the relevant phenotype with P < 0.05 were included in the meta-analysis. Correlation between all GWAS was estimated to ensure that no GWAS included in the meta-analysis for a given phenotype presented overlapping samples. In addition, to further prevent GWAS results from overlapping samples to be meta-analyzed, only the most recent/largest study for a given disease was included in our analysis when several GWAS studies in the Genomics plc database investigated the same disease. Although we could not directly estimate potential overlapping samples between the different disease-agnostic cohorts, significant overlap is very unlikely based on the participants’ characteristics (Table 2).

Identification of shared phenotypes

The phenotypic endpoints tested in the 23andMe, UK Biobank, FINRISK, and CHOP cohorts were derived from different sources (self-reports, self-reports and medical interviews, WHO ICD codes, and ICD9-CM codes, respectively) and named using different nomenclatures (e.g., clinical terms versus popular terms, abbreviations versus full names). In order to compare and combine results from the four cohorts with published GWAS results from the Genomics plc database, we manually mapped the phenotypes. Examples of mapped and unmapped phenotypic endpoints are provided in Supplementary Table 2. This step allowed us to identify 145 distinct phenotypes shared by at least 2 cohorts and with at least 20 cases in the independent cohorts (Fig. 1). The full list of mapped phenotypes is provided in Supplementary Table 3 and the Supplementary Data 1. We note that, in each cohort some phenotypes were captured multiple times by different endpoints with slightly different definitions. In this case, only one endpoint per cohort was selected for meta-analysis.

PheWAS and meta-analysis

Phenome-wide association analyses for each of the 25 SNPs were conducted in the 23andMe, Genomics plc UK Biobank, FINRISK (PheWAS results release November 2016), and CHOP cohorts separately. Each SNP-phenotype association was tested independently (assuming an additive genetic model), using logistic regressions adjusted for age, gender, and principal components to adjust for population stratification. Genotyping batch and survey cohort were also included as covariates in the FINRISK PheWAS. We then performed two distinct analyses to (1) replicate known GWAS associations, and (2) to detect novel associations. First, we meta-analyzed PheWAS results from the four cohorts, to investigate the ability of these cohorts to replicate known GWAS associations. After harmonizing the effect alleles across the cohorts, fixed effect meta-analyses were performed using PLINK[59]. I2 statistic and manual review of the meta-analyzed results were used to evaluate heterogeneity between cohorts. We then compared the meta-analysis association results with known significant SNP-phenotype associations from published GWAS, taking into account the statistical power to detect an association in the meta-analysis of the PheWAS results in the disease-agnostic cohorts. Second, we meta-analyzed results from the four disease-agnostic cohorts together with available GWAS results in order to detect novel associations. Meta-analysis was performed using PLINK as described above. Meta-analysis results at the 145 shared phenotypes were then combined with cohort-specific phenotype results from the 25 SNPs, resulting in 27,762 tests in total. It is clear given the structure of this PheWAS and meta-PheWAS that the 27,762 tests are not independent tests, which requires thought about the most appropriate method to control for multiple testing correction. We have chosen two methods, one that provides an extremely, over-conservative multi-testing correction assuming independence (Bonferroni correction) and one less conservative method that has been shown to be robust to test dependency (Benjamini & Hochberg’s False Discovery Rate (FDR))[64]. Benjamini and Yekutieli (2001) illustrated that the FDR procedure is robust to positive correlation amongst tests[65], therefore we have chosen to use the standard Benjamini & Hochberg FDR procedure implemented in the p.adjust method in R. For defining significance in this study, we set a FDR threshold of 0.1, which corresponded to P < 7 × 10−4. The over-conservative significance threshold based on Bonferroni correction was P = 0.05/27,762 = 1.8 × 10−6. We note that Bonferroni correction ignores the correlation structure between the tested phenotypes or the fact that all the SNPs tested in this study are known to be associated with one or several phenotypes in published GWAS.

Meta-analysis with UK Biobank v2 association results

To further test the robustness of the putative novel associations identified in our study, we performed a meta-analysis of the 23andMe, FINRISK, CHOP, and published GWAS results for 41 SNP-phenotype pairs with association results released by Neale et al. from an analysis of the expanded UK Biobank cohort, consisting of up to 337,199 unrelated participants of European ancestry (referred to as UK biobank v2). In order to meta-analyze these UK Biobank v2 results, which had been obtained using linear regression models, with the PheWAS cohorts and GWAS results of the current study, which were obtained using logistic regression models, we performed a weighted Z score meta-analysis. For each SNP-phenotype pair in each study i, we defined weights using the following equation: where Na and Nu are the numbers of cases and controls in study i, respectively. For each SNP-phenotype pair, we then calculated the meta-analysis Z score as follows: Z is the Z score in study i, derived from the logistic or linear regression model. The UK Biobank GWAS results used in this analysis have been released by the Neale’s lab under the following URL: https://sites.google.com/broadinstitute.org/ukbbgwasresults/home?authuser=0.

Statistical power estimations

We estimated statistical power to detect an association with known associated phenotypes using a formula adapted from Yang et al.[66], based on the published effect size in the most recently published GWAS, the frequency of the associated SNP risk allele in the 1000Genomes EUR population, the number of cases and controls in the disease-agnostic cohorts, and the following phenotype prevalence reported by the Centers for Disease Control and Prevention (https://www.cdc.gov): coronary artery disease, 5.8%; Crohn’s disease, 0.2%; inflammatory bowel disease, 0.44%; myocardial infarction, 3%; multiple sclerosis, 0.09%; primary biliary cirrhosis, 0.04%; Parkinson’s disease, 0.07%; psoriasis, 3%; rheumatoid arthritis, 0.6%; systemic lupus erythematosus, 0.2%; systemic scleroderma, 0.02%; type 1 diabetes, 0.5%; type 2 diabetes, 9%; ulcerative colitis, 0.24%; venous thromboembolism, 0.4%; vitiligo, 1%.

Co-localization analyses

To distinguish true pleiotropic effects from multiple associations at the loci that are explained by different causal SNPs (and potentially incriminating different causal genes), we used association summary statistics available from published GWAS and applied a Bayesian test implemented in the R package ‘coloc’ to assess co-localization, i.e., the probability of sharing causal genetic variants between pairs of apparent pleiotropic phenotypes using association summary statistics at the loci of interest[28]. Co-localization analysis at the LGALS3 locus was performed using meta-analyzed PD GWAS summary statistics from 23andMe published elsewhere (N cases = 4127, N controls = 62,037)[26], and galectin-3 plasma pQTL results in 3562 blood donors[29]. Co-localization analysis at the IFIH1 locus was performed using meta-analyzed SLE GWAS results from two independent published studies[67,68], meta-analyzed asthma GWAS summary statistics from 23andMe[69] and the Genomics plc UK Biobank (unpublished), and published UC GWAS summary statistics[70].

68 in total

1. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

2. A genome-wide association study for venous thromboembolism: the extended cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium.

Authors: Weihong Tang; Martina Teichert; Daniel I Chasman; John A Heit; Pierre-Emmanuel Morange; Guo Li; Bruno H Ch Stricker; Paul M Ridker; Aaron R Folsom; Nicholas L Smith; Nathan Pankratz; Frank W Leebeek; Guillaume Paré; Mariza de Andrade; Christophe Tzourio; Bruce M Psaty; Saonli Basu; Rikje Ruiter; Lynda Rose; Sebastian M Armasu; Thomas Lumley; Susan R Heckbert; André G Uitterlinden; Mark Lathrop; Kenneth M Rice; Mary Cushman; Albert Hofman; Jean-Charles Lambert; Nicole L Glazer; James S Pankow; Jacqueline C Witteman; Philippe Amouyel; Joshua C Bis; Edwin G Bovill; Xiaoxiao Kong; Russell P Tracy; Eric Boerwinkle; Jerome I Rotter; David-Alexandre Trégouët; Daan W Loth
Journal: Genet Epidemiol Date: 2013-05-05 Impact factor: 2.135

3. Association of systemic lupus erythematosus with C8orf13-BLK and ITGAM-ITGAX.

Authors: Geoffrey Hom; Robert R Graham; Barmak Modrek; Kimberly E Taylor; Ward Ortmann; Sophie Garnier; Annette T Lee; Sharon A Chung; Ricardo C Ferreira; P V Krishna Pant; Dennis G Ballinger; Roman Kosoy; F Yesim Demirci; M Ilyas Kamboh; Amy H Kao; Chao Tian; Iva Gunnarsson; Anders A Bengtsson; Solbritt Rantapää-Dahlqvist; Michelle Petri; Susan Manzi; Michael F Seldin; Lars Rönnblom; Ann-Christine Syvänen; Lindsey A Criswell; Peter K Gregersen; Timothy W Behrens
Journal: N Engl J Med Date: 2008-01-20 Impact factor: 91.245

4. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene-disease associations.

Authors: Joshua C Denny; Marylyn D Ritchie; Melissa A Basford; Jill M Pulley; Lisa Bastarache; Kristin Brown-Gentry; Deede Wang; Dan R Masys; Dan M Roden; Dana C Crawford
Journal: Bioinformatics Date: 2010-03-24 Impact factor: 6.937

5. A genome-wide association study of circulating galectin-3.

Authors: Rudolf A de Boer; Niek Verweij; Dirk J van Veldhuisen; Harm-Jan Westra; Stephan J L Bakker; Ron T Gansevoort; Anneke C Muller Kobold; Wiek H van Gilst; Lude Franke; Irene Mateo Leach; Pim van der Harst
Journal: PLoS One Date: 2012-10-09 Impact factor: 3.240

6. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age.

Authors: Cathie Sudlow; John Gallacher; Naomi Allen; Valerie Beral; Paul Burton; John Danesh; Paul Downey; Paul Elliott; Jane Green; Martin Landray; Bette Liu; Paul Matthews; Giok Ong; Jill Pell; Alan Silman; Alan Young; Tim Sprosen; Tim Peakman; Rory Collins
Journal: PLoS Med Date: 2015-03-31 Impact factor: 11.069

7. Extracting research-quality phenotypes from electronic health records to support precision medicine.

Authors: Wei-Qi Wei; Joshua C Denny
Journal: Genome Med Date: 2015-04-30 Impact factor: 11.117

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

Review 9. The challenges, advantages and future of phenome-wide association studies.

Authors: Scott J Hebbring
Journal: Immunology Date: 2014-02 Impact factor: 7.397

10. Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes.

Authors: Anubha Mahajan; Jennifer Wessel; Sara M Willems; Wei Zhao; Neil R Robertson; Audrey Y Chu; Wei Gan; Hidetoshi Kitajima; Daniel Taliun; N William Rayner; Xiuqing Guo; Yingchang Lu; Man Li; Richard A Jensen; Yao Hu; Shaofeng Huo; Kurt K Lohman; Weihua Zhang; James P Cook; Bram Peter Prins; Jason Flannick; Niels Grarup; Vassily Vladimirovich Trubetskoy; Jasmina Kravic; Young Jin Kim; Denis V Rybin; Hanieh Yaghootkar; Martina Müller-Nurasyid; Karina Meidtner; Ruifang Li-Gao; Tibor V Varga; Jonathan Marten; Jin Li; Albert Vernon Smith; Ping An; Symen Ligthart; Stefan Gustafsson; Giovanni Malerba; Ayse Demirkan; Juan Fernandez Tajes; Valgerdur Steinthorsdottir; Matthias Wuttke; Cécile Lecoeur; Michael Preuss; Lawrence F Bielak; Marielisa Graff; Heather M Highland; Anne E Justice; Dajiang J Liu; Eirini Marouli; Gina Marie Peloso; Helen R Warren; Saima Afaq; Shoaib Afzal; Emma Ahlqvist; Peter Almgren; Najaf Amin; Lia B Bang; Alain G Bertoni; Cristina Bombieri; Jette Bork-Jensen; Ivan Brandslund; Jennifer A Brody; Noël P Burtt; Mickaël Canouil; Yii-Der Ida Chen; Yoon Shin Cho; Cramer Christensen; Sophie V Eastwood; Kai-Uwe Eckardt; Krista Fischer; Giovanni Gambaro; Vilmantas Giedraitis; Megan L Grove; Hugoline G de Haan; Sophie Hackinger; Yang Hai; Sohee Han; Anne Tybjærg-Hansen; Marie-France Hivert; Bo Isomaa; Susanne Jäger; Marit E Jørgensen; Torben Jørgensen; Annemari Käräjämäki; Bong-Jo Kim; Sung Soo Kim; Heikki A Koistinen; Peter Kovacs; Jennifer Kriebel; Florian Kronenberg; Kristi Läll; Leslie A Lange; Jung-Jin Lee; Benjamin Lehne; Huaixing Li; Keng-Hung Lin; Allan Linneberg; Ching-Ti Liu; Jun Liu; Marie Loh; Reedik Mägi; Vasiliki Mamakou; Roberta McKean-Cowdin; Girish Nadkarni; Matt Neville; Sune F Nielsen; Ioanna Ntalla; Patricia A Peyser; Wolfgang Rathmann; Kenneth Rice; Stephen S Rich; Line Rode; Olov Rolandsson; Sebastian Schönherr; Elizabeth Selvin; Kerrin S Small; Alena Stančáková; Praveen Surendran; Kent D Taylor; Tanya M Teslovich; Barbara Thorand; Gudmar Thorleifsson; Adrienne Tin; Anke Tönjes; Anette Varbo; Daniel R Witte; Andrew R Wood; Pranav Yajnik; Jie Yao; Loïc Yengo; Robin Young; Philippe Amouyel; Heiner Boeing; Eric Boerwinkle; Erwin P Bottinger; Rajiv Chowdhury; Francis S Collins; George Dedoussis; Abbas Dehghan; Panos Deloukas; Marco M Ferrario; Jean Ferrières; Jose C Florez; Philippe Frossard; Vilmundur Gudnason; Tamara B Harris; Susan R Heckbert; Joanna M M Howson; Martin Ingelsson; Sekar Kathiresan; Frank Kee; Johanna Kuusisto; Claudia Langenberg; Lenore J Launer; Cecilia M Lindgren; Satu Männistö; Thomas Meitinger; Olle Melander; Karen L Mohlke; Marie Moitry; Andrew D Morris; Alison D Murray; Renée de Mutsert; Marju Orho-Melander; Katharine R Owen; Markus Perola; Annette Peters; Michael A Province; Asif Rasheed; Paul M Ridker; Fernando Rivadineira; Frits R Rosendaal; Anders H Rosengren; Veikko Salomaa; Wayne H-H Sheu; Rob Sladek; Blair H Smith; Konstantin Strauch; André G Uitterlinden; Rohit Varma; Cristen J Willer; Matthias Blüher; Adam S Butterworth; John Campbell Chambers; Daniel I Chasman; John Danesh; Cornelia van Duijn; Josée Dupuis; Oscar H Franco; Paul W Franks; Philippe Froguel; Harald Grallert; Leif Groop; Bok-Ghee Han; Torben Hansen; Andrew T Hattersley; Caroline Hayward; Erik Ingelsson; Sharon L R Kardia; Fredrik Karpe; Jaspal Singh Kooner; Anna Köttgen; Kari Kuulasmaa; Markku Laakso; Xu Lin; Lars Lind; Yongmei Liu; Ruth J F Loos; Jonathan Marchini; Andres Metspalu; Dennis Mook-Kanamori; Børge G Nordestgaard; Colin N A Palmer; James S Pankow; Oluf Pedersen; Bruce M Psaty; Rainer Rauramaa; Naveed Sattar; Matthias B Schulze; Nicole Soranzo; Timothy D Spector; Kari Stefansson; Michael Stumvoll; Unnur Thorsteinsdottir; Tiinamaija Tuomi; Jaakko Tuomilehto; Nicholas J Wareham; James G Wilson; Eleftheria Zeggini; Robert A Scott; Inês Barroso; Timothy M Frayling; Mark O Goodarzi; James B Meigs; Michael Boehnke; Danish Saleheen; Andrew P Morris; Jerome I Rotter; Mark I McCarthy
Journal: Nat Genet Date: 2018-04-09 Impact factor: 38.330

43 in total

1. Opportunities, challenges and expectations management for translating biobank research to precision medicine.

Authors: Christopher J O'Donnell
Journal: Eur J Epidemiol Date: 2020-02-28 Impact factor: 8.082

Review 2. Maturation and application of phenome-wide association studies.

Authors: Shiying Liu; Dana C Crawford
Journal: Trends Genet Date: 2022-01-03 Impact factor: 11.639

3. Phenome-Wide Association Studies.

Authors: Lisa Bastarache; Joshua C Denny; Dan M Roden
Journal: JAMA Date: 2022-01-04 Impact factor: 56.272

Review 4. Using Phecodes for Research with the Electronic Health Record: From PheWAS to PheRS.

Authors: Lisa Bastarache
Journal: Annu Rev Biomed Data Sci Date: 2021-07-20

Review 5. The current state of omics technologies in the clinical management of asthma and allergic diseases.

Authors: Brittney M Donovan; Lisa Bastarache; Kedir N Turi; Mary M Zutter; Tina V Hartert
Journal: Ann Allergy Asthma Immunol Date: 2019-09-05 Impact factor: 6.347

Review 6. The Musculoskeletal Knowledge Portal: Making Omics Data Useful to the Broader Scientific Community.

Authors: Douglas P Kiel; John P Kemp; Fernando Rivadeneira; Jennifer J Westendorf; David Karasik; Emma L Duncan; Yuuki Imai; Ralph Müller; Jason Flannick; Lynda Bonewald; Noël Burtt
Journal: J Bone Miner Res Date: 2020-09 Impact factor: 6.741

Review 7. Genetic contributions to NAFLD: leveraging shared genetics to uncover systems biology.

Authors: Mohammed Eslam; Jacob George
Journal: Nat Rev Gastroenterol Hepatol Date: 2019-10-22 Impact factor: 46.802

8. Evaluating the cardiovascular safety of sclerostin inhibition using evidence from meta-analysis of clinical trials and human genetics.

Authors: Jonas Bovijn; Kristi Krebs; Chia-Yen Chen; Ruth Boxall; Jenny C Censin; Teresa Ferreira; Sara L Pulit; Craig A Glastonbury; Samantha Laber; Iona Y Millwood; Kuang Lin; Liming Li; Zhengming Chen; Lili Milani; George Davey Smith; Robin G Walters; Reedik Mägi; Benjamin M Neale; Cecilia M Lindgren; Michael V Holmes
Journal: Sci Transl Med Date: 2020-06-24 Impact factor: 17.956

Review 9. Pleiotropy and Cross-Disorder Genetics Among Psychiatric Disorders.

Authors: Phil H Lee; Yen-Chen A Feng; Jordan W Smoller
Journal: Biol Psychiatry Date: 2020-10-10 Impact factor: 13.382

10. Phenome-wide and expression quantitative trait locus associations of coronavirus disease 2019 genetic risk loci.

Authors: Chang Yoon Moon; Brian M Schilder; Towfique Raj; Kuan-Lin Huang
Journal: iScience Date: 2021-05-18