Literature DB >> 35192625

Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance.

Katherine Hartmann¹, Michał Seweryn², Wolfgang Sadee¹.

Abstract

Genome-wide association studies (GWAS) have implicated 58 loci in coronary artery disease (CAD). However, the biological basis for these associations, the relevant genes, and causative variants often remain uncertain. Since the vast majority of GWAS loci reside outside coding regions, most exert regulatory functions. Here we explore the complexity of each of these loci, using tissue specific RNA sequencing data from GTEx to identify genes that exhibit altered expression patterns in the context of GWAS-significant loci, expanding the list of candidate genes from the 75 currently annotated by GWAS to 245, with almost half of these transcripts being non-coding. Tissue specific allelic expression imbalance data, also from GTEx, allows us to uncover GWAS variants that mark functional variation in a locus, e.g., rs7528419 residing in the SORT1 locus, in liver specifically, and rs72689147 in the GUYC1A1 locus, across a variety of tissues. We consider the GWAS variant rs1412444 in the LIPA locus in more detail as an example, probing tissue and transcript specific effects of genetic variation in the region. By evaluating linkage disequilibrium (LD) between tissue specific eQTLs, we reveal evidence for multiple functional variants within loci. We identify 3 variants (rs1412444, rs1051338, rs2250781) that when considered together, each improve the ability to account for LIPA gene expression, suggesting multiple interacting factors. These results refine the assignment of 58 GWAS loci to likely causative variants in a handful of cases and for the remainder help to re-prioritize associated genes and RNA isoforms, suggesting that ncRNAs maybe a relevant transcript in almost half of CAD GWAS results. Our findings support a multi-factorial system where a single variant can influence multiple genes and each genes is regulated by multiple variants.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 35192625 PMCID： PMC8863290 DOI： 10.1371/journal.pone.0244904

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Genome-wide association studies (GWAS) have identified dozens of genetic variants (SNPs) associated with cardiovascular disease risk and related clinical phenotypes (e.g., blood pressure, lipid levels) [1-3]. However, these findings do not necessarily translate to understanding of heritability, likely because we do not fully understand the link between significant loci, causative genetic variants and complex phenotypes [4]. Moreover, the functional variant and even the relevant gene close to a significant locus in many cases remain uncertain. The majority of statistically significant SNPs reside in non-coding regions with poorly defined biological functions and a complex architecture of multiple genes and transcripts [5]. Gene assignment is largely based on proximity, usually with little consideration for non-coding transcripts in the locus or the possibility of chromatin looping that places distant regions in close proximity [6], with regulatory domains often interacting with multiple genomic target regions (9). Additionally, localization to non-coding regions means the mechanisms remain unknown as the function is not immediately obvious, while implicating epigenetics and other regulatory processes [5,7]. This uncertainty limits the utility of GWAS findings. To interpret and refine GWAS results for coronary artery disease (CAD), we use RNA expression, in addition to physical position, to prioritize the variants and gene(s) most likely to be relevant. Although largely thought of in a single SNP–single protein-coding gene paradigm, GWAS variants mark regions with various degrees of complexity often including several protein-coding and non-coding RNAs (ncRNAs). SNPs located within RNA exons may not only alter the protein sequence but also influence RNA structure and function in a transcript specific manner [8]. Some of these GWAS loci consist of gene clusters that are coordinately regulated [9], and almost all include multiple RNA isoforms expressed from a given gene, including splice isoforms. Within such multi-gene regions, a single variant may affect more than one gene, both protein-coding and non-coding, via chromatin looping between multiple sites or by regulating DNA accessibility for the entire region [9,10]. Therefore, a critical question for interpreting GWAS associations is which gene(s), and what specific transcript(s), are affected within each significant locus. The potential for multiple variants to affect a single gene is also critical to the interpretation of GWAS. Such interactions between variants, either linear or dynamic (epistasis) and dictated by linkage disequilibrium (LD), may remain hidden in GWAS because of the restrictive nature of multiple hypotheses corrections; however targeted analysis of loci reveals multiple interacting variants modulating gene expression [9,11,12]. Failure to identify all main functional variants in a gene locus and their interactions results in false estimates of the genetic influence of a locus, and further impedes discovery of dynamic interactions that are sensitive to partial or confounded estimates [13-18]. Detailed analysis of RNA expression to evaluate GWAS results is increasingly employed to evaluate co-localization of GWAS and eQTL signals [19-22]. However, most methods rely on the a priori assumptions that variants are independent of each other (e.g., eCAVIAR), while COLOC assumes that there is only one functional variant per GWAS locus. These assumptions do not allow for a multifactorial system, where a single variant can influence multiple genes and each gene can be regulated by multiple variants. Accordingly, we search for overlap between variants marking GWAS associations and those marking eQTLs rather than using existing methods to co-localize signals. Although this approach limits our power to detect overlap as it requires a single variant appear as a marker in both GWAS and eQTL analysis, we posit it facilitates functional exploration of a multi-factorial system. A recent CAD GWAS used 1000 genomes to impute insertions/deletions, rare variants and common variants that were not directly genotyped as part of a large-scale meta-analysis of 185 thousand cases and controls [1]. While confirming 47 of the 48 previously identified loci, this study identified an additional 10 at genome-wide significance, bringing the total count of CAD associated loci to 58. Each of these loci are based on robust statistical associations for one or more SNPs in the locus. Furthermore, each locus has been assigned one or more genes based largely on proximity as part of the GWAS annotation. We consider each of these 58 loci in detail, using QTL and position to re-prioritize candidate genes and focusing on a subset of loci, to begin resolving inherent complexities of genomic architecture.

Materials and methods

Data

Our approach systematically utilizes and combines publicly available information based on the following datasets. Please note each dataset was considered separately; a meta-analysis was not undertaken.

CARDIoGRAMplusC4D Consortium GWAS results (Nikpay et al)

GWAS variants, annotated genes, and effect alleles were taken from Supplementary Tables 2 (CAD meta-analysis additive association results for 48 loci previously identified at genome-wide significance) and 4 (Association results of the 10 novel CAD loci including the dominant model) [23]. For these analyses CAD had been defined broadly as those participants with a diagnosis of myocardial infarction, acute coronary syndrome, chronic stable angina or coronary stenosis > 50%.

1000 genomes

Genotypes for calculating LD between SNPs of interest were downloaded from: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/. Individuals of the ‘EUR’ superpopulation were selected for LD calculations.

CATHeterization GENetics (CATHGEN)

Expression, genotypes, and clinical phenotypes were acquired via dbGaP Project #5358 (dbGaP accession phs000703). Expression levels had been determined using Illumina HumanHT-12-v3 in RNA from whole blood. We considered variables recorded in pht003672: age (phv00197199), gender (phv00197207), hypercholesterolemia (phv00197204), smoking (phv00197208), number of diseased vessels (phv00197295), CAD Index (phv00197202) and history of myocardial infarction (phv00197212). Approximating the definition of CAD used in the CARDIoGRAMplusC4D Consortium GWAS by Nikpay et al., we defined CAD as history of myocardial infarction and/or vessel occlusion >50% (CAD Index). We restricted analysis to Caucasians (race (phv00197206)) for sample size considerations (862 Caucasians; 259 African Americans). The approach developed here can be extended to other ethnic groups as these datasets become available. Data access was approved by the Ohio State University IRB (Protocol #2013H0096).

Genotype and Tissue Expression Project (GTEx)

Tissue-specific RNAseq data was acquired via dbGaP Project #5358 (dbGaP accession phs000424). For details see Lonsdale et al. and http://www.gtexportal.org/home/documentationPage [24]. P-values, effect sizes, and directionality for eQTLs were downloaded directly from the GTEx Portal from the already completed and published analysis of tissue specific eQTLs. Briefly, p-values reflect the alternative hypothesis that the slope of linear regression models accounting for tissue specific normalized gene expression with individual genetic variants is non-zero. This analysis included filters based on overall gene expression, normalized gene expression values, and incorporated covariates including top 5 principal components, covariates identified using Probabilistic Estimation of Expression Residuals (PEER) factors, sequencing platform (Illumina Hiseq 2000 or Hiseq X), sequencing protocol (PCR-based, PCR-free), and gender. CAD was defined as recorded history of heart disease (MHHRTDIS) or heart attack (MHHRTATT) to best approximate the definition used in the CARDIoGRAMplusC4D Consortium GWAS by Nikpay et al. Data access was approved by the Ohio State University IRB (Protocol #2013H0096).

Gene information

Transcripts, coding status, GO Ids, number of publications indexed in PubMed, gene/transcript expression, GWAS variants, GTEx eQTLs (expression quantitative trait loci) and sQTLs (splicing quantitative trait loci) including tissue specific expression, and allelic ratios in DNAse hypersensitivity sites were obtained for each gene using the package ‘mglR’ implemented in R (https://cran.r-project.org/web/packages/mglR/index.html). Protein-coding transcripts were defined as those annotated by BiomaRt as "IGC gene", "IGD gene", "IG gene", "IGJ gene", "IGLV gene", "IGM gene", "IGV gene", "IGZ gene", "nonsense_mediated_decay", "nontranslating CDS", "non stop decay", "polymorphic pseudogene", "TRC gene", "TRD gene", "TRJ gene", "protein_coding", "TEC". The remaining designations were considered non-coding and include "disrupted domain", "IGC pseudogene", "IGJ pseudogene", "IG pseudogene", "IGV pseudogene", "processed_pseudogene", "transcribed_processed_pseudogene", "transcribed unitary pseudogene", "transcribed_unprocessed_pseudogene", "translated processed pseudogene", "TRJ pseudogene", "unprocessed_pseudogene", "unitary_pseudogene", "3prime overlapping ncrna", "ambiguous orf", "antisense", "antisense RNA", "lincRNA", "ncrna host", "processed_transcript", "sense intronic", "sense overlapping", "lncRNA", "retained_intron", "miRNA", "miRNA_pseudogene", "miscRNA", "miscRNA_pseudogene", "Mt rRNA", "Mt tRNA", "rRNA", "scRNA", "snlRNA", "snoRNA", "snRNA", "tRNA", "tRNA_pseudogene", and "rRNA_pseudogene". A gene was considered non-coding only if all transcripts were non-coding.

Linkage Disequilibrium (LD)

R2 was calculated for 1000 genomes ‘EUR’ super population using the ‘ld’ function from the package ‘snpStats’ implemented in R and using LDlink.

Association testing

Generalized linear models to account for LIPA expression (linear) and CAD (additive logistic) using different combinations of variants were compared using ANOVA with a likelihood ratio test (LRT) implemented in R. Gender and age were included as covariates in models explaining LIPA gene expression, while sex, age, hypercholesterolemia, smoking, and number of diseased vessels were included as covariates in models explaining CAD. Bonferroni multiple hypothesis corrected p-values from the LRT comparing models as well as AICs reflecting ‘goodness of fit’ for individual models are reported. Differences in LIPA expression between those with or without a history of CAD were calculated using the wilcox.test function in R.

Allelic Expression Imbalance (AEI)

Allelic RNA expression imbalance (AEI) was assessed using data from GTEx (phe000039.v1.GTEx_v8_ASE.expression-matrixfmt-ase.c1). Candidate variants were subsetted from each individual file, and the deviation of the “REF_RATIO” from the “NULL_RATIO” was plotted for each variant in a given tissue type. Tissue types with 5 or more samples were considered.

Tissue specific eQTLs

eQTLs reported by GTEx for LIPA were clustered by their LD (R squared calculated from 1000 genomes) using heatmap.2 from the gplots package in R. The p-value reported by GTEx for each eQTL was used to shade the coloring of a tissue specific bar alongside the heatmap using the ColSideColors argument within the heatmap.2 function. In this way tissue specific LD blocks could be visually assessed. Power calculations based on tissue specific gene expression (median transcripts per million) and sample size were performed for a mock genetic variant assumed to have a MAF of 0.05 and effect size of 40% (i.e. no minor alleles is 20% less than the median tissue specific gene expression and two minor alleles is 20% greater than the median expression). It was assumed 5 million genetic variants were tested. Calculations were executed using the powerEQTL.ANOVA function from the power.EQTL package in R. Results were plotted using barplot function in R.

Results and discussion

Expanding candidate gene lists using QTL and position

As many functional variants marked by GWAS likely have regulatory functions affecting RNA expression or processing, the same SNPs appearing in GWAS may also mark expression Quantitative Trait Loci (eQTLs) or splicing Quantitative Trait Loci (sQTLs) for their target gene. To assign GWAS variants to target genes, we determine for each of the GWAS SNPs whether it appears as an eQTL or sQTL reported by GTEx, searching all available tissues. Recognizing that often multiple SNPs exist over a genomic region as significant GWAS variants, we consider each one individually in assigning candidate genes and separately assess concordance. We opt not to use COLOC and other existing tools that search for overlapping signal between GWAS variants and QTLs because they make assumptions about the genetic model that are not in line with the multi-factorial system we test here [25]; namely, these methods assume a single causative variant or that each variant acts independently. Instead, although we recognize it limits the overlap we are able to detect and biases our sample to variants that are ideal markers (i.e. frequent), we search for exact matches between GWAS and QTL marker variants. In addition to evaluating associations with gene expression and splicing, we consider the physical position of each GWAS variant as SNPs within the RNA sequence are expected to impact RNA folding, stability, function, etc. Specifically, we consider the corresponding gene for any transcript that physically overlaps the GWAS variant regardless of strand, thus incorporating coding, non-coding, and antisense genes. Using these three approaches (cis-eQTLs, cis-sQTLs, position), we expand the list of potential candidate genes for the 58 GWAS loci from 75 to 245 (Fig 1A, S1 File, comprehensive table is included in S3 File, S1 Fig).

Fig 1

Summary of CAD GWAS loci.

Summary of CAD GWAS loci.

(A) For each of the 58 loci identified by GWAS, number of candidate genes annotated by GWAS and additional genes added by eQTL, then sQTL, and finally position based reprioritization, if implicating genes other than those annotated previously by GWAS (See S1 Fig for further details about the approach and S3 File for a comprehensive table). Tier 1 (n = 7) denotes those loci where a GWAS annotated gene is supported by QTL-based re-prioritization or position and no other candidate genes are introduced; Tier 2 (n = 50) where QTL-based reprioritization or position introduces new associated genes while supporting all candidates at this locus (Tier2A), only some including the GWAS gene (Tier2B) or new genes except the GWAS genes (Tier2C); and Tier 3 (n = 1) where no eQTLs or sQTLs are identified and no gene physically overlaps the SNP, accordingly annotation by GWAS is not supported and no other genes are implicated. (B) Corresponding figure for recent large scale GWAS for insulin resistance. (C) For each of the 245 candidate genes displayed along the x-axis (names available in S1 File), the number of transcripts assigned to the gene, the number of antisense transcripts (note: antisense genes are not included among the 245 candidate genes unless their expression is associated with or they physically overlap a GWAS variant), GO terms, Papers indexed in PubMed, cis-eQTLs and sQTLs published in v8 of GTEx. Blue bar highlights those genes with only non-coding transcripts. This phenomenon of expanding a GWAS based candidate gene list by incorporating genes for which the GWAS variant serves as an eQTL/sQTL or on the basis of physical proximity is not unique to the CAD phenotype nor the particular GWAS published by Nikpay et al and their means of annotating genes. We considered two additional phenotypes of insulin resistance and blood pressure with recent large-scale GWAS studies and found these approaches also significantly expanded the range of candidate genes (Fig 1 and S4 File) [26,27]. In an effort to identify those loci where a target gene(s) is clearly supported by functional markers, we consider the agreement between the gene assignment given by GWAS studies and that derived by eQTL and sQTL analysis as well as by physical position. We group each of the 58 GWAS loci as follows: GWAS annotation is supported by QTL-based re-prioritization or position and no other candidate genes are introduced (Tier 1); QTL-based reprioritization or position introduces new genes, while supporting all (Tier2A), some (Tier2B), or none (Tier2C) of the genes annotated by GWAS so that multiple genes are implicated; no eQTLs or sQTLs are identified and no gene or annotated RNA transcript physically overlaps the SNP, accordingly annotation by GWAS is not supported (but also not negated) and no other genes are implicated (Tier 3), see Fig 1, S2 Fig, S3 File.

Tier 1: No new candidate genes introduced–GWAS annotation supported

For 7 loci, QTL-based reprioritization and/or position supports the GWAS annotation without introducing new candidate genes, supporting all or some of the gene(s) annotated by GWAS (Table 1).

Table 1

Tier 1 CAD GWAS loci.

Locus	SNP	OR	Risk Allele (Freq)	Gene	eQTL Tissue(s)	sQTL Tissue(s)	Position
16	rs6903956	1.65^a (1.44–1.90)	A (0.08^a)	ADTRP		Testis	ADTRP (intron)
32	rs11203042	1.04 (1.02–1.06)	T (0.45)	LIPA	Adipose (subq)Adipose (visceral) Colon (transverse) Heart (atrium)LungPancreasSkin (sun exp)SpleenThyroidBlood	Adipose (subq)FibroblastsLung	LIPA (intron)
32	rs1412444	1.07 (1.05–1.09)	T (0.37)	LIPA	Adipose (subq)Adipose (visceral)Adrenal GlandArtery (aorta)Brain (cerebellum)Colon (sigmoid)Colon (transverse)Heart (atrium)Heart (LV)LungSkeletal MuscleNervePancreasSkin (not sun exp)Skin (sun exposed)SpleenStomachThyroidBlood	Adipose (subq)Adipose (visceral)Adrenal GlandArtery (aorta)Artery (tibial)Brain (spinal cord)BreastFibroblastsLymphocytesLungTibial NervePancreasSkin (sun exposed)Small IntestineSpleenStomachBlood	LIPA (intron)
38	rs9319428	1.04 (1.02–1.06)	A (0.31)	FLT1	Nerve (tibial)		FLT1 (intron)
42	rs17514846	1.05 (1.03–1.07)	A (0.44)	FES	Adipose (subq)Adipose (visceral)Adrenal GlandArtery (aorta)Artery (tibial)FibroblastColon (transverse)Esophagus (musc.)Heart (atrium)LungNerve (tibial)PancreasPituitaryProstateSkin (not sun exp)Skin (sun exposed)StomachThyroidBlood	Adipose (subq)Adipose (visceral)Artery (aorta)Artery (tibial)BreastFibroblastsColon (sigmoid)Esophagus (GEJ)Esophagus (musc.)Heart (atrium)Heart (LV)LungSalivary GlandNerve (tibial)ProstateSkin (not sun exp)Skin (sun exposed)Small IntestineSpleenThyroidBlood	FURIN (intron)
42	rs17514846	1.05 (1.03–1.07)	A (0.44)	FURIN	Artery (aorta)Artery (tibial)Esophagus		FURIN (intron)
54	rs7212798			BCAS3			BCAS3 (intron)
57	rs11830157			KSR2			KSR2 (intron)
08^b	rs6544713	1.05 (1.03–1.07)	T (0.32)	ABCG8	Colon (transverse)		ABCG8 (intron)

a values reported from original publication [28] in Han Chinese population. rs6903956 was not significant in Nikpay et al. [1].

b ABCG8 and ABCG5 were both annotated by GWAS. ABCG5 was not supported by QTL or position.

Tissue names in grey font indicate GWAS SNP is associated with a decrease in gene expression (eQTL) or normalized intron-excision ratio (sQTL), while those in black font are associated with increased expression/normalized intron-excision ratio as reported by GTEx. a values reported from original publication [28] in Han Chinese population. rs6903956 was not significant in Nikpay et al. [1]. b ABCG8 and ABCG5 were both annotated by GWAS. ABCG5 was not supported by QTL or position. For four of the loci (16-ADTRP, 32-LIPA, 38-FLT1, 42-FURIN, 8-ABCG8), GWAS annotation of candidate gene assignment is supported by both QTL and position. In one instance, locus 42—rs17514846 (FURIN, FES), more than one gene is annotated by GWAS and supported by our reprioritization. rs17514846, which falls in an intron of FURIN, serves as an eQTL and an sQTL for FES in 23 tissues and an eQTL for FURIN in 3 tissues, two of which (aorta and tibial artery) overlap with FES. In aorta and tibial artery, rs17514846 is associated with decreased expression of FES as opposed to increased expression of FURIN–a possible example of competing interactions between regulatory and promoter regions. Evidence for multiple candidate genes in a locus may represent a paradigm in which a single SNP exerts an impact through more than one gene. In some instances the same variant in the same tissue is associated with both expression and splicing. For example, rs141244 in blood is associated with increased expression of LIPA and decreased splicing, a scenario that is consistent with greater stability of the un-spliced transcript. Thus, in considering potential mechanisms of action for the variant, it is important to evaluate not only the implications of increased levels of LIPA mRNA, but also increased levels of the un-spliced transcript.

Tier 2: New candidate genes implicated

Variants in 50 loci are associated with expression of one or more genes or physically overlap with another gene in addition to all (39 loci), some (7 loci), or none (4 loci) of the genes annotated by GWAS. Loci where additional candidate genes are introduced are classified as Tier 2 (S3 File). Candidate genes for these 50 GWAS loci are expanded by an average of 4.3 genes per locus for a total of 170 genes: 116 from eQTL based reprioritization, 17 from sQTL based reprioritization, 5 from physical position, and 32 from some combination of these features (S3 Fig). While about a third of the loci (21) have two or fewer candidate genes, others have substantially more: e.g., locus 33—rs12413409 and rs11191416 (CYP17A1-CNNM2-NT5C2) are associated with expression of twelve genes. Importantly, these multi-gene eQTLs cannot be explained solely by co-expression between genes. These eQTLs are often associated with expression of different genes in different tissues. Notably, ncRNAs are candidate genes for 33 of the 58 loci expanded from 6 loci prior to re-prioritization. For no loci are all candidate genes non-coding. For Tier 2C loci, there is no evidence to support the GWAS annotation. For example, locus 46—rs1122608 and rs56289821 LDLR is annotated by GWAS, a gene well-recognized for its role in lipid metabolism; yet, rs1122608 falls within an intron of SMARCA4 and is both an eQTL and sQTL for SMARCA4 as well as an eQTL for CARM1 and YIPF2 but not LDLR. The alternative SNP identified by GWAS, rs56289821, also does not point to LDLR but rather implicates RGL3, SLC44A2, and again SMARCA4. These 4 Tier 2C loci critically require future work, both mechanistic and computational, to explore relevant gene targets.

Tier 3: No genes implicated

The remaining GWAS locus, locus 55—rs663129 (MC4R, PMAIP1), classified as Tier 3, did not show any association with expression of nearby genes and is not physically overlapping any transcripts (S3 File). This locus and 3 others (locus 27—rs2954029 (TRIB1), locus 54—rs7212798 (BCAS3), and locus 57—rs11830157 (KSR2)) that are without any eQTL associations may have more subtle or context-dependent effects on gene expression that remain undetectable in GTEx. In particular, non-polyadenylated transcripts are not in GTEx as poly-dT priming was used, leaving countless ncRNAs as additional candidates. Furthermore, these SNPs may affect gene expression in trans (although we do not find such evidence in the GTEx trans-eQTL dataset) or exert their effect without altering RNA levels measured by RNAseq (e.g. by controlling the chromatin structure or co-translationally alter RNA modifications). Additionally, variants affecting RNA functions and processing (structural RNA SNPs) [8,29], may not be visible as eQTLs, or they may selectively affect translation by changing polysomal loading [30]. Given GWAS variants are expected to mark functional variants rather than themselves being functional, we test SNPs within a 1MB window in LD (R2 > 0.8) with each of the 4 GWAS variants lacking annotations, expanding the number of SNPs to 200. Using this approach, we find significant eQTLs, but no significant sQTLs, for three of the four loci. For locus 57, we were unable to find additional candidate SNPs with an R2 > 0.8 to mark the haplotype.

Survey of CAD GWAS candidate genes

The genomic loci for each these 245 candidate genes often harbor multiple protein-coding and non-coding transcripts arranged on both the sense and antisense strands (S2 File). They express an average of 9 transcripts per gene and a maximum of 189 (TEX41- locus 10—rs2252641, rs17678683), with 47% of all transcripts being non-coding (Fig 1B). More than half of the gene loci (161) also contain one or more antisense genes (i.e., located on the opposite strand and overlapping). With a median of 26 publications and a maximum of 27,497 (APOE), only a handful of these 245 candidate genes have been well studied to date (Fig 1B). Twenty percent (51) of genes do not have a single paper indexed in PubMed. There are on average 20 gene ontology (GO) terms, which are manually curated based on the literature, assigned to each gene; however, 62 (25%) of the candidate genes have no associated GO terms. We find those genes without GO terms and with limited publications do not have fewer markers of functionality (eQTLs, splicing QTLs, etc.), but are almost exclusively non-coding, indicative of a recognized bias in the literature toward protein-coding genes (Fig 1). Each implicated locus displays an astoundingly complex architecture with multiple candidate genes implicated by RNA expression and physical location, each with a number of overlapping coding and non-coding transcripts including those in antisense orientation. The complexity of these loci emphasizes the need for targeted molecular studies and computational approaches to determine the relevant gene and transcript(s). The distribution of PubMed articles and GO ids across candidate genes suggests that this targeted work has touched on only a handful of genes thus far, with more recent studies beginning to focus on ‘neglected’ CAD candidate genes [31].

Allelic RNA expression imbalance reveals functional variation

To evaluate potential functionality for each of the 58 GWAS loci, we ask whether each candidate SNP is associated with allelic expression imbalance (AEI), a specific indicator of cis-acting regulatory variation. By comparing expression of the two alleles at a heterozygous variant, various external/trans-acting influences on gene expression are shared and the cis-acting effect of the heterozygous variant can be isolated. In the absence of a functional variant altering RNA expression, the anticipated distribution between the alleles is 0.5 (ratio = 1) [8,32]. Using data released by GTEx, we evaluate AEI at each of 104 candidate variants across 54 tissue types. Only 55 of the SNPs are represented in the data. The remainder likely are in intergenic regions and poorly captured by RNA sequencing, while obtaining accurate AEI ratios requires rather robust expression (>30 RPM) [33]. Of the 55, many are present in only a few samples making it difficult to infer differential expression. However, several SNPs show surprisingly robust data–thousands of samples and dozens-hundreds of counts for each allele. A majority of these SNPs fail to reveal allelic expression imbalance, with near normal distribution of deviation from the expected ratio, suggesting no correlation between the GWAS variant and allelic expression imbalance. This implies that the GWAS candidate SNPs represented in the data are actually relatively poor markers for functional cis-acting variants in the locus; however, splicing events generating RNA isoforms with similar turnover are one example where allelic expression imbalance would fail. A number of SNPs do display consistent allelic expression imbalance (Fig 2). Locus 3 –rs7528419 (SORT1), which falls in the 3’UTR of CELSR2 exhibits AEI in 53/57 liver samples. Overall low expression of CELSR2 in liver tissue, however means that these ratios are for the most part based on low coverage (median total count 13). Despite the relative consistency from sample to sample, large allelic ratios derived from relatively low counts, as observed here, raises suspicion for systemic sources of bias, e.g. preferential amplification of one allele. To evaluate this further, we considered allelic ratios at nearby SNPs in strong LD (R2 > 0.9) and weak LD (R2 < 0.1). As these SNPs are co-located, systemic sources of bias should affect all SNPs in the locus while ‘true’ biological AEI would be expected only for those variants in strong LD with a functional SNP. We observe AEI for those SNPs in strong LD with the GWAS marker, but not for those in the same region in weak LD, a pattern that is suggestive of ‘true’ biological AEI and a functional cis-acting variant.

Fig 2

Allelic expression imbalance at GWAS variants mark functional SNPs.

Allelic expression imbalance at GWAS variants mark functional SNPs.

Deviation in the observed from the expected ratio for individuals heterozygous for given GWAS variant. (A) Locus 3 –rs7528419 (SORT1) exhibits AEI in 53/57 liver samples. Subcutaneous adipose, also shown, demonstrates near normal distribution of deviation from the expected allelic ratio and is representative of the 46 other tissues with at least 5 samples. (B) Locus 14—rs72689147 (GUCY1A3) exhibits AEI in 114/121 samples across 10 different tissues. Importantly, even one sample without AEI suggests the variant might itself not be functional but rather in high LD with a functional variant and serving as a marker. With only a few samples not exhibiting AEI, rs7528419 can be considered an excellent marker in tight LD with the functional SNP. Furthermore, that this pattern is only found in liver suggests that the regulatory variant is tissue specific. In contrast, the bidirectional ratios observed in adipose tissue suggests that the SNP is not in tight LD with a variant that is functional in adipose tissue. The Locus 14 SNP rs72689147 (GUCY1A3), which falls within an intron of GUCY1A3, exhibits AEI in 114/121 samples across 10 different tissues. Again, this SNP does not appear to be functional as not all samples display AEI, but it is a robust marker. While located in an intron, expression is sufficient to extract allelic ratios; as these are consistently below unity, this results suggests a gain of function.

Resolving number of signals in a locus using LD

Focusing on the 7 loci where eQTL-based reprioritization pointed to a single gene as well as the two examples of AEI discussed above, we find dozens of other significant eQTLs for each gene. To determine whether these eQTLs represent one or more functional variants, we plot the effect size (beta) of the variant on RNA expression for each eQTL against its LD (R2) with the top scoring (most significant p-value) eQTL in each tissue where eQTLs are detectable. Assuming one functional variant in the locus, the beta for each eQTL should correlate with its R2 relative to the highest scoring SNP [34]. This approach reveals that the observed eQTLs for a gene often represent more than one regulatory variant, with the exception of FLT1 in Tibial Nerve–represented by only one cluster of variants marked by the GWAS SNP (Fig 3). This result is critical to the correct interpretation of GWAS that would otherwise rely on a single variant rather than considering the combined effect of more than one causative variant.

Fig 3

Number of eQTL signals.

Number of eQTL signals.

Correlation plots show absolute value of beta for variant effects on RNA expression versus R2 with the top eQTL (most significant p-value), including all significant eQTLs in the given gene-tissue combination. Blue dots represent the top eQTL (most significant p-value), red dots represent GWAS variant(s). (A) FLT1 in Tibial Nerve: eQTLs are accounted for by a single eQTL marked by the GWAS variant (all eQTLs display a linear correlation with R2). CELSR2 (liver), GUCY1A3 (tibial artery), and LIPA (blood), correlation between beta and R2 suggests multiple functional variants. (B) At least three distinct LD blocks represented by LIPA eQTLs in whole blood. Correlations are shown left to right between the absolute value of beta and R2 with rs142444 (GWAS SNP), rs1051338, or rs2250781. Tightly linked SNPs (D’ > 0.9; R2 > 0.9) are shown in the same color. As an example, we consider the number of distinct eQTLs needed to maximally account for LIPA expression in blood. The most significant eQTL consists of a group of SNPs in high LD marked by the GWAS variant (red dot in Fig 3), while two additional clusters of SNPs (marked by rs1051338 and rs2250781) have equally or even more robust beta and p-values but show relatively poor linkage with the GWAS cluster (R2 ~ 0.5) (Fig 3B). These SNPs are more significant eQTLs than predicted by their LD with the trait-associated variant and may mark additional functional variants in the locus. To test the significance of any additional regulatory variants, we used a separate dataset (CATHGEN) to evaluate whether including an additional marker variant in a regression model improves the ability to account for LIPA expression in blood. Including additional markers improved the eQTL model, while adding a marker in strong LD with the original variant did not (Table 2), indicating there are likely multiple functional variants, incompletely represented by the GWAS variant alone, that contribute to LIPA expression in blood.

Table 2

Assessing multiple regulatory variants for LIPA.

Variable of interest	ANOVA p-value	Model 1	AIC	Model 2	AIC
rs1412444	8.8e-16	XP ~ sex + age	3310	XP ~ rs1412444 + sex + age	1778
rs13332328	8.8e-16	XP ~ sex + age	3310	XP ~ rs13332328 + sex + age	1782
rs1051338	8.8e-16	XP ~ sex + age	3310	XP ~ rs1051338 + sex + age	1779
rs2250781	8.8e-16	XP ~ sex + age	3310	XP ~ rs2250781 + sex + age	1800
rs1412444 in context of rs13332328	1.0	XP ~ rs1412444 + sex + age	1778	XP ~ rs1412444 + rs13332328+ sex + age	1781
rs1412444 in context of rs1051338	0.23	XP ~ rs1412444 + sex + age	1778	XP ~ rs1412444 + rs1051338 + sex + age	1777
rs1412444 in context of rs2250781	0.04	XP ~ rs1412444 + sex + age	1778	XP ~ rs1412444 + rs2250781 + sex + age	1773
rs1412444 & rs2250781 in context of rs1051338	0.19	XP ~ rs1412444 + rs2250781 + sex + age	1773	XP ~ rs1412444 + rs2250781 + rs1051338 + sex + age	1773
rs1412444 & rs1051338 in context of rs2250781	0.04	XP ~ rs1412444 + rs1051338 + sex + age	1777	XP ~ rs1412444 + rs1051338 + rs2250781 + sex + age	1773
rs1412444	1e-3	CAD ~ covariates	387.7	CAD ~ rs1412444 + covariates	385.1
rs13332328	1e-3	CAD ~ covariates	387.7	CAD ~ rs13332328 +covariates	385.3
rs1051338	6e-4	CAD ~ covariates	387.7	CAD ~ rs1051338 + covariates	384.3
rs2250781	4e-4	CAD ~ covariates	387.7	CAD ~ rs2250781 + covariates	383.5
rs1412444 in context of rs13332328	0.79	CAD ~ rs1412444 + covariates	385.1	CAD ~ rs1412444 + rs13332328+ covariates	386
rs1412444 in context of rs1051338	0.36	CAD ~ rs1412444 + covariates	385.1	CAD ~ rs1412444 + rs1051338 + covariates	384.5
rs1412444 in context of rs2250781	0.56	CAD ~ rs1412444 + covariates	385.1	CAD ~ rs1412444 + rs2250781 + covariates	385.9
rs1412444 & rs2250781 in context of rs1051338	0.11	CAD ~ rs1412444 + rs2250781 + covariates	385.9	CAD ~ rs1412444 + rs2250781 + rs1051338 + covariates	385.1
rs1412444 & rs1051338 in context of rs2250781	0.17	CAD ~ rs1412444 + rs1051338 + covariates	384.5	CAD ~ rs1412444 + rs1051338 + rs2250781 + covariates	385.1

ANOVA comparing generalized linear models with different SNP combinations accounting for LIPA expression and CAD (defined as history of myocardial infarction and/or >50% stenosis of vessel). Covariates in CATHGEN include sex, age, hypercholesterolemia, smoking, and number of diseased vessels. Testing these additional variants with CAD instead of LIPA expression did not yield significant associations (Table 2). However, LIPA expression itself is not associated with CAD except when rs1412444 is homozygous minor, which may explain the discrepancy. In looking separately at the associations between the GWAS variant and LIPA expression and the GWAS variant and CAD, we find that rs1412444 is associated with increased risk of CAD and increased expression of LIPA, but counterintuitively those with two minor alleles and CAD exhibit lower rather than higher expression, a pattern that also holds in GTEx although it is only statistically significant in CATHGEN (Figs 4 and S4).

Fig 4

LIPA expression, CAD, and genotype.

LIPA expression, CAD, and genotype.

Comparison of LIPA expression in CATHGEN for those with and without CAD based on rs142444 genotype. LIPA exhibits higher expression only in those without CAD in the homozygous minor group (p-value = 0.02). Absence of LIPA results in Wolman disease, characterized by lipid deposits and early onset CAD due to inability to break down lipids in lysosomes and subsequent upregulation of cholesterol production by the liver [35]. Here, congruent with this rare genetic disease, we find decreased LIPA expression associated with CAD. Unexpectedly this is observed when the GWAS based variant (rs142444), associated with increased LIPA expression, is homozygous minor, implying existence of an additional factor associated with the GWAS variant that interrupts the linear relationship between the number of rs142444 minor alleles and LIPA expression.

Context–tissue specific eQTLs

Genetic variation exists and functions within a context–the surrounding sequence, the tissue type and its preferred transcription factors, etc. In an effort to resolve the functional variation behind statistical associations observed in GWAS, it is essential to consider these contexts. As highlighted by the tissue specific AEI patterns above, if these relationships are not considered in a context specific manner (e.g., on a tissue by tissue basis), many robust effects will remain hidden. In an effort to evaluate some of these contextual features, we consider tissue specific eQTLs. eQTL analysis may focus the search on a relevant tissue. However, eQTLs are detectable only where expression and sample size are sufficiently high; accordingly tissue-specific differences in eQTLs reflect overall patterns of tissue selective expression and sample size, in addition to the influence of genetic variation (S5 Fig). To consider how eQTLs for a given gene compare across different tissues, we cluster genome-wide significant eQTLs reported by GTEx for LIPA in a heatmap organized by their pairwise LD (R2), using a colored bar at the top of the heatmap to denote tissue type (Fig 5A). eQTL SNPs generally cluster by tissue, suggesting distinct regulatory variants in different tissues. However, there are two LD blocks that contain eQTLs in more than half of tissues indicative of genetic variation that acts across different tissue types. Variants detected by GWAS for LIPA appear as a significant eQTLs in a subset of tissues (Table 1), some of which fit with our understanding of CAD pathology (heart, aorta, adipose), others suggest as yet unexplained biological consequences (spleen, pancreas) or pleiotropic effects.

Fig 5

Tissue specific eQTLs for LIPA.

Heatmap of LD for those SNPs reported by GTEx as genome-wide significant eQTLs for LIPA. Lighter-colored squares in the heatmap represent LD blocks, with SNPs clustered by R2 and not by genomic position. Colored bars at top eQTLs in each tissue with more significant p-values denoted by darker color.

Tissue specific eQTLs for LIPA.

Conclusions

We consider each of 58 loci implicated in CAD by GWAS to understand the biological meaning of the underlying statistical associations. In evaluating each of these loci, we find numerous candidate genes that were not included in the original annotation by GWAS. Many of these are non-coding. Non-coding RNAs, now well-recognized for their role as regulators, have historically been dismissed and continue to be difficult to study, a trend that is apparent in their poor representation in the literature, among GO annotations, and as annotated by GWAS [36]. We find no evidence to suggest these non-coding RNAs are less likely to account for the observed associations in GWAS and would advocate for their inclusion in further mechanistic and computational work examining these loci. In addition to broadening candidate gene lists to include non-coding transcripts, we would urge reconsideration of current assignments, especially for those loci categorized as Tier2C where expression, splicing, and physical position do not support the gene annotated by GWAS. LDLR is a particularly prominent example. Given our understanding of the critical role lipid metabolism plays in CAD, it is counterintuitive not to assign a CAD GWAS variant to LDLR when it lies within 15kb of the LDLR locus [37]. However, RNA expression and splicing data do not support this annotation, instead supporting the notion that such genetic variation affects the function of other nearby genes including SMARCA4, CARM1, YIPF2, RGL3, SLC44A2 [30]. Using allelic ratios built from tissue-specific RNA sequencing data available through GTEx, we were able to identify two loci where the GWAS variant served as a robust marker for a functional cis-acting regulatory variant. Locus 3 –rs7528419 (SORT1) falls in the 3’UTR of CELSR2, exhibits AEI exclusively in liver, and is in nearly perfect LD with rs12740374 which was shown by Musunuru et al. through a series of molecular experiments to create a C/EBP binding site increasing expression of SORT1, a multiligand sorting receptor which they concomitantly showed to be associated with LDL-C and VLDL levels [38]. This work revealed a single functional variant for a single target gene with a substantial effect size, the authors estimated a ~40% difference in MI risk. Our work suggests additional eQTLs not explained by their LD with the LD block marked by GWAS variant rs7528419. As we begin to identify functional variation behind GWAS associations, an important next step will be resolving additional functional variants within the loci that may modify these associations and better account for disease risk [39]. This work emphasizes that the linear presentation of GWAS results as a single variant tied to a single gene fails to capture the complexity of these loci. Many loci contain several SNPs identified by GWAS, and for each of these, multiple candidate genes are implicated by RNA expression and splicing associations as well as physical proximity. LD alone rarely accounts for the observed eQTLs, suggesting multiple functional variants within these loci. Although some GWAS associations may ultimately implicate single variants that alter expression of individual genes, this work indicates that true genetic effect size of a gene locus is accounted for by a multi-factorial system that allows for multiple functional variants regulating one or more genes. Important next steps in accounting for the genetic basis of disease will be in establishing causality for genetic variation, which even with computational efforts such as these to direct our understanding will require molecular biology experimentation to definitely address. It will also require looking beyond single nucleotide polymorphisms to copy number variation, methylation, and other forms of genetic variation, which have been shown to have considerable impacts on disease risk. Ultimately considering how functional variation of all kinds can in combination can be used to predict disease risk will likely machine learning approaches that can more effectively incorporate multi-factorial data [40,41]. The approach presented here must be expanded to include functional variants that are undetectable by RNAseq of whole tissues, including cell type specific expression, effect on RNA-protein interactions, distribution in sub-cellular domains, alteration of translational processes, and of course variants that change protein functions.

Expanding candidate genes process.

Flowchart portraying process of expanding candidate gene list from 75 to 245 using eQTL, sQTL, and physical position. (TIF) Click here for additional data file.

Tier assignment process.

Flowchart portraying process of assigning tiers to CAD GWAS loci. (TIF) Click here for additional data file.

eQTL, sQTL, position Venn diagram.

Venn diagram showing overlap in candidate genes derived from eQTL, sQTL, and position-based re-prioritization. (TIF) Click here for additional data file.

LIPA expression, CAD, and genotype in GTEx.

Comparison of LIPA expression in GTEx for those with and without heart disease based on rs142444 genotype. LIPA exhibits higher expression in those without heart disease only in the homozygous minor group (p value = 0.22). (TIF) Click here for additional data file.

Power calculations for tissue specific eQTLs.

Barplot displays power to detect a hypothetical LIPA eQTL with minor allele frequency 0.05 and effect size 40% (i.e. no minor alleles is 20% less than the median tissue specific gene expression and two minor alleles is 20% greater than the median expression) across different tissue types. About half of the tissues have greater than 80% power to detect such a variant. (TIF) Click here for additional data file.

Fig 1 Gene names.

Gene names corresponding to bar plot presented in Fig 1B. (DOCX) Click here for additional data file.

Example locus.

Example of a locus (LIPA) implicated by GWAS taken from ensemble.org. There are numerous annotated protein-coding and non-coding transcripts in close proximity and overlapping one another. (DOCX) Click here for additional data file.

58 CAD GWAS loci.

Table of 58 GWAS loci including tier designation, SNPs considered, GWAS annotation, and genes introduced by eQTL, sQTL, and position. Also includes additional text describing Tier 3 loci and the expanded search for candidate genes. (DOCX) Click here for additional data file.

Additional phenotype–blood pressure.

Bar charts showing the distribution of tier assignments for each of the GWA studies considered. Tier assignments for each of the 903 loci identified in a recent blood pressure GWAS [27]. (PDF) Click here for additional data file. 6 Apr 2021 PONE-D-20-39555 Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance PLOS ONE Dear Dr. Hartmann, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by May 21 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Mingqing Xu Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf Thank you for stating the following in the Acknowledgments Section of your manuscript: This study was supported by National Institutes of Health National Institute of General Medical Science Pharmacogenetics Research Network [Grant U01 GM092655] and the National Center for Advancing Translational Sciences [TL1 TR001069]. The GTEx Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from: the GTEx Portal and dbGaP accession number phs000424. For CATHGEN, clinical data originated from the 604 Duke Databank for Cardiovascular Disease (DDCD) and biological samples originated from the Duke Cardiac CATHeterization (CATHGEN) study. Funding support for the Genetic Mediators of Metabolic CVD Risk was provided by NHLBI grant RC2 HL101621 (William E. Kraus). The data used for the analyses described in this manuscript were obtained from the dbGaP accession number phs0000703.v1.p1. Computing time provided by the Ohio Supercomputer Center, GRANT #: PAS0885-2 and the Prometheus Cyfronet AGH. We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: This study was supported by National Institutes of Health National Institute of General Medical Science Pharmacogenetics Research Network [Grant U01 GM092655 awarded to WS] https://www.nigms.nih.gov/ and the National Center for Advancing Translational Sciences [TL1 TR001069 awarded to KH] https://ncats.nih.gov/. Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Please revise according to the reviewer's comment for re-submission. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: No ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript “Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance” has an overall goal of addressing an important question, namely, when GWAS identifies multiple variants in a given locus is it due to the underlying LD or because of multiple variants with independent/only partially related functional effects? It also seeks to determine if incorporating tissue-wide eQTLs and splicing QTLs (sQTLs) is a “better” way to determine the underlying gene for a given non-coding locus. All of these efforts as a concept are important as we think about the non-coding genome and primarily non-coding variants identified from GWAS. The strengths of the study are the overall conceptual framework posed (with caveat above that the analyses presented only take us so far); the careful incorporation of GTEx data coupled with individual level data obtained from the CATHGEN study through dbGaP; and the careful thought put into the genomic interpretations. But while the overall conceptual framework and potential hypotheses are important, the manuscript’s impact is weakened due to the overall descriptive data presented; the expansive results section for primarily relatively straightforward bioinformatics work that would be done when evaluating any genetic variant; and the lack of functional validation of the proposed model outside of eQTLs to show that this approach is “better” than the comparator approach (which in this paper is just comparison to annotation in the Nikolay et al paper). In fact, for many of their SNPs, a simple search in dbSNP/UCSC confirms the gene that the SNP is in (not sure why the Nikolay paper annotated it differently). As such, the comparator being the Nikolay paper doesn’t help us determine if this method is “better”. More minor issues include: 1) To address the above issue of the comparator being “GWAS annotated” but from a single paper, the authors should consider other phenotypes to perform these analyses; 2) The results section is very long and could be summarized more succinctly; 3) Throughout, there needs to be more formal statistics done (and more statistics methods)…for example, in Table 2, ANOVA does not make sense for the model for association between SNPs and MI; what are p-values/effect sizes/directionality for eQTLs, for “heat map” for clustering and showing that some variants have eQTLs in multiple tissues, etc. 4) For the LIPA analyses, why use MI when the original variants are CAD variants? 5) The LIPA expression models are interesting and statistically a good way to further validate that multiple SNPs in a locus have independent effects and should be done across all the loci. 6) For the LIPA expression models, need more formal statistics to show that adding SNPs improves models for expression (i.e. AIC, BIC, etc.) 7) Overall, the manuscript could use an eye towards editorial improvement (uses colloquial language in several places like calling them “GWAS hits”, there are grammatical errors, results section is densely presented; catheterization in methods is spelled incorrectly; Wilcoxon is spelled “wilcoxin”; the test in R is wilcox.test not “wilcoxin.test”, etc.) 8) How do the authors interpret the paradoxical results for rs1412444 on LIPA expression and MI? 9) How were eQTLs in GTEx defined? (again gets back to inclusion of more formal statistics). It’s hard to determine if some of the differences between variants/across tissues could just be variations around statistical significance without these more granular results. Reviewer #2: In the manuscript entitled “Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance”, the authors explored the complexity of each of CAD loci by use of tissue specific RNA sequencing data from GTEx to identify genes that exhibit altered expression patterns in the context of GWAS significant loci,and expanded the list of candidate genes from the 75 currently annotated by GWAS to 245. The following papers can be cited and followed for the meta-analytic procedures(if the data is not enough available, at least DISCUSSION should be added as the LIMITATION of this study with enough citation to support the viewpoints): Ref 1: Wu Y, et al. Multi-trait analysis for genome-wide association study of five psychiatric disorders. Transl Psychiatry. 2020 Jun 30;10(1):209. Ref 2:Jiang L, et al. Sex-Specific Association of Circulating Ferritin Level and Risk of Type 2 Diabetes: A Dose-Response Meta-Analysis of Prospective Studies..J Clin Endocrinol Metab. 2019 Oct 1;104(10):4539-4551. Ref 3: Xu M, et al. Quantitative assessment of the effect of angiotensinogen gene polymorphisms on the risk of coronary heart disease. Circulation. 2007 Sep 18;116(12):1356-66 Trans-ethnitic and trans-trait meta-analysis of cardiometabolic traitscan be referred to Ref 1. Subgroup analyses based on sex, age, race, gene dosage can be referred to Ref 2. and3 Integrating GWAS signals with eQTL from GTEX or pQTLs is a good strategy to exploring the causality of the genetic varients in the development of cardiometabolic traits. But I strong suggest to do causal inference analysis to see if the GWAS signals are causally triggering the develop,ent of CAD through mediating the expression of given genes in specific tissues. In addition, the significantly associated SNPs may be used to predict disease susceptibility in the context of its influence of gene expression,therefore, the authors may explore the possibility to conduct a machine-learning model to predict CAD risk or cardiometalobic traits based these significant SNPs. For this reason, the authors may cite the following papers to follow these references’ procedure to construct a standard prediction model based on the significant SNPs (probably include the gene expression information). Deep learning is a hot topic in dissecting the genome variants’ roles in the phenome. Especially deep learning method is a very promising way to predict disease risk based on clinical information and genetic biomarkers(If deep learning can not be used, please discuss as the LIMITATION of this study with enough citation to support the viewpoints). Ref 4:Yu H, et al. LEPR hypomethylation is significantly associated with gastric cancer in males.Exp Mol Pathol. 2020 Oct;116:104493. Ref 5:Liu M, et al. A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer's disease.Neuroimage. 2020 Mar;208:116459. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 2 Oct 2021 Reviewer #1: The manuscript “Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance” has an overall goal of addressing an important question, namely, when GWAS identifies multiple variants in a given locus is it due to the underlying LD or because of multiple variants with independent/only partially related functional effects? It also seeks to determine if incorporating tissue-wide eQTLs and splicing QTLs (sQTLs) is a “better” way to determine the underlying gene for a given non-coding locus. All of these efforts as a concept are important as we think about the non-coding genome and primarily non-coding variants identified from GWAS. The strengths of the study are the overall conceptual framework posed (with caveat above that the analyses presented only take us so far); the careful incorporation of GTEx data coupled with individual level data obtained from the CATHGEN study through dbGaP; and the careful thought put into the genomic interpretations. But while the overall conceptual framework and potential hypotheses are important, the manuscript’s impact is weakened due to the overall descriptive data presented; the expansive results section for primarily relatively straightforward bioinformatics work that would be done when evaluating any genetic variant; and the lack of functional validation of the proposed model outside of eQTLs to show that this approach is “better” than the comparator approach (which in this paper is just comparison to annotation in the Nikolay et al paper). In fact, for many of their SNPs, a simple search in dbSNP/UCSC confirms the gene that the SNP is in (not sure why the Nikolay paper annotated it differently). As such, the comparator being the Nikolay paper doesn’t help us determine if this method is “better”. Thank you to the reviewer for so nicely highlighting the strengths and weaknesses of this work. We would agree that it is odd that despite the simplicity of much of what we have done and the thought that goes into many of these GWAS studies, tables of GWAS variants still report the nearest (often protein coding) gene. We have found that this is not unique to CAD or to the Nikpay paper and have presented an additional GWAS study as outlined below and in the revised manuscript to demonstrate this. We deleted the word ‘better’ as this value judgement is not needed. More minor issues include: 1) To address the above issue of the comparator being “GWAS annotated” but from a single paper, the authors should consider other phenotypes to perform these analyses; Thank you for the suggestion! We have done a similar analysis to that outlined in Figure 1A for an additional GWAS study with variants reported by Lotta et al in Nature Genetics in 2017. This manuscript explored associations between genetic variants and insulin resistance phenotypes including higher fasting insulin levels adjusted for BMI, lower HDL cholesterol levels, and higher triglyceride levels, identifying 53 loci of interest. We find a similar pattern to that observed in the Nikpay paper in that many more candidate genes are introduced by considering expression and splicing analysis as well as location. 2) The results section is very long and could be summarized more succinctly; Agreed. We took out a significant portion of text and hope it now reads more easily. 3) Throughout, there needs to be more formal statistics done (and more statistics methods)…for example, Thank you for bringing this to our attention and highlighting some specific examples where we can help to bring more formal statistics and clarity. I think these changes substantially improve the manuscript and appreciate the comments. Please see the individual responses below. in Table 2, ANOVA does not make sense for the model for association between SNPs and MI; We used an ANOVA to test nested generalized linear models. These GLMs incorporate different combinations of SNPs and were designed with the response variable of LIPA gene expression and then separately CAD. The p-value reported reflects the likelihood ratio test comparing the GLMs. In addition to the p-value we have added AICs as a measure of goodness of fit for each individual GLM. what are p-values/effect sizes/directionality for eQTLs, These are p-values, effect sizes, and directionality that are reported by GTEx to reflect the association between the variant and gene expression. Details of their analysis can be found here https://www.gtexportal.org/home/documentationPage. Briefly, p-values reflect the alternative hypothesis that the slope in a linear regression model for normalized gene expression explained by a given genetic variant is non-zero. Effect size is this ‘non-zero’ slope and directionality corresponds to whether the minor allele is associated with higher or lower gene expression. We have added many of these details to the methods section to try and clarify further. for “heat map” for clustering and showing that some variants have eQTLs in multiple tissues, etc. Thank you for bringing this to our attention. The details of how this plot was generated had been inadvertently left out of the methods section. It is now included. We used the heatmap.2 function within the gplots R package to plot R2 for GTEx reported eQTLs. The colored bars above the heatmap were generated using the ColSideColors argument within the heatmap.2 function and are shaded to reflect the p-values with each color representing a different tissue type. 4) For the LIPA analyses, why use MI when the original variants are CAD variants? Thank you to the reviewer for highlighting this! We have tried to match definitions of CAD as best as possible across the datasets (see specific definitions below and track changes in the methods section of the text). Although these definitions had been used in the analysis, we were not diligent with the terminology in the initial submission and incorrectly used MI and CAD interchangeably. CAD was defined in Nikpay as “Case status was defined by an inclusive CAD diagnosis (e.g. myocardial infarction (MI), acute coronary syndrome, chronic stable angina, or coronary stenosis >50%)” CAD in CATHGEN Non-zero CAD Index i.e. no CAD >50% stenosis (CADINDEX) or history of myocardial infarction (HHXMI) (https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000703/phs000703.v1.p1/pheno_variable_summaries/phs000703.v1.pht003672.v1.CATHGEN_Metabolic_CVD_Risk_Subject_Phenotypes.data_dict.xml). CAD in GTEx Recorded history of heart disease (MHHRTDIS) or heart attack (MHHRTATT) https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000424/phs000424.v8.p2/pheno_variable_summaries/phs000424.v8.pht002742.v8.GTEx_Subject_Phenotypes.data_dict.xml). 5) The LIPA expression models are interesting and statistically a good way to further validate that multiple SNPs in a locus have independent effects and should be done across all the loci. We are happy to see that our approach for validating multiple SNPs within a locus resonated with the reviewer and agree that it would be ideal to perform such an analysis across all loci. However for this to be done, candidate SNPs would need to be selected by hand for each locus, as was done with the LIPA locus where LD patterns of eQTLs (Figure 3) pointed to specific SNPs of interest. Currently this approach is too time consuming to be scaled up to the entire gene list. An alternative would be to scan large numbers of variants automatically, but then the burden of multiple hypothesis testing would likely obscure any true results. We look forward to continuing to address the possibility of multiple functional variants within a locus in future work. 6) For the LIPA expression models, need more formal statistics to show that adding SNPs improves models for expression (i.e. AIC, BIC, etc.) Certainly. Thanks very much for the comment. We have included AIC values for each individual model in Table 2. 7) Overall, the manuscript could use an eye towards editorial improvement (uses colloquial language in several places like calling them “GWAS hits”, there are grammatical errors, results section is densely presented; catheterization in methods is spelled incorrectly; Wilcoxon is spelled “wilcoxin”; the test in R is wilcox.test not “wilcoxin.test”, etc.) Thank you for bringing these to our attention. We have corrected the spelling errors you noted and more carefully reviewed the manuscript for additional grammatical and spelling errors. 8) How do the authors interpret the paradoxical results for rs1412444 on LIPA expression and MI? Thanks for your question, it is one we have spent some time discussing. Complete absence of LIPA results in a rare genetic disorder known as Wolman disease in which lipids are not broken down in lysosomes, the liver upregulates cholesterol production, and early onset CAD as well as fatty deposition within various organs ensues. With this in mind, the association between decreased LIPA levels and MI is congruent. The question then remains why homozygous minor is associated with decreased rather than increased expression in this subset of individuals. It seems to imply the existence of a third, unknown factor regulating this relationship. For illustrative purposes, one potential explanation could be an additional variant that alters RNA transcript stability to be less stable and thus despite being produced in larger quantities is overall decreased. 9) How were eQTLs in GTEx defined? (again gets back to inclusion of more formal statistics). It’s hard to determine if some of the differences between variants/across tissues could just be variations around statistical significance without these more granular results. Without a doubt variations around statistical significance are playing a role here! In initially preparing this manuscript, a previous version of GTEx with smaller sample size was used and there were significantly fewer GWAS variants that were also deemed eQTLs. As sample sizes increased, more eQTLs were detected and thus more overlap identified. In addition, tissue expression of any given gene can vary drastically, reducing the power to detect eQTLs at low expression levels measured with RNAseq. Therefore, interpreting eQTL differences between tissues must reflect RNA expression levels. To give readers more of a sense of how sample size may affect tissue specific differences in eQTLs we have included the sample size for each tissue alongside the tissue name in Figure 5. The concept that eQTLs depend on sample size (and tissue specific expression) remains within the text of the results section along with a reference to a supplementary figure that shows how power calculations for a hypothetical SNP with minor allele frequency 0.05 and an effect size of 40% (i.e. no minor alleles is 20% less than the median tissue specific gene expression and two minor alleles is 20% greater than the median tissue specific gene expression) compare across tissues. Reviewer #2: In the manuscript entitled “Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance”, the authors explored the complexity of each of CAD loci by use of tissue specific RNA sequencing data from GTEx to identify genes that exhibit altered expression patterns in the context of GWAS significant loci,and expanded the list of candidate genes from the 75 currently annotated by GWAS to 245. The following papers can be cited and followed for the meta-analytic procedures (if the data is not enough available, at least DISCUSSION should be added as the LIMITATION of this study with enough citation to support the viewpoints): Ref 1: Wu Y, et al. Multi-trait analysis for genome-wide association study of five psychiatric disorders. Transl Psychiatry. 2020 Jun 30;10(1):209. Ref 2:Jiang L, et al. Sex-Specific Association of Circulating Ferritin Level and Risk of Type 2 Diabetes: A Dose-Response Meta-Analysis of Prospective Studies..J Clin Endocrinol Metab. 2019 Oct 1;104(10):4539-4551. Ref 3: Xu M, et al. Quantitative assessment of the effect of angiotensinogen gene polymorphisms on the risk of coronary heart disease. Circulation. 2007 Sep 18;116(12):1356-66 Trans-ethnitic and trans-trait meta-analysis of cardiometabolic traitscan be referred to Ref 1. Subgroup analyses based on sex, age, race, gene dosage can be referred to Ref 2. and3 We thank the reviewer for bringing these references to our attention. However as we have not undertaken a meta-analysis in this manuscript, instead working within several large scale databases (GTEx and CATHGEN) without combining them, we did not find these references would be relevant. We have added text to the methods section to help to clarify this point and reserve these references for any future meta-analytic work to which they may be more applicable. Integrating GWAS signals with eQTL from GTEX or pQTLs is a good strategy to exploring the causality of the genetic varients in the development of cardiometabolic traits. But I strong suggest to do causal inference analysis to see if the GWAS signals are causally triggering the develop,ent of CAD through mediating the expression of given genes in specific tissues. We thank the reviewer for their comment. Indeed causal analysis is a critical piece as GWAS findings are essentially associations and nothing more. We have added additional text to the discussion emphasizing the need for molecular biology experimentation to establish causal roles for genetic variation. In addition, the significantly associated SNPs may be used to predict disease susceptibility in the context of its influence of gene expression,therefore, the authors may explore the possibility to conduct a machine-learning model to predict CAD risk or cardiometalobic traits based these significant SNPs. For this reason, the authors may cite the following papers to follow these references’ procedure to construct a standard prediction model based on the significant SNPs (probably include the gene expression information). Deep learning is a hot topic in dissecting the genome variants’ roles in the phenome. Especially deep learning method is a very promising way to predict disease risk based on clinical information and genetic biomarkers(If deep learning can not be used, please discuss as the LIMITATION of this study with enough citation to support the viewpoints). Ref 4:Yu H, et al. LEPR hypomethylation is significantly associated with gastric cancer in males.Exp Mol Pathol. 2020 Oct;116:104493. Ref 5:Liu M, et al. A multi-model deep convolutional neural network for automatic hippocampus segmentation and classification in Alzheimer's disease.Neuroimage. 2020 Mar;208:116459. Thanks to the reviewer for the provided references and for the suggestion. We would agree that machine learning would be an interesting and potentially very powerful methodology to account for disease risk with various data points including gene expression and genetic variation. To a large extent, we are limited by a combination of sample size and data availability. So these analyses were not feasible. We have added text to the discussion emphasizing the need for future directions to explore machine learning methodology and to incorporate other markers of genetic variation beyond SNPs such as methylation and referenced the recommended publications. Submitted filename: ResponseToComments.pdf Click here for additional data file. 3 Jan 2022 Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance PONE-D-20-39555R1 Dear Dr. Katherine Hartmann , We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Mingqing Xu Academic Editor PLOS ONE Additional Editor Comments (optional): It can be accepted for publication now. Reviewers' comments: 7 Feb 2022 PONE-D-20-39555R1 Interpreting coronary artery disease GWAS results: A functional genomics approach assessing biological significance Dear Dr. Hartmann: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Mingqing Xu Academic Editor PLOS ONE

40 in total

1. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205

Review 2. Genetics of coronary artery disease: an update.

Authors: Robert Roberts
Journal: Methodist Debakey Cardiovasc J Date: 2014 Jan-Mar

Review 3. Wolman's disease and cholesteryl ester storage disorder: the phenotypic spectrum of lysosomal acid lipase deficiency.

Authors: Marinos Pericleous; Claire Kelly; Tim Wang; Callum Livingstone; Aftab Ala
Journal: Lancet Gastroenterol Hepatol Date: 2017-09

4. Interactions Between Regulatory Variants in CYP7A1 (Cholesterol 7α-Hydroxylase) Promoter and Enhancer Regions Regulate CYP7A1 Expression.

Authors: Danxin Wang; Katherine Hartmann; Michal Seweryn; Wolfgang Sadee
Journal: Circ Genom Precis Med Date: 2018-10

5. Interacting networks of resistance, virulence and core machinery genes identified by genome-wide epistasis analysis.

Authors: Marcin J Skwark; Nicholas J Croucher; Santeri Puranen; Claire Chewapreecha; Maiju Pesonen; Ying Ying Xu; Paul Turner; Simon R Harris; Stephen B Beres; James M Musser; Julian Parkhill; Stephen D Bentley; Erik Aurell; Jukka Corander
Journal: PLoS Genet Date: 2017-02-16 Impact factor: 5.917

6. Genetic Regulation of the Thymic Stromal Lymphopoietin (TSLP)/TSLP Receptor (TSLPR) Gene Expression and Influence of Epistatic Interactions Between IL-33 and the TSLP/TSLPR Axis on Risk of Coronary Artery Disease.

Authors: Shao-Fang Nie; Ling-Feng Zha; Qian Fan; Yu-Hua Liao; Hong-Song Zhang; Qian-Wen Chen; Fan Wang; Ting-Ting Tang; Ni Xia; Cheng-Qi Xu; Jiao-Yue Zhang; Yu-Zhi Lu; Zhi-Peng Zeng; Jiao Jiao; Yuan-Yuan Li; Tian Xie; Wen-Juan Zhang; Dan Wang; Chu-Chu Wang; Jing-Jing Fa; Hong-Bo Xiong; Jian Ye; Qing Yang; Peng-Yun Wang; Sheng-Hua Tian; Qiu-Lun Lv; Qing-Xian Li; Jin Qian; Bin Li; Gang Wu; Yan-Xia Wu; Yan Yang; Xiang-Ping Yang; Yu Hu; Qing K Wang; Xiang Cheng; Xin Tu
Journal: Front Immunol Date: 2018-08-03 Impact factor: 7.561

7. From GWAS to new biology and treatments in CAD.

Authors: Peter D Jones; Tom R Webb
Journal: Aging (Albany NY) Date: 2019-03-25 Impact factor: 5.682

Review 8. The Dark That Matters: Long Non-coding RNAs as Master Regulators of Cellular Metabolism in Non-communicable Diseases.

Authors: Alessia Mongelli; Fabio Martelli; Antonella Farsetti; Carlo Gaetano
Journal: Front Physiol Date: 2019-05-22 Impact factor: 4.566

9. Disentangling group specific QTL allele effects from genetic background epistasis using admixed individuals in GWAS: An application to maize flowering.

Authors: Simon Rio; Tristan Mary-Huard; Laurence Moreau; Cyril Bauland; Carine Palaffre; Delphine Madur; Valérie Combes; Alain Charcosset
Journal: PLoS Genet Date: 2020-03-04 Impact factor: 5.917

10. Allele-Selective Transcriptome Recruitment to Polysomes Primed for Translation: Protein-Coding and Noncoding RNAs, and RNA Isoforms.

Authors: Roshan Mascarenhas; Maciej Pietrzak; Ryan M Smith; Amy Webb; Danxin Wang; Audrey C Papp; Julia K Pinsonneault; Michal Seweryn; Grzegorz Rempala; Wolfgang Sadee
Journal: PLoS One Date: 2015-09-02 Impact factor: 3.240