Literature DB >> 28770004

Discovery and replication of SNP-SNP interactions for quantitative lipid traits in over 60,000 individuals.

Brendan J Keating^1,2, Marylyn D Ritchie³, Emily R Holzinger⁴, Shefali S Verma⁵, Carrie B Moore⁶, Molly Hall⁵, Rishika De⁷, Diane Gilbert-Diamond⁸, Matthew B Lanktree⁹, Nathan Pankratz¹⁰, Antoinette Amuzu¹¹, Amber Burt¹², Caroline Dale¹¹, Scott Dudek⁵, Clement E Furlong¹², Tom R Gaunt¹³, Daniel Seung Kim¹², Helene Riess¹⁴, Suthesh Sivapalaratnam¹⁵, Vinicius Tragante^16,17, Erik P A van Iperen^18,19, Ariel Brautbar²⁰, David S Carrell²¹, David R Crosslin¹², Gail P Jarvik¹², Helena Kuivaniemi²², Iftikhar J Kullo²³, Eric B Larson²¹, Laura J Rasmussen-Torvik²⁴, Gerard Tromp²², Jens Baumert¹⁴, Karen J Cruickshanks²⁵, Martin Farrall²⁶, Aroon D Hingorani²⁷, G K Hovingh¹⁵, Marcus E Kleber²⁸, Barbara E Klein²⁵, Ronald Klein²⁵, Wolfgang Koenig²⁹, Leslie A Lange³⁰, Winfried Mӓrz^28,31, Kari E North³², N Charlotte Onland-Moret³³, Alex P Reiner³⁴, Philippa J Talmud³⁵, Yvonne T van der Schouw³³, James G Wilson³⁶, Mika Kivimaki²⁷, Meena Kumari^27,37, Jason H Moore³⁸, Fotios Drenos^35,39, Folkert W Asselbergs^16,18,39.

Abstract

BACKGROUND: The genetic etiology of human lipid quantitative traits is not fully elucidated, and interactions between variants may play a role. We performed a gene-centric interaction study for four different lipid traits: low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), total cholesterol (TC), and triglycerides (TG).
RESULTS: Our analysis consisted of a discovery phase using a merged dataset of five different cohorts (n = 12,853 to n = 16,849 depending on lipid phenotype) and a replication phase with ten independent cohorts totaling up to 36,938 additional samples. Filters are often applied before interaction testing to correct for the burden of testing all pairwise interactions. We used two different filters: 1. A filter that tested only single nucleotide polymorphisms (SNPs) with a main effect of p < 0.001 in a previous association study. 2. A filter that only tested interactions identified by Biofilter 2.0. Pairwise models that reached an interaction significance level of p < 0.001 in the discovery dataset were tested for replication. We identified thirteen SNP-SNP models that were significant in more than one replication cohort after accounting for multiple testing.
CONCLUSIONS: These results may reveal novel insights into the genetic etiology of lipid levels. Furthermore, we developed a pipeline to perform a computationally efficient interaction analysis with multi-cohort replication.

Entities: Chemical

Keywords: Computational genetics; Genetic epidemiology; Genetics; Interactions; Lipids

Year: 2017 PMID： 28770004 PMCID： PMC5525436 DOI： 10.1186/s13040-017-0145-5

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 4.079

Background

For this study, we perform several analyses to identify and validate genetic interactions associated with circulating lipid levels. Our motivation for studying the contribution of interactions to lipid levels is three-fold. First, dyslipidemia have a large impact on human health. Circulating lipid levels, such as high-density lipoprotein cholesterol (HDL-C), low-density lipoprotein cholesterol (LDL-C), total cholesterol (TC), and triglycerides (TG), are associated with risk for various common disease traits including cardiovascular disease [1]. Cardiovascular disease is the leading cause of death for individuals in developed countries [2]. Secondly, the estimated genetic component for lipid levels is relatively large and highly variable. While age, sex, body mass index (BMI), diet, exercise and smoking status have been shown to have an effect on lipid levels, it is estimated that genetic factors contribute between 40 and 60% overall to variation in lipid levels [3, 4]. A more thorough understanding of the genetics underlying individual variation in lipid levels will result in greater insight into the biological processes underpinning dyslipidemia, and may inform more effective therapies to ultimately lower risk for cardiovascular disease. Lastly, a large portion of the estimated genetic component has not been identified by main effects alone. For the past decade or so, large efforts have been undertaken to tease apart the genetic etiology of common, complex traits, such as circulating lipid level and CVD; however a large proportion of the estimated heritability of these traits that remains unexplained [5, 6]. Sources of missing heritability are likely to be caused by rare variants, epigenetics, structural variation, gene-gene interactions, gene-environment interactions, and/or the accuracy of the heritability models [7, 8]. Notably, calculating the total heritability and measuring the exact contribution of these specific findings to heritability remains a controversial and complex issue [9-12]. However, the consistently small proportion explained by common variants identified by GWAS across all complex traits suggests that we still have a lot to learn about the genetic architecture of these traits. This study addresses the contribution of interactions to the genetic architecture of lipid traits by examining SNP-SNP interactions in four quantitative lipid traits – HDL-C, LDL-C, TC, and TG. Here we are trying to identify genetic interactions by searching for statistical interactions. For interpretation purposes, it is important to understand how we define an interaction. Biologically, we are trying to identify genetic variants that alter the phenotype in a manner that is dependent on genotypes at two different loci. For example, an individual may have variants in two different regions of a metabolic enzyme protein that cause triglyceride levels to increase even more than the combined additive effects of the variants. Statistically, we use a likelihood ratio test to assess the significance of including a multiplicative interaction term along with the two main effect terms in a linear regression model. While there has been some debate about the relationship between statistical and biological interactions, there is substantial evidence that this method is robust to the non-linear or interaction effects we are interested in [13].One of the main considerations for a genome-wide interaction study (GWIS) is the computational and statistical burden of exhaustive interaction testing, which inherently results in a massive increase in the number of tests (e.g. 1000 SNPs = 499,500 two-way interactions and 166,167,000 three-way interactions, etc.). While our analysis is not a GWIS per se, as most individuals were genotyped using a cardiovascular gene-centric array and we filtered before interaction testing, the considerations about interaction do still apply [14]. One approach to address this issue is to filter on main effect significance (i.e. the p-value from the main effect term in a regression model) using bona fide index lipid signals derived from existing GWAS. A limitation to the main effect filter approach is that SNPs involved in true interactions with little or no main effects will likely be filtered out. Another approach is to select SNP-SNP models based on knowledge-driven biologically plausible genes/loci, such as selecting SNP pairs in genes shown to physically interact in previous biological experiments (Biofilter) [15]. For this analysis, we used both of the aforementioned filter methods to test for interactions. After applying these filters, we identified potential SNP-SNP interactions for each of the lipid traits in a discovery analysis, which consisted of five cohorts merged into one dataset (n = 12,853 to n = 16,849 depending on lipid phenotype): Atherosclerosis Risk In Communities (ARIC); Coronary Artery Risk Development in Young Adults (CARDIA); Cardiovascular Health Study (CHS); Framingham Heart Study (FHS); and Multi-Ethnic Study of Atherosclerosis (MESA); from the NHLBI Candidate gene Association Resource (CARe). Models were selected for replication testing based on statistical significance from the discovery set. There were ten replication sets in total with sample sizes between n = 1568 and n = 7504 totaling 36,938 for the replication dataset. We identified models with the most evidence for significant associations with the lipid traits according multiple-testing corrected likelihood ratio test p-value thresholds from linear regression models. We also assessed the number of cohorts in which the models replicated. This study highlights an analysis strategy to explore genetic interactions for complex traits and suggests several replicating interactions for lipid traits.

Methods

Discovery: Cohort descriptions

The discovery dataset for each of the traits had n ~ 14,000, with each of the cohorts having the following contributions: ARIC (n = 11,906), CARDIA (n = 2319), CHS (n = 4490), FHS (n = 1467), and MESA (n = 5598). Individuals were genotyped using the gene-centric ITMAT-Broad-CARe (IBC) array [16], which was previously used in a meta-analysis of 32 studies (66,240 individuals), identifying and replicating many known and novel lipid signals [17]. All of the individuals in our analysis were self-reported European ancestry, subsequently verified using principal component analyses by selecting individuals that clustered with the CEU panel from HapMap, and ≥21 years of age. HDL-C, LDL-C, TC, and TG levels were measured from baseline or first measurement blood samples. All lipid measurements were converted to mmol/L. LDL-C was calculated according to Friedewald’s formula L ~ C – H - kT where C is total cholesterol, H is HDL, L is LDL, T are triglycerides, and k is 0.45 for mmol/L (or 0.20 if measured in mg/dl). If TG values were >4.51 mmol/L (>400 mg/dL), then LDL was treated as a missing value. More details for the five merged discovery cohorts are shown in Table 1.

Table 1

Details for the five cohorts that were merged to create the discovery dataset and the 10 cohorts used for replication

	Study (One letter label)	Recruitment design	Year of collection	N total^a	Data Level	Study Ref (PMID)
Discovery (IBC)	ARIC	Community-based	1985–2006	9588	Individual	20400780
	CARDIA	Community-based	1985–2003	1443	Individual	20400780
	CHS	Community-based	1988–2005	3952	Individual	20400780
	FHS	Community-based	1948-present	7556	Individual	20,400,780
	MESA	Community-based	1999–2009	2298	Individual	20400780
Replication (IBC)	BOSS/EHLS/BDES	Population-based cohort	1988-present	1568	Summary	21339392, 9801018, 1923372
	BWHHS (B)	Population-based cohort	1999–2001	3411	Summary	16045529
	CLEAR	Case-control	2005	1591	Summary	16474172
	EPIC-NL	Nested case-control	1993–1997	5194	Summary	19483199
	GIRaFH	Cohort	1999	1694	Summary	15554949
	KORA	Population-based cohort	1984–2005	1849	Summary	16032513, 1603251
	LURIC (L)	Case-control	1997–2002	2813	Summary	11258203
	PROCARDIS (P)	Case-control	1998-present	6432	Summary	20032323
	Whitehall II (W)	Population-based cohort	1985–1989	4882	Summary	15576467
Rep. (GWAS)	eMERGE	Consortium		7504	Summary	23743551

Discovery cohorts: Atherosclerosis Risk In Communities (ARIC); Coronary Artery Risk Development in Young Adults (CARDIA); Cardiovascular Health Study (CHS); Framingham Heart Study (FHS); Multi-Ethnic Study of Atherosclerosis (MESA)

Replication cohorts: BOSS beaver dam offspring study, EHLS epidemiology of hearing loss study, BDES beaver dam eye study, AIBIII Allied Irish Bank Workers Study III, AMC-PAS Academic Medical Center Amsterdam Premature Atherosclerosis Study, ASCOT anglo-scandinavian cardiac outcomes trial, BHS bogalusa heart study, BRIGHT, British genetics of hypertension, BWHHS British women’s heart and health study, CLEAR carotid lesion epidemiology and risk, EPIC-NL European Prospective Investigation into Cancer and Nutrition in the Netherlands, GIRaFH genetic identification of risk factors in familial hypercholesterolemia, KORA Kooperative Gesundheitsforschung in der Region Augsburg, LURIC Ludwigshafen Risk and Cardiovascular Health Study, PROCARDIS precocious coronary artery disease study, WHII Whitehall II study, GWAS eMERGE

aNumbers varied for each lipid trait. The number shown is the maximum number of non-missing individuals for all traits

Details for the five cohorts that were merged to create the discovery dataset and the 10 cohorts used for replication Discovery cohorts: Atherosclerosis Risk In Communities (ARIC); Coronary Artery Risk Development in Young Adults (CARDIA); Cardiovascular Health Study (CHS); Framingham Heart Study (FHS); Multi-Ethnic Study of Atherosclerosis (MESA) Replication cohorts: BOSS beaver dam offspring study, EHLS epidemiology of hearing loss study, BDES beaver dam eye study, AIBIII Allied Irish Bank Workers Study III, AMC-PAS Academic Medical Center Amsterdam Premature Atherosclerosis Study, ASCOT anglo-scandinavian cardiac outcomes trial, BHS bogalusa heart study, BRIGHT, British genetics of hypertension, BWHHS British women’s heart and health study, CLEAR carotid lesion epidemiology and risk, EPIC-NL European Prospective Investigation into Cancer and Nutrition in the Netherlands, GIRaFH genetic identification of risk factors in familial hypercholesterolemia, KORA Kooperative Gesundheitsforschung in der Region Augsburg, LURIC Ludwigshafen Risk and Cardiovascular Health Study, PROCARDIS precocious coronary artery disease study, WHII Whitehall II study, GWAS eMERGE aNumbers varied for each lipid trait. The number shown is the maximum number of non-missing individuals for all traits

Discovery: Quality control and statistical analyses

Individuals were genotyped on the ITMAT-Broad-CARe (IBC) array. This array consists of ~50,000 SNPs across ~2100 loci. Selection criteria for SNPs to be included on the IBC array have been described in detail previously [16]. Quality control filters were applied after the cohorts were merged into the full discovery dataset. A summary of the full quality control and analysis pipeline is shown in Fig. 1. All quality control procedures were implemented with the PLINK software package [18] unless otherwise specified. SNPs with a genotype missing rate > 95% or that were not in Hardy-Weinberg equilibrium (p < 1.0 x < 10−7) were removed from the analysis. After SNP genotyping quality control, 44,750 markers remained. Individuals with SNP genotype missing rates >90% were excluded from the analysis. For cohorts that contained known trios, non-founders (i.e. offspring) were removed. To address unknown or cryptic relatedness, identity-by-descent (IBD) estimates were calculated, and one individual from each pair with pi-hat >0.3 was removed. The TG values were log transformed to improve normality. Four new datasets were created for each of the quantitative lipid traits: HDL-C (n = 13,030), LDL-C (n = 12,853), TC (n = 16,849), and TG (n = 13,031).

Fig. 1

Flowchart of the quality control and analysis steps for the discovery and replication phases

Flowchart of the quality control and analysis steps for the discovery and replication phases Additional quality control metrics were applied to the individual lipid datasets for each of the statistical analyses. For both the main effect filter and Biofilter analyses, SNPs with missing phenotypes were removed along with variants with minor allele frequency (MAF) < 0.05 or missing genotype rate > 5%. For the main effect filter analysis, SNPs were pruned to remove high levels of SNP correlation, or LD from the data. No LD pruning was done for the Biofilter interaction analyses, as these models are specifically generated using SNPs that are in different genes. This was performed by removing one SNP from all pairs of SNPs with an r2 > 0.6 using PLINK. SNPs with a main effect p < 0.001 based on a previous GWAS regression analysis were selected for interaction testing [17]. We had two specific motivations for selecting this threshold for our study: 1. to allow for interactions that may be present in the absence of large, genome-wide significant main effects, and 2. to reduce the SNP set to a size that allowed for a manageable exhaustive SNP-SNP interaction analysis. SNP-SNP models were generated by creating an exhaustive list of all SNP pairs. Importantly, we did not LD prune for the Biofilter analysis due to the method used to generate SNP-SNP models. Biofilter 2.0 is a software package that identifies SNP-SNP models based on probable gene-gene interactions identified in various online sources including Gene Ontology GO and KEGG. The Biofilter method has previously been described in greater detail [15, 19]. Briefly, SNPs are mapped to genes using a 50 kb upstream or downstream inclusion criterion. Gene pairs that may be more likely to interact are then identified in various curated biological knowledge databases. A score is given based on the number of sources that indicate a possible interaction. For this analysis, models were included if at least five knowledge sources identified the gene-gene interaction model. The SNPs are then mapped back to the genes to create the SNP-SNP models for statistical testing. To test for SNP-SNP interactions, we used an R script that automatically tests the models according to user input parameters [20]. We tested for significant interactions using a linear regression framework. We adjusted for age, sex, smoking status, type 2 diabetes status, BMI, medication use (use or no use of lipid lowering drugs), and potential population substructure (top 10 principal components) by including these as covariate terms in the linear regression models for each of the four lipid traits. We included these covariates to control for any factors outside of genetics that may have an effect on lipid levels and to remain consistent with the previous GWAS from which the SNPs for the main effect filter analysis were chosen. In the previous study that used the same lipid measurements for a gene-centric meta-analysis of main effects [17], an additional adjustment for medication was done by multiplying a constant percentage to account for lipid lowering medication. The two adjustment methods (covariate and multiplication) gave similar results; therefore, we only included the covariate adjustment results in this manuscript. We chose to include the top 10 principal components to remain consistent with the previous GWAS and to control for any residual variation as we were performing these analyses in a combined cohort that included individuals from various parts of the country. Models were selected for replication testing with likelihood ratio test p -values <0.001 (comparing the full and reduced linear regression models (Eqs. 1 and 2)). We adjusted the threshold using a Bonferroni correction based on the total number of number of models that were tested for each filtering methods. We estimated these models to be independent due to the LD-pruning in the main effect filter analysis and the SNPs being in different genes for the Biofilter analysis (Fig 1 ). The full model consisted of the same SNP and covariate terms as the reduced model, but with an additional multiplicative interaction term for the SNP-SNP model. We generated “proxy” models by identifying SNPs in high linkage disequilibrium (LD) (r2 > 0.8) with model SNPs based on the HapMap European CEPH (CEU) population in 1000 Genomes Project Pilot 1 data (2010 release) using SNAP [21]. We generated a list of proxy SNP-SNP models using the SNPs in high LD with the original model SNPs to represent the original model from the discovery set. The purpose of these models was to capture signals in the replication data that may have been missed due to allele frequency differences between the discovery and replication cohorts. The original and proxy models from the discovery analysis were tested in each of the replication cohorts.

Replication: Cohort descriptions

The top original and proxy models from the main effect filter and Biofilter analyses were tested in ten independent replication cohorts – BOSS/EHLS/BDES, CLEAR, eMERGE, EPIC, GIRaFH, KORA, LURIC, PROCARDIS, Whitehall II, and BWHHS. All of the replication cohorts, except the eMERGE datasets, were genotyped using the IBC array; therefore, many of the proxy models were not tested because many of the proxy SNPs are not on the IBC array. The eMERGE network is a consortium of institutions with DNA from biorepositories linked to data from patient electronic medical records (EMR) [22]. The eMERGE set was genotyped with the Illumina660W GWAS platform and further imputed using 1000 Genomes project data, as described here [23]. The replication set consisted of data from the Marshfield Clinic, Northwestern University, Group Health Cooperative, Mayo Clinic, and Vanderbilt University. After quality control, the final eMERGE sample size was n = 7502 for all lipid traits. Details on quality control and phenotype extractions from the EMR can be found here [24]. The minimum variant and sample call rate threshold for all replication cohorts was 0.95 and 0.90, respectively. A Hardy-Weingberg equilibrium test p-value threshold of at least p < 1 × 10−6 was applied by each group. In each of the replication cohorts, population stratification and relatedness were assessed and adjusted for accordingly. All of the individuals in the replication cohorts were of European-American descent. The full details for the QC procedures can be found in the references provided for each replication cohort in Table 1.

Replication: Quality control and statistical analyses

Replication analyses were performed in nine independent cohorts genotyped previously on the IBC array for a range of phenotypes including lipid levels [17] and the eMERGE cohort, which contained GWAS genotype data (Fig 1). For each of the ten cohorts, all of the models from the discovery analysis with LRT p < 0.001 and all of the corresponding proxy models were tested using the same statistical approach as for the discovery analysis (Eqs. 1 and 2). The same statistical approach was applied in the replication analysis as for the discovery analysis. We compiled the results to assess which SNP-SNP model signals replicated across respective cohorts. Significance of replication was assessed by correcting the likelihood ratio test p-value for the number of original (i.e. non-proxy) models tested and for the 10 replication cohorts. We also assessed how many of the 10 cohorts had significant replication for each of the models. These results were visualized using the program SynthesisView [25].

Results

Discovery and replication

Full results from the discovery analysis for all original models selected for replication testing can be found in Additional file 1: Table S1 and S2. The counts for the number of significant models that were identified and then tested in the replication cohort can be found in Fig. 1. Significance in the replication cohort was estimated by using the number of original models tested in each study design (i.e. not counting the proxy models) and the number of replication cohorts (further divide by 10) to perform a Bonferroni-like correction equivalent to p = 0.05. For the MEF analyses, the number of original (non-proxy) models selected for testing are shown as: lipid trait (model count, corrected p-value) - HDL-C (156, p = 0.00003); LDL-C (160, p = 0.00003); TC (180, p = 0.00003); and TG (187, p = 0.00003). The respective counts for the Biofilter analysis were: HDL-C (22, p = 0.0002); LDL-C (22, p = 0.0002); TC (21, p = 0.0002); and TG (49, p = 0.0001). We then calculated the number of model signals that passed the respective thresholds in each cohort (i.e. if the original and proxy SNP-SNP models replicated for one LD signal then only one model signal was counted) (Fig. 1). The models that passed the main effect filter and Biofilter replication significance threshold are shown in Tables 1 and 2, respectively. Results are shown for models with the same direction of effect as the discovery datasets and/or the lowest p-value, where replication was observed in more than one cohort. For the main effect filter analyses, more models passed the selected replication threshold. Also, a number of models showed similar results in more than one cohort. For HDL-C, 17 total models replicated with seven models observed to replicate in at least two cohorts. For LDL-C, two models replicated, both in at least two cohorts. For TC, replication occurred for one model in one cohort. For TG, 11 total models replicated, with four models replicating in at least two cohorts (Table 2). For the Biofilter analyses, results were replicated for the TG trait with two models passing the significance threshold in a single cohort (Table 3).

Table 2

Discovery and replication results for models passing replication thresholds for each lipid trait for main effect filter analysis

	Disc. Rank	SNP 1	SNP 2	Locus 1	Locus 2	Beta	LRT p	Rep. Beta	Rep. LRT p	Rep. Cohort^a
HDL	1	rs12720918	rs4783961	CETP	CETP	−0.06	9.5 × 10⁻²⁰	−0.07	3.0 × 10⁻¹²	P,W,L
	2	rs12720918	rs158477	CETP	CETP	−0.06	6.3 × 10⁻¹⁶	−0.07	2.9 × 10⁻¹⁰	P,W,L
	3	rs1864163	rs4783961	CETP	CETP	−0.06	4.5 × 10⁻¹⁵	−0.05	7.1 × 10⁻⁷	P,W,B
	5	rs1864163	rs158477	CETP	CETP	−0.06	1.3 × 10⁻¹²	−0.05	2.3 × 10⁻⁸	P
	6	rs12708967	rs820299	CETP	CETP	0.06	1.0 × 10⁻¹¹	0.06	1.6 × 10⁻⁶	P,W,L
	7	rs1864163	rs4784744	CETP	CETP	0.05	2.6 × 10⁻¹¹	0.06	5.2 × 10⁻¹¹	P,W
	8	rs1800775	rs4783961	CETP	CETP	0.04	6.3 × 10⁻¹¹	−0.08	2.4 × 10⁻⁷	B
	9	rs12708967	rs158477	CETP	CETP	−0.05	2.5 × 10⁻¹⁰	−0.06	1.1 × 10⁻⁶	P,W,L
	10	rs9939224	rs4783961	CETP	CETP	−0.05	2.5 × 10⁻¹⁰	−0.04	2.4 × 10⁻⁷	W,B
	12	rs1800775	rs158477	CETP	CETP	0.04	1.8 × 10⁻⁸	0.07	2.5 × 10⁻⁶	B
	13	rs9939224	rs478474	CETP	CETP	0.04	5.4 × 10⁻⁷	0.07	2.4 × 10⁻¹⁰	P
	17	rs1800775	rs4784744	CETP	CETP	−0.03	1.8 × 10⁻⁶	−0.05	9.7 × 10⁻⁷	W
	18	rs9939224	rs12447924	CETP	CETP	0.05	1.8 × 10⁻⁶	0.06	5.3 × 10⁻⁷	P
	38	rs7013777	rs9644636	LPL	LPL	−0.03	8.0 × 10⁻⁵	−0.04	7.5 × 10⁻⁶	W
	50	rs820299	rs8056954	CETP	SLC12A3	0.03	1.8 × 10⁻⁴	0.06	1.5 × 10⁻⁵	W
	66	rs12708967	rs4784744	CETP	CETP	0.03	3.0 × 10⁻⁴	0.05	2.4 × 10⁻⁵	P
	133	rs6586891	rs285	LPL	LPL	−0.02	7.9 × 10⁻⁴	−0.04	2.9 × 10⁻⁵	P
LDL	7	rs1531517	rs519113	BCL3	PVRL2	−0.16	7.9 × 10⁻⁶	−0.2	5.2 × 10⁻⁶	P,B
LDL	70	rs4803766	rs157580	PVRL2	TOMM40	−0.06	3.4 × 10⁻⁴	−0.11	4.8 × 10⁻⁷	P,B
TC	33	rs11216129	rs10750097	BUD13	APOA5	−0.12	1.3 × 10⁻⁴	−0.22	1.4 × 10⁻⁵	W
TG	1	rs4938303	rs180327	BUD13	BUD13	0.09	1.2 × 10⁻²¹	0.08	9.5 × 10⁻⁷	P,W
	2	rs2075295	rs6589568	BUD13	APOA4	−0.10	4.4 × 10⁻¹⁹	−0.15	3.5 × 10⁻¹⁵	P,W
	3	rs180327	rs10750097	BUD13	APOA5	0.08	3.1 × 10⁻¹⁴	0.31	6.8 × 10⁻⁹	W,B
	4	rs180327	rs2075295	BUD13	BUD13	0.07	8.9 × 10⁻¹³	0.07	1.5 × 10⁻⁵	P
	5	rs180327	rs6589568	BUD13	APOA4	0.07	2.7 × 10⁻¹⁰	0.08	5.6 × 10⁻⁶	W
	6	rs11216129	rs10750097	BUD13	APOA5	−0.12	2.1 × 10⁻⁹	−0.12	2.6 × 10⁻⁷	W,P
	13	rs180327	rs618923	BUD13	ZPR1	−0.08	3.7 × 10⁻⁷	−0.08	1.0 × 10⁻⁵	W
	15	rs2075295	rs1263173	BUD13	APOA4	−0.08	2.1 × 10⁻⁷	−0.08	3.5 × 10⁻⁶	W
	19	rs486394	rs4938303	BUD13	BUD13	0.07	2.1 × 10⁻⁶	0.07	2.9 × 10⁻⁵	P
	49	rs2075295	rs10047459	BUD13	APOA1	−0.11	2.1 × 10⁻⁵	−0.11	6.9 × 10⁻⁸	W
	153	rs2075295	rs10750097	BUD13	APOA5	−0.09	2.1 × 10⁻⁴	−0.09	2.9 × 10⁻⁶	W

LRT likelihood ratio test. aSee Table 1 for details on cohorts

Table 3

Models that passed replications threshold for the TG trait for Biofilter analysis

	Disc. Rank	SNP 1	SNP 2	Locus 1	Locus 2	Beta	LRT p	Rep. Beta	Rep. LRT p	Rep. Cohort^a
TG	9	rs11216162	rs1263173	SIK3	APOA4	−0.05	5.5 × 10⁻⁵	−0.08	5.5 × 10⁻⁵	P
TG	44	rs625145	rs1263173	SIK3	APOA4	−0.04	6.8 × 10⁻⁵	−0.07	6.8 × 10⁻⁵	P

No models passed replication for HDL-C, LDL-C, or TC, LRT likelihood ratio test. a See Table 1 for details on cohorts

Discovery and replication results for models passing replication thresholds for each lipid trait for main effect filter analysis LRT likelihood ratio test. aSee Table 1 for details on cohorts Models that passed replications threshold for the TG trait for Biofilter analysis No models passed replication for HDL-C, LDL-C, or TC, LRT likelihood ratio test. a See Table 1 for details on cohorts Although we performed LD pruning prior to the interaction analyses, moderate LD remained with r2 < 0.6. This resulted in residual correlation in the top replicating models, and separate models may actually represent a single interaction signal. Additionally, all of the replicating models contained two SNPs in the same gene/region. To assess this, we looked at the pairwise r2values amongst all SNPs in the top replication models. The goal was to identify independent replication signals and to ensure that the interaction signals are not being inflated by LD between model SNPs. For the main effect filter analysis of HDL-C, we identified three sets of moderately correlated SNPs and two interaction signals (Set 1 x Set 2 and Set 2 x Set 3), as shown in Fig. 2. No correlation (r2 > 0.1) was observed in our data between SNPs in the same replication model.

Fig. 2

Pairwise r2 values for SNPs in top models for main effect filtering (MEF) analysis of HDL-C levels. The numbers in the boxes are r2 values and darker shading indicates higher LD. The numbers below the SNPs are an indicator of location in this region. Correlation patterns indicate three sets of SNPs and two interaction signals based on replication results (Set 1 x Set 2 and Set 2 x Set 3) For the main effect filter analyses of TG, the signals that replicated were in a similar region on chromosome 11. This region includes several genes with strong main effects on TG levels, including APOA4, APOA5, APOC3, SIK3, and BUD13. There are complex patterns of moderate to strong LD in this region, and thus bona fide “independent” signals are challenging to delineate. However, for the main effect filter analyses of TG, one SNP (rs180327) appeared in two of the four models that replicated in more than one cohort. Moderate correlation exists between most of the other SNPs except for rs180327 (Fig. 3). This suggests a single signal representing an interaction between rs180327 (or a correlated functional variant) and the other variants for the four models that include this SNP. For the main effect filter analysis of LDL, two models replicated in more than one cohort. While the SNPs from the two models are in a similar region on chromosome 19 encompassing genes such as APOE, BCL3, PVRL2, and TOMM40, these appear to consist of two separate interaction signals. No models replicated in HDL-C, LDL-C, or TC for the Biofilter analyses.

Fig. 3

Pairwise r2 values for SNPs in top models for the main effect filtering (MEF) replication analysis of plasma triglyceride (TG) levels. The numbers in the boxes are r2 values and darker shading indicates higher LD. The numbers below the SNPs are an indicator of location in this region. Correlation patterns indicate a single signal representing an interaction between rs180327 (or a correlated functional variant) and the other variants for the four models that include this SNP To further summarize the replication analyses, we plotted the compiled results to view all of the cohorts’ results simultaneously for each of the analyses with significant replication (Figs. 4, 5, 6, 7 and 8). In these figures, we show the models that replicated at the aforementioned thresholds. As some of the replications are in proxy models (not the original discovery model), we show the lead significant result for the each replicating model.

Fig. 4

Fig. 5

Results for the main effect filter (MEF) analysis of LDL-C. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect

Fig. 6

Results for the main effect filter (MEF) analysis of TC. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect

Fig. 7

Results for the main effect filter (MEF) analysis of TG. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect

Fig. 8

Results for the Biofilter analysis of TG. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect

Results for the main effect filter (MEF) analysis of HDL-C. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect Results for the main effect filter (MEF) analysis of LDL-C. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect Results for the main effect filter (MEF) analysis of TC. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect Results for the main effect filter (MEF) analysis of TG. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect Results for the Biofilter analysis of TG. Showing results for the models that passed the replication threshold of p < 3.0 × 10–5. Orig# and prox# designate models that were identified in the discovery cohort and those identified via proxy (i.e. both SNPs in high LD with SNPs from orig. Models), respectively. V1 and V2 are the two SNPs in the model; arrow in likelihood ratio test (LRT) row denotes direction of effect

Discussion

For this study, we used two different filtering pipelines to test for SNP-SNP interactions that are associated with four plasma lipid level traits: LDL-C, HDL-C, TC, and TG plasma levels. We tested these models in a large discovery cohort and then tested the top models in ten replication sets. Models signals passed the replication threshold for each of the lipid traits in the main effect filter analysis and for TG in the Biofilter analysis. As expected, replication of the observed association was found to be dependent on the size of the replication cohorts. Also, more models replicated in the main effect filter analysis, which may indicate a statistical bias due to strong main effects. However, the interaction signals appear robust, considering the number of models tested, indicating that this is unlikely the sole driver of these significant interactions. Genetic interactions are often described as gene-gene interactions, and are usually studied by specifically looking for variants in different genes that could be indicating novel pathways (e.g. protein-protein interactions that have not been previously identified using genetic data). However, intergenic interactions, such as those that we observed in this study, should not be ignored, as they may contribute to a substantial proportion of the genetic architecture. Our top replicating models for HDL-C consisted of two SNPs in CETP. Many of these model replicated across cohorts with the top replication p-value for the likelihood ratio test being 3.0 × 10–12 (Table 2 and Additional file 1: Table S1). LD patterns suggest that there are three independent sets of SNPs that represent many of the top models for the CETP-HDL associations. Further, many of these SNPs are in the promoter region of CETP. Most notably, a previous study identified a functional interaction between two of the SNPs in one of our top models (model 9: rs4783961 and rs1800775) that resulted in changes in CETP promoter activity [26]. As discussed in this study, this could be explained by shared transcription factors that may result in non-linear changes in CETP and HDL-C levels when the variants occur together. These results provide further support for studying intergenic non-linear effects and that they could be important for both accurate phenotype prediction and for understanding the function behind why specific variants in this gene have certain effects on HDL-C levels. Due to the complex nature of estimating heritability, we focus on how our results contribute to understanding the genetic architecture and biological underpinnings of lipid traits. First, the estimated heritability of lipid traits has a relatively wide range (40–60%). There is also high variability in results that can come from methods that calculate overall heritability. A recent study found that for certain models, the estimate is extremely inflated and potentially not reliable [12]. Furthermore, because we are studying genetic interactions, reliably calculating the overall contribution to trait variation becomes even more complicated, and many methods are not designed to accurately generate these estimates. In our study, we can see the difference in R2 for the full versus reduced model (Diff|Rsq column in Additional file 1: Table S1 and S2) is usually about 0.001. Even though this is much smaller than the R2 for the reduced model, which does not include the interaction term, it would be inaccurate to conclude that the interaction term is not contributing to the underlying genetic variation for a number of reasons. Firstly, the reduced model includes the contribution of highly relevant clinical and environmental variable (e.g. smoking status, medication, BMI, sex). Secondly, we are calculating this estimate from a very specific interaction model that assumes the interaction is multiplicative and that the effect of minor alleles is additive. While this model is robust to some interactions that don’t meet these assumptions [13], the estimates themselves could be under-estimated (or over-estimated). As we are not the first group to look for genetic interactions amongst lipid traits, it will be very important for a future study to be done that takes into account all of the identified main and interaction effects to assess and compare the contribution of each to trait variability. However, this is outside the scope of our current manuscript. Many questions remain to be answered in regards to a gold-standard genome-wide or candidate-loci interaction analysis protocol. For example, the overwhelming majority of our replicating interaction models were in the same gene. This is most likely due to the fact that our variants were genotyped using a gene-centric chip with genes that are known to have effects on cardiovascular-related phenotypes, like the lipid levels we analyzed in this study. A chip that had more extensive coverage outside of these genes may have identified more interactions between functionally different parts of the genome. However, our focused analysis did allow us to efficiently test two unique filtering pipelines for a more hypothesis-driven approach. These filtering approaches each have their own strengths and weaknesses. The Biofilter 2.0 analysis, which created gene-gene models based on current biological knowledge, allows for clearer interpretations as the models make biological sense. However, it inhibits the discovery of interactions in regions where biological knowledge is limited. The main effect filter analysis is more robust to discovering interactions that deviate from regions of current biological focus than the Biofilter pipeline in this particular dataset. However, if the main effects of the true interaction model are non-existent (i.e. purely epistatic models where each of the SNPs in the interaction model are not significant alone), the main effect filter pipeline would not detect such effects. Also, as our results possibly indicate, strong main effects may create inflated interaction signals. A more appropriate filtering pipeline may use a main effect filter – Biofilter hybrid approach. Another possible filtering mechanism may be one that ranks variables based on potential main and interaction effects simultaneously. Some machine learning methods, such as Random Forests (RF) and artificial neural networks (ANN), are currently being tested for this approach [27].

24 in total

1. SNAP: a web-based tool for identification and annotation of proxy SNPs using HapMap.

Authors: Andrew D Johnson; Robert E Handsaker; Sara L Pulit; Marcia M Nizzari; Christopher J O'Donnell; Paul I W de Bakker
Journal: Bioinformatics Date: 2008-10-30 Impact factor: 6.937

2. Personal genomes: The case of the missing heritability.

Authors: Brendan Maher
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

3. Underestimation of heritability using a mixed model with a polygenic covariance structure in a genome-wide association study for complex traits.

Authors: Hyunju Ryoo; Chaeyoung Lee
Journal: Eur J Hum Genet Date: 2013-10-23 Impact factor: 4.246

Review 4. Finding genes and variants for lipid levels after genome-wide association analysis.

Authors: Cristen J Willer; Karen L Mohlke
Journal: Curr Opin Lipidol Date: 2012-04 Impact factor: 4.776

5. ATHENA: a tool for meta-dimensional analysis applied to genotypes and gene expression data to predict HDL cholesterol levels.

Authors: Emily R Holzinger; Scott M Dudek; Alex T Frase; Ronald M Krauss; Marisa W Medina; Marylyn D Ritchie
Journal: Pac Symp Biocomput Date: 2013

6. Synthesis-View: visualization and interpretation of SNP association results for multi-cohort, multi-phenotype data and meta-analysis.

Authors: Sarah A Pendergrass; Scott M Dudek; Dana C Crawford; Marylyn D Ritchie
Journal: BioData Min Date: 2010-12-16 Impact factor: 2.522

Review 7. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future.

Authors: Omri Gottesman; Helena Kuivaniemi; Gerard Tromp; W Andrew Faucett; Rongling Li; Teri A Manolio; Saskia C Sanderson; Joseph Kannry; Randi Zinberg; Melissa A Basford; Murray Brilliant; David J Carey; Rex L Chisholm; Christopher G Chute; John J Connolly; David Crosslin; Joshua C Denny; Carlos J Gallego; Jonathan L Haines; Hakon Hakonarson; John Harley; Gail P Jarvik; Isaac Kohane; Iftikhar J Kullo; Eric B Larson; Catherine McCarty; Marylyn D Ritchie; Dan M Roden; Maureen E Smith; Erwin P Böttinger; Marc S Williams
Journal: Genet Med Date: 2013-06-06 Impact factor: 8.822

8. Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci.

Authors: Folkert W Asselbergs; Yiran Guo; Erik P A van Iperen; Suthesh Sivapalaratnam; Vinicius Tragante; Matthew B Lanktree; Leslie A Lange; Berta Almoguera; Yolande E Appelman; John Barnard; Jens Baumert; Amber L Beitelshees; Tushar R Bhangale; Yii-Der Ida Chen; Tom R Gaunt; Yan Gong; Jemma C Hopewell; Toby Johnson; Marcus E Kleber; Taimour Y Langaee; Mingyao Li; Yun R Li; Kiang Liu; Caitrin W McDonough; Matthijs F L Meijs; Rita P S Middelberg; Kiran Musunuru; Christopher P Nelson; Jeffery R O'Connell; Sandosh Padmanabhan; James S Pankow; Nathan Pankratz; Suzanne Rafelt; Ramakrishnan Rajagopalan; Simon P R Romaine; Nicholas J Schork; Jonathan Shaffer; Haiqing Shen; Erin N Smith; Sam E Tischfield; Peter J van der Most; Jana V van Vliet-Ostaptchouk; Niek Verweij; Kelly A Volcik; Li Zhang; Kent R Bailey; Kristian M Bailey; Florianne Bauer; Jolanda M A Boer; Peter S Braund; Amber Burt; Paul R Burton; Sarah G Buxbaum; Wei Chen; Rhonda M Cooper-Dehoff; L Adrienne Cupples; Jonas S deJong; Christian Delles; David Duggan; Myriam Fornage; Clement E Furlong; Nicole Glazer; John G Gums; Claire Hastie; Michael V Holmes; Thomas Illig; Susan A Kirkland; Mika Kivimaki; Ronald Klein; Barbara E Klein; Charles Kooperberg; Kandice Kottke-Marchant; Meena Kumari; Andrea Z LaCroix; Laya Mallela; Gurunathan Murugesan; Jose Ordovas; Willem H Ouwehand; Wendy S Post; Richa Saxena; Hubert Scharnagl; Pamela J Schreiner; Tina Shah; Denis C Shields; Daichi Shimbo; Sathanur R Srinivasan; Ronald P Stolk; Daniel I Swerdlow; Herman A Taylor; Eric J Topol; Elina Toskala; Joost L van Pelt; Jessica van Setten; Salim Yusuf; John C Whittaker; A H Zwinderman; Sonia S Anand; Anthony J Balmforth; Gerald S Berenson; Connie R Bezzina; Bernhard O Boehm; Eric Boerwinkle; Juan P Casas; Mark J Caulfield; Robert Clarke; John M Connell; Karen J Cruickshanks; Karina W Davidson; Ian N M Day; Paul I W de Bakker; Pieter A Doevendans; Anna F Dominiczak; Alistair S Hall; Catharina A Hartman; Christian Hengstenberg; Hans L Hillege; Marten H Hofker; Steve E Humphries; Gail P Jarvik; Julie A Johnson; Bernhard M Kaess; Sekar Kathiresan; Wolfgang Koenig; Debbie A Lawlor; Winfried März; Olle Melander; Braxton D Mitchell; Grant W Montgomery; Patricia B Munroe; Sarah S Murray; Stephen J Newhouse; N Charlotte Onland-Moret; Neil Poulter; Bruce Psaty; Susan Redline; Stephen S Rich; Jerome I Rotter; Heribert Schunkert; Peter Sever; Alan R Shuldiner; Roy L Silverstein; Alice Stanton; Barbara Thorand; Mieke D Trip; Michael Y Tsai; Pim van der Harst; Ellen van der Schoot; Yvonne T van der Schouw; W M Monique Verschuren; Hugh Watkins; Arthur A M Wilde; Bruce H R Wolffenbuttel; John B Whitfield; G Kees Hovingh; Christie M Ballantyne; Cisca Wijmenga; Muredach P Reilly; Nicholas G Martin; James G Wilson; Daniel J Rader; Nilesh J Samani; Alex P Reiner; Robert A Hegele; John J P Kastelein; Aroon D Hingorani; Philippa J Talmud; Hakon Hakonarson; Clara C Elbers; Brendan J Keating; Fotios Drenos
Journal: Am J Hum Genet Date: 2012-10-11 Impact factor: 11.025

2. Environment-Wide Association Study (Eⁿ WAS) of Prenatal and Perinatal Factors Associated With Autistic Traits: A Population-Based Study.

Authors: Masoud Amiri; Sander Lamballais; Eloy Geenjaar; Laura M E Blanken; Hanan El Marroun; Henning Tiemeier; Tonya White
Journal: Autism Res Date: 2020-08-23 Impact factor: 5.216

2 in total