Literature DB >> 20018015

Elastic-net regularization approaches for genome-wide association studies of rheumatoid arthritis.

Seoae Cho1, Haseong Kim, Sohee Oh, Kyunga Kim, Taesung Park.   

Abstract

The current trend in genome-wide association studies is to identify regions where the true disease-causing genes may lie by evaluating thousands of single-nucleotide polymorphisms (SNPs) across the whole genome. However, many challenges exist in detecting disease-causing genes among the thousands of SNPs. Examples include multicollinearity and multiple testing issues, especially when a large number of correlated SNPs are simultaneously tested. Multicollinearity can often occur when predictor variables in a multiple regression model are highly correlated, and can cause imprecise estimation of association. In this study, we propose a simple stepwise procedure that identifies disease-causing SNPs simultaneously by employing elastic-net regularization, a variable selection method that allows one to address multicollinearity. At Step 1, the single-marker association analysis was conducted to screen SNPs. At Step 2, the multiple-marker association was scanned based on the elastic-net regularization. The proposed approach was applied to the rheumatoid arthritis (RA) case-control data set of Genetic Analysis Workshop 16. While the selected SNPs at the screening step are located mostly on chromosome 6, the elastic-net approach identified putative RA-related SNPs on other chromosomes in an increased proportion. For some of those putative RA-related SNPs, we identified the interactions with sex, a well known factor affecting RA susceptibility.

Entities:  

Year:  2009        PMID: 20018015      PMCID: PMC2795922          DOI: 10.1186/1753-6561-3-s7-s25

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

Recently, genome-wide association studies (GWAS) have become a promising new tool for deciphering the genetics of complex diseases, which are usually polygenic and affected by gene-by-environmental interactions. Because it can be more powerful to scan multiple markers jointly in detecting disease-related genes, various multiple-marker approaches have been or can be used in GWAS [1-4]. Examples include logic regressions [2] and classification and regression trees [3]. Due to their sequential selection processes, these methods may miss the overall correlation structure of the genes. Another example is random forest [4], based on which true disease-causing genes can be hidden due to other genes; the identification result may not be robust. In this study, we propose a simple stepwise procedure that employs the elastic-net regularization-based approach [5] to take the overall correlation structure of single-nucleotide polymorphisms (SNPs) into account when selecting disease-causing genes automatically in GWAS. Because the elastic net imposes on a combination of lasso and ridge penalties [6,7], it provides a more reproducible prediction than using multiple regression, especially when there are highly correlated predictors (e.g., SNPs in high linkage disequilibrium). Our approach consists of two main steps, called the screen step and the elastic-net step. At the screen step, we eliminate most of noise SNPs via single-marker association tests, and select the largest number of candidate SNPs that can be analyzed by the elastic-net approach at the next step. At the elastic-net step, putative disease-causing SNPs are jointly identified based on multiple logistic regressions with the screened SNPs via the elastic net. Interactions between SNP and non-genotypic factor (e.g., sex) can also be examined. The proposed approach was applied to the rheumatoid arthritis (RA) case-control dataset of Genetic Analysis Workshop 16 (GAW16). RA is a complex disease with a moderately strong genetic component. It is generally known that females are at a higher risk than males and the mean onset of disease is in the fifth decade. Many studies have implicated the HLA region on chromosome 6p21, with consistent evidence for several DR alleles contributing to risk [8]. Among the non-HLA loci, PTPN22 on chromosome 1p13, a gene coding for protein tyrosine non-receptor22, is considered as a strong candidate RA-susceptibility gene [9]. Recently, a functional SNP in this PTPN22 gene was reported to be associated with RA [10]. There remains much to learn about the genetic susceptibility for RA, including possible gene-by-environmental interactions.

Methods

Genotype data and sample

The RA data from GAW16 included 545,080 SNPs genotyped by Illumina (550 k chip) along with covariates for 908 cases and 1260 controls. We adjusted population stratification using the computer program Eigenstrat [11] by excluding 20 outliers from the samples. Also, the samples showing sex matching error were filtered [12]. We excluded SNPs with >10% missing genotype, with minor allele frequencies <5%, and/or with p < 0.001 from Hardy-Weinberg equilibrium tests. As a result, 474,499 SNPs passed our quality control filters and were used in the proposed stepwise analyses.

Step 1: Screening SNPs via single-marker association tests

For each single SNP, the disease association is tested using the following logistic regression model adjusted by sex, under the additive mode of inheritance: where π represents the probability of getting the disease. Among the SNPs showing the strongest associations, we select the largest number of SNPs that can be analyzed in the penalized logistic regression via the elastic net at the next step. This screening step is needed to address the computational limitation when applying the penalized logistic regression via the elastic net to multiple SNPs.

Step 2: Penalized logistic regression models via the elastic net

In this step, putative disease-causing SNPs are identified via elastic-net-based variable selection. The elastic-net method is particularly useful when the number of highly correlated predictor variables (p) is much larger than the sample size (N). The elastic-net regularization approach solves the following problem: where the elastic-net penalty is defined as The elastic-net penalty creates a useful a compromise between the ridge-regression penalty (α = 0) [9] and the lasso penalty (α = 1) [10]. The elastic net with α = 1 - ε for some small ε > 0 performs much like the lasso, but is robust to extreme correlations among predictor variables. Moreover, the elastic net does both shrinkage and automatic variable selection simultaneously. The choice of the regularization parameter (λ) is critical to selecting important variables with accurate estimation. Tuning parameters α and λ are usually selected to minimize mean-squared prediction error based on cross-validations (e.g., 5-fold). Because the effect of genotype variations (i.e., SNPs) on disease status can be modified by other factors (in our study, sex), we consider the following multiple logistic regression models to examine the SNP main effects (M1) and also interaction effects of SNPs with sex (M2). where π represents the probability of getting the disease. When M1 is used with the elastic-net penalties, the SEX variable is not penalized to adjust the sex effect in selecting SNP main effects. Note that main effect terms of both SEX and SNPs are not penalized when examining the SNP-by-sex interactions in M2. In this study, we use a library 'glmnet' in R statistical package http://www.r-project.org to conduct the penalized logistic regressions via the elastic-net.

Results

Single-marker associations

The single-marker association test was conducted for each SNP, and 48,336 SNPs showed p-values below 0.05 (Figure 1). Some SNPs are in HLA-DRB1 and PTPN22, which were already known to be RA-susceptibility genes [8-10]. Among the 48,336 SNPs, we chose the top 1000, 2000, and 3000 significant SNPs for Step 2.
Figure 1

Genome-wide scan for RA-SNP association. The p-values < 0.05 from single SNP association tests were plotted in -log10 scale against chromosomal positions of the corresponding 48,366 SNPs. Blue and light blue were used to distinguish chromosomes. Red indicates potential RA-related SNPs that were identified by fitting the penalized logistic regression model (M1) via elastic-net using top 3000 of those 48,366 SNPs.

Genome-wide scan for RA-SNP association. The p-values < 0.05 from single SNP association tests were plotted in -log10 scale against chromosomal positions of the corresponding 48,366 SNPs. Blue and light blue were used to distinguish chromosomes. Red indicates potential RA-related SNPs that were identified by fitting the penalized logistic regression model (M1) via elastic-net using top 3000 of those 48,366 SNPs.

Main effect analysis via elastic-net (M1)

We applied the model M1 via the elastic net to top 1000, 2000, and 3000 SNPs selected at the first step. Among top 1000 SNPs, 250 SNPs were identified with main effects as putative RA-related SNPs while 360 SNPs were detected among the top 2000 and 398 SNPs among the top 3000. Those with the ten largest main effects are listed in Table 1. The resulting putative RA-related SNPs are displayed across the whole genome in Figure 1. Across the screening choices, 81 SNPs were commonly selected. Among those SNPs, 23 SNPs were identified also from single-marker association analyses after 5% Bonferroni multiplicity correction, and (except three SNPs) are located on chromosome 6. For examples, rs2395175 and rs660895 in HLA-DRB1 and HLA-DRA on chromosome 6 had p-values of 1.08 × 10-87 and 7.16 × 10-90, respectively, from single-marker association test. However, 58 overlapping SNPs that were not identified from single-marker association analyses were found on various chromosomes. Some SNPs are located on known genes, such as AMFR, ANKRD35, ECT2, TARBP1, ZFP92, and ZFPM2. For instance, rs2440468 is located in AMFR (autocrine motility factor receptor) gene on chromosome 16. AMF secretion and receptor levels are closely related to RA as well as tumor malignancy [13]. Note that RA-susceptibility odds ratios (ORs) of AG and GG against AA were 0.78 and 0.57, respectively, for this SNP. However, rs2440468 had a p-value = 5.74 × 10-5 for single-marker association test. While the evidence for single-marker based association at chromosome 6 with RA has been previously identified by numerous studies [1], our results indicate that putative RA-related SNPs were also distributed across several other regions outside of the chromosome 6 (Figure 2).
Table 1

RA-related SNPs identified with ten largest main effects via the elastic-net method (M1)

SNPChromosomeaCoefficientb
Top 1000
 rs69036086-0.3413
 rs239518560.3285
 rs116862642-0.3284
 rs69812238-0.31
 rs109486936-0.2813
 rs97279171-0.2806
 rs244046816-0.2736
 rs449987450.2714
 rs927559560.2641
 rs797089312-0.2492
Top 2000
 rs239517560.2522
 rs69036086-0.2299
 rs100947298-0.166
 rs210161310-0.1613
 rs691007160.1529
 rs66089560.1522
 rs92775546-0.1468
 rs122035926-0.1401
 rs257824090.1356
 rs92755726-0.1353
Top 3000
 rs239517560.3532
 rs66089560.2302
 rs92755726-0.219
 rs100947298-0.1972
 rs69036086-0.1889
 rs38734446-0.1403
 rs797089312-0.1321
 rs23459214-0.1316
 rs107891761-0.125
 rs92756016-0.1221

aChromosome where SNP is located.

bCoefficient representing size and direction of SNP main effect.

Figure 2

Distributions of top 3000 screened SNPs vs. 398 potential RA-related SNPs across chromosomes. For each chromosome, blue bars represent the number of SNPs that were selected as top 3000 SNPs via single SNP association tests at Step 1; and red bars represent the number of potential RA-related SNPs that were identified at Step 2 by fitting penalized logistic regression model (M1) via elastic-net using the top 3000 screened SNPs.

RA-related SNPs identified with ten largest main effects via the elastic-net method (M1) aChromosome where SNP is located. bCoefficient representing size and direction of SNP main effect. Distributions of top 3000 screened SNPs vs. 398 potential RA-related SNPs across chromosomes. For each chromosome, blue bars represent the number of SNPs that were selected as top 3000 SNPs via single SNP association tests at Step 1; and red bars represent the number of potential RA-related SNPs that were identified at Step 2 by fitting penalized logistic regression model (M1) via elastic-net using the top 3000 screened SNPs.

Interaction analysis with sex via elastic-net (M2)

To investigate SNPs with effects on RA-susceptibility that varied across sexes, we performed interaction analysis (M2) with the putative RA-related SNPs from M1 for each screening choice (i.e., top 1000, 2000, and 3000). We identified 71 SNPs and 132 SNPs with the SNP-by-sex interaction for each choice of top 1000 and top 2000, while 105 SNPs showed interactions for top 3000 choice. Those with five largest interactions effects are summarized in Table 2. For each sex, we investigated RA-susceptibility OR of each genotype against major-allele homozygote. For example, rs2044750 showed heterozygote OR of 1.12 and 1.71 for female and male, respectively. The OR for AA is 1.37 for female and 2.37 for male. This SNP is located in nuclear factor of activated T cell 1 (NFATc1), a transcription factor on chromosome 18, which has recently been shown to be related to osteoporosis, bone metastasis, and rheumatoid arthritis [14]. Note that rs2044750 showed a non-small p-value of 0.00041 at single-marker association test. Note that ten SNPs overlapped across the screening choices. Out of ten SNPs, we found six SNPs in known genes, such as C19orf2, CUGBP2, ECT2, TBC1D8, and WNT3.
Table 2

RA-related SNPs identified with sex-by-SNP interaction via the elastic-net method (M2)

SNPChromosomeaCoefficientb
Top 1000
 rs28588706-0.329
 rs97279171-0.2572
 rs101845732-0.2347
 rs10514911170.2314
 rs11703151220.2077
Top 2000
 rs69036086-0.538
 rs121767580.5188
 rs56027117-0.5169
 rs20111910-0.4943
 rs204475018-0.4812
Top 3000
 rs38734446-0.3573
 rs94819511-0.3233
 rs2579088120.3063
 rs1327711380.303
 rs1240797010.2787

aChromosome where SNP is located.

bCoefficient representing size and direction of SNP main effect.

RA-related SNPs identified with sex-by-SNP interaction via the elastic-net method (M2) aChromosome where SNP is located. bCoefficient representing size and direction of SNP main effect.

Discussion

We have proposed a simple stepwise approach that employs the multiple logistic regression model with the elastic-net penalties to detect disease-causing genes across a whole genome in GWAS. The elastic-net method using both lasso and ridge penalties has several advantages in identifying disease-causing SNPs jointly in GWAS. First, automatic variable selection and continuous shrinkage can be simultaneously performed. Second, it can select groups of many highly correlated SNPs, which may cause a multicollinearity problem in classical multiple linear regressions. Third, the shrinkage feature of the elastic net enables us to include all the interaction terms between SNPs and non-genotypic factors as well as SNP main effects into a model. Also, rather than searching for potential SNPs along the entire chromosome directly, our approach provides an efficient search by using a multi-step procedure to handle the extremely large number of potential SNP patterns in GWAS. Although most putative RA-related SNPs were found in chromosome 6, we also identified additional susceptibility genes in other chromosomes. Our findings need to be replicated in an independent dataset or to be functionally validated in the future in order to declare the biological significance. There is disagreement in results across the screening choices. There are possible causes that result in this discrepancy. First, the missing data caused large differences in the results. We removed some samples and SNPs to make datasets complete because the elastic-net regression method we employed does not allow for missing data. So the three datasets according to the screening choices ended up with different sample sizes. The difference in sample size was large in the previous analysis, and we tried to make the sample sizes similar in the updated analysis. Even though the previous analysis had a similar sample size, there are about only 70% overlapping samples, as shown below. This explains why we had more common SNPs in the updated results. This missing data problem would be avoided by using a proper imputation method for missing data. Second, depending on the correlation structures among SNPs, the elastic-net regression method may provide different results because it considers the correlation structure when selecting variables.

List of abbreviations used

GAW16: Genetic Analysis Workshop 16; GWAS: Genome-wide association study; OR: Odds ratio; RA: Rheumatoid arthritis; SNPs: Single-nucleotide polymorphisms

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

HK and SO participated in statistical analysis. SC participated in the design of the study, performed the statistical analysis, and drafted the manuscript. KK and TP conceived of the study, and participated in its design and coordination and helped to draft the manuscript. All authors read and approved the final manuscript.
  9 in total

1.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

2.  PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors:  Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal:  Am J Hum Genet       Date:  2007-07-25       Impact factor: 11.025

3.  PTPN22 genetic variation: evidence for multiple variants associated with rheumatoid arthritis.

Authors:  Victoria E H Carlton; Xiaolan Hu; Anand P Chokkalingam; Steven J Schrodi; Rhonda Brandon; Heather C Alexander; Monica Chang; Joseph J Catanese; Diane U Leong; Kristin G Ardlie; Daniel L Kastner; Michael F Seldin; Lindsey A Criswell; Peter K Gregersen; Ellen Beasley; Glenys Thomson; Christopher I Amos; Ann B Begovich
Journal:  Am J Hum Genet       Date:  2005-08-10       Impact factor: 11.025

4.  Identification of SNP interactions using logic regression.

Authors:  Holger Schwender; Katja Ickstadt
Journal:  Biostatistics       Date:  2007-06-19       Impact factor: 5.899

5.  Aldehydic components of cinnamon bark extract suppresses RANKL-induced osteoclastogenesis through NFATc1 downregulation.

Authors:  Kentaro Tsuji-Naito
Journal:  Bioorg Med Chem       Date:  2008-09-14       Impact factor: 3.641

6.  Autocrine motility factor signaling induces tumor apoptotic resistance by regulations Apaf-1 and Caspase-9 apoptosome expression.

Authors:  Arayo Haga; Tatsuyoshi Funasaka; Yasufumi Niinaka; Avraham Raz; Hisamitsu Nagase
Journal:  Int J Cancer       Date:  2003-12-10       Impact factor: 7.396

7.  A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis.

Authors:  Ann B Begovich; Victoria E H Carlton; Lee A Honigberg; Steven J Schrodi; Anand P Chokkalingam; Heather C Alexander; Kristin G Ardlie; Qiqing Huang; Ashley M Smith; Jill M Spoerke; Marion T Conn; Monica Chang; Sheng-Yung P Chang; Randall K Saiki; Joseph J Catanese; Diane U Leong; Veronica E Garcia; Linda B McAllister; Douglas A Jeffery; Annette T Lee; Franak Batliwalla; Elaine Remmers; Lindsey A Criswell; Michael F Seldin; Daniel L Kastner; Christopher I Amos; John J Sninsky; Peter K Gregersen
Journal:  Am J Hum Genet       Date:  2004-06-18       Impact factor: 11.025

8.  Dissecting the genetic complexity of the association between human leukocyte antigens and rheumatoid arthritis.

Authors:  Damini Jawaheer; Wentian Li; Robert R Graham; Wei Chen; Aarti Damle; Xiangli Xiao; Joanita Monteiro; Houman Khalili; Annette Lee; Robert Lundsten; Ann Begovich; Teodorica Bugawan; Henry Erlich; James T Elder; Lindsey A Criswell; Michael F Seldin; Christopher I Amos; Timothy W Behrens; Peter K Gregersen
Journal:  Am J Hum Genet       Date:  2002-08-09       Impact factor: 11.025

9.  A genome-wide tree- and forest-based association analysis of comorbidity of alcoholism and smoking.

Authors:  Yuanqing Ye; Xiaoyun Zhong; Heping Zhang
Journal:  BMC Genet       Date:  2005-12-30       Impact factor: 2.797

  9 in total
  29 in total

1.  A FAST ALGORITHM FOR DETECTING GENE-GENE INTERACTIONS IN GENOME-WIDE ASSOCIATION STUDIES.

Authors:  Jiahan Li; Wei Zhong; Runze Li; Rongling Wu
Journal:  Ann Appl Stat       Date:  2014       Impact factor: 2.083

2.  A model-free approach for detecting interactions in genetic association studies.

Authors:  Jiahan Li; Jun Dan; Chunlei Li; Rongling Wu
Journal:  Brief Bioinform       Date:  2013-11-21       Impact factor: 11.622

3.  BAYESIAN GROUP LASSO FOR NONPARAMETRIC VARYING-COEFFICIENT MODELS WITH APPLICATION TO FUNCTIONAL GENOME-WIDE ASSOCIATION STUDIES.

Authors:  Jiahan Li; Zhong Wang; Runze Li; Rongling Wu
Journal:  Ann Appl Stat       Date:  2015-06       Impact factor: 2.083

Review 4.  From genome-wide associations to candidate causal variants by statistical fine-mapping.

Authors:  Daniel J Schaid; Wenan Chen; Nicholas B Larson
Journal:  Nat Rev Genet       Date:  2018-08       Impact factor: 53.242

5.  Adverse subpopulation regression for multivariate outcomes with high-dimensional predictors.

Authors:  Bin Zhu; David B Dunson; Allison E Ashley-Koch
Journal:  Stat Med       Date:  2012-07-24       Impact factor: 2.373

6.  Mining gold dust under the genome wide significance level: a two-stage approach to analysis of GWAS.

Authors:  Gang Shi; Eric Boerwinkle; Alanna C Morrison; C Charles Gu; Aravinda Chakravarti; D C Rao
Journal:  Genet Epidemiol       Date:  2010-12-31       Impact factor: 2.135

7.  PREDICTING TEMPORAL LOBE VOLUME ON MRI FROM GENOTYPES USING L(1)-L(2) REGULARIZED REGRESSION.

Authors:  Omid Kohannim; Derrek P Hibar; Neda Jahanshad; Jason L Stein; Xue Hua; Arthur W Toga; Clifford R Jack; Michael W Weiner; Paul M Thompson
Journal:  Proc IEEE Int Symp Biomed Imaging       Date:  2012

8.  Multistage analysis strategies for genome-wide association studies: summary of group 3 contributions to Genetic Analysis Workshop 16.

Authors:  Rosalind J Neuman; Yun Ju Sung
Journal:  Genet Epidemiol       Date:  2009       Impact factor: 2.135

9.  Variable Selection with Prior Information for Generalized Linear Models via the Prior LASSO Method.

Authors:  Yuan Jiang; Yunxiao He; Heping Zhang
Journal:  J Am Stat Assoc       Date:  2016-05-05       Impact factor: 5.033

10.  Discovery and Replication of Gene Influences on Brain Structure Using LASSO Regression.

Authors:  Omid Kohannim; Derrek P Hibar; Jason L Stein; Neda Jahanshad; Xue Hua; Priya Rajagopalan; Arthur W Toga; Clifford R Jack; Michael W Weiner; Greig I de Zubicaray; Katie L McMahon; Narelle K Hansell; Nicholas G Martin; Margaret J Wright; Paul M Thompson
Journal:  Front Neurosci       Date:  2012-08-06       Impact factor: 4.677

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.