Literature DB >> 20018057

Detecting single-nucleotide polymorphism by single-nucleotide polymorphism interactions in rheumatoid arthritis using a two-step approach with machine learning and a Bayesian threshold least absolute shrinkage and selection operator (LASSO) model.

Oscar González-Recio¹, Evangelina López de Maturana, Andrés T Vega, Corinne D Engelman, Karl W Broman.

Abstract

The objective of this study was to detect interactions between relevant single-nucleotide polymorphisms (SNPs) associated with rheumatoid arthritis (RA). Data from Problem 1 of the Genetic Analysis Workshop 16 were used. These data consisted of 868 cases and 1,194 controls genotyped with the 500 k Illumina chip. First, machine learning methods were applied for preselecting SNPs. One hundred SNPs outside the HLA region and 1,500 SNPs in the HLA region were preselected using information-gain theory. The software weka was used to reduce colinearity and redundancy in the HLA region, resulting in a subset of 6 SNPs out of 1,500. In a second step, a parametric approach to account for interactions between SNPs in the HLA region, as well as HLA-nonHLA interactions was conducted using a Bayesian threshold least absolute shrinkage and selection operator (LASSO) model incorporating 2,560 covariates. This approach detected some main and interaction effects for SNPs in genes that have previously been associated with RA (e.g., rs2395175, rs660895, rs10484560, and rs2476601). Further, some other SNPs detected in this study may be considered in candidate gene studies.

Entities: Chemical Disease Mutation Species

Year: 2009 PMID： 20018057 PMCID： PMC2795964 DOI： 10.1186/1753-6561-3-s7-s63

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Rheumatoid arthritis (RA) is a chronic disease with known autoimmune pathophysiology. RA is a heritable condition and association studies have already identified a genomic region in chromosome 6 (the HLA region). While this represents progress in elucidating genetic contributions to RA, much is still unknown about the underlying genetic causes, and there is plenty of evidence that there exist other genes affecting disease risk, both as major effects and in epistasis [1]. Genome-wide association studies (GWAS) of diseases and complex traits have become a major focus of research in human genetics. GWAS may provide robust results for acquiring knowledge on the underlying genetic behavior of RA. One of the most difficult challenges of GWAS is how to deal with the large p, small n problem, arising when the number of variables considered (p) is much larger than the number of subjects (n). The problem becomes particularly difficult when one seeks to estimate single-nucleotide polymorphism (SNP) by SNP interactions. One approach for efficiently handling high dimensional GWA data consists of two steps: 1) reducing dimensionality by filtering non-informative markers, and 2) applying a more sophisticated model to quantify the effect of the selected SNPs and their interactions. Information gain and the wrapper procedure are examples of machine learning that have shown benefits over linear regression for Step 1. These methods are easy to implement and may deal with crude, noisy, and inconsistent information. They alleviate redundancy, colinearity, and the assumption of multivariate normality, making them appealing in genomic studies. The function that relates covariates to observations is unspecified, providing more flexibility in the model. Further, they may deal with non-additive effects, which are of interest in genetic epidemiology. Some drawbacks to these methods exist: information gain may not completely remove colinearity in the system and the wrapper procedure it is too computationally demanding to use on a large number of records and SNPs. Hence, a second step is necessary. A large variety of methods have been previously proposed for Step 2. Here, we propose a novel method that is able to reduce colinearity and make a higher shrinkage to zero than other methods for less relevant SNP and SNP × SNP effects. The Genetic Analysis Workshop 16 (GAW16) provides an opportunity for testing novel methods, such as those proposed above, on a well characterized dataset, to compare results and interpretations, and to discuss current problems in genetic analysis. The aim of this study was to identify additional disease susceptibility loci in the GAW16 RA data in a two-step approach: first, reducing the number of SNPs to be tested through machine learning algorithms; second, identifying interactions or epistatic effects between HLA and non-HLA SNPs using a Bayesian threshold LASSO model.

Methods

Data

Data from the North American Rheumatoid Arthritis Consortium (NARAC) provided by the GAW16 were used in the analyses. The initial batch consisted of 868 cases and 1,194 controls genotyped with the 500 k Illumina chip (545,080 SNPs). Further description of the initial data can be found in Plenge et al. [2]. Local institutional review board (IRB) approval was obtained.

Quality control

SNPs showing >2% missing genotypes were excluded (64,041). Monomorphic SNPs were omitted (1,920). SNPs with minor allele frequency <0.05 that did not show association with the disease through a Fisher's exact test (p < 0.001) were also omitted (47,190). Finally, SNPs that deviated from Hardy-Weinberg equilibrium (p < 0.0001) in the controls were discarded (11,835). The number of SNPs remaining in the analysis was 420,094.

Stage 1: Preselection of significant SNPs using machine learning

Information gain or entropy theory

Pre-selection of informative SNPs to be included in Stage 2, and substantially reduce the feature space, was performed using the information gain or entropy reduction criterion [3]. The entropy of the probability distribution of a discrete random variable y is defined as: where A is the set of all states that y can take, the logarithm is on base 2 to mimic bits of information, and we take 0 log 0 = 0. Here, y refers to case and control phenotypes. For each SNP, the data set was divided into three subsets corresponding to the three possible genotypes (aa, Aa, or AA). For each genotype k, there are individuals showing the disease, and individuals with absence of disease. The information gain for each SNP s (s = 1, 2,...,p) is the difference in entropy of the probability distribution (the reduction of uncertainty) before and after observing genotypes at each SNP, calculated as: where . At this point, SNPs were divided in two groups based on whether they were in the HLA region or somewhere else along the genome (non-HLA SNPs). The HLA region was defined starting at HLA-F (29,799,096 to 29,803,052) and extending to HLA-DPB1 (33,151,738 to 33,162,954). The 100 non-HLA SNPs with the highest information gain were selected to test for interactions with HLA SNPs in Stage 2. The HLA SNPs that passed to Stage 2 were selected as follows.

Selection of independent SNPs in the HLA region on chromosome 6

The most relevant SNPs in this region were selected using the wrapper procedure [4]. This procedure aims to reduce redundancy and colinearity in a feature subset. The HLA SNPs with an information gain above the 99.65 percentile (approximately the top 1,500 SNPs) were considered as candidates, and were included in this wrapper. This method involves searching through all possible combinations of SNPs in the data to find an 'optimal' subset of SNPs that best classify the phenotype outcome (binary: case or control), using an attribute evaluator and a search method. The attribute evaluator used was the naïve Bayesian classifier, with a bidirectional hill-climbing search method. The naïve Bayesian evaluator can be described as follows: Given an observed phenotype y with genotype k1,...,k, the best prediction (classification decision) is given by class value Y (case or control) such that Pr(Y = y|K1 = k1, ..., K= k) is maximum. Applying Bayes' theorem gives The prior probability of the phenotype, PR(Y = y), can be estimated from the training data, and the PR(K1 = k1, ..., K= k) cancels out when the odds of class membership are calculated. Then, assuming that the genotypes are conditionally independent, the probability of each genotype conditioned to the observed phenotype can be estimated as: We chose a five-fold cross-validation scenario in which a different list of SNPs was generated in each fold. Therefore, SNPs that appeared in three or more folds were extracted and included in Stage 2. The wrapper approach was implemented using the weka software [5].

Stage 2: Selection of significant SNPs and interactions

To identify interactions among SNPs in the HLA region and between an HLA SNP and a SNP elsewhere in the genome, we applied a Bayesian version of the LASSO (least absolute shrinkage and selection operator) [6]. The LASSO constrains the sum of absolute values of the regression coefficients, leading some coefficient estimates to be exactly zero. This can be viewed as a feature selection, and is suitable for quantifying estimates, as well. The binary nature of the outcome (control vs. case) was taken into account by applying a Bayesian threshold LASSO BTL model, a modification of the Bayesian LASSO [7], the performance of which will be tested for the first time in this study. The traditional threshold model [8] postulates that there is an underlying random variable called liability (λ) that follows a continuous distribution, and that the observed dichotomy is a result of the position of the liability with respect to a fixed threshold: Here, the liability was taken as the response variable. The BTL can be described as: where is the vector of liabilities for all individuals, are the LASSO estimates with their respective incidence matrix X, and, as a modeling choice, e was considered the vector of residuals independent and identically distributed as . In accordance with tradition, we fixed the threshold to be 0 and the residual variance to be 1; alternate choices result in the same model. Let Xβ = X+ X+ X, where is the vector of major effects, corresponds to the vector for interaction effects between HLA SNPs, and is to the vector for interaction effects between HLA and non-HLA SNPs, with X, X, and Xbeing the corresponding incidence matrices. These incidences matrices were constructed such that each major effect for SNP j was codified as = (-1, 0, or 1) for aa, Aa, and AA, respectively. Interactions were codified as follows: for each SNPwe defined two covariates, and , with = 1if the genotype was coded as 0 (0, otherwise), and = 1 if the genotype was coded as 1 (0, otherwise). Then, for each SNP× SNPinteraction, we set four covariates, for m, n equal to 1 or 2. Eq. (1) for individual i can be written as: In a fully Bayesian context, the LASSO estimates (β) can be interpreted as posterior modes estimates when the regression parameters have independent and identical double-exponential priors [6]. Park and Casella [7] proposed using a conditional Laplace prior specification for the LASSO estimates of the form: Samples from posterior distributions of those estimates were drawn from the Gibbs sampling algorithm described in Park and Casella [7], with a chain length of 100,000 samples discarding the first 50,000 as burning, after checking convergence.

Results

Information gain was calculated for each SNP across all chromosomes. The total entropy of the data was 0.98, which is the maximum information gain that a feature (i.e., SNP) could provide. The highest information gain was found in the HLA region on chromosome 6, as expected, with a maximum value of 0.19 for SNP rs2395175. Thirteen other SNPs in this region had an information gain higher than 0.10. The highest information gain outside of the HLA region was 0.017 (rs2476601 on chromosome 1). Within the HLA region, 472 out of the 1,323 SNPs had information gain in the 99.65 percentile. The wrapper procedure selected 6 out of the 472 SNPs. Therefore, 2,560 covariates (100 main effects, 4 × HLA-HLA interactions, and 4 × 6 × 100 HLA-non HLA interactions) were introduced in the BTL model. As expected, the posterior means of a large amount of effects were shrunk to zero. The main effects or epistatic basis functions with at least 80% of the posterior distribution either higher or lower than zero, are shown in Figure 1. Among those, the LASSO included the main effects of the two HLA SNPs with the highest information gain (rs2395175 and rs660895). The interaction with the largest effect was that between SNPs rs10484560 (HLA region) and rs2476601 (chromosome 1). These SNPs belong to genes that were previously reported as part of one of the most important interactions for RA [9]. Further, 4 out of the 21 non-HLA SNP shown in Figure 1 were in genes that had been previously related to RA, such as rs3181096 or rs10514911 [10,11].

Figure 1

Major effects and interaction basis functions detected by the Bayesian threshold LASSO model. Allele or interaction alleles are specified. The allele for the HLA SNPs is specified first in the interactions.

Conclusion

A pre-screening stage seems necessary in genome-wide association studies to reduce the large p, small n problem. The machine learning approach (information gain + wrapper) used in this study detected the most important known region associated with RA and reduced the number of SNPs in both the HLA region and across the genome. In a second stage, the BTL model selected covariates of major and epistatic effects strongly associated with RA, some of them already known. The major effects of the two HLA SNPs with highest information gain did appear in the top 27 covariates, showing their importance on the liability to RA. Shi et al. [12] used a LASSO model on the simulated data from the GAW15 Problem 3. However, they used a different pre-screening method and a non-Bayesian version of the LASSO. The Bayesian counterpart provides a measure of the reliability of the estimates. Because data used in the GAW16 RA problem are real data, the accuracy of the proposed approach cannot be tested immediately. However, the results in this study can be compared to the results generated by other methods applied to the same data, and also with past and future analyses. Further, SNPs found in this study with unknown previous function might act as markers of candidate genes in future research. Proving the benefits of these methods over others widely used in the field is a challenge for the future.

List of abbreviations used

BTL: Bayesian threshold LASSO; GAW16: Genetic Analysis Workshop 16; GWAS: Genome-wide association studies; IRB: Institutional review board; LASSO: Least absolute shrinkage and selection operator; NARAC: North American Rheumatoid Arthritis Consortium; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

OG-R participated in the design of the study and the statistical analyses and drafted the manuscript. EL participated in the design of the study and the statistical analyses, and helped revise the manuscript. ATV participated in the design of the study and helped revise the manuscript. CDE obtained IRB approval for the study, gained access to the dataset, participated in the design of the study, and helped revise the manuscript. KWB participated in the design of the study and helped revise the manuscript. All authors read and approved the final manuscript.

7 in total

Review 1. Could inflammatory arthritis be triggered by progenitor cells in the joints?

Authors: C Jorgensen; D Noel; G Gross
Journal: Ann Rheum Dis Date: 2002-01 Impact factor: 19.103

2. An Analysis of Variability in Number of Digits in an Inbred Strain of Guinea Pigs.

Authors: S Wright
Journal: Genetics Date: 1934-11 Impact factor: 4.562

3. CD4+ CD7- CD28- T cells are expanded in rheumatoid arthritis and are characterized by autoreactivity.

Authors: D Schmidt; J J Goronzy; C M Weyand
Journal: J Clin Invest Date: 1996-05-01 Impact factor: 14.808

4. Gene-gene and gene-environment interactions involving HLA-DRB1, PTPN22, and smoking in two subsets of rheumatoid arthritis.

Authors: Henrik Kallberg; Leonid Padyukov; Robert M Plenge; Johan Ronnelid; Peter K Gregersen; Annette H M van der Helm-van Mil; Rene E M Toes; Tom W Huizinga; Lars Klareskog; Lars Alfredsson
Journal: Am J Hum Genet Date: 2007-04-02 Impact factor: 11.025

5. TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study.

Authors: Robert M Plenge; Mark Seielstad; Leonid Padyukov; Annette T Lee; Elaine F Remmers; Bo Ding; Anthony Liew; Houman Khalili; Alamelu Chandrasekaran; Leela R L Davies; Wentian Li; Adrian K S Tan; Carine Bonnard; Rick T H Ong; Anbupalam Thalamuthu; Sven Pettersson; Chunyu Liu; Chao Tian; Wei V Chen; John P Carulli; Evan M Beckman; David Altshuler; Lars Alfredsson; Lindsey A Criswell; Christopher I Amos; Michael F Seldin; Daniel L Kastner; Lars Klareskog; Peter K Gregersen
Journal: N Engl J Med Date: 2007-09-05 Impact factor: 91.245

6. Detecting disease-causing genes by LASSO-Patternsearch algorithm.

Authors: Weiliang Shi; Kristine E Lee; Grace Wahba
Journal: BMC Proc Date: 2007-12-18

7. Evaluating gene x gene and gene x smoking interaction in rheumatoid arthritis using candidate genes in GAW15.

Authors: Ling Mei; Xiaohui Li; Kai Yang; Jinrui Cui; Belle Fang; Xiuqing Guo; Jerome I Rotter
Journal: BMC Proc Date: 2007-12-18

7 in total

10 in total

Review 1. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.

Authors: I S Stafford; M Kellermann; E Mossotto; R M Beattie; B D MacArthur; S Ennis
Journal: NPJ Digit Med Date: 2020-03-09

2. Next generation modeling in GWAS: comparing different genetic architectures.

Authors: Evangelina López de Maturana; Noelia Ibáñez-Escriche; Óscar González-Recio; Gaëlle Marenne; Hossein Mehrban; Stephen J Chanock; Michael E Goddard; Núria Malats
Journal: Hum Genet Date: 2014-06-17 Impact factor: 4.132

3. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning.

Authors: Oscar González-Recio; Selma Forni
Journal: Genet Sel Evol Date: 2011-02-17 Impact factor: 4.297

4. Fast genomic predictions via Bayesian G-BLUP and multilocus models of threshold traits including censored Gaussian data.

Authors: Hanni P Kärkkäinen; Mikko J Sillanpää
Journal: G3 (Bethesda) Date: 2013-09-04 Impact factor: 3.154

5. Application of multi-SNP approaches Bayesian LASSO and AUC-RF to detect main effects of inflammatory-gene variants associated with bladder cancer risk.

Authors: Evangelina López de Maturana; Yuanqing Ye; M Luz Calle; Nathaniel Rothman; Víctor Urrea; Manolis Kogevinas; Sandra Petrus; Stephen J Chanock; Adonina Tardón; Montserrat García-Closas; Anna González-Neira; Gemma Vellalta; Alfredo Carrato; Arcadi Navarro; Belén Lorente-Galdós; Debra T Silverman; Francisco X Real; Xifeng Wu; Núria Malats
Journal: PLoS One Date: 2013-12-31 Impact factor: 3.240

6. A combined risk model for the multi-encompassing identification of heterogeneities of prognoses, biological pathway variations and immune states for sepsis patients.

Authors: Zong-Xiu Yin; Chun-Yan Xing; Guan-Hua Li; Long-Bin Pang; Jing Wang; Jing Pan; Rui Zang; Shi Zhang
Journal: BMC Anesthesiol Date: 2022-01-07 Impact factor: 2.217

Review 7. Artificial Intelligence in Rheumatoid Arthritis: Current Status and Future Perspectives: A State-of-the-Art Review.

Authors: Sara Momtazmanesh; Ali Nowroozi; Nima Rezaei
Journal: Rheumatol Ther Date: 2022-07-18

8. Three-Month Real-Time Dengue Forecast Models: An Early Warning System for Outbreak Alerts and Policy Decision Support in Singapore.

Authors: Yuan Shi; Xu Liu; Suet-Yheng Kok; Jayanthi Rajarethinam; Shaohong Liang; Grace Yap; Chee-Seng Chong; Kim-Sung Lee; Sharon S Y Tan; Christopher Kuan Yew Chin; Andrew Lo; Waiming Kong; Lee Ching Ng; Alex R Cook
Journal: Environ Health Perspect Date: 2015-12-11 Impact factor: 9.031

9. Comparative study for haplotype block partitioning methods - Evidence from chromosome 6 of the North American Rheumatoid Arthritis Consortium (NARAC) dataset.

Authors: Mohamed N Saad; Mai S Mabrouk; Ayman M Eldeib; Olfat G Shaker
Journal: PLoS One Date: 2018-12-31 Impact factor: 3.240

Review 10. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases.

Authors: I S Stafford; M Kellermann; E Mossotto; R M Beattie; B D MacArthur; S Ennis
Journal: NPJ Digit Med Date: 2020-03-09

10 in total