Literature DB >> 20017994

Pathway-based analysis of a genome-wide case-control association study of rheumatoid arthritis.

Joseph Beyene¹, Pingzhao Hu, Jemila S Hamid, Elena Parkhomenko, Andrew D Paterson, David Tritchler.

Abstract

Evaluation of the association between single-nucleotide polymorphisms (SNPs) and disease outcomes is widely used to identify genetic risk factors for complex diseases. Although this analysis paradigm has made significant progress in many genetic studies, many challenges remain, such as the requirement of a large sample size to achieve adequate power. Here we use rheumatoid arthritis (RA) as an example and explore a new analysis strategy: pathway-based analysis to search for related genes and SNPs contributing to the disease.We first propose the application of measure of explained variation to quantify the predictive ability of a given SNP. We then use gene set enrichment analysis to evaluate enrichment of specific pathways, where pathways, are considered enriched if they consist of genes that are associated with the phenotype of interest above and beyond is expected by chance. The results are also compared with score tests for association analysis by adjusting for population stratification.Our study identified some significantly enriched pathways, such as "cell adhesion molecules," which are known to play a key role in RA. Our results showed that pathway-based analysis may identify other biologically interesting loci (e.g., rs1018361) related to RA: the gene (CTLA4) closest to this marker has previously been shown to be associated with RA and the gene is in the significant pathways we identified, even though the marker has not reached genome-wide significance in univariate single-marker analysis.

Entities: Chemical Disease Gene Mutation Species

Year: 2009 PMID： 20017994 PMCID： PMC2795901 DOI： 10.1186/1753-6561-3-s7-s128

Source DB: PubMed Journal: BMC Proc ISSN： 1753-6561

Background

Rheumatoid arthritis (RA) is the most common systemic autoimmune disease characterized by chronic, destructive, and debilitating inflammation of joints and extra-articular tissues [1]. The disease affects 1% of the adult population worldwide and is significantly more prevalent in women (3 to 1 ratio) than men. It is believed that the contribution of individual genes to complex diseases such as RA, is generally modest and difficult to detect due to inadequate sample sizes compared with the large number of variables being tested for association. Currently, the literature on genome-wide association studies is focused on testing single-nucleotide polymorphisms (SNPs) for association using standard test statistic (for example, chi-square test statistic). The difficulty in detecting genes of modest effects may be one potential explanation for inconsistent results often seen in genetic association studies [2]. In a recent genome-wide study of RA with a relatively large sample size, Plenge et al. [3] identified a genome-wide significant association signal on chromosome 9 close to TRAF1 and C5 in addition to confirming known genes related to RA (e.g., HLA-DRB1 in major histocompatibility complex (MHC) region and PTPN22) [3]. In this paper, we propose a pathway-based approach to study sets of genes. The motivation for this is the belief that the mechanism of a complex disease such as RA may not be described fully by looking at gene-by-gene comparisons alone. Gene set analysis has been widely used in studies involving gene expression data; however, there are only a few such studies in a genome-wide association setting. We also propose two measures of explained variation that are appropriate for binary outcomes and summarize these measurements over all SNPs in and around a particular gene.

Materials and methods

Data description

The provided data is a subset of the Stage 1 genome-wide association study of RA previously analyzed by Plenge et al. [3]. After removing duplicated and contaminated samples, there were 868 cases from the North American Rheumatoid Arthritis Consortium (NARAC) and 1194 controls on which 545,080 SNPs had been genotyped.

Quality control

The SNPs and samples fulfilling the following quality control requirements were included in our analysis: 1) call rate: SNP and sample call rates > 95%; 2) Hardy-Weinberg equilibrium: false-discovery rate (FDR) level in testing for Hardy-Weinberg equilibrium among controls > 0.2; 3) minor allele count > 10 copies, which is equivalent to a minor allele frequency of 0.24% (i.e, 10/(2 × 2062) = 0.24%). A total of 490,000 SNPs and 2062 samples met our quality control criteria and were used for further analysis.

Adjustment for population stratification

It has been indicated that this data set is affected by population stratification, and this may lead to misleading association results if not taken into account. We followed an approach proposed by Price et al. [4] to adjust for population stratification. The steps involved can be summarized as follows: i) Using the SNP set remaining after applying our quality control criteria, we first removed SNPs in the MHC region (29-34 Mb on chromosome 6) and on the sex chromosomes. ii) For the remaining SNPs, we pruned them based on linkage disequilibrium using PLINK [5]. iii) This left around 220 k SNPs for which we calculated genome-wide pair-wise identity-by-state similarity matrix followed by multidimensional scaling analysis on the identity-by-state-based distance matrix. iv) We selected 15 significant principal coordinates (PCs) with p-value < 0.001. v) These 15 significant PCs were used in subsequent association analyses. It should be noted that these steps were only applied to select these 15 significant PCs. In the subsequent association analyses, we used all SNPs and samples that passed the quality control criteria.

Univariate SNP association test

We performed score tests for association between RA status in patients and their genotypes, and adjusted for possible population stratification using the significant principal components. We used the 'egscore' function in GenABEL R package [6], in which the population stratification method proposed by Price et al. [4] is implemented. We obtained the corresponding p-value and chi-square value for each SNP.

Measures of explained variation

For a given SNP, let (y, x), i = 1, ..., n denote its n samples/observations, where ydenotes the outcome of the observation i (in this study, y= 1 if the subject is a case and 0 otherwise) and xdenotes the genotype (coded as 0, 1, 2, corresponding to AA, AB, and BB) of the ith observation. We consider two measures of explained variation for binary outcomes [7]. These measures are based on direct and indirect estimates of predictive accuracy. The direct measure is based on residual from the fit and the indirect index is related to a standard measure of information. Let the estimates from a logistic model without a covariate be and the estimates from a model with covariate xbe . Define and , and the explained variation based on the direct estimates becomes . Similarly, let and , then the explained variation based on indirect estimates is calculated as . We denote these two approaches as DirEV and IndirEV, respectively. In the models which include principal components to adjust for population stratification, the explained variation attributable to each SNP were obtained.

Pathway-based analysis

We summarize the steps for our method as follows (adapted from Wang et al. [8]):

1. Obtain test/summary statistic

For each SNP, we computed one of the following: measure of explained variation and univariate SNP association test (χ2) where population stratification is adjusted for by including 15 PCs.

2. Map SNPs to genes

We obtained the nearest gene name for each SNP from the Illumina SNP annotation file (HumanHap650Yv3_Gene_Annotation.txt, available from https://icom.illumina.com/) based on physical distance. The 490,000 SNPs were mapped to 16,500 genes.

3. Aggregate test/summary statistic

For each gene, we obtained an aggregate summary measure or test statistic based on individual values (from Step 1 above) for SNPs assigned to this gene (Step 2). Here we used the maximum summary measure or maximum test statistic over all SNPs mapped to the gene. 4. The aggregated summary measure or test statistics were used to evaluate the significance of predefined gene sets/pathways [9] based on the gene set enrichment analysis (GSEA) method [10]. Here we used the c2 curated gene sets, which are obtained from online pathway databases, citation in PubMed [9], and knowledge of domain experts and included 1900 gene sets collected from canonical pathways, chemical and genetic perturbations, BioCarta pathways, GeneMAPP [11], and KEGG [12]. A Kolmogorov-Smirnov non-parametric rank statistic was performed using the GseaPreranked tool included in the GSEA software. Gene sets were ranked by their FDR q-value, where the q-value of a test measures the proportion of false positives incurred (FDR) when that particular test is called significant. The empirical null distribution was obtained using 1000 random permutations. We defined a given gene set as significantly enriched in the data if it has FDR q-value of less than 0.05.

Results

For the 1900 gene sets that we evaluated, all three methods (Chi-Sq, DirEV and IndirEV) identified 10 gene sets that have FDR q-value of less than 0.05. Table 1 shows these ten significantly enriched gene sets/pathways identified by GSEA for each of the three methods. As seen from Table 1, the ten significantly enriched pathways for the three summary methods are the same. The pathways identified using DirEv and IndirEV have similar FDR q-value and ranks. The FDR q-value results based on the χ2 tests are slightly different from those based on DirEV and IndirEV.

Table 1

Significant pathways identified by three test statistic methods

		χ ²		DirEV		IndirEV

Gene sets/pathways	No. genes^a	Rank^b	FDR q-value	Rank	FDR q-value	Rank	FDR q-value
Hsa04612 antigen processing and presentation	57	1	<1*10^-8	1	<1*10^-8	1	<1*10^-8
Hsa04940 Type I diabetes mellitus	40	2	<1*10^-8	2	<1*10^-8	2	<1*10^-8
Wieland hepatitis B-induced	87	3	<1*10^-8	3	0.004	3	0.001
Ctla4 pathway	18	4	6*10^-5	6	0.006	4	0.003
Ami pathway	22	5	0.01	7	0.006	10	0.015
Csk pathway	22	6	0.01	10	0.011	7	0.012
Sana Ifng endothelial up	67	7	0.011	8	0.006	9	0.013
Th1th2 pathway	17	8	0.011	9	0.009	8	0.013
Inflam pathway	28	9	0.012	4	0.005	6	0.006
Hsa04514 cell adhesion molecule	115	10	0.02	5	0.005	5	0.004

aNumber of genes found in our data. (The actual number of genes defined in the set is larger than these numbers.)

bThe rank is based on the FDR q-value for all 1,900 tested gene sets in each method.

Significant pathways identified by three test statistic methods aNumber of genes found in our data. (The actual number of genes defined in the set is larger than these numbers.) bThe rank is based on the FDR q-value for all 1,900 tested gene sets in each method. As discussed before, there are few potential genes associated with RA: PTPN22 on chromosome 1, HLA_DRB1 on the MHC region on chromosome 6, and TRAF1/C5 on chromosome 9. However, PTPN22 and TRAF1/C5 were not in any of the ten significantly enriched gene sets. Therefore, we further evaluated the distributions of genes and SNPs in the MHC region compared with the rest of the genome (those not in MHC region) in these ten gene sets/pathways (Table 2). To do this, we defined the MHC region as 29-34 Mb on chromosome 6. The region is defined based on the results shown in Supplementary Table 1B of Plenge et al. [3], that is, SNPs that have p-value < 0.0001. We found 561 SNPs in the defined MHC region with p-value < 0.0001; and 227 genes in the region using BioMart [13].

Table 2

Distribution of selected genes and SNPs in each of the two regions and nine gene sets/pathways

	MHC region(Chr 6, 29-34 Mb^a)			Rest of genome

Gene sets/pathways	No. genes	%^b	#SNPs	No. genes	%^a	No. SNPs
Hsa04612 antigen processing and Presentation	24	42.1	5	33	57.9	0
Hsa04940 Type I Diabetes Mellitus	22	55.0	6	18	45.0	0
Wieland hepatitis B-iInduced	15	17.2	4	72	82.8	0
Ctla4 pathway	2	11.1	0	16	88.9	1
Ami pathway	2	9.1	0	20	90.9	0
Csk pathway	2	9.1	0	20	90.9	0
Sana Ifng endothelial up	11	16.4	2	56	83.6	0
Th1th2 pathway	2	11.8	0	15	88.2	0
Inflam pathway	4	14.3	2	24	85.7	0
Hsa04514 cell adhesion molecule	20	17.4	5	95	82.6	1

aThis region covers approximately 0.2% the whole genome

bThe percentage of genes in a given gene set that were found in the region

Table 2 shows the distribution of these genes and SNPs in the defined MHC region and other genes and SNPs with p-value < 0.0001 in the rest of the genome in the 10 gene sets/pathways. It can be seen the MHC region is enriched for genes from the following three gene sets: "Hsa04612 Antigen Processing and Presentation," "Hsa04940 Type I Diabetes Mellitus," and "Hsa04514 Cell Adhesion Molecule." There were 22 HLA-related genes in common among these three gene sets. Of these, 20 were found in the defined MHC region. Five SNPs with p-value < 0.0001 were present in these three gene sets in the MHC region. Interestingly, one SNP (rs1018361) with p-value of 2.6 × 10-5 on chromosome 2 was present in one of the three gene sets ("Hsa04514 Cell Adhesion Molecule") in the rest of genome (the region excluding MHC region) while the other two gene sets had no significant SNPs in the rest of the genome. Distribution of selected genes and SNPs in each of the two regions and nine gene sets/pathways aThis region covers approximately 0.2% the whole genome bThe percentage of genes in a given gene set that were found in the region The significant SNP on chromosome 2 (rs1018361) may be of interest for further investigation because the closest gene (CTLA4) to this SNP shares the same pathway as those genes containing the MHC region. Moreover, previous association study showed that RA is associated with CTLA4 [2]. Similarly, the cell adhesion molecules (proteins) may have an important role in regulation of the RA development than other tested pathways. As shown in Table 2, both genes and SNPs with p-value < 0.0001 in the pathway (cell adhesion molecule) are found in both the MHC region and the rest of the genome, while other significant pathways have SNPs with p-value < 0.0001 only in the MHC region or are not found in either the MHC region or the rest of the genome.

Conclusion

Overall, using a pathway-based analysis, we found some of the significant gene sets/pathways were enriched in the well known MHC region associated with RA. However, some of the loci that have been reported to be related to RA, such as PTPN22 and TRAF1/C5, were not found to be enriched in any of the gene sets/pathways we identified as significant. Our results also showed that pathway-based analysis may identify other RA-related loci (e.g., rs1018361) because the gene (CTLA4) closest to this SNP has previously been shown to be associated with RA and the significant pathways identified here contained this gene.

List of abbreviations used

DirEV: Explained variation based on the direct estimates; FDR: False-discovery rate; GSEA: Gene set enrichment analysis; IndirEV: Explained variation based on indirect estimates; MHC: Major histocompatability complex; PC: Principal coordinates; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JB initiated the study and proposed the pathway-based analysis ideas and methods for genetic association analysis. PH performed all of the data analysis and drafted the paper with JB, JSH, and ADP. JSH, ADP, EP, and DT contributed by providing critical comments which helped with refining the methods, analyses, and interpretation of the data. All authors read and approved the final manuscript.

9 in total

1. Predictive accuracy and explained variation.

Authors: Michael Schemper
Journal: Stat Med Date: 2003-07-30 Impact factor: 2.373

2. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

3. GenABEL: an R library for genome-wide association analysis.

Authors: Yurii S Aulchenko; Stephan Ripke; Aaron Isaacs; Cornelia M van Duijn
Journal: Bioinformatics Date: 2007-03-23 Impact factor: 6.937

4. Pathway-based approaches for analysis of genomewide association studies.

Authors: Kai Wang; Mingyao Li; Maja Bucan
Journal: Am J Hum Genet Date: 2007-12 Impact factor: 11.025

5. PLINK: a tool set for whole-genome association and population-based linkage analyses.

Authors: Shaun Purcell; Benjamin Neale; Kathe Todd-Brown; Lori Thomas; Manuel A R Ferreira; David Bender; Julian Maller; Pamela Sklar; Paul I W de Bakker; Mark J Daly; Pak C Sham
Journal: Am J Hum Genet Date: 2007-07-25 Impact factor: 11.025

6. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

7. Replication of putative candidate-gene associations with rheumatoid arthritis in >4,000 samples from North America and Sweden: association of susceptibility with PTPN22, CTLA4, and PADI4.

Authors: Robert M Plenge; Leonid Padyukov; Elaine F Remmers; Shaun Purcell; Annette T Lee; Elizabeth W Karlson; Frederick Wolfe; Daniel L Kastner; Lars Alfredsson; David Altshuler; Peter K Gregersen; Lars Klareskog; John D Rioux
Journal: Am J Hum Genet Date: 2005-11-01 Impact factor: 11.025

8. TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study.

Authors: Robert M Plenge; Mark Seielstad; Leonid Padyukov; Annette T Lee; Elaine F Remmers; Bo Ding; Anthony Liew; Houman Khalili; Alamelu Chandrasekaran; Leela R L Davies; Wentian Li; Adrian K S Tan; Carine Bonnard; Rick T H Ong; Anbupalam Thalamuthu; Sven Pettersson; Chunyu Liu; Chao Tian; Wei V Chen; John P Carulli; Evan M Beckman; David Altshuler; Lars Alfredsson; Lindsey A Criswell; Christopher I Amos; Michael F Seldin; Daniel L Kastner; Lars Klareskog; Peter K Gregersen
Journal: N Engl J Med Date: 2007-09-05 Impact factor: 91.245

9. A large-scale rheumatoid arthritis genetic study identifies association at chromosome 9q33.2.

Authors: Monica Chang; Charles M Rowland; Veronica E Garcia; Steven J Schrodi; Joseph J Catanese; Annette H M van der Helm-van Mil; Kristin G Ardlie; Christopher I Amos; Lindsey A Criswell; Daniel L Kastner; Peter K Gregersen; Fina A S Kurreeman; Rene E M Toes; Tom W J Huizinga; Michael F Seldin; Ann B Begovich
Journal: PLoS Genet Date: 2008-06-27 Impact factor: 5.917

9 in total

11 in total

Review 1. Analysing biological pathways in genome-wide association studies.

Authors: Kai Wang; Mingyao Li; Hakon Hakonarson
Journal: Nat Rev Genet Date: 2010-12 Impact factor: 53.242

Review 2. Using biological knowledge to uncover the mystery in the search for epistasis in genome-wide association studies.

Authors: Marylyn D Ritchie
Journal: Ann Hum Genet Date: 2011-01 Impact factor: 1.670

3. Gene- or region-based analysis of genome-wide association studies.

Authors: Joseph Beyene; David Tritchler; Jennifer L Asimit; Jemila S Hamid
Journal: Genet Epidemiol Date: 2009 Impact factor: 2.135

4. Pathway-based association analysis of two genome-wide screening data identifies rheumatoid arthritis-related pathways.

Authors: M-M Zhang; Y-S Jiang; H-C Lv; H-B Mu; J Li; Z-W Shang; R-J Zhang
Journal: Genes Immun Date: 2014-08-07 Impact factor: 2.676

5. Integrated enrichment analysis of variants and pathways in genome-wide association studies indicates central role for IL-2 signaling genes in type 1 diabetes, and cytokine signaling genes in Crohn's disease.

Authors: Peter Carbonetto; Matthew Stephens
Journal: PLoS Genet Date: 2013-10-03 Impact factor: 5.917

6. Measles contributes to rheumatoid arthritis: evidence from pathway and network analyses of genome-wide association studies.

Authors: Guiyou Liu; Yongshuai Jiang; Xiaoguang Chen; Ruijie Zhang; Guoda Ma; Rennan Feng; Liangcai Zhang; Mingzhi Liao; Yingbo Miao; Zugen Chen; Rong Zeng; Keshen Li
Journal: PLoS One Date: 2013-10-18 Impact factor: 3.240

7. Genome-wide association study of bipolar disorder in Canadian and UK populations corroborates disease loci including SYNE1 and CSMD1.

Authors: Wei Xu; Sarah Cohen-Woods; Qian Chen; Abdul Noor; Jo Knight; Georgina Hosang; Sagar V Parikh; Vincenzo De Luca; Federica Tozzi; Pierandrea Muglia; Julia Forte; Andrew McQuillin; Pingzhao Hu; Hugh M D Gurling; James L Kennedy; Peter McGuffin; Anne Farmer; John Strauss; John B Vincent
Journal: BMC Med Genet Date: 2014-01-04 Impact factor: 2.103

8. Pathway-based analysis of rare and common variants to test for association with blood pressure.

Authors: Huda Alsulami; Xiaofeng Liu; Joseph Beyene
Journal: BMC Proc Date: 2014-06-17

Review 9. Genetics and epigenetics of rheumatoid arthritis.

Authors: Sebastien Viatte; Darren Plant; Soumya Raychaudhuri
Journal: Nat Rev Rheumatol Date: 2013-02-05 Impact factor: 20.543

10. Integrative analysis of genome-wide association studies and gene expression analysis identifies pathways associated with rheumatoid arthritis.

Authors: Mingming Zhang; Hongbo Mu; Hongchao Lv; Lian Duan; Zhenwei Shang; Jin Li; Yongshuai Jiang; Ruijie Zhang
Journal: Oncotarget Date: 2016-02-23