Literature DB >> 20018003

Detecting susceptibility genes for rheumatoid arthritis based on a novel sliding-window approach.

Qiuying Sha1, Rui Tang, Shuanglin Zhang.   

Abstract

With the recent rapid improvements in high-throughout genotyping techniques, researchers are facing a very challenging task of large-scale genetic association analysis, especially at the whole-genome level, without an optimal solution. In this study, we propose a new approach for genetic association analysis based on a variable-sized sliding-window framework. This approach employs principal component analysis to find the optimal window size. Using the bisection algorithm in window size searching, the proposed method tackles the exhaustive computation problem. It is more efficient and effective than currently available approaches. We conduct the genome-wide association study in Genetic Analysis Workshop 16 (GAW16) Problem 1 data using the proposed method. Our method successfully identified several susceptibility genes that have been reported by other researchers and additional candidate genes for follow-up studies.

Entities:  

Year:  2009        PMID: 20018003      PMCID: PMC2795910          DOI: 10.1186/1753-6561-3-s7-s14

Source DB:  PubMed          Journal:  BMC Proc        ISSN: 1753-6561


Background

With the availability of large-scale genotyping technologies, the cost of genome-wide analyses has been greatly reduced and a boom of large-scale genetic association studies is underway. A sliding-window approach, in which several neighboring single-nucleotide polymorphisms (SNPs) together included in a "window frame", is a popular strategy of multiple allelic association analysis. During the test the window slides across the genome region under study in a stepwise fashion [1-3]. Variable sized sliding-window approaches with variable window sizes decided by the underlying linkage disequilibrium (LD) pattern perform more efficiently in large-scale data analysis. The problem for variable sized sliding-window approaches is how to search the optimal window size with being not only computationally practical but also statistically sufficient to gain higher detection power for both common and rare risk factors. In this report, based on the variable sized sliding-window frame, we adapt the optimal window size to the local LD pattern by employing principal components (PC) approach. The PC approach is known as a linear projection method that defines a lower-dimensional space and captures the maximum information of the initial data [4]. Each optimal window size is defined by the first few PCs (i.e., 3 or 5) that could explain a main fraction of the total amount (i.e., 90% or 95%) of information in the data.

Data

In our study, we used the Genetic Analysis Workshop (GAW) 16 Problem 1 data, which is the initial batch of the whole-genome association data for the North American Rheumatoid Arthritis Consortium (NARAC). Data were available for 868 cases and 1194 controls. There are 22 chromosomes with 545,080 SNP-genotype fields from the Illumina 550k chip. To avoid the missing value problem, any subject who had missing values in that window was excluded from the current window. Thus, some subjects may not be in the current window but will still be included in the study in other windows. In this way, we retained the most information we could.

Methods

Optimal window size defined by PC analysis

We consider a study with total M individuals in a data set and with genotype information denoted by vectors G= (g, g,⋯, g)(i = 1,2,⋯, M) at N SNP loci for the ith individual. We code the genotype gas 0, 1, or 2 for the number of minor (less frequent) alleles at SNP j, j = 1,2⋯, N of individual i. Let ydenote the trait value of individual i. In the sliding-window frame, a window denoted as is a set of neighboring SNPs {b, b + 1, b + 2, ⋯, b + l - 1}. A variable sized sliding window which begins with SNP b, denoted as Ω, is a collection of windows with l ranging from s to Γ, where s and Γare the smallest and largest window sizes. In this study, we apply PC method to define the optimal window size. The basic idea is that we attempt to find the largest window size in which c0 proportion of the total information can be explained by the first k PCs and c0 and k are predefined criteria. We define this largest window size as the optimal window size. Start with a window with l = s= k + 1, so that at least the window length is longer than the number of the important PCs. Let denote the sample variance-covariance matrix of genotypic numerical codes in window and denote the jth largest eigenvalue of Thus, in window , the total variance in the original dataset explained by the jth PC is . Let as the proportion of the total variability explained by the first k PCs. Our main idea of choosing the optimal window size of each sliding window is to find the largest window size in which c0 proportion of the total variability can be explained by the first k PCs among a set of windows Ω.

Bisection method for searching the optimal window size and computational consideration

Using the exhaustive searching method may be computational demanding for determining the optimal window size. We propose to use bisection method. Let s and Γ denote the predefined smallest and largest window sizes among a set of windows Ω, where b is the starting SNP of the set of windows. By adapting bisection method, the searching procedure for the optimal window size in Ωincludes following steps: Step 1: Let l be the middle point of s and Γ, that is, l = [(s = Γ)2/], where [a] is the largest integer that is less than or equal to a. Step 2: Conduct PC analysis within the window , where a window begins at SNP b and has a size l. Step 3: Calculate C (the proportion of the total variability explained by the first k PCs) for the window . If C > c0, we let s = l, that is, we update the smallest window size s. Otherwise, we let Γ = l, that is, we update the largest window size Γ. Step 4: Repeat Step 1 to Step 3 until Γ - s ≤ 1. In the window , if the proportion of the total variability explained by the first k PCs is greater than c0, the optimal window size will be Γ; otherwise, the optimal window size will be s. Until now, we have not mentioned how to choose the starting SNP b. Of course for the first window, b = 1. To choose b for other windows, the following three methods are typically used. For the ith (i > 1) window, choose 1) b = i; 2) b = n, where nis the middle SNP of the (i-1)th window; 3) b = m+1, where mis the last SNP of the (i-1)th window. In this article, we use the first method to choose the starting SNP b. By using bisection method, our proposed variable length sliding-window method is computationally efficient. Consider a set of windows Ωwith the smallest window size s, largest window size Γ, and starting SNP b. The computational complexity to find the optimal window size in Ωusing the bisection algorithm is Γ3log2(Γ - s). If we have N SNPs in total, the computational complexity to find all the optimal window sizes is NΓ3log2(Γ - s). In this article, we use Γ = 35 and s = 4. Suppose N = 500,000 in a genome-wide association study. Then, NΓ3log2(Γ - s)

Score test

After we find the optimal window size for each sliding window, we use the score test statistic based on a logistic model [5] to test for association within each sliding window. Consider , a window beginning at SNP b with an optimal window size l. Take b = 1 as an example for windows that start at the first SNP. Let denote its first k PCs of the ith individual, where i = 1,2,⋯, M. Suppose that the k PCs follow a logistic model, then, the score test statistic is given by T2 = U'V-1U, where , , , and M is the sample size. The statistic T2 asymptotically follows a χ2 distribution with k degrees of freedom. We select significant windows after adjusting for multiple testing using a Bonferroni correction.

Result

We applied the proposed approach to GAW16 Problem 1. In our application, we set s = 4; Γ = 35; c0 = 90% and k = 3. Originally the dataset contained genotypes at 545,080 SNPs on chromosomes 1 to 22. In our analysis we ended up with 531,501 windows. The size of the windows varied from 4 to 29 SNPs, with the median window size of 7 (see Table 1 for the distribution of the window sizes). After Bonferroni correction, we found 1,155 significant windows. Due to the strong LD among SNPs, many of the significant windows overlapped with the nearby windows. In order to report the result thoroughly, we combined the significant windows with all overlapped windows as one larger window. Thus, we end up with 76 significant larger windows. Due to the limited pages, in Table 2 we only report the top 30 windows after the combination. The order of the windows is according to their most significant sub-windows (the original window before combinations). Our result matches most of the genes reported in recent studies [6-11] and also identify more rheumatoid arthritis (RA) susceptibility genes for follow-up studies.
Table 1

The distribution of the window sizes based on 531,501 windows

Window sizePercentage of total windows
418.32
511.92
611.69
711.57
88.94
97.34
106.34
114.67
123.73
13-2915.48
Table 2

Genetic and physical map locations of window region identified using PC-sliding-window analysis based on the Bonferroni correction

Window IDChraPhysical locationGenesbCRASGc
1630014670, 33187144TNF, HLA-A HLA-B, HLA-CTNF, HLA-A HLA-B, HLA-C
21792429, 1101089AGRN, C1orf159, ISG15, SAMD11
32172768404, 172807000DLX1, DLX2STAT4, ITGAV
41246666298, 46718200COL2A1, LOC728181 LOC728114
513113656958, 113861908FAM70B, RASA3
67154133201, 154241160PAXIP1, LOC202781
71768283979, 68361160SLC39A11
8298261543, 98370780VWA3BIL1B
91349336428, 49340230KPNA3
1012243956, 3359357ARHGEF16, PRDM16
111766647226, 66750860Intergenic 17q24
121667482002, 67660490TMC07, FLJ12331
131875300466, 75314140NFATC1
141374883232, 74941310TBC1D4
159123211883, 123248000CRB2, MIRN601, DENND1ATRAF1/C5
162057796484, 57832810
17115181683, 151846600ADAM15, EFNA4
182035438689, 35501280SRC, RPL7AL4
192228164734, 28237820RFPL1S, RFPL1, NEFH
201164593946, 64661480SAC3D1, NAALADL1 CDCA5, ZFPL1, ZHIT2
211145207308, 45314100SYT13, FLJ41423
22820327035, 20435200
235137614229, 137825000GFRA3, CDC25C, FAM53C, JMJD1B
247129525353, 129580000CPA2, CPA4
253134954925, 134966000TF, SRPRB
269104811123, 104815000ABCA1
27126924169, 6932652ATN1, CL2orf57, PTPN6
281010531181, 10542841SH3PXD2A
29113335218, 3524620ZNF195, OR7Z12p
301919106771, 19154190TMEM16A1, MEF2B

aChr, chromosome

bWe found the significant genes using the NCBI dbSNP database http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp.

cThe confirmed RA susceptibility genes (CRASG) are shown in the last column if they are within or near our significant region.

The distribution of the window sizes based on 531,501 windows Genetic and physical map locations of window region identified using PC-sliding-window analysis based on the Bonferroni correction aChr, chromosome bWe found the significant genes using the NCBI dbSNP database http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp. cThe confirmed RA susceptibility genes (CRASG) are shown in the last column if they are within or near our significant region.

Discussion

As the most exhaustive searching engine in genome-wide association studies, sliding-window approaches are receiving more and more attention recently. Based on the variable sized sliding-window frame, we adapt the optimal window size to the local LD pattern by employing the PC approach. We applied this novel sliding-window approach to the GAW16 RA data and successfully validated nine genes that have been reported by recent studies and also identified new candidate genes for follow-up studies. Our approach has several advantages. It provides a stable method to choose the window size with the maximum information extraction and it automatically balances degrees of freedom and number of tests, which results in higher power to detect association. It is flexible enough to conduct different association tests within the windows. The method is computational efficient when applied to large-scale data compared with other variable sized sliding-window methods. It requires only genotype data so there is no need to go through any computationally intensive phasing program to account for uncertain haplotype phases. Further efforts are needed to improve the proposed method, such as determining the optimal c0 (the proportion of the total variability explained by the top k PCs) and the initial window lengths in the bisection method.

Conclusion

In this study, we applied our novel genome-wide PC sliding-window approach to detect the association between SNP windows and disease status using GAW16 Problem 1 RA dataset. We validated nine genes which have been identified to be responsible for RA in the literature and discovered more genes and non-gene regions for follow-up studies.

List of abbreviations used

GAW: Genetic Analysis Workshop; LD: Linkage disequilibrium; NARAC: North American Rheumatoid Arthritis Consortium; PC: Principal components; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QS participated in the design of the study and contributed to the manuscript preparation. RT performed the statistical analysis and wrote the draft of the manuscript. SZ contributed to the design of the study and to the manuscript preparation. All authors read and approved the final manuscript.
  11 in total

1.  Detecting disease associations due to linkage disequilibrium using haplotype tags: a class of tests and the determinants of statistical power.

Authors:  Juliet M Chapman; Jason D Cooper; John A Todd; David G Clayton
Journal:  Hum Hered       Date:  2003       Impact factor: 0.444

2.  Tests of association between quantitative traits and haplotypes in a reduced-dimensional space.

Authors:  Qiuying Sha; Jianping Dong; Renfang Jiang; Shuanglin Zhang
Journal:  Ann Hum Genet       Date:  2005-11       Impact factor: 1.670

3.  A sliding-window weighted linkage disequilibrium test.

Authors:  Hsin-Chou Yang; Chin-Yu Lin; Cathy S J Fann
Journal:  Genet Epidemiol       Date:  2006-09       Impact factor: 2.135

4.  Association mapping via regularized regression analysis of single-nucleotide-polymorphism haplotypes in variable-sized sliding windows.

Authors:  Yi Li; Wing-Kin Sung; Jian Jun Liu
Journal:  Am J Hum Genet       Date:  2007-02-19       Impact factor: 11.025

5.  Detecting haplotype effects in genomewide association studies.

Authors:  B E Huang; C I Amos; D Y Lin
Journal:  Genet Epidemiol       Date:  2007-12       Impact factor: 2.135

6.  TRAF1-C5 as a risk locus for rheumatoid arthritis--a genomewide study.

Authors:  Robert M Plenge; Mark Seielstad; Leonid Padyukov; Annette T Lee; Elaine F Remmers; Bo Ding; Anthony Liew; Houman Khalili; Alamelu Chandrasekaran; Leela R L Davies; Wentian Li; Adrian K S Tan; Carine Bonnard; Rick T H Ong; Anbupalam Thalamuthu; Sven Pettersson; Chunyu Liu; Chao Tian; Wei V Chen; John P Carulli; Evan M Beckman; David Altshuler; Lars Alfredsson; Lindsey A Criswell; Christopher I Amos; Michael F Seldin; Daniel L Kastner; Lars Klareskog; Peter K Gregersen
Journal:  N Engl J Med       Date:  2007-09-05       Impact factor: 91.245

7.  A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis.

Authors:  Ann B Begovich; Victoria E H Carlton; Lee A Honigberg; Steven J Schrodi; Anand P Chokkalingam; Heather C Alexander; Kristin G Ardlie; Qiqing Huang; Ashley M Smith; Jill M Spoerke; Marion T Conn; Monica Chang; Sheng-Yung P Chang; Randall K Saiki; Joseph J Catanese; Diane U Leong; Veronica E Garcia; Linda B McAllister; Douglas A Jeffery; Annette T Lee; Franak Batliwalla; Elaine Remmers; Lindsey A Criswell; Michael F Seldin; Daniel L Kastner; Christopher I Amos; John J Sninsky; Peter K Gregersen
Journal:  Am J Hum Genet       Date:  2004-06-18       Impact factor: 11.025

8.  Rheumatoid arthritis association at 6q23.

Authors:  Wendy Thomson; Anne Barton; Xiayi Ke; Steve Eyre; Anne Hinks; John Bowes; Rachelle Donn; Deborah Symmons; Samantha Hider; Ian N Bruce; Anthony G Wilson; Ioanna Marinou; Ann Morgan; Paul Emery; Angela Carter; Sophia Steer; Lynne Hocking; David M Reid; Paul Wordsworth; Pille Harrison; David Strachan; Jane Worthington
Journal:  Nat Genet       Date:  2007-11-04       Impact factor: 38.330

9.  The ITGAV rs3738919-C allele is associated with rheumatoid arthritis in the European Caucasian population: a family-based study.

Authors:  Laurent Jacq; Sophie Garnier; Philippe Dieudé; Laëtitia Michou; Céline Pierlot; Paola Migliorini; Alejandro Balsa; René Westhovens; Pilar Barrera; Helena Alves; Carlos Vaz; Manuela Fernandes; Dora Pascual-Salcedo; Stefano Bombardieri; Jan Dequeker; Timothy R Radstake; Piet Van Riel; Leo van de Putte; Antonio Lopes-Vaz; Elodie Glikmans; Sandra Barbet; Sandra Lasbleiz; Isabelle Lemaire; Patrick Quillet; Pascal Hilliquin; Vitor Hugo Teixeira; Elisabeth Petit-Teixeira; Hamdi Mbarek; Bernard Prum; Thomas Bardin; François Cornélis
Journal:  Arthritis Res Ther       Date:  2007       Impact factor: 5.156

10.  A candidate gene approach identifies the TRAF1/C5 region as a risk factor for rheumatoid arthritis.

Authors:  Fina A S Kurreeman; Leonid Padyukov; Rute B Marques; Steven J Schrodi; Maria Seddighzadeh; Gerrie Stoeken-Rijsbergen; Annette H M van der Helm-van Mil; Cornelia F Allaart; Willem Verduyn; Jeanine Houwing-Duistermaat; Lars Alfredsson; Ann B Begovich; Lars Klareskog; Tom W J Huizinga; Rene E M Toes
Journal:  PLoS Med       Date:  2007-09       Impact factor: 11.069

View more
  4 in total

1.  Gene- or region-based association study via kernel principal component analysis.

Authors:  Qingsong Gao; Yungang He; Zhongshang Yuan; Jinghua Zhao; Bingbing Zhang; Fuzhong Xue
Journal:  BMC Genet       Date:  2011-08-26       Impact factor: 2.797

2.  Genome-wide association studies for discrete traits.

Authors:  Duncan C Thomas
Journal:  Genet Epidemiol       Date:  2009       Impact factor: 2.135

3.  Regionally Smoothed Meta-Analysis Methods for GWAS Datasets.

Authors:  Ferdouse Begum; Monir H Sharker; Stephanie L Sherman; George C Tseng; Eleanor Feingold
Journal:  Genet Epidemiol       Date:  2015-12-28       Impact factor: 2.135

4.  Single Marker and Haplotype-Based Association Analysis of Semolina and Pasta Colour in Elite Durum Wheat Breeding Lines Using a High-Density Consensus Map.

Authors:  Amidou N'Diaye; Jemanesh K Haile; Aron T Cory; Fran R Clarke; John M Clarke; Ron E Knox; Curtis J Pozniak
Journal:  PLoS One       Date:  2017-01-30       Impact factor: 3.240

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.