Qianqian Peng1, Jinghua Zhao, Fuzhong Xue. 1. Department of Epidemiology and Health Statistics, School of Public Health, Shandong University, Jinan 250012, PR China.
Abstract
BACKGROUND: Genetic association study is currently the primary vehicle for identification and characterization of disease-predisposing variant(s) which usually involves multiple single-nucleotide polymorphisms (SNPs) available. However, SNP-wise association tests raise concerns over multiple testing. Haplotype-based methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming Hardy-Weinberg equilibrium (HWE) and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA) are preferable in this regard but their performance varies with methods of extracting principal components (PCs). RESULTS: PCA-based bootstrap confidence interval test (PCA-BCIT), which directly uses the PC scores to assess gene-disease association, was developed and evaluated for three ways of extracting PCs, i.e., cases only(CAES), controls only(COES) and cases and controls combined(CES). Extraction of PCs with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test. CONCLUSIONS: PCA-BCIT is a valid and powerful method for assessing gene-disease association involving multiple SNPs.
BACKGROUND: Genetic association study is currently the primary vehicle for identification and characterization of disease-predisposing variant(s) which usually involves multiple single-nucleotide polymorphisms (SNPs) available. However, SNP-wise association tests raise concerns over multiple testing. Haplotype-based methods have the advantage of being able to account for correlations between neighbouring SNPs, yet assuming Hardy-Weinberg equilibrium (HWE) and potentially large number degrees of freedom can harm its statistical power and robustness. Approaches based on principal component analysis (PCA) are preferable in this regard but their performance varies with methods of extracting principal components (PCs). RESULTS: PCA-based bootstrap confidence interval test (PCA-BCIT), which directly uses the PC scores to assess gene-disease association, was developed and evaluated for three ways of extracting PCs, i.e., cases only(CAES), controls only(COES) and cases and controls combined(CES). Extraction of PCs with COES is preferred to that with CAES and CES. Performance of the test was examined via simulations as well as analyses on data of rheumatoid arthritis and heroin addiction, which maintains nominal level under null hypothesis and showed comparable performance with permutation test. CONCLUSIONS: PCA-BCIT is a valid and powerful method for assessing gene-disease association involving multiple SNPs.
Genetic association studies now customarily involve multiple SNPs in candidate genes or genomic regions and have a significant role in identifying and characterizing disease-predisposing variant(s). A critical challenge in their statistical analysis is how to make optimal use of all available information. Population-based case-control studies have been very popular[1] and typically involve contingency table tests of SNP-disease association[2]. Notably, the genotype-wise Armitage trend test does not require HWE and has equivalent power to its allele-wise counterpart under HWE[3,4]. A thorny issue with individual tests of SNPs for linkage disequilibrium (LD) in such setting is multiple testing, however, methods for multiple testing adjustment assuming independence such as Bonferroni's[5,6] is knowingly conservative[7]. It is therefore necessary to seek alternative approaches which can utilize multiple SNPs simultaneously. The genotype-wise Armitage trend test is appealing since it is equivalent to the score test from logistic regression[8] of case-control status on dosage of disease-predisposing alleles of SNP. However, testing for the effects of multiple SNPs simultaneously via logistic regression is no cure for difficulty with multicollinearity and curse of dimensionality[9]. Haplotype-based methods have many desirable properties[10] and could possibly alleviate the problem[11-14], but assumption of HWE is usually required and a potentially large number of degrees of freedom are involved[7,11,15-18].It has recently been proposed that PCA can be combined with logistic regression test (LRT)[7,16,17] in a unified framework so that PCA is conducted first to account for between-SNP correlations in a candidate region, then LRT is applied as a formal test for the association between PC scores (linear combinations of the original SNPs) and disease. Since PCs are orthogonal, it avoids multicollinearity and at the meantime is less computer-intensive than haplotype-based methods. Studies have shown that PCA-LRT is at least as powerful as genotype- and haplotype-based methods[7,16,17]. Nevertheless, the power of PCA-based approaches vary with ways by which PCs are extracted, e.g., from genotype correlation, LD, or other kinds of metrics[17], and in principle can be employed in frameworks other than logistic regression[7,16,17]. Here we investigate ways of extracting PCs using genotype correlation matrix from different types of samples in a case-control study, while presenting a new approach testing for gene-disease association by direct use of PC scores in a PCA-based bootstrap confidence interval test (PCA-BCIT). We evaluated its performance via simulations and compared it with PCA-LRT and permutation test using real data.
Methods
PCA
Assume that p SNPs in a candidate region of interest have coded values (X1, X2, ⋯, X) according to a given genetic model (e.g., additive model) whose correlation matrix is C. PCA solves the following equation,where = 1, i = 1,2, ⋯, p, l= (l, l, ⋯, l)' are loadings of PCs. The score for an individual subject iswhere cov (F, F) = 0, i ≠ j, and var(F1) ≥ var(F2) ≥ ⋯ ≥ var(F).
Methods of extracting PCs
Potentially, PCA can be conducted via four distinct extracting strategies (ES) using case-control data, i.e., 0. Calculate PC scores of individuals in cases and controls separately (SES), 1. Use cases only (CAES) to obtain loadings for calculation of PC scores for subjects in both cases and controls, 2. Use controls only (COES) to obtain the loadings for both groups, and 3. Use combined cases and controls (CES) to obtain the loadings for both groups. It is likely that in a case-control association study, loadings calculated from cases and controls can have different connotations and hence we only consider scenarios 1-3 hereafter. More formally, let (X1, X2, ⋯, X) and (Y1, Y2, ⋯, Y) be p-dimension vectors of SNPs at a given candidate region for cases and controls respectively, then we have,Strategy 1 ():where Cis the correlation matrix of (X1, X2, ⋯, X), and = 1, i = 1,2, ⋯, p. The iPC for cases is calculated byand for controlsStrategy 2 ():where Cis the correlation matrix of (Y1, Y2, ⋯, Y). The iPC for controls is calculated byAnd for cases, the iPC, i = 1,2, ⋯, p, is calculated byStrategy 3 ():where C is the correlation matrix obtained from the pooled data of cases and controls, and . The iPC of cases is calculated byThe iPC of controls is calculated by
PCA-BCIT
Given a sample of N cases and M controls with p-SNP genotypes (X1, X2, ⋯, X), (Y1, Y2, ⋯, Y), and X= (X1, X2, ⋯, x) for the icase, Y= (Y1, Y2, ⋯, y) for the icontrol, a PCA-BCIT is furnished in three steps:
Step 1: Sampling
Replicate samples of cases and controls are obtained with replacement separately from (X1(, X2(, ⋯, X()and (Y1(, Y2(, ⋯, Y(), b = 1,2, ⋯, B (B = 1000).
Step 2: PCA
For each replicate sample obtained at Step 1, PCA is conducted and a given number of PCs retained with a threshold of 80% explained variance for all three strategies[16], expressed as and .
Step 3: PCA-BCIT
3a) For each replicate, the mean of the kPC in cases is calculated byand that of the kPC in controls is calculated by3b) Given confidence level (1 - α ), the confidence interval of is estimated by percentile method, with formwhere is the percentile of , and is the percentile.The confidence interval of is estimated bywhere is the percentile of , and is the percentile.3c) Confidence intervals of cases and controls are compared. The null hypothesis is rejected if and do not overlap, which is and are statistically different[19], indicating the candidate region is significantly associated with disease at level α. Otherwise, the candidate region is not significantly associated with disease at level α.
Simulation studies
We examine the performance of PCA-BCIT through simulations with data from the North American Rheumatoid Arthritis (RA) Consortium (NARAC) (868 cases and 1194 controls)[20], taking advantage of the fact that association between protein tyrosine phosphatase non-receptor type 22 (PTPN22) and the development of RA has been established[21-24]. Nine SNPs have been selected from the PNPT22 region (114157960-114215857), and most of the SNPs are within the same LD block (Figure 1). Females are more predisposed (73.85%) and are used in our simulation to ensure homogeneity. The corresponding steps for the simulation are as follows.
Figure 1
LD (. The nine PTPN22 SNPs are rs971173, rs1217390, rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101. The triangle marks a single LD block within this region: (rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101).
LD (. The nine PTPN22 SNPs are rs971173, rs1217390, rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101. The triangle marks a single LD block within this region: (rs878129, rs11811771, rs11102703, rs7545038, rs1503832, rs12127377, rs11485101).The observed genotype frequencies in the study sample are taken to be their true frequencies in populations of infinite sizes. Replicate samples of cases and controls of given size (N, N = 100, 200, ⋯, 1000) are generated whose estimated genotype frequencies are expected to be close to the true population frequencies while both the allele frequencies and LD structure are maintained. Under null hypothesis, replicate cases and controls are sampled with replacement from the controls. Under alternative hypothesis, replicate cases and controls are sampled with replacement from the cases and controls respectively.
Step 2: PCA-BCITing
For each replicate sample, PCA-BCITs are conducted through the three strategies of extracting PCs as outlined above on association between PC scores and disease (RA).
Step 3: Evaluating performance of PCA-BCITs
Repeat steps 1 and 2 for K ( K = 1000 ) times under both null and alternative hypotheses, and obtain the frequencies (P) of rejecting null hypothesis at level α (α = 0.05).
Applications
PCA-BCITs are applied to both the NARAC data on PTPN22 in 1493 females (641 cases and 852 controls) described above and a data containing nine SNPs near μ-opioid receptor gene (OPRM1) in Han Chinese from Shanghai (91 cases and 245 controls) with endophenotype of heroin-induced positive responses on first use[25]. There are two LD blocks in the region of gene OPRM1 (Figure 2).
Figure 2
LD (. The nine OPRM1 SNPs are rs1799971, rs510769, rs696522, rs1381376, rs3778151, rs2075572, rs533586, rs550014, rs658156. The triangles mark the LD block 1 (rs696522, rs1381376, rs3778151) and LD block 2 (rs550014, rs658156).
LD (. The nine OPRM1 SNPs are rs1799971, rs510769, rs696522, rs1381376, rs3778151, rs2075572, rs533586, rs550014, rs658156. The triangles mark the LD block 1 (rs696522, rs1381376, rs3778151) and LD block 2 (rs550014, rs658156).
Results
Simulation study
The performance of PCA-BCIT is shown in Table 1 for the three strategies given a range of sample sizes. It can be seen that strategies 2 and 3 both have type I error rates approaching the nominal level (α = 0.05), but those from strategy 1 deviate heavily. When sample size larger than 800, the power of PCA-BCIT is above 0.8, and strategies 2 and 3 outperform strategy 1 slightly.
Table 1
Performance of PCA-BCIT at level 0.05 with strategies 1-3†
Performance of PCA-BCIT at level 0.05 with strategies 1-3††1 case-only extracting strategy (CAES), 2 control-only extracting strategy (COES), 3 case-control extracting strategy (CES)For the NARAC data, Armitage trend test reveals none of the SNPs in significant association with RA using Bonferroni correction (Table 2), but the results of PCA-BCIT with strategies 2 and 3 show that the first PC extracted in region of PTPN22 is significantly associated with RA. The results are similar to that from permutation test (Table 3).
Table 2
Armitage trend test on nine PTPN22 SNPs and RA susceptibility
SNP
Genotype
Female
Male
Case
Control
P-value
Case
control
P-value
rs971173
CC
334
381
0.025
116
169
0.779
AC
236
363
85
134
AA
71
106
26
39
rs1217390
AA
268
319
0.333
99
112
0.108
AG
272
392
89
175
GG
98
138
38
55
rs878129
GG
338
507
0.009
131
187
0.384
AG
251
291
83
130
AA
52
54
13
25
rs11811771
AA
224
272
0.090
78
111
0.717
AG
303
411
104
168
GG
112
169
45
62
rs11102703
CC
312
469
0.024
121
174
0.418
AC
269
314
90
137
AA
60
69
16
31
rs7545038
GG
321
428
0.696
109
186
0.417
AG
265
342
98
114
AA
52
80
20
40
rs1503832
AA
324
487
0.013
129
185
0.249
AG
262
306
86
127
GG
55
59
12
30
rs12127377
AA
349
521
0.017
139
197
0.230
AG
243
282
78
121
GG
49
48
10
24
rs11485101
AA
564
738
0.656
206
305
0.430
AG
72
112
21
35
GG
5
2
0
2
None of the P-values is significant after Bonferroni Correction.
Table 3
PCA-BCIT, PCA-LRT and permutation test on real data
‡* significant at levels α = 0.05(*) and α = 0.01 (**).
Armitage trend test on nine PTPN22 SNPs and RA susceptibilityNone of the P-values is significant after Bonferroni Correction.PCA-BCIT, PCA-LRT and permutation test on real data†2 control-only extracting strategy (COES), 3 case-control extracting strategy (CES)‡* significant at levels α = 0.05(*) and α = 0.01 (**).For the OPRM1 data, the sample characteristics are comparable between cases and controls (Table 4), and three SNPs (rs696522, rs1381376 and rs3778151) are showed significant association with the endophenotype (Table 5). The results of PCA-BCIT with strategies 2 and 3 and permutation test are all significant at level α = 0.01. In contrast, result from PCA-LRT is not significant at level α = 0.05 with strategy 2 (Table 3). The apparent separation of cases and controls are shown in Figure 3 for PCA-BCIT with strategy 3, suggesting an intuitive interpretation.
Table 4
Sample characteristics of heroin-induced positive responses on first use
Cases (N = 91)
Controls (N = 245)
P-value
Age (yrs)
30.42 ± 7.65
30.93 ± 8.18
0.6057
Women (%)
26.4
29.8
0.5384
Age at onset (yrs)
26.29 ± 7.41
26.97 ± 7.89
0.4760
Reason for first use of heroin
0.7173
Curiousness
79.1
75.1
Peer pressure
6.6
4.9
Physical disease
7.7
10.2
Trouble
5.5
6.1
Other reasons
1.1
3.8
Table 5
Armitage trend tests on nine OPRM1 SNPs and heroin-induced positive responses on first use
SNP
Genotype
Count and frequency
Armitage trend test
Cases
Controls
Chi-square
P-value
rs1799971
AA
55
0.604
150
0.622
0.003
0.9537
AG
27
0.297
64
0.266
GG
9
0.099
24
0.112
rs510769
TT
56
0.667
167
0.749
2.744
0.0976
TC
24
0.286
53
0.237
CC
4
0.048
4
0.018
rs696522
AA
64
0.762
215
0.907
11.097
0.0009*
AG
19
0.226
21
0.089
GG
1
0.012
1
0.004
rs1381376
CC
70
0.769
221
0.913
13.409
0.0003*
CT
20
0.220
21
0.087
TT
1
0.011
0
0.000
rs3778151
GG
66
0.733
215
0.896
14.655
0.0001*
GA
23
0.256
25
0.104
AA
1
0.011
0
0.000
rs2075572
GG
50
0.556
149
0.642
1.574
0.2096
GC
33
0.367
82
0.353
CC
7
0.078
11
0.047
rs533586
TT
68
0.840
203
0.868
0.761
0.3830
TC
12
0.148
31
0.132
CC
1
0.012
0
0.000
rs550014
TT
78
0.857
203
0.832
0.093
0.7602
TC
12
0.132
41
0.168
CC
1
0.011
0
0.000
rs658156
GG
65
0.714
192
0.787
2.041
0.1531
GA
24
0.264
52
0.213
AA
1
0.011
0
0.000
* significant after Bonferroni Correction.
Figure 3
Real data analyses by . The horizontal axis denotes studies and vertical axis mean(PC1), the statistic used to calculate confidence intervals for cases and controls. PCA-BCITs with strategy 3 were significant at confidence level 0.95.
Sample characteristics of heroin-induced positive responses on first useArmitage trend tests on nine OPRM1 SNPs and heroin-induced positive responses on first use* significant after Bonferroni Correction.Real data analyses by . The horizontal axis denotes studies and vertical axis mean(PC1), the statistic used to calculate confidence intervals for cases and controls. PCA-BCITs with strategy 3 were significant at confidence level 0.95.
Discussion
In this study, a PCA-based bootstrap confidence interval test[19,26-28] (PCA-BCIT) is developed to study gene-disease association using all SNPs genotyped in a given region. There are several attractive features of PCA-based approaches. First of all, they are at least as powerful as genotype- and haplotype-based methods[7,16,17]. Secondly, they are able to capture LD information between correlated SNPs and easy to compute with needless consideration of multicollinearity and multiple testing. Thirdly, BCIT integrates point estimation and hypothesis testing as a single inferential statement of great intuitive appeal[29] and does not rely on the distributional assumption of the statistic used to calculate confidence interval[19,26-29].While there have been several different but closely related forms of bootstrap confidence interval calculations[28], we focus on percentiles of the asymptotic distribution of PCs for given confidence levels to estimate the confidence interval. PCA-BCIT is a data-learning method[29], and shown to be valid and powerful for sufficiently large number of replicates in our study. Our investigation involving three strategies of extracting PCs reveals that strategy 1 is invalid, while strategies 2 and 3 are acceptable. From analyses of real data we find that PCA-BCIT is more favourable compared with PCA-LRT and permutation test. It is suggested that a practical advantage of PCA-BCIT is that it offers an intuitive measure of difference between cases and controls by using the set of SNPs (PC scores) in a candidate region (Figure 3). As extraction of PCs through COES is more in line with the principle of a case-control study, it will be our method of choice given that it has a comparable performance with CES. Nevertheless, PCA-BCIT has the limitation that it does not directly handle covariates as is usually done in a regression model.
Conclusions
PCA-BCIT is both a valid and a powerful PCA-based method which captures multi-SNP information in study of gene-disease association. While extracting PCs based on CAES, COES and CES all have good performances, it appears that COES is more appropriate to use.
Abbreviations
SNP: single nucleotide polymorphism; HWE: Hardy-Weinberg Equilibrium; LD: linkage disequilibrium; LRT: logistic regression test; PCA: principle component analysis; PC: principle component; ES: extracting strategy; SES: separate case and control extracting strategy (strategy 0); CAES: case-based extracting strategy (strategy 1); COES: control-based extracting strategy (strategy 2); CES: combined case and control extracting strategy (strategy 3); BCIT: bootstrap confidence interval test.
Authors' contributions
QQP, JHZ, and FZX conceptualized the study, acquired and analyzed the data and prepared for the manuscript. All authors approved the final manuscript.
Authors: Daniel J Schaid; Shannon K McDonnell; Scott J Hebbring; Julie M Cunningham; Stephen N Thibodeau Journal: Am J Hum Genet Date: 2005-03-22 Impact factor: 11.025
Authors: Henrik Kallberg; Leonid Padyukov; Robert M Plenge; Johan Ronnelid; Peter K Gregersen; Annette H M van der Helm-van Mil; Rene E M Toes; Tom W Huizinga; Lars Klareskog; Lars Alfredsson Journal: Am J Hum Genet Date: 2007-04-02 Impact factor: 11.025
Authors: Robert M Plenge; Leonid Padyukov; Elaine F Remmers; Shaun Purcell; Annette T Lee; Elizabeth W Karlson; Frederick Wolfe; Daniel L Kastner; Lars Alfredsson; David Altshuler; Peter K Gregersen; Lars Klareskog; John D Rioux Journal: Am J Hum Genet Date: 2005-11-01 Impact factor: 11.025
Authors: Daniel O Stram; Celeste Leigh Pearce; Phillip Bretsky; Matthew Freedman; Joel N Hirschhorn; David Altshuler; Laurence N Kolonel; Brian E Henderson; Duncan C Thomas Journal: Hum Hered Date: 2003 Impact factor: 0.444
Authors: Robert M Plenge; Mark Seielstad; Leonid Padyukov; Annette T Lee; Elaine F Remmers; Bo Ding; Anthony Liew; Houman Khalili; Alamelu Chandrasekaran; Leela R L Davies; Wentian Li; Adrian K S Tan; Carine Bonnard; Rick T H Ong; Anbupalam Thalamuthu; Sven Pettersson; Chunyu Liu; Chao Tian; Wei V Chen; John P Carulli; Evan M Beckman; David Altshuler; Lars Alfredsson; Lindsey A Criswell; Christopher I Amos; Michael F Seldin; Daniel L Kastner; Lars Klareskog; Peter K Gregersen Journal: N Engl J Med Date: 2007-09-05 Impact factor: 91.245
Authors: Ann B Begovich; Victoria E H Carlton; Lee A Honigberg; Steven J Schrodi; Anand P Chokkalingam; Heather C Alexander; Kristin G Ardlie; Qiqing Huang; Ashley M Smith; Jill M Spoerke; Marion T Conn; Monica Chang; Sheng-Yung P Chang; Randall K Saiki; Joseph J Catanese; Diane U Leong; Veronica E Garcia; Linda B McAllister; Douglas A Jeffery; Annette T Lee; Franak Batliwalla; Elaine Remmers; Lindsey A Criswell; Michael F Seldin; Daniel L Kastner; Christopher I Amos; John J Sninsky; Peter K Gregersen Journal: Am J Hum Genet Date: 2004-06-18 Impact factor: 11.025
Authors: Tao Zheng; Wei Xie; Liling Xu; Xiaoying He; Ya Zhang; Mingrong You; Gong Yang; You Chen Journal: Int J Med Inform Date: 2016-10-01 Impact factor: 4.046
Authors: Lennart Hilbert; Genevieve Bates; Horia N Roman; Jenna L Blumenthal; Nedjma B Zitouni; Apolinary Sobieszek; Michael C Mackey; Anne-Marie Lauzon Journal: PLoS Comput Biol Date: 2013-10-24 Impact factor: 4.475
Authors: Marcel den Hoed; Søren Brage; Jing Hua Zhao; Kate Westgate; Ayrun Nessa; Ulf Ekelund; Tim D Spector; Nicholas J Wareham; Ruth J F Loos Journal: Am J Clin Nutr Date: 2013-09-18 Impact factor: 7.045