| Literature DB >> 20017972 |
Gina M Peloso1, Nadia Timofeev1, Kathryn L Lunetta1.
Abstract
Population structure occurs when a sample is composed of individuals with different ancestries and can result in excess type I error in genome-wide association studies. Genome-wide principal-component analysis (PCA) has become a popular method for identifying and adjusting for subtle population structure in association studies. Using the Genetic Analysis Workshop 16 (GAW16) NARAC data, we explore two unresolved issues concerning the use of genome-wide PCA to account for population structure in genetic associations studies: the choice of single-nucleotide polymorphism (SNP) subset and the choice of adjustment model. We computed PCs for subsets of genome-wide SNPs with varying levels of LD. The first two PCs were similar for all subsets and the first three PCs were associated with case status for all subsets. When the PCs associated with case status were included as covariates in an association model, the reduction in genomic inflation factor was similar for all SNP sets. Several models have been proposed to account for structure using PCs, but it is not yet clear whether the different methods will result in substantively different results for association studies with individuals of European descent. We compared genome-wide association p-values and results for two positive-control SNPs previously associated with rheumatoid arthritis using four PC adjustment methods as well as no adjustment and genomic control. We found that in this sample, adjusting for the continuous PCs or adjusting for discrete clusters identified using the PCs adequately accounts for the case-control population structure, but that a recently proposed randomization test performs poorly.Entities:
Year: 2009 PMID: 20017972 PMCID: PMC2795879 DOI: 10.1186/1753-6561-3-s7-s108
Source DB: PubMed Journal: BMC Proc ISSN: 1753-6561
Summary of PCA with six SNP subsets
| SNP Subset for PCA | PCs significantly associated with RA (α = 0.05) | λa | |
|---|---|---|---|
| S1. Filtered SNPs | 459,422 | 1, 2, 3, 5, 6, 8, 9 | 1.023 |
| S2. Removing the MHC region | 457,776 | 1, 2, 3, 5 | 1.027 |
| S3. Removing MHC and inversion on chr 8 | 456,846 | 1, 2, 3, 4 | 1.028 |
| S4. Removing chr 6 and inversion on chr 8 | 427,806 | 1, 2, 3, 4 | 1.030 |
| S5. Removing LD between SNPs with | 164,418 | 1, 2, 3, 4 | 1.022 |
| S6. Removing LD between SNPs with | 81,240 | 1, 2, 3, 4, 5 | 1.023 |
aInflation factor for genome-wide SNP association test statistics adjusting for PCs significantly associated with RA at (α = 0.05) as linear covariates
Correlation between S3 PCs and other SNP subset PCs.
| PCA removing MHC and inversion on chromosome 8 (S3)b | ||||||||
|---|---|---|---|---|---|---|---|---|
| S1 | PC3, PC4 | PC5, PC6 | PC5, PC6 | PC7 | PC8, PC9, PC10 | PC9 | ---- | ---- |
| S2 | PC3, PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | ---- |
| S4 | PC3 | PC4 | ----c | PC5 | PC6 | PC7, PC8 | PC7, PC8 | PC9 |
| S5 | PC3 | PC4 | ---- | ---- | ---- | ---- | ---- | ---- |
| S6 | PC3 | PC4 | ---- | ---- | ---- | ---- | ---- | ---- |
aFor each SNP subset, the PCs correlated with the corresponding subset S3 PC with r2 > 0.15 are displayed, along with the r2 value.
bPC1 and PC2 for each subset are correlated with S3 subset PC1 and PC2 (r2 > 0.98, r2 > 0.94), and are not shown.
c---, No PC has correlation (r2 > 0.15) with the S3 subset PC.
Figure 1PCA plot of SNP weights. Plot of SNP weights from PCAs S1 (black), S2 (red), S3 (green), S5 (blue), and S6 (turquoise).
Figure 2Q-Q plot of -log. Quantile-Quantile plot of -log10(p-value) for Methods A-F as described in Methods. For each analysis, we plot the negative -log10(p-value) for all genome-wide SNPs, excluding the SNPs in the MHC. Black, Method (A) - logistic regression adjusting for the DRB locus (inflation factor λ = 1.24); Red, Method (B) - logistic regression adjusting for the DRB locus and PC1-4 as continuous covariates (λ = 1.03); Green, Method (C) - logistic regression adjusting for the DRB locus and PC cluster (λ = 1.04); Blue, Method (D) - logistic regression adjusting for the DRB locus, PC1-4 as continuous covariates, and PC cluster (λ = 1.03); Purple, Method (E) - PSAT with disease probability assigned based on clustering on PC1-4 (λ = 1.15); and Orange, Method (F) - genomic control using the logistic model adjusting for the DRB locus (λ = 1.00).
Results for positive control SNPs for six methods of adjusting for structure
| Adjustment Method | βa ± SE | βa ± SE | ||
|---|---|---|---|---|
| A. Logistic regression adjusting for DRB | 0.61 ± 0.11 | 3.70 × 10-8 | 0.42 ± 0.07 | 8.04 × 10-9 |
| B. Logistic regression adjusting for DRB and PC1-4 | 0.47 ± 0.12 | 1.32 × 10-4 | 0.39 ± 0.08 | 2.08 × 10-6 |
| C. Logistic regression adjusting for DRB and cluster | 0.46 ± 0.12 | 1.57 × 10-5 | 0.45 ± 0.08 | 4.21 × 10-8 |
| D. Logistic regression adjusting for DRB, PC1-4, and cluster | 0.46 ± 0.13 | 2.44 × 10-4 | 0.42 ± 0.08 | 4.55 × 10-7 |
| E. PSAT based on PC1-4 cluster | -----b | 1.17 × 10-6 | ----- | 2.80 × 10-8 |
| F. Genomic control | 0.16 ± 0.12 | 7.44 × 10-7 | 0.42 ± 0.08 | 2.22 × 10-7 |
| Previously reported | 0.50 ± 0.15 | 6.60 × 10-4 | 0.35 ± 0.04 | 4.00 × 10-14 |
aβ, log-odds ratio for disease for the minor allele estimated by the logistic regression
b---, Estimates of effect cannot be obtained.