| Literature DB >> 23601181 |
Li Liu1, Donghui Zhang, Hong Liu, Christopher Arendt.
Abstract
BACKGROUND: Genome-wide association studies can provide novel insights into diseases of interest, as well as to the responsiveness of an individual to specific treatments. In such studies, it is very important to correct for population stratification, which refers to allele frequency differences between cases and controls due to systematic ancestry differences. Population stratification can cause spurious associations if not adjusted properly. The principal component analysis (PCA) method has been relied upon as a highly useful methodology to adjust for population stratification in these types of large-scale studies. Recently, the linear mixed model (LMM) has also been proposed to account for family structure or cryptic relatedness. However, neither of these approaches may be optimal in properly correcting for sample structures in the presence of subject outliers.Entities:
Mesh:
Year: 2013 PMID: 23601181 PMCID: PMC3637636 DOI: 10.1186/1471-2105-14-132
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
17] and applied in chemometrics. It is an easy to understand method and we have implemented it in R. To start RHM, we can randomly select half of the total observations. The sampled data matrix is written as a n/2 by p matrix Xs(i), and the mean m(i) and standard deviation s(i) vectors are determined. The original data matrix X is then scaled using m(i) and s(i) to arrive at a n by p scaled matrix X(i).
Population stratification configurations in simulations I and II
| | | ||
|---|---|---|---|
| S1 | (moderate) | (0.6,0.4)a | (0.4,0.6)b |
| S2 | (more extreme) | (0.5,0.5) | (0,1) |
| S3 | (moderate) | (0.45,0.35,0.20) | (0.35,0.20,0.45) |
| S4 | (more extreme) | (0.33,0.67,0) | (0,0.33,0.67) |
a The proportion of cases sampled from each subpopulation.
b The proportion of controls sampled from each subpopulation.
Empirical false positive rate and true positive rate results for simulation I (Discrete Populations without Outliers)
| S1 | Random SNPs | 2.67 | 0.91 | 0.97 | 0.97 | 0.99 | 0.97 |
| (2 populations, moderate) | Differentiated SNPs | 99.85 | 98.86 | 1.30 | 0.90 | 0.88 | 0.89 |
| Causal SNPs | 48.99 | 34.13 | 47.37 | 47.29 | 46.92 | 47.33 | |
| S2 | Random SNPs | 16.56 | 0.89 | 1.11 | 0.92 | 0.93 | 0.92 |
| (2 populations, more extreme) | Differentiated SNPs | 100.00 | 100.00 | 13.60 | 1.00 | 1.01 | 0.99 |
| Causal SNPs | 49.91 | 10.91 | 33.89 | 31.76 | 31.63 | 31.77 | |
| S3 | Random SNPs | 3.14 | 0.97 | 0.94 | 0.93 | 0.95 | 0.92 |
| (3 populations, moderate) | Differentiated SNPs | 99.99 | 99.98 | 2.24 | 1.00 | 1.01 | 1.00 |
| Causal SNPs | 48.18 | 31.76 | 45.16 | 45.08 | 44.60 | 45.09 | |
| S4 | Random SNPs | 21.76 | 0.94 | 1.45 | 1.05 | 1.05 | 1.06 |
| (3 populations, more extreme) | Differentiated SNPs | 100.00 | 100.00 | 41.78 | 0.96 | 0.95 | 0.96 |
| Causal SNPs | 50.79 | 8.42 | 23.51 | 19.34 | 19.13 | 19.34 |
aFor random SNPs and differentiated SNPs, the values in the table represent the empirical false positive rates; for causal SNPs, the values in the table represent the empirical true positive rates. The nominal false positive rate is 0.01. Note that the numbers in the table refer to percentages.
Empirical false positive rate and true positive rate results for simulation II (Discrete Populations with Outliers)
| S1 | Random SNPs | 2.75 | 1.41 | 1.94 | 0.97 | 1.01 | 0.99 |
| (2 populations, moderate) | Differentiated SNPs | 99.85 | 98.75 | 93.03 | 1.33 | 0.99 | 1.00 |
| Causal SNPs | 48.97 | 37.55 | 48.33 | 46.95 | 44.69 | 45.06 | |
| S2 | Random SNPs | 16.74 | 1.71 | 8.38 | 1.09 | 0.99 | 1.00 |
| (2 populations, more extreme) | Differentiated SNPs | 100.00 | 100.00 | 100.00 | 6.91 | 1.14 | 1.29 |
| Causal SNPs | 49.94 | 14.09 | 44.77 | 32.81 | 30.07 | 30.21 | |
| | | | | | |||
| S3 | Random SNPs | 3.40 | 1.12 | 1.65 | 1.08 | 1.06 | 1.06 |
| (3 populations, moderate) | Differentiated SNPs | 100.00 | 99.99 | 63.28 | 1.36 | 1.02 | 1.02 |
| Causal SNPs | 48.85 | 31.61 | 46.72 | 45.81 | 43.29 | 43.89 | |
| | | | | | | | |
| S4 | Random SNPs | 21.35 | 1.15 | 9.82 | 1.10 | 0.92 | 0.97 |
| (3 populations, more extreme) | Differentiated SNPs | 100.00 | 100.00 | 100.00 | 18.13 | 1.29 | 1.51 |
| Causal SNPs | 50.09 | 9.41 | 37.56 | 21.76 | 18.66 | 18.81 |
aFor random SNPs and differentiated SNPs, the values in the table represent the empirical false positive rates; for causal SNPs, the values in the table represent the empirical true positive rates. The nominal false positive rate is 0.01. Note that the numbers in the table refer to percentages.
Figure 1The orthogonal distance versus the score distance for one simulated dataset. The plot is based on projection pursuit robust PCA using the GRID algorithm for one simulated dataset under scenario S4 in simulation II. The vertical line is the outlier cutoff line for the score distance, the horizontal line is the outlier cutoff for the orthogonal distance, and those points on the right of the vertical line or above the horizontal line were identified as outliers.
Empirical false positive rate and true positive rate results for simulations III and IV (Admixed populations)
| | | | | ||||
|---|---|---|---|---|---|---|---|
| Simulation III | Random SNPs | 2.09 | 0.91 | 0.90 | 0.89 | 0.91 | 1.10 |
| (no outliers) | Differentiated SNPs | 97.16 | 94.29 | 1.12 | 1.09 | 1.09 | 1.10 |
| Causal SNPs | 49.22 | 36.88 | 45.09 | 45.06 | 44.64 | 44.10 | |
| Simulation IV | Random SNPs | 2.27 | 1.12 | 1.89 | 1.04 | 0.91 | 0.80 |
| (with outliers) | Differentiated SNPs | 97.59 | 94.17 | 88.11 | 10.09 | 1.01 | 1.40 |
| Causal SNPs | 49.15 | 37.63 | 48.23 | 45.30 | 42.37 | 45.50 |
aFor random SNPs and differentiated SNPs, the values in the table represent the empirical false positive rates; for causal SNPs, the values in the table represent the empirical true positive rates. The nominal false positive rate is 0.01. Note that the numbers in the table refer to percentages.
Figure 2The orthogonal distance versus the score distance for NARAC data. The vertical line is the outlier cutoff line for the score distance, the horizontal line is the outlier cutoff for the orthogonal distance, and those points on the right of the vertical line or above the horizontal line were identified as outliers.
Figure 3Results of GWA analyses based on five different methods. The y axis is in square root scale to improve readability.
Comparison of the analysis results for three SNPs on chromosome 9 known to be associated with RA
| | ||||||
|---|---|---|---|---|---|---|
| | | | | |||
| | | | | |||
| rPCA | 1.15E-07 | 1 | 2.78E-07 | 2 | 3.20E-07 | 3 |
| PCA | 1.91E-07 | 1 | 4.71E-07 | 2 | 5.55E-07 | 3 |
| MDS | 1.69E-07 | 1 | 4.55E-07 | 2 | 4.91E-07 | 3 |
| Trend | 8.05E-09 | 4 | 3.52E-08 | 7 | 2.82E-08 | 6 |
| GC | 1.46E-06 | 4 | 4.13E-06 | 7 | 3.54E-06 | 6 |