| Literature DB >> 25326237 |
Andrei-Alin Popescu1, Andrea L Harper2, Martin Trick3, Ian Bancroft2, Katharina T Huber4.
Abstract
Population structure is a confounding factor in genome-wide association studies, increasing the rate of false positive associations. To correct for it, several model-based algorithms such as ADMIXTURE and STRUCTURE have been proposed. These tend to suffer from the fact that they have a considerable computational burden, limiting their applicability when used with large datasets, such as those produced by next generation sequencing techniques. To address this, nonmodel based approaches such as sparse nonnegative matrix factorization (sNMF) and EIGENSTRAT have been proposed, which scale better with larger data. Here we present a novel nonmodel-based approach, population structure inference using kernel-PCA and optimization (PSIKO), which is based on a unique combination of linear kernel-PCA and least-squares optimization and allows for the inference of admixture coefficients, principal components, and number of founder populations of a dataset. PSIKO has been compared against existing leading methods on a variety of simulation scenarios, as well as on real biological data. We found that in addition to producing results of the same quality as other tested methods, PSIKO scales extremely well with dataset size, being considerably (up to 30 times) faster for longer sequences than even state-of-the-art methods such as sNMF. PSIKO and accompanying manual are freely available at https://www.uea.ac.uk/computing/psiko.Entities:
Keywords: Q-matrix; admixture inference; genome-wide association studies; kernel-PCA; population structure
Mesh:
Year: 2014 PMID: 25326237 PMCID: PMC4256762 DOI: 10.1534/genetics.114.171314
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 2PCA reduced dataset under different simulation scenarios, each of which is represented by a separate panel. In each panel, the coordinate axes are the first two significant principal components; see text for details.
Figure 1A summary of how the datasets underpinning our simulation experiments were generated. Each of the 1, 2, …, K encircled values indicates a founder population generated with the msms software. For all 1 ≤ k ≤ K, the vector F represents empirical allele frequencies computed for each of the K founder populations [i.e., F = (f1, f2, …, f)] and the values q represent the proportion population k contributes to accession x of the dataset X.
For K = 3, average RMSEs between true and estimated Q-matrices for simulated datasets
| Asymmetric | Dir(0.2, 0.2, 0.5) | Dir(0.2, 0.2, 0.05) | Dir(0.05, 0.05, 0.01) |
|---|---|---|---|
| PSIKO | 0.008 | 0.007 | 0.005 |
| ADMIXTURE | 0.008 | 0.005 | 0.002 |
| sNMF | 0.008 | 0.005 | 0.002 |
| STRUCTURE | 0.053 | 0.022 | 0.021 |
| Symmetric | Dir(1, 1, 1) | Dir(0.5, 0.5, 0.5) | Dir(0.1, 0.1, 0.1) |
| PSIKO | 0.011 | 0.009 | 0.004 |
| ADMIXTURE | 0.018 | 0.01 | 0.004 |
| sNMF | 0.02 | 0.013 | 0.005 |
| STRUCTURE | 0.015 | 0.016 | 0.03 |
Summarized relative run times of sNMF and PSIKO as averages over all 30 datasets
| Sequence length | 100,000 | 250,000 | 2,500,000 |
|---|---|---|---|
| PSIKO | 8 sec | 11 sec | 1 min 25 sec |
| sNMF | 55.5 sec | 1 min 40 sec | 22 min 28 sec |
i.e., 10 datasets for each symmetric Dirichlet distribution parameter setting given in Table 1.
Average RMSE between true and estimated Q-matrix for Dir(1, 1, 1) for each approach under each noise level p
| 0.01 | 0.05 | 0.1 | 0.15 | |
|---|---|---|---|---|
| PSIKO | 0.011 | 0.012 | 0.013 | 0.015 |
| sNMF | 0.016 | 0.012 | 0.012 | 0.02 |
| ADMIXTURE | 0.018 | 0.013 | 0.013 | 0.019 |
Average RMSE between true and estimated Q-matrix for Dir(1, 1, 1) for each approach under each missing value probability character p
| 0.1 | 0.2 | |
|---|---|---|
| PSIKO | 0.012 | 0.012 |
| sNMF | 0.013 | 0.012 |
| ADMIXTURE | 0.019 | 0.021 |
Average RMSE between true and estimated Q-matrix for Dir(1, 1, 1) for each approach under each value for the inbreeding coefficient FIS
| 0.25 | 1 | |
|---|---|---|
| PSIKO | 0.016 | 0.017 |
| sNMF | 0.026 | 0.027 |
| ADMIXTURE | 0.022 | 0.026 |
Figure 3Q-matrix plots for the 84 line Brassica napus dataset comparing the performance of psiko to other leading methods. The proportion of alleles belonging to each of the clusters is shown by respective white bars (cluster 1) or black bars (cluster 2).
Figure 4QQ plots illustrating population structure corrections using the four methods in genome-wide association studies (GWAS) analysis of two traits in the 84 lines Brassica napus panel, erucic acid (A) and seed glucosinolate content (B). The expected −log10 P (x-axis) are plotted against those observed (y-axis) from either a general linear model (solid lines) using population structure correction only and a mixed linear model (dashed lines) with population structure and relatedness corrections. The diagonal line is a guide for the perfect fit to the expected −log10 P-values.