| Literature DB >> 26545820 |
Wei Hao1, Minsun Song1, John D Storey2.
Abstract
MOTIVATION: Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure. The most prominent work on this problem has focused on estimating a model of admixture proportions of ancestral populations for each individual. Here, we instead focus on modeling variation of the genotypes without requiring a higher-level admixture interpretation.Entities:
Mesh:
Year: 2015 PMID: 26545820 PMCID: PMC4795615 DOI: 10.1093/bioinformatics/btv641
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.A hierarchical clustering of individuals from the HapMap, HGDP and TGP datasets. A dendrogram was drawn from a hierarchical clustering using Ward distance based on SNP genotypes (MAF ). Whereas the HapMap project shows a definitive discrete population structure (by sampling design), the HGDP and TGP data show the complex structure of human populations
Accuracy in estimating linear bases for
| Scenario | Mean | |
|---|---|---|
|
|
| |
| TGP fit by PCA | 0.9998 | 0.9722 |
| TGP fit by LFA* | 0.9912 | 0.9990 |
| HGDP fit by PCA | 0.9996 | 0.9614 |
| HGDP fit by LFA* | 0.9835 | 0.9983 |
| BN | 0.9999 | 0.9999 |
| PSD | 0.9998 | 0.9974 |
| PSD | 0.9998 | 0.9879 |
| PSD | 0.9996 | 0.9827 |
| PSD | 0.9993 | 0.9844 |
| Spatial | 0.9999 | 0.9964 |
| Spatial | 0.9999 | 0.9962 |
| Spatial | 0.9999 | 0.9964 |
| Spatial | 0.9998 | 0.9970 |
Column 1 shows the scenario from which the data were simulated. Columns 2 and 3 display the estimation accuracy of the PCA-based method (Column 2) and LFA (Column 3). Column 2 shows the mean R2 value when regressing the true on from PCA, averaging across all SNPs. Column 3 shows the mean R2 value when regressing the true on from LFA, averaging across all SNPs. All estimated standard errors fell between 10−6 and 10−8 so these are not shown. Note for each scenario, R2 values are higher for the method from which the true matrix was generated. All but the two scenarios marked with an asterisk (*) are from Model 1, while the two marked scenarios are from Model 2, where we took
Fig. 2.Principal component and logistic factor biplots for the HGDP and TGP datasets. The top three principal components from each dataset are plotted in a pairwise fashion in the top panel. The top three logistic factors are plotted analogously in the bottom panel. It can be seen that both approaches yield similar visualizations of structure
A comparison of accuracy in estimating π parameters where data were simulated from the PSD model for varying α
| PCA | LFA | ADX | FS | |
|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Methods used are the proposed PCA-based method (Algorithm 1) and LFA method (Algorithms 2 and 3), and two competing methods, ADMIXTURE (ADX) and fastSTRUCTURE (FS), that directly fit the PSD model. The values reported are root mean squared error in the π parameter. See Supplementary Table S1 for more extensive comparisons
Fig. 3.SNPs with highly differentiated allele frequencies with respect to structure. Two of the most highly different SNPs according to LFA are shown for the HGDP and TGP datasets. For each SNP, the values are ordered and they are colored according reported ancestry. The horizontal bars on the sides of the plots denote the usual allele frequency estimates formed within each ancestral group