| Literature DB >> 23755071 |
Liliana Porras-Hurtado1, Yarimar Ruiz, Carla Santos, Christopher Phillips, Angel Carracedo, Maria V Lareu.
Abstract
OBJECTIVES: We present an up-to-date review of STRUCTURE software: one of the most widely used population analysis tools that allows researchers to assess patterns of genetic structure in a set of samples. STRUCTURE can identify subsets of the whole sample by detecting allele frequency differences within the data and can assign individuals to those sub-populations based on analysis of likelihoods. The review covers STRUCTURE's most commonly used ancestry and frequency models, plus an overview of the main applications of the software in human genetics including case-control association studies (CCAS), population genetics, and forensic analysis. The review is accompanied by supplementary material providing a step-by-step guide to running STRUCTURE.Entities:
Keywords: CLUMPP; STRAT; STRUCTURE; case-control association studies; distruct; population structure; stratification
Year: 2013 PMID: 23755071 PMCID: PMC3665925 DOI: 10.3389/fgene.2013.00098
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1These graphics were obtained with distruct and using CLUMPP to align the three replicates for K = 4 (all runs were performed with 100,000 burnin period and 100,000 MCMC repeats after burnin). The exception was the POPINFO parameter sets for which direct STRUCTURE bar plot outputs were used. Human genetic data comprised genotypes listed in Table S1 consisting of 100 Africans: CEPH AFR, 158 Europeans: CEPH EUR, 165 East Asians: CEPH EAS, and 64 Native Americans: CEPH NAM from the HGDP-CEPH human diversity panel. An artificial case-control group was created using HapMap Mexican and Puerto Rican samples giving a total 67 sample divided into Cases 1 (C1), Cases 2 (C2), and Controls (Ct). Markers were: 9 AIM-SNPs (two triallelic), 3 phenotype associated SNPs and 5 AIM-SNPs on the X-chromosome. The phenotype and the X-SNPs are linked forming two distinct linkage disequilibrium groups—their genetic distance was used to define linkage disequilibrium groups. Each parameter setting and the results obtained are described in detail in Supplementary Material 1.
Figure 2Example case-control sample analyses comparing scenarios with the presence or absence of stratification. STRUCTURE bar plots and STRAT table results are shown. (A) Case 1 (C1) are compared to the Control (Ct) samples. (B) Case 2 (C2) are compared to the Control (Ct) samples. Details of these analyses are described in Supplementary Material 1.
Alternative population analysis programs and their comparison with .
| ADMIXTURE | A program for maximum likelihood estimation of individual ancestries from multilocus SNP genotypes. It allows automated choice of the population number and use of known ancestral populations in a supervised learning mode. | ADMIXTURE uses the same statistical model as | Models implemented in ADMIXTURE do not explicitly account for linkage disequilibrium (LD) between markers. | Alexander et al., |
| ADMIXMAP | A program for modeling admixture, using marker genotypes and trait data on a sample of individuals from an admixed population, where the markers have been chosen to have strongly differentiated allele frequencies between two or more of the ancestral populations contributing to the admixture. | ADMIXMAP models the dependence of the outcome variable on individual admixture and thus it can adjust for the effect of individual admixture on the outcome variable. | Some unnecessary computational components for inference of individual ancestry are included so computation times are longer. | McKeigue et al., |
| It allows the user to supply prior distributions for the allele frequencies. It allows for allelic association (other than that generated by admixture), and is therefore suitable for analysis of datasets in which two or more tightly linked loci (for instance SNPs in the same gene) have been genotyped. | It does not assume a possible correlation between the allele frequencies in each subpopulation. | |||
| FRAPPE | A maximum likelihood (ML) approach that estimates individual admixture fractions, using SNP or microsatellite genotype data, that allows for uncertainty in ancestral allele frequencies. | The efficiency of FRAPPE is similar to that of MCMC Bayesian approaches but the computation time is much reduced. | The parameter estimates can be slightly inaccurate due to the relaxed convergence criterion that permits fast termination of the algorithm. | Tang et al., |
| When the ancestral groups are small and the markers are not highly informative it can produce less biased estimates. | FRAPPE does not allow the incorporation of known map information and does not model the LD. | |||
| It does not provide measures to choose an optimal K value. | ||||
| EIGENSOFT | A program suite that has two main components, | In the case of non-admixed populations it has a better performance regarding the inference of population stratification. | The degree of admixture from ancestry fractions is not included in the results. | Patterson et al., |
| Output results are generated much more rapidly in comparison to | ||||
| PLINK | A program suite comprising a whole genome association analysis toolset designed to perform a range of basic, large-scale analyses in a computationally efficient manner. It assesses population stratification of whole-genome SNP data through a complete-linkage hierarchical clustering method and it produces a | PLINK facilitates the manipulation and analysis of whole-genome data in a computationally efficient way. | Designed for SNP data only (GWAS output). | Purcell et al., |
| It detects and corrects population stratification through identity-by-state and identity-by-descent information. | ||||
| Multidimensional scaling | Methods such as unrooted neighbor-joining trees, | These methods run much more rapidly on large datasets compared to model-based methods. | These methods usually include simple LD correction systems that are not as powerful as those of | Patterson et al., |
| It provides a formal test for the number of significant axes of variation and the presence of population structure in genetic data. | The distance measure and the clustering algorithm can be somewhat arbitrary—i.e., the clustering results may change if another definition of distance or clustering method is applied. | |||
| It outputs each individual's coordinates along axes of variation instead of trying to classify all individuals into discrete populations, which may not always be the correct model for a particular population history. | ||||
| LAMP | LAMP (Local Ancestry in adMixed Populations) estimates locus-specific ancestry (local ancestry) in recently admixed populations through a clustering algorithm that operates on sliding windows of contiguous SNPs. It can also be used to estimate each individual's ancestry (global ancestry). | LAMP is more efficient (about 104 times faster than | Designed for SNP data only. | Sankararaman et al., |
| It does not require the input of genotypes from unadmixed ancestral populations. | It assumes uncorrelated SNPs. | |||
| It requires the input of several parameters such as global ancestry proportions (that can be calculated with programs such as | ||||
| HAPMIX | A program that uses a haplotype-based method to solve the problem of local ancestry inference in two way admixed population. It uses haplotype information to accurately infer segments of chromosomal ancestry in admixed samples so it particularly useful in association tests for mapping disease genes in recently admixed populations and to do inferences about human history. | HAPMIX makes a more complete use of dense genome-wide data so it produces more accurate results. | Designed for SNP data only. | Price et al., |
| It only allows analysis of populations that are the result of mixture between two ancestral populations. | ||||
| ADMIXPROGRAM | This method directly evaluates the likelihood function and maximizes it from the hidden Markov Model for admixture mapping using an EM algorithm. | Allows for uncertainty in model parameters, such as the allele frequencies in the parental populations, the number of generations since admixture occurred and the contribution of ancestry at each generation. | Designed for SNP data only. | Zhu et al., |