| Literature DB >> 32727359 |
Steven M Mussmann1,2, Marlis R Douglas3, Tyler K Chafin3, Michael E Douglas3.
Abstract
BACKGROUND: Research on the molecular ecology of non-model organisms, while previously constrained, has now been greatly facilitated by the advent of reduced-representation sequencing protocols. However, tools that allow these large datasets to be efficiently parsed are often lacking, or if indeed available, then limited by the necessity of a comparable reference genome as an adjunct. This, of course, can be difficult when working with non-model organisms. Fortunately, pipelines are currently available that avoid this prerequisite, thus allowing data to be a priori parsed. An oft-used molecular ecology program (i.e., STRUCTURE), for example, is facilitated by such pipelines, yet they are surprisingly absent for a second program that is similarly popular and computationally more efficient (i.e., ADMIXTURE). The two programs differ in that ADMIXTURE employs a maximum-likelihood framework whereas STRUCTURE uses a Bayesian approach, yet both produce similar results. Given these issues, there is an overriding (and recognized) need among researchers in molecular ecology for bioinformatic software that will not only condense output from replicated ADMIXTURE runs, but also infer from these data the optimal number of population clusters (K).Entities:
Keywords: ADMIXTURE analysis; Population genomics; Population structure; RADseq; SNP analysis
Mesh:
Year: 2020 PMID: 32727359 PMCID: PMC7391514 DOI: 10.1186/s12859-020-03701-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The workflow for AdmixPipe involves two files as Input: 1) a VCF-formatted file of genotypes, and 2) a tab-delimited population map. These proceed through admixturePipeline.py which handles filtering, file conversion, and execution of Admixture according to user-specified parameters. After completion, the user can submit their output to Clumpak for analysis. The resulting files can then be visualized using distructRerun.py, and variability in cross validation (CV) values is assessed using cvSum.py
Fig. 2Benchmarking results for AdmixPipe. a The percent increase in runtime for AdmixPipe exhibits a nearly 1:1 ratio with respect to percent increase in the number of SNPs. Data are based upon pairwise comparisons (% increase) of runtime and input size for four datasets of varying size (61,910 SNPs, 25,851 SNPs, 19,140 SNPs, and 12,527 SNPs; R2 = 0.975, degrees of freedom = 58). b shows benchmarking results for a range of K values (K = 1–8; 16 replicates at each K), while c) shows the equivalent results for K = 9–16 (16 replicates at each K). Time for b) and c) is presented in hours on the Y-axis. The number of processor cores (CPU = 1, 2, 4, 8, and 16) was varied across runs. Four data thinning intervals (1, 25, 50, and 100) produced variable numbers of SNPs (61,910, 25,851, 19,140, and 12,527 respectively)