| Literature DB >> 32753664 |
Joshua M Miller1, Catherine I Cullingham2, Rhiannon M Peery3.
Abstract
Inference of genetic clusters is a key aim of population genetics, sparking development of numerous analytical methods. Within these, there is a conceptual divide between finding de novo structure versus assessment of a priori groups. Recently developed, Discriminant Analysis of Principal Components (DAPC), combines discriminant analysis (DA) with principal component (PC) analysis. When applying DAPC, the groups used in the DA (specified a priori or described de novo) need to be carefully assessed. While DAPC has rapidly become a core technique, the sensitivity of the method to misspecification of groups and how it is being empirically applied, are unknown. To address this, we conducted a simulation study examining the influence of a priori versus de novo group designations, and a literature review of how DAPC is being applied. We found that with a priori groupings, distance between genetic clusters reflected underlying FST. However, when migration rates were high and groups were described de novo there was considerable inaccuracy, both in terms of the number of genetic clusters suggested and placement of individuals into those clusters. Nearly all (90.1%) of 224 studies surveyed used DAPC to find de novo clusters, and for the majority (62.5%) the stated goal matched the results. However, most studies (52.3%) omit key run parameters, preventing repeatability and transparency. Therefore, we present recommendations for standard reporting of parameters used in DAPC analyses. The influence of groupings in genetic clustering is not unique to DAPC, and researchers need to consider their goal and which methods will be most appropriate.Entities:
Mesh:
Year: 2020 PMID: 32753664 PMCID: PMC7553915 DOI: 10.1038/s41437-020-0348-2
Source DB: PubMed Journal: Heredity (Edinb) ISSN: 0018-067X Impact factor: 3.821
Conceptual breakdown of how commonly used clustering methods address finding de novo genetic clusters versus visualizing a priori groupings.
| de novo | a priori | |
|---|---|---|
| Admixture analysis | Novel genetic clusters discovered through analysis of allele frequencies among “K” groups (Pritchard et al. | Prior groupings can be specified to visualize or assist with clustering (e.g., usepopinfo flag in STRUCTURE (Hubisz et al. |
| Analysis of molecular variance (AMOVA) | Novel genetic clusters are discovered through | Prior groupings used to assess the proportion of molecular variance is assigned among them (Excoffier et al. |
| Assignment tests | N/A | Prior groupings specify known individuals from which population allele frequencies are calculated, novel individuals are then assigned to these populations based on the likelihood of their genotype in the various populations (Paetkau et al. |
| DAPC | Novel genetic clusters are discovered through | Prior groupings are taken and visualized via discriminant analyses (Jombart and Collins |
| F-statistics | N/A | Prior groupings used to assess the genetic distance among them (Weir and Cockerham |
| Phylogenetic approaches | Novel genetic clusters discovered through grouping based on sequence similarity or genetic distance among individuals | Prior genetic clusters can be specified (e.g., forced monophyly) in a series of trees and then tested against one another to see which is more statistically likely (Goldman et al. |
| Principal components analysis (PCA) | Novel genetic clusters discovered through eigen vector decomposition of allele frequencies among individuals (Patterson et al. | N/A |
Fig. 1Scatterplots of Euclidean distance between DAPC clusters versus FST from our simulated datasets.
Plots distinguish if DAPC clusters were specified a priori (a) or determined de novo though k-means clustering (b) as well as the marker sets within each.
Results of generalized linear models examining factors associated with clustering success. Effect estimates are shown along with their standard errors.
| Intercept | Marker | df | AICc | |||
|---|---|---|---|---|---|---|
| −4.78 (0.42)* | 25.97 (2.02)* | 1.44 (0.33)* | 3 | 356.0 | ||
| −3.72 (0.26)* | 23.88 (1.87)* | 0.54 (0.29) | 3 | 373.7 | ||
| −3.51 (0.26)* | 23.57 (1.85)* | 2 | 375.1 |
*Term significant with p < 2 × 10−16.
Fig. 2Scatter plot of relationship between FST from our simulated datasets and if a cluster was successfully formed by find.clusters() for either N values of 100 (black crosses) or 500 (gray circles).
Curves show predictions from binomial generalized linear models for N values of 100 (black curve) or 500 (gray curve).
Fig. 3Temporal trends in reported parameters from our literature review of studies using the DAPC method.
a Trends in the yearly proportion of studies reporting if the authors stated their method for determining the optimal number of clusters (solid line with squares), the method used to determine the optimal number of PCs to retain (dotted line with triangles), and reporting the final number of PCs retained (dashed line with circles). b Trends in the yearly proportion of studies reporting use of either the a-score (solid line with squares), xval (dotted line with triangles), or cumulative variance (dashed line with circles) approach to determine the optimal number of PCs to retain.