| Literature DB >> 16848973 |
Abstract
Inference of population structure from genetic markers is helpful in diverse situations, such as association and evolutionary studies. In this paper, we describe a two-stage strategy in inferring population structure using multilocus genotype data. In the first stage, we use dimension reduction methods such as singular value decomposition to reduce the dimension of the data, and in the second stage, we use clustering methods on the reduced data to identify population structure. The strategy has the ability to identify population structure and assign each individual to its corresponding subpopulation. The strategy does not depend on any population genetics assumptions (such as Hardy-Weinberg equilibrium and linkage equilibrium between loci within populations) and can be used with any genotype data. When applied to real and simulated data, the strategy is found to have similar or better performance compared with STRUCTURE, the most popular method in current use. Therefore, the proposed strategy provides a useful alternative to analyse population data.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16848973 PMCID: PMC3525165 DOI: 10.1186/1479-7364-2-6-353
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Correlation coefficients between the estimates of STRUCTURE and the singular value decomposition (SVD)-based method from the full data of Rosenberg et al. [5]
| Cluster/population | ||||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | |
| 2 | 0.9602 | 0.9602 | ||||
| 3 | 0.9756 | 0.9695 | 0.9826 | |||
| 4 | 0.9824 | 0.9853 | 0.9667 | 0.9708 | ||
| 5 | 0.9602 | 0.9836 | 0.9470 | 0.9719 | 0.5715 | |
| 6 | 0.5688 | 0.9473 | 0.9473 | 0.9719 | 0.9642 | 0.9596 |
aK is the number of clusters/populations.
Note: For the SVD-based method, the first five principal components were used. The mixture model was used for clustering.
Results from STRUCTURE 2.0 on the Pima-Surui data with 100 randomly selected markers without missing genotypes.
| Run 1b | Run 2b | Run 3b | Run 4b | Run 5b | |
|---|---|---|---|---|---|
| 1 | -10849.6 | -10847.8 | -10850.1 | -10848.9 | -10849.5 |
| 2 | -9616.9 | -9619.3 | -9619.4 | -9629.2 | -9614.2 |
| 3 | -9417 | -9418.3 | -9412.9 | -9529.3 | -9498.3 |
| 4 | -9648.8 | -9369.8 | -9557 | -9397.9 | -9445.7 |
| 5 | -9303.7 | -9472.3 | -9346.6 | -9405.1 | -9317 |
| 6 | -9541.6 | -10457.2 | -9315.6 | -9484.1 | -11348.8 |
| 7 | -9408.4 | -10576.9 | -10151.4 | -9443.4 | -10938.6 |
| 8 | -9451.9 | -9369.1 | -10715.4 | -9393.1 | -10304.5 |
| 9 | -10403.1 | -9450.7 | -9489.1 | -10100.8 | -9205.1 |
aK is the number of clusters/subpopulations; bDifferent runs of STRUCTURE 2.0.
Figure 1Pairwise cosine similarities between individuals in the reduced two-dimensional space of the Pima-Surui data with 100 randomly selected markers without missing genotypes.
Figure 2Pairwise cosine similarities between individuals using the original Pima - Surui data with the same 100 markers as in Figure 1.
The performance of STRUCTURE 2.0 and the singular value decomposition (SVD)-based method on the simulated data.
| Number of single nucleotide polymorphisms | Number of misclassified individuals | |
|---|---|---|
| STRUCTURE | SVD-based | |
| 401 | 36 (12%) | 3 (1%) |
| 453 | 34 (11.3%) | 1 (0.3%) |
| 494 | 3 (1%) | 0 (0%) |
Note: The numbers in the table are the numbers of misclassified individuals and the numbers in parentheses are the misclassification rates. For the SVD-based method, the K-means method was employed for clustering using 30 principal components. The three datasets in the table correspond to the combinations of the first eight, nine and ten chromosomal segments from the original simulated data, respectively.
Figure 3The results of STRUCTURE and the singular value decomposition (SVD)-based method based on the simulated data set of Tang [12]For the SVD-based method, ten principal components were used. The K-means method was used for clustering. The individual admixture (IA) values were calculated using the method of Nascimento et al. [41].
Figure 4Illustration of the formation of the initial partitioning by density-based mean clustering on the first two principal components of the Pima-Surui data with 100 randomly selected markers without missing genotypes. The numbers (1 and 2 here) over the triangles indicate the order of cluster starting points identified. Points with black plus signs indicate the points identified as the first cluster by Gap statistics. Points with red crosses indicate the points identified by Gap statistics as the second cluster. Blue circles represent the original points in the two-dimensional space after singular value decomposition. Downward triangles indicate the starting points for each cluster. Points with red crosses indicate the points identified by Gap statistics as the second cluster.
Figure 5The first three principal components from logistic principal component analysis using the criterion that the log likelihood change is less than 0.1 in the Pima-Surui data.