| Literature DB >> 23152744 |
Markus Neuditschko1, Mehar S Khatkar, Herman W Raadsma.
Abstract
High-throughput sequencing and single nucleotide polymorphism (SNP) genotyping can be used to infer complex population structures. Fine-scale population structure analysis tracing individual ancestry remains one of the major challenges. Based on network theory and recent advances in SNP chip technology, we investigated an unsupervised network clustering method called Super Paramagnetic Clustering (Spc). When applied to whole-genome marker data it identifies the natural divisions of groups of individuals into population clusters without use of prior ancestry information. Furthermore, we optimised an analysis pipeline called NetView, a high-definition network visualization, starting with computation of genetic distance, followed clustering using Spc and finally visualization of clusters with Cytoscape. We compared NetView against commonly used methodologies including Principal Component Analyses (PCA) and a model-based algorithm, Admixture, on whole-genome-wide SNP data derived from three previously described data sets: simulated (2.5 million SNPs, 5 populations), human (1.4 million SNPs, 11 populations) and cattle (32,653 SNPs, 19 populations). We demonstrate that individuals can be effectively allocated to their correct population whilst simultaneously revealing fine-scale structure within the populations. Analyzing the human HapMap populations, we identified unexpected genetic relatedness among individuals, and population stratification within the Indian, African and Mexican samples. In the cattle data set, we correctly assigned all individuals to their respective breeds and detected fine-scale population sub-structures reflecting different sample origins and phenotypes. The NetView pipeline is computationally extremely efficient and can be easily applied on large-scale genome-wide data sets to assign individuals to particular populations and to reproduce fine-scale population structures without prior knowledge of individual ancestry. NetView can be used on any data from which a genetic relationship/distance between individuals can be calculated.Entities:
Mesh:
Year: 2012 PMID: 23152744 PMCID: PMC3485224 DOI: 10.1371/journal.pone.0048375
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Spc and NetView analysis of the simulated data set.
(A) Population structure used for creating the simulated data (adapted from Lawson et al. [19]). (B) Spc tree of clusters representing the grouping of individuals with k-NN = 10. The individuals have been separated into 5 clusters, representing the three main populations and the additional existence of two sub-populations (PopA1 and PopA2, PopB1 and PopB2). Each cluster is represented by a box; with Y axis positions indicating the stability of each cluster, whilst the X-axis positions are indicating the proximity between clusters. (C) High-definition network visualization (NetView) of the simulated population structure. Each individual is represented by a node; with the different shades denote the sample origin. The thickness of edges varies in proportion to the genetic distance and has been used to visualize individual relationships within and between populations. The node size varies in proportion to the numbers of edges per node, and illustrates how well each individual is connected within the population.
Figure 2PCA scatter plots of the simulated data set.
Projection of individuals from 5 populations onto a two dimensional (X,Y) subspace of four PCs. The panels A to D show pair wise comparison of PC combinations. Each individual is represented by a datum point. Each sub-population is denoted by a separate colour. The variation captured by each PC is indicated in parenthesis next to the axis label.
Figure 3Cluster assignment of the simulated population data following analysis by Admixture using 2–5 clusters (K).
Individuals are presented by a single vertical column divided into K colours. Each colour represents one cluster, and the length of the coloured segment corresponds to the individuals estimated proportion of membership in that cluster. For each K, 10 iterations were performed. The panels A to D represent the cluster patterns at K = 2 to 5.
Figure 4Spc and NetView analysis of human HapMap reference population after removal of closely related individuals.
(A) Spc tree of clusters representing the grouping of 1,159 unrelated individuals. All individuals have been separated into 11 clusters, representing 9 distinct populations and the existence of sub-structures within GIH and MKK samples. (B) NetView of the 1,159 assumed unrelated individuals. The topology of the network highlights the sub-structures within GIH, MXL and MKK and reveals a close relationship between CEU and TSI as well as between ASW and MKK. The identified outliers and key individuals of the population are indicated by their HapMap ID.
Figure 5Alternative (organic) NetView of populations with evidence of internal sub-structures.
Organic visualization style of (A) ASW/MKK and (B) TSI/CEU as implemented in software Cytoscape [59]. The network structure of this visualization highlights the existence of sub-structures and clearly identifies cross-linking individuals.
Figure 6Spc and NetView analysis of the Bovine HapMap data.
(A) Spc tree of clusters representing the grouping of 477 animals represented in the Bovine HapMap data set [31]. The animals have been allocated into 19 clusters, representing 18 out of 19 breeds and the existence of sub-structures within JER (JER_1 and JER_2), and a merged Angus cluster (ANG and RGU). (B) NetView of 477 bovine HapMap samples from Bos taurus, Bos indicus and admixed origins. The topology of the network reflects the genetic relatedness between cattle breeds and reveals sub-structures within LMS, SHK and ANG cluster.
Overall comparison of the three different approaches currently applied to study genome-wide population structures.
| Name | Applicability | Performance |
|
| Genotype or Haplotype based | Determines population clusters based on |
| distance matrix |
| |
| Creates a hierarchical structure of | ||
| population clusters. | ||
| Provides a high-definition network | ||
| visualization, which allows the identification | ||
| of fine-scale population structure and closely | ||
| related individuals (duos and trios). | ||
| Visualizes the relatedness of individuals | ||
| within populations. | ||
| Can also be used to determine genome-wide | ||
| membership proportions (unpublished result). | ||
|
| ||
| It can be recommended to start with | ||
| There are also other tuning parameters (Delta T and number of pott spins), but | ||
| they are not as important as the | ||
| determine the optimal number of | ||
|
| Any similarity matrix or genotype | Creates a visualization of fine-scale |
|
| matrix (Individuals/SNPs) | population structure in 2 or 3 dimensional |
| space. | ||
| Calculates the variation which is explained | ||
| by each principal component (PC). | ||
| Determines the optimal number of K. | ||
|
| ||
| optimal number of clusters (K), PCA has to be applied in combination with a | ||
| clustering tool e.g. | ||
| of the cluster result, the number of significant PCs has to be determined with a test | ||
| statistic e.g. Horns parallel analysis (PA). The visualization is limited to 3D-space. | ||
|
| Genotype matrix | Computes individual membership |
|
| proportions between individuals for any | |
| given number of clusters (K). | ||
| Identifies admixed individuals. | ||
| Determines the optimal number of K with an | ||
| implemented cross-validation procedure. | ||
| Provides a hierarchical population structure | ||
| and FST estimates between the populations. | ||
|
| ||
| Takes different model parameters as input. Determines the number of clusters with | ||
| a cross-validation procedure, which is not appropriate for all data sets, e.g. | ||
| simulated population structure results. Provides no convergence for cluster runs | ||
| when applied to data sets with K >10. Compared to the other two methods, this | ||
| procedure gets computational demanding when applied to full sequenced data with | ||
| millions of polymorphic sites. | ||