Literature DB >> 21712187

Inter-chromosomal variation in the pattern of human population genetic structure.

Abstract

Emerging technologies now make it possible to genotype hundreds of thousands of genetic variations in individuals, across the genome. The study of loci at finer scales will facilitate the understanding of genetic variation at genomic and geographic levels. We examined global and chromosomal variations across HapMap populations using 3.7 million single nucleotide polymorphisms to search for the most stratified genomic regions of human populations and linked these regions to ontological annotation and functional network analysis. To achieve this, we used five complementary statistical and genetic network procedures: principal component (PC), cluster, discriminant, fixation index (FST) and network/pathway analyses. At the global level, the first two PC scores were sufficient to account for major population structure; however, chromosomal level analysis detected subtle forms of population structure within continental populations, and as many as 31 PCs were required to classify individuals into homogeneous groups. Using recommended population ancestry differentiation measures, a total of 126 regions of the genome were catalogued. Gene ontology and networks analyses revealed that these regions included the genes encoding oculocutaneous albinism II (OCA2), hect domain and RLD 2 (HERC2), ectodysplasin A receptor (EDAR) and solute carrier family 45, member 2 (SLC45A2). These genes are associated with melanin production, which is involved in the development of skin and hair colour, skin cancer and eye pigmentation. We also identified the genes encoding interferon-γ (IFNG) and death-associated protein kinase 1 (DAPK1), which are associated with cell death, inflammatory and immunological diseases. An in-depth understanding of these genomic regions may help to explain variations in adaptation to different environments. Our approach offers a comprehensive strategy for analysing chromosome-based population structure and differentiation, and demonstrates the application of complementary statistical and functional network analysis in human genetic variation studies.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2011 PMID： 21712187 PMCID： PMC3326352 DOI： 10.1186/1479-7364-5-4-220

Source DB: PubMed Journal: Hum Genomics ISSN： 1473-9542 Impact factor: 4.639

Introduction

The comprehensive identification and control of population genetic structure and dissection of polymorphism are important steps in genomic studies aimed at gene mapping through (either directly or indirectly) linkage disequilibrium (LD) [1-4]. Previous estimates of population structure have provided tremendous insight into population genetics and human evolution, and have increased our knowledge of the distribution of genetic variation and relationships among human populations [5-8]. Until recently, however, these studies have been based on limited numbers of loci/genes or small fractions of the genome and thus have provided only average estimates of quantities such as fixation index (FST) across whole genomes and populations [9]. The study of a few genes with significant population differentiation can be extremely efficient -- for example, in screening for potential tumour markers or drug targets. Such analyses do not reach the full potential of genome-wide experiments to increase our understanding of whole biological processes, however. What is needed instead is a holistic approach to analysing the entire genome which shows major population differentiation and allows biologists to develop an integrated understanding of the functional networks/pathways driving population diversity [10]. Pääbo [11] suggested that, in variation studies, rather than 'populations', 'ethnicities' or 'races', a more efficient approach for studying within- and between-individual human chromosomal variation is to consider the genome of any particular individual as a mosaic of haplotype blocks. To date, analyses of the relationship between genetic variation and ancestral geographic origin have been limited to a few regions or genes because large-scale, genome-wide single nucleotide polymorphism (SNP) data from geographically diverse individuals have not been available. Given that levels of diversity/polymorphism are directly related to recombination (meiosis) and mutation rates that differ within and among chromosomes,[12] and that genes are not randomly distributed along chromosomes, the precise genes contributing to disease development and genealogy are not the same on each chromosome or part thereof [13-19]. Recently, Grimwood et al. [20] showed that the gene density on chromosome 19 is more than double the genome-wide average. Hence, the precision of equal segregation estimates of recombination fraction for all chromosomes and population-average values are not reliable, in terms of chromosome biological and evolutionary significance. The Santa Cruz Biotechnology group also announced a human chromosome database that features a chromosome-based index, which includes the chromosomal location of known human genes and links to the National Center for Biotechnology Information (NCBI) mRNA, protein and Online Mendelian Inheritance in Man (OMIM) databases (http://www.scbt.com). With the growing emphasis on dense SNPs and genome-wide association studies, and the recent accumulation of large, publicly available data-sets [21,22] -- such as the completion of HapMap,[23] with over 3.7 million SNPs across the genome -- there is an increasing need not only for fine-scale resolution of clines of population structure, but also to identify functional pathways in genomic regions of major population differentiation with influences on disease risk [24]. The thorough evaluation of the extent of fine-scale genetic structure among closely neighbouring populations, as well as the study of the ability to infer individual membership down to a particular population within a continent, have only begun in the past five years. The objectives of this study were as follows: (i) to examine the extent and patterns of within-and between-chromosomal variations; (ii) to determine population genetics structure and population membership and (iii) to identify SNPs with major population differentiation and link this information with ontological annotation and functional networks/pathways.

Materials and methods

Data mining, processing and description

We downloaded the HapMap SNP data (http://www.hapmap.org, release # 24, on NCBI B36 assembly, dbSNP b126). The HapMap project contains genotypes from 60 unrelated Caucasians from the USA with northern and western European ancestry (CEU), 60 unrelated Yoruba individuals from Ibadan, Nigeria (YRI), 45 Japanese individuals from Tokyo (JPT) and 45 Han Chinese individuals from Beijing (CHB) (http://www.hapmap.org). Two criteria were used to filter the SNPs included in the analysis: (i) locus call rate ≥95 per cent (ie we excluded all SNPs with more than 5 per cent missing data) and (ii) the SNP should be shared among populations so that the same sets of SNPs were used throughout in the population comparisons. A computer program using Python (http://www.python.org) was written to export and pre-process SNP genotype information from the databases. Genotypes were summarised for each population. For each dataset, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles (the cells give the number of copies of each major allele for each individual: zero, one or two). Figure 1 depicts our approach to SNP mining, multivariate chromosomal and population diversity and network analysis strategies. Of the total 3.7 million SNPs in the HapMap data release,[23,25] 809,000 SNPs fulfilled the criteria and were used in this analysis.

Figure 1

Schematic presentation of single nucleotide polymorphism (SNP) mining, multivariate chromosomal and population diversity and network analysis strategies. There are ~3.7 million SNPs in the HapMap data release. Genotypes were summarised for each population. For each dataset, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles (the cells give the number of copies of each major allele for each individual: 0, 1 or 2). Two criteria were used to filter the SNPs included in the analysis: (i) locus call rate ≥ 95 per cent (ie we excluded all SNPs with more than 5 per cent missing data); and (ii) the SNP should be shared among populations, so that the same sets of SNPs were used throughout in the population comparisons. From the total of ~3.7 million SNPs in the HapMap data release, only 809,624 SNPs were eligible for analysis.

Statistical analysis

Multivariate statistical techniques (namely, principal component [PC], cluster, discriminant, network analyses and FST statistics) were used to examine chromosomal structure within and between populations and associated functional networks by estimating chromosomal overall differentiation values. The analysis was carried out either using all SNPs together or separately for each chromosome. Because PC analysis (PCA) does not take into account group differences in reducing the dataset to a few representative variables, and it can be difficult to make appropriate inference about population relationships from the PC scatter plot, we further analysed the data using cluster analysis (CA) to classify individuals into mutually exclusive groups with high homogeneity within clusters and with low homogeneity between clusters. In other words, CA provides a visual assessment and identifies individuals who are similar (or dissimilar) to one another. To further confirm the grouping obtained in CA, discriminate analysis (DA) was performed. DA consists of the separation of a priori given classes for each individual. The variance-covariance between classes is maximised and the variance-covariance within classes is minimised under simultaneous consideration of all analysed data. PCA was done using the EIGENSOFT software package (http://genepath.med.harvard.edu/~reich/Software.htm) either on all SNPs simultaneously (all loci together) or separately per each chromosome. The analysis follows singular value decomposition, a procedure that produces eigenvectors, corresponding eigenvalues and proportions of eigenvalues, as well as the scores of the PCs [26]. Using PCA, we estimated axes of variation corresponding to ancestry. The first eigenvector separates the samples in a way that explains the largest amount of variability, while the second and subsequent ones explain lesser amounts of variability. The spatial relationships of populations in each chromosome and all chromosomes were presented by plotting the scores of the first and second PCs. The numbers of significant PCs (at the level of p < 0.05) were tested using Tracy-Widom statistics. Pairwise population genetic diversity was determined by calculation of Wright's FST using EIGENSOFT. FST values indicate how much of the genetic variability between individuals from different populations is due to population affiliation. Hierarchical clustering of molecular variance was followed using the similarity for qualitative data (SIMQUAL) module with the first 10 PCs that account for most of the variation. Average taxonomic distance matrices (DIST) were computed as a measure of genetic distance. This matrix was subjected to unweighted pair-group method analysis (UPGMA) to generate a dendrogram using the Sequential, Agglomerative, Hierarchical and Nested (SAHN) module. Both numerical taxonomic analyses were performed using the Numerical Taxonomy and Multivariate Analysis System Program, version 2.11f (NTSYS-pc) [27]. The cophenetic correlation coefficient was calculated, and Mantel's test [28] was performed, to check the goodness of fit of a CA. In DA, a linear combination of features that best separates two or more groups of objects is sought. The discriminant functions are determined based on the maximisation of the ratio of the external (between populations) to the internal (between individuals within the same population) variability [29]. The values of Wilks' lambda (λ) and their X2 statistics are used to evaluate the number of significant discriminant functions. In turn, to determine the most important features of the objects, partial Wilks' λ and its Fisher statistics were utilised [29]. Discriminant function analysis [30] was done following the SAS system [31] DISCRIM, CANDISC and STEPDISC procedures, and significance was tested using Wilks' λ [32]. In order to avoid the limitation of a large number of alleles compared with the number of observations and the correlation occur in allele frequencies, we ran discriminant analysis using the uncorrelated SNPs in the top significant PCs. This ensures that variables submitted to DA are perfectly uncorrelated and that their number is lower than that of analysed individuals. Linear discriminant analysis is similar to logistic regression and is useful for building a predictive model of group membership based on observed characteristics. The procedure yields a set of discriminant functions based on the linear combinations of variables that provide the best discrimination between groups. In the final set of analyses, a dataset containing a total of 126 genomic regions linked to SNPs that differed between populations (FST ≥ 0.5) was uploaded into the Ingenuity Pathways Analysis (IPA) 8.7 network analysis (Ingenuity Systems, Redwood City CA, USA). The network generated from the 126 input genes (called focus genes) uses both direct and indirect relationships/connectivity. These networks were ranked by scores that measured the probability that the genes were included in the network by chance alone. Networks with scores of three or more were classified as not being generated by random chance [33]. The significance threshold for Fisher's exact test to determine the probability that each biological function and/or disease assigned to that network is due to chance alone was 0.05 or less. Canonical pathways associated with input genes were elucidated with a statistical significance value. The gene ontology (GO) analysis was used to identify functional commonalities between the genes based on the number of shared ancestors in gene products (http://gostat.wehi.edu.au).

Results

Estimates of FST differ between chromosomes and populations

The empirical genome-wide distribution of FST showed heterogeneity in chromosomal ancestry across the genome (Figure 2). The average FST values for autosomes and sex chromosomes were significantly different (0.120 and 0.210, respectively; t-test, t = 16.1, p < 10-15). The higher average FST for chromosome X compared with autosomes might indicate differences in inheritance mechanisms that potentially affect sex chromosomes and autosomes differently [34-36]. Similarly, statistically significant differences in FST estimates (FST analysis per chromosome- FST_SNP_CHROM) among autosomal chromosomes were detected. The variability that drives FST distribution among autosomes could be due to variations in natural selection and/or recombination rates during meiosis [37]. Based on Wright's qualitative guidelines, FST statistics range from 0 (no differentiation) to 1 (fixed difference between populations for different alleles). Values of FST less than 0.05 represent low or little population genetics differentiation, values between 0.05 and 0.15 represent moderate population divergence, values between 0.15 and 0.25 indicate large population differentiation and FST values greater than 0.25 represent very large population divergence [38]. Usually, an FST > 0.5 is considered sufficient for ancestry differentiation.

Figure 2

Pairwise F. A simple measure of population differentiation is Wright's FST, which measures the fraction of total genetic variation due to between-population differences. It could also represent a matrix of pairwise net distance (divergence) among the population. The global pairwise FST value estimated for the 210 worldwide samples using all loci together (FST_SNP_ALL) was 0.130 (p < 10-6). As FST increases, populations become more distant and/or unrelated to each other [39]. We observed an average genetic differentiation between CEU and YRI (FST = 0.153), CEU and CHB (FST = 0.110), CEU and JPT (FST = 0.111), YRI and CHB (FST = 0.190) and YRI and JPT (FST = 0.192) (Figure 2). It is evident that more divergence has occurred between YRI and each of the three other populations than between the other pairs of populations. Genetic distances between CHB and JPT populations were low (mean FST = 0.007), indicating that substantial gene flow compensates for the effects of genetic drift. This low FST value as a result of high similarities in allele frequencies between CHB and JPT samples motivates researchers to analyse CHB and JPT populations jointly, as a single panel [25]. Regardless of the populations compared, most of the variation was observed within populations (average, 87 per cent versus 13 per cent variation observed between populations). Within-population diversity reflects the number of different types in the population, taking into account their frequencies. By contrast, between-population differentiation measures variation based on the relative frequencies of types within these subpopulations and, ideally, measures the average distance of subpopulations from their respective lumped remainders. The fact that only 13 per cent of the total genetic variation results from differences between populations indicates that alleles present in one population are also present in other populations [40-43]. The remaining 87 per cent represents the average difference between members of the same population. One way to interpret this number is to say that the expected genetic difference between unrelated individuals from distant continents exceeds by 13 per cent the expected difference between members of the same community [44]. An interesting common feature in population genetics studies of humans, animals, plants and other types of species is that within-population diversity is greater than between-population diversity [45-48]. This estimate is highly consistent for protein polymorphisms, blood groups, microsatellites, SNPs and morphological/phenotypic markers [49,50]. Therefore, it is necessary to quantify and control population structure, not only for major population differences, but also for subtle variation/structure arising within populations.

Significant numbers of PCs vary by chromosome

As shown in Figure 3, the first component -- which accounted for 50.2 per cent of variation -- separated YRI populations from CEU and CHB/JPT, while the second (accounting for 24.2 per cent of the total variance) could be associated with an Africa/Europe gradient. The overlap between the CHB and JPT populations suggests a low level of genetic differentiation, as shown by the pairwise FST divergence (Figure 2). Overall, these results demonstrate a clear partition of the West African populations considered. On a genome-wide average scale, about 74 per cent of the diversity in human samples was explained by the first two PCs. The eigenvalues for PC3-PC10 showed a plateau, suggesting that the first two PCs account for most of the populations' average substructure in this analysis. Such a genome-wide level of population structure may lead to an erroneous conclusion that the samples are genetically homogeneous. Thus, correction for population structure is only as good as the level of structure (at a finer or coarser level) that one wishes to correct.

Figure 3

Plot for the first two principal components (PCs) for HapMap individual for the genome-wide average shows the relationships between human populations in terms of their geographical origin. On a genome-wide average scale, about 74 per cent of the diversity in human population was explained on the basis of the first two PCs. In order to obtain fine-scale resolution of population membership, PCA was performed on each chromosome. Our analysis showed that the contribution of the first two PCs in classifying geographical regions varied among chromosomes, ranging from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranged from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranged from 18 per cent (Chr X) to 27 per cent for Chr 15 (Supplementary Figure S1 (Figure 6)). As shown in Figure 4, on a finer scale, the number of significant PCs accounting for population differentiation varies from 2 (Chr 2) to 31 (Chr X) among chromosomes. The higher number of significant PCs on Chr X explains why it has the lowest chromosome-wise contribution to the first two PCs (Supplementary Figure S1 (Figure 6)).

Figure S1

Chromosome-wise principal component analysis (PCA) analysis of the entire HapMap dataset. The first PC accounted for more than double the variance of the second PC. The level of contribution of the first two PCs across chromosomes in classifying geographical regions are presented here. The chromosome-wise contribution of the first two PCs ranges from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranges from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranges from 18 per cent (Chr X) to 27 per cent for Chr 15.

Figure 4

Significant numbers of PCs among chromosomes in the HapMap dataset. On a finer scale, the number of significant PCs that account for population differentiations vary from 2 to 31 among chromosomes.

Significant numbers of PCs among chromosomes in the HapMap dataset. On a finer scale, the number of significant PCs that account for population differentiations vary from 2 to 31 among chromosomes. Chromosome-wise principal component analysis (PCA) analysis of the entire HapMap dataset. The first PC accounted for more than double the variance of the second PC. The level of contribution of the first two PCs across chromosomes in classifying geographical regions are presented here. The chromosome-wise contribution of the first two PCs ranges from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranges from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranges from 18 per cent (Chr X) to 27 per cent for Chr 15. We next characterised the genetic relationships existing among the four different populations. The diagrammatic output of CA (constructed from PCs) for the mean of 210 individuals indicated that these individuals could be clustered into groups that basically coincided with their geographical distribution (data not shown). The analysis confirmed the distinctiveness of the CEU and YRI populations and the close average genetic distance between the CHB and JPT populations (Supplementary Figure S2 (Figure 7)). The results of the chromosomal-based CA (data not shown) were comparable to those of the PCA, and both methods classified racial populations into separate groups.

Figure S2

Unweighted pair-group method analysis dendrogram (a branching diagram used to show the relationships between members of a group) based on average taxonomic distance matrices among population means of HapMap SNP datasets. The cluster analysis (CA; constructed from principal components) for the mean of 210 individuals indicates the distance at which the various groups are formed and join together. CA, which is based on the means for all individuals from each geographical origin, was used to obtain similarities among individuals according to their correlation measures across all SNP datasets. Branch height represents dissimilarity. Note that, compared with YRI and CEU branch height, the CHB and JPT branch height is much shorter, representing that the genetic distance between these two populations is relatively close.

DA predicts population membership for 70 per cent of individuals

Although the overall population differentiation between the CHB and JPT populations appeared low using PCA and FST analysis, DA [51] indicated that ~30 per cent of the two populations were misclassified (Table 1). Thirty of the CHB individuals (n = 45) were correctly classified (67 per cent), while among the JPT individuals (n = 45), 38 were correctly classified (Table 1). The European and African populations were 100 per cent correctly classified to their respective groups. The classification matrix presented in Table 1 summarises the predictive ability of DA when each sample is assigned to a particular geographical region. Given the growing interest in high-density whole-genome association and admixture studies, DA is informative because misclassified individuals can be identified and assigned to their appropriate ancestral populations. Levels of correct and incorrect classification of human populations to their geographical regions of origin differed for each chromosome (Supplementary Table 1). For example, correct classifications of the 45 CHB individuals to their geographical regions of origin ranged from 23 (for Chr 6) to 35 (for Chr X). Chr 9 had the highest number of misclassified individuals and Chr X had the lowest. On the other hand, correctly classified individuals in the JPT population ranged from 25 (for Chr 9) to 36 (for Chr 19). Chr 18 had the lowest and Chr 9 the highest number of misclassified individuals in the JPT population. Chr 9 had the highest number of misclassifications in both the CHB and JPT populations. The variation in misclassification rate might indicate the existence of mosaic chromosomal blocks derived from other populations as a result of common ancestry or admixture. The use of more PCs might improve classification accuracy.

Table 1

Classification matrix for HapMap individuals based on SNP markers using DA

Population	n	Predicted population group
		1	2	3	4	% correct
CEU - European ancestry ¹	60	60	0	0	0	100
CHB - Chinese from Beijing ²	45	0	30	15	0	67
JPT - Japanese from Tokyo ³	45	0	7	38	0	84
YRI - Nigerian from Yorubans ⁴	60	0	0	0	60	100
Total	210	60	37	53	60	90

Average accuracy, 89%, n = number of individuals in each HapMap population. Numbers from 1 to 4 represent the four populations which are described on the left-hand side of the table.

Classification matrix for HapMap individuals based on SNP markers using DA Average accuracy, 89%, n = number of individuals in each HapMap population. Numbers from 1 to 4 represent the four populations which are described on the left-hand side of the table. The summarised results of a stepwise DA to select variables with the most classification power are shown in Table 2. Wilks' λ and associated F-values are used as indices of discriminatory power and are presented for each successive step. To test the statistical significance of the discriminant function, the value of Wilks' λ (~0.00) was assessed [52]. The Wilks' λ test showed that the ratio of the within-groups sum of squares to the total sum of squares was significant (Wilks' λ 9.53E-5, p < 0.001); thus, the null hypothesis of CEU = CHB = JPT = YRI was rejected.

Table 2

Stepwise order of inclusion of variables in the DA that distinguishes between human populations

			Variance (%)
Step	Entered	Eigenvalue	Proportion	Cumulative	Wilks'lambda	Canonicalcorrelation	F value	df	Pr >F
1	PC1	10587.53	80.7	80.7	9.53 × 10^-5	0.99	720717.4	3, 206	<0.0001
2	PC2	2532.22	19.3	100.0	4.10 × 10^-8	0.96	353836.8	6, 410	<0.0001

PC1, PC2, principal components 1 and 2; df, degrees of freedom; Pr >F, (probability level) associated with the F statistic.

Stepwise order of inclusion of variables in the DA that distinguishes between human populations PC1, PC2, principal components 1 and 2; df, degrees of freedom; Pr >F, (probability level) associated with the F statistic. The eigenvalue is the ratio of the between-groups sum of squares to the within-groups sum of squares [52]. This value measures the spread of the group centroids in the dimension of multivariate space (eigenvalue 10587.53; p < 0.0001). The canonical correlation measures the association between discriminant scores and groups. This association appeared to be statistically significant (stepwise elimination, R2 = 0.99, 0.96; p < 0.0001), and the data were subjected to the stepwise procedure. The first canonical discriminant function had a high eigenvalue, accounting for more than 81 per cent of the total variance. The first and second functions together accounted for 100 per cent of the variance.

Functional networks and pathways in highly stratified genomic regions

To characterise the main functional networks/pathways underlying genes with substantial population differentiation, we carried out network analysis (see Materials and methods) for between-population comparisons (between CEU-YRI, CEU-CHB/JPT and YRI-CHB/JPT). A summary of networks, molecular and cellular functions, diseases and disorders and canonical pathways associated with the genomic regions are presented in Table 3. A total of 126 genes were significantly differentiated among populations and eligible for network analysis, which led to the identification of five significant networks (Figure 5). Network 1 was centred on the nuclear factor (NF)-κB complex and had 19 focus genes; network 2 was centred on tumour necrosis factor (TNF) and had 14 focus genes; network 3 was centred on v-myc myelocytomatosis viral oncogene homologue (MYC) and had 13 focus genes; network 4 was centred on interferon-γ (IFNG) and had 12 focus genes; and network 5 was centred on hepatocyte nuclear factor-4α and had 12 focus genes. Interestingly, although no genes were shared among all these five different networks, two networks (networks 1 and 2) contained chloride intracellular channel 4 (CLIC4), which plays a role in apoptosis, differentiation and diabetes. The overlap of this gene suggests that similar biological pathways were targeted by selection in these populations. In addition, the gene involved in skin, hair and eye pigmentation -- including oculocutaneous albinism II (OCA2), hect domain and RLD 2 (HERC2), ectodysplasin A receptor (EDAR) and solute carrier family 45, member 2 (SLC45A2) were over-represented in our GO analysis (http://gostat.wehi.edu.au) (Table 4). Enriched GO biological function terms include cytoskeletal protein binding (p = 1.1 × 10-6), actin binding (p = 1.0 × 10-6) and fibroblast growth factor receptor antagonist activity (p = 6.2 × 10-5).

Table 3

IPA summary of associated networks, molecular and cellular functions, diseases and disorders and canonical pathways for the 126 genes mapped to significantly differentiated genomic regions

	IPA categories	Statistical measures	Associated gene(s)
No	Top networks	Network score*	No. of candidategenes
1	Cancer, cell death, dermatological diseases and conditions	33	19
2	Carbohydrate metabolism, dermatological diseases and conditions, lipid metabolism	24	14
3	Post-translational modification, embryonic development, tissue development	24	13
4	Inflammatory response, immunological disease, carbohydrate metabolism	20	12
5	Cell cycle, hair and skin development and function, nervous system development and function	19	12
	Molecular and cellular functions	p value⁺	No. of candidategenes
1	Cell death	2.55E-04 - 3.60E-02	19
2	Cell-to-cell signalling and interaction	6.33E-04 - 3.25E-02	6
3	Cellular assembly and organisation	6.33E-04 - 3.36E-02	18
4	Cellular compromise	6.33E-04 - 3.25E-02	8
5	Gene expression	1.50E-03 - 3.25E-02	7
	Associated disease and disorders	p value	No. of candidategenes
1	Inflammatory disease	1.43E-07 - 3.25E-02	47
2	Gastrointestinal disease	1.22E-06 - 3.39E-02	29
3	Genetic disorder	1.22E-06 - 3.25E-02	69
4	Endocrine system disorders	1.72E-05 - 3.25E-02	35
5	Metabolic disease	1.72E-05 - 1.31E-02	39
	Top canonical pathways	p value	No. of candidategenes
1	Androgen and oestrogen metabolism	2.15E-02	3
2	Neuroprotective role of THOP1 in Alzheimer's disease	2.85E-02	2
3	Alanine and aspartate metabolism	3.39E-02	2
4	Retinol metabolism	3.39E-02	2
5	Pentose and glucuronate interconversions	4.76E-02	2
	Physiological system development and function	p value	No. of candidategenes
1	Hair and skin development and function	6.26E-05 - 3.25E-02	9
2	Nervous system development and function	7.22E-04 - 3.36E-02	15
3	Connective tissue development and function	1.17E-03 - 3.25E-02	6
4	Skeletal and muscular system development and function	1.17E-03 - 3.25E-02	11
5	Tissue development	1.17E-03 - 3.25E-02	14

*Networks with scores ≥3 have a 99.9 per cent confidence of not being generated randomly.

+The IPA computes p values of statistically significant findings by comparing the number of molecules of interest relative to the total number of occurrences of these molecules in all functional/pathway annotations stored in the Ingenuity Pathways Knowledge Base (Fisher's exact test with p value adjusted using the Benjamini-Hochberg multiple testing correction).

Figure 5

IPA network analysis for 126 genes mapped to significantly differentiated genomic regions. Genes with red nodes are focus genes in our analysis, the others are generated through the network analysis from the Ingenuity Pathways Knowledge Base (http://www.ingenuity.com). Edges are displayed with labels that describe the nature of the relationship between the nodes. The lines between genes represent known interactions, with solid lines representing direct interactions and dashed lines representing indirect interactions. Nodes are displayed using various shapes that represent the functional class of the gene product.

Table 4

Gene Ontology analysis for the 126 genes mapped to significantly differentiated genomic regions

GO category	GO ID	p value	GO term name	Overrepresented genes
Molecularfunction	GO:0005516	0.00138	Calmodulin binding	MAP6, MYH9, MYLK, DAPK1, MYO5C
Molecularfunction	GO:0016462	0.001431	Pyrophosphatase activity	TGM3, ABCB7, MYH9, DYNC1LI1, DUT, MYO5C,MCM6, ATAD3B, ZRANB3, DNAH5, ATAD3C, IQCA1
Molecularfunction	GO:0016818	0.001478	Hydrolase activity,acid anhydrides, phosphorus-containing anhydrides	TGM3, ABCB7, MYH9, DYNC1LI1, DUT, MYO5C,MCM6, ATAD3B, ZRANB3, DNAH5, ATAD3C, IQCA1
Molecularfunction	GO:0016817	0.001509	Hydrolase activity,acting on acid anhydrides	TGM3, ABCB7, MYH9, DYNC1LI1, DUT, MYO5C,MCM6, ATAD3B, ZRANB3, DNAH5, ATAD3C, IQCA1
Molecularfunction	GO:0003779	0.002797	Actin binding	MYH9, MYLK, MKL1, MKL2, MYO5C, SYNE2, HIP1
Biologicalprocess	GO:0006582	0.000032	Melanin metabolic process	SLC45A2, OCA2, CITED1
Biologicalprocess	GO:0043473	0.000574	Pigmentation	SLC45A2, OCA2, EDAR, CITED1
Biologicalprocess	GO:0031641	0.00078	Regulation of myelination	IFNG, CDH2
Biologicalprocess	GO:0048066	0.001018	Developmental pigmentation	SLC45A2, OCA2, CITED1
Biologicalprocess	GO:0015701	0.001245	Bicarbonate transport	PTGER3, SLC4A5
Cellularcomponent	GO:0000299	0.000209	Integral to membraneof membrane fraction	ARSA, MLC1, SEMA4F
Cellularcomponent	GO:0045009	0.000983	Chitosome	SLC45A2, OCA2
Cellularcomponent	GO:0033162	0.000983	Melanosome membrane	SLC45A2, OCA2
Cellularcomponent	GO:0048770	0.001653	Pigment granule	SLC45A2, OCA2, YWHAE, SLC24A5
Cellularcomponent	GO:0042470	0.001653	Melanosome	SLC45A2, OCA2, YWHAE, SLC24A5

Abbreviations

MAP6, microtubule-associated protein 6 gene; MYH9, myosin, heavy chain 9 non-muscle gene; MYLK, Myosin light chain kinase gene, MYO5C, myosin VC gene; TGM3, transglutaminase 3 gene; ABCB7, gene for ATP-binding cassette sub-family B, member 7; DYNC1L11, cytoplasmic dynein 1 light intermediate chain 11 gene; DUT , deoxyuridine 5'-triphosphate nucleotidohydrodase gene; MCM6, minichromosome maintenance complex component 6 gene; ATAD3C, AAA domain-containing 3C gene, ZRANNB3, zinc finger RAN-binding domain-containing 3 gene; DNAH5, dynein axonemal heavy chain 5 gene; ATAD3B, AAA domain-containing 3B gene; IQCAI, IQ motif containing with AAA domain gene; MKL1, megakavyoblastic leukaemia (translocation) 1 gene; MKL2, megakavyoblastic leukaemia (translocation) 2 gene; SYNE2, spectrin repeat containing nuclear envelope 2 gene; HIP1, Huntingtin interacting protein 1 gene; SLC45A2, gene for solute carrier family 45, member 2; CDH2, cadherin 2 gene; PTGER3, prostaglandin E receptor 3 (subtype EP3) gene; SLC4AS, gene for solute carrier family 4 sodium bicarbonate cotransporter member 5; ARSA, arylsulfatase 4 gene; MLC1, megalencephalic leukoencephalopathy with subcortical cysts 1 gene; SEMA4F , semaphorin 4F gene; YWHAE, 14-3-3 protein epsilon gene, CITED1 cbp/p300-interacting transactivator 1 gene; OCA2, p protein gene; EDAR, ectodysphasin A receptor gene.

IPA summary of associated networks, molecular and cellular functions, diseases and disorders and canonical pathways for the 126 genes mapped to significantly differentiated genomic regions *Networks with scores ≥3 have a 99.9 per cent confidence of not being generated randomly. +The IPA computes p values of statistically significant findings by comparing the number of molecules of interest relative to the total number of occurrences of these molecules in all functional/pathway annotations stored in the Ingenuity Pathways Knowledge Base (Fisher's exact test with p value adjusted using the Benjamini-Hochberg multiple testing correction). IPA network analysis for 126 genes mapped to significantly differentiated genomic regions. Genes with red nodes are focus genes in our analysis, the others are generated through the network analysis from the Ingenuity Pathways Knowledge Base (http://www.ingenuity.com). Edges are displayed with labels that describe the nature of the relationship between the nodes. The lines between genes represent known interactions, with solid lines representing direct interactions and dashed lines representing indirect interactions. Nodes are displayed using various shapes that represent the functional class of the gene product. Gene Ontology analysis for the 126 genes mapped to significantly differentiated genomic regions Abbreviations MAP6, microtubule-associated protein 6 gene; MYH9, myosin, heavy chain 9 non-muscle gene; MYLK, Myosin light chain kinase gene, MYO5C, myosin VC gene; TGM3, transglutaminase 3 gene; ABCB7, gene for ATP-binding cassette sub-family B, member 7; DYNC1L11, cytoplasmic dynein 1 light intermediate chain 11 gene; DUT , deoxyuridine 5'-triphosphate nucleotidohydrodase gene; MCM6, minichromosome maintenance complex component 6 gene; ATAD3C, AAA domain-containing 3C gene, ZRANNB3, zinc finger RAN-binding domain-containing 3 gene; DNAH5, dynein axonemal heavy chain 5 gene; ATAD3B, AAA domain-containing 3B gene; IQCAI, IQ motif containing with AAA domain gene; MKL1, megakavyoblastic leukaemia (translocation) 1 gene; MKL2, megakavyoblastic leukaemia (translocation) 2 gene; SYNE2, spectrin repeat containing nuclear envelope 2 gene; HIP1, Huntingtin interacting protein 1 gene; SLC45A2, gene for solute carrier family 45, member 2; CDH2, cadherin 2 gene; PTGER3, prostaglandin E receptor 3 (subtype EP3) gene; SLC4AS, gene for solute carrier family 4 sodium bicarbonate cotransporter member 5; ARSA, arylsulfatase 4 gene; MLC1, megalencephalic leukoencephalopathy with subcortical cysts 1 gene; SEMA4F , semaphorin 4F gene; YWHAE, 14-3-3 protein epsilon gene, CITED1 cbp/p300-interacting transactivator 1 gene; OCA2, p protein gene; EDAR, ectodysphasin A receptor gene. In conclusion, our approach offers a complementary statistical strategy for summarising overall variability and global versus chromosomal structure, assessing population structure and identifying genomic regions driving genetic divergence among populations. We first used PCA (to reduce data dimensionality); however, because PCA does not take into account group differences in reducing the dataset to a few representative variables, we further analysed the data using CA to classify individuals into mutually exclusive groups with high homogeneity within clusters and low homogeneity between clusters. To further confirm and predict group membership, DA was performed using the top significant PCs. PCs were used to ensure that variables submitted to DA were perfectly uncorrelated, and that their number was lower than that of analysed individuals. Finally, using FST (to study population differentiation) analysis, we described the importance of chromosome-based population genetic structure to identify differing genomic regions driven by natural selection. We followed the target genomic regions using network/pathway analysis to elucidate their roles and functional implications in human genetic variations and diseases.

Discussion

Although most researchers traditionally focus on the top few axes of variation in a dataset, substantial information about population structure exists in lower-ranked chromosomal level PCs. Adjustment of global ancestry between study subjects may lead to false positives when chromosomal (local) population structure is an important confounding factor [53]. Using chromosome-based analysis, fine-scale substructure was detectable beyond the broad population level classifications that previously have been explored using genome-wide average estimates in this dataset. The study of population structure in terms of chromosomes has broader practical relevance to researchers who use genetics and genomics approaches in gene mapping because genetic diversity is directly related to recombination rate (meiosis), which differs among chromosomes, and genes are not randomly distributed along chromosomes. By restricting our analysis to each chromosome independently, instead of using global average estimates, we have reported for the first time that the number of fine-scale subpopulations is chromosome dependent. For example, chromosome 2 has two significant PCs which account for population differentiation, whereas chromosome X has 31. This result suggests that one has to examine a large enough number of PCs in order to find all the significant population differences. Thus, the variation in the number of chromosome-specific significant PCs might indicate the detection of a population structure that could have been missed if the average of all chromosomes was used. Even though chromosome 1 is the largest chromosome, followed by chromosome 2, the number of significant PCs that account for structure is lower in both of these chromosomes than in the rest of the chromosomes, indicating that genome size does not correlate with the biological complexity of organisms [54]. Interestingly, similar results were reported by Becquet et al [55] in their study of chimpanzee population genetics structure. In plants, a recent study showed that the optimal number of subpopulations required to correct population structure is trait dependent [56]. This study reminded us that the number of subpopulations for one trait may not be optimal for other traits. The current analytical approach using genome-wide average PCs as a covariate will control for confounding due to global ancestry but will not control for confounding due to the local (chromosome-based) ancestry effect. It is increasingly important to recognise intra-chromosomal variation, especially when populations have been recently admixed. Similar to the results of chromosome-based PCA analysis, DA shows that the classification of populations to their correct geographical regions of origin is chromosome dependent. For example, in our analysis, the number of CHB individuals correctly classified to their geographical regions of origin ranged from 23 (for Chr 6) to 35 (for Chr X), while correctly classified individuals in the JPT population ranged from 25 (for Chr 9) to 36 (for Chr 19). Given the growing interest in tracing ancestral origins or contributions in genetically mixed populations, DA is informative and appealing because misclassified individuals can be identified and grouped into appropriate populations prior to large-scale genotyping. To avoid single-marker FST-based inferences for selection, which can be misleading,[57] we ran an in-depth investigation of the patterns of genetic variation in and around the highly differentiated loci and their effects on the phenotype using network/ontology analyses. We overlaid 126 genes (selected based on FST > 0.5) onto the Ingenuity Pathways Knowledge Database (http://www.ingenuity.com). Using this analytical approach, we confirmed the over-representation of genes implicated in hair and skin development (OCA2, HERC2, EDAR and SLC45A2) in two of the top networks (Table 3). EDA-A1 and EDA-A2 are two isoforms of ectodysplasin that are encoded by the anhidrotic ectodermal dysplasia (EDA) gene. Genetic variability in the EDA ligand has been associated with loss of hair, sweat glands and teeth [58]. The non-synonymous SNP rs1385699 identified within the EDA2 receptor gene (EDA2R) is fixed in both Asian populations, where as an R57K substitution in EDA2R has derived-allele (T) frequencies of 100 per cent. The EDA2R gene product is involved in the positive regulation of NF-κB transcription factor activity specifically within the hair follicle, TNF receptor activity, embryonic development and apoptosis [60]. These genes were previously reported as candidates involved in human pigmentation phenotypes and in the development of skin cancer [61]. The most striking difference provided by our more direct approach was the over-representation of canonical pathways related to androgen and oestrogen metabolism (Supplementary Figure S3 (Figure 8)) and gene groups implicated in the functional category of inflammation, as well as hair and skin development (Figure S4 (Figure 9)).

Figure S3

Figure S4

The 16 most significant functional categories from IPA linked to the 126 genes of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. Bars that are above the line indicate significant enrichment of a function.

Global canonical pathways of the 126 genes linked to genomic regions of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. The first four sets of functions shown represent a p-value of less than 0.01. Bars that are above the line indicate significant enrichment of a pathway. The 16 most significant functional categories from IPA linked to the 126 genes of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. Bars that are above the line indicate significant enrichment of a function. In critically evaluating our results, it is important to note that our analyses, and hence interpretations, are subject to several limitations. First, an important caveat in the use of population-level genetic databases such as HapMap is the ascertainment criterion that was imposed during the initial selection of polymorphic SNPs to be assayed,[62] and the subsequent release of the HapMap database primarily focused on SNPs that were common. The fundamental theorem underpinning HapMap is the common disease common variance (CD/CV) hypothesis [63]. Secondly, the HapMap study (Phase III) is currently being extended to include additional samples and diverse populations (http://www.hapmap.org). The number of SNPs genotyped in Phase III is substantially fewer (~1.5 million SNPs) than in the present study, however, thereby providing less density and coverage. Such low coverage may miss important loci in regions of elevated molecular divergence in related populations, such as between CHB and JPT [64]. When whole-genome sequences (such as (http://www.1000genomes.org)) become widely available, the ability to use many rare variants to identify short shared genomic segments will perhaps allow routine identification of geographical regional or village-level ancestries, given a suitably large and carefully collected reference sample [65,66]. The 1000 Genomes Project, which aims to provide a whole-genome sequence resource for at least 1,200 individuals sampled from multiple population groups globally, will be invaluable for understanding the practical consequences of SNP ascertainment biases. Thirdly, a SNP with a large difference in allele frequency between populations is a strong candidate to explain large differences in disease prevalence between populations [67,68]. This is because disease is tightly linked to survival and reproductive success, and genes responsible for variation in disease should have the most differentiated SNP frequencies between human populations. Indeed, studies have suggested that genes associated with complex diseases such as cardiovascular disease and type 2 diabetes have been targets for positive natural selection [69]. If disease genes have often been targeted by selection, then identifying loci that have experienced selection may aid in disease-related research [68]. Further studies are required to determine the extent to which differences in allele frequencies between populations predict disease prevalence differences between populations, however. The study of population genetic structure between chromosomes is a fundamental issue in population biology because it helps us to obtain a deeper understanding of the ancestral population and associated evolutionary processes. For example, understanding heterogeneity in chromosomal ancestry in an admixed population is important because it can be a confounding factor when variation in admixture levels among individuals across chromosomes causes false-positive associations in genetic association studies. In addition, this analysis can be a source of statistical power for ancestry -- phenotype correlation studies that use observed racial/ethnic differences to find mosaic regions of the genome and map loci influencing complex phenotypes [70]. The distribution of SNP density along chromosomes will inform us about chromosomal segments that are more susceptible to selective pressures or differential patterns. Understanding how chromosomal variations in ancestry relate to disease risk is a major challenge to the biomedical research community [71]. Particularly, in the USA, there has been a significant intermixing among racial/ethnic groups, thereby creating a complex pattern of ancestral populations which are a mosaic of multiple continental populations. The development of population structure adjustment based on chromosome will provide higher-resolution genographic maps and offer investigators designing genetic association studies more powerful tools for detecting stratification. The final question we need to answer is, what causes population differentiation? Humans have wide altitudinal and latitudinal distribution ranges, and hence, different individuals may face very different environmental constraints and selection pressures. Population differentiation could arise as a result of geographical separation and subsequent drift and/or bottlenecks; natural selection (ie the local adaptation process by which organisms become adapted to their environments); differential admixture with other populations; and (possibly) different mutation rates (eg differential exposure to ionising radiation, environmental toxins, etc.). A central theme in evolutionary biology is that natural selection acting on heritable phenotypic variation will result in adaptation and differentiation among local populations inhabiting environments differing in their selective regimes [72]. Natural selection may confer an adaptive advantage to individuals in a specific environment if an allele provides a competitive advantage. Alleles under selection are likely to occur only in those geographical regions where they confer an advantage. Alleles associated with harmful traits decrease in frequency, while those associated with beneficial traits become more common. Local adaptation acting in concert with other processes (eg recombination) is sufficiently pervasive to confound measurements of population differentiation, making a single such genome-wide measurement somewhat unreliable, especially when applied to any specific chromosome or region. In summary: population differentiation, at a genetic level, is the result of numerous processes; differentiation is measurable and quantifiable by a variety of approaches; and most of the processes leading to differentiation affect all autosomes equally, except for natural selection, which leads to extreme values that reflect local adaptation due to natural selection. We also note that rather than some 'normal' distribution of FST values, with exceptional values occasionally reflecting natural selection, there is substantial inter-chromosomal variation in the inferred patterns and characteristics of population structure. These inter- and intra-chromosomal variations, either across the genome as a whole or along single chromosomes, may directly affect population divergence. This study underlines the potential of chromosome-based analysis of genome-wide data to quantify substructure in populations that might otherwise appear relatively homogeneous. Before embarking on a large-scale genomic study, proper control of chromosome-wise stratification/confounding, predicting population memberships is crucial.

Appendix 1

Principal component analysis (PCA)

PCA was performed using a correlation matrix, which was then subjected to eigenvector analysis to extract the principal components (PCs) that can summarise the variation in a data matrix X, consisting of N rows (samples) and K columns (variables: single nuclear polymorphisms [SNPs]), in terms of a few underlying and informative scores or latent variables [73]. The X-matrix is decomposed as the product of two matrices, the (N × A) score matrix, T, and the (A × K) loading matrix, P', where A is the number of PCs, plus a (N × K) 'noise' matrix of residuals E. where T is the score matrix summarising the X-variables and P' is the loading matrix showing the influence of the variables on the projection model. E is the residual matrix expressing the deviations between the original values and the projections. In general, PCA transforms a number of correlated allele frequencies into a smaller number of uncorrelated synthetic variables, or PCs.

Cluster analysis (CA)

For the HapMap SNP dataset that was found to be polymorphic among CEU, CHB, JPT and YRI samples, Pearson correlation coefficients were computed for the 210 individuals. The individuals were then grouped by a hierarchical clustering algorithm using the average linkage method, which was implemented using NTSYS v2.1 software [27]. The genetic distance between each pair of individuals, m and m', was summarised by the allele-sharing method, D (m, m'), as follows: where l is the number of loci for which both individuals have been tested.

Discriminant function analysis

Discriminant functions based on population grouping were obtained by the stepwise inclusion of SNPs to minimise Wilks' lambda (λ) between groups, as described by Rechner [29] and as follows: L = b1x1 + b2x2 + b3x3 ... + bzxz; where x1 through xz represent the various predictor variables (SNPs); b1 through bz represent the weight associated with each of the predictor variables; and L is the object's resultant qualitative discrimination score, with a cut-off score to assign objects to one group or another. Objects with L > × are assigned to one group, and those with L < × are assigned to another group,[51] based on allele frequency differences. L represents classifying variables.

Fixation index (FST) estimates between Populations

Global FST values for pairwise population comparisons were calculated using genome-wide SNP allele frequency variances estimated from the unrelated individuals in each HapMap population (CEU, CHB, JPT and YRI), following Wright [9]. The formula used was as follows: where is the average allele frequency (over all populations) of the i-th allele, m is the number of alleles and Fis the value of FST for each allele. SNP-specific FST measures of population genetic differentiation based on allele frequencies in two populations, a metric of variation within a population versus between populations, are outlined below, following McKeigue [74]. In this formula, p1 and p2 denote the frequencies of a particular allele in population 1 and population 2, respectively.

Network and gene ontology analysis of genes showing differentiation between populations

Ingenuity Pathways Analysis (IPA) was used to organise genes showing evidence of selection into networks of interacting genes and to identify pathways containing functionally related genes [33]. More precisely, network analysis consists of searching for direct and indirect interactions between candidate genes and all other molecules (genes, gene products or small molecules) contained in the Ingenuity Pathways Knowledge Base (IPKB). The complete list of gene identifiers was uploaded into IPA, and each was mapped to its corresponding IPKB gene object. Candidate genes are eligible for network generation if there is at least one wild-type IPKB interacting molecule. Based on the information available for eligible candidate genes (focus genes), IPA further constructs networks by maximising the number of focus genes and their inter-connectivity within the limit of 35 molecules per network. Note that additional highly connected non-focus molecules are also included. Finally, for each network, a right-tailed Fisher exact test is implemented to evaluate how likely it is that the focus genes it contains might be found together by chance. Only those networks with a score (-log[p value]) greater than three were considered as significant [75]. In addition, networks might be inter-connected (ie sharing at least one molecule), which strengthens the importance for the underlying biological functions. Networks are graphically represented by nodes with various shapes (according to the molecule type) and edges (according to their biological relationships). The likelihood for a gene pair to be regulated in the same manner increases with the similarity of their gene ontology (GO) description. The GO similarity score between two gene products is based on the number of shared ancestors. As a gene product might be assigned with multiple GO terms, we seek the maximum similarity score between all possible combinations. As we seek to discover gene-gene interactions, we reformulate the GO approach as follows. Let gene i and gene j be assigned hi and hj GO terms, respectively. Then, the GO similarity for the gene (i, j) pair is taken to be the maximum number of shared ancestors for all combinations of the hi and hj [76]. IPA essentially evaluates the enrichment of particular biological processes and molecular functions of gene sets by examining information collected by databases such as GO, Kyoto Encyclopedia of Genes and Genomes or the IPKB. Discriminant analysis classification accuracy and associated percentage across the genome and population Correct and misclassification of CHB and JPT individuals to their correct geographical region of origin differs for each chromosome. For example, correct classification to their regions of origin for CHB individuals range from 23 per cent (for Chr 6) to 33 per cent (for Chr X and Chr 22). N*= number of individuals in each population group, % = % classification accuracy; CEU = Caucasian, CHB = Chinese, JPT = Japanese, YRI = Yoruba

Table S1

Discriminant analysis classification accuracy and associated percentage across the genome and population

	CEU (60)		CHB (45)		JPT (45)		YRI (60)
Chr	Correctclassification		Correctclassification		Correctclassification		Correctclassification
	N *	%	N	%	N	%	N	%
1	60	100	28	62.22	27	60.00	60	100
2	60	100	27	60.00	30	66.67	60	100
3	60	100	26	57.78	32	71.11	60	100
4	60	100	30	66.67	27	60.00	60	100
5	60	100	28	62.22	29	64.44	60	100
6	60	100	23	51.11	28	62.22	60	100
7	60	100	25	55.56	26	57.78	60	100
8	60	100	25	55.56	26	57.78	60	100
9	60	100	25	55.56	25	55.56	60	100
10	60	100	25	55.56	30	66.67	60	100
11	60	100	27	60.00	29	64.44	60	100
12	60	100	29	64.44	26	57.78	60	100
13	60	100	24	53.33	26	57.78	60	100
14	60	100	26	57.78	32	71.11	60	100
15	60	100	30	66.67	31	68.89	60	100
16	60	100	29	64.44	28	62.22	60	100
17	60	100	29	64.44	29	64.44	60	100
18	60	100	29	64.44	36	80.00	60	100
19	60	100	31	68.89	33	73.33	60	100
20	60	100	29	64.44	25	55.56	60	100
21	60	100	27	60.00	32	71.11	60	100
22	60	100	33	73.33	31	68.89	60	100
X	60	100	33	73.33	32	71.11	60	100
All	60	100	32	71.11	38	84.44	60	100
Mean	60	100	28	62.00	29	64.44	60	100

Correct and misclassification of CHB and JPT individuals to their correct geographical region of origin differs for each chromosome. For example, correct classification to their regions of origin for CHB individuals range from 23 per cent (for Chr 6) to 33 per cent (for Chr X and Chr 22).

N*= number of individuals in each population group, % = % classification accuracy; CEU = Caucasian, CHB = Chinese, JPT = Japanese, YRI = Yoruba

59 in total

Review 1. Naturally occurring variation in Arabidopsis: an underexploited resource for plant genetics.

Authors: C Alonso-Blanco; M Koornneef
Journal: Trends Plant Sci Date: 2000-01 Impact factor: 18.313

2. Expression imbalance map: a new visualization method for detection of mRNA expression imbalance regions.

Authors: Makoto Kano; Kunihiro Nishimura; Shumpei Ishikawa; Shuichi Tsutsumi; Koichi Hirota; Michitaka Hirose; Hiroyuki Aburatani
Journal: Physiol Genomics Date: 2003-03-18 Impact factor: 3.107

Review 3. Estimating F-statistics.

Authors: B S Weir; W G Hill
Journal: Annu Rev Genet Date: 2002-06-11 Impact factor: 16.830

4. High-density single-nucleotide polymorphism maps of the human genome.

Authors: Raymond D Miller; Michael S Phillips; Inho Jo; Miriam A Donaldson; Joel F Studebaker; Nicholas Addleman; Steven V Alfisi; Wendy M Ankener; Hamid A Bhatti; Chad E Callahan; Benjamin J Carey; Cheryl L Conley; Justin M Cyr; Vram Derohannessian; Rachel A Donaldson; Carolina Elosua; Stacey E Ford; Angela M Forman; Craig A Gelfand; Nicole M Grecco; Susan M Gutendorf; Cricket R Hock; Mark J Hozza; Soyoung Hur; Sun Mi In; Diana L Jackson; Sangmee Ahn Jo; Sung-Chul Jung; Sook Kim; Kuchan Kimm; Ellen F Kloss; Daniel C Koboldt; Jennifer M Kuebler; Feng-Shen Kuo; Jessica A Lathrop; Jong-Keuk Lee; Kathy L Leis; Stephanie A Livingston; Elizabeth G Lovins; Maria L Lundy; Sima Maggan; Matthew Minton; Michael A Mockler; David W Morris; Eric P Nachtman; Bermseok Oh; Chan Park; Chang-Wook Park; Nicholas Pavelka; Adrienne B Perkins; Stephanie L Restine; Ravi Sachidanandam; Andrew J Reinhart; Kathryn E Scott; Gira J Shah; Jatana M Tate; Shobha A Varde; Amy Walters; J Rebecca White; Yeon-Kyeong Yoo; Jong-Eun Lee; Michael T Boyce-Jacino; Pui-Yan Kwok
Journal: Genomics Date: 2005-08 Impact factor: 5.736

5. Genetic similarities within and between human populations.

Authors: D J Witherspoon; S Wooding; A R Rogers; E E Marchani; W S Watkins; M A Batzer; L B Jorde
Journal: Genetics Date: 2007-03-04 Impact factor: 4.562

6. The hitch-hiking effect of a favourable gene.

Authors: J M Smith; J Haigh
Journal: Genet Res Date: 1974-02 Impact factor: 1.588

7. The detection of disease clustering and a generalized regression approach.

Authors: N Mantel
Journal: Cancer Res Date: 1967-02 Impact factor: 12.701

Review 8. The application of molecular genetic approaches to the study of human evolution.

Authors: L Luca Cavalli-Sforza; Marcus W Feldman
Journal: Nat Genet Date: 2003-03 Impact factor: 38.330

9. Cluster analysis and display of genome-wide expression patterns.

Authors: M B Eisen; P T Spellman; P O Brown; D Botstein
Journal: Proc Natl Acad Sci U S A Date: 1998-12-08 Impact factor: 11.205

10. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

13 in total

1. Ancestry informative markers for distinguishing between Thai populations based on genome-wide association datasets.

Authors: Kornkiat Vongpaisarnsin; Jennifer Beth Listman; Robert T Malison; Joel Gelernter
Journal: Leg Med (Tokyo) Date: 2015-02-25 Impact factor: 1.376

2. PAMAM: Power analysis in multiancestry admixture mapping.

Authors: Yadu Gautam; Sudhir Ghandikota; Siqi Chen; Tesfaye B Mersha
Journal: Genet Epidemiol Date: 2019-06-26 Impact factor: 2.135

3. AncestrySNPminer: a bioinformatics tool to retrieve and develop ancestry informative SNP panels.

Authors: Sushil Amirisetty; Gurjit K Khurana Hershey; Tesfaye M Baye
Journal: Genomics Date: 2012-05-11 Impact factor: 5.736

4. Comparison of measures of marker informativeness for ancestry and admixture mapping.

Authors: Lili Ding; Howard Wiener; Tilahun Abebe; Mekbib Altaye; Rodney C P Go; Carolyn Kercsmar; Greg Grabowski; Lisa J Martin; Gurjit K Khurana Hershey; Ranajit Chakorborty; Tesfaye M Baye
Journal: BMC Genomics Date: 2011-12-20 Impact factor: 3.969

5. Population structure analysis using rare and common functional variants.

Authors: Tesfaye M Baye; Hua He; Lili Ding; Brad G Kurowski; Xue Zhang; Lisa J Martin
Journal: BMC Proc Date: 2011-11-29

Review 6. Self-reported race/ethnicity in the age of genomic research: its potential impact on understanding health disparities.

Authors: Tesfaye B Mersha; Tilahun Abebe
Journal: Hum Genomics Date: 2015-01-07 Impact factor: 4.639

Review 7. Mapping asthma-associated variants in admixed populations.

Authors: Tesfaye B Mersha
Journal: Front Genet Date: 2015-09-29 Impact factor: 4.599

Review 8. Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research.

Authors: Koffi N Maglo; Tesfaye B Mersha; Lisa J Martin
Journal: Front Genet Date: 2016-02-17 Impact factor: 4.599

9. A Prediction Algorithm for Drug Response in Patients with Mesial Temporal Lobe Epilepsy Based on Clinical and Genetic Information.

Authors: Mariana S Silva-Alves; Rodrigo Secolin; Benilton S Carvalho; Clarissa L Yasuda; Elizabeth Bilevicius; Marina K M Alvim; Renato O Santos; Claudia V Maurer-Morelli; Fernando Cendes; Iscia Lopes-Cendes
Journal: PLoS One Date: 2017-01-04 Impact factor: 3.240

10. High mutation rates explain low population genetic divergence at copy-number-variable loci in Homo sapiens.

Authors: Xin-Sheng Hu; Francis C Yeh; Yang Hu; Li-Ting Deng; Richard A Ennos; Xiaoyang Chen
Journal: Sci Rep Date: 2017-02-22 Impact factor: 4.379