| Literature DB >> 21712187 |
Abstract
Emerging technologies now make it possible to genotype hundreds of thousands of genetic variations in individuals, across the genome. The study of loci at finer scales will facilitate the understanding of genetic variation at genomic and geographic levels. We examined global and chromosomal variations across HapMap populations using 3.7 million single nucleotide polymorphisms to search for the most stratified genomic regions of human populations and linked these regions to ontological annotation and functional network analysis. To achieve this, we used five complementary statistical and genetic network procedures: principal component (PC), cluster, discriminant, fixation index (FST) and network/pathway analyses. At the global level, the first two PC scores were sufficient to account for major population structure; however, chromosomal level analysis detected subtle forms of population structure within continental populations, and as many as 31 PCs were required to classify individuals into homogeneous groups. Using recommended population ancestry differentiation measures, a total of 126 regions of the genome were catalogued. Gene ontology and networks analyses revealed that these regions included the genes encoding oculocutaneous albinism II (OCA2), hect domain and RLD 2 (HERC2), ectodysplasin A receptor (EDAR) and solute carrier family 45, member 2 (SLC45A2). These genes are associated with melanin production, which is involved in the development of skin and hair colour, skin cancer and eye pigmentation. We also identified the genes encoding interferon-γ (IFNG) and death-associated protein kinase 1 (DAPK1), which are associated with cell death, inflammatory and immunological diseases. An in-depth understanding of these genomic regions may help to explain variations in adaptation to different environments. Our approach offers a comprehensive strategy for analysing chromosome-based population structure and differentiation, and demonstrates the application of complementary statistical and functional network analysis in human genetic variation studies.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21712187 PMCID: PMC3326352 DOI: 10.1186/1479-7364-5-4-220
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Figure 1Schematic presentation of single nucleotide polymorphism (SNP) mining, multivariate chromosomal and population diversity and network analysis strategies. There are ~3.7 million SNPs in the HapMap data release. Genotypes were summarised for each population. For each dataset, the number of alleles per locus (SNP) was coded to a string of numbers to obtain a full design matrix of alleles (the cells give the number of copies of each major allele for each individual: 0, 1 or 2). Two criteria were used to filter the SNPs included in the analysis: (i) locus call rate ≥ 95 per cent (ie we excluded all SNPs with more than 5 per cent missing data); and (ii) the SNP should be shared among populations, so that the same sets of SNPs were used throughout in the population comparisons. From the total of ~3.7 million SNPs in the HapMap data release, only 809,624 SNPs were eligible for analysis.
Figure 2Pairwise F. A simple measure of population differentiation is Wright's FST, which measures the fraction of total genetic variation due to between-population differences. It could also represent a matrix of pairwise net distance (divergence) among the population.
Figure 3Plot for the first two principal components (PCs) for HapMap individual for the genome-wide average shows the relationships between human populations in terms of their geographical origin. On a genome-wide average scale, about 74 per cent of the diversity in human population was explained on the basis of the first two PCs.
Figure S1Chromosome-wise principal component analysis (PCA) analysis of the entire HapMap dataset. The first PC accounted for more than double the variance of the second PC. The level of contribution of the first two PCs across chromosomes in classifying geographical regions are presented here. The chromosome-wise contribution of the first two PCs ranges from 65 per cent (Chr X) to 76 per cent (Chr 15). The contribution of PC1 ranges from 47 per cent (Chr X) to 51 per cent (Chr 3, Chr 8). The contribution of PC2 to the total variation ranges from 18 per cent (Chr X) to 27 per cent for Chr 15.
Figure 4Significant numbers of PCs among chromosomes in the HapMap dataset. On a finer scale, the number of significant PCs that account for population differentiations vary from 2 to 31 among chromosomes.
Figure S2Unweighted pair-group method analysis dendrogram (a branching diagram used to show the relationships between members of a group) based on average taxonomic distance matrices among population means of HapMap SNP datasets. The cluster analysis (CA; constructed from principal components) for the mean of 210 individuals indicates the distance at which the various groups are formed and join together. CA, which is based on the means for all individuals from each geographical origin, was used to obtain similarities among individuals according to their correlation measures across all SNP datasets. Branch height represents dissimilarity. Note that, compared with YRI and CEU branch height, the CHB and JPT branch height is much shorter, representing that the genetic distance between these two populations is relatively close.
Classification matrix for HapMap individuals based on SNP markers using DA
| Population | n | Predicted population group | ||||
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | % correct | ||
| CEU - European ancestry 1 | 60 | 60 | 0 | 0 | 0 | 100 |
| CHB - Chinese from Beijing 2 | 45 | 0 | 30 | 15 | 0 | 67 |
| JPT - Japanese from Tokyo 3 | 45 | 0 | 7 | 38 | 0 | 84 |
| YRI - Nigerian from Yorubans 4 | 60 | 0 | 0 | 0 | 60 | 100 |
| Total | 210 | 60 | 37 | 53 | 60 | 90 |
Average accuracy, 89%, n = number of individuals in each HapMap population. Numbers from 1 to 4 represent the four populations which are described on the left-hand side of the table.
Stepwise order of inclusion of variables in the DA that distinguishes between human populations
| Variance (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Step | Entered | Eigenvalue | Proportion | Cumulative | Wilks' | Canonical | F value | df | Pr >F |
| 1 | PC1 | 10587.53 | 80.7 | 80.7 | 9.53 × 10-5 | 0.99 | 720717.4 | 3, 206 | <0.0001 |
| 2 | PC2 | 2532.22 | 19.3 | 100.0 | 4.10 × 10-8 | 0.96 | 353836.8 | 6, 410 | <0.0001 |
PC1, PC2, principal components 1 and 2; df, degrees of freedom; Pr >F, (probability level) associated with the F statistic.
IPA summary of associated networks, molecular and cellular functions, diseases and disorders and canonical pathways for the 126 genes mapped to significantly differentiated genomic regions
| IPA categories | Statistical measures | Associated gene(s) | |
|---|---|---|---|
| No | Top networks | Network score* | No. of candidate |
| 1 | Cancer, cell death, dermatological diseases and conditions | 33 | 19 |
| 2 | Carbohydrate metabolism, dermatological diseases and conditions, lipid metabolism | 24 | 14 |
| 3 | Post-translational modification, embryonic development, tissue development | 24 | 13 |
| 4 | Inflammatory response, immunological disease, carbohydrate metabolism | 20 | 12 |
| 5 | Cell cycle, hair and skin development and function, nervous system development and function | 19 | 12 |
| 1 | Cell death | 2.55E-04 - 3.60E-02 | 19 |
| 2 | Cell-to-cell signalling and interaction | 6.33E-04 - 3.25E-02 | 6 |
| 3 | Cellular assembly and organisation | 6.33E-04 - 3.36E-02 | 18 |
| 4 | Cellular compromise | 6.33E-04 - 3.25E-02 | 8 |
| 5 | Gene expression | 1.50E-03 - 3.25E-02 | 7 |
| 1 | Inflammatory disease | 1.43E-07 - 3.25E-02 | 47 |
| 2 | Gastrointestinal disease | 1.22E-06 - 3.39E-02 | 29 |
| 3 | Genetic disorder | 1.22E-06 - 3.25E-02 | 69 |
| 4 | Endocrine system disorders | 1.72E-05 - 3.25E-02 | 35 |
| 5 | Metabolic disease | 1.72E-05 - 1.31E-02 | 39 |
| 1 | Androgen and oestrogen metabolism | 2.15E-02 | 3 |
| 2 | Neuroprotective role of THOP1 in Alzheimer's disease | 2.85E-02 | 2 |
| 3 | Alanine and aspartate metabolism | 3.39E-02 | 2 |
| 4 | Retinol metabolism | 3.39E-02 | 2 |
| 5 | Pentose and glucuronate interconversions | 4.76E-02 | 2 |
| 1 | Hair and skin development and function | 6.26E-05 - 3.25E-02 | 9 |
| 2 | Nervous system development and function | 7.22E-04 - 3.36E-02 | 15 |
| 3 | Connective tissue development and function | 1.17E-03 - 3.25E-02 | 6 |
| 4 | Skeletal and muscular system development and function | 1.17E-03 - 3.25E-02 | 11 |
| 5 | Tissue development | 1.17E-03 - 3.25E-02 | 14 |
*Networks with scores ≥3 have a 99.9 per cent confidence of not being generated randomly.
+The IPA computes p values of statistically significant findings by comparing the number of molecules of interest relative to the total number of occurrences of these molecules in all functional/pathway annotations stored in the Ingenuity Pathways Knowledge Base (Fisher's exact test with p value adjusted using the Benjamini-Hochberg multiple testing correction).
Figure 5IPA network analysis for 126 genes mapped to significantly differentiated genomic regions. Genes with red nodes are focus genes in our analysis, the others are generated through the network analysis from the Ingenuity Pathways Knowledge Base (http://www.ingenuity.com). Edges are displayed with labels that describe the nature of the relationship between the nodes. The lines between genes represent known interactions, with solid lines representing direct interactions and dashed lines representing indirect interactions. Nodes are displayed using various shapes that represent the functional class of the gene product.
Gene Ontology analysis for the 126 genes mapped to significantly differentiated genomic regions
| GO category | GO ID | GO term name | Overrepresented genes | |
|---|---|---|---|---|
| Molecular | GO:0005516 | 0.00138 | Calmodulin binding | |
| Molecular | GO:0016462 | 0.001431 | Pyrophosphatase activity | |
| Molecular | GO:0016818 | 0.001478 | Hydrolase activity, | |
| Molecular | GO:0016817 | 0.001509 | Hydrolase activity, | |
| Molecular | GO:0003779 | 0.002797 | Actin binding | |
| Biological | GO:0006582 | 0.000032 | Melanin metabolic process | |
| Biological | GO:0043473 | 0.000574 | Pigmentation | |
| Biological | GO:0031641 | 0.00078 | Regulation of myelination | |
| Biological | GO:0048066 | 0.001018 | Developmental pigmentation | |
| Biological | GO:0015701 | 0.001245 | Bicarbonate transport | |
| Cellular | GO:0000299 | 0.000209 | Integral to membrane | |
| Cellular | GO:0045009 | 0.000983 | Chitosome | |
| Cellular | GO:0033162 | 0.000983 | Melanosome membrane | |
| Cellular | GO:0048770 | 0.001653 | Pigment granule | |
| Cellular | GO:0042470 | 0.001653 | Melanosome |
Abbreviations
MAP6, microtubule-associated protein 6 gene; MYH9, myosin, heavy chain 9 non-muscle gene; MYLK, Myosin light chain kinase gene, MYO5C, myosin VC gene; TGM3, transglutaminase 3 gene; ABCB7, gene for ATP-binding cassette sub-family B, member 7; DYNC1L11, cytoplasmic dynein 1 light intermediate chain 11 gene; DUT , deoxyuridine 5'-triphosphate nucleotidohydrodase gene; MCM6, minichromosome maintenance complex component 6 gene; ATAD3C, AAA domain-containing 3C gene, ZRANNB3, zinc finger RAN-binding domain-containing 3 gene; DNAH5, dynein axonemal heavy chain 5 gene; ATAD3B, AAA domain-containing 3B gene; IQCAI, IQ motif containing with AAA domain gene; MKL1, megakavyoblastic leukaemia (translocation) 1 gene; MKL2, megakavyoblastic leukaemia (translocation) 2 gene; SYNE2, spectrin repeat containing nuclear envelope 2 gene; HIP1, Huntingtin interacting protein 1 gene; SLC45A2, gene for solute carrier family 45, member 2; CDH2, cadherin 2 gene; PTGER3, prostaglandin E receptor 3 (subtype EP3) gene; SLC4AS, gene for solute carrier family 4 sodium bicarbonate cotransporter member 5; ARSA, arylsulfatase 4 gene; MLC1, megalencephalic leukoencephalopathy with subcortical cysts 1 gene; SEMA4F , semaphorin 4F gene; YWHAE, 14-3-3 protein epsilon gene, CITED1 cbp/p300-interacting transactivator 1 gene; OCA2, p protein gene; EDAR, ectodysphasin A receptor gene.
Figure S3Global canonical pathways of the 126 genes linked to genomic regions of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. The first four sets of functions shown represent a p-value of less than 0.01. Bars that are above the line indicate significant enrichment of a pathway.
Figure S4The 16 most significant functional categories from IPA linked to the 126 genes of major population differentiation. The significance threshold, shown in yellow, represents a p value of greater than 0.05. Bars that are above the line indicate significant enrichment of a function.
Discriminant analysis classification accuracy and associated percentage across the genome and population
| CEU (60) | CHB (45) | JPT (45) | YRI (60) | |||||
|---|---|---|---|---|---|---|---|---|
| Chr | Correct | Correct | Correct | Correct | ||||
| N * | % | N | % | N | % | N | % | |
| 1 | 60 | 100 | 28 | 62.22 | 27 | 60.00 | 60 | 100 |
| 2 | 60 | 100 | 27 | 60.00 | 30 | 66.67 | 60 | 100 |
| 3 | 60 | 100 | 26 | 57.78 | 32 | 71.11 | 60 | 100 |
| 4 | 60 | 100 | 30 | 66.67 | 27 | 60.00 | 60 | 100 |
| 5 | 60 | 100 | 28 | 62.22 | 29 | 64.44 | 60 | 100 |
| 6 | 60 | 100 | 23 | 51.11 | 28 | 62.22 | 60 | 100 |
| 7 | 60 | 100 | 25 | 55.56 | 26 | 57.78 | 60 | 100 |
| 8 | 60 | 100 | 25 | 55.56 | 26 | 57.78 | 60 | 100 |
| 9 | 60 | 100 | 25 | 55.56 | 25 | 55.56 | 60 | 100 |
| 10 | 60 | 100 | 25 | 55.56 | 30 | 66.67 | 60 | 100 |
| 11 | 60 | 100 | 27 | 60.00 | 29 | 64.44 | 60 | 100 |
| 12 | 60 | 100 | 29 | 64.44 | 26 | 57.78 | 60 | 100 |
| 13 | 60 | 100 | 24 | 53.33 | 26 | 57.78 | 60 | 100 |
| 14 | 60 | 100 | 26 | 57.78 | 32 | 71.11 | 60 | 100 |
| 15 | 60 | 100 | 30 | 66.67 | 31 | 68.89 | 60 | 100 |
| 16 | 60 | 100 | 29 | 64.44 | 28 | 62.22 | 60 | 100 |
| 17 | 60 | 100 | 29 | 64.44 | 29 | 64.44 | 60 | 100 |
| 18 | 60 | 100 | 29 | 64.44 | 36 | 80.00 | 60 | 100 |
| 19 | 60 | 100 | 31 | 68.89 | 33 | 73.33 | 60 | 100 |
| 20 | 60 | 100 | 29 | 64.44 | 25 | 55.56 | 60 | 100 |
| 21 | 60 | 100 | 27 | 60.00 | 32 | 71.11 | 60 | 100 |
| 22 | 60 | 100 | 33 | 73.33 | 31 | 68.89 | 60 | 100 |
| X | 60 | 100 | 33 | 73.33 | 32 | 71.11 | 60 | 100 |
| All | 60 | 100 | 32 | 71.11 | 38 | 84.44 | 60 | 100 |
| Mean | 60 | 100 | 28 | 62.00 | 29 | 64.44 | 60 | 100 |
Correct and misclassification of CHB and JPT individuals to their correct geographical region of origin differs for each chromosome. For example, correct classification to their regions of origin for CHB individuals range from 23 per cent (for Chr 6) to 33 per cent (for Chr X and Chr 22).
N*= number of individuals in each population group, % = % classification accuracy; CEU = Caucasian, CHB = Chinese, JPT = Japanese, YRI = Yoruba