| Literature DB >> 26690364 |
Tao Huang1, Yang Shu2, Yu-Dong Cai3.
Abstract
BACKGROUND: Many differences between different ethnic groups have been observed, such as skin color, eye color, height, susceptibility to some diseases, and response to certain drugs. However, the genetic bases of such differences have been under-investigated. Since the HapMap project, large-scale genotype data from Caucasian, African and Asian population samples have been available. The project found that these populations were located in different areas of the PCA (Principal Component Analysis) plot. However, as an unsupervised method, PCA does not measure the differences in each single nucleotide polymorphism (SNP) among populations.Entities:
Mesh:
Year: 2015 PMID: 26690364 PMCID: PMC4687076 DOI: 10.1186/s12864-015-2328-0
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
The 1397 samples from nine ethnic groups
| Index | Abbreviation | Full Name | Training Sample Size | Independent Test Sample Size |
|---|---|---|---|---|
| 1 | ASW | African ancestry in Southwest USA | 74 | 13 |
| 2 | CEU | Utah residents with Northern and Western European ancestry from the CEPH collection | 140 | 25 |
| 3 | CHB/CHD/JPT | Han Chinese in Beijing, China/ Chinese in Metropolitan Denver, Colorado/Japanese in Tokyo, Japan | 305 | 54 |
| 4 | GIH | Gujarati Indians in Houston, Texas | 86 | 15 |
| 5 | LWK | Luhya in Webuye, Kenya | 94 | 16 |
| 6 | MEX | Mexican ancestry in Los Angeles, California | 73 | 13 |
| 7 | MKK | Maasai in Kinyawa, Kenya | 156 | 28 |
| 8 | TSI | Tuscan in Italy | 87 | 15 |
| 9 | YRI | Yoruban in Ibadan, Nigeria (West Africa) | 173 | 30 |
| Total | 1188 | 209 | ||
Fig. 1Flowchart for the predictive model construction and performance evaluation. First, we randomly divided the HapMap dataset into the training set (85 % of samples from each population) and independent test set (15 % of samples from each population). Then, the training samples were further partitioned into 10 equally sized partitions for 10-fold cross validation. Based on the training dataset, the features were selected, and the predictive model was constructed. Finally, the constructed model was tested on the independent test dataset
Fig. 2The IFS curves of four different methods. The IFS curves show how the 10-fold cross validation accuracies in each ethnic group (y-axis) change with the number of SNPs (x-axis) using SMO (a), IB1 (b), Dagging (c) and RandomForest (d) methods
The best predictive performance of the different methods
| Method | #SNP | ASW | CEU | CHB/CHD/JPT | GIH | LWK | MEX | MKK | TSI | YRI | Total |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SMO | 2192 | 0.932 | 0.921 | 1.000 | 1.000 | 0.926 | 0.945 | 0.987 | 0.724 | 0.994 | 0.955 |
| IB1 | 2413 | 0.757 | 0.943 | 1.000 | 0.930 | 0.213 | 0.863 | 0.795 | 0.483 | 1.000 | 0.838 |
| Dagging | 186 | 0.338 | 0.964 | 1.000 | 0.988 | 0.383 | 0.808 | 0.968 | 0.345 | 0.994 | 0.840 |
| RandomForest | 75 | 0.459 | 0.900 | 0.993 | 0.884 | 0.543 | 0.74 | 0.853 | 0.345 | 0.931 | 0.815 |
The predictive performance of the SMO method in the top 299 SNPs in the training and independent test dataset
| Dataset | ASW | CEU | CHB/CHD/JPT | GIH | LWK | MEX | MKK | TSI | YRI | Total |
|---|---|---|---|---|---|---|---|---|---|---|
| Training (10-fold cross validation) | 0.865 | 0.836 | 1.000 | 0.977 | 0.723 | 0.904 | 0.968 | 0.644 | 0.919 | 0.901 |
| Independent test | 0.846 | 0.760 | 1.000 | 1.000 | 0.688 | 1.000 | 0.786 | 0.800 | 1.000 | 0.895 |
Fig. 3The minor allele frequency of the top nine SNPs in each ethnic group. The minor allele frequencies of the top three SNPs, rs6023406 (a), rs1426654 (b), rs1325421 (c), rs8049040 (d), rs13432350 (e), rs1834640 (f), rs1325055 (g), rs3764719 (h), rs2973133 (i) in the nine ethnic groups were plotted. Each ethnic group has their own specific alleles. For example, the allele frequencies of rs6023406_G, rs1426654_A, rs1325421_T, rs8049040_G, rs13432350_T, rs1834640_A and rs3764719_C were very low, but those of rs1325055_G and rs2973133_A were very high in the Asian population (CHB/CHD/JPT)
Gene ontology enrichments of genes close to the 427 SNPs
| Term |
| Fold Enrichment | Benjamini adjusted |
|---|---|---|---|
| GO:0031424 ~ keratinization (BP) | 3.37E-06 | 4.77 | 0.00998 |
| GO:0030216 ~ keratinocyte differentiation (BP) | 6.41E-06 | 3.78 | 0.00948 |
| GO:0030855 ~ epithelial cell differentiation (BP) | 1.64E-05 | 2.67 | 0.01613 |
| GO:0009913 ~ epidermal cell differentiation (BP) | 2.09E-05 | 3.46 | 0.01545 |
| GO:0001533 ~ cornified envelope (CC) | 4.76E-06 | 6.95 | 0.00228 |