| Literature DB >> 30956897 |
Xiao-Ye Jin1,2,3, Yuan-Yuan Wei1,2, Qiong Lan4, Wei Cui1,2,3, Chong Chen1,2,3, Yu-Xin Guo1,2,3, Ya-Ting Fang4, Bo-Feng Zhu1,2,4.
Abstract
In recent years, forensic geneticists have begun to develop some ancestry informative marker (AIM) panels for ancestry analysis of regional populations. In this study, we chose 48 single nucleotide polymorphisms (SNPs) from SPSmart database to infer ancestry origins of continental populations and Chinese subpopulations. Based on the genetic data of four continental populations (African, American, East Asian and European) from the CEPH-HGDP database, the power of these SNPs for differentiating continental populations was assessed. Population genetic structure revealed that distinct ancestry components among these continental populations could be discerned by these SNPs. Another novel population set from 1000 Genomes Phase 3 was treated as testing populations to further validate the efficiency of the selected SNPs. Twenty-two populations from CEPH-HGDP database were classified into three known populations (African, East Asian, and European) based on their biogeographical regions. Principal component analysis and Bayes analysis of testing populations and three known populations indicated these testing populations could be correctly assigned to their corresponding biogeographical origins. For three Chinese populations (Han, Mongolian, and Uygur), multinomial logistic regression analyses indicated that these 48 SNPs could be used to estimate ancestry origins of these populations. Therefore, these SNPs possessed the promising potency in ancestry analysis among continental populations and some Chinese populations, and they could be used in population genetics and forensic research.Entities:
Keywords: Ancestry informative markers; Biogeographical origins; Chinese populations; Continental populations; SNP
Year: 2019 PMID: 30956897 PMCID: PMC6445247 DOI: 10.7717/peerj.6508
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Detailed information of populations used in this study and their corresponding sample sizes.
| Datasets | Populations | Abbreviations | Continents | Sources | Sample sizes |
|---|---|---|---|---|---|
| Training set | Biaka Pygmy | – | Africa | CEPH-HGDP | 22 |
| Mbuti Pygmy | – | Africa | CEPH-HGDP | 13 | |
| Bantu | – | Africa | CEPH-HGDP | 19 | |
| Yoruba | – | Africa | CEPH-HGDP | 21 | |
| Mandenka | – | Africa | CEPH-HGDP | 22 | |
| Brazian | – | America | CEPH-HGDP | 22 | |
| Maya | – | America | CEPH-HGDP | 21 | |
| Pima | – | America | CEPH-HGDP | 14 | |
| Basque | – | Europe | CEPH-HGDP | 24 | |
| French | – | Europe | CEPH-HGDP | 28 | |
| Italian | – | Europe | CEPH-HGDP | 49 | |
| Orcadian | – | Europe | CEPH-HGDP | 15 | |
| Adygei | – | Europe | CEPH-HGDP | 17 | |
| Russian | – | Europe | CEPH-HGDP | 25 | |
| Cambodian | – | East Asia | CEPH-HGDP | 10 | |
| Dai | – | East Asia | CEPH-HGDP | 10 | |
| Han | – | East Asia | CEPH-HGDP | 44 | |
| Miao | – | East Asia | CEPH-HGDP | 10 | |
| Mongolian | – | East Asia | CEPH-HGDP | 10 | |
| She | – | East Asia | CEPH-HGDP | 10 | |
| Tu | – | East Asia | CEPH-HGDP | 10 | |
| Tujia | – | East Asia | CEPH-HGDP | 10 | |
| Yi | – | East Asia | CEPH-HGDP | 10 | |
| Japanese | – | East Asia | CEPH-HGDP | 28 | |
| Yakut | – | East Asia | CEPH-HGDP | 25 | |
| Testing set | Esan in Nigeria | ESN | Africa | 1000 Genomes Phase 3 | 99 |
| Yoruba in Ibadan, Nigeria | YRI | Africa | 1000 Genomes Phase 3 | 108 | |
| Finnish in Finland | FIN | Europe | 1000 Genomes Phase 3 | 99 | |
| British in England and Scotland | GBR | Europe | 1000 Genomes Phase 3 | 91 | |
| Han Chinese in Bejing, China | CHB | East Asia | 1000 Genomes Phase 3 | 103 | |
| Japanese in Tokyo, Japan | JPT | East Asia | 1000 Genomes Phase 3 | 104 | |
| Three subpopulations in China | Uygur | – | Central Asia | CEPH-HGDP | 10 |
| Han | – | East Asia | CEPH-HGDP | 44 | |
| Mongolian | – | East Asia | CEPH-HGDP | 10 |
Notes.
Bantu population includes Kenya Bantu and South African Bantu populations.
Brazian population includes Karitiana and Surui populations.
Italian population includes Sardinian, Tuscan and Bergamo populations.
General information and ancestral allele frequencies of 48 SNP loci in different continental populations.
| Rs numbers | Alleles | Chromosomes | Positions (bp) | African | American | European | East Asian |
|---|---|---|---|---|---|---|---|
|
| C/T | 1 | 165478920 | 0.8351 | 0.3947 | 0.6108 | 0.2429 |
|
| G/T | 1 | 14881716 | 0.8608 | 0.2368 | 0.7753 | 0.4774 |
|
| A/G | 1 | 183824189 | 0.7320 | 0.4737 | 0.8101 | 0.6215 |
|
| A/C | 2 | 74528651 | 0.3763 | 0.5000 | 0.8481 | 0.1921 |
|
| C/T | 2 | 239209361 | 0.7320 | 0.6053 | 0.3829 | 0.5424 |
|
| A/C | 2 | 3325638 | 0.5567 | 0.8947 | 0.9051 | 0.4915 |
|
| C/T | 3 | 18394010 | 0.9742 | 0.9474 | 0.8101 | 0.5621 |
|
| A/G | 3 | 97627774 | 0.7423 | 0.8947 | 0.0918 | 0.6638 |
|
| A/G | 3 | 42248320 | 0.2835 | 0.3246 | 0.1456 | 0.7345 |
|
| G/T | 3 | 171401588 | 0.6649 | 0.0789 | 0.6551 | 0.0847 |
|
| C/T | 3 | 59723035 | 0.8866 | 0.8070 | 0.7532 | 0.5480 |
|
| C/T | 4 | 23799564 | 0.4897 | 0.7456 | 0.5032 | 0.5593 |
|
| C/T | 4 | 99144933 | 0.7268 | 1.0000 | 0.8291 | 0.3757 |
|
| C/T | 5 | 146472334 | 0.1546 | 0.2193 | 0.2057 | 0.7994 |
|
| A/C | 5 | 54158354 | 0.8351 | 0.8158 | 0.8133 | 0.3814 |
|
| C/T | 5 | 33969523 | 1.0000 | 0.8509 | 0.3956 | 0.8644 |
|
| C/T | 5 | 76526649 | 0.9639 | 0.5175 | 0.7880 | 0.1299 |
|
| A/G | 5 | 17437385 | 0.8918 | 0.2368 | 0.1361 | 0.0876 |
|
| A/G | 6 | 100446711 | 0.8557 | 0.4123 | 0.7025 | 0.5678 |
|
| C/T | 6 | 44027931 | 0.7990 | 0.7018 | 0.3291 | 0.9379 |
|
| A/G | 6 | 73028938 | 0.8608 | 0.7018 | 0.2500 | 0.7655 |
|
| C/T | 7 | 147717307 | 0.6031 | 0.6053 | 0.7152 | 0.7386 |
|
| G/A | 7 | 99767460 | 0.8505 | 0.2456 | 0.1108 | 0.1243 |
|
| A/G | 7 | 120683673 | 0.5773 | 0.7632 | 0.2057 | 0.4774 |
|
| A/G | 8 | 92646120 | 0.4124 | 0.3947 | 0.7962 | 0.2542 |
|
| A/C | 8 | 16149886 | 0.7680 | 0.5439 | 0.1329 | 0.4972 |
|
| A/C | 8 | 71193716 | 0.9588 | 0.3509 | 0.7342 | 0.3136 |
|
| A/G | 9 | 229826 | 0.4845 | 0.5877 | 0.8766 | 0.5791 |
|
| C/T | 9 | 93488779 | 0.0361 | 0.3333 | 0.4019 | 0.4096 |
|
| A/G | 10 | 18370276 | 0.2938 | 0.2456 | 0.3513 | 0.6441 |
|
| C/T | 10 | 89903145 | 0.8557 | 0.5439 | 0.8956 | 0.2147 |
|
| C/T | 11 | 99080380 | 0.6856 | 0.8070 | 0.8734 | 0.5256 |
|
| A/G | 11 | 20099911 | 0.7732 | 0.2018 | 0.7437 | 0.1384 |
|
| T/C | 12 | 72568351 | 0.5825 | 0.2632 | 0.1361 | 0.5311 |
|
| A/G | 13 | 29010756 | 0.5052 | 0.5877 | 0.5696 | 0.3220 |
|
| C/T | 13 | 94450792 | 0.9330 | 0.2544 | 0.7468 | 0.3220 |
|
| C/T | 13 | 111238209 | 0.7938 | 0.3246 | 0.8070 | 0.3079 |
|
| A/C | 14 | 66628895 | 0.2917 | 0.5357 | 0.9209 | 0.5028 |
|
| C/T | 15 | 93163645 | 0.5158 | 0.2281 | 0.3228 | 0.5141 |
|
| A/G | 15 | 78892045 | 0.6495 | 0.5439 | 0.2437 | 0.6780 |
|
| A/G | 16 | 57361752 | 0.6649 | 0.3333 | 0.0316 | 0.1412 |
|
| C/T | 17 | 15471179 | 0.7474 | 0.2105 | 0.1076 | 0.3475 |
|
| A/G | 18 | 62566413 | 0.7062 | 0.6316 | 0.7057 | 0.3023 |
|
| T/C | 18 | 28093911 | 0.2165 | 0.7632 | 0.3038 | 0.8079 |
|
| C/T | 20 | 34289960 | 0.8711 | 0.0088 | 0.1076 | 0.2147 |
|
| A/G | 20 | 57233824 | 0.2268 | 0.6667 | 0.7025 | 0.6864 |
|
| G/A | 21 | 35449600 | 0.2010 | 0.4737 | 0.5443 | 0.2288 |
|
| A/G | 22 | 18128456 | 0.9330 | 0.7456 | 0.1835 | 0.5650 |
Notes.
Information of each SNP locus is shown according to the report of dbSNP build 152.
Ancestral allele frequencies of 48 SNPs in four continental populations are obtained based on the genetic data of 25 training populations in Table 1.
Figure 1Ancestral allele frequency heatmap of 48 SNPs in 25 training populations from different continents.
Different colors represent for different levels of frequency values: blue for low value, red for high value.
Figure 2Population specific In values of 48 SNPs in African, American, European and East Asian populations.
Figure 3Principal component analysis of four continental populations comprising 25 training populations.
Figure 4Genetic structure analyses of 25 training populations at K = 2–5 (A) and cross-validation error of each K value (B) based on 48 SNPs.
Figure 5Ancestral origin analyses of six testing populations based on 48 SNPs. (A) genetic components of six testing populations by ADMIXTURE software v1.3. (B) Principal component analysis of six testing populations and three continental populations.
Population abbreviations (CHB, ESN, FIN, GBR, JPT and YRI) are explained in Table 1.
Figure 6Population specific In values of 48 SNPs in Han, Uygur and Mongolian populations.
Figure 7Genetic differentiation analyses among Han, Mongolian and Uygur populations in China.
(A) Genetic structure analyses among these populations. (B) Principal component analysis of these populations.