| Literature DB >> 30453954 |
Tanjin T Toma1, Jeremy M Dawson1, Donald A Adjeroh2.
Abstract
BACKGROUND: While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations (e.g., from the same continent) is still a difficult challenge.Entities:
Keywords: Ancestry prediction; DNA; SNP; Single chromosome
Mesh:
Year: 2018 PMID: 30453954 PMCID: PMC6245491 DOI: 10.1186/s12920-018-0412-4
Source DB: PubMed Journal: BMC Med Genomics ISSN: 1755-8794 Impact factor: 3.063
Fig. 1Graphical depiction of the proposed process of SNP selection for predicting human biogeographical ancestry
26 populations in the dataset
| Population code | Population name | Continent | Sample size |
|---|---|---|---|
| PUR | Puerto Rican | America | 104 |
| CLM | Colombian | America | 94 |
| PEL | Peruvian | America | 85 |
| MXL | Mexican-American | America | 64 |
| GBR | British | Europe | 91 |
| FIN | Finnish | Europe | 99 |
| IBS | Spanish | Europe | 107 |
| CEU | CEPH | Europe | 99 |
| TSI | Tuscan | Europe | 107 |
| CHS | Southern Han Chinese | East Asia | 105 |
| CDX | Dai Chinese | East Asia | 93 |
| KHV | Kinh Vietnamese | East Asia | 99 |
| CHB | Han Chinese | East Asia | 103 |
| JPT | Japanese | East Asia | 104 |
| PJL | Punjabi | South Asia | 96 |
| BEB | Bengali | South Asia | 86 |
| STU | Sri Lankan | South Asia | 102 |
| ITU | Indian | South Asia | 102 |
| GIH | Gujarati | South Asia | 103 |
| ACB | African-Caribbean | Africa | 96 |
| GWD | Gambian | Africa | 113 |
| ESN | Esan | Africa | 99 |
| MSL | Mende | Africa | 85 |
| YRI | Yoruba | Africa | 108 |
| LWK | Luhya | Africa | 99 |
| ASW | African-American SW | Africa | 61 |
Fig. 2Results for continental-level ancestry classification using varying thresholds. Results include both accuracy (left) and the number of SNPs (right) required to achieve a given accuracy
Confusion matrix for continental-level Ancestry classification (oveall classification rate of 96.75%, 206 SNPs)
| Continents | Europe | America | Africa | East Asia | South Asia |
|---|---|---|---|---|---|
| Europe | 94.06% | 3.96% | 0.00% | 0.00% | 1.98% |
| America | 10.94% | 89.06% | 0.00% | 0.00% | 0.00% |
| Africa | 0.00% | 0.00% | 100.00% | 0.00% | 0.00% |
| East Asia | 0.00% | 0.00% | 0.00% | 100.00% | 0.00% |
| South Asia | 1.02% | 2.04% | 0.00% | 0.00% | 96.94% |
Results for pairwise/binary classification between sub-populations in each continent
| Continent | Sub-populations | Number of SNPs | Correlation Threshold | Accuracy (80–20) |
|---|---|---|---|---|
| America | PUR-PEL | 56 | 0.76 | 100.00% |
| PUR-MXL | 44 | 0.72 | 93.33% | |
| PUR-CLM | 89 | 0.83 | 66.67% | |
| CLM-PEL | 96 | 0.84 | 97.06% | |
| CLM-MXL | 37 | 0.69 | 74.07% | |
| PEL-MXL | 96 | 0.84 | 84.00% | |
| Europe | GBR-FIN | 15 | 0.47 | 78.38% |
| GBR-IBS | 63 | 0.80 | 66.67% | |
| GBR-CEU | 30 | 0.64 | 67.57% | |
| GBR-TSI | 24 | 0.61 | 76.92% | |
| FIN-IBS | 82 | 0.83 | 83.33% | |
| FIN-CEU | 130 | 0.88 | 80.00% | |
| FIN-TSI | 75 | 0.82 | 90.48% | |
| IBS-CEU | 47 | 0.75 | 71.43% | |
| IBS-TSI | 82 | 0.83 | 77.27% | |
| CEU-TSI | 31 | 0.67 | 73.81% | |
| East Asia | CHS-CDX | 44 | 0.73 | 64.10% |
| CHS-KHV | 12 | 0.41 | 68.29% | |
| CHS-CHB | 30 | 0.66 | 64.29% | |
| CHS-JPT | 83 | 0.84 | 73.81% | |
| CDX-KHV | 30 | 0.66 | 68.42% | |
| CDX-CHB | 120 | 0.87 | 76.92% | |
| CDX-JPT | 120 | 0.87 | 87.18% | |
| KHV-CHB | 62 | 0.79 | 75.61% | |
| KHV-JPT | 92 | 0.85 | 82.93% | |
| CHB-JPT | 83 | 0.84 | 71.43% | |
| South Asia | PJL-BEB | 29 | 0.65 | 74.29% |
| PJL-STU | 57 | 0.78 | 62.50% | |
| PJL-ITU | 29 | 0.65 | 70.00% | |
| PJL-GIH | 153 | 0.89 | 100.00% | |
| BEB-STU | 42 | 0.72 | 72.97% | |
| BEB-ITU | 139 | 0.88 | 70.27% | |
| BEB-GIH | 113 | 0.86 | 100.00% | |
| STU-ITU | 29 | 0.65 | 64.29% | |
| STU-GIH | 79 | 0.82 | 100.00% | |
| ITU-GIH | 79 | 0.82 | 100.00% | |
| Africa | ACB-GWD | 47 | 0.76 | 76.74% |
| ACB-ESN | 20 | 0.56 | 79.49% | |
| ACB-MSL | 46 | 0.75 | 71.43% | |
| ACB-YRI | 43 | 0.72 | 80.49% | |
| ACB-LWK | 60 | 0.79 | 79.49% | |
| ACB-ASW | 15 | 0.49 | 81.48% | |
| GWD-ESN | 46 | 0.75 | 77.27% | |
| GWD-MSL | 73 | 0.82 | 72.50% | |
| GWD-YRI | 132 | 0.88 | 100.00% | |
| GWD-LWK | 132 | 0.88 | 100.00% | |
| GWD-ASW | 132 | 0.88 | 96.88% | |
| ESN-MSL | 102 | 0.86 | 69.44% | |
| ESL-YRI | 132 | 0.88 | 100.00% | |
| ESN-LWK | 132 | 0.88 | 100.00% | |
| ESN-ASW | 132 | 0.88 | 96.43% | |
| MSL-YRI | 38 | 0.71 | 100.00% | |
| MSL-LWK | 132 | 0.88 | 100.00% | |
| MSL-ASW | 73 | 0.82 | 91.67% | |
| YRI-LWK | 28 | 0.65 | 78.57% | |
| YRI-ASW | 146 | 0.89 | 90.00% | |
| LWK-ASW | 162 | 0.90 | 85.71% |
Fig. 3Pairwise classification results with varying correlation thresholds, for subgroups within the continent of America: a PUR vs. PEL; b PUR vs. MXL, c PUR vs. CLM; d CLM vs. PEL; e CLM vs. MXL; and f PEL vs. MXL
Comparative Performance on continental-level Ancestry classification
| Basic Method | Data Size | Datasets Used | Classification Rate (%) |
|---|---|---|---|
| STRUCTURE [ | 664 | Mutiple datasets | 96.1 |
| SNPforID [ | 2689 | 1000 Genome, HGDP, NIST | 98.8 |
| STRUCTURE [ | 6410 | Mutiple datasets | 81.4 |
| Random match probability [ | 451 | Own collection | 77.0 (+ 21.6 thresholded out) |
| Proposed | 2504 | 1000 Genome Phase 3 | 99.19 (614 SNPs) |
| Proposed | 2504 | 1000 Genome Phase 3 | 96.75 (206 SNPs) |
Comparative Performance In Sub-Population-level Ancestry classification
| Pairwise sub-populations | Continent | Method | Data size | Datasets | Classification rate (%) | Number of attributes used |
|---|---|---|---|---|---|---|
| CEU-TSI |
| 267 |
| 86.6 ± 2.4 | 180 SNP | |
| CHB-JPT |
| 250 |
| 95.6 ± 3.9 | 877 SNP | |
| LWK-MKK |
| 294 |
| 95.9 ± 1.5 | 341 SNP | |
| JPT-CHB |
| 9104 |
| 74.9 (77.2***) | 15 STR | |
| JPT-KOR |
| 731 |
| 67.9 (63.7) | 15 STR | |
| CHB-KOR |
| 731 |
| 69.6 (62.4) | 15 STR | |
| – |
|
| 503 | 1000 | 76.6* | 58 |
| – |
|
| 661 | 1000 | 87.02* | 87 |
| – |
|
| 504 | 1000 | 73.3* | 68 |
*Average accuracy of all pairwise sub-population classifications within the given continent
**Average number of SNPs required in all pairwise sub-population classifications within the given continent
***Results obtained without normalization