| Literature DB >> 18093342 |
Abstract
BACKGROUND: Since the single nucleotide polymorphisms (SNPs) are genetic variations which determine the difference between any two unrelated individuals, the SNPs can be used to identify the correct source population of an individual. For efficient population identification with the HapMap genotype data, as few informative SNPs as possible are required from the original 4 million SNPs. Recently, Park et al. (2006) adopted the nearest shrunken centroid method to classify the three populations, i.e., Utah residents with ancestry from Northern and Western Europe (CEU), Yoruba in Ibadan, Nigeria in West Africa (YRI), and Han Chinese in Beijing together with Japanese in Tokyo (CHB+JPT), from which 100,736 SNPs were obtained and the top 82 SNPs could completely classify the three populations.Entities:
Mesh:
Year: 2007 PMID: 18093342 PMCID: PMC2245981 DOI: 10.1186/1471-2105-8-484
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Example of haplotypes and genotypes. (a). The haplotype and genotype formats of one individual; (b). Different nominal values (genotype format) of one SNP for different individuals; (c). Numerical values of one SNP for different individuals in (b), in which the first transformation is for the F-Statistics algorithm, the second transformation in vector format is for the modified t-test algorithm.
Classification accuracy results obtained by the F-statistics measure for different feature subsets with different numbers of top ranked features (SNPs) in 30 simulations, on 3 and 4 populations, respectively
| Feature Numbers | Mean accuracy ± std (minimal/maximal accuracy) (%) for 3 populations | Mean accuracy ± std (minimal/maximal accuracy) (%) for 4 populations |
| 1 | 69.21 ± 1.60 (64.29/70) | 54.98 ± 1.60 (51.43/57.14) |
| 10 | 72.96 ± 7.82 (64.29/92.86) | 56.16 ± 2.36 (45.71/58.57) |
| 20 | 74.48 ± 7.82 (65.71/95.71) | 57.88 ± 4.26 (54.29/74.29) |
| 30 | 74.92 ± 8.79 (65.71/95.71) | 58.47 ± 5.18 (48.57/74.29) |
| 40 | 77.29 ± 10.55 (65.71/97.14) | 59.51 ± 5.08 (54.29/77.14) |
| 50 | 79.75 ± 11.96 (64.29/98.57) | 61.18 ± 7.24 (54.29/82.86) |
| 60 | 82.46 ± 11.41 (67.14/98.57) | 64.09 ± 7.10 (57.14/82.86) |
| 70 | 84.68 ± 11.15 (67.14/98.57) | 64.48 ± 7.73 (57.14/82.86) |
| 80 | 94.48 ± 7.03 (64.29/98.57) | 67.98 ± 8.86 (55.71/84.29) |
| 90 | 93.74 ± 5.98 (68.57/98.57) | 70.84 ± 9.13 (57.14/87.14) |
| 100 | 93.79 ± 3.44 (80/98.57) | 73.99 ± 7.09 (58.57/87.14) |
Classification accuracy results obtained by the modified t-test measure for different feature subsets with differen numbers of top ranked features (SNPs) in 30 simulations, on 3 and 4 populations, respectively
| Feature Numbers | Mean accuracy ± std (minimal/maximal accuracy) (%) for 3 populations | Mean accuracy ± std (minimal/maximal accuracy) (%) for 4 populations |
| 1 | 69.37 ± 1.43 (65.71/71.43) | 54.86 ± 1.54 (51.43/57.14) |
| 10 | 72.97 ± 7.14 (60.00/92.86) | 56.29 ± 6.03 (45.71/74.29) |
| 20 | 75.20 ± 7.82 (65.71/95.71) | 58.45 ± 7.15 (48.57/74.29) |
| 30 | 76.69 ± 9.23 (67.14/95.71) | 60.17 ± 9.34 (50.00/81.43) |
| 40 | 77.03 ± 8.65 (68.57/94.29) | 61.60 ± 7.77 (51.43/78.57) |
| 50 | 79.94 ± 9.36 (55.71/97.14) | 65.20 ± 8.21 (54.29/81.43) |
| 60 | 81.89 ± 11.03 (61.43/100) | 69.26 ± 9.21 (51.43/84.29) |
| 70 | 85.23 ± 10.92 (70.00/100) | 70.34 ± 9.38 (52.85/84.29) |
| 80 | 94.57 ± 9.75 (81.43/100) | 73.94 ± 7.73 (58.57/84.29) |
| 90 | 94.29 ± 3.73 (84.29/98.57) | 79.60 ± 4.53 (67.14/87.14) |
| 100 | 94.57 ± 3.06 (84.29/98.57) | 80.46 ± 4.57 (68.57/90.00) |
The maximum classification accuracy in each of 30 simulations together with the mean accuracy (standard deviation), and the relevant feature numbers leading to the maximal accuracy together with the mean number (standard deviation), for 3 populations and 4 populations, respectively
| Feature Numbers | Maximum accuracy (%) | Mean accuracy ± std (%) | Relevant feature numbers | Average number of features ± std |
| f-statistics on 3 populations | 94.29 98.57 95.71 97.14 94.29 | 96.05 ± 1.58 (%) | 12 85 15 42 29 86 | 63.6 ± 25.8 |
| 94.29 95.71 95.71 95.71 97.14 | 37 56 78 90 71 46 | |||
| 97.14 95.71 97.14 95.71 98.57 | 79 79 49 8 75 74 | |||
| 92.86 97.14 97.14 97.14 92.86 | 74 100 83 50 81 | |||
| 97.14 95.71 95.71 95.71 97.14 | 67 38 82 81 93 84 | |||
| 95.71 98.57 95.71 97.14 92.86 | 53 | |||
| f-statistics on 4 populations | 78.57 78.57 85.71 70.00 88.57 | 77.34 ± 6.57 (%) | 53 98 82 96 91 | 85.2 ± 15.1 |
| 78.57 80.00 82.86 70.00 80.00 | 100 93 81 99 90 | |||
| 65.71 78.57 82.86 68.57 81.43 | 59 99 88 55 73 81 | |||
| 80.00 70.00 78.57 81.43 74.29 | 56 99 100 100 74 | |||
| 72.86 84.26 74.29 84.29 88.57 | 74 98 94 88 72 81 | |||
| 68.57 75.71 64.29 75.71 70.00 | 99 98 90 | |||
| Modified t-test on 3 populations | 95.71 100.00 95.71 98.57 | 97.09 ± 1.74 (%) | 27 84 19 90 29 95 | 64.0 ± 26.5 |
| 94.29 98.57 95.71 95.71 | 64 80 84 80 78 54 | |||
| 95.71 98.57 95.71 95.71 98.57 | 83 92 57 11 80 79 | |||
| 100.00 100.00 97.14 98.57 | 78 95 62 75 94 84 | |||
| 95.71 97.14 98.57 98.57 95.71 | 53 10 32 28 68 55 | |||
| 94.29 94.29 97.14 97.14 95.71 | ||||
| 98.57 97.14 98.57 | ||||
| Modified t-test on 4 populations | 82.86 77.14 87.14 82.86 82.86 | 83.86 ± 3.16 (%) | 31 84 92 94 86 | 84.1 ± 16.3 |
| 84.29 88.57 80.00 81.43 82.86 | 100 84 99 83 87 | |||
| 82.86 84.29 87.14 84.29 85.71 | 96 61 83 93 99 95 | |||
| 81.43 82.86 84.29 84.29 75.71 | 80 81 88 99 99 75 | |||
| 81.43 84.29 81.43 84.29 90.00 | 95 82 95 51 83 73 | |||
| 84.29 85.71 85.71 85.71 90.00 | 99 55 | |||
Top ranked features whose appearance frequencies are greater than 83.33% (25/30) in 30 simulations, and their mean ranking values by the F-statistics ranking measure for 3 populations
| Ranking No. on Mean Ranking Values | Name of SNPs | Chromosome | Mean ranking values in 30 simulations | Ranking No. on Appearance Frequency |
| 1 | rs232045 | chr11 | 0.9573 | 7 |
| 2 | rs12786973 | chr11 | 0.9547 | 6 |
| 3 | rs7946015 | chr11 | 0.9544 | 2 |
| 4 | rs4756778 | chr11 | 0.9524 | 3 |
| 5 | rs7931276 | chr11 | 0.9521 | 9 |
| 6 | rs4823557 | chr11 | 0.9518 | 5 |
| 7 | rs10832001 | chr11 | 0.9506 | 4 |
| 8 | rs35397 | chr5 | 0.9491 | 8 |
| 9 | rs11604470 | chr11 | 0.9480 | 12 |
| 10 | rs10831841 | chr11 | 0.9478 | 11 |
| 11 | rs2296224 | chr1 | 0.9456 | 10 |
| 12 | rs12286898 | chr11 | 0.9387 | 13 |
| 13 | rs1869084 | chr11 | 0.9341 | 20 |
| 14 | rs4491181 | chr11 | 0.9307 | 26 |
| 15 | rs1604797 | chr11 | 0.9258 | 1 |
| 16 | rs7931276 | chr11 | 0.9161 | 14 |
| 17 | rs11826168 | chr11 | 0.9103 | 19 |
| 18 | rs477036 | chr11 | 0.9072 | 16 |
| 19 | rs7940199 | chr11 | 0.9032 | 22 |
| 20 | rs4429025 | chr11 | 0.8711 | 25 |
| 21 | rs6483747 | chr11 | 0.8435 | 17 |
| 22 | rs199138 | chr15 | 0.8417 | 18 |
Top ranked features whose appearance frequencies are greater than 83.33% (25/30) in 30 simulations, and their mean ranking values by the modified t-test ranking measure for 3 populations
| Ranking No. on Mean Ranking Values | Name of SNPs | Chromosome | Mean ranking values in 30 simulations | Ranking No. on Appearance Frequency |
| 1 | rs232045 | chr11 | 8.0956 | 7 |
| 2 | rs1869084 | chr11 | 8.0886 | 9 |
| 3 | rs4756778 | chr11 | 8.0079 | 6 |
| 4 | rs11218714 | chr11 | 8.0047 | 11 |
| 5 | rs10832001 | chr11 | 7.9810 | 3 |
| 6 | rs7946015 | chr11 | 7.8988 | 2 |
| 7 | rs11826168 | chr11 | 7.8517 | 4 |
| 8 | rs704737 | chr11 | 7.7786 | 18 |
| 9 | rs1083184 | chr11 | 7.7778 | 24 |
| 10 | rs16913196 | chr11 | 7.7774 | 13 |
| 11 | rs12786973 | chr11 | 7.7499 | 5 |
| 12 | rs12286898 | chr11 | 7.7421 | 12 |
| 13 | rs11604470 | chr11 | 7.7401 | 16 |
| 14 | rs35397 | chr5 | 7.7257 | 8 |
| 15 | rs7931276 | chr11 | 7.7060 | 14 |
| 16 | rs477036 | chr11 | 7.6644 | 17 |
| 17 | rs6483747 | chr11 | 7.6625 | 19 |
| 18 | rs7931276 | chr11 | 7.5996 | 15 |
| 19 | rs1604797 | chr11 | 7.3847 | 1 |
| 20 | rs10836565 | chr11 | 7.3053 | 10 |
| 21 | rs2296224 | chr1 | 7.1358 | 21 |
| 22 | rs4275650 | chr11 | 7.0043 | 23 |
| 23 | rs7924569 | chr11 | 6.9431 | 20 |
| 24 | rs2582905 | chr11 | 6.9264 | 22 |