| Literature DB >> 18267305 |
Abstract
Single nucleotide polymorphisms (SNPs) are genetic variations that determine the differences between any two unrelated individuals. Various population groups can be distinguished from each other using SNPs. For instance, the HapMap dataset has four population groups with about ten million SNPs. For more insights on human evolution, ethnic variation, and population assignment, we propose to find out which SNPs are significant in determining the population groups and then to classify different populations using these relevant SNPs as input features. In this study, we developed a modified t-test ranking measure and applied it to the HapMap genotype data. Firstly, we rank all SNPs in comparison with other feature importance measures including F-statistics and the informativeness for assignment. Secondly, we select different numbers of the most highly ranked SNPs as the input to a classifier, such as the support vector machine, so as to find the best feature subset corresponding to the best classification accuracy. Experimental results showed that the proposed method is very effective in finding SNPs that are significant in determining the population groups, with reduced computational burden and better classification accuracy.Entities:
Mesh:
Year: 2007 PMID: 18267305 PMCID: PMC5054219 DOI: 10.1016/S1672-0229(08)60011-X
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Fig. 1Average (mean) classification accuracy for the four populations using the modified t-test and the F-statistics.
Fig. 2Average (mean) classification accuracy for the three populations using the modified t-test, the F-statistics, and the I measure.
Classification accuracy (standard deviation) for the original four populations using the modified t-test ranking measure
| Number of SNPs | CEU | CHB | JPT | YRI |
|---|---|---|---|---|
| 5 | 66.68% (20.20%) | 39.51% (36.77%) | 21.06% (28.70%) | 88.20% (7.78%) |
| 10 | 72.14% (17.42%) | 33.86% (22.64%) | 39.34% (27.85%) | 93.17% (5.28%) |
| 50 | 82.67% (12.64%) | 43.74% (23.14%) | 55.95% (19.71%) | 99.85% (0.82%) |
| 100 | 88.57% (8.50%) | 54.59% (21.72%) | 49.10% (20.00%) | 100% (0) |
| 200 | 95.98% (5.54%) | 56.88% (20.25%) | 54.95% (19.78%) | 100% (0) |
| 300 | 98.76% (3.21%) | 59.76% (24.17%) | 55.18% (24.59%) | 100% (0) |
| 400 | 99.67% (1.22%) | 58.42% (23.61%) | 55.84% (18.78%) | 100% (0) |
| 500 | 99.86% (0.75%) | 59.09% (21.52%) | 50.06% (21.59%) | 100% (0) |
| 1,000 | 99.40% (1.87%) | 58.34% (23.37%) | 51.40% (26.76%) | 100% (0) |
Classification accuracy (standard deviation) for the original four populations using the F-statistics ranking measure
| Number of SNPs | CEU | CHB | JPT | YRI |
|---|---|---|---|---|
| 5 | 68.55% (23.70%) | 37.63% (32.49%) | 32.12% (27.27%) | 90.11% (9.85%) |
| 10 | 73.27% (12.19%) | 42.12% (30.44%) | 31.02% (22.85%) | 90.65% (9.52%) |
| 50 | 93.42% (7.08%) | 57.17% (33.00%) | 36.39% (32.70%) | 99.61% (1.48%) |
| 100 | 98.79% (2.01%) | 65.00% (33.58%) | 30.81% (33.61%) | 99.85% (0.82%) |
| 200 | 99.62% (1.45%) | 55.80% (39.66%) | 41.50% (40.45%) | 100% (0) |
| 300 | 99.78% (1.20%) | 61.97% (39.59%) | 32.75% (39.26%) | 100% (0) |
| 400 | 99.29% (1.83%) | 64.99% (39.94%) | 32.49% (40.31%) | 100% (0) |
| 500 | 99.78% (1.20%) | 64.33% (39.88%) | 31.95% (39.30%) | 100% (0) |
| 1,000 | 99.47% (1.63%) | 61.18% (41.72%) | 36.55% (39.80%) | 100% (0) |
Classification accuracy (standard deviation) for the three populations using the modified t-test ranking measure
| Number of SNPs | CEU | Asian (CHB and JPT) | YRI |
|---|---|---|---|
| 5 | 48.06% (34.28%) | 91.98% (8.69%) | 87.34% (9.34%) |
| 10 | 62.38% (29.83%) | 90.44% (10.59%) | 92.61% (9.13%) |
| 50 | 82.07% (13.34%) | 94.27% (5.22%) | 99.08% (2.68%) |
| 100 | 90.45% (8.94%) | 97.02% (3.37%) | 100% (0) |
| 200 | 96.96% (3.64%) | 98.61% (2.12%) | 100% (0) |
| 300 | 98.42% (3.21%) | 99.13% (1.45%) | 100% (0) |
| 400 | 98.94% (1.93%) | 99.13% (1.45%) | 100% (0) |
| 500 | 99.67% (1.25%) | 99.13% (1.45%) | 100% (0) |
| 1,000 | 99.22% (2.05%) | 99.03% (1.49%) | 100% (0) |
Classification accuracy (standard deviation) for the three populations using the F-statistics ranking measure
| Number of SNPs | CEU | Asian (CHB and JPT) | YRI |
|---|---|---|---|
| 5 | 60.25% (25.57%) | 84.65% (11.23%) | 88.29% (12.96%) |
| 10 | 67.06% (17.97%) | 84.81% (11.01%) | 89.84% (8.43%) |
| 50 | 87.84% (8.09%) | 95.30% (4.17%) | 98.65% (3.34%) |
| 100 | 99.48% (4.62%) | 98.46% (1.66%) | 99.86% (0.78%) |
| 200 | 99.36% (1.66%) | 99.23% (1.39%) | 100% (0) |
| 300 | 99.34% (1.69%) | 99.13% (1.45%) | 100% (0) |
| 400 | 99.84% (0.85%) | 99.13% (1.45%) | 100% (0) |
| 500 | 99.69% (1.16%) | 99.13% (1.45%) | 100% (0) |
| 1,000 | 99.84% (0.85%) | 99.13% (1.45%) | 100% (0) |