| Literature DB >> 23509783 |
Yen-Jen Lin1, Chun-Tien Chang, Chuan Yi Tang, Wen-Ping Hsieh.
Abstract
Single nucleotide polymorphism (SNP) data derived from array-based technology or massive parallel sequencing are often flawed with missing data. Missing SNPs can bias the results of association analyses. To maximize information usage, imputation is often adopted to compensate for the missing data by filling in the most probable values. To better understand the available tools for this purpose, we compare the imputation performances among BEAGLE, IMPUTE, BIMBAM, SNPMStat, MACH, and PLINK with data generated by randomly masking the genotype data from the International HapMap Phase III project. In addition, we propose a new algorithm called simple imputation (Simpute) that benefits from the high resolution of the SNPs in the array platform. Simpute does not require any reference data. The best feature of Simpute is its computational efficiency with complexity of order (mw + n), where n is the number of missing SNPs, w is the number of the positions of the missing SNPs, and m is the number of people considered. Simpute is suitable for regular screening of the large-scale SNP genotyping particularly when the sample size is large, and efficiency is a major concern in the analysis.Entities:
Mesh:
Year: 2013 PMID: 23509783 PMCID: PMC3581137 DOI: 10.1155/2013/813912
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
A 3 × 3 contingency table for the genotypes at two consecutive loci. A and a are the two alleles in locus 1 while B and b are the two alleles in locus 2.
| 0 (bb) | 1 (bB) | 2 (BB) | Total | |
|---|---|---|---|---|
| 0 (aa) |
|
|
|
|
| 1 (aA) |
|
|
|
|
| 2 (AA) |
|
|
|
|
|
| ||||
| Total |
|
|
|
|
Notation for the haplotype probabilities at the two loci.
| Locus | b | B | Total |
|---|---|---|---|
| a |
|
|
|
| A |
|
|
|
|
| |||
| Total |
|
| 1 |
The nonbiallelic loci proportion in the HapMap phase II release 22.
| Population | Individuals | SNPs | Nonbiallelic |
|---|---|---|---|
| CEU | 90 | 48217 | 1.69% |
| JPT + CHB | 90 | 50053 | 1.81% |
| YRI | 90 | 48541 | 1.60% |
The non-bi-allelic loci proportion in the HapMap phase III.
| Population | Individuals | SNPs | Non-bi-allelic |
|---|---|---|---|
| CEU | 80 | 19250 | 0.39% |
| JPT + CHB | 77 | 17286 | 0.21% |
| YRI | 80 | 20198 | 0.21% |
Figure 1Imputation accuracy compared across BEAGLE, IMPUTE, BIMBAM, SNPMStat, MACH, and plink using the complete set. The curve with IMPUTE-call_thresh-0 stands for the best setting (call thresh = 0) we found for Impute rather than the default setting. Accuracy = number of correctly imputed entries/number of missing entries ∗100%.
Figure 2CPU Time.
Error rates* for Simpute, BEAGLE, and Jung's method with random missing study on the six SNPs of chromosome 22.
| Missing rate/method | Simpute | BEAGLE | Jung's method |
|---|---|---|---|
| 5% | 1.358% | 1.7531% | 16.59% |
| 10% | 1.8944% | 2.1429% | 17.82% |
| 15% | 3.0207% | 3.4132% | 20.25% |
| 20% | 4.4472% | 4.4907% | 20.07% |
*Error rates = number of error imputed entries/number of missing entries ∗100%.
The error rates* for random missing SNPs of short input at r 2 ≥ 0.9 from the HapMap phase III on chromosome 21 of short input for the CEU.
| Method/missing rate | Simpute | BEAGLE |
|---|---|---|
| 0.1% | 37.136/483 (7.69%) | 38.09/483 (7.89%) |
| 0.5% | 188/2412 (7.79%) | 183.6364/2412 (7.61%) |
| 1% | 378.333/4823 (7.84%) | 376.762/4823 (7.81%) |
| 5% | 1913.632/24111 (7.94%) | 1892.053/24111 (7.84%) |
*Error rates = number of error imputed entries/number of missing entries ∗100%.
Error rates* and computation time for random missing SNPs of high LD for the CEU samples.
| Method/missing rate | Simpute | BEAGLE | ||
|---|---|---|---|---|
| Error rate | Running time (sec) | Error rate | Running time (sec) | |
| 0.1% | 5.52/483 (1.14%) | 12.88 | 4.49/483 (0.93%) | 164.17 |
| 0.5% | 27.94/2412 (1.16%) | 13.09 | 21.01/2412 (0.87%) | 164.82 |
| 1% | 57.22/4823 (1.19%) | 14.07 | 44.07/4823 (0.91%) | 168.47 |
| 5% | 321.9/24111 (1.33%) | 18.24 | 224.65/24111 (0.974%) | 168.69 |
*Error rates = number of error imputed entries/number of missing entries ∗100%.
Error rates* and computation time for random missing SNPs of high LD for the CHB + JPT samples.
| Method/missing rate | Simpute | BEAGLE | ||
|---|---|---|---|---|
| Error rate | Running time (sec) | Error rate | Running time (sec) | |
| 0.1% | 5.15/493 (1.04%) | 10.90 | 4.64/493 (0.94%) | 138.40 |
| 0.5% | 27.29/2463 (1.10%) | 11.08 | 24/2463 (0.97%) | 139.79 |
| 1% | 55.07/4925 (1.11%) | 11.77 | 47.69/4925 (0.97%) | 138.96 |
| 5% | 322.38/24622 (1.31%) | 16.113 | 242.26/24622 (0.98%) | 140.96 |
*Error rates = number of error imputed entries/number of missing entries ∗100%.
Error rates* and computation time for random missing SNPs of high LD for the YRI samples.
| Method/missing rate | Simpute | BEAGLE | ||
|---|---|---|---|---|
| Error rate | Running time (sec) | Error rate | Running time (sec) | |
| 0.1% | 2.57/271 (0.95%) | 12.42 | 2.23/271 (0.82%) | 187.80 |
| 0.5% | 13.54/1353 (1.00%) | 12.925 | 11.2/1353 (0.83%) | 188.41 |
| 1% | 27.19/2705 (1.00%) | 13.08 | 22.89/2705 (0.85%) | 187.45 |
| 5% | 161.02/13525 (1.19%) | 15.921 | 119.94/13525 (0.887%) | 191.29 |
*Error rates = number of error imputed entries/number of missing entries ∗100%.