| Literature DB >> 26377960 |
Daniel Money1, Kyle Gardner2, Zoë Migicovsky2, Heidi Schwaninger3, Gan-Yuan Zhong3, Sean Myles2.
Abstract
Obtaining genome-wide genotype data from a set of individuals is the first step in many genomic studies, including genome-wide association and genomic selection. All genotyping methods suffer from some level of missing data, and genotype imputation can be used to fill in the missing data and improve the power of downstream analyses. Model organisms like human and cattle benefit from high-quality reference genomes and panels of reference genotypes that aid in imputation accuracy. In nonmodel organisms, however, genetic and physical maps often are either of poor quality or are completely absent, and there are no panels of reference genotypes available. There is therefore a need for imputation methods designed specifically for nonmodel organisms in which genomic resources are poorly developed and marker order is unreliable or unknown. Here we introduce LinkImpute, a software package based on a k-nearest neighbor genotype imputation method, LD-kNNi, which is designed for unordered markers. No physical or genetic maps are required, and it is designed to work on unphased genotype data from heterozygous species. It exploits the fact that markers useful for imputation often are not physically close to the missing genotype but rather distributed throughout the genome. Using genotyping-by-sequencing data from diverse and heterozygous accessions of apples, grapes, and maize, we compare LD-kNNi with several genotype imputation methods and show that LD-kNNi is fast, comparable in accuracy to the best-existing methods, and exhibits the least bias in allele frequency estimates.Entities:
Keywords: SNP; apple; genotyping by sequencing; imputation
Mesh:
Year: 2015 PMID: 26377960 PMCID: PMC4632058 DOI: 10.1534/g3.115.021667
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Performance of the different imputation methods on the apple dataset
| Method | Genotype Error | Allele Error | Run Time, sec |
|---|---|---|---|
| Mode | 23.0% | 12.4% | |
| kNNi | 20.6% | 10.8% | 18 |
| MF | 9.9% | 5.1% | 40,107 |
| fastPHASE | 7.7% | 3.9% | 52,399 |
| Beagle | 7.6% | 3.9% | 424 |
| LD-kNNi | 7.4% | 3.9% | 104 |
kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.
Run time was under a second.
Using a fixed value of k = 8.
Using fixed values of k = 5 and l = 20.
Figure 1The number of shared neighbors between the k-nearest neighbors imputation (kNNi) and linkage disequilibrium k-nearest neighbors imputation (LD-kNNi) methods. The value of l was set to 5 for both methods.
Figure 2The probability of a single-nucleotide polymorphism (SNP) being on the same chromosome as the imputed SNP as a function of linkage disequilibrium (LD) with the imputed SNP. SNPs are ranked according to LD, with the SNP most in LD with the imputed SNP ranked one.
Figure 3Imputation accuracy as a function of the minor allele frequency (MAF) of the imputed SNP for each of the six imputation methods. MAF is binned in 5% bins and the number of SNPs in each bin is shown in parentheses. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.
Performance of LinkImpute and Beagle on different datasets
| Dataset | Number of SNPs | Number of Samples | Genotype Error | Run Time, sec | ||
|---|---|---|---|---|---|---|
| LinkImpute | Beagle | LinkImpute | Beagle | |||
| Apple | 8404 | 711 | 7.4% | 7.6% | 104 | 424 |
| Maize | 43,696 | 4300 | 18.1% | 18.7% | 7608 | 16,585 |
| Grape | 8506 | 77 | 9.5% | 11.0% | 28 | 16 |
SNP, single-nucleotide polymorphism.
Using the LD-kNNi option and optimized values of k and l
Figure 4Bubble plots of the actual and imputed genotypes for each of the 10,000 masked genotypes using each of the six imputation methods. Bubbles are not shown for the correctly imputed cases. The size of the bubbles is proportional to the frequency of observations in that category. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.
Figure 5Minor allele frequency (MAF) computed by the use of actual and imputed genotypes for each of the six imputation methods. kNNi, k-nearest neighbors imputation; LD-kNNi, linkage disequilibrium k-nearest neighbors imputation.