| Literature DB >> 26498158 |
Yining Wang1, Tim Wylie2,3, Paul Stothard4, Guohui Lin5.
Abstract
BACKGROUND: Despite ongoing reductions in the cost of sequencing technologies, whole genome SNP genotype imputation is often used as an alternative for obtaining abundant SNP genotypes for genome wide association studies. Several existing genotype imputation methods can be efficient for this purpose, while achieving various levels of imputation accuracy. Recent empirical results have shown that the two-step imputation may improve accuracy by imputing the low density genotyped study animals to a medium density array first and then to the target density. We are interested in building a series of staircase arrays that lead the low density array to the high density array or even the whole genome, such that genotype imputation along these staircases can achieve the highest accuracy.Entities:
Mesh:
Year: 2015 PMID: 26498158 PMCID: PMC4619096 DOI: 10.1186/s12859-015-0770-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 3A flow chart of the two-step piecemeal imputation framework, including both the training phase through a 5-fold cross validation and the independent testing. In the chart, T is the set of markers in the lower density chip and T∪U is the set of markers in the higher density chip; m is a marker of U; S is the set of study samples genotyped on T and R is the set of references genotyped on T∪U. The goal is to impute the genotype for markers of U for the study samples
Description of the different SNP chips and the SNP subsets
| SNP Chip | Chip Name | #SNPs |
|---|---|---|
| Illumina 6 K | Illumina BovineLD BeadChip | 6,909 |
| Illumina 50 K | Illumina BovineSNP50 BeadChip | 54,001 |
| Illumina 777 K | 777 K BovineHD BeadChip | 786,799 |
| Affymetrix 660 K | Axiom Genome-Wide BOS 1 Array | 648,875 |
Description of the different SNP chips and the filtered SNP subsets used in the study
| Chr | #Animals | #SNPs | HD | 50 K | 6 K |
|---|---|---|---|---|---|
| BTA 27 | 114 | 529,674 | 10,219 | 664 | 120 |
| BTA 14 | 82 | 933,833 | 14,367 | 1,618 | 219 |
Accuracy comparisons between the two-step piecemeal and the classic one-step imputation on the Simmental datasets
| 5-Fold cross validation | Independent testing | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| BaseProgram | Imputation |
|
| + | #Clusters | #TClusters |
|
| + |
| 6 K →50 K | 69.35 | 70.81 |
| 100 | 100 | 60.68 | 61.39 |
| |
| Beagle | 6 K →660 K | 72.37 | 74.92 |
| 800 | 800 | 66.00 | 67.76 |
|
| 50 K →660 K | 86.61 | 88.89 |
| 1000 | 1000 | 72.83 | 74.11 |
| |
| 6 K →50 K | 75.95 | 76.70 |
| 55 | 55 | 61.87 | 62.16 |
| |
| FImpute | 6 K →660 K | 79.11 | 80.11 |
| 1000 | 1000 | 68.43 | 68.95 |
|
| 50 K →660 K | 90.31 | 90.74 |
| 1000 | 1999 | 77.11 | 77.33 |
| |
Results are on the Simmental datasets for markers on chromosome 14. Columns 3–7 contain the 5-fold cross validation results on the 82 animals, with the selected markers and their associated target marker clusters. Independent testing results on the 367 animals are in columns 8–10, using the selected markers and their associated target marker clusters from the cross validation. 1In the independent testing from 50K to 660K, 8 markers of the Affymetrix 660K chip were filtered out due to their genotype disagreeing with the alternating alleles specified by sequencing, and consequently only 999 target marker clusters were used. The columns labelled with + show the improvements, in bold, of the piecemeal imputation over the one-step imputation
Fig. 1The Beagle/FImpute-based two-step piecemeal imputation accuracies against the number of SNP clusters
Accuracy comparisons between the two-step piecemeal and the classic one-step imputation on the Holstein datasets
| 5-Fold cross validation | Independent testing | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| BaseProgram | Imputation |
|
| + | #Clusters | #TClusters |
|
| + |
| 6 K →50 K | 86.98 | 89.81 |
| 95 | 189 | 74.97 | 76.90 |
| |
| Beagle | 6 K →777 K | 82.35 | 85.27 |
| 1,000 | 1963 | 71.29 | 73.25 |
|
| 50 K →777 K | 93.09 | 95.16 |
| 1,000 | 1956 | 82.27 | 84.25 |
| |
| 6 K →50 K | 91.11 | 91.64 |
| 95 | 288 | 81.15 | 81.40 |
| |
| FImpute | 6 K →777 K | 89.22 | 90.14 |
| 1,000 | 2942 | 82.80 | 82.81 |
|
| 50 K →777 K | 95.25 | 95.61 |
| 800 | 2765 | 87.72 | 87.83 |
| |
Results are on the Holstein datasets for markers on chromosome 27. Columns 3–7 contain the 5-fold cross validation results on 114 animals, with the selected markers and their associated target marker clusters. Independent testing results on the 8 animals are in columns 8–10, using the selected markers and their associated target marker clusters from the cross validation. In the independent testing, for 1Beagle 6, 37, and 44 target marker clusters are empty; for 2FImpute 7, 58, and 35 target marker clusters are empty. The columns labelled with + show the improvements, in bold, of the piecemeal imputation over the one-step imputation
Accuracy comparisons between the multi-step piecemeal and the usual two/three-step imputation
| BaseProgram | Imputation |
|
|
|
| + |
|---|---|---|---|---|---|---|
| Beagle | 8 Holstein BTA 27 | 71.29 | 74.25 | 74.43 |
| |
| FImpute | 6 K →50 K →777 K | 82.80 | 82.74 | 82.92 |
| |
| Beagle | 367 Simmental BTA 14 | 66.00 | 65.51 | 66.59 |
| |
| FImpute | 6 K →50 K →660 K | 68.43 | 68.54 | 68.56 |
| |
| Beagle | 23 Simmental BTA 14 | 84.91 | 89.88 | 90.17 |
| |
| FImpute | 50 K →660 K →Sequence | 87.95 | 90.47 | 90.50 |
| |
| Beagle | 23 Simmental BTA 14 | 81.19 | 83.94 | 86.26 |
| |
| FImpute | 6 K →50 K →660K →Sequence | 82.23 | 84.58 | 84.67 |
|
Results are on the Holstein datasets for markers on chromosome 27 and for the Simmental datasets for markers on chromosome 14, respectively. 8 Holstein and 367 Simmental genotyped animals are used in the two-step independent testing (6 K →50 K →HD), with results in columns 4, 6 and 7. The piecemeal imputation uses the selected markers and their associated target marker clusters from the training step. Additional 23 Simmental sequenced and genotyped animals are used in the two/three-step imputation to Sequence (50K →660K →Sequence, 6K →50K →660K →Sequence). All one-step imputation accuracies are included in column 3. The last column labelled with + show the improvements, in bold, of the piecemeal imputation over the two- or three-step imputation
Fig. 2Untyped SNP genotype piecemeal imputation. Both the SNP set T of a lower density 6 K chip and the SNP set T∪U of a higher density 50 K chip are shown, using their physical loci on BTA 14. The second to the seventh lines plot the SNPs in the first five clusters, by the k-means algorithm (k=15) on the marker feature vectors generated by the add-one two-step imputation using Beagle. The starred markers are the selected markers, one per cluster, and the associated target marker clusters are shown in the last five lines in the figure