| Literature DB >> 24972110 |
Yumei Yang1, Qishan Wang1, Qiang Chen1, Rongrong Liao1, Xiangzhe Zhang1, Hongjie Yang2, Youmin Zheng2, Zhiwu Zhang3, Yuchun Pan4.
Abstract
We report a novel algorithm, iBLUP, to impute missing genotypes by simultaneously and comprehensively using identity by descent and linkage disequilibrium information. The simulation studies showed that the algorithm exhibited drastically tolerance to high missing rate, especially for rare variants than other common imputation methods, e.g. BEAGLE and fastPHASE. At a missing rate of 70%, the accuracy of BEAGLE and fastPHASE dropped to 0.82 and 0.74 respectively while iBLUP retained an accuracy of 0.95. For minor allele, the accuracy of BEAGLE and fastPHASE decreased to -0.1 and 0.03, while iBLUP still had an accuracy of 0.61.We implemented the algorithm in a publicly available software package also named iBLUP. The application of iBLUP for processing real sequencing data in an outbred pig population was demonstrated.Entities:
Mesh:
Year: 2014 PMID: 24972110 PMCID: PMC4074155 DOI: 10.1371/journal.pone.0101025
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The mechanism and performance of iBLUP.
The figure illustrates how observed genotype with missing value is imputed by Best Linear Unbiased Prediction (BLUP). The imputation uses both relationship among markers represented by Linkage Disequilibrium (LD), and relationship among individuals represented as Identity By Decent (IBD).G and K are the genetic variance–covariance matrix and marker-based kinship matrix respectively, and the symbol represents the Kronecker product.
The comparison of four genotype imputation methods: iBLUP, BEAGLE, M-MM and fastPHASE.
| Method | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% |
| iBLIUP | 0.99 | 0.99 | 0.99 | 0.99 | 0.98 | 0.97 | 0.95 | 0.92 |
| BEAGLE | 0.99 | 0.99 | 0.99 | 0.99 | 0.99 | 0.95 | 0.82 | 0.76 |
| M-MM | 0.90 | 0.89 | 0.89 | 0.87 | 0.87 | 0.85 | 0.82 | 0.71 |
| fastPHASE | 0.99 | 0.99 | 0.99 | 0.99 | 0.79 | 0.76 | 0.74 | 0.69 |
Responses of imputation accuracy on marker density and individual relationship*.
| Missing | rate | 60% | 70% | 80% | ||||||
| iBLUP | BEAGLE | fastPHASE | iBLUP | BEAGLE | fastPHASE | iBLUP | BEAGLE | fastPHASE | ||
| Sibs | Half | 0.97±0.0006 | 0.95±0.0002 | 0.76±0.0163 | 0.95±0.0005 | 0.82±0.0004 | 0.74±0.0081 | 0.92±0.0002 | 0.76±0.0002 | 0.69±0.0041 |
| Full | 0.97±0.0003 | 0.96±0.0002 | 0.79±0.0113 | 0.96±0.0006 | 0.83±0.0004 | 0.75±0.0076 | 0.94±0.0007 | 0.77±0.0002 | 0.72±0.0046 | |
| Density | High | 0.97±0.0006 | 0.95±0.0002 | 0.76±0.0163 | 0.95±0.0005 | 0.82±0.0004 | 0.74±0.0081 | 0.92±0.0002 | 0.76±0.0002 | 0.69±0.0041 |
| Low | 0.90±0.0006 | 0.85±0.0005 | 0.76±0.0122 | 0.87±0.0004 | 0.78±0.0003 | 0.73±0.0093 | 0.83±0.0007 | 0.75±0.0006 | 0.71±0.0062 |
*The full dataset from 15th QTL-MAS workshop was sampled on individual relationship and marker density. The full dataset contains 3220 individuals genotyped with 9990 markers. The 3220 individual include 20 sires, 200 dams (10 dam per sire), and 3000 progeny (15 progeny per dam) as displayed in . The full population were randomly sampled to form two sub populations, one with individuals more related each other (full sibs see ) and the other with individuals less related each other (half sibs, see ). The known genotypes were randomly masked as missing at three different rates: 60%, 70%, and 80%. Two imputation methods (BEAGLE, fastPHASE and iBLUP) were used to impute the masked genotypes. Accuracy was calculated as Pearson correlation coefficient between known genotype and imputed. The sampling of missing genotypes was repeated ten times. The average and standard error of imputation accuracy are reported in the table.
All the genetic markers were used to evaluate the responses of imputation accuracy on individual relationship, i.e. half sib vs. full sibs population.
The half sib population was used to evaluate the responses of imputation accuracy on marker density. Two levels of marker density were examined. The high level marker density contained all the available markers (9990). The low density contained one fifth of the total available markers which are sampled evenly (choosing one out of every five adjacent markers).
Figure 2Diagram of iBLUP pipeline.
(1) Blended raw data were generated from the same flow cell lane. (2) Raw data were assigned to individuals according to the barcode. (3) Assigned reads were filtered for high quality reads according to several rules, including trimming the barcode and the last low quality base etc. (4) Filtered reads were aligned with the reference sequence. (5) SNP calling and genotyping were done according to the mapping results. (6) Missing genotypes were imputed by the iBLUP algorithm.
Imputation accuracy of real pig sequencing data.
| Method | 36% | 50% | 60% | 70% | 80% |
| iBLIUP | 0.97 | 0.97 | 0.96 | 0.96 | 0.95 |
| BEAGLE | 0.92 | 0.92 | 0.91 | 0.91 | 0.91 |
Imputation accuracy characterized by minor allele*.
| 60% | 70% | 80% | ||||||||
| Genotype | MAF | iBLUP | BEAGLE | fastPHASE | iBLUP | BEAGLE | fastPHASE | iBLUP | BEAGLE | fastPHASE |
| Major | <5% | 1.00±0.0002 | 1.00±0.0000 | 0.99±0.009 | 1.00±0.0001 | 0.99±0.0000 | 0.99±0.0005 | 1.00±0.0002 | 0.99±0.0000 | 0.99±0.0003 |
| 5–25% | 0.96±0.0005 | 0.94±0.0002 | 0.76±0.0157 | 0.94±0.0002 | 0.88±0.0001 | 0.73±0.0112 | 0.91±0.0003 | 0.87±0.0000 | 0.70±0.0081 | |
| >25% | 0.89±0.0011 | 0.88±0.0004 | 0.42±0.0311 | 0.85±0.0011 | 0.68±0.0008 | 0.37±0.0221 | 0.76±0.0009 | 0.64±0.001 | 0.30±0.014 | |
| All | 0.97±0.0003 | 0.96±0.0001 | 0.79±0.0134 | 0.96±0.0002 | 0.89±0.0002 | 0.77±0.0096 | 0.93±0.0002 | 0.87±0.0002 | 0.74±0.0063 | |
| Minor | <5% | 0.10±0.0013 | −0.07±0.0027 | −0.04±0.0095 | 0.05±0.0015 | −0.10±0.0008 | −0.05±0.0074 | 0.00±0.0018 | −0.11±0.0008 | −0.06±0.0062 |
| 5–25% | 0.52±0.0029 | 0.27±0.0018 | −0.04±0.0273 | 0.38±0.0017 | −0.24±0.0011 | −0.08±0.0194 | 0.20±0.0014 | −0.30±0.0001 | −0.13±0.0113 | |
| >25% | 0.77±0.0019 | 0.7±0.0013 | 0.11±0.0389 | 0.67±0.0018 | −0.07±0.003 | 0.06±0.0288 | 0.51±0.0016 | −0.41±0.0017 | 0.01±0.0195 | |
| All | 0.72±0.0021 | 0.59±0.0011 | −0.01±0.0171 | 0.61±0.0017 | −0.1±0.0024 | 0.03±0.026 | 0.44±0.0015 | −0.38±0.0013 | −0.01±0.0171 | |
| All | 0.97±0.0003 | 0.95±0.0002 | 0.76±0.0164 | 0.95±0.0003 | 0.82±0.0004 | 0.74±0.0119 | 0.92±0.0002 | 0.76±0.0000 | 0.69±0.0079 | |
*Genetic markers were classified into three categories based on Minor Allele Frequency (MAF). The cutoffs of MAF were 5% and 25%. Known genotypes were masked as missing at three different rates: 60%, 70% and 80%. Three imputation methods (BEAGLE, fastPHASE and iBLUP) were used to impute the masked genotypes. Accuracy was calculated as Pearson correlation coefficient between known genotype and imputed. Three subset of genotypes were examined: 1) Genotypes with major allele (Major), including homozygous of major allele and heterozygous; 2) Genotypes with minor allele (Minor), including homozygous of minor allele and heterozygous; 3) Genotypes of two homozygous and heterozygous (All).