| Literature DB >> 27270109 |
Po-Ru Loh1,2, Pier Francesco Palamara1,2, Alkes L Price1,2,3.
Abstract
Recent work has leveraged the extensive genotyping of the Icelandic population to perform long-range phasing (LRP), enabling accurate imputation and association analysis of rare variants in target samples typed on genotyping arrays. Here we develop a fast and accurate LRP method, Eagle, that extends this paradigm to populations with much smaller proportions of genotyped samples by harnessing long (>4-cM) identical-by-descent (IBD) tracts shared among distantly related individuals. We applied Eagle to N ≈ 150,000 samples (0.2% of the British population) from the UK Biobank, and we determined that it is 1-2 orders of magnitude faster than existing methods while achieving similar or better phasing accuracy (switch error rate ≈ 0.3%, corresponding to perfect phase in a majority of 10-Mb segments). We also observed that, when used within an imputation pipeline, Eagle prephasing improved downstream imputation accuracy in comparison to prephasing in batches using existing methods, as necessary to achieve comparable computational cost.Entities:
Mesh:
Year: 2016 PMID: 27270109 PMCID: PMC4925291 DOI: 10.1038/ng.3571
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Figure 1Eagle algorithm and example phase calls after each step
We show phase calls for ten trio children after each successive step of the Eagle algorithm (applied to phase the first 40cM of chromosome 10 in all N≈150,000 UK Biobank samples except trio parents). At all trio-phased sites, red and blue indicate whether the first Eagle-phased haplotype for each child matches the maternal or paternal haplotype. (a) After the first step, a sizable proportion of each genome is covered by long segments of near-perfect phase; these segments are the regions in which long IBD is available from several relatives. (b) The second step, which uses both long and short IBD, fixes most of the phase switch errors in the first step. (c,d) The subsequent approximate HMM iterations further reduce the error rate.
Figure 2Computational cost and accuracy of phasing methods
Benchmarks of Eagle and existing phasing methods (all run with default options) on N≈15,000, 50,000, and 150,000 UK Biobank samples and M=5,824 SNPs on chromosome 10. Log-log plots of (a) run times and (b) memory consumption using up to 10 cores of a 2.27 GHz Intel Xeon L5640 processor and up to two weeks of computation. (c) Mean switch error rate over 70 European-ancestry trios; error bars, s.e.m. All methods except HAPI-UR supported multithreading. As the HAPI-UR documentation suggested merging results from three independent runs with different random seeds, we parallelized these runs across three cores. (For the N≈150,000 experiment, HAPI-UR encountered a failed assertion bug for some random seeds, so we needed to try six random seeds to find three working seeds. We did not count this extra work against HAPI-UR.) Numeric data are provided in Supplementary Table 1.
Computational cost and accuracy of Eagle and SHAPEIT2 on N≈150,000 samples using various parameters.
| Method | Run time | Switch error rate | Switch error rate without blips | 0-discrepancy 10Mb segments | ≤2-discrepancy 10Mb segments |
|---|---|---|---|---|---|
| Eagle --fast | 2.8 days | 0.321% (0.011%) | 0.153% (0.012%) | 60.5% (2.0%) | 79.3% (1.7%) |
| Eagle | 5.0 days | 0.276% (0.008%) | 0.118% (0.007%) | 62.6% (2.0%) | 81.6% (1.6%) |
| SHAPEIT2 | 106.8 days | 0.306% (0.013%) | 0.159% (0.010%) | 56.3% (1.5%) | 71.2% (1.3%) |
| SHAPEIT2 | 118.8 days | 0.265% (0.014%) | 0.124% (0.009%) | 62.8% (1.8%) | 77.8% (1.1%) |
| SHAPEIT2 | 152.8 days | 0.243% (0.011%) | 0.101% (0.005%) | 64.2% (1.6%) | 80.4% (1.1%) |
We benchmarked various parameter settings of Eagle and SHAPEIT2 in analyses of ten 10,000-SNP regions comprising 16% of the genome (listed in Supplementary Table 4), phasing all N≈150,000 UK Biobank samples in each analysis. We split SHAPEIT2 analyses into 3, 4, or 5 blocks (with an overlap of 500 SNPs) as necessitated by computational constraints; we ligated SHAPEIT2 output using hapfuse v1.6.2. Run times are totals across all ten regions (using 16 cores of a 2.60 GHz Intel Xeon E5-2650 v2 processor). Switch error rates are means (s.e.m.) over the ten regions, assessed on 70 European-ancestry trios. Switch error rates without blips ignore switches arising when 1–2 SNPs are oppositely phased relative to ≥10 consistently phased SNPs on both sides. The number of discrepancies within a 10Mb segment is defined as the minimum number of SNPs with incorrect phase when comparing a phased haplotype to either trio-phased haplotype[13]; percentages of 10Mb segments with 0 or ≤2 discrepancies are means (s.e.m.) over the ten 10,000-SNP regions. Detailed discrepancy distributions are provided in Supplementary Table 5.