Literature DB >> 27694958

Reference-based phasing using the Haplotype Reference Consortium panel.

Po-Ru Loh^1,2, Petr Danecek³, Pier Francesco Palamara^1,2, Christian Fuchsberger^4,5, Yakir A Reshef⁶, Hilary K Finucane^1,7, Sebastian Schoenherr⁸, Lukas Forer⁸, Shane McCarthy³, Goncalo R Abecasis⁵, Richard Durbin³, Alkes L Price^1,2,9.

Abstract

Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing in a genotyped cohort, an approach that can yield high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ∼20× speedup and ∼10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2× the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

Entities: Chemical

Mesh：

Year: 2016 PMID： 27694958 PMCID： PMC5096458 DOI： 10.1038/ng.3679

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Haplotype phasing is a central problem in human genetics[1]. Over the past decade, phasing has most commonly been performed via statistical methods applied within a genotyped cohort[2-14]. Wet-lab technologies for direct phasing have also generated considerable recent interest, but these methods are currently much less scalable[15]. In general, the accuracy of statistical phasing methods increases steadily with sample size due to improved modeling of linkage disequilbrium and increasing prevalence of identity-by-descent. We and others have recently developed methods that achieve very high statistical phasing accuracy in cohorts comprising a large fraction of a population[8] or containing >100,000 samples[13,14]. However, for smaller cohorts, accuracy of cohort-based statistical phasing is fundamentally limited by the quantity of data available. Here, we explore an alternative paradigm, reference-based phasing, which can achieve high accuracy even in smaller cohorts by leveraging information from an external reference panel. This paradigm targets a user group complementary to recent methods for phasing very large cohorts[13,14]. In particular, methods for mapping molecular QTLs using allele-specific reads require accurate phasing information, but recent papers introducing these methods have reported that inaccurate phasing currently limits their potential[16,17]. We present a new reference-based phasing algorithm, Eagle2, which we have incorporated into the Sanger Imputation Service and the Michigan Imputation Server[18] to perform free reference-phasing using the 32,470-sample Haplotype Reference Consortium (HRC)[19]. This approach achieves >2x improved phasing accuracy over 1000 Genomes-based phasing on small European-ancestry cohorts, with smaller improvements for larger cohort sizes. The Eagle2 algorithm represents a substantial computational advance over existing reference-based phasing algorithms: Eagle2 achieves a 20x speedup over SHAPEIT2[12]—i.e., genome-wide phasing in 1.5 minutes per sample—with a 10% improvement in accuracy across a range of ancestries. Eagle2 achieves this performance via two key ideas that distinguish it from previous phasing algorithms[3-14]: a new data structure based on the positional Burrows-Wheeler transform[20] and a rapid search algorithm that explores only the most relevant phase paths through a hidden Markov model (HMM). We have released Eagle2 as open-source software (see URLs).

Results

Overview of methods

The Eagle2 phasing algorithm takes as input a diploid target sample and a library of reference haplotypes. The statistical model underlying Eagle2 is a haplotype copying model similar to the Li-Stephens model[21] used by previous HMM-based methods. However, Eagle2 has two key differences compared to previous HMM-based methods. First, whereas previous approaches approximate the haplotype structure (e.g., by merging haplotypes into local clusters) to produce a more tractable HMM, Eagle2 efficiently represents the full haplotype structure in a way that losslessly condenses locally matching haplotypes. Second, using this representation, Eagle2 selectively explores the space of diplotypes—i.e., complementary pairs of phased haplotypes—in a way that only expends computation on the most likely phase paths (i.e., diplotypes with highest posterior probabilities). This approach is distinct from the dynamic programming or sampling methods employed by previous phasing software and enables much greater computational efficiency. In more detail, Eagle2 efficiently represents haplotype structure by introducing a new data structure, the HapHedge, which can be generated in linear time using the positional Burrows-Wheeler transform (PBWT)[20]. Eagle2 then explores diplotypes using a branching-and-pruning beam search. We provide a schematic of the method in Figure 1 and present full details in Online Methods and the Supplementary Note.

Figure 1

Schematic of the Eagle2 core phasing algorithm

Given diploid genotypes from a target sample along with a haploid reference set of conditioning haplotypes, our algorithm proceeds in two steps. (a) We use the positional Burrows-Wheeler transform[20] to generate a “hedge” of haplotype prefix trees rooted at markers spaced across the chromosome. These trees encode haplotype prefix frequencies, represented here with branch thicknesses. (b) We explore a small set of high-probability diplotypes (i.e., complementary pairs of phased haplotypes), estimating diplotype probabilities under a haplotype copying model by summing over possible recombination points. For each possible choice of recombination points, the HapHedge data structure allows rapid lookup of haplotype segment frequencies. (This illustration is meant to provide intuition for the overall approach; our optimized software implementation first “condenses” reference haplotypes based on the target genotypes. Details are provided in Supplementary Fig. 1 and the Supplementary Note.)

We note that the Eagle2 algorithm is very different from the long-range phasing algorithm we recently developed for phasing extremely large cohorts[13]. (We refer to the previous method as Eagle1.) The basic idea of Eagle1 was to harness identity-by-descent among distant relatives—which is pervasive at very large sample sizes but rare among smaller numbers of samples—to rapidly call phase using a fast scoring approach. In contrast, Eagle2 analyzes a full probabilistic model similar to the diploid Li-Stephens model used by previous HMM-based methods. Consequently, whereas Eagle1 suffered decreased accuracy compared to HMM-based methods when used to phase <50,000 samples, Eagle2 achieves improved accuracy over previous methods for both small and large haplotype reference panel sizes, as we demonstrate below. We note that when a reference panel contains fewer than twice as many samples as the target cohort, Eagle2 iteratively augments the reference panel with inferred target haplotypes (Online Methods); under this paradigm, reference-based phasing should always improve accuracy over cohort-based phasing. We also note that the Eagle1 algorithm was originally only implemented for cohort-based phasing; in this work, we have extended the implementation to reference-based phasing for the sake of comparison. Likewise, we have implemented a cohort-based version of Eagle2 that we also benchmark below.

Phasing performance using genotyped reference panels

We first benchmarked Eagle2 against previous reference-based phasing methods using reference panels generated by phasing subsets of genotyped cohorts. These benchmarks allowed us to explore a greater range of reference panel sizes and genetic ancestries than currently available in sequenced reference panels (as the N=32,470 samples currently in the HRC are predominantly European), understanding that genotyped reference panels containing a limited set of markers are not broadly useful for reference-based phasing. We performed benchmarks using a total of five data sets: the UK Biobank cohort[22] and the four GERA sub-cohorts, which were genotyped on four distinct European, African, East Asian, and Latino genotyping arrays[23,24]. All five data sets were typed on arrays containing 650K–850K autosomal markers with typical heterozygosity and missingness rates, and each data set contained a small subset of mother-father-child trios (Online Methods and Supplementary Table 1). For the UK Biobank reference-based phasing benchmarks, we generated simulated reference panels by randomly selecting Nref = 15,000, 30,000, 50,000, or 100,000 samples (not containing trio members) and phasing them using Eagle1[13]. We phased each subset independently (rather than phasing all samples together and then extracting subsets) to better reflect the phase inaccuracy that would be present in a real reference panel of a given size. We then benchmarked the computational cost and accuracy of reference-based phasing methods by using each panel of 2Nref haplotypes to phase sets of other UK Biobank target samples including the 70 European-ancestry trio children, which we used for benchmarking accuracy (Online Methods). To cover a wide range of linkage disequilibrium structure, we performed these benchmarks on chromosomes 1, 5, 10, 15, and 20 (a total of 174,595 markers comprising ≈25% of the genome) using Eagle2, SHAPEIT2[12], SHAPEIT2 with its –no–mcmc option (which increases speed at the expense of accuracy), and a reference-based version of Eagle1 that we implemented for comparison. We also attempted to benchmark Beagle v4.1[25] but found it was too slow for this benchmark to be practical: for the smallest analysis (chromosome 20 with Nref = 15,000 and Ntarget = 72), Beagle v4.1 required 3.6 days (in contrast to 1.1 minutes for Eagle2). (We note that the focus of Beagle v4.1[25] is its haploid imputation algorithm, which is much faster than its phasing algorithm.) We did not benchmark HAPI-UR[11] as HAPI-UR does not implement reference-based phasing. We observed that Eagle2 achieved 12–38x speedups over SHAPEIT2 for performing reference-based phasing using panels of size Nref = 15,000–100,000 (Fig. 2a and Supplementary Tables 2 and 3). Moreover, unlike the other methods we benchmarked, the computation time Eagle2 required to phase each target sample was nearly independent of the reference size. (For very large reference panels with Nref >> 100,000, the computational cost of Eagle2 will eventually increase with Nref; see Online Methods and the Supplementary Note.) Eagle2 achieved running times similar to Eagle1 and ≈2x faster than SHAPEIT2 –no–mcmc (both of which are much less accurate methods than Eagle2 and SHAPEIT2 when used with reference panels of these sizes; see below). All methods had low memory costs (<7GB for M=57,753 SNPs on chromosome 1 with Nref = 100,000; Supplementary Table 3).

Figure 2

Running time and accuracy of reference-based phasing in UK Biobank benchmarks

We benchmarked Eagle2 and other available methods by phasing UK Biobank trio children using a reference panel generated from Nref = 15,000, 30,000, 50,000, or 100,000 other UK Biobank samples. (a) CPU time per target genome on a 2.27 GHz Intel Xeon L5640 processor. (We analyzed a total of 174,595 markers on chromosomes 1, 5, 10, 15, and 20, representing ≈25% of the genome, and scaled up running times by a factor of 4; see Supplementary Table 3 for details.) (b) Mean switch error rate over 70 European-ancestry trios; error bars, s.e.m. (c, d) CPU time and mean switch error rate as a function of the number of conditioning haplotypes used by SHAPEIT2 and Eagle2 (relative to the default values of K=100 and 10,000, respectively). Eagle1 does not have such a parameter, so we display its performance as a horizontal line. Numeric data and additional benchmarks varying the number of conditioning haplotypes used with Nref = 15,000, 50,000, and 100,000 are provided in Supplementary Table 2.

In our accuracy benchmarks, which we computed using gold standard trio phase calls, we observed that Eagle2 achieved 5–16% lower switch error rates[2] compared to SHAPEIT2, with larger gains for lower values of Nref in the 15,000–100,000 range (Fig. 2b and Supplementary Table 2). Eagle2 achieved 4–36% lower switch error rates than Eagle1 and 63–81% lower switch error rates than SHAPEIT2 –no–mcmc in the same Nref range. All differences were statistically significant (binomial p=0.003 or less for each comparison of Eagle2 with SHAPEIT2, p=0.02 or less for each comparison with Eagle1, and p=10−21 for each comparison with SHAPEIT2 –no–mcmc). Both Eagle2 and SHAPEIT2 have an important parameter, K, that specifies the number of conditioning haplotypes used to phase each target sample and thus adjusts the speed-accuracy trade-off. We therefore also investigated the effects of varying K. (We note that the default values and precise meaning of this parameter are different for Eagle2 vs. SHAPEIT2; by default, SHAPEIT2 locally selects K=100 best reference haplotypes in each 2Mb window, while Eagle2 selects a fixed set of K=10,000 best reference haplotypes to use for the entire chromosome. This difference may be responsible for the slightly lower rate of improvement in accuracy of Eagle2 relative to that of SHAPEIT2 as Nref increases at fixed K in Fig. 2b.) We considered a range of values of K from 0.5–4 times the default K, similar to previous benchmarks of SHAPEIT2[12]. The effects of varying K were broadly consistent for Eagle2, SHAPEIT2, and SHAPEIT2 –no–mcmc: all methods required similarly increased computation time and achieved improved accuracy with larger values of K (Fig. 2c,d and Supplementary Tables 2 and 3). In particular, increasing the number of conditioning haplotypes by a factor of 4x required 2–3x more computation time for both Eagle2 and SHAPEIT2 while achieving similar decreases in switch error rates (12–20% and 17–19%, respectively, for Nref = 15,000–100,000; Supplementary Table 2). All improvements were statistically significant (binomial p=10−8 or less). To assess the robustness of these accuracy benchmarks across genetic ancestries, we performed a similar set of benchmarks using the European, African, East Asian, and Latino GERA sub-cohorts (Online Methods). Because the latter three sub-cohorts were relatively small (Supplementary Table 1), we generated a single simulated reference panel for each sub-cohort containing all samples not belonging to trio pedigrees (Nref = 3,817, 5,164, 7,144, and 61,684 for the African, East Asian, Latino, and European sub-cohorts). We phased the three smaller panels using SHAPEIT2 and phased the European panel using Eagle1. We then benchmarked reference-based phasing accuracy by phasing the trio parents within each sub-cohort using the panel generated from that sub-cohort, running each method with default parameter settings. (We phased trio parents rather than trio children for these benchmarks because the three smaller data sets contained only 3–7 independent trios each; Supplementary Table 1.) These benchmarks confirmed our findings from the UK Biobank data: Eagle2 achieved 5–23% lower switch error rates than SHAPEIT2, and we observed the same relative ordering of accuracies as before across all sub-cohorts (Figure 3 and Supplementary Table 4). All differences were statistically significant (binomial p=10−7 or less). We note that every method had a higher switch error rate in the GERA European sub-cohort compared to the UK Biobank, presumably due primarily to a more diverse set of ancestries represented. In general, absolute switch error rates are not directly comparable among data sets due to differences in demography and genotyping properties (e.g., chip density, allele frequency distribution, and genotype error rate).

Figure 3

Accuracy of reference-based phasing in GERA benchmarks

We phased trio parents in each GERA sub-cohort using a reference panel generated from all other non-familial samples in the same sub-cohort. We ran each method with default parameter settings on all 22 autosomes and computed aggregate mean switch error rates; error bars, s.e.m. Standard errors for the European-ancestry sub-cohort are over 400 parent samples. Standard errors for the other three sub-cohorts are over 25 SNP blocks. Numeric data and additional benchmarks varying the number of conditioning haplotypes used by each method are provided in Supplementary Table 4.

Phasing accuracy using the 1000 Genomes and HRC panels

We next benchmarked reference-based phasing using either the 1000 Genomes Project Phase 3 reference panel (containing N=2,504 samples from 26 populations)[26] or the HRC reference panel r1.1 (containing N=32,470 samples mostly of European ancestry)[19]. For these benchmarks, we used 1000 Genomes trio children as target samples, removing all 1000 Genomes trios from each reference panel before running the analyses. We phased chromosome 1, and to emulate typical genotyping density, we restricted the SNP set to 31,853 sites typed on 23 and Me (customized Illumina) chips. Given the predominantly European composition of the HRC panel, the benchmarks on 32 CEU trio children were of primary interest, and we observed in these benchmarks that all methods achieved substantially improved accuracy using the HRC panel versus the 1000 Genomes panel (Figure 4 and Supplementary Table 5). For each choice of reference panel, Eagle2 achieved the lowest switch error rate, consistent with our previous results. For phasing small European cohorts, Eagle2 with the HRC panel provides a >2x improvement in accuracy over 1000 Genomes-based phasing: 1.36% (s.e. 0.04%) switch error rate versus 3.52% (0.06%) using SHAPEIT2 or 3.27% (0.06%) using Eagle2 with the 1000 Genomes reference (Figure 4 and Supplementary Table 5).

Figure 4

Accuracy of reference-based phasing using the 1000 Genomes and HRC panels

We phased 32 trio children from the 1000 Genomes CEU population using either the 1000 Genomes Phase 3 reference panel or the Haplotype Reference Consortium panel (excluding trios in either case). We analyzed chromosome 1, and to emulate a typical use case, we restricted the data to 31,853 markers (genotyped on 23 and Me chips). We plot mean switch error rates; error bars, s.e.m. over samples. Numeric data and additional benchmarks on other 1000 Genomes populations are provided in Supplementary Table 5.

We also benchmarked accuracy in all other 1000 Genomes populations containing >1 trio. We phased trio children in 31 Han Chinese (CHS) trios, 30 Peruvian (PEL) trios, 15 Punjabi (PJL) trios, and 19 Yoruba (YRI) trios using either the 1000 Genomes panel or the HRC panel, and we observed that in all cases Eagle2's accuracy was either slightly better or statistically indistiguishable from SHAPEIT2's (Supplementary Table 5). Specifically, the differences between Eagle2 and SHAPEIT2 were not significant for PEL with either reference panel and for YRI with HRC (p=0.05 or larger); all other differences were significant (binomial p=0.006 or less). Interestingly, all methods achieved lower accuracy using the HRC panel versus the 1000 Genomes panel (Supplementary Table 5). Given that the HRC panel contains the 1000 Genomes panel, this observation suggests that the inclusion of ≈30,000 additional predominantly European samples reduced the ability of each method to model the haplotype structure of non-European populations. However, we did not observe this phenomenon when phasing the two non-European UK Biobank trios using increasing numbers of European reference haplotypes (Supplementary Table 6), so this observation may be specific to the current HRC release (r1.1); development of the HRC is ongoing.

Phasing performance without a reference panel

Lastly, we assessed the performance of Eagle2 when applied to cohort-based phasing, which we also implemented in our software. The Eagle2 cohort-based phasing algorithm starts by running the first two steps of Eagle1[13] to rapidly produce rough haplotype estimates and then refines these estimates using the Eagle2 core phasing algorithm (Online Methods). We benchmarked Eagle2, Eagle1, and SHAPEIT2 on subsets of the UK Biobank data set containing N = 5,000, 15,000, 50,000, or 150,000 samples (including trio children and excluding trio parents). We phased chromosomes 1, 5, 10, 15, and 20 as in our UK Biobank reference-based phasing benchmarks, and we allowed each computational job up to 5 days to complete. We observed that Eagle2 exhibited computational efficiency similar to Eagle1, achieving 5–6x speedups over SHAPEIT2 in the analyses SHAPEIT2 was able to complete (N=5,000 and N=15,000) (Fig. 5a and Supplementary Tables 7 and 8). Eagle2 also exhibited close-to-linear run time scaling across this sample size range, breaking even with Eagle1 at N≈30,000 and achieving faster running times for larger N.

Figure 5

Running time and accuracy of cohort-based phasing in the UK Biobank cohort

We benchmarked Eagle2 and other available phasing methods on N=5,000, 15,000 50,000, and 150,000 UK Biobank samples (including trio children and excluding trio parents). (a) Total wall clock time for genome-wide phasing on a 16-core 2.60 GHz Intel Xeon E5-2650 v2 processor. (We analyzed a total of 174,595 markers on chromosomes 1, 5, 10, 15, and 20, representing ≈25% of the genome, and scaled up running times by a factor of 4; see Supplementary Table 8 for per-chromosome data.) SHAPEIT2 was unable to complete the N=50,000 chr1 and chr5 analyses and was uanble to complete any of the N=150,000 analyses in 5 days, the run time limit for single compute jobs. (b) Mean switch error rate over 70 European-ancestry trios; error bars, s.e.m. Numeric data and additional benchmarks varying the number of conditioning haplotypes used by Eagle2 are provided in Supplementary Table 7.

In our accuracy benchmarks using the 70 European-ancestry UK Biobank trios, we observed that Eagle2 achieved better accuracy than SHAPEIT2 and Eagle1 for N≤50,000, as expected (Fig. 5b and Supplementary Tables 7 and 8). At N=150,000, Eagle1 achieved a slightly lower switch error rate (0.31%, s.e. 0.02%) than Eagle2 (0.35%, s.e. 0.02%). However, we observed that running Eagle2 with 4x the default number of conditioning haplotypes (i.e., K=40,000) achieved the lowest error rates across all sample sizes tested (0.27%, s.e. 0.02% at N=150,000). Both differences were statistically significant (binomial p=0.0006 or less). Finally, we confirmed that Eagle2 achieved better phasing accuracy than SHAPEIT2 or Eagle1 when used to phase the GERA samples within each GERA sub-cohort (Supplementary Table 9), with switch error rates consistent with our earlier reference-based benchmarks (Figure 3 and Supplementary Table 4). All differences were statistically significant (binomial p=0.002 or less).

Discussion

We have described a new phasing algorithm, Eagle2, which we have incorporated into the Sanger Imputation Service and the Michigan Imputation Server to offer free reference-based phasing using the N=32,470-sample Haplotype Reference Consortium panel. This service enables high-accuracy phasing even in smaller cohorts, which was not previously possible. Eagle2 achieves substantial gains in speed and accuracy over previous methods via a novel search-based algorithm employing the positional Burrows-Wheeler transform. We believe this method is timely, as large sequenced reference panels (e.g., the HRC) are now becoming available for use—but must be utilized via analyses run on central servers due to consent restrictions. We anticipate that Eagle2's phasing speed—1.5 minutes per genotyped sample—will help keep computation tractable as demand for this service increases. Additionally, we anticipate that our release of Eagle2 as open-source software will aid in future method development and integration into analysis pipelines. We note that Eagle2 targets a distinct user group compared to very recent work on phasing very large cohorts[13,14]. In particular, our Eagle1 method[13] is targeted at phasing very large (N>100,000) cohorts and achieves much lower accuracy than both Eagle2 and previous methods when used to phase smaller cohorts. The SHAPEIT3 method[14] is likewise targeted at phasing “biobank scale datasets.” The information provided in the paper describing SHAPEIT3 (ref.[14]) indicates that its primary advance is removing a quadratic complexity component of the SHAPEIT2 algorithm that becomes significant as N increases beyond 10,000 samples; this computational speedup comes at the expense of reduced accuracy. The benchmarks in ref.[14] suggest that if used to perform HRC-based phasing at Nref=32,470, SHAPEIT3 would be ≈3x faster but roughly 20% less accurate than SHAPEIT2; in contrast, Eagle2 is ≈20x faster and ≈10% more accurate than SHAPEIT2 at this sample size. (In practice, the SHAPEIT license precludes its use for reference-based phasing on the Sanger and Michigan HRC servers.) While we believe that reference-based phasing using large reference panels such as the HRC is a valuable phasing paradigm, we note a few limitations. First, reference-based phasing accuracy is limited not only by reference panel size but also by genotyping and phasing accuracy in the reference panel. In particular, the HRC reference haplotypes are largely derived from low-coverage sequencing data (which is generally prone to higher errors in genotype calling), and efforts to improve the accuracy of the reference panel are ongoing. Second, for reference-based phasing to be effective, the reference panel needs to contain a sizable subset of samples with genetic ancestry well-matched to the target samples. Consequently, phasing using the HRC is currently only advantageous for European-ancestry target samples, although plans are underway to grow the HRC to better represent worldwide populations. Third, for very large cohorts (substantially larger than the reference size), we expect that reference-based phasing will achieve only marginal gains in accuracy over cohort-based phasing. For such cohorts, we expect that cohort-based phasing may remain the preferred option due to ease of execution—although reference sizes are growing, and if the end goal is HRC-based imputation, then we expect that HRC-based pre-phasing[27] will be more convenient. Finally, we have not fully optimized Eagle2 for phasing sequenced samples[28,29]. While a preliminary benchmark against SHAPEIT2 suggests that Eagle2 achieves gains in speed and accuracy on sequence data comparable to its improvements on genotype data (Supplementary Table 10), more investigation will be needed to tune the method and benchmark it against other approaches (e.g., the recent SHAPEITR method[30], which is available via the Oxford Statistics Phasing Server; see URLs). Despite these limitations, we expect that reference-based phasing using Eagle2 and the HRC panel will be a valuable resource providing free, fast, and accurate phasing to the scientific community.

Online Methods

Eagle2 core algorithm for phasing a single target sample using a set of reference haplotypes

Here we outline the key ideas underlying the Eagle2 core algorithm for phasing a single target sample using a set of reference haplotypes, which has three main steps.

Step 1: Selection of conditioning haplotypes

Eagle2 first identifies a subset of K=10,000 conditioning haplotypes by ranking reference haplotypes according to the number of discrepancies between each reference haplotype and the homozygous genotypes of the target sample. As in our previous work[13], we perform computation on blocks of up to 64 SNPs at once using bitwise arithmetic; thus, the total computational cost of subset selection is linear in Nref with a very small constant factor (ignoring time to rank the results, which is negligible in practice). The constant factor is small enough that this step constitutes only a small fraction of the total run time for Nref<100,000. We note that our discrepancy metric does not make use of inferred phase of the target genotypes (which is possible within an iterative phase refinement scheme) and produces a single set of conditioning haplotypes to use for the entire region being phased, in contrast to the sophisticated approach used by SHAPEIT2[12]. However, Eagle2 is able to condition on 100x more haplotypes than SHAPEIT2, which we suspect makes selection of conditioning haplotypes much less important. The overall complexity of this step is O(MN) in both time and memory.

Step 2: Generation of HapHedge data structure

Eagle2 next generates a HapHedge data structure on the selected conditioning haplotypes. The HapHedge encodes a sequence of haplotype prefix trees (i.e., binary trees on haplotype prefixes) rooted at a sequence of starting positions along the chromosome, thus enabling fast lookup of haplotype frequencies (Figure 1). (In practice, we start a new tree roughly once per heterozygous site in the target sample; Supplementary Fig. 1.) The key features of the HapHedge are linear-time construction, linear-memory representation, and constant-time prefix extension, all with small constant factors. The compact in-memory representation of the HapHedge is achieved via radix trees (Supplementary Fig. 2), while linear-time construction is achieved via the positional Burrows-Wheeler transform[20]. In its simplest form, the PBWT iteratively creates sorted lists of haplotype prefixes, moving the prefix start point from right to left. Our algorithm extends this procedure to convert the lists of sorted prefixes into prefix trees; see the Supplementary Note for details. The overall complexity of this step is O(MK) in both time and memory.

Step 3: Exploration of the diplotype space

Having prepared a HapHedge of conditioning haplotypes, Eagle2 performs phasing using a statistical model similar to the Li-Stephens haplotype copying model[5,21] used by previous HMM-based methods. However, in contrast to previous methods, Eagle2 applies two new ideas to perform fast and accurate phase inference under this model. The first idea is a new way to efficiently compute haplotype probabilities under a copying model. Naïvely, such computations require exponential time because of the combinatorial explosion of possible recombination points. The standard approach to overcoming this barrier is to observe that within a K-state HMM, recursion allows computation of all marginal probabilities (for all K states at each of M positions) in O(MK[2]) time. With Eagle2, we take a completely different recursive approach that computes the probability of a single haplotype in O(M) time—independent of the number of reference haplotypes K—after creation of the HapHedge in O(MK) time. The HapHedge essentially consolidates all reference haplotypes sharing a common prefix (starting at any given position) into a single atom of data, thus eliminating future computation that scales with K. Of course, being able to very rapidly compute the probability of a single haplotype is only useful if we can identify a small subset of haplotype probabilities that are worth computing; to this end, Eagle2 employs a second key idea. We perform a beam search from left to right across the chromosome, propagating a small set of likely diplotypes that represent most of the posterior probability mass in the local diplotype space. This approach essentially focuses computational effort on a small subset of the diplotype space (vs. expending computation evenly across the space as in HMM recursion), which is advantageous when most of the space is probabilistically unfavorable but difficult to discard a priori. Full mathematical and engineering details are provided in the Supplementary Note. The overall complexity of this step is O(MHP) time and O(MP+HP) memory, where H and P are “history length” and “beam width” parameters of the beam search described in the Supplementary Note. We explore the sensitivity of Eagle2 to various parameter choices in Supplementary Tables 11, 12, and 13.

Eagle2 algorithm for reference-based phasing with multiple target samples

In practice, reference-based phasing is typically performed on a target set containing many samples, allowing the potential to improve phasing accuracy by using inferred target haplotypes to phase each other. By default, Eagle2 performs a variable number of phasing iterations chosen based on the relative size of the target (Ntarget) and the reference (Nref). This behavior is intended to allow Eagle2 to automatically benefit from increased statistical power available from larger target sample sizes. Specifically, if Ntarget < Nref/2, Eagle2 performs only one phasing iteration (phasing each target sample using only the reference haplotypes). If Nref/2 ≤ Ntarget < 2Nref, Eagle2 performs two iterations, augmenting the reference panel with the inferred target haplotypes during the second iteration. If Ntarget ≥ 2Nref, Eagle2 performs three iterations in an analogous manner. Whenever Eagle2 performs more than one iteration, all iterations prior to the final iteration use K/2 conditioning haplotypes to save time, given that the last iteration has the most impact on accuracy. The number of iterations can also be set directly via the –pbwtIters parameter.

Eagle2 algorithm for cohort-based phasing

To perform cohort-based phasing (i.e., without a reference), Eagle2 employs an iteration similar to the above approach, but prior to running two iterations of the Eagle2 core phasing algorithm as described above, it first runs the first two steps of the Eagle1 algorithm, which rapidly detect identical-by-descent segments and use them to call phase[13]. Within small data sets, identity-by-descent is less common, but our results indicate that the two subsequent iterations of the Eagle2 core phasing algorithm are able to rapidly refine phase calls given even an inaccurate set of initial phase calls.

UK Biobank data set

We analyzed data from the UK Biobank consisting of 152,729 samples typed at ≈800,000 SNPs. Using PLINK[31] (see URLs), we removed 480 individuals marked for exclusion from genomic analyses based on missingness and heterozygosity filters and 1 individual who had withdrawn consent, leaving 152,248 samples (see URLs, UK Biobank Genotyping and QC). We restricted the SNP set to autosomal, biallelic SNPs with missingness ≤10% and we further excluded 65 autosomal SNPs found to have significantly different allele frequencies between the UK BiLEVE array and the UK Biobank array, leaving 707,524 SNPs (57,753 on chr1, 41,538 on chr5, 34,588 on chr10, 22,367 on chr15, and 18,349 on chr20). We identified 72 trios based on IBS0<0.001, sex of parents, and age of trio members (see URLs, Genotyping and QC). Of the 72 trio children, 69 self-reported British ethnicity, one self-reported Indian ethnicity, and one self-reported Caribbean ethnicity. The remaining trio child did not self-report any ethnicity, but her parents self-reported Irish and “Any other white background” as their ethnicities, so we included this trio child in the 70 European-ancestry trio children we used to benchmark phasing accuracy.

GERA data set

We analyzed GERA samples (see URLs; dbGaP study accession phs000674.v1.p1) typed on each of the four GERA ancestry-specific chips (European, African, East Asian, and Latino); QC is described in ref.[23]. We directly analyzed all samples and all autosomal SNPs in each of the four sub-cohorts (Supplementary Table 1). We identified independent trios in each sub-cohort according to pedigree information provided with the data release.

Phasing software versions and parameter settings

We benchmarked Eagle1[13], Eagle2 (v2.1), SHAPEIT v2 (r790)[12], and Beagle v4.1 (22Feb16.8ef)[25] using the Oxford genetic map (supplied with SHAPEIT and Eagle). When running Eagle2 in reference-based phasing mode, we turned off imputation of missing genotypes using the –noImpMissing flag; otherwise, we ran all methods using their default parameter settings unless explicitly testing non-default settings. Specifically, the non-default parameter settings we tested were the –no–mcmc and –states (K) option of SHAPEIT2 and the –Kpbwt option of Eagle2.

Evaluation of reference-based phasing using genotyped reference panels

For our reference-based phasing benchmarks using genotyped UK Biobank data, we constructed reference panels by randomly selecting Nref = 15,000, 30,000, 50,000, or 100,000 samples (disjoint from the 72 UK Biobank trios) and phasing these samples using Eagle1 (as phasing using SHAPEIT2 would have required several weeks[13]). We then applied each reference-based phasing method to phase chromosomes 1, 5, 10, 15, and 20 of the Ntarget = 72 trio children using each simulated reference panel, and we compared the phased output against trio phase calls to compute switch error rates[2,13]. In our results, we report mean switch error rates and s.e.m. over the 70 European-ancestry trio children (according to self-reported ethnicity; see above). To benchmark per-sample computational cost of reference-based phasing, we performed an additional set of analyses in which we phased 1,000 randomly selected samples (not contained in the simulated reference panels) in addition to the 72 trio children. We subtracted the Ntarget = 72 running times from the Ntarget = 1,072 running times to obtain the incremental cost of phasing 1,000 samples, thus adjusting for initialization costs (e.g., reading the reference data and synchronizing it with the target data), which account for a non-neglibible fraction of total computational cost when Ntarget is small. Finally, we divided by 1,000 to obtain per-sample costs and multiplied by 4 to scale up from the five chromosomes analyzed (≈25% of the genome) to a genome-wide analysis. For our reference-based phasing benchmarks using GERA data, we applied an analogous procedure with the following minor differences. For each of the four sub-cohorts, we created a single simulated reference panel using all samples not in the same extended pedigree as any trio. We phased the European chip panel using Eagle1 and phased the other three panels using SHAPEIT2 (which is computationally tractable and more accurate than Eagle1 for small cohorts[13]). In each sub-cohort, we then applied each reference-based phasing method to phase all 22 autosomes of the trio parents in that sub-cohort, and we computed mean switch error rates over all trio parents. We chose to benchmark accuracy using trio parents rather than trio children due to the small numbers of trios in the non-European sub-cohorts. (For the European sub-cohort, we also computed benchmarks using trio children for comparison.) For the European sub-cohort, we computed s.e.m. over samples as before; for the other three sub-cohorts, we computed s.e.m. over 25 SNP blocks due to the small numbers of trios. When used to compare methods, these standard errors are conservative due to true variation among samples and across the genome (which causes errors to be correlated). We therefore assessed statistical significance of differences in performance between pairs of methods by performing one-sided binomial tests across samples or SNP blocks as appropriate.

Evaluation of cohort-based phasing performance

For our benchmarks of phasing without a reference, we created subsets of UK Biobank samples containing N = 5,000, 15,000, 50,000, or 150,000 samples, each of which contained all 72 trio children and none of the 144 trio parents. We then applied each phasing method to phase chromosomes 1, 5, 10, 15, and 20 of each subset of samples, and we computed mean switch error rates and s.e.m. over the 70 European-ancestry trio children as above. We applied an analogous procedure to the GERA sub-cohorts with the same modifications as above.

30 in total

1. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data.

Authors: Na Li; Matthew Stephens
Journal: Genetics Date: 2003-12 Impact factor: 4.562

2. A linear complexity phasing method for thousands of genomes.

Authors: Olivier Delaneau; Jonathan Marchini; Jean-François Zagury
Journal: Nat Methods Date: 2011-12-04 Impact factor: 28.547

3. Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort.

Authors: Mark N Kvale; Stephanie Hesselson; Thomas J Hoffmann; Yang Cao; David Chan; Sheryl Connell; Lisa A Croen; Brad P Dispensa; Jasmin Eshragh; Andrea Finn; Jeremy Gollub; Carlos Iribarren; Eric Jorgenson; Lawrence H Kushi; Richard Lao; Yontao Lu; Dana Ludwig; Gurpreet K Mathauda; William B McGuire; Gangwu Mei; Sunita Miles; Michael Mittman; Mohini Patil; Charles P Quesenberry; Dilrini Ranatunga; Sarah Rowell; Marianne Sadler; Lori C Sakoda; Michael Shapero; Ling Shen; Tanu Shenoy; David Smethurst; Carol P Somkin; Stephen K Van Den Eeden; Lawrence Walter; Eunice Wan; Teresa Webster; Rachel A Whitmer; Simon Wong; Chia Zau; Yiping Zhan; Catherine Schaefer; Pui-Yan Kwok; Neil Risch
Journal: Genetics Date: 2015-06-19 Impact factor: 4.562

4. Phasing of many thousands of genotyped samples.

Authors: Amy L Williams; Nick Patterson; Joseph Glessner; Hakon Hakonarson; David Reich
Journal: Am J Hum Genet Date: 2012-08-10 Impact factor: 11.025

5. Next-generation genotype imputation service and methods.

Authors: Sayantan Das; Lukas Forer; Sebastian Schönherr; Carlo Sidore; Adam E Locke; Alan Kwong; Scott I Vrieze; Emily Y Chew; Shawn Levy; Matt McGue; David Schlessinger; Dwight Stambolian; Po-Ru Loh; William G Iacono; Anand Swaroop; Laura J Scott; Francesco Cucca; Florian Kronenberg; Michael Boehnke; Gonçalo R Abecasis; Christian Fuchsberger
Journal: Nat Genet Date: 2016-08-29 Impact factor: 38.330

6. Detection of sharing by descent, long-range phasing and haplotype imputation.

Authors: Augustine Kong; Gisli Masson; Michael L Frigge; Arnaldur Gylfason; Pasha Zusmanovich; Gudmar Thorleifsson; Pall I Olason; Andres Ingason; Stacy Steinberg; Thorunn Rafnar; Patrick Sulem; Magali Mouy; Frosti Jonsson; Unnur Thorsteinsdottir; Daniel F Gudbjartsson; Hreinn Stefansson; Kari Stefansson
Journal: Nat Genet Date: 2008-09 Impact factor: 38.330

7. Haplotype reconstruction from genotype data using Imperfect Phylogeny.

Authors: Eran Halperin; Eleazar Eskin
Journal: Bioinformatics Date: 2004-02-26 Impact factor: 6.937

8. Fast and accurate long-range phasing in a UK Biobank cohort.

Authors: Po-Ru Loh; Pier Francesco Palamara; Alkes L Price
Journal: Nat Genet Date: 2016-06-06 Impact factor: 38.330

9. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

10. Haplotype estimation for biobank-scale data sets.

Authors: Jared O'Connell; Kevin Sharp; Nick Shrine; Louise Wain; Ian Hall; Martin Tobin; Jean-Francois Zagury; Olivier Delaneau; Jonathan Marchini
Journal: Nat Genet Date: 2016-06-06 Impact factor: 38.330

498 in total

1. Leveraging epigenomics and contactomics data to investigate SNP pairs in GWAS.

Authors: Elisabetta Manduchi; Scott M Williams; Alessandra Chesi; Matthew E Johnson; Andrew D Wells; Struan F A Grant; Jason H Moore
Journal: Hum Genet Date: 2018-05-24 Impact factor: 4.132

2. Integrated analysis of genomics, longitudinal metabolomics, and Alzheimer's risk factors among 1,111 cohort participants.

Authors: Burcu F Darst; Qiongshi Lu; Sterling C Johnson; Corinne D Engelman
Journal: Genet Epidemiol Date: 2019-05-18 Impact factor: 2.135

3. A Fast and Simple Method for Detecting Identity-by-Descent Segments in Large-Scale Data.

Authors: Ying Zhou; Sharon R Browning; Brian L Browning
Journal: Am J Hum Genet Date: 2020-03-12 Impact factor: 11.025

4. Genetic Susceptibility for Atrial Fibrillation in Patients Undergoing Atrial Fibrillation Ablation.

Authors: M Benjamin Shoemaker; Daniela Husser; Carolina Roselli; Meelad Al Jazairi; Jonathan Chrispin; Michael Kühne; Benjamin Neumann; Stacey Knight; Han Sun; Sanghamitra Mohanty; Christian Shaffer; Sébastien Thériault; Lauren Lee Rinke; Joylene E Siland; Diane M Crawford; Laura Ueberham; Omeed Zardkoohi; Petra Büttner; Bastiaan Geelhoed; Steffen Blum; Stefanie Aeschbacher; Jonathan D Smith; David R Van Wagoner; Rebecca Freudling; Martina Müller-Nurasyid; Jay Montgomery; Zachary Yoneda; Quinn Wells; Tariq Issa; Peter Weeke; Victoria Jacobs; Isabelle C Van Gelder; Gerhard Hindricks; John Barnard; Hugh Calkins; Dawood Darbar; Greg Michaud; Stefan Kääb; Patrick Ellinor; Andrea Natale; Mina Chung; Saman Nazarian; Michael J Cutler; Moritz F Sinner; David Conen; Michiel Rienstra; Andreas Bollmann; Dan M Roden; Steven Lubitz
Journal: Circ Arrhythm Electrophysiol Date: 2020-02-14

5. Ancestry-specific polygenic scores and SNP heritability of 25(OH)D in African- and European-ancestry populations.

Authors: Kathryn E Hatchell; Qiongshi Lu; Scott J Hebbring; Erin D Michos; Alexis C Wood; Corinne D Engelman
Journal: Hum Genet Date: 2019-07-24 Impact factor: 4.132

6. Genetic origin of a large family with a novel PSEN1 mutation (Ile416Thr).

Authors: Laura Ramirez Aguilar; Juliana Acosta-Uribe; Margarita M Giraldo; Sonia Moreno; Ana Baena; Diana Alzate; Rosario Cuastumal; David Aguillón; Lucía Madrigal; Amanda Saldarriaga; Alexander Navarro; Gloria P Garcia; Daniel C Aguirre-Acevedo; Ethan G Geier; J Nicholas Cochran; Yakeel T Quiroz; Richard M Myers; Jennifer S Yokoyama; Kenneth S Kosik; Francisco Lopera
Journal: Alzheimers Dement Date: 2019-02-10 Impact factor: 21.566

7. Discrimination and anxiety: Using multiple polygenic scores to control for genetic liability.

Authors: Adolfo G Cuevas; Frank D Mann; David R Williams; Robert F Krueger
Journal: Proc Natl Acad Sci U S A Date: 2021-01-05 Impact factor: 11.205

8. Examining the independent and joint effects of molecular genetic liability and environmental exposures in schizophrenia: results from the EUGEI study.

Authors: Sinan Guloksuz; Lotta-Katrin Pries; Philippe Delespaul; Gunter Kenis; Jurjen J Luykx; Bochao D Lin; Alexander L Richards; Berna Akdede; Tolga Binbay; Vesile Altınyazar; Berna Yalınçetin; Güvem Gümüş-Akay; Burçin Cihan; Haldun Soygür; Halis Ulaş; EylemŞahin Cankurtaran; Semra Ulusoy Kaymak; Marina M Mihaljevic; Sanja Andric Petrovic; Tijana Mirjanic; Miguel Bernardo; Bibiana Cabrera; Julio Bobes; Pilar A Saiz; María Paz García-Portilla; Julio Sanjuan; Eduardo J Aguilar; José Luis Santos; Estela Jiménez-López; Manuel Arrojo; Angel Carracedo; Gonzalo López; Javier González-Peñas; Mara Parellada; Nadja P Maric; Cem Atbaşog Lu; Alp Ucok; Köksal Alptekin; Meram Can Saka; Celso Arango; Michael O'Donovan; Bart P F Rutten; Jim van Os
Journal: World Psychiatry Date: 2019-06 Impact factor: 49.548

9. Profiling and Leveraging Relatedness in a Precision Medicine Cohort of 92,455 Exomes.

Authors: Jeffrey Staples; Evan K Maxwell; Nehal Gosalia; Claudia Gonzaga-Jauregui; Christopher Snyder; Alicia Hawes; John Penn; Ricardo Ulloa; Xiaodong Bai; Alexander E Lopez; Cristopher V Van Hout; Colm O'Dushlaine; Tanya M Teslovich; Shane E McCarthy; Suganthi Balasubramanian; H Lester Kirchner; Joseph B Leader; Michael F Murray; David H Ledbetter; Alan R Shuldiner; George D Yancoupolos; Frederick E Dewey; David J Carey; John D Overton; Aris Baras; Lukas Habegger; Jeffrey G Reid
Journal: Am J Hum Genet Date: 2018-05-03 Impact factor: 11.025

10. Use of whole-genome sequence data and novel genomic selection strategies to improve selection for age at puberty in tropically-adapted beef heifers.

Authors: Christie L Warburton; Bailey N Engle; Elizabeth M Ross; Roy Costilla; Stephen S Moore; Nicholas J Corbet; Jack M Allen; Alan R Laing; Geoffry Fordyce; Russell E Lyons; Michael R McGowan; Brian M Burns; Ben J Hayes
Journal: Genet Sel Evol Date: 2020-05-27 Impact factor: 4.297