| Literature DB >> 27270105 |
Jared O'Connell1,2, Kevin Sharp2, Nick Shrine3, Louise Wain3, Ian Hall4, Martin Tobin3, Jean-Francois Zagury5, Olivier Delaneau6, Jonathan Marchini1,2.
Abstract
The UK Biobank (UKB) has recently released genotypes on 152,328 individuals together with extensive phenotypic and lifestyle information. We present a new phasing method, SHAPEIT3, that can handle such biobank-scale data sets and results in switch error rates as low as ∼0.3%. The method exhibits O(NlogN) scaling with sample size N, enabling fast and accurate phasing of even larger cohorts.Entities:
Mesh:
Year: 2016 PMID: 27270105 PMCID: PMC4926957 DOI: 10.1038/ng.3583
Source DB: PubMed Journal: Nat Genet ISSN: 1061-4036 Impact factor: 38.330
Comparison of methods on the UK Biobank dataset.
| Sample size | Method | Clustering | New MCMC | Switch Error (%) | Run time (hrs) | Run time scaling | Sample size scaling |
|---|---|---|---|---|---|---|---|
| 1,072 | SHAPEIT3 | No | Yes | 2.6 | 0.25 | 1 | 1 |
| 10,072 | SHAPEIT2 | No | No | 1.1 | 4.2 | 16.8 | 9.4 |
| 10,072 | SHAPEIT3 | No | Yes | 1.1 | 3.3 | 13.2 | 9.4 |
| 10,072 | SHAPEIT3 | Yes | Yes | 1.3 | 2.5 | 10.0 | 9.4 |
| 152,112 | SHAPEIT3 | Yes | Yes | 0.4 | 38.5 | 154 | 142 |
Each row shows the performance on a subset of the full dataset. The clustering column indicates whether the new method for choosing copying states was used or not. The new MCMC column indicates whether the new MCMC routine, which uses completely parallel updates and local algorithm termination, was used or not. Performance is measured as median switch error on the trio children. Run time is given in hours. The Scaling column shows the relative run time compared to the SHAPEIT3 run on a sample size of 1,072. 10 threads were used for all runs.
Figure 1Performance on UK-BiLEVE chromosome 20 dataset.
Computation time (left) and switch error rate (right) of different phasing routines. Estimated haplotypes are compared to those derived from the IBD1 segments of the 384 likely sibling pairs. HAPI-UR 1X is the average switch error rate/time across three runs of HAPI-UR. HAPI-UR 3X performs majority voting of haplotypes across three runs (computation is the sum of three runs of HAPI-UR). SHAPEIT3 was run with cluster size M = 4,000, which substantially improves computational complexity, and hence running time compared to SHAPEIT2. Both SHAPEIT runs use K=100 conditioning states.