William Yue1, Ardalan Naseri1, Victor Wang1, Pramesh Shakya2, Shaojie Zhang2, Degui Zhi1. 1. School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA. 2. Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA.
Abstract
Motivation: As large haplotype panels become increasingly available, efficient string matching algorithms such as positional Burrows-Wheeler transformation (PBWT) are promising for identifying shared haplotypes. However, recent mutations and genotyping errors create occasional mismatches, presenting challenges for exact haplotype matching. Previous solutions are based on probabilistic models or seed-and-extension algorithms that passively tolerate mismatches. Results: Here, we propose a PBWT-based smoothing algorithm, P-smoother, to actively 'correct' these mismatches and thus 'smooth' the panel. P-smoother runs a bidirectional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, which we call the IBD (identical-by-descent) prior. In a simulated panel with 4000 haplotypes and a 0.2% error rate, we show it can reliably correct 85% of errors. As a result, PBWT algorithms running over the smoothed panel can identify more pairwise IBD segments than that over the unsmoothed panel. Most strikingly, a PBWT-cluster algorithm running over the smoothed panel, which we call PS-cluster, achieves state-of-the-art performance for identifying multiway IBD segments, a challenging problem in the computational community for years. We also showed that PS-cluster is adequately efficient for UK Biobank data. Therefore, P-smoother opens up new possibilities for efficient error-tolerating algorithms for biobank-scale haplotype panels. Availability and implementation: Source code is available at github.com/ZhiGroup/P-smoother.
Motivation: As large haplotype panels become increasingly available, efficient string matching algorithms such as positional Burrows-Wheeler transformation (PBWT) are promising for identifying shared haplotypes. However, recent mutations and genotyping errors create occasional mismatches, presenting challenges for exact haplotype matching. Previous solutions are based on probabilistic models or seed-and-extension algorithms that passively tolerate mismatches. Results: Here, we propose a PBWT-based smoothing algorithm, P-smoother, to actively 'correct' these mismatches and thus 'smooth' the panel. P-smoother runs a bidirectional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, which we call the IBD (identical-by-descent) prior. In a simulated panel with 4000 haplotypes and a 0.2% error rate, we show it can reliably correct 85% of errors. As a result, PBWT algorithms running over the smoothed panel can identify more pairwise IBD segments than that over the unsmoothed panel. Most strikingly, a PBWT-cluster algorithm running over the smoothed panel, which we call PS-cluster, achieves state-of-the-art performance for identifying multiway IBD segments, a challenging problem in the computational community for years. We also showed that PS-cluster is adequately efficient for UK Biobank data. Therefore, P-smoother opens up new possibilities for efficient error-tolerating algorithms for biobank-scale haplotype panels. Availability and implementation: Source code is available at github.com/ZhiGroup/P-smoother.
Authors: Po-Ru Loh; Petr Danecek; Pier Francesco Palamara; Christian Fuchsberger; Yakir A Reshef; Hilary K Finucane; Sebastian Schoenherr; Lukas Forer; Shane McCarthy; Goncalo R Abecasis; Richard Durbin; Alkes L Price Journal: Nat Genet Date: 2016-10-03 Impact factor: 38.330
Authors: Po-Ru Loh; Giulio Genovese; Robert E Handsaker; Hilary K Finucane; Yakir A Reshef; Pier Francesco Palamara; Brenda M Birmann; Michael E Talkowski; Samuel F Bakhoum; Steven A McCarroll; Alkes L Price Journal: Nature Date: 2018-07-11 Impact factor: 49.962
Authors: Keith Mitchell; Jaqueline J Brito; Igor Mandric; Qiaozhen Wu; Sergey Knyazev; Sei Chang; Lana S Martin; Aaron Karlsberg; Ekaterina Gerasimov; Russell Littman; Brian L Hill; Nicholas C Wu; Harry Taegyun Yang; Kevin Hsieh; Linus Chen; Eli Littman; Taylor Shabani; German Enik; Douglas Yao; Ren Sun; Jan Schroeder; Eleazar Eskin; Alex Zelikovsky; Pavel Skums; Mihai Pop; Serghei Mangul Journal: Genome Biol Date: 2020-03-17 Impact factor: 13.583