Eunjung Han1, Janet S Sinsheimer2, John Novembre1. 1. Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA. 2. Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095, USA, Department of Human Genetics and Biomathematics, University of California, Los Angeles, Los Angeles, CA 90095, USA and Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
Abstract
MOTIVATION: The distribution of allele frequencies across polymorphic sites, also known as the site frequency spectrum (SFS), is of primary interest in population genetics. It is a complete summary of sequence variation at unlinked sites and more generally, its shape reflects underlying population genetic processes. One practical challenge is that inferring the SFS from low coverage sequencing data in a straightforward manner by using genotype calls can lead to significant bias. To reduce bias, previous studies have used a statistical method that directly estimates the SFS from sequencing data by first computing site allele frequency (SAF) likelihood for each site (i.e. the likelihood a site has each possible allele frequency conditional on observed sequence reads) using a dynamic programming (DP) algorithm. Although this method produces an accurate SFS, computing the SAF likelihood is quadratic in the number of samples sequenced. RESULTS: To overcome this computational challenge, we propose an algorithm, 'score-limited DP' algorithm, which is linear in the number of genomes to compute the SAF likelihood. This algorithm works because in a lower triangular matrix that arises in the DP algorithm, all non-negligible values of the SAF likelihood are concentrated on a few cells around the best-guess allele counts. We show that our score-limited DP algorithm has comparable accuracy but is faster than the original DP algorithm. This speed improvement makes SFS estimation practical when using low coverage NGS data from a large number of individuals. AVAILABILITY AND IMPLEMENTATION: The program will be available via a link from the Novembre lab website (http://jnpopgen.org/).
MOTIVATION: The distribution of allele frequencies across polymorphic sites, also known as the site frequency spectrum (SFS), is of primary interest in population genetics. It is a complete summary of sequence variation at unlinked sites and more generally, its shape reflects underlying population genetic processes. One practical challenge is that inferring the SFS from low coverage sequencing data in a straightforward manner by using genotype calls can lead to significant bias. To reduce bias, previous studies have used a statistical method that directly estimates the SFS from sequencing data by first computing site allele frequency (SAF) likelihood for each site (i.e. the likelihood a site has each possible allele frequency conditional on observed sequence reads) using a dynamic programming (DP) algorithm. Although this method produces an accurate SFS, computing the SAF likelihood is quadratic in the number of samples sequenced. RESULTS: To overcome this computational challenge, we propose an algorithm, 'score-limited DP' algorithm, which is linear in the number of genomes to compute the SAF likelihood. This algorithm works because in a lower triangular matrix that arises in the DP algorithm, all non-negligible values of the SAF likelihood are concentrated on a few cells around the best-guess allele counts. We show that our score-limited DP algorithm has comparable accuracy but is faster than the original DP algorithm. This speed improvement makes SFS estimation practical when using low coverage NGS data from a large number of individuals. AVAILABILITY AND IMPLEMENTATION: The program will be available via a link from the Novembre lab website (http://jnpopgen.org/).
Authors: Matthew R Nelson; Daniel Wegmann; Margaret G Ehm; Darren Kessner; Pamela St Jean; Claudio Verzilli; Judong Shen; Zhengzheng Tang; Silviu-Alin Bacanu; Dana Fraser; Liling Warren; Jennifer Aponte; Matthew Zawistowski; Xiao Liu; Hao Zhang; Yong Zhang; Jun Li; Yun Li; Li Li; Peter Woollard; Simon Topp; Matthew D Hall; Keith Nangle; Jun Wang; Gonçalo Abecasis; Lon R Cardon; Sebastian Zöllner; John C Whittaker; Stephanie L Chissoe; John Novembre; Vincent Mooser Journal: Science Date: 2012-05-17 Impact factor: 47.728
Authors: Bogdan Pasaniuc; Nadin Rohland; Paul J McLaren; Kiran Garimella; Noah Zaitlen; Heng Li; Namrata Gupta; Benjamin M Neale; Mark J Daly; Pamela Sklar; Patrick F Sullivan; Sarah Bergen; Jennifer L Moran; Christina M Hultman; Paul Lichtenstein; Patrik Magnusson; Shaun M Purcell; David W Haas; Liming Liang; Shamil Sunyaev; Nick Patterson; Paul I W de Bakker; David Reich; Alkes L Price Journal: Nat Genet Date: 2012-05-20 Impact factor: 38.330
Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean Journal: Nature Date: 2010-10-28 Impact factor: 49.962
Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330
Authors: Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin Journal: Bioinformatics Date: 2011-06-07 Impact factor: 6.937
Authors: Kumar Saurabh Singh; Erick M G Cordeiro; Bartlomiej J Troczka; Adam Pym; Joanna Mackisack; Thomas C Mathers; Ana Duarte; Fabrice Legeai; Stéphanie Robin; Pablo Bielza; Hannah J Burrack; Kamel Charaabi; Ian Denholm; Christian C Figueroa; Richard H Ffrench-Constant; Georg Jander; John T Margaritopoulos; Emanuele Mazzoni; Ralf Nauen; Claudio C Ramírez; Guangwei Ren; Ilona Stepanyan; Paul A Umina; Nina V Voronova; John Vontas; Martin S Williamson; Alex C C Wilson; Gao Xi-Wu; Young-Nam Youn; Christoph T Zimmer; Jean-Christophe Simon; Alex Hayward; Chris Bass Journal: Commun Biol Date: 2021-07-07