Literature DB >> 24637199

Predicting discovery rates of genomic features.

Simon Gravel1.   

Abstract

Successful sequencing experiments require judicious sample selection. However, this selection must often be performed on the basis of limited preliminary data. Predicting the statistical properties of the final sample based on preliminary data can be challenging, because numerous uncertain model assumptions may be involved. Here, we ask whether we can predict "omics" variation across many samples by sequencing only a fraction of them. In the infinite-genome limit, we find that a pilot study sequencing 5% of a population is sufficient to predict the number of genetic variants in the entire population within 6% of the correct value, using an estimator agnostic to demography, selection, or population structure. To reach similar accuracy in a finite genome with millions of polymorphisms, the pilot study would require ∼15% of the population. We present computationally efficient jackknife and linear programming methods that exhibit substantially less bias than the state of the art when applied to simulated data and subsampled 1000 Genomes Project data. Extrapolating based on the National Heart, Lung, and Blood Institute Exome Sequencing Project data, we predict that 7.2% of sites in the capture region would be variable in a sample of 50,000 African Americans and 8.8% in a European sample of equal size. Finally, we show how the linear programming method can also predict discovery rates of various genomic features, such as the number of transcription factor binding sites across different cell types.
Copyright © 2014 by the Genetics Society of America.

Keywords:  capture–recapture; linear programming; population genetics; rare variants; sequencing

Mesh:

Year:  2014        PMID: 24637199      PMCID: PMC4063918          DOI: 10.1534/genetics.114.162149

Source DB:  PubMed          Journal:  Genetics        ISSN: 0016-6731            Impact factor:   4.562


  14 in total

1.  An analysis of strategies for discovery of single-nucleotide polymorphisms.

Authors:  M A Eberle; L Kruglyak
Journal:  Genet Epidemiol       Date:  2000       Impact factor: 2.135

2.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution.

Authors:  Dick G Hwang; Phil Green
Journal:  Proc Natl Acad Sci U S A       Date:  2004-08-03       Impact factor: 11.205

3.  Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities.

Authors:  William A Link
Journal:  Biometrics       Date:  2003-12       Impact factor: 2.571

4.  On the optimal design of genetic variant discovery studies.

Authors:  Iuliana Ionita-Laza; Nan M Laird
Journal:  Stat Appl Genet Mol Biol       Date:  2010-08-27

5.  On identifiability in capture-recapture models.

Authors:  Hajo Holzmann; Axel Munk; Walter Zucchini
Journal:  Biometrics       Date:  2006-09       Impact factor: 2.571

6.  Can one learn history from the allelic spectrum?

Authors:  Simon Myers; Charles Fefferman; Nick Patterson
Journal:  Theor Popul Biol       Date:  2008-01-30       Impact factor: 1.570

7.  Estimating the number of unseen variants in the human genome.

Authors:  Iuliana Ionita-Laza; Christoph Lange; Nan M Laird
Journal:  Proc Natl Acad Sci U S A       Date:  2009-03-10       Impact factor: 11.205

8.  Non-equilibrium allele frequency spectra via spectral methods.

Authors:  Sergio Lukić; Jody Hey; Kevin Chen
Journal:  Theor Popul Biol       Date:  2011-03-02       Impact factor: 1.570

9.  Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data.

Authors:  Ryan N Gutenkunst; Ryan D Hernandez; Scott H Williamson; Carlos D Bustamante
Journal:  PLoS Genet       Date:  2009-10-23       Impact factor: 5.917

10.  An integrated map of genetic variation from 1,092 human genomes.

Authors:  Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal:  Nature       Date:  2012-11-01       Impact factor: 49.962

View more
  4 in total

1.  Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation.

Authors:  Julien Jouganous; Will Long; Aaron P Ragsdale; Simon Gravel
Journal:  Genetics       Date:  2017-05-11       Impact factor: 4.562

2.  A numerical framework for genetic hitchhiking in populations of variable size.

Authors:  Eric Friedlander; Matthias Steinrücken
Journal:  Genetics       Date:  2022-03-03       Impact factor: 4.562

3.  RAREsim: A simulation method for very rare genetic variants.

Authors:  Megan Null; Josée Dupuis; Pezhman Sheinidashtegol; Ryan M Layer; Christopher R Gignoux; Audrey E Hendricks
Journal:  Am J Hum Genet       Date:  2022-03-16       Impact factor: 11.043

4.  Quantifying unobserved protein-coding variants in human populations provides a roadmap for large-scale sequencing projects.

Authors:  James Zou; Gregory Valiant; Paul Valiant; Konrad Karczewski; Siu On Chan; Kaitlin Samocha; Monkol Lek; Shamil Sunyaev; Mark Daly; Daniel G MacArthur
Journal:  Nat Commun       Date:  2016-10-31       Impact factor: 14.919

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.