Literature DB >> 22697250

Normal and compound poisson approximations for pattern occurrences in NGS reads.

Zhiyuan Zhai1, Gesine Reinert, Kai Song, Michael S Waterman, Yihui Luan, Fengzhu Sun.   

Abstract

Next generation sequencing (NGS) technologies are now widely used in many biological studies. In NGS, sequence reads are randomly sampled from the genome sequence of interest. Most computational approaches for NGS data first map the reads to the genome and then analyze the data based on the mapped reads. Since many organisms have unknown genome sequences and many reads cannot be uniquely mapped to the genomes even if the genome sequences are known, alternative analytical methods are needed for the study of NGS data. Here we suggest using word patterns to analyze NGS data. Word pattern counting (the study of the probabilistic distribution of the number of occurrences of word patterns in one or multiple long sequences) has played an important role in molecular sequence analysis. However, no studies are available on the distribution of the number of occurrences of word patterns in NGS reads. In this article, we build probabilistic models for the background sequence and the sampling process of the sequence reads from the genome. Based on the models, we provide normal and compound Poisson approximations for the number of occurrences of word patterns from the sequence reads, with bounds on the approximation error. The main challenge is to consider the randomness in generating the long background sequence, as well as in the sampling of the reads using NGS. We show the accuracy of these approximations under a variety of conditions for different patterns with various characteristics. Under realistic assumptions, the compound Poisson approximation seems to outperform the normal approximation in most situations. These approximate distributions can be used to evaluate the statistical significance of the occurrence of patterns from NGS data. The theory and the computational algorithm for calculating the approximate distributions are then used to analyze ChIP-Seq data using transcription factor GABP. Software is available online (www-rcf.usc.edu/∼fsun/Programs/NGS_motif_power/NGS_motif_power.html). In addition, Supplementary Material can be found online (www.liebertonline.com/cmb).

Entities:  

Mesh:

Substances:

Year:  2012        PMID: 22697250      PMCID: PMC3375642          DOI: 10.1089/cmb.2012.0029

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.479


  27 in total

1.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions.

Authors:  Gregory E Sims; Se-Ran Jun; Guohong A Wu; Sung-Hou Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2009-02-02       Impact factor: 11.205

2.  Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method.

Authors:  Guohong Albert Wu; Se-Ran Jun; Gregory E Sims; Sung-Hou Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2009-06-24       Impact factor: 11.205

Review 3.  Next-generation DNA sequencing methods.

Authors:  Elaine R Mardis
Journal:  Annu Rev Genomics Hum Genet       Date:  2008       Impact factor: 8.929

4.  Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains.

Authors:  G Reinert; S Schbath
Journal:  J Comput Biol       Date:  1998       Impact factor: 1.479

5.  Exact computation of pattern probabilities in random sequences generated by Markov chains.

Authors:  J Kleffe; U Langbecker
Journal:  Comput Appl Biosci       Date:  1990-10

6.  Assessment of compositional heterogeneity within and between eukaryotic genomes.

Authors:  A Nekrutenko; W H Li
Journal:  Genome Res       Date:  2000-12       Impact factor: 9.043

7.  Biases in Illumina transcriptome sequencing caused by random hexamer priming.

Authors:  Kasper D Hansen; Steven E Brenner; Sandrine Dudoit
Journal:  Nucleic Acids Res       Date:  2010-04-14       Impact factor: 16.971

8.  Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands.

Authors:  Utz J Pape; Sven Rahmann; Fengzhu Sun; Martin Vingron
Journal:  J Comput Biol       Date:  2008 Jul-Aug       Impact factor: 1.479

9.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature.

Authors:  Christine Dufraigne; Bernard Fertil; Sylvain Lespinats; Alain Giron; Patrick Deschavanne
Journal:  Nucleic Acids Res       Date:  2005-01-13       Impact factor: 16.971

10.  Modeling ChIP sequencing in silico with applications.

Authors:  Zhengdong D Zhang; Joel Rozowsky; Michael Snyder; Joseph Chang; Mark Gerstein
Journal:  PLoS Comput Biol       Date:  2008-08-22       Impact factor: 4.475

View more
  3 in total

Review 1.  New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing.

Authors:  Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S Waterman; Fengzhu Sun
Journal:  Brief Bioinform       Date:  2013-09-23       Impact factor: 11.622

2.  Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Authors:  Jie Ren; Kai Song; Minghua Deng; Gesine Reinert; Charles H Cannon; Fengzhu Sun
Journal:  Bioinformatics       Date:  2015-06-30       Impact factor: 6.937

3.  Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.

Authors:  Lin Wan; Xin Kang; Jie Ren; Fengzhu Sun
Journal:  Quant Biol       Date:  2020-05-25
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.