Literature DB >> 2720059

The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.

J F Gentleman1, R C Mullin.   

Abstract

DNA's genetic code can be represented as an alphabetic sequence composed of the four letters A, C, G, and T, which represent the four types of nucleotides--adenylic, cytidylic, guanylic, and thymidylic acid--of which DNA is composed. Now that these sequences have been identified for many genes and are available in computer-readable form, scientists can analyze these data and search for patterns in an attempt to learn more about the regulatory functions of the gene. One area of study is that of the frequency of occurrence of specific nucleotide subsequences (e.g., ACAC) within part or all of a nucleotide sequence. This paper derives the probability distribution of the frequency of occurrence of a subsequence within a nucleotide sequence, under the hypothesis that the four nucleotides occur at random and with equal probability. This distribution is nontrivial because different subsequences have different "overlap capability." For example, the subsequence AAAA can occur up to 17 times in a sequence of length 20 (which would happen if the sequence were composed solely of A's), but the subsequence ACGT cannot occur more than 5 times in a sequence of length 20. Thus, the frequency distributions are different for each type of overlap capability. It is of interest to assess and compare the degree of nonrandomness for different subsequences or among different portions of a sequence; the existence and degree of nonrandomness may be related to the type and degree of functionality of a nucleotide (sub)sequence. The frequency distributions provided here can be used to perform exact significance tests of the hypothesis of randomness. An approximate test is also described for use with long sequences; this can be used to test a more general null hypothesis of nucleotides occurring with unequal probabilities.

Entities:  

Mesh:

Substances:

Year:  1989        PMID: 2720059

Source DB:  PubMed          Journal:  Biometrics        ISSN: 0006-341X            Impact factor:   2.571


  7 in total

1.  JC virus quasispecies analysis reveals a complex viral population underlying progressive multifocal leukoencephalopathy and supports viral dissemination via the hematogenous route.

Authors:  Tom Van Loy; Kim Thys; Caroline Ryschkewitsch; Ole Lagatie; Maria C Monaco; Eugene O Major; Luc Tritsmans; Lieven J Stuyver
Journal:  J Virol       Date:  2014-11-12       Impact factor: 5.103

Review 2.  Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors:  Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal:  Brief Bioinform       Date:  2013-07-31       Impact factor: 11.622

3.  A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

Authors:  Shaokun An; Jie Ren; Fengzhu Sun; Lin Wan
Journal:  J Comput Biol       Date:  2022-04-22       Impact factor: 1.549

4.  Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands.

Authors:  Utz J Pape; Sven Rahmann; Fengzhu Sun; Martin Vingron
Journal:  J Comput Biol       Date:  2008 Jul-Aug       Impact factor: 1.479

5.  Tandemly repeated pentanucleotides in DNA sequences of eucaryotes.

Authors:  B Borstnik; D Pumpernik; D Lukman; D Ugarković; M Plohl
Journal:  Nucleic Acids Res       Date:  1994-08-25       Impact factor: 16.971

6.  Pattern-based phylogenetic distance estimation and tree reconstruction.

Authors:  Michael Höhl; Isidore Rigoutsos; Mark A Ragan
Journal:  Evol Bioinform Online       Date:  2007-02-25       Impact factor: 1.625

7.  An improved string composition method for sequence comparison.

Authors:  Guoqing Lu; Shunpu Zhang; Xiang Fang
Journal:  BMC Bioinformatics       Date:  2008-05-28       Impact factor: 3.169

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.