Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.

Literature DB >> 2720059

The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.

Abstract

DNA's genetic code can be represented as an alphabetic sequence composed of the four letters A, C, G, and T, which represent the four types of nucleotides--adenylic, cytidylic, guanylic, and thymidylic acid--of which DNA is composed. Now that these sequences have been identified for many genes and are available in computer-readable form, scientists can analyze these data and search for patterns in an attempt to learn more about the regulatory functions of the gene. One area of study is that of the frequency of occurrence of specific nucleotide subsequences (e.g., ACAC) within part or all of a nucleotide sequence. This paper derives the probability distribution of the frequency of occurrence of a subsequence within a nucleotide sequence, under the hypothesis that the four nucleotides occur at random and with equal probability. This distribution is nontrivial because different subsequences have different "overlap capability." For example, the subsequence AAAA can occur up to 17 times in a sequence of length 20 (which would happen if the sequence were composed solely of A's), but the subsequence ACGT cannot occur more than 5 times in a sequence of length 20. Thus, the frequency distributions are different for each type of overlap capability. It is of interest to assess and compare the degree of nonrandomness for different subsequences or among different portions of a sequence; the existence and degree of nonrandomness may be related to the type and degree of functionality of a nucleotide (sub)sequence. The frequency distributions provided here can be used to perform exact significance tests of the hypothesis of randomness. An approximate test is also described for use with long sequences; this can be used to test a more general null hypothesis of nucleotides occurring with unequal probabilities.

Entities: Chemical Disease

Mesh：

Substances：
Oligonucleotides

Year: 1989 PMID： 2720059

Source DB: PubMed Journal: Biometrics ISSN： 0006-341X Impact factor: 2.571

Keyword Cloud
Cited

7 in total

1. JC virus quasispecies analysis reveals a complex viral population underlying progressive multifocal leukoencephalopathy and supports viral dissemination via the hematogenous route.

Authors: Tom Van Loy; Kim Thys; Caroline Ryschkewitsch; Ole Lagatie; Maria C Monaco; Eugene O Major; Luc Tritsmans; Lieven J Stuyver
Journal: J Virol Date: 2014-11-12 Impact factor: 5.103

Review 2. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

Authors: Oliver Bonham-Carter; Joe Steele; Dhundy Bastola
Journal: Brief Bioinform Date: 2013-07-31 Impact factor: 11.622

3. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

Authors: Shaokun An; Jie Ren; Fengzhu Sun; Lin Wan
Journal: J Comput Biol Date: 2022-04-22 Impact factor: 1.549

The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability.

1. JC virus quasispecies analysis reveals a complex viral population underlying progressive multifocal leukoencephalopathy and supports viral dissemination via the hematogenous route.

Review 2. Alignment-free genetic sequence comparisons: a review of recent approaches by word analysis.

3. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

4. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands.

5. Tandemly repeated pentanucleotides in DNA sequences of eucaryotes.

6. Pattern-based phylogenetic distance estimation and tree reconstruction.

7. An improved string composition method for sequence comparison.