Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words.

Literature DB >> 2531596

Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words.

Abstract

Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.

Entities: Disease Species

Mesh：

Substances：
Nucleotides

Year: 1989 PMID： 2531596 DOI： 10.1080/07391102.1989.10506528

Source DB: PubMed Journal: J Biomol Struct Dyn ISSN： 0739-1102

Keyword Cloud
Cited

20 in total

1. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Authors: J Besemer; A Lomsadze; M Borodovsky
Journal: Nucleic Acids Res Date: 2001-06-15 Impact factor: 16.971

10. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

Authors: Leslie Regad; Juliette Martin; Gregory Nuel; Anne-Claude Camproux
Journal: Algorithms Mol Biol Date: 2010-01-26 Impact factor: 1.405

Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words.

1. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Review 2. SWORDS: a statistical tool for analysing large DNA sequences.

3. Statistical analysis of nucleotide sequences.

4. WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.

5. Over- and underrepresentation of short DNA words in herpesvirus genomes.

6. Characteristic enrichment of DNA repeats in different genomes.

7. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

8. Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals.

9. Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions.

10. Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.