Literature DB >> 2531596

Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words.

P A Pevzner1, A A Mironov.   

Abstract

Mathematical models of the generation of genetic texts appeared simultaneously with the first sequencing DNA. They are used to establish functional and evolutionary relations between genetic texts, to predict the number and distribution of specific sites in a sequence and to identify "meaningful" words. The present paper deals with two problems: 1) The significance of deviations from the mean statistical characteristics in a genetic text. Anyone who has addressed himself to the statistical analysis of sequenced DNA is familiar with the question: what deviations from the expected frequencies of occurrence of particular words testify to the "biological" significance of those words? We propose a formula for the variance of the number of word's occurrences in the text, with allowance for word overlaps, making it possible to assess the significance of the deviations from the expected statistical characteristics. 2) A new method for predicting the frequencies of occurrence of particular words in a genetic text using the statistical characteristics of "spaced" L-grams. The method can be used for predicting the number of restriction sites in human DNA and in planning experiments on the physical mapping and sequencing of the human genome.

Entities:  

Mesh:

Substances:

Year:  1989        PMID: 2531596     DOI: 10.1080/07391102.1989.10506528

Source DB:  PubMed          Journal:  J Biomol Struct Dyn        ISSN: 0739-1102


  20 in total

1.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions.

Authors:  J Besemer; A Lomsadze; M Borodovsky
Journal:  Nucleic Acids Res       Date:  2001-06-15       Impact factor: 16.971

Review 2.  SWORDS: a statistical tool for analysing large DNA sequences.

Authors:  Probal Chaudhuri; Sandip Das
Journal:  J Biosci       Date:  2002-02       Impact factor: 1.826

3.  Statistical analysis of nucleotide sequences.

Authors:  E E Stückle; C Emmrich; U Grob; P J Nielsen
Journal:  Nucleic Acids Res       Date:  1990-11-25       Impact factor: 16.971

4.  WORDUP: an efficient algorithm for discovering statistically significant patterns in DNA sequences.

Authors:  G Pesole; N Prunella; S Liuni; M Attimonelli; C Saccone
Journal:  Nucleic Acids Res       Date:  1992-06-11       Impact factor: 16.971

5.  Over- and underrepresentation of short DNA words in herpesvirus genomes.

Authors:  M Y Leung; G M Marsh; T P Speed
Journal:  J Comput Biol       Date:  1996       Impact factor: 1.479

6.  Characteristic enrichment of DNA repeats in different genomes.

Authors:  R Cox; S M Mirkin
Journal:  Proc Natl Acad Sci U S A       Date:  1997-05-13       Impact factor: 11.205

7.  Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Authors:  Jie Ren; Kai Song; Minghua Deng; Gesine Reinert; Charles H Cannon; Fengzhu Sun
Journal:  Bioinformatics       Date:  2015-06-30       Impact factor: 6.937

8.  Statistical analysis of yeast genomic downstream sequences reveals putative polyadenylation signals.

Authors:  J van Helden; M del Olmo; J E Pérez-Ortín
Journal:  Nucleic Acids Res       Date:  2000-02-15       Impact factor: 16.971

9.  Multi-alphabet consensus algorithm for identification of low specificity protein-DNA interactions.

Authors:  A V Ulyanov; G D Stormo
Journal:  Nucleic Acids Res       Date:  1995-04-25       Impact factor: 16.971

10.  Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data.

Authors:  Leslie Regad; Juliette Martin; Gregory Nuel; Anne-Claude Camproux
Journal:  Algorithms Mol Biol       Date:  2010-01-26       Impact factor: 1.405

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.