Literature DB >> 17046370

Word organization in coding DNA: a mathematical model.

Indranil Mukhopadhyay1, Anup Som, Satyabrata Sahoo.   

Abstract

This article deals with the relationship between vocabulary (total number of distinct oligomers or "words") and text-length (total number of oligomers or "words") for a coding DNA sequence (CDS). For natural human languages, Heaps established a mathematical formula known as Heaps' law, which relates vocabulary to text-length. Our analysis shows that Heaps' law fails to model this relationship for CDSs. Here we develop a mathematical model to establish the relationship between the number of type of words (vocabulary) and the number of words sampled (text-length) for CDSs, when non-overlapping nucleotide strings with the same length are treated as words. We use tangent-hyperbolic function, which captures the saturation property of vocabulary. Based on the parameters of the model, we formulate a mathematical equation, known as "equation of word organization", whose parameters essentially indicate that nucleotide organization of coding sequences are different from one another. We also compare the word organization of CDSs with the random word distribution and conclude that a CDS is neither similar to a natural human language nor to a random one. Moreover, these sequences have their unique nucleotide organization and it is completely structured for specific biological functioning.

Entities:  

Mesh:

Substances:

Year:  2006        PMID: 17046370     DOI: 10.1016/j.thbio.2006.03.002

Source DB:  PubMed          Journal:  Theory Biosci        ISSN: 1431-7613            Impact factor:   1.919


  9 in total

1.  Quantifying DNA-protein interactions by double-stranded DNA arrays.

Authors:  M L Bulyk; E Gentalen; D J Lockhart; G M Church
Journal:  Nat Biotechnol       Date:  1999-06       Impact factor: 54.908

2.  Codon distributions in DNA.

Authors:  A Som; S Chattopadhyay; J Chakrabarti; D Bandyopadhyay
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2001-04-18

3.  Coding DNA sequences: statistical distributions.

Authors:  A Som; S Sahoo; J Chakrabarti
Journal:  Math Biosci       Date:  2003-05       Impact factor: 2.144

4.  Linguistic features of noncoding DNA sequences.

Authors:  R N Mantegna; S V Buldyrev; A L Goldberger; S Havlin; C K Peng; M Simons; H E Stanley
Journal:  Phys Rev Lett       Date:  1994-12-05       Impact factor: 9.161

5.  A DNA Motif Lexicon: cataloguing and annotating sequences.

Authors:  Betsey D Dyer; Mark D LeBlanc; Stephen Benz; Peter Cahalan; Brian Donorfio; Patrick Sagui; Adam Villa; Gregory Williams
Journal:  In Silico Biol       Date:  2004

6.  Similarities and dissimilarities of phage genomes.

Authors:  B E Blaisdell; A M Campbell; S Karlin
Journal:  Proc Natl Acad Sci U S A       Date:  1996-06-11       Impact factor: 11.205

7.  Long-range correlations in nucleotide sequences.

Authors:  C K Peng; S V Buldyrev; A L Goldberger; S Havlin; F Sciortino; M Simons; H E Stanley
Journal:  Nature       Date:  1992-03-12       Impact factor: 49.962

8.  Sequence fossils, triplet expansion, and reconstruction of earliest codons.

Authors:  E N Trifonov; T Bettecken
Journal:  Gene       Date:  1997-12-31       Impact factor: 3.688

9.  "Silent" sites in Drosophila genes are not neutral: evidence of selection among synonymous codons.

Authors:  D C Shields; P M Sharp; D G Higgins; F Wright
Journal:  Mol Biol Evol       Date:  1988-11       Impact factor: 16.240

  9 in total
  2 in total

1.  Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels.

Authors:  Hanieh Moghaddasi; Khosrow Khalifeh; Amir Hossein Darooneh
Journal:  Sci Rep       Date:  2017-01-27       Impact factor: 4.379

2.  Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences.

Authors:  Derek Gatherer
Journal:  Bioinform Biol Insights       Date:  2009-11-24
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.