Literature DB >> 9963739

Systematic analysis of coding and noncoding DNA sequences using methods of statistical linguistics.

R N Mantegna1, S V Buldyrev, A L Goldberger, S Havlin, C K Peng, M Simons, H E Stanley.   

Abstract

We compare the statistical properties of coding and noncoding regions in eukaryotic and viral DNA sequences by adapting two tests developed for the analysis of natural languages and symbolic sequences. The data set comprises all 30 sequences of length above 50 000 base pairs in GenBank Release No. 81.0, as well as the recently published sequences of C. elegans chromosome III (2.2 Mbp) and yeast chromosome XI (661 Kbp). We find that for the three chromosomes we studied the statistical properties of noncoding regions appear to be closer to those observed in natural languages than those of coding regions. In particular, (i) a n-tuple Zipf analysis of noncoding regions reveals a regime close to power-law behavior while the coding regions show logarithmic behavior over a wide interval, while (ii) an n-gram entropy measurement shows that the noncoding regions have a lower n-gram entropy (and hence a larger "n-gram redundancy") than the coding regions. In contrast to the three chromosomes, we find that for vertebrates such as primates and rodents and for viral DNA, the difference between the statistical properties of coding and noncoding regions is not pronounced and therefore the results of the analyses of the investigated sequences are less conclusive. After noting the intrinsic limitations of the n-gram redundancy analysis, we also briefly discuss the failure of the zeroth- and first-order Markovian models or simple nucleotide repeats to account fully for these "linguistic" features of DNA. Finally, we emphasize that our results by no means prove the existence of a "language" in noncoding DNA.

Entities:  

Keywords:  NASA Discipline Cardiopulmonary; NASA Discipline Number 14-10; Non-NASA Center

Mesh:

Substances:

Year:  1995        PMID: 9963739     DOI: 10.1103/physreve.52.2939

Source DB:  PubMed          Journal:  Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics        ISSN: 1063-651X


  16 in total

1.  Discriminating self from nonself with short peptides from large proteomes.

Authors:  Nigel J Burroughs; Rob J de Boer; Can Keşmir
Journal:  Immunogenetics       Date:  2004-07-30       Impact factor: 2.846

2.  Formation and positioning of nucleosomes: effect of sequence-dependent long-range correlated structural disorder.

Authors:  C Vaillant; B Audit; C Thermes; A Arnéodo
Journal:  Eur Phys J E Soft Matter       Date:  2006-02-14       Impact factor: 1.890

Review 3.  Informatics challenges in structured RNA.

Authors:  Alain Laederach
Journal:  Brief Bioinform       Date:  2007-07-04       Impact factor: 11.622

4.  The Shannon information entropy of protein sequences.

Authors:  B J Strait; T G Dewey
Journal:  Biophys J       Date:  1996-07       Impact factor: 4.033

5.  Quantification of DNA patchiness using long-range correlation measures.

Authors:  G M Viswanathan; S V Buldyrev; S Havlin; H E Stanley
Journal:  Biophys J       Date:  1997-02       Impact factor: 4.033

6.  Wavelet Analysis of DNA Bending Profiles reveals Structural Constraints on the Evolution of Genomic Sequences.

Authors:  Benjamin Audit; Cédric Vaillant; Alain Arnéodo; Yves d'Aubenton-Carafa; Claude Thermes
Journal:  J Biol Phys       Date:  2004-03       Impact factor: 1.365

Review 7.  An ensemble approach to the evolution of complex systems.

Authors:  Göker Arpağ; Ayşe Erzan
Journal:  J Biosci       Date:  2014-04       Impact factor: 1.826

8.  Analyzing similarities in genome sequences.

Authors:  I C Fonseca; E Nogueira; P H Figueirêdo; S Coutinho
Journal:  Eur Phys J E Soft Matter       Date:  2018-01-19       Impact factor: 1.890

9.  Languages cool as they expand: allometric scaling and the decreasing need for new words.

Authors:  Alexander M Petersen; Joel N Tenenbaum; Shlomo Havlin; H Eugene Stanley; Matjaž Perc
Journal:  Sci Rep       Date:  2012-12-10       Impact factor: 4.379

10.  Google matrix analysis of DNA sequences.

Authors:  Vivek Kandiah; Dima L Shepelyansky
Journal:  PLoS One       Date:  2013-05-09       Impact factor: 3.240

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.