Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Significantly lower entropy estimates for natural DNA sequences.

Literature DB >> 10223669

Significantly lower entropy estimates for natural DNA sequences.

Abstract

If DNA were a random string over its alphabet {A, C, G, T}, an optimal code would assign two bits to each nucleotide. DNA may be imagined to be a highly ordered, purposeful molecule, and one might therefore reasonably expect statistical models of its string representation to produce much lower entropy estimates. Surprisingly, this has not been the case for many natural DNA sequences, including portions of the human genome. We introduce a new statistical model (compression algorithm), the strongest reported to date, for naturally occurring DNA sequences. Conventional techniques code a nucleotide using only slightly fewer bits (1.90) than one obtains by relying only on the frequency statistics of individual nucleotides (1.95). Our method in some cases increases this gap by more than fivefold (1.66) and may lead to better performance in microbiological pattern recognition applications. One of our main contributions, and the principle source of these improvements, is the formal inclusion of inexact match information in the model. The existence of matches at various distances forms a panel of experts which are then combined into a single prediction. The structure of this combination is novel and its parameters are learned using Expectation Maximization (EM). Experiments are reported using a wide variety of DNA sequences and compared whenever possible with earlier work. Four reasonable notions for the string distance function used to identify near matches, are implemented and experimentally compared. We also report lower entropy estimates for coding regions extracted from a large collection of nonredundant human genes. The conventional estimate is 1.92 bits. Our model produces only slightly better results (1.91 bits) when considering nucleotides, but achieves 1.84-1.87 bits when the prediction problem is divided into two stages: (i) predict the next amino acid-based on inexact polypeptide matches, and (ii) predict the particular codon. Our results suggest that matches at the amino acid level play some role, but a small one, in determining the statistical structure of nonredundant coding sequences.

Entities: Species

Mesh：

Substances：
DNA

Year: 1999 PMID： 10223669 DOI： 10.1089/cmb.1999.6.125

Source DB: PubMed Journal: J Comput Biol ISSN： 1066-5277 Impact factor: 1.479

Keyword Cloud
Cited

8 in total

1. Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae.

Authors: Michael G Sadovsky
Journal: J Biol Phys Date: 2003-03 Impact factor: 1.365

2. Toward a Better Compression for DNA Sequences Using Huffman Encoding.

Authors: Anas Al-Okaily; Badar Almarri; Sultan Al Yami; Chun-Hsi Huang
Journal: J Comput Biol Date: 2016-12-13 Impact factor: 1.479

3. Efficient DNA sequence compression with neural networks.

Authors: Milton Silva; Diogo Pratas; Armando J Pinho
Journal: Gigascience Date: 2020-11-11 Impact factor: 6.524

4. On the representability of complete genomes by multiple competing finite-context (Markov) models.

Authors: Armando J Pinho; Paulo J S G Ferreira; António J R Neves; Carlos A C Bastos
Journal: PLoS One Date: 2011-06-30 Impact factor: 3.240

5. Prediction of multi-target networks of neuroprotective compounds with entropy indices and synthesis, assay, and theoretical study of new asymmetric 1,2-rasagiline carbamates.

Authors: Francisco J Romero Durán; Nerea Alonso; Olga Caamaño; Xerardo García-Mera; Matilde Yañez; Francisco J Prado-Prado; Humberto González-Díaz
Journal: Int J Mol Sci Date: 2014-09-24 Impact factor: 5.923

Significantly lower entropy estimates for natural DNA sequences.

1. Comparison of Real Frequencies of Strings vs. the Expected Ones Reveals the Information Capacity of Macromoleculae.

2. Toward a Better Compression for DNA Sequences Using Huffman Encoding.

3. Efficient DNA sequence compression with neural networks.

4. On the representability of complete genomes by multiple competing finite-context (Markov) models.

5. Prediction of multi-target networks of neuroprotective compounds with entropy indices and synthesis, assay, and theoretical study of new asymmetric 1,2-rasagiline carbamates.

6. An Optimal Seed Based Compression Algorithm for DNA Sequences.

Review 7. Information theory applications for biological sequence analysis.

8. Comparative analysis of long DNA sequences by per element information content using different contexts.