Literature DB >> 22226985

Clustering of DNA words and biological function: a proof of principle.

Michael Hackenberg1, Antonio Rueda, Pedro Carpena, Pedro Bernaola-Galván, Guillermo Barturen, José L Oliver.   

Abstract

Relevant words in literary texts (key words) are known to be clustered, while common words are randomly distributed. Given the clustered distribution of many functional genome elements, we hypothesize that the biological text per excellence, the DNA sequence, might behave in the same way: k-length words (k-mers) with a clear function may be spatially clustered along the one-dimensional chromosome sequence, while less-important, non-functional words may be randomly distributed. To explore this linguistic analogy, we calculate a clustering coefficient for each k-mer (k=2-9bp) in human and mouse chromosome sequences, then checking if clustered words are enriched in the functional part of the genome. First, we found a positive general trend relating clustering level and word enrichment within exons and Transcription Factor Binding Sites (TFBSs), while a much weaker relation exists for repeats, and no relation at all exists for introns. Second, we found that 38.45% of the 200 top-clustered 8-mers, but only 7.70% of the non-clustered words, are represented in known motif databases. Third, enrichment/depletion experiments show that highly clustered words are significantly enriched in exons and TFBSs, while they are depleted in introns and repetitive DNA. Considering exons and TFBSs together, 1417 (or 72.26%) in human and 1385 (or 72.97%) in mouse of the top-clustered 8-mers showed a statistically significant association to either exons or TFBSs, thus strongly supporting the link between word clustering and biological function. Lastly, we identified a subset of clustered, diagnostic words that are enriched in exons but depleted in introns, and therefore might help to discriminate between these two gene regions. The clustering of DNA words thus appears as a novel principle to detect functionality in genome sequences. As evolutionary conservation is not a prerequisite, the proof of principle described here may open new ways to detect species-specific functional DNA sequences and the improvement of gene and promoter predictions, thus contributing to the quest for function in the genome.
Copyright © 2011 Elsevier Ltd. All rights reserved.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22226985     DOI: 10.1016/j.jtbi.2011.12.024

Source DB:  PubMed          Journal:  J Theor Biol        ISSN: 0022-5193            Impact factor:   2.691


  7 in total

1.  Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast.

Authors:  Yan Zheng; Hong Li; Yue Wang; Hu Meng; Qiang Zhang; Xiaoqing Zhao
Journal:  Chromosome Res       Date:  2017-02-09       Impact factor: 5.239

2.  Analyzing similarities in genome sequences.

Authors:  I C Fonseca; E Nogueira; P H Figueirêdo; S Coutinho
Journal:  Eur Phys J E Soft Matter       Date:  2018-01-19       Impact factor: 1.890

3.  An improved alignment-free model for DNA sequence similarity metric.

Authors:  Junpeng Bao; Ruiyu Yuan; Zhe Bao
Journal:  BMC Bioinformatics       Date:  2014-09-28       Impact factor: 3.169

4.  Model of the Dynamic Construction Process of Texts and Scaling Laws of Words Organization in Language Systems.

Authors:  Shan Li; Ruokuang Lin; Chunhua Bian; Qianli D Y Ma; Plamen Ch Ivanov
Journal:  PLoS One       Date:  2016-12-22       Impact factor: 3.240

5.  PISMA: A Visual Representation of Motif Distribution in DNA Sequences.

Authors:  Rogelio Alcántara-Silva; Moisés Alvarado-Hermida; Gibrán Díaz-Contreras; Martha Sánchez-Barrios; Samantha Carrera; Silvia Carolina Galván
Journal:  Bioinform Biol Insights       Date:  2017-03-30

6.  Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels.

Authors:  Hanieh Moghaddasi; Khosrow Khalifeh; Amir Hossein Darooneh
Journal:  Sci Rep       Date:  2017-01-27       Impact factor: 4.379

7.  Extracting DNA words based on the sequence features: non-uniform distribution and integrity.

Authors:  Zhi Li; Hongyan Cao; Yuehua Cui; Yanbo Zhang
Journal:  Theor Biol Med Model       Date:  2016-01-25       Impact factor: 2.432

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.