Literature DB >> 7473716

Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.

R Guigó1, J W Fickett.   

Abstract

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.

Entities:  

Mesh:

Substances:

Year:  1995        PMID: 7473716     DOI: 10.1006/jmbi.1995.0535

Source DB:  PubMed          Journal:  J Mol Biol        ISSN: 0022-2836            Impact factor:   5.469


  7 in total

1.  An assessment of gene prediction accuracy in large DNA sequences.

Authors:  R Guigó; P Agarwal; J F Abril; M Burset; J W Fickett
Journal:  Genome Res       Date:  2000-10       Impact factor: 9.043

2.  Evaluation of gene-finding programs on mammalian sequences.

Authors:  S Rogic; A K Mackworth; F B Ouellette
Journal:  Genome Res       Date:  2001-05       Impact factor: 9.043

3.  A relationship between GC content and coding-sequence length.

Authors:  J L Oliver; A Marín
Journal:  J Mol Evol       Date:  1996-09       Impact factor: 2.395

4.  Coding sequence density estimation via topological pressure.

Authors:  David Koslicki; Daniel J Thompson
Journal:  J Math Biol       Date:  2014-01-22       Impact factor: 2.259

5.  Genomic organization of the S locus: Identification and characterization of genes in SLG/SRK region of S(9) haplotype of Brassica campestris (syn. rapa).

Authors:  G Suzuki; N Kai; T Hirose; K Fukui; T Nishio; S Takayama; A Isogai; M Watanabe; K Hinata
Journal:  Genetics       Date:  1999-09       Impact factor: 4.562

6.  Predicting statistical properties of open reading frames in bacterial genomes.

Authors:  Katharina Mir; Klaus Neuhaus; Siegfried Scherer; Martin Bossert; Steffen Schober
Journal:  PLoS One       Date:  2012-09-24       Impact factor: 3.240

7.  A novel role of the Sp/KLF transcription factor KLF11 in arresting progression of endometriosis.

Authors:  Gaurang S Daftary; Ye Zheng; Zaid M Tabbaa; John K Schoolmeester; Ravi P Gada; Adrienne L Grzenda; Angela J Mathison; Gary L Keeney; Gwen A Lomberk; Raul Urrutia
Journal:  PLoS One       Date:  2013-03-28       Impact factor: 3.240

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.