Literature DB >> 17597898

On relationship of Z-curve and Fourier approaches for DNA coding sequence classification.

Ngai-Fong Law1, Kin-On Cheng, Wan-Chi Siu.   

Abstract

Z-curve features are one of the popular features used in exon/intron classification. We showed that although both Z-curve and Fourier approaches are based on detecting 3-periodicity in coding regions, there are significant differences in their spectral formulation. From the spectral formulation of the Z-curve, we obtained three modified sequences that characterize different biological properties. Spectral analysis on the modified sequences showed a much more prominent 3-periodicity peak in coding regions than the Fourier approach. For long sequences, prominent peaks at 2Pi/3 are observed at coding regions, whereas for short sequences, clearly discernible peaks are still visible. Better classification can be obtained using spectral features derived from the modified sequences.

Entities:  

Year:  2006        PMID: 17597898      PMCID: PMC1891701          DOI: 10.6026/97320630001242

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

A DNA sequence is a long sequence consisting of four types of nucleotides: Adenine (A), Guanine (G), Thymine (T) and Cytosine (C). An important problem for sequence analysis is to distinguish coding (exons) and non-coding (introns and intergenic spaces) regions in a sequence. Sequence features exploiting properties such as codon usage bias, base compositional bias between codon positions, periodicity in base occurrence in coding regions [1, 2] have been proposed for characterizing coding/non-coding regions. The 3-periodicity property of coding regions is particularly interesting and has been studied intensely. A natural choice for detecting such periodicity is the Fourier Transform (FT). The Z-curve features [3 ] and the FT approach [4 – 6] are both concerned with detecting the 3-periodicity property of coding sequences and are implicitly related. However, there is no theoretical study of the relationship between the two approaches. In this paper, we give a theoretical analysis that reveals the relationships between the two and show that there are significant differences among them. In particular: (1) we provide a theoretical study of the relationship between the two approaches; (2) we provide a justification for the empirical observation that Z-curve approach generally have better performance than FT approach, especially for shorter sequences; and finally, (3) we propose a modification of the basic FT approach based on a new numerical sequence representation derived from Z-curve that preserves biological significance.

Methodology

PDF file

Results and Discussion

PDF file

Conclusion

Z-curve features are one of the popular features used for DNA sequence classification and they are closely related to a FT spectral analysis of the sequence for 3-periodicity. In this paper we gave a theoretical study of the relationship between the Z-curve and the FT approach. Our analysis showed that there are significant differences in the spectral interpretation between the two. We discussed the implications of these differences for shorter sequences. Moreover, we showed that the three modified sequences obtained from the spectral reformulation of the Z-curve approach characterize different biological properties and are useful for coding region prediction. In particular, the 3-periodicity is much more prominent in the modified sequences. As a result of our analysis, we proposed to apply spectral analysis to the three modified sequences to better capture the 3-periodicity property embedded in the coding region of a DNA sequence and verified this experimentally.
Table 1

Classification results of coding and non-coding sequences

YeastHuman
FFT approach
Sensitivity0.85800.8627
Specificity0.89220.2873
Average0.87510.5750

Proposed approach
Sensitivity0.86070.7607
Specificity0.95580.8413
Average0.90830.8010
  8 in total

1.  Frequency-domain analysis of biomolecular sequences.

Authors:  D Anastassiou
Journal:  Bioinformatics       Date:  2000-12       Impact factor: 6.937

2.  Recognition of protein coding genes in the yeast genome at better than 95% accuracy based on the Z curve.

Authors:  C T Zhang; J Wang
Journal:  Nucleic Acids Res       Date:  2000-07-15       Impact factor: 16.971

3.  Locating probable genes using Fourier transform approach.

Authors:  Biju Issac; Harpreet Singh; Harpreet Kaur; G P S Raghava
Journal:  Bioinformatics       Date:  2002-01       Impact factor: 6.937

4.  Classification of short human exons and introns based on statistical features.

Authors:  Yonghui Wu; Alan Wee-Chung Liew; Hong Yan; Mengsu Yang
Journal:  Phys Rev E Stat Nonlin Soft Matter Phys       Date:  2003-06-27

5.  Effective statistical features for coding and non-coding DNA sequence classification for yeast, C. elegans and human.

Authors:  Alan Wee-Chung Liew; Yonghui Wu; Hong Yan; Mengsu Yang
Journal:  Int J Bioinform Res Appl       Date:  2005

6.  Prediction of probable genes by Fourier analysis of genomic sequences.

Authors:  S Tiwari; S Ramachandran; A Bhattacharya; S Bhattacharya; R Ramaswamy
Journal:  Comput Appl Biosci       Date:  1997-06

7.  Codon preference and its use in identifying protein coding regions in long DNA sequences.

Authors:  R Staden; A D McLachlan
Journal:  Nucleic Acids Res       Date:  1982-01-11       Impact factor: 16.971

8.  Recognition of protein coding regions in DNA sequences.

Authors:  J W Fickett
Journal:  Nucleic Acids Res       Date:  1982-09-11       Impact factor: 16.971

  8 in total
  1 in total

1.  A Brief Review: The Z-curve Theory and its Application in Genome Analysis.

Authors:  Ren Zhang; Chun-Ting Zhang
Journal:  Curr Genomics       Date:  2014-04       Impact factor: 2.236

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.