Literature DB >> 17597888

A coding measure scheme employing electron-ion interaction pseudopotential (EIIP).

Achuthsankar S Nair1, Sivarama Pillai Sreenadhan.   

Abstract

In this paper, a revision for the existing method of locating exons by genomic signal processing technique employing four binary indicator sequences is presented. The existing method relies on the pronounced period three peaks observed in the Fourier power spectrum of the exon regions which are absent in non-coding regions. The authors have abandoned the four sequences all together and adopted a single 'EIIP indicator sequence' which is formed by substituting the electron-ion interaction pseudopotentials (EIIP) of the nucleotides A, G, C and T in the DNA sequence, reducing the computational overhead by 75%. The power spectrum of this sequence reveals period three peaks for exon regions. Also a number of exons have been identified which exhibit period three peaks when mapped to 'EIIP indicator sequence' and which do not show the same when the binary indicator sequences are employed. We could get better discrimination between exon areas and non-coding areas of a number of genomes when the sequences are mapped to EIIP indicator sequences and the power spectra of the same are taken in a sliding Kaiser window, compared to the existing method using a rectangular window which utilizes binary indicator sequences.

Entities:  

Year:  2006        PMID: 17597888      PMCID: PMC1891688     

Source DB:  PubMed          Journal:  Bioinformation        ISSN: 0973-2063


Background

The pivotal problem of gene identification in eukaryotes is distinguishing exons, from introns and intergenic regions. A number of coding measures like single and polynucleotide bias differences, spectral differences etc which exist between these regions have been utilized for this purpose in various gene finding algorithms. But simultaneous improvement of sensitivity and selectivity of these algorithms is still a challenge and so the hunt for new coding measures is to be continued. The existing method of locating exons by genomic signal processing technique employing four binary indicator sequences, one for each nucleotide, depends on the period three peaks observed in the power spectrum of the exon regions and which do not exist in non-coding regions. The method may be summarized as given below. For a DNA string x[n] of N characters (with an alphabet A, G, C & T) let us define four binary indicator sequences uA[n], uG[n], uC[n] & uT[n]. [1] Each indicator sequence has a 1 if the corresponding base exists at the position n, otherwise a zero. For example if, :as given in the PDF file linked below This coding measure has been utilized in the program ‘Genescan’ [4] by evaluating the N/3 component of the Fourier power spectrum of the binary indicator sequences through a sliding window and looking for the peaks, whose strength (against the average strength of the power spectrum in the region) surpass a threshold which indicate the presence of exons. Also an optimization technique [1] has been devised for locating exons employing binary indicator sequences. Another notable work is where an anti-notch filter [5] is used to locate the exons by employing the four binary indicator sequences. A recent work reported [6], employs the Cumulative Categorical Periodogram (CCP) for the same end, but giving troughs at N/3 whereas the binary indicator sequences exhibit peaks.

Methodology

The authors propose a novel coding measure scheme by replacing the four binary indicator sequences by just one sequence which we call as ‘EIIP indicator sequence’. The energy of delocalized electrons in amino acids and nucleotides has been calculated as the Electron-ion interaction pseudopotential (EIIP). [7] The EIIP values of amino acids have already been used in Resonant Recognition Models (RRM) to substitute for the corresponding amino acids in protein sequences, whose Discrete Fourier Transforms are taken to extract the information contents. [7] The Fourier cross spectra of a group of related proteins reveal a sharp peak at a frequency which is termed as the ‘characteristic frequency’ of that group of proteins as they are found to represent a particular biological function and selectively interact with targets of the corresponding ‘characteristic frequency’ (resonant recognition). [7] This has been used to identify ‘hot spots’ in proteins and for peptide design which are very useful in drug discovery. In the present work, the authors have made use of the EIIP values of the nucleotides rather than those of aminoacids for locating exons. The EIIP values for the nucleotides are given in Table 1
Table 1

Electron Ion Interaction pseudo potentials of nucleotides

NucleotideEIIP
A0.1260
G0.0806
C0.1340
T0.1335
If we substitute the EIIP values for A, G, C & T in a DNA string x[n], we get a numerical sequence which represents the distribution of the free electrons' energies along the DNA sequence. This sequence is named as the ‘EIIP indicator sequence’, xe[n]. For example, if x[n] = A A T G C A T C A, then using the values from Table 1, xe[n] = [0.1260 0.1260 0.1335 0.0806 0.1340 0.1260 0.1335 0.1340 0.1260]. PDF file When Se[k] is plotted against k, it reveals a peak at N/3 for a coding region and no such peak is observable for a noncoding region. As it is evident, the method has been simplified and computational overhead is reduced by 75% as now we have to find the Fast Fourier Transform (FFT) of only one sequence instead of the FFT of four binary sequences used in the original method. This may be used as a coding measure to detect probable coding regions in DNA sequences by examining the local signal to noise ratio of the peak within a sliding window and by selecting an appropriate threshold. Genescan [4] takes an optimal window size of 351 and the same is adopted in the present investigation. The authors have also experimented with both reduced and increased window sizes. When reduced window size is used, peaking areas become ‘sharper’ which is advantageous for detection when exons are closer (separated by comparatively shorter introns) but the subsequent increase in noise makes the discrimination poorer. On the other hand, increasing the window size makes the peaking areas wider and thus resulting in missing of exons which are closer. Instead of a rectangular window adopted by Genescan, the authors have taken Kaiser window which suppresses the noise more effectively as it has much smaller side lobes compared to rectangular window, and the binary indicator sequences are replaced by a single EIIP indicator sequence.

Results and Discussion

The authors have checked the power spectrum of several exon segments of eukaryotic genes in a number of organisms using binary sequence indicators and the proposed EIIP indicator sequence. Mainly, two data sets are used as bench mark for this purpose. One is the dataset prepared by Burset and Guigó [8,9] and the other is HMR195 [10] prepared by Sanja Rogic. In a good number of cases both methods performed well, but there are instances were EIIP indicator sequence shows the peak at the right location (near N/3) where binary sequences fail, and a few number of instances where the opposite is true, and of course there are a number of genes where both fail which proves that there exist many exons without appreciable N/3 peak. Table 2 lists some of the exons which give period three peaks when EIIP indicator sequence is employed and where the existing binary indicator sequence method fails.
Table 2

Exons from selected genes where EIIP indicator sequence gives better N/3 peaks compared to binary indicator sequences

Serial No.Accession numberDescription Of geneLength of sequenceExon area & length (N)Comments: (a) Using binary; (b) Using EIIP
1AF019074EKLF, mus musculus erythroid kruppel like factor gene63503761-4574 (814)Peak in (a) at 131 (not near N/3), Peak in (b) at 272 (near N/3).

2AB009589Human gene for Osteomodulin1241410624-10949 (326)Peak in (a) at 8 (not near N/3), Peak in (b) at 110 (near N/3).

3AF065986Human keratocan gene76596638-6810 (173)Peak in (a) at 40 (not near N/3), Peak in (b) at 53 (near N/3).

4AF015224Human mammoglobin gene42061713-1900 (188)Peak in (a) at 18 (not near N/3), Peak in (b) at 63 (near N/3).

5AB016625Human OCTN2 gene2587115591-15792 (172)Peak in (a) at 76 (not near N/3), Peak in (b) at 56 (near N/3).
The experiments using sliding windows show that in a number of cases the EIIP indicator sequence gives a better discrimination between coding and non-coding regions. Figure 1 and Figure 2 show the power spectrum of a gene, HUMELAFIN (Acc. No. D13156, homosapiens gene for elafin), using binary indicator sequences and using EIIP indicator, respectively. HUMELAFIN has two exons, one from nucleotide positions 245 to 325, and the other from 1185 to1459. As it is evident from Figure 1, a ‘false exon’ (an intron region having greater peak than exon regions) appears when binary indicator sequences are used and the peak of first exon (at 205) is also seen shifted from the actual region. On the contrary, the use of EIIP indicator sequence has ‘removed’ the ‘false’ exon and the peak of first exon is now inside the right region as seen from Figure 2. However, the second exon peak is exhibited by both methods correctly.
Figure 1

Power spectrum of HUMELAFIN (D13156) obtained using binary indicators

Figure 2

Power spectrum of HUMELAFIN (D13156) obtained using EIIP indicator

Table 3 Summarizes the observations about nine genes where EIIP indicator sequence is found to be a better discriminator than the binary indicator sequences. The last two columns in Table 3 are for comparing the performances of the methods in terms of an exon-intron discrimination measure D given by, D=Lowest of theexonpeaks/Highest peakin noncoding regions
Table 3

Examples of genes whose power spectra show better discrimination between coding and non-coding regions with EIIP indicator sequence mapping than with binary indicator sequence mapping

NoGene Name, Acc. No, DescriptionRegions (Nucleotide positions)Highest peak (binary slide)Highest peak (EIIP slide)Discrimination measure D for binary slideDiscrimination measure D for EIIP slide
1.F56F11.4a,NC001135, a gene from C. elegans chromosome IIIE1(929-1135)2.11.021.192.0
E2(2528-2857)7.012.75
E3(4114-4377)6.02.4
E4(5465-5644)5.31.1
E5(7255_7605)3.41.25
Intron regions1.770.51
2.HUMBETGLOA L26462, human betaglobin A chainE1(866-957)1.860.841.052.4
E2(1088-1310)4.340.9
E3(2161-2289)3.01.13
Intron regions1.770.35
3.HUMCBRG, M62420, Homosapiens carbonyl reductase geneE1(276-566)9.741.740.551.16
E2(1112-1219)1.30.43
E3(2608-3044)6.671.0
Intron regions2.360.37
4.HUMELAFIN, D13156, Homo sapiens gene for elafinE1(247-325)2.050.650.951.55
E2(1185-1459)2.721.575
Intron regions2.150.42
5.GalR2, AF042784 Mus musculus galin receptor type 2 geneE1(24-388)9.61.273.1914.11
E2(1449-2199)3.190.917
Intron regions1.00.09
6.PP32R1, AF00A216 Homosapiens candidate tumor suppressor geneE(4453-5157)30.611.9219.1321.67
Intergenic regions1.60.55
7.HMX1, AF009614, Mus musculus homeobox containing nuclear transcriptional factor geneE1(1267-1639)4.761.712.142.80
E2(3888-4513)7.093.94
Intron regions2.220.61
8.PSMB5, AB003306, Mus musculus DNA for PSMB5E1(1020-1217)3.820.951.433.17
E2(2207-2513)2.101.05
E3(4543-4832)7.652.12
Intron regions1.470.38
9.HSODF2, X74614, Homosapiens ODF2 geneE1(280-599)3.820.8652.434.33
E2(843-1275)6.7251.75
Intron regions1.570.2
Higher the value of D better is the discrimination. If D is more than one, all exons are identified without ambiguity, D less than one indicates that at least one exon is not having enough strength to be distinguished from noncoding areas. In all the examples cited, Method using EIIP indicator sequence shows a better discrimination compared to the method using binary indicator sequences. And in two cases, (HUMCBRG and HUMELAFIN) binary indicator sequence method even fails to identify all exons.

Conclusion

Ab initio gene finding still remains a challenging and exciting field as homology searches fail to identify around 30 to 50 % genes in newly sequenced genomes and none of the existing ab inito methods (methods using Hidden Markov Models are considered to be superior) are found to have enough sensitivity and selectivity for a fail-proof prediction. The method presented in this paper which uses electron-ion interaction pseudopotentials of nucleotides in genomic signal processing method of gene finding, improves the discrimination capability of the existing method and obviously reduces the computational complexity. The coding measure scheme using EIIP indicator sequence thus can be utilized for gene finding procedures using genomic signal processing assisted by the grammar of genes and position weight matrices (PWMs) for splice sites. Also the possibility of applying the potential of EIIP sequence to other exon prediction techniques such as Autoregressive modeling (AR), Average Magnitude Difference Function (AMDF) and Time Domain Periodogram (TDP) etc can also be explored. Thus, we hope, the fact that a physico-chemical property like EIIP has a role in the formation of protein coding regions of genomes will trigger a lot of research in related areas.
  7 in total

1.  Frequency-domain analysis of biomolecular sequences.

Authors:  D Anastassiou
Journal:  Bioinformatics       Date:  2000-12       Impact factor: 6.937

2.  A Fourier characteristic of coding sequences: origins and a non-Fourier approximation.

Authors:  Changchuan Yin; Stephen S-T Yau
Journal:  J Comput Biol       Date:  2005-11       Impact factor: 1.479

3.  Are categorical periodograms and indicator sequences of genomes spectrally equivalent?

Authors:  Achuthsankar S Nair; T Mahalakshmi
Journal:  In Silico Biol       Date:  2006

4.  Prediction of probable genes by Fourier analysis of genomic sequences.

Authors:  S Tiwari; S Ramachandran; A Bhattacharya; S Bhattacharya; R Ramaswamy
Journal:  Comput Appl Biosci       Date:  1997-06

5.  Evaluation of gene structure prediction programs.

Authors:  M Burset; R Guigó
Journal:  Genomics       Date:  1996-06-15       Impact factor: 5.736

6.  A measure of DNA periodicity.

Authors:  B D Silverman; R Linsker
Journal:  J Theor Biol       Date:  1986-02-07       Impact factor: 2.691

Review 7.  Macromolecular bioactivity: is it resonant interaction between macromolecules?--Theory and applications.

Authors:  I Cosic
Journal:  IEEE Trans Biomed Eng       Date:  1994-12       Impact factor: 4.538

  7 in total
  38 in total

1.  Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning.

Authors:  Haodong Xu; Peilin Jia; Zhongming Zhao
Journal:  Brief Bioinform       Date:  2021-05-20       Impact factor: 11.622

2.  iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization.

Authors:  Zhen Chen; Pei Zhao; Chen Li; Fuyi Li; Dongxu Xiang; Yong-Zi Chen; Tatsuya Akutsu; Roger J Daly; Geoffrey I Webb; Quanzhi Zhao; Lukasz Kurgan; Jiangning Song
Journal:  Nucleic Acids Res       Date:  2021-06-04       Impact factor: 16.971

3.  MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.

Authors:  Robson P Bonidia; Douglas S Domingues; Danilo S Sanches; André C P L F de Carvalho
Journal:  Brief Bioinform       Date:  2022-01-17       Impact factor: 11.622

4.  Leveraging the attention mechanism to improve the identification of DNA N6-methyladenine sites.

Authors:  Ying Zhang; Yan Liu; Jian Xu; Xiaoyu Wang; Xinxin Peng; Jiangning Song; Dong-Jun Yu
Journal:  Brief Bioinform       Date:  2021-11-05       Impact factor: 13.994

5.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches.

Authors:  Bin Liu; Xin Gao; Hanyu Zhang
Journal:  Nucleic Acids Res       Date:  2019-11-18       Impact factor: 16.971

6.  CNNLSTMac4CPred: A Hybrid Model for N4-Acetylcytidine Prediction.

Authors:  Guiyang Zhang; Wei Luo; Jianyi Lyu; Zu-Guo Yu; Guohua Huang
Journal:  Interdiscip Sci       Date:  2022-02-01       Impact factor: 2.233

7.  Discrete wavelet transform de-noising in eukaryotic gene splicing.

Authors:  Tina P George; Tessamma Thomas
Journal:  BMC Bioinformatics       Date:  2010-01-18       Impact factor: 3.169

8.  Accurate identification of RNA D modification using multiple features.

Authors:  Lijun Dou; Wenyang Zhou; Lichao Zhang; Lei Xu; Ke Han
Journal:  RNA Biol       Date:  2021-03-17       Impact factor: 4.652

9.  i4mC-EL: Identifying DNA N4-Methylcytosine Sites in the Mouse Genome Using Ensemble Learning.

Authors:  Yanjuan Li; Zhengnan Zhao; Zhixia Teng
Journal:  Biomed Res Int       Date:  2021-05-29       Impact factor: 3.411

10.  A Nonlinear Pattern Recognition of Pandemic H1N1 Using a State Space Based Methods.

Authors:  Mai S Mabrouk
Journal:  Avicenna J Med Biotechnol       Date:  2011-01
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.