Literature DB >> 1640461

Determination of eukaryotic protein coding regions using neural networks and information theory.

R Farber1, A Lapedes, K Sirotkin.   

Abstract

Our previous work applied neural network techniques to the problem of discriminating open reading frame (ORF) sequences taken from introns versus exons. The method counted the codon frequencies in an ORF of a specified length, and then used this codon frequency representation of DNA fragments to train a neural net (essentially a Perceptron with a sigmoidal, or "soft step function", output) to perform this discrimination. After training, the network was then applied to a disjoint "predict" set of data to assess accuracy. The resulting accuracy in our previous work was 98.4%, exceeding accuracies reported in the literature at that time for other algorithms. Here, we report even higher accuracies stemming from calculations of mutual information (a correlation measure) of spatially separated codons in exons, and in introns. Significant mutual information exists in exons, but not in introns, between adjacent codons. This suggests that dicodon frequencies of adjacent codons are important for intron/exon discrimination. We report that accuracies obtained using a neural net trained on the frequency of dicodons is significantly higher at smaller fragment lengths than even our original results using codon frequencies, which were already higher than simple statistical methods that also used codon frequencies. We also report accuracies obtained from including codon and dicodon statistics in all six reading frames, i.e. the three frames on the original and complement strand. Inclusion of six-frame statistics increases the accuracy still further. We also compare these neural net results to a Bayesian statistical prediction method that assumes independent codon frequencies in each position. The performance of the Bayesian scheme is poorer than any of the neural based schemes, however many methods reported in the literature either explicitly, or implicitly, use this method. Specifically, Bayesian prediction schemes based on codon frequencies achieve 90.9% accuracy on 90 codon ORFs, while our best neural net scheme reaches 99.4% accuracy on 60 codon ORFs. "Accuracy" is defined as the average of the exon and intron sensitivities. Achievement of sufficiently high accuracies on short fragment lengths can be useful in providing a computational means of finding coding regions in unannotated DNA sequences such as those arising from the mega-base sequencing efforts of the Human Genome Project. We caution that the high accuracies reported here do not represent a complete solution to the problem of identifying exons in "raw" base sequences. The accuracies are considerably lower from exons of small length, although still higher than accuracies reported in the literature for other methods. Short exon lengths are not uncommon.(ABSTRACT TRUNCATED AT 400 WORDS)

Entities:  

Mesh:

Substances:

Year:  1992        PMID: 1640461     DOI: 10.1016/0022-2836(92)90961-i

Source DB:  PubMed          Journal:  J Mol Biol        ISSN: 0022-2836            Impact factor:   5.469


  19 in total

Review 1.  Computational gene finding in plants.

Authors:  Mihaela Pertea; Steven L Salzberg
Journal:  Plant Mol Biol       Date:  2002-01       Impact factor: 4.076

2.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions.

Authors:  Daniel Kotlar; Yizhar Lavner
Journal:  Genome Res       Date:  2003-07-17       Impact factor: 9.043

3.  A hidden Markov model that finds genes in E. coli DNA.

Authors:  A Krogh; I S Mian; D Haussler
Journal:  Nucleic Acids Res       Date:  1994-11-11       Impact factor: 16.971

4.  Relationship between "proto-splice sites" and intron phases: evidence from dicodon analysis.

Authors:  M Long; S J de Souza; C Rosenberg; W Gilbert
Journal:  Proc Natl Acad Sci U S A       Date:  1998-01-06       Impact factor: 11.205

Review 5.  The backpropagation neural network--a Bayesian classifier. Introduction and applicability to pharmacokinetics.

Authors:  R J Erb
Journal:  Clin Pharmacokinet       Date:  1995-08       Impact factor: 6.447

6.  Identification of coding regions in genomic DNA sequences: an application of dynamic programming and neural networks.

Authors:  E E Snyder; G D Stormo
Journal:  Nucleic Acids Res       Date:  1993-02-11       Impact factor: 16.971

7.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis.

Authors:  B T Korber; R M Farber; D H Wolpert; A S Lapedes
Journal:  Proc Natl Acad Sci U S A       Date:  1993-08-01       Impact factor: 11.205

8.  Back-propagation and counter-propagation neural networks for phylogenetic classification of ribosomal RNA sequences.

Authors:  C Wu; S Shivakumar
Journal:  Nucleic Acids Res       Date:  1994-10-11       Impact factor: 16.971

9.  Self-organized neural maps of human protein sequences.

Authors:  E A Ferrán; B Pflugfelder; P Ferrara
Journal:  Protein Sci       Date:  1994-03       Impact factor: 6.725

10.  Classifying coding DNA with nucleotide statistics.

Authors:  Nicolas Carels; Diego Frías
Journal:  Bioinform Biol Insights       Date:  2009-10-28
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.