Literature DB >> 8877513

A generalized hidden Markov model for the recognition of human genes in DNA.

D Kulp1, D Haussler, M G Reese, F H Eeckman.   

Abstract

We present a statistical model of genes in DNA. A Generalized Hidden Markov Model (GHMM) provides the framework for describing the grammar of a legal parse of a DNA sequence (Stormo & Haussler 1994). Probabilities are assigned to transitions between states in the GHMM and to the generation of each nucleotide base given a particular state. Machine learning techniques are applied to optimize these probabilities using a standardized training set. Given a new candidate sequence, the best parse is deduced from the model using a dynamic programming algorithm to identify the path through the model with maximum probability. The GHMM is flexible and modular, so new sensors and additional states can be inserted easily. In addition, it provides simple solutions for integrating cardinality constraints, reading frame constraints, "indels", and homology searching. The description and results of an implementation of such a gene-finding model, called Genie, is presented. The exon sensor is a codon frequency model conditioned on windowed nucleotide frequency and the preceding codon. Two neural networks are used, as in (Brunak, Engelbrecht, & Knudsen 1991), for splice site prediction. We show that this simple model performs quite well. For a cross-validated standard test set of 304 genes [ftp:@www-hgc.lbl.gov/pub/genesets] in human DNA, our gene-finding system identified up to 85% of protein-coding bases correctly with a specificity of 80%. 58% of exons were exactly identified with a specificity of 51%. Genie is shown to perform favorably compared with several other gene-finding systems.

Entities:  

Mesh:

Substances:

Year:  1996        PMID: 8877513

Source DB:  PubMed          Journal:  Proc Int Conf Intell Syst Mol Biol        ISSN: 1553-0833


  56 in total

Review 1.  Annotating sequence data using Genotator.

Authors:  N L Harris
Journal:  Mol Biotechnol       Date:  2000-11       Impact factor: 2.695

2.  A question of size: the eukaryotic proteome and the problems in defining it.

Authors:  Paul M Harrison; Anuj Kumar; Ning Lang; Michael Snyder; Mark Gerstein
Journal:  Nucleic Acids Res       Date:  2002-03-01       Impact factor: 16.971

3.  Computational inference of homologous gene structures in the human genome.

Authors:  R F Yeh; L P Lim; C B Burge
Journal:  Genome Res       Date:  2001-05       Impact factor: 9.043

4.  Gene structure prediction and alternative splicing analysis using genomically aligned ESTs.

Authors:  Z Kan; E C Rouchka; W R Gish; D J States
Journal:  Genome Res       Date:  2001-05       Impact factor: 9.043

5.  The human genome browser at UCSC.

Authors:  W James Kent; Charles W Sugnet; Terrence S Furey; Krishna M Roskin; Tom H Pringle; Alan M Zahler; David Haussler
Journal:  Genome Res       Date:  2002-06       Impact factor: 9.043

6.  Epsilon-tubulin is an essential component of the centriole.

Authors:  Susan K Dutcher; Naomi S Morrissette; Andrea M Preble; Craig Rackley; John Stanga
Journal:  Mol Biol Cell       Date:  2002-11       Impact factor: 4.138

7.  Identification and characterization of multi-species conserved sequences.

Authors:  Elliott H Margulies; Mathieu Blanchette; David Haussler; Eric D Green
Journal:  Genome Res       Date:  2003-12       Impact factor: 9.043

8.  The UCSC Genome Browser Database.

Authors:  D Karolchik; R Baertsch; M Diekhans; T S Furey; A Hinrichs; Y T Lu; K M Roskin; M Schwartz; C W Sugnet; D J Thomas; R J Weber; D Haussler; W J Kent
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

9.  GAZE: a generic framework for the integration of gene-prediction data by dynamic programming.

Authors:  Kevin L Howe; Tom Chothia; Richard Durbin
Journal:  Genome Res       Date:  2002-09       Impact factor: 9.043

10.  Regulation of sex-specific differentiation and mating behavior in C. elegans by a new member of the DM domain transcription factor family.

Authors:  Robyn Lints; Scott W Emmons
Journal:  Genes Dev       Date:  2002-09-15       Impact factor: 11.361

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.