Literature DB >> 10871265

Prediction of genetic structure in eukaryotic DNA using reference point logistic regression and sequence alignment.

P M Hooper1, H Zhang, D S Wishart.   

Abstract

MOTIVATION: Current software tools are moderately effective in predicting genetic structure (exons, introns, intergenic regions, and complete genes) from raw DNA sequence data. Improvements in accuracy and speed are needed to deal with the increasing volume of data from large scale sequencing projects.
RESULTS: We present a two-stage computer program to predict genetic structure in eukaryotic DNA. The first stage makes use of a novel statistical technique, called reference point logistic (RPL) regression, to calculate scores for potential functional sites. These site scores are combined with interval content, length, and state scores, via a Generalized Hidden Markov Model, to determine a combined score for each possible parse of a given DNA sequence into exons, introns, and intergenic regions. An optimal parse is found using a dynamic programming algorithm. In the second stage, protein sequence alignment methods are applied to improve the accuracy of the initial parse. Computation in the first stage of the program is very fast (1 s on a 360 MHz CPU for a 16 kb sequence) and its predictive accuracy typically matches or exceeds the best results reported for other methods (Sensitivity = 0.93 and Specificity = 0.93 for the Burset/Guigótest set). Computation in the second stage is slower, but the final predictions are more accurate (Sn = 0.97, Sp = 0.97). The program (called GRPL) can handle partial, single, and multi-gene sequences. The program is also capable of predicting the genetic structure of vertebrate, invertebrate, and plant DNA with nearly equal accuracy. Statistical techniques have also been introduced to model the effects of varying C+G content in a continuous manner and to control overfitting of parameters for smaller training sets. AVAILABILITY: An academic implementation of GRPL, compiled for SUN workstations, is available by anonymous ftp from snipe.pharmacy. ualberta.ca/pub. The training and test sets used in this work, together with supplementary material, can be found at the same location. A commercial implementation is available as a component of GeneTool (BioTools Inc., http://biotools.com).

Entities:  

Mesh:

Substances:

Year:  2000        PMID: 10871265     DOI: 10.1093/bioinformatics/16.5.425

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  4 in total

Review 1.  The BioTools Suite. A comprehensive suite of platform-independent bioinformatics tools.

Authors:  D S Wishart; S Fortin
Journal:  Mol Biotechnol       Date:  2001-09       Impact factor: 2.695

Review 2.  Current methods of gene prediction, their strengths and weaknesses.

Authors:  Catherine Mathé; Marie-France Sagot; Thomas Schiex; Pierre Rouzé
Journal:  Nucleic Acids Res       Date:  2002-10-01       Impact factor: 16.971

3.  Gene organization features in A/T-rich organisms.

Authors:  Karol Szafranski; Rüdiger Lehmann; Genis Parra; Roderic Guigo; Gernot Glöckner
Journal:  J Mol Evol       Date:  2005-01       Impact factor: 2.395

4.  Comparative genomics in cyprinids: common carp ESTs help the annotation of the zebrafish genome.

Authors:  Alan Christoffels; Richard Bartfai; Hamsa Srinivasan; Hans Komen; Laszlo Orban
Journal:  BMC Bioinformatics       Date:  2006-12-18       Impact factor: 3.169

  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.