Literature DB >> 11042160

An assessment of gene prediction accuracy in large DNA sequences.

R Guigó1, P Agarwal, J F Abril, M Burset, J W Fickett.   

Abstract

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

Entities:  

Mesh:

Substances:

Year:  2000        PMID: 11042160      PMCID: PMC310940          DOI: 10.1101/gr.122800

Source DB:  PubMed          Journal:  Genome Res        ISSN: 1088-9051            Impact factor:   9.043


  22 in total

1.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

Review 2.  Toward a transcriptional map of the human genome.

Authors:  U Hochgeschwender
Journal:  Trends Genet       Date:  1992-02       Impact factor: 11.639

3.  Finding genes by computer: the state of the art.

Authors:  J W Fickett
Journal:  Trends Genet       Date:  1996-08       Impact factor: 11.639

4.  Evaluation of gene structure prediction programs.

Authors:  M Burset; R Guigó
Journal:  Genomics       Date:  1996-06-15       Impact factor: 5.736

5.  Genome annotation assessment in Drosophila melanogaster.

Authors:  M G Reese; G Hartzell; N L Harris; U Ohler; J F Abril; S E Lewis
Journal:  Genome Res       Date:  2000-04       Impact factor: 9.043

6.  Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.

Authors:  R Guigó; J W Fickett
Journal:  J Mol Biol       Date:  1995-10-13       Impact factor: 5.469

7.  Identification of protein coding regions by database similarity search.

Authors:  W Gish; D J States
Journal:  Nat Genet       Date:  1993-03       Impact factor: 38.330

Review 8.  Genome sequence of the nematode C. elegans: a platform for investigating biology.

Authors: 
Journal:  Science       Date:  1998-12-11       Impact factor: 47.728

9.  Gene recognition via spliced sequence alignment.

Authors:  M S Gelfand; A A Mironov; P A Pevzner
Journal:  Proc Natl Acad Sci U S A       Date:  1996-08-20       Impact factor: 11.205

10.  Isolation of genes from complex sources of mammalian genomic DNA using exon amplification.

Authors:  D M Church; C J Stotler; J L Rutter; J R Murrell; J A Trofatter; A J Buckler
Journal:  Nat Genet       Date:  1994-01       Impact factor: 38.330

View more
  62 in total

1.  The contribution of 700,000 ORF sequence tags to the definition of the human transcriptome.

Authors:  A A Camargo; H P Samaia; E Dias-Neto; D F Simão; I A Migotto; M R Briones; F F Costa; M A Nagai; S Verjovski-Almeida; M A Zago; L E Andrade; H Carrer; H F El-Dorry; E M Espreafico; A Habr-Gama; D Giannella-Neto; G H Goldman; A Gruber; C Hackel; E T Kimura; R M Maciel; S K Marie; E A Martins; M P Nobrega; M L Paco-Larson; M I Pardini; G G Pereira; J B Pesquero; V Rodrigues; S R Rogatto; I D da Silva; M C Sogayar; M F Sonati; E H Tajara; S R Valentini; F L Alberto; M E Amaral; I Aneas; L A Arnaldi; A M de Assis; M H Bengtson; N A Bergamo; V Bombonato; M E de Camargo; R A Canevari; D M Carraro; J M Cerutti; M L Correa; R F Correa; M C Costa; C Curcio; P O Hokama; A J Ferreira; G K Furuzawa; T Gushiken; P L Ho; E Kimura; J E Krieger; L C Leite; P Majumder; M Marins; E R Marques; A S Melo; M B Melo; C A Mestriner; E C Miracca; D C Miranda; A L Nascimento; F G Nobrega; E P Ojopi; J R Pandolfi; L G Pessoa; A C Prevedel; P Rahal; C A Rainho; E M Reis; M L Ribeiro; N da Ros; R G de Sa; M M Sales; S C Sant'anna; M L dos Santos; A M da Silva; N P da Silva; W A Silva; R A da Silveira; J F Sousa; D Stecconi; F Tsukumo; V Valente; F Soares; E S Moreira; D N Nunes; R G Correa; H Zalcberg; A F Carvalho; L F Reis; R R Brentani; A J Simpson; S J de Souza; M Melo
Journal:  Proc Natl Acad Sci U S A       Date:  2001-10-09       Impact factor: 11.205

2.  SGP-1: prediction and validation of homologous genes based on sequence alignments.

Authors:  T Wiehe; S Gebauer-Jung; T Mitchell-Olds; R Guigó
Journal:  Genome Res       Date:  2001-09       Impact factor: 9.043

3.  Computational inference of homologous gene structures in the human genome.

Authors:  R F Yeh; L P Lim; C B Burge
Journal:  Genome Res       Date:  2001-05       Impact factor: 9.043

4.  Homotypic regulatory clusters in Drosophila.

Authors:  Alexander P Lifanov; Vsevolod J Makeev; Anna G Nazina; Dmitri A Papatsenko
Journal:  Genome Res       Date:  2003-04       Impact factor: 9.043

5.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model.

Authors:  Marina Alexandersson; Simon Cawley; Lior Pachter
Journal:  Genome Res       Date:  2003-03       Impact factor: 9.043

6.  Gene structure conservation aids similarity based gene prediction.

Authors:  Irmtraud M Meyer; Richard Durbin
Journal:  Nucleic Acids Res       Date:  2004-02-04       Impact factor: 16.971

7.  Reevaluating human gene annotation: a second-generation analysis of chromosome 22.

Authors:  John E Collins; Melanie E Goward; Charlotte G Cole; Luc J Smink; Elizabeth J Huckle; Sarah Knowles; Jacqueline M Bye; David M Beare; Ian Dunham
Journal:  Genome Res       Date:  2003-01       Impact factor: 9.043

8.  Comparative gene prediction in human and mouse.

Authors:  Genís Parra; Pankaj Agarwal; Josep F Abril; Thomas Wiehe; James W Fickett; Roderic Guigó
Journal:  Genome Res       Date:  2003-01       Impact factor: 9.043

9.  A complexity reduction algorithm for analysis and annotation of large genomic sequences.

Authors:  Trees-Juen Chuang; Wen-Chang Lin; Hurng-Chun Lee; Chi-Wei Wang; Keh-Lin Hsiao; Zi-Hao Wang; Danny Shieh; Simon C Lin; Lan-Yang Ch'ang
Journal:  Genome Res       Date:  2003-02       Impact factor: 9.043

Review 10.  Current methods of gene prediction, their strengths and weaknesses.

Authors:  Catherine Mathé; Marie-France Sagot; Thomas Schiex; Pierre Rouzé
Journal:  Nucleic Acids Res       Date:  2002-10-01       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.