Deepank R Korandla1,2,3, Jacob M Wozniak4,5, Anaamika Campeau4,5, David J Gonzalez4,5, Erik S Wright3. 1. Department of Biological Sciences, USA. 2. Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA. 3. Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA 15219, USA. 4. Department of Pharmacology, University of California San Diego, La Jolla, CA 92093, USA. 5. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California San Diego, La Jolla, CA 92093, USA.
Abstract
MOTIVATION: A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. RESULTS: Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88-95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. AVAILABILITY AND IMPLEMENTATION: AssessORF is available as an R package via the Bioconductor package repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: A core task of genomics is to identify the boundaries of protein coding genes, which may cover over 90% of a prokaryote's genome. Several programs are available for gene finding, yet it is currently unclear how well these programs perform and whether any offers superior accuracy. This is in part because there is no universal benchmark for gene finding and, therefore, most developers select their own benchmarking strategy. RESULTS: Here, we introduce AssessORF, a new approach for benchmarking prokaryotic gene predictions based on evidence from proteomics data and the evolutionary conservation of start and stop codons. We applied AssessORF to compare gene predictions offered by GenBank, GeneMarkS-2, Glimmer and Prodigal on genomes spanning the prokaryotic tree of life. Gene predictions were 88-95% in agreement with the available evidence, with Glimmer performing the worst but no clear winner. All programs were biased towards selecting start codons that were upstream of the actual start. Given these findings, there remains considerable room for improvement, especially in the detection of correct start sites. AVAILABILITY AND IMPLEMENTATION: AssessORF is available as an R package via the Bioconductor package repository. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Dennis A Benson; Mark Cavanaugh; Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971
Authors: James C Wright; Deana Sugden; Sue Francis-McIntyre; Isabel Riba-Garcia; Simon J Gaskell; Igor V Grigoriev; Scott E Baker; Robert J Beynon; Simon J Hubbard Journal: BMC Genomics Date: 2009-02-04 Impact factor: 3.969
Authors: Patrick Willems; Elvis Ndah; Veronique Jonckheere; Simon Stael; Adriaan Sticker; Lennart Martens; Frank Van Breusegem; Kris Gevaert; Petra Van Damme Journal: Mol Cell Proteomics Date: 2017-04-21 Impact factor: 5.911