Literature DB >> 14565849

Eval: a software package for analysis of genome annotations.

Evan Keibler1, Michael R Brent.   

Abstract

SUMMARY: Eval is a flexible tool for analyzing the performance of gene annotation systems. It provides summaries and graphical distributions for many descriptive statistics about any set of annotations, regardless of their source. It also compares sets of predictions to standard annotations and to one another. Input is in the standard Gene Transfer Format (GTF). Eval can be run interactively or via the command line, in which case output options include easily parsable tab-delimited files. AVAILABILITY: To obtain the module package with documentation, go to http://genes.cse.wustl.edu/ and follow links for Resources, then Software. Please contact brent@cse.wustl.edu

Entities:  

Mesh:

Year:  2003        PMID: 14565849      PMCID: PMC270064          DOI: 10.1186/1471-2105-4-50

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Introduction

Automated gene annotation systems are typically based on large, complex probability models with thousands of parameters. Changing these parameters can change a system's performance as measured by the accuracy with which it reproduces the exons and gene structures in a standard annotation. While traditional sensitivity and specificity measures convey the accuracy of gene predictions [1,2], more information is often required for gaining insight into why a system is performing well or poorly. A deep analysis requires considering many features of a prediction set and its relation to the standard set, such as the distribution of number of exons per gene, the distribution of predicted exon lengths, and accuracy as a function of GC percentage. Such statistics can reveal which parameter sets are working well and which need tuning. We are not aware of any publicly available software systems that have this functionality. We therefore developed the Eval system to support detailed analysis and comparison of the large data sets generated by automated gene annotation systems [e.g., [3]].

Features

Statistics

Eval can generate a wide range of statistics showing the similarities and differences between a standard annotation set and a prediction set. It reports traditional performance measures, such as gene sensitivity and specificity, as well as measures focusing on specific features, including initial, internal, and terminal exons, and splice donor and acceptor sites (see Table 1 for a sampling of these statistics; for a complete list of all calculated statistics see online documentation). These specific measures can show why an annotation system is performing well or poorly on the traditional measures. They can also reveal specific weaknesses or strengths of the system – for example, that it is good at predicting the boundaries of genes but has problems with exon/intron structure because it does poorly on splice donor sites. Eval can also compute statistics on a single set of gene annotations (either predictions or standard annotations). These statistics reveal the average characteristics of the genes, such as their coding and genomic lengths, exon and intron lengths, number of exons, and so on. This is useful when tuning the parameters of annotation systems for optimal performance.
Table 1

A sampling of the less common statistics calculated by Eval when comparing the output of TWINSCAN and GENSCAN on the "semi-artificial" gene set used in [1] to the gold standard annotation. Standard statistics such as gene and exon sensitivity and specificity are also calculated but are not shown.

FeatureStatisticTWINSCANGENSCAN
TranscriptsExons Per Transcript6.465.93
CDS Overlap Specificity96.55%70.59%
CDS Overlap Sensitivity87.64%97.19%
All Introns Matched Specificity26.90%8.60%
All Introns Matched Sensitivity21.91%10.67%
Start and Stop Codon Specificity44.14%17.65%
Start and Stop Codon Sensitivity35.96%21.91%
Initial ExonsOverlap Specificity70.16%35.47%
Overlap Sensitivity77.54%73.91%
Terminal Exons5' Splice Specificity74.36%36.22%
5' Splice Sensitivity74.64%71.01%
Introns80% Overlap Specificity73.11%48.07%
80% Overlap Sensitivity80.19%72.58%
NucleotidesCorrect Specificity84.61%64.76%
Correct Sensitivity84.26%88.87%
Splice AcceptorsCorrect Specificity77.23%52.69%
Correct Sensitivity84.90%81.30%
Splice DonorsCorrect Specificity76.18%53.02%
Correct Sensitivity84.63%80.19%
Start CodonsCorrect Specificity61.97%34.90%
Correct Sensitivity49.44%37.64%
Stop CodonsCorrect Specificity82.22%47.95%
Correct Sensitivity62.36%58.99%
A sampling of the less common statistics calculated by Eval when comparing the output of TWINSCAN and GENSCAN on the "semi-artificial" gene set used in [1] to the gold standard annotation. Standard statistics such as gene and exon sensitivity and specificity are also calculated but are not shown.

Plots

Eval can also produce two types of plots. One type is a histogram showing the distribution of a statistic. Histograms are useful for determining whether the annotation system is producing specific types of genes and exons in the expected proportions. For example, suppose that the average number of exons per gene in an automated annotation is slightly below that of a standard annotation. Comparing the two distributions can reveal whether that difference is due to an insufficient fraction of predictions with extremely large exon counts or an insufficient fraction with slightly above-average exon counts (Fig. 1a). The other type of plot categorizes exons or genes by their length or GC content and shows the statistic for each category. For example, plotting transcript sensitivity as a function of transcript length might reveal that an annotation system is performing poorly on long genes but well on short ones (Fig. 1b). Further analysis would be needed to determine whether this effect is due to intron length or exon count.
Figure 1

Panel A. Distributions of exons-per-gene for TWINSCAN [4] and GENSCAN [5] gene predictions and RefSeq mRNA sequences aligned to the genome. The plot reveals that, although TWINSCAN predicts too few genes in the 5–20 exon range, it predicts the right proportion of genes with more than 25 exons. Panel B. Fraction of RefSeq genes that TWINSCAN and GENSCAN predict exactly right, as a function of the genomic length of the RefSeq, excluding UTRs. Both figures were made in Excel by importing Eval output as tab-separated files. Data in both panes was generated using the NCBI34 version of the human genome and TWINSCAN 1.2.

Panel A. Distributions of exons-per-gene for TWINSCAN [4] and GENSCAN [5] gene predictions and RefSeq mRNA sequences aligned to the genome. The plot reveals that, although TWINSCAN predicts too few genes in the 5–20 exon range, it predicts the right proportion of genes with more than 25 exons. Panel B. Fraction of RefSeq genes that TWINSCAN and GENSCAN predict exactly right, as a function of the genomic length of the RefSeq, excluding UTRs. Both figures were made in Excel by importing Eval output as tab-separated files. Data in both panes was generated using the NCBI34 version of the human genome and TWINSCAN 1.2.

Multi-way comparisons (Venn diagrams)

Eval can also determine the similarities and differences among multiple annotation sets. For example, it can build clusters of genes or exons which share some property, such as being identical to each other or overlapping each other. Building clusters of identical genes from two gene predictors and a standard annotation can show how similar the predictors are in their correctly and incorrectly predicted genes. For example, it could reveal that the two programs predict the same or completely separate sets of correct and incorrect genes. If they predict correct gene sets with a small intersection and incorrect gene sets with a large intersection then they could be combined to create a system which has both a higher sensitivity and specificity than either one alone. Table 2 shows a different example - clustering of identical exons from the aligned human RefSeq mRNAs, TWINSCAN [3,4] predictions, and GENSCAN [5] predictions.
Table 2

The results of building a Venn diagram based on exact exon matches among the aligned RefSeqs, TWINSCAN 1.2 predictions, and GENSCAN predictions, on the NCBI34 build of the human genome. All exons are first combined into clusters that have the same begin and end points. These clusters are then partitioned into the subset of exons annotated only by RefSeq (R), the subset annotated only by TWINSCAN (T), the subset annotated only by GENSCAN (G), the subset annotated by RefSeq and TWINSCAN but not GENSCAN (RT), etc. For each of these subsets, the table shows the number of clusters in the subset. It also shows the percentage all exons from each of the input sets that is included in that subset. The last column shows the fraction of all clusters included in that subset.

Subset in partitionCluster Count% of RefSeq exons% of Twinscan exons% of Genscan exons% of all clusters
R29,68020.29%0.00%0.00%7.21%
T44,6720.00%22.04%0.00%10.84%
G166,7650.00%0.00%51.72%40.48%
RT15,14110.55%7.47%0.00%3.68%
RG12,8129.29%0.00%3.97%3.11%
TG57,7950.00%28.52%17.92%14.03%
RTG85,06959.88%41.97%26.38%20.65%
The results of building a Venn diagram based on exact exon matches among the aligned RefSeqs, TWINSCAN 1.2 predictions, and GENSCAN predictions, on the NCBI34 build of the human genome. All exons are first combined into clusters that have the same begin and end points. These clusters are then partitioned into the subset of exons annotated only by RefSeq (R), the subset annotated only by TWINSCAN (T), the subset annotated only by GENSCAN (G), the subset annotated by RefSeq and TWINSCAN but not GENSCAN (RT), etc. For each of these subsets, the table shows the number of clusters in the subset. It also shows the percentage all exons from each of the input sets that is included in that subset. The last column shows the fraction of all clusters included in that subset.

Extraction of subsets

Eval can also extract subsets of genes that meet specific criteria for further analysis. Sets of genes that match another gene set by any of the following criteria can be selected: exact match, genomic overlap, CDS overlap, all introns match, one or more introns match, one or more exons match, start codon match, stop codon match, start and stop codon match. Boolean combinations of these criteria can also be specified. For example, the set of RefSeq genes that are predicted correctly by System1 but not by System2 can be extracted from annotations of the entire human genome with just a few commands. Once extracted, gene sets can be inspected individually using standard visualization tools.

Implementation

Eval is written in Perl and uses the Tk Perl module for displaying its graphical user interface. It is intended to run on Linux based systems, although it also runs under Windows. It requires the gnuplot utility to display the graphs it produces, but it can create the graphs as text files without this utility. The package comes with both command line and graphical interface. The command line interface provides quick access to the functions, while the graphical interface provides easier, more efficient access when running multiple analyses on the same data sets. Annotations are submitted to Eval in GTF file format , a community standard developed in the course of several collaborative genome annotations projects [6,7]. As such it can be run on the output of any annotation system. The Eval package contains a GTF validator which verifies correct GTF file format and identifies common syntactic and semantic errors in annotation files. It also contains Perl libraries for parsing, storing, accessing, and modifying GTF files and comparing sets of GTF files. Although it is written in Perl, the Eval system runs relatively quickly. A standard Eval report comparing all TWINSCAN [3,4] genes predicted on the human genome to the aligned human RefSeqs processes ~40,000 transcripts and ~300,000 exons and completes in under five minutes on a machine with a 1.5 GHz Athlon processor and 2 GB of RAM.
  7 in total

1.  An assessment of gene prediction accuracy in large DNA sequences.

Authors:  R Guigó; P Agarwal; J F Abril; M Burset; J W Fickett
Journal:  Genome Res       Date:  2000-10       Impact factor: 9.043

2.  Integrating genomic homology into gene structure prediction.

Authors:  I Korf; P Flicek; D Duan; M R Brent
Journal:  Bioinformatics       Date:  2001       Impact factor: 6.937

3.  Initial sequencing and comparative analysis of the mouse genome.

Authors:  Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal:  Nature       Date:  2002-12-05       Impact factor: 49.962

4.  Evaluation of gene structure prediction programs.

Authors:  M Burset; R Guigó
Journal:  Genomics       Date:  1996-06-15       Impact factor: 5.736

5.  Genome annotation assessment in Drosophila melanogaster.

Authors:  M G Reese; G Hartzell; N L Harris; U Ohler; J F Abril; S E Lewis
Journal:  Genome Res       Date:  2000-04       Impact factor: 9.043

6.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

7.  Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map.

Authors:  Paul Flicek; Evan Keibler; Ping Hu; Ian Korf; Michael R Brent
Journal:  Genome Res       Date:  2003-01       Impact factor: 9.043

  7 in total
  34 in total

1.  BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS.

Authors:  Katharina J Hoff; Simone Lange; Alexandre Lomsadze; Mark Borodovsky; Mario Stanke
Journal:  Bioinformatics       Date:  2015-11-11       Impact factor: 6.937

2.  MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes.

Authors:  Brandi L Cantarel; Ian Korf; Sofia M C Robb; Genis Parra; Eric Ross; Barry Moore; Carson Holt; Alejandro Sánchez Alvarado; Mark Yandell
Journal:  Genome Res       Date:  2007-11-19       Impact factor: 9.043

3.  Long-Read Annotation: Automated Eukaryotic Genome Annotation Based on Long-Read cDNA Sequencing.

Authors:  David E Cook; Jose Espejo Valle-Inclan; Alice Pajoro; Hanna Rovenich; Bart P H J Thomma; Luigi Faino
Journal:  Plant Physiol       Date:  2018-11-06       Impact factor: 8.340

4.  Closing in on the C. elegans ORFeome by cloning TWINSCAN predictions.

Authors:  Chaochun Wei; Philippe Lamesch; Manimozhiyan Arumugam; Jennifer Rosenberg; Ping Hu; Marc Vidal; Michael R Brent
Journal:  Genome Res       Date:  2005-04       Impact factor: 9.043

5.  Reference based annotation with GeneMapper.

Authors:  Sourav Chatterji; Lior Pachter
Journal:  Genome Biol       Date:  2006-04-05       Impact factor: 13.583

6.  Simultaneous gene finding in multiple genomes.

Authors:  Stefanie König; Lars W Romoth; Lizzy Gerischer; Mario Stanke
Journal:  Bioinformatics       Date:  2016-07-27       Impact factor: 6.937

7.  The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences.

Authors:  Allen Kovach; Jill L Wegrzyn; Genis Parra; Carson Holt; George E Bruening; Carol A Loopstra; James Hartigan; Mark Yandell; Charles H Langley; Ian Korf; David B Neale
Journal:  BMC Genomics       Date:  2010-07-07       Impact factor: 3.969

8.  MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects.

Authors:  Carson Holt; Mark Yandell
Journal:  BMC Bioinformatics       Date:  2011-12-22       Impact factor: 3.307

9.  ToPS: a framework to manipulate probabilistic models of sequence data.

Authors:  André Yoshiaki Kashiwabara; Igor Bonadio; Vitor Onuchic; Felipe Amado; Rafael Mathias; Alan Mitchell Durham
Journal:  PLoS Comput Biol       Date:  2013-10-03       Impact factor: 4.475

10.  ParsEval: parallel comparison and analysis of gene structure annotations.

Authors:  Daniel S Standage; Volker P Brendel
Journal:  BMC Bioinformatics       Date:  2012-08-01       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.