| Literature DB >> 22852583 |
Daniel S Standage1, Volker P Brendel.
Abstract
BACKGROUND: Accurate gene structure annotation is a fundamental but somewhat elusive goal of genome projects, as witnessed by the fact that (model) genomes typically undergo several cycles of re-annotation. In many cases, it is not only different versions of annotations that need to be compared but also different sources of annotation of the same genome, derived from distinct gene prediction workflows. Such comparisons are of interest to annotation providers, prediction software developers, and end-users, who all need to assess what is common and what is different among distinct annotation sources. We developed ParsEval, a software application for pairwise comparison of sets of gene structure annotations. ParsEval calculates several statistics that highlight the similarities and differences between the two sets of annotations provided. These statistics are presented in an aggregate summary report, with additional details provided as individual reports specific to non-overlapping, gene-model-centric genomic loci. Genome browser styled graphics embedded in these reports help visualize the genomic context of the annotations. Output from ParsEval is both easily read and parsed, enabling systematic identification of problematic gene models for subsequent focused analysis.Entities:
Mesh:
Year: 2012 PMID: 22852583 PMCID: PMC3439248 DOI: 10.1186/1471-2105-13-187
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Annotation comparison methods
| Manual comparison | minimal overhead | extremely tedious; error prone; unscalable | |||
| Genome browser | intuitive interface; visual assessment of individual loci | visual assessments imprecise; extensive overhead; little or no automation | |||
| Eval | detailed statistics; visual assessment of statistic distributions; scales fairly well for large data sets; can compare multiple predictions to a single reference | older software; relatively slow; only summary statistics are reported, while stats for individual loci are discarded | |||
| ParsEval | detailed statistics provided, not only as a summary but for individual loci as well; scales well for large data sets; fast, efficient, and portable | only capable of comparing a single pair of annotations |
Various approaches to comparing alternative sources of gene structure annotations, with a brief description of the associated pros and cons.
Figure 1Associating Annotations with Gene Loci. The black bar provides a scale corresponding to a genomic region for which two sets of annotations are available. Reference annotations for gene structure are represented with red glyphs, while prediction annotations are shown with blue glyphs. Arrows indicate the strand of the gene annotation, and different levels of shading correspond to different gene structure features: dark shading for coding sequence, medium shading for UTRs, and light shading for introns. Green brackets denote gene loci as determined by the common practice of using only the genomic coordinates from reference gene annotations.Orange brackets denote gene loci as determined by ParsEval, which takes into account both reference and prediction annotations when selecting distinct loci for comparison
Figure 2Locus Identification Using a Gene Interval Graph. Red and blue nodes in this interval graph correspond to reference and prediction gene annotations (respectively) as shown in Figure 1. Two nodes are connected by an edge if the corresponding gene annotations overlap. Each connected component in the graph represents a distinct gene locus, defined as the smallest genomic region containing every gene annotation associated with the corresponding subgraph. In this example, nodes representing five reference annotations and four prediction annotations are shown. The four connected components in the graph correspond to four gene loci, for which precise genomic coordinates can be determined from the associated genes (shown in orange brackets in Figure 1)
Figure 3Integrating ParsEval Reports with a Genome Browser. Screenshot of the Arabidopsis thaliana genome browser at Phytozome (http://phytozome.net/), with a custom anonymous user track populated by ParsEval output. Boxes in this custom track represent loci identified by ParsEval and are color-coded according to the level of agreement between the two sets of annotations compared (dark red and pastel blue glyphs, respectively). This custom track can easily be configured so that features are hyperlinked to ParsEval reports containing detailed comparison statistics
Annotation comparison methods
| Coding nucleotide sensitivity | 0.93 | 0.94 | |||
| Coding nucleotide specificity | 0.90 | 0.99 | |||
| Exon sensitivity | 0.80 | 0.81 | |||
| Exon specificity | 0.81 | 0.86 | |||
| Gene sensitivity | 0.48 | 0.43 | |||
| Gene specificity | 0.47 | 0.46 |
Sensitivity and specificity scores for AUGUSTUS gene predictions in comparison to corresponding gene annotations from EMBL database release 50. The first column shows scores as reported in the original AUGUSTUS manuscript. The second column shows scores as computed by ParsEval using predictions from the latest version of AUGUSTUS (2.5.5). Summary reports from ParsEval provide immediate access to a wide variety of similarity statistics, including the ones reported in this table. Differences between the scores reported by the AUGUSTUS authors and the ParsEval authors are likely due to subsequent updates of the AUGUSTUS program since its publication.
Annotation comparison methods
| Perfect matches | 22,333 | 94.7% | |||
|---|---|---|---|---|---|
| CDS structure matches | 0 | 0.0% | |||
| Exon structure matches | 0 | 0.0% | |||
| UTR structure matches | 83 | 0.4% | |||
| Non-matches | 1,174 | 5.0% | |||
| 23,590 | 100.0% |
Results from a ParsEval comparison of gene annotations for Mus musculus from two recent releases of the Ensembl database (releases 64 and 65). Release 64 contains 22,507 gene annotations, while release 65 contains 14,486 gene annotations. ParsEval identified 20,362 gene loci using these two data sets, 6,725 of which contained only annotations from release 64. For the 13,637 gene loci for which both release 64 and 65 have annotations, 23,590 comparisons were performed. Each of these comparisons was classified according to how well the annotations from the two releases agreed. This table shows a breakdown of these results.
Annotation comparison methods
| TAIR9 | FlyBase 5.39 | NCBI Entrez | UCSC knownGene (hg19) | |||||
| TAIR10 | Ensembl r65 | JGI / Phytozome | Ensembl r65 | |||||
| 36.3 | 859.4 | 91.1 | 1,350.5 | 85.3 | 1,461.1 | 294.3 | 6,422.0 | |
| 32.8 | 449.2 | 56.6 | 859.5 | 79.4 | 768.4 | 181.3 | 4,089.5 | |
| 30.7 | 246.5 | 39.2 | 633.7 | 76.5 | 439.9 | 130.1 | 2,751.2 | |
| 29.8 | 168.7 | 32.4 | 546.6 | 76.3 | 330.5 | 108.0 | 2,323.3 | |
| 25,618 | 10,976 | 47,877 | 17,865 | |||||
| 25,590 | 10,944 | 37,942 | 7,779 | |||||
| 6 | 32 | 3,363 | 9,569 | |||||
| 22 | 0 | 6,572 | 517 | |||||
| 33,002 | 22,474 | 38,734 | 16,168 | |||||
| 31,750 | 96.2% | 22,446 | 99.9% | 2,489 | 6.4% | 2,517 | 15.6% | |
| 420 | 1.3% | 0 | 0.0% | 17,450 | 45.1% | 8,269 | 51.1% | |
| 8 | 0.0% | 21 | 0.1% | 26 | 0.1% | 27 | 0.2% | |
| 159 | 0.5% | 1 | 0.0% | 647 | 1.7% | 58 | 0.4% | |
| 665 | 2.0% | 6 | 0.0% | 18,122 | 46.8% | 5,297 | 32.8% | |