| Literature DB >> 16925833 |
Mario Stanke1, Ana Tzvetkova, Burkhard Morgenstern.
Abstract
BACKGROUND: A large number of gene prediction programs for the human genome exist. These annotation tools use a variety of methods and data sources. In the recent ENCODE genome annotation assessment project (EGASP), some of the most commonly used and recently developed gene-prediction programs were systematically evaluated and compared on test data from the human genome. AUGUSTUS was among the tools that were tested in this project.Entities:
Mesh:
Year: 2006 PMID: 16925833 PMCID: PMC1810548 DOI: 10.1186/gb-2006-7-s1-s11
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Annotation of the protein coding regions of a part of the human ENCODE region ENm007. The line labeled 'VEGA_Known' shows one known gene on the forward strand. The ab initio program AUGUSTUS (labeled 'AUGUSTUS') predicts this gene almost correctly but completely misses the 9th exon annotated around position 318,600. Furthermore, as an ab initio program, AUGUSTUS predicts a false positive gene on the reverse strand around position 310,000. The lines labeled 'hints' show the hints derived from a comparison to the mouse genome. The height of the rectangles depends on their estimated reliability. The hints indicate the presence of an exon where AUGUSTUS missed the annotated exon. Also, there are no hints about coding regions where AUGUSTUS predicted a gene on the reverse strand. When the given hints are used by AUGUSTUS (labeled 'AUGUSTUS+mouse'), the missed exon is correctly predicted and the false positive gene is not predicted anymore. The former is a consequence of the bonus effect and the latter a consequence of the malus effect. Note that the hint about the exon around position 318,600 was helpful, although that exon is more likely to be on the reverse strand according to the hints alone. This plot has been obtained using gff2ps [28].
Figure 2A syntenic human-mouse sequence pair and its DIALIGN alignment. Each sequence contains one gene with five exons (only CDS shown). The fragments are segment pairs with high similarity at the protein level.