| Literature DB >> 16845043 |
Mario Stanke1, Oliver Keller, Irfan Gunduz, Alec Hayes, Stephan Waack, Burkhard Morgenstern.
Abstract
AUGUSTUS is a software tool for gene prediction in eukaryotes based on a Generalized Hidden Markov Model, a probabilistic model of a sequence and its gene structure. Like most existing gene finders, the first version of AUGUSTUS returned one transcript per predicted gene and ignored the phenomenon of alternative splicing. Herein, we present a WWW server for an extended version of AUGUSTUS that is able to predict multiple splice variants. To our knowledge, this is the first ab initio gene finder that can predict multiple transcripts. In addition, we offer a motif searching facility, where user-defined regular expressions can be searched against putative proteins encoded by the predicted genes. The AUGUSTUS web interface and the downloadable open-source stand-alone program are freely available from http://augustus.gobics.de.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16845043 PMCID: PMC1538822 DOI: 10.1093/nar/gkl200
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1The human gene ATP5G1 and the AUGUSTUS ab initio prediction for this region. The first transcript (g1.t1) is also the one predicted by standard AUGUSTUS using the Viterbi algorithm only. It misses the second exon of the gene. The second transcript (g1.t2) contains that exon and is correct. The height of a box (black: exon, light gray: intron) reflects the posterior probability of that exon or intron: The higher the posterior probability, the higher the box.
Figure 2Region of a human gene on the forward strand for which AUGUSTUS predicted six transcripts (gene SON, chromosome 21, 33 837 000–33 872 000, ncbi build 35). The long intron of transcript g1.t3 containing position 20 000 has low posterior probability. Thus, the model is unsure whether this is actually one gene, two or three genes. In fact, for this gene there exists EST evidence both for the short transcript g1.t1 and for longer transcripts with exons mostly agreeing with those predicted above.
Percentage of correctly predicted human exons and introns in coding regions grouped by their posterior probability
| Posterior probability | Exon specificity |
|---|---|
| 0 ≤ | 46/242 ≈ 19.0% |
| 50% < | 132/356 ≈ 37.1% |
| 70% < | 84/175 ≈ 48.0% |
| 80% < | 171/275 ≈ 62.2% |
| 90% < | 140/195 ≈ 71.8% |
| 95% < | 338/422 ≈ 80.1% |
| 99% < | 545/612 ≈ 89.1% |
| Total | 1456/2277 ≈ 63.9% |
For example, out of 2277 exons predicted by AUGUSTUS, 422 had a posterior probability between 0.95 and 0.99 of which 338 (80.1%) matched exactly an annotated exon. Here, AUGUSTUS was set to predict only one transcript per gene (no alternatives). As reference annotation the ENCODE test set with 296 genes and 649 transcripts was used, which is a challenging test set: the exon-level specificities of AUGUSTUS, GENEID, GENEZILLA and GENSCAN are 63.9, 61.1, 50.3 and 46.4%, respectively.
Accuracy values of variants of AUGUSTUS on the ENCODE test set with 296 genes and an average of 2.2 transcripts per gene
| Single transcript | Few transcripts | Medium number of transcripts | Many transcripts | |
|---|---|---|---|---|
| Gene sensitivity | 0.233 | 0.273 | 0.294 | 0.345 |
| Gene specificity | 0.170 | 0.200 | 0.211 | 0.239 |
| Transcript sensitivity | 0.106 | 0.125 | 0.137 | 0.165 |
| Transcript specificity | 0.170 | 0.144 | 0.112 | 0.053 |
| Exon sensitivity | 0.527 | 0.540 | 0.557 | 0.600 |
| Exon specificity | 0.639 | 0.615 | 0.571 | 0.490 |
| Base sensitivity | 0.775 | 0.779 | 0.790 | 0.814 |
| Base specificity | 0.764 | 0.757 | 0.738 | 0.705 |
| Average transcripts/gene | 1.0 | 1.4 | 2.0 | 5.1 |
We used AUGUSTUS with the original single-transcript option and with the new options for ‘few transcripts’, ‘medium number of transcripts’ and ‘many transcripts’.