| Literature DB >> 18854050 |
Qian Liu1, Koby Crammer, Fernando C N Pereira, David S Roos.
Abstract
BACKGROUND: Most gene finders score candidate gene models with state-based methods, typically HMMs, by combining local properties (coding potential, splice donor and acceptor patterns, etc). Competing models with similar state-based scores may be distinguishable with additional information. In particular, functional and comparative genomics datasets may help to select among competing models of comparable probability by exploiting features likely to be associated with the correct gene models, such as conserved exon/intron structure or protein sequence features.Entities:
Mesh:
Year: 2008 PMID: 18854050 PMCID: PMC2587481 DOI: 10.1186/1471-2105-9-433
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Number of candidate gene models per gene locus and number of exons per gene on Drosophila melanogaster. Blue bars provide a histogram showing the number of candidate gene models per locus, as identified by Evigan-5g. The red scatter plot shows the number of candidate gene models per locus versus the number of exons per gene (average number of exons per candidate where multiple candiates are predicted). Note that only a few candidate models are suggested for most genes; those with many candidate models predicted typically contain many exons.
Identification of D. melanogaster genes suitable for model reranking
| 13,669 | |
| Genes with multiple Evigan-5g candidate models | 11,701 |
| Genes with putative orthologs in | 9,125 |
| Intersection (genes with multiple candidate models and putative orthologs) | 7,975 |
| Training set (2.5% of intersection, randomly selected) | 1,98 |
| Test set (used for Table | 7,777 |
| Genes where ReRanker-5g selected the highest probability Evigan-5g model | 6,031 |
| Genes where ReRanker-5g selected a lower probability Evigan-5g model (used for Table | 1,746 |
Gene-finding performance for various algorithms.
| sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | |
| Augustus | 47.0 | 50.9 | 37.6 | 50.9 | 70.8 | 78.8 | 53.5 | 66.4 | 77.6 | 81.8 | 70.9 | 83.2 | 61.9 | 72. |
| CONTRAST | 48.8 | 51.9 | 39.2 | 51.9 | 69.7 | 80.8 | 57.4 | 70.6 | 74.2 | 84.6 | 69.7 | 80.8 | 68.9 | 78.0 |
| Geneid | 35.9 | 41.4 | 29.3 | 41.4 | 65.7 | 71.4 | 47.0 | 60.9 | 75.6 | 73.9 | 59.2 | 72.8 | 54.6 | 73.7 |
| Genie | 40.7 | 50.0 | 31.9 | 50.0 | 58.2 | 77.9 | 44.1 | 63.7 | 63.1 | 82.7 | 58.8 | 80.2 | 58.7 | 68.8 |
| Genscan | 31.4 | 35.7 | 24.9 | 35.7 | 61.3 | 61.6 | 42.4 | 54.6 | 70.8 | 61.6 | 54.1 | 65.9 | 58.7 | 76.9 |
| Evigan-5g | 54.6 | 58.9 | 43.8 | 58.9 | 73.7 | 84.4 | 61.0 | 74.6 | 78.7 | 87.5 | 72.9 | 84.6 | 70.7 | 85.6 |
| GeneWise | 29.4 | 31.0 | 25.0 | 31.0 | 58.3 | 73.9 | 41.8 | 56.7 | 69.5 | 48.5 | 59.4 | 32.3 | 30.6 | |
| Augustus+ | 53.3 | 57.0 | 43.5 | 57.0 | 73.0 | 81.1 | 58.3 | 72.2 | 84.0 | 71.6 | 83.3 | 65.2 | 73.0 | |
| Evigan-6g | 56.3 | 60.7 | 45.1 | 60.7 | 85.2 | 61.4 | 75.4 | 88.3 | 73.5 | 85.7 | 70.5 | 84.7 | ||
Performance on the entire D. melanogaster test set of 7777 loci (see Table 1). Augustus, CONTRAST, Geneid, Genie and Genscan are ab initio predictors used as evidence sources for Evigan-5g. ReRanker-5g selects among K-best gene models produced by Evigan-5g with cross-species information. GeneWise, Augustus+ and Evigan-6g are other comparative gene predictors or approaches. Bold indicates where ReRanker-5g outperforms Evigan-5g; italics indicates where other comparative approaches outperform ReRanker-5g (see text).
Gene-finding performance for genes where ReRanker-5g differs from Evigan-5g.
| sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | |
| Evigan-5g | 11.9 | 14.4 | 8.7 | 14.4 | 64.5 | 76.8 | 42.3 | 58.4 | 73.3 | 81.5 | 55.8 | 73.1 | 3.6 | 17.4 |
| GeneWise | 18.8 | 21.5 | 14.4 | 21.5 | 56.4 | 75.7 | 32.5 | 49.2 | 66.9 | 41.2 | 55.9 | 9.9 | 14.7 | |
| Augustus+ | 76.2 | 62.2 | 80.1 | 74.3 | 23.0 | |||||||||
| Evigan-6g | 19.0 | 23.1 | 13.7 | 23.1 | 79.1 | 43.8 | 60.7 | 83.3 | 58.7 | 77.3 | 2.8 | 17.4 | ||
Performance on the 1746 loci where ReRanker-5g selected a lower probability Evigan-5g model based on cross-species comparison. Note that ReRanker-5g improves on Evigan-5g acrosss the board; italics indicates where other comparative approaches outperform ReRanker-5g (see text).
Figure 2Performance by rank on Drosophila melanogaster. The table on the bottom right shows the number of loci where ReRanker-5g selects Evigan-5g candidate gene models of certain rank. For example, there are 6031 loci where ReRanker selects the most probable candidate models as defined by Evigan; there are 820 loci where ReRanker-5g selects the second to the fifth most probable candidate models as defined by Evigan, and so on. The other panels show the F-score (harmonic mean of sensitivity and specificity) of Evigan-5g and ReRanker-5g at the exon, transcript and gene levels for various rank ranges. ReRanker is successful at improving the identification of correct gene models even when selected candidates are far from the top of the list provided by Evigan.
Gene-finding performance for D. melanogaster genes with D. pseudoobscura EST evidence.
| sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | sn% | sp% | |
| Evigan-5g | 71.0 | 74.6 | 59.2 | 74.64 | 78.1 | 88.9 | 72.0 | 86.3 | 81.8 | 90.0 | 77.9 | 90.0 | 81.4 | 89.2 |
Performance on the 1191 D. melanogaster loci whose putative orthologs on D. pseudoobscura are supported by EST sequences (see text for details). Note that ReRanker-5g improves on Evigan-5g across the board (improvement indicated by bold).
Figure 3Infering shared splice sites from alignement. Blue boxes represent segments (local alignments) produced by DiAlign [44] between coding sequences of two gene models and the wavy lines represent unaligned regions. Arrows represent mapped splice sites. The first and third pairs of overlapping splice sites are identified as shared splice sites.