| Literature DB >> 16712739 |
Abstract
BACKGROUND: The number of sequenced eukaryotic genomes is rapidly increasing. This means that over time it will be hard to keep supplying customised gene finders for each genome. This calls for procedures to automatically generate species-specific gene finders and to re-train them as the quantity and quality of reliable gene annotation grows.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16712739 PMCID: PMC1522026 DOI: 10.1186/1471-2105-7-263
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Overview of gene model. Each box represents a separately modelled gene structure element. Red boxes are weight array matrices. Black boxes are length modelled elements. For clarity the intron models that assures that splicing does not introduce stop codons are not shown.
Figure 2Example ADPH distribution and probability graph. ADPH distribution with four phases and associated probability graph. The distribution describes the probability of passing from state one to the absorbing state, A, in a given number of steps. The example constitutes the special case of a mixture of an exponential and three negative binomial distributions. This arises when the loop probabilities are equal.
Figure 3Example fittings. Example ADPH fittings to length distribution of D. melanogaster gene structure elements. The plots each show actual length distribution in red and ADPH fit in blue.
Performance evaluation. NSN: nucleotide sensitivity, NSP: nucleotide specificity, ESN: exon sensitivity, ESP: exon specificity, ME: missed exons, WE: wrong exons. NSN is defined as the percentage of annotated coding bases predicted as coding and NSP as the percentage of predicted coding bases annotated as as coding. ESN and ESP reflect analogously how well the methods predict exons exactly right. Superscripts on the species names indicate the type of generated gene model. 1: Full model. 2: 3' UTR exons and introns not modelled. 3: No UTR exons and introns modelled. 4: UTR exons and introns as well as UTR part of first and last exons not modelled. Subscripts indicate whether shared length distributions are used. 1: No shared distributions. 2: Internal and single coding exons share distribution. 3: all coding coding exons share distribution
| Species | Predictor | NSN | NSP | ESN | ESP | ME | WE |
| Agene | |||||||
| GeneID | 95 | 86 | 75 | 68 | 6 | 16 | |
| Agene | |||||||
| Augustus | 85 | 91 | 67 | 72 | 15 | 11 | |
| Agene | |||||||
| Genscan | 82 | 83 | 49 | 51 | 17 | 18 | |
| Agene | |||||||
| GeneID | 24 | 95 | 14 | 83 | 76 | 8 | |
| Agene | |||||||
| GeneID | 88 | 86 | 49 | 50 | 18 | 19 | |
| Agene | |||||||
| Genscan | 87 | 60 | 63 | 47 | 14 | 39 | |
| Agene | |||||||
| Genscan | 88 | 82 | 69 | 68 | 14 | 17 | |
| Agene | |||||||
| Genscan | 91 | 87 | 67 | 69 | 12 | 10 | |
| Agene | |||||||
| Agene | |||||||
| Agene | |||||||
| Agene | |||||||
| Agene |
Figure 4Logos of donor and acceptor splice sites. A graphic representation of aligned donor and acceptor splice sites. The relative heights of letters correspond to frequencies of bases at each position. The degree of sequence conservation is reflected in the total height of a stack of letters, measured in bits of information.
Cross-species performance. Performance of Agene for C. elegans on a selection of other test species. The percentages shown are the differences in performance relative to the versions of Agene that are generated for the species on question. NSN: nucleotide sensitivity, NSP: nucleotide specificity, ESN: exon sensitivity, ESP: exon specificity, ME: missed exons, WE: wrong exons
| Species | NSN | NSP | ESN | ESP | ME | WE |
| -21 | 0 | -29 | -10 | 22 | 1 | |
| -8 | -5 | -21 | -17 | 8 | 8 | |
| -7 | -19 | -34 | -45 | 5 | 25 | |
| -26 | -3 | -52 | -27 | 16 | 6 | |
| -33 | -2 | -33 | -8 | 31 | 1 | |
| -21 | -7 | -58 | -38 | 13 | 13 |
Figure 5Performance as a function of training set size. The plot shows the nucleotide and exon sensitivity and specificity as well as missed and wrong exons as a function of the number of genes in the training set.