| Literature DB >> 16925840 |
David Carter1, Richard Durbin.
Abstract
BACKGROUND: One way in which the accuracy of gene structure prediction in vertebrate DNA sequences can be improved is by analyzing alignments with multiple related species, since functional regions of genes tend to be more conserved.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16925840 PMCID: PMC1810555 DOI: 10.1186/gb-2006-7-s1-s6
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Alignment for a coding splice acceptor site. The figure shows the central part of a typical alignment window used by the classifier component of DOGFISH. Codon boundaries on the exon side of the splice site are indicated with dots. This site has an alignment with all species except frog: hs; Homo sapiens: mm; Mus musculus: rn; Rattus norvegicus: cf; Canis familiaris: gg; Gallus gallus: dr; Danio rerio: fr; Fugu rubripes. The AG dinucleotide for the acceptor site itself is shown in bold.
Prediction accuracies for vertical and horizontal components
| Acceptors | Donors | Starts | Stops | |
| Train set size | 204,021 | 221,421 | 7,571 | 25,071 |
| Eval set size | 52,605 | 57,179 | 1,805 | 6,162 |
| %True sites | 14.05 | 13.01 | 16.68 | 8.08 |
| Presence | 52.72 | 48.77 | 39.70 | 34.64 |
| Vertical | 82.01 | 81.00 | 55.70 | 49.25 |
| Horizontal | 84.36 | 84.43 | 57.01 | 48.22 |
| Both | 84.86 | 84.60 | 58.22 | 49.60 |
| ENCODE Cl | 63.18 | 65.86 | 27.44 | 14.67 |
| ENCODE GF | 80.23 | 81.38 | 42.47 | 50.49 |
| Presence | 12.41 | 12.66 | 20.62 | 23.98 |
| Vertical | 2.46 | 2.52 | 14.49 | 12.76 |
| Horizontal | 1.81 | 1.58 | 12.48 | 11.77 |
| Both | 1.74 | 1.54 | 10.41 | 10.90 |
| ENCODE Cl | 0.99 | 0.61 | 9.14 | 10.49 |
The table shows the F score (geometric mean of sensitivity and specificity, which are close to each other) for various classifier components. The test set for the presence, vertical, horizontal and 'both' conditions is 'challenging' data; we show results for a mixture of the classifiers trained on challenging and randomly selected data. The 'ENCODE Cl' and 'ENCODE GF' lines are for the 31 ENCODE test regions, using classifier scores and gene-finder scores, respectively. The table also shows the 100%-ROC (receiver operating characteristic) error value for each condition. This error value is the probability that if a true instance and a decoy are selected at random, the classifier will give the decoy a higher score than the true instance.
Error rates broken down by site type
| Acceptors | Donors | Starts | Stops | |
| Overall | 4.45 | 4.08 | 17.61 | 15.86 |
| Coding | 4.34 | 3.88 | 17.61 | 15.86 |
| Non-coding | 7.48 | 10.00 | NA | NA |
| Overall | 4.45 | 4.08 | 17.35 | 15.84 |
| Coding | 2.34 | 1.08 | 23.19 | 2.92 |
| Non-coding intra | 2.54 | 2.21 | 12.17 | 16.58 |
| Non-coding inter | 8.00 | 8.31 | 18.05 | 16.07 |
The table shows the proportion (in the challenging test set) of various site types that received an incorrect classification. The classification threshold is adjusted to achieve roughly equal proportions of false positives and false negatives. NA: not applicable.
Phase prediction error rates on coding splice sites
| Acceptors | Donors | |
| Vertical | 3.84 | 3.02 |
| Horizontal | 5.17 | 4.79 |
| Both | 1.99 | 1.60 |
The table shows the percentage (in the challenging test set) of coding splice sites for which the coding phase that was assigned the highest probability was not the annotated phase.
Prediction accuracies for different numbers of species
| Acceptors | Donors | Starts | Stops | |
| Train set size | 204,021 | 221,421 | 7,571 | 25,071 |
| Eval set size | 52,605 | 57,179 | 1,805 | 6,162 |
| Human only | 66.78 | 67.25 | 35.34 | 22.20 |
| Human+mouse | 80.67 | 82.74 | 43.38 | 30.57 |
| All 4 mammals | 82.53 | 83.99 | 44.02 | 31.88 |
| All 8 species | 84.31 | 84.82 | 51.45 | 34.93 |
| Human only | 5.22 | 4.31 | 18.30 | 20.03 |
| Human+mouse | 2.45 | 1.93 | 13.18 | 15.54 |
| All 4 mammals | 2.21 | 1.81 | 11.77 | 14.75 |
| All 8 species | 1.76 | 1.54 | 10.53 | 11.68 |
The table shows the F score (geometric mean of sensitivity and specificity) and ROC error rate (area not under the ROC curve) for the horizontal component of Classifier Two trained on different numbers of informant species and running on the challenging evaluation (Eval) set. All scores are percentages.
Exon and transcript accuracies
| DOGFISH-1 | DOGFISH-2 | |
| Exon sensitivity | 53.11 | 63.68 |
| Exon specificity | 77.34 | 84.90 |
| Transcript sensitivity | 5.08 | 8.94 |
| Transcript specificity | 14.61 | 33.12 |
The table shows percentage sensitivity and specificity at the exon and transcript levels for the workshop version, DOGFISH-1, and the current version, DOGFISH-2.
Figure 2DOGFISH-2E results. (a) Sensitivity and specificity for DOGFISH-2E output. The figure shows plots for specificity against specificity on the ENCODE test regions as the acceptance probability threshold is varied for internal exons, external (initial and terminal) exons, and all exons together. 'X' is used to mark the DOGFISH-2 sensitivity and specificity values, and the specificity value of 95% for almost 50% sensitivity is highlighted. (b) Probability of annotation as a function of DOGFISH-2E estimate. The figure shows DOGFISH-2E probability estimates on the x axis and, on the y axis, the probability that a site a DOGFISH-2E estimate of the given magnitude is annotated in ENCODE and Ensembl, respectively. The Y = X line is shown for comparison.
Figure 3Mean RVM weights for horizontal and vertical component inputs. The figure shows the means, with p = 0.05 two-tail error bars, for weights assigned to inputs by acceptor site-type-pair RVMs in Classifier Two, averaging over all 20 pairings of decoy with true site types. The presence component has a single score. Two-letter abbreviations are used for the species-specific scores output by the horizontal component, while the vertical-component quantities are for eight 25 base-pair subregions (only six of which ever get non-zero scores) with one gap score. Species abbreviations are as in Figure 1.