| Literature DB >> 16925842 |
Paul Flicek1, Michael R Brent.
Abstract
BACKGROUND: As part of the ENCODE Genome Annotation Assessment Project (EGASP), we developed the MARS extension to the Twinscan algorithm. MARS is designed to find human alternatively spliced transcripts that are conserved in only one or a limited number of extant species. MARS is able to use an arbitrary number of informant sequences and predicts a number of alternative transcripts at each gene locus.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16925842 PMCID: PMC1810557 DOI: 10.1186/gb-2006-7-s1-s8
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Submitted versus updated prediction characteristics
| ESn | ESp | TSn | TSp | GSn | GSp | |
| Predictions submitted to EGASP | 69.3% | 65.8% | 18.2% | 17.8% | 38.0% | 28.3% |
| Updated MARS algorithm | 74.4% | 45.1% | 19.4% | 10.4% | 40.6% | 33.0% |
A comparison of the predictive accuracy for the MARS genes submitted to the EGASP workshop and those produced by the updated MARS algorithm. The columns are sensitivity and specificity at the coding exon (ESn/ESp), coding transcript (TSn/TSp), and gene level (GSn/GSp).
Pair-wise prediction characteristics
| Mouse | Rat | Dog | Chicken | Frog | Opossum | |
| Predicted transcripts | 486 | 476 | 530 | 431 | 422 | 467 |
| Exons per transcript | 7.55 | 7.62 | 6.82 | 11.02 | 11.28 | 8.54 |
The total number of predicted transcripts in the 44 ENCODE regions and the number of coding exons per transcript for each of the six informant sources in the MARS informant set.
Aligned fraction of the ENCODE regions
| Mouse | Rat | Dog | Chicken | Frog | Opossum | |
| Whole regions | 15.2% | 14.7% | 28.8% | 2.8% | 2.0% | 5.6% |
| Coding sequence | 87.5% | 85.0% | 87.4% | 53.0% | 50.1% | 76.1% |
A comparison of the total fraction of bases aligned in the 44 ENCODE regions and the fraction of bases aligned in the coding portion of the GENCODE annotation for each of the informant sources in the MARS informant set. See Materials and methods for the alignment protocol.
Figure 1Pair-wise predictive accuracy for each of the six sequences in the informant set. The sensitivity and specificity, as compared to the GENCODE annotations, of Twinscan predictions based on the mouse (blue), rat (red), dog (brown), chicken (green), frog (purple), and opossum (orange) informant sequences. Gene level accuracy (triangles), transcript level accuracy (squares), and coding exon level accuracy (circles) are presented.
Figure 2The information gain for the informant alignments with respect to the training set annotations for the six informant sequences. The information gain in the coding portion of the model is displayed in blue with the scale on the left side of the graph. The information gain for the translation initiation and termination signals is displayed in red with the scale on the right-hand side of the graph.
Figure 3Gene accuracy versus informant subset size. The effect of informant subset size on gene level sensitivity and specificity compared to the GENCODE annotations.
Effect of increasing informant subset size
| Pair-wise prediction characteristics | ||||||
| Informant set size | Two | Three | Four | Five | Six | Annotation |
| Average transcripts per gene | 1.76 | 2.44 | 3.05 | 3.62 | 4.15 | 2.25 |
| Average exons per transcript | 9.31 | 9.67 | 9.94 | 10.16 | 10.35 | 8.64 |
The number of coding transcripts per gene and coding exons per transcripts increases with the cardinality of the informant set. The gene level accuracy also increases with informant set size (see Figure 3).
Transcripts common to several informant sources
| Prediction accuracy for transcripts common to several informants | ||||||
| ESn | ESp | TSn | TSp | GSn | GSp | |
| All transcripts | 74.4% | 45.1% | 19.4% | 10.4% | 40.6% | 33.0% |
| Common transcripts | 43.0% | 71.6% | 15.1% | 29.6% | 32.0% | 35.0% |
| Mammalian transcripts | 40.0% | 77.1% | 14.8% | 33.7% | 31.3% | 37.7% |
A comparison of the predictive accuracy for all MARS genes, with those having at least two transcripts predicted with identical structure from more than one informant source across the entire informant set, and with those having two transcripts with identical structures from at least two mammalian informant sources. Columns are defined as in Table 1.