| Literature DB >> 18831778 |
Jing Wu1.
Abstract
BACKGROUND: Computational gene prediction tools routinely generate large volumes of predicted coding exons (putative exons). One common limitation of these tools is the relatively low specificity due to the large amount of non-coding regions.Entities:
Mesh:
Year: 2008 PMID: 18831778 PMCID: PMC2559877 DOI: 10.1186/1471-2164-9-S2-S13
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of the data sets.
| clearly orthologous exons (TP) | potential non-exons (FP) | potential non-genes (FP) | RefSeq exons (TP) | RefSeq genes (TP) | |
| size | 76,229 (1.2 × 107 bps) | 1,518,082 (8.3 × 108 bps) | -- | 172,042 (2.9 × 107) | 20,193 |
| GENSCAN | -- | -- | -- | 117,860 | 3,497 |
| TWINSCAN | -- | -- | -- | 118,650 | 5,131 |
| GENSCAN (w/mouse) | 53,217 | 54,360 | 4,856 | 115,551 | 3,284 |
| TWINSCAN (w/mouse) | 54,879 | 12,276 | 1,172 | 117,100 | 4,944 |
| GENSCAN (w/dog) | 52,712 | 49,899 | -- | -- | -- |
| TWINSCAN (w/dog) | 54,257 | 11,095 | -- | -- | -- |
The first row lists the type of sequences in the data set. The second row lists the number of the sequences in each type and the corresponding base pairs. The row of GENSCAN lists the number of exons predicted by GENSCAN with both ends matching RefSeq exons, the number of genes predicted by GENSCAN that exactly match RefSeq genes. The row of GENSCAN (w/mouse) lists the number of exons predicted by GENSCAN, which have full alignments with mouse, with both ends matching clearly orthologous exons, the number of the predicted exons, which have full alignments with mouse, with both ends within or matching potential non-exons, and the number of genes predicted by GENSCAN, which have full alignments with mouse, having all exons being in potential non-exons. The row of GENSCAN (w/dog) lists the number of exons predicted by GENSCAN, which have full alignments with dog, with both ends matching clearly orthologous exons and the number of the predicted exons, which have full alignments with dog, with both ends within or matching potential non-exons. The row of TWINSCAN, TWINSCAN (w/mouse), and TWINSCAN (w/dog) list the number of exons and genes collected the same way as those related to GENSCAN from TWINSCAN's prediction.
Comparing the enhancement on putative exons with existing models results based on human-mouse sequence conservation.
| clearly orthologous exons (TP) | potential non-exons (FP) | |
| size | 76,229 | 1, 518, 082 |
| GENSCAN (w/mouse) | 53,217 (69.8%) | 54,360 (3.58%) |
| GENSCAN (w/mouse) | 52,682 (69.1%) | 14,604 (0.95%) |
| TWINSCAN (w/mouse) | 54,879 (72.0%) | 12,276 (0.8%) |
| TWINSCAN (w/mouse) | 54,331 (71.3%) | 7,876 (0.5%) |
| 74.5% | 0.77% |
The number of clearly orthologous exons and potential non-exons in the test set are listed in the row of size. The rows of GENSCAN and TWINSCAN list the numbers of putative exons provided by GENSCAN and TWINSCAN respectively. The thresholds for GENSCAN and TWINSCAN are set so that 99% of the correct predictions of GENSCAN and TWINSCAN that have alignments are kept. The percentages in the parentheses are the true positive and false positive rates relative to the sizes of the test sets. The row of shortHMM is cited from [15].
Improvement of putative exons from TWINSCAN.
| RefSeq exons (TP) | potential non-exons (FP) | |
| size | 172,042 | 1, 518, 082 |
| TWINSCAN | 118,650 (69.0%) | 12,276 (0.8%) |
| TWINSCAN (w/mouse) | 115,909 (67.1%) | 7,876 (0.5%) |
Results based on human-mouse conservation. The number of RefSeq exons and potential non-exons in the test set are listed in the row of size. The row of TWINSCAN lists the number of putative exons provided by TWINSCAN. The threshold for TWINSCAN is set so that 99% of the correct predictions of TWINSCAN that have alignments are kept. The percentages in the parentheses are the true positive and false positive rates relative to the size of the test set.
Improvement of putative genes from TWINSCAN.
| RefSeq genes (TP) | potential non-genes (FP) | |
| size | 20,193 | -- |
| TWINSCAN | 5,131 | 1,172 |
| TWINSCAN (w/mouse) | 4,826 | 870 |
Results based on human-mouse conservation. The number of RefSeq genes is listed in the row of size. The row of TWINSCAN lists the number of putative genes provided by TWINSCAN. The threshold for TWINSCAN is set so that 98% of the corrected predicted genes of TWINSCAN are kept.
Figure 1Improving TWINSCAN's prediction on exons. ROC curves by applying the log odds ratio on TWINSCAN's exons. The x-axis is the false prediction rate (FP) of the exon by the log odds score and the y-axis is the true prediction rate (TP) of the exon by the log odds score. The upper graph is the result from human-mouse alignments of TWINSCAN's exons. The lower graph is the result from human-dog alignments of TWINSCAN's exons. The plot shows that by using the log odds score to refine TWINSCAN, we could largely reduce the number of false predictions, e.g., by 32% while keeping over 99% of true positives. The plot also shows that the improvement on TWINSCAN is not affected by the type of alignments used since the two curves are almost identical.