| Literature DB >> 20875138 |
Kazuharu Misawa1, Reiko F Kikuno.
Abstract
BACKGROUND: Identifying protein-coding regions in genomic sequences is an essential step in genome analysis. It is well known that the proportion of false positives among genes predicted by current methods is high, especially when the exons are short. These false positives are problematic because they waste time and resources of experimental studies.Entities:
Year: 2010 PMID: 20875138 PMCID: PMC2955682 DOI: 10.1186/1756-0381-3-6
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1Region Scores and Candidate Coding Regions (CDSs). The region score is the sum of the individual codon pair scores in an alignment. Human and mouse DNA alignment and codon scores are also shown. Three adjacent nucleotide pairs were treated as 1 codon pair. A high region score indicates that the region might be a CDS because that region contains many codon pairs with high scores. Note that the region scores should be calculated for all frames in both strands.
Figure 2Log-linear plot between the maximal segment pair (MSP) scores and their proportion of occurrences in the computer simulation. The straight line is the regression line.
Numbers of True and False Positives in Gene Finding
| GENSCAN | Twinscan | |||
|---|---|---|---|---|
| True Positives | False Positives | True Positives | False Positives | |
| Before GeneWaltz | 1818 | 1243 | 2209 | 480 |
| After GeneWaltz | 1345 | 262* | 1619 | 203* |
*Significantly different.
Figure 3Scatter plot of the ratios of true positives to all positives predicted by GENSCAN and Twinscan before and after filtering GeneWaltz versus exon length. The unit of exon length is 3 nucleotides.
Figure 4The partial receiver operating characteristic (partial ROC) curves using Twinscan and GENSCAN across several GeneWaltz p-value thresholds. A partial ROC curve plots the true positive rate for recovering true causal single-nucleotide polymorphisms (SNPs, y-axis) and the false positive rate (x-axis) over a range of small values of false positive rates.