| Literature DB >> 20224812 |
Jing Wu1.
Abstract
Proposed is a procedure to test whether a genomic sequence contains coding DNA, called a coding potential region. The procedure tests the coding potential of conserved short genomic sequence, in which the assumptions on the probability models of gene structures are relaxed. Thus, it is expected to provide additional candidate regions that contain coding DNAs to the current genomic database. The procedure was applied to the set of highly conserved human-mouse sequences in the genome database at the University of California at Santa Cruz. For sequences containing RefSeq coding exons, the procedure detected 91.3% regions having coding potential in this set, which covers 83% of the human RefSeq coding exons, at a 2.6% false positive rate. The procedure detected 12,688 novel short regions with coding potential at the false discovery rate <0.05; 65.7% of the novel regions are between annotated genes.Entities:
Year: 2010 PMID: 20224812 PMCID: PMC2834954 DOI: 10.1155/2010/287070
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Figure 1Summary of the proposed statistical procedure.
The above tables in UCSC's genome database are used to analyze the coding potential regions detected from the human sequences in the axtTight folder in UCSC's genome database.
| Tracks | URL |
|---|---|
| RefSeq [ |
|
| Known genes [ | |
| TWINSCAN [ | |
| GENSCAN [ | |
| SGP [ |
|
| ENSEMBL |
|
| GENEID [ |
|
| AUGUSTUS [ | |
| ECgene [ |
|
| MGC [ | |
| AceView [ |
|
| CCDS [ | |
| Nonhuman RefSeq [ | |
| Retropose [ | |
| Yale Psuedo [ |
|
| Vega |
|
| Vega pseudogenes |
|
| UniGene [ |
Figure 2A normal qq-plot of the averaged log-odds scores from the simulated sequences.
Figure 3Identifying a coding potential region in chromosome 1: 1058121-1058365 from assembly hg17. The position is in units of triplets. The codons are at position 25–56.
The detection of coding potential regions in the human-mouse conserved regions. The table lists the number of alignments and the corresponding base pairs of the human sequences in each test set. The true positive rates and false positive rates correspond to the number of alignments that have p-value less than α = 0.0387 by the present method, where the method with the parameters estimated from human-mouse training sets was applied both to the human-mouse alignments and human-dog alignments. The row of K /K is cited from [12]. The threshold is set so that the false positive rate of the proposed method is the same as that of [12].
| Conserved coding regions (TP) | Simulated random sequence pair (FP) | |
|---|---|---|
| Size | 146,254 | 10,305 |
| (3.9 × 107 bps) | (6.8 × 106 bps) | |
|
| ||
| Peak | 91.3% | 2.6% |
| Peak | 90.7% | 2.6% |
|
| 90.5% | 2.6% |
The distribution of RefSeq coding exons contained in the regions detected by the proposed method compared with those predicted by GENSCAN and TWINSCAN according to the types of exons: initial, internal, final, and single, where single refers to exons of single exon genes.
| Exon type | Initial | Internal | Final | Single |
|---|---|---|---|---|
| Peak | 90.1% | 94.3% | 81.7% | 80.1% |
| GENSCAN | 81.8% | 86.8% | 78.3% | 91.5% |
| TWINSCAN | 30.4% | 29.9% | 42.4% | 73.9% |