| Literature DB >> 16845010 |
Shu Ju Hsieh1, Chun Yuan Lin, Ning Han Liu, Wei Yuan Chow, Chuan Yi Tang.
Abstract
GeneAlign is a coding exon prediction tool for predicting protein coding genes by measuring the homologies between a sequence of a genome and related sequences, which have been annotated, of other genomes. Identifying protein coding genes is one of most important tasks in newly sequenced genomes. With increasing numbers of gene annotations verified by experiments, it is feasible to identify genes in the newly sequenced genomes by comparing to annotated genes of phylogenetically close organisms. GeneAlign applies CORAL, a heuristic linear time alignment tool, to determine if regions flanked by the candidate signals (initiation codon-GT, AG-GT and AG-STOP codon) are similar to annotated coding exons. Employing the conservation of gene structures and sequence homologies between protein coding regions increases the prediction accuracy. GeneAlign was tested on Projector dataset of 491 human-mouse homologous sequence pairs. At the gene level, both the average sensitivity and the average specificity of GeneAlign are 81%, and they are larger than 96% at the exon level. The rates of missing exons and wrong exons are smaller than 1%. GeneAlign is a free tool available at http://genealign.hccvs.hc.edu.tw.Entities:
Mesh:
Substances:
Year: 2006 PMID: 16845010 PMCID: PMC1538901 DOI: 10.1093/nar/gkl307
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Prediction accuracy on the Projector dataset
| Program | Gene level* (%) | Exon level* (%) | ||||
|---|---|---|---|---|---|---|
| Human gene prediction | ||||||
| GeneWise | 61.91 | 61.91 | 92.56 | 93.60 | 1.50 | 0.32 |
| Projector | 51.32 | 51.32 | 93.78 | 86.99 | 0.88 | 8.59 |
| GeneAlign | 82.28 | 82.28 | 96.65 | 97.12 | 0.74 | 0.32 |
| Mouse gene prediction | ||||||
| GeneWise | 60.49 | 60.49 | 93.13 | 93.39 | 1.18 | 0.28 |
| Projector | 58.45 | 58.45 | 94.55 | 90.35 | 0.47 | 4.55 |
| GeneAlign | 79.23 | 79.23 | 96.63 | 96.39 | 0.49 | 0.58 |
*The measures of sensitivity (Sn) and specificity (Sp) are respectively Sn= TP/(TP+ FN) and Sp= TP/(TP + FP). ME (missing exons) is the proportion of annotated exons not overlapped by any predicted exons, whereas WE (wrong exons) is the proportion of predicted exons not overlapped by any annotated exons.
Prediction accuracy on micro-exons of the Projector dataset
| Program | No. of micro-exons* | ||
|---|---|---|---|
| Accurate exons | Missing exons | Wrong exons | |
| Human micro-exon prediction | |||
| GeneWise | 22 | 25 | 2 |
| Projector | 45 | 1 | 339 |
| GeneAlign | 45 | 2 | 5 |
| Mouse micro-exon prediction | |||
| GeneWise | 23 | 22 | 3 |
| Projector | 47 | 0 | 170 |
| GeneAlign | 44 | 3 | 9 |
*The accuracy of identifying micro-exons was evaluated by the number of accurately predicted exons, missing exons and wrong exons. An exon is accurately predicted only when both boundaries are correct. Missing exons are annotated exons not overlapped with predicted exons. Wrong exons are predicted exons not overlapped by any annotated exons. In the Projector dataset, there are 48 and 47 micro-exons in human and mouse genes, respectively.
Figure 1Comparisons of the correlation between sequence homology and the prediction performance of the GeneWise, Projector and GeneAlign. The gene pairs of Projector dataset were sorted into five classes by their amino acid identities (<60, 60–70, 70–80, 80–90 and 90–100%), and the performance was calculated for each class. The amino acid identities were obtained by using a standard dynamic programming algorithm to calculate the identities between two protein sequences encoded in each homologous gene pair. The measures of sensitivity (Sn) and specificity (Sp) are respectively Sn = TP/(TP + FN) and Sp = TP/(TP + FP).