| Literature DB >> 17880681 |
Heng Li1, Liang Guan, Tao Liu, Yiran Guo, Wei-Mou Zheng, Gane Ka-Shu Wong, Jun Wang.
Abstract
BACKGROUND: The main two sorts of automatic gene annotation frameworks are ab initio and alignment-based, the latter splitting into two sub-groups. The first group is used for intra-species alignments, among which are successful ones with high specificity and speed. The other group contains more sensitive methods which are usually applied in aligning inter-species sequences.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17880681 PMCID: PMC2082505 DOI: 10.1186/1471-2105-8-349
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Nucleotide level sensitivity (nSn) and specificity (nSp). We restrict to coding regions, and display performance as a function of protein level identities in the aligned regions. Every data point represents 658 of the 10395 mRNAs from the mouse-human alignments. Obviously, the results for CAT, est_genome and GeneWise are hard to distinguish from each other when it comes to sensitivity. In plotting the figure, we discard the worst 5% of pairs where the fraction of aligned regions in respect to the length of full CDS is too small. These 5% of orthologous pairs tend to be wrongly predicted in the HomoloGene database due to their short aligned regions. Discarding them yields more consistent curves.
Evaluation of localized alignments. 10395 mouse mRNAs and 2007 zebrafish mRNAs are aligned to the orthologous regions in the human genome.
| CDS+UTR (nucl. level) | CDS alone (nucl. level) | CDS alone (exon level) | Speed (mRNA/hr) | ||||
| Algorithm | Sn | Sp | Sn | Sp | Sn | Sp | |
| 0.765 | 0.961 | 0.924 | 0.968 | 0.855 | 0.893 | 3579 | |
| 0.772 | 0.963 | 0.926 | 0.970 | 0.856 | 0.895 | 17 | |
| n/a | n/a | 0.927 | 0.972 | 0.869 | 0.917 | 8 | |
| 0.385 | 0.983 | 0.589 | 0.977 | 0.495 | 0.791 | 1254 | |
| n/a | n/a | 0.856 | 0.977 | 0.787 | 0.890 | 10027 | |
| 0.487 | 0.976 | 0.678 | 0.973 | 0.161 | 0.172 | 5138 | |
| 0.615 | 0.979 | 0.872 | 0.975 | 0.513 | 0.518 | 1172 | |
| 0.535 | 0.977 | 0.743 | 0.976 | 0.524 | 0.569 | 36815 | |
| CDS+UTR (nucl. level) | CDS alone (nucl. level) | CDS alone (exon level) | Speed (mRNA/hr) | ||||
| Algorithm | Sn | Sp | Sn | Sp | Sn | Sp | |
| 0.489 | 0.963 | 0.803 | 0.957 | 0.645 | 0.754 | 2806 | |
| 0.463 | 0.968 | 0.764 | 0.961 | 0.590 | 0.750 | 41 | |
| n/a | n/a | 0.862 | 0.975 | 0.781 | 0.879 | 12 | |
| n/a | n/a | 0.652 | 0.975 | 0.543 | 0.772 | 6757 | |
The ''correct'' answers, against which we judge these algorithms, are based on an alignment of human mRNAs from RefSeq to the sequence of the human genome, as annotated in the UCSC browser.
Figure 2Speed comparisons for localized and chromosome-wide alignments. 1000 randomly selected mouse mRNAs are aligned against the human genome. In the localized plot, every data point represents the average of 50 alignments. In chromosome-wide plot, every data point is a single chromosome. This plot is limited to CAT, BLAT, and sim4 because they are the only ones that run in a reasonable amount of time and/or memory.
Figure 3Flowchart of CAT algorithm (description in text of manuscript).
Figure 4Statistical filtering of terminal exons. Here, 1000 randomly selected mouse mRNAs are aligned to the human genome. We show the ratio of aligned to true length, before (red) and after (blue) statistical filtering. Length refers to the extent of the mRNA alignment from the start codon to the stop codon. In other words, UTRs are excluded.