| Literature DB >> 19429899 |
Leming Zhou1, Mihaela Pertea, Arthur L Delcher, Liliana Florea.
Abstract
Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64,000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome.Entities:
Mesh:
Year: 2009 PMID: 19429899 PMCID: PMC2699533 DOI: 10.1093/nar/gkp319
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Performance of spliced alignment programs on the four vertebrate reference data sets
| Method | Nucleotide | Exon | Splice junction | Time | |||
|---|---|---|---|---|---|---|---|
| Sn | Sp | Sn | Sp | Sn | Sp | ||
| Mouse: 818 genes, 8264 exons, 7408 introns | |||||||
| sim4 | 0.690 | 0.996 | 0.899 | 0.993 | 0.710 (0.741) | 0.660 (0.689) | 28.6 s |
| BLAT | 0.656 | 0.987 | 0.831 | 0.945 | 0.070 (0.517) | 0.047 (0.352) | 2 min 33.3 s |
| tBLAT | 0.774 | 0.985 | 0.942 | 0.985 | 0.183 (0.814) | 0.142 (0.634) | 8 min 55.7 s |
| GMAP | 0.719 | 0.996 | 0.785 | 0.995 | 0.758 (0.763) | 0.952 (0.958) | 1 min 22.3s |
| Exonerate | 0.849 | 0.984 | 0.870 | 0.996 | 0.811 (0.828) | 0.920 (0.939) | 20 min 19.4 s |
| GeneSeqer | 0.603 | 0.988 | 0.647 | 0.921 | 0.574 (0.582) | 0.829 (0.840) | 4 h 24 min 54.8 s |
| EXALIN | 0.846 | 0.997 | 0.948 | 0.996 | 0.926 (0.941) | 0.942 (0.957) | 6 h 33 min 29.3 s |
| sim4cc | 0.934 | 0.995 | 0.973 | 0.997 | 0.932 (0.944) | 0.939 (0.951) | 57.3 s |
| Dog: 46 genes, 419 exons, 370 introns | |||||||
| sim4 | 0.818 | 0.995 | 0.936 | 0.980 | 0.795 (0.816) | 0.770 (0.791) | 1.5 s |
| BLAT | 0.778 | 0.988 | 0.866 | 0.939 | 0.059 (0.608) | 0.048 (0.488) | 7.6 s |
| tBLAT | 0.869 | 0.981 | 0.932 | 0.950 | 0.162 (0.824) | 0.141 (0.716) | 30.4 s |
| GMAP | 0.875 | 0.996 | 0.861 | 0.989 | 0.849 (0.854) | 0.978 (0.984) | 4.0 s |
| Exonerate | 0.959 | 0.983 | 0.943 | 0.997 | 0.878 (0.900) | 0.931 (0.954) | 38.3 s |
| GeneSeqer | 0.677 | 0.995 | 0.671 | 0.941 | 0.600 (0.603) | 0.914 (0.918) | 11 min 55.0 s |
| EXALIN | 0.940 | 0.996 | 0.972 | 0.984 | 0.965 (0.973) | 0.960 (0.968) | 14 min 7.4 s |
| sim4cc | 0.972 | 0.988 | 0.965 | 0.976 | 0.941 (0.951) | 0.961 (0.972) | 2.1s |
| Chicken: 156 genes, 1624 exons, 1462 introns | |||||||
| sim4 | 0.414 | 0.992 | 0.589 | 0.987 | 0.287 (0.304) | 0.428 (0.452) | 6.3 s |
| BLAT | 0.347 | 0.978 | 0.433 | 0.881 | 0.017 (0.132) | 0.023 (0.178) | 29.1s |
| tBLAT | 0.739 | 0.986 | 0.834 | 0.975 | 0.142 (0.653) | 0.143 (0.658) | 1 min 50.7s |
| GMAP | 0.315 | 0.989 | 0.257 | 0.991 | 0.214 (0.216) | 0.932 (0.940) | 18.4 s |
| Exonerate | 0.424 | 0.945 | 0.530 | 0.988 | 0.425 (0.438) | 0.851 (0.873) | 2 min 50.2 s |
| GeneSeqer | 0.451 | 0.987 | 0.431 | 0.915 | 0.372 (0.384) | 0.810 (0.835) | 30 min 15.6 s |
| EXALIN | 0.762 | 0.998 | 0.825 | 0.996 | 0.788 (0.806) | 0.954 (0.975) | 1 h 14 min 23.5 s |
| sim4cc | 0.872 | 0.982 | 0.879 | 0.993 | 0.799 (0.816) | 0.872 (0.890) | 9.5 s |
| Zebrafish: 232 genes, 2549 exons, 2315 introns | |||||||
| sim4 | 0.101 | 0.984 | 0.196 | 0.991 | 0.029 (0.031) | 0.161 (0.171) | 7.6 s |
| BLAT | 0.064 | 0.966 | 0.083 | 0.798 | 0.001 (0.008) | 0.007 (0.067) | 39.4 s |
| tBLAT | 0.573 | 0.984 | 0.628 | 0.960 | 0.086 (0.376) | 0.129 (0.568) | 2 min 31.6 s |
| GMAP | 0.057 | 0.993 | 0.023 | 1.000 | 0.010 (0.010) | 0.958 (0.958) | 14.8 s |
| Exonerate | 0.298 | 0.890 | 0.244 | 0.990 | 0.145 (0.148) | 0.812 (0.829) | 3 min 34.5s |
| GeneSeqer | 0.143 | 0.989 | 0.128 | 0.940 | 0.116 (0.117) | 0.871 (0.877) | 9 min 19.4 s |
| EXALIN | 0.509 | 0.997 | 0.539 | 0.994 | 0.480 (0.495) | 0.954 (0.984) | 1 h 20 min 31.5 s |
| sim4cc | 0.701 | 0.970 | 0.732 | 0.985 | 0.546 (0.567) | 0.757 (0.785) | 18.8 s |
All programs were run with their default parameters. Columns represent sensitivity and specificity values at nucleotide, exon and splice junction (intron) level, the latter when allowing for a margin V (0 or 10) of error around the splice site. Sensitivity was calculated as Sn = TP/(TP + FN) and specificity as Sp = TP/(TP + FP). Run times were averaged over 10 executions of the program on a Dell workstation with 3.2 GHz Intel CPUs and 2 GB RAM.
Figure 1.Performance of spliced alignment programs (nucleotide sensitivity, vertical axis) with varying sequence identity levels (horizontal axis). The numbers of gene pairs for each sequence identity level in decreasing order from 90–95% to 65–70% are: 40, 135, 266, 281, 211 and 156.
Characteristics and mapping statistics of Fagus grandifolia cDNA sequences on the poplar genome
| Data | All | Mapped | |
|---|---|---|---|
| EM | GMAPX | ||
| 454 ESTs | |||
| Sequences | 64 237 | 24 810 | 19 034 |
| Length (avg) | 229 bp | 242 bp | 225 bp |
| Regions | n/a | 85 806 | 64 742 |
| sim4cc-alignment statistics | |||
| Sequence id. (avg) | n/a | 94.19 | 90.81 |
| Coverage (avg) | n/a | 97.26 | 93.00 |
| 454 unigenes | |||
| Sequences | 8163 | 2887 | 1625 |
| Avg length | 359 bp | 397 bp | 449 bp |
| Regions | n/a | 3243 | 2643 |
| sim4cc-alignment statistics | |||
| Sequence id. (avg) | n/a | 83.29 | 83.10 |
| Coverage (avg) | n/a | 89.31 | 87.11 |
454 ESTs were mapped with the tool ESTmapper (EM) at ≥50% coverage and ≥70% sequence identity and unigenes at ≥70% sequence identity, retaining only alignments longer than 100 bases. Only the ‘best’ alignment for each query was selected to determine matching regions (note: if indistinguishable from each other, several best alignments may be retained). GMAP was used in cross-species mode ‘−X’, and all other parameters as set by default. Alignment statistics were averaged over all regions. n/a = Not applicable.
Figure 2.Number of Fagus grandifolia (A) 454 EST sequences (out of 64 237) and (B) unigenes of these sequences (out of 8163) that can be aligned to the poplar genome at varying coverage cutoffs (horizontal axis), both before and after the application of sim4cc. Only those ESTmapper (EM) alignments covering more than 50% of the input sequence were retained.
Figure 3.Number of Fagus grandifolia (A) 454 EST sequences (out of 64 237) and (B) unigenes of these sequences (out of 8163) sequences that overlap the gene annotation of the poplar genome at varying coverage cutoffs, both before and after the application of sim4cc. Only those ESTmapper (EM) alignments covering more than 50% of the input sequence were retained.