| Literature DB >> 18495041 |
Yuri Kapustin1, Alexander Souvorov, Tatiana Tatusova, David Lipman.
Abstract
BACKGROUND: The computation of accurate alignments of cDNA sequences against a genome is at the foundation of modern genome annotation pipelines. Several factors such as presence of paralogs, small exons, non-consensus splice signals, sequencing errors and polymorphic sites pose recognized difficulties to existing spliced alignment algorithms.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18495041 PMCID: PMC2440734 DOI: 10.1186/1745-6150-3-20
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Figure 1The computation of spliced alignments with Splign.
Programs used in the comparison
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| version | - | 1.40 | 34 | 2007-06-04 | - | 1.29 |
| source time stamp | 09/2003 | 06/2006 | 04/2007 | 06/2007 | 07/2007 | 08/2007 |
The number of best alignments with A-content of 75% or higher in the 3' exon. The numbers are based on the alignments with the highest overall identity.
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| %% of all mRNAs | 15.5 | 0.57 | 0.98 | 0.04 | 2.12 | 0.01 |
| %% of RefSeq mRNAs | 22.1 | 0.31 | 1.67 | 0.02 | 2.72 | 0.00 |
The number of the full set alignments at various levels of the overall identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 80% | 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 212875 | 209895 | 203460 | 195939 | 182123 | 169058 | 107267 | 76542 |
| Sim4 | 195513 | 177784 | 162761 | 97812 | 69359 | |||
| -0.16 | -0.61 | -0.20 | -0.19 | -1.98 | -2.88 | -4.32 | -3.29 | |
| Spidey | 199870 | 195159 | 190753 | 186356 | 172385 | 159021 | 100338 | 72225 |
| -5.95 | -6.74 | -5.81 | -4.38 | -4.45 | -4.59 | -3.17 | -1.97 | |
| BLAT | 209899 | 206260 | 201207 | 194640 | 177345 | 163157 | 104338 | 74815 |
| -1.36 | -1.66 | -1.03 | -0.59 | -2.19 | -2.70 | -1.34 | -0.79 | |
| GMAP | 208849 | 205793 | 202732 | 180048 | 166488 | 105452 | ||
| -1.84 | -1.88 | -0.33 | +0.36 | -0.95 | -1.18 | -0.83 | -0.72 | |
| SPA | 203942 | 199548 | 195926 | 192595 | 74911 | |||
| -4.09 | -4.73 | -3.45 | -1.53 | -0.85 | -0.57 | -0.65 | -0.75 |
The number of RefSeq alignments at various levels of the overall identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 80% | 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 24255 | 24254 | 24243 | 24218 | 23947 | 23420 | 19337 | 14898 |
| Sim4 | 24242 | 24233 | 24201 | 24109 | 23566 | 22815 | 17774 | 12894 |
| -0.05 | -0.09 | -0.17 | -0.45 | -1.57 | -2.49 | -6.44 | -8.26 | |
| Spidey | 23648 | 23491 | 23270 | 22972 | 22408 | 21789 | 17782 | 13488 |
| -2.50 | -3.14 | -4.01 | -5.13 | -6.34 | -6.72 | -6.41 | -5.81 | |
| BLAT | 24240 | 24230 | 24201 | 24145 | 23701 | 23046 | 19043 | 14676 |
| -0.06 | -0.10 | -0.17 | -0.30 | -1.01 | -1.54 | -1.21 | -0.91 | |
| GMAP | 23294 | 14513 | ||||||
| -0.02 | -0.05 | -0.07 | -0.08 | -0.29 | -0.52 | -0.49 | -1.59 | |
| SPA | 24213 | 24202 | 24180 | 24148 | 23830 | 19204 | ||
| -0.17 | -0.21 | -0.26 | -0.29 | -0.48 | -0.48 | -0.55 | -0.82 |
Time to compute alignments for the full set of human mRNA sequences. The timing is based on a single instance running on Intel Xeon 2.33 GHz/8 GB Linux box.
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| CPU hours | 856 | 698 | 8 | 12 | 2448 | 49 |
The number of the Subset 1 alignments at various levels of the overall identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 72047 | 71922 | 71568 | 68890 | 65666 | 43046 | 25931 |
| Sim4 | 71830 | 71316 | 68055 | 63919 | 39272 | 23249 | |
| -0.05 | -0.13 | -0.35 | -1.16 | -2.42 | -5.23 | -3.72 | |
| Spidey | 71562 | 70833 | 69744 | 66128 | 62461 | 40456 | 24578 |
| -0.67 | -1.51 | -2.53 | -3.83 | -4.44 | -3.59 | -1.88 | |
| BLAT | 72010 | 71842 | 71393 | 67508 | 63319 | 41693 | 25357 |
| -0.05 | -0.11 | -0.24 | -1.92 | -3.25 | -1.88 | -0.80 | |
| GMAP | 72001 | 68446 | 64787 | 42263 | 25335 | ||
| -0.06 | -0.11 | -0.15 | -0.62 | -1.22 | -1.09 | -0.83 | |
| SPA | 71993 | 71823 | 71406 | ||||
| -0.07 | -0.14 | -0.22 | -0.43 | -0.41 | 0.00 | -0.27 |
The number of the Subset 1 RefSeq alignments at various levels of the overall identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 13883 | 13880 | 13870 | 13781 | 13540 | 11345 | 8276 |
| Sim4 | 13880 | 13873 | 13832 | 13615 | 13287 | 10521 | 7151 |
| -0.02 | -0.05 | -0.27 | -1.20 | -1.82 | -5.94 | -8.10 | |
| Spidey | 13765 | 13602 | 13379 | 13085 | 12802 | 10601 | 7587 |
| -0.85 | -2.00 | -3.54 | -5.01 | -5.32 | -5.36 | -4.96 | |
| BLAT | 13878 | 13866 | 13843 | 13680 | 13355 | 11180 | 8167 |
| -0.04 | -0.10 | -0.19 | -0.73 | -1.33 | -1.19 | -0.79 | |
| GMAP | 13481 | 8066 | |||||
| -0.03 | -0.04 | -0.04 | -0.20 | -0.42 | -0.39 | -1.51 | |
| SPA | 13878 | 13868 | 13855 | 13738 | 11290 | ||
| -0.04 | -0.09 | -0.11 | -0.31 | -0.27 | -0.40 | -0.73 |
The number of the Subset 1 alignments at various levels of the in-frame identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 80% | 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 67342 | 67241 | 67105 | 66766 | 65767 | 64938 | 50968 | 35839 |
| Sim4 | 63968 | 63723 | 63465 | 62986 | 61668 | 60622 | 47820 | 34033 |
| -4.68 | -4.88 | -5.05 | -5.24 | -5.68 | -5.99 | -4.37 | -2.50 | |
| Spidey | 65276 | 65182 | 65066 | 64766 | 63844 | 62867 | 48714 | 34451 |
| -2.86 | -2.86 | -2.83 | -2.77 | -2.67 | -2.87 | -3.13 | -1.92 | |
| BLAT | 64017 | 63932 | 63817 | 63545 | 62747 | 62115 | 49946 | 35632 |
| -4.61 | -4.59 | -4.56 | -4.47 | -4.19 | -3.91 | -1.42 | -0.29 | |
| GMAP | 66568 | 66463 | 66317 | 65956 | 64856 | 64058 | 50589 | |
| -1.07 | -1.08 | -1.09 | -1.12 | -1.26 | -1.22 | -0.53 | -0.22 | |
| SPA | 35679 | |||||||
| -0.78 | -0.79 | -0.82 | -0.85 | -1.01 | -1.04 | -0.41 | -0.22 |
The number of the Subset 1 RefSeq alignments at various levels of the in-frame identity. Fractions give the differences in the number of alignments by each method and Splign expressed as percentage of the total of sequences in the set.
| 80% | 85% | 90% | 95% | 99% | 99.5% | 99.9% | 100% | |
| Splign | 13757 | 13747 | 13740 | 13733 | 13723 | 13688 | 12426 | 10323 |
| Sim4 | 13214 | 13179 | 13145 | 13110 | 13044 | 12958 | 11662 | 9680 |
| -3.91 | -4.09 | -4.29 | -4.49 | -4.89 | -5.26 | -5.50 | -4.63 | |
| Spidey | 13185 | 13176 | 13169 | 13162 | 13146 | 13089 | 11720 | 9706 |
| -4.12 | -4.11 | -4.11 | -4.11 | -4.16 | -4.31 | -5.09 | -4.44 | |
| BLAT | 13507 | 13498 | 13491 | 13484 | 13473 | 13441 | 12271 | 10284 |
| -1.80 | -1.79 | -1.79 | -1.79 | -1.80 | -1.78 | -1.12 | -0.28 | |
| GMAP | ||||||||
| -0.09 | -0.10 | -0.10 | -0.11 | -0.12 | -0.12 | -0.13 | -0.12 | |
| SPA | 13715 | 13706 | 13698 | 13691 | 13676 | 13635 | 12373 | 10275 |
| -0.30 | -0.30 | -0.30 | -0.30 | -0.34 | -0.38 | -0.38 | -0.35 |
Frequencies of splice sites in Subset 1 alignments
| GT/AG | GC/AG | AT/AC | non-consensus | |
| Sim4 | 96.21 | 0.78 | 0.06 | 2.96 |
| Spidey | 95.72 | 0.67 | 0.09 | 3.52 |
| BLAT | 97.87 | 0.74 | 0.10 | 1.29 |
| GMAP | 98.74 | 0.75 | 0.12 | 0.38 |
| SPA | 98.52 | 0.74 | 0.11 | 0.62 |
| Splign | 98.66 | 0.75 | 0.11 | 0.48 |
Span ratios of Subset 2 alignments
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| median | 4.179 | 4.201 | 4.234 | 4.378 | 4.218 | 4.190 |
| mean | 11.242 | 8.534 | 9.134 | 10.102 | 8.671 | 8.420 |
Co-aligning EST test. 'Implied' is the number of introns estimated from EST-to-mRNA and mRNA-to-genomic alignments. 'Identified' is the number of introns found in EST-to-genomic alignments. 'Matching' is the number of introns in EST-to-genomic alignment matching those found in the mRNA-to-genomic alignments.
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| Implied | 3472173 | 3419307 | 3417203 | 3419975 | 3425734 | 3419138 |
| Identified, as %% of Implied | 3174725 | 3249426 | 3150040 | 3219665 | 3407890 | 3352628 |
| 91.4 | 95.0 | 92.2 | 94.1 | 99.5 | 98.1 | |
| Matching, as %% of Identified | 3128325 | 2813461 | 2810059 | 3202121 | 3294548 | 3314074 |
| 98.5 | 86.6 | 89.2 | 99.5 | 96.7 | 98.9 |
Co-aligning EST test, based on mRNA-to-genomic alignments matching across the methods
| Sim4 | Spidey | BLAT | GMAP | SPA | Splign | |
| Implied | 1784663 | 1784663 | 1784663 | 1784663 | 1784663 | 1784663 |
| Identified, as %% of Implied | 1652062 | 1691136 | 1642759 | 1679932 | 1771713 | 1745535 |
| 92.6 | 94.8 | 92.0 | 94.1 | 99.3 | 97.8 | |
| Matching, as %% of Identified | 1635634 | 1477759 | 1475183 | 1672383 | 1719549 | 1729834 |
| 99.0 | 87.4 | 89.8 | 99.6 | 97.1 | 99.1 |
Figure 2Compart matching algorithm.
Filtering of genomic repeats by different methods. Q1 is the fraction of masked words. Q2 is the fraction of words masked exclusively by the method. In case of RM and WM, a word was considered masked if all its bases were masked. The timing in each test was obtained using a single instance running on Intel Xeon 2.33 GHz/8 GB Linux box.
| RepeatMasker | WindowMasker | Compart | ||
| Human | 46.79 | 28.81 | 17.68 | |
| 31.03 | 12.23 | 5.41 | ||
| time (min) | 6077 | 131 | 7 | |
| Mouse | 40.55 | 28.61 | 17.71 | |
| 25.24 | 12.41 | 5.46 | ||
| time (min) | 5955 | 117 | 6 |
Impact of the PV on the size of the index. N is the number of full-length mRNA sequences. R is the number of genomic index keys as a percentage of the number of distinct words on the unmasked genome.
| Human 36.3 | Mouse 37.1 | ||
| mRNA | 214 749 | 240 299 | |
| 8.4 | 10.2 | ||
| EST | 7 732 838 | 4 836 245 | |
| 38.6 | 27.2 |
Compartmentization based on local alignments computed using different methods. N1 is the number of mRNA sequences with 75% or higher coverage by alignments to any single chromosome or unplaced scaffold.N2 is the number of mRNA sequences with 75% or higher coverage by high-identity alignments to any single chromosome or unplaced scaffold. N3 is the number of sequences for which at least one compartment was identified with the minimum compartment identity of 75%. The computing time was collected on Intel Xeon 2.33 GHz/8 GB Linux box.
| MB/RM | MB/WM | Compart | ||
| Human 36.3 | 232342 (94.95%) | 234565 (95.86%) | 238851 (97.61%) | |
| 224054 (91.56%) | 226293 (92.48%) | 229101 (93.63%) | ||
| 231939 (94.79%) | 234207 (95.71%) | 236928 (96.82%) | ||
| time (min) | 179 | 70 | 32 | |
| Mouse 37.1 | 227831 (98.44%) | 228681 (95.17%) | 232245 (96.65%) | |
| 222746 (92.70%) | 223704 (93.09%) | 226107 (94.09%) | ||
| 226964 (94.45%) | 227895 (94.84%) | 230478 (95.91%) | ||
| time (min) | 372 | 59 | 29 |
Score selection. W(W) is the score for matching (mismatching) bases. Other notations are given in the text.
| Alignment A | Alignment B | Conditions |
| a consensus intron and no indels | a non-consensus intron and no indels | |
| a consensus intron and an indel | a non-consensus intron and no indels | |
| two consensus introns and no indels | a non-consensus intron and no indels | |
| a consensus intron and no indels | two consensus introns and no indels | |
| two consensus introns and no indels | a non-consensus intron and an indel | |
| a consensus intron and no indels | two consensus introns and an indel |