| Literature DB >> 18344523 |
Abstract
The mapping and alignment of transcripts (cDNA, expressed sequence tag or amino acid sequences) onto a genomic sequence is a fundamental step for genome annotation, including gene finding and analyses of transcriptional activity, alternative splicing and nucleotide polymorphisms. As DNA sequence data of genomes and transcripts are accumulating at an unprecedented rate, steady improvement in accuracy, speed and space requirement in the computational tools for mapping/alignment is desired. We devised a multi-phase heuristic algorithm and implemented it in the development of the stand-alone computer program Spaln (space-efficient spliced alignment). Spaln is reasonably fast and space efficient; it requires <1 Gb of memory to map and align >120 000 Unigene sequences onto the unmasked whole human genome with a conventional computer, finishing the job in <6 h. With artificially introduced noise of various levels, Spaln significantly outperforms other leading alignment programs currently available with respect to the accuracy of mapped exon-intron structures. This performance is achieved without extensive learning procedures to adjust parameter values to a particular organism. According to the handiness and accuracy, Spaln may be used for studies on a wide area of genome analyses.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18344523 PMCID: PMC2377433 DOI: 10.1093/nar/gkn105
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Schematic diagram showing blocks, words and the greedy algorithm for finding significant blocks. Arrow corresponds to a k-mer word, and its direction indicates direct or reverse strand. Circle indicates presence of the word shown on the left within the block shown on top. Big circles are used to discriminate true hits from background noise indicated by small circles. Three blocks (numbers 10–12) containing three or more hits are considered to be significant and indicated by yellow color.
Figure 2.Block score as a function of number of test words. (a) Human Unigene sequences were shuffled and used as queries against the whole human genome. The mean (open triangle) and the SD (cross symbol) are fitted to logarithmic and constant curves, respectively. Each bin corresponds to the number of rounds at which the recurrence has finished. The number of occasions in each bin is shown by filled circle in the logarithmic scale. (b) Mean block score (filled square) for real Unigene sequences mapped on the human genome. SDs are shown by error bars. A part of (a) is also shown for comparison. The number of sequences in each bin is shown by filled circle in the logarithmic scale.
List of programs and options used for experiment
| Name | Version | Options for alignment | Options for mapping | Reference |
|---|---|---|---|---|
| 34 | -noTrimA -fine -q=rna | -noTrimA -fine -q=rna -ooc=11.ooc | ( | |
| 2005-05-06 | default | ( | ||
| –showalignment n | ||||
| 1.4.0 | –showtargetgff y -n 1 | ( | ||
| -S n –refine region -m e2g | ||||
| 2007-09-28 | default | -f 2 -B 2 | ( | |
| 2007-09-28 | -X | ( | ||
| 2.2.14 | -D 3 -F'm R;V;D’ | ( | ||
| 2003-09-21 | default | ( | ||
| 1.2.1 | -Q3 -pa | -Q7 -M | This work | |
| 1.2.1 | -Q3 -pa -ya | This work | ||
| 1.2.1 | -Q3 -pa -yS -yX | This work | ||
| 1.2.1 | -Q3 -pa -yX | -Q7 -yX -M | This work | |
| 1.2.1 | -Q3 -pa -yX -LS | -Q7 -yX -LS -M | This work |
Blank cell indicates ‘not examined’. The ‘-X’ option of Gmap indicates that canonical intron boundaries are favoured. Conversely, the ‘-ya’ option of Spaln indicates that non-canonical intron boundaries are accepted. The ‘-LS’ (local similarity) option was used only in combination with the ‘-yX’ (cross-species) option.
Mapping of Unigene sequences onto human genome
| Program | Mapped | Records | Sensitivity (%) | Selectivity (%) | CPU (h) |
|---|---|---|---|---|---|
| 124 103 | 37 74 716 | 99.80 | 3.29 | 37.37 | |
| 123 373 | 162 700 | 99.21 | 75.83 | 11.00 | |
| 118 643 | 304 138 | 95.41 | 39.01 | 63.05 | |
| 124 112 | 142 316 | 99.80 | 87.21 | 5.58 |
aThe number of query sequences that were mapped on at least one genomic locus.
bThe total number of gene models reported.
cThe percentage of mapped queries of 124 355 cDNA sequences examined.
d100× number of mapped queries divided by the total number of records reported.
Figure 3.Comparison of performance of various spliced alignment methods. Error rates at gene level (a) or exon level (b) are shown as a function of artificially introduced noise levels. The actual error rates of Sim4 and Blat are, respectively, twice and five times as large as those indicated by the bar heights. Letters X and S attached to Spaln indicate that ‘-yX’ and ‘-yX –yS’ options were applied, respectively. Each error rate is the mean of six trials, the numerical value and the SD of which are shown in Tables S4 and S5.
Accuracy in cross-species comparison between human and mouse cDNA and genomic sequences
| Measure | ||||||||
|---|---|---|---|---|---|---|---|---|
| #Answer | 266 | 981 | 982 | 937 | 978 | 982 | 982 | 982 |
| #Gene | 70 | 429 | 458 | 367 | 161 | 593 | 752 | 714 |
| %Gene | 7.13 | 43.69 | 46.64 | 37.37 | 16.40 | 60.39 | 76.58 | 72.71 |
| %Exon | 67.63 | 89.51 | 80.97 | 81.41 | 59.54 | 90.09 | 96.21 | 95.11 |
| %IntExon | 64.17 | 94.26 | 76.36 | 78.06 | 62.59 | 93.77 | 96.87 | 95.81 |
| %TermExon | 66.22 | 69.94 | 70.44 | 63.58 | 47.41 | 82.40 | 93.20 | 92.10 |
| %Junction | 79.50 | 93.97 | 85.97 | 86.03 | 73.54 | 93.27 | 97.91 | 97.26 |
| %Nucleotide | 95.30 | 98.20 | 93.22 | 92.60 | 91.18 | 97.77 | 99.38 | 99.31 |
| CPU (s) | 143.1 | 3181.3 | 221.2 | 52.5 | 33.9 | 220.8 | 190.3 | 145.5 |
A total of 982 human–mouse cross-species pairs of genomic segments and CDS sequences in Projector data set are examined. The numbers of ‘Answer’ and ‘Gene’ are the number of runs with any outputs and that with perfectly correct gene structures, respectively. The accuracies at the exon, junction and nucleotide levels are respective harmonic averages of the specificity and sensitivity defined by 200C/(T + P), where C, T and P are correctly predicted, true and predicted quantities, respectively. For internal (% IntExon) or terminal (% TermExon) exons, the sensitivity defined by 100 × C/T is presented.
Figure 4.Accuracy in mapping and alignment of CDS sequences onto the human or mouse genomic sequence. Error rates at the gene level (a) and at the exon level (b) are shown as functions of artificially introduced noise levels. The actual error rates of Blat are three times as large as those indicated by the bar heights. SpalnXL indicates that ‘-yX –LS’ options were applied to Spaln at run time. Each error rate is the mean of six trials, the numerical value and the SD of which are shown in Table S9 with other information.