| Literature DB >> 23618408 |
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, Steven L Salzberg.
Abstract
TopHat is a popular spliced aligner for RNA-sequence (RNA-seq) experiments. In this paper, we describe TopHat2, which incorporates many significant enhancements to TopHat. TopHat2 can align reads of various lengths produced by the latest sequencing technologies, while allowing for variable-length indels with respect to the reference genome. In addition to de novo spliced alignment, TopHat2 can align reads across fusion breaks, which can occur after genomic translocations. TopHat2 combines the ability to identify novel splice sites with direct mapping to known transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes or in the presence of pseudogenes. TopHat2 is available at http://ccb.jhu.edu/software/tophat.Entities:
Mesh:
Year: 2013 PMID: 23618408 PMCID: PMC4053844 DOI: 10.1186/gb-2013-14-4-r36
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Two possible incorrect alignments of spliced reads. 1) A read extending a few bases into the flanking exon can be aligned to the intron instead of the exon. 2) A read spanning multiple exons from genes with processed pseudogene copies can be aligned to the pseudogene copies instead of the gene from which it originates.
Performance of TopHat2 and other spliced aligners on a set of 20 million 100-bp, single-end reads, simulated based on transcripts from the entire human genome.
| Program | No. of mapped reads | Correctly mapped reads, % | Incorrectly mapped reads, % | Unmapped reads, % | ||
|---|---|---|---|---|---|---|
| TopHat2 + Bowtie1 | 19,826,638 | 98.31 | 0.82 | 0.87 | 95.28 | 93.69 |
| TopHat2 + Bowtie2 | 19,826,673 | 98.03 | 1.10 | 0.87 | 94.28 | 89.67 |
| TopHat1.14 | 19,616,874 | 94.64 | 3.45 | 1.91 | 84.44 | 44.08 |
| GSNAP | 19,997,255 | 94.21 | 5.77 | 0.02 | 83.15 | 26.01 |
| RUM | 19,555,823 | 88.11 | 9.67 | 2.22 | 65.35 | 8.59 |
| MapSplice | 19,872,372 | 97.28 | 2.08 | 0.64 | 92.09 | 75.57 |
| STAR | 19,087,508 | 92.14 | 3.30 | 4.56 | 77.17 | 3.54 |
There were 6,862,278 reads spanning one or more splice junctions; the alignment accuracy of junction reads refers to this set.
There were 1,448,022 reads extending 10 bp or less into one exon; the alignment accuracy of the short-anchored reads is based on these alignments.
Performance of TopHat2 and other spliced aligners on a set of 20 million pairs of 100-bp reads, simulated based on transcripts from the entire human genome.
| Program | No. of mapped pairs | Correctly mapped pairs, % | Incorrectly mapped pairs, % | Unmapped pairs, % | ||
|---|---|---|---|---|---|---|
| TopHat2 + Bowtie1 | 19,683,426 | 96.70 | 1.72 | 1.58 | 93.31 | 90.09 |
| TopHat2 + Bowtie2 | 19,686,006 | 96.19 | 2.24 | 1.57 | 92.03 | 85.88 |
| TopHat1.14 | 19,219,055 | 89.57 | 6.53 | 3.90 | 78.36 | 40.39 |
| GSNAP | 19,999,867 | 88.84 | 11.16 | 0.00 | 76.55 | 22.87 |
| RUM | 19,869,579 | 79.07 | 20.28 | 0.65 | 56.28 | 8.42 |
| MapSplice | 19,342,087 | 92.03 | 4.68 | 3.29 | 86.53 | 72.48 |
| STAR | 19,951,620 | 85.21 | 14.55 | 0.24 | 68.94 | 3.16 |
There were 9,491,394 pairs of reads classified as junction pairs.
There were 2,702,624 pairs containing short-anchored reads.
Performance of TopHat2 and other spliced aligners on single-end reads containing insertions and deletions (indels) of 1 to 3 bp.
| Program | Accuracy, % | Accuracy, % | ||
|---|---|---|---|---|
| TopHat2 | 70.9 | 16.8 | 12.1 | 2.8 |
| TopHat2 | 63.7 | 25.2 | 62.6 | 21.2 |
| GSNAP | 82.7 | 71.9 | 83.1 | 71.8 |
| RUM | 69.4 | 43.0 | 70.3 | 45.4 |
| MapSplice | 27.3 | 3.7 | 27.5 | 3.8 |
| STAR | 46.6 | 16.9 | 47.7 | 17.1 |
The number of reads containing each type of error is indicated in the column header. Boundary indels occur within 25 bp of an exon boundary. Percentages refer only to the reads of each type, not to the entire dataset.
Performance of TopHat2 and other spliced aligners on paired reads in which at least one of the reads contained insertions and deletions (indels) of 1 to 3 bp.
| Program | Accuracy, % | Accuracy, % | ||
|---|---|---|---|---|
| TopHat2 | 69.8 | 16.3 | 14.0 | 3.1 |
| TopHat2 + Bowtie2 | 62.3 | 24.0 | 60.8 | 19.8 |
| GSNAP | 77.0 | 63.8 | 77.8 | 64.8 |
| RUM | 60.3 | 34.3 | 61.3 | 36.0 |
| MapSplice | 25.5 | 3.4 | 25.0 | 3.2 |
| STAR | 53.4 | 19.2 | 54.9 | 21.4 |
The number of pairs containing each type of error is indicated in the column header.
Boundary indels occur within 25 bp of an exon boundary. Percentages refer only to the pairs of each type, not to the entire dataset.
Figure 2The number of read alignments from TopHat2, GSNAP, RUM, MapSplice, and STAR. Tthe RNA-seq reads are from Chen [11]. TopHat2 was run with and without realignment (realignment edit distance of 0). TopHat2, GSNAP, and STAR were run in both de novo and gene-mapping modes, while MapSplice was run only in de novo mode and RUM was run only in gene-mapping mode. The number of alignments at each edit distance is cumulative; for instance, the number of alignments at an edit distance of 2 includes all the alignments with edit distance of 0, 1, or 2.
Figure 3The number of spliced-read alignments from TopHat2, GSNAP, RUM, MapSplice, and STAR. The RNA-seq reads are from Chen [11]. TopHat2, GSNAP, and STAR were run in both de novo and gene-mapping modes while MapSplice was run only in de novo mode and RUM was run only in gene-mapping mode. For each mapping mode, the two panels on the left show the number of spliced alignments whose splice sites were found in the gene annotations, and the two panels on the right show the number of all spliced alignments including novel splice sites.
Expression levels of genes with pseudogene copies from Chen et al. [11].a
| Pair | |||||
|---|---|---|---|---|---|
| 1 | 553 (1.7%) | 6.85 | × 4.02 | 9.37 | × 5.49 |
| 2 | 113 (0.4) | 5.15 | × 14.79 | 5.20 | × 14.93 |
| 3 | 49 (0.2) | 1.27 | × 8.38 | 1.96 | × 12.99 |
| 4 | 27 (0.1) | 2.27 | × 27.32 | 2.28 | × 27.35 |
| ≥ 5 | 130 (0.4) | 6.91 | × 17.24 | 8.08 | × 20.16 |
| Total (≥ 1) | 872/32,439 (2.7) | 22.45 | × 8.35 | 26.88 | × 10.00 |
aUsing Bowtie2, we aligned RNA-seq paired-end reads to 32,439 annotated genes.
bNumber of pseudogene copies of a gene. The first row shows genes that have just one pseudogene, followed by rows for genes with two, three, four, and at least five pseudogene copies.
cNumber of genes with the specified number of pseudogene copies; for example, 553 genes (1.7% of all genes) have one pseudogene copy.
dPercentage of read pairs that were mapped to genes with pseudogene copies.
eRatio of columns 3 and 2.
fThese two columns were similarly defined using a normalized count, where the number of reads mapping to each gene was normalized to account for gene length.
Figure 4The number of read and spliced-read alignments from TopHat2, using different realignment edit distances and no realignment. Edit distances of 0, 1, and 2 were used. As TopHat2 allows more realignment from no realignment to 2 to 1 to 0, the number of read alignments and spliced-read alignments increases, so that the differences in the numbers of read alignments from TopHat run with different realignment edit distance are mostly explained by the increase in the number of spliced-read alignments.
Figure 5The number of spliced-read alignments from TopHat2, GSNAP, STAR, and MapSplice without using gene annotation. The number of read alignments whose splice sites were found in the gene annotations are shown in brown, and the number of all spliced-read alignments including novel splice sites are shown in green.
Figure 6TopHat2 pipeline. Details are given in the main text.