| Literature DB >> 26740642 |
I Medina1, J Tárraga2, H Martínez3, S Barrachina3, M I Castillo3, J Paschall4, J Salavert-Torres5, I Blanquer-Espert6, V Hernández-García5, E S Quintana-Ortí3, J Dopazo7.
Abstract
As sequencing technologies progress, the amount of data produced grows exponentially, shifting the bottleneck of discovery towards the data analysis phase. In particular, currently available mapping solutions for RNA-seq leave room for improvement in terms of sensitivity and performance, hindering an efficient analysis of transcriptomes by massive sequencing. Here, we present an innovative approach that combines re-engineering, optimization and parallelization. This solution results in a significant increase of mapping sensitivity over a wide range of read lengths and substantial shorter runtimes when compared with current RNA-seq mapping methods available.Entities:
Keywords: Burrows-Wheeler Transform; RNA-seq; high-performance computing; mapping
Mesh:
Year: 2016 PMID: 26740642 PMCID: PMC4833417 DOI: 10.1093/dnares/dsv039
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1.Schema of the implementation of the mapping process. (Top) Contiguous seeds of size (16 bp) are taken covering the whole read. Also, two more overlapping seeds near the ends of the read are taken for anchoring the ends to the exons. (Middle) Seeds are mapped without allowing any mismatch. Seeds mapped closer than read size and, in the same strand orientation, constitute a candidate alignment location (CAL). (Bottom) CALs closer than 500,000 bp and, in the same strand orientation, are clustered to form candidate exons and transcripts. These are evaluated and scores based on SW are assigned to them. This figure is available in black and white in print and in colour at DNA Research online.
Figure 2.Representation of mapping sensitivity (percentage of reads correctly mapped) versus runtimes (in min) for simulated datasets of different single read lengths (from 50 bp, upper left, to 400 bp, lower right) containing 10 million single-end reads in two scenarios of variability (low variability represented by circles and high variability represented by triangles). The colour code corresponds to the different programs. In all the cases, HPG Aligner sensibilities and runtimes were better than the ones shown by the rest of programs (Supplementary Table S1). These differences increase as read length and error increase.
Benchmarking for the simulated dataset containing 10 million single-end reads simulated with the BEERS program
| RL | MR (%) | HPG Aligner 1 | HISAT | STAR 2 | TopHat 2 + Bowtie 2 | MapSplice 2 | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IMR | RNM | CMR | T | IMR | RNM | CMR | T | IMR | RNM | CMR | T | IMR | RNM | CMR | T | IMR | RNM | CMR | T | ||
| 0.1 | 3.08 | 0.42 | 96.5 | 4.8 | 4.42 | 3.22 | 92.36 | 4.45 | 7.32 | 1.46 | 91.22 | 8.28 | 2.05 | 0.93 | 97.02 | 36.35 | 2.82 | 1.16 | 96.02 | 26.93 | |
| 50 | 1 | 4.26 | 0.84 | 94.9 | 4.9 | 9.79 | 8.61 | 81.60 | 4.80 | 11.23 | 1.72 | 87.05 | 8.73 | 2.06 | 3.04 | 94.90 | 79.65 | 2.95 | 2.03 | 95.02 | 26.13 |
| 2 | 5.82 | 2.26 | 91.92 | 5.2 | 16.29 | 17.18 | 66.53 | 4.58 | 15.44 | 3.16 | 81.40 | 8.85 | 2.00 | 10.18 | 87.82 | 374.9 | 3.21 | 4.81 | 91.98 | 26.08 | |
| 0.1 | 3.51 | 0.24 | 96.25 | 5.8 | 2.30 | 1.13 | 96.57 | 6.21 | 7.81 | 0.46 | 91.73 | 9.52 | 0.28 | 2.98 | 96.74 | 48.22 | 3.31 | 0.24 | 96.45 | 37.32 | |
| 75 | 1 | 4.17 | 0.94 | 94.89 | 6.0 | 4.64 | 3.48 | 91.88 | 6.40 | 11.82 | 1.18 | 87.00 | 9.97 | 1.24 | 6.90 | 91.86 | 50.78 | 3.59 | 1.16 | 95.25 | 35.05 |
| 2 | 5.64 | 1.27 | 93.09 | 6.4 | 10.59 | 10.46 | 78.95 | 6.46 | 16.62 | 1.55 | 81.83 | 11.25 | 0.94 | 22.18 | 76.88 | 58.93 | 5.13 | 2.20 | 92.67 | 34.28 | |
| 0.1 | 4.07 | 0.24 | 95.69 | 7.4 | 2.57 | 1.24 | 96.19 | 7.86 | 7.59 | 0.64 | 91.77 | 11.77 | 1.69 | 1.94 | 96.37 | 65.08 | 3.54 | 0.26 | 96.20 | 51.17 | |
| 100 | 1 | 4.95 | 0.55 | 94.50 | 7.5 | 5.70 | 4.51 | 89.79 | 8.15 | 11.90 | 0.62 | 87.48 | 12.08 | 0.99 | 11.70 | 87.31 | 74.02 | 3.98 | 0.68 | 95.34 | 46.03 |
| 2 | 6.07 | 0.93 | 93.00 | 7.8 | 15.64 | 17.19 | 67.17 | 8.40 | 16.48 | 1.14 | 82.38 | 12.35 | 0.60 | 36.14 | 63.26 | 85.65 | 6.45 | 1.50 | 92.05 | 45.6 | |
| 0.1 | 5.58 | 0.31 | 94.11 | 10.5 | 3.15 | 1.56 | 95.29 | 11.75 | 8.40 | 0.57 | 91.03 | 15.07 | 1.49 | 3.55 | 94.96 | 97.9 | 5.45 | 0.21 | 94.34 | 56.88 | |
| 150 | 1 | 7.77 | 1.95 | 90.28 | 10.6 | 12.49 | 13.03 | 74.48 | 11.41 | 12.88 | 2.73 | 84.39 | 15.08 | 0.82 | 27.13 | 72.05 | 121.55 | 8.23 | 2.67 | 89.10 | 53.53 |
| 2 | 7.06 | 0.78 | 92.16 | 10.9 | 24.42 | 39.76 | 35.82 | 11.91 | 15.97 | 0.73 | 83.30 | 15.73 | 0.45 | 61.84 | 37.71 | 140.28 | 13.65 | 1.57 | 84.78 | 56.73 | |
| 0.1 | 8.21 | 0.66 | 91.13 | 16.55 | 4.49 | 2.47 | 93.04 | 17.80 | 9.13 | 0.62 | 90.25 | 21.58 | 1.62 | 6.70 | 91.68 | 185.78 | 11.51 | 0.23 | 88.26 | 75.55 | |
| 250 | 1 | 8.43 | 0.81 | 90.76 | 16.5 | 22.28 | 31.6 | 46.12 | 19.28 | 12.99 | 0.71 | 86.30 | 22.3 | 0.28 | 53.29 | 46.43 | 256.83 | 29.85 | 0.75 | 69.40 | 79.28 |
| 2 | 8.72 | 1.62 | 89.66 | 17.45 | 17.75 | 77.04 | 5.21 | 17.46 | 18.35 | 1.43 | 80.22 | 23.35 | 0.07 | 89.62 | 10.31 | 309.85 | 35.27 | 1.24 | 63.49 | 76.37 | |
| 0.1 | 10.02 | 1.08 | 88.90 | 25.45 | 6.88 | 5.04 | 88.08 | 28.5 | 9.98 | 0.56 | 89.46 | 32.12 | 0.99 | 12.32 | 86.69 | 369.58 | 17.33 | 0.03 | 82.64 | 106.67 | |
| 400 | 1 | 12.94 | 3.35 | 83.71 | 25.6 | 22.52 | 66.17 | 11.31 | 29.7 | 14.82 | 4.26 | 80.92 | 33.83 | 0.09 | 82.40 | 17.51 | 573.18 | 51.64 | 1.65 | 46.71 | 103.98 |
| 2 | 11.78 | 2.84 | 85.38 | 26.0 | 3.08 | 96.82 | 0.10 | 27.10 | 32.75 | 2.75 | 64.5 | 24.5 | 0.01 | 99.02 | 0.97 | 603.28 | 54.34 | 0.51 | 45.15 | 106.53 | |
First column, RL, indicated read length in bp. Second column represents the mutation rate (MR). For each program, the table contains the following columns with the percentages of: IMR: incorrectly mapped reads; RNM: reads not mapped; CMR: correctly mapped reads covering the corresponding splice junctions. The last column, T, represents runtimes for producing a BAM file in min.
Real RNA-seq datasets taken from sequence read archive (SRA)
| SRA Study | Run accession | Run description | HPG Aligner | HISAT | STAR | TopHat2 + Bowtie2 | MapSplice | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| RNM | CMR | Time | RNM | CMR | Time | RNM | CMR | Time | RNM | CMR | Time | RNM | CMR | Time | |||
| SRP009262 | SRR364003_1 | Illumina HiSeq 81 M 100 nt (single-end) | 4.53 | 79.72 | 11.0 | 15.16 | 76.10 | 10.43 | 5.50 | 75.00 | 13.9 | 25.73 | 63.38 | 115.9 | 4.63 | 76.83 | 139.9 |
| SRR364003_1 | Illumina HiSeq 81 M 100 nt (paired-end) | 4.90 | 78.20 | 21.6 | 15.00 | 77.8 | 21.46 | 8.00 | 72.02 | 22.7 | 29.26 | 59.9 | 230.6 | 5.04 | 77.5 | 288.1 | |
| SRP003173 | SRR063344 | Roche 454 GS FLX 1.26 M ∼600 nt (single-end) | 11.40 | 48.04 | 6.9 | 99.87 | 0.10 | 3.45 | 41.6 | 30.25 | 6.6 | 99.97 | 0.02 | 520.6 | NA | NA | NA |
SRA SRP009262 study was sequenced with Illumina HiSeq and contains 81.6 M reads (20 M reads are used for the benchmarking) of 100 nt length. Aligners were run in single and paired-end mode. SRA SRP003173 study was sequenced with Roche 454 GS FLX and contains 1.26 M reads (1,265,460) of ∼600 nt.
For each program, the table contains the following columns with the percentages of: RNM: reads not mapped; CMR: correctly mapped reads (mapped reads that cover the corresponding to known transcript positions). The last column, T, represents runtimes for producing a BAM file in min.
Scalability with the number of reads
| Million reads | HPG Aligner | HISAT | STAR | TopHat2+Bowtie2 | MapSplice |
|---|---|---|---|---|---|
| 2 | 0.9 | 0.95 | 4.5 | 38.7 | 11.8 |
| 5 | 1.9 | 1.7 | 5.6 | 40.5 | 18.65 |
| 10 | 3.8 | 4.2 | 7.75 | 75.2 | 36.4 |
| 20 | 7.6 | 7.28 | 13.5 | 112.2 | 68.5 |
Runtimes for producing a BAM file (in min) of the different aligners in datasets of growing size.