| Literature DB >> 24466273 |
Yongchao Liu1, Bernt Popp2, Bertil Schmidt1.
Abstract
The majority of next-generation sequencing short-reads can be properly aligned by leading aligners at high speed. However, the alignment quality can still be further improved, since usually not all reads can be correctly aligned to large genomes, such as the human genome, even for simulated data. Moreover, even slight improvements in this area are important but challenging, and usually require significantly more computational endeavor. In this paper, we present CUSHAW3, an open-source parallelized, sensitive and accurate short-read aligner for both base-space and color-space sequences. In this aligner, we have investigated a hybrid seeding approach to improve alignment quality, which incorporates three different seed types, i.e. maximal exact match seeds, exact-match k-mer seeds and variable-length seeds, into the alignment pipeline. Furthermore, three techniques: weighted seed-pairing heuristic, paired-end alignment pair ranking and read mate rescuing have been conceived to facilitate accurate paired-end alignment. For base-space alignment, we have compared CUSHAW3 to Novoalign, CUSHAW2, BWA-MEM, Bowtie2 and GEM, by aligning both simulated and real reads to the human genome. The results show that CUSHAW3 consistently outperforms CUSHAW2, BWA-MEM, Bowtie2 and GEM in terms of single-end and paired-end alignment. Furthermore, our aligner has demonstrated better paired-end alignment performance than Novoalign for short-reads with high error rates. For color-space alignment, CUSHAW3 is consistently one of the best aligners compared to SHRiMP2 and BFAST. The source code of CUSHAW3 and all simulated data are available at http://cushaw3.sourceforge.net.Entities:
Mesh:
Year: 2014 PMID: 24466273 PMCID: PMC3899341 DOI: 10.1371/journal.pone.0086869
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Alignment quality on simulated reads (in %).
| Aligner | 2% | 4% | 6% | ||||||
| Sensitivity | Recall | Recall** | Sensitivity | Recall | Recall** | Sensitivity | Recall | Recall** | |
| SE | |||||||||
| CUSHAW3 |
| 99.04 | 95.96 | 99.92 | 97.85 | 94.81 | 99.26 | 95.28 | 92.32 |
| CUSHAW2 | 99.95 | 99.00 | 95.96 | 99.33 | 97.61 | 94.64 | 95.45 | 92.84 | 90.04 |
| Novoalign |
|
|
|
|
|
|
|
|
|
| BWA-MEM | 99.99 | 95.95 | 95.95 | 99.59 | 94.33 | 94.33 | 97.38 | 89.86 | 89.86 |
| Bowtie2 | 99.30 | 95.69 | 92.98 | 93.64 | 87.59 | 85.20 | 81.20 | 74.03 | 72.03 |
| GEM | 99.76 | 99.02 | 95.46 | 97.08 | 92.28 | 89.09 | 90.46 | 77.64 | 75.11 |
| PE | |||||||||
| CUSHAW3 |
| 99.54 | 97.35 |
| 99.14 | 96.99 | 99.96 |
|
|
| CUSHAW2 | 99.73 | 99.43 | 97.27 | 99.36 | 98.71 | 96.61 | 96.47 | 95.07 | 93.16 |
| Novoalign |
|
| 97.57 |
|
| 96.93 |
| 97.13 | 94.88 |
| BWA-MEM |
| 97.59 |
|
| 97.11 |
| 99.88 | 95.55 | 95.55 |
| Bowtie2 | 99.45 | 98.53 | 96.41 | 93.54 | 91.52 | 89.54 | 80.29 | 77.37 | 75.68 |
| GEM |
| 99.20 | 96.85 | 99.79 | 98.06 | 95.77 | 97.99 | 93.24 | 91.15 |
means the recall is calculated from all reported alignments per read and ** means the recall is calculated form the first alignment occurrence per read.
Figure 1ROC curves of all evaluated aligners on the simulated data with the minimum MAQP>0.
Real dataset information.
| Name | Type | Length | No. of Reads | Mean Insert |
| SRR034939 | PE | 100 | 36,201,642 | 525 |
| SRR211279 | PE | 100 | 50,937,050 | 302 |
| ERR024139 | PE | 100 | 53,653,010 | 313 |
estimated using CUSHAW3.
Alignment quality on real reads (in %).
| Aligner | SRR034939 | SRR211279 | ERR024139 |
| SE | |||
| CUSHAW3 |
|
|
|
| CUSHAW2 | 93.86/93.86 | 96.76/96.76 | 96.74/96.74 |
| Novoalign | 96.80/91.27 | 98.44/98.28 | 98.49/97.50 |
| BWA-MEM | 98.30/97.14 | 99.17/98.58 | 99.07/98.50 |
| Bowtie2 | 95.56/95.56 | 97.13/97.13 | 97.20/97.20 |
| GEM | 93.69/93.69 | 95.10/95.10 | 94.82/94.81 |
| PE | |||
| CUSHAW3 | 98.92/ | 99.46/ | 99.33/ |
| CUSHAW2 | 94.38/94.38 | 96.94/96.94 | 96.92/96.92 |
| Novoalign | 98.00/94.23 | 99.25/98.85 | 99.13/97.87 |
| BWA-MEM |
|
|
|
| Bowtie2 | 96.23/95.56 | 97.31/97.13 | 97.39/97.20 |
| GEM | 95.52/93.69 | 96.16/95.10 | 96.15/94.81 |
For each value x/y, x is the sensitivity calculated from all reported alignments and y is the sensitivity after removing the alignments with <50% aligned base proportion per read.
Runtimes (in minutes) on simulated and real base-space reads.
| Simulated | 2% | 4% | 6% | |||
| SE | PE | SE | PE | SE | PE | |
| CUSHAW3 | 3.4 | 6.2 | 3.7 | 8.1 | 3.9 | 10.7 |
| CUSHAW2 | 2.5 | 2.5 | 2.8 | 2.9 | 2.9 | 3.1 |
| Novoalign | 6.7 | 6.6 | 38.1 | 7.0 | 131.7 | 12.6 |
| BWA-MEM |
|
|
|
| 2.0 |
|
| Bowtie2 | 2.1 | 3.6 | 2.0 | 2.7 |
| 2.2 |
| GEM | 5.7 | 2.4 | 5.9 |
| 5.4 | 2.0 |
|
|
|
|
| |||
|
|
|
|
|
|
| |
| CUSHAW3 | 62.0 | 292.4 | 78.6 | 317.9 | 85.1 | 264.1 |
| CUSHAW2 | 38.0 | 38.5 | 47.2 | 49.0 | 51.4 | 50.5 |
| Novoalign | 862.1 | 497.6 | 2,024.0 | 1,243.8 | 754.2 | 460.3 |
| BWA-MEM |
|
|
|
|
|
|
| Bowtie2 | 50.4 | 55.9 | 79.1 | 69.5 | 78.0 | 72.7 |
| GEM | 53.0 | 34.4 | 72.2 | 44.7 | 68.3 | 51.0 |
Figure 2Peak resident memory of all evaluated aligners.
Alignment quality and runtimes on color-space reads.
| Dataset | Measure | CUSHAW3 | SHRiMP2 | BFAST |
| 50-bp | Sensitivity |
| 91.55 | 88.94 |
| Recall* | 86.28 |
| 81.01 | |
| Recall** |
| 84.22 | 81.01 | |
| Time(min) |
| 227 | 160 | |
| 75-bp | Sensitivity | 92.27 | 92.33 |
|
| Recall* | 91.16 |
| 86.14 | |
| Recall** |
| 88.15 | 86.14 | |
| Time(min) |
| 263 | 389 |
Same as Table 1.
Alignment results on GCAT benchmarks.
| Dataset | Measure | CUSHAW3 | CUSHAW2 | Novoalign | BWA-MEM |
| SE | |||||
| Small indels | Sensitivity |
| 99.86 | 97.56 | 99.99 |
| Recall |
|
| 97.47 |
| |
| Large indels | Sensitivity |
| 99.50 | 97.56 | 99.99 |
| Recall | 97.37 | 97.04 | 97.35 |
| |
| PE | |||||
| Small indels | Sensitivity |
| 99.99 | 98.85 |
|
| Recall | 99.06 | 99.05 | 98.83 |
| |
| Large indels | Sensitivity |
| 99.71 | 98.84 |
|
| Recall | 98.91 | 98.62 | 98.69 |
| |
Variant calling results on a GCAT benchmark.
| Aligner | Sensitivity | Specificity | Ti/Tv | CorrectSNP | Correct Indel |
| CUSHAW3 | 83.74 | 99.9930 | 2.285 | 115,709 | 5,974 |
| CUSHAW2 | 83.51 | 99.9930 |
| 112,727 | 5,841 |
| Novoalign | 84.10 |
| 2.289 | 121,992 |
|
| BWA-MEM |
| 99.9926 | 2.285 |
| 9,232 |
Sensitivity = TP/(TP+FN), specificity = TN/(TN+FP) and Ti/Tv is the ratio of transitions to transversions in SNPs.
Figure 3Program workflow of the single-end alignment using hybrid seeding.
Figure 4Program workflow of the paired-end alignment with hybrid seeding.