| Literature DB >> 34117875 |
Fatemeh Almodaresi1, Mohsen Zakeri1, Rob Patro1.
Abstract
MOTIVATION: Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists.Entities:
Year: 2021 PMID: 34117875 PMCID: PMC9502150 DOI: 10.1093/bioinformatics/btab408
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.This figure shows the main steps of chaining and between-MEM alignment in the PuffAligner procedure via an example. In this example, m1, m2 and m3 are the projected MEMs from the left end of the read to the reference and m4 and m5 are the projected MEMs from the right end of the read. In the first step, the chaining algorithm chooses the best chain of MEMs that provide the highest coverage score for each end of the read, that is the m1-m2 chain for the left end and two single MEM chain for the right end. Then, the selected chains from each end are joined together to find the concordant pairs of chains, that is the (m1-m2, m4) pair for this read as m5 is too far from m1-m2. Then, the chain from each end will go through to the next step, between-MEM alignment. For the green (dashed border) areas (MEMs) no alignment is recalculated as they are exact matches. Only the un-matched blue (solid border) parts of the reads (those nucleotides not occurring within a MEM) are aligned using a modified version of KSW2
The performance of different tools for aligning experimental DNA-seq reads
| Aligner | Mapping-rate (%) | Time (mm:ss) | Memory (GB) |
|---|---|---|---|
| PuffAligner | 95.58 | 6:14 | 13.09 |
| deBGA | 99.75 | 10:46 | 41.04 |
| STAR | 93.88 | 4:29 | 30.36 |
| Bowtie2 | 95.44 | 16:15 | 3.50 |
Note: The time reports are benchmarked after warming up the system cache so that the influence of index loading time is mitigated.
Accuracy of abundance estimation with Salmon using alignments reported by each aligner for the mock metagenomic sample simulated from SRR10948222
| Alignment mode | Tool | Spearman | MARD | MAE | MSLE |
|---|---|---|---|---|---|
| Primary | PuffAligner | 0.69 | 0.028 |
| 0.08 |
| Bowtie2 | 0.58 | 0.053 | 2.91 | 0.15 | |
| STAR |
|
| 1.493 |
| |
| deBGA | 0.28 | 0.616 | 656.08 | 6.53 | |
| Up to 20 | PuffAligner | 0.9 | 0.006 | 0.40 | 0.006 |
| Bowtie2 | 0.85 | 0.01 |
| 0.012 | |
| STAR |
|
| 0.303 |
| |
| deBGA | 0.28 | 0.573 | 637.60 | 5.65 | |
| Up to 200 | PuffAligner | 0.97 | 0.002 | 0.36 | 0.001 |
| Bowtie2 |
|
|
|
| |
| STAR | 0.929 | 0.004 | 0.299 | 0.002 | |
| deBGA | 0.28 | 0.571 | 637.83 | 5.55 | |
| Best strata | PuffAligner |
|
| 0.36 |
|
| STAR | 0.929 | 0.004 |
| 0.002 |
Note: All aligners are run in three main modes; allowing only one best alignment with ties broken randomly (Primary), up to 20 alignments reported per read and up to 200 alignments reported per read. PuffAligner and STAR support a fourth mode that allows reporting all equally best alignments (bestStrata). This option improves the performance while maintaining the accuracy of the results. Best result in each metric is highlighted in bold.
Fig. 2.Time performance for aligning a mock experiment simulated from bulk read sample SRR10948222. The dashed area shows fraction of the time spent purely on aligning reads and the rest is the time required for index loading. PuffAligner is the fastest tool, yet most of its time is still dedicated to loading the index. The alignment for Bowtie2 increases when asking for more alignments per read while the other tools show a constant alignment time scaling over number of reads
Fig. 3.Scalability of different tools over the final index disk and c onstruction memory for three different datasets, human transcriptome (gencode version 33), human genome (GRCh38 primary assembly) and collection of genomes (4000 random bacterial complete genomes)