| Literature DB >> 34446078 |
Mohammed Alser1,2,3, Jeremy Rotman4, Onur Mutlu1,2,3, Serghei Mangul5, Dhrithi Deshpande6, Kodi Taraszka4, Huwenbo Shi7,8, Pelin Icer Baykal9, Harry Taegyun Yang4,10, Victor Xue4, Sergey Knyazev9, Benjamin D Singer11,12,13, Brunilda Balliu14, David Koslicki15,16,17, Pavel Skums9, Alex Zelikovsky9,18, Can Alkan2,19.
Abstract
Aligning sequencing reads onto a reference is an essential step of the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today's diverse array of alignment methods. We provide a systematic survey of algorithmic foundations and methodologies across 107 alignment methods, for both short and long reads. We provide a rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read alignment. We discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34446078 PMCID: PMC8390189 DOI: 10.1186/s13059-021-02443-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Summary of algorithms and features of the examined read alignment methods. We surveyed 107 alignment tools published from 1988 to 2020 (indicated in column “Year of publication”). The table is sorted by year of publication, and then grouped according to the area(s) of application (indicated in column “Application”) within each year. In column “Indexing,” we document the algorithms used to index the genome (the first step in read alignment). In column “Global Positioning,” we document the algorithms used to determine a global position of the read in the reference genome (the second step). In column “Pairwise alignment,” we document the algorithm used to determine the similarity between the read and the corresponding region of the reference genome (the last step). SW, NW, HD, and DP stand for Smith-Waterman algorithm, Needleman-Wunsch algorithm, Hamming distance, and dynamic programming, respectively. In column “Wrapper,” we document the read alignment algorithms that are built on top of other read alignment tools. Finally, we report the maximum read length tested in the corresponding paper in column “Max. Read Length Tested in the Paper (bp).” The tested read length in each paper is not necessarily the maximum read length that each tool can handle
| Aligner | URL | Year of publication | Application | Indexing | Global Positioning | Pairwise alignment | Wrapper | Max. read length tested in the paper (bp) | ||
|---|---|---|---|---|---|---|---|---|---|---|
| Fix length seed | Spaced seed | Seed chaining | ||||||||
| FASTA [ | 1988 | DNA | Hashing | Y | N | Y | SW and NW | N | 1500 | |
| BLAST [ | 1990 | DNA | Hashing | Y | N | Y | Non-DP Heuristic | N | 73360 | |
| Gapped BLAST [ | 1997 | DNA | Hashing | Y | N | Y | SW | N | 246 | |
| SSAHA [ | 2001 | DNA | Hashing | Y | N | N | NW | N | 500 | |
| PatternHunter [ | 2002 | DNA | Hashing | Y | Y | Y | Non-DP heuristic | N | 500 | |
| BLAT [ | 2002 | DNA | Hashing | Y | N | Y | Non-DP heuristic | N | 500 | |
| BLASTZ [ | 2003 | DNA | Hashing | Y | N | N | SW | Y | 3000 | |
| C4 [ | 2005 | DNA | Hashing | Y | N | Y | Sparse DP | N | N/A | |
| GMAP [ | 2005 | DNA | Hashing | N | N | Y | NW | N | N/A | |
| BWT-SW [ | 2008 | DNA | BWT | Y | N | N | SW | N | 2000 | |
| MAQ [ | 2008 | DNA | Hashing | Y | Y | N | SW | N | 63 | |
| RMAP [ | 2008 | DNA | Hashing | Y | N | N | HD | N | 36 | |
| SOAP [ | 2008 | DNA | Hashing | Y | N | N | Non-DP heuristic | N | 50 | |
| SOCS [ | 2008 | DNA | Hashing | Y | N | N | Rabin-Karp Algorithm | N | 35 | |
| SeqMap [ | 2008 | DNA | Hashing | Y | N | N | Non-DP Heuristic | N | 30 | |
| ZOOM [ | 2008 | DNA | Hashing | Y | Y | N | SW | N | 36 | |
| QPALMA [ | 2008 | RNA-Seq | Suffix array | Y | N | Y | SW | Y | 36 | |
| BRAT [ | 2009 | BS-Seq | Hashing | Y | N | N | HD | N | 26 | |
| BSMAP [ | 2009 | BS-Seq | Hashing | Y | N | N | HD | N | 32 | |
| BFAST [ | 2009 | DNA | Hashing | N | Y | N | SW | N | 55 | |
| BWA [ | 2009 | DNA | BWT-FM | N | N | N | Semi-Global | N | 125 | |
| Bowtie [ | 2009 | DNA | BWT-FM | Y | N | N | HD | N | 76 | |
| CloudBurst [ | 2009 | DNA | Hashing | Y | N | N | Landau-Vishkin | N | 36 | |
| GNUMAP [ | 2009 | DNA | Hashing | Y | N | Y | NW | N | 36 | |
| GenomeMapper [ | 2009 | DNA | Hashing | Y | N | Y | NW | N | 200 | |
| MOM [ | 2009 | DNA | Hashing | Y | N | N | HD | N | 40 | |
| PASS [ | 2009 | DNA | Hashing | Y | N | Y | NW | N | 32 | |
| PerM [ | 2009 | DNA | Hashing | Y | Y | N | HD | N | 47 | |
| RazerS [ | 2009 | DNA | Hashing | Y | Y | Y | Myers Bit Vector | N | 76 | |
| SHRiMP [ | 2009 | DNA | Hashing | N | N | N | SW | N | 35 | |
| SOAP2 [ | 2009 | DNA | BWT-FM | Y | N | N | SW | N | 44 | |
| Slider [ | 2009 | DNA | Hashing | Y | N | N | HD | N | 36 | |
| segemehl [ | 2009 | DNA | Suffix array | N | N | Y | SW | N | 35 | |
| TopHat [ | 2009 | RNA-Seq | BWT-FM | Y | N | N | HD | Y | 42 | |
| BS-Seeker [ | 2010 | BS-Seq | BWT-FM | Y | N | N | HD | Y | 36 | |
| BWA-SW [ | 2010 | DNA | BWT-FM | N | N | N | SW | N | 10000 | |
| GASSST [ | 2010 | DNA | Hashing | Y | Y | Y | Semi-Global | N | 500 | |
| GSNAP [ | 2010 | DNA | Hashing | Y | N | Y | Non-DP Heuristic | N | 100 | |
| SMALT [ | 2010 | DNA | Hashing | Y | N | Y | SW | N | 150 | |
| Slider II [ | 2010 | DNA | Hashing | Y | N | N | HD | Y | 42 | |
| VMATCH [ | 2010 | DNA | Suffix array | Y | N | Y | SW | Y | N/A | |
| mrsFAST [ | 2010 | DNA | Hashing | Y | N | N | HD | N | 100 | |
| MapSplice [ | 2010 | RNA-Seq | BWT-FM | Y | N | N | HD | Y | 100 | |
| MicroRazerS [ | 2010 | RNA-Seq | Hashing | Y | N | Y | HD | N | 36 | |
| SpliceMap [ | 2010 | RNA-Seq | Hashing | Y | N | N | HD | Y | 50 | |
| Supersplat [ | 2010 | RNA-Seq | Hashing | N | N | N | NA | N | 36 | |
| Bismark [ | 2011 | BS-Seq | BWT-FM | Y | N | Y | SW & NW | Y | 50 | |
| LAST [ | 2011 | DNA/BS-Seq/RNA | Suffix array | N | Y | N | SW & NW | N | 105 | |
| DynMap [ | 2011 | DNA | Hashing | Y | N | N | NW | N | 52 | |
| SHRiMP2 [ | 2011 | DNA | Hashing | Y | Y | Y | SW | N | 75 | |
| SNAP [ | 2011 | DNA | Hashing | Y | N | N | NW | N | 10000 | |
| Stampy [ | 2011 | DNA | Hashing | Y | N | N | NW | N | 4500 | |
| TMAP | 2011 | DNA | BWT-FM | N | N | Y | SW | N | N/A | |
| X-Mate [ | 2011 | DNA | Hashing | N | N | N | Non-DP Heuristic | N | 50 | |
| SOAPSplice [ | 2011 | RNA-Seq | BWT-FM | Y | N | N | Non-DP Heuristic | N | 150 | |
| BRAT-BW [ | 2012 | BS-Seq | BWT-FM | N | N | N | HD | N | 62 | |
| BLASR [ | 2012 | DNA | Suffix array | Y | N | Y | NW | N | 8000 | |
| Batmis [ | 2012 | DNA | BWT-ST | Y | N | N | HD | N | 100 | |
| Bowtie2 [ | 2012 | DNA | BWT-FM | Y | N | Y | SW & NW | N | 400 | |
| GEM [ | 2012 | DNA | BWT-FM | N | N | Y | SW & NW | N | 150 | |
| RazerS3 [ | 2012 | DNA | Hashing | Y | Y | Y | Banded Myers Bit Vector | N | 800 | |
| SeqAlto [ | 2012 | DNA | Hashing | Y | N | N | NW | N | 200 | |
| SplazerS [ | 2012 | DNA | Hashing | Y | N | Y | Banded Myers Bit Vector | N | 150 | |
| WHAM [ | 2012 | DNA | Hashing | Y | N | N | NW | N | 74 | |
| YAHA [ | 2012 | DNA | Hashing | Y | N | Y | SW | N | 10000 | |
| OSA [ | 2012 | RNA-Seq | Hashing | Y | N | N | NA | N | 100 | |
| Passion [ | 2012 | RNA-Seq | Hashing | Y | N | Y | SW | Y | 75 | |
| BS-Seeker2 [ | 2013 | BS-Seq | BWT-FM | Y | N | Y | SW & NW | Y | 250 | |
| Subread [ | 2013 | DNA/RNA-Seq | Hashing | Y | Y | Y | SW | N | 202 | |
| BWA-MEM [ | 2013 | DNA | BWT-FM | N | N | Y | SW & NW | N | 650 | |
| Masai [ | 2013 | DNA | Suffix tree | N | N | Y | Banded Myers Bit Vector | N | 150 | |
| NextGenMap [ | 2013 | DNA | Hashing | Y | N | N | SW & NW | N | 250 | |
| SRmapper [ | 2013 | DNA | Hashing | Y | N | N | HD | N | 100 | |
| mrFAST [ | 2013 | DNA | Hashing | Y | N | N | Semi-Global | N | 180 | |
| CRAC [ | 2013 | RNA-Seq | BWT-FM | Y | N | N | Non-DP Heuristic | N | 200 | |
| STAR [ | 2013 | RNA-Seq | Suffix array | N | N | Y | SW | N | 5000 | |
| TopHat2 [ | 2013 | RNA-Seq | BWT-FM | Y | N | Y | SW & NW | Y | 101 | |
| Subjunc [ | 2013 | RNA-seq | Hashing | Y | Y | Y | NW | N | 202 | |
| BWA-PSSM [ | 2014 | DNA | BWT-FM | Y | N | N | SW | Y | 100 | |
| CUSHAW3 [ | 2014 | DNA | BWT-FM | Y | N | Y | SW & Semi-Global | N | 100 | |
| Hobbes2 [ | 2014 | DNA | Hashing | Y | N | Y | Banded Myers Bit Vector | N | 100 | |
| MOSAIK [ | 2014 | DNA | Hashing | Y | N | N | SW | N | 100 | |
| hpg-Aligner [ | 2014 | DNA | Suffix array | N | N | Y | SW | N | 5000 | |
| mrsFAST-Ultra [ | 2014 | DNA | Hashing | Y | N | N | HD | N | 100 | |
| JAGuaR [ | 2014 | RNA-Seq | BWT-FM | Y | N | N | SW | Y | 100 | |
| ContextMap 2 [ | 2015 | RNA-Seq | BWT-FM | Y | N | Y | SW & NW | Y | 76 | |
| HISAT [ | 2015 | RNA-Seq | BWT-FM | Y | N | N | Non-DP Heuristic | N | 100 | |
| ERNE 2 [ | 2016 | DNA/BS-Seq | BWT-FM + hashing | Y | N | N | HD | N | 100 | |
| GraphMap [ | 2016 | DNA | Hashing | Y | Y | Y | Semi-global | N | 9000 | |
| NanoBLASTer [ | 2016 | DNA | Hashing | Y | N | Y | NW | N | 7040 | |
| minimap [ | 2016 | DNA | Hashing | Y | N | N | N/A | N | 13000 | |
| rHAT [ | 2016 | DNA | Hashing | Y | N | Y | SW | N | 8000 | |
| KART [ | 2017 | DNA | BWT-FM | N | N | Y | NW | N | 7118 | |
| LAMSA [ | 2017 | DNA | BWT-FM + hashing | Y | N | Y | Sparse DP | Y | 100000 | |
| DART [ | 2017 | RNA-Seq | BWT-FM | N | N | Y | NW | N | 251 | |
| minimap2 [ | 2018 | DNA/RNA-Seq | Hashing | Y | N | Y | NW | N | 11628 | |
| DREAM-Yara [ | 2018 | DNA | BWT-FM | Y | N | N | Banded Myers Bit Vector | Y | 150 | |
| MUMmer4 [ | 2018 | DNA | Suffix array | Y | N | Y | SW | Y | 7821 | |
| NGMLR [ | 2018 | DNA | Hashing | Y | N | Y | SW | N | 50000 | |
| lordFAST [ | 2018 | DNA | BWT-FM + hashing | N | N | Y | SW & NW | N | 35489 | |
| BatMeth2 [ | 2019 | BS-Seq | BWT-FM | Y | N | Y | SW & NW | N | 125 | |
| GraphMap2 [ | 2019 | DNA/RNA-Seq | Hashing | Y | Y | Y | Semi-global | N | 9000 | |
| Magic-BLAST [ | 2019 | DNA/RNA-Seq | Hashing | Y | N | N | Non-DP Heuristic | N | 90000 | |
| BWA-MEM2 [ | 2019 | DNA | BWT-FM | N | N | Y | SW | N | 650 | |
| HISAT2 [ | 2019 | DNA | BWT-FM | Y | N | N | Non-DP Heuristic | N | 100 | |
| deSALT [ | 2019 | RNA-seq | Hashing | Y | N | Y | SW | N | 8000 | |
| conLSH [ | 2020 | DNA | Hashing | Y | N | Y | Sparse DP | N | 8000 | |
Advantages and limitations of read alignment algorithms. We compare the ease of implementing each algorithm (“Easy to implement”). We define the “ease of implementation” as the ability to quickly implement such an algorithm and its indexing technique, flexibly apply some changes to it, and easily understand its working principle. We also record whether the algorithm allows for an exact and/or inexact match (“Search for exact/inexact match”). The use of spaced seeds enables searching for inexact match using a hash table. We also compare the size of the genome index (indicated in column “Index size”), the speed of seed query (indicated in column “Seed query speed”), and the possibility to vary the length of the seed (“Seed length”)
| Hashing | Suffix tree and BWT-FM | |
|---|---|---|
| Easy to implement | Yes | No |
| Search for exact/inexact match | Exact | Exact and inexact |
| Index size | Large | Compressed (small) |
| Indexing time | Small | Large |
| Seed query speed | O(1), fast | Slow |
| Seed length | Fixed length per index | Can be fixed or variable |
Fig. 1Overview of a read alignment algorithm. a The seeds from the reference genome sequence are extracted. b Each extracted seed and all its occurrence locations in the reference genome are stored using the data structure of choice (suffix tree and hash table are presented as an example). Common prefixes of the seeds are stored once in the branches of the suffix tree, while the hash table stores each seed individually. c The seeds from each read sequence are extracted. d The occurrences of each extracted seed in the reference genome are determined by querying the index database. In this example, the three seeds from the first read appear adjacent at locations 5, 7, and 9 in the reference genome. Two of the same seeds appear also adjacent at another two locations (12 and 16). Other non-adjacent locations are filtered out (marked with X) as they may not span a good match with the first read. e The adjacent seeds are linked together to form a longer chain of seeds by examining the mismatches between the gaps. Pre-alignment filters can also be applied to quickly decide whether or not the computationally expensive DP calculation is needed. f Once the pre-alignment filter accepts the alignment between a read and a region in the reference genome, then DP-based (or non-DP-based) verification algorithms are used to generate the alignment file (in BAM or SAM formats), which contains alignment information such as the exact number of differences, location of each difference, and their type.
Fig. 2Combination of algorithms utilized by read alignment tools. Sankey plot displaying the flow of surveyed tools using each indexing technique and pairwise alignment. For every indexing technique, the percentage of surveyed tools using the algorithm is displayed (BWT-FM 26.2%, BWT-FM, and Hashing 2.8%, Hashing 60.8%, Other Suffix 10.3%). For every pairwise alignment technique, the percentage of surveyed tools using the algorithm is displayed (Smith-Waterman 28.3%, Hamming distance 19.2%, Needleman-Wunsch 16.2%, Other DP 14.1%, Non-DP Heuristic 13.1%, Multiple Methods 9.1%)
Fig. 3The landscape of read alignment algorithms published from 1988 to 2020. a Histogram showing the cumulation of surveyed tools over time colored by the algorithm used for genome indexing. The first published aligner, FASTA, is labeled as well as the point at which Bowtie and BWA were introduced and changed the landscape of aligners. b The popularity of all surveyed aligners, judged by citations per year since the initial release. Tools are grouped by the algorithm used for genome indexing. The six overall most popular aligners are labeled. c Histogram showing the cumulation of surveyed tools over time colored by the algorithm used for pairwise alignment. The two aligners credited to have been the first to use the three most popular algorithms (FASTA: Smith-Waterman and Needleman-Wunsch, RMAP: Hamming distance) are labeled. d The popularity of each surveyed aligner, judged by citations per year since the initial release. Tools are grouped by the algorithm used for pairwise alignment. The six overall most popular aligners are labeled.
Fig. 4The effect of read alignment algorithms on the speed of alignment and computational resources. Results of the benchmarking performed on 11 surveyed DNA read alignment tools that can be installed through bioconda (RMAP, Bowtie, BWA, GSNAP, SMALT, LAST, SNAP, Bowtie2, Subread, HISAT2, and minimap2) additionally noted in Supplementary Table 2 and Supplementary Note 3. Each tool’s CPU time and RAM required were recorded for 10 different WGS samples from the 1000 Genomes Project. a, b Violin plots showing the relative performance (a CPU time and b RAM) of the benchmarked aligners. Aligners are ordered by year of release. c, d The relative performance (c CPU time and d RAM) of the benchmarked aligners grouped by the algorithm used for genome indexing and colored by individual aligners (BWT-FM CPU time vs. Suffix array CPU time: LRT, p value = 1.5 × 10−15, Hashing memory vs. BWT-FM memory: LRT, p value = 2.2 × 10−3, BWT-FM memory vs. Suffix Array memory: LRT, p value < 2 × 10−16). The legend of d is the same for c, e, and f. e The relative performance (CPU time) of the benchmarked aligners grouped by whether the tool was released before or after long-read technology was introduced (2013) and colored by individual aligners (LRT, p value = 3.7 × 10−11). f The relative performance (CPU time) of the benchmarked aligners grouped by the algorithm used for pairwise alignment and colored by individual aligners (Needleman-Wunsch CPU time vs. Smith-Waterman CPU time: Wald, p value = 1.3 × 10−4, Needleman-Wunsch CPU time vs. Hamming Distance CPU time: Wald, p value = 9.3 × 10−7, Needleman-Wunsch CPU time vs. Non-DP Heuristic CPU time: Wald, p value = 1.8 × 10−10)
| • Error rate. The error rate of modern short-read sequencing technologies is smaller than that of modern long-read technologies. | |
| • Genome coverage. Throughput (i.e., the number of reads) of modern short-read sequencing technologies is higher than that of modern long-read technologies. | |
| • Global position. Determine a global position of the read by identifying the starting position or positions of the reads in the reference genome. This step is ambiguous with short reads, as the repetitive structure of the human genome causes such reads to align to multiple locations of the genome. In contrast, long reads are usually longer than the majority of repeat regions and are aligned to a single location in the genome. | |
| • Local pairwise alignment. After determining the global position of each read, the algorithms map all bases of the read to the reference segments, located at these global positions, in order to account for indels. Due to the smaller error rate of short-read technologies, it is usually easier to perform local alignment on short reads than on long ones. | |
| • Genomic variants. Single-nucleotide polymorphisms (SNPs) are easy to detect using short reads when compared to long reads due to the lower error rate and higher coverage of short-read sequencing technologies. Structural variants (SVs) are easy to detect with long reads, which span the entire SV region. Current long-read-based tools [ |