| Literature DB >> 26610555 |
Hao Ye1, Joe Meehan2, Weida Tong3, Huixiao Hong4.
Abstract
Precision medicine or personalized medicine has been proposed as a modernized and promising medical strategy. Genetic variants of patients are the key information for implementation of precision medicine. Next-generation sequencing (NGS) is an emerging technology for deciphering genetic variants. Alignment of raw reads to a reference genome is one of the key steps in NGS data analysis. Many algorithms have been developed for alignment of short read sequences since 2008. Users have to make a decision on which alignment algorithm to use in their studies. Selection of the right alignment algorithm determines not only the alignment algorithm but also the set of suitable parameters to be used by the algorithm. Understanding these algorithms helps in selecting the appropriate alignment algorithm for different applications in precision medicine. Here, we review current available algorithms and their major strategies such as seed-and-extend and q-gram filter. We also discuss the challenges in current alignment algorithms, including alignment in multiple repeated regions, long reads alignment and alignment facilitated with known genetic variants.Entities:
Keywords: alignment; genetic variants; next-generation sequencing; precision medicine; short reads
Year: 2015 PMID: 26610555 PMCID: PMC4695832 DOI: 10.3390/pharmaceutics7040523
Source DB: PubMed Journal: Pharmaceutics ISSN: 1999-4923 Impact factor: 6.321
Figure 1A brief flow chart of genetic studies using NGS. In the first step, DNA or cDNA samples are extracted from the cells of individuals. Then each of the samples is cleaved into small fragments and PCR is carried out to build the library by amplifying each of the small fragments. The library is then sequenced using the NGS platform. The original output from the NGS platform is a set of images. Thereafter, a base-calling algorithm is used to processing the images and output the so-called “raw reads”. An alignment algorithm is then used to align the millions of short reads onto the reference human genome followed up by genetic variants detection for downstream analysis in the genetics studies.
The most frequently used NGS platforms *.
| Platform Name | Illumina HiSeq 2500 | Ion Torrent-Proton II | PacBio RS II | OxFord Nanopore Minion |
|---|---|---|---|---|
| Instrument | ||||
| Cost (USD) ** | 690 k | 224 k | 695 k | 1 k *** |
| Reagent cost Per run/per GB | 4126/45.84 | 1000/20.41 | 100/1111.11 | 900/1000 |
| Reads per run | 300 millions | 280 millions | 0.03 millions | 0.1 millions |
| Average Read length | 2 × 150 bp | 175 bp | 14,000 bp | 9,000 bp |
| Run time | 10 h | 5 h | 2 h | 6 h |
| Major errors | substitution | indel | indel | deletion |
| Error rate (%) | 0.1 | 1 | 1 | 4 |
| Amplification | bridgePCR | emPCR | none, SMS | none, SMS |
| Advantage | low cost per GB; high output | low cost | long reads; no amplification bias | long reads; no amplification bias |
| Disadvantage | high cost | homopolymer errors | low throughput; high cost | high error rate |
* Sources: http://www.molecularecologist.com/next-gen-fieldguide-2014/ and websites of the companies;
** Sources: http://www.molecularecologist.com/next-gen-table-3a-2014/;
*** Accessing fee. Sources: https://www.nanoporetech.com/products-services/minion-mki.
Alignment algorithms and software tools.
| Name | Website | Reference | Remark |
|---|---|---|---|
| SOAP * | [ | k-mer inexact match seed; support at most 3 mismatches; GPU calculation supported | |
| CUSHAW $ | [ | k-mer inexact match, maximal exact match and hybrid seeds; GPU supported | |
| Bowtie & | [ | k-mer inexact match seed; high speed; double-index; up to 3 mismatches | |
| BWA | [ | k-mer inexact match and maximal exact match seed | |
| GASSST | [ | k-mer exact match seed; it currently has been tested for reads up to 500 bp | |
| GNUMAP | [ | k-mer exact match seed; probabilistically mapping reads to repeat regions | |
| MOSAIK | [ | k-mer exact match seed | |
| NextGenMap | [ | q-gramq-gram filter; GPU calculation supported | |
| QPALMA | [ | k-mer inexact match; incorporate read quality score and splice site | |
| RMAP | [ | k-mer inexact match seed; 10 mismatches allowed; incorporate read quality score | |
| Segemehl | [ | k-mer inexact match seed; enhanced suffix arrays | |
| SeqMap | [ | k-mer inexact match; support windows, linux, Mac OS | |
| Stampy | [ | k-mer inexact match; support up to 30 bp indels in paired-end reads alignment | |
| Cloudburst | [ | Highly sensitive read mapping with MapReduce. | |
| drFAST | [ | k-mer inexact match; specially designed for better delineation of structural variants | |
| BFAST | [ | k-mer spaced seeds | |
| MAQ | [ | k-mer spaced seeds; incorporate quality scores of reads in alignment | |
| MOM | [ | k-mer spaced seeds; unlimited mismatches in the 3′ and 5′ flanking regions. | |
| PASS | [ | k-mer spaced seeds; implemented in C++ and supported on Linux and Windows | |
| PerM | [ | k-mer spaced seeds; 9 mismatches are allowed | |
| SHRiMP2 | [ | combined k-mer spaced seeds and q-gram filter | |
| ZOOM | [ | k-mer spaced seeds; tolerate 2 mismatches by default | |
| BarraCUDA | [ | Incorporate GPU to speed up BWA | |
| GEM | [ | q-gram filter | |
| MPSCAN | [ | q-gram filter; support Windows, linux, Mac OS | |
| ERNE | [ | long gap support; Works on Windows, Mac OS X, linux | |
| SARUMAN | [ | k-mer inexact matched seed; support GPU calculation | |
| LAST | [ | adaptive seed | |
| Genalice MAP | NA | cloud calculation; High sensitivity for SNPs and long INDELS | |
| Novoalign | NA | support up to 7 and 16 mismatches in single-end and pair-end reads. | |
| PRIMEX | [ | k-mer inexact match seed; written in C++; lookup table and server functionality | |
| SOCS | [ | good at align CpG methylation-enriched reads | |
| SToRM | [ | doesn’t support pair-end reads | |
| iSAAC | [ | k-mer inexact match seed; high speed | |
| RazerS | [ | q-gram filter; support Windows, linux, Mac OS X | |
| SSAHA2 | [ | k-mer inexact match seed; support various output formats | |
| UGENE | [ | works on Windows, linux and Mac OS X |
* Include SOAP, SOAP2, SOAP3 and SOAP3-dp; $ Include CUSHAW (k-mer inexact match seed), CUSHAW2 (maximal exact match seed) and CUSHAW3 (hybrid seeds); & Include Bowtie and Bowtie2. NA: commercial software, no reference available.
Figure 2A brief workflow of seed-and-extend strategy in alignment. Generally, the strategy can be divided into three steps: (1) generate raw seed from a read; (2) identify the matched seed; and (3) extend the matched seed and do the local alignment through standard dynamic programming algorithms based on Needleman–Wunsch (NW) [78] algorithm or Smith–Waterman (SW) algorithm [79].
Figure 3The overall workflow of q-gram filter in alignment. The strategy consists of three steps: (1) generation of q-grams from a read; (2) identification of highly mapped regions in a reference sequence through multiple q-grams mapping; and (3) local alignment of the read and highly mapped regions through standard dynamic programming algorithms based on Needleman–Wunsch (NW) [78] algorithm or Smith–Waterman (SW) algorithm [79].