| Literature DB >> 26702294 |
Jongpill Choi1, Kiejung Park2, Seong Beom Cho1, Myungguen Chung1.
Abstract
BACKGROUND: A number of alignment tools have been developed to align sequencing reads to the human reference genome. The scale of information from next-generation sequencing (NGS) experiments, however, is increasing rapidly. Recent studies based on NGS technology have routinely produced exome or whole-genome sequences from several hundreds or thousands of samples. To accommodate the increasing need of analyzing very large NGS data sets, it is necessary to develop faster, more sensitive and accurate mapping tools.Entities:
Keywords: Hash table index; Hybrid index; Mapper; NGS; Sequence alignment; Suffix array index
Year: 2015 PMID: 26702294 PMCID: PMC4688996 DOI: 10.1186/s13015-015-0062-4
Source DB: PubMed Journal: Algorithms Mol Biol ISSN: 1748-7188 Impact factor: 1.405
Fig. 1Constructing the hybrid index. Panel a represents the procedure for constructing the hybrid index given Sequence = TATAGGCATGAGCCAC and q = 1. Construction proceeds as follows: first, convert the nucleotide symbols in sequence into the corresponding decimal values (I). Second, count each q-gram and store the counts in the HT (II-I). Third, set the beginning position of each q-gram based on the counts of q-grams (II-II). Fourth, store the positions of each q-gram in the SA such as (III). Finally, sort each q-gram range in the SA and finish hybrid index construction. The sizes of Sequence, SA, and HT are 16, 16, and 4q + 1=5, respectively. Panel b shows the constructed hybrid index
Fig. 2Classification diagram of split causes between matched region and unmatched region. Panel a represents the split causes of LMUR. Panel b shows the split cuases of RMUR. There are four split causes (Mismatch, Insertion, Deletion, and Mixed) in LMUR and RMUR. Panel c represents the split causes of MRUR, that can be classified in great detail. Yellow block indicates that both blocks of read and sequence are exactly matched. Dark gray block means the mismatch region. White block shows that two blocks are gapped. Red block shows that two blocks are overlapped. Finally, blue block indicates that there are two more split causes on read block and sequence block
Fig. 3Finding MRs and CARs. Find MRs and CARs given a read (r = GCCATG) and the hybrid index constructed in Fig. 1
Results of index generation
| Aligner | Options | Time | Memory (GB) | Size (GB) |
|---|---|---|---|---|
| HIA | -t 1 -q 14 | 165 | 20.32 | 12.63 |
| HIA | -t 12 -q 14 | 28 | 20.47 | 12.63 |
| BWA | 65 | 4.53 | 5.40 | |
| bowtie2 | 99 | 5.35 | 4.10 | |
| soap2 | 55 | 3.39 | 5.90 | |
| seqAlto | -I 0 genome.fa 28 | 33 | 37.99 | 22.40 |
| seqAlto | -I 1 genome.fa 22 | 12 | 13.19 | 5.52 |
Time measurement is elapsed time (minute). Memory is the peak memory for the index construction. Size is the sum of all generated files
Results for simulated single-end reads
| Aligner | Time | % Aligned | % Unique [% Err] | % Q10 [% Err] |
|---|---|---|---|---|
| (a) Illumina-like 100 bp reads (unpaired) | ||||
| HIA | 464 | 100.00 | 96.57 [0.4314] | 95.80 [0.2151] |
| BWA | 1242 | 98.11 | 94.73 [0.1711] | 94.60 [0.1562] |
| BWA MEM | 265 | 100.00 | 96.30 [0.0497] | 95.27 [0.0153] |
| Bowtie2 | 1291 | 99.95 | 99.63 [2.4252] | 94.22 [0.0208] |
| SOAP2 | 264 | 79.37 | 76.27 [0.4679] | |
| SeqAlto | 1459 | 99.69 | 96.33 [0.2861] | 96.04 [0.2156] |
| (b) Illumina-like 150 bp reads (unpaired) | ||||
| HIA | 530 | 100.00 | 97.56 [0.2552] | 97.26 [0.1643] |
| BWA | 2464 | 98.00 | 95.55 [0.0953] | 95.48 [0.0866] |
| BWA MEM | 355 | 100.00 | 97.36 [0.0210] | 96.41 [0.0053] |
| Bowtie2 | 2069 | 99.97 | 99.87 [1.6663] | 95.99 [0.0094] |
| SOAP2 | 525 | 68.72 | 66.78 [0.2806] | |
| SeqAlto | 3608 | 99.68 | 97.25 [0.1947] | 97.10 [0.1490] |
| (c) 454-like 250 bp reads (unpaired) | ||||
| HIA | 1009 | 99.96 | 98.28 [0.4189] | 94.38 [0.1772] |
| BWA-SW | 3157 | 99.86 | 97.61 [0.6735] | 94.38 [0.0357] |
| BWA MEM | 1497 | 100.00 | 97.92 [0.0767] | 97.26 [0.0346] |
| Bowtie2 | 2947 | 99.59 | 83.40 [0.5980] | 36.44 [0.0011] |
| (d) 454-like 400 bp reads (unpaired) | ||||
| HIA | 1378 | 99.76 | 98.48 [0.1557] | 96.17 [0.0397] |
| BWA-SW | 5144 | 100.00 | 95.89 [0.2084] | 94.00 [0.0284] |
| BWA MEM | 2426 | 99.99 | 98.46 [0.0471] | 97.98 [0.0238] |
| Bowtie2 | 6597 | 99.96 | 88.35 [0.3048] | 32.93 [0.0000] |
Time measurement is elapsed time (second). Unique refers to MAPQ ≥ 1 if MAPQ available. Q10 refers to MAPQ ≥ 10
Results for simulated paired-end reads
| Aligner | Time | % Aligned | % Unique [% Err] | % Q10 [% Err] |
|---|---|---|---|---|
| (a) Illumina-like 100 bp reads (paired) | ||||
| HIA | 1009 | 99.96 | 97.17 [0.0859] | 96.75 [0.0510] |
| BWA | 2554 | 99.79 | 97.70 [0.0954] | 97.49 [0.0692] |
| BWA MEM | 646 | 99.99 | 98.03 [0.0282] | 97.95 [0.0172] |
| Bowtie2 | 1691 | 97.13 | 97.09 [1.2711] | 93.73 [0.0128] |
| SOAP2 | 586 | 84.54 | 82.75 [0.3356] | |
| SeqAlto | 2945 | 99.61 | 97.15 [0.0833] | 97.02 [0.0788] |
| (b) Illumina-like 150 bp reads (paired) | ||||
| HIA | 1153 | 99.99 | 98.10 [0.0780] | 97.89 [0.0537] |
| BWA | 5380 | 99.78 | 98.17 [0.0983] | 98.06 [0.0876] |
| BWA MEM | 649 | 99.99 | 98.43 [0.0137] | 98.39 [0.0083] |
| Bowtie2 | 2348 | 97.13 | 97.12 [0.9904] | 94.14 [0.0083] |
| SOAP2 | 856 | 75.25 | 74.04 [0.3487] | |
| SeqAlto | 7191 | 99.58 | 97.80 [0.0723] | 97.74 [0.0698] |
Time measurement is elapsed time (second). Unique refers to MAPQ ≥ 1 if MAPQ available. Q10 refers to MAPQ ≥ 10
Results for real datasets
| Aligner | Time | % Aligned | % Unique | % Q10 |
|---|---|---|---|---|
| (a) Illumina 100 bp reads (unpaired) | ||||
| HIA | 369 | 97.71 | 91.23 | 86.41 |
| BWA | 2877 | 85.87 | 81.82 | 81.68 |
| BWA MEM | 272 | 96.86 | 89.32 | 86.85 |
| Bowtie2 | 1291 | 94.96 | 92.11 | 83.69 |
| SOAP2 | 283 | 87.29 | 82.29 | |
| SeqAlto | 1567 | 89.16 | 85.22 | 84.60 |
| (b) 454 400 bp reads (unpaired) | ||||
| HIA | 964 | 99.05 | 96.90 | 95.92 |
| BWA-SW | 6369 | 99.53 | 96.48 | 92.63 |
| BWA MEM | 830 | 99.73 | 96.36 | 94.86 |
| Bowtie2 | 6597 | 98.37 | 96.96 | 91.02 |
| (c) Illumina 100 bp reads (paired) | ||||
| HIA | 1111 | 91.53 | 87.48 | 85.25 |
| BWA | 2871 | 88.90 | 86.59 | 86.26 |
| BWA MEM | 690 | 93.49 | 90.55 | 89.80 |
| Bowtie2 | 1646 | 93.13 | 91.60 | 84.93 |
| SOAP2 | 725 | 82.18 | 79.56 | |
| SeqAlto | 3370 | 92.04 | 87.82 | 87.55 |
Time measurement is elapsed time (second). Unique refers to MAPQ ≥ 1 if MAPQ available. Q10 refers to MAPQ ≥ 10
Results of the multithreading tests
| Aligner | Time (6 threads) | Time (12 threads) |
|---|---|---|
| HIA | 932 | 505 |
| BWA | 4006 | 2586 |
| BWA MEM | 1162 | 645 |
| bowtie2 | 1180 | 789 |
| soap2 | 2217 | 1616 |
| seqAlto | 3945 | 2077 |
Time measurement is elapsed time (minute)