| Literature DB >> 30180801 |
Mahdi Heydari1,2, Giles Miclotte1,2, Yves Van de Peer2,3,4,5, Jan Fostier6,7.
Abstract
BACKGROUND: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string.Entities:
Keywords: Graph alignment; Illumina; Markov Model; Next-generation sequencing; de Bruijn Graph
Mesh:
Year: 2018 PMID: 30180801 PMCID: PMC6122196 DOI: 10.1186/s12859-018-2319-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1This figure shows the association between the de Bruijn graph and MM tables. On the left side, part of a de Bruijn graph is shown. True paths are depicted by blue lines. The numbers inside each node indicate the multiplicity of that node, i.e., the number of times the node’s sequence is present in the reference genome. A table at each node guides the aligner based on previously observed nodes. The 2-MM and 3-MM tables of node A are shown on the right side. Based on the 2-MM table, reads that align to CA are guided to E as the continuation to node D is not allowed. However, the information in this table is insufficient to guide reads that align to BA since continuations to E and D are both valid. In contrast, the 3-MM table guides the reads that align to FBA to D, and GBA to E. The information in the final row in 3-MM table is redundant because it is also contained in the lower-order 2-MM table
Artificial datasets used for the evaluation of graph aligner tools
| Abbr. | Organism | Reference ID | Genome | Repeated | Sequencing | Cov. | Read |
|---|---|---|---|---|---|---|---|
| size | 31-mers (%) | platform | length | ||||
| S1 |
| NC010473 | 4.5 Mbp | 3.2 | Illumina HiSeq 2500 | 25 | 150 bp |
| S2 |
| NC010473 | 4.5 Mbp | 3.2 | Illumina HiSeq 2000 | 50 | 100 bp |
| S3 | HG19 | 45.2 Mbp | 4.3 | Illumina HiSeq 2500 | 25 | 150 bp | |
| S4 | HG19 | 45.2 Mbp | 4.3 | Illumina HiSeq 2000 | 50 | 100 bp | |
| S5 |
| Release 5 | 116.4 Mbp | 1.1 | Illumina HiSeq 2500 | 25 | 150 bp |
| S6 |
| Release 5 | 116.4 Mbp | 1.1 | Illumina HiSeq 2000 | 50 | 100 bp |
Real datasets used for the evaluation of graph aligner tools
| Abbr. | Organism | Reference ID | Genome | Repeated | Cov. | Sequencing | Read | Trimmed | Dataset ID |
|---|---|---|---|---|---|---|---|---|---|
| size | 31-mers (%) | platform | length | reads | |||||
| R1 |
| Nc013714.1 | 2.6 Mbp | 0.4 | 373 X | Illumina MiSeq | 251 bp | SRR1151311 | |
| R2 |
| NC010473 | 4.5 Mbp | 3.2 | 418 X | Illumina MiSeq | 150 bp | Ill. Data library | |
| R3 |
| NC000913 | 4.5 Mbp | 0.6 | 612 X | Illumina GAII | 100 bp | ERA000206 | |
| R4 |
| NC011083.1 | 4.7 Mbp | 0.5 | 97 X | Illumina MiSeq | 239 bp | ✓ | SRR1206093 |
| R5 |
| ERR330008 | 6.1 Mbp | 0.6 | 169 X | Illumina MiSeq | 120 bp | ✓ | ERR330008 |
| R6 | HG19 | 45.2 Mbp | 4.3 | 29 X | Illumina HiSeq | 100 bp | Ill. Data library | ||
| R7 |
| WS222 | 97.6 Mbp | 2.6 | 58 X | Illumina HiSeq | 101 bp | SRR543736 | |
| R8 |
| Release 5 | 116.4 Mbp | 1.1 | 52 X | Illumina HiSeq | 100 bp | SRR823377 |
Accuracy comparison of graph aligner tools in terms of correct alignment of reads to the graph on simulated data
| S1 | S2 | S3 | S4 | S5 | S6 | |
|---|---|---|---|---|---|---|
| Percentage of correctly aligned reads.(%) | ||||||
| BGREAT | 99.94 | 99.61 | 98.92 | 96.16 | 99.89 | 99.40 |
| BrownieAligner | 100.00 | 99.99 | 99.42 | 98.07 | 99.97 | 99.89 |
| BrownieAlignerNoMM | 99.99 | 99.98 | 99.30 | 97.67 | 99.96 | 99.85 |
| deBGA | 99.52 | 83.48 | 99.07 | 83.01 | 99.37 | 83.37 |
Accuracy evaluation of BrownieAlignerNoMM and BrownieAligner on the subset of the simulated reads that align to a path of at least two nodes in the graph
| S1 | S2 | S3 | S4 | S5 | S6 | |
|---|---|---|---|---|---|---|
| Percentage of correctly aligned reads. (%) | ||||||
| BrownieAligner | 99.34 | 99.05 | 90.72 | 86.07 | 98.21 | 97.12 |
| BrownieAlignerNoMM | 98.72 | 98.47 | 87.68 | 82.39 | 97.38 | 96.13 |
Accuracy comparison of graph aligner tools in terms of correct alignment of reads to the graph on real data
| R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | |
|---|---|---|---|---|---|---|---|---|
| Percentage of correctly aligned reads. (%) | ||||||||
| BGREAT | 94.55 | 94.28 | 91.28 | 84.97 | 96.09 | 92.01 | 94.57 | 80.37 |
| BrownieAligner | 99.81 | 99.81 | 99.55 | 99.02 | 99.78 | 96.98 | 96.53 | 89.59 |
| BrownieAlignerNoMM | 99.81 | 99.80 | 99.52 | 98.99 | 99.78 | 96.67 | 96.47 | 89.55 |
| deBGA | 99.67 | 99.30 | 92.36 | 97.31 | 93.63 | 98.42 | 74.72 | 85.42 |
Fig. 2Peak memory usage. Peak memory usage of the aligner tools for simulated datasets
Fig. 3Runtime. Average runtime of tools to align 1M reads for the simulated datasets
Fig. 4Runtime. The effect of branch and bound strategy on the running time of BrownieAligner
Fig. 5Peak memory usage. Peak memory usage of the aligner tools for real datasets
Fig. 6Runtime. Average runtime of tools to align 1M reads for the real datasets