| Literature DB >> 29072142 |
Dinghua Li1, Yukun Huang1, Chi-Ming Leung1,2, Ruibang Luo1,2, Hing-Fung Ting1, Tak-Wah Lam3,4.
Abstract
BACKGROUND: The recent release of the gene-targeted metagenomics assembler Xander has demonstrated that using the trained Hidden Markov Model (HMM) to guide the traversal of de Bruijn graph gives obvious advantage over other assembly methods. Xander, as a pilot study, indeed has a lot of room for improvement. Apart from its slow speed, Xander uses only 1 k-mer size for graph construction and whatever choice of k will compromise either sensitivity or accuracy. Xander uses a Bloom-filter representation of de Bruijn graph to achieve a lower memory footprint. Bloom filters bring in false positives, and it is not clear how this would impact the quality of assembly. Xander does not keep track of the multiplicity of k-mers, which would have been an effective way to differentiate between erroneous k-mers and correct k-mers.Entities:
Keywords: Assembly; De Bruijn graph; Metagenomics; Targeted gene
Mesh:
Substances:
Year: 2017 PMID: 29072142 PMCID: PMC5657035 DOI: 10.1186/s12859-017-1825-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The workflow of Xander (left) and MegaGTA (right). Their differences are highlighted in bold
Assembly statistics of different k-mer sizes
| MegaGTA | Xander | |||||
|---|---|---|---|---|---|---|
|
| 30 | 36 | 45 | 30 | 36 | 45 |
| # of contigs | 16 | 7 | 4 | 14 | 7 | 4 |
| # of gene recoverd | 9 | 5 | 4 | 8 | 4 | 4 |
| duplication ratio | 1.82 | 1.46 | 1.00 | 1.75 | 1.82 | 1.00 |
| # misassembled contigs | 1 | 0 | 0 | 1 | 0 | 0 |
| # partially unaligned contigs | 2 | 0 | 0 | 2 | 0 | 0 |
| # mismatches per 100kbp | 148 | 150 | 96 | 534 | 278 | 64 |
| Wall time (second) | 101 | 73 | 65 | 1264 | 1090 | 573 |
| The gene fraction of each recovered rplB genes (%) | ||||||
|
| 84.8 | – | – | 84.8 | – | – |
|
| 82.5 | 82.5 | – | 82.5 | 82.5 | – |
|
| 99.6 | 99.6 | 81.5 | 99.6 | 99.6 | 81.5 |
|
| 81.4 | – | – | 81.4 | – | – |
|
| 78.1 | – | – | 78.1 | – | – |
|
| 98.2 | 64.3 | – | 98.2 | 64.3 | – |
|
| 99.6 | 99.6 | 99.6 | 99.6 | 99.6 | 99.6 |
|
| – | – | 99.6 | – | – | 99.6 |
|
| 55.0 | 55.0 | 93.2 | – | – | 93.9 |
|
| 62.2 | – | – | 62.2 | – | – |
Assembly result with or without low coverage penalty
| Before UCHIME | After UCHIME | |||
|---|---|---|---|---|
| with penalty | without penalty | with penalty | without penalty | |
| # of gene contigs | 13 | 14 | 7 | 7 |
| # of gene recoverd | 6 | 6 | 5 | 4 |
| # misassembled contigs | 0 | 0 | 0 | 0 |
| # partially unaligned contigs | 2 | 3 | 0 | 0 |
| # mismatches per 100kbp | 543.9 | 997.1 | 149.9 | 208.8 |
| The gene fraction of each recovered rplB genes (%) | ||||
|
| 82.5 | 82.5 | 82.5 | 82.5 |
|
| 99.6 | 99.6 | 99.6 | 99.6 |
|
| 64.3 | 64.3 | 64.3 | 64.3 |
|
| 99.6 | 99.6 | 99.6 | 99.6 |
|
| 99.6 | 99.6 | – | – |
|
| 84.3 | 84.3 | 55.0 | – |
Assembly results of MegaGTA (using iterative de Bruijn graph) and Xander (merging contigs of three k-mer sizes)
| MegaGTA (iterates on | Xander (Union of | |
|---|---|---|
| # of gene contigs | 10 | 19 |
| # of genes recovered | 10 | 10 |
| duplication ratio | 1 | 1.79 |
| # misassembled contigs | 0 | 1 |
| # partially unaligned contigs | 1 | 2 |
| # mismatches per 100kbp | 13.52 | 453.05 |
| Time (second) | 277 | 2927 |
| The gene fraction of each recovered rplB genes (%) | ||
|
| 98.77 | 84.77 |
|
| 82.48 | 82.48 |
|
| 99.64 | 99.64 |
|
| 81.39 | 81.39 |
|
| 78.14 | 78.14 |
|
| 98.21 | 98.21 |
|
| 99.64 | 99.64 |
|
| 99.64 | 99.64 |
|
| 99.29 | 99.29 |
|
| 63.31 | 62.23 |
Performance of MegaGTA, Xander and MEGAHIT on the rhizosphere soil metagenomic sample
| MegaGTA | Xander | MEGAHIT | ||||
|---|---|---|---|---|---|---|
| Gene rplB | ||||||
| Cluster Identity | 99% | 95% | 99% | 95% | 99% | 95% |
| # of gene contigs aligned by Framebot | 17,668 | 5079 | 15,933 | 4237 | 578 | 465 |
| Total length (bp) | 13.9 M | 3.96 M | 12.5 M | 3.32 M | 378 k | 311 k |
| Median length (bp) | 822 | 822 | 822 | 822 | 639 | 660 |
| # of matched reference genes | 491 | 427 | 456 | 385 | 208 | 193 |
| Median % aa identity | 76.73 | 76.00 | 77.46 | 76.90 | 77.50 | 77.07 |
| Gene nifH | ||||||
| Cluster Identity | 99% | 95% | 99% | 95% | 99% | 95% |
| # of gene contigs aligned by Framebot | 33 | 11 | 31 | 10 | 9 | 5 |
| Total length (bp) | 27.8 k | 9225 | 25.3 k | 8412 | 7368 | 4464 |
| Median length (bp) | 888 | 888 | 882 | 883.5 | 930 | 930 |
| # of matched reference genes | 13 | 10 | 12 | 8 | 8 | 5 |
| Median % aa identity | 91.55 | 90.54 | 92.96 | 91.19 | 85.14 | 83.99 |
| Gene nirK | ||||||
| Cluster Identity | 99% | 95% | 99% | 95% | 99% | 95% |
| # of gene contigs aligned by Framebot | 1336 | 392 | 1242 | 336 | 203 | 179 |
| Total length (bp) | 1.09 M | 321 k | 1.02 M | 277 k | 170 k | 153 k |
| Median length (bp) | 687 | 787.5 | 690 | 748.5 | 735 | 750 |
| # of matched reference genes | 55 | 53 | 50 | 47 | 71 | 66 |
| Median % aa identity | 89.29 | 86.61 | 89.06 | 87.30 | 66.41 | 65.97 |
Fig. 2Number of matched nirk reference genes (clustered at 99% identity) v.s. minimum aa identity reported by Framebot
Number of contigs with false k-mers v.s. different Bloom filter sizes
| Bloom filter size (GB) | 256 | 128 | 64 |
|---|---|---|---|
| # contigs | 15,929 | 15,933 | 16,107 |
| # contigs with false | 3 | 62 | 1694 |
| # contigs with internal false | 1 | 46 | 1523 |