| Literature DB >> 25558875 |
Basir Shariat, Narjes Sadat Movahedi, Hamidreza Chitsaz, Christina Boucher.
Abstract
MOTIVATION: Intimately tied to assembly quality is the complexity of the de Bruijn graph built by the assembler. Thus, there have been many paradigms developed to decrease the complexity of the de Bruijn graph. One obvious combinatorial paradigm for this is to allow the value of k to vary; having a larger value of k where the graph is more complex and a smaller value of k where the graph would likely contain fewer spurious edges and vertices. One open problem that affects the practicality of this method is how to predict the value of k prior to building the de Bruijn graph. We show that optimal values of k can be predicted prior to assembly by using the information contained in a phylogenetically-close genome and therefore, help make the use of multiple values of k practical for genome assembly.Entities:
Mesh:
Year: 2014 PMID: 25558875 PMCID: PMC4304221 DOI: 10.1186/1471-2164-15-S10-S9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1(left)The self sequence landscape for CATCATTTG, and (right) the sequence landscape of another string GGCATCATTGGGTATAACC with respect to CATCATTTG. The mountains (light grey or black) demonstrate occurrences of substrings of source string in the target string. Numbers at the peak of the mountains denote the frequency of occurrence. The maximal sequence landscape is highlighted in light grey, and the arrows demonstrate the ascent and descent of the landscapes.
The performance comparison between major assembly tools and HyDA-Vista on the error-corrected standard multicell E.coli dataset (6.2 Gbps, 28 million reads, 100 bp, treated as single-end) using QUAST in default mode [43].
| Assembler | NGA50 | NA50 | Largest (bp) | Total (bp) | MA | GF (%) | Unaligned (bp) |
|---|---|---|---|---|---|---|---|
| SOAPdenovo | 32,032 | 35,343 | 101,201 | 4,304,232 | 3 | 95.2 | 3,421 |
| ABySS | 31,237 | 32,987 | 110,012 | 4,530,701 | 0 | 97.56 | 0 |
| SPAdes | 60,338 | 60,768 | 173,976 | 4,545,775 | 0 | 97.8 | 3,001 |
| IDBA | 57,826 | 58,549 | 173,964 | 4,538,426 | 0 | 97.7 | 2,349 |
| HyDA | 36,292 | 39,069 | 123,771 | 4,524,075 | 0 | 97.4 | 0 |
All statistics are based on contigs no shorter than 500 bp. Since there are not (QUAST-defined) misassemblies in any of the assemblies, the length statistics are based on correct contigs. NGA50 (NA50) is a (QUAST-corrected) contig size the contigs larger than which cover half of the genome (assembly) size [43,44]. Total is sum of the length of all contigs. MA is the number of misassemblies. GF is the genome fraction percentage, which is the fraction of genome bases that are covered by the assembly. Unaligned is the total length of all of the contigs that could not be aligned to the reference.