| Literature DB >> 33471088 |
Sylwester Swat1, Artur Laskowski1, Jan Badura1, Wojciech Frohmberg1, Pawel Wojciechowski1,2, Aleksandra Swiercz1,2, Marta Kasprzak1, Jacek Blazewicz1,2.
Abstract
MOTIVATION: There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater data sets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big data sets and using the overlap graph model is still looked out.Entities:
Year: 2021 PMID: 33471088 PMCID: PMC8289375 DOI: 10.1093/bioinformatics/btab005
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An example of the construction of a sparse graph. (A) Three overlapping reads. (B) The algorithm starts with the overlap area set to its minimum value and increases it once overlaps of such a length have been checked within the whole dataset. The first found edge is (a, c), weighted by the appropriate offset. (C) After increasing the overlap area to , edge (b, c) is found, and it replaces (a, c) recognized at this moment as a transitive edge because of the presence of the supplementing connection from a to b. This connection can be quickly checked with bitwise operations, for the offset between a and b is already known. (D) Edge (a, b) is added once the overlap area rises to
Fig. 2.Solving the minimum directed spanning tree problem in a local subarea. (A) Reads from the example. (B) Subgraph S identified in a larger graph for vertex a, here . S contains a directed cycle and no edge is deleted in this iteration. (C) Result of the reduction for subgraph identified for vertex . Vertex h is not reachable from within distance D, so edges (g, h) and (h, f) do not belong to
Characteristics of datasets used in the comparison
| G [Mbp] | N [M] | R [bp] | I [bp] | D | |
|---|---|---|---|---|---|
|
| 91.0 | 18.3 | 101 | 158 | 41 |
|
| 100.3 | 34.3 | 110 | 225 | 75 |
|
| 120.3 | 48.0 | 101 | 272 | 81 |
|
| 58.7 | 48.5 | 250 | 499 | 413 |
|
| 4.6 | 22.7 | 101 | 504 | 989 |
|
| 4.2 | 4.9 | 98 | 315 | 229 |
Note: ‘G’ stands for genome length, ‘N’ for number of read pairs, ‘R’ for average read length, ‘I’ for average insert size, ‘D’ for average depth of coverage.
Results summarized for all genomes from Table 1 (average values) from the point of view of assemblers’ functionality
| Assembler | Genome fraction | Dupl. ratio | Inaccuracy |
|---|---|---|---|
| ALGA | 95.98% | 1.004 | 1.26% |
| GRASShopPER | 86.44% | 1.053 | 1.76% |
| MEGAHIT | 96.00% | 1.017 | 30.33% |
| Platanus | 92.30% | 1.008 | 0.47% |
| Readjoiner | 44.54% | 1.133 | 2.90% |
| SAGE2 | 76.65% | 1.008 | 4.83% |
| SGA | 95.77% | 1.008 | 1.28% |
| SOAPdenovo2 | 91.89% | 1.008 | 1.12% |
| SPAdes | 96.62% | 1.004 | 27.76% |
| Velvet | 81.93% | 1.015 | 6.01% |
Note: Genome fraction is the part of a reference genome covered by contigs aligned to it; duplication ratio is the relation of aligned result to the aligned part of a genome (with 1 being optimum); inaccuracy is the sum of lengths of inaccurate contigs (being contigs misassembled, unaligned or partially unaligned to a reference genome) divided by the genome length.
Fig. 3.Results shown as the dependency of NG50 values and the length of misassemblies. NG50 means the length of a contig that together with at least such long contigs cover half of a reference genome. Total misassembly length is the sum of lengths of inaccurate contigs (being contigs misassembled, unaligned or partially unaligned to a reference genome). The better the results are, the closer to the right-bottom part of the graph they are visualized. Axes are in logarithmic scale
Fig. 4.Results shown as the part (per cent) of a reference genome covered by contigs aligned to it, depending on the minimal length of contigs taken into account. X axis is in logarithmic scale
Values of BUSCO measure summed up for all datasets
| Complete and single-copy genes | Complete and duplicated genes | Fragmented genes | |
|---|---|---|---|
| Reference genomes | 10 055 | 70 | 39 |
| ALGA | 9259 | 74 | 362 |
| GRASShopPER | 4228 | 23 | 451 |
| Platanus | 8967 | 61 | 447 |
| SAGE2 | 7236 | 70 | 734 |
| SGA | 9178 | 61 | 410 |
| SOAPdenovo2 | 7983 | 62 | 778 |
| Velvet | 6472 | 62 | 1167 |
Note: The values refer to numbers of BUSCO genes recognized correctly or partially within contigs produced by the assemblers.