| Literature DB >> 30815668 |
Martin Ayling1, Matthew D Clark2, Richard M Leggett1.
Abstract
In recent years, the use of longer range read data combined with advances in assembly algorithms has stimulated big improvements in the contiguity and quality of genome assemblies. However, these advances have not directly transferred to metagenomic data sets, as assumptions made by the single genome assembly algorithms do not apply when assembling multiple genomes at varying levels of abundance. The development of dedicated assemblers for metagenomic data was a relatively late innovation and for many years, researchers had to make do using tools designed for single genomes. This has changed in the last few years and we have seen the emergence of a new type of tool built using different principles. In this review, we describe the challenges inherent in metagenomic assemblies and compare the different approaches taken by these novel assembly tools.Entities:
Keywords: Metagenomics; algorithms; assembly; sequencing
Mesh:
Year: 2020 PMID: 30815668 PMCID: PMC7299287 DOI: 10.1093/bib/bbz020
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1Two different approaches to genome assembly: (a) in Overlap, Layout, Consensus assembly, (i) overlaps are found between reads and an overlap graph constructed (edges indicate overlapping reads). (ii) Reads are laid out into contigs based on the overlaps (dashed lines indicate overlapping portions). (iii) The most likely sequence is chosen to construct consensus sequence. (b) In dBg assembly, (i) reads are decomposed into kmers by sliding a window of size k across the reads. (ii) The kmers become vertices in the dBg, with edges connecting overlapping kmers. Polymorphisms (red) form branches in the graph. A count is kept of how many times a kmer is seen, shown here as numbers above kmers. (iii) Contigs are built by walking the graph from edge nodes. A variety of heuristics handle branches in the graphs—for example, low coverage paths, as shown here, may be ignored.
Metagenomic assembly tools: key concepts and references to publications
|
|
|
|
|
|---|---|---|---|
| BBAP | OLC | Blast-based overlap assembly, with optional intermediary assembly stage. | Lin |
| Genovo | OLC | Generative probabilistic model; applies a series of hill-climbing steps iteratively until convergence; randomly (CRP prior) picks a contig to align read ‘i’ to breaks up chimeric contigs by taking the edge reads off of contigs every ~5 iterations. | Laserson |
| IDBA-UD | dBg | Build graph; remove dead ends (<2 k-1); merge bubbles; break graph on progressive (local) depth; error correction in reads (map reads to confident contigs; reads which match in all but a few bases can be ‘corrected’ to map perfectly); use mate pair info to build a ‘local’ assembly, avoid repeats and chimeras; hold trivial contigs, remove reads; make next graph; after k_max, partitions graph, clips tips, based on progressive (local) depth; Paired end reads requires long contigs to be effective. | Peng |
| IVA (iterative virus assembler) | OLC | Aimed at viruses. Greedy kmer-based extension. The most abundant kmer in the set is used as a seed, and this seed is grown out using a read that perfectly maps to it. A new kmer is drawn from the prefix of this read, which must be much more abundant than any other of the same size and occur more than 10 times in the data set. | Hunt |
| MAP | OLC | Reads are filtered before overlap (reduce pairwise alignments made), simple paths found first, mate pair support used to simplify paths, edges removed with contradictory/insufficient mate pair support. | Lai |
| MegaGTA | dBg | Guided assembly targeting specific genes. Employs HMM profile model, iterative kmers and succinct dBg. | Li |
| MEGAHIT | dBg | Solid kmers (occur more than a set threshold) and mercy kmers (remainder); mercy kmers that occur between two solid kmers in a read are kept; build a succinct dBG (dBG with Burrows-Wheeler Transform); remove tips, bubbles, progressively remove low local coverage edges; increasing kmer size, extract kmers from contigs and reads, build next graph. | Li |
| MetaVelvet | dBg | dBG is first built with Velvet; population structure estimated from coverage of nodes (poisson distributions); dBg is partitioned into hypothetical subgraphs (possibly different species) using these peaks as a guide; only nodes from primary distribution are considered—chimeric and repeat contigs are identified and split by paired end info and coverage differences. Assembly produced for primary distribution; procedure repeated for next. | Namiki |
| MetaVelvet-SL | dBg | Similar to MetaVelvet, but the decision for identifying chimeric contigs is done using an SVM trained on (Paired ends, coverage, contig lengths) for each dinucleotide (AA, AT...GG); a training set is generated from a similar population, the SVM is trained on this, then passed over the dBg for decomposition. | Afiahayati |
| Omega | OLC | Read prefix/suffix (+/−) are stored in hashes; graph is built of V(r); simple paths (1 in, 1 out) are contracted, and transitive edges are reduced; tips removed (<10r) and bubbles are removed (hold edges with more r); minimum cost flow analysis for short (<1000 bp) contigs; Mate pair inserts are estimated from the assembly now, used to support contigs; scaffolding with long mate pair reads; remaining unresolved contigs are merged on similar coverage. | Haider |
| PRICE | Hybrid | Reads are ‘collapsed’ if identical, then if near identical; then (single strand) dbg used to assemble (essentially)—greedy walking, start at highest coverage; identical contigs collapsed, then near identical contigs (ungapped) and finally gapped. | Ruby |
| Ray Meta | dBg | Extension of Ray—no graph partitioning performed, doesn’t use a single peak for kmer coverage, min and peak coverage are specific for each seed path; heuristics-based graph traversal; graph is coloured according to an expected taxonomic profile. | Boisvert |
| SAVAGE | OLC | Aimed at viral quasi-species recovery. Strict overlap conditions reproduce quasi-species assembly with minimal misassemblies. | Baaijens |
| Snowball | Iterative joining | Guided assembly targeting specific genes. Overlapping paired-end read are merged, then assigned to profile domains. Consensus reads assembled for each domain by iterative joining. | Gregor |
| SPAdes and metaSPAdes | dBg | SPAdes started out as a tool aiming to resolve uneven coverage in single cell genome data; metaSPAdes builds specific metagenomic pipeline on top of SPAdes. Multiple kmer sizes of dBG, starting with lowest kmer size and adding hypothetical kmers of (pref smallest useful size) to connect graph. | Bankevich |
| VICUNA | Overlap | A min hash algorithm based on pairwise genetic distance threshold, inexact matching first (reads with similar or identical hash are merged) and then string matching of prefix/suffix of hashes is matched; (optional) target-like reads are kept first (similar reads binned, similarity of bin is used), everything else removed. | Yang |
| Xander | dBg | Guided assembly targeting specific genes. Employs HMM profile model. | Wang |