| Literature DB >> 29949969 |
Alla Mikheenko1, Andrey Prjibelski1, Vladislav Saveliev1, Dmitry Antipov1, Alexey Gurevich1.
Abstract
Motivation: The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes.Entities:
Mesh:
Year: 2018 PMID: 29949969 PMCID: PMC6022658 DOI: 10.1093/bioinformatics/bty266
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Upper bound assembly construction. (a) All available reads (brown for long reads, orange for mate-pairs, and yellow for paired-ends) are mapped to the reference (gray) to compute zero-coverage genomic regions. Repeat sequences (red) are detected using repeat finder software. Non-repetitive covered fragments are reported as upper bound contigs. (b) The overlaps between the contigs (green), and either long or mate-pair reads are detected, and contigs are further joined to form upper bound scaffolds. (c) The gaps between adjacent contigs within a scaffold are filled either with reference sequence (for covered regions) or with stretches of N nucleotides (for coverage gaps). Unresolved repeats are added as separate sequences
Fig. 2.Detection of discrepancies caused by TEs. On each subfigure, we plot the reference genome R (top), the contig C (bottom), their matching fragments (blue and green bars for the positions in C and R, respectively) and locations of TEs (violet bars) causing discrepancies in the mapping. The inconsistencies in the alignments are shown by arrows and δ characters. (a) TE is present in R and missing in C. Since δ here is equal to the TE’s length, a specifically chosen breakpoint threshold X transforms classification of this discrepancy from a relocation to a local misassembly (). (b) TE is located inside C but its position in R is significantly away from the rest of C mappings and could also be located on the opposite strand. Original QUAST would treat this situation as two misassembly breakpoints (relocations or inversions) because δ1 and δ2 are usually much higher than X. In contrast, QUAST-LG classifies such pattern as possible TE since it computes , that is again equal to the TE’s length and could be prevailed by appropriate X. (c) TE is the first or the last alignment fragment in C, while its location on R is large distance δ away from the neighboring C fragment. QUAST-LG cannot reliably distinguish this situation from a real relocation/inversion: it would need to be able to recognize TE based on its genomic sequence, which is out of scope of this paper
Benchmark datasets details
| Dataset | Yeast | Yeast | Worm | Fly | Human | Human | |
|---|---|---|---|---|---|---|---|
| Species | |||||||
| Genome size | 12.1 Mb | 12.1 Mb | 100.3 Mb | 137.6 Mb | 3.1 Gb | 3.1 Gb | |
| RL, IS | 101 bp, 220 bp | 250 bp, 350 bp | 100 bp, 250 bp | 101 bp, 225 bp | 250 bp, 350 bp | 150 bp, 350 bp | |
| Coverage | 1038× | 300× | 65× | 35× | 50× | 100× | |
| Type | PB | NP | PB | MP | MP | NP | |
| RL, IS | 6 kb,— | 7.7 kb,— | 11 kb,— | 110 bp, 8 kb | 101 bp, 6 kb | 6.5 kb,— | |
| Coverage | 155× | 120× | 40× | 40× | 30× | 30× |
Note: Read lengths (RL) and insert sizes (IS) are represented by their median values, ‘—’ indicates no insert size for long-read libraries. Type stands for the sequencing technology used for generating Library 2, PB, NP, MP are for PacBio SMRT, Oxford Nanopore Technologies and Illumina mate-pairs data, respectively. Library 1 data were generated with Illumina sequencers for all datasets. HG001 and HG004 are human sample identifiers in the Genome in a Bottle Consortium (Zook ).
Assemblers used in the study
| Name | Reference | Version | Date |
|---|---|---|---|
| ABYSS | 2.0.2 | Oct 2016 | |
| DISCOVAR | — | 52488 | Mar 2015 |
| MaSuRCA | 3.2.3 | Sep 2017 | |
| Meraculous | 2.2.4 | Jun 2017 | |
| Platanus | 1.2.4 | Oct 2015 | |
| SOAPdenovo | 2.04 | Dec 2013 | |
| SPAdes | 3.11.1 | Oct 2017 | |
| Canu | 1.6 | Jun 2017 | |
| FALCON | 0.7 | Jun 2016 | |
| Flye | 2.3 | Jan 2018 | |
| MaSuRCA | 3.2.3 | Sep 2017 | |
| Miniasm | 0.2-r168 | Nov 2017 | |
Note: The assemblers are divided into two groups based on the read types they can process. DISCOVAR de novo is the successor of popular ALLPATHS-LG (Gnerre ) assembler but it is not published yet (indicated with ‘—’).
QUAST and QUAST-LG performance
| Dataset | Genome | # asm. | QUAST | QUAST-LG | ||
|---|---|---|---|---|---|---|
| size (Mb) | Time | RAM | Time | RAM | ||
| Yeast | 12.1 | 5 | 00:06 | 1.2 | 00:01 | 1.1 |
| Yeast | 12.1 | 4 | 00:04 | 1.2 | 00:01 | 0.6 |
| Worm | 100.3 | 5 | 02:51 | 8.4 | 00:08 | 6.3 |
| Fly | 137.6 | 6 | 04:55 | 13.8 | 00:21 | 9.8 |
| Human | 3088.3 | 4 | — | — | 03:55 | 135.2 |
| Human | 3088.3 | 4 | — | — | 04:05 | 135.4 |
Note: # asm. stands for the number of assemblies being processed. Running time is in hh:mm format; maximal RAM consumption is in GB; ‘—’ indicates the fact that conventional QUAST was not able to process the human datasets. All benchmarking was done on a server with Intel Xeon X7560 2.27 GHz CPUs using 8 threads.
Comparison of assemblies of six benchmark datasets
| Assembler | LGA50 | Largest alignment (Mb) | Genome fraction (%) | # mis. | NGA50 (Mb) | Mismatches per 100 kb | Indels per 100 kb | K-mer-based | BUSCO | |
|---|---|---|---|---|---|---|---|---|---|---|
| compl. (%) | # misjoins | compl. (%) | ||||||||
| Yeast | ||||||||||
| Canu | 35 | 0.669 | 579.50 | 48.25 | 64.39 | |||||
| FALCON | 1.502 | 96.07 | 184.09 | 92.09 | 86.31 | 95.18 | ||||
| Flye | 1.083 | 98.04 | 24 | 0.677 | 7 | |||||
| MaSuRCA | 14 | 0.686 | 97.41 | 60 | 0.346 | 680.43 | 50.04 | 62.36 | 10 | |
| Miniasm | 97.31 | 35 | 0.663 | 155.74 | 104.48 | 85.46 | 97.93 | |||
| 6 | 1.524 | 99.92 | 0 | 0.777 | 0.00 | 0.00 | 99.90 | 0 | 99.31 | |
| Yeast | ||||||||||
| Canu | 7 | 1.090 | 98.84 | 12 | 0.658 | 565.83 | 101.41 | 62.04 | 98.96 | |
| Flye | 8 | 1.081 | 97.71 | 0.663 | 56.14 | 649.09 | 52.09 | 3 | 67.93 | |
| MaSuRCA | 14 | 5 | ||||||||
| Miniasm | 8 | 1.057 | 98.26 | 7 | 0.639 | 52.60 | 754.08 | 48.08 | 66.89 | |
| 6 | 1.524 | 99.87 | 0 | 0.777 | 0.00 | 0.00 | 99.94 | 0 | 99.31 | |
| 6 | 1.532 | 100.00 | 0 | 0.924 | 0.00 | 0.00 | 100.00 | 0 | 99.31 | |
| Worm | ||||||||||
| Canu | 27 | 147 | 1.292 | 41.18 | ||||||
| FALCON | 29 | 3.052 | 98.67 | 1.176 | 65.11 | 126.13 | 88.94 | 8 | 94.39 | |
| Flye | 3.354 | 99.31 | 122 | 43.50 | 95.23 | 6 | ||||
| MaSuRCA | 32 | 2.542 | 99.18 | 138 | 1.016 | 33.28 | 7.79 | 97.45 | 25 | |
| Miniasm | 29 | 2.839 | 99.41 | 262 | 1.215 | 54.47 | 143.88 | 87.41 | 5 | 96.04 |
| 8 | 12.667 | 99.95 | 0 | 3.507 | 0.00 | 0.00 | 99.96 | 0 | 96.37 | |
| 3 | 20.924 | 100.00 | 0 | 17.494 | 0.00 | 0.00 | 100.00 | 0 | 96.37 | |
| Fly | ||||||||||
| ABySS | 2.694 | 79.45 | 0.331 | 92.10 | 60.50 | 99.01 | ||||
| MaSuRCA | 186 | 1.807 | 84.61 | 922 | 0.157 | 1316.66 | 63.35 | 66 | ||
| Meraculous | 111 | 1.586 | 82.58 | 305 | 0.316 | 1241.33 | 91.06 | 6 | 99.01 | |
| Platanus | 97 | 81.07 | 280 | 1288.45 | 91.18 | 62.36 | 24 | 99.01 | ||
| SOAPdenovo | 155 | 1.631 | 713 | 0.238 | 1308.22 | 91.12 | 63.50 | 12 | 99.67 | |
| SPAdes | 123 | 1.656 | 80.41 | 388 | 0.287 | 1173.67 | 93.08 | 61.39 | 108 | 99.01 |
| 44 | 3.558 | 99.16 | 0 | 1.015 | 0.00 | 0.00 | 97.28 | 0 | 99.67 | |
| 3 | 32.079 | 100.00 | 0 | 25.281 | 0.00 | 0.00 | 100.00 | 0 | 99.67* | |
| Human | ||||||||||
| ABySS | 263 | 20.392 | 93.56 | 820 | 3.326 | 27.44 | 86.92 | 572 | ||
| DISCOVAR | 106.24 | 535 | 93.40 | |||||||
| SOAPdenovo | 3725 | 2.193 | 85.10 | 670 | 0.210 | 129.15 | 50.41 | 77.73 | 90.43 | |
| 112 | 35.878 | 99.06 | 0 | 8.309 | 0.00 | 0.00 | 99.24 | 0 | 92.74 | |
| Human | ||||||||||
| Canu | 296 | 92.25 | 853 | 2.745 | 258.95 | 68.03 | 83.93 | 523 | ||
| Flye | 266 | 21.735 | 91.91 | 3.172 | 580.26 | 1125.37 | 26.59 | 69.64 | ||
| MaSuRCA | 22.413 | 13227 | 892 | 87.79 | ||||||
| 105 | 75.724 | 99.07 | 0 | 7.862 | 0.00 | 0.00 | 99.51 | 0 | 92.74 | |
| 9 | 248.956 | 100.00 | 0 | 144.769 | 0.00 | 0.00 | 100.00 | 0 | 93.75 | |
Note: All statistics are given for scaffolds kb. The best value for each column is indicated in bold (upper bound assembly and reference genome statistics are excluded from the best value determination). LGA50 is the minimal number of aligned fragments that cover half of the reference genome. NGA50 corresponds to the shortest length among the LGA50 aligned fragments. # mis. stands for the total number of misassemblies. K-mer-based compl. is for the fraction of unique reference 101-mers present in the assemblies. K-mer-based # misjoins is the total number of k-mer-based misjoins. BUSCO compl. stands for the total number of conserved genes completely or partially identified in the assembly, divided by the total number of BUSCO genes. UpperBound and Reference stand for the upper bound assembly and the reference genome statistics, respectively. Note that Reference is given once per unique genome, that is only four times per six datasets. We manually checked the overestimated BUSCO compl. measure for MaSuRCA assembly of Fly which outperformed the reference value (100.00 versus 99.67% completeness; marked ‘*’ in the table). The D.melanogaster reference sequence misses a single short BUSCO gene which is different from the BUSCO core sequence in a few SNPs and is not identified by the tool. At the same time, MaSuRCA assembled this gene in a form more similar to the BUSCO version which enabled its identification. The similar situation happened on the Human dataset, where ABySS and DISCOVAR partially assembled two BUSCO genes missed in the reference genome. These two assemblers thus were able to exceed the upper bound estimate of the BUSCO completeness.