| Literature DB >> 35546658 |
Fengyuan Huang1,2, Li Xiao3, Min Gao1,3, Ethan J Vallely1, Kevin Dybvig2,4, T Prescott Atkinson4, Ken B Waites5, Zechen Chong6,7.
Abstract
BACKGROUND: Accurate bacteria genome de novo assembly is fundamental to understand the evolution and pathogenesis of new bacteria species. The advent and popularity of Third-Generation Sequencing (TGS) enables assembly of bacteria genomes at an unprecedented speed. However, most current TGS assemblers were specifically designed for human or other species that do not have a circular genome. Besides, the repetitive DNA fragments in many bacterial genomes plus the high error rate of long sequencing data make it still very challenging to accurately assemble their genomes even with a relatively small genome size. Therefore, there is an urgent need for the development of an optimized method to address these issues.Entities:
Keywords: Bacteria genome; De novo assembly; Hybrid-read assembly; Long-read-only assembly
Mesh:
Substances:
Year: 2022 PMID: 35546658 PMCID: PMC9092672 DOI: 10.1186/s12864-022-08577-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 4.547
Fig. 1 The workflow of B-assembler. B-assembler has two modes: long-read-only assembly and hybrid reads assembly
Evaluation of assembly on simulated sequences
| Asm | # ctg | Largest contig | Cov % | # Local mis | # SNV | # Indels | Time(min) | Memory(G) |
|---|---|---|---|---|---|---|---|---|
| B-assembler L | 1 | 668,045 | 98.446 | 0 | 0.9 | 0 | 56 | 8.9 |
| wtdbg2 | 1 | 658,000 | 96.842 | 1 | 217.3 | 216.08 | 11 | 1.7 |
| Flye | 1 | 701,022 | 98.444 | 0 | 0.89 | 74.01 | 724 | 30.1 |
| Canu | 1 | 676,819 | 98.455 | 0 | 1.62 | 44.26 | 3,848 | 37.2 |
| Unicycler L | 1 | 680,945 | 99.839 | 0 | 65.47 | 225.6 | 70 | 3.5 |
| B-assembler H | 1 | 676,353 | 99.67 | 0 | 0 | 0 | 66 | 11.7 |
| Unicycler H | 1 | 677,975 | 99.809 | 0 | 7.48 | 5.15 | 1634 | 14.8 |
| hybridSPAdes | 2 | 433,245 | 99.947 | 1 | 6.78 | 0.88 | 42 | 16 |
| haslr | 1 | 613,987 | 90.47 | 0 | 1.79 | 13.36 | 4 | 1.3 |
| lathe | 1 | 673,506 | 95.49 | 1 | 12.48 | 26.75 | 85 | 6.7 |
# Local mis Local misasemblies, Cov % assembled genome fraction, # SNVs number of mismatches per 100kbp, # Indels number of Indels per 100kbp, L Long-read-only mode, H Hybrid mode
Evaluation of assembly on ONT sequences
| Genome | Asm | # ctg | Length | Supl. A | Supl. C | M.R | Time (min) | Memory (G) | # Mapped PCR | Indels | SNV |
|---|---|---|---|---|---|---|---|---|---|---|---|
| B-assembler L | 1 | 681,050 | 703 | 0 | 88.62 | 88 | 11.5 | 68 | 35 | 16 | |
| B-assembler H | 1 | 685,432 | 689 | 0 | 88.65 | 1294 | 23.8 | 70 | 7 | 16 | |
| wtdbg2 | 1 | 660,247 | 2,806 | 12 | 86.92 | 7 | 2.1 | 57 | 84 | 156 | |
| Flye | 3 | 700,499 | 958 | 1 | 87.93 | 29 | 11.1 | 64 | 36 | 60 | |
| Canu | 1 | 793,572 | 873 | 1 | 87.89 | 2,883 | 15.7 | 65 | 39 | 41 | |
| Unicycler L | 2 | 755,568 | 759 | 2 | 87.91 | 73 | 10.6 | 66 | 42 | 99 | |
| Unicycler H | 61 | 407,877 | 8,646 | 10 | 76.43 | 1410 | 24.5 | 68 | 12 | 50 | |
| B-assembler L | 1 | 1,047,044 | 1,290 | 0 | 98.95 | 86 | 15.4 | / | / | / | |
| wtdbg2 | 1 | 1,014,839 | 3,089 | 7 | 98.74 | 10 | 3.2 | / | / | / | |
| Flye | 1 | 1,052,944 | 2,218 | 1 | 98.7 | 254 | 9.1 | / | / | / | |
| Canu | 1 | 1,066,644 | 1,719 | 0 | 98.92 | 2,933 | 28.8 | / | / | / | |
| Unicycler L | 1 | 1,066,967 | 2,122 | 1 | 98.81 | 97 | 17.9 | / | / | / |
# ctg number of assembled contigs, Suppl. A number of Supplementary Alignments, Suppl. C number of Supplementary alignment Clusters, M.R. Mapping Rate, B-assembler L., Unicycler L the long-read modes of B-assembler and Unicycler, B-assembler H. and Unicycler H. the hybrid-read modes of B-assembler and Unicycler
Selected QUAST statistics on the PacBio data
| species ID | ref | B-assembler | Canu | Unicycler | Flye | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| #ctg | # ctg | dup | mis | # ctg | dup | mis | # ctg | dup | mis | # ctg | dup | mis | |
| 1 | 1.004 | 1 | 1.006 | 1 | 1.006 | 1 | |||||||
| 2 | |||||||||||||
| 9 | 15 | 1.038 | 20 | 11 | 1.016 | 13 | 10 | 1.014 | 10 | ||||
| 11 | 6 | 11 | 1.013 | 41 | 1.007 | 7 | 11 | 1.036 | 13 | ||||
| 19 | 204 | 1.032 | 20 | 297 | 1.054 | 5 | 120 | 1.005 | 6 | ||||
| 3 | 10 | 1.023 | 19 | 1.006 | 15 | 7 | 1.019 | 23 | |||||
| 2 | 14 | 19 | 1.015 | 11 | 25 | 1.004 | 3 | 1.02 | 7 | ||||
| 1 | 2 | 4 | 1.006 | 1 | 1.002 | 1 | |||||||
| 1 | 2 | 3 | 1.004 | 2 | 1.001 | ||||||||
| 2 | 8 | 1.006 | 3 | 5 | 1.003 | 5 | 2 | 3 | |||||
| 2 | 2 | 4 | 1.006 | 2 | 3 | 1.001 | |||||||
| 4 | 2 | 1.004 | 2 | 2 | 2 | 2 | 2 | ||||||
| 1 | 1.001 | 2 | 1.003 | 1.001 | |||||||||
| 3 | 1.05 | 4 | 4 | 4 | 7 | 1.025 | 6 | 1.051 | 8 | ||||
# ctg number of assembled contigs, dup. duplication ratio, mis. misassemblies larger than 1kbp
Fig. 2Indels and mismatches produced by the benchmarked assemblers on the 14 NCTC PacBio samples. The number of indels and mismatches were added per 100kbp