| Literature DB >> 32658266 |
Rei Kajitani1, Dai Yoshimura1, Yoshitoshi Ogura2,3, Yasuhiro Gotoh3, Tetsuya Hayashi3, Takehiko Itoh1.
Abstract
De novo assembly of short DNA reads remains an essential technology, especially for large-scale projects and high-resolution variant analyses in epidemiology. However, the existing tools often lack sufficient accuracy required to compare closely related strains. To facilitate such studies on bacterial genomes, we developed Platanus_B, a de novo assembler that employs iterations of multiple error-removal algorithms. The benchmarks demonstrated the superior accuracy and high contiguity of Platanus_B, in addition to its ability to enhance the hybrid assembly of both short and nanopore long reads. Although the hybrid strategies for short and long reads were effective in achieving near full-length genomes, we found that short-read-only assemblies generated with Platanus_B were sufficient to obtain ≥90% of exact coding sequences in most cases. In addition, while nanopore long-read-only assemblies lacked fine-scale accuracies, inclusion of short reads was effective in improving the accuracies. Platanus_B can, therefore, be used for comprehensive genomic surveillances of bacterial pathogens and high-resolution phylogenomic analyses of a wide range of bacteria.Entities:
Keywords: zzm321990 de novo assembly; Bacterial genome; high-resolution phylogenomics; large-scale genomic surveillance
Mesh:
Substances:
Year: 2020 PMID: 32658266 PMCID: PMC7433917 DOI: 10.1093/dnares/dsaa014
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1Workflow of Platanus_B and the related assemblers. Orange- and blue-framed boxes correspond to error-correction and Platanus_B-specific processes, respectively.
Strains used in the benchmarks
| Strain | Genome size (bp) | Repetitive 100-mer rate (%) | No. of CDSs |
|---|---|---|---|
|
| 5,594,605 | 5.31 | 5,291 |
|
| 4,641,652 | 1.91 | 4,357 |
|
| 5,191,712 | 0.95 | 4,824 |
|
| 5,326,023 | 1.13 | 4,972 |
|
| 2,354,886 | 7.63 | 2,051 |
The reference genomic data were downloaded from the RefSeq database. Repetitive 100-mer rate indicates the rate of 100-mers that occur more than 1 time in a reference genome. The RefSeq assembly accessions of the reference genomes were GCF_000008865.2, GCF_000005845.2, GCF_000829175.1, GCF_000828775.1, and GCF_000010505.1.
Figure 2Contiguity and accuracy of the short-read-based assemblers, and effects of the error-correction function in Platanus_B. (A and B) Platanus_B and other existing assemblers. Read lengths are 300 and 150 bp, respectively. (C) Deactivation test of the error-correction functions of Platanus_B. ‘Platanus_B no-corrections’ corresponds to the modified version where all correction functions (based on k-mers, physical coverage, and read mapping) are deactivated. Read length is 300 bp.
Run time and memory usage of benchmarks using the TruSeq PCR-free kit for multiple species
| Strain | Assembler | PE × 10 | PE × 20 | PE × 30 | PE × 40 | PE × 50 |
|---|---|---|---|---|---|---|
|
| Platanus_B | 796 | 919 | 1,080 | 1,125 | 1,284 |
| Platanus | 58 | 63 | 82 | 668 | 899 | |
| Platanus-allee | 278 | 289 | 353 | 420 | 481 | |
| MaSuRCA | 498 | 517 | 697 | 867 | 1,006 | |
| SPAdes | 621 | 927 | 1,283 | 1,626 | 2,023 | |
| Unicycler | 3,054 | 5,369 | 8,805 | 9,281 | 11,759 | |
| DISCOVAR | 575 | 276 | 330 | 411 | 471 | |
|
| Platanus_B | 672 | 741 | 856 | 853 | 996 |
| Platanus | 54 | 62 | 84 | 571 | 740 | |
| Platanus-allee | 235 | 241 | 283 | 308 | 351 | |
| MaSuRCA | 417 | 423 | 552 | 641 | 792 | |
| SPAdes | 516 | 803 | 1,109 | 1,446 | 1,709 | |
| Unicycler | 1,994 | 3,397 | 5,076 | 6,889 | 8,188 | |
| DISCOVAR | 484 | 204 | 225 | 279 | 326 | |
|
| Platanus_B | 713 | 774 | 888 | 902 | 1,004 |
| Platanus | 62 | 63 | 80 | 513 | 706 | |
| Platanus-allee | 264 | 263 | 320 | 377 | 448 | |
| MaSuRCA | 445 | 487 | 641 | 791 | 945 | |
| SPAdes | 603 | 3,036 | 5,240 | 6,751 | 8,987 | |
| Unicycler | 1,856 | 5,237 | 8,389 | 10,602 | 13,830 | |
| DISCOVAR | 819 | 315 | 275 | 334 | 394 | |
|
| Platanus_B | 744 | 801 | 970 | 993 | 1,128 |
| Platanus | 61 | 67 | 88 | 536 | 793 | |
| Platanus-allee | 263 | 250 | 300 | 340 | 374 | |
| MaSuRCA | 476 | 470 | 620 | 776 | 862 | |
| SPAdes | 543 | 887 | 1,146 | 1,458 | 1,840 | |
| Unicycler | 1,695 | 2,854 | 3,688 | 4,926 | 6,096 | |
| DISCOVAR | 657 | 262 | 274 | 339 | 384 | |
|
| Platanus_B | 419 | 476 | 529 | 559 | 666 |
| Platanus | 28 | 32 | 39 | 363 | 427 | |
| Platanus-allee | 158 | 167 | 190 | 211 | 246 | |
| MaSuRCA | 220 | 236 | 296 | 354 | 397 | |
| SPAdes | 273 | 380 | 505 | 666 | 813 | |
| Unicycler | 1,633 | 2,217 | 2,741 | 3,091 | 3,922 | |
| DISCOVAR | 236 | 128 | 154 | 189 | 225 | |
| (B) Real time (s) | ||||||
|
| Platanus_B | 977 | 1,056 | 1,116 | 1,116 | 1,274 |
| Platanus | 47 | 54 | 59 | 517 | 629 | |
| Platanus-allee | 292 | 308 | 337 | 365 | 393 | |
| MaSuRCA | 250 | 230 | 317 | 359 | 397 | |
| SPAdes | 227 | 295 | 401 | 505 | 624 | |
| Unicycler | 1,054 | 1,625 | 2,572 | 2,728 | 3,427 | |
| DISCOVAR | 177 | 89 | 113 | 134 | 166 | |
|
| Platanus_B | 869 | 921 | 963 | 951 | 1,040 |
| Platanus | 42 | 47 | 53 | 503 | 569 | |
| Platanus-allee | 269 | 281 | 297 | 311 | 340 | |
| MaSuRCA | 210 | 190 | 239 | 263 | 315 | |
| SPAdes | 180 | 256 | 348 | 449 | 526 | |
| Unicycler | 632 | 980 | 1,433 | 1,909 | 2,252 | |
| DISCOVAR | 139 | 66 | 80 | 95 | 104 | |
|
| Platanus_B | 879 | 915 | 957 | 946 | 987 |
| Platanus | 43 | 49 | 56 | 410 | 533 | |
| Platanus-allee | 274 | 284 | 307 | 319 | 355 | |
| MaSuRCA | 217 | 220 | 273 | 322 | 371 | |
| SPAdes | 333 | 1,711 | 2,231 | 2,288 | 3,025 | |
| Unicycler | 610 | 2,164 | 2,737 | 3,027 | 4,216 | |
| DISCOVAR | 225 | 99 | 102 | 116 | 136 | |
|
| Platanus_B | 919 | 942 | 1,017 | 1,005 | 1,084 |
| Platanus | 43 | 50 | 57 | 427 | 572 | |
| Platanus-allee | 268 | 268 | 293 | 302 | 313 | |
| MaSuRCA | 236 | 204 | 267 | 325 | 342 | |
| SPAdes | 188 | 303 | 359 | 497 | 646 | |
| Unicycler | 620 | 899 | 1,137 | 1,425 | 1,738 | |
| DISCOVAR | 182 | 79 | 94 | 108 | 127 | |
|
| Platanus_B | 637 | 678 | 695 | 697 | 808 |
|
| 29 | 31 | 33 | 401 | 412 | |
| Platanus-allee | 217 | 220 | 229 | 235 | 252 | |
| MaSuRCA | 121 | 121 | 144 | 161 | 172 | |
| SPAdes | 110 | 126 | 174 | 230 | 263 | |
| Unicycler | 550 | 642 | 783 | 882 | 1,108 | |
| DISCOVAR | 74 | 47 | 52 | 60 | 69 | |
|
| Platanus_B (-m 16) | 14.98 | 15.02 | 15.08 | 15.00 | 14.95 |
| Platanus_B (-m 1) | 1.83 | 1.85 | 2.41 | 2.41 | 2.41 | |
| Platanus | 8.40 | 8.40 | 8.40 | 14.94 | 14.94 | |
| Platanus-allee | 14.95 | 14.96 | 14.96 | 14.95 | 14.95 | |
| MaSuRCA | 15.64 | 15.63 | 15.63 | 15.63 | 15.63 | |
| SPAdes | 1.37 | 2.61 | 2.62 | 2.62 | 2.62 | |
| Unicycler | 1.37 | 2.61 | 2.62 | 2.62 | 2.62 | |
| DISCOVAR | 2.36 | 4.21 | 6.02 | 7.97 | 9.88 | |
| (C) Peak memory usage (GB) | ||||||
|
| Platanus_B (-m 16) | 14.93 | 14.95 | 14.99 | 14.94 | 14.91 |
| Platanus_B (-m 1) | 1.63 | 1.64 | 1.64 | 2.40 | 2.40 | |
| Platanus | 8.40 | 8.40 | 8.40 | 14.90 | 14.90 | |
| Platanus-allee | 14.91 | 14.92 | 14.92 | 14.91 | 14.91 | |
| MaSuRCA | 15.64 | 15.63 | 12.92 | 7.51 | 12.92 | |
| SPAdes | 1.14 | 2.27 | 2.62 | 2.62 | 2.62 | |
| Unicycler | 1.14 | 2.27 | 2.62 | 2.62 | 2.62 | |
| DISCOVAR | 1.84 | 3.56 | 5.10 | 6.70 | 8.27 | |
|
| Platanus_B (-m 16) | 14.95 | 14.98 | 15.02 | 14.95 | 14.95 |
| Platanus_B (-m 1) | 1.61 | 1.62 | 1.62 | 2.15 | 2.15 | |
| Platanus | 8.40 | 8.40 | 8.40 | 14.93 | 14.93 | |
| Platanus-allee | 14.94 | 14.95 | 14.95 | 14.94 | 14.94 | |
| MaSuRCA | 15.64 | 15.63 | 15.63 | 15.63 | 15.63 | |
| SPAdes | 1.17 | 2.34 | 2.62 | 2.62 | 2.62 | |
| Unicycler | 1.17 | 2.33 | 2.62 | 2.62 | 2.62 | |
| DISCOVAR | 2.03 | 3.93 | 5.67 | 7.45 | 9.21 | |
|
| Platanus_B (-m 16) | 14.96 | 14.99 | 15.03 | 14.96 | 14.95 |
| Platanus_B (-m 1) | 1.80 | 1.81 | 1.81 | 2.15 | 2.15 | |
| Platanus | 8.40 | 8.40 | 8.40 | 14.93 | 14.93 | |
| Platanus-allee | 14.94 | 14.95 | 14.96 | 14.95 | 14.95 | |
| MaSuRCA | 15.64 | 15.63 | 11.56 | 11.57 | 6.16 | |
| SPAdes | 1.26 | 2.53 | 2.62 | 2.62 | 2.62 | |
| Unicycler | 1.26 | 2.52 | 2.62 | 2.62 | 2.62 | |
| DISCOVAR | 2.13 | 4.01 | 5.82 | 7.63 | 9.45 | |
|
| Platanus_B (-m 16) | 14.82 | 14.83 | 14.85 | 14.83 | 14.80 |
| Platanus_B (-m 1) | 1.07 | 1.08 | 1.09 | 1.20 | 1.20 | |
| Platanus | 8.40 | 8.40 | 8.40 | 14.79 | 14.79 | |
| Platanus-allee | 14.80 | 14.80 | 14.80 | 14.80 | 14.80 | |
| MaSuRCA | 15.63 | 15.63 | 14.27 | 10.19 | 7.47 | |
| SPAdes | 0.58 | 1.16 | 1.75 | 2.32 | 2.61 | |
| Unicycler | 0.60 | 1.16 | 1.74 | 2.32 | 2.61 | |
| DISCOVAR | 0.98 | 1.84 | 2.68 | 3.63 | 4.23 | |
As a machine environment, the number of CPUs were 24, the model name of CPU was Intel(R) Xeon(R) CPU E5-2687W v4, the clock rate of CPU was 3.00 GHz, and the amount of RAM was 256GB. Each tool was executed with the setting of 4 threads and the times (real and CPU time) were measured using GNU time (version 1.7). (A) CPU time (s), (B) Real time (s), and (C) Peak memory usage (GB). For Platanus_B, two values (16 and 1) are specified to an option of available memory amount (-m).
Figure 3Benchmark using multiple preparation kits for short reads. (A and B) Escherichia coli O157 Sakai. Read lengths are 300 and 150 bp, respectively. (C and D) Escherichia coli K-12 MG1655. Read lengths are 300 and 150 bp, respectively.
Figure 4Coding sequence exact-match rates of short- and long-read-based assemblies for E. coli strains. With mixed input of long and short reads followed by polishing with Pilon a coverage depth corresponding to the column names, ×10–50, is obtained for each library. For example, the total coverage depth is 20 (long reads, 10; short reads, 10) if the coverage depth is denoted as ‘×10’. Pilon was executed three times for each long-read-based assembly (Canu, Flye, Wtdbg2, and miniasm+Racon).