| Literature DB >> 31819265 |
Abstract
Existing long-read assemblers require thousands of central processing unit hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a long-read assembler wtdbg2 (https://github.com/ruanjue/wtdbg2) that is 2-17 times as fast as published tools while achieving comparable contiguity and accuracy. It paves the way for population-scale long-read assembly in future.Entities:
Mesh:
Year: 2019 PMID: 31819265 PMCID: PMC7004874 DOI: 10.1038/s41592-019-0669-3
Source DB: PubMed Journal: Nat Methods ISSN: 1548-7091 Impact factor: 28.547
Fig. 1Outline of the wtdbg2 algorithm. Wtdbg2 groups 256 base pairs into a bin, a small box in the figure. Bins/boxes with the same color suggest they share k-mers, except that a gray bin doesn’t match other bins due to sequencing errors. Wtdbg2 performs all-vs-all alignment between binned reads and constructs the fuzzy-Bruijn assembly graph, where a vertex is a 4-bin segment and an edge connects two vertices if they are both present on a read. Wtdbg2 then trims tips and pops bubbles and produces the final contig sequences from the consensus of read subsequences attached to each edge.
Evaluating long-read assemblies
FALCON requires PacBio-style read names and does not work with ONT data or the A4 strain of D. melanogaster which was downloaded from SRA. The A. thaliana assembly by FALCON is acquired from PacBio website as our assembly is fragmented. MECAT produces fragmented assemblies for the ONT dataset. Human assemblies were performed by the developers of each assembler. Base-level evaluations and NGA50 are only reported when the sequenced strain or individual is close to the reference genome. BUSCO scores are computed for genomes sequenced to 50-fold coverage or higher.
| Dataset | Metric | CANU | FALCON | Flye | MECAT | Ra | Wtdbg2 |
|---|---|---|---|---|---|---|---|
| Total length (>= 50kbp) | 106.5Mb | 100.8Mb | 102.0Mb | 102.1Mb | 108.1Mb | 104.8Mb | |
| % reference genome covered | 99.58 | 99.16 | 99.29 | 99.51 | 99.55 | 99.37 | |
| % genome covered more than once | 0.33 | 0.25 | 0.15 | 0.35 | 0.69 | 0.13 | |
| NG75 (75% ref. in contigs longer than NG75) | 1,884,280 | 935,802 | 1,275,590 | 1,424,674 | 1,320,829 | 2,255,274 | |
| NG50 (50% ref. in contigs longer than NG50) | 2,677,990 | 1,629,544 | 1,926,198 | 2,113,456 | 2,047,105 | 3,596,268 | |
| NGA50 (50% ref in alignments longer than NGA50) | 1,283,814 | 980,062 | 1,087,075 | 1,119,713 | 1,019,386 | 1,365,602 | |
| # alignment breakpoints | 681 | 192 | 284 | 278 | 724 | 177 | |
| BUSCO (% complete single-copy genes) | 98.2% | 88.1% | 98.4% | 97.0% | 90.9% | 97.5% | |
| # substitutions/1Mb (pre-/post-polish) | 64.1 / 62.2 | 233.2 / 50.1 | 61.6 / 57.6 | 65.9 / 62.8 | 309.9 / 66.8 | 83.8 / 60.3 | |
| # insertions/1Mb (pre-/post-polish) | 31.1 / 22.4 | 592.7 / 19.4 | 29.8 / 21.8 | 43.9 / 21.9 | 3011.2 / 24.3 | 110.6 / 20.8 | |
| # deletions/1Mb (pre-/post-polish) | 152.8 / 55.1 | 1822.7 / 56.7 | 381.4 / 56.9 | 366.0 / 57.9 | 144.1 / 53.1 | 343.0 / 57.7 | |
| Wall-clock time over 32 CPUs (pre-polish) | 9h30m | 2h06m | 2h58m | 3h08m | 2h23m | 26m | |
| Total length (>= 50kbp) | 135.0Mb | 130.7Mb | 126.5Mb | 127.4Mb | |||
| % reference genome covered | 91.74 | 89.40 | 86.35 | 89.34 | |||
| % genome covered more than once | 1.19 | 0.14 | 0.68 | 0.22 | |||
| NG75 | 714,013 | 1,367,004 | 685,943 | 1,752,322 | |||
| NG50 | 4,298,595 | 6,016,667 | 1,898,336 | 10,631,323 | |||
| NGA50 | 1,837,928 | 2,210,468 | 1,700,400 | 2,989,107 | |||
| # alignment breakpoints | 823 | 248 | 225 | 276 | |||
| # substitutions per 1Mb (pre-polish) | 847.6 | 1318 | 1976.2 | 1109.2 | |||
| # insertions per 1Mb (pre-polish) | 255.9 | 10669.9 | 4388.7 | 371.2 | |||
| # deletions per 1Mb (pre-polish) | 7168.2 | 1901.3 | 2324.6 | 9746.3 | |||
| Wall-clock time over 32 CPUs (pre-polish) | 22h23m | 1h41m | 2h10m | 50m | |||
| Total length (>= 50kbp) | 196.5Mb | 138.1Mb | 122.3Mb | 188.4Mb | 133.3Mb | 125.0Mb | |
| % reference genome covered | 99.04 | 97.03 | 93.55 | 97.47 | 92.52 | 92.66 | |
| % genome covered more than once | 47.61 | 11.35 | 3.72 | 51.46 | 3.38 | 1.08 | |
| NG75 | 460,325 | 4,810,976 | 180,227 | 1,096,121 | 404,218 | 2,182,254 | |
| NG50 | 873,036 | 7,979,657 | 370,306 | 3,525,236 | 1,210,836 | 8,707,235 | |
| # alignment breakpoints | 3,059 | 2,102 | 1,674 | 2,573 | 2,078 | 1,777 | |
| BUSCO (% complete single-copy genes) | 43.8% | 91.9% | 93.1% | 49.2% | 87.8% | 90.3% | |
| Wall-clock time over 32 CPUs (pre-polish) | 30h42m | (by PacBio) | 20h3m | 11h33m | 18h33m | 1h12m | |
| Human CHM1 cell line PacBio x100 | Total length (>= 50kbp) | 2,837Mb | 2,938Mb | 2,712Mb | |||
| % reference genome covered | 89.33 | 90.13 | 86.03 | ||||
| % genome covered more than once | 0.53 | 0.72 | 0.02 | ||||
| NG75 | 3,793,440 | 7,726,658 | 4,387,668 | ||||
| NG50 | 17,570,750 | 26,132,317 | 18,220,221 | ||||
| NGA50 | 7,128,216 | 9,262,902 | 8,017,241 | ||||
| # alignment breakpoints | 1,795 | 7,966 | 1,619 | ||||
| BUSCO (% complete single-copy genes) | 91.3% | 91.5% | 90.5% | ||||
| # substitutions per 1Mb (post-polish) | 961.5 | 966.6 | 963.6 | ||||
| # insertions per 1Mb (post-polish) | 142.8 | 140.1 | 140.2 | ||||
| # deletions per 1Mb (post-polish) | 140.0 | 137.6 | 141.1 | ||||
| Total CPU hours (pre-polish CPU hours) | 22,750 | 68,789 | 2,506 (632) | ||||
Wtdbg2 performance on other human genomes. Performance metrics were obtained on a machine with 96 CPU cores. G. size: size of the reference genome; Cov.: sequencing coverage; NG50: 50% of the reference genome are in contigs longer than this length.
| Data set | Technology | Cov. | CPU hour | Real hour | Peak RAM (GB) | NG50 (Mb) |
|---|---|---|---|---|---|---|
| NA12878 | Nanopore | 36 | 1513 | 26 | 235 | 10.3 |
| NA19240 | Nanopore | 35 | 1197 | 19 | 226 | 4.4 |
| NA24385 | PacBio CCS | 28 | 410 | 6 | 108 | 11.8 |
| HG00733 | PacBio Sequel | 93 | 1906 | 37 | 338 | 29.2 |