| Literature DB >> 31483244 |
Nicola De Maio1, Liam P Shaw1, Alasdair Hubbard2, Sophie George3,1, Nicholas D Sanderson1, Jeremy Swann1, Ryan Wick4, Manal AbuOun5, Emma Stubberfield5, Sarah J Hoosdally1, Derrick W Crook3,1, Timothy E A Peto3,1, Anna E Sheppard3,1, Mark J Bailey6, Daniel S Read6, Muna F Anjum5, A Sarah Walker3,1, Nicole Stoesser1.
Abstract
Illumina sequencing allows rapid, cheap and accurate whole genome bacterial analyses, but short reads (<300 bp) do not usually enable complete genome assembly. Long-read sequencing greatly assists with resolving complex bacterial genomes, particularly when combined with short-read Illumina data (hybrid assembly). However, it is not clear how different long-read sequencing methods affect hybrid assembly accuracy. Relative automation of the assembly process is also crucial to facilitating high-throughput complete bacterial genome reconstruction, avoiding multiple bespoke filtering and data manipulation steps. In this study, we compared hybrid assemblies for 20 bacterial isolates, including two reference strains, using Illumina sequencing and long reads from either Oxford Nanopore Technologies (ONT) or SMRT Pacific Biosciences (PacBio) sequencing platforms. We chose isolates from the family Enterobacteriaceae, as these frequently have highly plastic, repetitive genetic structures, and complete genome reconstruction for these species is relevant for a precise understanding of the epidemiology of antimicrobial resistance. We de novo assembled genomes using the hybrid assembler Unicycler and compared different read processing strategies, as well as comparing to long-read-only assembly with Flye followed by short-read polishing with Pilon. Hybrid assembly with either PacBio or ONT reads facilitated high-quality genome reconstruction, and was superior to the long-read assembly and polishing approach evaluated with respect to accuracy and completeness. Combining ONT and Illumina reads fully resolved most genomes without additional manual steps, and at a lower consumables cost per isolate in our setting. Automated hybrid assembly is a powerful tool for complete and accurate bacterial genome assembly.Entities:
Keywords: Enterobacteriaceae; bacterial genomics; hybrid assembly; long-read sequencing; plasmid assembly
Mesh:
Substances:
Year: 2019 PMID: 31483244 PMCID: PMC6807382 DOI: 10.1099/mgen.0.000294
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Summary of all hybrid assemblies in terms of circularized contigs
Different rows refer to different isolates. ‘n of m’ means that n contigs were circular in the assembly out of m total contigs. When n and m are identical, it means that the assembly was considered complete, and these cases are shaded in green. ‘Basic’, ‘Corrected’, ‘Filtered’ and ‘Subsampled’ refer to the strategies of long read preparation (see Methods). ‘NA’ refers to cases where the assembly pipeline repeatedly failed. The true number of circular structures was estimated by inspection.
|
ONT (MinION) |
PacBio (RSII System) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
|
Isolate |
Basic |
Corrected |
Filtered |
Subsampled |
Basic |
Corrected |
Filtered |
Subsampled |
True circular structures (estimated) |
|
CFT073 (reference) |
1 of 1 |
1 of 1 |
0 of 9 |
1 of 1 |
0 of 9 |
0 of 9 |
0 of 9 |
0 of 9 |
1 |
|
MGH78578 (reference) |
6 of 6 |
4 of 7 |
|
6 of 6 |
4 of 7 |
2 of 22 |
2 of 22 |
2 of 22 |
6 |
|
RBHSTW-00029 |
3 of 9 |
3 of 9 |
3 of 9 |
3 of 9 |
3 of 9 |
3 of 9 |
3 of 9 |
3 of 9 |
4 |
|
RBHSTW-00053 |
6 of 6 |
6 of 6 |
6 of 6 |
6 of 6 |
6 of 6 |
6 of 6 |
6 of 6 |
6 of 6 |
6 |
|
RBHSTW-00059 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 |
|
RBHSTW-00122 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 |
|
RBHSTW-00123 |
7 of 7 |
|
7 of 7 |
7 of 7 |
5 of 8 |
4 of 18 |
4 of 18 |
4 of 18 |
7 |
|
RBHSTW-00127 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 of 5 |
5 |
|
RBHSTW-00128 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
4 of 4 |
3 of 6 |
3 of 6 |
3 of 6 |
4 |
|
RBHSTW-00131 |
4 of 4 |
2 of 7 |
4 of 4 |
4 of 4 |
3 of 15 |
4 of 5 |
3 of 15 |
2 of 15 |
4 |
|
RBHSTW-00142 |
7 of 7 |
5 of 25 |
7 of 7 |
7 of 7 |
4 of 24 |
4 of 58 |
4 of 24 |
4 of 27 |
7 |
|
RBHSTW-00167 |
9 of 9 |
5 of 15 |
10 of 10 |
9 of 9 |
4 of 34 |
3 of 60 |
3 of 60 |
3 of 60 |
9 |
|
RBHSTW-00189 |
6 of 6 |
6 of 6 |
5 of 6 |
6 of 6 |
5 of 29 |
5 of 28 |
5 of 29 |
5 of 30 |
6 |
|
RBHSTW-00277 |
2 of 2 |
2 of 2 |
1 of 8 |
2 of 2 |
1 of 8 |
1 of 8 |
1 of 8 |
1 of 8 |
2 |
|
RBHSTW-00309 |
4 of 5 |
5 of 5 |
5 of 5 |
4 of 5 |
5 of 5 |
4 of 5 |
5 of 5 |
5 of 5 |
5 |
|
RBHSTW-00340 |
3 of 11 |
3 of 11 |
4 of 4 |
4 of 4 |
2 of 25 |
2 of 25 |
2 of 24 |
2 of 25 |
4 |
|
RBHSTW-00350 |
2 of 2 |
2 of 2 |
2 of 3 |
2 of 2 |
2 of 2 |
2 of 2 |
2 of 2 |
2 of 2 |
2 |
|
RHB10-C07 |
1 of 1 |
1 of 1 |
1 of 1 |
1 of 1 |
1 of 1 |
1 of 1 |
1 of 17 |
1 of 1 |
1 |
|
RHB11-C04 |
3 of 3 |
3 of 3 |
3 of 3 |
3 of 3 |
3 of 3 |
3 of 3 |
3 of 3 |
3 of 3 |
3 |
|
RHB14-C01 |
1 of 12 |
1 of 12 |
1 of 15 |
1 of 12 |
1 of 15 |
1 of 15 |
1 of 15 |
1 of 15 |
2 |
|
Total contigs |
109 |
130 |
115 |
102 |
218 |
294 |
276 |
265 |
87 |
|
Total circularized contigs (% over total estimated circular structures from Bandage: |
83 (95 %) |
67 (84 %) |
77 (95 %) |
84 (97 %) |
67 (77 %) |
62 (71 %) |
62 (71 %) |
61 (70 %) |
|
|
Total circularized contigs for reference strains [i.e. structures known, total |
7 (100 %) |
5 (71 %) |
0 (0 %) |
7 (100 %) |
5 (71 %) |
2 (29 %) |
2 (29 %) |
2 (29 %) |
|
|
Total isolates with all contigs circularised (% isolates) |
16 (80 %) |
12 (60 %) |
13 (65 %) |
17 (85 %) |
9 (45 %) |
7 (35 %) |
7 (35 %) |
8 (40 %) |
|
Fig. 1.Examples of genome structure uncertainty in hybrid assemblies in (a) the chromosome and (b) the accessory genome. (a) An ONT+Illumina hybrid assembly for isolate RBHSTW-00029 using the ‘Basic’ long-read preparation strategy. (b) A PacBio+Illumina hybrid assembly for isolate MGH78578 using the ‘Corrected’ long-read preparation strategy. Plots were obtained using Bandage on the ‘assembly.gfa’ output file from Unicycler, with grey boxes indicating unresolved structures. Each contig is annotated with contig length and Illumina coverage; connections between contigs represent overlaps between contig ends. The assembly for RHBSTW-00029 in (a) and that of isolate RHB14-C01 (which showed a similar pattern of chromosome structure uncertainty) represented the only two datasets that could not be completely assembled with any of the attempted strategies using ONT+Illumina data. They were also not fully assembled by any PacBio+Illumina strategy, which similarly failed to completely assemble isolates RBHSTW-00189, RBHSTW-00277, RBHSTW-340 and CFT073 (Figure S4). The pattern in (b) was only observed for PacBio+Illumina data, and was the reason for incomplete assemblies for isolates RBHSTW-00123, RBHSTW-00131, RBHSTW-00142, RBHSTW-00167 and MGH78578 (Figure S5).
Comparison between PacBio- and ONT-based hybrid assemblies
Comparisons are shown using ALE, DNAdiff and REAPR (see Methods). Different rows represent different isolates. All entries representing a better score for the PacBio assembly are shaded in red, while those showing a better score for ONT are shaded in blue. ‘ALE score’ is the assembly likelihood difference (calculated by ALE from the mapping of Illumina reads) between PacBio and ONT assemblies. ‘Unmapped reads’ refers to the number of Illumina reads that ALE did not map to the corresponding assembly. ‘REAPR errors’ refers to the assembly errors found by REAPR by mapping Illumina reads to the corresponding assembly. For each isolate, one ONT and one PacBio-based assembly with the best completion (i.e. number of circularized contigs) were chosen for comparison. DNAdiff results show the median (range) results from comparing all assemblies for an isolate across read preparation strategies, i.e. 4×4=16 comparisons for each isolate. ‘GSNPs’/‘GIndels’ refer to high-confidence SNPs/indels between ONT and PacBio assemblies
|
Isolate |
ALE score |
PacBio unmapped reads (% total) |
ONT unmapped reads (% total) |
PacBio REAPR errors |
ONT REAPR errors |
DNAdiff GSNPs |
DNAdiff GIndels |
|---|---|---|---|---|---|---|---|
|
CFT073 (reference |
−17 928 |
29 246 (0.89 %) |
29 240 (0.89 %) |
5 |
5 |
1 (0–1) |
0 (0–0) |
|
MGH78578 (reference |
−1 532 602 |
41 793 (1.31 %) |
38 371 (1.21 %) |
8 |
7 |
6 (1–7) |
0 (0–1) |
|
RBHSTW-00029 |
207 465 |
50 056 (1.85 %) |
49 876 (1.84 %) |
3 |
3 |
0 (0–0) |
0 (0–0) |
|
RBHSTW-00053 |
4 727 |
50 860 (1.62 %) |
50 861 (1.62 %) |
12 |
11 |
1.5 (0–4) |
0 (0–0) |
|
RBHSTW-00059 |
−143 627 |
37 357 (1.04 %) |
36 251 (1.01 %) |
15 |
14 |
0 (0–0) |
0 (0–0) |
|
RBHSTW-00122 |
0 |
24 355 (1.18 %) |
24 355 (1.18 %) |
6 |
7 |
0 (0–0) |
0 (0–0) |
|
RBHSTW-00123 |
−1 963 188 |
56 224 (1.68 %) |
57 074 (1.70 %) |
17 |
21 |
4 (1–6) |
4.5 (2–6) |
|
RBHSTW-00127 |
−1 145 |
34 206 (0.98 %) |
34 206 (0.98 %) |
16 |
16 |
0 (0–0) |
0 (0–0) |
|
RBHSTW-00128 |
3 114 |
31 526 (1.06 %) |
31 507 (1.05 %) |
6 |
8 |
2 (1–2) |
2 (1–4) |
|
RBHSTW-00131 |
399 368 |
25 880 (0.88 %) |
26 271 (0.89 %) |
24 |
28 |
3 (1–7) |
1 (1–3) |
|
RBHSTW-00142 |
−790 773 |
34 684 (1.23 %) |
32 590 (1.16 %) |
12 |
12 |
3 (1–11) |
0 (0–1) |
|
RBHSTW-00167 |
4 083 063 |
34 510 (1.13 %) |
76 805 (2.52 %) |
24 |
33 |
21 (18–47) |
1.5 (0–4) |
|
RBHSTW-00189 |
−158 523 |
37 378 (1.25 %) |
37 418 (1.25 %) |
9 |
12 |
11.5 (7–21) |
1 (0–2) |
|
RBHSTW-00277 |
18 417 |
33 677 (0.99 %) |
33 685 (0.99 %) |
16 |
16 |
2 (0–2) |
0 (0–0) |
|
RBHSTW-00309 |
−518 811 |
30 704 (0.88 %) |
30 327 (0.87 %) |
17 |
36 |
2 (0–11) |
44.5 (0–86) |
|
RBHSTW-00340 |
−906 675 |
30 802 (0.87 %) |
29 860 (0.84 %) |
11 |
10 |
2 (0–4) |
0 (0–1) |
|
RBHSTW-00350 |
21 188 |
28 907 (0.79 %) |
28 907 (0.79 %) |
12 |
13 |
2 (2–4) |
5 (0–8) |
|
RHB10-C07 |
−23 295 |
27 779 (0.90 %) |
27 777 (0.90 %) |
22 |
21 |
5 (0–17) |
0.5 (0–1) |
|
RHB11-C04 |
12 774 |
24 879 (0.86 %) |
24 881 (0.86 %) |
25 |
25 |
2 (0–6) |
0 (0–0) |
|
RHB14-C01 |
172 712 |
30 478 (0.95 %) |
30 576 (0.95 %) |
13 |
12 |
3 (0–3) |
0 (0–0) |
Fig. 2.Examples of mismatches identified between the ONT-based and the PacBio-based assemblies for the two reference strains (E. coli CFT073 and K. pneumoniae MGH78578). Each sub-figure is an IGV (v2.4.3) view of part of the PacBio-based assembly, centred around a PacBio-ONT SNP, with all reads from the same isolate mapped to it. We performed this analysis for all SNPs in isolates MGH78578 and CFT073, and report examples for the two most typical patterns observed. (a) SNP from MGH78578 with very low Illumina coverage, but normal PacBio and ONT coverage. Most of the Illumina reads have a different base than the one in the PacBio-assembled reference (the red T's), suggesting perhaps an error in the PacBio assembly. A similar pattern is observed in 14 SNPs in CFT073 (with 12 due to error in the PacBio assembly), and 11 SNPs in MGH78578 (with 10 due to error in the PacBio assembly). (b) SNP from MGH78578 with normal Illumina coverage; Illumina reads support both bases with similar proportions, suggesting that this could be a polymorphic site within the original DNA sample. This pattern was observed for four SNPs in CFT073 and for 13 SNPs in MGH78578.