| Literature DB >> 34773979 |
Kerstin Neubert1,2, Eric Zuchantke3, Robert Maximilian Leidenfrost4, Röbbe Wünschiers4, Josephine Grützke2, Burkhard Malorny2, Holger Brendebach2, Sascha Al Dahouk2, Timo Homeier5, Helmut Hotzel3, Knut Reinert1, Herbert Tomaso3, Anne Busch6,7.
Abstract
BACKGROUND: We benchmarked sequencing technology and assembly strategies for short-read, long-read, and hybrid assemblers in respect to correctness, contiguity, and completeness of assemblies in genomes of Francisella tularensis. Benchmarking allowed in-depth analyses of genomic structures of the Francisella pathogenicity islands and insertion sequences. Five major high-throughput sequencing technologies were applied, including next-generation "short-read" and third-generation "long-read" sequencing methods.Entities:
Keywords: Francisella pathogenicity island; High-throughput sequencing; Hybrid assembly; Illumina HiSeq; Pacific biosciences RS; Illumina MiSeq; Insertion sequences; Ion Torrent’s ion S5; Oxford Nanopore technologies MinION; Short-read assembly
Mesh:
Substances:
Year: 2021 PMID: 34773979 PMCID: PMC8590783 DOI: 10.1186/s12864-021-08115-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Processing workflow for sequencing data including data QC, preprocessing, de novo assembly, assembly evaluation and annotation (utilized tools in brackets).
Sequencing data and error rates for isolate FSC237 to reference NC_006570.2, with a GC content of 32.26%
| Platform | GC reads (%) | Mapped bases (bp) | Mismatches | Insertions | Deletions | Mismatch error rate (%) | Insertions error rate (%) | Deletions error rate (%) | Total error rate (%) | Even score | Total error rate added (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| MiSeq | 36.29 | 265,564,803 | 710,772 | 653 | 8371 | 0.268 | 0.0002 | 0.0032 | 0.27 | 0.6715 | 0.2710 |
| HiSeq | 32.12 | 7,402,011,869 | 22,972,180 | 43,344 | 56,564 | 0.310 | 0.0006 | 0.0008 | 0.31 | 0.9862 | 0.3117 |
| Ion Torrent | 32.66 | 350,087,308 | 783,525 | 911,991 | 522,265 | 0.224 | 0.2605 | 0.1492 | 0.51 | 0.9400 | 0.6335 |
| PacBio | 32.45 | 1,118,125,280 | 77,755,474 | 50,861,188 | 35,660,450 | 6.954 | 4.5488 | 3.1893 | 14.99 | 0.9747 | 14.6922 |
| MinION | 32.26 | 533,513,044 | 68,465,798 | 13,287,287 | 18,136,808 | 12.833 | 2.4905 | 3.3995 | 16.88 | 0.9669 | 18.7230 |
Fig. 2GC-bias plots for dataset FSC237 sequences for short-read (A) and long-read platforms (B). Normalized coverage is plotted for GC percentages with at least 1000 windows in the genome. Unbiased coverage is represented by a dashed line at normalized coverage of 1. GC distribution of FSC237 to reference NC_006570.2 (C)
Fig. 3NA50 values versus assembly errors for short-read assemblies. Optimal results are expected to have high NA50 with low error rates (located in the upper left corner)
Hybrid assembly results for FSC237 isolate based on PacBio data
| Assembler | Short read library | Total length (bp) | GC (%) | Contigs (> = 500 bp) | NGA50 (Mb) | Genome covered (%) | Genomic features (Complete + partial) | Complete Busco (%) | True/all circular contigs (size) | Errors (per 100 kb) |
|---|---|---|---|---|---|---|---|---|---|---|
| Canu/ Pilon | HiSeq | 1,889,842 | 32.27 | 1 | 1.89 | 99.85 | 3794 + 3 part | 93.92 | 0/0 | 4.34 |
| MiSeq | 1,889,606 | 32.27 | 1 | 1.89 | 99.85 | 3794 + 3 part | 92.57 | 0/0 | 17.14 | |
| Ion Torrent | 1,889,812 | 32.27 | 1 | 1.89 | 99.85 | 3794 + 3 part | 93.24 | 0/0 | 5.82 | |
| Flye/Pilon | HiSeq | 1,892,761 | 32.26 | 2 | 1.50 | 100.00 | 3795 + 2 part | 93.92 | 1/1 (1892709) | 0.74 |
| MiSeq | 1,958,505 | 32.22 | 2 | 1.53 | 100.00 | 3796 + 1 part | 92.57 | 1/1 (1892639) | 23.78 | |
| Ion Torrent | 1,958,909 | 32.22 | 2 | 1.53 | 100,00 | 3796 + 1 part | 93.92 | 0/1 (393321) | 28.16 | |
| SPAdes | HiSeq | 1,858,769 | 32.28 | 2 | 1.12 | 98.20 | 3736 + 3 part | 93.92 | 0/1 (1499404) | 0.16 |
| MiSeq | 1,830,140 | 32.33 | 29 | 0.09 | 96.69 | 3600 + 15 part | 93.92 | 0/0 | 14.09 | |
| Ion Torrent | 1,858,056 | 32.28 | 2 | 1.50 | 98.17 | 3738 + 1 part | 92.57 | 1/1 (1892668) | 4.36 | |
| Unicycler | HiSeq | 1,856,294 | 32.29 | 6 | 1.46 | 97.79 | 3703 + 7 part | 93.92 | 1/1 (1892695) | 0.05 |
| MiSeq | 1,865,687 | 32.28 | 7 | 1.15 | 98.08 | 3730 + 4 part | 93.92 | 0/1 (393314) | 10.28 | |
| Ion Torrent | 1,855,936 | 32.29 | 6 | 1.46 | 98.06 | 3718 + 5 part | 93.92 | 1/1 (1892586) | 6.04 |
Hybrid assembly results for FSC237 isolate based on MinION data
| Assembler | Short read library | Total length (bp) | GC (%) | Contigs (> = 500 bp) | NGA50 (Mb) | Genome covered (%) | Genomic features (Complete + partial) | Complete Busco (%) | True/all circular contigs (size) | Errors (per 100 kb) |
|---|---|---|---|---|---|---|---|---|---|---|
| Canu/ Pilon | HiSeq | 1,949,612 | 32.26 | 1 | 1.95 | 99.97 | 3794 + 3 part | 93.24 | 1/1 (1891217) | 53.90 |
| MiSeq | 1,942,673 | 32.34 | 1 | 1.94 | 99.97 | 3794 + 3 part | 62.16 | 0/0 | 351.18 | |
| Ion Torrent | 1,948,850 | 32.27 | 1 | 1.95 | 99.97 | 3794 + 3 part | 88.51 | 1/1 (1890592) | 78.43 | |
| Flye/Pilon | HiSeq | 1,921,160 | 31.96 | 3 | 1.89 | 99.97 | 3793 + 3 part | 77.70 | 1/1 (1893441) | 98.56 |
| MiSeq | 1,944,710 | 31.58 | 3 | 1.92 | 99.97 | 3792 + 3 part | 64.86 | 1/1 (1913901) | 1059.83 | |
| Ion Torrent | 1,921,860 | 31.95 | 3 | 1.89 | 99.97 | 3792 + 3 part | 67.57 | 0/0 | 162.61 | |
| SPAdes | HiSeq | 1,892,530 | 32.28 | 1 | 1.86 | 98.21 | 3740 + 2 part | 93.92 | 1/1 (1891993) | 0.16 |
| MiSeq | 1,827,899 | 32.33 | 32 | 0.09 | 96.58 | 3592 + 15 part | 93.92 | 0/0 | 20.84 | |
| Ion Torrent | 1,858,435 | 32.28 | 2 | 1.40 | 98.19 | 3736 + 2 part | 91.89 | 0/1 (1498148) | 4.57 | |
| Unicycler | HiSeq | 1,892,775 | 32.26 | 1 | 1.89 | 100.00 | 3794 + 3 part | 93.92 | 0/0 | 0.05 |
| MiSeq | 1,921,618 | 32.25 | 12 | 1.89 | 99.88 | 3789 + 3 part | 93.92 | 1/1 (1891153) | 57.40 | |
| Ion Torrent | 1,892,630 | 32.26 | 1 | 1.89 | 100.00 | 3794 + 3 part | 93.92 | 0/0 | 6.77 |
Fig. 5Genome assemblies of FSC237 isolate based on different sequencing platforms and assemblers aligned to the SCHU S4 reference genome a) Canu b) Flye c) SPAdes and d) Unicycler. Assembled contigs from inside to outside: ONT+Ion Torrent, PacBio+Ion Torrent, ONT+MiSeq, PacBio+MiSeq, ONT+HiSeq, PacBio+HiSeq; misassembled bases identified by QUAST (dark red); mismatches in total error (grey bars); outer circle: F. tularensis subsp. tularensis str. SCHU S4 reference, inner circle: F. tularensis subsp. tularensis str. SCHU S4 RefSeq genes (yellow) and pathogenicity islands (cyan bars) and ISFTu 1-6 (red/ purple/ blue bars), repeats (black) presented with Circos. The origin of replication is at twelve o’clocd
Genomic variants in holarctica isolates with respect to their reference genomes called with three different short-read sequencing datasets
| Isolate | FSC237 | 08 T013 | 12 T0050 | 12 T0052 | 12 T0058 | |
|---|---|---|---|---|---|---|
| NCBI ID | NC_006570.2 | NC_017463.1 | NC_009749.1 | NC_009749.1 | NC_019551.1 | |
| HiSeq | Average coverage | 3910x | 4793x | 1432x | 3232x | 7711x |
| SNPs | 0 | 207 | 35 | 35 | 96 | |
| Indels | 0 | 2 | 0 | 0 | 0 | |
| Miseq | Average coverage | 140x | 102x | 79x | 78x | 89x |
| SNPs | 81 | 312 | 153 | 173 | 246 | |
| Indels | 0 | 67 | 9 | 12 | 11 | |
| Ion Torrent | Average coverage | 185x | 204x | 202x | 163x | 92x |
| SNPs | 0 | 189 | 49 | 47 | 94 | |
| Indels | 1 | 619 | 611 | 617 | 924 |
Maximum RAM consumption and running time for assembly of FSC237 isolate with subsampled data (PacBio: 94 Mb, MinION: 92 Mbp, HiSeq: 151 Mb)
| Assembler | Max RAM (Gb) | Running time (min) | ||
|---|---|---|---|---|
| PacBio | MinION | PacBio | MinION | |
| Canu 1.8 | 3.56 | 6.06 | 27.93 | 97.03 |
| Flye 2.4.2 | 2.64 | 10.33 | 6.91 | 96.91 |
| Flye 2.5 | 2.60 | 7.42 | 7.26 | 67.43 |
| SPAdes 3.13.0 | 2.74 | 2.70 | 6.09 | 5.84 |
| Unicycler 0.4.7 | 8.00 | 6.01 | 52.32 | 44.32 |
Fig. 6Alignment of genomes including Francisella pathogenicity island (FPI) and insertion sequences (IS or ISFtu) as coloured bars with neighbor joining tree based on kSNP on the left side
Clades, genes, insertion sequences and FPI in isolates and reference strains. The number of detected insertion sequences in assembled genomes corresponds to those in respective references (Table 4)
| 08 T0013 | 12 T0050 | 12 T0052 | 12 T0058 | FSC237 | NC_017463.1 | NC_009749.1 | NC_019551.1 | NC_006570.2 | NC_007880.1 | NZ_CP009633.1 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Strain | |||||||||||
| Source | Hare, Ehingen (Bavaria, Germany) 2008 | Hare, Herringhausen (North Rhine-Westphalia, Germany) | Tick, Germany | Hare, Heideck (Bavaria, Germany) | Human, Ohio 1941 | Dead beaver found near Red Rock, Okla. 1978 | Human, France 2002 | Human, Ljusdal, Sweden 1998 | Human, Ohio 1941 | Live vaccine strain, Russia 1930 | Turbid saltwater, Utah 1950 |
| Clade | B.4 | B.11 | B.11 | B.33 | A.I.13 | B.4 | B.11 | B.12 | A.I.13 | B.24 | T/N.1 |
| Size (bp) | 1,893,712 | 1,890,831 | 1,890,859 | 1,913,616 | 1,892,771 | 1,895,727 | 1,890,909 | 1,894,157 | 1,892,775 | 1,895,994 | 1,910,592 |
| Genes | 2153 | 2151 | 2149 | 2164 | 2080 | 2162 | 2145 | 2143 | 2078 | 2147 | 1845 |
| CDS | 2104 | 2102 | 2100 | 2110 | 2031 | 2113 | 2096 | 2094 | 2029 | 2098 | 1796 |
| rRNA | 10 | 10 | 10 | 13 | 10 | 10 | 10 | 10 | 10 | 10 | 10 |
| tRNA | 38 | 38 | 38 | 40 | 38 | 38 | 38 | 38 | 38 | 38 | 38 |
| tmRNA | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| FPI | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 1 |
| ISFtu1 | 59 | 59 | 59 | 59 | 53 | 61 | 59 | 59 | 53 | 59 | 1 |
| ISFtu2 | 42 | 42 | 42 | 42 | 16 | 42 | 42 | 42 | 16 | 43 | 17 |
| ISFtu3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 3 | 5 |
| ISFtu4 | 2 | 2 | 2 | 2 | 1 | 2 | 2 | 2 | 1 | 2 | 1 |
| ISFtu5 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
| ISFtu6 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 |
Strain selection and reference genomes
| Species | Isolate | Clade | Reference strain | Reference assembly (RefSeq ID) | Reference genome size |
|---|---|---|---|---|---|
| FSC237 | AI | SCHU S4 NC_006570.2 | GCF_000008985.1 | 1,892,775 | |
| 08 T0013 | B.4 | OSU 18 NC_017463.1 | GCF_000011405.1 | 1,895,727 | |
| 12 T0050 | B.6 | FTNF002–00 NC_009749.1 | GCF_000017785.1 | 1,890,909 | |
| 12 T0052 | B.6 | ||||
| 12 T0058 | B.12 | FSC 200 NC_019551.1 | GCF_000168775.2 | 1,894,157 |
Assembler software for hybrid assembly
| Assembler | Version | Method | Read error correction | Assembly polishing |
|---|---|---|---|---|
| Canu + Pilon | 1.8/ 1.23 | Long-read first/ Best overlap graph (BOG) | consensus of long-reads from overlapping reads | Pilon |
| Flye + Pilon | 2.4.2/ 1.23 | Long-read first/Repeat graph | None | Pilon |
| SPAdes | 3.13.0 | Short-read first/ de Bruijn graph | BayesHammer (Illumina); hammer (Ion Torrent) | MismatchCorrector (default: disabled) |
| Unicycler | 0.4.7 | Short-read first/ de Bruijn graph (SPAdes) and string graph of short-read contigs and long-reads (Minasm) | BayesHammer (Illumina) | Racon + Pilon |
Assembler software for short-read assembly
| Assembler | Assembly method | Version | Release date | Parameter |
|---|---|---|---|---|
| ABySS [ | Single k-mer De Bruijn graph | 2.2.3 | 27/09/2019 | -k 96 / -k 128 |
| A5-miseq [ | Automated pipeline including read cleaning, k-mer based error correction, assembly with IDBA and misassembly correction | 20,160,825 | 25/08/2016 | default |
| IDBA [ | Accumulated De Bruijn graph with iteratively increased k-mer size | 1.1.3 | 11/07/2016 | --mink 20 --maxk 124 |
| MaSuRCA [ | DeBruijn graph and Overlap-Layout-Consensus (OLC) | 3.3.4 | 13/09/2019 | GRAPH_KMER_SIZE = auto cwgErrorRate = 0.25 CLOSE_GAPS = 1 |
| MIRA [ | ‘High-quality alignments first’ contig building strategy with iterative removal of technology-specific errors | V5rc2 | 26/02/2019 | Default |
| SGA [ | String graph based on read pair overlaps (using FM index) | 0.10.15 | 05/08/2016 | -m 111 --min-branch-length 400 |
| SPAdes [ | Multi-kmer De Bruijn graph | 3.13.0 | 16/10/2018 | --cov-cutoff auto --careful |
| Tadpole [ | Single k-mer-based assembly with read extension optimized for correctness | BBMap 35.85 | 16/08/2016 | Default |
| VelvetOptimiser [ | Single k-mer De Bruijn graph with optimised N50 | 2.2.6 Velvet: 1.2.10 | 03/08/2017 05/07/2018 | -s 97 -e 127 -x 10 |