| Literature DB >> 32727443 |
Hooman Derakhshani1,2, Steve P Bernier3,4, Victoria A Marko3,4, Michael G Surette3,4,5.
Abstract
BACKGROUND: Illumina technology currently dominates bacterial genomics due to its high read accuracy and low sequencing cost. However, the incompleteness of draft genomes generated by Illumina reads limits their application in comprehensive genomics analyses. Alternatively, hybrid assembly using both Illumina short reads and long reads generated by single molecule sequencing technologies can enable assembly of complete bacterial genomes, yet the high per-genome cost of long-read sequencing limits the widespread use of this approach in bacterial genomics. Here we developed a protocol for hybrid assembly of complete bacterial genomes using miniaturized multiplexed Illumina sequencing and non-barcoded PacBio sequencing of a synthetic genomic pool (SGP), thus significantly decreasing the overall per-genome cost of sequencing.Entities:
Keywords: Bacterial genomics; De novo assembly; Hybrid assembly; Synthetic genomic pool
Mesh:
Substances:
Year: 2020 PMID: 32727443 PMCID: PMC7392658 DOI: 10.1186/s12864-020-06910-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Schematic workflow for completion of draft bacterial genomes using long-read sequencing of synthetic genomic pools. Individual bacterial genomes are sequenced using a miniaturized, cost effective, multiplexed sequencing protocol on the Illumina platform. Short Illumina reads are used for de novo assembly of draft bacterial genomes (Illumina contigs). The gDNA of bacterial isolates are then combined into a synthetic genomic pool library (~80Mbp total genome size) and subjected to standard PacBio sequencing without multiplexing. Generated long reads are mapped to Illumina assemblies for sorting high-quality long reads of individual genomes. This is followed by de novo long-read assembly to generate ultra-long PacBio contigs for each genome, and finally, completion of draft bacterial genomes by hybrid assembly of Illumina short reads, PacBio long reads, and ultra-long PacBio contigs. The final assembly is polished by high-quality Illumina reads to correct potential assembly errors. Names of bioinformatics software used at each step of the assembly pipeline are indicated in parenthesis
Fig. 2Overall performance and quality assessment of the SGP hybrid assembly pipeline. a Assembly statistics of 20 genomes subjected to the SGP hybrid assembly pipeline. The bar plot depicts genome size, GC content, N50 value, and assembly completeness of individual genomes (orange circles indicating complete assemblies and blue triangles indicating fragmented assemblies). b Evaluating the performance of SGP hybrid assembly by performing parallel multiplexed Illumina and PacBio sequencing on 9 genomes. The bar plot depicts assembly statistics of Illumina short-read assembly, hybrid assembly of barcoded Illumina and PacBio reads (Barcoded_Hybrid), SGP hybrid assembly, long-read assembly of barcoded PacBio reads polished using long reads (Flye) or short Illumina reads (Flye_Pilon). c Reference-free assembly validation: qualities of assemblies were assessed by mapping high-quality barcoded Illumina reads of each genome to its corresponding assemblies using Breseq. The top panel shows the frequency of single base substitutions, the middle panel shows the frequency of insertion sequences (unresolved repeat elements identified as small contigs with higher read coverages compared to other regions of the chromosome), and the bottom panel shows the frequency of small insertion-deletions (indels) in various assemblies of each genome. d Dot plot showing the ability of different assembly approaches to resolve multiple copies of the 16S ribosomal RNA gene within each genome. Position of each dot on x-axis depicts the total number of 16S rRNA gene copy numbers detected within each assembly. Numbers within dots indicate intragenomic heterogeneity (sequence variants) among 16S rRNA genes of each assembly
Fig. 3Evaluating the effect of assembly quality on gene annotation. a Hybrid assembly improves gene prediction and annotation accuracy. The left bar plot depicts total number of predicted coding sequences (CDSs) of various assemblies of each genome. The bar plot on the right shows the ratio of incomplete to complete CDSs of each assembly. CDSs were predicted by Prodigal and aligned to the UniProtKB/TrEMBL protein database using DIAMOND Blastp. The ratio of query sequence length to subject sequence length was then used as a proxy to measure completeness of the predicted CDSs (threshold of ≥0.95). b Pangenome analysis of different assemblies of related strains, including Escherichia fergusonii (GC162 vs. GC505), Christensenella massiliensis (GC249 vs. GC441) and Blautia obeum (GC481 vs. GC508). Phylogenetic trees were generated by alignment of core and accessory genes identified by PIRATE. The colour ramp indicates the Markov clustering (MCL) threshold at which each gene family has been classified (the higher this threshold, the less divergent is that gene family across assemblies)