| Literature DB >> 31996747 |
José Arturo Molina-Mora1, Rebeca Campos-Sánchez2, César Rodríguez3, Leming Shi4, Fernando García3.
Abstract
Genotyping methods and genome sequencing are indispensable to reveal genomic structure of bacterial species displaying high level of genome plasticity. However, reconstruction of genome or assembly is not straightforward due to data complexity, including repeats, mobile and accessory genetic elements of bacterial genomes. Moreover, since the solution to this problem is strongly influenced by sequencing technology, bioinformatics pipelines, and selection criteria to assess assemblers, there is no systematic way to select a priori the optimal assembler and parameter settings. To assembly the genome of Pseudomonas aeruginosa strain AG1 (PaeAG1), short reads (Illumina) and long reads (Oxford Nanopore) sequencing data were used in 13 different non-hybrid and hybrid approaches. PaeAG1 is a multiresistant high-risk sequence type 111 (ST-111) clone that was isolated from a Costa Rican hospital and it was the first report of an isolate of P. aeruginosa carrying both blaVIM-2 and blaIMP-18 genes encoding for metallo-β-lactamases (MBL) enzymes. To assess the assemblies, multiple metrics regard to contiguity, correctness and completeness (3C criterion, as we define here) were used for benchmarking the 13 approaches and select a definitive assembly. In addition, annotation was done to identify genes (coding and RNA regions) and to describe the genomic content of PaeAG1. Whereas long reads and hybrid approaches showed better performances in terms of contiguity, higher correctness and completeness metrics were obtained for short read only and hybrid approaches. A manually curated and polished hybrid assembly gave rise to a single circular sequence with 100% of core genes and known regions identified, >98% of reads mapped back, no gaps, and uniform coverage. The strategy followed to obtain this high-quality 3C assembly is detailed in the manuscript and we provide readers with an all-in-one script to replicate our results or to apply it to other troublesome cases. The final 3C assembly revealed that the PaeAG1 genome has 7,190,208 bp, a 65.7% GC content and 6,709 genes (6,620 coding sequences), many of which are included in multiple mobile genomic elements, such as 57 genomic islands, six prophages, and two complete integrons with blaVIM-2 and blaIMP-18 MBL genes. Up to 250 and 60 of the predicted genes are anticipated to play a role in virulence (adherence, quorum sensing and secretion) or antibiotic resistance (β-lactamases, efflux pumps, etc). Altogether, the assembly and annotation of the PaeAG1 genome provide new perspectives to continue studying the genomic diversity and gene content of this important human pathogen.Entities:
Mesh:
Substances:
Year: 2020 PMID: 31996747 PMCID: PMC6989561 DOI: 10.1038/s41598-020-58319-6
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1General bioinformatic pipeline to assemble, compare and annotate the Pseudomonas aeruginosa AG1 genome using short and long reads as well as hybrid approaches.
Comparison of contiguity and annotation of P. aeruginosa AG1 genome assembly by different approaches*.
| 3C Criterion | Level and metrics | Short reads only approaches | Long reads only approaches | Hybrid approaches | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Velvet | SPAdes | IDBA | Megahit | SKESA | Unicycler | Canu | Flye | Unicycler | IDBA | SPAdes | Unicycler | Final assembly | |||
| Contiguity | Contigs assembly | Contigs | 89 | 127 | 125 | 217 | 113 | 2 | 5 | 121 | 16 | ||||
| Total length | 7094145 | 7090598 | 7103650 | 7047434 | 7074438 | 7121028 | 7092836 | 7188777 | 7189601 | ||||||
| GC (%) | 65.79 | 65.73 | 65.74 | 65.73 | 65.77 | 65.77 | 65.66 | 65.59 | 65.64 | 65.74 | 65.68 | 65.71 | 65.71 | ||
| N50 | 223421 | 170948 | 168521 | 68375 | 151417 | 4329427 | 7178173 | 141288 | 1593634 | ||||||
| L50 | 33 | 11 | 14 | 14 | 34 | 15 | 1 | 1 | 1 | 15 | 2 | 1 | 1 | ||
| Scaffolding | Scaffolds | 1 | 10 | 10 | 10 | 2 | 1 | 1 | 1 | 1 | 10 | 10 | 1 | 1 | |
| N50 & NG50 | 7039385 | 7078855 | 7079244 | 7091835 | 7056837 | 7080238 | 7121028 | 7209472 | 7465826 | 7082290 | 7171429 | 7189601 | 7190208 | ||
| Genome fraction (%) | 98.362 | 98.293 | 98.484 | 98.054 | 98.382 | 99.381 | 99.991 | 98.356 | 99.717 | ||||||
| NA50 | 177145 | 375326 | 491929 | 478607 | 708585 | 709611 | 4328063 | 7207242 | 7177177 | 477586 | 3956502 | 7189601 | 7190208 | ||
| LA50 | 12 | 6 | 5 | 5 | 4 | 4 | 1 | 1 | 1 | 4 | 1 | 1 | 1 | ||
| N's per 100 kbp | 52.13 | 77.51 | 75.96 | 151.6 | 81.92 | 1.34 | 74.67 | 5.56 | |||||||
| Correctness | Misassemblies | 22 | 37 | 33 | 24 | 19 | 1 | 4 | 26 | 2 | |||||
| Unaligned mis. contigs | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ||
| Mismatches per 100 kbp | 6.56 | 2.42 | 4.88 | 1.61 | 1.84 | 3.68 | 11.33 | ||||||||
| Indels per 100 kbp | 6.49 | 0.41 | 0.67 | 0.28 | 1.79 | 1 | 1.14 | ||||||||
| Completeness | 40 core genes (BUSCO) | Fragmented genes | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 9 | 9 | 0 | 0 | 0 | 0 |
| Intact genes | 40 | 40 | 40 | 40 | 40 | 40 | 20 | 13 | 23 | 40 | 40 | 40 | 40 | ||
| Lost genes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |||||
| Completeness score (strict, %) | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |||||
| Whole genome annotation | CDS | 6574 | 6554 | 6543 | 6565 | 6540 | 6567 | 11229 | 9565 | 9089 | 6559 | 6605 | 6621 | 6620 | |
| Contigs | 1 | 10 | 10 | 10 | 2 | 1 | 1 | 1 | 1 | 10 | 10 | 1 | 1 | ||
| rRNA | 2 | 5 | 5 | 5 | 3 | 3 | 12 | 12 | 12 | 4 | 14 | 12 | 12 | ||
| tmRNA | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | ||
| tRNA | 70 | 62 | 69 | 70 | 61 | 70 | 72 | 65 | 75 | 69 | 76 | 76 | 76 | ||
| Completeness & correctness | Mean length of CDS (bp) | 938.34 | 957.54 | 956.28 | 954.9 | 950.19 | 953.49 | 955.11 | 963.51 | 961.89 | 961.86 | ||||
| Integron blaVIM-2 | Identity (%) | 100.0 | 99.5 | 99.8 | 100.0 | 100.0 | 99.7 | 99.488 | 99.257 | 99.843 | 99.753 | 99.778 | 99.778 | 100 | |
| Coverage | 0.5 | 0.7 | 0.6 | 0.5 | 0.6 | 1.0 | 0.6 | ||||||||
| Integron blaIMP-18 | Identity (%) | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 99.515 | 98.744 | 99.728 | 100 | 100 | 100 | 100 | |
| Coverage | 0.9 | 0.9 | 0.9 | 0.8 | 0.8 | 1.0 | 0.8 | ||||||||
*For some metrics, best and worst values are marked as bold or italics, respectively.
Figure 2General comparison of P. aeruginosa AG1 genome assemblies. (a) Relationship between different assemblers by PCA using contiguity and annotation features. (b) Completeness evaluation and comparison for all different approaches using the final assembly as reference. (c) De novo assembly graph of three different approaches by short reads, long reads or hybrid assemblers. More details in Supplementary Fig. S1.
Figure 3Annotation of P. aeruginosa AG1 genome. (a) Circularized genome showing phages and integrons locations. (b) Specific annotation of different genomic determinants including number of elements. (c) Genome synteny comparison among three strains of P. aeruginosa: PAO1 (general reference), AG1 (our assembly) and RIVM-EMC2982 (closest one to PaeAG1 according to BLAST analysis).
Figure 4Pan-genome analysis of ST-111 P. aeruginosa strains. (a) Clustering according to strains profile by gene content. A total of 10,637 genes were identified. (b) Distribution of the gene content in all the strains, including that the core genome is composed of 4,783 (45% of total genes). Distribution of genes number by number of genomes is presented in (c).