| Literature DB >> 28769883 |
Sagar M Utturkar1, Dawn M Klingeman2,3, Richard A Hurt2, Steven D Brown1,2,3.
Abstract
This study characterized regions of DNA which remained unassembled by either PacBio and Illumina sequencing technologies for seven bacterial genomes. Two genomes were manually finished using bioinformatics and PCR/Sanger sequencing approaches and regions not assembled by automated software were analyzed. Gaps present within Illumina assemblies mostly correspond to repetitive DNA regions such as multiple rRNA operon sequences. PacBio gap sequences were evaluated for several properties such as GC content, read coverage, gap length, ability to form strong secondary structures, and corresponding annotations. Our hypothesis that strong secondary DNA structures blocked DNA polymerases and contributed to gap sequences was not accepted. PacBio assemblies had few limitations overall and gaps were explained as cumulative effect of lower than average sequence coverage and repetitive sequences at contig termini. An important aspect of the present study is the compilation of biological features that interfered with assembly and included active transposons, multiple plasmid sequences, phage DNA integration, and large sequence duplication. Our targeted genome finishing approach and systematic evaluation of the unassembled DNA will be useful for others looking to close, finish, and polish microbial genome sequences.Entities:
Keywords: Illumina; PacBio; Pilon; circlator; genome assembly; next-generation sequencing (NGS); repetitive DNA
Year: 2017 PMID: 28769883 PMCID: PMC5513972 DOI: 10.3389/fmicb.2017.01272
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Properties of gap sequences present within PacBio assembly.
| AD2_Overlap1 | 3,502 | 5,535 | 2033 | 36x | 39.4 | Membrane protein insertase | |
| AD2_Overlap2 | 180,557 | 182,612 | 2055 | 116x | 35.1 | Transposase DDE domain | |
| AD2_Gap1 | 558,824 | 559,892 | 1068 | 82x | 39 | Transposase mutator type | |
| BC_Overlap1 | 6,343,204 | 6,349,991 | 6788 | 36x | 32.5 | Transposase Tn3 family protein | |
| BC_Gap1 | 6,389,652 | 6,390,057 | 405 | 4x | 35.5 | RNA-binding protein |
Average sequence coverage and GC contents of the final genome assembly are provided in the Table .
Figure 1AD2 genome assembly comparisons. The outermost orange colored circle corresponds to finished genome assembly. The next two circles show genes on positive and negative strands and using color coded by standards for COG categories. The next yellow colored circle corresponds to Illumina assembly and gaps within Illumina assembly are denoted by red strokes. The next circle denotes the strong positional preference marked in pink color. The next two concentric circles denote the sequence coverage for Illumina and PacBio technologies respectively as heatmap (lowest: light blue, highest: dark blue). The innermost circle: AD2_SC1 (yellow) was generated by super assembly of draft contigs (green). AD2_HC1 (sky blue) share 780 kb overlap with AD2_SC1. Blue-highlighted region denotes sequence overlaps validated using PCR/Sanger approach. A detailed Illustration is provided in Figure S1.
Figure 2Summary of biological features with potential to interfere with the assembly process. (A) Presence of active transposon elements in strain JBW45 (B) repetitive transposon sequences at the contig terminus region of B. cellulosolvens (C) large sequence duplications in C. pasteurianum (D) presence of megaplsmids in stain KO116 (E) genome duplication assembled as spurious contig in C. thermocellum LQRI (F) multiple copies of rRNA operons in C. paradoxum JW-YL-7. The figures are illustration only and not drawn to scale.
Summary of rRNA operons present within Illumina assembly.
| 4 | 2 (50) | 2 | |
| 6 | 4 (66) | 2 | |
| 14 | 12 (85) | 2 | |
| 9 | 5 (55) | 4 | |
| 12 | 11 (90) | 1 | |
| 8 | 4 (50) | 4 | |
| 10 | 0 (0) | 0 |
Assembly summary statistics for de novo and hybrid assemblies.
| Illumina | 102 | 331 | 116 | 3.48 | SPAdes* | |
| 107 | 282 | 84 | 3.54 | ABySS | ||
| Illumina + PacBio | 14 | 2,270 | 2,270 | 3.57 | SPAdes | |
| PacBio-only | 10 | 982 | 891 | 3.49 | SMRTanalysis v 2.2 | |
| PacBio-only | ||||||
| Illumina | 110 | 373 | 194 | 5.13 | SPAdes* | |
| 120 | 315 | 115 | 5.19 | ABySS | ||
| Illumina + PacBio | 30 | 4,654 | 4,654 | 5.19 | SPAdes | |
| PacBio-only | ||||||
| Illumina | 175 | 1,025 | 637 | 5.13 | SPAdes | |
| 131 | 169 | 78 | 5.03 | ABySS* | ||
| Illumina + PacBio | 147 | 4,498 | 4,498 | 5.19 | SPAdes | |
| PacBio-only | ||||||
| Illumina | 70 | 477 | 244 | 5.3 | SPAdes* | |
| 114 | 318 | 110 | 5.4 | ABySS | ||
| Illumina + PacBio | 1 | 5,381 | 5,381 | 5.38 | SPAdes | |
| PacBio-only | ||||||
| Illumina | 661 | 293 | 121 | 2.23 | SPAdes | |
| 43 | 235 | 74 | 1.84 | ABySS* | ||
| Illumina + PacBio | 612 | 1,061 | 323 | 2.26 | SPAdes | |
| PacBio-only | ||||||
| Illumina | 194 | 1,143 | 271 | 6.81 | SPAdes | |
| 172 | 358 | 130 | 6.99 | ABySS* | ||
| Illumina + PacBio | 122 | 3,522 | 3,522 | 6.91 | SPAdes | |
| PacBio-only | 12 | 2,261 | 1,340 | 6.94 | SMRTanalysis v 2.0 | |
| PacBio-only | 3 | 6,349 | 6,349 | 6.88 | SMRTanalysis v 2.2 | |
| PacBio-only | ||||||
| Illumina | 6 | 4,108 | 4,108 | 4.36 | SPAdes* | |
| 101 | 207 | 73 | 4.35 | ABySS | ||
| Illumina + PacBio | 9 | 4,022 | 4,022 | 4.36 | SPAdes | |
| PacBio-only |
Best assemblies shown in bold. The best draft assembly achieved with only the Illumina data are marked with *.
Additional numbers shown in brackets correspond to the extra-chromosomal plasmid DNA.
Assemblies performed prior to the availability of SMRTanalysis version 2.2. Prior assemblies are included to describe the effectiveness of algorithm improvement on genome assembly using the same data.
Summary of Pilon call verification by Sanger sequencing.
| 11 | 11 | 0 | |
| 22 | 17 | 5 | |
| 6 | 4 | 2 | |
| 8 | 8 | 0 | |
| Total | 47 | 40 | 7 |
SNP refers to polymorphisms as well as indels. 19 of 47 SNP calls were indels while 1 of 7 incorrect SNP calls was indel.