| Literature DB >> 28824579 |
Jade L L Teng1,2,3,4, Man Lung Yeung1,2,3,4, Elaine Chan1, Lilong Jia1, Chi Ho Lin5, Yi Huang1, Herman Tse1,2,3,4, Samson S Y Wong1,2,3,4, Pak Chung Sham5,6, Susanna K P Lau1,2,3,4,7, Patrick C Y Woo1,2,3,4,7.
Abstract
Although PacBio third-generation sequencers have improved the read lengths of genome sequencing which facilitates the assembly of complete genomes, no study has reported success in using PacBio data alone to completely sequence a two-chromosome bacterial genome from a single library in a single run. Previous studies using earlier versions of sequencing chemistries have at most been able to finish bacterial genomes containing only one chromosome with de novo assembly. In this study, we compared the robustness of PacBio RS II, using one SMRT cell and the latest P6-C4 chemistry, with Illumina HiSeq 1500 in sequencing the genome of Burkholderia pseudomallei, a bacterium which contains two large circular chromosomes, very high G+C content of 68-69%, highly repetitive regions and substantial genomic diversity, and represents one of the largest and most complex bacterial genomes sequenced, using a reference genome generated by hybrid assembly using PacBio and Illumina datasets with subsequent manual validation. Results showed that PacBio data with de novo assembly, but not Illumina, was able to completely sequence the B. pseudomallei genome without any gaps or mis-assemblies. The two large contigs of the PacBio assembly aligned unambiguously to the reference genome, sharing >99.9% nucleotide identities. Conversely, Illumina data assembled using three different assemblers resulted in fragmented assemblies (201-366 contigs), sharing only 92.2-100% and 92.0-100% nucleotide identities to chromosomes I and II reference sequences, respectively, with no indication that the B. pseudomallei genome consisted of two chromosomes with four copies of ribosomal operons. Among all assemblies, the PacBio assembly recovered the highest number of core and virulence proteins, and housekeeping genes based on whole-genome multilocus sequence typing (wgMLST). Most notably, assembly solely based on PacBio outperformed even hybrid assembly using both PacBio and Illumina datasets. Hybrid approach generated only 74 contigs, while the PacBio data alone with de novo assembly achieved complete closure of the two-chromosome B. pseudomallei genome without additional costly bench work and further sequencing. PacBio RS II using P6-C4 chemistry is highly robust and cost-effective and should be the platform of choice in sequencing bacterial genomes, particularly for those that are well-known to be difficult-to-sequence.Entities:
Keywords: Burkholderia pseudomallei; P6-C4; PacBio RS II; complete; genome
Year: 2017 PMID: 28824579 PMCID: PMC5539568 DOI: 10.3389/fmicb.2017.01448
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
Genomes used for bioinformatics analyses in this study.
| Strain | Genomic elements | Number of bases | Status |
|---|---|---|---|
| Contigs | 7,160,336 | Draft assembly | |
| Chromosome 1 | 4,092,668 | Complete | |
| Chromosome 2 | 3,138,747 | ||
| Chromosome 1 | 3,988,455 | Complete | |
| Chromosome 2 | 3,100,794 | ||
| Contigs | 7,214,442 | Draft assembly | |
| Contigs | 6,934,311 | Draft assembly | |
| Contigs | 6,767,946 | Draft assembly | |
| Contigs | 7,080,338 | Draft assembly | |
| Contigs | 6,730,227 | Draft assembly | |
| Contigs | 7,030,687 | Draft assembly | |
| Chromosome 1 | 4,115,277 | Complete | |
| Chromosome 2 | 3,171,393 | ||
| Chromosome 1 | 4,126,292 | Complete | |
| Chromosome 2 | 3,181,762 | ||
| Contigs | 7,454,077 | Draft assembly | |
| Contigs | 7,188,691 | Draft assembly | |
| Contigs | 7,118,369 | Draft assembly | |
| Contigs | 7,401,189 | Draft assembly | |
| Contigs | 7,246,987 | Draft assembly | |
| Chromosome 1 | 3,912,947 | Complete | |
| Chromosome 2 | 3,127,456 | ||
| Contigs | 6,997,097 | Draft assembly | |
| Contigs | 6,827,079 | Draft assembly | |
| Contigs | 6,888,055 | Draft assembly | |
| Contigs | 6,908,769 | Draft assembly | |
| Contigs | 7,012,758 | Draft assembly | |
| Contigs | 6,717,096 | Draft assembly | |
| Chromosome 1 | 4,074,542 | Complete | |
| Chromosome 2 | 3,173,005 | ||
| Contigs | 7,136,682 | Draft assembly | |
| Contigs | 7,148,557 | Draft assembly | |
| Contigs | 7,348,022 | Draft assembly | |
| Contigs | 7,389,720 | Draft assembly | |
Genome characteristics for PacBio and Illumina platforms.
| Platform | PacBio RS II (latest P6-C4 chemistry) | Illumina HiSeq | ||
|---|---|---|---|---|
| Assembler | SMRT analysis software suite | MIRA | SPAdes | Velvet |
| Total number of bases | 1,000,419,819 | 6,278,193,333 | 6,278,193,333 | 6,278,193,333 |
| Number of reads assembled | 114,845 | 4,296,615a | 7,655,760a | 7,655,760a |
| Average depth of coverage | 143× | 70×b | 143× | 143× |
| Average read length (bp) | 8,711 | 151 | 131 | 131 |
| No. of contigs (>200 bp) | 2 | 366 | 201 | 288 |
| Largest contigs (bp) | 4,091,945 | 152,181 | 372,549 | 299,448 |
| Assembled genome size (bp) | 7,222,235 | 7,261,126 | 7,134,451 | 7,137,994 |
| N50 | 4,091,945 | 45,496 | 83,355 | 69,759 |
| GC content (%) | 68.2 | 68.1 | 68.2 | 68.2 |
| Number of subsystemsc | 522 | 519 | 521 | 522 |
| Number of coding sequencesc | 7,014 | 7,044 | 6,972 | 6,927 |
| Number of RNAsc | 71 | 73 | 56 | 60 |
| Number of tRNAc | 59 | 59 | 53 | 53 |
Summary of number of repeats predicted by Tandem Repeats Finder.
| Platform | Reference genome (hybrid assembly) | PacBio RS II | Illumina HiSeq 1500 | ||
|---|---|---|---|---|---|
| Assembler | SPAdes | SMRT analysis software suite | MIRA | SPAdes | Velvet |
| Number of repeat | 2,052 | 2,045 | 2,088 | 2,042 | 2,053 |
| Number of copy | 1.8–69.3 | 1.8–75.3 | 1.8–51.7 | 1.8–36.9 | 1.8–35 |
| Period size (bp) | 4–954 | 4–954 | 4–954 | 4–954 | 4–834 |
| Total length (bp) | 159,618 | 162,531 | 160,870 | 159,741 | 152,134 |
| Percentage of genome | 2.3% | 2.3% | 2.2% | 2.2% | 2.1% |
Recovery of important B. pseudomallei proteins in different assemblies.
| Platform | Reference genome (hybrid assembly) | PacBio RS II | Illumina HiSeq 1500 | ||
|---|---|---|---|---|---|
| Assembler | SPAdes | SMRT analysis software suite | MIRA | SPAdes | Velvet |
| Core proteins ( | 3,804 | 3,804 | 3,787 | 3,803 | 3,802 |
| Virulence factors ( | 137 | 137 | 136 | 137 | 137 |
| Actin-based motility ( | 1 | 1 | 1 | 1 | 1 |
| Adhesin ( | 11 | 11 | 11 | 11 | 11 |
| Antiphagocytosis ( | 25 | 25 | 25 | 25 | 25 |
| Invasion ( | 53 | 53 | 53 | 53 | 53 |
| Secretion systems ( | 47 | 47 | 46 | ||
| MLST genes ( | 7 | 7 | 7 | ||
| wgMLST genesa ( | 5,689 | 5,678 | 5,588 | 47 7 5,634 | 47 7 5,654 |
Comparison of the PacBio RS II and Illumina HiSeq platforms used in this studya.
| PacBio RS II (P6-C4 chemistry) | Illumina HiSeq 1500 (rapid run mode) | |
|---|---|---|
| Instrument price (US$) | $700,000 | $690,000 |
| Read length | 8 to 15 kb | 2 × 151 bp |
| Throughput per run | Up to 1 Gb | Up to 90 Gb |
| Instrument run time | 4 h | 40 h |
| Cost per Gb (US$) | $300 | $55 |
| Extra labor costb | Nil | Yes |
| Extra time for completing the genomeb | Nil | ≥6 months |