| Literature DB >> 28587290 |
Ian J Miller1, Marc G Chevrette2, Jason C Kwan3.
Abstract
Genome mining has become an increasingly powerful, scalable, and economically accessible tool for the study of natural product biosynthesis and drug discovery. However, there remain important biological and practical problems that can complicate or obscure biosynthetic analysis in genomic and metagenomic sequencing projects. Here, we focus on limitations of available technology as well as computational and experimental strategies to overcome them. We review the unique challenges and approaches in the study of symbiotic and uncultured systems, as well as those associated with biosynthetic gene cluster (BGC) assembly and product prediction. Finally, to explore sequencing parameters that affect the recovery and contiguity of large and repetitive BGCs assembled de novo, we simulate Illumina and PacBio sequencing of the Salinispora tropica genome focusing on assembly of the salinilactam (slm) BGC.Entities:
Keywords: binning; bioinformatics; biosynthesis; biosynthetic gene clusters; genome mining; genome sequencing; metagenomics; secondary metabolism
Mesh:
Substances:
Year: 2017 PMID: 28587290 PMCID: PMC5484115 DOI: 10.3390/md15060165
Source DB: PubMed Journal: Mar Drugs ISSN: 1660-3397 Impact factor: 5.118
Figure 1Circular genome map of de novo assemblies mapped back to the Salinospora tropica CNB-440 reference genome (GCA_0016425.1). Simulated Illumina HiSeq 2500 sequencing data show assembly fragmentation (indicated by black bars) throughout the chromosome, including in BGCs (annotated as green boxes in the outermost ring; the slm pathway is annotated in red) using a mean insert size of 275 bp and different combinations of read length (50–125 bp) and sequencing depth (1–100×).
Figure 2(a) Alignment of de novo contigs to the reference slm pathway. De novo contigs colored in green, yellow, and red mapped to the reference slm pathway sequence one, two, and three times, respectively. In other words, contigs colored in red mapped to three different locations in the slm BGC, due to exact repeats. (b) Fragmentation and percent (in length) recovery of the salinilactam biosynthetic gene cluster based on combinations of read length and depth of sequencing on a simulated Illumina HiSeq 2500 platform run with and without PacBio CLR sequencing. An insert size of 275 bp was used for all of 12 simulated sequencing runs displayed here (the results obtained using a longer fragment size and greater sequencing depths are also explored in Figure S1). Notably, 30× PacBio coverage was required to fully scaffold the Illumina-based assembly with a read length of 125 bp and 100× coverage (len125_cov100_pb30×, where the numbers following “len” describes the Illumina read length, “cov” the depth of Illumina read coverage, and “pb” the depth of PacBio coverage).