| Literature DB >> 22073286 |
Heath E O'Brien1, Yunchen Gong, Pauline Fung, Pauline W Wang, David S Guttman.
Abstract
Next-generation genomic technology has both greatly accelerated the pace of genome research as well as increased our reliance on draft genome sequences. While groups such as the Genomics Standards Consortium have made strong efforts to promote genome standards there is a still a general lack of uniformity among published draft genomes, leading to challenges for downstream comparative analyses. This lack of uniformity is a particular problem when using standard draft genomes that frequently have large numbers of low-quality sequencing tracts. Here we present a proposal for an "enhanced-quality draft" genome that identifies at least 95% of the coding sequences, thereby effectively providing a full accounting of the genic component of the genome. Enhanced-quality draft genomes are easily attainable through a combination of small- and large-insert next-generation, paired-end sequencing. We illustrate the generation of an enhanced-quality draft genome by re-sequencing the plant pathogenic bacterium Pseudomonas syringae pv. phaseolicola 1448A (Pph 1448A), which has a published, closed genome sequence of 5.93 Mbp. We use a combination of Illumina paired-end and mate-pair sequencing, and surprisingly find that de novo assemblies with 100x paired-end coverage and mate-pair sequencing with as low as low as 2-5x coverage are substantially better than assemblies based on higher coverage. The rapid and low-cost generation of large numbers of enhanced-quality draft genome sequences will be of particular value for microbial diagnostics and biosecurity, which rely on precise discrimination of potentially dangerous clones from closely related benign strains.Entities:
Mesh:
Substances:
Year: 2011 PMID: 22073286 PMCID: PMC3206934 DOI: 10.1371/journal.pone.0027199
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Assembly statistics for P. syringae pv. phaseolicola 1448A resequencing.
| Assembler | Sequencing Depth | Contig/Scaffold # | N50 (kb) | Coverage | Coverage | Mismatches | |||
| PE | MP | 100% | 90% | 100% | 90% | ||||
| SOAPdenovo | 250x | 165 | 63 | 82% | 88% | 84% | 91% | 0.011% | |
| SOAPdenovo/SSPACE | 250x | 100x | 47 | 444 | 82% | 88% | 84% | 91% | 0.007% |
| SOAPdenovo/SSPACE | 100x | 5x | 165 | 384 | 76% | 93% | 79% | 96% | 0.005% |
| CLC | 250x | 465 | 27 | 94% | 96% | 97% | 98% | 0.006% | |
| CLC/SSPACE | 250x | 100x | 129 | 507 | 94% | 96% | 97% | 98% | 0.010% |
| CLC/SSPACE | 100x | 5x | 95 | 542 | 95% | 97% | 98% | 99% | 0.004% |
PE = paired-end sequencing; MP = mate-pair sequencing.
Percent of all Open Reading Frames (ORFs) or only single copy ORFs that were complete (100%) or 90% reassembled.
Figure 1Relationship between quantity of sequence data (expressed as expected read depth) and assembly quality (number of contigs and N50) for subsamples of reads.
A) De novo assemblies of 75×2 bp paired end reads (insert size 150–300 bp). B) Scaffolding of contigs using 38×2 bp mate-pair reads (insert size 4–6 kb). Three random subsets of reads were analyzed for each coverage level except the highest, which used all of the reads.
Comparison of P. syringae resequencing projects.
| Strain | Sequencing Method & Depth | Assembler | N50 (kb) | Completeness | Complete ORFs | Mismatches | Reference |
|
| Illumina (26x)/454 (2.5x/5xPE) | VCAKE/Newbler | 90 | 99% | 85% | 0.018% |
|
|
| Illumina (24xPE) | Velvet 0.718 | 165 | 97.5% | 91% | 0.027% |
|
|
| Illumina (100xPE/5xMP) | CLC/SSPACE | 542 | 98% | 95% | 0.004% | This study |
PE = paired-end sequencing; MP = mate-pair sequencing. If not specified then single read sequencing was performed.
Percent of the single-copy portion of the reference genome present in contigs.
Percent of all ORFs completely reassembled.