| Literature DB >> 28348874 |
Andrew J Page1, Nishadi De Silva1, Martin Hunt1, Michael A Quail2, Julian Parkhill3, Simon R Harris3, Thomas D Otto4, Jacqueline A Keane1.
Abstract
The rapidly reducing cost of bacterial genome sequencing has lead to its routine use in large-scale microbial analysis. Though mapping approaches can be used to find differences relative to the reference, many bacteria are subject to constant evolutionary pressures resulting in events such as the loss and gain of mobile genetic elements, horizontal gene transfer through recombination and genomic rearrangements. De novo assembly is the reconstruction of the underlying genome sequence, an essential step to understanding bacterial genome diversity. Here we present a high-throughput bacterial assembly and improvement pipeline that has been used to generate nearly 20 000 annotated draft genome assemblies in public databases. We demonstrate its performance on a public data set of 9404 genomes. We find all the genes used in multi-locus sequence typing schema present in 99.6 % of assembled genomes. When tested on low-, neutral- and high-GC organisms, more than 94 % of genes were present and completely intact. The pipeline has been proven to be scalable and robust with a wide variety of datasets without requiring human intervention. All of the software is available on GitHub under the GNU GPL open source license.Entities:
Keywords: assembly; high-throughput; illumina; prokaryotic
Mesh:
Year: 2016 PMID: 28348874 PMCID: PMC5320598 DOI: 10.1099/mgen.0.000083
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Overview of the method with major components noted.
Comparison of de novo assemblies derived from the pipeline against their corresponding complete reference genomes using QUAST
More comprehensive details are available in Table S1.
| Organism | |||
|---|---|---|---|
| Coverage | 40.16 | 28.11 | 43.86 |
| Number of contigs | 247 | 22 | 38 |
| Total length | 3 856 742 | 4 711 864 | 3 016 231 |
| Reference length | 4 086 189 | 4 895 678 | 3 075 806 |
| Genome fraction (%) | 94.32 | 95.74 | 98.00 |
| DNA GC content (%) | 67.81 | 52.15 | 32.64 |
| Reference DNA GC content (%) | 67.72 | 52.16 | 32.78 |
| N50 | 23 177 | 517 904 | 206 505 |
| Number of misassemblies | 6 | 10 | 4 |
| Number of mismatches per 100 kbp | 1.43 | 1.15 | 1.76 |
| Number of indels per 100 kbp | 0.6 | 1.92 | 0.17 |
| Genes | 3624 | 4727 | 2965 |
| Percentage of reference genes found | 93.19 | 95.19 | 98.41 |
Summary of the isolates in the large public dataset
| Species | Number of samples | Mean contigs | Mean coverage |
|---|---|---|---|
| 168 | 70 | 134 | |
| 379 | 24 | 121 | |
| 178 | 167 | 145 | |
| 157 | 37 | 120 | |
| 1441 | 122 | 150 | |
| 234 | 75 | 205 | |
| 1643 | 55 | 92 | |
| 171 | 81 | 136 | |
| 299 | 405 | 118 | |
| 534 | 36 | 174 | |
| 131 | 86 | 91 | |
| 116 | 26 | 293 | |
| 159 | 81 | 374 | |
| 3562 | 74 | 290 | |
| Other | 232 | 80 | 136 |
Fig. 2.Distribution of the number of contigs in a set of 9404 assemblies.
Fig. 3.Distribution of the percentage difference between the size of each assembly and the size of a closely related reference sequence.