| Literature DB >> 23665771 |
Tanja Magoc1, Stephan Pabinger, Stefan Canzar, Xinyue Liu, Qi Su, Daniela Puiu, Luke J Tallon, Steven L Salzberg.
Abstract
MOTIVATION: A large and rapidly growing number of bacterial organisms have been sequenced by the newest sequencing technologies. Cheaper and faster sequencing technologies make it easy to generate very high coverage of bacterial genomes, but these advances mean that DNA preparation costs can exceed the cost of sequencing for small genomes. The need to contain costs often results in the creation of only a single sequencing library, which in turn introduces new challenges for genome assembly methods.Entities:
Mesh:
Year: 2013 PMID: 23665771 PMCID: PMC3702249 DOI: 10.1093/bioinformatics/btt273
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Bacterial genomes and sequence read lengths used in the GAGE-B evaluation
| Species | Genome size (Mb) | GC content (%) | Sequencing technology | Read length (bp) | Fragment length (bp) | Coverage |
|---|---|---|---|---|---|---|
| 4.7 | 65 | HiSeq | 101 | 180 | 250× | |
| 5.4 | 35 | HiSeq | 101 | 180 | 100–300× | |
| 5.4 | 35 | MiSeq | 250 | 600 | 100× | |
| 5.3 | 43 | HiSeq | 101 | 180 | 250× | |
| 5.1 | 64 | HiSeq | 100 | 335 | 115× | |
| 5.1 | 64 | MiSeq | 250 | 335 | 100× | |
| 4.6 | 69 | HiSeq | 101 | 220 | 210× | |
| 4.6 | 69 | MiSeq | 251 | 540 | 100× | |
| 2.9 | 33 | HiSeq | 101 | 180 | 250× | |
| 4.0 | 48 | HiSeq | 100 | 335 | 110× | |
| 4.0 | 48 | MiSeq | 250 | 335 | 100× | |
| 2.9 | 33 | HiSeq | 101 | 400 | 250× |
Note: All datasets used paired-end reads from both ends of every fragment.
aThe fragment lengths for two of the MiSeq libraries were relatively short, only 335 bp, because the same library was used for both HiSeq and MiSeq sequencing of those species.
Fig. 1.Comparison of N50 contig size (in kilobases) on the y-axis, versus depth of coverage on the x-axis, for the eight assemblers used in this study. All datasets were 100 bp HiSeq reads from B.cereus
Comparison of corrected N50 contig sizes, shown in kilobases, for assemblies where the finished reference genome was identical or near-identical
| Assembler | Species assembled | ||||||
|---|---|---|---|---|---|---|---|
| HiSeq (100 bp) reads | MiSeq (250 bp) reads | ||||||
| ABySS | 13.0 | 115.7 | 93.0 | 130.6 | 21.4 | 68.5 | 60.3 |
| CABOG | 11.2 | 78.2 | 48.8 | 150.5 | 30.5 | 8.3 | 32.5 |
| MIRA | 17.7 | 129.2 | 87.1 | 100.0 | 15.4 | 75.0 | 108.7 |
| MaSuRCA | 36.2 | 71.6 | |||||
| SGA | 12.1 | 27.9 | 23.4 | 25.5 | 9.1 | 12.8 | 27.3 |
| SOAPdenovo | 10.5 | 147.2 | 106.5 | 33.5 | 113.3 | 65.5 | |
| SPAdes | 83.5 | 147.9 | 77.1 | 103.7 | 118.1 | ||
| Velvet | 13.1 | 60.3 | 39.5 | 24.5 | 24.2 | 41.5 | 67.1 |
Note: The best values (or two best, in case of near-ties) for each genome are shown in bold. Corrected N50 is defined in Section 2.4.
Comparison of N50 sizes (in kilobases) for assemblies where the sequenced strain was too divergent to compute a corrected N50 value
| Assembler | Species assembled | ||||
|---|---|---|---|---|---|
| ABySS | 237.5 | 48.3 | 146.2 | 73.9 | 89.9 |
| CABOG | 278.4 | 61.6 | 94.2 | 102.8 | 105.8 |
| MIRA | 246.2 | 47.4 | 134.3 | 132.4 | 105.6 |
| MaSuRCA | |||||
| SGA | 68.8 | 23.4 | 41.2 | 38.1 | 47.8 |
| SOAPdenovo | 243.9 | 57.9 | 116.1 | 146.3 | 74.2 |
| SPAdes | 379.7 | 97.2 | 187.1 | ||
| Velvet | 184.4 | 38.9 | 125.2 | 122.5 | 83.0 |
Note: Boldface indicates the best result in each column, with the top two results highlighted when the difference was minimal. All genomes shown here were assembled from 100 bp HiSeq reads. See the Supplementary data for additional statistics.
Assemblies of R.sphaeroides using one versus two libraries
| Assembler (dataset) | MaSuRCA (one HiSeq library) | MaSuRCA (GAGE setting: two libraries) | SPAdes (one MiSeq library) | Allpaths-LG (GAGE setting: two libraries) |
|---|---|---|---|---|
| Contigs | ||||
| Number | 258 | 185 | 204 | |
| Errors | 11 | 9 | 15 | |
| Local errors | 7 | 12 | ||
| Corrected N50 (kb) | 30 | 118 | 43 | |
| Scaffolds | ||||
| Number | 101 | 73 | 33 | |
| Errors | 9 | 12 | 15 | |
| Local errors | 71 | 9 | 81 | |
| Corrected N50 (kb) | 197 | 2528 | 152 | |
Note: Shown are results from MaSuRCA and SPAdes, the two assemblers with the best performance in this study on a single library, and from Allpaths-LG, which was the best performer on the two-library dataset in the original GAGE study. We also include an additional comparison using MaSuRCA on two libraries. The best values in each row are shown in boldface type.