| Literature DB >> 19796385 |
Iain Maccallum1, Dariusz Przybylski, Sante Gnerre, Joshua Burton, Ilya Shlyakhter, Andreas Gnirke, Joel Malek, Kevin McKernan, Swati Ranade, Terrance P Shea, Louise Williams, Sarah Young, Chad Nusbaum, David B Jaffe.
Abstract
We demonstrate that genome sequences approaching finished quality can be generated from short paired reads. Using 36 base (fragment) and 26 base (jumping) reads from five microbial genomes of varied GC composition and sizes up to 40 Mb, ALLPATHS2 generated assemblies with long, accurate contigs and scaffolds. Velvet and EULER-SR were less accurate. For example, for Escherichia coli, the fraction of 10-kb stretches that were perfect was 99.8% (ALLPATHS2), 68.7% (Velvet), and 42.1% (EULER-SR).Entities:
Mesh:
Year: 2009 PMID: 19796385 PMCID: PMC2784318 DOI: 10.1186/gb-2009-10-10-r103
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: source data
|
|
|
|
|
| |
|---|---|---|---|---|---|
| Strain | USA300 | K12 MG1655 | 2.4.1 | 972 h | 74A |
| Reference sequence | Finished, curated | Finished, curated | Finished, curated | Finished | Finished |
| GC composition (%) | 33 | 51 | 69 | 36 | 49 |
| Genome size (kb) | 2,873 | 4,639 | 4,603 | 12,554 | 39,226 |
| Reference N50 (kb) | 2,873 | 4,639 | 3,188 | 4,509 | 665 |
| Sequence coverage (x) | 89 | 139 | 370 | 148 | 123 |
For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. In all cases identical data were supplied to both programs. Contigs of size <1,000 bases were discarded. Strain: strain that was sequenced. GC composition: percent of GC base pairs in the finished reference sequence. Genome size: total number of bases in reference. Reference N50: N50 size of reference. Reference sequences: all were finished, and in addition the reference sequences for the three bacterial genomes were curated to ensure full concordance with the samples, as described in the main text. S. aureus, circular chromosome and two plasmids; E. coli, circular chromosome; R. sphaeroides, finished but differs at 374 base positions from our sample, two circular chromosomes, four circular plasmids, and one linear plasmid; S. pombe, three linear chromosomes; N. crassa, finished at the base level, 251 contigs from 7 chromosomes. Sequence coverage: total coverage by usable reads; for a precise definition see Table S3 in Additional data file 1.
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: contiguity
| ALLPATHS | Velvet | EULER | |
|---|---|---|---|
| | 477 | 87 | 68 |
| | 337 | 62 | 19 |
| | 156 | 151 | 3 |
| | 51 | 55 | 24 |
| | 19 | 16 | 14 |
| | 611 | 562 | 68 |
| | 2,680 | 298 | 19 |
| | 858 | 1,126 | 3 |
| | 222 | 422 | 24 |
| | 58 | 186 | 14 |
For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. Contiguity refers to contig and scaffold N50 values.
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: genome coverage
| ALLPATHS | Velvet | EULER | |
|---|---|---|---|
| | 99.1% | 97.0% | 96.7% |
| | 99.3% | 97.7% | 94.6% |
| | 98.5% | 94.3% | 65.0% |
| | 95.9% | 95.5% | 93.6% |
| | 89.5% | 88.7% | 89.2% |
| | 98.8% | 94.4% | 93.0% |
| | 99.0% | 92.4% | 72.7% |
| | 96.0% | 87.2% | 6.2% |
| | 91.3% | 91.3% | 76.3% |
| | 71.0% | 62.1% | 61.5% |
| | 88.3% | 36.5% | 41.3% |
| | 86.5% | 25.5% | 0.0% |
| | 69.5% | 58.4% | 0.0% |
| | 18.3% | 17.8% | 1.7% |
| | 1.3% | 0.3% | 0.0% |
For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. Genome coverage refers to the fraction of the reference genome covered by the assembly, computed as the fraction of 100-mers in the reference sequence present in the assembly. If a 100-mer was present multiple times in the reference sequence, we checked for up to that many copies in the assembly.
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: correctness (of chunks approximately 10 kb or less)
| ALLPATHS | Velvet | EULER | |
|---|---|---|---|
| | 99.3% | 70.7% | 51.7% |
| | 99.8% | 68.7% | 42.1% |
| | 99.7% | 71.8% | 18.9% |
| | 79.7% | 66.2% | 31.4% |
| | 78.6% | 49.9% | 19.1% |
| | 0.7% | 13.7% | 26.4% |
| | 0.2% | 18.0% | 24.1% |
| | 0.0% | 19.3% | 39.2% |
| | 18.6% | 26.6% | 32.6% |
| | 15.3% | 24.3% | 24.1% |
| | 0.0% | 9.1% | 13.7% |
| | 0.0% | 6.4% | 26.8% |
| | 0.0% | 5.9% | 37.0% |
| | 1.3% | 3.7% | 28.7% |
| | 3.2% | 11.4% | 32.3% |
| | 0.0% | 6.2% | 5.9% |
| | 0.0% | 5.9% | 5.3% |
| | 0.2% | 2.0% | 3.2% |
| | 0.3% | 2.6% | 5.1% |
| | 1.3% | 7.9% | 12.8% |
| | 0.0% | 0.4% | 2.3% |
| | 0.0% | 1.0% | 1.7% |
| | 0.1% | 1.0% | 1.5% |
| | 0.2% | 0.8% | 2.1% |
| | 0.9% | 5.7% | 10.4% |
| | 0.0% | 0.0% | 0.0% |
| | 0.0% | 0.0% | 0.0% |
| | 0.0% | 0.1% | 0.2% |
| | 0.0% | 0.0% | 0.0% |
| | 0.7%* | 0.9%* | 1.4%* |
For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. Correctness: contigs were divided into approximately 10 kb chunks, leaving smaller contigs intact. Subject to the caveat that the reference sequences might have some errors (likely greater for the fungi), we assayed the absolute accuracy of each chunk by finding the minimum number of errors (substitution or indel bases) among all alignments of it to the reference sequence for the genome. The table shows the distribution of the bases in the chunks according to their accuracy. Chunks having no 100-mer match are separately classified (class VI). *For N. crassa, some AT-rich regions that are missing from the reference [23] appear as novel sequence in the assemblies.
ALLPATHS, Velvet and EULER assemblies of five microbial genomes: base accuracy, misassemblies, and long-range validity
| ALLPATHS | Velvet | EULER | |
|---|---|---|---|
| Quality, from class I to III chunks | |||
| | Q59 | Q33 | Q32 |
| | Q67 | Q34 | Q30 |
| | >Q60 | Q35 | Q29 |
| | Q42 | Q37 | Q29 |
| | Q39 | Q32 | Q28 |
| % in class IV to V chunks | |||
| | 0.0% | 6.6% | 8.2% |
| | 0.0% | 6.9% | 7.0% |
| | 0.3% | 3.0% | 4.7% |
| | 0.5% | 3.4% | 7.2% |
| | 2.2% | 13.6% | 23.2% |
| At 100-kb distance | |||
| | 100.0% | 45.6% | 99.8% |
| | 100.0% | 86.4% | N/A |
| | 100.0% | 75.8% | N/A |
| | 99.8% | 80.2% | 100.0% |
| | 99.8% | 13.6% | N/A |
For five genomes, statistics from ALLPATHS, Velvet, and EULER assemblies are shown. Base accuracy: the inferred quality score for the bases in chunk classes I to III, obtained by dividing the number of errors in these chunks by the total number of bases, and then taking -10*log10 of this quantity. Misassemblies: total fraction of bases in chunks in classes IV to V. Long-range validity: we chose 10,000 pairs of 100-base sequences at random from the scaffolds, where the two sequences were separated by 100,000 bases, considering only cases where both sequences could be placed uniquely on the reference. We scored the placement as 'valid' if both sequences were placed on the same chromosome, in the same orientation, with separation 100,000 ± 25% bases. We report the fraction of valid placements. N/A: all scaffolds were of length <100 kb.
Figure 1The ALLPATHS assembly of S. aureus. Each edge represents a contiguous and unambiguous sequence of bases and, for this assembly, each component is its own scaffold. Longer edges are in red, short edges in gray. The sizes of the gray edges and regions are in bases. Several key features are called out in blue boxes. Five short sequences totaling 9 kb are not shown. Images of the graphs for all five ALLPATHS assemblies of this paper are available at [16].