| Literature DB >> 21876754 |
Jarrod A Chapman1, Isaac Ho, Sirisha Sunkara, Shujun Luo, Gary P Schroth, Daniel S Rokhsar.
Abstract
We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.Entities:
Mesh:
Substances:
Year: 2011 PMID: 21876754 PMCID: PMC3158087 DOI: 10.1371/journal.pone.0023501
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Paired ends.
A. Fragment pair end separation distribution. Pairs are separated by 279±7 bp. B. Mate-pairs are produced by circularizing a genomic segment (vertical line indicates junction). End-sequences from sheared fragments that contain the junction (1) represent reads that point outward at the ends of the original segment. End-sequences from sheared fragments that do not contain the junction (2) are inwardly directed and adjacent on the original segment. C. Mate-pair end separation distribution. Two-thirds of all pairs are found to be divergently oriented and separated by 3.2±0.2 kb. An artifactual population of convergently oriented pairs separated by less than 500 bp is apparent, representing fragments of type (2) shown above in panel B.
Figure 2Example of a 7-mer graph.
The node a is X-terminated to the left. The non-reciprocal linkage between nodes b and c is removed because the terminal base (lower case “a” in the sequence) of node c is low quality. Node e is F-terminated to the right. The resultant U-U contig is the union of nodes b and d: CTGCTGCT.
Figure 3k-mer frequency and extension characteristics in Pichia.
A. 41-mer frequency distributions. The overall 41-mer distribution (green) is decomposed into genomic (red) and non-genomic (yellow) contributions. At fewer than ∼30 occurrences non-genomic (error-induced) 41-mers dominate. The modal frequency is ∼135. B. Graph features as functions of d. The total number of nodes (blue), total number of X-terminated nodes (red), and total number of F-terminated (yellow) nodes in the 41-mer graph are calculated as functions of the assembly parameter dmin. We find the optimal assembly to occur at dmin = 10.
Figure 4Estimated gap sizes vs. actual contig separation in the Pichia genome.
75% of the initial inter-contig gaps are resolved during gap closing. 97% of gaps are found to be within 4 bp of their estimated size, and 58% within 1 bp.
Comparison of assembles of E. coli K12 MG1655 benchmark dataset.
| Assembler | Assembly as reported in | Contig N50 (kbp) | Scaffold N50 (kbp) | Coverage | Errors reported |
| Allpaths2 | Allpaths2 | 337 | 2,680 | 99.3% | Base accuracy Q67; no misassemblies |
| Soapdenovo | Soapdenovo | 89 | 105 | NR | 5 incorrect contigs |
| Velvet | Allpaths2 | 62 | 298 | 97.7 | Base accuracy Q34; 6.9% of 10 kb regions missassembled |
| Velvet | ABySS | 54 | NR | 98.8 | 9 incorrect contigs (mean size 33 kbp) |
| Euler-SR | ABySS | 57 | NR | 99.8 | 26 incorrect contigs (mean size 52 kbp) |
| Euler | Allpaths2 | 19 | 19 | 94.6 | Base accuracy Q30; 7.0% of 10 kb regions misassembled |
|
|
|
|
|
|
|
| Edena | ABySS | 16 | NR | 99.1% | 6 incorrect contigs (mean size 13 kbp) |
| ABySS | ABySS | 45 | NR | 99.4% | 13 incorrect contigs (mean size 33 kbp) |
| SSAKE | ABySS | 11 | NR | 99.99% | 38 incorrect contigs (mean size 6 kbp) |
In ref. [9] analysis of ABySS, Velvet, Euler-SR, SSAKE, and Edena, only contigs of at least 100 bp were considered and genome coverage was based on full length, partial, and broken alignments with at least 95% identity to reference. Contigs with broken alignments, or that aligned with less than 95% identity, were considered incorrect. In the ref. [23] analysis of Allpaths2, Velvet, and Euler, only contigs of at least 1 kbp were considered. Genome coverage computed as the fraction of 100-mers in the reference sequence that are present in the assembly, allowing for multiple occurrences in the assembly. Base quality reported as total number of discrepancies to reference, computed over ∼10 kb assembly segments that contain fewer than 1% such discrepancies. Misassemblies were reported as the total fraction of bases in ∼10 kb segments containing at least 1% error. In the ref. [11] summary of Soap denovo assembly, contigs >100 bp were reported.
NR: not reported.
*Four localized discrepancies were noted between our meraculous assembly and the E. coli K12 MG1655 reference sequence. As described in the text, further examination showed that all four discrepancies were in fact errors in the reference (or mutations in the lineages separating the MG1655 reference sample from the short read dataset sample). Analysis of errors reported for other assemblers have not been analysed.
Figure 5Differences between E. coli meraculous and reference sequence identify transposon insertion.
Bottom line shows portion of the Genbank reference genome for E. coli str. K-12 substr. MG1655 produced by Sanger sequencing and directed finishing strain [27]. Top shows alignment of the de novo meraculus contigs to reference sequence. Solid lines agree perfectly. Angled dashed lines represent unaligned meraculous contig-ends that correspond to the beginning and end of a transposable element. All short-read data supports the meraculous sequence, indicating either insertion of the transposon in the Illumina-sequenced lineage, or an error in the MG1655 reference.
Comparison of P. Stipitis assembly scaffold characteristics (including scaffolds of size at least 2 kbp).
| Assembler | No. Scaffolds | Total Size (Mbp) | Scaffold N50 (no. / kbp) | Total gap bases (kbp; %) | Scaffolding errors |
| ABySS | 111 | 15.48 | 20 / 263 | 7.3 (0.05%) | 0 |
| Meraculous | 118 | 14.79 | 18 / 269 | 81.7 (0.55%) | 0 |
| SOAPdenovo | 88 | 14.74 | 14 / 348 | 156 (1.06%) | 0 |
| Velvet | 157 | 14.82 | 24 / 202 | 136 (0.92%) | 78 |
To assess accuracy of the assemblies, contigs were aligned to the reference genome using BLAST. Scaffolding errors include non-colinear arrangements of contigs within scaffolds with respect to the reference sequence.
Comparison of P. Stipitis assembly contig characteristics (including contigs of at least 100 bp).
| Assembler | No. Contigs | Total Size (Mbp) | Contig N50 (no. / kbp) | Contig error rate | Reference coverage | Unique coverage |
| ABySS | 132 | 15.48 | 21 / 263 | 1/29 kbp | 97.8% | 92.2% |
| Meraculous | 489 | 14.70 | 44 / 101 | <1/15000 kbp | 95.8% | 95.8% |
| SOAPdenovo | 561 | 14.58 | 64 / 65 | 1/6.4 kbp | 95.2% | 95.1% |
| Velvet | 572 | 14.69 | 87 / 53 | 1/15 kbp | 96.5% | 95.4% |
Contig error rate is measured for only the single best-aligning BLAST HSP per contig. Reference coverage is based on the total number of bases spanned by at least one HSP; unique coverage is based on the total number of reference bases spanned by exactly one HSP.