| Literature DB >> 29244003 |
Shenglong Zhu1, Danny Z Chen2, Scott J Emrich3.
Abstract
BACKGROUND: Although single molecule sequencing is still improving, the lengths of the generated sequences are inevitably an advantage in genome assembly. Prior work that utilizes long reads to conduct genome assembly has mostly focused on correcting sequencing errors and improving contiguity of de novo assemblies.Entities:
Keywords: Genome improvement; Genome scaffolding; Sequence assembly; Single molecule sequencing
Mesh:
Year: 2017 PMID: 29244003 PMCID: PMC5731603 DOI: 10.1186/s12864-017-4271-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1A visual overview. a Alignment of long reads to the target assembly. b Extraction of structurally consistent segments from the target assembly. c The reassembly of the consistent portions has two scaffolds: The first scaffold is formed by the red (1st) segment followed by the reverse complement (indicated by a leftward arrow) of the blue (3rd) segment in b, and the gap induced by disassembling is then filled by one or more connecting long reads; the second scaffold is the single green (2nd) segment
Fig. 2Comparisons generated by Nucmer and delta-filter (with -r and -q options). a Comparison of the mutated E. coli genome to reference. b Comparison after applying our method
Correction of mutations of E. coli K-12 with two additional downsamples of the 50X data
| Coverage | # of mutations | 50 | 100 | 150 | 200 | 300 | 400 | 500 |
|---|---|---|---|---|---|---|---|---|
| 50x | scaffolds | 1 | 1 | 3 | 1 | 2 | 9 | 7 |
| Total running time | 23m35s | 25m | 26m37s | 27m41s | 32m15s | 37m51s | 38m29s | |
| Stage 1 time | 20m31s | 20m4s | 19m24s | 18m37s | 18m25s | 17m58s | 17m30s | |
| Stage 2 time | 3m4s | 4m56s | 7m13s | 9m4s | 13m50s | 19m53s | 20m59s | |
| Relocations | 1 | 1 | 3 | 2 | 3 | 6 | 5 | |
| Inversions | 0 | 0 | 0 | 0 | 0 | 0 | 1 | |
| N50 | 4641851 | 4642418 | 3799940 | 4641735 | 4495031 | 1610858 | 2378162 | |
| 25x | scaffolds | 3 | 3 | 7 | 16 | 14 | 23 | 39 |
| Total running time | 17m47s | 18m47s | 20m8s | 20m56s | 23m31s | 27m2s | 26m7s | |
| Stage 1 time | 15m52s | 15m27s | 15m26s | 14m50s | 14m21s | 14m37s | 13m13s | |
| Stage 2 time | 1m56s | 3m20s | 4m42s | 6m6s | 9m10s | 12m25s | 12m54s | |
| Relocations | 2 | 1 | 2 | 1 | 3 | 7 | 12 | |
| Inversions | 0 | 0 | 0 | 0 | 1 | 3 | 7 | |
| N50 | 3046572 | 4641477 | 4638607 | 4641947 | 2781405 | 906635 | 1057202 | |
| 12.5x | scaffolds | 20 | 11 | 18 | 31 | 59 | 74 | 120 |
| Total running time | 16m28s | 16m43s | 17m17s | 17m32s | 18m32s | 19m34s | 18m33s | |
| Stage 1 time | 14m31s | 13m51s | 13m12s | 12m37s | 11m41s | 11m1s | 9m45s | |
| Stage 2 time | 1m57s | 2m52s | 4m5s | 4m55s | 6m51s | 8m33s | 8m48s | |
| Relocations | 3 | 1 | 3 | 6 | 7 | 9 | 11 | |
| Inversions | 0 | 0 | 2 | 2 | 1 | 6 | 8 | |
| N50 | 1592081 | 1150026 | 719436 | 730221 | 385833 | 276032 | 118970 |
Errors determined by Quast on assemblies of S. Cerevisiae W303
| Draft | Correction | Correction | Correction | |
|---|---|---|---|---|
| assembly | (BLASR) | (Nucmer) | (ScaffMatch) | |
| Contigs | 30 | 294 | 278 | 204 |
| Misassemblies | 108 | 22 | 64 | 26 |
| Relocations | 37 | 7 | 18 | 9 |
| Translocations | 69 | 14 | 46 | 17 |
| Inversions | 2 | 1 | 0 | 0 |
| Misassembled | 10 865 048 | 969 713 | 3 151 961 | 1 607 623 |
| contig length |
Quality comparisons for S. aureus USA300 assemblies using downsampled PacBio long reads
| Coverage | Tools | Contigs | N50 | NG50 | AvgIdentity | Reloc | Trans | Inv | Total T | Stg1 T | Stg2 T |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 293.63x | SMSC | 13 | 2585902 | 2585902 | 99.99 | 3 | 0 | 0 | 6m51s | 2m12s | 4m39s |
| Canu | 5 | 1492711 | 1492711 | 99.98 | 2 | 0 | 0 | 36m26s | N/A | N/A | |
| ScaffMatch | 14 | 1099804 | 1099804 | 99.98 | 28 | 2 | 0 | N/A | N/A | 1m7s | |
|
| 31 | 611114 | 479066 | 99.98 | 16 | 0 | 0 | N/A | N/A | 2m12s | |
| 17.54x | SMSC | 55 | 91135 | 91135 | 99.84 | 8 | 0 | 1 | 2m15s | 1m5s | 1m10s |
| Canu | 52 | 89147 | 84231 | 99.64 | 2 | 0 | 0 | 6m8s | N/A | N/A | |
| ScaffMatch | 13 | 1131851 | 1131851 | 99.85 | 30 | 0 | 1 | N/A | N/A | 1m6s | |
|
| 106 | 50567 | 49351 | 99.87 | 2 | 0 | 0 | N/A | N/A | 25s | |
| 23.47x | SMSC | 18 | 641333 | 641333 | 99.94 | 3 | 0 | 0 | 2m37s | 56s | 1m41s |
| Canu | 27 | 224570 | 224570 | 99.85 | 2 | 0 | 0 | 7m29s | N/A | N/A | |
| ScaffMatch | 14 | 1091191 | 1091191 | 99.94 | 29 | 0 | 0 | N/A | N/A | 54s | |
|
| 70 | 89077 | 89077 | 99.95 | 3 | 0 | 0 | N/A | N/A | 32s | |
| 29.31x | SMSC | 12 | 2569515 | 2569515 | 99.95 | 6 | 0 | 0 | 3m6s | 1m4s | 2m2s |
| Canu | 15 | 426754 | 426754 | 99.92 | 2 | 0 | 0 | 9m54s | N/A | N/A | |
| ScaffMatch | 13 | 1091146 | 1091146 | 99.96 | 34 | 0 | 0 | N/A | N/A | 54s | |
|
| 55 | 149552 | 149552 | 99.97 | 6 | 0 | 0 | N/A | N/A | 37s |
The ScaffMatch* means the ScaffMatch that takes error-corrected PacBio long reads as input
Fig. 3An example of our underlying graph model. a The graph constructed for four segments. Each dashed edge represents a segment from tail (5’-end) to head (3’-end). Each solid edge is for a bundle of long reads. b A possible ordering of the vertices/segments. This ordering will also decide the orientation of the segments in the resulting scaffolds. c The actual ordering and orientation based on the result of b
Fig. 4An example of the NP-completeness reduction. The reduction is from the Hamiltonian path problem to the decision version of the scaffolding problem. The instance of the Hamiltonian path problem is the 5-vertex graph on the left. The instance of the decision version of the scaffolding problem is the entire graph, including the dashed edges