| Literature DB >> 28638050 |
Francesca Giordano1, Louise Aigrain2, Michael A Quail2, Paul Coupland3, James K Bonfield2, Robert M Davies2, German Tischler4, David K Jackson2, Thomas M Keane2, Jing Li5, Jia-Xing Yue5, Gianni Liti5, Richard Durbin2, Zemin Ning2.
Abstract
Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.Entities:
Mesh:
Year: 2017 PMID: 28638050 PMCID: PMC5479803 DOI: 10.1038/s41598-017-03996-z
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Read length distributions for the ONT and PacBio datasets. Read length distributions for the four yeast strains, S288C, N44, CBS432 and SK1 of the 2D-Pass ONT datasets in (a) and the PacBio datasets in (b). Comparison of read length distributions for the S288C strain of the 31X datasets ONT 2D-Pass and PacBio ONT-emulating 31X-subset in (c).
Statistic information for the 2D-Pass ONT datasets for the S288C, N44, CBS432 and SK1 strains.
| Oxford Nanopore Datasets | |||||||
|---|---|---|---|---|---|---|---|
| Strain | Dataset | Bases (Mb) | Reads | Average (b) | Longest (b) | N50 (b) | Identity |
| S288C | 2D-Pass: 31X | 383 | 42,325 | 9,040 | 56,477 | 11,693 | 93.3% |
| 2D-Pass: 20X | 121 | 13,366 | 9,054 | 56,028 | 11,716 | 92.0% | |
| 2D-Pass: 10X | 242 | 26,721 | 9,057 | 56,477 | 11,659 | 92.8% | |
| N44 | 2D-Pass: 11X | 130 | 15,654 | 8,292 | 37,837 | 9,861 | NA |
| CBS432 | 2D-Pass: 9X | 110 | 12,211 | 8,952 | 46,481 | 11,201 | NA |
| SK1 | 2D-Pass: 4X | 51 | 5,938 | 8,589 | 36,791 | 10,971 | NA |
For the S288C strain, also shown are a 20X and a 10X subsets of randomly selected reads from the immediately larger 2D-Pass dataset.
Statistic information for the PacBio datasets for the S288C, N44, CBS432 and SK1 strains.
| Pacfic Biosciences Datasets | |||||||
|---|---|---|---|---|---|---|---|
| Strain | Dataset | Bases (Mb) | Reads | Average (b) | Longest (b) | N50 (b) | Identity |
| S288C | 120X | 1,463 | 239,408 | 6,109 | 35,196 | 8,656 | 92.5% |
| ONT-Emu: 31X | 375 | 42,180 | 8,893 | 35,196 | 11,196 | 91.9% | |
| ONT-Emu: 20X | 242 | 26,786 | 9,035 | 35,196 | 11,615 | 91.7% | |
| ONT-Emu: 10X | 121 | 13,456 | 8,993 | 31,627 | 11,582 | 91.2% | |
| N44 | 148X | 1,794 | 371,025 | 4,834 | 33,906 | 6,800 | NA |
| CBS432 | 135X | 1,639 | 324,414 | 5,053 | 34,173 | 7,212 | NA |
| SK1 | 248X | 3,019 | 697,989 | 4,325 | 34,080 | 6,184 | NA |
For the S288C strain also the ONT-emulating subsets are shown: ‘ONT-Emu’ 31X, 20X and 10X subsets, selected to match the 31X, 20X and 10X ONT S288C datasets for depth and read length distribution.
Statistic information about the de novo assemblies for the S288C ONT datasets for the hybrid pipeline PBcR-MiSeq, the pipeline Miniasm, with no base error correction nor consensus step, and for the non-hybrid pipelines: Racon (on a Miniasm draft assembly), Falcon, SMARTdenovo, ABruijn, PBcR-Self and Canu for, from top to bottom: all the 2D-Pass reads (2D-Pass 31X), the 2D-Pass 20X subset and the 2D-Pass 10X subsets.
| Oxford Nanopore S288C Datasets | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Assembler | Bases (Mb) | Contigs | N50 (kb) | Reference Coverage | SNPs, Indels (#per kb) | Identity | MisAss | Na50 (kb) | Genes (6,615) | CPU Time (h) | Memory (GB) |
| 2D-Pass 31X | PBcR-MiSeq | 11.9 | 76 | 305 | 99.08% |
|
|
|
|
|
|
|
| Miniasm | 11.8 | 27 | 739 | 94.85% | 34, 67 | 89.42% | 26 | 362 | 3,353 |
|
| |
| Racon | 12.0 | 27 | 752 | 98.80% | 0.4, 11 |
| 24 | 534 | 6,533 | 8 | 5 | |
| Falcon | 11.9 | 43 | 717 | 99.09% | 0.5, 21 | 97.79% | 27 |
| 6,526 |
|
| |
| SMARTdenovo |
| 28 | 625 | 99.54% | 0.3, 14 | 98.50% | 25 | 531 | 6,556 | 2 |
| |
| ABruijn | 12.4 |
|
| 98.89% |
| 98.49% | 31 | 536 | 6,533 | 44 | 8 | |
| PBcR-Self | 12.9 | 64 | 616 | 99.21% | 0.2, 17 | 98.24% | 92 | 525 | 6,552 | 695 | 23 | |
| Canu |
| 29 | 698 |
|
| 98.30% | 34 | 530 |
| 80 | 14 | |
| +Nanopolish | 12.3 | 29 | 709 | 99.63% | 0.1, 4 | 99.57% | 35 | 538 | 6,584 | 1,835 | 12 | |
| 2D-Pass 20X | PBcR-MiSeq | 11.8 | 66 | 269 | 99.09% |
|
|
| 262 | 6,522 | 95 | 13 |
| Miniasm | 11.6 | 39 | 418 | 94.66% | 34, 67 | 89.36% | 24 | 286 | 3,271 |
|
| |
| Racon | 11.8 | 39 | 423 | 98.11% | 0.7, 13 |
| 26 | 393 | 6,478 | 5 |
| |
| Falcon | 10.7 | 84 | 210 | 90.64% | 0.6, 21 | 97.56% | 17 | 194 | 5,946 | 10 | 44 | |
| SMARTdenovo | 11.9 |
|
| 98.99% | 0.8, 16 | 98.23% | 24 | 455 | 6,528 | 1 | 4 | |
| ABruijn |
|
| 468 | 98.55% | 0.3, 16 | 98.28% | 12 | 436 | 6,495 | 29 | 7 | |
| PBcR-Self | 12.9 | 72 | 545 |
| 0.3, 18 | 98.08% | 74 | 452 |
| 342 | 20 | |
| Canu | 11.9 | 31 | 544 | 98.99% | 0.2, 18 | 98.10% | 25 | 441 | 6,525 | 41 | 10 | |
| 2D-Pass 10X | PBcR-MiSeq | 11.3 | 123 |
|
|
|
| 13 |
|
| 33 | 7 |
| Miniasm | 7.9 | 158 | 58 | 67.90% | 24, 46 | 89.26% | 12 | 43 | 2,256 | 0.02 | 0.002 | |
| Racon | 8.1 | 158 | 60 | 70.33% | 2, 13 | 97.72% | 15 | 58 | 4,520 | 3 | 1 | |
| Falcon | 1.4 | 113 | 17 | 15.90% | 0.1, 3 | 97.43% | 6 | 16 | 901 | 3 | 24 | |
| SMARTdenovo | 10.4 |
| 115 | 88.71% | 5, 24 | 96.58% |
| 104 | 5,610 |
|
| |
| ABruijn | 8.5 | 86 | 111 | 72.68% | 1, 16 | 97.47% | 21 | 97 | 4,711 | 16 | 8 | |
| PBcR-Self |
| 167 | 106 | 91.33% | 1, 22 |
| 64 | 102 | 5,957 | 71 | 3 | |
| Canu | 10.7 | 115 | 134 | 91.52% | 1, 23 | 97.32% | 18 | 112 | 5,955 | 13 | 6 | |
For the 2’D-Pass 31X’ dataset also the results from Nanopolish on the Canu assembly is shown. In each column the best value is highlighted in bold. For the identity column the best value is always for the hybrid assembly PBcR-MiSeq, but we also highlighted (bold and underlined) the best value for the non-hybrid pipelines, ignoring Nanopolish as it is the only polishing tool. For the 10X datasets, we ignored assemblies with less than 80% reference coverage when choosing the best values.
As Table 3 but for PacBio-based assemblies from the ONT-Emu PacBio subsets at, from top to bottom, 31X, 20X and 10X depth.
| Pacific Biosciences S288C Datasets | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Assembler | Bases (Mb) | Contigs | N50 (kb) | Reference Coverage | SNPs, Indels (# per kb) | Identity | MisAss | Na50 (kb) | Genes (6,615) | CPU Time (h) | Memory (GB) |
| ONT-Emu 31X | PBcR-MiSeq | 11.9 | 76 | 270 | 98.68% |
|
|
| 270 | 6,526 | 132 | 17 |
| Miniasm | 12.5 | 35 | 563 | 96.10% | 19, 88 | 89.37% | 53 | 106 | 3,226 |
|
| |
| Racon |
| 34 | 544 | 99.09% | 0.3, 4 | 99.50% | 22 | 429 | 6,540 | 19 | 5 | |
| Falcon | 12.0 | 35 | 549 | 98.18% | 0.3, 2 | 99.78% | 28 | 436 | 6,508 | 13 | 64 | |
| SMARTdenovo | 12.3 |
|
|
| 0.2, 3 | 99.66% | 27 | 549 | 6,596 | 2 |
| |
| ABruijn | 12.3 | 26 | 666 | 99.30% |
| 99.87% | 43 | 469 | 6,565 | 19 | 7 | |
| PBcR-Self | 12.4 | 39 | 751 | 99.50% |
| 99.92% | 43 | 548 | 6,590 | 63 | 24 | |
| Canu | 12.3 | 28 | 607 | 99.92% |
|
| 29 | 534 |
| 15 | 10 | |
| ONT-Emu 20X | PBcR-MiSeq | 11.7 | 64 | 304 | 98.73% |
|
|
| 264 | 6,501 | 86 | 13 |
| Miniasm | 12.0 | 86 | 202 | 93.08% | 18, 84 | 89.57% | 53 | 69 | 3,255 |
|
| |
| Racon | 11.6 | 86 | 194 | 95.35% | 1, 7 | 99.18% | 25 | 189 | 6,241 | 10 |
| |
| Falcon | 9.9 | 152 | 115 | 82.22% | 0.3, 2 | 99.65% | 29 | 112 | 5,341 | 5.6 | 41 | |
| SMARTdenovo |
|
|
|
|
| 99.09% | 24 | 434 | 6,534 | 1 | 3 | |
| ABruijn | 11.7 | 58 | 272 | 96.06% |
| 99.72% | 36 | 258 | 6,325 | 23 | 9 | |
| PBcR-Self | 12.3 | 44 | 502 | 99.03% | 0.2, 2 | 99.78% | 35 | 428 | 6,560 | 30 | 20 | |
| Canu |
| 42 | 454 | 99.47% | 0.2, 2 |
| 28 | 432 |
| 8 | 7 | |
| ONT-Emu 10X | PBcR-MiSeq |
|
|
|
|
|
|
|
|
|
|
|
| Miniasm | 4.0 | 120 | 35 | 33.92% | 6, 27 | 89.61% | 4 | 19 | 1,035 | 0.02 | 0.1 | |
| Racon | 3.8 | 120 | 34 | 35.48% | 1, 5 | 98.28% | 8 | 33 | 2,095 | 5 | 1 | |
| Falcon | 0.6 | 59 | 14 | 8.99% | 0.1, 0.2 | 99.41% | 10 | 13 | 421 | 1 | 23 | |
| SMARTdenovo | 8.5 | 157 | 61 | 71.29% | 3, 22 | 96.43% | 8 | 55 | 4,271 | 1 | 1 | |
| ABruijn | 4.7 | 67 | 71 | 41.45% | 0.4, 4 | 98.87% | 10 | 67 | 2,631 | 13 | 7 | |
| PBcR-Self | 9.7 | 232 | 57 | 78.99% | 1, 7 | 98.96% | 35 | 55 | 5,111 | 12 | 18 | |
| Canu | 8.9 | 178 | 62 | 75.35% | 0.4, 6 | 99.14% | 19 | 59 | 4,811 | 3 | 4 | |
Figure 2Homomer counts. Counts for the 5 bases homomers in the Reference (blue), in the Canu assembly (orange), and in the Canu assembly after polishing with Nanopolish (green).
Figure 3Chromosome homomer rate with respect to that of the mitochondrial genome. Ratios of 5–homomer counts normalized by the chromosome length between each chromosome and the mitochondrial genome (mt).
MiSeq-only assembly from SPAdes in top row.
| S288C Datasets: Scaffolding Pipelines | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dataset | Assembler | Bases (Mb) | Contigs | N50 (kb) | Reference Coverage | SNPs, Indels (# per kb) | Identity | MisAss | Na50 (kb) | Genes (6,615) | CPU Time (h) | Memory (GB) |
| MiSeq | SPAdes | 11.6 | 206 | 125 | 98.3% | 0.04, 0.03 | 99.98% | 5 | 125 | 6,399 | 5 | 12 |
| ONT 2D-Pass 31X | npScarf |
|
|
| 99.8% | 0.4, 0.3 | 99.91% | 69 |
| 6,559 |
|
|
| HybridSPAdes | 11.8 | 64 | 444 |
| 0.1, | 99.97% |
| 416 |
| 18 | 12 | |
| SMIS | 11.8 | 85 | 549 | 98.4% |
|
| 13 | 493 | 6,411 | 13 |
| |
| PacBio ONT-Emu 31X | npScarf |
|
|
| 98.5% | 0.3, 0.4 | 99.91% | 67 |
| 6,458 |
|
|
| HybridSPAdes |
| 68 | 364 |
| 0.1, |
|
| 317 |
| 27 | 12 | |
| SMIS |
| 89 | 546 | 98.8% |
|
| 40 | 309 | 6,399 | 9 | 6 | |
MiSeq-only assembly from SPAdes scaffolded by the npScarf, HybridSPAdes and SMIS pipelines using the ‘2D-Pass 31X’ ONT sample (Middle) and the ‘ONT-Emu 31X’ PacBio subset (Bottom).
Statistic information on the de novo assemblies from the MiSeq-only SPAdes pipeline and the same SPAdes assembly scaffolded with npScarf, HybridSPAdes and SMIS for the N44 (top panel), CBS432 (middle panel), and SK1 (bottom panel) strains.
| Oxford Nanopore Datasets | |||||||
|---|---|---|---|---|---|---|---|
| Dataset | Assembler | Bases (Mb) | Contigs | N50 (kb) | Genes (6,615) | CPU Time (h) | Memory (GB) |
| N44 2D-Pass: 11X | SPAdes | 11.6 | 187 | 117 | 5,475 | 7 | 13 |
| npScarf | 11.7 | 19 | 898 | 5,538 | 1 | 2 | |
| HybridSPAdes | 11.7 | 61 | 324 | 5,547 | 8 | 13 | |
| SMIS | 11.7 | 58 | 511 | 5,474 | 2 | 5 | |
| CBS432 2D-Pass: 9X | SPAdes | 11.6 | 181 | 150 | 5,498 | 5 | 12 |
| npScarf | 11.4 | 19 | 928 | 5,443 | 1 | 1 | |
| HybridSPAdes | 11.7 | 49 | 515 | 5,611 | 6 | 12 | |
| SMIS | 11.7 | 64 | 658 | 5,499 | 1 | 5 | |
| SK1 2D-Pass: 4X | SPAdes | 11.6 | 240 | 118 | 6,341 | 6 | 12 |
| npScarf | 11.7 | 43 | 507 | 6,435 | 0.3 | 1 | |
| HybridSPAdes | 11.7 | 111 | 227 | 6,444 | 5 | 12 | |
| SMIS | 11.7 | 142 | 358 | 6,341 | 1 | 3 | |