| Literature DB >> 27573208 |
Chengxi Ye1,2, Christopher M Hill1, Shigang Wu3, Jue Ruan3, Zhanshan Sam Ma2.
Abstract
The highly anticipated transition from next generation sequencing (NGS) to third generation sequencing (3GS) has been difficult primarily due to high error rates and excessive sequencing cost. The high error rates make the assembly of long erroneous reads of large genomes challenging because existing software solutions are often overwhelmed by error correction tasks. Here we report a hybrid assembly approach that simultaneously utilizes NGS and 3GS data to address both issues. We gain advantages from three general and basic design principles: (i) Compact representation of the long reads leads to efficient alignments. (ii) Base-level errors can be skipped; structural errors need to be detected and corrected. (iii) Structurally correct 3GS reads are assembled and polished. In our implementation, preassembled NGS contigs are used to derive the compact representation of the long reads, motivating an algorithmic conversion from a de Bruijn graph to an overlap graph, the two major assembly paradigms. Moreover, since NGS and 3GS data can compensate for each other, our hybrid assembly approach reduces both of their sequencing requirements. Experiments show that our software is able to assemble mammalian-sized genomes orders of magnitude more quickly than existing methods without consuming a lot of memory, while saving about half of the sequencing cost.Entities:
Mesh:
Year: 2016 PMID: 27573208 PMCID: PMC5004134 DOI: 10.1038/srep31900
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1(A) Map de Bruijn graph contigs to the long reads. The long reads are in red, the de Bruijn graph contigs are in other colors. Each long read is converted into an ordered list of contigs, termed compressed reads. (B) Calculate overlaps between the compressed reads. The alignment is calculated using the anchors. Contained reads are removed and the reads are chained together in the best-overlap fashion. (C) Layout: construct the assembly backbone from the best overlaps. (D) Consensus: align all related reads to the backbone and calculate the most likely sequence as the consensus output.
Figure 2Reads correction by multiple sequence alignment.
The left portion shows removing a false positive anchoring contig (brown) that appears only once in the multiple alignment. The right portion shows detection of a chimeric read by aligning it to multiple reads. A breakpoint is detected as all the reads can be aligned with the left portion of the target read are not consistent with all the reads that can be aligned with the right portion of the target read.
The demonstration of the compression ratio on various datasets.
| Datasets | Sequencing Technology | Average Raw Read Length | NGS Contig N50 (DBG | Average Compressed Read Length | Compression Ratio |
|---|---|---|---|---|---|
| PacBio | 4,734 | 31,233 ( | 7 | 1:676 | |
| PacBio | 5,614 | 2,264 ( | 8 | 1:702 | |
| PacBio | 14,519 | 3,115 ( | 11 | 1:1320 | |
| Oxford Nanopore | 6,597 | 3,303 ( | 4 | 1:1649 | |
| Illumina Miseq | 150 | 3,303 ( | 2 | 1:75 |
Computation time of each procedure.
| Species | Long Read Source | Short Read Assembly (CPU hr) | Compression (CPU hr) | Graph Construction (CPU hr) | Consensus (CPU hr) |
|---|---|---|---|---|---|
| 20x PacBio | 0.1 | 0.03 | 0.005 | 2 | |
| 40x PacBio | 1 | 0.6 | 0.2 | 18 | |
| 30x PacBio | 25 | 37 | 3 | 1600 | |
| 30x Nanopore | 0.1 | 0.02 | 0.002 | 2 |
Assembly performance comparison on the S. cerevisiae genome (genome size: 12 M bp).
| Cov | Assembler | Time (h) | NG50 | Contigs | NGA50 (454) | Identity (454) | Misass-emblies (454) | NGA50 (PacBio) | Identity (PacBio) | Misass-emblies (PacBio) | Longest | Sum |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10x | MHAP* | — | — | — | — | — | — | — | — | — | — | — |
| HGAP* | 36.3 | — | 554 | — | 99.68% | 105 | — | 99.77% | 6 | 36,942 | 1,512,911 | |
| CA* | 15.1 | 85,728 | 289 | 68,030 | 97.49% | 134 | 81,451 | 97.46% | 13 | 448,177 | 12,285,888 | |
| PacBioToCA | 173.5 | 19,694 | 898 | 19,378 | 99.88% | 112 | 18,689 | 99.90% | 6 | 221,736 | 10,741,663 | |
| ECTools | 24.5 | 120,126 | 169 | 98,965 | 99.76% | 324 | 109,640 | 99.73% | 29 | 525,820 | 11,785,741 | |
| Falcon* | 1.3 | — | 675 | — | 99.23% | 116 | — | 99.28% | 4 | 36,616 | 4,137,485 | |
| DBG2OLC | 1.7 | 475,890 | 67 | 168,612 | 99.70% | 408 | 355,269 | 99.81% | 46 | 1,174,277 | 11,899,604 | |
| 20x | MHAP* | 17.1 | 241,394 | 87 | 155,221 | 99.70% | 508 | 241,260 | 99.75% | 22 | 490,764 | 12,123,145 |
| HGAP* | 31.1 | 8,578 | 1,210 | 6,908 | 99.85% | 307 | 7,619 | 99.90% | 20 | 86,998 | 8,624,090 | |
| CA* | 42.4 | 371,115 | 165 | 201,649 | 98.83% | 284 | 329,930 | 98.82% | 21 | 680,599 | 13,052,212 | |
| PacBioToCA | 400.9 | 66,974 | 395 | 65,171 | 99.87% | 157 | 65,171 | 99.91% | 7 | 628,280 | 11,487,222 | |
| ECTools | 34.2 | 176,663 | 172 | 109,931 | 99.77% | 565 | 150,351 | 99.74% | 46 | 624,112 | 12,887,799 | |
| Falcon* | 3.5 | 110,083 | 180 | 93,385 | 99.38% | 345 | 110,438 | 99.42% | 15 | 281,041 | 10,583,868 | |
| DBG2OLC | 2.6 | 597,541 | 47 | 172,455 | 99.71% | 440 | 576,287 | 99.88% | 37 | 1,085,773 | 12,476,994 | |
| 40x | MHAP* | 36.6 | 614,363 | 65 | 243,012 | 99.91% | 598 | 589,044 | 99.94% | 24 | 1,090,578 | 12,356,826 |
| HGAP* | 36.2 | 211,631 | 93 | 198,387 | 99.94% | 528 | 348,754 | 99.99% | 30 | 796,762 | 12,387,287 | |
| CA* | 115.2 | 365,912 | 114 | 160,867 | 99.66% | 358 | 377,360 | 99.60% | 11 | 769,189 | 15,171,228 | |
| PacBioToCA | 621.7 | 96,817 | 371 | 96,476 | 99.87% | 178 | 94,480 | 99.91% | 6 | 742,046 | 11,700,172 | |
| ECTools | 55.8 | 255,956 | 271 | 166,945 | 99.79% | 891 | 214,377 | 99.76% | 64 | 714,196 | 14,481,947 | |
| Falcon* | 11.2 | 614,509 | 58 | 247,745 | 99.72% | 336 | 555,886 | 99.74% | 10 | 1,069,920 | 12,116,235 | |
| DBG2OLC | 4.2 | 672,955 | 28 | 238,683 | 99.87% | 431 | 544,679 | 99.90% | 36 | 1,086,380 | 12,149,997 | |
| 80x | MHAP* | 13.5 | 751,122 | 43 | 248,079 | 99.91% | 526 | 745,563 | 99.95% | 10 | 1,537,433 | 12,350,704 |
| HGAP* | 46.5 | 818,775 | 33 | 248,655 | 99.95% | 534 | 678,552 | 99.99% | 23 | 1,545,906 | 12,621,393 | |
| CA* | 236.0 | 430,552 | 75 | 201,397 | 99.80% | 319 | 397,774 | 99.74% | 12 | 984,295 | 16,571,250 | |
| PacBioToCA | 274.3 | 64,967 | 364 | 63,651 | 99.88% | 45 | 62,268 | 99.91% | 10 | 233,799 | 11,651,218 | |
| ECTools | 100.9 | 247,871 | 382 | 154,348 | 99.79% | 1,470 | 164,839 | 99.76% | 101 | 881,635 | 15,925,328 | |
| Falcon* | 34.7 | 810,136 | 99 | 247,480 | 99.81% | 437 | 810,134 | 99.82% | 24 | 1,537,463 | 12,681,860 | |
| DBG2OLC | 8.1 | 678,365 | 29 | 204065 | 99.92% | 426 | 574,476 | 99.95% | 35 | 1,089,897 | 12,209,592 |
*Assemblers that use only 3GS data.
DBG2OLC assembly performance comparison on various genomes.
| Genome | Size | Coverage | NG50 | Contigs | NGA50 | Identity | Misassemblies | Longest | Sum |
|---|---|---|---|---|---|---|---|---|---|
| 120 Mbp | 10x PacBio | 405,464 | 881 | 258,924 | 99.77% | 704 | 1,549,329 | 119 Mb | |
| 20x PacBio | 2,431,755 | 306 | 926,138 | 99.90% | 117 | 6,015,430 | 120 Mb | ||
| 40x PacBio | 3,601,597 | 243 | 1,605,981 | 99.93% | 131 | 15,473,059 | 129 Mb | ||
| 3.0 Gbp | 10x PacBio | 432,739 | 16,689 | 347,104 | 99.56% | — | 3,507,306 | 2.97 G | |
| 20x PacBio | 1,886,756 | 9,757 | 1,416,766 | 99.82% | — | 14,597,500 | 3.13 Gb | ||
| 30x longest PacBio | 6,085,133 | 13,095 | 4,124,714 | 99.85% | — | 23,825,526 | 3.21 Gb | ||
| 4.6 Mbp | 30x Nanopore | 4,680,635 | 1 | 1,850,974 | 99.77% | 1 | 4,680,635 | 4.7 Mb |