| Literature DB >> 34706745 |
Xiao Luo1,2, Xiongbin Kang1,2, Alexander Schönhuth3,4.
Abstract
Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly. However, current long-read assemblers are either reference based, so introduce biases, or fail to capture the haplotype diversity of diploid genomes. We present phasebook, a de novo approach for reconstructing the haplotypes of diploid genomes from long reads. phasebook outperforms other approaches in terms of haplotype coverage by large margins, in addition to achieving competitive performance in terms of assembly errors and assembly contiguity.Entities:
Keywords: Diploid; Genome assembly; Haplotype; Long reads
Mesh:
Year: 2021 PMID: 34706745 PMCID: PMC8549298 DOI: 10.1186/s13059-021-02512-x
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1An overview of phasebook. The purple text represents that this step is optional. Correct errors from raw reads is recommended for long reads with high sequencing error rate such as PacBio CLR and ONT reads. Filter overlaps based on SNPs is recommended for small genomes or specific genomic regions such as MHCs
Fig. 2A schematic diagram for read phasing and super read generation in a read cluster. The blue reads belong to haplotype 1, and the orange reads belong to haplotype 2. The bold sky blue read means that this read is a seed read. We use WhatsHap to separate reads into two different groups based on SNPs involved in long reads. The dash line regions in super reads represent the bases with sequencing low coverage, which can be optionally trimmed to generate corrected super reads
Fig. 3A schematic diagram for super read overlap graph construction. The blue super reads belong to haplotype 1, and the orange super reads belong to haplotype 2. The solid arrow lines (blue or orange) represent non-transitive edges, and the dash arrow lines represent transitive edges. The solid arrow lines (reddish purple) represent spurious edges caused by incorrect super read overlaps from different haplotypes
Benchmarking results for PacBio HiFi data
| Dataset | Assembler | Size (Mb) | HC (%) | Continuity (bp) | QV | Switch error(%) | Dup (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All | Mat | Pat | NG50 | NGA50 | Phased N50 | ||||||||
| MHC (HiFi 15x) | phasebook-hi | 11.2 | 99.4 | 99.9 | 99.9 | 99.5 | 546,476 | 539,378 | 383,554 | 46.7 | 0.04 | 0.0 | 1.19 |
| HiCanu | 9.5 | 90.8 | 100.0 | 99.8 | 99.9 | 4,827,925 | 517,913 | 460,744 | 42.1 | 0.00 | 0.0 | 1.23 | |
| Flye | 5.1 | 56.0 | 96.0 | 77.4 | 78.1 | 138,470 | 10,861 | 58,919 | 44.0 | 2.79 | 0.0 | 0.92 | |
| Hifiasm | 9.7 | 97.3 | 100.0 | 99.9 | 100.0 | 4,878,334 | 2,205,550 | 620,366 | 59.8 | 0.01 | 0.0 | 1.21 | |
| IPA | 8.6 | 81.9 | 99.5 | 92.7 | 99.6 | 1,426,746 | 818,630 | 589,604 | 47.8 | 0.03 | 0.0 | 1.15 | |
| Wtdbg2 | 4.7 | 49.8 | 91.2 | 47.0 | 70.5 | – | – | 72,528 | 38.6 | 3.76 | 0.0 | 0.77 | |
| 9.0 | 56.5 | 84.7 | 54.7 | 54.0 | 257,599 | 155,079 | 167,061 | 31.4 | 5.90 | 9.0 | 1.50 | ||
| 9.0 | 59.4 | 84.7 | 55.2 | 53.7 | 257,599 | 155,079 | 176,555 | 31.4 | 6.26 | 9.0 | 1.43 | ||
| HG00733 (Chr6) (HiFi 18x) | phasebook-hi | 378.6 | 91.2 | 99.3 | 96.7 | 96.3 | 517,478 | 491,919 | 317,664 | 47.6 | 2.08 | 0.0 | 1.53 |
| HiCanu | 344.6 | 83.9 | 99.7 | 97.3 | 97.4 | 1,456,437 | 991,541 | 440,882 | 39.2 | 1.76 | 0.0 | 1.42 | |
| Flye | 169.0 | 55.1 | 97.9 | 53.3 | 49.3 | – | – | 70,128 | 45.0 | 11.40 | 0.0 | 0.97 | |
| Hifiasm | 341.5 | 93.5 | 99.7 | 97.1 | 96.5 | 28,008,203 | 10,513,146 | 673,184 | 46.4 | 1.81 | 0.0 | 1.12 | |
| IPA | 280.5 | 73.3 | 99.2 | 82.8 | 86.0 | 1,612,661 | 722,645 | 460,373 | 41.7 | 1.95 | 0.0 | 1.21 | |
| Wtdbg2 | 167.3 | 54.8 | 97.3 | 49.4 | 47.3 | – | – | 72,850 | 40.6 | 16.09 | 0.0 | 0.97 | |
| 340.8 | 76.5 | 99.3 | 91.7 | 91.1 | 379,321 | 354,305 | 330,339 | 40.8 | 6.48 | 0.4 | 1.3 | ||
| 340.9 | 76.5 | 99.3 | 91.7 | 91.1 | 381,196 | 359,841 | 327,329 | 40.9 | 6.52 | 0.4 | 1.3 | ||
| HG002(HiFi 14x) | phasebook-hi | 6709 | – | 97.5 | 80.0 | 85.0 | 136,140 | – | 111,668 | 50.5 | 0.33 | 0.0 | – |
| HiCanu | 2953 | – | 97.1 | 49.8 | 54.0 | – | – | 937,018 | 57.9 | 0.15 | 0.0 | – | |
| Falcon | 2955 | – | 97.3 | 49.5 | 62.2 | – | – | 501,274 | 49.1 | 0.40 | 0.0 | – | |
| Hifiasm | 3067 | – | 97.5 | 49.6 | 65.3 | – | – | 1,146,665 | 54.0 | 0.11 | 0.0 | – | |
| 6435 | – | 98.8 | 87.8 | 90.2 | 145,138,636 | – | 858,407 | 40.0 | 1.64 | 5.4 | – | ||
| 6435 | – | 98.8 | 87.8 | 90.2 | 145,138,636 | – | 858,156 | 40.0 | 1.64 | 5.4 | – | ||
The sequencing technology and the average sequencing coverage per haplotype are shown in the first column. The MHC dataset is simulated whereas the others are real. Size (Mb) represents the size of assemblies generated by each assembler. Due to lack of high-quality phased assemblies as the ground truth, haplotype coverage and NGA50 for HG002 are not provided. NGA50/NG50 calculation uses a diploid genome size (double haploid genome size). The haploid genome size of MHC, HG00733(Chr6), and HG002 is 4.7 Mb, 171 Mb, and 3.1 Gb, respectively. HC(%) is the haplotype coverage. In the k-mer recovery (%) multicolumn, all is the k-mer completeness for both haplotypes combined, mat is maternal hap-mer completeness and pat is paternal hap-mer completeness. N (%) is the ambiguous bases proportion. Dup (%) is the duplication ratio. The assemblers marked as italics (HapCut2 and WhatsHap) are reference-guided methods, whereas the others are de novo assembly methods. Note that we compared with IPA in MHC and HG00733(Chr6) datasets, the official PacBio assembler for HiFi reads instead of Falcon. The publicly released assemblies of HG002 (Canu, Falcon, Hifiasm) were directly used for comparison
Benchmarking results for PacBio CLR data
| Dataset | Assembler | Size (Mb) | HC (%) | Continuity (bp) | QV | Switch error(%) | Dup (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All | Mat | Pat | NG50 | NGA50 | Phased N50 | ||||||||
| MHC (CLR 25x) | phasebook | 10.8 | 95.2 | 96.7 | 88.0 | 76.3 | 172,577 | 172,577 | 133,141 | 37.7 | 0.66 | 0.0 | 1.36 |
| phasebook-hi | 14.3 | 98.0 | 97.5 | 89.4 | 88.8 | 361,721 | 354,437 | 122,594 | 40.6 | 6.26 | 0.0 | 1.88 | |
| Canu | 5.9 | 59.2 | 96.3 | 80.7 | 76.9 | 2,184,005 | 62,395 | 65,217 | 39.4 | 4.57 | 0.0 | 0.92 | |
| Falcon | 5.4 | 60.5 | 94.3 | 82.2 | 68.8 | 4,814,264 | 32,719 | 120,818 | 27.6 | 5.24 | 0.0 | 1.17 | |
| Flye | 5.0 | 74.1 | 94.2 | 64.2 | 70.5 | 548,628 | 66,242 | 74,992 | 37.1 | 5.34 | 0.0 | 1.01 | |
| Wtdbg2 | 4.7 | 58.9 | 90.5 | 46.9 | 63.2 | – | – | 102,431 | 33.0 | 5.49 | 0.0 | 0.93 | |
| 9.1 | 56.4 | 84.2 | 53.7 | 53.0 | 393,164 | 254,386 | 282,817 | 31.3 | 5.93 | 8.9 | 1.52 | ||
| 9.2 | 56.6 | 84.2 | 53.4 | 53.2 | 393,164 | 254,386 | 279,136 | 31.2 | 5.62 | 8.8 | 1.52 | ||
| HG00733 (Chr6) (CLR 44x) | phasebook | 453.2 | 92.9 | 98.7 | 89.7 | 90.6 | 256,934 | 253,785 | 164,373 | 32.6 | 5.50 | 0.0 | 1.92 |
| phasebook-hi | 291.0 | 81.0 | 97.9 | 68.2 | 65.3 | 587,151 | 552,300 | 201,382 | 33.9 | 14.41 | 0.0 | 1.33 | |
| Canu | 178.0 | 56.8 | 97.9 | 52.7 | 52.9 | 110,328 | – | 119,821 | 38.8 | 17.57 | 0.0 | 0.98 | |
| Falcon | 185.6 | 63.4 | 95.1 | 41.8 | 42.0 | 155,444 | 132,950 | 142,167 | 28.6 | 22.22 | 0.0 | 1.04 | |
| Flye | 168.1 | 51.9 | 97.6 | 25.4 | 75.8 | – | – | 2,094,032 | 42.9 | 3.12 | 0.0 | 0.99 | |
| Wtdbg2 | 165.2 | 64.6 | 88.9 | 35.3 | 35.3 | – | – | 142,300 | 24.2 | 20.07 | 0.0 | 1.0 | |
| 341.3 | 61.8 | 99.3 | 92.1 | 91.7 | 3,899,799 | 1,944,878 | 1,346,888 | 41.0 | 5.65 | 0.4 | 1.57 | ||
| 341.3 | 63.4 | 99.3 | 92.1 | 91.6 | 3,349,274 | 1,944,877 | 1,334,202 | 40.9 | 5.72 | 0.4 | 1.53 | ||
| HG002 (CLR 25x) | phasebook | 5829 | – | 92.3 | 60.1 | 59.2 | 96,740 | – | 70,473 | 31.9 | 1.61 | 0.0 | – |
| phasebook-hi | 6590 | – | 97.1 | 62.0 | 70.6 | 312,775 | – | 150,123 | 35.2 | 7.16 | 0.0 | – | |
| Canu | 3119 | – | 97.1 | 49.5 | 62.3 | 47,412 | – | 207,853 | 40.0 | 6.49 | 0.0 | – | |
| 6435 | – | 98.8 | 87.9 | 90.2 | 145,138,636 | – | 1,756,246 | 40.1 | 1.52 | 5.4 | – | ||
| 6435 | – | 98.8 | 87.9 | 90.2 | 145,138,636 | – | 1,729,580 | 40.1 | 1.52 | 5.4 | – | ||
| A. thaliana (CLR 75x) | phasebook | 301 | – | 89.7 | 76.6 | 76.0 | 66,120 | – | 39,513 | 27.0 | 2.78 | 0.0 | – |
| phasebook-hi | 296 | – | 94.9 | 89.5 | 88.8 | 301,078 | – | 149,964 | 33.4 | 4.25 | 0.0 | – | |
| Canu | 238 | – | 94.4 | 86.8 | 86.6 | 204,191 | – | 64,998 | 31.9 | 4.88 | 0.0 | – | |
| Flye | 142 | – | 87.1 | 62.8 | 62.5 | 35,796 | – | 30,209 | 30.0 | 7.54 | 0.0 | – | |
| Wtdbg2 | 125 | – | 35.1 | 26.4 | 26.3 | 11,374 | – | 35,733 | 14.4 | 15.17 | 0.0 | – | |
| 238 | – | 91.3 | 99.4 | 53.3 | 18,585,056 | – | 531,267 | 36.4 | 5.73 | 0.2 | – | ||
| 238 | – | 91.3 | 99.5 | 53.3 | 1,8585,056 | – | 1,637,965 | 36.3 | 5.72 | 0.2 | – | ||
Due to lack of high-quality phased assemblies as the ground truth, haplotype coverage and NGA50 for HG002 and A. thaliana are not provided. We failed to run Falcon, Flye, and Wtdbg2 for the HG002 (CLR) data on a computing machine (48 cores, 1 TB RAM) probably because of running out of memory. The method phasebook-hi represents a combination of performing Canu’s error correction and trim module on raw noisy long reads and then performing genome assembly for corrected reads with phasebook (HiFi mode)
Benchmarking results for ONT data
| Dataset | Assembler | Size (Mb) | HC (%) | Continuity (bp) | QV | Switch error(%) | Dup (%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| All | Mat | Pat | NG50 | NGA50 | Phased N50 | ||||||||
| MHC (ONT 25x) | phasebook | 14.5 | 97.0 | 98.4 | 94.0 | 87.7 | 152,599 | 152,599 | 102,954 | 36.9 | 0.78 | 0.0 | 1.82 |
| phasebook-hi | 14.8 | 91.6 | 97.5 | 89.4 | 88.8 | 796,423 | 746,361 | 176,895 | 40.6 | 6.99 | 0.0 | 1.96 | |
| Canu | 5.9 | 59.6 | 95.4 | 83.2 | 65.9 | 1,419,330 | 93,217 | 100,918 | 33.4 | 6.44 | 0.0 | 0.84 | |
| Falcon | 6.3 | 67.8 | 94.3 | 82.2 | 68.8 | 2,489,921 | 442,832 | 220,087 | 27.6 | 4.65 | 0.0 | 1.10 | |
| Flye | 5.0 | 54.5 | 95.0 | 81.4 | 62.6 | 792,993 | 86,893 | 106,195 | 39.7 | 6.22 | 0.0 | 0.82 | |
| Shasta | 5.0 | 63.0 | 86.8 | 62.5 | 48.5 | 957,250 | 136,119 | 147,280 | 23.7 | 10.71 | 0.0 | 0.89 | |
| Wtdbg2 | 5.0 | 76.5 | 89.1 | 67.0 | 47.4 | 4,821,843 | 208,838 | 136,335 | 24.3 | 4.33 | 0.0 | 0.99 | |
| 9.1 | 55.9 | 84.5 | 55.6 | 52.9 | 493,723 | 279,218 | 367,400 | 31.5 | 6.33 | 8.9 | 1.53 | ||
| 9.2 | 56.9 | 84.4 | 55.3 | 53.0 | 493,723 | 279,218 | 324,450 | 31.5 | 5.47 | 8.9 | 1.51 | ||
| NA19240 (Chr6) (ONT 26x) | phasebook | 380.8 | 84.4 | 87.5 | 53.9 | 55.1 | 90,219 | 85,595 | 51,291 | 20.6 | 36.96 | 0.0 | 1.87 |
| phasebook-hi | 335.5 | 79.4 | 83.7 | 46.3 | 45.9 | 127,968 | 122,097 | 63,798 | 22.3 | 40.12 | 0.0 | 1.76 | |
| Canu | 169.2 | 65.4 | 83.9 | 40.7 | 40.8 | – | – | 129,586 | 22.5 | 35.40 | 0.0 | 0.98 | |
| Falcon | 179.3 | 64.0 | 77.7 | 33.7 | 35.4 | 69,660 | 37,993 | 162,461 | 20.5 | 35.72 | 0.0 | 1.02 | |
| Flye | 167.4 | 57.3 | 89.4 | 40.1 | 48.3 | – | – | 115,043 | 25.1 | 31.85 | 0.0 | 0.99 | |
| Shasta | 164.9 | 62.7 | 76.8 | 33.2 | 34.2 | – | – | 190,758 | 20.2 | 39.98 | 0.0 | 0.99 | |
| Wtdbg2 | 166.7 | 60.6 | 78.3 | 33.2 | 33.9 | – | – | 160,729 | 20.4 | 35.13 | 0.0 | 0.98 | |
| 341.3 | 56.2 | 96.3 | 71.2 | 71.2 | 27,552,232 | 3,285,944 | 105,889 | 23.0 | 29.39 | 0.4 | 1.68 | ||
| 341.3 | 56.4 | 96.2 | 71.1 | 71.3 | 27,552,232 | 3,520,811 | 102,096 | 23.0 | 28.45 | 0.4 | 1.70 | ||
| HG002 (ONT 38x) | phasebook | 5691 | – | 92.0 | 62.2 | 63.2 | 390,659 | – | 202,723 | 26.6 | 2.28 | 0.0 | – |
| Canu | 2901 | – | 84.5 | 39.3 | 51.6 | – | – | 266,624 | 23.2 | 10.27 | 0.0 | – | |
| Flye | 2928 | – | 83.9 | 38.5 | 50.6 | – | – | 287,619 | 23.0 | 11.94 | 0.0 | – | |
| Shasta | 2805 | – | 88.8 | 40.9 | 53.2 | – | – | 276,802 | 25.0 | 12.59 | 0.0 | – | |
| Wtdbg2 | 2794 | – | 83.0 | 34.4 | 44.2 | – | – | 239,707 | 22.8 | 9.62 | 0.0 | – | |
The publicly released assemblies of HG002 (Canu, Flye, Shasta, Wtdbg2) were directly used for comparison. We failed to run HapCut2 and WhatsHap for HG002 data on a computing machine (48 cores, 1 TB RAM) probably due to running out of memory