| Literature DB >> 30979905 |
Rei Kajitani1, Dai Yoshimura1, Miki Okuno1, Yohei Minakuchi2, Hiroshi Kagoshima3, Asao Fujiyama3, Kaoru Kubokawa4, Yuji Kohara3, Atsushi Toyoda2,3, Takehiko Itoh5.
Abstract
The ultimate goal for diploid genome determination is to completely decode homologous chromosomes independently, and several phasing programs from consensus sequences have been developed. These methods work well for lowly heterozygous genomes, but the manifold species have high heterozygosity. Additionally, there are highly divergent regions (HDRs), where the haplotype sequences differ considerably. Because HDRs are likely to direct various interesting biological phenomena, many genomic analysis targets fall within these regions. However, they cannot be accessed by existing phasing methods, and we have to adopt costly traditional methods. Here, we develop a de novo haplotype assembler, Platanus-allee ( http://platanus.bio.titech.ac.jp/platanus2 ), which initially constructs each haplotype sequence and then untangles the assembly graphs utilizing sequence links and synteny information. A comprehensive benchmark analysis reveals that Platanus-allee exhibits high recall and precision, particularly for HDRs. Using this approach, previously unknown HDRs are detected in the human genome, which may uncover novel aspects of genome variability.Entities:
Mesh:
Year: 2019 PMID: 30979905 PMCID: PMC6461651 DOI: 10.1038/s41467-019-09575-2
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Platanus-allee algorithm. (a) Schematic model of the concept of phasing and haplotype synteny-based assembly. (b) Workflow
Phased block statistics
| Species | Assembler | Input data | Total (Mbp) | Bubble total (Mbp) | Bubble-total / genome-size | Scaffold NG50 (kbp) | Scaffold LG50 (#) | Contig NG50 (kbp) | Contig LG50 (#) | % gaps | BUSCO duplicate complete (%) | % exact-match MP15k pairs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| Platanus-allee | PE + 4 MP | 442 | 393 | 0.818 | 404 | 344 | 102 | 1,282 | 2.80 | 79.03 |
|
| PE + 4 MP + PacBio(20 × ) | 473 | 456 | 0.950 |
|
| 161 | 868 | 2.79 |
| 33.82 | ||
| PE + 4 MP + 10X | 449 | 407 | 0.848 | 698 | 207 | 113 | 1,214 | 2.60 | 83.25 | 35.61 | ||
| PE + 4 MP + PacBio(20 × ) + 10X | 476 | 460 |
| 2,392 | 65 | 143 | 987 | 2.71 | 88.94 | 33.57 | ||
| FALCON-Unzip | PacBio(99 × ) | 481 | 404 | 0.843 | 413 | 352 | 413 |
|
| 70.68 | 31.10 | |
| FALCON-Unzip, Pilon, PH | PacBio(99×) + PE | 471 | 422 | 0.880 | 421 | 353 |
| 353 |
| 77.48 | 34.60 | |
| Supernova | 10X | 313 | 122 | 0.253 | 79 | 789 | 31 | 2,362 | 1.92 | 24.32 | 29.98 | |
|
| Platanus-allee | PE + 3 MP | 720 | 686 | 0.880 | 1,090 | 194 | 47 | 4,406 | 3.79 | 86.30 | |
| PE + 3 MP + PacBio(20 × ) | 739 | 715 | 0.916 | 1,514 | 142 | 48 | 4,300 | 4.40 |
| |||
| PE + 3 MP + 10X | 732 | 694 | 0.889 | 1,155 | 172 | 33 | 6,271 | 3.94 | 84.15 | |||
| PE + 3 MP + PacBio(20 × ) + 10X | 750 | 720 |
|
|
| 34 | 6,124 | 4.53 | 85.28 | |||
| FALCON-Unzip | PacBio(156×) | 918 | 378 | 0.484 | 172 | 1,179 | 172 | 1,179 |
| 74.34 | ||
| FALCON-Unzip, Pilon, PH | PE + PacBio(156 × ) | 978 | 852 | 1.092 | 1,075 | 162 |
|
|
| 80.98 | ||
| Supernova | 10X | 697 | 177 | 0.227 | 18 | 8,779 | 11 | 17,306 | 2.98 | 41.41 | ||
|
| Platanus-allee | PE + 3 MP | 195 | 179 | 0.897 | 470 | 112 | 43 | 1,242 | 4.89 | 77.09 | |
| PE + 3 MP + PacBio(20 × ) | 205 | 198 |
|
|
| 60 | 992 | 4.95 |
| |||
| FALCON-Unzip | PacBio(192 × ) | 243 | 224 | 1.121 | 511 | 105 | 511 |
|
| 82.28 | ||
| FALCON-Unzip, Pilon, PH | PacBio(192 × ) + PE | 232 | 222 | 1.109 | 512 | 105 |
|
|
| 82.49 | ||
|
| Platanus-allee | PE + 4 MP | 3,898 | 2,018 | 0.325 | 4 | 194,055 | 2 | 454,273 | 6.25 | 6.35 | |
| PE + 4 MP + PacBio(x20) | 5,684 | 5,406 | 0.872 | 306 | 5,294 | 19 | 78,213 | 6.72 | 59.71 | |||
| PE + 4 MP + 10X | 4,918 | 4,025 | 0.649 | 59 | 23,612 | 13 | 118,218 | 2.64 | 39.15 | |||
| PE + 4 MP + PacBio(20 × ) + 10X | 5,673 | 5,460 |
| 658 | 2,584 | 23 | 71,898 | 3.53 | 68.33 | |||
| FALCON-Unzip | PacBio(77 × ) | 4,721 | 3,691 | 0.595 | 109 | 13,508 | 109 | 13,508 |
| 30.36 | ||
| FALCON-Unzip, Pilon, PH | PacBio(77×) + PE | 4,851 | 3,783 | 0.610 | 124 | 11,817 |
|
|
| 30.98 | ||
| Supernova | 10X | 5,405 | 5,028 | 0.811 | 2,489 | 675 | 124 | 13,651 | 1.38 | 72.38 | ||
| Mostovoy et al. 2016 | PE + 1 MP + 10X + Bionano | 5,535 | 5,353 | 0.863 |
|
| 9 | 184,955 | 8.25 |
|
Statistics were calculated for phased blocks whose length ≥ 500 bp. A bold value indicates the best one for each species. Bubbles are phased heterozygous regions. Genome sizes and heterozygosities were estimated based on the k-mer frequency information of PEs and GenomeScope[26]. Bubble-total/genome-size, NG50s and LG50s were calculated based on the estimated diploid genome sizes (P. polytes, 480 Mbp; B. japonicum, 780 Mbp; C. elegans, 200 Mbp; H. sapiens 6.2 Gbp). Estimated heterozygosities are shown in Supplementary Table 5. BUSCO[27] (version 3.0.2) was used to estimate the rate of the phased single-copy genes for P. polytes, B. japonicum, C. elegans and H. sapiens with the endopterygota set (2442 orthologs), the metazoa set (978 orthologs), the nematoda set (982 orthologs) and the euarchotoglires set (6192 orthologs), respectively
Fig. 2Benchmarking for P. polytes and B. japonicum. a Alignment between the P. polytes bubbles mapped to the dsx-gene locus. Dot plot was obtained using Nucmer and a modified version of Mummerplot in Mummer package[57]. b Alignment between Platanus-allee bubble and the results obtained using other methods for the dsx-gene locus. c Precision-recall evaluation of B. japonicum based on Moleculo synthetic long reads. Underlined numbers indicate F-measures (harmonic mean of recall and precision). d Relation between phasing performances and heterozygosity of amphioxus. e Examples of highly divergent Platanus-allee bubbles obtained in the amphioxus analysis. Heterozygosities between bubble sequences were calculated based on the sequence difference (1 − (number of matches/alignment-length)) for the mapped 1k-mer pairs obtained from the Moleculo data using 100 kbp-windows. Alignments (“Arrangements of other results”) in b, e were performed using Minimap2[20] (see “Methods” section)
Consensus sequence statistics
| Species | Assembler | Input data | Total (Mbp) | Total / geome-size | Scaffold NG50 (kbp) | Scaffold LG50 (#) | Contig NG50 (kbp) | Contig LG50 (#) | % gaps | BUSCO single complete (%) |
|---|---|---|---|---|---|---|---|---|---|---|
|
| Platanus-allee | PE + 4 MP | 248 | 1.032 |
|
| 157 | 460 | 3.22 | 95.95 |
| PE + 4 MP + PacBio(20 × ) | 247 | 1.028 | 7,784 |
| 164 | 438 | 3.18 | 96.19 | ||
| PE + 4 MP + 10X | 248 | 1.031 | 6,621 | 13 | 138 | 516 | 3.08 | 96.40 | ||
| PE + 4 MP + PacBio(20 × ) + 10X | 247 | 1.029 | 7,845 | 11 | 145 | 482 | 3.02 | 96.44 | ||
| Platanus (v1.2.4) | PE + 4 MP | 232 | 0.968 | 5,995 | 15 | 159 | 438 | 1.37 |
| |
| FALCON-Unzip | PacBio(99 × ) | 261 | 1.088 | 5,196 | 18 | 5,196 |
|
| 89.35 | |
| FALCON-Unzip, Pilon, PH | PacBio(99 × ) + PE | 242 |
| 5,199 | 18 |
|
|
| 92.47 | |
| Supernova | 10X | 257 | 1.070 | 329 | 201 | 98 | 638 | 1.75 | 88.00 | |
|
| Platanus-allee | PE + 3 MP | 384 | 0.984 | 5,336 | 22 | 48 | 2,149 | 4.40 | 94.07 |
| PE + 3 MP + PacBio(20 × ) | 389 |
|
|
| 49 | 2,120 | 5.12 |
| ||
| PE + 3 MP + 10X | 392 | 1.004 | 4,914 | 21 | 34 | 3,104 | 4.52 | 93.87 | ||
| PE + 3 MP + PacBio(20 × ) + 10X | 397 | 1.017 | 5,413 | 17 | 35 | 3,030 | 5.15 | 94.07 | ||
| Platanus (v1.2.4) | PE + MP | 488 | 1.250 | 239 | 414 | 10 | 10,393 | 13.89 | 65.75 | |
| FALCON-Unzip | PacBio(156 × ) | 707 | 1.812 | 4,301 | 32 |
|
|
| 27.91 | |
| FALCON-Unzip, Pilon, PH | PE + PacBio(156 × ) | 406 | 1.042 | 3,259 | 39 | 3,259 | 39 |
| 84.97 | |
| Supernova | 10X | 662 | 1.697 | 45 | 2271 | 23 | 5,554 | 2.86 | 54.09 | |
|
| Platanus-allee | PE + 3 MP | 106 | 1.058 | 2,388 | 14 | 63 | 458 | 4.76 | 95.52 |
| PE + 3 MP + PacBio(20 × ) | 107 | 1.065 |
|
| 64 | 456 | 5.13 | 95.11 | ||
| Platanus (v1.2.4) | PE + 3 MP | 102 |
| 1,848 | 19 | 71 | 364 | 2.30 |
| |
| FALCON-Unzip | PacBio(192 × ) | 109 | 1.093 | 2,064 | 17 |
|
|
| 93.18 | |
| FALCON-Unzip, Pilon, PH | PacBio(192 × ) + PE | 103 | 1.029 | 2,064 | 17 | 2,064 |
|
| 94.60 | |
|
| Platanus-allee | PE + 4 MP | 2,894 | 0.934 | 3,917 | 230 | 23 | 34,650 | 4.08 | 88.82 |
| PE + 4 MP + PacBio(x20) | 2,995 | 0.966 | 3,717 | 250 | 22 | 36,365 | 6.76 | 86.45 | ||
| PE + 4 MP + 10X | 2,914 | 0.940 | 3,050 | 305 | 25 | 33,967 | 2.46 | 89.34 | ||
| PE + 4 MP + PacBio(20 × ) + 10X | 2,954 |
| 3,730 | 252 | 24 | 34,583 | 3.66 | 88.66 | ||
| Platanus (v1.2.4) | PE + 4 MP | 2,706 | 0.873 | 13,149 | 65 | 24 | 33,902 | 1.62 |
| |
| FALCON-Unzip | PacBio(77 × ) | 2,788 | 0.899 | 8,670 | 98 |
|
|
| 87.97 | |
| FALCON-Unzip, Pilon, PH | PacBio(77 × ) + PE | 2,791 | 0.900 | 8,669 | 98 |
|
|
| 88.97 | |
| Supernova | 10X | 2,942 | 0.949 | 30,823 |
| 147 | 5,954 | 1.42 | 90.33 | |
| Mostovoy et al. 2016 | PE + 1 MP + 10X + Bionano | 2,857 | 0.922 |
| 34 | 9 | 91,880 | 10.22 | 88.76 |
Statistics were calculated for consensus (pseudo-haploid) sequences whose length ≥ 500 bp. P. polytes, B. japonicum, C. elegans and H. sapiens correspond to a butterfly, am amphioxus, a worm and the human (NA12878), respectively. A bold value indicates the best one for each species. Genome sizes were estimated based on the k-mer frequency information of PEs and GenomeScope[26]. Total/genome-size, NG50s, and LG50s were calculated based on the estimated haploid genome sizes (P. polytes, 240 Mbp; B. japonicum, 390 Mbp; C. elegans, 100 Mbp; H. sapiens 3.1 Gbp). BUSCO was used to estimate the rate of the non-redundantly constructed single-copy genes in a similar manner for the phased blocks (Table 1). The formats of the results from FALCON-Unzip and Supernova were “primary-contigs” and “pseudohap,” respectively
Fig. 3Benchmarking for the synthetic diploid sample of C. elegans. a Precision-recall evaluation for the synthetic diploid C. elegans based on the reference genomes. Underlined numbers indicate F-measures. b Phasing performances and heterozygosity for the entire genome and a highly divergent region of the synthetic diploid C. elegans. Phased 1k-mer pair rate and heterozygosity in the entire genome (above two graphs) and the highly divergent chromosome II end (bottom two graphs) were calculated using 1 Mbp- and 100 kbp-windows, respectively. Alignments (“Arrangements of the other results”) were performed using Minimap2[20] (see “Methods” section)
Fig. 4Benchmarking for H. sapiens. a Precision-recall evaluation of the human sample based on the Moleculo synthetic long reads. Underlined numbers indicate F-measures. b Relation between phasing performances and heterozygosity using a human sample. c Phasing results for the human MHC class II region. Positions on chromosome 6 and gene annotation were based on human reference genome GRCh38.p10. Heterozygosity was calculated based on the sequence difference (1 − (number of matches/alignment-length)) for the mapped 1k-mer pairs from the Moleculo data using 1 Mbp-windows. d Dot plot between the bubble of Platanus-allee covering the MHC class II region. e Highly divergent and non-localized bubble obtained using human NA12878 data. Alignments (“Arrangements of the other results”) in c, e were performed using Minimap2[20] (see “Methods” section). Dot plots in d, e were depicted using Nucmer and modified version of Mummerplot in Mummer package[57]