| Literature DB >> 35164674 |
Min Xie1, Linfeng Yang2,1, Chenglin Jiang2, Shenshen Wu2, Cheng Luo2, Xin Yang1, Lijuan He1, Shixuan Chen1, Tianquan Deng1, Mingzhi Ye1, Jianbing Yan2,3, Ning Yang4,5.
Abstract
BACKGROUND: Generating chromosome-scale haplotype resolved assembly is important for functional studies. However, current de novo assemblers are either haploid assemblers that discard allelic information, or diploid assemblers that can only tackle genomes of low complexity.Entities:
Keywords: Diploid; Gamete cells; Haplotype-resolved de novo assembler; Highly heterozygous genomes
Mesh:
Year: 2022 PMID: 35164674 PMCID: PMC8842951 DOI: 10.1186/s12859-022-04591-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic of gcaPDA workflow. (a) Gamete cells (N = 40) are isolated from focal individual. (b) Whole genome shotgun sequencing is performed to generate gamete cell short reads. (c) HiFi reads and Hi-C reads are generated by sequencing somatic tissues. (d) An initial assembly is generated by assembling HiFi reads into contigs and scaffolding contigs into superscaffolds with Hi-C data. (e) Short reads of gamete cells are then mapped to the initial assembly and (f) SNPs were identified for each gamete cells. (g) Chromosomal-scale haplotypes of the sequenced individual are reconstructed based on gamete cell SNP arrays using major voting strategy [23], with number of gamete cells that supports adjacent SNP combination were shown on the left. By comparing SNPs of each gamete cell with reconstructed haplotypes, (h) crossovers and (i) haplotype blocks of gamete cells can be determined. Gamete cell reads are (j) partitioned based on haplotype blocks and (k) normalized by k-mer depth to mimic genome coverage distribution of regular parental WGS reads. At last, (l) HiFi reads and partitioned normalized gamete cell reads are then used to construct phased contigs using hifiasm [11] and scaffolded into phased pseudochromosomes with Hi-C data
Evaluation of Hifiasm assembly, Trio assembly, Hifiasm + Hi-C assembly and gcaPDA assembly
| Assembly | Contigs | k-mer completeness (%) | Gene completeness (%) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Total (Mb) | No | N50 (Mb) | All | Hap SK | Hap B73 | PPV* | Comp | Dup | Frag | Mis | |
| SK | 2154 | 671 | 223.2 | 76.15 | 100.00 | 0.00 | NA | 96.1 | 6.0 | 1.3 | 2.6 |
| B73 | 2104 | 267 | 220.8 | 75.97 | 0.00 | 100.00 | NA | 95.5 | 6.1 | 1.5 | 3.0 |
| B73 + SK | 4258 | 938 | 223.2 | 100.00 | 100.00 | 100.00 | NA | 96.9 | 94.8 | 0.9 | 2.2 |
| Hap1 | 2160 | 800 | 77.3 | 75.69 | 54.91 | 48.75 | 53.16 | 94.4 | 8.6 | 1.0 | 4.6 |
| Hap2 | 2160 | 673 | 72.4 | 75.89 | 48.47 | 55.35 | 53.12 | 95.2 | 8.3 | 0.8 | 4.0 |
| Hap1 + Hap2 | 4319 | 1473 | 75.5 | 99.86 | 99.75 | 99.67 | 53.14 | 97.6 | 94.2 | 0.2 | 2.2 |
| HapSK | 2134 | 671 | 96.9 | 75.98 | 97.53 | 2.49 | 97.53 | 96.4 | 5.3 | 0.9 | 2.7 |
| HapB73 | 2124 | 690 | 48.0 | 76.02 | 2.81 | 97.76 | 97.18 | 96.1 | 5.8 | 1.2 | 2.7 |
| HapSK + hapB73 | 4258 | 1361 | 69.6 | 99.96 | 99.89 | 99.93 | 97.36 | 97.7 | 94.9 | 0.2 | 2.1 |
| HapSK | 2160 | 1118 | 60.0 | 76.06 | 99.62 | 0.58 | 99.43 | 96.1 | 7.6 | 0.8 | 2.7 |
| HapB73 | 2118 | 811 | 35.6 | 75.98 | 0.49 | 99.85 | 99.51 | 96.5 | 7.2 | 1.2 | 2.9 |
| HapSK + hapB73 | 4278 | 1929 | 44.9 | 99.90 | 99.65 | 99.92 | 99.47 | 97.8 | 94.9 | 0.4 | 2.1 |
| HapSK | 2162 | 767 | 55.3 | 76.19 | 98.87 | 1.78 | 98.25 | 95.9 | 6.5 | 1.4 | 2.7 |
| HapB73 | 2159 | 699 | 57.0 | 76.41 | 2.25 | 99.46 | 97.77 | 95.4 | 6.2 | 1.6 | 3.0 |
| HapSK + hapB73 | 4320 | 1466 | 55.3 | 99.67 | 99.06 | 99.56 | 98.01 | 96.9 | 94.5 | 1.0 | 2.2 |
*PPV indicates positive predictive value
Fig. 2Comparison of phasing accuracy of contigs of different assemblies for maize. Each contig is represented by a circle, with circles of hap1 contigs filled red and hap2 contigs filled blue. The size of a circle is proportional to the total number of k-mer in the contig. The x and y axes refer to the number of SK hapmer and B73 hapmer identified in a contig, respectively. Panel (a) Hifiasm assembly, (b) Hifiasm + Hi-C assembly, (c) Trio assembly, (d) gcaPDA assembly
Fig. 3Haplotype blocks in the gcaPDA assembly. Each chromosome is represented by a rectangle, with width proportional to chromosome length. Haplotype blocks are defined by consecutive hapmer of the same haplotype
Fig. 4Sequence comparison of different assemblies. (a) Chromosome 4 of HapSK, hapB73 assembly (generated by gcaPDA) and reference B73 (B73) assembly were aligned to chromosome 4 of reference SK assembly (the alignments colored with green, blue and purple, respectively). Inset at upper left: a large inversion (c. 8 Mb) between B73 and SK. Inset at lower right: a large InDel (c. 3 Mb) between B73 and SK. (b) Sequence alignments of the ZmBAM1d locus, which is known as highly divergent between B73 and SK. Numbers with circles indicate large InDels between SK and B73. Coverage and identity of the alignments are shown on the left. ZmBAM1d genes is represented as a green brick