| Literature DB >> 35819189 |
Haoling Xie1,2,3, Wen Li1,4,3, Yuqiong Hu1,3, Cheng Yang1,3, Jiansen Lu1,3, Yuqing Guo1,3, Lu Wen1,3, Fuchou Tang1,2,4,3.
Abstract
Genome assembly has been benefited from long-read sequencing technologies with higher accuracy and higher continuity. However, most human genome assembly require large amount of DNAs from homogeneous cell lines without keeping cell heterogeneities, since cell heterogeneity could profoundly affect haplotype assembly results. Herein, using single-cell genome long-read sequencing technology (SMOOTH-seq), we have sequenced K562 and HG002 cells on PacBio HiFi and Oxford Nanopore Technologies (ONT) platforms and conducted de novo genome assembly. For the first time, we have completed the human genome assembly with high continuity (with NG50 of ∼2 Mb using 95 individual K562 cells) at single-cell levels, and explored the impact of different assemblers and sequencing strategies on genome assembly. With sequencing data from 30 diploid individual HG002 cells of relatively high genome coverage (average coverage ∼41.7%) on ONT platform, the NG50 can reach over 1.3 Mb. Furthermore, with the assembled genome from K562 single-cell dataset, more complete and accurate set of insertion events and complex structural variations could be identified. This study opened a new chapter on the practice of single-cell genome de novo assembly.Entities:
Mesh:
Year: 2022 PMID: 35819189 PMCID: PMC9303314 DOI: 10.1093/nar/gkac586
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 19.160
Figure 1.The assembly workflow and K562 assembly metrics. (A) The workflow illustrates the samples, sequencing platforms, assemblers and evaluation indicators we used to demonstrate the feasibility of genome assembly based on scWGS dataset. (B) K562 assembly (95 cells) benchmarking results for Pacbio HiFi data (primary contigs). N50 is the sequence length of the shortest contig at half of the total assembly size; NG50 is the sequence length of the shortest contig at half of the reference genome size; NGA50 is the sequence length of the shortest aligned block at half of the reference genome size; BUSCO is a tool that assess the completeness of benchmarking universal single-copy orthologs present in an assembly; Per-base consensus quality values (QV) represents a log-scaled probability of errors for assembly, higher QVs indicate more accurate consensus. (C) K562 cells MHC assemblies were compared with the reference human genome (hg38). Only contigs longer than 500kb are displayed.
Figure 2.Single-cell haplotype assembly metrics of HG002 cells. (A) HG002 assembly (157 cells) benchmarking results for Pacbio HiFi mode (primary contigs). QUAST diffs reports the number of large structural discrepancies (>5 kb) observed between the assemblies and phased HG002 reference genome normalized by the assembly size (in Mb). The total base of Collapsed sequences and Expandable sequences report the amount of bp that are collapsed and potentially expandable in each assembly (smaller is better). (B) Trio Hicanu HG002 trio heterozygosity assembly statistics. Using two parental genome specific k-mers, trio HiCanu separated parental haplotype reads, and then we used wtdbg2 to assemble parental haplotypes. QUAST diffs reports the number of large structural discrepancies (>5 kb) observed between the HG003/HG004 haplotype assemblies and HG003/HG004 haplotype reference genome normalized by the assembly size (in Mb). (C) Trio Hicanu HG002 MHC haplotype assemblies were compared with the reference human genome (hg38), and only contigs longer than 80 kb are displayed. (D) Trio Hicanu HG002 primary contigs and associated haplotigs (from wtdbg2) spanning the MHC region were displayed and annotated along with various HLA gene.
Figure 3.Assembly metrics for HG002 192 cells at a lower sequencing depth per cell and 30 cells at a higher genome coverage. (A) HG002 assembly (192 cells) benchmarking results for ONT mode (primary contigs). Per-base consensus quality values (QV) was estimated by Merqury. QUAST diffs reports the number of large structural discrepancies (>5 kb) observed between the assemblies and phased HG002 reference genome normalized by the assembly size (in Mb). The total base of Collapsed sequences and Expandable sequences report the amount of bp that are collapsed and potentially expandable in each assembly (smaller is better). (B) Cumulative plot illustrates the growth rate of assemblies’ length. (C) NGx plot showing contig length distribution (NG50: contigs equal or larger than this represent 50% of the estimated genome size). Since the total length of contigs assembled from one cell did not exceed 1.5 Gb, NGx was not shown. (D) HG002 assembly (1, 10, 20, 30 cells) benchmarking results for ONT mode (primary contigs). Only contigs of length greater than 10 kb will be taken into accounting basic indicators. Per-base consensus quality values (QV) was estimated by Merqury. QUAST diffs reports the number of large structural discrepancies (>5 kb) observed between the assemblies and phased HG002 reference genome normalized by the assembly size (in Mb). The total base of Collapsed sequences and Expandable sequences report the amount of bp that are collapsed and potentially expandable in each assembly (smaller is better) (E) Visual representation of the most contigs from ONT assembly results with 192 cells at a lower sequencing depth per cell. Each gray and black block indicates a continuous contig alignment, which was calculated by QUAST, and only contigs labed ‘correct’ from QUAST will be displayed (the QUAST parameter of ‘lower threshold for the relocation’ was 10 kb). The red dots mark the gap-closed regions in the assembled genome. (F) Visual representation of the most contigs from ONT assembly results with 30 cells at a higher genome coverage. Each gray and black block indicates a continuous contig alignment, which was calculated from QUAST, and only contigs labeled as ‘correct’ from QUAST will be displayed (the QUAST parameter of ‘lower threshold for the relocation’ was 10 kb). The red dots mark the gap-closed regions in the assembled genome.
Figure 4.SVs discovery and distribution from K562 cells assembly. (A) Size distribution of SVs identified from K562 cells with hifiasm assembly, 300 bp peak for ALU and 6kb peak for LINE. (B) Length distribution of SVs identified from Hicanu assembly, hifiasm assembly, raw single cell HiFi reads and bulk CLR reads. (C) The precision, recall and F1-score of SVs identified from Hicanu assembly, hifiasm assembly, raw single cell HiFi reads, where bulk CLR SVs were treated as ground truth. (D) The percentage of true positive SVs identified from Hicanu assembly, hifiasm assembly, raw single cell HiFi reads, where bulk CLR SVs were treated as ground truth. (E) Ribbon (56) visualization the translocations in K562 cells. The diagram on the left indicates the detailed positions and directions of the translocation of BCR-ABL1 locus and NUP214-XKR3 locus. The right diagram indicates the translocation of chr3 with chr10 generates CDC25A-GRID1 fusion gene, which is not detected in single cell direct mapping result.