| Literature DB >> 33845884 |
Abstract
High-quality chromosome-scale haplotype sequences of diploid genomes, polyploid genomes, and metagenomes provide important insights into genetic variation associated with disease and biodiversity. However, whole-genome short read sequencing does not yield haplotype information spanning whole chromosomes directly. Computational assembly of shorter haplotype fragments is required for haplotype reconstruction, which can be challenging owing to limited fragment lengths and high haplotype and repeat variability across genomes. Recent advancements in long-read and chromosome-scale sequencing technologies, alongside computational innovations, are improving the reconstruction of haplotypes at the level of whole chromosomes. Here, we review recent and discuss methodological progress and perspectives in these areas.Entities:
Mesh:
Year: 2021 PMID: 33845884 PMCID: PMC8040228 DOI: 10.1186/s13059-021-02328-9
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Third-generation sequencing initiatives and reference data sets
| Initiatives | # samples/#haplotypes | Technologies | Links |
|---|---|---|---|
| Genome in a Bottle [ | 2 trios and 1 sample, 6 haplotypes | PacBio, ONT, Illumina, BioNano, Strand-seq, 10xG | |
| Human Genome Structural Variation Consortium [ | > 3 trios, > 6 haplotypes | PacBio, Illumina, BioNano, Hi-C, Strand-seq, 10xG | |
| Vertebrate Genome Project (VGP; facilitated by Genome 10 K), Darwin Tree of Life Project | > 100, ongoing haplotyping efforts | 10xG, PacBio, Hi-C | |
| Human Pangenome Project | > 10, > 20 haplotypes | PacBio, ONT, Hi-C | |
| Earth Biogenome Project (facilitated by Genome 10 K) | > 10, ongoing haplotyping efforts | PacBio, Hi-C | |
| The DNA Zoo project | > 10, ongoing haplotyping efforts | Hi-C and WGS | |
| Japanese Reference Project [ | > 1, > 2 haplotypes | PacBio, Illumina | |
| CHM1, CHM13 [ | Individual samples, two haplotypes each (except CHM1 and CHM13) | PacBio, ONT, BioNano, Hi-C, Illumina | n/a |
Fig. 1Third-generation sequencing technologies and their characteristics (read length, error rate and scale of information). The read length and scale of information (local versus chromosome-scale) together determine the haplotype range that can be achieved; moving down the schematic this range increases (orange arrow). Sequencing costs per sample increase moving from short-read sequencing down to nanopore sequencing, and then decrease again for BioNano and Hi-C (yellow arrows). Similarly, read length and error rate first increase moving down to nanopore sequencing, and then decrease again for BioNano and Hi-C (green arrows)
Methods and computational tools for haplotype reconstruction
| Approach | Tools | Data | Advantages | Disadvantages |
|---|---|---|---|---|
| Molecular haplotyping | WhatsHap [ | Long reads such as PacBio, Hi-C of individual | Can phase de novo and rare variants | Limitations in complex regions such as centromeres, HLA, etc. |
| Single-cell phasing | CHISEL [ | Single-cell short-read | High precision at single-cell, detection of rare alleles | Engineering tricks required to scale to > million cells |
| Polyploid phasing | HapTree [ | Local phasing | Can phase de novo and rare variants | Limitations in repetitive regions and not optimized for ploidy > 5 |
| De novo | ||||
| Diploid assembly | Falcon Unzip [ | Long reads and Hi-C of individual | Local phased contigs | No chromosome-scale assembly and computationally expensive |
| DipAsm [ | Long reads and Hi-C of individual | Chromosome-scale diploid assembly | Collapsed assembly not suitable for repetitive regions | |
| Hifiasm, HiCanu [ | HiFi reads of individual | High consensus accuracy and continuity | No chromosome-scale assembly | |
| pstools | Hifi and Hi-C reads | High-quality chromosome-scale haplotype assembly | Only designed for haplotyping diploids | |
| TrioCanu [ | Long reads of trios | Local phased contigs | Require family information | |
| Polyploid assembly | SDA [ | Long reads of individual | Local phased contigs | Need to be optimized for whole genomes |
| POLYTE [ | Illumina short reads | Local phased contigs | Does not scale well to whole genomes | |
| De novo (re-) assembly | IDBA-UD [ | Metagenome short reads | No prior knowledge required | Low sensitivity: rare haplotypes can remain undetected |
| OPERA-MS [ | Metagenome using short and long reads | High continuity | Computationally expensive | |
| SNV-based assembly | ConStrains [ | Metagenome short reads | Computational efficiency | Assembly accuracy depends on variant calling |
| Read binning | MetaMaps [ | Metagenome long reads | Computational efficiency | Accuracy depends on database |
| Contig binning | ProxiMeta [ | Metagenome short reads and Hi-C | Reference-free, ability to link plasmids to host chromosome | Multiple technologies necessary (Hi-C + shotgun sequencing) |
Fig. 2Molecular haplotyping techniques in reference-based phasing. Individual haplotypes are derived directly from sequencing data of the target sample based on read alignments to the reference genome. Local and chromosome-scale haplotype phasing make use of short- and long-range sequencing data, respectively; hybrid haplotype phasing combines the two data types
Fig. 3Haplotype-aware de novo assembly. Collapsed assembly approaches identify sequence variants on a consensus assembly and subsequently phase these variants into haplotypes using chromosome-scale data (Hi-C). Semi-collapsed approaches follow a similar approach, but after phasing, variants in the initial assembly graph are updated and final contigs are produced based on this updated graph. Uncollapsed approaches directly determine haplotype-specific overlaps in local sequencing reads by retaining SNPs and repeat variation in all possible overlaps and construct haplotypes based on the selected overlaps
Fig. 4Strain-resolved metagenome assembly. a Given a pooled sequencing sample, the goal of strain-resolved metagenome assembly is to reconstruct all individual microbial strains. b A typical workflow consists of four steps: de novo assembly, contig binning, bin-wise re-assembly, and assembly curation. Each step can be performed at the species-level or at the strain-level, as illustrated in the left and middle column, respectively. Some workflows skip the initial de novo assembly step and perform strain-resolved binning directly on the sequencing reads, which can be reference-guided (right column)