| Literature DB >> 31908732 |
Xingtan Zhang1,2, Ruoxi Wu1, Yibin Wang1, Jiaxin Yu1, Haibao Tang1.
Abstract
Diploid genomes consist of two homologous copies of chromosomes with one from each parent while polyploid genomes contain more than two homologous sets of chromosomes. Most of the reference genome assemblies collapsed haplotypes that represent 'mosaic' sequences, ignoring allelic variants that may be involved in important cellular and biological functions. Unzipping haplotypes into distinct sets of sequences has been a growing trend in recent genome studies, as it is an essential tool towards resolving important clinical and biological questions, such as compound heterozygotes, heterosis, and evolution. Herein, we review existing methods for alignment-based and assembly-based haplotype phasing for heterozygous diploid and polyploid genomes, as well as recent advances of experimental approaches for improved genome phasing. We anticipate that full haplotype phasing could become a routine procedure in genome studies in the near future.Entities:
Keywords: Genome assembly; Haplotype phasing; Heterozygosity; Ploidy; Reference genome
Year: 2019 PMID: 31908732 PMCID: PMC6938933 DOI: 10.1016/j.csbj.2019.11.011
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Overview of the two main classes of haplotype phasing strategies. The left panel (A) is alignment-based and the right one is assembly-based haplotype phasing workflow, respectively. In the alignment-based haplotype phasing, reads are sequenced with relatively low coverage (<30×) and are mapped to a reference genome for variant calling. Afterwards, linked variants are extended into phased blocks each containing a number of neighboring SNPs represented as 0(REF)/1(ALT). In the assembly-based haplotype phasing (B), much deeper sequencing is typically carried out using a variety of sequencing technologies. Allele-aware de novo assembly can be achieved using Falcon-unzip or Canu trio-binning methods. When working with multiple haplotypes, primary contigs can be selected as an arbitrary haplotype representation, e.g. using purge_haplotigs, for downstream analysis. Alternatively, full set of haplotypes can be resolved through Hi-C technology, e.g. using ALLHiC.
Overview of softwares listed in this study.
| Program | Algorithm | Input data | Highlights | Limitations | Citation |
|---|---|---|---|---|---|
| WhatsHap | Dynamic programing algorithm | VCF/BAM/reference genome | Good performance in completeness and accuracy; | Ignore structural variations; | |
| HapCUT2 | MAX-CUT-based heuristic algorithm | VCF/BAM/reference genome | Capable of handling a wide range of sequencing technologies, including Illumina short, PacBio long, 10 × Linked and Hi-C reads | Ignore structural variations; | |
| SHAPEIT3&4 | HMM/MCMC/PBWT | VCF/Genetic map | Excellent performance on accuracy and speed; | Ignore structural variations; | |
| HaploMerger1/2 | Whole genome comparison | Draft genome assembly | Suitable for diploid assemblies with high heterozygosity; | Not suitable for too fragmented scaffold (e.g. N50 < 100 kb) | |
| Redundans | Whole genome comparison | Draft genome assembly | Multiple functionalities, including removing heterozygous sequences, scaffolding, gap closure | May throw away some repetitive and paralogous contigs in reducing step | |
| Purge_haplotigs | Read depth | BAM/draft genome assembly | It is able to avoid part of repetitive and paralogous contigs to get over-purged; | Cannot resolve haplotype switching in draft genome due to arbitrary retention of contigs, 'pseudo-haploid’ doesn't represent true phasing in polymorphic regions | |
| FALCON&FALCON-Unzip | Heuristic algorithms to identify ‘bubble’ structure and greedy algorithm to assist constructing haplotype | PacBio raw reads | FALCON-Unzip is capable of assembling highly accurate, contiguous primary contigs and haplotigs that allows further downstream-analysis on haplotype level | Primary contigs contain haplotype switching error between adjacent phase blocks; | |
| CANU | Trio binning | Parental short reads/F1 long reads | Able to generate two sets of haploid genomes for each parent line; | Limited application for highly heterozygous genomes with no recorded pedigree information | |
| FALCON-Phase | A pipeline integrating PacBio reads and Hi-C data to reassign haplotypes for diploid genome | Output for FALCON-unzip assembly/Hi-C reads | Can benefit from allele aware contig assembly by FALCON-Unzip; | Only for diploid genome; | |
| ALLHiC | Prune/optimize/Genetic Algorithm | Draft contig assembly/Hi-C mapping BAM/Allele table | It is applicable to a wide range of genomes with different complexity, including simple diploid, heterozygous diploid, allo-polyploid genomes and auto-polyploid genomes; | Sensitive to the accuracy of the starting contig assembly; | |