| Literature DB >> 35686922 |
Andrea Minio1, Noé Cochetel1, Amanda M Vondras1, Mélanie Massonnet1, Dario Cantu1.
Abstract
De novo genome assembly is essential for genomic research. High-quality genomes assembled into phased pseudomolecules are challenging to produce and often contain assembly errors because of repeats, heterozygosity, or the chosen assembly strategy. Although algorithms that produce partially phased assemblies exist, haploid draft assemblies that may lack biological information remain favored because they are easier to generate and use. We developed HaploSync, a suite of tools that produces fully phased, chromosome-scale diploid genome assemblies, and performs extensive quality control to limit assembly artifacts. HaploSync scaffolds sequences from a draft diploid assembly into phased pseudomolecules guided by a genetic map and/or the genome of a closely related species. HaploSync generates a report that visualizes the relationships between current and legacy sequences, for both haplotypes, and displays their gene and marker content. This quality control helps the user identify misassemblies and guides Haplosync's correction of scaffolding errors. Finally, HaploSync fills assembly gaps with unplaced sequences and resolves collapsed homozygous regions. In a series of plant, fungal, and animal kingdom case studies, we demonstrate that HaploSync efficiently increases the assembly contiguity of phased chromosomes, improves completeness by filling gaps, corrects scaffolding, and correctly phases highly heterozygous, complex regions.Entities:
Keywords: assembly error correction; chromosome anchoring; diploid genomes; haplotype phasing; hybrid genome assembly
Mesh:
Year: 2022 PMID: 35686922 PMCID: PMC9339290 DOI: 10.1093/g3journal/jkac143
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.542
Fig. 1.The HaploSync pipeline builds and refines haploid and diploid genome assemblies. The diploid-aware pipeline can deliver fully phased diploid pseudomolecules using a draft diploid assembly or diploid pseudomolecules as input. If draft sequences are used, Haplosplit first separates the haplotypes into 2 pseudomolecule sets. Pseudomolecules provided by the user or reconstructed with HaploSplit, then undergo quality control with HaploDup. If errors are found, input sequences can be edited with HaploBreak prior to rebuilding the pseudomolecules with HaploSplit. If no errors are detected and there are unplaced sequences, the pseudomolecule undergoes gap-filling with HaploFill. After each filling iteration, quality control can be performed with HaploDup. Finally, HaploMap can be used to identify colinear regions between pseudomolecules.
Fig. 2.The HaploSplit procedure using genetic markers as input. a) The procedure identifies marker positions in the draft sequences. b) The longest sorted set of markers is identified for each draft sequence. c) Each sequence is assigned to a unique genomic region in the map (linkage group) and oriented. d) A directed adjacency network of nonoverlapping sequences is built for each linkage group connecting all sequences with no overlapping ranges of genetic markers. Sequences sharing markers are placed in separate network paths. e) The tiling path that maximizes the number of covered markers is selected for the first haplotype. f) Sequences belonging to the first haplotype are removed from the adjacency network and the second-best tiling path is used to scaffold the second haplotype.
Fig. 3.Example of HaploDup’s interactive reports. The figure reports 2 static screenshots exemplifying HaploDup interactive output. a) Assembly quality control of M. rotundifolia chromosome 12 Haplotype 1: whole-sequence alignment of both alternative haplotypes on Haplotype 1, legacy contig and hybrid scaffold composition of Haplotype 1, position of the genetic markers and the duplicated markers in Haplotype 1, number of significant alignment(s) per gene of Haplotype 1 in each alternative haplotype. In this example, the composition in legacy contigs and position of duplicated markers indicate that both alleles (primary contig and haplotig) and both marker copies were placed in a hybrid scaffold (overlayed box). b) Unplaced sequence quality control: Marker content is compared between pseudomolecules and unplaced sequences to evaluate conditions that prevent the inclusion of a specific unplaced sequence. Color-coding is used for better contextualization. Markers are color-coded based on their order in the map. The structure of pseudomolecules and unplaced sequences are represented with color-coded blocks. Blocks identify the composition in terms of draft assembly sequences, color coding is used to show the existing relationships between the composing sequences (e.g. primary to haplotig relationships). In this example, the presence of a marker (overlayed box, the dark marker on the right of the contig) in the unplaced sequences far from its expected position on the map extends the expected coverage of the map to the end of the linkage group and prevents placement in any haplotype scaffold.
Assembly statistics.
| Genotype | Kingdom | Haploid size | Chromosomes | Technology | Markers (per Mb) | Input sequences |
| |||
|---|---|---|---|---|---|---|---|---|---|---|
| HaploSplit | HaploFill | |||||||||
|
| Fungi | 14 Mb | 7 + R | PacBio | 116 (8.3) | Primary | 15.5 Mb | Hap 1 | 11.6 Mb | 12.9 Mb |
| Haplotigs | 13.8 Mb | Hap 2 | 12.4 Mb | 13.7 Mb | ||||||
| Total | 29.2 Mb | Unpl | 5.2 Mb | 2.7 Mb | ||||||
|
| Plantae | 119 Mb | 5 | PacBio | 676 (5.7) | Primary | 140.0 Mb | Hap 1 | 109.0 Mb | 114.7 Mb |
| Haplotigs | 104.9 Mb | Hap 2 | 106.6 Mb | 111.5 Mb | ||||||
| Total | 245.0 Mb | Unpl | 29.4 Mb | 19.0 Mb | ||||||
|
| Animalia | 2.6 Gb (29+X) 2.5 Gb (29 + Y) | 29 + XY | PacBio | 46,325 (17.6) | Primary | 2.7 Gb | Hap 1 | 2.6 Gb (29+X) | 2.6 Gb (29+X) |
| Haplotigs | 2.5 Gb | Hap 2 | 2.3 Gb (29+Y) | 2.5 Gb (29+Y) | ||||||
| Total | 5.2 Gb | Unpl | 0.3 Gb | 0.2 Gb | ||||||
|
| Plantae | 487–557 Mb | 19 | PacBio + Doveatil HiC | 1,661 (3.5) | Primary | 570.2 Mb | Hap 1 | 350.8 Mb | 455.6 Mb |
| Haplotigs | 284.7 Mb | Hap 2 | 263.4 Mb | 410.9 Mb | ||||||
| Total | 854.9 Mb | Unpl | 239.9 Mb | 47.1 Mb | ||||||
|
| Plantae | 483 Mb | 20 | PacBio + BioNano | 1,661 (3.5) | Primary | 459.5 Mb | Hap 1 | 374.3 Mb | 400.5 Mb |
| Haplotigs | 364.8 Mb | Hap 2 | 338.9 Mb | 370.0 Mb | ||||||
| Total | 896.0 Mb | Unpl | 165.5 Mb | 63.0 Mb | ||||||
Summary statistics for the testing dataset.
Where Hap 1: Haplotype 1; Hap 2: Haplotype 2; Unpl: Unplaced sequences.
FalconUnzip (Hamlin ).
Forche et al. (2004).
FalconUnzip (Chin ).
Singer .
FalconUnzip (Koren ).
Low et al. (2020) using the Integrated Bovine Map of sex chromosome (ver. Btau_4.0, https://www.hgsc.bcm.edu/other-mammals/bovine-genome-project).
Range of values as reported for PN40024 in Jaillon and Cabernet Sauvignon in Cochetel .
FalconUnzip + SSPACE + HiRise (Vondras ).
Zou .
FalconUnzip + Hybrid Scaffolder (Cochetel ).
Zou .
Reported for FalconUnzip assembly as haplotype separation is lost during Hybrid Scaffolding.
Reported for FalconUnzip assembly as haplotype separation is lost during Hybrid Scaffolding.
Fig. 4.HaploSplit performance. a) The results of using different sources of external information and HaploSplit protocols for V. vinifera cv. Cabernet Franc cl. 04 (Vondras ) assembly. Map-based assembly produces the largest first haplotype, but its overassembly occurs at the expense of the second haplotype’s completeness. A map-based approach is conservative and limited by the density of the markers. The hybrid approach recovers more sequences where the map is lacking information, without overassembling, and delivers a better reconstruction of both haplotypes. b) Effect of limited marker availability on overall assembly length tested on B. taurus Angus × Brahma (Koren ; Low ) by subsampling the genetic map. Longer sequences are more likely to contain a marker, making the first reconstructed haplotype most complete across all tests and with little variation in size. As the number of available markers increases and short sequences are included, the completeness of the second haplotype improves. c) Effect of limited marker availability on the number of placed sequences tested on B. taurus Angus × Brahma (Koren ; Low ) by subsampling the genetic map. Increasing the number of markers as fragmentation increases allows recruiting more sequences for scaffolding and improves completeness. Haplotype 1, with long sequences, shows little variation. In contrast, Haplotype 2 greatly benefits from increased marker density. The majority of sequences that remained unplaced are short and a small fraction of the genome’s length.