| Literature DB >> 32552806 |
Lyam Baudry1,2, Nadège Guiglielmoni1,3, Hervé Marie-Nelly1,2, Alexandre Cormier4, Martial Marbouty1, Komlan Avia4,5, Yann Loe Mie6, Olivier Godfroy4, Lieven Sterck7,8, J Mark Cock4, Christophe Zimmer9, Susana M Coelho10, Romain Koszul11.
Abstract
Hi-C exploits contact frequencies between pairs of loci to bridge and order contigs during genome assembly, resulting in chromosome-level assemblies. Because few robust programs are available for this type of data, we developed instaGRAAL, a complete overhaul of the GRAAL program, which has adapted the latter to allow efficient assembly of large genomes. instaGRAAL features a number of improvements over GRAAL, including a modular correction approach that optionally integrates independent data. We validate the program using data for two brown algae, and human, to generate near-complete assemblies with minimal human intervention.Entities:
Keywords: Desmarestia herbacea; Ectocarpus; GPU; Hi-C scaffolding; Hi-C, genome assembly; MCMC
Mesh:
Year: 2020 PMID: 32552806 PMCID: PMC7386250 DOI: 10.1186/s13059-020-02041-z
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Matrix generation and binning process. a From left to right: (i) the input data to be processed, and paired-end reads to be mapped onto the Ectocarpus. sp. reference v1 genome assembly; (ii) raw contact map before binning—each pixel is a contact count between two restriction fragments (RF); and (iii) raw contact map after binning—each pixel is a contact between a determined number of RFs (see b). b Schematic description of one iteration of the binning process over 10 restriction fragments (arrows). From left to right: (i) initial contact map, each pixel is a contact count between two RFs; (ii) filtering step—RFs either too short or presenting a read coverage below one standard deviation below the mean are discarded; (iii) binning step (1 bin = 3RFs)—adjacent RFs are pooled by three, with sum-pooling along all pixels in a 3 × 3 square; and (iv) binning step (1 bin = 9 RFs)—adjacent RFs are pooled by nine
Fig. 2Evolution of the Ectocarpus sp. contact map, the parameters of the polymer model, and the log-likelihood of the contact map. a The raw contact map before (upper part) and after (bottom part) scaffolding using instaGRAAL. Scaffolds are ordered by size. b Evolution of three parameters of the polymer model (exponent, pre-factor, mean trans-contacts) and the log-likelihood as a function of iterations
Fig. 3Size distribution (log scale) of the final Ectocarpus sp. scaffolds after 250 instaGRAAL iterations. After filtering, and prior to correction, 27 main scaffolds (red bars) or putative chromosomes were obtained. The dotted green horizontal line represents the proportion of the filtered genome that was not integrated into the main 27 scaffolds and represents less than 0.6% of the initial assembly. Each scaffold presents, after normalization, a high-quality Hi-C profile with features that are typical of eukaryotic genomes (Additional file 1 Fig. S1)
Fig. 4Step-by-step correction procedure. Correction procedure (top to bottom): (i) in silico restriction of the genome and binning, yielding a set of bins; (ii) reordering of all bins into scaffolds without taking into account their input contig of origin; typically, groups of bins from the same input contig naturally aggregate, but some bins get scattered to other scaffolds (e.g., bin 13, pink arrow), while others will be “flipped” with respect to the original assembly (e.g., bin 4, red arrows); (iii) reconstruction of the original input contigs by relocating scattered bins next to the biggest bin group; and (iv) bins in the original input contigs are oriented according to their original consensus orientation
Comparison of Nx, NGx (i.e., Nx with respect to the original reference v1 genome assembly; in bp), and BUSCO completeness for the different assemblies (linkage group v2, GRAAL v3, and corrected instaGRAAL v4) of the Ectocarpus sp. genome
| Reference v1 assembly | Linkage group v2 assembly | v3 GRAAL | v4 corrected instaGRAAL | |
|---|---|---|---|---|
| N50 | 497,380 | 6,528,661 | 6,867,074 | 6,813,345 |
| NG50 | 497,380 | 6,528,661 | 6,725,743 | 6,813,345 |
| N75 | 233,412 | 5,613,161 | 5,693,784 | 5,686,617 |
| NG75 | 233,412 | 5,613,161 | 5,672,622 | 5,686,617 |
| L50 | 118 | 12 | 11 | 11 |
| LG50 | 118 | 12 | 12 | 11 |
| L75 | 258 | 19 | 18 | 19 |
| LG75 | 258 | 19 | 19 | 19 |
| BUSCO completeness (%) | 75.9 | 76.9 | 76.24 | 77.56 |
Comparison of Nx, NGx (i.e., Nx with respect to the original human reference genome assembly; in bp), and other QUAST statistics for the different assemblies (artificial assembly, corrected instaGRAAL, and SALSA2) of the Homo sapiens genome
| Reference genome assembly | Artificial assembly | instaGRAAL | SALSA2 | |
|---|---|---|---|---|
| N50 | 145,138,636 | 300,000 | 143,373,745 | 152,389,473 |
| NG50 | 145,138,636 | 300,000 | 143,373,745 | 152,389,473 |
| N75 | 107,043,718 | 300,000 | 89,477,166 | 130,103,422 |
| NG75 | 107,043,718 | 300,000 | 82,128,910 | 103,672,000 |
| L50 | 9 | 5165 | 9 | 9 |
| LG50 | 9 | 5454 | 9 | 9 |
| L75 | 15 | 7747 | 15 | 15 |
| LG75 | 15 | 8181 | 17 | 17 |
| No. of genomic features | 3,625,295 + 305 part | 3,411,473 + 44,299 part | 3,456,227 + 3836 part | 3,415,115 + 44,127 part |
| Genome fraction (%) | 100.0 | 94.6 | 94.6 | 94.5 |
| No. of misassemblies | 9 | 0 | 776 | 438 |