| Literature DB >> 34432923 |
Kazuaki Yamaguchi1, Mitsutaka Kadota1, Osamu Nishimura1, Yuta Ohishi1, Yuki Naito2, Shigehiro Kuraku1,3,4.
Abstract
The recent development of ecological studies has been fueled by the introduction of massive information based on chromosome-scale genome sequences, even for species for which genetic linkage is not accessible. This was enabled mainly by the application of Hi-C, a method for genome-wide chromosome conformation capture that was originally developed for investigating the long-range interaction of chromatins. Performing genomic scaffolding using Hi-C data is highly resource-demanding and employs elaborate laboratory steps for sample preparation. It starts with building a primary genome sequence assembly as an input, which is followed by computation for genome scaffolding using Hi-C data, requiring careful validation. This article presents technical considerations for obtaining optimal Hi-C scaffolding results and provides a test case of its application to a reptile species, the Madagascar ground gecko (Paroedura picta). Among the metrics that are frequently used for evaluating scaffolding results, we investigate the validity of the completeness assessment of chromosome-scale genome assemblies using single-copy reference orthologues.Entities:
Keywords: BUSCO; Hi-C scaffolding; chromosome-scale genome assembly; completeness assessment; gene space; iconHi-C
Mesh:
Substances:
Year: 2021 PMID: 34432923 PMCID: PMC9292758 DOI: 10.1111/mec.16146
Source DB: PubMed Journal: Mol Ecol ISSN: 0962-1083 Impact factor: 6.622
FIGURE 1Overview of the workflow used for Hi‐C library preparation. Digestion of chromatin DNA is performed with restriction enzymes or DNA nuclease. DNA ends are labelled by a biotinylated nucleotide (left) or a biotinylated bridge adapter (right). Ligation is performed in situ in the nucleus, and biotin‐containing DNA is captured and used for the generation of sequencing libraries
FIGURE 2Technical considerations in Hi‐C scaffolding. The major points regarding technical considerations (left) are shown as hands‐on steps. Individual rows show possible solutions (middle) and our demonstration using the Madagascar ground gecko (right). See Naumova et al. (2013) for the detail of the potential effect of cell cycle status to chromatin contacts. See Kadota et al. (2020) for the method to estimate the optimal number of PCR cycles for library amplification
Comparison of sample preparation for proximity‐based genome scaffolding
| Different specifications | In situ Hi‐C by Rao et al.a |
iconHi‐C (ver. 1.0)b |
Arima‐HiC Kit (Arima Genomics; ver. A160134 v01) |
Proximo Hi‐C (Animal) Prep Kit (Phase Genomics; ver. 4.0) |
Dovetail Hi‐C Kit (Dovetail Genomics; ver. 1.4) | Omni‐C Proximity Ligation Assay Kit (Dovetail Genomics; ver. 1.3) |
EpiTect Hi‐C Kit (Qiagen; ver. 04/2019) |
|---|---|---|---|---|---|---|---|
| Crosslinking agent | Formaldehyde (final 1%) | Formaldehyde (final 1%) | Formaldehyde (final 2%) |
Crosslinking solution (included in the kit) | Formaldehyde (final 1.5%) | DSG (final 30 mM)c and formaldehyde (final 1%) | Formaldehyde (final 1%) |
| Enzyme for chromatin DNA digestion | MboI (cuts at "GATC") | HindIII (cuts at "AAGCTT") or DpnII (cuts at "GATC") |
Cocktail of A1 and A2 enzymes (cut at "GATC" and "GANTC")c | Sau3AI (cuts at "GATC") | DpnII (cuts at "GATC") | Nuclease enzyme mixc | Hi‐C digestion enzyme (cuts at "GATC") |
| Duration of restriction enzyme digestion | 2 h to overnight at 37℃ | Overnight at 37℃ | 30–60 min at 37℃ | 1 h at 37℃ | 1 h at 37℃ | 30 min at 30℃ | 2 h at 37℃ |
| Biotin‐labeling method | Incorporation of biotinylated nucleotide | Incorporation of biotinylated nucleotide | Incorporation of biotinylated nucleotide | Incorporation of biotinylated nucleotide | Incorporation of biotinylated nucleotide | Ligation of biotin‐containing bridge adapterc |
Incorporation of biotinylated nucleotide |
| Chromatin capture | N/A | N/A | N/A | By Recovery Beads (included in the kit)c | By Chromatin Capture Beads (included in the kit)c |
By Chromatin Capture Beads (included in the kit)c | N/A |
| Ligation condition | 4 h at room temperature | 4–6 h at 16℃ | 15 min at room temperature | 4 h at 25℃ | 1–16 h at 16℃ |
30 min at 22℃ and 1 h at 22°Cc | 2 h at 16℃ |
| Reverse crosslinking |
Overnight or at least 1.5 h at 68℃ | Overnight at 65℃ | 1.5–16 h at 68℃ | 1–18 h at 65℃ | 45 min at 68℃ | 45 min at 68℃ | 90 min at 80°Cc |
| Quality control (QC) of ligated DNA | No | Yes (by size distribution analysis) | Yes (yield of biotin incorporated DNA) | Yes (yield of biotin incorporated DNA) | Yes (yield of ligated DNA) | Yes (yield of ligated DNA) | No |
| Fragmentation of the ligated DNA | Yes (by sonication) | Yes (by sonication) | Yes (by sonication) | Yes (enzymatic; included in the kit)c | Yes (by sonication) | Noc | Yes (by sonication) |
| Removal of biotin from unligated ends |
No | Yes | No | No | No | N/A | No |
| PCR cycles for sequencing library preparation | 4–12 cycles | Optimized for each libraryc |
Optimized for each libraryc | 12 cycles | 11 cycles | 12 cycles | 7 cycles |
| Library QC target | Not specified |
Yield and size distribution; digestion with NheI or ClaIc | Yield and size distribution | Yield and size distribution | Yield and size distribution | Yield and size distribution | Yield and size distribution |
aRao et al., 2014; bKadota et al., 2020; cSpecification applied to a subset of the kits/protocols.
Comparison of computational programs for proximity‐based genome scaffolding. The programs are sorted in the descending order of the number of citations in the literature introducing the individual programs, with the exception of the programs that are not openly maintained (LACHESIS and HiRise at the bottom)
| Program | Description | Input data requirement | Other information |
|---|---|---|---|
| 3d‐dnaa,b |
Misjoin correction algorithm is applied to detect errors in the input assembly; compatible with multiple enzymes |
Accepts only Juicer mapper format | The results can be reviewed and modified directly by JuiceBox |
| SALSA2c |
Uses the physical coverage of Hi‐C pairs to identify misassembled regions of the input assembly; compatible with multiple enzymes |
Generic bam (bed) file, assembly graph, unitig, 10x link files | The results can be visualized by JuiceBox via the included script |
| ALLHiCd |
Scaffolding and phasing of a polyploid genome | Hi‐C read pairs; (option) associated gene annotation or chromosome‐scale genome assembly for a closely related species | Generate the chromatin contact matrix to evaluate genome scaffolding |
| FALCON‐Phasee |
Scaffolding and phasing of a diploid genome | Hi‐C read pairs; FALCON‐Unzip assembly | Output two phased full‐length pseudo‐haplotypes |
| HiCAssemblerf |
Misassemblies are corrected by iterative joining of high‐confidence scaffold paths | Hi‐C matrix of h5 format created by HiCExplorer | Misassembled regions in the input assembly can be corrected by specifying the location in the program |
| instaGRAALg |
Overhauling the GRAAL program to allow efficient assembly of large genomes | Hi‐C matrix of instaGRAAL format created by hicstuff or HiC‐Box | Requires NVIDIA CUDA and can be executed in a limited environment |
| LACHESISh |
No function to correct scaffold misjoins | Generic bam format | Developer's support discontinued; intricate installation |
| HiRisei |
Employed in Dovetail Chicago/Hi‐C service | Generic bam format | Open‐source version at GitHub not updated since 2015 |
aDudchenko et al., 2017; bDurand et al., 2016; cGhurye et al., 2019; dZhang et al., 2019; eKronenberg et al., 2018; fRenschler et al., 2019; gBaudry et al., 2020; hBurton et al., 2013; iPutnam et al., 2016.
FIGURE 3Genome assembly of the Madagascar ground gecko. (a) Hi‐C contact map. The intensities of chromatin contacts quantified in Hi‐C data (red) are indicated in the matrix of different genomic regions. The blue frames indicate the putative chromosomal units. (b) An example of manual curation. The white frames indicate the scaffold units before Hi‐C scaffolding. In a part of the magnified view of the contact map shown in (a), the two input scaffolds indicated by the dashed lines on the left were judged to be derived from a single scaffold on the right. (c, d) Snail plots of the genome assembly before (c) and after (d) Hi‐C scaffolding. These plots were produced using BlobTools2 (Challis et al., 2020). The light‐grey spiral at the center shows the cumulative record count on a log scale, with the white lines indicating successive orders of digits. The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold of the assembly, and the ranges in orange and light orange indicate the N50 and N90 lengths, respectively. The blue area in the outer layer shows the distribution of GC, AT, and N percentages in the base composition of each scaffold unit
Improvement of the Madagascar ground gecko genome assembly. BUSCO's Tetrapoda gene set consisting of 5310 orthologues was used to assess gene space completeness with BUSCO v5
| Metric | Draft v1.0 (Hara et al., | Hi‐C scaffolds v2.0 (this study) |
|---|---|---|
| Total length (Mbp) | 1,694 | 1,562 |
| N50 scaffold length (Mbp) | 4.1 | 109.0 |
| Largest scaffold length (Mbp) | 33.7 | 184.3 |
| Number of scaffolds >1 Mbp | 297 | 18 |
| % of sum length of sequences >10 Mbp | 26.6 | 96.5 |
| % of sum length of sequences >1 Mbp | 73.3 | 96.5 |
| Number (%) of reference orthologues detected as “complete” | 4,575 (86.16) | 4,577 (86.20) |
| Number (%) of reference orthologues detected as ‘fragmented’ or “complete” | 4,960 (93.41) | 4,969 (93.58) |
| Number (%) of reference orthologues detected as “duplicated” | 45 (0.8%) | 38 (0.7%) |
| Number (%) of reference orthologues recognized as “missing” | 350 (6.59%) | 341 (6.42%) |