Literature DB >> 34432923

Technical considerations in Hi-C scaffolding and evaluation of chromosome-scale genome assemblies.

Kazuaki Yamaguchi¹, Mitsutaka Kadota¹, Osamu Nishimura¹, Yuta Ohishi¹, Yuki Naito², Shigehiro Kuraku^1,3,4.

Abstract

The recent development of ecological studies has been fueled by the introduction of massive information based on chromosome-scale genome sequences, even for species for which genetic linkage is not accessible. This was enabled mainly by the application of Hi-C, a method for genome-wide chromosome conformation capture that was originally developed for investigating the long-range interaction of chromatins. Performing genomic scaffolding using Hi-C data is highly resource-demanding and employs elaborate laboratory steps for sample preparation. It starts with building a primary genome sequence assembly as an input, which is followed by computation for genome scaffolding using Hi-C data, requiring careful validation. This article presents technical considerations for obtaining optimal Hi-C scaffolding results and provides a test case of its application to a reptile species, the Madagascar ground gecko (Paroedura picta). Among the metrics that are frequently used for evaluating scaffolding results, we investigate the validity of the completeness assessment of chromosome-scale genome assemblies using single-copy reference orthologues.

Entities: Chemical

Keywords: BUSCO; Hi-C scaffolding; chromosome-scale genome assembly; completeness assessment; gene space; iconHi-C

Mesh：

Substances：
Chromatin

Year: 2021 PMID： 34432923 PMCID： PMC9292758 DOI： 10.1111/mec.16146

Source DB: PubMed Journal: Mol Ecol ISSN： 0962-1083 Impact factor: 6.622

INTRODUCTION

Molecular ecology research often targets intra‐ or interspecific variations of information in DNA sequences. In eukaryotes, DNA molecules are found in cell nuclei as part of “chromatin”, a complex of proteins that modulates the conformation of chromosomal DNAs in the nuclear environment. Hi‐C is a method for the genome‐wide capture of such chromosome conformations and was originally developed for detecting the long‐range interaction of chromatins (Lieberman‐Aiden et al., 2009) (Figure 1). This method has more recently been applied to the scaffolding of genome sequences from diverse species (Burton et al., 2013; Kaplan & Dekker, 2013; Marie‐Nelly et al., 2014). In general, the more closely two genomic regions are located on DNA sequences, the more frequently they contact in 3D genomes in chromatin. In genome scaffolding using Hi‐C data, fragmentary sequences of genomic DNA are grouped, ordered, and oriented on the basis of chromatin contact frequency between different genomic regions. Collectively, the genome scaffolding based on this type of chromatin contacts captured in situ in nuclei by digestion‐ligation (“proximity ligation”) is called proximity‐guided assembly (PGA).

FIGURE 1

Overview of the workflow used for Hi‐C library preparation. Digestion of chromatin DNA is performed with restriction enzymes or DNA nuclease. DNA ends are labelled by a biotinylated nucleotide (left) or a biotinylated bridge adapter (right). Ligation is performed in situ in the nucleus, and biotin‐containing DNA is captured and used for the generation of sequencing libraries Molecular ecology studies have been fueled by genome‐wide approaches for monitoring genetic diversity, which is most reliably achieved by the assembly of whole‐genome sequences using the output of modern DNA sequencers. Previously, sequences resulting from whole‐genome assembly were often flanked by long interspersed repeats and remained unassembled with any other sequence (Peona et al., 2018). Under this circumstance, chromosome‐scale sequences were obtained only through genetic linkage mapping, which requires a cross of identified mates and a sufficient number of offspring (Tang et al., 2015; Yoshitake et al., 2018), or optical mapping, which requires a large quantity of high‐molecular‐weight genomic DNA. After the introduction of PGA, Hi‐C scaffolding has become a major solution and has been adopted in mass genome sequencing projects to realize the reconstruction of chromosome‐scale sequences of genomic DNA (e.g., Rhie et al., 2021). The utility of Hi‐C scaffolding is characterized by its handiness (compared with the resource‐demanding alternatives mentioned above), requiring only chromatin preparation from a single individual and short‐read sequencing on an ordinary sequencing platform. Nonetheless, performing successful Hi‐C scaffolding is not trivial. Most frequently, researchers outsource the whole process to a commercial company or an experienced collaborator, which may not allow them to optimize parameters pertaining to sample preparation and computation with repeated attempts. Alternatively, especially when cost‐saving is desired, researchers may perform the whole preparation by themselves; however, different parts of the process (tissue sampling, library preparation, sequencing, scaffolding, and output validation) may be performed by different individuals, rarely resulting in a self‐contained experience. For these reasons, technical tips regarding the whole process are not explicitly written or shared with academic researcher communities, although they may accumulate at facilities that take on mass genome sequencing projects. It should also be noted that Hi‐C requires the chromatin contained in cell nuclei, rather than extracted genomic DNA. This is often misunderstood, even by those who have a long experience with DNA sequencing, resulting in the unfavourable sampling and storage of materials. In this review, we address the existing technical information about sample preparation protocols/kits and computational programs, and present technical factors for more successful Hi‐C scaffolding (Figure 2) based on our experience with diverse multicellular organisms (Kadota et al., 2020).

FIGURE 2

Technical considerations in Hi‐C scaffolding. The major points regarding technical considerations (left) are shown as hands‐on steps. Individual rows show possible solutions (middle) and our demonstration using the Madagascar ground gecko (right). See Naumova et al. (2013) for the detail of the potential effect of cell cycle status to chromatin contacts. See Kadota et al. (2020) for the method to estimate the optimal number of PCR cycles for library amplification

WHAT MAKES A DIFFERENCE IN CHROMOSOME‐SCALE GENOME SCAFFOLDING?

The analysis of chromatin dynamics, for which Hi‐C was originally developed, requires appropriate tissues/cells as materials for addressing specific biological questions; however, in Hi‐C scaffolding, the choice of materials is less important because it targets the reconstruction of the whole genome as the uniform goal, even when using different cell populations in an organism. One may expect that the use of numerous types of tissues will yield an optimal performance covering maximally diverse chromatin contacts. However, our previous attempt with this intention did not lead to improvement (Kadota et al., 2020). In general, the use of multiple tissues (in separate preparations) should increase the chance of obtaining a more successful library, and it is preferable to choose tissues with low endogenous nuclease activity (e.g., pancreas) (Takeshita et al., 2000) and those from which single cells can be prepared relatively easily for chromatin fixation (e.g., blood). For animals, tissues including muscle, blood, and liver are listed as promising choices in typical Hi‐C manuals and have been frequently used in published studies (e.g., Rhie et al., 2021). Basically, frozen tissues can serve as materials for Hi‐C library preparation, which should certainly lower the hurdle with species with low accessibility, a typical challenge in ecology. Table 1 summarizes the key laboratory steps in the preparation of chromatin, Hi‐C DNA, and libraries for sequencing, in that order. As a noncommercial choice, this table includes the traditional protocol by Rao et al. (2014), as well as a derivative of this protocol, iconHi‐C (Kadota et al., 2020), which resembles many others (e.g., Belaghzal et al., 2017). As of April 2021, four biochemical companies (Arima Genomics, Dovetail Genomics, Phase Genomics, and Qiagen) manufacture Hi‐C kits, which are formulated with different components and protocols. In general, conventional Hi‐C kits employ a restriction enzyme or a cocktail of multiple restriction enzymes, whereas Omni‐C employs a sequence‐independent endonuclease (Table 1). In Omni‐C, to capture more proximal contacts, disuccinimidyl glutarate (DSG) and formaldehyde are used for sample fixation (Nowak et al., 2005), which is now provided as a kit by Dovetail Genomics. Restriction enzyme digestion and ligation are performed in situ or on chromatin‐binding beads. Library preparation is performed by sonication followed by adapter ligation. The differences in specification between these kits/protocols include (1) choice of the DNA digestion method, (2) method of biotin incorporation, (3) adaptability of the sample quality control (QC) to the laboratory workflow, and (4) degree of amplification in library preparation (Table 1). Sufficient attention to these factors will issue an alert for unsuccessful sample preparation, such as insufficient chromatin fixation and insufficient DNA digestion, and will allow the retrieval of chromatin contacts with maximal diversity. Signs of unsuccessful samples will be alerted in QCs before sequencing (Kadota et al., 2020). When a species of interest has unusual biochemical properties in the selected tissues, genome size, and base composition, which affect the efficiency and uniformity of DNA fragmentation, the choice of the kit/protocol may be crucial (Figure 2). In sequencing Hi‐C libraries, one is usually recommended to obtain 100 million read pairs per Gb genome (Dudchenko et al., 2018; https://www.dnazoo.org/methods) as also suggested by typical Hi‐C kit manuals. Ultimately, the diversity of library molecules, which can be inferred with preliminary small‐scale sequencing (Kadota et al., 2020), determines the ideal number of read pairs to obtain.

TABLE 1

Comparison of sample preparation for proximity‐based genome scaffolding

Different specifications	In situ Hi‐C by Rao et al.^a	iconHi‐C (ver. 1.0)^b	Arima‐HiC Kit (Arima Genomics; ver. A160134 v01)	Proximo Hi‐C (Animal) Prep Kit (Phase Genomics; ver. 4.0)	Dovetail Hi‐C Kit (Dovetail Genomics; ver. 1.4)	Omni‐C Proximity Ligation Assay Kit (Dovetail Genomics; ver. 1.3)	EpiTect Hi‐C Kit (Qiagen; ver. 04/2019)
Crosslinking agent	Formaldehyde (final 1%)	Formaldehyde (final 1%)	Formaldehyde (final 2%)	Crosslinking solution (included in the kit)	Formaldehyde (final 1.5%)	DSG (final 30 mM)^c and formaldehyde (final 1%)	Formaldehyde (final 1%)
Enzyme for chromatin DNA digestion	MboI (cuts at "GATC")	HindIII (cuts at "AAGCTT") or DpnII (cuts at "GATC")	Cocktail of A1 and A2 enzymes (cut at "GATC" and "GANTC")^c	Sau3AI (cuts at "GATC")	DpnII (cuts at "GATC")	Nuclease enzyme mix^c	Hi‐C digestion enzyme (cuts at "GATC")
Duration of restriction enzyme digestion	2 h to overnight at 37℃	Overnight at 37℃	30–60 min at 37℃	1 h at 37℃	1 h at 37℃	30 min at 30℃	2 h at 37℃
Biotin‐labeling method	Incorporation of biotinylated nucleotide	Incorporation of biotinylated nucleotide	Incorporation of biotinylated nucleotide	Incorporation of biotinylated nucleotide	Incorporation of biotinylated nucleotide	Ligation of biotin‐containing bridge adapter^c	Incorporation of biotinylated nucleotide
Chromatin capture	N/A	N/A	N/A	By Recovery Beads (included in the kit)^c	By Chromatin Capture Beads (included in the kit)^c	By Chromatin Capture Beads (included in the kit)^c	N/A
Ligation condition	4 h at room temperature	4–6 h at 16℃	15 min at room temperature	4 h at 25℃	1–16 h at 16℃	30 min at 22℃ and 1 h at 22°C^c	2 h at 16℃
Reverse crosslinking	Overnight or at least 1.5 h at 68℃	Overnight at 65℃	1.5–16 h at 68℃	1–18 h at 65℃	45 min at 68℃	45 min at 68℃	90 min at 80°C^c
Quality control (QC) of ligated DNA	No	Yes (by size distribution analysis)	Yes (yield of biotin incorporated DNA)	Yes (yield of biotin incorporated DNA)	Yes (yield of ligated DNA)	Yes (yield of ligated DNA)	No
Fragmentation of the ligated DNA	Yes (by sonication)	Yes (by sonication)	Yes (by sonication)	Yes (enzymatic; included in the kit)^c	Yes (by sonication)	No^c	Yes (by sonication)
Removal of biotin from unligated ends	No	Yes	No	No	No	N/A	No
PCR cycles for sequencing library preparation	4–12 cycles	Optimized for each library^c	Optimized for each library^c	12 cycles	11 cycles	12 cycles	7 cycles
Library QC target	Not specified	Yield and size distribution; digestion with NheI or ClaI^c	Yield and size distribution	Yield and size distribution	Yield and size distribution	Yield and size distribution	Yield and size distribution

aRao et al., 2014; bKadota et al., 2020; cSpecification applied to a subset of the kits/protocols.

Comparison of sample preparation for proximity‐based genome scaffolding iconHi‐C (ver. 1.0)b Arima‐HiC Kit (Arima Genomics; ver. A160134 v01) Proximo Hi‐C (Animal) Prep Kit (Phase Genomics; ver. 4.0) Dovetail Hi‐C Kit (Dovetail Genomics; ver. 1.4) EpiTect Hi‐C Kit (Qiagen; ver. 04/2019) Crosslinking solution (included in the kit) Cocktail of A1 and A2 enzymes (cut at "GATC" and "GANTC")c Incorporation of biotinylated nucleotide By Chromatin Capture Beads (included in the kit)c 30 min at 22℃ and 1 h at 22°Cc Overnight or at least 1.5 h at 68℃ No Optimized for each libraryc Yield and size distribution; digestion with NheI or ClaIc aRao et al., 2014; bKadota et al., 2020; cSpecification applied to a subset of the kits/protocols. Table 2 summarizes the specification of the existing computational programs for Hi‐C scaffolding. Most of these were developed and maintained by academic parties, with the exception of HiRise, which is used exclusively in paid services by Dovetail Genomics (Putnam et al., 2016), and LACHESIS, which is no longer maintained (Burton et al., 2013). These programs implement different algorithms for using Hi‐C read alignment in scaffolding sequences (Ghurye et al., 2019). Apart from those core algorithmic differences, more superficial parameters with default settings that vary among programs can also largely affect the output, which includes a minimum input sequence length (see Kadota et al., 2020 for an example of a remarkable improvement using an altered length parameter setting) and the number of iterative cycles for misjoin correction (Figure 2). Some of the programs listed in Table 2 are used with certain specifications. FALCON‐Phase (Kronenberg et al., 2018) requires the output of the long read‐based assembly by FALCON‐Unzip (Chin et al., 2016), whereas ALLHiC, which was developed to overcome the difficulty in resolving polyploidy, requires a chromosome‐scale genome assembly or an associated gene annotation of a closely related species for phasing and scaffolding polyploid genomes (Zhang et al., 2019). More crucial key factors that are independent of program choice include the quality and continuity of the input genome assembly (reviewed in Whibley et al., 2021) and the amount of Hi‐C reads obtained after excluding improper fragments resulting from unintended ligation products (self‐ligation, religation, and unligation (“dangling end”); see the details in Kadota et al. (2020).

TABLE 2

Program	Description	Input data requirement	Other information
3d‐dna^a,b	Misjoin correction algorithm is applied to detect errors in the input assembly; compatible with multiple enzymes	Accepts only Juicer mapper format	The results can be reviewed and modified directly by JuiceBox
SALSA2^c	Uses the physical coverage of Hi‐C pairs to identify misassembled regions of the input assembly; compatible with multiple enzymes	Generic bam (bed) file, assembly graph, unitig, 10x link files	The results can be visualized by JuiceBox via the included script
ALLHiC^d	Scaffolding and phasing of a polyploid genome	Hi‐C read pairs; (option) associated gene annotation or chromosome‐scale genome assembly for a closely related species	Generate the chromatin contact matrix to evaluate genome scaffolding
FALCON‐Phase^e	Scaffolding and phasing of a diploid genome	Hi‐C read pairs; FALCON‐Unzip assembly	Output two phased full‐length pseudo‐haplotypes
HiCAssembler^f	Misassemblies are corrected by iterative joining of high‐confidence scaffold paths	Hi‐C matrix of h5 format created by HiCExplorer	Misassembled regions in the input assembly can be corrected by specifying the location in the program
instaGRAAL^g	Overhauling the GRAAL program to allow efficient assembly of large genomes	Hi‐C matrix of instaGRAAL format created by hicstuff or HiC‐Box	Requires NVIDIA CUDA and can be executed in a limited environment
LACHESIS^h	No function to correct scaffold misjoins	Generic bam format	Developer's support discontinued; intricate installation
HiRiseⁱ	Employed in Dovetail Chicago/Hi‐C service	Generic bam format	Open‐source version at GitHub not updated since 2015

aDudchenko et al., 2017; bDurand et al., 2016; cGhurye et al., 2019; dZhang et al., 2019; eKronenberg et al., 2018; fRenschler et al., 2019; gBaudry et al., 2020; hBurton et al., 2013; iPutnam et al., 2016.

Comparison of computational programs for proximity‐based genome scaffolding. The programs are sorted in the descending order of the number of citations in the literature introducing the individual programs, with the exception of the programs that are not openly maintained (LACHESIS and HiRise at the bottom) Misjoin correction algorithm is applied to detect errors in the input assembly; compatible with multiple enzymes Accepts only Juicer mapper format Uses the physical coverage of Hi‐C pairs to identify misassembled regions of the input assembly; compatible with multiple enzymes Generic bam (bed) file, assembly graph, unitig, 10x link files Scaffolding and phasing of a polyploid genome Scaffolding and phasing of a diploid genome Misassemblies are corrected by iterative joining of high‐confidence scaffold paths Overhauling the GRAAL program to allow efficient assembly of large genomes No function to correct scaffold misjoins Employed in Dovetail Chicago/Hi‐C service aDudchenko et al., 2017; bDurand et al., 2016; cGhurye et al., 2019; dZhang et al., 2019; eKronenberg et al., 2018; fRenschler et al., 2019; gBaudry et al., 2020; hBurton et al., 2013; iPutnam et al., 2016. Overall, there is no single gold‐standard method for library preparation and post‐sequencing scaffolding. When a need for troubleshooting is encountered, one can consider the technical points included in Figure 2, which may provide alternatives for possible improvement.

VALIDATION OF CHROMOSOME‐SCALE SCAFFOLDING OUTPUT

The goal of chromosome‐scale genome assembly is the reconstruction of actual nucleotide base lineups in DNA sequences. Assembly products can be rigidly evaluated by referring to any independent information on genome size, chromosomal organization, and location of individual genes, if available. It may not be widely known that a Hi‐C scaffolding output needs to be carefully evaluated and can often be manually modified by referring to the matrix of chromatin contact frequencies (Howe et al., 2020; also see below for an example of a reptile species), that is, the process called “review” in the manual of the program 3d‐dna (https://www.dnazoo.org/methods). In Hi‐C scaffolding, inversions and misjoins occur more frequently than in other scaffolding methods (Dudchenko et al., 2018; Ghurye et al., 2019). This is mainly because Hi‐C reads in pair do not instruct regarding the original fragment orientation in the genome, and the orientation of the sequences that are to be joined is reliably determined only when they are sufficiently long to harbour sufficient data points for chromatin contacts among them and other sequences. Therefore, it is also important to choose a scaffolding program that assumes and facilitates “review” in a dedicated editor, such as JuiceBox (Dudchenko et al., 2018). The visualized chromatin contact map indicates the parts to modify with outstanding signals distant from the diagonal line that do not fit in the intensified signals (intrachromosomal contacts) demarcated in squares (Figure 3a). Such outstanding signals caused by sequence misjoins or disjoins can be resolved by relocating the relevant scaffolds in the contact map (e.g., Figure 3a,b). After the “review”, the program HiC‐Hiker can reduce the error rate further by considering not only the junctions between two adjacent contigs, but also multiple neighbouring contigs (Nakabayashi & Morishita, 2020).

FIGURE 3

Genome assembly of the Madagascar ground gecko. (a) Hi‐C contact map. The intensities of chromatin contacts quantified in Hi‐C data (red) are indicated in the matrix of different genomic regions. The blue frames indicate the putative chromosomal units. (b) An example of manual curation. The white frames indicate the scaffold units before Hi‐C scaffolding. In a part of the magnified view of the contact map shown in (a), the two input scaffolds indicated by the dashed lines on the left were judged to be derived from a single scaffold on the right. (c, d) Snail plots of the genome assembly before (c) and after (d) Hi‐C scaffolding. These plots were produced using BlobTools2 (Challis et al., 2020). The light‐grey spiral at the center shows the cumulative record count on a log scale, with the white lines indicating successive orders of digits. The distribution of scaffold lengths is shown in dark grey with the plot radius scaled to the longest scaffold of the assembly, and the ranges in orange and light orange indicate the N50 and N90 lengths, respectively. The blue area in the outer layer shows the distribution of GC, AT, and N percentages in the base composition of each scaffold unit In reality, no comprehensive answer is available for checking the output of de novo genome sequencing. However, karyotypes, namely the number and size of chromosomes prepared from single cells, serve as valuable references for these aspects, and should ideally be made available prior to the assessment of Hi‐C scaffolding results (see Uno et al., 2020 for an example of this sort for sharks with scarce karyotyping reports). If chromosomal gene mapping records or optical mapping results also exist, they can be used as a reference for validation—namely, validating the order of the chromosome segments, for example, using protein‐coding genes with such prior records as markers (see Kadota et al., 2020 for an example). Several early studies employed an existing genome assembly of a closely related species for validation (Dong et al., 2013; Worley et al., 2014); however, this incurred uncontrollable risks because one cannot discern the artifacts to be corrected from natural cross‐species differences. It should be noted that sex chromosome pairs (X/Y or Z/W) may not be assembled with high precision, especially when they have regions that are similar to each other, which are known as pseudoautosomal regions (PAR) (Liu et al., 2019). Sex chromosomes or their segments can be identified by an outstanding ratio of read coverage between a male and a female, if additional whole genome sequencing reads covering both sexes are available (Palmer et al., 2019). Another typical concern is allelic redundancy. Unless one aims to separate different alleles (“haplotype phasing”), it is advisable to discard highly similar sequences with allelic differences (“haplotigs”) before performing Hi‐C, because they can confuse Hi‐C read mapping and result in insufficient scaffolding in those regions. Methods for evaluating large genome assemblies have been long debated, and no single metric allows an overall assessment (Bradnam et al., 2013; Rhie et al., 2021; Thrash et al.,2020; Veeckman et al., 2016). Scaffolding programs insert tracts of undetermined bases (“N”) between the sequences joined by Hi‐C data, and it should be noted that “N” is implicitly set to a uniform length throughout a genome by individual programs (for example, inserting 500 Ns is the default setting in 3d‐dna and SALSA2). In the evaluation of the output of de novo genome assembly, the metrics N50 length and NG50 length are frequently used (Bradnam et al., 2013). These metrics apply to scaffold sequences and contig sequences, with the latter indicating sequences without any intervening ambiguous bases (“N”). The N50 and NG50 length denotes the length of the shortest sequence at 50% of the total sequence length in the genome assembly and the genome size, respectively. Basically, a larger N50 or NG50 length entails a more continuous genome assembly. However, the optimal N50 or NG50 length is inherently defined by the karyotype of the species of interest. For the human genome, the N50 of the optimal genome assembly is approximately 154 Mbp, while it is limited to approximately 15 Mbp for the sea lamprey, with more than 100 small, dot‐like chromosomes (2n = 168; Potter & Rothwell, 1970). For this unique karyotype, N50 length cannot be substantially larger than 15 Mbp. Even larger N50 lengths for this species or its close relatives would indicate over‐assembly, which can be the result of the limited number of in silico chromosome fusions. Very importantly, the overall sequence length statistics, such as N50 and NG50, do not reflect the sequence content and its precision. To fulfill this task, one of the metrics proposed most recently was the quantification of reconstructed long terminal repeat (LTR) retrotransposons (LTR Assembly Index, LAI) (Ou et al., 2018). This metric, however, is accessible only when prior information of LTR motif sequences is available, and can be useful if the genome of question inherently harbours the type of LTRs assumed by this program with a certain abundance. The demand for a more accurate assessment method is increasing as genome sequences of unprecedented quality and continuity emerge. When evaluating genome assemblies, one needs to perform a multifaceted assessment using different metrics (for details, see Rhie et al., 2021), including the coverage of the protein‐coding gene space, which is widely used as a central metric (Figure 2). The following section will focus on how the use of the metric for scoring the completeness of protein‐coding genes should be adapted to the prevailing chromosome‐scale genome assembly production.

LIMITATION OF GENE SPACE COMPLETENESS ASSESSMENT

The measurement of gene space completeness was used as a metric of genome assembly quality even before 2010, when most of the available genome assemblies did not reach a chromosomal scale. The only maintained program for this purpose in that period, CEGMA (Parra et al., 2009), was originally developed for identifying a set of protein‐coding genes in a given de novo genome assembly, to be used as a gene set for training gene prediction programs (Parra et al., 2007). Later, the support for CEGMA was discontinued, which was subsequently almost completely replaced by BUSCO (Simão et al., 2015). Generally, when no other option is available as a benchmark solution, users need to be warned about potential misleading reports from the single solution. As previously reported for the benchmarking of multiple sequence alignments (Iantorno et al., 2014), developers and users of genome assembly assessment tools should be fully informed about the perils of misleading assessments. Since its first release in 2015, BUSCO has been rapidly upgraded to version 2 in 2016, version 3 in 2017, version 4 in 2019, and version 5 in January 2021. BUSCO assumes the use of its accompanying reference gene set derived from OrthoDB (Kriventseva et al., 2019), and both the reference gene set and the pipeline for searching reference genes have been upgraded. This sort of benchmark program is expected to serve as a reliable standard on which genome assemblies can be uniformly compared. Most recently, the BUSCO pipeline was upgraded to version 5 and adopted a new component program for gene search, MetaEuk (Levy Karin et al., 2020), which sometimes yields largely different values compared with the earlier versions 2 and 3 (these two versions superficially perform in the same way because version 3 was a refactored version of version 2). Another persistent concern with BUSCO is the criterion for choosing reference single‐copy genes (see Korlach et al., 2017)—genes that are absent from genome‐wide sequences of some species (no more than 10% of all of the species considered) are included in the reference orthologue set. Some genes that were secondarily lost during evolution can also be implicitly queried and judged as missing from the genome assembly because of incomplete sequencing or assembly, which results in underestimation of genome assembly completeness. Such an inaccurate assessment of elaborately produced genome assemblies severely hampers the establishment of reasonable decisions in research. To circumvent this systematic inaccuracy, we previously developed the gene set core vertebrate genes (CVG) that contained only the genes retained as single copies in all 29 rigorously selected vertebrate species (Hara et al., 2015). This gene set is included as an option at our original web application, gVolante (Nishimura et al., 2017, 2019), in which different BUSCO versions (including its latest version 5), as well as CEGMA, are available. Apart from the concerns mentioned above, scoring orthologue detection beyond cross‐species differences is not trivial. As a baseline that is independent of this factor, we assessed the nearly complete human genome assembly CHM13 v1.0 (https://github.com/nanopore‐wgs‐consortium/chm13) released by the Telomere‐to‐Telomere consortium (https://sites.google.com/ucsc.edu/t2tworkinggroup/home; Nurk et al., 2021)—the completeness assessment of this assembly is expected to be nearly 100% if no technical limitations arise. This assessment of the human CHM13 v1.0 assembly resulted in 79 genes judged as missing out of 5,310 BUSCO reference orthologues for Tetrapoda (1.49%) by BUSCO version 5, and one out of 233 CVGs (0.43%) by CEGMA. We tentatively analysed the properties of these 79 reference genes that were judged as missing in OrthoDB v9 and v10 and checked manually the nucleotide sequences of the human CHM13 v1.0 genome assembly for the existence of their orthologues. Importantly, this search revealed that all 79 genes existed in the CHM13 v1.0 assembly (Table S1) and proved BUSCO’s false‐negative detections. This suggests a systematic underestimation of completeness assessment scores by BUSCO, which may be worth exploring further on a larger scale. Importantly, in this human CHM13 genome assembly (version 1.0), the five remaining gaps are known to be localized in nonprotein‐coding regions—more precisely, ribosomal DNA arrays in the acrocentric arm regions of five chromosomes. The orthologues that were judged as missing in the assessment above are thought to have escaped the gene detection process of the BUSCO pipeline. It is possible that such false negatives occur when a queried orthologue is too divergent to fit within a range recognized as an orthologue by BUSCO or has sequences that are too long or repetitive (even in introns or flanking non‐coding regions) to be scanned properly by the programs implemented inside BUSCO, namely, TBLASTN and Augustus. Basically, genome assemblies with higher continuity are expected to yield higher completeness scores (see Jauhal & Newcomb, 2021); however, the scores tend to be rather saturated as long as the assessment targets the genomic space marked by a limited number of protein‐coding genes. In resorting to protein‐coding gene completeness, one needs to pay closer attention to the mitigation of false negatives and false positives, by choosing a more appropriate orthologue set and parameters for orthologue search. It is also instrumental to perform an independent assessment of gene coverage in genome assemblies by mapping raw RNA‐seq reads or the transcript contig sequences derived from them to the genome assembly sequences (for details, see Rhie et al., 2021).

ARE THEY CHROMOSOMES? CONSIDERATIONS IN ASSEMBLY FINALIZATION

The typical practice of genome assembly finalization includes the process of removing unnecessary sequences, such as unambiguous contaminants and organelle genomes. Herein, a possible discrepancy between the number of resultant chromosome‐scale sequences and the haploid/diploid chromosome number needs to be addressed. This should be followed by the renumbering of the sequences and other amendments required at sequence submission to public databases (https://www.ncbi.nlm.nih.gov/genbank/genomesubmit/). It remains controversial whether the products of chromosome‐scale genome assemblies can be called “chromosomes”. A semantic criticism in this context is that chromosomes consist of not only DNA, but also other components, mainly proteins. It should also be cautioned that “chromosome‐scale” sequences built by Hi‐C scaffolding alone are prone to errors and should be continuously improved by other approaches—it may be risky to regard “Hi‐C karyotyping” as replacing conventional cytogenetic analyses of karyotypes. To evoke a careful distinction, a set of terms including “C‐scaffold” (for chromosome‐scale genome scaffold, instead of “chromosome”) and “scaffotype” (a set of chromosome‐scale scaffolds, instead of “karyotype”) was introduced to avoid confusion (Lewin et al., 2019). Apart from these concerns about semantics and QC, the utility of chromosome‐scale genome sequences opens up new frontiers of molecular‐level biology affecting a wide variety of fields involving diverse species (reviewed in Deakin et al., 2019).

TEST CASE OF THE MADAGASCAR GROUND GECKO

As a test case, we dissected the chromosome‐scale genome assembly of the Madagascar ground gecko (Paroedura picta) by referring to the technical consideration factors raised above (Figure 2). The karyotype of this species is 2n = 36 (Main et al., 2012), and the genome size based on the nuclear DNA content is 1.80 Gbp (Hara et al., 2018). Molecular sequence data provision for this animal was initiated with transcriptome analysis (Hara et al., 2015), which was followed by short‐read genome assembly (Hara et al., 2018). For loss‐of‐function experiments, genome editing with CRISPR/Cas9 was recently demonstrated in a reptilian species (Rasys et al., 2019). To promote the potential of this species in question‐driven biological studies, the genome assembly of this species has been incorporated as one of the target species into the guide RNA designing tool CRISPRdirect (https://crispr.dbcls.jp/) (Naito et al., 2015). This resource is expected to facilitate the use of this animal in diverse life science studies that demand loss‐of‐function experiments. The chromosome‐scale genome scaffolding of the Madagascar ground gecko benefited from the supply of embryos (see Supporting Information Methods for the detailed procedure). Chromatin preparation from the small embryonic sample allowed the improvement of sequence continuity without sacrificing adult animals—the N50 scaffold length increased from 4.1 to 109.0 Mbp (Table 3). This scaffolding performance was achieved with only about 100 million read pairs, which is half of the data size usually recommended in the specification of commercial kits (100 million read pairs per Gb of genome). This could be because the diversity of the read obtained from our Hi‐C library was sufficiently high. Precise control of library quality before sequencing was a prerequisite for this efficient data production (Figure S1).

TABLE 3

Improvement of the Madagascar ground gecko genome assembly. BUSCO's Tetrapoda gene set consisting of 5310 orthologues was used to assess gene space completeness with BUSCO v5

Metric	Draft v1.0 (Hara et al., 2018)	Hi‐C scaffolds v2.0 (this study)
Total length (Mbp)	1,694	1,562
N50 scaffold length (Mbp)	4.1	109.0
Largest scaffold length (Mbp)	33.7	184.3
Number of scaffolds >1 Mbp	297	18
% of sum length of sequences >10 Mbp	26.6	96.5
% of sum length of sequences >1 Mbp	73.3	96.5
Number (%) of reference orthologues detected as “complete”	4,575 (86.16)	4,577 (86.20)
Number (%) of reference orthologues detected as ‘fragmented’ or “complete”	4,960 (93.41)	4,969 (93.58)
Number (%) of reference orthologues detected as “duplicated”	45 (0.8%)	38 (0.7%)
Number (%) of reference orthologues recognized as “missing”	350 (6.59%)	341 (6.42%)

Improvement of the Madagascar ground gecko genome assembly. BUSCO's Tetrapoda gene set consisting of 5310 orthologues was used to assess gene space completeness with BUSCO v5 As the input for this Hi‐C scaffolding demonstration aimed at obtaining the first chromosome‐scale genome assembly for the taxon Gekkota (as of May 2021), we employed three draft genome assemblies: (1) the traditional short‐read shotgun assembly, (2) the Chromium supernova assembly using linked reads, and (3) the combination of the two former data types, as well as scaffolding with paired‐end RNA‐seq reads (Figure S2). Each of these three starting assemblies was scaffolded using Hi‐C reads by varying the input sequence length threshold, as included in Figure 2. We derived 15 chromosome‐level assemblies, and a total of 18 assemblies, including the three starting nonchromosome‐scale assemblies, were subjected to the comparison of sequence length statistics (Figure S3) and completeness assessment with BUSCO, which did not produce remarkable differences between the assemblies as often observed in the assessment of chromosome‐scale sequences (see above). Remarkably, varying input sequence length thresholds largely affected the scaffolding output (Figure S3). Applying the small length of 1,000 bp always produced suboptimal output, while the large length of 15,000 bp, the default of some scaffolding programs, did not produce the best output, either (Figures S1 and S3). In the variable output, we evaluated multiple aspects including component sequence length distribution and identified an assembly with optimal or nearly optimal results in all of N50 scaffold length, largest scaffold length, and the proportion of the sum scaffold length for the total assembly size (Assembly 6 in Figures S1 and S3). This assembly was subjected to manual curation (“review”; see above), to derive a sequence assembly for a public release. The manual interventions performed therein included a recovery of the linkage between two small scaffolds, to form a putative single middle‐sized chromosome sequence (Figure 3a,b). Importantly, in assessing the genome assembly of this species, a cross‐species comparison referring to a chromosome‐scale genome assembly was not helpful, because species outside the taxon Gekkota (e.g., anole lizard) diverged more than 150 million years ago (Hara et al., 2018). Conversely, our review was performed by referring to the previously published records of gene mapping using fluorescence in situ hybridization (FISH) on a different species of Gekkota (Supporting Information Methods), which assisted the retrieval of the correct order of chromosome segments based on the location of protein‐coding genes. In the resulting genome assembly, the number of chromosome‐scale scaffolds with a length >1 Mbp was 18, which is almost the same as the haploid number of chromosomes (n = 18 for XX/ZZ or 19 for XY/ZW; note that the sex chromosome organization in this species is unknown) (Figure 3a). The percentage of sequences longer than 1 Mbp in the entire assembly was 96.5%, indicating that most of the sequence information is incorporated into the resulting chromosome‐sized scaffolds (Table 3). The resulting Madagascar ground gecko genome assembly was assessed to cover 93.58% of the BUSCO’s reference orthologues for the taxon Tetrapoda (4,969 out of 5,310 genes) that were judged as being complete or fragmented by BUSCO version 5 (Table 3). The number of reference orthologues detected as complete increased by two genes after Hi‐C scaffolding (Table 3). The low percentage of the orthologues detected as duplicated (<1%) shows that the assembly harbours almost no redundancy caused by duplicated haplotypes. Alleged contaminated sequences from organelles or other organisms were removed from the assembly prior to the public release. The resulting chromosome‐scale genome assembly of the Madagascar ground gecko, which was introduced as an example of Hi‐C scaffolding, will serve as a basis for various studies focusing on the ecology and evolution of this species, as well as other molecular‐level biological studies performed in comparison with other amniote species, including mammals and birds.

AUTHOR CONTRIBUTIONS

Shigehiro Kuraku conceived the study and drafted the manuscript. Yuta Ohishi, Shigehiro Kuraku, Mitsutaka Kadota, Osamu Nishimur, and Kazuaki Yamaguchi analysed the data reviewed in this article. Yuki Naito and Kazuaki Yamaguchi set up public data use. All authors contributed to the final writing of the manuscript.

CONFLICT OF INTEREST

The authors declare that there is no conflict of interest statement. Supplementary Material Click here for additional data file.

54 in total

1. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes.

Authors: Genis Parra; Keith Bradnam; Ian Korf
Journal: Bioinformatics Date: 2007-03-01 Impact factor: 6.937

2. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

3. The changing face of genome assemblies: Guidance on achieving high-quality reference genomes.

Authors: Annabel Whibley; Joanna L Kelley; Shawn R Narum
Journal: Mol Ecol Resour Date: 2021-01-20 Impact factor: 7.090

4. Phased diploid genome assembly with single-molecule real-time sequencing.

Authors: Chen-Shan Chin; Paul Peluso; Fritz J Sedlazeck; Maria Nattestad; Gregory T Concepcion; Alicia Clum; Christopher Dunn; Ronan O'Malley; Rosa Figueroa-Balderas; Abraham Morales-Cruz; Grant R Cramer; Massimo Delledonne; Chongyuan Luo; Joseph R Ecker; Dario Cantu; David R Rank; Michael C Schatz
Journal: Nat Methods Date: 2016-10-17 Impact factor: 28.547

Review 5. Chromosomics: Bridging the Gap between Genomes and Chromosomes.

Authors: Janine E Deakin; Sally Potter; Rachel O'Neill; Aurora Ruiz-Herrera; Marcelo B Cioffi; Mark D B Eldridge; Kichi Fukui; Jennifer A Marshall Graves; Darren Griffin; Frank Grutzner; Lukáš Kratochvíl; Ikuo Miura; Michail Rovatsos; Kornsorn Srikulnath; Erik Wapstra; Tariq Ezaz
Journal: Genes (Basel) Date: 2019-08-20 Impact factor: 4.096

Review 6. How to identify sex chromosomes and their turnover.

Authors: Daniela H Palmer; Thea F Rogers; Rebecca Dean; Alison E Wright
Journal: Mol Ecol Date: 2019-10-10 Impact factor: 6.185

7. New insights into mammalian sex chromosome structure and evolution using high-quality sequences from bovine X and Y chromosomes.

Authors: Ruijie Liu; Wai Yee Low; Rick Tearle; Sergey Koren; Jay Ghurye; Arang Rhie; Adam M Phillippy; Benjamin D Rosen; Derek M Bickhart; Timothy P L Smith; Stefan Hiendleder; John L Williams
Journal: BMC Genomics Date: 2019-12-19 Impact factor: 3.969

8. Assessing the gene space in draft genomes.

Authors: Genis Parra; Keith Bradnam; Zemin Ning; Thomas Keane; Ian Korf
Journal: Nucleic Acids Res Date: 2008-11-28 Impact factor: 16.971

9. The common marmoset genome provides insight into primate biology and evolution.

Authors:
Journal: Nat Genet Date: 2014-07-20 Impact factor: 38.330

10. MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics.

Authors: Eli Levy Karin; Milot Mirdita; Johannes Söding
Journal: Microbiome Date: 2020-04-03 Impact factor: 14.650

3 in total

1. Chromosome-Level Genome Assembly Reveals Dynamic Sex Chromosomes in Neotropical Leaf-Litter Geckos (Sphaerodactylidae: Sphaerodactylus).

Authors: Brendan J Pinto; Shannon E Keating; Stuart V Nielsen; Daniel P Scantlebury; Juan D Daza; Tony Gamble
Journal: J Hered Date: 2022-07-09 Impact factor: 2.679

2. Technical considerations in Hi-C scaffolding and evaluation of chromosome-scale genome assemblies.

Authors: Kazuaki Yamaguchi; Mitsutaka Kadota; Osamu Nishimura; Yuta Ohishi; Yuki Naito; Shigehiro Kuraku
Journal: Mol Ecol Date: 2021-09-12 Impact factor: 6.622

3. Draft genome of the bluefin tuna blood fluke, Cardicola forsteri.

Authors: Lachlan Coff; Andrew J Guy; Bronwyn E Campbell; Barbara F Nowak; Paul A Ramsland; Nathan J Bott
Journal: PLoS One Date: 2022-10-14 Impact factor: 3.752

3 in total