Literature DB >> 33987534

Hapo-G, haplotype-aware polishing of genome assemblies with accurate reads.

Abstract

Single-molecule sequencing technologies have recently been commercialized by Pacific Biosciences and Oxford Nanopore with the promise of sequencing long DNA fragments (kilobases to megabases order) and then, using efficient algorithms, provide high quality assemblies in terms of contiguity and completeness of repetitive regions. However, the error rate of long-read technologies is higher than that of short-read technologies. This has a direct consequence on the base quality of genome assemblies, particularly in coding regions where sequencing errors can disrupt the coding frame of genes. In the case of diploid genomes, the consensus of a given gene can be a mixture between the two haplotypes and can lead to premature stop codons. Several methods have been developed to polish genome assemblies using short reads and generally, they inspect the nucleotide one by one, and provide a correction for each nucleotide of the input assembly. As a result, these algorithms are not able to properly process diploid genomes and they typically switch from one haplotype to another. Herein we proposed Hapo-G (Haplotype-Aware Polishing Of Genomes), a new algorithm capable of incorporating phasing information from high-quality reads (short or long-reads) to polish genome assemblies and in particular assemblies of diploid and heterozygous genomes.

Entities: Chemical Gene Species

Year: 2021 PMID： 33987534 PMCID： PMC8092372 DOI： 10.1093/nargab/lqab034

Source DB: PubMed Journal: NAR Genom Bioinform ISSN： 2631-9268

INTRODUCTION

Long-read technologies commercialized by Pacific Biosciences (PACBIO) and Oxford Nanopore Technologies (ONT) are able to sequence long DNA molecules but at the cost of a higher error rate at least for standard protocols. Their throughputs are sufficient to generate complex genomes (1–5) and their costs are almost compatible with their use in large-scale resequencing projects (6–8). Standard genome assemblies currently rely on a combination of several technologies, making it possible to generate complete assemblies in terms of both repetitive and coding regions. The quality of the consensus relies heavily on the use of high-quality reads, mainly short reads, and the choice of a polishing algorithm. One of the most popular polishing algorithms, Pilon (9), was developed several years ago, before the advent of the long-read era and was originally designed to detect variants and improve microbial genome assemblies. With the increasing popularity of long-read technologies, public databases now contain a large collection of very contiguous assemblies, but even if the overall reported quality seems sufficient, local errors can critically affect protein prediction (10). Aware of this issue, the bioinformatic community has developed several tools over the past 2 years (11–16). Most of the tools (Pilon, Racon, NextPolish, HyPo, Apollo and POLCA) are based on short-read alignment, ntEdit is the only method that uses a kmer approach and NextPolish combines both strategies (Table 1). After aligning the short reads, the algorithms detect errors by examining the pileup of bases from the reads (Pilon, NextPolish), by generating a consensus using Partial Order Alignment (Racon, HyPo) or by detecting variants (POLCA). While these tools are capable of correcting most of the errors in a draft assembly generated using long reads, we have observed frequent issues when correcting heterozygous regions. Indeed, the case of diploid genomes is particularly problematic since in this case the long-read assembly is composed of collapsed homozygous regions and duplicated allelic regions which will complicate the correct alignment of short reads. As the existing tools work locally and not at the scale of a 150 bp read and its mate, they frequently generate a mixture of haplotypes. Switching between haplotypes is problematic for the alignment of short reads and variant calling, but it can also affect the coding sequence of genes. As an example, when we were dealing with a long-read genome assembly, we observed that the Pilon correction was not able to restore a deletion in a heterozygous coding region (Figure 1). This simple observation motivated our need to develop a new polishing algorithm.

Table 1.

General characteristics of existing polishing algorithms. These seven tools were evaluated in our benchmark with the specified parameters

	Reference	Publication date	Version	Read alignment	Integrated aligner	Parameters used in the benchmark
Hapo-G	This study	-	0.1	yes	BWA	-t 36
Apollo	Firtina C. et al. (12)	2020	2.0	yes	no	-t 36
POLCA	Ziminn AV. et al. (14)	2020	3.4.2	yes	BWA	-t 36
HyPo	Kundu R. et al. (13)	2019	1.0.3	yes	no	-s size* -c 180 -t 36
NextPolish	Hu J. et al. (12)	2019	1.3.2	yes	BWA	task = best parallel_jobs = 6 multithread_jobs = 6 genome_size = auto
ntEdit	Warren RL. et al. (11)	2019	1.3.2	no	NA	-k 40 -t 36 –outbloom –solid ntEdit: -m 1 -t 36
Racon	Vaser R. et al. (15)	2017	1.4.3	yes	no	-t 36
Pilon	Walker BJ. et al. (9)	2014	1.23	yes	no	–threads 36

*size = 120Mb (Arabidopsis thaliana), 120Mb (Solanum tuberosum), 100 K (synthetic sequence).

Figure 1.

Example of a deletion in a coding frame. The two haplotypes have consistent coding frames (codons are alternatively colored in red and blue) and the draft assembly contains a deletion in a stretch of T’s (green box). Pilon was not able to restore the coding frame and add a second frameshift. In comparison Hapo-G was able to restore haplotype A.

General characteristics of existing polishing algorithms. These seven tools were evaluated in our benchmark with the specified parameters *size = 120Mb (Arabidopsis thaliana), 120Mb (Solanum tuberosum), 100 K (synthetic sequence). Example of a deletion in a coding frame. The two haplotypes have consistent coding frames (codons are alternatively colored in red and blue) and the draft assembly contains a deletion in a stretch of T’s (green box). Pilon was not able to restore the coding frame and add a second frameshift. In comparison Hapo-G was able to restore haplotype A. Here we present Hapo-G (pronounced as apogee), a new method dedicated to the polishing of genome assemblies. This algorithm tends to phase the assembly while correcting the sequencing errors. We compare Hapo-G with existing polishers (HyPo, Apollo, NextPolish, Pilon, POLCA and Racon) and show that Hapo-G is not only comparable to existing methods for polishing draft assemblies, but also faster and tends to decrease jumps between haplotypes. Hapo-G is written in C, uses the hts library (17) and is freely available at http://www.genoscope.cns.fr/hapog.

MATERIALS AND METHODS

Hapo-G algorithm

Hapo-G, like most existing tools, requires a sorted bam file containing the high-quality read alignments on the draft genome. These alignments could have been generated using bwa mem (18), minimap2 (19) or any other alignment tool capable of producing a bam file. Hapo-G maintains two stacks of alignments, the first (all-ali) contains all the alignments that overlap the currently inspected base, and the second (hap-ali) contains only the read alignments that agree with the last selected haplotype. Hapo-G selects a reference alignment and tries to use it as long as possible to polish the region where it aligns, which will minimize mixing between haplotypes (Figure 2).

Figure 2.

Description of the Hapo-G algorithm. Two stacks of alignments are stored, all-ali which contains all the alignment of a specific region (not shown) and hap-ali which contains reads from the same haplotype. The bam file is processed iteratively, and for each input alignment, Hapo-G will polish the region (draft genome is in black) between the start of the last alignment and the start of the current alignment. The reference alignment is the one used as the backbone for error-correction. Once the frequency of the reference base (in red) is computed, the position is classified as homozygous (case 1, left panel), heterozygous (case 2, lower panel) or sequencing error (case 3, right panel). Hapo-G performs the polishing sequentially and scans the input bam file of alignments sorted by position. For each input alignment (called current alignment), Hapo-G polishes each nucleotide in the region between the last recorded position and the start position of the current alignment (called current position), after which the current alignment will be added to both stacks (Figure 2).

Polishing of a given nucleotide in the draft assembly

First, the two stacks (all-ali and hap-ali), if they are not empty, are cleaned to remove any alignment that does not overlap with the current position. In case the reference alignment has been deleted, a new alignment is selected from the hap-ali stack (the read alignment that ends closest to the current position). If the coverage at the current position is below a threshold, set at three reads, the current base in the draft sequence remains unchanged. Otherwise, the nucleotide of the reference alignment (called the reference base) is extracted and the frequency of this reference base is calculated in the all-ali and hap-ali stacks. Based on its frequency, the current position is tagged as a homozygous site, a heterozygous site or a sequencing error (Figure 2). The position is classified as homozygous if the frequency of the reference base is >0.8 and at least 3 reads from the hap-ali stack are in accordance. If the reference base and the nucleotide of the draft assembly (the current base) are different, the current base is replaced by the reference base. The position is classified as heterozygous if the frequency of the reference base is between 0.2 and 0.8 and the hap-ali stack contains at least six reads. If the reference base and the current base are different, the current base is replaced by the reference base. In addition, any read alignments that do not have the same base as the reference base at the current position will be removed from the hap-ali stack. Indeed, they may represent a second haplotype. Importantly, when a read is removed from the stack, its name and its mate name are added to a hash table. The corresponding read alignments will be ignored when encountered later while polishing the current sequence. The hash table is empty when the end of the current sequence is reached.

Usage and parallelization of Hapo-G

The polishing step of Hapo-G is wrapped in a python script which manages the pre and post processing steps. First, the wrapper indexes the genome and maps the reads on the draft assembly using bwa-mem for short reads and minimap2 for long-reads. The polishing step, written in C using the htslib, is not multithreaded but can be easily parallelized by splitting the input fasta file as well as the alignment file. This divide and conquer strategy makes it possible to speed up the polishing step and allows to take advantage of a wide range of computing architectures.

Generation of benchmarking datasets

Homozygous genome assemblies

We downloaded the Nanopore data of an Arabidopsis thaliana sample produced by Michigan State University (Table 2) and assembled these data using the Flye assembler (20) (v2.8.1) with the parameter ‘-g 120m’ to indicate a genome size of 120 Mb. We obtained an assembly of 118 Mb, comprising 16 large contigs and a contig N50 of 14.8 Mb (Table 2). This assembly was used as input of the Medaka polisher (v1.2.0, https://github.com/nanoporetech/medaka), in conjunction with the Nanopore reads and the ‘-m r941_min_high_g303’ parameter was applied, in order to choose a suitable model for this type of data. This assembly was used to compare polishing tools.

Table 2.

		Arabidopsis thaliana Col-0	Homo sapiens	Synthetic sequence	Solanum tuberosum L. RH89–039-16	Metagenomic sample
Illumina	Accession number	SRR12136403	ERR194147	NA	PRJNA573826	SRX4901583
	Read length (bp)	2 × 150	2 × 101	2 × 150	2 × 250	2 × 151
	Coverage	176 X	30 X	50 X	47 X	1150 X
Nanopore	Accession number	SRR12136402	-	-	PRJNA573826	SRX5161985 SRX4901586
	Reads N50 (bp)	18 827	-	-	25 280	22 660
	Coverage	95 X	-	-	75 X	83 X
	Accession number	-	-	-	SRX7922852	-
PACBIO HiFi	Reads N50 (bp)	-	-	-	10 000	-
	Coverage	-	-	-	14 X	-
Assembly	Number of contigs	238	1,172	1	11,070	107
	Cumulative size	119 992 853	2 818 937 673	102 000	1 332 417 447	49 379 539
	Contig N50 (bp)	14 841 396	11 821 944	102 000	440 422	3 584 230

Datasets and long-read assemblies generated for the benchmark. Coverages were computed using a genome size of 120 Mb, 3 Gb, 1.6 Gb and 56 Mb for Arabidopsis thaliana, Homo sapiens, Solanum tuberosum and the metagenome sample (sum of genome sizes), respectively In addition, we downloaded a human genome assembly based on Nanopore reads that was published by Jain et al. (2) as well as Illumina reads from the same individual (ERR194147). This dataset was used to benchmark all polishers, in terms of quality of the result but also of restitution time and memory requirements.

Heterozygous genome assemblies

We generated a 100 kb sequence using an Homo sapiens model (http://rsat.sb-roscoff.fr/RSAT_home.cgi) and created two haplotypes by incorporating, each time, 100 random mutations into the initial 100 kb sequence (http://www.bioinformatics.org/sms2/mutate_dna.html). We added two random 1 kb sequences to both ends of the two haplotypes to avoid mapping issues on first and last nucleotides. Illumina short-reads were generated from both haplotypes using ART (21) software (version 2.5.1) and the following parameters: -ss HSXt -p -l 150 -f 25 -m 200 -s 10. From the two haplotype sequences, we generated an haploid sequence by alternatively retaining 60 nucleotides of each sequence and adding 2000 random mutations (http://www.bioinformatics.org/sms2/mutate_dna.html) to simulate sequencing errors. The resulting haploid sequence is a mixture of the two haplotypes and is used to test the ability of each polisher to correct errors and phase the draft sequence. Additionally, we downloaded the Nanopore data produced by the authors of a recent article by Zhou et al. (22) describing the diploid assembly of Solanum tuberosum (Table 2). The long-read dataset was assembled using the Flye assembler (20) (v2.8.1) with the parameter ‘-g 1600m’ to indicate a genome size of 1600 Mb (representing the length of the diploid genome). The resulting assembly had a size of 1.33 Gb and a contig N50 of 440 kb (Table 2). Using this assembly, we performed two benchmarks: a first on the whole genome and a second on a specific genomic region which has been thoroughly analyzed in the Zhou et al. publication.

Metagenome assembly

We downloaded two MinION runs from a synthetic mock microbial community (23) composed of 12 bacterial strains (SRX4901586 and SRX5161985) as well as the corresponding Illumina sequencing reads (SRR8073716). We assembled the Nanopore data using Metaflye (24) and obtained a genome assembly of 43.1 Mb in size, composed of 104 contigs of >2 kb. Contigs were then polished two times with Racon (Medaka could not be launched on this too old dataset) and used as input assembly for all polishing methods.

Benchmarking of polishing methods

Each polisher was launched (on a 36 cores server with 380GB of memory) iteratively six times on the input assembly to evaluate accuracy and impact of multiple rounds of correction. If needed, Illumina reads were aligned with BWA mem (v0.7.17 with the default parameters except -t 36), and the resulting bam file was sorted and indexed using Samtools (17) (v1.10 with the default parameters except -@ 36 and -m 10G). Surprisingly, HyPo never succeeded when using a genome size of 1300 Mb for the correction of the Solanum T. genome assembly, but was able to polish the sequence when using 120 Mb. Importantly, as Apollo does not take into account paired-reads, short-reads alignment was performed in single-end mode. In our hands, Apollo needed 60 hours to perform a single iteration over the Arabidopsis genome, 60 times longer than most other methods. We therefore decided not to use it for further testing. The parameters used for each polisher are described in Table 1.

Polishing of homozygous genome assemblies

The QUAST suite (25) was used to generate statistics on the quality of the alignment between the polished assemblies and the reference genome (Col-0 downloaded from the TAIR website). In addition, Illumina reads were aligned on each polished assembly and only perfect alignments were kept. The accuracy and completeness of the gene content were assessed using TAIR10.1 annotation and BUSCO (26) (version 5 with brassicales_odb10 dataset). Exons from the reference Arabidopsis annotation (using the getfasta command from bedtools (27)) were extracted and aligned onto each assembly using the Blat aligner (28). Exons that were aligned with 100% identity along their entire length were kept and unique exon names were counted to avoid multi-mapping bias. On the human genome dataset, all polishers were launched only one time, as this large assembly was mainly used to evaluate their restitution time and memory requirements. As with Arabidopsis thaliana, the quality was assessed by aligning the Illumina reads on each polished assembly and only perfect alignments were retained. The accuracy and completeness of the gene content was assessed with the reference annotation using pblat (29) and as previously described.

Polishing of heterozygous genome assemblies

The polished assemblies of the original 100 kb genomic region were aligned with the two haplotype sequences using muscle (30) (version 3.8), and each position was labeled haplotype 1 or haplotype 2 (if the base was similar to the corresponding haplotype), error (if the base was different from the two haplotypes) or equal (if the three bases were identical). For each polisher, the number of swaps between the two haplotypes and the number of errors were reported. The diploid and heterozygous genome of Solanum tuberosum was used to assess the ability of each polisher to locally preserve the haplotype phasing. Two benchmarks were performed: a first on the whole genome and a second on a specific genomic region. Polishing algorithms were compared on their ability to phase genomic regions during the error-correction step. For that purpose, heterozygous variants were detected from the Illumina short reads, without prior assembly, using discoSNP (31) and default parameters. The sequence context (30 bp on the right and left side) of each variant was extracted from the discoSNP output and only 61 bp-sequence with a single variant were kept. In the discoSNP output, the detected variants were phased based on the Illumina reads (-A parameter of discoSNP), and only chains of at least three variants validated by at least 5 short reads were selected. For each heterozygous SNP, the two variants were mapped on each polished assembly and only perfect matches were kept. All reliable chains of variants were searched in the alignment results and a given chain was validated only if all its variants were found in a perfect match and on the same genomic sequence. In addition, we focused on a 300 kb genomic region of chromosome 8, which has been described in the Figure 2B of the Zhou et al. publication. The authors illustrate a syntenic block on the two haplotypes. The coding exons of the two haplotypes were extracted and only the exons that contain at least one difference were selected and used as candidate exons. This region was assembled into four different contigs in our nanopore-based assembly, with two collapsed (contig_3372 and contig_15103) and two duplicated contigs (contig_15126 and contig_15127). The candidate exons were searched in the polished versions of these four contigs and only the perfect alignments were kept. The gene content completeness was evaluated using BUSCO (26) (version 5 with solanales_odb10 dataset).

Polishing of a metagenomic assembly

A synthetic mock community composed of 12 bacterial strains was used to compare each polisher and assess their ability to recover specific regions of each genome. For that purpose, we collected all 31-mers in the metagenome sample that are specific to a given species, using UniqueKMER (32) (Supplementary Table S1). We then searched for these specific 31-mers in each polished assembly and hypothesized that a higher number of specific 31-mers reflected a more realistic composition of the metagenome.

RESULTS

Impact of the sequencing coverage

Even if the sequencing coverage is no longer a problem, the Arabidopsis thaliana and Solanum tuberosum genome assemblies were polished independently with various depths of coverage. On the homozygous genome, the consensus quality is already high with as low as 25× of coverage, even if the overall quality can be improved with a higher coverage (between 50× and 100×, Supplementary Figure S1). Interestingly, some metrics (identity and number of perfect mapped reads) seem to decrease when the coverage is too high (180×, Supplementary Figure S1). In heterozygous regions, coverage has a direct impact on the ability to phase variants, and it intuitively requires twice as much coverage as in homozygous regions. With a coverage of 25×, Hapo-G missed 4% of phased variants, compared to polishing with 50×. However, the number of phased variants with 25× remains similar or higher than that of other polishers, even with 50× of coverage (Supplementary Figure S2). Overall, all of the polishing tools achieved very similar results in terms of quality metrics. They produced a polished assembly with <60 errors per 100 kb (Figure 3A), with the exception of ntEdit (75 errors per 100 kb), Apollo (96 errors per 100 kb) and Racon (128 errors per 100 kb). These results are confirmed by the number of perfectly mapped Illumina read pairs (Figure 3B), with the lowest scores obtained by Racon and ntEdit (34.1 and 41.8 M respectively), while the highest number was achieved by HyPo and NextPolish (43.0 M after the first round of correction). Regarding the alignment of the reference annotation, again, differences were small (<450 exons between Hapo-G, HyPo, NextPolish, Pilon and POLCA out of the 203 233 input exons), with ntEdit and Racon assemblies containing the lowest number of exons retrieved perfectly (Figure 3C). For Hapo-G, HyPo, ntEdit, NextPolish, Pilon and POLCA, increasing the number of polishing rounds didn’t seem to have any significant impact, although two rounds for most tools seems to be optimal. Oddly, increasing the number of polishing rounds with Racon increased the number of errors per 100 kb from 128 for the first round to 161 for the sixth round. Racon’s inferior performance can be explained by the fact that it was originally designed to perform polishing using long reads. Moreover, the gene content was very similar for all methods, and more complete than the uncorrected assembly, except for Racon and Apollo which obtained a lower value than the input assembly (Supplementary Table S2). The fastest average running time was achieved by ntEdit with approximately 25 min for each round, while the slowest was Apollo (3647 min for a single iteration) and Racon, with an average running time of 163 min. From the alignment-based methods, Hapo-G was the fastest with an average running time of 40 min (Figure 3D). Memory requirement is variable, ntEdit requires the least amount of memory, while alignment-based methods are generally limited by the use of samtools sort which is the part that requires the most amount of memory (Supplementary Figure S3). However, depending on the implementation, some methods require twice as much memory (Pilon, POLCA and Racon).

Figure 3.

Comparison of polishing algorithms on the Arabidopsis thaliana genome assembly. Lower panels of A, B and C show the full distribution and the upper panels are a zoom on the higher values. (A) Number of errors per 100 kb after each round of polishing, when compared to the Arabidopsis thaliana reference genome. (B) Number of Illumina pairs mapped perfectly on each assembly. (C) Number of Arabidopsis thaliana exons aligned with 100% identity after each round of polishing. (D) Run times of polishing tools, for each polishing round. The human genome was successfully polished using Hapo-G, NextPolish, ntEdit and Pilon. Other methods were not able to produce results: HyPo crashed due to a lack of memory, Racon generated an empty fasta file with no error messages and POLCA crashed while processing the VCF (Variant Call Format) file. As already observed on the homozygous plant genome, the quality metrics were similar for all methods, and ntEdit had the lowest results but was the fastest. Hapo-G was the second fastest method, only 5 h compared to 30 h with NextPolish, and therefore seems to be a good compromise between speed and quality of the result (Supplementary Figure S4) Initially, the 100 kb sequence contained 861 haplotype switches and 1877 sequencing errors. In this benchmark, we only performed one round of correction for each tool. Pilon was the only polisher to generate more haplotype switches than there were initially in the reference (881 switches). The POLCA and ntEdit polished sequences contained >800 switches, while about 500 switches were still present in the HyPo, Racon and NextPolish corrected sequences. Hapo-G, the only tool dedicated to heterozygous genomes, obtained the best result with only 65 switches (Figure 4 and Supplementary Table S3). In terms of remaining sequencing errors, only three corrected sequences still contained some errors, the one obtained with: HyPo (5 errors), Racon (198 errors) and ntEdit (938 errors).

Figure 4.

Comparison of polishing algorithms on a synthetic diploid sequence. The 100 kb sequence is represented on the x axis and each polishing tool has a dedicated track, where remaining errors are represented with red bars and switches between the two haplotypes are represented by blue bars. In the case of a more complex and heterozygous genome, the situation is different. Three methods seemed to perform better than the others: Hapo-G, HyPo and NextPolish. As for simple and homozygous genomes, ntEdit and Racon obtained the worst results (Figure 5A). However, the number of phased variants that could be recovered was higher in the assembly corrected with Hapo-G and this from the first round (472 534 phased variants compared to 469 441 after six rounds of NextPolish, Figure 5B). The situation is the same when looking at chains of at least three variants. Hapo-G was the only one to retrieve >100 000 chains whereas the second best result was obtained by NextPolish with 92 291 chains (Figure 5C). Additionally, Hapo-G was twice as fast as HyPo and seven times faster than NextPolish, the other two methods that performed well on heterozygous genomes. Furthermore, Hapo-G was the second fastest method and as previously observed, ntEdit was the fastest (Figure 5D). The six rounds of polishing using Hapo-G were faster than a single round performed with NextPolish. Interestingly, we have observed three types of tools: those that take advantage of multiple rounds of correction, those for which one round seems sufficient, and those that produce inferior results by performing multiple rounds of correction. Hapo-G, HyPo, Pilon and POLCA are in the first category, ntEdit and NextPolish in the second and Racon is the only one which seems to degrade the quality of the assembly as rounds are performed (Figure 5A–C).

Figure 5.

Comparison of polishing algorithms on the Solanum tuberosum genome assembly for each polishing round. Lower panels of A, B and C show the full distribution and the upper panels are a zoom on the higher values. (A) Number of Illumina pairs mapped perfectly on each assembly. (B) Number of phased variants retrieved in each assembly. (C) Number of chains composed of >3 variants (and confirmed by at least 5 reads) perfectly retrieved in each assembly. (D) Run time of each polishing tool. In addition, we focused on the two allelic regions described in the study by Zhou et al. and counted the number of candidate exons in each assembly. In the Hapo-G corrected sequence, we recovered 86 candidate exons which represented the highest number of exons found in all assemblies, compared to 39 in the unpolished assembly. For comparison, 84 candidate exons were found in the NextPolish assembly which is the second best result (Supplementary Table S4). Furthermore, the completeness of the gene content was evaluated on the whole genome assembly using BUSCO and 5950 genes conserved across solanales. Even if all the methods had a higher gene content completeness than the unpolished assembly (81.5%), only three methods (Hapo-G, HyPo and NextPolish) obtained a BUSCO score equal or higher than 98.2% (Supplementary Table S5). On the metagenomic sample, Hapo-G and HyPo were the two methods that allowed us to retrieve the highest number of specific 31-mers (Supplementary Figure S5B) and to perfectly align the higher number of paired-end reads (Supplementary Figure S5A). Hapo-G retrieved the highest number of specific 31-mers, and HyPo the highest number of paired-ends reads. This could be explained by the fact that the coverage was heterogeneous, so the number of specific 31-mers was not directly linked to the number of correctly mapped paired-ends reads. Interestingly, in this context, NextPolish is the fastest method (even faster than the kmer approach of ntEdit), only 30 mn of running time for the first iteration, compared to the ∼2 h needed for other mapping-based methods (Supplementary Figure S5C).

DISCUSSION

In this study, we report a new software, Hapo-G, which is able to polish draft assemblies with a quality equivalent to that of existing tools on simple and homozygous genomes, while being faster, but which also improves the polishing of heterozygous genomic regions as well as metagenomic assemblies. Nowadays, the number of polishing tools is high and although almost all are based on the same principle (mapping of short reads), their performances are different, likewise none are specialized in processing heterozygous genomes. In this study, we compared eight existing algorithms: Hapo-G, HyPo, Apollo, NextPolish, ntEdit, Pilon, POLCA and Racon. We obtained very similar results on a small plant genome (Arabidopsis thaliana) with the exception of ntEdit which is the only tool not based on short-read alignment. In addition, we observe that the oldest and most widely used polishing tool, Pilon, is not among the tools with the best results. In general, Hapo-G, HyPo and NextPolish stand out and often give good results. We observed on a synthetic diploid sequence and on a true heterozygous genome (Solanum tuberosum) that the phasing of variants is of better quality when the assembly is polished with Hapo-G, leading after six rounds of correction to the higher number of paired-end reads perfectly mapped back to the assembly (47 957 836 out of 151 018 344). Only two tools, ntEdit and Hapo-G, succeeded in polishing the 1.3 Gb of the Solanum tuberosum genome assembly in <3 h on average. On this large genome, the six rounds of Hapo-G ended before the first round of NextPolish, which is the second best tool according to our benchmark and already incorporates several rounds of mapping/correction internally. Importantly, on much larger genomes like that of the Human, HyPo, Apollo and Racon failed to polish the consensus in <3 days on a 36-core server with 380 Gb of memory. Increasing the number of correction rounds is generally beneficial, except for ntEdit and NextPolish where the results are very similar from round one through sixth, and Racon where the quality of the consensus seems to deteriorate when adding new correction cycles. In homozygous regions, although many polishers have achieved similar results, NextPolish appears to be the best performer, therefore, if possible, we suggest using NextPolish in combination with Hapo-G to achieve high quality in homozygous and heterozygous regions. It is important to note that Hapo-G should be used last to best preserve variant phasing. In fact, homozygosity is generally not complete and heterozygous regions may remain. Interestingly, by combining NextPolish and Hapo-G, the number of perfectly mapped paired-end reads was higher than after six rounds of Hapo-G or NextPolish separately. However, a combination of Hapo-G with any other polisher did not lead to better results in our tests (Supplementary Figure S6). Hapo-G algorithm requires high-quality reads, which excludes error-prone long reads. However, PACBIO HiFi long-reads are suitable to polish a genome assembly using Hapo-G. Even if it was not initially dedicated to long reads, we polished the Solanum tuberosum using both short and long reads. The input coverage was low (14×), but combining both technologies allowed Hapo-G to increase the number of phased variants (491 093 compared to 482 404 after six rounds of Hapo-G with short reads, Supplementary Figure S7). Hapo-G is often the fastest alignment-based method, achieves good results (particularly in metagenomic samples and heterozygous regions) and is flexible (it can take long or short reads as input as long as they are of high quality). Based on these observations, we recommend using Hapo-G to polish genome and metagenome assemblies and eventually by performing multiple rounds of correction if possible. In addition, depending on the size of the genome and if possible, we recommend combining Hapo-G with another polisher like HyPo or NextPolish which generally give good results.

DATA AVAILABILITY

Hapo-G is an open source software, source code, binaries as well as results of the benchmark are freely available from http://www.genoscope.cns.fr/hapog. In addition, Hapo-G has been added in the Anaconda repository for ease of installation and use: https://anaconda.org/lbgb_cea/hapog. All data, short and long reads, used in the article are available on public repositories. Click here for additional data file.

30 in total

1. Assembly of long, error-prone reads using repeat graphs.

Authors: Mikhail Kolmogorov; Jeffrey Yuan; Yu Lin; Pavel A Pevzner
Journal: Nat Biotechnol Date: 2019-04-01 Impact factor: 54.908

2. Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.

Authors: Can Firtina; Jeremie S Kim; Mohammed Alser; Damla Senol Cali; A Ercument Cicek; Can Alkan; Onur Mutlu
Journal: Bioinformatics Date: 2020-06-01 Impact factor: 6.937

3. ART: a next-generation sequencing read simulator.

Authors: Weichun Huang; Leping Li; Jason R Myers; Gabor T Marth
Journal: Bioinformatics Date: 2011-12-23 Impact factor: 6.937

4. A computational toolset for rapid identification of SARS-CoV-2, other viruses and microorganisms from sequencing data.

Authors: Shifu Chen; Changshou He; Yingqiang Li; Zhicheng Li; Charles E Melançon
Journal: Brief Bioinform Date: 2021-03-22 Impact factor: 11.622

5. Major Impacts of Widespread Structural Variation on Gene Expression and Crop Improvement in Tomato.

Authors: Michael Alonge; Xingang Wang; Matthias Benoit; Sebastian Soyk; Lara Pereira; Lei Zhang; Hamsini Suresh; Srividya Ramakrishnan; Florian Maumus; Danielle Ciren; Yuval Levy; Tom Hai Harel; Gili Shalev-Schlosser; Ziva Amsellem; Hamid Razifard; Ana L Caicedo; Denise M Tieman; Harry Klee; Melanie Kirsche; Sergey Aganezov; T Rhyker Ranallo-Benavidez; Zachary H Lemmon; Jennifer Kim; Gina Robitaille; Melissa Kramer; Sara Goodwin; W Richard McCombie; Samuel Hutton; Joyce Van Eck; Jesse Gillis; Yuval Eshed; Fritz J Sedlazeck; Esther van der Knaap; Michael C Schatz; Zachary B Lippman
Journal: Cell Date: 2020-06-17 Impact factor: 66.850

6. Gapless assembly of maize chromosomes using long-read technologies.

Authors: Jianing Liu; Arun S Seetharam; Kapeel Chougule; Shujun Ou; Kyle W Swentowsky; Jonathan I Gent; Victor Llaca; Margaret R Woodhouse; Nancy Manchanda; Gernot G Presting; David A Kudrna; Magdy Alabady; Candice N Hirsch; Kevin A Fengler; Doreen Ware; Todd P Michael; Matthew B Hufford; R Kelly Dawe
Journal: Genome Biol Date: 2020-05-20 Impact factor: 13.583

7. Long-read assembly of the Brassica napus reference genome Darmor-bzh.

Authors: Mathieu Rousseau-Gueutin; Caroline Belser; Corinne Da Silva; Gautier Richard; Benjamin Istace; Corinne Cruaud; Cyril Falentin; Franz Boideau; Julien Boutte; Regine Delourme; Gwenaëlle Deniot; Stefan Engelen; Julie Ferreira de Carvalho; Arnaud Lemainque; Loeiz Maillet; Jérôme Morice; Patrick Wincker; France Denoeud; Anne-Marie Chèvre; Jean-Marc Aury
Journal: Gigascience Date: 2020-12-15 Impact factor: 6.524

8. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

9. Chromosome-level assemblies of multiple Arabidopsis genomes reveal hotspots of rearrangements with altered evolutionary dynamics.

Authors: Wen-Biao Jiao; Korbinian Schneeberger
Journal: Nat Commun Date: 2020-02-20 Impact factor: 14.919

10. Haplotype-resolved genome analyses of a heterozygous diploid potato.

Authors: Qian Zhou; Dié Tang; Wu Huang; Zhongmin Yang; Yu Zhang; John P Hamilton; Richard G F Visser; Christian W B Bachem; C Robin Buell; Zhonghua Zhang; Chunzhi Zhang; Sanwen Huang
Journal: Nat Genet Date: 2020-09-28 Impact factor: 38.330

6 in total

1. Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing.

Authors: Caroline Belser; Franc-Christophe Baurens; Benjamin Noel; Guillaume Martin; Corinne Cruaud; Benjamin Istace; Nabila Yahiaoui; Karine Labadie; Eva Hřibová; Jaroslav Doležel; Arnaud Lemainque; Patrick Wincker; Angélique D'Hont; Jean-Marc Aury
Journal: Commun Biol Date: 2021-09-07

2. The genome sequence of the lesser marbled fritillary, Brenthis ino, and evidence for a segregating neo-Z chromosome.

Authors: Alexander Mackintosh; Dominik R Laetsch; Tobias Baril; Robert G Foster; Vlad Dincă; Roger Vila; Alexander Hayward; Konrad Lohse
Journal: G3 (Bethesda) Date: 2022-05-30 Impact factor: 3.542

3. Long-read and chromosome-scale assembly of the hexaploid wheat genome achieves high resolution for research and breeding.

Authors: Jean-Marc Aury; Stefan Engelen; Benjamin Istace; Cécile Monat; Pauline Lasserre-Zuber; Caroline Belser; Corinne Cruaud; Hélène Rimbert; Philippe Leroy; Sandrine Arribat; Isabelle Dufau; Arnaud Bellec; David Grimbichler; Nathan Papon; Etienne Paux; Marion Ranoux; Adriana Alberti; Patrick Wincker; Frédéric Choulet
Journal: Gigascience Date: 2022-04-28 Impact factor: 7.658

4. Genomic adaptation of the picoeukaryote Pelagomonas calceolata to iron-poor oceans revealed by a chromosome-scale genome sequence.

Authors: Nina Guérin; Marta Ciccarella; Elisa Flamant; Paul Frémont; Sophie Mangenot; Benjamin Istace; Benjamin Noel; Caroline Belser; Laurie Bertrand; Karine Labadie; Corinne Cruaud; Sarah Romac; Charles Bachy; Martin Gachenot; Eric Pelletier; Adriana Alberti; Olivier Jaillon; Patrick Wincker; Jean-Marc Aury; Quentin Carradec
Journal: Commun Biol Date: 2022-09-16

5. Generation of a chromosome-level genome assembly for Pacific halibut (Hippoglossus stenolepis) and characterization of its sex-determining genomic region.

Authors: Andrew J Jasonowicz; Anna Simeon; Margot Zahm; Cédric Cabau; Christophe Klopp; Céline Roques; Carole Iampietro; Jérôme Lluch; Cécile Donnadieu; Hugues Parrinello; Daniel P Drinan; Lorenz Hauser; Yann Guiguen; Josep V Planas
Journal: Mol Ecol Resour Date: 2022-06-01 Impact factor: 8.678

6. Population genomics of apricots unravels domestication history and adaptive events.

Authors: Alexis Groppi; Shuo Liu; Amandine Cornille; Stéphane Decroocq; Quynh Trang Bui; David Tricon; Corinne Cruaud; Sandrine Arribat; Caroline Belser; William Marande; Jérôme Salse; Cécile Huneau; Nathalie Rodde; Wassim Rhalloussi; Stéphane Cauet; Benjamin Istace; Erwan Denis; Sébastien Carrère; Jean-Marc Audergon; Guillaume Roch; Patrick Lambert; Tetyana Zhebentyayeva; Wei-Sheng Liu; Olivier Bouchez; Céline Lopez-Roques; Rémy-Félix Serre; Robert Debuchy; Joseph Tran; Patrick Wincker; Xilong Chen; Pierre Pétriacq; Aurélien Barre; Macha Nikolski; Jean-Marc Aury; Albert Glenn Abbott; Tatiana Giraud; Véronique Decroocq
Journal: Nat Commun Date: 2021-06-25 Impact factor: 14.919

6 in total