Literature DB >> 33537807

Benchmarking of next and third generation sequencing technologies and their associated algorithms for de novo genome assembly.

Marios Gavrielatos¹, Konstantinos Kyriakidis², Demetrios A Spandidos³, Ioannis Michalopoulos¹.

Abstract

Genome assemblers are computational tools for de novo genome assembly, based on a plenitude of primary sequencing data. The quality of genome assemblies is estimated by their contiguity and the occurrences of misassemblies (duplications, deletions, translocations or inversions). The rapid development of sequencing technologies has enabled the rise of novel de novo genome assembly strategies. The ultimate goal of such strategies is to utilise the features of each sequencing platform in order to address the existing weaknesses of each sequencing type and compose a complete and correct genome map. In the present study, the hybrid strategy, which is based on Illumina short paired‑end reads and Nanopore long reads, was benchmarked using MaSuRCA and Wengan assemblers. Moreover, the long‑read assembly strategy, which is based on Nanopore reads, was benchmarked using Canu or PacBio HiFi reads were benchmarked using Hifiasm and HiCanu. The assemblies were performed on a computational cluster with limited computational resources. Their outputs were evaluated in terms of accuracy and computational performance. PacBio HiFi assembly strategy outperforms the other ones, while Hi‑C scaffolding, which is based on chromatin 3D structure, is required in order to increase continuity, accuracy and completeness when large and complex genomes, such as the human one, are assembled. The use of Hi‑C data is also necessary while using the hybrid assembly strategy. The results revealed that HiFi sequencing enabled the rise of novel algorithms which require less genome coverage than that of the other strategies making the assembly a less computationally demanding task. Taken together, these developments may lead to the democratisation of genome assembly projects which are now approachable by smaller labs with limited technical and financial resources.

Entities: Chemical Disease Gene Species

Year: 2021 PMID： 33537807 PMCID： PMC7893683 DOI： 10.3892/mmr.2021.11890

Source DB: PubMed Journal: Mol Med Rep ISSN： 1791-2997 Impact factor: 2.952

Introduction

The first human genome draft (1) was based on Sanger sequencing technology (2), cost $2.7 billion and lasted over a period of 10 years (3). In comparison, the sequencing of the human genome (~3 Gbp haploid genome size) in a next generation sequencing (NGS) platform where millions of reads are efficiently mapped to the reference genome, currently costs <$1,000 and it can be performed in <2 days (4). Short-read de novo genome assemblers have difficulty to produce large and reliable contigs, particularly in low complexity regions such as centromeres, telomeres and other repetitive regions (5,6). To address this issue, third generation sequencing (7) technologies have been developed. Nanopore (https://nanoporetech.com/) (8,9) and PacBio (https://www.pacb.com/) (10) sequencing platforms were launched around 2010. Third generation sequencers are sequencing single-molecules in real-time (10) without the need of PCR amplification and thus, avoid PCR bias (11,12). The main drawback of long reads is lower accuracy compared to Illumina short-reads: Typical Nanopore and PacBio Sequel I long-reads have an average accuracy of 90% (13) compared to 99.9% of typical Illumina short-reads (4). As a consequence, assemblies produced only by long-reads were more contiguous, but they also contained more errors, which made genome annotation, variant calling and other genome analyses, challenging tasks (6,12). By following the hybrid assembly strategy (14,15), the advantages of the two generations are combined, incorporating the information contained in the two read types, overcoming their drawbacks. Recent advantages in long-read sequencing by PacBio have shown very promising results: Sequel System II was released in 2019 with an upgraded SMRT flow cell that was first introduced in 2013 (16), which was able to increase the sequencing yield up to 8-fold. However, the greatest breakthrough was the advance of circular consensus sequencing (CCS) (17) which sequences the same circular DNA molecule 10 times, to produce a highly accurate (99.9%) high-fidelity (HiFi) consensus read, while increasing unique molecular yield and insert size (up to 25 Kbp). At the same time, recent advances in Nanopore's base identification algorithm, Bonito (https://github.com/nanoporetech/bonito) (18), have led to greater than 97% base accuracy. Usually, the primary genome assembly is very fragmented and some contigs are misassembled. For this reason, the completion of the assembly requires the construction of scaffolds (19). To this end, Hi-C sequencing method provides chromosomal conformation information necessary to assemble chromosome-level scaffolds. The general principle of this method is based on the proximity and contacts of chromosomal regions in the cell nucleus. The frequency of contacts is higher between regions of the same chromosome; thus, different chromosomes can be distinguished during the assembly (20). The result of this method is a collection of pairs of reads of chimeric fragments that can be mapped to the assembly, joining very remote areas. Using the recent sequencing and scaffolding technologies, it is now possible to construct new reference genomes and finish the assembly of existing ones, by closing gaps in the centromeres, telomeres and other low complexity regions. For this reason, new projects have been launched and new consortia have been formed (21–23). The telomere to telomere (T2T) consortium (https://sites.google.com/ucsc.edu/t2tworkinggroup/) (24,25) aims to finish the entire human genome by producing chromosomes without gaps. Almost two decades after the first draft of the human genome by the International Human Genome Sequencing Consortium, T2T published a completed human genome with the exception of five known gaps withing the rDNA arrays (https://genomeinformatics.github.io/CHM13v1/). The development of sequencing technologies and assembly and scaffolding algorithms, as well as the sharp increase of publicly available data (https://www.ncbi.nlm.nih.gov/genbank/statistics/), democratised de novo genome assembly projects by making them more approachable to smaller labs. The present study aimed to compare genome assembly pipelines, which use different assembly strategies, evaluating them in terms of accuracy, speed and computational power needed. Finally, the need for scaffold construction, incorporating Hi-C sequencing data was also evaluated.

Materials and methods

Data acquisition and experimental overview

Primary sequencing data were downloaded from 3 organisms, Drosophila virilis, Drosophila melanogaster and Homo sapiens (Table I). Some FASTQ files were subsampled using Reformat tool from BBtools (https://sourceforge.net/projects/bbmap/). Following the hybrid assembly strategy, using short paired-end Illumina reads in combination with long Nanopore reads, the low complexity genome of Drosophila virilis and the high complexity genome of Homo sapiens were constructed, downloading read data from the European Nucleotide Archive (ENA) (26) and the T2T Consortium, respectively. Drosophila melanogaster genome was assembled following the long-read assembly strategy using only HiFi reads retrieved from ENA. Finally, Hi-C reads were used to create the scaffolds of our assemblies. It is important to note that the sequencing data used to assemble Homo sapiens genome, derives from CHM13hTERT, which is a female haploid cell line; thus, there will be no Y chromosome in the final assemblies. The experiments were performed on the Biomedical Research Foundation, Academy of Athens (BRFAA) computer cluster that consists of 24 nodes of 128 GB RAM each. Each node consists of 2 Intel® Xeon® Silver 4116 processors with 12 cores per processor and 2 threads per core (i.e. 48 CPUs per node). Additionally, Homo sapiens assembly by Wengan was performed on an Aristotle University of Thessaloniki (AUTh) computational system on a single node which consists of 4 AMD Opteron™ 6274 processors with 16 cores per processor and 1 thread per core (i.e., 64 CPUs) and 256 GB RAM.

Table I.

ENA accessions and T2T links of primary sequencing data.

Organism	Genome size (Mbp)	Illumina paired-end sequencing (coverage)	Illumina Hi-C sequencing (coverage)	Nanopore reads (coverage)	PacBio/HiFi reads (coverage)
Drosophila virilis	169	SRR1536175 (108×)	SRR7029394 (67×)	SRR7167958 (50×)
Drosophila melanogaster	140				SRR9969842 (37×), SRR10238607 (subsampled to 92×)
Homo sapiens	3,200	SRR3189741 SRR3189742 (Combined and subsampled to 34×)	https://github.com/nanopore-wgs-consortium/CHM13#hi-c-data (40×)	https://github.com/nanopore-wgs-consortium/CHM13#oxford-nanopore-data (Subsampled to 30×)	SRR11292120 SRR11292121 SRR11292122 SRR11292123 (Combined and subsampled to 16×)

In some cases, where more than one FASTQ files was used, the files were combined and randomly subsampled to lower coverages. Mbp, Megabase pairs.

The pipeline is divided into 3 parts: In the first stage of the current workflow (Fig. 1), different assemblers were used for the genome construction. In the second stage, the scaffolding, Hi-C data were combined with the initial assembly, in order to increase its continuity and accuracy. In the last stage, the final assembly was assessed and evaluated with the use of various tools.

Figure 1.

Pipeline stages and tools used in each step of the workflow.

Genome assembly

In order to assess the hybrid assembly strategy, the present study chose to evaluate two pipelines, MaSuRCA (version 3.3.5) (27,28) and Wengan (version 0.1) (29). MaSuRCA workflow offers three different assemblers, CABOG (30), SOAPdenovo (31) and Flye (32). The pipeline was tested using CABOG and Flye assemblers, which are designed for long-read assembly. Wengan pipeline is based on DiscovarDenovo assembler (33). Canu (version 2.0) (34) is a long-read assembler, designed to use long high-noise single-molecule sequencing data, such as Nanopore and PacBio reads. Its workflow is based on the Celera assembler (35) which was used in the Human Genome Project to produce the first draft of the human genome. Hifiasm (version 0.13) (36) and HiCanu (Canu version 2.1.1) (37) are long-read assemblers exclusively for HiFi reads. The main difference between HiFi assemblers and the ones mentioned previously, is that Hifiasm and HiCanu produce phased assemblies. A phased assembly is a haplotype-resolved assembly, where high complexity regions, such as genes, will be separated into two different alleles (36,38). HiCanu is a modified version of Canu, adapted to take advantage of the characteristics of HiFi reads. Hifiasm produces two different files for the primary and alternative assembly, whereas HiCanu combines the primary and the alternative assembly in the same FASTA file.

Scaffolding

In order to test the necessity of scaffolding, a scaffolder was used to improve the assembly continuity and completeness, as follows: Hi-C data are mapped to the primary assembly by Arima mapping pipeline (39), to produce a BAM file which is consequently converted to a BED file. SALSA (version 2.2) (40) uses this BED file which contains the mapping information of Hi-C reads on the assembly, to scaffold the primary assembly.

Quality control metrics

For the quality control of the assemblies produced, different evaluation tools were used. These tools produce and present the qualitative and quantitative characteristics of the assemblies in a comprehensible way. QUAST (version 5.0.2) (41), a genome assembly evaluation tool, produces various metrics for our assemblies, using a reference genome (Table II). The standard assembly statistics include the calculation of N50/NG50 and L50/LG50 values (42), as follows: N50 (or NG50) is the size of the contig, where at least 50% of the genome assembly size (or the reference genome size), is contained in contigs of equal or larger size than this contig. Higher N50/NG50 values signify more contiguous assemblies. L50 (or LG50) is the smallest number of contigs whose length sum makes up for at least 50% of the genome assembly length (or reference genome length). Lower L50/LG50 values signify more contiguous assemblies. Furthermore, QUAST makes use of BUSCO (Quast version 5.0.2) (43), to assess genome assembly and annotation completeness, based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs.

Table II.

Reference genomes used for the evaluation of the assemblies.

Organisms	Reference genomes
Drosophila virilis	GCA_007989325.1_vir160_genomic.fna
	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/007/989/325/GCA_007989325.1_vir160/
Drosophila melanogaster	GCA_002300595.1_Dmel_A4_1.0_genomic.fna
	https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/002/300/595/GCA_002300595.1_Dmel_A4_1.0/
Homo sapiens	chm13.draft_v1.0.fasta
	https://s3.amazonaws.com/nanopore-human-wgs/chm13/assemblies/chm13.draft_v1.0.fasta.gz

Genome consistency plots

JupiterPlot (version 1.0) (44) is a workflow that uses Circos (45) to generate a genome assembly consistency plot between a reference genome and a genome assembly. The chromosomes of the reference genome are represented as coloured arcs on the left half circle of the plot, whereas the contigs/scaffolds of the assembled genome are represented as outlined white arcs on the right half circle. The number and size of white arcs is indicative of the genome contiguity. JupiterPlot represents synteny between the reference and the assembled genome, indicating corresponding contiguous regions as ribbons whose width is proportional to their sequence length. In this manner, assembly errors and chromosomal misassemblies can be visually identified: A ribbon in twisted position represents an inversion, a ribbon which crosses over other ribbons represents a translocation, a lack of a ribbon connecting a region of the reference genome represents a deletion and the overlap of two ribbons connecting the same reference genome region represents a duplication. Although in other cases these misassemblies may represent genuine chromosomal aberrations, in our case they represent assembly errors due to low sequence complexity of repetitive regions such as centromeres, telomeres, etc., low sequencing coverage and weaknesses of each assembly algorithm.

Results

Drosophila genome assemblies

Primary (unscaffolded) MaSuRCA (CABOG or Flye) hybrid assemblies are by far the most fragmented of all Drosophila virilis assemblies, based on N50/NG50 and L50/LG50 values (Table III) and manual inspection of genome assembly consistency plots (Fig. 2). Canu, based exclusively on long Nanopore data, produced the most contiguous primary assembly. MaSuRCA/CABOG produced the most misassembled contigs, while Wengan hybrid assembler created the least misassembled ones. All but Canu assemblies present very high rates of preserved gene completeness, similar to the rates of the reference genomes (Table IV). The sizes of all Drosophila virilis primary assemblies are comparable to each other and very similar to that of the reference genome. Wengan is the fastest hybrid assembler and produced the Drosophila virilis genome 71 times faster than Canu, while the average CPU usage of Wengan is smaller than the rest of these assemblers (Table V). Hi-C-based scaffolding ameliorated the contiguity and it limited the misassemblies of all assemblies, but it did not improve the gene completeness and it did not alter the final assembly size.

Table III.

Metrics of Drosophila assemblies.

Assemblers	Contigs/scaffolds	Genome assembly size (bp)	N50	NG50	L50	LG50
MaSuRCA (CABOG)	1,016	167,374,624	366,859	359,873	127	131
MaSuRCA (CABOG)/SALSA (Arima)	532	167,617,624	3,400,369	3,400,369	15	15
MaSuRCA (Flye)	689	163,000,738	419,467	406,899	113	121
MaSuRCA (Flye)/SALSA (Arima)	230	163,230,238	5,261,864	5,258,634	9	10
Wengan	329	153,989,049	3,232,846	3,013,042	13	16
Wengan/SALSA (Arima)	229	154,046,842	21,036,706	16,232,289	3	4
Canu	425	169,315,961	4,435,749	4,435,749	10	10
Canu/SALSA (Arima)	488	176,029,265	25,182,285	25,182,285	4	4
Hifiasm
Insert size: 11 Kbp	314	149,971,598	23,693,975	23,693,975	3	3
Coverage: 37×
Insert size: 24 Kbp	149	164,010,561	21,707,601	24,110,342	4	3
Coverage: 40×
Insert size: 24 Kbp	186	169,871,295	23,943,049	24,211,538	4	3
Coverage: 92×
Hifiasm/SALSA
Insert size: 11 Kbp	308	149,976,098	23,693,975	23,693,975	3	3
Coverage: 37×
Insert size: 24 Kbp	141	164,015,561	24,110,342	24,620,248	4	3
Coverage: 40×
Insert size: 24 Kbp	183	169,876,757	23,943,049	24,211,538	4	3
Coverage: 92×
HiCanu
Insert size: 11 Kbp	1,792	295,986,869	2,513,964	6,791,534	24	7
Coverage: 37×
Insert size: 24 Kbp	1,024	322,211,690	6,752,429	17,694,921	12	4
Coverage: 40×
Insert size: 24 Kbp	1,269	337,795,659	11,255,983	26,987,095	8	2
Coverage: 92×
HiCanu/SALSA
Insert size: 11 Kbp	1,747	296,025,369	5,836,825	10,646,076	14	4
Coverage: 37×
Insert size: 24 Kbp	1,023	322,224,690	12,833,112	30,402,815	7	2
Coverage: 40×
Insert size: 24 Kbp	1,281	337,778,159	6,830,725	16,844,691	12	4
Coverage: 92×

Hybrid assemblies (MaSuRCA and Wengan) and long-read Nanopore assembly (Canu) were based on the Drosophila virilis genome (size: 169773245). HiFi PacBio assemblies (Hifiasm and HiCanu) were based on the Drosophila melanogaster genome (size: 145940863). Hifiasm and HiCanu assembles were performed using three combinations of insert data and coverage.

Figure 2.

Drosophila virilis assemblies comparison. Hybrid assemblers, MaSuRCA (CABOG and Flye) and Wengan, used Illumina short reads and Nanopore long reads for the assembly, while Canu, a long read assembler utilised Nanopore long reads for the same purpose. SALSA improved contiguity in all assemblies.

Table IV.

BUSCO values of Drosophila assemblies.

Assemblers	Completed and single-copy BUSCOs (S)	Completed and duplicated BUSCOs (D)	Fragmented BUSCOs (F)	Missing BUSCOs (M)
Drosophila virilis reference genome	98.0%	0.5%	0.7%	0.8%
MaSuRCA (CABOG)	96.1%	1.5%	0.8%	1.6%
MaSuRCA (CABOG)/SALSA (Arima)	96.1%	1.4%	0.8%	1.7%
MaSuRCA (Flye)	98.2%	0.5%	0.8%	0.5%
MaSuRCA (Flye)/SALSA (Arima)	98.0%	0.5%	0.8%	0.7%
Wengan	98.0%	0.4%	0.7%	0.9%
Wengan/SALSA (Arima)	97.9%	0.3%	0.8%	1.0%
Canu	62.7%	0.2%	21.3%	15.8%
Canu/SALSA (Arima)	64.0%	0.3%	20.7%	15.0%
Drosophila melanogaster reference genome	97.9%	0.7%	0.9%	0.5%
Hifiasmx
Insert size: 11 Kbp
Coverage: 37×	98.1%	0.6%	0.7%	0.6%
Insert size: 24 Kbp
Coverage: 40×	98.2%	0.4%	0.7%	0.7%
Insert size: 24 Kbp
Coverage: 90×	98.1%	0.5%	0.7%	0.7%
Hifiasm/SALSA
Insert size: 11 Kbp
Coverage: 37×	98.1%	0.6%	0.7%	0.6%
Insert size: 24 Kbp
Coverage: 40×	98.2%	0.4%	0.7%	0.7%
Insert size: 24 Kbp
Coverage: 90×	98.2%	0.5%	0.7%	0.6%
HiCanu
Insert size: 11 Kbp
Coverage: 37×	4.8%	94.1%	0.6%	0.5%
Insert size: 24 Kbp
Coverage: 40×	3.8%	95.2%	0.5%	0.5%
Insert size: 24 Kbp
Coverage: 90×	3.2%	95.5%	0.7%	0.6%
HiCanu/SALSA
Insert size: 11 Kbp
Coverage: 37×	42.3%	56.7%	0.5%	0.5%
Insert size: 24 Kbp
Coverage: 40×	37.3%	61.6%	0.5%	0.6%
Insert size: 24 Kbp
Coverage: 90×	39.0%	59.9%	0.5%	0.6%

Table V.

Assembly time and CPU usage comparison.

Organism	Assemblers	CPU time (sec)	CPU usage	Elapsed (wall clock) time (h:mm:ss)
Drosophila virilis	MaSuRCA (CABOG)	1,638,637.72	3,954%	11:30:39
	MaSuRCA (Flye)	1,344,633.10	3,961%	9:25:44
	Canu	993,441,898	3,532%	78:07:27
	Wengan	198,241.94	2,831%	1:56:42
Drosophila melanogaster	Hifiasm
	Insert size: 11 Kbp
	Coverage: 37×	163,816.92	4,098%	1:06:37
	Insert size: 24 Kbp
	Coverage: 40×	215,855.05	4,287%	1:23:54
	Insert size: 24 Kbp
	Coverage: 90×	4,271,030.94	4,313%	25:40:58
	HiCanu
	Insert size: 11 Kbp
	Coverage: 37×	85,224.85	1,752%	1:21:03
	Insert size: 24 Kbp
	Coverage: 40×	107,146.65	2,235%	1:19:53
	Insert size: 24 Kbp
	Coverage: 90×	176,649.77	1,646%	2:58:46
Homo sapiens	Hifiasm	1,272,271.15	4,113%	8:35:29

In Drosophila melanogaster primary assemblies, Hifiasm outperformed HiCanu, producing less fragmented and misassembled contigs (Figs. 3 and 4). As HiCanu produces phased assemblies, the vast majority of single-copy genes appeared as completed and duplicated in BUSCO analysis (Table IV). Nevertheless, the sum of completed single and duplicated BUSCOs in Hifiasm and HiCanu was practically identical to that of the reference genome. While using the 11 Kbp insert size and 37× coverage data, Hifiasm produced Drosophila melanogaster genome faster than HiCanu. However, as the coverage was increased, the assembly time of Hifiasm increased more rapidly than that of HiCanu: The assembly time of Hifiasm and HiCanu using 24 Kbp insert size and 40× coverage data was approximately the same, while HiCanu was 12× faster than Hifiasm, when 24 Kbp insert size and 92× coverage was used. The average CPU usage of HiCanu was also smaller than that of Hifiasm (Table V). SALSA scaffolding based on Hi-C data, slightly improved Hifiasm assemblies, while it ameliorated the contiguity of HiCanu ones. It also slightly limited the misassemblies of HiCanu outputs. It did not influence the gene completeness of any assembly. Insert size (11 and 24 Kbp) and coverage (37×, 40× and 92×) did not influence the outcome of Hifiasm; however, a small deterioration in assembly contiguity at the 92× coverage was noted. On the other hand, a higher insert size and coverage improved HiCanu performance.

Figure 3.

Drosophila melanogaster Hifiasm assemblies comparison. Hifiasm performed three different assemblies using PacBio Hifi long reads with different insert size (11 Kbp, 24 Kbp) and coverage (37×, 40×, 92×). A region in one of the two termini of chr 2L appears translocated in the assemblies produced by 11 Kbp insert size with 37× coverage and 24 Kbp insert size with 92× coverage. The same region appears deleted in the assembly produced by 24 Kbp insert size with 40× coverage prior to SALSA scaffolding and inverted in the same assembly with SALSA scaffolding.

Figure 4.

Drosophila melanogaster HiCanu assemblies comparison. HiCanu performed three different assemblies using PacBio Hifi long reads with different insert size (11 Kbp, 24 Kbp) and coverage (37×, 40×, 92×). Deletions of major regions or entire chromosomes can be found in all assemblies. Apparent duplications as of major parts of chr 3L in the assemblies produced by 24 Kbp insert size with 40× coverage are the results of phasing.

Overall, Hifiasm performed most effectively in the primary assembly of Drosophila melanogaster genome (which is comparable to that of Drosophila virilis), in terms of genome contiguity, accuracy and completeness. At 37× and 40× coverages, Hifiasm was also the fastest assembler; however, the CPU usage of Wengan and HiCanu was half of that of Hifiasm. The combination of Hi-C data had a minimal effect on the improvement of Hifiasm assembly. Among hybrid assemblers, Wengan performed best when combined with SALSA.

Homo sapiens genome assemblies

The human genome is much more complex than that of Drosophila; thus, its assembly is a more demanding task which requires much more computational resources. MaSuRCA and Wengan hybrid assemblers and Canu long-read assembler, were not able to complete the assembly of the human genome, even in half of the original Illumina and Nanopore coverage, on the BRFAA cluster with 128 GB RAM. Wengan, though, was able to produce a human genome assembly on AUTh computational system with 256 GB RAM, when FASTQ files were subsampled by half (Fig. 5). The incorporation of Hi-C data improved the genome continuity and completeness, while reducing misassemblies (Table VI).

Figure 5.

Homo sapiens assemblies comparison. Wengan hybrid assembler used 34× Illumina short reads and 30× Nanopore long reads for the assembly, while Hifiasm used 16× PacBio Hifi long reads.

Table VI.

Homo sapiens assembly metrics.

Assemblers	Contigs/scaffolds	Genome assembly size (bp)	N50	NG50	L50	LG50
Reference	24	3,056,916,522	154,259,625		8
Wengan	2,000	2,845,883,522	39,733,923	36,783,291	23	26
Wengan/SALSA (Arima)	1,689	2,845,883,522	59,573,195	56,310,190	15	17
Hifiasm	498	3,045,796,332	45,256,540	45,256,540	20	20
Hifiasm/SALSA (Arima)	431	3,045,840,332	61,206,687	61,206,687	15	15

Hifiasm was unable to assemble the human genome on the BRFAA cluster when the original 30× coverage of HiFi data was used. Nevertheless, it succeeded to produce a notable assembly on the same computational system with subsampled data (16× coverage), in contrast to HiCanu, which failed to run because of low memory resources, even with the subsampled data. Hifiasm failed to produce a contig for chromosome 22. SALSA improved the contiguity, accuracy and completeness of Hifiasm assembly (Table VI). The longest chromosomes of the genome are well assembled, however, four of the smallest autosomal chromosomes (chr 16, 19, 21, 22) are missing (Fig. 5). Hifiasm outperformed HiCanu, Canu, Wengan and MaSuRCA, as it managed to run in low resources and low coverage, producing superior primary and scaffolded assemblies to those of Wengan.

Discussion

The use of a reference genome in the study of medical genetics, with the help of novel tools and methods, can help the identification of novel drug-sequence variant interactions (46) and the identification of variants which may be related to mutations with a genetic base of a variety of genetic diseases, such as cancer (47) and produce further analysis (48). By studying these variants, we are able to analyse the differences and the heterogeneity of different populations in order to understand their differences (49). To propose an optimised de novo genome assembly workflow, in the present study, factors such as the maximum assembly contiguity, accuracy and completeness were taken into account, without ignoring other parameters crucial for the execution of the sequencing experiments and the production of the assemblies, such as financial, computational power and time limitations. These findings suggest that the assembly exclusively based on long highly accurate PacBio Hifi reads outperforms Illumina-Nanopore hybrid and Nanopore assembly. de novo genome assemblers which use HiFi reads, require lower amounts of data compared to other strategies. It has been reported that a 30× genome coverage, using HiFi data, is sufficient in order to produce high quality assemblies (18,50). The present study revealed that even a 16× coverage of the human genome was adequate for that purpose. Thus, subsampling in Hifiasm assembly strategy allows the adaptation of sequencing data to the computational resources available as follows: Sequencing data with a coverage of no higher than 40× can be produced as the current findings and previous experience from other Hifiasm users (https://downloads.pacbcloud.com/public/dataset/redwood2020/hifiasm/v12/) suggest, and if the computational system fails to run, the data can be subsampled using the divide and conquer approach, until the computational resources are adequate for the analysis. However, if the subsampled data correspond to <30× coverage, the final assembly can be deteriorated, as we notice on Homo sapiens assembly, where chromosome 22 is missing from the primary assembly and chromosomes 16, 19, 21 and 22 from the final assembly, after the scaffolding and correction process. On the other hand, it has been reported that a hybrid assembly would need 50× Illumina short-read coverage and 30× Nanopore long-read coverage of the genome (15,51,52). In the case of the human genome, notable results with a 34× Illumina and 30× Nanopore coverage were able to be produced. Therefore, the volume of data used for HiFi assemblies is much smaller. As the volume of data decreases, so do the computational requirements for CPU power and particularly memory. In addition, the use of highly accurate long reads, bypasses several computationally demanding, time consuming steps of the assembly workflow. In hybrid assembly strategy, Wengan performed most effectively in terms of accuracy and speed. Wengan produced the most contiguous Drosophila virilis assemblies. Although no hybrid assembler produced a human genome assembly in BRFAA cluster, Wengan was the only assembler that managed to construct a primary assembly in AUTh computational system. The assembler we recommend for HiFi reads is Hifiasm, as it outperformed HiCanu in a small genome and it succeeded to produce a notable assembly of a large genome whereas HiCanu failed to run. Hifiasm performed equally well in respect of insert size and coverage, while HiCanu output improves with the increase in insert size and coverage. We recommend the use of Hifiasm or HiCanu assemblers, depending on the available computational resources as well as the organism's genome size and complexity. Hifiasm produced the most contiguous assemblies and its assembly strategy is highly efficient in terms of computational power and time on a single node of the cluster. For this reason, Hifiasm is also used by the Human Pangenome Project (https://humanpangenome.org/). On the other hand, HiCanu gives the possibility to run the assembly on grid when using a computational cluster. Distributing the tasks on different nodes allows the use of more computational resources than running on a central resource and jobs can be executed in parallel speeding performance. Although running on grid, HiCanu was unable to produce a human genome assembly, as the main bottleneck of all assemblers is RAM size. Finally, by following PacBio HiFi assembly strategy for small genomes, we utilise only one sample preparation and one sequencing technology, in contrast to the Illumina/Nanopore hybrid strategy where we need to make three sample preparations (Illumina, Nanopore and Hi-C) and utilise two sequencing technologies (Illumina sequencing for short genomic and Hi-C reads and Nanopore for long genomic reads). For larger genomes, similar to the human one, PacBio HiFi assembly strategy relies on two sample preparations and two sequencing technologies (Illumina sequencing for short Hi-C reads and PacBio long genomic reads). Our analysis suggests that the use of additional information for scaffolding is not necessary in small genomes (such as insect genomes); however, it offers a noticeable improvement in larger and more complex genomes (such as the human genome and higher plant genomes). The computational resources required for scaffolding, even for the most complex genomes, are far less than those for the assembly step. Ideally, the use of multiple types of data, seems to exploit different genome features. The successive use of 10× (https://www.10×genomics.com/) (53,54), Bionano (https://bionanogenomics.com/) (55) and Hi-C data will generate the most accurate scaffolds (25,56). Although the use of 10× and Bionano data is not imperative, Hi-C sequencing reads are highly recommended for complex genomes, in order to increase the continuity of the assembly, while improving the accuracy by reducing major misassemblies and translocations. The development of sequencing technologies led to a great reduction on sequencing cost. The purchase of a sequencer is no longer compulsory for genome assembly projects, as different institutes provide a variety of sequencing services at affordable, by many labs, prices. Each of PacBio, Illumina and Nanopore, offers a network of certified sequencing service providers. Some of these providers are certified for more than one of those sequencing technologies. Moreover, the purchase of a computational cluster is no longer necessary, as bioinformatics infrastructures, such as ELIXIR (57), can offer researchers the computational recourses necessary for the accomplishment of demanding tasks, such as a de novo genome assembly. The major bottlenecks in genome assembly projects were the computationally demanding assembly algorithms and the large cost of sequencing. The development of new assembly algorithms, which require much less computational power and memory, is the result of major improvements in long-read accuracy by PacBio. The future of genomics relies on long-reads in order to resolve low complexity regions of the genomes and perform telomere-to-telomere assemblies. Alongside to the advances of read accuracy, third generation sequencing led to the reduction of sequencing cost. Furthermore, the increase of genomic data availability in public databases (58), such as Sequence Read Archive (SRA) (59), allows researchers to find and use a variety of raw sequencing data from the same species of interest, already produced by others, for the primary assembly and/or the scaffolding process. Finally, it is important to note that all assembly algorithms and methods we utilised during this work, are being constantly updated in order to improve in terms of performance and computational efficiency, allowing even the reanalysis of older data and the discovery of novel information. In addition, as basecallers are also constantly updated, reusing raw signal files (for example, fast5-formatted files in Nanopore) can produce more accurate reads. In conclusion, continuous advancements in all fields mentioned above, lead towards the democratisation of de novo genome assembly projects, by enabling scientific laboratories with limited technical and financial resources to perform a great variety of genomic studies, without the need for expensive sequencing equipment and computational infrastructure.

53 in total

1. Circos: an information aesthetic for comparative genomics.

Authors: Martin Krzywinski; Jacqueline Schein; Inanç Birol; Joseph Connors; Randy Gascoyne; Doug Horsman; Steven J Jones; Marco A Marra
Journal: Genome Res Date: 2009-06-18 Impact factor: 9.043

2. Assembly of long, error-prone reads using repeat graphs.

Authors: Mikhail Kolmogorov; Jeffrey Yuan; Yu Lin; Pavel A Pevzner
Journal: Nat Biotechnol Date: 2019-04-01 Impact factor: 54.908

3. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.

Authors: Sergey Nurk; Brian P Walenz; Arang Rhie; Mitchell R Vollger; Glennis A Logsdon; Robert Grothe; Karen H Miga; Evan E Eichler; Adam M Phillippy; Sergey Koren
Journal: Genome Res Date: 2020-08-14 Impact factor: 9.043

4. Integrating Hi-C links with assembly graphs for chromosome-scale assembly.

Authors: Jay Ghurye; Arang Rhie; Brian P Walenz; Anthony Schmitt; Siddarth Selvaraj; Mihai Pop; Adam M Phillippy; Sergey Koren
Journal: PLoS Comput Biol Date: 2019-08-21 Impact factor: 4.475

5. Comprehensive mapping of long-range interactions reveals folding principles of the human genome.

Authors: Erez Lieberman-Aiden; Nynke L van Berkum; Louise Williams; Maxim Imakaev; Tobias Ragoczy; Agnes Telling; Ido Amit; Bryan R Lajoie; Peter J Sabo; Michael O Dorschner; Richard Sandstrom; Bradley Bernstein; M A Bender; Mark Groudine; Andreas Gnirke; John Stamatoyannopoulos; Leonid A Mirny; Eric S Lander; Job Dekker
Journal: Science Date: 2009-10-09 Impact factor: 47.728

6. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly.

Authors: Ernest T Lam; Alex Hastie; Chin Lin; Dean Ehrlich; Somes K Das; Michael D Austin; Paru Deshpande; Han Cao; Niranjan Nagarajan; Ming Xiao; Pui-Yan Kwok
Journal: Nat Biotechnol Date: 2012-08 Impact factor: 54.908

7. Nanopore sequencing and assembly of a human genome with ultra-long reads.

Authors: Miten Jain; Sergey Koren; Karen H Miga; Josh Quick; Arthur C Rand; Thomas A Sasani; John R Tyson; Andrew D Beggs; Alexander T Dilthey; Ian T Fiddes; Sunir Malla; Hannah Marriott; Tom Nieto; Justin O'Grady; Hugh E Olsen; Brent S Pedersen; Arang Rhie; Hollian Richardson; Aaron R Quinlan; Terrance P Snutch; Louise Tee; Benedict Paten; Adam M Phillippy; Jared T Simpson; Nicholas J Loman; Matthew Loose
Journal: Nat Biotechnol Date: 2018-01-29 Impact factor: 54.908

8. ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers.

Authors: Lauren Coombe; Jessica Zhang; Benjamin P Vandervalk; Justin Chu; Shaun D Jackman; Inanc Birol; René L Warren
Journal: BMC Bioinformatics Date: 2018-06-20 Impact factor: 3.169

9. Aggressive assembly of pyrosequencing reads with mates.

Authors: Jason R Miller; Arthur L Delcher; Sergey Koren; Eli Venter; Brian P Walenz; Anushka Brownley; Justin Johnson; Kelvin Li; Clark Mobarry; Granger Sutton
Journal: Bioinformatics Date: 2008-10-24 Impact factor: 6.937

10. Finding Nemo: hybrid assembly with Oxford Nanopore and Illumina reads greatly improves the clownfish (Amphiprion ocellaris) genome assembly.

Authors: Mun Hua Tan; Christopher M Austin; Michael P Hammer; Yin Peng Lee; Laurence J Croft; Han Ming Gan
Journal: Gigascience Date: 2018-03-01 Impact factor: 6.524

4 in total

1. Hybrid Sequencing Resolved Inverted Terminal Repeats in the Genome of Megavirus Baoshan.

Authors: Yucheng Xia; Huanyu Cheng; Jiang Zhong
Journal: Front Microbiol Date: 2022-05-10 Impact factor: 6.064

2. A High-Quality Genome Assembly of Striped Catfish (Pangasianodon hypophthalmus) Based on Highly Accurate Long-Read HiFi Sequencing Data.

Authors: Dao Minh Hai; Duong Thuy Yen; Pham Thanh Liem; Bui Minh Tam; Do Thi Thanh Huong; Bui Thi Bich Hang; Dang Quang Hieu; Mutien-Marie Garigliany; Wouter Coppieters; Patrick Kestemont; Nguyen Thanh Phuong; Frédéric Farnir
Journal: Genes (Basel) Date: 2022-05-22 Impact factor: 4.141

Review 3. High-Throughput Monoclonal Antibody Discovery from Phage Libraries: Challenging the Current Preclinical Pipeline to Keep the Pace with the Increasing mAb Demand.

Authors: Nicola Zambrano; Guendalina Froechlich; Dejan Lazarevic; Margherita Passariello; Alfredo Nicosia; Claudia De Lorenzo; Marco J Morelli; Emanuele Sasso
Journal: Cancers (Basel) Date: 2022-03-04 Impact factor: 6.639

Review 4. Listeria monocytogenes in foods-From culture identification to whole-genome characteristics.

Authors: Jacek Osek; Beata Lachtara; Kinga Wieczorek
Journal: Food Sci Nutr Date: 2022-05-03 Impact factor: 3.553

4 in total