Literature DB >> 34784084

De novo chromosome level assembly of a plant genome from long read sequence data.

Priyanka Sharma1, Ardashir Kharabian Masouleh1, Bruce Topp1, Agnelo Furtado1, Robert J Henry1,2.   

Abstract

Recent advances in the sequencing and assembly of plant genomes have allowed the generation of genomes with increasing contiguity and sequence accuracy. Chromosome level genome assemblies using sequence contigs generated from long read sequencing have involved the use of proximity analysis (Hi-C) or traditional genetic maps to guide the placement of sequence contigs within chromosomes. The development of highly accurate long reads by repeated sequencing of circularized DNA (HiFi; PacBio) has greatly increased the size of contigs. We now report the use of HiFiasm to assemble the genome of Macadamia jansenii, a genome that has been used as a model to test sequencing and assembly. This achieved almost complete chromosome level assembly from the sequence data alone without the need for higher level chromosome map information. Eight of the 14 chromosomes were represented by a single large contig (six with telomere repeats at both ends) and the other six assembled from two to four main contigs. The small number of chromosome breaks appears to be the result of highly repetitive regions including ribosomal genes that cannot be assembled by these approaches. De novo assembly of near complete chromosome level plant genomes now appears possible using these sequencing and assembly tools. Further targeted strategies might allow these remaining gaps to be closed.
© 2021 The Authors. The Plant Journal published by Society for Experimental Biology and John Wiley & Sons Ltd.

Entities:  

Keywords:  zzm321990Macadamia janseniizzm321990; HiFi reads; HiFiasm; de novo genome assembly; mitochondrial genome; nuclear genome; nuclear ribosomal RNA; plastid genome; technical advance

Mesh:

Year:  2021        PMID: 34784084      PMCID: PMC9300133          DOI: 10.1111/tpj.15583

Source DB:  PubMed          Journal:  Plant J        ISSN: 0960-7412            Impact factor:   7.091


INTRODUCTION

Reference genome sequences are a key resource for plant science. The challenge of producing a complete genome sequence has been greatly reduced by advances in both DNA sequencing (Hon et al., 2020; Levy and Myers, 2016) and sequence assembly tools (Chen et al., 2017; Phillippy, 2017). Final assembly of chromosome level genomes has relied upon evidence other than the sequence data alone, such as genetic maps (Fierst, 2015; Yu et al., 2019). Advancements in the field of sequencing, assembly and scaffolding technologies, along with the rapid increase in the amount of freely available genomic data (https://www.ncbi.nlm.nih.gov/genbank/statistics), has greatly facilitated the development of highly accurate de novo assemblers. Short‐read de novo assemblers are not efficient in assembling the complex and long repetitive regions of plant genomes, such as centromeres and telomeres (Liao et al., 2019). To address this limitation, long read sequencing technologies, also known as third generation sequencers, have been developed. However, these long reads from PacBio (Menlo Park, CA, USA) and Oxford Nanopore (Oxford, UK) have been less accurate, with an average base calling accuracy of 90% compared to the 99.9% accuracy of the Illumina (San Diego, CA, USA) reads (Amarasinghe et al., 2020; Shendure et al., 2017). Hybrid assembly pipelines have often been used to assemble many genomes, aiming to overcome the shortcomings of both the long reads and short reads. This has allowed assembly of larger contigs from complex genomes. However, to achieve chromosome level genome assembly, scaffolding of the contigs was usually required. Analysis of sequence proximity in the chromatin by methods such as Hi‐C has made this possible (Dudchenko et al., 2017; Kaplan and Dekker, 2013). Recent advances in long read sequencing technology have allowed a single molecule to be sequenced multiple times to produce long high fidelity reads (HiFi; PacBio) with a base level accuracy of 99.9% (Wenger et al., 2019). We have used Macadamia jansenii to compare methods for the sequencing and assembly of plant genomes (Murigneux et al., 2020; Sharma et al., 2021a). This genome has a size (approximately 800 Mb) typical of many plant genomes but with a relatively low heterozygosity (Sharma et al., 2021b). Assembly of this genome using highly accurate circular consensus sequencing (CCS) reads (HiFi; PacBio) using the HiFiasm assembly tool (Cheng et al., 2021) was found to give a more contiguous genome than that obtained with earlier longer continuous long reads (CLR; PacBio) (Sharma et al., 2021a). The HiFiasm assembler has been used to successfully assemble genomes of Fragaria × ananassa (garden strawberry), Rana muscosa (mountain yellow‐legged frog) and Sequoia sempervirens (California redwood) (Cheng et al., 2021). Recently, HiFiasm was reported to allow highly contiguous assembly of plant genomes (Driguez et al., 2021). We now report the near complete chromosome level assembly of the M. jansenii genome from HiFi reads with the HiFiasm assembly tool, as well as an analysis of the assembled genome against a Hi‐C chromosome level assembly.

RESULTS

HiFiasm assembly

The estimated genome size of the M. jansenii genome is 780 Mb (Murigneux et al., 2020). The size of the primary Hifiasm assembly was 826 Mb, including 779 contigs (Table S1), with the longest contig of 71.9 Mb and an average contig length of 1 Mb. busco analysis (https://busco.ezlab.org) showed that the assembly covered 99.6% of universal single copy genes (Table 1). The contigs generated in this assembly were characterized in three groups based upon their size: large contigs (>1 Mb); medium size contigs (between 1 Mb and 100 kb) and small contigs (<100 kb).
Table 1

HiFiasm contigs in different size categories and comparison of primary and haploid assemblies generated from HiFiasm genome assembler tool

Number of contigsAssembly length (Mb)N50 (Mb)N75 (Mb) busco (%)
HiFiasm assembly
Total contigs779826462599.6
Contigs >40 Mb10524504668.7
Contigs >10 Mb19746483993.9
Contigs >1 Mb30784463099.1
Contigs >100 kb94805462799.0
Between 100 kb and 1 Mb64200.490.220.20
Between 10 kb and 100 kb685220.0320.0280.00
Comparison of HiFiasm primary and haploid assemblies
Primary assembly77982746.12599.60
Hap 1_assembly87981624.48.998.80
Hap 2_assembly36377614.35.497.90
Hap 1 >1 Mb9673616.46.896.70
Hap 2 >1 Mb7276624.512.398.10
HiFiasm contigs in different size categories and comparison of primary and haploid assemblies generated from HiFiasm genome assembler tool

Larger size contigs >1 Mb

There were 30 contigs greater than 1 Mb in length. These contigs alone provided a good assembly with an N50 of 46 Mb and a busco score of 99.1% (Table 1). Dotplot analysis against the Hi‐C assembly (Sharma et al., 2021b) showed that, of the nine contigs more than 46 Mb in length, eight correspond to complete Hi‐C pseudomolecules (i.e. each contig corresponds to a single chromosome; chromosomes 1, 4, 5, 6, 10, 11, 13 and 14) (Figure 2a). One contig (Ptg000010|), corresponded to a large part of the second largest chromosome (chromosome 2) and another two contigs of approximately 25 and 2.7 Mb covered the other parts of this chromosome (Figure 2b and Tables 2 and 3). The 14 contigs between 4 and 46 Mb in size covered the remaining six chromosomes, in combinations of two to four contigs. Five of the contigs between 1 and 4 Mb in size corresponded to nuclear ribosomal RNA sequences, and the other two contigs matched parts of chromosome 2 and 7 (Figure 2b and Tables 2 and 3).
Figure 2

Dotplots of HiFiasm contigs against Hi‐C pseudo‐molecules. (a) Pseudo‐molecules that are covered by a single HiFiasm contig. (b) Pseudo‐molecules that are covered by more than one HiFiasm contig.

Table 2

Chromosomal location of HiFiasm contigs >1 Mb

Contig id >1 MbLength in bpHi‐C pseudo‐molecule corresponding HiFiasm contigs
ptg000016l71 935 981Chr 1 + Ribo RNA
ptg000003l57 251 071Chr 6
ptg000017l57 081 251Chr 4
ptg000011l56 513 637Chr 5
ptg000004l49 863 231Chr 10
ptg000012l48 320 516Chr 11
ptg000023l47 997 562Chr 13
ptg000010l46 138 073Chr 2 + Ribo RNA
ptg000008l46 131 124Chr 14
ptg000014l43 049 961Chr 9
ptg000009l39 279 660Chr 3
ptg000002l29 700 554Chr 8
ptg000001l26 771 894Chr 12
ptg000006l25 189 511Chr 2
ptg000007l23 138 637Chr 7 + Ribo RNA
ptg000013l22 539 440Chr 8
ptg000020l22 399 594Chr 7
ptg000052l20 335 125Chr 12
ptg000021l13 354 688Chr 3
ptg000019l8 098 418Chr 7
ptg000022l6 676 624Chr 3
ptg000005l6 127 021Chr 9
ptg000072l4 271 045Chr 12
ptg000018l2 743 534Ribo RNA
ptg000025l2 713 795Part of Chr 2
ptg000062l1 651 603Ribo RNA
ptg000074l1 299 006Ribo RNA
ptg000034l1 171 806Ribo RNA
ptg000036l1 154 310Ribo RNA
ptg000033l1 122 141Part of Chr 7
Table 3

HiFiasm contigs ( 1 Mb) covering each of the Hi‐C pseudo‐molecules

Macadamia jansenii HiC pseudo‐molecules (A)Size of HiC pseudo‐molecules (B)HiFiasm contigs corresponding to HiC scaffolds (C)HiFiasm contigs length (with explanation) (D)HiFiasm combined contigs length (E)Extra HiFiasm length (HiFiasm contig length − HiC scaffold length) (E − B)
Chr 167 682 215ptg000016l71 93 598171 935 9814 253 766
Chr 263 669 590ptg000006I + ptg000025I + ptg000010I74 041 379 (=25 189 511 + 2 713 795 + 46 138 073)74 041 37910 371 789
Chr 358 143 993ptg000021I + ptg000009l + ptg000022I59 310 972 (=13 354 688 + 39 279 660 + 6 676 624)59 310 9721 166 979
Chr 456 076 407ptg000017l57 081 25157 081 2511 004 844
Chr 55 522 0784ptg000011l56 513 63756 513 6371 292 853
Chr 653 595 462ptg000003l57 251 07157 251 0713 655 609
Chr 752 077 970ptg000019I + ptg000020I +ptg000033l + ptg000007I54 758 790 (=8 098 418 + 22 399 594 + 1 122 141 + 23 138 637)54 758 7902 680 820
Chr 849 563 658ptg000013I + ptg000002l5 223 9994 (=22 539 440 + 29 700 554)52 239 9942 676 336
Chr 949 085 581ptg000014l + ptg0000054 917 6982 (=43 049 961 + 6 127 021)49 176 98291 401
Chr 1048 974 653ptg000004l4 986 323149 863 231888 578
Chr 1147 698 009ptg000012l4 832 051648 320 516622 507
Chr 1246 713 600ptg000001l + ptg000072I + ptg000052I51 378 064 (=26 771 894 + 4 271 045 + 20 335 125)51 378 0644 664 464
Chr 1345 610 911ptg000023l47 997 56247 997 56223 86 651
Chr 1445 288 529ptg000008l46 131 12446 131 124842 595
Chromosomal location of HiFiasm contigs >1 Mb HiFiasm contigs ( 1 Mb) covering each of the Hi‐C pseudo‐molecules

Medium size contigs

There were 64 contigs between 1 Mb and 100 kb in size. These contigs had 0% busco genes (Table 1). Only eight contigs in the range between 100 and 824 kb corresponded to seven Hi‐C pseudo‐molecules (with an alignment block length of more than 100 kb) (Figure 1b; Figures S2 and Figure S3; Tables S2 and S3). Out of these eight contigs, five corresponded to the terminal part of the Hi‐C pseudo‐molecules and three corresponded to the non‐terminal regions of Hi‐C chromosomes 3 and 7, marked as red starts in Figure S2(a,b). Most of the medium size contigs corresponded to ribosomal RNA genes (Figure 5b) and one contig of 183 kb corresponded to a chloroplast assembly (Figure 3b). None of the contigs showed similarity with mitochondrial sequences (Figure 4b).
Figure 1

Dotplot of Macadamia jansenii Hi‐C genome assembly against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Figure 5

Dotplot of Macadamia jansenii nuclear ribosomal RNA sequence against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Figure 3

Dotplot of Macadamia jansenii chloroplast genome sequence against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Figure 4

Dotplot of Macadamia jansenii mitochondria genome sequence against three sets of HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Dotplot of Macadamia jansenii Hi‐C genome assembly against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Smaller contigs

There were 685 contigs between 10 and 100 kb in size. Most of these small size contigs from the HiFiasm assembly corresponded to small portions of the chloroplast and mitochondrial genomes. These contigs aligned together covered the complete organelle genomes (Figures 3c and 4c). However, a few of contigs corresponded to nuclear ribosomal RNA sequences (Figure 5c). This contig set also showed 0% busco genes (Table 1).

Influence of data volume

HiFiasm assembly from CCS reads from two individual single molecule, real time (SMRT) sequencing cells and the combined data is given in Table S1. A HiFiasm assembly generated from the 10× CCS data produced 4511 contigs with an assembly of 909 Mb and N50 of 0.38 Mb, whereas a larger CCS file with 18× coverage generated an assembly with less contigs (1058), a shorter assembly length (833 Mb) and an improved N50 of 4.4 Mb (Table S1). The 18× assembly was closer to the combined CCS assembly (and the Hi‐C assembly) than the 10× assembly. Haploid assembly details are given in Table 1. The haploid 1 assembly had a greater number of contigs than the haploid 2 assembly. The busco results were similar for the two haploid and primary assemblies as all assemblies were relatively complete.

Comparison with Hi‐C assembly

A dotplot analysis of 14 pseudo‐molecules of M. jansenii Hi‐C assembly against the HiFiasm assembly is shown in Figure 1. The dotplot of contigs >1 Mb in size showed a complete match of 25 contigs (out of total 30) with the 14 Hi‐C pseudo‐molecules (Sharma et al., 2021b) (Figure 1a). The remaining five large contigs did not contribute to the genome assembly. They were composed of nuclear ribosomal RNA sequences. Chromosomes 1, 4, 5, 6, 10, 11, 13 and 14 were covered by a single contig of the HiFiasm assembly (Figure 2a), two chromosomes (Chr 8 and 9) were covered by two contigs, chromosomes 2, 3 and 12 were covered by three contigs, and chromosome 7 was covered by four contigs (Figure 2b, Tables 2 and 3). Dotplots of HiFiasm contigs against Hi‐C pseudo‐molecules. (a) Pseudo‐molecules that are covered by a single HiFiasm contig. (b) Pseudo‐molecules that are covered by more than one HiFiasm contig. Analysis of the sequence at the ends of the HiFiasm contigs (Table 4) showed that the eight Hi‐C pseudo‐molecules (1, 4, 5, 6, 10, 11, 13 and 14) covered by single HiFiasm contigs had telomere repeats at both ends, except for pseudo‐molecules 1 and 5, which had a telomere at one end and an 18s ribosomal RNA on the other terminal. The other two pseudo‐molecules that were covered by two contigs (Chr 8 and Chr 9) had telomere sequences at one end of each contig. Chromosomes 2, 3 and 12 were covered by three contigs. In the case of chromosome 12, two contigs had telomere repeats at one end indicating their position at the end of the chromosome. One had 5S RNA gene sequences at the other end, confirming the match with 5S RNA sequences on the end of the middle contig. Chromosome 3 (covered by three contigs) also had two contigs with telomere repeats, confirming their terminal position in the chromosome. Similarly, chromosome 7 (covered by four contigs) had telomere repeats at one end of two contigs, indicating their position at the end of the chromosome and another two in the middle of the chromosome.
Table 4

Presence of telomere repeats and rRNA at the ends of HiFiasm contigs

Hi‐C pseudo‐moleculeHiFiasm contigsTerminal 1 (HiFiasm contig)Terminal 2 (HiFiasm contig)
Hi‐C pseudo‐molecules covered by a single HiFiasm contig
Chr 1ptg000016lTelomere18S rRNA
Chr 4ptg000017lTelomereTelomere
Chr 5ptg000011lTelomere18S rRNA
Chr 6ptg000003lTelomereTelomere
Chr 10ptg000004lTelomereTelomere
Chr 11ptg000012lTelomereTelomere
Chr 13ptg000023lTelomereTelomere
Chr 14ptg000008lTelomereTelomere
Hi‐C pseudo‐molecules covered by more than one HiFiasm contig
Chr 2ptg000006ITelomere
ptg000025I28S rRNA
ptg000010I18S rRNA28S rRNA
Chr 3ptg000021ITelomere
ptg000009I
ptg000022ITelomere
Chr 7ptg000019ITelomere
ptg000020I
ptg000033I
ptg000007ITelomere
Chr 8ptg000013ITelomereRepeats
ptg000002lTelomereRepeats
Chr 9ptg000014lTelomere
ptg000005Telomere
Chr 12ptg000001lTelomere
ptg000072I5S rRNA
ptg000052ITelomere5S rRNA
Presence of telomere repeats and rRNA at the ends of HiFiasm contigs

Organelle genome analysis

Dotplot analysis of a 159 Mb full length chloroplast genome assembled using the GetOrganelle toolkit (Jin et al., 2020) against the HiFiasm genome assembly indicated the insertion of small fragments of chloroplast sequences in the nuclear genome assembly (Figure 3a; Figure S1A), which also align with previously reported Hi‐C assembly results (Sharma et al., 2021b) (Figure S1B). Among the middle size contig set, only one contig (ptg0000186|) of 183 Mb aligned with the chloroplast genome (Figure 3b). Contig ptg000186| covered the complete chloroplast genome including the two inverted repeat regions of the chloroplast (Figure S4). Another HiFiasm middle size contig, ptg000066|, also showed some similarity with the chloroplast assembly and also aligned with the terminal end of Hi‐C chromosome 14 (Figure S5). Analysis of the smaller size contigs showed that the majority of these contigs contained some fragments of the chloroplast assembly (Figure 3c). Dotplot of Macadamia jansenii chloroplast genome sequence against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb). Mitochondrial sequence analysis revealed that the size of the de novo mitochondrial assembly was 351 kb. Analysis against the HiFiasm assembly indicated the presence of mitochondrial sequences in the smallest set of contigs. The majority of these contigs cover small fragments of the mitochondria genome (Figure 4c), whereas, in the larger contig set (>1 Mb), only a few contigs showed some similarity with mitochondrial sequences. These represent the mitochondria sequences inserted in the nuclear genome (Figure 4a), which aligns with the dotplot result of Hi‐C assembly (Figure S1B[b]). The middle size contigs did not show the presence of any mitochondria sequences in the dotplot analysis (Figure 4b). Dotplot of Macadamia jansenii mitochondria genome sequence against three sets of HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Nuclear ribosomal RNA gene sequences analysis

Dotplot analysis of nuclear ribosomal RNA sequences showed matches with the majority of the middle size contigs, with a small number of contigs from the smaller set of contigs having ribosomal RNA sequences (Figure 5b,c). Dotplot of Macadamia jansenii nuclear ribosomal RNA sequence against HiFiasm contigs. (a) HiFiasm longest contigs (>1 Mb size), (b) HiFiasm medium size contigs (<1 Mb and >100 kb) and (c) HiFiasm smallest contigs (<100 kb).

Analysis of repeat elements

The HiFiasm contigs were longer than the corresponding Hi‐C pseudomolecules (Table 3). This is probably because the HiFiasm contigs included a larger proportion of repetitive elements than the corresponding Hi‐C pseudomolecules (Table 5). The longer chromosome had a generally higher content of repetitive elements, suggesting that the presence of these repeat regions explained their greater size. The HiFiasm assemblies included more repetitive elements in the larger chromosomes but lower repeat content in the smaller chromosomes, largely as a result of the inclusion of less unclassified repeats in the HiFiasm assemblies of the smaller chromosomes.
Table 5

Comparative repetitive elements of Hi‐C pseudo‐molecules and HiFiasm assembly

Macadamia jansenii pseudo‐moleculesGenome assemblerSize of pseudo‐moleculesTotal repeats (%)LINE (%)LTR (%)DNA elements (%)Unclassified (%)Simple repeats (%)
Chr 1Hi‐C67 682 215624.1330.30.5226.80.64
HiFiasm71 935 98162.23.8834.80.5822.60.65
Chr 2Hi‐C63 669 590663.3138.21.1223.30.86
HiFiasm74 041 379682.9839.61.2223.80.44
Chr 3Hi‐C58 143 99352.36.1520.51.1324.20.67
HiFiasm59 310 97254.37.9320.70.9724.10.72
Chr 4Hi‐C56 076 40755.16.2622.80.7924.31.13
HiFiasm57 081 25157.27.2121.60.9026.561.23
Chr 5Hi‐C55 220 78453.13.2731.40.9617.00.78
HiFiasm56 513 63754.33.4731.60.7517.90.85
Chr 6Hi‐C53 595 46255.18.7519.91.1825.50.79
HiFiasm57 251 07158.89.4722.61.4324.01.44
Chr 7Hi‐C52 077 97052.97.0421.31.4222.40.85
HiFiasm54 758 79051.26.6521.31.2521.40.79
Chr 8Hi‐C49 563 65844.05.3915.50.6421.31.25
HiFiasm52 239 99441.86.2522.01.3012.30
Chr 9Hi‐C49 085 58148.15.3918.41.7622.00.86
HiFiasm49 176 98245.95.4124.81.6514.00
Chr 10Hi‐C48 974 65348.16.2417.30.9022.81.02
HiFiasm49 863 23144.75.9122.92.8213.10
Chr 11Hi‐C47 698 00948.16.2417.30.9022.81.02
HiFiasm48 320 51644.74.0126.32.8211.60
Chr 12Hi‐C46 713 60044.64.4716.51.3921.30.85
HiFiasm51 378 06425.64.4721.22.4411.70
Chr 13Hi‐C45 610 91142.25.5213.70.7021.11.31
HiFiasm47 997 56239.75.4218.81.6313.80
Chr 14Hi‐C45 288 52942.55.8212.90.9621.51.50
HiFiasm46 131 12441.15.5319.61.9413.90
Comparative repetitive elements of Hi‐C pseudo‐molecules and HiFiasm assembly

DISCUSSION

This era of genomics is continuing to advance with improved sequencing technologies and the potential to sequence all recorded species on earth (Lewin et al., 2018). Accurate chromosome level genome assembly requires accurate reads, high genome coverage and long read length. This has typically involved the use of very high coverage and data from multiple sequencing platforms along with mapping of Hi‐C technologies to achieve chromosome level assemblies. However, the combination of high sequence accuracy in a long read in HiFi reads (99.8% accuracy at around 15 kb average length) provides the option to assemble a complete genome using a single sequencing technology (Cheng et al., 2021) and with a more readily obtainable genome coverage (Wenger et al., 2019). In the present study, we have combined the benefit of the highly accurate reads with an improved assembly tool HiFiasm (Cheng et al., 2021). HiFi read genome coverage of 28–40×, for plant genomes within the range of 700−1000 Mb size, was sufficient to generate high quality assemblies with Mb contig sizes (Sharma et al., 2021b). The DNA extracted from M. jansenii may have contained some impurities that reduced the efficiency of the DNA sequencing. Two SMRT cells were required to generate 28× genome coverage with CCS reads. For some samples, this may be possible with one single run providing the required coverage if sufficient DNA purity is achieved, reducing the cost of obtaining sufficient sequence. When the two individual CCS runs of 10× and 18× were assembled separately using HiFiasm, the final assembly was very fragmented (N50 of 0.38 and 4.4 Mb, respectively) for M. jansenii (Table S1), whereas the combined 28× gave a highly contiguous assembly with N50 of 46.1 Mb and 99.6% BUSCO results. The combined CCS run results suggests that, if the isolation method resulted in high purity DNA, a single run with less coverage may be sufficient to assemble the genome. The higher base‐calling accuracy by HiFi improves the assembly accuracy by bypassing many time‐consuming and heavy computational requirement steps in the assembly workflow. The M. jansenii assembly from HiFiasm using HiFi sequencing data produced a near chromosome level assembly, with eight contigs covering eight complete Hi‐C pseudo‐molecules and another six chromosomes, being covered by only one to four breaks and a total of 17 contigs. Chromosomes 1 and 5 had 18S ribosomal RNA genes at one end, suggesting that these repeats near the end of the chromosome had prevented assembly to the telomere. Chromosome 12 was interrupted by 5S ribosomal RNA genes. For a plant with approximately 800 Mb of data, we estimate a high‐quality chromosome level assembly could be produced within 1 week from the plant material, if the DNA extraction step is well established. This highly contiguous M. jansenii chromosome level assembly will help achieve a better understanding of the genome of macadamias. All four species of Macadamia are listed as threatened under Australian legislation (Mast et al., 2008), although M. jansenii is particularly endangered because of its very low population size (<200 plants in the wild) (Shapcott and Powell, 2011). The highly accurate genome assembly will facilitate its conservation and use in breeding. Macadamia jansenii has small inedible nuts (Gross and Weston, 1992); however, as a result of its small tree size and narrow root spread, it is being tested as a rootstock and in hybrids with the commercial species Macadamia intergrifolia (Alam et al., 2018). The HiFiasm assembly (busco 99%) is much better than the Hi‐C assembly (busco 97%) (Sharma et al., 2021b), suggesting the incorporation of some regions missing in the Hi‐C assembly. The initiative to complete the genome assembly of almost all living organisms (Koepfli et al., 2015; Lewin et al., 2018) requires a highly efficient assembly method with sustainable financial, computational and time requirements without compromising on genome accuracy. Contiguity and completeness should be taken into consideration (Rhie et al., 2021). Our analysis suggests that HiFiasm assembly with the HiFi reads may require almost no further scaffolding for the plants with similar genome size of approximately 800 Mb. Analysis of the nature of the few remaining regions of the genome that are not assembled in these analyses may allow the development of targeted strategies to complete these assemblies. Analysis of the sequences at the ends of the contigs formed by HiFiasm assembly of HiFi reads may identify those contigs that have been interrupted by repetitive sequences that cannot be assembled de novo. This technology is successfully assembling regions with high levels of the repeat sequences that make up more than 50% of the M. jansenii genome (Sharma et al., 2021b). It may be that the very high accuracy of the HiFi reads detects minor variations in repeat sequences that allow their unique assembly and that only perfect repeats that are longer than the HiFi reads create a barrier to assembly. The present study suggests that more than half of the total chromosomes could be assembled telomere to telomere for the plants with a genome size of approximately 800 Mb, whereas plants with larger genome sizes may require some additional methods for complete assembly. Assemblies of larger genomes have been shown to require a higher level of coverage with long read data to achieve the same size of assembled contigs (Sharma et al., 2021a). The chromosomes covered by more than one contig have some end sequences that indicate how they should be connected to other contigs. The present study also suggests that the large ribosomal gene clusters in the genome of plants may be one of the few limitations to complete assembly. This would suggest that sequence analysis of the ends of contigs could be used to guide high level assembly of the genome. However, additional information may be required for plants with very large and complex genomes. This approach will be useful for producing plant genomes generating high quality de novo chromosome level assemblies, especially for laboratories with limited financial, technical and computational resources.

METHODS

Sequencing data

Short‐read (Illumina) sequencing data were from Murigneux et al. (2020) and long read data (PacBio HiFi) were from Sharma et al. (2021a). The HiFiasm genome assembly (Cheng et al., 2021) was generated using the High Performance Computing facility at the University of Queensland. For assembly, 24 core processing units and 120 Gb of memory was employed. Default settings of the HiFiasm assembler were used to assemble heterozygous genomes with built‐in duplication purging parameters. The HiFiasm output directory consists of two haploid (1 and 2), one primary contig and one alternate haplotig GFA graph files. Each halplotig and one primary contig GFA file was converted to FASTA format using the awk command.

Analysis of assembly

The primary HiFiasm assembly of M. jansenii included 779 contigs that were categorised into three subsets: (i) contigs <1 Mb size; (ii) contigs <1 Mb and more than 100 kb size; and (iii) contigs <100 kb size. Along with the main primary and two haploid assemblies, all three sets of primary contig subsets were passed through analysis using quast (Gurevich et al., 2013), busco (Simão et al., 2015) and repeatmodeler (Humann et al., 2019). The telomere sequences in the HiFiasm contigs were identified using the bioserf platform (https://bioserf.org) (Somanathan and Baysdorfer, 2018). Ribosomal RNA and other protein coding genes at the terminal end of the HiFiasm contigs were identified using an ncbi blast search (https://blast.ncbi.nlm.nih.gov). Ribosomal RNA in the contigs was identified using Barrnap (https://github.com/tseemann/barrnap) (Seemann, 2013) with default settings for eukaryotes. The HiFiasm contigs were compared with the M. jansenii 14 pseudo‐molecules from the Hi‐C assembly (Sharma et al., 2021b) using the online interactive D‐Genies dotplot tool (Cabanettes and Klopp, 2018) to compare two genomes using Minimap2 and, for alignments, dotplot images were created after selecting the ‘sort contigs’ option, selecting the ‘minimum identity’ parameter at 0.75 and checking the ‘strong precision’ tick box.

Characterisation of organelle genomes content of HiFiasm contigs

A reference mitochondrial genome, chloroplast genome and nuclear ribosomal RNA sequence from this sample were assembled from Illumina raw reads (Murigneux et al., 2020) using the GetOrganelle toolkit (Jin et al., 2020) with default parameters. The HiFiasm contigs (779) were compared with the organellar and ribosomal sequences in dotplots.

AUTHOR CONTRIBUTIONS

RJH, AF, AKM and BT designed the study and supervised the project. PS and AKM were responsible for genome assembly and analysis. PS, AF, AKM and RJH were responsible for data analysis. PS, RJH and AF were responsible for the tables and figures. PS and RJH drafted the manuscript. PS was responsible for data deposition. All authors edited and approved the final manuscript submitted for publication.

CONFLICT OF INTEREST

The authors declare no conflict of interest. Figure S1. (A) Dotplot of Macadamia jansenii Hifiasm longest contigs (more than 1 Mb) against the (a) chloroplast, (b) mitochondria and (c) nuclear ribosomal RNA sequence of M. jansenii. (B) Dotplot of M. jansenii Hi‐C assembly against the (a) chloroplast, (b) mitochondria and (c) nuclear ribosomal RNA sequence of M. jansenii. Figure S2. (a) Dotplots of Hi‐C pseudo‐molecules against HiFiasm contigs (longest contigs >1 Mb). (b) Dotplots of Hi‐C pseudo‐molecules against HiFiasm contigs (longest and middle size contigs). Figure S3. (a) Dotplots of Hi‐C pseudo‐molecules against HiFiasm contigs (longest contigs >1 Mb). (b) Dotplots of Hi‐C pseudo‐molecules against HiFiasm contigs (longest and middle size contigs). Figure S4. Chloroplast assembly covered by a single HiFiasm Contig (Ptg0000186|) and small bits by Ptg000066|. Figure S5. Chloroplast sequence (Ptg0000186| and Ptg000066|) insertions in the Hi‐C assembly. Table S1. IPA and HiFiasm assembly from different volumes of sequence data Table S2. HiFiasm contigs (<1 Mb and >100 kb) that are part of Hi‐C pseudo‐molecule assembly Table S3. HiFiasm contigs (biggest contigs and middle size contigs) corresponds to Macadamia jansenii Hi‐C 14 pseudo‐molecules Click here for additional data file. FigureS1‐S5 Click here for additional data file.
  25 in total

1.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

Review 2.  The Genome 10K Project: a way forward.

Authors:  Klaus-Peter Koepfli; Benedict Paten; Stephen J O'Brien
Journal:  Annu Rev Anim Biosci       Date:  2015       Impact factor: 8.923

3.  Structural and Functional Annotation of Eukaryotic Genomes with GenSAS.

Authors:  Jodi L Humann; Taein Lee; Stephen Ficklin; Dorrie Main
Journal:  Methods Mol Biol       Date:  2019

Review 4.  DNA sequencing at 40: past, present and future.

Authors:  Jay Shendure; Shankar Balasubramanian; George M Church; Walter Gilbert; Jane Rogers; Jeffery A Schloss; Robert H Waterston
Journal:  Nature       Date:  2017-10-11       Impact factor: 49.962

5.  De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds.

Authors:  Olga Dudchenko; Sanjit S Batra; Arina D Omer; Sarah K Nyquist; Marie Hoeger; Neva C Durand; Muhammad S Shamim; Ido Machol; Eric S Lander; Aviva Presser Aiden; Erez Lieberman Aiden
Journal:  Science       Date:  2017-03-23       Impact factor: 47.728

6.  Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm.

Authors:  Haoyu Cheng; Gregory T Concepcion; Xiaowen Feng; Haowen Zhang; Heng Li
Journal:  Nat Methods       Date:  2021-02-01       Impact factor: 28.547

Review 7.  Opportunities and challenges in long-read sequencing data analysis.

Authors:  Shanika L Amarasinghe; Shian Su; Xueyi Dong; Luke Zappia; Matthew E Ritchie; Quentin Gouil
Journal:  Genome Biol       Date:  2020-02-07       Impact factor: 13.583

8.  The genome of the endangered Macadamia jansenii displays little diversity but represents an important genetic resource for plant breeding.

Authors:  Priyanka Sharma; Valentine Murigneux; Jasmine Haimovitz; Catherine J Nock; Wei Tian; Ardashir Kharabian Masouleh; Bruce Topp; Mobashwer Alam; Agnelo Furtado; Robert J Henry
Journal:  Plant Direct       Date:  2021-12-14

9.  Highly accurate long-read HiFi sequencing data for five complex genomes.

Authors:  Ting Hon; Kristin Mars; Greg Young; Yu-Chih Tsai; Joseph W Karalius; Jane M Landolin; Nicholas Maurer; David Kudrna; Michael A Hardigan; Cynthia C Steiner; Steven J Knapp; Doreen Ware; Beth Shapiro; Paul Peluso; David R Rank
Journal:  Sci Data       Date:  2020-11-17       Impact factor: 6.444

10.  GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes.

Authors:  Jian-Jun Jin; Wen-Bin Yu; Jun-Bo Yang; Yu Song; Claude W dePamphilis; Ting-Shuang Yi; De-Zhu Li
Journal:  Genome Biol       Date:  2020-09-10       Impact factor: 13.583

View more
  1 in total

1.  A de novo genome assembly of Solanum verrucosum Schlechtendal, a Mexican diploid species geographically isolated from other diploid A-genome species of potato relatives.

Authors:  Awie J Hosaka; Rena Sanetomo; Kazuyoshi Hosaka
Journal:  G3 (Bethesda)       Date:  2022-07-29       Impact factor: 3.542

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.