Literature DB >> 34938939

The genome of the endangered Macadamia jansenii displays little diversity but represents an important genetic resource for plant breeding.

Priyanka Sharma¹, Valentine Murigneux², Jasmine Haimovitz³, Catherine J Nock⁴, Wei Tian^5,6, Ardashir Kharabian Masouleh¹, Bruce Topp¹, Mobashwer Alam¹, Agnelo Furtado¹, Robert J Henry^1,7.

Abstract

Macadamia, a recently domesticated expanding nut crop in the tropical and subtropical regions of the world, is one of the most economically important genera in the diverse and widely adapted Proteaceae family. All four species of Macadamia are rare in the wild with the most recently discovered, M. jansenii, being endangered. The M. jansenii genome has been used as a model for testing sequencing methods using a wide range of long read sequencing techniques. Here, we report a chromosome level genome assembly, generated using a combination of Pacific Biosciences sequencing and Hi-C, comprising 14 pseudo-molecules, with a N50 of 52 Mb and a total genome assembly size of 758 Mb of which 56% is repetitive. Completeness assessment revealed that the assembly covered -97.1% of the conserved single copy genes. Annotation predicted 31,591 protein coding genes and allowed the characterization of genes encoding biosynthesis of cyanogenic glycosides, fatty acid metabolism, and anti-microbial proteins. Re-sequencing of seven other genotypes confirmed low diversity and low heterozygosity within this endangered species. Important morphological characteristics of this species such as small tree size and high kernel recovery suggest that M. jansenii is an important source of these commercial traits for breeding. As a member of a small group of families that are sister to the core eudicots, this high-quality genome also provides a key resource for evolutionary and comparative genomics studies.

Entities: Chemical

Keywords: Proteaceae; endangered species; genome assembly; genome diversity; genome sequencing; wild species

Year: 2021 PMID： 34938939 PMCID： PMC8671617 DOI： 10.1002/pld3.364

Source DB: PubMed Journal: Plant Direct ISSN： 2475-4455

INTRODUCTION

Macadamia is a recent domesticate with a complex domestication history (Peace, 2005). The four currently recognized Macadamia species are endemic to the central coast of eastern Australia (Mast et al., 2008). However, macadamia was domesticated in Hawaii around 100 years ago, with most of the global production based upon the Hawaiian domesticated germplasm (Hardner, 2016). Macadamia is a member of the Proteaceae family, one of a group of families that are sister to the core eudicots (Christenhusz & Byng, 2016; Gross & Weston, 1992). Macadamia was the first Australian native plant to be widely grown as a food plant (Peace et al., 2013). All the Hawaiian macadamia cultivars have been reported to be based upon only a few or possibly even a single tree from Australia (Nock et al., 2019). The resulting narrow gene pool makes it susceptible to disease and climate change, whereas the unexploited wild macadamia germplasm of Australia provides an opportunity for great improvement of this newly domesticated crop. Despite a rapid international increase in macadamia production, breeding is restricted because of lack of genomic information (Topp et al., 2019). Macadamia remains the most widely grown Australian native food crop (Peace et al., 2013). Macadamia production was valued at USD 1.17 billion in 2019, and production is expected to grow at a rate of 9.2% from 2020 to 2027 (https://www.grandviewresearch.com/industry-analysis/macadamia-nut-market). Among the M acadamia species, M. integrifolia, the species from which most of the domesticated gene pool is derived (Hardner, 2016), was the first genome to be sequenced (Nock et al., 2016). This genome, of cultivar HAES 741, has supported initial efforts at genome based breeding (O'Connor et al., 2018) and has recently been upgraded to chromosome level with a contig N50 of 413 Kb (Nock et al., 2020). The other species that has been a contributor to domesticated germplasm, M. tetraphylla, has also been sequenced with contig N50 of 1.18 Mb (Niu et al., 2020). All species are rare in the wild, but M. jansenii is endangered and is only found in a limited area to the north‐west of Bundaberg, Queensland (Hayward et al., 2021; Shapcott & Powell, 2011). Macadamia jansenii is endangered under the Australian (EPBC) Act and critically endangered under the Queensland (Qld Nature Conservation Act) legislation (Gross & Weston, 1992). Due to the expected low heterozygosity associated with the extremely small population size, this species has been used as a model to compare available genome sequencing technologies (Murigneux et al., 2020; Sharma et al., 2021). Macadamia jansenii has been sequenced (Murigneux et al., 2020), using three long read sequencing technologies, Oxford Nanopore (PromethION), PacBio (Sequel I), and BGI (Single‐tube Long Fragment Read). The genome was recently updated by sequencing using the PacBio HiFi sequencing (Sharma et al., 2021). Here, we report chromosome level assembly of the same genotype using Hi‐C and annotation of the genome. This provides a platform that allows analysis of key genes of importance in macadamia breeding, a reference genome in this group of angiosperms, and insights into the impact of rarity on plant genomes. This high‐quality reference genome also provides a platform for analysis of three unique attributes of macadamia, the high levels of unusual fatty acids (Hu et al., 2019), high cyanogenic glucoside content (Nock et al., 2016), and the presence of a novel anti‐microbial peptide (Marcus et al., 1999). The fatty acid, palmitoleic acid (16:1), is found in large amounts in macadamia and has been considered to have potential human health benefits (Solà Marsiñach & Cuenca, 2019; Song et al., 2018). Cyanogenic glycosides in plants are part of their defense against herbivores. However, the highly bitter nuts of M. jansenii are not edible, and the use of this species in macadamia breeding will require selection to ensure high levels of cyanogenic glycosides are avoided. Identification of the associated genes could assist by providing molecular tools for use in breeding selection. A novel antimicrobial protein was reported in the kernels of M. integrifolia (Marcus et al., 1999). These small antimicrobial proteins were found to be produced by processing of a larger pre‐cursor protein. As fungal infection and insect herbivores are major hurdles in macadamia production (Dahler et al., 1995; Marcus et al., 1999; Nock et al., 2016), retention of the antimicrobial protein and cyanogenesis in some parts of the plant may be important. Analysis of candidate genes for these traits may assist in understanding and manipulating them in macadamia breeding. The assembly and annotation of the genome presented here will allow characterization of the role of these novel genes and facilitate their use in breeding.

RESULTS

Genome sequencing and assembly

A pseudo‐molecule level genome assembly of PacBio contigs (Murigneux et al., 2020) was produced using Hi‐C. The estimated genome size of M. jansenii has been reported to be 780 Mb (Murigneux et al., 2020), and the size of the final Hi‐C assembly was 758 Mb, comprised of 219 scaffolds with an N50 of 52 Mb (Table 1). Of this, 97% was anchored to the 14 largest scaffolds representing the 14 chromosomes (Figure S1; Table S1). The statistical summary of self‐interacting genomic region analysis or topologically associating domain (TAD) analysis at three different resolutions is given in Tables S2 and S3. Comparison of the PacBio assembly with the Hi‐C chromosome assembly shows the number of scaffolds decreased from 762 to 219 and the length of the longest scaffold increased sixfold (Table 1). The L50 reduced from 135 to 7 scaffolds, and the N50 was improved from 1.58 to 52 Mb.

TABLE 1

Macadamia jansenii genome sequencing and assembly statistics

	PacBio	Dovetail Chicago	Dovetail Hi‐C assembly
Library statistics	3,170,206 reads	213 M read pairs; 2 × 150 bp	156 M read pairs; 2× 150 bp
Coverage	84X	88X	3,601X
Genome assembly
Total length	758.28 Mb	758.30 Mb	758.43 Mb
L50/N50 ^a	135 scaffolds; 1.58 Mb	199 scaffolds; 1.0 Mb	7 scaffolds; 52.1 Mb
L90/N90 ^a	457 scaffolds; .51 Mb	767 scaffolds; .23 Mb	13 scaffolds; 45.61 Mb
Longest scaffold	10,537,631 bp	8,434,305 bp	67,682,215 bp
Number of scaffolds	762	1,529	219
BUSCO results ^a
Complete genes (single+ duplicate)	96.70%	97.20%	97.1%
Single genes	79.10%	80.10%	80.80%
Duplicated genes	17.60%	17.10%	16.30%
Fragmented genes	0.90%	1.00%	1.00%
Missing genes	2.00%	2.00%	2.10%

Eudicots_odb10 dataset, Number of BUSCOs = 2,326.

Macadamia jansenii genome sequencing and assembly statistics Eudicots_odb10 dataset, Number of BUSCOs = 2,326.

Assembly completeness and repeat element analysis

The completeness of the M. jansenii assembly was assessed by Benchmarking Universal Single‐Copy orthologs (BUSCO) analysis (Simão et al., 2015). This analysis revealed 97.1% complete genes (single and duplicated) in the Hi‐C assembly (Table 1). A total of 423.6 Mb, representing 55.9% of the Hi‐C assembly, was identified as repetitive (Table 2). Class I TE (Transposable Elements) repeats were the most abundant repetitive elements representing 30% of the genome, including LTRs (24%), LINE (5.67%), and SINE (0%), and Class II TE repeats were 1.56%.

TABLE 2

Annotation of repeat sequences in the M. jansenii genome

	Hi‐C assembly
Total repetitive content	55.9%
Class I TEs repeats	29.9%
LTRs	24%
LINE	5.67%
SINE	0%
Class II TEs repeats	1.56%
Low complexity repeats	0.33%
Simple repeats	1.35%

Annotation of repeat sequences in the M. jansenii genome

Structural and functional annotation

A total of 31,591 genes were identified in the repeat‐masked Hi‐C M. jansenii genome using a homology‐based and RNA assisted approach. The average length of the genes was 1,368 bp (Table 3). Of a total of 31,591 transcripts, only 22,500 sequences (71%) were annotated by BLAST2GO (Figure S2). Then, the transcripts were functionally annotated using Gene Ontology (GO) terms to assess the potential role of the genes in the M. jansenii genome. The most abundant M. jansenii specific gene families were organic cyclic and heterocyclic compounds among the molecular function; organic and cellular metabolic among the biological process; and protein‐containing binding membrane and intracellular organelle among the cellular component (Figure S3). The annotation edit distance (AED) score plot, which calculates how well the predicted genes agrees with the external evidence, is given in Figure S4, where zero represents the perfect evidence match and one represents no support for the predicted genes. The comparison of the three Macadamia genomes, assembled so far, showed M. jansenii has the highly continuous assembly with highest number of BUSCO genes (Table 4).

TABLE 3

Genes predicted in the M. jansenii genome

Gene prediction
Total number of genes	31,591
Total coding region	43,235,907 bp
Average length of genes	1,368 bp
Number of single‐exon genes	2,458
Number of genes with annotation	22,500

TABLE 4

Comparison of genome assemblies of three Macadamia species

	M. integrifolia (V1)	M. integrifolia (V2)	M. tetraphylla	M. jansenii
Assembly length (Mb)	518.49	744.64	750.53	758.43
N50 (Mb)	4.7 kb	413 kb	1.2 Mb	52.1 Mb
No. of contigs/scaffolds	193,493	4094	4,335	219
Repeats	37.00%	55.00%	61.42%	55.90%
Complete BUSCO	77.40%	90.20%	89.72%	97.1%
No. of coding genes	35,337	34,274	31,571	31,591

Genes predicted in the M. jansenii genome Comparison of genome assemblies of three Macadamia species

Anti‐microbial genes

Blast analysis identified homologs of the macadamia antimicrobial protein in the M. jansenii genome (Figure S5A,B). The ANN01396 transcript from M jansenii also showed four repeat segments of cysteine motifs with the same structure as found in MiAMP‐2 (Figure 1a). Comparison of the translated protein sequences indicated a high level of homology with only 28 differences in the 665 amino acid sequence (Figure 1b). The M. jansenii sequence provides the first genomic sequence for this novel anti‐microbial gene and reveals the presence of an intron in the 5′ UTR (Figure S6).

FIGURE 1

Anti‐microbial peptide structure. (a) The cDNA sequence of anti‐microbial gene of with four repeat segments (RS), shown in red open boxes and cysteine residues in green filled boxes aligned with M. jansenii transcript sequence ANN01396, showing same pattern. (b) The alignment of the anti‐microbial peptide sequence from the and M. jansenii. The first half of the sequence shows the repeat segments within red boxes with green highlighted cysteine residues. Differences in amino acid sequence throughout the alignment as shown in blue highlighted text

Cyanogenic glycoside genes

Analysis of genes of cyanogenic glycoside metabolism detected a total of 11 putative genes with high homology in the M. jansenii genome. These genes were distributed throughout the genome. The largest number of these genes (five) is encoded by CYP79 which is involved in the first step of the cyanogenesis, responsible for conversion of amino acids to oxime. In contrast, only one gene of CYP71 was found which is responsible for conversion of oxime to hydroxynitrile (Table S4).

Fatty acid metabolism genes

This study identified the key enzymes involved in fatty acid biosynthesis: elongases (e.g., KAS, FATA, and FATB) and desaturases (e.g., SAD). A total of 12 of these genes were found in the M. jansenii genome. Stearoyl‐ACP desaturases (SAD) which convert 18:0 to 18:1 were found to be abundant with five genes present. Ketoacyl‐ACP‐synthase (KAS), which is responsible for elongation of butyryl‐ACP from a 4 to 14 carbon chain, was found to be two in number (Table S5).

Heterozygosity and genetic diversity

To study the genetic diversity within the species, resequencing was undertaken for eight accessions including the one for which the genome assembly was generated. The trimmed read sequence data yield for the genotypes used in this study ranged from 19.7 to 22.8 Gb corresponding to 25X to 29X of the M. jansenii genome (assembled size 780 Mb). The eight accessions analyzed had between 5.4 and 7.0 million variants relative to the reference genome (Tables 5 and S6). Most of these were SNPs with less than 600,000 indels in all genotypes. Most SNPs were heterozygous with approximately 1 million or less homozygous SNP variants in each individual (Table S6). The level of SNP heterozygosity for the eight genotypes (including the reference) was found to be in the range of 0.26% to 0.34% with an average of 0.31% (Table 5). The statistical polymorphic sites in up to seven of the eight M. jansenii accessions is given in Table S7. The genotypes varied in their divergence from the reference with most unique variants being heterozygous and only 762 to 44,014 unique homozygous SNPs being found in an individual and not present in the other seven genotypes (Table S8).

TABLE 5

Heterozygosity and genetic variation in eight M. jansenii accessions

AccessionID	Polymorphic variants			Heterozygosity analysis		Unique polymorphic SNP sites ^a
AccessionID	Total ^b	Indels	SNP	Heterozygous SNP sites	Heterozygosity ^c	Heterozygous	Homozygous	Total
1005 ^d	5,393,188	486,846	4,739,937	2,428,956	0.31	585,053	762.00 ^e	585,815.00
1161004	5,311,865	377,580	4,797,864	2,038,553	0.26	187,441	8,615.00	196,056.00
1161003	6,649,485	555,641	5,898,975	2,465,089	0.32	521,184	44,014.00	565,198.00
1161005	6,109,728	531,550	5,393,088	2,347,362	0.3	608,485	40,629.00	649,114.00
1161001a	6,868,915	574,625	6,087,318	2,649,035	0.34	95,089	8,155.00	103,244.00
1003	6,944,903	586,001	6,148,269	2,672,103	0.34	99,715	9,878.00	109,593.00
1002	6,642,855	586,334	5,857,020	2,447,418	0.31	219,933	27,408.00	247,341.00
1161001b	6,594,383	548,292	5,852,433	2,556,695	0.33	83,662	4,843.00	88,505.00

Variant sites only found in this individual and not in any of the other seven genotypes.

Comprised of replacements, multi nucleotide variants (MNVs), Indels, and SNPs.

Calculated as a percentage of heterozygous SNP sites compared to the total genome size in bases.

Nuclear genome (780 Mb) used as reference for genetic variation and heterozygosity analysis.

Homozygous sites identified possibly due to errors in the reference genome.

Heterozygosity and genetic variation in eight M. jansenii accessions Variant sites only found in this individual and not in any of the other seven genotypes. Comprised of replacements, multi nucleotide variants (MNVs), Indels, and SNPs. Calculated as a percentage of heterozygous SNP sites compared to the total genome size in bases. Nuclear genome (780 Mb) used as reference for genetic variation and heterozygosity analysis. Homozygous sites identified possibly due to errors in the reference genome.

Genome duplication

The MCScanX tool identified 6,303 genes (~20%) of the total genes as WGD/segmental genes. Among all the types of duplicated genes, dispersed genes (15,875 genes) were found to be highest in number, followed by WGD, singleton (3,183 genes) and tandem (2,966 genes). The proximal gene pairs which are separated by only a few genes on the chromosomes were found to be lowest in number, as 2,573 (Table S9). The distribution of the duplicated genes is shown in Figure 2, where pseudo‐molecules 3 and 10 represents a relationship based upon duplicated genes. The whole genome duplication (WGD) tool computed Ks distribution graph with a peak at 0.3 (Figure S7). This shows that M. jansenii genome has undergone a single whole genome duplication event about 20 Million years ago.

FIGURE 2

Chromosomal distribution of duplicated genes of M. jansenii, generated using SynVisio. The parallel horizontal lines represent the 14 pseudo‐molecules of M. jansenii genome, with connected ribbons representing the duplicated genes

DISCUSSION

A major constraint to the use of M. jansenii for commercial breeding is the risk of an inedible kernel due to high levels of toxic cyanogenic glycosides. Cyanogenic glycosides have been observed in all the four species of Macadamia. However, the concentration varies at different developmental stages (Castada et al., 2020). Even the edible cultivars derived from M. integrifolia have genes involved in the cyanogenic glycoside pathway (Nock et al., 2016). However, cyanogenic glycoside levels are extremely low in the kernel of the commercially important species M. integrifolia and M. tetraphylla (Dahler et al., 1995). The high level of bitterness in the seeds of M. jansenii may be associated with high concentrations of cyanogenic glycosides. Knowledge of these genes will support efforts to avoid their transfer to domesticated Macadamia when using M. jansenii as a source of other desirable genes. Plants may produce antimicrobial proteins as part of their defense against microbial attack. Macadamia seed might have antimicrobial proteins that protect them against attack when germinating in the warm and moist rainforest environment. A new family of antimicrobial peptides, MiAMP‐2, was discovered in the seeds of M. integrifolia (Marcus et al., 1999). In addition to antimicrobial properties, these seed storage proteins are homologous to vicilin 7S globulins and have been identified as putative allergens (Rost et al., 2016, 2020). A cDNA sequence, from M. integrifolia, encoding these proteins, MiAMP‐2, has been reported to contain four repeat segments, with each segment comprised of cysteine rich motifs (C‐X‐X‐X‐C‐[10 to12] X‐C‐X‐X‐X‐C), where X is any other amino acid residue. Although only a single gene was found in the M. jansenii genome, it encoded a protein with four domains that correspond to the previously reported antimicrobial peptides, suggesting that four copies of the peptide could be derived from each translation of this gene. This is the first report of a gene structure for the macadamia anti‐microbial peptide with a single intron. This gene has potential for wide use as an antimicrobial protein in plant defense. Macadamia oil has a unique composition being 75% fat, 80% of which is monounsaturated, for exampe, oleic oil (C18:1) 55–67%, followed by palmitoleic acid (C16:1) 15–22% (Aquino‐Bolaños et al., 2016; Curb et al., 2000; Hu et al., 2019). The results of analysis of the genes of lipid metabolism in the M. jansenii genome are consistent with this fatty acid profile. The number of SAD genes which are responsible for conversion of stearoyl‐ACP (18:0) to oleate (18:1) was found to be higher in number than the other genes in these pathways and may explain the desirable high oleic content of macadamias. Retention of these genes will be important in breeding. This species may provide a source of genes for manipulation of lipids in other food crops. This rare species has a very small population size explaining the low heterozygosity (Ceballos et al., 2018). The heterozygosity was less than one third that of the more widespread, M. integrifolia, reported to have a heterozygosity of 0.98% (Nock et al., 2020; Topp et al., 2019). This analysis indicates the importance of conserving the diversity of this endangered species and retaining the unique alleles that may be useful in breeding. M. jansenii is a small tree with a high kernel recovery, and both of these traits are key for macadamia improvement. Sustainable intensification of production will be facilitated by the breeding of smaller trees, and improved kernel recovery is central to kernel yield. Genome level analysis will support field studies for the conservation of this species (Shapcott & Powell, 2011) and molecular analysis of diversity in support of breeding (Mai et al., 2020). The use of M. jansenii as a model in testing genome sequencing and assembly methods (Murigneux et al., 2020; Sharma et al., 2021) is further enhanced by the chromosome level assembly presented here. This is currently the most complete genome sequence available for a macadamia and any member of the more than 1,660 Proteaceae species (Christenhusz & Byng, 2016) making a useful contribution to the goal of sequencing plant biodiversity (Lewin et al., 2018). The Proteaceae belongs to the basal eudicot order Proteales, a sister group to core eudicots (Chanderbali et al., 2016; Drinnan et al., 1994). Among the basal eudicots there are few well characterized genomes. Available genomes include Aquilegia coerulea (Ranuncules) (Filiault et al., 2018), Papaver somniferum (Ranuncules) (Pei et al., 2021), Nelumbo nucifera (Proteales) (Ming et al., 2013), Trochodendron aralioides (Trochodendrales) (Strijk et al., 2019), and Tetracentron sinense (Trochodendrales) (Liu et al., 2020). The M. jansenii genome provides a valuable contribution to comparative genomics in this group of flowering plants. The chromosome level assembly with an N50 scaffold length of 52 Mb and 97.1% of complete BUSCO genes compares favorably with those available for other endangered species, for example, Acer yangbiense with N50 45 Mb and 90.5% of complete BUSCO genes (Giordano et al., 2017), Ostrya rehderiana N50 2.31 Mb (Yang et al., 2018), and Nyssa yunnanensis with N50 of 985 Kb and BUSCO score of 90.5% (Weixue et al., 2020).

EXPERIMENTAL PROCEDURES

Plant material

Fresh leaf tissue of M. jansenii was collected from ex situ collections of trees at Nambour and Tiaro (three accessions were from the Maroochy Research Facility, Department of Agriculture and Fisheries, Nambour, Queensland, Australia, Accessions 1005, 1003, and 1002, and five from Tiaro, Queensland, Australia, Accession 1161003, 1161005, 1161001a, 1161001b, and 1161004). Fresh leaf tissue (fully expanded young flush) was collected and immediately frozen by placing under dry ice and stored at −80°C until further processed for DNA and RNA extraction.

DNA and RNA isolation

Leaf tissue was coarsely ground under liquid nitrogen using a mortar and pestle and further ground under cryogenic conditions into a fine powder using a Tissue Lyser (MM400, Retsch, Germany). All accessions were used for DNA isolation. DNA was extracted as per an established method (Furtado, 2014) with minor modification where phenol was excluded from the extraction method. DNA was extracted from 2–3 gm of leaf tissue and dissolved in up to 400 μl of TE buffer. Accession 10051 was used for RNA isolation. RNA was extracted as per established methods (Furtado, 2014; Rubio‐Piña & Zapata‐Pérez, 2011). RNA was extracted from 2–3 gm of tissue and treated with extraction buffer, chloroform, and phenol/chloroform (1:1) in different steps, followed by further purification using DNase treatment from the Qiagen's RNeasy Mini kit. RNA quality and quantity were determined using A260/280 and A260/230 absorbance ratio (Nanodrop, Invitrogen USA) and RNA integrity measurements (Bioanalyser, Agilent technology, USA).

Chromosome level assembly

Chicago library sequencing and sequencing

DNA was isolated as per an established method (Furtado, 2014). Then, the library was prepared as described in Putnam et al. (2016). Briefly, ~500 ng of HMW gDNA was reconstituted into chromatin in vitro and fixed with formaldehyde. Fixed chromatin was digested with DpnII, the 5′ overhangs filled in with biotinylated nucleotides, and then, free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA was purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina‐compatible adapters. Biotin‐containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeqX platform to produce 213 million 2 × 150 bp paired end reads, which provided 88.11 × physical coverage of the genome (1–100 kb pairs).

Dovetail Hi‐C library preparation and sequencing

A Dovetail Hi‐C library was prepared in a similar manner as described previously (Lieberman‐Aiden et al., 2009). Briefly, for each library, chromatin was fixed in place with formaldehyde in the nucleus and then extracted. Fixed chromatin was digested with DpnII, the 5′ overhangs filled in with biotinylated nucleotides, and then free blunt ends were ligated. After ligation, crosslinks were reversed, and the DNA purified from protein. Purified DNA was treated to remove biotin that was not internal to ligated fragments. The DNA was then sheared to ~350 bp mean fragment size, and sequencing libraries were generated using NEBNext Ultra enzymes and Illumina‐compatible adapters. Biotin‐containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The libraries were sequenced on an Illumina HiSeqX platform to produce 156 million 2 × 150 bp paired end reads, which provided 3,601.74 × physical coverage of the genome (10–10,000 kb pairs).

Scaffolding the assembly with HiRise

The input de novo assembly, shotgun reads, Chicago library reads, and Dovetail Hi‐C library reads were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al., 2016). An iterative analysis was conducted. First, Shotgun and Chicago library sequences were aligned to the draft input assembly using a modified SNAP read mapper (http://snap.cs.berkeley.edu). The separations of Chicago read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and to make joins above a threshold. After aligning and scaffolding Chicago data, Dovetail Hi‐C library sequences were aligned and scaffolded following the same method. After scaffolding, shotgun sequences were used to close gaps between contigs.

Re‐sequencing

To study the genetic diversity within the species, re‐sequencing of the eight different genotypes was performed on the DNBseq platform (Drmanac et al., 2010). The seven Macadamia jansenii samples were selected randomly to represent diversity in the population. A DNBseq library was prepared as follows. Briefly, genomic DNA (1 μg) was randomly fragmented using a Covaris; magnetic beads were used to select fragments with an average size of 300–400 bp, and DNA was quantified using a Qubit fluorometer. The fragments were subjected to end‐repair and 3′ adenylated; adaptors were ligated to the ends of these 3′ adenylated fragments. Then, the double stranded products were heat denatured and circularized by the splint oligo sequence; the single strand circle DNA (ssCir DNA) was formatted as the final library. The final library was then amplified to make DNA nanoball (DNB) which had more than 300 copies of each molecule, and the DNBs were loaded into the patterned nanoarray. Finally, pair‐end 150 bases reads were generated by combinatorial Probe‐Anchor Synthesis (cPAS) (MGISEQ‐2000).

RNA‐sequencing

RNA sequencing was undertaken by Macrogen, South Korea. Total RNA was subjected to ribosomal RNA depletion (Ribo zero plant) and then sequenced on Illumina platform using TruSeq stranded total RNA LT sample prep kit (plant) to obtain 154 M paired end reads of 23.3 Gb read length.

Genome assembly quality evaluation and repetitive element evaluation

The completeness of the genome assembly was evaluated by checking the integrity of the protein coding genes in the Hi‐C assembly using Benchmarking Universal Single‐Copy Orthologs (BUSCO) (version v5.0.0) analysis with eudicot odb10 dataset with 2,326 genes. Repetitive elements in the Hi‐C assembly were identified de novo and classified using RepeatModeler (version 2.0.1). The repeat library obtained from RepeatModeler was used to identify and mask the repeats in the Hi‐C assembly file using RepeatMasker (Version 4.1.0).

Structural annotation and functional annotation

The prediction of the protein coding genes in the repeat masked genome was carried out using ab initio and evidence‐based approach. For ab initio prediction, Dovetail staff used Augustus (version 2.5.5) (Stanke et al., 2006) and SNAP (version 2006‐07‐28) (Johnson et al., 2008). For evidence‐based approach, MAKER (Cantarel et al., 2008) was used. For training the ab initio model for M. jansenii, coding sequences from Malus domestica, Prunus persica, and Arabidopsis thaliana were used using AUGUSTUS and SNAP. Six rounds of prediction optimization were done with the package provided by AUGUSTUS. To generate the peptide evidence in Maker pipeline, Swiss‐Prot peptide sequences from the UniProt database were downloaded and used in combination with the protein sequences from M. domestica, P. persica, and A. thaliana. To assess the quality of the gene prediction, AED scores were generated for each of the predicted genes as part of MAKER pipeline. Only those genes which were predicted by both SNAP and AUGUSTUS were retained in the final gene set. To generate the intron hints, a bam file was generated by aligning the RNAseq reads to the genome using the STAR aligner software (version 2.7), and then, bam2hints tool was used within the AUGUSTUS. The predicted genes were further characterized for their putative function by performing a BLASTx search against nr protein database (all non‐redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF), as part of annotations undertaken by Dovetail and also by using OmicsBox Ver 1.3.11 (BioBam Bioinformatics, Spain).

Gene families

To identify the anti‐microbial genes in the genome, BLAST homology search was performed to identity transcripts similar to the M. integrifolia antimicrobial cDNA (MiAMP2, GenBank: AF161884.1) (Marcus et al., 1999). Then, sequence alignment was undertaken using Clone Manager ver 9.0 (SciEd, USA). Multiple Alignment was undertaken using a reference sequence as indicated in the results and alignment parameter scoring matrix of Mismatch (2), Open Gap (4), and Extension‐Gap (1). Genes involved in the metabolism of cyanogenic glycosides and fatty acids were identified in M. jansenii genome using BLASTp (1E‐10) and percentage identity >60%, and then, top hits were identified from the list based upon sequence similarity. For cyanogenic genes, the reference sequence was taken from the M. integrifolia genome (Nock et al., 2016), whereas for fatty acids reference sequences were used from A. thaliana.

Heterozygosity and genetic diversity analysis

All analysis including basic variant analysis (BVA) to determine polymorphic positions for heterozygosity analysis was performed using Qiagen CLC Genomics Workbench 21.0.4 (CLC bio, Aarhus, Denmark). BGI short read sequences of six genotypes (1003, 1002, 1161003, 1161005, 1161001a, and 1161001b) and Illumina reads of two genotypes (1005 and 1161004) of M. jansenii were first trimmed at quality limit of 0.01 (Phred score of 20 and above). The trimmed reads were then mapped to the M. jansenii Dovetail Hi‐C derived reference genome (Accession 1005). Mapping of short reads was conducted at settings as follows; Masking Mode (No), Match score (1), Mismatch score (2), Insertion cost (3), deletion cost (3), Length fraction (1), Similarity fraction (0.95), Global alignment (No), and Non‐specific match handling (Map randomly). To identify variants, the mapping file was subjected to the “Basic Variant Analysis” tool at the following settings: minimum coverage (10), minimum count (3), and minimum frequency (25%). Homozygous SNP variants were those filtered at 100% variant frequency. Heterozygous SNP variants were those filtered for a frequency range of between 25% to 75% for the alternate allele. The number of homozygous SNP variants also represented the number of homozygous sites. The number of heterozygous SNP sites were identified when the variant table was sorted for “Reference allele” as “Yes.” The heterozygosity for an accession was determined by representing the accession‐specific heterozygous SNPs sites as a percentage of the M. jansenii genome size (780 Mb). Accession‐specific polymorphic SNP sites present in up to seven of the eight accessions were determined as outlined below. Accession‐specific SNP sites (homozygous and heterozygous), identified by filtering “sample Count” for “≤7,” were represented as a percentage of the M. jansenii genome size (780 Mb). Accession‐specific unique polymorphic SNP sites, defined as those variant sites present only once in any of the eight accessions, were identified by filtering the “sample Count” for “≤1.” MCScanX (Wang et al., 2012) was used to identify whole‐genome duplication (WGD)/segmental along with tandem, dispersed, singleton, and proximal duplication on the M. jansenii genome. An all‐versus‐all BLASTP was performed (E value: 1e−10, max target sequences: 5 and m6 format output), for the M. jansenii whole‐genome protein sequences. The duplicate gene classifier and MCScanX program was executed using the M. jansenii genome annotation (gff file) along with the BLASTP output file using the default parameters. The collinearity file (MCScanX output file) and the genome annotation file were used to generate the synteny plot using SynVisio toolkit (Bandi & Gutwin, 2020). The Ks distribution graph was generated using WGD tools (Zwaenepoel and van de Peer, 2018). Timing of WGD was estimated as described by Magallón et al. (2015).

AUTHOR CONTRIBUTIONS

Contributions of authors were as follows: designed the study and supervised the project: RJH, AF, BT, and MA; collected sample: MA, BT, AF, and PS; management of germplasm: MA and BT; DNA and RNA isolation: PS and AF; data analysis and prepared the figures: PS and AF; bioinformatics analysis: PS, AF, VM, JH, and AM; drafted the manuscript: PS, AF, JH, and WT; and data deposition: PS. All authors edited and approved the final manuscript.

CONFLICT OF INTEREST

The Authors did not report any conflict of interest. Table S1: Size of each scaffold and number of genes per scaffold Table S2: Topologically Associated Domains (TADs) analysis summary Table S3: TAD statistics at different resolutions Table S4: Location of cyanogenic genes on pseudo‐molecules Table S5: Location of fatty acid genes on pseudo‐molecules Table S6: SNP heterozygosity statistics in eight Macadamia jansenii accessions Table S7: Polymorphic sites in up to seven of the eight Macadamia jansenii accessions Table S8: Genotype‐specific unique polymorphic SNP sites Table S9: List of duplicated genes in M. jansenii genome, using MCScanX tool Figure S1: Linkage density histogram of Hi‐C assembly of M. jansenii genome Figure S2: BLAST2GO sequence similarity search Figure S3: Gene ontology (GO) analysis by BLAST2GO Figure S4: Frequency graph of AED scores Figure S5: Alignment of the vicilin‐like antimicrobial‐peptide transcript from and M. jansenii Figure S6: Alignment of anti‐microbial CDS sequence of against the M. jansenii transcript sequence Figure S7: Ks plot of M. jansenii was generated by Wgd tool. Click here for additional data file. Table S4: Location of cyanogenic genes on pseudo‐molecule (based on blastp results) Click here for additional data file. Table S5: Location of fatty acid genes on pseudo‐molecules (based on blastp results) Click here for additional data file.

32 in total

1. DNA extraction from vegetative tissue for next-generation sequencing.

Authors: Agnelo Furtado
Journal: Methods Mol Biol Date: 2014

2. A metacalibrated time-tree documents the early rise of flowering plant phylogenetic diversity.

Authors: Susana Magallón; Sandra Gómez-Acevedo; Luna L Sánchez-Reyes; Tania Hernández-Hernández
Journal: New Phytol Date: 2015-01-23 Impact factor: 10.151

3. A label-free shotgun proteomics analysis of macadamia nut.

Authors: Johanna Rost; Sridevi Muralidharan; N Alice Lee
Journal: Food Res Int Date: 2019-11-30 Impact factor: 6.475

4. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays.

Authors: Radoje Drmanac; Andrew B Sparks; Matthew J Callow; Aaron L Halpern; Norman L Burns; Bahram G Kermani; Paolo Carnevali; Igor Nazarenko; Geoffrey B Nilsen; George Yeung; Fredrik Dahl; Andres Fernandez; Bryan Staker; Krishna P Pant; Jonathan Baccash; Adam P Borcherding; Anushka Brownley; Ryan Cedeno; Linsu Chen; Dan Chernikoff; Alex Cheung; Razvan Chirita; Benjamin Curson; Jessica C Ebert; Coleen R Hacker; Robert Hartlage; Brian Hauser; Steve Huang; Yuan Jiang; Vitali Karpinchyk; Mark Koenig; Calvin Kong; Tom Landers; Catherine Le; Jia Liu; Celeste E McBride; Matt Morenzoni; Robert E Morey; Karl Mutch; Helena Perazich; Kimberly Perry; Brock A Peters; Joe Peterson; Charit L Pethiyagoda; Kaliprasad Pothuraju; Claudia Richter; Abraham M Rosenbaum; Shaunak Roy; Jay Shafto; Uladzislau Sharanhovich; Karen W Shannon; Conrad G Sheppy; Michel Sun; Joseph V Thakuria; Anne Tran; Dylan Vu; Alexander Wait Zaranek; Xiaodi Wu; Snezana Drmanac; Arnold R Oliphant; William C Banyai; Bruce Martin; Dennis G Ballinger; George M Church; Clifford A Reid
Journal: Science Date: 2009-11-05 Impact factor: 47.728

Review 5. Runs of homozygosity: windows into population history and trait architecture.

Authors: Francisco C Ceballos; Peter K Joshi; David W Clark; Michèle Ramsay; James F Wilson
Journal: Nat Rev Genet Date: 2018-01-15 Impact factor: 53.242

6. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity.

Authors: Yupeng Wang; Haibao Tang; Jeremy D Debarry; Xu Tan; Jingping Li; Xiyin Wang; Tae-ho Lee; Huizhe Jin; Barry Marler; Hui Guo; Jessica C Kissinger; Andrew H Paterson
Journal: Nucleic Acids Res Date: 2012-01-04 Impact factor: 16.971

Review 7. Evolving Ideas on the Origin and Evolution of Flowers: New Perspectives in the Genomic Era.

Authors: Andre S Chanderbali; Brent A Berger; Dianella G Howarth; Pamela S Soltis; Douglas E Soltis
Journal: Genetics Date: 2016-04 Impact factor: 4.562

8. The Aquilegia genome provides insight into adaptive radiation and reveals an extraordinarily polymorphic chromosome with a unique history.

Authors: Danièle L Filiault; Evangeline S Ballerini; Terezie Mandáková; Gökçe Aköz; Nathan J Derieg; Jeremy Schmutz; Jerry Jenkins; Jane Grimwood; Shengqiang Shu; Richard D Hayes; Uffe Hellsten; Kerrie Barry; Juying Yan; Sirma Mihaltcheva; Miroslava Karafiátová; Viktoria Nizhynska; Elena M Kramer; Martin A Lysak; Scott A Hodges; Magnus Nordborg
Journal: Elife Date: 2018-10-16 Impact factor: 8.713

9. Genome of the long-living sacred lotus (Nelumbo nucifera Gaertn.).

Authors: Ray Ming; Robert VanBuren; Yanling Liu; Mei Yang; Yuepeng Han; Lei-Ting Li; Qiong Zhang; Min-Jeong Kim; Michael C Schatz; Michael Campbell; Jingping Li; John E Bowers; Haibao Tang; Eric Lyons; Ann A Ferguson; Giuseppe Narzisi; David R Nelson; Crysten E Blaby-Haas; Andrea R Gschwend; Yuannian Jiao; Joshua P Der; Fanchang Zeng; Jennifer Han; Xiang Jia Min; Karen A Hudson; Ratnesh Singh; Aleel K Grennan; Steven J Karpowicz; Jennifer R Watling; Kikukatsu Ito; Sharon A Robinson; Matthew E Hudson; Qingyi Yu; Todd C Mockler; Andrew Carroll; Yun Zheng; Ramanjulu Sunkar; Ruizong Jia; Nancy Chen; Jie Arro; Ching Man Wai; Eric Wafula; Ashley Spence; Yanni Han; Liming Xu; Jisen Zhang; Rhiannon Peery; Miranda J Haus; Wenwei Xiong; James A Walsh; Jun Wu; Ming-Li Wang; Yun J Zhu; Robert E Paull; Anne B Britt; Chunguang Du; Stephen R Downie; Mary A Schuler; Todd P Michael; Steve P Long; Donald R Ort; J William Schopf; David R Gang; Ning Jiang; Mark Yandell; Claude W dePamphilis; Sabeeha S Merchant; Andrew H Paterson; Bob B Buchanan; Shaohua Li; Jane Shen-Miller
Journal: Genome Biol Date: 2013-05-10 Impact factor: 13.583