| Literature DB >> 34938939 |
Priyanka Sharma1, Valentine Murigneux2, Jasmine Haimovitz3, Catherine J Nock4, Wei Tian5,6, Ardashir Kharabian Masouleh1, Bruce Topp1, Mobashwer Alam1, Agnelo Furtado1, Robert J Henry1,7.
Abstract
Macadamia, a recently domesticated expanding nut crop in the tropical and subtropical regions of the world, is one of the most economically important genera in the diverse and widely adapted Proteaceae family. All four species of Macadamia are rare in the wild with the most recently discovered, M. jansenii, being endangered. The M. jansenii genome has been used as a model for testing sequencing methods using a wide range of long read sequencing techniques. Here, we report a chromosome level genome assembly, generated using a combination of Pacific Biosciences sequencing and Hi-C, comprising 14 pseudo-molecules, with a N50 of 52 Mb and a total genome assembly size of 758 Mb of which 56% is repetitive. Completeness assessment revealed that the assembly covered -97.1% of the conserved single copy genes. Annotation predicted 31,591 protein coding genes and allowed the characterization of genes encoding biosynthesis of cyanogenic glycosides, fatty acid metabolism, and anti-microbial proteins. Re-sequencing of seven other genotypes confirmed low diversity and low heterozygosity within this endangered species. Important morphological characteristics of this species such as small tree size and high kernel recovery suggest that M. jansenii is an important source of these commercial traits for breeding. As a member of a small group of families that are sister to the core eudicots, this high-quality genome also provides a key resource for evolutionary and comparative genomics studies.Entities:
Keywords: Proteaceae; endangered species; genome assembly; genome diversity; genome sequencing; wild species
Year: 2021 PMID: 34938939 PMCID: PMC8671617 DOI: 10.1002/pld3.364
Source DB: PubMed Journal: Plant Direct ISSN: 2475-4455
Macadamia jansenii genome sequencing and assembly statistics
| PacBio | Dovetail Chicago | Dovetail Hi‐C assembly | |
|---|---|---|---|
| Library statistics | 3,170,206 reads | 213 M read pairs; 2 × 150 bp | 156 M read pairs; 2× 150 bp |
| Coverage | 84X | 88X | 3,601X |
|
| |||
| Total length | 758.28 Mb | 758.30 Mb | 758.43 Mb |
| L50/N50 | 135 scaffolds; 1.58 Mb | 199 scaffolds; 1.0 Mb | 7 scaffolds; 52.1 Mb |
| L90/N90 | 457 scaffolds; .51 Mb | 767 scaffolds; .23 Mb | 13 scaffolds; 45.61 Mb |
| Longest scaffold | 10,537,631 bp | 8,434,305 bp | 67,682,215 bp |
| Number of scaffolds | 762 | 1,529 | 219 |
|
| |||
| Complete genes (single+ duplicate) | 96.70% | 97.20% | 97.1% |
| Single genes | 79.10% | 80.10% | 80.80% |
| Duplicated genes | 17.60% | 17.10% | 16.30% |
| Fragmented genes | 0.90% | 1.00% | 1.00% |
| Missing genes | 2.00% | 2.00% | 2.10% |
Eudicots_odb10 dataset, Number of BUSCOs = 2,326.
Annotation of repeat sequences in the M. jansenii genome
| Hi‐C assembly | |
|---|---|
| Total repetitive content | 55.9% |
| Class I TEs repeats | 29.9% |
| LTRs | 24% |
| LINE | 5.67% |
| SINE | 0% |
| Class II TEs repeats | 1.56% |
| Low complexity repeats | 0.33% |
| Simple repeats | 1.35% |
Genes predicted in the M. jansenii genome
| Gene prediction | |
|---|---|
| Total number of genes | 31,591 |
| Total coding region | 43,235,907 bp |
| Average length of genes | 1,368 bp |
| Number of single‐exon genes | 2,458 |
| Number of genes with annotation | 22,500 |
Comparison of genome assemblies of three Macadamia species
|
|
|
|
| |
|---|---|---|---|---|
| Assembly length (Mb) | 518.49 | 744.64 | 750.53 | 758.43 |
| N50 (Mb) | 4.7 kb | 413 kb | 1.2 Mb | 52.1 Mb |
| No. of contigs/scaffolds | 193,493 | 4094 | 4,335 | 219 |
| Repeats | 37.00% | 55.00% | 61.42% | 55.90% |
| Complete BUSCO | 77.40% | 90.20% | 89.72% | 97.1% |
| No. of coding genes | 35,337 | 34,274 | 31,571 | 31,591 |
FIGURE 1Anti‐microbial peptide structure. (a) The cDNA sequence of anti‐microbial gene of with four repeat segments (RS), shown in red open boxes and cysteine residues in green filled boxes aligned with M. jansenii transcript sequence ANN01396, showing same pattern. (b) The alignment of the anti‐microbial peptide sequence from the and M. jansenii. The first half of the sequence shows the repeat segments within red boxes with green highlighted cysteine residues. Differences in amino acid sequence throughout the alignment as shown in blue highlighted text
Heterozygosity and genetic variation in eight M. jansenii accessions
| AccessionID | Polymorphic variants | Heterozygosity analysis | Unique polymorphic SNP sites | |||||
|---|---|---|---|---|---|---|---|---|
| Total | Indels | SNP | Heterozygous SNP sites | Heterozygosity | Heterozygous | Homozygous | Total | |
| 1005 | 5,393,188 | 486,846 | 4,739,937 | 2,428,956 | 0.31 | 585,053 | 762.00 | 585,815.00 |
| 1161004 | 5,311,865 | 377,580 | 4,797,864 | 2,038,553 | 0.26 | 187,441 | 8,615.00 | 196,056.00 |
| 1161003 | 6,649,485 | 555,641 | 5,898,975 | 2,465,089 | 0.32 | 521,184 | 44,014.00 | 565,198.00 |
| 1161005 | 6,109,728 | 531,550 | 5,393,088 | 2,347,362 | 0.3 | 608,485 | 40,629.00 | 649,114.00 |
| 1161001a | 6,868,915 | 574,625 | 6,087,318 | 2,649,035 | 0.34 | 95,089 | 8,155.00 | 103,244.00 |
| 1003 | 6,944,903 | 586,001 | 6,148,269 | 2,672,103 | 0.34 | 99,715 | 9,878.00 | 109,593.00 |
| 1002 | 6,642,855 | 586,334 | 5,857,020 | 2,447,418 | 0.31 | 219,933 | 27,408.00 | 247,341.00 |
| 1161001b | 6,594,383 | 548,292 | 5,852,433 | 2,556,695 | 0.33 | 83,662 | 4,843.00 | 88,505.00 |
Variant sites only found in this individual and not in any of the other seven genotypes.
Comprised of replacements, multi nucleotide variants (MNVs), Indels, and SNPs.
Calculated as a percentage of heterozygous SNP sites compared to the total genome size in bases.
Nuclear genome (780 Mb) used as reference for genetic variation and heterozygosity analysis.
Homozygous sites identified possibly due to errors in the reference genome.
FIGURE 2Chromosomal distribution of duplicated genes of M. jansenii, generated using SynVisio. The parallel horizontal lines represent the 14 pseudo‐molecules of M. jansenii genome, with connected ribbons representing the duplicated genes