Literature DB >> 33283274

Recent advances in Cannabis sativa genomics research.

Bhavna Hurgobin^1,2, Muluneh Tamiru-Oli^1,2, Matthew T Welling^1,2, Monika S Doblin^1,2, Antony Bacic^1,2, James Whelan^1,2,3, Mathew G Lewsey^1,2.

Abstract

Cannabis (Cannabis sativa L.) is one of the oldest cultivated plants purported to have unique medicinal properties. However, scientific research of cannabis has been restricted by the Single Convention on Narcotic Drugs of 1961, an international treaty that prohibits the production and supply of narcotic drugs except under license. Legislation governing cannabis cultivation for research, medicinal and even recreational purposes has been relaxed recently in certain jurisdictions. As a result, there is now potential to accelerate cultivar development of this multi-use and potentially medically useful plant species by application of modern genomics technologies. Whilst genomics has been pivotal to our understanding of the basic biology and molecular mechanisms controlling key traits in several crop species, much work is needed for cannabis. In this review we provide a comprehensive summary of key cannabis genomics resources and their applications. We also discuss prospective applications of existing and emerging genomics technologies for accelerating the genetic improvement of cannabis.

Entities: Chemical Disease Gene Mutation Species

Keywords: breeding; cannabinoids; cannabis; crop improvement; genome assembly; genomics

Mesh：

Year: 2021 PMID： 33283274 PMCID： PMC7986631 DOI： 10.1111/nph.17140

Source DB: PubMed Journal: New Phytol ISSN： 0028-646X Impact factor: 10.151

Introduction

Cannabis sativa L. (cannabis), a member of the Cannabaceae family, is one of the world’s oldest domesticated crops (Bradshaw et al., 1981; Long et al., 2017). It is believed to have originated in Central Asia, from where its cultivation rapidly spread throughout Asia and Europe. Nowadays legal and illegal cannabis cultivation occurs globally (Van Bakel et al., 2011). The exact number of species comprising the Cannabis genus is controversial. Some claim the genus consists of three species that display distinct phenotypic differences; namely C. sativa L., C. indica Lam (Lamarck) and C. ruderalis (Sawler et al., 2015; Clarke & Merlin, 2016; Henry et al., 2020). The alternative, and perhaps most accepted, viewpoint is that Cannabis is a monotypic genus consisting of a single species, Cannabis sativa L. (referred to as cannabis hereafter) (Small & Cronquist, 1976). Cannabis has a diploid genome (2n = 20) consisting of nine autosomes and a pair of sex chromosomes (X and Y) (Braich et al., 2019; McKernan et al., 2020). It is predominantly dioecious, meaning a plant is either a male or a female, with estimated haploid genome sizes of 843 Mb and 818 Mb for male and female plants, respectively (Van Bakel et al., 2011). Despite the presence of defined sex chromosomes, environmental factors such as reduced photoperiod and low temperature, and foliar applications of chemicals such as silver nitrate and the ethylene hormone inhibitor silver thiosulfate induce pollen production in female flowers, leading to the production of ‘feminised seeds’ (Ram & Sett, 1982; Kaushal, 2012; Lubell & Brand, 2018). This technique has been exploited as a useful tool in cannabis breeding (for example, selfing or crossing female plants) and in generating populations for dissection of the genetic bases of important traits. Cannabis can be classified as fibre‐type (hemp or industrial hemp) and drug‐type (medicinal cannabis or marijuana) based on usage and cannabinoid content; fibre‐type plants contain < 0.3% ∆9‐tetrahydrocannabinol (THC) whereas drug‐type plants contain > 0.3% THC. Both have been exploited by humans for various applications since 8000 bce (Srinivasababu, 2014). For instance, the stalk of hemp is an important fibre source whilst oil extracted from its seeds is used in several food and nonfood applications (Clarke & Merlin, 2016). More recently, there have been applications of hemp in construction, geotextiles, cosmetics, as a food product and as a therapeutic agent (Piluzza et al., 2013). Historical medicinal and recreational usage of cannabis has been reported, particularly the use of marijuana for its mood‐altering narcotic properties. Consequently, marijuana is the most cultivated, trafficked and abused illicit drug in the world (Sawler et al., 2015). Its prolonged usage has been associated with detrimental health outcomes, such as impaired cognitive development and psychomotor performance, leading to chronic health conditions (Andre et al., 2016). Hence, there is an urgent need to conduct evidence‐based research to safeguard purity and quality of products, and to better understand the mode of action of cannabinoids for therapeutic applications. Cannabis plants grown for medicinal and recreational end‐uses are generally shorter, have thinner stems, more branches and a higher density of floral tissues than industrial hemp plants. Cultivars also can be discriminated by their cannabinoid profile, also termed chemotype (Piluzza et al., 2013; Clarke & Merlin, 2016). Cannabinoids are secondary metabolites produced in capitate stalked glandular trichomes (Fig. 1), more than 120 of which have been identified (Braich et al., 2019; Kovalchuk et al., 2020). Two of these, THC and CBD (cannabidiol), are highly sought after by cannabis breeders and pharmaceutical industries (Adams et al., 1940; Weiblen et al., 2015; Andre et al., 2016). Precursor synthesis of these cannabinoids occurs from two distinct metabolic pathways; the polyketide pathway and the methylerythritol phosphate (MEP) pathway (Fig. 1) (Kovalchuk et al., 2020). These produce alkylresorcinolic acids, including olivetolic acid (OA) that is specific to cannabis, and geranyl diphosphate (GPP), respectively. CBGA (cannabigerolic acid) is then synthesized from OA and GPP to produce the acidic precursors of THC (tetrahydrocannabinolic acid; THCA) and CBD (cannabidiolic acid; CBDA) (Weiblen et al., 2015). THC is the main psychoactive/intoxicant in cannabis. It induces sensations of euphoria, anxiety, paranoia and cognitive deficits, and is associated primarily with the narcotic status of cannabis (Boggs et al., 2018). However, THC also has therapeutic benefits as it confers relief from nausea caused by certain anti‐cancer treatments and acts as an anti‐inflammatory agent (Andre et al., 2016). CBD, which is an isomer of THC, has an analgesic effect, and also is purported to have neuroprotective, anti‐cancer and anti‐diabetic properties (Andre et al., 2016). Epidiolex, the first CBD‐based product approved by the US Food and Drug Administration, also has been shown to reduce seizures in children with Dravet syndromes (O’Connell et al., 2017; Chen et al., 2019). Industrial hemp and recreational drug chemotypes differ in their THC : CBD ratios. Cannabis plants are frequently classified into three main chemotypes based on this ratio; chemotype I plants (drug‐type) exhibit a THC : CBD ratio well beyond 1.0, chemotype II plants have an intermediate ratio of 0.5–2.0 and chemotype III plants (fibre‐type) have a ratio well below 1.0 (Aizpurua‐Olaizola et al., 2016). Additionally, the DW of THC in mature female inflorescences is used to demarcate cultivars for industrial hemp end‐uses such as seed or fibre production (Piluzza et al., 2013).

Fig. 1

Schematic diagram of cannabinoid biosynthesis including polyketide and isoprenoid precursor pathways. Precursor pathways are merged by a plastid‐localized aromatic prenyltransferase, with alkylresorcinolic acids and geranyl diphosphate intermediates forming cannabigeroids with a linear isoprenyl residue (Gülck & Møller, 2020). Cannabinoid synthesis concludes in the apoplastic storage cavity of glandular trichomes. Here, cannabigeroids are converted to tri‐ and di‐cyclic cannabinoids such as ∆9‐tetrahydrocannabinolic acid (THCA) and cannabidiolic acid (CBDA) via stereoselective oxidative cyclisation of the isoprenyl moiety. This occurs enzymatically by the cannabinoid synthases THCAS, CBDAS and CBCAS. The green arrow indicates location of the extracellular storage cavity of a Cannabis stalked glandular trichome; bar, 100 µm. Subcellular locations of cannabinoid and precursor pathway enzymes were predicted with the subcellular location software TargetP‐2.0 (http://www.cbs.dtu.dk/services/TargetP/). AAE, acyl‐activating enzyme; CBCA, cannabichromenic acid; CBCAS, cannabichromenic acid synthase; CBCOA, cannabiorcichromenic acid; CBCVA, cannabichromevarinic acid; CBDA, cannabidiolic acid; CBDAS, cannabidiolic acid synthase; CBDPA, cannabidiphorolic acid; CBDVA, cannabidivarinic acid; CMK, 4‐(cytidine 50‐diphospho)‐2‐C‐methyl‐d‐erythritol kinase; DXR, 1‐deoxy‐D‐xylulose‐5‐phosphate reductoisomerase; DXS, 1‐deoxy‐d‐xylulose 5 phosphate synthase; HDR, 1‐hydroxy‐2‐methyl‐2‐butenyl 4‐diphosphate reductase; HDS, 1‐hydroxy‐2‐methyl‐2‐butenyl 4‐diphosphate synthase; MCT, 2‐C‐methyl‐d‐erythritol 4‐phosphate cytidylyltransferase; MDS, 2‐C‐methyl‐d‐erythritol 2,4‐cyclodiphosphate synthase; MEP, 2‐C‐methyl‐d‐erythritol‐4‐phosphate; OAC, olivetolic acid cyclase; PT, prenyltransferase (e.g. geranylpyrophosphate:olivetolate geranyltransferase (GOT)); THCA, tetrahydrocannabinolic acid; THCAS, tetrahydrocannabinolic acid synthase; THCOA, tetrahydrocannabiorcolic acid; THCPA, tetrahydrocannabipgorolic acid; THCVA, tetrahydrocannabivarinic acid; TKS, tetraketide synthase. The prospects of using contemporary breeding technologies to improve cannabis traits for medicinal applications are promising. However, progress in this area is hampered by several issues. First, the genetics of cannabis is poorly understood and causes incorrect classification of cultivars/strains, with implications for researchers, growers, cannabis users and regulators (Vergara et al., 2016; Welling et al., 2016; Schwabe & McGlaughlin, 2019). Secondly, intensive clandestine breeding practices since the early 1970s have led to a genetic bottleneck and reduction in allelic diversity in marijuana plants (Clarke & Merlin, 2016). Thirdly, restrictions such as the international narcotics conventions and associated legislation have hampered the exchange of cannabis genetic resources and research materials (Welling et al., 2016). Genomic analyses have facilitated a paradigm shift in the improvement of cultivars of major crop species over the last two decades (Morrell et al., 2012; Yuan et al., 2017). This has been driven by high‐throughput sequencing and single nucleotide polymorphism (SNP) marker‐based genotyping platforms, as well as the development of high‐quality reference genome and transcriptome assemblies. These technologies have increased our understanding of gene content, genomic variation and the genetic basis of complex agronomic traits in multiple plant species (Xie et al., 2015; Xu et al., 2017; Appels et al., 2018; Thomas et al., 2019). The status of cannabis as an emerging, high‐value and clinically efficacious crop and the potential of genomics‐assisted breeding, coupled with the relaxation of regulations and restrictions, warrant expanded research efforts. In this review we summarize available cannabis genomics resources and report on the application of these tools. We also discuss future applications and emerging genomics technologies relevant to the genetic improvement of cannabis.

Cannabis sativa genomics resources and the discoveries they have enabled

Genome assemblies

De novo assembly of plant genomes remains challenging and the cannabis genome is no exception (Schatz et al., 2012). The cannabis genome is highly heterozygous (estimated at 12.5–40.5%) and contains large amounts of repetitive elements (estimated at 70%) (Van Bakel et al., 2011; Pisupati et al., 2018; Gao et al., 2020; Kovalchuk et al., 2020). Several attempts have been made to assemble this complex genome, as illustrated by publicly available genome and transcriptome assemblies of varying sizes and completeness for 12 different cannabis cultivars (Table 1a,b). Initial efforts relied upon the use of short‐read sequencing technology, but these proved computationally challenging. Application of third‐generation long‐read sequencing technologies such as Single‐Molecule Real‐Time (SMRT) sequencing (PacBio) and MinION (Oxford Nanopore Technologies) have greatly improved the contiguity of cannabis reference sequences, as has the anchoring of scaffolds using genetic linkage maps coupled with Hi‐C data (Grassa et al., 2018; Laverty et al., 2019; Gao et al., 2020; McKernan et al., 2020). This has resulted in the creation of four chromosome‐level assemblies for Purple Kush (PK; drug‐type), Finola (FN; hemp), JL (wild accession) and CBDRx (cs10; high‐CBD) (Grassa et al., 2018; Laverty et al., 2019; Gao et al., 2020). The cs10 assembly is the most complete and contiguous chromosome‐level assembly, comprising 25 302 protein‐coding genes (Fig. 2; Maoz, 2020). The current version of this assembly (v.2.0; GenBank acc. no. GCA_900626175.2) has recently been updated with the chromosomes renumbered according to an agreed community standard (Table 2; https://www.ncbi.nlm.nih.gov/assembly/GCF_900626175.2#/st; Maoz, 2020). Earlier this year, the International Cannabis Genomics Research Consortium (ICGRC) proposed that cs10 be used as the reference for cannabis genomics (Maoz, 2020). Cannabis genome assemblies other than cs10 also have much to offer. For instance, the PK, FN and the contig‐level Jamaican Lion trio assemblies have been key to confirming findings from earlier, lower‐resolution studies as well as uncovering important biology of the Cannabis genus (Van Bakel et al., 2011; Laverty et al., 2019; Prentout et al., 2019; Vergara et al., 2019; Booth et al., 2020; McKernan et al., 2020). Further improvement of these assemblies will assist in fully realising their potential.

Table 1

Statistics for the latest Cannabis sativa reference genome and transcriptome assemblies.

Genome assemblies
Cultivar (GenBank acc. no.)	Sex	Total sequence length (bp)	Total ungapped length (bp)	Number of scaffolds	Scaffold N50 (bp)	Number of contigs	Contig N50 (bp)	GC content (%)	Number of chromosomes and plasmids	Genome coverage	Sequencing technology	Number of protein‐coding genes/transcripts	Reference
cs10/CBDRx ^a (GCA_900626175.2)	Female	876 147 649	736 579 359	221	91 913 889	1052	1959 202	34	10	100×	Oxford Nanopore Technology	25 302 protein‐coding genes	Grassa et al. (2018)
Purple Kush ^b (GCA_000230575.5)	Female	891 964 663	891 346 362	6653	60 968 100	12 836	133 904	34	10	79×	PacBio	30 074 transcripts	Laverty et al. (2019)
Finola (GCA_003417725.2)	Male	1009 674 739	1009 380 638	2362	77 135 887	5303	370 471	34	10	97.64×	PacBio	37 689 transcripts	Lavertyet al. (2019)
Pineapple Banana Bubba Kush (GCA_002090435.1)	Male	512 174 223	512 174 223	18 355	51 819	18 355	51 819	34	–	72×	PacBio	–	–
LA Confidential (GCA_001510005.1)	Female	595 358 288	595 357 797	311 039	2649	311 039	2649	35	–	50×	454 sequencing	–	–
Chemdog 91 (GCA_001509995.1)	Female	285 932 793	285 527 436	175 088	2250	190 122	2189	33	–	50×	Illumina GAII	–	–
Cannatonic (GCA_001865755.1)	Female	585 823 666	585 823 666	11 110	128 718	11 110	128 718	34	–	10×	PacBio	–	–
Jamaican Lion (female parent)^c (GCA_012923425.1)	Female	876 735 611	876 735 611	–	–	1599	3283 100	34	–	125×	PacBio Sequel	27 664 genes	McKernan et al. (2020)
Jamaican Lion (male parent)^c (GCA_013030025.1)	Male	1009 156 132	1009 156 132	–	–	1264	1668 042	34	–	125×	PacBio Sequel	31 591 genes	McKernan et al. (2020)
Jamaican Lion (F1) ^c (JAATIR000000000.1)	Female	999 122 115	999 122 115	–	–	658	1668 042	34	–	–	PacBio Sequel	–	McKernan et al. (2020)
JL ^d (GCA_013030365.1)	Female	812 525 420	811 830 406	483	82 998 198	2978	509 999	34	10	153×	PacBio Sequel	38 382 protein‐coding genes	Gao et al. (2020)

Grassa et al. (2018) refer to an earlier version of this assembly in their paper (GenBank acc. no. GCA_900626175.1).

Laverty et al. (2019) refer to an earlier version of this assembly in their paper (GenBank acc. no. GCA_000230575.4).

McKernan et al. (2020) refer to earlier versions of these assemblies in their paper (Jamaican Lion, female parent = CoGe ID 55184; Jamaican Lion, male parent = CoGe ID 55360 and Jamaican Lion, F1 = CoGe ID 55567).

Genome annotation not included with this assembly.

The Cannbio assembly was generated using RNA‐Seq data from the female and male cannabis cultivars, Cannbio‐2 and Cannbio‐male, respectively.

Fig. 2

Benchmarking Universal Single‐Copy Ortholog (BUSCO) assessment of the cannabis genome assemblies shown in Table 1(a). The percentages of complete (single‐copy and duplicated), fragmented and missing universal single‐copy orthologue genes were identified using busco v.4.02 (Simão et al., 2015). The Jamaican Lion assemblies (female parent, male parent, F1) have more complete BUSCOs on average, but they also harbour a larger number of duplicated BUSCOs, which reflects the fragmented nature of these assemblies.

Table 2

Chromosomal nomenclatures of Cannabis sativa genome assemblies highlighting the discrepancies in chromosome numbering among the current assemblies (https://www.ncbi.nlm.nih.gov/assembly/GCF_900626175.2#/st; Maoz, 2020).

cs10 v.2.0 (GenBank acc. no. GCA_900626175.2)		cs10 v.1.0 (GenBank acc. no. GCA_900626175.1)		Finola v.2.0 (latest; GenBank acc. no. GCA_003417725.2)		Purple Kush v.5.0 (latest; GenBank acc. no. GCA_000230575.5)
Chromosome number	GenBank sequence	Chromosome number	GenBank sequence	Chromosome number	GenBank sequence	Chromosome number	GenBank sequence
1	NC_044371.1	2	NC_044371.1	5	CM011609.1	5	CM010796.2
2	NC_044375.1	6	NC_044375.1	3	CM011607.1	3	CM010793.2
3	NC_044372.1	3	NC_044372.1	4	CM011608.1	4	CM010794.2
4	NC_044373.1	4	NC_044373.1	7	CM011611.1	7	CM010799.2
5	NC_044374.1	5	NC_044374.1	1	CM011605.1	1	CM010790.2
6	NC_044377.1	8	NC_044377.1	2	CM011606.1	2	CM010792.2
7	NC_044378.1	9	NC_044378.1	6	CM011610.1	6	CM010797.2
8	NC_044379.1	10	NC_044379.1	9	CM011613.1	9	CM010798.2
9	NC_044376.1	7	NC_044376.1	8	CM011612.1	8	CM010795.2
10	NC_044370.1	1	NC_044370.1	10	CM011614.1	10	CM010791.2

Chromosome 10 (chromosome 1 in cs10 v1.0) corresponds to the sex chromosome in cs10 v.2.0, Purple Kush v.5.0 and Finola v.2.0.

Statistics for the latest Cannabis sativa reference genome and transcriptome assemblies. Grassa et al. (2018) refer to an earlier version of this assembly in their paper (GenBank acc. no. GCA_900626175.1). Laverty et al. (2019) refer to an earlier version of this assembly in their paper (GenBank acc. no. GCA_000230575.4). McKernan et al. (2020) refer to earlier versions of these assemblies in their paper (Jamaican Lion, female parent = CoGe ID 55184; Jamaican Lion, male parent = CoGe ID 55360 and Jamaican Lion, F1 = CoGe ID 55567). Genome annotation not included with this assembly. The Cannbio assembly was generated using RNA‐Seq data from the female and male cannabis cultivars, Cannbio‐2 and Cannbio‐male, respectively. Benchmarking Universal Single‐Copy Ortholog (BUSCO) assessment of the cannabis genome assemblies shown in Table 1(a). The percentages of complete (single‐copy and duplicated), fragmented and missing universal single‐copy orthologue genes were identified using busco v.4.02 (Simão et al., 2015). The Jamaican Lion assemblies (female parent, male parent, F1) have more complete BUSCOs on average, but they also harbour a larger number of duplicated BUSCOs, which reflects the fragmented nature of these assemblies. Chromosomal nomenclatures of Cannabis sativa genome assemblies highlighting the discrepancies in chromosome numbering among the current assemblies (https://www.ncbi.nlm.nih.gov/assembly/GCF_900626175.2#/st; Maoz, 2020). Chromosome 10 (chromosome 1 in cs10 v1.0) corresponds to the sex chromosome in cs10 v.2.0, Purple Kush v.5.0 and Finola v.2.0. Genome assemblies have made enormous contributions to our understanding of the cannabinoid biosynthetic pathways through the underlying synthase genes. In particular, the assemblies have shed light on the inheritance of these genes (Grassa et al., 2018; Laverty et al., 2019; McKernan et al., 2020). Before the availability of genome sequence data, de Meijer et al. proposed a Mendelian inheritance model of chemotype that involved a single locus, B, with two co‐dominant alleles, B and B, encoding for THCAS and CBDAS, respectively (De Meijer et al., 2003). Homozygosity at the B locus led to the production of either THC (drug‐type; B) or CBD (fibre‐type; B/B), whilst heterozygous individuals (B/B) had a mixed THC‐CBD chemotype. However, the creation of high‐quality genome assemblies supports an alternative, multilocus model whereby THCAS and CBDAS are two different genes located in close proximity in a retrotransposon‐rich region of the cannabis genome (Grassa et al., 2018; Laverty et al., 2019; McKernan et al., 2020). This finding supports other earlier nongenomics studies (De Meijer et al., 2003; Kojoma et al., 2006; Weiblen et al., 2015). The identification of the less‐studied CBCA synthase (cannabichromenic acid synthase; CBCAS) gene (also known as inactive THCAS) is another notable discovery originating from genomics datasets (Grassa et al., 2018; Laverty et al., 2019; McKernan et al., 2020). This gene catalyses the synthesis of cannabichromene (CBC), which is an emerging target for medicinal cannabis breeding as it is nonintoxicating, can reduce pain sensations and act as an anti‐inflammatory agent (Laverty et al., 2019). The functional, chemotype‐determining forms of CBCAS, THCAS and CBDAS are highly similar at the nucleotide and amino acid levels (Fig. 3a,b). The fact that long‐read sequencing enabled the assembly of such highly similar loci reflects the importance of this technology in resolving complex, repetitive regions in the cannabis genome (Schatz et al., 2012; Michael & VanBuren, 2020). Our survey of the contig‐level PacBio genome assemblies of Pineapple Banana Bubba Kush (PBBK) and Cannatonic also supports these findings. We identified a larger number of THCAS and CBCAS gene loci in these assemblies compared to Chemdog91 and LA Confidential, which were generated using short‐read sequencing (Fig. 4; Supporting information Tables S1, S2). It is likely that underestimation of THCAS, CBDAS and CBCAS loci has occurred in these collapsed and relatively smaller assemblies (<595 Mb). However, the presence of true biological variation among these cultivars cannot be ruled out. For instance, structural variants (SVs) in the form of copy number variations (CNVs) and presence/absence variations (PAVs) are known to affect the gene content of many plant species (Saxena et al., 2014).

Fig. 3

Fig. 4

Maximum‐likelihood phylogenetic tree depicting the relationship among the Cannabis sativa cannabinoid synthase genes tetrahydrocannabinolic acid synthase (THCAS), cannabidiolic acid synthase (CBDAS) and cannabichromenic acid synthase (CBCAS). The published nucleotide sequences of the active/functional forms of THCAS (GenBank acc. no. AB057805.1), CBDAS (GenBank acc. no. AB292682.1), CBCAS (GenBank acc. no. LY658671.1) and the paralogues of these genes as annotated in the cs10 v.2.0 and Jamaican Lion (female parent and male parent) assemblies (Supporting Information Tables S1, S2) were aligned against the latest C. sativa reference genome assemblies (Table 1) using Blast+/2.2.29 (Altschul et al., 1990). Best hits corresponding to a percentage identity > 98.5%, query coverage > 75% and alignment length = query length ± 100 bp were retained (Tables S1, S2). The nucleotide sequences of these best hits were extracted from each assembly (where applicable) using bedtools v.2.26.0 (Quinlan & Hall, 2010). transdecoder v.3.0 was used to predict the longest open reading frame from the extracted regions (https://transdecoder.github.io/). The predicted proteins along with amino acid sequences (complete CDS) of AB057805.1 (gene ID in blue), AB292682.1 (gene ID in red), LY658671.1 (gene ID in green) and the other cannabinoid synthase gene copies annotated in the cs10 v.2.0 and Jamaican Lion (female parent and male parent) genome assemblies were used for multiple sequence alignment using clustal Omega (Sievers et al., 2011). The phylogenetic tree was reconstructed from these alignments using raxml v.8.12.12. with 500 bootstrap replicates under the JTT model of amino acid substitution and visualized using Interactive Tree Of Life (iTOL) (Letunic & Bork, 2007; Stamatakis et al., 2008). The tree was rooted with the Humulus lupulus THCAS homolog (GenBank acc. no. LA634839.1). Only bootstrap values of > 70% are shown. It is worth noting that all CBCAS genes cluster with some the THCAS genes reflecting the high sequence similarity between these two cannabinoid synthase genes (Fig. 3).

Sequence similarity between the Cannabis sativa cannabinoid synthase genes tetrahydrocannabinolic acid synthase (THCAS; GenBank acc. no. AB057805.1), cannabidiolic acid synthase (CBDAS: GenBank acc. no. AB292682.1) and cannabichromenic acid synthase (CBCAS; GenBank acc. no. LY658671.1). (a) Protein sequence alignments of THCAS, CBDAS and CBCAS were performed using Clustal Omega and protein domains were annotated using interproscan v.5.41‐78.0 (Sievers et al., 2011; Jones et al., 2014). The p‐cresol methylhydroxylase (PCMH)‐type flavin adenine dinucleotide (FAD)‐binding domain (residues 77–251, PrositeProfiles: PS51387, InterPro:IPR016166) and berberine and berberine‐like domain (residues 480–538 for THCAS and CBCAS, residues 479–537 for CBDAS, Pfam: PF08031, InterPro: IPR012951) are highlighted in red and black, respectively. The FAD‐binding domain (residues 81–214, Pfam: PF01565, IPR006094) is not shown. (b) THCAS is more similar to CBCAS than CBDAS at the nucleotide and amino acid levels. It is possible that the presence of CBCAS may lead to the production of THCA as a by‐product (McKernan et al., 2020). Maximum‐likelihood phylogenetic tree depicting the relationship among the Cannabis sativa cannabinoid synthase genes tetrahydrocannabinolic acid synthase (THCAS), cannabidiolic acid synthase (CBDAS) and cannabichromenic acid synthase (CBCAS). The published nucleotide sequences of the active/functional forms of THCAS (GenBank acc. no. AB057805.1), CBDAS (GenBank acc. no. AB292682.1), CBCAS (GenBank acc. no. LY658671.1) and the paralogues of these genes as annotated in the cs10 v.2.0 and Jamaican Lion (female parent and male parent) assemblies (Supporting Information Tables S1, S2) were aligned against the latest C. sativa reference genome assemblies (Table 1) using Blast+/2.2.29 (Altschul et al., 1990). Best hits corresponding to a percentage identity > 98.5%, query coverage > 75% and alignment length = query length ± 100 bp were retained (Tables S1, S2). The nucleotide sequences of these best hits were extracted from each assembly (where applicable) using bedtools v.2.26.0 (Quinlan & Hall, 2010). transdecoder v.3.0 was used to predict the longest open reading frame from the extracted regions (https://transdecoder.github.io/). The predicted proteins along with amino acid sequences (complete CDS) of AB057805.1 (gene ID in blue), AB292682.1 (gene ID in red), LY658671.1 (gene ID in green) and the other cannabinoid synthase gene copies annotated in the cs10 v.2.0 and Jamaican Lion (female parent and male parent) genome assemblies were used for multiple sequence alignment using clustal Omega (Sievers et al., 2011). The phylogenetic tree was reconstructed from these alignments using raxml v.8.12.12. with 500 bootstrap replicates under the JTT model of amino acid substitution and visualized using Interactive Tree Of Life (iTOL) (Letunic & Bork, 2007; Stamatakis et al., 2008). The tree was rooted with the Humulus lupulus THCAS homolog (GenBank acc. no. LA634839.1). Only bootstrap values of > 70% are shown. It is worth noting that all CBCAS genes cluster with some the THCAS genes reflecting the high sequence similarity between these two cannabinoid synthase genes (Fig. 3). The identification of sex chromosomes in the cannabis genome is another notable genomics‐driven achievement (Prentout et al., 2019; McKernan et al., 2020). Of the 565 sex‐linked genes identified in the PK transcriptome, 363 were mapped to cs10 v.1.0 chromosome 1 (cs10 v.2.0 chromosome 10), indicating that this chromosome pair constitutes the sex chromosomes (Prentout et al., 2019). This enabled the identification of sex‐specific molecular markers to aid cannabis breeding. THCA and CBDA are produced at much higher concentrations in the inflorescences of female cannabis plants compared with males, and hence female plants are economically more valuable (Braich et al., 2019; Prentout et al., 2019; McKernan et al., 2020). Having the capacity to identify male and female plants at an early stage enables yield improvement and better management of cannabis crops. Approximately 3500 sex‐biased genes have been identified, which are differentially expressed between female and male cannabis plants, with a subset being expressed in the flower buds (Prentout et al., 2019). These genes are not restricted to the sex chromosomes: some are located on the Y‐chromosome of male plants and are involved in trichome development, sex determination, hermaphroditism and photoperiod‐independence (McKernan et al., 2020). Consistent chromosome nomenclature is important when working with multiple genome assemblies of the same species. However, discrepancies exist between the cs10, PK and FN assemblies. Chromosomal mappings performed by NRGene determined that cs10 v.2.0 chromosome 7 (cs10 v.1.0 chromosome 9) corresponds to chromosome 6 of PK and FN (Table 2; Maoz, 2020). Whilst our synteny analyses broadly agree with the findings from NRGene, they also show the lack of synteny between these genomes in some regions as illustrated by breaks in the chromosomal alignments (Fig. 5). These could have occurred due to SVs which would indicate true biological variation between these unrelated chemotypes. Another possibility relates to the more fragmented and less contiguous nature of the PK and FN assemblies relative to the cs10 v.2.0 assembly. Such lack of assembly contiguity can cause an underestimation of the syntenic relationship, causing genomic regions to erroneously appear as absent in one assembly relative to another (Liu et al., 2018).

Fig. 5

Dotplots showing the syntenic relationship between genomes of three Cannabis sativa cultivars. Pairwise genome alignments for (a) PK v.5.0 (GenBank acc. no. GCA_000230575.5) and FN v.2.0 (GenBank acc. no. GCA_003417725.2), (b) cs10 v.2.0 (GenBank acc. no. GCA_900626175.2) and PK v.5.0 and (c) cs10 v.2.0 and FN v.2.0 were performed using Minimap2 and the alignments were visualized using d‐genies (Cabanettes & Klopp, 2018; Li, 2018) (Supporting Information Table S3). Breaks in the alignment could be due to the presence of structural variants or the less contiguous nature of the PK and FN assemblies. The difference in chromosome orientation between the assemblies also can be seen. Only chromosome‐level alignments are shown. PK, Purple Kush; FN, Finola. Our synteny analyses also revealed the presence of multiple CBDAS, CBDAS‐like and inactive THCAS (CBCAS) genes on chromosome 6 and unplaced scaffolds of the PK and FN assemblies (Tables S1, S3). We found that the scaffolds harbouring the CBDAS gene copies could be best aligned (start to end) against the CBDAS gene cluster region (29.63–30.93 Mbp) of cs10 v.2.0 chromosome 7 (Table S3). Likewise, the scaffolds containing the CBCAS genes could be best aligned against the inactive THCAS gene cluster region (25.82–26.05 Mbp) of cs10 v.2.0 chromosome 7. These findings suggest that the unplaced scaffolds belong to chromosome 6 of PK and FN and that the regions surrounding the candidate genes on these scaffolds share synteny with cs10 v.2.0 chromosome 7. Although these results highlight the syntenic relationship that exists between these various cannabis chemotypes, they also illustrate the difficulty in inferring such a relationship when presented with fragmented assemblies. The continuous improvement of current cannabis assemblies will therefore be required for more accurate comparative genomics analyses within and between species.

Gene expression

Large‐scale gene expression studies on cannabis have been instrumental in elucidating cannabinoid metabolism, but application of these approaches to other important traits is limited (Guerriero et al., 2017; Spitzer‐Rimon et al., 2019; McKernan et al., 2020). Many enzymes required for the conversion of primary metabolic precursors through to the synthesis of THCA and CBDA were identified by early comparisons between expressed sequence tags from trichome and leaf tissue (Marks et al., 2009; Van Bakel et al., 2011; Stout et al., 2012; Braich et al., 2019; Livingston et al., 2019; Zager et al., 2019). RNA‐Seq analysis identified 15‐fold increases in cannabinoid pathway genes in flowers of PK compared with FN, although this preliminary investigation lacked a sufficient number of biological replicates for robust statistical analyses (Van Bakel et al., 2011). Tissue and organ comparisons were recently improved, with female and male plants compared as well as various trichome morphotypes (Braich et al., 2019; Livingston et al., 2019). Gene co‐expression network analysis identified functionally relevant cannabis terpene synthases (CsTPS) involved in mono‐ and sesquiterpene accumulation (Zager et al., 2019). The differential expression of CsTPS genes among cultivars also has been linked to variation in terpene profiles of these cultivars (Booth et al., 2020). However, the genes underlying the synthesis of many minor cannabinoids and associated expression patterns are less well‐defined (Pollastro et al., 2011; Citti et al., 2019; Welling et al., 2019; Basas‐Jaumandreu & de las Heras, 2020).

Substantial CNV exists among cannabinoid synthase loci

The CNV of cannabinoid synthases has been reported in the cannabis genome and may have resulted from either natural or artificial selection (Grassa et al., 2018; Vergara et al., 2019; McKernan et al., 2020). Overall, it appears that CBDAS exists in significantly larger copy numbers compared to THCAS and CBCAS (Fig. 4; Tables S1, S2, S4). We also identified a larger number of CBDAS gene clusters relative to the other cannabinoid synthases, suggesting the presence of higher sequence variation among CBDAS gene copies. A previous report also suggested a more recent evolution of THCAS and CBCAS genes, which originated from the ancestral gene, CBDAS, as a result of gene duplication (Onofri et al., 2015). Additionally, we found the functional copies of CBDAS to be highly similar (>99% nucleotide identity) among accessions (Table S1). This also was the case for the functional THCAS and CBCAS loci, suggesting that intensive breeding practices have been performed to select these desirable cannabinoid synthase loci, which have become less polymorphic as a result. Beyond their similarity at the nucleotide and amino acid sequence levels, studies on the evolution of these genes would shed more light on their origin and functional divergence.

The relationship between cannabinoid synthase CNVs and cannabinoid content

The association between cannabinoid synthases, CNVs and overall cannabinoid yield remains unclear and may, in fact, not be significant (Grassa et al., 2018). Of the five quantitative trait loci (QTL) for total cannabinoid content recently identified, none belong to the cannabinoid synthase gene cluster on cs10 v.2.0 chromosome 7 (cs10 v.1.0 chromosome 9) (Grassa et al., 2018), suggesting that other non‐cannabinoid synthase‐related loci contribute to cannabinoid yield. Regardless, the functional forms of THCAS and CBDAS rather than their other copies are essential for the synthesis of THCA and CBDA, respectively. This is supported by our survey of the genome assemblies of the high‐THC producing cultivars PBBK and PK. Whilst both assemblies contained similar numbers of THCAS and CBDAS copies, they also contain a functional THCAS copy that bears > 99% nucleotide identity with the THCA‐producing gene locus identified by Sirikantaramas and colleagues (GenBank acc. no. AB057805.1; Fig. 4; Table S1) (Sirikantaramas et al., 2004). It is likely that nonfunctional CBDAS loci, inferred by the presence of premature stop codons and frameshift mutations, also are present in these assemblies (Weiblen et al., 2015). This is probable because CBDAS is a stronger competitor for CBGA than THCAS (Weiblen et al., 2015). However, when the nonfunctional form of CBDAS and the functional form of THCAS are both present, THCA is produced instead of CBDA (Weiblen et al., 2015). Indeed, we identified two loci in PBBK and PK that share > 96% nucleotide identity with the published nonfunctional CBDAS homologues from Skunk #1 (Table S5) (Weiblen et al., 2015). We also identified these loci in Cannatonic and JL, and found that these cultivars also contain one locus which is highly similar (>99% nucleotide identity) to AB057805.1 (Table S1). It is important to validate hypotheses generated in silico to explain chemotypic properties of cannabis cultivars by conducting either in vitro or in vivo studies. This can be achieved using genetic approaches such as co‐segregation of known chemotypes and enzyme genotypes (Weiblen et al., 2015). For example, the association between functional forms of either THCAS or CBDAS and chemotype was demonstrated by analysis of an F2 population derived from crossing hemp and marijuana chemotype plants. The presence of functional THCAS or CBDAS in F2 individuals was determined by genotype analysis, then related to the plants' cannabinoid contents (Weiblen et al., 2015).

Short read sequencing creates challenges during CNV and expression analyses of the highly similar cannabinoid synthase loci

The predominant use of short read methods to analyze cannabis transcriptomes may create misleading artefacts resulting from the very high sequence similarity of cannabinoid synthase gene loci. Short reads that originate from paralogous and pseudogenic regions of a genome frequently cannot be distinguished (Ju et al., 2017; Dougherty et al., 2018). These challenges are illustrated in our analysis of trichome transcriptome data from nine medicinal cannabis cultivars generated by Zager et al. (2019) (Fig. 6a–c). Using cs10 v.1.0 as the reference, we identified two highly similar inactive THCAS loci (>99% nucleotide identity), LOC115697880 and LOC115697886, which were expressed much more highly than the remaining THCAS‐like and inactive THCAS‐like copies across all cultivars (Table S6). These loci are more similar to CBCAS (>99% nucleotide identity) than to the functional THCAS gene, AB057805.1 (>94% nucleotide identity) (Fig. 4; Table S1). It was reported that THCA (>13 % DW on average) was present in these cultivars (Zager et al., 2019). Therefore, we suspect that although the high‐THCA cultivars each harbour at least one functional THCAS gene copy, the reads that originated from this locus were forced to map to LOC115697880 and LOC115697886 owing to the higher similarity between the reads and these two loci, and the fact that cs10 lacks a functional THCAS gene. Whilst our findings agree with our observations above that CNV does not impact the synthesis and accumulation of THCA, they reflect the inadequacy of short reads to differentiate between highly similar loci and highlight the challenges of using an unrelated reference assembly for expression analyses.

Fig. 6

Cannabinoid synthase gene expression in relation to cannabinoid content and composition in nine high cannabinoid yielding cannabis cultivars (data taken from Zager et al., 2019). (a) Tetrahydrocannabinolic acid : cannabidiolic acid (THCA : CBDA) ratio and (b) cannabinoid contents of the cultivars. The lower panel in (b) shows a zoomed‐in view of cannabinoid content (% DW) in the range 0–0.5%. (c) Trichome‐specific expression patterns of 13 cannabinoid synthase genes from cs10 v.1.0 genome assembly (GenBank acc. no. GCA_900626175.1) in these cultivars. Positions on chromosomes represent one or more cannabinoid synthase locus. The reference CBDAS (LOC115697762) and inactive THCAS (LOC115697880) loci are underlined. LOC115697762 bears 100% nucleotide identity with the functional CBDAS identified by Taura et al. (2007) (GenBank acc. no. AB292682.1), whereas LOC115697880 is 99% identical to CBCAS (GenBank acc. no. LY658671.1) at the nucleotide level (Taura et al., 2007). Of the 13 loci, two (LOC115698060 and LOC115697886) are pseudogenic inactive THCAS copies containing in‐frame stop codons, whereas the remaining 11 genes produce full‐length CDS. Trichome enriched RNA‐seq reads previously reported by Zager et al. (2019) were accessed from the NCBI Sequence Read Archive (SRA project no. PRJNA498707; Zager et al., 2019). The reads were mapped to the Cannabis sativa cs10 v.1.0 genome assembly (using Hisat2 v.2.1.0 and sorted by genomic location using samtools v.1.9 Li et al., 2009; Kim et al., 2019). stringtie v.1.3.5 was used to assemble RNA‐Seq alignments into potential transcripts and to calculate gene abundances (TPM) (Supporting Information Table S6; Pertea et al., 2015). Chromosome numbers have been changed to community standard nomenclature in accordance with cs10 v.2.0. (GenBank acc. no. GCA_900626174.2.). Cannabis sativa var. cs10 is associated with a high CBD chemotype. BB, Black Berry Kush; BL, Black Lime; CC, Cherry Chem; CT, Canna Tsu; MT, Mama Thai; SD, Sour Diesel; TP, Terple; TPM, Transcripts per million; VF, Valley Fire; WC, White Cookies. Error bars represent ± 1 SD of the mean metabolite content of each cultivar (n = 3). Likewise, the higher expression of the CBDAS gene, LOC115697762, and higher CBDA content in Canna Tsu (7.76 ± 0.3% DW) relative to the other high‐THCA cultivars suggests that only one copy of the CBDAS gene is predominantly responsible for CBDA synthesis (Table S6). LOC115697762 bears 100% nucleotide identity with the functional CBDAS identified by Taura et al. (GenBank acc. no. AB292682.1; Fig. 4; Table S1) (Taura et al., 2007). Further, the moderate expression of another highly similar CBDAS‐like gene copy (LOC115696884, 88% nucleotide identity to LOC115697762) in all cultivars could explain the low concentrations (<0.45% DW) of CBDA detected in the high‐THCA cultivars (Zager et al., 2019). Overall, our findings suggest that CNV does not affect cannabinoid content. Recent studies using long‐read approaches from PacBio (SMRT isoform sequencing, Iso‐Seq) and Oxford Nanopore support this (McKernan et al., 2020; Michael, 2020). Both concluded that only single copies of the functional THCAS and CBDAS genes were expressed in the cannabis genome and contribute to the production of THCA and CBDA, respectively (McKernan et al., 2020; Michael, 2020). This finding highlights the strength of long‐read sequencing at more accurately identifying and quantifying the expression of paralogous genes (Dougherty et al., 2018).

Terpene synthases

Cannabis is a prolific producer of terpenes and these compounds are thought to contribute to the therapeutic efficacy of cannabis preparations via the ‘entourage effect’, by acting in combination with cannabinoids (Ben‐Shabat et al., 1998; LaVigne et al., 2020). However, evidence for the existence of the entourage effect is largely anecdotal and lacks mechanistic explanation, for example whether the effect is additive or multiplicative. Despite their therapeutic potential and similar biosynthetic origin, genetic prediction of terpene composition is challenging. Genomic analysis of 55 CsTPS genes suggests a complex genetic background characterized by CsTPS nonhomologous gene clusters and tandem duplication events (Allen et al., 2019; Booth et al., 2020; McKernan et al., 2020). Further, the presence of extensive CNVs within the CsTPS‐a (sesquiterpene synthase) and CsTPS‐b (monoterpene synthase) gene subfamilies point to their highly diverse nature (Booth et al., 2020). It appears that some members of the CsTPS‐b gene subfamily can produce sesquiterpene in both cannabis and sandalwood. This indicates that similar selective pressures have occurred on the species‐specific monoterpene synthase ancestors to produce these loci (Gao et al., 2012). The diversity of the CsTPS genes complicates studies of gene regulation at various levels. Terpene composition can vary between cultivars, tissue types, trichome morphotypes and across development, whilst nonenzymatic modifications such as the oxidation of β‐myrcene cause variation independent of genomic and transcriptomic regulation (Marchini et al., 2014; Aizpurua‐Olaizola et al., 2016; Allen et al., 2019; Livingston et al., 2019; Booth et al., 2020). Consequently, future transcriptional studies would need to consider gene‐environment interactions, as well as organ, tissue and cell‐type specificity.

SNP studies

High‐throughput SNP studies have contributed substantially to our understanding of cannabis evolutionary history. Key findings include: the genome‐wide distinction between drug and hemp types that resulted from selective breeding; the association between chemotypic identity and variation of loci encoding cannabinoid synthases; and errors in cultivar classification and ancestry by breeders (Van Bakel et al., 2011; Sawler et al., 2015; Lynch et al., 2016; Soorni et al., 2017). These findings are likely to facilitate the development of more accurate diagnostic systems for cannabis germplasm to assist with product compliance, traceability, provenance and consumer education (Henry et al., 2020). Domestication and intensive breeding have narrowed the genetic and allelic diversity of cannabis gene pools (Sawler et al., 2015; Soorni et al., 2017). However, these can be expanded by a comprehensive, genomics‐based assessment of cannabis germplasm that defines the diversity available. This would in turn help with the identification of heterotic groups for use in crosses to achieve hybrid vigour and hence assist with the creation of elite cultivars (Huang et al., 2015).

Genomics approaches that could be applied in the near‐term to improve cannabis traits

Our knowledge of how cannabis traits relate to genotype is currently limited. Many next generation sequencing (NGS)‐based tools exist that can rapidly identify genetic variation underlying traits of interest. The available genetic and genomic resources provide an opportunity to apply NGS tools for trait discovery and molecular breeding in cannabis, as has been achieved in other crops (Varshney et al., 2014; Kang et al., 2016; Crossa et al., 2017).

QTL mapping and gene discovery

Low‐density molecular markers such as amplified fragment length polymorphisms (AFLPs) and simple sequence repeats (SSRs) have been employed in cannabis to identify QTLs associated with sex determination and cannabinoid composition (Weiblen et al., 2015; Faux et al., 2016). High‐density genetic maps recently developed for cannabis should improve the accuracy of QTL mapping (Grassa et al., 2018; Laverty et al., 2019). Several NGS‐based methods are available for high‐resolution mapping and interval detection in plants (Schneeberger et al., 2009; Abe et al., 2012; Takagi et al., 2013). A common feature of these methods is that they combine NGS with bulked‐segregant analysis to study mutated or natural populations. The methods have been deployed mainly in self‐pollinating species with homozygous genomes (Jaganathan et al., 2020). However, recent protocol improvements have allowed their application to heterozygous, outcrossing species, for example to identify loci involved in sex determination, flowering, trichome formation, anthocyanin accumulation and leaf shape in Dioscorea rotundata (Guinea yam), Brassica rapa and Vitis vinifera (grapevine) (Tamiru et al., 2017; Demmings et al., 2019; Itoh et al., 2019; Zhang et al., 2020). Similar approaches could be used in cannabis. Ethyl methanesulfonate (EMS) mutagenesis has been applied successfully to hemp and a protocol exists for cannabis cell culture (Bielecka et al., 2014; Hari, 2020). However, large‐scale generation and screening of mutant lines can be a logistical challenge in cannabis owing to its size and the requirement in many jurisdictions to grow it in secure and licensed facilities. Consequently, the exploitation of the cannabis natural diversity provides a better option for mining important genes in the short term (Vergara et al., 2016; Welling et al., 2016). Therefore, the available gene/QTL mapping tools should go beyond simple genetic variations such as SNPs and small insertions/deletions (Indels) and consider SVs including large Indels, rearrangements and CNVs, all of which have been shown to affect trait diversity in several crops including cannabis (Wang et al., 2016; Chakraborty et al., 2019; McKernan et al., 2020).

Genome‐wide association studies (GWAS)

There are only two published cannabis GWAS studies, one of which is ongoing and the other reporting marker‐trait association using a limited number of SNPs (B. J. Campbell et al., 2019; Henry et al., 2020; http://multihemp.eu/). The paucity of studies is likely a consequence of the previous limited accessibility of cannabis genetic diversity and insufficient marker density. GWAS relies on linkage disequilibrium (LD) for detecting common genetic variants associated with a trait in natural and experimental populations (Brachi et al., 2011). Consequently, it has been widely deployed in self‐crossing species that generally have more extensive LD. Nevertheless, GWAS has been used successfully for genotype–trait association in outcrossing and vegetatively propagated species including date palm, sweet potato and hop (Humulus lupulus), the species most closely related to cannabis (Henning et al., 2015; Hazzouri et al., 2019; Okada et al., 2019). Application of efficient and high‐throughput phenotyping systems to cannabis will help GWAS studies owing to the species’ high phenotypic plasticity; applicable commercial and open‐source solutions already exist (L. G. Campbell et al., 2019). A variant of GWAS termed mGWAS (metabolite‐based GWAS) that combines genotyping and metabolic profiling has proved powerful for dissecting the genetic bases of metabolic diversity in plants (Fang & Luo, 2019; Chen et al., 2020). A similar approach could be applied to the cannabinoid and terpene biosynthesis pathways of cannabis.

Genomic selection

Genomic selection utilizes genome‐wide marker information to predict the breeding value of genotypes, which is an estimate of the value of a genotype as a parent. It integrates genotypic and phenotypic data from a reference population and uses statistical models to determine the genomic‐estimated breeding values (GEBVs) of other individuals for which only genotype information is available. Elite lines with the highest GEBVs are then selected for use as parents in breeding programs. Genomic selection is considered particularly promising for genetic improvement of complex traits controlled by many genes with minor effects (Heslot et al., 2015; Spindel & McCouch, 2016). The approach has been successfully implemented in breeding programmes for outcrossing heterozygous species such as maize and cassava (Crossa et al., 2017; Elias et al., 2018). It might consequently make a notable contribution to the genetic improvement of complex cannabis traits once marker density, population size, statistical models and accuracy of phenotyping improve.

Emerging genomics technologies with high potential in cannabis research

Phased genome assemblies

The assembly of heterozygous plant genomes remains challenging despite the use of long‐read sequencing technology (Michael & VanBuren, 2020). Genome assembly of outcrossing species such as cannabis is particularly challenging because haplotypes consist of various repeating units, short Indels and SVs (Chin et al., 2016; VanBuren et al., 2018). As a result, the majority of near‐complete haploid cannabis assemblies are fully unphased, meaning they are a synthetic patchwork of collapsed segments of homologous chromosomes that do not fully capture genome composition (Chin et al., 2016; Grassa et al., 2018; Laverty et al., 2019; Gao et al., 2020). There have been recent efforts to produce partially phased genome assemblies for cannabis (two draft assemblies are now available for the maternal Jamaican Lion cultivar) and hop (var. Cascade) (Padgitt‐Cobb et al., 2019; Medicinal Genomics, 2020b). In addition, NRGene recently announced the creation of fully‐phased genome assemblies for two proprietary elite cultivars using cs10 as the basis for scaffold ordering (Weisshaus, 2020). These assemblies are likely to improve our understanding of haplotype structure and heterozygosity, which is essential for allele‐specific analyses of quantitative traits (Chin et al., 2016).

Cannabis pangenomics

Plant pangenomics studies conducted over the last five years have demonstrated the inadequacy of a single reference genome in representing the entire genetic diversity of a species (Golicz et al., 2016; Hurgobin et al., 2018). The creation of a cannabis pangenome promises to shed more light on the extent of gene content variation, as well as forming the basis for cannabis breeding programmes. For instance, the inclusion of diverse genotypes and wild cannabis populations (sometime referred to as C. ruderalis), would facilitate the identification of elite marker genes in the dispensable/variable genome of cannabis and drive the process of heterosis to create resilient and high‐yielding cultivars (Tao et al., 2019). Additionally, combining gene PAV and pangenome‐wide SNPs would lead to a more accurate identification of causal variants associated with traits of interest (Hurgobin & Edwards, 2017). Independent cannabis pangenome initiatives are being led by NRGene and Medicinal Genomics (NRGene, 2018; Medicinal Genomics, 2020a). Allelic variations and additional genes involved in cannabinoid biosynthesis were identified by comparing the recent fully‐phased assemblies from NRGene with existing chromosome‐level, unphased cannabis assemblies (Weisshaus, 2020). Conserved genomic regions, as well as variable regions harbouring SVs such as CNVs, PAVs and large rearrangements, which may be implicated in cannabis and hemp breeding, were also identified. The increased accessibility of long‐read sequencing will likely encourage the construction of graph‐based pangenomes, which are considered to be the future of plant pangenomics studies (Bayer et al., 2020). A graph‐based pangenome consists of a single, nonredundant reference that contains SVs from multiple individuals/accessions. It can be visualized as a sequence graph with branches representing accession‐specific sequences (Garrison et al., 2018). The first graph‐based pangenome of soybean was recently produced using 29 wild and cultivated accessions (Liu et al., 2020). Using this resource and the SVs that they had identified among the accessions, the authors performed a GWAS and identified a 10‐kbp PAV, which was associated with seed lustre variation. This type of analysis would be helpful for cannabis as it could be used to determine the SV landscape of this important crop and its association with traits of interest.

Single cell genomics

Transcriptomics has been revolutionized by single cell RNA‐Sequencing (scRNA‐Seq) because it enables researchers to investigate the complex interplay of molecular regulators and identify active cellular processes at true cellular resolution (Rich‐Griffin et al., 2020). Whilst only a handful of scRNA‐Seq studies have been conducted in plants to date, they demonstrate that this technology may have great potential in cannabis breeding (Rich‐Griffin et al., 2020). For instance, given the specificity of the cannabinoid biosynthetic pathway to the capitate stalked trichomes, a detailed view of gene expression in the cell‐types involved in this pathway would assist with the selection of marker genes. An understanding of cell‐types which respond most strongly to environmental cues such as biotic or abiotic stresses also would be valuable. Combining this information with single‐cell assay for transposable accessible chromatin (scATAC) sequencing data would enable the identification of regulatory regions of the genome that drive cell‐type specific expression (Rich‐Griffin et al., 2020). Integration of scRNA‐Seq with CRISPR/Cas9‐based genetic screens would enable more effective selection of cell types for targeted gene manipulation whilst reducing the impact of pleiotropy (Yuan et al., 2018; Shahan, 2019; Marand et al., 2020; Rich‐Griffin et al., 2020).

Concluding remarks and future perspectives

Cannabis biology remains poorly understood but the relaxation of legislation, together with the application of existing and emerging genomics technologies, is likely to fuel more scientific research on this fascinating plant. The rapid generation of sequencing data and the creation of additional, fully phased, chromosome‐level genome assemblies will enable a more comprehensive assessment of the genetic architecture of important traits and aid in marker discovery. Whilst genomics‐assisted breeding will have a pivotal role in increasing the efficiency and precision of cannabis crop improvement, the utility of integrating the other ‘omics technologies cannot be overlooked. In line with the overarching goal of the ICGRC, the integration of individual ‘omics approaches such as transcriptomics, phenomics, metabolomics and proteomics, among others, and the ongoing development of computational methods will be useful for understanding gene function, biological and metabolic pathways, and regulatory networks underlying traits of interest. Complementing these ‘omics approaches with traditional breeding practices will provide a multifaceted strategy for the improvement of this emerging crop species.

Author contributions

BH and MGL conceived the manuscript; BH and MW conducted genome sequence analyses; BH, MTO and MW wrote the manuscript; BH, MTO, MTW, MGL, MSD, AB and JW revised the manuscript. All authors read and approved the final version. Table S1 Summary of best Blast hits of the functional cannabinoid biosynthesis genes against the latest C. sativa reference genome assemblies. Table S2 Summary of best Blast hits of other THCAS‐ and CBDAS‐like gene copies against the latest C. sativa reference genome assemblies. Table S3 Alignment of Purple Kush v.5.0 (GenBank acc. no. GCA_000230575.5) and Finola v.2.0 (GenBank acc. no. GCA_003417725.2) unplaced scaffolds against the cs10 v.2.0 genome assembly (GenBank acc. no. GCA_900626175.2). Table S4 Cannabinoid synthase genes from cs10 v.2.0 (GenBank acc. no. GCA_900626175.2) and Jamaican Lion (female parent: GenBank acc. no. GCA_012923435.1; male parent: GenBank acc. no. GCA_013030025.1) assemblies. Table S5 Summary of best Blast hits of nonfunctional CBDAS homologs from Skunk #1 against the latest C. sativa reference genome assemblies. Table S6 Transcript per million (TPM) counts of the 13 cannabinoid synthase genes from the cs10 v.1.0 genome assembly (GenBank acc. no. GCA_900626175.1) in nine medicinal cannabis cultivars from Zager et al. (2019). Please note: Wiley Blackwell are not responsible for the content or functionality of any Supporting Information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office. Click here for additional data file.

97 in total

1. Terpene Synthases and Terpene Variation in Cannabis sativa.

Authors: Judith K Booth; Macaire M S Yuen; Sharon Jancsik; Lufiani L Madilao; Jonathan E Page; Jörg Bohlmann
Journal: Plant Physiol Date: 2020-06-26 Impact factor: 8.340

2. Pan-Genome of Wild and Cultivated Soybeans.

Authors: Yucheng Liu; Huilong Du; Pengcheng Li; Yanting Shen; Hua Peng; Shulin Liu; Guo-An Zhou; Haikuan Zhang; Zhi Liu; Miao Shi; Xuehui Huang; Yan Li; Min Zhang; Zheng Wang; Baoge Zhu; Bin Han; Chengzhi Liang; Zhixi Tian
Journal: Cell Date: 2020-06-17 Impact factor: 41.582

3. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation.

Authors: Ivica Letunic; Peer Bork
Journal: Bioinformatics Date: 2006-10-18 Impact factor: 6.937

4. A rapid bootstrap algorithm for the RAxML Web servers.

Authors: Alexandros Stamatakis; Paul Hoover; Jacques Rougemont
Journal: Syst Biol Date: 2008-10 Impact factor: 15.683

5. Cannabis glandular trichomes alter morphology and metabolite content during flower maturation.

Authors: Samuel J Livingston; Teagen D Quilichini; Judith K Booth; Darren C J Wong; Kim H Rensing; Jessica Laflamme-Yonkman; Simone D Castellarin; Joerg Bohlmann; Jonathan E Page; A Lacey Samuels
Journal: Plant J Date: 2019-10-12 Impact factor: 6.417

6. Efficient Approach to Correct Read Alignment for Pseudogene Abundance Estimates.

Authors: Chelsea J-T Ju; Zhuangtian Zhao; Wei Wang
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2016-07-14 Impact factor: 3.710

7. Induction of fertile male flowers in genetically female Cannabis sativa plants by silver nitrate and silver thiosulphate anionic complex.

Authors: H Y Mohan Ram; R Sett
Journal: Theor Appl Genet Date: 1982-12 Impact factor: 5.699

8. The pangenome of an agronomically important crop plant Brassica oleracea.

Authors: Agnieszka A Golicz; Philipp E Bayer; Guy C Barker; Patrick P Edger; HyeRan Kim; Paula A Martinez; Chon Kit Kenneth Chan; Anita Severn-Ellis; W Richard McCombie; Isobel A P Parkin; Andrew H Paterson; J Chris Pires; Andrew G Sharpe; Haibao Tang; Graham R Teakle; Christopher D Town; Jacqueline Batley; David Edwards
Journal: Nat Commun Date: 2016-11-11 Impact factor: 14.919

9. Assessment of Genetic Diversity and Population Structure in Iranian Cannabis Germplasm.

Authors: Aboozar Soorni; Reza Fatahi; David C Haak; Seyed Alireza Salami; Aureliano Bombarely
Journal: Sci Rep Date: 2017-11-15 Impact factor: 4.379

10. Quantitative Trait Locus Analysis of Leaf Morphology Indicates Conserved Shape Loci in Grapevine.

Authors: Elizabeth M Demmings; Brigette R Williams; Cheng-Ruei Lee; Paola Barba; Shanshan Yang; Chin-Feng Hwang; Bruce I Reisch; Daniel H Chitwood; Jason P Londo
Journal: Front Plant Sci Date: 2019-11-15 Impact factor: 5.753

16 in total

1. Developing Oligo Probes for Chromosomes Identification in Hemp (Cannabis sativa L.).

Authors: Dmitry V Romanov; Gennady I Karlov; Mikhail G Divashuk
Journal: Plants (Basel) Date: 2022-07-22

2. Deciphering the iPBS retrotransposons based genetic diversity of nanoparticles induced in Vitro seedlings of industrial hemp (Cannabis sativa L.).

Authors: Ozlem Akgur; Muhammad Aasim
Journal: Mol Biol Rep Date: 2022-06-18 Impact factor: 2.742

Review 3. The Use of Cannabis sativa L. for Pest Control: From the Ethnobotanical Knowledge to a Systematic Review of Experimental Studies.

Authors: Genís Ona; Manica Balant; José Carlos Bouso; Airy Gras; Joan Vallès; Daniel Vitales; Teresa Garnatje
Journal: Cannabis Cannabinoid Res Date: 2021-10-05

4. Combined use of specific length amplified fragment sequencing (SLAF-seq) and bulked segregant analysis (BSA) for rapid identification of genes influencing fiber content of hemp (Cannabis sativa L.).

Authors: Yue Zhao; Yufeng Sun; Kun Cao; Xiaoyan Zhang; Jing Bian; Chengwei Han; Ying Jiang; Lei Xu; Xiaonan Wang
Journal: BMC Plant Biol Date: 2022-05-21 Impact factor: 5.260

5. THC and CBD Fingerprinting of an Elite Cannabis Collection from Iran: Quantifying Diversity to Underpin Future Cannabis Breeding.

Authors: Mahboubeh Mostafaei Dehnavi; Ali Ebadi; Afshin Peirovi; Gail Taylor; Seyed Alireza Salami
Journal: Plants (Basel) Date: 2022-01-04

6. Industry-Based Misconceptions Regarding Cross-Pollination of Cannabis spp.

Authors: Kenneth J Olejar; Sang-Hyuck Park
Journal: Front Plant Sci Date: 2022-01-26 Impact factor: 5.753

7. Tobacco Rattle Virus as a Tool for Rapid Reverse-Genetics Screens and Analysis of Gene Function in Cannabis sativa L.

Authors: Hanan Alter; Reut Peer; Aviv Dombrovsky; Moshe Flaishman; Ben Spitzer-Rimon
Journal: Plants (Basel) Date: 2022-01-26

Review 8. Applications of cell- and tissue-specific 'omics to improve plant productivity.

Authors: Bhavna Hurgobin; Mathew G Lewsey
Journal: Emerg Top Life Sci Date: 2022-04-15

Review 9. Application of High-Throughput Sequencing on the Chinese Herbal Medicine for the Data-Mining of the Bioactive Compounds.

Authors: Xiaoyan Liu; Xun Gong; Yi Liu; Junlin Liu; Hantao Zhang; Sen Qiao; Gang Li; Min Tang
Journal: Front Plant Sci Date: 2022-07-14 Impact factor: 6.627

10. Origin and Evolution of the Cannabinoid Oxidocyclase Gene Family.

Authors: Robin van Velzen; M Eric Schranz
Journal: Genome Biol Evol Date: 2021-08-03 Impact factor: 3.416