Literature DB >> 32118265

Genome sequence of the agarwood tree Aquilaria sinensis (Lour.) Spreng: the first chromosome-level draft genome in the Thymelaeceae family.

Xupo Ding¹, Wenli Mei¹, Qiang Lin², Hao Wang¹, Jun Wang¹, Shiqing Peng³, Huiliang Li³, Jiahong Zhu³, Wei Li¹, Pei Wang¹, Huiqin Chen¹, Wenhua Dong¹, Dong Guo³, Caihong Cai¹, Shengzhuo Huang¹, Peng Cui², Haofu Dai¹.

Abstract

BACKGROUD: Aquilaria sinensis (Lour.) Spreng is one of the important plant resources involved in the production of agarwood in China. The agarwood resin collected from wounded Aquilaria trees has been used in Asia for aromatic or medicinal purposes from ancient times, although the mechanism underlying the formation of agarwood still remains poorly understood owing to a lack of accurate and high-quality genetic information.
FINDINGS: We report the genomic architecture of A. sinensis by using an integrated strategy combining Nanopore, Illumina, and Hi-C sequencing. The final genome was ∼726.5 Mb in size, which reached a high level of continuity and a contig N50 of 1.1 Mb. We combined Hi-C data with the genome assembly to generate chromosome-level scaffolds. Eight super-scaffolds corresponding to the 8 chromosomes were assembled to a final size of 716.6 Mb, with a scaffold N50 of 88.78 Mb using 1,862 contigs. BUSCO evaluation reveals that the genome completeness reached 95.27%. The repeat sequences accounted for 59.13%, and 29,203 protein-coding genes were annotated in the genome. According to phylogenetic analysis using single-copy orthologous genes, we found that A. sinensis is closely related to Gossypium hirsutum and Theobroma cacao from the Malvales order, and A. sinensis diverged from their common ancestor ∼53.18-84.37 million years ago.
CONCLUSIONS: Here, we present the first chromosome-level genome assembly and gene annotation of A. sinensis. This study should contribute to valuable genetic resources for further research on the agarwood formation mechanism, genome-assisted improvement, and conservation biology of Aquilaria species.

Entities: CellLine Chemical Disease Species

Keywords: zzm321990 Aquilaria sinensiszzm321990 ; Hi-C sequencing; agarwood; annotation; chromosome-level genome assembly

Year: 2020 PMID： 32118265 PMCID： PMC7050300 DOI： 10.1093/gigascience/giaa013

Source DB: PubMed Journal: Gigascience ISSN： 2047-217X Impact factor: 6.524

Background

Agarwood is the fragrant resin-filled heartwood from the trees of the Aquilaria or Gyrinops genus, high-quality preparations of which are more costly than gold in the international market [1, 2]. Agarwood has been used as precious incense in Buddhist, Islamic, and Hindu ceremonies, and also as a traditional medicine in Chinese therapies and Ayurveda [3]. Modern pharmacological and chemical studies have indicated that sesquiterpenoid and phenylethyl chromone derivatives are the principal compounds in agarwood, many of which have been studied for potential pharmacological activities including neuroprotective, sedative, acetylcholinesterase inhibitory, antioxidant, antibacterial, and anti-inflammatory activities [4-7]. However, healthy Aquilaria trees generate very little agarwood unless they have been stimulated by various forms of injury or microbial infestation. In the wild, agarwood formation is usually related to natural factors such as wounding by wind or lightning damage, or gnawing by insects and fungi. [8, 9]. As a result of agarwood's potential medicinal and economic importance, traditional methods used for producing agarwood in Asia include chopping, nailing, boring holes, burning the stem of Aquilaria trees, or pruning the partial trunk [10]. This has resulted in wild Aquilaria plants being excessively exploited, and many species are now decreasing or endangered [11]. Aquilaria sinensis has been harvested and cultured for producing agarwood, which has been used in traditional Chinese medicine in China as early as the seventh century [11]. The morphological characteristics and agarwood of A. sinensis are shown in Fig. 1. As the largest producer of agarwood in China, the population of A. sinensis has undergone a dramatic decline in the past decade and its wild populations are threatened [11, 12]. The availability of agarwood is limited by the exhaustion of its time-consuming preparation and its plant sources. Although the expression of genes related to terpene synthesis or stress responses during agarwood formation has been described via transcriptome sequencing [2, 13, 14], the molecular mechanism of agarwood formation has remained unclear because of a lack of accurate genome information and genetic resources. Recently it has been discovered that 2-(2-phenylethyl) chromone and its derivatives were the key markers for agarwood formation in A. sinensis and their hypothetical biosynthetic pathway has been elucidated [8]. With the decreasing population of A. sinensis plants in the wild and increasing demand in the agarwood market, it is important to interrogate the genomic background to explore the mechanism of agarwood formation and to accelerate genome-assisted improvement in breeding systems.

Figure 1:

Morphological characteristics of Aquilaria sinensis. (a) mature tree; (b) flower; (c) fruit; (d) seed; (e) cracked seed; (f) agarwood generation; (g) agarwood. The images b–e were captured using a stereoscopic fluorescence microscope (Olympus SZX16, Pittsburgh, PA) under the dark field. All the photos were taken by Dr. Jun Wang and processed by Dr. Xupo Ding. Herein, we sequenced and assembled the genome of Aquilaria sinensis (NCBI:txid210372) by means of a hybrid approach using Illumina short reads, Oxford Nanopore Technologies (ONT) long reads, and Hi-C data. We reveal the genomic features of Aquilaria sinensis, including repeat sequences, gene annotation, and evolution. This reference genome will provide the fundamental genetic information to elucidate the metabolic formation of agarwood and facilitate genetic research on Aquilaria trees.

Data Description

Genomic DNA extraction and genome size estimation

An individual plant of cultivar Aquilaria sinensis (Lour.) Spreng was collected from Chengxi district (110 19.245 E, 19 59.757 N), Haikou, China. After collection healthy, fresh leaves were snap-frozen in liquid nitrogen, followed by preservation at −80°C in the laboratory prior to DNA extraction. High molecular weight plant genomic DNA was extracted from these leaves using a modified CTAB method [15]. The quality and quantity of the isolated DNA were checked by electrophoresis on a 0.75% agarose gel and a NanoDrop D-1000 spectrophotometer (NanoDrop Technologies, Wilmington, DE), and the DNA was then accurately quantified using Qubit technology (Life Technologies, Carlsbad, CA). Subsequently, 150-bp paired-end (PE) libraries with insert lengths of 270 bp were constructed and 49.84 Gb raw data were generated on the Illumina Hiseq2500 platform (Illumina HiSeq 2500 System, RRID:SCR_016383) using standard protocols, which were used for estimating the genome size of A. sinensis with the following formula: genome size = [Num (total k-mer) − Num (erroneous k-mer)]/mean depth of k-mer[16, 17]. Finally, the genome size of A.sinensis was estimated as 773.3 Mb with the total number of 19-mer ∼3.71 × 1010 and the peak of 19-mer at the depth of 48 (Supplementary Fig. S1). The GC content of A. sinensis genome was 39.23%, which is considered a moderate GC level (Supplementary Fig. S2). Meanwhile, the heterozygosity of 0.6% and repeat content of 53.12% for theA. sinensis genome were also estimated [18].

Genomic sequencing and assembly using Nanopore long reads

One Nanopore 1D library was prepared following the Oxford Nanopore SQK-LSK 108 kit and GridION protocol (Oxford Nanopore Technologies, Oxford, UK) [19]. Genomic DNA was first repaired and end-prepped with NEBNext FFPE Repair Mix (New England Biolabs [NEB]) and the NEBNext Ultra II End Repair/dA-Tailing Module (NEB). The DNA was then purified with AMPure XP beads (Beckmann Coulter) and ligated with sequencing adapters provided by ONT using concentrated T4 DNA ligase 2 M U mL−1 (NEB). After purification with AMPure XP beads (Beckman Coulter) using dilution buffer (ONT) and wash buffer (ONT), the library was mixed with sequencing buffer (ONT) and library loading beads (ONT) and loaded on 16 flow cells (R9.4) of GridION X5 platform (GridION, RRID:SCR_017986) [20], generating 71.3 Gb raw DNA reads (∼100× coverage of the genome assembly). We obtained 4.8 million Nanopore long reads (67.7 Gb in total) with an N50 read length of 21.29 kb and the longest read length of 935.06 kb after removing adapter sequences (Supplementary Table S1). The clean long reads obtained from Nanopore were initially assembled by wtdbg (wtdbg, RRID:SCR_017225) version 1.3 [21] with parameters as follows: wtdbg -t 60 -i Passed.fastq -o Sample -H -k 17 -S 1.01 -e 4. The iterative polishing was conducted thrice by Pilon version 1.22 (Pilon, RRID:SCR_014731) [22] and BWA (BWA, RRID:SCR_010910) [23] with the default parameters. The Pilon program was also run with default parameters to fill gaps, fix bases (including single-nucleotide polymorphisms and indels), and correct local misassemblies. A total of 99.26% of Illumina short reads were able to align to the assembled genome (Supplementary Table S2). The primary draft genome assembly was 720 Mb with a contig N50 length of 1.1 Mb and the longest contig length of 11.9 Mb (Supplementary Table S3). The contig N50 of the A. sinensis genome was much higher than other published medicinal plants' genome assemblies (Supplementary Table S4).

Hi-C library construction and chromosome-scale assembly

Hi-C, derived from chromosome conformation capture technology, is a method that probes the 3D architecture of whole genomes by coupling proximity-based ligation with massively parallel sequencing [24]. The Hi-C contract matrix has been widely used for assembly correction to generate chromosome-scale scaffolds. In this work, the genomic DNA used for the Hi-C library was extracted from a fresh leaf sample of A. sinensis using standard methods. The cross-linked DNA from lysed cells was digested with DpnII after cells were fixed with formaldehyde. Sticky ends were biotin labeled and proximity ligated to form chimeric junctions and then physically sheared to a size of 300–500 bp. Chimeric fragments representing the original cross-linked and long-distance physical interactions were then processed into PE sequencing libraries after PCR amplification. The PCR cycling protocol was as follows: with 95°C for 5 minutes; cycled 18×; 4°C for 30 seconds, 45°C for 1 second, 70°C for 20 seconds, and 98°C for 30 seconds; and then held at 4°C. The products of PCR were purified according to the Hi-C protocol and then the purified DNA was sheared, end-repaired, adenylation tailed, and universal adapter ligated, and samples were indexed as described in the manufacturer's recommendations [25]. The whole-genome Hi-C library was sequenced with 150 bp PE sequencing on Illumina Hiseq 2500. A total of 714.27 million clean PE reads (∼103.07 Gb, roughly 142× coverage of assembled genome) were generated after filtering adapters and low-quality reads with Fastp (version 0.12.6) and HiC-Pro (HiC-Pro, RRID:SCR_017643) [26, 27]. By mapping the Hi-C data to the Nanopore-based assembly using bowtie2 (bowtie2, RRID:SCR_005476) [28], we found 93.49 million unique mapped PE reads and 62.89 million valid interaction pairs, which, respectively, accounted for 26.18% and 17.61% in the clean data (Supplementary Table S5). We used BWA and Lachesis (Lachesis, RRID:SCR_017644) software to align PE reads and retain the reads aligned to 500 bp away from each restriction site [29]. According to the methods of clustering, ordering, and orienting to the assembly contigs, these sequences were divided into 8 chromosome clusters and scaffolded by using Lachesis software with tuned parameters (Supplementary Table S6, Fig. 2). Finally, a heat map of Hi-C interaction for final assembly was produced using R (version 3.5.3) [30, 31].

Figure 2:

Hi-C interaction matrix for A. sinensis genome assembly using 8 clusters.

Hi-C interaction matrix for A. sinensis genome assembly using 8 clusters. A total of 1,862 contigs were used for scaffolding by Hi-C data, which consequently generated 805 scaffolds. The Hi-C–assisted chromosome-length scaffolds resulted in a final size of 716.6 Mb accounting for the 99.85% draft genome, which showed a high level of continuity with a contig N50 of 1.1 Mb and a scaffold N50 of 88.78 Mb. The final draft genome assembly of A. sinensis was 726.5 Mb (Supplementary Table S3). The anchor rate of contigs (>100 kb) to pseudochromosomes was attained up to the 98.63% level based on the Hi-C assembly (Table 1). The scaffold N50 of the A. sinensis genome was also superior to other published medicinal plant genome assemblies (Supplementary Table S4).

Table 1:

Statistics of the final genome assembly for Aquilaria sinensis

Statistic	Contig length (bp)	Contig No.	Scaffold length (bp)	Scaffold No.
N50	1,058,652	164	88,784,932	4
N60	726,407	246	86,380,100	5
N70	495,861	366	84,956,755	6
Longest	11,913,571	1	109,870,270	1
Total	720,187,708	2,015	726,587,161	9
Length ≥1 kb	720,187,482	2,013	726,587,161	9
Length ≥2 kb	720,179,880	2,008	726,587,161	9
Length ≥5 kb	720,112,854	1,991	726,587,161	9

Statistics of the final genome assembly for Aquilaria sinensis

RNA preparation and sequencing

Iso-seq was performed for genome assembly and annotation. The sample of mixed root, stem, and leaf used for RNA extraction was obtained from the same plant used for Oxford Nanopore DNA sequencing and immediately snap-frozen in liquid nitrogen. Total RNA was extracted from the frozen tissue using a Qiagen RNA extraction kit (Qiagen, Hilden, DE) and the sequencing library was then prepared with SMRTbellTM template prep kit 1.0 (Pacific Biosciences, Menlo Park, CA, USA) after RNA reverse transcription with SMARTerTM PCR cDNA Synthesis kit and complementary DNA (cDNA) amplification with KAPA HiFi PCR kits (Kapa Biosystems, Boston, Massachusetts, USA). Full-length transcriptome sequencing was subsequently performed using the PacBio Sequel System (PacBio Sequel System, RRID:SCR_017989). A total of 18,411,342 subreads were obtained from Iso-seq after raw data filtering with SMRTLING 5.1 and derived 136,050 consensus sequences, of which 94.71% (128,854) can be aligned to the final genome of A. sinensis (Supplementary Table S7).

Genome quality evaluation

To evaluate the completeness of our assembly, we subjected the final assembled genome sequences to BUSCO version 3 (BUSCO, RRID:SCR_015008) (BUSCO, Embryophyta odb 10) [32, 33]. Overall, 95.27% of 1,375 expected embryophyta genes were identified in our genome assembly as the complete and partial BUSCO profiles. Among these identified 1,310 complete expected embryophyta genes, 1,202 and 108 were identified as single copy and duplicated copies, respectively (Supplementary Table S8).

Repeat sequences within the A. sinensis genome assembly

Transposable elements (TEs) and tandem repeats were identified with both homology-based annotation and de novo methods. Consensus sequences of repetitive elements were de novo identified and classified using the software package RepeatModeler version 1.04 (RepeatModeler, RRID:SCR_015027) [34]. RepeatMask version 3.2.9 (RepeatMasker, RRID:SCR_012954) [34], RepeatProteinMasker [35], and TRF [36] were used to discover and identify repeats within the respective genomes. Furthermore, simple sequence repeats (SSRs) in the A. sinensis genome were also classified with MISA (MISA, RRID:SCR_010765) [37]. The results showed that de novo predicted repeats were more recently active than Repbase [38] predicted repeats (Supplementary Fig. S3). The identified repeat sequences in the A. sinensis genome assembly accounted for 59.13% and total length of those accounted for 425.87 Mb (Supplementary Table S9). In particular, the details showed that long terminal repeats (LTRs) were the most abundant repeat type and that 2 non-LTR retrotransposons, short interspersed nuclear elements (SINEs) and long interspersed nuclear elements (LINEs) [39], had the lowest proportions in the final assemblies. In addition, 13.12% of repeat sequences could not be classified (Table 2). A total of 367,251 SSRs were identified from the draft assembly in 675 scaffolds. Mononucleotides (64.71%), dinucleotides (18.19%), and trinucleotides (12.46%) comprised nearly 96% of SSRs in our assembly (Supplementary Table S10).

Table 2:

Statistics of transposable elements in Aquilaria sinensis genome sequences

Type	Repbase TEs		Mips-REdat TEs		TE proteins		RepeatModeler		Combined TEs
Type	Length (Mb)	% in genome	Length (Mb)	% in genome	Length (Mb)	% in genome	Length (Mb)	% in genome	Length (Mb)	% in genome
DNA	13,223,408	1.84	1,392,136	0.19	10,456,270	1.45	28,698 131	3.98	38,895,471	5.4
LINE	2,916,904	0.41	253,492	0.04	7,680,548	1.07	6,394,899	0.89	12,239,695	1.7
LTR	73,748,923	10.24	22,973,865	3.19	75,336,839	10.46	138,348,032	19.21	192,609,862	26.74
SINE	2,232	0	1,145	0	0	0	0	0	4,539	0
Other	6,189,190	0.86	380,555	0.05	1,369,337	0.19	0	0	87,659,087	12.17
Unknown	35,443	0	0	0	0	0	124,331,790	17.26	94,460,416	13.12
Total	96,116,100	13.35	25,001,193	3.47	94,842,994	13.17	296,679,047	41.19	425,869,070	59.13

Statistics of transposable elements in Aquilaria sinensis genome sequences

Gene prediction and annotation

Three strategies were used for gene prediction. Augustus version 3.2.3 (Augustus, RRID:SCR_008417) [40], GlimmHmm [41], and GeneID (GeneID, RRID:SCR_002473) [42] were used for ab initio gene prediction, using model training based on coding sequence (CDS) from Corchorus olitorius (COLO4_1.0) [43], Durio zibethinus (Duzib1.0) [44], Gossypium hirsutum (ASM98774v1) [45], Herrania umbratica (ASM216827v2) [46], Theobroma cacao (Cirollo_cocoa_geneoe_v2) [47], and Arabidopsis thaliana (TAIR10) [48]. GeneWise (GeneWise, RRID:SCR_015054) [49] and GeMoMa [50] were used for homology prediction. PASA (PASA, RRID:SCR_014656) [51] and Tophat (TopHat, RRID:SCR_013035) [52] were used for gene structural prediction based on expressed sequence tag and cDNA sequences. Finally, the total gene prediction was obtained from the union of these 3 strategies with EVM [51] and filtering the TEs with Transposon PSI (Transposon, RRID:SCR_001159) [53]. RNA-sequencing data of mixed tissues were mapped with the annotation of the reference genome using MatchAnnot [54], respectively. The final annotation was composed of 29,203 gene models with an average of 3,177.62 bp transcripts and 1,114.16 bp CDS, each gene containing 5.02 exons with an average length of 222.09 bp. The comparative information of genes from A. sinensis and 6 closely related plants was also calculated (Supplementary Table S11), including their distributions of CDS and gene length, exon and intron length, and exon and intron number (Supplementary Fig. S4). Genes were characterized for their putative function by performing Blastall [55] and KAAS [56] searches of the peptide sequences against Swiss-Prot (Swiss-Prot, RRID:SCR_002380) [57], NR [58], TrEMBL (TrEMBL, RRID:SCR_002380) [57], KEGG database (KEGG, RRID:SCR_012773) [59], COG (Clusters of Orthologous Groups) database (COG, RRID:SCR_007273) [60], and the Gene Ontology (GO) database (GO, RRID:SCR_002811) [61]. Protein conservative models and motif prediction were performed with InterProScan version 5.2 (InterproScan, RRID:SCR_005829) [62]. Of these 29,203 protein-coding genes, 82.64% have functional annotation. The database research hits can be summarized as follows: Swiss-Prot (19,586 [67.07%]), NR (24,097 [82.52%]), TrEMBL (23,455 [80.32%]), KEGG (8,494 [29.09%]), COG (13,592 [46.54%]), GO (14,019 [48.00%]), and InterProScan (20,031 [68.59%]) (Supplementary Table S12). In addition, we also identified 207 microRNAs, 34 ribosomal RNAs, 173 transfer RNAs (tRNAs), and 1,173 small nuclear RNAs via the Rfam non-coding RNA database (Rfam, RRID:SCR_007891) [63], tRNAscan-SE (tRNAscan-SE, RRID:SCR_010835) [64], and RNAmmer [65]. The average length, total length, and percentage of non-coding RNAs in the A. sinensis genome were further assessed (Supplementary Table S13). In addition, 48.61% of predicted genes (14,197) were supported by Iso-seq transcripts (Supplementary Table S14).

Gene family identification and phylogenetic tree construction

By keeping the longest transcript for each gene, whole protein coding gene sets from theA. sinensis genome and 12 other representative plant genomes including G. hirsutum (ASM98774v1), A. thaliana (TAIR10), T. cacao (Cirollo_cocoa_geneoe_v2), Cephalotus follicularis (Cfol_1.0), Citrus clementina (Citrus_clementina_v1.0), Cucurbita pepo (ASM280686v2), Eucalyptus grandis (Egrandis1_0), Glycine max (Glycine_max_v2.1), Helianthus annuus (HanXRQr 1.0), Populus euphratica (PopEup_1.0), Quercus suber (CorkOak 1.0), and Vitis vinifera (assembly 12X) were used to construct a global gene family classification with all-vs-all BLASTP (1e−5 cutoff, Blast+ v2.3.056) and OrthoMCL version 2.0.9 (Ortholog Groups of Protein Sequences, RRID:SCR_007839) [66]. The default settings were used for BLASTP and OrthoMCL. In our assembly, 21,955 genes were clustered into 13,713 gene families. Gene family analysis also revealed that 789 gene families and 7,248 genes were unique to A. sinensis in the above comparison (Fig. 3a and Supplementary Table S15). Of these, 9,615 gene families were shared among A. sinensis and 4 representative species (G. hirsutum from Malvaceae, C. olitorius from Tiliaceae, T. cacao from Sterculiaceae, and A. thaliana as the model plant from Cruciferae), whereas 804 gene families were unique to the A. sinensis genome (Fig. 3b). Malvaceae, Tiliaceae, and Sterculiaceae are beyond the order Malvales, and the Thymelaeceae family is also divided into order Malvales in APG IV [67].

Figure 3:

Comparative genomic analysis of Aquilaria sinensis and other plant species. (a) Distribution of genes and gene families of 13 plant species we investigated. (b) Venn diagram showing the distribution of shared gene families among the Malvales plants Aquilaria sinensis (agarwood), Theobroma cacao (cocoa), Gossypium hirsutum (cotton), Corchorus olitorius (jute), and the model plant Arabidopsis thaliana (Arabidopsis). (c) Divergence time estimation and gene family changes among 13 plant species. The black number at each node denotes estimated divergence time from present (million years ago). The blue number at the root (11,885) denotes the total number of gene families predicted in the most recent common ancestor (MRCA), and the green/red numbers around each branch denote gene family gain/loss number. The red nodes indicate the known divergence time of Asterids and Rosids. (d) Transversion substitutions at 4-fold degenerate sites (4dTv) distribution in selected assemblies of A. sinensis, A. thaliana, O. sativa, M. truncatula, and V. vinifera. Single-copy genes, or the orphan genes with only a single copy in the genome during duplication and evolution of species, are highly conserved and are generally used for establishing genetic relationship and origin of species. Alignment of single-copy genes was performed with protein sequences by Mafft (Mafft, RRID:SCR_011811) [68], then poorly aligned and highly divergent sites were subjected to filtering with Gblocks (Gblocks, RRID:SCR_015945) [69] and the final CDSs were used for evolutional analyses by RAxML with GTRGAMMA model (RAxML, RRID:SCR_006086) [70]. The bootstrap was 100 and H. annuus from the Asterids was the outgroup [71]. We constructed a phylogenetic tree and estimated the divergence time of 13 plants by 89 single-copy gene families with the MCMCTREE of PAML [72] (Supplementary Fig. S5) (Parameters: clock = 2, RootAge = < 100.6, model = 7, BDparas = 1 1 0, kappa gamma = 6 2, alpha gamma = 1 1, rgene gamma = 2 3.18, sigma2 gamma = 1 1.3; divergence time of Asterids and Rosids [∼118 million years ago (Mya)] was used for calibration [71]). The divergence time between A. sinensis and A. thaliana was estimated as 82.14 (95% CI, 67.63–93.99) Mya, and the divergence time between A. sinensis and the common ancestor of G. hirsutum and T. cacao from Malvales order was ∼69.64 (95% CI, 53.18–84.37) Mya (Fig. 3c and Supplementary Fig. S6), whereas the divergence time between G. hirsutum and T. cacao was determined as 31.33–69.23 Mya in our analysis, which is concordance with the previous studies [44, 45].

Gene family expansion and contraction

Expansion and contraction of a defining gene family is an important driver of metabolite variation and species adaptation during plant evolution [73]. We determined the expansion and contraction of orthologous gene families in the A. sinensis genome by means of CAFÉ 2.2 (CAFÉ, RRID:SCR_005983) with default parameters [74]. We inferred 53 expanded families and 117 contracted families with the A. sinensis genome after comparing 11,855 gene families across all 13 species (Fig. 3c and Supplementary Table S16). We used Blast2GO (B2G, RRID:SCR_005828) to enrich the ontology categories (GO and KEGG terms). The expanded gene families were involved in the pathways of plant circadian rhythm, tricarboxylic acid cycle, propanoate metabolism, ribosome biogenesis, and aminoacyl-tRNA biosynthesis (Supplementary Table S17 and Fig. S7), and the contracted gene families mapped pathways of starch/sucrose metabolism, sesquiterpenoid and triterpenoid biosynthesis, and linoleic acid metabolism (Supplementary Table S18 and Fig. S8).

4DTv Distribution

We used MCScanX to identify the syntenic regions [75], with the longest isoform for each gene selected for this exercise. The top 5 mutual hits of the BLASTP results in gene family analysis were used as input. Only the syntenic segments that have >5 gene pairs were considered for 4-fold degenerate synonymous site (4DTv) calculation. Pairwise sequence was aligned using MUSCLE [76]. Raw 4DTv values were corrected for possible multiple transversions at the same site. Based on 4DTv distribution, a large accumulation of gene duplications is evident in the A. sinensis genome and distinct from the scenarios in A. thaliana, M. truncatula, and V. vinifera (Fig. 3d).

Conclusions

In summary, a high-quality de novo genome assembly and in-depth characterization of A. sinensis, combining Nanopore single-molecule long reads and Hi-C, has been provided in this study. The final assembly was ∼726.5 Mb in size, which was slightly smaller than the k-mer estimated genome size of 773.3 Mb. The Hi-C data were used to identify and revise 230 misassemblies and assign the contigs into chromosome-scale scaffolds. This consequently generated an assembly with a high level of continuity with a contig N50 of 1.1 Mb and a scaffold N50 of 88.78 Mb. We also predicted 29,203 protein-coding genes from the final assembly and 82.64% (24,133 genes) of all protein-coding genes were annotated. We estimated that the divergence time between A. sinensis and its common ancestor G. hirsutum and T. cacao from the Malvales order was ∼53.18–84.37 Mya. The genome of A. sinensis seems to have experienced a recent whole-genome duplication event after the K-T boundary [77]. The chromosome-level genome assembly of A. sinensis is also the first high-quality genome in the Thymelaeceae family. Considering that wild A. sinensis tree populations are currently highly threatened by heavy exploitation for the production of commercially valuable agarwood products, the genome assembly of the A. sinensis tree presented here will provide valuable information to aid the global conservation of these precious biological resources and contribute to understanding the mechanism of agarwood formation, which eventually will help us reveal the evolution of aromatic genes and plants.

Availability of Supporting Data and Materials

Supporting data and materials are available in the GigaScience GigaDB database [78], with the raw genomics and transcriptome sequences deposited in the NCBI SRA database under the BioProject accession number PRJNA556948 and BioSample accession number SAMN12385133.

Additional files

Supplementary Figure 1: k-mer (k = 19) analysis for estimating the size of the Aquilaria sinensis genome. Supplementary Figure 2: GC content and average sequencing depth of the Illumina sequencing data used for genome estimation. Supplementary Figure 3: Distribution of sequence divergence rates of different TE types with Repbase (A) and de novo (B) methods in the Aquilaria sinensis genome. Supplementary Figure 4: Distribution of gene elements in Aquilaria sinensis genome and 6 other plant genomes. Supplementary Figure 5: Phylogenetic tree of 13 plant species including Aquilaria sinensis. Supplementary Figure 6: Estimation of divergence time of 13 plant species investigated in the present study. The colored numbers on the nodes are the divergence time from present (million years ago). Numbers in parentheses indicate the 95% confidence interval of the divergence time. Supplementary Figure 7: GO enrichment of expansion gene families in Aquilaria sinensis genome. Supplementary Figure 8: GO enrichment of contraction gene families in Aquilaria sinensis genome. Supplementary Table 1: Summary of Nanopore sequencing for Aquilaria sinensis genome. Supplementary Table 2: Support of Illumina data for Nanopore data in Aquilaria sinensis genome assembly. Supplementary Table 3: Statistics of the results of Aquilaria sinensis genome assembly before Hi-C mapping. Supplementary Table 4: Comparisons of genome assemblies of medicinal plants based on descending contig N50. Supplementary Table 5: Summary of mapping status of Hi-C data. Supplementary Table 6: Statistics of initial and final assembly with Hi-C. Supplementary Table 7: Mapping result of Iso-seq from Aquilaria sinensis. Supplementary Table 8: Statistics of BUSCO evolution for Aquilaria sinensis genome. Supplementary Table 9: Statistics of repeat sequence in Aquilaria sinensis genome via different methods. Supplementary Table 10: Statistics of SSRs in Aquilaria sinensis genome sequences. Supplementary Table 11: Statistics of characteristics of gene models in Aquilaria sinensis and 6 other plant genomes. Supplementary Table 12: The annotated genes of Aquilaria sinensis that can be functionally classified in each corresponding database. Supplementary Table 13: Noncoding RNA annotation in the Aquilaria sinensis genome. Supplementary Table 14: Annotation of Iso-seq and comparison with genome annotation of the Aquilaria sinensis genome. Supplementary Table 15: Summary of gene families among 13 plant species. Supplementary Table 16: Summary of gene family changes among 13 species. Supplementary Table 17: KEGG mapping of expansion gene families in Aquilaria sinensis genome. Supplementary Table 18: KEGG mapping of contraction gene families in Aquilaria sinensis genome. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Click here for additional data file. Caroline Belser -- 9/5/2019 Reviewed Click here for additional data file. Caroline Belser -- 12/16/2019 Reviewed Click here for additional data file. Fernando Cruz -- 9/29/2019 Reviewed Click here for additional data file. Fernando Cruz -- 1/9/2020 Reviewed Click here for additional data file. Click here for additional data file.

Abbreviations

4DTv: 4-fold degenerate synonymous sites; APG: Angiosperm Phylogeny Group; bp: base pairs; BUSCO: Benchmarking Universal Single-Copy Orthologs; CAFÉ: Computational Analysis of Gene Family Evolution; cDNA: complementary DNA; CDS: coding sequence; COG: Clusters of Orthologous Groups; CTAB: cetyl trimethylammonium bromide; EVM: EVidenceModeler; FFPE: formalin-fixed paraffin-embedded; Gb: gigabase pairs; GC: guanine-cytosine; GO: Gene Ontology; Hi-C: high-throughput chromosome conformation capture; Iso-seq: Isoform sequencing; IUCN: International Union for Conservation of Nature and Natural Resources; KAAS: KEGG Automatic Annotation Server; kb: kilobase pairs; KEGG: Kyoto Encyclopedia of Genes and Genomes; K-T: Cretaceous-Tertiary; LINE: long interspersed nuclear element; Mb: megabase pairs; MRCA: most recent common ancestor; Mya: million years ago; NCBI: National Center for Biotechnology Information; NEB: New England Biolabs; Nr: non-redundant protein database; ONT: Oxford Nanopore Technologies; PASA: Program to Assemble Spliced Alignments; PE: paired end; Pfam: protein families; RAxML: Randomized Axelerated Maximum Likelihood; SINE: short interspersed nuclear element; SMRT: single-molecule real time; SRA: Sequence Read Archive; SSR: simple sequence repeat; TE: transposable element; TrEMBL: Translated EMBL-Bank; TRF: Tandem Repeats Finder; tRNA: transnfer RNA; CI: confidence interval.

Competing Interests

The authors declare that they have no competing interests.

Funding

This work was supported by the Central Public-interest Scientific Institution Basal Research Fund for Chinese Academy of Tropical Agricultural Sciences (17CXTD-15 and 1630052020003), the National Natural Science Foundation of China (31870668) and the China Agriculture Research System (CARS-21). We are grateful to NextOmics Co., Ltd. (Wuhan, China), for providing technical help.

Authors' Contributions

H.F.D., P.C., and W.L.M. conceptualized the research program. X.P.D., W.L.M., and S.Q.P. designed experiments and coordinated the program. S.Z.H. collected the specimens and J.W. took the photos. H.L.L and J.H.Z. extracted the DNA. X.P.D., Q.L., P.W., P.C., W.L., H.Q.C., W.H.D., D.G., and C.H.C were partially involved with either experiments or data analysis. X.P.D. and Q.L. wrote the manuscript. All authors read and approved the final manuscript.

67 in total

1. Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis.

Authors: J Castresana
Journal: Mol Biol Evol Date: 2000-04 Impact factor: 16.240

2. Aquilanols A and B, Macrocyclic Humulene-Type Sesquiterpenoids from the Agarwood of Aquilaria malaccensis.

Authors: Chi Thanh Ma; Taeyong Eom; Eunji Cho; Bo Wu; Tae Ryong Kim; Ki Bong Oh; Sang Beom Han; Sung Won Kwon; Jeong Hill Park
Journal: J Nat Prod Date: 2017-10-30 Impact factor: 4.050

3. Three-dimensional architecture of inorganic nanoarrays electrodeposited through a surface-layer protein mask.

Authors: Daniel B Allred; Anchi Cheng; Mehmet Sarikaya; François Baneyx; Daniel T Schwartz
Journal: Nano Lett Date: 2008-04-01 Impact factor: 11.189

4. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.

Authors: T M Lowe; S R Eddy
Journal: Nucleic Acids Res Date: 1997-03-01 Impact factor: 16.971

5. Fast gapped-read alignment with Bowtie 2.

Authors: Ben Langmead; Steven L Salzberg
Journal: Nat Methods Date: 2012-03-04 Impact factor: 28.547

6. The sunflower genome provides insights into oil metabolism, flowering and Asterid evolution.

Authors: Hélène Badouin; Jérôme Gouzy; Christopher J Grassa; Florent Murat; S Evan Staton; Ludovic Cottret; Christine Lelandais-Brière; Gregory L Owens; Sébastien Carrère; Baptiste Mayjonade; Ludovic Legrand; Navdeep Gill; Nolan C Kane; John E Bowers; Sariel Hubner; Arnaud Bellec; Aurélie Bérard; Hélène Bergès; Nicolas Blanchet; Marie-Claude Boniface; Dominique Brunel; Olivier Catrice; Nadia Chaidir; Clotilde Claudel; Cécile Donnadieu; Thomas Faraut; Ghislain Fievet; Nicolas Helmstetter; Matthew King; Steven J Knapp; Zhao Lai; Marie-Christine Le Paslier; Yannick Lippi; Lolita Lorenzon; Jennifer R Mandel; Gwenola Marage; Gwenaëlle Marchand; Elodie Marquand; Emmanuelle Bret-Mestries; Evan Morien; Savithri Nambeesan; Thuy Nguyen; Prune Pegot-Espagnet; Nicolas Pouilly; Frances Raftis; Erika Sallet; Thomas Schiex; Justine Thomas; Céline Vandecasteele; Didier Varès; Felicity Vear; Sonia Vautrin; Martin Crespi; Brigitte Mangin; John M Burke; Jérôme Salse; Stéphane Muños; Patrick Vincourt; Loren H Rieseberg; Nicolas B Langlade
Journal: Nature Date: 2017-05-22 Impact factor: 49.962

7. GenomeScope: fast reference-free genome profiling from short reads.

Authors: Gregory W Vurture; Fritz J Sedlazeck; Maria Nattestad; Charles J Underwood; Han Fang; James Gurtowski; Michael C Schatz
Journal: Bioinformatics Date: 2017-07-15 Impact factor: 6.937

8. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity.

Authors: Yupeng Wang; Haibao Tang; Jeremy D Debarry; Xu Tan; Jingping Li; Xiyin Wang; Tae-ho Lee; Huizhe Jin; Barry Marler; Hui Guo; Jessica C Kissinger; Andrew H Paterson
Journal: Nucleic Acids Res Date: 2012-01-04 Impact factor: 16.971

9. Identification of genes related to agarwood formation: transcriptome analysis of healthy and wounded tissues of Aquilaria sinensis.

Authors: Yanhong Xu; Zheng Zhang; Mengxi Wang; Jianhe Wei; Hongjiang Chen; Zhihui Gao; Chun Sui; Hongmei Luo; Xingli Zhang; Yun Yang; Hui Meng; Wenlan Li
Journal: BMC Genomics Date: 2013-04-08 Impact factor: 3.969

10. Genome sequence of the agarwood tree Aquilaria sinensis (Lour.) Spreng: the first chromosome-level draft genome in the Thymelaeceae family.

Authors: Xupo Ding; Wenli Mei; Qiang Lin; Hao Wang; Jun Wang; Shiqing Peng; Huiliang Li; Jiahong Zhu; Wei Li; Pei Wang; Huiqin Chen; Wenhua Dong; Dong Guo; Caihong Cai; Shengzhuo Huang; Peng Cui; Haofu Dai
Journal: Gigascience Date: 2020-03-01 Impact factor: 6.524

6 in total

1. Characterization of the complete chloroplast genome of Corydalis bungeana Turcz.

Authors: Qi Wang; Zhixian Lei; Lirong Zhou; Biwei Mai; Naiyun Zhu; Xiaoli Zhao; Wenting Xu
Journal: Mitochondrial DNA B Resour Date: 2021-06-14 Impact factor: 0.658

2. The complete chloroplast genome of Dryopteris crassirhizoma Nakai.

Authors: Qi Wang; Wenting Xu; Lirong Zhou; Biwei Mai; Naiyun Zhu; Xiaoli Zhao; Zhixian Lei
Journal: Mitochondrial DNA B Resour Date: 2021-05-27 Impact factor: 0.658

3. Genome Assembly of Salicaceae Populus deltoides (Eastern Cottonwood) I-69 Based on Nanopore Sequencing and Hi-C Technologies.

Authors: Shengjun Bai; Hainan Wu; Jinpeng Zhang; Zhiliang Pan; Wei Zhao; Zhiting Li; Chunfa Tong
Journal: J Hered Date: 2021-05-24 Impact factor: 2.645

Review 4. Review on the Development and Applications of Medicinal Plant Genomes.

Authors: Qi-Qing Cheng; Yue Ouyang; Zi-Yu Tang; Chi-Chou Lao; Yan-Yu Zhang; Chun-Song Cheng; Hua Zhou
Journal: Front Plant Sci Date: 2021-12-23 Impact factor: 5.753

5. Effects of various artificial agarwood-induction techniques on the metabolome of Aquilaria sinensis.

Authors: Ningnan Zhang; Shiyu Xue; Jie Song; Xiuren Zhou; Dahao Zhou; Xiaojin Liu; Zhou Hong; Daping Xu
Journal: BMC Plant Biol Date: 2021-12-13 Impact factor: 4.215

6. Genome sequence of the agarwood tree Aquilaria sinensis (Lour.) Spreng: the first chromosome-level draft genome in the Thymelaeceae family.

6 in total