Literature DB >> 35466378

Chromosome-level genome assembly and characterization of Sophora Japonica.

Weixiao Lei¹, Zefu Wang², Man Cao¹, Hui Zhu¹, Min Wang¹, Yi Zou¹, Yunchun Han¹, Dandan Wang¹, Zeyu Zheng¹, Ying Li¹, Bingbing Liu³, Dafu Ru¹.

Abstract

Sophora japonica is a medium-size deciduous tree belonging to Leguminosae family and famous for its high ecological, economic and medicinal value. Here, we reveal a draft genome of S. japonica, which was ∼511.49 Mb long (contig N50 size of 17.34 Mb) based on Illumina, Nanopore and Hi-C data. We reliably assembled 110 contigs into 14 chromosomes, representing 91.62% of the total genome, with an improved N50 size of 31.32 Mb based on Hi-C data. Further investigation identified 271.76 Mb (53.13%) of repetitive sequences and 31,000 protein-coding genes, of which 30,721 (99.1%) were functionally annotated. Phylogenetic analysis indicates that S. japonica separated from Arabidopsis thaliana and Glycine max ∼107.53 and 61.24 million years ago, respectively. We detected evidence of species-specific and common-legume whole-genome duplication events in S. japonica. We further found that multiple TF families (e.g. BBX and PAL) have expanded in S. japonica, which might have led to its enhanced tolerance to abiotic stress. In addition, S. japonica harbours more genes involved in the lignin and cellulose biosynthesis pathways than the other two species. Finally, population genomic analyses revealed no obvious differentiation among geographical groups and the effective population size continuously declined since 2 Ma. Our genomic data provide a powerful comparative framework to study the adaptation, evolution and active ingredients biosynthesis in S. japonica. More importantly, our high-quality S. japonica genome is important for elucidating the biosynthesis of its main bioactive components, and improving its production and/or processing.

Entities: Chemical

Keywords: zzm321990 Sophora japonicazzm321990 ; Hi-C; WGD; genome; nanopore

Mesh：

Year: 2022 PMID： 35466378 PMCID： PMC9154292 DOI： 10.1093/dnares/dsac009

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.477

1. Introduction

Sophora japonica (2n = 28) is a popular perennial ornamental and medicinal plant of the subfamily Papilionoideae commonly known as Chinese scholar tree or Japanese pagoda tree. Widely distributed in Asia and Europe, it has been cultivated for more than 3,000 years in China. As a medicinal plant, almost every part of S. japonica, including the leaves, buds, flowers, seeds and bark, is used medicinally in Asia. For example, the buds and fruits are commonly used as ingredients in traditional Chinese medicine, and flavonoid components of the buds and fruits reportedly have haemostatic properties. Moreover, the dry flowers (Huaihua or Flos Sophorae) and flower buds (Huaimi or Flos Sophorae Immaturus) are included in both Chinese and European Pharmacopeia and contain diverse kinds of components such as tetraglycosides, flavones, isoflavones, isoflavone tetraglycosides and flavonol glycosides (rutin, for instance). According to Chinese pharmacopeia records, clinical trials and modern pharmacological studies, rutin is one of the main effective components and reportedly has more than 40 therapeutic properties., Sophora japonica is also widely used in urban greening programmes and planted by highways as it has high resistance to pests and diseases, pollution tolerance and adaptability to diverse environmental factors. In urban areas of Beijing, China, where S. japonica accounts for 81% of the street trees, and elsewhere it plays important roles in microclimate regulation, maintenance of healthy urban ecosystems and improvement of air quality. However, despite its medicinal and ecological value, very little genomic information is available for S. japonica. Therefore, we have assembled and annotated the S. japonica genome using long reads obtained from Oxford Nanopore (ONT) sequencing and short reads from Illumina sequencing. This yielded a genome assembly of 511.49 Mb with a ∼17.34 Mb contig N50 size. Using Hi-C data, we associated 91.62% of the assembled bases with 14 pseudo-chromosomes, with an improved N50 of 31.32 Mb. We identified a species-special whole-genome duplication (WGD) event and found many of the duplicated genes caused by WGD have contributed to S. japonica’s adaptation and biosynthesis. We further performed whole-genome resequencing for 35 individuals to examine the population structure. The high-quality chromosome-level S. japonica genome provides robust foundations for investigating pests and diseases resistance, pollution tolerance and environmental adaptability and biosynthesis of its main active ingredients of the species.

2. Materials and methods

Sample collection and sequencing

Leaves were collected from a S. japonica individual in Lanzhou, Gansu Province, China (36°2′57″N, 103°51′34″E) (Fig. 1). Genomic DNA was extracted with a QIAGEN® Genomic Kit following the manufacturer’s recommendations. An Illumina paired-end library (with insert size of 350 bp) was constructed using an Illumina genomic DNA sample preparation kit. The Illumina NovaSeq 6000 platform was then used for sequencing and yielded ∼103.70 Gb of raw sequence data (∼193.55× coverage of the genome) (Supplementary Table S1). A Nanopore library was constructed and sequenced with a PromethION sequencer (Oxford Nanopore Technologies, UK). A total of 60.45 Gb (∼112.83×) of long reads were generated (Supplementary Table S1). The Hi-C library was constructed using young leaves from the same S. japonica individual. Illumina NovaSeq 6000 platform was used to generate 60.73 Gb Hi-C data (Supplementary Table S1). We also collected four tissue samples (including leaves, stems, flowers and fruits) from the same individual for RNA-Seq. For each sample, the fresh tissue was used to construct cDNA library. Finally, a total of 27.47 Gb paired-end reads were obtained for all RNA-Seq samples (Supplementary Table S1).

Figure 1

(A) Photo of S. japonica tree; (B) photo of an S. japonica leaf; (C) flower buds (Huaimi) of S. japonica; (D) flowers (Huaihua) of S. japonica; (E) seeds of S. japonica.

Genome size and heterozygosity estimation

To estimate the size and heterozygosity level of the S. japonica genome, 17-mer spectrum analysis was performed following a procedure previously applied in Oplegnathus fasciatus genome research. Based on the total number of k-mers (74,577,100,335), the estimated genome size is ∼535.77 Mb (Supplementary Fig. S1 and Table S2). The estimated heterozygosity level and repetition frequency of the Styphnolobiumjaponicum genome are 1.19% and 52.77%, respectively (Supplementary Fig. S1 and Table S2).

Assembly of the S. japonicum genome

We used the previously generated 60.45 Gb ONT single molecule long reads for genome assembling. NextDenovo (version 2.3) (https://github.com/Nextomics/NextDenovo), including sequencing error correction, preliminary assembly and genome polishing, was performed under the recommended pipeline. The NextCorrect module was used for raw read correction and consensus sequence (CNS) extraction. The NextGraph module was used for preliminary assembly, and the NextPolish (version 1.2.2) module for genome polishing. According to previously described procedures, the S. japonica assembly was further refined based on 60.73 Gb Hi-C data (Supplementary Table S1) to improve the primary genome assembly to the chromosome level. BUSCO (version 4) was further used to assess the completeness of the genome assembly with embryophyta_odb10 database.

Annotation of repeat and non-coding RNAs

We used two strategies to predict repeats in S. japonica genome. For homology-based strategy, RepeatMasker (version 4.1.0) was performed for homology searching of repetitive elements against Repbase database (version 23). For ab initio identification, LTR Finder (version 1.07) and RepeatModeler (version 2.0) were performed, respectively. To identify non-coding RNA (ncRNA) sequences, two strategies were used: database searching and model-based prediction. rfam_scan.pl (version 1.0.4) was used to search against Rfam database (version 11.0) to detect small nuclear RNA (snRNA) and microRNA (miRNA) sequences, and tRNAscan-SE (version 2.0) was used (with eukaryotic parameters) to predict transfer RNA (tRNA) sequences. In addition, RNAmmer (version 1.2) was used to predict ribosomal RNA (rRNA) and its subunits.

Prediction and functional annotation of protein-coding genes

To maximize the reliability of the gene annotation process, repetitive regions in the assembled genome were masked before gene annotation. The prediction was performed by homology-based, transcriptome-based and ab initio strategies to identify high-quality protein-coding genes. In homology-based prediction process, protein-coding sequences of seven species (Arabidopsis thaliana, Cicer arietinum, Glycine soja, Lupinus albus, Populus trichocarpa, Trifolium pratense and Vigna unguiculata) were downloaded from NCBI or Phytozome (v12.1; https://phytozome.jgi.doe.gov/) (Supplementary Table S3) and align with our S. japonica genome using TBLASTN (E-value < 1 × 10−5). We used GeneWise (version 2.4.1) to align homologous genome sequences with matched proteins to define gene models. To generate transcriptome-based predictions based on RNA-Seq data, we sequenced RNA-Seq for leaves, stems, flowers and fruits of S. japonica, respectively. All RNA reads were initially aligned by HISAT2 (version 2.1.0) based on the reference genome and further assembled into transcripts using Trinity (version 2.6.6). These assembled sequences were compared with the S. japonica genome using the Program to Assemble Spliced Alignments (PASA) (version 2.3.3). Then we clustered the valid transcript alignments based on the positions obtained from the genome mapping, and finally assembled them into gene structures with the scripts from PASA packages (version 2.3.3) (‘build_comprehensive_transcriptome.dbi’ and ‘pasa_asmbls_to_training_set.dbi’) following default parameter settings. For ab initio prediction, the genome was subjected to analysis by Augustus with parameters trained using PASA self-trained gene models. EVM (version 1.1.1) was then used to merge the gene models obtained by the three methods into a final consensus set. The gene model was finally updated by PASA to generate untranslated regions and alternative splicing variants. We used BLASTP (E-value < 1 × 10−5) to annotate functions of protein-coding genes based on entries in the Swiss-Prot (version 2020_04), TrEMBL (version 2020_04) and NR databases (release 20200502). The protein domains were predicted by searching against InterPro database (version v84) using InterProScan (version 5.32-71.0). The Gene Ontology (GO) terms were obtained from the corresponding InterPro entries. The pathways for each gene were predicted by BBH method on KAAS website (https://www.genome.jp/tools/kaas/) based on Kyoto Encyclopedia of Genes and Genomes (KEGG) databases.

Phylogeny and evolution of gene families

To explore S. japonica’s evolutionary relationships, we used OrthoFinder (version 2.3.8) to cluster its genes with genes from 16 other dicot species: Vigna radiata, Vigna unguiculate, Phaseolus vulgaris, Glycine max, G.soja, Cajanus cajan, Medicago truncatula, T.pratense, C.arietinum, Lotus japonicus, L.albus, Lupinus angustifolius, Arachis duranensis, Arachis ipaensis, Senna toraand A.thaliana (Supplementary Table S3). Before clustering, the genes for each species were filtered. Only the longest transcript of each gene was retained and short protein-coding sequences (fewer than 40 amino acids) or error sequences (stop codons appearing prematurely) were removed. We identified 204 groups of strictly single-copy genes and used them to construct maximum likelihood trees to assess their evolutionary relationships using GTRGAMMA models in RAxML (version 8.2.12). We also used the ‘mcmctree’ programme in the PAML (version 4.9j) package to estimate divergence times, based on known approximate divergence times of A. duranensis and M. truncatula [49–58 million years ago (Ma)], as well as G. max and A. thaliana (98–117 Ma) (http://www.timetree.org/). The expansion and contraction of gene families are important drivers of metabolite variation and species adaptation during plant evolution. Thus, we explored the expansion and contraction of orthologous gene families in the S. japonica genome using CAFE (version 2.2) with default parameters. The expansive and contractive gene families in S. japonica were extracted using home-made perl script and further applied to GO enrichment analyses using TBtools (version 1.075).

Whole-genome duplications

We used the MCScanX (version 1.1.11) package to search for collinear blocks (defined as regions containing more than five collinear genes) between pseudochromosomes of pairs of the 16 genomes listed above and our S. japonica genome. The number of synonymous substitutions per synonymous site values of collinearity of orthologous gene pairs were determined using the Perl script ‘add_ka_and_ks_to_collinearity.pl’ implemented in MCScanX. We further determined the expanded genes which caused by WGD waves and also applied to GO enrichment analyses.

Population genetic analyses

An Illumina NovaSeq 6000 sequencing platform and PE150 sequencing strategy were used for whole-genome resequencing of 35 samples with high coverage (∼18×) (Supplementary Table S4). BWA (version 0.7.17) was used to map the reads to the chromosome-level genome of S. japonicum. Picard software (version 2.23.6) was applied to remove duplicates (broadinstitute.github.io/picard). SAMtools (version 1.10) was applied to identify single-nucleotide polymorphisms (SNPs) and InDels. To obtain high quality SNPs, we filtered the SNPs with the following criteria: (i) no InDel within 5 bps window; (ii) root mean square quality no less than 20; (iii) missing rate <20%; (iv) minimum allele frequency no <0.05 and P-value of Hardy–Weinberg equilibrium no <0.001. In addition, the base quality and mapping quality were set to 20 and 30, respectively. To analyse phylogenetic relationships, we used TreeBeST (version 1.9.2) to construct a neighbour-joining tree, ADMIXTURE (version 1.23) to infer the population structure, and PLINK (version 1.07) for principal component analysis (PCA)-based population clustering, respectively.

Demographic history

We used pairwise sequentially Markovian coalescence (PSMC) (version 0.6.5-r67) modelling to infer demographic history. It is capable of reconstructing the history of changes in population size over time using the distribution of the most recent common ancestor between two alleles within an individual. SAMtools (version 1.3) was used to obtain consensus sequences and divide them into non-overlapping 100 base pair bins, with the following parameters: -N25 -t15 -r5 -p ‘4 + 25 × 2 + 4 + 6’. A generation time of 7 years and mutation rate of 3.65 × 10−9 per nucleotide per year were used to convert the scaling time and population size to actual time and size.

3. Results

To perform a de novo assembly of the S. japonica genome (estimated genome size ∼535.77 Mb; Supplementary Table S2 and Fig. S1), we combined several sequencing technologies and assembly strategies (see Materials and methods). A total of 60.45 Gb ONT reads, corresponding to ∼112.83-fold coverage of the estimated genome size, were generated and used for de novo assembly (Supplementary Table S1). The primary genome assembly of S. japonica includes 110 contigs with N50 = 17.34 Mb and longest contig of 32.91 Mb (Table 1). The genome is 511,488,806 bp long with an average GC content of 33.77% (Table 1). The genome size is similar to that of assembled C.arietinum and L.japonicus (532.29 Mb, 544.14 Mb, respectively).,

Table 1

Summary of the genome assembly and annotation tables

Genome assembly
Estimated genome size	535.77 Mb
N50 length (contig)	17.34 Mb
Longest contig	32.92 Mb
Number of contigs	110
Total length of contigs	511.49 Mb
N50 length (scaffold)	31.32 Mb
Longest scaffold	60.53 Mb
Number of scaffolds	99
Total length of scaffolds	511.49 Mb
Average GC content (%)	33.77
BUSCO score of assembly (%)	98.1 (S: 88.7, D: 9.4), F: 1.0, M: 0.9
Transposable elements
Annotation	Percent (%)	Total length (bp)
DNA	5.59	27,990,056
LINE	1.17	5,880,495
LTR	37.98	190,194,740
SINE	0.15	762,931
Satellite	0.18	904,219
Simple repeat	9.37	46,924,249
Small RNA	0.09	438,768
Unknown	20.31	101,699,257
Total	53.13	271,762,340
Protein-coding genes
Predicted genes	31,000
Average genes length(bp)	4,864.39
Average CDS length (bp)	1,304.15
Average exons per gene	5.63
Average exon length (bp)	231.49
Average intron length (bp)	620.11
BUSCO score of annotation (%)	97.4 (S: 87.4, D: 10.0), F: 1.0, M: 1.6

Summary of the genome assembly and annotation tables The assembly was further refined using 60.73 Gb Hi-C data (Supplementary Table S1) and previously described procedures for chromosome-level anchorage. Sophora japonica has 2n = 28 chromosomes, according to karyotype analyses, and in total 468.63 Mb (91.62%) contig sequences were anchored onto 14 chromosomes (Fig. 2A and B). The final scaffold genome was 511,485,306 bp long in size, a bit larger than the contig genome. In addition, the scaffold N50 was increased to 31.32 Mb, with a longest scaffold of 60.53 Mb and 98.1% coverage according to BUSCO (version 4, dataset: embryophyta_odb10) analysis (single, duplicated, fragmented and missed percentages: 88.7%, 9.4%, 1.0% and 0.9%, respectively) (Table 1). Furthermore, 99.45% of the Illumina data could be mapped to the chromosome-level genome (coverage of 97.81%), and 97.47%, 97.12% and 96.64% of the assembled genome sequence could be covered by at least 4-fold, 10-fold and 20-fold Illumina data, respectively (Supplementary Table S5). These results clearly suggest that our assembly has high quality and is relatively complete.

Figure 2.

(A) Circos plot showing genomic features of S. japonica. Concentric circles, outer to inner, show GC density, gene density, repetitive sequence density and collinearity, respectively. (B) Heat map of chromatin contact matrices generated by aligning the Hi-C dataset to the S. japonica genome. (C) Phylogenetic relationships and divergence times of commelinid plants obtained using the maximum likelihood (ML) method with A. thaliana as a distant outgroup. Divergence times were estimated using the ‘mcmctree’ program incorporated in the PAML package.

Repeat and ncRNAs annotation

We identified ∼271.76 Mb repetitive elements in total, accounting for 53.13% of the genome (Table 1). Of these, DNA transposons, long terminal repeat, long interspersed nuclear elements and short interspersed nuclear elements accounted for 5.59%, 37.98%, 1.17% and 0.15% of the genome, respectively (Table 1). The fraction of repetitive sequences in the genome is comparable to other genomes, like those of grapevine (41%), castor bean (50%) and soybean (59%), but less than that seen in the genomes of sorghum (62%) and maize (85%). The final ncRNA annotation results included 456 miRNA, 75 tRNA, 1,488 rRNA and 160 snRNA sequences with average lengths of 165.23, 73.81, 398.98 and 141.29 bp, respectively, accounts for a small part of the genome (Supplementary Table S6). We generated 31,000 gene models in total. Coding sequences of the predicted genes are 4,864.39 bp long and have 5.63 exons, on average (Table 1). In addition, among 31,000 predicted protein-coding genes, 97.07% (30,092), 92.17% (28,574), 97.29% (30,160), 82.07% (25,442), 96.38% (29,879) and 46.23% (14,330) were obtained from searches against the InterPro, GO, TrEMBL, Swiss-PROT, NR and KEGG database, respectively. In total, functional annotations resulted in the assignment of putative functional annotations for 30,721 (99.1%) genes (Supplementary Table S7). BUSCO assessment indicated that 97.4% of the highly conserved plant genes (1,614 in total) are completely present in the genome (single, duplicated, fragmented and missed percentages: 87.4%, 10.0%, 1.0% and 1.6%, respectively) (Table 1). These results clearly show that the annotated gene set of S. japonica is relatively complete. From comparison with published genomes of the 16 species listed in Materials and methods, we found S. japonica shared 10,612 gene families with four other species (A. thaliana, M. truncatula, A. duranensis and G. max) and contained 2,350 unique gene families (Fig. 2C). Meanwhile, we identified 204 single-copy orthologous genes. Phylogenetic inferences based on 204 single-copy orthologous genes indicates that S. japonica separated from A.thaliana and G.max about 107.53 Ma and 61.24 Ma, respectively (Fig. 2D). After comparing gene families of all 17 species, we inferred that the S. japonica genome includes 3,288 extended families and 11,602 contracted families (Fig. 2D). Among them, 568 expanded and 325 contracted gene families were significant (P < 0.01, Supplementary Tables S8 and S9). GO enrichment analysis indicated that gene families related to ‘response to fungus (GO: 0009620)’, ‘regulation of root development (GO: 2000280)’, ‘regulation of response to stress (GO: 0080134)’, ‘flavonoid metabolic process (GO: 0009812)’, ‘defense response to oomycetes (GO: 0002229)’ have all expanded (Supplementary Table S8 and Fig. S2A), which may explain the high general environmental resistance of S. japonica. While gene families related to ‘terpene biosynthetic process (GO: 0051762)’, ‘sesquiterpene metabolic process (GO: 0051761)’, ‘ribosomal subunit (GO: 0044391)’, ‘alkaloid biosynthetic process (GO: 0035825)’ have all contracted (Supplementary Table S9 and Fig. S2B). Furthermore, the PAL TF family, which plays an important role in the process of plant resistance through regulating the synthesis of plant antibiotics, expanded from four in A. thaliana and five in M. truncatula to six in S. japonica genome (Supplementary Fig. S2C). In addition, we found the BBX TF family, which is widely involved in plant growth and development, and plays important role in abiotic stress response,, expanded from 32 members in A. thaliana and 24 in M. truncatula to 38 in S. japonica genome (Supplementary Fig. S2D). WGDs (polyploidization) have played a major role in the angiosperms’ evolutionary history. The Ks value for collinear gene pairs indicated that a WGD has recently occurred in the evolution of S. japonica, and evidence of the pan-legume WGD event was also observed in its genome (Fig. 3A).

Figure 3

Results of comparative genomic analyses. (A) Ks distribution of syntenic blocks. (B) Dot plot of syntenic blocks identified by MCscanX in the S. japonica genome. (C) MCscanX identified synteny blocks (involving ≥5 collinear genes) between S. japonica, G. max and M. truncatula. Intragenomic syntenic analysis of S. japonica revealed strong collinearity between large segments of different chromosomes (Fig. 3B). The widespread occurrence of one-to-one syntenic blocks confirms that WGD events have occurred in the evolutionary history of the S. japonica genome. Furthermore, we identified 2:2 S. japonica–G. max and 2:1 S. japonica–M. truncatula syntenic depth ratios (Fig. 3C). This clearly confirms that two WGD events have occurred in S. japonica’s evolution as there is documented evidence that G. max has undergone two WGD events and M. truncatula has undergone one. We further found that the most expanded genes which caused by WGD waves were enriched in ‘response to water (GO: 0009415)’, ‘response to salt stress (GO: 0009651)’, ‘response to abiotic stimulus (GO: 0009628)’ as well as ‘regulation of biosynthetic process (GO: 0009889)’ (Supplementary Tables S10 and S11). We generated short Illumina reads for 35 S. japonica individuals from 10 populations sampled in various parts of its range, with an average depth of ∼18× (Fig. 4A and Supplementary Tables S4 and S12). After mapping to the reference S. japonica genome and quality control, we obtained 6,250,321 high-quality SNPs with the mapping ratio and coverage of 97.92% and 92.35%, respectively, on average (Supplementary Table S12). The genetic diversity (π) of S. japonica was calculated to be 0.0034 based on the high-quality SNPs (Supplementary Table S13). Using S. tora as an outgroup, we explored the phylogenetic relationships among the 35 accessions by examining whole-genome genetic variations. The neighbour-joining phylogenetic tree we obtained showed that the 35 accessions cluster in a single group (Fig. 4D). According to bar plots of assignment probabilities from ADMIXTURE analysis and cross-validation analysis of the 35 individuals, the best number of clusters (K) is 1 (lowest CV error = 0.65). PCA-based analysis of S. japonica populations also clustered the individuals into a single group (Fig. 4B). In summary, there is no obvious differentiation among the geographical groups of S. japonica.

Figure 4

(A) Map showing current distribution of S. japonica. (B) Principal component analysis (PCA) plots showing scores of the first two principal components. (C) Demographic history of S. japonica, showing PSMC estimates of the species’ effective population size (Ne). The time scale on the x-axis was calculated assuming a mutation rate per generation (μ) of 3.65×e−9 and generation time (g) of 7 years. The time of the Naynayxungla glaciation is highlighted in grey vertical bars. (D) Neighbour joining tree obtained with Senna tora as the outgroup. Red and blue indicate species of S. japonica and S. tora, respectively. The strong selection pressure likely exerted on S. japonica through its long domestication process is expected to have substantially affected the effective population size (Ne) of the existing genetic clusters. To address this issue, we estimated Ne using the PSMC method (Li and Durbin, 2011) (Fig. 4C). The results indicate that the ancestral Ne of S. japonica peaked ∼2 Ma then continuously declined. It is obvious that during Naynayxungla glaciation period, S. japonica underwent a sharp decline.

4. Discussion

In this study, we report a high-quality, chromosome-level genome for S. japonica obtained by a combination of Nanopore sequencing and Hi-C technology. The genome survey results indicated that the heterozygosity of S. japonica genome was relatively high (1.19%) and similar to that of Pyrus bretschneideri (1.2%), Quercus lobata (1.25%), Jasminum sambac (1.5%) and Castanopsis tibetana (1.32%). The final assembled genome of S. japonica is 511.49 Mb in length, consisting of 110 contigs with a contig N50 length of 17.34 Mb, which is much higher than that of P. bretschneideri (35.7 kb), Q. lobata (∼18 kb) and C. tibetana (3.3 Mb), while comparable to that of J. sambac (17.5 Mb). Besides, it is also relatively high compared with other legumes, such as V.unguiculata (10.9 Mb),L.albus (1.76 Mb) and V.radiata (2.8 Mb). Further Hi-C scaffolding placed 91.62% of the assembled genome on 14 pseudochromosomes (Lachesis Groups) ranging in size from 24.80 to 60.53 Mb with an improved N50 length of 31.32 Mb. The BUSCO assessment indicated that 98.1% of the conserved genes could be detected in the assembled genome. With the highly complete genome assembly, we obtain a high-quality gene set with better contiguity and annotation, which contains 31,000 protein coding genes, 456 miRNAs, 75 tRNAs, 1,488 rRNAs and 160 snRNAs. Phylogenomic analysis based on 204 single-copy orthologous genes revealed that S. tora and S. japonica were placed successively sister to all other Fabaceae species, and the two diverged at 74.77 Ma. Our comparative genomics analyses revealed that S. japonica and other legume species shared the common legume WGD event, and experienced an independent WGD event which was first identified in our study (Fig. 2). WGD is recognized as a powerful enabler of plant genome expansion and evolution. WGD (or paleopolyploidy) has been shown to contribute significantly to plant adaptation during global environmental changes in the evolutionary history of angiosperms. For example, WTD events in Asteraceae, Lamiaceae and Apiaceae have contributed to the environmental adaptation during the Cretaceous period. Here, using the chromosome-scale genome of S. japonica, we first identified a species-specific WGD event, which was confirmed in G. max and M. truncatula genomes by Ks value profiles and collinearity analysis (Fig. 3). As Glycine-specific duplication is estimated to have occurred ∼13 Ma,, and legume-common tetraploidy event at ∼59 Ma, we calculated that a species-specific WGD of S. japonica occurred ∼19.42–29.32 Ma (Ks ∼0.21) (Fig. 2D). Besides, we found that gene families related to oxidoreductase activity, cellular response, defence response, response, regulation, positive regulation, negative regulation and immune responses have all expanded, most of which are caused by WGD waves (Supplementary Tables S10 and S11 and Fig. S2). We further examined members of TF families, such as BBX and PAL, which have been determined to be involved in abiotic stress responses, were increased in S. japonica (Supplementary Fig. S2C and D). We thus propose that WGD events may have increased S. japonica’s adaptability to diverse environmental factors and high resistance to pests and diseases. Re-sequencing of 35 S. japonica individuals from surrounding areas of Gansu detected no obvious differentiation in samples from different geographic regions (Fig. 4B and D), so we speculate that the sampled S. japonica trees may have originated from the same nursery base. It is reasonable to infer that the S. japonica planted throughout the country may stem from very limited sources. To further study the species’ genetic diversity, we calculated its π value and found that it is moderate compared with those of other species (Supplementary Table S13). However, this does not mean that the π value of S. japonica will not decrease in the near future. In particular, we notice that, the effective population size of S. japonica was continuously declined since 2 Ma (Fig. 4C). Thus, S. japonica from various sources or wild varieties should be planted to ensure that its genetic diversity does not decrease in the future. In conclusion, we have constructed a chromosome-level reference genome for S. japonica. The genome is 511.49 Mb long, with a contig N50 size of 17.34 Mb and 98.1% coverage according to BUSCO analysis, indicating that the assembly has high completeness. We also detected evidence of two WGD events in the species’ evolutionary history. The high-quality genome and whole genome resequencing data of S. japonica provide further foundations for comparative genomics and analyses of both the adaptation and evolution of dicotyledons.

Supplementary data

Supplementary data are available at DNARES online.

Funding

This study was supported by the National Natural Science Foundation of China (grant no. 32001085) and Fundamental Research Funds for Central Universities (grant no. lzujbky-2020-34, lzujbky-2020-ct02).

Conflict of interest

The authors have no conflict of interest to declare.

Data availability

Data acquired in this Whole Genome Shotgun project has been deposited in NCBI under project number PRJNA814452 and the BIG Data Center (http://bigd.big.ac.cn) under project number PRJCA005733. The genome assembly file is available at NCBI. All the annotation tables containing results of the draft genome analysis are available at figshare doi.org/10.6084/m9.figshare.14790132. Click here for additional data file.

67 in total

1. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies.

Authors: Brian J Haas; Arthur L Delcher; Stephen M Mount; Jennifer R Wortman; Roger K Smith; Linda I Hannick; Rama Maiti; Catherine M Ronning; Douglas B Rusch; Christopher D Town; Steven L Salzberg; Owen White
Journal: Nucleic Acids Res Date: 2003-10-01 Impact factor: 16.971

2. AUGUSTUS: a web server for gene finding in eukaryotes.

Authors: Mario Stanke; Rasmus Steinkamp; Stephan Waack; Burkhard Morgenstern
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

3. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors: Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Bioinformatics Date: 2015-06-09 Impact factor: 6.937

4. Chromosome-scale genome assembly of Castanopsis tibetana provides a powerful comparative framework to study the evolution and adaptation of Fagaceae trees.

Authors: Ye Sun; Jianling Guo; Xiaorong Zeng; Risheng Chen; Yi Feng; Shuang Chen; Kai Yang
Journal: Mol Ecol Resour Date: 2021-11-02 Impact factor: 7.090

5. Draft genome sequence of chickpea (Cicer arietinum) provides a resource for trait improvement.

Authors: Rajeev K Varshney; Chi Song; Rachit K Saxena; Sarwar Azam; Sheng Yu; Andrew G Sharpe; Steven Cannon; Jongmin Baek; Benjamin D Rosen; Bunyamin Tar'an; Teresa Millan; Xudong Zhang; Larissa D Ramsay; Aiko Iwata; Ying Wang; William Nelson; Andrew D Farmer; Pooran M Gaur; Carol Soderlund; R Varma Penmetsa; Chunyan Xu; Arvind K Bharti; Weiming He; Peter Winter; Shancen Zhao; James K Hane; Noelia Carrasquilla-Garcia; Janet A Condie; Hari D Upadhyaya; Ming-Cheng Luo; Mahendar Thudi; C L L Gowda; Narendra P Singh; Judith Lichtenzveig; Krishna K Gali; Josefa Rubio; N Nadarajan; Jaroslav Dolezel; Kailash C Bansal; Xun Xu; David Edwards; Gengyun Zhang; Guenter Kahl; Juan Gil; Karam B Singh; Swapan K Datta; Scott A Jackson; Jun Wang; Douglas R Cook
Journal: Nat Biotechnol Date: 2013-01-27 Impact factor: 54.908

6. The Medicago genome provides insight into the evolution of rhizobial symbioses.

Authors: Nevin D Young; Frédéric Debellé; Giles E D Oldroyd; Rene Geurts; Steven B Cannon; Michael K Udvardi; Vagner A Benedito; Klaus F X Mayer; Jérôme Gouzy; Heiko Schoof; Yves Van de Peer; Sebastian Proost; Douglas R Cook; Blake C Meyers; Manuel Spannagl; Foo Cheung; Stéphane De Mita; Vivek Krishnakumar; Heidrun Gundlach; Shiguo Zhou; Joann Mudge; Arvind K Bharti; Jeremy D Murray; Marina A Naoumkina; Benjamin Rosen; Kevin A T Silverstein; Haibao Tang; Stephane Rombauts; Patrick X Zhao; Peng Zhou; Valérie Barbe; Philippe Bardou; Michael Bechner; Arnaud Bellec; Anne Berger; Hélène Bergès; Shelby Bidwell; Ton Bisseling; Nathalie Choisne; Arnaud Couloux; Roxanne Denny; Shweta Deshpande; Xinbin Dai; Jeff J Doyle; Anne-Marie Dudez; Andrew D Farmer; Stéphanie Fouteau; Carolien Franken; Chrystel Gibelin; John Gish; Steven Goldstein; Alvaro J González; Pamela J Green; Asis Hallab; Marijke Hartog; Axin Hua; Sean J Humphray; Dong-Hoon Jeong; Yi Jing; Anika Jöcker; Steve M Kenton; Dong-Jin Kim; Kathrin Klee; Hongshing Lai; Chunting Lang; Shaoping Lin; Simone L Macmil; Ghislaine Magdelenat; Lucy Matthews; Jamison McCorrison; Erin L Monaghan; Jeong-Hwan Mun; Fares Z Najar; Christine Nicholson; Céline Noirot; Majesta O'Bleness; Charles R Paule; Julie Poulain; Florent Prion; Baifang Qin; Chunmei Qu; Ernest F Retzel; Claire Riddle; Erika Sallet; Sylvie Samain; Nicolas Samson; Iryna Sanders; Olivier Saurat; Claude Scarpelli; Thomas Schiex; Béatrice Segurens; Andrew J Severin; D Janine Sherrier; Ruihua Shi; Sarah Sims; Susan R Singer; Senjuti Sinharoy; Lieven Sterck; Agnès Viollet; Bing-Bing Wang; Keqin Wang; Mingyi Wang; Xiaohong Wang; Jens Warfsmann; Jean Weissenbach; Doug D White; Jim D White; Graham B Wiley; Patrick Wincker; Yanbo Xing; Limei Yang; Ziyun Yao; Fu Ying; Jixian Zhai; Liping Zhou; Antoine Zuber; Jean Dénarié; Richard A Dixon; Gregory D May; David C Schwartz; Jane Rogers; Francis Quétier; Christopher D Town; Bruce A Roe
Journal: Nature Date: 2011-11-16 Impact factor: 49.962