Literature DB >> 35946426

Genomic insights into genetic diploidization in the homosporous fern Adiantum nelumboides.

Yan Zhong¹, Yongbo Liu², Wei Wu³, Jingfang Chen¹, Chenyu Sun¹, Hongmei Liu⁴, Jiangping Shu⁵, Atsushi Ebihara⁶, Yuehong Yan⁵, Renchao Zhou¹, Harald Schneider⁴.

Abstract

Whole genome duplication has been recognized as a major process in speciation of land plants, especially in ferns. Whereas genome downsizing contributes greatly to the post-genome shock responses of polyploid flowering plants, diploidization of polyploid ferns diverges by maintaining most of the duplicated DNA and is thus expected to be dominated by genic processes. As a consequence, fern genomes provide excellent opportunities to study ecological speciation enforced by expansion of protein families via polyploidy. To test the key predictions of this hypothesis, we reported the de novo genome sequence of Adiantum nelumboides, a tetraploid homosporous fern. The obtained draft genome had a size of 6.27 Gb assembled into 11,767 scaffolds with the contig N50 of 1.37 Mb. Repetitive DNA sequences contributed with about 81.7%, a remarkably high proportion of the genome. With 69,568 the number of predicted protein-coding genes exceeded those reported in most other land plant genomes. Intragenomic synteny analyses recovered 443 blocks with the average block size of 1.29 Mb and the average gene content of 16 genes. The results are consistent with the hypothesis of high ancestral chromosome number, lack of substantial genome downsizing, and dominance of genic diploidization. As expected in the calciphilous plants, a notable number of detected genes were involved in calcium uptake and transport. In summary, the genome sequence of a tetraploid homosporous fern not only provides access to a genomic resource of a derived fern, but also supports the hypothesis of maintenance of high chromosome numbers and duplicated DNA in young polyploid ferns.

Entities: Chemical

Keywords: diploidization; ecological adaptation; genome assembly; homeologous chromosome pairing; whole genome duplication

Year: 2022 PMID： 35946426 PMCID： PMC9387920 DOI： 10.1093/gbe/evac127

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 4.065

Unlike flowering plants, polyploid ferns maintain high chromosome numbers and duplicated DNA. Thus, their diploidization is expected to be dominated by genic processes instead by processes enabling rapid reducing of the duplicated genome components. By sequencing and characterizing the whole genome of a tetraploid homosporous fern, Adiantum nelumboides, we provided genomic evidence to support the key predictions of this hypothesis.

Introduction

Consisting of approximately 10,578 species, ferns are not only the sister lineage to seed plants but also the second-largest group of vascular plants (Smith et al. 2006; PPGI 2016). As their origin in the Paleozoic, they are major components in most terrestrial ecosystems (Mehltreter et al. 2010). Existing evidence supports the hypothesis of rather distinct trends in the genome evolution of ferns compared with other land plants especially to angiosperms as illustrated by recorded characteristics of fern genomes (Klekowski and Baker 1966; Wagner and Wagner 1980; Haufler 1987; Barker 2009; Barker and Wolf 2010; Clark et al. 2016; Sessa and Der 2016; Hidalgo et al. 2017a; Huang et al. 2020; Fujiwara et al. 2021). (1) Ferns accumulate large chromosome numbers including the largest known chromosome number. (2) They possess medium-to-large-sized genomes including the second-largest genome recorded. (3) These two characters, chromosome number (n) and genome size (1C), are positively correlated. (4) Among land plants, they show the highest frequency of polyploidy enforced speciation. (5) Some evidence supports ancient polyploidy events in the evolutionary history of ferns but arguably they were less common than proposed in the past. These characteristics together with various lines of evidence ranging from cytology, population genetics, genomics, and proteomics support the hypothesis that the post-whole-genome duplication (WGD) shock response of ferns is highly distinct from angiosperms. Diploidization is mainly achieved by genic process instead of cytological processes enabling rapid genome downsizing as reported from many derived angiosperms (Leitch and Bennett 2004; Wang et al. 2021). Instead, polyploid fern genomes arguably maintain most of the duplicated DNA and thus accumulate large chromosome numbers, a large number of pseudogenes, and a high proportion of noncoding DNA (Wolf et al. 2015; Clark et al. 2016; Marchant et al. 2019). Our knowledge on the genomic diversity of seed plants and most other plant lineages have been rapidly improved since the publication of the first plant genome of Arabidopsis thaliana in 2000 as a consequence of hundreds of sequenced whole genomes that have been made available in recent years (Sun et al. 2022). In contrast, only two heterosporous ferns, Azolla filiculoides and Salvinia cucullata, and one homosporous fern, Alsophila spinulosa, have been fully sequenced to date (Li et al. 2018; Huang et al. 2022). Two sequenced heterosporous ferns diverge from the majority of ferns by having relative small genomes (750 Mb for Az. filiculoides and 260 Mb for S. cucullata), whereas most homosporous fern lineages accumulate medium-to-large genomes (Clark et al. 2016; Fujiwara et al. 2021), including the second-largest genomes ever recorded, the genome of Tmesipteris obliqua (Hidalgo et al. 2017a, 2017b). In a whole-genome shotgun sequencing project, nuclear genome contigs were reported for a homosporous fern, Ceratopteris richardii (Marchant et al. 2019). However, this incompletely assembled genome with a coverage of only 38% genome assembled is insufficient to comprehensively address questions regarding genome structure, content, and evolution in homosporous ferns. Despite the above-mentioned efforts to obtain a whole genome of C. richardii, we still lack a completely assembled genome for polyploid ferns that contribute >80% of the extant diversity (PPGI 2016). Therefore, understanding the evolution of ferns in particular and vascular plants in general is incomplete due to the lack of a whole-genome sequence of one or several derived ferns. Genome sequencing of ferns with typical large genome size and high chromosome number is essential for a comprehensive understanding of fern biology. Instead of hunting for the smallest fern genome, this study focused on a rather typical polyploid homosporous fern. Here we fully sequenced, assembled, and annotated the genome of a homosporous fern, Adiantum nelumboides X. C. Zhang that was previously recorded as Adiantum reniforme var. sinense Y. X. Lin. The genus Adiantum belongs to Pteridaceae, which comprises ca. 1211 species over 10% of extant ferns (PPGI 2016). Species in this family not only occupy a wide range of niches, including terrestrial, epiphytic, xeric-adapted rupestral, and even aquatic habitats, but also comprise many highly specialized species (Schuettpelz et al. 2007). The species studied here was firstly discovered in Chongqing, China in 1978 (Lin 1980). It has a narrow distribution along the Yangtze River from Shizhu County to Wanzhou District of Chongqing (Tianquan et al. 1987) and usually occurs in exposed karst rocks and rocky crevices (Kang et al. 2008;fig. 1). Because of its sparse distribution, limited population number, and small population size, Ad. nelumboides was listed as a critically endangered species in Threatened Species List of China’s Higher Plants (Qin et al. 2017). In addition to the loss of habitats, over-collection by local people as a consequence of its medicinal value has further reduced the population size of Ad. nelumboides (Kang et al. 2008).

Fig. 1.

Adiantum nelumboides. (a) habitat of a natural population in Shizhu County, Chongqing, China. (b) An individual in this population. (c) Adaxial frond surface with sporangia on the margin. (d) Abaxial frond surface with sporangia on the margin. This ecologically highly specialized species has a chromosome number of 2n = 120 (Lin 1989) and a holoploid genome size of 7.39 Gb based on flow cytometry analysis (Fujiwara et al. 2021). Considering the high conserved basic chromosome number of Adiantum of x = 30, Ad. nelumboides is interpreted as a tetraploid, whereas the sister taxon Ad. reniforme occurring in Canary Islands and Madeira has been recorded as hexaploid (6x) with 2n = 180 (Wang et al. 2015) or a decaploid (10x) with 2n = 300 (Manton and Vida 1968). The study was designed to confirm several predictions about the genomes of polyploid fern species that achieved to re-establish homeosis and fully functional meiosis after the WGD event. Firstly, the characteristics of the generated genome are expected to be consistent with the post-WGD diploidization without genome downsizing including maintenance of high chromosome number enabling homeologous chromosome pairing during meiosis (Grusz et al. 2017). Thus, the genome is expected to possess a large number of genes and noncoding DNA especially transposable elements, as observed in other ferns (Wolf et al. 2015). Components relevant to gene regulation such as microRNA (miRNA) and small nuclear RNA (snRNA) are also expected to be present with a high copy number. Consistent with reports on the low frequency of large syntenic gene blocks in Ceratopteris (Nakazato et al. 2006; Marchant et al. 2019), the genome of Ad. nelumboides is expected to lack conserved large synteny blocks of genes. This expectation is in conflict to the assumed conservation of large synteny blocks as the consequence of chromosome duplication in newly formed polyploids (MacKintosh and Ferrier 2017). Finally, we predict the occurrence of a large number of protein-coding genes regulating calcium transport and uptake, which enable the adaptation of Ad. nelumboides to grow on limestone rocks—arguably contributed a lot by the WGD

Results

Genome Assembly and Annotation

We generated 834 Gb PacBio reads and 389 Gb of Illumina paired-end 150-bp reads from an Ad. nelumboides individual for genome assembly (supplementary table 1, Supplementary Material online). The genome size was estimated to be 5.94 Gb, the heterozygosity was estimated to be 0.26% and the repeat content was estimated to be 59.0% using K-mer statistics of a subset (235 Gb) of the Illumina reads (supplementary fig. 1, Supplementary Material online). The genome assembled with the PacBio reads had a total length of 6.27 Gb sorted into 11,767 contigs with an N50 contig size of 1.37 Mb (table 1). The Benchmarking Universal Single-Copy Orthologs (BUSCO) assessment showed that 412 of the 425 BUSCO genes (96.9%) in the viridiplantae_odb10 data set were recovered in the assembled genome, including 35.3% single-copy BUSCO genes and 61.6% duplicated BUSCO genes (supplementary table 2, Supplementary Material online). The high fidelity of the assembly was supported by the high genome coverage rate of 97.3% and high mapping rate of 94.8% of Illumina reads. The overall read-mapping rates for frond and rhizome transcriptomes were 92.4% and 86.0%, respectively. These observations suggest high quality of the genome assembly. The overall Guanine+Cytosine content of Ad. nelumboides is 40.2%, which is comparable with those of other ferns, ranging from 37.9% to 42.9%.

Table 1

Statistics of the Genome Assembly for Adiantum nelumboides

Assembly Features
Contig length (bp)	6,272,116,485
Contig number	11,767
N50 (bp)	1,373,929
L50	1,324
N90 (bp)	271,030
L90	5,099
Guanine+Cytosine content	40.2%
Repeat content (% of the genome assembly)	81.7
Number of predicted gene models	69,568
Average coding sequence length (bp)	1,309.9
Average exon number per gene	4.1
Average exon length (bp)	328
Average intron length (bp)	3,380

Statistics of the Genome Assembly for Adiantum nelumboides Noncoding repetitive DNA contributed about 81.7% of the assembled genome (5.12 Gb; table 1), which is a much higher proportion than those reported for two heterosporous ferns, containing 53.6% in Azolla and 44.5% in Salvinia and even exceeding the 74.6% reported for the homosporous tree fern Als. spinulosa. Long terminal repeat (LTR) retrotransposons contributed 54.1% of the genome, with Ty3-gypsy (46.6% of the genome) and Ty1-copia (7.4% of the genome) being the most abundant types, and DNA transposons contributed 3.1% of the genome (supplementary table 3, Supplementary Material online). A total of 59,841 intact LTRs were obtained and most of the intact LTRs (99.9%) were inserted in the relative recent phylogenetic past (0–15 Ma), with the highest proportion of LTRs inserted ca. 0.46 Ma (supplementary fig. 2, Supplementary Material online). The Ad. nelumboides genome contained 69,568 protein-coding genes based on de novo prediction, transcriptome sequences, and homology with other known plant proteins (table 1). The majority (94.6%) of the predicted genes were functionally annotated in at least one public database (supplementary table 4, Supplementary Material online). The BUSCO assessment for the predicted proteome showed that 87.1% complete BUSCO genes was recovered in the viridiplantae_odb 10 data set (supplementary table 2, Supplementary Material online), higher than 72.1% detected in Als. spinulosa from eukaryote_odb database. The mapping rates of frond and rhizome RNA-seq to annotated coding sequences were 85.8% and 84.5%, respectively. The average exon and intron sizes were 328 and 3,380 bp, respectively (table 1). The average intron size of Ad. nelumboides is among the largest reported in plants, compared with those observed in the genomes of conifers (Nystedt et al. 2013; Niu et al. 2022). Nearly all (99.0%) long introns (>3000 bp in size) contained repeats, suggesting repeat insertion and expansion contributed to intron size expansion. A total of 2,586 copies of tRNAs, 621 copies of ribosomal RNAs (rRNAs), 736 copies of snRNAs, and 9,453 copies of miRNAs (in 22 miRNA families) were identified (supplementary tables 5 and 6, Supplementary Material online). Among the 22 miRNA families, 4,139 copies of miRNA408 accounted for 43.8% of all miRNA copies, which is the largest copy number for a miRNA family. Other two miRNA families also showed abundant copies, with 2,463 for miRNA287 and 1,063 for miRNA672, in the genome of Ad. nelumboides, whereas miR156/157, miR170/171, miR396, miR165/166, miR159, miR160, miR168, and miR169 belong to the conserved miRNA families previously identified in plants (Berruezo et al. 2017; You et al. 2017).

Intragenomic Synteny and WGD

A total of 443 intragenomic syntenic blocks in the genome were identified, and they contain 14,996 genes and 3,485 gene pairs. On average, each syntenic block contains eight homologous gene pairs, with the longest block containing 23 gene pairs. With a small fraction of genes having more than one counterpart, these 3,485 gene pairs include 6,928 genes, indicating that approximately 10% of the annotated Ad. nelumboides genes exhibit synteny-based signals. The peak value of Ks (the number of substitutions per synonymous site) distribution for all paralogous gene pairs within the 443 blocks is 0.063 (fig. 2 and supplementary fig. 3, Supplementary Material online), suggesting a very recent WGD event shaping the genome of this fern. Applying a synonymous substitution rate of 4.79 × 10−9 in polypodiaceous nuclear genomes, this WGD event was dated to approximately 6.6 Ma, whereas the absolute dating approach based on phylogenetic analysis showed that the WGD event occurred at 10.7 Ma (7.5–12.6 Ma).

Fig. 2.

Frequency distribution of synonymous substitution rate (Ks) between gene pairs on syntenic blocks in the genome of Adiantum nelumboides. The x axis shows the Ks with a peak corresponding to 0.063, and the y axis represents the number of gene pairs. Only the distribution of Ks values of 0–0.5 was shown here. With 69,568 genes in the 6.27 Gb genome, its average gene density is about 1 gene per 90 kb. On average, the segment length for five genes is [the default minimum number of gene pairs for syntenic blocks (m) is 5 in MCScanX] about 450 kb. Because there are a lot of genomic contigs <450 kb, they may escape the detection of syntenic block identification. We then set m to 4, 3, and 2 separately, and a very limited increase for the number of syntenic blocks as well as the number of genes on these blocks was observed (supplementary table 7, Supplementary Material online). In the whole genome, a large number of pseudogenes (89,510) were identified in the intergenic spacers, and they correspond to 13,373 genes. After excluding the internal stop codons from the pseudogenes, the Ks distribution between pseudogenes and their corresponding genes showed a Ks peak at 0.068 (supplementary fig. 4, Supplementary Material online), which is very close to the Ks peak related to the WGD (0.063). Because pseudogenes usually exhibit an accelerated mutation rate due to the relaxation of selective constraints, we considered pseudogenes with a Ks value <1.5-fold of 0.063 from their corresponding genes as those produced after the WGD event. According to this criterion, 8,604 pseudogenes (9.6%), which correspond to 2,174 genes (16.3%), arguably originated after the detected WGD. Among the 8,604 pseudogenes, 1,548 were present in the syntenic blocks and corresponded to 661 genes. The 1,548 pseudogenes were most promising candidates originated after the detected WGD. This implies that there have been quite a few gene silencing events in the genome since the WGD. Many instances of gene rearrangements in the syntenic blocks of Ad. nelumboides were found (see fig. 3 for examples), besides a great number of pseudogenes identified in the intergenic spacers. Given this data, pseudogenization of one homeolog was common in the identified syntenic blocks. For example, no corresponding annotated duplicates for the two genes, Ane38759 and Ane38765, were found in the collinear block alignment #102, but their pseudogenes were detected (fig. 3). Ks between the two genes and their pseudogenes in the syntenic blocks (only for the translatable regions) were 0.058 and 0.059, respectively, resembling the Ks peak value of 0.063 identified for the WGD event.

Fig. 3.

Examples for homeolog loss due to pseudogenization after the recent whole-genome duplication. Five syntenic blocks are shown. Each color bar represents a gene and the gene names are coded consecutively. The gene names at the beginning and end are labeled. Blue and red bars represent genes in the syntenic blocks and only gene pairs are connected with solid green lines. Yellow bars represent pseudogenes formed after the recent whole-genome duplication, which are connected with corresponding genes with dotted green lines.

Gene Supporting the Adaptation to Limestone Habitats

Given the preference of this species as well as several other species of Adiantum to grow on limestone rocks, the study focused specifically to identify gene families considered to play a role in the adaptation to the specific needs of these habitats. These observations will enhance the usage of this species as a model to study the role of genes potentially supporting the adaptation of these plants to calcium-rich habitats. Genes from 20 species were clustered into 46,961 gene families with two or more members. We detected 5,840 gene families that were significantly expanded (P < 0.05) and 447 gene families that were significantly contracted (P < 0.05) in Ad. nelumboides (fig. 4). Although there are a large number of expanded gene families in Ad. nelumboides, functional enrichment analysis showed that neither the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways nor the GO terms are significantly enriched for these expanded gene families. There was one KEGG pathway, ascorbate and aldarate metabolism, and 169 GO terms are significantly underrepresented in Ad. nelumboides (Fisher’s exact test, P < 0.05; supplementary Excel file 1, Supplementary Material online). The underrepresented GO terms include l-ascorbic acid biosynthetic process, l-ascorbic acid metabolic process, response to reactive oxygen species, vitamin biosynthetic process, vitamin metabolic process, etc.

Fig. 4.

Gene family expansion/contraction analysis for Adiantum nelumboides and 19 other species. Phylogeny of these species is reconstructed based on single- or low-copy orthologous genes. The number of expanded/contracted gene families is indicated at each node and after species. Five protein families related to calcium uptake and transport were detected in the genome of Ad. nelumboides (table 2 and supplementary Excel file 2, Supplementary Material online), including Ca2+-ATPase proteins, calmodulins, IQD gene family of calmodulin binding, calmodulin-binding receptor-like kinase proteins, and calcineurin B-like calcium sensor. There were also three potassium transport-related protein families expanded, namely potassium proton antiporter family, potassium transporter family, and potassium channel family (table 2). In addition, we found three salt-related protein families in Ad. nelumboides, including Na+/H+ exchanger subfamily, another Na+/H+ exchanger subfamily, and vacuolar sodium/proton exchanger family (table 2). In 8 of these 11 gene families, 13–58% members likely originated in the WGD based on pairwise Ks analysis.

Table 2

Expanded Gene Families Related to Calcium, Potassium, and Sodium Uptake and Transport in the Adiantum nelumboides Genome

Category	OrthoID	Gene Family Description	Number of Members in Adiantum nelumboides	Number of Members likely Produced by the WGD (%)	Average Number of Members in Other Species[a]	Reference for Gene Family Description
Calcium uptake and transport	OG0000331	Ca²⁺-ATPases proteins: Ca²⁺-efflux transporters responsible for maintaining homeostasis of cytosolic Ca²⁺ concentration	23	13 (57%)	7	Huda et al. (2013)
	OG0000341	Calmodulins: predominant calcium receptors and small Ca²⁺ binding protein that acts to transduce second messenger signals into a wide array of cellular responses	11	0	7	Chin and Means (2000)
	OG0000187	IQD gene family of calmodulin binding: linking calcium-signaling pathways to the regulation of gene expression	19	8 (42%)	10	Abel et al. (2005)
	OG0000729	Calmodulin-binding receptor-like kinase proteins	15	2 (13%)	5	-
	OG0000245	Calcineurin B-like calcium sensor: Ca sensors that interact with CIPK for stress responses, including mediating salt tolerance by regulating ion homeostasis in Arabidopsis	14	6 (43%)	9	Mao et al. (2016), Deng et al. (2013), Yu et al. (2014), Jin et al. (2016)
Potassium transport	OG0001821	Potassium proton antiporter family	6	2 (33%)	3	–
	OG0002459	Potassium transporter family	6	2 (33%)	2	–
	OG0002487	Potassium channel family	6	0	2	–
Salt tolerance	OG0000437	Na⁺/H⁺ exchanger subfamily	12	6 (50%)	6	Qiu et al. (2003)
	OG0002550	Na⁺/H⁺ exchanger subfamily	12	7 (58%)	2	Qiu et al. (2003)
	OG0000946	Vacuolar sodium/proton exchanger family	7	0	4	Glenn et al. (1999)

Number of members in other species can be found in Supplementary Excel file 2, Supplementary Material online.

Expanded Gene Families Related to Calcium, Potassium, and Sodium Uptake and Transport in the Adiantum nelumboides Genome Number of members in other species can be found in Supplementary Excel file 2, Supplementary Material online.

Discussion

Rapid Genome Evolution Following a Recent WGD

Given the conserved basic chromosome number of n = 30 or less frequent n = 29 (Rice et al. 2015), Ad. nelumboides is with 2n = 120 a tetraploid (supplementary fig. 5, Supplementary Material online). Consistently, the genome size of 7.39 Gb corresponds to 4n given the reported genome size of diploids such as the genome size of 3.78 Gb of Adiantum caudatum (Kuo and Li 2019), a closest relative with available genome size information. Furthermore, according to the results of the synteny analysis, Ad. nelumboides has undergone a WGD. The separation of Ad. nelumboides from its sister species Ad. reniforme has been dated back to about 4.94 Ma (Wang et al. 2015), so the establishment of the WGD event (6.6/10.7 Ma) occurred before their divergence. We did not find the signal of two earlier WGDs predating the origins of core leptosporangiate and Polypodiales, respectively. These events have been suggested before but these claims are controversial (Li et al. 2018; Huang et al. 2020). The very recent origin of this WGD event appears to be in contrast with relatively short syntenic blocks and a small fraction of genes on these blocks. Low gene density is the most plausible explanation (1 gene per 90 kb) in the Ad. nelumboides genome. Changes in the minimum number of gene pairs used for identifying syntenic blocks have little effect on the results of syntenic blocks, which implied that syntenic blocks in the Ad. nelumboides genome are indeed short and the proportion of genes on these syntenic blocks is indeed small. These observations are consistent with the hypothesis that the Ad. nelumboides genome has experienced rapid genome evolution following a rather recent WGD event. This argument is inconsistent with a previous suggestion that fern genomes are thought to evolve slowly and be less dynamic in general (Leitch and Leitch 2013; Clark et al. 2016). Rapid genome reorganization has also been reported in the polypoid fern, C. richardii. This distantly related species, which also has a relative large genome (11.25 Gb) and has experienced a WGD event (Barker 2009). Its genome does not show evidence of large-scale synteny based on a high-resolution genetic linkage map, although 76% loci were duplicated (Nakazato et al. 2006). While these results may be less unexpected in the genome of C. richardii (Barker 2009; Barker and Wolf 2010), our findings for rapid genome evolution following such a recent WGD event in Ad. nelumboides are most striking. This may explain why genomic evidence for polyploidy is not as evident in homosporous ferns as expected in the past. Compared with most angiosperms showing evidence for rapid genome downsizing after WGDs, the Ad. nelumboides genome does not show evident genome downsizing. In contrast, the characteristics of the Ad. nelumboides genome suggest rapid evolution at segmental and genic levels. The lack of evidence for extensive polyploidy in homosporous ferns with high chromosome numbers has been referred to as the “polyploidy paradox” (Soltis and Soltis 2000). For example, many homosporous ferns with high chromosome numbers exhibit diploid gene expression at isozyme loci (Haufler and Soltis 1986; Soltis 1986; Soltis and Soltis 1988a, 1988b; Gastony 1991). Soltis and Soltis (2000) considered two explanations for this paradox. In the first explanation, ancient polyploidy in ferns coincides with extensive gene silencing to produce genetic diploids, whereas the second explanation assumes frequent chromosomal fission resulting in high chromosome numbers. The second explanation does not hold for Ad. nelumboides because its genome size is roughly two-fold as large as its diploid relatives. Although Ad. nelumboides is a tetraploid species, a previous microsatellite study suggested that this species behaves a genetic diploid (Kang et al. 2008). Rapid chromosomal and genomic changes following WGD (Song et al. 1995; Leitch and Bennett 1997; Chester et al. 2012; Soltis et al. 2012; Chester et al. 2015), including karyotypic variation (intrachromosomal and intragenomic rearrangements) and gene silencing (and even loss), may explain genetic diploidization in Ad. nelumboides. In fact, pseudogenization and gene deletion by recombination have been considered to be major mechanisms for fractionation, as a particularly important component of diploidization (Li et al. 2021). A great number of pseudogenes, including pseudogenization of one homeolog, are detected in the Ad. nelumboides genome, which is consistent with the genetic diploidization prediction. Moreover, with a long interval between neighboring genes in the Ad. nelumboides genome, the chance for recombination between genes is higher than other species with compact genomes. Correspondingly, more genomic rearrangements are expected between neighboring genes and very shorter syntenic blocks and even the loss of syntenic blocks in the Ad. nelumboides genome supports this. Meanwhile, a high level of recombination should be expected for Ad. nelumboides with such a high chromosome number of 2n = 120, because increased chromosome numbers in nuclear genomes tend to cause increased genetic recombination (Qumsiyeh 1994). Overall, these data suggest rapid and substantial gene rearrangements and gene silencing following polyploidy in Ad. nelumboides. Thus, our genome fits well with the first explanation proposed by Soltis and Soltis (2000).

Genetic Adaptation to High-Calcium Habitats

Adiantum nelumboides usually occurs in thin soils on limestone or purplish shale rocks and rocky crevices (Tianquan et al. 1987; Kang et al. 2008). Limestone is a sedimentary rock composed primarily of calcium carbonate (CaCO3). Soils developed from purplish shale contain relatively high contents of calcium, potassium, and sodium other than silicon, aluminum, and iron (Du et al. 2013). Expansion and contraction of gene families have been suggested as major mechanisms underlying phenotypic differences as well as the key contributor to adaptive evolution (Purugganan et al. 1995; Lashbrook et al. 1998). Five protein families related to calcium uptake and transport in Ad. nelumboides play various roles related to maintaining homeostasis of cytosolic Ca2+ concentration and regulating gene expression and cellular responses via calcium-signaling pathway. Potassium (K+) transport is critical for enzyme activation, osmotic adjustment, turgor generation, cell expansion, regulation of membrane electric potential, and pH homeostasis (Hawkesford et al. 2012). In plants, Na+/H+ exchangers in the plasma membrane are critical for growth in high levels of salt, removing toxic Na+ from the cytoplasm by transport out of the cell (Qiu et al. 2003). Vacuolar sodium/proton exchanger, on the other hand, plays essential role in transporting Na+ from the cytoplasm to vacuoles (Glenn et al. 1999). These protein families related to calcium uptake and transport, potassium, and sodium transport are expected to contribute to the adaptation of Ad. nelumboides to the harsh habitats. Ascorbate has a role in protecting against reactive oxygen species produced during photosynthesis (Awad et al. 2015). The KEGG pathway of ascorbate is significantly underrepresented in Ad. nelumboides. The underrepresented GO terms, including l-ascorbic acid biosynthetic process and l-ascorbic acid metabolic process, are related to reactive oxygen species, vitamin biosynthetic process, vitamin metabolic process, etc. The underrepresented KEGG pathway and GO terms in Ad. nelumboides are consistent with its ecological characteristics, given that it inhabits shady habitats with relatively weak sunlight intensity and that less reactive oxygen species are expected to be produced. Furthermore, Ad. nelumboides has a large copy number of miRNA408, which might be associated with its adaptation in the barren rocky habitats with various abiotic stresses. The miRNA408 family has been considered to be major contributors to abiotic stress tolerance through regulation of redox status during heat, cold, salt, and oxidative stress conditions in Arabidopsis (Ma et al. 2015).

Conclusions

The genome of the tetraploid Ad. nelumboides provides important insights into the genomic structure of polyploid ferns that are geologically relatively young. The recovered structure of the genome is highly consistent with the first explanation of the fern genome paradox proposed by Soltis and Soltis (2000). The WGD has resulted in a highly expanded genome by conserving the vast majority of the duplicated DNA and its packaging in chromosomes. However, several genic processes including gene silencing and expansion of regulation mechanisms enabled these genomes to establish homeostasis. In turn, the genome of this rare fern also supports the prediction that some of the expanded gene families contribute to the adaptation to stressful environments such as calcium-rich substrates. Besides providing the key to resolve the fern genome paradox, the complete genome of a derived homosporous fern also provides crucial information about the evolution of ferns and their relatives.

Materials and Methods

Sampling and Sequencing

An adult individual of Ad. nelumboides from a wild population in Shizhu County, Chongqing, China was collected and cultivated at the Shanghai Chenshan Botanical Garden, Shanghai, China. The voucher specimen (YYH15116) has been deposited in the herbarium of Shanghai Chenshan Botanical Garden (CSH). Several mature leaves were used for DNA isolation and library construction. Genomic DNA was isolated using the HiPure Plant DNA Mini Kit (Magen, Guangzhou, China). DNA integrity was tested on 1% agarose gel, whereas DNA purity was tested and quantified using Qubit Fluorometer (Invitrogen, Carlsbad, CA, USA). The genome libraries with an insert size of 30 kb were constructed and then sequenced on the PacBio Sequel II System based on the single-molecule real-time (SMRT) sequencing technology. To estimate the genome size and polish the genome assembly, a short genome fragment library with an insert size of 350 bp was also constructed and sequenced on the Illumina Novaseq platform with 150 bp paired-end reads obtained. All Illumina reads were filtered using Trimmomatic v0.39 (Bolger et al. 2014) with default parameters. Frond and rhizome tissues used for RNA isolation and transcriptome sequencing were sampled from the specimen mentioned above. RNA-sequencing libraries were constructed using the Illumina TruSeq RNA Sample Prep Kit. The paired-end libraries with an insert size of 300 bp were sequenced on an Illumina Novaseq platform. For each tissue, about 8 Gb sequence data were generated (supplementary table 1, Supplementary Material online).

Genome Size Estimation

We used Jellyfish v.2.2.10 (Marçais and Kingsford 2011) to obtain k-mers and calculate k-mers frequency for a subset (235 Gb) of the Illumina data. Genome size, heterozygosity, and repeat content were estimated based on k-mer (k = 25) frequency distributions by GenomeScope (http://qb.cshl.edu/genomescope).

Genome Assembly

The PacBio reads obtained were corrected, trimmed, and assembled into contigs using Canu v1.5 (Koren et al. 2017) with default parameters. The primary assembly was polished by referring to the Illumina reads with Pilon (Walker et al. 2014) with default parameters.

Identification of Repetitive Sequences

Known repeat sequences were identified by RepeatMasker version 4.0.3 (www.repeatmasker.org) with the Repbase library (Jurka et al. 2005). A de novo repeat library was constructed using RepeatModeler version 1.0.7 (RepeatModeler Open-1.0. 2008–2015). The results from different methods were merged to generate a final nonredundant set of repetitive sequences. The repeat-masked genome sequences were used for subsequent gene prediction. LTRs were predicted by combining LTR_finder_parallel (Ou and Jiang 2019) and LTRharvest (Ellinghaus et al. 2008). The final identification was achieved using by LTR_retriever (Ou and Jiang 2018). For intact LTR transposons, their insertion times were estimated based on the genetic divergence between the two LTRs, assuming their sequences were identical upon integration. Using the formula T = K/(2*r), where K and r denote the divergence and mutation rate. A synonymous substitution rate of 4.79 × 10−9 per site per year for polypodiaceous nuclear genomes was applied (Barker 2009).

Gene Prediction and Annotation

We predicted protein-coding genes using a combination of homologous sequence search, ab initio gene prediction, and transcriptome-data comparison with the genome sequence implemented in an automatic genome annotation tool GETA v2.4.5 (https://github.com/chenlianfu/geta). Illumina RNA-seq reads from different tissues were mapped to the genome assembly using HISAT2 (Kim et al. 2019). Protein sequences of two heterosporous ferns (Az. filiculoides v1.1 and S. cucullata v1.2 available at https://www.fernbase.org/), and three flowering plants (Populus trichocarpa v4.1, Ar. thaliana TAIR10 and Oryza sativa v7.0 accessed via Phytozome version 13; Goodstein et al. 2012) were employed for homology-based prediction with GeneWise (https://www.ebi.ac.uk/∼birney/wise2/). Ab initio prediction was performed in Augustus v3.3.3 (Stanke and Morgenstern 2005), trained with intron and exon information generated above. These prediction results were integrated and then were searched against the Pfam database for screening to get the final gene prediction result. The rRNAs, miRNAs, small nucleolar RNAs, and snRNAs were predicted using INFERNAL version 1.1rc4 (Nawrocki and Eddy 2013) against the Rfam database version 11.0 (Burge et al. 2012). The transfer RNAs (tRNAs) were identified using tRNAscan-SE (Lowe and Eddy 1997) with the parameters: -X 20 -z 8. Gene functions were assigned based on the best matches in the alignments using Blastp (Altschul et al. 1997) with the SwissProt database. Functional annotation of genes was also performed by using InterproScan (Jones et al. 2014) and eggnog-mapper (http://eggnog-mapper.embl.de/). The SwissProt, eggNOG, and InterPro annotation results were then integrated.

Genome Assembly and Annotation Quality Assessment

The quality of the genome assembly and annotation was assessed by QUAST v5.1.0 (Gurevich et al. 2013), BUSCO (Simão et al. 2015), and the mapping rates of Illumina DNA-seq and RNA-seq reads. QUAST was employed to calculate N50, L50, N90, and L90, whereas Illumina reads were mapped to the genome assembly using BWA-mem v0.7.17 (Li and Durbin 2009) with default parameters. The mapping rates were summarized using the flagstat module in SAMtools (Li et al. 2009). RNA-seq reads was aligned to the genome and the predicted protein-coding sequences, and the mapping rate was recorded using the flagstat module in SAMtools and HISAT2 (Kim et al. 2019). The completeness of the genome assembly and predicted protein-coding sequences were further evaluated using the BUSCO viridiplantae_odb10 database with default parameters.

Identification of Collinear Blocks and WGD Analysis

All-versus-all alignment of the protein sequences of Ad. nelumboides was constructed using the Blastp algorithm (Altschul et al. 1997). To detect the signature of WGD, the program MCScanX (Wang et al. 2012) with default parameters was used to define duplicated blocks. For the default setting, at least five genes (m = 5) in a block were required to call synteny. For each gene pair in the duplicated blocks, the Ks values were calculated using the YN00 method in PAML4 (Yang 2007) and the distribution of Ks values was plotted. To assess the influence of m on the number of syntenic blocks, we also set m to be 4, 3, and 2. Homeolog loss after the WGD caused by pseudogenization was checked for the selected collinear block pairs. To estimate the absolute time of the WGD, gene pairs in the syntenic blocks with Ks ranging from 0.043 to 0.083 (0.063 ± 0.02) were selected. Protein sequences of three related species (Pteris vittata, Aleuritopteris chrysophylla, and Ad. caudatum), which were obtained from transcriptome sequencing, were downloaded from GigaDB (Shen et al. 2018). Orthologs were searched for Ad. nelumboides (using only one of the two homeologs produced by the WGD) and the three related species with OrthoFinder (Emms and Kelly 2019). Single-copy orthologs of the three species and the homeologs of Ad. nelumboides produced by the WGD were aligned using MAFFT v7.0 (Katoh and Standley 2013). RAxML-NG (Kozlov et al. 2019) was used to construct the maximum likelihood tree for each protein. Trees inconsistent with the species phylogeny were discarded. Using the divergence time of Pt. vittata and Ale. chrysophylla (Huang et al. 2020) as a calibration point, the divergence time between the homeologs in Ad. nelumboides was estimated using the mcmctree program implemented in PAML.

Pseudogene Identification and Analysis

To enable the identification of putative pseudogenes, protein sequences of Ad. nelumboides were first aligned with the repeat-masked genome using TBLASTN (Altschul et al. 1997). Pseudogenes were then identified following the PseudogenePipeline (https://github.com/ShiuLab/PseudogenePipeline). Only sequences >300 bp in length were kept as pseudogenes. Then, we extract-coding sequences of the identified pseudogenes and their corresponding genes and translated them into amino acid sequences. To calculate Ks between each pseudogene and its corresponding gene, the internal stop codons of pseudogenes were removed. For each pseudogene and its corresponding gene, their protein sequences were aligned, and then converted into nucleotide sequence alignments by ParaAT (Zhang et al. 2012). Ks was calculated for each pseudogene and its corresponding gene using KaKs_Calculator 2.0 (Wang et al. 2010), and then the Ks distribution was plotted.

Gene Family Analysis

Protein-coding genes from 19 other species (1 moss: Physcomitrella patens; one lycopodiales: Selaginella moellendorffii; 2 ferns: Az. filiculoides and S. cucullata; 3 gymnosperms: Pinus taeda, Pinus abies, and Ginkgo biloba; and 12 angiosperms: Ar. thaliana, Po. trichocarpa, Lotus japonica, Vitis vinifera, Solanum lycopersicum, Mimulus guttatus, O. sativa, Sorghum bicolor, Musa acuminata, Zostera marina, and Spirodela polyrhiza) were downloaded from Phytozome version 13 (Goodstein et al. 2012). For each gene with alternative splicing variants, the longest transcript was selected to represent that gene. Proteins from Ad. nelumboides and the 19 species were then combined to perform an all-against-all comparison using Blastp in OrthoFinder with default parameters. Orthologous groups were then established. Based on these orthologous groups, orthologous single- or two-copy nuclear genes were identified among Ad. nelumboides and 19 other species. For each orthologous gene, protein sequences were aligned using MAFFT v7.0, and all the alignments were then concatenated into a super matrix, and subject to substitution model test using jModelTest2 with the Akaike information criterion (Darriba et al. 2012). Following the mode and using Ph. patens as an outgroup, the phylogenetic tree was constructed using RAxML v8.2.10 (Stamatakis 2014) with 1,000 bootstrap replicates. Gene family expansions and contractions were inferred using the program CAFE v 4.1(De Bie et al. 2006). Given equal birth (λ) and death (μ) rate, global (one λ) and multiple local rates (λ1, λ2,…) were determined using a likelihood ratio test. Then the maximum likelihood value of turnover rates across the phylogenetic tree was estimated, and inferences on ancestral states of gene family sizes for each node and changes along each branch in the phylogenetic tree were drawn. Gene families with an accelerated rate of expansion and contraction were determined with a threshold conditional P-value (P < 0.05). For the expanded and contracted gene families, KEGG pathway and GO category enrichment analysis was conducted using KOBAS 3.0 (Xie et al. 2011) and statistical significance was tested by Fisher’s exact test in combination with the False Discovery Rate correction. For those expanded gene families related to calcium, potassium and sodium uptake and transport in the Ad. nelumboides genome, we calculated pairwise Ks between members within each gene family and counted the number of genes likely produced by the WGD with Ks values ranging from <25% to >25% 0.063. Click here for additional data file.

74 in total

1. Gene silencing in a polyploid homosporous fern: paleopolyploidy revisited.

Authors: G J Gastony
Journal: Proc Natl Acad Sci U S A Date: 1991-03-01 Impact factor: 11.205

2. Genetic map-based analysis of genome structure in the homosporous fern Ceratopteris richardii.

Authors: Takuya Nakazato; Min-Kyung Jung; Elizabeth A Housworth; Loren H Rieseberg; Gerald J Gastony
Journal: Genetics Date: 2006-04-28 Impact factor: 4.562

3. LTR_retriever: A Highly Accurate and Sensitive Program for Identification of Long Terminal Repeat Retrotransposons.

Authors: Shujun Ou; Ning Jiang
Journal: Plant Physiol Date: 2017-12-12 Impact factor: 8.340

4. The Chinese pine genome and methylome unveil key features of conifer evolution.

Authors: Shihui Niu; Jiang Li; Wenhao Bo; Weifei Yang; Andrea Zuccolo; Stefania Giacomello; Xi Chen; Fangxu Han; Junhe Yang; Yitong Song; Yumeng Nie; Biao Zhou; Peiyi Wang; Quan Zuo; Hui Zhang; Jingjing Ma; Jun Wang; Lvji Wang; Qianya Zhu; Huanhuan Zhao; Zhanmin Liu; Xuemei Zhang; Tao Liu; Surui Pei; Zhimin Li; Yao Hu; Yehui Yang; Wenzhao Li; Yanjun Zan; Linghua Zhou; Jinxing Lin; Tongqi Yuan; Wei Li; Yue Li; Hairong Wei; Harry X Wu
Journal: Cell Date: 2021-12-28 Impact factor: 41.582

Review 5. Twenty years of plant genome sequencing: achievements and challenges.

Authors: Yanqing Sun; Lianguang Shang; Qian-Hao Zhu; Longjiang Fan; Longbiao Guo
Journal: Trends Plant Sci Date: 2021-11-12 Impact factor: 18.313

6. Conservation and divergence of small RNA pathways and microRNAs in land plants.

Authors: Chenjiang You; Jie Cui; Hui Wang; Xinping Qi; Li-Yaung Kuo; Hong Ma; Lei Gao; Beixin Mo; Xuemei Chen
Journal: Genome Biol Date: 2017-08-23 Impact factor: 13.583

7. OrthoFinder: phylogenetic orthology inference for comparative genomics.

Authors: David M Emms; Steven Kelly
Journal: Genome Biol Date: 2019-11-14 Impact factor: 13.583

8. LTR_FINDER_parallel: parallelization of LTR_FINDER enabling rapid identification of long terminal repeat retrotransposons.

Authors: Shujun Ou; Ning Jiang
Journal: Mob DNA Date: 2019-12-12

9. Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors: Heng Li; Richard Durbin
Journal: Bioinformatics Date: 2009-05-18 Impact factor: 6.937

10. TaCIPK29, a CBL-interacting protein kinase gene from wheat, confers salt stress tolerance in transgenic tobacco.

Authors: Xiaomin Deng; Wei Hu; Shuya Wei; Shiyi Zhou; Fan Zhang; Jiapeng Han; Lihong Chen; Yin Li; Jialu Feng; Bin Fang; Qingchen Luo; Shasha Li; Yunyi Liu; Guangxiao Yang; Guangyuan He
Journal: PLoS One Date: 2013-07-29 Impact factor: 3.240