Vinciane Mossion1, Benjamin Dauphin1,2, Jason Grant1, Michael Kessler3, Niklaus Zemp4, Daniel Croll1. 1. Laboratory of Evolutionary Genetics, University of Neuchâtel, Neuchâtel, Switzerland. 2. Swiss Federal Research Institute WSL, Birmensdorf, Switzerland. 3. Department of Systematic and Evolutionary Botany, University of Zürich, Zurich, Switzerland. 4. Genetic Diversity Centre (GDC), ETH Zurich, Zurich, Switzerland.
Abstract
Ferns are the second most diverse group of land plants after angiosperms. Extant species occupy a wide range of habitats and contribute significantly to ecosystem functioning. Despite the importance of ferns, most taxa are poorly covered by genomic resources and within-species studies based on high-resolution markers are entirely lacking. The genus Botrychium belongs to the family Ophioglossaceae, which includes species with very large genomes and chromosome numbers (e.g., Ophioglossum reticulatum 2n = 1520). The genus has a cosmopolitan distribution with 35 species, half of which are polyploids. Here, we establish a transcriptome for Botrychium lunaria (L.) Sw., a diploid species with an extremely large genome of about ~19.0-23.7 Gb. We assembled 25,677 high-quality transcripts with an average length of 1,333 bp based on deep RNA-sequencing of a single individual. We sequenced 11 additional transcriptomes of individuals from two populations in Switzerland, including the population of the reference individual. Based on read mapping to reference transcript sequences, we identified 374,463 single nucleotide polymorphisms (SNPs) segregating among individuals for an average density of 14 SNPs per kilobase. We found that all 12 transcriptomes were most likely from diploid individuals. The transcriptome-wide markers provided unprecedented resolution of the population genetic structure, revealing substantial variation in heterozygosity among individuals. We also constructed a phylogenomic tree of 92 taxa representing all fern orders to ascertain the placement of the genus Botrychium. High-quality transcriptomic resources and SNP sets constitute powerful population genomic resources to investigate the ecology, and evolution of fern populations.
Ferns are the second most diverse group of land plants after angiosperms. Extant species occupy a wide range of habitats and contribute significantly to ecosystem functioning. Despite the importance of ferns, most taxa are poorly covered by genomic resources and within-species studies based on high-resolution markers are entirely lacking. The genus Botrychium belongs to the family Ophioglossaceae, which includes species with very large genomes and chromosome numbers (e.g., Ophioglossum reticulatum 2n = 1520). The genus has a cosmopolitan distribution with 35 species, half of which are polyploids. Here, we establish a transcriptome for Botrychium lunaria (L.) Sw., a diploid species with an extremely large genome of about ~19.0-23.7 Gb. We assembled 25,677 high-quality transcripts with an average length of 1,333 bp based on deep RNA-sequencing of a single individual. We sequenced 11 additional transcriptomes of individuals from two populations in Switzerland, including the population of the reference individual. Based on read mapping to reference transcript sequences, we identified 374,463 single nucleotide polymorphisms (SNPs) segregating among individuals for an average density of 14 SNPs per kilobase. We found that all 12 transcriptomes were most likely from diploid individuals. The transcriptome-wide markers provided unprecedented resolution of the population genetic structure, revealing substantial variation in heterozygosity among individuals. We also constructed a phylogenomic tree of 92 taxa representing all fern orders to ascertain the placement of the genus Botrychium. High-quality transcriptomic resources and SNP sets constitute powerful population genomic resources to investigate the ecology, and evolution of fern populations.
Ferns (Polypodiopsida) are a highly diverse group of cosmopolitan vascular plants and present in most climates. Around 85% of the species richness is found in the tropics (Page, 2002; Ranker & Haufler, 2008), where ferns have diversified into a multitude of habitats including deserts, grasslands, forest understory, mountainous regions, and aquatic environments (Mehltreter et al., 2010). Ferns play key roles in ecosystem functioning, including serving as a habitat for invertebrates (Ellwood & Foster, 2004), shaping plant recolonization of disturbed habitats (Walker, 1994) and influencing the composition of tree species communities (George & Bazzaz, 1999a, 1999b). Most ferns are characterized by two independent life stages (i.e., gametophytic and sporophytic; Pinson et al., 2016). Ferns have complex and idiosyncratic life cycles. For example, many fern species are capable of versatile reproductive modes (Sessa et al., 2012) including apomixis (Grusz, 2016), sporophytic and gametophytic selfing, and outcrossing (Barker & Wolf, 2010; Haufler et al., 2016). Flexibility in the reproductive mode plays a crucial role in the foundation and persistence of isolated populations from a few spores, and thus likely contributed to the evolution of fern lifestyles. In addition, the reproductive mode likely impacts the tempo of evolution and lineage diversification.Phylogenetic analyses have resolved the position of ferns as the sister group to seed plants (Pryer et al., 2001) and as the second earliest diverging lineage of vascular land plants (Pryer et al., 2004; Raubeson & Jansen, 1992). The root age of ferns has been estimated to be c. 360–431 million years, underlining the deep divergence among ferns lineages (Des Marais et al., 2003; Lehtonen et al., 2017; Magallón et al., 2013; Pryer et al., 2004; Qi et al., 2018; Rothfels et al., 2015; Testo & Sundue, 2016; Wikström & Kenrick, 2001; Zhong et al., 2014). Fern phylogenies have been mostly established based on chloroplast markers (Grewe et al., 2013; Kuo et al., 2011; Lu et al., 2015; Rai & Graham, 2010; Schuettpelz & Pryer, 2007; Testo & Sundue, 2016), sometimes combined either with mitochondrial markers (Knie et al., 2015) or nuclear markers (Pryer et al., 2001, 2004) or both (Qiu et al., 2007). The heavy reliance on plastid and mitochondrial markers largely prevents analyses of reticulation events (i.e., hybridization and introgression). Yet, polyploidizations occur at one third of the fern speciation rate (Wood et al., 2009). In contrast, a series of recent phylogenomic studies have highlighted the power of transcriptome‐based approaches (Leebens‐Mack et al., 2019; Qi et al., 2018; Rothfels et al., 2013, 2015; Shen et al., 2018; Wickett et al., 2014). Expanding genome‐ or transcriptome‐wide data sets are essential to overcome challenges imposed by high degrees of paralogy in many fern genomes.The first insights into the structure of fern genomes were provided by the complete genome sequences of Azolla filiculoides and Salvinia cucullata (Li et al., 2018). These two genera are known for their small genome sizes (A. filiculoides 753 Mb and S. cucullata 255 Mb; Li et al., 2018). For example, the genome sizes of A. microphylla and S. molesta were estimated at 1C = 0.77 pg or ~0.75 Gband 1C = 2.25 pg, respectively (Clark et al., 2016; Obermayer et al., 2002). Conversely, many fern genomes are of enormous size (e.g., Tmesipteris obliqua 1C = 150.61 pg; Hidalgo et al., 2017). Major progress in the establishment of genomic resources was made with the sequencing of 73 fern transcriptomes (Carpenter et al., 2019; Leebens‐Mack et al., 2019). Such data sets have been successfully used to develop single copy nuclear markers to resolve deep evolutionary relationships among ferns (Rothfels et al., 2013, 2015). Transcriptome assemblies are also an important tool to develop genotyping approaches and overcome challenges associated with extremely large fern genomes (Bennett & Leitch, 2001; Hanson & Leitch, 2002; Obermayer et al., 2002). These approaches typically reduce genome complexity but still provide sufficient polymorphic markers to conduct population genomics analyses (Seeb et al., 2011). Establishing transcriptomic data sets for understudied fern clades will bring new insights into fern diversification.An important genus lacking transcriptomic resources is Botrychium belonging to subclass Ophioglossidae (PPG I, 2016). This subclass is characterized by a subterranean gametophytic stage (Field et al., 2015; Jeffrey, 1898; Winther & Friedman, 2007) and extremely large and complex genomes (e.g., Ophioglossum petiolatum 1C = 65.55 pg; Obermayer et al., 2002). Botrychium occurs in open habitats on nearly every continent across a broad temperate and boreal distribution. The genus is divided into three monophyletic clades as estimated by non‐coding plastid markers (Simplex‐Campestre, Lanceolatum and Lunaria; Dauphin et al., 2017), containing 35 recognized taxa (PPG I, 2016). The challenge of identifying Botrychium taxa based on morphology is underlined by claims of cryptic species (Clausen, 1938; Hauk, 1995). Ambiguous morphologies are sometimes caused by polyploidization which is a major driver of speciation as half of the known Botrychium species are allopolyploids (Dauphin et al., 2018). Nuclear markers have resolved the parental origins of these allopolyploid taxa and have provided insights into the genus radiation (Dauphin et al., 2018). Additionally, the reconstruction of maternal lineages of Botrychium has revealed high genetic diversity within the Lunaria clade, highlighting the uncertainty of taxonomic assignments (Dauphin et al., 2014, 2017; Maccagni et al., 2017). Previous population genetic studies based on isozymes have shown a lack of genetic differentiation among morphologically recognized types (Williams et al., 2016), and the low amount of genetic variation detected within Botrychium populations suggests pervasive self‐fertilization (Farrar, 1998; Hauk & Haufler, 1999; Williams, 2021). Furthermore, genetic differentiation among populations and regions was found to be low, suggesting that gene flow may occur (Birkeland et al., 2017; Camacho & Liston, 2001; Swartz & Brunsfeld, 2002). These studies highlight the need for powerful, genome‐wide marker systems to resolve population structures, life histories, and taxonomy of eusporangiate ferns.In this study, we developed transcriptome‐wide SNPs for the fern species Botrychium lunaria (L.) Sw., which has a large genome size of about 24.1 pg (2C; ~23.7 Gb; Veselý et al., 2012), establishing the first such resource for ferns. By analysing the transcriptome of 12 individuals mapped against the newly established reference transcriptome, we show the power of the discovered SNPs to resolve population‐level variation in heterozygosity and to test for evidence of ploidy variation. We further demonstrate the power of transcriptome‐wide markers to resolve phylogenetic relationships at the genus level and among deeply divergent fern lineages.
MATERIALS AND METHODS
Sampling, library preparation and sequencing
Leaf material of B. lunaria was obtained from three locations in Switzerland: two in the Val d’Hérens in the Pennine Alps, Mase and Forclaz, within approximately 30 km, and one at the Chasseral in the Jura Mountains (Table 1). The two regions are separated by ~120 km and the Alpine population was sampled on meadows over an altitudinal range of 1500 to 2400 m (Table 1). Leaves of six individuals from Val d’Hérens and from Chasseral were collected in July 2015 and June 2017, respectively. Plant material was wrapped in aluminum foil and frozen immediately in liquid nitrogen. Total RNA was extracted from trophophores (i.e., the sterile part of the leaves) using the RNAeasy Plant Mini Kit (Qiagen) and DNA was eliminated using DNase I digestion. Total RNA was quantified using a Qubit fluorometer (Invitrogen, Thermo Fisher Scientific) with the RNA Broad‐Range assay kit (Invitrogen, Thermo Fisher Scientific) and quality‐checked using an Agilent 2200 Tape Station (Agilent Technologies, Inc.). Samples were diluted to 100 ng/µl in RNase free ultra‐pure water before library preparation. The RNA‐sequencing libraries were prepared following a TruSeq RNA library preparation protocol (Illumina, Inc.) enriching for polyadenylated RNAs. After quality assessment on an Agilent 2200 Tape Station, libraries were pooled and sequenced in 150 bp single‐end mode on one lane of an Illumina HiSeq 4000 sequencer.
TABLE 1
Populations, accessions and voucher information. Geographic coordinates are given in the WGS84 reference system (decimal degrees)
Individual identifier
Population
Location
Latitude
Longitude
Altitude (m)
Date
Voucher number
Absolute genome sizea
Depositb institute
CHA_I_1
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
NE000101258
–
NEU
CHA_I_3
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
NE000101257
–
NEU
CHA_I_5
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
NE000101256
–
NEU
CHA_I_6
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
NE000101255
–
NEU
CHA_I_7
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
CHA_I_7
20.58
UNINE
CHA_I_8
Chasseral
Chasseral
47.12974
7.04934
1549.71
07.06.2017
NE000101254
–
NEU
IIT2_H2
Val d'Hérens
Forclaz
46.08741
7.54379
2346.15
2015
IIT2_H2
19.38
UNINE
IIT1_H5
Val d'Hérens
Forclaz
46.08851
7.53906
2346.15
2015
IIT1_H5
19.97
UNINE
IIT2_B2
Val d'Hérens
Forclaz
46.08741
7.54379
2406.10
2015
IIT2_B2
19.84
UNINE
IT1_A3
Val d'Hérens
Mase
46.20432
7.48350
2406.62
2015
IT1_A3
19.40
UNINE
IT1_H1
Val d'Hérens
Mase
46.20432
7.48350
2406.62
2015
IT1_H1
19.84
UNINE
IT2_D4
Val d'Hérens
Mase
46.19642
7.48502
2424.07
2015
IT2_D4
–
UNINE
Expressed as 2C‐values and pictograms.
NEU: Herbarium of the University of Neuchâtel; UNINE: University of Neuchâtel, Institute of Biology Herbarium. The specimens deposited at UNINE are frozen samples stored at –80°C.
Populations, accessions and voucher information. Geographic coordinates are given in the WGS84 reference system (decimal degrees)Expressed as 2C‐values and pictograms.NEU: Herbarium of the University of Neuchâtel; UNINE: University of Neuchâtel, Institute of Biology Herbarium. The specimens deposited at UNINE are frozen samples stored at –80°C.Absolute genome sizes were estimated by flow cytometry to assess the ploidy level of the six specimens for which frozen material was available, following the one‐step methodology of Doležel et al. (1998). Frozen material was silica‐dried before being processed.
De novo assembly, filtering and quality assessment
Sequencing reads were quality‐checked using FastQC v. 0.11.7 (Andrews, 2010) and trimmed using Trimmomatic v. 0.38 (Bolger et al., 2014). Reads were retained if the leading and trailing base qualities were >5, had a 4‐bp sliding window >15, and a minimum read length of 36 bp. Trimmed reads from a single reference individual were then de novo assembled using Trinity v. 2.8.3 (Haas et al., 2013). The other individual transcriptomes were used for SNP calling (see below). The reference individual choice was made based on the highest trimmed transcript sequence number. We used the pseudo‐alignment percentage calculated by Kallisto v. 0.45.0 (Bray et al., 2016) to assess the representativeness of the raw assembly across the twelve sequenced individuals in total. Candidate coding regions were identified using TransDecoder v. 5.3.0 (Haas et al., 2013). Only transcripts with an open reading frame (ORF) of at least 100 amino acids were kept. We also retained only the longest isoform per transcript using the Trinity v. 2.8.3 toolkit. We used Diamond v. 0.9.24 (Buchfink et al., 2015) to screen the transcript assembly against the NCBI nonredundant protein (nr) and UniVec databases to identify potential foreign RNA contaminants. The best hit for each transcript was assigned at the phylum‐level using the R package taxise v. 0.9.7 (Chamberlain & Szöcs, 2013) in RStudio v. 1.2.1335 (R Development Core Team, 2020; RStudio Team, 2015). We excluded all transcripts with a best hit outside of the plant kingdom (i.e., Viridiplantae). We did not filter for hits within the plant kingdom due to the scarce representation of seedless plant sequences in the NCBI nr and UniVec databases. Transcript redundancies were detected using CD‐HIT‐EST v.4.8.1 (Weizhong & Godzik, 2006) with a global sequence identity of 0.95% and a word size of 10. Redundant transcripts were pruned from the final assembly retaining the longest transcript per cluster. The completeness of the transcriptome assemblies was assessed using BUSCO v. 5.0.0 with the viridiplantae_odb10 database (Simão et al., 2015). Data were visualized using the R package ggplot2 v. 3.3.2 (Wickham, 2016).
Variant calling
We generated alignment BAM files for each individual against the transcriptome using the short read aligner Bowtie2 v. 2.4.2 (Langmead, 2010) and SAMtools v. 1.11 (Li et al., 2009). Depth coverage of the reference individual was extracted using SAMtools idxstats. Alignments were processed with HaplotypeCaller implemented in the Genome Analysis Toolkit (GATK) v. 4.1.8.1 (Van der Auwera et al., 2013; DePristo et al., 2011; McKenna et al., 2010) for single nucleotide polymorphism (SNP) calling. The resulting gvcf files were combined and genotyped using the GATK CombineGVCF and GenotypeGVCF tools, respectively. We excluded monomorphic sites from further analysis. We filtered SNPs for the number of genotyped chromosomes (AN ≥ 20) out of a maximum of 24 (12 diploid individuals). Quality criteria QUAL > 100, QualByDepth > 5.0, RMSMappingQuality > 20.0, MappingQualityRankSumTest retained values >–2.0 and <2.0, and ReadPosRankSumTest and BaseQualityRankSumTest retained values >–2.0 and <2.0 were defined following the best practices and were applied to flag low‐quality loci (Figure S1). We removed SNPs failing the above filters using VCFtools v. 0.1.16 (Danecek et al., 2011) and added a filter to retain only biallelic SNPs. Analyses were performed using the R packages vcfR v. 1.12.0 (Knaus & Grünwald, 2017) and the SNP statistics among transcripts were visualized using ggplot2 v. 3.3.2.
Population genetics analyses
Intra‐individual allele frequencies were calculated for each individual and SNP locus using the mapped read depth per allele (AD). The frequency distributions were plotted per individual. We subsampled the number of SNPs by selecting one SNP every 1,000 bp of transcriptomic sequence using VCFtools v. 0.1.16. The subsampling was performed to obtain a representative set of markers across all polymorphic transcripts. Furthermore, the subsampling reduced linkage disequilibrium among markers to reduce biases in the representation of the population genetic structure. We performed principal component analyses (PCA) and calculated the pairwise Nei's F
ST and the mean heterozygosity (H
e) per location and per individual (Nei, 1987). These analyses were performed using the R packages vcfR v. 1.12.0, adegenet v. 2.1.3 (Jombart & Ahmed, 2011) and hierfstat v. 0.5–7 (Goudet, 2005), and data were visualized using ggplot2 v. 3.3.2.
Functional annotation
We functionally characterized encoded protein sequences based on gene ontology (GO) terms. We summarized GO terms by selecting the least redundant annotations among the 30 most frequent terms per ontology (biological process BP, cellular component CC, and molecular function MF). Analyses were performed using the Bioconductor packages AnnotationDbi v. 1.46.0 (Pagès et al., 2020), GO.db v. 3.8.2 (Carlson, 2020), GSEABase v. 1.34.0 (Morgan et al., 2020), annotate v. 1.62.0 (Gentleman, 2020), and data were visualized using the R package ggplot2 v. 3.3.2.
Genus‐level phylogenetic analyses
To estimate the phylogenetic placement of the 12 individuals included in our study, we retrieved sequences of four previously analysed nuclear regions (i.e., ApPEFP_C, CRY2cA, CRY2cB, and transducin) of both diploid and polyploid Botrychium taxa (Dauphin et al., 2018). We searched homologous sequences in the transcriptome assembly using BLAST v. 2.9.0 (Altschul et al., 1990). If the associated transcript was found in the assembly, we used BCFtools v. 1.9 (Li, 2011) to retrieve the corresponding transcript from the 11 remaining individuals using the VCF file information. Sequence alignments were performed with MAFFT v. 7.470 under the G‐INS‐i strategy and default parameters (Katoh et al., 2015; Katoh & Standley, 2013). Multiple alignments were visually inspected and manually adjusted using Geneious v. 8.1.9 (Kearse et al., 2012). Phylogenetic trees were inferred using maximum likelihood (ML) in RAxML‐NG v. 0.9.0 (Kozlov et al., 2019). We ran tree inferences with a fixed random seed of 2 under the HKY+GAMMA model based on model settings by Dauphin et al. (2018) to ensure reproducibility. The tree search was performed using 25 random‐ and 25 parsimony‐based starting trees. The branch support was estimated using 1,000 bootstrap replicates and calculated according to the transfer bootstrap expectation matrix (Lemoine et al., 2018). The support values were depicted on the best‐scoring ML tree. The tree was visualized using the R packages ape v. 5.4–1 (Paradis & Schliep, 2019) and ggtree v.2.4.0 (Yu et al., 2017).
Phylogenomic analyses
We performed phylogenomic analyses across ferns by including the newly established B. lunaria transcriptome for a total of 95 transcriptomes including 86 fern species (18 eusporangiates and 68 leptosporangiates), six Spermatophyta and two Lycopodiopsida (Table S1; Leebens‐Mack et al., 2019; Qi et al., 2018; Shen et al., 2018). The Spermatophyta and Lycopodiopsida species represent the outgroup in the analysis. We first performed an orthologue search using OrthoFinder v. 2.3.12 (Emms & Kelly, 2015) including the newly established B. lunaria transcriptome and protein sequences of other transcriptomes. We retained orthogroups shared by every member of the taxa set regardless of the gene copy number. We used ASTRAL‐Pro v. 1.1.5, which was specifically developed to summarize multicopy gene trees into a species tree (Zhang et al., 2020). To reduce the computational load, we processed only orthogroups for which at least half of the species had a single‐copy gene and excluded two orthogroups with very high paralogue counts (n ≥ 700). Sequences of the orthogroups subset were subsequently aligned with MAFFT v. 7.475 under the G‐INS‐i strategy and default parameters. The optimal substitution model was assessed for each orthogroup alignment using Modeltest‐NG v. 0.1.6 (Darriba et al., 2020). Finally, unrooted gene trees were built using maximum likelihood (ML) in RAxML‐NG v. 1.0.1. We ran tree inferences under the best model according to the Akaike Information Criterion (AICc) with a fixed random seed of 12,345. The tree search was performed on 25 random and 25 parsimony‐based starting trees. The inferred gene trees (i.e., best ML trees) were used to estimate a species tree with ASTRAL‐Pro. Branch support was calculated using local posterior probabilities (Sayyari & Mirarab, 2016). In order to display internal branch lengths on our final trees, we first added arbitrary terminal branch lengths using the python script (add‐bl.py) provided with ASTRAL. Species trees were edited using the R packages treeio v.1.14.0 (Wang et al., 2020) and ape v. 5.4–1. The arbitrary terminal branch lengths were removed for the final tree visualizations.
RESULTS
Sample collection and transcriptome assembly
We successfully analyzed transcriptomes of 12 B. lunaria individuals. The transcriptome sequencing produced 14.6–50.1 million reads per individual. After quality trimming, we retained 97.0%–99.2% of the reads (Figure 1a, Table S2). The highest number of high‐quality reads (49.5 million) was obtained for the Chasseral individual CHA_I_1. We selected this individual to assemble a reference transcriptome for the species. The raw assembly for CHA_I_1 contained 167,306 transcripts for a total of 87,537 candidate genes before redundancy filtering. Mapping reads from all 12 individuals to the raw transcriptome assembly showed a pseudoalignment rate (i.e., percentage of mapped reads) between 74.2%–82.5% regardless of the population of origin (Figure 1a, Table S2). We analyzed all assembled transcripts for the presence of high‐confidence open reading frames (ORF; ≥100 amino acids). We retained 69,280 transcripts (41.4%) covering 26,139 predicted genes (Figure 1b). Next, we selected the longest transcript for each gene (Figure 1d). We screened each gene against the complete nonredundant protein and the UniVec database of NCBI and found evidence for contamination in 438 transcripts. Most contaminant sequences were associated with viruses, fungi, or bacteria (Figure 1c). We performed an additional screen for redundant transcripts using CD‐HIT‐EST. We identified 16 clusters and retained only a unique representative transcript per cluster for further analyses. The final assembly consisted of 25,677 unique transcripts spanning a total of 3,423 Mb. The average and median transcript lengths were 1,333 and 967 bp, respectively (Figure 1d). The N50 of the final transcriptome was 1,995 bp with an average GC content of 44.3% (Table 2). GO terms were assigned to 11,137 transcripts (43.4%; Figure 2d, Table S3).
FIGURE 1
De novo assembly of the Botrychium lunaria transcriptome. (a) Distribution of the raw, trimmed, and pseudoaligned read numbers per individual. (b) Transcripts and gene assembly content at each filtering stage of the transcriptome. (c) Contaminant transcript sequences detected by phylum. Retained phyla are indicated in bold. See methods for further details. (d) Distribution of the assembled transcript lengths given in base pairs (bp). (e) BUSCO genes detected at each filtering stage of the B. lunaria transcriptome and for two genome assemblies of ferns (Azolla filliculoides and Salvinia cucullata; Li et al., 2018)
TABLE 2
Overview of assembly statistics over the different transcript filtering stages
Filtering stage
Assembled bases
Transcripts
Genes
N50‐longest isoform
N50‐all
GC%
Raw assembly
56,273,802
167,306
87,537
1,689
1,089
43.68
ORF‐encoding
34,588,465
69,280
26,139
2,152
1,988
44.00
Longest isoform only
34,588,465
26,139
26,139
1,988
1,988
44.31
Contaminant screening
34,245,455
25,701
25,701
1,995
1,995
44.30
CD‐HIT
34,230,067
25,677
25,677
1,995
1,995
44.30
FIGURE 2
Analyses of the assembly coverage. (a) Number of aligned reads per assembled transcript for the reference individual. (b) Aligned reads of the reference individual according to the assembled transcript length. (c) Mapping rate of all 12 individuals from the Chasseral and Val d’Hérens populations, including the subpopulations Forclaz and Mase. The reference individual used to assemble the transcriptome was CHA_I_1. (d) Characterizations of predicted functions encoded by the transcriptome. Gene ontology (GO) term annotations are shown for the 30 most frequent terms per ontology (BP, biological process; CC, cellular component; MF, molecular function). GO terms with highly similar functions are excluded from the representation
De novo assembly of the Botrychium lunaria transcriptome. (a) Distribution of the raw, trimmed, and pseudoaligned read numbers per individual. (b) Transcripts and gene assembly content at each filtering stage of the transcriptome. (c) Contaminant transcript sequences detected by phylum. Retained phyla are indicated in bold. See methods for further details. (d) Distribution of the assembled transcript lengths given in base pairs (bp). (e) BUSCO genes detected at each filtering stage of the B. lunaria transcriptome and for two genome assemblies of ferns (Azolla filliculoides and Salvinia cucullata; Li et al., 2018)Overview of assembly statistics over the different transcript filtering stagesAnalyses of the assembly coverage. (a) Number of aligned reads per assembled transcript for the reference individual. (b) Aligned reads of the reference individual according to the assembled transcript length. (c) Mapping rate of all 12 individuals from the Chasseral and Val d’Hérens populations, including the subpopulations Forclaz and Mase. The reference individual used to assemble the transcriptome was CHA_I_1. (d) Characterizations of predicted functions encoded by the transcriptome. Gene ontology (GO) term annotations are shown for the 30 most frequent terms per ontology (BP, biological process; CC, cellular component; MF, molecular function). GO terms with highly similar functions are excluded from the representation
A highly complete B. lunaria transcriptome
The completeness of the assembled B. lunaria transcriptome assessed using BUSCO matched the completeness of complete genome assemblies of other ferns. Importantly, the absence of fern species among the 30 species constituting the BUSCO viridiplantae_odb10 database might lead to an underestimation of the assembly completeness. We found 90.1% complete single‐copy, 0.9% complete duplicates, 5.9% fragmented and 3.1% missing genes for the B. lunaria transcriptome (Figure 1e, Table 3). This is comparable to the only two complete genome assemblies of ferns: A. filiculoides with 79.3% and S. cucullata with 90.4% complete single‐copy genes (Figure 1e, Table 3). The two Salviniaceae genomes and the B. lunaria transcriptome exhibited a comparable number of missing BUSCO genes (2.8%–1.6% and 3.1%, respectively; Figure 1e, Table 3) and of fragmented BUSCO genes (9.9%–4.7% and 5.9% respectively; Figure 1e, Table 3). The mapped reads coverage depth of the reference individual to the assembled transcripts is on average 1649× with a range of 4 to 514,622×. Transcript coverage is strongly skewed towards low coverage, whereas ~7% show coverage >4000 reads (Figure 2a). The read coverage shows no obvious association with transcript length (Figure 2b).
TABLE 3
Analyses of assembly completeness using BUSCO genes. The B. lunaria transcriptome is compared to two genome assemblies of ferns (Azolla filliculoides and Salvinia cucullata)
Species name
Filtering stage
Ca
CSb
CDc
Fd
Me
Nf
Data set
Azolla filliculoides
–
371
337
34
42
12
425
viridiplantae_odb10
Salvinia cucullata
–
398
384
14
20
7
425
viridiplantae_odb10
B. lunaria
Raw assembly
397
118
279
20
8
425
viridiplantae_odb10
B. lunaria
ORF‐encoding
397
124
273
19
9
425
viridiplantae_odb10
B. lunaria
Longest isoform only
387
383
4
25
13
425
viridiplantae_odb10
B. lunaria
Contaminant screening
387
383
4
25
13
425
viridiplantae_odb10
B. lunaria
CD‐HIT
387
383
4
25
13
425
viridiplantae_odb10
aComplete genes, bComplete and single copy genes, cComplete and duplicated genes, dFragmented genes, eMissing genes, fTotal number of BUSCO genes in the data set.
Analyses of assembly completeness using BUSCO genes. The B. lunaria transcriptome is compared to two genome assemblies of ferns (Azolla filliculoides and Salvinia cucullata)aComplete genes, bComplete and single copy genes, cComplete and duplicated genes, dFragmented genes, eMissing genes, fTotal number of BUSCO genes in the data set.
Phylogenetic assignment of all analyzed transcriptomes
The analysis of nuclear barcoding loci confirmed the 12 individuals included in the transcriptome analyses as B. lunaria. Among the four nuclear loci previously sequenced in a broad sample of Botrychium species, three loci displayed sequence variation nearly exclusively in intronic sequences (Dauphin et al., 2018). Therefore, no comparison with our transcriptomic sequences was possible. We focused on the locus CRY2cA carrying sufficient informative sites in the coding regions to produce a well‐supported phylogeny. The combined data set for CRY2cA included 67 individuals, representing 38 Botrychium taxa and an outgroup constituted by Sceptridium multifidum and Botrypus virginianum (Dauphin et al., 2018; Table S4). The multiple sequence alignment contained a total of 3,579 sites and 153 patterns. The main clades Lanceolatum, Lunaria, and Simplex‐Campestre were resolved as monophyletic (Figure S2). Lanceolatum was resolved as a sister group to the Simplex‐Campestre and Lunaria clade. All individuals from the Chasseral and Val d’Hérens grouped with B. lunaria var. lunaria and formed a well‐supported clade.
Transcriptome‐wide phylogenomic trees of ferns
Phylogenomic analyses based on the transcriptomes of 91 fern species (Leebens‐Mack et al., 2019; Qi et al., 2018; Shen et al., 2018) including the new transcriptome for B. lunaria were congruent with previous studies. We identified 41,186 orthogroups in total (Table S5) of which we retained 761 orthogroups and 92 species to construct a phylogenomic tree (Table S6). The species tree was well‐resolved and branch support values from the local posterior probability (LPP) were high (>0.95; Figure 5). The species tree topology was consistent with the most recent fern phylogenies (Knie et al., 2015; Kuo et al., 2011; Leebens‐Mack et al., 2019; Lu et al., 2015; Qi et al., 2018; Rai & Graham, 2010; Rothfels et al., 2015; Shen et al., 2018; Testo & Sundue, 2016) and with the current consensus classification (PPG I, 2016). The B. lunaria transcriptome clustered with the sister genus Sceptridium and the closely related genus Botrypus. Among the earliest divergent ferns (i.e., eusporangiate and early leptosporangiate), we identified Equisetales as the sister clade to all other ferns, and Marattiales as the sister clade to all leptosporangiates. Consistent with recent work, the Gleicheniales were recovered as paraphyletic and the deep eupolypod relationships remained largely unresolved in our phylogeny (Figure S3).
FIGURE 5
Phylogenomic relationships among ferns including Botrychium lunaria. A species tree of 761 orthologous genes including 92 taxa inferred by coalescence‐based method implemented in ASTRAL‐Pro. The branch support is indicated by local posterior probability (LPP) values. LPP values of the main topology are displayed in dark red above branches on left node sides. Fern subclasses (Equisetidae, Ophioglossidae, Marattiidae and Polypodiidae) are denoted by coloured rectangles or a vertical dark grey line on clade sides. Polypodiidae orders (Osmundales, Hymenophyllales, Gleicheniales, Schizaeales Salviniales, Cyatheales and Polypodiales) are designated by coloured rectangles. The internal branch lengths are in coalescent units. The tree with the full ASTRAL annotations is available as Supporting Information S7
Ploidy assessments and analyses of within‐species transcriptome‐wide polymorphisms
The mapping rate of reads from each individual against the assembled transcriptome sequences varied between 82.1%–86.7% (Figure 2c). The highest mapping rate was found for the individual IIT1_H5 (86.7%), which was slightly higher than the mapping rate of the reference individual used to establish the transcriptome (CHA_I_1, 85.5%; Table S4). We found no meaningful difference in mapping rates among populations. Based on reads aligned against reference transcripts, we assessed intra‐individual allele frequency distributions based on read counts. Diploid individuals should show a singular, dominant peak at 0.5 corresponding to heterozygous SNPs (Chen et al., 2018). We indeed found a major peak around a frequency of 0.5 without any secondary peaks at 0.25 or 0.75, suggesting that all individuals are likely diploid (Figure 3a). However, very recent autopolypoids would also share this signature. The broad peak shape is likely due to stochasticity in read coverage at each SNP among individuals. Low read coverage can produce individual allele frequency estimates far from 0.5. For independent evidence of ploidy levels, we used flow cytometry. The estimated absolute genome sizes varied between 19.38–20.58 pg among the six analyzed individuals (2C‐values; Table 1). Hence, there is no indication for ploidy variation and the genome size estimates match previous reports for the diploid genome of B. lunaria (Veselý et al., 2012). Given this evidence for diploidy, we proceeded with calling SNPs assuming a diploid state in all individuals. We recovered a total of 376,463 high‐quality biallelic SNPs after filtering. The average number of SNPs per transcript was 17 and the maximum number was 257 (Figure 3b). The SNP density per transcript had a mean of 14, a median of 10, and a maximum of 153 SNPs per kb (Figure 3c). The median SNP density decreased slightly with transcript length (Figure 3d).
FIGURE 3
Analyses of population‐level transcriptomic polymorphism. (a) Distribution of the transcriptome‐wide SNP reference allele frequencies per individual estimated from mapped reads. The light‐blue dashed lines show the mean reference allele frequency. Homozygous positions (frequencies 0 and 1) were excluded. (b) Number of SNPs per transcript up to 257. (c) Density of SNPs per transcript (i.e., number of SNPs per kb) up to 153. (d) SNP density according to bins of transcript length given in bp. The mean density is shown by a red rectangle and the median by the solid black line inside the box. The box outline indicates the first and third quartiles. The whiskers show the minimum (bottom one) and maximum (top one) number of SNPs per kb for each bin
Analyses of population‐level transcriptomic polymorphism. (a) Distribution of the transcriptome‐wide SNP reference allele frequencies per individual estimated from mapped reads. The light‐blue dashed lines show the mean reference allele frequency. Homozygous positions (frequencies 0 and 1) were excluded. (b) Number of SNPs per transcript up to 257. (c) Density of SNPs per transcript (i.e., number of SNPs per kb) up to 153. (d) SNP density according to bins of transcript length given in bp. The mean density is shown by a red rectangle and the median by the solid black line inside the box. The box outline indicates the first and third quartiles. The whiskers show the minimum (bottom one) and maximum (top one) number of SNPs per kb for each bin
Population structure and heterozygosity
The SNP genotyping data from the 12 individuals revealed a clear population differentiation between the two main sampling locations (Figure 4a). The first principal component (PC1, 17% of total variance explained) of the PCA identified a divergent genotype in the Chasseral population (CHA_I_7, Figure 4b). The second principal component (PC2, 12%) separated the two populations Chasseral and Val d’Hérens (Figure 4b). We performed a second PCA excluding the CHA_I_7 individual and found the Chasseral population to be more diverse than Val d’Hérens (Figure 4c). We found no apparent differentiation between the two locations sampled in Val d’Hérens, but the sampling coverage was limited (n = 3 per location). The pairwise F
ST between populations was low (0.040). Mean heterozygosity was slightly higher in Val d’Hérens (H
e = 0.20) than in the Chasseral population (0.17; Figure 4d). We found similar levels of variation in individual heterozygosity among populations ranging from 0.16 to 0.21, except for CHA_I_7, which was an outlier in the PCA (Figure 4e). CHA_I_7 showed less than half the heterozygosity (H
e = 0.05) compared to other members of the same population.
FIGURE 4
Population genetic structure and observed heterozygosity. (a) Sampling locations of Botrychium lunaria populations analysed in this study. (b) Principal component analysis (PCA) of the populations Chasseral and Val d’Hérens (sites Mase and Forclaz). (c) PCA of both populations excluding the CHA_I_7 outlier. PCA were analysed using a reduced SNP data set of a maximum of 1 SNP per kb of transcript. (d) Mean observed heterozygosity per location grouped by population. (e) Mean observed heterozygosity per individual grouped by population
Population genetic structure and observed heterozygosity. (a) Sampling locations of Botrychium lunaria populations analysed in this study. (b) Principal component analysis (PCA) of the populations Chasseral and Val d’Hérens (sites Mase and Forclaz). (c) PCA of both populations excluding the CHA_I_7 outlier. PCA were analysed using a reduced SNP data set of a maximum of 1 SNP per kb of transcript. (d) Mean observed heterozygosity per location grouped by population. (e) Mean observed heterozygosity per individual grouped by population
DISCUSSION
We established a high‐quality transcriptome for the genus Botrychium filling an important gap in the coverage of eusporangiate ferns. The completeness of the transcriptomic gene space was comparable to well‐assembled fern genomes. Using 12 individuals of the same species sampled in two regions, we were able to generate the first intraspecies transcriptome‐wide SNP data set for ferns in general. Our analyses revealed a clear genetic distinction between the two B. lunaria populations and among the twelve individuals challenging the general assumption of gametophytic selfing being the dominant reproductive mode in Botrychium populations. A phylogenomic tree based on 761 orthologous genes confirmed the phylogenetic position of the genus among other fern lineages.
Establishment of a transcriptome for B. lunaria
Generating a representative transcriptome assembly is challenging because not all genes are expressed in all tissues and life cycle stages. Across the life cycle of ferns, gene expression patterns are largely overlapping (Sigel et al., 2018), but the covered gene space is usually increased by including multiple target tissues. For Botrychium, only the trophophore and the sporophore were adequate tissues for the extraction of RNA since underground tissues are colonized by arbuscular mycorrhizal fungi (AMF; Winther & Friedman, 2007) leading to numerous contaminants. Because we only included trophophore tissue, the assembled transcriptome potentially underrepresents sporogenesis‐ and root‐specific genes. Despite these challenges, our B. lunaria transcriptome has a fairly complete gene space in comparison to a wide range of assembled transcriptomes (Der et al., 2011; Leebens‐Mack et al., 2019; Qi et al., 2018; Shen et al., 2018; Figure 1e). It is important to note that database‐dependent tools such as BUSCO consistently underestimate transcriptome completeness if the database was compiled without closely related species. The challenge in using BUSCO is exemplified by the absence of ferns species in the viridiplantae data set. The gene space of assembled fern genomes tends to show less fragmented BUSCO genes compared to the B. lunaria transcriptome (Li et al., 2018). However, the B. lunaria transcriptome is consistent with other high‐quality fern transcriptomes (Leebens‐Mack et al., 2019; Qi et al., 2018; Shen et al., 2018). Missing gene segments in assembled transcriptomes are often caused by uneven read depth among genes or alternative splicing complicating gene recovery. The completeness of the B. lunaria transcriptome compared to other fern genomes and transcriptomes provides a powerful tool for phylogenomic and population genetic analyses.The B. lunaria transcriptome enables strong phylogenomic inference at different taxonomic levels, overcoming challenges associated with the small number of nuclear and chloroplast markers available for most ferns. Botrychium taxa have subtle morphological characteristics, hence taxonomy relies largely on molecular data (Dauphin et al., 2014, 2017; Maccagni et al., 2017; Stensvold & Farrar, 2017; Stensvold et al., 2002). We have placed the individual B. lunaria transcriptomes among other closely related taxa by retrieving orthologous genes, which were previously used for phylogenetic analyses. The newly established transcriptome will enable investigations of the gene content of Botrychium. The tens of thousands of markers across the coding sequences allow powerful genome‐wide studies of population and species differentiation. Highly dense marker sets are critical for ecological genomics investigations to for example, identify loci underlying climatic adaptation, recent gene flow events and consequences of variation in the reproductive mode. Beyond this, refined analyses of transcriptomic markers can help to retrace the evolution of the extensive ploidy variation among Botrychium.As an expansion of the phylogenetic analyses within Botrychium, we analysed orthologous genes across all ferns. The genus Botrychium was placed within the Ophioglossales with strong support (Figure 5). Furthermore, the phylogenetic placements of most fern clades were congruent with previous studies and highlighted unresolved discordances. For example, the paraphyly observed for the Gleicheniales in our species tree (Figure 5, Figure S3) corroborates recent findings (Qi et al., 2018; Shen et al., 2018). Sparse sampling can strongly influence tree topologies. For instance, the Matoniaceae, constituting one of the three Gleicheniales families (PPG I, 2016), are not represented in phylogenomic studies. Phylogenies based only on few barcoding loci, but with a more representative sampling, identified Gleicheniales as being a monophyletic clade (Pryer et al., 2004; Schuettpelz et al., 2006; Schuettpelz & Pryer, 2007). Phylogenomic studies including the Matoniaceae will be needed to ascertain the placement of the Dipteridaceae outside of the order Gleicheniales. The resolution of our fern‐wide phylogenomic analyses revealed some limitations in resolution at the within‐genus level. Equisetum hyemale and E. arvense were recovered as sister species distant from E. diffusum even though E. diffusum and E. arvense belong to the same subgenus Equisetum which does not include E. hyemale (subgenus. Hippochaete; Christenhusz et al., 2021). The reduced power to resolve some phylogenetic relationships with genera might be explained by the scarce species representation within some genera and the degree of conservation of the included orthologues.Phylogenomic relationships among ferns including Botrychium lunaria. A species tree of 761 orthologous genes including 92 taxa inferred by coalescence‐based method implemented in ASTRAL‐Pro. The branch support is indicated by local posterior probability (LPP) values. LPP values of the main topology are displayed in dark red above branches on left node sides. Fern subclasses (Equisetidae, Ophioglossidae, Marattiidae and Polypodiidae) are denoted by coloured rectangles or a vertical dark grey line on clade sides. Polypodiidae orders (Osmundales, Hymenophyllales, Gleicheniales, Schizaeales Salviniales, Cyatheales and Polypodiales) are designated by coloured rectangles. The internal branch lengths are in coalescent units. The tree with the full ASTRAL annotations is available as Supporting Information S7
Fine‐grained resolution of population structure
The transcriptome‐wide SNPs revealed clear geographical structuring between the two B. lunaria populations sampled from Switzerland (Figure 4b, c). The differentiation was apparent even when subsampling SNPs containing a maximum of 1 SNP per kb to avoid biases by highly polymorphic transcripts and high linkage disequilibrium. The full set of SNPs will provide a powerful tool for ecological genomics investigations to unravel loci contribution to adaptation. It has been generally assumed Botrychium species show no meaningful genetic differentiation within populations (Farrar, 1998; Hauk & Haufler, 1999; Williams, 2021) or low genetic differentiation among populations (Birkeland et al., 2017; Camacho & Liston, 2001; Dauphin et al., 2020; Swartz & Brunsfeld, 2002). However, the absence of genetic differentiation reported by previous studies may well stem from low marker resolution. The transcriptome‐wide markers used here showed every individual was clearly distinct, and populations showed marked differentiation. The Chasseral and Val d’Hérens populations were collected in the Jura Mountains and Pennine Alps, respectively. The two sites are 120 km apart and separated by habitats unsuitable for B. lunaria. Therefore, reduced gene flow and genetic differentiation among populations is expected. We found no indication of genetic substructure among the two locations Mase and Forclaz within the Val d’Hérens valley. This suggests sufficient gene flow at the local scale or recent recolonization at the upper front of the valley, which is consistent with restriction‐site associated DNA sequencing‐based analyses of the same field sites (Dauphin, 2017). We found no evidence for higher levels of ploidy based on mapped read depths per individual (Figure 3a). The intraindividual allele frequency distributions showed some spread around the singular peak at frequency 0.5 (expected for diploids). The spread in the estimated frequencies likely reflects noise caused by low coverage SNP positions and was observed in other systems as well (Chen et al., 2018). The individual CHA_1_7 has a less pronounced and a shifted major peak. The individual does not cluster as closely as the other genotypes of the same population and is less heterozygous. We find no difference in sequencing quality or quantity that could explain the observed peak pattern. Unequal mapping of reads matching the two alleles may be a possible reason. Such patterns may be expected for introgressed sequences of higher divergence. Absolute genome size measurements corroborate the absence of higher level ploidy. This is in accordance with a broader sampling of genome sizes of 57 individuals across the B. lunaria group, which likewise recovered only diploid conditions (V. Mossion & M. Kessler, unpublished data), except for the allotetraploid species B. yaaxudakeit, which is only known from northwestern North America (Stensvold et al., 2002). Also, consistent with our findings, a recent study of Swiss populations based on allozyme markers found no evidence for fixed heterozygosity, a typical indicator of polyploidy (Dauphin et al., 2020). However, we cannot rule out that very recent polyploidization or autopolyploidization events may remain undetected.Population‐level genetic diversity is indicative of the reproductive mode of Botrychium populations. Self‐fertilization is considered common in homosporous ferns and includes sporophytic and gametophytic selfing (Gastony & Gottlieb, 1985; Klekowski & Baker, 1966; McCauley et al., 1985; Sessa et al., 2016; Soltis & Soltis, 1986, 1990, 1992). In sporophytic selfing, zygotes are produced by gametes from two distinct gametophytes that originate from a single sporophyte. In contrast, in gametophytic selfing, zygotes are produced from gametes of the same gametophyte. Gametophytic selfing is thought to be the main reproductive mode for the genus Botrychium (Hauk & Haufler, 1999) and does lead to completely homozygous plants within one generation (Klekowski & Lloyd, 1968). In a population undergoing largely gametophytic selfing, low genetic diversity would also be expected among individuals due to genetic drift. The genetic diversity of B. lunaria populations (Figure 4d, e) and the clear structure among sites (Figure 4b, c) suggest that sporophytic selfing or outcrossing are predominant. Only one individual in the Chasseral population showed signature of a recent gametophytic selfing event (Figure 4e). These findings contrast with the general assumption that gametophytic selfing is the dominant reproductive mode in the genus but corroborate evidence of outcrossing in Swiss populations (Dauphin et al., 2020).Our study establishes a high‐quality set of transcriptome‐wide SNPs in a species of the early diverging fern genus Botrychium. With an estimated genome size of 19.4–24.6 (2C; ~19.0–23.7 Gb) for B. lunaria, the assembly of transcriptomes provides the only currently feasible approach to generate extensive genome‐wide markers information. Our transcriptome‐wide SNPs enable fine‐grained demographic history analyses and investigate the consequences of the complex mating systems and putative polyploidization events. Furthermore, ecological genomics studies based on our transcriptomic markers will provide the first opportunities to dissect adaptation to the local environment within fern species.
AUTHOR CONTRIBUTIONS
V.M., B.D. and D.C. designed the study. V.M., M.K. and D.C. performed analyses, N.Z. contributed to analyses, B.D. and J.G. acquired funding. V.M. and D.C. wrote the manuscript with input from coauthors.Supplementary MaterialClick here for additional data file.Table S1‐S6Click here for additional data file.File S1Click here for additional data file.File S2Click here for additional data file.File S3Click here for additional data file.File S4Click here for additional data file.File S5Click here for additional data file.File S6Click here for additional data file.File S7Click here for additional data file.
Authors: Troy E Wood; Naoki Takebayashi; Michael S Barker; Itay Mayrose; Philip B Greenspoon; Loren H Rieseberg Journal: Proc Natl Acad Sci U S A Date: 2009-08-10 Impact factor: 11.205
Authors: Joshua P Der; Michael S Barker; Norman J Wickett; Claude W dePamphilis; Paul G Wolf Journal: BMC Genomics Date: 2011-02-08 Impact factor: 3.969
Authors: F Lemoine; J-B Domelevo Entfellner; E Wilkinson; D Correia; M Dávila Felipe; T De Oliveira; O Gascuel Journal: Nature Date: 2018-04-18 Impact factor: 49.962