Literature DB >> 33552749

Genomic and transcriptomic resources for candidate gene discovery in the Ranunculids.

Tatiana Arias^1,2,3, Diego Mauricio Riaño-Pachón⁴, Verónica S Di Stilio².

Abstract

PREMISE: Multiple transitions from insect to wind pollination are associated with polyploidy and unisexual flowers in Thalictrum (Ranunculaceae), yet the underlying genetics remains unknown. We generated a draft genome of Thalictrum thalictroides, a representative of a clade with ancestral floral traits (diploid, hermaphrodite, and insect pollinated) and a model for functional studies. Floral transcriptomes of T. thalictroides and of wind-pollinated, andromonoecious T. hernandezii are presented as a resource to facilitate candidate gene discovery in flowers with different sexual and pollination systems.
METHODS: A draft genome of T. thalictroides and two floral transcriptomes of T. thalictroides and T. hernandezii were obtained from HiSeq 2000 Illumina sequencing and de novo assembly.
RESULTS: The T. thalictroides de novo draft genome assembly consisted of 44,860 contigs (N50 = 12,761 bp, 243 Mbp total length) and contained 84.5% conserved embryophyte single-copy genes. Floral transcriptomes contained representatives of most eukaryotic core genes, and most of their genes formed orthogroups. DISCUSSION: To validate the utility of these resources, potential candidate genes were identified for the different floral morphologies using stepwise data set comparisons. Single-copy gene analysis and simple sequence repeat markers were also generated as a resource for population-level and phylogenetic studies.

Entities: Chemical

Keywords: Ranunculaceae; Thalictrum hernandezii; Thalictrum thalictroides; draft genome; floral transcriptome; pollination syndrome; sexual system

Year: 2021 PMID： 33552749 PMCID： PMC7845765 DOI： 10.1002/aps3.11407

Source DB: PubMed Journal: Appl Plant Sci ISSN： 2168-0450 Impact factor: 1.936

Transcriptomic and genomic resources are a valuable tool for studying development within an evolutionary context, as they enable the search for candidate loci and their regulatory regions, contributing to the upgrading of study systems into model lineages. Thalictrum L. is an emerging model lineage within the Ranunculales (Damerval and Becker, 2017), the sister group to all other eudicots (Lane et al., 2018). The genus is therefore an extant representative of a privileged phylogenetic node, before a major evolutionary radiation within the flowering plants (Zeng et al., 2017). Thalictrum is a temperate genus of approximately 200 species with floral diversity that encompasses unisexual and wind‐pollinated flowers, in association with polyploidy (Soza et al., 2012, 2013). Certain species, mainly T. thalictroides (L.) A. J. Eames & B. Boivin, are amenable to virus‐induced gene silencing (Di Stilio et al., 2010), which has enabled functional studies of ABC model floral MADS‐box genes (Galimba et al., 2012, 2018; Di Stilio et al., 2013; Galimba and Di Stilio, 2015; Soza et al., 2016). Two chloroplast genomes have been assembled to date for this genus, for T. coreanum H. Lév. (Park et al., 2015) and T. thalictroides (from this data set, Morales‐Briones et al., 2019). Here, we contribute genomic resources for Thalictrum, including the first de novo draft genome of diploid, hermaphroditic, and insect‐pollinated T. thalictroides, and two de novo floral transcriptomes for T. thalictroides and for tetraploid, andromonoecious, and wind‐pollinated T. hernandezii Tausch ex J. Presl (Fig. 1). A qualitative comparison between these transcriptomes provided a preliminary list of enriched transcription factor families and potential candidate genes for developmental studies. Microsatellite markers were also identified as a resource for population‐level genetic studies. These genetic resources are presented as a tool to facilitate research on the genetic underpinnings of floral diversity in Thalictrum, and across Ranunculales.

Figure 1

Floral phenotypes of Thalictrum study species. Inflorescence of hermaphroditic T. thalictroides with hermaphrodite open flowers (A) and andromonoecious T. hernandezii with staminate buds () and hermaphrodite open flowers () occurring together in an inflorescence (B); inset, detail of young staminate flower. Scale bar = 1 cm.

METHODS

Thalictrum thalictroides draft nuclear genome

Plant materials and DNA extraction

Two live accessions of T. thalictroides (Tt), TtWT478 and TtWT964, were grown in the University of Washington greenhouse (Fig. 1A, Appendix 1). Total DNA was extracted separately from healthy young leaves of two individuals using a modified cetyltrimethylammonium bromide (CTAB) method (Shyu and Hu, 2013). DNA quality and concentration were measured in a Qubit 4 Fluorometer (Thermo Fisher Scientific, Waltham, Massachusetts, USA) and confirmed with agarose gel electrophoresis.

Library preparation

Total genomic DNA was sheared by sonication for 15–24 min using a Bioruptor (Diagenode, Denville, New Jersey, USA). Size‐selected samples (200–400 bp) were extracted using x‐tracta Gel Extractor tools (USA Scientific, Ocala, Florida, USA) and purified using the Gel DNA Purification Kit (QIAGEN, Germantown, Maryland, USA) for the end‐repair, adenylation of 3′ ends, ligation, and enrichment steps. Sheared DNA was visualized on a 2% low‐melting‐point agarose gel stained with ethidium bromide. Illumina’s TruSeq paired‐end (2 × 100 bp) libraries (Illumina, San Diego, California, USA) were generated for each individual with three different insert sizes (170, 500, and 800 bp), following the manufacturer's instructions. Libraries were assessed in an Agilent 2100 Bioanalyzer with the DNA 1000 Kit (Agilent, Santa Clara, California, USA) to determine average length, quantitated by real‐time quantitative PCR (qPCR) using a TaqMan Probe (Thermo Fisher Scientific) and amplified off‐board on a cBot (TruSeq PE Cluster Kit v3–cBot–HS; Illumina) to generate clusters on the sequencing flow cell, and then sequenced on a HiSeq 2000 (TruSeq SBS Kit v3‐HS; Illumina).

Read pre‐processing and genome assembly

Raw read quality was visually inspected with FastQC (Andrews, 2015) and then treated with Trimmomatic version 0.38 (Bolger et al., 2014) to remove adapter sequences and low‐quality bases, keeping only paired‐end reads with at least 80 bp for de novo assembly. Assemblies were initially conducted in CLC Genomics Workbench version 7.5.1 (QIAGEN, Hilden, Germany) and SOAP de novo version 2.04 (Luo et al., 2012a); assemblies were then scanned using BlobTools (Laetsch and Blaxter, 2017) to identify contigs originating from contaminants and finalized in MaSuRCA version 3.2.9 (Zimin et al., 2013). Genome sequences of potential contaminants were downloaded from the National Center for Biotechnology Information (NCBI) and used to map the clean reads with BBDuk (BBMap version 35.85; https://sourceforge.net/projects/bbmap/), keeping only unmapped reads for genome assembly. K‐mer‐based statistics were computed for the clean reads with Jellyfish version 2.2.10 (Marçais and Kingsford, 2011), GenomeScope version 1.0 (Vurture et al., 2017), and Smudgeplots version 0.2.3 (Ranallo‐Benavidez et al., 2020). One assembly was generated for each of the two T. thalictroides accessions, and a third was generated by combining the data from the two accessions. Each assembly was polished using Pilon version 1.23 (Bruce et al., 2014), with two rounds of error correction. Assembly statistics were computed with Quast version 5.0.2 (Gurevich et al., 2013), and sequence repeats were identified with RepeatModeler version 1.0.11 (http://www.repeatmasker.org/RepeatModeler/) and RepeatMasker version 4.0.9.p2 (http://www.repeatmasker.org/RepeatMasker/). Simple sequence repeat (SSR) markers and loci with di‐ to hexanucleotide repeats were identified in the genome and transcriptomes using the MicroSAtellite Identification tool (MISA; Thiel et al., 2003). The accuracy of the assemblies was assessed by mapping the contaminant‐free clean reads back to the assembly using Bowtie2 version 2.4.1 (Langmead et al., 2012) and computing the fraction of reads that map in the correct orientation (forward‐reverse) and within the length range of the insert size used to build the library; the insert size distribution was then estimated from the mapped paired‐end reads using the CollectInsertSizeMetrics function in Picard version 2.23.8 (http://broadinstitute.github.io/picard/).

Gene predictions and orthogroup identification

RNA‐seq data sets generated from T. thalictroides flowers (see below), together with a 1KP project transcriptome (Matasci et al., 2014) (http://www.onekp.com) and predicted protein sequences from the Aquilegia coerulea E. James genome (Filiault et al., 2018), were used as extrinsic evidence for protein‐coding gene prediction using Braker2 version 2.1.2 (Hoff et al., 2019; Bruna et al., 2020). RNA‐seq data were also employed for protein‐coding gene prediction with StringTie version 1.3.6 (Pertea et al., 2015) and TransDecoder version 5.5.0 (https://transdecoder.github.io); ab initio protein‐coding gene prediction was generated with SNAP daf76ba (Korf, 2004) using Arabidopsis thaliana (L.) Heynh. as a model. Final models for protein‐coding genes were produced with EVidenceModeler version 1.1.1 (Haas et al., 2008). tRNAs were predicted with tRNAscan‐SE version 2.0 (Chan and Lowe, 2019) and ribosomal RNA genes with RNAmmer version 1.2 (Lagesen et al., 2007). Other non‐protein‐coding RNA (ncRNA) were predicted with Infernal version 1.1.2 (Nawrocki and Eddy, 2013) and Rfam version 14.1 (Kalvari et al., 2018). Predicted protein‐coding genes were evaluated with TransRate version 1.0.3 (Smith‐Unna et al., 2016) against Papaver somniferum L. and A. coerulea (Ranunculales), and A. thaliana and Solanum lycopersicum L., representing the two main eudicot orders.

Genome annotation

Functional annotation was conducted with BLASTP (against the TAIR, SwissProt, and TrEMBL protein databases). Protein functional descriptions and Gene Ontology (GO) terms were added with the “Automated Assignment of Human Readable Descriptions” tool (AHRD version 3.3.3) (The Tomato Genome Consortium et al., 2012). Conserved regions, domains, and sequence features were identified with InterProScan version 5.35‐74.0 (Jones et al., 2014). Transcription‐associated proteins (TAPs) were identified using the approach described for the construction of PlnTFDB version 3.0 (http://plntfdb.bio.uni‐potsdam.de) (Pérez‐Rodríguez et al., 2010). Classification rules and software to assign protein sequences into TAP families based on their domain architecture were obtained from https://bitbucket.org/diriano/mytfdb. Benchmarking Universal Single‐Copy Orthologs (BUSCO) version 3.0 (Waterhouse et al., 2017) was used with near‐universal single‐copy orthologs selected from OrthoDB version 9.0 for Embryophyta (Zdobnov et al., 2017). Clusters of orthologous genes (orthogroups) were identified with OrthoFinder version 2.3.3 (Emms and Kelly, 2019). Sequence similarity was computed with Diamond version 0.9.24 (Buchfink et al., 2015) and gene clusters were generated with MCL version 14‐137, with 1.5 inflation (Enright et al., 2002).

Thalictrum thalictroides and T. hernandezii floral transcriptomes

Plant materials and RNA extractions

Fresh open flowers were collected from an individual of T. thalictroides (Fig. 1A) that was also used for genome sequencing (TtWT478, hermaphroditic flowers) and from T. hernandezii (Fig. 1B, Th_HWT441 hermaphroditic and Th_SWT441 staminate flowers) (Fig. 1, Appendix 1). Flowers of T. thalictroides and T. hernandezii were flash‐frozen in liquid nitrogen and total RNA was extracted with TRIzol (Invitrogen, Carlsbad, California, USA), following the manufacturer’s instructions. RNA quality (RIN ≥ 6.5) and concentration were determined in an Agilent 2100 Bioanalyzer and with agarose gel electrophoresis.

RNA library preparation

Library preparation and sequencing were carried out by the Beijing Genomics Institute (BGI Group, Hong Kong). Flower RNA‐seq libraries were prepared for T. thalictroides hermaphrodite (Tt_H), T. hernandezii hermaphrodite (Th_H), and T. hernandezii staminate (Th_S) flowers using Illumina’s TruSeq mRNA (unstranded) paired‐end (2 × 100 bp) kit and sequenced in a HiSeq 2000 Illumina sequencer. A mixed floral stage library of T. thalictroides (1KP, library name: GBVZ; Matasci et al., 2014) was downloaded from NCBI’s Sequence Read Archive and also included in the analysis.

Read pre‐processing, sequence assembly, and annotation

Contaminant reads were removed using BBDuk (BBMap version 35.85) by mapping against the same contaminants detected during genome assembly (see above); de novo assembly was then conducted in Trinity version 2.8.5 (Grabherr et al., 2011) and assessed for completeness with BUSCO version 3.0 (as for the genome assembly). Two de novo transcriptome assemblies were generated, one for T. thalictroides and another for T. hernandezii (combining libraries for the two flower types). Only contigs larger than 200 bp were used in further analyses. Read mapping metrics were computed against the assemblies using Bowtie2 version 2.4.1 (Langmead et al., 2012) with the parameters ‐‐maxins 1000 ‐‐very‐sensitive and Salmon version 0.14.0 (Patro et al., 2017) with the parameters quant ‐‐validateMappings ‐‐seqBias, as a measure of transcriptome accuracy. Assembled transcripts were compared against NCBI’s non‐redundant protein database using Diamond version 0.9.23 and assessed in MEGAN‐LR version 6.14.2 (Huson et al., 2018). Polypeptides encoded by the assembled transcripts were identified with TransDecoder version 5.5.0, including BLASTP hits against the SwissProt database and profile hidden Markov model hits against the Pfam database (El‐Gebali et al., 2019). Functional annotation, including identification of TAPs and clustering of shared (orthologous) genes, or “orthogroups,” was carried out as described above.

Identification of candidate genes

First, we performed a three‐way comparison to identify orthogroups among the T. thalictroides genome and the T. thalictroides and T. hernandezii de novo transcriptomes. We considered that orthogroups present in the T. thalictroides genome and transcriptome, but not found in the T. hernandezii transcriptome, are more likely to contain high‐confidence genes involved in the development of floral traits that characterize insect‐pollinated flowers (Fig. 1A). Conversely, orthogroups expressed exclusively in T. hernandezii (i.e., not found in the T. thalictroides transcriptome) that map to the T. thalictroides draft genome were considered more likely to contain high‐confidence genes involved in the development of floral traits that characterize wind‐pollinated flowers (Fig. 1B). Second, we analyzed T. hernandezii‐specific orthogroups to identify transcripts associated with the different floral sexes, i.e., staminate (male) vs. hermaphrodite. To that end, we computed the expression level of the de novo–assembled T. hernandezii transcripts in the two T. hernandezii data sets (Ther_S and Ther_H) using Salmon version 0.14.0 with options ‐‐seqBias ‐‐validateMappings ‐‐recoverOrphans ‐‐libType A and considered a transcript as expressed when it had at least a single mapped read. For the data intersections of interest, we performed an enrichment analysis of the families of TAPs using a Fisher’s exact test, correcting P values for the false discovery rate using the Benjamini–Hochberg method (Benjamini and Hochberg, 1995). GO term enrichment analysis was conducted using topGO (Alexa and Rahnenfuhrer, 2010), with the weight0 method, correcting P values for the false discovery rate using the Benjamini–Hochberg method.

RESULTS

A draft nuclear genome for Thalictrum thalictroides

Genome sequencing and de novo assembly

Paired‐end sequencing of the six libraries from two live accessions of T. thalictroides resulted in 49,105,897 sequenced fragments or 8.8 Gbp, after quality‐trimming and contaminant removal (Appendix 2). The genome sequences of contaminants identified by BlobTools version 1.0.1 (Laetsch and Blaxter, 2017; Appendix S1), as well as mitochondria and plastids (Appendix S2), were used to map the clean reads with bbduk2, keeping only unmapped reads for further processing. Short‐read data from TtWT478 had contamination from an aphid (GCF_000142985.2) and its bacterial endosymbiont (GCF_000009605.1; Appendix S1), which was removed. Metrics for the T. thalictroides draft genome assemblies were generated with MaSuRCA version 3.2.9 (Table 1; Appendices S3, S4). The best assembly, as measured by mapped reads (Appendix S4), assembly contiguity metrics (Table 1), and gene content (BUSCOs; Fig. 2), was generated by combining both accessions. This “consensus” assembly consisted of 44,860 contigs, with N50 = 12,761 bp (Table 1, Appendix S2) and 83.8% complete conserved embryophyte single‐copy genes (88.5% when considering complete and fragmented BUSCOs; Fig. 2). The BUSCO estimate increased to 84.5% after gene prediction (90.9% when considering complete and fragmented BUSCOs). We confirmed with Smudgeplots (Ranallo‐Benavidez et al., 2020) that the joint data set behaves like a diploid genome (Appendix S5), and the low level of duplicated BUSCOs in the consensus supports this (46/1440, or 3.2%; Fig. 2). We were able to map, on average, 5.75% more high‐quality reads to the consensus assemblies than to either individual genome across the different next‐generation sequencing libraries (Appendix S4). Mean contig coverage for the consensus assembly (from the two combined accessions) was 55× (median = 31.9×). Our genome size estimate from the consensus was 243.1 Mbp, comparable to the 286.4 Mbp estimate from k‐mer frequency statistics in GenomeScope version 1.0; the heterozygosity estimate was 1.23% (Appendix S6).

Table 1

Summary statistics for de novo genome assemblies of Thalictrum thalictroides.

Metric	Consensus	WT964 ^b	WT478 ^c
No. of contigs	44,860	41,892	41,384
Total length (bp)	243,091,176	205,112,749	167,308,766
Largest contig (bp)	119,890	80,531	57,549
GC (%)	36.23	36.29	36.61
N50 (bp)	12,761	10,189	8017
L50	4281	4739	4875

Assembly statistics were computed using Quast version 5.0.2 (Gurevich et al., 2013).

Sequenced accessions from two different individuals.

Figure 2

Proportion of Benchmarking Universal Single‐Copy Orthologs (BUSCOs) for the Thalictrum thalictroides genome (total number of BUSCOs = 1440 genes in Embryophyta from OrthoDB version 9.0). Data are shown for the two sequenced accessions (WT964 and WT478) and the consensus.

Summary statistics for de novo genome assemblies of Thalictrum thalictroides. Assembly statistics were computed using Quast version 5.0.2 (Gurevich et al., 2013). Sequenced accessions from two different individuals. Proportion of Benchmarking Universal Single‐Copy Orthologs (BUSCOs) for the Thalictrum thalictroides genome (total number of BUSCOs = 1440 genes in Embryophyta from OrthoDB version 9.0). Data are shown for the two sequenced accessions (WT964 and WT478) and the consensus. More than one third of the genome (37.6%) could be assigned to different classes of repeat elements, with the most abundant class being the autonomous long terminal repeat retroelements, particularly from the Copia (7.7%) and Gypsy (6.8%) superfamilies (Appendix S7). SSRs were also a common occurrence in the genome; 65,651 were identified and the most common SSR was dinucleotide (Table 2; Arias et al., 2020a).

Table 2

Simple sequence repeat (SSR) motif distribution in the Thalictrum thalictroides genome and the T. thalictroides and T. hernandezii (hermaphrodite and staminate combined) floral transcriptomes.

SSR properties	T. thalictroides genome ^b	T. thalictroides transcriptome	T. hernandezii transcriptome
Total no. of identified SSRs ^a	65,651	11,826	18,631
No. of SSRs present in tandem	3582	381	538
No. of contigs with SSR	22,867	9844	16,512
No. of contigs with >1 SSR	12,596	1634	1898

SSR sequences are available at https://doi.org/10.6084/m9.figshare.11984370.v8.

With abundance ≥5% of total SSRs in the genome.

Simple sequence repeat (SSR) motif distribution in the Thalictrum thalictroides genome and the T. thalictroides and T. hernandezii (hermaphrodite and staminate combined) floral transcriptomes. SSR sequences are available at https://doi.org/10.6084/m9.figshare.11984370.v8. With abundance ≥5% of total SSRs in the genome. We predicted 33,624 protein‐coding genes and 1936 non‐coding RNA loci (Appendix S8). Orthogroup analysis resulted in 67.4% of predicted protein‐coding genes forming clusters with at least one of the four reference genomes included in the analysis (Appendix S9); most were shared by all five species, providing support for our gene predictions. TransRate analysis also showed that a large number of predicted proteins in T. thalictroides had best bi‐directional BLAST hits against the reference genomes of A. thaliana, S. lycopersicum, A. coerulea, and P. somniferum (Appendix S10), and 88.3% of predicted protein‐coding genes had hits against InterPro member databases (Arias et al., 2020b). Functional descriptions were added for 22,603 protein‐coding genes. We identified 1569 TAPs; 1258 of these were transcription factors (TFs) that can be grouped into 68 TF families, and the remaining 311 belonged to 29 families of other transcriptional regulators (oTRs) (Fig. 3, Appendix S11).

Figure 3

The relative abundance of different families of transcription‐associated proteins (TAPs) in the floral transcriptomes of Thalictrum thalictroides (Tt) and T. hernandezii (Th_H, hermaphrodite and Th_S, staminate). Transcripts >0.1 TPM (transcripts per million) are included. Statistically significant differences between samples (chi‐square test) are marked with black (P < 0.01) or red (P < 0.05) asterisks. TFF: transcription factor family, OTR: other transcriptional regulator family.

Thalictrum thalictroides and T. hernandezii floral transcriptome assembly

De novo transcriptome assemblies of T. thalictroides (Tt; GHXU00000000) consisted of 54,104 contigs (N50 = 1817 bp), while T. hernandezii (Th; GHXT00000000) had 124,707 contigs (N50 = 1703 bp), with 80.1% and 82.9% identified complete BUSCOs, respectively (Table 3). For T. thalictroides, the total percentage of identified complete BUSCOs increased to 94.2% when including both the transcripts predicted from the genome sequence and those from the de novo transcriptome assembly (96.3% when considering complete plus fragmented BUSCOs). TransDecoder identified 26,407 ORFs per sample for T. thalictroides and 52,313 for T. hernandezii (Arias et al., 2020c). There were approximately twice as many conditional reciprocal best BLAST hits in T. hernandezii compared to T. thalictroides with either of the reference transcriptomes, which is consistent with the former being a tetraploid (Table 4). A search within the Thalictrum floral transcriptomes detected 30,457 SSR markers, with approximately half of the analyzed contigs containing SSRs. The most common SSR was trinucleotide, followed by dinucleotide (Table 2; Arias et al., 2020a).

Table 3

Properties of de novo floral transcriptome assemblies of Thalictrum thalictroides and T. hernandezii.

Metrics	Parameter	T. thalictroides	T. hernandezii
Contigs	No. of sequences (transcripts)	54,104	124,707
	Largest sequence (bp)	14,923	11,578
	No. of bases	72,336,777	150,432,265
	Mean sequence length	1337.0	1206.3
	No. of sequences >1 Kbp	29,920	60,919
	N50 (bp)	1817	1703
	% GC	40	40
BUSCO	Complete	80.1%	82.9%
	Single copy	42.3%	16.5%
	Duplicated	37.8%	66.4%
	Fragmented	7.8%	7.6%
	Missing	12.1%	9.5%
Read mapping	No. of paired‐ends (fragments)	40,504,757	65,712,208
Read mapping	Proportion of paired‐end fragments mapping back to the assembly	87.92% (Salmon) 88.85% overall mapping 84.25% concordantly mapping (Bowtie2)	89.64% (Salmon) 90.72% overall mapping 87.03% concordantly mapping (Bowtie2)

Table 4

Comparative metrics of de novo floral transcriptome assemblies of Thalictrum thalictroides and T. hernandezii.

	T. thalictroides		T. hernandezii
Parameter	Reference Aquilegia coerulea	Reference T. thalictroides	Reference Aquilegia coerulea	Reference T. thalictroides
No. of CRBB ^a contigs	40,771	46,744	97,550	90,591
Proportion of CRBB contigs	0.75357	0.86397	0.78223	0.72643
No. of reciprocal best hits per reference	0.99289	1.31451	2.74325	2.20615
No. of CRBB reference transcripts	18,877	19,387	18,323	21,224
Proportion of CRBB reference transcripts	0.45971	0.54519	0.51527	0.51686

Conditional reciprocal best BLAST algorithm, implemented in TransRate (Smith‐Unna et al., 2016).

Properties of de novo floral transcriptome assemblies of Thalictrum thalictroides and T. hernandezii. 87.92% (Salmon) 88.85% overall mapping 84.25% concordantly mapping (Bowtie2) 89.64% (Salmon) 90.72% overall mapping 87.03% concordantly mapping (Bowtie2) Comparative metrics of de novo floral transcriptome assemblies of Thalictrum thalictroides and T. hernandezii. Conditional reciprocal best BLAST algorithm, implemented in TransRate (Smith‐Unna et al., 2016).

High representation of transcription factors in floral transcriptomes

The T. thalictroides and T. hernandezii floral transcriptomes contained 3541 transcripts that could be assigned into 94 TAP families; of these, MYB‐related, AP2‐EREBP, bHLH, and bZIP were among the top families. MADS‐box genes (which include the floral organ identity genes) were found in all floral transcriptomes at approximately 2% relative abundance, and several TAP families were represented at significantly different levels in the three transcriptomes (Fig. 3).

Functional annotation of floral transcriptomes

To optimize orthogroup detection, Thalictrum transcriptomes were compared against multiple reference genomes: (1) T. thalictroides draft genome (this work); (2) A. thaliana (Brassicaceae); and (3) the two phylogenetically most closely related genomes available, A. coerulea (Ranunculaceae) and P. somniferum (Papaveraceae, Ranunculales). As a result of these comparisons, we identified 14 T. thalictroides‐ and 75 T. hernandezii‐specific orthogroups (Appendix S12). Stepwise comparisons were conducted to (a) validate transcripts against the T. thalictroides draft genome and (b) conduct inter‐ and intraspecies qualitative comparisons between wind‐ vs. insect‐pollinated and male vs. hermaphrodite floral morphologies (Fig. 1). First, we performed an inter‐species transcriptome comparison, using the draft genome for validation (Fig. 4A). The three‐way intersection in the Venn diagram (5204 orthogroups) represents orthologs that can be mapped to the T. thalictroides draft genome and that are expressed in floral transcriptomes of both species. A GO term enrichment analysis for biological process provided a broad characterization of gene functions at this triple intersection, examples of relevant categories include gene silencing by miRNA, maintenance of floral organ identity, meristem initiation, adaxial/abaxial pattern specification, and negative regulation of growth (Appendix S13; Arias et al., 2020d). A “core” of 9556 orthogroups common to both species is represented by three of the intersecting areas (5204 + 1298 + 3054). Intersection area “a” (3477 orthogroups) comprised 6451 transcripts uniquely expressed in T. thalictroides that also mapped to the reference genome (therefore considered of high confidence). Intersection area “b” (3054 orthogroups) comprised 11,251 transcripts found exclusively in T. hernandezii and similarly validated by the reference genome. Two sets of species‐specific orthogroups not found in the genome (274 and 473 orthogroups each) could represent lineage‐specific expansions and/or losses, or artifacts arising from incomplete sequencing. Finally, orthogroups found in both transcriptomes but not in the genome (1298) point to limitations due to fragmentation, as many can be found in other reference genomes (Appendix S12). Second, we performed an intraspecific comparison within T. hernandezii‐specific orthogroups validated by our draft genome (Fig. 4A area “b”; 3054 orthogroups). Transcripts from male (Ther_S) and hermaphrodite (Ther_H) floral transcriptomes were compared, yielding 447 male‐specific and 765 hermaphrodite‐specific transcripts (Fig. 4B, areas “c” and “d”, respectively).

Figure 4

The number of shared and unique orthogroups or transcripts among the Thalictrum data sets in this study, with examples of candidate genes (in red based on Arabidopsis, see text for details). The total number of orthogroups per data set is indicated in parentheses. Representative flower pictures are shown for T. thalictroides (TTHA), insect‐pollinated with hermaphrodite flowers, and T. hernandezii (THER), wind‐pollinated with hermaphrodite (THER_HERM) and staminate (THER_MALE) flowers. (A) Interspecies comparison of floral transcriptomes (TTHA_RNA and THER_RNA) and validation against the T. thalictroides draft genome (TTHA). Area “a” represents T. thalictroides‐exclusive orthogroups found in the draft genome; “b” represents T. hernandezii‐exclusive orthogroups (both floral types combined) found in the draft genome. (B) Intraspecies comparison of transcripts of two floral types in T. hernandezii (male and hermaphrodite). Area “c” represents male flower–specific transcripts; “d” represents hermaphrodite flower–specific transcripts.

DISCUSSION

The genome completeness results obtained here are within range of other draft genomes of non‐traditional species, such as Calotropis gigantea (77–88%; Hoopes et al., 2017) and certain Brassicaceae (61–95%; Lopez et al., 2017), yet lower than the reference genomes used in our orthogroup analysis (94.5–99.6%; Appendix S9). Our genome size estimated range of 243.1–286.4 Mbp was comparable to that of A. coerulea, another diploid member of the Ranunculaceae (291.7 Mbp; Filiault et al., 2018). This value is slightly lower than our previous estimate for this species from flow cytometry (1C value = 0.366 pg or 356 Mbp; Soza et al., 2013), which we attribute to the relatively low sequencing depth and to the presence of complex or long repetitive regions (longer than the short reads) that were not recovered in our sequencing and/or assembly. Both of these limitations can be overcome in the near future with additional deep sequencing from third‐generation sequencing technologies. Among the most abundant repetitive elements found, long terminal repeat transposable elements have been previously found to underlie homeotic flower mutants of T. thalictroides (Galimba et al., 2012). Our heterozygosity estimate of 1.23% is higher than that of A. coerulea (0.2–0.35%; Filiault et al., 2018) but still within the range of obligate outcrossers (Leffler et al., 2012). The number of TF families detected in the T. thalictroides draft genome is comparable to that found in the genomes of A. thaliana, S. lycopersicum, A. coerulea, and P. somniferum. The ratio of TF to oTRs was similar among the three representatives of Ranunculales (T. thalictroides, A. coerulea, and P. somniferum), with approximately 4.1 TFs per oTR, and lower than estimates for A. thaliana and S. lycopersicum (5.1 TF per oTR). The use of multiple reference genomes captured most orthologs in our transcriptomes, aiding in the validation of our gene predictions. Because T. hernandezii is a tetraploid (Soza et al., 2013), de novo transcriptome assembly was expected to include up to four expressed alleles per gene, thus explaining the larger number of assembled transcripts and of reciprocal best hits per reference for this species (Tables 3, 4), as well as the larger fraction of duplicated BUSCOs.

Applications: Data‐mining examples for candidate genes in flower development

To test for applications of our results, we identified potential candidate genes for the morphological differences between flower types in the two species (Fig. 1). Our goal was to provide a preliminary, qualitative working list of candidate genes for future investigations of the genetic basis of distinct sexual systems (hermaphroditic vs. unisexual) and pollination modes (insect vs. wind). To that end, we first searched for known candidate genes within our comparisons (Fig. 4). Venn diagram areas “a” and “b” (Fig. 4A) represent functional orthogroups for flowers with distinct morphologies due to differing pollination modes: insect‐pollinated T. thalictroides vs. wind‐pollinated T. hernandezii. Venn diagram areas “c” and “d” (Fig. 4B) represent examples of transcripts expressed in flowers with distinct sexual systems: staminate flowers, with sepals and stamens, and hermaphrodite flowers, with added carpels. It is possible that a small number of these orthogroups are expressed at low levels in both species, but that due to the lack of replicates they would appear as differentially expressed. Thalictrum thalictroides has petaloid sepals that are comparatively bigger and white, upright stamens with smaller anthers, and carpels with short styles and stigmas. Thalictrum hernandezii has smaller flowers with smaller, green sepals, pendant stamens with larger anthers on longer filaments, and carpels with longer styles and stigmas (Fig. 1). Based on these phenotypic differences, we predict that the transcriptome comparisons could yield genes involved in processes such as cell elongation or cell division (longer stamen filaments and styles), increased flexibility (in pendant filaments), epidermal cell elongation (extended stigmatic papillae in wind‐pollinated flowers; Di Stilio et al., 2009), or increased pollinator grip in petaloid organs, among others. A subset of the genes emerging from our comparisons fit these criteria, thus serving as validation for the usefulness of our data sets as a resource (see below). First, we searched for previously characterized candidate genes in the T. thalictroides draft genome and the three transcriptomes. A homolog of MIXTA‐like2 (ThtMYBML2, FJ487606.1) with a role in papillate cells and stigmatic papillae was identified in both species, as a full transcript in the genome assembly (GenBank: KAF5204412.1) and in the T. hernandezii transcriptome (90% protein identity, TSA:GHXT01017115), and in two fragments in the T. thalictroides transcriptome assembly (TSA:GHXU01051721 and GHXU01036124). A second candidate for differences in morphology between species is the Thalictrum STYLE2.1 ortholog, which is involved in style length in tomato (Chen et al., 2007). A Solanum query (UniProt B6CG44) was used to retrieve sequences from the T. thalictroides genome (65% protein identity, GenBank: KAF5189420.1) and the transcriptomes (65% identity, TSA:GHXU01012872; 67% identity, TSA:GHXT01076288). The presence of these candidate genes at the three‐way intersection of all data sets (Fig. 4A) suggests that regulatory changes in expression levels, rather than on/off switches, likely underlie the phenotype differences. We then identified candidate orthogroups or transcripts of interest in our data set via stepwise comparisons (Arias et al., 2020e). Orthogroups exclusive to T. thalictroides (Fig. 4A, area “a”) and relevant to the insect pollination syndrome included orthologs of BIG PETAL (BPE; Szécsi et al., 2006) and JAGGED (JAG; Sauret‐Güeto et al., 2013) with a potential role in large, petaloid sepals (Fig. 1A); LESS ADHESIVE POLLEN 5 (LAP5; Dobritsa et al., 2010) for “stickier” pollen (higher in insect‐pollinated species); STYLISH 1/2 (STY1/2; Sohlberg et al., 2006) for regulation of style length (short) and stigma size (compact or “capitate”); and jasmonate (JA) hormone pathway JAZ 1‐3 (Figueroa and Browse, 2015) in stamen filament elongation (shorter and flattened). Other “usual suspects” in flower development enriched in this area included WUSCHEL‐related homeobox (WOX3‐4) contributing to general aspects of floral architecture and morphology (Costanzo et al., 2014), the YABBY family INNER‐NO‐OUTER (INO) in ovule development (Simon et al., 2017), and the TCP family CYCLOIDEA (CYC) in flower symmetry (Hileman and Cubas, 2009). Thalictrum hernandezii–specific orthogroups relevant to the wind‐pollination syndrome (Fig. 4A, area “b”) included orthologs of Arabidopsis FLAKY POLLEN 1 (FKP1) affecting pollen coat qualities (Ishiguro et al., 2010) relevant to pollen adaptations to wind pollination (less “sticky”); PECTIN METHYLESTERASE 34 (PME34), which is highly expressed in stamen filaments and relevant to long, flexible filaments and styles (Gou et al., 2012), and PME48, a promoter of pollen tube elongation and thus relevant for successful fertilization through long styles (Leroux et al., 2015); and STIGMA/STYLE CELL‐CYCLE INHIBITOR 1 (SCI1; DePaoli et al., 2014), relevant to the extended styles and stigmas. Most members of the OVULE ABORTION (OVA) family (Berg et al., 2005), relevant to sex determination, were also specific to this andromonoecious species. Comparisons between T. hernandezii hermaphrodite and staminate flower transcriptomes (intra‐individual; Fig. 4B) were best suited to identify carpel‐specific candidate genes (only present in hermaphrodite flowers) and, to a lesser extent, genes related to sexually dimorphic features of stamens and sepals, which are found in both flower types. Out of 765 transcripts uniquely expressed in T. hernandezii hermaphrodite flowers (Fig. 4B, area “d”), relevant candidates included orthologs of PECTIN METHYL ESTERASE 5 (PME5) expressed in carpels and during fruit development (Louvet et al., 2006), SHATTERPROOF 2 (SHP2) and SEEDSTICK (STK) in ovule development (Favaro et al., 2003), EMBRYO DEFECTIVE 25 (EMB25) in embryo development (Meinke, 2020), and RADIALIS‐like (RAD‐like) in ovule and embryo development (Baxter et al., 2007). Among the 447 male‐specific transcripts, MIKC* MADS‐box genes AG‐like30/65 are main contributors to pollen and pollen‐tube function (Adamczyk and Fernandez, 2009) and orthologs with pollen development function, such as POLLEN CALCIUM BINDING PROTEIN 1 (PC1; Wang et al., 2008), ABERRANT POLLEN DEVELOPMENT 2 (APD2; Luo et al., 2012b), and POLLEN EXPRESSED TRANSCRIPTION FACTOR 2, affecting pollen germination (PTF2; Niu et al., 2013).

Conclusions

This study provides genomic and transcriptomic resources for Thalictrum, a representative of an early‐diverging lineage of eudicots with distinct floral morphologies representing diversity in sexual and pollination systems. Genomic resources for T. thalictroides and transcriptomes for T. thalictroides and T. hernandezii (Ranunculaceae) generated here increased the known set of protein‐coding genes for this genus to 33,624 (predicted from genome sequence, BioProject: PRJNA439007), from approximately 132 nuclear genes, 10,461 expressed sequence tags, and 130 population sets available in NCBI databases to date. The value of these resources has been exemplified in the identification of transposable elements, molecular markers, and putative candidate genes. Future potential uses of these resources include the identification of other genes of interest and their regulatory regions (draft genome), as well as primer design to contribute to ongoing phylogenetic and population‐level studies in Thalictrum (e.g., Humphrey and Ossip‐Drahos, 2018; Timerman and Barrett, 2018) and in other Ranunculids.

Author Contribution

T.A. co‐developed questions and framework, obtained funding, made collections, performed lab work, performed analyses, and wrote the initial draft of the manuscript. D.M.R.P. co‐developed the framework, performed analyses, and edited the manuscript. V.D.S. co‐developed questions and framework, performed analyses, mentored T. A., and wrote and edited the manuscript. All authors approved the final version of the manuscript. APPENDIX S1. Phylum‐level taxonomic assignment of contaminants in the genome of Thalictrum thalictroides (from BlobPlots). The top plot summarizes sequence coverage and GC content for each contig in the assembly (circles), with T. thalictroides contigs shown in yellow. The size of the circles is proportional to contig size; the color key for taxa is provided in the inset legend. The bottom plot shows the frequency distribution of the contaminating taxa for each of the two accessions used. Click here for additional data file. APPENDIX S2. Genome accession numbers of species, mitochondria and plastids used for contaminant removal using BBduk. Click here for additional data file. APPENDIX S3. Frequency of contig length in de novo genome assemblies of Thalictrum thalictroides. Click here for additional data file. APPENDIX S4. Metrics for clean read mapping to de novo genome assemblies of Thalictrum thalictroides (after quality‐trimming and contaminant removal). Click here for additional data file. APPENDIX S5. Estimation of ploidy from clean reads using Smudgeplots version 0.2.3 (k‐mer = 21). Color intensity reflects the approximate number of k‐mers per bin, from purple (low) to yellow (high). The most abundant k‐mer pair is the diploid heterozygous (AB). Click here for additional data file. APPENDIX S6. Genome size estimation with GenomeScope version 1.0. Click here for additional data file. APPENDIX S7. Transposable and other repeat elements in the Thalictrum thalictroides genome. Click here for additional data file. APPENDIX S8. Number of predicted RNA types in the Thalictrum thalictroides genome. Click here for additional data file. APPENDIX S9. Identification of orthogroups in the Thalictrum thalictroides genome using columbine (Aquilegia coerulea), tomato (Solanum lycopersicum), opium poppy (Papaver somniferum), and Arabidopsis thaliana. Protein identifiers for each orthogroup and species are available at https://doi.org/10.6084/m9.figshare.11984358.v1. Click here for additional data file. APPENDIX S10. Evaluation of the predicted transcriptome from the Thalictrum thalictroides genome assembly (Tt) against four reference genomes (see Appendix S9). Metrics are from TransRate (Smith‐Unna et al., 2016). Click here for additional data file. APPENDIX S11. Number and type of transcription‐associated proteins (TAPs) and proportion of complete BUSCO in the genomes of Thalictrum thalictroides (TTHA), Aquilegia coerulea (ACOE), Papaver somniferum (PSOM), Solanum lycopersicum (SLYC), and Arabidopsis thaliana (ATHA). Click here for additional data file. APPENDIX S12. The number of shared and unique orthologous genes (orthogroups) at the intersection between five species and six data sets. Genome comparisons between Thalictrum thalictroides (TTHA, this study) and the reference genomes of Arabidopsis thaliana (ATHA), Aquilegia coerulea (ACOE), and Papaver somniferum (PSOM); and two Thalictrum floral transcriptomes, T. thalictroides (TTHA_RNA, this study) and T. hernandezii (THER_RNA; this study). Click here for additional data file. APPENDIX S13. Enriched Gene Ontology categories for comparisons between floral transcriptomes of Thalictrum thalictroides (Tt) and T. hernandezii (Th). Click here for additional data file.

Species	DNA code	Usage	Voucher accession no. (Herbarium)	Locality	SRA accession no.
T. hernandezii Tausch ex J. Presl	ThWT441	RNA‐seq analysis	V. Di Stilio 119 (WTU)	Cultivated from wild seed, Huitzila, Mexico Liston1215	SRR6869420, SRR6869419
T. thalictroides (L.) A. J. Eames & B. Boivin	TtWT478	Whole genome assembly RNA‐seq analysis	V. Di Stilio 124 (WTU)	Cultivated from nursery material (MI, USA)	SRR6869426, SRR6869425, SRR6869418 Transcriptome: SRR6869422 GBVZ
T. thalictroides	TtWT964	Whole genome assembly	Unvouchered	Cultivated from nursery material (NC, USA)	SRR6869424, SRR6869423, SRR6869421

Species

DNA code

Usage

Voucher accession no. (Herbarium)

Locality

SRA accession no.

T. hernandezii Tausch ex J. Presl

ThWT441

RNA‐seq analysis

V. Di Stilio 119 (WTU)

Cultivated from wild seed, Huitzila, Mexico Liston1215

SRR6869420, SRR6869419

T. thalictroides (L.) A. J. Eames & B. Boivin

TtWT478

Whole genome assembly

RNA‐seq analysis

V. Di Stilio 124 (WTU)

Cultivated from nursery material (MI, USA)

SRR6869426, SRR6869425, SRR6869418

Transcriptome:

SRR6869422

GBVZ

T. thalictroides

TtWT964

Whole genome assembly

Unvouchered

Cultivated from nursery material (NC, USA)

SRR6869424, SRR6869423, SRR6869421

Sample	Library insert size	No. of sequenced fragments					Clean reads (Gbp)	SRA accession no.
Sample	Library insert size	Raw reads	Trimmed reads	Percent (from raw) after Trimming	Clean reads ^a	Percent (from raw) after cleaning	Clean reads (Gbp)	SRA accession no.
WT964	170	11,429,928	10,146,420	88.8	9,758,045	85.4	1.684	SRX3823744
WT964	500	11,887,778	9,155,669	77.0	8,910,174	75.0	1.497	SRX3823741
WT964	800	11,826,149	9,253,079	78.2	9,130,960	77.2	1.524	SRX3823742
WT478	170	11,484,802	10,191,821	88.7	6,601,641	57.5	1.290	SRX3823739
WT478	500	11,477,882	10,414,934	90.7	7,727,941	67.3	1.502	SRX3823740
WT478	800	11,721,131	9,012,622	76.9	6,977,136	59.5	1.350	SRX3823747

Clean reads are the number of reads remaining after quality‐trimming and contaminant removal (see Appendices S1 and S2 for a list of contaminants).

71 in total

1. Fast and sensitive protein alignment using DIAMOND.

Authors: Benjamin Buchfink; Chao Xie; Daniel H Huson
Journal: Nat Methods Date: 2014-11-17 Impact factor: 28.547

2. Divergent selection on the biomechanical properties of stamens under wind and insect pollination.

Authors: David Timerman; Spencer C H Barrett
Journal: Proc Biol Sci Date: 2018-12-19 Impact factor: 5.349

3. Partial redundancy and functional specialization of E-class SEPALLATA genes in an early-diverging eudicot.

Authors: Valerie L Soza; Corey D Snelson; Kristen D Hewett Hazelton; Verónica S Di Stilio
Journal: Dev Biol Date: 2016-08-06 Impact factor: 3.582

4. tRNAscan-SE: Searching for tRNA Genes in Genomic Sequences.

Authors: Patricia P Chan; Todd M Lowe
Journal: Methods Mol Biol Date: 2019

5. Four closely-related RING-type E3 ligases, APD1-4, are involved in pollen mitosis II regulation in Arabidopsis.

Authors: Guo Luo; Hongya Gu; Jingjing Liu; Li-Jia Qu
Journal: J Integr Plant Biol Date: 2012-09-19 Impact factor: 7.061

6. PECTIN METHYLESTERASE48 is involved in Arabidopsis pollen grain germination.

Authors: Christelle Leroux; Sophie Bouton; Marie-Christine Kiefer-Meyer; Tohnyui Ndinyanka Fabrice; Alain Mareck; Stéphanie Guénin; Françoise Fournet; Christoph Ringli; Jérôme Pelloux; Azeddine Driouich; Patrice Lerouge; Arnaud Lehner; Jean-Claude Mollet
Journal: Plant Physiol Date: 2014-12-18 Impact factor: 8.340

7. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics.

Authors: Robert M Waterhouse; Mathieu Seppey; Felipe A Simão; Mosè Manni; Panagiotis Ioannidis; Guennadi Klioutchnikov; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal: Mol Biol Evol Date: 2018-03-01 Impact factor: 16.240

8. Salmon provides fast and bias-aware quantification of transcript expression.

Authors: Rob Patro; Geet Duggal; Michael I Love; Rafael A Irizarry; Carl Kingsford
Journal: Nat Methods Date: 2017-03-06 Impact factor: 28.547

9. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. Chloroplast primers for clade-wide phylogenetic studies of Thalictrum.

Authors: Diego F Morales-Briones; Tatiana Arias; Verónica S Di Stilio; David C Tank
Journal: Appl Plant Sci Date: 2019-10-16 Impact factor: 1.936

1 in total

1. Complete chloroplast genome sequence of Thalictrum viscosum W.T.Wang & S.H.Wang, 1979 (Ranunculaceae).

Authors: Haohong Cai; Wenbo Shi; Song Weicai; Weiqi Han; Shuo Wang
Journal: Mitochondrial DNA B Resour Date: 2022-09-05 Impact factor: 0.610

1 in total