Literature DB >> 28431014

The draft genome of Ruellia speciosa (Beautiful Wild Petunia: Acanthaceae).

Abstract

The genus Ruellia (Wild Petunias; Acanthaceae) is characterized by an enormous diversity of floral shapes and colours manifested among closely related species. Using Illumina platform, we reconstructed the draft genome of Ruellia speciosa, with a scaffold size of 1,021 Mb (or ∼1.02 Gb) and an N50 size of 17,908 bp, spanning ∼93% of the estimated genome (∼1.1 Gb). The draft assembly predicted 40,124 gene models and phylogenetic analyses of four key enzymes involved in anthocyanin colour production [flavanone 3-hydroxylase (F3H), flavonoid 3'-hydroxylase (F3'H), flavonoid 3',5'-hydroxylase (F3'5'H), and dihydroflavonol 4-reductase (DFR)] found that most angiosperms here sampled harboured at least one copy of F3H, F3'H, and DFR. In contrast, fewer than one-half (but including R. speciosa) harboured a copy of F3'5'H, supporting observations that blue flowers and/or fruits, which this enzyme is required for, are less common among flowering plants. Ka/Ks analyses of duplicated copies of F3'H and DFR in R. speciosa suggested purifying selection in the former but detected evidence of positive selection in the latter. The genome sequence and annotation of R. speciosa represents only one of only four families sequenced in the large and important Asterid clade of flowering plants and, as such, will facilitate extensive future research on this diverse group, particularly with respect to floral evolution.

Entities: Chemical Disease Gene Species

Keywords: Asterid; Ka/Ks; anthocyanin; evolution; phylogenetic

Mesh：

Substances：

Year: 2017 PMID： 28431014 PMCID： PMC5397612 DOI： 10.1093/dnares/dsw054

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

1. Introduction

The Asterid clade is one of the most species-rich lineages of flowering plants, containing over a quarter of all known species of angiosperms (∼80,000 of 300,000). In this clade are four of the 10 most diverse families of flowering plants: Asteraceae (Sunflowers: ∼24,000 spp.), Rubiaceae (Coffee Family: ∼11,200 spp.), Lamiaceae (Mint Family: ∼7,200 spp.), and Acanthaceae (Acanthus Family: 4,200 spp.). Asterids are of tremendous economic and ecological significance, with cultivated and economically important members including coffee, mints (basil, oregano, rosemary, sage, thyme, lavender, teak), tomatoes and relatives (potatoes, tobacco, chilies, peppers, petunias), and carrots and relatives (carrots, cumin, cilantro, dill, fennel, parsley, celery), as well as blueberries, olives, sunflowers, sweet potatoes, snapdragons, dogwoods, ashes, milkweeds, and hollies. Members of the Asterid clade furthermore dominate ecosystems worldwide. Despite a revolution in sequencing plant genomes over the last decade, there exists only a handful of plants in the Asterid clade with nuclear reference genomes (http://www.plantgdb.org/). These genomes derive primarily from the tomato family (Solanaceae) in which nine species have been sequenced (Solanum, four species: ∼838–900 Mb;Nicotiana, three species: ∼3 Gb;,Capsicum, three species: ∼3.48 Gb;,and Petunia, two species: ∼1.4 Gb). Beyond this, there exists only two additional reference genomes in the Asterid clade, only one of which is >100 Mb in size: the monkeyflower Mimulus guttatus, ∼430 Mb; the other derives from the insectivorous bladderwort Utricularia gibba, which is only ∼81.9 Mb. Thus, only three of 100+ families that comprise Asterids have representative nuclear genomes. To build genomic resources for Asterids, we here contribute new nuclear genome reference from a fourth family in this clade: Acanthaceae. The genus Ruellia (Wild Petunias) contains ∼400 species that are distributed primarily in the Neotropics and Paleotropics. With very few exceptions, all species of Ruellia (∼50 species counted to date) are thought to be diploid and share a somatic chromosome count of 2n = 34 (reviewed by Tripp and Tripp). Ruellia is marked by tremendous diversity in both flower shape and colour, with very closely related species (e.g. sister taxa) often divergent in floral form (Fig. 1). Flowers of Ruellia range from blue to purple, pink, red, green, yellow, and white and are pollinated by bees, butterflies, hawkmoths, bats, hummingbirds, and sunbirds., Owing to the floral diversity and species richness in the genus, Ruellia has potential as a future model system for floral evolution in angiosperms.

Figure 1

Floral colour and floral shape diversity present within Ruellia. (A) R. speciosa. (B) Ruellia macrantha. (C) Ruellia longipedunculata. (D) Ruellia hirsuto-glandulosa. (E) Ruellia bourgaei. (F) Ruellia saccata. (G) Ruellia biolleyi. (H) R. breedlovei. (I) Ruellia californica. (J) Ruellia noctiflora. Ruellia speciosa (Beautiful Wild Petunia) is a pale yellow-flowered species native to a narrow portion of the southern Sierra Madre Oriental of Mexico and is nearly extinct in the wild due to habitat destruction. Because of its tiny population sizes and isolation from related species, the genome is expected to be highly homozygous and suitable for sequencing as a reference.

2. Materials and methods

2.1. DNA isolation, library preparation, and sequencing

Plant material for this study was collected in the wild by E. Tripp and S. Acosta (voucher # 175, housed at the Duke University Herbarium; in living cultivation in the University of Colorado Greenhouses), in a small canyon of a mountain that lies within the city limits of Oaxaca City, Oaxaca, Mexico. Total genomic DNA was extracted from young leaf tissue of R.speciosa using the DNeasy Plant Mini Kit (QIAGEN) following the manufacturer’s instructions. Presence of high molecular weight DNA was confirmed by 1% agarose-gel electrophoresis stained with SYBR Safe (Invitrogen). DNA concentration was quantified using a Qubit 2.0 Fluorimeter spectrophotometer (Life Technology). DNA that passed quality control was used for mate-pair library preparation followed by whole-genome shotgun sequencing. Small fragment libraries with insert sizes of 250, 350, 450, and 550 bp were prepared using a TruSeq DNA PCR-Free Library Prep Kit (Illumina). Two mate-pair libraries with average insert sizes of 2 and 5 kbp were also built using a Nextera Mate Pair Library Preparation Kit (Illumina). Final libraries were quantified using a Qubit and qualities were assessed using a Bioanalyzer. Libraries were sequenced on an Illumina HiSeq2500 using either 2 × 125 bp paired-end or 2 × 150 bp paired-end chemistry at the Genomics and Microarray Core, University of Colorado–Anschutz Medical Campus.

2.2. Flow cytometry

Prior to sequencing, we conducted flow cytometry to estimate the genome size of R.speciosa. Samples were transferred to the Iowa State University Flow Cytometry Facility and analysed on a BD Biosciences FACSCanto Flow Cytometer using Solanum lycopersicum as an internal standard. All cytometry protocols followed in house methods at that facility. The software package BD FACSDiva v.6.1.3 was used to analyse the data.

2.3. De novo nuclear genome assembly

De novo assembly of the R.speciosa genome into contigs was conducted using MaSuRCA v3.1.0. All raw untrimmed sequenced reads were used as input for MaSuRCA as instructed in the manual and run with default settings except the memory usage option (NUM_THREADS and JF_SIZE). Gap closing was conducted via BGI’s GapCloser v1.6 of SOAPdenovo. Following this, we used the package SSPACE v2.0 (scaffolding pre-assemblies after contig extension) to conduct the final round of scaffolding. Reads trimmed with Trimmomatic v0.33 (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50) were also assembled with SOAPdenovo2 (‘Multi-kmer’ method was used, kmer ranges from 65 to 80) and Abyss v1.9.0 (k = 71, b = 1,000, P = 0.95 and s = 500). Assembly generated by MaSuRCA was used for all downstream analysis as it yielded the best statistics based on scaffold N50. Completeness of genome assembly was evaluated using BUSCO v1.2.

2.4. Ploidy estimation and heterozygosity analysis

To estimate ploidy of R.speciosa, we used an approach similar to that employed by Kentaro Yoshida’s group. After quality trimming and error correction, reads were mapped backed to the assembled genome using BWA v0.7.1327 with default settings allowing two gaps and without seeding. Samtools was used to convert the sam file containing the alignment information into bam format and the resulting bam file was sorted and indexed. SNP calling was conducted using both samtools mpileup and bcftools. To reliably calculate base and minor allele frequencies to enable identification of heterozygous alleles, only positions with read coverages ≥ ×50 (approximately one-half of the estimated overall read coverage) and minor allele frequencies ≥ 0.2 were considered. Ploidy is determined by the ratio of alleles at heterozygous positions. In a diploid genome, a ratio of 1:1 is expected for the majority of the heterozygous alleles; in triploid and tetraploid genomes, these ratios should ∼1:2 and 1:3, respectively. Using Marker-predicted genes as input, prediction of recent or ancient polyploidization was secondarily tested by plotting the distribution of ages of duplicated sequences using DupPipe. Synonymous divergence (Ks) was estimated using PAML v4.9c and the F3X4 model of evolution. As polyploidy or whole-genome duplication (WGD) yields a large and enduring signature in such plots, significant features (α = 0.05) in age distributions were identified with SiZer v0.1-4. To estimate genome heterozygosity, the number of distinct k-mers in our reads was calculated using Jellyfish v2.2.5. k-mers were plotted as frequencies, and the number of peaks was used as an indicator of genome heterozygosity.

2.5. Genome annotation with MAKER

Genome annotation was conducted using the MAKER-P annotation pipeline and its dependent programmes. Two rounds of running the pipeline were completed to yield the final annotation file. Prior to annotation of the protein-coding genes, a R.speciosa-specific repeat library was constructed following the tutorial included in MAKER-P tool kit and RepBase v21.09. This library served to mask repeat elements in the assembled genome using RepeatMasker v3.2.0. The repeat-masked assembled genome of R. speciosa (scaffold length ≥ 1,000 bp) was then processed through the MAKER annotation pipeline. Gene prediction was conducted using SNAP and AUGUSTUS v3.2.2, trained on Arabidopsis thaliana and S.lycopersicum as models. RNA-seq data obtained from corolla and leaf samples of R. speciosa (Zhuang and Tripp, in rev.; SRA accession: SRP075855) served as EST evidence to refine gene models. High confidence annotation of protein sequences of 31 dicot species were retrieved from the PLAZA database v3.0 to serve as protein evidence for refining gene predictions. The annotation file from the first-round run was used to create the HMM training file for SNAP. The resulting HMM training file together with first-round GFF annotation file were further processed by MAKER-P to generate a final annotation file. Reads generated from previous RNA-seq study (Zhuang and Tripp, in rev.) were mapped against the R. speciosa genome assembly using BWA with default parameters. Paired-end reads where each end mapped to gene-containing regions of different scaffolds were then used to combine the fragments of a single gene. All predicted protein sequences were further validated following methods described by Upadhyay et al.

2.6. SSR identification and maker predication

We used GMATo (genome-wide microsatellite analysing tool; https://sourceforge.net/projects/gmato/files/) to characterize the masked SSR loci. Masked SSR sequences were extracted using an in-house Perl script prior to analysis with GMATo. To enable comparison of our SSR statistics to those reported from other species, we employed similar GMATo parameters to those used in prior studies. These included retaining SSR motifs that were 2–8 bp in length and had ≥4 repeats with a minimum of 10 bp length (−r 5 −m 2 −× 10 −s 0).

2.7. Gene Ontology database annotation

Functional annotation of predicted gene models was conducted using the Trinotate pipeline v3.0.0 (http://trinotate.sourceforge.net/), with a blast e-value threshold of 1 × 10−. To assign function annotations to the predicted genes, local NCBI-BLAST v2.2.31 was used to search for homologies between the gene and SwissProt, UniProt, and eggNOG/Gene Ontology (GO) databases. In addition, all green plant entries integrated into UniProtKB/TrEMBL (Taxonomy: Viridiplantae) were also retrieved to serve as a custom database for blast searches. To enable comparison and visualization of GO annotations in R.speciosa to other plants, the annotated GO terms of Ruellia were combined with GO annotation files of A.thaliana and Solanum tuberosum, which were downloaded from AgriGO (http://bioinfo.cau.edu.cn/agriGO/dowload). This combined file was analysed using BGI WEGO (http://wego.genomics.org.cn) to classify GO terms into Level-2 sub categories.

2.8. Phylogenetic analysis of genes involved in the anthocyanin biosynthesis pathway

To enable phylogenetic investigation of key structural genes involved in the production of anthocyanin pigments that occur in flowers, fruits, and leaves, amino acid sequences representing genes from several angiosperm species were downloaded from PLAZA and OrthoDB v9. Sequences from Pinus taeda were also download from PINREFSEQ (http://pinegenome.org) to facilitate tree rooting. In total, four structural genes from 25 dicots plus three monocots were downloaded then subject to phylogenetic analysis along with orthologues retrieved from R.speciosa and P.taeda (ntax = 30 in total). Based on Trinotate annotation, genes annotated as flavanone 3-hydroxylase (F3H), flavonoid 3′-hydroxylase (F3′H), flavonoid 3′,5′-hydroxylase (F3′5′H), and dihydroflavonol 4-reductase (DFR) were extracted, and CD-HIT was used to retrieve non-redundant copies (at a 99% similarity cutoff). To accurately estimate copy numbers of family genes studied, SNPs located within target gene-containing regions identified from polyploidy analysis were further examined; the gene was considered to have unidentified paralogues if the percentage of SNPs were ≥90% of gene length and averaged ≥10 reads mapped to the SNP-containing region. We searched for orthologues of these genes in queried plants using the package Proteinortho v5.15. Identified sequences were parsed and extracted with an in-house Perl script. Two alignment matrices were constructed: the first containing all F3H, F3′H, and F3′5′H sequences and the second containing all DFR sequences. Sequence alignment was conducted using the ‘linsi’ option of MAFFT v7.305. Resulting alignments were truncated to exclude regions of extremely high sequence divergence (namely: autapomorphic divergence), which can interfere with phylogenetic signal owing to phenomena such as long branch attraction. Evolutionary histories were inferred using maximum likelihood methods, employing the LG+ F amino acid model as the best-fitting model determined by MEGA6, and final resulting phylogenetic trees were rooted using P.taeda. Branch support was assessed via 200 ML bootstrap replicates. To test selection on and assess potential functional divergence of duplicated genes, ratios of non-synonymous (Ka) to synonymous (Ks) nucleotide substitutions were estimated using the Ka/Ks web service at the computational biology unit, hosted at The University of Bergen. Phylogenetic trees from the earlier analyses were used as reference trees for the Ka/Ks analysis.

2.9. Genome-wide characterization of MYB3R and R2R3 type MYB gene family

To characterize the MYB gene family in R.speciosa, protein sequences of 125 typical R2R3-MYB proteins and six atypical MYB proteins (AtMYBCDC5, AtMYB3R-like, and AtMYB4R1) were downloaded from the Arabidopsis information resource (https://www.arabidopsis.org/browse/genefamily/MYB.jsp). Three Petunia MYBs regulate anthocyanin biosynthesis were also downloaded from NCBI database (GI:673536265, GI:321688260, GI:321688230) to facilitate function annotation. Proteinortho was used to conduct homologue searches of Arabidopsis MYBs against protein sequence of all gene models annotated by Maker-P with default parameters. Proteins without homologues among Arabidopsis MYBs were removed from subsequent analysis. The resulting MYB proteins were then assessed for the presence of Myb domains using Pfam v26.0 on the Pfam server with default parameters (http://pfam.sanger.ac.uk). Proteins containing Myb domains were extracted and subjected to phylogenetic analysis using methods as earlier employing the LG + G amino acid model as the best-fitting model. Function assignments were conducted based on known functions of Arabidopsis MYBs. MYBs with bootstrap values ≥ 70% were considered to belong to the same subgroup.

2.10. Gene expression analysis

To measure expression levels of selected anthocyanin genes and identified MYBs, reads from previous tissue-specific RNA-seq libraries (Zhuang and Tripp, in rev.; SRA accession: SRP075855) were trimmed with Trimmomatic v0.3323(parameters used: ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36) and mapped back to Maker-P predicted gene sequences using BWA with default parameters. eXpress v1.5.1 was used to measure gene abundance with default parameters. The ‘fpkm’ column of tissue-specific eXpress output was extracted and used for building expression heatmaps using the heatmap.2 program within the gplots package for R.

2.11. Reference-based chloroplast assembly

We additionally used resulting reads to assemble the R.speciosa chloroplast genome. First, we retrieved all available plant chloroplast sequences that were 100,000–200,000 bp in size from NCBI. Blat was used to compared our contigs generated from MaSuRCA to these reference chloroplast sequences to find nucleotide sequences with high similarity. Five contigs (length ≥ 10,000 bp; query coverage ≥ 80%) were identified. We used Blast to compare these five contigs against reference chloroplasts and identified two with the highest e-values and query coverages: Lindenbergia philippensis (Accession: HG530133) and Andrographis paniculata (Accession: KF150644). These, together with a chloroplast reference from closed related species Ruellia breedlovei (Accession: KP300014) were used as references for our R. speciosa assembly. Quality trimmed paired-end reads were aligned to the references using Bowtie v2-2.2.3. Reads with either end mapped to the references were extracted using Samtools and an in-house Perl script. Mapped reads were randomly subsampled from ×100 to ×500 coverage of the estimated chloroplast genome size; reads with an apparent depth under ×5 were discarded using BBmap v36.02. The resulting aligned reads were assembled using SPAdes v3.5.0 and scaffolded using SSPACE v2.0 against all sequenced libraries. Scaffolds were gap closed using Gapcloser v1.6. The best assemblies containing three scaffolds ≥ 300 bp were selected and subjected to second-round scaffolding and gap closing, thereby generating a circular chloroplast genome. The chloroplast sequence of the closely related R.breedlovei was used to correct the orientation of the region between the inverted repeats (IRs). CpGAVASwas used for chloroplast annotation and visualization.

3. Results and discussion

3.1. Genome sequencing and assembly

From a whole-genome perspective, the successful sequencing, assembly, and annotation of the R.speciosa genome contributes important new data about one of the largest yet understudied clades of flowering plants—the Asterids. Our targeting Ruellia initiates the process of building a new model system for studies of floral evolution. We herein estimate the genome size of R.speciosa to be ∼1.2 Gb based on flow cytometry or ∼1.1 Gb based on Kmer analysis (Fig. 2B). For the nuclear genome assembly, nine paired-end libraries with insert sizes ranging from 250 to 550 bp and two mate-pair libraries with insert sizes of 3–6 kbp were constructed. After quality trimming, we generated >118.9 Gb of sequence data representing >×104 coverage of the predicted genome size (Supplementary Table S1). Before scaffolding, short reads were assembled into 808,710 contigs with an N50 of 1,551 bp that totaled 733 Mb (spanning 67.25% of the predicted genome size; Table 1). Additional MP reads improved the N50 by ∼10 and the additional gap closing step increased genome coverage to ∼90.07% of its predicted size without Ns (Table 1). The success of a draft genome assembly is strongly dependent on the genetic complexity of the specific organism and its genome size. Of the two methods we used for genome size estimation, the Kmer protocol estimated a smaller genome size, which may suggest a substantial repeat fraction in the genome that is difficult to account for using this type of analysis. Our repeat DNA analysis suggested that ∼70% of the genome is repetitive (Table 2). Repeat sequences are difficult to assemble because high-identity reads can derive from different portions of the genome, generating gaps, ambiguities, and collapses in alignment and assembly., This high repeat DNA content is likely to explain the relatively low scaffold N50 (17,908 bp), even though the sequencing depth exceeds ×100. Nonetheless, our scaffold N50 is comparable to those derived from genome assemblies of Cannabis sativa (N50: 16.7 kb),Conyza canadensis (N50: 33.5 kb, repeat content 6.25%),Ocimum tenuiflorum (N50: 27.1 kb; repeat content: 42.9%),Lemna minor (N50: 23.6 kb, repeat content: 61.5%),Solanum commersonii (N50: 44.3 kb, repeat content: 44.5%),Hevea brasiliensis (N50: 3 kb, repeat content: 72%),Aegilops tauschii (N50: 57.6 kb, repeat content: 66%), and Triticum urartu (N50: 63.7 kb, repeat content: 67%). Although data from additional sequencing libraries or a combination of existing and additional sequencing technologies may be needed to make further improvements to the quality of the R. speciosa genome for purposes of whole-genome synteny analyses, our current assembly enables us to characterize numerous aspects of this genome ranging from repeat elements, gene content, gene annotation, and phylogenetic history of specific pathways of interest.

Figure 2

Table 1.

Summary of the de novo genome assembly of R.speciosa

Category	Contigs	Scaffolds (all)	Scaffold (>1 k)
Number of contigs	808,710	341,168	84,681
N50 (bp)	1,551	17,908	19,156
L50	92,735	14,124	10,074
Largest contig/scaffolds (bp)	34,954	345,229	345,229
GC content (%)	39.27	38.80	38.77
N’s percentage		3.62	3.82
Total span (M)	733	1,021	892
Estimated coverage (%)	67.25	93.69	81.89
Number of gene models			40,124
Mean transcript length			1,131 bp
Number of genes annotated			37,848 (94.32%)

Table 2.

De novo identification of sequence repeats in the genome of R.speciosa

Class	Number	Total length (bp)	Percentage of genome
Retrotransposons
LINE	29,246	11,469,068	1.23
SINE	276	14,095	1.50e⁻⁰⁵
LTR Copia	53,812	23,369,721	2.50
LTR Gypsy	342,944	158,185,560	16.92
LTR other	26,556	9,772,864	1.05
DNA transposons	113,878	35,584,790	3.81
Tandem repeats
Satellite	1,213	176,347	0.02
Low complexity	64,404	3,277,475	0.35
Simple repeats	352,081	19,601,079	2.10
Unknown	1,158,125	384,916,724	41.17
Total	2,142,535	646,367,723	69.13

(A) Ploidy analysis of assembled R. speciosa genome. A major peak around 0.5 supports hypothesis of diploidy in R. speciosa. (B) Twenty-one k-mer depth distribution of whole-genome Illumina reads; the single peak observed indicates high level of genome homozygosity. Summary of the de novo genome assembly of R.speciosa De novo identification of sequence repeats in the genome of R.speciosa

3.2. Ploidy and heterozygosity of R. speciosa

Polyploidization is rampant in flowering plants and is an important contributor to genomic diversity and function; as such, polyploidization can facilitate novel ecological transitions and substantially impact evolutionary trajectories. As discussed by Yoshida et al., for diploid species, we expect a single peak at 0.5 for the mean of read counts at heterozygous positions. To characterize the ploidy of R. speciosa, the distribution of read counts of biallelic SNPs was calculated from high coverage genomic regions. Figure 2A depicts a major peak around 0.5, supporting a hypothesis of diploidy in R. speciosa. This ploidy was further confirmed by Ks analysis using DupPipe: peaks of gene duplication serve as evidence of ancient WGDs, but none was identified through Sizer analysis in our Ks plot (Supplementary Fig. S1) thus supporting our earlier conclusion that R. speciosa genome is diploid. Kmer analysis (Fig. 2B) revealed the presence of only a single peak, consistent with our prediction that R.speciosa is highly homozygous. A very minor peak immediately adjacent to the major peak could be detected only when Kmer size ≤ 31, which represents regions with higher copy numbers such as repeats. To further explore heterozygosity and polymorphism,we calculated the percentage of total insertions/deletions and SNPs using SAMtools and BCFtools. We found a total of 2,431,330 SNPs (∼0.20%) with relatively relaxed search parameters (depth ≥ 3) or 299,334 SNPs (0.02%) of high confidence (depth ≥ 30), further supporting our hypothesis that the R. speciosa genome is highly homozygous.

3.3. Genome annotation and quality assessment

Of the 84,681 scaffolds (length > 1,000 bp) used for gene annotation, 25,256 (29.82%) were found to contain annotated genes. Our final Maker-P annotation predicted 40,124 protein-coding genes with an average length of 1,131 bp. This number exceeds the number of genes annotated for two relatively closely related species that have comparable genome sizes: tomato (gene number: 34,727, genome size ∼900 Mb) and potato (gene number: 35,004, genome size ∼844 Mb). To assess completeness of our R. speciosa genome assembly, 966 genes present as single-copy orthologues in at least 90% of plant species in OrthoDB were compiled. Using BUSCO, we retrieved orthologues for 741 (77.52%) conserved genes in Ruellia, suggesting a relatively high level of completeness for draft plant genomes generated only from Illumina reads. Additionally, mapping reads from previous RNA-seq experiment on R. speciosa (Zhuang and Tripp, in rev.) back to our genome assembly demonstrated that 76.12% of the quality trimmed reads could be successfully mapped using Tophat2 with default parameters. We interpret this as further evidence of the completeness of our genome assembly given this number is on par with expected mapping rates for RNA-seq data from very well assembled genomes such as the human genome, which has a 70–90% RNA-seq read mapping rate. Gene annotation databases are commonly used to evaluate functional properties of experimentally derived gene sets. Comparison of completed sets of genes from different genomes helps to reveal the genetic basis of biological traits as well as differences among species. Using the Trinotate pipeline, we assigned GO functional terms to 30,852 (76.89%) genes. Genome-wide comparative analyses among R. speciosa, S. tuberosum, and A. thaliana revealed distinct patterns of enriched Level 2 GO terms (Supplementary Fig. S2). Although potato is more closely related to R. speciosa than is Arabidopsis, gene enrichment patterns were more similar between R. speciosa and Arabidopsis than between R. speciosa and S. tuberosum for two of the three categories [cellular component and biological process categories (BP)]. GO terms categorized under molecular function were more consistent among the three species. Across all functions, fewer GO terms were enriched in S. tuberosum compared with in R. speciosa and A. thaliana, with the exception of GO terms involved metabolic processes (BP).

3.4. Analysis of repetitive elements

Repetitive DNA elements generally compose the majority of nuclear genomes in plants, but this content varies tremendously, ranging from 3% (U.gibba) to 80% of the total genome (Triticum aestivum). In this study, a combination of homologue-based searching and de novo analyses identified 417,698 (2.47%) tandem repeats in our assembled R. speciosa genome, which can be parsed into 352,081 simple repeats (2.1%), 64,404 low-complexity elements (0.35%), and 1,213 satellite elements (0.02%). Various types of retrotransposons comprising ∼25.51% of the whole assembled genome were also recovered in this analysis; these include 29,246 (1.23%) long interspersed nuclear elements (LINE), 276 (1.50e−05%) short interspersed nuclear elements (SINE), 424,312 (20.47%) long terminal repeats (LTR), and 113,878 (3.81%) DNA transposons. Overall, using homologue-based and de novo approaches, we estimate that ∼70% of the R. speciosa genome is composed of repetitive DNA (Table 2). However, because repeat elements are difficult to assemble, our use of assembly dependent methods for repeat identification may still underestimate total repeat content in the genome. Repetitive sequences possess high sequence homogeneity but, over the course of evolution, accumulate variations in both sequence and copy numbers. Because of this, repetitive elements are very useful for markers for many downstream applications, ranging from plant population genetics to breeding studies. In particular, molecular makers developed from simple sequence repeats (or microsatellites) have been utilized most extensively as these can be readily amplified by PCR and contain a large amount of allelic variation at each locus. We identified 181,793 SSR loci using GMATo. These repetitive elements are depicted via motif type in a frequency distribution plot shown in Fig. 3. The most abundant repeat motifs are AT and TA, together accounting for ∼53.37% of the total SSRs identified (Fig. 3A). Dinucleotide repeat motifs were the greatest in number (Fig. 3B), consistent with results from SSR study in holy basil but differing from patterns in most other plants including tomato, potato, cucumber, and rice where trinucleotide SSRs represent the majority (not including mononucleotide SSRs) of repeat type. Taken together, these data suggest that SSRs constitute a relatively unique and important aspect of the R. speciosa genome.

Figure 3

Simple sequence repeat (SSR or microsatellite) analysis. (A) Bar plot of top 15 repeat motifs with highest abundance. (B) Pie chart showing distribution of identified SSRs based on motif type.

3.5. Analysis of genes involved in the anthocyanin biosynthesis pathway

Flower colour is among the key traits that serve to signal rewards to pollinators and thus helps determine reproductive success of both plants and pollinators. The anthocyanin biosynthesis pathway (ABP), largely responsible for flower colour, has furthermore been extensively studied for its beneficial contributions to human health. Using data from the R.speciosa genome as well as orthologues in other plants with annotated nuclear reference genomes, we reconstructed evolutionary histories of four structural enzymes in the ABP. In R.speciosa, we found two putative copies of F3H, eight of F3′H, one of F3′5′H, and three of DFR. These enzymes were present in variable duplicated copy numbers in other reference genomes as well as in Ruellia (Fig. 4, Supplementary Figs S3 and S4; see also ref. 114), although studies in other systems have reported these enzymes as single copy (12 species of Penstemon).

Figure 4

Summary phylogenies for (A) F3H, F3′H, and F3′5′H and (B) DFR. Phylogenetic analyses incorporated all copies of these four genes available from published angiosperm genomes (these collapased into clades depicted with triangles) as well as all copies from one gymnosperm, P. taeda. (A) Analyses recovered two major clades of F3H, three of F3′H, and one of F3′5′H in angiosperms (A) as well as three major clades of DFR in angiosperms (B). All Pinus copies depicted as individual lineages. Numbers of copies of genes recovered in R. speciosa genome shown in brackets. Numbers above branches indicate maximum likelihood bootstrap support (only values at or above 70% shown except for DFR-2). Full phylogenetic detail provided in Supplementary Figs S3 and S4. Phylogenetic reconstruction yielded two clades of F3H, three of F3′H, one of F3′5′H, and three of DFR, and strong bootstrap support (≥70%) was recovered for some but not all key branches in our topologies (Fig. 4, Supplementary Figs S3 and S4). Our trees in part contradict those in Campanella et al., which depicted clades of F3′H and DFR that correspond specifically to monocots vs. dicots (Supplementary Figs S3 and S4). Instead, Fig. 4 suggests the evolution of different copies of F3H, F3′H, F3′5′H, and DFR predates the divergence of monocots from Eudicots. With a given gene genealogy in our analyses (e.g. Fig. 4A), there has been subsequent lineage-specific duplication. Thus, ABP genes are marked by a history of early duplication as well as subsequent duplication events, products of which have in some cases facilitated the evolution of tissue-specific functions. Consistent with this, our expression analysis of putative F3′H genes showed tissue-specific expression patterns: among 5 putative F3′H copies with detectable expression levels (fpkm > 1), two copies appeared to be leaf specific and two were corolla specific (Fig. 5). However, Ka/Ks analysis of R.speciosa copies yielded ratios consistently <1, suggesting an overall signature of purifying selection or constraint on these copies (Fig. 6; Supplementary Table S2). Our phylogenetic results also suggest that F3′5′H is more closely related to one of the three clades of F3′H than it is to F3H, corroborating Seitz et al. who documented multiple evolutionary origins of F3′5′H from an F3′H ancestor in Asteraceae. F3′5′H is an enzyme required for production of blue delphinidins and, across all land plants, is evolutionarily younger than F3H and F3′H. Our finding that only one-half of all reference genomes sampled harbour a copy of this enzyme corroborates observations that blue anthocyanins are less common in plants than are purple, pink, and red cyanidins and pelargonidins that derive from other branches of the ABP pathway. Finally, DFR plays a crucial role in production of all three branches of the ABP pathway and was traditionally thought of as a substrate generalist, having the capacity to metabolize precursors to all three branches. However, substrate specificity has evolved in numerous groups of plants, in some cases driven by a single amino acid change. We recovered three copies of DFR in the R.speciosa genome, each resolved in three separate clades (Fig. 4B). RNA-seq analysis detected expression of two of these three copies: DFR.1 and DFR.3 as labelled in Fig. 4B. Although both copies were expressed in leaf as well as corolla tissue, DFR.1 was up-regulated in leaves (a 2.56-fold increase) whereas DFR.3 was up-regulated in corollas (a 3.50-fold increase) (Fig. 5). In contrast to F3′H copies, Ka/Ks analysis of DFR copies present in R.speciosa yielded ratios > 1 in one of the three copies (Fig. 6; Supplementary Table S3). These data suggest positive selection and are consistent with a hypothesis of adaptive evolution of this copy but functional assays are needed to understand the role of duplicated ABP loci in R.speciosa.

Figure 5

Heatmap showing expression patterns of putative structural genes functional in ABP and MYBs identified in Ruellia. Only genes with detectable expression level (fpkm > 1) were shown.

Figure 6

Phylogenetic hypothesis for F3'H (A) and DFR (B) identified in R speciosa. Ka/Ks values shown above branches (values >1 in red). Node IDs (1 through 7 for each gene) reflect Ka and Ks values in Supplementary Tables 3 & 4.

Heatmap showing expression patterns of putative structural genes functional in ABP and MYBs identified in Ruellia. Only genes with detectable expression level (fpkm > 1) were shown. Phylogenetic hypothesis for F3'H (A) and DFR (B) identified in R speciosa. Ka/Ks values shown above branches (values >1 in red). Node IDs (1 through 7 for each gene) reflect Ka and Ks values in Supplementary Tables 3 & 4.

3.6. Phylogenetic analysis of MYB gene family

MYB transcription factors, marked by a conserved DNA-binding domain, comprise the largest transcription factor family in plants. R2R3 type MYBs are specific to the plant kingdom and are involved in the transcriptional control of plant-specific processes. Based on homologue searches to known MYB3R and R2R3 type MYBs in A.thaliana, we identified 96 homologous MYBs in the R.speciosa genome then assigned putative functions to 90 of them based on 87 function-annotated MYBs in Arabidopsis (Fig. 7). Our phylogenetic tree is generally consistent with prior data from Arabidopsis alone. Among classified MYBs, of particular interest are three subgroups that function in flavonoid biosynthesis (highlighted in blue in Fig. 7), a large pathway that includes the ABP branch. In most species, the anthocyanin branch is controlled by a ternary complex of MYB-bHLH-WD40 transcription factors and the specificity of each function appears to be the result of a particular R2R3-MYB protein that joins the complex., A total of nine Ruellia MYBs (RsMYBs; all highlighted in red in Fig. 7) belonging to two of the three flavonoid subgroups were identified (Fig. 7). Phylogenetic analyses resolved RsMYB71 in the same clade as several flavonoid MYBs from Arabidopsis as well as the R2R3-MYB members anthocyanin2, deep purple, and purple haze of Petunia, which are responsible for full petal colour, flower tube venation/bud-blush, and vegetative pigmentation. Transcriptome data demonstrated corolla-specific expression of RsMYB71 in Ruellia (Fig. 5), suggesting that this regulatory factor may be a key candidate in regulating flower colour in Ruellia. In contrast, our transcriptome data failed to detect any expression of the remaining eight RsMYBs, all of which belong to a second subgroup (Figs 5 and 7). Comparative analysis of these MYBs across multiple species of Ruellia followed by functional assays are needed to advance knowledge of flavonoid biosynthesis and flower colour production in this group.

Figure 7

Phylogenetic analysis of MYB genes present in Ruellia speciosa, Petunia hybrida, and Arabidopsis thaliana. MYBs of R. speciosa are in red text, MYBs of A. thaliana are in blue text, and MYBs of P. hybrida are in black text. Functional annotations based on experimental confirmation in Arabidopsis are marked by asterisks. Branches supported by ML bootstrap values > 70% are labeled to the right of the supported node. Clades highlighted in orange and yellow depict genes involved in non-flavonoid functions; clades highlighted in blue depict genes involved in flavonoid biosynthesis.

3.7. De novo assembly of chloroplast

Chloroplasts are essential photosynthetic organelles of plant cells that manufacture energy in the presence of sunlight. Because the chloroplast genome is small, relatively easily sequenced and assembled, and generally uniparentally inherited, this molecule can provide abundant molecular information to support comparative evolutionary research, especially for species without a whole nuclear genome reference available. Furthermore, nucleotide substitution rates of chloroplast are relatively slow and therefore provide an appropriate window of resolution to study plant phylogeny at deep evolutionary time scales. In angiosperms, most assembled chloroplast genomes range from 120 to 160 kb in length and exist as circular molecules. Employing reference-based assembly methods, a subset of chloroplasts reads were extracted from our whole-genome shotgun dataset and successfully assembled into a circular contig with a length of 148,941 bp. Function and structure annotations conducted using CPGAVAS show that the obtained R.speciosa chloroplast genome has a standard quadripartite organization (Fig. 8), comprising two copies of IRs, a large single copy region, and a small single copy region, typical of other plants.,, In total, five genes involved in photosynthesis and 38 genes involved in self-replication were identified in the R. speciosa chloroplast genome, these with an average intron length of 659 bp (Fig. 8). The whole chloroplast genome assembled in this study will be useful to future phylogenetic research both in Acanthaceae and across angiosperms.

Figure 8

Gene organization of the R. speciosa chloroplast genome. Genes drawn inside the circle are transcribed clockwise while those drawn outside are transcribed counterclockwise. Different gene functional groups are colour coded. The map was drawn using CpGAVAS.

4. Conclusion

Asterids contain a quarter of all known species of flowering plants and four of the 10 most diverse families of flowering plants. We have contributed a new nuclear genome sequence that serves as only the third family represented by such a sequence within this important clade. By building new genomic resources for R.speciosa, this study facilitates future investigation of Asterid evolution, particular with respect to flower colour production. Additionally, this new resource may facilitate future investigation of the genomic architecture of evolutionary intermediates (sensu Stebbins), as R.speciosa is member to a clade that demonstrates a clear transition from bee to hummingbird to bat pollination and has features intermediate between the latter two stages.

Availability

Raw nucleotide sequence data are available in the NCBI sequence read archive database (BioProject PRJNA326965) under the accession number SRP077633. The draft assembly is deposited in the NCBI whole-genome shotgun database (BioProject PRJNA326965) under the submission number SUB1661288 (released upon article acceptance). Click here for additional data file.

107 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Draft genome of the wheat A-genome progenitor Triticum urartu.

Authors: Hong-Qing Ling; Shancen Zhao; Dongcheng Liu; Junyi Wang; Hua Sun; Chi Zhang; Huajie Fan; Dong Li; Lingli Dong; Yong Tao; Chuan Gao; Huilan Wu; Yiwen Li; Yan Cui; Xiaosen Guo; Shusong Zheng; Biao Wang; Kang Yu; Qinsi Liang; Wenlong Yang; Xueyuan Lou; Jie Chen; Mingji Feng; Jianbo Jian; Xiaofei Zhang; Guangbin Luo; Ying Jiang; Junjie Liu; Zhaobao Wang; Yuhui Sha; Bairu Zhang; Huajun Wu; Dingzhong Tang; Qianhua Shen; Pengya Xue; Shenhao Zou; Xiujie Wang; Xin Liu; Famin Wang; Yanping Yang; Xueli An; Zhenying Dong; Kunpu Zhang; Xiangqi Zhang; Ming-Cheng Luo; Jan Dvorak; Yiping Tong; Jian Wang; Huanming Yang; Zhensheng Li; Daowen Wang; Aimin Zhang; Jun Wang
Journal: Nature Date: 2013-03-24 Impact factor: 49.962

3. agriGO: a GO analysis toolkit for the agricultural community.

Authors: Zhou Du; Xin Zhou; Yi Ling; Zhenhai Zhang; Zhen Su
Journal: Nucleic Acids Res Date: 2010-04-30 Impact factor: 16.971

4. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations.

Authors: Michael S Campbell; MeiYee Law; Carson Holt; Joshua C Stein; Gaurav D Moghe; David E Hufnagel; Jikai Lei; Rujira Achawanantakun; Dian Jiao; Carolyn J Lawrence; Doreen Ware; Shin-Han Shiu; Kevin L Childs; Yanni Sun; Ning Jiang; Mark Yandell
Journal: Plant Physiol Date: 2013-12-04 Impact factor: 8.340

5. Fragment assignment in the cloud with eXpress-D.

Authors: Adam Roberts; Harvey Feng; Lior Pachter
Journal: BMC Bioinformatics Date: 2013-12-07 Impact factor: 3.169

6. Draft genome sequence of eggplant (Solanum melongena L.): the representative solanum species indigenous to the old world.

Authors: Hideki Hirakawa; Kenta Shirasawa; Koji Miyatake; Tsukasa Nunome; Satomi Negoro; Akio Ohyama; Hirotaka Yamaguchi; Shusei Sato; Sachiko Isobe; Satoshi Tabata; Hiroyuki Fukuoka
Journal: DNA Res Date: 2014-09-18 Impact factor: 4.458

7. Extensive error in the number of genes inferred from draft genome assemblies.

Authors: James F Denton; Jose Lugo-Martinez; Abraham E Tucker; Daniel R Schrider; Wesley C Warren; Matthew W Hahn
Journal: PLoS Comput Biol Date: 2014-12-04 Impact factor: 4.475

8. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

9. CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences.

Authors: Chang Liu; Linchun Shi; Yingjie Zhu; Haimei Chen; Jianhui Zhang; Xiaohan Lin; Xiaojun Guan
Journal: BMC Genomics Date: 2012-12-20 Impact factor: 3.969

10. Reference genomes and transcriptomes of Nicotiana sylvestris and Nicotiana tomentosiformis.

Authors: Nicolas Sierro; James N D Battey; Sonia Ouadi; Lucien Bovet; Simon Goepfert; Nicolas Bakaher; Manuel C Peitsch; Nikolai V Ivanov
Journal: Genome Biol Date: 2013-06-17 Impact factor: 13.583

7 in total

1. Variance of allele balance calculated from low coverage sequencing data infers departure from a diploid state.

Authors: Kyle Fletcher; Rongkui Han; Diederik Smilde; Richard Michelmore
Journal: BMC Bioinformatics Date: 2022-04-25 Impact factor: 3.307

2. RADseq dataset with 90% missing data fully resolves recent radiation of Petalidium (Acanthaceae) in the ultra-arid deserts of Namibia.

Authors: Erin A Tripp; Yi-Hsin Erica Tsai; Yongbin Zhuang; Kyle G Dexter
Journal: Ecol Evol Date: 2017-08-30 Impact factor: 2.912

3. Orthologous nuclear markers and new transcriptomes that broadly cover the phylogenetic diversity of Acanthaceae.

Authors: Erica B Morais; Jürg Schönenberger; Elena Conti; Alexandre Antonelli; Péter Szövényi
Journal: Appl Plant Sci Date: 2019-09-25 Impact factor: 1.936

4. De novo Assembly of the Pokeweed Genome Provides Insight Into Pokeweed Antiviral Protein (PAP) Gene Expression.

Authors: Kira C M Neller; Camille A Diaz; Adrian E Platts; Katalin A Hudak
Journal: Front Plant Sci Date: 2019-08-06 Impact factor: 5.753

5. Complete chloroplast genome sequence of Barleria prionitis, comparative chloroplast genomics and phylogenetic relationships among Acanthoideae.

Authors: Dhafer A Alzahrani; Samaila S Yaradua; Enas J Albokhari; Abidina Abba
Journal: BMC Genomics Date: 2020-06-06 Impact factor: 3.969

6. The Complete Chloroplast Genome of Heimia myrtifolia and Comparative Analysis within Myrtales.

Authors: Cuihua Gu; Bin Dong; Liang Xu; Luke R Tembrock; Shaoyu Zheng; Zhiqiang Wu
Journal: Molecules Date: 2018-04-08 Impact factor: 4.411

7. Draft genome of Tanacetum cinerariifolium, the natural source of mosquito coil.

Authors: Takanori Yamashiro; Akira Shiraishi; Honoo Satake; Koji Nakayama
Journal: Sci Rep Date: 2019-12-03 Impact factor: 4.379

7 in total